unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-07-26 18:39:18 +00:00

History

chore: switch to charset normalizer (#4060 )

Closes
[SPI-44](https://linear.app/unstructured/issue/SPI-44/spike-replace-chardet-with-charset-normalizer-if-possible).

Removes `chardet` as a dependency, standardizing on
`charset-normalizer`.

This involved:
- Changing `chardet` to `charset-normalizer` in our base dependency file
- Updating the code (in only one place) where `chardet` was used
- pip-compiling to update our published dependency tree
- Updating one test... `charset-normalizer` misdiagnosed the encoding of
a file used as a test fixture. My guess is that the ~10 characters in
the file were not enough for `charset-normalizer` to do a proper
inference, so I re-encoded another slightly longer file that's also used
for encoding testing, and it got that one.
- Updating an ingest test fixture.
- Updating the ingest test fixture update workflow to also update the
expected markdown results (this was a task I missed when adding the
markdown ingest tests)

---------

Co-authored-by: Ahmet Melek <39141206+ahmetmeleq@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: qued <qued@users.noreply.github.com>
Co-authored-by: Maksymilian Operlejn <36171422+MaksOpp@users.noreply.github.com>

2025-07-22 19:02:40 +00:00

eml

rfctr(email): eml partitioner rewrite (#3694 )

2024-10-16 02:02:33 +00:00

img

refactor: restructure PDF/Image example document organization (#3410 )

2024-07-18 22:21:32 +00:00

language-docs

feat: detect language for PDFs (#4051 )

2025-07-15 18:53:28 +00:00

pdf

Add password with PDF files (#3721 )

2025-02-11 17:39:16 +00:00

test_evaluate_files

Feat: weighted average table metrics (#3348 )

2024-11-20 17:14:57 +00:00

unsupported

feat: add partition_xml for XML files (#596 )

2023-05-18 15:40:12 +00:00

2023-half-year-analyses-by-segment.xlsx

feat: xlsx subtable extraction (#1585 )

2023-10-04 13:30:23 -04:00

book-war-and-peace-1p.txt

test: add benchmark script (#638 )

2023-06-05 09:14:43 -07:00

book-war-and-peace-1225p.txt

test: add benchmark script (#638 )

2023-06-05 09:14:43 -07:00

CantinaBand3.wav

enhancement: file detection for .wav files (#2387 )

2024-01-15 16:50:49 +00:00

category-level.docx

Feat: Native hierarchies for docx element types (#1505 )

2023-09-27 11:32:46 -04:00

codeblock.md

add fenced-code extension to the md parser (#4044 )

2025-07-07 21:05:54 +00:00

contains-pictures.docx

feat(docx): add pluggable picture sub-partitioner (#3081 )

2024-05-23 18:46:30 +00:00

csv-with-escaped-commas.csv

rfctr(file): improve filetype tests (#3402 )

2024-07-16 04:04:34 +00:00

csv-with-line-delimiter.csv

add '|' as a delimiter in csv files (#4059 )

2025-07-18 17:56:24 +00:00

csv-with-long-lines.csv

fix(csv): partition_csv() raises on long lines (#2998 )

2024-05-10 21:19:31 +00:00

docx-hdrftr.docx

fix(docx): tables in header/footer dropped (#2135 )

2023-11-22 15:39:25 -08:00

docx-shapes.docx

feat: include text from shapes in docx (#2510 )

2024-02-14 17:48:38 +00:00

docx-tables.docx

fix(docx): Table.text duplicates merged cell text (#2134 )

2023-11-21 22:22:40 +00:00

duplicate-paragraphs.doc

Better element IDs - deterministic and document-unique hashes (#2673 )

2024-04-24 00:05:20 -07:00

duplicate-paragraphs.docx

Better element IDs - deterministic and document-unique hashes (#2673 )

2024-04-24 00:05:20 -07:00

emoji.xlsx

fix: extract emojis with partition_xlsx (#1009 )

2023-08-04 10:14:08 -04:00

empty.txt

enhancement: handling for empty files in detect_filetype and partition (#710 )

2023-06-09 16:07:50 -04:00

empty.xlsx

fix(xlsx): XLSX emits std minified .text_as_html (#3558 )

2024-10-17 22:05:11 +00:00

example-10k-1p.html

test: add benchmark script (#638 )

2023-06-05 09:14:43 -07:00

example-10k-230p.html

test: add benchmark script (#638 )

2023-06-05 09:14:43 -07:00

example-10k-utf-16.html

fix: encoding/decoding error with default utf-8 encoding for html, xml, and auto (#660 )

2023-06-05 11:27:12 -07:00

example-10k.html

Initial Release

2022-09-26 14:55:20 -07:00

example-list-items-multiple.docx

fix: detect list items in MS Word documents (#909 )

2023-07-10 15:29:08 +00:00

example-steelJIS-datasheet-utf-16.html

fix: encoding/decoding error with default utf-8 encoding for html, xml, and auto (#660 )

2023-06-05 11:27:12 -07:00

example-steelJIS-datasheet.html

fix: encoding/decoding error with default utf-8 encoding for html, xml, and auto (#660 )

2023-06-05 11:27:12 -07:00

example-with-scripts.html

fix: Remove JavaScript from HTML reader output (#313 )

2023-02-28 14:24:24 -08:00

factbook-utf-16.xml

fix: encoding/decoding error with default utf-8 encoding for html, xml, and auto (#660 )

2023-06-05 11:27:12 -07:00

factbook.xml

feat: add partition_xml for XML files (#596 )

2023-05-18 15:40:12 +00:00

fake_table.docx

feat: Read docx tables (#572 )

2023-05-11 18:31:38 +00:00

fake-doc-emphasized-text.doc

feat: track emphasized text msword (#1048 )

2023-08-04 17:04:12 -04:00

fake-doc-emphasized-text.docx

feat: track emphasized text msword (#1048 )

2023-08-04 17:04:12 -04:00

fake-doc.rtf

Table processing test for RTF (#1388 )

2023-09-12 18:27:05 -07:00

fake-email-attachment.msg

feat: add msg attachment support (#510 )

2023-04-21 11:14:46 -05:00

fake-email-multiple-attachments.msg

feat: add msg attachment support (#510 )

2023-04-21 11:14:46 -05:00

fake-email-with-cc-and-bcc.msg

feat: msg and email metadata (#3444 )

2024-08-01 19:24:17 +00:00

fake-email.eml

rfctr(auto): improve expression in tests (#3384 )

2024-07-11 19:57:28 +00:00

fake-email.msg

feat: add partition_msg for MSFT Outlook files (#412 )

2023-03-28 20:15:22 +00:00

fake-email.txt

bug: empty-elements (#1252 )

2023-11-02 10:52:41 -05:00

fake-encrypted.msg

feat: detect PGP encrypted content in partition_email and partition_msg (#1205 )

2023-08-25 17:09:25 -07:00

fake-html-cp1252.html

chore: switch to charset normalizer (#4060 )

2025-07-22 19:02:40 +00:00

fake-html-lang-de.html

fix: adjust threshold for encoding detection (#894 )

2023-07-07 09:25:03 -04:00

fake-html-pre.htm

feature(html partition): parse pre tag (#642 )

2023-06-27 18:52:39 +00:00

fake-html-with-base64-image.html

feat: support extracting image url in html (#3955 )

2025-03-13 22:41:10 +00:00

fake-html-with-duplicate-elements.html

Better element IDs - deterministic and document-unique hashes (#2673 )

2024-04-24 00:05:20 -07:00

fake-html-with-footer-and-header.html

feat: optionally ignore header and footer tags in partition html (#1013 )

2023-08-04 21:56:33 +00:00

fake-html-with-image-from-url.html

feat: support extracting image url in html (#3955 )

2025-03-13 22:41:10 +00:00

fake-html.html

Feat: Create a naive hierarchy for elements (#1268 )

2023-09-14 11:23:16 -04:00

fake-incomplete-json.txt

enhancement: improve json detection by detect_filetype (#971 )

2023-07-25 12:47:39 -04:00

fake-power-point-malformed.pptx

fix malformed pptx issue (#761 )

2023-06-15 19:52:44 +00:00

fake-power-point-many-pages.pptx

fix: metadata.page_number of pptx files (#675 )

2023-06-02 13:22:43 +00:00

fake-power-point-table.pptx

feat: table extraction for power points (#664 )

2023-05-31 18:26:32 +00:00

fake-power-point.ppt

feat: add partition_ppt for older power point docs (#238 )

2023-02-17 16:57:08 +00:00

fake-power-point.pptx

feat: basic PowerPoint parsing in partition_pptx (#166 )

2023-01-23 17:03:09 +00:00

fake-text-all-whitespace.txt

Fix: partition on empty or whitespace-only text files (#3675 )

2024-09-28 21:16:33 -07:00

fake-text-utf-16-be.txt

Adding optional encoding arg, and text_partition tests (#339 )

2023-03-06 15:07:33 -08:00

fake-text-utf-16-le.txt

Issue/unicode error (#608 )

2023-05-23 13:35:38 -07:00

fake-text-utf-16.txt

Issue/unicode error (#608 )

2023-05-23 13:35:38 -07:00

fake-text-utf-32.txt

Issue/unicode error (#608 )

2023-05-23 13:35:38 -07:00

fake-text.txt

fix: cleanup from live .docx tests (#177 )

2023-01-26 15:52:25 +00:00

fake.doc

feat: add partition_doc for .doc files (#236 )

2023-02-17 09:30:23 -05:00

fake.docx

feat: extract metadata from .docx, .xlsx, and .jpg (#113 )

2022-12-26 09:34:36 -05:00

fake.go

rfctr(file): improve filetype tests (#3402 )

2024-07-16 04:04:34 +00:00

fake.odt

chore: adding test case for odt tables (#1434 )

2023-09-16 22:29:44 -07:00

file_we_dont_want_imported

Fix: plug security issue partition system files via include (#3908 )

2025-02-06 03:27:18 +00:00

grid_offset_error.docx

fix: add try/except wrap over row.cells to failproof tc grid_offset (#4033 )

2025-06-30 14:20:18 +00:00

group-shapes-nested.pptx

rfctr(pptx): extract _PptxPartitionerOptions (#2853 )

2024-04-08 19:01:03 +00:00

handbook-1p-no-rendered-page-breaks.docx

fix(docx): improve page-break detection (#2036 )

2023-11-09 20:34:30 +00:00

handbook-1p.docx

fix(docx): improve page-break detection (#2036 )

2023-11-09 20:34:30 +00:00

handbook-872p.docx

test: add benchmark script (#638 )

2023-06-05 09:14:43 -07:00

hebrew-text-base64-iso88598i.txt

fix: format Arabic and Hebrew annotated encodings (#823 )

2023-06-27 18:15:02 -07:00

hlink-meta.docx

rfctr: prepare docx partitioner and tests for nested tables PR to follow (#1978 )

2023-11-02 05:22:17 +00:00

ideas-page.html

fix: ensure all text is maintained in html output (#335 )

2023-03-02 14:03:13 -05:00

logger.py

rfctr(file): improve filetype tests (#3402 )

2024-07-16 04:04:34 +00:00

more-than-1k-cells.xlsx

fix: use nx to avoid recursion limit (#1761 )

2023-10-14 19:38:21 +00:00

norwich-city.txt

enhancement: max_partition kwarg for limiting element size (#818 )

2023-06-28 15:26:01 -04:00

not-unstructured-payload.json

fix(file): fix OLE-based file-type auto-detection (#3437 )

2024-07-25 17:25:41 +00:00

page-breaks.docx

docx: improve page break fidelity (#1631 )

2023-11-17 00:09:14 +00:00

password_protected.xlsx

fix: properly handle password protected xlsx (#4057 )

2025-07-16 13:19:14 +00:00

picture.pptx

feat(pptx): add pluggable PPTX Picture sub-partitioner (#2880 )

2024-04-12 06:00:01 +00:00

README-w-include.org

Fix: plug security issue partition system files via include (#3908 )

2025-02-06 03:27:18 +00:00

README-w-include.rst

Fix: plug security issue partition system files via include (#3908 )

2025-02-06 03:27:18 +00:00

README.md

Update README.md (#435 )

2023-04-02 09:52:14 -07:00

README.org

feat: partition_org for Org Mode documents (#780 )

2023-06-23 18:45:31 +00:00

README.rst

feat: partition_rst for ReStructured Text documents (#725 )

2023-06-12 19:31:10 +00:00

sample-presentation.pptx

Feat: Native hierarchies for elements from pptx documents (#1616 )

2023-10-05 12:55:45 -04:00

science-exploration-1p.pptx

test: add benchmark script (#638 )

2023-06-05 09:14:43 -07:00

science-exploration-369p.pptx

test: add benchmark script (#638 )

2023-06-05 09:14:43 -07:00

semicolon-delimited.csv

rfctr(csv): accommodate single column CSV files (#3483 )

2024-08-06 00:48:37 +00:00

simple-table.md

fix: md tables (#1924 )

2023-10-30 14:09:46 +00:00

simple.doc

rfctr(auto): improve expression in tests (#3384 )

2024-07-11 19:57:28 +00:00

simple.docx

rfctr(doc): spruce up test_doc.py (#3024 )

2024-05-15 18:32:51 +00:00

simple.epub

rfctr(part): remove double-decoration 3 (#3687 )

2024-10-02 21:04:37 +00:00

simple.json

rfctr(auto): fix auto-partition test xfails and skips (#3367 )

2024-07-10 05:29:07 +00:00

simple.ndjson

feat: add ndjson support (#3845 )

2024-12-19 14:39:26 +00:00

simple.odt

rfctr(odt): organize and improve test_odt.py (#3031 )

2024-05-16 01:04:06 +00:00

simple.pptx

rfctr(file): refactor detect_filetype() (#3429 )

2024-07-23 23:18:48 +00:00

simple.yaml

rfctr(file): improve filetype tests (#3402 )

2024-07-16 04:04:34 +00:00

simple.zip

rfctr(file): improve filetype tests (#3402 )

2024-07-16 04:04:34 +00:00

single-column.csv

rfctr(csv): accommodate single column CSV files (#3483 )

2024-08-06 00:48:37 +00:00

spring-weather.html.json

rfctr(auto): fix auto-partition test xfails and skips (#3367 )

2024-07-10 05:29:07 +00:00

spring-weather.html.ndjson

feat: add ndjson support (#3845 )

2024-12-19 14:39:26 +00:00

stanley-cups-utf-16.csv

feat: Support encoding parameter in partition_csv (#3564 )

2024-08-28 14:19:58 +00:00

stanley-cups-with-emoji.csv

fix: etree parser error (#1077 )

2023-08-10 23:28:57 +00:00

stanley-cups-with-emoji.tsv

Fix/1057 etree parser error tsv (#1106 )

2023-08-14 01:22:36 +00:00

stanley-cups.csv

rfctr(csv): accommodate single column CSV files (#3483 )

2024-08-06 00:48:37 +00:00

stanley-cups.tsv

feat: partition_tsv for tab separated value files (#758 )

2023-06-15 18:50:53 +00:00

stanley-cups.xlsx

feat: add partition_xlsx for MSFT Excel files (#594 )

2023-05-16 19:40:40 +00:00

table-multi-row-column-cells-actual.csv

chore: add metric helper for table structure eval (#1877 )

2023-10-27 13:23:44 -05:00

table-semicolon-delimiter.csv

fix: handle delimiter bug in partition_csv (#2224 )

2023-12-13 23:57:46 +00:00

tables-with-incomplete-rows.docx

fix(docx): fix short-row DOCX table (#2943 )

2024-05-02 00:45:52 +00:00

teams_chat.docx

fix: handle sectionless-docx in the general case (#1829 )

2023-11-08 19:05:19 +00:00

test-image-jpg-mime.pptx

fix(pptx): accommodate invalid image/jpg MIME-type (#3475 )

2024-08-06 18:48:15 +00:00

tests-example.xls

feat: add xls support (#632 )

2023-05-26 01:55:32 -07:00

umlauts-non-utf8.md

chore: switch to charset normalizer (#4060 )

2025-07-22 19:02:40 +00:00

umlauts-utf8.md

fix: update md to reads umlauts on non-utf-8 files (#4037 )

2025-07-01 16:38:30 +00:00

vodafone.xlsx

feat: xlsx subtable extraction (#1585 )

2023-10-04 13:30:23 -04:00

winter-sports.epub

feat: add partition_epub function (#364 )

2023-03-14 15:52:21 +00:00

xlsx-subtable-cases.xlsx

fix(xlsx): xlsx subtable algorithm (#2534 )

2024-02-13 20:29:17 -08:00

README.md

Example Docs

The sample docs directory contains the following files:

example-10k.html - A 10-K SEC filing in HTML format
layout-parser-paper.pdf - A PDF copy of the layout parser paper
factbook.xml/factbook.xsl - Example XML/XLS files that you can use to test stylesheets

These documents can be used to test out the parsers in the library. In addition, here are instructions for pulling in some sample docs that are too big to store in the repo.

XBRL 10-K

You can get an example 10-K in inline XBRL format using the following curl. Note, you need to have the user agent set in the header or the SEC site will reject your request.

curl -O \
  -A '${organization} ${email}'
  https://www.sec.gov/Archives/edgar/data/311094/000117184321001344/0001171843-21-001344.txt

You can parse this document using the HTML parser.