unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2026-01-01 01:34:18 +00:00

History

Michał Martyniak 2d1923ac7e

Better element IDs - deterministic and document-unique hashes (#2673 )

Part two of: https://github.com/Unstructured-IO/unstructured/pull/2842

Main changes compared to part one:
* hash computation includes element's sequence number on page, page
number, document filename and its text
* there are more test for deterministic behavior of IDs returned by
partitioning functions + their uniqueness (guaranteed at the document
level, and high probability across multiple documents)

This PR addresses the following issue:
https://github.com/Unstructured-IO/unstructured/issues/2461

2024-04-24 00:05:20 -07:00

eml

enhancement: process .p7s files with partition_email (#2521 )

2024-02-07 22:31:49 +00:00

language-docs

detect document language across all partitioners (#1627 )

2023-10-11 01:47:56 +00:00

test_evaluate_files

feat: add cleanup fixtures for test_evaluate (#2701 )

2024-04-02 15:10:59 +00:00

unsupported

feat: add partition_xml for XML files (#596 )

2023-05-18 15:40:12 +00:00

2023-half-year-analyses-by-segment.xlsx

feat: xlsx subtable extraction (#1585 )

2023-10-04 13:30:23 -04:00

all-number-table.pdf

Jj/2027 float no attr strip (#2048 )

2023-11-10 05:14:06 +00:00

book-war-and-peace-1p.txt

test: add benchmark script (#638 )

2023-06-05 09:14:43 -07:00

book-war-and-peace-1225p.txt

test: add benchmark script (#638 )

2023-06-05 09:14:43 -07:00

CantinaBand3.wav

enhancement: file detection for .wav files (#2387 )

2024-01-15 16:50:49 +00:00

category-level.docx

Feat: Native hierarchies for docx element types (#1505 )

2023-09-27 11:32:46 -04:00

chevron-page.pdf

fix: group together text from the same bounding box in partition_pdf with fast strategy (#542 )

2023-05-03 18:33:24 -04:00

chi_sim_image.jpeg

chore: function to map between standard and Tesseract language codes (#1421 )

2023-09-18 08:42:02 -07:00

copy-protected.pdf

enhancement: check for copy protection on PDFs and fallback to hi res when necessary (#514 )

2023-04-21 21:35:43 +00:00

DA-1p.heic

feat: add support for partitioning .heic files (#2454 )

2024-01-30 04:49:00 +00:00

DA-1p.jpg

feat: add support for partitioning .heic files (#2454 )

2024-01-30 04:49:00 +00:00

DA-1p.pdf

test: add benchmark script (#638 )

2023-06-05 09:14:43 -07:00

DA-1p.png

feat: add support for partitioning .heic files (#2454 )

2024-01-30 04:49:00 +00:00

DA-619p.pdf

test: add benchmark script (#638 )

2023-06-05 09:14:43 -07:00

docx-hdrftr.docx

fix(docx): tables in header/footer dropped (#2135 )

2023-11-22 15:39:25 -08:00

docx-shapes.docx

feat: include text from shapes in docx (#2510 )

2024-02-14 17:48:38 +00:00

docx-tables.docx

fix(docx): Table.text duplicates merged cell text (#2134 )

2023-11-21 22:22:40 +00:00

double-column-A.jpg

chore: custom layout order example notebook (#1024 )

2023-08-02 18:29:04 -06:00

double-column-B.jpg

chore: custom layout order example notebook (#1024 )

2023-08-02 18:29:04 -06:00

duplicate-paragraphs.doc

Better element IDs - deterministic and document-unique hashes (#2673 )

2024-04-24 00:05:20 -07:00

duplicate-paragraphs.docx

Better element IDs - deterministic and document-unique hashes (#2673 )

2024-04-24 00:05:20 -07:00

embedded-images-tables.jpg

Feat: return base64 encoded images for PDF's (#2310 )

2023-12-27 05:39:01 +00:00

embedded-images-tables.pdf

Feat: return base64 encoded images for PDF's (#2310 )

2023-12-27 05:39:01 +00:00

embedded-images.pdf

Feat/1332 save embedded images in pdf (#1371 )

2023-09-22 09:16:03 +00:00

embedded-link.pdf

feat: get embedded url, associate text and start index for pdf (#1539 )

2023-09-27 13:43:32 -04:00

emoji.xlsx

fix: extract emojis with partition_xlsx (#1009 )

2023-08-04 10:14:08 -04:00

emphasis-text.pdf

feat: get embedded url, associate text and start index for pdf (#1539 )

2023-09-27 13:43:32 -04:00

empty.txt

enhancement: handling for empty files in detect_filetype and partition (#710 )

2023-06-09 16:07:50 -04:00

english-and-korean.png

enhancement: add ocr_only strategy for partition_image (#540 )

2023-05-04 20:23:51 +00:00

example-10k-1p.html

test: add benchmark script (#638 )

2023-06-05 09:14:43 -07:00

example-10k-230p.html

test: add benchmark script (#638 )

2023-06-05 09:14:43 -07:00

example-10k-utf-16.html

fix: encoding/decoding error with default utf-8 encoding for html, xml, and auto (#660 )

2023-06-05 11:27:12 -07:00

example-10k.html

Initial Release

2022-09-26 14:55:20 -07:00

example-list-items-multiple.docx

fix: detect list items in MS Word documents (#909 )

2023-07-10 15:29:08 +00:00

example-steelJIS-datasheet-utf-16.html

fix: encoding/decoding error with default utf-8 encoding for html, xml, and auto (#660 )

2023-06-05 11:27:12 -07:00

example-steelJIS-datasheet.html

fix: encoding/decoding error with default utf-8 encoding for html, xml, and auto (#660 )

2023-06-05 11:27:12 -07:00

example-with-scripts.html

fix: Remove JavaScript from HTML reader output (#313 )

2023-02-28 14:24:24 -08:00

example.jpg

feat: extract metadata from .docx, .xlsx, and .jpg (#113 )

2022-12-26 09:34:36 -05:00

factbook-utf-16.xml

fix: encoding/decoding error with default utf-8 encoding for html, xml, and auto (#660 )

2023-06-05 11:27:12 -07:00

factbook.xml

feat: add partition_xml for XML files (#596 )

2023-05-18 15:40:12 +00:00

failure-after-repair.pdf

Chore: Repair invalid PDF structure for PDFminer when PSSyntaxError (#2137 )

2023-11-29 19:00:15 +00:00

fake_table.docx

feat: Read docx tables (#572 )

2023-05-11 18:31:38 +00:00

fake-doc-emphasized-text.doc

feat: track emphasized text msword (#1048 )

2023-08-04 17:04:12 -04:00

fake-doc-emphasized-text.docx

feat: track emphasized text msword (#1048 )

2023-08-04 17:04:12 -04:00

fake-doc.rtf

Table processing test for RTF (#1388 )

2023-09-12 18:27:05 -07:00

fake-email-attachment.msg

feat: add msg attachment support (#510 )

2023-04-21 11:14:46 -05:00

fake-email-multiple-attachments.msg

feat: add msg attachment support (#510 )

2023-04-21 11:14:46 -05:00

fake-email.msg

feat: add partition_msg for MSFT Outlook files (#412 )

2023-03-28 20:15:22 +00:00

fake-email.txt

bug: empty-elements (#1252 )

2023-11-02 10:52:41 -05:00

fake-encrypted.msg

feat: detect PGP encrypted content in partition_email and partition_msg (#1205 )

2023-08-25 17:09:25 -07:00

fake-html-cp1252.html

chore: Add encoding param to ingest (#955 )

2023-07-24 10:06:13 -07:00

fake-html-lang-de.html

fix: adjust threshold for encoding detection (#894 )

2023-07-07 09:25:03 -04:00

fake-html-pre.htm

feature(html partition): parse pre tag (#642 )

2023-06-27 18:52:39 +00:00

fake-html-with-duplicate-elements.html

Better element IDs - deterministic and document-unique hashes (#2673 )

2024-04-24 00:05:20 -07:00

fake-html-with-footer-and-header.html

feat: optionally ignore header and footer tags in partition html (#1013 )

2023-08-04 21:56:33 +00:00

fake-html.html

Feat: Create a naive hierarchy for elements (#1268 )

2023-09-14 11:23:16 -04:00

fake-incomplete-json.txt

enhancement: improve json detection by detect_filetype (#971 )

2023-07-25 12:47:39 -04:00

fake-memo-with-duplicate-page.pdf

Better element IDs - deterministic and document-unique hashes (#2673 )

2024-04-24 00:05:20 -07:00

fake-memo.pdf

feat: Create spacy notebook example (#593 )

2023-05-17 15:42:15 -05:00

fake-power-point-malformed.pptx

fix malformed pptx issue (#761 )

2023-06-15 19:52:44 +00:00

fake-power-point-many-pages.pptx

fix: metadata.page_number of pptx files (#675 )

2023-06-02 13:22:43 +00:00

fake-power-point-table.pptx

feat: table extraction for power points (#664 )

2023-05-31 18:26:32 +00:00

fake-power-point.ppt

feat: add partition_ppt for older power point docs (#238 )

2023-02-17 16:57:08 +00:00

fake-power-point.pptx

feat: basic PowerPoint parsing in partition_pptx (#166 )

2023-01-23 17:03:09 +00:00

fake-text-utf-16-be.txt

Adding optional encoding arg, and text_partition tests (#339 )

2023-03-06 15:07:33 -08:00

fake-text-utf-16-le.txt

Issue/unicode error (#608 )

2023-05-23 13:35:38 -07:00

fake-text-utf-16.txt

Issue/unicode error (#608 )

2023-05-23 13:35:38 -07:00

fake-text-utf-32.txt

Issue/unicode error (#608 )

2023-05-23 13:35:38 -07:00

fake-text.txt

fix: cleanup from live .docx tests (#177 )

2023-01-26 15:52:25 +00:00

fake.doc

feat: add partition_doc for .doc files (#236 )

2023-02-17 09:30:23 -05:00

fake.docx

feat: extract metadata from .docx, .xlsx, and .jpg (#113 )

2022-12-26 09:34:36 -05:00

fake.odt

chore: adding test case for odt tables (#1434 )

2023-09-16 22:29:44 -07:00

group-shapes-nested.pptx

rfctr(pptx): extract _PptxPartitionerOptions (#2853 )

2024-04-08 19:01:03 +00:00

handbook-1p-no-rendered-page-breaks.docx

fix(docx): improve page-break detection (#2036 )

2023-11-09 20:34:30 +00:00

handbook-1p.docx

fix(docx): improve page-break detection (#2036 )

2023-11-09 20:34:30 +00:00

handbook-872p.docx

test: add benchmark script (#638 )

2023-06-05 09:14:43 -07:00

header-test-doc.pdf

enhancement: detect headers in partition_pdf with fast strategy (#2455 )

2024-02-23 16:56:09 +00:00

hebrew-text-base64-iso88598i.txt

fix: format Arabic and Hebrew annotated encodings (#823 )

2023-06-27 18:15:02 -07:00

hlink-meta.docx

rfctr: prepare docx partitioner and tests for nested tables PR to follow (#1978 )

2023-11-02 05:22:17 +00:00

ideas-page.html

fix: ensure all text is maintained in html output (#335 )

2023-03-02 14:03:13 -05:00

interface-config-guide-p93.pdf

fix: isalnum referenced before assignment (#1586 )

2023-10-03 11:25:20 -04:00

invalid-pdf-structure-pdfminer-entire-doc.pdf

Chore: Repair invalid PDF structure for PDFminer when PSSyntaxError (#2137 )

2023-11-29 19:00:15 +00:00

invalid-pdf-structure-pdfminer-one-page.pdf

Chore: Repair invalid PDF structure for PDFminer when PSSyntaxError (#2137 )

2023-11-29 19:00:15 +00:00

jpn-vert.jpeg

chore: function to map between standard and Tesseract language codes (#1421 )

2023-09-18 08:42:02 -07:00

korean-text-with-tables.pdf

Chore (refactor): support table extraction with pre-computed ocr data (#1801 )

2023-10-21 00:24:23 +00:00

layout-parser-paper-10p.jpg

test: add benchmark script (#638 )

2023-06-05 09:14:43 -07:00

layout-parser-paper-combined.tiff

feat: supports multipage tiff (#1131 )

2023-08-24 15:12:50 +00:00

layout-parser-paper-fast.jpg

docs: add bricks training notebook (#211 )

2023-02-10 14:39:14 +00:00

layout-parser-paper-fast.pdf

revert pdf changes and add new pdf for empty page testing (#1255 )

2023-09-01 22:33:06 +00:00

layout-parser-paper-fast.tiff

feat: supports multipage tiff (#1131 )

2023-08-24 15:12:50 +00:00

layout-parser-paper-with-empty-pages.pdf

revert pdf changes and add new pdf for empty page testing (#1255 )

2023-09-01 22:33:06 +00:00

layout-parser-paper-with-table.jpg

Chore: Pass table support param to partition image (#973 )

2023-07-27 13:33:36 -04:00

layout-parser-paper-with-table.pdf

Fix: missing columns on table ingest output after table OCR refactor (#1959 )

2023-11-01 18:34:27 +00:00

layout-parser-paper.pdf

Initial Release

2022-09-26 14:55:20 -07:00

list-item-example.pdf

fix pdf partition of list items being detected as titles in OCR only mode (#1119 )

2023-08-15 09:35:54 -07:00

loremipsum-flat.pdf

fix: better extractable check (#900 )

2023-07-07 23:41:37 -05:00

more-than-1k-cells.xlsx

fix: use nx to avoid recursion limit (#1761 )

2023-10-14 19:38:21 +00:00

multi-column-2p.pdf

Feat/1136 elements ordering for pdf (#1161 )

2023-08-24 17:46:19 -07:00

multi-column.pdf

Feat/1136 elements ordering for pdf (#1161 )

2023-08-24 17:46:19 -07:00

negative-coords.pdf

fix: coordinates bug on pdf parsing (#1462 )

2023-09-19 19:25:31 -07:00

norwich-city.txt

enhancement: max_partition kwarg for limiting element size (#818 )

2023-06-28 15:26:01 -04:00

page-breaks.docx

docx: improve page break fidelity (#1631 )

2023-11-17 00:09:14 +00:00

pdf2image-memory-error-test-400p.pdf

fix: 521 pdf2image memory error (#924 )

2023-07-14 15:08:33 -05:00

pdf-bad-color-space.pdf

fix: handle KeyError: 'N' for certain pdfs (#2072 )

2023-11-15 01:59:05 +00:00

picture.pptx

feat(pptx): add pluggable PPTX Picture sub-partitioner (#2880 )

2024-04-12 06:00:01 +00:00

README.md

Update README.md (#435 )

2023-04-02 09:52:14 -07:00

README.org

feat: partition_org for Org Mode documents (#780 )

2023-06-23 18:45:31 +00:00

README.rst

feat: partition_rst for ReStructured Text documents (#725 )

2023-06-12 19:31:10 +00:00

reliance.pdf

fix: enable partition_pdf to recursively grab text with fast strategy (#796 )

2023-06-22 11:19:54 -04:00

sample-presentation.pptx

Feat: Native hierarchies for elements from pptx documents (#1616 )

2023-10-05 12:55:45 -04:00

science-exploration-1p.pptx

test: add benchmark script (#638 )

2023-06-05 09:14:43 -07:00

science-exploration-369p.pptx

test: add benchmark script (#638 )

2023-06-05 09:14:43 -07:00

simple-table.md

fix: md tables (#1924 )

2023-10-30 14:09:46 +00:00

spring-weather.html.json

Better element IDs - deterministic and document-unique hashes (#2673 )

2024-04-24 00:05:20 -07:00

stanley-cups-with-emoji.csv

fix: etree parser error (#1077 )

2023-08-10 23:28:57 +00:00

stanley-cups-with-emoji.tsv

Fix/1057 etree parser error tsv (#1106 )

2023-08-14 01:22:36 +00:00

stanley-cups.csv

feat: add partition_csv function (#619 )

2023-05-19 15:57:42 -04:00

stanley-cups.tsv

feat: partition_tsv for tab separated value files (#758 )

2023-06-15 18:50:53 +00:00

stanley-cups.xlsx

feat: add partition_xlsx for MSFT Excel files (#594 )

2023-05-16 19:40:40 +00:00

table-multi-row-column-cells-actual.csv

chore: add metric helper for table structure eval (#1877 )

2023-10-27 13:23:44 -05:00

table-multi-row-column-cells.pdf

chore: add metric helper for table structure eval (#1877 )

2023-10-27 13:23:44 -05:00

table-multi-row-column-cells.png

chore: add metric helper for table structure eval (#1877 )

2023-10-27 13:23:44 -05:00

table-semicolon-delimiter.csv

fix: handle delimiter bug in partition_csv (#2224 )

2023-12-13 23:57:46 +00:00

teams_chat.docx

fix: handle sectionless-docx in the general case (#1829 )

2023-11-08 19:05:19 +00:00

tests-example.xls

feat: add xls support (#632 )

2023-05-26 01:55:32 -07:00

vodafone.xlsx

feat: xlsx subtable extraction (#1585 )

2023-10-04 13:30:23 -04:00

winter-sports.epub

feat: add partition_epub function (#364 )

2023-03-14 15:52:21 +00:00

xlsx-subtable-cases.xlsx

fix(xlsx): xlsx subtable algorithm (#2534 )

2024-02-13 20:29:17 -08:00

README.md

Example Docs

The sample docs directory contains the following files:

example-10k.html - A 10-K SEC filing in HTML format
layout-parser-paper.pdf - A PDF copy of the layout parser paper
factbook.xml/factbook.xsl - Example XML/XLS files that you can use to test stylesheets

These documents can be used to test out the parsers in the library. In addition, here are instructions for pulling in some sample docs that are too big to store in the repo.

XBRL 10-K

You can get an example 10-K in inline XBRL format using the following curl. Note, you need to have the user agent set in the header or the SEC site will reject your request.

curl -O \
  -A '${organization} ${email}'
  https://www.sec.gov/Archives/edgar/data/311094/000117184321001344/0001171843-21-001344.txt

You can parse this document using the HTML parser.