unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2026-01-07 12:50:54 +00:00

History

feat(docx): add pluggable picture sub-partitioner (#3081 )

**Summary**
Allow registration of a custom sub-partitioner that extracts images from
a DOCX paragraph.

**Additional Context**
- A custom image sub-partitioner must implement the
`PicturePartitionerT` interface defined in this PR. Basically have an
`.iter_elements()` classmethod that takes the paragraph and generates
zero or more `Image` elements from it.
- The custom image sub-partitioner must be registered by passing the
class to `register_picture_partitioner()`.
- The default image sub-partitioner is `_NullPicturePartitioner` that
does nothing.
- The registered picture partitioner is called once for each paragraph.

2024-05-23 18:46:30 +00:00

eml

enhancement: process .p7s files with partition_email (#2521 )

2024-02-07 22:31:49 +00:00

language-docs

fix: ppt parameters include_page_breaks and include_slide_notes (#2996 )

2024-05-10 17:57:36 +00:00

test_evaluate_files

Feat: form parsing placeholders (#3034 )

2024-05-16 14:21:31 +00:00

unsupported

…

2023-half-year-analyses-by-segment.xlsx

feat: xlsx subtable extraction (#1585 )

2023-10-04 13:30:23 -04:00

a1977-backus-p21.pdf

Fix: avoid elements sharing the same memory address (#2940 )

2024-04-28 19:15:17 -07:00

all-number-table.pdf

Jj/2027 float no attr strip (#2048 )

2023-11-10 05:14:06 +00:00

book-war-and-peace-1p.txt

…

book-war-and-peace-1225p.txt

…

CantinaBand3.wav

enhancement: file detection for .wav files (#2387 )

2024-01-15 16:50:49 +00:00

category-level.docx

Feat: Native hierarchies for docx element types (#1505 )

2023-09-27 11:32:46 -04:00

chevron-page.pdf

…

chi_sim_image.jpeg

chore: function to map between standard and Tesseract language codes (#1421 )

2023-09-18 08:42:02 -07:00

contains-pictures.docx

feat(docx): add pluggable picture sub-partitioner (#3081 )

2024-05-23 18:46:30 +00:00

copy-protected.pdf

…

csv-with-long-lines.csv

fix(csv): partition_csv() raises on long lines (#2998 )

2024-05-10 21:19:31 +00:00

DA-1p.heic

feat: add support for partitioning .heic files (#2454 )

2024-01-30 04:49:00 +00:00

DA-1p.jpg

feat: add support for partitioning .heic files (#2454 )

2024-01-30 04:49:00 +00:00

DA-1p.pdf

…

DA-1p.png

feat: add support for partitioning .heic files (#2454 )

2024-01-30 04:49:00 +00:00

DA-619p.pdf

…

docx-hdrftr.docx

fix(docx): tables in header/footer dropped (#2135 )

2023-11-22 15:39:25 -08:00

docx-shapes.docx

feat: include text from shapes in docx (#2510 )

2024-02-14 17:48:38 +00:00

docx-tables.docx

fix(docx): Table.text duplicates merged cell text (#2134 )

2023-11-21 22:22:40 +00:00

double-column-A.jpg

…

double-column-B.jpg

…

duplicate-paragraphs.doc

Better element IDs - deterministic and document-unique hashes (#2673 )

2024-04-24 00:05:20 -07:00

duplicate-paragraphs.docx

Better element IDs - deterministic and document-unique hashes (#2673 )

2024-04-24 00:05:20 -07:00

embedded-images-tables.jpg

Feat: return base64 encoded images for PDF's (#2310 )

2023-12-27 05:39:01 +00:00

embedded-images-tables.pdf

Feat: return base64 encoded images for PDF's (#2310 )

2023-12-27 05:39:01 +00:00

embedded-images.pdf

Feat/1332 save embedded images in pdf (#1371 )

2023-09-22 09:16:03 +00:00

embedded-link.pdf

feat: get embedded url, associate text and start index for pdf (#1539 )

2023-09-27 13:43:32 -04:00

emoji.xlsx

…

emphasis-text.pdf

feat: get embedded url, associate text and start index for pdf (#1539 )

2023-09-27 13:43:32 -04:00

empty.txt

…

english-and-korean.png

…

example-10k-1p.html

…

example-10k-230p.html

…

example-10k-utf-16.html

…

example-10k.html

…

example-list-items-multiple.docx

…

example-steelJIS-datasheet-utf-16.html

…

example-steelJIS-datasheet.html

…

example-with-scripts.html

…

example.jpg

…

factbook-utf-16.xml

…

factbook.xml

…

failure-after-repair.pdf

Chore: Repair invalid PDF structure for PDFminer when PSSyntaxError (#2137 )

2023-11-29 19:00:15 +00:00

fake_table.docx

…

fake-doc-emphasized-text.doc

…

fake-doc-emphasized-text.docx

…

fake-doc.rtf

Table processing test for RTF (#1388 )

2023-09-12 18:27:05 -07:00

fake-email-attachment.msg

…

fake-email-multiple-attachments.msg

…

fake-email.msg

…

fake-email.txt

bug: empty-elements (#1252 )

2023-11-02 10:52:41 -05:00

fake-encrypted.msg

feat: detect PGP encrypted content in partition_email and partition_msg (#1205 )

2023-08-25 17:09:25 -07:00

fake-html-cp1252.html

…

fake-html-lang-de.html

…

fake-html-pre.htm

…

fake-html-with-duplicate-elements.html

Better element IDs - deterministic and document-unique hashes (#2673 )

2024-04-24 00:05:20 -07:00

fake-html-with-footer-and-header.html

…

fake-html.html

Feat: Create a naive hierarchy for elements (#1268 )

2023-09-14 11:23:16 -04:00

fake-incomplete-json.txt

…

fake-memo-with-duplicate-page.pdf

Better element IDs - deterministic and document-unique hashes (#2673 )

2024-04-24 00:05:20 -07:00

fake-memo.pdf

…

fake-power-point-malformed.pptx

…

fake-power-point-many-pages.pptx

…

fake-power-point-table.pptx

…

fake-power-point.ppt

…

fake-power-point.pptx

…

fake-text-utf-16-be.txt

…

fake-text-utf-16-le.txt

…

fake-text-utf-16.txt

…

fake-text-utf-32.txt

…

fake-text.txt

…

fake.doc

…

fake.docx

…

fake.odt

chore: adding test case for odt tables (#1434 )

2023-09-16 22:29:44 -07:00

group-shapes-nested.pptx

rfctr(pptx): extract _PptxPartitionerOptions (#2853 )

2024-04-08 19:01:03 +00:00

handbook-1p-no-rendered-page-breaks.docx

fix(docx): improve page-break detection (#2036 )

2023-11-09 20:34:30 +00:00

handbook-1p.docx

fix(docx): improve page-break detection (#2036 )

2023-11-09 20:34:30 +00:00

handbook-872p.docx

…

header-test-doc.pdf

enhancement: detect headers in partition_pdf with fast strategy (#2455 )

2024-02-23 16:56:09 +00:00

hebrew-text-base64-iso88598i.txt

…

hlink-meta.docx

rfctr: prepare docx partitioner and tests for nested tables PR to follow (#1978 )

2023-11-02 05:22:17 +00:00

ideas-page.html

…

interface-config-guide-p93.pdf

fix: isalnum referenced before assignment (#1586 )

2023-10-03 11:25:20 -04:00

invalid-pdf-structure-pdfminer-entire-doc.pdf

Chore: Repair invalid PDF structure for PDFminer when PSSyntaxError (#2137 )

2023-11-29 19:00:15 +00:00

invalid-pdf-structure-pdfminer-one-page.pdf

Chore: Repair invalid PDF structure for PDFminer when PSSyntaxError (#2137 )

2023-11-29 19:00:15 +00:00

jpn-vert.jpeg

chore: function to map between standard and Tesseract language codes (#1421 )

2023-09-18 08:42:02 -07:00

korean-text-with-tables.pdf

Chore (refactor): support table extraction with pre-computed ocr data (#1801 )

2023-10-21 00:24:23 +00:00

layout-parser-paper-10p.jpg

…

layout-parser-paper-combined.tiff

feat: supports multipage tiff (#1131 )

2023-08-24 15:12:50 +00:00

layout-parser-paper-fast.jpg

…

layout-parser-paper-fast.pdf

revert pdf changes and add new pdf for empty page testing (#1255 )

2023-09-01 22:33:06 +00:00

layout-parser-paper-fast.tiff

feat: supports multipage tiff (#1131 )

2023-08-24 15:12:50 +00:00

layout-parser-paper-with-empty-pages.pdf

revert pdf changes and add new pdf for empty page testing (#1255 )

2023-09-01 22:33:06 +00:00

layout-parser-paper-with-table.jpg

…

layout-parser-paper-with-table.pdf

Fix: missing columns on table ingest output after table OCR refactor (#1959 )

2023-11-01 18:34:27 +00:00

layout-parser-paper.pdf

…

list-item-example.pdf

fix pdf partition of list items being detected as titles in OCR only mode (#1119 )

2023-08-15 09:35:54 -07:00

loremipsum-flat.pdf

…

more-than-1k-cells.xlsx

fix: use nx to avoid recursion limit (#1761 )

2023-10-14 19:38:21 +00:00

multi-column-2p.pdf

Feat/1136 elements ordering for pdf (#1161 )

2023-08-24 17:46:19 -07:00

multi-column.pdf

Feat/1136 elements ordering for pdf (#1161 )

2023-08-24 17:46:19 -07:00

negative-coords.pdf

fix: coordinates bug on pdf parsing (#1462 )

2023-09-19 19:25:31 -07:00

norwich-city.txt

…

page-breaks.docx

docx: improve page break fidelity (#1631 )

2023-11-17 00:09:14 +00:00

pdf2image-memory-error-test-400p.pdf

…

pdf-bad-color-space.pdf

fix: handle KeyError: 'N' for certain pdfs (#2072 )

2023-11-15 01:59:05 +00:00

picture.pptx

feat(pptx): add pluggable PPTX Picture sub-partitioner (#2880 )

2024-04-12 06:00:01 +00:00

README.md

…

README.org

…

README.rst

…

reliance.pdf

…

sample-presentation.pptx

Feat: Native hierarchies for elements from pptx documents (#1616 )

2023-10-05 12:55:45 -04:00

science-exploration-1p.pptx

…

science-exploration-369p.pptx

…

simple-table.md

fix: md tables (#1924 )

2023-10-30 14:09:46 +00:00

simple.doc

rfctr(doc): spruce up test_doc.py (#3024 )

2024-05-15 18:32:51 +00:00

simple.docx

rfctr(doc): spruce up test_doc.py (#3024 )

2024-05-15 18:32:51 +00:00

simple.odt

rfctr(odt): organize and improve test_odt.py (#3031 )

2024-05-16 01:04:06 +00:00

spring-weather.html.json

Better element IDs - deterministic and document-unique hashes (#2673 )

2024-04-24 00:05:20 -07:00

stanley-cups-with-emoji.csv

…

stanley-cups-with-emoji.tsv

…

stanley-cups.csv

…

stanley-cups.tsv

…

stanley-cups.xlsx

…

table-multi-row-column-cells-actual.csv

chore: add metric helper for table structure eval (#1877 )

2023-10-27 13:23:44 -05:00

table-multi-row-column-cells.pdf

chore: add metric helper for table structure eval (#1877 )

2023-10-27 13:23:44 -05:00

table-multi-row-column-cells.png

chore: add metric helper for table structure eval (#1877 )

2023-10-27 13:23:44 -05:00

table-semicolon-delimiter.csv

fix: handle delimiter bug in partition_csv (#2224 )

2023-12-13 23:57:46 +00:00

tables-with-incomplete-rows.docx

fix(docx): fix short-row DOCX table (#2943 )

2024-05-02 00:45:52 +00:00

teams_chat.docx

fix: handle sectionless-docx in the general case (#1829 )

2023-11-08 19:05:19 +00:00

tests-example.xls

…

vodafone.xlsx

feat: xlsx subtable extraction (#1585 )

2023-10-04 13:30:23 -04:00

winter-sports.epub

…

xlsx-subtable-cases.xlsx

fix(xlsx): xlsx subtable algorithm (#2534 )

2024-02-13 20:29:17 -08:00

README.md

Example Docs

The sample docs directory contains the following files:

example-10k.html - A 10-K SEC filing in HTML format
layout-parser-paper.pdf - A PDF copy of the layout parser paper
factbook.xml/factbook.xsl - Example XML/XLS files that you can use to test stylesheets

These documents can be used to test out the parsers in the library. In addition, here are instructions for pulling in some sample docs that are too big to store in the repo.

XBRL 10-K

You can get an example 10-K in inline XBRL format using the following curl. Note, you need to have the user agent set in the header or the SEC site will reject your request.

curl -O \
  -A '${organization} ${email}'
  https://www.sec.gov/Archives/edgar/data/311094/000117184321001344/0001171843-21-001344.txt

You can parse this document using the HTML parser.