unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-11-12 00:18:56 +00:00

Author	SHA1	Message	Date
Christine Straub	ec59abfabc	enhancement: improve text clearing process in `email` partitioning (#3422 ) ### Summary Currently, the email partitioner removes only `=\n` characters during the clearing process. However, email content sometimes contains `=\r\n` characters, especially when read from file-like objects such as `SpooledTemporaryFile` (the file type used in our API). This PR updates the email partitioner to remove both `=\n` and `=\r\n` characters during the clearing process. ### Testing ``` filename = "example-docs/eml/family-day.eml" elements = partition_email( filename=filename, ) print(f"From filename: {elements[3].text}") with open(filename, "rb") as test_file: spooled_temp_file = tempfile.SpooledTemporaryFile() spooled_temp_file.write(test_file.read()) spooled_temp_file.seek(0) elements = partition_email(file=spooled_temp_file) print(f"From spooled_temp_file: {elements[3].text}") ``` Results: - on `main` ``` From filename: Make sure to RSVP! From spooled_temp_file: Make sure to = RSVP! ``` - on `PR` ``` From filename: Make sure to RSVP! From spooled_temp_file: Make sure to RSVP! ```	2024-07-19 18:18:02 +00:00
Christine Straub	0eb461acc2	refactor: restructure PDF/Image example document organization (#3410 ) This PR aims to improve the organization and readability of our example documents used in unit tests, specifically focusing on PDF and image files. ### Summary - Created two new subdirectories in the `example-docs` folder: - `pdf/`: for all PDF example files - `img/`: for all image example files - Moved relevant PDF files from `example-docs/` to `example-docs/pdf/` - Moved relevant image files from `example-docs/` to `example-docs/img/` - Updated file paths in affected unit & ingest tests to reflect the new directory structure ### Testing All unit & ingest tests should be updated and verified to work with the new file structure. ## Notes Other file types (e.g., office documents, HTML files) remain in the root of `example-docs/` for now. ## Next Steps Consider similar reorganization for other file types if this structure proves to be beneficial. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2024-07-18 22:21:32 +00:00
Steve Canny	56ca39ca7f	rfctr(file): improve filetype tests (#3402 ) Summary Improve file-detection tests in preparation for additional work and bug fixes. Additional Context - Add type annotations. - Use mocks instead of `monkeypatch` in most cases and verify calls to mock. This revealed a dozen broken tests, broken in that the mocks weren't being called so a different code path than intended was being exercised. - Use `example_doc_path()` instead of hard-coded paths. - Add actual test files for cases where they were being constructed in temporary directories. - Make test names consistent and more descriptive of behavior under test.	2024-07-16 04:04:34 +00:00
Steve Canny	3d6e30a1f7	rfctr(auto): improve expression in tests (#3384 ) Summary In preparation for further work on auto-partitioning, improve the expression in the test-suite.	2024-07-11 19:57:28 +00:00
Steve Canny	c27e0d0062	rfctr(html): replace html parser (#3218 ) Summary Replace legacy HTML parser with recursive version that captures all content and provides flexibility to add new metadata. It's also substantially faster although that's just a happy side-effect. Additional Context The prior HTML parsing algorithm that makes up the core of HTML partitioning was buggy and very difficult to reason about because it did not conform to the inherently recursive structure of HTML. The new version retains `lxml` as the performant and reliable base library but uses `lxml`'s custom element classes to efficiently classify HTML elements by their behaviors (block-item and inline (phrasing) primarily) and give those elements the desired partitioning behaviors. This solves a host of existing problems with content being skipped and elements (paragraphs) being divided improperly, but also provides a clear domain model for reasoning about its behavior and reliably adjusting it to suit our existing and future purposes. The parser's operation is recursive, closely modeling the recursive structure of HTML itself. It's behaviors are based on the HTML Standard and reliably produce proper and explainable results even for novel cases. Fixes #2325 Fixes #2562 Fixes #2675 Fixes #3168 Fixes #3227 Fixes #3228 Fixes #3230 Fixes #3237 Fixes #3245 Fixes #3247 Fixes #3255 Fixes #3309 ### BEHAVIOR DIFFERENCES #### `emphasized_text_tags` encoding is changed: - `<strong>` is encoded as `"b"` rather than `"strong"`. - `<em>` is encoded as `"i"` rather than `"em"`. - `<span>` is no longer recorded in `emphasized_text_tags` (because without the CSS we can't tell whether it's used for emphasis or if so what kind). - nested emphasis (e.g. bold+italic) is encoded as multiple characters ("bi"). - `emphasized_text_contents` is broken on emphasis-change boundaries, like: ```html `<p>foo <b>bar <i>baz</i> bada</b> bing</p>` ``` produces: ```json { "emphasized_text_contents": ["bar", "baz", "bada"], "emphasized_text_tags": ["b", "bi", "b"] } ``` whereas previously it would have produced: ```json { "emphasized_text_contents": ["bar baz bada", "baz"], "emphasized_text_tags": ["b", "i"] } ``` #### `<pre>` text is preserved as it appears in the html Except that a leading newline is removed if present (has to be in position 0 of text). Also, a trailing newline is stripped but only if it appears in the very last position ([-1]) of the `<pre>` text. Old parser stripped all leading and trailing whitespace. Result is that: ```html <pre> foo bar baz </pre> ``` parses to `"foo\nbar\nbaz"` which is the same result produced for: ```html <pre>foo bar baz</pre> ``` This equivalence is the same behavior exhibited by a browser, which is why we did the extra work to make it this way. #### Whitespace normalization Leading and trailing whitespace are removed from element text, just as it is removed in the browser. Runs of whitespace within the element text are reduced to a single space character (like in the browser). Note this means that `\t`, `\n`, and ` ` are replaced with a regular space character. All text derived from elements is whitespace normalized except the text within a `<pre>` tag. Any leading or trailing newline is trimmed from `<pre>` element text; all other whitespace is preserved just as it appeared in the HTML source. #### `link_start_indexes` metadata is no longer captured. Rationale: - It was frequently wrong, often `-1`. - It was deprecated but then added back in a community PR. - Maintaining it across any possible downstream transformations (e.g. chunking) would be expensive and almost certainly lead to wrong values as distant code evolves. - It is complex to compute and recompute when whitespace is normalized, adding substantial complexity to the code and reducing readability and maintainability #### `<br/>` element is replaced with a single newline (`"\n"`) but that is usually replaced with a space in `Element.text` when it is normalized. The newline is preserved within a `<pre>` element. - Related: _No paragraph-break on `<br/><br/>`_ #### Empty `h1..h6` elements are dropped. HTML heading elements (`<h1..h6>`) are "skipped" (do not generate a `Title` element) when they contain no text or contain only whitespace. --------- Co-authored-by: scanny <scanny@users.noreply.github.com>	2024-07-11 00:14:28 +00:00
Steve Canny	0c562d8050	rfctr(auto): fix auto-partition test xfails and skips (#3367 ) Summary Improve expression in auto-partition tests and fix xfails and skips. Add issues for the two hard-fails where xfail needed to stay.	2024-07-10 05:29:07 +00:00
Steve Canny	47d28612f7	feat(docx): add pluggable picture sub-partitioner (#3081 ) Summary Allow registration of a custom sub-partitioner that extracts images from a DOCX paragraph. Additional Context - A custom image sub-partitioner must implement the `PicturePartitionerT` interface defined in this PR. Basically have an `.iter_elements()` classmethod that takes the paragraph and generates zero or more `Image` elements from it. - The custom image sub-partitioner must be registered by passing the class to `register_picture_partitioner()`. - The default image sub-partitioner is `_NullPicturePartitioner` that does nothing. - The registered picture partitioner is called once for each paragraph.	2024-05-23 18:46:30 +00:00
Jan Kanty Milczek	e6ada05c55	Feat: form parsing placeholders (#3034 ) Allows introduction of form extraction in the future - sets up the FormKeysValues element & format, puts in an empty function call in the partition_pdf_or_image pipeline.	2024-05-16 14:21:31 +00:00
Steve Canny	aeca8bef88	rfctr(odt): organize and improve test_odt.py (#3031 ) Summary In preparation for adding more tests related to image extraction, improve the `partition_odt()` test suite: - Add type annotations to type-check clean on strict mode. - Improve test names. - Simplify tests where possible. - Remove a couple duplicated tests	2024-05-16 01:04:06 +00:00
Steve Canny	a164b01c7e	rfctr(doc): spruce up test_doc.py (#3024 ) Summary In preparation for adding more tests related to image extraction, improve the `partition_doc()` test suite: - Remove redundant DOCX -> DOC file conversions on most tests. - Add type annotations to type-check clean on strict mode. - Improve test names. - Simplify tests where possible. - Remove one duplicated test Speed was roughly doubled: 24 tests in 20s -> 23 tests in 8s.	2024-05-15 18:32:51 +00:00
Steve Canny	e4c895923d	fix(csv): partition_csv() raises on long lines (#2998 ) Summary The CSV delimiter-sniffer requires whole lines to properly detect the delimiter character. Limiting bytes read produced partial lines when lines were very long. Limit bytes but read whole lines. Fixes #2643.	2024-05-10 21:19:31 +00:00
John	593aa47802	fix: ppt parameters include_page_breaks and include_slide_notes (#2996 ) Pass the parameters `include_slide_notes` and `include_page_breaks` to `partition_pptx` from `partition_ppt`. Also update the .ppt example doc we use for testing so it has slide notes and a PageBreak (and second page)	2024-05-10 17:57:36 +00:00
Pluto	4397dd6a10	Add calculation of table related metrics based on table_as_cells (#2898 ) This pull request add metrics that are calculated based on table_as_cells instead of text_as_html. This change is required for comprehensive metrics calculation, as previously every colspan or rowspan predicted was considered to be an incorrect predicted (even if it was correct prediction) This change has to be merged after https://github.com/Unstructured-IO/unstructured/pull/2892 which introduces table_as_cells field	2024-05-07 13:57:38 +00:00
Steve Canny	601594d373	fix(docx): fix short-row DOCX table (#2943 ) Summary The DOCX format allows a table row to start late and/or end early, meaning cells at the beginning or end of a row can be omitted. While there are legitimate uses for this capability, using it in practice is relatively rare. However, it can happen unintentionally when adjusting cell borders with the mouse. Accommodate this case and generate accurate `.text` and `.metadata.text_as_html` for these tables.	2024-05-02 00:45:52 +00:00
Michał Martyniak	7720e72424	Fix: avoid elements sharing the same memory address (#2940 ) This PR attempts to fix a memory issue, which resulted in errors like this: https://github.com/Unstructured-IO/unstructured/issues/2931 The root cause seems to be in how ListItems are being combined, not in how hashes or parent IDs are updated. When `assign_and_map_hash_ids()` is called and elements (or elements' metadata) do not have unique memory addresses, then updating the parent_id of one element will also overwrite the parent_id of some other element. --------- Co-authored-by: cragwolfe <crag@unstructured.io>	2024-04-28 19:15:17 -07:00
Michał Martyniak	2d1923ac7e	Better element IDs - deterministic and document-unique hashes (#2673 ) Part two of: https://github.com/Unstructured-IO/unstructured/pull/2842 Main changes compared to part one: * hash computation includes element's sequence number on page, page number, document filename and its text * there are more test for deterministic behavior of IDs returned by partitioning functions + their uniqueness (guaranteed at the document level, and high probability across multiple documents) This PR addresses the following issue: https://github.com/Unstructured-IO/unstructured/issues/2461	2024-04-24 00:05:20 -07:00
Steve Canny	3e643c4cb3	feat(pptx): add pluggable PPTX Picture sub-partitioner (#2880 ) Summary Delegate partitioning of PPTX Picture (image, to a first approximation) shapes to a distinct sub-partitioner and allow the default picture sub-partitioner to be replaced at run-time by one of the user's choosing.	2024-04-12 06:00:01 +00:00
Steve Canny	2c7e0289aa	rfctr(pptx): extract _PptxPartitionerOptions (#2853 ) Reviewers: Likely quicker to review commit-by-commit. Summary In preparation for adding a PPTX `Picture` shape _sub-partitioner_, extract management of PPTX partitioning-run options to a separate `_PptxPartitioningOptions` object similar to those used in chunking and XLSX partitioning. This provides several benefits: - Extract code dealing with applying defaults and computing derived values from the main partitioning code, leaving it less cluttered and focused on the partitioning algorithm itself. - Allow the options set to be passed to helper objects, prominently including sub-partitioners, without requiring a long list of parameters or requiring the caller to couple itself to the particular option values the helper object requires. - Allow options behaviors to be thoroughly and efficiently tested in isolation.	2024-04-08 19:01:03 +00:00
Klaijan	8a239b346c	feat: add cleanup fixtures for test_evaluate (#2701 ) This PR adds `@pytest.mark.usefixtures("_cleanup_after_test")` to `test_evaluate` on tests that do not have.	2024-04-02 15:10:59 +00:00
Matt Robinson	b4d9ad8130	enhancement: detect headers in `partition_pdf` with fast strategy (#2455 ) ### Summary Detects headers and footers when using `partition_pdf` with the fast strategy. Identifies elements that are positioned in the top or bottom 5% of the page as headers or footers. If no coordinate information is available, an element won't be detected as a header or footer. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>	2024-02-23 16:56:09 +00:00
Pawel Kmiecik	ff9d46f9dc	feat(eval): table evaluation metrics (#2558 ) This PR adds new table evaluation metrics prepared by @leah1985 The metrics include: - `table count` (check) - `table_level_acc` - accuracy of table detection - `element_col_level_index_acc` - accuracy of cell detection in columns - `element_row_level_index_acc` - accuracy of cell detection in rows - `element_col_level_content_acc` - accuracy of content detected in columns - `element_row_level_content_acc` - accuracy of content detected in rows TODO in next steps: - create a minimal dataset and upload to s3 for ingest tests - generate and add metrics on the above dataset to `test_unstructured_ingest/metrics`	2024-02-22 16:35:46 +00:00
Filip Knefel	f048695a55	feat: include text from shapes in docx (#2510 ) Reported bug: Text from docx shapes is not included in the `partition` output. Fix: Extend docx partition to search for text tags nested inside structures responsible for creating the shape. --------- Co-authored-by: Filip Knefel <filip@unstructured.io>	2024-02-14 17:48:38 +00:00
Steve Canny	d9f8467187	fix(xlsx): xlsx subtable algorithm (#2534 ) Reviewers: It may be easier to review each of the two commits separately. The first adds the new `_SubtableParser` object with its unit-tests and the second one uses that object to replace the flawed existing subtable-parsing algorithm. Summary There are a cluster of bugs in `partition_xlsx()` that all derive from flaws in the algorithm we use to detect "subtables". These are encountered when the user wants to get multiple document-elements from each worksheet, which is the default (argument `find_subtable = True`). This PR replaces the flawed existing algorithm with a `_SubtableParser` object that encapsulates all that logic and has thorough unit-tests. Additional Context This is a summary of the failure cases. There are a few other cases but they're closely related and this was enough evidence and scope for my purposes. This PR fixes all these bugs: ```python # # -- ✅ CASE 1: There are no leading or trailing single-cell rows. # -> this subtable functions never get called, subtable is emitted as the only element # # a b -> Table(a, b, c, d) # c d # -- ✅ CASE 2: There is exactly one leading single-cell row. # -> Leading single-cell row emitted as `Title` element, core-table properly identified. # # a -> [ Title(a), # b c Table(b, c, d, e) ] # d e # -- ❌ CASE 3: There are two-or-more leading single-cell rows. # -> leading single-cell rows are included in subtable # # a -> [ Table(a, b, c, d, e, f) ] # b # c d # e f # -- ❌ CASE 4: There is exactly one trailing single-cell row. # -> core table is dropped. trailing single-cell row is emitted as Title # (this is the behavior in the reported bug) # # a b -> [ Title(e) ] # c d # e # -- ❌ CASE 5: There are two-or-more trailing single-cell rows. # -> core table is dropped. trailing single-cell rows are each emitted as a Title # # a b -> [ Title(e), # c d Title(f) ] # e # f # -- ✅ CASE 6: There are exactly one each leading and trailing single-cell rows. # -> core table is correctly identified, leading and trailing single-cell rows are each # emitted as a Title. # # a -> [ Title(a), # b c Table(b, c, d, e), # d e Title(f) ] # f # -- ✅ CASE 7: There are two leading and one trailing single-cell rows. # -> core table is correctly identified, leading and trailing single-cell rows are each # emitted as a Title. # # a -> [ Title(a), # b Title(b), # c d Table(c, d, e, f), # e f Title(g) ] # g # -- ✅ CASE 8: There are two-or-more leading and trailing single-cell rows. # -> core table is correctly identified, leading and trailing single-cell rows are each # emitted as a Title. # # a -> [ Title(a), # b Title(b), # c d Table(c, d, e, f), # e f Title(g), # g Title(h) ] # h # -- ❌ CASE 9: Single-row subtable, no single-cell rows above or below. # -> First cell is mistakenly emitted as title, remaining cells are dropped. # # a b c -> [ Title(a) ] # -- ❌ CASE 10: Single-row subtable with one leading single-cell row. # -> Leading single-row cell is correctly identified as title, core-table is mis-identified # as a `Title` and truncated. # # a -> [ Title(a), # b c d Title(b) ] ```	2024-02-13 20:29:17 -08:00
Matt Robinson	ccf0477080	enhancement: process `.p7s` files with `partition_email` (#2521 ) ### Summary Closes #2489, which reported an inability to process `.p7s` files. PR implements two changes: - If the user selected content type for the email is not available and there is another valid content type available, fall back to the other valid content type. - For signed message, extract the signature and add it to the metadata ### Testing ```python from unstructured.partition.auto import partition filename = "example-docs/eml/signed-doc.p7s" elements = partition(filename=filename) # should get a message about fall back logic print(elements[0]) # "This is a test" elements[0].metadata.to_dict() # Will see the signature ```	2024-02-07 22:31:49 +00:00
John	db67805ec6	feat: add support for partitioning .heic files (#2454 ) .heic files are an image filetype we have not supported. #### Testing ``` from unstructured.partition.image import partition_image png_filename = "example-docs/DA-1p.png" heic_filename = "example-docs/DA-1p.heic" png_elements = partition_image(png_filename, strategy="hi_res") heic_elements = partition_image(heic_filename, strategy="hi_res") for i in range(len(heic_elements)): print(heic_elements[i].text == png_elements[i].text) ``` --------- Co-authored-by: christinestraub <christinemstraub@gmail.com>	2024-01-30 04:49:00 +00:00
Matt Robinson	36faf677c0	enhancement: file detection for `.wav` files (#2387 ) ### Summary Adds filetype detection for `.wav` audio files ### Testing ```python from unstructured.file_utils.filetype import detect_filetype filename = "example-docs/CantinaBand3.wav" detect_filetype(filename=filename) # Should be FileType.WAV ```	2024-01-15 16:50:49 +00:00
Klaijan	e65a44eabb	feat: update cct eval for text dir (#2299 ) The code makes edit to the `measure_text_extraction_accuracy` function to allows dir of txt as well as json. The function also takes input `output_type` to be either "json" or "txt" only, and checks if the files under given directory/list contains only specified file type or not. To test this feature, run the following code: ```PYTHONPATH=. python unstructured/ingest/evaluate.py measure-text-extraction-accuracy-command --output_dir <clean-text-path> --source_dir <cct-label-path> --output_type txt```	2024-01-05 23:34:53 +00:00
Christine Straub	dd144456de	Feat: return base64 encoded images for PDF's (#2310 ) Closes #2302. ### Summary - add functionality to get a Base64 encoded string from a PIL image - store base64 encoded image data in two metadata fields: `image_base64` and `image_mime_type` - update the "image element filter" logic to keep all image elements in the output if a user specifies image extraction ### Testing ``` from unstructured.partition.pdf import partition_pdf elements = partition_pdf( filename="example-docs/embedded-images-tables.pdf", strategy="hi_res", extract_element_types=["Image", "Table"], extract_to_payload=True, ) ``` or ``` from unstructured.partition.auto import partition elements = partition( filename="example-docs/embedded-images-tables.pdf", strategy="hi_res", pdf_extract_element_types=["Image", "Table"], pdf_extract_to_payload=True, ) ```	2023-12-27 05:39:01 +00:00
Andy Li	4ae49419c9	feat: support base64-encoded text in partition_email (#2277 ) closes #816 ## Description Added functionality for `partition_email` to automatically decode base64 text before passing it to `partition_text` or `partition_html`. Also adds base64 encoded email text test cases.	2023-12-19 23:37:17 -08:00
Austin Walker	d594c06a3e	fix: handle delimiter bug in partition_csv (#2224 ) Closes #2218. When a csv has commas in its content, and the delimiter is something else, Pandas may throw an error. We can sniff the csv and get the correct delimiter to pass to Pandas. To verify, try partitioning the file in the linked bug.	2023-12-13 23:57:46 +00:00
Yuming Long	92dae8cd1a	Chore: Repair invalid PDF structure for PDFminer when PSSyntaxError (#2137 ) ### Summary Add a procedure to repair PDF when the PDF structure is invalid for `PDFminer` to process. This PR handles two cases of `PSSyntaxError Invalid dictionary construct: ...`: * PDFminer open entire document and create pages generator on `PDFPage.get_pages(fp)`: [sentry log example](https://unstructuredio.sentry.io/issues/4655715023/?alert_rule_id=14681339&alert_type=issue&notification_uuid=d8db4cf4-686f-4504-8a22-74a79a8e966f&project=4505909127086080&referrer=slack) * PDFminer's interpreter process a single page on `interpreter.process_page(page)`: [sentry log example](https://unstructuredio.sentry.io/issues/4655898781/?referrer=slack&notification_uuid=0d929d48-f490-4db8-8dad-5d431c8460bc&alert_rule_id=14681339&alert_type=issue) Additional tech details: * Add new dependency `pikepdf` in `requirements/extra-pdf-image.in`, which is used for repairing PDF. * Add new denpendenct `pypdf` in `requirements/extra-pdf-image.in`, which is used to find the error page from entire document by reading the PDF file again (can't find a way to split pdf in PDFminer). * Refactor the `is null` check for `get_uris_from_annots`, since the root cause is that `get_uris` passed a None `annots` to `get_uris_from_annots`, so the Null check should happen in `get_uris`. * Add more type protection in `get_uris_from_annots` when using any `PDFObjRef.resolve()` as `dict` (it could still be a `PDFObjRef`). This should fix : * https://github.com/Unstructured-IO/unstructured/issues/1922 where `annotation_dict` is a `PDFObjRef` * https://github.com/Unstructured-IO/unstructured/issues/1921 where `rect` is a `PDFObjRef` ### Test Added three test files (both are larger than 500 KB) for unittests to test: * Repair entire doc * Repair one page * Reprocess failure after repairing one page (just return the elements before error page in this case). * Also seems like splitting the document into smaller pages could fix this problem, but not sure why. For example, I saw error from reprocess in the whole [cancer.pdf](https://github.com/Unstructured-IO/unstructured/files/13461616/cancer.pdf) doc, but no error when i split the pdf by error page.... * tested if i can repair the entire doc again in this case, saw other error which means repairing is not helping imo * PDFminer can process the whole doc after pikepdf only repaired the entire doc in the first place, but we can't repair by pages in this way --------- Co-authored-by: cragwolfe <crag@unstructured.io>	2023-11-29 19:00:15 +00:00
Klaijan	0aae1faa54	feat: add visualize param to command and add test (#2178 ) - Add `visualize` parameter to the click command -- now callable using `--visualize` flag to show the progress bar. - Refactor the name.	2023-11-29 01:05:55 +00:00
Steve Canny	02e8c962aa	fix(docx): tables in header/footer dropped (#2135 ) A DOCX header or footer is a so-called "story part" meaning like the document body (which is also a story part) it can contain both paragraphs and tables. The implementation of `Header.text` and `Footer.text` gather only the paragraphs. Add a new method to extract all content from a header or footer, including table content, suitable for use as the `.text` attribute of that element. Fixes #2126.	2023-11-22 15:39:25 -08:00
Steve Canny	e6637592d1	fix(docx): Table.text duplicates merged cell text (#2134 ) Summary. The `python-docx` table API is designed for _uniform_ tables (no merged cells, no nested tables). Naive processing of DOCX tables using this API produces duplicate text when the table has merged cells. Add a more sophisticated parsing method that reads only "root" cells (those with an actual `<tc>` element) and skip cells spanned by a merge. In the process, abandon use of the `tabulate` package for this job (which is also designed for uniform tables) and remove the whitespace padding it adds for visual alignment of columns. Separate the text for each cell with a single newline ("\n"). Since it's little extra trouble, add support for nested tables such that their text also contributes to the `Table.text` string. The new `._iter_table_texts()` method will also be used for parsing tables in headers and footers (where they are frequently used for layout purposes) in a closely following PR. Fixes #2106.	2023-11-21 22:22:40 +00:00
Steve Canny	a589a494f6	docx: improve page break fidelity (#1631 ) Page breaks can and often do occur within a paragraph. The full text of the paragraph is attributed to the page (number) the paragraph starts on. Improve page-break fidelity such that a paragraph containing a page-break is split into two elements, one containing the text before the page-break and the other the text after. Emit the `PageBreak` element between these two and assign the correct page-number (n and n+1 respectively) to the two textual elements. This functionality is largely provided upstream by the new `python-docx` v1.0.0 release (1.0.0 from 0.8.11 because it drops Python 2 support). That version also makes obsolete the "include hyperlink text in `Paragraph.text` monkey patch that we had maintained up to now. Remove that monkey-patch.	2023-11-17 00:09:14 +00:00
Austin Walker	2931cb38e8	fix: handle KeyError: 'N' for certain pdfs (#2072 ) Closes #2059. We've found some pdfs that throw an error in pdfminer. These files use a ICCBased color profile but do not include an expected value `N`. As a workaround, we can wrap pdfminer and drop any colorspace info, since we don't need to render the document. To verify, try to partition the document in the linked issue. ``` elements = partition(filename="google-2023-environmental-report_condensed.pdf", strategy="fast") ``` --------- Co-authored-by: cragwolfe <crag@unstructured.io>	2023-11-15 01:59:05 +00:00
John	f8c180a59e	Jj/2027 float no attr strip (#2048 ) Closes #2027 Tables or pages that contain only numbers are returned as floats in a pandas.DataFrame when the image or page is converted from `.image_to_data()`. An AttributeError was raised downstream when trying to `.strip()` the floats. This update converts those floats if needed and otherwise strips the text. Testing (note: the document used for testing is new, so you will have to copy it to the main branch in order to see that this snippet raises an AttributeError on the main branch, but works on this branch) ``` from unstructured.partition.pdf import partition_pdf filename = "example-docs/all-number-table.pdf" partition_pdf(filename, strategy="ocr_only") ``` --------- Co-authored-by: cragwolfe <crag@unstructured.io>	2023-11-10 05:14:06 +00:00
Steve Canny	d06bcc41bb	fix(docx): improve page-break detection (#2036 ) Page breaks are reliably indicated by `w:lastRenderedPageBreak` elements present in the document XML. Page breaks are NOT reliably indicated by "hard" page-breaks inserted by the author and when present are redundant to a `w:lastRenderedPageBreak` element so cause over-counting if used. Use rendered page-breaks only.	2023-11-09 20:34:30 +00:00
Steve Canny	0e2c21e5a2	fix: handle sectionless-docx in the general case (#1829 ) A DOCX document that has no sections can still contain one or more tables. Such files are never created by Word but Word can open them just fine. These can be and are generated by other applications. Use the newly-added `Document.iter_inner_content()` method added upstream in `python-docx` to capture both paragraphs and tables from a section-less DOCX document. This generalizes the fix for MS Teams chat-transcripts (an example of sectionless-docx) implemented in #1825.	2023-11-08 19:05:19 +00:00
Steve Canny	80fe07b89f	fix: #1952 support nested docx tables (#2020 ) In DOCX, like HTML, a table cell can itself contain a table. This is not uncommon and is typically used for formatting purposes. When a DOCX table is nested, create nested HTML tables to reflect that structure and create a plain-text table with captures all the text in nested tables, formatting it as a reasonable facsimile of a table. This implements the solution described and spiked in PR #1952. --------- Co-authored-by: Bruno Bornsztein <bruno.bornsztein@gmail.com>	2023-11-08 00:37:21 +00:00
shreyanid	6db663e7bb	refactor: separate click wrappers from core evaluation functionality (#1981 ) ### Summary Click decorated functions cannot (properly) be called outside of the click interface. This makes it difficult to reuse the setup functionality in measure_text_edit_distance or measure_element_type_accuracy. This PR removes the click decoration and separates it into a wrapper function purely to execute the command. ### Technical Details - Changed as suggested in [this StackOverflow post](https://stackoverflow.com/questions/40091347/call-another-click-command-from-a-click-command) response - The locations of these now distinct functions are separate: the `_command` click-decorated functions stay in ingest/evaluate.py, and the core functions measure_text_edit_distance and measure_element_type_accuracy are moved into the unstructured/metrics/ folder (which is a more logical location for them). - Initial test added for measure_text_edit_distance ### Test `sh ./test_unstructured_ingest/evaluation-metrics.sh text-extraction` functionality is unchanged. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: shreyanid <shreyanid@users.noreply.github.com> Co-authored-by: Trevor Bossert <37596773+tabossert@users.noreply.github.com>	2023-11-07 19:54:22 +00:00
Mallori Harrell	d07baed4a1	bug: empty-elements (#1252 ) - This PR adds a function to check if a piece of text only contains a bullet (no text) to prevent creating an empty element. - Also fixed a test that had a typo.	2023-11-02 10:52:41 -05:00
Steve Canny	4e40999070	rfctr: prepare docx partitioner and tests for nested tables PR to follow (#1978 ) Reviewer: May be quicker to review commit by commit as they are quite distinct and well-groomed to each focus on a single clean-up task. Clean up odds-and-ends in the docx partitioner in preparation for adding nested-tables support in a closely following PR. 1. Remove obsolete TODOs now in GitHub issues, which is probably where they belong in future anyway. 2. Remove local DOCX "workaround" code that has been implemented upstream and is now obsolete. 3. "Clean" the docx tests, introducing strict typing, extracting a fixture or two, and generally tightening things up. 4. Extract docx-local versions of `unstructured.partition.common.convert_ms_office_table_to_text()` which will be the base for adding nested-table support. More information on why this is required in that commit.	2023-11-02 05:22:17 +00:00
Christine Straub	210d53a7e0	Fix: missing columns on table ingest output after table OCR refactor (#1959 ) Closes #1873. ### Summary Table OCR refactoring changed the default padding value for table image cropping from [12](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layoutelement.py#L95) to [0](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/ocr.py#L260), causing some columns in the table to be missing. ### Testing ``` filename = "example-docs/layout-parser-paper-with-table.pdf" elements = pdf.partition_pdf( filename=filename, strategy="hi_res", infer_table_structure=True, ) table = [el.metadata.text_as_html for el in elements if el.metadata.text_as_html] assert "Large Model" in table[0] ``` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2023-11-01 18:34:27 +00:00
qued	645a0fb765	fix: md tables (#1924 ) Courtesy @phowat, created a branch in the repo to make some changes and merge quickly. Closes #1486. * Fixes issue where tables from markdown documents were being treated as text Problem: Tables from markdown documents were being treated as text, and not being extracted as tables. Solution: Enable the `tables` extension when instantiating the `python-markdown` object. Importance: This will allow users to extract structured data from tables in markdown documents. #### Testing: On `main` run the following (run `git checkout fix/md-tables -- example-docs/simple-table.md` first to grab the example table from this branch) ```python from unstructured.partition.md import partition_md elements = partition_md("example-docs/simple-table.md") print(elements[0].category) ``` Output should be `UncategorizedText`. Then run the same code on this branch and observe the output is `Table`. --------- Co-authored-by: cragwolfe <crag@unstructured.io>	2023-10-30 14:09:46 +00:00
Yao You	42f8cf1997	chore: add metric helper for table structure eval (#1877 ) - add helper to run inference over an image or pdf of table and compare it against a ground truth csv file - this metric generates a similarity score between 1 and 0, where 1 is perfect match and 0 is no match at all - add example docs for testing - NOTE: this metric is only relevant to table structure detection. Therefore the input should be just the table area in an image/pdf file; we are not evaluating table element detection in this metric	2023-10-27 13:23:44 -05:00
Yao You	b530e0a2be	fix: partition docx from teams output (#1825 ) This PR resolves #1816 - current docx partition assumes all contents are in sections - this is not true for MS Teams chat transcript exported to docx - now the code checks if there are sections or not; if not then iterate through the paragraphs and partition contents in the paragraphs	2023-10-24 15:17:02 +00:00
Yuming Long	ce40cdc55f	Chore (refactor): support table extraction with pre-computed ocr data (#1801 ) ### Summary Table OCR refactor, move the OCR part for table model in inference repo to unst repo. * Before this PR, table model extracts OCR tokens with texts and bounding box and fills the tokens to the table structure in inference repo. This means we need to do an additional OCR for tables. * After this PR, we use the OCR data from entire page OCR and pass the OCR tokens to inference repo, which means we only do one OCR for the entire document. Tech details: * Combined env `ENTIRE_PAGE_OCR` and `TABLE_OCR` to `OCR_AGENT`, this means we use the same OCR agent for entire page and tables since we only do one OCR. * Bump inference repo to `0.7.9`, which allow table model in inference to use pre-computed OCR data from unst repo. Please check in [PR](https://github.com/Unstructured-IO/unstructured-inference/pull/256). * All notebooks lint are made by `make tidy` * This PR also fixes [issue](https://github.com/Unstructured-IO/unstructured/issues/1564), I've added test for the issue in `test_pdf.py::test_partition_pdf_hi_table_extraction_with_languages` * Add same scaling logic to image [similar to previous Table OCR](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L109C1-L113), but now scaling is applied to entire image ### Test * Not much to manually testing expect table extraction still works * But due to change on scaling and use pre-computed OCR data from entire page, there are some slight (better) changes on table output, here is an comparison on test outputs i found from the same test `test_partition_image_with_table_extraction`: screen shot for table in `layout-parser-paper-with-table.jpg`: <img width="343" alt="expected" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/278d7665-d212-433d-9a05-872c4502725c"> before refactor: <img width="709" alt="before" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/347fbc3b-f52b-45b5-97e9-6f633eaa0d5e"> after refactor: <img width="705" alt="after" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/b3cbd809-cf67-4e75-945a-5cbd06b33b2d"> ### TODO (added as a ticket) Still have some clean up to do in inference repo since now unst repo have duplicate logic, but can keep them as a fall back plan. If we want to remove anything OCR related in inference, here are items that is deprecated and can be removed: * [`get_tokens`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L77) (already noted in code) * parameter `extract_tables` in inference * [`interpret_table_block`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layoutelement.py#L88) * [`load_agent`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L197) * env `TABLE_OCR` ### Note if we want to fallback for an additional table OCR (may need this for using paddle for table), we need to: * pass `infer_table_structure` to inference with `extract_tables` parameter * stop passing `infer_table_structure` to `ocr.py` --------- Co-authored-by: Yao You <yao@unstructured.io>	2023-10-21 00:24:23 +00:00
qued	cf31c9a2c4	fix: use nx to avoid recursion limit (#1761 ) Fixes recursion limit error that was being raised when partitioning Excel documents of a certain size. Previously we used a recursive method to find subtables within an excel sheet. However this would run afoul of Python's recursion depth limit when there was a contiguous block of more than 1000 cells within a sheet. This function has been updated to use the NetworkX library which avoids Python recursion issues. * Updated `_get_connected_components` to use `networkx` graph methods rather than implementing our own algorithm for finding contiguous groups of cells within a sheet. * Added a test and example doc that replicates the `RecursionError` prior to the change. * Added `networkx` to `extra_xlsx` dependencies and `pip-compile`d. #### Testing: The following run from a Python terminal should raise a `RecursionError` on `main` and succeed on this branch: ```python import sys from unstructured.partition.xlsx import partition_xlsx old_recursion_limit = sys.getrecursionlimit() try: sys.setrecursionlimit(1000) filename = "example-docs/more-than-1k-cells.xlsx" partition_xlsx(filename=filename) finally: sys.setrecursionlimit(old_recursion_limit) ``` Note: the recursion limit is different in different contexts. Checking my own system, the default in a notebook seems to be 3000, but in a terminal it's 1000. The documented Python default recursion limit is 1000.	2023-10-14 19:38:21 +00:00
John	9500d04791	detect document language across all partitioners (#1627 ) ### Summary Closes #1534 and #1535 Detects document language using `langdetect` package. Creates new kwargs for user to set the document language (`languages`) or detect the language at the element level instead of the default document level (`detect_language_per_element`) --------- Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: Coniferish <Coniferish@users.noreply.github.com> Co-authored-by: cragwolfe <crag@unstructured.io> Co-authored-by: Austin Walker <austin@unstructured.io>	2023-10-11 01:47:56 +00:00

1 2 3

140 Commits