unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-09-18 04:47:38 +00:00

Author	SHA1	Message	Date
Christine Straub	da7ac625b1	Feat: save tables in PDF's as images (#2229 ) closes #2222. ### Summary The "table" elements are saved as `table-<pageN>-<tableN>.jpg`. This filename is presented in the `image_path` metadata field for the Table element. The default would be to not do this. ### Testing PDF: [124_PDFsam_Basel III - Finalising post-crisis reforms.pdf](https://github.com/Unstructured-IO/unstructured/files/13591714/124_PDFsam_Basel.III.-.Finalising.post-crisis.reforms.pdf) ``` elements = partition_pdf( filename="124_PDFsam_Basel III - Finalising post-crisis reforms.pdf", strategy="hi_res", infer_table_structure=True, extract_element_types=['Table'], ) ```	2023-12-11 19:14:41 +00:00
Christine Straub	ed76b11b1a	Refactor: support image extraction (#2201 ) ### Summary This PR is the second part of the "image extraction" refactor to move it from unstructured-inference repo to unstructured repo, the first part is done in https://github.com/Unstructured-IO/unstructured-inference/pull/299. This PR adds logic to support extracting images. ### Testing `git clone -b refactor/remove_image_extraction_code --single-branch https://github.com/Unstructured-IO/unstructured-inference.git && cd unstructured-inference && pip install -e . && cd ../` ``` elements = partition_pdf( filename="example-docs/embedded-images.pdf", strategy="hi_res", extract_images_in_pdf=True, ) print("\n\n".join([str(el) for el in elements])) ```	2023-12-05 18:22:29 +00:00
John	8fa5cbf036	build(ci): rm unneeded call to get_api_key in test (#2199 ) Follow-up PR to [https://github.com/Unstructured-IO/unstructured/pull/2195](https://github.com/Unstructured-IO/unstructured/pull/2195). Removes unnecessary calls to `get_api_key()`. That helper function is supposed to only be used for tests decorated by @pytest.mark.skipif(skip_outside_ci, reason="Skipping test run outside of CI") (which are skipped because those tests are partitioning pdf/jpg files). These tests are partitioning emails and rely on the MockResponse at the top of the file, so they don't need to call `get_api_key()` and it can simply be removed from them.	2023-12-03 21:28:05 -08:00
Christine Straub	69d0ee1aea	Refactor: support merging `extracted` layout with `inferred` layout (#2158 ) ### Summary This PR is the second part of `pdfminer` refactor to move it from `unstructured-inference` repo to `unstructured` repo, the first part is done in https://github.com/Unstructured-IO/unstructured-inference/pull/294. This PR adds logic to merge the extracted layout with the inferred layout. The updated workflow for the `hi_res` strategy: * pass the document (as data/filename) to the `inference` repo to get `inferred_layout` (DocumentLayout) * pass the `inferred_layout` returned from the `inference` repo and the document (as data/filename) to the `pdfminer_processing` module, which first opens the document (create temp file/dir as needed), and splits the document by pages * if is_image is `True`, return the passed inferred_layout(DocumentLayout) * if is_image is `False`: * get extracted_layout (TextRegions) from the passed document(data/filename) by pdfminer * merge `extracted_layout` (TextRegions) with the passed `inferred_layout` (DocumentLayout) * return the `inferred_layout `(DocumentLayout) with updated elements (all merged LayoutElements) as merged_layout (DocumentLayout) * pass merged_layout and the document (as data/filename) to the `OCR` module, which first opens the document (create temp file/dir as needed), and splits the document by pages (convert PDF pages to image pages for PDF file) ### Note This PR also fixes issue #2164 by using functionality similar to the one implemented in the `fast` strategy workflow when extracting elements by `pdfminer`. ### TODO * image extraction refactor to move it from `unstructured-inference` repo to `unstructured` repo * improving natural reading order by applying the current default `xycut` sorting to the elements extracted by `pdfminer`	2023-12-01 20:56:31 +00:00
John	e5bdf7fb43	chore: unstructured python client (#2195 ) ### Summary Closes #2033 Updates `partition_via_api` to use `UnstructuredClient` for api calls instead of `requests`. Updates associated tests. Note: This PR does not update `partition_multiple_via_api` as documentation in `unstructured-python-client` indicates it does not support multiple files. A new issue should be opened to add that functionality to `unstructured-python-client`. --------- Co-authored-by: Klaijan <klaijan@unstructured.io> Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2023-12-01 18:49:59 +00:00
Yuming Long	92dae8cd1a	Chore: Repair invalid PDF structure for PDFminer when PSSyntaxError (#2137 ) ### Summary Add a procedure to repair PDF when the PDF structure is invalid for `PDFminer` to process. This PR handles two cases of `PSSyntaxError Invalid dictionary construct: ...`: * PDFminer open entire document and create pages generator on `PDFPage.get_pages(fp)`: [sentry log example](https://unstructuredio.sentry.io/issues/4655715023/?alert_rule_id=14681339&alert_type=issue&notification_uuid=d8db4cf4-686f-4504-8a22-74a79a8e966f&project=4505909127086080&referrer=slack) * PDFminer's interpreter process a single page on `interpreter.process_page(page)`: [sentry log example](https://unstructuredio.sentry.io/issues/4655898781/?referrer=slack&notification_uuid=0d929d48-f490-4db8-8dad-5d431c8460bc&alert_rule_id=14681339&alert_type=issue) Additional tech details: * Add new dependency `pikepdf` in `requirements/extra-pdf-image.in`, which is used for repairing PDF. * Add new denpendenct `pypdf` in `requirements/extra-pdf-image.in`, which is used to find the error page from entire document by reading the PDF file again (can't find a way to split pdf in PDFminer). * Refactor the `is null` check for `get_uris_from_annots`, since the root cause is that `get_uris` passed a None `annots` to `get_uris_from_annots`, so the Null check should happen in `get_uris`. * Add more type protection in `get_uris_from_annots` when using any `PDFObjRef.resolve()` as `dict` (it could still be a `PDFObjRef`). This should fix : * https://github.com/Unstructured-IO/unstructured/issues/1922 where `annotation_dict` is a `PDFObjRef` * https://github.com/Unstructured-IO/unstructured/issues/1921 where `rect` is a `PDFObjRef` ### Test Added three test files (both are larger than 500 KB) for unittests to test: * Repair entire doc * Repair one page * Reprocess failure after repairing one page (just return the elements before error page in this case). * Also seems like splitting the document into smaller pages could fix this problem, but not sure why. For example, I saw error from reprocess in the whole [cancer.pdf](https://github.com/Unstructured-IO/unstructured/files/13461616/cancer.pdf) doc, but no error when i split the pdf by error page.... * tested if i can repair the entire doc again in this case, saw other error which means repairing is not helping imo * PDFminer can process the whole doc after pikepdf only repaired the entire doc in the first place, but we can't repair by pages in this way --------- Co-authored-by: cragwolfe <crag@unstructured.io>	2023-11-29 19:00:15 +00:00
Yuming Long	6c08c136ae	ci: fix broken API unit test for using unsupported `fast` strategy for images (#2144 ) ### Summary This should fix the broken unit test on main CI * change the strategy in `test_partition_multiple_via_api_valid_request_data_kwargs` from `fast` to `auto`, since the test was using `fast` for images, and we don't support it.	2023-11-22 17:35:04 -08:00
Steve Canny	02e8c962aa	fix(docx): tables in header/footer dropped (#2135 ) A DOCX header or footer is a so-called "story part" meaning like the document body (which is also a story part) it can contain both paragraphs and tables. The implementation of `Header.text` and `Footer.text` gather only the paragraphs. Add a new method to extract all content from a header or footer, including table content, suitable for use as the `.text` attribute of that element. Fixes #2126.	2023-11-22 15:39:25 -08:00
Steve Canny	e6637592d1	fix(docx): Table.text duplicates merged cell text (#2134 ) Summary. The `python-docx` table API is designed for _uniform_ tables (no merged cells, no nested tables). Naive processing of DOCX tables using this API produces duplicate text when the table has merged cells. Add a more sophisticated parsing method that reads only "root" cells (those with an actual `<tc>` element) and skip cells spanned by a merge. In the process, abandon use of the `tabulate` package for this job (which is also designed for uniform tables) and remove the whitespace padding it adds for visual alignment of columns. Separate the text for each cell with a single newline ("\n"). Since it's little extra trouble, add support for nested tables such that their text also contributes to the `Table.text` string. The new `._iter_table_texts()` method will also be used for parsing tables in headers and footers (where they are frequently used for layout purposes) in a closely following PR. Fixes #2106.	2023-11-21 22:22:40 +00:00
Steve Canny	ee9be2a3b2	fix: assorted partition_html() bugs (#2113 ) Addresses a cluster of HTML-related bugs: - empty table is identified as bulleted-table - `partition_html()` emits empty (no text) tables (#1928) - `.text_as_html` contains inappropriate `<br>` elements in invalid locations. - cells enclosed in `<thead>` and `<tfoot>` elements are dropped (#1928) - `.text_as_html` contains whitespace padding Each of these is addressed in a separate commit below. Fixes #1928. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: scanny <scanny@users.noreply.github.com> Co-authored-by: Yuming Long <63475068+yuming-long@users.noreply.github.com>	2023-11-20 16:29:32 +00:00
Christine Straub	d623d75d3c	Fix: incorrect figure mapping (#2111 ) Closes #2098.	2023-11-18 00:11:11 +00:00
Steve Canny	a589a494f6	docx: improve page break fidelity (#1631 ) Page breaks can and often do occur within a paragraph. The full text of the paragraph is attributed to the page (number) the paragraph starts on. Improve page-break fidelity such that a paragraph containing a page-break is split into two elements, one containing the text before the page-break and the other the text after. Emit the `PageBreak` element between these two and assign the correct page-number (n and n+1 respectively) to the two textual elements. This functionality is largely provided upstream by the new `python-docx` v1.0.0 release (1.0.0 from 0.8.11 because it drops Python 2 support). That version also makes obsolete the "include hyperlink text in `Paragraph.text` monkey patch that we had maintained up to now. Remove that monkey-patch.	2023-11-17 00:09:14 +00:00
Steve Canny	7a741c9ae6	fix(chunk): #1985 mis-splits of Table chunks (#2076 ) Closes #1985 Summary. Due to an interaction of coding errors, HTML text in `TableChunk` splits of a `Table` element were repeating the entire HTML for the table in each chunk. Technical Summary. This behavior was fixed but not published in the last chunking PR of a series. Finish up that PR and submit it all here. This PR extracts chunking to the particular Section type (each has their own distinct chunking behavior).	2023-11-16 16:22:50 +00:00
Steve Canny	41fc55bc12	fix(docx): tabulate output is non-deterministic (#2090 ) The test for nested tables added a few PRs ago indirectly relies on the padding added to table-HTML by `tabulate`. The length of that padding turns out to be non-deterministic, perhaps related to M1 vs. Intel hardware. Remove padding from tabulate output in the test so only actual content is compared.	2023-11-16 07:52:16 +00:00
Christine Straub	e114e5c418	Refactor: partition pdf (#2074 ) ### Summary - add constants for strategies - add `_process_uncategorized_text_elements()` to remove code block duplication ### Testing CI should pass.	2023-11-15 21:41:02 -08:00
Steve Canny	252405c780	Dynamic ElementMetadata implementation (#2043 ) ### Executive Summary The structure of element metadata is currently static, meaning only predefined fields can appear in the metadata. We would like the flexibility for end-users, at their own discretion, to define and use additional metadata fields that make sense for their particular use-case. ### Concepts A key concept for dynamic metadata is _known field_. A known-field is one of those explicitly defined on `ElementMetadata`. Each of these has a type and can be specified when _constructing_ a new `ElementMetadata` instance. This is in contrast to an _end-user defined_ (or _ad-hoc_) metadata field, one not known at "compile" time and added at the discretion of an end-user to suit the purposes of their application. An ad-hoc field can only be added by _assignment_ on an already constructed instance. ### End-user ad-hoc metadata field behaviors An ad-hoc field can be added to an `ElementMetadata` instance by assignment: ```python >>> metadata = ElementMetadata() >>> metadata.coefficient = 0.536 ``` A field added in this way can be accessed by name: ```python >>> metadata.coefficient 0.536 ``` and that field will appear in the JSON/dict for that instance: ```python >>> metadata = ElementMetadata() >>> metadata.coefficient = 0.536 >>> metadata.to_dict() {"coefficient": 0.536} ``` However, accessing a "user-defined" value that has _not_ been assigned on that instance raises `AttributeError`: ```python >>> metadata.coeffcient # -- misspelled "coefficient" -- AttributeError: 'ElementMetadata' object has no attribute 'coeffcient' ``` This makes "tagging" a metadata item with a value very convenient, but entails the proviso that if an end-user wants to add a metadata field to _some_ elements and not others (sparse population), AND they want to access that field by name on ANY element and receive `None` where it has not been assigned, they will need to use an expression like this: ```python coefficient = metadata.coefficient if hasattr(metadata, "coefficient") else None ``` ### Implementation Notes - ad-hoc metadata fields are discarded during consolidation (for chunking) because we don't have a consolidation strategy defined for those. We could consider using a default consolidation strategy like `FIRST` or possibly allow a user to register a strategy (although that gets hairy in non-private and multiple-memory-space situations.) - ad-hoc metadata fields cannot start with an underscore. - We have no way to distinguish an ad-hoc field from any "noise" fields that might appear in a JSON/dict loaded using `.from_dict()`, so unlike the original (which only loaded known-fields), we'll rehydrate anything that we find there. - No real type-safety is possible on ad-hoc fields but the type-checker does not complain because the type of all ad-hoc fields is `Any` (which is the best available behavior in my view). - We may want to consider whether end-users should be able to add ad-hoc fields to "sub" metadata objects too, like `DataSourceMetadata` and conceivably `CoordinatesMetadata` (although I'm not immediately seeing a use-case for the second one).	2023-11-15 13:22:15 -08:00
Austin Walker	2931cb38e8	fix: handle KeyError: 'N' for certain pdfs (#2072 ) Closes #2059. We've found some pdfs that throw an error in pdfminer. These files use a ICCBased color profile but do not include an expected value `N`. As a workaround, we can wrap pdfminer and drop any colorspace info, since we don't need to render the document. To verify, try to partition the document in the linked issue. ``` elements = partition(filename="google-2023-environmental-report_condensed.pdf", strategy="fast") ``` --------- Co-authored-by: cragwolfe <crag@unstructured.io>	2023-11-15 01:59:05 +00:00
Christine Straub	475066ba7c	Fix: fast strategy fallback to ocr only (#2055 ) Closes #2038. ### Summary The `fast` strategy should not fall back to a more expensive strategy. ### Testing For [9493801-p17.pdf](https://github.com/Unstructured-IO/unstructured/files/13292884/9493801-p17.pdf), the following code should return an empty list. ``` elements = partition(filename=filename, strategy="fast") ``` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2023-11-14 18:46:41 +00:00
John	1ead5a27df	Jj/2011 missing languages metadata (#2037 ) ### Summary Closes #2011 `languages` was missing from the metadata when partitioning pdfs via `hi_res` and `fast` strategies and missing from image partitions via `hi_res`. This PR adds `languages` to the relevant function calls so it is included in the resulting elements. ### Testing On the main branch, `partition_image` will include `languages` when `strategy='ocr_only'`, but not when `strategy='hi_res'`: ``` filename = "example-docs/english-and-korean.png" from unstructured.partition.image import partition_image elements = partition_image(filename, strategy="ocr_only", languages=['eng', 'kor']) elements[0].metadata.languages elements = partition_image(filename, strategy="hi_res", languages=['eng', 'kor']) elements[0].metadata.languages ``` For `partition_pdf`, `'ocr_only'` will include `languages` in the metadata, but `'fast'` and `'hi_res'` will not. ``` filename = "example-docs/korean-text-with-tables.pdf" from unstructured.partition.pdf import partition_pdf elements = partition_pdf(filename, strategy="ocr_only", languages=['kor']) elements[0].metadata.languages elements = partition_pdf(filename, strategy="fast", languages=['kor']) elements[0].metadata.languages elements = partition_pdf(filename, strategy="hi_res", languages=['kor']) elements[0].metadata.languages ``` On this branch, `languages` is included in the metadata regardless of strategy --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: Coniferish <Coniferish@users.noreply.github.com>	2023-11-13 16:47:05 +00:00
John	f8c180a59e	Jj/2027 float no attr strip (#2048 ) Closes #2027 Tables or pages that contain only numbers are returned as floats in a pandas.DataFrame when the image or page is converted from `.image_to_data()`. An AttributeError was raised downstream when trying to `.strip()` the floats. This update converts those floats if needed and otherwise strips the text. Testing (note: the document used for testing is new, so you will have to copy it to the main branch in order to see that this snippet raises an AttributeError on the main branch, but works on this branch) ``` from unstructured.partition.pdf import partition_pdf filename = "example-docs/all-number-table.pdf" partition_pdf(filename, strategy="ocr_only") ``` --------- Co-authored-by: cragwolfe <crag@unstructured.io>	2023-11-10 05:14:06 +00:00
Steve Canny	d06bcc41bb	fix(docx): improve page-break detection (#2036 ) Page breaks are reliably indicated by `w:lastRenderedPageBreak` elements present in the document XML. Page breaks are NOT reliably indicated by "hard" page-breaks inserted by the author and when present are redundant to a `w:lastRenderedPageBreak` element so cause over-counting if used. Use rendered page-breaks only.	2023-11-09 20:34:30 +00:00
Christine Straub	3fe480799a	Fix: missing characters at the beginning of sentences on table ingest output after table OCR refactor (#1961 ) Closes #1875. ### Summary - add functionality to do a second OCR on cropped table images - use `IMAGE_CROP_PAD` env for `individual_blocks` mode ### Testing The test function [`test_partition_pdf_hi_res_ocr_mode_with_table_extraction()`](https://github.com/Unstructured-IO/unstructured/blob/main/test_unstructured/partition/pdf_image/test_pdf.py#L425) in `test_pdf.py` should pass. ### NOTE: I've tried to experiment with values for scaling ENVs on the following PRs but found that changes to the values for scaling ENVs affect the entire page OCR output(OCR regression) so switched to doing a second OCR for tables. - https://github.com/Unstructured-IO/unstructured/pull/1998/files - https://github.com/Unstructured-IO/unstructured/pull/2004/files - https://github.com/Unstructured-IO/unstructured/pull/2016/files - https://github.com/Unstructured-IO/unstructured/pull/2029/files --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2023-11-09 18:29:55 +00:00
Christine Straub	bb58c1bb0b	Refactor: element type (#2035 ) ### Summary - add constants for element type - replace the `TYPE_TO_TEXT_ELEMENT_MAP` dictionary using the `ElementType` constants - replace element type strings using the constants ### Testing CI should pass.	2023-11-08 21:52:55 -08:00
Steve Canny	0e2c21e5a2	fix: handle sectionless-docx in the general case (#1829 ) A DOCX document that has no sections can still contain one or more tables. Such files are never created by Word but Word can open them just fine. These can be and are generated by other applications. Use the newly-added `Document.iter_inner_content()` method added upstream in `python-docx` to capture both paragraphs and tables from a section-less DOCX document. This generalizes the fix for MS Teams chat-transcripts (an example of sectionless-docx) implemented in #1825.	2023-11-08 19:05:19 +00:00
qued	92ddf3a337	feat: enable request timeout (#2013 ) Courtesy @cdpierse. Adds a test to PR #1529 in accordance with feedback. Description from original PR: In python the default behaviour of `requests.get` without a `timeout` being set is to hang indefinitely. We have a production use case where the desired behaviour would be to raise a timeout error rather than have the application just hang. This PR adds a new optional keyword parameter `request_timeout` to `partition` which is passed to `file_and_type_from_url` in the case where we are fetching from a URL. This is then passed to `requests.get` --------- Co-authored-by: Charles Pierse <charlespierse@gmail.com>	2023-11-08 00:44:58 +00:00
Steve Canny	80fe07b89f	fix: #1952 support nested docx tables (#2020 ) In DOCX, like HTML, a table cell can itself contain a table. This is not uncommon and is typically used for formatting purposes. When a DOCX table is nested, create nested HTML tables to reflect that structure and create a plain-text table with captures all the text in nested tables, formatting it as a reasonable facsimile of a table. This implements the solution described and spiked in PR #1952. --------- Co-authored-by: Bruno Bornsztein <bruno.bornsztein@gmail.com>	2023-11-08 00:37:21 +00:00
Yuming Long	ad14321016	Chore: don't pass empty language code to tesseract CLI (#1996 ) Summary: Close: https://github.com/Unstructured-IO/unstructured/issues/1920 * stop passing in empty string from `languages` to tesseract, which will result in passing empty string to language config `-l` for the tesseract CLI * also stop passing in duplicate language code from `languages` to tesseract OCR * if we failed to convert any iso languages from the `languages` parameter, proceed OCR with `eng` as default ### Test * First confirm the tesseract error `Estimating resolution as X` before this: * on the `unstructured-api` repo with main branch, run `make run-web-app` * curl to test error from empty string, or just any wrong input like `-F 'languages="eng,de"'`: ``` curl -X 'POST' 'http://0.0.0.0:8000/general/v0/general' \ -H 'accept: application/json' \ -H 'Content-Type: multipart/form-data' \ -F 'files=@sample-docs/layout-parser-paper-with-table.jpg' \ -F 'languages=""' \ -F 'strategy=hi_res' \ -F 'pdf_infer_table_structure=True' \ \| jq -C . \| less -R ``` * after this change: * in your unstructured API env, cd to unstructured repo and install it locally with `pip install -e .` * check out to this branch * run `make run-web-app` again in api repo * the curl command return output and see warning in log --------- Co-authored-by: qued <64741807+qued@users.noreply.github.com>	2023-11-06 19:30:12 -06:00
Christine Straub	9f7ff4fd98	rfctr: Clean up test functions in `test_pdf.py` (#1999 ) ### Summary: - use the test utility function `example_doc_path()` - clean up test functions related to `metadata_date` and `exclude_metadata`	2023-11-03 10:02:43 -05:00
Steve Canny	4e40999070	rfctr: prepare docx partitioner and tests for nested tables PR to follow (#1978 ) Reviewer: May be quicker to review commit by commit as they are quite distinct and well-groomed to each focus on a single clean-up task. Clean up odds-and-ends in the docx partitioner in preparation for adding nested-tables support in a closely following PR. 1. Remove obsolete TODOs now in GitHub issues, which is probably where they belong in future anyway. 2. Remove local DOCX "workaround" code that has been implemented upstream and is now obsolete. 3. "Clean" the docx tests, introducing strict typing, extracting a fixture or two, and generally tightening things up. 4. Extract docx-local versions of `unstructured.partition.common.convert_ms_office_table_to_text()` which will be the base for adding nested-table support. More information on why this is required in that commit.	2023-11-02 05:22:17 +00:00
John	2f553333bd	refactor text.py (#1872 ) ### Summary Closes #1520 Partial solution to #1521 - Adds an abstraction layer between the user API and the partitioner implementation - Adds comments explaining paragraph chunking - Makes edits to pass strict type-checking for both text.py and test_text.py	2023-11-01 17:44:55 -05:00
John	b92cab7fbd	fix languages 500 error with empty string for ocr_languages (#1968 ) Closes #1870 Defining both `languages` and `ocr_languages` raises a ValueError, but the api defaults to `ocr_languages` being an empty string, so if users define `languages` they are automatically hitting the ValueError. This fix checks if `ocr_languages` is an empty string and converts it to `None` to avoid this. ### Testing On the main branch, the following will raise the ValueError, but it will correctly partition on this branch ``` from unstructured.partition.auto import partition filename = "example-docs/category-level.docx" elements = partition(filename,languages=['spa'],ocr_languages="") elements[0].metadata.languages ``` --------- Co-authored-by: yuming <305248291@qq.com> Co-authored-by: Yuming Long <63475068+yuming-long@users.noreply.github.com> Co-authored-by: Austin Walker <awalk89@gmail.com>	2023-11-01 22:02:00 +00:00
Christine Straub	210d53a7e0	Fix: missing columns on table ingest output after table OCR refactor (#1959 ) Closes #1873. ### Summary Table OCR refactoring changed the default padding value for table image cropping from [12](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layoutelement.py#L95) to [0](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/ocr.py#L260), causing some columns in the table to be missing. ### Testing ``` filename = "example-docs/layout-parser-paper-with-table.pdf" elements = pdf.partition_pdf( filename=filename, strategy="hi_res", infer_table_structure=True, ) table = [el.metadata.text_as_html for el in elements if el.metadata.text_as_html] assert "Large Model" in table[0] ``` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2023-11-01 18:34:27 +00:00
qued	b08562ba1a	tests: separate chipper tests (#1939 ) Separates chipper tests to speed up testing and CI.	2023-10-31 21:02:00 +00:00
Denis Lusson	f585d489c1	feat: Add include_header argument for partition_csv and partition_tsv (#1764 ) This PR add `include_header` argument for partition_csv and partition_tsv. This is related to the following feature request https://github.com/Unstructured-IO/unstructured/issues/1751. `include_header` is already part of partition_xlsx. The work here is in line with the current usage and testing of the `include_header` argument in partition_xlsx. --------- Co-authored-by: cragwolfe <crag@unstructured.io>	2023-10-31 08:16:36 +00:00
Klaijan	a11d4634f1	fix: type error string indices bug (#1940 ) Fix TypeError: string indices must be integers. The `annotation_dict` variable is conditioned to be `None` if instance type is not dict. Then we add logic to skip the attempt if the value is `None`.	2023-10-30 17:38:57 -07:00
Christine Straub	1f0c563e0c	refactor: `partition_pdf()` for `ocr_only` strategy (#1811 ) ### Summary Update `ocr_only` strategy in `partition_pdf()`. This PR adds the functionality to get accurate coordinate data when partitioning PDFs and Images with the `ocr_only` strategy. - Add functionality to perform OCR region grouping based on the OCR text taken from `pytesseract.image_to_string()` - Add functionality to get layout elements from OCR regions (ocr_layout) for both `tesseract` and `paddle` - Add functionality to determine the `source` of merged text regions when merging text regions in `merge_text_regions()` - Merge multiple test functions related to "ocr_only" strategy into `test_partition_pdf_with_ocr_only_strategy()` - This PR also fixes [issue #1792](https://github.com/Unstructured-IO/unstructured/issues/1792) ### Evaluation ``` # Image PYTHONPATH=. python examples/custom-layout-order/evaluate_natural_reading_order.py example-docs/double-column-A.jpg ocr_only xy-cut image # PDF PYTHONPATH=. python examples/custom-layout-order/evaluate_natural_reading_order.py example-docs/multi-column-2p.pdf ocr_only xy-cut pdf ``` ### Test - Before update All elements have the same coordinate data ![multi-column-2p_1_xy-cut](https://github.com/Unstructured-IO/unstructured/assets/9475974/aae0195a-2943-4fa8-bdd8-807f2f09c768) - After update All elements have accurate coordinate data ![multi-column-2p_1_xy-cut](https://github.com/Unstructured-IO/unstructured/assets/9475974/0f6c6202-9e65-4acf-bcd4-ac9dd01ab64a) --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2023-10-30 20:13:29 +00:00
qued	645a0fb765	fix: md tables (#1924 ) Courtesy @phowat, created a branch in the repo to make some changes and merge quickly. Closes #1486. * Fixes issue where tables from markdown documents were being treated as text Problem: Tables from markdown documents were being treated as text, and not being extracted as tables. Solution: Enable the `tables` extension when instantiating the `python-markdown` object. Importance: This will allow users to extract structured data from tables in markdown documents. #### Testing: On `main` run the following (run `git checkout fix/md-tables -- example-docs/simple-table.md` first to grab the example table from this branch) ```python from unstructured.partition.md import partition_md elements = partition_md("example-docs/simple-table.md") print(elements[0].category) ``` Output should be `UncategorizedText`. Then run the same code on this branch and observe the output is `Table`. --------- Co-authored-by: cragwolfe <crag@unstructured.io>	2023-10-30 14:09:46 +00:00
Benjamin Torres	05c3cd1be2	feat: clean pdfminer elements inside tables (#1808 ) This PR introduces `clean_pdfminer_inner_elements` , which deletes pdfminer elements inside other detection origins such as YoloX or detectron. This function returns the clean document. Also, the ingest-test fixtures were updated to reflect the new standard output. The best way to check that this function is working properly is check the new test `test_clean_pdfminer_inner_elements` in `test_unstructured/partition/utils/test_processing_elements.py` --------- Co-authored-by: Roman Isecke <roman@unstructured.io> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com> Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com>	2023-10-30 07:10:51 +00:00
Yao You	f87731e085	feat: use yolox as default to table extraction for pdf/image (#1919 ) - yolox has better recall than yolox_quantized, the current default model, for table detection - update logic so that when `infer_table_structure=True` the default model is `yolox` instead of `yolox_quantized` - user can still override the default by passing in a `model_name` or set the env variable `UNSTRUCTURED_HI_RES_MODEL_NAME` ## Test: Partition the attached file with ```python from unstructured.partition.pdf import partition_pdf yolox_elements = partition_pdf(filename, strategy="hi_re", infer_table_structure=True) yolox_quantized_elements = partition_pdf(filename, strategy="hi_re", infer_table_structure=True, model_name="yolox_quantized") ``` Compare the table elements between those two and yolox (default) elements should have more complete table. [AK_AK-PERS_CAFR_2008_3.pdf](https://github.com/Unstructured-IO/unstructured/files/13191198/AK_AK-PERS_CAFR_2008_3.pdf)	2023-10-27 15:37:45 -05:00
qued	808b4ced7a	build(deps): remove ebooklib (#1878 ) * Removed `ebooklib` as a dependency `ebooklib` is licensed under AGPL3, which is incompatible with the Apache 2.0 license. Thus it is being removed.	2023-10-26 12:22:40 -05:00
qued	d8241cbcfc	fix: filename missing from image metadata (#1863 ) Closes [#1859](https://github.com/Unstructured-IO/unstructured/issues/1859). * Fixes elements partitioned from an image file missing certain metadata Metadata for image files, like file type, was being handled differently from other file types. This caused a bug where other metadata, like the file name, was being missed. This change brought metadata handling for image files to be more in line with the handling for other file types so that file name and other metadata fields are being captured. Additionally: * Added test to verify filename is being captured in metadata * Cleaned up `CHANGELOG.md` formatting #### Testing: The following produces output `None` on `main`, but outputs the filename `layout-parser-paper-fast.jpg` on this branch: ```python from unstructured.partition.auto import partition elements = partition("example-docs/layout-parser-paper-fast.jpg") print(elements[0].metadata.filename) ```	2023-10-25 05:19:51 +00:00
John	8080f9480d	fix strategy test for api and linting (#1840 ) ### Summary Closes unstructured-api issue [188](https://github.com/Unstructured-IO/unstructured-api/issues/188) The test and gist were using different versions of the same file (jpg/pdf), creating what looked like a bug when there wasn't one. The api is correctly using the `strategy` kwarg. ### Testing #### Checkout to `main` - Comment out the `@pytest.mark.skip` decorators for the `test_partition_via_api_with_no_strategy` test - Add an API key to your env: - Add `from dotenv import load_dotenv; load_dotenv()` to the top of the file and have `UNS_API_KEY` defined in `.env` - Run `pytest test_unstructured/partition/test_api.py -k "test_partition_via_api_with_no_strategy"` ^the test will fail #### Checkout to this branch - (make the same changes as above) - Run `pytest test_unstructured/partition/test_api.py -k "test_partition_via_api_with_no_strategy"` ### Other `make tidy` and `make check` made linting changes to additional files	2023-10-24 22:17:54 +00:00
Yuming Long	01a0e003d9	Chore: stop passing extract_tables to inference and note table regression on entire doc OCR (#1850 ) ### Summary A follow up ticket on https://github.com/Unstructured-IO/unstructured/pull/1801, I forgot to remove the lines that pass extract_tables to inference, and noted the table regression if we only do one OCR for entire doc Tech details: * stop passing `extract_tables` parameter to inference * added table extraction ingest test for image, which was skipped before, and the "text_as_html" field contains the OCR output from the table OCR refactor PR * replaced `assert_called_once_with` with `call_args` so that the unit tests don't need to test additional parameters * added `error_margin` as ENV when comparing bounding boxes of`ocr_region` with `table_element` * added more tests for tables and noted the table regression in test for partition pdf ### Test * for stop passing `extract_tables` parameter to inference, run test `test_partition_pdf_hi_res_ocr_mode_with_table_extraction` before this branch and you will see warning like `Table OCR from get_tokens method will be deprecated....`, which means it called the table OCR in inference repo. This branch removed the warning.	2023-10-24 17:13:28 +00:00
qued	44cef80c82	test: Add test to ensure languages trickle down to ocr (#1857 ) Closes [#93](https://github.com/Unstructured-IO/unstructured-inference/issues/93). Adds a test to ensure language parameters are passed all the way from `partition_pdf` down to the OCR calls. #### Testing: CI should pass.	2023-10-24 16:54:19 +00:00
Yao You	b530e0a2be	fix: partition docx from teams output (#1825 ) This PR resolves #1816 - current docx partition assumes all contents are in sections - this is not true for MS Teams chat transcript exported to docx - now the code checks if there are sections or not; if not then iterate through the paragraphs and partition contents in the paragraphs	2023-10-24 15:17:02 +00:00
Amanda Cameron	0584e1d031	chore: fix infer_table bug (#1833 ) Carrying `skip_infer_table_types` to `infer_table_structure` in partition flow. Now PPT/X, DOC/X, etc. Table elements should not have a `text_as_html` field. Note: I've continued to exclude this var from partitioners that go through html flow, I think if we've already got the html it doesn't make sense to carry the infer variable along, since we're not 'infer-ing' the html table in these cases. TODO: ✅ add unit tests --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: amanda103 <amanda103@users.noreply.github.com>	2023-10-24 00:11:53 +00:00
qued	7fdddfbc1e	chore: improve kwarg handling (#1810 ) Closes `unstructured-inference` issue [#265](https://github.com/Unstructured-IO/unstructured-inference/issues/265). Cleaned up the kwarg handling, taking opportunities to turn instances of handling kwargs as dicts to just using them as normal in function signatures. #### Testing: Should just pass CI.	2023-10-23 04:48:28 +00:00
Yuming Long	ce40cdc55f	Chore (refactor): support table extraction with pre-computed ocr data (#1801 ) ### Summary Table OCR refactor, move the OCR part for table model in inference repo to unst repo. * Before this PR, table model extracts OCR tokens with texts and bounding box and fills the tokens to the table structure in inference repo. This means we need to do an additional OCR for tables. * After this PR, we use the OCR data from entire page OCR and pass the OCR tokens to inference repo, which means we only do one OCR for the entire document. Tech details: * Combined env `ENTIRE_PAGE_OCR` and `TABLE_OCR` to `OCR_AGENT`, this means we use the same OCR agent for entire page and tables since we only do one OCR. * Bump inference repo to `0.7.9`, which allow table model in inference to use pre-computed OCR data from unst repo. Please check in [PR](https://github.com/Unstructured-IO/unstructured-inference/pull/256). * All notebooks lint are made by `make tidy` * This PR also fixes [issue](https://github.com/Unstructured-IO/unstructured/issues/1564), I've added test for the issue in `test_pdf.py::test_partition_pdf_hi_table_extraction_with_languages` * Add same scaling logic to image [similar to previous Table OCR](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L109C1-L113), but now scaling is applied to entire image ### Test * Not much to manually testing expect table extraction still works * But due to change on scaling and use pre-computed OCR data from entire page, there are some slight (better) changes on table output, here is an comparison on test outputs i found from the same test `test_partition_image_with_table_extraction`: screen shot for table in `layout-parser-paper-with-table.jpg`: <img width="343" alt="expected" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/278d7665-d212-433d-9a05-872c4502725c"> before refactor: <img width="709" alt="before" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/347fbc3b-f52b-45b5-97e9-6f633eaa0d5e"> after refactor: <img width="705" alt="after" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/b3cbd809-cf67-4e75-945a-5cbd06b33b2d"> ### TODO (added as a ticket) Still have some clean up to do in inference repo since now unst repo have duplicate logic, but can keep them as a fall back plan. If we want to remove anything OCR related in inference, here are items that is deprecated and can be removed: * [`get_tokens`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L77) (already noted in code) * parameter `extract_tables` in inference * [`interpret_table_block`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layoutelement.py#L88) * [`load_agent`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L197) * env `TABLE_OCR` ### Note if we want to fallback for an additional table OCR (may need this for using paddle for table), we need to: * pass `infer_table_structure` to inference with `extract_tables` parameter * stop passing `infer_table_structure` to `ocr.py` --------- Co-authored-by: Yao You <yao@unstructured.io>	2023-10-21 00:24:23 +00:00
Yao You	3437a23c91	fix: partition html fail with table without tbody (#1817 ) This PR resolves #1807 - fix a bug where when a table tagged content does not contain `tbody` tag but `thead` tag for the rows the code fails - now when there is no `tbody` in a table section we try to look for `thead` isntead - when both are not found return empty table	2023-10-20 23:21:59 +00:00
Yao You	aa7b7c87d6	fix: model_name being None raises attribution error (#1822 ) This PR resolves #1754 - function wrapper tries to use `cast` to convert kwargs into `str` but when a value is `None` `cast(str, None)` still returns `None` - fix replaces the conversion to simply using `str()` function call	2023-10-20 21:08:17 +00:00

1 2 3 4 5 ...

402 Commits