unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-10-22 05:25:29 +00:00

Author	SHA1	Message	Date
Christine Straub	29b9ea7ba6	refactor: ocr modules (#2492 ) The purpose of this PR is to refactor OCR-related modules to reduce unnecessary module imports to avoid potential issues (most likely due to a "circular import"). ### Summary - add `inference_utils` module (unstructured/partition/pdf_image/inference_utils.py) to define unstructured-inference library related utility functions, which will reduce importing unstructured-inference library functions in other files - add `conftest.py` in `test_unstructured/partition/pdf_image/` directory to define fixtures that are available to all tests in the same directory and its subdirectories ### Testing CI should pass	2024-02-06 17:11:55 +00:00
Christine Straub	94001a208d	feat: improve table cell data (#2457 ) The purpose of this PR is to pass embedded text through table processing sub-pipeline later later use.	2024-02-01 05:29:19 +00:00
Christophe Jolif	ccc2302b33	feat: add the ability to specify a custom OCR besides the ones natively supported (#2462 ) This is nice to natively support both Tesseract and Paddle. However, one might already use another OCR and might want to keep using it (for quality reasons, for cost reasons etc...). This PR adds the ability for the user to specify its own OCR agent implementation that is then called by unstructured. I am new to unstructured so don't hesitate to let me know if you would prefer this being done differently and I will rework the PR. --------- Co-authored-by: Yao You <theyaoyou@gmail.com> Co-authored-by: Yao You <yao@unstructured.io>	2024-01-31 16:38:14 -06:00
Christine Straub	8b1de4c2b8	fix: `partition_pdf()` not working when using chipper model with file (#2479 ) Closes #2480. ### Summary - fixed an error introduced by PR [#2347](https://github.com/Unstructured-IO/unstructured/pull/2347) - https://github.com/Unstructured-IO/unstructured/pull/2347/files#diff-cefa2d296ae7ffcf5c28b5734d5c7d506fbdb225c05a0bc27c6b755d5424ffdaL373 - updated `test_partition_pdf_with_model_name()` to test more model names ### Testing The updated test function `test_partition_pdf_with_model_name()` should work on this branch, but fails on the `main` branch.	2024-01-31 17:36:59 +00:00
John	db67805ec6	feat: add support for partitioning .heic files (#2454 ) .heic files are an image filetype we have not supported. #### Testing ``` from unstructured.partition.image import partition_image png_filename = "example-docs/DA-1p.png" heic_filename = "example-docs/DA-1p.heic" png_elements = partition_image(png_filename, strategy="hi_res") heic_elements = partition_image(heic_filename, strategy="hi_res") for i in range(len(heic_elements)): print(heic_elements[i].text == png_elements[i].text) ``` --------- Co-authored-by: christinestraub <christinemstraub@gmail.com>	2024-01-30 04:49:00 +00:00
John	9320311a19	fix: check languages args (#2435 ) This PR is the last in a series of PRs for refactoring and fixing the language parameters (`languages` and `ocr_languages` so we can address incorrect input by users. See #2293 It is recommended to go though this PR commit-by-commit and note the commit message. The most significant commit is "update check_languages..."	2024-01-29 20:12:08 +00:00
Yao You	97fb10db4a	fix: default hi_res model rely on inference setting (#2441 ) - there are multiple places setting the default `hi_res_model_name` in both `unstructured` and `unstructured-inference` - they lead to inconsistency and unexpected behaviors - this fix removes a helper in `unstructured` that tries to set the default hi_res layout detection model; instead we rely on the `unstructured-inference` to provide that default when no explicit model name is passed in ## test ```bash UNSTRUCTURED_INCLUDE_DEBUG_METADATA=true ipython ``` ```python from unstructured.partition.auto import partition # find a pdf file elements = partition("foo.pdf", strategy="hi_res") assert elements[0].metadata.detection_origin == "yolox" ``` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>	2024-01-29 16:44:41 +00:00
Antonio Jose Jimeno Yepes	d8b3bdb919	Check chipper version and prevent running pdfminer with chipper (#2347 ) We have added a new version of chipper (Chipperv3), which needs to allow unstructured to effective work with all the current Chipper versions. This implies resizing images with the appropriate resolution and make sure that Chipper elements are not sorted by unstructured. In addition, it seems that PDFMiner is being called when calling Chipper, which adds repeated elements from Chipper and PDFMiner. To evaluate this PR, you can test the code below with the attached PDF. The code writes a JSON file with the generated elements. The output can be examined with `cat out.un.json \| python -m json.tool`. There are three things to check: 1. The size of the image passed to Chipper, which can be identiied in the layout_height and layout_width attributes, which should have values 3301 and 2550 as shown in the example below: ``` [ { "element_id": "c0493a7872f227e4172c4192c5f48a06", "metadata": { "coordinates": { "layout_height": 3301, "layout_width": 2550, ``` 2. There should be no repeated elements. 3. Order should be closer to reading order. The script to run Chipper from unstructured is: ``` from unstructured import __version__ print(__version__.__version__) import json from unstructured.partition.auto import partition from unstructured.staging.base import elements_to_json elements = json.loads(elements_to_json(partition("Huang_Improving_Table_Structure_Recognition_With_Visual-Alignment_Sequential_Coordinate_Modeling_CVPR_2023_paper-p6.pdf", strategy="hi_res", model_name="chipperv3"))) with open('out.un.json', 'w') as w: json.dump(elements, w) ``` [Huang_Improving_Table_Structure_Recognition_With_Visual-Alignment_Sequential_Coordinate_Modeling_CVPR_2023_paper-p6.pdf](https://github.com/Unstructured-IO/unstructured/files/13817273/Huang_Improving_Table_Structure_Recognition_With_Visual-Alignment_Sequential_Coordinate_Modeling_CVPR_2023_paper-p6.pdf) --------- Co-authored-by: Antonio Jimeno Yepes <antonio@unstructured.io>	2024-01-25 02:33:32 +00:00
John	c34fac9c3a	enhancement: add _clean_ocr_languages_arg helper function (#2413 ) This PR is one in a series of PRs for refactoring and fixing the languages parameter so it can address incorrect input by users. #2293 This PR adds _clean_ocr_languages_arg. There are no calls to this function yet, but it will be called in later PRs related to this series.	2024-01-19 19:59:08 +00:00
Christine Straub	7378a378f6	enhancement: allow setting image block crop padding parameter (#2415 ) Closes #2320 . ### Summary In certain circumstances, adjusting the image block crop padding can improve image block extraction by preventing extracted image blocks from being clipped. ### Testing - PDF: [LM339-D_2-2.pdf](https://github.com/Unstructured-IO/unstructured/files/13968952/LM339-D_2-2.pdf) - Set two environment variables `EXTRACT_IMAGE_BLOCK_CROP_HORIZONTAL_PAD` and `EXTRACT_IMAGE_BLOCK_CROP_VERTICAL_PAD` (e.g. `EXTRACT_IMAGE_BLOCK_CROP_HORIZONTAL_PAD = 40`, `EXTRACT_IMAGE_BLOCK_CROP_VERTICAL_PAD = 20` ``` elements = partition_pdf( filename="LM339-D_2-2.pdf", extract_image_block_types=["image"], ) ```	2024-01-19 06:28:32 +00:00
John	fa9f6ccc17	refactor: use _get_iso639_language_object (#2424 ) This refactor removes `_convert_to_standard_langcode` and replaces it with calling `_get_iso639_language_object` with a string slice. Use of TESSERACT_LANGUAGES_AND_CODES, which was added to `_convert_to_standard_langcode` previously, is moved to the relevant part where `_convert_to_standard_langcode` was previously called. If/else statements replace the list comprehension for readability and `langdetect_langs.append("zho")` replaces `_convert_to_standard_langcode("zh")` since that always returned `"zho"`.	2024-01-19 00:14:45 +00:00
Matt Robinson	4d5038d9fd	enhancement: add support from bitmap images (#2414 ) ### Summary Adds support for bitmap images (`.bmp`) in both file detection and partitioning. Bitmap images will be processed with `partition_image` just like JPGs and PNGs. ### Testing ```python from unstructured.file_utils.filetype import detect_filetype from unstructured.partition.auto import partition from PIL import Image filename = "example-docs/layout-parser-paper-with-table.jpg" bmp_filename = "~/tmp/ayout-parser-paper-with-table.bmp" img = Image.open(filename) img.save(bmp_filename) detect_filetype(filename=bmp_filename) # Should be FileType.BMP elements = partition(filename=bmp_filename) ```	2024-01-17 22:50:36 +00:00
John	125b63cd7c	refactor: extract language helper functions (#2370 ) This PR is one in a series of PRs for refactoring and fixing the `languages` parameter so it can address incorrect input by users. #2293 Refactor `_convert_language_code_to_pytesseract_lang_code` and extract `_get_iso639_language_object` to its own function ``` from unstructured.partition.lang import _convert_language_code_to_pytesseract_lang_code as convert convert("English") # this will raise an error on both main and this branch convert("en") # this will return "eng" on both branches ```	2024-01-16 17:51:03 +00:00
Christine Straub	ee06260987	feat: keep all image elements when using `hi_res` strategy. (#2382 ) ### Summary The goal of this PR is to keep all image elements when using "hi_res" strategy. Previously, `Image` elements with small chunks of text were ignored unless the image block extraction parameters (`extract_images_in_pdf` or `extract_image_block_types`) were specified. Now, all image elements are kept regardless of whether the image block extraction parameters are specified. ### Testing - on `main` branch, ``` elements = partition_pdf( filename="example-docs/embedded-images.pdf", strategy="hi_res", ) image_elements = [el for el in elements if el.category == ElementType.IMAGE] print("number of image elements: ", len(image_elements)) ``` The above code will display `number of image elements: 0`. - on this `feature` branch, The same code will display `number of image elements: 3` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2024-01-15 23:19:17 +00:00
Matt Robinson	36faf677c0	enhancement: file detection for `.wav` files (#2387 ) ### Summary Adds filetype detection for `.wav` audio files ### Testing ```python from unstructured.file_utils.filetype import detect_filetype filename = "example-docs/CantinaBand3.wav" detect_filetype(filename=filename) # Should be FileType.WAV ```	2024-01-15 16:50:49 +00:00
John	bfd0258ba5	chore: refactor _convert_to_standard_langcode (#2369 ) This PR is one in a series of PRs for refactoring and fixing the `languages` parameter so it can address incorrect input by users. #2293 This PR adds a dictionary for helping map fully spelled out languages to tesseract language codes --------- Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com>	2024-01-11 00:34:13 +00:00
Steve Canny	23edf2e911	feature(chunking): add basic strategy and overlap (#2367 ) This PR culminates the restructuring of chunking over my prior dozen-or-so commits by adding the new options to the API and documentation. Separately I'll be adding a new ingest test to defend against regression, although the integration test included in this PR will do a pretty good job of that too.	2024-01-10 22:19:24 +00:00
Steve Canny	22cbdce7ca	fix(html): unequal row lengths in HTMLTable.text_as_html (#2345 ) Fixes #2339 Fixes to HTML partitioning introduced with v0.11.0 removed the use of `tabulate` for forming the HTML placed in `HTMLTable.text_as_html`. This had several benefits, but part of `tabulate`'s behavior was to make row-length (cell-count) uniform across the rows of the table. Lacking this prior uniformity produced a downstream problem reported in On closer inspection, the method used to "harvest" cell-text was producing more text-nodes than there were cells and was sensitive to where whitespace was used to format the HTML. It also "moved" text to different columns in certain rows. Refine the cell-text gathering mechanism to get exactly one text string for each row cell, eliminating whitespace formatting nodes and producing strict correspondence between the number of cells in the original HTML table row and that placed in HTML.text_as_html. HTML tables that are uniform (every row has the same number of cells) will produce a uniform table in `.text_as_html`. Merged cells may still produce a non-uniform table in `.text_as_html` (because the source table is non-uniform).	2024-01-04 21:53:19 +00:00
Christine Straub	5b0ae3fd8b	Refactor: rename image extraction kwargs (#2303 ) Currently, we're using different kwarg names in partition() and partition_pdf(), which has implications for the API since it goes through partition(). ### Summary - rename `extract_element_types` -> `extract_image_block_types` - rename `image_output_dir_path` to `extract_image_block_output_dir` - rename `extract_to_payload` -> `extract_image_block_to_payload` - rename `pdf_extract_images` -> `extract_images_in_pdf` in `partition.auto` - add unit tests to test element extraction for `pdf/image` via `partition.auto` ### Testing CI should pass.	2024-01-04 17:52:00 +00:00
Austin Walker	91b892c79d	fix: Fix api_url param to partition_via_api (#2342 ) Closes #2340 We need to make sure the custom url is passed to our client. The client constructor takes the base url, so for compatibility we can continue to take the full url and strip off the path. To verify, run the api locally and confirm you can make calls to it. ``` # In unstructured-api make run-web-app # In ipython in this repo from unstructured.partition.api import partition_via_api filename = "example-docs/layout-parser-paper.pdf" partition_via_api(filename=filename, api_url="http://localhost:8000") ```	2024-01-03 20:08:48 +00:00
Christine Straub	9459af435d	Fix: element extraction not working when using "auto" strategy for pdf (#2324 ) Closes #2323. ### Summary - update logic to return "hi_res" if either `extract_images_in_pdf` or `extract_element_types` is set - refactor: remove unused `file` parameter from `determine_pdf_or_image_strategy()` ### Testing ``` from unstructured.partition.pdf import partition_pdf elements = partition_pdf( filename="example-docs/embedded-images-tables.pdf", extract_element_types=["Image"], extract_to_payload=True, ) image_elements = [el for el in elements if el.category == ElementType.IMAGE] print(image_elements) ```	2023-12-28 22:25:30 +00:00
Christine Straub	dd144456de	Feat: return base64 encoded images for PDF's (#2310 ) Closes #2302. ### Summary - add functionality to get a Base64 encoded string from a PIL image - store base64 encoded image data in two metadata fields: `image_base64` and `image_mime_type` - update the "image element filter" logic to keep all image elements in the output if a user specifies image extraction ### Testing ``` from unstructured.partition.pdf import partition_pdf elements = partition_pdf( filename="example-docs/embedded-images-tables.pdf", strategy="hi_res", extract_element_types=["Image", "Table"], extract_to_payload=True, ) ``` or ``` from unstructured.partition.auto import partition elements = partition( filename="example-docs/embedded-images-tables.pdf", strategy="hi_res", pdf_extract_element_types=["Image", "Table"], pdf_extract_to_payload=True, ) ```	2023-12-27 05:39:01 +00:00
John	5c0043aa7d	chore: add hi_res_model_name kwarg (#2289 ) Closes #2160 Explicitly adds `hi_res_model_name` as kwarg to relevant functions and notes that `model_name` is to be deprecated. Testing: ``` from unstructured.partition.auto import partition filename = "example-docs/DA-1p.pdf" elements = partition(filename, strategy="hi_res", hi_res_model_name="yolox") ``` --------- Co-authored-by: cragwolfe <crag@unstructured.io> Co-authored-by: Steve Canny <stcanny@gmail.com> Co-authored-by: Christine Straub <christinemstraub@gmail.com> Co-authored-by: Yao You <yao@unstructured.io> Co-authored-by: Yao You <theyaoyou@gmail.com>	2023-12-22 15:06:54 +00:00
Steve Canny	093a11d058	rfctr(chunking): split oversized chunks on word boundary (#2297 ) The text of an oversized chunk is split on an arbitrary character boundary (mid-word). The `chunk_by_character()` strategy introduces the idea of allowing the user to specify a separator to use for chunk-splitting. For `langchain` this is typically "\n\n", "\n", or " "; blank-line, newline, or word boundaries respectively. Even if the user is allowed to specify a separator, we must provide fall-back for when a chunk contains no such character. This can be done incrementally, like blank-line is preferable to newline, newline is preferable to word, and word is preferable to arbitrary character. Further, there is nothing particular to `chunk_by_character()` in providing such a fall-back text-splitting strategy. It would be preferable for all strategies to split oversized chunks on even-word boundaries for example. Note that while a "blank-line" ("\n\n") may be common in plain text, it is unlikely to appear in the text of an element because it would have been interpreted as an element boundary during partitioning. Add _TextSplitter with basic separator preferences and fall-back and apply it to chunk-splitting for all strategies. The `by_character` chunking strategy may enhance this behavior by adding the option for a user to specify a particular separator suited to their use case.	2023-12-21 05:45:36 +00:00
Andy Li	4ae49419c9	feat: support base64-encoded text in partition_email (#2277 ) closes #816 ## Description Added functionality for `partition_email` to automatically decode base64 text before passing it to `partition_text` or `partition_html`. Also adds base64 encoded email text test cases.	2023-12-19 23:37:17 -08:00
Christine Straub	a7c3f5f570	Refactor: importation consistency for `partition_pdf()` and `partition_image()` (#2282 ) Closes #2278. This PR also removes the `extract_tables_in_pdf` mentioned in issue #2280.	2023-12-15 22:29:58 +00:00
Yao You	5f5ff6319f	fix: consider text in cid code as invalid in hi_res (#2259 ) This PR addresses [CORE-2969](https://unstructured-ai.atlassian.net/browse/CORE-2969) - pdfminer sometimes fail to decode text in an pdf file and returns cid codes as text - now those text will be considered invalid and be replaced with ocr results in `hi_res` mode ## test This PR adds unit test for the utility functions. In addition the file below would return elements with text in cid code on main but proper ascii text with this PR: [005-CISA-AA22-076-Strengthening-Cybersecurity-p1-p4.pdf](https://github.com/Unstructured-IO/unstructured/files/13662984/005-CISA-AA22-076-Strengthening-Cybersecurity-p1-p4.pdf) This change improves both cct accuracy and %missing scores: before: ``` metric average sample_sd population_sd count -------------------------------------------------- cct-accuracy 0.681 0.267 0.266 105 cct-%missing 0.086 0.159 0.159 105 ``` after: ``` metric average sample_sd population_sd count -------------------------------------------------- cct-accuracy 0.697 0.251 0.250 105 cct-%missing 0.071 0.123 0.122 105 ``` [CORE-2969]: https://unstructured-ai.atlassian.net/browse/CORE-2969?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: badGarnet <badGarnet@users.noreply.github.com> Co-authored-by: christinestraub <christinemstraub@gmail.com>	2023-12-14 06:49:23 +00:00
Austin Walker	d594c06a3e	fix: handle delimiter bug in partition_csv (#2224 ) Closes #2218. When a csv has commas in its content, and the delimiter is something else, Pandas may throw an error. We can sniff the csv and get the correct delimiter to pass to Pandas. To verify, try partitioning the file in the linked bug.	2023-12-13 23:57:46 +00:00
Steve Canny	74d089d942	rfctr: skip CheckBox elements during chunking (#2253 ) `CheckBox` elements get special treatment during chunking. `CheckBox` does not derive from `Text` and can contribute no text to a chunk. It is considered "non-combinable" and so is emitted as-is as a chunk of its own. A consequence of this is it breaks an otherwise contiguous chunk into two wherever it occurs. This is problematic, but becomes much more so when overlap is introduced. Each chunk accepts a "tail" text fragment from its preceding element and contributes its own tail fragment to the next chunk. These tails represent the "overlap" between chunks. However, a non-text chunk can neither accept nor provide a tail-fragment and so interrupts the overlap. None of the possible solutions are terrific. Give `Element` a `.text` attribute such that _all_ elements have a `.text` attribute, even though its value is the empty-string for element-types such as CheckBox and PageBreak which inherently have no text. As a consequence, several `cast()` wrappers are no longer required to satisfy strict type-checking. This also allows a `CheckBox` element to be combined with `Text` subtypes during chunking, essentially the same way `PageBreak` is, contributing no text to the chunk. Also, remove the `_NonTextSection` object which previously wrapped a `CheckBox` element during pre-chunking as it is no longer required.	2023-12-13 20:22:25 +00:00
Yao You	36e4639e05	fix: image may be scaled too large for tesseract (#2252 ) This PR addresses [CORE-2965](https://unstructured-ai.atlassian.net/browse/CORE-2965) by limiting zoom factor so that the scaled image can still be processed by tesseract. - tesseract has a 2^31 byte limit on image data - occasionally an image may be scaled too much and larger than that size - fix limits the scaling factor so that we never scale an image larger than what tesseract can handle ## test A unit test is added in this PR to test a unlikely case where we'd scale an image a few thousand times and massively exceed the limit without the fix. Unstructured reviewers can also use the document in the ticket to test. [CORE-2965]: https://unstructured-ai.atlassian.net/browse/CORE-2965?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ	2023-12-13 19:35:05 +00:00
John	d3a404cfb5	pdfminer bug (#2244 ) Closes #2212. ### Summary This PR implements logic to fall back to the "inferred_layout + OCR" if pdfminer fails in the `hi_res` pipeline (discussed in[ this slack channel](https://unstructuredw-kbe4326.slack.com/archives/C057R3F8F7A/p1701807299018929). ### Testing PDF: [NASA-SNA-8-D-027III-Rev2-CsmLmSpacecraftOperationalDataBook-Volume3-MassProperties-pg856.pdf](https://github.com/Unstructured-IO/unstructured/files/13554149/NASA-SNA-8-D-027III-Rev2-CsmLmSpacecraftOperationalDataBook-Volume3-MassProperties-pg856.pdf) ``` elements = partition_pdf( filename="NASA-SNA-8-D-027III-Rev2-CsmLmSpacecraftOperationalDataBook-Volume3-MassProperties-pg856.pdf", strategy="hi_res", ) ``` --------- Co-authored-by: christinestraub <christinemstraub@gmail.com>	2023-12-13 00:51:38 +00:00
Christine Straub	da7ac625b1	Feat: save tables in PDF's as images (#2229 ) closes #2222. ### Summary The "table" elements are saved as `table-<pageN>-<tableN>.jpg`. This filename is presented in the `image_path` metadata field for the Table element. The default would be to not do this. ### Testing PDF: [124_PDFsam_Basel III - Finalising post-crisis reforms.pdf](https://github.com/Unstructured-IO/unstructured/files/13591714/124_PDFsam_Basel.III.-.Finalising.post-crisis.reforms.pdf) ``` elements = partition_pdf( filename="124_PDFsam_Basel III - Finalising post-crisis reforms.pdf", strategy="hi_res", infer_table_structure=True, extract_element_types=['Table'], ) ```	2023-12-11 19:14:41 +00:00
Christine Straub	ed76b11b1a	Refactor: support image extraction (#2201 ) ### Summary This PR is the second part of the "image extraction" refactor to move it from unstructured-inference repo to unstructured repo, the first part is done in https://github.com/Unstructured-IO/unstructured-inference/pull/299. This PR adds logic to support extracting images. ### Testing `git clone -b refactor/remove_image_extraction_code --single-branch https://github.com/Unstructured-IO/unstructured-inference.git && cd unstructured-inference && pip install -e . && cd ../` ``` elements = partition_pdf( filename="example-docs/embedded-images.pdf", strategy="hi_res", extract_images_in_pdf=True, ) print("\n\n".join([str(el) for el in elements])) ```	2023-12-05 18:22:29 +00:00
John	8fa5cbf036	build(ci): rm unneeded call to get_api_key in test (#2199 ) Follow-up PR to [https://github.com/Unstructured-IO/unstructured/pull/2195](https://github.com/Unstructured-IO/unstructured/pull/2195). Removes unnecessary calls to `get_api_key()`. That helper function is supposed to only be used for tests decorated by @pytest.mark.skipif(skip_outside_ci, reason="Skipping test run outside of CI") (which are skipped because those tests are partitioning pdf/jpg files). These tests are partitioning emails and rely on the MockResponse at the top of the file, so they don't need to call `get_api_key()` and it can simply be removed from them.	2023-12-03 21:28:05 -08:00
Christine Straub	69d0ee1aea	Refactor: support merging `extracted` layout with `inferred` layout (#2158 ) ### Summary This PR is the second part of `pdfminer` refactor to move it from `unstructured-inference` repo to `unstructured` repo, the first part is done in https://github.com/Unstructured-IO/unstructured-inference/pull/294. This PR adds logic to merge the extracted layout with the inferred layout. The updated workflow for the `hi_res` strategy: * pass the document (as data/filename) to the `inference` repo to get `inferred_layout` (DocumentLayout) * pass the `inferred_layout` returned from the `inference` repo and the document (as data/filename) to the `pdfminer_processing` module, which first opens the document (create temp file/dir as needed), and splits the document by pages * if is_image is `True`, return the passed inferred_layout(DocumentLayout) * if is_image is `False`: * get extracted_layout (TextRegions) from the passed document(data/filename) by pdfminer * merge `extracted_layout` (TextRegions) with the passed `inferred_layout` (DocumentLayout) * return the `inferred_layout `(DocumentLayout) with updated elements (all merged LayoutElements) as merged_layout (DocumentLayout) * pass merged_layout and the document (as data/filename) to the `OCR` module, which first opens the document (create temp file/dir as needed), and splits the document by pages (convert PDF pages to image pages for PDF file) ### Note This PR also fixes issue #2164 by using functionality similar to the one implemented in the `fast` strategy workflow when extracting elements by `pdfminer`. ### TODO * image extraction refactor to move it from `unstructured-inference` repo to `unstructured` repo * improving natural reading order by applying the current default `xycut` sorting to the elements extracted by `pdfminer`	2023-12-01 20:56:31 +00:00
John	e5bdf7fb43	chore: unstructured python client (#2195 ) ### Summary Closes #2033 Updates `partition_via_api` to use `UnstructuredClient` for api calls instead of `requests`. Updates associated tests. Note: This PR does not update `partition_multiple_via_api` as documentation in `unstructured-python-client` indicates it does not support multiple files. A new issue should be opened to add that functionality to `unstructured-python-client`. --------- Co-authored-by: Klaijan <klaijan@unstructured.io> Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2023-12-01 18:49:59 +00:00
Yuming Long	92dae8cd1a	Chore: Repair invalid PDF structure for PDFminer when PSSyntaxError (#2137 ) ### Summary Add a procedure to repair PDF when the PDF structure is invalid for `PDFminer` to process. This PR handles two cases of `PSSyntaxError Invalid dictionary construct: ...`: * PDFminer open entire document and create pages generator on `PDFPage.get_pages(fp)`: [sentry log example](https://unstructuredio.sentry.io/issues/4655715023/?alert_rule_id=14681339&alert_type=issue&notification_uuid=d8db4cf4-686f-4504-8a22-74a79a8e966f&project=4505909127086080&referrer=slack) * PDFminer's interpreter process a single page on `interpreter.process_page(page)`: [sentry log example](https://unstructuredio.sentry.io/issues/4655898781/?referrer=slack&notification_uuid=0d929d48-f490-4db8-8dad-5d431c8460bc&alert_rule_id=14681339&alert_type=issue) Additional tech details: * Add new dependency `pikepdf` in `requirements/extra-pdf-image.in`, which is used for repairing PDF. * Add new denpendenct `pypdf` in `requirements/extra-pdf-image.in`, which is used to find the error page from entire document by reading the PDF file again (can't find a way to split pdf in PDFminer). * Refactor the `is null` check for `get_uris_from_annots`, since the root cause is that `get_uris` passed a None `annots` to `get_uris_from_annots`, so the Null check should happen in `get_uris`. * Add more type protection in `get_uris_from_annots` when using any `PDFObjRef.resolve()` as `dict` (it could still be a `PDFObjRef`). This should fix : * https://github.com/Unstructured-IO/unstructured/issues/1922 where `annotation_dict` is a `PDFObjRef` * https://github.com/Unstructured-IO/unstructured/issues/1921 where `rect` is a `PDFObjRef` ### Test Added three test files (both are larger than 500 KB) for unittests to test: * Repair entire doc * Repair one page * Reprocess failure after repairing one page (just return the elements before error page in this case). * Also seems like splitting the document into smaller pages could fix this problem, but not sure why. For example, I saw error from reprocess in the whole [cancer.pdf](https://github.com/Unstructured-IO/unstructured/files/13461616/cancer.pdf) doc, but no error when i split the pdf by error page.... * tested if i can repair the entire doc again in this case, saw other error which means repairing is not helping imo * PDFminer can process the whole doc after pikepdf only repaired the entire doc in the first place, but we can't repair by pages in this way --------- Co-authored-by: cragwolfe <crag@unstructured.io>	2023-11-29 19:00:15 +00:00
Yuming Long	6c08c136ae	ci: fix broken API unit test for using unsupported `fast` strategy for images (#2144 ) ### Summary This should fix the broken unit test on main CI * change the strategy in `test_partition_multiple_via_api_valid_request_data_kwargs` from `fast` to `auto`, since the test was using `fast` for images, and we don't support it.	2023-11-22 17:35:04 -08:00
Steve Canny	02e8c962aa	fix(docx): tables in header/footer dropped (#2135 ) A DOCX header or footer is a so-called "story part" meaning like the document body (which is also a story part) it can contain both paragraphs and tables. The implementation of `Header.text` and `Footer.text` gather only the paragraphs. Add a new method to extract all content from a header or footer, including table content, suitable for use as the `.text` attribute of that element. Fixes #2126.	2023-11-22 15:39:25 -08:00
Steve Canny	e6637592d1	fix(docx): Table.text duplicates merged cell text (#2134 ) Summary. The `python-docx` table API is designed for _uniform_ tables (no merged cells, no nested tables). Naive processing of DOCX tables using this API produces duplicate text when the table has merged cells. Add a more sophisticated parsing method that reads only "root" cells (those with an actual `<tc>` element) and skip cells spanned by a merge. In the process, abandon use of the `tabulate` package for this job (which is also designed for uniform tables) and remove the whitespace padding it adds for visual alignment of columns. Separate the text for each cell with a single newline ("\n"). Since it's little extra trouble, add support for nested tables such that their text also contributes to the `Table.text` string. The new `._iter_table_texts()` method will also be used for parsing tables in headers and footers (where they are frequently used for layout purposes) in a closely following PR. Fixes #2106.	2023-11-21 22:22:40 +00:00
Steve Canny	ee9be2a3b2	fix: assorted partition_html() bugs (#2113 ) Addresses a cluster of HTML-related bugs: - empty table is identified as bulleted-table - `partition_html()` emits empty (no text) tables (#1928) - `.text_as_html` contains inappropriate `<br>` elements in invalid locations. - cells enclosed in `<thead>` and `<tfoot>` elements are dropped (#1928) - `.text_as_html` contains whitespace padding Each of these is addressed in a separate commit below. Fixes #1928. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: scanny <scanny@users.noreply.github.com> Co-authored-by: Yuming Long <63475068+yuming-long@users.noreply.github.com>	2023-11-20 16:29:32 +00:00
Christine Straub	d623d75d3c	Fix: incorrect figure mapping (#2111 ) Closes #2098.	2023-11-18 00:11:11 +00:00
Steve Canny	a589a494f6	docx: improve page break fidelity (#1631 ) Page breaks can and often do occur within a paragraph. The full text of the paragraph is attributed to the page (number) the paragraph starts on. Improve page-break fidelity such that a paragraph containing a page-break is split into two elements, one containing the text before the page-break and the other the text after. Emit the `PageBreak` element between these two and assign the correct page-number (n and n+1 respectively) to the two textual elements. This functionality is largely provided upstream by the new `python-docx` v1.0.0 release (1.0.0 from 0.8.11 because it drops Python 2 support). That version also makes obsolete the "include hyperlink text in `Paragraph.text` monkey patch that we had maintained up to now. Remove that monkey-patch.	2023-11-17 00:09:14 +00:00
Steve Canny	7a741c9ae6	fix(chunk): #1985 mis-splits of Table chunks (#2076 ) Closes #1985 Summary. Due to an interaction of coding errors, HTML text in `TableChunk` splits of a `Table` element were repeating the entire HTML for the table in each chunk. Technical Summary. This behavior was fixed but not published in the last chunking PR of a series. Finish up that PR and submit it all here. This PR extracts chunking to the particular Section type (each has their own distinct chunking behavior).	2023-11-16 16:22:50 +00:00
Steve Canny	41fc55bc12	fix(docx): tabulate output is non-deterministic (#2090 ) The test for nested tables added a few PRs ago indirectly relies on the padding added to table-HTML by `tabulate`. The length of that padding turns out to be non-deterministic, perhaps related to M1 vs. Intel hardware. Remove padding from tabulate output in the test so only actual content is compared.	2023-11-16 07:52:16 +00:00
Christine Straub	e114e5c418	Refactor: partition pdf (#2074 ) ### Summary - add constants for strategies - add `_process_uncategorized_text_elements()` to remove code block duplication ### Testing CI should pass.	2023-11-15 21:41:02 -08:00
Steve Canny	252405c780	Dynamic ElementMetadata implementation (#2043 ) ### Executive Summary The structure of element metadata is currently static, meaning only predefined fields can appear in the metadata. We would like the flexibility for end-users, at their own discretion, to define and use additional metadata fields that make sense for their particular use-case. ### Concepts A key concept for dynamic metadata is _known field_. A known-field is one of those explicitly defined on `ElementMetadata`. Each of these has a type and can be specified when _constructing_ a new `ElementMetadata` instance. This is in contrast to an _end-user defined_ (or _ad-hoc_) metadata field, one not known at "compile" time and added at the discretion of an end-user to suit the purposes of their application. An ad-hoc field can only be added by _assignment_ on an already constructed instance. ### End-user ad-hoc metadata field behaviors An ad-hoc field can be added to an `ElementMetadata` instance by assignment: ```python >>> metadata = ElementMetadata() >>> metadata.coefficient = 0.536 ``` A field added in this way can be accessed by name: ```python >>> metadata.coefficient 0.536 ``` and that field will appear in the JSON/dict for that instance: ```python >>> metadata = ElementMetadata() >>> metadata.coefficient = 0.536 >>> metadata.to_dict() {"coefficient": 0.536} ``` However, accessing a "user-defined" value that has _not_ been assigned on that instance raises `AttributeError`: ```python >>> metadata.coeffcient # -- misspelled "coefficient" -- AttributeError: 'ElementMetadata' object has no attribute 'coeffcient' ``` This makes "tagging" a metadata item with a value very convenient, but entails the proviso that if an end-user wants to add a metadata field to _some_ elements and not others (sparse population), AND they want to access that field by name on ANY element and receive `None` where it has not been assigned, they will need to use an expression like this: ```python coefficient = metadata.coefficient if hasattr(metadata, "coefficient") else None ``` ### Implementation Notes - ad-hoc metadata fields are discarded during consolidation (for chunking) because we don't have a consolidation strategy defined for those. We could consider using a default consolidation strategy like `FIRST` or possibly allow a user to register a strategy (although that gets hairy in non-private and multiple-memory-space situations.) - ad-hoc metadata fields cannot start with an underscore. - We have no way to distinguish an ad-hoc field from any "noise" fields that might appear in a JSON/dict loaded using `.from_dict()`, so unlike the original (which only loaded known-fields), we'll rehydrate anything that we find there. - No real type-safety is possible on ad-hoc fields but the type-checker does not complain because the type of all ad-hoc fields is `Any` (which is the best available behavior in my view). - We may want to consider whether end-users should be able to add ad-hoc fields to "sub" metadata objects too, like `DataSourceMetadata` and conceivably `CoordinatesMetadata` (although I'm not immediately seeing a use-case for the second one).	2023-11-15 13:22:15 -08:00
Austin Walker	2931cb38e8	fix: handle KeyError: 'N' for certain pdfs (#2072 ) Closes #2059. We've found some pdfs that throw an error in pdfminer. These files use a ICCBased color profile but do not include an expected value `N`. As a workaround, we can wrap pdfminer and drop any colorspace info, since we don't need to render the document. To verify, try to partition the document in the linked issue. ``` elements = partition(filename="google-2023-environmental-report_condensed.pdf", strategy="fast") ``` --------- Co-authored-by: cragwolfe <crag@unstructured.io>	2023-11-15 01:59:05 +00:00
Christine Straub	475066ba7c	Fix: fast strategy fallback to ocr only (#2055 ) Closes #2038. ### Summary The `fast` strategy should not fall back to a more expensive strategy. ### Testing For [9493801-p17.pdf](https://github.com/Unstructured-IO/unstructured/files/13292884/9493801-p17.pdf), the following code should return an empty list. ``` elements = partition(filename=filename, strategy="fast") ``` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2023-11-14 18:46:41 +00:00
John	1ead5a27df	Jj/2011 missing languages metadata (#2037 ) ### Summary Closes #2011 `languages` was missing from the metadata when partitioning pdfs via `hi_res` and `fast` strategies and missing from image partitions via `hi_res`. This PR adds `languages` to the relevant function calls so it is included in the resulting elements. ### Testing On the main branch, `partition_image` will include `languages` when `strategy='ocr_only'`, but not when `strategy='hi_res'`: ``` filename = "example-docs/english-and-korean.png" from unstructured.partition.image import partition_image elements = partition_image(filename, strategy="ocr_only", languages=['eng', 'kor']) elements[0].metadata.languages elements = partition_image(filename, strategy="hi_res", languages=['eng', 'kor']) elements[0].metadata.languages ``` For `partition_pdf`, `'ocr_only'` will include `languages` in the metadata, but `'fast'` and `'hi_res'` will not. ``` filename = "example-docs/korean-text-with-tables.pdf" from unstructured.partition.pdf import partition_pdf elements = partition_pdf(filename, strategy="ocr_only", languages=['kor']) elements[0].metadata.languages elements = partition_pdf(filename, strategy="fast", languages=['kor']) elements[0].metadata.languages elements = partition_pdf(filename, strategy="hi_res", languages=['kor']) elements[0].metadata.languages ``` On this branch, `languages` is included in the metadata regardless of strategy --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: Coniferish <Coniferish@users.noreply.github.com>	2023-11-13 16:47:05 +00:00

1 2 3 4 5 ...

333 Commits