unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-07-10 02:25:55 +00:00

Author	SHA1	Message	Date
Christine Straub	9552fbbfbf	chore: bump unstructured-inference 0.7.35 (#3205 ) ### Summary - bump unstructured-inference to `0.7.35` which fixed syntax for generated HTML tables - update unit tests and ingest test fixtures to reflect changes in the generated HTML tables - cut a release for `0.14.6` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2024-06-14 18:11:38 +00:00
Christine Straub	76831f154b	refactor: `partition_pdf()` pass `kwargs` through `fast` strategy pipeline (#3040 ) This PR aims to pass `kwargs` through `fast` strategy pipeline, which was missing as part of the previous PR - https://github.com/Unstructured-IO/unstructured/pull/3030. I also did some code refactoring in this PR, so I recommend reviewing this PR commit by commit. ### Summary - pass `kwargs` through `fast` strategy pipeline, which will allow users to specify additional params like `sort_mode` - refactor: code reorganization - cut a release for `0.14.0` ### Testing CI should pass	2024-05-17 20:55:11 +00:00
John	d829b669e6	Add starting_page_num param to partition_image (#2987 ) Add missing starting_page_num param to partition_image Closes #2985	2024-05-09 21:31:35 +00:00
Filip Knefel	6af6604057	feat: introduce `date_from_file_object` parameter to partitions (#2563 ) Introduce `date_from_file_object` to `partition*` functions, by default set to `False`. If set to `True` and file is provided via `file` parameter, partition will attempt to infer last modified date from `file`'s contents otherwise last modified metadata will be set to `None`. --------- Co-authored-by: Filip Knefel <filip@unstructured.io> Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>	2024-03-18 01:09:44 +00:00
John	db67805ec6	feat: add support for partitioning .heic files (#2454 ) .heic files are an image filetype we have not supported. #### Testing ``` from unstructured.partition.image import partition_image png_filename = "example-docs/DA-1p.png" heic_filename = "example-docs/DA-1p.heic" png_elements = partition_image(png_filename, strategy="hi_res") heic_elements = partition_image(heic_filename, strategy="hi_res") for i in range(len(heic_elements)): print(heic_elements[i].text == png_elements[i].text) ``` --------- Co-authored-by: christinestraub <christinemstraub@gmail.com>	2024-01-30 04:49:00 +00:00
Yao You	97fb10db4a	fix: default hi_res model rely on inference setting (#2441 ) - there are multiple places setting the default `hi_res_model_name` in both `unstructured` and `unstructured-inference` - they lead to inconsistency and unexpected behaviors - this fix removes a helper in `unstructured` that tries to set the default hi_res layout detection model; instead we rely on the `unstructured-inference` to provide that default when no explicit model name is passed in ## test ```bash UNSTRUCTURED_INCLUDE_DEBUG_METADATA=true ipython ``` ```python from unstructured.partition.auto import partition # find a pdf file elements = partition("foo.pdf", strategy="hi_res") assert elements[0].metadata.detection_origin == "yolox" ``` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>	2024-01-29 16:44:41 +00:00
Matt Robinson	4d5038d9fd	enhancement: add support from bitmap images (#2414 ) ### Summary Adds support for bitmap images (`.bmp`) in both file detection and partitioning. Bitmap images will be processed with `partition_image` just like JPGs and PNGs. ### Testing ```python from unstructured.file_utils.filetype import detect_filetype from unstructured.partition.auto import partition from PIL import Image filename = "example-docs/layout-parser-paper-with-table.jpg" bmp_filename = "~/tmp/ayout-parser-paper-with-table.bmp" img = Image.open(filename) img.save(bmp_filename) detect_filetype(filename=bmp_filename) # Should be FileType.BMP elements = partition(filename=bmp_filename) ```	2024-01-17 22:50:36 +00:00
Christine Straub	5b0ae3fd8b	Refactor: rename image extraction kwargs (#2303 ) Currently, we're using different kwarg names in partition() and partition_pdf(), which has implications for the API since it goes through partition(). ### Summary - rename `extract_element_types` -> `extract_image_block_types` - rename `image_output_dir_path` to `extract_image_block_output_dir` - rename `extract_to_payload` -> `extract_image_block_to_payload` - rename `pdf_extract_images` -> `extract_images_in_pdf` in `partition.auto` - add unit tests to test element extraction for `pdf/image` via `partition.auto` ### Testing CI should pass.	2024-01-04 17:52:00 +00:00
Christine Straub	9459af435d	Fix: element extraction not working when using "auto" strategy for pdf (#2324 ) Closes #2323. ### Summary - update logic to return "hi_res" if either `extract_images_in_pdf` or `extract_element_types` is set - refactor: remove unused `file` parameter from `determine_pdf_or_image_strategy()` ### Testing ``` from unstructured.partition.pdf import partition_pdf elements = partition_pdf( filename="example-docs/embedded-images-tables.pdf", extract_element_types=["Image"], extract_to_payload=True, ) image_elements = [el for el in elements if el.category == ElementType.IMAGE] print(image_elements) ```	2023-12-28 22:25:30 +00:00
Christine Straub	dd144456de	Feat: return base64 encoded images for PDF's (#2310 ) Closes #2302. ### Summary - add functionality to get a Base64 encoded string from a PIL image - store base64 encoded image data in two metadata fields: `image_base64` and `image_mime_type` - update the "image element filter" logic to keep all image elements in the output if a user specifies image extraction ### Testing ``` from unstructured.partition.pdf import partition_pdf elements = partition_pdf( filename="example-docs/embedded-images-tables.pdf", strategy="hi_res", extract_element_types=["Image", "Table"], extract_to_payload=True, ) ``` or ``` from unstructured.partition.auto import partition elements = partition( filename="example-docs/embedded-images-tables.pdf", strategy="hi_res", pdf_extract_element_types=["Image", "Table"], pdf_extract_to_payload=True, ) ```	2023-12-27 05:39:01 +00:00
John	5c0043aa7d	chore: add hi_res_model_name kwarg (#2289 ) Closes #2160 Explicitly adds `hi_res_model_name` as kwarg to relevant functions and notes that `model_name` is to be deprecated. Testing: ``` from unstructured.partition.auto import partition filename = "example-docs/DA-1p.pdf" elements = partition(filename, strategy="hi_res", hi_res_model_name="yolox") ``` --------- Co-authored-by: cragwolfe <crag@unstructured.io> Co-authored-by: Steve Canny <stcanny@gmail.com> Co-authored-by: Christine Straub <christinemstraub@gmail.com> Co-authored-by: Yao You <yao@unstructured.io> Co-authored-by: Yao You <theyaoyou@gmail.com>	2023-12-22 15:06:54 +00:00
Christine Straub	a7c3f5f570	Refactor: importation consistency for `partition_pdf()` and `partition_image()` (#2282 ) Closes #2278. This PR also removes the `extract_tables_in_pdf` mentioned in issue #2280.	2023-12-15 22:29:58 +00:00
Christine Straub	69d0ee1aea	Refactor: support merging `extracted` layout with `inferred` layout (#2158 ) ### Summary This PR is the second part of `pdfminer` refactor to move it from `unstructured-inference` repo to `unstructured` repo, the first part is done in https://github.com/Unstructured-IO/unstructured-inference/pull/294. This PR adds logic to merge the extracted layout with the inferred layout. The updated workflow for the `hi_res` strategy: * pass the document (as data/filename) to the `inference` repo to get `inferred_layout` (DocumentLayout) * pass the `inferred_layout` returned from the `inference` repo and the document (as data/filename) to the `pdfminer_processing` module, which first opens the document (create temp file/dir as needed), and splits the document by pages * if is_image is `True`, return the passed inferred_layout(DocumentLayout) * if is_image is `False`: * get extracted_layout (TextRegions) from the passed document(data/filename) by pdfminer * merge `extracted_layout` (TextRegions) with the passed `inferred_layout` (DocumentLayout) * return the `inferred_layout `(DocumentLayout) with updated elements (all merged LayoutElements) as merged_layout (DocumentLayout) * pass merged_layout and the document (as data/filename) to the `OCR` module, which first opens the document (create temp file/dir as needed), and splits the document by pages (convert PDF pages to image pages for PDF file) ### Note This PR also fixes issue #2164 by using functionality similar to the one implemented in the `fast` strategy workflow when extracting elements by `pdfminer`. ### TODO * image extraction refactor to move it from `unstructured-inference` repo to `unstructured` repo * improving natural reading order by applying the current default `xycut` sorting to the elements extracted by `pdfminer`	2023-12-01 20:56:31 +00:00
Christine Straub	e114e5c418	Refactor: partition pdf (#2074 ) ### Summary - add constants for strategies - add `_process_uncategorized_text_elements()` to remove code block duplication ### Testing CI should pass.	2023-11-15 21:41:02 -08:00
John	1ead5a27df	Jj/2011 missing languages metadata (#2037 ) ### Summary Closes #2011 `languages` was missing from the metadata when partitioning pdfs via `hi_res` and `fast` strategies and missing from image partitions via `hi_res`. This PR adds `languages` to the relevant function calls so it is included in the resulting elements. ### Testing On the main branch, `partition_image` will include `languages` when `strategy='ocr_only'`, but not when `strategy='hi_res'`: ``` filename = "example-docs/english-and-korean.png" from unstructured.partition.image import partition_image elements = partition_image(filename, strategy="ocr_only", languages=['eng', 'kor']) elements[0].metadata.languages elements = partition_image(filename, strategy="hi_res", languages=['eng', 'kor']) elements[0].metadata.languages ``` For `partition_pdf`, `'ocr_only'` will include `languages` in the metadata, but `'fast'` and `'hi_res'` will not. ``` filename = "example-docs/korean-text-with-tables.pdf" from unstructured.partition.pdf import partition_pdf elements = partition_pdf(filename, strategy="ocr_only", languages=['kor']) elements[0].metadata.languages elements = partition_pdf(filename, strategy="fast", languages=['kor']) elements[0].metadata.languages elements = partition_pdf(filename, strategy="hi_res", languages=['kor']) elements[0].metadata.languages ``` On this branch, `languages` is included in the metadata regardless of strategy --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: Coniferish <Coniferish@users.noreply.github.com>	2023-11-13 16:47:05 +00:00
Christine Straub	3fe480799a	Fix: missing characters at the beginning of sentences on table ingest output after table OCR refactor (#1961 ) Closes #1875. ### Summary - add functionality to do a second OCR on cropped table images - use `IMAGE_CROP_PAD` env for `individual_blocks` mode ### Testing The test function [`test_partition_pdf_hi_res_ocr_mode_with_table_extraction()`](https://github.com/Unstructured-IO/unstructured/blob/main/test_unstructured/partition/pdf_image/test_pdf.py#L425) in `test_pdf.py` should pass. ### NOTE: I've tried to experiment with values for scaling ENVs on the following PRs but found that changes to the values for scaling ENVs affect the entire page OCR output(OCR regression) so switched to doing a second OCR for tables. - https://github.com/Unstructured-IO/unstructured/pull/1998/files - https://github.com/Unstructured-IO/unstructured/pull/2004/files - https://github.com/Unstructured-IO/unstructured/pull/2016/files - https://github.com/Unstructured-IO/unstructured/pull/2029/files --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2023-11-09 18:29:55 +00:00
Christine Straub	bb58c1bb0b	Refactor: element type (#2035 ) ### Summary - add constants for element type - replace the `TYPE_TO_TEXT_ELEMENT_MAP` dictionary using the `ElementType` constants - replace element type strings using the constants ### Testing CI should pass.	2023-11-08 21:52:55 -08:00
Christine Straub	1f0c563e0c	refactor: `partition_pdf()` for `ocr_only` strategy (#1811 ) ### Summary Update `ocr_only` strategy in `partition_pdf()`. This PR adds the functionality to get accurate coordinate data when partitioning PDFs and Images with the `ocr_only` strategy. - Add functionality to perform OCR region grouping based on the OCR text taken from `pytesseract.image_to_string()` - Add functionality to get layout elements from OCR regions (ocr_layout) for both `tesseract` and `paddle` - Add functionality to determine the `source` of merged text regions when merging text regions in `merge_text_regions()` - Merge multiple test functions related to "ocr_only" strategy into `test_partition_pdf_with_ocr_only_strategy()` - This PR also fixes [issue #1792](https://github.com/Unstructured-IO/unstructured/issues/1792) ### Evaluation ``` # Image PYTHONPATH=. python examples/custom-layout-order/evaluate_natural_reading_order.py example-docs/double-column-A.jpg ocr_only xy-cut image # PDF PYTHONPATH=. python examples/custom-layout-order/evaluate_natural_reading_order.py example-docs/multi-column-2p.pdf ocr_only xy-cut pdf ``` ### Test - Before update All elements have the same coordinate data ![multi-column-2p_1_xy-cut](https://github.com/Unstructured-IO/unstructured/assets/9475974/aae0195a-2943-4fa8-bdd8-807f2f09c768) - After update All elements have accurate coordinate data ![multi-column-2p_1_xy-cut](https://github.com/Unstructured-IO/unstructured/assets/9475974/0f6c6202-9e65-4acf-bcd4-ac9dd01ab64a) --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2023-10-30 20:13:29 +00:00
Benjamin Torres	05c3cd1be2	feat: clean pdfminer elements inside tables (#1808 ) This PR introduces `clean_pdfminer_inner_elements` , which deletes pdfminer elements inside other detection origins such as YoloX or detectron. This function returns the clean document. Also, the ingest-test fixtures were updated to reflect the new standard output. The best way to check that this function is working properly is check the new test `test_clean_pdfminer_inner_elements` in `test_unstructured/partition/utils/test_processing_elements.py` --------- Co-authored-by: Roman Isecke <roman@unstructured.io> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com> Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com>	2023-10-30 07:10:51 +00:00
qued	d8241cbcfc	fix: filename missing from image metadata (#1863 ) Closes [#1859](https://github.com/Unstructured-IO/unstructured/issues/1859). * Fixes elements partitioned from an image file missing certain metadata Metadata for image files, like file type, was being handled differently from other file types. This caused a bug where other metadata, like the file name, was being missed. This change brought metadata handling for image files to be more in line with the handling for other file types so that file name and other metadata fields are being captured. Additionally: * Added test to verify filename is being captured in metadata * Cleaned up `CHANGELOG.md` formatting #### Testing: The following produces output `None` on `main`, but outputs the filename `layout-parser-paper-fast.jpg` on this branch: ```python from unstructured.partition.auto import partition elements = partition("example-docs/layout-parser-paper-fast.jpg") print(elements[0].metadata.filename) ```	2023-10-25 05:19:51 +00:00
Yuming Long	01a0e003d9	Chore: stop passing extract_tables to inference and note table regression on entire doc OCR (#1850 ) ### Summary A follow up ticket on https://github.com/Unstructured-IO/unstructured/pull/1801, I forgot to remove the lines that pass extract_tables to inference, and noted the table regression if we only do one OCR for entire doc Tech details: * stop passing `extract_tables` parameter to inference * added table extraction ingest test for image, which was skipped before, and the "text_as_html" field contains the OCR output from the table OCR refactor PR * replaced `assert_called_once_with` with `call_args` so that the unit tests don't need to test additional parameters * added `error_margin` as ENV when comparing bounding boxes of`ocr_region` with `table_element` * added more tests for tables and noted the table regression in test for partition pdf ### Test * for stop passing `extract_tables` parameter to inference, run test `test_partition_pdf_hi_res_ocr_mode_with_table_extraction` before this branch and you will see warning like `Table OCR from get_tokens method will be deprecated....`, which means it called the table OCR in inference repo. This branch removed the warning.	2023-10-24 17:13:28 +00:00
Yuming Long	ce40cdc55f	Chore (refactor): support table extraction with pre-computed ocr data (#1801 ) ### Summary Table OCR refactor, move the OCR part for table model in inference repo to unst repo. * Before this PR, table model extracts OCR tokens with texts and bounding box and fills the tokens to the table structure in inference repo. This means we need to do an additional OCR for tables. * After this PR, we use the OCR data from entire page OCR and pass the OCR tokens to inference repo, which means we only do one OCR for the entire document. Tech details: * Combined env `ENTIRE_PAGE_OCR` and `TABLE_OCR` to `OCR_AGENT`, this means we use the same OCR agent for entire page and tables since we only do one OCR. * Bump inference repo to `0.7.9`, which allow table model in inference to use pre-computed OCR data from unst repo. Please check in [PR](https://github.com/Unstructured-IO/unstructured-inference/pull/256). * All notebooks lint are made by `make tidy` * This PR also fixes [issue](https://github.com/Unstructured-IO/unstructured/issues/1564), I've added test for the issue in `test_pdf.py::test_partition_pdf_hi_table_extraction_with_languages` * Add same scaling logic to image [similar to previous Table OCR](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L109C1-L113), but now scaling is applied to entire image ### Test * Not much to manually testing expect table extraction still works * But due to change on scaling and use pre-computed OCR data from entire page, there are some slight (better) changes on table output, here is an comparison on test outputs i found from the same test `test_partition_image_with_table_extraction`: screen shot for table in `layout-parser-paper-with-table.jpg`: <img width="343" alt="expected" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/278d7665-d212-433d-9a05-872c4502725c"> before refactor: <img width="709" alt="before" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/347fbc3b-f52b-45b5-97e9-6f633eaa0d5e"> after refactor: <img width="705" alt="after" src="https://github.com/Unstructured-IO/unstructured/assets/63475068/b3cbd809-cf67-4e75-945a-5cbd06b33b2d"> ### TODO (added as a ticket) Still have some clean up to do in inference repo since now unst repo have duplicate logic, but can keep them as a fall back plan. If we want to remove anything OCR related in inference, here are items that is deprecated and can be removed: * [`get_tokens`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L77) (already noted in code) * parameter `extract_tables` in inference * [`interpret_table_block`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/inference/layoutelement.py#L88) * [`load_agent`](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/tables.py#L197) * env `TABLE_OCR` ### Note if we want to fallback for an additional table OCR (may need this for using paddle for table), we need to: * pass `infer_table_structure` to inference with `extract_tables` parameter * stop passing `infer_table_structure` to `ocr.py` --------- Co-authored-by: Yao You <yao@unstructured.io>	2023-10-21 00:24:23 +00:00
qued	8100f1e7e2	chore: process chipper hierarchy (#1634 ) PR to support schema changes introduced from [PR 232](https://github.com/Unstructured-IO/unstructured-inference/pull/232) in `unstructured-inference`. Specifically what needs to be supported is: * Change to the way `LayoutElement` from `unstructured-inference` is structured, specifically that this class is no longer a subclass of `Rectangle`, and instead `LayoutElement` has a `bbox` property that captures the location information and a `from_coords` method that allows construction of a `LayoutElement` directly from coordinates. * Removal of `LocationlessLayoutElement` since chipper now exports bounding boxes, and if we need to support elements without bounding boxes, we can make the `bbox` property mentioned above optional. * Getting hierarchy data directly from the inference elements rather than in post-processing * Don't try to reorder elements received from chipper v2, as they should already be ordered. #### Testing: The following demonstrates that the new version of chipper is inferring hierarchy. ```python from unstructured.partition.pdf import partition_pdf elements = partition_pdf("example-docs/layout-parser-paper-fast.pdf", strategy="hi_res", model_name="chipper") children = [el for el in elements if el.metadata.parent_id is not None] print(children) ``` Also verify that running the traditional `hi_res` gives different results: ```python from unstructured.partition.pdf import partition_pdf elements = partition_pdf("example-docs/layout-parser-paper-fast.pdf", strategy="hi_res") ``` --------- Co-authored-by: Sebastian Laverde Alfonso <lavmlk20201@gmail.com> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinemstraub@gmail.com>	2023-10-13 01:28:46 +00:00
Roman Isecke	ebf0722dcc	roman/ingest continue on error (#1736 ) ### Description Add flag to raise an error on failure but default to only log it and continue with other docs	2023-10-12 21:33:10 +00:00
Steve Canny	d726963e42	serde tests round-trip through JSON (#1681 ) Each partitioner has a test like `test_partition_x_with_json()`. What these do is serialize the elements produced by the partitioner to JSON, then read them back in from JSON and compare the before and after elements. Because our element equality (`Element.__eq__()`) is shallow, this doesn't tell us a lot, but if we take it one more step, like `List[Element] -> JSON -> List[Element] -> JSON` and then compare the JSON, it gives us some confidence that the serialized elements can be "re-hydrated" without losing any information. This actually showed up a few problems, all in the serialization/deserialization (serde) code that all elements share.	2023-10-12 19:47:55 +00:00

25 Commits