unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-07-03 23:20:35 +00:00

Author	SHA1	Message	Date
Christine Straub	5b0ae3fd8b	Refactor: rename image extraction kwargs (#2303 ) Currently, we're using different kwarg names in partition() and partition_pdf(), which has implications for the API since it goes through partition(). ### Summary - rename `extract_element_types` -> `extract_image_block_types` - rename `image_output_dir_path` to `extract_image_block_output_dir` - rename `extract_to_payload` -> `extract_image_block_to_payload` - rename `pdf_extract_images` -> `extract_images_in_pdf` in `partition.auto` - add unit tests to test element extraction for `pdf/image` via `partition.auto` ### Testing CI should pass.	2024-01-04 17:52:00 +00:00
Christine Straub	9459af435d	Fix: element extraction not working when using "auto" strategy for pdf (#2324 ) Closes #2323. ### Summary - update logic to return "hi_res" if either `extract_images_in_pdf` or `extract_element_types` is set - refactor: remove unused `file` parameter from `determine_pdf_or_image_strategy()` ### Testing ``` from unstructured.partition.pdf import partition_pdf elements = partition_pdf( filename="example-docs/embedded-images-tables.pdf", extract_element_types=["Image"], extract_to_payload=True, ) image_elements = [el for el in elements if el.category == ElementType.IMAGE] print(image_elements) ```	2023-12-28 22:25:30 +00:00
Christine Straub	a7c3f5f570	Refactor: importation consistency for `partition_pdf()` and `partition_image()` (#2282 ) Closes #2278. This PR also removes the `extract_tables_in_pdf` mentioned in issue #2280.	2023-12-15 22:29:58 +00:00
Christine Straub	69d0ee1aea	Refactor: support merging `extracted` layout with `inferred` layout (#2158 ) ### Summary This PR is the second part of `pdfminer` refactor to move it from `unstructured-inference` repo to `unstructured` repo, the first part is done in https://github.com/Unstructured-IO/unstructured-inference/pull/294. This PR adds logic to merge the extracted layout with the inferred layout. The updated workflow for the `hi_res` strategy: * pass the document (as data/filename) to the `inference` repo to get `inferred_layout` (DocumentLayout) * pass the `inferred_layout` returned from the `inference` repo and the document (as data/filename) to the `pdfminer_processing` module, which first opens the document (create temp file/dir as needed), and splits the document by pages * if is_image is `True`, return the passed inferred_layout(DocumentLayout) * if is_image is `False`: * get extracted_layout (TextRegions) from the passed document(data/filename) by pdfminer * merge `extracted_layout` (TextRegions) with the passed `inferred_layout` (DocumentLayout) * return the `inferred_layout `(DocumentLayout) with updated elements (all merged LayoutElements) as merged_layout (DocumentLayout) * pass merged_layout and the document (as data/filename) to the `OCR` module, which first opens the document (create temp file/dir as needed), and splits the document by pages (convert PDF pages to image pages for PDF file) ### Note This PR also fixes issue #2164 by using functionality similar to the one implemented in the `fast` strategy workflow when extracting elements by `pdfminer`. ### TODO * image extraction refactor to move it from `unstructured-inference` repo to `unstructured` repo * improving natural reading order by applying the current default `xycut` sorting to the elements extracted by `pdfminer`	2023-12-01 20:56:31 +00:00
Christine Straub	e114e5c418	Refactor: partition pdf (#2074 ) ### Summary - add constants for strategies - add `_process_uncategorized_text_elements()` to remove code block duplication ### Testing CI should pass.	2023-11-15 21:41:02 -08:00
Christine Straub	475066ba7c	Fix: fast strategy fallback to ocr only (#2055 ) Closes #2038. ### Summary The `fast` strategy should not fall back to a more expensive strategy. ### Testing For [9493801-p17.pdf](https://github.com/Unstructured-IO/unstructured/files/13292884/9493801-p17.pdf), the following code should return an empty list. ``` elements = partition(filename=filename, strategy="fast") ``` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2023-11-14 18:46:41 +00:00
shreyanid	a23d75a292	Set default strategy for images to be "hi_res" (#968 ) Set default strategy for images (not PDFs) to be hi_res.	2023-08-02 09:22:20 -07:00
qued	79f734d3f9	fix: better extractable check (#900 ) auto strategy was choosing the fast strategy in cases where the pdf contents were just a flat image, resulting in no output. This PR changes the behavior of auto so that elements that can be extracted by fast are extracted, a cursory examination of the elements is made to see if there are elements with text present, and if so then these elements are used as the output. Otherwise fallback strategies come into play.	2023-07-07 23:41:37 -05:00
Yuming Long	a611532e3c	Chore: convert fast strategy to ocr_only for images (#735 ) * fall back to ocr only * more note * add test case * maybe remove skipping dockertest for kor ocr? * bump again * clean up flag * empty commit	2023-06-16 10:59:13 -04:00
Matt Robinson	727d366a94	enhancement: auto strategy for PDFs and images (#578 ) * added functions for determining auto stratgy * change default strategy to auto * tests for auto strategy * update docs * changelog and version * bump version * remove ingest file in wrong location * update jpg output * typo fix	2023-05-12 17:45:08 +00:00
Matt Robinson	3d3f3df3ec	enhancement: add "ocr_only" strategy for PDFs (#553 ) * add tests for validating strategy * refactor into determine_pdf_strategy function * refactor pdf strategies into strategies * remove commented out code * remove unreachable code * add in handling for image types * a little more refactoring * import ocr partioning for images * catch warnings, partition type for valid strategies * fallback to ocr_only from fast * fallback logic for hi_res * test for fallback to ocr only * fallback logic ofr ocr_only * more tests for fallback logic * update doc strings * version and changelog * linting, linting, linting * update docs to include notes about strategy * fix typos * change back patched filename	2023-05-08 17:21:24 +00:00

11 Commits