unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-11-11 07:57:21 +00:00

Author	SHA1	Message	Date
Steve Canny	30e5a0cd4e	rfctr(docx): organize docx tests (#3070 ) Summary I preparation for adding DOCX pluggable image extraction, organize a few of the DOCX tests to be parallel to very similar tests for the DOC and ODT partitioners.	2024-05-21 22:11:46 +00:00
Christine Straub	b0d8a779da	feat: `partiton_pdf()` set inferred elements text (#3061 ) This PR adds the ability to fill inferred elements text from embedded text (`pdfminer`) without depending on `unstructured-inference` library. This PR is the second part of moving embedded text related code from `unstructured-inference` to `unstructured` and works together with https://github.com/Unstructured-IO/unstructured-inference/pull/349.	2024-05-21 19:43:38 +00:00
Matt Robinson	acda4d0707	fix: set `skip_infer_tables` explicitly in `test_partition_via_api_with_no_strategy` (#3057 ) ### Summary A `partition_via_api` test that only runs on `main` was [failing](https://github.com/Unstructured-IO/unstructured/actions/runs/9159429513/job/25181600959) with the following output, likely due to the change in the default behavior for `skip_infer_table_types`. This PR explicitly sets the `skip_infer_table_types` param to avoid the failure.. ```python =========================== short test summary info ============================ FAILED test_unstructured/partition/test_api.py::test_partition_via_api_with_no_strategy - AssertionError: assert 'Zejiang Shen® (<), Ruochen Zhang?, Melissa Dell®, Benjamin Charles Germain Lee?, Jacob Carlson®, and Weining Li®' != 'Zejiang Shen® (<), Ruochen Zhang?, Melissa Dell®, Benjamin Charles Germain Lee?, Jacob Carlson®, and Weining Li®' + where 'Zejiang Shen® (<), Ruochen Zhang?, Melissa Dell®, Benjamin Charles Germain Lee?, Jacob Carlson®, and Weining Li®' = <unstructured.documents.elements.Text object at 0x7fb9069fc610>.text + and 'Zejiang Shen® (<), Ruochen Zhang?, Melissa Dell®, Benjamin Charles Germain Lee?, Jacob Carlson®, and Weining Li®' = <unstructured.documents.elements.Text object at 0x7fb90648ad90>.text = 1 failed, 2299 passed, 9 skipped, 2 deselected, 2 xfailed, 9 xpassed, 14 warnings in 1241.64s (0:20:41) = make: *** [Makefile:302: test] Error 1 ``` ### Testing After temporarily removing the "skip if not on `main`" `pytest` mark, the [unit tests pass](https://github.com/Unstructured-IO/unstructured/actions/runs/9163268381/job/25192040902?pr=3057O) on the feature branch.	2024-05-20 19:05:13 -04:00
Christine Straub	76831f154b	refactor: `partition_pdf()` pass `kwargs` through `fast` strategy pipeline (#3040 ) This PR aims to pass `kwargs` through `fast` strategy pipeline, which was missing as part of the previous PR - https://github.com/Unstructured-IO/unstructured/pull/3030. I also did some code refactoring in this PR, so I recommend reviewing this PR commit by commit. ### Summary - pass `kwargs` through `fast` strategy pipeline, which will allow users to specify additional params like `sort_mode` - refactor: code reorganization - cut a release for `0.14.0` ### Testing CI should pass	2024-05-17 20:55:11 +00:00
amadeusz-ds	1c8b2b23eb	feat: add GLOBAL_WORKING_DIR and GLOBAL_WORKING_PROCESS_DIR config parameteres (#3014 ) This PR introduces GLOBAL_WORKING_DIR and GLOBAL_WORKING_PROCESS_DIR controlling where temporary files are stored during partition flow, via tempfile.tempdir. #### Edit: Renamed prefixes from STORAGE_ to UNSTRUCTURED_CACHE_ #### Edit 2: Renamed prefixes from UNSTRUCTURED_CACHE to GLOBAL_WORKING_DIR_	2024-05-17 19:16:10 +00:00
Matt Robinson	ec987dcbb2	BREAKING CHANGE: revert table extraction off by default for PDFs and images (#3035 ) ### Summary Closes #3021 . Turns table extraction for PDFs and images off by default. The default behavior originally changed in #2588 . The reason for reversion is that some users did not realize turning off table extraction was an option and experience long processing times for PDFs and images with the new default behavior. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>	2024-05-17 15:28:11 +00:00
Steve Canny	f320889b4f	feat(docx): add strategy parameter to DOC and ODT (#3042 ) Summary Because DOCX now supports the `strategy` argument to control aspects of image extraction, `partition_doc()` and `partition_odt()` will need to support it to because they delegate partitioning to `partition_docx()`. This will allow image extraction to work the same way for those two additional document-types.	2024-05-16 22:14:02 +00:00
Christine Straub	1fb0fe5cf5	enhancement: `partitoin_pdf()` skip unnecessary element sorting (#3030 ) This PR aims to skip element sorting when determining whether embedded text can be extracted. The extracted elements in this step are returned as final elements only for the `fast` strategy pipeline and are never used for other strategy pipelines (`hi_res`, `ocr`). Removing element sorting in this step and adding it to the `fast` strategy pipeline later will improve performance and reduce execution time. ### Summary - skip element sorting when determining whether embedded text can be extracted. - add `_partition_pdf_with_pdfparser()` function for fast` strategy pipeline ### Testing CI should pass.	2024-05-16 06:02:56 +00:00
Steve Canny	aeca8bef88	rfctr(odt): organize and improve test_odt.py (#3031 ) Summary In preparation for adding more tests related to image extraction, improve the `partition_odt()` test suite: - Add type annotations to type-check clean on strict mode. - Improve test names. - Simplify tests where possible. - Remove a couple duplicated tests	2024-05-16 01:04:06 +00:00
Matt Robinson	612905e311	build: wolfi base image for Dockerfile (#3016 ) ### Summary Updates the `Dockerfile` to use the Chainguard `wolfi-base` image to reduce CVEs. Also adds a step in the docker publish job that scans the images and checks for CVEs before publishing. The job will fail if there are high or critical vulnerabilities. ### Testing Run `make docker-run-dev` and then `python3.11` once you're in. And that point, you can try: ```python from unstructured.partition.auto import partition elements = partition(filename="example-docs/DA-1p.pdf", skip_infer_table_types=["pdf"]) elements ``` Stop the container once you're done.	2024-05-15 22:53:15 +00:00
Steve Canny	094e3542cb	feat(docx): add strategy parameter to partition_docx() (#3026 ) Summary The behavior of an image sub-partitioner can be partially determined by the partitioning strategy, for example whether it is "hi_res" or "fast". Add this parameter to `partition_docx()` so it can pass it along to `DocxPartitionerOptions` which will make it available to any image sub-partitioners.	2024-05-15 21:05:32 +00:00
Steve Canny	a164b01c7e	rfctr(doc): spruce up test_doc.py (#3024 ) Summary In preparation for adding more tests related to image extraction, improve the `partition_doc()` test suite: - Remove redundant DOCX -> DOC file conversions on most tests. - Add type annotations to type-check clean on strict mode. - Improve test names. - Simplify tests where possible. - Remove one duplicated test Speed was roughly doubled: 24 tests in 20s -> 23 tests in 8s.	2024-05-15 18:32:51 +00:00
Steve Canny	12b30d2810	rfctr(docx): extract DocxPartitionerOptions (#3018 ) Reviewers: Probably easier to review first and second commits separately as the first one adds all the new code and tests (without installing it), and the second one installs it into the partitioner along with the required changes to code and tests. Summary Enable communication of partitioning options to sub-partitioners, in particular to the pluggable `PicturePartitioner` coming in a closely subsequent PR to implement image-extraction and OCR for DOCX, DOC, and ODT formats. Additional Context In general, validation of partitioning options as well as assigning default values and computing derived partitioning settings can be extracted from partitioners into a neatly encapsulated separate object. This simplifies the core partitioning code by removing the noise associated with computing metadata values and deciding how to access the source document, etc. However, better factoring aside, having the partition-time "settings" available in a single object allows partitioning of certain document features, for example images, to be readily _delegated_ to a sub-partitioner while still giving it access to all the relevant partitioning settings for the current document. This is particularly important when a sub-partitioner is "pluggable" at runtime and must rely on a clearly-defined (and simple as possible) interface to operate smoothly.	2024-05-15 00:50:31 +00:00
Steve Canny	db186dc23b	rfctr(doc): organize test_doc.py (#3017 ) Summary Organize DOC tests into related groups with markers. This makes it easier to assess coverage and find tests related to particular behaviors. This is in preparation for adding tests related to DOC image extraction. No code changes, purely line-block moves. - Move module-level fixtures to the bottom. - Organize tests into related groups with markers.	2024-05-14 20:57:31 +00:00
Steve Canny	b4a6009c09	rfctr(docx): improve typing etc. in prep for docx image extraction (#3015 ) Summary Noisy but trivial changes to `partition_docx()` environs and tests in preparation for DOCX image extraction. These changes are extracted here so they don't distract on the changes of substance to follow in the next PR.	2024-05-14 19:32:17 +00:00
Steve Canny	3f8e6b79c5	rfctr(docx): move docx unit tests to bottom (#3011 ) No code changes, strictly this single block move. Move `Describe_DocxPartitioner` unit-test class to bottom so `DescribeDocxPartitionerOptions` unit-test to follow in subsequent commit will be together with it. Integration tests first, then unit tests, for consistency with other test modules e.g. test_pptx. I added `Describe_DocxPartitioner` soon after I arrived, before we adopted the convention of placing unit-tests after integration tests. Move this so we can maintain that consistency with the block of tests to follow in a closely subsequent PR.	2024-05-13 22:05:12 +00:00
Steve Canny	e4c895923d	fix(csv): partition_csv() raises on long lines (#2998 ) Summary The CSV delimiter-sniffer requires whole lines to properly detect the delimiter character. Limiting bytes read produced partial lines when lines were very long. Limit bytes but read whole lines. Fixes #2643.	2024-05-10 21:19:31 +00:00
John	593aa47802	fix: ppt parameters include_page_breaks and include_slide_notes (#2996 ) Pass the parameters `include_slide_notes` and `include_page_breaks` to `partition_pptx` from `partition_ppt`. Also update the .ppt example doc we use for testing so it has slide notes and a PageBreak (and second page)	2024-05-10 17:57:36 +00:00
John	d829b669e6	Add starting_page_num param to partition_image (#2987 ) Add missing starting_page_num param to partition_image Closes #2985	2024-05-09 21:31:35 +00:00
John	ef47d530f6	feat: add chunking to partition_tsv (#2982 ) Closes #2980	2024-05-07 23:09:27 +00:00
Christine Straub	0cd07d78f9	feat: `parition_pdf()` add ability to get `cid` ratio (#2970 ) This PR adds the ability to get the ratio of `cid` characters in embedded text extracted by `pdfminer`. This PR is the second part of moving `cid` related code from `unstructured-inference` to `unstructured` and works together with https://github.com/Unstructured-IO/unstructured-inference/pull/342.	2024-05-04 05:21:27 +00:00
Steve Canny	cb55245f70	rfctr: extract OCRAgent.get_agent() out of PDF subtree (#2965 ) Summary File-types other than PDF need to use OCR on extracted images. Extract `OCRAgent.get_agent()` such that any file-type partitioner can use it without risking dependency on PDF-only extras.	2024-05-03 19:39:22 +00:00
Steve Canny	39b74a2370	fix(test): Remedy macOS-only test failure not triggered by CI (#2957 ) Summary A crude and OS-specific mechanism was used to detect when a path represented a temp-file. Change that to be robust across operating systems and localized configurations. The specific problem was for DOC files but this PR fixes it for PPT too which was prone to the same problem.	2024-05-02 18:21:18 +00:00
Steve Canny	7dea2fa4a1	rfctr: tidy up ppt+doc tests (#2956 ) Summary Make tests for DOC and PPT formats more concise and readable in preparation for adding one or two.	2024-05-02 16:00:00 +00:00
Steve Canny	601594d373	fix(docx): fix short-row DOCX table (#2943 ) Summary The DOCX format allows a table row to start late and/or end early, meaning cells at the beginning or end of a row can be omitted. While there are legitimate uses for this capability, using it in practice is relatively rare. However, it can happen unintentionally when adjusting cell borders with the mouse. Accommodate this case and generate accurate `.text` and `.metadata.text_as_html` for these tables.	2024-05-02 00:45:52 +00:00
Yuming Long	542d442699	chore CORE-4775: remove html page number metadata field (#2942 ) ### Summary Rip off page_number metadata fields until we have page counting for all kinds of html files (not just limited to news articles with multiple `<article>` tag) ### Test Unit tests `test_add_chunking_strategy_on_partition_html_respects_multipage` and `test_add_chunking_strategy_title_on_partition_auto_respects_multipage` removed since they relay on the `page_number` fields from the SEC html file - now test moved to mock test for chunk_by_title -> revisit those tests when we find test file for this Also changed the element ids from partition outputs for html files - element id change due to page number change (in element id hashing) -> todo ticket: update other deterministic element id tests per crag's comment --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: yuming-long <yuming-long@users.noreply.github.com>	2024-04-30 15:20:26 +00:00
Marco Lüthy	0d80886578	fix: parse URL response Content-Type according to RFC 9110 (#2950 ) Currently, `file_and_type_from_url()` does not correctly handle the `Content-Type` header. Specifically, it assumes that the header contains only the mime-type (e.g. `text/html`), however, [RFC 9110](https://www.rfc-editor.org/rfc/rfc9110#field.content-type) allows for additional directives — specifically the `charset` — to be returned in the header. This leads to a `ValueError` when loading a URL with a response Content-Type header such as `text/html; charset=UTF-8`. To reproduce the issue: ```python from unstructured.partition.auto import partition url = "https://arstechnica.com/space/2024/04/nasa-still-doesnt-understand-root-cause-of-orion-heat-shield-issue/" partition(url=url) ``` Which will result in the following exception: ```python { "name": "ValueError", "message": "Invalid file. The FileType.UNK file type is not supported in partition.", "stack": "--------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[1], line 4 1 from unstructured.partition.auto import partition 3 url = \"https://arstechnica.com/space/2024/04/nasa-still-doesnt-understand-root-cause-of-orion-heat-shield-issue/\" ----> 4 partition(url=url) File ~/miniconda3/envs/ai-tasks/lib/python3.11/site-packages/unstructured/partition/auto.py:541, in partition(filename, content_type, file, file_filename, url, include_page_breaks, strategy, encoding, paragraph_grouper, headers, skip_infer_table_types, ssl_verify, ocr_languages, languages, detect_language_per_element, pdf_infer_table_structure, extract_images_in_pdf, extract_image_block_types, extract_image_block_output_dir, extract_image_block_to_payload, xml_keep_tags, data_source_metadata, metadata_filename, request_timeout, hi_res_model_name, model_name, date_from_file_object, starting_page_number, **kwargs) 539 else: 540 msg = \"Invalid file\" if not filename else f\"Invalid file {filename}\" --> 541 raise ValueError(f\"{msg}. The {filetype} file type is not supported in partition.\") 543 for element in elements: 544 element.metadata.url = url ValueError: Invalid file. The FileType.UNK file type is not supported in partition." } ``` This PR fixes the issue by parsing the mime-type out of the `Content-Type` header string. Closes #2257	2024-04-29 22:53:44 -07:00
Michał Martyniak	7720e72424	Fix: avoid elements sharing the same memory address (#2940 ) This PR attempts to fix a memory issue, which resulted in errors like this: https://github.com/Unstructured-IO/unstructured/issues/2931 The root cause seems to be in how ListItems are being combined, not in how hashes or parent IDs are updated. When `assign_and_map_hash_ids()` is called and elements (or elements' metadata) do not have unique memory addresses, then updating the parent_id of one element will also overwrite the parent_id of some other element. --------- Co-authored-by: cragwolfe <crag@unstructured.io>	2024-04-28 19:15:17 -07:00
Michał Martyniak	2d1923ac7e	Better element IDs - deterministic and document-unique hashes (#2673 ) Part two of: https://github.com/Unstructured-IO/unstructured/pull/2842 Main changes compared to part one: * hash computation includes element's sequence number on page, page number, document filename and its text * there are more test for deterministic behavior of IDs returned by partitioning functions + their uniqueness (guaranteed at the document level, and high probability across multiple documents) This PR addresses the following issue: https://github.com/Unstructured-IO/unstructured/issues/2461	2024-04-24 00:05:20 -07:00
Dimitri Lozeve	abb0174181	Integration with the Google Cloud Vision API (#2902 ) This PR adds a third OCR provider, alongside Tesseract and Paddle: the [Google Cloud Vision API](https://cloud.google.com/vision). It can be used similarly to other OCR methods: set the `OCR_AGENT` environment variable to the path to the OCR module (`unstructured.partition.utils.ocr_models.google_vision_ocr.OCRAgentGoogleVision`). You also need to set the credentials to use Google APIs, for instance by setting the `GOOGLE_APPLICATION_CREDENTIALS` environment variable. --------- Co-authored-by: christinestraub <christinemstraub@gmail.com>	2024-04-23 21:11:39 +00:00
Steve Canny	05ff975081	fix: remove unused `ElementMetadata.section` (#2921 ) Summary The `.section` field in `ElementMetadata` is dead code, possibly a remainder from a prior iteration of `partition_epub()`. In any case, it is not populated by any partitioner. Remove it and any code that uses it.	2024-04-22 23:58:17 +00:00
Steve Canny	4dc8327149	rfctr(pptx): make PptxPartitionerOptions public (#2901 ) Summary A few additional small, mechanical odds and ends required for PPTX image extraction. The big one is removing the leading underscore from `PptxPartitionerOptions` because now client code that implements a custom Picture-shape sub-partitioner will need to reference this class.	2024-04-19 04:50:06 +00:00
Christine Straub	ac5048bf30	enhancement: remove duplicate embedded images (#2897 ) This PR aims to remove duplicate embedded images taken by `PDFminer`. ### Summary - add `clean_pdfminer_duplicate_image_elements()` to remove embedded images with similar `bboxes` and the same `text` - add env_config `EMBEDDED_IMAGE_SAME_REGION_THRESHOLD` to consider the bounding boxes of two embedded images as the same region - refactor: reorganzie `clean_pdfminer_inner_elements()`	2024-04-18 23:07:47 +00:00
Michał Martyniak	cb1e91058e	Introduce `start_page` argument to partitioning functions that assign `element.metadata.page_number` (#2884 ) This small change will be useful for users who partition only fragments of their PDF documents. It's a small step towards addressing this issue: https://github.com/Unstructured-IO/unstructured/issues/2461 Related PRs: * https://github.com/Unstructured-IO/unstructured/pull/2842 * https://github.com/Unstructured-IO/unstructured/pull/2673	2024-04-15 21:03:42 +00:00
MiXiBo	0506aff788	add support for `start_index` in `html` links extraction (#2600 ) add support for start_index in html links extraction (closes #2625) Testing ``` from unstructured.partition.html import partition_html from unstructured.staging.base import elements_to_json html_text = """<html> <p>Hello there I am a <a href="/link">very important link!</a></p> <p>Here is a list of my favorite things</p> <ul> <li><a href="https://en.wikipedia.org/wiki/Parrot">Parrots</a></li> <li>Dogs</li> </ul> <a href="/loner">A lone link!</a> </html>""" elements = partition_html(text=html_text) print(elements_to_json(elements)) ``` --------- Co-authored-by: Michael Niestroj <michael.niestroj@unblu.com> Co-authored-by: christinestraub <christinemstraub@gmail.com> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com> Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>	2024-04-12 06:14:20 +00:00
Steve Canny	3e643c4cb3	feat(pptx): add pluggable PPTX Picture sub-partitioner (#2880 ) Summary Delegate partitioning of PPTX Picture (image, to a first approximation) shapes to a distinct sub-partitioner and allow the default picture sub-partitioner to be replaced at run-time by one of the user's choosing.	2024-04-12 06:00:01 +00:00
Steve Canny	2cba949f18	feat(pptx): partition_pptx() accepts strategy arg (#2879 ) Summary As we move to adding pluggable sub-partitioners, `partition_pptx()` will need to become sensitive to the `strategy` argument, in particular when it is set to "hi_res". Up until now there were no expensive operations (inference, OCR, etc.) incurred while partitioning PPTX so this argument was ignored. After this PR, `partition_pptx()` still won't do anything with that value, other than pass it along to `_PptxPartitionerOptions` for safe-keeping, but now its ready for use by a `PicturePartitioner` (to come in a subsequent PR).	2024-04-11 22:36:16 +00:00
Christine Straub	4656b8cbe5	Fix: `partition_html()` partially extracts text (#2852 ) Closes #2362. Previously, when an HTML contained a `div` with a nested tag e.g. a `<b>` or `<span>`, the element created from the `div` contained only the text up to the inline element. This PR adds support for extracting text from tag tails in HTML. ### Testing ``` html_text = """ <html> <body> <div> the Company issues shares at $<div style="display:inline;"><span>5.22</span></div> per share. There is more text </div> </body> </html> """ elements = partition_html(text=html_text) print(''.join([str(el).strip() for el in elements])) ``` Expected behavior ``` the Company issues shares at $5.22per share. There is more text ```	2024-04-08 19:18:55 +00:00
Steve Canny	2c7e0289aa	rfctr(pptx): extract _PptxPartitionerOptions (#2853 ) Reviewers: Likely quicker to review commit-by-commit. Summary In preparation for adding a PPTX `Picture` shape _sub-partitioner_, extract management of PPTX partitioning-run options to a separate `_PptxPartitioningOptions` object similar to those used in chunking and XLSX partitioning. This provides several benefits: - Extract code dealing with applying defaults and computing derived values from the main partitioning code, leaving it less cluttered and focused on the partitioning algorithm itself. - Allow the options set to be passed to helper objects, prominently including sub-partitioners, without requiring a long list of parameters or requiring the caller to couple itself to the particular option values the helper object requires. - Allow options behaviors to be thoroughly and efficiently tested in isolation.	2024-04-08 19:01:03 +00:00
Christine Straub	a9b6506724	Fix: `partition_html()` fails parsing simple html (#2849 ) Closes #2520. Previously, `partition_html()` did not extract text from `<b>` tags inside container tags (like `<div>`, `<pre>`). This PR provides support for extracting text from `<b>` tags inside container tags. ### Testing ``` html_text = """ <!DOCTYPE html> <html> <head> <title>A page</title> </head> <body> <div> <h1>Header 1</h1> <p>Text </p> <h2>Header 2</h2> <pre><b>Param1</b> = Y<br><b>Param2</b> = 1<br><b>Param3</b> = 2<br><b>Param4</b> = A <br><b>Param5</b> = A,B,C,D,E<br><b>Param6</b> = 7<br><b>Param7</b> = Five<br></pre> </div> </body> </html> """ elements = partition_html(text=html_text) print("\n\n".join([str(el) for el in elements])) ``` Expected behavior ``` Header 1 Text Header 2 Param1 = Y Param2 = 1 Param3 = 2 Param4 = A Param5 = A,B,C,D,E Param6 = 7 Param7 = Five ```	2024-04-08 18:09:41 +00:00
Pawel Kmiecik	63fc2a1061	feat: element types extension (#2700 ) This PR adds some new element types that can be used especially by pdf/image parition.	2024-04-04 07:49:55 +00:00
Steve Canny	1ce60f2bba	rfctr(xlsx): extract _XlsxPartitionerOptions (#2838 ) Summary As an initial step in reducing the complexity of the monolithic `partition_xlsx()` function, extract all argument-handling to a separate `_XlsxPartitionerOptions` object which can be fully covered by isolated unit tests. Additional Context This code was from a prior XLSX bug-fix branch that did not get committed because of time constraints. I wanted to revisit it here because I need the benefits of this as part of some new work on PPTX that will require a separate options object that can be passed to delegate objects. This approach was incubated in the chunking context and has produced a lot of opportunities there to decompose the logic into smaller components that are more understandable and isolated-test-able, without having to pass an extended list of option values in ever sub-call. As well as decluttering the code, this removes coupling where the caller needs to know which options a subroutine might need to reference.	2024-04-03 23:27:33 +00:00
Christine Straub	887e6c9094	refactor: use env_config instead of `SUBREGION_THRESHOLD_FOR_OCR` constant (#2697 ) The purpose of this PR is to introduce a new env_config for the subregion threshold for OCR. ### Testing CI should pass.	2024-03-28 20:28:35 +00:00
Christine Straub	08fafc564f	Fix: embedded text not getting merged with inferred elements (#2679 ) This PR is the second part of fixing "embedded text not getting merged with inferred elements", the first part is done in https://github.com/Unstructured-IO/unstructured-inference/pull/331. ### Summary - replace `Rectangle.is_in()` with `Rectangle.is_almost_subregion_of()` when removing pdfminer (embedded) elements that were merged with inferred elements - use env_config `EMBEDDED_TEXT_AGGREGATION_SUBREGION_THRESHOLD` introduced in the [first part](https://github.com/Unstructured-IO/unstructured-inference/pull/331) when removing pdfminer (embedded) elements that were merged with inferred elements - bump `unstructured-inference` to 0.7.25 ### Testing PDF: [pwc-financial-statements-p114.pdf](https://github.com/Unstructured-IO/unstructured/files/14707146/pwc-financial-statements-p114.pdf) ``` $ pip uninstall unstructured-inference -y $ git clone -b fix/embedded-text-not-getting-merged-with-inferred-elements git@github.com:Unstructured-IO/unstructured-inference.git && cd unstructured-inference $ pip install -e . ``` ``` elements = partition_pdf( filename="pwc-financial-statements-p114.pdf", strategy="hi_res", infer_table_structure=True, extract_image_block_types=["Image"], ) table_elements = [el for el in elements if el.category == "Table"] print(table_elements[0].text) ``` --------- Co-authored-by: Antonio Jose Jimeno Yepes <antonio.jimeno@gmail.com> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2024-03-23 03:59:23 +00:00
Filip Knefel	bdfd975115	chore: change table extraction defaults (#2588 ) Change default values for table extraction - works in pair with [this](https://github.com/Unstructured-IO/unstructured-api/pull/370) `unstructured-api` PR We want to move away from `pdf_infer_table_structure` parameter, in this PR: - We change how it's treated wrt `skip_infer_table_types` parameter. Whether to extract tables from pdf now follows from the rule: `pdf_infer_table_structure && "pdf" not in skip_infer_table_types` - We set it to `pdf_infer_table_structure=True` and `skip_infer_table_types=[]` by default - We remove it from the examples in documentation - We describe it as deprecated in favor of `skip_infer_table_types` in documentation More detailed description of how we want parameters to interact - if `pdf_infer_table_structure` is False tables will never extracted from pdf - if `pdf_infer_table_structure` is True tables will be extracted from pdf unless it's skipped via `skip_infer_table_types` - on default `pdf_infer_table_structure=True` and `skip_infer_table_types=[]` --------- Co-authored-by: Filip Knefel <filip@unstructured.io> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ds-filipknefel <ds-filipknefel@users.noreply.github.com> Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>	2024-03-22 10:08:49 +00:00
Steve Canny	31bef433ad	rfctr: prepare to add orig_elements serde (#2668 ) Summary The serialization and deserialization (serde) of `metadata.orig_elements` will be located in `unstructured.staging.base` alongside `elements_to_json()` and other existing serde functions. Improve the typing, readability, and structure of that module before adding the new serde functions for `metadata.orig_elements`. Reviewers: The commits are well-groomed and are probably quicker to review commit-by-commit than as all files-changed at once.	2024-03-20 21:27:59 +00:00
Filip Knefel	6af6604057	feat: introduce `date_from_file_object` parameter to partitions (#2563 ) Introduce `date_from_file_object` to `partition*` functions, by default set to `False`. If set to `True` and file is provided via `file` parameter, partition will attempt to infer last modified date from `file`'s contents otherwise last modified metadata will be set to `None`. --------- Co-authored-by: Filip Knefel <filip@unstructured.io> Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>	2024-03-18 01:09:44 +00:00
Steve Canny	b27ad9b6aa	fix: raises on file-like object with .name not a valid path (#2614 ) Summary Fixes: #2308 Additional context Through a somewhat deep call-chain, partitioning a file-like object (e.g. io.BytesIO) having its `.name` attribute set to a path not pointing to an actual file on the local filesystem would raise `FileNotFoundError` when the last-modified date was being computed for the document. This scenario is a legitimate partitioning call, where `file.name` is used downstream to describe the source of, for example, a bytes payload downloaded from the network. Fix - explicitly check for the existence of a file at the given path before accessing it to get its modified date. Return `None` (already a legitimate return value) when no such file exists. - Generally clean up the implementations. - Add unit tests that exercise all cases. --------- Co-authored-by: John <43506685+Coniferish@users.noreply.github.com>	2024-03-07 19:02:04 +00:00
Steve Canny	b59e4b69ce	rfctr: prepare for fix to raises on file-like-object with name not a path to a file (#2617 ) Summary Improve typing and other mechanical refactoring in preparation for fix to issue 2308.	2024-03-06 23:46:54 +00:00
Christine Straub	ee8b0f93dc	feat: pass list type parameters via client sdk (#2567 ) The purpose of this PR is to support using the same type of parameters as `partition_()` when using `partition_via_api()`. This PR works together with `unsturctured-api` [PR #368](https://github.com/Unstructured-IO/unstructured-api/pull/368). Note:* This PR will support extracting image blocks("Image", "Table") via partition_via_api(). ### Summary - update `partition_via_api()` to convert all list type parameters to JSON formatted strings before passing them to the unstructured client SDK - add a unit test function to test extracting image blocks via `parition_via_api()` - add a unit test function to test list type parameters passed to API via unstructured client sdk ### Testing ``` from unstructured.partition.api import partition_via_api elements = partition_via_api( filename="example-docs/embedded-images-tables.pdf", api_key="YOUR-API-KEY", strategy="hi_res", extract_image_block_types=["image", "table"], ) image_block_elements = [el for el in elements if el.category == "Image" or el.category == "Table"] print("\n\n".join([el.metadata.image_mime_type for el in image_block_elements])) print("\n\n".join([el.metadata.image_base64 for el in image_block_elements])) ```	2024-02-26 19:17:06 +00:00

1 2 3 4 5 ...

392 Commits