unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-09-16 03:48:33 +00:00

Author	SHA1	Message	Date
Matt Robinson	cf32672bc5	build(deps): bumps for 2024-09-09 (#3608 ) ### Summary Dependency bumps for 2024-09-09.	2024-09-09 16:45:18 +00:00
cragwolfe	3bb0ee1e79	chore: fix tests breaking on main (#3603 ) Fix API tests (really more like integration tests) that run only on main. Also use less compute intensive files to speedup test time and remove a useless test. Tests in `test_unstructured/partition/test_api.py` pass, temporarily running outside of main per per screenshot: ![image](https://github.com/user-attachments/assets/f15d440a-2574-40f2-98b4-adf57fbae704) https://github.com/Unstructured-IO/unstructured/actions/runs/10754098974/job/29824415513	2024-09-08 21:25:52 +00:00
Matt Robinson	c060467018	build(deps): bump cryptography version (#3599 ) ### Summary Bumps to the latest version of the `cryptography` library to address `GHSA-h4gh-qq45-vh27`.	2024-09-05 19:06:43 +00:00
Pawel Kmiecik	f25eb60585	fix: expose drawing options as function params rather than env config (#3598 ) This PR: - changes the interface of analysis tools to expose drawing params as function parameters rather than env_config (=environmental variables) - restructures analysis package	2024-09-05 15:51:43 +00:00
Christine Straub	acd070c5d5	feat: enhance `pdfminer` element cleanup (#3593 ) This PR aims to expand removal of `pdfminer` elements to include those inside all `non-pdfminer` elements, not just `tables`. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2024-09-04 12:02:50 +00:00
Yao You	d51fb134e6	Feat/improve iou speed (#3582 ) This PR vectorizes the computation of element overlap to speed up deduplication process of extracted elements. ## test This PR adds unit test to the new vectorized IOU and subregion computation functions. In addition, running partition on large files with many elements like this slide: [002489.pdf](https://github.com/user-attachments/files/16823176/002489.pdf) shows a reduction of runtime from around 15min on the main branch to less than 4min with this branch. Profiling results show that the new implementation greatly reduces the time cost of computation and now most of the time is spend on getting the coordinates from a list of bboxes. ![Screenshot 2024-08-30 at 9 29 27 PM](https://github.com/user-attachments/assets/6c186838-54c7-483b-ac3e-7342c23ff3a6)	2024-09-03 00:06:18 +00:00
Pawel Kmiecik	404f780bbb	feat: make analysis drawing more flexible (#3574 ) This PR changes the way the analysis tools can be used: - by default if `analysis` is set to `True` in `partition_pdf` and the strategy is resolved to `hi_res`: - for each file 4 layout dumps are produced and saved as JSON files (`object_detection`, `extracted`, `ocr`, `final`) - similar way to the current `object_detection` dump - the drawing functions/classes now accept these dumps accordingly instead of the internal classes instances (like `TextRegion`, `DocumentLayout` - it makes it possible to use the lightweight JSON files to render the bboxes of a given file after the partition is done - `_partition_pdf_or_image_local` has been refactored and most of the analysis code is now encapsulated in `save_analysis_artifiacts` function - to do this, helper function `render_bboxes_for_file` is added <img width="338" alt="Screenshot 2024-08-28 at 14 37 56" src="https://github.com/user-attachments/assets/10b6fbbd-7824-448d-8c11-52fc1b1b0dd0">	2024-09-02 11:06:11 +00:00
Matt Robinson	04322d1632	build(deps): removed unnecessary jupyter deps (#3583 ) ### Summary Removes unnecessary `jupyter` and `ipython` dev dependencies to reduce CVE surface area.	2024-08-31 05:21:40 +00:00
Matt Robinson	6ba8135bf9	fix: check ole storage content to differentiate filetypes (#3581 ) ### Summary Updates the file detection logic for OLE files to check the storage content of the file to more reliable differentiate between DOC, PPT, XLS and MSG files. This corrects a bug that caused file type detection to be incorrect in cases where the `filetype` library guessed and incorrect MIME type, such as `'application/vnd.ms-excel'` for a `.msg` file. As part of this work, the `"msg"` extra was removed because the `python-oxmsg` package is now a base dependency. ### Testing Using a test `.msg` file that returns `'application/vnd.ms-excel'` from `filetype.guess_mime`. ```python from unstructured.file_utils.filetype import detect_filetype filename = "test-file.msg" detect_filetype(filename=filename) # result should be FileType.MSG ``` 0.15.9	2024-08-30 15:12:46 -04:00
John	ddb6cb631d	chore: remove minimum version pins for pins older than 6 mo (#3577 ) Remove a number of pins in `requirements/deps/constraints.txt` and `make pip-compile`	2024-08-29 15:35:14 +00:00
Austin Walker	f440eb476c	feat: Support encoding parameter in partition_csv (#3564 ) See added test file. Added support for the encoding parameter, which can be passed directly to `pd.read_csv`.	2024-08-28 14:19:58 +00:00
John	f21c853ade	bug: fix file_conversion disk leak (#3562 ) Fix disk space leaks and Windows errors when accessing file.name on a NamedTemporaryFile Uses of `NamedTemporaryFile(..., delete=False)` and/or uses of `file.name` of NamedTemporaryFile have been replaced with TemporaryFileDirectory to avoid a known issue: - https://docs.python.org/3/library/tempfile.html#tempfile.NamedTemporaryFile - https://github.com/Unstructured-IO/unstructured/issues/3390 The first 7 commits each address an individual occurrence of the issue if reviewers want to review commit-by-commit.	2024-08-27 22:02:24 +00:00
Matt Robinson	4194a07d12	build(deps): replace pillow-heif with pi-heif (#3571 ) ### Summary Closes #2664 and replaces `pillow-heif` with `pi-heif` due to more permissive licensing on the binary wheel for `pi-heif`. 0.15.8	2024-08-27 11:54:35 -04:00
David Potter	ddba928344	Potter/mixedbread embedder (#3513 ) Thanks to @huangrpablo and @juliuslipp we now have a mixedbread.ai embedder!	2024-08-27 14:52:13 +00:00
Christine Straub	affd997c39	refactor(ci): remove unused environment variables (#3568 ) This PR removes the unused env `TABLE_OCR` from CI.	2024-08-26 19:19:58 +00:00
Matt Robinson	09d84bc46b	build(deps): version bumps for 2024-08-26 (#3567 ) ### Summary Version bumps for 2024-08-26.	2024-08-26 15:15:25 -04:00
Christine Straub	ac10ba4fc1	build(deps): bump unstructured.paddleocr to 2.8.1.0 (#3561 ) ### Summary - Bump `unstructured.paddleocr` to 2.8.1.0 - Remove `opencv-python` and `opencv-contrib-python` constraint pins - Fix `0.15.7` changelog	2024-08-23 14:17:29 -07:00
Steve Canny	32bb77aafb	fix(file): no default OLE subtype (#3516 ) Summary Do not assume MSG format when an OLE "container" file cannot be differentiated into DOC, PPT, XLS, or MSG. Fall back to extention-based identification in that case. Additional Context DOC, MSG, PPT, and XLS are all OLE files. An OLE file is, very roughly, a Microsoft-proprietary Zip format which "contains" a filesystem of discrete files and directories. An OLE "container" is easily identified by inspecting the first 8 bytes of the file, so all we need to do is differentiate between the four subtypes we can process. The `filetype` module does a good job of this but it not perfect and does not identify MSG files. Previously we assumed MSG format when none of DOC, PPT, or XLS was detected, but we discovered that `filetype` is not completely reliable at detecting these types. Change the behavior to remove the assumption of MSG format. `_OleFileDifferentiator` returns `None` in this case and filetype detection falls back to use filename-extension. Note a file with no filename and no metadata_filename or an incorrect extension will not be correctly identified in this case, however we're assuming for now that will be rare in practice.	2024-08-22 19:16:53 +00:00
John	b4a6aa5559	chore: remove fsspec pin (#3554 ) remove fsspec pin	2024-08-21 21:57:42 +00:00
Steve Canny	03e0ed3519	rfctr(docx): DOCX emits std minified .text_as_html (#3545 ) Summary Eliminate historical "idiosyncracies" of `table.metadata.text_as_html` HTML introduced by `partition_docx()`. Produce minified `.text_as_html` consistent with that formed by chunking. Additional Context - nested tables appear as their extracted text in the parent cell (no nested `<table>` elements in `.text_as_html`). - DOCX `.text_as_html` is minified (no extra whitespace or thead, tbody, tfoot elements).	2024-08-21 18:54:21 +00:00
John	f135344738	chore: remove scipy and packaging pins (#3550 ) Remove scipy and packaging constraint pins	2024-08-21 16:05:19 +00:00
John	604cadfb7e	chore: remove ipython pin (#3548 ) this pr is stacked on https://github.com/Unstructured-IO/unstructured/pull/3538 and https://github.com/Unstructured-IO/unstructured/pull/3547 This pr removes dependency pins for IPython, anyio, and pyparsing. It also updates the label-studio-sdk import statement so we don't have to have that pinned and make some minor type hinting edits. Label Studio had a breaking change in their 1.13.0 [release](https://github.com/HumanSignal/label-studio/releases/tag/1.13.0)	2024-08-21 00:06:31 +00:00
Christine Straub	01dbc7b473	fix: `nltk` data download path to prevent redundant nested directories (#3546 ) Closes #3543. ### Summary This PR addresses an issue with the NLTK data download process. Previously, when downloading NLTK data, a nested "nltk_data" directory was created within the parent "nltk_data" directory if the parent directory already existed. This redundant directory structure led to two significant problems: - errors in checking if data had already been downloaded, potentially causing redundant downloads in subsequent calls. - failures in loading models from the downloaded NLTK data due to incorrect path resolution. This fix modifies the NLTK data download logic to prevent creation of unnecessary nested directories. If the download path ends with "nltk_data" and that directory already exists, we now use the existing directory instead of creating a new nested one. ### Testing CI should pass. 0.15.7	2024-08-20 18:56:59 +00:00
Matt Robinson	1f8030dd0e	fix(CVE-2024-39705): bump to `nltk` 3.9.1; correct model download issues (#3541 ) ### Summary Bumps to `nltk==3.9.1` and resolves [CVE-2024-39705](https://nvd.nist.gov/vuln/detail/CVE-2024-39705). An NLTK version bump was originally introduced in #3512 and rolled back in #3527 because `nltk==3.8.2` was yanked from PyPI, and also because we observed significant slowdowns in processing time after bumping to `nltk==3.8.2`. The processing time regression does not appear in `nltk==3.9.1`. ### Testing After the bump, CI should pass. Additionally we verified locally that files processing takes around the amount of time we would expect for a long `.docx` file. ```python In [1]: from unstructured.partition.auto import partition In [2]: filename = "test-doc.docx" In [3]: %timeit partition(filename=filename) 3.92 s ± 73 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ``` 0.15.6	2024-08-19 20:59:36 +00:00
Steve Canny	a861ed8fe7	feat(chunk): split tables on even row boundaries (#3504 ) Summary Use more sophisticated algorithm for splitting oversized `Table` elements into `TableChunk` elements during chunking to ensure element text and HTML are "synchronized" and HTML is always parseable. Additional Context Table splitting now has the following characteristics: - `TableChunk.metadata.text_as_html` is always a parseable HTML `<table>` subtree. - `TableChunk.text` is always the text in the HTML version of the table fragment in `.metadata.text_as_html`. Text and HTML are "synchronized". - The table is divided at a whole-row boundary whenever possible. - A row is broken at an even-cell boundary when a single row is larger than the chunking window. - A cell is broken at an even-word boundary when a single cell is larger than the chunking window. - `.text_as_html` is "minified", removing all extraneous whitespace and unneeded elements or attributes. This maximizes the semantic "density" of each chunk.	2024-08-19 18:56:53 +00:00
Christine Straub	99f72d65ba	ci: fix ingest test fixtures update (#3532 )	2024-08-16 16:37:33 -07:00
Christine Straub	fc26426310	feat: replace `pytesseract` with `unstructured.pytesseract` fork (#3528 ) This PR reverts `pytesseract` dependency to `unstructured.pytesseract` fork due to the unavailability of some recent release versions of `pytesseract` on PyPI. This PR also addresses an issue encountered during the publication of `unstructured==0.15.4` to PyPI. The error was due to the fact that PyPI does not allow direct dependencies from Version Control System URLs like GitHub in the `install_requires` or `extras_require` sections of the `setup.py` file. 0.15.5	2024-08-16 10:34:22 -04:00
Matt Robinson	e64e09507a	build: update to latest base image (#3524 ) ### Summary Updates to the latest `wolfi-base` base image to pull in more recent package version. A notable update is that upgrading to `libreoffice==24.2.5.2` resolves several CVEs. --------- Co-authored-by: christinestraub <christinemstraub@gmail.com>	2024-08-15 22:27:41 -07:00
Christine Straub	d0211cc41f	build: downgrade `nltk` version (#3527 ) This PR aims to roll back `nltk` to `3.8.1` which bumped to `3.8.2` in https://github.com/Unstructured-IO/unstructured/pull/3512 because `3.8.2` is no longer available in PyPI due to some issues(https://github.com/nltk/nltk/issues/3301)	2024-08-15 16:35:21 -07:00
Christine Straub	9b778e270d	fix: `pytesseract>=0.3.12` installation error while installing `pdf` extra (#3522 ) Closes #3521. This PR resolves an installation error with `pytesseract>=0.3.12` that occurred during `pip install unstructured[pdf]==0.15.3`. ### Testing Run following command in main branch and this PR ``` pip uninstall -y pytesseract && pip install ".[pdf]" ``` Results - `main` branch ``` INFO: pip is looking at multiple versions of unstructured[pdf] to determine which version is compatible with other requirements. This could take a while. ERROR: Could not find a version that satisfies the requirement pytesseract>=0.3.12; extra == "pdf" (from unstructured[pdf]) (from versions: 0.1, 0.1.3, 0.1.4, 0.1.5, 0.1.6, 0.1.7, 0.1.8, 0.1.9, 0.2.0, 0.2.2, 0.2.4, 0.2.5, 0.2.6, 0.2.7, 0.2.8, 0.2.9, 0.3.0, 0.3.1, 0.3.2, 0.3.3, 0.3.4, 0.3.5, 0.3.6, 0.3.7, 0.3.8, 0.3.9, 0.3.10) ERROR: No matching distribution found for pytesseract>=0.3.12; extra == "pdf" ``` - this `PR` `pytesseract-0.3.13` should be installed successfully. 0.15.4	2024-08-14 16:15:40 -05:00
Christine Straub	d6a84bdfbb	build(deps): update extra-paddleocr requirements (#3515 ) This PR removes custom index URL for `paddlepaddle` installation in `extra-paddleocr.in`, resolving `setup.py` configuration error. Now uses `paddlepaddle==3.0.0b1` directly from PyPI, simplifying installation process. --------- Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io> Co-authored-by: Matt Robinson <mrobinson@unstructured.io> 0.15.3	2024-08-14 12:19:20 -05:00
Matt Robinson	7437f0a084	fix(CVE-2024-39705): update to latest `nltk` version (#3512 ) ### Summary Addresses [CVE-2024-39705](https://nvd.nist.gov/vuln/detail/CVE-2024-39705) by updating to `nltk==3.8.2` and closes #3511. This CVE had previously been mitigated in #3361. --------- Co-authored-by: Christine Straub <christinemstraub@gmail.com> 0.15.2	2024-08-13 09:39:29 -04:00
Christine Straub	1158d8f695	Refactor image block extraction in `pdf` partitioning (#3514 ) Closes [#3503](https://github.com/Unstructured-IO/unstructured/issues/3503). ### Summary This PR prevents creation of `figures` directory for saving image blocks (`Image`, `Table`) when `extract_image_block_to_payload` parameter is set to True ### Testing ``` elements = partition_image( filename="example-docs/img/embedded-images-tables.jpg", strategy="hi_res", extract_image_block_types=["Image", "Table"], extract_image_block_to_payload=True, ) ``` Results: - `Main` Branch: `figures` directory is created. - `PR`: `figures` directory is not created.	2024-08-13 06:11:10 +00:00
Steve Canny	cbe1b35621	rfctr(chunk): prep for adding TableSplitter (#3510 ) Summary Mechanical refactoring in preparation for adding (pre-chunk) `TableSplitter` in a PR stacked on this one.	2024-08-12 18:04:49 +00:00
Christine Straub	d99b39923d	build(deps): Remove unstructured.paddlepaddle fork (#3506 ) This PR aims to remove "unstructured.paddlepaddle" fork. Previously, we used `unstructured.paddlepaddle` fork to support `unstructured.paddleocr` on arm64 architecture. But currently, `unstructured.paddleocr` with `unstructured.paddlepaddle` fails to work on `arm64` architecture. Also, `unstructured.paddleocr` with the latest version of the original `paddlepaddle` works on both `amd64` and `arm64` architectures. ### Testing ``` os.environ["OCR_AGENT"] = "unstructured.partition.utils.ocr_models.paddle_ocr.OCRAgentPaddle" elements = partition_pdf( filename=<file_path>, strategy="hi_res", infer_table_structure=True, ) ```	2024-08-09 22:04:22 +00:00
John	a2ae2ed646	chore: remove matplotlib constraint (#3505 )	2024-08-09 19:31:19 +00:00
Jake Zerrer	051be5aead	Remove unstructured.pytesseract fork (#3454 ) A second attempt at https://github.com/Unstructured-IO/unstructured/pull/3360, this PR removes unstructured's dependency on its own fork of `pytesseract`. (The original reason for the fork, the addition of `run_and_get_multiple_output`, was removed [here](https://github.com/madmaze/pytesseract/releases/tag/v0.3.12).) --------- Co-authored-by: Christine Straub <christinemstraub@gmail.com>	2024-08-09 04:28:48 +00:00
John	2373eaa829	fix typo: pipline>pipeline (#3498 ) fix typo: pipline>pipeline	2024-08-08 18:53:47 +00:00
John	43ae0befa7	chore: bump botocore pin (#3493 ) bump botocore pin to match aiobotocore/s pin: `eae97439b3`	2024-08-07 21:41:53 +00:00
John	696155e614	chore: update importlib-metadata pin (#3491 )	2024-08-07 18:17:53 +00:00
John	6545f16e57	chore: remove cryptography pin and update test (#3482 ) remove cryptography pin, pin tenacity, and update test_unstructured_ingest/unit/connector/test_salesforce_connector.py	2024-08-07 15:25:23 +00:00
Pawel Kmiecik	eba12daeb2	feat: correct object detection metrics (#3490 ) This PR: - fixes an issue that made it impossible to compute OD metrics - ads per-class object detection metrics	2024-08-07 14:14:02 +00:00
John	24a1f298e5	chore: small edits (#3480 ) Add comments and fix decorators on some tests.	2024-08-06 19:21:43 +00:00
Steve Canny	73bef27ef1	fix(pptx): accommodate invalid image/jpg MIME-type (#3475 ) As described in #3381, some clients, perhaps including Adobe PDF Converter, map JPEG images to the invalid `image/jpg` MIME-type. Prior to v1.0.0, `python-pptx` would not load these images, which caused image extraction to fail. Update the `python-pptx` dependency to `v1.0.1` or above to ensure this upstream fix is always available. Fixes: #3381	2024-08-06 18:48:15 +00:00
Steve Canny	a468b2de3b	rfctr(csv): accommodate single column CSV files (#3483 ) Summary Improve factoring, type-annotation, and tests for `partition_csv()` and accommodate single-column CSV files. Fixes: #2616	2024-08-06 00:48:37 +00:00
David Potter	59ec64235b	chore: rename astra to astradb (#3458 ) DataStax wanted all references to be astradb instead of astra. As per @erichare We'll also have to do the same in unstructured-ingest :)	2024-08-05 20:41:02 +00:00
Austin Walker	7e887442c4	chore: Cut the 0.15.1 release (#3481 ) 0.15.1	2024-08-05 16:16:13 +00:00
Maciej Kurzawa	b749b891a7	fix: disabled checking max pages for images (#3473 ) Added fix related to https://github.com/Unstructured-IO/unstructured/pull/3431, which disables checking max pages for images	2024-08-02 14:25:08 +00:00
John	147514f6b5	feat: msg and email metadata (#3444 ) Update partition_eml and partition_msg to capture cc, bcc, and message id fields. Docs PR: https://github.com/Unstructured-IO/docs/pull/135/files Testing ``` from unstructured.partition.email import partition_email from test_unstructured.unit_utils import example_doc_path elements = partition_email(filename=example_doc_path("eml/fake-email-header.eml"), include_headers=True) print(elements) elements[0].metadata.to_dict() ``` Note to reviewers: Tests in `test_unstructured/partition/test_email.py` were refactored and rearranged to group similar tests together, so it will be easiest to review those changes commit by commit. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: Coniferish <Coniferish@users.noreply.github.com>	2024-08-01 19:24:17 +00:00
Christine Straub	0f057188c6	Improve pdfminer embedded image extraction in `pdf` partitioning (#3456 ) ### Summary This PR addresses an issue in `pdfminer` library's embedded image extraction process. Previously, some extracted "images" were incorrect, including embedded text elements, resulting in oversized bounding boxes. This update refines the extraction process to focus on actual images with more accurate, smaller bounding boxes. ### Testing PDF: [test_pdfminer_text_extraction.pdf](https://github.com/user-attachments/files/16448213/test_pdfminer_text_extraction.pdf) ``` elements = partition_pdf( filename="test_pdfminer_text_extraction", strategy=strategy, languages=["chi_sim"], analysis=True, ) ``` Results - this `PR` ![page1_layout_pdfminer](https://github.com/user-attachments/assets/098e0a1f-fdad-4627-a881-cbafd71ce5a0) ![page1_layout_final](https://github.com/user-attachments/assets/6dc89180-36ac-424a-99de-63810ebf8958) - `main` branch ![page1_layout_pdfminer](https://github.com/user-attachments/assets/8228995a-2ef1-4b76-9758-b8015c224e6d) ![page1_layout_final](https://github.com/user-attachments/assets/68d43d7b-7270-4f58-8360-dc76bd0df78f)	2024-08-01 16:47:08 +00:00

1 2 3 4 5 ...

1605 Commits