unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-06-27 02:30:08 +00:00

Author	SHA1	Message	Date
luke-kucing	2aca876921	Luke/CVE python3.12 update (#4027 )	2025-06-17 06:32:06 +00:00
jiajun-unstructured	b0dbd71aff	Parallelize tests (#4024 )	2025-06-16 23:29:35 +00:00
Yao You	5e43e36427	recompile on arm64 to get minimum reqs (#4020 ) new `torch==2.7.1` now comes with nvidia gpu support and triton as dependencies. Those are not supported by `arm64` or actually being used by `unstructured` in `adm64` either. This is a quick patch to remove those from .txt requirements files to unblock builds.	2025-06-12 21:44:35 +00:00
Emily Voss	b6ab471f00	Drop Python 3.9 support due to dependency conflicts (#4017 )	2025-06-10 23:32:11 -07:00
Emily Voss	06e4e54f5c	Bump requests to address CVEs (#4015 )	2025-06-11 01:38:43 +00:00
Yao You	37d2f021a3	Feat/bump inference (#4013 ) Bump `unstructured-inference` to `1.0.5`, which includes fix to ensure model init is thread safe. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>	2025-06-06 09:52:17 +00:00
luke-kucing	a7e90f7990	resolve CVEs and HF issue (#4009 ) update reqs to resolve CVEs and add the HF ENV to stop it from reaching out updated the Dockerfile with ENV HF_HUB_OFFLINE=1 to stop it from pinging HF. This was an issue for a gov customer. and updated requirements to resolve some open CVEs --------- Co-authored-by: cragwolfe <crag@unstructured.io> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: luke-kucing <luke-kucing@users.noreply.github.com>	2025-06-04 18:52:58 +00:00
David Potter	fd9d796797	fix cve (#3989 ) fix critical cve for h11. supposedly 0.16.0 fixes it. --------- Co-authored-by: Yao You <yao@unstructured.io> Co-authored-by: Austin Walker <austin@unstructured.io> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>	2025-04-29 00:58:05 +00:00
qued	3f07840b80	chore: deprecate stage_for_label_studio (#3968 ) This PR is to address [a CVE](https://github.com/advisories/GHSA-rgv9-w7jp-m23g) that appeared in a recent scan. The CVE has to do with the package `label_studio_sdk`. This relates to the tool Label Studio, a data labeling platform. We built a staging function that takes a list of elements and converts it to a format suitable for passing to the LabelStudio platform. We don't use the package with the vulnerability in the actual function, we only use it to test the output of the function against the Label Studio API schema. Even the test where we use it is sort of questionable in value, since it's really testing the schema against an old version of the LabelStudio API (we are testing against a recording of the Label Studio API's responses stored using `vcrpy`). Label Studio has fixed the vulnerability as of version 1.0.10 of their SDK, but we're stuck on 1.0.5 because 1.0.6 and above require `numpy<2.0.0`. This leaves us with several choices of resolution, some of which are: 1. Downgrade `numpy` to upgrade `label_studio_sdk` to >=1.0.10 to resolve the CVE 2. Drop `label_studio_sdk` by either removing or rewriting the test. 3. Drop test and dev dependencies from the `unstructured` image. We've decided to do 2. _and_ 3. This PR handles 2., with 3. to be a follow-on PR. Here we add a deprecation notice to `stage_for_label_studio` and remove the offending test. Normally good practice would be to add a warning of future deprecation to the function for a reasonable amount of time, but in order to address the CVE immediately, we're deprecating it right away. ### Testing Install the dependencies (`make install`) into a fresh environment, and `pip list \| grep label` should have no results. The scan artifact in CI should contain no "high" or "critical" CVEs.	2025-03-26 23:37:03 +00:00
Yao You	7de630e45e	Feat/bump numpy to 2 (#3961 ) This PR updates a few dependencies so that they are compatible with `numpy>=2`.	2025-03-18 21:33:48 +00:00
Yao You	2dceac34b5	Feat/remove reference of PageLayout.elements (#3943 ) This PR removes usage of `PageLayout.elements` from partition function, except for when `analysis=True`. This PR updates the partition logic so that `PageLayout.elements_array` is used everywhere to save memory and cpu cost. Since the analysis function is intended for investigation and not for general document processing purposes, this part of the code is left for a future refactor. `PageLayout.elements` uses a list to store layout elements' data while `elements_array` uses `numpy` array to store the data, which has much lower memory requirements. Using `memory_profiler` to test the differences is usually around 10x.	2025-03-12 15:21:21 +00:00
Pluto	74b0647aa2	Fix json bytes content type detection (#3941 ) Fixes order of content type detection strategies for byte-encoded jsons. Before ``` json_bytes = json.dumps([{"example": "data"}]).encode("utf-8") file_buffer = io.BytesIO(json_bytes) detect_filetype(file=file_buffer, metadata_file_path="filename.pdf") ``` Before PDF Now JSON	2025-03-07 10:33:33 +00:00
luke-kucing	147add9a04	Luke/CVE bump (#3928 ) bumping dependancies and updated the tokenizer constraint	2025-02-19 17:23:31 +00:00
Philippe PRADOS	b521bce9c6	Add password with PDF files (#3721 ) Add password with PDF files Must be combined with [PR 392 in unstructured-inference](https://github.com/Unstructured-IO/unstructured-inference/pull/392) --------- Co-authored-by: John J <43506685+Coniferish@users.noreply.github.com>	2025-02-11 17:39:16 +00:00
cragwolfe	918a3d0deb	fix: allow users to install package with python3.13 or higher (#3893 ) Although, python3.13 is not officially supported or tested in CI just yet.	2025-01-30 14:52:24 +00:00
Christine Straub	a447b813a9	added auto_download logic to download data runtime (#3883 ) - Add auto-download for NLTK for Python Enviroment When user import `tokenize`, It will automatically download nltk data. - Added `AUTO_DOWNLOAD_NLTK` flag in `tokenize.py` to download `NLTK_DATA`	2025-01-27 19:11:32 +00:00
Roman Isecke	e230364a2c	bugfix/drop use of ndjson dep, use local code (#3886 ) ### Description Avoid using the ndjson dependency due to the limiting license that exists on it	2025-01-24 15:31:02 +00:00
Tracy Shen	afecf1b742	update unstructured-inference lib (#3880 ) update unstructured-inference to 0.8.6 in requirements in extra-pdf-image.in 0.8.6 has pdfminer=20240706 (newer version)	2025-01-22 23:11:28 +00:00
cragwolfe	1a94d95e47	chore: dependency bumps, release commit for 0.16.12 (#3831 )	2025-01-05 13:50:19 -08:00
Roman Isecke	50ea6fe7fc	feat: add ndjson support (#3845 ) ### Description Add ndjson file type support and treat is the same as json files.	2024-12-19 14:39:26 +00:00
Steve Canny	10f0d54ac2	build: remove ruff version upper bound (#3829 ) Summary Remove pin on `ruff` linter and fix the handful of lint errors a newer version catches.	2024-12-16 23:01:22 +00:00
Christine Straub	df156ebe5a	feat: support pdf link extraction in hi_res strategy (#3753 ) This PR aims to add support for link extraction in pdf `hi_res` strategy. The `partition_pdf()` function now supports link extraction when using the `hi_res` strategy, allowing users to extract hyperlinks from PDF documents. ### Summary - Added functionalities to support link extraction in hi_res flow - Enhanced word extraction functionality used for link extraction in both `fast` and `hi_res` flows, resulted in more correct `start_index` and `text` in `links` metadata. - Updated ingest fixture update workflow to not skip Astra DB source test ### Testing ``` elements = partition_pdf( filename="example-docs/pdf/embedded-link.pdf", strategy="hi_res" ) assert len(elements[0].metadata.links) == 3 ``` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com> Co-authored-by: cragwolfe <crag@unstructured.io>	2024-10-31 16:52:27 +00:00
ryannikolaidis	d0be1151a1	chore: pin unstructured-ingest (#3764 )	2024-10-30 23:25:47 +00:00
Tracy Shen	340a07f18b	[Merge] release to 0.16.3 (#3755 ) - bump version to 0.16.3 based on Pluto's fix on layout parsing - update unstructured-inference version to 0.8.1 in	2024-10-25 13:23:41 -07:00
Pluto	03a3ed8d3b	Add parsing HTML to unstructured elements (#3732 ) > This is POC change; not everything is working correctly and code quality could be improved significantly This ticket add parsing HTML to unstructured element and back. How is it working? HTML has a tree structure, Unstructured Elements is a list. HTML structure is traversed in DFS order, creating Elements and adding them to list. So the reading order from HTML is preserved. To be able to compose tree again all elements has IDs, and metadata.parent_id is leveraged How html is preserved if there are 'layout' without text, or there are deeply nested HTMLs that are just text from the point of view of Unstructured Element? Each element is parsed back to HTML using metadata.text_as_html field. For layout elements only html_tag are there, for long text elements there is everything required to recreate HTML - you can see examples in unit tests or .json file I attached. Pros of solution: - Nothing had to be changed in element types Cons: - There are elements without Text which may be confusing (they could be replaced by some special type) Core transformation logic can be found in 2 functions in `unstructured/documents/transformations.py` Knowns bugs (they are minor): - sometimes html tag is changed incorrectly - metadata.category_depth and metadata.page_number are not set - page break is not added between pages How to test. Generate HTML: ```python3 from pathlib import Path from vlm_partitioner.src.partition import partition if __name__ == "__main__": doc_dir = Path("out_dir") file_path = Path("example_doc.pdf") partition(str(file_path), provider="anthropic", output_dir=str(doc_dir)) ``` Then parse to unstructured elements and back to html ```python3 from pathlib import Path from unstructured.documents.html_utils import indent_html from unstructured.documents.transformations import parse_html_to_ontology, ontology_to_unstructured_elements, \ unstructured_elements_to_ontology from unstructured.staging.base import elements_to_json if __name__ == "__main__": output_dir = Path("out_dir/") output_dir.mkdir(exist_ok=True, parents=True) doc_path = Path("out_dir/example_doc.html") html_content = doc_path.read_text() ontology = parse_html_to_ontology(html_content) unstructured_elements = ontology_to_unstructured_elements(ontology) elements_to_json(unstructured_elements, str(output_dir / f"{doc_path.stem}_unstr.json")) parsed_ontology = unstructured_elements_to_ontology(unstructured_elements) html_to_save = indent_html(parsed_ontology.to_html()) Path(output_dir / f"{doc_path.stem}_parsed_unstr.html").write_text(html_to_save) ``` I attached example doc before and after running these scripts [outputs.zip](https://github.com/user-attachments/files/17438673/outputs.zip)	2024-10-23 12:28:07 +00:00
Yao You	a11ad22609	bump `unstructured-inference` (#3711 ) This PR bumps `unstructured-inference` to `0.8.0`, which introduces vectorized data structure for layout elements and text regions. This PR also cleans up a few places in CI that has repeated definition of env variables or missing installation of testing dependencies in cache. A few document ingest results are changed: - two places for `biomed-api` (actually processed locally on runner) are due to very small changes in numerical results of the bounding box areas: one results in a duplicated page number/header and another results in a deduplication of a word of a sentence that starts in a new line. (yes, two cases goes in opposite directions) - the layout parser paper now outputs the code lines with page number inside the code box as list items --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: badGarnet <badGarnet@users.noreply.github.com> Co-authored-by: christinestraub <christinemstraub@gmail.com>	2024-10-21 21:55:08 +00:00
Steve Canny	3240e3d17a	rfctr(pptx): minify HTML and table.text is cct (#3734 ) Summary Eliminate historical "idiosyncracies" of `table.metadata.text_as_html` HTML introduced by `partition_pptx()`. Produce minified `.text_as_html` consistent with that formed by chunking. Additional Context - PPTX `.metadata.text_as_html` is minified (no extra whitespace or thead, tbody, tfoot elements). - `table.text` is clean-concatenated-text (CCT) of table. - Last use of `tabulate` library is removed and that dependency is removed from `base.in`.	2024-10-21 16:23:15 +00:00
Yao You	3dea723656	chore: pin upper limit for unstructured-client (#3743 ) - 0.26.0 breaks tests and reason is unknown - 0.26.0 introduces async request but the sync version interface still exists	2024-10-21 01:06:21 +00:00
Roman Isecke	9049e4e2be	feat/remove ingest code, use new dep for tests (#3595 ) ### Description Alternative to https://github.com/Unstructured-IO/unstructured/pull/3572 but maintaining all ingest tests, running them by pulling in the latest version of unstructured-ingest. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com> Co-authored-by: Christine Straub <christinemstraub@gmail.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2024-10-15 10:01:34 -05:00
Austin Walker	6428d19e5a	fix: update python SDK syntax for forward compatibility (#3656 ) Wrap the `shared.PartitionParameters` usage with `operations.PartitionRequest`. This syntax has been deprecated since v0.23.0 of the SDK, and will be unsupported in v0.26.0.	2024-09-24 16:37:38 +00:00
Matt Robinson	7d66a236f1	fix: correctly install mesa-gl for arm (#3647 ) ### Summary Fixes the `arm64` image builds, which will be available again starting in version `0.15.13`. A fix was implemented upstream in https://github.com/Unstructured-IO/base-images/pull/47 and a workaround that installed `x86` packages in the `unstructured` repo was removed. ### Testing See [this job](https://github.com/Unstructured-IO/unstructured/actions/runs/10948943594/job/30401108059?pr=3647) for a successful `arm64` build on the feature branch.	2024-09-20 13:32:47 +00:00
John	46e04b165a	build(deps): bump protobuf pin (#3625 ) Bumps max version of `protobuf<5.0` and sets min version of `chromadb>0.4.14` in `requirements/ingest/chroma.in`. Also fixes some type hints in `unstructured/ingest/v2/processes/connectors/chroma.py`	2024-09-16 19:39:47 +00:00
John	159b8a9082	remove more dependency pins (#3621 ) Remove `langchain-community>=0.2.5` and `wrapt>=1.14.0` pins and add `importlib-metadata>=8.5.0` pin	2024-09-13 01:55:14 +00:00
John	ab94c6c5d1	chore: remove pins (#3579 ) - Remove constraint pins for `Office365-REST-Python-Client`, `weaviate-client`, and `platformdirs`. Removing the pin for `Office365` brought to light some bugs in the Onedrive connector, so some changes were also made to `unstructured/ingest/v2/processes/connectors/onedrive.py`. - Also, as part of updating dependencies `unstructured-client` was updated to `0.25.8`, which introduced a new default for the `strategy` param and required updating a test fixture. - The `hubspot.sh` integration test was failing and is now ignored in CI with this PR per discussion with @rbiseck3. May be easiest to review commit-by-commit.	2024-09-12 13:48:59 +00:00
Matt Robinson	cf32672bc5	build(deps): bumps for 2024-09-09 (#3608 ) ### Summary Dependency bumps for 2024-09-09.	2024-09-09 16:45:18 +00:00
Matt Robinson	c060467018	build(deps): bump cryptography version (#3599 ) ### Summary Bumps to the latest version of the `cryptography` library to address `GHSA-h4gh-qq45-vh27`.	2024-09-05 19:06:43 +00:00
Matt Robinson	04322d1632	build(deps): removed unnecessary jupyter deps (#3583 ) ### Summary Removes unnecessary `jupyter` and `ipython` dev dependencies to reduce CVE surface area.	2024-08-31 05:21:40 +00:00
Matt Robinson	6ba8135bf9	fix: check ole storage content to differentiate filetypes (#3581 ) ### Summary Updates the file detection logic for OLE files to check the storage content of the file to more reliable differentiate between DOC, PPT, XLS and MSG files. This corrects a bug that caused file type detection to be incorrect in cases where the `filetype` library guessed and incorrect MIME type, such as `'application/vnd.ms-excel'` for a `.msg` file. As part of this work, the `"msg"` extra was removed because the `python-oxmsg` package is now a base dependency. ### Testing Using a test `.msg` file that returns `'application/vnd.ms-excel'` from `filetype.guess_mime`. ```python from unstructured.file_utils.filetype import detect_filetype filename = "test-file.msg" detect_filetype(filename=filename) # result should be FileType.MSG ```	2024-08-30 15:12:46 -04:00
John	ddb6cb631d	chore: remove minimum version pins for pins older than 6 mo (#3577 ) Remove a number of pins in `requirements/deps/constraints.txt` and `make pip-compile`	2024-08-29 15:35:14 +00:00
John	f21c853ade	bug: fix file_conversion disk leak (#3562 ) Fix disk space leaks and Windows errors when accessing file.name on a NamedTemporaryFile Uses of `NamedTemporaryFile(..., delete=False)` and/or uses of `file.name` of NamedTemporaryFile have been replaced with TemporaryFileDirectory to avoid a known issue: - https://docs.python.org/3/library/tempfile.html#tempfile.NamedTemporaryFile - https://github.com/Unstructured-IO/unstructured/issues/3390 The first 7 commits each address an individual occurrence of the issue if reviewers want to review commit-by-commit.	2024-08-27 22:02:24 +00:00
Matt Robinson	4194a07d12	build(deps): replace pillow-heif with pi-heif (#3571 ) ### Summary Closes #2664 and replaces `pillow-heif` with `pi-heif` due to more permissive licensing on the binary wheel for `pi-heif`.	2024-08-27 11:54:35 -04:00
David Potter	ddba928344	Potter/mixedbread embedder (#3513 ) Thanks to @huangrpablo and @juliuslipp we now have a mixedbread.ai embedder!	2024-08-27 14:52:13 +00:00
Matt Robinson	09d84bc46b	build(deps): version bumps for 2024-08-26 (#3567 ) ### Summary Version bumps for 2024-08-26.	2024-08-26 15:15:25 -04:00
Christine Straub	ac10ba4fc1	build(deps): bump unstructured.paddleocr to 2.8.1.0 (#3561 ) ### Summary - Bump `unstructured.paddleocr` to 2.8.1.0 - Remove `opencv-python` and `opencv-contrib-python` constraint pins - Fix `0.15.7` changelog	2024-08-23 14:17:29 -07:00
John	b4a6aa5559	chore: remove fsspec pin (#3554 ) remove fsspec pin	2024-08-21 21:57:42 +00:00
John	f135344738	chore: remove scipy and packaging pins (#3550 ) Remove scipy and packaging constraint pins	2024-08-21 16:05:19 +00:00
John	604cadfb7e	chore: remove ipython pin (#3548 ) this pr is stacked on https://github.com/Unstructured-IO/unstructured/pull/3538 and https://github.com/Unstructured-IO/unstructured/pull/3547 This pr removes dependency pins for IPython, anyio, and pyparsing. It also updates the label-studio-sdk import statement so we don't have to have that pinned and make some minor type hinting edits. Label Studio had a breaking change in their 1.13.0 [release](https://github.com/HumanSignal/label-studio/releases/tag/1.13.0)	2024-08-21 00:06:31 +00:00
Matt Robinson	1f8030dd0e	fix(CVE-2024-39705): bump to `nltk` 3.9.1; correct model download issues (#3541 ) ### Summary Bumps to `nltk==3.9.1` and resolves [CVE-2024-39705](https://nvd.nist.gov/vuln/detail/CVE-2024-39705). An NLTK version bump was originally introduced in #3512 and rolled back in #3527 because `nltk==3.8.2` was yanked from PyPI, and also because we observed significant slowdowns in processing time after bumping to `nltk==3.8.2`. The processing time regression does not appear in `nltk==3.9.1`. ### Testing After the bump, CI should pass. Additionally we verified locally that files processing takes around the amount of time we would expect for a long `.docx` file. ```python In [1]: from unstructured.partition.auto import partition In [2]: filename = "test-doc.docx" In [3]: %timeit partition(filename=filename) 3.92 s ± 73 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ```	2024-08-19 20:59:36 +00:00
Christine Straub	fc26426310	feat: replace `pytesseract` with `unstructured.pytesseract` fork (#3528 ) This PR reverts `pytesseract` dependency to `unstructured.pytesseract` fork due to the unavailability of some recent release versions of `pytesseract` on PyPI. This PR also addresses an issue encountered during the publication of `unstructured==0.15.4` to PyPI. The error was due to the fact that PyPI does not allow direct dependencies from Version Control System URLs like GitHub in the `install_requires` or `extras_require` sections of the `setup.py` file.	2024-08-16 10:34:22 -04:00
Christine Straub	d0211cc41f	build: downgrade `nltk` version (#3527 ) This PR aims to roll back `nltk` to `3.8.1` which bumped to `3.8.2` in https://github.com/Unstructured-IO/unstructured/pull/3512 because `3.8.2` is no longer available in PyPI due to some issues(https://github.com/nltk/nltk/issues/3301)	2024-08-15 16:35:21 -07:00

1 2 3 4 5 ...

353 Commits