unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-12-28 07:33:36 +00:00

Author	SHA1	Message	Date
Filip Knefel	f66562b1cb	fix: properly handle password protected xlsx (#4057 ) ### Issue Attempt at partitioning a password protected errors results in an obscure exception > Can't find workbook in OLE2 compound document ### Solution Utilize [msoffcrypto-tool](https://pypi.org/project/msoffcrypto-tool/) package (MIT License) to load XLSX file and check whether it's encrypted, if yes throw an `UnprocessableEntityError` exception detailing the reason for rejecting the file. --------- Co-authored-by: Filip Knefel <filip@unstructured.io>	2025-07-16 13:19:14 +00:00
shreyanid	344202fa6d	feat: detect language for PDFs (#4051 ) The `@apply_metadata` decorator already contains logic to detect the language of the element text (on either a document or element level). Update pdfs, and later images, to use this decorator to get accurate element language results outputted. Test ``` from unstructured.partition.auto import partition def test_partition_pdf(): pdf_path = "example-docs/language-docs/fr_olap.pdf" elements = partition(pdf_path) # optionally set `detect_language_per_element=True)` print(f"Number of elements partitioned: {len(elements)}") # Check if elements are returned assert len(elements) > 0, "No elements were partitioned from the PDF." # check language outputted for each element for element in elements: print(element) print(element.metadata.languages) print("-------------------------------") test_partition_pdf() ``` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: shreyanid <shreyanid@users.noreply.github.com> 0.18.7	2025-07-15 18:53:28 +00:00
Ahmet Melek	2ffaf6f323	fix: type for serialized TableChunks (#4056 ) #### To test, simply serialize a TableChunk element with and without the changes in the PR ____ Without the changes: ``` In [1]: from unstructured.documents.elements import TableChunk In [2]: TableChunk("hi") Out[2]: <unstructured.documents.elements.TableChunk at 0x110113410> In [3]: TableChunk("hi").to_dict() Out[3]: {'type': 'Table', 'element_id': '6267e99a-46d8-4f2d-a206-51c691469c72', 'text': 'hi', 'metadata': {}} ``` ____ With the changes: ``` In [1]: from unstructured.documents.elements import TableChunk In [2]: TableChunk("hi") Out[2]: <unstructured.documents.elements.TableChunk at 0x10367f050> In [3]: TableChunk("hi").to_dict() Out[3]: {'type': 'TableChunk', 'element_id': 'f91af3ac-0dea-4dc4-8a6a-69c28cfcca3b', 'text': 'hi', 'metadata': {}} ``` ____ 0.18.6	2025-07-15 17:29:02 +00:00
mateuszkuprowski	37800c3523	feat: added new exception type to epub conversions (#4052 ) Added UnprocessableEpubError to better handle the case when incoming epub file is actually damanged which makes pandoc lib crash with exit code 64.	2025-07-15 10:56:22 +00:00
Yao You	73d239fb28	feat: keep img tag's class attr (#4050 ) This change affects partition html. Previously when there is a table in the html, we clean any tags inside the table of their `class` and `id` attributes. However, sometimes there are images, `img` tags, present in a table and its `class` attribute identifies some important information about the image. This change preserves the `class` attribute for `img` tags inside a table. This change is reflected in a table element's `metadata.text_as_html` attribute. 0.18.5	2025-07-10 20:46:28 +00:00
qued	7764fb6fd4	build: drop remaining Python 3.9 refs (#4049 ) Dropped variables that said we support Python 3.9 in `setup.py`, as well as any remaining references to Python 3.9. I also checked the pins and removed several that don't seem necessary any more.	2025-07-10 16:43:15 +00:00
jiajun-unstructured	92965fb286	add fenced-code extension to the md parser (#4044 ) https://github.com/Unstructured-IO/unstructured/issues/3578 --------- Co-authored-by: qued <64741807+qued@users.noreply.github.com> Co-authored-by: Alan Bertl <alan@unstructured.io>	2025-07-07 21:05:54 +00:00
Filip Knefel	f078cd923b	fix(partition, csv): increase csv field limit (#4046 ) Increase the csv field limit to support partitioning of files with large data in fields. --------- Co-authored-by: Filip Knefel <filip@unstructured.io> 0.18.4	2025-07-07 14:12:53 +00:00
Austin Walker	8a9abddb16	chore: bump pillow to address a CVE (#4045 ) 0.18.3	2025-07-05 18:33:15 +00:00
Yao You	d7dfda9ecb	bump version to make a release (#4042 ) 0.18.2	2025-07-01 23:06:02 +00:00
Yao You	aa332101ab	fix: fix header and footer not parsed as Header/Footer types (#4041 ) ## Summary This PR fixes an issue where header/footer content in html are not partitioned as `unstructured` `Header` or `Footer` element types. Rather they are either `UncategorizedText` or taking on the type of the nested structure inside the header/footer. E.g., `<header class="Header"><h1 class="Title">Header Title</h1></header>` would be partitioned as a `Title` instead of `Header`. ## Bug description This behavior is because we treat header and footer as layout, i.e., containers, in the ontology definition. As a result, during parsing we [unwrap](`ec209c6b5f/unstructured/partition/html/transformations.py (L361-L378)`) the container and parse the contents as if they are from the main text even though they are still part of header/footer. The fix is to treat header/footer as text instead of layout in ontology so that all content inside of them are properly gathered under `Header`/`Footer` element types.	2025-07-01 21:58:43 +00:00
Klaijan	45c3b63dcc	bump version (#4038 )	2025-07-01 17:44:24 +00:00
Klaijan	56e739b34c	fix: update md to reads umlauts on non-utf-8 files (#4037 ) This PR updates the `partition_md` to reads files with non-utf8 encodings without fail. Closes issue https://github.com/Unstructured-IO/unstructured-api/issues/489	2025-07-01 16:38:30 +00:00
jiajun-unstructured	66640f26fe	fix: xml processing not escaped (#4034 ) `<?xml version="1.0"?>` does not get escaped when converting to html, in a code block like this in the markdown file ```` <?xml version="1.0"?> <sparql xmlns="http://www.w3.org/2005/sparql-results#"> <head></head> <boolean>true</boolean> </sparql> ```` which causes the parser to throw error like > AttributeError: 'lxml.etree._ProcessingInstruction' object has no attribute 'is_phrasing'. This PR processes the original md file and add indentation to `<?xml version="1.0"?>` to force the xml code to be escaped when being converted to html https://github.com/Unstructured-IO/unstructured/issues/3935	2025-06-30 20:15:38 +00:00
Klaijan	dab79b0c83	fix: add try/except wrap over row.cells to failproof tc grid_offset (#4033 ) This PR fixes the issue with `docx` with complex/recursive/merged/malformed tables by skipping cells that could not trace back to a valid `<w:tc>` element used by the `python-docx` due to missing or improperly merged rows. Accessing row.cells in such cases can raise a `ValueError` when `python-docx` fails to resolve the full logical table layout. This PR wraps those calls in `try/except` to skip problematic rows while continuing to extract usable content from the rest of the document.	2025-06-30 14:20:18 +00:00
Yuming Long	c04235c168	fix [NEX-49] : Fix TypeError for empty HTML content (#4032 ) ### Summary Addressed a TypeError that occurred when partitioning empty or whitespace-only HTML content. ## Test * unit test `test_unstructured/partition/html/test_partition.py::test_partition_html_with_empty_content_raises_error` can reproduce the TypeErro before fix * now test can pass	2025-06-25 18:13:20 +00:00
ryannikolaidis	3f87946f56	feat: add DocumentData type (#4031 ) In scenarios where there is a large amount of data that represents the document rather than individual elements in the document, it may be preferable to specify this in a single location rather than duplicating the data across all elements (as we do for smaller metadata like filename or filetype) This PR adds DocumentData element type which can be used to uniquely capture this data. 0.18.1	2025-06-23 03:46:25 +00:00
qued	6866fda860	fix: use encoding in context class (#4030 ) Given the fact that the `_CsvPartitioningContext` defines an `_encoding` property, this property was meant to be used. Behaviorally this change should be a no-op, but supports future efforts where the partitioning context applies internal logic.	2025-06-20 21:43:07 +00:00
luke-kucing	2aca876921	Luke/CVE python3.12 update (#4027 )	2025-06-17 06:32:06 +00:00
jiajun-unstructured	b0dbd71aff	Parallelize tests (#4024 )	2025-06-16 23:29:35 +00:00
Emmanuel Ferdman	531490d013	Migrate to modern bs4 interface (#4025 ) ## PR Summary This small PR fixes the bs4 deprecation warnings which you can find in the [CI logs](https://github.com/Unstructured-IO/unstructured/actions/runs/15491657572/job/43729960936#step:3:2615): ```python /app/unstructured/metrics/table/table_extraction.py:53: DeprecationWarning: Call to deprecated method findAll. (Replaced by find_all) -- Deprecated since version 4.0.0. /app/unstructured/metrics/table/table_extraction.py:57: DeprecationWarning: Call to deprecated method findAll. (Replaced by find_all) -- Deprecated since version 4.0.0. ``` --------- Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>	2025-06-16 18:44:20 +00:00
cragwolfe	6ef2fc1ec6	chore: add claude (#4023 )	2025-06-13 15:20:10 +00:00
Yuming Long	a80decdbd4	fix [NEX-28]: file_type is None for result_file_type in chunker partition json (#4022 ) ### Summary `'NoneType' object has no attribute 'partitioner_shortname'` due to `result_file_type = self._disambiguate_json_file_type` could return None for file type	2025-06-13 15:19:09 +00:00
Yao You	5e43e36427	recompile on arm64 to get minimum reqs (#4020 ) new `torch==2.7.1` now comes with nvidia gpu support and triton as dependencies. Those are not supported by `arm64` or actually being used by `unstructured` in `adm64` either. This is a quick patch to remove those from .txt requirements files to unblock builds. 0.17.11-dev1	2025-06-12 21:44:35 +00:00
Yuming Long	55ad5fd637	fix chucking text None type has no attribute stripe (#4018 ) ### Summary To fix error `Error in chunk: 512: {"detail":"'NoneType' object has no attribute 'strip'"}` I found the logs under same org (could assume this is the same job) screenshot: ![Screenshot 2025-06-11 at 10 15 57 AM](https://github.com/user-attachments/assets/c50ada55-eef1-43f7-9e27-9b9ae339a6fb) stack trace from the `utic-api` ES log doc: ![Screenshot 2025-06-11 at 2 01 01 PM](https://github.com/user-attachments/assets/7e84fa24-4eb6-45e8-b195-a11d3d124bfa) ### Notes longer term we should make partitioner (vlm + utic-api) not return text with Null --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: yuming-long <yuming-long@users.noreply.github.com>	2025-06-12 18:28:46 +00:00
Pluto	ec209c6b5f	Remove IDs from HTML code (#4012 ) In this pull request parent-child relationship for elements generated with v2 parser is based on actual element IDs instead of IDs baked somewhere in the HTML script. With some extra bug fixing it allowed for significantly simplifying json -> HTML script	2025-06-11 11:55:02 +00:00
Emily Voss	b6ab471f00	Drop Python 3.9 support due to dependency conflicts (#4017 )	2025-06-10 23:32:11 -07:00
Emily Voss	06e4e54f5c	Bump requests to address CVEs (#4015 )	2025-06-11 01:38:43 +00:00
Yao You	37d2f021a3	Feat/bump inference (#4013 ) Bump `unstructured-inference` to `1.0.5`, which includes fix to ensure model init is thread safe. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>	2025-06-06 09:52:17 +00:00
luke-kucing	a7e90f7990	resolve CVEs and HF issue (#4009 ) update reqs to resolve CVEs and add the HF ENV to stop it from reaching out updated the Dockerfile with ENV HF_HUB_OFFLINE=1 to stop it from pinging HF. This was an issue for a gov customer. and updated requirements to resolve some open CVEs --------- Co-authored-by: cragwolfe <crag@unstructured.io> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: luke-kucing <luke-kucing@users.noreply.github.com>	2025-06-04 18:52:58 +00:00
cragwolfe	3a048a5a02	chore: script to verify unstructured image outbound connectivity (#4008 ) Sample output. The key thing here is the modes `offline` (meaning set HF_HUB_ONLINE=1 AND DO_NOT_TRACK=true) results in no outbound connections. This also is true if the locally cached models are removed, the last scenario of `offline-and-missing-models`) ``` $ ./test-all-outbound-connectivity-scenarios.sh >>> Removing leftover sut_* containers… Container: 543ac4b14370a18d790a2035e206e8c445754b825ec8b2887f4246f7404299c7 (scenario baseline) tcpdump running on interface eth0... >>> Running Python workload (capturing stdout/stderr)… [INFO] partitioning /app/example-docs/ideas-page.html <snip> Python finished. Log saved to /r/unstructured/scripts/image/python-output/offline-and-missing-models.log pcap saved to /r/unstructured/scripts/image/pcaps/offline-and-missing-models.pcap ================================================================== ======================================== Begin Scenario: baseline ------------------------------------------- tshark output for baseline ------------------------------------------- IPv4 Conversations Filter:<No Filter> \| <- \| \| -> \| \| Total \| Relative \| Duration \| \| Frames Bytes \| \| Frames Bytes \| \| Frames Bytes \| Start \| \| 172.18.0.2 <-> 108.138.246.79 20 12 kB 20 4,176 bytes 40 16 kB 2.531247000 69.0419 172.18.0.2 <-> 3.214.154.119 11 5,777 bytes 12 2,656 bytes 23 8,433 bytes 0.029451000 0.4118 172.18.0.2 <-> 192.168.65.5 2 656 bytes 2 158 bytes 4 814 bytes 0.000000000 2.5310 ------------------------------------------ python log output for baseline ------------------------------------------ [INFO] partitioning /app/example-docs/ideas-page.html [INFO] partitioning /app/example-docs/category-level.docx [INFO] partitioning /app/example-docs/fake_table.docx [INFO] partitioning /app/example-docs/img/english-and-korean.png 2025-06-02 22:05:02,265 - matplotlib.font_manager - INFO - generated new fontManager 2025-06-02 22:05:02,356 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): huggingface.co:443 2025-06-02 22:05:02,497 - urllib3.connectionpool - DEBUG - https://huggingface.co:443 "HEAD /unstructuredio/yolo_x_layout/resolve/main/yolox_l0.05.onnx HTTP/1.1" 302 0 2025-06-02 22:05:02,613 - unstructured_inference - INFO - Reading image file: /app/example-docs/img/english-and-korean.png ... 2025-06-02 22:05:04,792 - unstructured_inference - INFO - Loading the Table agent ... 2025-06-02 22:05:04,792 - unstructured_inference - INFO - Loading the table structure model ... 2025-06-02 22:05:04,877 - urllib3.connectionpool - DEBUG - https://huggingface.co:443 "HEAD /microsoft/table-transformer-structure-recognition/resolve/main/config.json HTTP/1.1" 200 0 2025-06-02 22:05:04,960 - urllib3.connectionpool - DEBUG - https://huggingface.co:443 "HEAD /microsoft/table-transformer-structure-recognition/resolve/main/config.json HTTP/1.1" 200 0 2025-06-02 22:05:04,970 - timm.models._builder - INFO - Loading pretrained weights from Hugging Face hub (timm/resnet18.a1_in1k) 2025-06-02 22:05:05,062 - urllib3.connectionpool - DEBUG - https://huggingface.co:443 "HEAD /timm/resnet18.a1_in1k/resolve/main/model.safetensors HTTP/1.1" 302 0 2025-06-02 22:05:05,065 - timm.models._hub - INFO - [timm/resnet18.a1_in1k] Safe alternative available for 'pytorch_model.bin' (as 'model.safetensors'). Loading weights using safetensors. 2025-06-02 22:05:05,071 - timm.models._builder - INFO - Missing keys (fc.weight, fc.bias) discovered while loading pretrained weights. This is expected if model is being adapted. [INFO] partitioning /app/example-docs/img/embedded-images-tables.jpg 2025-06-02 22:05:05,152 - unstructured_inference - INFO - Reading image file: /app/example-docs/img/embedded-images-tables.jpg ... [INFO] partitioning /app/example-docs/img/layout-parser-paper-with-table.jpg 2025-06-02 22:05:07,693 - unstructured_inference - INFO - Reading image file: /app/example-docs/img/layout-parser-paper-with-table.jpg ... [INFO] partitioning /app/example-docs/pdf/embedded-images-tables.pdf 2025-06-02 22:05:12,706 - pikepdf._core - INFO - pikepdf C++ to Python logger bridge initialized 2025-06-02 22:05:12,733 - unstructured_inference - INFO - Reading PDF for file: /app/example-docs/pdf/embedded-images-tables.pdf ... [INFO] partitioning /app/example-docs/pdf/all-number-table.pdf 2025-06-02 22:05:15,251 - unstructured_inference - INFO - Reading PDF for file: /app/example-docs/pdf/all-number-table.pdf ... [INFO] partitioning /app/example-docs/fake-power-point.pptx [INFO] partitioning /app/example-docs/stanley-cups.xlsx [INFO] partitioning /app/example-docs/fake-email-multiple-attachments.msg 2025-06-02 22:05:16,936 - unstructured_inference - INFO - Reading image file: /tmp/tmplkanlou1/unstructured_logo.png ... 2025-06-02 22:05:18,749 - unstructured_inference - INFO - Reading PDF for file: /tmp/tmpxdzdouhb/dense_doc.pdf ... ================================================================== ======================================== Begin Scenario: missing-models ------------------------------------------- tshark output for missing-models ------------------------------------------- IPv4 Conversations Filter:<No Filter> \| <- \| \| -> \| \| Total \| Relative \| Duration \| \| Frames Bytes \| \| Frames Bytes \| \| Frames Bytes \| Start \| \| 172.18.0.2 <-> 18.155.192.23 181834 273 MB 33502 1,813 kB 215336 275 MB 2.704106000 75.2880 172.18.0.2 <-> 3.168.86.41 79696 119 MB 15234 825 kB 94930 120 MB 9.066044000 68.9276 172.18.0.2 <-> 108.138.246.85 29 21 kB 25 5,760 bytes 54 27 kB 2.431857000 75.5633 172.18.0.2 <-> 3.214.154.119 12 5,831 bytes 12 2,656 bytes 24 8,487 bytes 0.016604000 0.3590 172.18.0.2 <-> 192.168.65.5 4 1,084 bytes 4 314 bytes 8 1,398 bytes 0.000000000 9.0651 ------------------------------------------ python log output for missing-models ------------------------------------------ [INFO] partitioning /app/example-docs/ideas-page.html [INFO] partitioning /app/example-docs/category-level.docx [INFO] partitioning /app/example-docs/fake_table.docx [INFO] partitioning /app/example-docs/img/english-and-korean.png 2025-06-02 22:06:30,961 - matplotlib.font_manager - INFO - generated new fontManager 2025-06-02 22:06:31,046 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): huggingface.co:443 2025-06-02 22:06:31,300 - urllib3.connectionpool - DEBUG - https://huggingface.co:443 "HEAD /unstructuredio/yolo_x_layout/resolve/main/yolox_l0.05.onnx HTTP/1.1" 302 0 2025-06-02 22:06:31,310 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): cdn-lfs.hf.co:443 2025-06-02 22:06:31,439 - urllib3.connectionpool - DEBUG - https://cdn-lfs.hf.co:443 "GET /repos/d9/51/d951593388d0af1cb4a029c311ba19f9b05090d9acc4606c2b82588297ea4397/134301ca94fb0df8027be9a6dad1908fe6218af8ffa4d34f0819c7c2226195f3?response-content-disposition=inline%3B+filename%3DUTF-8%27%27yolox_l0.05.onnx%3B+filename%3D%22yolox_l0.05.onnx%22%3B&Expires=1748904676&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTc0ODkwNDY3Nn19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5oZi5jby9yZXBvcy9kOS81MS9kOTUxNTkzMzg4ZDBhZjFjYjRhMDI5YzMxMWJhMTlmOWIwNTA5MGQ5YWNjNDYwNmMyYjgyNTg4Mjk3ZWE0Mzk3LzEzNDMwMWNhOTRmYjBkZjgwMjdiZTlhNmRhZDE5MDhmZTYyMThhZjhmZmE0ZDM0ZjA4MTljN2MyMjI2MTk1ZjM~cmVzcG9uc2UtY29udGVudC1kaXNwb3NpdGlvbj0qIn1dfQ__&Signature=hxvwTzJynEvyE~UuirlH~L4c5Gc6rGksDp~Uw94ooayDrzshE2sDdHmvqgoQyzqxHHhZLjfiJlAGUtVO7nVAHSoqt8mH7H9yN51Zj5UGqI-odXtW1dmWCD3i7nwwNlrEEjlXlERkIScpIjpkJDnjwhzeE94l1s7gysIm8c6J8JTcDlsdMver5wAVrBtLSVUrDN8PC84xgOGerHVhX7-eZcUVG2OAIJHoB3s2gLPkW9aVM5fvCmmoXMPI9oCvgLUp-zhXv3cWHh~yURuY1ufoI4CFG5ogW8nV~V45qLlbRw9PrvfFoLS-wxBGDOhT3SRWVOJzRRmACByABGWYMXRFuw__&Key-Pair-Id=K3RPWS32NSSJCE HTTP/1.1" 200 216625723 2025-06-02 22:06:35,019 - unstructured_inference - INFO - Reading image file: /app/example-docs/img/english-and-korean.png ... 2025-06-02 22:06:37,188 - unstructured_inference - INFO - Loading the Table agent ... 2025-06-02 22:06:37,188 - unstructured_inference - INFO - Loading the table structure model ... 2025-06-02 22:06:37,290 - urllib3.connectionpool - DEBUG - https://huggingface.co:443 "HEAD /microsoft/table-transformer-structure-recognition/resolve/main/config.json HTTP/1.1" 200 0 2025-06-02 22:06:37,375 - urllib3.connectionpool - DEBUG - https://huggingface.co:443 "GET /microsoft/table-transformer-structure-recognition/resolve/main/config.json HTTP/1.1" 200 1469 2025-06-02 22:06:37,484 - urllib3.connectionpool - DEBUG - https://huggingface.co:443 "HEAD /microsoft/table-transformer-structure-recognition/resolve/main/config.json HTTP/1.1" 200 0 2025-06-02 22:06:37,581 - urllib3.connectionpool - DEBUG - https://huggingface.co:443 "HEAD /microsoft/table-transformer-structure-recognition/resolve/main/model.safetensors HTTP/1.1" 302 0 Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet` 2025-06-02 22:06:37,586 - huggingface_hub.file_download - WARNING - Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet` 2025-06-02 22:06:37,681 - urllib3.connectionpool - DEBUG - https://huggingface.co:443 "GET /microsoft/table-transformer-structure-recognition/resolve/main/model.safetensors HTTP/1.1" 302 1319 2025-06-02 22:06:37,685 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): cas-bridge.xethub.hf.co:443 2025-06-02 22:06:37,778 - urllib3.connectionpool - DEBUG - https://cas-bridge.xethub.hf.co:443 "GET /xet-bridge-us/634929bd8146350b3a4cadaf/e78778928a1863786d5bb22a109a7ff1dbac47a29eae6f223a1fc2689172c347?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=cas%2F20250602%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20250602T220637Z&X-Amz-Expires=3600&X-Amz-Signature=c0a361e8982b1b05ee443054646b438e5a68d6767ef6df03dad6c5db20d0bdc5&X-Amz-SignedHeaders=host&X-Xet-Cas-Uid=public&response-content-disposition=inline%3B+filename%3DUTF-8%27%27model.safetensors%3B+filename%3D%22model.safetensors%22%3B&x-id=GetObject&Expires=1748905597&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTc0ODkwNTU5N319LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2FzLWJyaWRnZS54ZXRodWIuaGYuY28veGV0LWJyaWRnZS11cy82MzQ5MjliZDgxNDYzNTBiM2E0Y2FkYWYvZTc4Nzc4OTI4YTE4NjM3ODZkNWJiMjJhMTA5YTdmZjFkYmFjNDdhMjllYWU2ZjIyM2ExZmMyNjg5MTcyYzM0NyoifV19&Signature=cRjZe56uJ8vxmmgRhPmp7XZX69PHKoXO9XN1bfq5n~84Vxz~HvCmg6MqtuUAFIiOWAHFhOuVzJpoiWTYT1JdZrtMeQTdywnZM-lIIn5Q45kzr8q8C58yvLz7vmKKrD9pOnGjJPaVavYYxEDdlAXbWf6xo433kKF4TfmQ9z7UIKt~M-XV9EdPUUBNhByucLVcTZ3sec5DqI4FmzK28fdJ1BMD4NyDjWW6hi~Lp2V3bW0FLCpI6qKGuikJ3E-OVcJDdDvZAqSN0-GoQyHIP9kp4RTqPBb7jekpZ3Uj91UWEmGx6YNuNlorAMGi61hrL6mAUUmW13OGua2vcJyk9LxZQg__&Key-Pair-Id=K2L8F4GPSG1IFC HTTP/1.1" 200 115434268 2025-06-02 22:06:39,612 - timm.models._builder - INFO - Loading pretrained weights from Hugging Face hub (timm/resnet18.a1_in1k) 2025-06-02 22:06:39,696 - urllib3.connectionpool - DEBUG - https://huggingface.co:443 "HEAD /timm/resnet18.a1_in1k/resolve/main/model.safetensors HTTP/1.1" 302 0 2025-06-02 22:06:39,714 - urllib3.connectionpool - DEBUG - https://cdn-lfs.hf.co:443 "GET /repos/42/d5/42d585781e0b74854ae52a1bc2a63d09896f1d70f86bff969f4c053508d6c2d6/80c49dee3da4822c009c5a7fe591e9223c5a2cfcf95a4067ca4dfb5a7b89c612?response-content-disposition=inline%3B+filename%3DUTF-8%27%27model.safetensors%3B+filename%3D%22model.safetensors%22%3B&Expires=1748904665&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTc0ODkwNDY2NX19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5oZi5jby9yZXBvcy80Mi9kNS80MmQ1ODU3ODFlMGI3NDg1NGFlNTJhMWJjMmE2M2QwOTg5NmYxZDcwZjg2YmZmOTY5ZjRjMDUzNTA4ZDZjMmQ2LzgwYzQ5ZGVlM2RhNDgyMmMwMDljNWE3ZmU1OTFlOTIyM2M1YTJjZmNmOTVhNDA2N2NhNGRmYjVhN2I4OWM2MTI~cmVzcG9uc2UtY29udGVudC1kaXNwb3NpdGlvbj0qIn1dfQ__&Signature=GL15CLiGsmHno-DP25kfcuObjbrjd~ir5C5xapGqb9lda~5Wjy-3axBPftr1xWUnKh24Ay0mS49U8ZOcEdQxmzxQ97HiSX0-8s0-H187hV6mId6uxsULOGkNtjpkMKhfxe0qIfAmfi9gxl9JdiVfG5367HfPDVST8NvGPqMuKYoywSNWA-Uby-L9qb~EjtxbH9v1H2g6C0i9t2mn8ghD8BtTWEn4LY9c4O5bI~EQatNToNjsQTKa18LzXEowZnODLSLkyE7beLzfEpuTX9vlDzcAwKCPp-1M3xMZI4tzR-yfzyGhW19wqc6BVncUw53WSK7oOCv56HmFTYHhzOE-eQ__&Key-Pair-Id=K3RPWS32NSSJCE HTTP/1.1" 200 46807446 2025-06-02 22:06:40,394 - timm.models._hub - INFO - [timm/resnet18.a1_in1k] Safe alternative available for 'pytorch_model.bin' (as 'model.safetensors'). Loading weights using safetensors. 2025-06-02 22:06:40,396 - timm.models._builder - INFO - Missing keys (fc.weight, fc.bias) discovered while loading pretrained weights. This is expected if model is being adapted. [INFO] partitioning /app/example-docs/img/embedded-images-tables.jpg 2025-06-02 22:06:40,460 - unstructured_inference - INFO - Reading image file: /app/example-docs/img/embedded-images-tables.jpg ... [INFO] partitioning /app/example-docs/img/layout-parser-paper-with-table.jpg 2025-06-02 22:06:42,985 - unstructured_inference - INFO - Reading image file: /app/example-docs/img/layout-parser-paper-with-table.jpg ... [INFO] partitioning /app/example-docs/pdf/embedded-images-tables.pdf 2025-06-02 22:06:48,019 - pikepdf._core - INFO - pikepdf C++ to Python logger bridge initialized 2025-06-02 22:06:48,045 - unstructured_inference - INFO - Reading PDF for file: /app/example-docs/pdf/embedded-images-tables.pdf ... [INFO] partitioning /app/example-docs/pdf/all-number-table.pdf 2025-06-02 22:06:50,557 - unstructured_inference - INFO - Reading PDF for file: /app/example-docs/pdf/all-number-table.pdf ... [INFO] partitioning /app/example-docs/fake-power-point.pptx [INFO] partitioning /app/example-docs/stanley-cups.xlsx [INFO] partitioning /app/example-docs/fake-email-multiple-attachments.msg 2025-06-02 22:06:52,358 - unstructured_inference - INFO - Reading image file: /tmp/tmpsha4r586/unstructured_logo.png ... 2025-06-02 22:06:54,199 - unstructured_inference - INFO - Reading PDF for file: /tmp/tmpg_5lk06v/dense_doc.pdf ... ================================================================== ======================================== Begin Scenario: analytics-online-only ------------------------------------------- tshark output for analytics-online-only ------------------------------------------- IPv4 Conversations Filter:<No Filter> \| <- \| \| -> \| \| Total \| Relative \| Duration \| \| Frames Bytes \| \| Frames Bytes \| \| Frames Bytes \| Start \| \| 172.18.0.2 <-> 54.236.224.89 12 5,831 bytes 12 2,656 bytes 24 8,487 bytes 0.032536000 0.3535 172.18.0.2 <-> 192.168.65.5 1 462 bytes 1 84 bytes 2 546 bytes 0.000000000 0.0322 ------------------------------------------ python log output for analytics-online-only ------------------------------------------ [INFO] partitioning /app/example-docs/ideas-page.html [INFO] partitioning /app/example-docs/category-level.docx [INFO] partitioning /app/example-docs/fake_table.docx [INFO] partitioning /app/example-docs/img/english-and-korean.png 2025-06-02 22:08:10,114 - matplotlib.font_manager - INFO - generated new fontManager 2025-06-02 22:08:10,320 - unstructured_inference - INFO - Reading image file: /app/example-docs/img/english-and-korean.png ... 2025-06-02 22:08:12,470 - unstructured_inference - INFO - Loading the Table agent ... 2025-06-02 22:08:12,470 - unstructured_inference - INFO - Loading the table structure model ... 2025-06-02 22:08:12,475 - timm.models._builder - INFO - Loading pretrained weights from Hugging Face hub (timm/resnet18.a1_in1k) 2025-06-02 22:08:12,476 - timm.models._hub - INFO - [timm/resnet18.a1_in1k] Safe alternative available for 'pytorch_model.bin' (as 'model.safetensors'). Loading weights using safetensors. 2025-06-02 22:08:12,478 - timm.models._builder - INFO - Missing keys (fc.weight, fc.bias) discovered while loading pretrained weights. This is expected if model is being adapted. [INFO] partitioning /app/example-docs/img/embedded-images-tables.jpg 2025-06-02 22:08:12,548 - unstructured_inference - INFO - Reading image file: /app/example-docs/img/embedded-images-tables.jpg ... [INFO] partitioning /app/example-docs/img/layout-parser-paper-with-table.jpg 2025-06-02 22:08:15,102 - unstructured_inference - INFO - Reading image file: /app/example-docs/img/layout-parser-paper-with-table.jpg ... [INFO] partitioning /app/example-docs/pdf/embedded-images-tables.pdf 2025-06-02 22:08:20,163 - pikepdf._core - INFO - pikepdf C++ to Python logger bridge initialized 2025-06-02 22:08:20,189 - unstructured_inference - INFO - Reading PDF for file: /app/example-docs/pdf/embedded-images-tables.pdf ... [INFO] partitioning /app/example-docs/pdf/all-number-table.pdf 2025-06-02 22:08:22,732 - unstructured_inference - INFO - Reading PDF for file: /app/example-docs/pdf/all-number-table.pdf ... [INFO] partitioning /app/example-docs/fake-power-point.pptx [INFO] partitioning /app/example-docs/stanley-cups.xlsx [INFO] partitioning /app/example-docs/fake-email-multiple-attachments.msg 2025-06-02 22:08:24,468 - unstructured_inference - INFO - Reading image file: /tmp/tmp4oud0ctq/unstructured_logo.png ... 2025-06-02 22:08:26,297 - unstructured_inference - INFO - Reading PDF for file: /tmp/tmpv24idrvu/dense_doc.pdf ... ================================================================== ======================================== Begin Scenario: offline ------------------------------------------- tshark output for offline ------------------------------------------- IPv4 Conversations Filter:<No Filter> \| <- \| \| -> \| \| Total \| Relative \| Duration \| \| Frames Bytes \| \| Frames Bytes \| \| Frames Bytes \| Start \| \| ------------------------------------------ python log output for offline ------------------------------------------ [INFO] partitioning /app/example-docs/ideas-page.html [INFO] partitioning /app/example-docs/category-level.docx [INFO] partitioning /app/example-docs/fake_table.docx [INFO] partitioning /app/example-docs/img/english-and-korean.png 2025-06-02 22:09:37,826 - matplotlib.font_manager - INFO - generated new fontManager 2025-06-02 22:09:38,028 - unstructured_inference - INFO - Reading image file: /app/example-docs/img/english-and-korean.png ... 2025-06-02 22:09:40,188 - unstructured_inference - INFO - Loading the Table agent ... 2025-06-02 22:09:40,188 - unstructured_inference - INFO - Loading the table structure model ... 2025-06-02 22:09:40,193 - timm.models._builder - INFO - Loading pretrained weights from Hugging Face hub (timm/resnet18.a1_in1k) 2025-06-02 22:09:40,193 - timm.models._hub - INFO - [timm/resnet18.a1_in1k] Safe alternative available for 'pytorch_model.bin' (as 'model.safetensors'). Loading weights using safetensors. 2025-06-02 22:09:40,195 - timm.models._builder - INFO - Missing keys (fc.weight, fc.bias) discovered while loading pretrained weights. This is expected if model is being adapted. [INFO] partitioning /app/example-docs/img/embedded-images-tables.jpg 2025-06-02 22:09:40,260 - unstructured_inference - INFO - Reading image file: /app/example-docs/img/embedded-images-tables.jpg ... [INFO] partitioning /app/example-docs/img/layout-parser-paper-with-table.jpg 2025-06-02 22:09:42,810 - unstructured_inference - INFO - Reading image file: /app/example-docs/img/layout-parser-paper-with-table.jpg ... [INFO] partitioning /app/example-docs/pdf/embedded-images-tables.pdf 2025-06-02 22:09:47,851 - pikepdf._core - INFO - pikepdf C++ to Python logger bridge initialized 2025-06-02 22:09:47,877 - unstructured_inference - INFO - Reading PDF for file: /app/example-docs/pdf/embedded-images-tables.pdf ... [INFO] partitioning /app/example-docs/pdf/all-number-table.pdf 2025-06-02 22:09:50,475 - unstructured_inference - INFO - Reading PDF for file: /app/example-docs/pdf/all-number-table.pdf ... [INFO] partitioning /app/example-docs/fake-power-point.pptx [INFO] partitioning /app/example-docs/stanley-cups.xlsx [INFO] partitioning /app/example-docs/fake-email-multiple-attachments.msg 2025-06-02 22:09:52,181 - unstructured_inference - INFO - Reading image file: /tmp/tmpn3rraz6o/unstructured_logo.png ... 2025-06-02 22:09:54,032 - unstructured_inference - INFO - Reading PDF for file: /tmp/tmpvbqk645u/dense_doc.pdf ... ================================================================== ======================================== Begin Scenario: offline-and-missing-models ------------------------------------------- tshark output for offline-and-missing-models ------------------------------------------- IPv4 Conversations Filter:<No Filter> \| <- \| \| -> \| \| Total \| Relative \| Duration \| \| Frames Bytes \| \| Frames Bytes \| \| Frames Bytes \| Start \| \| ------------------------------------------ python log output for offline-and-missing-models ------------------------------------------ [INFO] partitioning /app/example-docs/ideas-page.html [INFO] partitioning /app/example-docs/category-level.docx [INFO] partitioning /app/example-docs/fake_table.docx [INFO] partitioning /app/example-docs/img/english-and-korean.png 2025-06-02 22:11:05,743 - matplotlib.font_manager - INFO - generated new fontManager Traceback (most recent call last): File "/home/notebook-user/.local/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 1484, in _get_metadata_or_catch_error metadata = get_hf_file_metadata( ^^^^^^^^^^^^^^^^^^^^^ File "/home/notebook-user/.local/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn return fn(args, kwargs) ^^^^^^^^^^^^^^^^^^^ File "/home/notebook-user/.local/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 1401, in get_hf_file_metadata r = _request_wrapper( ^^^^^^^^^^^^^^^^^ File "/home/notebook-user/.local/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 285, in _request_wrapper response = _request_wrapper( ^^^^^^^^^^^^^^^^^ File "/home/notebook-user/.local/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 308, in _request_wrapper response = get_session().request(method=method, url=url, params) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/notebook-user/.local/lib/python3.11/site-packages/requests/sessions.py", line 589, in request resp = self.send(prep, send_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/notebook-user/.local/lib/python3.11/site-packages/requests/sessions.py", line 703, in send r = adapter.send(request, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/notebook-user/.local/lib/python3.11/site-packages/huggingface_hub/utils/_http.py", line 107, in send raise OfflineModeIsEnabled( huggingface_hub.errors.OfflineModeIsEnabled: Cannot reach https://huggingface.co/unstructuredio/yolo_x_layout/resolve/main/yolox_l0.05.onnx: offline mode is enabled. To disable it, please unset the `HF_HUB_OFFLINE` environment variable. The above exception was the direct cause of the following exception: Traceback (most recent call last): File "<stdin>", line 35, in <module> File "/app/unstructured/partition/auto.py", line 231, in partition elements = partition_image( ^^^^^^^^^^^^^^^^ File "/app/unstructured/documents/elements.py", line 585, in wrapper elements = func(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/app/unstructured/file_utils/filetype.py", line 774, in wrapper elements = func(args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/app/unstructured/chunking/dispatch.py", line 74, in wrapper elements = func(args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/app/unstructured/partition/image.py", line 102, in partition_image return partition_pdf_or_image( ^^^^^^^^^^^^^^^^^^^^^^^ File "/app/unstructured/partition/pdf.py", line 341, in partition_pdf_or_image elements = _partition_pdf_or_image_local( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/unstructured/utils.py", line 216, in wrapper return func(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/app/unstructured/partition/pdf.py", line 649, in _partition_pdf_or_image_local inferred_document_layout = process_file_with_model( ^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured_inference/inference/layout.py", line 371, in process_file_with_model model = get_model(model_name, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured_inference/models/base.py", line 74, in get_model model.initialize(*initialize_params) File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured_inference/utils.py", line 40, in __getitem__ value = evaluate(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured_inference/utils.py", line 115, in download_if_needed_and_get_local_path return hf_hub_download(path_or_repo, filename, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/notebook-user/.local/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn return fn(args, *kwargs) ^^^^^^^^^^^^^^^^^^^ File "/home/notebook-user/.local/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 961, in hf_hub_download return _hf_hub_download_to_cache_dir( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/notebook-user/.local/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 1068, in _hf_hub_download_to_cache_dir _raise_on_head_call_error(head_call_error, force_download, local_files_only) File "/home/notebook-user/.local/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 1599, in _raise_on_head_call_error raise LocalEntryNotFoundError( huggingface_hub.errors.LocalEntryNotFoundError: An error happened while trying to locate the file on the Hub and we cannot find the requested files in the local cache. Please check your connection and try again or make sure your Internet connection is on. ```	2025-06-02 15:21:17 -07:00
Emmanuel Ferdman	e42884a566	fix: resolve warnings of logger library (#3999 ) # PR Summary This PR resolves the deprecation warnings of the `logger` library: ```python DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead ``` --------- Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com> Co-authored-by: cragwolfe <crag@unstructured.io>	2025-05-22 17:53:42 +00:00
Ronny H	8be7108829	Replace Serverless API to Platform announcement on README page (#4003 ) Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>	2025-05-20 16:54:53 +00:00
jordan-homan	570ee078a4	fix: throw validation error when json is passed with invalid unstructured json (#4002 ) ### Notes Adds validation if `json` / `ndjson` are not valid unstructured schema. ### Testing Manually tested serverless API with example json: ``` test_length = [] = 200 test_invalid = [{"invalid": "schema"}] = 422 test_invalid_ndjson ={"hi": "there"} = 422 test_chunk = [{"type":"Header","element_id":"a23fdadef9277f217563e217ebd074d5" ... = 200 ```	2025-05-19 18:24:44 +00:00
Austin Walker	e3417d7e98	fix: Fix for Pillow error when extracting PNG images (#3998 ) When I tried to partition a PNG file and extract images, I got an error from Pillow: ``` WARNING unstructured:pdf_image_utils.py:230 Image Extraction Error: Skipping the failed image Traceback (most recent call last): File "/Users/austin/.pyenv/versions/unstructured/lib/python3.10/site-packages/PIL/JpegImagePlugin.py", line 666, in _save rawmode = RAWMODE[im.mode] KeyError: 'RGBA' ``` The issue is that a PNG has an additional layer that cannot be saved off in jpeg format. We can fix this with a quick conversion. I added a png test case that is now passing with this fix.	2025-05-08 21:57:05 +00:00
Yao You	b814ece39f	fix: properly handle the case when an element's text is None (#3995 ) Some elements, like `Image`, can have `None` as its `text` attribute's value. In that case current chunking logic fails because it expects the field to always have a length or can be split. The fix is to update the logic as `element.text or ""` for checking length and add flow control to early exit to avoid calling split on `None`.	2025-05-05 18:08:11 +00:00
Marek Połom	604c4a7c5e	fix: failing build (#3993 ) Successful build and test: https://github.com/Unstructured-IO/unstructured/actions/runs/14730300234/job/41342657532 Failing test_json_to_html CI job fix here: https://github.com/Unstructured-IO/unstructured/pull/3992	2025-04-29 13:29:58 +00:00
Marek Połom	b585df1588	fix: Add missing diffstat command to test_json_to_html CI job (#3992 ) Removed some additional html fixtures. The original json fixtures from which html ones were generated, were removed some time ago.	2025-04-29 13:29:44 +00:00
David Potter	fd9d796797	fix cve (#3989 ) fix critical cve for h11. supposedly 0.16.0 fixes it. --------- Co-authored-by: Yao You <yao@unstructured.io> Co-authored-by: Austin Walker <austin@unstructured.io> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>	2025-04-29 00:58:05 +00:00
Nathan	27f503ce31	Update pdfminer_utils.py (#3974 ) Fix for 'PSSyntaxError' import error: "cannot import name 'PSSyntaxError' from 'pdfminer.pdfparser'" Latest pdfminer-six doesn't import PSSyntaxError into `pdfminer.pdfparser` anymore. It must now be directly imported from its source (`pdfminer.psexceptions`)	2025-04-08 00:47:24 -07:00
Philippe PRADOS	d570f4624b	Fix sort_page_element. ensures that sorting is stable and not random. (#3978 ) The sort_page_element() use the element id to sort the elements. Two executions of the same code, on the same file, produce different results. The order of the elements is random. This makes it impossible to write stable unit tests, for example, or to obtain reproducible results.	2025-04-07 15:57:20 +00:00
cragwolfe	dfa17bd3a0	fix: hi_res PDF parsing: only uncategorized text for extracted elements (#3975 )	2025-04-04 14:38:23 -07:00
cragwolfe	8fc41811eb	chore: add html path to ingest-test-fixtures-update-pr (#3977 ) This should allow the `Ingest Test Fixtures Update PR` workflow to also update expected html outputs. E.g., before the change, the .html files would be left unmodified: ![image](https://github.com/user-attachments/assets/fa14c1a5-39bd-4e32-b4b9-9552eb312de1) https://github.com/Unstructured-IO/unstructured/actions/runs/14234877547/job/39892334672	2025-04-03 15:42:25 -07:00
cragwolfe	c6b8ed4290	chore: allow changing default output dir for unstructured-get-json.sh (#3973 )	2025-03-31 22:18:57 -07:00
cragwolfe	19fc1fcc72	feat: convenience unstructured-get-json.sh update (#3971 ) * script now supports: * the --vlm flag, to process the document with the VLM strategy * optionally takes --vlm-model, --vlm-provider args * optionally also writes .html outputs by converting unstructured .json output * optionally opens those .html outputs in a browser Tested with: ``` unstructured-get-json.sh --write-html --open-html --fast layout-parser-paper-p2.pdf unstructured-get-json.sh --write-html --open-html --hi-res layout-parser-paper-p2.pdf unstructured-get-json.sh --write-html --open-html --ocr-only layout-parser-paper-p2.pdf unstructured-get-json.sh --write-html --open-html --vlm layout-parser-paper-p2.pdf unstructured-get-json.sh --write-html --open-html --vlm --vlm-provider openai --vlm-model gpt-4o layout-parser-paper-p2.pdf unstructured-get-json.sh --write-html --open-html --vlm --vlm-provider vertexai --vlm-model gemini-2.0-flash-001 layout-parser-paper-p2.pdf unstructured-get-json.sh --write-html --open-html --vlm --vlm-provider anthropic --vlm-model claude-3-5-sonnet-20241022 layout-parser-paper-p2.pdf ``` [layout-parser-paper-p2.pdf](https://github.com/user-attachments/files/19514007/layout-parser-paper-p2.pdf)	2025-03-31 09:45:01 -07:00
qued	9a239fa18b	build: remove test and dev deps from docker image (#3969 ) Removed the dependencies contained in `test.txt`, `dev.txt`, and `constraints.txt` from the things that get installed in the docker image. In order to keep testing the image (running the tests), I added a step to the `docker-test` make target to install `test.txt` and `dev.txt`. Thus we presumably get a smaller image (probably not much smaller), reduce the dependency chain or our images, and have less exposure to vulnerabilities while still testing as robustly as before. Incidentally, I removed the `Dockerfile` for our ubuntu image, since it made reference to non-existent make targets, which tells me it's stale and wasn't being used. ### Review: - Reviewer should ensure the dev and test dependencies are not being installed in the docker image. One way to check is to check the logs in CI, and note, e.g. that [this](https://github.com/Unstructured-IO/unstructured/actions/runs/14112971425/job/39536304012#step:3:1700) is the first reference to `pytest` in the docker build and test logs, after the image build is completed. - Reviewer should ensure docker image is still being tested in CI and is passing.	2025-03-27 18:41:11 +00:00
qued	3f07840b80	chore: deprecate stage_for_label_studio (#3968 ) This PR is to address [a CVE](https://github.com/advisories/GHSA-rgv9-w7jp-m23g) that appeared in a recent scan. The CVE has to do with the package `label_studio_sdk`. This relates to the tool Label Studio, a data labeling platform. We built a staging function that takes a list of elements and converts it to a format suitable for passing to the LabelStudio platform. We don't use the package with the vulnerability in the actual function, we only use it to test the output of the function against the Label Studio API schema. Even the test where we use it is sort of questionable in value, since it's really testing the schema against an old version of the LabelStudio API (we are testing against a recording of the Label Studio API's responses stored using `vcrpy`). Label Studio has fixed the vulnerability as of version 1.0.10 of their SDK, but we're stuck on 1.0.5 because 1.0.6 and above require `numpy<2.0.0`. This leaves us with several choices of resolution, some of which are: 1. Downgrade `numpy` to upgrade `label_studio_sdk` to >=1.0.10 to resolve the CVE 2. Drop `label_studio_sdk` by either removing or rewriting the test. 3. Drop test and dev dependencies from the `unstructured` image. We've decided to do 2. _and_ 3. This PR handles 2., with 3. to be a follow-on PR. Here we add a deprecation notice to `stage_for_label_studio` and remove the offending test. Normally good practice would be to add a warning of future deprecation to the function for a reasonable amount of time, but in order to address the CVE immediately, we're deprecating it right away. ### Testing Install the dependencies (`make install`) into a fresh environment, and `pip list \| grep label` should have no results. The scan artifact in CI should contain no "high" or "critical" CVEs.	2025-03-26 23:37:03 +00:00
luke-kucing	347a4e5d9e	manual trigger of workflows to publish new image and new vers tag in … (#3965 ) …quay There were some open CVEs in the base-image. Those are resolved so triggering a workflow with updated version tag	2025-03-25 19:38:47 +00:00
Sri Sudarsan	349728162e	Matches prefix to verify presence of DOCX,PPTX,XLSX files instead of standard file names (#3959 ) Instead of looking for presence of `word/document.xml` , `ppt/presentation.xml` and `xl/workbook.xml` to identify DOCX,PPTX and XLSX files, we look for prefix `word/document.xml`, `ppt/presentation.xml` and `xl/workbook*.xml` as certain files generated from office365 has files with different names. Fixes https://github.com/Unstructured-IO/unstructured/issues/3937 --------- Co-authored-by: Yao You <theyaoyou@gmail.com>	2025-03-21 16:27:13 +00:00
Antonio Jose Jimeno Yepes	0fa5174bd7	Image within div or span with no text is annotated as Image (#3962 ) Ticket: https://unstructured-ai.atlassian.net/browse/ML-942 The following uncompressed HTML document can be used to test the transformation using the `partition_html` function from the VLM partitioner. [recalibrating-risk-report.pdf.json.html.zip](https://github.com/user-attachments/files/19330528/recalibrating-risk-report.pdf.json.html.zip) 0.17.2	2025-03-20 04:09:02 +00:00

1 2 3 4 5 ...

1754 Commits