1739 Commits

Author SHA1 Message Date
Yuming Long
c04235c168
fix [NEX-49] : Fix TypeError for empty HTML content (#4032)
### Summary

Addressed a TypeError that occurred when partitioning empty or
whitespace-only HTML content.

## Test
* unit test
`test_unstructured/partition/html/test_partition.py::test_partition_html_with_empty_content_raises_error`
can reproduce the TypeErro before fix
* now test can pass
2025-06-25 18:13:20 +00:00
ryannikolaidis
3f87946f56
feat: add DocumentData type (#4031)
In scenarios where there is a large amount of data that represents the
document rather than individual elements in the document, it may be
preferable to specify this in a single location rather than duplicating
the data across all elements (as we do for smaller metadata like
filename or filetype)

This PR adds DocumentData element type which can be used to uniquely
capture this data.
0.18.1
2025-06-23 03:46:25 +00:00
qued
6866fda860
fix: use encoding in context class (#4030)
Given the fact that the `_CsvPartitioningContext` defines an `_encoding`
property, this property was meant to be used. Behaviorally this change
should be a no-op, but supports future efforts where the partitioning
context applies internal logic.
2025-06-20 21:43:07 +00:00
luke-kucing
2aca876921
Luke/CVE python3.12 update (#4027) 2025-06-17 06:32:06 +00:00
jiajun-unstructured
b0dbd71aff
Parallelize tests (#4024) 2025-06-16 23:29:35 +00:00
Emmanuel Ferdman
531490d013
Migrate to modern bs4 interface (#4025)
## PR Summary
This small PR fixes the bs4 deprecation warnings which you can find in
the [CI
logs](https://github.com/Unstructured-IO/unstructured/actions/runs/15491657572/job/43729960936#step:3:2615):
```python
/app/unstructured/metrics/table/table_extraction.py:53: DeprecationWarning: Call to deprecated method findAll. (Replaced by find_all) -- Deprecated since version 4.0.0.
/app/unstructured/metrics/table/table_extraction.py:57: DeprecationWarning: Call to deprecated method findAll. (Replaced by find_all) -- Deprecated since version 4.0.0.
```

---------

Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>
2025-06-16 18:44:20 +00:00
cragwolfe
6ef2fc1ec6
chore: add claude (#4023) 2025-06-13 15:20:10 +00:00
Yuming Long
a80decdbd4
fix [NEX-28]: file_type is None for result_file_type in chunker partition json (#4022)
### Summary
`'NoneType' object has no attribute 'partitioner_shortname'` due to
`result_file_type = self._disambiguate_json_file_type` could return None
for file type
2025-06-13 15:19:09 +00:00
Yao You
5e43e36427
recompile on arm64 to get minimum reqs (#4020)
new `torch==2.7.1` now comes with nvidia gpu support and triton as
dependencies. Those are not supported by `arm64` or actually being used
by `unstructured` in `adm64` either. This is a quick patch to remove
those from .txt requirements files to unblock builds.
0.17.11-dev1
2025-06-12 21:44:35 +00:00
Yuming Long
55ad5fd637
fix chucking text None type has no attribute stripe (#4018)
### Summary
To fix error `Error in chunk: 512: {"detail":"'NoneType' object has no
attribute 'strip'"}` I found the logs under same org (could assume this
is the same job)

screenshot:
![Screenshot 2025-06-11 at 10 15
57 AM](https://github.com/user-attachments/assets/c50ada55-eef1-43f7-9e27-9b9ae339a6fb)

stack trace from the `utic-api` ES log doc:
![Screenshot 2025-06-11 at 2 01
01 PM](https://github.com/user-attachments/assets/7e84fa24-4eb6-45e8-b195-a11d3d124bfa)



### Notes
longer term we should make partitioner (vlm + utic-api) not return text
with Null

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: yuming-long <yuming-long@users.noreply.github.com>
2025-06-12 18:28:46 +00:00
Pluto
ec209c6b5f
Remove IDs from HTML code (#4012)
In this pull request parent-child relationship for elements generated
with v2 parser is based on actual element IDs instead of IDs baked
somewhere in the HTML script.
With some extra bug fixing it allowed for significantly simplifying json
-> HTML script
2025-06-11 11:55:02 +00:00
Emily Voss
b6ab471f00
Drop Python 3.9 support due to dependency conflicts (#4017) 2025-06-10 23:32:11 -07:00
Emily Voss
06e4e54f5c
Bump requests to address CVEs (#4015) 2025-06-11 01:38:43 +00:00
Yao You
37d2f021a3
Feat/bump inference (#4013)
Bump `unstructured-inference` to `1.0.5`, which includes fix to ensure
model init is thread safe.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>
2025-06-06 09:52:17 +00:00
luke-kucing
a7e90f7990
resolve CVEs and HF issue (#4009)
update reqs to resolve CVEs and add the HF ENV to stop it from reaching
out

updated the Dockerfile with
ENV HF_HUB_OFFLINE=1

to stop it from pinging HF. This was an issue for a gov customer. and
updated requirements to resolve some open CVEs

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: luke-kucing <luke-kucing@users.noreply.github.com>
2025-06-04 18:52:58 +00:00
cragwolfe
3a048a5a02
chore: script to verify unstructured image outbound connectivity (#4008)
Sample output. The key thing here is the modes `offline` (meaning set
HF_HUB_ONLINE=1 AND DO_NOT_TRACK=true) results in no outbound
connections. This also is true if the locally cached models are removed,
the last scenario of `offline-and-missing-models`)

```
$ ./test-all-outbound-connectivity-scenarios.sh 
>>> Removing leftover sut_* containers…
Container: 543ac4b14370a18d790a2035e206e8c445754b825ec8b2887f4246f7404299c7  (scenario baseline)
tcpdump running on interface eth0...
>>> Running Python workload (capturing stdout/stderr)…
[INFO] partitioning /app/example-docs/ideas-page.html

<snip>

Python finished.  Log saved to /r/unstructured/scripts/image/python-output/offline-and-missing-models.log
pcap saved to /r/unstructured/scripts/image/pcaps/offline-and-missing-models.pcap

==================================================================
======================================== Begin Scenario: baseline

   -------------------------------------------
   tshark output for baseline
   -------------------------------------------

IPv4 Conversations
Filter:<No Filter>
                                               |       <-      | |       ->      | |     Total     |    Relative    |   Duration   |
                                               | Frames  Bytes | | Frames  Bytes | | Frames  Bytes |      Start     |              |
172.18.0.2           <-> 108.138.246.79            20 12 kB          20 4,176 bytes      40 16 kB         2.531247000        69.0419
172.18.0.2           <-> 3.214.154.119             11 5,777 bytes      12 2,656 bytes      23 8,433 bytes     0.029451000         0.4118
172.18.0.2           <-> 192.168.65.5               2 656 bytes       2 158 bytes       4 814 bytes     0.000000000         2.5310

   ------------------------------------------
   python log output for baseline
   ------------------------------------------

[INFO] partitioning /app/example-docs/ideas-page.html
[INFO] partitioning /app/example-docs/category-level.docx
[INFO] partitioning /app/example-docs/fake_table.docx
[INFO] partitioning /app/example-docs/img/english-and-korean.png
2025-06-02 22:05:02,265 - matplotlib.font_manager - INFO - generated new fontManager
2025-06-02 22:05:02,356 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): huggingface.co:443
2025-06-02 22:05:02,497 - urllib3.connectionpool - DEBUG - https://huggingface.co:443 "HEAD /unstructuredio/yolo_x_layout/resolve/main/yolox_l0.05.onnx HTTP/1.1" 302 0
2025-06-02 22:05:02,613 - unstructured_inference - INFO - Reading image file: /app/example-docs/img/english-and-korean.png ...
2025-06-02 22:05:04,792 - unstructured_inference - INFO - Loading the Table agent ...
2025-06-02 22:05:04,792 - unstructured_inference - INFO - Loading the table structure model ...
2025-06-02 22:05:04,877 - urllib3.connectionpool - DEBUG - https://huggingface.co:443 "HEAD /microsoft/table-transformer-structure-recognition/resolve/main/config.json HTTP/1.1" 200 0
2025-06-02 22:05:04,960 - urllib3.connectionpool - DEBUG - https://huggingface.co:443 "HEAD /microsoft/table-transformer-structure-recognition/resolve/main/config.json HTTP/1.1" 200 0
2025-06-02 22:05:04,970 - timm.models._builder - INFO - Loading pretrained weights from Hugging Face hub (timm/resnet18.a1_in1k)
2025-06-02 22:05:05,062 - urllib3.connectionpool - DEBUG - https://huggingface.co:443 "HEAD /timm/resnet18.a1_in1k/resolve/main/model.safetensors HTTP/1.1" 302 0
2025-06-02 22:05:05,065 - timm.models._hub - INFO - [timm/resnet18.a1_in1k] Safe alternative available for 'pytorch_model.bin' (as 'model.safetensors'). Loading weights using safetensors.
2025-06-02 22:05:05,071 - timm.models._builder - INFO - Missing keys (fc.weight, fc.bias) discovered while loading pretrained weights. This is expected if model is being adapted.
[INFO] partitioning /app/example-docs/img/embedded-images-tables.jpg
2025-06-02 22:05:05,152 - unstructured_inference - INFO - Reading image file: /app/example-docs/img/embedded-images-tables.jpg ...
[INFO] partitioning /app/example-docs/img/layout-parser-paper-with-table.jpg
2025-06-02 22:05:07,693 - unstructured_inference - INFO - Reading image file: /app/example-docs/img/layout-parser-paper-with-table.jpg ...
[INFO] partitioning /app/example-docs/pdf/embedded-images-tables.pdf
2025-06-02 22:05:12,706 - pikepdf._core - INFO - pikepdf C++ to Python logger bridge initialized
2025-06-02 22:05:12,733 - unstructured_inference - INFO - Reading PDF for file: /app/example-docs/pdf/embedded-images-tables.pdf ...
[INFO] partitioning /app/example-docs/pdf/all-number-table.pdf
2025-06-02 22:05:15,251 - unstructured_inference - INFO - Reading PDF for file: /app/example-docs/pdf/all-number-table.pdf ...
[INFO] partitioning /app/example-docs/fake-power-point.pptx
[INFO] partitioning /app/example-docs/stanley-cups.xlsx
[INFO] partitioning /app/example-docs/fake-email-multiple-attachments.msg
2025-06-02 22:05:16,936 - unstructured_inference - INFO - Reading image file: /tmp/tmplkanlou1/unstructured_logo.png ...
2025-06-02 22:05:18,749 - unstructured_inference - INFO - Reading PDF for file: /tmp/tmpxdzdouhb/dense_doc.pdf ...

==================================================================
======================================== Begin Scenario: missing-models

   -------------------------------------------
   tshark output for missing-models
   -------------------------------------------

IPv4 Conversations
Filter:<No Filter>
                                               |       <-      | |       ->      | |     Total     |    Relative    |   Duration   |
                                               | Frames  Bytes | | Frames  Bytes | | Frames  Bytes |      Start     |              |
172.18.0.2           <-> 18.155.192.23         181834 273 MB      33502 1,813 kB   215336 275 MB        2.704106000        75.2880
172.18.0.2           <-> 3.168.86.41            79696 119 MB      15234 825 kB      94930 120 MB        9.066044000        68.9276
172.18.0.2           <-> 108.138.246.85            29 21 kB          25 5,760 bytes      54 27 kB         2.431857000        75.5633
172.18.0.2           <-> 3.214.154.119             12 5,831 bytes      12 2,656 bytes      24 8,487 bytes     0.016604000         0.3590
172.18.0.2           <-> 192.168.65.5               4 1,084 bytes       4 314 bytes       8 1,398 bytes     0.000000000         9.0651

   ------------------------------------------
   python log output for missing-models
   ------------------------------------------

[INFO] partitioning /app/example-docs/ideas-page.html
[INFO] partitioning /app/example-docs/category-level.docx
[INFO] partitioning /app/example-docs/fake_table.docx
[INFO] partitioning /app/example-docs/img/english-and-korean.png
2025-06-02 22:06:30,961 - matplotlib.font_manager - INFO - generated new fontManager
2025-06-02 22:06:31,046 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): huggingface.co:443
2025-06-02 22:06:31,300 - urllib3.connectionpool - DEBUG - https://huggingface.co:443 "HEAD /unstructuredio/yolo_x_layout/resolve/main/yolox_l0.05.onnx HTTP/1.1" 302 0
2025-06-02 22:06:31,310 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): cdn-lfs.hf.co:443
2025-06-02 22:06:31,439 - urllib3.connectionpool - DEBUG - https://cdn-lfs.hf.co:443 "GET /repos/d9/51/d951593388d0af1cb4a029c311ba19f9b05090d9acc4606c2b82588297ea4397/134301ca94fb0df8027be9a6dad1908fe6218af8ffa4d34f0819c7c2226195f3?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27yolox_l0.05.onnx%3B+filename%3D%22yolox_l0.05.onnx%22%3B&Expires=1748904676&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTc0ODkwNDY3Nn19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5oZi5jby9yZXBvcy9kOS81MS9kOTUxNTkzMzg4ZDBhZjFjYjRhMDI5YzMxMWJhMTlmOWIwNTA5MGQ5YWNjNDYwNmMyYjgyNTg4Mjk3ZWE0Mzk3LzEzNDMwMWNhOTRmYjBkZjgwMjdiZTlhNmRhZDE5MDhmZTYyMThhZjhmZmE0ZDM0ZjA4MTljN2MyMjI2MTk1ZjM~cmVzcG9uc2UtY29udGVudC1kaXNwb3NpdGlvbj0qIn1dfQ__&Signature=hxvwTzJynEvyE~UuirlH~L4c5Gc6rGksDp~Uw94ooayDrzshE2sDdHmvqgoQyzqxHHhZLjfiJlAGUtVO7nVAHSoqt8mH7H9yN51Zj5UGqI-odXtW1dmWCD3i7nwwNlrEEjlXlERkIScpIjpkJDnjwhzeE94l1s7gysIm8c6J8JTcDlsdMver5wAVrBtLSVUrDN8PC84xgOGerHVhX7-eZcUVG2OAIJHoB3s2gLPkW9aVM5fvCmmoXMPI9oCvgLUp-zhXv3cWHh~yURuY1ufoI4CFG5ogW8nV~V45qLlbRw9PrvfFoLS-wxBGDOhT3SRWVOJzRRmACByABGWYMXRFuw__&Key-Pair-Id=K3RPWS32NSSJCE HTTP/1.1" 200 216625723
2025-06-02 22:06:35,019 - unstructured_inference - INFO - Reading image file: /app/example-docs/img/english-and-korean.png ...
2025-06-02 22:06:37,188 - unstructured_inference - INFO - Loading the Table agent ...
2025-06-02 22:06:37,188 - unstructured_inference - INFO - Loading the table structure model ...
2025-06-02 22:06:37,290 - urllib3.connectionpool - DEBUG - https://huggingface.co:443 "HEAD /microsoft/table-transformer-structure-recognition/resolve/main/config.json HTTP/1.1" 200 0
2025-06-02 22:06:37,375 - urllib3.connectionpool - DEBUG - https://huggingface.co:443 "GET /microsoft/table-transformer-structure-recognition/resolve/main/config.json HTTP/1.1" 200 1469
2025-06-02 22:06:37,484 - urllib3.connectionpool - DEBUG - https://huggingface.co:443 "HEAD /microsoft/table-transformer-structure-recognition/resolve/main/config.json HTTP/1.1" 200 0
2025-06-02 22:06:37,581 - urllib3.connectionpool - DEBUG - https://huggingface.co:443 "HEAD /microsoft/table-transformer-structure-recognition/resolve/main/model.safetensors HTTP/1.1" 302 0
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
2025-06-02 22:06:37,586 - huggingface_hub.file_download - WARNING - Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
2025-06-02 22:06:37,681 - urllib3.connectionpool - DEBUG - https://huggingface.co:443 "GET /microsoft/table-transformer-structure-recognition/resolve/main/model.safetensors HTTP/1.1" 302 1319
2025-06-02 22:06:37,685 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): cas-bridge.xethub.hf.co:443
2025-06-02 22:06:37,778 - urllib3.connectionpool - DEBUG - https://cas-bridge.xethub.hf.co:443 "GET /xet-bridge-us/634929bd8146350b3a4cadaf/e78778928a1863786d5bb22a109a7ff1dbac47a29eae6f223a1fc2689172c347?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=cas%2F20250602%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20250602T220637Z&X-Amz-Expires=3600&X-Amz-Signature=c0a361e8982b1b05ee443054646b438e5a68d6767ef6df03dad6c5db20d0bdc5&X-Amz-SignedHeaders=host&X-Xet-Cas-Uid=public&response-content-disposition=inline%3B+filename*%3DUTF-8%27%27model.safetensors%3B+filename%3D%22model.safetensors%22%3B&x-id=GetObject&Expires=1748905597&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTc0ODkwNTU5N319LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2FzLWJyaWRnZS54ZXRodWIuaGYuY28veGV0LWJyaWRnZS11cy82MzQ5MjliZDgxNDYzNTBiM2E0Y2FkYWYvZTc4Nzc4OTI4YTE4NjM3ODZkNWJiMjJhMTA5YTdmZjFkYmFjNDdhMjllYWU2ZjIyM2ExZmMyNjg5MTcyYzM0NyoifV19&Signature=cRjZe56uJ8vxmmgRhPmp7XZX69PHKoXO9XN1bfq5n~84Vxz~HvCmg6MqtuUAFIiOWAHFhOuVzJpoiWTYT1JdZrtMeQTdywnZM-lIIn5Q45kzr8q8C58yvLz7vmKKrD9pOnGjJPaVavYYxEDdlAXbWf6xo433kKF4TfmQ9z7UIKt~M-XV9EdPUUBNhByucLVcTZ3sec5DqI4FmzK28fdJ1BMD4NyDjWW6hi~Lp2V3bW0FLCpI6qKGuikJ3E-OVcJDdDvZAqSN0-GoQyHIP9kp4RTqPBb7jekpZ3Uj91UWEmGx6YNuNlorAMGi61hrL6mAUUmW13OGua2vcJyk9LxZQg__&Key-Pair-Id=K2L8F4GPSG1IFC HTTP/1.1" 200 115434268
2025-06-02 22:06:39,612 - timm.models._builder - INFO - Loading pretrained weights from Hugging Face hub (timm/resnet18.a1_in1k)
2025-06-02 22:06:39,696 - urllib3.connectionpool - DEBUG - https://huggingface.co:443 "HEAD /timm/resnet18.a1_in1k/resolve/main/model.safetensors HTTP/1.1" 302 0
2025-06-02 22:06:39,714 - urllib3.connectionpool - DEBUG - https://cdn-lfs.hf.co:443 "GET /repos/42/d5/42d585781e0b74854ae52a1bc2a63d09896f1d70f86bff969f4c053508d6c2d6/80c49dee3da4822c009c5a7fe591e9223c5a2cfcf95a4067ca4dfb5a7b89c612?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27model.safetensors%3B+filename%3D%22model.safetensors%22%3B&Expires=1748904665&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTc0ODkwNDY2NX19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5oZi5jby9yZXBvcy80Mi9kNS80MmQ1ODU3ODFlMGI3NDg1NGFlNTJhMWJjMmE2M2QwOTg5NmYxZDcwZjg2YmZmOTY5ZjRjMDUzNTA4ZDZjMmQ2LzgwYzQ5ZGVlM2RhNDgyMmMwMDljNWE3ZmU1OTFlOTIyM2M1YTJjZmNmOTVhNDA2N2NhNGRmYjVhN2I4OWM2MTI~cmVzcG9uc2UtY29udGVudC1kaXNwb3NpdGlvbj0qIn1dfQ__&Signature=GL15CLiGsmHno-DP25kfcuObjbrjd~ir5C5xapGqb9lda~5Wjy-3axBPftr1xWUnKh24Ay0mS49U8ZOcEdQxmzxQ97HiSX0-8s0-H187hV6mId6uxsULOGkNtjpkMKhfxe0qIfAmfi9gxl9JdiVfG5367HfPDVST8NvGPqMuKYoywSNWA-Uby-L9qb~EjtxbH9v1H2g6C0i9t2mn8ghD8BtTWEn4LY9c4O5bI~EQatNToNjsQTKa18LzXEowZnODLSLkyE7beLzfEpuTX9vlDzcAwKCPp-1M3xMZI4tzR-yfzyGhW19wqc6BVncUw53WSK7oOCv56HmFTYHhzOE-eQ__&Key-Pair-Id=K3RPWS32NSSJCE HTTP/1.1" 200 46807446
2025-06-02 22:06:40,394 - timm.models._hub - INFO - [timm/resnet18.a1_in1k] Safe alternative available for 'pytorch_model.bin' (as 'model.safetensors'). Loading weights using safetensors.
2025-06-02 22:06:40,396 - timm.models._builder - INFO - Missing keys (fc.weight, fc.bias) discovered while loading pretrained weights. This is expected if model is being adapted.
[INFO] partitioning /app/example-docs/img/embedded-images-tables.jpg
2025-06-02 22:06:40,460 - unstructured_inference - INFO - Reading image file: /app/example-docs/img/embedded-images-tables.jpg ...
[INFO] partitioning /app/example-docs/img/layout-parser-paper-with-table.jpg
2025-06-02 22:06:42,985 - unstructured_inference - INFO - Reading image file: /app/example-docs/img/layout-parser-paper-with-table.jpg ...
[INFO] partitioning /app/example-docs/pdf/embedded-images-tables.pdf
2025-06-02 22:06:48,019 - pikepdf._core - INFO - pikepdf C++ to Python logger bridge initialized
2025-06-02 22:06:48,045 - unstructured_inference - INFO - Reading PDF for file: /app/example-docs/pdf/embedded-images-tables.pdf ...
[INFO] partitioning /app/example-docs/pdf/all-number-table.pdf
2025-06-02 22:06:50,557 - unstructured_inference - INFO - Reading PDF for file: /app/example-docs/pdf/all-number-table.pdf ...
[INFO] partitioning /app/example-docs/fake-power-point.pptx
[INFO] partitioning /app/example-docs/stanley-cups.xlsx
[INFO] partitioning /app/example-docs/fake-email-multiple-attachments.msg
2025-06-02 22:06:52,358 - unstructured_inference - INFO - Reading image file: /tmp/tmpsha4r586/unstructured_logo.png ...
2025-06-02 22:06:54,199 - unstructured_inference - INFO - Reading PDF for file: /tmp/tmpg_5lk06v/dense_doc.pdf ...

==================================================================
======================================== Begin Scenario: analytics-online-only

   -------------------------------------------
   tshark output for analytics-online-only
   -------------------------------------------

IPv4 Conversations
Filter:<No Filter>
                                               |       <-      | |       ->      | |     Total     |    Relative    |   Duration   |
                                               | Frames  Bytes | | Frames  Bytes | | Frames  Bytes |      Start     |              |
172.18.0.2           <-> 54.236.224.89             12 5,831 bytes      12 2,656 bytes      24 8,487 bytes     0.032536000         0.3535
172.18.0.2           <-> 192.168.65.5               1 462 bytes       1 84 bytes        2 546 bytes     0.000000000         0.0322

   ------------------------------------------
   python log output for analytics-online-only
   ------------------------------------------

[INFO] partitioning /app/example-docs/ideas-page.html
[INFO] partitioning /app/example-docs/category-level.docx
[INFO] partitioning /app/example-docs/fake_table.docx
[INFO] partitioning /app/example-docs/img/english-and-korean.png
2025-06-02 22:08:10,114 - matplotlib.font_manager - INFO - generated new fontManager
2025-06-02 22:08:10,320 - unstructured_inference - INFO - Reading image file: /app/example-docs/img/english-and-korean.png ...
2025-06-02 22:08:12,470 - unstructured_inference - INFO - Loading the Table agent ...
2025-06-02 22:08:12,470 - unstructured_inference - INFO - Loading the table structure model ...
2025-06-02 22:08:12,475 - timm.models._builder - INFO - Loading pretrained weights from Hugging Face hub (timm/resnet18.a1_in1k)
2025-06-02 22:08:12,476 - timm.models._hub - INFO - [timm/resnet18.a1_in1k] Safe alternative available for 'pytorch_model.bin' (as 'model.safetensors'). Loading weights using safetensors.
2025-06-02 22:08:12,478 - timm.models._builder - INFO - Missing keys (fc.weight, fc.bias) discovered while loading pretrained weights. This is expected if model is being adapted.
[INFO] partitioning /app/example-docs/img/embedded-images-tables.jpg
2025-06-02 22:08:12,548 - unstructured_inference - INFO - Reading image file: /app/example-docs/img/embedded-images-tables.jpg ...
[INFO] partitioning /app/example-docs/img/layout-parser-paper-with-table.jpg
2025-06-02 22:08:15,102 - unstructured_inference - INFO - Reading image file: /app/example-docs/img/layout-parser-paper-with-table.jpg ...
[INFO] partitioning /app/example-docs/pdf/embedded-images-tables.pdf
2025-06-02 22:08:20,163 - pikepdf._core - INFO - pikepdf C++ to Python logger bridge initialized
2025-06-02 22:08:20,189 - unstructured_inference - INFO - Reading PDF for file: /app/example-docs/pdf/embedded-images-tables.pdf ...
[INFO] partitioning /app/example-docs/pdf/all-number-table.pdf
2025-06-02 22:08:22,732 - unstructured_inference - INFO - Reading PDF for file: /app/example-docs/pdf/all-number-table.pdf ...
[INFO] partitioning /app/example-docs/fake-power-point.pptx
[INFO] partitioning /app/example-docs/stanley-cups.xlsx
[INFO] partitioning /app/example-docs/fake-email-multiple-attachments.msg
2025-06-02 22:08:24,468 - unstructured_inference - INFO - Reading image file: /tmp/tmp4oud0ctq/unstructured_logo.png ...
2025-06-02 22:08:26,297 - unstructured_inference - INFO - Reading PDF for file: /tmp/tmpv24idrvu/dense_doc.pdf ...

==================================================================
======================================== Begin Scenario: offline

   -------------------------------------------
   tshark output for offline
   -------------------------------------------

IPv4 Conversations
Filter:<No Filter>
                                               |       <-      | |       ->      | |     Total     |    Relative    |   Duration   |
                                               | Frames  Bytes | | Frames  Bytes | | Frames  Bytes |      Start     |              |

   ------------------------------------------
   python log output for offline
   ------------------------------------------

[INFO] partitioning /app/example-docs/ideas-page.html
[INFO] partitioning /app/example-docs/category-level.docx
[INFO] partitioning /app/example-docs/fake_table.docx
[INFO] partitioning /app/example-docs/img/english-and-korean.png
2025-06-02 22:09:37,826 - matplotlib.font_manager - INFO - generated new fontManager
2025-06-02 22:09:38,028 - unstructured_inference - INFO - Reading image file: /app/example-docs/img/english-and-korean.png ...
2025-06-02 22:09:40,188 - unstructured_inference - INFO - Loading the Table agent ...
2025-06-02 22:09:40,188 - unstructured_inference - INFO - Loading the table structure model ...
2025-06-02 22:09:40,193 - timm.models._builder - INFO - Loading pretrained weights from Hugging Face hub (timm/resnet18.a1_in1k)
2025-06-02 22:09:40,193 - timm.models._hub - INFO - [timm/resnet18.a1_in1k] Safe alternative available for 'pytorch_model.bin' (as 'model.safetensors'). Loading weights using safetensors.
2025-06-02 22:09:40,195 - timm.models._builder - INFO - Missing keys (fc.weight, fc.bias) discovered while loading pretrained weights. This is expected if model is being adapted.
[INFO] partitioning /app/example-docs/img/embedded-images-tables.jpg
2025-06-02 22:09:40,260 - unstructured_inference - INFO - Reading image file: /app/example-docs/img/embedded-images-tables.jpg ...
[INFO] partitioning /app/example-docs/img/layout-parser-paper-with-table.jpg
2025-06-02 22:09:42,810 - unstructured_inference - INFO - Reading image file: /app/example-docs/img/layout-parser-paper-with-table.jpg ...
[INFO] partitioning /app/example-docs/pdf/embedded-images-tables.pdf
2025-06-02 22:09:47,851 - pikepdf._core - INFO - pikepdf C++ to Python logger bridge initialized
2025-06-02 22:09:47,877 - unstructured_inference - INFO - Reading PDF for file: /app/example-docs/pdf/embedded-images-tables.pdf ...
[INFO] partitioning /app/example-docs/pdf/all-number-table.pdf
2025-06-02 22:09:50,475 - unstructured_inference - INFO - Reading PDF for file: /app/example-docs/pdf/all-number-table.pdf ...
[INFO] partitioning /app/example-docs/fake-power-point.pptx
[INFO] partitioning /app/example-docs/stanley-cups.xlsx
[INFO] partitioning /app/example-docs/fake-email-multiple-attachments.msg
2025-06-02 22:09:52,181 - unstructured_inference - INFO - Reading image file: /tmp/tmpn3rraz6o/unstructured_logo.png ...
2025-06-02 22:09:54,032 - unstructured_inference - INFO - Reading PDF for file: /tmp/tmpvbqk645u/dense_doc.pdf ...

==================================================================
======================================== Begin Scenario: offline-and-missing-models

   -------------------------------------------
   tshark output for offline-and-missing-models
   -------------------------------------------

IPv4 Conversations
Filter:<No Filter>
                                               |       <-      | |       ->      | |     Total     |    Relative    |   Duration   |
                                               | Frames  Bytes | | Frames  Bytes | | Frames  Bytes |      Start     |              |

   ------------------------------------------
   python log output for offline-and-missing-models
   ------------------------------------------

[INFO] partitioning /app/example-docs/ideas-page.html
[INFO] partitioning /app/example-docs/category-level.docx
[INFO] partitioning /app/example-docs/fake_table.docx
[INFO] partitioning /app/example-docs/img/english-and-korean.png
2025-06-02 22:11:05,743 - matplotlib.font_manager - INFO - generated new fontManager
Traceback (most recent call last):
  File "/home/notebook-user/.local/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 1484, in _get_metadata_or_catch_error
    metadata = get_hf_file_metadata(
               ^^^^^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 1401, in get_hf_file_metadata
    r = _request_wrapper(
        ^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 285, in _request_wrapper
    response = _request_wrapper(
               ^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 308, in _request_wrapper
    response = get_session().request(method=method, url=url, **params)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/huggingface_hub/utils/_http.py", line 107, in send
    raise OfflineModeIsEnabled(
huggingface_hub.errors.OfflineModeIsEnabled: Cannot reach https://huggingface.co/unstructuredio/yolo_x_layout/resolve/main/yolox_l0.05.onnx: offline mode is enabled. To disable it, please unset the `HF_HUB_OFFLINE` environment variable.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 35, in <module>
  File "/app/unstructured/partition/auto.py", line 231, in partition
    elements = partition_image(
               ^^^^^^^^^^^^^^^^
  File "/app/unstructured/documents/elements.py", line 585, in wrapper
    elements = func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/app/unstructured/file_utils/filetype.py", line 774, in wrapper
    elements = func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/app/unstructured/chunking/dispatch.py", line 74, in wrapper
    elements = func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/app/unstructured/partition/image.py", line 102, in partition_image
    return partition_pdf_or_image(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/unstructured/partition/pdf.py", line 341, in partition_pdf_or_image
    elements = _partition_pdf_or_image_local(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/unstructured/utils.py", line 216, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/app/unstructured/partition/pdf.py", line 649, in _partition_pdf_or_image_local
    inferred_document_layout = process_file_with_model(
                               ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured_inference/inference/layout.py", line 371, in process_file_with_model
    model = get_model(model_name, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured_inference/models/base.py", line 74, in get_model
    model.initialize(**initialize_params)
  File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured_inference/utils.py", line 40, in __getitem__
    value = evaluate(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured_inference/utils.py", line 115, in download_if_needed_and_get_local_path
    return hf_hub_download(path_or_repo, filename, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 961, in hf_hub_download
    return _hf_hub_download_to_cache_dir(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 1068, in _hf_hub_download_to_cache_dir
    _raise_on_head_call_error(head_call_error, force_download, local_files_only)
  File "/home/notebook-user/.local/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 1599, in _raise_on_head_call_error
    raise LocalEntryNotFoundError(
huggingface_hub.errors.LocalEntryNotFoundError: An error happened while trying to locate the file on the Hub and we cannot find the requested files in the local cache. Please check your connection and try again or make sure your Internet connection is on.
```
2025-06-02 15:21:17 -07:00
Emmanuel Ferdman
e42884a566
fix: resolve warnings of logger library (#3999)
# PR Summary
This PR resolves the deprecation warnings of the `logger` library:
```python
DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead
```

---------

Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>
Co-authored-by: cragwolfe <crag@unstructured.io>
2025-05-22 17:53:42 +00:00
Ronny H
8be7108829
Replace Serverless API to Platform announcement on README page (#4003)
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
2025-05-20 16:54:53 +00:00
jordan-homan
570ee078a4
fix: throw validation error when json is passed with invalid unstructured json (#4002)
### Notes
Adds validation if `json` / `ndjson` are not valid unstructured schema.

### Testing
Manually tested serverless API with example json:

```

test_length = [] = 200

test_invalid = [{"invalid": "schema"}] = 422
test_invalid_ndjson ={"hi": "there"} = 422

test_chunk = [{"type":"Header","element_id":"a23fdadef9277f217563e217ebd074d5" ... = 200

```
2025-05-19 18:24:44 +00:00
Austin Walker
e3417d7e98
fix: Fix for Pillow error when extracting PNG images (#3998)
When I tried to partition a PNG file and extract images, I got an error
from Pillow:

```
WARNING  unstructured:pdf_image_utils.py:230 Image Extraction Error: Skipping the failed image
Traceback (most recent call last):
  File "/Users/austin/.pyenv/versions/unstructured/lib/python3.10/site-packages/PIL/JpegImagePlugin.py", line 666, in _save
    rawmode = RAWMODE[im.mode]
KeyError: 'RGBA'
```

The issue is that a PNG has an additional layer that cannot be saved off
in jpeg format. We can fix this with a quick conversion. I added a png
test case that is now passing with this fix.
2025-05-08 21:57:05 +00:00
Yao You
b814ece39f
fix: properly handle the case when an element's text is None (#3995)
Some elements, like `Image`, can have `None` as its `text` attribute's
value. In that case current chunking logic fails because it expects the
field to always have a length or can be split. The fix is to update the
logic as `element.text or ""` for checking length and add flow control
to early exit to avoid calling split on `None`.
2025-05-05 18:08:11 +00:00
Marek Połom
604c4a7c5e
fix: failing build (#3993)
Successful build and test:
https://github.com/Unstructured-IO/unstructured/actions/runs/14730300234/job/41342657532

Failing test_json_to_html CI job fix here:
https://github.com/Unstructured-IO/unstructured/pull/3992
2025-04-29 13:29:58 +00:00
Marek Połom
b585df1588
fix: Add missing diffstat command to test_json_to_html CI job (#3992)
Removed some additional html fixtures. The original json fixtures from
which html ones were generated, were removed some time ago.
2025-04-29 13:29:44 +00:00
David Potter
fd9d796797
fix cve (#3989)
fix critical cve for h11. supposedly 0.16.0 fixes it.

---------

Co-authored-by: Yao You <yao@unstructured.io>
Co-authored-by: Austin Walker <austin@unstructured.io>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>
2025-04-29 00:58:05 +00:00
Nathan
27f503ce31
Update pdfminer_utils.py (#3974)
Fix for 'PSSyntaxError' import error:
"cannot import name 'PSSyntaxError' from 'pdfminer.pdfparser'"

Latest pdfminer-six doesn't import PSSyntaxError into
`pdfminer.pdfparser` anymore. It must now be directly imported from its
source (`pdfminer.psexceptions`)
2025-04-08 00:47:24 -07:00
Philippe PRADOS
d570f4624b
Fix sort_page_element. ensures that sorting is stable and not random. (#3978)
The sort_page_element() use the element id to sort the elements.
Two executions of the same code, on the same file, produce different
results. The order of the elements is random.
This makes it impossible to write stable unit tests, for example, or to
obtain reproducible results.
2025-04-07 15:57:20 +00:00
cragwolfe
dfa17bd3a0
fix: hi_res PDF parsing: only uncategorized text for extracted elements (#3975) 2025-04-04 14:38:23 -07:00
cragwolfe
8fc41811eb
chore: add html path to ingest-test-fixtures-update-pr (#3977)
This should allow the `Ingest Test Fixtures Update PR` workflow to also
update expected html outputs.

E.g., before the change, the .html files would be left unmodified:

![image](https://github.com/user-attachments/assets/fa14c1a5-39bd-4e32-b4b9-9552eb312de1)


https://github.com/Unstructured-IO/unstructured/actions/runs/14234877547/job/39892334672
2025-04-03 15:42:25 -07:00
cragwolfe
c6b8ed4290
chore: allow changing default output dir for unstructured-get-json.sh (#3973) 2025-03-31 22:18:57 -07:00
cragwolfe
19fc1fcc72
feat: convenience unstructured-get-json.sh update (#3971)
* script now supports:
   * the --vlm flag, to process the document with the VLM strategy
   * optionally takes --vlm-model, --vlm-provider args
* optionally also writes .html outputs by converting unstructured .json
output
   * optionally opens those .html outputs in a browser
   
Tested with:
   ```
unstructured-get-json.sh --write-html --open-html --fast
layout-parser-paper-p2.pdf
unstructured-get-json.sh --write-html --open-html --hi-res
layout-parser-paper-p2.pdf
unstructured-get-json.sh --write-html --open-html --ocr-only
layout-parser-paper-p2.pdf
unstructured-get-json.sh --write-html --open-html --vlm
layout-parser-paper-p2.pdf
unstructured-get-json.sh --write-html --open-html --vlm --vlm-provider
openai --vlm-model gpt-4o layout-parser-paper-p2.pdf
unstructured-get-json.sh --write-html --open-html --vlm --vlm-provider
vertexai --vlm-model gemini-2.0-flash-001 layout-parser-paper-p2.pdf
unstructured-get-json.sh --write-html --open-html --vlm --vlm-provider
anthropic --vlm-model claude-3-5-sonnet-20241022
layout-parser-paper-p2.pdf

```

[layout-parser-paper-p2.pdf](https://github.com/user-attachments/files/19514007/layout-parser-paper-p2.pdf)
2025-03-31 09:45:01 -07:00
qued
9a239fa18b
build: remove test and dev deps from docker image (#3969)
Removed the dependencies contained in `test.txt`, `dev.txt`, and
`constraints.txt` from the things that get installed in the docker
image. In order to keep testing the image (running the tests), I added a
step to the `docker-test` make target to install `test.txt` and
`dev.txt`. Thus we presumably get a smaller image (probably not much
smaller), reduce the dependency chain or our images, and have less
exposure to vulnerabilities while still testing as robustly as before.

Incidentally, I removed the `Dockerfile` for our ubuntu image, since it
made reference to non-existent make targets, which tells me it's stale
and wasn't being used.

### Review:
- Reviewer should ensure the dev and test dependencies are not being
installed in the docker image. One way to check is to check the logs in
CI, and note, e.g. that
[this](https://github.com/Unstructured-IO/unstructured/actions/runs/14112971425/job/39536304012#step:3:1700)
is the first reference to `pytest` in the docker build and test logs,
after the image build is completed.
- Reviewer should ensure docker image is still being tested in CI and is
passing.
2025-03-27 18:41:11 +00:00
qued
3f07840b80
chore: deprecate stage_for_label_studio (#3968)
This PR is to address [a
CVE](https://github.com/advisories/GHSA-rgv9-w7jp-m23g) that appeared in
a recent scan.

The CVE has to do with the package `label_studio_sdk`. This relates to
the tool Label Studio, a data labeling platform. We built a staging
function that takes a list of elements and converts it to a format
suitable for passing to the LabelStudio platform.

We don't use the package with the vulnerability in the actual function,
we only use it to test the output of the function against the Label
Studio API schema.

Even the test where we use it is sort of questionable in value, since
it's really testing the schema against an old version of the LabelStudio
API (we are testing against a recording of the Label Studio API's
responses stored using `vcrpy`).

Label Studio has fixed the vulnerability as of version 1.0.10 of their
SDK, but we're stuck on 1.0.5 because 1.0.6 and above require
`numpy<2.0.0`.

This leaves us with several choices of resolution, some of which are:
1. Downgrade `numpy` to upgrade `label_studio_sdk` to >=1.0.10 to
resolve the CVE
2. Drop `label_studio_sdk` by either removing or rewriting the test.
3. Drop test and dev dependencies from the `unstructured` image.

We've decided to do 2. _and_ 3. This PR handles 2., with 3. to be a
follow-on PR.

Here we add a deprecation notice to `stage_for_label_studio` and remove
the offending test. Normally good practice would be to add a warning of
future deprecation to the function for a reasonable amount of time, but
in order to address the CVE immediately, we're deprecating it right
away.

### Testing
Install the dependencies (`make install`) into a fresh environment, and
`pip list | grep label` should have no results. The scan artifact in CI
should contain no "high" or "critical" CVEs.
2025-03-26 23:37:03 +00:00
luke-kucing
347a4e5d9e
manual trigger of workflows to publish new image and new vers tag in … (#3965)
…quay

There were some open CVEs in the base-image. Those are resolved so
triggering a workflow with updated version tag
2025-03-25 19:38:47 +00:00
Sri Sudarsan
349728162e
Matches prefix to verify presence of DOCX,PPTX,XLSX files instead of standard file names (#3959)
Instead of looking for presence of `word/document.xml` ,
`ppt/presentation.xml` and `xl/workbook.xml` to identify DOCX,PPTX and
XLSX files, we look for prefix `word/document*.xml`,
`ppt/presentation*.xml` and `xl/workbook*.xml` as certain files
generated from office365 has files with different names.
Fixes https://github.com/Unstructured-IO/unstructured/issues/3937

---------

Co-authored-by: Yao You <theyaoyou@gmail.com>
2025-03-21 16:27:13 +00:00
Antonio Jose Jimeno Yepes
0fa5174bd7
Image within div or span with no text is annotated as Image (#3962)
Ticket: https://unstructured-ai.atlassian.net/browse/ML-942

The following uncompressed HTML document can be used to test the
transformation using the `partition_html` function from the VLM
partitioner.


[recalibrating-risk-report.pdf.json.html.zip](https://github.com/user-attachments/files/19330528/recalibrating-risk-report.pdf.json.html.zip)
0.17.2
2025-03-20 04:09:02 +00:00
Yao You
7de630e45e
Feat/bump numpy to 2 (#3961)
This PR updates a few dependencies so that they are compatible with
`numpy>=2`.
2025-03-18 21:33:48 +00:00
Yao You
4e424efd22
feat: use lxml instead of bs4 to parse hOCR data (#3960)
- `lxml` is a much faster library than `bs4` when the input data is
regular
- since the hOCR data is guaranteed to be regular (programmatically
generated) we don't need `bs4` here to parse the data
- `lxml` improves parsing speed by about 10x

Example runtime profiling locally using the same `hocr` data from 1 page
pdf, where `agent.hocr_to_dataframe_bs4` is the current method on main
and `agent.hocr_to_dataframe` is the PR's method.

![Screenshot 2025-03-17 at 12 14
59 PM](https://github.com/user-attachments/assets/7c483857-8711-4d72-8954-e83510fef783)
2025-03-18 00:36:19 +00:00
ryannikolaidis
66bf4b0198
feat: support extracting image url in html (#3955)
also removes mimetype when base64 is not included in image metadata

---------

Co-authored-by: ryannikolaidis <ryannikolaidis@users.noreply.github.com>
2025-03-13 22:41:10 +00:00
Yao You
2dceac34b5
Feat/remove reference of PageLayout.elements (#3943)
This PR removes usage of `PageLayout.elements` from partition function,
except for when `analysis=True`. This PR updates the partition logic so
that `PageLayout.elements_array` is used everywhere to save memory and
cpu cost.
Since the analysis function is intended for investigation and not for
general document processing purposes, this part of the code is left for
a future refactor.

`PageLayout.elements` uses a list to store layout elements' data while
`elements_array` uses `numpy` array to store the data, which has much
lower memory requirements. Using `memory_profiler` to test the
differences is usually around 10x.
0.17.0
2025-03-12 15:21:21 +00:00
Yao You
8759b0aac9
feat: allow passing down of ocr agent and table agent (#3954)
This PR allows passing down both `ocr_agent` and `table_ocr_agent` as
parameters to specify the `OCRAgent` class for the page and tables, if
any, respectively. Both are default to using `tesseract`, consistent
with the present default behavior.

We used to rely on env variables to specify the agents but os env can be
changed during runtime outside of the caller's control. This method of
passing down the variables ensures that specification is independent of
env changes.

## testing

Using `example-docs/img/layout-parser-paper-with-table.jpg` and run
partition with two different settings. Note that this test requires
`paddleocr` extra.

```python
from unstructured.partition.auto import partition
from unstructured.partition.utils.constants import OCR_AGENT_TESSERACT, OCR_AGENT_PADDLE
elements = partition(f, strategy="hi_res", skip_infer_table_types=[], ocr_agent=OCR_AGENT_TESSERACT, table_ocr_agent=OCR_AGENT_PADDLE)
elements_alt = partition(f, strategy="hi_res", skip_infer_table_types=[], ocr_agent=OCR_AGENT_PADDLE, table_ocr_agent=OCR_AGENT_TESSERACT)
```

we should see both finish and slight differences in the table element's
text attribute.
2025-03-11 16:36:31 +00:00
ryannikolaidis
0001a33dba
fix: pass extract image args to all partitioners (#3950)
This is needed in order for the user to specify whether to extract the
base64 for images, which are now parsed by the html partitioner.

## Testing

Adds test that validates this by calling the auto-partitioner with
appropriate arguments partitioning an html file with base64 embedded
image.
2025-03-10 04:15:08 +00:00
ryannikolaidis
c0457c1cc3
feat: include images when partitioning html (#3945)
Currently we [filter img
tags](2addb19473/unstructured/partition/html/partition.py (L226-L229))
before tags are converted to Elements by the html partitioner. More
importantly we also don’t currently have a defined “block” / mapping to
support these. This adds these mappings and logic to process.

It also respects `extract_image_block_types` and
`extract_image_block_to_payload` (as we do with pdfs) to determine
whether base64 is included in the metadata.

The partitioned Image Elements sets the text to the img tag’s alt text
if available.

The partitioned Image Elements include the [url in the
metadata](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/documents/elements.py#L209)
(rather than image_base64) if the img tag src is a url.

## Testing

unit tests have been added for explicit coverage.
existing integration tests and other unit test fixtures have been
updated to account for `Image` elements now present

---------

Co-authored-by: ryannikolaidis <ryannikolaidis@users.noreply.github.com>
2025-03-08 01:25:21 +00:00
Pluto
74b0647aa2
Fix json bytes content type detection (#3941)
Fixes order of content type detection strategies for byte-encoded jsons.

Before
```
json_bytes = json.dumps([{"example": "data"}]).encode("utf-8")
file_buffer = io.BytesIO(json_bytes)
detect_filetype(file=file_buffer, metadata_file_path="filename.pdf") 
```

Before
PDF

Now
JSON
0.16.25
2025-03-07 10:33:33 +00:00
Yao You
961c8d5b11
feat: use block matrix to reduce peak memory usage for matmul (#3947)
This PR targets the most memory expensive operation in partition pdf and
images: deduplicate pdfminer elements. In large pages the number of
elements can be over 10k, which would generate multiple 10k x 10k square
double float matrices during deduplication, pushing peak memory usage
close to 13Gb
![Screenshot 2025-03-06 at 3 22
52 PM](https://github.com/user-attachments/assets/fdc26806-947b-4b5a-9d8e-4faeb0179b9f)


This PR breaks this computation down by computing partial IOU. More
precisely it computes IOU for each 2000 elements against all the
elements at a time to reduce peak memory usage by about 10x to around
1.6Gb.

![image](https://github.com/user-attachments/assets/e7b9f149-2b6a-4fc9-83c7-652e20849b76)


The block size is configurable based on user preference for peak memory
usage and it is set by changing the env `UNST_MATMUL_MEMORY_CAP_IN_GB`.
0.16.24
2025-03-07 00:28:36 +00:00
Nathan Van Gheem
19373de5ff
Enable dynamic file type registration (#3946)
The purpose of this PR is to enable registering new file types
dynamically.

The PR enables this through 2 primary functions:

1. `unstructured.file_utils.model.create_file_type` This registers the
new `FileType` enum which enables the rest of unstructured to understand
a new type of file
2. `unstructured.file_utils.model.register_partitioner` Decorator that
enables registering a partitioner function to run for a file type.

---------

Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com>
2025-03-06 22:09:42 +00:00
Roman Isecke
061462de22
fix/drop ndjson extra dep (#3944) 2025-03-05 17:52:00 +00:00
Marek Połom
f333d7fe7f
feat: Json elements to HTML converter (#3936)
## NOTE
`test_unstructured_ingest/expected-structured-output-html` contains all
test HTML fixtures. Original JSON files, from which these HTML fixtures
are generated, were taken from
`test_unstructured_ingest/expected-structured-output`
2025-03-04 13:57:35 +00:00
Yao You
43b682ad3f
feat: allow extraction of camel cased element type names (#3938)
This PR allows element types with CamelCase names to be extractable
using `extract_image_block_types` variable.

Before: specify `extract_image_block_types=["NarrativeText"]` (or any
casing for `NarrativeText`) would raise a warning that it doesn't match
any available types and not image would be extracted for this element
type

Now: specify `extract_image_block_types=["NarrativeText"]` would extract
images for this element type

## testing

```python
from unstructured.partition.auto import partition
f = "example-docs/pdf/embedded-images-tables.pdf"
elements = partition(f, strategy="hi_res", extract_image_block_types=["narrativetext"])
```

Without this PR no figures would be extracted. With this PR a local
folder would be created to contain images of the narrative text elements
in path like `./figures/figure-1-1.jpg`

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
2025-03-04 01:33:05 +00:00
cragwolfe
2addb19473
chore: remove sphinx docs (#3923)
Docs are now at https://docs.unstructured.io
2025-02-20 22:25:41 +00:00
Pluto
0df50fe6e8
Fix file detection when spooled file is pased (#3932)
This pull request fixes the scenario when SpooledTemporaryFile is passed
to detect_file type. In such cases some weird number was assigned as
'name' (and it couldn't be overwritten as SpooledTemporaryFile can't
have fields assigned 😩 ) so I added in our object factory just another
scenario where we parse this type of file.
For BytesIo `name` attr is None as it should be and some other metadata
fields are leveraged for file type recognition
0.16.23
2025-02-20 13:00:25 +00:00