unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-12-03 18:49:53 +00:00

Author	SHA1	Message	Date
qued	8fd07fd9f6	feat: Add simple script to sync fork with local branch (#4102 ) #### Testing: From the base folder of this repo, run: ```bash ./scripts/sync_fork.sh git@github.com:aseembits93/unstructured.git optimize-_assign_hash_ids-memtfran ``` Check to make sure the only remote is `origin` with: ```bash git remote ``` Check the diff from `main` with: ```bash git diff main ```	2025-09-26 16:36:56 +00:00
David Potter	0d20f6a9b1	email date format flexibility (#4072 ) we are seeing some .eml files come through the VLM partitioner. Which then downgrades to hi-res i believe. For some reason they have a date format that is not standard email format. But it is still legitimate. This uses a more robust date package to parse the date. This package is already installed. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: potter-potter <potter-potter@users.noreply.github.com>	2025-08-13 18:55:24 +00:00
qued	c7c3e3c082	feat: convert elements to markdown (#4055 ) Creates a staging function `elements_to_md` to convert lists of `Elements` to markdown strings (or a markdown file). Includes unit tests as well as ingest tests and expected output fixtures.	2025-07-16 14:34:29 +00:00
qued	7764fb6fd4	build: drop remaining Python 3.9 refs (#4049 ) Dropped variables that said we support Python 3.9 in `setup.py`, as well as any remaining references to Python 3.9. I also checked the pins and removed several that don't seem necessary any more.	2025-07-10 16:43:15 +00:00
Pluto	ec209c6b5f	Remove IDs from HTML code (#4012 ) In this pull request parent-child relationship for elements generated with v2 parser is based on actual element IDs instead of IDs baked somewhere in the HTML script. With some extra bug fixing it allowed for significantly simplifying json -> HTML script	2025-06-11 11:55:02 +00:00
Emily Voss	b6ab471f00	Drop Python 3.9 support due to dependency conflicts (#4017 )	2025-06-10 23:32:11 -07:00
cragwolfe	3a048a5a02	chore: script to verify unstructured image outbound connectivity (#4008 ) Sample output. The key thing here is the modes `offline` (meaning set HF_HUB_ONLINE=1 AND DO_NOT_TRACK=true) results in no outbound connections. This also is true if the locally cached models are removed, the last scenario of `offline-and-missing-models`) ``` $ ./test-all-outbound-connectivity-scenarios.sh >>> Removing leftover sut_* containers… Container: 543ac4b14370a18d790a2035e206e8c445754b825ec8b2887f4246f7404299c7 (scenario baseline) tcpdump running on interface eth0... >>> Running Python workload (capturing stdout/stderr)… [INFO] partitioning /app/example-docs/ideas-page.html <snip> Python finished. Log saved to /r/unstructured/scripts/image/python-output/offline-and-missing-models.log pcap saved to /r/unstructured/scripts/image/pcaps/offline-and-missing-models.pcap ================================================================== ======================================== Begin Scenario: baseline ------------------------------------------- tshark output for baseline ------------------------------------------- IPv4 Conversations Filter:<No Filter> \| <- \| \| -> \| \| Total \| Relative \| Duration \| \| Frames Bytes \| \| Frames Bytes \| \| Frames Bytes \| Start \| \| 172.18.0.2 <-> 108.138.246.79 20 12 kB 20 4,176 bytes 40 16 kB 2.531247000 69.0419 172.18.0.2 <-> 3.214.154.119 11 5,777 bytes 12 2,656 bytes 23 8,433 bytes 0.029451000 0.4118 172.18.0.2 <-> 192.168.65.5 2 656 bytes 2 158 bytes 4 814 bytes 0.000000000 2.5310 ------------------------------------------ python log output for baseline ------------------------------------------ [INFO] partitioning /app/example-docs/ideas-page.html [INFO] partitioning /app/example-docs/category-level.docx [INFO] partitioning /app/example-docs/fake_table.docx [INFO] partitioning /app/example-docs/img/english-and-korean.png 2025-06-02 22:05:02,265 - matplotlib.font_manager - INFO - generated new fontManager 2025-06-02 22:05:02,356 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): huggingface.co:443 2025-06-02 22:05:02,497 - urllib3.connectionpool - DEBUG - https://huggingface.co:443 "HEAD /unstructuredio/yolo_x_layout/resolve/main/yolox_l0.05.onnx HTTP/1.1" 302 0 2025-06-02 22:05:02,613 - unstructured_inference - INFO - Reading image file: /app/example-docs/img/english-and-korean.png ... 2025-06-02 22:05:04,792 - unstructured_inference - INFO - Loading the Table agent ... 2025-06-02 22:05:04,792 - unstructured_inference - INFO - Loading the table structure model ... 2025-06-02 22:05:04,877 - urllib3.connectionpool - DEBUG - https://huggingface.co:443 "HEAD /microsoft/table-transformer-structure-recognition/resolve/main/config.json HTTP/1.1" 200 0 2025-06-02 22:05:04,960 - urllib3.connectionpool - DEBUG - https://huggingface.co:443 "HEAD /microsoft/table-transformer-structure-recognition/resolve/main/config.json HTTP/1.1" 200 0 2025-06-02 22:05:04,970 - timm.models._builder - INFO - Loading pretrained weights from Hugging Face hub (timm/resnet18.a1_in1k) 2025-06-02 22:05:05,062 - urllib3.connectionpool - DEBUG - https://huggingface.co:443 "HEAD /timm/resnet18.a1_in1k/resolve/main/model.safetensors HTTP/1.1" 302 0 2025-06-02 22:05:05,065 - timm.models._hub - INFO - [timm/resnet18.a1_in1k] Safe alternative available for 'pytorch_model.bin' (as 'model.safetensors'). Loading weights using safetensors. 2025-06-02 22:05:05,071 - timm.models._builder - INFO - Missing keys (fc.weight, fc.bias) discovered while loading pretrained weights. This is expected if model is being adapted. [INFO] partitioning /app/example-docs/img/embedded-images-tables.jpg 2025-06-02 22:05:05,152 - unstructured_inference - INFO - Reading image file: /app/example-docs/img/embedded-images-tables.jpg ... [INFO] partitioning /app/example-docs/img/layout-parser-paper-with-table.jpg 2025-06-02 22:05:07,693 - unstructured_inference - INFO - Reading image file: /app/example-docs/img/layout-parser-paper-with-table.jpg ... [INFO] partitioning /app/example-docs/pdf/embedded-images-tables.pdf 2025-06-02 22:05:12,706 - pikepdf._core - INFO - pikepdf C++ to Python logger bridge initialized 2025-06-02 22:05:12,733 - unstructured_inference - INFO - Reading PDF for file: /app/example-docs/pdf/embedded-images-tables.pdf ... [INFO] partitioning /app/example-docs/pdf/all-number-table.pdf 2025-06-02 22:05:15,251 - unstructured_inference - INFO - Reading PDF for file: /app/example-docs/pdf/all-number-table.pdf ... [INFO] partitioning /app/example-docs/fake-power-point.pptx [INFO] partitioning /app/example-docs/stanley-cups.xlsx [INFO] partitioning /app/example-docs/fake-email-multiple-attachments.msg 2025-06-02 22:05:16,936 - unstructured_inference - INFO - Reading image file: /tmp/tmplkanlou1/unstructured_logo.png ... 2025-06-02 22:05:18,749 - unstructured_inference - INFO - Reading PDF for file: /tmp/tmpxdzdouhb/dense_doc.pdf ... ================================================================== ======================================== Begin Scenario: missing-models ------------------------------------------- tshark output for missing-models ------------------------------------------- IPv4 Conversations Filter:<No Filter> \| <- \| \| -> \| \| Total \| Relative \| Duration \| \| Frames Bytes \| \| Frames Bytes \| \| Frames Bytes \| Start \| \| 172.18.0.2 <-> 18.155.192.23 181834 273 MB 33502 1,813 kB 215336 275 MB 2.704106000 75.2880 172.18.0.2 <-> 3.168.86.41 79696 119 MB 15234 825 kB 94930 120 MB 9.066044000 68.9276 172.18.0.2 <-> 108.138.246.85 29 21 kB 25 5,760 bytes 54 27 kB 2.431857000 75.5633 172.18.0.2 <-> 3.214.154.119 12 5,831 bytes 12 2,656 bytes 24 8,487 bytes 0.016604000 0.3590 172.18.0.2 <-> 192.168.65.5 4 1,084 bytes 4 314 bytes 8 1,398 bytes 0.000000000 9.0651 ------------------------------------------ python log output for missing-models ------------------------------------------ [INFO] partitioning /app/example-docs/ideas-page.html [INFO] partitioning /app/example-docs/category-level.docx [INFO] partitioning /app/example-docs/fake_table.docx [INFO] partitioning /app/example-docs/img/english-and-korean.png 2025-06-02 22:06:30,961 - matplotlib.font_manager - INFO - generated new fontManager 2025-06-02 22:06:31,046 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): huggingface.co:443 2025-06-02 22:06:31,300 - urllib3.connectionpool - DEBUG - https://huggingface.co:443 "HEAD /unstructuredio/yolo_x_layout/resolve/main/yolox_l0.05.onnx HTTP/1.1" 302 0 2025-06-02 22:06:31,310 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): cdn-lfs.hf.co:443 2025-06-02 22:06:31,439 - urllib3.connectionpool - DEBUG - https://cdn-lfs.hf.co:443 "GET /repos/d9/51/d951593388d0af1cb4a029c311ba19f9b05090d9acc4606c2b82588297ea4397/134301ca94fb0df8027be9a6dad1908fe6218af8ffa4d34f0819c7c2226195f3?response-content-disposition=inline%3B+filename%3DUTF-8%27%27yolox_l0.05.onnx%3B+filename%3D%22yolox_l0.05.onnx%22%3B&Expires=1748904676&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTc0ODkwNDY3Nn19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5oZi5jby9yZXBvcy9kOS81MS9kOTUxNTkzMzg4ZDBhZjFjYjRhMDI5YzMxMWJhMTlmOWIwNTA5MGQ5YWNjNDYwNmMyYjgyNTg4Mjk3ZWE0Mzk3LzEzNDMwMWNhOTRmYjBkZjgwMjdiZTlhNmRhZDE5MDhmZTYyMThhZjhmZmE0ZDM0ZjA4MTljN2MyMjI2MTk1ZjM~cmVzcG9uc2UtY29udGVudC1kaXNwb3NpdGlvbj0qIn1dfQ__&Signature=hxvwTzJynEvyE~UuirlH~L4c5Gc6rGksDp~Uw94ooayDrzshE2sDdHmvqgoQyzqxHHhZLjfiJlAGUtVO7nVAHSoqt8mH7H9yN51Zj5UGqI-odXtW1dmWCD3i7nwwNlrEEjlXlERkIScpIjpkJDnjwhzeE94l1s7gysIm8c6J8JTcDlsdMver5wAVrBtLSVUrDN8PC84xgOGerHVhX7-eZcUVG2OAIJHoB3s2gLPkW9aVM5fvCmmoXMPI9oCvgLUp-zhXv3cWHh~yURuY1ufoI4CFG5ogW8nV~V45qLlbRw9PrvfFoLS-wxBGDOhT3SRWVOJzRRmACByABGWYMXRFuw__&Key-Pair-Id=K3RPWS32NSSJCE HTTP/1.1" 200 216625723 2025-06-02 22:06:35,019 - unstructured_inference - INFO - Reading image file: /app/example-docs/img/english-and-korean.png ... 2025-06-02 22:06:37,188 - unstructured_inference - INFO - Loading the Table agent ... 2025-06-02 22:06:37,188 - unstructured_inference - INFO - Loading the table structure model ... 2025-06-02 22:06:37,290 - urllib3.connectionpool - DEBUG - https://huggingface.co:443 "HEAD /microsoft/table-transformer-structure-recognition/resolve/main/config.json HTTP/1.1" 200 0 2025-06-02 22:06:37,375 - urllib3.connectionpool - DEBUG - https://huggingface.co:443 "GET /microsoft/table-transformer-structure-recognition/resolve/main/config.json HTTP/1.1" 200 1469 2025-06-02 22:06:37,484 - urllib3.connectionpool - DEBUG - https://huggingface.co:443 "HEAD /microsoft/table-transformer-structure-recognition/resolve/main/config.json HTTP/1.1" 200 0 2025-06-02 22:06:37,581 - urllib3.connectionpool - DEBUG - https://huggingface.co:443 "HEAD /microsoft/table-transformer-structure-recognition/resolve/main/model.safetensors HTTP/1.1" 302 0 Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet` 2025-06-02 22:06:37,586 - huggingface_hub.file_download - WARNING - Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet` 2025-06-02 22:06:37,681 - urllib3.connectionpool - DEBUG - https://huggingface.co:443 "GET /microsoft/table-transformer-structure-recognition/resolve/main/model.safetensors HTTP/1.1" 302 1319 2025-06-02 22:06:37,685 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): cas-bridge.xethub.hf.co:443 2025-06-02 22:06:37,778 - urllib3.connectionpool - DEBUG - https://cas-bridge.xethub.hf.co:443 "GET /xet-bridge-us/634929bd8146350b3a4cadaf/e78778928a1863786d5bb22a109a7ff1dbac47a29eae6f223a1fc2689172c347?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=cas%2F20250602%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20250602T220637Z&X-Amz-Expires=3600&X-Amz-Signature=c0a361e8982b1b05ee443054646b438e5a68d6767ef6df03dad6c5db20d0bdc5&X-Amz-SignedHeaders=host&X-Xet-Cas-Uid=public&response-content-disposition=inline%3B+filename%3DUTF-8%27%27model.safetensors%3B+filename%3D%22model.safetensors%22%3B&x-id=GetObject&Expires=1748905597&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTc0ODkwNTU5N319LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2FzLWJyaWRnZS54ZXRodWIuaGYuY28veGV0LWJyaWRnZS11cy82MzQ5MjliZDgxNDYzNTBiM2E0Y2FkYWYvZTc4Nzc4OTI4YTE4NjM3ODZkNWJiMjJhMTA5YTdmZjFkYmFjNDdhMjllYWU2ZjIyM2ExZmMyNjg5MTcyYzM0NyoifV19&Signature=cRjZe56uJ8vxmmgRhPmp7XZX69PHKoXO9XN1bfq5n~84Vxz~HvCmg6MqtuUAFIiOWAHFhOuVzJpoiWTYT1JdZrtMeQTdywnZM-lIIn5Q45kzr8q8C58yvLz7vmKKrD9pOnGjJPaVavYYxEDdlAXbWf6xo433kKF4TfmQ9z7UIKt~M-XV9EdPUUBNhByucLVcTZ3sec5DqI4FmzK28fdJ1BMD4NyDjWW6hi~Lp2V3bW0FLCpI6qKGuikJ3E-OVcJDdDvZAqSN0-GoQyHIP9kp4RTqPBb7jekpZ3Uj91UWEmGx6YNuNlorAMGi61hrL6mAUUmW13OGua2vcJyk9LxZQg__&Key-Pair-Id=K2L8F4GPSG1IFC HTTP/1.1" 200 115434268 2025-06-02 22:06:39,612 - timm.models._builder - INFO - Loading pretrained weights from Hugging Face hub (timm/resnet18.a1_in1k) 2025-06-02 22:06:39,696 - urllib3.connectionpool - DEBUG - https://huggingface.co:443 "HEAD /timm/resnet18.a1_in1k/resolve/main/model.safetensors HTTP/1.1" 302 0 2025-06-02 22:06:39,714 - urllib3.connectionpool - DEBUG - https://cdn-lfs.hf.co:443 "GET /repos/42/d5/42d585781e0b74854ae52a1bc2a63d09896f1d70f86bff969f4c053508d6c2d6/80c49dee3da4822c009c5a7fe591e9223c5a2cfcf95a4067ca4dfb5a7b89c612?response-content-disposition=inline%3B+filename%3DUTF-8%27%27model.safetensors%3B+filename%3D%22model.safetensors%22%3B&Expires=1748904665&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTc0ODkwNDY2NX19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5oZi5jby9yZXBvcy80Mi9kNS80MmQ1ODU3ODFlMGI3NDg1NGFlNTJhMWJjMmE2M2QwOTg5NmYxZDcwZjg2YmZmOTY5ZjRjMDUzNTA4ZDZjMmQ2LzgwYzQ5ZGVlM2RhNDgyMmMwMDljNWE3ZmU1OTFlOTIyM2M1YTJjZmNmOTVhNDA2N2NhNGRmYjVhN2I4OWM2MTI~cmVzcG9uc2UtY29udGVudC1kaXNwb3NpdGlvbj0qIn1dfQ__&Signature=GL15CLiGsmHno-DP25kfcuObjbrjd~ir5C5xapGqb9lda~5Wjy-3axBPftr1xWUnKh24Ay0mS49U8ZOcEdQxmzxQ97HiSX0-8s0-H187hV6mId6uxsULOGkNtjpkMKhfxe0qIfAmfi9gxl9JdiVfG5367HfPDVST8NvGPqMuKYoywSNWA-Uby-L9qb~EjtxbH9v1H2g6C0i9t2mn8ghD8BtTWEn4LY9c4O5bI~EQatNToNjsQTKa18LzXEowZnODLSLkyE7beLzfEpuTX9vlDzcAwKCPp-1M3xMZI4tzR-yfzyGhW19wqc6BVncUw53WSK7oOCv56HmFTYHhzOE-eQ__&Key-Pair-Id=K3RPWS32NSSJCE HTTP/1.1" 200 46807446 2025-06-02 22:06:40,394 - timm.models._hub - INFO - [timm/resnet18.a1_in1k] Safe alternative available for 'pytorch_model.bin' (as 'model.safetensors'). Loading weights using safetensors. 2025-06-02 22:06:40,396 - timm.models._builder - INFO - Missing keys (fc.weight, fc.bias) discovered while loading pretrained weights. This is expected if model is being adapted. [INFO] partitioning /app/example-docs/img/embedded-images-tables.jpg 2025-06-02 22:06:40,460 - unstructured_inference - INFO - Reading image file: /app/example-docs/img/embedded-images-tables.jpg ... [INFO] partitioning /app/example-docs/img/layout-parser-paper-with-table.jpg 2025-06-02 22:06:42,985 - unstructured_inference - INFO - Reading image file: /app/example-docs/img/layout-parser-paper-with-table.jpg ... [INFO] partitioning /app/example-docs/pdf/embedded-images-tables.pdf 2025-06-02 22:06:48,019 - pikepdf._core - INFO - pikepdf C++ to Python logger bridge initialized 2025-06-02 22:06:48,045 - unstructured_inference - INFO - Reading PDF for file: /app/example-docs/pdf/embedded-images-tables.pdf ... [INFO] partitioning /app/example-docs/pdf/all-number-table.pdf 2025-06-02 22:06:50,557 - unstructured_inference - INFO - Reading PDF for file: /app/example-docs/pdf/all-number-table.pdf ... [INFO] partitioning /app/example-docs/fake-power-point.pptx [INFO] partitioning /app/example-docs/stanley-cups.xlsx [INFO] partitioning /app/example-docs/fake-email-multiple-attachments.msg 2025-06-02 22:06:52,358 - unstructured_inference - INFO - Reading image file: /tmp/tmpsha4r586/unstructured_logo.png ... 2025-06-02 22:06:54,199 - unstructured_inference - INFO - Reading PDF for file: /tmp/tmpg_5lk06v/dense_doc.pdf ... ================================================================== ======================================== Begin Scenario: analytics-online-only ------------------------------------------- tshark output for analytics-online-only ------------------------------------------- IPv4 Conversations Filter:<No Filter> \| <- \| \| -> \| \| Total \| Relative \| Duration \| \| Frames Bytes \| \| Frames Bytes \| \| Frames Bytes \| Start \| \| 172.18.0.2 <-> 54.236.224.89 12 5,831 bytes 12 2,656 bytes 24 8,487 bytes 0.032536000 0.3535 172.18.0.2 <-> 192.168.65.5 1 462 bytes 1 84 bytes 2 546 bytes 0.000000000 0.0322 ------------------------------------------ python log output for analytics-online-only ------------------------------------------ [INFO] partitioning /app/example-docs/ideas-page.html [INFO] partitioning /app/example-docs/category-level.docx [INFO] partitioning /app/example-docs/fake_table.docx [INFO] partitioning /app/example-docs/img/english-and-korean.png 2025-06-02 22:08:10,114 - matplotlib.font_manager - INFO - generated new fontManager 2025-06-02 22:08:10,320 - unstructured_inference - INFO - Reading image file: /app/example-docs/img/english-and-korean.png ... 2025-06-02 22:08:12,470 - unstructured_inference - INFO - Loading the Table agent ... 2025-06-02 22:08:12,470 - unstructured_inference - INFO - Loading the table structure model ... 2025-06-02 22:08:12,475 - timm.models._builder - INFO - Loading pretrained weights from Hugging Face hub (timm/resnet18.a1_in1k) 2025-06-02 22:08:12,476 - timm.models._hub - INFO - [timm/resnet18.a1_in1k] Safe alternative available for 'pytorch_model.bin' (as 'model.safetensors'). Loading weights using safetensors. 2025-06-02 22:08:12,478 - timm.models._builder - INFO - Missing keys (fc.weight, fc.bias) discovered while loading pretrained weights. This is expected if model is being adapted. [INFO] partitioning /app/example-docs/img/embedded-images-tables.jpg 2025-06-02 22:08:12,548 - unstructured_inference - INFO - Reading image file: /app/example-docs/img/embedded-images-tables.jpg ... [INFO] partitioning /app/example-docs/img/layout-parser-paper-with-table.jpg 2025-06-02 22:08:15,102 - unstructured_inference - INFO - Reading image file: /app/example-docs/img/layout-parser-paper-with-table.jpg ... [INFO] partitioning /app/example-docs/pdf/embedded-images-tables.pdf 2025-06-02 22:08:20,163 - pikepdf._core - INFO - pikepdf C++ to Python logger bridge initialized 2025-06-02 22:08:20,189 - unstructured_inference - INFO - Reading PDF for file: /app/example-docs/pdf/embedded-images-tables.pdf ... [INFO] partitioning /app/example-docs/pdf/all-number-table.pdf 2025-06-02 22:08:22,732 - unstructured_inference - INFO - Reading PDF for file: /app/example-docs/pdf/all-number-table.pdf ... [INFO] partitioning /app/example-docs/fake-power-point.pptx [INFO] partitioning /app/example-docs/stanley-cups.xlsx [INFO] partitioning /app/example-docs/fake-email-multiple-attachments.msg 2025-06-02 22:08:24,468 - unstructured_inference - INFO - Reading image file: /tmp/tmp4oud0ctq/unstructured_logo.png ... 2025-06-02 22:08:26,297 - unstructured_inference - INFO - Reading PDF for file: /tmp/tmpv24idrvu/dense_doc.pdf ... ================================================================== ======================================== Begin Scenario: offline ------------------------------------------- tshark output for offline ------------------------------------------- IPv4 Conversations Filter:<No Filter> \| <- \| \| -> \| \| Total \| Relative \| Duration \| \| Frames Bytes \| \| Frames Bytes \| \| Frames Bytes \| Start \| \| ------------------------------------------ python log output for offline ------------------------------------------ [INFO] partitioning /app/example-docs/ideas-page.html [INFO] partitioning /app/example-docs/category-level.docx [INFO] partitioning /app/example-docs/fake_table.docx [INFO] partitioning /app/example-docs/img/english-and-korean.png 2025-06-02 22:09:37,826 - matplotlib.font_manager - INFO - generated new fontManager 2025-06-02 22:09:38,028 - unstructured_inference - INFO - Reading image file: /app/example-docs/img/english-and-korean.png ... 2025-06-02 22:09:40,188 - unstructured_inference - INFO - Loading the Table agent ... 2025-06-02 22:09:40,188 - unstructured_inference - INFO - Loading the table structure model ... 2025-06-02 22:09:40,193 - timm.models._builder - INFO - Loading pretrained weights from Hugging Face hub (timm/resnet18.a1_in1k) 2025-06-02 22:09:40,193 - timm.models._hub - INFO - [timm/resnet18.a1_in1k] Safe alternative available for 'pytorch_model.bin' (as 'model.safetensors'). Loading weights using safetensors. 2025-06-02 22:09:40,195 - timm.models._builder - INFO - Missing keys (fc.weight, fc.bias) discovered while loading pretrained weights. This is expected if model is being adapted. [INFO] partitioning /app/example-docs/img/embedded-images-tables.jpg 2025-06-02 22:09:40,260 - unstructured_inference - INFO - Reading image file: /app/example-docs/img/embedded-images-tables.jpg ... [INFO] partitioning /app/example-docs/img/layout-parser-paper-with-table.jpg 2025-06-02 22:09:42,810 - unstructured_inference - INFO - Reading image file: /app/example-docs/img/layout-parser-paper-with-table.jpg ... [INFO] partitioning /app/example-docs/pdf/embedded-images-tables.pdf 2025-06-02 22:09:47,851 - pikepdf._core - INFO - pikepdf C++ to Python logger bridge initialized 2025-06-02 22:09:47,877 - unstructured_inference - INFO - Reading PDF for file: /app/example-docs/pdf/embedded-images-tables.pdf ... [INFO] partitioning /app/example-docs/pdf/all-number-table.pdf 2025-06-02 22:09:50,475 - unstructured_inference - INFO - Reading PDF for file: /app/example-docs/pdf/all-number-table.pdf ... [INFO] partitioning /app/example-docs/fake-power-point.pptx [INFO] partitioning /app/example-docs/stanley-cups.xlsx [INFO] partitioning /app/example-docs/fake-email-multiple-attachments.msg 2025-06-02 22:09:52,181 - unstructured_inference - INFO - Reading image file: /tmp/tmpn3rraz6o/unstructured_logo.png ... 2025-06-02 22:09:54,032 - unstructured_inference - INFO - Reading PDF for file: /tmp/tmpvbqk645u/dense_doc.pdf ... ================================================================== ======================================== Begin Scenario: offline-and-missing-models ------------------------------------------- tshark output for offline-and-missing-models ------------------------------------------- IPv4 Conversations Filter:<No Filter> \| <- \| \| -> \| \| Total \| Relative \| Duration \| \| Frames Bytes \| \| Frames Bytes \| \| Frames Bytes \| Start \| \| ------------------------------------------ python log output for offline-and-missing-models ------------------------------------------ [INFO] partitioning /app/example-docs/ideas-page.html [INFO] partitioning /app/example-docs/category-level.docx [INFO] partitioning /app/example-docs/fake_table.docx [INFO] partitioning /app/example-docs/img/english-and-korean.png 2025-06-02 22:11:05,743 - matplotlib.font_manager - INFO - generated new fontManager Traceback (most recent call last): File "/home/notebook-user/.local/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 1484, in _get_metadata_or_catch_error metadata = get_hf_file_metadata( ^^^^^^^^^^^^^^^^^^^^^ File "/home/notebook-user/.local/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn return fn(args, kwargs) ^^^^^^^^^^^^^^^^^^^ File "/home/notebook-user/.local/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 1401, in get_hf_file_metadata r = _request_wrapper( ^^^^^^^^^^^^^^^^^ File "/home/notebook-user/.local/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 285, in _request_wrapper response = _request_wrapper( ^^^^^^^^^^^^^^^^^ File "/home/notebook-user/.local/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 308, in _request_wrapper response = get_session().request(method=method, url=url, params) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/notebook-user/.local/lib/python3.11/site-packages/requests/sessions.py", line 589, in request resp = self.send(prep, send_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/notebook-user/.local/lib/python3.11/site-packages/requests/sessions.py", line 703, in send r = adapter.send(request, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/notebook-user/.local/lib/python3.11/site-packages/huggingface_hub/utils/_http.py", line 107, in send raise OfflineModeIsEnabled( huggingface_hub.errors.OfflineModeIsEnabled: Cannot reach https://huggingface.co/unstructuredio/yolo_x_layout/resolve/main/yolox_l0.05.onnx: offline mode is enabled. To disable it, please unset the `HF_HUB_OFFLINE` environment variable. The above exception was the direct cause of the following exception: Traceback (most recent call last): File "<stdin>", line 35, in <module> File "/app/unstructured/partition/auto.py", line 231, in partition elements = partition_image( ^^^^^^^^^^^^^^^^ File "/app/unstructured/documents/elements.py", line 585, in wrapper elements = func(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/app/unstructured/file_utils/filetype.py", line 774, in wrapper elements = func(args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/app/unstructured/chunking/dispatch.py", line 74, in wrapper elements = func(args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/app/unstructured/partition/image.py", line 102, in partition_image return partition_pdf_or_image( ^^^^^^^^^^^^^^^^^^^^^^^ File "/app/unstructured/partition/pdf.py", line 341, in partition_pdf_or_image elements = _partition_pdf_or_image_local( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/unstructured/utils.py", line 216, in wrapper return func(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/app/unstructured/partition/pdf.py", line 649, in _partition_pdf_or_image_local inferred_document_layout = process_file_with_model( ^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured_inference/inference/layout.py", line 371, in process_file_with_model model = get_model(model_name, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured_inference/models/base.py", line 74, in get_model model.initialize(*initialize_params) File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured_inference/utils.py", line 40, in __getitem__ value = evaluate(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured_inference/utils.py", line 115, in download_if_needed_and_get_local_path return hf_hub_download(path_or_repo, filename, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/notebook-user/.local/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn return fn(args, *kwargs) ^^^^^^^^^^^^^^^^^^^ File "/home/notebook-user/.local/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 961, in hf_hub_download return _hf_hub_download_to_cache_dir( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/notebook-user/.local/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 1068, in _hf_hub_download_to_cache_dir _raise_on_head_call_error(head_call_error, force_download, local_files_only) File "/home/notebook-user/.local/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 1599, in _raise_on_head_call_error raise LocalEntryNotFoundError( huggingface_hub.errors.LocalEntryNotFoundError: An error happened while trying to locate the file on the Hub and we cannot find the requested files in the local cache. Please check your connection and try again or make sure your Internet connection is on. ```	2025-06-02 15:21:17 -07:00
Marek Połom	604c4a7c5e	fix: failing build (#3993 ) Successful build and test: https://github.com/Unstructured-IO/unstructured/actions/runs/14730300234/job/41342657532 Failing test_json_to_html CI job fix here: https://github.com/Unstructured-IO/unstructured/pull/3992	2025-04-29 13:29:58 +00:00
cragwolfe	c6b8ed4290	chore: allow changing default output dir for unstructured-get-json.sh (#3973 )	2025-03-31 22:18:57 -07:00
cragwolfe	19fc1fcc72	feat: convenience unstructured-get-json.sh update (#3971 ) * script now supports: * the --vlm flag, to process the document with the VLM strategy * optionally takes --vlm-model, --vlm-provider args * optionally also writes .html outputs by converting unstructured .json output * optionally opens those .html outputs in a browser Tested with: ``` unstructured-get-json.sh --write-html --open-html --fast layout-parser-paper-p2.pdf unstructured-get-json.sh --write-html --open-html --hi-res layout-parser-paper-p2.pdf unstructured-get-json.sh --write-html --open-html --ocr-only layout-parser-paper-p2.pdf unstructured-get-json.sh --write-html --open-html --vlm layout-parser-paper-p2.pdf unstructured-get-json.sh --write-html --open-html --vlm --vlm-provider openai --vlm-model gpt-4o layout-parser-paper-p2.pdf unstructured-get-json.sh --write-html --open-html --vlm --vlm-provider vertexai --vlm-model gemini-2.0-flash-001 layout-parser-paper-p2.pdf unstructured-get-json.sh --write-html --open-html --vlm --vlm-provider anthropic --vlm-model claude-3-5-sonnet-20241022 layout-parser-paper-p2.pdf ``` [layout-parser-paper-p2.pdf](https://github.com/user-attachments/files/19514007/layout-parser-paper-p2.pdf)	2025-03-31 09:45:01 -07:00
Marek Połom	f333d7fe7f	feat: Json elements to HTML converter (#3936 ) ## NOTE `test_unstructured_ingest/expected-structured-output-html` contains all test HTML fixtures. Original JSON files, from which these HTML fixtures are generated, were taken from `test_unstructured_ingest/expected-structured-output`	2025-03-04 13:57:35 +00:00
cragwolfe	238f985dda	feat: add --images support to unstructured-get-json.sh (#3888 ) E.g., now can run: ```bash # extracts base64 encoded image data for `Table` and `Image` elements $ unstructured-get-json.sh --trace --verbose --images /t/docs/Captur-1317-5_ENG-p5.pdf # also extracts `Title` elements (see screenshot) $ IMAGE_BLOCK_TYPES='"title","table","image"' unstructured-get-json.sh --trace --verbose --images /t/docs/Captur-1317-5_ENG-p5.pdf ``` It was discovered during testing that "narrativetext" does not work, probably due to camel casing of NarrativeText 😬 ![image](https://github.com/user-attachments/assets/e6414a57-81e1-4560-b1b2-dce3b1c2c804)	2025-01-27 16:09:13 -08:00
Marianna	4140f625d0	add script to render html from unstructured elements (#3799 ) Script to render HTML from unstructured elements. NOTE: This script is not intended to be used as a module. NOTE: This script is only intended to be used with outputs with non-empty `metadata.text_as_html`. TODO: It was noted that unstructured_elements_to_ontology func always returns a single page This script is using helper functions to handle multiple pages. I am not sure if this was intended, or it is a bug - if it is a bug it would require bit longer debugging - to make it usable fast I used workarounds. Usage: test with any outputs with non-empty `metadata.text_as_html`. Example files attached. `[Example-Bill-of-Lading-Waste.docx.pdf.json](https://github.com/user-attachments/files/17922898/Example-Bill-of-Lading-Waste.docx.pdf.json)` [Breast_Cancer1-5.pdf.json](https://github.com/user-attachments/files/17922899/Breast_Cancer1-5.pdf.json)	2024-12-04 19:46:51 -08:00
cragwolfe	9445a2dd01	chore: fix CHANGELOG formatting (#3800 ) Fixes formatting in CHANGELOG.md where most of the page was bold and indented. (verify the branch version here: https://github.com/Unstructured-IO/unstructured/blob/crag/tables-tweak/CHANGELOG.md) Bonus tweak: u-table-inspect.sh is more robust to adding borders for visualizations	2024-11-26 15:38:42 -08:00
Roman Isecke	9049e4e2be	feat/remove ingest code, use new dep for tests (#3595 ) ### Description Alternative to https://github.com/Unstructured-IO/unstructured/pull/3572 but maintaining all ingest tests, running them by pulling in the latest version of unstructured-ingest. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com> Co-authored-by: Christine Straub <christinemstraub@gmail.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2024-10-15 10:01:34 -05:00
Steve Canny	086b8d6f8a	rfctr(part): prepare for pluggable auto-partitioners 2 (#3657 ) Summary Step 2 in prep for pluggable auto-partitioners, remove `regex_metadata` field from `ElementMetadata`. Additional Context - "regex-metadata" was an experimental feature that didn't pan out. - It's implemented by one of the post-partitioning metadata decorators, so get rid of it as part of the cleanup before consolidating those decorators.	2024-09-24 17:33:25 +00:00
Steve Canny	03c2bf8f1f	rfctr(part): extract partition.common submodules (#3649 ) Summary In preparation for consolidating post-partitioning metadata decorators, extract `partition.common` module into a sub-package (directory) and extract `partition.common.metadata` module to house metadata-specific object shared by partitioners. Additional Context - This new module will be the home of the new consolidated metadata decorator. - The consolidated decorator is a step toward removing post-processing decorators from _delegating_ partitioners. A delegating partitioner is one that convert its file to a different format and "delegates" actual partitioning to the partitioner for that target format. 10 of the 20 partitioners are delegating partitioners. - Removing decorators from delegating partitioners will allow us to avoid "double-decorating", i.e. running those decorators twice, once on the principal partitioner and again on the proxy partitioner. - This will allow us to send `*kwargs` to either partitioner, removing the knowledge of which arguments to send for each file-type from auto-partition. - And this will allow pluggable auto-partitioners which all have a `partition_x(filename, , file, **kwargs) -> list[Element]` interface.	2024-09-20 20:35:28 +00:00
Steve Canny	cd074bb32b	chore(file): remove dead code (#3645 ) Summary Remove dead code in `unstructured.file_utils`. Additional Context These modules were added in 12/2022 and 1/2023 and are not referenced by any code. Removing to reduce unnecessary complexity. These can of course be recovered from Git history if we decide we want them again in future.	2024-09-19 06:45:33 +00:00
cragwolfe	e9690b2738	feat: utility script to process large PDFs through the API by script (#3591 ) Adds the bash script `process-pdf-parallel-through-api.sh` that allows splitting up a PDF into smaller parts (splits) to be processed through the API concurrently, and is re-entrant. If any of the parts splits fail to process, one can attempt reprocessing those split(s) by rerunning the script. Note: requires the `qpdf` command line utility. The below command line output shows the scenario where just one split had to be reprocessed through the API to create the final `layout-parser-paper_combined.json` output. ``` $ BATCH_SIZE=20 PDF_SPLIT_PAGE_SIZE=6 STRATEGY=hi_res \ ./scripts/user/process-pdf-parallel-through-api.sh example-docs/pdf/layout-parser-paper.pdf > % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 Skipping processing for /Users/cragwolfe/tmp/pdf-splits/layout-parser-paper-output-8a76cb6228e109450992bc097dbd1a51_split-6_strat-hi_res/layout-pars\ er-paper_pages_1_to_6.json as it already exists. Skipping processing for /Users/cragwolfe/tmp/pdf-splits/layout-parser-paper-output-8a76cb6228e109450992bc097dbd1a51_split-6_strat-hi_res/layout-parser-paper_pages_7_to_12.json as it already exists. Valid JSON output created: /Users/cragwolfe/tmp/pdf-splits/layout-parser-paper-output-8a76cb6228e109450992bc097dbd1a51_split-6_strat-hi_res/layout-parser-paper_pages_13_to_16.json Processing complete. Combined JSON saved to /Users/cragwolfe/tmp/pdf-splits/layout-parser-paper-output-8a76cb6228e109450992bc097dbd1a51_split-6_strat-hi_res/layout-parser-paper_combined.json ``` Bonus change to `unstructured-get-json.sh` to point to the standard hosted Serverless API, but allow using the Free API with --freemium.	2024-09-10 11:40:35 -07:00
Christine Straub	d99b39923d	build(deps): Remove unstructured.paddlepaddle fork (#3506 ) This PR aims to remove "unstructured.paddlepaddle" fork. Previously, we used `unstructured.paddlepaddle` fork to support `unstructured.paddleocr` on arm64 architecture. But currently, `unstructured.paddleocr` with `unstructured.paddlepaddle` fails to work on `arm64` architecture. Also, `unstructured.paddleocr` with the latest version of the original `paddlepaddle` works on both `amd64` and `arm64` architectures. ### Testing ``` os.environ["OCR_AGENT"] = "unstructured.partition.utils.ocr_models.paddle_ocr.OCRAgentPaddle" elements = partition_pdf( filename=<file_path>, strategy="hi_res", infer_table_structure=True, ) ```	2024-08-09 22:04:22 +00:00
Matt Robinson	ee2b247297	build: check dependency licenses in CI (#3349 ) ### Summary Adds a CI check to ensure that packages added as dependencies are appropriately licensed. All of the `.txt` files in the `requirements` directory are checked with the exception of: - `constraints.txt`, since those are not installed and are instead conditions on the other dependency files - `dev.txt`, since those are for local development and not shipped as part of the `unstructured` package - `extra-pdf-image.txt` - the `extra-pdf-image.in` since checking `extra-pdf-image.txt` pulls in NVIDIA GPU related packages with an `Other/Proprietary` license type, and there's not a good way to exclude those without adding `Other/Proprietary` to the allowed licenses list. ### Testing The new `check-licenses` job should pass in CI.	2024-07-11 22:36:01 +00:00
Roman Isecke	f1a28600d9	feat/singlestore dest connector (#3320 ) ### Description Adds [SingleStore](https://www.singlestore.com/) database destination connector with associated ingest test.	2024-07-03 15:15:39 +00:00
Matt Robinson	db8617872b	build: image and dependency updates; fix tesseract files locations (#3310 ) ### Summary Updates to the latest version of the `wolfi-base` image. Changes include: - Version bumps to address CVEs - `libreoffice` is now included in the `arm64`. `.doc` files are now supported for `arm64`. `.ppt` do not work with the `libreoffice` package currently available on `wolfi-os`. We have follow on work to look into that. - Updates the location of the `tesseract` `tessdata` files on the `arm64` build. Closes #3290. - Closes #3319 and addes `psutil` to the base dependencies. ### Testing - `test_dockerfile` should continue to pass with the updates.	2024-07-01 19:39:32 +00:00
David Potter	15f80c4ad6	rfct [P6M]-392: OpenSearch V2 Destination Connector (#3293 ) Migrates OpenSearch destination connector to V2. Relies a lot on the Elasticsearch connector where possible. (this is expected)	2024-06-28 20:51:23 +00:00
Roman Isecke	e0f4374386	Roman/bugfix conflicting event loop ingest (#3264 ) ### Description In use cases where an external system (such as code being run in a jupyter notebook) already has a running event loop, run the async code in a dedicated thread pool to not conflict with the existing event loop. This also has a variety of fixes that were found when putting together a demo leveraging the elasticsearch destination connector	2024-06-24 18:47:37 +00:00
David Potter	8610bd3ab9	feat: Kafka source and destination connector (#3176 ) Thanks to @tullytim we have a new Kafka source and destination connector. It also works with hosted Kafka via Confluent. Documentation will be added to the Docs repo.	2024-06-22 23:26:23 +00:00
Matt Robinson	2d965fd65e	build: switch arm64 image to wolfi-base (#3268 ) ### Summary Updates the `arm64` build to use the same `Dockerfile` as `amd64`, since there are now upstream base images for `wolfi-base` for both architectures. The legacy `rockylinux-9.4` is now stashed in a subdirectory the `docker` subdirectory and is no longer built in CI, but is available is users would like to build it themselves. Additionally, this PR includes a fix to symlink `python3` to `python3.11`, which had caused a CI failure [here](https://github.com/Unstructured-IO/unstructured/actions/runs/9619486931/job/26535697755). BREAKING CHANGE: the `arm64` image no longer supports `.doc`, `.pptx`, or `.xls` because we do not yet have a `libreoffice` `apk` built for `wolfi-base`. We intend to address that as a follow on. All other filetypes work. ### Testing Successfully docker builds, tests, and smoke tests for [amd64](https://github.com/Unstructured-IO/unstructured/actions/runs/9619458140/job/26535610735?pr=3268) and [arm64](https://github.com/Unstructured-IO/unstructured/actions/runs/9619458140/job/26535610341?pr=3268) on the feature branch (with publish disabled).	2024-06-22 05:10:29 +00:00
Christine Straub	f23d180d34	fix: docker image publishing error (#3238 ) This PR aims to fix a docker image publishing error caused by user changes when pulling the `amd64` image from the `unstructured` `wolfi-base` image. (https://github.com/Unstructured-IO/unstructured/pull/3213). --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2024-06-18 21:01:42 +00:00
Christine Straub	b47e6e9fdc	refactor: remove download packages step (#3225 ) This PR aims to remove the download packages step since all of that gets installed in the base images. This PR also updates the base `wolfi` image because the original base image can not be found anymore: https://github.com/Unstructured-IO/unstructured/actions/runs/9555654898/job/26339587945	2024-06-18 12:15:44 +00:00
Matt Robinson	ad69bdcd4e	build(deps): deltalake bump to `0.18.x` (#3197 ) ### Summary Closes #3173. Removes the `overwrite_schema` kwarg from the Delta Table connector and bumps the `deltalake` version. Per [this PR](https://github.com/delta-io/delta-rs/pull/2554) in the `deltalake` repo, the `overwrite_schema` kwarg is deprecated as of version `0.18.0`. Users can specify `schema_mode="merge"` to obtain the same behavior. - `schema_mode="merge"` is equivalent to `overwrite_schema=False` - `schema_mode="overwrite"` is equivalent to `overwrite_schema=True` Also adds an `engine` parameter that you can use to set `"rust"` or `"pyarrow"` as the engine. `engine` defaults to `"pyarrow"` and `schema_mode` defaults to `None`, which is consistent with the behavior in `deltalake` documented [here](https://delta-io.github.io/delta-rs/api/delta_writer/). ### Testing The Delta Table ingest tests should pass on this PR. --------- Co-authored-by: Ahmet Melek <39141206+ahmetmeleq@users.noreply.github.com>	2024-06-13 15:59:34 +00:00
Roman Isecke	b777864296	feat: Migrate over fsspec connectors (#3066 ) ### Description Move over all fsspec connectors to the new framework Variety of bug fixes found and fixed in this PR as well: * custom json mixin being used for the enhanced dataclass would break if typing was quoted. That was fixed. A check was also added to the enhanced dataclass to prevent `InitVar` from being used in the root dataclass since this breaks serialization. * hashing for partitioner was using the filename of the raw file being partitioned rather than the file name of the file data generated from indexing. This means that mutliple files could result in the same partition hash when recursive flag is passed in. This was updated to use the file data file name instead. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2024-06-05 19:12:06 +00:00
Matt Robinson	6b400b46fe	feat: add VoyageAI embeddings (#3069 ) (#3099 ) Original PR was #3069. Merged in to a feature branch to fix dependency and linting issues. Application code changes from the original PR were already reviewed and approved. ------------ Original PR description: Adding VoyageAI embeddings Voyage AI’s embedding models and rerankers are state-of-the-art in retrieval accuracy. --------- Co-authored-by: fzowl <160063452+fzowl@users.noreply.github.com> Co-authored-by: Liuhong99 <39693953+Liuhong99@users.noreply.github.com>	2024-05-24 21:48:35 +00:00
Roman Isecke	3eaf65a8c1	feat: refactor ingest (#3009 ) ### Description This refactors the current ingest CLI process to support better granularity in how the steps are ran * Both multiprocessing and async now supported. Given that a lot of the steps are IO-bound, such as downloading and uploading content, we can achieve better parallelization by using async here * Destination step broken up into a stager step and an upload step. This will allow for steps that require manipulation of the data between formats, such as converting the elements json into a csv format to upload for tabular destinations, to be pulled out of the step that does the actual upload. * The process of writing the content to a local destination was now pulled out as it's own dedicated destination connector, meaning you no longer need to persist the content locally once the process is done if the content was uploaded elsewhere. * Quick update to the chunker/partition step to use the python client. * Move the uncompress suppport as a pipeline step since this can arbitrarily apply to any concrete files that have been downloaded, regardless of where they came from. * Leverage last modified date to mark files to be reprocessed, even if the file already exists locally. ### Callouts Retry configs haven't been moved over yet. This is an open question because the intent was for it to wrap potential connection errors but now any of the other steps that leverage an API might run into network connection issues. Should those be isolated in each of the steps and wrapped with the same retry configs? Or do we need to expose a unique retry config for each step? This would bloat the input params even more. ### Testing * If you want to run the new code as an SDK, there's an example file that was added to highlight how to do that: [example.py](https://github.com/Unstructured-IO/unstructured/blob/roman/refactor-ingest/unstructured/ingest/v2/example.py) * If you want to run the new code as an isolated CLI: ```shell PYTHONPATH=. python unstructured/ingest/v2/main.py --help ``` * If you want to see which commands have been migrated to the new version, there's now a `v2` short help text next to those commands when running the current cli: ```shell PYTHONPATH=. python unstructured/ingest/main.py --help Usage: main.py [OPTIONS] COMMAND [ARGS]...main.py --help Options: --help Show this message and exit. Commands: airtable azure biomed box confluence delta-table discord dropbox elasticsearch fsspec gcs github gitlab google-drive hubspot jira local v2 mongodb notion onedrive opensearch outlook reddit s3 v2 salesforce sftp sharepoint slack wikipedia ``` You can run any of the local or s3 specific ingest tests and these should now work. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2024-05-21 17:01:49 +00:00
Matt Robinson	9cd0e706ab	fix: reenable arm64 builds for docker (#3045 ) ### Summary Closes #3034 and reenables ARM64 in the docker build and publish job. This was taken out in #3039 because we've only build `libreoffice` for AMD64 and not ARM64. If Chainguard publishes an `apk` for `libreoffice`, we can support a Chainguard image for both architectures. The smoke test now differs for both architectures, to reflect differences in the directory structure. ### Testing Build and publish ran successfully for ARM64 (job [here](https://github.com/Unstructured-IO/unstructured/actions/runs/9129712470/job/25104907497)) and AMD64 (job [here](https://github.com/Unstructured-IO/unstructured/actions/runs/9129712470/job/25104907826)).	2024-05-17 19:27:20 +00:00
Matt Robinson	934f1a464a	fix: disable arm build for chainguard (#3039 ) ### Summary Temporarily disables the ARM build due to the error in [this CI job](https://github.com/Unstructured-IO/unstructured/actions/runs/9114507405/job/25058629166). Will add back support for ARM using the rockylinux container once we show this works.	2024-05-17 00:22:10 +00:00
Matt Robinson	612905e311	build: wolfi base image for Dockerfile (#3016 ) ### Summary Updates the `Dockerfile` to use the Chainguard `wolfi-base` image to reduce CVEs. Also adds a step in the docker publish job that scans the images and checks for CVEs before publishing. The job will fail if there are high or critical vulnerabilities. ### Testing Run `make docker-run-dev` and then `python3.11` once you're in. And that point, you can try: ```python from unstructured.partition.auto import partition elements = partition(filename="example-docs/DA-1p.pdf", skip_infer_table_types=["pdf"]) elements ``` Stop the container once you're done.	2024-05-15 22:53:15 +00:00
Roman Isecke	d6f2841ff4	feat: update dependencies and remove constraint on pydantic (#2841 ) ### Description * The `consistent-deps.sh` was fixed to take into account the ingest dependencies, causing some errors to show up. New constriants were added to make that script pass. * Update all requirements without constraint on pydantic, allowing the latest version to be pulled in. * `pikepdf` is causing a conflict but there's a fix on their `main` branch, just need for the next release to be published. Opened up a question here to see if we can get that out any sooner: [Do releases happen on a schedule?](https://github.com/pikepdf/pikepdf/discussions/574). For now added `lxml<5` to the constraints. A couple optimizations: * `constraints.in` renamed to `constraints.txt` since the whole point is all dependencies are already pinned and the file never gets compiled * `constraints.txt` moved to a `requirements/deps` directory as this never gets compiled by `pip-compile` * Other dependency files updated to reference the new location of `base.in` and `constraints.txt` * make file updated since it was originally written to avoid the `base.in` and `constraints.in` file	2024-04-04 19:58:23 +00:00
Klaijan	30b6a09bc3	fix: declare -i [SC2324 shellcheck] (#2624 ) Fix SC2324 shellcheck warning by adding -i to indicate var type of integer and tidy up the formatting.	2024-03-08 10:09:55 +00:00
Roman Isecke	9c1c41f493	BUGFIX: fix dependencies in setup.py (#2605 ) ### Description Currently the requirements associated with an extra in the `setup.py` is being dynamically generated using the `load_requirements()` method in the same file. This is being passed in all the `.in` files which then get read line by line to generate the requirements associated with an extra. Unless the `.in` file itself has a version pin, this will never respect the `.txt` files being generated by `pip-compile`. This fix updates all the inputs to `load_requirements()` to use the `.txt` files themselves.	2024-03-06 18:59:08 +00:00
David Potter	0c834517d8	fix: change opensearch port (#2517 ) change opensearch port to see if fixes CI. We think there may be a conflict with the elasticsearch docker port. Also adding simple retry to vector query. --------- Co-authored-by: potter-potter <david.potter@gmail.com>	2024-02-07 21:25:04 +00:00
David Potter	bc791d53f4	feat: add opensearch source and destination connector (#2349 ) Adds OpenSearch as a source and destination. Since OpenSearch is a fork of Elasticsearch, these connectors rely heavily on inheriting the Elasticsearch connectors whenever possible. - Adds OpenSearch source connector to be able to ingest documents from OpenSearch. - Adds OpenSearch destination connector to be able to ingest documents from any supported source, embed them and write the embeddings / documents into OpenSearch. - Defines an example unstructured elements schema for users to be able to setup their unstructured OpenSearch indexes easily. --------- Co-authored-by: potter-potter <david.potter@gmail.com>	2024-01-17 04:31:49 +00:00
ryannikolaidis	2ce829ddd0	test: update test Elasticsearch mappings to validate embedding search (#2397 ) Currently in the Elasticsearch Destination ingest test we are writing the embeddings to a "float" type field. In order to leverage this field for similarity search it should be mapped as "dense_vector" with the respective dimensions assigned. This PR updates that mapping and adds a test query to validate that this works as expected.	2024-01-14 19:27:56 +00:00
Roman Isecke	b37b4689bc	drop python3.8 (#2372 ) ### Description Remove all uses of python3.8 --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2024-01-09 23:37:30 +00:00
rvztz	950e5d68f9	feat: adds postgresql/sqlite destination connector (#2005 ) - Adds a destination connector to upload processed output into a PostgreSQL/Sqlite database instance. - Users are responsible to provide their instances. This PR includes a couple of configuration examples. - Defines the scripts required to setup a PostgreSQL instance with the unstructured elements schema. - Validates postgres/pgvector embedding storage and retrieval --------- Co-authored-by: potter-potter <david.potter@gmail.com>	2024-01-04 19:33:16 +00:00
Ahmet Melek	fd293b3e78	feat: add elasticsearch destination connector (#2152 ) Closes https://github.com/Unstructured-IO/unstructured/issues/1842 Closes https://github.com/Unstructured-IO/unstructured/issues/2202 Closes https://github.com/Unstructured-IO/unstructured/issues/2203 This PR: - Adds Elasticsearch destination connector to be able to ingest documents from any supported source, embed them and write the embeddings / documents into Elasticsearch. - Defines an example unstructured elements schema for users to be able to setup their unstructured elasticsearch indexes easily. - Includes parallelized upload and lazy processing for elasticsearch destination connector. - Rearranges elasticsearch test helpers to source, destination, and common folders. - Adds util functions to be able to batch iterables in a lazy way for uploads - Fixes a bug where removing the optional parameter `--fields` broke the connector due to an integer processing error. - Fixes a bug where using an [elasticsearch config](`8fa5cbf036/unstructured/ingest/connector/elasticsearch.py (L26-L35)`) for a destination connector resulted in a serialization issue when optional parameter `--fields` was not provided.	2023-12-20 01:26:58 +00:00
David Potter	4b8352e0f5	feat: add chroma destination connector (#2240 ) Adds Chroma (also known as ChromaDB) as a vector destination. Currently Chroma is an in-memory single-process oriented library with plans of a hosted and/or more production ready solution -https://docs.trychroma.com/deployment Though they now claim to support multiple Clients hitting the database at once, I found that it was inconsistent. Sometimes multiprocessing worked (maybe 1 out of 3 times) But the other times I would get different errors. So I kept it single process. --------- Co-authored-by: potter-potter <david.potter@gmail.com>	2023-12-19 16:58:23 +00:00
cragwolfe	bd8a74d686	chore: shell scripts default indent of 2 instead of 4 (#2287 ) Given the tendency for shell scripts to easily enter into a few levels of indentation and long line lengths, update the default to 2 spaces.	2023-12-19 07:48:21 +00:00
Roman Isecke	76efcf4dd7	chore: add shfmt (#2246 ) ### Description Given all the shell files that now exist in the repo, would be nice to have linting/formatting around them (in addition to the existing shellcheck which doesn't do anything to format the shell code). This PR introduces `shfmt` to both check for changes and apply formatting when the associated make targets are called.	2023-12-12 01:04:15 +00:00
Roman Isecke	ac302689a0	chore: update sphinx ingest docs with new connectors (#2245 ) Replacing https://github.com/Unstructured-IO/unstructured/pull/2243	2023-12-11 21:29:41 +00:00
David Potter	cde11d1eb0	feat: Add sftp source connector (#2163 ) Adds source connector for SFTP which uses fsspec and paramiko via fsspec. Paramiko is the standard sftp package for python used in pysftp etc... ``` --username foo \ --password bar \ --remote-url sftp://localhost:47474/upload/ ``` Will only download a specifically requested file if it has an extension. (i.e. `--remote-url sftp://localhost:47474/upload/bob.zip`) It will treat any other remote_url as a folder path. This is intentional. --------- Co-authored-by: potter-potter <david.potter@gmail.com>	2023-12-07 19:33:19 +00:00

1 2 3

112 Commits