mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-06-27 02:30:08 +00:00

Sample output. The key thing here is the modes `offline` (meaning set HF_HUB_ONLINE=1 AND DO_NOT_TRACK=true) results in no outbound connections. This also is true if the locally cached models are removed, the last scenario of `offline-and-missing-models`) ``` $ ./test-all-outbound-connectivity-scenarios.sh >>> Removing leftover sut_* containers… Container: 543ac4b14370a18d790a2035e206e8c445754b825ec8b2887f4246f7404299c7 (scenario baseline) tcpdump running on interface eth0... >>> Running Python workload (capturing stdout/stderr)… [INFO] partitioning /app/example-docs/ideas-page.html <snip> Python finished. Log saved to /r/unstructured/scripts/image/python-output/offline-and-missing-models.log pcap saved to /r/unstructured/scripts/image/pcaps/offline-and-missing-models.pcap ================================================================== ======================================== Begin Scenario: baseline ------------------------------------------- tshark output for baseline ------------------------------------------- IPv4 Conversations Filter:<No Filter> | <- | | -> | | Total | Relative | Duration | | Frames Bytes | | Frames Bytes | | Frames Bytes | Start | | 172.18.0.2 <-> 108.138.246.79 20 12 kB 20 4,176 bytes 40 16 kB 2.531247000 69.0419 172.18.0.2 <-> 3.214.154.119 11 5,777 bytes 12 2,656 bytes 23 8,433 bytes 0.029451000 0.4118 172.18.0.2 <-> 192.168.65.5 2 656 bytes 2 158 bytes 4 814 bytes 0.000000000 2.5310 ------------------------------------------ python log output for baseline ------------------------------------------ [INFO] partitioning /app/example-docs/ideas-page.html [INFO] partitioning /app/example-docs/category-level.docx [INFO] partitioning /app/example-docs/fake_table.docx [INFO] partitioning /app/example-docs/img/english-and-korean.png 2025-06-02 22:05:02,265 - matplotlib.font_manager - INFO - generated new fontManager 2025-06-02 22:05:02,356 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): huggingface.co:443 2025-06-02 22:05:02,497 - urllib3.connectionpool - DEBUG - https://huggingface.co:443 "HEAD /unstructuredio/yolo_x_layout/resolve/main/yolox_l0.05.onnx HTTP/1.1" 302 0 2025-06-02 22:05:02,613 - unstructured_inference - INFO - Reading image file: /app/example-docs/img/english-and-korean.png ... 2025-06-02 22:05:04,792 - unstructured_inference - INFO - Loading the Table agent ... 2025-06-02 22:05:04,792 - unstructured_inference - INFO - Loading the table structure model ... 2025-06-02 22:05:04,877 - urllib3.connectionpool - DEBUG - https://huggingface.co:443 "HEAD /microsoft/table-transformer-structure-recognition/resolve/main/config.json HTTP/1.1" 200 0 2025-06-02 22:05:04,960 - urllib3.connectionpool - DEBUG - https://huggingface.co:443 "HEAD /microsoft/table-transformer-structure-recognition/resolve/main/config.json HTTP/1.1" 200 0 2025-06-02 22:05:04,970 - timm.models._builder - INFO - Loading pretrained weights from Hugging Face hub (timm/resnet18.a1_in1k) 2025-06-02 22:05:05,062 - urllib3.connectionpool - DEBUG - https://huggingface.co:443 "HEAD /timm/resnet18.a1_in1k/resolve/main/model.safetensors HTTP/1.1" 302 0 2025-06-02 22:05:05,065 - timm.models._hub - INFO - [timm/resnet18.a1_in1k] Safe alternative available for 'pytorch_model.bin' (as 'model.safetensors'). Loading weights using safetensors. 2025-06-02 22:05:05,071 - timm.models._builder - INFO - Missing keys (fc.weight, fc.bias) discovered while loading pretrained weights. This is expected if model is being adapted. [INFO] partitioning /app/example-docs/img/embedded-images-tables.jpg 2025-06-02 22:05:05,152 - unstructured_inference - INFO - Reading image file: /app/example-docs/img/embedded-images-tables.jpg ... [INFO] partitioning /app/example-docs/img/layout-parser-paper-with-table.jpg 2025-06-02 22:05:07,693 - unstructured_inference - INFO - Reading image file: /app/example-docs/img/layout-parser-paper-with-table.jpg ... [INFO] partitioning /app/example-docs/pdf/embedded-images-tables.pdf 2025-06-02 22:05:12,706 - pikepdf._core - INFO - pikepdf C++ to Python logger bridge initialized 2025-06-02 22:05:12,733 - unstructured_inference - INFO - Reading PDF for file: /app/example-docs/pdf/embedded-images-tables.pdf ... [INFO] partitioning /app/example-docs/pdf/all-number-table.pdf 2025-06-02 22:05:15,251 - unstructured_inference - INFO - Reading PDF for file: /app/example-docs/pdf/all-number-table.pdf ... [INFO] partitioning /app/example-docs/fake-power-point.pptx [INFO] partitioning /app/example-docs/stanley-cups.xlsx [INFO] partitioning /app/example-docs/fake-email-multiple-attachments.msg 2025-06-02 22:05:16,936 - unstructured_inference - INFO - Reading image file: /tmp/tmplkanlou1/unstructured_logo.png ... 2025-06-02 22:05:18,749 - unstructured_inference - INFO - Reading PDF for file: /tmp/tmpxdzdouhb/dense_doc.pdf ... ================================================================== ======================================== Begin Scenario: missing-models ------------------------------------------- tshark output for missing-models ------------------------------------------- IPv4 Conversations Filter:<No Filter> | <- | | -> | | Total | Relative | Duration | | Frames Bytes | | Frames Bytes | | Frames Bytes | Start | | 172.18.0.2 <-> 18.155.192.23 181834 273 MB 33502 1,813 kB 215336 275 MB 2.704106000 75.2880 172.18.0.2 <-> 3.168.86.41 79696 119 MB 15234 825 kB 94930 120 MB 9.066044000 68.9276 172.18.0.2 <-> 108.138.246.85 29 21 kB 25 5,760 bytes 54 27 kB 2.431857000 75.5633 172.18.0.2 <-> 3.214.154.119 12 5,831 bytes 12 2,656 bytes 24 8,487 bytes 0.016604000 0.3590 172.18.0.2 <-> 192.168.65.5 4 1,084 bytes 4 314 bytes 8 1,398 bytes 0.000000000 9.0651 ------------------------------------------ python log output for missing-models ------------------------------------------ [INFO] partitioning /app/example-docs/ideas-page.html [INFO] partitioning /app/example-docs/category-level.docx [INFO] partitioning /app/example-docs/fake_table.docx [INFO] partitioning /app/example-docs/img/english-and-korean.png 2025-06-02 22:06:30,961 - matplotlib.font_manager - INFO - generated new fontManager 2025-06-02 22:06:31,046 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): huggingface.co:443 2025-06-02 22:06:31,300 - urllib3.connectionpool - DEBUG - https://huggingface.co:443 "HEAD /unstructuredio/yolo_x_layout/resolve/main/yolox_l0.05.onnx HTTP/1.1" 302 0 2025-06-02 22:06:31,310 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): cdn-lfs.hf.co:443 2025-06-02 22:06:31,439 - urllib3.connectionpool - DEBUG - https://cdn-lfs.hf.co:443 "GET /repos/d9/51/d951593388d0af1cb4a029c311ba19f9b05090d9acc4606c2b82588297ea4397/134301ca94fb0df8027be9a6dad1908fe6218af8ffa4d34f0819c7c2226195f3?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27yolox_l0.05.onnx%3B+filename%3D%22yolox_l0.05.onnx%22%3B&Expires=1748904676&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTc0ODkwNDY3Nn19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5oZi5jby9yZXBvcy9kOS81MS9kOTUxNTkzMzg4ZDBhZjFjYjRhMDI5YzMxMWJhMTlmOWIwNTA5MGQ5YWNjNDYwNmMyYjgyNTg4Mjk3ZWE0Mzk3LzEzNDMwMWNhOTRmYjBkZjgwMjdiZTlhNmRhZDE5MDhmZTYyMThhZjhmZmE0ZDM0ZjA4MTljN2MyMjI2MTk1ZjM~cmVzcG9uc2UtY29udGVudC1kaXNwb3NpdGlvbj0qIn1dfQ__&Signature=hxvwTzJynEvyE~UuirlH~L4c5Gc6rGksDp~Uw94ooayDrzshE2sDdHmvqgoQyzqxHHhZLjfiJlAGUtVO7nVAHSoqt8mH7H9yN51Zj5UGqI-odXtW1dmWCD3i7nwwNlrEEjlXlERkIScpIjpkJDnjwhzeE94l1s7gysIm8c6J8JTcDlsdMver5wAVrBtLSVUrDN8PC84xgOGerHVhX7-eZcUVG2OAIJHoB3s2gLPkW9aVM5fvCmmoXMPI9oCvgLUp-zhXv3cWHh~yURuY1ufoI4CFG5ogW8nV~V45qLlbRw9PrvfFoLS-wxBGDOhT3SRWVOJzRRmACByABGWYMXRFuw__&Key-Pair-Id=K3RPWS32NSSJCE HTTP/1.1" 200 216625723 2025-06-02 22:06:35,019 - unstructured_inference - INFO - Reading image file: /app/example-docs/img/english-and-korean.png ... 2025-06-02 22:06:37,188 - unstructured_inference - INFO - Loading the Table agent ... 2025-06-02 22:06:37,188 - unstructured_inference - INFO - Loading the table structure model ... 2025-06-02 22:06:37,290 - urllib3.connectionpool - DEBUG - https://huggingface.co:443 "HEAD /microsoft/table-transformer-structure-recognition/resolve/main/config.json HTTP/1.1" 200 0 2025-06-02 22:06:37,375 - urllib3.connectionpool - DEBUG - https://huggingface.co:443 "GET /microsoft/table-transformer-structure-recognition/resolve/main/config.json HTTP/1.1" 200 1469 2025-06-02 22:06:37,484 - urllib3.connectionpool - DEBUG - https://huggingface.co:443 "HEAD /microsoft/table-transformer-structure-recognition/resolve/main/config.json HTTP/1.1" 200 0 2025-06-02 22:06:37,581 - urllib3.connectionpool - DEBUG - https://huggingface.co:443 "HEAD /microsoft/table-transformer-structure-recognition/resolve/main/model.safetensors HTTP/1.1" 302 0 Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet` 2025-06-02 22:06:37,586 - huggingface_hub.file_download - WARNING - Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet` 2025-06-02 22:06:37,681 - urllib3.connectionpool - DEBUG - https://huggingface.co:443 "GET /microsoft/table-transformer-structure-recognition/resolve/main/model.safetensors HTTP/1.1" 302 1319 2025-06-02 22:06:37,685 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): cas-bridge.xethub.hf.co:443 2025-06-02 22:06:37,778 - urllib3.connectionpool - DEBUG - https://cas-bridge.xethub.hf.co:443 "GET /xet-bridge-us/634929bd8146350b3a4cadaf/e78778928a1863786d5bb22a109a7ff1dbac47a29eae6f223a1fc2689172c347?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=cas%2F20250602%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20250602T220637Z&X-Amz-Expires=3600&X-Amz-Signature=c0a361e8982b1b05ee443054646b438e5a68d6767ef6df03dad6c5db20d0bdc5&X-Amz-SignedHeaders=host&X-Xet-Cas-Uid=public&response-content-disposition=inline%3B+filename*%3DUTF-8%27%27model.safetensors%3B+filename%3D%22model.safetensors%22%3B&x-id=GetObject&Expires=1748905597&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTc0ODkwNTU5N319LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2FzLWJyaWRnZS54ZXRodWIuaGYuY28veGV0LWJyaWRnZS11cy82MzQ5MjliZDgxNDYzNTBiM2E0Y2FkYWYvZTc4Nzc4OTI4YTE4NjM3ODZkNWJiMjJhMTA5YTdmZjFkYmFjNDdhMjllYWU2ZjIyM2ExZmMyNjg5MTcyYzM0NyoifV19&Signature=cRjZe56uJ8vxmmgRhPmp7XZX69PHKoXO9XN1bfq5n~84Vxz~HvCmg6MqtuUAFIiOWAHFhOuVzJpoiWTYT1JdZrtMeQTdywnZM-lIIn5Q45kzr8q8C58yvLz7vmKKrD9pOnGjJPaVavYYxEDdlAXbWf6xo433kKF4TfmQ9z7UIKt~M-XV9EdPUUBNhByucLVcTZ3sec5DqI4FmzK28fdJ1BMD4NyDjWW6hi~Lp2V3bW0FLCpI6qKGuikJ3E-OVcJDdDvZAqSN0-GoQyHIP9kp4RTqPBb7jekpZ3Uj91UWEmGx6YNuNlorAMGi61hrL6mAUUmW13OGua2vcJyk9LxZQg__&Key-Pair-Id=K2L8F4GPSG1IFC HTTP/1.1" 200 115434268 2025-06-02 22:06:39,612 - timm.models._builder - INFO - Loading pretrained weights from Hugging Face hub (timm/resnet18.a1_in1k) 2025-06-02 22:06:39,696 - urllib3.connectionpool - DEBUG - https://huggingface.co:443 "HEAD /timm/resnet18.a1_in1k/resolve/main/model.safetensors HTTP/1.1" 302 0 2025-06-02 22:06:39,714 - urllib3.connectionpool - DEBUG - https://cdn-lfs.hf.co:443 "GET /repos/42/d5/42d585781e0b74854ae52a1bc2a63d09896f1d70f86bff969f4c053508d6c2d6/80c49dee3da4822c009c5a7fe591e9223c5a2cfcf95a4067ca4dfb5a7b89c612?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27model.safetensors%3B+filename%3D%22model.safetensors%22%3B&Expires=1748904665&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTc0ODkwNDY2NX19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5oZi5jby9yZXBvcy80Mi9kNS80MmQ1ODU3ODFlMGI3NDg1NGFlNTJhMWJjMmE2M2QwOTg5NmYxZDcwZjg2YmZmOTY5ZjRjMDUzNTA4ZDZjMmQ2LzgwYzQ5ZGVlM2RhNDgyMmMwMDljNWE3ZmU1OTFlOTIyM2M1YTJjZmNmOTVhNDA2N2NhNGRmYjVhN2I4OWM2MTI~cmVzcG9uc2UtY29udGVudC1kaXNwb3NpdGlvbj0qIn1dfQ__&Signature=GL15CLiGsmHno-DP25kfcuObjbrjd~ir5C5xapGqb9lda~5Wjy-3axBPftr1xWUnKh24Ay0mS49U8ZOcEdQxmzxQ97HiSX0-8s0-H187hV6mId6uxsULOGkNtjpkMKhfxe0qIfAmfi9gxl9JdiVfG5367HfPDVST8NvGPqMuKYoywSNWA-Uby-L9qb~EjtxbH9v1H2g6C0i9t2mn8ghD8BtTWEn4LY9c4O5bI~EQatNToNjsQTKa18LzXEowZnODLSLkyE7beLzfEpuTX9vlDzcAwKCPp-1M3xMZI4tzR-yfzyGhW19wqc6BVncUw53WSK7oOCv56HmFTYHhzOE-eQ__&Key-Pair-Id=K3RPWS32NSSJCE HTTP/1.1" 200 46807446 2025-06-02 22:06:40,394 - timm.models._hub - INFO - [timm/resnet18.a1_in1k] Safe alternative available for 'pytorch_model.bin' (as 'model.safetensors'). Loading weights using safetensors. 2025-06-02 22:06:40,396 - timm.models._builder - INFO - Missing keys (fc.weight, fc.bias) discovered while loading pretrained weights. This is expected if model is being adapted. [INFO] partitioning /app/example-docs/img/embedded-images-tables.jpg 2025-06-02 22:06:40,460 - unstructured_inference - INFO - Reading image file: /app/example-docs/img/embedded-images-tables.jpg ... [INFO] partitioning /app/example-docs/img/layout-parser-paper-with-table.jpg 2025-06-02 22:06:42,985 - unstructured_inference - INFO - Reading image file: /app/example-docs/img/layout-parser-paper-with-table.jpg ... [INFO] partitioning /app/example-docs/pdf/embedded-images-tables.pdf 2025-06-02 22:06:48,019 - pikepdf._core - INFO - pikepdf C++ to Python logger bridge initialized 2025-06-02 22:06:48,045 - unstructured_inference - INFO - Reading PDF for file: /app/example-docs/pdf/embedded-images-tables.pdf ... [INFO] partitioning /app/example-docs/pdf/all-number-table.pdf 2025-06-02 22:06:50,557 - unstructured_inference - INFO - Reading PDF for file: /app/example-docs/pdf/all-number-table.pdf ... [INFO] partitioning /app/example-docs/fake-power-point.pptx [INFO] partitioning /app/example-docs/stanley-cups.xlsx [INFO] partitioning /app/example-docs/fake-email-multiple-attachments.msg 2025-06-02 22:06:52,358 - unstructured_inference - INFO - Reading image file: /tmp/tmpsha4r586/unstructured_logo.png ... 2025-06-02 22:06:54,199 - unstructured_inference - INFO - Reading PDF for file: /tmp/tmpg_5lk06v/dense_doc.pdf ... ================================================================== ======================================== Begin Scenario: analytics-online-only ------------------------------------------- tshark output for analytics-online-only ------------------------------------------- IPv4 Conversations Filter:<No Filter> | <- | | -> | | Total | Relative | Duration | | Frames Bytes | | Frames Bytes | | Frames Bytes | Start | | 172.18.0.2 <-> 54.236.224.89 12 5,831 bytes 12 2,656 bytes 24 8,487 bytes 0.032536000 0.3535 172.18.0.2 <-> 192.168.65.5 1 462 bytes 1 84 bytes 2 546 bytes 0.000000000 0.0322 ------------------------------------------ python log output for analytics-online-only ------------------------------------------ [INFO] partitioning /app/example-docs/ideas-page.html [INFO] partitioning /app/example-docs/category-level.docx [INFO] partitioning /app/example-docs/fake_table.docx [INFO] partitioning /app/example-docs/img/english-and-korean.png 2025-06-02 22:08:10,114 - matplotlib.font_manager - INFO - generated new fontManager 2025-06-02 22:08:10,320 - unstructured_inference - INFO - Reading image file: /app/example-docs/img/english-and-korean.png ... 2025-06-02 22:08:12,470 - unstructured_inference - INFO - Loading the Table agent ... 2025-06-02 22:08:12,470 - unstructured_inference - INFO - Loading the table structure model ... 2025-06-02 22:08:12,475 - timm.models._builder - INFO - Loading pretrained weights from Hugging Face hub (timm/resnet18.a1_in1k) 2025-06-02 22:08:12,476 - timm.models._hub - INFO - [timm/resnet18.a1_in1k] Safe alternative available for 'pytorch_model.bin' (as 'model.safetensors'). Loading weights using safetensors. 2025-06-02 22:08:12,478 - timm.models._builder - INFO - Missing keys (fc.weight, fc.bias) discovered while loading pretrained weights. This is expected if model is being adapted. [INFO] partitioning /app/example-docs/img/embedded-images-tables.jpg 2025-06-02 22:08:12,548 - unstructured_inference - INFO - Reading image file: /app/example-docs/img/embedded-images-tables.jpg ... [INFO] partitioning /app/example-docs/img/layout-parser-paper-with-table.jpg 2025-06-02 22:08:15,102 - unstructured_inference - INFO - Reading image file: /app/example-docs/img/layout-parser-paper-with-table.jpg ... [INFO] partitioning /app/example-docs/pdf/embedded-images-tables.pdf 2025-06-02 22:08:20,163 - pikepdf._core - INFO - pikepdf C++ to Python logger bridge initialized 2025-06-02 22:08:20,189 - unstructured_inference - INFO - Reading PDF for file: /app/example-docs/pdf/embedded-images-tables.pdf ... [INFO] partitioning /app/example-docs/pdf/all-number-table.pdf 2025-06-02 22:08:22,732 - unstructured_inference - INFO - Reading PDF for file: /app/example-docs/pdf/all-number-table.pdf ... [INFO] partitioning /app/example-docs/fake-power-point.pptx [INFO] partitioning /app/example-docs/stanley-cups.xlsx [INFO] partitioning /app/example-docs/fake-email-multiple-attachments.msg 2025-06-02 22:08:24,468 - unstructured_inference - INFO - Reading image file: /tmp/tmp4oud0ctq/unstructured_logo.png ... 2025-06-02 22:08:26,297 - unstructured_inference - INFO - Reading PDF for file: /tmp/tmpv24idrvu/dense_doc.pdf ... ================================================================== ======================================== Begin Scenario: offline ------------------------------------------- tshark output for offline ------------------------------------------- IPv4 Conversations Filter:<No Filter> | <- | | -> | | Total | Relative | Duration | | Frames Bytes | | Frames Bytes | | Frames Bytes | Start | | ------------------------------------------ python log output for offline ------------------------------------------ [INFO] partitioning /app/example-docs/ideas-page.html [INFO] partitioning /app/example-docs/category-level.docx [INFO] partitioning /app/example-docs/fake_table.docx [INFO] partitioning /app/example-docs/img/english-and-korean.png 2025-06-02 22:09:37,826 - matplotlib.font_manager - INFO - generated new fontManager 2025-06-02 22:09:38,028 - unstructured_inference - INFO - Reading image file: /app/example-docs/img/english-and-korean.png ... 2025-06-02 22:09:40,188 - unstructured_inference - INFO - Loading the Table agent ... 2025-06-02 22:09:40,188 - unstructured_inference - INFO - Loading the table structure model ... 2025-06-02 22:09:40,193 - timm.models._builder - INFO - Loading pretrained weights from Hugging Face hub (timm/resnet18.a1_in1k) 2025-06-02 22:09:40,193 - timm.models._hub - INFO - [timm/resnet18.a1_in1k] Safe alternative available for 'pytorch_model.bin' (as 'model.safetensors'). Loading weights using safetensors. 2025-06-02 22:09:40,195 - timm.models._builder - INFO - Missing keys (fc.weight, fc.bias) discovered while loading pretrained weights. This is expected if model is being adapted. [INFO] partitioning /app/example-docs/img/embedded-images-tables.jpg 2025-06-02 22:09:40,260 - unstructured_inference - INFO - Reading image file: /app/example-docs/img/embedded-images-tables.jpg ... [INFO] partitioning /app/example-docs/img/layout-parser-paper-with-table.jpg 2025-06-02 22:09:42,810 - unstructured_inference - INFO - Reading image file: /app/example-docs/img/layout-parser-paper-with-table.jpg ... [INFO] partitioning /app/example-docs/pdf/embedded-images-tables.pdf 2025-06-02 22:09:47,851 - pikepdf._core - INFO - pikepdf C++ to Python logger bridge initialized 2025-06-02 22:09:47,877 - unstructured_inference - INFO - Reading PDF for file: /app/example-docs/pdf/embedded-images-tables.pdf ... [INFO] partitioning /app/example-docs/pdf/all-number-table.pdf 2025-06-02 22:09:50,475 - unstructured_inference - INFO - Reading PDF for file: /app/example-docs/pdf/all-number-table.pdf ... [INFO] partitioning /app/example-docs/fake-power-point.pptx [INFO] partitioning /app/example-docs/stanley-cups.xlsx [INFO] partitioning /app/example-docs/fake-email-multiple-attachments.msg 2025-06-02 22:09:52,181 - unstructured_inference - INFO - Reading image file: /tmp/tmpn3rraz6o/unstructured_logo.png ... 2025-06-02 22:09:54,032 - unstructured_inference - INFO - Reading PDF for file: /tmp/tmpvbqk645u/dense_doc.pdf ... ================================================================== ======================================== Begin Scenario: offline-and-missing-models ------------------------------------------- tshark output for offline-and-missing-models ------------------------------------------- IPv4 Conversations Filter:<No Filter> | <- | | -> | | Total | Relative | Duration | | Frames Bytes | | Frames Bytes | | Frames Bytes | Start | | ------------------------------------------ python log output for offline-and-missing-models ------------------------------------------ [INFO] partitioning /app/example-docs/ideas-page.html [INFO] partitioning /app/example-docs/category-level.docx [INFO] partitioning /app/example-docs/fake_table.docx [INFO] partitioning /app/example-docs/img/english-and-korean.png 2025-06-02 22:11:05,743 - matplotlib.font_manager - INFO - generated new fontManager Traceback (most recent call last): File "/home/notebook-user/.local/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 1484, in _get_metadata_or_catch_error metadata = get_hf_file_metadata( ^^^^^^^^^^^^^^^^^^^^^ File "/home/notebook-user/.local/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn return fn(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^ File "/home/notebook-user/.local/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 1401, in get_hf_file_metadata r = _request_wrapper( ^^^^^^^^^^^^^^^^^ File "/home/notebook-user/.local/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 285, in _request_wrapper response = _request_wrapper( ^^^^^^^^^^^^^^^^^ File "/home/notebook-user/.local/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 308, in _request_wrapper response = get_session().request(method=method, url=url, **params) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/notebook-user/.local/lib/python3.11/site-packages/requests/sessions.py", line 589, in request resp = self.send(prep, **send_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/notebook-user/.local/lib/python3.11/site-packages/requests/sessions.py", line 703, in send r = adapter.send(request, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/notebook-user/.local/lib/python3.11/site-packages/huggingface_hub/utils/_http.py", line 107, in send raise OfflineModeIsEnabled( huggingface_hub.errors.OfflineModeIsEnabled: Cannot reach https://huggingface.co/unstructuredio/yolo_x_layout/resolve/main/yolox_l0.05.onnx: offline mode is enabled. To disable it, please unset the `HF_HUB_OFFLINE` environment variable. The above exception was the direct cause of the following exception: Traceback (most recent call last): File "<stdin>", line 35, in <module> File "/app/unstructured/partition/auto.py", line 231, in partition elements = partition_image( ^^^^^^^^^^^^^^^^ File "/app/unstructured/documents/elements.py", line 585, in wrapper elements = func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/app/unstructured/file_utils/filetype.py", line 774, in wrapper elements = func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/app/unstructured/chunking/dispatch.py", line 74, in wrapper elements = func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/app/unstructured/partition/image.py", line 102, in partition_image return partition_pdf_or_image( ^^^^^^^^^^^^^^^^^^^^^^^ File "/app/unstructured/partition/pdf.py", line 341, in partition_pdf_or_image elements = _partition_pdf_or_image_local( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/unstructured/utils.py", line 216, in wrapper return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/app/unstructured/partition/pdf.py", line 649, in _partition_pdf_or_image_local inferred_document_layout = process_file_with_model( ^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured_inference/inference/layout.py", line 371, in process_file_with_model model = get_model(model_name, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured_inference/models/base.py", line 74, in get_model model.initialize(**initialize_params) File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured_inference/utils.py", line 40, in __getitem__ value = evaluate(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured_inference/utils.py", line 115, in download_if_needed_and_get_local_path return hf_hub_download(path_or_repo, filename, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/notebook-user/.local/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn return fn(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^ File "/home/notebook-user/.local/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 961, in hf_hub_download return _hf_hub_download_to_cache_dir( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/notebook-user/.local/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 1068, in _hf_hub_download_to_cache_dir _raise_on_head_call_error(head_call_error, force_download, local_files_only) File "/home/notebook-user/.local/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 1599, in _raise_on_head_call_error raise LocalEntryNotFoundError( huggingface_hub.errors.LocalEntryNotFoundError: An error happened while trying to locate the file on the Hub and we cannot find the requested files in the local cache. Please check your connection and try again or make sure your Internet connection is on. ```
215 lines
3.1 KiB
Plaintext
215 lines
3.1 KiB
Plaintext
# Byte-compiled / optimized / DLL files
|
|
__pycache__/
|
|
*.py[cod]
|
|
*$py.class
|
|
|
|
# C extensions
|
|
*.so
|
|
|
|
# Distribution / packaging
|
|
.Python
|
|
build/
|
|
develop-eggs/
|
|
dist/
|
|
downloads/
|
|
figures/
|
|
eggs/
|
|
.eggs/
|
|
lib/
|
|
lib64/
|
|
parts/
|
|
sdist/
|
|
var/
|
|
wheels/
|
|
pip-wheel-metadata/
|
|
share/python-wheels/
|
|
*.egg-info/
|
|
nltk_data/
|
|
.installed.cfg
|
|
*.egg
|
|
MANIFEST
|
|
|
|
# PyInstaller
|
|
# Usually these files are written by a python script from a template
|
|
# before PyInstaller builds the exe, so as to inject date/other infos into it.
|
|
*.manifest
|
|
*.spec
|
|
|
|
# Installer logs
|
|
pip-log.txt
|
|
pip-delete-this-directory.txt
|
|
|
|
# Pycharm
|
|
.idea/
|
|
|
|
# Unit test / coverage reports
|
|
htmlcov/
|
|
.tox/
|
|
.nox/
|
|
.coverage
|
|
.coverage.*
|
|
.cache
|
|
nosetests.xml
|
|
coverage.xml
|
|
*.cover
|
|
*.py,cover
|
|
.hypothesis/
|
|
.pytest_cache/
|
|
|
|
# Translations
|
|
*.mo
|
|
*.pot
|
|
|
|
# Django stuff:
|
|
*.log
|
|
local_settings.py
|
|
db.sqlite3
|
|
db.sqlite3-journal
|
|
|
|
# Flask stuff:
|
|
instance/
|
|
.webassets-cache
|
|
|
|
# Scrapy stuff:
|
|
.scrapy
|
|
|
|
# Sphinx documentation
|
|
docs/_build/
|
|
|
|
# PyBuilder
|
|
target/
|
|
|
|
# Jupyter Notebook
|
|
.ipynb_checkpoints
|
|
nbs/
|
|
|
|
# IPython
|
|
profile_default/
|
|
ipython_config.py
|
|
|
|
# pyenv
|
|
.python-version
|
|
|
|
# pipenv
|
|
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
|
|
# However, in case of collaboration, if having platform-specific dependencies or dependencies
|
|
# having no cross-platform support, pipenv may install dependencies that don't work, or not
|
|
# install all needed dependencies.
|
|
#Pipfile.lock
|
|
|
|
# PEP 582; used by e.g. github.com/David-OConnor/pyflow
|
|
__pypackages__/
|
|
|
|
# Celery stuff
|
|
celerybeat-schedule
|
|
celerybeat.pid
|
|
|
|
# SageMath parsed files
|
|
*.sage.py
|
|
|
|
# Environments
|
|
.env
|
|
.venv
|
|
env/
|
|
venv/
|
|
ENV/
|
|
env.bak/
|
|
venv.bak/
|
|
|
|
# Spyder project settings
|
|
.spyderproject
|
|
.spyproject
|
|
|
|
# Rope project settings
|
|
.ropeproject
|
|
|
|
# mkdocs documentation
|
|
/site
|
|
|
|
# mypy
|
|
.mypy_cache/
|
|
.dmypy.json
|
|
dmypy.json
|
|
|
|
# Pyre type checker
|
|
.pyre/
|
|
|
|
# pyright (Python LSP/type-checker in VSCode) config
|
|
/pyrightconfig.json
|
|
|
|
# ingest outputs
|
|
/structured-output
|
|
test_unstructured_ingest/workdir/
|
|
test_unstructured_ingest/delta-table-dest/
|
|
test_unstructured_ingest/skipped-files.txt
|
|
test_unstructured_ingest/chroma-dest/
|
|
|
|
# suggested ingest mirror directory
|
|
/mirror
|
|
|
|
## https://github.com/github/gitignore/blob/main/Global/Emacs.gitignore (partial)
|
|
|
|
*~
|
|
\#*\#
|
|
/.emacs.desktop
|
|
/.emacs.desktop.lock
|
|
*.elc
|
|
auto-save-list
|
|
tramp
|
|
.\#*
|
|
|
|
## https://github.com/github/gitignore/blob/main/Global/VisualStudioCode.gitignore
|
|
.vscode/*
|
|
!.vscode/tasks.json
|
|
!.vscode/launch.json
|
|
!.vscode/extensions.json
|
|
!.vscode/*.code-snippets
|
|
|
|
# Local History for Visual Studio Code
|
|
.history/
|
|
|
|
# Built Visual Studio Code Extensions
|
|
*.vsix
|
|
|
|
## https://github.com/github/gitignore/blob/main/Global/Vim.gitignore
|
|
# Swap
|
|
[._]*.s[a-v][a-z]
|
|
!*.svg # comment out if you don't need vector files
|
|
[._]*.sw[a-p]
|
|
[._]s[a-rt-v][a-z]
|
|
[._]ss[a-gi-z]
|
|
[._]sw[a-p]
|
|
|
|
# Session
|
|
Session.vim
|
|
Sessionx.vim
|
|
|
|
# Temporary
|
|
.netrwhist
|
|
# Auto-generated tag files
|
|
tags
|
|
# Persistent undo
|
|
[._]*.un~
|
|
|
|
*.DS_Store
|
|
|
|
# Ruff cache
|
|
.ruff_cache/
|
|
|
|
unstructured-inference/sample-docs/*
|
|
.ppm
|
|
unstructured-inference/
|
|
|
|
example-docs/*_images
|
|
examples/**/output/
|
|
|
|
outputdiff.txt
|
|
outputhtmldiff.txt
|
|
metricsdiff.txt
|
|
|
|
# analysis
|
|
annotated/
|
|
.aider*
|
|
pcaps
|
|
python-output
|