mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-06-27 02:30:08 +00:00
chore: script to verify unstructured image outbound connectivity (#4008)
Sample output. The key thing here is the modes `offline` (meaning set HF_HUB_ONLINE=1 AND DO_NOT_TRACK=true) results in no outbound connections. This also is true if the locally cached models are removed, the last scenario of `offline-and-missing-models`) ``` $ ./test-all-outbound-connectivity-scenarios.sh >>> Removing leftover sut_* containers… Container: 543ac4b14370a18d790a2035e206e8c445754b825ec8b2887f4246f7404299c7 (scenario baseline) tcpdump running on interface eth0... >>> Running Python workload (capturing stdout/stderr)… [INFO] partitioning /app/example-docs/ideas-page.html <snip> Python finished. Log saved to /r/unstructured/scripts/image/python-output/offline-and-missing-models.log pcap saved to /r/unstructured/scripts/image/pcaps/offline-and-missing-models.pcap ================================================================== ======================================== Begin Scenario: baseline ------------------------------------------- tshark output for baseline ------------------------------------------- IPv4 Conversations Filter:<No Filter> | <- | | -> | | Total | Relative | Duration | | Frames Bytes | | Frames Bytes | | Frames Bytes | Start | | 172.18.0.2 <-> 108.138.246.79 20 12 kB 20 4,176 bytes 40 16 kB 2.531247000 69.0419 172.18.0.2 <-> 3.214.154.119 11 5,777 bytes 12 2,656 bytes 23 8,433 bytes 0.029451000 0.4118 172.18.0.2 <-> 192.168.65.5 2 656 bytes 2 158 bytes 4 814 bytes 0.000000000 2.5310 ------------------------------------------ python log output for baseline ------------------------------------------ [INFO] partitioning /app/example-docs/ideas-page.html [INFO] partitioning /app/example-docs/category-level.docx [INFO] partitioning /app/example-docs/fake_table.docx [INFO] partitioning /app/example-docs/img/english-and-korean.png 2025-06-02 22:05:02,265 - matplotlib.font_manager - INFO - generated new fontManager 2025-06-02 22:05:02,356 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): huggingface.co:443 2025-06-02 22:05:02,497 - urllib3.connectionpool - DEBUG - https://huggingface.co:443 "HEAD /unstructuredio/yolo_x_layout/resolve/main/yolox_l0.05.onnx HTTP/1.1" 302 0 2025-06-02 22:05:02,613 - unstructured_inference - INFO - Reading image file: /app/example-docs/img/english-and-korean.png ... 2025-06-02 22:05:04,792 - unstructured_inference - INFO - Loading the Table agent ... 2025-06-02 22:05:04,792 - unstructured_inference - INFO - Loading the table structure model ... 2025-06-02 22:05:04,877 - urllib3.connectionpool - DEBUG - https://huggingface.co:443 "HEAD /microsoft/table-transformer-structure-recognition/resolve/main/config.json HTTP/1.1" 200 0 2025-06-02 22:05:04,960 - urllib3.connectionpool - DEBUG - https://huggingface.co:443 "HEAD /microsoft/table-transformer-structure-recognition/resolve/main/config.json HTTP/1.1" 200 0 2025-06-02 22:05:04,970 - timm.models._builder - INFO - Loading pretrained weights from Hugging Face hub (timm/resnet18.a1_in1k) 2025-06-02 22:05:05,062 - urllib3.connectionpool - DEBUG - https://huggingface.co:443 "HEAD /timm/resnet18.a1_in1k/resolve/main/model.safetensors HTTP/1.1" 302 0 2025-06-02 22:05:05,065 - timm.models._hub - INFO - [timm/resnet18.a1_in1k] Safe alternative available for 'pytorch_model.bin' (as 'model.safetensors'). Loading weights using safetensors. 2025-06-02 22:05:05,071 - timm.models._builder - INFO - Missing keys (fc.weight, fc.bias) discovered while loading pretrained weights. This is expected if model is being adapted. [INFO] partitioning /app/example-docs/img/embedded-images-tables.jpg 2025-06-02 22:05:05,152 - unstructured_inference - INFO - Reading image file: /app/example-docs/img/embedded-images-tables.jpg ... [INFO] partitioning /app/example-docs/img/layout-parser-paper-with-table.jpg 2025-06-02 22:05:07,693 - unstructured_inference - INFO - Reading image file: /app/example-docs/img/layout-parser-paper-with-table.jpg ... [INFO] partitioning /app/example-docs/pdf/embedded-images-tables.pdf 2025-06-02 22:05:12,706 - pikepdf._core - INFO - pikepdf C++ to Python logger bridge initialized 2025-06-02 22:05:12,733 - unstructured_inference - INFO - Reading PDF for file: /app/example-docs/pdf/embedded-images-tables.pdf ... [INFO] partitioning /app/example-docs/pdf/all-number-table.pdf 2025-06-02 22:05:15,251 - unstructured_inference - INFO - Reading PDF for file: /app/example-docs/pdf/all-number-table.pdf ... [INFO] partitioning /app/example-docs/fake-power-point.pptx [INFO] partitioning /app/example-docs/stanley-cups.xlsx [INFO] partitioning /app/example-docs/fake-email-multiple-attachments.msg 2025-06-02 22:05:16,936 - unstructured_inference - INFO - Reading image file: /tmp/tmplkanlou1/unstructured_logo.png ... 2025-06-02 22:05:18,749 - unstructured_inference - INFO - Reading PDF for file: /tmp/tmpxdzdouhb/dense_doc.pdf ... ================================================================== ======================================== Begin Scenario: missing-models ------------------------------------------- tshark output for missing-models ------------------------------------------- IPv4 Conversations Filter:<No Filter> | <- | | -> | | Total | Relative | Duration | | Frames Bytes | | Frames Bytes | | Frames Bytes | Start | | 172.18.0.2 <-> 18.155.192.23 181834 273 MB 33502 1,813 kB 215336 275 MB 2.704106000 75.2880 172.18.0.2 <-> 3.168.86.41 79696 119 MB 15234 825 kB 94930 120 MB 9.066044000 68.9276 172.18.0.2 <-> 108.138.246.85 29 21 kB 25 5,760 bytes 54 27 kB 2.431857000 75.5633 172.18.0.2 <-> 3.214.154.119 12 5,831 bytes 12 2,656 bytes 24 8,487 bytes 0.016604000 0.3590 172.18.0.2 <-> 192.168.65.5 4 1,084 bytes 4 314 bytes 8 1,398 bytes 0.000000000 9.0651 ------------------------------------------ python log output for missing-models ------------------------------------------ [INFO] partitioning /app/example-docs/ideas-page.html [INFO] partitioning /app/example-docs/category-level.docx [INFO] partitioning /app/example-docs/fake_table.docx [INFO] partitioning /app/example-docs/img/english-and-korean.png 2025-06-02 22:06:30,961 - matplotlib.font_manager - INFO - generated new fontManager 2025-06-02 22:06:31,046 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): huggingface.co:443 2025-06-02 22:06:31,300 - urllib3.connectionpool - DEBUG - https://huggingface.co:443 "HEAD /unstructuredio/yolo_x_layout/resolve/main/yolox_l0.05.onnx HTTP/1.1" 302 0 2025-06-02 22:06:31,310 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): cdn-lfs.hf.co:443 2025-06-02 22:06:31,439 - urllib3.connectionpool - DEBUG - https://cdn-lfs.hf.co:443 "GET /repos/d9/51/d951593388d0af1cb4a029c311ba19f9b05090d9acc4606c2b82588297ea4397/134301ca94fb0df8027be9a6dad1908fe6218af8ffa4d34f0819c7c2226195f3?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27yolox_l0.05.onnx%3B+filename%3D%22yolox_l0.05.onnx%22%3B&Expires=1748904676&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTc0ODkwNDY3Nn19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5oZi5jby9yZXBvcy9kOS81MS9kOTUxNTkzMzg4ZDBhZjFjYjRhMDI5YzMxMWJhMTlmOWIwNTA5MGQ5YWNjNDYwNmMyYjgyNTg4Mjk3ZWE0Mzk3LzEzNDMwMWNhOTRmYjBkZjgwMjdiZTlhNmRhZDE5MDhmZTYyMThhZjhmZmE0ZDM0ZjA4MTljN2MyMjI2MTk1ZjM~cmVzcG9uc2UtY29udGVudC1kaXNwb3NpdGlvbj0qIn1dfQ__&Signature=hxvwTzJynEvyE~UuirlH~L4c5Gc6rGksDp~Uw94ooayDrzshE2sDdHmvqgoQyzqxHHhZLjfiJlAGUtVO7nVAHSoqt8mH7H9yN51Zj5UGqI-odXtW1dmWCD3i7nwwNlrEEjlXlERkIScpIjpkJDnjwhzeE94l1s7gysIm8c6J8JTcDlsdMver5wAVrBtLSVUrDN8PC84xgOGerHVhX7-eZcUVG2OAIJHoB3s2gLPkW9aVM5fvCmmoXMPI9oCvgLUp-zhXv3cWHh~yURuY1ufoI4CFG5ogW8nV~V45qLlbRw9PrvfFoLS-wxBGDOhT3SRWVOJzRRmACByABGWYMXRFuw__&Key-Pair-Id=K3RPWS32NSSJCE HTTP/1.1" 200 216625723 2025-06-02 22:06:35,019 - unstructured_inference - INFO - Reading image file: /app/example-docs/img/english-and-korean.png ... 2025-06-02 22:06:37,188 - unstructured_inference - INFO - Loading the Table agent ... 2025-06-02 22:06:37,188 - unstructured_inference - INFO - Loading the table structure model ... 2025-06-02 22:06:37,290 - urllib3.connectionpool - DEBUG - https://huggingface.co:443 "HEAD /microsoft/table-transformer-structure-recognition/resolve/main/config.json HTTP/1.1" 200 0 2025-06-02 22:06:37,375 - urllib3.connectionpool - DEBUG - https://huggingface.co:443 "GET /microsoft/table-transformer-structure-recognition/resolve/main/config.json HTTP/1.1" 200 1469 2025-06-02 22:06:37,484 - urllib3.connectionpool - DEBUG - https://huggingface.co:443 "HEAD /microsoft/table-transformer-structure-recognition/resolve/main/config.json HTTP/1.1" 200 0 2025-06-02 22:06:37,581 - urllib3.connectionpool - DEBUG - https://huggingface.co:443 "HEAD /microsoft/table-transformer-structure-recognition/resolve/main/model.safetensors HTTP/1.1" 302 0 Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet` 2025-06-02 22:06:37,586 - huggingface_hub.file_download - WARNING - Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet` 2025-06-02 22:06:37,681 - urllib3.connectionpool - DEBUG - https://huggingface.co:443 "GET /microsoft/table-transformer-structure-recognition/resolve/main/model.safetensors HTTP/1.1" 302 1319 2025-06-02 22:06:37,685 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): cas-bridge.xethub.hf.co:443 2025-06-02 22:06:37,778 - urllib3.connectionpool - DEBUG - https://cas-bridge.xethub.hf.co:443 "GET /xet-bridge-us/634929bd8146350b3a4cadaf/e78778928a1863786d5bb22a109a7ff1dbac47a29eae6f223a1fc2689172c347?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=cas%2F20250602%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20250602T220637Z&X-Amz-Expires=3600&X-Amz-Signature=c0a361e8982b1b05ee443054646b438e5a68d6767ef6df03dad6c5db20d0bdc5&X-Amz-SignedHeaders=host&X-Xet-Cas-Uid=public&response-content-disposition=inline%3B+filename*%3DUTF-8%27%27model.safetensors%3B+filename%3D%22model.safetensors%22%3B&x-id=GetObject&Expires=1748905597&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTc0ODkwNTU5N319LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2FzLWJyaWRnZS54ZXRodWIuaGYuY28veGV0LWJyaWRnZS11cy82MzQ5MjliZDgxNDYzNTBiM2E0Y2FkYWYvZTc4Nzc4OTI4YTE4NjM3ODZkNWJiMjJhMTA5YTdmZjFkYmFjNDdhMjllYWU2ZjIyM2ExZmMyNjg5MTcyYzM0NyoifV19&Signature=cRjZe56uJ8vxmmgRhPmp7XZX69PHKoXO9XN1bfq5n~84Vxz~HvCmg6MqtuUAFIiOWAHFhOuVzJpoiWTYT1JdZrtMeQTdywnZM-lIIn5Q45kzr8q8C58yvLz7vmKKrD9pOnGjJPaVavYYxEDdlAXbWf6xo433kKF4TfmQ9z7UIKt~M-XV9EdPUUBNhByucLVcTZ3sec5DqI4FmzK28fdJ1BMD4NyDjWW6hi~Lp2V3bW0FLCpI6qKGuikJ3E-OVcJDdDvZAqSN0-GoQyHIP9kp4RTqPBb7jekpZ3Uj91UWEmGx6YNuNlorAMGi61hrL6mAUUmW13OGua2vcJyk9LxZQg__&Key-Pair-Id=K2L8F4GPSG1IFC HTTP/1.1" 200 115434268 2025-06-02 22:06:39,612 - timm.models._builder - INFO - Loading pretrained weights from Hugging Face hub (timm/resnet18.a1_in1k) 2025-06-02 22:06:39,696 - urllib3.connectionpool - DEBUG - https://huggingface.co:443 "HEAD /timm/resnet18.a1_in1k/resolve/main/model.safetensors HTTP/1.1" 302 0 2025-06-02 22:06:39,714 - urllib3.connectionpool - DEBUG - https://cdn-lfs.hf.co:443 "GET /repos/42/d5/42d585781e0b74854ae52a1bc2a63d09896f1d70f86bff969f4c053508d6c2d6/80c49dee3da4822c009c5a7fe591e9223c5a2cfcf95a4067ca4dfb5a7b89c612?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27model.safetensors%3B+filename%3D%22model.safetensors%22%3B&Expires=1748904665&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTc0ODkwNDY2NX19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5oZi5jby9yZXBvcy80Mi9kNS80MmQ1ODU3ODFlMGI3NDg1NGFlNTJhMWJjMmE2M2QwOTg5NmYxZDcwZjg2YmZmOTY5ZjRjMDUzNTA4ZDZjMmQ2LzgwYzQ5ZGVlM2RhNDgyMmMwMDljNWE3ZmU1OTFlOTIyM2M1YTJjZmNmOTVhNDA2N2NhNGRmYjVhN2I4OWM2MTI~cmVzcG9uc2UtY29udGVudC1kaXNwb3NpdGlvbj0qIn1dfQ__&Signature=GL15CLiGsmHno-DP25kfcuObjbrjd~ir5C5xapGqb9lda~5Wjy-3axBPftr1xWUnKh24Ay0mS49U8ZOcEdQxmzxQ97HiSX0-8s0-H187hV6mId6uxsULOGkNtjpkMKhfxe0qIfAmfi9gxl9JdiVfG5367HfPDVST8NvGPqMuKYoywSNWA-Uby-L9qb~EjtxbH9v1H2g6C0i9t2mn8ghD8BtTWEn4LY9c4O5bI~EQatNToNjsQTKa18LzXEowZnODLSLkyE7beLzfEpuTX9vlDzcAwKCPp-1M3xMZI4tzR-yfzyGhW19wqc6BVncUw53WSK7oOCv56HmFTYHhzOE-eQ__&Key-Pair-Id=K3RPWS32NSSJCE HTTP/1.1" 200 46807446 2025-06-02 22:06:40,394 - timm.models._hub - INFO - [timm/resnet18.a1_in1k] Safe alternative available for 'pytorch_model.bin' (as 'model.safetensors'). Loading weights using safetensors. 2025-06-02 22:06:40,396 - timm.models._builder - INFO - Missing keys (fc.weight, fc.bias) discovered while loading pretrained weights. This is expected if model is being adapted. [INFO] partitioning /app/example-docs/img/embedded-images-tables.jpg 2025-06-02 22:06:40,460 - unstructured_inference - INFO - Reading image file: /app/example-docs/img/embedded-images-tables.jpg ... [INFO] partitioning /app/example-docs/img/layout-parser-paper-with-table.jpg 2025-06-02 22:06:42,985 - unstructured_inference - INFO - Reading image file: /app/example-docs/img/layout-parser-paper-with-table.jpg ... [INFO] partitioning /app/example-docs/pdf/embedded-images-tables.pdf 2025-06-02 22:06:48,019 - pikepdf._core - INFO - pikepdf C++ to Python logger bridge initialized 2025-06-02 22:06:48,045 - unstructured_inference - INFO - Reading PDF for file: /app/example-docs/pdf/embedded-images-tables.pdf ... [INFO] partitioning /app/example-docs/pdf/all-number-table.pdf 2025-06-02 22:06:50,557 - unstructured_inference - INFO - Reading PDF for file: /app/example-docs/pdf/all-number-table.pdf ... [INFO] partitioning /app/example-docs/fake-power-point.pptx [INFO] partitioning /app/example-docs/stanley-cups.xlsx [INFO] partitioning /app/example-docs/fake-email-multiple-attachments.msg 2025-06-02 22:06:52,358 - unstructured_inference - INFO - Reading image file: /tmp/tmpsha4r586/unstructured_logo.png ... 2025-06-02 22:06:54,199 - unstructured_inference - INFO - Reading PDF for file: /tmp/tmpg_5lk06v/dense_doc.pdf ... ================================================================== ======================================== Begin Scenario: analytics-online-only ------------------------------------------- tshark output for analytics-online-only ------------------------------------------- IPv4 Conversations Filter:<No Filter> | <- | | -> | | Total | Relative | Duration | | Frames Bytes | | Frames Bytes | | Frames Bytes | Start | | 172.18.0.2 <-> 54.236.224.89 12 5,831 bytes 12 2,656 bytes 24 8,487 bytes 0.032536000 0.3535 172.18.0.2 <-> 192.168.65.5 1 462 bytes 1 84 bytes 2 546 bytes 0.000000000 0.0322 ------------------------------------------ python log output for analytics-online-only ------------------------------------------ [INFO] partitioning /app/example-docs/ideas-page.html [INFO] partitioning /app/example-docs/category-level.docx [INFO] partitioning /app/example-docs/fake_table.docx [INFO] partitioning /app/example-docs/img/english-and-korean.png 2025-06-02 22:08:10,114 - matplotlib.font_manager - INFO - generated new fontManager 2025-06-02 22:08:10,320 - unstructured_inference - INFO - Reading image file: /app/example-docs/img/english-and-korean.png ... 2025-06-02 22:08:12,470 - unstructured_inference - INFO - Loading the Table agent ... 2025-06-02 22:08:12,470 - unstructured_inference - INFO - Loading the table structure model ... 2025-06-02 22:08:12,475 - timm.models._builder - INFO - Loading pretrained weights from Hugging Face hub (timm/resnet18.a1_in1k) 2025-06-02 22:08:12,476 - timm.models._hub - INFO - [timm/resnet18.a1_in1k] Safe alternative available for 'pytorch_model.bin' (as 'model.safetensors'). Loading weights using safetensors. 2025-06-02 22:08:12,478 - timm.models._builder - INFO - Missing keys (fc.weight, fc.bias) discovered while loading pretrained weights. This is expected if model is being adapted. [INFO] partitioning /app/example-docs/img/embedded-images-tables.jpg 2025-06-02 22:08:12,548 - unstructured_inference - INFO - Reading image file: /app/example-docs/img/embedded-images-tables.jpg ... [INFO] partitioning /app/example-docs/img/layout-parser-paper-with-table.jpg 2025-06-02 22:08:15,102 - unstructured_inference - INFO - Reading image file: /app/example-docs/img/layout-parser-paper-with-table.jpg ... [INFO] partitioning /app/example-docs/pdf/embedded-images-tables.pdf 2025-06-02 22:08:20,163 - pikepdf._core - INFO - pikepdf C++ to Python logger bridge initialized 2025-06-02 22:08:20,189 - unstructured_inference - INFO - Reading PDF for file: /app/example-docs/pdf/embedded-images-tables.pdf ... [INFO] partitioning /app/example-docs/pdf/all-number-table.pdf 2025-06-02 22:08:22,732 - unstructured_inference - INFO - Reading PDF for file: /app/example-docs/pdf/all-number-table.pdf ... [INFO] partitioning /app/example-docs/fake-power-point.pptx [INFO] partitioning /app/example-docs/stanley-cups.xlsx [INFO] partitioning /app/example-docs/fake-email-multiple-attachments.msg 2025-06-02 22:08:24,468 - unstructured_inference - INFO - Reading image file: /tmp/tmp4oud0ctq/unstructured_logo.png ... 2025-06-02 22:08:26,297 - unstructured_inference - INFO - Reading PDF for file: /tmp/tmpv24idrvu/dense_doc.pdf ... ================================================================== ======================================== Begin Scenario: offline ------------------------------------------- tshark output for offline ------------------------------------------- IPv4 Conversations Filter:<No Filter> | <- | | -> | | Total | Relative | Duration | | Frames Bytes | | Frames Bytes | | Frames Bytes | Start | | ------------------------------------------ python log output for offline ------------------------------------------ [INFO] partitioning /app/example-docs/ideas-page.html [INFO] partitioning /app/example-docs/category-level.docx [INFO] partitioning /app/example-docs/fake_table.docx [INFO] partitioning /app/example-docs/img/english-and-korean.png 2025-06-02 22:09:37,826 - matplotlib.font_manager - INFO - generated new fontManager 2025-06-02 22:09:38,028 - unstructured_inference - INFO - Reading image file: /app/example-docs/img/english-and-korean.png ... 2025-06-02 22:09:40,188 - unstructured_inference - INFO - Loading the Table agent ... 2025-06-02 22:09:40,188 - unstructured_inference - INFO - Loading the table structure model ... 2025-06-02 22:09:40,193 - timm.models._builder - INFO - Loading pretrained weights from Hugging Face hub (timm/resnet18.a1_in1k) 2025-06-02 22:09:40,193 - timm.models._hub - INFO - [timm/resnet18.a1_in1k] Safe alternative available for 'pytorch_model.bin' (as 'model.safetensors'). Loading weights using safetensors. 2025-06-02 22:09:40,195 - timm.models._builder - INFO - Missing keys (fc.weight, fc.bias) discovered while loading pretrained weights. This is expected if model is being adapted. [INFO] partitioning /app/example-docs/img/embedded-images-tables.jpg 2025-06-02 22:09:40,260 - unstructured_inference - INFO - Reading image file: /app/example-docs/img/embedded-images-tables.jpg ... [INFO] partitioning /app/example-docs/img/layout-parser-paper-with-table.jpg 2025-06-02 22:09:42,810 - unstructured_inference - INFO - Reading image file: /app/example-docs/img/layout-parser-paper-with-table.jpg ... [INFO] partitioning /app/example-docs/pdf/embedded-images-tables.pdf 2025-06-02 22:09:47,851 - pikepdf._core - INFO - pikepdf C++ to Python logger bridge initialized 2025-06-02 22:09:47,877 - unstructured_inference - INFO - Reading PDF for file: /app/example-docs/pdf/embedded-images-tables.pdf ... [INFO] partitioning /app/example-docs/pdf/all-number-table.pdf 2025-06-02 22:09:50,475 - unstructured_inference - INFO - Reading PDF for file: /app/example-docs/pdf/all-number-table.pdf ... [INFO] partitioning /app/example-docs/fake-power-point.pptx [INFO] partitioning /app/example-docs/stanley-cups.xlsx [INFO] partitioning /app/example-docs/fake-email-multiple-attachments.msg 2025-06-02 22:09:52,181 - unstructured_inference - INFO - Reading image file: /tmp/tmpn3rraz6o/unstructured_logo.png ... 2025-06-02 22:09:54,032 - unstructured_inference - INFO - Reading PDF for file: /tmp/tmpvbqk645u/dense_doc.pdf ... ================================================================== ======================================== Begin Scenario: offline-and-missing-models ------------------------------------------- tshark output for offline-and-missing-models ------------------------------------------- IPv4 Conversations Filter:<No Filter> | <- | | -> | | Total | Relative | Duration | | Frames Bytes | | Frames Bytes | | Frames Bytes | Start | | ------------------------------------------ python log output for offline-and-missing-models ------------------------------------------ [INFO] partitioning /app/example-docs/ideas-page.html [INFO] partitioning /app/example-docs/category-level.docx [INFO] partitioning /app/example-docs/fake_table.docx [INFO] partitioning /app/example-docs/img/english-and-korean.png 2025-06-02 22:11:05,743 - matplotlib.font_manager - INFO - generated new fontManager Traceback (most recent call last): File "/home/notebook-user/.local/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 1484, in _get_metadata_or_catch_error metadata = get_hf_file_metadata( ^^^^^^^^^^^^^^^^^^^^^ File "/home/notebook-user/.local/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn return fn(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^ File "/home/notebook-user/.local/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 1401, in get_hf_file_metadata r = _request_wrapper( ^^^^^^^^^^^^^^^^^ File "/home/notebook-user/.local/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 285, in _request_wrapper response = _request_wrapper( ^^^^^^^^^^^^^^^^^ File "/home/notebook-user/.local/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 308, in _request_wrapper response = get_session().request(method=method, url=url, **params) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/notebook-user/.local/lib/python3.11/site-packages/requests/sessions.py", line 589, in request resp = self.send(prep, **send_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/notebook-user/.local/lib/python3.11/site-packages/requests/sessions.py", line 703, in send r = adapter.send(request, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/notebook-user/.local/lib/python3.11/site-packages/huggingface_hub/utils/_http.py", line 107, in send raise OfflineModeIsEnabled( huggingface_hub.errors.OfflineModeIsEnabled: Cannot reach https://huggingface.co/unstructuredio/yolo_x_layout/resolve/main/yolox_l0.05.onnx: offline mode is enabled. To disable it, please unset the `HF_HUB_OFFLINE` environment variable. The above exception was the direct cause of the following exception: Traceback (most recent call last): File "<stdin>", line 35, in <module> File "/app/unstructured/partition/auto.py", line 231, in partition elements = partition_image( ^^^^^^^^^^^^^^^^ File "/app/unstructured/documents/elements.py", line 585, in wrapper elements = func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/app/unstructured/file_utils/filetype.py", line 774, in wrapper elements = func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/app/unstructured/chunking/dispatch.py", line 74, in wrapper elements = func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/app/unstructured/partition/image.py", line 102, in partition_image return partition_pdf_or_image( ^^^^^^^^^^^^^^^^^^^^^^^ File "/app/unstructured/partition/pdf.py", line 341, in partition_pdf_or_image elements = _partition_pdf_or_image_local( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/unstructured/utils.py", line 216, in wrapper return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/app/unstructured/partition/pdf.py", line 649, in _partition_pdf_or_image_local inferred_document_layout = process_file_with_model( ^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured_inference/inference/layout.py", line 371, in process_file_with_model model = get_model(model_name, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured_inference/models/base.py", line 74, in get_model model.initialize(**initialize_params) File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured_inference/utils.py", line 40, in __getitem__ value = evaluate(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured_inference/utils.py", line 115, in download_if_needed_and_get_local_path return hf_hub_download(path_or_repo, filename, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/notebook-user/.local/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn return fn(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^ File "/home/notebook-user/.local/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 961, in hf_hub_download return _hf_hub_download_to_cache_dir( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/notebook-user/.local/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 1068, in _hf_hub_download_to_cache_dir _raise_on_head_call_error(head_call_error, force_download, local_files_only) File "/home/notebook-user/.local/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 1599, in _raise_on_head_call_error raise LocalEntryNotFoundError( huggingface_hub.errors.LocalEntryNotFoundError: An error happened while trying to locate the file on the Hub and we cannot find the requested files in the local cache. Please check your connection and try again or make sure your Internet connection is on. ```
This commit is contained in:
parent
e42884a566
commit
3a048a5a02
2
.gitignore
vendored
2
.gitignore
vendored
@ -210,3 +210,5 @@ metricsdiff.txt
|
||||
# analysis
|
||||
annotated/
|
||||
.aider*
|
||||
pcaps
|
||||
python-output
|
||||
|
61
scripts/image/test-all-outbound-connectivity-scenarios.sh
Executable file
61
scripts/image/test-all-outbound-connectivity-scenarios.sh
Executable file
@ -0,0 +1,61 @@
|
||||
#!/usr/bin/env bash
|
||||
|
||||
# Note:
|
||||
#
|
||||
# The scenarios baseline, missing-models, and analytics-online-only
|
||||
# are expected to have conversations reported by tshark
|
||||
#
|
||||
# The scenarios offline and offline-and-missing-models
|
||||
# are *NOT* expected to have any conversations (or attempted conversations) reported by tshark
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
# shellcheck disable=SC2015
|
||||
((BASH_VERSINFO[0] >= 5)) || {
|
||||
echo "Requires bash >= 5" >&2
|
||||
exit 1
|
||||
}
|
||||
|
||||
mkdir -p python-output
|
||||
mkdir -p pcaps
|
||||
|
||||
start_timestamp_seconds=$(date +%s)
|
||||
|
||||
./test-outbound-connectivity.sh --cleanup baseline
|
||||
./test-outbound-connectivity.sh --cleanup missing-models
|
||||
./test-outbound-connectivity.sh --cleanup analytics-online-only
|
||||
./test-outbound-connectivity.sh --cleanup offline
|
||||
./test-outbound-connectivity.sh --cleanup offline-and-missing-models
|
||||
|
||||
set +e
|
||||
found_pcap_files=$(find "pcaps" -maxdepth 1 -name "*.pcap" -type f -newermt "@$start_timestamp_seconds" 2>/dev/null | wc -l | tr -d ' ')
|
||||
found_log_files=$(find "python-output" -maxdepth 1 -name "*.log" -type f -newermt "@$start_timestamp_seconds" 2>/dev/null | wc -l | tr -d ' ')
|
||||
set -e
|
||||
if [ "$found_pcap_files" -ne "5" ]; then
|
||||
echo "Expected to find 4 fresh pcap/ files from this test but found $found_pcap_files instead"
|
||||
exit 1
|
||||
fi
|
||||
if [ "$found_log_files" -ne "5" ]; then
|
||||
echo "Expected to find 4 fresh python-output .log files from this test but found $found_log_files instead"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
for scenario in baseline missing-models analytics-online-only offline offline-and-missing-models; do
|
||||
echo
|
||||
|
||||
echo "=================================================================="
|
||||
echo "======================================== Begin Scenario: $scenario"
|
||||
echo
|
||||
echo " -------------------------------------------"
|
||||
echo " tshark output for $scenario"
|
||||
echo " -------------------------------------------"
|
||||
echo
|
||||
tshark -r pcaps/$scenario.pcap -q -z conv,ip | grep -v '===================================='
|
||||
|
||||
echo
|
||||
echo " ------------------------------------------"
|
||||
echo " python log output for $scenario"
|
||||
echo " ------------------------------------------"
|
||||
echo
|
||||
cat python-output/$scenario.log
|
||||
done
|
183
scripts/image/test-outbound-connectivity.sh
Executable file
183
scripts/image/test-outbound-connectivity.sh
Executable file
@ -0,0 +1,183 @@
|
||||
#!/usr/bin/env bash
|
||||
#
|
||||
# test-outbound-connectivity.sh
|
||||
#
|
||||
# Capture every external packet an Unstructured Docker image emits while
|
||||
# partition()‑ing a test PNG, *inside the same container* (works on macOS).
|
||||
#
|
||||
# In addition **also capture the Python workload's stdout / stderr** and save it
|
||||
# under ./python-output/<scenario>.log while still streaming it to your terminal.
|
||||
#
|
||||
# Usage examples
|
||||
# ./test-outbound-connectivity.sh baseline
|
||||
# ./test-outbound-connectivity.sh --cleanup missing-models
|
||||
# ./test-outbound-connectivity.sh --cleanup offline
|
||||
# ./test-outbound-connectivity.sh offline-and-missing-models
|
||||
#
|
||||
# Outputs:
|
||||
# ./pcaps/<scenario>.pcap
|
||||
# ./python-output/<scenario>.log
|
||||
# ---------------------------------------------------------------------
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
######################## user‑tunable constants ########################
|
||||
IMAGE="downloads.unstructured.io/unstructured-io/unstructured:e42884a"
|
||||
NET="unstructured_test_net"
|
||||
CAPTURE_IFACE="${CAPTURE_IFACE:-eth0}"
|
||||
PCAP_DIR="$(pwd)/pcaps"
|
||||
PY_LOG_DIR="$(pwd)/python-output" # where Python logs go
|
||||
HF_CACHE="/home/notebook-user/.cache/huggingface"
|
||||
########################################################################
|
||||
|
||||
# shellcheck disable=SC2015
|
||||
((BASH_VERSINFO[0] >= 5)) || {
|
||||
echo "Requires bash >= 5" >&2
|
||||
exit 1
|
||||
}
|
||||
|
||||
# Create output directories up‑front so failures don’t leave us empty‑handed
|
||||
mkdir -p "$PCAP_DIR" "$PY_LOG_DIR"
|
||||
|
||||
# ---------- parse flags (optional --cleanup) --------------------------
|
||||
CLEANUP=0
|
||||
if [[ "${1:-}" == "--cleanup" ]]; then
|
||||
CLEANUP=1
|
||||
shift
|
||||
fi
|
||||
|
||||
SCENARIO="${1:-}"
|
||||
if [[ -z "$SCENARIO" ]]; then
|
||||
echo "Usage: $0 [--cleanup] {baseline|missing-models|offline|offline-and-missing-models}" >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# ---------- optional pre‑run cleanup ----------------------------------
|
||||
if ((CLEANUP)); then
|
||||
echo ">>> Removing leftover sut_* containers…"
|
||||
# shellcheck disable=SC2015
|
||||
docker rm -f "$(docker ps -aq --filter name='^sut_')" 2>/dev/null || true
|
||||
fi
|
||||
|
||||
# ---------- scenario‑specific settings --------------------------------
|
||||
DO_NOT_TRACK=""
|
||||
HF_HUB_OFFLINE=""
|
||||
REMOVE_CACHE=0
|
||||
case "$SCENARIO" in
|
||||
baseline) ;;
|
||||
missing-models) REMOVE_CACHE=1 ;;
|
||||
analytics-online-only) HF_HUB_OFFLINE=1 ;;
|
||||
offline)
|
||||
DO_NOT_TRACK=true
|
||||
HF_HUB_OFFLINE=1
|
||||
;;
|
||||
offline-and-missing-models)
|
||||
DO_NOT_TRACK=true
|
||||
HF_HUB_OFFLINE=1
|
||||
REMOVE_CACHE=1
|
||||
;;
|
||||
*)
|
||||
echo "Unknown scenario: $SCENARIO"
|
||||
exit 1
|
||||
;;
|
||||
esac
|
||||
|
||||
docker network inspect "$NET" >/dev/null 2>&1 || docker network create "$NET"
|
||||
|
||||
# ---------- launch SUT idle -------------------------------------------
|
||||
CID=$(docker run -d --rm --name "sut_${SCENARIO}" \
|
||||
--network "$NET" \
|
||||
--cap-add NET_RAW --cap-add NET_ADMIN \
|
||||
-e DO_NOT_TRACK="$DO_NOT_TRACK" \
|
||||
-e HF_HUB_OFFLINE="$HF_HUB_OFFLINE" \
|
||||
--entrypoint /bin/sh "$IMAGE" -c "sleep infinity")
|
||||
echo "Container: $CID (scenario $SCENARIO)"
|
||||
|
||||
# install tcpdump (Wolfi uses apk) as root
|
||||
docker exec -u root "$CID" apk add --no-cache tcpdump >/dev/null
|
||||
|
||||
# optionally wipe HF cache
|
||||
# shellcheck disable=SC2015
|
||||
((REMOVE_CACHE)) && docker exec "$CID" rm -rf "$HF_CACHE" || true
|
||||
|
||||
# ---------- start tcpdump in background -------------------------------
|
||||
FILTER='not (dst net ff02::/16 or src net ff02::/16 or ip6[6] = 58 or ether multicast)'
|
||||
|
||||
docker exec -u root -d "$CID" sh -c "tcpdump -U -n -i $CAPTURE_IFACE '$FILTER' -w /tmp/capture.pcap > /tmp/tcpdump.log 2>&1"
|
||||
|
||||
# check if tcpdump stayed alive
|
||||
sleep 2
|
||||
if ! docker exec "$CID" pgrep tcpdump >/dev/null; then
|
||||
echo 'tcpdump exited – showing its log:'
|
||||
docker exec "$CID" cat /tmp/tcpdump.log
|
||||
exit 1
|
||||
fi
|
||||
|
||||
echo "tcpdump running on interface $CAPTURE_IFACE..."
|
||||
# ---------- run the Python workload -----------------------------------
|
||||
echo ">>> Running Python workload (capturing stdout/stderr)…"
|
||||
# ‑ The "|&" pipes *both* stdout *and* stderr into tee.
|
||||
# ‑ tee sends it to the terminal *and* writes the log file.
|
||||
# ‑ With `set -o pipefail` we still fail early if the Python process exits non‑zero.
|
||||
|
||||
if [[ "$HF_HUB_OFFLINE" -eq 1 && "$REMOVE_CACHE" -eq 1 ]]; then
|
||||
echo "HF_HUB_OFFLINE=1 and REMOVE_CACHE=1 : allowing python command have a non-exit 0 status and will continue the script."
|
||||
set +e
|
||||
fi
|
||||
|
||||
docker exec -i -e PYTHONUNBUFFERED=1 "$CID" python - <<PY |& tee "${PY_LOG_DIR}/${SCENARIO}.log"
|
||||
import logging
|
||||
from unstructured.partition.auto import partition
|
||||
from unstructured.logger import logger # force analytics ping if not DO_NOT_TRACK
|
||||
import urllib.request, time, os, sys
|
||||
|
||||
# Configure detailed logging
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
|
||||
handlers=[
|
||||
logging.StreamHandler(sys.stdout),
|
||||
logging.FileHandler('test_platform_api.log')
|
||||
]
|
||||
)
|
||||
logging.getLogger("urllib").setLevel(logging.DEBUG)
|
||||
logging.getLogger("urllib3").setLevel(logging.DEBUG)
|
||||
logging.getLogger("httpx").setLevel(logging.DEBUG)
|
||||
logging.getLogger("httpcore").setLevel(logging.DEBUG)
|
||||
logging.getLogger("pdfminer.pdfpage").setLevel(logging.CRITICAL)
|
||||
|
||||
for test_file in [
|
||||
"/app/example-docs/ideas-page.html",
|
||||
"/app/example-docs/category-level.docx",
|
||||
"/app/example-docs/fake_table.docx",
|
||||
"/app/example-docs/img/english-and-korean.png",
|
||||
"/app/example-docs/img/embedded-images-tables.jpg",
|
||||
"/app/example-docs/img/layout-parser-paper-with-table.jpg",
|
||||
"/app/example-docs/pdf/embedded-images-tables.pdf",
|
||||
"/app/example-docs/pdf/all-number-table.pdf",
|
||||
"/app/example-docs/fake-power-point.pptx",
|
||||
"/app/example-docs/stanley-cups.xlsx",
|
||||
"/app/example-docs/fake-email-multiple-attachments.msg",
|
||||
]:
|
||||
print("[INFO] partitioning "+test_file)
|
||||
partition(test_file, strategy="hi_res", skip_infer_table_types=[])
|
||||
## add this if you always want to force an external connection
|
||||
#print("[INFO] done partitioning; hitting google…")
|
||||
#urllib.request.urlopen("https://www.google.com", timeout=10).read(64)
|
||||
#print("[INFO] google fetch finished")
|
||||
|
||||
time.sleep(1) # ensure FIN packets captured
|
||||
PY
|
||||
|
||||
echo "Python finished. Log saved to ${PY_LOG_DIR}/${SCENARIO}.log"
|
||||
|
||||
set -e
|
||||
|
||||
# ---------- stop tcpdump, copy pcap, clean up -------------------------
|
||||
docker exec "$CID" pkill -2 tcpdump || true
|
||||
sleep 1 # let pcap flush
|
||||
|
||||
docker cp "$CID:/tmp/capture.pcap" "${PCAP_DIR}/${SCENARIO}.pcap"
|
||||
echo "pcap saved to ${PCAP_DIR}/${SCENARIO}.pcap"
|
||||
|
||||
docker stop "$CID" >/dev/null
|
Loading…
x
Reference in New Issue
Block a user