1760 Commits

Author SHA1 Message Date
jiajun-unstructured
d24dec5e04
add '|' as a delimiter in csv files (#4059)
This PR fixes the error “Failure to process CSV: Expected 2 fields in
line 2, saw 4” when '|' is used as a delimiter in the csv file
2025-07-18 17:56:24 +00:00
Nick Franck
a040483a7e
Add OCR_AGENT_CACHE_SIZE environment variable (#4066)
## Problem
OCR agents used unlimited caching, causing excessive memory usage. Each
cached OCR agent consumes different amounts of memory, but can easily
consume ~800MB.

## Solution
Add `OCR_AGENT_CACHE_SIZE` environment variable to limit cached OCR
agents per process.

- **Default**: 1 cached agent
- **Configurable**: Set to 0 to disable caching, or higher for more
languages
0.18.10
2025-07-18 15:47:55 +00:00
qued
869ef457fe
build: Update CodeQL GHA to v3 (#4065)
We were using CodeQL v2, which has been [deprecated since
January](https://github.blog/changelog/2025-01-10-code-scanning-codeql-action-v2-is-now-deprecated/).
2025-07-17 21:07:22 +00:00
Yao You
909716f310
feat: keep input tag's class attr in table (#4064)
This change affects partition html.

Previously when there is a table in the html, we clean any tags inside
the table of their class and id attributes except for the class
attribute for `img` tags. This change also preserves the class attribute
for `input` tags inside a table. This change is reflected in a table
element's metadata.text_as_html attribute.
0.18.9
2025-07-16 21:46:58 +00:00
shreyanid
446826885b
fix: add empty string case for language metadata (#4062)
Add an empty string edge case for when the element text field is None or
not a string.

most of the diff is `make tidy`
2025-07-16 21:35:00 +00:00
qued
c7c3e3c082
feat: convert elements to markdown (#4055)
Creates a staging function `elements_to_md` to convert lists of
`Elements` to markdown strings (or a markdown file). Includes unit tests
as well as ingest tests and expected output fixtures.
2025-07-16 14:34:29 +00:00
Filip Knefel
f66562b1cb
fix: properly handle password protected xlsx (#4057)
### Issue
Attempt at partitioning a password protected errors results in an
obscure exception
> Can't find workbook in OLE2 compound document

### Solution
Utilize [msoffcrypto-tool](https://pypi.org/project/msoffcrypto-tool/)
package (MIT License) to load XLSX file and check whether it's
encrypted, if yes throw an `UnprocessableEntityError` exception
detailing the reason for rejecting the file.

---------

Co-authored-by: Filip Knefel <filip@unstructured.io>
2025-07-16 13:19:14 +00:00
shreyanid
344202fa6d
feat: detect language for PDFs (#4051)
The `@apply_metadata` decorator already contains logic to detect the
language of the element text (on either a document or element level).
Update pdfs, and later images, to use this decorator to get accurate
element language results outputted.

Test
```
from unstructured.partition.auto import partition

def test_partition_pdf():
    pdf_path = "example-docs/language-docs/fr_olap.pdf"
    elements = partition(pdf_path)  # optionally set `detect_language_per_element=True)`
    print(f"Number of elements partitioned: {len(elements)}")

    # Check if elements are returned
    assert len(elements) > 0, "No elements were partitioned from the PDF."

    # check language outputted for each element
    for element in elements:
        print(element)
        print(element.metadata.languages)
        print("-------------------------------")

test_partition_pdf()
```

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: shreyanid <shreyanid@users.noreply.github.com>
0.18.7
2025-07-15 18:53:28 +00:00
Ahmet Melek
2ffaf6f323
fix: type for serialized TableChunks (#4056)
#### To test, simply serialize a TableChunk element with and without the
changes in the PR

____
**Without the changes:**

```
In [1]: from unstructured.documents.elements import TableChunk

In [2]: TableChunk("hi")
Out[2]: <unstructured.documents.elements.TableChunk at 0x110113410>

In [3]: TableChunk("hi").to_dict()
Out[3]: 
{'type': 'Table',
 'element_id': '6267e99a-46d8-4f2d-a206-51c691469c72',
 'text': 'hi',
 'metadata': {}}
```

____
**With the changes:**

```
In [1]: from unstructured.documents.elements import TableChunk

In [2]: TableChunk("hi")
Out[2]: <unstructured.documents.elements.TableChunk at 0x10367f050>

In [3]: TableChunk("hi").to_dict()
Out[3]: 
{'type': 'TableChunk',
 'element_id': 'f91af3ac-0dea-4dc4-8a6a-69c28cfcca3b',
 'text': 'hi',
 'metadata': {}}
```
____
0.18.6
2025-07-15 17:29:02 +00:00
mateuszkuprowski
37800c3523
feat: added new exception type to epub conversions (#4052)
Added UnprocessableEpubError to better handle the case when incoming
epub file is actually damanged which makes pandoc lib crash with exit
code 64.
2025-07-15 10:56:22 +00:00
Yao You
73d239fb28
feat: keep img tag's class attr (#4050)
This change affects partition html.

Previously when there is a table in the html, we clean any tags inside
the table of their `class` and `id` attributes. However, sometimes there
are images, `img` tags, present in a table and its `class` attribute
identifies some important information about the image. This change
preserves the `class` attribute for `img` tags inside a table. This
change is reflected in a table element's `metadata.text_as_html`
attribute.
0.18.5
2025-07-10 20:46:28 +00:00
qued
7764fb6fd4
build: drop remaining Python 3.9 refs (#4049)
Dropped variables that said we support Python 3.9 in `setup.py`, as well
as any remaining references to Python 3.9.

I also checked the pins and removed several that don't seem necessary
any more.
2025-07-10 16:43:15 +00:00
jiajun-unstructured
92965fb286
add fenced-code extension to the md parser (#4044)
https://github.com/Unstructured-IO/unstructured/issues/3578

---------

Co-authored-by: qued <64741807+qued@users.noreply.github.com>
Co-authored-by: Alan Bertl <alan@unstructured.io>
2025-07-07 21:05:54 +00:00
Filip Knefel
f078cd923b
fix(partition, csv): increase csv field limit (#4046)
Increase the csv field limit to support partitioning of files with large
data in fields.

---------

Co-authored-by: Filip Knefel <filip@unstructured.io>
0.18.4
2025-07-07 14:12:53 +00:00
Austin Walker
8a9abddb16
chore: bump pillow to address a CVE (#4045) 0.18.3 2025-07-05 18:33:15 +00:00
Yao You
d7dfda9ecb
bump version to make a release (#4042) 0.18.2 2025-07-01 23:06:02 +00:00
Yao You
aa332101ab
fix: fix header and footer not parsed as Header/Footer types (#4041)
## Summary

This PR fixes an issue where header/footer content in html are not
partitioned as `unstructured` `Header` or `Footer` element types. Rather
they are either `UncategorizedText` or taking on the type of the nested
structure inside the header/footer. E.g., `<header class="Header"><h1
class="Title">Header Title</h1></header>` would be partitioned as a
`Title` instead of `Header`.

## Bug description

This behavior is because we treat header and footer as layout, i.e.,
containers, in the ontology definition. As a result, during parsing we
[unwrap](ec209c6b5f/unstructured/partition/html/transformations.py (L361-L378))
the container and parse the contents as if they are from the main text
even though they are still part of header/footer.

The fix is to treat header/footer as text instead of layout in ontology
so that all content inside of them are properly gathered under
`Header`/`Footer` element types.
2025-07-01 21:58:43 +00:00
Klaijan
45c3b63dcc
bump version (#4038) 2025-07-01 17:44:24 +00:00
Klaijan
56e739b34c
fix: update md to reads umlauts on non-utf-8 files (#4037)
This PR updates the `partition_md` to reads files with non-utf8
encodings without fail.

Closes issue
https://github.com/Unstructured-IO/unstructured-api/issues/489
2025-07-01 16:38:30 +00:00
jiajun-unstructured
66640f26fe
fix: xml processing not escaped (#4034)
`<?xml version="1.0"?>` does not get escaped when converting to html, in
a code block like this in the markdown file
````
<?xml version="1.0"?>
<sparql xmlns="http://www.w3.org/2005/sparql-results#">
  <head></head>
  <boolean>true</boolean>
</sparql>
````
which causes the parser to throw error like 

> AttributeError: 'lxml.etree._ProcessingInstruction' object has no
attribute 'is_phrasing'.

This PR processes the original md file and add indentation to `<?xml
version="1.0"?>` to force the xml code to be escaped when being
converted to html

https://github.com/Unstructured-IO/unstructured/issues/3935
2025-06-30 20:15:38 +00:00
Klaijan
dab79b0c83
fix: add try/except wrap over row.cells to failproof tc grid_offset (#4033)
This PR fixes the issue with `docx` with
complex/recursive/merged/malformed tables by skipping cells that could
not trace back to a valid `<w:tc>` element used by the `python-docx` due
to missing or improperly merged rows.

Accessing row.cells in such cases can raise a `ValueError` when
`python-docx` fails to resolve the full logical table layout. This PR
wraps those calls in `try/except` to skip problematic rows while
continuing to extract usable content from the rest of the document.
2025-06-30 14:20:18 +00:00
Yuming Long
c04235c168
fix [NEX-49] : Fix TypeError for empty HTML content (#4032)
### Summary

Addressed a TypeError that occurred when partitioning empty or
whitespace-only HTML content.

## Test
* unit test
`test_unstructured/partition/html/test_partition.py::test_partition_html_with_empty_content_raises_error`
can reproduce the TypeErro before fix
* now test can pass
2025-06-25 18:13:20 +00:00
ryannikolaidis
3f87946f56
feat: add DocumentData type (#4031)
In scenarios where there is a large amount of data that represents the
document rather than individual elements in the document, it may be
preferable to specify this in a single location rather than duplicating
the data across all elements (as we do for smaller metadata like
filename or filetype)

This PR adds DocumentData element type which can be used to uniquely
capture this data.
0.18.1
2025-06-23 03:46:25 +00:00
qued
6866fda860
fix: use encoding in context class (#4030)
Given the fact that the `_CsvPartitioningContext` defines an `_encoding`
property, this property was meant to be used. Behaviorally this change
should be a no-op, but supports future efforts where the partitioning
context applies internal logic.
2025-06-20 21:43:07 +00:00
luke-kucing
2aca876921
Luke/CVE python3.12 update (#4027) 2025-06-17 06:32:06 +00:00
jiajun-unstructured
b0dbd71aff
Parallelize tests (#4024) 2025-06-16 23:29:35 +00:00
Emmanuel Ferdman
531490d013
Migrate to modern bs4 interface (#4025)
## PR Summary
This small PR fixes the bs4 deprecation warnings which you can find in
the [CI
logs](https://github.com/Unstructured-IO/unstructured/actions/runs/15491657572/job/43729960936#step:3:2615):
```python
/app/unstructured/metrics/table/table_extraction.py:53: DeprecationWarning: Call to deprecated method findAll. (Replaced by find_all) -- Deprecated since version 4.0.0.
/app/unstructured/metrics/table/table_extraction.py:57: DeprecationWarning: Call to deprecated method findAll. (Replaced by find_all) -- Deprecated since version 4.0.0.
```

---------

Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>
2025-06-16 18:44:20 +00:00
cragwolfe
6ef2fc1ec6
chore: add claude (#4023) 2025-06-13 15:20:10 +00:00
Yuming Long
a80decdbd4
fix [NEX-28]: file_type is None for result_file_type in chunker partition json (#4022)
### Summary
`'NoneType' object has no attribute 'partitioner_shortname'` due to
`result_file_type = self._disambiguate_json_file_type` could return None
for file type
2025-06-13 15:19:09 +00:00
Yao You
5e43e36427
recompile on arm64 to get minimum reqs (#4020)
new `torch==2.7.1` now comes with nvidia gpu support and triton as
dependencies. Those are not supported by `arm64` or actually being used
by `unstructured` in `adm64` either. This is a quick patch to remove
those from .txt requirements files to unblock builds.
0.17.11-dev1
2025-06-12 21:44:35 +00:00
Yuming Long
55ad5fd637
fix chucking text None type has no attribute stripe (#4018)
### Summary
To fix error `Error in chunk: 512: {"detail":"'NoneType' object has no
attribute 'strip'"}` I found the logs under same org (could assume this
is the same job)

screenshot:
![Screenshot 2025-06-11 at 10 15
57 AM](https://github.com/user-attachments/assets/c50ada55-eef1-43f7-9e27-9b9ae339a6fb)

stack trace from the `utic-api` ES log doc:
![Screenshot 2025-06-11 at 2 01
01 PM](https://github.com/user-attachments/assets/7e84fa24-4eb6-45e8-b195-a11d3d124bfa)



### Notes
longer term we should make partitioner (vlm + utic-api) not return text
with Null

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: yuming-long <yuming-long@users.noreply.github.com>
2025-06-12 18:28:46 +00:00
Pluto
ec209c6b5f
Remove IDs from HTML code (#4012)
In this pull request parent-child relationship for elements generated
with v2 parser is based on actual element IDs instead of IDs baked
somewhere in the HTML script.
With some extra bug fixing it allowed for significantly simplifying json
-> HTML script
2025-06-11 11:55:02 +00:00
Emily Voss
b6ab471f00
Drop Python 3.9 support due to dependency conflicts (#4017) 2025-06-10 23:32:11 -07:00
Emily Voss
06e4e54f5c
Bump requests to address CVEs (#4015) 2025-06-11 01:38:43 +00:00
Yao You
37d2f021a3
Feat/bump inference (#4013)
Bump `unstructured-inference` to `1.0.5`, which includes fix to ensure
model init is thread safe.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>
2025-06-06 09:52:17 +00:00
luke-kucing
a7e90f7990
resolve CVEs and HF issue (#4009)
update reqs to resolve CVEs and add the HF ENV to stop it from reaching
out

updated the Dockerfile with
ENV HF_HUB_OFFLINE=1

to stop it from pinging HF. This was an issue for a gov customer. and
updated requirements to resolve some open CVEs

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: luke-kucing <luke-kucing@users.noreply.github.com>
2025-06-04 18:52:58 +00:00
cragwolfe
3a048a5a02
chore: script to verify unstructured image outbound connectivity (#4008)
Sample output. The key thing here is the modes `offline` (meaning set
HF_HUB_ONLINE=1 AND DO_NOT_TRACK=true) results in no outbound
connections. This also is true if the locally cached models are removed,
the last scenario of `offline-and-missing-models`)

```
$ ./test-all-outbound-connectivity-scenarios.sh 
>>> Removing leftover sut_* containers…
Container: 543ac4b14370a18d790a2035e206e8c445754b825ec8b2887f4246f7404299c7  (scenario baseline)
tcpdump running on interface eth0...
>>> Running Python workload (capturing stdout/stderr)…
[INFO] partitioning /app/example-docs/ideas-page.html

<snip>

Python finished.  Log saved to /r/unstructured/scripts/image/python-output/offline-and-missing-models.log
pcap saved to /r/unstructured/scripts/image/pcaps/offline-and-missing-models.pcap

==================================================================
======================================== Begin Scenario: baseline

   -------------------------------------------
   tshark output for baseline
   -------------------------------------------

IPv4 Conversations
Filter:<No Filter>
                                               |       <-      | |       ->      | |     Total     |    Relative    |   Duration   |
                                               | Frames  Bytes | | Frames  Bytes | | Frames  Bytes |      Start     |              |
172.18.0.2           <-> 108.138.246.79            20 12 kB          20 4,176 bytes      40 16 kB         2.531247000        69.0419
172.18.0.2           <-> 3.214.154.119             11 5,777 bytes      12 2,656 bytes      23 8,433 bytes     0.029451000         0.4118
172.18.0.2           <-> 192.168.65.5               2 656 bytes       2 158 bytes       4 814 bytes     0.000000000         2.5310

   ------------------------------------------
   python log output for baseline
   ------------------------------------------

[INFO] partitioning /app/example-docs/ideas-page.html
[INFO] partitioning /app/example-docs/category-level.docx
[INFO] partitioning /app/example-docs/fake_table.docx
[INFO] partitioning /app/example-docs/img/english-and-korean.png
2025-06-02 22:05:02,265 - matplotlib.font_manager - INFO - generated new fontManager
2025-06-02 22:05:02,356 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): huggingface.co:443
2025-06-02 22:05:02,497 - urllib3.connectionpool - DEBUG - https://huggingface.co:443 "HEAD /unstructuredio/yolo_x_layout/resolve/main/yolox_l0.05.onnx HTTP/1.1" 302 0
2025-06-02 22:05:02,613 - unstructured_inference - INFO - Reading image file: /app/example-docs/img/english-and-korean.png ...
2025-06-02 22:05:04,792 - unstructured_inference - INFO - Loading the Table agent ...
2025-06-02 22:05:04,792 - unstructured_inference - INFO - Loading the table structure model ...
2025-06-02 22:05:04,877 - urllib3.connectionpool - DEBUG - https://huggingface.co:443 "HEAD /microsoft/table-transformer-structure-recognition/resolve/main/config.json HTTP/1.1" 200 0
2025-06-02 22:05:04,960 - urllib3.connectionpool - DEBUG - https://huggingface.co:443 "HEAD /microsoft/table-transformer-structure-recognition/resolve/main/config.json HTTP/1.1" 200 0
2025-06-02 22:05:04,970 - timm.models._builder - INFO - Loading pretrained weights from Hugging Face hub (timm/resnet18.a1_in1k)
2025-06-02 22:05:05,062 - urllib3.connectionpool - DEBUG - https://huggingface.co:443 "HEAD /timm/resnet18.a1_in1k/resolve/main/model.safetensors HTTP/1.1" 302 0
2025-06-02 22:05:05,065 - timm.models._hub - INFO - [timm/resnet18.a1_in1k] Safe alternative available for 'pytorch_model.bin' (as 'model.safetensors'). Loading weights using safetensors.
2025-06-02 22:05:05,071 - timm.models._builder - INFO - Missing keys (fc.weight, fc.bias) discovered while loading pretrained weights. This is expected if model is being adapted.
[INFO] partitioning /app/example-docs/img/embedded-images-tables.jpg
2025-06-02 22:05:05,152 - unstructured_inference - INFO - Reading image file: /app/example-docs/img/embedded-images-tables.jpg ...
[INFO] partitioning /app/example-docs/img/layout-parser-paper-with-table.jpg
2025-06-02 22:05:07,693 - unstructured_inference - INFO - Reading image file: /app/example-docs/img/layout-parser-paper-with-table.jpg ...
[INFO] partitioning /app/example-docs/pdf/embedded-images-tables.pdf
2025-06-02 22:05:12,706 - pikepdf._core - INFO - pikepdf C++ to Python logger bridge initialized
2025-06-02 22:05:12,733 - unstructured_inference - INFO - Reading PDF for file: /app/example-docs/pdf/embedded-images-tables.pdf ...
[INFO] partitioning /app/example-docs/pdf/all-number-table.pdf
2025-06-02 22:05:15,251 - unstructured_inference - INFO - Reading PDF for file: /app/example-docs/pdf/all-number-table.pdf ...
[INFO] partitioning /app/example-docs/fake-power-point.pptx
[INFO] partitioning /app/example-docs/stanley-cups.xlsx
[INFO] partitioning /app/example-docs/fake-email-multiple-attachments.msg
2025-06-02 22:05:16,936 - unstructured_inference - INFO - Reading image file: /tmp/tmplkanlou1/unstructured_logo.png ...
2025-06-02 22:05:18,749 - unstructured_inference - INFO - Reading PDF for file: /tmp/tmpxdzdouhb/dense_doc.pdf ...

==================================================================
======================================== Begin Scenario: missing-models

   -------------------------------------------
   tshark output for missing-models
   -------------------------------------------

IPv4 Conversations
Filter:<No Filter>
                                               |       <-      | |       ->      | |     Total     |    Relative    |   Duration   |
                                               | Frames  Bytes | | Frames  Bytes | | Frames  Bytes |      Start     |              |
172.18.0.2           <-> 18.155.192.23         181834 273 MB      33502 1,813 kB   215336 275 MB        2.704106000        75.2880
172.18.0.2           <-> 3.168.86.41            79696 119 MB      15234 825 kB      94930 120 MB        9.066044000        68.9276
172.18.0.2           <-> 108.138.246.85            29 21 kB          25 5,760 bytes      54 27 kB         2.431857000        75.5633
172.18.0.2           <-> 3.214.154.119             12 5,831 bytes      12 2,656 bytes      24 8,487 bytes     0.016604000         0.3590
172.18.0.2           <-> 192.168.65.5               4 1,084 bytes       4 314 bytes       8 1,398 bytes     0.000000000         9.0651

   ------------------------------------------
   python log output for missing-models
   ------------------------------------------

[INFO] partitioning /app/example-docs/ideas-page.html
[INFO] partitioning /app/example-docs/category-level.docx
[INFO] partitioning /app/example-docs/fake_table.docx
[INFO] partitioning /app/example-docs/img/english-and-korean.png
2025-06-02 22:06:30,961 - matplotlib.font_manager - INFO - generated new fontManager
2025-06-02 22:06:31,046 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): huggingface.co:443
2025-06-02 22:06:31,300 - urllib3.connectionpool - DEBUG - https://huggingface.co:443 "HEAD /unstructuredio/yolo_x_layout/resolve/main/yolox_l0.05.onnx HTTP/1.1" 302 0
2025-06-02 22:06:31,310 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): cdn-lfs.hf.co:443
2025-06-02 22:06:31,439 - urllib3.connectionpool - DEBUG - https://cdn-lfs.hf.co:443 "GET /repos/d9/51/d951593388d0af1cb4a029c311ba19f9b05090d9acc4606c2b82588297ea4397/134301ca94fb0df8027be9a6dad1908fe6218af8ffa4d34f0819c7c2226195f3?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27yolox_l0.05.onnx%3B+filename%3D%22yolox_l0.05.onnx%22%3B&Expires=1748904676&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTc0ODkwNDY3Nn19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5oZi5jby9yZXBvcy9kOS81MS9kOTUxNTkzMzg4ZDBhZjFjYjRhMDI5YzMxMWJhMTlmOWIwNTA5MGQ5YWNjNDYwNmMyYjgyNTg4Mjk3ZWE0Mzk3LzEzNDMwMWNhOTRmYjBkZjgwMjdiZTlhNmRhZDE5MDhmZTYyMThhZjhmZmE0ZDM0ZjA4MTljN2MyMjI2MTk1ZjM~cmVzcG9uc2UtY29udGVudC1kaXNwb3NpdGlvbj0qIn1dfQ__&Signature=hxvwTzJynEvyE~UuirlH~L4c5Gc6rGksDp~Uw94ooayDrzshE2sDdHmvqgoQyzqxHHhZLjfiJlAGUtVO7nVAHSoqt8mH7H9yN51Zj5UGqI-odXtW1dmWCD3i7nwwNlrEEjlXlERkIScpIjpkJDnjwhzeE94l1s7gysIm8c6J8JTcDlsdMver5wAVrBtLSVUrDN8PC84xgOGerHVhX7-eZcUVG2OAIJHoB3s2gLPkW9aVM5fvCmmoXMPI9oCvgLUp-zhXv3cWHh~yURuY1ufoI4CFG5ogW8nV~V45qLlbRw9PrvfFoLS-wxBGDOhT3SRWVOJzRRmACByABGWYMXRFuw__&Key-Pair-Id=K3RPWS32NSSJCE HTTP/1.1" 200 216625723
2025-06-02 22:06:35,019 - unstructured_inference - INFO - Reading image file: /app/example-docs/img/english-and-korean.png ...
2025-06-02 22:06:37,188 - unstructured_inference - INFO - Loading the Table agent ...
2025-06-02 22:06:37,188 - unstructured_inference - INFO - Loading the table structure model ...
2025-06-02 22:06:37,290 - urllib3.connectionpool - DEBUG - https://huggingface.co:443 "HEAD /microsoft/table-transformer-structure-recognition/resolve/main/config.json HTTP/1.1" 200 0
2025-06-02 22:06:37,375 - urllib3.connectionpool - DEBUG - https://huggingface.co:443 "GET /microsoft/table-transformer-structure-recognition/resolve/main/config.json HTTP/1.1" 200 1469
2025-06-02 22:06:37,484 - urllib3.connectionpool - DEBUG - https://huggingface.co:443 "HEAD /microsoft/table-transformer-structure-recognition/resolve/main/config.json HTTP/1.1" 200 0
2025-06-02 22:06:37,581 - urllib3.connectionpool - DEBUG - https://huggingface.co:443 "HEAD /microsoft/table-transformer-structure-recognition/resolve/main/model.safetensors HTTP/1.1" 302 0
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
2025-06-02 22:06:37,586 - huggingface_hub.file_download - WARNING - Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
2025-06-02 22:06:37,681 - urllib3.connectionpool - DEBUG - https://huggingface.co:443 "GET /microsoft/table-transformer-structure-recognition/resolve/main/model.safetensors HTTP/1.1" 302 1319
2025-06-02 22:06:37,685 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): cas-bridge.xethub.hf.co:443
2025-06-02 22:06:37,778 - urllib3.connectionpool - DEBUG - https://cas-bridge.xethub.hf.co:443 "GET /xet-bridge-us/634929bd8146350b3a4cadaf/e78778928a1863786d5bb22a109a7ff1dbac47a29eae6f223a1fc2689172c347?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=cas%2F20250602%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20250602T220637Z&X-Amz-Expires=3600&X-Amz-Signature=c0a361e8982b1b05ee443054646b438e5a68d6767ef6df03dad6c5db20d0bdc5&X-Amz-SignedHeaders=host&X-Xet-Cas-Uid=public&response-content-disposition=inline%3B+filename*%3DUTF-8%27%27model.safetensors%3B+filename%3D%22model.safetensors%22%3B&x-id=GetObject&Expires=1748905597&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTc0ODkwNTU5N319LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2FzLWJyaWRnZS54ZXRodWIuaGYuY28veGV0LWJyaWRnZS11cy82MzQ5MjliZDgxNDYzNTBiM2E0Y2FkYWYvZTc4Nzc4OTI4YTE4NjM3ODZkNWJiMjJhMTA5YTdmZjFkYmFjNDdhMjllYWU2ZjIyM2ExZmMyNjg5MTcyYzM0NyoifV19&Signature=cRjZe56uJ8vxmmgRhPmp7XZX69PHKoXO9XN1bfq5n~84Vxz~HvCmg6MqtuUAFIiOWAHFhOuVzJpoiWTYT1JdZrtMeQTdywnZM-lIIn5Q45kzr8q8C58yvLz7vmKKrD9pOnGjJPaVavYYxEDdlAXbWf6xo433kKF4TfmQ9z7UIKt~M-XV9EdPUUBNhByucLVcTZ3sec5DqI4FmzK28fdJ1BMD4NyDjWW6hi~Lp2V3bW0FLCpI6qKGuikJ3E-OVcJDdDvZAqSN0-GoQyHIP9kp4RTqPBb7jekpZ3Uj91UWEmGx6YNuNlorAMGi61hrL6mAUUmW13OGua2vcJyk9LxZQg__&Key-Pair-Id=K2L8F4GPSG1IFC HTTP/1.1" 200 115434268
2025-06-02 22:06:39,612 - timm.models._builder - INFO - Loading pretrained weights from Hugging Face hub (timm/resnet18.a1_in1k)
2025-06-02 22:06:39,696 - urllib3.connectionpool - DEBUG - https://huggingface.co:443 "HEAD /timm/resnet18.a1_in1k/resolve/main/model.safetensors HTTP/1.1" 302 0
2025-06-02 22:06:39,714 - urllib3.connectionpool - DEBUG - https://cdn-lfs.hf.co:443 "GET /repos/42/d5/42d585781e0b74854ae52a1bc2a63d09896f1d70f86bff969f4c053508d6c2d6/80c49dee3da4822c009c5a7fe591e9223c5a2cfcf95a4067ca4dfb5a7b89c612?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27model.safetensors%3B+filename%3D%22model.safetensors%22%3B&Expires=1748904665&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTc0ODkwNDY2NX19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5oZi5jby9yZXBvcy80Mi9kNS80MmQ1ODU3ODFlMGI3NDg1NGFlNTJhMWJjMmE2M2QwOTg5NmYxZDcwZjg2YmZmOTY5ZjRjMDUzNTA4ZDZjMmQ2LzgwYzQ5ZGVlM2RhNDgyMmMwMDljNWE3ZmU1OTFlOTIyM2M1YTJjZmNmOTVhNDA2N2NhNGRmYjVhN2I4OWM2MTI~cmVzcG9uc2UtY29udGVudC1kaXNwb3NpdGlvbj0qIn1dfQ__&Signature=GL15CLiGsmHno-DP25kfcuObjbrjd~ir5C5xapGqb9lda~5Wjy-3axBPftr1xWUnKh24Ay0mS49U8ZOcEdQxmzxQ97HiSX0-8s0-H187hV6mId6uxsULOGkNtjpkMKhfxe0qIfAmfi9gxl9JdiVfG5367HfPDVST8NvGPqMuKYoywSNWA-Uby-L9qb~EjtxbH9v1H2g6C0i9t2mn8ghD8BtTWEn4LY9c4O5bI~EQatNToNjsQTKa18LzXEowZnODLSLkyE7beLzfEpuTX9vlDzcAwKCPp-1M3xMZI4tzR-yfzyGhW19wqc6BVncUw53WSK7oOCv56HmFTYHhzOE-eQ__&Key-Pair-Id=K3RPWS32NSSJCE HTTP/1.1" 200 46807446
2025-06-02 22:06:40,394 - timm.models._hub - INFO - [timm/resnet18.a1_in1k] Safe alternative available for 'pytorch_model.bin' (as 'model.safetensors'). Loading weights using safetensors.
2025-06-02 22:06:40,396 - timm.models._builder - INFO - Missing keys (fc.weight, fc.bias) discovered while loading pretrained weights. This is expected if model is being adapted.
[INFO] partitioning /app/example-docs/img/embedded-images-tables.jpg
2025-06-02 22:06:40,460 - unstructured_inference - INFO - Reading image file: /app/example-docs/img/embedded-images-tables.jpg ...
[INFO] partitioning /app/example-docs/img/layout-parser-paper-with-table.jpg
2025-06-02 22:06:42,985 - unstructured_inference - INFO - Reading image file: /app/example-docs/img/layout-parser-paper-with-table.jpg ...
[INFO] partitioning /app/example-docs/pdf/embedded-images-tables.pdf
2025-06-02 22:06:48,019 - pikepdf._core - INFO - pikepdf C++ to Python logger bridge initialized
2025-06-02 22:06:48,045 - unstructured_inference - INFO - Reading PDF for file: /app/example-docs/pdf/embedded-images-tables.pdf ...
[INFO] partitioning /app/example-docs/pdf/all-number-table.pdf
2025-06-02 22:06:50,557 - unstructured_inference - INFO - Reading PDF for file: /app/example-docs/pdf/all-number-table.pdf ...
[INFO] partitioning /app/example-docs/fake-power-point.pptx
[INFO] partitioning /app/example-docs/stanley-cups.xlsx
[INFO] partitioning /app/example-docs/fake-email-multiple-attachments.msg
2025-06-02 22:06:52,358 - unstructured_inference - INFO - Reading image file: /tmp/tmpsha4r586/unstructured_logo.png ...
2025-06-02 22:06:54,199 - unstructured_inference - INFO - Reading PDF for file: /tmp/tmpg_5lk06v/dense_doc.pdf ...

==================================================================
======================================== Begin Scenario: analytics-online-only

   -------------------------------------------
   tshark output for analytics-online-only
   -------------------------------------------

IPv4 Conversations
Filter:<No Filter>
                                               |       <-      | |       ->      | |     Total     |    Relative    |   Duration   |
                                               | Frames  Bytes | | Frames  Bytes | | Frames  Bytes |      Start     |              |
172.18.0.2           <-> 54.236.224.89             12 5,831 bytes      12 2,656 bytes      24 8,487 bytes     0.032536000         0.3535
172.18.0.2           <-> 192.168.65.5               1 462 bytes       1 84 bytes        2 546 bytes     0.000000000         0.0322

   ------------------------------------------
   python log output for analytics-online-only
   ------------------------------------------

[INFO] partitioning /app/example-docs/ideas-page.html
[INFO] partitioning /app/example-docs/category-level.docx
[INFO] partitioning /app/example-docs/fake_table.docx
[INFO] partitioning /app/example-docs/img/english-and-korean.png
2025-06-02 22:08:10,114 - matplotlib.font_manager - INFO - generated new fontManager
2025-06-02 22:08:10,320 - unstructured_inference - INFO - Reading image file: /app/example-docs/img/english-and-korean.png ...
2025-06-02 22:08:12,470 - unstructured_inference - INFO - Loading the Table agent ...
2025-06-02 22:08:12,470 - unstructured_inference - INFO - Loading the table structure model ...
2025-06-02 22:08:12,475 - timm.models._builder - INFO - Loading pretrained weights from Hugging Face hub (timm/resnet18.a1_in1k)
2025-06-02 22:08:12,476 - timm.models._hub - INFO - [timm/resnet18.a1_in1k] Safe alternative available for 'pytorch_model.bin' (as 'model.safetensors'). Loading weights using safetensors.
2025-06-02 22:08:12,478 - timm.models._builder - INFO - Missing keys (fc.weight, fc.bias) discovered while loading pretrained weights. This is expected if model is being adapted.
[INFO] partitioning /app/example-docs/img/embedded-images-tables.jpg
2025-06-02 22:08:12,548 - unstructured_inference - INFO - Reading image file: /app/example-docs/img/embedded-images-tables.jpg ...
[INFO] partitioning /app/example-docs/img/layout-parser-paper-with-table.jpg
2025-06-02 22:08:15,102 - unstructured_inference - INFO - Reading image file: /app/example-docs/img/layout-parser-paper-with-table.jpg ...
[INFO] partitioning /app/example-docs/pdf/embedded-images-tables.pdf
2025-06-02 22:08:20,163 - pikepdf._core - INFO - pikepdf C++ to Python logger bridge initialized
2025-06-02 22:08:20,189 - unstructured_inference - INFO - Reading PDF for file: /app/example-docs/pdf/embedded-images-tables.pdf ...
[INFO] partitioning /app/example-docs/pdf/all-number-table.pdf
2025-06-02 22:08:22,732 - unstructured_inference - INFO - Reading PDF for file: /app/example-docs/pdf/all-number-table.pdf ...
[INFO] partitioning /app/example-docs/fake-power-point.pptx
[INFO] partitioning /app/example-docs/stanley-cups.xlsx
[INFO] partitioning /app/example-docs/fake-email-multiple-attachments.msg
2025-06-02 22:08:24,468 - unstructured_inference - INFO - Reading image file: /tmp/tmp4oud0ctq/unstructured_logo.png ...
2025-06-02 22:08:26,297 - unstructured_inference - INFO - Reading PDF for file: /tmp/tmpv24idrvu/dense_doc.pdf ...

==================================================================
======================================== Begin Scenario: offline

   -------------------------------------------
   tshark output for offline
   -------------------------------------------

IPv4 Conversations
Filter:<No Filter>
                                               |       <-      | |       ->      | |     Total     |    Relative    |   Duration   |
                                               | Frames  Bytes | | Frames  Bytes | | Frames  Bytes |      Start     |              |

   ------------------------------------------
   python log output for offline
   ------------------------------------------

[INFO] partitioning /app/example-docs/ideas-page.html
[INFO] partitioning /app/example-docs/category-level.docx
[INFO] partitioning /app/example-docs/fake_table.docx
[INFO] partitioning /app/example-docs/img/english-and-korean.png
2025-06-02 22:09:37,826 - matplotlib.font_manager - INFO - generated new fontManager
2025-06-02 22:09:38,028 - unstructured_inference - INFO - Reading image file: /app/example-docs/img/english-and-korean.png ...
2025-06-02 22:09:40,188 - unstructured_inference - INFO - Loading the Table agent ...
2025-06-02 22:09:40,188 - unstructured_inference - INFO - Loading the table structure model ...
2025-06-02 22:09:40,193 - timm.models._builder - INFO - Loading pretrained weights from Hugging Face hub (timm/resnet18.a1_in1k)
2025-06-02 22:09:40,193 - timm.models._hub - INFO - [timm/resnet18.a1_in1k] Safe alternative available for 'pytorch_model.bin' (as 'model.safetensors'). Loading weights using safetensors.
2025-06-02 22:09:40,195 - timm.models._builder - INFO - Missing keys (fc.weight, fc.bias) discovered while loading pretrained weights. This is expected if model is being adapted.
[INFO] partitioning /app/example-docs/img/embedded-images-tables.jpg
2025-06-02 22:09:40,260 - unstructured_inference - INFO - Reading image file: /app/example-docs/img/embedded-images-tables.jpg ...
[INFO] partitioning /app/example-docs/img/layout-parser-paper-with-table.jpg
2025-06-02 22:09:42,810 - unstructured_inference - INFO - Reading image file: /app/example-docs/img/layout-parser-paper-with-table.jpg ...
[INFO] partitioning /app/example-docs/pdf/embedded-images-tables.pdf
2025-06-02 22:09:47,851 - pikepdf._core - INFO - pikepdf C++ to Python logger bridge initialized
2025-06-02 22:09:47,877 - unstructured_inference - INFO - Reading PDF for file: /app/example-docs/pdf/embedded-images-tables.pdf ...
[INFO] partitioning /app/example-docs/pdf/all-number-table.pdf
2025-06-02 22:09:50,475 - unstructured_inference - INFO - Reading PDF for file: /app/example-docs/pdf/all-number-table.pdf ...
[INFO] partitioning /app/example-docs/fake-power-point.pptx
[INFO] partitioning /app/example-docs/stanley-cups.xlsx
[INFO] partitioning /app/example-docs/fake-email-multiple-attachments.msg
2025-06-02 22:09:52,181 - unstructured_inference - INFO - Reading image file: /tmp/tmpn3rraz6o/unstructured_logo.png ...
2025-06-02 22:09:54,032 - unstructured_inference - INFO - Reading PDF for file: /tmp/tmpvbqk645u/dense_doc.pdf ...

==================================================================
======================================== Begin Scenario: offline-and-missing-models

   -------------------------------------------
   tshark output for offline-and-missing-models
   -------------------------------------------

IPv4 Conversations
Filter:<No Filter>
                                               |       <-      | |       ->      | |     Total     |    Relative    |   Duration   |
                                               | Frames  Bytes | | Frames  Bytes | | Frames  Bytes |      Start     |              |

   ------------------------------------------
   python log output for offline-and-missing-models
   ------------------------------------------

[INFO] partitioning /app/example-docs/ideas-page.html
[INFO] partitioning /app/example-docs/category-level.docx
[INFO] partitioning /app/example-docs/fake_table.docx
[INFO] partitioning /app/example-docs/img/english-and-korean.png
2025-06-02 22:11:05,743 - matplotlib.font_manager - INFO - generated new fontManager
Traceback (most recent call last):
  File "/home/notebook-user/.local/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 1484, in _get_metadata_or_catch_error
    metadata = get_hf_file_metadata(
               ^^^^^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 1401, in get_hf_file_metadata
    r = _request_wrapper(
        ^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 285, in _request_wrapper
    response = _request_wrapper(
               ^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 308, in _request_wrapper
    response = get_session().request(method=method, url=url, **params)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/huggingface_hub/utils/_http.py", line 107, in send
    raise OfflineModeIsEnabled(
huggingface_hub.errors.OfflineModeIsEnabled: Cannot reach https://huggingface.co/unstructuredio/yolo_x_layout/resolve/main/yolox_l0.05.onnx: offline mode is enabled. To disable it, please unset the `HF_HUB_OFFLINE` environment variable.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 35, in <module>
  File "/app/unstructured/partition/auto.py", line 231, in partition
    elements = partition_image(
               ^^^^^^^^^^^^^^^^
  File "/app/unstructured/documents/elements.py", line 585, in wrapper
    elements = func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/app/unstructured/file_utils/filetype.py", line 774, in wrapper
    elements = func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/app/unstructured/chunking/dispatch.py", line 74, in wrapper
    elements = func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/app/unstructured/partition/image.py", line 102, in partition_image
    return partition_pdf_or_image(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/unstructured/partition/pdf.py", line 341, in partition_pdf_or_image
    elements = _partition_pdf_or_image_local(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/unstructured/utils.py", line 216, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/app/unstructured/partition/pdf.py", line 649, in _partition_pdf_or_image_local
    inferred_document_layout = process_file_with_model(
                               ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured_inference/inference/layout.py", line 371, in process_file_with_model
    model = get_model(model_name, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured_inference/models/base.py", line 74, in get_model
    model.initialize(**initialize_params)
  File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured_inference/utils.py", line 40, in __getitem__
    value = evaluate(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/unstructured_inference/utils.py", line 115, in download_if_needed_and_get_local_path
    return hf_hub_download(path_or_repo, filename, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 961, in hf_hub_download
    return _hf_hub_download_to_cache_dir(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/notebook-user/.local/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 1068, in _hf_hub_download_to_cache_dir
    _raise_on_head_call_error(head_call_error, force_download, local_files_only)
  File "/home/notebook-user/.local/lib/python3.11/site-packages/huggingface_hub/file_download.py", line 1599, in _raise_on_head_call_error
    raise LocalEntryNotFoundError(
huggingface_hub.errors.LocalEntryNotFoundError: An error happened while trying to locate the file on the Hub and we cannot find the requested files in the local cache. Please check your connection and try again or make sure your Internet connection is on.
```
2025-06-02 15:21:17 -07:00
Emmanuel Ferdman
e42884a566
fix: resolve warnings of logger library (#3999)
# PR Summary
This PR resolves the deprecation warnings of the `logger` library:
```python
DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead
```

---------

Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>
Co-authored-by: cragwolfe <crag@unstructured.io>
2025-05-22 17:53:42 +00:00
Ronny H
8be7108829
Replace Serverless API to Platform announcement on README page (#4003)
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
2025-05-20 16:54:53 +00:00
jordan-homan
570ee078a4
fix: throw validation error when json is passed with invalid unstructured json (#4002)
### Notes
Adds validation if `json` / `ndjson` are not valid unstructured schema.

### Testing
Manually tested serverless API with example json:

```

test_length = [] = 200

test_invalid = [{"invalid": "schema"}] = 422
test_invalid_ndjson ={"hi": "there"} = 422

test_chunk = [{"type":"Header","element_id":"a23fdadef9277f217563e217ebd074d5" ... = 200

```
2025-05-19 18:24:44 +00:00
Austin Walker
e3417d7e98
fix: Fix for Pillow error when extracting PNG images (#3998)
When I tried to partition a PNG file and extract images, I got an error
from Pillow:

```
WARNING  unstructured:pdf_image_utils.py:230 Image Extraction Error: Skipping the failed image
Traceback (most recent call last):
  File "/Users/austin/.pyenv/versions/unstructured/lib/python3.10/site-packages/PIL/JpegImagePlugin.py", line 666, in _save
    rawmode = RAWMODE[im.mode]
KeyError: 'RGBA'
```

The issue is that a PNG has an additional layer that cannot be saved off
in jpeg format. We can fix this with a quick conversion. I added a png
test case that is now passing with this fix.
2025-05-08 21:57:05 +00:00
Yao You
b814ece39f
fix: properly handle the case when an element's text is None (#3995)
Some elements, like `Image`, can have `None` as its `text` attribute's
value. In that case current chunking logic fails because it expects the
field to always have a length or can be split. The fix is to update the
logic as `element.text or ""` for checking length and add flow control
to early exit to avoid calling split on `None`.
2025-05-05 18:08:11 +00:00
Marek Połom
604c4a7c5e
fix: failing build (#3993)
Successful build and test:
https://github.com/Unstructured-IO/unstructured/actions/runs/14730300234/job/41342657532

Failing test_json_to_html CI job fix here:
https://github.com/Unstructured-IO/unstructured/pull/3992
2025-04-29 13:29:58 +00:00
Marek Połom
b585df1588
fix: Add missing diffstat command to test_json_to_html CI job (#3992)
Removed some additional html fixtures. The original json fixtures from
which html ones were generated, were removed some time ago.
2025-04-29 13:29:44 +00:00
David Potter
fd9d796797
fix cve (#3989)
fix critical cve for h11. supposedly 0.16.0 fixes it.

---------

Co-authored-by: Yao You <yao@unstructured.io>
Co-authored-by: Austin Walker <austin@unstructured.io>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>
2025-04-29 00:58:05 +00:00
Nathan
27f503ce31
Update pdfminer_utils.py (#3974)
Fix for 'PSSyntaxError' import error:
"cannot import name 'PSSyntaxError' from 'pdfminer.pdfparser'"

Latest pdfminer-six doesn't import PSSyntaxError into
`pdfminer.pdfparser` anymore. It must now be directly imported from its
source (`pdfminer.psexceptions`)
2025-04-08 00:47:24 -07:00
Philippe PRADOS
d570f4624b
Fix sort_page_element. ensures that sorting is stable and not random. (#3978)
The sort_page_element() use the element id to sort the elements.
Two executions of the same code, on the same file, produce different
results. The order of the elements is random.
This makes it impossible to write stable unit tests, for example, or to
obtain reproducible results.
2025-04-07 15:57:20 +00:00
cragwolfe
dfa17bd3a0
fix: hi_res PDF parsing: only uncategorized text for extracted elements (#3975) 2025-04-04 14:38:23 -07:00
cragwolfe
8fc41811eb
chore: add html path to ingest-test-fixtures-update-pr (#3977)
This should allow the `Ingest Test Fixtures Update PR` workflow to also
update expected html outputs.

E.g., before the change, the .html files would be left unmodified:

![image](https://github.com/user-attachments/assets/fa14c1a5-39bd-4e32-b4b9-9552eb312de1)


https://github.com/Unstructured-IO/unstructured/actions/runs/14234877547/job/39892334672
2025-04-03 15:42:25 -07:00
cragwolfe
c6b8ed4290
chore: allow changing default output dir for unstructured-get-json.sh (#3973) 2025-03-31 22:18:57 -07:00