7 Commits

Author SHA1 Message Date
Yuming Long
55ad5fd637
fix chucking text None type has no attribute stripe (#4018)
### Summary
To fix error `Error in chunk: 512: {"detail":"'NoneType' object has no
attribute 'strip'"}` I found the logs under same org (could assume this
is the same job)

screenshot:
![Screenshot 2025-06-11 at 10 15
57 AM](https://github.com/user-attachments/assets/c50ada55-eef1-43f7-9e27-9b9ae339a6fb)

stack trace from the `utic-api` ES log doc:
![Screenshot 2025-06-11 at 2 01
01 PM](https://github.com/user-attachments/assets/7e84fa24-4eb6-45e8-b195-a11d3d124bfa)



### Notes
longer term we should make partitioner (vlm + utic-api) not return text
with Null

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: yuming-long <yuming-long@users.noreply.github.com>
2025-06-12 18:28:46 +00:00
Yao You
37d2f021a3
Feat/bump inference (#4013)
Bump `unstructured-inference` to `1.0.5`, which includes fix to ensure
model init is thread safe.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>
2025-06-06 09:52:17 +00:00
luke-kucing
a7e90f7990
resolve CVEs and HF issue (#4009)
update reqs to resolve CVEs and add the HF ENV to stop it from reaching
out

updated the Dockerfile with
ENV HF_HUB_OFFLINE=1

to stop it from pinging HF. This was an issue for a gov customer. and
updated requirements to resolve some open CVEs

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: luke-kucing <luke-kucing@users.noreply.github.com>
2025-06-04 18:52:58 +00:00
Marek Połom
b585df1588
fix: Add missing diffstat command to test_json_to_html CI job (#3992)
Removed some additional html fixtures. The original json fixtures from
which html ones were generated, were removed some time ago.
2025-04-29 13:29:44 +00:00
cragwolfe
dfa17bd3a0
fix: hi_res PDF parsing: only uncategorized text for extracted elements (#3975) 2025-04-04 14:38:23 -07:00
ryannikolaidis
c0457c1cc3
feat: include images when partitioning html (#3945)
Currently we [filter img
tags](2addb19473/unstructured/partition/html/partition.py (L226-L229))
before tags are converted to Elements by the html partitioner. More
importantly we also don’t currently have a defined “block” / mapping to
support these. This adds these mappings and logic to process.

It also respects `extract_image_block_types` and
`extract_image_block_to_payload` (as we do with pdfs) to determine
whether base64 is included in the metadata.

The partitioned Image Elements sets the text to the img tag’s alt text
if available.

The partitioned Image Elements include the [url in the
metadata](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/documents/elements.py#L209)
(rather than image_base64) if the img tag src is a url.

## Testing

unit tests have been added for explicit coverage.
existing integration tests and other unit test fixtures have been
updated to account for `Image` elements now present

---------

Co-authored-by: ryannikolaidis <ryannikolaidis@users.noreply.github.com>
2025-03-08 01:25:21 +00:00
Marek Połom
f333d7fe7f
feat: Json elements to HTML converter (#3936)
## NOTE
`test_unstructured_ingest/expected-structured-output-html` contains all
test HTML fixtures. Original JSON files, from which these HTML fixtures
are generated, were taken from
`test_unstructured_ingest/expected-structured-output`
2025-03-04 13:57:35 +00:00