### Summary
Addressed a TypeError that occurred when partitioning empty or
whitespace-only HTML content.
## Test
* unit test
`test_unstructured/partition/html/test_partition.py::test_partition_html_with_empty_content_raises_error`
can reproduce the TypeErro before fix
* now test can pass
In scenarios where there is a large amount of data that represents the
document rather than individual elements in the document, it may be
preferable to specify this in a single location rather than duplicating
the data across all elements (as we do for smaller metadata like
filename or filetype)
This PR adds DocumentData element type which can be used to uniquely
capture this data.
Given the fact that the `_CsvPartitioningContext` defines an `_encoding`
property, this property was meant to be used. Behaviorally this change
should be a no-op, but supports future efforts where the partitioning
context applies internal logic.
## PR Summary
This small PR fixes the bs4 deprecation warnings which you can find in
the [CI
logs](https://github.com/Unstructured-IO/unstructured/actions/runs/15491657572/job/43729960936#step:3:2615):
```python
/app/unstructured/metrics/table/table_extraction.py:53: DeprecationWarning: Call to deprecated method findAll. (Replaced by find_all) -- Deprecated since version 4.0.0.
/app/unstructured/metrics/table/table_extraction.py:57: DeprecationWarning: Call to deprecated method findAll. (Replaced by find_all) -- Deprecated since version 4.0.0.
```
---------
Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>
### Summary
`'NoneType' object has no attribute 'partitioner_shortname'` due to
`result_file_type = self._disambiguate_json_file_type` could return None
for file type
new `torch==2.7.1` now comes with nvidia gpu support and triton as
dependencies. Those are not supported by `arm64` or actually being used
by `unstructured` in `adm64` either. This is a quick patch to remove
those from .txt requirements files to unblock builds.
### Summary
To fix error `Error in chunk: 512: {"detail":"'NoneType' object has no
attribute 'strip'"}` I found the logs under same org (could assume this
is the same job)
screenshot:

stack trace from the `utic-api` ES log doc:

### Notes
longer term we should make partitioner (vlm + utic-api) not return text
with Null
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: yuming-long <yuming-long@users.noreply.github.com>
In this pull request parent-child relationship for elements generated
with v2 parser is based on actual element IDs instead of IDs baked
somewhere in the HTML script.
With some extra bug fixing it allowed for significantly simplifying json
-> HTML script
Bump `unstructured-inference` to `1.0.5`, which includes fix to ensure
model init is thread safe.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>
update reqs to resolve CVEs and add the HF ENV to stop it from reaching
out
updated the Dockerfile with
ENV HF_HUB_OFFLINE=1
to stop it from pinging HF. This was an issue for a gov customer. and
updated requirements to resolve some open CVEs
---------
Co-authored-by: cragwolfe <crag@unstructured.io>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: luke-kucing <luke-kucing@users.noreply.github.com>
# PR Summary
This PR resolves the deprecation warnings of the `logger` library:
```python
DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead
```
---------
Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>
Co-authored-by: cragwolfe <crag@unstructured.io>
When I tried to partition a PNG file and extract images, I got an error
from Pillow:
```
WARNING unstructured:pdf_image_utils.py:230 Image Extraction Error: Skipping the failed image
Traceback (most recent call last):
File "/Users/austin/.pyenv/versions/unstructured/lib/python3.10/site-packages/PIL/JpegImagePlugin.py", line 666, in _save
rawmode = RAWMODE[im.mode]
KeyError: 'RGBA'
```
The issue is that a PNG has an additional layer that cannot be saved off
in jpeg format. We can fix this with a quick conversion. I added a png
test case that is now passing with this fix.
Some elements, like `Image`, can have `None` as its `text` attribute's
value. In that case current chunking logic fails because it expects the
field to always have a length or can be split. The fix is to update the
logic as `element.text or ""` for checking length and add flow control
to early exit to avoid calling split on `None`.
Fix for 'PSSyntaxError' import error:
"cannot import name 'PSSyntaxError' from 'pdfminer.pdfparser'"
Latest pdfminer-six doesn't import PSSyntaxError into
`pdfminer.pdfparser` anymore. It must now be directly imported from its
source (`pdfminer.psexceptions`)
The sort_page_element() use the element id to sort the elements.
Two executions of the same code, on the same file, produce different
results. The order of the elements is random.
This makes it impossible to write stable unit tests, for example, or to
obtain reproducible results.
Removed the dependencies contained in `test.txt`, `dev.txt`, and
`constraints.txt` from the things that get installed in the docker
image. In order to keep testing the image (running the tests), I added a
step to the `docker-test` make target to install `test.txt` and
`dev.txt`. Thus we presumably get a smaller image (probably not much
smaller), reduce the dependency chain or our images, and have less
exposure to vulnerabilities while still testing as robustly as before.
Incidentally, I removed the `Dockerfile` for our ubuntu image, since it
made reference to non-existent make targets, which tells me it's stale
and wasn't being used.
### Review:
- Reviewer should ensure the dev and test dependencies are not being
installed in the docker image. One way to check is to check the logs in
CI, and note, e.g. that
[this](https://github.com/Unstructured-IO/unstructured/actions/runs/14112971425/job/39536304012#step:3:1700)
is the first reference to `pytest` in the docker build and test logs,
after the image build is completed.
- Reviewer should ensure docker image is still being tested in CI and is
passing.
This PR is to address [a
CVE](https://github.com/advisories/GHSA-rgv9-w7jp-m23g) that appeared in
a recent scan.
The CVE has to do with the package `label_studio_sdk`. This relates to
the tool Label Studio, a data labeling platform. We built a staging
function that takes a list of elements and converts it to a format
suitable for passing to the LabelStudio platform.
We don't use the package with the vulnerability in the actual function,
we only use it to test the output of the function against the Label
Studio API schema.
Even the test where we use it is sort of questionable in value, since
it's really testing the schema against an old version of the LabelStudio
API (we are testing against a recording of the Label Studio API's
responses stored using `vcrpy`).
Label Studio has fixed the vulnerability as of version 1.0.10 of their
SDK, but we're stuck on 1.0.5 because 1.0.6 and above require
`numpy<2.0.0`.
This leaves us with several choices of resolution, some of which are:
1. Downgrade `numpy` to upgrade `label_studio_sdk` to >=1.0.10 to
resolve the CVE
2. Drop `label_studio_sdk` by either removing or rewriting the test.
3. Drop test and dev dependencies from the `unstructured` image.
We've decided to do 2. _and_ 3. This PR handles 2., with 3. to be a
follow-on PR.
Here we add a deprecation notice to `stage_for_label_studio` and remove
the offending test. Normally good practice would be to add a warning of
future deprecation to the function for a reasonable amount of time, but
in order to address the CVE immediately, we're deprecating it right
away.
### Testing
Install the dependencies (`make install`) into a fresh environment, and
`pip list | grep label` should have no results. The scan artifact in CI
should contain no "high" or "critical" CVEs.
Instead of looking for presence of `word/document.xml` ,
`ppt/presentation.xml` and `xl/workbook.xml` to identify DOCX,PPTX and
XLSX files, we look for prefix `word/document*.xml`,
`ppt/presentation*.xml` and `xl/workbook*.xml` as certain files
generated from office365 has files with different names.
Fixes https://github.com/Unstructured-IO/unstructured/issues/3937
---------
Co-authored-by: Yao You <theyaoyou@gmail.com>
- `lxml` is a much faster library than `bs4` when the input data is
regular
- since the hOCR data is guaranteed to be regular (programmatically
generated) we don't need `bs4` here to parse the data
- `lxml` improves parsing speed by about 10x
Example runtime profiling locally using the same `hocr` data from 1 page
pdf, where `agent.hocr_to_dataframe_bs4` is the current method on main
and `agent.hocr_to_dataframe` is the PR's method.

This PR removes usage of `PageLayout.elements` from partition function,
except for when `analysis=True`. This PR updates the partition logic so
that `PageLayout.elements_array` is used everywhere to save memory and
cpu cost.
Since the analysis function is intended for investigation and not for
general document processing purposes, this part of the code is left for
a future refactor.
`PageLayout.elements` uses a list to store layout elements' data while
`elements_array` uses `numpy` array to store the data, which has much
lower memory requirements. Using `memory_profiler` to test the
differences is usually around 10x.
This PR allows passing down both `ocr_agent` and `table_ocr_agent` as
parameters to specify the `OCRAgent` class for the page and tables, if
any, respectively. Both are default to using `tesseract`, consistent
with the present default behavior.
We used to rely on env variables to specify the agents but os env can be
changed during runtime outside of the caller's control. This method of
passing down the variables ensures that specification is independent of
env changes.
## testing
Using `example-docs/img/layout-parser-paper-with-table.jpg` and run
partition with two different settings. Note that this test requires
`paddleocr` extra.
```python
from unstructured.partition.auto import partition
from unstructured.partition.utils.constants import OCR_AGENT_TESSERACT, OCR_AGENT_PADDLE
elements = partition(f, strategy="hi_res", skip_infer_table_types=[], ocr_agent=OCR_AGENT_TESSERACT, table_ocr_agent=OCR_AGENT_PADDLE)
elements_alt = partition(f, strategy="hi_res", skip_infer_table_types=[], ocr_agent=OCR_AGENT_PADDLE, table_ocr_agent=OCR_AGENT_TESSERACT)
```
we should see both finish and slight differences in the table element's
text attribute.
This is needed in order for the user to specify whether to extract the
base64 for images, which are now parsed by the html partitioner.
## Testing
Adds test that validates this by calling the auto-partitioner with
appropriate arguments partitioning an html file with base64 embedded
image.
Currently we [filter img
tags](2addb19473/unstructured/partition/html/partition.py (L226-L229))
before tags are converted to Elements by the html partitioner. More
importantly we also don’t currently have a defined “block” / mapping to
support these. This adds these mappings and logic to process.
It also respects `extract_image_block_types` and
`extract_image_block_to_payload` (as we do with pdfs) to determine
whether base64 is included in the metadata.
The partitioned Image Elements sets the text to the img tag’s alt text
if available.
The partitioned Image Elements include the [url in the
metadata](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/documents/elements.py#L209)
(rather than image_base64) if the img tag src is a url.
## Testing
unit tests have been added for explicit coverage.
existing integration tests and other unit test fixtures have been
updated to account for `Image` elements now present
---------
Co-authored-by: ryannikolaidis <ryannikolaidis@users.noreply.github.com>
Fixes order of content type detection strategies for byte-encoded jsons.
Before
```
json_bytes = json.dumps([{"example": "data"}]).encode("utf-8")
file_buffer = io.BytesIO(json_bytes)
detect_filetype(file=file_buffer, metadata_file_path="filename.pdf")
```
Before
PDF
Now
JSON
This PR targets the most memory expensive operation in partition pdf and
images: deduplicate pdfminer elements. In large pages the number of
elements can be over 10k, which would generate multiple 10k x 10k square
double float matrices during deduplication, pushing peak memory usage
close to 13Gb

This PR breaks this computation down by computing partial IOU. More
precisely it computes IOU for each 2000 elements against all the
elements at a time to reduce peak memory usage by about 10x to around
1.6Gb.

The block size is configurable based on user preference for peak memory
usage and it is set by changing the env `UNST_MATMUL_MEMORY_CAP_IN_GB`.
The purpose of this PR is to enable registering new file types
dynamically.
The PR enables this through 2 primary functions:
1. `unstructured.file_utils.model.create_file_type` This registers the
new `FileType` enum which enables the rest of unstructured to understand
a new type of file
2. `unstructured.file_utils.model.register_partitioner` Decorator that
enables registering a partitioner function to run for a file type.
---------
Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com>
## NOTE
`test_unstructured_ingest/expected-structured-output-html` contains all
test HTML fixtures. Original JSON files, from which these HTML fixtures
are generated, were taken from
`test_unstructured_ingest/expected-structured-output`
This PR allows element types with CamelCase names to be extractable
using `extract_image_block_types` variable.
Before: specify `extract_image_block_types=["NarrativeText"]` (or any
casing for `NarrativeText`) would raise a warning that it doesn't match
any available types and not image would be extracted for this element
type
Now: specify `extract_image_block_types=["NarrativeText"]` would extract
images for this element type
## testing
```python
from unstructured.partition.auto import partition
f = "example-docs/pdf/embedded-images-tables.pdf"
elements = partition(f, strategy="hi_res", extract_image_block_types=["narrativetext"])
```
Without this PR no figures would be extracted. With this PR a local
folder would be created to contain images of the narrative text elements
in path like `./figures/figure-1-1.jpg`
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
This pull request fixes the scenario when SpooledTemporaryFile is passed
to detect_file type. In such cases some weird number was assigned as
'name' (and it couldn't be overwritten as SpooledTemporaryFile can't
have fields assigned 😩 ) so I added in our object factory just another
scenario where we parse this type of file.
For BytesIo `name` attr is None as it should be and some other metadata
fields are leveraged for file type recognition