### Issue
Attempt at partitioning a password protected errors results in an
obscure exception
> Can't find workbook in OLE2 compound document
### Solution
Utilize [msoffcrypto-tool](https://pypi.org/project/msoffcrypto-tool/)
package (MIT License) to load XLSX file and check whether it's
encrypted, if yes throw an `UnprocessableEntityError` exception
detailing the reason for rejecting the file.
---------
Co-authored-by: Filip Knefel <filip@unstructured.io>
The `@apply_metadata` decorator already contains logic to detect the
language of the element text (on either a document or element level).
Update pdfs, and later images, to use this decorator to get accurate
element language results outputted.
Test
```
from unstructured.partition.auto import partition
def test_partition_pdf():
pdf_path = "example-docs/language-docs/fr_olap.pdf"
elements = partition(pdf_path) # optionally set `detect_language_per_element=True)`
print(f"Number of elements partitioned: {len(elements)}")
# Check if elements are returned
assert len(elements) > 0, "No elements were partitioned from the PDF."
# check language outputted for each element
for element in elements:
print(element)
print(element.metadata.languages)
print("-------------------------------")
test_partition_pdf()
```
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: shreyanid <shreyanid@users.noreply.github.com>
#### To test, simply serialize a TableChunk element with and without the
changes in the PR
____
**Without the changes:**
```
In [1]: from unstructured.documents.elements import TableChunk
In [2]: TableChunk("hi")
Out[2]: <unstructured.documents.elements.TableChunk at 0x110113410>
In [3]: TableChunk("hi").to_dict()
Out[3]:
{'type': 'Table',
'element_id': '6267e99a-46d8-4f2d-a206-51c691469c72',
'text': 'hi',
'metadata': {}}
```
____
**With the changes:**
```
In [1]: from unstructured.documents.elements import TableChunk
In [2]: TableChunk("hi")
Out[2]: <unstructured.documents.elements.TableChunk at 0x10367f050>
In [3]: TableChunk("hi").to_dict()
Out[3]:
{'type': 'TableChunk',
'element_id': 'f91af3ac-0dea-4dc4-8a6a-69c28cfcca3b',
'text': 'hi',
'metadata': {}}
```
____
This change affects partition html.
Previously when there is a table in the html, we clean any tags inside
the table of their `class` and `id` attributes. However, sometimes there
are images, `img` tags, present in a table and its `class` attribute
identifies some important information about the image. This change
preserves the `class` attribute for `img` tags inside a table. This
change is reflected in a table element's `metadata.text_as_html`
attribute.
Dropped variables that said we support Python 3.9 in `setup.py`, as well
as any remaining references to Python 3.9.
I also checked the pins and removed several that don't seem necessary
any more.
Increase the csv field limit to support partitioning of files with large
data in fields.
---------
Co-authored-by: Filip Knefel <filip@unstructured.io>
## Summary
This PR fixes an issue where header/footer content in html are not
partitioned as `unstructured` `Header` or `Footer` element types. Rather
they are either `UncategorizedText` or taking on the type of the nested
structure inside the header/footer. E.g., `<header class="Header"><h1
class="Title">Header Title</h1></header>` would be partitioned as a
`Title` instead of `Header`.
## Bug description
This behavior is because we treat header and footer as layout, i.e.,
containers, in the ontology definition. As a result, during parsing we
[unwrap](ec209c6b5f/unstructured/partition/html/transformations.py (L361-L378))
the container and parse the contents as if they are from the main text
even though they are still part of header/footer.
The fix is to treat header/footer as text instead of layout in ontology
so that all content inside of them are properly gathered under
`Header`/`Footer` element types.
`<?xml version="1.0"?>` does not get escaped when converting to html, in
a code block like this in the markdown file
````
<?xml version="1.0"?>
<sparql xmlns="http://www.w3.org/2005/sparql-results#">
<head></head>
<boolean>true</boolean>
</sparql>
````
which causes the parser to throw error like
> AttributeError: 'lxml.etree._ProcessingInstruction' object has no
attribute 'is_phrasing'.
This PR processes the original md file and add indentation to `<?xml
version="1.0"?>` to force the xml code to be escaped when being
converted to html
https://github.com/Unstructured-IO/unstructured/issues/3935
This PR fixes the issue with `docx` with
complex/recursive/merged/malformed tables by skipping cells that could
not trace back to a valid `<w:tc>` element used by the `python-docx` due
to missing or improperly merged rows.
Accessing row.cells in such cases can raise a `ValueError` when
`python-docx` fails to resolve the full logical table layout. This PR
wraps those calls in `try/except` to skip problematic rows while
continuing to extract usable content from the rest of the document.
### Summary
Addressed a TypeError that occurred when partitioning empty or
whitespace-only HTML content.
## Test
* unit test
`test_unstructured/partition/html/test_partition.py::test_partition_html_with_empty_content_raises_error`
can reproduce the TypeErro before fix
* now test can pass
In scenarios where there is a large amount of data that represents the
document rather than individual elements in the document, it may be
preferable to specify this in a single location rather than duplicating
the data across all elements (as we do for smaller metadata like
filename or filetype)
This PR adds DocumentData element type which can be used to uniquely
capture this data.
Given the fact that the `_CsvPartitioningContext` defines an `_encoding`
property, this property was meant to be used. Behaviorally this change
should be a no-op, but supports future efforts where the partitioning
context applies internal logic.
## PR Summary
This small PR fixes the bs4 deprecation warnings which you can find in
the [CI
logs](https://github.com/Unstructured-IO/unstructured/actions/runs/15491657572/job/43729960936#step:3:2615):
```python
/app/unstructured/metrics/table/table_extraction.py:53: DeprecationWarning: Call to deprecated method findAll. (Replaced by find_all) -- Deprecated since version 4.0.0.
/app/unstructured/metrics/table/table_extraction.py:57: DeprecationWarning: Call to deprecated method findAll. (Replaced by find_all) -- Deprecated since version 4.0.0.
```
---------
Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>
### Summary
`'NoneType' object has no attribute 'partitioner_shortname'` due to
`result_file_type = self._disambiguate_json_file_type` could return None
for file type
new `torch==2.7.1` now comes with nvidia gpu support and triton as
dependencies. Those are not supported by `arm64` or actually being used
by `unstructured` in `adm64` either. This is a quick patch to remove
those from .txt requirements files to unblock builds.
### Summary
To fix error `Error in chunk: 512: {"detail":"'NoneType' object has no
attribute 'strip'"}` I found the logs under same org (could assume this
is the same job)
screenshot:

stack trace from the `utic-api` ES log doc:

### Notes
longer term we should make partitioner (vlm + utic-api) not return text
with Null
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: yuming-long <yuming-long@users.noreply.github.com>
In this pull request parent-child relationship for elements generated
with v2 parser is based on actual element IDs instead of IDs baked
somewhere in the HTML script.
With some extra bug fixing it allowed for significantly simplifying json
-> HTML script
Bump `unstructured-inference` to `1.0.5`, which includes fix to ensure
model init is thread safe.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>
update reqs to resolve CVEs and add the HF ENV to stop it from reaching
out
updated the Dockerfile with
ENV HF_HUB_OFFLINE=1
to stop it from pinging HF. This was an issue for a gov customer. and
updated requirements to resolve some open CVEs
---------
Co-authored-by: cragwolfe <crag@unstructured.io>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: luke-kucing <luke-kucing@users.noreply.github.com>
# PR Summary
This PR resolves the deprecation warnings of the `logger` library:
```python
DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead
```
---------
Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>
Co-authored-by: cragwolfe <crag@unstructured.io>
When I tried to partition a PNG file and extract images, I got an error
from Pillow:
```
WARNING unstructured:pdf_image_utils.py:230 Image Extraction Error: Skipping the failed image
Traceback (most recent call last):
File "/Users/austin/.pyenv/versions/unstructured/lib/python3.10/site-packages/PIL/JpegImagePlugin.py", line 666, in _save
rawmode = RAWMODE[im.mode]
KeyError: 'RGBA'
```
The issue is that a PNG has an additional layer that cannot be saved off
in jpeg format. We can fix this with a quick conversion. I added a png
test case that is now passing with this fix.
Some elements, like `Image`, can have `None` as its `text` attribute's
value. In that case current chunking logic fails because it expects the
field to always have a length or can be split. The fix is to update the
logic as `element.text or ""` for checking length and add flow control
to early exit to avoid calling split on `None`.
Fix for 'PSSyntaxError' import error:
"cannot import name 'PSSyntaxError' from 'pdfminer.pdfparser'"
Latest pdfminer-six doesn't import PSSyntaxError into
`pdfminer.pdfparser` anymore. It must now be directly imported from its
source (`pdfminer.psexceptions`)
The sort_page_element() use the element id to sort the elements.
Two executions of the same code, on the same file, produce different
results. The order of the elements is random.
This makes it impossible to write stable unit tests, for example, or to
obtain reproducible results.
Removed the dependencies contained in `test.txt`, `dev.txt`, and
`constraints.txt` from the things that get installed in the docker
image. In order to keep testing the image (running the tests), I added a
step to the `docker-test` make target to install `test.txt` and
`dev.txt`. Thus we presumably get a smaller image (probably not much
smaller), reduce the dependency chain or our images, and have less
exposure to vulnerabilities while still testing as robustly as before.
Incidentally, I removed the `Dockerfile` for our ubuntu image, since it
made reference to non-existent make targets, which tells me it's stale
and wasn't being used.
### Review:
- Reviewer should ensure the dev and test dependencies are not being
installed in the docker image. One way to check is to check the logs in
CI, and note, e.g. that
[this](https://github.com/Unstructured-IO/unstructured/actions/runs/14112971425/job/39536304012#step:3:1700)
is the first reference to `pytest` in the docker build and test logs,
after the image build is completed.
- Reviewer should ensure docker image is still being tested in CI and is
passing.
This PR is to address [a
CVE](https://github.com/advisories/GHSA-rgv9-w7jp-m23g) that appeared in
a recent scan.
The CVE has to do with the package `label_studio_sdk`. This relates to
the tool Label Studio, a data labeling platform. We built a staging
function that takes a list of elements and converts it to a format
suitable for passing to the LabelStudio platform.
We don't use the package with the vulnerability in the actual function,
we only use it to test the output of the function against the Label
Studio API schema.
Even the test where we use it is sort of questionable in value, since
it's really testing the schema against an old version of the LabelStudio
API (we are testing against a recording of the Label Studio API's
responses stored using `vcrpy`).
Label Studio has fixed the vulnerability as of version 1.0.10 of their
SDK, but we're stuck on 1.0.5 because 1.0.6 and above require
`numpy<2.0.0`.
This leaves us with several choices of resolution, some of which are:
1. Downgrade `numpy` to upgrade `label_studio_sdk` to >=1.0.10 to
resolve the CVE
2. Drop `label_studio_sdk` by either removing or rewriting the test.
3. Drop test and dev dependencies from the `unstructured` image.
We've decided to do 2. _and_ 3. This PR handles 2., with 3. to be a
follow-on PR.
Here we add a deprecation notice to `stage_for_label_studio` and remove
the offending test. Normally good practice would be to add a warning of
future deprecation to the function for a reasonable amount of time, but
in order to address the CVE immediately, we're deprecating it right
away.
### Testing
Install the dependencies (`make install`) into a fresh environment, and
`pip list | grep label` should have no results. The scan artifact in CI
should contain no "high" or "critical" CVEs.
Instead of looking for presence of `word/document.xml` ,
`ppt/presentation.xml` and `xl/workbook.xml` to identify DOCX,PPTX and
XLSX files, we look for prefix `word/document*.xml`,
`ppt/presentation*.xml` and `xl/workbook*.xml` as certain files
generated from office365 has files with different names.
Fixes https://github.com/Unstructured-IO/unstructured/issues/3937
---------
Co-authored-by: Yao You <theyaoyou@gmail.com>