**Summary**
Do not assume MSG format when an OLE "container" file cannot be
differentiated into DOC, PPT, XLS, or MSG. Fall back to extention-based
identification in that case.
**Additional Context**
DOC, MSG, PPT, and XLS are all OLE files. An OLE file is, very roughly,
a Microsoft-proprietary Zip format which "contains" a filesystem of
discrete files and directories.
An OLE "container" is easily identified by inspecting the first 8 bytes
of the file, so all we need to do is differentiate between the four
subtypes we can process. The `filetype` module does a good job of this
but it not perfect and does not identify MSG files.
Previously we assumed MSG format when none of DOC, PPT, or XLS was
detected, but we discovered that `filetype` is not completely reliable
at detecting these types.
Change the behavior to remove the assumption of MSG format.
`_OleFileDifferentiator` returns `None` in this case and filetype
detection falls back to use filename-extension.
Note a file with no filename and no metadata_filename or an incorrect
extension will not be correctly identified in this case, however we're
assuming for now that will be rare in practice.
**Summary**
Eliminate historical "idiosyncracies" of `table.metadata.text_as_html`
HTML introduced by `partition_docx()`. Produce minified `.text_as_html`
consistent with that formed by chunking.
**Additional Context**
- nested tables appear as their extracted text in the parent cell (no
nested `<table>` elements in `.text_as_html`).
- DOCX `.text_as_html` is minified (no extra whitespace or thead, tbody,
tfoot elements).
Closes#3543.
### Summary
This PR addresses an issue with the NLTK data download process.
Previously, when downloading NLTK data, a nested "nltk_data" directory
was created within the parent "nltk_data" directory if the parent
directory already existed. This redundant directory structure led to two
significant problems:
- errors in checking if data had already been downloaded, potentially
causing redundant downloads in subsequent calls.
- failures in loading models from the downloaded NLTK data due to
incorrect path resolution.
This fix modifies the NLTK data download logic to prevent creation of
unnecessary nested directories. If the download path ends with
"nltk_data" and that directory already exists, we now use the existing
directory instead of creating a new nested one.
### Testing
CI should pass.
### Summary
Bumps to `nltk==3.9.1` and resolves
[CVE-2024-39705](https://nvd.nist.gov/vuln/detail/CVE-2024-39705). An
NLTK version bump was originally introduced in #3512 and rolled back in
#3527 because `nltk==3.8.2` was yanked from PyPI, and also because we
observed significant slowdowns in processing time after bumping to
`nltk==3.8.2`. The processing time regression does not appear in
`nltk==3.9.1`.
### Testing
After the bump, CI should pass. Additionally we verified locally that
files processing takes around the amount of time we would expect for a
long `.docx` file.
```python
In [1]: from unstructured.partition.auto import partition
In [2]: filename = "test-doc.docx"
In [3]: %timeit partition(filename=filename)
3.92 s ± 73 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```
**Summary**
Use more sophisticated algorithm for splitting oversized `Table`
elements into `TableChunk` elements during chunking to ensure element
text and HTML are "synchronized" and HTML is always parseable.
**Additional Context**
Table splitting now has the following characteristics:
- `TableChunk.metadata.text_as_html` is always a parseable HTML
`<table>` subtree.
- `TableChunk.text` is always the text in the HTML version of the table
fragment in `.metadata.text_as_html`. Text and HTML are "synchronized".
- The table is divided at a whole-row boundary whenever possible.
- A row is broken at an even-cell boundary when a single row is larger
than the chunking window.
- A cell is broken at an even-word boundary when a single cell is larger
than the chunking window.
- `.text_as_html` is "minified", removing all extraneous whitespace and
unneeded elements or attributes. This maximizes the semantic "density"
of each chunk.
This PR reverts `pytesseract` dependency to `unstructured.pytesseract`
fork due to the unavailability of some recent release versions of
`pytesseract` on PyPI.
This PR also addresses an issue encountered during the publication of
`unstructured==0.15.4` to PyPI. The error was due to the fact that PyPI
does not allow direct dependencies from Version Control System URLs like
GitHub in the `install_requires` or `extras_require` sections of the
`setup.py` file.
### Summary
Updates to the latest `wolfi-base` base image to pull in more recent
package version. A notable update is that upgrading to
`libreoffice==24.2.5.2` resolves several CVEs.
---------
Co-authored-by: christinestraub <christinemstraub@gmail.com>
Closes#3521.
This PR resolves an installation error with `pytesseract>=0.3.12` that
occurred during `pip install unstructured[pdf]==0.15.3`.
### Testing
**Run following command in main branch and this PR**
```
pip uninstall -y pytesseract && pip install ".[pdf]"
```
**Results**
- `main` branch
```
INFO: pip is looking at multiple versions of unstructured[pdf] to determine which version is compatible with other requirements. This could take a while.
ERROR: Could not find a version that satisfies the requirement pytesseract>=0.3.12; extra == "pdf" (from unstructured[pdf]) (from versions: 0.1, 0.1.3, 0.1.4, 0.1.5, 0.1.6, 0.1.7, 0.1.8, 0.1.9, 0.2.0, 0.2.2, 0.2.4, 0.2.5, 0.2.6, 0.2.7, 0.2.8, 0.2.9, 0.3.0, 0.3.1, 0.3.2, 0.3.3, 0.3.4, 0.3.5, 0.3.6, 0.3.7, 0.3.8, 0.3.9, 0.3.10)
ERROR: No matching distribution found for pytesseract>=0.3.12; extra == "pdf"
```
- this `PR`
`pytesseract-0.3.13` should be installed successfully.
This PR removes custom index URL for `paddlepaddle` installation in
`extra-paddleocr.in`, resolving `setup.py` configuration error. Now uses
`paddlepaddle==3.0.0b1` directly from PyPI, simplifying installation
process.
---------
Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
### Summary
Addresses
[CVE-2024-39705](https://nvd.nist.gov/vuln/detail/CVE-2024-39705) by
updating to `nltk==3.8.2` and closes#3511. This CVE had previously been
mitigated in #3361.
---------
Co-authored-by: Christine Straub <christinemstraub@gmail.com>
Closes
[#3503](https://github.com/Unstructured-IO/unstructured/issues/3503).
### Summary
This PR prevents creation of `figures` directory for saving image blocks
(`Image`, `Table`) when `extract_image_block_to_payload` parameter is
set to True
### Testing
```
elements = partition_image(
filename="example-docs/img/embedded-images-tables.jpg",
strategy="hi_res",
extract_image_block_types=["Image", "Table"],
extract_image_block_to_payload=True,
)
```
**Results:**
- `Main` Branch: `figures` directory is created.
- `PR`: `figures` directory is not created.
This PR aims to remove "unstructured.paddlepaddle" fork. Previously, we
used `unstructured.paddlepaddle` fork to support
`unstructured.paddleocr` on arm64 architecture. But currently,
`unstructured.paddleocr` with `unstructured.paddlepaddle` fails to work
on `arm64` architecture. Also, `unstructured.paddleocr` with the latest
version of the original `paddlepaddle` works on both `amd64` and `arm64`
architectures.
### Testing
```
os.environ["OCR_AGENT"] = "unstructured.partition.utils.ocr_models.paddle_ocr.OCRAgentPaddle"
elements = partition_pdf(
filename=<file_path>,
strategy="hi_res",
infer_table_structure=True,
)
```
As described in #3381, some clients, perhaps including Adobe PDF
Converter, map JPEG images to the invalid `image/jpg` MIME-type. Prior
to v1.0.0, `python-pptx` would not load these images, which caused image
extraction to fail.
Update the `python-pptx` dependency to `v1.0.1` or above to ensure this
upstream fix is always available.
Fixes: #3381
Update partition_eml and partition_msg to capture cc, bcc, and message
id fields.
Docs PR: https://github.com/Unstructured-IO/docs/pull/135/files
Testing
```
from unstructured.partition.email import partition_email
from test_unstructured.unit_utils import example_doc_path
elements = partition_email(filename=example_doc_path("eml/fake-email-header.eml"), include_headers=True)
print(elements)
elements[0].metadata.to_dict()
```
Note to reviewers:
Tests in `test_unstructured/partition/test_email.py` were refactored and
rearranged to group similar tests together, so it will be easiest to
review those changes commit by commit.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Coniferish <Coniferish@users.noreply.github.com>
# Description:
Passing `max_pages` argument allows rejecting pdf files which exceeds
this page number limit while `high_res` strategy is chosen. By default
it will allow parsing pdf files with unlimited number of pages.
# Testing:
```python
from unstructured.partition.auto import partition
elements = partition(filename="unstructured/example-docs/pdf/reliance.pdf", strategy='hi_res') # should pass
elements = partition(filename="unstructured/example-docs/pdf/reliance.pdf", strategy='hi_res', max_pages=4) # should pass
elements = partition(filename="unstructured/example-docs/pdf/reliance.pdf", strategy='hi_res', max_pages=2) # should raise PdfMaxPagesExceededError
```
### Description
Any time `unstructed.ingest` is imported, this deprecation warning gets
emitted:
```
DeprecationWarning: unstructured.ingest will be removed in a future version
```
**Summary**
A DOC, PPT, or XLS file sent to partition() as a file-like object is
misidentified as a MSG file and raises an exception in python-oxmsg
(which is used to process MSG files).
**Fix**
DOC, PPT, XLS, and MSG are all Microsoft OLE-based files, aka. Compound
File Binary Format (CFBF). These can be reliably distinguished by
inspecting magic bytes in certain locations. `libmagic` is unreliable at
this or doesn't try, reporting the generic `"application/x-ole-storage"`
which corresponds to the "container" CFBF format (vaguely like a
Microsoft Zip format) that all these document types are stored in.
Unconditionally use `filetype.guess_mime()` provided by the `filetype`
package that is part of the base unstructured install. Unlike
`libmagic`, this package reliably detects the distinguished MIME-type
(e.g. `"application/msword"`) for OLE file subtypes.
Fixes#3364
**Summary**
The `content_type` argument received by `partition()` from the API is
sometimes unreliable for MS-Office 2007+ MIME-types. What we've observed
is that it gets the MS-Office bit right but falls down on distinguishing
PPTX from DOCX or XLSX.
Confirmation of these types is simple, fast, and reliable. Confirm all
MS-Office `content_type` argument values asserted by callers of
`detect_filetype()` and correct swapped values.
Similar to https://github.com/Unstructured-IO/unstructured/pull/3433.
### Summary
This PR aims to update `HuggingFaceEmbeddingEncoder` to use
`HuggingFaceEmbeddings` from `langchain_huggingface` package instead of
the deprecated version from `langchain-community`. This resolves the
deprecation warning and ensures compatibility with future versions of
langchain.
### Testing
```
from unstructured.documents.elements import Text
from unstructured.embed.huggingface import HuggingFaceEmbeddingConfig, HuggingFaceEmbeddingEncoder
embedding_encoder = HuggingFaceEmbeddingEncoder(
config=HuggingFaceEmbeddingConfig()
)
elements = embedding_encoder.embed_documents(
elements=[Text("This is sentence 1"), Text("This is sentence 2")],
)
query = "This is the query"
query_embedding = embedding_encoder.embed_query(query=query)
[print(e.embeddings, e) for e in elements]
print(query_embedding, query)
print(embedding_encoder.is_unit_vector(), embedding_encoder.num_of_dimensions())
```
**Expected behavior**
No deprecation warning should be displayed. The code should use the
updated `HuggingFaceEmbeddings` class from the `langchain_huggingface`
package.
Closes https://github.com/Unstructured-IO/unstructured/issues/3378.
### Summary
This PR aims to update `OpenAIEmbeddingEncoder` to use
`OpenAIEmbeddings` from `langchain-openai` package instead of the
deprecated version from `langchain-community`. This resolves the
deprecation warning and ensures compatibility with future versions of
langchain.
**Summary**
In preparation for fixing a cluster of bugs with automatic file-type
detection and paving the way for some reliability improvements, refactor
`unstructured.file_utils.filetype` module and improve thoroughness of
tests.
**Additional Context**
Factor type-recognition process into three distinct strategies that are
attempted in sequence. Attempted in order of preference,
type-recognition falls to the next strategy when the one before it is
not applicable or cannot determine the file-type. This provides a clear
basis for organizing the code and tests at the top level.
Consolidate the existing tests around these strategies, adding
additional cases to achieve better coverage.
Several bugs were uncovered in the process. Small ones were just fixed,
bigger ones will be remedied in following PRs.
the pinecone python package moved their importing of
PineconeApiException
Chroma `sleep` added because even thought there is a `wait`, there is
still some sort of timing issue.
**Summary**
Replace conditional explicit import of partitioner modules in
`.partition.auto` with the new `_PartitionerLoader` class. This avoids
unbound variable warnings and is much less noisy.
`_PartitionerLoader` makes use of the new `FileType` property
`.importable_package_dependencies` to determine whether all required
packages are importable before dispatching the file to its partitioner.
It uses `FileType.extra_name` to form a helpful error message when a
dependency is not installed, so the caller knows which `pip install`
extra to specify to remedy the error.
`PartitionerLoader` uses the `FileType` properties
`.partitioner_module_qname` and `partitioner_function_name` to load
the partitioner once its dependencies are verified. Loaded partitioners
are cached with module lifetime scope for efficiency.
### Summary
Currently, the email partitioner removes only `=\n` characters during
the clearing process. However, email content sometimes contains `=\r\n`
characters, especially when read from file-like objects such as
`SpooledTemporaryFile` (the file type used in our API). This PR updates
the email partitioner to remove both `=\n` and `=\r\n` characters during
the clearing process.
### Testing
```
filename = "example-docs/eml/family-day.eml"
elements = partition_email(
filename=filename,
)
print(f"From filename: {elements[3].text}")
with open(filename, "rb") as test_file:
spooled_temp_file = tempfile.SpooledTemporaryFile()
spooled_temp_file.write(test_file.read())
spooled_temp_file.seek(0)
elements = partition_email(file=spooled_temp_file)
print(f"From spooled_temp_file: {elements[3].text}")
```
**Results:**
- on `main`
```
From filename: Make sure to RSVP!
From spooled_temp_file: Make sure to = RSVP!
```
- on `PR`
```
From filename: Make sure to RSVP!
From spooled_temp_file: Make sure to RSVP!
```
### Description
If the id value exists in the stats response from fsspec, save it as a
`file_id` field in the metadata being persisted on each element.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
This PR aims to improve the organization and readability of our example
documents used in unit tests, specifically focusing on PDF and image
files.
### Summary
- Created two new subdirectories in the `example-docs` folder:
- `pdf/`: for all PDF example files
- `img/`: for all image example files
- Moved relevant PDF files from `example-docs/` to `example-docs/pdf/`
- Moved relevant image files from `example-docs/` to `example-docs/img/`
- Updated file paths in affected unit & ingest tests to reflect the new
directory structure
### Testing
All unit & ingest tests should be updated and verified to work with the
new file structure.
## Notes
Other file types (e.g., office documents, HTML files) remain in the root
of `example-docs/` for now.
## Next Steps
Consider similar reorganization for other file types if this structure
proves to be beneficial.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
### Description
At times, the google drive response doens't have some of the metadata
we're grabbing to populate the `FileData` metadata. This is fine, but
without the added safegaurds, this can cause a `KeyError`.
**Summary**
Elaborate the `FileType` enum to be a complete descriptor of file-types.
Add methods to allow `STR_TO_FILETYPE`, `EXT_TO_FILETYPE` and
`FILETYPE_TO_MIMETYPE` mappings to be replaced, removing those redundant
and noisy declarations.
In the process, fix some lingering file-type identification and
`.metadata.filetype` errors that had been skipped in the tests.
**Additional Context**
Gathering the various attributes of a file-type into the `FileType` enum
eliminates the duplication inherent in the separate `STR_TO_FILETYPE`
etc. mappings and makes access to those values convenient for callers.
These attributes include what MIME-type a file-type should record in
metadata and what MIME-types and extensions map to that file-type. These
values and others are made available as methods and properties directly
on the `FileType` class and members. Because all attributes are defined
in the `FileType` enum there is no risk of inconsistency across multiple
locations and any changes happen in one and only one place. Further
attributes and methods will be added in later commits to support other
file-type related operations like mapping to a partitioner and verifying
its dependencies are installed.