This PR bumps `unstructured-inference` to `0.8.0`, which introduces
vectorized data structure for layout elements and text regions.
This PR also cleans up a few places in CI that has repeated definition
of env variables or missing installation of testing dependencies in
cache.
A few document ingest results are changed:
- two places for `biomed-api` (actually processed locally on runner) are
due to very small changes in numerical results of the bounding box
areas: one results in a duplicated page number/header and another
results in a deduplication of a word of a sentence that starts in a new
line. (yes, two cases goes in opposite directions)
- the layout parser paper now outputs the code lines with page number
inside the code box as list items
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>
Co-authored-by: christinestraub <christinemstraub@gmail.com>
**Summary**
Eliminate historical "idiosyncracies" of `table.metadata.text_as_html`
HTML introduced by `partition_pptx()`. Produce minified `.text_as_html`
consistent with that formed by chunking.
**Additional Context**
- PPTX `.metadata.text_as_html` is minified (no extra whitespace or
thead, tbody, tfoot elements).
- `table.text` is clean-concatenated-text (CCT) of table.
- Last use of `tabulate` library is removed and that dependency is
removed from `base.in`.
**Summary**
Eliminate historical "idiosyncracies" of `table.metadata.text_as_html`
HTML introduced by `partition_csv()`. Produce minified `.text_as_html`
consistent with that formed by chunking.
**Additional Context**
- CSV `.metadata.text_as_html` is minified (no extra whitespace or
thead, tbody, tfoot elements).
- `table.text` is clean-concatenated-text (CCT) of table.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: scanny <scanny@users.noreply.github.com>
**Summary**
Eliminate historical "idiosyncracies" of `table.metadata.text_as_html`
HTML introduced by `partition_xlsx()`. Produce minified `.text_as_html`
consistent with that formed by chunking.
**Additional Context**
- XLSX `.text_as_html` is minified (no extra whitespace or thead, tbody,
tfoot elements).
- `table.text` is clean-concatenated-text (CCT) of table.
---------
Co-authored-by: scanny <scanny@users.noreply.github.com>
**Summary**
Initial attempts to incrementally refactor `partition_email()` into
shape to allow pluggable partitioning quickly became too complex for
ready code-review. Prepare separate rewritten module and tests and swap
them out whole.
**Additional Context**
- Uses the modern stdlib `email` module to reliably accomplish several
manual decoding steps in the legacy code.
- Remove obsolete email-specific element-types which were replaced 18
months or so ago with email-specific metadata fields for things like Cc:
addresses, subject, etc.
- Remove accepting an email as `text: str` because MIME-email is
inherently a binary format which can and often does contain multiple and
contradictory character-encodings.
- Remove `encoding` parameters as it is now unused. An email file is not
a text file and as such does not have a single overall encoding.
Character encoding is specified individually for each MIME-part within
the message and often varies from one part to another in the same
message.
- Remove the need for a caller to specify `attachment_partitioner`.
There is only one reasonable choice for this which is
`auto.partition()`, consistent with the same interface and operation in
`partition_msg()`.
- Fixes#3671 along the way by silently skipping attachments with a
file-type for which there is no partitioner.
- Substantially extend the test-suite to cover multiple
transport-encoding/charset combinations.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: scanny <scanny@users.noreply.github.com>
### Description
Alternative to https://github.com/Unstructured-IO/unstructured/pull/3572
but maintaining all ingest tests, running them by pulling in the latest
version of unstructured-ingest.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
Co-authored-by: Christine Straub <christinemstraub@gmail.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
This PR addresses issue #3659 by adding an optional `language` parameter
to the `OCRAgentGoogleVision` class constructor.
This parameter serves as a "language hint" for the
`document_text_detection` method in the `ImageAnnotatorClient`. For more
information on language hints, refer to the [Google Cloud Vision
documentation](https://cloud.google.com/vision/docs/languages).
**Default Behavior**:
The language parameter defaults to None, allowing Google Cloud Vision to
auto-detect the language, as recommended in their documentation.
**Purpose**:
This change is necessary because the `OCRAgent`'s `get_instance` method
expects all `OCRAgent`s to include a language parameter in their
constructors.
**Context on Issue:**
When trying to parse a PDF with
`OCR_AGENT=unstructured.partition.utils.ocr_models.google_vision_ocr.OCRAgentGoogleVision`,
an error occurs in the `get_instance` method. The method expects a
`language` parameter, which the current `OCRAgentGoogleVision`
constructor does not support, leading to a positional argument error.
---------
Co-authored-by: Christine Straub <christinemstraub@gmail.com>
**Summary**
Remove double-decoration from EML and MSG.
**Additional Context**
- These needed to wait to the end because `partition_email()` and
`partition_msg()` can use any other partitioner for one of their
attachments.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: scanny <scanny@users.noreply.github.com>
**Summary**
Install new `@apply_metadata()` on TXT.
**Additional Context**
- Both EML and MSG delegate to both HTML and TXT to partition the
message-body, depending on which MIME-part body payload is selected
(`text/plain` or `text/html`). This PR prepares the way to remove
decorators from EML and MSG in the next PR.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: scanny <scanny@users.noreply.github.com>
**Summary**
Install new `@apply_metadata()` on HTML and remove decorators from
delegating partitioners EPUB, MD, ORG, RST, and RTF.
**Additional Context**
- All five of these delegating partitioners delegate to
`partition_html()` so they're something of a matched set. EML and MSG
also partially delegate to HTML but that's a harder problem (they also
delegate to all other partitioners for attachments) that we'll address a
couple PRs later .
- Replace use of `@process_metadata()` and
`@add_metadata_with_filetype()` decorators with `@apply_metadata()` on
`partition_html()`.
- Remove all decorators from delegating partitioners; this removes the
"double-decorating".
**Summary**
Install new `@apply_metadata()` on PPTX, TSV, XLSX, and XML and remove
decoration from PPT.
**Additional Context**
- Alphabetical order turns out to be hard, so this is the remaining
"easy" delegating partitioner and the remaining principal partitioners.
- Replace use of `@process_metadata()` and
`@add_metadata_with_filetype()` decorators with `@apply_metadata()` on
principal partitioners (those that do not delegate to other
partitioners.
- Remove all decorators from delegating partitioners (PPT in this case);
this removes the "double-decorating".
**Summary**
Install new `@apply_metadata()` on CSV and DOCX and remove decoration
from DOC and ODT.
**Additional Context**
- Working in alphabetical order and keeping PR size manageable, replace
use of `@process_metadata()` and `@add_metadata_with_filetype()`
decorators with `@apply_metadata()` on principal partitioners (those
that do not delegate to other partitioners.
- Remove all decorators from delegating partitioners (DOC and ODT in
this case); this removes the "double-decorating".
**Summary**
Refine `@apply_metadata()` replacement decorator. Note it has not been
installed yet.
- Apply `metadata_last_modified` arg with the `@apply_metadata()`
decorator. No need for redundant processing in each partitioner.
- Add "unique-ify" step to fix any cases where the same `Element` or
`ElementMetadata` instance was used more than once in the element
stream. This prevents unexpected "multi-mutation" in downstream
processes.
- Apply "global" metadata items before computing hash-ids. In
particular, `.metadata.filename` is used in the hash computation and
will produce different results if that's not already settled.
- Compute hash-ids _before_ computing `.metadata.parent_id`. This
removes the need for mapping UUID element-ids to their hash counterpart
and doing a fixup of `.parent_id` after applying hash-ids to elements.
**Additional Context**
- The `@apply_metadata()` decorator replaces the four metadata-related
decorators: `@process_metadata()`, `@add_metadata_with_filetype()`,
`@add_metadata()`, and `@add_filetype()`.
- It will be installed on each partitioner in a series of following PRs.
This is a fix for this
[bug](https://github.com/Unstructured-IO/unstructured/issues/3674), auto partition fails on text files which are empty or contain only whitespaces
Inference of .txt file type fails if the file has only whitespaces.
To Reproduce:
```
from tempfile import NamedTemporaryFile
from unstructured.partition.auto import partition
with NamedTemporaryFile(mode="w", suffix=".txt") as f:
f.write(" \n")
f.seek(0)
elements = partition(filename=f.name)
```
**Summary**
In preparation for pluggable auto-partitioners, add a new metadata
decorator to replace the four existing ones.
**Additional Context**
"Global" metadata items, those applied to all element on all
partitioners, are applied using a decorator.
Currently there are four decorators where there only needs to be one.
Consolidate those into a single metadata decorator.
One or two additional behaviors of the new decorator will allow us to
remove decorators from delegating partitioners which is a prerequisite
for pluggable auto-partitioners.
**Summary**
Remove unused `include_metadata` parameter.
**Additional Context**
- The `include_metadata` parameter was originally added circa v0.7.12 as
a mechanism for avoiding the "double-decorating" problem on delegating
partitioners.
- It turns out it doesn't fully address that problem, is now unused, and
is unnecessary for the solution we'll be adding as part of pluggable
partitioners.
- Remove the unnecessary complexity introduced by this unused parameter.
**Summary**
Step 2 in prep for pluggable auto-partitioners, remove `regex_metadata`
field from `ElementMetadata`.
**Additional Context**
- "regex-metadata" was an experimental feature that didn't pan out.
- It's implemented by one of the post-partitioning metadata decorators,
so get rid of it as part of the cleanup before consolidating those
decorators.
This PR fixes an occasional `KeyError` when calling
`assign_and_map_hash_ids`.
- This happens when the input `elements` has duplicated element
instances or metadata.
- When there are duplications the logic to iterate through all elements
and map their parent ids will raise an error when an already mapped
parent id is up for mapping.
- The fix adds a logic to check if the parent id exists in
`old_to_new_mapping` and if it doesn't we skip mapping it
## test
This PR adds a unit test on this case and the test would fail without
the fix.
Wrap the `shared.PartitionParameters` usage with
`operations.PartitionRequest`. This syntax has been deprecated since
v0.23.0 of the SDK, and will be unsupported in v0.26.0.
**Summary**
In preparation for pluggable auto-partitioners simplify metadata as
discussed.
**Additional Context**
- Pluggable auto-partitioners requires partitioners to have a consistent
call signature. An arbitrary partitioner provided at runtime needs to
have a call signature that is known and consistent. Basically
`partition_x(filename, *, file, **kwargs)`.
- The current `auto.partition()` is highly coupled to each distinct
file-type partitioner, deciding which arguments to forward to each.
- This is driven by the existence of "delegating" partitioners, those
that convert their file-type and then call a second partitioner to do
the actual partitioning. Both the delegating and proxy partitioners are
decorated with metadata-post-processing decorators and those decorators
are not idempotent. We call the situation where those decorators would
run twice "double-decorating". For example, EPUB converts to HTML and
calls `partition_html()` and both `partition_epub()` and
`partition_html()` are decorated.
- The way double-decorating has been avoided in the past is to avoid
sending the arguments the metadata decorators are sensitive to to the
proxy partitioner. This is very obscure, complex to reason about,
error-prone, and just overall not a viable strategy. The better solution
is to not decorate delegating partitioners and let the proxy partitioner
handle all the metadata.
- This first step in preparation for that is part of simplifying the
metadata processing by removing unused or unwanted legacy parameters.
- `date_from_file_object` is a misnomer because a file-object never
contains last-modified data.
- It can never produce useful results in the API where last-modified
information must be provided by `metadata_last_modified`.
- It is an undocumented parameter so not in use.
- Using it can produce incorrect metadata.
**Summary**
In preparation for consolidating post-partitioning metadata decorators,
extract `partition.common` module into a sub-package (directory) and
extract `partition.common.metadata` module to house metadata-specific
object shared by partitioners.
**Additional Context**
- This new module will be the home of the new consolidated metadata
decorator.
- The consolidated decorator is a step toward removing post-processing
decorators from _delegating_ partitioners. A delegating partitioner is
one that convert its file to a different format and "delegates" actual
partitioning to the partitioner for that target format. 10 of the 20
partitioners are delegating partitioners.
- Removing decorators from delegating partitioners will allow us to
avoid "double-decorating", i.e. running those decorators twice, once on
the principal partitioner and again on the proxy partitioner.
- This will allow us to send `**kwargs` to either partitioner, removing
the knowledge of which arguments to send for each file-type from
auto-partition.
- And this will allow pluggable auto-partitioners which all have a
`partition_x(filename, *, file, **kwargs) -> list[Element]` interface.
This PR enhances `pdfminer` image cleanup process by repositioning the
duplicate image removal step. It optimizes the removal of duplicated
pdfminer images by performing the cleanup before merging elements,
rather than after. This improvement reduces execution time and enhances
the overall processing speed of PDF documents.
---------
Co-authored-by: Yao You <theyaoyou@gmail.com>
**Summary**
Remove dead code in `unstructured.file_utils`.
**Additional Context**
These modules were added in 12/2022 and 1/2023 and are not referenced by
any code. Removing to reduce unnecessary complexity. These can of course
be recovered from Git history if we decide we want them again in future.
This PR implements splitting of `pdfminer` elements (`groups of text
chunks`) into smaller bounding boxes (`text lines`). This implementation
prevents loss of information from the object detection model and
facilitates more effective removal of duplicated `pdfminer` text. This
PR also addresses #3430.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
- Remove constraint pins for `Office365-REST-Python-Client`,
`weaviate-client`, and `platformdirs`. Removing the pin for `Office365`
brought to light some bugs in the Onedrive connector, so some changes
were also made to
`unstructured/ingest/v2/processes/connectors/onedrive.py`.
- Also, as part of updating dependencies `unstructured-client` was
updated to `0.25.8`, which introduced a new default for the `strategy`
param and required updating a test fixture.
- The `hubspot.sh` integration test was failing and is now ignored in CI
with this PR per discussion with @rbiseck3.
May be easiest to review commit-by-commit.
This PR vectorizes the computation of element overlap to speed up
deduplication process of extracted elements.
## test
This PR adds unit test to the new vectorized IOU and subregion
computation functions.
In addition, running partition on large files with many elements like
this slide:
[002489.pdf](https://github.com/user-attachments/files/16823176/002489.pdf)
shows a reduction of runtime from around 15min on the main branch to
less than 4min with this branch.
Profiling results show that the new implementation greatly reduces the
time cost of computation and now most of the time is spend on getting
the coordinates from a list of bboxes.

This PR changes the way the analysis tools can be used:
- by default if `analysis` is set to `True` in `partition_pdf` and the
strategy is resolved to `hi_res`:
- for each file 4 layout dumps are produced and saved as JSON files
(`object_detection`, `extracted`, `ocr`, `final`) - similar way to the
current `object_detection` dump
- the drawing functions/classes now accept these dumps accordingly
instead of the internal classes instances (like `TextRegion`,
`DocumentLayout`
- it makes it possible to use the lightweight JSON files to render the
bboxes of a given file after the partition is done
- `_partition_pdf_or_image_local` has been refactored and most of the
analysis code is now encapsulated in `save_analysis_artifiacts` function
- to do this, helper function `render_bboxes_for_file` is added
<img width="338" alt="Screenshot 2024-08-28 at 14 37 56"
src="https://github.com/user-attachments/assets/10b6fbbd-7824-448d-8c11-52fc1b1b0dd0">
### Summary
Updates the file detection logic for OLE files to check the storage
content of the file to more reliable differentiate between DOC, PPT, XLS
and MSG files. This corrects a bug that caused file type detection to be
incorrect in cases where the `filetype` library guessed and incorrect
MIME type, such as `'application/vnd.ms-excel'` for a `.msg` file.
As part of this work, the `"msg"` extra was removed because the
`python-oxmsg` package is now a base dependency.
### Testing
Using a test `.msg` file that returns `'application/vnd.ms-excel'` from
`filetype.guess_mime`.
```python
from unstructured.file_utils.filetype import detect_filetype
filename = "test-file.msg"
detect_filetype(filename=filename) # result should be FileType.MSG
```
**Summary**
Do not assume MSG format when an OLE "container" file cannot be
differentiated into DOC, PPT, XLS, or MSG. Fall back to extention-based
identification in that case.
**Additional Context**
DOC, MSG, PPT, and XLS are all OLE files. An OLE file is, very roughly,
a Microsoft-proprietary Zip format which "contains" a filesystem of
discrete files and directories.
An OLE "container" is easily identified by inspecting the first 8 bytes
of the file, so all we need to do is differentiate between the four
subtypes we can process. The `filetype` module does a good job of this
but it not perfect and does not identify MSG files.
Previously we assumed MSG format when none of DOC, PPT, or XLS was
detected, but we discovered that `filetype` is not completely reliable
at detecting these types.
Change the behavior to remove the assumption of MSG format.
`_OleFileDifferentiator` returns `None` in this case and filetype
detection falls back to use filename-extension.
Note a file with no filename and no metadata_filename or an incorrect
extension will not be correctly identified in this case, however we're
assuming for now that will be rare in practice.
**Summary**
Eliminate historical "idiosyncracies" of `table.metadata.text_as_html`
HTML introduced by `partition_docx()`. Produce minified `.text_as_html`
consistent with that formed by chunking.
**Additional Context**
- nested tables appear as their extracted text in the parent cell (no
nested `<table>` elements in `.text_as_html`).
- DOCX `.text_as_html` is minified (no extra whitespace or thead, tbody,
tfoot elements).
**Summary**
Use more sophisticated algorithm for splitting oversized `Table`
elements into `TableChunk` elements during chunking to ensure element
text and HTML are "synchronized" and HTML is always parseable.
**Additional Context**
Table splitting now has the following characteristics:
- `TableChunk.metadata.text_as_html` is always a parseable HTML
`<table>` subtree.
- `TableChunk.text` is always the text in the HTML version of the table
fragment in `.metadata.text_as_html`. Text and HTML are "synchronized".
- The table is divided at a whole-row boundary whenever possible.
- A row is broken at an even-cell boundary when a single row is larger
than the chunking window.
- A cell is broken at an even-word boundary when a single cell is larger
than the chunking window.
- `.text_as_html` is "minified", removing all extraneous whitespace and
unneeded elements or attributes. This maximizes the semantic "density"
of each chunk.
This PR reverts `pytesseract` dependency to `unstructured.pytesseract`
fork due to the unavailability of some recent release versions of
`pytesseract` on PyPI.
This PR also addresses an issue encountered during the publication of
`unstructured==0.15.4` to PyPI. The error was due to the fact that PyPI
does not allow direct dependencies from Version Control System URLs like
GitHub in the `install_requires` or `extras_require` sections of the
`setup.py` file.
### Summary
Addresses
[CVE-2024-39705](https://nvd.nist.gov/vuln/detail/CVE-2024-39705) by
updating to `nltk==3.8.2` and closes#3511. This CVE had previously been
mitigated in #3361.
---------
Co-authored-by: Christine Straub <christinemstraub@gmail.com>
As described in #3381, some clients, perhaps including Adobe PDF
Converter, map JPEG images to the invalid `image/jpg` MIME-type. Prior
to v1.0.0, `python-pptx` would not load these images, which caused image
extraction to fail.
Update the `python-pptx` dependency to `v1.0.1` or above to ensure this
upstream fix is always available.
Fixes: #3381
Update partition_eml and partition_msg to capture cc, bcc, and message
id fields.
Docs PR: https://github.com/Unstructured-IO/docs/pull/135/files
Testing
```
from unstructured.partition.email import partition_email
from test_unstructured.unit_utils import example_doc_path
elements = partition_email(filename=example_doc_path("eml/fake-email-header.eml"), include_headers=True)
print(elements)
elements[0].metadata.to_dict()
```
Note to reviewers:
Tests in `test_unstructured/partition/test_email.py` were refactored and
rearranged to group similar tests together, so it will be easiest to
review those changes commit by commit.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Coniferish <Coniferish@users.noreply.github.com>
# Description:
Passing `max_pages` argument allows rejecting pdf files which exceeds
this page number limit while `high_res` strategy is chosen. By default
it will allow parsing pdf files with unlimited number of pages.
# Testing:
```python
from unstructured.partition.auto import partition
elements = partition(filename="unstructured/example-docs/pdf/reliance.pdf", strategy='hi_res') # should pass
elements = partition(filename="unstructured/example-docs/pdf/reliance.pdf", strategy='hi_res', max_pages=4) # should pass
elements = partition(filename="unstructured/example-docs/pdf/reliance.pdf", strategy='hi_res', max_pages=2) # should raise PdfMaxPagesExceededError
```