**Summary**
In preparation for consolidating post-partitioning metadata decorators,
extract `partition.common` module into a sub-package (directory) and
extract `partition.common.metadata` module to house metadata-specific
object shared by partitioners.
**Additional Context**
- This new module will be the home of the new consolidated metadata
decorator.
- The consolidated decorator is a step toward removing post-processing
decorators from _delegating_ partitioners. A delegating partitioner is
one that convert its file to a different format and "delegates" actual
partitioning to the partitioner for that target format. 10 of the 20
partitioners are delegating partitioners.
- Removing decorators from delegating partitioners will allow us to
avoid "double-decorating", i.e. running those decorators twice, once on
the principal partitioner and again on the proxy partitioner.
- This will allow us to send `**kwargs` to either partitioner, removing
the knowledge of which arguments to send for each file-type from
auto-partition.
- And this will allow pluggable auto-partitioners which all have a
`partition_x(filename, *, file, **kwargs) -> list[Element]` interface.
This PR enhances `pdfminer` image cleanup process by repositioning the
duplicate image removal step. It optimizes the removal of duplicated
pdfminer images by performing the cleanup before merging elements,
rather than after. This improvement reduces execution time and enhances
the overall processing speed of PDF documents.
---------
Co-authored-by: Yao You <theyaoyou@gmail.com>
**Summary**
Remove dead code in `unstructured.file_utils`.
**Additional Context**
These modules were added in 12/2022 and 1/2023 and are not referenced by
any code. Removing to reduce unnecessary complexity. These can of course
be recovered from Git history if we decide we want them again in future.
This PR implements splitting of `pdfminer` elements (`groups of text
chunks`) into smaller bounding boxes (`text lines`). This implementation
prevents loss of information from the object detection model and
facilitates more effective removal of duplicated `pdfminer` text. This
PR also addresses #3430.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
- Remove constraint pins for `Office365-REST-Python-Client`,
`weaviate-client`, and `platformdirs`. Removing the pin for `Office365`
brought to light some bugs in the Onedrive connector, so some changes
were also made to
`unstructured/ingest/v2/processes/connectors/onedrive.py`.
- Also, as part of updating dependencies `unstructured-client` was
updated to `0.25.8`, which introduced a new default for the `strategy`
param and required updating a test fixture.
- The `hubspot.sh` integration test was failing and is now ignored in CI
with this PR per discussion with @rbiseck3.
May be easiest to review commit-by-commit.
This PR vectorizes the computation of element overlap to speed up
deduplication process of extracted elements.
## test
This PR adds unit test to the new vectorized IOU and subregion
computation functions.
In addition, running partition on large files with many elements like
this slide:
[002489.pdf](https://github.com/user-attachments/files/16823176/002489.pdf)
shows a reduction of runtime from around 15min on the main branch to
less than 4min with this branch.
Profiling results show that the new implementation greatly reduces the
time cost of computation and now most of the time is spend on getting
the coordinates from a list of bboxes.

This PR changes the way the analysis tools can be used:
- by default if `analysis` is set to `True` in `partition_pdf` and the
strategy is resolved to `hi_res`:
- for each file 4 layout dumps are produced and saved as JSON files
(`object_detection`, `extracted`, `ocr`, `final`) - similar way to the
current `object_detection` dump
- the drawing functions/classes now accept these dumps accordingly
instead of the internal classes instances (like `TextRegion`,
`DocumentLayout`
- it makes it possible to use the lightweight JSON files to render the
bboxes of a given file after the partition is done
- `_partition_pdf_or_image_local` has been refactored and most of the
analysis code is now encapsulated in `save_analysis_artifiacts` function
- to do this, helper function `render_bboxes_for_file` is added
<img width="338" alt="Screenshot 2024-08-28 at 14 37 56"
src="https://github.com/user-attachments/assets/10b6fbbd-7824-448d-8c11-52fc1b1b0dd0">
### Summary
Updates the file detection logic for OLE files to check the storage
content of the file to more reliable differentiate between DOC, PPT, XLS
and MSG files. This corrects a bug that caused file type detection to be
incorrect in cases where the `filetype` library guessed and incorrect
MIME type, such as `'application/vnd.ms-excel'` for a `.msg` file.
As part of this work, the `"msg"` extra was removed because the
`python-oxmsg` package is now a base dependency.
### Testing
Using a test `.msg` file that returns `'application/vnd.ms-excel'` from
`filetype.guess_mime`.
```python
from unstructured.file_utils.filetype import detect_filetype
filename = "test-file.msg"
detect_filetype(filename=filename) # result should be FileType.MSG
```
**Summary**
Do not assume MSG format when an OLE "container" file cannot be
differentiated into DOC, PPT, XLS, or MSG. Fall back to extention-based
identification in that case.
**Additional Context**
DOC, MSG, PPT, and XLS are all OLE files. An OLE file is, very roughly,
a Microsoft-proprietary Zip format which "contains" a filesystem of
discrete files and directories.
An OLE "container" is easily identified by inspecting the first 8 bytes
of the file, so all we need to do is differentiate between the four
subtypes we can process. The `filetype` module does a good job of this
but it not perfect and does not identify MSG files.
Previously we assumed MSG format when none of DOC, PPT, or XLS was
detected, but we discovered that `filetype` is not completely reliable
at detecting these types.
Change the behavior to remove the assumption of MSG format.
`_OleFileDifferentiator` returns `None` in this case and filetype
detection falls back to use filename-extension.
Note a file with no filename and no metadata_filename or an incorrect
extension will not be correctly identified in this case, however we're
assuming for now that will be rare in practice.
**Summary**
Eliminate historical "idiosyncracies" of `table.metadata.text_as_html`
HTML introduced by `partition_docx()`. Produce minified `.text_as_html`
consistent with that formed by chunking.
**Additional Context**
- nested tables appear as their extracted text in the parent cell (no
nested `<table>` elements in `.text_as_html`).
- DOCX `.text_as_html` is minified (no extra whitespace or thead, tbody,
tfoot elements).
**Summary**
Use more sophisticated algorithm for splitting oversized `Table`
elements into `TableChunk` elements during chunking to ensure element
text and HTML are "synchronized" and HTML is always parseable.
**Additional Context**
Table splitting now has the following characteristics:
- `TableChunk.metadata.text_as_html` is always a parseable HTML
`<table>` subtree.
- `TableChunk.text` is always the text in the HTML version of the table
fragment in `.metadata.text_as_html`. Text and HTML are "synchronized".
- The table is divided at a whole-row boundary whenever possible.
- A row is broken at an even-cell boundary when a single row is larger
than the chunking window.
- A cell is broken at an even-word boundary when a single cell is larger
than the chunking window.
- `.text_as_html` is "minified", removing all extraneous whitespace and
unneeded elements or attributes. This maximizes the semantic "density"
of each chunk.
This PR reverts `pytesseract` dependency to `unstructured.pytesseract`
fork due to the unavailability of some recent release versions of
`pytesseract` on PyPI.
This PR also addresses an issue encountered during the publication of
`unstructured==0.15.4` to PyPI. The error was due to the fact that PyPI
does not allow direct dependencies from Version Control System URLs like
GitHub in the `install_requires` or `extras_require` sections of the
`setup.py` file.
### Summary
Addresses
[CVE-2024-39705](https://nvd.nist.gov/vuln/detail/CVE-2024-39705) by
updating to `nltk==3.8.2` and closes#3511. This CVE had previously been
mitigated in #3361.
---------
Co-authored-by: Christine Straub <christinemstraub@gmail.com>
As described in #3381, some clients, perhaps including Adobe PDF
Converter, map JPEG images to the invalid `image/jpg` MIME-type. Prior
to v1.0.0, `python-pptx` would not load these images, which caused image
extraction to fail.
Update the `python-pptx` dependency to `v1.0.1` or above to ensure this
upstream fix is always available.
Fixes: #3381
Update partition_eml and partition_msg to capture cc, bcc, and message
id fields.
Docs PR: https://github.com/Unstructured-IO/docs/pull/135/files
Testing
```
from unstructured.partition.email import partition_email
from test_unstructured.unit_utils import example_doc_path
elements = partition_email(filename=example_doc_path("eml/fake-email-header.eml"), include_headers=True)
print(elements)
elements[0].metadata.to_dict()
```
Note to reviewers:
Tests in `test_unstructured/partition/test_email.py` were refactored and
rearranged to group similar tests together, so it will be easiest to
review those changes commit by commit.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Coniferish <Coniferish@users.noreply.github.com>
# Description:
Passing `max_pages` argument allows rejecting pdf files which exceeds
this page number limit while `high_res` strategy is chosen. By default
it will allow parsing pdf files with unlimited number of pages.
# Testing:
```python
from unstructured.partition.auto import partition
elements = partition(filename="unstructured/example-docs/pdf/reliance.pdf", strategy='hi_res') # should pass
elements = partition(filename="unstructured/example-docs/pdf/reliance.pdf", strategy='hi_res', max_pages=4) # should pass
elements = partition(filename="unstructured/example-docs/pdf/reliance.pdf", strategy='hi_res', max_pages=2) # should raise PdfMaxPagesExceededError
```
**Summary**
A DOC, PPT, or XLS file sent to partition() as a file-like object is
misidentified as a MSG file and raises an exception in python-oxmsg
(which is used to process MSG files).
**Fix**
DOC, PPT, XLS, and MSG are all Microsoft OLE-based files, aka. Compound
File Binary Format (CFBF). These can be reliably distinguished by
inspecting magic bytes in certain locations. `libmagic` is unreliable at
this or doesn't try, reporting the generic `"application/x-ole-storage"`
which corresponds to the "container" CFBF format (vaguely like a
Microsoft Zip format) that all these document types are stored in.
Unconditionally use `filetype.guess_mime()` provided by the `filetype`
package that is part of the base unstructured install. Unlike
`libmagic`, this package reliably detects the distinguished MIME-type
(e.g. `"application/msword"`) for OLE file subtypes.
Fixes#3364
**Summary**
The `content_type` argument received by `partition()` from the API is
sometimes unreliable for MS-Office 2007+ MIME-types. What we've observed
is that it gets the MS-Office bit right but falls down on distinguishing
PPTX from DOCX or XLSX.
Confirmation of these types is simple, fast, and reliable. Confirm all
MS-Office `content_type` argument values asserted by callers of
`detect_filetype()` and correct swapped values.
**Summary**
In preparation for fixing a cluster of bugs with automatic file-type
detection and paving the way for some reliability improvements, refactor
`unstructured.file_utils.filetype` module and improve thoroughness of
tests.
**Additional Context**
Factor type-recognition process into three distinct strategies that are
attempted in sequence. Attempted in order of preference,
type-recognition falls to the next strategy when the one before it is
not applicable or cannot determine the file-type. This provides a clear
basis for organizing the code and tests at the top level.
Consolidate the existing tests around these strategies, adding
additional cases to achieve better coverage.
Several bugs were uncovered in the process. Small ones were just fixed,
bigger ones will be remedied in following PRs.
**Summary**
Replace conditional explicit import of partitioner modules in
`.partition.auto` with the new `_PartitionerLoader` class. This avoids
unbound variable warnings and is much less noisy.
`_PartitionerLoader` makes use of the new `FileType` property
`.importable_package_dependencies` to determine whether all required
packages are importable before dispatching the file to its partitioner.
It uses `FileType.extra_name` to form a helpful error message when a
dependency is not installed, so the caller knows which `pip install`
extra to specify to remedy the error.
`PartitionerLoader` uses the `FileType` properties
`.partitioner_module_qname` and `partitioner_function_name` to load
the partitioner once its dependencies are verified. Loaded partitioners
are cached with module lifetime scope for efficiency.
### Summary
Currently, the email partitioner removes only `=\n` characters during
the clearing process. However, email content sometimes contains `=\r\n`
characters, especially when read from file-like objects such as
`SpooledTemporaryFile` (the file type used in our API). This PR updates
the email partitioner to remove both `=\n` and `=\r\n` characters during
the clearing process.
### Testing
```
filename = "example-docs/eml/family-day.eml"
elements = partition_email(
filename=filename,
)
print(f"From filename: {elements[3].text}")
with open(filename, "rb") as test_file:
spooled_temp_file = tempfile.SpooledTemporaryFile()
spooled_temp_file.write(test_file.read())
spooled_temp_file.seek(0)
elements = partition_email(file=spooled_temp_file)
print(f"From spooled_temp_file: {elements[3].text}")
```
**Results:**
- on `main`
```
From filename: Make sure to RSVP!
From spooled_temp_file: Make sure to = RSVP!
```
- on `PR`
```
From filename: Make sure to RSVP!
From spooled_temp_file: Make sure to RSVP!
```
This PR aims to improve the organization and readability of our example
documents used in unit tests, specifically focusing on PDF and image
files.
### Summary
- Created two new subdirectories in the `example-docs` folder:
- `pdf/`: for all PDF example files
- `img/`: for all image example files
- Moved relevant PDF files from `example-docs/` to `example-docs/pdf/`
- Moved relevant image files from `example-docs/` to `example-docs/img/`
- Updated file paths in affected unit & ingest tests to reflect the new
directory structure
### Testing
All unit & ingest tests should be updated and verified to work with the
new file structure.
## Notes
Other file types (e.g., office documents, HTML files) remain in the root
of `example-docs/` for now.
## Next Steps
Consider similar reorganization for other file types if this structure
proves to be beneficial.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
**Summary**
Elaborate the `FileType` enum to be a complete descriptor of file-types.
Add methods to allow `STR_TO_FILETYPE`, `EXT_TO_FILETYPE` and
`FILETYPE_TO_MIMETYPE` mappings to be replaced, removing those redundant
and noisy declarations.
In the process, fix some lingering file-type identification and
`.metadata.filetype` errors that had been skipped in the tests.
**Additional Context**
Gathering the various attributes of a file-type into the `FileType` enum
eliminates the duplication inherent in the separate `STR_TO_FILETYPE`
etc. mappings and makes access to those values convenient for callers.
These attributes include what MIME-type a file-type should record in
metadata and what MIME-types and extensions map to that file-type. These
values and others are made available as methods and properties directly
on the `FileType` class and members. Because all attributes are defined
in the `FileType` enum there is no risk of inconsistency across multiple
locations and any changes happen in one and only one place. Further
attributes and methods will be added in later commits to support other
file-type related operations like mapping to a partitioner and verifying
its dependencies are installed.
**Summary**
In preparation for further work on auto file-type detection, improve
`filetype.py` and related modules:
- improve docstrings
- improve type annotations
- extract domain model to `.model` module
Closes#3159.
This PR extends language specification capability to `PaddleOCR` in
addition to `TesseractOCR`. Users can now specify OCR languages for both
OCR engines when using `partition_pdf()`.
### Testing
```
os.environ["OCR_AGENT"] = "unstructured.partition.utils.ocr_models.paddle_ocr.OCRAgentPaddle"
elements = partition_pdf(
filename=<file_path>,
strategy=strategy,
languages=["chi_sim"], # chinese - simplified
infer_table_structure=True,
)
```
**Summary**
Improve file-detection tests in preparation for additional work and bug
fixes.
**Additional Context**
- Add type annotations.
- Use mocks instead of `monkeypatch` in most cases and verify calls to
mock. This revealed a dozen broken tests, broken in that the mocks
weren't being called so a different code path than intended was being
exercised.
- Use `example_doc_path()` instead of hard-coded paths.
- Add actual test files for cases where they were being constructed in
temporary directories.
- Make test names consistent and more descriptive of behavior under
test.
**Summary**
Replace legacy HTML parser with recursive version that captures all
content and provides flexibility to add new metadata. It's also
substantially faster although that's just a happy side-effect.
**Additional Context**
The prior HTML parsing algorithm that makes up the core of HTML
partitioning was buggy and very difficult to reason about because it did
not conform to the inherently recursive structure of HTML. The new
version retains `lxml` as the performant and reliable base library but
uses `lxml`'s custom element classes to efficiently classify HTML
elements by their behaviors (block-item and inline (phrasing) primarily)
and give those elements the desired partitioning behaviors.
This solves a host of existing problems with content being skipped and
elements (paragraphs) being divided improperly, but also provides a
clear domain model for reasoning about its behavior and reliably
adjusting it to suit our existing and future purposes.
The parser's operation is recursive, closely modeling the recursive
structure of HTML itself. It's behaviors are based on the HTML Standard
and reliably produce proper and explainable results even for novel
cases.
Fixes#2325Fixes#2562Fixes#2675Fixes#3168Fixes#3227Fixes#3228Fixes#3230Fixes#3237Fixes#3245Fixes#3247Fixes#3255Fixes#3309
### BEHAVIOR DIFFERENCES
#### `emphasized_text_tags` encoding is changed:
- `<strong>` is encoded as `"b"` rather than `"strong"`.
- `<em>` is encoded as `"i"` rather than `"em"`.
- `<span>` is no longer recorded in `emphasized_text_tags` (because
without the CSS we can't tell whether it's used for emphasis or if so
what kind).
- nested emphasis (e.g. bold+italic) is encoded as multiple characters
("bi").
- `emphasized_text_contents` is broken on emphasis-change boundaries,
like:
```html
`<p>foo <b>bar <i>baz</i> bada</b> bing</p>`
```
produces:
```json
{
"emphasized_text_contents": ["bar", "baz", "bada"],
"emphasized_text_tags": ["b", "bi", "b"]
}
```
whereas previously it would have produced:
```json
{
"emphasized_text_contents": ["bar baz bada", "baz"],
"emphasized_text_tags": ["b", "i"]
}
```
#### `<pre>` text is preserved as it appears in the html
Except that a leading newline is removed if present (has to be in
position 0 of text). Also, a trailing newline is stripped but only if it
appears in the very last position ([-1]) of the `<pre>` text. Old parser
stripped all leading and trailing whitespace.
Result is that:
```html
<pre>
foo
bar
baz
</pre>
```
parses to `"foo\nbar\nbaz"` which is the same result produced for:
```html
<pre>foo
bar
baz</pre>
```
This equivalence is the same behavior exhibited by a browser, which is
why we did the extra work to make it this way.
#### Whitespace normalization
Leading and trailing whitespace are removed from element text, just as
it is removed in the browser. Runs of whitespace within the element text
are reduced to a single space character (like in the browser). Note this
means that `\t`, `\n`, and ` ` are replaced with a regular space
character. All text derived from elements is whitespace normalized
except the text within a `<pre>` tag. Any leading or trailing newline is
trimmed from `<pre>` element text; all other whitespace is preserved
just as it appeared in the HTML source.
#### `link_start_indexes` metadata is no longer captured. Rationale:
- It was frequently wrong, often `-1`.
- It was deprecated but then added back in a community PR.
- Maintaining it across any possible downstream transformations (e.g.
chunking) would be expensive and almost certainly lead to wrong values
as distant code evolves.
- It is complex to compute and recompute when whitespace is normalized,
adding substantial complexity to the code and reducing readability and
maintainability
#### `<br/>` element is replaced with a single newline (`"\n"`)
but that is usually replaced with a space in `Element.text` when it is
normalized. The newline is preserved within a `<pre>` element.
- Related: _No paragraph-break on `<br/><br/>`_
#### Empty `h1..h6` elements are dropped.
HTML heading elements (`<h1..h6>`) are "skipped" (do not generate a
`Title` element) when they contain no text or contain only whitespace.
---------
Co-authored-by: scanny <scanny@users.noreply.github.com>
**Note**
This refines the new HTML parser but _does not install it_. This is why
no changes to ingest test expectations or other unit-tests are required
here. Installing the new parser will happen in the next PR #3218.
**Summary**
The initial version of the parser (purposely) raised on a block element
nested inside a phrasing element. While such nesting is not valid
according to the HTML Standard, it is accepted by the browser and does
happen in the wild.
The refinements here handle this situation similarly to how the browser
does, breaking phrasing at the block element boundaries and starting it
up again after the block element.
Unfortunately this adds complexity to the parser, but it makes the
parser robust against pretty much any HTML we're likely to encounter and
partitions it consistent with how it would be rendered in the browser.
### Summary
Addresses
[CVE-2024-39705](https://nvd.nist.gov/vuln/detail/CVE-2024-39705), which
highlights the risk of remote code execution when running
`nltk.download` . Removes `nltk.download` in favor of a `.tgz` file with
the appropriate NLTK data files and checking the SHA256 hash to validate
the download. An error now raises if `nltk.download` is invoked.
The logic for determining the NLTK download directory is borrowed from
`nltk`, so users can still set `NLTK_DATA` as they did previously.
### Testing
1. Create a directory called `~/tmp/nltk_test`. Set
`NLTK_DATA=${HOME}/tmp/nltk_test`.
2. From a python interactive session, run:
```python
from unstructured.nlp.tokenize import download_nltk_packages
download_nltk_packages()
```
3. Run `ls /tmp/nltk_test/nltk_data`. You should see the downloaded
data.
---------
Co-authored-by: Steve Canny <stcanny@gmail.com>
**Summary**
In preparation for further work on auto-partitioning (`partition()`),
improve typing and organize `test_auto.py` by introducing categories.
The table metrics considering spans is not used and it messes with the
output thus I have cleaned the code from it. Though, I have left
table_as_cells in the source code - it still may be useful for the users
This pull request add table detection metrics.
One case that was considered by me:
Case: Two tables are predicted and matched with one table in ground
truth
Question: Is this matching correct in both cases or just for on table
There are two subcases:
- table was predicted by OD as two sub tables (so half in two, there are
two non overlapping subtables) -> in my opinion both are correct
- it is false positive from tables matching script in
get_table_level_alignment -> 1 good, 1 wrong
As we don't have bounding boxes I followed the notebook calculation
script and assumed pessimistic, second subcase version
Change unstructured-client pin to setting minimum version instead of max
version and `make pip-compile`.
Integration tests that were dependent on the old version of the client
are removed. These tests should be replicated in/moved to the SDK
repo(s).
This pull request fixes counting tables metric for three cases:
- False Negatives: when table exist in ground truth but any of the
predicted tables doesn't match the table, the table should count as 0
and the file should not be completely skipped (before it was np.NaN).
- False Positives: When there is a predicted table that didn't match any
ground truth table it should be counted as 0, right now it is skipped in
processing (matched_indices==-1)
- The file should be completely skipped only if there is no tables in
ground truth and in prediction
In short we can say that previous metric calculation didn't consider OD
mistakes
**Summary**
The `python-docx` error `docx.opc.exceptions.PackageNotFoundError`
arises both when no file exists at the given path and when the file
exists but is not a ZIP archive (and so is not a DOCX file).
This ambiguity is unwelcome when diagnosing the error as the two
possible conditions generally indicate a different course of action to
resolve the error.
Add detailed validation to `DocxPartitionerOptions` to distinguish these
two and provide more precise exception messages.
**Additional Context**
- `python-pptx` shares the same OPC-Package (file) loading code used by
`python-docx`, so the same ambiguity will be present in `python-pptx`.
- It would be preferable for this distinguished exception behavior to be
upstream in `python-docx` and `python-pptx`. If we're willing to take
the version bump it might be worth considering doing that instead.
This PR adds new capabilities for drawing bboxes for each layout
(extracted, inferred, ocr and final) + OD model output dump as a json
file for better analysis.
---------
Co-authored-by: Christine Straub <christinemstraub@gmail.com>
Co-authored-by: Michal Martyniak <michal.martyniak@deepsense.ai>