1605 Commits

Author SHA1 Message Date
Matt Robinson
cf32672bc5
build(deps): bumps for 2024-09-09 (#3608)
### Summary

Dependency bumps for 2024-09-09.
2024-09-09 16:45:18 +00:00
cragwolfe
3bb0ee1e79
chore: fix tests breaking on main (#3603)
Fix API tests (really more like integration tests) that run only on
main. Also use less compute intensive files to speedup test time and
remove a useless test.

Tests in `test_unstructured/partition/test_api.py` pass, temporarily
running outside of main per per screenshot:

![image](https://github.com/user-attachments/assets/f15d440a-2574-40f2-98b4-adf57fbae704)


https://github.com/Unstructured-IO/unstructured/actions/runs/10754098974/job/29824415513
2024-09-08 21:25:52 +00:00
Matt Robinson
c060467018
build(deps): bump cryptography version (#3599)
### Summary

Bumps to the latest version of the `cryptography` library to address
`GHSA-h4gh-qq45-vh27`.
2024-09-05 19:06:43 +00:00
Pawel Kmiecik
f25eb60585
fix: expose drawing options as function params rather than env config (#3598)
This PR:
- changes the interface of analysis tools to expose drawing params as
function parameters rather than env_config (=environmental variables)
- restructures analysis package
2024-09-05 15:51:43 +00:00
Christine Straub
acd070c5d5
feat: enhance pdfminer element cleanup (#3593)
This PR aims to expand removal of `pdfminer` elements to include those
inside all `non-pdfminer` elements, not just `tables`.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
2024-09-04 12:02:50 +00:00
Yao You
d51fb134e6
Feat/improve iou speed (#3582)
This PR vectorizes the computation of element overlap to speed up
deduplication process of extracted elements.

## test

This PR adds unit test to the new vectorized IOU and subregion
computation functions.

In addition, running partition on large files with many elements like
this slide:

[002489.pdf](https://github.com/user-attachments/files/16823176/002489.pdf)

shows a reduction of runtime from around 15min on the main branch to
less than 4min with this branch.

Profiling results show that the new implementation greatly reduces the
time cost of computation and now most of the time is spend on getting
the coordinates from a list of bboxes.

![Screenshot 2024-08-30 at 9 29
27 PM](https://github.com/user-attachments/assets/6c186838-54c7-483b-ac3e-7342c23ff3a6)
2024-09-03 00:06:18 +00:00
Pawel Kmiecik
404f780bbb
feat: make analysis drawing more flexible (#3574)
This PR changes the way the analysis tools can be used:
- by default if `analysis` is set to `True` in `partition_pdf` and the
strategy is resolved to `hi_res`:
- for each file 4 layout dumps are produced and saved as JSON files
(`object_detection`, `extracted`, `ocr`, `final`) - similar way to the
current `object_detection` dump
- the drawing functions/classes now accept these dumps accordingly
instead of the internal classes instances (like `TextRegion`,
`DocumentLayout`
- it makes it possible to use the lightweight JSON files to render the
bboxes of a given file after the partition is done
- `_partition_pdf_or_image_local` has been refactored and most of the
analysis code is now encapsulated in `save_analysis_artifiacts` function
- to do this, helper function `render_bboxes_for_file` is added
<img width="338" alt="Screenshot 2024-08-28 at 14 37 56"
src="https://github.com/user-attachments/assets/10b6fbbd-7824-448d-8c11-52fc1b1b0dd0">
2024-09-02 11:06:11 +00:00
Matt Robinson
04322d1632
build(deps): removed unnecessary jupyter deps (#3583)
### Summary

Removes unnecessary `jupyter` and `ipython` dev dependencies to reduce
CVE surface area.
2024-08-31 05:21:40 +00:00
Matt Robinson
6ba8135bf9
fix: check ole storage content to differentiate filetypes (#3581)
### Summary

Updates the file detection logic for OLE files to check the storage
content of the file to more reliable differentiate between DOC, PPT, XLS
and MSG files. This corrects a bug that caused file type detection to be
incorrect in cases where the `filetype` library guessed and incorrect
MIME type, such as `'application/vnd.ms-excel'` for a `.msg` file.

As part of this work, the `"msg"` extra was removed because the
`python-oxmsg` package is now a base dependency.

### Testing

Using a test `.msg` file that returns `'application/vnd.ms-excel'` from
`filetype.guess_mime`.

```python
from unstructured.file_utils.filetype import detect_filetype

filename = "test-file.msg"
detect_filetype(filename=filename) # result should be FileType.MSG
```
0.15.9
2024-08-30 15:12:46 -04:00
John
ddb6cb631d
chore: remove minimum version pins for pins older than 6 mo (#3577)
Remove a number of pins in `requirements/deps/constraints.txt` and `make
pip-compile`
2024-08-29 15:35:14 +00:00
Austin Walker
f440eb476c
feat: Support encoding parameter in partition_csv (#3564)
See added test file. Added support for the encoding parameter, which can
be passed directly to `pd.read_csv`.
2024-08-28 14:19:58 +00:00
John
f21c853ade
bug: fix file_conversion disk leak (#3562)
Fix disk space leaks and Windows errors when accessing file.name on a
NamedTemporaryFile

Uses of `NamedTemporaryFile(..., delete=False)` and/or uses of
`file.name` of NamedTemporaryFile have been replaced with
TemporaryFileDirectory to avoid a known issue:
-
https://docs.python.org/3/library/tempfile.html#tempfile.NamedTemporaryFile
- https://github.com/Unstructured-IO/unstructured/issues/3390

The first 7 commits each address an individual occurrence of the issue
if reviewers want to review commit-by-commit.
2024-08-27 22:02:24 +00:00
Matt Robinson
4194a07d12
build(deps): replace pillow-heif with pi-heif (#3571)
### Summary

Closes #2664 and replaces `pillow-heif` with `pi-heif` due to more
permissive licensing on the binary wheel for `pi-heif`.
0.15.8
2024-08-27 11:54:35 -04:00
David Potter
ddba928344
Potter/mixedbread embedder (#3513)
Thanks to @huangrpablo and @juliuslipp we now have a mixedbread.ai
embedder!
2024-08-27 14:52:13 +00:00
Christine Straub
affd997c39
refactor(ci): remove unused environment variables (#3568)
This PR removes the unused env `TABLE_OCR` from CI.
2024-08-26 19:19:58 +00:00
Matt Robinson
09d84bc46b
build(deps): version bumps for 2024-08-26 (#3567)
### Summary

Version bumps for 2024-08-26.
2024-08-26 15:15:25 -04:00
Christine Straub
ac10ba4fc1
build(deps): bump unstructured.paddleocr to 2.8.1.0 (#3561)
### Summary
- Bump `unstructured.paddleocr` to 2.8.1.0
- Remove `opencv-python` and `opencv-contrib-python` constraint pins
- Fix `0.15.7` changelog
2024-08-23 14:17:29 -07:00
Steve Canny
32bb77aafb
fix(file): no default OLE subtype (#3516)
**Summary**
Do not assume MSG format when an OLE "container" file cannot be
differentiated into DOC, PPT, XLS, or MSG. Fall back to extention-based
identification in that case.

**Additional Context**
DOC, MSG, PPT, and XLS are all OLE files. An OLE file is, very roughly,
a Microsoft-proprietary Zip format which "contains" a filesystem of
discrete files and directories.

An OLE "container" is easily identified by inspecting the first 8 bytes
of the file, so all we need to do is differentiate between the four
subtypes we can process. The `filetype` module does a good job of this
but it not perfect and does not identify MSG files.

Previously we assumed MSG format when none of DOC, PPT, or XLS was
detected, but we discovered that `filetype` is not completely reliable
at detecting these types.

Change the behavior to remove the assumption of MSG format.
`_OleFileDifferentiator` returns `None` in this case and filetype
detection falls back to use filename-extension.

Note a file with no filename and no metadata_filename or an incorrect
extension will not be correctly identified in this case, however we're
assuming for now that will be rare in practice.
2024-08-22 19:16:53 +00:00
John
b4a6aa5559
chore: remove fsspec pin (#3554)
remove fsspec pin
2024-08-21 21:57:42 +00:00
Steve Canny
03e0ed3519
rfctr(docx): DOCX emits std minified .text_as_html (#3545)
**Summary**
Eliminate historical "idiosyncracies" of `table.metadata.text_as_html`
HTML introduced by `partition_docx()`. Produce minified `.text_as_html`
consistent with that formed by chunking.

**Additional Context**
- nested tables appear as their extracted text in the parent cell (no
nested `<table>` elements in `.text_as_html`).
- DOCX `.text_as_html` is minified (no extra whitespace or thead, tbody,
tfoot elements).
2024-08-21 18:54:21 +00:00
John
f135344738
chore: remove scipy and packaging pins (#3550)
Remove scipy and packaging constraint pins
2024-08-21 16:05:19 +00:00
John
604cadfb7e
chore: remove ipython pin (#3548)
this pr is stacked on
https://github.com/Unstructured-IO/unstructured/pull/3538 and
https://github.com/Unstructured-IO/unstructured/pull/3547

This pr removes dependency pins for IPython, anyio, and pyparsing. It
also updates the label-studio-sdk import statement so we don't have to
have that pinned and make some minor type hinting edits. Label Studio
had a breaking change in their 1.13.0
[release](https://github.com/HumanSignal/label-studio/releases/tag/1.13.0)
2024-08-21 00:06:31 +00:00
Christine Straub
01dbc7b473
fix: nltk data download path to prevent redundant nested directories (#3546)
Closes #3543.

### Summary
This PR addresses an issue with the NLTK data download process.
Previously, when downloading NLTK data, a nested "nltk_data" directory
was created within the parent "nltk_data" directory if the parent
directory already existed. This redundant directory structure led to two
significant problems:
- errors in checking if data had already been downloaded, potentially
causing redundant downloads in subsequent calls.
- failures in loading models from the downloaded NLTK data due to
incorrect path resolution.

This fix modifies the NLTK data download logic to prevent creation of
unnecessary nested directories. If the download path ends with
"nltk_data" and that directory already exists, we now use the existing
directory instead of creating a new nested one.

### Testing
CI should pass.
0.15.7
2024-08-20 18:56:59 +00:00
Matt Robinson
1f8030dd0e
fix(CVE-2024-39705): bump to nltk 3.9.1; correct model download issues (#3541)
### Summary

Bumps to `nltk==3.9.1` and resolves
[CVE-2024-39705](https://nvd.nist.gov/vuln/detail/CVE-2024-39705). An
NLTK version bump was originally introduced in #3512 and rolled back in
#3527 because `nltk==3.8.2` was yanked from PyPI, and also because we
observed significant slowdowns in processing time after bumping to
`nltk==3.8.2`. The processing time regression does not appear in
`nltk==3.9.1`.

### Testing

After the bump, CI should pass. Additionally we verified locally that
files processing takes around the amount of time we would expect for a
long `.docx` file.

```python
In [1]: from unstructured.partition.auto import partition

In [2]: filename = "test-doc.docx"

In [3]: %timeit partition(filename=filename)
3.92 s ± 73 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```
0.15.6
2024-08-19 20:59:36 +00:00
Steve Canny
a861ed8fe7
feat(chunk): split tables on even row boundaries (#3504)
**Summary**
Use more sophisticated algorithm for splitting oversized `Table`
elements into `TableChunk` elements during chunking to ensure element
text and HTML are "synchronized" and HTML is always parseable.

**Additional Context**
Table splitting now has the following characteristics:
- `TableChunk.metadata.text_as_html` is always a parseable HTML
`<table>` subtree.
- `TableChunk.text` is always the text in the HTML version of the table
fragment in `.metadata.text_as_html`. Text and HTML are "synchronized".
- The table is divided at a whole-row boundary whenever possible.
- A row is broken at an even-cell boundary when a single row is larger
than the chunking window.
- A cell is broken at an even-word boundary when a single cell is larger
than the chunking window.
- `.text_as_html` is "minified", removing all extraneous whitespace and
unneeded elements or attributes. This maximizes the semantic "density"
of each chunk.
2024-08-19 18:56:53 +00:00
Christine Straub
99f72d65ba
ci: fix ingest test fixtures update (#3532) 2024-08-16 16:37:33 -07:00
Christine Straub
fc26426310
feat: replace pytesseract with unstructured.pytesseract fork (#3528)
This PR reverts `pytesseract` dependency to `unstructured.pytesseract`
fork due to the unavailability of some recent release versions of
`pytesseract` on PyPI.

This PR also addresses an issue encountered during the publication of
`unstructured==0.15.4` to PyPI. The error was due to the fact that PyPI
does not allow direct dependencies from Version Control System URLs like
GitHub in the `install_requires` or `extras_require` sections of the
`setup.py` file.
0.15.5
2024-08-16 10:34:22 -04:00
Matt Robinson
e64e09507a
build: update to latest base image (#3524)
### Summary

Updates to the latest `wolfi-base` base image to pull in more recent
package version. A notable update is that upgrading to
`libreoffice==24.2.5.2` resolves several CVEs.

---------

Co-authored-by: christinestraub <christinemstraub@gmail.com>
2024-08-15 22:27:41 -07:00
Christine Straub
d0211cc41f
build: downgrade nltk version (#3527)
This PR aims to roll back `nltk` to `3.8.1` which bumped to `3.8.2` in
https://github.com/Unstructured-IO/unstructured/pull/3512 because
`3.8.2` is no longer available in PyPI due to some
issues(https://github.com/nltk/nltk/issues/3301)
2024-08-15 16:35:21 -07:00
Christine Straub
9b778e270d
fix: pytesseract>=0.3.12 installation error while installing pdf extra (#3522)
Closes #3521.

This PR resolves an installation error with `pytesseract>=0.3.12` that
occurred during `pip install unstructured[pdf]==0.15.3`.

### Testing
**Run following command in main branch and this PR**
```
pip uninstall -y pytesseract && pip install ".[pdf]"
```
**Results**
- `main` branch
```
INFO: pip is looking at multiple versions of unstructured[pdf] to determine which version is compatible with other requirements. This could take a while.
ERROR: Could not find a version that satisfies the requirement pytesseract>=0.3.12; extra == "pdf" (from unstructured[pdf]) (from versions: 0.1, 0.1.3, 0.1.4, 0.1.5, 0.1.6, 0.1.7, 0.1.8, 0.1.9, 0.2.0, 0.2.2, 0.2.4, 0.2.5, 0.2.6, 0.2.7, 0.2.8, 0.2.9, 0.3.0, 0.3.1, 0.3.2, 0.3.3, 0.3.4, 0.3.5, 0.3.6, 0.3.7, 0.3.8, 0.3.9, 0.3.10)
ERROR: No matching distribution found for pytesseract>=0.3.12; extra == "pdf"
```
- this `PR`

`pytesseract-0.3.13` should be installed successfully.
0.15.4
2024-08-14 16:15:40 -05:00
Christine Straub
d6a84bdfbb
build(deps): update extra-paddleocr requirements (#3515)
This PR removes custom index URL for `paddlepaddle` installation in
`extra-paddleocr.in`, resolving `setup.py` configuration error. Now uses
`paddlepaddle==3.0.0b1` directly from PyPI, simplifying installation
process.

---------

Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
0.15.3
2024-08-14 12:19:20 -05:00
Matt Robinson
7437f0a084
fix(CVE-2024-39705): update to latest nltk version (#3512)
### Summary

Addresses
[CVE-2024-39705](https://nvd.nist.gov/vuln/detail/CVE-2024-39705) by
updating to `nltk==3.8.2` and closes #3511. This CVE had previously been
mitigated in #3361.

---------

Co-authored-by: Christine Straub <christinemstraub@gmail.com>
0.15.2
2024-08-13 09:39:29 -04:00
Christine Straub
1158d8f695
Refactor image block extraction in pdf partitioning (#3514)
Closes
[#3503](https://github.com/Unstructured-IO/unstructured/issues/3503).

### Summary
This PR prevents creation of `figures` directory for saving image blocks
(`Image`, `Table`) when `extract_image_block_to_payload` parameter is
set to True

### Testing

```
elements = partition_image(
    filename="example-docs/img/embedded-images-tables.jpg",
    strategy="hi_res",
    extract_image_block_types=["Image", "Table"],
    extract_image_block_to_payload=True,
)
```
**Results:**
- `Main` Branch: `figures` directory is created.
- `PR`: `figures` directory is not created.
2024-08-13 06:11:10 +00:00
Steve Canny
cbe1b35621
rfctr(chunk): prep for adding TableSplitter (#3510)
**Summary**
Mechanical refactoring in preparation for adding (pre-chunk)
`TableSplitter` in a PR stacked on this one.
2024-08-12 18:04:49 +00:00
Christine Straub
d99b39923d
build(deps): Remove unstructured.paddlepaddle fork (#3506)
This PR aims to remove "unstructured.paddlepaddle" fork. Previously, we
used `unstructured.paddlepaddle` fork to support
`unstructured.paddleocr` on arm64 architecture. But currently,
`unstructured.paddleocr` with `unstructured.paddlepaddle` fails to work
on `arm64` architecture. Also, `unstructured.paddleocr` with the latest
version of the original `paddlepaddle` works on both `amd64` and `arm64`
architectures.

### Testing
```
os.environ["OCR_AGENT"] = "unstructured.partition.utils.ocr_models.paddle_ocr.OCRAgentPaddle"

elements = partition_pdf(
    filename=<file_path>,
    strategy="hi_res",
    infer_table_structure=True,
)
```
2024-08-09 22:04:22 +00:00
John
a2ae2ed646
chore: remove matplotlib constraint (#3505) 2024-08-09 19:31:19 +00:00
Jake Zerrer
051be5aead
Remove unstructured.pytesseract fork (#3454)
A second attempt at
https://github.com/Unstructured-IO/unstructured/pull/3360, this PR
removes unstructured's dependency on its own fork of `pytesseract`. (The
original reason for the fork, the addition of
`run_and_get_multiple_output`, was removed
[here](https://github.com/madmaze/pytesseract/releases/tag/v0.3.12).)

---------

Co-authored-by: Christine Straub <christinemstraub@gmail.com>
2024-08-09 04:28:48 +00:00
John
2373eaa829
fix typo: pipline>pipeline (#3498)
fix typo: pipline>pipeline
2024-08-08 18:53:47 +00:00
John
43ae0befa7
chore: bump botocore pin (#3493)
bump botocore pin to match aiobotocore/s pin:

eae97439b3
2024-08-07 21:41:53 +00:00
John
696155e614
chore: update importlib-metadata pin (#3491) 2024-08-07 18:17:53 +00:00
John
6545f16e57
chore: remove cryptography pin and update test (#3482)
remove cryptography pin, pin tenacity, and update
test_unstructured_ingest/unit/connector/test_salesforce_connector.py
2024-08-07 15:25:23 +00:00
Pawel Kmiecik
eba12daeb2
feat: correct object detection metrics (#3490)
This PR:
- fixes an issue that made it impossible to compute OD metrics
- ads per-class object detection metrics
2024-08-07 14:14:02 +00:00
John
24a1f298e5
chore: small edits (#3480)
Add comments and fix decorators on some tests.
2024-08-06 19:21:43 +00:00
Steve Canny
73bef27ef1
fix(pptx): accommodate invalid image/jpg MIME-type (#3475)
As described in #3381, some clients, perhaps including Adobe PDF
Converter, map JPEG images to the invalid `image/jpg` MIME-type. Prior
to v1.0.0, `python-pptx` would not load these images, which caused image
extraction to fail.

Update the `python-pptx` dependency to `v1.0.1` or above to ensure this
upstream fix is always available.

Fixes: #3381
2024-08-06 18:48:15 +00:00
Steve Canny
a468b2de3b
rfctr(csv): accommodate single column CSV files (#3483)
**Summary**
Improve factoring, type-annotation, and tests for `partition_csv()` and
accommodate single-column CSV files.

Fixes: #2616
2024-08-06 00:48:37 +00:00
David Potter
59ec64235b
chore: rename astra to astradb (#3458)
DataStax wanted all references to be astradb instead of astra. As per
@erichare

We'll also have to do the same in unstructured-ingest :)
2024-08-05 20:41:02 +00:00
Austin Walker
7e887442c4
chore: Cut the 0.15.1 release (#3481) 0.15.1 2024-08-05 16:16:13 +00:00
Maciej Kurzawa
b749b891a7
fix: disabled checking max pages for images (#3473)
Added fix related to
https://github.com/Unstructured-IO/unstructured/pull/3431, which
disables checking max pages for images
2024-08-02 14:25:08 +00:00
John
147514f6b5
feat: msg and email metadata (#3444)
Update partition_eml and partition_msg to capture cc, bcc, and message
id fields.

Docs PR: https://github.com/Unstructured-IO/docs/pull/135/files

Testing
```
from unstructured.partition.email import partition_email
from test_unstructured.unit_utils import example_doc_path

elements = partition_email(filename=example_doc_path("eml/fake-email-header.eml"), include_headers=True)
print(elements)
elements[0].metadata.to_dict()
```

Note to reviewers:
Tests in `test_unstructured/partition/test_email.py` were refactored and
rearranged to group similar tests together, so it will be easiest to
review those changes commit by commit.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Coniferish <Coniferish@users.noreply.github.com>
2024-08-01 19:24:17 +00:00
Christine Straub
0f057188c6
Improve pdfminer embedded image extraction in pdf partitioning (#3456)
### Summary
This PR addresses an issue in `pdfminer` library's embedded image
extraction process. Previously, some extracted "images" were incorrect,
including embedded text elements, resulting in oversized bounding boxes.
This update refines the extraction process to focus on actual images
with more accurate, smaller bounding boxes.

### Testing
PDF:
[test_pdfminer_text_extraction.pdf](https://github.com/user-attachments/files/16448213/test_pdfminer_text_extraction.pdf)

```
elements = partition_pdf(
    filename="test_pdfminer_text_extraction",
    strategy=strategy,
    languages=["chi_sim"],
    analysis=True,
)
```
**Results**
- this `PR`

![page1_layout_pdfminer](https://github.com/user-attachments/assets/098e0a1f-fdad-4627-a881-cbafd71ce5a0)

![page1_layout_final](https://github.com/user-attachments/assets/6dc89180-36ac-424a-99de-63810ebf8958)
- `main` branch

![page1_layout_pdfminer](https://github.com/user-attachments/assets/8228995a-2ef1-4b76-9758-b8015c224e6d)

![page1_layout_final](https://github.com/user-attachments/assets/68d43d7b-7270-4f58-8360-dc76bd0df78f)
2024-08-01 16:47:08 +00:00