1574 Commits

Author SHA1 Message Date
Steve Canny
3bab9d93e6
rfctr(part): prepare for pluggable auto-partitioners 1 (#3655)
**Summary**
In preparation for pluggable auto-partitioners simplify metadata as
discussed.

**Additional Context**
- Pluggable auto-partitioners requires partitioners to have a consistent
call signature. An arbitrary partitioner provided at runtime needs to
have a call signature that is known and consistent. Basically
`partition_x(filename, *, file, **kwargs)`.
- The current `auto.partition()` is highly coupled to each distinct
file-type partitioner, deciding which arguments to forward to each.
- This is driven by the existence of "delegating" partitioners, those
that convert their file-type and then call a second partitioner to do
the actual partitioning. Both the delegating and proxy partitioners are
decorated with metadata-post-processing decorators and those decorators
are not idempotent. We call the situation where those decorators would
run twice "double-decorating". For example, EPUB converts to HTML and
calls `partition_html()` and both `partition_epub()` and
`partition_html()` are decorated.
- The way double-decorating has been avoided in the past is to avoid
sending the arguments the metadata decorators are sensitive to to the
proxy partitioner. This is very obscure, complex to reason about,
error-prone, and just overall not a viable strategy. The better solution
is to not decorate delegating partitioners and let the proxy partitioner
handle all the metadata.
- This first step in preparation for that is part of simplifying the
metadata processing by removing unused or unwanted legacy parameters.
- `date_from_file_object` is a misnomer because a file-object never
contains last-modified data.
- It can never produce useful results in the API where last-modified
information must be provided by `metadata_last_modified`.
- It is an undocumented parameter so not in use.
- Using it can produce incorrect metadata.
2024-09-23 22:23:10 +00:00
Steve Canny
03c2bf8f1f
rfctr(part): extract partition.common submodules (#3649)
**Summary**
In preparation for consolidating post-partitioning metadata decorators,
extract `partition.common` module into a sub-package (directory) and
extract `partition.common.metadata` module to house metadata-specific
object shared by partitioners.

**Additional Context**
- This new module will be the home of the new consolidated metadata
decorator.
- The consolidated decorator is a step toward removing post-processing
decorators from _delegating_ partitioners. A delegating partitioner is
one that convert its file to a different format and "delegates" actual
partitioning to the partitioner for that target format. 10 of the 20
partitioners are delegating partitioners.
- Removing decorators from delegating partitioners will allow us to
avoid "double-decorating", i.e. running those decorators twice, once on
the principal partitioner and again on the proxy partitioner.
- This will allow us to send `**kwargs` to either partitioner, removing
the knowledge of which arguments to send for each file-type from
auto-partition.
- And this will allow pluggable auto-partitioners which all have a
`partition_x(filename, *, file, **kwargs) -> list[Element]` interface.
2024-09-20 20:35:28 +00:00
Matt Robinson
7d66a236f1
fix: correctly install mesa-gl for arm (#3647)
### Summary

Fixes the `arm64` image builds, which will be available again starting
in version `0.15.13`. A fix was implemented upstream in
https://github.com/Unstructured-IO/base-images/pull/47 and a workaround
that installed `x86` packages in the `unstructured` repo was removed.

### Testing

See [this
job](https://github.com/Unstructured-IO/unstructured/actions/runs/10948943594/job/30401108059?pr=3647)
for a successful `arm64` build on the feature branch.
0.15.13
2024-09-20 13:32:47 +00:00
Christine Straub
0ed69a1ac3
refactor: pdfminer image cleanup (#3648)
This PR aims to remove `clean_pdfminer_duplicate_image_elements()`
function, as its functionality has already been integrated into the
`remove_duplicate_elements()` function in [PR
#3630](https://github.com/Unstructured-IO/unstructured/pull/3630).
2024-09-19 18:57:02 +00:00
Christine Straub
be88eef06f
perf: optimize pdfminer image cleanup process for improved performance (#3630)
This PR enhances `pdfminer` image cleanup process by repositioning the
duplicate image removal step. It optimizes the removal of duplicated
pdfminer images by performing the cleanup before merging elements,
rather than after. This improvement reduces execution time and enhances
the overall processing speed of PDF documents.

---------

Co-authored-by: Yao You <theyaoyou@gmail.com>
2024-09-19 14:05:05 +00:00
Steve Canny
cd074bb32b
chore(file): remove dead code (#3645)
**Summary**
Remove dead code in `unstructured.file_utils`.

**Additional Context**
These modules were added in 12/2022 and 1/2023 and are not referenced by
any code. Removing to reduce unnecessary complexity. These can of course
be recovered from Git history if we decide we want them again in future.
2024-09-19 06:45:33 +00:00
Yao You
22998354db
add requirements files to ingest cache hash key (#3641)
This PR adds the requirement files for base and extras for the ingest
cache's hash key.

- The current workflow uses only the ingest requirements to generate
hash key for the gitaction cache
- Sometimes only base or extra requirements (like extra-pdf.txt) updated
but not any ingest requirements -> this would mean the ingest test would
fetch a cache with outdated non-ingest dependencies
- When we generate new ingest cache we actually do check first base and
extra requirements and generate a base env before layer on top the
ingest dependencies.
- This PR allows the ingest step to recognize changes to non-ingest
dependency changes and trigger new cache generation when either ingest
or base/extra requirement files changes.

This PR also bumps the setup python action version in cache actions; it
also adds installation of `virtualenv` for the ingest cache action to
avoid errors like
https://github.com/Unstructured-IO/unstructured/actions/runs/10905551870/job/30265057515?pr=3641#step:3:111
2024-09-18 18:39:14 -05:00
Yao You
2d3cd45b23
Fix/reduce memory usage (#3629)
This PR fixes the high memory usage when computing intersection areas.

- it now converts the coordinates into half precision floating point
numbers instead of double
- removes some intermediate variables to free up memory usage

## test

Using a memory profiler like `memory_profiler` in `ipython`:

```ipython
## cell 1
from unstructured.partition.pdf_image.pdfminer_processing import areas_of_boxes_and_intersection_area
import numpy as np
%load_ext memory_profiler

## cell 2
%%memit
coords = np.random.rand(40000).reshape((10000,4)).astype(np.float16)

## cell 3
%%memit
inter_area, boxa_area, boxb_area = areas_of_boxes_and_intersection_area(coords, coords)
```

The peak memory and incremental memory from cell 3 should be close to 
```
peak memory: 730.55 MiB, increment: 573.22 MiB
```

On main branch the `coords` is double precision and running the same
code with

```
coords = np.random.rand(40000).reshape((10000,4)).astype(np.float64)
```

would result in peak memory usage more than 4GiB

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
Co-authored-by: christinestraub <christinemstraub@gmail.com>
2024-09-17 11:00:26 -05:00
John
46e04b165a
build(deps): bump protobuf pin (#3625)
Bumps max version of `protobuf<5.0` and sets min version of
`chromadb>0.4.14` in `requirements/ingest/chroma.in`. Also fixes some
type hints in `unstructured/ingest/v2/processes/connectors/chroma.py`
2024-09-16 19:39:47 +00:00
Matt Robinson
ba93f9a26a
fix: reenable arm64 build (#3626)
### Summary

Reverts the CI change in #3624 and reenables the `arm64` build and
publish steps.
2024-09-13 16:15:01 +00:00
Matt Robinson
8b7e5bbeac
fix: temporarily disable arm64 build (#3624)
### Summary

Per [this
job](https://github.com/Unstructured-IO/unstructured/actions/runs/10842120429/job/30087252047),
`arm64` builds are currently failing, likely because the workaround for
the broken `mesa-gl` package from the `wolfi` repository only works for
`amd64`. Temporarily disabling the `arm64` build in order to push out
the latest `amd64` image with security patches, then will revert and
work the fix for the `arm64` image.

- https://github.com/Unstructured-IO/base-images/pull/44
0.15.12
2024-09-13 13:47:39 +00:00
John
159b8a9082
remove more dependency pins (#3621)
Remove `langchain-community>=0.2.5` and `wrapt>=1.14.0` pins and add
`importlib-metadata>=8.5.0` pin
2024-09-13 01:55:14 +00:00
Christine Straub
87a88a3c87
feat: improve pdfminer element processing (#3618)
This PR implements splitting of `pdfminer` elements (`groups of text
chunks`) into smaller bounding boxes (`text lines`). This implementation
prevents loss of information from the object detection model and
facilitates more effective removal of duplicated `pdfminer` text. This
PR also addresses #3430.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
2024-09-12 21:17:27 +00:00
qued
639ca591d8
fix: Table metric typo (#3623)
It looks like we puts columns when we meant rows in one of the table
metrics. @pravin-unstructured flagged this.
2024-09-12 19:47:53 +00:00
John
ab94c6c5d1
chore: remove pins (#3579)
- Remove constraint pins for `Office365-REST-Python-Client`,
`weaviate-client`, and `platformdirs`. Removing the pin for `Office365`
brought to light some bugs in the Onedrive connector, so some changes
were also made to
`unstructured/ingest/v2/processes/connectors/onedrive.py`.
- Also, as part of updating dependencies `unstructured-client` was
updated to `0.25.8`, which introduced a new default for the `strategy`
param and required updating a test fixture.
- The `hubspot.sh` integration test was failing and is now ignored in CI
with this PR per discussion with @rbiseck3.

May be easiest to review commit-by-commit.
2024-09-12 13:48:59 +00:00
Roman Isecke
ebf16055d8
feat/add deprecation warning to all embed code (#3614)
### Description
Related PR to move the code over:
https://github.com/Unstructured-IO/unstructured-ingest/pull/92

Also removed the console script that exposes ingest.
2024-09-10 23:48:39 +00:00
cragwolfe
e9690b2738
feat: utility script to process large PDFs through the API by script (#3591)
Adds the bash script `process-pdf-parallel-through-api.sh` that allows
splitting up a PDF into smaller parts (splits) to be processed through
the API concurrently, and is re-entrant. If any of the parts splits fail
to process, one can attempt reprocessing those split(s) by rerunning the
script.

Note: requires the `qpdf` command line utility.

The below command line output shows the scenario where just one split
had to be reprocessed through the API to create the final
`layout-parser-paper_combined.json` output.

```
$ BATCH_SIZE=20 PDF_SPLIT_PAGE_SIZE=6 STRATEGY=hi_res \
  ./scripts/user/process-pdf-parallel-through-api.sh example-docs/pdf/layout-parser-paper.pdf
>   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
 Skipping processing for /Users/cragwolfe/tmp/pdf-splits/layout-parser-paper-output-8a76cb6228e109450992bc097dbd1a51_split-6_strat-hi_res/layout-pars\
er-paper_pages_1_to_6.json as it already exists.
Skipping processing for /Users/cragwolfe/tmp/pdf-splits/layout-parser-paper-output-8a76cb6228e109450992bc097dbd1a51_split-6_strat-hi_res/layout-parser-paper_pages_7_to_12.json as it already exists.
Valid JSON output created: /Users/cragwolfe/tmp/pdf-splits/layout-parser-paper-output-8a76cb6228e109450992bc097dbd1a51_split-6_strat-hi_res/layout-parser-paper_pages_13_to_16.json
Processing complete. Combined JSON saved to /Users/cragwolfe/tmp/pdf-splits/layout-parser-paper-output-8a76cb6228e109450992bc097dbd1a51_split-6_strat-hi_res/layout-parser-paper_combined.json
```

Bonus change to `unstructured-get-json.sh` to point to the standard
hosted Serverless API, but allow using the Free API with --freemium.
2024-09-10 11:40:35 -07:00
cragwolfe
71208ca2ee
doc: emphasize deprecation of ingest (#3610)
Given that unstructured-ingest is now maintained in [its own
repo](https://github.com/Unstructured-IO/unstructured-ingest), update
documentation references in this repo to point there.

Note that the forked, deprecated unstructured.ingest [in this repo
](https://github.com/Unstructured-IO/unstructured/tree/main/unstructured/ingest)will
be removed in the near future, once CI is updated properly.
0.15.10
2024-09-09 16:03:44 -07:00
Matt Robinson
dc1128c21c
build(release): version 0.15.10 (#3609)
### Summary

Release for version `0.15.10`.
2024-09-09 21:42:20 +00:00
Matt Robinson
cf32672bc5
build(deps): bumps for 2024-09-09 (#3608)
### Summary

Dependency bumps for 2024-09-09.
2024-09-09 16:45:18 +00:00
cragwolfe
3bb0ee1e79
chore: fix tests breaking on main (#3603)
Fix API tests (really more like integration tests) that run only on
main. Also use less compute intensive files to speedup test time and
remove a useless test.

Tests in `test_unstructured/partition/test_api.py` pass, temporarily
running outside of main per per screenshot:

![image](https://github.com/user-attachments/assets/f15d440a-2574-40f2-98b4-adf57fbae704)


https://github.com/Unstructured-IO/unstructured/actions/runs/10754098974/job/29824415513
2024-09-08 21:25:52 +00:00
Matt Robinson
c060467018
build(deps): bump cryptography version (#3599)
### Summary

Bumps to the latest version of the `cryptography` library to address
`GHSA-h4gh-qq45-vh27`.
2024-09-05 19:06:43 +00:00
Pawel Kmiecik
f25eb60585
fix: expose drawing options as function params rather than env config (#3598)
This PR:
- changes the interface of analysis tools to expose drawing params as
function parameters rather than env_config (=environmental variables)
- restructures analysis package
2024-09-05 15:51:43 +00:00
Christine Straub
acd070c5d5
feat: enhance pdfminer element cleanup (#3593)
This PR aims to expand removal of `pdfminer` elements to include those
inside all `non-pdfminer` elements, not just `tables`.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
2024-09-04 12:02:50 +00:00
Yao You
d51fb134e6
Feat/improve iou speed (#3582)
This PR vectorizes the computation of element overlap to speed up
deduplication process of extracted elements.

## test

This PR adds unit test to the new vectorized IOU and subregion
computation functions.

In addition, running partition on large files with many elements like
this slide:

[002489.pdf](https://github.com/user-attachments/files/16823176/002489.pdf)

shows a reduction of runtime from around 15min on the main branch to
less than 4min with this branch.

Profiling results show that the new implementation greatly reduces the
time cost of computation and now most of the time is spend on getting
the coordinates from a list of bboxes.

![Screenshot 2024-08-30 at 9 29
27 PM](https://github.com/user-attachments/assets/6c186838-54c7-483b-ac3e-7342c23ff3a6)
2024-09-03 00:06:18 +00:00
Pawel Kmiecik
404f780bbb
feat: make analysis drawing more flexible (#3574)
This PR changes the way the analysis tools can be used:
- by default if `analysis` is set to `True` in `partition_pdf` and the
strategy is resolved to `hi_res`:
- for each file 4 layout dumps are produced and saved as JSON files
(`object_detection`, `extracted`, `ocr`, `final`) - similar way to the
current `object_detection` dump
- the drawing functions/classes now accept these dumps accordingly
instead of the internal classes instances (like `TextRegion`,
`DocumentLayout`
- it makes it possible to use the lightweight JSON files to render the
bboxes of a given file after the partition is done
- `_partition_pdf_or_image_local` has been refactored and most of the
analysis code is now encapsulated in `save_analysis_artifiacts` function
- to do this, helper function `render_bboxes_for_file` is added
<img width="338" alt="Screenshot 2024-08-28 at 14 37 56"
src="https://github.com/user-attachments/assets/10b6fbbd-7824-448d-8c11-52fc1b1b0dd0">
2024-09-02 11:06:11 +00:00
Matt Robinson
04322d1632
build(deps): removed unnecessary jupyter deps (#3583)
### Summary

Removes unnecessary `jupyter` and `ipython` dev dependencies to reduce
CVE surface area.
2024-08-31 05:21:40 +00:00
Matt Robinson
6ba8135bf9
fix: check ole storage content to differentiate filetypes (#3581)
### Summary

Updates the file detection logic for OLE files to check the storage
content of the file to more reliable differentiate between DOC, PPT, XLS
and MSG files. This corrects a bug that caused file type detection to be
incorrect in cases where the `filetype` library guessed and incorrect
MIME type, such as `'application/vnd.ms-excel'` for a `.msg` file.

As part of this work, the `"msg"` extra was removed because the
`python-oxmsg` package is now a base dependency.

### Testing

Using a test `.msg` file that returns `'application/vnd.ms-excel'` from
`filetype.guess_mime`.

```python
from unstructured.file_utils.filetype import detect_filetype

filename = "test-file.msg"
detect_filetype(filename=filename) # result should be FileType.MSG
```
0.15.9
2024-08-30 15:12:46 -04:00
John
ddb6cb631d
chore: remove minimum version pins for pins older than 6 mo (#3577)
Remove a number of pins in `requirements/deps/constraints.txt` and `make
pip-compile`
2024-08-29 15:35:14 +00:00
Austin Walker
f440eb476c
feat: Support encoding parameter in partition_csv (#3564)
See added test file. Added support for the encoding parameter, which can
be passed directly to `pd.read_csv`.
2024-08-28 14:19:58 +00:00
John
f21c853ade
bug: fix file_conversion disk leak (#3562)
Fix disk space leaks and Windows errors when accessing file.name on a
NamedTemporaryFile

Uses of `NamedTemporaryFile(..., delete=False)` and/or uses of
`file.name` of NamedTemporaryFile have been replaced with
TemporaryFileDirectory to avoid a known issue:
-
https://docs.python.org/3/library/tempfile.html#tempfile.NamedTemporaryFile
- https://github.com/Unstructured-IO/unstructured/issues/3390

The first 7 commits each address an individual occurrence of the issue
if reviewers want to review commit-by-commit.
2024-08-27 22:02:24 +00:00
Matt Robinson
4194a07d12
build(deps): replace pillow-heif with pi-heif (#3571)
### Summary

Closes #2664 and replaces `pillow-heif` with `pi-heif` due to more
permissive licensing on the binary wheel for `pi-heif`.
0.15.8
2024-08-27 11:54:35 -04:00
David Potter
ddba928344
Potter/mixedbread embedder (#3513)
Thanks to @huangrpablo and @juliuslipp we now have a mixedbread.ai
embedder!
2024-08-27 14:52:13 +00:00
Christine Straub
affd997c39
refactor(ci): remove unused environment variables (#3568)
This PR removes the unused env `TABLE_OCR` from CI.
2024-08-26 19:19:58 +00:00
Matt Robinson
09d84bc46b
build(deps): version bumps for 2024-08-26 (#3567)
### Summary

Version bumps for 2024-08-26.
2024-08-26 15:15:25 -04:00
Christine Straub
ac10ba4fc1
build(deps): bump unstructured.paddleocr to 2.8.1.0 (#3561)
### Summary
- Bump `unstructured.paddleocr` to 2.8.1.0
- Remove `opencv-python` and `opencv-contrib-python` constraint pins
- Fix `0.15.7` changelog
2024-08-23 14:17:29 -07:00
Steve Canny
32bb77aafb
fix(file): no default OLE subtype (#3516)
**Summary**
Do not assume MSG format when an OLE "container" file cannot be
differentiated into DOC, PPT, XLS, or MSG. Fall back to extention-based
identification in that case.

**Additional Context**
DOC, MSG, PPT, and XLS are all OLE files. An OLE file is, very roughly,
a Microsoft-proprietary Zip format which "contains" a filesystem of
discrete files and directories.

An OLE "container" is easily identified by inspecting the first 8 bytes
of the file, so all we need to do is differentiate between the four
subtypes we can process. The `filetype` module does a good job of this
but it not perfect and does not identify MSG files.

Previously we assumed MSG format when none of DOC, PPT, or XLS was
detected, but we discovered that `filetype` is not completely reliable
at detecting these types.

Change the behavior to remove the assumption of MSG format.
`_OleFileDifferentiator` returns `None` in this case and filetype
detection falls back to use filename-extension.

Note a file with no filename and no metadata_filename or an incorrect
extension will not be correctly identified in this case, however we're
assuming for now that will be rare in practice.
2024-08-22 19:16:53 +00:00
John
b4a6aa5559
chore: remove fsspec pin (#3554)
remove fsspec pin
2024-08-21 21:57:42 +00:00
Steve Canny
03e0ed3519
rfctr(docx): DOCX emits std minified .text_as_html (#3545)
**Summary**
Eliminate historical "idiosyncracies" of `table.metadata.text_as_html`
HTML introduced by `partition_docx()`. Produce minified `.text_as_html`
consistent with that formed by chunking.

**Additional Context**
- nested tables appear as their extracted text in the parent cell (no
nested `<table>` elements in `.text_as_html`).
- DOCX `.text_as_html` is minified (no extra whitespace or thead, tbody,
tfoot elements).
2024-08-21 18:54:21 +00:00
John
f135344738
chore: remove scipy and packaging pins (#3550)
Remove scipy and packaging constraint pins
2024-08-21 16:05:19 +00:00
John
604cadfb7e
chore: remove ipython pin (#3548)
this pr is stacked on
https://github.com/Unstructured-IO/unstructured/pull/3538 and
https://github.com/Unstructured-IO/unstructured/pull/3547

This pr removes dependency pins for IPython, anyio, and pyparsing. It
also updates the label-studio-sdk import statement so we don't have to
have that pinned and make some minor type hinting edits. Label Studio
had a breaking change in their 1.13.0
[release](https://github.com/HumanSignal/label-studio/releases/tag/1.13.0)
2024-08-21 00:06:31 +00:00
Christine Straub
01dbc7b473
fix: nltk data download path to prevent redundant nested directories (#3546)
Closes #3543.

### Summary
This PR addresses an issue with the NLTK data download process.
Previously, when downloading NLTK data, a nested "nltk_data" directory
was created within the parent "nltk_data" directory if the parent
directory already existed. This redundant directory structure led to two
significant problems:
- errors in checking if data had already been downloaded, potentially
causing redundant downloads in subsequent calls.
- failures in loading models from the downloaded NLTK data due to
incorrect path resolution.

This fix modifies the NLTK data download logic to prevent creation of
unnecessary nested directories. If the download path ends with
"nltk_data" and that directory already exists, we now use the existing
directory instead of creating a new nested one.

### Testing
CI should pass.
0.15.7
2024-08-20 18:56:59 +00:00
Matt Robinson
1f8030dd0e
fix(CVE-2024-39705): bump to nltk 3.9.1; correct model download issues (#3541)
### Summary

Bumps to `nltk==3.9.1` and resolves
[CVE-2024-39705](https://nvd.nist.gov/vuln/detail/CVE-2024-39705). An
NLTK version bump was originally introduced in #3512 and rolled back in
#3527 because `nltk==3.8.2` was yanked from PyPI, and also because we
observed significant slowdowns in processing time after bumping to
`nltk==3.8.2`. The processing time regression does not appear in
`nltk==3.9.1`.

### Testing

After the bump, CI should pass. Additionally we verified locally that
files processing takes around the amount of time we would expect for a
long `.docx` file.

```python
In [1]: from unstructured.partition.auto import partition

In [2]: filename = "test-doc.docx"

In [3]: %timeit partition(filename=filename)
3.92 s ± 73 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```
0.15.6
2024-08-19 20:59:36 +00:00
Steve Canny
a861ed8fe7
feat(chunk): split tables on even row boundaries (#3504)
**Summary**
Use more sophisticated algorithm for splitting oversized `Table`
elements into `TableChunk` elements during chunking to ensure element
text and HTML are "synchronized" and HTML is always parseable.

**Additional Context**
Table splitting now has the following characteristics:
- `TableChunk.metadata.text_as_html` is always a parseable HTML
`<table>` subtree.
- `TableChunk.text` is always the text in the HTML version of the table
fragment in `.metadata.text_as_html`. Text and HTML are "synchronized".
- The table is divided at a whole-row boundary whenever possible.
- A row is broken at an even-cell boundary when a single row is larger
than the chunking window.
- A cell is broken at an even-word boundary when a single cell is larger
than the chunking window.
- `.text_as_html` is "minified", removing all extraneous whitespace and
unneeded elements or attributes. This maximizes the semantic "density"
of each chunk.
2024-08-19 18:56:53 +00:00
Christine Straub
99f72d65ba
ci: fix ingest test fixtures update (#3532) 2024-08-16 16:37:33 -07:00
Christine Straub
fc26426310
feat: replace pytesseract with unstructured.pytesseract fork (#3528)
This PR reverts `pytesseract` dependency to `unstructured.pytesseract`
fork due to the unavailability of some recent release versions of
`pytesseract` on PyPI.

This PR also addresses an issue encountered during the publication of
`unstructured==0.15.4` to PyPI. The error was due to the fact that PyPI
does not allow direct dependencies from Version Control System URLs like
GitHub in the `install_requires` or `extras_require` sections of the
`setup.py` file.
0.15.5
2024-08-16 10:34:22 -04:00
Matt Robinson
e64e09507a
build: update to latest base image (#3524)
### Summary

Updates to the latest `wolfi-base` base image to pull in more recent
package version. A notable update is that upgrading to
`libreoffice==24.2.5.2` resolves several CVEs.

---------

Co-authored-by: christinestraub <christinemstraub@gmail.com>
2024-08-15 22:27:41 -07:00
Christine Straub
d0211cc41f
build: downgrade nltk version (#3527)
This PR aims to roll back `nltk` to `3.8.1` which bumped to `3.8.2` in
https://github.com/Unstructured-IO/unstructured/pull/3512 because
`3.8.2` is no longer available in PyPI due to some
issues(https://github.com/nltk/nltk/issues/3301)
2024-08-15 16:35:21 -07:00
Christine Straub
9b778e270d
fix: pytesseract>=0.3.12 installation error while installing pdf extra (#3522)
Closes #3521.

This PR resolves an installation error with `pytesseract>=0.3.12` that
occurred during `pip install unstructured[pdf]==0.15.3`.

### Testing
**Run following command in main branch and this PR**
```
pip uninstall -y pytesseract && pip install ".[pdf]"
```
**Results**
- `main` branch
```
INFO: pip is looking at multiple versions of unstructured[pdf] to determine which version is compatible with other requirements. This could take a while.
ERROR: Could not find a version that satisfies the requirement pytesseract>=0.3.12; extra == "pdf" (from unstructured[pdf]) (from versions: 0.1, 0.1.3, 0.1.4, 0.1.5, 0.1.6, 0.1.7, 0.1.8, 0.1.9, 0.2.0, 0.2.2, 0.2.4, 0.2.5, 0.2.6, 0.2.7, 0.2.8, 0.2.9, 0.3.0, 0.3.1, 0.3.2, 0.3.3, 0.3.4, 0.3.5, 0.3.6, 0.3.7, 0.3.8, 0.3.9, 0.3.10)
ERROR: No matching distribution found for pytesseract>=0.3.12; extra == "pdf"
```
- this `PR`

`pytesseract-0.3.13` should be installed successfully.
0.15.4
2024-08-14 16:15:40 -05:00
Christine Straub
d6a84bdfbb
build(deps): update extra-paddleocr requirements (#3515)
This PR removes custom index URL for `paddlepaddle` installation in
`extra-paddleocr.in`, resolving `setup.py` configuration error. Now uses
`paddlepaddle==3.0.0b1` directly from PyPI, simplifying installation
process.

---------

Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
0.15.3
2024-08-14 12:19:20 -05:00