37 Commits

Author SHA1 Message Date
Emily Voss
b6ab471f00
Drop Python 3.9 support due to dependency conflicts (#4017) 2025-06-10 23:32:11 -07:00
David Potter
fd9d796797
fix cve (#3989)
fix critical cve for h11. supposedly 0.16.0 fixes it.

---------

Co-authored-by: Yao You <yao@unstructured.io>
Co-authored-by: Austin Walker <austin@unstructured.io>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>
2025-04-29 00:58:05 +00:00
Yao You
2dceac34b5
Feat/remove reference of PageLayout.elements (#3943)
This PR removes usage of `PageLayout.elements` from partition function,
except for when `analysis=True`. This PR updates the partition logic so
that `PageLayout.elements_array` is used everywhere to save memory and
cpu cost.
Since the analysis function is intended for investigation and not for
general document processing purposes, this part of the code is left for
a future refactor.

`PageLayout.elements` uses a list to store layout elements' data while
`elements_array` uses `numpy` array to store the data, which has much
lower memory requirements. Using `memory_profiler` to test the
differences is usually around 10x.
2025-03-12 15:21:21 +00:00
Pluto
74b0647aa2
Fix json bytes content type detection (#3941)
Fixes order of content type detection strategies for byte-encoded jsons.

Before
```
json_bytes = json.dumps([{"example": "data"}]).encode("utf-8")
file_buffer = io.BytesIO(json_bytes)
detect_filetype(file=file_buffer, metadata_file_path="filename.pdf") 
```

Before
PDF

Now
JSON
2025-03-07 10:33:33 +00:00
luke-kucing
147add9a04
Luke/CVE bump (#3928)
bumping dependancies and updated the tokenizer constraint
2025-02-19 17:23:31 +00:00
Tracy Shen
afecf1b742
update unstructured-inference lib (#3880)
update unstructured-inference to 0.8.6 in requirements in
extra-pdf-image.in

0.8.6 has pdfminer=20240706 (newer version)
2025-01-22 23:11:28 +00:00
Yao You
3dea723656
chore: pin upper limit for unstructured-client (#3743)
- 0.26.0 breaks tests and reason is unknown
- 0.26.0 introduces async request but the sync version interface still
exists
2024-10-21 01:06:21 +00:00
Austin Walker
6428d19e5a
fix: update python SDK syntax for forward compatibility (#3656)
Wrap the `shared.PartitionParameters` usage with
`operations.PartitionRequest`. This syntax has been deprecated since
v0.23.0 of the SDK, and will be unsupported in v0.26.0.
2024-09-24 16:37:38 +00:00
John
46e04b165a
build(deps): bump protobuf pin (#3625)
Bumps max version of `protobuf<5.0` and sets min version of
`chromadb>0.4.14` in `requirements/ingest/chroma.in`. Also fixes some
type hints in `unstructured/ingest/v2/processes/connectors/chroma.py`
2024-09-16 19:39:47 +00:00
John
159b8a9082
remove more dependency pins (#3621)
Remove `langchain-community>=0.2.5` and `wrapt>=1.14.0` pins and add
`importlib-metadata>=8.5.0` pin
2024-09-13 01:55:14 +00:00
John
ab94c6c5d1
chore: remove pins (#3579)
- Remove constraint pins for `Office365-REST-Python-Client`,
`weaviate-client`, and `platformdirs`. Removing the pin for `Office365`
brought to light some bugs in the Onedrive connector, so some changes
were also made to
`unstructured/ingest/v2/processes/connectors/onedrive.py`.
- Also, as part of updating dependencies `unstructured-client` was
updated to `0.25.8`, which introduced a new default for the `strategy`
param and required updating a test fixture.
- The `hubspot.sh` integration test was failing and is now ignored in CI
with this PR per discussion with @rbiseck3.

May be easiest to review commit-by-commit.
2024-09-12 13:48:59 +00:00
John
ddb6cb631d
chore: remove minimum version pins for pins older than 6 mo (#3577)
Remove a number of pins in `requirements/deps/constraints.txt` and `make
pip-compile`
2024-08-29 15:35:14 +00:00
Christine Straub
ac10ba4fc1
build(deps): bump unstructured.paddleocr to 2.8.1.0 (#3561)
### Summary
- Bump `unstructured.paddleocr` to 2.8.1.0
- Remove `opencv-python` and `opencv-contrib-python` constraint pins
- Fix `0.15.7` changelog
2024-08-23 14:17:29 -07:00
John
b4a6aa5559
chore: remove fsspec pin (#3554)
remove fsspec pin
2024-08-21 21:57:42 +00:00
John
f135344738
chore: remove scipy and packaging pins (#3550)
Remove scipy and packaging constraint pins
2024-08-21 16:05:19 +00:00
John
604cadfb7e
chore: remove ipython pin (#3548)
this pr is stacked on
https://github.com/Unstructured-IO/unstructured/pull/3538 and
https://github.com/Unstructured-IO/unstructured/pull/3547

This pr removes dependency pins for IPython, anyio, and pyparsing. It
also updates the label-studio-sdk import statement so we don't have to
have that pinned and make some minor type hinting edits. Label Studio
had a breaking change in their 1.13.0
[release](https://github.com/HumanSignal/label-studio/releases/tag/1.13.0)
2024-08-21 00:06:31 +00:00
Matt Robinson
1f8030dd0e
fix(CVE-2024-39705): bump to nltk 3.9.1; correct model download issues (#3541)
### Summary

Bumps to `nltk==3.9.1` and resolves
[CVE-2024-39705](https://nvd.nist.gov/vuln/detail/CVE-2024-39705). An
NLTK version bump was originally introduced in #3512 and rolled back in
#3527 because `nltk==3.8.2` was yanked from PyPI, and also because we
observed significant slowdowns in processing time after bumping to
`nltk==3.8.2`. The processing time regression does not appear in
`nltk==3.9.1`.

### Testing

After the bump, CI should pass. Additionally we verified locally that
files processing takes around the amount of time we would expect for a
long `.docx` file.

```python
In [1]: from unstructured.partition.auto import partition

In [2]: filename = "test-doc.docx"

In [3]: %timeit partition(filename=filename)
3.92 s ± 73 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```
2024-08-19 20:59:36 +00:00
Christine Straub
9b778e270d
fix: pytesseract>=0.3.12 installation error while installing pdf extra (#3522)
Closes #3521.

This PR resolves an installation error with `pytesseract>=0.3.12` that
occurred during `pip install unstructured[pdf]==0.15.3`.

### Testing
**Run following command in main branch and this PR**
```
pip uninstall -y pytesseract && pip install ".[pdf]"
```
**Results**
- `main` branch
```
INFO: pip is looking at multiple versions of unstructured[pdf] to determine which version is compatible with other requirements. This could take a while.
ERROR: Could not find a version that satisfies the requirement pytesseract>=0.3.12; extra == "pdf" (from unstructured[pdf]) (from versions: 0.1, 0.1.3, 0.1.4, 0.1.5, 0.1.6, 0.1.7, 0.1.8, 0.1.9, 0.2.0, 0.2.2, 0.2.4, 0.2.5, 0.2.6, 0.2.7, 0.2.8, 0.2.9, 0.3.0, 0.3.1, 0.3.2, 0.3.3, 0.3.4, 0.3.5, 0.3.6, 0.3.7, 0.3.8, 0.3.9, 0.3.10)
ERROR: No matching distribution found for pytesseract>=0.3.12; extra == "pdf"
```
- this `PR`

`pytesseract-0.3.13` should be installed successfully.
2024-08-14 16:15:40 -05:00
John
a2ae2ed646
chore: remove matplotlib constraint (#3505) 2024-08-09 19:31:19 +00:00
Jake Zerrer
051be5aead
Remove unstructured.pytesseract fork (#3454)
A second attempt at
https://github.com/Unstructured-IO/unstructured/pull/3360, this PR
removes unstructured's dependency on its own fork of `pytesseract`. (The
original reason for the fork, the addition of
`run_and_get_multiple_output`, was removed
[here](https://github.com/madmaze/pytesseract/releases/tag/v0.3.12).)

---------

Co-authored-by: Christine Straub <christinemstraub@gmail.com>
2024-08-09 04:28:48 +00:00
John
43ae0befa7
chore: bump botocore pin (#3493)
bump botocore pin to match aiobotocore/s pin:

eae97439b3
2024-08-07 21:41:53 +00:00
John
696155e614
chore: update importlib-metadata pin (#3491) 2024-08-07 18:17:53 +00:00
John
6545f16e57
chore: remove cryptography pin and update test (#3482)
remove cryptography pin, pin tenacity, and update
test_unstructured_ingest/unit/connector/test_salesforce_connector.py
2024-08-07 15:25:23 +00:00
David Potter
441b3393b1
bugfix [OSS-67]: update import of pinecone exception (#3432)
the pinecone python package moved their importing of
PineconeApiException

Chroma `sleep` added because even thought there is a `wait`, there is
still some sort of timing issue.
2024-07-23 19:48:55 +00:00
Matt Robinson
b2f0620f2c
build(deps): version bumps for 2024-07-22 (#3427)
### Summary

Weekly dependency bumps.
2024-07-22 20:49:40 +00:00
John
0046f58a4f
revert unstructured-client pin and make pip-compile (#3298)
Change unstructured-client pin to setting minimum version instead of max
version and `make pip-compile`.

Integration tests that were dependent on the old version of the client
are removed. These tests should be replicated in/moved to the SDK
repo(s).
2024-07-02 16:42:03 +00:00
Matt Robinson
6939bff49e
build(deps): bump langchain-community version (#3305)
### Summary

Bumps to the latest `langchain-community` version to resolve
[CVE-2024-2965](https://nvd.nist.gov/vuln/detail/CVE-2024-2965).

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
2024-06-26 22:42:32 +00:00
qued
0665e94b96
build: move numpy pin to packaging (#3296)
Moved numpy pin to `base.in` where it will be picked up by packaging.

Side note:
`constraints.txt` (formerly `constraints.in`) is a really useful
pattern: you put a constraint there, add that file as a `-c` requirement
in other files, and the constraint will be applied when pip-compiling
*only when needed* because the library is required by something else.
Neat! However, unfortunately, in my searches I've never found a similar
pattern for packaging, so any pins we want to propagate to user installs
need to be explicitly placed in the `.in` files.

So what is `constraints.txt` really doing for us? Well in the past I
think there have been instances where something is temporarily broken in
an upstream dependency but we expect it to be patched soon, but in the
meantime we want things to work in our CI builds and development
installs, so it's not worth pinning everywhere it's used. Having said
that, I'm coming to the conclusion that `constraints.txt` causes more
harm than good in the confusion it causes WRT packaging -- maybe we
should remove that pattern at some point.
2024-06-25 21:08:25 +00:00
Christine Straub
ab88e20575
chore: bump unstructured-inference 0.7.36 (#3275)
### Summary
- bump unstructured-inference to `0.7.35` which fixed `ValueError` when
converting cells to HTML in the table processing subpipeline
- cut a release for `0.14.8`

---------

Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
2024-06-24 13:07:22 +00:00
Matt Robinson
2815226b54
build(deps): version bumps for 2024-06-17 (#3220)
### Summary

Version bumps for the week of 2024-06-17. There is a now a pin on
`numpy` due to a breaking change in the latest version that we'll need
to investigate and remove in a subsequent PR.
2024-06-17 14:04:29 +00:00
qued
293901e144
build: pin python-docx (#3110)
Since we incorporate a newer feature from `python-docx`
[here](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/docx.py#L521),
we should make the version of `python-docx` that first supports that
method an explicit requirement.

I didn't pip recompile since our generated dependencies already have
`python-docx==1.1.2`, but I can do that if someone thinks it's
necessary.
2024-05-30 15:08:10 +00:00
Matt Robinson
6b400b46fe
feat: add VoyageAI embeddings (#3069) (#3099)
Original PR was #3069. Merged in to a feature branch to fix dependency
and linting issues. Application code changes from the original PR were
already reviewed and approved.

------------
Original PR description:
Adding VoyageAI embeddings 
Voyage AI’s embedding models and rerankers are state-of-the-art in
retrieval accuracy.

---------

Co-authored-by: fzowl <160063452+fzowl@users.noreply.github.com>
Co-authored-by: Liuhong99 <39693953+Liuhong99@users.noreply.github.com>
2024-05-24 21:48:35 +00:00
Matt Robinson
d7608014c0
improve: add Python 3.12 support (#3033) (#3047)
### Summary

Closes #2959. Updates the dependency and CI to add support for Python
3.12.

The MongoDB ingest tests were disabled due to jobs like [this
one](https://github.com/Unstructured-IO/unstructured/actions/runs/9133383127/job/25116767333)
failing due to issues with the `bson` package. `bson` is a dependency
for the AstraDB connector, but `pymongo` does not work when `bson` is
installed from `pip`. This issue is documented by MongoDB
[here](https://pymongo.readthedocs.io/en/stable/installation.html). Spun
off #3049 to resolve this. Issue seems unrelated to Python 3.12, though
unsure why this didn't surface previously.

Disables the `argilla` tests because `argilla` does not yet support
Python 3.12. We can add the `argilla` tests back in once the PR
references below is merged. You can still use the `stage_for_argilla`
function if you're on `python<3.12` and you install `argilla` yourself.
- https://github.com/argilla-io/argilla/pull/4837

---------

Co-authored-by: Nicolò Boschi <boschi1997@gmail.com>
2024-05-19 23:03:15 +00:00
Steve Canny
eff84afe24
chore: update python-docx version dependency (#2952)
**Summary**
`unstructured` will use table features added in the most recent version
of `python-docx`.

Also update the `lxml` version constraint because `lxml>4.9.2` will not
install on Apple Silicon
(https://github.com/Unstructured-IO/unstructured/issues/1707).

`python-docx` requires `lxml` although other file formats require it as
well.
2024-05-01 21:36:31 +00:00
Steve Canny
305247b4e1
chore: bump unstructured-inference pin (#2913)
**Summary**
Update dependencies to use the new version of `unstructured-inference`
released yesterday. Remedy a few small problems with `make pip-compile`
that stood in the way.
2024-04-21 03:08:20 +00:00
Roman Isecke
4185a1a15a
feat: Remove constraint on unstructured client from .in file (#2862)
### Description
Don't limit the version of the unstructured client for all users of the
repo
2024-04-08 16:50:56 +00:00
Roman Isecke
d6f2841ff4
feat: update dependencies and remove constraint on pydantic (#2841)
### Description
* The `consistent-deps.sh` was fixed to take into account the ingest
dependencies, causing some errors to show up. New constriants were added
to make that script pass.
* Update all requirements without constraint on pydantic, allowing the
latest version to be pulled in.
* `pikepdf` is causing a conflict but there's a fix on their `main`
branch, just need for the next release to be published. Opened up a
question here to see if we can get that out any sooner: [Do releases
happen on a
schedule?](https://github.com/pikepdf/pikepdf/discussions/574). For now
added `lxml<5` to the constraints.

A couple optimizations: 
* `constraints.in` renamed to `constraints.txt` since the whole point is
all dependencies are already pinned and the file never gets compiled
* `constraints.txt` moved to a `requirements/deps` directory as this
never gets compiled by `pip-compile`
* Other dependency files updated to reference the new location of
`base.in` and `constraints.txt`
* make file updated since it was originally written to avoid the
`base.in` and `constraints.in` file
2024-04-04 19:58:23 +00:00