### Summary
Adds a CI check to ensure that packages added as dependencies are
appropriately licensed. All of the `.txt` files in the `requirements`
directory are checked with the exception of:
- `constraints.txt`, since those are not installed and are instead
conditions on the other dependency files
- `dev.txt`, since those are for local development and not shipped as
part of the `unstructured` package
- `extra-pdf-image.txt` - the `extra-pdf-image.in` since checking
`extra-pdf-image.txt` pulls in NVIDIA GPU related packages with an
`Other/Proprietary` license type, and there's not a good way to exclude
those without adding `Other/Proprietary` to the allowed licenses list.
### Testing
The new `check-licenses` job should pass in CI.
Change unstructured-client pin to setting minimum version instead of max
version and `make pip-compile`.
Integration tests that were dependent on the old version of the client
are removed. These tests should be replicated in/moved to the SDK
repo(s).
### Summary
Updates to the latest version of the `wolfi-base` image. Changes
include:
- Version bumps to address CVEs
- `libreoffice` is now included in the `arm64`. `.doc` files are now
supported for `arm64`. `.ppt` do not work with the `libreoffice` package
currently available on `wolfi-os`. We have follow on work to look into
that.
- Updates the location of the `tesseract` `tessdata` files on the
`arm64` build. Closes#3290.
- Closes#3319 and addes `psutil` to the base dependencies.
### Testing
- `test_dockerfile` should continue to pass with the updates.
### Summary
Bumps to the latest `langchain-community` version to resolve
[CVE-2024-2965](https://nvd.nist.gov/vuln/detail/CVE-2024-2965).
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
Moved numpy pin to `base.in` where it will be picked up by packaging.
Side note:
`constraints.txt` (formerly `constraints.in`) is a really useful
pattern: you put a constraint there, add that file as a `-c` requirement
in other files, and the constraint will be applied when pip-compiling
*only when needed* because the library is required by something else.
Neat! However, unfortunately, in my searches I've never found a similar
pattern for packaging, so any pins we want to propagate to user installs
need to be explicitly placed in the `.in` files.
So what is `constraints.txt` really doing for us? Well in the past I
think there have been instances where something is temporarily broken in
an upstream dependency but we expect it to be patched soon, but in the
meantime we want things to work in our CI builds and development
installs, so it's not worth pinning everywhere it's used. Having said
that, I'm coming to the conclusion that `constraints.txt` causes more
harm than good in the confusion it causes WRT packaging -- maybe we
should remove that pattern at some point.
## Summary
This PR addresses an issue where the code could attempt to run `soffice`
in multiple processes and closes#3284
The fix is to add a wait mechanism when there is another `soffice`
process running in already.
## Diagnosis of issue
- `soffice` can only have one process running when using the command
`soffice` as is.
- on main branch the function `partition.common.convert_office_doc`
simply spawns a subprocess to run `soffice` command to convert a `doc`
or `ppt` file into `docx` or `pptx` format.
- if there are multiple partition calls to process `doc` or `ppt` files
and they all want to spawn `soffice` subprocesses only one will succeed
while other processes will simply fail and return 1 from the subprocess
- in downstream this will lead to errors like `PackageNotFoundError:
Package not found at '/tmp/tmpac6lcu4w/document.docx'`
## solution
While there are
[ways](https://www.reddit.com/r/libreoffice/comments/agk3os/how_to_open_more_than_one_calc_instance_under/)
to circumvent the limit of `soffice` by setting a tmp file as user
installation env, these kind of solutions rely on the internals of
`soffice` and adds maintenance cost to track its changes.
This PR solves this problem by adding a wait mechanism:
- we first spawning a subprocess to run `soffice`
- if the `stdout` is empty and we still have wait time budget left the
function first checks if there is another `soffice` running
* If yes then the function waits for 0.01s before checking again;
* if no the functions spawns a subprocess to run `soffice` and return to
beginning of this step
* we need to return the the beginning to check if `stdout` is empty
because we could have another collision right after `soffice` becomes
available.
## test
This PR adds two unit tests.
Additionally this can be tested by running partition of `.doc` files
locally with multiprocessing.
### Summary
- bump unstructured-inference to `0.7.35` which fixed `ValueError` when
converting cells to HTML in the table processing subpipeline
- cut a release for `0.14.8`
---------
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
### Summary
Version bumps for the week of 2024-06-17. There is a now a pin on
`numpy` due to a breaking change in the latest version that we'll need
to investigate and remove in a subsequent PR.
### Summary
Closes#3173. Removes the `overwrite_schema` kwarg from the Delta Table
connector and bumps the `deltalake` version. Per [this
PR](https://github.com/delta-io/delta-rs/pull/2554) in the `deltalake`
repo, the `overwrite_schema` kwarg is deprecated as of version `0.18.0`.
Users can specify `schema_mode="merge"` to obtain the same behavior.
- `schema_mode="merge"` is equivalent to `overwrite_schema=False`
- `schema_mode="overwrite"` is equivalent to `overwrite_schema=True`
Also adds an `engine` parameter that you can use to set `"rust"` or
`"pyarrow"` as the engine. `engine` defaults to `"pyarrow"` and
`schema_mode` defaults to `None`, which is consistent with the behavior
in `deltalake` documented
[here](https://delta-io.github.io/delta-rs/api/delta_writer/).
### Testing
The Delta Table ingest tests should pass on this PR.
---------
Co-authored-by: Ahmet Melek <39141206+ahmetmeleq@users.noreply.github.com>
Original PR was #3069. Merged in to a feature branch to fix dependency
and linting issues. Application code changes from the original PR were
already reviewed and approved.
------------
Original PR description:
Adding VoyageAI embeddings
Voyage AI’s embedding models and rerankers are state-of-the-art in
retrieval accuracy.
---------
Co-authored-by: fzowl <160063452+fzowl@users.noreply.github.com>
Co-authored-by: Liuhong99 <39693953+Liuhong99@users.noreply.github.com>
Summary:
- bump unstructured-inference to `0.7.33`
- cut a release for `0.14.2`
- add some dependencies that previously came through from the
layoutparser extras.
### Summary
Switches to installing `libreoffice` from the Wolfi repository and
upgrades the `libreoffice` version to `libreoffice==24.x.x`. Resolves a
medium vulnerability in the old `libreoffice` version. Security scanning
with `anchore/grype` was also added to the `test_dockerfile` job.
Requirements were bumped to resolve a vulnerability in the `requests`
library.
### Testing
`test_dockerfile` passes with the updates.
### Summary
Closes#2959. Updates the dependency and CI to add support for Python
3.12.
The MongoDB ingest tests were disabled due to jobs like [this
one](https://github.com/Unstructured-IO/unstructured/actions/runs/9133383127/job/25116767333)
failing due to issues with the `bson` package. `bson` is a dependency
for the AstraDB connector, but `pymongo` does not work when `bson` is
installed from `pip`. This issue is documented by MongoDB
[here](https://pymongo.readthedocs.io/en/stable/installation.html). Spun
off #3049 to resolve this. Issue seems unrelated to Python 3.12, though
unsure why this didn't surface previously.
Disables the `argilla` tests because `argilla` does not yet support
Python 3.12. We can add the `argilla` tests back in once the PR
references below is merged. You can still use the `stage_for_argilla`
function if you're on `python<3.12` and you install `argilla` yourself.
- https://github.com/argilla-io/argilla/pull/4837
---------
Co-authored-by: Nicolò Boschi <boschi1997@gmail.com>
**Summary**
`unstructured` will use table features added in the most recent version
of `python-docx`.
Also update the `lxml` version constraint because `lxml>4.9.2` will not
install on Apple Silicon
(https://github.com/Unstructured-IO/unstructured/issues/1707).
`python-docx` requires `lxml` although other file formats require it as
well.
Cut a release.
Run pip-compile on mac to avoid `nvidia-*` requirements creeping into
`requirements/extra-pdf-image.txt`. This should fix arm64 image builds
that have been breaking on main.
This PR adds a third OCR provider, alongside Tesseract and Paddle: the
[Google Cloud Vision API](https://cloud.google.com/vision).
It can be used similarly to other OCR methods: set the `OCR_AGENT`
environment variable to the path to the OCR module
(`unstructured.partition.utils.ocr_models.google_vision_ocr.OCRAgentGoogleVision`).
You also need to set the credentials to use Google APIs, for instance by
setting the `GOOGLE_APPLICATION_CREDENTIALS` environment variable.
---------
Co-authored-by: christinestraub <christinemstraub@gmail.com>
**Summary**
Update dependencies to use the new version of `unstructured-inference`
released yesterday. Remedy a few small problems with `make pip-compile`
that stood in the way.
### Description
* The `consistent-deps.sh` was fixed to take into account the ingest
dependencies, causing some errors to show up. New constriants were added
to make that script pass.
* Update all requirements without constraint on pydantic, allowing the
latest version to be pulled in.
* `pikepdf` is causing a conflict but there's a fix on their `main`
branch, just need for the next release to be published. Opened up a
question here to see if we can get that out any sooner: [Do releases
happen on a
schedule?](https://github.com/pikepdf/pikepdf/discussions/574). For now
added `lxml<5` to the constraints.
A couple optimizations:
* `constraints.in` renamed to `constraints.txt` since the whole point is
all dependencies are already pinned and the file never gets compiled
* `constraints.txt` moved to a `requirements/deps` directory as this
never gets compiled by `pip-compile`
* Other dependency files updated to reference the new location of
`base.in` and `constraints.txt`
* make file updated since it was originally written to avoid the
`base.in` and `constraints.in` file