unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-06-27 02:30:08 +00:00

Author	SHA1	Message	Date
Matt Robinson	ee2b247297	build: check dependency licenses in CI (#3349 ) ### Summary Adds a CI check to ensure that packages added as dependencies are appropriately licensed. All of the `.txt` files in the `requirements` directory are checked with the exception of: - `constraints.txt`, since those are not installed and are instead conditions on the other dependency files - `dev.txt`, since those are for local development and not shipped as part of the `unstructured` package - `extra-pdf-image.txt` - the `extra-pdf-image.in` since checking `extra-pdf-image.txt` pulls in NVIDIA GPU related packages with an `Other/Proprietary` license type, and there's not a good way to exclude those without adding `Other/Proprietary` to the allowed licenses list. ### Testing The new `check-licenses` job should pass in CI.	2024-07-11 22:36:01 +00:00
Christine Straub	512583ed91	build(deps): bump unstructured.paddleocr 2.8.0 (#3374 ) ### Summary Bump unstructured.paddleocr to `2.8.0` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2024-07-09 22:19:08 +00:00
Matt Robinson	3fc2342f6d	build(deps): version bumps for 2024-07-08 (#3359 ) ## Summary Version bumps for 2024-07-08.	2024-07-08 12:38:04 +00:00
John	0046f58a4f	revert unstructured-client pin and make pip-compile (#3298 ) Change unstructured-client pin to setting minimum version instead of max version and `make pip-compile`. Integration tests that were dependent on the old version of the client are removed. These tests should be replicated in/moved to the SDK repo(s).	2024-07-02 16:42:03 +00:00
Matt Robinson	db8617872b	build: image and dependency updates; fix tesseract files locations (#3310 ) ### Summary Updates to the latest version of the `wolfi-base` image. Changes include: - Version bumps to address CVEs - `libreoffice` is now included in the `arm64`. `.doc` files are now supported for `arm64`. `.ppt` do not work with the `libreoffice` package currently available on `wolfi-os`. We have follow on work to look into that. - Updates the location of the `tesseract` `tessdata` files on the `arm64` build. Closes #3290. - Closes #3319 and addes `psutil` to the base dependencies. ### Testing - `test_dockerfile` should continue to pass with the updates.	2024-07-01 19:39:32 +00:00
Matt Robinson	6939bff49e	build(deps): bump langchain-community version (#3305 ) ### Summary Bumps to the latest `langchain-community` version to resolve [CVE-2024-2965](https://nvd.nist.gov/vuln/detail/CVE-2024-2965). --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>	2024-06-26 22:42:32 +00:00
qued	0665e94b96	build: move numpy pin to packaging (#3296 ) Moved numpy pin to `base.in` where it will be picked up by packaging. Side note: `constraints.txt` (formerly `constraints.in`) is a really useful pattern: you put a constraint there, add that file as a `-c` requirement in other files, and the constraint will be applied when pip-compiling only when needed because the library is required by something else. Neat! However, unfortunately, in my searches I've never found a similar pattern for packaging, so any pins we want to propagate to user installs need to be explicitly placed in the `.in` files. So what is `constraints.txt` really doing for us? Well in the past I think there have been instances where something is temporarily broken in an upstream dependency but we expect it to be patched soon, but in the meantime we want things to work in our CI builds and development installs, so it's not worth pinning everywhere it's used. Having said that, I'm coming to the conclusion that `constraints.txt` causes more harm than good in the confusion it causes WRT packaging -- maybe we should remove that pattern at some point.	2024-06-25 21:08:25 +00:00
Yao You	c32aeaac44	fix: wait to run soffice until there is no other soffice process running (#3287 ) ## Summary This PR addresses an issue where the code could attempt to run `soffice` in multiple processes and closes #3284 The fix is to add a wait mechanism when there is another `soffice` process running in already. ## Diagnosis of issue - `soffice` can only have one process running when using the command `soffice` as is. - on main branch the function `partition.common.convert_office_doc` simply spawns a subprocess to run `soffice` command to convert a `doc` or `ppt` file into `docx` or `pptx` format. - if there are multiple partition calls to process `doc` or `ppt` files and they all want to spawn `soffice` subprocesses only one will succeed while other processes will simply fail and return 1 from the subprocess - in downstream this will lead to errors like `PackageNotFoundError: Package not found at '/tmp/tmpac6lcu4w/document.docx'` ## solution While there are [ways](https://www.reddit.com/r/libreoffice/comments/agk3os/how_to_open_more_than_one_calc_instance_under/) to circumvent the limit of `soffice` by setting a tmp file as user installation env, these kind of solutions rely on the internals of `soffice` and adds maintenance cost to track its changes. This PR solves this problem by adding a wait mechanism: - we first spawning a subprocess to run `soffice` - if the `stdout` is empty and we still have wait time budget left the function first checks if there is another `soffice` running * If yes then the function waits for 0.01s before checking again; * if no the functions spawns a subprocess to run `soffice` and return to beginning of this step * we need to return the the beginning to check if `stdout` is empty because we could have another collision right after `soffice` becomes available. ## test This PR adds two unit tests. Additionally this can be tested by running partition of `.doc` files locally with multiprocessing.	2024-06-25 18:49:27 +00:00
Christine Straub	ab88e20575	chore: bump unstructured-inference 0.7.36 (#3275 ) ### Summary - bump unstructured-inference to `0.7.35` which fixed `ValueError` when converting cells to HTML in the table processing subpipeline - cut a release for `0.14.8` --------- Co-authored-by: Matt Robinson <mrobinson@unstructured.io> Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>	2024-06-24 13:07:22 +00:00
Matt Robinson	2815226b54	build(deps): version bumps for 2024-06-17 (#3220 ) ### Summary Version bumps for the week of 2024-06-17. There is a now a pin on `numpy` due to a breaking change in the latest version that we'll need to investigate and remove in a subsequent PR.	2024-06-17 14:04:29 +00:00
Matt Robinson	ad69bdcd4e	build(deps): deltalake bump to `0.18.x` (#3197 ) ### Summary Closes #3173. Removes the `overwrite_schema` kwarg from the Delta Table connector and bumps the `deltalake` version. Per [this PR](https://github.com/delta-io/delta-rs/pull/2554) in the `deltalake` repo, the `overwrite_schema` kwarg is deprecated as of version `0.18.0`. Users can specify `schema_mode="merge"` to obtain the same behavior. - `schema_mode="merge"` is equivalent to `overwrite_schema=False` - `schema_mode="overwrite"` is equivalent to `overwrite_schema=True` Also adds an `engine` parameter that you can use to set `"rust"` or `"pyarrow"` as the engine. `engine` defaults to `"pyarrow"` and `schema_mode` defaults to `None`, which is consistent with the behavior in `deltalake` documented [here](https://delta-io.github.io/delta-rs/api/delta_writer/). ### Testing The Delta Table ingest tests should pass on this PR. --------- Co-authored-by: Ahmet Melek <39141206+ahmetmeleq@users.noreply.github.com>	2024-06-13 15:59:34 +00:00
Matt Robinson	c822e3fd10	build(deps): weekly dependency bumps (6/10/2024) (#3170 ) ### Summary Weekly dependency bumps for the week of 6/10/2024. The `deltalake` dependency was pinned to `<0.18.0` because `0.18.0` seemed to break the connector test, per [this test](https://github.com/Unstructured-IO/unstructured/actions/runs/9450141486/job/26028131005). Opened #3173 to address.	2024-06-10 16:20:22 +00:00
Matt Robinson	5203390a4a	build(deps): weekly pip version bump (#3147 ) ### Summary Weekly PR to bump dependency versions.	2024-06-04 20:47:04 +00:00
Matt Robinson	6b400b46fe	feat: add VoyageAI embeddings (#3069 ) (#3099 ) Original PR was #3069. Merged in to a feature branch to fix dependency and linting issues. Application code changes from the original PR were already reviewed and approved. ------------ Original PR description: Adding VoyageAI embeddings Voyage AI’s embedding models and rerankers are state-of-the-art in retrieval accuracy. --------- Co-authored-by: fzowl <160063452+fzowl@users.noreply.github.com> Co-authored-by: Liuhong99 <39693953+Liuhong99@users.noreply.github.com>	2024-05-24 21:48:35 +00:00
Christine Straub	18428f24ab	chore: bump unstructured-inference 0.7.33 (#3074 ) Summary: - bump unstructured-inference to `0.7.33` - cut a release for `0.14.2` - add some dependencies that previously came through from the layoutparser extras.	2024-05-22 22:35:00 +00:00
Matt Robinson	059fc64bd9	build: apk add libreoffice24 (#3065 ) ### Summary Switches to installing `libreoffice` from the Wolfi repository and upgrades the `libreoffice` version to `libreoffice==24.x.x`. Resolves a medium vulnerability in the old `libreoffice` version. Security scanning with `anchore/grype` was also added to the `test_dockerfile` job. Requirements were bumped to resolve a vulnerability in the `requests` library. ### Testing `test_dockerfile` passes with the updates.	2024-05-21 18:54:16 +00:00
Matt Robinson	d7608014c0	improve: add Python 3.12 support (#3033 ) (#3047 ) ### Summary Closes #2959. Updates the dependency and CI to add support for Python 3.12. The MongoDB ingest tests were disabled due to jobs like [this one](https://github.com/Unstructured-IO/unstructured/actions/runs/9133383127/job/25116767333) failing due to issues with the `bson` package. `bson` is a dependency for the AstraDB connector, but `pymongo` does not work when `bson` is installed from `pip`. This issue is documented by MongoDB [here](https://pymongo.readthedocs.io/en/stable/installation.html). Spun off #3049 to resolve this. Issue seems unrelated to Python 3.12, though unsure why this didn't surface previously. Disables the `argilla` tests because `argilla` does not yet support Python 3.12. We can add the `argilla` tests back in once the PR references below is merged. You can still use the `stage_for_argilla` function if you're on `python<3.12` and you install `argilla` yourself. - https://github.com/argilla-io/argilla/pull/4837 --------- Co-authored-by: Nicolò Boschi <boschi1997@gmail.com>	2024-05-19 23:03:15 +00:00
Matt Robinson	f4b01a4aad	build(deps): bump versions for security hygiene (#3008 ) ### Summary Version bumps to keep on top of security scans.	2024-05-13 15:30:09 +00:00
Christine Straub	b64a48440d	chore: bump unstructured-inference 0.7.31 (#2981 )	2024-05-08 16:26:58 +00:00
Steve Canny	eff84afe24	chore: update python-docx version dependency (#2952 ) Summary `unstructured` will use table features added in the most recent version of `python-docx`. Also update the `lxml` version constraint because `lxml>4.9.2` will not install on Apple Silicon (https://github.com/Unstructured-IO/unstructured/issues/1707). `python-docx` requires `lxml` although other file formats require it as well.	2024-05-01 21:36:31 +00:00
Pluto	fa767d6706	chore: Bump unstructured inference 0.29 (#2932 ) Co-authored-by: cragwolfe <crag@unstructured.io>	2024-04-27 19:49:22 +00:00
cragwolfe	9e46ed016c	fix: reqs arm64 friendly again. release 0.13.4 (#2935 ) Cut a release. Run pip-compile on mac to avoid `nvidia-*` requirements creeping into `requirements/extra-pdf-image.txt`. This should fix arm64 image builds that have been breaking on main.	2024-04-26 08:15:13 +00:00
Dimitri Lozeve	abb0174181	Integration with the Google Cloud Vision API (#2902 ) This PR adds a third OCR provider, alongside Tesseract and Paddle: the [Google Cloud Vision API](https://cloud.google.com/vision). It can be used similarly to other OCR methods: set the `OCR_AGENT` environment variable to the path to the OCR module (`unstructured.partition.utils.ocr_models.google_vision_ocr.OCRAgentGoogleVision`). You also need to set the credentials to use Google APIs, for instance by setting the `GOOGLE_APPLICATION_CREDENTIALS` environment variable. --------- Co-authored-by: christinestraub <christinemstraub@gmail.com>	2024-04-23 21:11:39 +00:00
Steve Canny	305247b4e1	chore: bump unstructured-inference pin (#2913 ) Summary Update dependencies to use the new version of `unstructured-inference` released yesterday. Remedy a few small problems with `make pip-compile` that stood in the way.	2024-04-21 03:08:20 +00:00
Roman Isecke	9ad2993fe3	bug: fix pip-compile (#2885 ) ### Description Currently wasn't compiling `base.in` first, which is required because others use the generated `.txt` file as a constraint.	2024-04-19 21:39:25 +00:00
Roman Isecke	4185a1a15a	feat: Remove constraint on unstructured client from .in file (#2862 ) ### Description Don't limit the version of the unstructured client for all users of the repo	2024-04-08 16:50:56 +00:00
Roman Isecke	d6f2841ff4	feat: update dependencies and remove constraint on pydantic (#2841 ) ### Description * The `consistent-deps.sh` was fixed to take into account the ingest dependencies, causing some errors to show up. New constriants were added to make that script pass. * Update all requirements without constraint on pydantic, allowing the latest version to be pulled in. * `pikepdf` is causing a conflict but there's a fix on their `main` branch, just need for the next release to be published. Opened up a question here to see if we can get that out any sooner: [Do releases happen on a schedule?](https://github.com/pikepdf/pikepdf/discussions/574). For now added `lxml<5` to the constraints. A couple optimizations: * `constraints.in` renamed to `constraints.txt` since the whole point is all dependencies are already pinned and the file never gets compiled * `constraints.txt` moved to a `requirements/deps` directory as this never gets compiled by `pip-compile` * Other dependency files updated to reference the new location of `base.in` and `constraints.txt` * make file updated since it was originally written to avoid the `base.in` and `constraints.in` file	2024-04-04 19:58:23 +00:00
Ahmet Melek	d46792214a	feat: add vertexai embeddings (#2693 ) This PR: - Adds VertexAI embeddings as an embedding provider Testing - Tested with pinecone destination connector on [this](https://github.com/Unstructured-IO/unstructured/actions/runs/8429035114/job/23082700074?pr=2693) job run. --------- Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io> Co-authored-by: Matt Robinson <mrobinson@unstructured.io>	2024-03-28 21:15:36 +00:00

28 Commits