unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-11-28 00:05:55 +00:00

Author	SHA1	Message	Date
Christine Straub	ab88e20575	chore: bump unstructured-inference 0.7.36 (#3275 ) ### Summary - bump unstructured-inference to `0.7.35` which fixed `ValueError` when converting cells to HTML in the table processing subpipeline - cut a release for `0.14.8` --------- Co-authored-by: Matt Robinson <mrobinson@unstructured.io> Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>	2024-06-24 13:07:22 +00:00
Matt Robinson	ad69bdcd4e	build(deps): deltalake bump to `0.18.x` (#3197 ) ### Summary Closes #3173. Removes the `overwrite_schema` kwarg from the Delta Table connector and bumps the `deltalake` version. Per [this PR](https://github.com/delta-io/delta-rs/pull/2554) in the `deltalake` repo, the `overwrite_schema` kwarg is deprecated as of version `0.18.0`. Users can specify `schema_mode="merge"` to obtain the same behavior. - `schema_mode="merge"` is equivalent to `overwrite_schema=False` - `schema_mode="overwrite"` is equivalent to `overwrite_schema=True` Also adds an `engine` parameter that you can use to set `"rust"` or `"pyarrow"` as the engine. `engine` defaults to `"pyarrow"` and `schema_mode` defaults to `None`, which is consistent with the behavior in `deltalake` documented [here](https://delta-io.github.io/delta-rs/api/delta_writer/). ### Testing The Delta Table ingest tests should pass on this PR. --------- Co-authored-by: Ahmet Melek <39141206+ahmetmeleq@users.noreply.github.com>	2024-06-13 15:59:34 +00:00
Matt Robinson	c822e3fd10	build(deps): weekly dependency bumps (6/10/2024) (#3170 ) ### Summary Weekly dependency bumps for the week of 6/10/2024. The `deltalake` dependency was pinned to `<0.18.0` because `0.18.0` seemed to break the connector test, per [this test](https://github.com/Unstructured-IO/unstructured/actions/runs/9450141486/job/26028131005). Opened #3173 to address.	2024-06-10 16:20:22 +00:00
Matt Robinson	5203390a4a	build(deps): weekly pip version bump (#3147 ) ### Summary Weekly PR to bump dependency versions.	2024-06-04 20:47:04 +00:00
Christine Straub	18428f24ab	chore: bump unstructured-inference 0.7.33 (#3074 ) Summary: - bump unstructured-inference to `0.7.33` - cut a release for `0.14.2` - add some dependencies that previously came through from the layoutparser extras.	2024-05-22 22:35:00 +00:00
Matt Robinson	059fc64bd9	build: apk add libreoffice24 (#3065 ) ### Summary Switches to installing `libreoffice` from the Wolfi repository and upgrades the `libreoffice` version to `libreoffice==24.x.x`. Resolves a medium vulnerability in the old `libreoffice` version. Security scanning with `anchore/grype` was also added to the `test_dockerfile` job. Requirements were bumped to resolve a vulnerability in the `requests` library. ### Testing `test_dockerfile` passes with the updates.	2024-05-21 18:54:16 +00:00
Christine Straub	b64a48440d	chore: bump unstructured-inference 0.7.31 (#2981 )	2024-05-08 16:26:58 +00:00
Steve Canny	eff84afe24	chore: update python-docx version dependency (#2952 ) Summary `unstructured` will use table features added in the most recent version of `python-docx`. Also update the `lxml` version constraint because `lxml>4.9.2` will not install on Apple Silicon (https://github.com/Unstructured-IO/unstructured/issues/1707). `python-docx` requires `lxml` although other file formats require it as well.	2024-05-01 21:36:31 +00:00
Dimitri Lozeve	abb0174181	Integration with the Google Cloud Vision API (#2902 ) This PR adds a third OCR provider, alongside Tesseract and Paddle: the [Google Cloud Vision API](https://cloud.google.com/vision). It can be used similarly to other OCR methods: set the `OCR_AGENT` environment variable to the path to the OCR module (`unstructured.partition.utils.ocr_models.google_vision_ocr.OCRAgentGoogleVision`). You also need to set the credentials to use Google APIs, for instance by setting the `GOOGLE_APPLICATION_CREDENTIALS` environment variable. --------- Co-authored-by: christinestraub <christinemstraub@gmail.com>	2024-04-23 21:11:39 +00:00
Steve Canny	305247b4e1	chore: bump unstructured-inference pin (#2913 ) Summary Update dependencies to use the new version of `unstructured-inference` released yesterday. Remedy a few small problems with `make pip-compile` that stood in the way.	2024-04-21 03:08:20 +00:00
Roman Isecke	9ad2993fe3	bug: fix pip-compile (#2885 ) ### Description Currently wasn't compiling `base.in` first, which is required because others use the generated `.txt` file as a constraint.	2024-04-19 21:39:25 +00:00
Roman Isecke	d6f2841ff4	feat: update dependencies and remove constraint on pydantic (#2841 ) ### Description * The `consistent-deps.sh` was fixed to take into account the ingest dependencies, causing some errors to show up. New constriants were added to make that script pass. * Update all requirements without constraint on pydantic, allowing the latest version to be pulled in. * `pikepdf` is causing a conflict but there's a fix on their `main` branch, just need for the next release to be published. Opened up a question here to see if we can get that out any sooner: [Do releases happen on a schedule?](https://github.com/pikepdf/pikepdf/discussions/574). For now added `lxml<5` to the constraints. A couple optimizations: * `constraints.in` renamed to `constraints.txt` since the whole point is all dependencies are already pinned and the file never gets compiled * `constraints.txt` moved to a `requirements/deps` directory as this never gets compiled by `pip-compile` * Other dependency files updated to reference the new location of `base.in` and `constraints.txt` * make file updated since it was originally written to avoid the `base.in` and `constraints.in` file	2024-04-04 19:58:23 +00:00
Ahmet Melek	be71633415	refactor: isolate ingest dependencies into local scopes (#2509 ) This PR: - Moves ingest dependencies into local scopes to be able to import ingest connector classes without the need of installing imported external dependencies. This allows lightweight use of the classes (not the instances. to use the instances as intended you'll still need the dependencies). - Upgrades the embed module dependencies from `langchain` to `langchain-community` module (to pass CI [rather than introducing a pin]) - Does pip-compile - Does minor refactors in other files to pass `ruff 2.0` checks which were introduced by pip-compile	2024-02-06 21:28:55 +00:00
qued	399dd60311	build(deps): unpin pillow (#2472 ) Removed `pillow` pin and recompiled. I think it was originally there to address a conflict, which, as far as I can tell, no longer exists. Also a security vulnerability was discovered in the older version of `pillow`. #### Testing: CI should pass.	2024-01-30 21:29:08 +00:00
John	db67805ec6	feat: add support for partitioning .heic files (#2454 ) .heic files are an image filetype we have not supported. #### Testing ``` from unstructured.partition.image import partition_image png_filename = "example-docs/DA-1p.png" heic_filename = "example-docs/DA-1p.heic" png_elements = partition_image(png_filename, strategy="hi_res") heic_elements = partition_image(heic_filename, strategy="hi_res") for i in range(len(heic_elements)): print(heic_elements[i].text == png_elements[i].text) ``` --------- Co-authored-by: christinestraub <christinemstraub@gmail.com>	2024-01-30 04:49:00 +00:00
Matt Robinson	2d3a7f1c48	fix: fix table index error by bumping `unstructured-inference` (#2430 ) ### Summary Closes #2417. Bumps `unstructured-inference` to pull in the fix implemented in https://github.com/Unstructured-IO/unstructured-inference/pull/317	2024-01-19 22:42:32 +00:00
Roman Isecke	b37b4689bc	drop python3.8 (#2372 ) ### Description Remove all uses of python3.8 --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2024-01-09 23:37:30 +00:00
Christine Straub	ed76b11b1a	Refactor: support image extraction (#2201 ) ### Summary This PR is the second part of the "image extraction" refactor to move it from unstructured-inference repo to unstructured repo, the first part is done in https://github.com/Unstructured-IO/unstructured-inference/pull/299. This PR adds logic to support extracting images. ### Testing `git clone -b refactor/remove_image_extraction_code --single-branch https://github.com/Unstructured-IO/unstructured-inference.git && cd unstructured-inference && pip install -e . && cd ../` ``` elements = partition_pdf( filename="example-docs/embedded-images.pdf", strategy="hi_res", extract_images_in_pdf=True, ) print("\n\n".join([str(el) for el in elements])) ```	2023-12-05 18:22:29 +00:00
Christine Straub	69d0ee1aea	Refactor: support merging `extracted` layout with `inferred` layout (#2158 ) ### Summary This PR is the second part of `pdfminer` refactor to move it from `unstructured-inference` repo to `unstructured` repo, the first part is done in https://github.com/Unstructured-IO/unstructured-inference/pull/294. This PR adds logic to merge the extracted layout with the inferred layout. The updated workflow for the `hi_res` strategy: * pass the document (as data/filename) to the `inference` repo to get `inferred_layout` (DocumentLayout) * pass the `inferred_layout` returned from the `inference` repo and the document (as data/filename) to the `pdfminer_processing` module, which first opens the document (create temp file/dir as needed), and splits the document by pages * if is_image is `True`, return the passed inferred_layout(DocumentLayout) * if is_image is `False`: * get extracted_layout (TextRegions) from the passed document(data/filename) by pdfminer * merge `extracted_layout` (TextRegions) with the passed `inferred_layout` (DocumentLayout) * return the `inferred_layout `(DocumentLayout) with updated elements (all merged LayoutElements) as merged_layout (DocumentLayout) * pass merged_layout and the document (as data/filename) to the `OCR` module, which first opens the document (create temp file/dir as needed), and splits the document by pages (convert PDF pages to image pages for PDF file) ### Note This PR also fixes issue #2164 by using functionality similar to the one implemented in the `fast` strategy workflow when extracting elements by `pdfminer`. ### TODO * image extraction refactor to move it from `unstructured-inference` repo to `unstructured` repo * improving natural reading order by applying the current default `xycut` sorting to the elements extracted by `pdfminer`	2023-12-01 20:56:31 +00:00
John	e5bdf7fb43	chore: unstructured python client (#2195 ) ### Summary Closes #2033 Updates `partition_via_api` to use `UnstructuredClient` for api calls instead of `requests`. Updates associated tests. Note: This PR does not update `partition_multiple_via_api` as documentation in `unstructured-python-client` indicates it does not support multiple files. A new issue should be opened to add that functionality to `unstructured-python-client`. --------- Co-authored-by: Klaijan <klaijan@unstructured.io> Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2023-12-01 18:49:59 +00:00
Yuming Long	ccda93b0d1	chore: bump inference to `0.7.15` release unst `0.11.0` (#2110 ) ^^	2023-11-20 18:20:03 +00:00
Yuming Long	ad14321016	Chore: don't pass empty language code to tesseract CLI (#1996 ) Summary: Close: https://github.com/Unstructured-IO/unstructured/issues/1920 * stop passing in empty string from `languages` to tesseract, which will result in passing empty string to language config `-l` for the tesseract CLI * also stop passing in duplicate language code from `languages` to tesseract OCR * if we failed to convert any iso languages from the `languages` parameter, proceed OCR with `eng` as default ### Test * First confirm the tesseract error `Estimating resolution as X` before this: * on the `unstructured-api` repo with main branch, run `make run-web-app` * curl to test error from empty string, or just any wrong input like `-F 'languages="eng,de"'`: ``` curl -X 'POST' 'http://0.0.0.0:8000/general/v0/general' \ -H 'accept: application/json' \ -H 'Content-Type: multipart/form-data' \ -F 'files=@sample-docs/layout-parser-paper-with-table.jpg' \ -F 'languages=""' \ -F 'strategy=hi_res' \ -F 'pdf_infer_table_structure=True' \ \| jq -C . \| less -R ``` * after this change: * in your unstructured API env, cd to unstructured repo and install it locally with `pip install -e .` * check out to this branch * run `make run-web-app` again in api repo * the curl command return output and see warning in log --------- Co-authored-by: qued <64741807+qued@users.noreply.github.com>	2023-11-06 19:30:12 -06:00
qued	808b4ced7a	build(deps): remove ebooklib (#1878 ) * Removed `ebooklib` as a dependency `ebooklib` is licensed under AGPL3, which is incompatible with the Apache 2.0 license. Thus it is being removed.	2023-10-26 12:22:40 -05:00
Roman Isecke	4802332de0	Roman/optimize ingest ci (#1799 ) ### Description Currently the CI caches the CI dependencies but uses the hash of all files in `requirements/`. This isn't completely accurate since the ingest dependencies are installed in a later step and don't affect the cached environment. As part of this PR: * ingest dependencies were isolated into their own folder in `requirements/ingest/` * A new cache setup was introduced in the CI to restore the base cache -> install ingest dependencies -> cache it with a new id * new make target created to install all ingest dependencies via `pip install -r ...` * updates to Dockerfile to use `find ...` to install all dependencies, avoiding the need to update this when new deps are added. * update to pip-compile script to run over all `*.in` files in `requirements/`	2023-10-24 14:54:00 +00:00

24 Commits