25 Commits

Author SHA1 Message Date
Christine Straub
ab88e20575
chore: bump unstructured-inference 0.7.36 (#3275)
### Summary
- bump unstructured-inference to `0.7.35` which fixed `ValueError` when
converting cells to HTML in the table processing subpipeline
- cut a release for `0.14.8`

---------

Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
2024-06-24 13:07:22 +00:00
Matt Robinson
c822e3fd10
build(deps): weekly dependency bumps (6/10/2024) (#3170)
### Summary

Weekly dependency bumps for the week of 6/10/2024.

The `deltalake` dependency was pinned to `<0.18.0` because `0.18.0`
seemed to break the connector test, per [this
test](https://github.com/Unstructured-IO/unstructured/actions/runs/9450141486/job/26028131005).
Opened #3173 to address.
2024-06-10 16:20:22 +00:00
Matt Robinson
5203390a4a
build(deps): weekly pip version bump (#3147)
### Summary

Weekly PR to bump dependency versions.
2024-06-04 20:47:04 +00:00
Matt Robinson
6b400b46fe
feat: add VoyageAI embeddings (#3069) (#3099)
Original PR was #3069. Merged in to a feature branch to fix dependency
and linting issues. Application code changes from the original PR were
already reviewed and approved.

------------
Original PR description:
Adding VoyageAI embeddings 
Voyage AI’s embedding models and rerankers are state-of-the-art in
retrieval accuracy.

---------

Co-authored-by: fzowl <160063452+fzowl@users.noreply.github.com>
Co-authored-by: Liuhong99 <39693953+Liuhong99@users.noreply.github.com>
2024-05-24 21:48:35 +00:00
Christine Straub
18428f24ab
chore: bump unstructured-inference 0.7.33 (#3074)
Summary:
- bump unstructured-inference to `0.7.33`
- cut a release for `0.14.2`
- add some dependencies that previously came through from the
layoutparser extras.
2024-05-22 22:35:00 +00:00
Matt Robinson
059fc64bd9
build: apk add libreoffice24 (#3065)
### Summary

Switches to installing `libreoffice` from the Wolfi repository and
upgrades the `libreoffice` version to `libreoffice==24.x.x`. Resolves a
medium vulnerability in the old `libreoffice` version. Security scanning
with `anchore/grype` was also added to the `test_dockerfile` job.
Requirements were bumped to resolve a vulnerability in the `requests`
library.

### Testing

`test_dockerfile` passes with the updates.
2024-05-21 18:54:16 +00:00
Matt Robinson
d7608014c0
improve: add Python 3.12 support (#3033) (#3047)
### Summary

Closes #2959. Updates the dependency and CI to add support for Python
3.12.

The MongoDB ingest tests were disabled due to jobs like [this
one](https://github.com/Unstructured-IO/unstructured/actions/runs/9133383127/job/25116767333)
failing due to issues with the `bson` package. `bson` is a dependency
for the AstraDB connector, but `pymongo` does not work when `bson` is
installed from `pip`. This issue is documented by MongoDB
[here](https://pymongo.readthedocs.io/en/stable/installation.html). Spun
off #3049 to resolve this. Issue seems unrelated to Python 3.12, though
unsure why this didn't surface previously.

Disables the `argilla` tests because `argilla` does not yet support
Python 3.12. We can add the `argilla` tests back in once the PR
references below is merged. You can still use the `stage_for_argilla`
function if you're on `python<3.12` and you install `argilla` yourself.
- https://github.com/argilla-io/argilla/pull/4837

---------

Co-authored-by: Nicolò Boschi <boschi1997@gmail.com>
2024-05-19 23:03:15 +00:00
Christine Straub
b64a48440d
chore: bump unstructured-inference 0.7.31 (#2981) 2024-05-08 16:26:58 +00:00
Steve Canny
eff84afe24
chore: update python-docx version dependency (#2952)
**Summary**
`unstructured` will use table features added in the most recent version
of `python-docx`.

Also update the `lxml` version constraint because `lxml>4.9.2` will not
install on Apple Silicon
(https://github.com/Unstructured-IO/unstructured/issues/1707).

`python-docx` requires `lxml` although other file formats require it as
well.
2024-05-01 21:36:31 +00:00
Dimitri Lozeve
abb0174181
Integration with the Google Cloud Vision API (#2902)
This PR adds a third OCR provider, alongside Tesseract and Paddle: the
[Google Cloud Vision API](https://cloud.google.com/vision).

It can be used similarly to other OCR methods: set the `OCR_AGENT`
environment variable to the path to the OCR module
(`unstructured.partition.utils.ocr_models.google_vision_ocr.OCRAgentGoogleVision`).
You also need to set the credentials to use Google APIs, for instance by
setting the `GOOGLE_APPLICATION_CREDENTIALS` environment variable.

---------

Co-authored-by: christinestraub <christinemstraub@gmail.com>
2024-04-23 21:11:39 +00:00
Steve Canny
305247b4e1
chore: bump unstructured-inference pin (#2913)
**Summary**
Update dependencies to use the new version of `unstructured-inference`
released yesterday. Remedy a few small problems with `make pip-compile`
that stood in the way.
2024-04-21 03:08:20 +00:00
Roman Isecke
9ad2993fe3
bug: fix pip-compile (#2885)
### Description
Currently wasn't compiling `base.in` first, which is required because
others use the generated `.txt` file as a constraint.
2024-04-19 21:39:25 +00:00
Roman Isecke
4185a1a15a
feat: Remove constraint on unstructured client from .in file (#2862)
### Description
Don't limit the version of the unstructured client for all users of the
repo
2024-04-08 16:50:56 +00:00
Roman Isecke
d6f2841ff4
feat: update dependencies and remove constraint on pydantic (#2841)
### Description
* The `consistent-deps.sh` was fixed to take into account the ingest
dependencies, causing some errors to show up. New constriants were added
to make that script pass.
* Update all requirements without constraint on pydantic, allowing the
latest version to be pulled in.
* `pikepdf` is causing a conflict but there's a fix on their `main`
branch, just need for the next release to be published. Opened up a
question here to see if we can get that out any sooner: [Do releases
happen on a
schedule?](https://github.com/pikepdf/pikepdf/discussions/574). For now
added `lxml<5` to the constraints.

A couple optimizations: 
* `constraints.in` renamed to `constraints.txt` since the whole point is
all dependencies are already pinned and the file never gets compiled
* `constraints.txt` moved to a `requirements/deps` directory as this
never gets compiled by `pip-compile`
* Other dependency files updated to reference the new location of
`base.in` and `constraints.txt`
* make file updated since it was originally written to avoid the
`base.in` and `constraints.in` file
2024-04-04 19:58:23 +00:00
Ahmet Melek
be71633415
refactor: isolate ingest dependencies into local scopes (#2509)
This PR: 
- Moves ingest dependencies into local scopes to be able to import
ingest connector classes without the need of installing imported
external dependencies. This allows lightweight use of the classes (not
the instances. to use the instances as intended you'll still need the
dependencies).
- Upgrades the embed module dependencies from `langchain` to
`langchain-community` module (to pass CI [rather than introducing a
pin])
- Does pip-compile
- Does minor refactors in other files to pass `ruff 2.0` checks which
were introduced by pip-compile
2024-02-06 21:28:55 +00:00
John
db67805ec6
feat: add support for partitioning .heic files (#2454)
.heic files are an image filetype we have not supported.

#### Testing
```
from unstructured.partition.image import partition_image

png_filename = "example-docs/DA-1p.png"
heic_filename = "example-docs/DA-1p.heic"

png_elements = partition_image(png_filename, strategy="hi_res")
heic_elements = partition_image(heic_filename, strategy="hi_res")

for i in range(len(heic_elements)):
	print(heic_elements[i].text == png_elements[i].text)
```

---------

Co-authored-by: christinestraub <christinemstraub@gmail.com>
2024-01-30 04:49:00 +00:00
Matt Robinson
2d3a7f1c48
fix: fix table index error by bumping unstructured-inference (#2430)
### Summary

Closes #2417. Bumps `unstructured-inference` to pull in the fix
implemented in
https://github.com/Unstructured-IO/unstructured-inference/pull/317
2024-01-19 22:42:32 +00:00
Roman Isecke
b37b4689bc
drop python3.8 (#2372)
### Description
Remove all uses of python3.8

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2024-01-09 23:37:30 +00:00
Christine Straub
69d0ee1aea
Refactor: support merging extracted layout with inferred layout (#2158)
### Summary
This PR is the second part of `pdfminer` refactor to move it from
`unstructured-inference` repo to `unstructured` repo, the first part is
done in
https://github.com/Unstructured-IO/unstructured-inference/pull/294. This
PR adds logic to merge the extracted layout with the inferred layout.

The updated workflow for the `hi_res` strategy:
* pass the document (as data/filename) to the `inference` repo to get
`inferred_layout` (DocumentLayout)
* pass the `inferred_layout` returned from the `inference` repo and the
document (as data/filename) to the `pdfminer_processing` module, which
first opens the document (create temp file/dir as needed), and splits
the document by pages
* if is_image is `True`, return the passed
inferred_layout(DocumentLayout)
  * if is_image is `False`:
* get extracted_layout (TextRegions) from the passed
document(data/filename) by pdfminer
* merge `extracted_layout` (TextRegions) with the passed
`inferred_layout` (DocumentLayout)
* return the `inferred_layout `(DocumentLayout) with updated elements
(all merged LayoutElements) as merged_layout (DocumentLayout)
* pass merged_layout and the document (as data/filename) to the `OCR`
module, which first opens the document (create temp file/dir as needed),
and splits the document by pages (convert PDF pages to image pages for
PDF file)

### Note
This PR also fixes issue #2164 by using functionality similar to the one
implemented in the `fast` strategy workflow when extracting elements by
`pdfminer`.

### TODO
* image extraction refactor to move it from `unstructured-inference`
repo to `unstructured` repo
* improving natural reading order by applying the current default
`xycut` sorting to the elements extracted by `pdfminer`
2023-12-01 20:56:31 +00:00
John
e5bdf7fb43
chore: unstructured python client (#2195)
### Summary
Closes #2033
Updates `partition_via_api` to use `UnstructuredClient` for api calls
instead of `requests`.
Updates associated tests.

Note: This PR does **not** update `partition_multiple_via_api` as
documentation in `unstructured-python-client` indicates it does not
support multiple files. A new issue should be opened to add that
functionality to `unstructured-python-client`.

---------

Co-authored-by: Klaijan <klaijan@unstructured.io>
Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2023-12-01 18:49:59 +00:00
Yuming Long
ccda93b0d1
chore: bump inference to 0.7.15 release unst 0.11.0 (#2110)
^^
2023-11-20 18:20:03 +00:00
Roman Isecke
b8af2f18bb
add mongo db destination connector (#2068)
### Description
This adds the basic implementation of pushing the generated json output
of partition to mongodb. None of this code provisions the mondo db
instance so things like adding a search index around the embedding
content must be done by the user. Any sort of schema validation would
also have to take place via user-specific configuration on the database.
This update makes no assumptions about the configuration of the database
itself.
2023-11-16 22:40:22 +00:00
Yuming Long
ad14321016
Chore: don't pass empty language code to tesseract CLI (#1996)
Summary:

Close: https://github.com/Unstructured-IO/unstructured/issues/1920

* stop passing in empty string from `languages` to tesseract, which will
result in passing empty string to language config `-l` for the tesseract
CLI
* also stop passing in duplicate language code from `languages` to
tesseract OCR
* if we failed to convert any iso languages from the `languages`
parameter, proceed OCR with `eng` as default
  
### Test
* First confirm the tesseract error `Estimating resolution as X` before
this:
* on the `unstructured-api` repo with main branch, run `make
run-web-app`
* curl to test error from empty string, or just any wrong input like `-F
'languages="eng,de"'`:
 ```
curl -X 'POST'  'http://0.0.0.0:8000/general/v0/general' \
  -H 'accept: application/json'   \
-H 'Content-Type: multipart/form-data' \
 -F 'files=@sample-docs/layout-parser-paper-with-table.jpg' \
-F 'languages=""'  \
-F 'strategy=hi_res'  \
-F 'pdf_infer_table_structure=True' \
 | jq -C . | less -R
``` 

* after this change:
   * in your unstructured API env, cd to unstructured repo and install it locally with `pip install -e .`
   * check out to this branch
   * run `make run-web-app` again in api repo
   * the curl command return output and see warning in log

---------

Co-authored-by: qued <64741807+qued@users.noreply.github.com>
2023-11-06 19:30:12 -06:00
qued
808b4ced7a
build(deps): remove ebooklib (#1878)
* **Removed `ebooklib` as a dependency** `ebooklib` is licensed under
AGPL3, which is incompatible with the Apache 2.0 license. Thus it is
being removed.
2023-10-26 12:22:40 -05:00
Roman Isecke
4802332de0
Roman/optimize ingest ci (#1799)
### Description
Currently the CI caches the CI dependencies but uses the hash of all
files in `requirements/`. This isn't completely accurate since the
ingest dependencies are installed in a later step and don't affect the
cached environment. As part of this PR:
* ingest dependencies were isolated into their own folder in
`requirements/ingest/`
* A new cache setup was introduced in the CI to restore the base cache
-> install ingest dependencies -> cache it with a new id
* new make target created to install all ingest dependencies via `pip
install -r ...`
* updates to Dockerfile to use `find ...` to install all dependencies,
avoiding the need to update this when new deps are added.
* update to pip-compile script to run over all `*.in` files in
`requirements/`
2023-10-24 14:54:00 +00:00