15 Commits

Author SHA1 Message Date
Christine Straub
ab88e20575
chore: bump unstructured-inference 0.7.36 (#3275)
### Summary
- bump unstructured-inference to `0.7.35` which fixed `ValueError` when
converting cells to HTML in the table processing subpipeline
- cut a release for `0.14.8`

---------

Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
2024-06-24 13:07:22 +00:00
Roman Isecke
d876a386ed
Roman/fix ingest async connectors (#3210)
### Description
Choosing to use async needs to be very careful because if a connector is
set to use async, the pipeline will not fan out the inputs via
multiprocessing but instead it will be limited to run in a single
process under the assumption it has more benefit from async due to heavy
network traffic. This means the exact same code that is not optimized
for async and is blocking will force the pipeline to perform worse than
simply never marking the connector to use async since the pipeline will
fan that out using multiprocessing.

All connectors and processes in the pipeline we revisited to make sure
this criteria was met and updated accordingly:
* Currently the unstructured client does not support making requests
async, so this was moved over to use multiprocessing
* fsspec connector was updated to use the async client from the fsspec
library. This also required that the client be a `@property` fetched on
demand, otherwise the client would break the multiprocessing pool since
it maintains a thread lock and that can't be pickled when the fsspec
connector doesn't support async.
* elasticsearch was also updated to use the async client
* weaviate only recently came out with async support in their SDK at a
version that is higher than we can use in the open source repo, so a
TODO was left but otherwise moved to use multiprocessing
* all underlying embedders don't use async to embedder step must be
multiprocessing for now. TODO left to update underlying embedder classes
to optionally support async.
* Chunking parameters were not accurately being passed through from cli
to chunker params, this was fixed

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2024-06-17 16:55:19 +00:00
Matt Robinson
c822e3fd10
build(deps): weekly dependency bumps (6/10/2024) (#3170)
### Summary

Weekly dependency bumps for the week of 6/10/2024.

The `deltalake` dependency was pinned to `<0.18.0` because `0.18.0`
seemed to break the connector test, per [this
test](https://github.com/Unstructured-IO/unstructured/actions/runs/9450141486/job/26028131005).
Opened #3173 to address.
2024-06-10 16:20:22 +00:00
Matt Robinson
5203390a4a
build(deps): weekly pip version bump (#3147)
### Summary

Weekly PR to bump dependency versions.
2024-06-04 20:47:04 +00:00
Matt Robinson
6b400b46fe
feat: add VoyageAI embeddings (#3069) (#3099)
Original PR was #3069. Merged in to a feature branch to fix dependency
and linting issues. Application code changes from the original PR were
already reviewed and approved.

------------
Original PR description:
Adding VoyageAI embeddings 
Voyage AI’s embedding models and rerankers are state-of-the-art in
retrieval accuracy.

---------

Co-authored-by: fzowl <160063452+fzowl@users.noreply.github.com>
Co-authored-by: Liuhong99 <39693953+Liuhong99@users.noreply.github.com>
2024-05-24 21:48:35 +00:00
Christine Straub
b64a48440d
chore: bump unstructured-inference 0.7.31 (#2981) 2024-05-08 16:26:58 +00:00
Roman Isecke
d6f2841ff4
feat: update dependencies and remove constraint on pydantic (#2841)
### Description
* The `consistent-deps.sh` was fixed to take into account the ingest
dependencies, causing some errors to show up. New constriants were added
to make that script pass.
* Update all requirements without constraint on pydantic, allowing the
latest version to be pulled in.
* `pikepdf` is causing a conflict but there's a fix on their `main`
branch, just need for the next release to be published. Opened up a
question here to see if we can get that out any sooner: [Do releases
happen on a
schedule?](https://github.com/pikepdf/pikepdf/discussions/574). For now
added `lxml<5` to the constraints.

A couple optimizations: 
* `constraints.in` renamed to `constraints.txt` since the whole point is
all dependencies are already pinned and the file never gets compiled
* `constraints.txt` moved to a `requirements/deps` directory as this
never gets compiled by `pip-compile`
* Other dependency files updated to reference the new location of
`base.in` and `constraints.txt`
* make file updated since it was originally written to avoid the
`base.in` and `constraints.in` file
2024-04-04 19:58:23 +00:00
Ahmet Melek
be71633415
refactor: isolate ingest dependencies into local scopes (#2509)
This PR: 
- Moves ingest dependencies into local scopes to be able to import
ingest connector classes without the need of installing imported
external dependencies. This allows lightweight use of the classes (not
the instances. to use the instances as intended you'll still need the
dependencies).
- Upgrades the embed module dependencies from `langchain` to
`langchain-community` module (to pass CI [rather than introducing a
pin])
- Does pip-compile
- Does minor refactors in other files to pass `ruff 2.0` checks which
were introduced by pip-compile
2024-02-06 21:28:55 +00:00
Matt Robinson
2d3a7f1c48
fix: fix table index error by bumping unstructured-inference (#2430)
### Summary

Closes #2417. Bumps `unstructured-inference` to pull in the fix
implemented in
https://github.com/Unstructured-IO/unstructured-inference/pull/317
2024-01-19 22:42:32 +00:00
Roman Isecke
b37b4689bc
drop python3.8 (#2372)
### Description
Remove all uses of python3.8

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2024-01-09 23:37:30 +00:00
Roman Isecke
2bb463d006
feat: support both single and batch ingest docs (#2105)
### Description
There are some source ingest connectors that would be more efficient to
read the content in batches rather than use an entire process per
document. For example, reading from ElasticSearch. Given an index with
possible hundreds of documents, reading each one individually is not as
optimal as reading in batches. To try and maintain as much of the ingest
doc paradigm already being supported, a new class `BaseIngestDocBatch`
was added to handle reading in batches. It produces a list of
`BaseSingleIngestDoc` which is what all current implementations were
renamed to. This list is generated after it runs its `get_files` method.
Past the source node, all other steps in the pipeline should not be
affected, this is just an optimization for the read step.

**Additional Changes:**
* Removed use of jq and instead converted this into a fields filter on
the content to let the database handle the filtering and limit the
amount of data being pulled in.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2023-11-27 19:25:30 +00:00
Yuming Long
ccda93b0d1
chore: bump inference to 0.7.15 release unst 0.11.0 (#2110)
^^
2023-11-20 18:20:03 +00:00
Roman Isecke
b8af2f18bb
add mongo db destination connector (#2068)
### Description
This adds the basic implementation of pushing the generated json output
of partition to mongodb. None of this code provisions the mondo db
instance so things like adding a search index around the embedding
content must be done by the user. Any sort of schema validation would
also have to take place via user-specific configuration on the database.
This update makes no assumptions about the configuration of the database
itself.
2023-11-16 22:40:22 +00:00
Yuming Long
ad14321016
Chore: don't pass empty language code to tesseract CLI (#1996)
Summary:

Close: https://github.com/Unstructured-IO/unstructured/issues/1920

* stop passing in empty string from `languages` to tesseract, which will
result in passing empty string to language config `-l` for the tesseract
CLI
* also stop passing in duplicate language code from `languages` to
tesseract OCR
* if we failed to convert any iso languages from the `languages`
parameter, proceed OCR with `eng` as default
  
### Test
* First confirm the tesseract error `Estimating resolution as X` before
this:
* on the `unstructured-api` repo with main branch, run `make
run-web-app`
* curl to test error from empty string, or just any wrong input like `-F
'languages="eng,de"'`:
 ```
curl -X 'POST'  'http://0.0.0.0:8000/general/v0/general' \
  -H 'accept: application/json'   \
-H 'Content-Type: multipart/form-data' \
 -F 'files=@sample-docs/layout-parser-paper-with-table.jpg' \
-F 'languages=""'  \
-F 'strategy=hi_res'  \
-F 'pdf_infer_table_structure=True' \
 | jq -C . | less -R
``` 

* after this change:
   * in your unstructured API env, cd to unstructured repo and install it locally with `pip install -e .`
   * check out to this branch
   * run `make run-web-app` again in api repo
   * the curl command return output and see warning in log

---------

Co-authored-by: qued <64741807+qued@users.noreply.github.com>
2023-11-06 19:30:12 -06:00
Roman Isecke
4802332de0
Roman/optimize ingest ci (#1799)
### Description
Currently the CI caches the CI dependencies but uses the hash of all
files in `requirements/`. This isn't completely accurate since the
ingest dependencies are installed in a later step and don't affect the
cached environment. As part of this PR:
* ingest dependencies were isolated into their own folder in
`requirements/ingest/`
* A new cache setup was introduced in the CI to restore the base cache
-> install ingest dependencies -> cache it with a new id
* new make target created to install all ingest dependencies via `pip
install -r ...`
* updates to Dockerfile to use `find ...` to install all dependencies,
avoiding the need to update this when new deps are added.
* update to pip-compile script to run over all `*.in` files in
`requirements/`
2023-10-24 14:54:00 +00:00