7 Commits

Author SHA1 Message Date
Matt Robinson
2d3a7f1c48
fix: fix table index error by bumping unstructured-inference (#2430)
### Summary

Closes #2417. Bumps `unstructured-inference` to pull in the fix
implemented in
https://github.com/Unstructured-IO/unstructured-inference/pull/317
2024-01-19 22:42:32 +00:00
Roman Isecke
b37b4689bc
drop python3.8 (#2372)
### Description
Remove all uses of python3.8

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2024-01-09 23:37:30 +00:00
Roman Isecke
2bb463d006
feat: support both single and batch ingest docs (#2105)
### Description
There are some source ingest connectors that would be more efficient to
read the content in batches rather than use an entire process per
document. For example, reading from ElasticSearch. Given an index with
possible hundreds of documents, reading each one individually is not as
optimal as reading in batches. To try and maintain as much of the ingest
doc paradigm already being supported, a new class `BaseIngestDocBatch`
was added to handle reading in batches. It produces a list of
`BaseSingleIngestDoc` which is what all current implementations were
renamed to. This list is generated after it runs its `get_files` method.
Past the source node, all other steps in the pipeline should not be
affected, this is just an optimization for the read step.

**Additional Changes:**
* Removed use of jq and instead converted this into a fields filter on
the content to let the database handle the filtering and limit the
amount of data being pulled in.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2023-11-27 19:25:30 +00:00
Yuming Long
ccda93b0d1
chore: bump inference to 0.7.15 release unst 0.11.0 (#2110)
^^
2023-11-20 18:20:03 +00:00
Roman Isecke
b8af2f18bb
add mongo db destination connector (#2068)
### Description
This adds the basic implementation of pushing the generated json output
of partition to mongodb. None of this code provisions the mondo db
instance so things like adding a search index around the embedding
content must be done by the user. Any sort of schema validation would
also have to take place via user-specific configuration on the database.
This update makes no assumptions about the configuration of the database
itself.
2023-11-16 22:40:22 +00:00
Yuming Long
ad14321016
Chore: don't pass empty language code to tesseract CLI (#1996)
Summary:

Close: https://github.com/Unstructured-IO/unstructured/issues/1920

* stop passing in empty string from `languages` to tesseract, which will
result in passing empty string to language config `-l` for the tesseract
CLI
* also stop passing in duplicate language code from `languages` to
tesseract OCR
* if we failed to convert any iso languages from the `languages`
parameter, proceed OCR with `eng` as default
  
### Test
* First confirm the tesseract error `Estimating resolution as X` before
this:
* on the `unstructured-api` repo with main branch, run `make
run-web-app`
* curl to test error from empty string, or just any wrong input like `-F
'languages="eng,de"'`:
 ```
curl -X 'POST'  'http://0.0.0.0:8000/general/v0/general' \
  -H 'accept: application/json'   \
-H 'Content-Type: multipart/form-data' \
 -F 'files=@sample-docs/layout-parser-paper-with-table.jpg' \
-F 'languages=""'  \
-F 'strategy=hi_res'  \
-F 'pdf_infer_table_structure=True' \
 | jq -C . | less -R
``` 

* after this change:
   * in your unstructured API env, cd to unstructured repo and install it locally with `pip install -e .`
   * check out to this branch
   * run `make run-web-app` again in api repo
   * the curl command return output and see warning in log

---------

Co-authored-by: qued <64741807+qued@users.noreply.github.com>
2023-11-06 19:30:12 -06:00
Roman Isecke
4802332de0
Roman/optimize ingest ci (#1799)
### Description
Currently the CI caches the CI dependencies but uses the hash of all
files in `requirements/`. This isn't completely accurate since the
ingest dependencies are installed in a later step and don't affect the
cached environment. As part of this PR:
* ingest dependencies were isolated into their own folder in
`requirements/ingest/`
* A new cache setup was introduced in the CI to restore the base cache
-> install ingest dependencies -> cache it with a new id
* new make target created to install all ingest dependencies via `pip
install -r ...`
* updates to Dockerfile to use `find ...` to install all dependencies,
avoiding the need to update this when new deps are added.
* update to pip-compile script to run over all `*.in` files in
`requirements/`
2023-10-24 14:54:00 +00:00