### Description
* The `consistent-deps.sh` was fixed to take into account the ingest
dependencies, causing some errors to show up. New constriants were added
to make that script pass.
* Update all requirements without constraint on pydantic, allowing the
latest version to be pulled in.
* `pikepdf` is causing a conflict but there's a fix on their `main`
branch, just need for the next release to be published. Opened up a
question here to see if we can get that out any sooner: [Do releases
happen on a
schedule?](https://github.com/pikepdf/pikepdf/discussions/574). For now
added `lxml<5` to the constraints.
A couple optimizations:
* `constraints.in` renamed to `constraints.txt` since the whole point is
all dependencies are already pinned and the file never gets compiled
* `constraints.txt` moved to a `requirements/deps` directory as this
never gets compiled by `pip-compile`
* Other dependency files updated to reference the new location of
`base.in` and `constraints.txt`
* make file updated since it was originally written to avoid the
`base.in` and `constraints.in` file
### Description
There are some source ingest connectors that would be more efficient to
read the content in batches rather than use an entire process per
document. For example, reading from ElasticSearch. Given an index with
possible hundreds of documents, reading each one individually is not as
optimal as reading in batches. To try and maintain as much of the ingest
doc paradigm already being supported, a new class `BaseIngestDocBatch`
was added to handle reading in batches. It produces a list of
`BaseSingleIngestDoc` which is what all current implementations were
renamed to. This list is generated after it runs its `get_files` method.
Past the source node, all other steps in the pipeline should not be
affected, this is just an optimization for the read step.
**Additional Changes:**
* Removed use of jq and instead converted this into a fields filter on
the content to let the database handle the filtering and limit the
amount of data being pulled in.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
### Description
Currently the CI caches the CI dependencies but uses the hash of all
files in `requirements/`. This isn't completely accurate since the
ingest dependencies are installed in a later step and don't affect the
cached environment. As part of this PR:
* ingest dependencies were isolated into their own folder in
`requirements/ingest/`
* A new cache setup was introduced in the CI to restore the base cache
-> install ingest dependencies -> cache it with a new id
* new make target created to install all ingest dependencies via `pip
install -r ...`
* updates to Dockerfile to use `find ...` to install all dependencies,
avoiding the need to update this when new deps are added.
* update to pip-compile script to run over all `*.in` files in
`requirements/`