unstructured/elasticsearch.in at e65a44eabbcb1597e7a4b46e4d61f083503632f8 - unstructured - Gitea: Git with a cup of tea

yujunjun/unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-07-03 07:05:20 +00:00

Roman Isecke 2bb463d006

feat: support both single and batch ingest docs (#2105 )

### Description
There are some source ingest connectors that would be more efficient to
read the content in batches rather than use an entire process per
document. For example, reading from ElasticSearch. Given an index with
possible hundreds of documents, reading each one individually is not as
optimal as reading in batches. To try and maintain as much of the ingest
doc paradigm already being supported, a new class `BaseIngestDocBatch`
was added to handle reading in batches. It produces a list of
`BaseSingleIngestDoc` which is what all current implementations were
renamed to. This list is generated after it runs its `get_files` method.
Past the source node, all other steps in the pipeline should not be
affected, this is just an optimization for the read step.

**Additional Changes:**
* Removed use of jq and instead converted this into a fields filter on
the content to let the database handle the filtering and limit the
amount of data being pulled in.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>

2023-11-27 19:25:30 +00:00

4 lines

50 B

Plaintext

Raw Blame History

	`-c ../constraints.in`
	`-c ../base.txt`
	`elasticsearch`