mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-08-18 13:45:45 +00:00

Closes https://github.com/Unstructured-IO/unstructured/issues/1414
Closes #2039
This PR:
- Uses Pinecone python cli to implement a destination connector for
Pinecone and provides the ingest readme requirements
[(here)](https://github.com/Unstructured-IO/unstructured/tree/main/unstructured/ingest#the-checklist)
for the connector
- Updates documentation for the s3 destination connector
- Alphabetically sorts setup.py contents
- Updates logs for the chunking node in ingest pipeline
- Adds a baseline session handle implementation for destination
connectors, to be able to parallelize their operations
- For the
[bug](https://github.com/Unstructured-IO/unstructured/issues/1892)
related to persisting element data to ingest embedding nodes; this PR
tests the
[solution](https://github.com/Unstructured-IO/unstructured/pull/1893)
with its ingest test
- Solves a bug on ingest chunking params with [bugfix on chunking params
and implementing related
test](69e1949a6f
)
---------
Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com>
31 lines
1.3 KiB
Bash
Executable File
31 lines
1.3 KiB
Bash
Executable File
#!/usr/bin/env bash
|
|
|
|
# Processes all the files from s3://utic-dev-tech-fixtures/small-pdf-set/,
|
|
# embeds the processed documents, and writes to results to a Pinecone index.
|
|
|
|
# Structured outputs are stored in s3-small-batch-output-to-pinecone/
|
|
|
|
SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
|
|
cd "$SCRIPT_DIR"/../../.. || exit 1
|
|
|
|
|
|
# As an example we're using the s3 source connector,
|
|
# however ingesting from any supported source connector is possible.
|
|
# shellcheck disable=2094
|
|
PYTHONPATH=. ./unstructured/ingest/main.py \
|
|
local \
|
|
--input-path example-docs/book-war-and-peace-1225p.txt \
|
|
--output-dir local-to-pinecone \
|
|
--strategy fast \
|
|
--chunk-elements \
|
|
--embedding-provider <an unstructured embedding provider, ie. langchain-huggingface> \
|
|
--num-processes 2 \
|
|
--verbose \
|
|
--work-dir "<directory for intermediate outputs to be saved>" \
|
|
pinecone \
|
|
--api-key "<Pinecone API Key to write into a Pinecone index>" \
|
|
--index-name "<Pinecone index name, ie: ingest-test>" \
|
|
--environment "<Pinecone index name, ie: ingest-test>" \
|
|
--batch-size "<Number of elements to be uploaded per batch, ie. 80>" \
|
|
--num-processes "<Number of processes to be used to upload, ie. 2>"
|