Ahmet Melek ed08773de7
feat: add pinecone destination connector (#1774)
Closes https://github.com/Unstructured-IO/unstructured/issues/1414
Closes #2039 

This PR:
- Uses Pinecone python cli to implement a destination connector for
Pinecone and provides the ingest readme requirements
[(here)](https://github.com/Unstructured-IO/unstructured/tree/main/unstructured/ingest#the-checklist)
for the connector
- Updates documentation for the s3 destination connector
- Alphabetically sorts setup.py contents
- Updates logs for the chunking node  in ingest pipeline
- Adds a baseline session handle implementation for destination
connectors, to be able to parallelize their operations
- For the
[bug](https://github.com/Unstructured-IO/unstructured/issues/1892)
related to persisting element data to ingest embedding nodes; this PR
tests the
[solution](https://github.com/Unstructured-IO/unstructured/pull/1893)
with its ingest test
- Solves a bug on ingest chunking params with [bugfix on chunking params
and implementing related
test](69e1949a6f)

---------

Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com>
2023-11-29 22:37:32 +00:00

31 lines
1.3 KiB
Bash
Executable File

#!/usr/bin/env bash
# Processes all the files from s3://utic-dev-tech-fixtures/small-pdf-set/,
# embeds the processed documents, and writes to results to a Pinecone index.
# Structured outputs are stored in s3-small-batch-output-to-pinecone/
SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
cd "$SCRIPT_DIR"/../../.. || exit 1
# As an example we're using the s3 source connector,
# however ingesting from any supported source connector is possible.
# shellcheck disable=2094
PYTHONPATH=. ./unstructured/ingest/main.py \
local \
--input-path example-docs/book-war-and-peace-1225p.txt \
--output-dir local-to-pinecone \
--strategy fast \
--chunk-elements \
--embedding-provider <an unstructured embedding provider, ie. langchain-huggingface> \
--num-processes 2 \
--verbose \
--work-dir "<directory for intermediate outputs to be saved>" \
pinecone \
--api-key "<Pinecone API Key to write into a Pinecone index>" \
--index-name "<Pinecone index name, ie: ingest-test>" \
--environment "<Pinecone index name, ie: ingest-test>" \
--batch-size "<Number of elements to be uploaded per batch, ie. 80>" \
--num-processes "<Number of processes to be used to upload, ie. 2>"