mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-07-10 18:45:54 +00:00

Thanks to Eric Hare @erichare at DataStax we have a new destination connector. This Pull Request implements an integration with [Astra DB](https://datastax.com) which allows for the Astra DB Vector Database to be compatible with Unstructured's set of integrations. To create your Astra account and authenticate with your `ASTRA_DB_APPLICATION_TOKEN`, and `ASTRA_DB_API_ENDPOINT`, follow these steps: 1. Create an account at https://astra.datastax.com 2. Login and create a new database 3. From the database page, in the right hand panel, you will find your API Endpoint 4. Beneath that, you can create a Token to be used Some notes about Astra DB: - Astra DB is a Vector Database which allows for high-performance database transactions, and enables modern GenAI apps [See here](https://docs.datastax.com/en/astra/astra-db-vector/get-started/concepts.html) - It supports similarity search via a number of methods [See here](https://docs.datastax.com/en/astra/astra-db-vector/get-started/concepts.html#metrics) - It also supports non-vector tables / collections
19 lines
507 B
Bash
19 lines
507 B
Bash
#!/usr/bin/env bash
|
|
|
|
EMBEDDING_PROVIDER=${EMBEDDING_PROVIDER:-"langchain-huggingface"}
|
|
|
|
unstructured-ingest \
|
|
local \
|
|
--input-path example-docs/book-war-and-peace-1p.txt \
|
|
--output-dir local-output-to-astra \
|
|
--strategy fast \
|
|
--chunk-elements \
|
|
--embedding-provider "$EMBEDDING_PROVIDER" \
|
|
--num-processes 2 \
|
|
--verbose \
|
|
astra \
|
|
--token "$ASTRA_DB_TOKEN" \
|
|
--api-endpoint "$ASTRA_DB_ENDPOINT" \
|
|
--collection-name "$COLLECTION_NAME" \
|
|
--embedding-dimension "$EMBEDDING_DIMENSION"
|