mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-07-04 15:42:16 +00:00

This PR: - Adds VertexAI embeddings as an embedding provider Testing - Tested with pinecone destination connector on [this](https://github.com/Unstructured-IO/unstructured/actions/runs/8429035114/job/23082700074?pr=2693) job run. --------- Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io> Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
229 lines
9.7 KiB
ReStructuredText
229 lines
9.7 KiB
ReStructuredText
#########
|
|
Embedding
|
|
#########
|
|
|
|
Embedding encoder classes in ``unstructured`` use document elements detected
|
|
with ``partition`` or document elements grouped with ``chunking`` to obtain
|
|
embeddings for each element, for uses cases such as Retrieval Augmented Generation (RAG).
|
|
|
|
|
|
``BaseEmbeddingEncoder``
|
|
------------------------
|
|
|
|
The ``BaseEmbeddingEncoder`` is an abstract base class that defines the methods to be implemented
|
|
for each ``EmbeddingEncoder`` subclass.
|
|
|
|
|
|
``OpenAIEmbeddingEncoder``
|
|
--------------------------
|
|
|
|
The ``OpenAIEmbeddingEncoder`` class uses langchain OpenAI integration under the hood
|
|
to connect to the OpenAI Text&Embedding API to obtain embeddings for pieces of text.
|
|
|
|
``embed_documents`` will receive a list of Elements, and return an updated list which
|
|
includes the ``embeddings`` attribute for each Element.
|
|
|
|
``embed_query`` will receive a query as a string, and return a list of floats which is the
|
|
embedding vector for the given query string.
|
|
|
|
``num_of_dimensions`` is a metadata property that denotes the number of dimensions in any
|
|
embedding vector obtained via this class.
|
|
|
|
``is_unit_vector`` is a metadata property that denotes if embedding vectors obtained via
|
|
this class are unit vectors.
|
|
|
|
The following code block shows an example of how to use ``OpenAIEmbeddingEncoder``. You will
|
|
see the updated elements list (with the ``embeddings`` attribute included for each element),
|
|
the embedding vector for the query string, and some metadata properties about the embedding model.
|
|
You will need to set an environment variable named ``OPENAI_API_KEY`` to be able to run this example.
|
|
To obtain an api key, visit: https://platform.openai.com/account/api-keys
|
|
|
|
.. code:: python
|
|
|
|
import os
|
|
|
|
from unstructured.documents.elements import Text
|
|
from unstructured.embed.openai import OpenAIEmbeddingConfig, OpenAIEmbeddingEncoder
|
|
|
|
# Initialize the encoder with OpenAI credentials
|
|
embedding_encoder = OpenAIEmbeddingEncoder(config=OpenAIEmbeddingConfig(api_key=os.environ["OPENAI_API_KEY"]))
|
|
|
|
# Embed a list of Elements
|
|
elements = embedding_encoder.embed_documents(
|
|
elements=[Text("This is sentence 1"), Text("This is sentence 2")],
|
|
)
|
|
|
|
# Embed a single query string
|
|
query = "This is the query"
|
|
query_embedding = embedding_encoder.embed_query(query=query)
|
|
|
|
# Print embeddings
|
|
[print(e.embeddings, e) for e in elements]
|
|
print(query_embedding, query)
|
|
print(embedding_encoder.is_unit_vector(), embedding_encoder.num_of_dimensions())
|
|
|
|
``HuggingFaceEmbeddingEncoder``
|
|
---------------------------------
|
|
|
|
The ``HuggingFaceEmbeddingEncoder`` class uses langchain HuggingFace integration under the hood
|
|
to obtain embeddings for pieces of text using a local model.
|
|
|
|
``embed_documents`` will receive a list of Elements, and return an updated list which
|
|
includes the ``embeddings`` attribute for each Element.
|
|
|
|
``embed_query`` will receive a query as a string, and return a list of floats which is the
|
|
embedding vector for the given query string.
|
|
|
|
``num_of_dimensions`` is a metadata property that denotes the number of dimensions in any
|
|
embedding vector obtained via this class.
|
|
|
|
``is_unit_vector`` is a metadata property that denotes if embedding vectors obtained via
|
|
this class are unit vectors.
|
|
|
|
The following code block shows an example of how to use ``HuggingFaceEmbeddingEncoder``. You will
|
|
see the updated elements list (with the ``embeddings`` attribute included for each element),
|
|
the embedding vector for the query string, and some metadata properties about the embedding model.
|
|
|
|
|
|
``BedrockEmbeddingEncoder``
|
|
-----------------------------
|
|
|
|
The ``BedrockEmbeddingEncoder`` class provides an interface to obtain embeddings for text using the Bedrock embeddings via the langchain integration. It connects to the Bedrock Runtime using AWS's boto3 package.
|
|
|
|
Key methods and attributes include:
|
|
|
|
``embed_documents``: This function takes a list of Elements as its input and returns the same list with an updated embeddings attribute for each Element.
|
|
|
|
``embed_query``: This method takes a query as a string and returns the embedding vector for the given query string.
|
|
|
|
``num_of_dimensions``: A metadata property that signifies the number of dimensions in any embedding vector obtained via this class.
|
|
|
|
``is_unit_vector``: A metadata property that checks if embedding vectors obtained via this class are unit vectors.
|
|
|
|
Initialization:
|
|
To create an instance of the `BedrockEmbeddingEncoder`, AWS credentials and the region name are required.
|
|
|
|
.. code:: python
|
|
|
|
from unstructured.documents.elements import Text
|
|
from unstructured.embed.bedrock import BedrockEmbeddingEncoder
|
|
|
|
# Initialize the encoder with AWS credentials
|
|
embedding_encoder = BedrockEmbeddingEncoder(
|
|
aws_access_key_id="YOUR_AWS_ACCESS_KEY_ID",
|
|
aws_secret_access_key="YOUR_AWS_SECRET_ACCESS_KEY",
|
|
region_name="us-west-2",
|
|
)
|
|
|
|
# Embed a list of Elements
|
|
elements = embedding_encoder.embed_documents(elements=[Text("Sentence A"), Text("Sentence B")])
|
|
|
|
# Embed a single query string
|
|
query = "Example query"
|
|
query_embedding = embedding_encoder.embed_query(query=query)
|
|
|
|
# Print embeddings
|
|
[print(e.embeddings, e) for e in elements]
|
|
print(query_embedding, query)
|
|
print(embedding_encoder.is_unit_vector(), embedding_encoder.num_of_dimensions())
|
|
|
|
|
|
Dependencies:
|
|
This class relies on several dependencies which include boto3, numpy, and langchain. Ensure these are installed and available in the environment where this class is utilized.
|
|
|
|
``OctoAIEmbeddingEncoder``
|
|
--------------------------
|
|
|
|
The ``OctoAIEmbeddingEncoder`` class connects to the OctoAI Text&Embedding API to obtain embeddings for pieces of text.
|
|
|
|
``embed_documents`` will receive a list of Elements, and return an updated list which
|
|
includes the ``embeddings`` attribute for each Element.
|
|
|
|
``embed_query`` will receive a query as a string, and return a list of floats which is the
|
|
embedding vector for the given query string.
|
|
|
|
``num_of_dimensions`` is a metadata property that denotes the number of dimensions in any
|
|
embedding vector obtained via this class.
|
|
|
|
``is_unit_vector`` is a metadata property that denotes if embedding vectors obtained via
|
|
this class are unit vectors.
|
|
|
|
The following code block shows an example of how to use ``OctoAIEmbeddingEncoder``. You will
|
|
see the updated elements list (with the ``embeddings`` attribute included for each element),
|
|
the embedding vector for the query string, and some metadata properties about the embedding model.
|
|
You will need to set an environment variable named ``OCTOAI_API_KEY`` to be able to run this example.
|
|
To obtain an api key, visit: https://octo.ai/docs/getting-started/how-to-create-an-octoai-access-token
|
|
|
|
.. code:: python
|
|
|
|
import os
|
|
|
|
from unstructured.documents.elements import Text
|
|
from unstructured.embed.octoai import OctoAiEmbeddingConfig, OctoAIEmbeddingEncoder
|
|
|
|
embedding_encoder = OctoAIEmbeddingEncoder(
|
|
config=OctoAiEmbeddingConfig(api_key=os.environ["OCTOAI_API_KEY"])
|
|
)
|
|
elements = embedding_encoder.embed_documents(
|
|
elements=[Text("This is sentence 1"), Text("This is sentence 2")],
|
|
)
|
|
|
|
query = "This is the query"
|
|
query_embedding = embedding_encoder.embed_query(query=query)
|
|
|
|
[print(e.embeddings, e) for e in elements]
|
|
print(query_embedding, query)
|
|
print(embedding_encoder.is_unit_vector(), embedding_encoder.num_of_dimensions())
|
|
|
|
``VertexAIEmbeddingEncoder``
|
|
--------------------------
|
|
|
|
The ``VertexAIEmbeddingEncoder`` class connects to the GCP VertexAI to obtain embeddings for pieces of text.
|
|
|
|
``embed_documents`` will receive a list of Elements, and return an updated list which
|
|
includes the ``embeddings`` attribute for each Element.
|
|
|
|
``embed_query`` will receive a query as a string, and return a list of floats which is the
|
|
embedding vector for the given query string.
|
|
|
|
``num_of_dimensions`` is a metadata property that denotes the number of dimensions in any
|
|
embedding vector obtained via this class.
|
|
|
|
``is_unit_vector`` is a metadata property that denotes if embedding vectors obtained via
|
|
this class are unit vectors.
|
|
|
|
The following code block shows an example of how to use ``VertexAIEmbeddingEncoder``. You will
|
|
see the updated elements list (with the ``embeddings`` attribute included for each element),
|
|
the embedding vector for the query string, and some metadata properties about the embedding model.
|
|
|
|
To use Vertex AI PaLM tou will need to:
|
|
- either, pass the full json content of your GCP VertexAI application credentials to the
|
|
VertexAIEmbeddingConfig as the api_key parameter. (This will create a file in the ``/tmp``
|
|
directory with the content of the json, and set the GOOGLE_APPLICATION_CREDENTIALS environment
|
|
variable to the **path** of the created file.)
|
|
- or, you'll need to store the path to a manually created service account JSON file as the
|
|
GOOGLE_APPLICATION_CREDENTIALS environment variable. (For more information:
|
|
https://python.langchain.com/docs/integrations/text_embedding/google_vertex_ai_palm)
|
|
- or, you'll need to have the credentials configured for your environment (gcloud,
|
|
workload identity, etc…)
|
|
|
|
.. code:: python
|
|
|
|
import os
|
|
|
|
from unstructured.documents.elements import Text
|
|
from unstructured.embed.vertexai import VertexAIEmbeddingConfig, VertexAIEmbeddingEncoder
|
|
|
|
embedding_encoder = VertexAIEmbeddingEncoder(
|
|
config=VertexAIEmbeddingConfig(api_key=os.environ["VERTEXAI_GCP_APP_CREDS_JSON_CONTENT"])
|
|
)
|
|
elements = embedding_encoder.embed_documents(
|
|
elements=[Text("This is sentence 1"), Text("This is sentence 2")],
|
|
)
|
|
|
|
query = "This is the query"
|
|
query_embedding = embedding_encoder.embed_query(query=query)
|
|
|
|
[print(e.embeddings, e) for e in elements]
|
|
print(query_embedding, query)
|
|
print(embedding_encoder.is_unit_vector(), embedding_encoder.num_of_dimensions()) |