feat: document embeddings (#1368)

Closes https://github.com/Unstructured-IO/unstructured/issues/1319, closes https://github.com/Unstructured-IO/unstructured/issues/1372 This module: - implements EmbeddingEncoder classes which track embedding related data - implements embed_documents method which receives a list of Elements, obtains embeddings for the text within Elements, updates the Elements with an attribute named embeddings , and returns the updated Elements - the module uses langchain to obtain the embeddings ----- - The PR additionally fixes a JSON de-serialization issue on the metadata fields. To test the changes, run `examples/embed/example.py`
2026-01-08 05:10:11 +00:00 · 2023-09-20 22:55:30 +03:00 · 2023-09-20 22:55:30 +03:00 · 9e88929a8c
commit 9e88929a8c
parent 7a3828d292
9 changed files with 182 additions and 3 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -1,4 +1,4 @@
-## 0.10.17-dev1
+## 0.10.17-dev2

 ### Enhancements

@ -6,8 +6,12 @@

 ### Features

+* **Adds the embedding module to be able to embed Elements** Problem: Many NLP applications require the ability to represent parts of documents in a semantic way. Until now, Unstructured did not have text embedding ability within the core library. Feature: This embedding module is able to track embeddings related data with a class, embed a list of elements, and return an updated list of Elements with the *embeddings* property. The module is also able to embed query strings. Importance: Ability to embed documents or parts of documents will enable users to make use of these semantic representations in different NLP applications, such as search, retrieval, and retrieval augmented generation.
+
 ### Fixes

+* **Fixes a metadata source serialization bug** Problem: In unstructured elements, when loading an elements json file from the disk, the data_source attribute is assumed to be an instance of DataSourceMetadata and the code acts based on that. However the loader did not satisfy the assumption, and loaded it as a dict instead, causing an error. Fix: Added necessary code block to initialize a DataSourceMetadata object, also refactored DataSourceMetadata.from_dict() method to remove redundant code. Importance: Crucial to be able to load elements (which have data_source fields) from json files.
+
 ## 0.10.16

 ### Enhancements
@ -78,7 +82,7 @@
 * Update all connectors to use new downstream architecture
  * New click type added to parse comma-delimited string inputs
  * Some CLI options renamed
- 
+
 ### Features

 ### Fixes
--- a/docs/source/bricks/embedding.rst
+++ b/docs/source/bricks/embedding.rst
@ -0,0 +1,58 @@
+########
+Embedding
+########
+
+EmbeddingEncoder classes in ``unstructured`` use document elements detected
+with ``partition`` or document elements grouped with ``chunking`` to obtain
+embeddings for each element, for uses cases such as Retrieval Augmented Generation (RAG).
+
+
+``BaseEmbeddingEncoder``
+------------------
+
+The ``BaseEmbeddingEncoder`` is an abstract base class that defines the methods to be implemented
+for each ``EmbeddingEncoder`` subclass.
+
+
+``OpenAIEmbeddingEncoder``
+------------------
+
+The ``OpenAIEmbeddingEncoder`` class uses langchain OpenAI integration under the hood
+to connect to the OpenAI Text&Embedding API to obtain embeddings for pieces of text.
+
+``embed_documents`` will receive a list of Elements, and return an updated list which
+includes the ``embeddings`` attribute for each Element.
+
+``embed_query`` will receive a query as a string, and return a list of floats which is the
+embedding vector for the given query string.
+
+``num_of_dimensions`` is a metadata property that denotes the number of dimensions in any
+embedding vector obtained via this class.
+
+``is_unit_vector`` is a metadata property that denotes if embedding vectors obtained via
+this class are unit vectors.
+
+The following code block shows an example of how to use ``OpenAIEmbeddingEncoder``. You will
+see the updated elements list (with the ``embeddings`` attribute included for each element),
+the embedding vector for the query string, and some metadata properties about the embedding model.
+You will need to set an environment variable named ``OPENAI_API_KEY`` to be able to run this example.
+To obtain an api key, visit: https://platform.openai.com/account/api-keys
+
+.. code:: python
+
+    import os
+
+    from unstructured.documents.elements import Text
+    from unstructured.embed.openai import OpenAIEmbeddingEncoder
+
+    embedding_encoder = OpenAIEmbeddingEncoder(api_key=os.environ["OPENAI_API_KEY"])
+    elements = embedding_encoder.embed_documents(
+        elements=[Text("This is sentence 1"), Text("This is sentence 2")],
+    )
+
+    query = "This is the query"
+    query_embedding = embedding_encoder.embed_query(query=query)
+
+    [print(e.embeddings, e) for e in elements]
+    print(query_embedding, query)
+    print(embedding_encoder.is_unit_vector(), embedding_encoder.num_of_dimensions())
--- a/examples/embed/example.py
+++ b/examples/embed/example.py
@ -0,0 +1,16 @@
+import os
+
+from unstructured.documents.elements import Text
+from unstructured.embed.openai import OpenAIEmbeddingEncoder
+
+embedding_encoder = OpenAIEmbeddingEncoder(api_key=os.environ["OPENAI_API_KEY"])
+elements = embedding_encoder.embed_documents(
+    elements=[Text("This is sentence 1"), Text("This is sentence 2")],
+)
+
+query = "This is the query"
+query_embedding = embedding_encoder.embed_query(query=query)
+
+[print(e.embeddings, e) for e in elements]
+print(query_embedding, query)
+print(embedding_encoder.is_unit_vector(), embedding_encoder.num_of_dimensions())
--- a/unstructured/version.py
+++ b/unstructured/version.py
@ -1 +1 @@
-__version__ = "0.10.17-dev1"  # pragma: no cover
+__version__ = "0.10.17-dev2"  # pragma: no cover
--- a/unstructured/documents/elements.py
+++ b/unstructured/documents/elements.py
@ -48,6 +48,10 @@ class DataSourceMetadata:
    def to_dict(self):
        return {key: value for key, value in self.__dict__.items() if value is not None}

+    @classmethod
+    def from_dict(cls, input_dict):
+        return cls(**input_dict)
+

@dc.dataclass
 class CoordinatesMetadata:
@ -200,6 +204,10 @@ class ElementMetadata:
            constructor_args["coordinates"] = CoordinatesMetadata.from_dict(
                constructor_args["coordinates"],
            )
+        if constructor_args.get("data_source", None) is not None:
+            constructor_args["data_source"] = DataSourceMetadata.from_dict(
+                constructor_args["data_source"],
+            )
        return cls(**constructor_args)

    def merge(self, other: ElementMetadata):
--- a/unstructured/embed/init.py
+++ b/unstructured/embed/init.py
--- a/unstructured/embed/interfaces.py
+++ b/unstructured/embed/interfaces.py
@ -0,0 +1,32 @@
+from abc import ABC, abstractmethod
+from typing import List, Tuple
+
+from unstructured.documents.elements import Element
+
+
+class BaseEmbeddingEncoder(ABC):
+    @abstractmethod
+    def initialize(self):
+        """Initializes the embedding encoder class. Should also validate the instance
+        is properly configured: e.g., embed a single a element"""
+        pass
+
+    @property
+    @abstractmethod
+    def num_of_dimensions(self) -> Tuple[int]:
+        """Number of dimensions for the embedding vector."""
+        pass
+
+    @property
+    @abstractmethod
+    def is_unit_vector(self) -> bool:
+        """Denotes if the embedding vector is a unit vector."""
+        pass
+
+    @abstractmethod
+    def embed_documents(self, elements: List[Element]) -> List[Element]:
+        pass
+
+    @abstractmethod
+    def embed_query(self, query: str) -> List[float]:
+        pass
--- a/unstructured/embed/openai.py
+++ b/unstructured/embed/openai.py
@ -0,0 +1,57 @@
+from typing import List, Optional
+
+import numpy as np
+
+from unstructured.documents.elements import (
+    Element,
+)
+from unstructured.embed.interfaces import BaseEmbeddingEncoder
+from unstructured.ingest.error import EmbeddingEncoderConnectionError
+from unstructured.utils import requires_dependencies
+
+
+class OpenAIEmbeddingEncoder(BaseEmbeddingEncoder):
+    def __init__(self, api_key: str, model_name: Optional[str] = "text-embedding-ada-002"):
+        self.api_key = api_key
+        self.model_name = model_name
+        self.initialize()
+
+    def initialize(self):
+        self.openai_client = self.get_openai_client()
+
+    def num_of_dimensions(self):
+        return np.shape(self.examplary_embedding)
+
+    def is_unit_vector(self):
+        return np.isclose(np.linalg.norm(self.examplary_embedding), 1.0)
+
+    def embed_query(self, query):
+        return self.openai_client.embed_documents(str(query))
+
+    def embed_documents(self, elements: List[Element]) -> List[Element]:
+        embeddings = self.openai_client.embed_documents([str(e) for e in elements])
+        elements_with_embeddings = self._add_embeddings_to_elements(elements, embeddings)
+        return elements_with_embeddings
+
+    def _add_embeddings_to_elements(self, elements, embeddings) -> List[Element]:
+        assert len(elements) == len(embeddings)
+        for i in range(len(elements)):
+            elements[i].embeddings = embeddings[i]
+        return elements
+
+    @EmbeddingEncoderConnectionError.wrap
+    @requires_dependencies(
+        ["langchain", "openai"],
+    )  # add extras="langchain" when it's added to the makefile
+    def get_openai_client(self):
+        if not hasattr(self, "openai_client"):
+            """Creates a langchain OpenAI python client to embed elements."""
+            from langchain.embeddings.openai import OpenAIEmbeddings
+
+            openai_client = OpenAIEmbeddings(
+                openai_api_key=self.api_key,
+                model=self.model_name,
+            )
+
+            self.examplary_embedding = openai_client.embed_query("Q")
+            return openai_client
--- a/unstructured/ingest/error.py
+++ b/unstructured/ingest/error.py
@ -33,5 +33,9 @@ class DestinationConnectionError(CustomError):
    error_string = "Error in connecting to downstream data source: {}"


+class EmbeddingEncoderConnectionError(CustomError):
+    error_string = "Error in connecting to the embedding model provider: {}"
+
+
 class PartitionError(CustomError):
    error_string = "Error in partitioning content: {}"