feat: document embeddings (#1368)

Closes https://github.com/Unstructured-IO/unstructured/issues/1319,
closes https://github.com/Unstructured-IO/unstructured/issues/1372

This module:

- implements EmbeddingEncoder classes which track embedding related data
- implements embed_documents method which receives a list of Elements,
obtains embeddings for the text within Elements, updates the Elements
with an attribute named embeddings , and returns the updated Elements
- the module uses langchain to obtain the embeddings
-----
- The PR additionally fixes a JSON de-serialization issue on the
metadata fields.

To test the changes, run `examples/embed/example.py`
This commit is contained in:
Ahmet Melek 2023-09-20 22:55:30 +03:00 committed by GitHub
parent 7a3828d292
commit 9e88929a8c
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
9 changed files with 182 additions and 3 deletions

View File

@ -1,4 +1,4 @@
## 0.10.17-dev1
## 0.10.17-dev2
### Enhancements
@ -6,8 +6,12 @@
### Features
* **Adds the embedding module to be able to embed Elements** Problem: Many NLP applications require the ability to represent parts of documents in a semantic way. Until now, Unstructured did not have text embedding ability within the core library. Feature: This embedding module is able to track embeddings related data with a class, embed a list of elements, and return an updated list of Elements with the *embeddings* property. The module is also able to embed query strings. Importance: Ability to embed documents or parts of documents will enable users to make use of these semantic representations in different NLP applications, such as search, retrieval, and retrieval augmented generation.
### Fixes
* **Fixes a metadata source serialization bug** Problem: In unstructured elements, when loading an elements json file from the disk, the data_source attribute is assumed to be an instance of DataSourceMetadata and the code acts based on that. However the loader did not satisfy the assumption, and loaded it as a dict instead, causing an error. Fix: Added necessary code block to initialize a DataSourceMetadata object, also refactored DataSourceMetadata.from_dict() method to remove redundant code. Importance: Crucial to be able to load elements (which have data_source fields) from json files.
## 0.10.16
### Enhancements
@ -78,7 +82,7 @@
* Update all connectors to use new downstream architecture
* New click type added to parse comma-delimited string inputs
* Some CLI options renamed
### Features
### Fixes

View File

@ -0,0 +1,58 @@
########
Embedding
########
EmbeddingEncoder classes in ``unstructured`` use document elements detected
with ``partition`` or document elements grouped with ``chunking`` to obtain
embeddings for each element, for uses cases such as Retrieval Augmented Generation (RAG).
``BaseEmbeddingEncoder``
------------------
The ``BaseEmbeddingEncoder`` is an abstract base class that defines the methods to be implemented
for each ``EmbeddingEncoder`` subclass.
``OpenAIEmbeddingEncoder``
------------------
The ``OpenAIEmbeddingEncoder`` class uses langchain OpenAI integration under the hood
to connect to the OpenAI Text&Embedding API to obtain embeddings for pieces of text.
``embed_documents`` will receive a list of Elements, and return an updated list which
includes the ``embeddings`` attribute for each Element.
``embed_query`` will receive a query as a string, and return a list of floats which is the
embedding vector for the given query string.
``num_of_dimensions`` is a metadata property that denotes the number of dimensions in any
embedding vector obtained via this class.
``is_unit_vector`` is a metadata property that denotes if embedding vectors obtained via
this class are unit vectors.
The following code block shows an example of how to use ``OpenAIEmbeddingEncoder``. You will
see the updated elements list (with the ``embeddings`` attribute included for each element),
the embedding vector for the query string, and some metadata properties about the embedding model.
You will need to set an environment variable named ``OPENAI_API_KEY`` to be able to run this example.
To obtain an api key, visit: https://platform.openai.com/account/api-keys
.. code:: python
import os
from unstructured.documents.elements import Text
from unstructured.embed.openai import OpenAIEmbeddingEncoder
embedding_encoder = OpenAIEmbeddingEncoder(api_key=os.environ["OPENAI_API_KEY"])
elements = embedding_encoder.embed_documents(
elements=[Text("This is sentence 1"), Text("This is sentence 2")],
)
query = "This is the query"
query_embedding = embedding_encoder.embed_query(query=query)
[print(e.embeddings, e) for e in elements]
print(query_embedding, query)
print(embedding_encoder.is_unit_vector(), embedding_encoder.num_of_dimensions())

16
examples/embed/example.py Normal file
View File

@ -0,0 +1,16 @@
import os
from unstructured.documents.elements import Text
from unstructured.embed.openai import OpenAIEmbeddingEncoder
embedding_encoder = OpenAIEmbeddingEncoder(api_key=os.environ["OPENAI_API_KEY"])
elements = embedding_encoder.embed_documents(
elements=[Text("This is sentence 1"), Text("This is sentence 2")],
)
query = "This is the query"
query_embedding = embedding_encoder.embed_query(query=query)
[print(e.embeddings, e) for e in elements]
print(query_embedding, query)
print(embedding_encoder.is_unit_vector(), embedding_encoder.num_of_dimensions())

View File

@ -1 +1 @@
__version__ = "0.10.17-dev1" # pragma: no cover
__version__ = "0.10.17-dev2" # pragma: no cover

View File

@ -48,6 +48,10 @@ class DataSourceMetadata:
def to_dict(self):
return {key: value for key, value in self.__dict__.items() if value is not None}
@classmethod
def from_dict(cls, input_dict):
return cls(**input_dict)
@dc.dataclass
class CoordinatesMetadata:
@ -200,6 +204,10 @@ class ElementMetadata:
constructor_args["coordinates"] = CoordinatesMetadata.from_dict(
constructor_args["coordinates"],
)
if constructor_args.get("data_source", None) is not None:
constructor_args["data_source"] = DataSourceMetadata.from_dict(
constructor_args["data_source"],
)
return cls(**constructor_args)
def merge(self, other: ElementMetadata):

View File

View File

@ -0,0 +1,32 @@
from abc import ABC, abstractmethod
from typing import List, Tuple
from unstructured.documents.elements import Element
class BaseEmbeddingEncoder(ABC):
@abstractmethod
def initialize(self):
"""Initializes the embedding encoder class. Should also validate the instance
is properly configured: e.g., embed a single a element"""
pass
@property
@abstractmethod
def num_of_dimensions(self) -> Tuple[int]:
"""Number of dimensions for the embedding vector."""
pass
@property
@abstractmethod
def is_unit_vector(self) -> bool:
"""Denotes if the embedding vector is a unit vector."""
pass
@abstractmethod
def embed_documents(self, elements: List[Element]) -> List[Element]:
pass
@abstractmethod
def embed_query(self, query: str) -> List[float]:
pass

View File

@ -0,0 +1,57 @@
from typing import List, Optional
import numpy as np
from unstructured.documents.elements import (
Element,
)
from unstructured.embed.interfaces import BaseEmbeddingEncoder
from unstructured.ingest.error import EmbeddingEncoderConnectionError
from unstructured.utils import requires_dependencies
class OpenAIEmbeddingEncoder(BaseEmbeddingEncoder):
def __init__(self, api_key: str, model_name: Optional[str] = "text-embedding-ada-002"):
self.api_key = api_key
self.model_name = model_name
self.initialize()
def initialize(self):
self.openai_client = self.get_openai_client()
def num_of_dimensions(self):
return np.shape(self.examplary_embedding)
def is_unit_vector(self):
return np.isclose(np.linalg.norm(self.examplary_embedding), 1.0)
def embed_query(self, query):
return self.openai_client.embed_documents(str(query))
def embed_documents(self, elements: List[Element]) -> List[Element]:
embeddings = self.openai_client.embed_documents([str(e) for e in elements])
elements_with_embeddings = self._add_embeddings_to_elements(elements, embeddings)
return elements_with_embeddings
def _add_embeddings_to_elements(self, elements, embeddings) -> List[Element]:
assert len(elements) == len(embeddings)
for i in range(len(elements)):
elements[i].embeddings = embeddings[i]
return elements
@EmbeddingEncoderConnectionError.wrap
@requires_dependencies(
["langchain", "openai"],
) # add extras="langchain" when it's added to the makefile
def get_openai_client(self):
if not hasattr(self, "openai_client"):
"""Creates a langchain OpenAI python client to embed elements."""
from langchain.embeddings.openai import OpenAIEmbeddings
openai_client = OpenAIEmbeddings(
openai_api_key=self.api_key,
model=self.model_name,
)
self.examplary_embedding = openai_client.embed_query("Q")
return openai_client

View File

@ -33,5 +33,9 @@ class DestinationConnectionError(CustomError):
error_string = "Error in connecting to downstream data source: {}"
class EmbeddingEncoderConnectionError(CustomError):
error_string = "Error in connecting to the embedding model provider: {}"
class PartitionError(CustomError):
error_string = "Error in partitioning content: {}"