mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-12-27 15:13:35 +00:00
feat: document embeddings (#1368)
Closes https://github.com/Unstructured-IO/unstructured/issues/1319, closes https://github.com/Unstructured-IO/unstructured/issues/1372 This module: - implements EmbeddingEncoder classes which track embedding related data - implements embed_documents method which receives a list of Elements, obtains embeddings for the text within Elements, updates the Elements with an attribute named embeddings , and returns the updated Elements - the module uses langchain to obtain the embeddings ----- - The PR additionally fixes a JSON de-serialization issue on the metadata fields. To test the changes, run `examples/embed/example.py`
This commit is contained in:
parent
7a3828d292
commit
9e88929a8c
@ -1,4 +1,4 @@
|
||||
## 0.10.17-dev1
|
||||
## 0.10.17-dev2
|
||||
|
||||
### Enhancements
|
||||
|
||||
@ -6,8 +6,12 @@
|
||||
|
||||
### Features
|
||||
|
||||
* **Adds the embedding module to be able to embed Elements** Problem: Many NLP applications require the ability to represent parts of documents in a semantic way. Until now, Unstructured did not have text embedding ability within the core library. Feature: This embedding module is able to track embeddings related data with a class, embed a list of elements, and return an updated list of Elements with the *embeddings* property. The module is also able to embed query strings. Importance: Ability to embed documents or parts of documents will enable users to make use of these semantic representations in different NLP applications, such as search, retrieval, and retrieval augmented generation.
|
||||
|
||||
### Fixes
|
||||
|
||||
* **Fixes a metadata source serialization bug** Problem: In unstructured elements, when loading an elements json file from the disk, the data_source attribute is assumed to be an instance of DataSourceMetadata and the code acts based on that. However the loader did not satisfy the assumption, and loaded it as a dict instead, causing an error. Fix: Added necessary code block to initialize a DataSourceMetadata object, also refactored DataSourceMetadata.from_dict() method to remove redundant code. Importance: Crucial to be able to load elements (which have data_source fields) from json files.
|
||||
|
||||
## 0.10.16
|
||||
|
||||
### Enhancements
|
||||
@ -78,7 +82,7 @@
|
||||
* Update all connectors to use new downstream architecture
|
||||
* New click type added to parse comma-delimited string inputs
|
||||
* Some CLI options renamed
|
||||
|
||||
|
||||
### Features
|
||||
|
||||
### Fixes
|
||||
|
||||
58
docs/source/bricks/embedding.rst
Normal file
58
docs/source/bricks/embedding.rst
Normal file
@ -0,0 +1,58 @@
|
||||
########
|
||||
Embedding
|
||||
########
|
||||
|
||||
EmbeddingEncoder classes in ``unstructured`` use document elements detected
|
||||
with ``partition`` or document elements grouped with ``chunking`` to obtain
|
||||
embeddings for each element, for uses cases such as Retrieval Augmented Generation (RAG).
|
||||
|
||||
|
||||
``BaseEmbeddingEncoder``
|
||||
------------------
|
||||
|
||||
The ``BaseEmbeddingEncoder`` is an abstract base class that defines the methods to be implemented
|
||||
for each ``EmbeddingEncoder`` subclass.
|
||||
|
||||
|
||||
``OpenAIEmbeddingEncoder``
|
||||
------------------
|
||||
|
||||
The ``OpenAIEmbeddingEncoder`` class uses langchain OpenAI integration under the hood
|
||||
to connect to the OpenAI Text&Embedding API to obtain embeddings for pieces of text.
|
||||
|
||||
``embed_documents`` will receive a list of Elements, and return an updated list which
|
||||
includes the ``embeddings`` attribute for each Element.
|
||||
|
||||
``embed_query`` will receive a query as a string, and return a list of floats which is the
|
||||
embedding vector for the given query string.
|
||||
|
||||
``num_of_dimensions`` is a metadata property that denotes the number of dimensions in any
|
||||
embedding vector obtained via this class.
|
||||
|
||||
``is_unit_vector`` is a metadata property that denotes if embedding vectors obtained via
|
||||
this class are unit vectors.
|
||||
|
||||
The following code block shows an example of how to use ``OpenAIEmbeddingEncoder``. You will
|
||||
see the updated elements list (with the ``embeddings`` attribute included for each element),
|
||||
the embedding vector for the query string, and some metadata properties about the embedding model.
|
||||
You will need to set an environment variable named ``OPENAI_API_KEY`` to be able to run this example.
|
||||
To obtain an api key, visit: https://platform.openai.com/account/api-keys
|
||||
|
||||
.. code:: python
|
||||
|
||||
import os
|
||||
|
||||
from unstructured.documents.elements import Text
|
||||
from unstructured.embed.openai import OpenAIEmbeddingEncoder
|
||||
|
||||
embedding_encoder = OpenAIEmbeddingEncoder(api_key=os.environ["OPENAI_API_KEY"])
|
||||
elements = embedding_encoder.embed_documents(
|
||||
elements=[Text("This is sentence 1"), Text("This is sentence 2")],
|
||||
)
|
||||
|
||||
query = "This is the query"
|
||||
query_embedding = embedding_encoder.embed_query(query=query)
|
||||
|
||||
[print(e.embeddings, e) for e in elements]
|
||||
print(query_embedding, query)
|
||||
print(embedding_encoder.is_unit_vector(), embedding_encoder.num_of_dimensions())
|
||||
16
examples/embed/example.py
Normal file
16
examples/embed/example.py
Normal file
@ -0,0 +1,16 @@
|
||||
import os
|
||||
|
||||
from unstructured.documents.elements import Text
|
||||
from unstructured.embed.openai import OpenAIEmbeddingEncoder
|
||||
|
||||
embedding_encoder = OpenAIEmbeddingEncoder(api_key=os.environ["OPENAI_API_KEY"])
|
||||
elements = embedding_encoder.embed_documents(
|
||||
elements=[Text("This is sentence 1"), Text("This is sentence 2")],
|
||||
)
|
||||
|
||||
query = "This is the query"
|
||||
query_embedding = embedding_encoder.embed_query(query=query)
|
||||
|
||||
[print(e.embeddings, e) for e in elements]
|
||||
print(query_embedding, query)
|
||||
print(embedding_encoder.is_unit_vector(), embedding_encoder.num_of_dimensions())
|
||||
@ -1 +1 @@
|
||||
__version__ = "0.10.17-dev1" # pragma: no cover
|
||||
__version__ = "0.10.17-dev2" # pragma: no cover
|
||||
|
||||
@ -48,6 +48,10 @@ class DataSourceMetadata:
|
||||
def to_dict(self):
|
||||
return {key: value for key, value in self.__dict__.items() if value is not None}
|
||||
|
||||
@classmethod
|
||||
def from_dict(cls, input_dict):
|
||||
return cls(**input_dict)
|
||||
|
||||
|
||||
@dc.dataclass
|
||||
class CoordinatesMetadata:
|
||||
@ -200,6 +204,10 @@ class ElementMetadata:
|
||||
constructor_args["coordinates"] = CoordinatesMetadata.from_dict(
|
||||
constructor_args["coordinates"],
|
||||
)
|
||||
if constructor_args.get("data_source", None) is not None:
|
||||
constructor_args["data_source"] = DataSourceMetadata.from_dict(
|
||||
constructor_args["data_source"],
|
||||
)
|
||||
return cls(**constructor_args)
|
||||
|
||||
def merge(self, other: ElementMetadata):
|
||||
|
||||
0
unstructured/embed/__init__.py
Normal file
0
unstructured/embed/__init__.py
Normal file
32
unstructured/embed/interfaces.py
Normal file
32
unstructured/embed/interfaces.py
Normal file
@ -0,0 +1,32 @@
|
||||
from abc import ABC, abstractmethod
|
||||
from typing import List, Tuple
|
||||
|
||||
from unstructured.documents.elements import Element
|
||||
|
||||
|
||||
class BaseEmbeddingEncoder(ABC):
|
||||
@abstractmethod
|
||||
def initialize(self):
|
||||
"""Initializes the embedding encoder class. Should also validate the instance
|
||||
is properly configured: e.g., embed a single a element"""
|
||||
pass
|
||||
|
||||
@property
|
||||
@abstractmethod
|
||||
def num_of_dimensions(self) -> Tuple[int]:
|
||||
"""Number of dimensions for the embedding vector."""
|
||||
pass
|
||||
|
||||
@property
|
||||
@abstractmethod
|
||||
def is_unit_vector(self) -> bool:
|
||||
"""Denotes if the embedding vector is a unit vector."""
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
def embed_documents(self, elements: List[Element]) -> List[Element]:
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
def embed_query(self, query: str) -> List[float]:
|
||||
pass
|
||||
57
unstructured/embed/openai.py
Normal file
57
unstructured/embed/openai.py
Normal file
@ -0,0 +1,57 @@
|
||||
from typing import List, Optional
|
||||
|
||||
import numpy as np
|
||||
|
||||
from unstructured.documents.elements import (
|
||||
Element,
|
||||
)
|
||||
from unstructured.embed.interfaces import BaseEmbeddingEncoder
|
||||
from unstructured.ingest.error import EmbeddingEncoderConnectionError
|
||||
from unstructured.utils import requires_dependencies
|
||||
|
||||
|
||||
class OpenAIEmbeddingEncoder(BaseEmbeddingEncoder):
|
||||
def __init__(self, api_key: str, model_name: Optional[str] = "text-embedding-ada-002"):
|
||||
self.api_key = api_key
|
||||
self.model_name = model_name
|
||||
self.initialize()
|
||||
|
||||
def initialize(self):
|
||||
self.openai_client = self.get_openai_client()
|
||||
|
||||
def num_of_dimensions(self):
|
||||
return np.shape(self.examplary_embedding)
|
||||
|
||||
def is_unit_vector(self):
|
||||
return np.isclose(np.linalg.norm(self.examplary_embedding), 1.0)
|
||||
|
||||
def embed_query(self, query):
|
||||
return self.openai_client.embed_documents(str(query))
|
||||
|
||||
def embed_documents(self, elements: List[Element]) -> List[Element]:
|
||||
embeddings = self.openai_client.embed_documents([str(e) for e in elements])
|
||||
elements_with_embeddings = self._add_embeddings_to_elements(elements, embeddings)
|
||||
return elements_with_embeddings
|
||||
|
||||
def _add_embeddings_to_elements(self, elements, embeddings) -> List[Element]:
|
||||
assert len(elements) == len(embeddings)
|
||||
for i in range(len(elements)):
|
||||
elements[i].embeddings = embeddings[i]
|
||||
return elements
|
||||
|
||||
@EmbeddingEncoderConnectionError.wrap
|
||||
@requires_dependencies(
|
||||
["langchain", "openai"],
|
||||
) # add extras="langchain" when it's added to the makefile
|
||||
def get_openai_client(self):
|
||||
if not hasattr(self, "openai_client"):
|
||||
"""Creates a langchain OpenAI python client to embed elements."""
|
||||
from langchain.embeddings.openai import OpenAIEmbeddings
|
||||
|
||||
openai_client = OpenAIEmbeddings(
|
||||
openai_api_key=self.api_key,
|
||||
model=self.model_name,
|
||||
)
|
||||
|
||||
self.examplary_embedding = openai_client.embed_query("Q")
|
||||
return openai_client
|
||||
@ -33,5 +33,9 @@ class DestinationConnectionError(CustomError):
|
||||
error_string = "Error in connecting to downstream data source: {}"
|
||||
|
||||
|
||||
class EmbeddingEncoderConnectionError(CustomError):
|
||||
error_string = "Error in connecting to the embedding model provider: {}"
|
||||
|
||||
|
||||
class PartitionError(CustomError):
|
||||
error_string = "Error in partitioning content: {}"
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user