Original PR was #3069. Merged in to a feature branch to fix dependency
and linting issues. Application code changes from the original PR were
already reviewed and approved.
------------
Original PR description:
Adding VoyageAI embeddings
Voyage AI’s embedding models and rerankers are state-of-the-art in
retrieval accuracy.
---------
Co-authored-by: fzowl <160063452+fzowl@users.noreply.github.com>
Co-authored-by: Liuhong99 <39693953+Liuhong99@users.noreply.github.com>
**Summary**
The serialization and deserialization (serde) of
`metadata.orig_elements` will be located in `unstructured.staging.base`
alongside `elements_to_json()` and other existing serde functions.
Improve the typing, readability, and structure of that module before
adding the new serde functions for `metadata.orig_elements`.
**Reviewers:** The commits are well-groomed and are probably quicker to
review commit-by-commit than as all files-changed at once.
Thanks to Pedro at OctoAI we have a new embedding option.
The following PR adds support for the use of OctoAI embeddings.
Forked from the original OpenAI embeddings class. We removed the use of
the LangChain adaptor, and use OpenAI's SDK directly instead.
Also updated out-of-date example script.
Including new test file for OctoAI.
# Testing
Get a token from our platform at: https://www.octoai.cloud/
For testing one can do the following:
```
export OCTOAI_TOKEN=<your octo token>
python3 examples/embed/example_octoai.py
```
## Testing done
Validated running the above script from within a locally built container
via `make docker-start-dev`
---------
Co-authored-by: potter-potter <david.potter@gmail.com>
### Description
Leverage a similar pattern to what is used for connectors, where there
is a nested config dataclass as a field, along with cached content for
things like the client and sample embedding for each. This required an
update on the embeddings config in ingest and I left a TODO in there
because the current approach breaks on other encoders such as bedrock
because the parameters in that config don't map to all encoders. But
this keeps the existing functionality working.
This update makes sure all variables associated with the dataclass exist
when it's instantiated rather than being added in the `__post_init__()`
method or the `initialize()`, allowing other libraries like pydantic to
appropriately generate schemas from it. It also now follows the pattern
of the connectors in that each class has a nested config class used to
instantiate the client itself as well as a field/property approach used
to cache the client.
Closes#1782
This PR:
- Extends ingest pipeline so that it is possible to select an embedding
provider from a range of providers
- Modifies the ingest embedding test to be a diff test, since the
embedding vectors are reproducible after supporting multiple providers
Additional info on the chosen provider for the test:
- Found `langchain.embeddings.HuggingFaceEmbeddings` to be deterministic
even when there's no seed set
- Took 6.84s to pass a unit test with the provider (without cache,
including model download)
- `langchain.embeddings.HuggingFaceEmbeddings` runs in local, making it
zero cost
For all these reasons, testing embedding modules with the Huggingface
model seems to be making sense
---------
Co-authored-by: cragwolfe <crag@unstructured.io>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
Currently when the OpenAIEmbeddingEncoder adds embeddings to Elements in
`_add_embeddings_to_elements` it overwrites each Element's `to_dict`
method, mistakenly resulting in each Element having identical values
with the exception of the actual embedding value. This was due to the
way it leverages a nested `new_to_dict` method to overwrite. Instead,
this updates the original definition of Element itself to accommodate
the `embeddings` field when available. This also adds a test to validate
that values are not duplicated.