Original PR was #3069. Merged in to a feature branch to fix dependency
and linting issues. Application code changes from the original PR were
already reviewed and approved.
------------
Original PR description:
Adding VoyageAI embeddings
Voyage AI’s embedding models and rerankers are state-of-the-art in
retrieval accuracy.
---------
Co-authored-by: fzowl <160063452+fzowl@users.noreply.github.com>
Co-authored-by: Liuhong99 <39693953+Liuhong99@users.noreply.github.com>
Thanks to Pedro at OctoAI we have a new embedding option.
The following PR adds support for the use of OctoAI embeddings.
Forked from the original OpenAI embeddings class. We removed the use of
the LangChain adaptor, and use OpenAI's SDK directly instead.
Also updated out-of-date example script.
Including new test file for OctoAI.
# Testing
Get a token from our platform at: https://www.octoai.cloud/
For testing one can do the following:
```
export OCTOAI_TOKEN=<your octo token>
python3 examples/embed/example_octoai.py
```
## Testing done
Validated running the above script from within a locally built container
via `make docker-start-dev`
---------
Co-authored-by: potter-potter <david.potter@gmail.com>
Closes https://github.com/Unstructured-IO/unstructured/issues/1319,
closes https://github.com/Unstructured-IO/unstructured/issues/1372
This module:
- implements EmbeddingEncoder classes which track embedding related data
- implements embed_documents method which receives a list of Elements,
obtains embeddings for the text within Elements, updates the Elements
with an attribute named embeddings , and returns the updated Elements
- the module uses langchain to obtain the embeddings
-----
- The PR additionally fixes a JSON de-serialization issue on the
metadata fields.
To test the changes, run `examples/embed/example.py`