Closes#1782
This PR:
- Extends ingest pipeline so that it is possible to select an embedding
provider from a range of providers
- Modifies the ingest embedding test to be a diff test, since the
embedding vectors are reproducible after supporting multiple providers
Additional info on the chosen provider for the test:
- Found `langchain.embeddings.HuggingFaceEmbeddings` to be deterministic
even when there's no seed set
- Took 6.84s to pass a unit test with the provider (without cache,
including model download)
- `langchain.embeddings.HuggingFaceEmbeddings` runs in local, making it
zero cost
For all these reasons, testing embedding modules with the Huggingface
model seems to be making sense
---------
Co-authored-by: cragwolfe <crag@unstructured.io>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
Currently when the OpenAIEmbeddingEncoder adds embeddings to Elements in
`_add_embeddings_to_elements` it overwrites each Element's `to_dict`
method, mistakenly resulting in each Element having identical values
with the exception of the actual embedding value. This was due to the
way it leverages a nested `new_to_dict` method to overwrite. Instead,
this updates the original definition of Element itself to accommodate
the `embeddings` field when available. This also adds a test to validate
that values are not duplicated.