mirror of
https://github.com/deepset-ai/haystack.git
synced 2025-10-29 08:49:07 +00:00
proposal: New EmbeddingRetriever for Haystack 2.0 (#3558)
* Add EmbeddingRetriever proposal * Update with Sara's feedback * Consistent naming
This commit is contained in:
parent
77cea8b140
commit
c28f6688f5
129
proposals/text/3558-embedding_retriever.md
Normal file
129
proposals/text/3558-embedding_retriever.md
Normal file
@ -0,0 +1,129 @@
|
|||||||
|
- Start Date: 2022-11-11
|
||||||
|
- Proposal PR: https://github.com/deepset-ai/haystack/pull/3558
|
||||||
|
- Github Issue:
|
||||||
|
# Summary
|
||||||
|
|
||||||
|
- Current EmbeddingRetriever doesn't allow Haystack users to provide new embedding methods and is
|
||||||
|
currently constricted to farm, transformers, sentence transformers, OpenAI and Cohere based
|
||||||
|
embedding approaches. Any new encoding methods need to be explicitly added to Haystack
|
||||||
|
and registered with the EmbeddingRetriever.
|
||||||
|
|
||||||
|
|
||||||
|
- We should allow users to easily plug-in new embedding methods to EmbeddingRetriever. For example, a Haystack user should be able to
|
||||||
|
add custom embeddings without having to commit additional code to Haystack repository.
|
||||||
|
|
||||||
|
# Basic example
|
||||||
|
EmbeddingRetriever is instantiated with:
|
||||||
|
|
||||||
|
``` python
|
||||||
|
retriever = EmbeddingRetriever(
|
||||||
|
document_store=document_store,
|
||||||
|
embedding_model="sentence-transformers/multi-qa-mpnet-base-dot-v1",
|
||||||
|
model_format="sentence_transformers",
|
||||||
|
)
|
||||||
|
```
|
||||||
|
- The current approach doesn't provide a pluggable abstraction point of composition but
|
||||||
|
rather attempts to satisfy various embedding methodologies by having a lot of
|
||||||
|
parameters which keep ever expanding.
|
||||||
|
|
||||||
|
|
||||||
|
- The new approach allows creation of the underlying embedding mechanism (EmbeddingEncoder)
|
||||||
|
which is then in turn plugged into EmbeddingRetriever. For example:
|
||||||
|
|
||||||
|
``` python
|
||||||
|
encoder = SomeNewFancyEmbeddingEncoder(api_key="asdfklklja",
|
||||||
|
query_model="text-search-query",
|
||||||
|
doc_model="text-search-doc")
|
||||||
|
```
|
||||||
|
|
||||||
|
- EmbeddingEncoder is then used for the creation of EmbeddingRetriever. EmbeddingRetriever
|
||||||
|
init method doesn't get polluted with additional parameters as all of the peculiarities
|
||||||
|
of a particular encoder methodology are contained on in its abstraction layer.
|
||||||
|
|
||||||
|
``` python
|
||||||
|
retriever = EmbeddingRetriever(
|
||||||
|
document_store=document_store,
|
||||||
|
encoder=encoder
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
# Motivation
|
||||||
|
|
||||||
|
- Why are we doing this? What use cases does it support? What is the expected outcome?
|
||||||
|
|
||||||
|
We could certainly keep the current solution as is; it does implement a decent level
|
||||||
|
of composition/decoration to lower coupling between EmbeddingRetriever and the underlying
|
||||||
|
mechanism of embedding (sentence transformers, OpenAI, etc). However, the current mechanism
|
||||||
|
in place basically hard-codes available embedding implementations and prevents our users from
|
||||||
|
adding new embedding mechanism by themselves outside of Haystack repository. We also might
|
||||||
|
want to have a non-public dC embedding mechanism in the future. In the current design a non-public
|
||||||
|
dC embedding mechanism would be impractical. In addition, the more underlying implementations we
|
||||||
|
add we'll continue to "pollute" EmbeddingRetriever init method with more and more parameters.
|
||||||
|
This is certainly less than ideal long term.
|
||||||
|
|
||||||
|
|
||||||
|
- EmbeddingEncoder classes should be subclasses of BaseComponent! As subclasses of BaseComponent,
|
||||||
|
we can use them outside the EmbeddingRetriever context in indexing pipelines, generating the
|
||||||
|
embeddings. We are currently employing a kludge of using Retrievers which is quite counter-intuitive
|
||||||
|
and confusing for our users.
|
||||||
|
|
||||||
|
|
||||||
|
- EmbeddingEncoder classes might sound overly complicated, especially with a distinguishing mechanism
|
||||||
|
name pre-appended (i.e CohereEmbeddingEncoder). Therefore, we'll adopt <specific>Embedder
|
||||||
|
naming scheme, i.e. CohereEmbedder, SentenceTransformerEmbedder and so on.
|
||||||
|
|
||||||
|
# Detailed design
|
||||||
|
|
||||||
|
- Our new EmbeddingRetriever would still wrap the underlying encoding mechanism in the form of
|
||||||
|
_BaseEmbedder. _BaseEmbedder still needs to implement methods:
|
||||||
|
- embed_queries
|
||||||
|
- embed_documents
|
||||||
|
|
||||||
|
|
||||||
|
- The new design approach differs is in the creation of EmbeddingRetriever - rather than hiding the underlying encoding
|
||||||
|
mechanism one could simply create the EmbeddingRetriever with a specific encoder directly. For example:
|
||||||
|
|
||||||
|
```
|
||||||
|
retriever = EmbeddingRetriever(
|
||||||
|
document_store=document_store,
|
||||||
|
encoder=OpenAIEmbedder(api_key="asdfklklja", model="ada"),
|
||||||
|
#additional EmbeddingRetriever-abstraction-level parameters
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
- If the "two-step approach" of EmbeddingRetriever initialization is no longer the ideal solution (issues with current
|
||||||
|
schema generation and loading/saving via YAML pipelines) we might simply add the EmbeddingRetriever
|
||||||
|
class for every supported encoding approach. For example, we could have OpenAIEmbeddingRetriever, CohereEmbeddingRetriever,
|
||||||
|
SentenceTransformerEmbeddingRetriever and so on. Each of these retrievers will delegate the bulk of the work to an
|
||||||
|
existing EmbeddingRetriever with a per-class-specific Embedder set in the class constructor (for that custom
|
||||||
|
encoding part). We'll get the best of both worlds. Each <Specific>EmeddingRetriever will have only the relevant primitives
|
||||||
|
parameters for the **init()** constructor; the underlying EmbeddingRetriever attribute in <Specific>EmeddingRetriever
|
||||||
|
will handle most of the business logic of retrieving, yet each retriever will use an appropriate per-class-specific
|
||||||
|
Embedder for the custom encoding part.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
# Drawbacks
|
||||||
|
- The main shortcoming are:
|
||||||
|
- The "two-step approach" in EmbeddingRetriever initialization
|
||||||
|
- Likely be an issue for the current schema generation and loading/saving via YAML pipelines (see solution above)
|
||||||
|
- It is a API breaking change so it'll require code update for all EmbeddingRetriever usage both in our codebase and for Haystack users
|
||||||
|
- Can only be done in major release along with other breaking changes
|
||||||
|
|
||||||
|
# Alternatives
|
||||||
|
|
||||||
|
We could certainly keep everything as is :-)
|
||||||
|
|
||||||
|
# Adoption strategy
|
||||||
|
- As it is a breaking change, we should implement it for the next major release.
|
||||||
|
|
||||||
|
# How do we teach this?
|
||||||
|
- This change would require only a minor change in documentation.
|
||||||
|
- The concept of embedding retriever remains, just the mechanics are slightly changed
|
||||||
|
- All docs and tutorials need to be updated
|
||||||
|
- Haystack users are informed about a possibility to create and use their own embedders for embedding retriever.
|
||||||
|
- # Unresolved questions
|
||||||
|
|
||||||
|
Optional, but suggested for first drafts. What parts of the design are still
|
||||||
|
TBD?
|
||||||
|
|
||||||
Loading…
x
Reference in New Issue
Block a user