From c28f6688f5fb6e683a9b4a49131d7db3f0289ae2 Mon Sep 17 00:00:00 2001
From: Vladimir Blagojevic <dovlex@gmail.com>
Date: Mon, 12 Dec 2022 10:06:35 +0100
Subject: [PATCH] proposal: New EmbeddingRetriever for Haystack 2.0 (#3558)

* Add EmbeddingRetriever proposal

* Update with Sara's feedback

* Consistent naming
---
 proposals/text/3558-embedding_retriever.md | 129 +++++++++++++++++++++
 1 file changed, 129 insertions(+)
 create mode 100644 proposals/text/3558-embedding_retriever.md

diff --git a/proposals/text/3558-embedding_retriever.md b/proposals/text/3558-embedding_retriever.md
new file mode 100644
index 000000000..d245e0fbf
--- /dev/null
+++ b/proposals/text/3558-embedding_retriever.md
@@ -0,0 +1,129 @@
+- Start Date: 2022-11-11
+- Proposal PR: https://github.com/deepset-ai/haystack/pull/3558
+- Github Issue: 
+  # Summary  
+    
+- Current EmbeddingRetriever doesn't allow Haystack users to provide new embedding methods and is 
+  currently constricted to farm, transformers, sentence transformers, OpenAI and Cohere based 
+  embedding approaches. Any new encoding methods need to be explicitly added to Haystack 
+  and registered with the EmbeddingRetriever.  
+ 
+
+- We should allow users to easily plug-in new embedding methods to EmbeddingRetriever. For example, a Haystack user should be able to 
+  add custom embeddings without having to commit additional code to Haystack repository.
+
+  # Basic example  
+    EmbeddingRetriever is instantiated with:  
+
+  ``` python
+	  retriever = EmbeddingRetriever(
+	      document_store=document_store,
+	      embedding_model="sentence-transformers/multi-qa-mpnet-base-dot-v1",
+	      model_format="sentence_transformers",
+	  )
+  ```
+- The current approach doesn't provide a pluggable abstraction point of composition but 
+  rather attempts to satisfy various embedding methodologies by having a lot of 
+  parameters which keep ever expanding.
+
+
+- The new approach allows creation of the underlying embedding mechanism (EmbeddingEncoder) 
+  which is then in turn plugged into EmbeddingRetriever.  For example:  
+ 
+  ``` python
+	  encoder = SomeNewFancyEmbeddingEncoder(api_key="asdfklklja", 
+                                          query_model="text-search-query",
+                                          doc_model="text-search-doc")
+  ```
+
+- EmbeddingEncoder is then used for the creation of EmbeddingRetriever. EmbeddingRetriever 
+  init method doesn't get polluted with additional parameters as all of the peculiarities 
+  of a particular encoder methodology are contained on in its abstraction layer.  
+
+  ``` python
+	  retriever = EmbeddingRetriever(
+	      document_store=document_store,
+	      encoder=encoder
+	  )
+  ```
+  
+  # Motivation
+
+- Why are we doing this? What use cases does it support? What is the expected outcome?  
+    
+  We could certainly keep the current solution as is; it does implement a decent level 
+  of composition/decoration to lower coupling between EmbeddingRetriever and the underlying 
+  mechanism of embedding (sentence transformers, OpenAI, etc). However, the current mechanism 
+  in place basically hard-codes available embedding implementations and prevents our users from 
+  adding new embedding mechanism by themselves outside of Haystack repository. We also might 
+  want to have a non-public dC embedding mechanism in the future. In the current design a non-public 
+  dC embedding mechanism would be impractical. In addition, the more underlying implementations we 
+  add we'll continue to "pollute" EmbeddingRetriever init method with more and more parameters. 
+  This is certainly less than ideal long term.  
+
+
+- EmbeddingEncoder classes should be subclasses of BaseComponent! As subclasses of BaseComponent,
+  we can use them outside the EmbeddingRetriever context in indexing pipelines, generating the
+  embeddings. We are currently employing a kludge of using Retrievers which is quite counter-intuitive
+  and confusing for our users.
+
+
+- EmbeddingEncoder classes might sound overly complicated, especially with a distinguishing mechanism
+  name pre-appended (i.e CohereEmbeddingEncoder). Therefore, we'll adopt <specific>Embedder
+  naming scheme, i.e. CohereEmbedder, SentenceTransformerEmbedder and so on.
+
+  # Detailed design  
+
+- Our new EmbeddingRetriever would still wrap the underlying encoding mechanism in the form of 
+  _BaseEmbedder. _BaseEmbedder still needs to implement methods:
+	- embed_queries  
+	- embed_documents 
+
+
+- The new design approach differs is in the creation of EmbeddingRetriever - rather than hiding the underlying encoding 
+  mechanism one could simply create the EmbeddingRetriever with a specific encoder directly. For example:  
+
+  ```
+	  retriever = EmbeddingRetriever(
+	      document_store=document_store,
+	      encoder=OpenAIEmbedder(api_key="asdfklklja", model="ada"),
+	      #additional EmbeddingRetriever-abstraction-level parameters
+	  )
+  ```
+ 
+- If the "two-step approach" of EmbeddingRetriever initialization is no longer the ideal solution (issues with current 
+  schema generation and loading/saving via YAML pipelines) we might simply add the EmbeddingRetriever 
+  class for every supported encoding approach. For example, we could have OpenAIEmbeddingRetriever, CohereEmbeddingRetriever, 
+  SentenceTransformerEmbeddingRetriever and so on. Each of these retrievers will delegate the bulk of the work to an 
+  existing EmbeddingRetriever with a per-class-specific Embedder set in the class constructor (for that custom
+  encoding part). We'll get the best of both worlds. Each <Specific>EmeddingRetriever will have only the relevant primitives 
+  parameters for the **init()** constructor; the underlying EmbeddingRetriever attribute in <Specific>EmeddingRetriever 
+  will handle most of the business logic of retrieving, yet each retriever will use an appropriate per-class-specific 
+  Embedder for the custom encoding part.
+
+
+ 
+  # Drawbacks  
+- The main shortcoming are:  
+	- The "two-step approach" in EmbeddingRetriever initialization  
+	- Likely be an issue for the current schema generation and loading/saving via YAML pipelines (see solution above)  
+	- It is a API breaking change so it'll require code update for all EmbeddingRetriever usage both in our codebase and for Haystack users  
+	- Can only be done in major release along with other breaking changes  
+
+  # Alternatives  
+    
+  We could certainly keep everything as is :-)
+
+  # Adoption strategy  
+- As it is a breaking change, we should implement it for the next major release.  
+ 
+  # How do we teach this?  
+- This change would require only a minor change in documentation.  
+- The concept of embedding retriever remains, just the mechanics are slightly changed  
+- All docs and tutorials need to be updated  
+- Haystack users are informed about a possibility to create and use their own embedders for embedding retriever.
+- # Unresolved questions  
+    
+  Optional, but suggested for first drafts. What parts of the design are still  
+  TBD?  
+ 
\ No newline at end of file