mirror of
https://github.com/deepset-ai/haystack.git
synced 2025-09-21 06:03:38 +00:00

* new docs version * updated directory structure * Add pipelines page * Add Finder deprecation suggestion * header for pipelines file * Document MySQL support * Mention DPR train tutorial coming soon * Mention open distro ES * Update doc strings regarding similarity fn * Add link to API docs * Wrap pipelines docs in box * add api reference for pipelines * copied latest version to v0.6.0 * Remove space * Remove space * Copy to v0.6.0 Co-authored-by: brandenchan <brandenchan@icloud.com>
292 lines
9.8 KiB
Markdown
292 lines
9.8 KiB
Markdown
<!---
|
||
title: "Retriever"
|
||
metaTitle: "Retriever"
|
||
metaDescription: ""
|
||
slug: "/docs/retriever"
|
||
date: "2020-09-03"
|
||
id: "retrievermd"
|
||
--->
|
||
|
||
# Retriever
|
||
|
||
The Retriever is a lightweight filter that can quickly go through the full document store and pass a set of candidate documents to the Reader.
|
||
It is an tool for sifting out the obvious negative cases, saving the Reader from doing more work than it needs to and speeding up the querying process.
|
||
|
||
<div class="recommendation">
|
||
|
||
**Recommendations**
|
||
|
||
* BM25 (sparse)
|
||
|
||
* Dense Passage Retrieval (dense)
|
||
|
||
</div>
|
||
|
||
<!-- _comment: !! Example speedup from slides !! -->
|
||
<!-- _comment: !! Benchmarks !! -->
|
||
Note that not all Retrievers can be paired with every DocumentStore.
|
||
Here are the combinations which are supported:
|
||
|
||
| | Memory | Elasticsearch | SQL | FAISS |
|
||
| --- | --- | --- | ---- | ---- |
|
||
| BM25 | N | Y | N | N |
|
||
| TF-IDF | Y | Y | Y | N |
|
||
| Embedding | Y | Y | N | Y |
|
||
| DPR | Y | Y | N | Y |
|
||
|
||
See [Optimization](/docs/latest/optimizationmd) for suggestions on how to choose top-k values.
|
||
|
||
|
||
## TF-IDF
|
||
|
||
### Description
|
||
|
||
TF-IDF is a commonly used baseline for information retrieval that exploits two key intuitions:
|
||
|
||
|
||
* documents that have more lexical overlap with the query are more likely to be relevant
|
||
|
||
|
||
* words that occur in fewer documents are more significant than words that occur in many documents
|
||
|
||
Given a query, a tf-idf score is computed for each document as follows:
|
||
|
||
```python
|
||
score = tf * idf
|
||
```
|
||
|
||
Where:
|
||
|
||
|
||
* `tf` is how many times words in the query occur in that document.
|
||
|
||
|
||
* `idf` is the inverse of the fraction of documents containing the word.
|
||
|
||
In practice, both terms are usually log normalised. If you’d like to learn more about the exact details of the algorithm,
|
||
have a look at [this video](https://www.youtube.com/watch?v=hNXwhF0OZ_o).
|
||
|
||
### Initialisation
|
||
|
||
```python
|
||
document_store = InMemoryDocumentStore()
|
||
...
|
||
retriever = TfidfRetriever(document_store)
|
||
...
|
||
finder = Finder(reader, retriever)
|
||
```
|
||
|
||
<div class="recommendation">
|
||
|
||
**Tip:** The Finder class is being deprecated and has been replaced by a more powerful [Pipelines class](/docs/latest/pipelinesmd).
|
||
|
||
</div>
|
||
|
||
## BM25 (Recommended)
|
||
|
||
### Description
|
||
|
||
BM25 is a variant of TF-IDF that we recommend you use if you are looking for a retrieval method that does not need a neural network for indexing.
|
||
It improves upon its predecessor in two main aspects:
|
||
|
||
|
||
* It saturates `tf` after a set number of occurrences of the given term in the document
|
||
|
||
|
||
* It normalises by document length so that short documents are favoured over long documents if they have the same amount of word overlap with the query
|
||
|
||
### Initialisation
|
||
|
||
```python
|
||
document_store = ElasticsearchDocumentStore()
|
||
...
|
||
retriever = ElasticsearchRetriever(document_store)
|
||
...
|
||
finder = Finder(reader, retriever)
|
||
```
|
||
|
||
<div class="recommendation">
|
||
|
||
**Tip:** The Finder class is being deprecated and has been replaced by a more powerful [Pipelines class](/docs/latest/pipelinesmd).
|
||
|
||
</div>
|
||
|
||
See [this](https://www.elastic.co/blog/practical-bm25-part-2-the-bm25-algorithm-and-its-variables) blog post for more details about the algorithm.
|
||
|
||
<!-- _comment: !! Diagram showing TFIDF vs BM25 !! -->
|
||
## Dense Passage Retrieval (Recommended)
|
||
|
||
### Description
|
||
|
||
[Dense Passage Retrieval](https://arxiv.org/abs/2004.04906) is a highly performant retrieval method that calculates relevance using dense representations.
|
||
Key features:
|
||
|
||
|
||
* One BERT base model to encode documents
|
||
|
||
|
||
* One Bert base model to encode queries
|
||
|
||
|
||
* Ranking of documents done by dot product similarity between query and document embeddings
|
||
|
||
<!-- _comment: !! Diagram !! -->
|
||
Indexing using DPR is comparatively expensive in terms of required computation since all documents in the database need to be processed through the transformer.
|
||
The embeddings that are created in this step can be stored in FAISS, a database optimized for vector similarity.
|
||
DPR can also work with the ElasticsearchDocumentStore or the InMemoryDocumentStore.
|
||
|
||
<div class="recommendation">
|
||
|
||
**Tip**
|
||
|
||
When using DPR, it is recommended that you use the dot product similarity function since that is how it is trained.
|
||
To do so, simply provide `similarity="dot_product"` when initializing the DocumentStore
|
||
as is done in the code example below.
|
||
|
||
</div>
|
||
|
||
There are two design decisions that have made DPR particularly performant.
|
||
|
||
|
||
* Separate encoders for document and query helps since queries are much shorter than documents
|
||
|
||
|
||
* Training with ‘In-batch negatives’ (gold labels are treated as negative examples for other samples in same batch) is highly efficient
|
||
|
||
In Haystack, you can simply download the pretrained encoders needed to start using DPR.
|
||
If you’d like to learn how to set up a DPR based system, have a look at our tutorials.
|
||
|
||
### Initialisation
|
||
|
||
```python
|
||
document_store = FAISSDocumentStore(similarity="dot_product")
|
||
...
|
||
retriever = DensePassageRetriever(
|
||
document_store=document_store,
|
||
query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
|
||
passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base"
|
||
)
|
||
...
|
||
finder = Finder(reader, retriever)
|
||
```
|
||
|
||
<div class="recommendation">
|
||
|
||
**Tip:** The Finder class is being deprecated and has been replaced by a more powerful [Pipelines class](/docs/latest/pipelinesmd).
|
||
|
||
</div>
|
||
|
||
<div class="recommendation">
|
||
|
||
**Tip:** Haystack supports training of your own DPR model! Tutorial coming soon!
|
||
|
||
</div>
|
||
|
||
<!-- _comment: !! Training in future? !! -->
|
||
<!-- _comment: !! Talk more about benchmarks, SoTA, results !! -->
|
||
## Embedding Retrieval
|
||
|
||
### Description
|
||
|
||
In Haystack, you also have the option of using a single transformer model to encode document and query.
|
||
One style of model that is suited to this kind of retrieval is that of [Sentence Transformers](https://github.com/UKPLab/sentence-transformers).
|
||
These models are trained in Siamese Networks and use triplet loss such that they learn to embed similar sentences near to each other in a shared embedding space.
|
||
|
||
They are particular suited to cases where your query input is similar in style to that of the documents in your database
|
||
i.e. when you are searching for most similar documents.
|
||
This is not inherently suited to query based search where the length, language and format of the query usually significantly differs from the searched for text.
|
||
|
||
<div class="recommendation">
|
||
|
||
**Tip**
|
||
|
||
When using Sentence Transformer models, we recommend that you use a cosine similarity function.
|
||
To do so, simply provide `similarity="cosine"` when initializing the DocumentStore
|
||
as is done in the code example below.
|
||
|
||
</div>
|
||
|
||
### Initialisation
|
||
|
||
```python
|
||
document_store = ElasticsearchDocumentStore(similarity="cosine")
|
||
...
|
||
retriever = EmbeddingRetriever(document_store=document_store,
|
||
embedding_model="deepset/sentence_bert")
|
||
...
|
||
finder = Finder(reader, retriever)
|
||
```
|
||
|
||
<div class="recommendation">
|
||
|
||
**Tip:** The Finder class is being deprecated and has been replaced by a more powerful [Pipelines class](/docs/latest/pipelinesmd).
|
||
|
||
</div>
|
||
|
||
## Deeper Dive: Dense vs Sparse
|
||
|
||
Broadly speaking, retrieval methods can be split into two categories: **dense** and **sparse**.
|
||
|
||
**Sparse** methods, like TF-IDF and BM25, operate by looking for shared keywords between the document and query.
|
||
They are:
|
||
|
||
|
||
* simple but effective
|
||
|
||
|
||
* don’t need to be trained
|
||
|
||
|
||
* work on any language
|
||
|
||
More recently, **dense** approaches such as Dense Passage Retrieval (DPR) have shown even better performance than their sparse counter parts.
|
||
These methods embed both document and query into a shared embedding space using deep neural networks
|
||
and the top candidates are the nearest neighbour documents to the query.
|
||
They are:
|
||
|
||
|
||
* powerful but computationally more expensive especially during indexing
|
||
|
||
|
||
* trained using labelled datasets
|
||
|
||
|
||
* language specific
|
||
|
||
### Qualitative Differences
|
||
|
||
Between these two types there are also some qualitative differences too.
|
||
For example, sparse methods treat text as a bag-of-words meaning that they **do not take word order and syntax into account**,
|
||
while the latest generation of dense methods use transformer based encoders
|
||
which are designed to be **sensitive** to these factors.
|
||
|
||
Also dense methods are very capable of building strong semantic representations of text,
|
||
but they **struggle when encountering out-of-vocabulary** words such as new names.
|
||
By contrast, sparse methods don’t need to learn representations of words,
|
||
they only care about whether they are present or absent in the text.
|
||
As such, **they handle out-of-vocabulary words with no problem**.
|
||
|
||
<!-- _comment: !! Show example from DPR paper? !! -->
|
||
### Indexing
|
||
|
||
Dense methods perform indexing by processing all the documents through a neural network and storing the resulting vectors.
|
||
This is a much more expensive operation than the creation of the inverted-index in sparse methods
|
||
and will require significant computational power and time.
|
||
|
||
<!-- _comment: !!See their individual sections (!! link !!) for more details on this point. Benchmarks too !! -->
|
||
### Terminology
|
||
|
||
<!-- _comment: !! Diagram of what a sparse vector looks like vs dense vector. !! -->
|
||
<!-- _comment: !! This section should be turned into something more like a side note !! -->
|
||
The terms **dense** and **sparse** refer to the representations that the algorithms build for each document and query.
|
||
**Sparse** methods characterise texts using vectors with one dimension corresponding to each word in the vocabulary.
|
||
Dimensions will be zero if the word is absent and non-zero if it is present.
|
||
Since most documents contain only a small subset of the full vocabulary,
|
||
these vectors are considered sparse since non-zero values are few and far between.
|
||
|
||
**Dense** methods, by contrast, pass text as input into neural network encoders
|
||
and represent text in a vector of a manually defined size (usually 768).
|
||
Though individual dimensions are not mapped to any corresponding vocabulary or linguistic feature,
|
||
each dimension encodes some information about the text.
|
||
There are rarely 0s in these vectors hence their relative density.
|