haystack/docs/v0.4.0/_src/api/api/retriever.md
Markus Paff 2531c8e061
Add versioning docs (#495)
* add time and perf benchmark for es

* Add retriever benchmarking

* Add Reader benchmarking

* add nq to squad conversion

* add conversion stats

* clean benchmarks

* Add link to dataset

* Update imports

* add first support for neg psgs

* Refactor test

* set max_seq_len

* cleanup benchmark

* begin retriever speed benchmarking

* Add support for retriever query index benchmarking

* improve reader eval, retriever speed benchmarking

* improve retriever speed benchmarking

* Add retriever accuracy benchmark

* Add neg doc shuffling

* Add top_n

* 3x speedup of SQL. add postgres docker run. make shuffle neg a param. add more logging

* Add models to sweep

* add option for faiss index type

* remove unneeded line

* change faiss to faiss_flat

* begin automatic benchmark script

* remove existing postgres docker for benchmarking

* Add data processing scripts

* Remove shuffle in script bc data already shuffled

* switch hnsw setup from 256 to 128

* change es similarity to dot product by default

* Error includes stack trace

* Change ES default timeout

* remove delete_docs() from timing for indexing

* Add support for website export

* update website on push to benchmarks

* add complete benchmarks results

* new json format

* removed NaN as is not a valid json token

* versioning for docs

* unsaved changes

* cleaning

* cleaning

* Edit format of benchmarks data

* update also jsons in v0.4.0

Co-authored-by: brandenchan <brandenchan@icloud.com>
Co-authored-by: deepset <deepset@Crenolape.localdomain>
Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2020-10-19 11:46:51 +02:00

292 lines
9.2 KiB
Markdown

<a name="sparse"></a>
# sparse
<a name="sparse.ElasticsearchRetriever"></a>
## ElasticsearchRetriever
```python
class ElasticsearchRetriever(BaseRetriever)
```
<a name="sparse.ElasticsearchRetriever.__init__"></a>
#### \_\_init\_\_
```python
| __init__(document_store: ElasticsearchDocumentStore, custom_query: str = None)
```
**Arguments**:
- `document_store`: an instance of a DocumentStore to retrieve documents from.
- `custom_query`: query string as per Elasticsearch DSL with a mandatory question placeholder($question).
Optionally, ES `filter` clause can be added where the values of `terms` are placeholders
that get substituted during runtime. The placeholder(${filter_name_1}, ${filter_name_2}..)
names must match with the filters dict supplied in self.retrieve().
::
An example custom_query:
{
"size": 10,
"query": {
"bool": {
"should": [{"multi_match": {
"query": "${question}", // mandatory $question placeholder
"type": "most_fields",
"fields": ["text", "title"]}}],
"filter": [ // optional custom filters
{"terms": {"year": "${years}"}},
{"terms": {"quarter": "${quarters}"}},
{"range": {"date": {"gte": "${date}"}}}
],
}
},
}
For this custom_query, a sample retrieve() could be:
::
self.retrieve(query="Why did the revenue increase?",
filters={"years": ["2019"], "quarters": ["Q1", "Q2"]})
<a name="sparse.ElasticsearchFilterOnlyRetriever"></a>
## ElasticsearchFilterOnlyRetriever
```python
class ElasticsearchFilterOnlyRetriever(ElasticsearchRetriever)
```
Naive "Retriever" that returns all documents that match the given filters. No impact of query at all.
Helpful for benchmarking, testing and if you want to do QA on small documents without an "active" retriever.
<a name="sparse.TfidfRetriever"></a>
## TfidfRetriever
```python
class TfidfRetriever(BaseRetriever)
```
Read all documents from a SQL backend.
Split documents into smaller units (eg, paragraphs or pages) to reduce the
computations when text is passed on to a Reader for QA.
It uses sklearn's TfidfVectorizer to compute a tf-idf matrix.
<a name="dense"></a>
# dense
<a name="dense.DensePassageRetriever"></a>
## DensePassageRetriever
```python
class DensePassageRetriever(BaseRetriever)
```
Retriever that uses a bi-encoder (one transformer for query, one transformer for passage).
See the original paper for more details:
Karpukhin, Vladimir, et al. (2020): "Dense Passage Retrieval for Open-Domain Question Answering."
(https://arxiv.org/abs/2004.04906).
<a name="dense.DensePassageRetriever.__init__"></a>
#### \_\_init\_\_
```python
| __init__(document_store: BaseDocumentStore, query_embedding_model: str = "facebook/dpr-question_encoder-single-nq-base", passage_embedding_model: str = "facebook/dpr-ctx_encoder-single-nq-base", max_seq_len: int = 256, use_gpu: bool = True, batch_size: int = 16, embed_title: bool = True, remove_sep_tok_from_untitled_passages: bool = True)
```
Init the Retriever incl. the two encoder models from a local or remote model checkpoint.
The checkpoint format matches huggingface transformers' model format
**Arguments**:
- `document_store`: An instance of DocumentStore from which to retrieve documents.
- `query_embedding_model`: Local path or remote name of question encoder checkpoint. The format equals the
one used by hugging-face transformers' modelhub models
Currently available remote names: ``"facebook/dpr-question_encoder-single-nq-base"``
- `passage_embedding_model`: Local path or remote name of passage encoder checkpoint. The format equals the
one used by hugging-face transformers' modelhub models
Currently available remote names: ``"facebook/dpr-ctx_encoder-single-nq-base"``
- `max_seq_len`: Longest length of each sequence
- `use_gpu`: Whether to use gpu or not
- `batch_size`: Number of questions or passages to encode at once
- `embed_title`: Whether to concatenate title and passage to a text pair that is then used to create the embedding
- `remove_sep_tok_from_untitled_passages`: If embed_title is ``True``, there are different strategies to deal with documents that don't have a title.
If this param is ``True`` => Embed passage as single text, similar to embed_title = False (i.e [CLS] passage_tok1 ... [SEP]).
If this param is ``False`` => Embed passage as text pair with empty title (i.e. [CLS] [SEP] passage_tok1 ... [SEP])
<a name="dense.DensePassageRetriever.embed_queries"></a>
#### embed\_queries
```python
| embed_queries(texts: List[str]) -> List[np.array]
```
Create embeddings for a list of queries using the query encoder
**Arguments**:
- `texts`: Queries to embed
**Returns**:
Embeddings, one per input queries
<a name="dense.DensePassageRetriever.embed_passages"></a>
#### embed\_passages
```python
| embed_passages(docs: List[Document]) -> List[np.array]
```
Create embeddings for a list of passages using the passage encoder
**Arguments**:
- `docs`: List of Document objects used to represent documents / passages in a standardized way within Haystack.
**Returns**:
Embeddings of documents / passages shape (batch_size, embedding_dim)
<a name="dense.EmbeddingRetriever"></a>
## EmbeddingRetriever
```python
class EmbeddingRetriever(BaseRetriever)
```
<a name="dense.EmbeddingRetriever.__init__"></a>
#### \_\_init\_\_
```python
| __init__(document_store: BaseDocumentStore, embedding_model: str, use_gpu: bool = True, model_format: str = "farm", pooling_strategy: str = "reduce_mean", emb_extraction_layer: int = -1)
```
**Arguments**:
- `document_store`: An instance of DocumentStore from which to retrieve documents.
- `embedding_model`: Local path or name of model in Hugging Face's model hub such as ``'deepset/sentence_bert'``
- `use_gpu`: Whether to use gpu or not
- `model_format`: Name of framework that was used for saving the model. Options:
- ``'farm'``
- ``'transformers'``
- ``'sentence_transformers'``
- `pooling_strategy`: Strategy for combining the embeddings from the model (for farm / transformers models only).
Options:
- ``'cls_token'`` (sentence vector)
- ``'reduce_mean'`` (sentence vector)
- ``'reduce_max'`` (sentence vector)
- ``'per_token'`` (individual token vectors)
- `emb_extraction_layer`: Number of layer from which the embeddings shall be extracted (for farm / transformers models only).
Default: -1 (very last layer).
<a name="dense.EmbeddingRetriever.embed"></a>
#### embed
```python
| embed(texts: Union[List[str], str]) -> List[np.array]
```
Create embeddings for each text in a list of texts using the retrievers model (`self.embedding_model`)
**Arguments**:
- `texts`: Texts to embed
**Returns**:
List of embeddings (one per input text). Each embedding is a list of floats.
<a name="dense.EmbeddingRetriever.embed_queries"></a>
#### embed\_queries
```python
| embed_queries(texts: List[str]) -> List[np.array]
```
Create embeddings for a list of queries. For this Retriever type: The same as calling .embed()
**Arguments**:
- `texts`: Queries to embed
**Returns**:
Embeddings, one per input queries
<a name="dense.EmbeddingRetriever.embed_passages"></a>
#### embed\_passages
```python
| embed_passages(docs: List[Document]) -> List[np.array]
```
Create embeddings for a list of passages. For this Retriever type: The same as calling .embed()
**Arguments**:
- `docs`: List of documents to embed
**Returns**:
Embeddings, one per input passage
<a name="base"></a>
# base
<a name="base.BaseRetriever"></a>
## BaseRetriever
```python
class BaseRetriever(ABC)
```
<a name="base.BaseRetriever.retrieve"></a>
#### retrieve
```python
| @abstractmethod
| retrieve(query: str, filters: dict = None, top_k: int = 10, index: str = None) -> List[Document]
```
Scan through documents in DocumentStore and return a small number documents
that are most relevant to the query.
**Arguments**:
- `query`: The query
- `filters`: A dictionary where the keys specify a metadata field and the value is a list of accepted values for that field
- `top_k`: How many documents to return per query.
- `index`: The name of the index in the DocumentStore from which to retrieve documents
<a name="base.BaseRetriever.eval"></a>
#### eval
```python
| eval(label_index: str = "label", doc_index: str = "eval_document", label_origin: str = "gold_label", top_k: int = 10, open_domain: bool = False) -> dict
```
Performs evaluation on the Retriever.
Retriever is evaluated based on whether it finds the correct document given the question string and at which
position in the ranking of documents the correct document is.
| Returns a dict containing the following metrics:
- "recall": Proportion of questions for which correct document is among retrieved documents
- "mean avg precision": Mean of average precision for each question. Rewards retrievers that give relevant
documents a higher rank.
**Arguments**:
- `label_index`: Index/Table in DocumentStore where labeled questions are stored
- `doc_index`: Index/Table in DocumentStore where documents that are used for evaluation are stored
- `top_k`: How many documents to return per question
- `open_domain`: If ``True``, retrieval will be evaluated by checking if the answer string to a question is
contained in the retrieved docs (common approach in open-domain QA).
If ``False``, retrieval uses a stricter evaluation that checks if the retrieved document ids
are within ids explicitly stated in the labels.