mirror of
https://github.com/deepset-ai/haystack.git
synced 2025-08-30 03:16:46 +00:00
206 lines
7.5 KiB
Markdown
206 lines
7.5 KiB
Markdown
![]() |
<!---
|
||
|
title: "Tutorial 6"
|
||
|
metaTitle: "Better retrieval via Dense Passage Retrieval"
|
||
|
metaDescription: ""
|
||
|
slug: "/docs/tutorial6"
|
||
|
date: "2020-09-03"
|
||
|
id: "tutorial6md"
|
||
|
--->
|
||
|
|
||
|
# Better Retrieval via "Dense Passage Retrieval"
|
||
|
|
||
|
EXECUTABLE VERSION: [colab](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial6_Better_Retrieval_via_DPR.ipynb)
|
||
|
|
||
|
### Importance of Retrievers
|
||
|
|
||
|
The Retriever has a huge impact on the performance of our overall search pipeline.
|
||
|
|
||
|
|
||
|
### Different types of Retrievers
|
||
|
#### Sparse
|
||
|
Family of algorithms based on counting the occurrences of words (bag-of-words) resulting in very sparse vectors with length = vocab size.
|
||
|
|
||
|
**Examples**: BM25, TF-IDF
|
||
|
|
||
|
**Pros**: Simple, fast, well explainable
|
||
|
|
||
|
**Cons**: Relies on exact keyword matches between query and text
|
||
|
|
||
|
|
||
|
#### Dense
|
||
|
These retrievers use neural network models to create "dense" embedding vectors. Within this family there are two different approaches:
|
||
|
|
||
|
a) Single encoder: Use a **single model** to embed both query and passage.
|
||
|
b) Dual-encoder: Use **two models**, one to embed the query and one to embed the passage
|
||
|
|
||
|
Recent work suggests that dual encoders work better, likely because they can deal better with the different nature of query and passage (length, style, syntax ...).
|
||
|
|
||
|
**Examples**: REALM, DPR, Sentence-Transformers
|
||
|
|
||
|
**Pros**: Captures semantinc similarity instead of "word matches" (e.g. synonyms, related topics ...)
|
||
|
|
||
|
**Cons**: Computationally more heavy, initial training of model
|
||
|
|
||
|
|
||
|
### "Dense Passage Retrieval"
|
||
|
|
||
|
In this Tutorial, we want to highlight one "Dense Dual-Encoder" called Dense Passage Retriever.
|
||
|
It was introdoced by Karpukhin et al. (2020, https://arxiv.org/abs/2004.04906.
|
||
|
|
||
|
Original Abstract:
|
||
|
|
||
|
_"Open-domain question answering relies on efficient passage retrieval to select candidate contexts, where traditional sparse vector space models, such as TF-IDF or BM25, are the de facto method. In this work, we show that retrieval can be practically implemented using dense representations alone, where embeddings are learned from a small number of questions and passages by a simple dual-encoder framework. When evaluated on a wide range of open-domain QA datasets, our dense retriever outperforms a strong Lucene-BM25 system largely by 9%-19% absolute in terms of top-20 passage retrieval accuracy, and helps our end-to-end QA system establish new state-of-the-art on multiple open-domain QA benchmarks."_
|
||
|
|
||
|
Paper: https://arxiv.org/abs/2004.04906
|
||
|
Original Code: https://fburl.com/qa-dpr
|
||
|
|
||
|
|
||
|
*Use this* [link](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial6_Better_Retrieval_via_DPR.ipynb) *to open the notebook in Google Colab.*
|
||
|
|
||
|
|
||
|
### Prepare environment
|
||
|
|
||
|
#### Colab: Enable the GPU runtime
|
||
|
Make sure you enable the GPU runtime to experience decent speed in this tutorial.
|
||
|
**Runtime -> Change Runtime type -> Hardware accelerator -> GPU**
|
||
|
|
||
|
<img src="https://raw.githubusercontent.com/deepset-ai/haystack/master/docs/img/colab_gpu_runtime.jpg">
|
||
|
|
||
|
|
||
|
```python
|
||
|
# Make sure you have a GPU running
|
||
|
!nvidia-smi
|
||
|
```
|
||
|
|
||
|
|
||
|
```python
|
||
|
# Install the latest release of Haystack in your own environment
|
||
|
#! pip install farm-haystack
|
||
|
|
||
|
# Install the latest master of Haystack and install the version of torch that works with the colab GPUs
|
||
|
!pip install git+https://github.com/deepset-ai/haystack.git
|
||
|
!pip install torch==1.6.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html
|
||
|
```
|
||
|
|
||
|
|
||
|
```python
|
||
|
from haystack import Finder
|
||
|
from haystack.preprocessor.cleaning import clean_wiki_text
|
||
|
from haystack.preprocessor.utils import convert_files_to_dicts, fetch_archive_from_http
|
||
|
from haystack.reader.farm import FARMReader
|
||
|
from haystack.reader.transformers import TransformersReader
|
||
|
from haystack.utils import print_answers
|
||
|
```
|
||
|
|
||
|
### Document Store
|
||
|
|
||
|
FAISS is a library for efficient similarity search on a cluster of dense vectors.
|
||
|
The `FAISSDocumentStore` uses a SQL(SQLite in-memory be default) database under-the-hood
|
||
|
to store the document text and other meta data. The vector embeddings of the text are
|
||
|
indexed on a FAISS Index that later is queried for searching answers.
|
||
|
|
||
|
|
||
|
```python
|
||
|
from haystack.document_store.faiss import FAISSDocumentStore
|
||
|
|
||
|
document_store = FAISSDocumentStore()
|
||
|
```
|
||
|
|
||
|
### Cleaning & indexing documents
|
||
|
|
||
|
Similarly to the previous tutorials, we download, convert and index some Game of Thrones articles to our DocumentStore
|
||
|
|
||
|
|
||
|
```python
|
||
|
# Let's first get some files that we want to use
|
||
|
doc_dir = "data/article_txt_got"
|
||
|
s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt.zip"
|
||
|
fetch_archive_from_http(url=s3_url, output_dir=doc_dir)
|
||
|
|
||
|
# Convert files to dicts
|
||
|
dicts = convert_files_to_dicts(dir_path=doc_dir, clean_func=clean_wiki_text, split_paragraphs=True)
|
||
|
|
||
|
# Now, let's write the dicts containing documents to our DB.
|
||
|
document_store.write_documents(dicts)
|
||
|
```
|
||
|
|
||
|
### Initalize Retriever, Reader, & Finder
|
||
|
|
||
|
#### Retriever
|
||
|
|
||
|
**Here:** We use a `DensePassageRetriever`
|
||
|
|
||
|
**Alternatives:**
|
||
|
|
||
|
- The `ElasticsearchRetriever`with custom queries (e.g. boosting) and filters
|
||
|
- Use `EmbeddingRetriever` to find candidate documents based on the similarity of embeddings (e.g. created via Sentence-BERT)
|
||
|
- Use `TfidfRetriever` in combination with a SQL or InMemory Document store for simple prototyping and debugging
|
||
|
|
||
|
|
||
|
```python
|
||
|
from haystack.retriever.dense import DensePassageRetriever
|
||
|
retriever = DensePassageRetriever(document_store=document_store,
|
||
|
query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
|
||
|
passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base",
|
||
|
use_gpu=True,
|
||
|
embed_title=True,
|
||
|
max_seq_len=256,
|
||
|
batch_size=16,
|
||
|
remove_sep_tok_from_untitled_passages=True)
|
||
|
# Important:
|
||
|
# Now that after we have the DPR initialized, we need to call update_embeddings() to iterate over all
|
||
|
# previously indexed documents and update their embedding representation.
|
||
|
# While this can be a time consuming operation (depending on corpus size), it only needs to be done once.
|
||
|
# At query time, we only need to embed the query and compare it the existing doc embeddings which is very fast.
|
||
|
document_store.update_embeddings(retriever)
|
||
|
```
|
||
|
|
||
|
#### Reader
|
||
|
|
||
|
Similar to previous Tutorials we now initalize our reader.
|
||
|
|
||
|
Here we use a FARMReader with the *deepset/roberta-base-squad2* model (see: https://huggingface.co/deepset/roberta-base-squad2)
|
||
|
|
||
|
|
||
|
|
||
|
##### FARMReader
|
||
|
|
||
|
|
||
|
```python
|
||
|
# Load a local model or any of the QA models on
|
||
|
# Hugging Face's model hub (https://huggingface.co/models)
|
||
|
|
||
|
reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=True)
|
||
|
```
|
||
|
|
||
|
#### Finder
|
||
|
|
||
|
The Finder sticks together reader and retriever in a pipeline to answer our actual questions.
|
||
|
|
||
|
|
||
|
```python
|
||
|
finder = Finder(reader, retriever)
|
||
|
```
|
||
|
|
||
|
### Voilà! Ask a question!
|
||
|
|
||
|
|
||
|
```python
|
||
|
# You can configure how many candidates the reader and retriever shall return
|
||
|
# The higher top_k_retriever, the better (but also the slower) your answers.
|
||
|
prediction = finder.get_answers(question="Who created the Dothraki vocabulary?", top_k_retriever=10, top_k_reader=5)
|
||
|
|
||
|
#prediction = finder.get_answers(question="Who is the father of Arya Stark?", top_k_retriever=10, top_k_reader=5)
|
||
|
#prediction = finder.get_answers(question="Who is the sister of Sansa?", top_k_retriever=10, top_k_reader=5)
|
||
|
```
|
||
|
|
||
|
|
||
|
```python
|
||
|
print_answers(prediction, details="minimal")
|
||
|
```
|
||
|
|
||
|
|
||
|
```python
|
||
|
|
||
|
```
|