haystack/tutorials/Tutorial6_Better_Retrieval_via_DPR.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Better retrieval via \"Dense Passage Retrieval\"\n",
    "\n",
    "\n",
    "### Importance of Retrievers\n",
    "\n",
    "The Retriever has a huge impact on the performance of our overall search pipeline.\n",
    "\n",
    "\n",
    "### Different types of Retrievers\n",
    "#### Sparse\n",
    "Family of algorithms based on counting the occurences of words (bag-of-words) resulting in very sparse vectors with length = vocab size. \n",
    "\n",
    "Examples: BM25, TF-IDF  \n",
    "Pros: Simple, fast, well explainable  \n",
    "Cons: Relies on exact keyword matches between query and text  \n",
    " \n",
    "\n",
    "#### Dense\n",
    "These retrievers use neural network models to create \"dense\" embedding vectors. Within this family there are two different approaches: \n",
    "\n",
    "a) Single encoder: Use a **single model** to embed both query and passage.  \n",
    "b) Dual-encoder: Use **two models**, one to embed the query and one to embed the passage\n",
    "\n",
    "Recent work suggests that dual encoders work better, likely because they can deal better with the different nature of query and passage (length, style, syntax ...). \n",
    "\n",
    "Examples: REALM, DPR, Sentence-Transformers ...\n",
    "Pros: Captures semantinc similarity instead of \"word matches\" (e.g. synonyms, related topics ...) \n",
    "Cons: Computationally more heavy, initial training of model  \n",
    "\n",
    "\n",
    "### \"Dense Passage Retrieval\"\n",
    "\n",
    "In this Tutorial, we want to highlight one \"Dense Dual-Encoder\" called Dense Passage Retriever. \n",
    "It was introdoced by Karpukhin et al. (2020, https://arxiv.org/abs/2004.04906. \n",
    "\n",
    "Original Abstract: \n",
    "\n",
    "_\"Open-domain question answering relies on efficient passage retrieval to select candidate contexts, where traditional sparse vector space models, such as TF-IDF or BM25, are the de facto method. In this work, we show that retrieval can be practically implemented using dense representations alone, where embeddings are learned from a small number of questions and passages by a simple dual-encoder framework. When evaluated on a wide range of open-domain QA datasets, our dense retriever outperforms a strong Lucene-BM25 system largely by 9%-19% absolute in terms of top-20 passage retrieval accuracy, and helps our end-to-end QA system establish new state-of-the-art on multiple open-domain QA benchmarks.\"_\n",
    "\n",
    "Paper: https://arxiv.org/abs/2004.04906  \n",
    "Original Code: https://fburl.com/qa-dpr \n",
    "\n",
    "\n",
    "*Use this [link](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial6_Better_Retrieval_via_DPR.ipynb) to open the notebook in Google Colab.*\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prepare environment\n",
    "\n",
    "### Colab: Enable the GPU runtime \n",
    "Make sure you enable the GPU runtime to experience decent speed in this tutorial.  \n",
    "**Runtime -> Change Runtime type -> Hardware accelerator -> GPU**\n",
    "\n",
    "<img src=\"https://raw.githubusercontent.com/deepset-ai/haystack/master/docs/img/colab_gpu_runtime.jpg\">"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Make sure you have a GPU running\n",
    "!nvidia-smi"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "! pip install git+https://github.com/deepset-ai/haystack.git"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "pycharm": {
     "is_executing": false
    }
   },
   "outputs": [],
   "source": [
    "from haystack import Finder\n",
    "from haystack.indexing.cleaning import clean_wiki_text\n",
    "from haystack.indexing.utils import convert_files_to_dicts, fetch_archive_from_http\n",
    "from haystack.reader.farm import FARMReader\n",
    "from haystack.reader.transformers import TransformersReader\n",
    "from haystack.utils import print_answers"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Document Store\n",
    "\n",
    "FAISS is a library for efficient similarity search on a cluster of dense vectors.\n",
    "The `FAISSDocumentStore` uses a SQL(SQLite in-memory be default) database under-the-hood\n",
    "to store the document text and other meta data. The vector embeddings of the text are\n",
    "indexed on a FAISS Index that later is queried for searching answers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "outputs": [],
   "source": [
    "from haystack.database.faiss import FAISSDocumentStore\n",
    "\n",
    "document_store = FAISSDocumentStore()"
   ],
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## Cleaning & indexing documents\n",
    "\n",
    "Similarly to the previous tutorials, we download, convert and index some Game of Thrones articles to our DocumentStore"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "07/03/2020 11:46:28 - INFO - haystack.indexing.utils -   Found data stored in `data/article_txt_got`. Delete this first if you really want to fetch new data.\n",
      "07/03/2020 11:46:28 - INFO - elasticsearch -   POST http://localhost:9200/_bulk [status:200 request:0.300s]\n",
      "07/03/2020 11:46:28 - INFO - elasticsearch -   POST http://localhost:9200/_bulk [status:200 request:0.209s]\n",
      "07/03/2020 11:46:28 - INFO - elasticsearch -   POST http://localhost:9200/_bulk [status:200 request:0.124s]\n",
      "07/03/2020 11:46:29 - INFO - elasticsearch -   POST http://localhost:9200/_bulk [status:200 request:0.101s]\n",
      "07/03/2020 11:46:29 - INFO - elasticsearch -   POST http://localhost:9200/_bulk [status:200 request:0.125s]\n",
      "07/03/2020 11:46:29 - INFO - elasticsearch -   POST http://localhost:9200/_bulk [status:200 request:0.096s]\n"
     ]
    }
   ],
   "source": [
    "# Let's first get some files that we want to use\n",
    "doc_dir = \"data/article_txt_got\"\n",
    "s3_url = \"https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt.zip\"\n",
    "fetch_archive_from_http(url=s3_url, output_dir=doc_dir)\n",
    "\n",
    "# Convert files to dicts\n",
    "dicts = convert_files_to_dicts(dir_path=doc_dir, clean_func=clean_wiki_text, split_paragraphs=True)\n",
    "\n",
    "# Now, let's write the dicts containing documents to our DB.\n",
    "document_store.write_documents(dicts)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Initalize Retriever, Reader,  & Finder\n",
    "\n",
    "### Retriever\n",
    "\n",
    "**Here:** We use a `DensePassageRetriever`\n",
    "\n",
    "**Alternatives:**\n",
    "\n",
    "- The `ElasticsearchRetriever`with custom queries (e.g. boosting) and filters\n",
    "- Use `EmbeddingRetriever` to find candidate documents based on the similarity of embeddings (e.g. created via Sentence-BERT)\n",
    "- Use `TfidfRetriever` in combination with a SQL or InMemory Document store for simple prototyping and debugging"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "07/03/2020 11:46:29 - INFO - haystack.retriever.dpr_utils -   Loading saved model from models/dpr/checkpoint/retriever/single/nq/bert-base-encoder.cp\n",
      "07/03/2020 11:46:29 - INFO - haystack.retriever.dense -   Loaded encoder params:  {'do_lower_case': True, 'pretrained_model_cfg': 'bert-base-uncased', 'encoder_model_type': 'hf_bert', 'pretrained_file': None, 'projection_dim': 0, 'sequence_length': 256}\n",
      "07/03/2020 11:46:36 - INFO - haystack.retriever.dense -   Loading saved model state ...\n",
      "07/03/2020 11:46:36 - INFO - haystack.retriever.dense -   Loading saved model state ...\n",
      "07/03/2020 11:46:37 - INFO - elasticsearch -   GET http://localhost:9200/document/_search?scroll=5m&size=1000 [status:200 request:0.177s]\n",
      "07/03/2020 11:46:37 - INFO - elasticsearch -   GET http://localhost:9200/_search/scroll [status:200 request:0.053s]\n",
      "07/03/2020 11:46:37 - INFO - elasticsearch -   GET http://localhost:9200/_search/scroll [status:200 request:0.047s]\n",
      "07/03/2020 11:46:37 - INFO - elasticsearch -   GET http://localhost:9200/_search/scroll [status:200 request:0.003s]\n",
      "07/03/2020 11:46:37 - INFO - elasticsearch -   DELETE http://localhost:9200/_search/scroll [status:200 request:0.005s]\n",
      "07/03/2020 11:46:37 - INFO - haystack.database.elasticsearch -   Updating embeddings for 2811 docs ...\n",
      "07/03/2020 11:46:55 - INFO - haystack.retriever.dense -   Embedded 80 / 2811 texts\n",
      "07/03/2020 11:47:13 - INFO - haystack.retriever.dense -   Embedded 160 / 2811 texts\n",
      "07/03/2020 11:47:35 - INFO - haystack.retriever.dense -   Embedded 240 / 2811 texts\n",
      "07/03/2020 11:47:55 - INFO - haystack.retriever.dense -   Embedded 320 / 2811 texts\n",
      "07/03/2020 11:48:15 - INFO - haystack.retriever.dense -   Embedded 400 / 2811 texts\n",
      "07/03/2020 11:48:34 - INFO - haystack.retriever.dense -   Embedded 480 / 2811 texts\n",
      "07/03/2020 11:48:53 - INFO - haystack.retriever.dense -   Embedded 560 / 2811 texts\n",
      "07/03/2020 11:49:15 - INFO - haystack.retriever.dense -   Embedded 640 / 2811 texts\n",
      "07/03/2020 11:49:35 - INFO - haystack.retriever.dense -   Embedded 720 / 2811 texts\n",
      "07/03/2020 11:49:57 - INFO - haystack.retriever.dense -   Embedded 800 / 2811 texts\n",
      "07/03/2020 11:50:20 - INFO - haystack.retriever.dense -   Embedded 880 / 2811 texts\n",
      "07/03/2020 11:50:44 - INFO - haystack.retriever.dense -   Embedded 960 / 2811 texts\n",
      "07/03/2020 11:51:07 - INFO - haystack.retriever.dense -   Embedded 1040 / 2811 texts\n",
      "07/03/2020 11:51:29 - INFO - haystack.retriever.dense -   Embedded 1120 / 2811 texts\n",
      "07/03/2020 11:51:52 - INFO - haystack.retriever.dense -   Embedded 1200 / 2811 texts\n",
      "07/03/2020 11:52:14 - INFO - haystack.retriever.dense -   Embedded 1280 / 2811 texts\n",
      "07/03/2020 11:52:38 - INFO - haystack.retriever.dense -   Embedded 1360 / 2811 texts\n",
      "07/03/2020 11:53:00 - INFO - haystack.retriever.dense -   Embedded 1440 / 2811 texts\n",
      "07/03/2020 11:53:23 - INFO - haystack.retriever.dense -   Embedded 1520 / 2811 texts\n",
      "07/03/2020 11:53:48 - INFO - haystack.retriever.dense -   Embedded 1600 / 2811 texts\n",
      "07/03/2020 11:54:09 - INFO - haystack.retriever.dense -   Embedded 1680 / 2811 texts\n",
      "07/03/2020 11:54:33 - INFO - haystack.retriever.dense -   Embedded 1760 / 2811 texts\n",
      "07/03/2020 11:54:56 - INFO - haystack.retriever.dense -   Embedded 1840 / 2811 texts\n",
      "07/03/2020 11:55:18 - INFO - haystack.retriever.dense -   Embedded 1920 / 2811 texts\n",
      "07/03/2020 11:55:41 - INFO - haystack.retriever.dense -   Embedded 2000 / 2811 texts\n",
      "07/03/2020 11:56:08 - INFO - haystack.retriever.dense -   Embedded 2080 / 2811 texts\n",
      "07/03/2020 11:56:32 - INFO - haystack.retriever.dense -   Embedded 2160 / 2811 texts\n",
      "07/03/2020 11:56:54 - INFO - haystack.retriever.dense -   Embedded 2240 / 2811 texts\n",
      "07/03/2020 11:57:18 - INFO - haystack.retriever.dense -   Embedded 2320 / 2811 texts\n",
      "07/03/2020 11:57:40 - INFO - haystack.retriever.dense -   Embedded 2400 / 2811 texts\n",
      "07/03/2020 11:58:04 - INFO - haystack.retriever.dense -   Embedded 2480 / 2811 texts\n",
      "07/03/2020 11:58:39 - INFO - haystack.retriever.dense -   Embedded 2560 / 2811 texts\n",
      "07/03/2020 11:59:16 - INFO - haystack.retriever.dense -   Embedded 2640 / 2811 texts\n",
      "07/03/2020 11:59:53 - INFO - haystack.retriever.dense -   Embedded 2720 / 2811 texts\n",
      "07/03/2020 12:00:33 - INFO - haystack.retriever.dense -   Embedded 2800 / 2811 texts\n",
      "07/03/2020 12:00:42 - INFO - elasticsearch -   POST http://localhost:9200/_bulk [status:200 request:2.577s]\n",
      "07/03/2020 12:00:44 - INFO - elasticsearch -   POST http://localhost:9200/_bulk [status:200 request:2.085s]\n",
      "07/03/2020 12:00:46 - INFO - elasticsearch -   POST http://localhost:9200/_bulk [status:200 request:1.918s]\n",
      "07/03/2020 12:00:49 - INFO - elasticsearch -   POST http://localhost:9200/_bulk [status:200 request:1.933s]\n",
      "07/03/2020 12:00:51 - INFO - elasticsearch -   POST http://localhost:9200/_bulk [status:200 request:1.562s]\n",
      "07/03/2020 12:00:52 - INFO - elasticsearch -   POST http://localhost:9200/_bulk [status:200 request:1.150s]\n"
     ]
    }
   ],
   "source": [
    "from haystack.retriever.dense import DensePassageRetriever\n",
    "retriever = DensePassageRetriever(document_store=document_store, embedding_model=\"dpr-bert-base-nq\",\n",
    "                                  do_lower_case=True, use_gpu=True)\n",
    "\n",
    "# Important: \n",
    "# Now that after we have the DPR initialized, we need to call update_embeddings() to iterate over all\n",
    "# previously indexed documents and update their embedding representation. \n",
    "# While this can be a time consuming operation (depending on corpus size), it only needs to be done once. \n",
    "# At query time, we only need to embed the query and compare it the existing doc embeddings which is very fast.\n",
    "document_store.update_embeddings(retriever)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Reader\n",
    "\n",
    "Similar to previous Tutorials we now initalize our reader.\n",
    "\n",
    "Here we use a FARMReader with the *deepset/roberta-base-squad2* model (see: https://huggingface.co/deepset/roberta-base-squad2)\n",
    "\n",
    "\n",
    "\n",
    "#### FARMReader"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "pycharm": {
     "is_executing": false
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "07/03/2020 12:00:52 - INFO - farm.utils -   device: cpu n_gpu: 0, distributed training: False, automatic mixed precision training: None\n",
      "07/03/2020 12:00:52 - INFO - farm.infer -   Could not find `deepset/roberta-base-squad2` locally. Try to download from model hub ...\n",
      "07/03/2020 12:00:59 - WARNING - farm.modeling.language_model -   Could not automatically detect from language model name what language it is. \n",
      "\t We guess it's an *ENGLISH* model ... \n",
      "\t If not: Init the language model by supplying the 'language' param.\n",
      "07/03/2020 12:01:07 - WARNING - farm.modeling.prediction_head -   Some unused parameters are passed to the QuestionAnsweringHead. Might not be a problem. Params: {\"loss_ignore_index\": -1}\n",
      "/home/mp/miniconda3/envs/py37/lib/python3.7/site-packages/transformers/tokenization_utils.py:831: FutureWarning: Parameter max_len is deprecated and will be removed in a future release. Use model_max_length instead.\n",
      "  category=FutureWarning,\n",
      "07/03/2020 12:01:11 - INFO - farm.utils -   device: cpu n_gpu: 0, distributed training: False, automatic mixed precision training: None\n",
      "07/03/2020 12:01:11 - INFO - farm.infer -   Got ya 7 parallel workers to do inference ...\n",
      "07/03/2020 12:01:11 - INFO - farm.infer -    0    0    0    0    0    0    0 \n",
      "07/03/2020 12:01:11 - INFO - farm.infer -   /w\\  /w\\  /w\\  /w\\  /w\\  /w\\  /w\\\n",
      "07/03/2020 12:01:11 - INFO - farm.infer -   /'\\  / \\  /'\\  /'\\  / \\  / \\  /'\\\n",
      "07/03/2020 12:01:11 - INFO - farm.infer -               \n"
     ]
    }
   ],
   "source": [
    "# Load a  local model or any of the QA models on\n",
    "# Hugging Face's model hub (https://huggingface.co/models)\n",
    "\n",
    "reader = FARMReader(model_name_or_path=\"deepset/roberta-base-squad2\", use_gpu=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Finder\n",
    "\n",
    "The Finder sticks together reader and retriever in a pipeline to answer our actual questions. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "pycharm": {
     "is_executing": false
    }
   },
   "outputs": [],
   "source": [
    "finder = Finder(reader, retriever)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Voilà! Ask a question!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "pycharm": {
     "is_executing": false
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "07/03/2020 12:07:16 - INFO - elasticsearch -   GET http://localhost:9200/document/_search [status:200 request:0.018s]\n",
      "07/03/2020 12:07:16 - INFO - haystack.finder -   Reader is looking for detailed answer in 11362 chars ...\n",
      "Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  1.09 Batches/s]\n",
      "Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  2.05 Batches/s]\n",
      "Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.76s/ Batches]\n",
      "Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  1.50 Batches/s]\n",
      "Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  1.93 Batches/s]\n",
      "Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  2.15 Batches/s]\n",
      "Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  1.64 Batches/s]\n",
      "Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  1.93 Batches/s]\n",
      "Inferencing Samples: 100%|██████████| 1/1 [00:02<00:00,  2.60s/ Batches]\n",
      "Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  1.10 Batches/s]\n"
     ]
    }
   ],
   "source": [
    "# You can configure how many candidates the reader and retriever shall return\n",
    "# The higher top_k_retriever, the better (but also the slower) your answers. \n",
    "prediction = finder.get_answers(question=\"Who created the Dothraki vocabulary?\", top_k_retriever=10, top_k_reader=5)\n",
    "\n",
    "#prediction = finder.get_answers(question=\"Who is the father of Arya Stark?\", top_k_retriever=10, top_k_reader=5)\n",
    "#prediction = finder.get_answers(question=\"Who is the sister of Sansa?\", top_k_retriever=10, top_k_reader=5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "pycharm": {
     "is_executing": false,
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[   {   'answer': 'David J. Peterson',\n",
      "        'context': \"ge for ''Game of Thrones''\\n\"\n",
      "                   'The Dothraki vocabulary was created by David J. Peterson '\n",
      "                   'well in advance of the adaptation. HBO hired the Language '\n",
      "                   'Creation'},\n",
      "    {   'answer': 'David J. Peterson',\n",
      "        'context': '\\n'\n",
      "                   '===Creation===\\n'\n",
      "                   'David J. Peterson, creator of the spoken Valyrian '\n",
      "                   \"languages for ''Game of Thrones''\\n\"\n",
      "                   'To create the Dothraki and Valyrian languages to b'},\n",
      "    {   'answer': 'David J. Peterson',\n",
      "        'context': 'orld. The language was developed for the TV series by the '\n",
      "                   'linguist David J. Peterson, working off the Dothraki words '\n",
      "                   \"and phrases in Martin's novels.\\n\"\n",
      "                   ','},\n",
      "    {   'answer': 'Peterson',\n",
      "        'context': \"e does not exist in the fictional world of ''A Song of Ice \"\n",
      "                   \"and Fire'', Peterson chose to treat the similarity as \"\n",
      "                   \"coincidental and made ''dracarys'' an\"},\n",
      "    {   'answer': 'Peterson',\n",
      "        'context': 'ystem. Another word, \\'\\'trēsy\\'\\', meaning \"son\", was '\n",
      "                   \"coined in honour of Peterson's 3000th Twitter follower.\\n\"\n",
      "                   'Peterson did not create a High Valyrian wri'}]\n"
     ]
    }
   ],
   "source": [
    "print_answers(prediction, details=\"minimal\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}