mirror of
				https://github.com/deepset-ai/haystack.git
				synced 2025-10-31 01:39:45 +00:00 
			
		
		
		
	 717796c587
			
		
	
	
		717796c587
		
			
		
	
	
	
	
		
			
			* Tutorial 06: Replace DPR with EmbeddingRetriever Closes #2887 * Add updated tutorials/6.md file Replace `DensePassageRetriever` with `EmbeddingRetriever` * Update Tutorial 06 based on PR feedback * Further updates to Tutorial-06 according to review feedback * [Tutorial 06] Put in review feedback for the py file
		
			
				
	
	
		
			450 lines
		
	
	
		
			16 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			450 lines
		
	
	
		
			16 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
| {
 | |
|   "cells": [
 | |
|     {
 | |
|       "cell_type": "markdown",
 | |
|       "metadata": {
 | |
|         "id": "bEH-CRbeA6NU"
 | |
|       },
 | |
|       "source": [
 | |
|         "# Better Retrieval via \"Embedding Retrieval\"\n",
 | |
|         "\n",
 | |
|         "[](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial6_Better_Retrieval_via_Embedding_Retrieval.ipynb)\n",
 | |
|         "\n",
 | |
|         "### Importance of Retrievers\n",
 | |
|         "\n",
 | |
|         "The Retriever has a huge impact on the performance of our overall search pipeline.\n",
 | |
|         "\n",
 | |
|         "\n",
 | |
|         "### Different types of Retrievers\n",
 | |
|         "#### Sparse\n",
 | |
|         "Family of algorithms based on counting the occurrences of words (bag-of-words) resulting in very sparse vectors with length = vocab size.\n",
 | |
|         "\n",
 | |
|         "**Examples**: BM25, TF-IDF\n",
 | |
|         "\n",
 | |
|         "**Pros**: Simple, fast, well explainable\n",
 | |
|         "\n",
 | |
|         "**Cons**: Relies on exact keyword matches between query and text\n",
 | |
|         " \n",
 | |
|         "\n",
 | |
|         "#### Dense\n",
 | |
|         "These retrievers use neural network models to create \"dense\" embedding vectors. Within this family, there are two different approaches:\n",
 | |
|         "\n",
 | |
|         "a) Single encoder: Use a **single model** to embed both the query and the passage.\n",
 | |
|         "b) Dual-encoder: Use **two models**, one to embed the query and one to embed the passage.\n",
 | |
|         "\n",
 | |
|         "**Examples**: REALM, DPR, Sentence-Transformers\n",
 | |
|         "\n",
 | |
|         "**Pros**: Captures semantic similarity instead of \"word matches\" (for example, synonyms, related topics).\n",
 | |
|         "\n",
 | |
|         "**Cons**: Computationally more heavy to use, initial training of the model (though this is less of an issue nowadays as many pre-trained models are available and most of the time, it's not needed to train the model).\n",
 | |
|         "\n",
 | |
|         "\n",
 | |
|         "### Embedding Retrieval\n",
 | |
|         "\n",
 | |
|         "In this Tutorial, we use an `EmbeddingRetriever` with [Sentence Transformers](https://www.sbert.net/index.html) models.\n",
 | |
|         "\n",
 | |
|         "These models are trained to embed similar sentences close to each other in a shared embedding space.\n",
 | |
|         "\n",
 | |
|         "Some models have been fine-tuned on massive Information Retrieval data and can be used to retrieve documents based on a short query (for example, `multi-qa-mpnet-base-dot-v1`). There are others that are more suited to semantic similarity tasks where you are trying to find the most similar documents to a given document (for example, `all-mpnet-base-v2`). There are even models that are multilingual (for example, `paraphrase-multilingual-mpnet-base-v2`). For a good overview of different models with their evaluation metrics, see the [Pretrained Models](https://www.sbert.net/docs/pretrained_models.html#) in the Sentence Transformers documentation.\n",
 | |
|         "\n",
 | |
|         "*Use this* [link](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial6_Better_Retrieval_via_Embedding_Retrieval.ipynb) *to open the notebook in Google Colab.*\n"
 | |
|       ]
 | |
|     },
 | |
|     {
 | |
|       "cell_type": "markdown",
 | |
|       "metadata": {
 | |
|         "id": "3K27Y5FbA6NV"
 | |
|       },
 | |
|       "source": [
 | |
|         "### Prepare the Environment\n",
 | |
|         "\n",
 | |
|         "#### Colab: Enable the GPU Runtime\n",
 | |
|         "Make sure you enable the GPU runtime to experience decent speed in this tutorial.\n",
 | |
|         "**Runtime -> Change Runtime type -> Hardware accelerator -> GPU**\n",
 | |
|         "\n",
 | |
|         "<img src=\"https://raw.githubusercontent.com/deepset-ai/haystack/master/docs/img/colab_gpu_runtime.jpg\">"
 | |
|       ]
 | |
|     },
 | |
|     {
 | |
|       "cell_type": "code",
 | |
|       "execution_count": null,
 | |
|       "metadata": {
 | |
|         "id": "JlZgP8q1A6NW"
 | |
|       },
 | |
|       "outputs": [],
 | |
|       "source": [
 | |
|         "# Make sure you have a GPU running\n",
 | |
|         "!nvidia-smi"
 | |
|       ]
 | |
|     },
 | |
|     {
 | |
|       "cell_type": "code",
 | |
|       "execution_count": null,
 | |
|       "metadata": {
 | |
|         "id": "NM36kbRFA6Nc"
 | |
|       },
 | |
|       "outputs": [],
 | |
|       "source": [
 | |
|         "# Install the latest release of Haystack in your own environment\n",
 | |
|         "#! pip install farm-haystack\n",
 | |
|         "\n",
 | |
|         "# Install the latest master of Haystack\n",
 | |
|         "!pip install --upgrade pip\n",
 | |
|         "!pip install git+https://github.com/deepset-ai/haystack.git#egg=farm-haystack[colab,faiss]"
 | |
|       ]
 | |
|     },
 | |
|     {
 | |
|       "cell_type": "markdown",
 | |
|       "source": [
 | |
|         "## Logging\n",
 | |
|         "\n",
 | |
|         "We configure how logging messages should be displayed and which log level should be used before importing Haystack.\n",
 | |
|         "Example log message:\n",
 | |
|         "INFO - haystack.utils.preprocessing -  Converting data/tutorial1/218_Olenna_Tyrell.txt\n",
 | |
|         "Default log level in basicConfig is WARNING so the explicit parameter is not necessary but can be changed easily:"
 | |
|       ],
 | |
|       "metadata": {
 | |
|         "collapsed": false,
 | |
|         "pycharm": {
 | |
|           "name": "#%% md\n"
 | |
|         },
 | |
|         "id": "GbM2ml-ozqLX"
 | |
|       }
 | |
|     },
 | |
|     {
 | |
|       "cell_type": "code",
 | |
|       "execution_count": null,
 | |
|       "outputs": [],
 | |
|       "source": [
 | |
|         "import logging\n",
 | |
|         "\n",
 | |
|         "logging.basicConfig(format=\"%(levelname)s - %(name)s -  %(message)s\", level=logging.WARNING)\n",
 | |
|         "logging.getLogger(\"haystack\").setLevel(logging.INFO)"
 | |
|       ],
 | |
|       "metadata": {
 | |
|         "pycharm": {
 | |
|           "name": "#%%\n"
 | |
|         },
 | |
|         "id": "kQWEUUMnzqLX"
 | |
|       }
 | |
|     },
 | |
|     {
 | |
|       "cell_type": "code",
 | |
|       "execution_count": null,
 | |
|       "metadata": {
 | |
|         "id": "xmRuhTQ7A6Nh"
 | |
|       },
 | |
|       "outputs": [],
 | |
|       "source": [
 | |
|         "from haystack.utils import clean_wiki_text, convert_files_to_docs, fetch_archive_from_http, print_answers\n",
 | |
|         "from haystack.nodes import FARMReader, TransformersReader"
 | |
|       ]
 | |
|     },
 | |
|     {
 | |
|       "cell_type": "markdown",
 | |
|       "metadata": {
 | |
|         "id": "q3dSo7ZtA6Nl"
 | |
|       },
 | |
|       "source": [
 | |
|         "### Document Store\n",
 | |
|         "\n",
 | |
|         "#### Option 1: FAISS\n",
 | |
|         "\n",
 | |
|         "FAISS is a library for efficient similarity search on a cluster of dense vectors.\n",
 | |
|         "The `FAISSDocumentStore` uses a SQL(SQLite in-memory be default) database under-the-hood\n",
 | |
|         "to store the document text and other meta data. The vector embeddings of the text are\n",
 | |
|         "indexed on a FAISS Index that later is queried for searching answers.\n",
 | |
|         "The default flavour of FAISSDocumentStore is \"Flat\" but can also be set to \"HNSW\" for\n",
 | |
|         "faster search at the expense of some accuracy. Just set the faiss_index_factor_str argument in the constructor.\n",
 | |
|         "For more info on which suits your use case: https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index"
 | |
|       ]
 | |
|     },
 | |
|     {
 | |
|       "cell_type": "code",
 | |
|       "execution_count": null,
 | |
|       "metadata": {
 | |
|         "id": "1cYgDJmrA6Nv",
 | |
|         "pycharm": {
 | |
|           "name": "#%%\n"
 | |
|         }
 | |
|       },
 | |
|       "outputs": [],
 | |
|       "source": [
 | |
|         "from haystack.document_stores import FAISSDocumentStore\n",
 | |
|         "\n",
 | |
|         "document_store = FAISSDocumentStore(faiss_index_factory_str=\"Flat\")"
 | |
|       ]
 | |
|     },
 | |
|     {
 | |
|       "cell_type": "markdown",
 | |
|       "metadata": {
 | |
|         "collapsed": false,
 | |
|         "pycharm": {
 | |
|           "name": "#%% md\n"
 | |
|         },
 | |
|         "id": "s4HK5l0qzqLZ"
 | |
|       },
 | |
|       "source": [
 | |
|         "#### Option 2: Milvus\n",
 | |
|         "\n",
 | |
|         "Milvus is an open source database library that is also optimized for vector similarity searches like FAISS.\n",
 | |
|         "Like FAISS it has both a \"Flat\" and \"HNSW\" mode but it outperforms FAISS when it comes to dynamic data management.\n",
 | |
|         "It does require a little more setup, however, as it is run through Docker and requires the setup of some config files.\n",
 | |
|         "See [their docs](https://milvus.io/docs/v1.0.0/milvus_docker-cpu.md) for more details."
 | |
|       ]
 | |
|     },
 | |
|     {
 | |
|       "cell_type": "code",
 | |
|       "execution_count": null,
 | |
|       "metadata": {
 | |
|         "pycharm": {
 | |
|           "name": "#%%\n"
 | |
|         },
 | |
|         "id": "2Ur4h-E3zqLZ"
 | |
|       },
 | |
|       "outputs": [],
 | |
|       "source": [
 | |
|         "# Milvus cannot be run on COlab, so this cell is commented out.\n",
 | |
|         "# To run Milvus you need Docker (versions below 2.0.0) or a docker-compose (versions >= 2.0.0), neither of which is available on Colab.\n",
 | |
|         "# See Milvus' documentation for more details: https://milvus.io/docs/install_standalone-docker.md\n",
 | |
|         "\n",
 | |
|         "# !pip install git+https://github.com/deepset-ai/haystack.git#egg=farm-haystack[milvus]\n",
 | |
|         "\n",
 | |
|         "# from haystack.utils import launch_milvus\n",
 | |
|         "# from haystack.document_stores import MilvusDocumentStore\n",
 | |
|         "\n",
 | |
|         "# launch_milvus()\n",
 | |
|         "# document_store = MilvusDocumentStore()"
 | |
|       ]
 | |
|     },
 | |
|     {
 | |
|       "cell_type": "markdown",
 | |
|       "metadata": {
 | |
|         "id": "06LatTJBA6N0",
 | |
|         "pycharm": {
 | |
|           "name": "#%% md\n"
 | |
|         }
 | |
|       },
 | |
|       "source": [
 | |
|         "### Cleaning & indexing documents\n",
 | |
|         "\n",
 | |
|         "Similarly to the previous tutorials, we download, convert and index some Game of Thrones articles to our DocumentStore"
 | |
|       ]
 | |
|     },
 | |
|     {
 | |
|       "cell_type": "code",
 | |
|       "execution_count": null,
 | |
|       "metadata": {
 | |
|         "id": "iqKnu6wxA6N1",
 | |
|         "pycharm": {
 | |
|           "name": "#%%\n"
 | |
|         }
 | |
|       },
 | |
|       "outputs": [],
 | |
|       "source": [
 | |
|         "# Let's first get some files that we want to use\n",
 | |
|         "doc_dir = \"data/tutorial6\"\n",
 | |
|         "s3_url = \"https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt6.zip\"\n",
 | |
|         "fetch_archive_from_http(url=s3_url, output_dir=doc_dir)\n",
 | |
|         "\n",
 | |
|         "# Convert files to dicts\n",
 | |
|         "docs = convert_files_to_docs(dir_path=doc_dir, clean_func=clean_wiki_text, split_paragraphs=True)\n",
 | |
|         "\n",
 | |
|         "# Now, let's write the dicts containing documents to our DB.\n",
 | |
|         "document_store.write_documents(docs)"
 | |
|       ]
 | |
|     },
 | |
|     {
 | |
|       "cell_type": "markdown",
 | |
|       "metadata": {
 | |
|         "id": "wgjedxx_A6N6"
 | |
|       },
 | |
|       "source": [
 | |
|         "### Initialize Retriever, Reader & Pipeline\n",
 | |
|         "\n",
 | |
|         "#### Retriever\n",
 | |
|         "\n",
 | |
|         "**Here:** We use an `EmbeddingRetriever`.\n",
 | |
|         "\n",
 | |
|         "**Alternatives:**\n",
 | |
|         "\n",
 | |
|         "- `BM25Retriever` with custom queries (for example, boosting) and filters\n",
 | |
|         "- `DensePassageRetriever` which uses two encoder models, one to embed the query and one to embed the passage, and then compares the embedding for retrieval\n",
 | |
|         "- `TfidfRetriever` in combination with a SQL or InMemory Document store for simple prototyping and debugging"
 | |
|       ]
 | |
|     },
 | |
|     {
 | |
|       "cell_type": "code",
 | |
|       "execution_count": null,
 | |
|       "metadata": {
 | |
|         "id": "kFwiPP60A6N7",
 | |
|         "pycharm": {
 | |
|           "is_executing": true
 | |
|         }
 | |
|       },
 | |
|       "outputs": [],
 | |
|       "source": [
 | |
|         "from haystack.nodes import EmbeddingRetriever\n",
 | |
|         "\n",
 | |
|         "retriever = EmbeddingRetriever(\n",
 | |
|         "    document_store=document_store,\n",
 | |
|         "    embedding_model=\"sentence-transformers/multi-qa-mpnet-base-dot-v1\",\n",
 | |
|         "    model_format=\"sentence_transformers\",\n",
 | |
|         ")\n",
 | |
|         "# Important:\n",
 | |
|         "# Now that we initialized the Retriever, we need to call update_embeddings() to iterate over all\n",
 | |
|         "# previously indexed documents and update their embedding representation.\n",
 | |
|         "# While this can be a time consuming operation (depending on the corpus size), it only needs to be done once.\n",
 | |
|         "# At query time, we only need to embed the query and compare it to the existing document embeddings, which is very fast.\n",
 | |
|         "document_store.update_embeddings(retriever)"
 | |
|       ]
 | |
|     },
 | |
|     {
 | |
|       "cell_type": "markdown",
 | |
|       "metadata": {
 | |
|         "id": "rnVR28OXA6OA"
 | |
|       },
 | |
|       "source": [
 | |
|         "#### Reader\n",
 | |
|         "\n",
 | |
|         "Similar to previous Tutorials we now initalize our reader.\n",
 | |
|         "\n",
 | |
|         "Here we use a FARMReader with the *deepset/roberta-base-squad2* model (see: https://huggingface.co/deepset/roberta-base-squad2)\n",
 | |
|         "\n",
 | |
|         "\n",
 | |
|         "\n",
 | |
|         "##### FARMReader"
 | |
|       ]
 | |
|     },
 | |
|     {
 | |
|       "cell_type": "code",
 | |
|       "execution_count": null,
 | |
|       "metadata": {
 | |
|         "id": "fyIuWVwhA6OB"
 | |
|       },
 | |
|       "outputs": [],
 | |
|       "source": [
 | |
|         "# Load a  local model or any of the QA models on\n",
 | |
|         "# Hugging Face's model hub (https://huggingface.co/models)\n",
 | |
|         "\n",
 | |
|         "reader = FARMReader(model_name_or_path=\"deepset/roberta-base-squad2\", use_gpu=True)"
 | |
|       ]
 | |
|     },
 | |
|     {
 | |
|       "cell_type": "markdown",
 | |
|       "metadata": {
 | |
|         "id": "unhLD18yA6OF"
 | |
|       },
 | |
|       "source": [
 | |
|         "### Pipeline\n",
 | |
|         "\n",
 | |
|         "With a Haystack `Pipeline` you can stick together your building blocks to a search pipeline.\n",
 | |
|         "Under the hood, `Pipelines` are Directed Acyclic Graphs (DAGs) that you can easily customize for your own use cases.\n",
 | |
|         "To speed things up, Haystack also comes with a few predefined Pipelines. One of them is the `ExtractiveQAPipeline` that combines a retriever and a reader to answer our questions.\n",
 | |
|         "You can learn more about `Pipelines` in the [docs](https://haystack.deepset.ai/docs/latest/pipelinesmd)."
 | |
|       ]
 | |
|     },
 | |
|     {
 | |
|       "cell_type": "code",
 | |
|       "execution_count": null,
 | |
|       "metadata": {
 | |
|         "id": "TssPQyzWA6OG"
 | |
|       },
 | |
|       "outputs": [],
 | |
|       "source": [
 | |
|         "from haystack.pipelines import ExtractiveQAPipeline\n",
 | |
|         "\n",
 | |
|         "pipe = ExtractiveQAPipeline(reader, retriever)"
 | |
|       ]
 | |
|     },
 | |
|     {
 | |
|       "cell_type": "markdown",
 | |
|       "metadata": {
 | |
|         "id": "bXlBBxKXA6OL"
 | |
|       },
 | |
|       "source": [
 | |
|         "## Voilà! Ask a question!"
 | |
|       ]
 | |
|     },
 | |
|     {
 | |
|       "cell_type": "code",
 | |
|       "execution_count": null,
 | |
|       "metadata": {
 | |
|         "id": "Zi97Hif2A6OM"
 | |
|       },
 | |
|       "outputs": [],
 | |
|       "source": [
 | |
|         "# You can configure how many candidates the reader and retriever shall return\n",
 | |
|         "# The higher top_k for retriever, the better (but also the slower) your answers.\n",
 | |
|         "prediction = pipe.run(\n",
 | |
|         "    query=\"Who created the Dothraki vocabulary?\", params={\"Retriever\": {\"top_k\": 10}, \"Reader\": {\"top_k\": 5}}\n",
 | |
|         ")"
 | |
|       ]
 | |
|     },
 | |
|     {
 | |
|       "cell_type": "code",
 | |
|       "execution_count": null,
 | |
|       "metadata": {
 | |
|         "id": "pI0wrHylzqLa"
 | |
|       },
 | |
|       "outputs": [],
 | |
|       "source": [
 | |
|         "print_answers(prediction, details=\"minimum\")"
 | |
|       ]
 | |
|     },
 | |
|     {
 | |
|       "cell_type": "markdown",
 | |
|       "metadata": {
 | |
|         "collapsed": false,
 | |
|         "id": "kXE84-2_zqLa"
 | |
|       },
 | |
|       "source": [
 | |
|         "## About us\n",
 | |
|         "\n",
 | |
|         "This [Haystack](https://github.com/deepset-ai/haystack/) notebook was made with love by [deepset](https://deepset.ai/) in Berlin, Germany\n",
 | |
|         "\n",
 | |
|         "We bring NLP to the industry via open source!\n",
 | |
|         "  \n",
 | |
|         "Our focus: Industry specific language models & large scale QA systems.  \n",
 | |
|         "  \n",
 | |
|         "Some of our other work: \n",
 | |
|         "- [German BERT](https://deepset.ai/german-bert)\n",
 | |
|         "- [GermanQuAD and GermanDPR](https://deepset.ai/germanquad)\n",
 | |
|         "- [FARM](https://github.com/deepset-ai/FARM)\n",
 | |
|         "\n",
 | |
|         "Get in touch:\n",
 | |
|         "[Twitter](https://twitter.com/deepset_ai) | [LinkedIn](https://www.linkedin.com/company/deepset-ai/) | [Slack](https://haystack.deepset.ai/community/join) | [GitHub Discussions](https://github.com/deepset-ai/haystack/discussions) | [Website](https://deepset.ai)\n",
 | |
|         "\n",
 | |
|         "By the way: [we're hiring!](https://www.deepset.ai/jobs)"
 | |
|       ]
 | |
|     }
 | |
|   ],
 | |
|   "metadata": {
 | |
|     "accelerator": "GPU",
 | |
|     "colab": {
 | |
|       "collapsed_sections": [],
 | |
|       "name": "Tutorial6_Better_Retrieval_via_Embedding_Retrieval.ipynb",
 | |
|       "provenance": []
 | |
|     },
 | |
|     "kernelspec": {
 | |
|       "display_name": "Python 3",
 | |
|       "language": "python",
 | |
|       "name": "python3"
 | |
|     },
 | |
|     "language_info": {
 | |
|       "codemirror_mode": {
 | |
|         "name": "ipython",
 | |
|         "version": 3
 | |
|       },
 | |
|       "file_extension": ".py",
 | |
|       "mimetype": "text/x-python",
 | |
|       "name": "python",
 | |
|       "nbconvert_exporter": "python",
 | |
|       "pygments_lexer": "ipython3",
 | |
|       "version": "3.6.9"
 | |
|     },
 | |
|     "gpuClass": "standard"
 | |
|   },
 | |
|   "nbformat": 4,
 | |
|   "nbformat_minor": 0
 | |
| } |