haystack/tutorials/Tutorial6_Better_Retrieval_via_Embedding_Retrieval.ipynb

{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "bEH-CRbeA6NU"
      },
      "source": [
        "# Better Retrieval via \"Embedding Retrieval\"\n",
        "\n",
        "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial6_Better_Retrieval_via_Embedding_Retrieval.ipynb)\n",
        "\n",
        "### Importance of Retrievers\n",
        "\n",
        "The Retriever has a huge impact on the performance of our overall search pipeline.\n",
        "\n",
        "\n",
        "### Different types of Retrievers\n",
        "#### Sparse\n",
        "Family of algorithms based on counting the occurrences of words (bag-of-words) resulting in very sparse vectors with length = vocab size.\n",
        "\n",
        "**Examples**: BM25, TF-IDF\n",
        "\n",
        "**Pros**: Simple, fast, well explainable\n",
        "\n",
        "**Cons**: Relies on exact keyword matches between query and text\n",
        " \n",
        "\n",
        "#### Dense\n",
        "These retrievers use neural network models to create \"dense\" embedding vectors. Within this family, there are two different approaches:\n",
        "\n",
        "a) Single encoder: Use a **single model** to embed both the query and the passage.\n",
        "b) Dual-encoder: Use **two models**, one to embed the query and one to embed the passage.\n",
        "\n",
        "**Examples**: REALM, DPR, Sentence-Transformers\n",
        "\n",
        "**Pros**: Captures semantic similarity instead of \"word matches\" (for example, synonyms, related topics).\n",
        "\n",
        "**Cons**: Computationally more heavy to use, initial training of the model (though this is less of an issue nowadays as many pre-trained models are available and most of the time, it's not needed to train the model).\n",
        "\n",
        "\n",
        "### Embedding Retrieval\n",
        "\n",
        "In this Tutorial, we use an `EmbeddingRetriever` with [Sentence Transformers](https://www.sbert.net/index.html) models.\n",
        "\n",
        "These models are trained to embed similar sentences close to each other in a shared embedding space.\n",
        "\n",
        "Some models have been fine-tuned on massive Information Retrieval data and can be used to retrieve documents based on a short query (for example, `multi-qa-mpnet-base-dot-v1`). There are others that are more suited to semantic similarity tasks where you are trying to find the most similar documents to a given document (for example, `all-mpnet-base-v2`). There are even models that are multilingual (for example, `paraphrase-multilingual-mpnet-base-v2`). For a good overview of different models with their evaluation metrics, see the [Pretrained Models](https://www.sbert.net/docs/pretrained_models.html#) in the Sentence Transformers documentation.\n",
        "\n",
        "*Use this* [link](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial6_Better_Retrieval_via_Embedding_Retrieval.ipynb) *to open the notebook in Google Colab.*\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "3K27Y5FbA6NV"
      },
      "source": [
        "### Prepare the Environment\n",
        "\n",
        "#### Colab: Enable the GPU Runtime\n",
        "Make sure you enable the GPU runtime to experience decent speed in this tutorial.\n",
        "**Runtime -> Change Runtime type -> Hardware accelerator -> GPU**\n",
        "\n",
        "<img src=\"https://raw.githubusercontent.com/deepset-ai/haystack/master/docs/img/colab_gpu_runtime.jpg\">"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "JlZgP8q1A6NW"
      },
      "outputs": [],
      "source": [
        "# Make sure you have a GPU running\n",
        "!nvidia-smi"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "NM36kbRFA6Nc"
      },
      "outputs": [],
      "source": [
        "# Install the latest release of Haystack in your own environment\n",
        "#! pip install farm-haystack\n",
        "\n",
        "# Install the latest master of Haystack\n",
        "!pip install --upgrade pip\n",
        "!pip install git+https://github.com/deepset-ai/haystack.git#egg=farm-haystack[colab,faiss]"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Logging\n",
        "\n",
        "We configure how logging messages should be displayed and which log level should be used before importing Haystack.\n",
        "Example log message:\n",
        "INFO - haystack.utils.preprocessing -  Converting data/tutorial1/218_Olenna_Tyrell.txt\n",
        "Default log level in basicConfig is WARNING so the explicit parameter is not necessary but can be changed easily:"
      ],
      "metadata": {
        "collapsed": false,
        "pycharm": {
          "name": "#%% md\n"
        },
        "id": "GbM2ml-ozqLX"
      }
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "outputs": [],
      "source": [
        "import logging\n",
        "\n",
        "logging.basicConfig(format=\"%(levelname)s - %(name)s -  %(message)s\", level=logging.WARNING)\n",
        "logging.getLogger(\"haystack\").setLevel(logging.INFO)"
      ],
      "metadata": {
        "pycharm": {
          "name": "#%%\n"
        },
        "id": "kQWEUUMnzqLX"
      }
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "xmRuhTQ7A6Nh"
      },
      "outputs": [],
      "source": [
        "from haystack.utils import clean_wiki_text, convert_files_to_docs, fetch_archive_from_http, print_answers\n",
        "from haystack.nodes import FARMReader, TransformersReader"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "q3dSo7ZtA6Nl"
      },
      "source": [
        "### Document Store\n",
        "\n",
        "#### Option 1: FAISS\n",
        "\n",
        "FAISS is a library for efficient similarity search on a cluster of dense vectors.\n",
        "The `FAISSDocumentStore` uses a SQL(SQLite in-memory be default) database under-the-hood\n",
        "to store the document text and other meta data. The vector embeddings of the text are\n",
        "indexed on a FAISS Index that later is queried for searching answers.\n",
        "The default flavour of FAISSDocumentStore is \"Flat\" but can also be set to \"HNSW\" for\n",
        "faster search at the expense of some accuracy. Just set the faiss_index_factor_str argument in the constructor.\n",
        "For more info on which suits your use case: https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "1cYgDJmrA6Nv",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "outputs": [],
      "source": [
        "from haystack.document_stores import FAISSDocumentStore\n",
        "\n",
        "document_store = FAISSDocumentStore(faiss_index_factory_str=\"Flat\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "collapsed": false,
        "pycharm": {
          "name": "#%% md\n"
        },
        "id": "s4HK5l0qzqLZ"
      },
      "source": [
        "#### Option 2: Milvus\n",
        "\n",
        "Milvus is an open source database library that is also optimized for vector similarity searches like FAISS.\n",
        "Like FAISS it has both a \"Flat\" and \"HNSW\" mode but it outperforms FAISS when it comes to dynamic data management.\n",
        "It does require a little more setup, however, as it is run through Docker and requires the setup of some config files.\n",
        "See [their docs](https://milvus.io/docs/v1.0.0/milvus_docker-cpu.md) for more details."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "pycharm": {
          "name": "#%%\n"
        },
        "id": "2Ur4h-E3zqLZ"
      },
      "outputs": [],
      "source": [
        "# Milvus cannot be run on COlab, so this cell is commented out.\n",
        "# To run Milvus you need Docker (versions below 2.0.0) or a docker-compose (versions >= 2.0.0), neither of which is available on Colab.\n",
        "# See Milvus' documentation for more details: https://milvus.io/docs/install_standalone-docker.md\n",
        "\n",
        "# !pip install git+https://github.com/deepset-ai/haystack.git#egg=farm-haystack[milvus]\n",
        "\n",
        "# from haystack.utils import launch_milvus\n",
        "# from haystack.document_stores import MilvusDocumentStore\n",
        "\n",
        "# launch_milvus()\n",
        "# document_store = MilvusDocumentStore()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "06LatTJBA6N0",
        "pycharm": {
          "name": "#%% md\n"
        }
      },
      "source": [
        "### Cleaning & indexing documents\n",
        "\n",
        "Similarly to the previous tutorials, we download, convert and index some Game of Thrones articles to our DocumentStore"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "iqKnu6wxA6N1",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "outputs": [],
      "source": [
        "# Let's first get some files that we want to use\n",
        "doc_dir = \"data/tutorial6\"\n",
        "s3_url = \"https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt6.zip\"\n",
        "fetch_archive_from_http(url=s3_url, output_dir=doc_dir)\n",
        "\n",
        "# Convert files to dicts\n",
        "docs = convert_files_to_docs(dir_path=doc_dir, clean_func=clean_wiki_text, split_paragraphs=True)\n",
        "\n",
        "# Now, let's write the dicts containing documents to our DB.\n",
        "document_store.write_documents(docs)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "wgjedxx_A6N6"
      },
      "source": [
        "### Initialize Retriever, Reader & Pipeline\n",
        "\n",
        "#### Retriever\n",
        "\n",
        "**Here:** We use an `EmbeddingRetriever`.\n",
        "\n",
        "**Alternatives:**\n",
        "\n",
        "- `BM25Retriever` with custom queries (for example, boosting) and filters\n",
        "- `DensePassageRetriever` which uses two encoder models, one to embed the query and one to embed the passage, and then compares the embedding for retrieval\n",
        "- `TfidfRetriever` in combination with a SQL or InMemory Document store for simple prototyping and debugging"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "kFwiPP60A6N7",
        "pycharm": {
          "is_executing": true
        }
      },
      "outputs": [],
      "source": [
        "from haystack.nodes import EmbeddingRetriever\n",
        "\n",
        "retriever = EmbeddingRetriever(\n",
        "    document_store=document_store,\n",
        "    embedding_model=\"sentence-transformers/multi-qa-mpnet-base-dot-v1\",\n",
        "    model_format=\"sentence_transformers\",\n",
        ")\n",
        "# Important:\n",
        "# Now that we initialized the Retriever, we need to call update_embeddings() to iterate over all\n",
        "# previously indexed documents and update their embedding representation.\n",
        "# While this can be a time consuming operation (depending on the corpus size), it only needs to be done once.\n",
        "# At query time, we only need to embed the query and compare it to the existing document embeddings, which is very fast.\n",
        "document_store.update_embeddings(retriever)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "rnVR28OXA6OA"
      },
      "source": [
        "#### Reader\n",
        "\n",
        "Similar to previous Tutorials we now initalize our reader.\n",
        "\n",
        "Here we use a FARMReader with the *deepset/roberta-base-squad2* model (see: https://huggingface.co/deepset/roberta-base-squad2)\n",
        "\n",
        "\n",
        "\n",
        "##### FARMReader"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "fyIuWVwhA6OB"
      },
      "outputs": [],
      "source": [
        "# Load a  local model or any of the QA models on\n",
        "# Hugging Face's model hub (https://huggingface.co/models)\n",
        "\n",
        "reader = FARMReader(model_name_or_path=\"deepset/roberta-base-squad2\", use_gpu=True)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "unhLD18yA6OF"
      },
      "source": [
        "### Pipeline\n",
        "\n",
        "With a Haystack `Pipeline` you can stick together your building blocks to a search pipeline.\n",
        "Under the hood, `Pipelines` are Directed Acyclic Graphs (DAGs) that you can easily customize for your own use cases.\n",
        "To speed things up, Haystack also comes with a few predefined Pipelines. One of them is the `ExtractiveQAPipeline` that combines a retriever and a reader to answer our questions.\n",
        "You can learn more about `Pipelines` in the [docs](https://haystack.deepset.ai/docs/latest/pipelinesmd)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "TssPQyzWA6OG"
      },
      "outputs": [],
      "source": [
        "from haystack.pipelines import ExtractiveQAPipeline\n",
        "\n",
        "pipe = ExtractiveQAPipeline(reader, retriever)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "bXlBBxKXA6OL"
      },
      "source": [
        "## Voilà! Ask a question!"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Zi97Hif2A6OM"
      },
      "outputs": [],
      "source": [
        "# You can configure how many candidates the reader and retriever shall return\n",
        "# The higher top_k for retriever, the better (but also the slower) your answers.\n",
        "prediction = pipe.run(\n",
        "    query=\"Who created the Dothraki vocabulary?\", params={\"Retriever\": {\"top_k\": 10}, \"Reader\": {\"top_k\": 5}}\n",
        ")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "pI0wrHylzqLa"
      },
      "outputs": [],
      "source": [
        "print_answers(prediction, details=\"minimum\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "collapsed": false,
        "id": "kXE84-2_zqLa"
      },
      "source": [
        "## About us\n",
        "\n",
        "This [Haystack](https://github.com/deepset-ai/haystack/) notebook was made with love by [deepset](https://deepset.ai/) in Berlin, Germany\n",
        "\n",
        "We bring NLP to the industry via open source!\n",
        "  \n",
        "Our focus: Industry specific language models & large scale QA systems.  \n",
        "  \n",
        "Some of our other work: \n",
        "- [German BERT](https://deepset.ai/german-bert)\n",
        "- [GermanQuAD and GermanDPR](https://deepset.ai/germanquad)\n",
        "- [FARM](https://github.com/deepset-ai/FARM)\n",
        "\n",
        "Get in touch:\n",
        "[Twitter](https://twitter.com/deepset_ai) | [LinkedIn](https://www.linkedin.com/company/deepset-ai/) | [Slack](https://haystack.deepset.ai/community/join) | [GitHub Discussions](https://github.com/deepset-ai/haystack/discussions) | [Website](https://deepset.ai)\n",
        "\n",
        "By the way: [we're hiring!](https://www.deepset.ai/jobs)"
      ]
    }
  ],
  "metadata": {
    "accelerator": "GPU",
    "colab": {
      "collapsed_sections": [],
      "name": "Tutorial6_Better_Retrieval_via_Embedding_Retrieval.ipynb",
      "provenance": []
    },
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.6.9"
    },
    "gpuClass": "standard"
  },
  "nbformat": 4,
  "nbformat_minor": 0
}