mirror of
https://github.com/deepset-ai/haystack.git
synced 2025-07-23 08:52:16 +00:00

* Tutorial 06: Replace DPR with EmbeddingRetriever Closes #2887 * Add updated tutorials/6.md file Replace `DensePassageRetriever` with `EmbeddingRetriever` * Update Tutorial 06 based on PR feedback * Further updates to Tutorial-06 according to review feedback * [Tutorial 06] Put in review feedback for the py file
450 lines
16 KiB
Plaintext
450 lines
16 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "bEH-CRbeA6NU"
|
|
},
|
|
"source": [
|
|
"# Better Retrieval via \"Embedding Retrieval\"\n",
|
|
"\n",
|
|
"[](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial6_Better_Retrieval_via_Embedding_Retrieval.ipynb)\n",
|
|
"\n",
|
|
"### Importance of Retrievers\n",
|
|
"\n",
|
|
"The Retriever has a huge impact on the performance of our overall search pipeline.\n",
|
|
"\n",
|
|
"\n",
|
|
"### Different types of Retrievers\n",
|
|
"#### Sparse\n",
|
|
"Family of algorithms based on counting the occurrences of words (bag-of-words) resulting in very sparse vectors with length = vocab size.\n",
|
|
"\n",
|
|
"**Examples**: BM25, TF-IDF\n",
|
|
"\n",
|
|
"**Pros**: Simple, fast, well explainable\n",
|
|
"\n",
|
|
"**Cons**: Relies on exact keyword matches between query and text\n",
|
|
" \n",
|
|
"\n",
|
|
"#### Dense\n",
|
|
"These retrievers use neural network models to create \"dense\" embedding vectors. Within this family, there are two different approaches:\n",
|
|
"\n",
|
|
"a) Single encoder: Use a **single model** to embed both the query and the passage.\n",
|
|
"b) Dual-encoder: Use **two models**, one to embed the query and one to embed the passage.\n",
|
|
"\n",
|
|
"**Examples**: REALM, DPR, Sentence-Transformers\n",
|
|
"\n",
|
|
"**Pros**: Captures semantic similarity instead of \"word matches\" (for example, synonyms, related topics).\n",
|
|
"\n",
|
|
"**Cons**: Computationally more heavy to use, initial training of the model (though this is less of an issue nowadays as many pre-trained models are available and most of the time, it's not needed to train the model).\n",
|
|
"\n",
|
|
"\n",
|
|
"### Embedding Retrieval\n",
|
|
"\n",
|
|
"In this Tutorial, we use an `EmbeddingRetriever` with [Sentence Transformers](https://www.sbert.net/index.html) models.\n",
|
|
"\n",
|
|
"These models are trained to embed similar sentences close to each other in a shared embedding space.\n",
|
|
"\n",
|
|
"Some models have been fine-tuned on massive Information Retrieval data and can be used to retrieve documents based on a short query (for example, `multi-qa-mpnet-base-dot-v1`). There are others that are more suited to semantic similarity tasks where you are trying to find the most similar documents to a given document (for example, `all-mpnet-base-v2`). There are even models that are multilingual (for example, `paraphrase-multilingual-mpnet-base-v2`). For a good overview of different models with their evaluation metrics, see the [Pretrained Models](https://www.sbert.net/docs/pretrained_models.html#) in the Sentence Transformers documentation.\n",
|
|
"\n",
|
|
"*Use this* [link](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial6_Better_Retrieval_via_Embedding_Retrieval.ipynb) *to open the notebook in Google Colab.*\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "3K27Y5FbA6NV"
|
|
},
|
|
"source": [
|
|
"### Prepare the Environment\n",
|
|
"\n",
|
|
"#### Colab: Enable the GPU Runtime\n",
|
|
"Make sure you enable the GPU runtime to experience decent speed in this tutorial.\n",
|
|
"**Runtime -> Change Runtime type -> Hardware accelerator -> GPU**\n",
|
|
"\n",
|
|
"<img src=\"https://raw.githubusercontent.com/deepset-ai/haystack/master/docs/img/colab_gpu_runtime.jpg\">"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"id": "JlZgP8q1A6NW"
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Make sure you have a GPU running\n",
|
|
"!nvidia-smi"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"id": "NM36kbRFA6Nc"
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Install the latest release of Haystack in your own environment\n",
|
|
"#! pip install farm-haystack\n",
|
|
"\n",
|
|
"# Install the latest master of Haystack\n",
|
|
"!pip install --upgrade pip\n",
|
|
"!pip install git+https://github.com/deepset-ai/haystack.git#egg=farm-haystack[colab,faiss]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"## Logging\n",
|
|
"\n",
|
|
"We configure how logging messages should be displayed and which log level should be used before importing Haystack.\n",
|
|
"Example log message:\n",
|
|
"INFO - haystack.utils.preprocessing - Converting data/tutorial1/218_Olenna_Tyrell.txt\n",
|
|
"Default log level in basicConfig is WARNING so the explicit parameter is not necessary but can be changed easily:"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
},
|
|
"id": "GbM2ml-ozqLX"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"outputs": [],
|
|
"source": [
|
|
"import logging\n",
|
|
"\n",
|
|
"logging.basicConfig(format=\"%(levelname)s - %(name)s - %(message)s\", level=logging.WARNING)\n",
|
|
"logging.getLogger(\"haystack\").setLevel(logging.INFO)"
|
|
],
|
|
"metadata": {
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
},
|
|
"id": "kQWEUUMnzqLX"
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"id": "xmRuhTQ7A6Nh"
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"from haystack.utils import clean_wiki_text, convert_files_to_docs, fetch_archive_from_http, print_answers\n",
|
|
"from haystack.nodes import FARMReader, TransformersReader"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "q3dSo7ZtA6Nl"
|
|
},
|
|
"source": [
|
|
"### Document Store\n",
|
|
"\n",
|
|
"#### Option 1: FAISS\n",
|
|
"\n",
|
|
"FAISS is a library for efficient similarity search on a cluster of dense vectors.\n",
|
|
"The `FAISSDocumentStore` uses a SQL(SQLite in-memory be default) database under-the-hood\n",
|
|
"to store the document text and other meta data. The vector embeddings of the text are\n",
|
|
"indexed on a FAISS Index that later is queried for searching answers.\n",
|
|
"The default flavour of FAISSDocumentStore is \"Flat\" but can also be set to \"HNSW\" for\n",
|
|
"faster search at the expense of some accuracy. Just set the faiss_index_factor_str argument in the constructor.\n",
|
|
"For more info on which suits your use case: https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"id": "1cYgDJmrA6Nv",
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"from haystack.document_stores import FAISSDocumentStore\n",
|
|
"\n",
|
|
"document_store = FAISSDocumentStore(faiss_index_factory_str=\"Flat\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
},
|
|
"id": "s4HK5l0qzqLZ"
|
|
},
|
|
"source": [
|
|
"#### Option 2: Milvus\n",
|
|
"\n",
|
|
"Milvus is an open source database library that is also optimized for vector similarity searches like FAISS.\n",
|
|
"Like FAISS it has both a \"Flat\" and \"HNSW\" mode but it outperforms FAISS when it comes to dynamic data management.\n",
|
|
"It does require a little more setup, however, as it is run through Docker and requires the setup of some config files.\n",
|
|
"See [their docs](https://milvus.io/docs/v1.0.0/milvus_docker-cpu.md) for more details."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
},
|
|
"id": "2Ur4h-E3zqLZ"
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Milvus cannot be run on COlab, so this cell is commented out.\n",
|
|
"# To run Milvus you need Docker (versions below 2.0.0) or a docker-compose (versions >= 2.0.0), neither of which is available on Colab.\n",
|
|
"# See Milvus' documentation for more details: https://milvus.io/docs/install_standalone-docker.md\n",
|
|
"\n",
|
|
"# !pip install git+https://github.com/deepset-ai/haystack.git#egg=farm-haystack[milvus]\n",
|
|
"\n",
|
|
"# from haystack.utils import launch_milvus\n",
|
|
"# from haystack.document_stores import MilvusDocumentStore\n",
|
|
"\n",
|
|
"# launch_milvus()\n",
|
|
"# document_store = MilvusDocumentStore()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "06LatTJBA6N0",
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
},
|
|
"source": [
|
|
"### Cleaning & indexing documents\n",
|
|
"\n",
|
|
"Similarly to the previous tutorials, we download, convert and index some Game of Thrones articles to our DocumentStore"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"id": "iqKnu6wxA6N1",
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Let's first get some files that we want to use\n",
|
|
"doc_dir = \"data/tutorial6\"\n",
|
|
"s3_url = \"https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt6.zip\"\n",
|
|
"fetch_archive_from_http(url=s3_url, output_dir=doc_dir)\n",
|
|
"\n",
|
|
"# Convert files to dicts\n",
|
|
"docs = convert_files_to_docs(dir_path=doc_dir, clean_func=clean_wiki_text, split_paragraphs=True)\n",
|
|
"\n",
|
|
"# Now, let's write the dicts containing documents to our DB.\n",
|
|
"document_store.write_documents(docs)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "wgjedxx_A6N6"
|
|
},
|
|
"source": [
|
|
"### Initialize Retriever, Reader & Pipeline\n",
|
|
"\n",
|
|
"#### Retriever\n",
|
|
"\n",
|
|
"**Here:** We use an `EmbeddingRetriever`.\n",
|
|
"\n",
|
|
"**Alternatives:**\n",
|
|
"\n",
|
|
"- `BM25Retriever` with custom queries (for example, boosting) and filters\n",
|
|
"- `DensePassageRetriever` which uses two encoder models, one to embed the query and one to embed the passage, and then compares the embedding for retrieval\n",
|
|
"- `TfidfRetriever` in combination with a SQL or InMemory Document store for simple prototyping and debugging"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"id": "kFwiPP60A6N7",
|
|
"pycharm": {
|
|
"is_executing": true
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"from haystack.nodes import EmbeddingRetriever\n",
|
|
"\n",
|
|
"retriever = EmbeddingRetriever(\n",
|
|
" document_store=document_store,\n",
|
|
" embedding_model=\"sentence-transformers/multi-qa-mpnet-base-dot-v1\",\n",
|
|
" model_format=\"sentence_transformers\",\n",
|
|
")\n",
|
|
"# Important:\n",
|
|
"# Now that we initialized the Retriever, we need to call update_embeddings() to iterate over all\n",
|
|
"# previously indexed documents and update their embedding representation.\n",
|
|
"# While this can be a time consuming operation (depending on the corpus size), it only needs to be done once.\n",
|
|
"# At query time, we only need to embed the query and compare it to the existing document embeddings, which is very fast.\n",
|
|
"document_store.update_embeddings(retriever)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "rnVR28OXA6OA"
|
|
},
|
|
"source": [
|
|
"#### Reader\n",
|
|
"\n",
|
|
"Similar to previous Tutorials we now initalize our reader.\n",
|
|
"\n",
|
|
"Here we use a FARMReader with the *deepset/roberta-base-squad2* model (see: https://huggingface.co/deepset/roberta-base-squad2)\n",
|
|
"\n",
|
|
"\n",
|
|
"\n",
|
|
"##### FARMReader"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"id": "fyIuWVwhA6OB"
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Load a local model or any of the QA models on\n",
|
|
"# Hugging Face's model hub (https://huggingface.co/models)\n",
|
|
"\n",
|
|
"reader = FARMReader(model_name_or_path=\"deepset/roberta-base-squad2\", use_gpu=True)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "unhLD18yA6OF"
|
|
},
|
|
"source": [
|
|
"### Pipeline\n",
|
|
"\n",
|
|
"With a Haystack `Pipeline` you can stick together your building blocks to a search pipeline.\n",
|
|
"Under the hood, `Pipelines` are Directed Acyclic Graphs (DAGs) that you can easily customize for your own use cases.\n",
|
|
"To speed things up, Haystack also comes with a few predefined Pipelines. One of them is the `ExtractiveQAPipeline` that combines a retriever and a reader to answer our questions.\n",
|
|
"You can learn more about `Pipelines` in the [docs](https://haystack.deepset.ai/docs/latest/pipelinesmd)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"id": "TssPQyzWA6OG"
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"from haystack.pipelines import ExtractiveQAPipeline\n",
|
|
"\n",
|
|
"pipe = ExtractiveQAPipeline(reader, retriever)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "bXlBBxKXA6OL"
|
|
},
|
|
"source": [
|
|
"## Voilà! Ask a question!"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"id": "Zi97Hif2A6OM"
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# You can configure how many candidates the reader and retriever shall return\n",
|
|
"# The higher top_k for retriever, the better (but also the slower) your answers.\n",
|
|
"prediction = pipe.run(\n",
|
|
" query=\"Who created the Dothraki vocabulary?\", params={\"Retriever\": {\"top_k\": 10}, \"Reader\": {\"top_k\": 5}}\n",
|
|
")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"id": "pI0wrHylzqLa"
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"print_answers(prediction, details=\"minimum\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"id": "kXE84-2_zqLa"
|
|
},
|
|
"source": [
|
|
"## About us\n",
|
|
"\n",
|
|
"This [Haystack](https://github.com/deepset-ai/haystack/) notebook was made with love by [deepset](https://deepset.ai/) in Berlin, Germany\n",
|
|
"\n",
|
|
"We bring NLP to the industry via open source!\n",
|
|
" \n",
|
|
"Our focus: Industry specific language models & large scale QA systems. \n",
|
|
" \n",
|
|
"Some of our other work: \n",
|
|
"- [German BERT](https://deepset.ai/german-bert)\n",
|
|
"- [GermanQuAD and GermanDPR](https://deepset.ai/germanquad)\n",
|
|
"- [FARM](https://github.com/deepset-ai/FARM)\n",
|
|
"\n",
|
|
"Get in touch:\n",
|
|
"[Twitter](https://twitter.com/deepset_ai) | [LinkedIn](https://www.linkedin.com/company/deepset-ai/) | [Slack](https://haystack.deepset.ai/community/join) | [GitHub Discussions](https://github.com/deepset-ai/haystack/discussions) | [Website](https://deepset.ai)\n",
|
|
"\n",
|
|
"By the way: [we're hiring!](https://www.deepset.ai/jobs)"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"accelerator": "GPU",
|
|
"colab": {
|
|
"collapsed_sections": [],
|
|
"name": "Tutorial6_Better_Retrieval_via_Embedding_Retrieval.ipynb",
|
|
"provenance": []
|
|
},
|
|
"kernelspec": {
|
|
"display_name": "Python 3",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.6.9"
|
|
},
|
|
"gpuClass": "standard"
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 0
|
|
} |