haystack/tutorials/Tutorial7_RAG_Generator.ipynb
Sara Zan 13510aa753
Refactoring of the haystack package (#1624)
* Files moved, imports all broken

* Fix most imports and docstrings into

* Fix the paths to the modules in the API docs

* Add latest docstring and tutorial changes

* Add a few pipelines that were lost in the inports

* Fix a bunch of mypy warnings

* Add latest docstring and tutorial changes

* Create a file_classifier module

* Add docs for file_classifier

* Fixed most circular imports, now the REST API can start

* Add latest docstring and tutorial changes

* Tackling more mypy issues

* Reintroduce  from FARM and fix last mypy issues hopefully

* Re-enable old-style imports

* Fix some more import from the top-level  package in an attempt to sort out circular imports

* Fix some imports in tests to new-style to prevent failed class equalities from breaking tests

* Change document_store into document_stores

* Update imports in tutorials

* Add latest docstring and tutorial changes

* Probably fixes summarizer tests

* Improve the old-style import allowing module imports (should work)

* Try to fix the docs

* Remove dedicated KnowledgeGraph page from autodocs

* Remove dedicated GraphRetriever page from autodocs

* Fix generate_docstrings.sh with an updated list of yaml files to look for

* Fix some more modules in the docs

* Fix the document stores docs too

* Fix a small issue on Tutorial14

* Add latest docstring and tutorial changes

* Add deprecation warning to old-style imports

* Remove stray folder and import Dict into dense.py

* Change import path for MLFlowLogger

* Add old loggers path to the import path aliases

* Fix debug output of convert_ipynb.py

* Fix circular import on BaseRetriever

* Missed one merge block

* re-run tutorial 5

* Fix imports in tutorial 5

* Re-enable squad_to_dpr CLI from the root package and move get_batches_from_generator into document_stores.base

* Add latest docstring and tutorial changes

* Fix typo in utils __init__

* Fix a few more imports

* Fix benchmarks too

* New-style imports in test_knowledge_graph

* Rollback setup.py

* Rollback squad_to_dpr too

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-10-25 15:50:23 +02:00

393 lines
11 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"source": [
"# Generative QA with \"Retrieval-Augmented Generation\"\n",
"\n",
"[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial7_RAG_Generator.ipynb)\n",
"\n",
"While extractive QA highlights the span of text that answers a query,\n",
"generative QA can return a novel text answer that it has composed.\n",
"In this tutorial, you will learn how to set up a generative system using the\n",
"[RAG model](https://arxiv.org/abs/2005.11401) which conditions the\n",
"answer generator on a set of retrieved documents."
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"### Prepare environment\n",
"\n",
"#### Colab: Enable the GPU runtime\n",
"Make sure you enable the GPU runtime to experience decent speed in this tutorial.\n",
"**Runtime -> Change Runtime type -> Hardware accelerator -> GPU**\n",
"\n",
"<img src=\"https://raw.githubusercontent.com/deepset-ai/haystack/master/docs/_src/img/colab_gpu_runtime.jpg\">"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Make sure you have a GPU running\n",
"!nvidia-smi"
],
"outputs": [],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"Here are the packages and imports that we'll need:"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"!pip install grpcio-tools==1.34.1\n",
"!pip install git+https://github.com/deepset-ai/haystack.git\n",
"\n",
"# If you run this notebook on Google Colab, you might need to\n",
"# restart the runtime after installing haystack."
],
"outputs": [],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"from typing import List\n",
"import requests\n",
"import pandas as pd\n",
"from haystack import Document\n",
"from haystack.document_stores import FAISSDocumentStore\n",
"from haystack.nodes import RAGenerator, DensePassageRetriever"
],
"outputs": [],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"Let's download a csv containing some sample text and preprocess the data.\n"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Download sample\n",
"temp = requests.get(\"https://raw.githubusercontent.com/deepset-ai/haystack/master/tutorials/small_generator_dataset.csv\")\n",
"open('small_generator_dataset.csv', 'wb').write(temp.content)\n",
"\n",
"# Create dataframe with columns \"title\" and \"text\"\n",
"df = pd.read_csv(\"small_generator_dataset.csv\", sep=',')\n",
"# Minimal cleaning\n",
"df.fillna(value=\"\", inplace=True)\n",
"\n",
"print(df.head())"
],
"outputs": [],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"We can cast our data into Haystack Document objects.\n",
"Alternatively, we can also just use dictionaries with \"text\" and \"meta\" fields"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Use data to initialize Document objects\n",
"titles = list(df[\"title\"].values)\n",
"texts = list(df[\"text\"].values)\n",
"documents: List[Document] = []\n",
"for title, text in zip(titles, texts):\n",
" documents.append(\n",
" Document(\n",
" content=text,\n",
" meta={\n",
" \"name\": title or \"\"\n",
" }\n",
" )\n",
" )"
],
"outputs": [],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"Here we initialize the FAISSDocumentStore, DensePassageRetriever and RAGenerator.\n",
"FAISS is chosen here since it is optimized vector storage."
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Initialize FAISS document store.\n",
"# Set `return_embedding` to `True`, so generator doesn't have to perform re-embedding\n",
"document_store = FAISSDocumentStore(\n",
" faiss_index_factory_str=\"Flat\",\n",
" return_embedding=True\n",
")\n",
"\n",
"# Initialize DPR Retriever to encode documents, encode question and query documents\n",
"retriever = DensePassageRetriever(\n",
" document_store=document_store,\n",
" query_embedding_model=\"facebook/dpr-question_encoder-single-nq-base\",\n",
" passage_embedding_model=\"facebook/dpr-ctx_encoder-single-nq-base\",\n",
" use_gpu=True,\n",
" embed_title=True,\n",
")\n",
"\n",
"# Initialize RAG Generator\n",
"generator = RAGenerator(\n",
" model_name_or_path=\"facebook/rag-token-nq\",\n",
" use_gpu=True,\n",
" top_k=1,\n",
" max_length=200,\n",
" min_length=2,\n",
" embed_title=True,\n",
" num_beams=2,\n",
")"
],
"outputs": [],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"We write documents to the DocumentStore, first by deleting any remaining documents then calling `write_documents()`.\n",
"The `update_embeddings()` method uses the retriever to create an embedding for each document.\n"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Delete existing documents in documents store\n",
"document_store.delete_documents()\n",
"\n",
"# Write documents to document store\n",
"document_store.write_documents(documents)\n",
"\n",
"# Add documents embeddings to index\n",
"document_store.update_embeddings(\n",
" retriever=retriever\n",
")"
],
"outputs": [],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"Here are our questions:"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"QUESTIONS = [\n",
" \"who got the first nobel prize in physics\",\n",
" \"when is the next deadpool movie being released\",\n",
" \"which mode is used for short wave broadcast service\",\n",
" \"who is the owner of reading football club\",\n",
" \"when is the next scandal episode coming out\",\n",
" \"when is the last time the philadelphia won the superbowl\",\n",
" \"what is the most current adobe flash player version\",\n",
" \"how many episodes are there in dragon ball z\",\n",
" \"what is the first step in the evolution of the eye\",\n",
" \"where is gall bladder situated in human body\",\n",
" \"what is the main mineral in lithium batteries\",\n",
" \"who is the president of usa right now\",\n",
" \"where do the greasers live in the outsiders\",\n",
" \"panda is a national animal of which country\",\n",
" \"what is the name of manchester united stadium\",\n",
"]"
],
"outputs": [],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"Now let's run our system!\n",
"The retriever will pick out a small subset of documents that it finds relevant.\n",
"These are used to condition the generator as it generates the answer.\n",
"What it should return then are novel text spans that form and answer to your question!"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Now generate an answer for each question\n",
"for question in QUESTIONS:\n",
" # Retrieve related documents from retriever\n",
" retriever_results = retriever.retrieve(\n",
" query=question\n",
" )\n",
"\n",
" # Now generate answer from question and retrieved documents\n",
" predicted_result = generator.predict(\n",
" query=question,\n",
" documents=retriever_results,\n",
" top_k=1\n",
" )\n",
"\n",
" # Print you answer\n",
" answers = predicted_result[\"answers\"]\n",
" print(f'Generated answer is \\'{answers[0][\"answer\"]}\\' for the question = \\'{question}\\'')"
],
"outputs": [],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Or alternatively use the Pipeline class\n",
"from haystack.pipelines import GenerativeQAPipeline\n",
"\n",
"pipe = GenerativeQAPipeline(generator=generator, retriever=retriever)\n",
"for question in QUESTIONS:\n",
" res = pipe.run(query=question, params={\"Generator\": {\"top_k\": 1}, \"Retriever\": {\"top_k\": 5}})\n",
" print(res)"
],
"outputs": [],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"## About us\n",
"\n",
"This [Haystack](https://github.com/deepset-ai/haystack/) notebook was made with love by [deepset](https://deepset.ai/) in Berlin, Germany\n",
"\n",
"We bring NLP to the industry via open source! \n",
"Our focus: Industry specific language models & large scale QA systems. \n",
" \n",
"Some of our other work: \n",
"- [German BERT](https://deepset.ai/german-bert)\n",
"- [GermanQuAD and GermanDPR](https://deepset.ai/germanquad)\n",
"- [FARM](https://github.com/deepset-ai/FARM)\n",
"\n",
"Get in touch:\n",
"[Twitter](https://twitter.com/deepset_ai) | [LinkedIn](https://www.linkedin.com/company/deepset-ai/) | [Slack](https://haystack.deepset.ai/community/join) | [GitHub Discussions](https://github.com/deepset-ai/haystack/discussions) | [Website](https://deepset.ai)\n",
"\n",
"By the way: [we're hiring!](https://www.deepset.ai/jobs)"
],
"metadata": {
"collapsed": false
}
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 2
}