haystack/tutorials/Tutorial7_RAG_Generator.ipynb
Malte Pietsch 4a6c9302b3
Redesign primitives - Document, Answer, Label (#1398)
* first draft / notes on new primitives

* wip label / feedback refactor

* rename doc.text -> doc.content. add doc.content_type

* add datatype for content

* remove faq_question_field from ES and weaviate. rename text_field -> content_field in docstores. update tutorials for content field

* update converters for . Add warning for empty

* renam label.question -> label.query. Allow sorting of Answers.

* WIP primitives

* update ui/reader for new Answer format

* Improve Label. First refactoring of MultiLabel. Adjust eval code

* fixed workflow conflict with introducing new one (#1472)

* Add latest docstring and tutorial changes

* make add_eval_data() work again

* fix reader formats. WIP fix _extract_docs_and_labels_from_dict

* fix test reader

* Add latest docstring and tutorial changes

* fix another test case for reader

* fix mypy in farm reader.eval()

* fix mypy in farm reader.eval()

* WIP ORM refactor

* Add latest docstring and tutorial changes

* fix mypy weaviate

* make label and multilabel dataclasses

* bump mypy env in CI to python 3.8

* WIP refactor Label ORM

* WIP refactor Label ORM

* simplify tests for individual doc stores

* WIP refactoring markers of tests

* test alternative approach for tests with existing parametrization

* WIP refactor ORMs

* fix skip logic of already parametrized tests

* fix weaviate behaviour in tests - not parametrizing it in our general test cases.

* Add latest docstring and tutorial changes

* fix some tests

* remove sql from document_store_types

* fix markers for generator and pipeline test

* remove inmemory marker

* remove unneeded elasticsearch markers

* add dataclasses-json dependency. adjust ORM to just store JSON repr

* ignore type as dataclasses_json seems to miss functionality here

* update readme and contributing.md

* update contributing

* adjust example

* fix duplicate doc handling for custom index

* Add latest docstring and tutorial changes

* fix some ORM issues. fix get_all_labels_aggregated.

* update drop flags where get_all_labels_aggregated() was used before

* Add latest docstring and tutorial changes

* add to_json(). add + fix tests

* fix no_answer handling in label / multilabel

* fix duplicate docs in memory doc store. change primary key for sql doc table

* fix mypy issues

* fix mypy issues

* haystack/retriever/base.py

* fix test_write_document_meta[elastic]

* fix test_elasticsearch_custom_fields

* fix test_labels[elastic]

* fix crawler

* fix converter

* fix docx converter

* fix preprocessor

* fix test_utils

* fix tfidf retriever. fix selection of docstore in tests with multiple fixtures / parameterizations

* Add latest docstring and tutorial changes

* fix crawler test. fix ocrconverter attribute

* fix test_elasticsearch_custom_query

* fix generator pipeline

* fix ocr converter

* fix ragenerator

* Add latest docstring and tutorial changes

* fix test_load_and_save_yaml for elasticsearch

* fixes for pipeline tests

* fix faq pipeline

* fix pipeline tests

* Add latest docstring and tutorial changes

* fix weaviate

* Add latest docstring and tutorial changes

* trigger CI

* satisfy mypy

* Add latest docstring and tutorial changes

* satisfy mypy

* Add latest docstring and tutorial changes

* trigger CI

* fix question generation test

* fix ray. fix Q-generation

* fix translator test

* satisfy mypy

* wip refactor feedback rest api

* fix rest api feedback endpoint

* fix doc classifier

* remove relation of Labels -> Docs in SQL ORM

* fix faiss/milvus tests

* fix doc classifier test

* fix eval test

* fixing eval issues

* Add latest docstring and tutorial changes

* fix mypy

* WIP replace dataclasses-json with manual serialization

* Add latest docstring and tutorial changes

* revert to dataclass-json serialization for now. remove debug prints.

* update docstrings

* fix extractor. fix Answer Span init

* fix api test

* keep meta data of answers in reader.run()

* fix meta handling

* adress review feedback

* Add latest docstring and tutorial changes

* make document=None for open domain labels

* add import

* fix print utils

* fix rest api

* adress review feedback

* Add latest docstring and tutorial changes

* fix mypy

Co-authored-by: Markus Paff <markuspaff.mp@gmail.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2021-10-13 14:23:23 +02:00

394 lines
11 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"source": [
"# Generative QA with \"Retrieval-Augmented Generation\"\n",
"\n",
"[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial7_RAG_Generator.ipynb)\n",
"\n",
"While extractive QA highlights the span of text that answers a query,\n",
"generative QA can return a novel text answer that it has composed.\n",
"In this tutorial, you will learn how to set up a generative system using the\n",
"[RAG model](https://arxiv.org/abs/2005.11401) which conditions the\n",
"answer generator on a set of retrieved documents."
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"### Prepare environment\n",
"\n",
"#### Colab: Enable the GPU runtime\n",
"Make sure you enable the GPU runtime to experience decent speed in this tutorial.\n",
"**Runtime -> Change Runtime type -> Hardware accelerator -> GPU**\n",
"\n",
"<img src=\"https://raw.githubusercontent.com/deepset-ai/haystack/master/docs/_src/img/colab_gpu_runtime.jpg\">"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"# Make sure you have a GPU running\n",
"!nvidia-smi"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"Here are the packages and imports that we'll need:"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"!pip install grpcio-tools==1.34.1\n",
"!pip install git+https://github.com/deepset-ai/haystack.git\n",
"\n",
"# If you run this notebook on Google Colab, you might need to\n",
"# restart the runtime after installing haystack."
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"from typing import List\n",
"import requests\n",
"import pandas as pd\n",
"from haystack import Document\n",
"from haystack.document_store.faiss import FAISSDocumentStore\n",
"from haystack.generator.transformers import RAGenerator\n",
"from haystack.retriever.dense import DensePassageRetriever"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"Let's download a csv containing some sample text and preprocess the data.\n"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"# Download sample\n",
"temp = requests.get(\"https://raw.githubusercontent.com/deepset-ai/haystack/master/tutorials/small_generator_dataset.csv\")\n",
"open('small_generator_dataset.csv', 'wb').write(temp.content)\n",
"\n",
"# Create dataframe with columns \"title\" and \"text\"\n",
"df = pd.read_csv(\"small_generator_dataset.csv\", sep=',')\n",
"# Minimal cleaning\n",
"df.fillna(value=\"\", inplace=True)\n",
"\n",
"print(df.head())"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"We can cast our data into Haystack Document objects.\n",
"Alternatively, we can also just use dictionaries with \"text\" and \"meta\" fields"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"# Use data to initialize Document objects\n",
"titles = list(df[\"title\"].values)\n",
"texts = list(df[\"text\"].values)\n",
"documents: List[Document] = []\n",
"for title, text in zip(titles, texts):\n",
" documents.append(\n",
" Document(\n",
" content=text,\n",
" meta={\n",
" \"name\": title or \"\"\n",
" }\n",
" )\n",
" )"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"Here we initialize the FAISSDocumentStore, DensePassageRetriever and RAGenerator.\n",
"FAISS is chosen here since it is optimized vector storage."
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"# Initialize FAISS document store.\n",
"# Set `return_embedding` to `True`, so generator doesn't have to perform re-embedding\n",
"document_store = FAISSDocumentStore(\n",
" faiss_index_factory_str=\"Flat\",\n",
" return_embedding=True\n",
")\n",
"\n",
"# Initialize DPR Retriever to encode documents, encode question and query documents\n",
"retriever = DensePassageRetriever(\n",
" document_store=document_store,\n",
" query_embedding_model=\"facebook/dpr-question_encoder-single-nq-base\",\n",
" passage_embedding_model=\"facebook/dpr-ctx_encoder-single-nq-base\",\n",
" use_gpu=True,\n",
" embed_title=True,\n",
")\n",
"\n",
"# Initialize RAG Generator\n",
"generator = RAGenerator(\n",
" model_name_or_path=\"facebook/rag-token-nq\",\n",
" use_gpu=True,\n",
" top_k=1,\n",
" max_length=200,\n",
" min_length=2,\n",
" embed_title=True,\n",
" num_beams=2,\n",
")"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"We write documents to the DocumentStore, first by deleting any remaining documents then calling `write_documents()`.\n",
"The `update_embeddings()` method uses the retriever to create an embedding for each document.\n"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"# Delete existing documents in documents store\n",
"document_store.delete_documents()\n",
"\n",
"# Write documents to document store\n",
"document_store.write_documents(documents)\n",
"\n",
"# Add documents embeddings to index\n",
"document_store.update_embeddings(\n",
" retriever=retriever\n",
")"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"Here are our questions:"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"QUESTIONS = [\n",
" \"who got the first nobel prize in physics\",\n",
" \"when is the next deadpool movie being released\",\n",
" \"which mode is used for short wave broadcast service\",\n",
" \"who is the owner of reading football club\",\n",
" \"when is the next scandal episode coming out\",\n",
" \"when is the last time the philadelphia won the superbowl\",\n",
" \"what is the most current adobe flash player version\",\n",
" \"how many episodes are there in dragon ball z\",\n",
" \"what is the first step in the evolution of the eye\",\n",
" \"where is gall bladder situated in human body\",\n",
" \"what is the main mineral in lithium batteries\",\n",
" \"who is the president of usa right now\",\n",
" \"where do the greasers live in the outsiders\",\n",
" \"panda is a national animal of which country\",\n",
" \"what is the name of manchester united stadium\",\n",
"]"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"Now let's run our system!\n",
"The retriever will pick out a small subset of documents that it finds relevant.\n",
"These are used to condition the generator as it generates the answer.\n",
"What it should return then are novel text spans that form and answer to your question!"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"# Now generate an answer for each question\n",
"for question in QUESTIONS:\n",
" # Retrieve related documents from retriever\n",
" retriever_results = retriever.retrieve(\n",
" query=question\n",
" )\n",
"\n",
" # Now generate answer from question and retrieved documents\n",
" predicted_result = generator.predict(\n",
" query=question,\n",
" documents=retriever_results,\n",
" top_k=1\n",
" )\n",
"\n",
" # Print you answer\n",
" answers = predicted_result[\"answers\"]\n",
" print(f'Generated answer is \\'{answers[0][\"answer\"]}\\' for the question = \\'{question}\\'')"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"# Or alternatively use the Pipeline class\n",
"from haystack.pipeline import GenerativeQAPipeline\n",
"\n",
"pipe = GenerativeQAPipeline(generator=generator, retriever=retriever)\n",
"for question in QUESTIONS:\n",
" res = pipe.run(query=question, params={\"Generator\": {\"top_k\": 1}, \"Retriever\": {\"top_k\": 5}})\n",
" print(res)"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"## About us\n",
"\n",
"This [Haystack](https://github.com/deepset-ai/haystack/) notebook was made with love by [deepset](https://deepset.ai/) in Berlin, Germany\n",
"\n",
"We bring NLP to the industry via open source! \n",
"Our focus: Industry specific language models & large scale QA systems. \n",
" \n",
"Some of our other work: \n",
"- [German BERT](https://deepset.ai/german-bert)\n",
"- [GermanQuAD and GermanDPR](https://deepset.ai/germanquad)\n",
"- [FARM](https://github.com/deepset-ai/FARM)\n",
"\n",
"Get in touch:\n",
"[Twitter](https://twitter.com/deepset_ai) | [LinkedIn](https://www.linkedin.com/company/deepset-ai/) | [Slack](https://haystack.deepset.ai/community/join) | [GitHub Discussions](https://github.com/deepset-ai/haystack/discussions) | [Website](https://deepset.ai)\n",
"\n",
"By the way: [we're hiring!](https://apply.workable.com/deepset/) "
],
"metadata": {
"collapsed": false
}
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 0
}