haystack/docs/v0.6.0/_src/tutorials/tutorials/7.md

<!---
title: "Tutorial 7"
metaTitle: "Generative QA"
metaDescription: ""
slug: "/docs/tutorial7"
date: "2020-11-12"
id: "tutorial7md"
--->

# Generative QA with "Retrieval-Augmented Generation"

While extractive QA highlights the span of text that answers a query,
generative QA can return a novel text answer that it has composed.
In this tutorial, you will learn how to set up a generative system using the
[RAG model](https://arxiv.org/abs/2005.11401) which conditions the
answer generator on a set of retrieved documents.

Here are the packages and imports that we'll need:


```python
!pip install git+https://github.com/deepset-ai/haystack.git
!pip install urllib3==1.25.4
!pip install torch==1.6.0+cu101 torchvision==0.6.1+cu101 -f https://download.pytorch.org/whl/torch_stable.html

```


```python
from typing import List
import requests
import pandas as pd
from haystack import Document
from haystack.document_store.faiss import FAISSDocumentStore
from haystack.generator.transformers import RAGenerator
from haystack.retriever.dense import DensePassageRetriever
```

Let's download a csv containing some sample text and preprocess the data.


```python
# Download sample
temp = requests.get("https://raw.githubusercontent.com/deepset-ai/haystack/master/tutorials/small_generator_dataset.csv")
open('small_generator_dataset.csv', 'wb').write(temp.content)

# Create dataframe with columns "title" and "text"
df = pd.read_csv("small_generator_dataset.csv", sep=',')
# Minimal cleaning
df.fillna(value="", inplace=True)

print(df.head())
```

We can cast our data into Haystack Document objects.
Alternatively, we can also just use dictionaries with "text" and "meta" fields


```python
# Use data to initialize Document objects
titles = list(df["title"].values)
texts = list(df["text"].values)
documents: List[Document] = []
for title, text in zip(titles, texts):
    documents.append(
        Document(
            text=text,
            meta={
                "name": title or ""
            }
        )
    )
```

Here we initialize the FAISSDocumentStore, DensePassageRetriever and RAGenerator.
FAISS is chosen here since it is optimized vector storage.


```python
# Initialize FAISS document store.
# Set `return_embedding` to `True`, so generator doesn't have to perform re-embedding
document_store = FAISSDocumentStore(
    faiss_index_factory_str="Flat",
    return_embedding=True
)

# Initialize DPR Retriever to encode documents, encode question and query documents
retriever = DensePassageRetriever(
    document_store=document_store,
    query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
    passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base",
    use_gpu=False,
    embed_title=True,
)

# Initialize RAG Generator
generator = RAGenerator(
    model_name_or_path="facebook/rag-token-nq",
    use_gpu=False,
    top_k_answers=1,
    max_length=200,
    min_length=2,
    embed_title=True,
    num_beams=2,
)
```

We write documents to the DocumentStore, first by deleting any remaining documents then calling `write_documents()`.
The `update_embeddings()` method uses the retriever to create an embedding for each document.


```python
# Delete existing documents in documents store
document_store.delete_all_documents()

# Write documents to document store
document_store.write_documents(documents)

# Add documents embeddings to index
document_store.update_embeddings(
    retriever=retriever
)
```

Here are our questions:


```python
QUESTIONS = [
    "who got the first nobel prize in physics",
    "when is the next deadpool movie being released",
    "which mode is used for short wave broadcast service",
    "who is the owner of reading football club",
    "when is the next scandal episode coming out",
    "when is the last time the philadelphia won the superbowl",
    "what is the most current adobe flash player version",
    "how many episodes are there in dragon ball z",
    "what is the first step in the evolution of the eye",
    "where is gall bladder situated in human body",
    "what is the main mineral in lithium batteries",
    "who is the president of usa right now",
    "where do the greasers live in the outsiders",
    "panda is a national animal of which country",
    "what is the name of manchester united stadium",
]
```

Now let's run our system!
The retriever will pick out a small subset of documents that it finds relevant.
These are used to condition the generator as it generates the answer.
What it should return then are novel text spans that form and answer to your question!


```python
# Now generate an answer for each question
for question in QUESTIONS:
    # Retrieve related documents from retriever
    retriever_results = retriever.retrieve(
        query=question
    )

    # Now generate answer from question and retrieved documents
    predicted_result = generator.predict(
        query=question,
        documents=retriever_results,
        top_k=1
    )

    # Print you answer
    answers = predicted_result["answers"]
    print(f'Generated answer is \'{answers[0]["answer"]}\' for the question = \'{question}\'')
```
Add docs v0.6.0 (#689) * new docs version * updated directory structure * Add pipelines page * Add Finder deprecation suggestion * header for pipelines file * Document MySQL support * Mention DPR train tutorial coming soon * Mention open distro ES * Update doc strings regarding similarity fn * Add link to API docs * Wrap pipelines docs in box * add api reference for pipelines * copied latest version to v0.6.0 * Remove space * Remove space * Copy to v0.6.0 Co-authored-by: brandenchan <brandenchan@icloud.com> 2020-12-18 12:47:27 +01:00			`<!---`
			`title: "Tutorial 7"`
			`metaTitle: "Generative QA"`
			`metaDescription: ""`
			`slug: "/docs/tutorial7"`
			`date: "2020-11-12"`
			`id: "tutorial7md"`
			`--->`

			`# Generative QA with "Retrieval-Augmented Generation"`

			`While extractive QA highlights the span of text that answers a query,`
			`generative QA can return a novel text answer that it has composed.`
			`In this tutorial, you will learn how to set up a generative system using the`
			`[RAG model](https://arxiv.org/abs/2005.11401) which conditions the`
			`answer generator on a set of retrieved documents.`

			`Here are the packages and imports that we'll need:`


			```python
			`!pip install git+https://github.com/deepset-ai/haystack.git`
			`!pip install urllib3==1.25.4`
			`!pip install torch==1.6.0+cu101 torchvision==0.6.1+cu101 -f https://download.pytorch.org/whl/torch_stable.html`

			```


			```python
			`from typing import List`
			`import requests`
			`import pandas as pd`
			`from haystack import Document`
			`from haystack.document_store.faiss import FAISSDocumentStore`
			`from haystack.generator.transformers import RAGenerator`
			`from haystack.retriever.dense import DensePassageRetriever`
			```

			`Let's download a csv containing some sample text and preprocess the data.`



			```python
			`# Download sample`
			`temp = requests.get("https://raw.githubusercontent.com/deepset-ai/haystack/master/tutorials/small_generator_dataset.csv")`
			`open('small_generator_dataset.csv', 'wb').write(temp.content)`

			`# Create dataframe with columns "title" and "text"`
			`df = pd.read_csv("small_generator_dataset.csv", sep=',')`
			`# Minimal cleaning`
			`df.fillna(value="", inplace=True)`

			`print(df.head())`
			```

			`We can cast our data into Haystack Document objects.`
			`Alternatively, we can also just use dictionaries with "text" and "meta" fields`


			```python
			`# Use data to initialize Document objects`
			`titles = list(df["title"].values)`
			`texts = list(df["text"].values)`
			`documents: List[Document] = []`
			`for title, text in zip(titles, texts):`
			`documents.append(`
			`Document(`
			`text=text,`
			`meta={`
			`"name": title or ""`
			`}`
			`)`
			`)`
			```

			`Here we initialize the FAISSDocumentStore, DensePassageRetriever and RAGenerator.`
			`FAISS is chosen here since it is optimized vector storage.`


			```python
			`# Initialize FAISS document store.`
			# Set `return_embedding` to `True`, so generator doesn't have to perform re-embedding
			`document_store = FAISSDocumentStore(`
			`faiss_index_factory_str="Flat",`
			`return_embedding=True`
			`)`

			`# Initialize DPR Retriever to encode documents, encode question and query documents`
			`retriever = DensePassageRetriever(`
			`document_store=document_store,`
			`query_embedding_model="facebook/dpr-question_encoder-single-nq-base",`
			`passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base",`
			`use_gpu=False,`
			`embed_title=True,`
			`)`

			`# Initialize RAG Generator`
			`generator = RAGenerator(`
			`model_name_or_path="facebook/rag-token-nq",`
			`use_gpu=False,`
			`top_k_answers=1,`
			`max_length=200,`
			`min_length=2,`
			`embed_title=True,`
			`num_beams=2,`
			`)`
			```

			We write documents to the DocumentStore, first by deleting any remaining documents then calling `write_documents()`.
			The `update_embeddings()` method uses the retriever to create an embedding for each document.



			```python
			`# Delete existing documents in documents store`
			`document_store.delete_all_documents()`

			`# Write documents to document store`
			`document_store.write_documents(documents)`

			`# Add documents embeddings to index`
			`document_store.update_embeddings(`
			`retriever=retriever`
			`)`
			```

			`Here are our questions:`


			```python
			`QUESTIONS = [`
			`"who got the first nobel prize in physics",`
			`"when is the next deadpool movie being released",`
			`"which mode is used for short wave broadcast service",`
			`"who is the owner of reading football club",`
			`"when is the next scandal episode coming out",`
			`"when is the last time the philadelphia won the superbowl",`
			`"what is the most current adobe flash player version",`
			`"how many episodes are there in dragon ball z",`
			`"what is the first step in the evolution of the eye",`
			`"where is gall bladder situated in human body",`
			`"what is the main mineral in lithium batteries",`
			`"who is the president of usa right now",`
			`"where do the greasers live in the outsiders",`
			`"panda is a national animal of which country",`
			`"what is the name of manchester united stadium",`
			`]`
			```

			`Now let's run our system!`
			`The retriever will pick out a small subset of documents that it finds relevant.`
			`These are used to condition the generator as it generates the answer.`
			`What it should return then are novel text spans that form and answer to your question!`


			```python
			`# Now generate an answer for each question`
			`for question in QUESTIONS:`
			`# Retrieve related documents from retriever`
			`retriever_results = retriever.retrieve(`
			`query=question`
			`)`

			`# Now generate answer from question and retrieved documents`
			`predicted_result = generator.predict(`
			`query=question,`
			`documents=retriever_results,`
			`top_k=1`
			`)`

			`# Print you answer`
			`answers = predicted_result["answers"]`
			`print(f'Generated answer is \'{answers[0]["answer"]}\' for the question = \'{question}\'')`
			```