mirror of
https://github.com/deepset-ai/haystack.git
synced 2025-08-26 17:36:34 +00:00
165 lines
5.4 KiB
Markdown
165 lines
5.4 KiB
Markdown
![]() |
<!---
|
||
|
title: "Tutorial 12"
|
||
|
metaTitle: "Generative QA with LFQA"
|
||
|
metaDescription: ""
|
||
|
slug: "/docs/tutorial12"
|
||
|
date: "2021-04-06"
|
||
|
id: "tutorial12md"
|
||
|
--->
|
||
|
|
||
|
# Long-Form Question Answering
|
||
|
|
||
|
[](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial12_LFQA.ipynb)
|
||
|
|
||
|
### Prepare environment
|
||
|
|
||
|
#### Colab: Enable the GPU runtime
|
||
|
Make sure you enable the GPU runtime to experience decent speed in this tutorial.
|
||
|
**Runtime -> Change Runtime type -> Hardware accelerator -> GPU**
|
||
|
|
||
|
<img src="https://raw.githubusercontent.com/deepset-ai/haystack/master/docs/_src/img/colab_gpu_runtime.jpg">
|
||
|
|
||
|
|
||
|
```python
|
||
|
# Make sure you have a GPU running
|
||
|
!nvidia-smi
|
||
|
```
|
||
|
|
||
|
|
||
|
```python
|
||
|
# Install the latest master of Haystack
|
||
|
!pip install git+https://github.com/deepset-ai/haystack.git
|
||
|
```
|
||
|
|
||
|
|
||
|
```python
|
||
|
from haystack.preprocessor.cleaning import clean_wiki_text
|
||
|
from haystack.preprocessor.utils import convert_files_to_dicts, fetch_archive_from_http
|
||
|
from haystack.generator.transformers import Seq2SeqGenerator
|
||
|
```
|
||
|
|
||
|
### Document Store
|
||
|
|
||
|
FAISS is a library for efficient similarity search on a cluster of dense vectors.
|
||
|
The `FAISSDocumentStore` uses a SQL(SQLite in-memory be default) database under-the-hood
|
||
|
to store the document text and other meta data. The vector embeddings of the text are
|
||
|
indexed on a FAISS Index that later is queried for searching answers.
|
||
|
The default flavour of FAISSDocumentStore is "Flat" but can also be set to "HNSW" for
|
||
|
faster search at the expense of some accuracy. Just set the faiss_index_factor_str argument in the constructor.
|
||
|
For more info on which suits your use case: https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index
|
||
|
|
||
|
|
||
|
```python
|
||
|
from haystack.document_store.faiss import FAISSDocumentStore
|
||
|
|
||
|
document_store = FAISSDocumentStore(vector_dim=128, faiss_index_factory_str="Flat")
|
||
|
```
|
||
|
|
||
|
### Cleaning & indexing documents
|
||
|
|
||
|
Similarly to the previous tutorials, we download, convert and index some Game of Thrones articles to our DocumentStore
|
||
|
|
||
|
|
||
|
```python
|
||
|
# Let's first get some files that we want to use
|
||
|
doc_dir = "data/article_txt_got"
|
||
|
s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt.zip"
|
||
|
fetch_archive_from_http(url=s3_url, output_dir=doc_dir)
|
||
|
|
||
|
# Convert files to dicts
|
||
|
dicts = convert_files_to_dicts(dir_path=doc_dir, clean_func=clean_wiki_text, split_paragraphs=True)
|
||
|
|
||
|
# Now, let's write the dicts containing documents to our DB.
|
||
|
document_store.write_documents(dicts)
|
||
|
```
|
||
|
|
||
|
### Initalize Retriever and Reader/Generator
|
||
|
|
||
|
#### Retriever
|
||
|
|
||
|
**Here:** We use a `RetribertRetriever` and we invoke `update_embeddings` to index the embeddings of documents in the `FAISSDocumentStore`
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
```python
|
||
|
from haystack.retriever.dense import EmbeddingRetriever
|
||
|
|
||
|
retriever = EmbeddingRetriever(document_store=document_store,
|
||
|
embedding_model="yjernite/retribert-base-uncased",
|
||
|
model_format="retribert")
|
||
|
|
||
|
document_store.update_embeddings(retriever)
|
||
|
```
|
||
|
|
||
|
Before we blindly use the `RetribertRetriever` let's empirically test it to make sure a simple search indeed finds the relevant documents.
|
||
|
|
||
|
|
||
|
```python
|
||
|
from haystack.utils import print_answers, print_documents
|
||
|
from haystack.pipeline import DocumentSearchPipeline
|
||
|
|
||
|
p_retrieval = DocumentSearchPipeline(retriever)
|
||
|
res = p_retrieval.run(
|
||
|
query="Tell me something about Arya Stark?",
|
||
|
top_k_retriever=5
|
||
|
)
|
||
|
print_documents(res, max_text_len=512)
|
||
|
|
||
|
```
|
||
|
|
||
|
#### Reader/Generator
|
||
|
|
||
|
Similar to previous Tutorials we now initalize our reader/generator.
|
||
|
|
||
|
Here we use a `Seq2SeqGenerator` with the *yjernite/bart_eli5* model (see: https://huggingface.co/yjernite/bart_eli5)
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
```python
|
||
|
generator = Seq2SeqGenerator(model_name_or_path="yjernite/bart_eli5")
|
||
|
```
|
||
|
|
||
|
### Pipeline
|
||
|
|
||
|
With a Haystack `Pipeline` you can stick together your building blocks to a search pipeline.
|
||
|
Under the hood, `Pipelines` are Directed Acyclic Graphs (DAGs) that you can easily customize for your own use cases.
|
||
|
To speed things up, Haystack also comes with a few predefined Pipelines. One of them is the `GenerativeQAPipeline` that combines a retriever and a reader/generator to answer our questions.
|
||
|
You can learn more about `Pipelines` in the [docs](https://haystack.deepset.ai/docs/latest/pipelinesmd).
|
||
|
|
||
|
|
||
|
```python
|
||
|
from haystack.pipeline import GenerativeQAPipeline
|
||
|
pipe = GenerativeQAPipeline(generator, retriever)
|
||
|
```
|
||
|
|
||
|
## Voilà! Ask a question!
|
||
|
|
||
|
|
||
|
```python
|
||
|
pipe.run(query="Why did Arya Stark's character get portrayed in a television adaptation?", top_k_retriever=1)
|
||
|
```
|
||
|
|
||
|
|
||
|
```python
|
||
|
pipe.run(query="What kind of character does Arya Stark play?", top_k_retriever=1)
|
||
|
```
|
||
|
|
||
|
## About us
|
||
|
|
||
|
This [Haystack](https://github.com/deepset-ai/haystack/) notebook was made with love by [deepset](https://deepset.ai/) in Berlin, Germany
|
||
|
|
||
|
We bring NLP to the industry via open source!
|
||
|
Our focus: Industry specific language models & large scale QA systems.
|
||
|
|
||
|
Some of our other work:
|
||
|
- [German BERT](https://deepset.ai/german-bert)
|
||
|
- [GermanQuAD and GermanDPR](https://deepset.ai/germanquad)
|
||
|
- [FARM](https://github.com/deepset-ai/FARM)
|
||
|
|
||
|
Get in touch:
|
||
|
[Twitter](https://twitter.com/deepset_ai) | [LinkedIn](https://www.linkedin.com/company/deepset-ai/) | [Slack](https://haystack.deepset.ai/community/join) | [GitHub Discussions](https://github.com/deepset-ai/haystack/discussions) | [Website](https://deepset.ai)
|
||
|
|
||
|
By the way: [we're hiring!](https://apply.workable.com/deepset/)
|