haystack/tutorials/Tutorial1_Basic_QA_Pipeline.ipynb
2020-05-26 11:56:24 +02:00

422 lines
15 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Task: Question Answering for Game of Thrones\n",
"\n",
"<img style=\"float: right;\" src=\"https://upload.wikimedia.org/wikipedia/en/d/d8/Game_of_Thrones_title_card.jpg\">\n",
"\n",
"Question Answering can be used in a variety of use cases. A very common one: Using it to navigate through complex knowledge bases or long documents (\"search setting\").\n",
"\n",
"A \"knowledge base\" could for example be your website, an internal wiki or a collection of financial reports. \n",
"In this tutorial we will work on a slightly different domain: \"Game of Thrones\". \n",
"\n",
"Let's see how we can use a bunch of Wikipedia articles to answer a variety of questions about the \n",
"marvellous seven kingdoms... \n",
"\n",
"\n",
"*Use this [link](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial1_Basic_QA_Pipeline.ipynb) to open the notebook in Google Colab.*\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"! pip install git+git://github.com/deepset-ai/haystack.git@92429a40e6176a3b0c822081f6511fc1c555fabf\n",
"#! pip install farm-haystack"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
}
},
"outputs": [],
"source": [
"from haystack import Finder\n",
"from haystack.indexing.cleaning import clean_wiki_text\n",
"from haystack.indexing.io import write_documents_to_db, fetch_archive_from_http\n",
"from haystack.reader.farm import FARMReader\n",
"from haystack.reader.transformers import TransformersReader\n",
"from haystack.utils import print_answers"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Document Store\n",
"\n",
"Haystack finds answers to queries within the documents stored in a `DocumentStore`. The current implementations of `DocumentStore` include `ElasticsearchDocumentStore`, `SQLDocumentStore`, and `InMemoryDocumentStore`.\n",
"\n",
"**Here:** We recommended Elasticsearch as it comes preloaded with features like [full-text queries](https://www.elastic.co/guide/en/elasticsearch/reference/current/full-text-queries.html), [BM25 retrieval](https://www.elastic.co/elasticon/conf/2016/sf/improved-text-scoring-with-bm25), and [vector storage for text embeddings](https://www.elastic.co/guide/en/elasticsearch/reference/7.6/dense-vector.html).\n",
"\n",
"**Alternatives:** If you are unable to setup an Elasticsearch instance, then follow the [Tutorial 3](https://github.com/deepset-ai/haystack/blob/master/tutorials/Tutorial3_Basic_QA_Pipeline_without_Elasticsearch.ipynb) for using SQL/InMemory document stores.\n",
"\n",
"**Hint**: This tutorial creates a new document store instance with Wikipedia articles on Game of Thrones. However, you can configure Haystack to work with your existing document stores.\n",
"\n",
"### Start an Elasticsearch server\n",
"You can start Elasticsearch on your local machine instance using Docker. If Docker is not readily available in your environment (eg., in Colab notebooks), then you can manually download and execute Elasticsearch from source."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Recommended: Start Elasticsearch using Docker\n",
"# ! docker run -d -p 9200:9200 -e \"discovery.type=single-node\" elasticsearch:7.6.2"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# In Colab / No Docker environments: Start Elasticsearch from source\n",
"! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.6.2-linux-x86_64.tar.gz -q\n",
"! tar -xzf elasticsearch-7.6.2-linux-x86_64.tar.gz\n",
"! chown -R daemon:daemon elasticsearch-7.6.2\n",
"\n",
"import os\n",
"from subprocess import Popen, PIPE, STDOUT\n",
"es_server = Popen(['elasticsearch-7.6.2/bin/elasticsearch'],\n",
" stdout=PIPE, stderr=STDOUT,\n",
" preexec_fn=lambda: os.setuid(1) # as daemon\n",
" )\n",
"# wait until ES has started\n",
"! sleep 30"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"04/28/2020 12:27:32 - INFO - elasticsearch - PUT http://localhost:9200/document [status:400 request:0.010s]\n"
]
}
],
"source": [
"# Connect to Elasticsearch\n",
"\n",
"from haystack.database.elasticsearch import ElasticsearchDocumentStore\n",
"document_store = ElasticsearchDocumentStore(host=\"localhost\", username=\"\", password=\"\", index=\"document\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"## Cleaning & indexing documents\n",
"\n",
"Haystack provides a customizable cleaning and indexing pipeline for ingesting documents in Document Stores.\n",
"\n",
"In this tutorial, we download Wikipedia articles on Game of Thrones, apply a basic cleaning function, and index them in Elasticsearch."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"04/28/2020 12:19:09 - INFO - haystack.indexing.io - Found data stored in `data/article_txt_got`. Delete this first if you really want to fetch new data.\n",
"04/28/2020 12:19:09 - INFO - elasticsearch - POST http://localhost:9200/_count [status:200 request:0.066s]\n",
"04/28/2020 12:19:10 - INFO - elasticsearch - POST http://localhost:9200/_bulk [status:200 request:0.603s]\n",
"04/28/2020 12:19:10 - INFO - elasticsearch - POST http://localhost:9200/_bulk [status:200 request:0.040s]\n",
"04/28/2020 12:19:10 - INFO - haystack.indexing.io - Wrote 517 docs to DB\n"
]
}
],
"source": [
"# Let's first get some documents that we want to query\n",
"# Here: 517 Wikipedia articles for Game of Thrones\n",
"doc_dir = \"data/article_txt_got\"\n",
"s3_url = \"https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt.zip\"\n",
"fetch_archive_from_http(url=s3_url, output_dir=doc_dir)\n",
"\n",
"\n",
"# Now, let's write the docs to our DB.\n",
"# You can optionally supply a cleaning function that is applied to each doc (e.g. to remove footers)\n",
"# It must take a str as input, and return a str.\n",
"write_documents_to_db(document_store=document_store, document_dir=doc_dir, clean_func=clean_wiki_text, only_empty_db=True, split_paragraphs=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Initalize Retriever, Reader, & Finder\n",
"\n",
"### Retriever\n",
"\n",
"Retrievers help narrowing down the scope for the Reader to smaller units of text where a given question could be answered.\n",
"They use some simple but fast algorithm.\n",
"\n",
"**Here:** We use Elasticsearch's default BM25 algorithm\n",
"\n",
"**Alternatives:**\n",
"\n",
"- Customize the `ElasticsearchRetriever`with custom queries (e.g. boosting) and filters\n",
"- Use `EmbeddingRetriever` to find candidate documents based on the similarity of embeddings (e.g. created via Sentence-BERT)\n",
"- Use `TfidfRetriever` in combination with a SQL or InMemory Document store for simple prototyping and debugging"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"from haystack.retriever.elasticsearch import ElasticsearchRetriever\n",
"retriever = ElasticsearchRetriever(document_store=document_store)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"pycharm": {
"is_executing": false,
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"# Alternative: An in-memory TfidfRetriever based on Pandas dataframes for building quick-prototypes with SQLite document store.\n",
"\n",
"# from haystack.retriever.tfidf import TfidfRetriever\n",
"# retriever = TfidfRetriever(document_store=document_store)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Reader\n",
"\n",
"A Reader scans the texts returned by retrievers in detail and extracts the k best answers. They are based\n",
"on powerful, but slower deep learning models.\n",
"\n",
"Haystack currently supports Readers based on the frameworks FARM and Transformers.\n",
"With both you can either load a local model or one from Hugging Face's model hub (https://huggingface.co/models).\n",
"\n",
"**Here:** a medium sized RoBERTa QA model using a Reader based on FARM (https://huggingface.co/deepset/roberta-base-squad2)\n",
"\n",
"**Alternatives (Reader):** TransformersReader (leveraging the `pipeline` of the Transformers package)\n",
"\n",
"**Alternatives (Models):** e.g. \"distilbert-base-uncased-distilled-squad\" (fast) or \"deepset/bert-large-uncased-whole-word-masking-squad2\" (good accuracy)\n",
"\n",
"**Hint:** You can adjust the model to return \"no answer possible\" with the no_ans_boost. Higher values mean the model prefers \"no answer possible\"\n",
"\n",
"#### FARMReader"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"pycharm": {
"is_executing": false
}
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"04/28/2020 12:29:45 - INFO - farm.utils - device: cpu n_gpu: 0, distributed training: False, automatic mixed precision training: None\n",
"04/28/2020 12:29:45 - INFO - farm.infer - Could not find `deepset/roberta-base-squad2` locally. Try to download from model hub ...\n",
"04/28/2020 12:29:49 - WARNING - farm.modeling.language_model - Could not automatically detect from language model name what language it is. \n",
"\t We guess it's an *ENGLISH* model ... \n",
"\t If not: Init the language model by supplying the 'language' param.\n",
"04/28/2020 12:29:54 - WARNING - farm.modeling.prediction_head - Some unused parameters are passed to the QuestionAnsweringHead. Might not be a problem. Params: {\"loss_ignore_index\": -1}\n",
"04/28/2020 12:29:58 - INFO - farm.utils - device: cpu n_gpu: 0, distributed training: False, automatic mixed precision training: None\n"
]
}
],
"source": [
"# Load a local model or any of the QA models on\n",
"# Hugging Face's model hub (https://huggingface.co/models)\n",
"\n",
"reader = FARMReader(model_name_or_path=\"deepset/roberta-base-squad2\", use_gpu=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### TransformersReader"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Alternative:\n",
"# reader = TransformersReader(model=\"distilbert-base-uncased-distilled-squad\", tokenizer=\"distilbert-base-uncased\", use_gpu=-1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Finder\n",
"\n",
"The Finder sticks together reader and retriever in a pipeline to answer our actual questions. "
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"pycharm": {
"is_executing": false
}
},
"outputs": [],
"source": [
"finder = Finder(reader, retriever)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Voilà! Ask a question!"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"pycharm": {
"is_executing": false
}
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"04/28/2020 12:27:53 - INFO - elasticsearch - GET http://localhost:9200/document/_search [status:200 request:0.113s]\n",
"04/28/2020 12:27:53 - INFO - haystack.retriever.elasticsearch - Got 10 candidates from retriever\n",
"04/28/2020 12:27:53 - INFO - haystack.finder - Reader is looking for detailed answer in 362347 chars ...\n"
]
}
],
"source": [
"# You can configure how many candidates the reader and retriever shall return\n",
"# The higher top_k_retriever, the better (but also the slower) your answers. \n",
"prediction = finder.get_answers(question=\"Who is the father of Arya Stark?\", top_k_retriever=10, top_k_reader=5)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# prediction = finder.get_answers(question=\"Who created the Dothraki vocabulary?\", top_k_reader=5)\n",
"# prediction = finder.get_answers(question=\"Who is the sister of Sansa?\", top_k_reader=5)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"pycharm": {
"is_executing": false,
"name": "#%%\n"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[ { 'answer': 'Eddard',\n",
" 'context': 's Nymeria after a legendary warrior queen. She travels '\n",
" \"with her father, Eddard, to King's Landing when he is made \"\n",
" 'Hand of the King. Before she leaves,'},\n",
" { 'answer': 'Ned',\n",
" 'context': 'girl disguised as a boy all along and is surprised to '\n",
" \"learn she is Arya, Ned Stark's daughter. After the \"\n",
" 'Goldcloaks get help from Ser Amory Lorch and '},\n",
" { 'answer': 'Ned',\n",
" 'context': 'in the television series.\\n'\n",
" '\\n'\n",
" '\\n'\n",
" '====Season 1====\\n'\n",
" 'Arya accompanies her father Ned and her sister Sansa to '\n",
" \"King's Landing. Before their departure, Arya's ha\"},\n",
" { 'answer': 'Balon Greyjoy',\n",
" 'context': 'He sends Theon to the Iron Islands hoping to broker an '\n",
" \"alliance with Balon Greyjoy, Theon's father. In exchange \"\n",
" 'for Greyjoy support, Robb as the King '},\n",
" { 'answer': 'Brynden Tully',\n",
" 'context': 'o the weather. Sandor decides to instead take her to her '\n",
" 'great-uncle Brynden Tully. On their way to Riverrun, they '\n",
" \"encounter two men on Arya's death l\"}]\n"
]
}
],
"source": [
"print_answers(prediction, details=\"minimal\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 2
}