haystack/tutorials/Tutorial4_FAQ_style_QA.ipynb

285 lines
8.7 KiB
Plaintext
Raw Normal View History

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
Create documentation website (#272) * Skeleton of doc website * Flesh out documentation pages * Split concepts into their own rst files * add tutorial rsts * Consistent level 1 markdown headers in tutorials * Change theme to readthedocs * Turn bullet points into prose * Populate sections * Add more text * Add more sphinx files * Add more retriever documentation * combined all documenations in one structure * rename of src to _src as it was ignored by git * Incorporate MP2's changes * add benchmark bar charts * Adapt docstrings in Readers * Improvements to intro, creation of glossary * Adapt docstrings in Retrievers * Adapt docstrings in Finder * Adapt Docstrings of Finder * Updates to text * Edit text * update doc strings * proof read tutorials * Edit text * Edit text * Add stacked chart * populate graph with data * Switch Documentation to markdown (#386) * add way to generate markdown files to sphinx * changed from rst to markdown and extended sphinx for it * fix spelling * Clean titles * delete file * change spelling * add sections to document store usage * add basic rest api docs * fix readme in setup.py * Update Tutorials * Change section names * add windows note to pip install * update intro * new renderer for markdown files * Fix typos * delete dpr_utils.py * fix windows note in get started * Fix docstrings * deleted rest api docs in api * fixed typo * Fix docstring * revert readme to rst * Fix readme * Update setup.py Co-authored-by: deepset <deepset@Crenolape.localdomain> Co-authored-by: PiffPaffM <markuspaff.mp@gmail.com> Co-authored-by: Bogdan Kostić <bogdankostic@web.de> Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2020-09-18 12:57:32 +02:00
"# Utilizing existing FAQs for Question Answering\n",
"\n",
"[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial4_FAQ_style_QA.ipynb)\n",
"\n",
"While *extractive Question Answering* works on pure texts and is therefore more generalizable, there's also a common alternative that utilizes existing FAQ data.\n",
"\n",
Create documentation website (#272) * Skeleton of doc website * Flesh out documentation pages * Split concepts into their own rst files * add tutorial rsts * Consistent level 1 markdown headers in tutorials * Change theme to readthedocs * Turn bullet points into prose * Populate sections * Add more text * Add more sphinx files * Add more retriever documentation * combined all documenations in one structure * rename of src to _src as it was ignored by git * Incorporate MP2's changes * add benchmark bar charts * Adapt docstrings in Readers * Improvements to intro, creation of glossary * Adapt docstrings in Retrievers * Adapt docstrings in Finder * Adapt Docstrings of Finder * Updates to text * Edit text * update doc strings * proof read tutorials * Edit text * Edit text * Add stacked chart * populate graph with data * Switch Documentation to markdown (#386) * add way to generate markdown files to sphinx * changed from rst to markdown and extended sphinx for it * fix spelling * Clean titles * delete file * change spelling * add sections to document store usage * add basic rest api docs * fix readme in setup.py * Update Tutorials * Change section names * add windows note to pip install * update intro * new renderer for markdown files * Fix typos * delete dpr_utils.py * fix windows note in get started * Fix docstrings * deleted rest api docs in api * fixed typo * Fix docstring * revert readme to rst * Fix readme * Update setup.py Co-authored-by: deepset <deepset@Crenolape.localdomain> Co-authored-by: PiffPaffM <markuspaff.mp@gmail.com> Co-authored-by: Bogdan Kostić <bogdankostic@web.de> Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2020-09-18 12:57:32 +02:00
"**Pros**:\n",
"\n",
"- Very fast at inference time\n",
"- Utilize existing FAQ data\n",
"- Quite good control over answers\n",
"\n",
Create documentation website (#272) * Skeleton of doc website * Flesh out documentation pages * Split concepts into their own rst files * add tutorial rsts * Consistent level 1 markdown headers in tutorials * Change theme to readthedocs * Turn bullet points into prose * Populate sections * Add more text * Add more sphinx files * Add more retriever documentation * combined all documenations in one structure * rename of src to _src as it was ignored by git * Incorporate MP2's changes * add benchmark bar charts * Adapt docstrings in Readers * Improvements to intro, creation of glossary * Adapt docstrings in Retrievers * Adapt docstrings in Finder * Adapt Docstrings of Finder * Updates to text * Edit text * update doc strings * proof read tutorials * Edit text * Edit text * Add stacked chart * populate graph with data * Switch Documentation to markdown (#386) * add way to generate markdown files to sphinx * changed from rst to markdown and extended sphinx for it * fix spelling * Clean titles * delete file * change spelling * add sections to document store usage * add basic rest api docs * fix readme in setup.py * Update Tutorials * Change section names * add windows note to pip install * update intro * new renderer for markdown files * Fix typos * delete dpr_utils.py * fix windows note in get started * Fix docstrings * deleted rest api docs in api * fixed typo * Fix docstring * revert readme to rst * Fix readme * Update setup.py Co-authored-by: deepset <deepset@Crenolape.localdomain> Co-authored-by: PiffPaffM <markuspaff.mp@gmail.com> Co-authored-by: Bogdan Kostić <bogdankostic@web.de> Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2020-09-18 12:57:32 +02:00
"**Cons**:\n",
"\n",
Create documentation website (#272) * Skeleton of doc website * Flesh out documentation pages * Split concepts into their own rst files * add tutorial rsts * Consistent level 1 markdown headers in tutorials * Change theme to readthedocs * Turn bullet points into prose * Populate sections * Add more text * Add more sphinx files * Add more retriever documentation * combined all documenations in one structure * rename of src to _src as it was ignored by git * Incorporate MP2's changes * add benchmark bar charts * Adapt docstrings in Readers * Improvements to intro, creation of glossary * Adapt docstrings in Retrievers * Adapt docstrings in Finder * Adapt Docstrings of Finder * Updates to text * Edit text * update doc strings * proof read tutorials * Edit text * Edit text * Add stacked chart * populate graph with data * Switch Documentation to markdown (#386) * add way to generate markdown files to sphinx * changed from rst to markdown and extended sphinx for it * fix spelling * Clean titles * delete file * change spelling * add sections to document store usage * add basic rest api docs * fix readme in setup.py * Update Tutorials * Change section names * add windows note to pip install * update intro * new renderer for markdown files * Fix typos * delete dpr_utils.py * fix windows note in get started * Fix docstrings * deleted rest api docs in api * fixed typo * Fix docstring * revert readme to rst * Fix readme * Update setup.py Co-authored-by: deepset <deepset@Crenolape.localdomain> Co-authored-by: PiffPaffM <markuspaff.mp@gmail.com> Co-authored-by: Bogdan Kostić <bogdankostic@web.de> Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2020-09-18 12:57:32 +02:00
"- Generalizability: We can only answer questions that are similar to existing ones in FAQ\n",
"\n",
"In some use cases, a combination of extractive QA and FAQ-style can also be an interesting option."
]
},
{
"cell_type": "markdown",
"source": [
"### Prepare environment\n",
"\n",
"#### Colab: Enable the GPU runtime\n",
"Make sure you enable the GPU runtime to experience decent speed in this tutorial.\n",
"**Runtime -> Change Runtime type -> Hardware accelerator -> GPU**\n",
"\n",
"<img src=\"https://raw.githubusercontent.com/deepset-ai/haystack/master/docs/_src/img/colab_gpu_runtime.jpg\">"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"# Make sure you have a GPU running\n",
"!nvidia-smi"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Install the latest release of Haystack in your own environment \n",
"#! pip install farm-haystack\n",
"\n",
"# Install the latest master of Haystack\n",
"!pip install git+https://github.com/deepset-ai/haystack.git\n",
"!pip install urllib3==1.25.4\n",
"!pip install torch==1.6.0+cu101 torchvision==0.6.1+cu101 -f https://download.pytorch.org/whl/torch_stable.html\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": false
}
},
"outputs": [],
"source": [
"from haystack import Finder\n",
"from haystack.document_store.elasticsearch import ElasticsearchDocumentStore\n",
"\n",
"from haystack.retriever.dense import EmbeddingRetriever\n",
"from haystack.utils import print_answers\n",
"import pandas as pd\n",
"import requests\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Start an Elasticsearch server\n",
"You can start Elasticsearch on your local machine instance using Docker. If Docker is not readily available in your environment (eg., in Colab notebooks), then you can manually download and execute Elasticsearch from source."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Recommended: Start Elasticsearch using Docker\n",
"# ! docker run -d -p 9200:9200 -e \"discovery.type=single-node\" elasticsearch:7.9.2"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# In Colab / No Docker environments: Start Elasticsearch from source\n",
"! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q\n",
"! tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz\n",
"! chown -R daemon:daemon elasticsearch-7.9.2\n",
"\n",
"import os\n",
"from subprocess import Popen, PIPE, STDOUT\n",
"es_server = Popen(['elasticsearch-7.9.2/bin/elasticsearch'],\n",
" stdout=PIPE, stderr=STDOUT,\n",
" preexec_fn=lambda: os.setuid(1) # as daemon\n",
" )\n",
"# wait until ES has started\n",
"! sleep 30\n"
]
},
{
"cell_type": "markdown",
"source": [
"### Init the DocumentStore\n",
"In contrast to Tutorial 1 (extractive QA), we:\n",
"\n",
"* specify the name of our `text_field` in Elasticsearch that we want to return as an answer\n",
"* specify the name of our `embedding_field` in Elasticsearch where we'll store the embedding of our question and that is used later for calculating our similarity to the incoming user question\n",
"* set `excluded_meta_data=[\"question_emb\"]` so that we don't return the huge embedding vectors in our search results"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"04/28/2020 12:27:32 - INFO - elasticsearch - PUT http://localhost:9200/document [status:400 request:0.010s]\n"
]
}
],
"source": [
"from haystack.document_store.elasticsearch import ElasticsearchDocumentStore\n",
"document_store = ElasticsearchDocumentStore(host=\"localhost\", username=\"\", password=\"\",\n",
" index=\"document\",\n",
" embedding_field=\"question_emb\",\n",
" embedding_dim=768,\n",
" excluded_meta_data=[\"question_emb\"])"
]
},
{
"cell_type": "markdown",
"source": [
"### Create a Retriever using embeddings\n",
"Instead of retrieving via Elasticsearch's plain BM25, we want to use vector similarity of the questions (user question vs. FAQ ones).\n",
"We can use the `EmbeddingRetriever` for this purpose and specify a model that we use for the embeddings."
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"retriever = EmbeddingRetriever(document_store=document_store, embedding_model=\"deepset/sentence_bert\", use_gpu=True)"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"### Prepare & Index FAQ data\n",
"We create a pandas dataframe containing some FAQ data (i.e curated pairs of question + answer) and index those in elasticsearch.\n",
"Here: We download some question-answer pairs related to COVID-19"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"# Download\n",
"temp = requests.get(\"https://raw.githubusercontent.com/deepset-ai/COVID-QA/master/data/faqs/faq_covidbert.csv\")\n",
"open('small_faq_covid.csv', 'wb').write(temp.content)\n",
"\n",
"# Get dataframe with columns \"question\", \"answer\" and some custom metadata\n",
"df = pd.read_csv(\"small_faq_covid.csv\")\n",
"# Minimal cleaning\n",
"df.fillna(value=\"\", inplace=True)\n",
"df[\"question\"] = df[\"question\"].apply(lambda x: x.strip())\n",
"print(df.head())\n",
"\n",
"# Get embeddings for our questions from the FAQs\n",
"questions = list(df[\"question\"].values)\n",
"df[\"question_emb\"] = retriever.embed_queries(texts=questions)\n",
"df = df.rename(columns={\"answer\": \"text\"})\n",
"\n",
"# Convert Dataframe to list of dicts and index them in our DocumentStore\n",
"docs_to_index = df.to_dict(orient=\"records\")\n",
"document_store.write_documents(docs_to_index)"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"### Ask questions\n",
"Initialize a Finder (this time without a reader) and ask questions"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"finder = Finder(reader=None, retriever=retriever)\n",
"prediction = finder.get_answers_via_similar_questions(question=\"How is the virus spreading?\", top_k_retriever=10)\n",
"print_answers(prediction, details=\"all\")"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 2
}