mirror of
https://github.com/deepset-ai/haystack.git
synced 2025-07-25 09:50:14 +00:00

* add basic telemetry features * change pipeline_config to _component_config * Update Documentation & Code Style * add super().__init__() calls to error classes * make posthog mock work with python 3.7 * Update Documentation & Code Style * update link to docs web page * log exceptions, send event for raised HaystackErrors, refactor Path(CONFIG_PATH) * add comment on send_event in BaseComponent.init() and fix mypy * mock NonPrivateParameters and fix pylint undefined-variable * Update Documentation & Code Style * check model path contains multiple / * add test for writing to file * add test for en-/disable telemetry * Update Documentation & Code Style * merge file deletion methods and ignore pylint global statement * Update Documentation & Code Style * set env variable in demo to activate telemetry * fix mock of HAYSTACK_TELEMETRY_ENABLED * fix mypy and linter * add CI as env variable to execution contexts * remove threading, add test for custom error event * Update Documentation & Code Style * simplify config/log file deletion * add test for final event being sent * force writing config file in test * make test compatible with python 3.7 * switch to posthog production server * Update Documentation & Code Style Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
333 lines
9.7 KiB
Plaintext
333 lines
9.7 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Utilizing existing FAQs for Question Answering\n",
|
|
"\n",
|
|
"[](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial4_FAQ_style_QA.ipynb)\n",
|
|
"\n",
|
|
"While *extractive Question Answering* works on pure texts and is therefore more generalizable, there's also a common alternative that utilizes existing FAQ data.\n",
|
|
"\n",
|
|
"**Pros**:\n",
|
|
"\n",
|
|
"- Very fast at inference time\n",
|
|
"- Utilize existing FAQ data\n",
|
|
"- Quite good control over answers\n",
|
|
"\n",
|
|
"**Cons**:\n",
|
|
"\n",
|
|
"- Generalizability: We can only answer questions that are similar to existing ones in FAQ\n",
|
|
"\n",
|
|
"In some use cases, a combination of extractive QA and FAQ-style can also be an interesting option."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"source": [
|
|
"### Prepare environment\n",
|
|
"\n",
|
|
"#### Colab: Enable the GPU runtime\n",
|
|
"Make sure you enable the GPU runtime to experience decent speed in this tutorial.\n",
|
|
"**Runtime -> Change Runtime type -> Hardware accelerator -> GPU**\n",
|
|
"\n",
|
|
"<img src=\"https://raw.githubusercontent.com/deepset-ai/haystack/master/docs/_src/img/colab_gpu_runtime.jpg\">"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Make sure you have a GPU running\n",
|
|
"!nvidia-smi"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Install the latest release of Haystack in your own environment\n",
|
|
"#! pip install farm-haystack\n",
|
|
"\n",
|
|
"# Install the latest master of Haystack\n",
|
|
"!pip install --upgrade pip\n",
|
|
"!pip install git+https://github.com/deepset-ai/haystack.git#egg=farm-haystack[colab]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"pycharm": {
|
|
"is_executing": false
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"from haystack.document_stores import ElasticsearchDocumentStore\n",
|
|
"\n",
|
|
"from haystack.nodes import EmbeddingRetriever\n",
|
|
"import pandas as pd\n",
|
|
"import requests"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Start an Elasticsearch server\n",
|
|
"You can start Elasticsearch on your local machine instance using Docker. If Docker is not readily available in your environment (eg., in Colab notebooks), then you can manually download and execute Elasticsearch from source."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Recommended: Start Elasticsearch using Docker via the Haystack utility function\n",
|
|
"from haystack.utils import launch_es, fetch_archive_from_http\n",
|
|
"\n",
|
|
"launch_es()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# In Colab / No Docker environments: Start Elasticsearch from source\n",
|
|
"! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q\n",
|
|
"! tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz\n",
|
|
"! chown -R daemon:daemon elasticsearch-7.9.2\n",
|
|
"\n",
|
|
"import os\n",
|
|
"from subprocess import Popen, PIPE, STDOUT\n",
|
|
"\n",
|
|
"es_server = Popen(\n",
|
|
" [\"elasticsearch-7.9.2/bin/elasticsearch\"], stdout=PIPE, stderr=STDOUT, preexec_fn=lambda: os.setuid(1) # as daemon\n",
|
|
")\n",
|
|
"# wait until ES has started\n",
|
|
"! sleep 30"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"source": [
|
|
"### Init the DocumentStore\n",
|
|
"In contrast to Tutorial 1 (extractive QA), we:\n",
|
|
"\n",
|
|
"* specify the name of our `text_field` in Elasticsearch that we want to return as an answer\n",
|
|
"* specify the name of our `embedding_field` in Elasticsearch where we'll store the embedding of our question and that is used later for calculating our similarity to the incoming user question\n",
|
|
"* set `excluded_meta_data=[\"question_emb\"]` so that we don't return the huge embedding vectors in our search results"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"metadata": {
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"04/28/2020 12:27:32 - INFO - elasticsearch - PUT http://localhost:9200/document [status:400 request:0.010s]\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"from haystack.document_stores import ElasticsearchDocumentStore\n",
|
|
"\n",
|
|
"document_store = ElasticsearchDocumentStore(\n",
|
|
" host=\"localhost\",\n",
|
|
" username=\"\",\n",
|
|
" password=\"\",\n",
|
|
" index=\"document\",\n",
|
|
" embedding_field=\"question_emb\",\n",
|
|
" embedding_dim=384,\n",
|
|
" excluded_meta_data=[\"question_emb\"],\n",
|
|
")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"source": [
|
|
"### Create a Retriever using embeddings\n",
|
|
"Instead of retrieving via Elasticsearch's plain BM25, we want to use vector similarity of the questions (user question vs. FAQ ones).\n",
|
|
"We can use the `EmbeddingRetriever` for this purpose and specify a model that we use for the embeddings."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"retriever = EmbeddingRetriever(\n",
|
|
" document_store=document_store, embedding_model=\"sentence-transformers/all-MiniLM-L6-v2\", use_gpu=True\n",
|
|
")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"source": [
|
|
"### Prepare & Index FAQ data\n",
|
|
"We create a pandas dataframe containing some FAQ data (i.e curated pairs of question + answer) and index those in elasticsearch.\n",
|
|
"Here: We download some question-answer pairs related to COVID-19"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Download\n",
|
|
"doc_dir = \"data/tutorial4\"\n",
|
|
"s3_url = \"https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/small_faq_covid.csv.zip\"\n",
|
|
"fetch_archive_from_http(url=s3_url, output_dir=doc_dir)\n",
|
|
"\n",
|
|
"# Get dataframe with columns \"question\", \"answer\" and some custom metadata\n",
|
|
"df = pd.read_csv(\"small_faq_covid.csv\")\n",
|
|
"# Minimal cleaning\n",
|
|
"df.fillna(value=\"\", inplace=True)\n",
|
|
"df[\"question\"] = df[\"question\"].apply(lambda x: x.strip())\n",
|
|
"print(df.head())\n",
|
|
"\n",
|
|
"# Get embeddings for our questions from the FAQs\n",
|
|
"questions = list(df[\"question\"].values)\n",
|
|
"df[\"question_emb\"] = retriever.embed_queries(texts=questions)\n",
|
|
"df = df.rename(columns={\"question\": \"content\"})\n",
|
|
"\n",
|
|
"# Convert Dataframe to list of dicts and index them in our DocumentStore\n",
|
|
"docs_to_index = df.to_dict(orient=\"records\")\n",
|
|
"document_store.write_documents(docs_to_index)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"source": [
|
|
"### Ask questions\n",
|
|
"Initialize a Pipeline (this time without a reader) and ask questions"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"from haystack.pipelines import FAQPipeline\n",
|
|
"\n",
|
|
"pipe = FAQPipeline(retriever=retriever)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"from haystack.utils import print_answers\n",
|
|
"\n",
|
|
"prediction = pipe.run(query=\"How is the virus spreading?\", params={\"Retriever\": {\"top_k\": 10}})\n",
|
|
"print_answers(prediction, details=\"medium\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"source": [
|
|
"## About us\n",
|
|
"\n",
|
|
"This [Haystack](https://github.com/deepset-ai/haystack/) notebook was made with love by [deepset](https://deepset.ai/) in Berlin, Germany\n",
|
|
"\n",
|
|
"We bring NLP to the industry via open source! \n",
|
|
"Our focus: Industry specific language models & large scale QA systems. \n",
|
|
" \n",
|
|
"Some of our other work: \n",
|
|
"- [German BERT](https://deepset.ai/german-bert)\n",
|
|
"- [GermanQuAD and GermanDPR](https://deepset.ai/germanquad)\n",
|
|
"- [FARM](https://github.com/deepset-ai/FARM)\n",
|
|
"\n",
|
|
"Get in touch:\n",
|
|
"[Twitter](https://twitter.com/deepset_ai) | [LinkedIn](https://www.linkedin.com/company/deepset-ai/) | [Slack](https://haystack.deepset.ai/community/join) | [GitHub Discussions](https://github.com/deepset-ai/haystack/discussions) | [Website](https://deepset.ai)\n",
|
|
"\n",
|
|
"By the way: [we're hiring!](https://www.deepset.ai/jobs)"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.7.6"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
} |