haystack/tutorials/Tutorial11_Pipelines.ipynb
Branden Chan 9827b3652e
Pipelines tutorial (#991)
* Start Pipelines tutorial

* Make Tutorial 11 run locally

* Add colab compatibility

* Fix pip install

* Add ES install from source

* Add ES install from source

* Add pygraphviz installation

* Incorporate reviewer feedback

* Ensure print_answers() works for Generator output

* Fix typo
2021-04-29 17:31:28 +02:00

731 lines
22 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"source": [
"# Pipelines Tutorial\n",
"\n",
"[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial11_Pipelines.ipynb)\n",
"\n",
"In this tutorial, you will learn how the `Pipeline` class acts as a connector between all the different\n",
"building blocks that are found in FARM. Whether you are using a Reader, Generator, Summarizer\n",
"or Retriever (or 2), the `Pipeline` class will help you build a Directed Acyclic Graph (DAG) that\n",
"determines how to route the output of one component into the input of another.\n"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"## Setting Up the Environment\n",
"\n",
"Let's start by ensuring we have a GPU running to ensure decent speed in this tutorial.\n",
"In Google colab, you can change to a GPU runtime in the menu:\n",
"- **Runtime -> Change Runtime type -> Hardware accelerator -> GPU**"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"source": [
"# Make sure you have a GPU running\n",
"!nvidia-smi"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"These lines are to install Haystack through pip"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"# Install the latest release of Haystack in your own environment\n",
"#! pip install farm-haystack\n",
"\n",
"# Install the latest master of Haystack\n",
"!pip install --upgrade git+https://github.com/deepset-ai/haystack.git\n",
"!pip install urllib3==1.25.4\n",
"\n",
"# Install pygraphviz\n",
"!apt install libgraphviz-dev\n",
"!pip install pygraphviz"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"If running from Colab or a no Docker environment, you will want to start Elasticsearch from source"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"# In Colab / No Docker environments: Start Elasticsearch from source\n",
"! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q\n",
"! tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz\n",
"! chown -R daemon:daemon elasticsearch-7.9.2\n",
"\n",
"import os\n",
"from subprocess import Popen, PIPE, STDOUT\n",
"es_server = Popen(['elasticsearch-7.9.2/bin/elasticsearch'],\n",
" stdout=PIPE, stderr=STDOUT,\n",
" preexec_fn=lambda: os.setuid(1) # as daemon\n",
" )\n",
"# wait until ES has started\n",
"! sleep 30"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"## Initialization\n",
"\n",
"Here are some core imports"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from haystack.utils import print_answers, print_documents"
]
},
{
"cell_type": "markdown",
"source": [
"Then let's fetch some data (in this case, pages from the Game of Thrones wiki) and prepare it so that it can\n",
"be used indexed into our `DocumentStore`"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"from haystack.preprocessor.utils import fetch_archive_from_http, convert_files_to_dicts\n",
"from haystack.preprocessor.cleaning import clean_wiki_text\n",
"\n",
"#Download and prepare data - 517 Wikipedia articles for Game of Thrones\n",
"doc_dir = \"data/article_txt_got\"\n",
"s3_url = \"https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt.zip\"\n",
"fetch_archive_from_http(url=s3_url, output_dir=doc_dir)\n",
"\n",
"# convert files to dicts containing documents that can be indexed to our datastore\n",
"got_dicts = convert_files_to_dicts(\n",
" dir_path=doc_dir,\n",
" clean_func=clean_wiki_text,\n",
" split_paragraphs=True\n",
")"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"Here we initialize the core components that we will be gluing together using the `Pipeline` class.\n",
"We have a `DocumentStore`, an `ElasticsearchRetriever` and a `FARMReader`.\n",
"These can be combined to create a classic Retriever-Reader pipeline that is designed\n",
"to perform Open Domain Question Answering."
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"source": [
"from haystack import Pipeline\n",
"from haystack.utils import launch_es\n",
"from haystack.document_store import ElasticsearchDocumentStore\n",
"from haystack.retriever.sparse import ElasticsearchRetriever\n",
"from haystack.retriever.dense import DensePassageRetriever\n",
"from haystack.reader import FARMReader\n",
"\n",
"\n",
"# Initialize DocumentStore and index documents\n",
"launch_es()\n",
"document_store = ElasticsearchDocumentStore()\n",
"document_store.delete_all_documents()\n",
"document_store.write_documents(got_dicts)\n",
"\n",
"# Initialize Sparse retriever\n",
"es_retriever = ElasticsearchRetriever(document_store=document_store)\n",
"\n",
"# Initialize dense retriever\n",
"dpr_retriever = DensePassageRetriever(document_store)\n",
"document_store.update_embeddings(dpr_retriever, update_existing_embeddings=False)\n",
"\n",
"reader = FARMReader(model_name_or_path=\"deepset/roberta-base-squad2\")"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"## Prebuilt Pipelines\n",
"\n",
"Haystack features many prebuilt pipelines that cover common tasks.\n",
"Here we have an `ExtractiveQAPipeline` (the successor to the now deprecated `Finder` class)."
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"from haystack.pipeline import ExtractiveQAPipeline\n",
"\n",
"# Prebuilt pipeline\n",
"p_extractive_premade = ExtractiveQAPipeline(reader=reader, retriever=es_retriever)\n",
"res = p_extractive_premade.run(\n",
" query=\"Who is the father of Arya Stark?\",\n",
" top_k_retriever=10,\n",
" top_k_reader=5\n",
")\n",
"print_answers(res, details=\"minimal\")"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"If you want to just do the retrieval step, you can use a `DocumentSearchPipeline`"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"from haystack.pipeline import DocumentSearchPipeline\n",
"\n",
"p_retrieval = DocumentSearchPipeline(es_retriever)\n",
"res = p_retrieval.run(\n",
" query=\"Who is the father of Arya Stark?\",\n",
" top_k_retriever=10\n",
")\n",
"print_documents(res, max_text_len=200)"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"Or if you want to use a `Generator` instead of a `Reader`,\n",
"you can initialize a `GenerativeQAPipeline` like this:"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"from haystack.pipeline import GenerativeQAPipeline, FAQPipeline\n",
"from haystack.generator import RAGenerator\n",
"\n",
"# We set this to True so that the document store returns document embeddings with each document\n",
"# This is needed by the Generator\n",
"document_store.return_embedding = True\n",
"\n",
"# Initialize generator\n",
"rag_generator = RAGenerator()\n",
"\n",
"# Generative QA\n",
"p_generator = GenerativeQAPipeline(generator=rag_generator, retriever=dpr_retriever)\n",
"res = p_generator.run(\n",
" query=\"Who is the father of Arya Stark?\",\n",
" top_k_retriever=10\n",
")\n",
"print_answers(res, details=\"minimal\")\n",
"\n",
"# We are setting this to False so that in later pipelines,\n",
"# we get a cleaner printout\n",
"document_store.return_embedding = False\n"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"Haystack features prebuilt pipelines to do:\n",
"- just document search (DocumentSearchPipeline),\n",
"- document search with summarization (SearchSummarizationPipeline)\n",
"- generative QA (GenerativeQAPipeline)\n",
"- FAQ style QA (FAQPipeline)\n",
"- translated search (TranslationWrapperPipeline)\n",
"To find out more about these pipelines, have a look at our [documentation](https://haystack.deepset.ai/docs/latest/pipelinesmd)\n"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"With any Pipeline, whether prebuilt or custom constructed,\n",
"you can save a diagram showing how all the components are connected.\n",
"\n",
"![image](https://user-images.githubusercontent.com/1563902/102451716-54813700-4039-11eb-881e-f3c01b47ca15.png)"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"p_extractive_premade.draw(\"pipeline_extractive_premade.png\")\n",
"p_retrieval.draw(\"pipeline_retrieval.png\")\n",
"p_generator.draw(\"pipeline_generator.png\")"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"## Custom Pipelines\n",
"\n",
"Now we are going to rebuild the `ExtractiveQAPipelines` using the generic Pipeline class.\n",
"We do this by adding the building blocks that we initialized as nodes in the graph."
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"# Custom built extractive QA pipeline\n",
"p_extractive = Pipeline()\n",
"p_extractive.add_node(component=es_retriever, name=\"Retriever\", inputs=[\"Query\"])\n",
"p_extractive.add_node(component=reader, name=\"Reader\", inputs=[\"Retriever\"])\n",
"\n",
"# Now we can run it\n",
"res = p_extractive.run(\n",
" query=\"Who is the father of Arya Stark?\",\n",
" top_k_retriever=10,\n",
" top_k_reader=5\n",
")\n",
"print_answers(res, details=\"minimal\")\n",
"p_extractive.draw(\"pipeline_extractive.png\")"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"Pipelines offer a very simple way to ensemble together different components.\n",
"In this example, we are going to combine the power of a `DensePassageRetriever`\n",
"with the keyword based `ElasticsearchRetriever`.\n",
"See our [documentation](https://haystack.deepset.ai/docs/latest/retrievermd) to understand why\n",
"we might want to combine a dense and sparse retriever.\n",
"\n",
"![image](https://user-images.githubusercontent.com/1563902/102451782-7bd80400-4039-11eb-9046-01b002a783f8.png)\n",
"\n",
"Here we use a `JoinDocuments` node so that the predictions from each retriever can be merged together."
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"from haystack.pipeline import JoinDocuments\n",
"\n",
"# Create ensembled pipeline\n",
"p_ensemble = Pipeline()\n",
"p_ensemble.add_node(component=es_retriever, name=\"ESRetriever\", inputs=[\"Query\"])\n",
"p_ensemble.add_node(component=dpr_retriever, name=\"DPRRetriever\", inputs=[\"Query\"])\n",
"p_ensemble.add_node(component=JoinDocuments(join_mode=\"concatenate\"), name=\"JoinResults\", inputs=[\"ESRetriever\", \"DPRRetriever\"])\n",
"p_ensemble.add_node(component=reader, name=\"Reader\", inputs=[\"JoinResults\"])\n",
"p_ensemble.draw(\"pipeline_ensemble.png\")\n",
"\n",
"# Run pipeline\n",
"res = p_ensemble.run(\n",
" query=\"Who is the father of Arya Stark?\",\n",
" top_k_retriever=5 #This is top_k per retriever\n",
")\n",
"print_answers(res, details=\"minimal\")"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"## Custom Nodes\n",
"\n",
"Nodes are relatively simple objects\n",
"and we encourage our users to design their own if they don't see on that fits their use case\n",
"\n",
"The only requirements are:\n",
"- Add a method run(self, **kwargs) to your class. **kwargs will contain the output from the previous node in your graph.\n",
"- Do whatever you want within run() (e.g. reformatting the query)\n",
"- Return a tuple that contains your output data (for the next node)\n",
"and the name of the outgoing edge (by default \"output_1\" for nodes that have one output)\n",
"- Add a class attribute outgoing_edges = 1 that defines the number of output options from your node. You only need a higher number here if you have a decision node (see below).\n",
"\n",
"Here we have a template for a Node:"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"class NodeTemplate():\n",
" outgoing_edges = 1\n",
"\n",
" def run(self, **kwargs):\n",
" # Insert code here to manipulate the variables in kwarg\n",
" return (kwargs, \"output_1\")"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"## Decision Nodes\n",
"\n",
"Decision Nodes help you route your data so that only certain branches of your `Pipeline` are run.\n",
"One popular use case for such query classifiers is routing keyword queries to Elasticsearch and questions to DPR + Reader.\n",
"With this approach you keep optimal speed and simplicity for keywords while going deep with transformers when it's most helpful.\n",
"\n",
"![image](https://user-images.githubusercontent.com/1563902/102452199-41229b80-403a-11eb-9365-7038697e7c3e.png)\n",
"\n",
"Though this looks very similar to the ensembled pipeline shown above,\n",
"the key difference is that only one of the retrievers is run for each request.\n",
"By contrast both retrievers are always run in the ensembled approach.\n",
"\n",
"Below, we define a very naive `QueryClassifier` and show how to use it:"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"class QueryClassifier():\n",
" outgoing_edges = 2\n",
"\n",
" def run(self, **kwargs):\n",
" if \"?\" in kwargs[\"query\"]:\n",
" return (kwargs, \"output_2\")\n",
" else:\n",
" return (kwargs, \"output_1\")\n",
"\n",
"# Here we build the pipeline\n",
"p_classifier = Pipeline()\n",
"p_classifier.add_node(component=QueryClassifier(), name=\"QueryClassifier\", inputs=[\"Query\"])\n",
"p_classifier.add_node(component=es_retriever, name=\"ESRetriever\", inputs=[\"QueryClassifier.output_1\"])\n",
"p_classifier.add_node(component=dpr_retriever, name=\"DPRRetriever\", inputs=[\"QueryClassifier.output_2\"])\n",
"p_classifier.add_node(component=reader, name=\"QAReader\", inputs=[\"ESRetriever\", \"DPRRetriever\"])\n",
"p_classifier.draw(\"pipeline_classifier.png\")\n",
"\n",
"# Run only the dense retriever on the full sentence query\n",
"res_1 = p_classifier.run(\n",
" query=\"Who is the father of Arya Stark?\",\n",
" top_k_retriever=10\n",
")\n",
"print(\"DPR Results\" + \"\\n\" + \"=\"*15)\n",
"print_answers(res_1)\n",
"\n",
"# Run only the sparse retriever on a keyword based query\n",
"res_2 = p_classifier.run(\n",
" query=\"Arya Stark father\",\n",
" top_k_retriever=10\n",
")\n",
"print(\"ES Results\" + \"\\n\" + \"=\"*15)\n",
"print_answers(res_2)"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"## Evaluation Nodes\n",
"\n",
"We have also designed a set of nodes that can be used to evaluate the performance of a system.\n",
"Have a look at our [tutorial](https://haystack.deepset.ai/docs/latest/tutorial5md) to get hands on with the code and learn more about Evaluation Nodes!\n"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"## YAML Configs\n",
"\n",
"A full `Pipeline` can be defined in a YAML file and simply loaded.\n",
"Having your pipeline available in a YAML is particularly useful\n",
"when you move between experimentation and production environments.\n",
"Just export the YAML from your notebook / IDE and import it into your production environment.\n",
"It also helps with version control of pipelines,\n",
"allows you to share your pipeline easily with colleagues,\n",
"and simplifies the configuration of pipeline parameters in production.\n",
"\n",
"It consists of two main sections: you define all objects (e.g. a reader) in components\n",
"and then stick them together to a pipeline in pipelines.\n",
"You can also set one component to be multiple nodes of a pipeline or to be a node across multiple pipelines.\n",
"It will be loaded just once in memory and therefore doesn't hurt your resources more than actually needed.\n",
"\n",
"The contents of a YAML file should look something like this:"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"```yaml\n",
"version: '0.7'\n",
"components: # define all the building-blocks for Pipeline\n",
"- name: MyReader # custom-name for the component; helpful for visualization & debugging\n",
" type: FARMReader # Haystack Class name for the component\n",
" params:\n",
" no_ans_boost: -10\n",
" model_name_or_path: deepset/roberta-base-squad2\n",
"- name: MyESRetriever\n",
" type: ElasticsearchRetriever\n",
" params:\n",
" document_store: MyDocumentStore # params can reference other components defined in the YAML\n",
" custom_query: null\n",
"- name: MyDocumentStore\n",
" type: ElasticsearchDocumentStore\n",
" params:\n",
" index: haystack_test\n",
"pipelines: # multiple Pipelines can be defined using the components from above\n",
"- name: my_query_pipeline # a simple extractive-qa Pipeline\n",
" nodes:\n",
" - name: MyESRetriever\n",
" inputs: [Query]\n",
" - name: MyReader\n",
" inputs: [MyESRetriever]\n",
"```"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"To load, simply call:\n",
"``` python\n",
"pipeline.load_from_yaml(Path(\"sample.yaml\"))\n",
"```"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"## Conclusion\n",
"\n",
"The possibilities are endless with the `Pipeline` class and we hope that this tutorial will inspire you\n",
"to build custom pipeplines that really work for your use case!"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 0
}