Add pipelines tutorial (#1013)

2026-01-06 12:07:04 +00:00 · 2021-04-29 18:19:20 +02:00 · 2021-04-29 18:19:20 +02:00 · 056be3354b
commit 056be3354b
parent 9827b3652e
2 changed files with 402 additions and 0 deletions
--- a/docs/_src/tutorials/tutorials/11.md
+++ b/docs/_src/tutorials/tutorials/11.md
@ -0,0 +1,394 @@
+<!---
+title: "Tutorial 11"
+metaTitle: "Pipelines"
+metaDescription: ""
+slug: "/docs/tutorial11"
+date: "2021-04-06"
+id: "tutorial11md"
+--->
+
+# Pipelines Tutorial
+
+[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial11_Pipelines.ipynb)
+
+In this tutorial, you will learn how the `Pipeline` class acts as a connector between all the different
+building blocks that are found in FARM. Whether you are using a Reader, Generator, Summarizer
+or Retriever (or 2), the `Pipeline` class will help you build a Directed Acyclic Graph (DAG) that
+determines how to route the output of one component into the input of another.
+
+
+## Setting Up the Environment
+
+Let's start by ensuring we have a GPU running to ensure decent speed in this tutorial.
+In Google colab, you can change to a GPU runtime in the menu:
+- **Runtime -> Change Runtime type -> Hardware accelerator -> GPU**
+
+
+```python
+# Make sure you have a GPU running
+!nvidia-smi
+```
+
+These lines are to install Haystack through pip
+
+
+```python
+# Install the latest release of Haystack in your own environment
+#! pip install farm-haystack
+
+# Install the latest master of Haystack
+!pip install --upgrade git+https://github.com/deepset-ai/haystack.git
+!pip install urllib3==1.25.4
+
+# Install  pygraphviz
+!apt install libgraphviz-dev
+!pip install pygraphviz
+```
+
+If running from Colab or a no Docker environment, you will want to start Elasticsearch from source
+
+
+```python
+# In Colab / No Docker environments: Start Elasticsearch from source
+! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q
+! tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz
+! chown -R daemon:daemon elasticsearch-7.9.2
+
+import os
+from subprocess import Popen, PIPE, STDOUT
+es_server = Popen(['elasticsearch-7.9.2/bin/elasticsearch'],
+                   stdout=PIPE, stderr=STDOUT,
+                   preexec_fn=lambda: os.setuid(1)  # as daemon
+                  )
+# wait until ES has started
+! sleep 30
+```
+
+## Initialization
+
+Here are some core imports
+
+
+```python
+from haystack.utils import print_answers, print_documents
+```
+
+Then let's fetch some data (in this case, pages from the Game of Thrones wiki) and prepare it so that it can
+be used indexed into our `DocumentStore`
+
+
+```python
+from haystack.preprocessor.utils import fetch_archive_from_http, convert_files_to_dicts
+from haystack.preprocessor.cleaning import clean_wiki_text
+
+#Download and prepare data - 517 Wikipedia articles for Game of Thrones
+doc_dir = "data/article_txt_got"
+s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt.zip"
+fetch_archive_from_http(url=s3_url, output_dir=doc_dir)
+
+# convert files to dicts containing documents that can be indexed to our datastore
+got_dicts = convert_files_to_dicts(
+    dir_path=doc_dir,
+    clean_func=clean_wiki_text,
+    split_paragraphs=True
+)
+```
+
+Here we initialize the core components that we will be gluing together using the `Pipeline` class.
+We have a `DocumentStore`, an `ElasticsearchRetriever` and a `FARMReader`.
+These can be combined to create a classic Retriever-Reader pipeline that is designed
+to perform Open Domain Question Answering.
+
+
+```python
+from haystack import Pipeline
+from haystack.utils import launch_es
+from haystack.document_store import ElasticsearchDocumentStore
+from haystack.retriever.sparse import ElasticsearchRetriever
+from haystack.retriever.dense import DensePassageRetriever
+from haystack.reader import FARMReader
+
+
+# Initialize DocumentStore and index documents
+launch_es()
+document_store = ElasticsearchDocumentStore()
+document_store.delete_all_documents()
+document_store.write_documents(got_dicts)
+
+# Initialize Sparse retriever
+es_retriever = ElasticsearchRetriever(document_store=document_store)
+
+# Initialize dense retriever
+dpr_retriever = DensePassageRetriever(document_store)
+document_store.update_embeddings(dpr_retriever, update_existing_embeddings=False)
+
+reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2")
+```
+
+## Prebuilt Pipelines
+
+Haystack features many prebuilt pipelines that cover common tasks.
+Here we have an `ExtractiveQAPipeline` (the successor to the now deprecated `Finder` class).
+
+
+```python
+from haystack.pipeline import ExtractiveQAPipeline
+
+# Prebuilt pipeline
+p_extractive_premade = ExtractiveQAPipeline(reader=reader, retriever=es_retriever)
+res = p_extractive_premade.run(
+    query="Who is the father of Arya Stark?",
+    top_k_retriever=10,
+    top_k_reader=5
+)
+print_answers(res, details="minimal")
+```
+
+If you want to just do the retrieval step, you can use a `DocumentSearchPipeline`
+
+
+```python
+from haystack.pipeline import DocumentSearchPipeline
+
+p_retrieval = DocumentSearchPipeline(es_retriever)
+res = p_retrieval.run(
+    query="Who is the father of Arya Stark?",
+    top_k_retriever=10
+)
+print_documents(res, max_text_len=200)
+```
+
+Or if you want to use a `Generator` instead of a `Reader`,
+you can initialize a `GenerativeQAPipeline` like this:
+
+
+```python
+from haystack.pipeline import GenerativeQAPipeline, FAQPipeline
+from haystack.generator import RAGenerator
+
+# We set this to True so that the document store returns document embeddings with each document
+# This is needed by the Generator
+document_store.return_embedding = True
+
+# Initialize generator
+rag_generator = RAGenerator()
+
+# Generative QA
+p_generator = GenerativeQAPipeline(generator=rag_generator, retriever=dpr_retriever)
+res = p_generator.run(
+    query="Who is the father of Arya Stark?",
+    top_k_retriever=10
+)
+print_answers(res, details="minimal")
+
+# We are setting this to False so that in later pipelines,
+# we get a cleaner printout
+document_store.return_embedding = False
+
+```
+
+Haystack features prebuilt pipelines to do:
+- just document search (DocumentSearchPipeline),
+- document search with summarization (SearchSummarizationPipeline)
+- generative QA (GenerativeQAPipeline)
+- FAQ style QA (FAQPipeline)
+- translated search (TranslationWrapperPipeline)
+To find out more about these pipelines, have a look at our [documentation](https://haystack.deepset.ai/docs/latest/pipelinesmd)
+
+
+With any Pipeline, whether prebuilt or custom constructed,
+you can save a diagram showing how all the components are connected.
+
+![image](https://user-images.githubusercontent.com/1563902/102451716-54813700-4039-11eb-881e-f3c01b47ca15.png)
+
+
+```python
+p_extractive_premade.draw("pipeline_extractive_premade.png")
+p_retrieval.draw("pipeline_retrieval.png")
+p_generator.draw("pipeline_generator.png")
+```
+
+## Custom Pipelines
+
+Now we are going to rebuild the `ExtractiveQAPipelines` using the generic Pipeline class.
+We do this by adding the building blocks that we initialized as nodes in the graph.
+
+
+```python
+# Custom built extractive QA pipeline
+p_extractive = Pipeline()
+p_extractive.add_node(component=es_retriever, name="Retriever", inputs=["Query"])
+p_extractive.add_node(component=reader, name="Reader", inputs=["Retriever"])
+
+# Now we can run it
+res = p_extractive.run(
+    query="Who is the father of Arya Stark?",
+    top_k_retriever=10,
+    top_k_reader=5
+)
+print_answers(res, details="minimal")
+p_extractive.draw("pipeline_extractive.png")
+```
+
+Pipelines offer a very simple way to ensemble together different components.
+In this example, we are going to combine the power of a `DensePassageRetriever`
+with the keyword based `ElasticsearchRetriever`.
+See our [documentation](https://haystack.deepset.ai/docs/latest/retrievermd) to understand why
+we might want to combine a dense and sparse retriever.
+
+![image](https://user-images.githubusercontent.com/1563902/102451782-7bd80400-4039-11eb-9046-01b002a783f8.png)
+
+Here we use a `JoinDocuments` node so that the predictions from each retriever can be merged together.
+
+
+```python
+from haystack.pipeline import JoinDocuments
+
+# Create ensembled pipeline
+p_ensemble = Pipeline()
+p_ensemble.add_node(component=es_retriever, name="ESRetriever", inputs=["Query"])
+p_ensemble.add_node(component=dpr_retriever, name="DPRRetriever", inputs=["Query"])
+p_ensemble.add_node(component=JoinDocuments(join_mode="concatenate"), name="JoinResults", inputs=["ESRetriever", "DPRRetriever"])
+p_ensemble.add_node(component=reader, name="Reader", inputs=["JoinResults"])
+p_ensemble.draw("pipeline_ensemble.png")
+
+# Run pipeline
+res = p_ensemble.run(
+    query="Who is the father of Arya Stark?",
+    top_k_retriever=5   #This is top_k per retriever
+)
+print_answers(res, details="minimal")
+```
+
+## Custom Nodes
+
+Nodes are relatively simple objects
+and we encourage our users to design their own if they don't see on that fits their use case
+
+The only requirements are:
+- Add a method run(self, **kwargs) to your class. **kwargs will contain the output from the previous node in your graph.
+- Do whatever you want within run() (e.g. reformatting the query)
+- Return a tuple that contains your output data (for the next node)
+and the name of the outgoing edge (by default "output_1" for nodes that have one output)
+- Add a class attribute outgoing_edges = 1 that defines the number of output options from your node. You only need a higher number here if you have a decision node (see below).
+
+Here we have a template for a Node:
+
+
+```python
+class NodeTemplate():
+    outgoing_edges = 1
+
+    def run(self, **kwargs):
+        # Insert code here to manipulate the variables in kwarg
+        return (kwargs, "output_1")
+```
+
+## Decision Nodes
+
+Decision Nodes help you route your data so that only certain branches of your `Pipeline` are run.
+One popular use case for such query classifiers is routing keyword queries to Elasticsearch and questions to DPR + Reader.
+With this approach you keep optimal speed and simplicity for keywords while going deep with transformers when it's most helpful.
+
+![image](https://user-images.githubusercontent.com/1563902/102452199-41229b80-403a-11eb-9365-7038697e7c3e.png)
+
+Though this looks very similar to the ensembled pipeline shown above,
+the key difference is that only one of the retrievers is run for each request.
+By contrast both retrievers are always run in the ensembled approach.
+
+Below, we define a very naive `QueryClassifier` and show how to use it:
+
+
+```python
+class QueryClassifier():
+    outgoing_edges = 2
+
+    def run(self, **kwargs):
+        if "?" in kwargs["query"]:
+            return (kwargs, "output_2")
+        else:
+            return (kwargs, "output_1")
+
+# Here we build the pipeline
+p_classifier = Pipeline()
+p_classifier.add_node(component=QueryClassifier(), name="QueryClassifier", inputs=["Query"])
+p_classifier.add_node(component=es_retriever, name="ESRetriever", inputs=["QueryClassifier.output_1"])
+p_classifier.add_node(component=dpr_retriever, name="DPRRetriever", inputs=["QueryClassifier.output_2"])
+p_classifier.add_node(component=reader, name="QAReader", inputs=["ESRetriever", "DPRRetriever"])
+p_classifier.draw("pipeline_classifier.png")
+
+# Run only the dense retriever on the full sentence query
+res_1 = p_classifier.run(
+    query="Who is the father of Arya Stark?",
+    top_k_retriever=10
+)
+print("DPR Results" + "\n" + "="*15)
+print_answers(res_1)
+
+# Run only the sparse retriever on a keyword based query
+res_2 = p_classifier.run(
+    query="Arya Stark father",
+    top_k_retriever=10
+)
+print("ES Results" + "\n" + "="*15)
+print_answers(res_2)
+```
+
+## Evaluation Nodes
+
+We have also designed a set of nodes that can be used to evaluate the performance of a system.
+Have a look at our [tutorial](https://haystack.deepset.ai/docs/latest/tutorial5md) to get hands on with the code and learn more about Evaluation Nodes!
+
+
+## YAML Configs
+
+A full `Pipeline` can be defined in a YAML file and simply loaded.
+Having your pipeline available in a YAML is particularly useful
+when you move between experimentation and production environments.
+Just export the YAML from your notebook / IDE and import it into your production environment.
+It also helps with version control of pipelines,
+allows you to share your pipeline easily with colleagues,
+and simplifies the configuration of pipeline parameters in production.
+
+It consists of two main sections: you define all objects (e.g. a reader) in components
+and then stick them together to a pipeline in pipelines.
+You can also set one component to be multiple nodes of a pipeline or to be a node across multiple pipelines.
+It will be loaded just once in memory and therefore doesn't hurt your resources more than actually needed.
+
+The contents of a YAML file should look something like this:
+
+```yaml
+version: '0.7'
+components:    # define all the building-blocks for Pipeline
+- name: MyReader       # custom-name for the component; helpful for visualization & debugging
+  type: FARMReader    # Haystack Class name for the component
+  params:
+    no_ans_boost: -10
+    model_name_or_path: deepset/roberta-base-squad2
+- name: MyESRetriever
+  type: ElasticsearchRetriever
+  params:
+    document_store: MyDocumentStore    # params can reference other components defined in the YAML
+    custom_query: null
+- name: MyDocumentStore
+  type: ElasticsearchDocumentStore
+  params:
+    index: haystack_test
+pipelines:    # multiple Pipelines can be defined using the components from above
+- name: my_query_pipeline    # a simple extractive-qa Pipeline
+  nodes:
+  - name: MyESRetriever
+    inputs: [Query]
+  - name: MyReader
+    inputs: [MyESRetriever]
+```
+
+To load, simply call:
+``` python
+pipeline.load_from_yaml(Path("sample.yaml"))
+```
+
+## Conclusion
+
+The possibilities are endless with the `Pipeline` class and we hope that this tutorial will inspire you
+to build custom pipeplines that really work for your use case!
--- a/docs/_src/tutorials/tutorials/headers.py
+++ b/docs/_src/tutorials/tutorials/headers.py
@ -78,5 +78,13 @@ metaDescription: ""
 slug: "/docs/tutorial10"
 date: "2021-04-06"
 id: "tutorial10md"
+--->""",
+    11: """<!---
+title: "Tutorial 11"
+metaTitle: "Pipelines"
+metaDescription: ""
+slug: "/docs/tutorial11"
+date: "2021-04-06"
+id: "tutorial11md"
 --->"""
 }