Update pipeline documentation and readme (#693)

* Update README.md * Update pipelines.md * Update pipelines.md * Update README.md
2025-12-30 08:37:20 +00:00 · 2020-12-22 13:34:28 +01:00 · 2020-12-22 13:34:28 +01:00 · a2e5e6b09e
commit a2e5e6b09e
parent 94b7345505
3 changed files with 259 additions and 13 deletions
--- a/README.md
+++ b/README.md
@ -94,7 +94,7 @@ We recommend Elasticsearch or FAISS, but have also more light-weight options for
 5.  **Reader**: Neural network (e.g. BERT or RoBERTA) that reads through texts in detail
    to find an answer. The Reader takes multiple passages of text as input and returns top-n answers. Models are trained via [FARM](https://github.com/deepset-ai/FARM) or [Transformers](https://github.com/huggingface/transformers) on SQuAD like tasks.  You can just load a pretrained model from [Hugging Face's model hub](https://huggingface.co/models) or fine-tune it on your own domain data.
 6.  **Generator**: Neural network (e.g. RAG) that *generates* an answer for a given question conditioned on the retrieved documents from the retriever.
-6.  **Finder**: Glues together a Retriever + Reader/Generator as a pipeline to provide an easy-to-use question answering interface.
+6.  **Pipeline**: Stick building blocks together to highly custom pipelines that are represented as Directed Acyclic Graphs (DAG). Think of it as "Apache Airflow for search".
 7.  **REST API**: Exposes a simple API based on fastAPI for running QA search, uploading files and collecting user feedback for continuous learning.
 8.  **Haystack Annotate**: Create custom QA labels to improve performance of your domain-specific models. [Hosted version](https://annotate.deepset.ai/login) or [Docker images](https://github.com/deepset-ai/haystack/tree/master/annotation_tool). 

@ -102,8 +102,49 @@ We recommend Elasticsearch or FAISS, but have also more light-weight options for

 ## Usage

-![image](https://raw.githubusercontent.com/deepset-ai/haystack/master/docs/_src/img/code_snippet_usage.png)
+```python
+# DB to store your docs
+document_store = ElasticsearchDocumentStore(host="localhost", username="", password="",
+                                            index="document", embedding_dim=768,                                                                 embedding_field="embedding")

+# Index your docs
+# (Options: Convert text from PDFs etc. via FileConverter; Split and clean docs with the PreProcessor)
+docs = [Document(text="Arya accompanies her father Ned and her sister Sansa to King's Landing. Before their departure ...", meta={}), 
+        ...]
+
+document_store.write_documents([doc])
+
+# Init Retriever: Fast algorithm to identify most promising candidate docs
+# (Options: DPR, TF-IDF, Elasticsearch, Plain Embeddings ..)
+retriever = DensePassageRetriever(document_store=document_store,                         
+                                query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
+                                passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base",
+                                )
+document_store.update_embeddings(retriever)
+
+# Init Reader: Powerful, but slower neural model 
+# (Options: FARM or Transformers Framework; Extractive or generative models)
+reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=True)
+
+# The Pipeline sticks together Reader + Retriever to a DAG
+# There's many different pipeline types and you can easily build your own
+pipeline = ExtractiveQAPipeline(reader, retriever)
+
+# Voilá! Ask a question!
+prediction = pipeline.run(query="Who is the father of Arya Stark?", top_k_retriever=10,                                         top_k_reader=3)
+print_answers(prediction, details="minimal")
+
+[   {   'answer': 'Eddard',
+        'context': """... She travels with her father, Eddard, to 
+                   King's Landing when he is made Hand of the King ..."""},
+    {   'answer': 'Ned',
+        'context': """... girl disguised as a boy all along and is surprised 
+                   to learn she is Arya, Ned Stark's daughter ..."""},
+    {   'answer': 'Ned',
+        'context': """... Arya accompanies her father Ned and her sister Sansa to
+                   King's Landing. Before their departure ..."""}
+]
+``` 
 ## Tutorials

 -   Tutorial 1 - Basic QA Pipeline: [Jupyter notebook](https://github.com/deepset-ai/haystack/blob/master/tutorials/Tutorial1_Basic_QA_Pipeline.ipynb)
@ -132,7 +173,7 @@ We recommend Elasticsearch or FAISS, but have also more light-weight options for


 ## Quick Tour
-[File Conversion](https://github.com/deepset-ai/haystack/blob/master/README.md#1-file-conversion) | [Preprocessing](https://github.com/deepset-ai/haystack/blob/master/README.md#2-preprocessing) | [DocumentStores](https://github.com/deepset-ai/haystack/blob/master/README.md#3-documentstores) | [Retrievers](https://github.com/deepset-ai/haystack/blob/master/README.md#4-retrievers) | [Readers](https://github.com/deepset-ai/haystack/blob/master/README.md#5-readers) | [REST API](https://github.com/deepset-ai/haystack/blob/master/README.md#6-rest-api) |  [Labeling Tool](https://github.com/deepset-ai/haystack/blob/master/README.md#7-labeling-tool) 
+[File Conversion](https://github.com/deepset-ai/haystack/blob/master/README.md#1-file-conversion) | [Preprocessing](https://github.com/deepset-ai/haystack/blob/master/README.md#2-preprocessing) | [DocumentStores](https://github.com/deepset-ai/haystack/blob/master/README.md#3-documentstores) | [Retrievers](https://github.com/deepset-ai/haystack/blob/master/README.md#4-retrievers) | [Readers](https://github.com/deepset-ai/haystack/blob/master/README.md#5-readers) | [Pipelines](https://github.com/deepset-ai/haystack/blob/master/README.md#6-pipelines) | [REST API](https://github.com/deepset-ai/haystack/blob/master/README.md#7-rest-api) |  [Labeling Tool](https://github.com/deepset-ai/haystack/blob/master/README.md#8-labeling-tool) 

 ### 1) File Conversion
 **What**  
@ -295,7 +336,36 @@ reader.predict(question="Who is the father of Arya Starck?", documents=documents
 ```
 -> See [docs](https://haystack.deepset.ai/docs/latest/readermd) for details

-### 6) REST API
+### 6) Pipelines
+
+**What**  
+In order to build modern search pipelines, you need two things: powerful building blocks and a flexible way to stick them together.
+The `Pipeline` class is exactly build for this purpose and enables many search scenarios beyond QA. The core idea: you can build a Directed Acyclic Graph (DAG) where each node is one "building block" (Reader, Retriever, Generator ...).
+
+**Available Options**   
+- Standard nodes: Reader, Retriever, Generator ...
+- Join nodes: For example, combine results of multiple retrievers via the `JoinDocuments` node
+- Decision Nodes: For example, classify an incoming query and depending on the results execute only certain branch of your graph 
+
+**Example**  
+A minimal Open-Domain QA Pipeline:
+
+```python
+p = Pipeline()
+p.add_node(component=retriever, name="ESRetriever1", inputs=["Query"])
+p.add_node(component=reader, name="QAReader", inputs=["ESRetriever1"])
+res = p.run(query="What did Einstein work on?", top_k_retriever=1)
+
+```
+You can **draw the DAG** to better inspect what you are building:
+```python
+p.draw(path="custom_pipe.png")
+```
+![image](https://user-images.githubusercontent.com/1563902/102451716-54813700-4039-11eb-881e-f3c01b47ca15.png)
+
+-> See [docs](https://haystack.deepset.ai/docs/latest/pipelinesmd) for details and example of more complex pipelines
+
+### 7) REST API
 **What**  
 A simple REST API based on [FastAPI](https://fastapi.tiangolo.com/) is provided to:

@ -315,7 +385,7 @@ To serve the API, adjust the values in `rest_api/config.py` and run:
 You will find the Swagger API documentation at
 <http://127.0.0.1:8000/docs>

-### 7) Labeling Tool
+### 8) Labeling Tool

 -   Use the [hosted version](https://annotate.deepset.ai/login) (Beta) or deploy it yourself with the [Docker Images](https://github.com/deepset-ai/haystack/blob/master/annotation_tool).
 -   Create labels with different techniques: Come up with questions (+ answers) while reading passages (SQuAD style) or have a set of predefined questions and look for answers in the document (~ Natural Questions).
--- a/docs/_src/usage/usage/pipelines.md
+++ b/docs/_src/usage/usage/pipelines.md
@ -15,8 +15,96 @@ id: "pipelinesmd"
 The new Pipelines class was added in Haystack 0.6.0 to give a more flexible way of defining your processing steps. 
 It replaces the Finder class which will be deprecated in the next version.

-Comprehensive guides and documentation coming soon! 
-For now, have a look at our [latest release notes](https://github.com/deepset-ai/haystack/releases/tag/v0.6.0) to see what is possible.
-You can also refer to our [Pipelines API documentation](/docs/latest/apipipelinesmd). 
-
 </div>
+
+
+### Flexible Pipelines powered by DAGs
+In order to build modern search pipelines, you need two things: powerful building blocks and a flexible way to stick them together.
+The `Pipeline` class is exactly build for this purpose and enables many search scenarios beyond QA. The core idea: you can build a Directed Acyclic Graph (DAG) where each node is one "building block" (Reader, Retriever, Generator ...). Here's a simple example for a "standard" Open-Domain QA Pipeline: 
+
+```python
+p = Pipeline()
+p.add_node(component=retriever, name="ESRetriever1", inputs=["Query"])
+p.add_node(component=reader, name="QAReader", inputs=["ESRetriever1"])
+res = p.run(query="What did Einstein work on?", top_k_retriever=1)
+
+```
+
+You can **draw the DAG** to better inspect what you are building:
+```python
+p.draw(path="custom_pipe.png")
+```
+![image](https://user-images.githubusercontent.com/1563902/102451716-54813700-4039-11eb-881e-f3c01b47ca15.png)
+
+### Multiple retrievers
+You can now also use multiple Retrievers and join their results: 
+```python
+p = Pipeline()
+p.add_node(component=es_retriever, name="ESRetriever", inputs=["Query"])
+p.add_node(component=dpr_retriever, name="DPRRetriever", inputs=["Query"])
+p.add_node(component=JoinDocuments(join_mode="concatenate"), name="JoinResults", inputs=["ESRetriever", "DPRRetriever"])
+p.add_node(component=reader, name="QAReader", inputs=["JoinResults"])
+res = p.run(query="What did Einstein work on?", top_k_retriever=1)
+```
+![image](https://user-images.githubusercontent.com/1563902/102451782-7bd80400-4039-11eb-9046-01b002a783f8.png)
+
+### Custom nodes
+You can easily build your own custom nodes. Just respect the following requirements: 
+
+1. Add a method `run(self, **kwargs)` to your class. `**kwargs` will contain the output from the previous node in your graph.
+2. Do whatever you want within `run()` (e.g. reformatting the query)
+3. Return a tuple that contains your output data (for the next node) and the name of the outgoing edge `output_dict, "output_1`
+4. Add a class attribute `outgoing_edges = 1` that defines the number of output options from your node. You only need a higher number here if you have a decision node (see below).
+
+### Decision nodes
+Or you can add decision nodes where only one "branch" is executed afterwards. This allows, for example, to classify an incoming query and depending on the result routing it to different modules: 
+![image](https://user-images.githubusercontent.com/1563902/102452199-41229b80-403a-11eb-9365-7038697e7c3e.png)
+```python 
+    class QueryClassifier():
+        outgoing_edges = 2
+
+        def run(self, **kwargs):
+            if "?" in kwargs["query"]:
+                return (kwargs, "output_1")
+
+            else:
+                return (kwargs, "output_2")
+
+    pipe = Pipeline()
+    pipe.add_node(component=QueryClassifier(), name="QueryClassifier", inputs=["Query"])
+    pipe.add_node(component=es_retriever, name="ESRetriever", inputs=["QueryClassifier.output_1"])
+    pipe.add_node(component=dpr_retriever, name="DPRRetriever", inputs=["QueryClassifier.output_2"])
+    pipe.add_node(component=JoinDocuments(join_mode="concatenate"), name="JoinResults",
+                  inputs=["ESRetriever", "DPRRetriever"])
+    pipe.add_node(component=reader, name="QAReader", inputs=["JoinResults"])
+    res = p.run(query="What did Einstein work on?", top_k_retriever=1)
+```
+
+### Default Pipelines (replacing the "Finder")
+Last but not least, we added some "Default Pipelines" that allow you to run standard patterns with very few lines of code.
+This is replacing the `Finder` class which is now deprecated.
+
+```
+from haystack.pipeline import DocumentSearchPipeline, ExtractiveQAPipeline, Pipeline, JoinDocuments
+
+# Extractive QA
+qa_pipe = ExtractiveQAPipeline(reader=reader, retriever=retriever)
+res = qa_pipe.run(query="When was Kant born?", top_k_retriever=3, top_k_reader=5)
+
+# Document Search
+doc_pipe = DocumentSearchPipeline(retriever=retriever)
+res = doc_pipe.run(query="Physics Einstein", top_k_retriever=1)
+
+# Generative QA
+doc_pipe = GenerativeQAPipeline(generator=rag_generator, retriever=retriever)
+res = doc_pipe.run(query="Physics Einstein", top_k_retriever=1)
+
+# FAQ based QA
+doc_pipe = FAQPipeline(retriever=retriever)
+res = doc_pipe.run(query="How can I change my address?", top_k_retriever=3)
+
+```    
+See also the [Pipelines API documentation](/docs/latest/apipipelinesmd) for more details. 
+
+We plan many more features around the new pipelines incl. parallelized execution, distributed execution, definition via YAML files, dry runs - so stay tuned ...  
+
--- a/docs/v0.6.0/_src/usage/usage/pipelines.md
+++ b/docs/v0.6.0/_src/usage/usage/pipelines.md
@ -15,8 +15,96 @@ id: "pipelinesmd"
 The new Pipelines class was added in Haystack 0.6.0 to give a more flexible way of defining your processing steps. 
 It replaces the Finder class which will be deprecated in the next version.

-Comprehensive guides and documentation coming soon! 
-For now, have a look at our [latest release notes](https://github.com/deepset-ai/haystack/releases/tag/v0.6.0) to see what is possible.
-You can also refer to our [Pipelines API documentation](/docs/latest/apipipelinesmd). 
-
 </div>
+
+
+### Flexible Pipelines powered by DAGs
+In order to build modern search pipelines, you need two things: powerful building blocks and a flexible way to stick them together.
+The `Pipeline` class is exactly build for this purpose and enables many search scenarios beyond QA. The core idea: you can build a Directed Acyclic Graph (DAG) where each node is one "building block" (Reader, Retriever, Generator ...). Here's a simple example for a "standard" Open-Domain QA Pipeline: 
+
+```python
+p = Pipeline()
+p.add_node(component=retriever, name="ESRetriever1", inputs=["Query"])
+p.add_node(component=reader, name="QAReader", inputs=["ESRetriever1"])
+res = p.run(query="What did Einstein work on?", top_k_retriever=1)
+
+```
+
+You can **draw the DAG** to better inspect what you are building:
+```python
+p.draw(path="custom_pipe.png")
+```
+![image](https://user-images.githubusercontent.com/1563902/102451716-54813700-4039-11eb-881e-f3c01b47ca15.png)
+
+### Multiple retrievers
+You can now also use multiple Retrievers and join their results: 
+```python
+p = Pipeline()
+p.add_node(component=es_retriever, name="ESRetriever", inputs=["Query"])
+p.add_node(component=dpr_retriever, name="DPRRetriever", inputs=["Query"])
+p.add_node(component=JoinDocuments(join_mode="concatenate"), name="JoinResults", inputs=["ESRetriever", "DPRRetriever"])
+p.add_node(component=reader, name="QAReader", inputs=["JoinResults"])
+res = p.run(query="What did Einstein work on?", top_k_retriever=1)
+```
+![image](https://user-images.githubusercontent.com/1563902/102451782-7bd80400-4039-11eb-9046-01b002a783f8.png)
+
+### Custom nodes
+You can easily build your own custom nodes. Just respect the following requirements: 
+
+1. Add a method `run(self, **kwargs)` to your class. `**kwargs` will contain the output from the previous node in your graph.
+2. Do whatever you want within `run()` (e.g. reformatting the query)
+3. Return a tuple that contains your output data (for the next node) and the name of the outgoing edge `output_dict, "output_1`
+4. Add a class attribute `outgoing_edges = 1` that defines the number of output options from your node. You only need a higher number here if you have a decision node (see below).
+
+### Decision nodes
+Or you can add decision nodes where only one "branch" is executed afterwards. This allows, for example, to classify an incoming query and depending on the result routing it to different modules: 
+![image](https://user-images.githubusercontent.com/1563902/102452199-41229b80-403a-11eb-9365-7038697e7c3e.png)
+```python 
+    class QueryClassifier():
+        outgoing_edges = 2
+
+        def run(self, **kwargs):
+            if "?" in kwargs["query"]:
+                return (kwargs, "output_1")
+
+            else:
+                return (kwargs, "output_2")
+
+    pipe = Pipeline()
+    pipe.add_node(component=QueryClassifier(), name="QueryClassifier", inputs=["Query"])
+    pipe.add_node(component=es_retriever, name="ESRetriever", inputs=["QueryClassifier.output_1"])
+    pipe.add_node(component=dpr_retriever, name="DPRRetriever", inputs=["QueryClassifier.output_2"])
+    pipe.add_node(component=JoinDocuments(join_mode="concatenate"), name="JoinResults",
+                  inputs=["ESRetriever", "DPRRetriever"])
+    pipe.add_node(component=reader, name="QAReader", inputs=["JoinResults"])
+    res = p.run(query="What did Einstein work on?", top_k_retriever=1)
+```
+
+### Default Pipelines (replacing the "Finder")
+Last but not least, we added some "Default Pipelines" that allow you to run standard patterns with very few lines of code.
+This is replacing the `Finder` class which is now deprecated.
+
+```
+from haystack.pipeline import DocumentSearchPipeline, ExtractiveQAPipeline, Pipeline, JoinDocuments
+
+# Extractive QA
+qa_pipe = ExtractiveQAPipeline(reader=reader, retriever=retriever)
+res = qa_pipe.run(query="When was Kant born?", top_k_retriever=3, top_k_reader=5)
+
+# Document Search
+doc_pipe = DocumentSearchPipeline(retriever=retriever)
+res = doc_pipe.run(query="Physics Einstein", top_k_retriever=1)
+
+# Generative QA
+doc_pipe = GenerativeQAPipeline(generator=rag_generator, retriever=retriever)
+res = doc_pipe.run(query="Physics Einstein", top_k_retriever=1)
+
+# FAQ based QA
+doc_pipe = FAQPipeline(retriever=retriever)
+res = doc_pipe.run(query="How can I change my address?", top_k_retriever=3)
+
+```    
+See also the [Pipelines API documentation](/docs/latest/apipipelinesmd) for more details. 
+
+We plan many more features around the new pipelines incl. parallelized execution, distributed execution, definition via YAML files, dry runs - so stay tuned ...  
+