
* updated tutorials and docstrings and new version * update to correct directory structure
7.9 KiB
Pipelines
Flexibility powered by DAGs
In order to build modern search pipelines, you need two things: powerful building blocks and an easy way to stick them together.
The Pipeline
class is exactly built for this purpose and enables many search scenarios beyond QA.
The core idea is that you can build a Directed Acyclic Graph (DAG) where each node is one building block (Reader, Retriever, Generator ...).
Here's a simple example for a standard Open-Domain QA Pipeline:
from haystack import Pipeline
p = Pipeline()
p.add_node(component=retriever, name="ESRetriever1", inputs=["Query"])
p.add_node(component=reader, name="QAReader", inputs=["ESRetriever1"])
res = p.run(query="What did Einstein work on?")
You can draw the DAG to better inspect what you are building:
p.draw(path="custom_pipe.png")
Arguments
Each node in a Pipeline defines the arguments the run() method accepts. The Pipeline class takes care of passing relevant
arguments to the node. In addition to mandatory inputs like query
, the run()
accepts optional node parameters like
top_k
with the params
argument. For instance, params={"top_k": 5}
will set the top_k
of all nodes as 5. To
target params to a specific node, the node name can be explicitly specifie as params={"Retriever": {"top_k": 5}}
.
res = pipeline.run(
query="What did Einstein work on?",
params={"Retriever": {"top_k": 5}, "Reader": {"top_k": 3}}
)
YAML File Definitions
For your convenience, there is also the option of defining and loading pipelines in YAML files. Having your pipeline available in a YAML is particularly useful when you move between experimentation and production environments. Just export the YAML from your notebook / IDE and import it into your production environment. It also helps with version control of pipelines, allows you to share your pipeline easily with colleagues, and simplifies the configuration of pipeline parameters in production.
For example, you can define and save a simple Retriever Reader pipeline by saving the following to a file:
version: '0.7'
components: # define all the building-blocks for Pipeline
- name: MyReader # custom-name for the component; helpful for visualization & debugging
type: FARMReader # Haystack Class name for the component
params:
no_ans_boost: -10
model_name_or_path: deepset/roberta-base-squad2
- name: MyESRetriever
type: ElasticsearchRetriever
params:
document_store: MyDocumentStore # params can reference other components defined in the YAML
custom_query: null
- name: MyDocumentStore
type: ElasticsearchDocumentStore
params:
index: haystack_test
pipelines: # multiple Pipelines can be defined using the components from above
- name: my_query_pipeline # a simple extractive-qa Pipeline
nodes:
- name: MyESRetriever
inputs: [Query]
- name: MyReader
inputs: [MyESRetriever]
To load, simply call:
pipeline.load_from_yaml(Path("sample.yaml"))
For another example YAML config, check out this file.
Multiple retrievers
You can now also use multiple Retrievers and join their results:
p = Pipeline()
p.add_node(component=es_retriever, name="ESRetriever", inputs=["Query"])
p.add_node(component=dpr_retriever, name="DPRRetriever", inputs=["Query"])
p.add_node(component=JoinDocuments(join_mode="concatenate"), name="JoinResults", inputs=["ESRetriever", "DPRRetriever"])
p.add_node(component=reader, name="QAReader", inputs=["JoinResults"])
res = p.run(query="What did Einstein work on?", params={"ESRetriever": {"top_k": 1}, "DPRRetriever": {"top_k": 3}})
Custom nodes
It is easy to build custom nodes. Just respect the following requirements:
- Create a Class that inherits from
BaseComponent
. - Add a
run()
method to your class with any parameters it needs to process the input. Ensure that the parameters are either passed withparams
to the pipeline or are returned by the preceding nodes. - Do whatever you want within
run()
(e.g., reformatting the query). - Return a tuple that contains your output data (for the next node) and the name of the outgoing edge
output_dict, "output_1
. - Add a class attribute
outgoing_edges = 1
that defines your node's number of output options. You only need a higher number here if you have a decision node (see below).
Decision nodes
Or you can add decision nodes where only one "branch" is executed afterwards. This allows, for example, to classify an incoming query and depending on the result routing it to different modules:
from haystack import BaseComponent, Pipeline
class QueryClassifier(BaseComponent):
outgoing_edges = 2
def run(self, query):
if "?" in query:
return {}, "output_1"
else:
return {}, "output_2"
pipe = Pipeline()
pipe.add_node(component=QueryClassifier(), name="QueryClassifier", inputs=["Query"])
pipe.add_node(component=es_retriever, name="ESRetriever", inputs=["QueryClassifier.output_1"])
pipe.add_node(component=dpr_retriever, name="DPRRetriever", inputs=["QueryClassifier.output_2"])
pipe.add_node(component=JoinDocuments(join_mode="concatenate"), name="JoinResults",
inputs=["ESRetriever", "DPRRetriever"])
pipe.add_node(component=reader, name="QAReader", inputs=["JoinResults"])
res = p.run(query="What did Einstein work on?", params={"ESRetriever": {"top_k": 1}, "DPRRetriever": {"top_k": 3}})
Evaluation nodes
There are nodes in Haystack that are used to evaluate the performance of readers, retrievers and combine systems. To get hands on with this kind of node, have a look at the evaluation tutorial.
Default Pipelines (replacing the "Finder")
Last but not least, we added some "Default Pipelines" that allow you to run standard patterns with very few lines of code.
This is replacing the Finder
class which was deprecated with Haystack 0.6.0 .
from haystack.pipeline import DocumentSearchPipeline, ExtractiveQAPipeline, Pipeline, JoinDocuments
# Extractive QA
qa_pipe = ExtractiveQAPipeline(reader=reader, retriever=retriever)
res = qa_pipe.run(query="When was Kant born?", params={"retriever": {"top_k": 3}, "reader": {"top_k": 5}})
# Document Search
doc_pipe = DocumentSearchPipeline(retriever=retriever)
res = doc_pipe.run(query="Physics Einstein", params={"retriever": {"top_k": 3}})
# Generative QA
doc_pipe = GenerativeQAPipeline(generator=rag_generator, retriever=retriever)
res = doc_pipe.run(query="Physics Einstein", params={"retriever": {"top_k": 3}})
# FAQ based QA
doc_pipe = FAQPipeline(retriever=retriever)
res = doc_pipe.run(query="How can I change my address?", params={"retriever": {"top_k": 3}})
So to migrate your QA system from the deprecated Finder
to ExtractiveQAPipeline
you'd need to:
# 1. Change import
from haystack.pipeline import ExtractiveQAPipeline
# 2. Replace the Finder
qa_pipe = ExtractiveQAPipeline(reader=reader, retriever=retriever)
# 3. Replace get_answers() with run()
res = qa_pipe.run(query="When was Kant born?", top_k_retriever=3, top_k_reader=5)
# 4. Access your results from ["documents"] rather than ["answers"]
print(res["documents"])
See also the Pipelines API documentation for more details.
We plan many more features around the new pipelines incl. parallelized execution, distributed execution, dry runs - so stay tuned ...