mirror of
https://github.com/deepset-ai/haystack.git
synced 2025-07-21 07:51:40 +00:00
764 lines
23 KiB
Plaintext
764 lines
23 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"# Pipelines Tutorial\n",
|
|
"\n",
|
|
"[](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial11_Pipelines.ipynb)\n",
|
|
"\n",
|
|
"In this tutorial, you will learn how the `Pipeline` class acts as a connector between all the different\n",
|
|
"building blocks that are found in FARM. Whether you are using a Reader, Generator, Summarizer\n",
|
|
"or Retriever (or 2), the `Pipeline` class will help you build a Directed Acyclic Graph (DAG) that\n",
|
|
"determines how to route the output of one component into the input of another.\n"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"## Setting Up the Environment\n",
|
|
"\n",
|
|
"Let's start by ensuring we have a GPU running to ensure decent speed in this tutorial.\n",
|
|
"In Google colab, you can change to a GPU runtime in the menu:\n",
|
|
"- **Runtime -> Change Runtime type -> Hardware accelerator -> GPU**"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"source": [
|
|
"# Make sure you have a GPU running\n",
|
|
"!nvidia-smi"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
},
|
|
"execution_count": null,
|
|
"outputs": []
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"These lines are to install Haystack through pip"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"outputs": [],
|
|
"source": [
|
|
"# Install the latest release of Haystack in your own environment\n",
|
|
"#! pip install farm-haystack\n",
|
|
"\n",
|
|
"# Install the latest master of Haystack\n",
|
|
"!pip install grpcio-tools==1.34.1\n",
|
|
"!pip install --upgrade git+https://github.com/deepset-ai/haystack.git\n",
|
|
"\n",
|
|
"# Install pygraphviz\n",
|
|
"!apt install libgraphviz-dev\n",
|
|
"!pip install pygraphviz"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"If running from Colab or a no Docker environment, you will want to start Elasticsearch from source"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"outputs": [],
|
|
"source": [
|
|
"# In Colab / No Docker environments: Start Elasticsearch from source\n",
|
|
"! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q\n",
|
|
"! tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz\n",
|
|
"! chown -R daemon:daemon elasticsearch-7.9.2\n",
|
|
"\n",
|
|
"import os\n",
|
|
"from subprocess import Popen, PIPE, STDOUT\n",
|
|
"es_server = Popen(['elasticsearch-7.9.2/bin/elasticsearch'],\n",
|
|
" stdout=PIPE, stderr=STDOUT,\n",
|
|
" preexec_fn=lambda: os.setuid(1) # as daemon\n",
|
|
" )\n",
|
|
"# wait until ES has started\n",
|
|
"! sleep 30"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"## Initialization\n",
|
|
"\n",
|
|
"Here are some core imports"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"collapsed": true
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"from haystack.utils import print_answers, print_documents"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"Then let's fetch some data (in this case, pages from the Game of Thrones wiki) and prepare it so that it can\n",
|
|
"be used indexed into our `DocumentStore`"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"outputs": [],
|
|
"source": [
|
|
"from haystack.preprocessor.utils import fetch_archive_from_http, convert_files_to_dicts\n",
|
|
"from haystack.preprocessor.cleaning import clean_wiki_text\n",
|
|
"\n",
|
|
"#Download and prepare data - 517 Wikipedia articles for Game of Thrones\n",
|
|
"doc_dir = \"data/article_txt_got\"\n",
|
|
"s3_url = \"https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt.zip\"\n",
|
|
"fetch_archive_from_http(url=s3_url, output_dir=doc_dir)\n",
|
|
"\n",
|
|
"# convert files to dicts containing documents that can be indexed to our datastore\n",
|
|
"got_dicts = convert_files_to_dicts(\n",
|
|
" dir_path=doc_dir,\n",
|
|
" clean_func=clean_wiki_text,\n",
|
|
" split_paragraphs=True\n",
|
|
")"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"Here we initialize the core components that we will be gluing together using the `Pipeline` class.\n",
|
|
"We have a `DocumentStore`, an `ElasticsearchRetriever` and a `FARMReader`.\n",
|
|
"These can be combined to create a classic Retriever-Reader pipeline that is designed\n",
|
|
"to perform Open Domain Question Answering."
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"source": [
|
|
"from haystack import Pipeline\n",
|
|
"from haystack.utils import launch_es\n",
|
|
"from haystack.document_store import ElasticsearchDocumentStore\n",
|
|
"from haystack.retriever.sparse import ElasticsearchRetriever\n",
|
|
"from haystack.retriever.dense import DensePassageRetriever\n",
|
|
"from haystack.reader import FARMReader\n",
|
|
"\n",
|
|
"\n",
|
|
"# Initialize DocumentStore and index documents\n",
|
|
"launch_es()\n",
|
|
"document_store = ElasticsearchDocumentStore()\n",
|
|
"document_store.delete_all_documents()\n",
|
|
"document_store.write_documents(got_dicts)\n",
|
|
"\n",
|
|
"# Initialize Sparse retriever\n",
|
|
"es_retriever = ElasticsearchRetriever(document_store=document_store)\n",
|
|
"\n",
|
|
"# Initialize dense retriever\n",
|
|
"dpr_retriever = DensePassageRetriever(document_store)\n",
|
|
"document_store.update_embeddings(dpr_retriever, update_existing_embeddings=False)\n",
|
|
"\n",
|
|
"reader = FARMReader(model_name_or_path=\"deepset/roberta-base-squad2\")"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
},
|
|
"execution_count": null,
|
|
"outputs": []
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"## Prebuilt Pipelines\n",
|
|
"\n",
|
|
"Haystack features many prebuilt pipelines that cover common tasks.\n",
|
|
"Here we have an `ExtractiveQAPipeline` (the successor to the now deprecated `Finder` class)."
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"outputs": [],
|
|
"source": [
|
|
"from haystack.pipeline import ExtractiveQAPipeline\n",
|
|
"\n",
|
|
"# Prebuilt pipeline\n",
|
|
"p_extractive_premade = ExtractiveQAPipeline(reader=reader, retriever=es_retriever)\n",
|
|
"res = p_extractive_premade.run(\n",
|
|
" query=\"Who is the father of Arya Stark?\",\n",
|
|
" top_k_retriever=10,\n",
|
|
" top_k_reader=5\n",
|
|
")\n",
|
|
"print_answers(res, details=\"minimal\")"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"If you want to just do the retrieval step, you can use a `DocumentSearchPipeline`"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"outputs": [],
|
|
"source": [
|
|
"from haystack.pipeline import DocumentSearchPipeline\n",
|
|
"\n",
|
|
"p_retrieval = DocumentSearchPipeline(es_retriever)\n",
|
|
"res = p_retrieval.run(\n",
|
|
" query=\"Who is the father of Arya Stark?\",\n",
|
|
" top_k_retriever=10\n",
|
|
")\n",
|
|
"print_documents(res, max_text_len=200)"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"Or if you want to use a `Generator` instead of a `Reader`,\n",
|
|
"you can initialize a `GenerativeQAPipeline` like this:"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"outputs": [],
|
|
"source": [
|
|
"from haystack.pipeline import GenerativeQAPipeline, FAQPipeline\n",
|
|
"from haystack.generator import RAGenerator\n",
|
|
"\n",
|
|
"# We set this to True so that the document store returns document embeddings with each document\n",
|
|
"# This is needed by the Generator\n",
|
|
"document_store.return_embedding = True\n",
|
|
"\n",
|
|
"# Initialize generator\n",
|
|
"rag_generator = RAGenerator()\n",
|
|
"\n",
|
|
"# Generative QA\n",
|
|
"p_generator = GenerativeQAPipeline(generator=rag_generator, retriever=dpr_retriever)\n",
|
|
"res = p_generator.run(\n",
|
|
" query=\"Who is the father of Arya Stark?\",\n",
|
|
" top_k_retriever=10\n",
|
|
")\n",
|
|
"print_answers(res, details=\"minimal\")\n",
|
|
"\n",
|
|
"# We are setting this to False so that in later pipelines,\n",
|
|
"# we get a cleaner printout\n",
|
|
"document_store.return_embedding = False\n"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"Haystack features prebuilt pipelines to do:\n",
|
|
"- just document search (DocumentSearchPipeline),\n",
|
|
"- document search with summarization (SearchSummarizationPipeline)\n",
|
|
"- generative QA (GenerativeQAPipeline)\n",
|
|
"- FAQ style QA (FAQPipeline)\n",
|
|
"- translated search (TranslationWrapperPipeline)\n",
|
|
"To find out more about these pipelines, have a look at our [documentation](https://haystack.deepset.ai/docs/latest/pipelinesmd)\n"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"With any Pipeline, whether prebuilt or custom constructed,\n",
|
|
"you can save a diagram showing how all the components are connected.\n",
|
|
"\n",
|
|
""
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"outputs": [],
|
|
"source": [
|
|
"p_extractive_premade.draw(\"pipeline_extractive_premade.png\")\n",
|
|
"p_retrieval.draw(\"pipeline_retrieval.png\")\n",
|
|
"p_generator.draw(\"pipeline_generator.png\")"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"## Custom Pipelines\n",
|
|
"\n",
|
|
"Now we are going to rebuild the `ExtractiveQAPipelines` using the generic Pipeline class.\n",
|
|
"We do this by adding the building blocks that we initialized as nodes in the graph."
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"outputs": [],
|
|
"source": [
|
|
"# Custom built extractive QA pipeline\n",
|
|
"p_extractive = Pipeline()\n",
|
|
"p_extractive.add_node(component=es_retriever, name=\"Retriever\", inputs=[\"Query\"])\n",
|
|
"p_extractive.add_node(component=reader, name=\"Reader\", inputs=[\"Retriever\"])\n",
|
|
"\n",
|
|
"# Now we can run it\n",
|
|
"res = p_extractive.run(\n",
|
|
" query=\"Who is the father of Arya Stark?\",\n",
|
|
" top_k_retriever=10,\n",
|
|
" top_k_reader=5\n",
|
|
")\n",
|
|
"print_answers(res, details=\"minimal\")\n",
|
|
"p_extractive.draw(\"pipeline_extractive.png\")"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"Pipelines offer a very simple way to ensemble together different components.\n",
|
|
"In this example, we are going to combine the power of a `DensePassageRetriever`\n",
|
|
"with the keyword based `ElasticsearchRetriever`.\n",
|
|
"See our [documentation](https://haystack.deepset.ai/docs/latest/retrievermd) to understand why\n",
|
|
"we might want to combine a dense and sparse retriever.\n",
|
|
"\n",
|
|
"\n",
|
|
"\n",
|
|
"Here we use a `JoinDocuments` node so that the predictions from each retriever can be merged together."
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"outputs": [],
|
|
"source": [
|
|
"from haystack.pipeline import JoinDocuments\n",
|
|
"\n",
|
|
"# Create ensembled pipeline\n",
|
|
"p_ensemble = Pipeline()\n",
|
|
"p_ensemble.add_node(component=es_retriever, name=\"ESRetriever\", inputs=[\"Query\"])\n",
|
|
"p_ensemble.add_node(component=dpr_retriever, name=\"DPRRetriever\", inputs=[\"Query\"])\n",
|
|
"p_ensemble.add_node(component=JoinDocuments(join_mode=\"concatenate\"), name=\"JoinResults\", inputs=[\"ESRetriever\", \"DPRRetriever\"])\n",
|
|
"p_ensemble.add_node(component=reader, name=\"Reader\", inputs=[\"JoinResults\"])\n",
|
|
"p_ensemble.draw(\"pipeline_ensemble.png\")\n",
|
|
"\n",
|
|
"# Run pipeline\n",
|
|
"res = p_ensemble.run(\n",
|
|
" query=\"Who is the father of Arya Stark?\",\n",
|
|
" top_k_retriever=5 #This is top_k per retriever\n",
|
|
")\n",
|
|
"print_answers(res, details=\"minimal\")"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"## Custom Nodes\n",
|
|
"\n",
|
|
"Nodes are relatively simple objects\n",
|
|
"and we encourage our users to design their own if they don't see on that fits their use case\n",
|
|
"\n",
|
|
"The only requirements are:\n",
|
|
"- Add a method run(self, **kwargs) to your class. **kwargs will contain the output from the previous node in your graph.\n",
|
|
"- Do whatever you want within run() (e.g. reformatting the query)\n",
|
|
"- Return a tuple that contains your output data (for the next node)\n",
|
|
"and the name of the outgoing edge (by default \"output_1\" for nodes that have one output)\n",
|
|
"- Add a class attribute outgoing_edges = 1 that defines the number of output options from your node. You only need a higher number here if you have a decision node (see below).\n",
|
|
"\n",
|
|
"Here we have a template for a Node:"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"outputs": [],
|
|
"source": [
|
|
"class NodeTemplate():\n",
|
|
" outgoing_edges = 1\n",
|
|
"\n",
|
|
" def run(self, **kwargs):\n",
|
|
" # Insert code here to manipulate the variables in kwarg\n",
|
|
" return (kwargs, \"output_1\")"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"## Decision Nodes\n",
|
|
"\n",
|
|
"Decision Nodes help you route your data so that only certain branches of your `Pipeline` are run.\n",
|
|
"One popular use case for such query classifiers is routing keyword queries to Elasticsearch and questions to DPR + Reader.\n",
|
|
"With this approach you keep optimal speed and simplicity for keywords while going deep with transformers when it's most helpful.\n",
|
|
"\n",
|
|
"\n",
|
|
"\n",
|
|
"Though this looks very similar to the ensembled pipeline shown above,\n",
|
|
"the key difference is that only one of the retrievers is run for each request.\n",
|
|
"By contrast both retrievers are always run in the ensembled approach.\n",
|
|
"\n",
|
|
"Below, we define a very naive `QueryClassifier` and show how to use it:"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"outputs": [],
|
|
"source": [
|
|
"class QueryClassifier():\n",
|
|
" outgoing_edges = 2\n",
|
|
"\n",
|
|
" def run(self, **kwargs):\n",
|
|
" if \"?\" in kwargs[\"query\"]:\n",
|
|
" return (kwargs, \"output_2\")\n",
|
|
" else:\n",
|
|
" return (kwargs, \"output_1\")\n",
|
|
"\n",
|
|
"# Here we build the pipeline\n",
|
|
"p_classifier = Pipeline()\n",
|
|
"p_classifier.add_node(component=QueryClassifier(), name=\"QueryClassifier\", inputs=[\"Query\"])\n",
|
|
"p_classifier.add_node(component=es_retriever, name=\"ESRetriever\", inputs=[\"QueryClassifier.output_1\"])\n",
|
|
"p_classifier.add_node(component=dpr_retriever, name=\"DPRRetriever\", inputs=[\"QueryClassifier.output_2\"])\n",
|
|
"p_classifier.add_node(component=reader, name=\"QAReader\", inputs=[\"ESRetriever\", \"DPRRetriever\"])\n",
|
|
"p_classifier.draw(\"pipeline_classifier.png\")\n",
|
|
"\n",
|
|
"# Run only the dense retriever on the full sentence query\n",
|
|
"res_1 = p_classifier.run(\n",
|
|
" query=\"Who is the father of Arya Stark?\",\n",
|
|
" top_k_retriever=10\n",
|
|
")\n",
|
|
"print(\"DPR Results\" + \"\\n\" + \"=\"*15)\n",
|
|
"print_answers(res_1)\n",
|
|
"\n",
|
|
"# Run only the sparse retriever on a keyword based query\n",
|
|
"res_2 = p_classifier.run(\n",
|
|
" query=\"Arya Stark father\",\n",
|
|
" top_k_retriever=10\n",
|
|
")\n",
|
|
"print(\"ES Results\" + \"\\n\" + \"=\"*15)\n",
|
|
"print_answers(res_2)"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"## Evaluation Nodes\n",
|
|
"\n",
|
|
"We have also designed a set of nodes that can be used to evaluate the performance of a system.\n",
|
|
"Have a look at our [tutorial](https://haystack.deepset.ai/docs/latest/tutorial5md) to get hands on with the code and learn more about Evaluation Nodes!\n"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"## YAML Configs\n",
|
|
"\n",
|
|
"A full `Pipeline` can be defined in a YAML file and simply loaded.\n",
|
|
"Having your pipeline available in a YAML is particularly useful\n",
|
|
"when you move between experimentation and production environments.\n",
|
|
"Just export the YAML from your notebook / IDE and import it into your production environment.\n",
|
|
"It also helps with version control of pipelines,\n",
|
|
"allows you to share your pipeline easily with colleagues,\n",
|
|
"and simplifies the configuration of pipeline parameters in production.\n",
|
|
"\n",
|
|
"It consists of two main sections: you define all objects (e.g. a reader) in components\n",
|
|
"and then stick them together to a pipeline in pipelines.\n",
|
|
"You can also set one component to be multiple nodes of a pipeline or to be a node across multiple pipelines.\n",
|
|
"It will be loaded just once in memory and therefore doesn't hurt your resources more than actually needed.\n",
|
|
"\n",
|
|
"The contents of a YAML file should look something like this:"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"```yaml\n",
|
|
"version: '0.7'\n",
|
|
"components: # define all the building-blocks for Pipeline\n",
|
|
"- name: MyReader # custom-name for the component; helpful for visualization & debugging\n",
|
|
" type: FARMReader # Haystack Class name for the component\n",
|
|
" params:\n",
|
|
" no_ans_boost: -10\n",
|
|
" model_name_or_path: deepset/roberta-base-squad2\n",
|
|
"- name: MyESRetriever\n",
|
|
" type: ElasticsearchRetriever\n",
|
|
" params:\n",
|
|
" document_store: MyDocumentStore # params can reference other components defined in the YAML\n",
|
|
" custom_query: null\n",
|
|
"- name: MyDocumentStore\n",
|
|
" type: ElasticsearchDocumentStore\n",
|
|
" params:\n",
|
|
" index: haystack_test\n",
|
|
"pipelines: # multiple Pipelines can be defined using the components from above\n",
|
|
"- name: my_query_pipeline # a simple extractive-qa Pipeline\n",
|
|
" nodes:\n",
|
|
" - name: MyESRetriever\n",
|
|
" inputs: [Query]\n",
|
|
" - name: MyReader\n",
|
|
" inputs: [MyESRetriever]\n",
|
|
"```"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"To load, simply call:\n",
|
|
"``` python\n",
|
|
"pipeline.load_from_yaml(Path(\"sample.yaml\"))\n",
|
|
"```"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"## Conclusion\n",
|
|
"\n",
|
|
"The possibilities are endless with the `Pipeline` class and we hope that this tutorial will inspire you\n",
|
|
"to build custom pipeplines that really work for your use case!"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"## About us\n",
|
|
"\n",
|
|
"This [Haystack](https://github.com/deepset-ai/haystack/) notebook was made with love by [deepset](https://deepset.ai/) in Berlin, Germany\n",
|
|
"\n",
|
|
"We bring NLP to the industry via open source! \n",
|
|
"Our focus: Industry specific language models & large scale QA systems. \n",
|
|
" \n",
|
|
"Some of our other work: \n",
|
|
"- [German BERT](https://deepset.ai/german-bert)\n",
|
|
"- [GermanQuAD and GermanDPR](https://deepset.ai/germanquad)\n",
|
|
"- [FARM](https://github.com/deepset-ai/FARM)\n",
|
|
"\n",
|
|
"Get in touch:\n",
|
|
"[Twitter](https://twitter.com/deepset_ai) | [LinkedIn](https://www.linkedin.com/company/deepset-ai/) | [Slack](https://haystack.deepset.ai/community/join) | [GitHub Discussions](https://github.com/deepset-ai/haystack/discussions) | [Website](https://deepset.ai)\n",
|
|
"\n",
|
|
"By the way: [we're hiring!](https://apply.workable.com/deepset/) "
|
|
],
|
|
"metadata": {
|
|
"collapsed": false
|
|
}
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 2
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython2",
|
|
"version": "2.7.6"
|
|
},
|
|
"pycharm": {
|
|
"stem_cell": {
|
|
"cell_type": "raw",
|
|
"source": [],
|
|
"metadata": {
|
|
"collapsed": false
|
|
}
|
|
}
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 0
|
|
} |