Exchanged minimal with minimum in print_answers function call (#1890)

This commit is contained in:
Alberto Villa 2021-12-14 15:27:37 +01:00 committed by GitHub
parent 2396f0cd3a
commit 1bb6244a63
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
4 changed files with 529 additions and 522 deletions

View File

@ -17,6 +17,8 @@ or Retriever (or 2), the `Pipeline` class will help you build a Directed Acyclic
determines how to route the output of one component into the input of another.
## Setting Up the Environment
Let's start by ensuring we have a GPU running to ensure decent speed in this tutorial.
@ -130,7 +132,7 @@ res = p_extractive_premade.run(
query="Who is the father of Arya Stark?",
params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5}},
)
print_answers(res, details="minimal")
print_answers(res, details="minimum")
```
If you want to just do the retrieval step, you can use a `DocumentSearchPipeline`
@ -168,7 +170,7 @@ res = p_generator.run(
query="Who is the father of Arya Stark?",
params={"Retriever": {"top_k": 10}}
)
print_answers(res, details="minimal")
print_answers(res, details="minimum")
# We are setting this to False so that in later pipelines,
# we get a cleaner printout
@ -214,7 +216,7 @@ res = p_extractive.run(
query="Who is the father of Arya Stark?",
params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5}}
)
print_answers(res, details="minimal")
print_answers(res, details="minimum")
p_extractive.draw("pipeline_extractive.png")
```
@ -245,7 +247,7 @@ res = p_ensemble.run(
query="Who is the father of Arya Stark?",
params={"DPRRetriever": {"top_k": 5}, "ESRetriever": {"top_k": 5}}
)
print_answers(res, details="minimal")
print_answers(res, details="minimum")
```
## Custom Nodes

View File

@ -221,7 +221,7 @@ prediction = pipe.run(
```python
print_answers(prediction, details="minimal")
print_answers(prediction, details="minimum")
```
## About us

View File

@ -2,6 +2,12 @@
"cells": [
{
"cell_type": "markdown",
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"# Pipelines Tutorial\n",
"\n",
@ -11,60 +17,66 @@
"building blocks that are found in FARM. Whether you are using a Reader, Generator, Summarizer\n",
"or Retriever (or 2), the `Pipeline` class will help you build a Directed Acyclic Graph (DAG) that\n",
"determines how to route the output of one component into the input of another.\n"
],
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "markdown",
},
"source": [
"## Setting Up the Environment\n",
"\n",
"Let's start by ensuring we have a GPU running to ensure decent speed in this tutorial.\n",
"In Google colab, you can change to a GPU runtime in the menu:\n",
"- **Runtime -> Change Runtime type -> Hardware accelerator -> GPU**"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
]
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"# Make sure you have a GPU running\n",
"!nvidia-smi"
],
"outputs": [],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
"outputs": [],
"source": [
"# Make sure you have a GPU running\n",
"!nvidia-smi"
]
},
{
"cell_type": "markdown",
"source": [
"These lines are to install Haystack through pip"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
"source": [
"These lines are to install Haystack through pip"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"# Install the latest master of Haystack\n",
"!pip install grpcio-tools==1.34.1\n",
@ -76,30 +88,30 @@
"\n",
"# If you run this notebook on Google Colab, you might need to\n",
"# restart the runtime after installing haystack."
],
"outputs": [],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
]
},
{
"cell_type": "markdown",
"source": [
"If running from Colab or a no Docker environment, you will want to start Elasticsearch from source"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
"source": [
"If running from Colab or a no Docker environment, you will want to start Elasticsearch from source"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"# In Colab / No Docker environments: Start Elasticsearch from source\n",
"! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q\n",
@ -114,43 +126,43 @@
" )\n",
"# wait until ES has started\n",
"! sleep 30"
],
"outputs": [],
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"## Initialization"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"Then let's fetch some data (in this case, pages from the Game of Thrones wiki) and prepare it so that it can\n",
"be used indexed into our `DocumentStore`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"## Initialization"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"Then let's fetch some data (in this case, pages from the Game of Thrones wiki) and prepare it so that it can\n",
"be used indexed into our `DocumentStore`"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": null,
},
"outputs": [],
"source": [
"from haystack.utils import print_answers, print_documents, fetch_archive_from_http, convert_files_to_dicts, clean_wiki_text\n",
"\n",
@ -165,33 +177,33 @@
" clean_func=clean_wiki_text,\n",
" split_paragraphs=True\n",
")"
],
"outputs": [],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
]
},
{
"cell_type": "markdown",
"source": [
"Here we initialize the core components that we will be gluing together using the `Pipeline` class.\n",
"We have a `DocumentStore`, an `ElasticsearchRetriever` and a `FARMReader`.\n",
"These can be combined to create a classic Retriever-Reader pipeline that is designed\n",
"to perform Open Domain Question Answering."
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
"source": [
"Here we initialize the core components that we will be gluing together using the `Pipeline` class.\n",
"We have a `DocumentStore`, an `ElasticsearchRetriever` and a `FARMReader`.\n",
"These can be combined to create a classic Retriever-Reader pipeline that is designed\n",
"to perform Open Domain Question Answering."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"from haystack import Pipeline\n",
"from haystack.utils import launch_es\n",
@ -213,33 +225,33 @@
"document_store.update_embeddings(dpr_retriever, update_existing_embeddings=False)\n",
"\n",
"reader = FARMReader(model_name_or_path=\"deepset/roberta-base-squad2\")"
],
"outputs": [],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
]
},
{
"cell_type": "markdown",
"source": [
"## Prebuilt Pipelines\n",
"\n",
"Haystack features many prebuilt pipelines that cover common tasks.\n",
"Here we have an `ExtractiveQAPipeline` (the successor to the now deprecated `Finder` class)."
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
"source": [
"## Prebuilt Pipelines\n",
"\n",
"Haystack features many prebuilt pipelines that cover common tasks.\n",
"Here we have an `ExtractiveQAPipeline` (the successor to the now deprecated `Finder` class)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"from haystack.pipelines import ExtractiveQAPipeline\n",
"\n",
@ -249,28 +261,28 @@
" query=\"Who is the father of Arya Stark?\",\n",
" params={\"Retriever\": {\"top_k\": 10}, \"Reader\": {\"top_k\": 5}},\n",
")\n",
"print_answers(res, details=\"minimal\")"
],
"outputs": [],
"print_answers(res, details=\"minimum\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"If you want to just do the retrieval step, you can use a `DocumentSearchPipeline`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"If you want to just do the retrieval step, you can use a `DocumentSearchPipeline`"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": null,
},
"outputs": [],
"source": [
"from haystack.pipelines import DocumentSearchPipeline\n",
"\n",
@ -280,28 +292,28 @@
" params={\"Retriever\": {\"top_k\": 10}},\n",
")\n",
"print_documents(res, max_text_len=200)"
],
"outputs": [],
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"Or if you want to use a `Generator` instead of a `Reader`,\n",
"you can initialize a `GenerativeQAPipeline` like this:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"Or if you want to use a `Generator` instead of a `Reader`,\n",
"you can initialize a `GenerativeQAPipeline` like this:"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": null,
},
"outputs": [],
"source": [
"from haystack.pipelines import GenerativeQAPipeline, FAQPipeline\n",
"from haystack.nodes import RAGenerator\n",
@ -319,22 +331,18 @@
" query=\"Who is the father of Arya Stark?\",\n",
" params={\"Retriever\": {\"top_k\": 10}}\n",
")\n",
"print_answers(res, details=\"minimal\")\n",
"print_answers(res, details=\"minimum\")\n",
"\n",
"# We are setting this to False so that in later pipelines,\n",
"# we get a cleaner printout\n",
"document_store.return_embedding = False\n"
],
"outputs": [],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"Haystack features prebuilt pipelines to do:\n",
"- just document search (DocumentSearchPipeline),\n",
@ -343,60 +351,64 @@
"- FAQ style QA (FAQPipeline)\n",
"- translated search (TranslationWrapperPipeline)\n",
"To find out more about these pipelines, have a look at our [documentation](https://haystack.deepset.ai/docs/latest/pipelinesmd)\n"
],
"metadata": {
"collapsed": false
}
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"With any Pipeline, whether prebuilt or custom constructed,\n",
"you can save a diagram showing how all the components are connected.\n",
"\n",
"![image](https://user-images.githubusercontent.com/1563902/102451716-54813700-4039-11eb-881e-f3c01b47ca15.png)"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
]
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"p_extractive_premade.draw(\"pipeline_extractive_premade.png\")\n",
"p_retrieval.draw(\"pipeline_retrieval.png\")\n",
"p_generator.draw(\"pipeline_generator.png\")"
],
"outputs": [],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
"outputs": [],
"source": [
"p_extractive_premade.draw(\"pipeline_extractive_premade.png\")\n",
"p_retrieval.draw(\"pipeline_retrieval.png\")\n",
"p_generator.draw(\"pipeline_generator.png\")"
]
},
{
"cell_type": "markdown",
"source": [
"## Custom Pipelines\n",
"\n",
"Now we are going to rebuild the `ExtractiveQAPipelines` using the generic Pipeline class.\n",
"We do this by adding the building blocks that we initialized as nodes in the graph."
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
"source": [
"## Custom Pipelines\n",
"\n",
"Now we are going to rebuild the `ExtractiveQAPipelines` using the generic Pipeline class.\n",
"We do this by adding the building blocks that we initialized as nodes in the graph."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"# Custom built extractive QA pipeline\n",
"p_extractive = Pipeline()\n",
@ -408,19 +420,18 @@
" query=\"Who is the father of Arya Stark?\",\n",
" params={\"Retriever\": {\"top_k\": 10}, \"Reader\": {\"top_k\": 5}}\n",
")\n",
"print_answers(res, details=\"minimal\")\n",
"print_answers(res, details=\"minimum\")\n",
"p_extractive.draw(\"pipeline_extractive.png\")"
],
"outputs": [],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"Pipelines offer a very simple way to ensemble together different components.\n",
"In this example, we are going to combine the power of a `DensePassageRetriever`\n",
@ -431,17 +442,18 @@
"![image](https://user-images.githubusercontent.com/1563902/102451782-7bd80400-4039-11eb-9046-01b002a783f8.png)\n",
"\n",
"Here we use a `JoinDocuments` node so that the predictions from each retriever can be merged together."
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"from haystack.pipelines import JoinDocuments\n",
"\n",
@ -458,18 +470,17 @@
" query=\"Who is the father of Arya Stark?\",\n",
" params={\"DPRRetriever\": {\"top_k\": 5}, \"ESRetriever\": {\"top_k\": 5}}\n",
")\n",
"print_answers(res, details=\"minimal\")"
],
"outputs": [],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
"print_answers(res, details=\"minimum\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"## Custom Nodes\n",
"\n",
@ -485,17 +496,18 @@
"- Add a class attribute outgoing_edges = 1 that defines the number of output options from your node. You only need a higher number here if you have a decision node (see below).\n",
"\n",
"Here we have a template for a Node:"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"from haystack import BaseComponent\n",
"from typing import Optional\n",
@ -507,17 +519,16 @@
" # process the inputs\n",
" output = {\"my_output\": ...}\n",
" return output, \"output_1\""
],
"outputs": [],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"## Decision Nodes\n",
"\n",
@ -532,17 +543,18 @@
"By contrast both retrievers are always run in the ensembled approach.\n",
"\n",
"Below, we define a very naive `QueryClassifier` and show how to use it:"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"class CustomQueryClassifier(BaseComponent):\n",
" outgoing_edges = 2\n",
@ -570,47 +582,46 @@
"res_2 = p_classifier.run(query=\"Arya Stark father\")\n",
"print(\"ES Results\" + \"\\n\" + \"=\"*15)\n",
"print_answers(res_2)"
],
"outputs": [],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"## Evaluation Nodes\n",
"\n",
"We have also designed a set of nodes that can be used to evaluate the performance of a system.\n",
"Have a look at our [tutorial](https://haystack.deepset.ai/docs/latest/tutorial5md) to get hands on with the code and learn more about Evaluation Nodes!"
],
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "markdown",
},
"source": [
"## Debugging Pipelines\n",
"\n",
"You can print out debug information from nodes in your pipelines in a few different ways."
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"# 1) You can set the `debug` attribute of a given node.\n",
@ -635,16 +646,16 @@
")\n",
"\n",
"result[\"_debug\"]"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"## YAML Configs\n",
"\n",
@ -662,16 +673,16 @@
"It will be loaded just once in memory and therefore doesn't hurt your resources more than actually needed.\n",
"\n",
"The contents of a YAML file should look something like this:"
],
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "markdown",
},
"source": [
"```yaml\n",
"version: '0.7'\n",
@ -698,46 +709,43 @@
" - name: MyReader\n",
" inputs: [MyESRetriever]\n",
"```"
],
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "markdown",
},
"source": [
"To load, simply call:\n",
"``` python\n",
"pipeline.load_from_yaml(Path(\"sample.yaml\"))\n",
"```"
],
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "markdown",
},
"source": [
"## Conclusion\n",
"\n",
"The possibilities are endless with the `Pipeline` class and we hope that this tutorial will inspire you\n",
"to build custom pipeplines that really work for your use case!"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"## About us\n",
"\n",
@ -755,10 +763,7 @@
"[Twitter](https://twitter.com/deepset_ai) | [LinkedIn](https://www.linkedin.com/company/deepset-ai/) | [Slack](https://haystack.deepset.ai/community/join) | [GitHub Discussions](https://github.com/deepset-ai/haystack/discussions) | [Website](https://deepset.ai)\n",
"\n",
"By the way: [we're hiring!](https://www.deepset.ai/jobs)"
],
"metadata": {
"collapsed": false
}
]
}
],
"metadata": {
@ -782,4 +787,4 @@
},
"nbformat": 4,
"nbformat_minor": 2
}
}

View File

@ -2,6 +2,10 @@
"cells": [
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "bEH-CRbeA6NU"
},
"source": [
"# Better Retrieval via \"Dense Passage Retrieval\"\n",
"\n",
@ -52,14 +56,14 @@
"\n",
"\n",
"*Use this* [link](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial6_Better_Retrieval_via_DPR.ipynb) *to open the notebook in Google Colab.*\n"
],
"metadata": {
"colab_type": "text",
"id": "bEH-CRbeA6NU"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "3K27Y5FbA6NV"
},
"source": [
"### Prepare environment\n",
"\n",
@ -68,23 +72,24 @@
"**Runtime -> Change Runtime type -> Hardware accelerator -> GPU**\n",
"\n",
"<img src=\"https://raw.githubusercontent.com/deepset-ai/haystack/master/docs/_src/img/colab_gpu_runtime.jpg\">"
],
"metadata": {
"colab_type": "text",
"id": "3K27Y5FbA6NV"
}
]
},
{
"cell_type": "code",
"execution_count": 1,
"source": [
"# Make sure you have a GPU running\n",
"!nvidia-smi"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 357
},
"colab_type": "code",
"id": "JlZgP8q1A6NW",
"outputId": "c893ac99-b7a0-4d49-a8eb-1a9951d364d9"
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"output_type": "stream",
"text": [
"Mon Aug 24 11:56:45 2020 \r\n",
"+-----------------------------------------------------------------------------+\r\n",
@ -106,34 +111,27 @@
]
}
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 357
},
"colab_type": "code",
"id": "JlZgP8q1A6NW",
"outputId": "c893ac99-b7a0-4d49-a8eb-1a9951d364d9"
}
"source": [
"# Make sure you have a GPU running\n",
"!nvidia-smi"
]
},
{
"cell_type": "code",
"execution_count": 2,
"source": [
"# Install the latest release of Haystack in your own environment \n",
"#! pip install farm-haystack\n",
"\n",
"# Install the latest master of Haystack\n",
"!pip install grpcio-tools==1.34.1\n",
"!pip install git+https://github.com/deepset-ai/haystack.git\n",
"\n",
"# If you run this notebook on Google Colab, you might need to\n",
"# restart the runtime after installing haystack."
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 1000
},
"colab_type": "code",
"id": "NM36kbRFA6Nc",
"outputId": "af1a9d85-9557-4d68-ea87-a01f00c584f9"
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"output_type": "stream",
"text": [
"Collecting git+https://github.com/deepset-ai/haystack.git\n",
" Cloning https://github.com/deepset-ai/haystack.git to /tmp/pip-req-build-fqgbr4x7\n",
@ -200,8 +198,8 @@
]
},
{
"output_type": "stream",
"name": "stdout",
"output_type": "stream",
"text": [
"Requirement already satisfied: future in /home/ubuntu/py3_6/lib/python3.6/site-packages (from torch==1.5.*->farm==0.4.6->farm-haystack==0.3.0) (0.18.2)\n",
"Requirement already satisfied: botocore<1.18.0,>=1.17.20 in /home/ubuntu/py3_6/lib/python3.6/site-packages (from boto3->farm==0.4.6->farm-haystack==0.3.0) (1.17.20)\n",
@ -258,8 +256,8 @@
]
},
{
"output_type": "stream",
"name": "stdout",
"output_type": "stream",
"text": [
"Looking in links: https://download.pytorch.org/whl/torch_stable.html\n",
"Collecting torch==1.5.1+cu101\n",
@ -282,32 +280,38 @@
]
}
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 1000
},
"colab_type": "code",
"id": "NM36kbRFA6Nc",
"outputId": "af1a9d85-9557-4d68-ea87-a01f00c584f9"
}
"source": [
"# Install the latest release of Haystack in your own environment \n",
"#! pip install farm-haystack\n",
"\n",
"# Install the latest master of Haystack\n",
"!pip install grpcio-tools==1.34.1\n",
"!pip install git+https://github.com/deepset-ai/haystack.git\n",
"\n",
"# If you run this notebook on Google Colab, you might need to\n",
"# restart the runtime after installing haystack."
]
},
{
"cell_type": "code",
"execution_count": 2,
"source": [
"from haystack.utils import clean_wiki_text, convert_files_to_dicts, fetch_archive_from_http, print_answers\n",
"from haystack.nodes import FARMReader, TransformersReader\n"
],
"outputs": [],
"metadata": {
"colab": {},
"colab_type": "code",
"id": "xmRuhTQ7A6Nh"
}
},
"outputs": [],
"source": [
"from haystack.utils import clean_wiki_text, convert_files_to_dicts, fetch_archive_from_http, print_answers\n",
"from haystack.nodes import FARMReader, TransformersReader\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "q3dSo7ZtA6Nl"
},
"source": [
"### Document Store\n",
"\n",
@ -320,30 +324,11 @@
"The default flavour of FAISSDocumentStore is \"Flat\" but can also be set to \"HNSW\" for\n",
"faster search at the expense of some accuracy. Just set the faiss_index_factor_str argument in the constructor.\n",
"For more info on which suits your use case: https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index"
],
"metadata": {
"colab_type": "text",
"id": "q3dSo7ZtA6Nl"
}
]
},
{
"cell_type": "code",
"execution_count": 3,
"source": [
"from haystack.document_stores import FAISSDocumentStore\n",
"\n",
"document_store = FAISSDocumentStore(faiss_index_factory_str=\"Flat\")"
],
"outputs": [
{
"output_type": "stream",
"name": "stderr",
"text": [
"08/25/2020 08:27:51 - INFO - faiss - Loading faiss with AVX2 support.\n",
"08/25/2020 08:27:51 - INFO - faiss - Loading faiss.\n"
]
}
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
@ -355,10 +340,31 @@
"pycharm": {
"name": "#%%\n"
}
}
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"08/25/2020 08:27:51 - INFO - faiss - Loading faiss with AVX2 support.\n",
"08/25/2020 08:27:51 - INFO - faiss - Loading faiss.\n"
]
}
],
"source": [
"from haystack.document_stores import FAISSDocumentStore\n",
"\n",
"document_store = FAISSDocumentStore(faiss_index_factory_str=\"Flat\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"#### Option 2: Milvus\n",
"\n",
@ -366,71 +372,44 @@
"Like FAISS it has both a \"Flat\" and \"HNSW\" mode but it outperforms FAISS when it comes to dynamic data management.\n",
"It does require a little more setup, however, as it is run through Docker and requires the setup of some config files.\n",
"See [their docs](https://milvus.io/docs/v1.0.0/milvus_docker-cpu.md) for more details."
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"from haystack.utils import launch_milvus\n",
"from haystack.document_stores import MilvusDocumentStore\n",
"\n",
"launch_milvus()\n",
"document_store = MilvusDocumentStore()"
],
"outputs": [],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
]
},
{
"cell_type": "markdown",
"source": [
"### Cleaning & indexing documents\n",
"\n",
"Similarly to the previous tutorials, we download, convert and index some Game of Thrones articles to our DocumentStore"
],
"metadata": {
"colab_type": "text",
"id": "06LatTJBA6N0",
"pycharm": {
"name": "#%% md\n"
}
}
},
"source": [
"### Cleaning & indexing documents\n",
"\n",
"Similarly to the previous tutorials, we download, convert and index some Game of Thrones articles to our DocumentStore"
]
},
{
"cell_type": "code",
"execution_count": 4,
"source": [
"# Let's first get some files that we want to use\n",
"doc_dir = \"data/article_txt_got\"\n",
"s3_url = \"https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt.zip\"\n",
"fetch_archive_from_http(url=s3_url, output_dir=doc_dir)\n",
"\n",
"# Convert files to dicts\n",
"dicts = convert_files_to_dicts(dir_path=doc_dir, clean_func=clean_wiki_text, split_paragraphs=True)\n",
"\n",
"# Now, let's write the dicts containing documents to our DB.\n",
"document_store.write_documents(dicts)"
],
"outputs": [
{
"output_type": "stream",
"name": "stderr",
"text": [
"08/25/2020 08:27:53 - INFO - haystack.indexing.utils - Found data stored in `data/article_txt_got`. Delete this first if you really want to fetch new data.\n"
]
}
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
@ -442,10 +421,35 @@
"pycharm": {
"name": "#%%\n"
}
}
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"08/25/2020 08:27:53 - INFO - haystack.indexing.utils - Found data stored in `data/article_txt_got`. Delete this first if you really want to fetch new data.\n"
]
}
],
"source": [
"# Let's first get some files that we want to use\n",
"doc_dir = \"data/article_txt_got\"\n",
"s3_url = \"https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt.zip\"\n",
"fetch_archive_from_http(url=s3_url, output_dir=doc_dir)\n",
"\n",
"# Convert files to dicts\n",
"dicts = convert_files_to_dicts(dir_path=doc_dir, clean_func=clean_wiki_text, split_paragraphs=True)\n",
"\n",
"# Now, let's write the dicts containing documents to our DB.\n",
"document_store.write_documents(dicts)"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "wgjedxx_A6N6"
},
"source": [
"### Initalize Retriever, Reader & Pipeline\n",
"\n",
@ -458,37 +462,53 @@
"- The `ElasticsearchRetriever`with custom queries (e.g. boosting) and filters\n",
"- Use `EmbeddingRetriever` to find candidate documents based on the similarity of embeddings (e.g. created via Sentence-BERT)\n",
"- Use `TfidfRetriever` in combination with a SQL or InMemory Document store for simple prototyping and debugging"
],
"metadata": {
"colab_type": "text",
"id": "wgjedxx_A6N6"
}
]
},
{
"cell_type": "code",
"execution_count": 5,
"source": [
"from haystack.nodes import DensePassageRetriever\n",
"retriever = DensePassageRetriever(document_store=document_store,\n",
" query_embedding_model=\"facebook/dpr-question_encoder-single-nq-base\",\n",
" passage_embedding_model=\"facebook/dpr-ctx_encoder-single-nq-base\",\n",
" max_seq_len_query=64,\n",
" max_seq_len_passage=256,\n",
" batch_size=16,\n",
" use_gpu=True,\n",
" embed_title=True,\n",
" use_fast_tokenizers=True)\n",
"# Important: \n",
"# Now that after we have the DPR initialized, we need to call update_embeddings() to iterate over all\n",
"# previously indexed documents and update their embedding representation. \n",
"# While this can be a time consuming operation (depending on corpus size), it only needs to be done once. \n",
"# At query time, we only need to embed the query and compare it the existing doc embeddings which is very fast.\n",
"document_store.update_embeddings(retriever)"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 1000,
"referenced_widgets": [
"20affb86c4574e3a9829136fdfe40470",
"7f8c2c86bbb74a18ac8bd24046d99d34",
"84311c037c6e44b5b621237f59f027a0",
"05d793fc179746e9b74cbcbc1a3389eb",
"ad2ce6a8b4f844ac93b425f1261c131f",
"bb45d5e4c9944fcd87b408e2fbfea440",
"248d02e01dea4a63a3296e28e4537eaf",
"74a9c43eb61a43aa973194b0b70e18f5",
"58fc3339f13644aea1d4c6d8e1d43a65",
"460bef2bfa7d4aa480639095555577ac",
"8553a48fb3144739b99fa04adf8b407c",
"babe35bb292f4010b64104b2b5bc92af",
"887412c45ce744efbcc875b563770c29",
"b4b950d899df4e3fbed9255b281e988a",
"89535c589aa64648b82a9794a2888e78",
"f35430501bb14fba8dbd5fb797c2e509",
"eb5d93a8416a437e9cb039650756ac74",
"5b8d5975d2674e7e9ada64e77c463c0a",
"4afa2be1c2c5483f932a42ea4a7897af",
"0e7186eeb5fa47d89c8c111ebe43c5af",
"fa946133dfcc4a6ebc6fef2ef9dd92f7",
"518b6a993e42490297289f2328d0270a",
"cea074a636d34a75b311569fc3f0b3ab",
"2630fd2fa91d498796af6d7d8d73aba4"
]
},
"colab_type": "code",
"id": "kFwiPP60A6N7",
"outputId": "07249856-3222-4898-9246-68e9ecbf5a1b",
"pycharm": {
"is_executing": true
}
},
"outputs": [
{
"output_type": "stream",
"name": "stderr",
"output_type": "stream",
"text": [
"08/25/2020 08:28:12 - INFO - haystack.database.faiss - Updating embeddings for 2497 docs ...\n",
"/pytorch/torch/csrc/utils/python_arg_parser.cpp:756: UserWarning: This overload of nonzero is deprecated:\n",
@ -529,47 +549,31 @@
]
}
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 1000,
"referenced_widgets": [
"20affb86c4574e3a9829136fdfe40470",
"7f8c2c86bbb74a18ac8bd24046d99d34",
"84311c037c6e44b5b621237f59f027a0",
"05d793fc179746e9b74cbcbc1a3389eb",
"ad2ce6a8b4f844ac93b425f1261c131f",
"bb45d5e4c9944fcd87b408e2fbfea440",
"248d02e01dea4a63a3296e28e4537eaf",
"74a9c43eb61a43aa973194b0b70e18f5",
"58fc3339f13644aea1d4c6d8e1d43a65",
"460bef2bfa7d4aa480639095555577ac",
"8553a48fb3144739b99fa04adf8b407c",
"babe35bb292f4010b64104b2b5bc92af",
"887412c45ce744efbcc875b563770c29",
"b4b950d899df4e3fbed9255b281e988a",
"89535c589aa64648b82a9794a2888e78",
"f35430501bb14fba8dbd5fb797c2e509",
"eb5d93a8416a437e9cb039650756ac74",
"5b8d5975d2674e7e9ada64e77c463c0a",
"4afa2be1c2c5483f932a42ea4a7897af",
"0e7186eeb5fa47d89c8c111ebe43c5af",
"fa946133dfcc4a6ebc6fef2ef9dd92f7",
"518b6a993e42490297289f2328d0270a",
"cea074a636d34a75b311569fc3f0b3ab",
"2630fd2fa91d498796af6d7d8d73aba4"
]
},
"colab_type": "code",
"id": "kFwiPP60A6N7",
"outputId": "07249856-3222-4898-9246-68e9ecbf5a1b",
"pycharm": {
"is_executing": true
}
}
"source": [
"from haystack.nodes import DensePassageRetriever\n",
"retriever = DensePassageRetriever(document_store=document_store,\n",
" query_embedding_model=\"facebook/dpr-question_encoder-single-nq-base\",\n",
" passage_embedding_model=\"facebook/dpr-ctx_encoder-single-nq-base\",\n",
" max_seq_len_query=64,\n",
" max_seq_len_passage=256,\n",
" batch_size=16,\n",
" use_gpu=True,\n",
" embed_title=True,\n",
" use_fast_tokenizers=True)\n",
"# Important: \n",
"# Now that after we have the DPR initialized, we need to call update_embeddings() to iterate over all\n",
"# previously indexed documents and update their embedding representation. \n",
"# While this can be a time consuming operation (depending on corpus size), it only needs to be done once. \n",
"# At query time, we only need to embed the query and compare it the existing doc embeddings which is very fast.\n",
"document_store.update_embeddings(retriever)"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "rnVR28OXA6OA"
},
"source": [
"#### Reader\n",
"\n",
@ -580,41 +584,11 @@
"\n",
"\n",
"##### FARMReader"
],
"metadata": {
"colab_type": "text",
"id": "rnVR28OXA6OA"
}
]
},
{
"cell_type": "code",
"execution_count": 6,
"source": [
"# Load a local model or any of the QA models on\n",
"# Hugging Face's model hub (https://huggingface.co/models)\n",
"\n",
"reader = FARMReader(model_name_or_path=\"deepset/roberta-base-squad2\", use_gpu=True)"
],
"outputs": [
{
"output_type": "stream",
"name": "stderr",
"text": [
"08/25/2020 08:28:54 - INFO - farm.utils - device: cuda n_gpu: 1, distributed training: False, automatic mixed precision training: None\n",
"08/25/2020 08:28:54 - INFO - farm.infer - Could not find `deepset/roberta-base-squad2` locally. Try to download from model hub ...\n",
"08/25/2020 08:28:59 - WARNING - farm.modeling.language_model - Could not automatically detect from language model name what language it is. \n",
"\t We guess it's an *ENGLISH* model ... \n",
"\t If not: Init the language model by supplying the 'language' param.\n",
"08/25/2020 08:29:06 - WARNING - farm.modeling.prediction_head - Some unused parameters are passed to the QuestionAnsweringHead. Might not be a problem. Params: {\"loss_ignore_index\": -1}\n",
"08/25/2020 08:29:09 - INFO - farm.utils - device: cuda n_gpu: 1, distributed training: False, automatic mixed precision training: None\n",
"08/25/2020 08:29:10 - INFO - farm.infer - Got ya 7 parallel workers to do inference ...\n",
"08/25/2020 08:29:10 - INFO - farm.infer - 0 0 0 0 0 0 0 \n",
"08/25/2020 08:29:10 - INFO - farm.infer - /w\\ /w\\ /w\\ /w\\ /w\\ /w\\ /w\\\n",
"08/25/2020 08:29:10 - INFO - farm.infer - /'\\ / \\ /'\\ /'\\ / \\ / \\ /'\\\n",
"08/25/2020 08:29:10 - INFO - farm.infer - \n"
]
}
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
@ -673,10 +647,40 @@
"colab_type": "code",
"id": "fyIuWVwhA6OB",
"outputId": "33113253-8b95-4604-f9e5-1aa28ee66a91"
}
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"08/25/2020 08:28:54 - INFO - farm.utils - device: cuda n_gpu: 1, distributed training: False, automatic mixed precision training: None\n",
"08/25/2020 08:28:54 - INFO - farm.infer - Could not find `deepset/roberta-base-squad2` locally. Try to download from model hub ...\n",
"08/25/2020 08:28:59 - WARNING - farm.modeling.language_model - Could not automatically detect from language model name what language it is. \n",
"\t We guess it's an *ENGLISH* model ... \n",
"\t If not: Init the language model by supplying the 'language' param.\n",
"08/25/2020 08:29:06 - WARNING - farm.modeling.prediction_head - Some unused parameters are passed to the QuestionAnsweringHead. Might not be a problem. Params: {\"loss_ignore_index\": -1}\n",
"08/25/2020 08:29:09 - INFO - farm.utils - device: cuda n_gpu: 1, distributed training: False, automatic mixed precision training: None\n",
"08/25/2020 08:29:10 - INFO - farm.infer - Got ya 7 parallel workers to do inference ...\n",
"08/25/2020 08:29:10 - INFO - farm.infer - 0 0 0 0 0 0 0 \n",
"08/25/2020 08:29:10 - INFO - farm.infer - /w\\ /w\\ /w\\ /w\\ /w\\ /w\\ /w\\\n",
"08/25/2020 08:29:10 - INFO - farm.infer - /'\\ / \\ /'\\ /'\\ / \\ / \\ /'\\\n",
"08/25/2020 08:29:10 - INFO - farm.infer - \n"
]
}
],
"source": [
"# Load a local model or any of the QA models on\n",
"# Hugging Face's model hub (https://huggingface.co/models)\n",
"\n",
"reader = FARMReader(model_name_or_path=\"deepset/roberta-base-squad2\", use_gpu=True)"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "unhLD18yA6OF"
},
"source": [
"### Pipeline\n",
"\n",
@ -684,50 +688,48 @@
"Under the hood, `Pipelines` are Directed Acyclic Graphs (DAGs) that you can easily customize for your own use cases.\n",
"To speed things up, Haystack also comes with a few predefined Pipelines. One of them is the `ExtractiveQAPipeline` that combines a retriever and a reader to answer our questions.\n",
"You can learn more about `Pipelines` in the [docs](https://haystack.deepset.ai/docs/latest/pipelinesmd)."
],
"metadata": {
"colab_type": "text",
"id": "unhLD18yA6OF"
}
]
},
{
"cell_type": "code",
"execution_count": 7,
"source": [
"from haystack.pipelines import ExtractiveQAPipeline\n",
"pipe = ExtractiveQAPipeline(reader, retriever)"
],
"outputs": [],
"metadata": {
"colab": {},
"colab_type": "code",
"id": "TssPQyzWA6OG"
}
},
"outputs": [],
"source": [
"from haystack.pipelines import ExtractiveQAPipeline\n",
"pipe = ExtractiveQAPipeline(reader, retriever)"
]
},
{
"cell_type": "markdown",
"source": [
"## Voilà! Ask a question!"
],
"metadata": {
"colab_type": "text",
"id": "bXlBBxKXA6OL"
}
},
"source": [
"## Voilà! Ask a question!"
]
},
{
"cell_type": "code",
"execution_count": 8,
"source": [
"# You can configure how many candidates the reader and retriever shall return\n",
"# The higher top_k for retriever, the better (but also the slower) your answers.\n",
"prediction = pipe.run(\n",
" query=\"Who created the Dothraki vocabulary?\", params={\"Retriever\": {\"top_k\": 10}, \"Reader\": {\"top_k\": 5}}\n",
")"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 275
},
"colab_type": "code",
"id": "Zi97Hif2A6OM",
"outputId": "5eb9363d-ba92-45d5-c4d0-63ada3073f02"
},
"outputs": [
{
"output_type": "stream",
"name": "stderr",
"output_type": "stream",
"text": [
"08/25/2020 08:30:28 - INFO - haystack.finder - Reader is looking for detailed answer in 9168 chars ...\n",
"Inferencing Samples: 0%| | 0/1 [00:00<?, ? Batches/s]/home/ubuntu/deepset/FARM/farm/modeling/prediction_head.py:1073: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.\n",
@ -747,27 +749,28 @@
]
}
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 275
},
"colab_type": "code",
"id": "Zi97Hif2A6OM",
"outputId": "5eb9363d-ba92-45d5-c4d0-63ada3073f02"
}
"source": [
"# You can configure how many candidates the reader and retriever shall return\n",
"# The higher top_k for retriever, the better (but also the slower) your answers.\n",
"prediction = pipe.run(\n",
" query=\"Who created the Dothraki vocabulary?\", params={\"Retriever\": {\"top_k\": 10}, \"Reader\": {\"top_k\": 5}}\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"print_answers(prediction, details=\"minimal\")"
],
"metadata": {},
"outputs": [],
"metadata": {}
"source": [
"print_answers(prediction, details=\"minimum\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"## About us\n",
"\n",
@ -785,10 +788,7 @@
"[Twitter](https://twitter.com/deepset_ai) | [LinkedIn](https://www.linkedin.com/company/deepset-ai/) | [Slack](https://haystack.deepset.ai/community/join) | [GitHub Discussions](https://github.com/deepset-ai/haystack/discussions) | [Website](https://deepset.ai)\n",
"\n",
"By the way: [we're hiring!](https://www.deepset.ai/jobs)"
],
"metadata": {
"collapsed": false
}
]
}
],
"metadata": {
@ -3027,4 +3027,4 @@
},
"nbformat": 4,
"nbformat_minor": 1
}
}