mirror of
				https://github.com/deepset-ai/haystack.git
				synced 2025-10-31 01:39:45 +00:00 
			
		
		
		
	 13510aa753
			
		
	
	
		13510aa753
		
			
		
	
	
	
	
		
			
			* Files moved, imports all broken * Fix most imports and docstrings into * Fix the paths to the modules in the API docs * Add latest docstring and tutorial changes * Add a few pipelines that were lost in the inports * Fix a bunch of mypy warnings * Add latest docstring and tutorial changes * Create a file_classifier module * Add docs for file_classifier * Fixed most circular imports, now the REST API can start * Add latest docstring and tutorial changes * Tackling more mypy issues * Reintroduce from FARM and fix last mypy issues hopefully * Re-enable old-style imports * Fix some more import from the top-level package in an attempt to sort out circular imports * Fix some imports in tests to new-style to prevent failed class equalities from breaking tests * Change document_store into document_stores * Update imports in tutorials * Add latest docstring and tutorial changes * Probably fixes summarizer tests * Improve the old-style import allowing module imports (should work) * Try to fix the docs * Remove dedicated KnowledgeGraph page from autodocs * Remove dedicated GraphRetriever page from autodocs * Fix generate_docstrings.sh with an updated list of yaml files to look for * Fix some more modules in the docs * Fix the document stores docs too * Fix a small issue on Tutorial14 * Add latest docstring and tutorial changes * Add deprecation warning to old-style imports * Remove stray folder and import Dict into dense.py * Change import path for MLFlowLogger * Add old loggers path to the import path aliases * Fix debug output of convert_ipynb.py * Fix circular import on BaseRetriever * Missed one merge block * re-run tutorial 5 * Fix imports in tutorial 5 * Re-enable squad_to_dpr CLI from the root package and move get_batches_from_generator into document_stores.base * Add latest docstring and tutorial changes * Fix typo in utils __init__ * Fix a few more imports * Fix benchmarks too * New-style imports in test_knowledge_graph * Rollback setup.py * Rollback squad_to_dpr too Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
		
			
				
	
	
		
			520 lines
		
	
	
		
			16 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			520 lines
		
	
	
		
			16 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
| {
 | |
|  "cells": [
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "source": [
 | |
|     "# Preprocessing\n",
 | |
|     "\n",
 | |
|     "[](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial8_Preprocessing.ipynb)\n",
 | |
|     "\n",
 | |
|     "Haystack includes a suite of tools to extract text from different file types, normalize white space\n",
 | |
|     "and split text into smaller pieces to optimize retrieval.\n",
 | |
|     "These data preprocessing steps can have a big impact on the systems performance and effective handling of data is key to getting the most out of Haystack."
 | |
|    ],
 | |
|    "metadata": {
 | |
|     "collapsed": false
 | |
|    }
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "source": [
 | |
|     "Ultimately, Haystack expects data to be provided as a list documents in the following dictionary format:\n",
 | |
|     "``` python\n",
 | |
|     "docs = [\n",
 | |
|     "    {\n",
 | |
|     "        'content': DOCUMENT_TEXT_HERE,\n",
 | |
|     "        'meta': {'name': DOCUMENT_NAME, ...}\n",
 | |
|     "    }, ...\n",
 | |
|     "]\n",
 | |
|     "```"
 | |
|    ],
 | |
|    "metadata": {
 | |
|     "collapsed": false,
 | |
|     "pycharm": {
 | |
|      "name": "#%% md\n"
 | |
|     }
 | |
|    }
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "source": [
 | |
|     "This tutorial will show you all the tools that Haystack provides to help you cast your data into this format."
 | |
|    ],
 | |
|    "metadata": {
 | |
|     "collapsed": false,
 | |
|     "pycharm": {
 | |
|      "name": "#%% md\n"
 | |
|     }
 | |
|    }
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": 26,
 | |
|    "source": [
 | |
|     "# Let's start by installing Haystack\n",
 | |
|     "\n",
 | |
|     "# Install the latest release of Haystack in your own environment\n",
 | |
|     "#! pip install farm-haystack\n",
 | |
|     "\n",
 | |
|     "# Install the latest master of Haystack\n",
 | |
|     "!pip install grpcio-tools==1.34.1\n",
 | |
|     "!pip install git+https://github.com/deepset-ai/haystack.git\n",
 | |
|     "!wget --no-check-certificate https://dl.xpdfreader.com/xpdf-tools-linux-4.03.tar.gz\n",
 | |
|     "!tar -xvf xpdf-tools-linux-4.03.tar.gz && sudo cp xpdf-tools-linux-4.03/bin64/pdftotext /usr/local/bin\n",
 | |
|     "\n",
 | |
|     "# If you run this notebook on Google Colab, you might need to\n",
 | |
|     "# restart the runtime after installing haystack."
 | |
|    ],
 | |
|    "outputs": [],
 | |
|    "metadata": {
 | |
|     "collapsed": false,
 | |
|     "pycharm": {
 | |
|      "name": "#%%\n"
 | |
|     }
 | |
|    }
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": 2,
 | |
|    "source": [
 | |
|     "# Here are the imports we need\n",
 | |
|     "from haystack.nodes import TextConverter, PDFToTextConverter, DocxToTextConverter, PreProcessor\n",
 | |
|     "from haystack.utils import convert_files_to_dicts, fetch_archive_from_http"
 | |
|    ],
 | |
|    "outputs": [
 | |
|     {
 | |
|      "output_type": "stream",
 | |
|      "name": "stderr",
 | |
|      "text": [
 | |
|       "01/06/2021 14:49:14 - INFO - faiss -   Loading faiss with AVX2 support.\n",
 | |
|       "01/06/2021 14:49:14 - INFO - faiss -   Loading faiss.\n"
 | |
|      ]
 | |
|     }
 | |
|    ],
 | |
|    "metadata": {
 | |
|     "collapsed": false,
 | |
|     "pycharm": {
 | |
|      "name": "#%%\n"
 | |
|     }
 | |
|    }
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": 29,
 | |
|    "source": [
 | |
|     "# This fetches some sample files to work with\n",
 | |
|     "\n",
 | |
|     "doc_dir = \"data/preprocessing_tutorial\"\n",
 | |
|     "s3_url = \"https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/preprocessing_tutorial.zip\"\n",
 | |
|     "fetch_archive_from_http(url=s3_url, output_dir=doc_dir)"
 | |
|    ],
 | |
|    "outputs": [
 | |
|     {
 | |
|      "output_type": "stream",
 | |
|      "name": "stderr",
 | |
|      "text": [
 | |
|       "01/05/2021 12:02:30 - INFO - haystack.preprocessor.utils -   Fetching from https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/preprocessing_tutorial.zip to `data/preprocessing_tutorial`\n",
 | |
|       "100%|██████████| 595119/595119 [00:00<00:00, 5299765.39B/s]\n"
 | |
|      ]
 | |
|     },
 | |
|     {
 | |
|      "output_type": "execute_result",
 | |
|      "data": {
 | |
|       "text/plain": [
 | |
|        "True"
 | |
|       ]
 | |
|      },
 | |
|      "metadata": {},
 | |
|      "execution_count": 29
 | |
|     }
 | |
|    ],
 | |
|    "metadata": {
 | |
|     "collapsed": false,
 | |
|     "pycharm": {
 | |
|      "name": "#%%\n"
 | |
|     }
 | |
|    }
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "source": [
 | |
|     "## Converters\n",
 | |
|     "\n",
 | |
|     "Haystack's converter classes are designed to help you turn files on your computer into the documents\n",
 | |
|     "that can be processed by the Haystack pipeline.\n",
 | |
|     "There are file converters for txt, pdf, docx files as well as a converter that is powered by Apache Tika.\n",
 | |
|     "The parameter `valid_langugages` does not convert files to the target language, but checks if the conversion worked as expected.\n",
 | |
|     "For converting PDFs, try changing the encoding to UTF-8 if the conversion isn't great."
 | |
|    ],
 | |
|    "metadata": {
 | |
|     "collapsed": false,
 | |
|     "pycharm": {
 | |
|      "name": "#%% md\n"
 | |
|     }
 | |
|    }
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": 8,
 | |
|    "source": [
 | |
|     "# Here are some examples of how you would use file converters\n",
 | |
|     "\n",
 | |
|     "converter = TextConverter(remove_numeric_tables=True, valid_languages=[\"en\"])\n",
 | |
|     "doc_txt = converter.convert(file_path=\"data/preprocessing_tutorial/classics.txt\", meta=None)\n",
 | |
|     "\n",
 | |
|     "converter = PDFToTextConverter(remove_numeric_tables=True, valid_languages=[\"en\"])\n",
 | |
|     "doc_pdf = converter.convert(file_path=\"data/preprocessing_tutorial/bert.pdf\", meta=None)\n",
 | |
|     "\n",
 | |
|     "converter = DocxToTextConverter(remove_numeric_tables=False, valid_languages=[\"en\"])\n",
 | |
|     "doc_docx = converter.convert(file_path=\"data/preprocessing_tutorial/heavy_metal.docx\", meta=None)\n"
 | |
|    ],
 | |
|    "outputs": [],
 | |
|    "metadata": {
 | |
|     "collapsed": false,
 | |
|     "pycharm": {
 | |
|      "name": "#%%\n"
 | |
|     }
 | |
|    }
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": 9,
 | |
|    "source": [
 | |
|     "# Haystack also has a convenience function that will automatically apply the right converter to each file in a directory.\n",
 | |
|     "\n",
 | |
|     "all_docs = convert_files_to_dicts(dir_path=\"data/preprocessing_tutorial\")"
 | |
|    ],
 | |
|    "outputs": [
 | |
|     {
 | |
|      "output_type": "stream",
 | |
|      "name": "stderr",
 | |
|      "text": [
 | |
|       "01/06/2021 14:51:06 - INFO - haystack.preprocessor.utils -   Converting data/preprocessing_tutorial/heavy_metal.docx\n",
 | |
|       "01/06/2021 14:51:06 - INFO - haystack.preprocessor.utils -   Converting data/preprocessing_tutorial/bert.pdf\n",
 | |
|       "01/06/2021 14:51:07 - INFO - haystack.preprocessor.utils -   Converting data/preprocessing_tutorial/classics.txt\n"
 | |
|      ]
 | |
|     }
 | |
|    ],
 | |
|    "metadata": {
 | |
|     "collapsed": false,
 | |
|     "pycharm": {
 | |
|      "name": "#%%\n"
 | |
|     }
 | |
|    }
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "source": [
 | |
|     "## PreProcessor\n",
 | |
|     "\n",
 | |
|     "The PreProcessor class is designed to help you clean text and split text into sensible units.\n",
 | |
|     "File splitting can have a very significant impact on the system's performance and is absolutely mandatory for Dense Passage Retrieval models.\n",
 | |
|     "In general, we recommend you split the text from your files into small documents of around 100 words for dense retrieval methods\n",
 | |
|     "and no more than 10,000 words for sparse methods.\n",
 | |
|     "Have a look at the [Preprocessing](https://haystack.deepset.ai/docs/latest/preprocessingmd)\n",
 | |
|     "and [Optimization](https://haystack.deepset.ai/docs/latest/optimizationmd) pages on our website for more details."
 | |
|    ],
 | |
|    "metadata": {
 | |
|     "collapsed": false,
 | |
|     "pycharm": {
 | |
|      "name": "#%% md\n"
 | |
|     }
 | |
|    }
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": 10,
 | |
|    "source": [
 | |
|     "# This is a default usage of the PreProcessor.\n",
 | |
|     "# Here, it performs cleaning of consecutive whitespaces\n",
 | |
|     "# and splits a single large document into smaller documents.\n",
 | |
|     "# Each document is up to 1000 words long and document breaks cannot fall in the middle of sentences\n",
 | |
|     "# Note how the single document passed into the document gets split into 5 smaller documents\n",
 | |
|     "\n",
 | |
|     "preprocessor = PreProcessor(\n",
 | |
|     "    clean_empty_lines=True,\n",
 | |
|     "    clean_whitespace=True,\n",
 | |
|     "    clean_header_footer=False,\n",
 | |
|     "    split_by=\"word\",\n",
 | |
|     "    split_length=100,\n",
 | |
|     "    split_respect_sentence_boundary=True\n",
 | |
|     ")\n",
 | |
|     "docs_default = preprocessor.process(doc_txt)\n",
 | |
|     "print(f\"n_docs_input: 1\\nn_docs_output: {len(docs_default)}\")"
 | |
|    ],
 | |
|    "outputs": [
 | |
|     {
 | |
|      "output_type": "stream",
 | |
|      "name": "stdout",
 | |
|      "text": [
 | |
|       "n_docs_input: 1\n",
 | |
|       "n_docs_output: 51\n"
 | |
|      ]
 | |
|     },
 | |
|     {
 | |
|      "output_type": "stream",
 | |
|      "name": "stderr",
 | |
|      "text": [
 | |
|       "[nltk_data] Downloading package punkt to /home/branden/nltk_data...\n",
 | |
|       "[nltk_data]   Package punkt is already up-to-date!\n"
 | |
|      ]
 | |
|     }
 | |
|    ],
 | |
|    "metadata": {
 | |
|     "collapsed": false,
 | |
|     "pycharm": {
 | |
|      "name": "#%%\n"
 | |
|     }
 | |
|    }
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "source": [
 | |
|     "## Cleaning\n",
 | |
|     "\n",
 | |
|     "- `clean_empty_lines` will normalize 3 or more consecutive empty lines to be just a two empty lines\n",
 | |
|     "- `clean_whitespace` will remove any whitespace at the beginning or end of each line in the text\n",
 | |
|     "- `clean_header_footer` will remove any long header or footer texts that are repeated on each page"
 | |
|    ],
 | |
|    "metadata": {
 | |
|     "collapsed": false,
 | |
|     "pycharm": {
 | |
|      "name": "#%% md\n"
 | |
|     }
 | |
|    }
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "source": [
 | |
|     "## Splitting\n",
 | |
|     "By default, the PreProcessor will respect sentence boundaries, meaning that documents will not start or end\n",
 | |
|     "midway through a sentence.\n",
 | |
|     "This will help reduce the possibility of answer phrases being split between two documents.\n",
 | |
|     "This feature can be turned off by setting `split_respect_sentence_boundary=False`."
 | |
|    ],
 | |
|    "metadata": {
 | |
|     "collapsed": false,
 | |
|     "pycharm": {
 | |
|      "name": "#%% md\n"
 | |
|     }
 | |
|    }
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": 11,
 | |
|    "source": [
 | |
|     "# Not respecting sentence boundary vs respecting sentence boundary\n",
 | |
|     "\n",
 | |
|     "preprocessor_nrsb = PreProcessor(split_respect_sentence_boundary=False)\n",
 | |
|     "docs_nrsb = preprocessor_nrsb.process(doc_txt)\n",
 | |
|     "\n",
 | |
|     "print(\"RESPECTING SENTENCE BOUNDARY\")\n",
 | |
|     "end_text = docs_default[0][\"content\"][-50:]\n",
 | |
|     "print(\"End of document: \\\"...\" + end_text + \"\\\"\")\n",
 | |
|     "print()\n",
 | |
|     "print(\"NOT RESPECTING SENTENCE BOUNDARY\")\n",
 | |
|     "end_text_nrsb = docs_nrsb[0][\"content\"][-50:]\n",
 | |
|     "print(\"End of document: \\\"...\" + end_text_nrsb + \"\\\"\")"
 | |
|    ],
 | |
|    "outputs": [
 | |
|     {
 | |
|      "output_type": "stream",
 | |
|      "name": "stdout",
 | |
|      "text": [
 | |
|       "RESPECTING SENTENCE BOUNDARY\n",
 | |
|       "End of document: \"...cornerstone of a typical elite European education.\"\n",
 | |
|       "\n",
 | |
|       "NOT RESPECTING SENTENCE BOUNDARY\n",
 | |
|       "End of document: \"...on. In England, for instance, Oxford and Cambridge\"\n"
 | |
|      ]
 | |
|     },
 | |
|     {
 | |
|      "output_type": "stream",
 | |
|      "name": "stderr",
 | |
|      "text": [
 | |
|       "[nltk_data] Downloading package punkt to /home/branden/nltk_data...\n",
 | |
|       "[nltk_data]   Package punkt is already up-to-date!\n"
 | |
|      ]
 | |
|     }
 | |
|    ],
 | |
|    "metadata": {
 | |
|     "collapsed": false,
 | |
|     "pycharm": {
 | |
|      "name": "#%%\n"
 | |
|     }
 | |
|    }
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "source": [
 | |
|     "A commonly used strategy to split long documents, especially in the field of Question Answering,\n",
 | |
|     "is the sliding window approach. If `split_length=10` and `split_overlap=3`, your documents will look like this:\n",
 | |
|     "\n",
 | |
|     "- doc1 = words[0:10]\n",
 | |
|     "- doc2 = words[7:17]\n",
 | |
|     "- doc3 = words[14:24]\n",
 | |
|     "- ...\n",
 | |
|     "\n",
 | |
|     "You can use this strategy by following the code below."
 | |
|    ],
 | |
|    "metadata": {
 | |
|     "collapsed": false,
 | |
|     "pycharm": {
 | |
|      "name": "#%% md\n"
 | |
|     }
 | |
|    }
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": 12,
 | |
|    "source": [
 | |
|     "# Sliding window approach\n",
 | |
|     "\n",
 | |
|     "preprocessor_sliding_window = PreProcessor(\n",
 | |
|     "    split_overlap=3,\n",
 | |
|     "    split_length=10,\n",
 | |
|     "    split_respect_sentence_boundary=False\n",
 | |
|     ")\n",
 | |
|     "docs_sliding_window = preprocessor_sliding_window.process(doc_txt)\n",
 | |
|     "\n",
 | |
|     "doc1 = docs_sliding_window[0][\"content\"][:200]\n",
 | |
|     "doc2 = docs_sliding_window[1][\"content\"][:100]\n",
 | |
|     "doc3 = docs_sliding_window[2][\"content\"][:100]\n",
 | |
|     "\n",
 | |
|     "print(\"Document 1: \\\"\" + doc1 + \"...\\\"\")\n",
 | |
|     "print(\"Document 2: \\\"\" + doc2 + \"...\\\"\")\n",
 | |
|     "print(\"Document 3: \\\"\" + doc3 + \"...\\\"\")"
 | |
|    ],
 | |
|    "outputs": [
 | |
|     {
 | |
|      "output_type": "stream",
 | |
|      "name": "stdout",
 | |
|      "text": [
 | |
|       "Document 1: \"Classics or classical studies is the study of classical antiquity,...\"\n",
 | |
|       "Document 2: \"of classical antiquity, and in the Western world traditionally refers...\"\n",
 | |
|       "Document 3: \"world traditionally refers to the study of Classical Greek and...\"\n"
 | |
|      ]
 | |
|     },
 | |
|     {
 | |
|      "output_type": "stream",
 | |
|      "name": "stderr",
 | |
|      "text": [
 | |
|       "[nltk_data] Downloading package punkt to /home/branden/nltk_data...\n",
 | |
|       "[nltk_data]   Package punkt is already up-to-date!\n"
 | |
|      ]
 | |
|     }
 | |
|    ],
 | |
|    "metadata": {
 | |
|     "collapsed": false,
 | |
|     "pycharm": {
 | |
|      "name": "#%%\n"
 | |
|     }
 | |
|    }
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "source": [
 | |
|     "## Bringing it all together"
 | |
|    ],
 | |
|    "metadata": {
 | |
|     "collapsed": false,
 | |
|     "pycharm": {
 | |
|      "name": "#%% md\n"
 | |
|     }
 | |
|    }
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": 16,
 | |
|    "source": [
 | |
|     "all_docs = convert_files_to_dicts(dir_path=\"data/preprocessing_tutorial\")\n",
 | |
|     "preprocessor = PreProcessor(\n",
 | |
|     "    clean_empty_lines=True,\n",
 | |
|     "    clean_whitespace=True,\n",
 | |
|     "    clean_header_footer=False,\n",
 | |
|     "    split_by=\"word\",\n",
 | |
|     "    split_length=100,\n",
 | |
|     "    split_respect_sentence_boundary=True\n",
 | |
|     ")\n",
 | |
|     "docs = preprocessor.process(all_docs)\n",
 | |
|     "\n",
 | |
|     "print(f\"n_files_input: {len(all_docs)}\\nn_docs_output: {len(docs)}\")"
 | |
|    ],
 | |
|    "outputs": [
 | |
|     {
 | |
|      "output_type": "stream",
 | |
|      "name": "stderr",
 | |
|      "text": [
 | |
|       "01/06/2021 14:56:12 - INFO - haystack.preprocessor.utils -   Converting data/preprocessing_tutorial/heavy_metal.docx\n",
 | |
|       "01/06/2021 14:56:12 - INFO - haystack.preprocessor.utils -   Converting data/preprocessing_tutorial/bert.pdf\n",
 | |
|       "01/06/2021 14:56:12 - INFO - haystack.preprocessor.utils -   Converting data/preprocessing_tutorial/classics.txt\n",
 | |
|       "[nltk_data] Downloading package punkt to /home/branden/nltk_data...\n",
 | |
|       "[nltk_data]   Package punkt is already up-to-date!\n"
 | |
|      ]
 | |
|     },
 | |
|     {
 | |
|      "output_type": "stream",
 | |
|      "name": "stdout",
 | |
|      "text": [
 | |
|       "n_files_input: 3\n",
 | |
|       "n_docs_output: 150\n"
 | |
|      ]
 | |
|     }
 | |
|    ],
 | |
|    "metadata": {
 | |
|     "collapsed": false,
 | |
|     "pycharm": {
 | |
|      "name": "#%%\n"
 | |
|     }
 | |
|    }
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "source": [
 | |
|     "## About us\n",
 | |
|     "\n",
 | |
|     "This [Haystack](https://github.com/deepset-ai/haystack/) notebook was made with love by [deepset](https://deepset.ai/) in Berlin, Germany\n",
 | |
|     "\n",
 | |
|     "We bring NLP to the industry via open source!  \n",
 | |
|     "Our focus: Industry specific language models & large scale QA systems.  \n",
 | |
|     "  \n",
 | |
|     "Some of our other work: \n",
 | |
|     "- [German BERT](https://deepset.ai/german-bert)\n",
 | |
|     "- [GermanQuAD and GermanDPR](https://deepset.ai/germanquad)\n",
 | |
|     "- [FARM](https://github.com/deepset-ai/FARM)\n",
 | |
|     "\n",
 | |
|     "Get in touch:\n",
 | |
|     "[Twitter](https://twitter.com/deepset_ai) | [LinkedIn](https://www.linkedin.com/company/deepset-ai/) | [Slack](https://haystack.deepset.ai/community/join) | [GitHub Discussions](https://github.com/deepset-ai/haystack/discussions) | [Website](https://deepset.ai)\n",
 | |
|     "\n",
 | |
|     "By the way: [we're hiring!](https://www.deepset.ai/jobs)\n"
 | |
|    ],
 | |
|    "metadata": {
 | |
|     "collapsed": false,
 | |
|     "pycharm": {
 | |
|      "name": "#%% md\n"
 | |
|     }
 | |
|    }
 | |
|   }
 | |
|  ],
 | |
|  "metadata": {
 | |
|   "kernelspec": {
 | |
|    "display_name": "Python 3",
 | |
|    "language": "python",
 | |
|    "name": "python3"
 | |
|   },
 | |
|   "language_info": {
 | |
|    "codemirror_mode": {
 | |
|     "name": "ipython",
 | |
|     "version": 2
 | |
|    },
 | |
|    "file_extension": ".py",
 | |
|    "mimetype": "text/x-python",
 | |
|    "name": "python",
 | |
|    "nbconvert_exporter": "python",
 | |
|    "pygments_lexer": "ipython2",
 | |
|    "version": "2.7.6"
 | |
|   }
 | |
|  },
 | |
|  "nbformat": 4,
 | |
|  "nbformat_minor": 2
 | |
| } |