Create Preprocessing Tutorial (#706)

* WIP: First version of preprocessing tutorial * stride renamed overlap, ipynb and py files created * rename split_stride in test * Update preprocessor api documentation * define order for markdown files * define order of modules in api docs * Add colab links * Incorporate review feedback Co-authored-by: PiffPaffM <markuspaff.mp@gmail.com>
2026-01-05 19:47:45 +00:00 · 2021-01-06 15:54:05 +01:00 · 2021-01-06 15:54:05 +01:00 · bb8aba18e0
commit bb8aba18e0
parent 5db73d4107
13 changed files with 718 additions and 55 deletions
--- a/docs/_src/api/api/preprocessor.md
+++ b/docs/_src/api/api/preprocessor.md
@ -1,3 +1,22 @@
+<a name="base"></a>
+# Module base
+
+<a name="base.BasePreProcessor"></a>
+## BasePreProcessor Objects
+
+```python
+class BasePreProcessor()
+```
+
+<a name="base.BasePreProcessor.process"></a>
+#### process
+
+```python
+ | process(document: dict) -> List[dict]
+```
+
+Perform document cleaning and splitting. Takes a single document as input and returns a list of documents.
+
 <a name="preprocessor"></a>
 # Module preprocessor

@ -12,7 +31,7 @@ class PreProcessor(BasePreProcessor)
 #### \_\_init\_\_

 ```python
- | __init__(clean_whitespace: Optional[bool] = True, clean_header_footer: Optional[bool] = False, clean_empty_lines: Optional[bool] = True, split_by: Optional[str] = "word", split_length: Optional[int] = 1000, split_stride: Optional[int] = None, split_respect_sentence_boundary: Optional[bool] = True)
+ | __init__(clean_whitespace: Optional[bool] = True, clean_header_footer: Optional[bool] = False, clean_empty_lines: Optional[bool] = True, split_by: Optional[str] = "word", split_length: Optional[int] = 1000, split_overlap: Optional[int] = None, split_respect_sentence_boundary: Optional[bool] = True)
 ```

 **Arguments**:
@ -26,10 +45,12 @@ or similar.
 - `split_by`: Unit for splitting the document. Can be "word", "sentence", or "passage". Set to None to disable splitting.
 - `split_length`: Max. number of the above split unit (e.g. words) that are allowed in one document. For instance, if n -> 10 & split_by ->
 "sentence", then each output document will have 10 sentences.
- `split_stride`: Length of striding window over the splits. For example, if split_by -> `word`,
-split_length -> 5 & split_stride -> 2, then the splits would be like:
+- `split_overlap`: Word overlap between two adjacent documents after a split.
+Setting this to a positive number essentially enables the sliding window approach.
+For example, if split_by -> `word`,
+split_length -> 5 & split_overlap -> 2, then the splits would be like:
 [w1 w2 w3 w4 w5, w4 w5 w6 w7 w8, w7 w8 w10 w11 w12].
-Set the value to None to disable striding behaviour.
+Set the value to None to ensure there is no overlap among the documents after splitting.
 - `split_respect_sentence_boundary`: Whether to split in partial sentences if split_by -> `word`. If set
 to True, the individual split will always have complete sentences &
 the number of words will be <= split_length.
@ -52,22 +73,9 @@ and empty lines. Its exact functionality is defined by the parameters passed int
 ```

 Perform document splitting on a single document. This method can split on different units, at different lengths,
-with different strides. It can also respect sectence boundaries. Its exact functionality is defined by
+with different strides. It can also respect sentence boundaries. Its exact functionality is defined by
 the parameters passed into PreProcessor.__init__(). Takes a single document as input and returns a list of documents.

-<a name="cleaning"></a>
-# Module cleaning
-
-<a name="cleaning.clean_wiki_text"></a>
-#### clean\_wiki\_text
-
-```python
-clean_wiki_text(text: str) -> str
-```
-
-Clean wikipedia text by removing multiple new lines, removing extremely short lines,
-adding paragraph breaks and removing empty paragraphs
-
 <a name="utils"></a>
 # Module utils

@ -154,22 +162,16 @@ Fetch an archive (zip or tar.gz) from a url via http and extract content to an o

 bool if anything got fetched

-<a name="base"></a>
-# Module base
+<a name="cleaning"></a>
+# Module cleaning

-<a name="base.BasePreProcessor"></a>
-## BasePreProcessor Objects
+<a name="cleaning.clean_wiki_text"></a>
+#### clean\_wiki\_text

 ```python
-class BasePreProcessor()
+clean_wiki_text(text: str) -> str
 ```

-<a name="base.BasePreProcessor.process"></a>
-#### process
-
-```python
- | process(document: dict) -> List[dict]
-```
-
-Perform document cleaning and splitting. Takes a single document as input and returns a list of documents.
+Clean wikipedia text by removing multiple new lines, removing extremely short lines,
+adding paragraph breaks and removing empty paragraphs

--- a/docs/_src/api/api/pydoc-markdown-document-store.yml
+++ b/docs/_src/api/api/pydoc-markdown-document-store.yml
@ -1,6 +1,7 @@
 loaders:
  - type: python
    search_path: [../../../../haystack/document_store]
+    modules: ['base', 'elasticsearch', 'memory', 'sql', 'faiss']
    ignore_when_discovered: ['__init__']
 processor:
  - type: filter
--- a/docs/_src/api/api/pydoc-markdown-file-converters.yml
+++ b/docs/_src/api/api/pydoc-markdown-file-converters.yml
@ -1,6 +1,7 @@
 loaders:
  - type: python
    search_path: [../../../../haystack/file_converter]
+    modules: ['base', 'txt', 'docx', 'tika',  'pdf']
    ignore_when_discovered: ['__init__']
 processor:
  - type: filter
--- a/docs/_src/api/api/pydoc-markdown-generator.yml
+++ b/docs/_src/api/api/pydoc-markdown-generator.yml
@ -1,6 +1,7 @@
 loaders:
  - type: python
    search_path: [../../../../haystack/generator]
+    modules: [ 'base', 'transformers']
    ignore_when_discovered: ['__init__']
 processor:
  - type: filter
--- a/docs/_src/api/api/pydoc-markdown-preprocessor.yml
+++ b/docs/_src/api/api/pydoc-markdown-preprocessor.yml
@ -1,6 +1,7 @@
 loaders:
  - type: python
    search_path: [../../../../haystack/preprocessor]
+    modules: ['base', 'preprocessor', 'utils', 'cleaning'] 
    ignore_when_discovered: ['__init__']
 processor:
  - type: filter
--- a/docs/_src/api/api/pydoc-markdown-reader.yml
+++ b/docs/_src/api/api/pydoc-markdown-reader.yml
@ -1,6 +1,7 @@
 loaders:
  - type: python
    search_path: [../../../../haystack/reader]
+    modules: ['base', 'farm', 'transformers']
    ignore_when_discovered: ['__init__']
 processor:
  - type: filter
--- a/docs/_src/api/api/pydoc-markdown-retriever.yml
+++ b/docs/_src/api/api/pydoc-markdown-retriever.yml
@ -1,6 +1,7 @@
 loaders:
  - type: python
    search_path: [../../../../haystack/retriever]
+    modules: ['base', 'sparse', 'dense']
    ignore_when_discovered: ['__init__']
 processor:
  - type: filter
--- a/docs/_src/usage/usage/preprocessing.md
+++ b/docs/_src/usage/usage/preprocessing.md
@ -94,12 +94,15 @@ For suggestions on how best to split your documents, see [Optimization](/docs/la

 ```python
 doc = converter.convert(file_path=file, meta=None)
-processor = PreProcessor(clean_empty_lines=True,
-                         clean_whitespace=True,
-                         clean_header_footer=True,
-                         split_by="word",
-                         split_length=200,
-                         split_respect_sentence_boundary=True)
+processor = PreProcessor(
+    clean_empty_lines=True,
+    clean_whitespace=True,
+    clean_header_footer=True,
+    split_by="word",
+    split_length=200,
+    split_respect_sentence_boundary=True,
+    split_overlap=0
+)
 docs = processor.process(d)
 ```

@ -109,3 +112,5 @@ docs = processor.process(d)
 * `split_by` determines what unit the document is split by: `'word'`, `'sentence'` or `'passage'`
 * `split_length` sets a maximum number of `'word'`, `'sentence'` or `'passage'` units per output document
 * `split_respect_sentence_boundary` ensures that document boundaries do not fall in the middle of sentences
+* `split_overlap` sets the amount of overlap between two adjacent documents after a split. Setting this to a positive number essentially enables the sliding window approach.
+
--- a/haystack/preprocessor/preprocessor.py
+++ b/haystack/preprocessor/preprocessor.py
@ -21,7 +21,7 @@ class PreProcessor(BasePreProcessor):
        clean_empty_lines: Optional[bool] = True,
        split_by: Optional[str] = "word",
        split_length: Optional[int] = 1000,
-        split_stride: Optional[int] = None,
+        split_overlap: Optional[int] = None,
        split_respect_sentence_boundary: Optional[bool] = True,
    ):
        """
@ -34,10 +34,12 @@ class PreProcessor(BasePreProcessor):
        :param split_by: Unit for splitting the document. Can be "word", "sentence", or "passage". Set to None to disable splitting.
        :param split_length: Max. number of the above split unit (e.g. words) that are allowed in one document. For instance, if n -> 10 & split_by ->
                           "sentence", then each output document will have 10 sentences.
-        :param split_stride: Length of striding window over the splits. For example, if split_by -> `word`,
-                             split_length -> 5 & split_stride -> 2, then the splits would be like:
-                             [w1 w2 w3 w4 w5, w4 w5 w6 w7 w8, w7 w8 w10 w11 w12].
-                             Set the value to None to disable striding behaviour.
+        :param split_overlap: Word overlap between two adjacent documents after a split.
+                              Setting this to a positive number essentially enables the sliding window approach.
+                              For example, if split_by -> `word`,
+                              split_length -> 5 & split_overlap -> 2, then the splits would be like:
+                              [w1 w2 w3 w4 w5, w4 w5 w6 w7 w8, w7 w8 w10 w11 w12].
+                              Set the value to None to ensure there is no overlap among the documents after splitting.
        :param split_respect_sentence_boundary: Whether to split in partial sentences if split_by -> `word`. If set
                                                to True, the individual split will always have complete sentences &
                                                the number of words will be <= split_length.
@ -48,7 +50,7 @@ class PreProcessor(BasePreProcessor):
        self.clean_empty_lines = clean_empty_lines
        self.split_by = split_by
        self.split_length = split_length
-        self.split_stride = split_stride
+        self.split_overlap = split_overlap
        self.split_respect_sentence_boundary = split_respect_sentence_boundary

    def clean(self, document: dict) -> dict:
@ -79,7 +81,7 @@ class PreProcessor(BasePreProcessor):

    def split(self, document: dict) -> List[dict]:
        """Perform document splitting on a single document. This method can split on different units, at different lengths,
-        with different strides. It can also respect sectence boundaries. Its exact functionality is defined by
+        with different strides. It can also respect sentence boundaries. Its exact functionality is defined by
        the parameters passed into PreProcessor.__init__(). Takes a single document as input and returns a list of documents. """

        if not self.split_by:
@ -107,12 +109,12 @@ class PreProcessor(BasePreProcessor):
                if word_count + current_word_count > self.split_length:
                    list_splits.append(current_slice)
                    #Enable split_stride with split_by='word' while respecting sentence boundaries.
-                    if self.split_stride:
+                    if self.split_overlap:
                        overlap = []
                        w_count = 0
                        for s in current_slice[::-1]:
                            sen_len = len(s.split(" "))
-                            if w_count < self.split_stride:
+                            if w_count < self.split_overlap:
                                overlap.append(s)
                                w_count += sen_len
                            else:
@ -139,8 +141,8 @@ class PreProcessor(BasePreProcessor):
                raise NotImplementedError("PreProcessor only supports 'passage' or 'sentence' split_by options.")

            # concatenate individual elements based on split_length & split_stride
-            if self.split_stride:
-                segments = windowed(elements, n=self.split_length, step=self.split_length - self.split_stride)
+            if self.split_overlap:
+                segments = windowed(elements, n=self.split_length, step=self.split_length - self.split_overlap)
            else:
                segments = windowed(elements, n=self.split_length, step=self.split_length)
            text_splits = []
--- a/test/test_preprocessor.py
+++ b/test/test_preprocessor.py
@ -21,12 +21,12 @@ in the sentence.
@pytest.mark.tika
 def test_preprocess_sentence_split():
    document = {"text": TEXT}
-    preprocessor = PreProcessor(split_length=1, split_stride=0, split_by="sentence")
+    preprocessor = PreProcessor(split_length=1, split_overlap=0, split_by="sentence")
    documents = preprocessor.process(document)
    assert len(documents) == 15

    preprocessor = PreProcessor(
-        split_length=10, split_stride=0, split_by="sentence"
+        split_length=10, split_overlap=0, split_by="sentence"
    )
    documents = preprocessor.process(document)
    assert len(documents) == 2
@ -35,11 +35,11 @@ def test_preprocess_sentence_split():
@pytest.mark.tika
 def test_preprocess_word_split():
    document = {"text": TEXT}
-    preprocessor = PreProcessor(split_length=10, split_stride=0, split_by="word", split_respect_sentence_boundary=False)
+    preprocessor = PreProcessor(split_length=10, split_overlap=0, split_by="word", split_respect_sentence_boundary=False)
    documents = preprocessor.process(document)
    assert len(documents) == 11

-    preprocessor = PreProcessor(split_length=15, split_stride=0, split_by="word", split_respect_sentence_boundary=True)
+    preprocessor = PreProcessor(split_length=15, split_overlap=0, split_by="word", split_respect_sentence_boundary=True)
    documents = preprocessor.process(document)
    for i,doc in enumerate(documents):
        if i == 0:
@ -47,7 +47,7 @@ def test_preprocess_word_split():
        assert len(doc["text"].split(" ")) <= 15 or doc["text"].startswith("This is to trick")
    assert len(documents) == 8

-    preprocessor = PreProcessor(split_length=40, split_stride=10, split_by="word", split_respect_sentence_boundary=True)
+    preprocessor = PreProcessor(split_length=40, split_overlap=10, split_by="word", split_respect_sentence_boundary=True)
    documents = preprocessor.process(document)
    assert len(documents) == 5

@ -55,11 +55,11 @@ def test_preprocess_word_split():
@pytest.mark.tika
 def test_preprocess_passage_split():
    document = {"text": TEXT}
-    preprocessor = PreProcessor(split_length=1, split_stride=0, split_by="passage", split_respect_sentence_boundary=False)
+    preprocessor = PreProcessor(split_length=1, split_overlap=0, split_by="passage", split_respect_sentence_boundary=False)
    documents = preprocessor.process(document)
    assert len(documents) == 3

-    preprocessor = PreProcessor(split_length=2, split_stride=0, split_by="passage", split_respect_sentence_boundary=False)
+    preprocessor = PreProcessor(split_length=2, split_overlap=0, split_by="passage", split_respect_sentence_boundary=False)
    documents = preprocessor.process(document)
    assert len(documents) == 2

--- a/tutorials/Tutorial7_RAG_Generator.ipynb
+++ b/tutorials/Tutorial7_RAG_Generator.ipynb
@ -5,6 +5,8 @@
   "source": [
    "# Generative QA with \"Retrieval-Augmented Generation\"\n",
    "\n",
+    "EXECUTABLE VERSION: [colab](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial7_RAG_Generator.ipynb)\n",
+    "\n",
    "While extractive QA highlights the span of text that answers a query,\n",
    "generative QA can return a novel text answer that it has composed.\n",
    "In this tutorial, you will learn how to set up a generative system using the\n",
--- a/tutorials/Tutorial8_Preprocessing.ipynb
+++ b/tutorials/Tutorial8_Preprocessing.ipynb
@ -0,0 +1,504 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "source": [
+    "# Preprocessing\n",
+    "\n",
+    "EXECUTABLE VERSION: [colab](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial8_Preprocessing.ipynb)\n",
+    "\n",
+    "Haystack includes a suite of tools to extract text from different file types, normalize white space\n",
+    "and split text into smaller pieces to optimize retrieval.\n",
+    "These data preprocessing steps can have a big impact on the systems performance and effective handling of data is key to getting the most out of Haystack."
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "Ultimately, Haystack expects data to be provided as a list documents in the following dictionary format:\n",
+    "``` python\n",
+    "docs = [\n",
+    "    {\n",
+    "        'text': DOCUMENT_TEXT_HERE,\n",
+    "        'meta': {'name': DOCUMENT_NAME, ...}\n",
+    "    }, ...\n",
+    "]\n",
+    "```"
+   ],
+   "metadata": {
+    "collapsed": false,
+    "pycharm": {
+     "name": "#%% md\n"
+    }
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "This tutorial will show you all the tools that Haystack provides to help you cast your data into this format."
+   ],
+   "metadata": {
+    "collapsed": false,
+    "pycharm": {
+     "name": "#%% md\n"
+    }
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 26,
+   "outputs": [],
+   "source": [
+    "# Let's start by installing Haystack\n",
+    "\n",
+    "# Install the latest release of Haystack in your own environment\n",
+    "#! pip install farm-haystack\n",
+    "\n",
+    "# Install the latest master of Haystack\n",
+    "!pip install git+https://github.com/deepset-ai/haystack.git\n",
+    "!pip install torch==1.6.0+cu101 torchvision==0.6.1+cu101 -f https://download.pytorch.org/whl/torch_stable.html\n",
+    "!wget --no-check-certificate https://dl.xpdfreader.com/xpdf-tools-linux-4.02.tar.gz\n",
+    "!tar -xvf xpdf-tools-linux-4.02.tar.gz && sudo cp xpdf-tools-linux-4.02/bin64/pdftotext /usr/local/bin"
+   ],
+   "metadata": {
+    "collapsed": false,
+    "pycharm": {
+     "name": "#%%\n"
+    }
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "01/06/2021 14:49:14 - INFO - faiss -   Loading faiss with AVX2 support.\n",
+      "01/06/2021 14:49:14 - INFO - faiss -   Loading faiss.\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Here are the imports we need\n",
+    "\n",
+    "from haystack.file_converter.txt import TextConverter\n",
+    "from haystack.file_converter.pdf import PDFToTextConverter\n",
+    "from haystack.file_converter.docx import DocxToTextConverter\n",
+    "\n",
+    "from haystack.preprocessor.utils import convert_files_to_dicts, fetch_archive_from_http\n",
+    "from haystack.preprocessor.preprocessor import PreProcessor"
+   ],
+   "metadata": {
+    "collapsed": false,
+    "pycharm": {
+     "name": "#%%\n"
+    }
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 29,
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "01/05/2021 12:02:30 - INFO - haystack.preprocessor.utils -   Fetching from https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/preprocessing_tutorial.zip to `data/preprocessing_tutorial`\n",
+      "100%|██████████| 595119/595119 [00:00<00:00, 5299765.39B/s]\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": "True"
+     },
+     "execution_count": 29,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "# This fetches some sample files to work with\n",
+    "\n",
+    "doc_dir = \"data/preprocessing_tutorial\"\n",
+    "s3_url = \"https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/preprocessing_tutorial.zip\"\n",
+    "fetch_archive_from_http(url=s3_url, output_dir=doc_dir)"
+   ],
+   "metadata": {
+    "collapsed": false,
+    "pycharm": {
+     "name": "#%%\n"
+    }
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "## Converters\n",
+    "\n",
+    "Haystack's converter classes are designed to help you turn files on your computer into the documents\n",
+    "that can be processed by the Haystack pipeline.\n",
+    "There are file converters for txt, pdf, docx files as well as a converter that is powered by Apache Tika."
+   ],
+   "metadata": {
+    "collapsed": false,
+    "pycharm": {
+     "name": "#%% md\n"
+    }
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "outputs": [],
+   "source": [
+    "# Here are some examples of how you would use file converters\n",
+    "\n",
+    "converter = TextConverter(remove_numeric_tables=True, valid_languages=[\"en\"])\n",
+    "doc_txt = converter.convert(file_path=\"data/preprocessing_tutorial/classics.txt\", meta=None)\n",
+    "\n",
+    "converter = PDFToTextConverter(remove_numeric_tables=True, valid_languages=[\"en\"])\n",
+    "doc_pdf = converter.convert(file_path=\"data/preprocessing_tutorial/bert.pdf\", meta=None)\n",
+    "\n",
+    "converter = DocxToTextConverter(remove_numeric_tables=True, valid_languages=[\"en\"])\n",
+    "doc_docx = converter.convert(file_path=\"data/preprocessing_tutorial/heavy_metal.docx\", meta=None)\n"
+   ],
+   "metadata": {
+    "collapsed": false,
+    "pycharm": {
+     "name": "#%%\n"
+    }
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "01/06/2021 14:51:06 - INFO - haystack.preprocessor.utils -   Converting data/preprocessing_tutorial/heavy_metal.docx\n",
+      "01/06/2021 14:51:06 - INFO - haystack.preprocessor.utils -   Converting data/preprocessing_tutorial/bert.pdf\n",
+      "01/06/2021 14:51:07 - INFO - haystack.preprocessor.utils -   Converting data/preprocessing_tutorial/classics.txt\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Haystack also has a convenience function that will automatically apply the right converter to each file in a directory.\n",
+    "\n",
+    "all_docs = convert_files_to_dicts(dir_path=\"data/preprocessing_tutorial\")"
+   ],
+   "metadata": {
+    "collapsed": false,
+    "pycharm": {
+     "name": "#%%\n"
+    }
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "## PreProcessor\n",
+    "\n",
+    "The PreProcessor class is designed to help you clean text and split text into sensible units.\n",
+    "File splitting can have a very significant impact on the system's performance and is absolutely mandatory for Dense Passage Retrieval models.\n",
+    "In general, we recommend you split the text from your files into small documents of around 100 words for dense retrieval methods\n",
+    "and no more than 10,000 words for sparse methods.\n",
+    "Have a look at the [Preprocessing](https://haystack.deepset.ai/docs/latest/preprocessingmd)\n",
+    "and [Optimization](https://haystack.deepset.ai/docs/latest/optimizationmd) pages on our website for more details."
+   ],
+   "metadata": {
+    "collapsed": false,
+    "pycharm": {
+     "name": "#%% md\n"
+    }
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "n_docs_input: 1\n",
+      "n_docs_output: 51\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "[nltk_data] Downloading package punkt to /home/branden/nltk_data...\n",
+      "[nltk_data]   Package punkt is already up-to-date!\n"
+     ]
+    }
+   ],
+   "source": [
+    "# This is a default usage of the PreProcessor.\n",
+    "# Here, it performs cleaning of consecutive whitespaces\n",
+    "# and splits a single large document into smaller documents.\n",
+    "# Each document is up to 1000 words long and document breaks cannot fall in the middle of sentences\n",
+    "# Note how the single document passed into the document gets split into 5 smaller documents\n",
+    "\n",
+    "preprocessor = PreProcessor(\n",
+    "    clean_empty_lines=True,\n",
+    "    clean_whitespace=True,\n",
+    "    clean_header_footer=False,\n",
+    "    split_by=\"word\",\n",
+    "    split_length=100,\n",
+    "    split_respect_sentence_boundary=True\n",
+    ")\n",
+    "docs_default = preprocessor.process(doc_txt)\n",
+    "print(f\"n_docs_input: 1\\nn_docs_output: {len(docs_default)}\")"
+   ],
+   "metadata": {
+    "collapsed": false,
+    "pycharm": {
+     "name": "#%%\n"
+    }
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "## Cleaning\n",
+    "\n",
+    "- `clean_empty_lines` will normalize 3 or more consecutive empty lines to be just a two empty lines\n",
+    "- `clean_whitespace` will remove any whitespace at the beginning or end of each line in the text\n",
+    "- `clean_header_footer` will remove any long header or footer texts that are repeated on each page"
+   ],
+   "metadata": {
+    "collapsed": false,
+    "pycharm": {
+     "name": "#%% md\n"
+    }
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "## Splitting\n",
+    "By default, the PreProcessor will respect sentence boundaries, meaning that documents will not start or end\n",
+    "midway through a sentence.\n",
+    "This will help reduce the possibility of answer phrases being split between two documents.\n",
+    "This feature can be turned off by setting `split_respect_sentence_boundary=False`."
+   ],
+   "metadata": {
+    "collapsed": false,
+    "pycharm": {
+     "name": "#%% md\n"
+    }
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "RESPECTING SENTENCE BOUNDARY\n",
+      "End of document: \"...cornerstone of a typical elite European education.\"\n",
+      "\n",
+      "NOT RESPECTING SENTENCE BOUNDARY\n",
+      "End of document: \"...on. In England, for instance, Oxford and Cambridge\"\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "[nltk_data] Downloading package punkt to /home/branden/nltk_data...\n",
+      "[nltk_data]   Package punkt is already up-to-date!\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Not respecting sentence boundary vs respecting sentence boundary\n",
+    "\n",
+    "preprocessor_nrsb = PreProcessor(split_respect_sentence_boundary=False)\n",
+    "docs_nrsb = preprocessor_nrsb.process(doc_txt)\n",
+    "\n",
+    "print(\"RESPECTING SENTENCE BOUNDARY\")\n",
+    "end_text = docs_default[0][\"text\"][-50:]\n",
+    "print(\"End of document: \\\"...\" + end_text + \"\\\"\")\n",
+    "print()\n",
+    "print(\"NOT RESPECTING SENTENCE BOUNDARY\")\n",
+    "end_text_nrsb = docs_nrsb[0][\"text\"][-50:]\n",
+    "print(\"End of document: \\\"...\" + end_text_nrsb + \"\\\"\")"
+   ],
+   "metadata": {
+    "collapsed": false,
+    "pycharm": {
+     "name": "#%%\n"
+    }
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "A commonly used strategy to split long documents, especially in the field of Question Answering,\n",
+    "is the sliding window approach. If `split_length=10` and `split_overlap=3`, your documents will look like this:\n",
+    "\n",
+    "- doc1 = words[0:10]\n",
+    "- doc2 = words[7:17]\n",
+    "- doc3 = words[14:24]\n",
+    "- ...\n",
+    "\n",
+    "You can use this strategy by following the code below."
+   ],
+   "metadata": {
+    "collapsed": false,
+    "pycharm": {
+     "name": "#%% md\n"
+    }
+   }
+  },
+  {
+   "cell_type": "code",
+   "source": [
+    "# Sliding window approach\n",
+    "\n",
+    "preprocessor_sliding_window = PreProcessor(\n",
+    "    split_overlap=3,\n",
+    "    split_length=10,\n",
+    "    split_respect_sentence_boundary=False\n",
+    ")\n",
+    "docs_sliding_window = preprocessor_sliding_window.process(doc_txt)\n",
+    "\n",
+    "doc1 = docs_sliding_window[0][\"text\"][:200]\n",
+    "doc2 = docs_sliding_window[1][\"text\"][:100]\n",
+    "doc3 = docs_sliding_window[2][\"text\"][:100]\n",
+    "\n",
+    "print(\"Document 1: \\\"\" + doc1 + \"...\\\"\")\n",
+    "print(\"Document 2: \\\"\" + doc2 + \"...\\\"\")\n",
+    "print(\"Document 3: \\\"\" + doc3 + \"...\\\"\")"
+   ],
+   "metadata": {
+    "collapsed": false,
+    "pycharm": {
+     "name": "#%%\n"
+    }
+   },
+   "execution_count": 12,
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Document 1: \"Classics or classical studies is the study of classical antiquity,...\"\n",
+      "Document 2: \"of classical antiquity, and in the Western world traditionally refers...\"\n",
+      "Document 3: \"world traditionally refers to the study of Classical Greek and...\"\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "[nltk_data] Downloading package punkt to /home/branden/nltk_data...\n",
+      "[nltk_data]   Package punkt is already up-to-date!\n"
+     ]
+    }
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "## Bringing it all together"
+   ],
+   "metadata": {
+    "collapsed": false,
+    "pycharm": {
+     "name": "#%% md\n"
+    }
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "01/06/2021 14:56:12 - INFO - haystack.preprocessor.utils -   Converting data/preprocessing_tutorial/heavy_metal.docx\n",
+      "01/06/2021 14:56:12 - INFO - haystack.preprocessor.utils -   Converting data/preprocessing_tutorial/bert.pdf\n",
+      "01/06/2021 14:56:12 - INFO - haystack.preprocessor.utils -   Converting data/preprocessing_tutorial/classics.txt\n",
+      "[nltk_data] Downloading package punkt to /home/branden/nltk_data...\n",
+      "[nltk_data]   Package punkt is already up-to-date!\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "n_files_input: 3\n",
+      "n_docs_output: 150\n"
+     ]
+    }
+   ],
+   "source": [
+    "all_docs = convert_files_to_dicts(dir_path=\"data/preprocessing_tutorial\")\n",
+    "preprocessor = PreProcessor(\n",
+    "    clean_empty_lines=True,\n",
+    "    clean_whitespace=True,\n",
+    "    clean_header_footer=False,\n",
+    "    split_by=\"word\",\n",
+    "    split_length=100,\n",
+    "    split_respect_sentence_boundary=True\n",
+    ")\n",
+    "nested_docs = [preprocessor.process(d) for d in all_docs]\n",
+    "docs = [d for x in nested_docs for d in x]\n",
+    "\n",
+    "print(f\"n_files_input: {len(all_docs)}\\nn_docs_output: {len(docs)}\")"
+   ],
+   "metadata": {
+    "collapsed": false,
+    "pycharm": {
+     "name": "#%%\n"
+    }
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "outputs": [],
+   "source": [],
+   "metadata": {
+    "collapsed": false,
+    "pycharm": {
+     "name": "#%%\n"
+    }
+   }
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 2
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython2",
+   "version": "2.7.6"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
+}
--- a/tutorials/Tutorial8_Preprocessing.py
+++ b/tutorials/Tutorial8_Preprocessing.py
@ -0,0 +1,142 @@
+"""
+Preprocessing
+
+Haystack includes a suite of tools to extract text from different file types, normalize white space
+and split text into smaller pieces to optimize retrieval.
+These data preprocessing steps can have a big impact on the systems performance and effective handling of data is key to getting the most out of Haystack.
+
+Ultimately, Haystack pipelines expect data to be provided as a list documents in the following dictionary format:
+
+docs = [
+    {
+        'text': DOCUMENT_TEXT_HERE,
+        'meta': {'name': DOCUMENT_NAME, ...}
+    }, ...
+]
+
+This tutorial will show you all the tools that Haystack provides to help you cast your data into the right format.
+"""
+
+# Here are the imports we need
+
+from haystack.file_converter.txt import TextConverter
+from haystack.file_converter.pdf import PDFToTextConverter
+from haystack.file_converter.docx import DocxToTextConverter
+
+from haystack.preprocessor.utils import convert_files_to_dicts, fetch_archive_from_http
+from haystack.preprocessor.preprocessor import PreProcessor
+
+# This fetches some sample files to work with
+
+doc_dir = "data/preprocessing_tutorial"
+s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/preprocessing_tutorial.zip"
+fetch_archive_from_http(url=s3_url, output_dir=doc_dir)
+
+"""
+## Converters
+
+Haystack's converter classes are designed to help you turn files on your computer into the documents
+that can be processed by the Haystack pipeline.
+There are file converters for txt, pdf, docx files as well as a converter that is powered by Apache Tika.
+"""
+
+# Here are some examples of how you would use file converters
+
+converter = TextConverter(remove_numeric_tables=True, valid_languages=["en"])
+doc_txt = converter.convert(file_path="data/preprocessing_tutorial/classics.txt", meta=None)
+
+converter = PDFToTextConverter(remove_numeric_tables=True, valid_languages=["en"])
+doc_pdf = converter.convert(file_path="data/preprocessing_tutorial/bert.pdf", meta=None)
+
+converter = DocxToTextConverter(remove_numeric_tables=True, valid_languages=["en"])
+doc_docx = converter.convert(file_path="data/preprocessing_tutorial/heavy_metal.docx", meta=None)
+
+# Haystack also has a convenience function that will automatically apply the right converter to each file in a directory.
+
+all_docs = convert_files_to_dicts(dir_path="data/preprocessing_tutorial")
+
+"""
+
+## PreProcessor
+
+The PreProcessor class is designed to help you clean text and split text into sensible units.
+File splitting can have a very significant impact on the system's performance.
+Have a look at the [Preprocessing](https://haystack.deepset.ai/docs/latest/preprocessingmd)
+and [Optimization](https://haystack.deepset.ai/docs/latest/optimizationmd) pages on our website for more details.
+"""
+
+
+# This is a default usage of the PreProcessor.
+# Here, it performs cleaning of consecutive whitespaces
+# and splits a single large document into smaller documents.
+# Each document is up to 1000 words long and document breaks cannot fall in the middle of sentences
+# Note how the single document passed into the document gets split into 5 smaller documents
+
+preprocessor = PreProcessor(
+    clean_empty_lines=True,
+    clean_whitespace=True,
+    clean_header_footer=False,
+    split_by="word",
+    split_length=1000,
+    split_respect_sentence_boundary=True
+)
+docs_default = preprocessor.process(doc_txt)
+print(f"n_docs_input: 1\nn_docs_output: {len(docs_default)}")
+
+"""
+## Cleaning
+
+- `clean_empty_lines` will normalize 3 or more consecutive empty lines to be just a two empty lines
+- `clean_whitespace` will remove any whitespace at the beginning or end of each line in the text
+- `clean_header_footer` will remove any long header or footer texts that are repeated on each page
+
+## Splitting
+By default, the PreProcessor will respect sentence boundaries, meaning that documents will not start or end
+midway through a sentence.
+This will help reduce the possibility of answer phrases being split between two documents.
+This feature can be turned off by setting `split_respect_sentence_boundary=False`.
+"""
+
+# Not respecting sentence boundary vs respecting sentence boundary
+
+preprocessor_nrsb = PreProcessor(split_respect_sentence_boundary=False)
+docs_nrsb = preprocessor_nrsb.process(doc_txt)
+
+print("RESPECTING SENTENCE BOUNDARY")
+end_text = docs_default[0]["text"][-50:]
+print("End of document: \"..." + end_text + "\"")
+print()
+print("NOT RESPECTING SENTENCE BOUNDARY")
+end_text_nrsb = docs_nrsb[0]["text"][-50:]
+print("End of document: \"..." + end_text_nrsb + "\"")
+
+"""
+A commonly used strategy to split long documents, especially in the field of Question Answering,
+is the sliding window approach. If `split_length=10` and `split_overlap=3`, your documents will look like this:
+
+- doc1 = words[0:10]
+- doc2 = words[7:17]
+- doc3 = words[14:24]
+- ...
+
+You can use this strategy by following the code below.
+"""
+
+# Sliding window approach
+
+preprocessor_sliding_window = PreProcessor(
+    split_overlap=3,
+    split_length=10,
+    split_respect_sentence_boundary=False
+)
+docs_sliding_window = preprocessor_sliding_window.process(doc_txt)
+
+doc1 = docs_sliding_window[0]["text"][:200]
+doc2 = docs_sliding_window[1]["text"][:100]
+doc3 = docs_sliding_window[2]["text"][:100]
+
+print("Document 1: \"" + doc1 + "...\"")
+print("Document 2: \"" + doc2 + "...\"")
+print("Document 3: \"" + doc3 + "...\"")
+
+