Tutorial for DocumentClassifier at Index Time (#1697)

* basic example of document classifier in preprocessing logic * add batch_size to TransformersDocumentClassifier * complete tutorial16 * Add latest docstring and tutorial changes * fix missing batch_size * add notebook * test for batch_size use added * add tutorial 16 to headers.py * Add latest docstring and tutorial changes * make DocumentClassifier indexing pipeline rdy * Add latest docstring and tutorial changes * flexibility improvements for DocumentClassifier in Pipelines * Add latest docstring and tutorial changes * fix index time usage * remove query from documentclassifier tests * improve classification_field resolving + minor fixes * Add latest docstring and tutorial changes * tutorial 16 extended with zero shot and pipelines * Add latest docstring and tutorial changes * install graphviz in notebook * Add latest docstring and tutorial changes * remove convert_to_dicts * Add latest docstring and tutorial changes * Fix typo * Add latest docstring and tutorial changes * remove retriever from indexing pipeline * Add latest docstring and tutorial changes * fix save_to_yaml when using FileTypeClassifier * emphasize the impact with zero shot classification * Add latest docstring and tutorial changes * adjust use_gpu to boolean in test Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
2026-01-06 03:57:19 +00:00 · 2021-11-09 18:43:00 +01:00 · 2021-11-09 18:43:00 +01:00 · 14515a861b
commit 14515a861b
parent 91cafb49bb
13 changed files with 1219 additions and 24 deletions
--- a/docs/_src/api/api/document_classifier.md
+++ b/docs/_src/api/api/document_classifier.md
@ -34,9 +34,12 @@ This node classifies documents and adds the output from the classification step
 The meta field of the document is a dictionary with the following format:
 ``'meta': {'name': '450_Baelor.txt', 'classification': {'label': 'neutral', 'probability' = 0.9997646, ...} }``

+Classification is run on document's content field by default. If you want it to run on another field,
+set the `classification_field` to one of document's meta fields.
+
 With this document_classifier, you can directly get predictions via predict()

- **Usage example:**
+ **Usage example at query time:**
 ```python
 |    ...
 |    retriever = ElasticsearchRetriever(document_store=document_store)
@ -55,11 +58,27 @@ With this document_classifier, you can directly get predictions via predict()
 |    res["documents"][0].to_dict()["meta"]["classification"]["label"]
 ```

+**Usage example at index time:**
+ ```python
+|    ...
+|    converter = TextConverter()
+|    preprocessor = Preprocessor()
+|    document_store = ElasticsearchDocumentStore()
+|    document_classifier = TransformersDocumentClassifier(model_name_or_path="bhadresh-savani/distilbert-base-uncased-emotion",
+|                                                         batch_size=16)
+|    p = Pipeline()
+|    p.add_node(component=converter, name="TextConverter", inputs=["File"])
+|    p.add_node(component=preprocessor, name="Preprocessor", inputs=["TextConverter"])
+|    p.add_node(component=document_classifier, name="DocumentClassifier", inputs=["Preprocessor"])
+|    p.add_node(component=document_store, name="DocumentStore", inputs=["DocumentClassifier"])
+|    p.run(file_paths=file_paths)
+ ```
+
 <a name="transformers.TransformersDocumentClassifier.__init__"></a>
 #### \_\_init\_\_

 ```python
- | __init__(model_name_or_path: str = "bhadresh-savani/distilbert-base-uncased-emotion", model_version: Optional[str] = None, tokenizer: Optional[str] = None, use_gpu: bool = True, return_all_scores: bool = False, task: str = 'text-classification', labels: Optional[List[str]] = None)
+ | __init__(model_name_or_path: str = "bhadresh-savani/distilbert-base-uncased-emotion", model_version: Optional[str] = None, tokenizer: Optional[str] = None, use_gpu: bool = True, return_all_scores: bool = False, task: str = 'text-classification', labels: Optional[List[str]] = None, batch_size: int = -1, classification_field: str = None)
 ```

 Load a text classification model from Transformers.
@ -88,6 +107,8 @@ See https://huggingface.co/models for full list of available models.
 ["positive", "negative"] otherwise None. Given a LABEL, the sequence fed to the model is "<cls> sequence to
 classify <sep> This example is LABEL . <sep>" and the model predicts whether that sequence is a contradiction
 or an entailment.
+- `batch_size`: batch size to be processed at once
+- `classification_field`: Name of Document's meta field to be used for classification. If left unset, Document.content is used by default.

 <a name="transformers.TransformersDocumentClassifier.predict"></a>
 #### predict
@ -96,7 +117,8 @@ or an entailment.
 | predict(documents: List[Document]) -> List[Document]
 ```

-Returns documents containing classification result in meta field
+Returns documents containing classification result in meta field.
+Documents are updated in place.

 **Arguments**:

--- a/docs/_src/tutorials/tutorials/15.md
+++ b/docs/_src/tutorials/tutorials/15.md
@ -2,7 +2,7 @@
 title: "Tutorial 15"
 metaTitle: "TableQA Tutorial"
 metaDescription: ""
-slug: "/docs/tutorial16"
+slug: "/docs/tutorial15"
 date: "2021-10-28"
 id: "tutorial15md"
 --->
--- a/docs/_src/tutorials/tutorials/16.md
+++ b/docs/_src/tutorials/tutorials/16.md
@ -0,0 +1,253 @@
+<!---
+title: "Tutorial 16"
+metaTitle: "DocumentClassifier at Index Time Tutorial"
+metaDescription: ""
+slug: "/docs/tutorial16"
+date: "2021-11-05"
+id: "tutorial16md"
+--->
+
+# Extending your Metadata using DocumentClassifiers at Index Time
+
+[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial8_Preprocessing.ipynb)
+
+With DocumentClassifier it's possible to automatically enrich your documents with categories, sentiments, topics or whatever metadata you like. This metadata could be used for efficient filtering or further processing. Say you have some categories your users typically filter on. If the documents are tagged manually with these categories, you could automate this process by training a model. Or you can leverage the full power and flexibility of zero shot classification. All you need to do is pass your categories to the classifier, no labels required. This tutorial shows how to integrate it in your indexing pipeline.
+
+DocumentClassifier adds the classification result (label and score) to Document's meta property.
+Hence, we can use it to classify documents at index time. \
+The result can be accessed at query time: for example by applying a filter for "classification.label".
+
+This tutorial will show you how to integrate a classification model into your preprocessing steps and how you can filter for this additional metadata at query time. In the last section we show how to put it all together and create an indexing pipeline.
+
+
+```python
+# Let's start by installing Haystack
+
+# Install the latest release of Haystack in your own environment
+#! pip install farm-haystack
+
+# Install the latest master of Haystack
+!pip install grpcio-tools==1.34.1
+!pip install git+https://github.com/deepset-ai/haystack.git
+!wget --no-check-certificate https://dl.xpdfreader.com/xpdf-tools-linux-4.03.tar.gz
+!tar -xvf xpdf-tools-linux-4.03.tar.gz && sudo cp xpdf-tools-linux-4.03/bin64/pdftotext /usr/local/bin
+
+# Install  pygraphviz
+!apt install libgraphviz-dev
+!pip install pygraphviz
+
+# If you run this notebook on Google Colab, you might need to
+# restart the runtime after installing haystack.
+```
+
+
+```python
+# Here are the imports we need
+from haystack.document_stores.elasticsearch import ElasticsearchDocumentStore
+from haystack.nodes import PreProcessor, TransformersDocumentClassifier, FARMReader, ElasticsearchRetriever
+from haystack.schema import Document
+from haystack.utils import convert_files_to_dicts, fetch_archive_from_http, print_answers
+```
+
+
+```python
+# This fetches some sample files to work with
+
+doc_dir = "data/preprocessing_tutorial"
+s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/preprocessing_tutorial.zip"
+fetch_archive_from_http(url=s3_url, output_dir=doc_dir)
+```
+
+## Read and preprocess documents
+
+
+
+```python
+# note that you can also use the document classifier before applying the PreProcessor, e.g. before splitting your documents
+
+all_docs = convert_files_to_dicts(dir_path=doc_dir)
+preprocessor_sliding_window = PreProcessor(
+    split_overlap=3,
+    split_length=10,
+    split_respect_sentence_boundary=False
+)
+docs_sliding_window = preprocessor_sliding_window.process(all_docs)
+```
+
+## Apply DocumentClassifier
+
+We can enrich the document metadata at index time using any transformers document classifier model. While traditional classification models are trained to predict one of a few "hard-coded" classes and required a dedicated training dataset, zero-shot classification is super flexible and you can easily switch the classes the model should predict on the fly. Just supply them via the labels param.
+Here we use a zero shot model that is supposed to classify our documents in 'music', 'natural language processing' and 'history'. Feel free to change them for whatever you like to classify. \
+These classes can later on be accessed at query time.
+
+
+```python
+doc_classifier = TransformersDocumentClassifier(model_name_or_path="cross-encoder/nli-distilroberta-base",
+    task="zero-shot-classification",
+    labels=["music", "natural language processing", "history"],
+    batch_size=16
+)
+```
+
+
+```python
+# we can also use any other transformers model besides zero shot classification
+
+# doc_classifier_model = 'bhadresh-savani/distilbert-base-uncased-emotion'
+# doc_classifier = TransformersDocumentClassifier(model_name_or_path=doc_classifier_model, batch_size=16, use_gpu=-1)
+```
+
+
+```python
+# we could also specifiy a different field we want to run the classification on
+
+# doc_classifier = TransformersDocumentClassifier(model_name_or_path="cross-encoder/nli-distilroberta-base",
+#    task="zero-shot-classification",
+#    labels=["music", "natural language processing", "history"],
+#    batch_size=16, use_gpu=-1,
+#    classification_field="description")
+```
+
+
+```python
+# convert to Document using a fieldmap for custom content fields the classification should run on
+docs_to_classify = [Document.from_dict(d) for d in docs_sliding_window]
+```
+
+
+```python
+# classify using gpu, batch_size makes sure we do not run out of memory
+classified_docs = doc_classifier.predict(docs_to_classify)
+```
+
+
+```python
+# let's see how it looks: there should be a classification result in the meta entry containing labels and scores.
+print(classified_docs[0].to_dict())
+```
+
+## Indexing
+
+
+```python
+# In Colab / No Docker environments: Start Elasticsearch from source
+! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q
+! tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz
+! chown -R daemon:daemon elasticsearch-7.9.2
+
+import os
+from subprocess import Popen, PIPE, STDOUT
+es_server = Popen(['elasticsearch-7.9.2/bin/elasticsearch'],
+                   stdout=PIPE, stderr=STDOUT,
+                   preexec_fn=lambda: os.setuid(1)  # as daemon
+                  )
+# wait until ES has started
+! sleep 30
+```
+
+
+```python
+# Connect to Elasticsearch
+document_store = ElasticsearchDocumentStore(host="localhost", username="", password="", index="document")
+```
+
+
+```python
+# Now, let's write the docs to our DB.
+document_store.delete_all_documents()
+document_store.write_documents(classified_docs)
+```
+
+
+```python
+# check if indexed docs contain classification results
+test_doc = document_store.get_all_documents()[0]
+print(f'document {test_doc.id} with content \n\n{test_doc.content}\n\nhas label {test_doc.meta["classification"]["label"]}')
+```
+
+## Querying the data
+
+All we have to do to filter for one of our classes is to set a filter on "classification.label".
+
+
+```python
+# Initialize QA-Pipeline
+from haystack.pipelines import ExtractiveQAPipeline
+retriever = ElasticsearchRetriever(document_store=document_store)
+reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=True)
+pipe = ExtractiveQAPipeline(reader, retriever)    
+```
+
+
+```python
+## Voilà! Ask a question while filtering for "music"-only documents
+prediction = pipe.run(
+    query="What is heavy metal?", params={"Retriever": {"top_k": 10, "filters": {"classification.label": ["music"]}}, "Reader": {"top_k": 5}}
+)
+```
+
+
+```python
+print_answers(prediction, details="high")
+```
+
+## Wrapping it up in an indexing pipeline
+
+
+```python
+from pathlib import Path
+from haystack.pipelines import Pipeline
+from haystack.nodes import TextConverter, PreProcessor, FileTypeClassifier, PDFToTextConverter, DocxToTextConverter
+```
+
+
+```python
+file_type_classifier = FileTypeClassifier()
+text_converter = TextConverter()
+pdf_converter = PDFToTextConverter()
+docx_converter = DocxToTextConverter()
+
+indexing_pipeline_with_classification = Pipeline()
+indexing_pipeline_with_classification.add_node(component=file_type_classifier, name="FileTypeClassifier", inputs=["File"])
+indexing_pipeline_with_classification.add_node(component=text_converter, name="TextConverter", inputs=["FileTypeClassifier.output_1"])
+indexing_pipeline_with_classification.add_node(component=pdf_converter, name="PdfConverter", inputs=["FileTypeClassifier.output_2"])
+indexing_pipeline_with_classification.add_node(component=docx_converter, name="DocxConverter", inputs=["FileTypeClassifier.output_4"])
+indexing_pipeline_with_classification.add_node(component=preprocessor_sliding_window, name="Preprocessor", inputs=["TextConverter", "PdfConverter", "DocxConverter"])
+indexing_pipeline_with_classification.add_node(component=doc_classifier, name="DocumentClassifier", inputs=["Preprocessor"])
+indexing_pipeline_with_classification.add_node(component=document_store, name="DocumentStore", inputs=["DocumentClassifier"])
+indexing_pipeline_with_classification.draw("index_time_document_classifier.png")
+
+document_store.delete_documents()
+txt_files = [f for f in Path(doc_dir).iterdir() if f.suffix == '.txt']
+pdf_files = [f for f in Path(doc_dir).iterdir() if f.suffix == '.pdf']
+docx_files = [f for f in Path(doc_dir).iterdir() if f.suffix == '.docx']
+indexing_pipeline_with_classification.run(file_paths=txt_files)
+indexing_pipeline_with_classification.run(file_paths=pdf_files)
+indexing_pipeline_with_classification.run(file_paths=docx_files)
+
+document_store.get_all_documents()[0]
+```
+
+
+```python
+# we can store this pipeline and use it from the REST-API
+indexing_pipeline_with_classification.save_to_yaml("indexing_pipeline_with_classification.yaml")
+```
+
+## About us
+
+This [Haystack](https://github.com/deepset-ai/haystack/) notebook was made with love by [deepset](https://deepset.ai/) in Berlin, Germany
+
+We bring NLP to the industry via open source!  
+Our focus: Industry specific language models & large scale QA systems.  
+  
+Some of our other work: 
+- [German BERT](https://deepset.ai/german-bert)
+- [GermanQuAD and GermanDPR](https://deepset.ai/germanquad)
+- [FARM](https://github.com/deepset-ai/FARM)
+
+Get in touch:
+[Twitter](https://twitter.com/deepset_ai) | [LinkedIn](https://www.linkedin.com/company/deepset-ai/) | [Slack](https://haystack.deepset.ai/community/join) | [GitHub Discussions](https://github.com/deepset-ai/haystack/discussions) | [Website](https://deepset.ai)
+
+By the way: [we're hiring!](https://www.deepset.ai/jobs)
+
--- a/docs/_src/tutorials/tutorials/headers.py
+++ b/docs/_src/tutorials/tutorials/headers.py
@ -115,8 +115,16 @@ id: "tutorial14md"
 title: "Tutorial 15"
 metaTitle: "TableQA Tutorial"
 metaDescription: ""
-slug: "/docs/tutorial16"
+slug: "/docs/tutorial15"
 date: "2021-10-28"
 id: "tutorial15md"
+--->""",
+    16: """<!---
+title: "Tutorial 16"
+metaTitle: "DocumentClassifier at Index Time Tutorial"
+metaDescription: ""
+slug: "/docs/tutorial16"
+date: "2021-11-05"
+id: "tutorial16md"
 --->"""
 }
--- a/haystack/nodes/document_classifier/base.py
+++ b/haystack/nodes/document_classifier/base.py
@ -1,4 +1,4 @@
-from typing import List
+from typing import List, Union

 import logging
 from abc import abstractmethod
@ -21,16 +21,22 @@ class BaseDocumentClassifier(BaseComponent):
    def predict(self, documents: List[Document]):
        pass

-    def run(self, query: str, documents: List[Document]): # type: ignore
+    def run(self, documents: Union[List[dict], List[Document]], root_node: str): # type: ignore
        self.query_count += 1
        if documents:
-            predict = self.timing(self.predict, "query_time")
+            predict = self.timing(self.predict, "query_time")            
+            documents = [Document.from_dict(doc) if isinstance(doc, dict) else doc for doc in documents]
            results = predict(documents=documents)
        else:
            results = []

        document_ids = [doc.id for doc in results]
        logger.debug(f"Retrieved documents with IDs: {document_ids}")
+
+        # convert back to dicts if we are in an indexing pipeline
+        if root_node == "File":
+            results = [doc.to_dict() for doc in results]
+
        output = {"documents": results}

        return output, "output_1"
--- a/haystack/nodes/document_classifier/transformers.py
+++ b/haystack/nodes/document_classifier/transformers.py
@ -20,9 +20,12 @@ class TransformersDocumentClassifier(BaseDocumentClassifier):
    The meta field of the document is a dictionary with the following format:
    ``'meta': {'name': '450_Baelor.txt', 'classification': {'label': 'neutral', 'probability' = 0.9997646, ...} }``

+    Classification is run on document's content field by default. If you want it to run on another field,
+    set the `classification_field` to one of document's meta fields.
+
    With this document_classifier, you can directly get predictions via predict()
    
-     **Usage example:**
+     **Usage example at query time:**
     ```python
    |    ...
    |    retriever = ElasticsearchRetriever(document_store=document_store)
@ -40,6 +43,22 @@ class TransformersDocumentClassifier(BaseDocumentClassifier):
    |    # or access the predicted class label directly
    |    res["documents"][0].to_dict()["meta"]["classification"]["label"]
     ```
+
+    **Usage example at index time:**
+     ```python
+    |    ...
+    |    converter = TextConverter()
+    |    preprocessor = Preprocessor()
+    |    document_store = ElasticsearchDocumentStore()
+    |    document_classifier = TransformersDocumentClassifier(model_name_or_path="bhadresh-savani/distilbert-base-uncased-emotion",
+    |                                                         batch_size=16)
+    |    p = Pipeline()
+    |    p.add_node(component=converter, name="TextConverter", inputs=["File"])
+    |    p.add_node(component=preprocessor, name="Preprocessor", inputs=["TextConverter"])
+    |    p.add_node(component=document_classifier, name="DocumentClassifier", inputs=["Preprocessor"])
+    |    p.add_node(component=document_store, name="DocumentStore", inputs=["DocumentClassifier"])
+    |    p.run(file_paths=file_paths)
+     ```
    """
    def __init__(
        self,
@ -49,7 +68,9 @@ class TransformersDocumentClassifier(BaseDocumentClassifier):
        use_gpu: bool = True,
        return_all_scores: bool = False,
        task: str = 'text-classification',
-        labels: Optional[List[str]] = None
+        labels: Optional[List[str]] = None,
+        batch_size: int = -1,
+        classification_field: str = None
    ):
        """
        Load a text classification model from Transformers.
@ -76,11 +97,14 @@ class TransformersDocumentClassifier(BaseDocumentClassifier):
        ["positive", "negative"] otherwise None. Given a LABEL, the sequence fed to the model is "<cls> sequence to
        classify <sep> This example is LABEL . <sep>" and the model predicts whether that sequence is a contradiction
        or an entailment.
+        :param batch_size: batch size to be processed at once
+        :param classification_field: Name of Document's meta field to be used for classification. If left unset, Document.content is used by default.
        """
        # save init parameters to enable export of component config as YAML
        self.set_config(
            model_name_or_path=model_name_or_path, model_version=model_version, tokenizer=tokenizer,
-            use_gpu=use_gpu, return_all_scores=return_all_scores, labels=labels, task=task
+            use_gpu=use_gpu, return_all_scores=return_all_scores, labels=labels, task=task, batch_size=batch_size,
+            classification_field=classification_field
        )
        if labels and task == 'text-classification':
            logger.warning(f'Provided labels {labels} will be ignored for task text-classification. Set task to '
@ -98,27 +122,35 @@ class TransformersDocumentClassifier(BaseDocumentClassifier):
        self.return_all_scores = return_all_scores
        self.labels = labels
        self.task = task
+        self.batch_size = batch_size
+        self.classification_field = classification_field

    def predict(self, documents: List[Document]) -> List[Document]:
        """
-        Returns documents containing classification result in meta field
+        Returns documents containing classification result in meta field.
+        Documents are updated in place.

        :param documents: List of Document to classify
        :return: List of Document enriched with meta information
        """
-        texts = [doc.content for doc in documents]
+        texts = [doc.content if self.classification_field is None else doc.meta[self.classification_field] for doc in documents]
+        batches = self.get_batches(texts, batch_size=self.batch_size)
        if self.task == 'zero-shot-classification':
-            predictions = self.model(texts, candidate_labels=self.labels, truncation=True)
+            batched_predictions = [self.model(batch, candidate_labels=self.labels, truncation=True) for batch in batches]
        elif self.task == 'text-classification':
-            predictions = self.model(texts, return_all_scores=self.return_all_scores, truncation=True)
-
-        classified_docs: List[Document] = []
+            batched_predictions = [self.model(batch, return_all_scores=self.return_all_scores, truncation=True) for batch in batches]
+        predictions = [pred for batched_prediction in batched_predictions for pred in batched_prediction]

        for prediction, doc in zip(predictions, documents):
-            cur_doc = doc
-            cur_doc.meta["classification"] = prediction
            if self.task == 'zero-shot-classification':
-                cur_doc.meta["classification"]["label"] = cur_doc.meta["classification"]["labels"][0]
-            classified_docs.append(cur_doc)
+                prediction["label"] = prediction["labels"][0]
+            doc.meta["classification"] = prediction

-        return classified_docs
+        return documents
+
+    def get_batches(self, items, batch_size):
+        if batch_size == -1:
+            yield items
+            return
+        for index in range(0, len(items), batch_size): 
+            yield items[index:index + batch_size]
--- a/haystack/nodes/file_classifier/file_type.py
+++ b/haystack/nodes/file_classifier/file_type.py
@ -9,6 +9,9 @@ class FileTypeClassifier(BaseComponent):
    """
    outgoing_edges = 5

+    def __init__(self):
+        self.set_config()
+
    def _get_files_extension(self, file_paths: list) -> set:
        """
        Return the file extensions
--- a/test/conftest.py
+++ b/test/conftest.py
@ -343,18 +343,34 @@ def ranker():
 def document_classifier():
    return TransformersDocumentClassifier(
        model_name_or_path="bhadresh-savani/distilbert-base-uncased-emotion",
-        use_gpu=-1
+        use_gpu=False
    )

@pytest.fixture(scope="module")
 def zero_shot_document_classifier():
    return TransformersDocumentClassifier(
        model_name_or_path="cross-encoder/nli-distilroberta-base",
-        use_gpu=-1,
+        use_gpu=False,
        task="zero-shot-classification",
        labels=["negative", "positive"]
    )

+@pytest.fixture(scope="module")
+def batched_document_classifier():
+    return TransformersDocumentClassifier(
+        model_name_or_path="bhadresh-savani/distilbert-base-uncased-emotion",
+        use_gpu=False,
+        batch_size=16
+    )
+
+@pytest.fixture(scope="module")
+def indexing_document_classifier():
+    return TransformersDocumentClassifier(
+        model_name_or_path="bhadresh-savani/distilbert-base-uncased-emotion",
+        use_gpu=False,
+        batch_size=16,
+        classification_field="class_field"
+    )

 # TODO Fix bug in test_no_answer_output when using
 # @pytest.fixture(params=["farm", "transformers"])
--- a/test/samples/pipeline/test_pipeline.yaml
+++ b/test/samples/pipeline/test_pipeline.yaml
@ -24,6 +24,15 @@ components:
    type: PreProcessor
    params:
      clean_whitespace: true
+  - name: IndexTimeDocumentClassifier
+    type: TransformersDocumentClassifier
+    params:
+      batch_size: 16
+      use_gpu: -1
+  - name: QueryTimeDocumentClassifier
+    type: TransformersDocumentClassifier
+    params:
+      use_gpu: -1


 pipelines:
@ -44,6 +53,16 @@ pipelines:
      - name: Reader
        inputs: [ ESRetriever ]

+  - name: query_pipeline_with_document_classifier
+    type: Pipeline
+    nodes:
+      - name: ESRetriever
+        inputs: [Query]
+      - name: QueryTimeDocumentClassifier
+        inputs: [ESRetriever]
+      - name: Reader
+        inputs: [QueryTimeDocumentClassifier]
+
  - name: indexing_pipeline
    type: Pipeline
    nodes:
@ -67,3 +86,17 @@ pipelines:
        inputs: [Preprocessor]
      - name: DocumentStore
        inputs: [ESRetriever]
+
+  - name: indexing_pipeline_with_classifier
+    type: Pipeline
+    nodes:
+      - name: PDFConverter
+        inputs: [File]
+      - name: Preprocessor
+        inputs: [PDFConverter]
+      - name: IndexTimeDocumentClassifier
+        inputs: [Preprocessor]
+      - name: ESRetriever
+        inputs: [IndexTimeDocumentClassifier]
+      - name: DocumentStore
+        inputs: [ESRetriever]
--- a/test/test_document_classifier.py
+++ b/test/test_document_classifier.py
@ -46,3 +46,69 @@ def test_zero_shot_document_classifier(zero_shot_document_classifier):
    expected_labels = ["positive", "negative"]
    for i, doc in enumerate(results):
        assert doc.to_dict()["meta"]["classification"]["label"] == expected_labels[i]
+
+
+@pytest.mark.slow
+def test_document_classifier_batch_size(batched_document_classifier):
+    assert isinstance(batched_document_classifier, BaseDocumentClassifier)
+
+    docs = [
+        Document(
+            content="""That's good. I like it."""*700,  # extra long text to check truncation
+            meta={"name": "0"},
+            id="1",
+        ),
+        Document(
+            content="""That's bad. I don't like it.""",
+            meta={"name": "1"},
+            id="2",
+        ),
+    ]
+    results = batched_document_classifier.predict(documents=docs)
+    expected_labels = ["joy", "sadness"]
+    for i, doc in enumerate(results):
+        assert doc.to_dict()["meta"]["classification"]["label"] == expected_labels[i]
+
+
+@pytest.mark.slow
+def test_document_classifier_as_index_node(indexing_document_classifier):
+    assert isinstance(indexing_document_classifier, BaseDocumentClassifier)
+
+    docs = [
+        {"content":"""That's good. I like it."""*700,  # extra long text to check truncation
+         "meta":{"name": "0"},
+         "id":"1",
+         "class_field": "That's bad."
+        },
+        {"content":"""That's bad. I like it.""",
+         "meta":{"name": "1"},
+         "id":"2",
+         "class_field": "That's good."
+        },
+    ]
+    output, output_name = indexing_document_classifier.run(documents=docs, root_node="File")
+    expected_labels = ["sadness", "joy"]
+    for i, doc in enumerate(output["documents"]):
+        assert doc["meta"]["classification"]["label"] == expected_labels[i]
+
+
+@pytest.mark.slow
+def test_document_classifier_as_query_node(document_classifier):
+    assert isinstance(document_classifier, BaseDocumentClassifier)
+
+    docs = [
+        Document(
+            content="""That's good. I like it."""*700,  # extra long text to check truncation
+            meta={"name": "0"},
+            id="1",
+        ),
+        Document(
+            content="""That's bad. I don't like it.""",
+            meta={"name": "1"},
+            id="2",
+        ),
+    ]
+    output, output_name = document_classifier.run(documents=docs, root_node="Query")
+    expected_labels = ["joy", "sadness"]
+    for i, doc in enumerate(output["documents"]):
+        assert doc.to_dict()["meta"]["classification"]["label"] == expected_labels[i]
--- a/test/test_pipeline.py
+++ b/test/test_pipeline.py
@ -700,3 +700,50 @@ def test_document_search_pipeline(retriever, document_store):
            assert isinstance(document, Document)
            assert isinstance(document.id, str)
            assert isinstance(document.content, str)
+
+
+@pytest.mark.elasticsearch
+@pytest.mark.parametrize("document_store", ["elasticsearch"], indirect=True)
+def test_indexing_pipeline_with_classifier(document_store):
+    # test correct load of indexing pipeline from yaml
+    pipeline = Pipeline.load_from_yaml(
+        Path(__file__).parent/"samples"/"pipeline"/"test_pipeline.yaml", pipeline_name="indexing_pipeline_with_classifier"
+    )
+    pipeline.run(
+        file_paths=Path(__file__).parent/"samples"/"pdf"/"sample_pdf_1.pdf"
+    )
+    # test correct load of query pipeline from yaml
+    pipeline = Pipeline.load_from_yaml(
+        Path(__file__).parent/"samples"/"pipeline"/"test_pipeline.yaml", pipeline_name="query_pipeline"
+    )
+    prediction = pipeline.run(
+        query="Who made the PDF specification?", params={"ESRetriever": {"top_k": 10}, "Reader": {"top_k": 3}}
+    )
+    assert prediction["query"] == "Who made the PDF specification?"
+    assert prediction["answers"][0].answer == "Adobe Systems"
+    assert prediction["answers"][0].meta["classification"]["label"] == "joy"
+    assert "_debug" not in prediction.keys()
+
+
+@pytest.mark.elasticsearch
+@pytest.mark.parametrize("document_store", ["elasticsearch"], indirect=True)
+def test_query_pipeline_with_document_classifier(document_store):
+    # test correct load of indexing pipeline from yaml
+    pipeline = Pipeline.load_from_yaml(
+        Path(__file__).parent/"samples"/"pipeline"/"test_pipeline.yaml", pipeline_name="indexing_pipeline"
+    )
+    pipeline.run(
+        file_paths=Path(__file__).parent/"samples"/"pdf"/"sample_pdf_1.pdf"
+    )
+    # test correct load of query pipeline from yaml
+    pipeline = Pipeline.load_from_yaml(
+        Path(__file__).parent/"samples"/"pipeline"/"test_pipeline.yaml", pipeline_name="query_pipeline_with_document_classifier"
+    )
+    prediction = pipeline.run(
+        query="Who made the PDF specification?", params={"ESRetriever": {"top_k": 10}, "Reader": {"top_k": 3}}
+    )
+    assert prediction["query"] == "Who made the PDF specification?"
+    assert prediction["answers"][0].answer == "Adobe Systems"
+    assert prediction["answers"][0].meta["classification"]["label"] == "joy"
+    assert "_debug" not in prediction.keys()
+
--- a/tutorials/Tutorial16_Document_Classifier_at_Index_Time.ipynb
+++ b/tutorials/Tutorial16_Document_Classifier_at_Index_Time.ipynb
@ -0,0 +1,534 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Extending your Metadata using DocumentClassifiers at Index Time\n",
+    "\n",
+    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial8_Preprocessing.ipynb)\n",
+    "\n",
+    "With DocumentClassifier it's possible to automatically enrich your documents with categories, sentiments, topics or whatever metadata you like. This metadata could be used for efficient filtering or further processing. Say you have some categories your users typically filter on. If the documents are tagged manually with these categories, you could automate this process by training a model. Or you can leverage the full power and flexibility of zero shot classification. All you need to do is pass your categories to the classifier, no labels required. This tutorial shows how to integrate it in your indexing pipeline."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "DocumentClassifier adds the classification result (label and score) to Document's meta property.\n",
+    "Hence, we can use it to classify documents at index time. \\\n",
+    "The result can be accessed at query time: for example by applying a filter for \"classification.label\"."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "pycharm": {
+     "name": "#%% md\n"
+    }
+   },
+   "source": [
+    "This tutorial will show you how to integrate a classification model into your preprocessing steps and how you can filter for this additional metadata at query time. In the last section we show how to put it all together and create an indexing pipeline."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "pycharm": {
+     "name": "#%%\n"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# Let's start by installing Haystack\n",
+    "\n",
+    "# Install the latest release of Haystack in your own environment\n",
+    "#! pip install farm-haystack\n",
+    "\n",
+    "# Install the latest master of Haystack\n",
+    "!pip install grpcio-tools==1.34.1\n",
+    "!pip install git+https://github.com/deepset-ai/haystack.git\n",
+    "!wget --no-check-certificate https://dl.xpdfreader.com/xpdf-tools-linux-4.03.tar.gz\n",
+    "!tar -xvf xpdf-tools-linux-4.03.tar.gz && sudo cp xpdf-tools-linux-4.03/bin64/pdftotext /usr/local/bin\n",
+    "\n",
+    "# Install  pygraphviz\n",
+    "!apt install libgraphviz-dev\n",
+    "!pip install pygraphviz\n",
+    "\n",
+    "# If you run this notebook on Google Colab, you might need to\n",
+    "# restart the runtime after installing haystack."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {
+    "pycharm": {
+     "name": "#%%\n"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# Here are the imports we need\n",
+    "from haystack.document_stores.elasticsearch import ElasticsearchDocumentStore\n",
+    "from haystack.nodes import PreProcessor, TransformersDocumentClassifier, FARMReader, ElasticsearchRetriever\n",
+    "from haystack.schema import Document\n",
+    "from haystack.utils import convert_files_to_dicts, fetch_archive_from_http, print_answers"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {
+    "pycharm": {
+     "name": "#%%\n"
+    }
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "False"
+      ]
+     },
+     "execution_count": 2,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "# This fetches some sample files to work with\n",
+    "\n",
+    "doc_dir = \"data/preprocessing_tutorial\"\n",
+    "s3_url = \"https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/preprocessing_tutorial.zip\"\n",
+    "fetch_archive_from_http(url=s3_url, output_dir=doc_dir)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "pycharm": {
+     "name": "#%% md\n"
+    }
+   },
+   "source": [
+    "## Read and preprocess documents\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {
+    "pycharm": {
+     "name": "#%%\n"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "pdftotext version 0.86.1\n",
+      "Copyright 2005-2020 The Poppler Developers - http://poppler.freedesktop.org\n",
+      "Copyright 1996-2011 Glyph & Cog, LLC\n",
+      "100%|██████████| 3/3 [00:00<00:00, 372.17docs/s]\n"
+     ]
+    }
+   ],
+   "source": [
+    "# note that you can also use the document classifier before applying the PreProcessor, e.g. before splitting your documents\n",
+    "\n",
+    "all_docs = convert_files_to_dicts(dir_path=doc_dir)\n",
+    "preprocessor_sliding_window = PreProcessor(\n",
+    "    split_overlap=3,\n",
+    "    split_length=10,\n",
+    "    split_respect_sentence_boundary=False\n",
+    ")\n",
+    "docs_sliding_window = preprocessor_sliding_window.process(all_docs)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Apply DocumentClassifier"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We can enrich the document metadata at index time using any transformers document classifier model. While traditional classification models are trained to predict one of a few \"hard-coded\" classes and required a dedicated training dataset, zero-shot classification is super flexible and you can easily switch the classes the model should predict on the fly. Just supply them via the labels param.\n",
+    "Here we use a zero shot model that is supposed to classify our documents in 'music', 'natural language processing' and 'history'. Feel free to change them for whatever you like to classify. \\\n",
+    "These classes can later on be accessed at query time."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "doc_classifier = TransformersDocumentClassifier(model_name_or_path=\"cross-encoder/nli-distilroberta-base\",\n",
+    "    task=\"zero-shot-classification\",\n",
+    "    labels=[\"music\", \"natural language processing\", \"history\"],\n",
+    "    batch_size=16\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# we can also use any other transformers model besides zero shot classification\n",
+    "\n",
+    "# doc_classifier_model = 'bhadresh-savani/distilbert-base-uncased-emotion'\n",
+    "# doc_classifier = TransformersDocumentClassifier(model_name_or_path=doc_classifier_model, batch_size=16, use_gpu=-1)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# we could also specifiy a different field we want to run the classification on\n",
+    "\n",
+    "# doc_classifier = TransformersDocumentClassifier(model_name_or_path=\"cross-encoder/nli-distilroberta-base\",\n",
+    "#    task=\"zero-shot-classification\",\n",
+    "#    labels=[\"music\", \"natural language processing\", \"history\"],\n",
+    "#    batch_size=16, use_gpu=-1,\n",
+    "#    classification_field=\"description\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# convert to Document using a fieldmap for custom content fields the classification should run on\n",
+    "docs_to_classify = [Document.from_dict(d) for d in docs_sliding_window]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# classify using gpu, batch_size makes sure we do not run out of memory\n",
+    "classified_docs = doc_classifier.predict(docs_to_classify)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "{'content': 'Heavy metal\\n\\nHeavy metal (or simply metal) is a genre of', 'content_type': 'text', 'score': None, 'meta': {'name': 'heavy_metal.docx', '_split_id': 0, 'classification': {'sequence': 'Heavy metal\\n\\nHeavy metal (or simply metal) is a genre of', 'labels': ['music', 'history', 'natural language processing'], 'scores': [0.8191022872924805, 0.11593689769506454, 0.06496082246303558], 'label': 'music'}}, 'embedding': None, 'id': '9903d23737f3d05a9d9ee170703dc245'}\n"
+     ]
+    }
+   ],
+   "source": [
+    "# let's see how it looks: there should be a classification result in the meta entry containing labels and scores.\n",
+    "print(classified_docs[0].to_dict())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Indexing"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# In Colab / No Docker environments: Start Elasticsearch from source\n",
+    "! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q\n",
+    "! tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz\n",
+    "! chown -R daemon:daemon elasticsearch-7.9.2\n",
+    "\n",
+    "import os\n",
+    "from subprocess import Popen, PIPE, STDOUT\n",
+    "es_server = Popen(['elasticsearch-7.9.2/bin/elasticsearch'],\n",
+    "                   stdout=PIPE, stderr=STDOUT,\n",
+    "                   preexec_fn=lambda: os.setuid(1)  # as daemon\n",
+    "                  )\n",
+    "# wait until ES has started\n",
+    "! sleep 30"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Connect to Elasticsearch\n",
+    "document_store = ElasticsearchDocumentStore(host=\"localhost\", username=\"\", password=\"\", index=\"document\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "DEPRECATION WARNINGS: \n",
+      "                1. delete_all_documents() method is deprecated, please use delete_documents method\n",
+      "                For more details, please refer to the issue: https://github.com/deepset-ai/haystack/issues/1045\n",
+      "                \n"
+     ]
+    }
+   ],
+   "source": [
+    "# Now, let's write the docs to our DB.\n",
+    "document_store.delete_all_documents()\n",
+    "document_store.write_documents(classified_docs)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "document 9903d23737f3d05a9d9ee170703dc245 with content \n",
+      "\n",
+      "Heavy metal\n",
+      "\n",
+      "Heavy metal (or simply metal) is a genre of\n",
+      "\n",
+      "has label music\n"
+     ]
+    }
+   ],
+   "source": [
+    "# check if indexed docs contain classification results\n",
+    "test_doc = document_store.get_all_documents()[0]\n",
+    "print(f'document {test_doc.id} with content \\n\\n{test_doc.content}\\n\\nhas label {test_doc.meta[\"classification\"][\"label\"]}')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Querying the data"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "All we have to do to filter for one of our classes is to set a filter on \"classification.label\"."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Initialize QA-Pipeline\n",
+    "from haystack.pipelines import ExtractiveQAPipeline\n",
+    "retriever = ElasticsearchRetriever(document_store=document_store)\n",
+    "reader = FARMReader(model_name_or_path=\"deepset/roberta-base-squad2\", use_gpu=True)\n",
+    "pipe = ExtractiveQAPipeline(reader, retriever)    "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s]/home/tstad/git/haystack/haystack/modeling/model/prediction_head.py:455: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').\n",
+      "  start_indices = flat_sorted_indices // max_seq_len\n",
+      "Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  1.97 Batches/s]\n",
+      "Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  2.45 Batches/s]\n",
+      "Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  2.46 Batches/s]\n",
+      "Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  2.46 Batches/s]\n",
+      "Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  2.44 Batches/s]\n",
+      "Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  2.43 Batches/s]\n",
+      "Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  2.48 Batches/s]\n",
+      "Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  2.47 Batches/s]\n",
+      "Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  2.48 Batches/s]\n",
+      "Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  2.46 Batches/s]\n"
+     ]
+    }
+   ],
+   "source": [
+    "## Voilà! Ask a question while filtering for \"music\"-only documents\n",
+    "prediction = pipe.run(\n",
+    "    query=\"What is heavy metal?\", params={\"Retriever\": {\"top_k\": 10, \"filters\": {\"classification.label\": [\"music\"]}}, \"Reader\": {\"top_k\": 5}}\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "{   'answers': [   Answer(answer='thick, massive sound', type='extractive', score=0.5388454645872116, context=',[6] heavy metal bands developed a thick, massive sound, characterized', offsets_in_document=[Span(start=35, end=55)], offsets_in_context=[Span(start=35, end=55)], document_id='b69a8816c2c8d782dceb412b80a4bf6e', meta={'_split_id': 5, 'classification': {'sequence': ',[6] heavy metal bands developed a thick, massive sound, characterized', 'labels': ['music', 'history', 'natural language processing'], 'scores': [0.926899254322052, 0.0447617806494236, 0.0283389650285244], 'label': 'music'}, 'name': 'heavy_metal.docx'}),\n",
+      "                   Answer(answer='a genre', type='extractive', score=0.399858795106411, context='Heavy metal\\n\\nHeavy metal (or simply metal) is a genre of', offsets_in_document=[Span(start=46, end=53)], offsets_in_context=[Span(start=46, end=53)], document_id='9903d23737f3d05a9d9ee170703dc245', meta={'_split_id': 0, 'classification': {'sequence': 'Heavy metal\\n\\nHeavy metal (or simply metal) is a genre of', 'labels': ['music', 'history', 'natural language processing'], 'scores': [0.8191022872924805, 0.11593689769506454, 0.06496082246303558], 'label': 'music'}, 'name': 'heavy_metal.docx'}),\n",
+      "                   Answer(answer='more accessible forms', type='extractive', score=0.09895830601453781, context='Several American bands modified heavy metal into more accessible forms', offsets_in_document=[Span(start=49, end=70)], offsets_in_context=[Span(start=49, end=70)], document_id='5a9cefc4732e4b2f97529a79231345ec', meta={'_split_id': 13, 'classification': {'sequence': 'Several American bands modified heavy metal into more accessible forms', 'labels': ['music', 'history', 'natural language processing'], 'scores': [0.9277411699295044, 0.036878395825624466, 0.035380441695451736], 'label': 'music'}, 'name': 'heavy_metal.docx'}),\n",
+      "                   Answer(answer='roots', type='extractive', score=0.017235582694411278, context='States.[5] With roots in , , and ,[6] heavy metal', offsets_in_document=[Span(start=16, end=21)], offsets_in_context=[Span(start=16, end=21)], document_id='7aa5f844f509c36ee6cf6a8d605ca452', meta={'_split_id': 4, 'classification': {'sequence': 'States.[5] With roots in , , and ,[6] heavy metal', 'labels': ['music', 'history', 'natural language processing'], 'scores': [0.661548376083374, 0.20618689060211182, 0.13226467370986938], 'label': 'music'}, 'name': 'heavy_metal.docx'}),\n",
+      "                   Answer(answer='heavy metal fans', type='extractive', score=0.014702926389873028, context='decade, heavy metal fans became known as \"\" or \"\".\\nDuring', offsets_in_document=[Span(start=8, end=24)], offsets_in_context=[Span(start=8, end=24)], document_id='ac5cc61f3e646d87d6549a7bb80b8b4a', meta={'_split_id': 26, 'classification': {'sequence': 'decade, heavy metal fans became known as \"\" or \"\".\\nDuring', 'labels': ['music', 'history', 'natural language processing'], 'scores': [0.5980438590049744, 0.32206231355667114, 0.0798937976360321], 'label': 'music'}, 'name': 'heavy_metal.docx'})],\n",
+      "    'documents': [   {'content': 'Heavy metal\\n\\nHeavy metal (or simply metal) is a genre of', 'content_type': 'text', 'score': 0.9135275466034989, 'meta': {'_split_id': 0, 'classification': {'sequence': 'Heavy metal\\n\\nHeavy metal (or simply metal) is a genre of', 'labels': ['music', 'history', 'natural language processing'], 'scores': [0.8191022872924805, 0.11593689769506454, 0.06496082246303558], 'label': 'music'}, 'name': 'heavy_metal.docx'}, 'embedding': None, 'id': '9903d23737f3d05a9d9ee170703dc245'},\n",
+      "                     {'content': 'States.[5] With roots in , , and ,[6] heavy metal', 'content_type': 'text', 'score': 0.8256961596331841, 'meta': {'_split_id': 4, 'classification': {'sequence': 'States.[5] With roots in , , and ,[6] heavy metal', 'labels': ['music', 'history', 'natural language processing'], 'scores': [0.661548376083374, 0.20618689060211182, 0.13226467370986938], 'label': 'music'}, 'name': 'heavy_metal.docx'}, 'embedding': None, 'id': '7aa5f844f509c36ee6cf6a8d605ca452'},\n",
+      "                     {'content': 'decade, heavy metal fans became known as \"\" or \"\".\\nDuring', 'content_type': 'text', 'score': 0.8256961596331841, 'meta': {'_split_id': 26, 'classification': {'sequence': 'decade, heavy metal fans became known as \"\" or \"\".\\nDuring', 'labels': ['music', 'history', 'natural language processing'], 'scores': [0.5980438590049744, 0.32206231355667114, 0.0798937976360321], 'label': 'music'}, 'name': 'heavy_metal.docx'}, 'embedding': None, 'id': 'ac5cc61f3e646d87d6549a7bb80b8b4a'},\n",
+      "                     {'content': ',[6] heavy metal bands developed a thick, massive sound, characterized', 'content_type': 'text', 'score': 0.8172186978337235, 'meta': {'_split_id': 5, 'classification': {'sequence': ',[6] heavy metal bands developed a thick, massive sound, characterized', 'labels': ['music', 'history', 'natural language processing'], 'scores': [0.926899254322052, 0.0447617806494236, 0.0283389650285244], 'label': 'music'}, 'name': 'heavy_metal.docx'}, 'embedding': None, 'id': 'b69a8816c2c8d782dceb412b80a4bf6e'},\n",
+      "                     {'content': 'Several American bands modified heavy metal into more accessible forms', 'content_type': 'text', 'score': 0.8172186978337235, 'meta': {'_split_id': 13, 'classification': {'sequence': 'Several American bands modified heavy metal into more accessible forms', 'labels': ['music', 'history', 'natural language processing'], 'scores': [0.9277411699295044, 0.036878395825624466, 0.035380441695451736], 'label': 'music'}, 'name': 'heavy_metal.docx'}, 'embedding': None, 'id': '5a9cefc4732e4b2f97529a79231345ec'},\n",
+      "                     {'content': 'similar vein. By the end of the decade, heavy metal', 'content_type': 'text', 'score': 0.8172186978337235, 'meta': {'_split_id': 25, 'classification': {'sequence': 'similar vein. By the end of the decade, heavy metal', 'labels': ['music', 'history', 'natural language processing'], 'scores': [0.613710343837738, 0.2949920892715454, 0.09129763394594193], 'label': 'music'}, 'name': 'heavy_metal.docx'}, 'embedding': None, 'id': 'f99c648859137f46860d2e14a7524012'},\n",
+      "                     {'content': '> snull +  , where the threshold  is', 'content_type': 'text', 'score': 0.5794093707835908, 'meta': {'_split_id': 513, 'classification': {'sequence': '> snull +  , where the threshold  is', 'labels': ['music', 'natural language processing', 'history'], 'scores': [0.7594349384307861, 0.1394801139831543, 0.10108496993780136], 'label': 'music'}, 'name': 'bert.pdf'}, 'embedding': None, 'id': 'dea394d0bc270f451ea72caeb31f8fe0'},\n",
+      "                     {'content': 'They are sampled such that the combined length is ', 'content_type': 'text', 'score': 0.5673102196284736, 'meta': {'_split_id': 964, 'classification': {'sequence': 'They are sampled such that the combined length is ', 'labels': ['music', 'natural language processing', 'history'], 'scores': [0.6959176659584045, 0.20470784604549408, 0.0993744507431984], 'label': 'music'}, 'name': 'bert.pdf'}, 'embedding': None, 'id': '3792202c1568920f7353682800f6d3f5'},\n",
+      "                     {'content': 'Classics or classical studies is the study of classical antiquity,', 'content_type': 'text', 'score': 0.564837188549428, 'meta': {'_split_id': 0, 'classification': {'sequence': 'Classics or classical studies is the study of classical antiquity,', 'labels': ['music', 'natural language processing', 'history'], 'scores': [0.3462620675563812, 0.33706134557724, 0.3166765868663788], 'label': 'music'}, 'name': 'classics.txt'}, 'embedding': None, 'id': '5f06721d4e5ddd207e8de318274a89b6'},\n",
+      "                     {'content': 'Orders: Doric, Ionic, and Corinthian. The Parthenon is still the', 'content_type': 'text', 'score': 0.564837188549428, 'meta': {'_split_id': 272, 'classification': {'sequence': 'Orders: Doric, Ionic, and Corinthian. The Parthenon is still the', 'labels': ['music', 'history', 'natural language processing'], 'scores': [0.8087852001190186, 0.09828957170248032, 0.09292517602443695], 'label': 'music'}, 'name': 'classics.txt'}, 'embedding': None, 'id': 'eb79ba3545ad1c1fa01a32f0b70a7455'}],\n",
+      "    'no_ans_gap': 3.725033760070801,\n",
+      "    'node_id': 'Reader',\n",
+      "    'params': {   'Reader': {'top_k': 5},\n",
+      "                  'Retriever': {   'filters': {   'classification.label': [   'music']},\n",
+      "                                   'top_k': 10}},\n",
+      "    'query': 'What is heavy metal?',\n",
+      "    'root_node': 'Query'}\n"
+     ]
+    }
+   ],
+   "source": [
+    "print_answers(prediction, details=\"high\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Wrapping it up in an indexing pipeline"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from pathlib import Path\n",
+    "from haystack.pipelines import Pipeline\n",
+    "from haystack.nodes import TextConverter, PreProcessor, FileTypeClassifier, PDFToTextConverter, DocxToTextConverter"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "file_type_classifier = FileTypeClassifier()\n",
+    "text_converter = TextConverter()\n",
+    "pdf_converter = PDFToTextConverter()\n",
+    "docx_converter = DocxToTextConverter()\n",
+    "\n",
+    "indexing_pipeline_with_classification = Pipeline()\n",
+    "indexing_pipeline_with_classification.add_node(component=file_type_classifier, name=\"FileTypeClassifier\", inputs=[\"File\"])\n",
+    "indexing_pipeline_with_classification.add_node(component=text_converter, name=\"TextConverter\", inputs=[\"FileTypeClassifier.output_1\"])\n",
+    "indexing_pipeline_with_classification.add_node(component=pdf_converter, name=\"PdfConverter\", inputs=[\"FileTypeClassifier.output_2\"])\n",
+    "indexing_pipeline_with_classification.add_node(component=docx_converter, name=\"DocxConverter\", inputs=[\"FileTypeClassifier.output_4\"])\n",
+    "indexing_pipeline_with_classification.add_node(component=preprocessor_sliding_window, name=\"Preprocessor\", inputs=[\"TextConverter\", \"PdfConverter\", \"DocxConverter\"])\n",
+    "indexing_pipeline_with_classification.add_node(component=doc_classifier, name=\"DocumentClassifier\", inputs=[\"Preprocessor\"])\n",
+    "indexing_pipeline_with_classification.add_node(component=document_store, name=\"DocumentStore\", inputs=[\"DocumentClassifier\"])\n",
+    "indexing_pipeline_with_classification.draw(\"index_time_document_classifier.png\")\n",
+    "\n",
+    "document_store.delete_documents()\n",
+    "txt_files = [f for f in Path(doc_dir).iterdir() if f.suffix == '.txt']\n",
+    "pdf_files = [f for f in Path(doc_dir).iterdir() if f.suffix == '.pdf']\n",
+    "docx_files = [f for f in Path(doc_dir).iterdir() if f.suffix == '.docx']\n",
+    "indexing_pipeline_with_classification.run(file_paths=txt_files)\n",
+    "indexing_pipeline_with_classification.run(file_paths=pdf_files)\n",
+    "indexing_pipeline_with_classification.run(file_paths=docx_files)\n",
+    "\n",
+    "document_store.get_all_documents()[0]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# we can store this pipeline and use it from the REST-API\n",
+    "indexing_pipeline_with_classification.save_to_yaml(\"indexing_pipeline_with_classification.yaml\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "pycharm": {
+     "name": "#%% md\n"
+    }
+   },
+   "source": [
+    "## About us\n",
+    "\n",
+    "This [Haystack](https://github.com/deepset-ai/haystack/) notebook was made with love by [deepset](https://deepset.ai/) in Berlin, Germany\n",
+    "\n",
+    "We bring NLP to the industry via open source!  \n",
+    "Our focus: Industry specific language models & large scale QA systems.  \n",
+    "  \n",
+    "Some of our other work: \n",
+    "- [German BERT](https://deepset.ai/german-bert)\n",
+    "- [GermanQuAD and GermanDPR](https://deepset.ai/germanquad)\n",
+    "- [FARM](https://github.com/deepset-ai/FARM)\n",
+    "\n",
+    "Get in touch:\n",
+    "[Twitter](https://twitter.com/deepset_ai) | [LinkedIn](https://www.linkedin.com/company/deepset-ai/) | [Slack](https://haystack.deepset.ai/community/join) | [GitHub Discussions](https://github.com/deepset-ai/haystack/discussions) | [Website](https://deepset.ai)\n",
+    "\n",
+    "By the way: [we're hiring!](https://www.deepset.ai/jobs)\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.11"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
--- a/tutorials/Tutorial16_Document_Classifier_at_Index_Time.py
+++ b/tutorials/Tutorial16_Document_Classifier_at_Index_Time.py
@ -0,0 +1,175 @@
+# # Extending your Metadata using DocumentClassifiers at Index Time
+# 
+# With DocumentClassifier it's possible to automatically enrich your documents 
+# with categories, sentiments, topics or whatever metadata you like. 
+# This metadata could be used for efficient filtering or further processing. 
+# Say you have some categories your users typically filter on. 
+# If the documents are tagged manually with these categories, you could automate 
+# this process by training a model. Or you can leverage the full power and flexibility 
+# of zero shot classification. All you need to do is pass your categories to the classifier, 
+# no labels required.
+# This tutorial shows how to integrate it in your indexing pipeline.
+
+# DocumentClassifier adds the classification result (label and score) to Document's meta property.
+# Hence, we can use it to classify documents at index time. \
+# The result can be accessed at query time: for example by applying a filter for "classification.label".
+
+# This tutorial will show you how to integrate a classification model into your preprocessing steps and how you can filter for this additional metadata at query time. In the last section we show how to put it all together and create an indexing pipeline.
+
+
+# Here are the imports we need
+from haystack.document_stores.elasticsearch import ElasticsearchDocumentStore
+from haystack.nodes import PreProcessor, TransformersDocumentClassifier, FARMReader, ElasticsearchRetriever
+from haystack.schema import Document
+from haystack.utils import convert_files_to_dicts, fetch_archive_from_http, print_answers, launch_es
+
+
+def tutorial16_document_classifier_at_index_time():
+    # This fetches some sample files to work with
+
+    doc_dir = "data/preprocessing_tutorial"
+    s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/preprocessing_tutorial.zip"
+    fetch_archive_from_http(url=s3_url, output_dir=doc_dir)
+
+
+    # ## Read and preprocess documents
+
+    # note that you can also use the document classifier before applying the PreProcessor, e.g. before splitting your documents
+    all_docs = convert_files_to_dicts(dir_path=doc_dir)
+    preprocessor_sliding_window = PreProcessor(
+        split_overlap=3,
+        split_length=10,
+        split_respect_sentence_boundary=False
+    )
+    docs_sliding_window = preprocessor_sliding_window.process(all_docs)
+
+    # ## Apply DocumentClassifier
+
+    # We can enrich the document metadata at index time using any transformers document classifier model.
+    # Here we use a zero shot model that is supposed to classify our documents in 'music', 'natural language processing' and 'history'.
+    # While traditional classification models are trained to predict one of a few "hard-coded" classes and required a dedicated training dataset, 
+    # zero-shot classification is super flexible and you can easily switch the classes the model should predict on the fly. 
+    # Just supply them via the labels param. 
+    # Feel free to change them for whatever you like to classify.
+    # These classes can later on be accessed at query time.
+
+    doc_classifier = TransformersDocumentClassifier(model_name_or_path="cross-encoder/nli-distilroberta-base",
+        task="zero-shot-classification",
+        labels=["music", "natural language processing", "history"],
+        batch_size=16
+    )
+
+    # we can also use any other transformers model besides zero shot classification
+
+    # doc_classifier_model = 'bhadresh-savani/distilbert-base-uncased-emotion'
+    # doc_classifier = TransformersDocumentClassifier(model_name_or_path=doc_classifier_model, batch_size=16)
+
+    # we could also specifiy a different field we want to run the classification on
+
+    # doc_classifier = TransformersDocumentClassifier(model_name_or_path="cross-encoder/nli-distilroberta-base",
+    #    task="zero-shot-classification",
+    #    labels=["music", "natural language processing", "history"],
+    #    batch_size=16,
+    #    classification_field="description")
+
+
+    # convert to Document using a fieldmap for custom content fields the classification should run on
+    docs_to_classify = [Document.from_dict(d) for d in docs_sliding_window]
+
+    # classify using gpu, batch_size makes sure we do not run out of memory
+    classified_docs = doc_classifier.predict(docs_to_classify)
+
+
+    # let's see how it looks: there should be a classification result in the meta entry containing labels and scores.
+    print(classified_docs[0].to_dict())
+
+
+    # ## Indexing
+
+    launch_es()
+
+    # Connect to Elasticsearch
+    document_store = ElasticsearchDocumentStore(host="localhost", username="", password="", index="document")
+
+    # Now, let's write the docs to our DB.
+    document_store.delete_all_documents()
+    document_store.write_documents(classified_docs)
+
+    # check if indexed docs contain classification results
+    test_doc = document_store.get_all_documents()[0]
+    print(f'document {test_doc.id} with content \n\n{test_doc.content}\n\nhas label {test_doc.meta["classification"]["label"]}')
+
+
+    # ## Querying the data
+
+    # All we have to do to filter for one of our classes is to set a filter on "classification.label".
+
+    # Initialize QA-Pipeline
+    from haystack.pipelines import ExtractiveQAPipeline
+    retriever = ElasticsearchRetriever(document_store=document_store)
+    reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=True)
+    pipe = ExtractiveQAPipeline(reader, retriever)    
+
+    ## Voilà! Ask a question while filtering for "music"-only documents
+    prediction = pipe.run(
+        query="What is heavy metal?", params={"Retriever": {"top_k": 10, "filters": {"classification.label": ["music"]}}, "Reader": {"top_k": 5}}
+    )
+
+    print_answers(prediction, details="high")
+
+
+    # ## Wrapping it up in an indexing pipeline
+
+    from pathlib import Path
+    from haystack.pipelines import Pipeline
+    from haystack.nodes import TextConverter, FileTypeClassifier, PDFToTextConverter, DocxToTextConverter
+
+    file_type_classifier = FileTypeClassifier()
+    text_converter = TextConverter()
+    pdf_converter = PDFToTextConverter()
+    docx_converter = DocxToTextConverter()
+
+    indexing_pipeline_with_classification = Pipeline()
+    indexing_pipeline_with_classification.add_node(component=file_type_classifier, name="FileTypeClassifier", inputs=["File"])
+    indexing_pipeline_with_classification.add_node(component=text_converter, name="TextConverter", inputs=["FileTypeClassifier.output_1"])
+    indexing_pipeline_with_classification.add_node(component=pdf_converter, name="PdfConverter", inputs=["FileTypeClassifier.output_2"])
+    indexing_pipeline_with_classification.add_node(component=docx_converter, name="DocxConverter", inputs=["FileTypeClassifier.output_4"])
+    indexing_pipeline_with_classification.add_node(component=preprocessor_sliding_window, name="Preprocessor", inputs=["TextConverter", "PdfConverter", "DocxConverter"])
+    indexing_pipeline_with_classification.add_node(component=doc_classifier, name="DocumentClassifier", inputs=["Preprocessor"])
+    indexing_pipeline_with_classification.add_node(component=document_store, name="DocumentStore", inputs=["DocumentClassifier"])
+    indexing_pipeline_with_classification.draw("index_time_document_classifier.png")
+
+    document_store.delete_documents()
+    txt_files = [f for f in Path(doc_dir).iterdir() if f.suffix == '.txt']
+    pdf_files = [f for f in Path(doc_dir).iterdir() if f.suffix == '.pdf']
+    docx_files = [f for f in Path(doc_dir).iterdir() if f.suffix == '.docx']
+    indexing_pipeline_with_classification.run(file_paths=txt_files)
+    indexing_pipeline_with_classification.run(file_paths=pdf_files)
+    indexing_pipeline_with_classification.run(file_paths=docx_files)
+
+    document_store.get_all_documents()[0]
+
+    # we can store this pipeline and use it from the REST-API
+    indexing_pipeline_with_classification.save_to_yaml("indexing_pipeline_with_classification.yaml")
+
+
+if __name__ == "__main__":
+    tutorial16_document_classifier_at_index_time()
+
+# ## About us
+# 
+# This [Haystack](https://github.com/deepset-ai/haystack/) notebook was made with love by [deepset](https://deepset.ai/) in Berlin, Germany
+# 
+# We bring NLP to the industry via open source!  
+# Our focus: Industry specific language models & large scale QA systems.  
+#   
+# Some of our other work: 
+# - [German BERT](https://deepset.ai/german-bert)
+# - [GermanQuAD and GermanDPR](https://deepset.ai/germanquad)
+# - [FARM](https://github.com/deepset-ai/FARM)
+# 
+# Get in touch:
+# [Twitter](https://twitter.com/deepset_ai) | [LinkedIn](https://www.linkedin.com/company/deepset-ai/) | [Slack](https://haystack.deepset.ai/community/join) | [GitHub Discussions](https://github.com/deepset-ai/haystack/discussions) | [Website](https://deepset.ai)
+# 
+# By the way: [we're hiring!](https://www.deepset.ai/jobs)
+#