"[](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial16_Document_Classifier_at_Index_Time.ipynb)\n",
"With DocumentClassifier it's possible to automatically enrich your documents with categories, sentiments, topics or whatever metadata you like. This metadata could be used for efficient filtering or further processing. Say you have some categories your users typically filter on. If the documents are tagged manually with these categories, you could automate this process by training a model. Or you can leverage the full power and flexibility of zero shot classification. All you need to do is pass your categories to the classifier, no labels required. This tutorial shows how to integrate it in your indexing pipeline."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"DocumentClassifier adds the classification result (label and score) to Document's meta property.\n",
"Hence, we can use it to classify documents at index time. \\\n",
"The result can be accessed at query time: for example by applying a filter for \"classification.label\"."
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"This tutorial will show you how to integrate a classification model into your preprocessing steps and how you can filter for this additional metadata at query time. In the last section we show how to put it all together and create an indexing pipeline."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"# Let's start by installing Haystack\n",
"\n",
"# Install the latest release of Haystack in your own environment\n",
"We can enrich the document metadata at index time using any transformers document classifier model. While traditional classification models are trained to predict one of a few \"hard-coded\" classes and required a dedicated training dataset, zero-shot classification is super flexible and you can easily switch the classes the model should predict on the fly. Just supply them via the labels param.\n",
"Here we use a zero shot model that is supposed to classify our documents in 'music', 'natural language processing' and 'history'. Feel free to change them for whatever you like to classify. \\\n",
"These classes can later on be accessed at query time."
"Inferencing Samples: 0%| | 0/1 [00:00<?, ? Batches/s]/home/tstad/git/haystack/haystack/modeling/model/prediction_head.py:455: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').\n",
"## Voilà! Ask a question while filtering for \"music\"-only documents\n",
"prediction = pipe.run(\n",
" query=\"What is heavy metal?\", params={\"Retriever\": {\"top_k\": 10, \"filters\": {\"classification.label\": [\"music\"]}}, \"Reader\": {\"top_k\": 5}}\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{ 'answers': [ Answer(answer='thick, massive sound', type='extractive', score=0.5388454645872116, context=',[6] heavy metal bands developed a thick, massive sound, characterized', offsets_in_document=[Span(start=35, end=55)], offsets_in_context=[Span(start=35, end=55)], document_id='b69a8816c2c8d782dceb412b80a4bf6e', meta={'_split_id': 5, 'classification': {'sequence': ',[6] heavy metal bands developed a thick, massive sound, characterized', 'labels': ['music', 'history', 'natural language processing'], 'scores': [0.926899254322052, 0.0447617806494236, 0.0283389650285244], 'label': 'music'}, 'name': 'heavy_metal.docx'}),\n",
" Answer(answer='a genre', type='extractive', score=0.399858795106411, context='Heavy metal\\n\\nHeavy metal (or simply metal) is a genre of', offsets_in_document=[Span(start=46, end=53)], offsets_in_context=[Span(start=46, end=53)], document_id='9903d23737f3d05a9d9ee170703dc245', meta={'_split_id': 0, 'classification': {'sequence': 'Heavy metal\\n\\nHeavy metal (or simply metal) is a genre of', 'labels': ['music', 'history', 'natural language processing'], 'scores': [0.8191022872924805, 0.11593689769506454, 0.06496082246303558], 'label': 'music'}, 'name': 'heavy_metal.docx'}),\n",
" Answer(answer='more accessible forms', type='extractive', score=0.09895830601453781, context='Several American bands modified heavy metal into more accessible forms', offsets_in_document=[Span(start=49, end=70)], offsets_in_context=[Span(start=49, end=70)], document_id='5a9cefc4732e4b2f97529a79231345ec', meta={'_split_id': 13, 'classification': {'sequence': 'Several American bands modified heavy metal into more accessible forms', 'labels': ['music', 'history', 'natural language processing'], 'scores': [0.9277411699295044, 0.036878395825624466, 0.035380441695451736], 'label': 'music'}, 'name': 'heavy_metal.docx'}),\n",
" Answer(answer='roots', type='extractive', score=0.017235582694411278, context='States.[5] With roots in , , and ,[6] heavy metal', offsets_in_document=[Span(start=16, end=21)], offsets_in_context=[Span(start=16, end=21)], document_id='7aa5f844f509c36ee6cf6a8d605ca452', meta={'_split_id': 4, 'classification': {'sequence': 'States.[5] With roots in , , and ,[6] heavy metal', 'labels': ['music', 'history', 'natural language processing'], 'scores': [0.661548376083374, 0.20618689060211182, 0.13226467370986938], 'label': 'music'}, 'name': 'heavy_metal.docx'}),\n",
" Answer(answer='heavy metal fans', type='extractive', score=0.014702926389873028, context='decade, heavy metal fans became known as \"\" or \"\".\\nDuring', offsets_in_document=[Span(start=8, end=24)], offsets_in_context=[Span(start=8, end=24)], document_id='ac5cc61f3e646d87d6549a7bb80b8b4a', meta={'_split_id': 26, 'classification': {'sequence': 'decade, heavy metal fans became known as \"\" or \"\".\\nDuring', 'labels': ['music', 'history', 'natural language processing'], 'scores': [0.5980438590049744, 0.32206231355667114, 0.0798937976360321], 'label': 'music'}, 'name': 'heavy_metal.docx'})],\n",
" 'documents': [ {'content': 'Heavy metal\\n\\nHeavy metal (or simply metal) is a genre of', 'content_type': 'text', 'score': 0.9135275466034989, 'meta': {'_split_id': 0, 'classification': {'sequence': 'Heavy metal\\n\\nHeavy metal (or simply metal) is a genre of', 'labels': ['music', 'history', 'natural language processing'], 'scores': [0.8191022872924805, 0.11593689769506454, 0.06496082246303558], 'label': 'music'}, 'name': 'heavy_metal.docx'}, 'embedding': None, 'id': '9903d23737f3d05a9d9ee170703dc245'},\n",
" {'content': 'States.[5] With roots in , , and ,[6] heavy metal', 'content_type': 'text', 'score': 0.8256961596331841, 'meta': {'_split_id': 4, 'classification': {'sequence': 'States.[5] With roots in , , and ,[6] heavy metal', 'labels': ['music', 'history', 'natural language processing'], 'scores': [0.661548376083374, 0.20618689060211182, 0.13226467370986938], 'label': 'music'}, 'name': 'heavy_metal.docx'}, 'embedding': None, 'id': '7aa5f844f509c36ee6cf6a8d605ca452'},\n",
" {'content': 'decade, heavy metal fans became known as \"\" or \"\".\\nDuring', 'content_type': 'text', 'score': 0.8256961596331841, 'meta': {'_split_id': 26, 'classification': {'sequence': 'decade, heavy metal fans became known as \"\" or \"\".\\nDuring', 'labels': ['music', 'history', 'natural language processing'], 'scores': [0.5980438590049744, 0.32206231355667114, 0.0798937976360321], 'label': 'music'}, 'name': 'heavy_metal.docx'}, 'embedding': None, 'id': 'ac5cc61f3e646d87d6549a7bb80b8b4a'},\n",
" {'content': ',[6] heavy metal bands developed a thick, massive sound, characterized', 'content_type': 'text', 'score': 0.8172186978337235, 'meta': {'_split_id': 5, 'classification': {'sequence': ',[6] heavy metal bands developed a thick, massive sound, characterized', 'labels': ['music', 'history', 'natural language processing'], 'scores': [0.926899254322052, 0.0447617806494236, 0.0283389650285244], 'label': 'music'}, 'name': 'heavy_metal.docx'}, 'embedding': None, 'id': 'b69a8816c2c8d782dceb412b80a4bf6e'},\n",
" {'content': 'Several American bands modified heavy metal into more accessible forms', 'content_type': 'text', 'score': 0.8172186978337235, 'meta': {'_split_id': 13, 'classification': {'sequence': 'Several American bands modified heavy metal into more accessible forms', 'labels': ['music', 'history', 'natural language processing'], 'scores': [0.9277411699295044, 0.036878395825624466, 0.035380441695451736], 'label': 'music'}, 'name': 'heavy_metal.docx'}, 'embedding': None, 'id': '5a9cefc4732e4b2f97529a79231345ec'},\n",
" {'content': 'similar vein. By the end of the decade, heavy metal', 'content_type': 'text', 'score': 0.8172186978337235, 'meta': {'_split_id': 25, 'classification': {'sequence': 'similar vein. By the end of the decade, heavy metal', 'labels': ['music', 'history', 'natural language processing'], 'scores': [0.613710343837738, 0.2949920892715454, 0.09129763394594193], 'label': 'music'}, 'name': 'heavy_metal.docx'}, 'embedding': None, 'id': 'f99c648859137f46860d2e14a7524012'},\n",
" {'content': '> snull + , where the threshold is', 'content_type': 'text', 'score': 0.5794093707835908, 'meta': {'_split_id': 513, 'classification': {'sequence': '> snull + , where the threshold is', 'labels': ['music', 'natural language processing', 'history'], 'scores': [0.7594349384307861, 0.1394801139831543, 0.10108496993780136], 'label': 'music'}, 'name': 'bert.pdf'}, 'embedding': None, 'id': 'dea394d0bc270f451ea72caeb31f8fe0'},\n",
" {'content': 'They are sampled such that the combined length is ', 'content_type': 'text', 'score': 0.5673102196284736, 'meta': {'_split_id': 964, 'classification': {'sequence': 'They are sampled such that the combined length is ', 'labels': ['music', 'natural language processing', 'history'], 'scores': [0.6959176659584045, 0.20470784604549408, 0.0993744507431984], 'label': 'music'}, 'name': 'bert.pdf'}, 'embedding': None, 'id': '3792202c1568920f7353682800f6d3f5'},\n",
" {'content': 'Classics or classical studies is the study of classical antiquity,', 'content_type': 'text', 'score': 0.564837188549428, 'meta': {'_split_id': 0, 'classification': {'sequence': 'Classics or classical studies is the study of classical antiquity,', 'labels': ['music', 'natural language processing', 'history'], 'scores': [0.3462620675563812, 0.33706134557724, 0.3166765868663788], 'label': 'music'}, 'name': 'classics.txt'}, 'embedding': None, 'id': '5f06721d4e5ddd207e8de318274a89b6'},\n",
" {'content': 'Orders: Doric, Ionic, and Corinthian. The Parthenon is still the', 'content_type': 'text', 'score': 0.564837188549428, 'meta': {'_split_id': 272, 'classification': {'sequence': 'Orders: Doric, Ionic, and Corinthian. The Parthenon is still the', 'labels': ['music', 'history', 'natural language processing'], 'scores': [0.8087852001190186, 0.09828957170248032, 0.09292517602443695], 'label': 'music'}, 'name': 'classics.txt'}, 'embedding': None, 'id': 'eb79ba3545ad1c1fa01a32f0b70a7455'}],\n",