"[](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial14_Query_Classifier.ipynb)\n",
"One of the great benefits of using state-of-the-art NLP models like those available in Haystack is that it allows users to state their queries as *plain natural language questions*: rather than trying to come up with just the right set of keywords to find the answer to their question, users can simply ask their question in much the same way that they would ask it of a (very knowledgeable!) person.\n",
"But just because users *can* ask their questions in \"plain English\" (or \"plain German\", etc.), that doesn't mean they always *will*. For instance, a user might input a few keywords rather than a complete question because they don't understand the pipeline's full capabilities, or because they are so accustomed to keyword search. While a standard Haystack pipeline might handle such queries with reasonable accuracy, for a variety of reasons we still might prefer that our pipeline be sensitive to the type of query it is receiving, so that it behaves differently when a user inputs, say, a collection of keywords instead of a question.\n",
"For this reason, Haystack comes with built-in capabilities to distinguish between three types of queries: **keyword queries**, **interrogative queries**, and **statement queries**, described below.\n",
"1. **Keyword queries** can be thought of more or less as lists of words, such as \"Alaska cruises summer\". While the meanings of individual words may matter in a keyword query, the linguistic connections *between* words do not. Hence, in a keyword query the order of words is largely irrelevant: \"Alaska cruises summer\", \"summer Alaska cruises\", and \"summer cruises Alaska\" are functionally the same.\n",
"2. **Interrogative queries** (or **question queries**) are queries phrased as natural language questions, such as \"Who was the father of Eddard Stark?\". Unlike with keyword queries, word order very much matters here: \"Who was the father of Eddard Stark?\" and \"Who was Eddard Stark the father of?\" are very different questions, despite having exactly the same words. (Note that while we often write questions with question marks, Haystack can find interrogative queries without such a dead giveaway!)\n",
"3. **Statement queries** are just declarative sentences, such as \"Daenerys loved Jon\". These are like interrogative queries in that word order matters—again, \"Daenerys loved Jon\" and \"Jon loved Daenerys\" mean very different things—but they are statements instead of questions.\n",
"In this tutorial you will learn how to use **query classifiers** to branch your Haystack pipeline based on the type of query it receives. Haystack comes with two out-of-the-box query classification schemas, each of which routes a given query into one of two branches:\n",
"1. **Keyword vs. Question/Statement** — routes a query into one of two branches depending on whether it is a full question/statement or a collection of keywords.\n",
"2. **Question vs. Statement** — routes a natural language query into one of two branches depending on whether it is a question or a statement.\n",
"Furthermore, for each classification schema there are two types of nodes capable of performing this classification: a **`TransformersQueryClassifier`** that uses a transformer model, and an **`SklearnQueryClassifier`** that uses a more lightweight model built in `sklearn`.\n",
"Before integrating query classifiers into our pipelines, let's test them out on their own and see what they actually do. First we initiate a simple, out-of-the-box **keyword vs. question/statement** `SklearnQueryClassifier`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "XhPMEqBzxA8V",
"collapsed": true
},
"outputs": [],
"source": [
"# Here we create the keyword vs question/statement query classifier\n",
"Now let's feed some queries into this query classifier. We'll test with one keyword query, one interrogative query, and one statement query. Notice that we don't use any punctuation, such as question marks; this illustrates that the classifier doesn't need punctuation in order to make the right decision."
"We can see below what our classifier does with these queries: \"Arya Stark father\" is rightly determined to be a keyword query and is sent to branch 2, while both the interrogative query \"Who was the father of Arya Stark\" and the statement query \"Lord Eddard was the father of Arya Stark\" are correctly labeled as non-keyword queries, and are thus shipped off to branch 1."
"Next we will illustrate a **question vs. statement** `SklearnQueryClassifier`. We define our classifier below; notice that this time we have to explicitly specify the model and vectorizer, since the default for an `SklearnQueryClassifier` (and a `TransformersQueryClassifier`) is keyword vs. question/statement classification."
" q_vs_s_results[\"Class\"].append(\"Question\" if result[1] == \"output_1\" else \"Statement\")\n",
"\n",
"pd.DataFrame.from_dict(q_vs_s_results)"
],
"metadata": {
"id": "1ZULHEBVmqq2"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"And as we see, the question \"Who was the father of Arya Stark\" is sent to branch 1, while the statement \"Lord Eddard was the father of Arya Stark\" is sent to branch 2, so we can have our pipeline treat statements and questions differently."
],
"metadata": {
"id": "Fk2kpvQR6Fa0"
}
},
{
"cell_type": "markdown",
"source": [
"### Using Query Classifiers in a Pipeline\n",
"\n",
"Now let's see how we can use query classifiers in a question-answering (QA) pipeline. We start by initiating Elasticsearch:"
],
"metadata": {
"id": "eEwDIq9KXXke"
}
},
{
"cell_type": "code",
"source": [
"# In Colab / No Docker environments: Start Elasticsearch from source\n",
"#### Pipelines with Keyword vs. Question/Statement Classification\n",
"\n",
"Our first illustration will be a simple retriever-reader QA pipeline, but the choice of which retriever we use will depend on the type of query received: **keyword** queries will use a sparse **`BM25Retriever`**, while **question/statement** queries will use the more accurate but also more computationally expensive **`EmbeddingRetriever`**.\n",
"Now we define our pipeline. As promised, the question/statement branch `output_1` from the query classifier is fed into an `EmbeddingRetriever`, while the keyword branch `output_2` from the same classifier is fed into a `BM25Retriever`. Both of these retrievers are then fed into our reader. Our pipeline can thus be thought of as having something of a diamond shape: all queries are sent into the classifier, which splits those queries into two different retrievers, and those retrievers feed their outputs to the same reader."
"Below we can see some results from this choice in branching structure: the keyword query \"arya stark father\" and the question query \"Who is the father of Arya Stark?\" generate noticeably different results, a distinction that is likely due to the use of different retrievers for keyword vs. question/statement queries."
"The above example uses an `SklearnQueryClassifier`, but of course we can do precisely the same thing with a `TransformersQueryClassifier`. This is illustrated below, where we have constructed the same diamond-shaped pipeline."
"Above we saw a potential use for keyword vs. question/statement classification: we might choose to use a less resource-intensive retriever for keyword queries than for question/statement queries. But what about question vs. statement classification?\n",
"In other words, our pipeline will be a **retriever-only pipeline for statement queries**—given the statement \"Arya Stark was the daughter of a Lord\", all we will get back are the most relevant documents—but it will be a **retriever-reader pipeline for question queries**.\n",
"\n",
"To make things more concrete, our pipeline will start with a retriever, which is then fed into a `TransformersQueryClassifier` that is set to do question vs. statement classification. Note that this means we need to explicitly choose the model, since as mentioned previously a default `TransformersQueryClassifier` performs keyword vs. question/statement classification. The classifier's first branch, which handles question queries, will then be sent to the reader, while the second branch will not be connected to any other nodes. As a result, the last node of the pipeline depends on the type of query: questions go all the way through the reader, while statements only go through the retriever. This pipeline is illustrated below:"
"And below we see the results of this pipeline: with a question query like \"Who is the father of Arya Stark?\" we get back answers returned by a reader, but with a statement query like \"Arya Stark was the daughter of a Lord\" we just get back documents returned by a retriever."