haystack/docs/_src/api/api/query_classifier.md
bogdankostic 738e008020
Add run_batch method to all nodes and Pipeline to allow batch querying (#2481)
* Add run_batch methods for batch querying

* Update Documentation & Code Style

* Fix mypy

* Update Documentation & Code Style

* Fix mypy

* Fix linter

* Fix tests

* Update Documentation & Code Style

* Fix tests

* Update Documentation & Code Style

* Fix mypy

* Fix rest api test

* Update Documentation & Code Style

* Add Doc strings

* Update Documentation & Code Style

* Add batch_size as attribute to nodes supporting batching

* Adapt error messages

* Adapt type of filters in retrievers

* Revert change about truncation_warning in summarizer

* Unify multiple_doc_lists tests

* Use smaller models in extractor tests

* Add return types to JoinAnswers and RouteDocuments

* Adapt return statements in reader's run_batch method

* Allow list of filters

* Adapt error messages

* Update Documentation & Code Style

* Fix tests

* Fix mypy

* Adapt print_questions

* Remove disabling warning about too many public methods

* Add flag for pylint to disable warning about too many public methods in pipelines/base.py and document_stores/base.py

* Add type check

* Update Documentation & Code Style

* Adapt tutorial 11

* Update Documentation & Code Style

* Add query_batch method for DCDocStore

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-05-11 11:11:00 +02:00

5.6 KiB

Module base

BaseQueryClassifier

class BaseQueryClassifier(BaseComponent)

Abstract class for Query Classifiers

Module sklearn

SklearnQueryClassifier

class SklearnQueryClassifier(BaseQueryClassifier)

A node to classify an incoming query into one of two categories using a lightweight sklearn model. Depending on the result, the query flows to a different branch in your pipeline and the further processing can be customized. You can define this by connecting the further pipeline to either output_1 or output_2 from this node.

Example:

|{
|pipe = Pipeline()
|pipe.add_node(component=SklearnQueryClassifier(), name="QueryClassifier", inputs=["Query"])
|pipe.add_node(component=elastic_retriever, name="ElasticRetriever", inputs=["QueryClassifier.output_2"])
|pipe.add_node(component=dpr_retriever, name="DPRRetriever", inputs=["QueryClassifier.output_1"])

|# Keyword queries will use the ElasticRetriever
|pipe.run("kubernetes aws")

|# Semantic queries (questions, statements, sentences ...) will leverage the DPR retriever
|pipe.run("How to manage kubernetes on aws")

Models:

Pass your own Sklearn binary classification model or use one of the following pretrained ones:

  1. Keywords vs. Questions/Statements (Default) query_classifier can be found here query_vectorizer can be found here output_1 => question/statement output_2 => keyword query Readme

  2. Questions vs. Statements query_classifier can be found here query_vectorizer can be found here output_1 => question output_2 => statement Readme

See also the tutorial on pipelines.

SklearnQueryClassifier.__init__

def __init__(model_name_or_path: Union[
            str, Any
        ] = "https://ext-models-haystack.s3.eu-central-1.amazonaws.com/gradboost_query_classifier/model.pickle", vectorizer_name_or_path: Union[
            str, Any
        ] = "https://ext-models-haystack.s3.eu-central-1.amazonaws.com/gradboost_query_classifier/vectorizer.pickle", batch_size: Optional[int] = None)

Arguments:

  • model_name_or_path: Gradient boosting based binary classifier to classify between keyword vs statement/question queries or statement vs question queries.
  • vectorizer_name_or_path: A ngram based Tfidf vectorizer for extracting features from query.
  • batch_size: Number of queries to process at a time.

Module transformers

TransformersQueryClassifier

class TransformersQueryClassifier(BaseQueryClassifier)

A node to classify an incoming query into one of two categories using a (small) BERT transformer model. Depending on the result, the query flows to a different branch in your pipeline and the further processing can be customized. You can define this by connecting the further pipeline to either output_1 or output_2 from this node.

Example:

|{
|pipe = Pipeline()
|pipe.add_node(component=TransformersQueryClassifier(), name="QueryClassifier", inputs=["Query"])
|pipe.add_node(component=elastic_retriever, name="ElasticRetriever", inputs=["QueryClassifier.output_2"])
|pipe.add_node(component=dpr_retriever, name="DPRRetriever", inputs=["QueryClassifier.output_1"])

|# Keyword queries will use the ElasticRetriever
|pipe.run("kubernetes aws")

|# Semantic queries (questions, statements, sentences ...) will leverage the DPR retriever
|pipe.run("How to manage kubernetes on aws")

Models:

Pass your own Transformer binary classification model from file/huggingface or use one of the following pretrained ones hosted on Huggingface:

  1. Keywords vs. Questions/Statements (Default) model_name_or_path="shahrukhx01/bert-mini-finetune-question-detection" output_1 => question/statement output_2 => keyword query Readme

  2. Questions vs. Statements model_name_or_path="shahrukhx01/question-vs-statement-classifier" output_1 => question output_2 => statement Readme

See also the tutorial on pipelines.

TransformersQueryClassifier.__init__

def __init__(model_name_or_path: Union[Path, str] = "shahrukhx01/bert-mini-finetune-question-detection", use_gpu: bool = True, batch_size: Optional[int] = None)

Arguments:

  • model_name_or_path: Transformer based fine tuned mini bert model for query classification
  • use_gpu: Whether to use GPU (if available).