mirror of https://github.com/deepset-ai/haystack.git synced 2025-07-24 17:30:38 +00:00

Add run_batch method to all nodes and Pipeline to allow batch querying (#2481 )

* Add run_batch methods for batch querying

* Update Documentation & Code Style

* Fix mypy

* Update Documentation & Code Style

* Fix mypy

* Fix linter

* Fix tests

* Update Documentation & Code Style

* Fix tests

* Update Documentation & Code Style

* Fix mypy

* Fix rest api test

* Update Documentation & Code Style

* Add Doc strings

* Update Documentation & Code Style

* Add batch_size as attribute to nodes supporting batching

* Adapt error messages

* Adapt type of filters in retrievers

* Revert change about truncation_warning in summarizer

* Unify multiple_doc_lists tests

* Use smaller models in extractor tests

* Add return types to JoinAnswers and RouteDocuments

* Adapt return statements in reader's run_batch method

* Allow list of filters

* Adapt error messages

* Update Documentation & Code Style

* Fix tests

* Fix mypy

* Adapt print_questions

* Remove disabling warning about too many public methods

* Add flag for pylint to disable warning about too many public methods in pipelines/base.py and document_stores/base.py

* Add type check

* Update Documentation & Code Style

* Adapt tutorial 11

* Update Documentation & Code Style

* Add query_batch method for DCDocStore

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

2022-05-11 11:11:00 +02:00

2.5 KiB

Raw Blame History

Module question_generator

QuestionGenerator

class QuestionGenerator(BaseComponent)

The Question Generator takes only a document as input and outputs questions that it thinks can be answered by this document. In our current implementation, input texts are split into chunks of 50 words with a 10 word overlap. This is because the default model valhalla/t5-base-e2e-qg seems to generate only about 3 questions per passage regardless of length. Our approach prioritizes the creation of more questions over processing efficiency (T5 is able to digest much more than 50 words at once). The returned questions generally come in an order dictated by the order of their answers i.e. early questions in the list generally come from earlier in the document.

QuestionGenerator.init

def __init__(model_name_or_path="valhalla/t5-base-e2e-qg", model_version=None, num_beams=4, max_length=256, no_repeat_ngram_size=3, length_penalty=1.5, early_stopping=True, split_length=50, split_overlap=10, use_gpu=True, prompt="generate questions:", batch_size: Optional[int] = None)

Uses the valhalla/t5-base-e2e-qg model by default. This class supports any question generation model that is

implemented as a Seq2SeqLM in HuggingFace Transformers. Note that this style of question generation (where the only input is a document) is sometimes referred to as end-to-end question generation. Answer-supervised question generation is not currently supported.

Arguments:

model_name_or_path: Directory of a saved model or the name of a public model e.g. "valhalla/t5-base-e2e-qg". See https://huggingface.co/models for full list of available models.
model_version: The version of model to use from the HuggingFace model hub. Can be tag name, branch name, or commit hash.
use_gpu: Whether to use GPU or the CPU. Falls back on CPU if no GPU is available.
batch_size: Number of documents to process at a time.

QuestionGenerator.generate_batch

def generate_batch(texts: Union[List[str], List[List[str]]], batch_size: Optional[int] = None) -> Union[List[List[str]], List[List[List[str]]]]

Generates questions for a list of strings or a list of lists of strings.