mirror of https://github.com/deepset-ai/haystack.git synced 2025-07-24 01:10:45 +00:00

Add run_batch method to all nodes and Pipeline to allow batch querying (#2481 )

* Add run_batch methods for batch querying

* Update Documentation & Code Style

* Fix mypy

* Update Documentation & Code Style

* Fix mypy

* Fix linter

* Fix tests

* Update Documentation & Code Style

* Fix tests

* Update Documentation & Code Style

* Fix mypy

* Fix rest api test

* Update Documentation & Code Style

* Add Doc strings

* Update Documentation & Code Style

* Add batch_size as attribute to nodes supporting batching

* Adapt error messages

* Adapt type of filters in retrievers

* Revert change about truncation_warning in summarizer

* Unify multiple_doc_lists tests

* Use smaller models in extractor tests

* Add return types to JoinAnswers and RouteDocuments

* Adapt return statements in reader's run_batch method

* Allow list of filters

* Adapt error messages

* Update Documentation & Code Style

* Fix tests

* Fix mypy

* Adapt print_questions

* Remove disabling warning about too many public methods

* Add flag for pylint to disable warning about too many public methods in pipelines/base.py and document_stores/base.py

* Add type check

* Update Documentation & Code Style

* Adapt tutorial 11

* Update Documentation & Code Style

* Add query_batch method for DCDocStore

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

2022-05-11 11:11:00 +02:00

6.0 KiB

Raw Blame History

Module base

BaseSummarizer

class BaseSummarizer(BaseComponent)

Abstract class for Summarizer

BaseSummarizer.predict

@abstractmethod
def predict(documents: List[Document], generate_single_summary: Optional[bool] = None) -> List[Document]

Abstract method for creating a summary.

Arguments:

documents: Related documents (e.g. coming from a retriever) that the answer shall be conditioned on.
generate_single_summary: Whether to generate a single summary for all documents or one summary per document. If set to "True", all docs will be joined to a single string that will then be summarized. Important: The summary will depend on the order of the supplied documents!

Returns:

List of Documents, where Document.text contains the summarization and Document.meta["context"] the original, not summarized text

Module transformers

TransformersSummarizer

class TransformersSummarizer(BaseSummarizer)

Transformer based model to summarize the documents using the HuggingFace's transformers framework

You can use any model that has been fine-tuned on a summarization task. For example: 'bart-large-cnn', 't5-small', 't5-base', 't5-large', 't5-3b', 't5-11b'. See the up-to-date list of available models on huggingface.co/models <https://huggingface.co/models?filter=summarization>__

Example

|     docs = [Document(text="PG&E stated it scheduled the blackouts in response to forecasts for high winds amid dry conditions."
|            "The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were scheduled to be affected by"
|            "the shutoffs which were expected to last through at least midday tomorrow.")]
|
|     # Summarize
|     summary = summarizer.predict(
|        documents=docs,
|        generate_single_summary=True
|     )
|
|     # Show results (List of Documents, containing summary and original text)
|     print(summary)
|
|    [
|      {
|        "text": "California's largest electricity provider has turned off power to hundreds of thousands of customers.",
|        ...
|        "meta": {
|          "context": "PGE stated it scheduled the blackouts in response to forecasts for high winds amid dry conditions. ..."
|              },
|        ...
|      },

TransformersSummarizer.init

def __init__(model_name_or_path: str = "google/pegasus-xsum", model_version: Optional[str] = None, tokenizer: Optional[str] = None, max_length: int = 200, min_length: int = 5, use_gpu: bool = True, clean_up_tokenization_spaces: bool = True, separator_for_single_summary: str = " ", generate_single_summary: bool = False, batch_size: Optional[int] = None)

Load a Summarization model from Transformers.

See the up-to-date list of available models at https://huggingface.co/models?filter=summarization

Arguments:

model_name_or_path: Directory of a saved model or the name of a public model e.g. 'facebook/rag-token-nq', 'facebook/rag-sequence-nq'. See https://huggingface.co/models?filter=summarization for full list of available models.
model_version: The version of model to use from the HuggingFace model hub. Can be tag name, branch name, or commit hash.
tokenizer: Name of the tokenizer (usually the same as model)
max_length: Maximum length of summarized text
min_length: Minimum length of summarized text
use_gpu: Whether to use GPU (if available).
clean_up_tokenization_spaces: Whether or not to clean up the potential extra spaces in the text output
separator_for_single_summary: If generate_single_summary=True in predict(), we need to join all docs into a single text. This separator appears between those subsequent docs.
generate_single_summary: Whether to generate a single summary for all documents or one summary per document. If set to "True", all docs will be joined to a single string that will then be summarized. Important: The summary will depend on the order of the supplied documents!
batch_size: Number of documents to process at a time.

TransformersSummarizer.predict

def predict(documents: List[Document], generate_single_summary: Optional[bool] = None) -> List[Document]

Produce the summarization from the supplied documents.

These document can for example be retrieved via the Retriever.

Arguments:

documents: Related documents (e.g. coming from a retriever) that the answer shall be conditioned on.
generate_single_summary: Whether to generate a single summary for all documents or one summary per document. If set to "True", all docs will be joined to a single string that will then be summarized. Important: The summary will depend on the order of the supplied documents!

Returns:

List of Documents, where Document.text contains the summarization and Document.meta["context"] the original, not summarized text

TransformersSummarizer.predict_batch

def predict_batch(documents: Union[List[Document], List[List[Document]]], generate_single_summary: Optional[bool] = None, batch_size: Optional[int] = None) -> Union[List[Document], List[List[Document]]]

Produce the summarization from the supplied documents.

These documents can for example be retrieved via the Retriever.

Arguments:

documents: Single list of related documents or list of lists of related documents (e.g. coming from a retriever) that the answer shall be conditioned on.
generate_single_summary: Whether to generate a single summary for each provided document list or one summary per document. If set to "True", all docs of a document list will be joined to a single string that will then be summarized. Important: The summary will depend on the order of the supplied documents!
batch_size: Number of Documents to process at a time.

6.0 KiB Raw Blame History

Module base

BaseSummarizer

BaseSummarizer.predict

Module transformers

TransformersSummarizer

TransformersSummarizer.__init__

TransformersSummarizer.predict

TransformersSummarizer.predict_batch

6.0 KiB

Raw Blame History

TransformersSummarizer.init