refactor: Adapt running benchmarks (#5007)

* Generate eval result in separate method * Adapt benchmarking utils * Adapt running retriever benchmarks * Adapt error message * Adapt running reader benchmarks * Adapt retriever reader benchmark script * Adapt running benchmarks script * Adapt README.md * Raise error if file doesn't exist * Raise error if path doesn't exist or is a directory * minor readme update * Create separate methods for checking if pipeline contains reader or retriever * Fix reader pipeline case --------- Co-authored-by: Darja Fokina <daria.f93@gmail.com>
2025-10-21 04:48:59 +00:00 · 2023-05-26 18:48:11 +02:00 · 2023-05-26 18:48:11 +02:00 · b8ff1052d4
commit b8ff1052d4
parent 2ede4d1d1d
3 changed files with 167 additions and 72 deletions
--- a/test/benchmarks/README.md
+++ b/test/benchmarks/README.md
@ -1,45 +1,97 @@
 # Benchmarks
 The tooling provided in this directory allows running benchmarks on reader pipelines, retriever pipelines,
 and retriever-reader pipelines.
 ## Defining configuration
-To start all benchmarks (e.g. for a new Haystack release), run:
+To run a benchmark, you need to create a configuration file first. This file should be a Pipeline YAML file that
 contains both the querying and, optionally, the indexing pipeline, in case the querying pipeline includes a retriever.
-````
+The configuration file should also have a **`benchmark_config`** section that includes the following information:
 python run.py --reader --retriever_index --retriever_query --update_json --save_markdown
 ````
-For custom runs, you can specify which components and processes to benchmark with the following flags:
+- **`labels_file`**: The path to a SQuAD-formatted JSON or CSV file that contains the labels to be benchmarked on.
-```
+- **`documents_directory`**: The path to a directory containing files intended to be indexed into the document store.
-python run.py [--reader] [--retriever_index] [--retriever_query] [--ci] [--update_json] [--save_markdown]
+                             This is only necessary for retriever and retriever-reader pipelines.
 - **`data_url`**: This is optional. If provided, the benchmarking script will download data from this URL and
                  save it in the **`data/`** directory.
-where
+Here is an example of how a configuration file for a retriever-reader pipeline might look like:
-**--reader** will trigger the speed and accuracy benchmarks for the reader. Here we simply use the SQuAD dev set.
+```yaml
 components:
  - name: DocumentStore
    type: ElasticsearchDocumentStore
  - name: TextConverter
    type: TextConverter
  - name: Reader
    type: FARMReader
    params:
      model_name_or_path: deepset/roberta-base-squad2-distilled
  - name: Retriever
    type: BM25Retriever
    params:
      document_store: DocumentStore
      top_k: 10
-**--retriever_index** will trigger indexing benchmarks
+pipelines:
  - name: indexing
    nodes:
      - name: TextConverter
        inputs: [File]
      - name: Retriever
        inputs: [TextConverter]
      - name: DocumentStore
        inputs: [Retriever]
  - name: querying
    nodes:
      - name: Retriever
        inputs: [Query]
      - name: Reader
        inputs: [Retriever]
-**--retriever_query** will trigger querying benchmarks (embeddings will be loaded from file instead of being computed on the fly)
+benchmark_config:
-
+  data_url: http://example.com/data.tar.gz
-**--ci** will cause the the benchmarks to run on a smaller slice of each dataset and a smaller subset of Retriever / Reader / DocStores. 
+  documents_directory: /path/to/documents
-
+  labels_file: /path/to/labels.csv
 **--update-json** will cause the script to update the json files in docs/_src/benchmarks so that the website benchmarks will be updated.
 **--save_markdown** save results additionally to the default csv also as a markdown file
 ```
-Results will be stored in this directory as
+## Running benchmarks
 - retriever_index_results.csv and retriever_index_results.md
 - retriever_query_results.csv and retriever_query_results.md
 - reader_results.csv and reader_results.md
 Once you have your configuration file, you can run benchmarks by using the **`run.py`** script.
-# Temp. Quickfix for bigger runs
+```bash
 python run.py [--output OUTPUT] config
 ```
-For bigger indexing runs (500k docs) the standard elastic / opensearch container that we spawn via haystack might run OOM. 
+The script takes the following arguments:
 Therefore, start them manually before you trigger the benchmark script and assign more memory to them: 
-`docker start opensearch > /dev/null 2>&1 || docker run -d -p 9201:9200 -p 9600:9600 -e "discovery.type=single-node" -e "OPENSEARCH_JAVA_OPTS=-Xms4096m -Xmx4096m" --name opensearch opensearchproject/opensearch:2.2.1`
+- `config`: This is the path to your configuration file.
 - `--output`: This is an optional path where benchmark results should be saved. If not provided, the script will create a JSON file with the same name as the specified config file.
-and
+## Metrics
-`docker start elasticsearch > /dev/null 2>&1 || docker run -d -p 9200:9200 -e "discovery.type=single-node" -e "ES_JAVA_OPTS=-Xms4096m -Xmx4096m" --name elasticsearch elasticsearch:7.9.2`
+The benchmarks yield the following metrics:
 - Reader pipelines:
    - Exact match score
    - F1 score
    - Total querying time
    - Seconds/query
 - Retriever pipelines:
    - Recall
    - Mean-average precision
    - Total querying time
    - Seconds/query
    - Queries/second
    - Total indexing time
    - Number of indexed Documents/second
 - Retriever-Reader pipelines:
    - Exact match score
    - F1 score
    - Total querying time
    - Seconds/query
    - Total indexing time
    - Number of indexed Documents/second
 You can find more details about the performance metrics in our [evaluation guide](https://docs.haystack.deepset.ai/docs/evaluation).
--- a/test/benchmarks/run.py
+++ b/test/benchmarks/run.py
@ -1,51 +1,77 @@
-# The benchmarks use
+from pathlib import Path
-# - a variant of the Natural Questions Dataset (https://ai.google.com/research/NaturalQuestions) from Google Research
+from typing import Dict
 #   licensed under CC BY-SA 3.0 (https://creativecommons.org/licenses/by-sa/3.0/)
 # - the SQuAD 2.0 Dataset (https://rajpurkar.github.io/SQuAD-explorer/) from  Rajpurkar et al.
 #   licensed under  CC BY-SA 4.0 (https://creativecommons.org/licenses/by-sa/4.0/legalcode)
 from retriever import benchmark_indexing, benchmark_querying
 from reader import benchmark_reader
 from utils import load_config
 import argparse
 import json
 from haystack import Pipeline
 from haystack.nodes import BaseRetriever, BaseReader
 from haystack.pipelines.config import read_pipeline_config_from_yaml
 from utils import prepare_environment, contains_reader, contains_retriever
 from reader import benchmark_reader
 from retriever import benchmark_retriever
 from retriever_reader import benchmark_retriever_reader
-parser = argparse.ArgumentParser()
+def run_benchmark(pipeline_yaml: Path) -> Dict:
    """
    Run benchmarking on a given pipeline. Pipeline can be a retriever, reader, or retriever-reader pipeline.
    In case of retriever or retriever-reader pipelines, indexing is also benchmarked, so the config file must
    contain an indexing pipeline as well.
-parser.add_argument("--reader", default=False, action="store_true", help="Perform Reader benchmarks")
+    :param pipeline_yaml: Path to pipeline YAML config. The config file should contain a benchmark_config section where
-parser.add_argument(
+                          the following parameters are specified:
-    "--retriever_index", default=False, action="store_true", help="Perform Retriever indexing benchmarks"
+                            - documents_directory: Directory containing files to index.
-)
+                            - labels_file: Path to evaluation set.
-parser.add_argument(
+                            - data_url (optional): URL to download the data from. Downloaded data will be stored in
-    "--retriever_query", default=False, action="store_true", help="Perform Retriever querying benchmarks"
+                                                   the directory `data/`.
-)
+    """
-parser.add_argument(
+    pipeline_config = read_pipeline_config_from_yaml(pipeline_yaml)
-    "--ci", default=False, action="store_true", help="Perform a smaller subset of benchmarks that are quicker to run"
+    benchmark_config = pipeline_config.pop("benchmark_config", {})
 )
 parser.add_argument(
    "--update_json",
    default=False,
    action="store_true",
    help="Update the json file with the results of this run so that the website can be updated",
 )
 parser.add_argument(
    "--save_markdown",
    default=False,
    action="store_true",
    help="Update the json file with the results of this run so that the website can be updated",
 )
 args = parser.parse_args()
-# load config
+    # Prepare environment
-params, filenames = load_config(config_filename="config.json", ci=args.ci)
+    prepare_environment(pipeline_config, benchmark_config)
    labels_file = Path(benchmark_config["labels_file"])
-if args.retriever_index:
+    querying_pipeline = Pipeline.load_from_config(pipeline_config, pipeline_name="querying")
-    benchmark_indexing(
+    pipeline_contains_reader = contains_reader(querying_pipeline)
-        **params, **filenames, ci=args.ci, update_json=args.update_json, save_markdown=args.save_markdown
+    pipeline_contains_retriever = contains_retriever(querying_pipeline)
-    )
+
-if args.retriever_query:
+    # Retriever-Reader pipeline
-    benchmark_querying(
+    if pipeline_contains_retriever and pipeline_contains_reader:
-        **params, **filenames, ci=args.ci, update_json=args.update_json, save_markdown=args.save_markdown
+        documents_dir = Path(benchmark_config["documents_directory"])
-    )
+        indexing_pipeline = Pipeline.load_from_config(pipeline_config, pipeline_name="indexing")
-if args.reader:
+
-    benchmark_reader(**params, **filenames, ci=args.ci, update_json=args.update_json, save_markdown=args.save_markdown)
+        results = benchmark_retriever_reader(indexing_pipeline, querying_pipeline, documents_dir, labels_file)
    # Retriever pipeline
    elif pipeline_contains_retriever:
        documents_dir = Path(benchmark_config["documents_directory"])
        indexing_pipeline = Pipeline.load_from_config(pipeline_config, pipeline_name="indexing")
        results = benchmark_retriever(indexing_pipeline, querying_pipeline, documents_dir, labels_file)
    # Reader pipeline
    elif pipeline_contains_reader:
        results = benchmark_reader(querying_pipeline, labels_file)
    # Unsupported pipeline type
    else:
        raise ValueError("Pipeline must be a retriever, reader, or retriever-reader pipeline.")
    results["config_file"] = pipeline_config
    return results
 if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("config", type=str, help="Path to pipeline YAML config.")
    parser.add_argument("--output", type=str, help="Path to output file.")
    args = parser.parse_args()
    config_file = Path(args.config)
    output_file = f"{config_file.stem}_results.json" if args.output is None else args.output
    results = run_benchmark(config_file)
    with open(output_file, "w") as f:
        json.dump(results, f, indent=2)
--- a/test/benchmarks/utils.py
+++ b/test/benchmarks/utils.py
@ -158,3 +158,20 @@ def get_retriever_config(pipeline: Pipeline) -> Tuple[str, Union[int, str]]:
    retriever_top_k = retriever.top_k
    return retriever_type, retriever_top_k
 def contains_reader(pipeline: Pipeline) -> bool:
    """
    Check if a pipeline contains a Reader component.
    :param pipeline: Pipeline
    """
    components = [comp for comp in pipeline.components.values()]
    return any(isinstance(comp, BaseReader) for comp in components)
 def contains_retriever(pipeline: Pipeline) -> bool:
    """
    Check if a pipeline contains a Retriever component.
    """
    components = [comp for comp in pipeline.components.values()]
    return any(isinstance(comp, BaseRetriever) for comp in components)