fix: make pandas DataFrame optional in EvaluationRunResult (#8838)

* feat: AsyncPipeline that can schedule components to run concurrently (#8812) * add component checks * pipeline should run deterministically * add FIFOQueue * add agent tests * add order dependent tests * run new tests * remove code that is not needed * test: intermediate from cycle outputs are available outside cycle * add tests for component checks (Claude) * adapt tests for component checks (o1 review) * chore: format * remove tests that aren't needed anymore * add _calculate_priority tests * revert accidental change in pyproject.toml * test format conversion * adapt to naming convention * chore: proper docstrings and type hints for PQ * format * add more unit tests * rm unneeded comments * test input consumption * lint * fix: docstrings * lint * format * format * fix license header * fix license header * add component run tests * fix: pass correct input format to tracing * fix types * format * format * types * add defaults from Socket instead of signature - otherwise components with dynamic inputs would fail * fix test names * still wait for optional inputs on greedy variadic sockets - mirrors previous behavior * fix format * wip: warn for ambiguous running order * wip: alternative warning * fix license header * make code more readable Co-authored-by: Amna Mubashar <amnahkhan.ak@gmail.com> * Introduce content tracing to a behavioral test * Fixing linting * Remove debug print statements * Fix tracer tests * remove print * test: test for component inputs * test: remove testing for run order * chore: update component checks from experimental * chore: update pipeline and base from experimental * refactor: remove unused method * refactor: remove unused method * refactor: outdated comment * refactor: inputs state is updated as side effect - to prepare for AsyncPipeline implementation * format * test: add file conversion test * format * fix: original implementation deepcopies outputs * lint * fix: from_dict was updated * fix: format * fix: test * test: add test for thread safety * remove unused imports * format * test: FIFOPriorityQueue * chore: add release note * feat: add AsyncPipeline * chore: Add release notes * fix: format * debug: switch run order to debug ubuntu and windows tests * fix: consider priorities of other components while waiting for DEFER * refactor: simplify code * fix: resolve merge conflict with mermaid changes * fix: format * fix: remove unused import * refactor: rename to avoid accidental conflicts * fix: track pipeline type * fix: and extend test * fix: format * style: sort alphabetically * Update test/core/pipeline/features/conftest.py Co-authored-by: Amna Mubashar <amnahkhan.ak@gmail.com> * Update test/core/pipeline/features/conftest.py Co-authored-by: Amna Mubashar <amnahkhan.ak@gmail.com> * Update releasenotes/notes/feat-async-pipeline-338856a142e1318c.yaml * fix: indentation, do not close loop * fix: use asyncio.run * fix: format --------- Co-authored-by: Amna Mubashar <amnahkhan.ak@gmail.com> Co-authored-by: David S. Batista <dsbatista@gmail.com> * feat: AsyncPipeline that can schedule components to run concurrently (#8812) * add component checks * pipeline should run deterministically * add FIFOQueue * add agent tests * add order dependent tests * run new tests * remove code that is not needed * test: intermediate from cycle outputs are available outside cycle * add tests for component checks (Claude) * adapt tests for component checks (o1 review) * chore: format * remove tests that aren't needed anymore * add _calculate_priority tests * revert accidental change in pyproject.toml * test format conversion * adapt to naming convention * chore: proper docstrings and type hints for PQ * format * add more unit tests * rm unneeded comments * test input consumption * lint * fix: docstrings * lint * format * format * fix license header * fix license header * add component run tests * fix: pass correct input format to tracing * fix types * format * format * types * add defaults from Socket instead of signature - otherwise components with dynamic inputs would fail * fix test names * still wait for optional inputs on greedy variadic sockets - mirrors previous behavior * fix format * wip: warn for ambiguous running order * wip: alternative warning * fix license header * make code more readable Co-authored-by: Amna Mubashar <amnahkhan.ak@gmail.com> * Introduce content tracing to a behavioral test * Fixing linting * Remove debug print statements * Fix tracer tests * remove print * test: test for component inputs * test: remove testing for run order * chore: update component checks from experimental * chore: update pipeline and base from experimental * refactor: remove unused method * refactor: remove unused method * refactor: outdated comment * refactor: inputs state is updated as side effect - to prepare for AsyncPipeline implementation * format * test: add file conversion test * format * fix: original implementation deepcopies outputs * lint * fix: from_dict was updated * fix: format * fix: test * test: add test for thread safety * remove unused imports * format * test: FIFOPriorityQueue * chore: add release note * feat: add AsyncPipeline * chore: Add release notes * fix: format * debug: switch run order to debug ubuntu and windows tests * fix: consider priorities of other components while waiting for DEFER * refactor: simplify code * fix: resolve merge conflict with mermaid changes * fix: format * fix: remove unused import * refactor: rename to avoid accidental conflicts * fix: track pipeline type * fix: and extend test * fix: format * style: sort alphabetically * Update test/core/pipeline/features/conftest.py Co-authored-by: Amna Mubashar <amnahkhan.ak@gmail.com> * Update test/core/pipeline/features/conftest.py Co-authored-by: Amna Mubashar <amnahkhan.ak@gmail.com> * Update releasenotes/notes/feat-async-pipeline-338856a142e1318c.yaml * fix: indentation, do not close loop * fix: use asyncio.run * fix: format --------- Co-authored-by: Amna Mubashar <amnahkhan.ak@gmail.com> Co-authored-by: David S. Batista <dsbatista@gmail.com> * updated changes for refactoring evaluations without pandas package * added release notes for eval_run_result.py for refactoring EvaluationRunResult to work without pandas * wip: cleaning and refactoring * removing BaseEvaluationRunResult * wip: fixing tests * fixing tests and docstrings * updating release notes * fixing typing * pylint fix * adding deprecation warning * fixing tests * fixin types consistency * adding stacklevel=2 to warning messages * fixing docstrings * fixing docstrings * updating release notes --------- Co-authored-by: mathislucka <mathis.lucka@gmail.com> Co-authored-by: Amna Mubashar <amnahkhan.ak@gmail.com> Co-authored-by: David S. Batista <dsbatista@gmail.com>
2025-12-01 01:16:09 +00:00 · 2025-02-17 08:43:54 -05:00 · 2025-02-17 08:43:54 -05:00 · b5fb0d3ff8
commit b5fb0d3ff8
parent 2f383bce25
5 changed files with 207 additions and 189 deletions
--- a/haystack/evaluation/init.py
+++ b/haystack/evaluation/init.py
@ -2,7 +2,6 @@
 #
 # SPDX-License-Identifier: Apache-2.0

-from .base import BaseEvaluationRunResult
 from .eval_run_result import EvaluationRunResult

-__all__ = ["BaseEvaluationRunResult", "EvaluationRunResult"]
+__all__ = ["EvaluationRunResult"]
--- a/haystack/evaluation/base.py
+++ b/haystack/evaluation/base.py
@ -1,49 +0,0 @@
-# SPDX-FileCopyrightText: 2022-present deepset GmbH <info@deepset.ai>
-#
-# SPDX-License-Identifier: Apache-2.0
-
-from abc import ABC, abstractmethod
-from typing import List, Optional
-
-from pandas import DataFrame
-
-
-class BaseEvaluationRunResult(ABC):
-    """
-    Represents the results of an evaluation run.
-    """
-
-    @abstractmethod
-    def to_pandas(self) -> "DataFrame":
-        """
-        Creates a Pandas DataFrame containing the scores of each metric for every input sample.
-
-        :returns:
-            Pandas DataFrame with the scores.
-        """
-
-    @abstractmethod
-    def score_report(self) -> "DataFrame":
-        """
-        Transforms the results into a Pandas DataFrame with the aggregated scores for each metric.
-
-        :returns:
-            Pandas DataFrame with the aggregated scores.
-        """
-
-    @abstractmethod
-    def comparative_individual_scores_report(
-        self, other: "BaseEvaluationRunResult", keep_columns: Optional[List[str]] = None
-    ) -> "DataFrame":
-        """
-        Creates a Pandas DataFrame with the scores for each metric in the results of two different evaluation runs.
-
-        The inputs to both evaluation runs is assumed to be the same.
-
-        :param other:
-            Results of another evaluation run to compare with.
-        :param keep_columns:
-            List of common column names to keep from the inputs of the evaluation runs to compare.
-        :returns:
-            Pandas DataFrame with the score comparison.
-        """
--- a/haystack/evaluation/eval_run_result.py
+++ b/haystack/evaluation/eval_run_result.py
@ -2,17 +2,18 @@
 #
 # SPDX-License-Identifier: Apache-2.0

+import csv
 from copy import deepcopy
-from typing import Any, Dict, List, Optional
+from typing import Any, Dict, List, Literal, Optional, Union
 from warnings import warn

-from pandas import DataFrame
-from pandas import concat as pd_concat
+from haystack.lazy_imports import LazyImport

-from .base import BaseEvaluationRunResult
+with LazyImport("Run 'pip install pandas'") as pandas_import:
+    from pandas import DataFrame


-class EvaluationRunResult(BaseEvaluationRunResult):
+class EvaluationRunResult:
    """
    Contains the inputs and the outputs of an evaluation pipeline and provides methods to inspect them.
    """
@ -23,16 +24,14 @@ class EvaluationRunResult(BaseEvaluationRunResult):

        :param run_name:
            Name of the evaluation run.
+
        :param inputs:
-            Dictionary containing the inputs used for the run.
-            Each key is the name of the input and its value is
-            a list of input values. The length of the lists should
-            be the same.
+            Dictionary containing the inputs used for the run. Each key is the name of the input and its value is a list
+            of input values. The length of the lists should be the same.
+
        :param results:
-            Dictionary containing the results of the evaluators
-            used in the evaluation pipeline. Each key is the name
-            of the metric and its value is dictionary with the following
-            keys:
+            Dictionary containing the results of the evaluators used in the evaluation pipeline. Each key is the name
+            of the metric and its value is dictionary with the following keys:
                - 'score': The aggregated score for the metric.
                - 'individual_scores': A list of scores for each input sample.
        """
@ -59,77 +58,186 @@ class EvaluationRunResult(BaseEvaluationRunResult):
                    f"Got {len(outputs['individual_scores'])} but expected {expected_len}."
                )

-    def score_report(self) -> DataFrame:
+    @staticmethod
+    def _write_to_csv(csv_file: str, data: Dict[str, List[Any]]) -> str:
        """
-        Transforms the results into a Pandas DataFrame with the aggregated scores for each metric.
+        Write data to a CSV file.
+
+        :param csv_file: Path to the CSV file to write
+        :param data: Dictionary containing the data to write
+        :return: Status message indicating success or failure
+        """
+        list_lengths = [len(value) for value in data.values()]
+
+        if len(set(list_lengths)) != 1:
+            raise ValueError("All lists in the JSON must have the same length")
+
+        try:
+            headers = list(data.keys())
+            num_rows = list_lengths[0]
+            rows = []
+
+            for i in range(num_rows):
+                row = [data[header][i] for header in headers]
+                rows.append(row)
+
+            with open(csv_file, "w", newline="") as csvfile:
+                writer = csv.writer(csvfile)
+                writer.writerow(headers)
+                writer.writerows(rows)
+
+            return f"Data successfully written to {csv_file}"
+        except PermissionError:
+            return f"Error: Permission denied when writing to {csv_file}"
+        except IOError as e:
+            return f"Error writing to {csv_file}: {str(e)}"
+        except Exception as e:
+            return f"Error: {str(e)}"
+
+    @staticmethod
+    def _handle_output(
+        data: Dict[str, List[Any]], output_format: Literal["json", "csv", "df"] = "csv", csv_file: Optional[str] = None
+    ) -> Union[str, DataFrame, Dict[str, List[Any]]]:
+        """
+        Handles output formatting based on `output_format`.
+
+        :returns: DataFrame for 'df', dict for 'json', or confirmation message for 'csv'
+        """
+        if output_format == "json":
+            return data
+
+        elif output_format == "df":
+            pandas_import.check()
+            return DataFrame(data)
+
+        elif output_format == "csv":
+            if not csv_file:
+                raise ValueError("A file path must be provided in 'csv_file' parameter to save the CSV output.")
+            return EvaluationRunResult._write_to_csv(csv_file, data)
+
+        else:
+            raise ValueError(f"Invalid output format '{output_format}' provided. Choose from 'json', 'csv', or 'df'.")
+
+    def aggregated_report(
+        self, output_format: Literal["json", "csv", "df"] = "json", csv_file: Optional[str] = None
+    ) -> Union[Dict[str, List[Any]], "DataFrame", str]:
+        """
+        Generates a report with aggregated scores for each metric.
+
+        :param output_format: The output format for the report, "json", "csv", or "df", default to "json".
+        :param csv_file: Filepath to save CSV output if `output_format` is "csv", must be provided.

        :returns:
-            Pandas DataFrame with the aggregated scores.
+            JSON or DataFrame with aggregated scores, in case the output is set to a CSV file, a message confirming the
+            successful write or an error message.
        """
        results = {k: v["score"] for k, v in self.results.items()}
-        df = DataFrame.from_dict(results, orient="index", columns=["score"]).reset_index()
-        df.columns = ["metrics", "score"]
-        return df
+        data = {"metrics": list(results.keys()), "score": list(results.values())}
+        return self._handle_output(data, output_format, csv_file)

-    def to_pandas(self) -> DataFrame:
+    def detailed_report(
+        self, output_format: Literal["json", "csv", "df"] = "json", csv_file: Optional[str] = None
+    ) -> Union[Dict[str, List[Any]], "DataFrame", str]:
        """
-        Creates a Pandas DataFrame containing the scores of each metric for every input sample.
+        Generates a report with detailed scores for each metric.
+
+        :param output_format: The output format for the report, "json", "csv", or "df", default to "json".
+        :param csv_file: Filepath to save CSV output if `output_format` is "csv", must be provided.

        :returns:
-            Pandas DataFrame with the scores.
+            JSON or DataFrame with the detailed scores, in case the output is set to a CSV file, a message confirming
+            the successful write or an error message.
        """
-        inputs_columns = list(self.inputs.keys())
-        inputs_values = list(self.inputs.values())
-        inputs_values = list(map(list, zip(*inputs_values)))  # transpose the values
-        df_inputs = DataFrame(inputs_values, columns=inputs_columns)

+        combined_data = {col: self.inputs[col] for col in self.inputs}
+
+        # enforce columns type consistency
        scores_columns = list(self.results.keys())
-        scores_values = [v["individual_scores"] for v in self.results.values()]
-        scores_values = list(map(list, zip(*scores_values)))  # transpose the values
-        df_scores = DataFrame(scores_values, columns=scores_columns)
+        for col in scores_columns:
+            col_values = self.results[col]["individual_scores"]
+            if any(isinstance(v, float) for v in col_values):
+                col_values = [float(v) for v in col_values]
+            combined_data[col] = col_values

-        return df_inputs.join(df_scores)
+        return self._handle_output(combined_data, output_format, csv_file)

-    def comparative_individual_scores_report(
-        self, other: "BaseEvaluationRunResult", keep_columns: Optional[List[str]] = None
-    ) -> DataFrame:
+    def comparative_detailed_report(
+        self,
+        other: "EvaluationRunResult",
+        keep_columns: Optional[List[str]] = None,
+        output_format: Literal["json", "csv", "df"] = "json",
+        csv_file: Optional[str] = None,
+    ) -> Union[str, "DataFrame", None]:
        """
-        Creates a Pandas DataFrame with the scores for each metric in the results of two different evaluation runs.
+        Generates a report with detailed scores for each metric from two evaluation runs for comparison.

-        The inputs to both evaluation runs is assumed to be the same.
+        :param other: Results of another evaluation run to compare with.
+        :param keep_columns: List of common column names to keep from the inputs of the evaluation runs to compare.
+        :param output_format: The output format for the report, "json", "csv", or "df", default to "json".
+        :param csv_file: Filepath to save CSV output if `output_format` is "csv", must be provided.

-        :param other:
-            Results of another evaluation run to compare with.
-        :param keep_columns:
-            List of common column names to keep from the inputs of the evaluation runs to compare.
        :returns:
-            Pandas DataFrame with the score comparison.
+            JSON or DataFrame with a comparison of the detailed scores, in case the output is set to a CSV file,
+             a message confirming the successful write or an error message.
        """
+
        if not isinstance(other, EvaluationRunResult):
            raise ValueError("Comparative scores can only be computed between EvaluationRunResults.")

-        this_name = self.run_name
-        other_name = other.run_name
-        if this_name == other_name:
-            warn(f"The run names of the two evaluation results are the same ('{this_name}')")
-            this_name = f"{this_name}_first"
-            other_name = f"{other_name}_second"
+        if not hasattr(other, "run_name") or not hasattr(other, "inputs") or not hasattr(other, "results"):
+            raise ValueError("The 'other' parameter must have 'run_name', 'inputs', and 'results' attributes.")
+
+        if self.run_name == other.run_name:
+            warn(f"The run names of the two evaluation results are the same ('{self.run_name}')")

        if self.inputs.keys() != other.inputs.keys():
-            warn(f"The input columns differ between the results; using the input columns of '{this_name}'.")
+            warn(f"The input columns differ between the results; using the input columns of '{self.run_name}'.")

-        pipe_a_df = self.to_pandas()
-        pipe_b_df = other.to_pandas()
+        # got both detailed reports
+        detailed_a = self.detailed_report(output_format="json")
+        detailed_b = other.detailed_report(output_format="json")

+        # ensure both detailed reports are in dictionaries format
+        if not isinstance(detailed_a, dict) or not isinstance(detailed_b, dict):
+            raise ValueError("Detailed reports must be dictionaries.")
+
+        # determine which columns to ignore
        if keep_columns is None:
            ignore = list(self.inputs.keys())
        else:
            ignore = [col for col in list(self.inputs.keys()) if col not in keep_columns]

-        pipe_b_df.drop(columns=ignore, inplace=True, errors="ignore")
-        pipe_b_df.columns = [f"{other_name}_{column}" for column in pipe_b_df.columns]  # type: ignore
-        pipe_a_df.columns = [f"{this_name}_{col}" if col not in ignore else col for col in pipe_a_df.columns]  # type: ignore
+        # filter out ignored columns from pipe_b_dict
+        filtered_detailed_b = {
+            f"{other.run_name}_{key}": value for key, value in detailed_b.items() if key not in ignore
+        }

-        results_df = pd_concat([pipe_a_df, pipe_b_df], axis=1)
+        # rename columns in pipe_a_dict based on ignore list
+        renamed_detailed_a = {
+            (key if key in ignore else f"{self.run_name}_{key}"): value for key, value in detailed_a.items()
+        }

-        return results_df
+        # combine both detailed reports
+        combined_results = {**renamed_detailed_a, **filtered_detailed_b}
+        return self._handle_output(combined_results, output_format, csv_file)
+
+    def score_report(self) -> "DataFrame":
+        """Generates a DataFrame report with aggregated scores for each metric."""
+        msg = "The `score_report` method is deprecated and will be changed to `aggregated_report` in Haystack 2.11.0."
+        warn(msg, DeprecationWarning, stacklevel=2)
+        return self.aggregated_report(output_format="df")
+
+    def to_pandas(self) -> "DataFrame":
+        """Generates a DataFrame report with detailed scores for each metric."""
+        msg = "The `to_pandas` method is deprecated and will be changed to `detailed_report` in Haystack 2.11.0."
+        warn(msg, DeprecationWarning, stacklevel=2)
+        return self.detailed_report(output_format="df")
+
+    def comparative_individual_scores_report(self, other: "EvaluationRunResult") -> "DataFrame":
+        """Generates a DataFrame report with detailed scores for each metric from two evaluation runs for comparison."""
+        msg = (
+            "The `comparative_individual_scores_report` method is deprecated and will be changed to "
+            "`comparative_detailed_report` in Haystack 2.11.0."
+        )
+        warn(msg, DeprecationWarning, stacklevel=2)
+        return self.comparative_detailed_report(other, output_format="df")
--- a/releasenotes/notes/refactored-evalaution-nopandas-bcab00e4797a3c60.yaml
+++ b/releasenotes/notes/refactored-evalaution-nopandas-bcab00e4797a3c60.yaml
@ -0,0 +1,8 @@
+---
+enhancements:
+  - |
+    `EvaluationRunResult` can now output the results in JSON, a pandas Dataframe or in a CSV file.
+
+deprecations:
+  - |
+    The use of pandas Dataframe in `EvaluationRunResult` is now optional and the methods `score_report`, `to_pandas` and `comparative_individual_scores_report` are deprecated and will be removed in the next haystack release.
--- a/test/evaluation/test_eval_run_result.py
+++ b/test/evaluation/test_eval_run_result.py
@ -1,6 +1,8 @@
 # SPDX-FileCopyrightText: 2022-present deepset GmbH <info@deepset.ai>
 #
 # SPDX-License-Identifier: Apache-2.0
+import json
+
 from haystack.evaluation import EvaluationRunResult
 import pytest

@ -87,16 +89,24 @@ def test_score_report():
    }

    result = EvaluationRunResult("testing_pipeline_1", inputs=data["inputs"], results=data["metrics"])
-    report = result.score_report().to_json()
+    report = result.aggregated_report(output_format="json")

    assert report == (
-        '{"metrics":{"0":"reciprocal_rank","1":"single_hit","2":"multi_hit","3":"context_relevance",'
-        '"4":"faithfulness","5":"semantic_answer_similarity"},'
-        '"score":{"0":0.476932,"1":0.75,"2":0.46428375,"3":0.58177975,"4":0.40585375,"5":0.53757075}}'
+        {
+            "metrics": [
+                "reciprocal_rank",
+                "single_hit",
+                "multi_hit",
+                "context_relevance",
+                "faithfulness",
+                "semantic_answer_similarity",
+            ],
+            "score": [0.476932, 0.75, 0.46428375, 0.58177975, 0.40585375, 0.53757075],
+        }
    )


-def test_to_pandas():
+def test_to_df():
    data = {
        "inputs": {
            "query_id": ["53c3b3e6", "225f87f7", "53c3b3e6", "225f87f7"],
@ -121,19 +131,25 @@ def test_to_pandas():
    }

    result = EvaluationRunResult("testing_pipeline_1", inputs=data["inputs"], results=data["metrics"])
-    assert result.to_pandas().to_json() == (
-        '{"query_id":{"0":"53c3b3e6","1":"225f87f7","2":"53c3b3e6","3":"225f87f7"},'
-        '"question":{"0":"What is the capital of France?","1":"What is the capital of Spain?",'
-        '"2":"What is the capital of Luxembourg?","3":"What is the capital of Portugal?"},'
-        '"contexts":{"0":"wiki_France","1":"wiki_Spain","2":"wiki_Luxembourg","3":"wiki_Portugal"},'
-        '"answer":{"0":"Paris","1":"Madrid","2":"Luxembourg","3":"Lisbon"},'
-        '"predicted_answer":{"0":"Paris","1":"Madrid","2":"Luxembourg","3":"Lisbon"},'
-        '"reciprocal_rank":{"0":0.378064,"1":0.534964,"2":0.216058,"3":0.778642},'
-        '"single_hit":{"0":1,"1":1,"2":0,"3":1},'
-        '"multi_hit":{"0":0.706125,"1":0.454976,"2":0.445512,"3":0.250522},'
-        '"context_relevance":{"0":0.805466,"1":0.410251,"2":0.75007,"3":0.361332},'
-        '"faithfulness":{"0":0.135581,"1":0.695974,"2":0.749861,"3":0.041999},'
-        '"semantic_answer_similarity":{"0":0.971241,"1":0.15932,"2":0.019722,"3":1.0}}'
+    assert result.detailed_report() == (
+        {
+            "query_id": ["53c3b3e6", "225f87f7", "53c3b3e6", "225f87f7"],
+            "question": [
+                "What is the capital of France?",
+                "What is the capital of Spain?",
+                "What is the capital of Luxembourg?",
+                "What is the capital of Portugal?",
+            ],
+            "contexts": ["wiki_France", "wiki_Spain", "wiki_Luxembourg", "wiki_Portugal"],
+            "answer": ["Paris", "Madrid", "Luxembourg", "Lisbon"],
+            "predicted_answer": ["Paris", "Madrid", "Luxembourg", "Lisbon"],
+            "reciprocal_rank": [0.378064, 0.534964, 0.216058, 0.778642],
+            "single_hit": [1, 1, 0, 1],
+            "multi_hit": [0.706125, 0.454976, 0.445512, 0.250522],
+            "context_relevance": [0.805466, 0.410251, 0.75007, 0.361332],
+            "faithfulness": [0.135581, 0.695974, 0.749861, 0.041999],
+            "semantic_answer_similarity": [0.971241, 0.15932, 0.019722, 1.0],
+        }
    )


@ -176,73 +192,9 @@ def test_comparative_individual_scores_report():

    result1 = EvaluationRunResult("testing_pipeline_1", inputs=data_1["inputs"], results=data_1["metrics"])
    result2 = EvaluationRunResult("testing_pipeline_2", inputs=data_2["inputs"], results=data_2["metrics"])
-    results = result1.comparative_individual_scores_report(result2)
+    results = result1.comparative_detailed_report(result2, keep_columns=["predicted_answer"])

-    expected = {
-        "query_id": {0: "53c3b3e6", 1: "225f87f7"},
-        "question": {0: "What is the capital of France?", 1: "What is the capital of Spain?"},
-        "contexts": {0: "wiki_France", 1: "wiki_Spain"},
-        "answer": {0: "Paris", 1: "Madrid"},
-        "predicted_answer": {0: "Paris", 1: "Madrid"},
-        "testing_pipeline_1_reciprocal_rank": {0: 0.378064, 1: 0.534964},
-        "testing_pipeline_1_single_hit": {0: 1, 1: 1},
-        "testing_pipeline_1_multi_hit": {0: 0.706125, 1: 0.454976},
-        "testing_pipeline_1_context_relevance": {0: 1, 1: 1},
-        "testing_pipeline_1_faithfulness": {0: 0.135581, 1: 0.695974},
-        "testing_pipeline_1_semantic_answer_similarity": {0: 0.971241, 1: 0.15932},
-        "testing_pipeline_2_reciprocal_rank": {0: 0.378064, 1: 0.534964},
-        "testing_pipeline_2_single_hit": {0: 1, 1: 1},
-        "testing_pipeline_2_multi_hit": {0: 0.706125, 1: 0.454976},
-        "testing_pipeline_2_context_relevance": {0: 1, 1: 1},
-        "testing_pipeline_2_faithfulness": {0: 0.135581, 1: 0.695974},
-        "testing_pipeline_2_semantic_answer_similarity": {0: 0.971241, 1: 0.15932},
-    }
-
-    assert expected == results.to_dict()
-
-
-def test_comparative_individual_scores_report_keep_truth_answer_in_df():
-    data_1 = {
-        "inputs": {
-            "query_id": ["53c3b3e6", "225f87f7"],
-            "question": ["What is the capital of France?", "What is the capital of Spain?"],
-            "contexts": ["wiki_France", "wiki_Spain"],
-            "answer": ["Paris", "Madrid"],
-            "predicted_answer": ["Paris", "Madrid"],
-        },
-        "metrics": {
-            "reciprocal_rank": {"individual_scores": [0.378064, 0.534964], "score": 0.476932},
-            "single_hit": {"individual_scores": [1, 1], "score": 0.75},
-            "multi_hit": {"individual_scores": [0.706125, 0.454976], "score": 0.46428375},
-            "context_relevance": {"individual_scores": [1, 1], "score": 1},
-            "faithfulness": {"individual_scores": [0.135581, 0.695974], "score": 0.40585375},
-            "semantic_answer_similarity": {"individual_scores": [0.971241, 0.159320], "score": 0.53757075},
-        },
-    }
-
-    data_2 = {
-        "inputs": {
-            "query_id": ["53c3b3e6", "225f87f7"],
-            "question": ["What is the capital of France?", "What is the capital of Spain?"],
-            "contexts": ["wiki_France", "wiki_Spain"],
-            "answer": ["Paris", "Madrid"],
-            "predicted_answer": ["Paris", "Madrid"],
-        },
-        "metrics": {
-            "reciprocal_rank": {"individual_scores": [0.378064, 0.534964], "score": 0.476932},
-            "single_hit": {"individual_scores": [1, 1], "score": 0.75},
-            "multi_hit": {"individual_scores": [0.706125, 0.454976], "score": 0.46428375},
-            "context_relevance": {"individual_scores": [1, 1], "score": 1},
-            "faithfulness": {"individual_scores": [0.135581, 0.695974], "score": 0.40585375},
-            "semantic_answer_similarity": {"individual_scores": [0.971241, 0.159320], "score": 0.53757075},
-        },
-    }
-
-    result1 = EvaluationRunResult("testing_pipeline_1", inputs=data_1["inputs"], results=data_1["metrics"])
-    result2 = EvaluationRunResult("testing_pipeline_2", inputs=data_2["inputs"], results=data_2["metrics"])
-    results = result1.comparative_individual_scores_report(result2, keep_columns=["predicted_answer"])
-
-    assert list(results.columns) == [
+    assert list(results.keys()) == [
        "query_id",
        "question",
        "contexts",