fix: make pandas DataFrame optional in EvaluationRunResult (#8838)

* feat: AsyncPipeline that can schedule components to run concurrently (#8812)

* add component checks

* pipeline should run deterministically

* add FIFOQueue

* add agent tests

* add order dependent tests

* run new tests

* remove code that is not needed

* test: intermediate from cycle outputs are available outside cycle

* add tests for component checks (Claude)

* adapt tests for component checks (o1 review)

* chore: format

* remove tests that aren't needed anymore

* add _calculate_priority tests

* revert accidental change in pyproject.toml

* test format conversion

* adapt to naming convention

* chore: proper docstrings and type hints for PQ

* format

* add more unit tests

* rm unneeded comments

* test input consumption

* lint

* fix: docstrings

* lint

* format

* format

* fix license header

* fix license header

* add component run tests

* fix: pass correct input format to tracing

* fix types

* format

* format

* types

* add defaults from Socket instead of signature

- otherwise components with dynamic inputs would fail

* fix test names

* still wait for optional inputs on greedy variadic sockets

- mirrors previous behavior

* fix format

* wip: warn for ambiguous running order

* wip: alternative warning

* fix license header

* make code more readable

Co-authored-by: Amna Mubashar <amnahkhan.ak@gmail.com>

* Introduce content tracing to a behavioral test

* Fixing linting

* Remove debug print statements

* Fix tracer tests

* remove print

* test: test for component inputs

* test: remove testing for run order

* chore: update component checks from experimental

* chore: update pipeline and base from experimental

* refactor: remove unused method

* refactor: remove unused method

* refactor: outdated comment

* refactor: inputs state is updated as side effect

- to prepare for AsyncPipeline implementation

* format

* test: add file conversion test

* format

* fix: original implementation deepcopies outputs

* lint

* fix: from_dict was updated

* fix: format

* fix: test

* test: add test for thread safety

* remove unused imports

* format

* test: FIFOPriorityQueue

* chore: add release note

* feat: add AsyncPipeline

* chore: Add release notes

* fix: format

* debug: switch run order to debug ubuntu and windows tests

* fix: consider priorities of other components while waiting for DEFER

* refactor: simplify code

* fix: resolve merge conflict with mermaid changes

* fix: format

* fix: remove unused import

* refactor: rename to avoid accidental conflicts

* fix: track pipeline type

* fix: and extend test

* fix: format

* style: sort alphabetically

* Update test/core/pipeline/features/conftest.py

Co-authored-by: Amna Mubashar <amnahkhan.ak@gmail.com>

* Update test/core/pipeline/features/conftest.py

Co-authored-by: Amna Mubashar <amnahkhan.ak@gmail.com>

* Update releasenotes/notes/feat-async-pipeline-338856a142e1318c.yaml

* fix: indentation, do not close loop

* fix: use asyncio.run

* fix: format

---------

Co-authored-by: Amna Mubashar <amnahkhan.ak@gmail.com>
Co-authored-by: David S. Batista <dsbatista@gmail.com>

* feat: AsyncPipeline that can schedule components to run concurrently (#8812)

* add component checks

* pipeline should run deterministically

* add FIFOQueue

* add agent tests

* add order dependent tests

* run new tests

* remove code that is not needed

* test: intermediate from cycle outputs are available outside cycle

* add tests for component checks (Claude)

* adapt tests for component checks (o1 review)

* chore: format

* remove tests that aren't needed anymore

* add _calculate_priority tests

* revert accidental change in pyproject.toml

* test format conversion

* adapt to naming convention

* chore: proper docstrings and type hints for PQ

* format

* add more unit tests

* rm unneeded comments

* test input consumption

* lint

* fix: docstrings

* lint

* format

* format

* fix license header

* fix license header

* add component run tests

* fix: pass correct input format to tracing

* fix types

* format

* format

* types

* add defaults from Socket instead of signature

- otherwise components with dynamic inputs would fail

* fix test names

* still wait for optional inputs on greedy variadic sockets

- mirrors previous behavior

* fix format

* wip: warn for ambiguous running order

* wip: alternative warning

* fix license header

* make code more readable

Co-authored-by: Amna Mubashar <amnahkhan.ak@gmail.com>

* Introduce content tracing to a behavioral test

* Fixing linting

* Remove debug print statements

* Fix tracer tests

* remove print

* test: test for component inputs

* test: remove testing for run order

* chore: update component checks from experimental

* chore: update pipeline and base from experimental

* refactor: remove unused method

* refactor: remove unused method

* refactor: outdated comment

* refactor: inputs state is updated as side effect

- to prepare for AsyncPipeline implementation

* format

* test: add file conversion test

* format

* fix: original implementation deepcopies outputs

* lint

* fix: from_dict was updated

* fix: format

* fix: test

* test: add test for thread safety

* remove unused imports

* format

* test: FIFOPriorityQueue

* chore: add release note

* feat: add AsyncPipeline

* chore: Add release notes

* fix: format

* debug: switch run order to debug ubuntu and windows tests

* fix: consider priorities of other components while waiting for DEFER

* refactor: simplify code

* fix: resolve merge conflict with mermaid changes

* fix: format

* fix: remove unused import

* refactor: rename to avoid accidental conflicts

* fix: track pipeline type

* fix: and extend test

* fix: format

* style: sort alphabetically

* Update test/core/pipeline/features/conftest.py

Co-authored-by: Amna Mubashar <amnahkhan.ak@gmail.com>

* Update test/core/pipeline/features/conftest.py

Co-authored-by: Amna Mubashar <amnahkhan.ak@gmail.com>

* Update releasenotes/notes/feat-async-pipeline-338856a142e1318c.yaml

* fix: indentation, do not close loop

* fix: use asyncio.run

* fix: format

---------

Co-authored-by: Amna Mubashar <amnahkhan.ak@gmail.com>
Co-authored-by: David S. Batista <dsbatista@gmail.com>

* updated changes for refactoring evaluations without pandas package

* added release notes for eval_run_result.py for refactoring  EvaluationRunResult to work without pandas

* wip: cleaning and refactoring

* removing BaseEvaluationRunResult

* wip: fixing tests

* fixing tests and docstrings

* updating release notes

* fixing typing

* pylint fix

* adding deprecation warning

* fixing tests

* fixin types consistency

* adding stacklevel=2 to warning messages

* fixing docstrings

* fixing docstrings

* updating release notes

---------

Co-authored-by: mathislucka <mathis.lucka@gmail.com>
Co-authored-by: Amna Mubashar <amnahkhan.ak@gmail.com>
Co-authored-by: David S. Batista <dsbatista@gmail.com>
This commit is contained in:
Hemanth Taduka 2025-02-17 08:43:54 -05:00 committed by GitHub
parent 2f383bce25
commit b5fb0d3ff8
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
5 changed files with 207 additions and 189 deletions

View File

@ -2,7 +2,6 @@
#
# SPDX-License-Identifier: Apache-2.0
from .base import BaseEvaluationRunResult
from .eval_run_result import EvaluationRunResult
__all__ = ["BaseEvaluationRunResult", "EvaluationRunResult"]
__all__ = ["EvaluationRunResult"]

View File

@ -1,49 +0,0 @@
# SPDX-FileCopyrightText: 2022-present deepset GmbH <info@deepset.ai>
#
# SPDX-License-Identifier: Apache-2.0
from abc import ABC, abstractmethod
from typing import List, Optional
from pandas import DataFrame
class BaseEvaluationRunResult(ABC):
"""
Represents the results of an evaluation run.
"""
@abstractmethod
def to_pandas(self) -> "DataFrame":
"""
Creates a Pandas DataFrame containing the scores of each metric for every input sample.
:returns:
Pandas DataFrame with the scores.
"""
@abstractmethod
def score_report(self) -> "DataFrame":
"""
Transforms the results into a Pandas DataFrame with the aggregated scores for each metric.
:returns:
Pandas DataFrame with the aggregated scores.
"""
@abstractmethod
def comparative_individual_scores_report(
self, other: "BaseEvaluationRunResult", keep_columns: Optional[List[str]] = None
) -> "DataFrame":
"""
Creates a Pandas DataFrame with the scores for each metric in the results of two different evaluation runs.
The inputs to both evaluation runs is assumed to be the same.
:param other:
Results of another evaluation run to compare with.
:param keep_columns:
List of common column names to keep from the inputs of the evaluation runs to compare.
:returns:
Pandas DataFrame with the score comparison.
"""

View File

@ -2,17 +2,18 @@
#
# SPDX-License-Identifier: Apache-2.0
import csv
from copy import deepcopy
from typing import Any, Dict, List, Optional
from typing import Any, Dict, List, Literal, Optional, Union
from warnings import warn
from pandas import DataFrame
from pandas import concat as pd_concat
from haystack.lazy_imports import LazyImport
from .base import BaseEvaluationRunResult
with LazyImport("Run 'pip install pandas'") as pandas_import:
from pandas import DataFrame
class EvaluationRunResult(BaseEvaluationRunResult):
class EvaluationRunResult:
"""
Contains the inputs and the outputs of an evaluation pipeline and provides methods to inspect them.
"""
@ -23,16 +24,14 @@ class EvaluationRunResult(BaseEvaluationRunResult):
:param run_name:
Name of the evaluation run.
:param inputs:
Dictionary containing the inputs used for the run.
Each key is the name of the input and its value is
a list of input values. The length of the lists should
be the same.
Dictionary containing the inputs used for the run. Each key is the name of the input and its value is a list
of input values. The length of the lists should be the same.
:param results:
Dictionary containing the results of the evaluators
used in the evaluation pipeline. Each key is the name
of the metric and its value is dictionary with the following
keys:
Dictionary containing the results of the evaluators used in the evaluation pipeline. Each key is the name
of the metric and its value is dictionary with the following keys:
- 'score': The aggregated score for the metric.
- 'individual_scores': A list of scores for each input sample.
"""
@ -59,77 +58,186 @@ class EvaluationRunResult(BaseEvaluationRunResult):
f"Got {len(outputs['individual_scores'])} but expected {expected_len}."
)
def score_report(self) -> DataFrame:
@staticmethod
def _write_to_csv(csv_file: str, data: Dict[str, List[Any]]) -> str:
"""
Transforms the results into a Pandas DataFrame with the aggregated scores for each metric.
Write data to a CSV file.
:param csv_file: Path to the CSV file to write
:param data: Dictionary containing the data to write
:return: Status message indicating success or failure
"""
list_lengths = [len(value) for value in data.values()]
if len(set(list_lengths)) != 1:
raise ValueError("All lists in the JSON must have the same length")
try:
headers = list(data.keys())
num_rows = list_lengths[0]
rows = []
for i in range(num_rows):
row = [data[header][i] for header in headers]
rows.append(row)
with open(csv_file, "w", newline="") as csvfile:
writer = csv.writer(csvfile)
writer.writerow(headers)
writer.writerows(rows)
return f"Data successfully written to {csv_file}"
except PermissionError:
return f"Error: Permission denied when writing to {csv_file}"
except IOError as e:
return f"Error writing to {csv_file}: {str(e)}"
except Exception as e:
return f"Error: {str(e)}"
@staticmethod
def _handle_output(
data: Dict[str, List[Any]], output_format: Literal["json", "csv", "df"] = "csv", csv_file: Optional[str] = None
) -> Union[str, DataFrame, Dict[str, List[Any]]]:
"""
Handles output formatting based on `output_format`.
:returns: DataFrame for 'df', dict for 'json', or confirmation message for 'csv'
"""
if output_format == "json":
return data
elif output_format == "df":
pandas_import.check()
return DataFrame(data)
elif output_format == "csv":
if not csv_file:
raise ValueError("A file path must be provided in 'csv_file' parameter to save the CSV output.")
return EvaluationRunResult._write_to_csv(csv_file, data)
else:
raise ValueError(f"Invalid output format '{output_format}' provided. Choose from 'json', 'csv', or 'df'.")
def aggregated_report(
self, output_format: Literal["json", "csv", "df"] = "json", csv_file: Optional[str] = None
) -> Union[Dict[str, List[Any]], "DataFrame", str]:
"""
Generates a report with aggregated scores for each metric.
:param output_format: The output format for the report, "json", "csv", or "df", default to "json".
:param csv_file: Filepath to save CSV output if `output_format` is "csv", must be provided.
:returns:
Pandas DataFrame with the aggregated scores.
JSON or DataFrame with aggregated scores, in case the output is set to a CSV file, a message confirming the
successful write or an error message.
"""
results = {k: v["score"] for k, v in self.results.items()}
df = DataFrame.from_dict(results, orient="index", columns=["score"]).reset_index()
df.columns = ["metrics", "score"]
return df
data = {"metrics": list(results.keys()), "score": list(results.values())}
return self._handle_output(data, output_format, csv_file)
def to_pandas(self) -> DataFrame:
def detailed_report(
self, output_format: Literal["json", "csv", "df"] = "json", csv_file: Optional[str] = None
) -> Union[Dict[str, List[Any]], "DataFrame", str]:
"""
Creates a Pandas DataFrame containing the scores of each metric for every input sample.
Generates a report with detailed scores for each metric.
:param output_format: The output format for the report, "json", "csv", or "df", default to "json".
:param csv_file: Filepath to save CSV output if `output_format` is "csv", must be provided.
:returns:
Pandas DataFrame with the scores.
JSON or DataFrame with the detailed scores, in case the output is set to a CSV file, a message confirming
the successful write or an error message.
"""
inputs_columns = list(self.inputs.keys())
inputs_values = list(self.inputs.values())
inputs_values = list(map(list, zip(*inputs_values))) # transpose the values
df_inputs = DataFrame(inputs_values, columns=inputs_columns)
combined_data = {col: self.inputs[col] for col in self.inputs}
# enforce columns type consistency
scores_columns = list(self.results.keys())
scores_values = [v["individual_scores"] for v in self.results.values()]
scores_values = list(map(list, zip(*scores_values))) # transpose the values
df_scores = DataFrame(scores_values, columns=scores_columns)
for col in scores_columns:
col_values = self.results[col]["individual_scores"]
if any(isinstance(v, float) for v in col_values):
col_values = [float(v) for v in col_values]
combined_data[col] = col_values
return df_inputs.join(df_scores)
return self._handle_output(combined_data, output_format, csv_file)
def comparative_individual_scores_report(
self, other: "BaseEvaluationRunResult", keep_columns: Optional[List[str]] = None
) -> DataFrame:
def comparative_detailed_report(
self,
other: "EvaluationRunResult",
keep_columns: Optional[List[str]] = None,
output_format: Literal["json", "csv", "df"] = "json",
csv_file: Optional[str] = None,
) -> Union[str, "DataFrame", None]:
"""
Creates a Pandas DataFrame with the scores for each metric in the results of two different evaluation runs.
Generates a report with detailed scores for each metric from two evaluation runs for comparison.
The inputs to both evaluation runs is assumed to be the same.
:param other: Results of another evaluation run to compare with.
:param keep_columns: List of common column names to keep from the inputs of the evaluation runs to compare.
:param output_format: The output format for the report, "json", "csv", or "df", default to "json".
:param csv_file: Filepath to save CSV output if `output_format` is "csv", must be provided.
:param other:
Results of another evaluation run to compare with.
:param keep_columns:
List of common column names to keep from the inputs of the evaluation runs to compare.
:returns:
Pandas DataFrame with the score comparison.
JSON or DataFrame with a comparison of the detailed scores, in case the output is set to a CSV file,
a message confirming the successful write or an error message.
"""
if not isinstance(other, EvaluationRunResult):
raise ValueError("Comparative scores can only be computed between EvaluationRunResults.")
this_name = self.run_name
other_name = other.run_name
if this_name == other_name:
warn(f"The run names of the two evaluation results are the same ('{this_name}')")
this_name = f"{this_name}_first"
other_name = f"{other_name}_second"
if not hasattr(other, "run_name") or not hasattr(other, "inputs") or not hasattr(other, "results"):
raise ValueError("The 'other' parameter must have 'run_name', 'inputs', and 'results' attributes.")
if self.run_name == other.run_name:
warn(f"The run names of the two evaluation results are the same ('{self.run_name}')")
if self.inputs.keys() != other.inputs.keys():
warn(f"The input columns differ between the results; using the input columns of '{this_name}'.")
warn(f"The input columns differ between the results; using the input columns of '{self.run_name}'.")
pipe_a_df = self.to_pandas()
pipe_b_df = other.to_pandas()
# got both detailed reports
detailed_a = self.detailed_report(output_format="json")
detailed_b = other.detailed_report(output_format="json")
# ensure both detailed reports are in dictionaries format
if not isinstance(detailed_a, dict) or not isinstance(detailed_b, dict):
raise ValueError("Detailed reports must be dictionaries.")
# determine which columns to ignore
if keep_columns is None:
ignore = list(self.inputs.keys())
else:
ignore = [col for col in list(self.inputs.keys()) if col not in keep_columns]
pipe_b_df.drop(columns=ignore, inplace=True, errors="ignore")
pipe_b_df.columns = [f"{other_name}_{column}" for column in pipe_b_df.columns] # type: ignore
pipe_a_df.columns = [f"{this_name}_{col}" if col not in ignore else col for col in pipe_a_df.columns] # type: ignore
# filter out ignored columns from pipe_b_dict
filtered_detailed_b = {
f"{other.run_name}_{key}": value for key, value in detailed_b.items() if key not in ignore
}
results_df = pd_concat([pipe_a_df, pipe_b_df], axis=1)
# rename columns in pipe_a_dict based on ignore list
renamed_detailed_a = {
(key if key in ignore else f"{self.run_name}_{key}"): value for key, value in detailed_a.items()
}
return results_df
# combine both detailed reports
combined_results = {**renamed_detailed_a, **filtered_detailed_b}
return self._handle_output(combined_results, output_format, csv_file)
def score_report(self) -> "DataFrame":
"""Generates a DataFrame report with aggregated scores for each metric."""
msg = "The `score_report` method is deprecated and will be changed to `aggregated_report` in Haystack 2.11.0."
warn(msg, DeprecationWarning, stacklevel=2)
return self.aggregated_report(output_format="df")
def to_pandas(self) -> "DataFrame":
"""Generates a DataFrame report with detailed scores for each metric."""
msg = "The `to_pandas` method is deprecated and will be changed to `detailed_report` in Haystack 2.11.0."
warn(msg, DeprecationWarning, stacklevel=2)
return self.detailed_report(output_format="df")
def comparative_individual_scores_report(self, other: "EvaluationRunResult") -> "DataFrame":
"""Generates a DataFrame report with detailed scores for each metric from two evaluation runs for comparison."""
msg = (
"The `comparative_individual_scores_report` method is deprecated and will be changed to "
"`comparative_detailed_report` in Haystack 2.11.0."
)
warn(msg, DeprecationWarning, stacklevel=2)
return self.comparative_detailed_report(other, output_format="df")

View File

@ -0,0 +1,8 @@
---
enhancements:
- |
`EvaluationRunResult` can now output the results in JSON, a pandas Dataframe or in a CSV file.
deprecations:
- |
The use of pandas Dataframe in `EvaluationRunResult` is now optional and the methods `score_report`, `to_pandas` and `comparative_individual_scores_report` are deprecated and will be removed in the next haystack release.

View File

@ -1,6 +1,8 @@
# SPDX-FileCopyrightText: 2022-present deepset GmbH <info@deepset.ai>
#
# SPDX-License-Identifier: Apache-2.0
import json
from haystack.evaluation import EvaluationRunResult
import pytest
@ -87,16 +89,24 @@ def test_score_report():
}
result = EvaluationRunResult("testing_pipeline_1", inputs=data["inputs"], results=data["metrics"])
report = result.score_report().to_json()
report = result.aggregated_report(output_format="json")
assert report == (
'{"metrics":{"0":"reciprocal_rank","1":"single_hit","2":"multi_hit","3":"context_relevance",'
'"4":"faithfulness","5":"semantic_answer_similarity"},'
'"score":{"0":0.476932,"1":0.75,"2":0.46428375,"3":0.58177975,"4":0.40585375,"5":0.53757075}}'
{
"metrics": [
"reciprocal_rank",
"single_hit",
"multi_hit",
"context_relevance",
"faithfulness",
"semantic_answer_similarity",
],
"score": [0.476932, 0.75, 0.46428375, 0.58177975, 0.40585375, 0.53757075],
}
)
def test_to_pandas():
def test_to_df():
data = {
"inputs": {
"query_id": ["53c3b3e6", "225f87f7", "53c3b3e6", "225f87f7"],
@ -121,19 +131,25 @@ def test_to_pandas():
}
result = EvaluationRunResult("testing_pipeline_1", inputs=data["inputs"], results=data["metrics"])
assert result.to_pandas().to_json() == (
'{"query_id":{"0":"53c3b3e6","1":"225f87f7","2":"53c3b3e6","3":"225f87f7"},'
'"question":{"0":"What is the capital of France?","1":"What is the capital of Spain?",'
'"2":"What is the capital of Luxembourg?","3":"What is the capital of Portugal?"},'
'"contexts":{"0":"wiki_France","1":"wiki_Spain","2":"wiki_Luxembourg","3":"wiki_Portugal"},'
'"answer":{"0":"Paris","1":"Madrid","2":"Luxembourg","3":"Lisbon"},'
'"predicted_answer":{"0":"Paris","1":"Madrid","2":"Luxembourg","3":"Lisbon"},'
'"reciprocal_rank":{"0":0.378064,"1":0.534964,"2":0.216058,"3":0.778642},'
'"single_hit":{"0":1,"1":1,"2":0,"3":1},'
'"multi_hit":{"0":0.706125,"1":0.454976,"2":0.445512,"3":0.250522},'
'"context_relevance":{"0":0.805466,"1":0.410251,"2":0.75007,"3":0.361332},'
'"faithfulness":{"0":0.135581,"1":0.695974,"2":0.749861,"3":0.041999},'
'"semantic_answer_similarity":{"0":0.971241,"1":0.15932,"2":0.019722,"3":1.0}}'
assert result.detailed_report() == (
{
"query_id": ["53c3b3e6", "225f87f7", "53c3b3e6", "225f87f7"],
"question": [
"What is the capital of France?",
"What is the capital of Spain?",
"What is the capital of Luxembourg?",
"What is the capital of Portugal?",
],
"contexts": ["wiki_France", "wiki_Spain", "wiki_Luxembourg", "wiki_Portugal"],
"answer": ["Paris", "Madrid", "Luxembourg", "Lisbon"],
"predicted_answer": ["Paris", "Madrid", "Luxembourg", "Lisbon"],
"reciprocal_rank": [0.378064, 0.534964, 0.216058, 0.778642],
"single_hit": [1, 1, 0, 1],
"multi_hit": [0.706125, 0.454976, 0.445512, 0.250522],
"context_relevance": [0.805466, 0.410251, 0.75007, 0.361332],
"faithfulness": [0.135581, 0.695974, 0.749861, 0.041999],
"semantic_answer_similarity": [0.971241, 0.15932, 0.019722, 1.0],
}
)
@ -176,73 +192,9 @@ def test_comparative_individual_scores_report():
result1 = EvaluationRunResult("testing_pipeline_1", inputs=data_1["inputs"], results=data_1["metrics"])
result2 = EvaluationRunResult("testing_pipeline_2", inputs=data_2["inputs"], results=data_2["metrics"])
results = result1.comparative_individual_scores_report(result2)
results = result1.comparative_detailed_report(result2, keep_columns=["predicted_answer"])
expected = {
"query_id": {0: "53c3b3e6", 1: "225f87f7"},
"question": {0: "What is the capital of France?", 1: "What is the capital of Spain?"},
"contexts": {0: "wiki_France", 1: "wiki_Spain"},
"answer": {0: "Paris", 1: "Madrid"},
"predicted_answer": {0: "Paris", 1: "Madrid"},
"testing_pipeline_1_reciprocal_rank": {0: 0.378064, 1: 0.534964},
"testing_pipeline_1_single_hit": {0: 1, 1: 1},
"testing_pipeline_1_multi_hit": {0: 0.706125, 1: 0.454976},
"testing_pipeline_1_context_relevance": {0: 1, 1: 1},
"testing_pipeline_1_faithfulness": {0: 0.135581, 1: 0.695974},
"testing_pipeline_1_semantic_answer_similarity": {0: 0.971241, 1: 0.15932},
"testing_pipeline_2_reciprocal_rank": {0: 0.378064, 1: 0.534964},
"testing_pipeline_2_single_hit": {0: 1, 1: 1},
"testing_pipeline_2_multi_hit": {0: 0.706125, 1: 0.454976},
"testing_pipeline_2_context_relevance": {0: 1, 1: 1},
"testing_pipeline_2_faithfulness": {0: 0.135581, 1: 0.695974},
"testing_pipeline_2_semantic_answer_similarity": {0: 0.971241, 1: 0.15932},
}
assert expected == results.to_dict()
def test_comparative_individual_scores_report_keep_truth_answer_in_df():
data_1 = {
"inputs": {
"query_id": ["53c3b3e6", "225f87f7"],
"question": ["What is the capital of France?", "What is the capital of Spain?"],
"contexts": ["wiki_France", "wiki_Spain"],
"answer": ["Paris", "Madrid"],
"predicted_answer": ["Paris", "Madrid"],
},
"metrics": {
"reciprocal_rank": {"individual_scores": [0.378064, 0.534964], "score": 0.476932},
"single_hit": {"individual_scores": [1, 1], "score": 0.75},
"multi_hit": {"individual_scores": [0.706125, 0.454976], "score": 0.46428375},
"context_relevance": {"individual_scores": [1, 1], "score": 1},
"faithfulness": {"individual_scores": [0.135581, 0.695974], "score": 0.40585375},
"semantic_answer_similarity": {"individual_scores": [0.971241, 0.159320], "score": 0.53757075},
},
}
data_2 = {
"inputs": {
"query_id": ["53c3b3e6", "225f87f7"],
"question": ["What is the capital of France?", "What is the capital of Spain?"],
"contexts": ["wiki_France", "wiki_Spain"],
"answer": ["Paris", "Madrid"],
"predicted_answer": ["Paris", "Madrid"],
},
"metrics": {
"reciprocal_rank": {"individual_scores": [0.378064, 0.534964], "score": 0.476932},
"single_hit": {"individual_scores": [1, 1], "score": 0.75},
"multi_hit": {"individual_scores": [0.706125, 0.454976], "score": 0.46428375},
"context_relevance": {"individual_scores": [1, 1], "score": 1},
"faithfulness": {"individual_scores": [0.135581, 0.695974], "score": 0.40585375},
"semantic_answer_similarity": {"individual_scores": [0.971241, 0.159320], "score": 0.53757075},
},
}
result1 = EvaluationRunResult("testing_pipeline_1", inputs=data_1["inputs"], results=data_1["metrics"])
result2 = EvaluationRunResult("testing_pipeline_2", inputs=data_2["inputs"], results=data_2["metrics"])
results = result1.comparative_individual_scores_report(result2, keep_columns=["predicted_answer"])
assert list(results.columns) == [
assert list(results.keys()) == [
"query_id",
"question",
"contexts",