mirror of
https://github.com/deepset-ai/haystack.git
synced 2025-11-28 16:08:37 +00:00
fix: make pandas DataFrame optional in EvaluationRunResult (#8838)
* feat: AsyncPipeline that can schedule components to run concurrently (#8812) * add component checks * pipeline should run deterministically * add FIFOQueue * add agent tests * add order dependent tests * run new tests * remove code that is not needed * test: intermediate from cycle outputs are available outside cycle * add tests for component checks (Claude) * adapt tests for component checks (o1 review) * chore: format * remove tests that aren't needed anymore * add _calculate_priority tests * revert accidental change in pyproject.toml * test format conversion * adapt to naming convention * chore: proper docstrings and type hints for PQ * format * add more unit tests * rm unneeded comments * test input consumption * lint * fix: docstrings * lint * format * format * fix license header * fix license header * add component run tests * fix: pass correct input format to tracing * fix types * format * format * types * add defaults from Socket instead of signature - otherwise components with dynamic inputs would fail * fix test names * still wait for optional inputs on greedy variadic sockets - mirrors previous behavior * fix format * wip: warn for ambiguous running order * wip: alternative warning * fix license header * make code more readable Co-authored-by: Amna Mubashar <amnahkhan.ak@gmail.com> * Introduce content tracing to a behavioral test * Fixing linting * Remove debug print statements * Fix tracer tests * remove print * test: test for component inputs * test: remove testing for run order * chore: update component checks from experimental * chore: update pipeline and base from experimental * refactor: remove unused method * refactor: remove unused method * refactor: outdated comment * refactor: inputs state is updated as side effect - to prepare for AsyncPipeline implementation * format * test: add file conversion test * format * fix: original implementation deepcopies outputs * lint * fix: from_dict was updated * fix: format * fix: test * test: add test for thread safety * remove unused imports * format * test: FIFOPriorityQueue * chore: add release note * feat: add AsyncPipeline * chore: Add release notes * fix: format * debug: switch run order to debug ubuntu and windows tests * fix: consider priorities of other components while waiting for DEFER * refactor: simplify code * fix: resolve merge conflict with mermaid changes * fix: format * fix: remove unused import * refactor: rename to avoid accidental conflicts * fix: track pipeline type * fix: and extend test * fix: format * style: sort alphabetically * Update test/core/pipeline/features/conftest.py Co-authored-by: Amna Mubashar <amnahkhan.ak@gmail.com> * Update test/core/pipeline/features/conftest.py Co-authored-by: Amna Mubashar <amnahkhan.ak@gmail.com> * Update releasenotes/notes/feat-async-pipeline-338856a142e1318c.yaml * fix: indentation, do not close loop * fix: use asyncio.run * fix: format --------- Co-authored-by: Amna Mubashar <amnahkhan.ak@gmail.com> Co-authored-by: David S. Batista <dsbatista@gmail.com> * feat: AsyncPipeline that can schedule components to run concurrently (#8812) * add component checks * pipeline should run deterministically * add FIFOQueue * add agent tests * add order dependent tests * run new tests * remove code that is not needed * test: intermediate from cycle outputs are available outside cycle * add tests for component checks (Claude) * adapt tests for component checks (o1 review) * chore: format * remove tests that aren't needed anymore * add _calculate_priority tests * revert accidental change in pyproject.toml * test format conversion * adapt to naming convention * chore: proper docstrings and type hints for PQ * format * add more unit tests * rm unneeded comments * test input consumption * lint * fix: docstrings * lint * format * format * fix license header * fix license header * add component run tests * fix: pass correct input format to tracing * fix types * format * format * types * add defaults from Socket instead of signature - otherwise components with dynamic inputs would fail * fix test names * still wait for optional inputs on greedy variadic sockets - mirrors previous behavior * fix format * wip: warn for ambiguous running order * wip: alternative warning * fix license header * make code more readable Co-authored-by: Amna Mubashar <amnahkhan.ak@gmail.com> * Introduce content tracing to a behavioral test * Fixing linting * Remove debug print statements * Fix tracer tests * remove print * test: test for component inputs * test: remove testing for run order * chore: update component checks from experimental * chore: update pipeline and base from experimental * refactor: remove unused method * refactor: remove unused method * refactor: outdated comment * refactor: inputs state is updated as side effect - to prepare for AsyncPipeline implementation * format * test: add file conversion test * format * fix: original implementation deepcopies outputs * lint * fix: from_dict was updated * fix: format * fix: test * test: add test for thread safety * remove unused imports * format * test: FIFOPriorityQueue * chore: add release note * feat: add AsyncPipeline * chore: Add release notes * fix: format * debug: switch run order to debug ubuntu and windows tests * fix: consider priorities of other components while waiting for DEFER * refactor: simplify code * fix: resolve merge conflict with mermaid changes * fix: format * fix: remove unused import * refactor: rename to avoid accidental conflicts * fix: track pipeline type * fix: and extend test * fix: format * style: sort alphabetically * Update test/core/pipeline/features/conftest.py Co-authored-by: Amna Mubashar <amnahkhan.ak@gmail.com> * Update test/core/pipeline/features/conftest.py Co-authored-by: Amna Mubashar <amnahkhan.ak@gmail.com> * Update releasenotes/notes/feat-async-pipeline-338856a142e1318c.yaml * fix: indentation, do not close loop * fix: use asyncio.run * fix: format --------- Co-authored-by: Amna Mubashar <amnahkhan.ak@gmail.com> Co-authored-by: David S. Batista <dsbatista@gmail.com> * updated changes for refactoring evaluations without pandas package * added release notes for eval_run_result.py for refactoring EvaluationRunResult to work without pandas * wip: cleaning and refactoring * removing BaseEvaluationRunResult * wip: fixing tests * fixing tests and docstrings * updating release notes * fixing typing * pylint fix * adding deprecation warning * fixing tests * fixin types consistency * adding stacklevel=2 to warning messages * fixing docstrings * fixing docstrings * updating release notes --------- Co-authored-by: mathislucka <mathis.lucka@gmail.com> Co-authored-by: Amna Mubashar <amnahkhan.ak@gmail.com> Co-authored-by: David S. Batista <dsbatista@gmail.com>
This commit is contained in:
parent
2f383bce25
commit
b5fb0d3ff8
@ -2,7 +2,6 @@
|
||||
#
|
||||
# SPDX-License-Identifier: Apache-2.0
|
||||
|
||||
from .base import BaseEvaluationRunResult
|
||||
from .eval_run_result import EvaluationRunResult
|
||||
|
||||
__all__ = ["BaseEvaluationRunResult", "EvaluationRunResult"]
|
||||
__all__ = ["EvaluationRunResult"]
|
||||
|
||||
@ -1,49 +0,0 @@
|
||||
# SPDX-FileCopyrightText: 2022-present deepset GmbH <info@deepset.ai>
|
||||
#
|
||||
# SPDX-License-Identifier: Apache-2.0
|
||||
|
||||
from abc import ABC, abstractmethod
|
||||
from typing import List, Optional
|
||||
|
||||
from pandas import DataFrame
|
||||
|
||||
|
||||
class BaseEvaluationRunResult(ABC):
|
||||
"""
|
||||
Represents the results of an evaluation run.
|
||||
"""
|
||||
|
||||
@abstractmethod
|
||||
def to_pandas(self) -> "DataFrame":
|
||||
"""
|
||||
Creates a Pandas DataFrame containing the scores of each metric for every input sample.
|
||||
|
||||
:returns:
|
||||
Pandas DataFrame with the scores.
|
||||
"""
|
||||
|
||||
@abstractmethod
|
||||
def score_report(self) -> "DataFrame":
|
||||
"""
|
||||
Transforms the results into a Pandas DataFrame with the aggregated scores for each metric.
|
||||
|
||||
:returns:
|
||||
Pandas DataFrame with the aggregated scores.
|
||||
"""
|
||||
|
||||
@abstractmethod
|
||||
def comparative_individual_scores_report(
|
||||
self, other: "BaseEvaluationRunResult", keep_columns: Optional[List[str]] = None
|
||||
) -> "DataFrame":
|
||||
"""
|
||||
Creates a Pandas DataFrame with the scores for each metric in the results of two different evaluation runs.
|
||||
|
||||
The inputs to both evaluation runs is assumed to be the same.
|
||||
|
||||
:param other:
|
||||
Results of another evaluation run to compare with.
|
||||
:param keep_columns:
|
||||
List of common column names to keep from the inputs of the evaluation runs to compare.
|
||||
:returns:
|
||||
Pandas DataFrame with the score comparison.
|
||||
"""
|
||||
@ -2,17 +2,18 @@
|
||||
#
|
||||
# SPDX-License-Identifier: Apache-2.0
|
||||
|
||||
import csv
|
||||
from copy import deepcopy
|
||||
from typing import Any, Dict, List, Optional
|
||||
from typing import Any, Dict, List, Literal, Optional, Union
|
||||
from warnings import warn
|
||||
|
||||
from pandas import DataFrame
|
||||
from pandas import concat as pd_concat
|
||||
from haystack.lazy_imports import LazyImport
|
||||
|
||||
from .base import BaseEvaluationRunResult
|
||||
with LazyImport("Run 'pip install pandas'") as pandas_import:
|
||||
from pandas import DataFrame
|
||||
|
||||
|
||||
class EvaluationRunResult(BaseEvaluationRunResult):
|
||||
class EvaluationRunResult:
|
||||
"""
|
||||
Contains the inputs and the outputs of an evaluation pipeline and provides methods to inspect them.
|
||||
"""
|
||||
@ -23,16 +24,14 @@ class EvaluationRunResult(BaseEvaluationRunResult):
|
||||
|
||||
:param run_name:
|
||||
Name of the evaluation run.
|
||||
|
||||
:param inputs:
|
||||
Dictionary containing the inputs used for the run.
|
||||
Each key is the name of the input and its value is
|
||||
a list of input values. The length of the lists should
|
||||
be the same.
|
||||
Dictionary containing the inputs used for the run. Each key is the name of the input and its value is a list
|
||||
of input values. The length of the lists should be the same.
|
||||
|
||||
:param results:
|
||||
Dictionary containing the results of the evaluators
|
||||
used in the evaluation pipeline. Each key is the name
|
||||
of the metric and its value is dictionary with the following
|
||||
keys:
|
||||
Dictionary containing the results of the evaluators used in the evaluation pipeline. Each key is the name
|
||||
of the metric and its value is dictionary with the following keys:
|
||||
- 'score': The aggregated score for the metric.
|
||||
- 'individual_scores': A list of scores for each input sample.
|
||||
"""
|
||||
@ -59,77 +58,186 @@ class EvaluationRunResult(BaseEvaluationRunResult):
|
||||
f"Got {len(outputs['individual_scores'])} but expected {expected_len}."
|
||||
)
|
||||
|
||||
def score_report(self) -> DataFrame:
|
||||
@staticmethod
|
||||
def _write_to_csv(csv_file: str, data: Dict[str, List[Any]]) -> str:
|
||||
"""
|
||||
Transforms the results into a Pandas DataFrame with the aggregated scores for each metric.
|
||||
Write data to a CSV file.
|
||||
|
||||
:param csv_file: Path to the CSV file to write
|
||||
:param data: Dictionary containing the data to write
|
||||
:return: Status message indicating success or failure
|
||||
"""
|
||||
list_lengths = [len(value) for value in data.values()]
|
||||
|
||||
if len(set(list_lengths)) != 1:
|
||||
raise ValueError("All lists in the JSON must have the same length")
|
||||
|
||||
try:
|
||||
headers = list(data.keys())
|
||||
num_rows = list_lengths[0]
|
||||
rows = []
|
||||
|
||||
for i in range(num_rows):
|
||||
row = [data[header][i] for header in headers]
|
||||
rows.append(row)
|
||||
|
||||
with open(csv_file, "w", newline="") as csvfile:
|
||||
writer = csv.writer(csvfile)
|
||||
writer.writerow(headers)
|
||||
writer.writerows(rows)
|
||||
|
||||
return f"Data successfully written to {csv_file}"
|
||||
except PermissionError:
|
||||
return f"Error: Permission denied when writing to {csv_file}"
|
||||
except IOError as e:
|
||||
return f"Error writing to {csv_file}: {str(e)}"
|
||||
except Exception as e:
|
||||
return f"Error: {str(e)}"
|
||||
|
||||
@staticmethod
|
||||
def _handle_output(
|
||||
data: Dict[str, List[Any]], output_format: Literal["json", "csv", "df"] = "csv", csv_file: Optional[str] = None
|
||||
) -> Union[str, DataFrame, Dict[str, List[Any]]]:
|
||||
"""
|
||||
Handles output formatting based on `output_format`.
|
||||
|
||||
:returns: DataFrame for 'df', dict for 'json', or confirmation message for 'csv'
|
||||
"""
|
||||
if output_format == "json":
|
||||
return data
|
||||
|
||||
elif output_format == "df":
|
||||
pandas_import.check()
|
||||
return DataFrame(data)
|
||||
|
||||
elif output_format == "csv":
|
||||
if not csv_file:
|
||||
raise ValueError("A file path must be provided in 'csv_file' parameter to save the CSV output.")
|
||||
return EvaluationRunResult._write_to_csv(csv_file, data)
|
||||
|
||||
else:
|
||||
raise ValueError(f"Invalid output format '{output_format}' provided. Choose from 'json', 'csv', or 'df'.")
|
||||
|
||||
def aggregated_report(
|
||||
self, output_format: Literal["json", "csv", "df"] = "json", csv_file: Optional[str] = None
|
||||
) -> Union[Dict[str, List[Any]], "DataFrame", str]:
|
||||
"""
|
||||
Generates a report with aggregated scores for each metric.
|
||||
|
||||
:param output_format: The output format for the report, "json", "csv", or "df", default to "json".
|
||||
:param csv_file: Filepath to save CSV output if `output_format` is "csv", must be provided.
|
||||
|
||||
:returns:
|
||||
Pandas DataFrame with the aggregated scores.
|
||||
JSON or DataFrame with aggregated scores, in case the output is set to a CSV file, a message confirming the
|
||||
successful write or an error message.
|
||||
"""
|
||||
results = {k: v["score"] for k, v in self.results.items()}
|
||||
df = DataFrame.from_dict(results, orient="index", columns=["score"]).reset_index()
|
||||
df.columns = ["metrics", "score"]
|
||||
return df
|
||||
data = {"metrics": list(results.keys()), "score": list(results.values())}
|
||||
return self._handle_output(data, output_format, csv_file)
|
||||
|
||||
def to_pandas(self) -> DataFrame:
|
||||
def detailed_report(
|
||||
self, output_format: Literal["json", "csv", "df"] = "json", csv_file: Optional[str] = None
|
||||
) -> Union[Dict[str, List[Any]], "DataFrame", str]:
|
||||
"""
|
||||
Creates a Pandas DataFrame containing the scores of each metric for every input sample.
|
||||
Generates a report with detailed scores for each metric.
|
||||
|
||||
:param output_format: The output format for the report, "json", "csv", or "df", default to "json".
|
||||
:param csv_file: Filepath to save CSV output if `output_format` is "csv", must be provided.
|
||||
|
||||
:returns:
|
||||
Pandas DataFrame with the scores.
|
||||
JSON or DataFrame with the detailed scores, in case the output is set to a CSV file, a message confirming
|
||||
the successful write or an error message.
|
||||
"""
|
||||
inputs_columns = list(self.inputs.keys())
|
||||
inputs_values = list(self.inputs.values())
|
||||
inputs_values = list(map(list, zip(*inputs_values))) # transpose the values
|
||||
df_inputs = DataFrame(inputs_values, columns=inputs_columns)
|
||||
|
||||
combined_data = {col: self.inputs[col] for col in self.inputs}
|
||||
|
||||
# enforce columns type consistency
|
||||
scores_columns = list(self.results.keys())
|
||||
scores_values = [v["individual_scores"] for v in self.results.values()]
|
||||
scores_values = list(map(list, zip(*scores_values))) # transpose the values
|
||||
df_scores = DataFrame(scores_values, columns=scores_columns)
|
||||
for col in scores_columns:
|
||||
col_values = self.results[col]["individual_scores"]
|
||||
if any(isinstance(v, float) for v in col_values):
|
||||
col_values = [float(v) for v in col_values]
|
||||
combined_data[col] = col_values
|
||||
|
||||
return df_inputs.join(df_scores)
|
||||
return self._handle_output(combined_data, output_format, csv_file)
|
||||
|
||||
def comparative_individual_scores_report(
|
||||
self, other: "BaseEvaluationRunResult", keep_columns: Optional[List[str]] = None
|
||||
) -> DataFrame:
|
||||
def comparative_detailed_report(
|
||||
self,
|
||||
other: "EvaluationRunResult",
|
||||
keep_columns: Optional[List[str]] = None,
|
||||
output_format: Literal["json", "csv", "df"] = "json",
|
||||
csv_file: Optional[str] = None,
|
||||
) -> Union[str, "DataFrame", None]:
|
||||
"""
|
||||
Creates a Pandas DataFrame with the scores for each metric in the results of two different evaluation runs.
|
||||
Generates a report with detailed scores for each metric from two evaluation runs for comparison.
|
||||
|
||||
The inputs to both evaluation runs is assumed to be the same.
|
||||
:param other: Results of another evaluation run to compare with.
|
||||
:param keep_columns: List of common column names to keep from the inputs of the evaluation runs to compare.
|
||||
:param output_format: The output format for the report, "json", "csv", or "df", default to "json".
|
||||
:param csv_file: Filepath to save CSV output if `output_format` is "csv", must be provided.
|
||||
|
||||
:param other:
|
||||
Results of another evaluation run to compare with.
|
||||
:param keep_columns:
|
||||
List of common column names to keep from the inputs of the evaluation runs to compare.
|
||||
:returns:
|
||||
Pandas DataFrame with the score comparison.
|
||||
JSON or DataFrame with a comparison of the detailed scores, in case the output is set to a CSV file,
|
||||
a message confirming the successful write or an error message.
|
||||
"""
|
||||
|
||||
if not isinstance(other, EvaluationRunResult):
|
||||
raise ValueError("Comparative scores can only be computed between EvaluationRunResults.")
|
||||
|
||||
this_name = self.run_name
|
||||
other_name = other.run_name
|
||||
if this_name == other_name:
|
||||
warn(f"The run names of the two evaluation results are the same ('{this_name}')")
|
||||
this_name = f"{this_name}_first"
|
||||
other_name = f"{other_name}_second"
|
||||
if not hasattr(other, "run_name") or not hasattr(other, "inputs") or not hasattr(other, "results"):
|
||||
raise ValueError("The 'other' parameter must have 'run_name', 'inputs', and 'results' attributes.")
|
||||
|
||||
if self.run_name == other.run_name:
|
||||
warn(f"The run names of the two evaluation results are the same ('{self.run_name}')")
|
||||
|
||||
if self.inputs.keys() != other.inputs.keys():
|
||||
warn(f"The input columns differ between the results; using the input columns of '{this_name}'.")
|
||||
warn(f"The input columns differ between the results; using the input columns of '{self.run_name}'.")
|
||||
|
||||
pipe_a_df = self.to_pandas()
|
||||
pipe_b_df = other.to_pandas()
|
||||
# got both detailed reports
|
||||
detailed_a = self.detailed_report(output_format="json")
|
||||
detailed_b = other.detailed_report(output_format="json")
|
||||
|
||||
# ensure both detailed reports are in dictionaries format
|
||||
if not isinstance(detailed_a, dict) or not isinstance(detailed_b, dict):
|
||||
raise ValueError("Detailed reports must be dictionaries.")
|
||||
|
||||
# determine which columns to ignore
|
||||
if keep_columns is None:
|
||||
ignore = list(self.inputs.keys())
|
||||
else:
|
||||
ignore = [col for col in list(self.inputs.keys()) if col not in keep_columns]
|
||||
|
||||
pipe_b_df.drop(columns=ignore, inplace=True, errors="ignore")
|
||||
pipe_b_df.columns = [f"{other_name}_{column}" for column in pipe_b_df.columns] # type: ignore
|
||||
pipe_a_df.columns = [f"{this_name}_{col}" if col not in ignore else col for col in pipe_a_df.columns] # type: ignore
|
||||
# filter out ignored columns from pipe_b_dict
|
||||
filtered_detailed_b = {
|
||||
f"{other.run_name}_{key}": value for key, value in detailed_b.items() if key not in ignore
|
||||
}
|
||||
|
||||
results_df = pd_concat([pipe_a_df, pipe_b_df], axis=1)
|
||||
# rename columns in pipe_a_dict based on ignore list
|
||||
renamed_detailed_a = {
|
||||
(key if key in ignore else f"{self.run_name}_{key}"): value for key, value in detailed_a.items()
|
||||
}
|
||||
|
||||
return results_df
|
||||
# combine both detailed reports
|
||||
combined_results = {**renamed_detailed_a, **filtered_detailed_b}
|
||||
return self._handle_output(combined_results, output_format, csv_file)
|
||||
|
||||
def score_report(self) -> "DataFrame":
|
||||
"""Generates a DataFrame report with aggregated scores for each metric."""
|
||||
msg = "The `score_report` method is deprecated and will be changed to `aggregated_report` in Haystack 2.11.0."
|
||||
warn(msg, DeprecationWarning, stacklevel=2)
|
||||
return self.aggregated_report(output_format="df")
|
||||
|
||||
def to_pandas(self) -> "DataFrame":
|
||||
"""Generates a DataFrame report with detailed scores for each metric."""
|
||||
msg = "The `to_pandas` method is deprecated and will be changed to `detailed_report` in Haystack 2.11.0."
|
||||
warn(msg, DeprecationWarning, stacklevel=2)
|
||||
return self.detailed_report(output_format="df")
|
||||
|
||||
def comparative_individual_scores_report(self, other: "EvaluationRunResult") -> "DataFrame":
|
||||
"""Generates a DataFrame report with detailed scores for each metric from two evaluation runs for comparison."""
|
||||
msg = (
|
||||
"The `comparative_individual_scores_report` method is deprecated and will be changed to "
|
||||
"`comparative_detailed_report` in Haystack 2.11.0."
|
||||
)
|
||||
warn(msg, DeprecationWarning, stacklevel=2)
|
||||
return self.comparative_detailed_report(other, output_format="df")
|
||||
|
||||
@ -0,0 +1,8 @@
|
||||
---
|
||||
enhancements:
|
||||
- |
|
||||
`EvaluationRunResult` can now output the results in JSON, a pandas Dataframe or in a CSV file.
|
||||
|
||||
deprecations:
|
||||
- |
|
||||
The use of pandas Dataframe in `EvaluationRunResult` is now optional and the methods `score_report`, `to_pandas` and `comparative_individual_scores_report` are deprecated and will be removed in the next haystack release.
|
||||
@ -1,6 +1,8 @@
|
||||
# SPDX-FileCopyrightText: 2022-present deepset GmbH <info@deepset.ai>
|
||||
#
|
||||
# SPDX-License-Identifier: Apache-2.0
|
||||
import json
|
||||
|
||||
from haystack.evaluation import EvaluationRunResult
|
||||
import pytest
|
||||
|
||||
@ -87,16 +89,24 @@ def test_score_report():
|
||||
}
|
||||
|
||||
result = EvaluationRunResult("testing_pipeline_1", inputs=data["inputs"], results=data["metrics"])
|
||||
report = result.score_report().to_json()
|
||||
report = result.aggregated_report(output_format="json")
|
||||
|
||||
assert report == (
|
||||
'{"metrics":{"0":"reciprocal_rank","1":"single_hit","2":"multi_hit","3":"context_relevance",'
|
||||
'"4":"faithfulness","5":"semantic_answer_similarity"},'
|
||||
'"score":{"0":0.476932,"1":0.75,"2":0.46428375,"3":0.58177975,"4":0.40585375,"5":0.53757075}}'
|
||||
{
|
||||
"metrics": [
|
||||
"reciprocal_rank",
|
||||
"single_hit",
|
||||
"multi_hit",
|
||||
"context_relevance",
|
||||
"faithfulness",
|
||||
"semantic_answer_similarity",
|
||||
],
|
||||
"score": [0.476932, 0.75, 0.46428375, 0.58177975, 0.40585375, 0.53757075],
|
||||
}
|
||||
)
|
||||
|
||||
|
||||
def test_to_pandas():
|
||||
def test_to_df():
|
||||
data = {
|
||||
"inputs": {
|
||||
"query_id": ["53c3b3e6", "225f87f7", "53c3b3e6", "225f87f7"],
|
||||
@ -121,19 +131,25 @@ def test_to_pandas():
|
||||
}
|
||||
|
||||
result = EvaluationRunResult("testing_pipeline_1", inputs=data["inputs"], results=data["metrics"])
|
||||
assert result.to_pandas().to_json() == (
|
||||
'{"query_id":{"0":"53c3b3e6","1":"225f87f7","2":"53c3b3e6","3":"225f87f7"},'
|
||||
'"question":{"0":"What is the capital of France?","1":"What is the capital of Spain?",'
|
||||
'"2":"What is the capital of Luxembourg?","3":"What is the capital of Portugal?"},'
|
||||
'"contexts":{"0":"wiki_France","1":"wiki_Spain","2":"wiki_Luxembourg","3":"wiki_Portugal"},'
|
||||
'"answer":{"0":"Paris","1":"Madrid","2":"Luxembourg","3":"Lisbon"},'
|
||||
'"predicted_answer":{"0":"Paris","1":"Madrid","2":"Luxembourg","3":"Lisbon"},'
|
||||
'"reciprocal_rank":{"0":0.378064,"1":0.534964,"2":0.216058,"3":0.778642},'
|
||||
'"single_hit":{"0":1,"1":1,"2":0,"3":1},'
|
||||
'"multi_hit":{"0":0.706125,"1":0.454976,"2":0.445512,"3":0.250522},'
|
||||
'"context_relevance":{"0":0.805466,"1":0.410251,"2":0.75007,"3":0.361332},'
|
||||
'"faithfulness":{"0":0.135581,"1":0.695974,"2":0.749861,"3":0.041999},'
|
||||
'"semantic_answer_similarity":{"0":0.971241,"1":0.15932,"2":0.019722,"3":1.0}}'
|
||||
assert result.detailed_report() == (
|
||||
{
|
||||
"query_id": ["53c3b3e6", "225f87f7", "53c3b3e6", "225f87f7"],
|
||||
"question": [
|
||||
"What is the capital of France?",
|
||||
"What is the capital of Spain?",
|
||||
"What is the capital of Luxembourg?",
|
||||
"What is the capital of Portugal?",
|
||||
],
|
||||
"contexts": ["wiki_France", "wiki_Spain", "wiki_Luxembourg", "wiki_Portugal"],
|
||||
"answer": ["Paris", "Madrid", "Luxembourg", "Lisbon"],
|
||||
"predicted_answer": ["Paris", "Madrid", "Luxembourg", "Lisbon"],
|
||||
"reciprocal_rank": [0.378064, 0.534964, 0.216058, 0.778642],
|
||||
"single_hit": [1, 1, 0, 1],
|
||||
"multi_hit": [0.706125, 0.454976, 0.445512, 0.250522],
|
||||
"context_relevance": [0.805466, 0.410251, 0.75007, 0.361332],
|
||||
"faithfulness": [0.135581, 0.695974, 0.749861, 0.041999],
|
||||
"semantic_answer_similarity": [0.971241, 0.15932, 0.019722, 1.0],
|
||||
}
|
||||
)
|
||||
|
||||
|
||||
@ -176,73 +192,9 @@ def test_comparative_individual_scores_report():
|
||||
|
||||
result1 = EvaluationRunResult("testing_pipeline_1", inputs=data_1["inputs"], results=data_1["metrics"])
|
||||
result2 = EvaluationRunResult("testing_pipeline_2", inputs=data_2["inputs"], results=data_2["metrics"])
|
||||
results = result1.comparative_individual_scores_report(result2)
|
||||
results = result1.comparative_detailed_report(result2, keep_columns=["predicted_answer"])
|
||||
|
||||
expected = {
|
||||
"query_id": {0: "53c3b3e6", 1: "225f87f7"},
|
||||
"question": {0: "What is the capital of France?", 1: "What is the capital of Spain?"},
|
||||
"contexts": {0: "wiki_France", 1: "wiki_Spain"},
|
||||
"answer": {0: "Paris", 1: "Madrid"},
|
||||
"predicted_answer": {0: "Paris", 1: "Madrid"},
|
||||
"testing_pipeline_1_reciprocal_rank": {0: 0.378064, 1: 0.534964},
|
||||
"testing_pipeline_1_single_hit": {0: 1, 1: 1},
|
||||
"testing_pipeline_1_multi_hit": {0: 0.706125, 1: 0.454976},
|
||||
"testing_pipeline_1_context_relevance": {0: 1, 1: 1},
|
||||
"testing_pipeline_1_faithfulness": {0: 0.135581, 1: 0.695974},
|
||||
"testing_pipeline_1_semantic_answer_similarity": {0: 0.971241, 1: 0.15932},
|
||||
"testing_pipeline_2_reciprocal_rank": {0: 0.378064, 1: 0.534964},
|
||||
"testing_pipeline_2_single_hit": {0: 1, 1: 1},
|
||||
"testing_pipeline_2_multi_hit": {0: 0.706125, 1: 0.454976},
|
||||
"testing_pipeline_2_context_relevance": {0: 1, 1: 1},
|
||||
"testing_pipeline_2_faithfulness": {0: 0.135581, 1: 0.695974},
|
||||
"testing_pipeline_2_semantic_answer_similarity": {0: 0.971241, 1: 0.15932},
|
||||
}
|
||||
|
||||
assert expected == results.to_dict()
|
||||
|
||||
|
||||
def test_comparative_individual_scores_report_keep_truth_answer_in_df():
|
||||
data_1 = {
|
||||
"inputs": {
|
||||
"query_id": ["53c3b3e6", "225f87f7"],
|
||||
"question": ["What is the capital of France?", "What is the capital of Spain?"],
|
||||
"contexts": ["wiki_France", "wiki_Spain"],
|
||||
"answer": ["Paris", "Madrid"],
|
||||
"predicted_answer": ["Paris", "Madrid"],
|
||||
},
|
||||
"metrics": {
|
||||
"reciprocal_rank": {"individual_scores": [0.378064, 0.534964], "score": 0.476932},
|
||||
"single_hit": {"individual_scores": [1, 1], "score": 0.75},
|
||||
"multi_hit": {"individual_scores": [0.706125, 0.454976], "score": 0.46428375},
|
||||
"context_relevance": {"individual_scores": [1, 1], "score": 1},
|
||||
"faithfulness": {"individual_scores": [0.135581, 0.695974], "score": 0.40585375},
|
||||
"semantic_answer_similarity": {"individual_scores": [0.971241, 0.159320], "score": 0.53757075},
|
||||
},
|
||||
}
|
||||
|
||||
data_2 = {
|
||||
"inputs": {
|
||||
"query_id": ["53c3b3e6", "225f87f7"],
|
||||
"question": ["What is the capital of France?", "What is the capital of Spain?"],
|
||||
"contexts": ["wiki_France", "wiki_Spain"],
|
||||
"answer": ["Paris", "Madrid"],
|
||||
"predicted_answer": ["Paris", "Madrid"],
|
||||
},
|
||||
"metrics": {
|
||||
"reciprocal_rank": {"individual_scores": [0.378064, 0.534964], "score": 0.476932},
|
||||
"single_hit": {"individual_scores": [1, 1], "score": 0.75},
|
||||
"multi_hit": {"individual_scores": [0.706125, 0.454976], "score": 0.46428375},
|
||||
"context_relevance": {"individual_scores": [1, 1], "score": 1},
|
||||
"faithfulness": {"individual_scores": [0.135581, 0.695974], "score": 0.40585375},
|
||||
"semantic_answer_similarity": {"individual_scores": [0.971241, 0.159320], "score": 0.53757075},
|
||||
},
|
||||
}
|
||||
|
||||
result1 = EvaluationRunResult("testing_pipeline_1", inputs=data_1["inputs"], results=data_1["metrics"])
|
||||
result2 = EvaluationRunResult("testing_pipeline_2", inputs=data_2["inputs"], results=data_2["metrics"])
|
||||
results = result1.comparative_individual_scores_report(result2, keep_columns=["predicted_answer"])
|
||||
|
||||
assert list(results.columns) == [
|
||||
assert list(results.keys()) == [
|
||||
"query_id",
|
||||
"question",
|
||||
"contexts",
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user