haystack/test/evaluation/test_eval_run_result.py
Hemanth Taduka b5fb0d3ff8
fix: make pandas DataFrame optional in EvaluationRunResult (#8838)
* feat: AsyncPipeline that can schedule components to run concurrently (#8812)

* add component checks

* pipeline should run deterministically

* add FIFOQueue

* add agent tests

* add order dependent tests

* run new tests

* remove code that is not needed

* test: intermediate from cycle outputs are available outside cycle

* add tests for component checks (Claude)

* adapt tests for component checks (o1 review)

* chore: format

* remove tests that aren't needed anymore

* add _calculate_priority tests

* revert accidental change in pyproject.toml

* test format conversion

* adapt to naming convention

* chore: proper docstrings and type hints for PQ

* format

* add more unit tests

* rm unneeded comments

* test input consumption

* lint

* fix: docstrings

* lint

* format

* format

* fix license header

* fix license header

* add component run tests

* fix: pass correct input format to tracing

* fix types

* format

* format

* types

* add defaults from Socket instead of signature

- otherwise components with dynamic inputs would fail

* fix test names

* still wait for optional inputs on greedy variadic sockets

- mirrors previous behavior

* fix format

* wip: warn for ambiguous running order

* wip: alternative warning

* fix license header

* make code more readable

Co-authored-by: Amna Mubashar <amnahkhan.ak@gmail.com>

* Introduce content tracing to a behavioral test

* Fixing linting

* Remove debug print statements

* Fix tracer tests

* remove print

* test: test for component inputs

* test: remove testing for run order

* chore: update component checks from experimental

* chore: update pipeline and base from experimental

* refactor: remove unused method

* refactor: remove unused method

* refactor: outdated comment

* refactor: inputs state is updated as side effect

- to prepare for AsyncPipeline implementation

* format

* test: add file conversion test

* format

* fix: original implementation deepcopies outputs

* lint

* fix: from_dict was updated

* fix: format

* fix: test

* test: add test for thread safety

* remove unused imports

* format

* test: FIFOPriorityQueue

* chore: add release note

* feat: add AsyncPipeline

* chore: Add release notes

* fix: format

* debug: switch run order to debug ubuntu and windows tests

* fix: consider priorities of other components while waiting for DEFER

* refactor: simplify code

* fix: resolve merge conflict with mermaid changes

* fix: format

* fix: remove unused import

* refactor: rename to avoid accidental conflicts

* fix: track pipeline type

* fix: and extend test

* fix: format

* style: sort alphabetically

* Update test/core/pipeline/features/conftest.py

Co-authored-by: Amna Mubashar <amnahkhan.ak@gmail.com>

* Update test/core/pipeline/features/conftest.py

Co-authored-by: Amna Mubashar <amnahkhan.ak@gmail.com>

* Update releasenotes/notes/feat-async-pipeline-338856a142e1318c.yaml

* fix: indentation, do not close loop

* fix: use asyncio.run

* fix: format

---------

Co-authored-by: Amna Mubashar <amnahkhan.ak@gmail.com>
Co-authored-by: David S. Batista <dsbatista@gmail.com>

* feat: AsyncPipeline that can schedule components to run concurrently (#8812)

* add component checks

* pipeline should run deterministically

* add FIFOQueue

* add agent tests

* add order dependent tests

* run new tests

* remove code that is not needed

* test: intermediate from cycle outputs are available outside cycle

* add tests for component checks (Claude)

* adapt tests for component checks (o1 review)

* chore: format

* remove tests that aren't needed anymore

* add _calculate_priority tests

* revert accidental change in pyproject.toml

* test format conversion

* adapt to naming convention

* chore: proper docstrings and type hints for PQ

* format

* add more unit tests

* rm unneeded comments

* test input consumption

* lint

* fix: docstrings

* lint

* format

* format

* fix license header

* fix license header

* add component run tests

* fix: pass correct input format to tracing

* fix types

* format

* format

* types

* add defaults from Socket instead of signature

- otherwise components with dynamic inputs would fail

* fix test names

* still wait for optional inputs on greedy variadic sockets

- mirrors previous behavior

* fix format

* wip: warn for ambiguous running order

* wip: alternative warning

* fix license header

* make code more readable

Co-authored-by: Amna Mubashar <amnahkhan.ak@gmail.com>

* Introduce content tracing to a behavioral test

* Fixing linting

* Remove debug print statements

* Fix tracer tests

* remove print

* test: test for component inputs

* test: remove testing for run order

* chore: update component checks from experimental

* chore: update pipeline and base from experimental

* refactor: remove unused method

* refactor: remove unused method

* refactor: outdated comment

* refactor: inputs state is updated as side effect

- to prepare for AsyncPipeline implementation

* format

* test: add file conversion test

* format

* fix: original implementation deepcopies outputs

* lint

* fix: from_dict was updated

* fix: format

* fix: test

* test: add test for thread safety

* remove unused imports

* format

* test: FIFOPriorityQueue

* chore: add release note

* feat: add AsyncPipeline

* chore: Add release notes

* fix: format

* debug: switch run order to debug ubuntu and windows tests

* fix: consider priorities of other components while waiting for DEFER

* refactor: simplify code

* fix: resolve merge conflict with mermaid changes

* fix: format

* fix: remove unused import

* refactor: rename to avoid accidental conflicts

* fix: track pipeline type

* fix: and extend test

* fix: format

* style: sort alphabetically

* Update test/core/pipeline/features/conftest.py

Co-authored-by: Amna Mubashar <amnahkhan.ak@gmail.com>

* Update test/core/pipeline/features/conftest.py

Co-authored-by: Amna Mubashar <amnahkhan.ak@gmail.com>

* Update releasenotes/notes/feat-async-pipeline-338856a142e1318c.yaml

* fix: indentation, do not close loop

* fix: use asyncio.run

* fix: format

---------

Co-authored-by: Amna Mubashar <amnahkhan.ak@gmail.com>
Co-authored-by: David S. Batista <dsbatista@gmail.com>

* updated changes for refactoring evaluations without pandas package

* added release notes for eval_run_result.py for refactoring  EvaluationRunResult to work without pandas

* wip: cleaning and refactoring

* removing BaseEvaluationRunResult

* wip: fixing tests

* fixing tests and docstrings

* updating release notes

* fixing typing

* pylint fix

* adding deprecation warning

* fixing tests

* fixin types consistency

* adding stacklevel=2 to warning messages

* fixing docstrings

* fixing docstrings

* updating release notes

---------

Co-authored-by: mathislucka <mathis.lucka@gmail.com>
Co-authored-by: Amna Mubashar <amnahkhan.ak@gmail.com>
Co-authored-by: David S. Batista <dsbatista@gmail.com>
2025-02-17 14:43:54 +01:00

217 lines
9.6 KiB
Python

# SPDX-FileCopyrightText: 2022-present deepset GmbH <info@deepset.ai>
#
# SPDX-License-Identifier: Apache-2.0
import json
from haystack.evaluation import EvaluationRunResult
import pytest
def test_init_results_evaluator():
data = {
"inputs": {
"query_id": ["53c3b3e6", "225f87f7"],
"question": ["What is the capital of France?", "What is the capital of Spain?"],
"contexts": ["wiki_France", "wiki_Spain"],
"answer": ["Paris", "Madrid"],
"predicted_answer": ["Paris", "Madrid"],
},
"metrics": {
"reciprocal_rank": {"individual_scores": [0.378064, 0.534964], "score": 0.476932},
"single_hit": {"individual_scores": [1, 1], "score": 0.75},
"multi_hit": {"individual_scores": [0.706125, 0.454976], "score": 0.46428375},
"context_relevance": {"individual_scores": [0.805466, 0.410251], "score": 0.58177975},
"faithfulness": {"individual_scores": [0.135581, 0.695974], "score": 0.40585375},
"semantic_answer_similarity": {"individual_scores": [0.971241, 0.159320], "score": 0.53757075},
},
}
_ = EvaluationRunResult("testing_pipeline_1", inputs=data["inputs"], results=data["metrics"])
with pytest.raises(ValueError, match="No inputs provided"):
_ = EvaluationRunResult("testing_pipeline_1", inputs={}, results={})
with pytest.raises(ValueError, match="Lengths of the inputs should be the same"):
_ = EvaluationRunResult(
"testing_pipeline_1",
inputs={"query_id": ["53c3b3e6", "something else"], "question": ["What is the capital of France?"]},
results={"some": {"score": 0.1, "individual_scores": [0.378064, 0.534964]}},
)
with pytest.raises(ValueError, match="Aggregate score missing"):
_ = EvaluationRunResult(
"testing_pipeline_1",
inputs={
"query_id": ["53c3b3e6", "something else"],
"question": ["What is the capital of France?", "another"],
},
results={"some": {"individual_scores": [0.378064, 0.534964]}},
)
with pytest.raises(ValueError, match="Individual scores missing"):
_ = EvaluationRunResult(
"testing_pipeline_1",
inputs={
"query_id": ["53c3b3e6", "something else"],
"question": ["What is the capital of France?", "another"],
},
results={"some": {"score": 0.378064}},
)
with pytest.raises(ValueError, match="Length of individual scores .* should be the same as the inputs"):
_ = EvaluationRunResult(
"testing_pipeline_1",
inputs={
"query_id": ["53c3b3e6", "something else"],
"question": ["What is the capital of France?", "another"],
},
results={"some": {"score": 0.1, "individual_scores": [0.378064, 0.534964, 0.3]}},
)
def test_score_report():
data = {
"inputs": {
"query_id": ["53c3b3e6", "225f87f7"],
"question": ["What is the capital of France?", "What is the capital of Spain?"],
"contexts": ["wiki_France", "wiki_Spain"],
"answer": ["Paris", "Madrid"],
"predicted_answer": ["Paris", "Madrid"],
},
"metrics": {
"reciprocal_rank": {"individual_scores": [0.378064, 0.534964], "score": 0.476932},
"single_hit": {"individual_scores": [1, 1], "score": 0.75},
"multi_hit": {"individual_scores": [0.706125, 0.454976], "score": 0.46428375},
"context_relevance": {"individual_scores": [0.805466, 0.410251], "score": 0.58177975},
"faithfulness": {"individual_scores": [0.135581, 0.695974], "score": 0.40585375},
"semantic_answer_similarity": {"individual_scores": [0.971241, 0.159320], "score": 0.53757075},
},
}
result = EvaluationRunResult("testing_pipeline_1", inputs=data["inputs"], results=data["metrics"])
report = result.aggregated_report(output_format="json")
assert report == (
{
"metrics": [
"reciprocal_rank",
"single_hit",
"multi_hit",
"context_relevance",
"faithfulness",
"semantic_answer_similarity",
],
"score": [0.476932, 0.75, 0.46428375, 0.58177975, 0.40585375, 0.53757075],
}
)
def test_to_df():
data = {
"inputs": {
"query_id": ["53c3b3e6", "225f87f7", "53c3b3e6", "225f87f7"],
"question": [
"What is the capital of France?",
"What is the capital of Spain?",
"What is the capital of Luxembourg?",
"What is the capital of Portugal?",
],
"contexts": ["wiki_France", "wiki_Spain", "wiki_Luxembourg", "wiki_Portugal"],
"answer": ["Paris", "Madrid", "Luxembourg", "Lisbon"],
"predicted_answer": ["Paris", "Madrid", "Luxembourg", "Lisbon"],
},
"metrics": {
"reciprocal_rank": {"score": 0.1, "individual_scores": [0.378064, 0.534964, 0.216058, 0.778642]},
"single_hit": {"score": 0.1, "individual_scores": [1, 1, 0, 1]},
"multi_hit": {"score": 0.1, "individual_scores": [0.706125, 0.454976, 0.445512, 0.250522]},
"context_relevance": {"score": 0.1, "individual_scores": [0.805466, 0.410251, 0.750070, 0.361332]},
"faithfulness": {"score": 0.1, "individual_scores": [0.135581, 0.695974, 0.749861, 0.041999]},
"semantic_answer_similarity": {"score": 0.1, "individual_scores": [0.971241, 0.159320, 0.019722, 1]},
},
}
result = EvaluationRunResult("testing_pipeline_1", inputs=data["inputs"], results=data["metrics"])
assert result.detailed_report() == (
{
"query_id": ["53c3b3e6", "225f87f7", "53c3b3e6", "225f87f7"],
"question": [
"What is the capital of France?",
"What is the capital of Spain?",
"What is the capital of Luxembourg?",
"What is the capital of Portugal?",
],
"contexts": ["wiki_France", "wiki_Spain", "wiki_Luxembourg", "wiki_Portugal"],
"answer": ["Paris", "Madrid", "Luxembourg", "Lisbon"],
"predicted_answer": ["Paris", "Madrid", "Luxembourg", "Lisbon"],
"reciprocal_rank": [0.378064, 0.534964, 0.216058, 0.778642],
"single_hit": [1, 1, 0, 1],
"multi_hit": [0.706125, 0.454976, 0.445512, 0.250522],
"context_relevance": [0.805466, 0.410251, 0.75007, 0.361332],
"faithfulness": [0.135581, 0.695974, 0.749861, 0.041999],
"semantic_answer_similarity": [0.971241, 0.15932, 0.019722, 1.0],
}
)
def test_comparative_individual_scores_report():
data_1 = {
"inputs": {
"query_id": ["53c3b3e6", "225f87f7"],
"question": ["What is the capital of France?", "What is the capital of Spain?"],
"contexts": ["wiki_France", "wiki_Spain"],
"answer": ["Paris", "Madrid"],
"predicted_answer": ["Paris", "Madrid"],
},
"metrics": {
"reciprocal_rank": {"individual_scores": [0.378064, 0.534964], "score": 0.476932},
"single_hit": {"individual_scores": [1, 1], "score": 0.75},
"multi_hit": {"individual_scores": [0.706125, 0.454976], "score": 0.46428375},
"context_relevance": {"individual_scores": [1, 1], "score": 1},
"faithfulness": {"individual_scores": [0.135581, 0.695974], "score": 0.40585375},
"semantic_answer_similarity": {"individual_scores": [0.971241, 0.159320], "score": 0.53757075},
},
}
data_2 = {
"inputs": {
"query_id": ["53c3b3e6", "225f87f7"],
"question": ["What is the capital of France?", "What is the capital of Spain?"],
"contexts": ["wiki_France", "wiki_Spain"],
"answer": ["Paris", "Madrid"],
"predicted_answer": ["Paris", "Madrid"],
},
"metrics": {
"reciprocal_rank": {"individual_scores": [0.378064, 0.534964], "score": 0.476932},
"single_hit": {"individual_scores": [1, 1], "score": 0.75},
"multi_hit": {"individual_scores": [0.706125, 0.454976], "score": 0.46428375},
"context_relevance": {"individual_scores": [1, 1], "score": 1},
"faithfulness": {"individual_scores": [0.135581, 0.695974], "score": 0.40585375},
"semantic_answer_similarity": {"individual_scores": [0.971241, 0.159320], "score": 0.53757075},
},
}
result1 = EvaluationRunResult("testing_pipeline_1", inputs=data_1["inputs"], results=data_1["metrics"])
result2 = EvaluationRunResult("testing_pipeline_2", inputs=data_2["inputs"], results=data_2["metrics"])
results = result1.comparative_detailed_report(result2, keep_columns=["predicted_answer"])
assert list(results.keys()) == [
"query_id",
"question",
"contexts",
"answer",
"testing_pipeline_1_predicted_answer",
"testing_pipeline_1_reciprocal_rank",
"testing_pipeline_1_single_hit",
"testing_pipeline_1_multi_hit",
"testing_pipeline_1_context_relevance",
"testing_pipeline_1_faithfulness",
"testing_pipeline_1_semantic_answer_similarity",
"testing_pipeline_2_predicted_answer",
"testing_pipeline_2_reciprocal_rank",
"testing_pipeline_2_single_hit",
"testing_pipeline_2_multi_hit",
"testing_pipeline_2_context_relevance",
"testing_pipeline_2_faithfulness",
"testing_pipeline_2_semantic_answer_similarity",
]