[](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial5_Evaluation.ipynb)
To be able to make a statement about the quality of results a question-answering pipeline or any other pipeline in haystack produces, it is important to evaluate it. Furthermore, evaluation allows determining which components of the pipeline can be improved.
The results of the evaluation can be saved as CSV files, which contain all the information to calculate additional metrics later on or inspect individual predictions.
### Prepare environment
#### Colab: Enable the GPU runtime
Make sure you enable the GPU runtime to experience decent speed in this tutorial.
**Runtime -> Change Runtime type -> Hardware accelerator -> GPU**
You can start Elasticsearch on your local machine instance using Docker. If Docker is not readily available in your environment (eg., in Colab notebooks), then you can manually download and execute Elasticsearch from source.
```python
# If Docker is available: Start Elasticsearch as docker container
# from haystack.utils import launch_es
# launch_es()
# Alternative in Colab / No Docker environments: Start Elasticsearch from source
## Fetch, Store And Preprocess the Evaluation Dataset
```python
from haystack.utils import fetch_archive_from_http
# Download evaluation data, which is a subset of Natural Questions development set containing 50 documents with one question per document and multiple annotated answers
# The add_eval_data() method converts the given dataset in json format into Haystack document and label objects. Those objects are then indexed in their respective document and label index in the document store. The method can be used with any dataset in SQuAD format.
document_store.add_eval_data(
filename="../data/nq/nq_dev_subset_v2.json",
doc_index=doc_index,
label_index=label_index,
preprocessor=preprocessor
)
```
## Initialize the Two Components of an ExtractiveQAPipeline: Retriever and Reader
Here we evaluate retriever and reader in open domain fashion on the full corpus of documents i.e. a document is considered
correctly retrieved if it contains the gold answer string within it. The reader is evaluated based purely on the
predicted answer string, regardless of which document this came from and the position of the extracted span.
The generation of predictions is seperated from the calculation of metrics. This allows you to run the computation-heavy model predictions only once and then iterate flexibly on the metrics or reports you want to generate.
```python
from haystack.schema import EvaluationResult, MultiLabel
# We can load evaluation labels from the document store
# content='Book of Life - wikipedia Book of Life Jump to: navigation, search This article is about the book mentioned in Christian and Jewish religious teachings. For other uses, see The Book of Life. In Christianity and Judaism, the Book of Life (Hebrew: ספר החיים, transliterated Sefer HaChaim; Greek: βιβλίον τῆς ζωῆς Biblíon tēs Zōēs) is the book in which God records the names of every person who is destined for Heaven or the World to Come. According to the Talmud it is open on Rosh Hashanah, as is its analog for the wicked, the Book of the Dead. For this reason extra mention is made for the Book of Life during Amidah recitations during the Days of Awe, the ten days between Rosh Hashanah, the Jewish new year, and Yom Kippur, the day of atonement (the two High Holidays, particularly in the prayer Unetaneh Tokef). Contents (hide) 1 In the Hebrew Bible 2 Book of Jubilees 3 References in the New Testament 4 The eschatological or annual roll-call 5 Fundraising 6 See also 7 Notes 8 References In the Hebrew Bible(edit) In the Hebrew Bible the Book of Life - the book or muster-roll of God - records forever all people considered righteous before God'),
# is_correct_answer=True,
# is_correct_document=True,
# origin="gold-label")])
# ]
# Similar to pipeline.run() we can execute pipeline.eval()
eval_result = pipeline.eval(
labels=eval_labels,
params={"Retriever": {"top_k": 5}}
)
```
```python
# The EvaluationResult contains a pandas dataframe for each pipeline node.
# That's why there are two dataframes in the EvaluationResult of an ExtractiveQAPipeline.
retriever_result = eval_result["Retriever"]
retriever_result.head()
```
```python
reader_result = eval_result["Reader"]
reader_result.head()
```
```python
# We can filter for all documents retrieved for a given query
retriever_book_of_life = retriever_result[retriever_result['query'] == "who is written in the book of life"]
```
```python
# We can also filter for all answers predicted for a given query
reader_book_of_life = reader_result[reader_result['query'] == "who is written in the book of life"]
```
```python
# Save the evaluation result so that we can reload it later and calculate evaluation metrics without running the pipeline again.
eval_result.save("../")
```
## Calculating Evaluation Metrics
Load an EvaluationResult to quickly calculate standard evaluation metrics for all predictions, such as F1-score of each individual prediction of the Reader node or recall of the retriever.
A summary of the evaluation results can be printed to get a quick overview. It includes some aggregated metrics and also shows a few wrongly predicted examples.
```python
pipeline.print_eval_report(saved_eval_result)
```
## Advanced Evaluation Metrics
As an advanced evaluation metric, semantic answer similarity (SAS) can be calculated. This metric takes into account whether the meaning of a predicted answer is similar to the annotated gold answer rather than just doing string comparison.
To this end SAS relies on pre-trained models. For English, we recommend "cross-encoder/stsb-roberta-large", whereas for German we recommend "deepset/gbert-large-sts". A good multilingual model is "sentence-transformers/paraphrase-multilingual-mpnet-base-v2".
More info on this metric can be found in our [paper](https://arxiv.org/abs/2108.06130) or in our [blog post](https://www.deepset.ai/blog/semantic-answer-similarity-to-evaluate-qa).