# Download evaluation data, which is a subset of Natural Questions development set containing 50 documents with one question per document and multiple annotated answers
# The SentenceTransformer model "sentence-transformers/multi-qa-mpnet-base-dot-v1" generally works well with the EmbeddingRetriever on any kind of English text.
# For more information and suggestions on different models check out the documentation at: https://www.sbert.net/docs/pretrained_models.html
# from haystack.retriever import EmbeddingRetriever, DensePassageRetriever
# Answers are considered correct if the predicted answer matches the gold answer in the labels.
# Documents are considered correct if the predicted document ID matches the gold document ID in the labels.
# Sometimes, these simple definitions of "correctness" are not sufficient.
# There are cases where you want to further specify the "scope" within which an answer or a document is considered correct.
# For this reason, `EvaluationResult.calculate_metrics()` offers the parameters `answer_scope` and `document_scope`.
#
# Say you want to ensure that an answer is only considered correct if it stems from a specific context of surrounding words.
# This is especially useful if your answer is very short, like a date (for example, "2011") or a place ("Berlin").
# Such short answer might easily appear in multiple completely different contexts.
# Some of those contexts might perfectly fit the actual question and answer it.
# Some others might not: they don't relate to the question at all but still contain the answer string.
# In that case, you might want to ensure that only answers that stem from the correct context are considered correct.
# To do that, specify `answer_scope="context"` in `calculate_metrics()`.
#
# `answer_scope` takes the following values:
# - `any` (default): Any matching answer is considered correct.
# - `context`: The answer is only considered correct if its context matches as well. It uses fuzzy matching (see `context_matching` parameters of `pipeline.eval()`).
# - `document_id`: The answer is only considered correct if its document ID matches as well. You can specify a custom document ID through the `custom_document_id_field` parameter of `pipeline.eval()`.
# - `document_id_and_context`: The answer is only considered correct if its document ID and its context match as well.
#
# In Question Answering, to enforce that the retrieved document is considered correct whenever the answer is correct, set `document_scope` to `answer` or `document_id_or_answer`.
#
# `document_scope` takes the following values:
# - `document_id`: Specifies that the document ID must match. You can specify a custom document ID through the `custom_document_id_field` parameter of `pipeline.eval()`.
# - `context`: Specifies that the content of the document must match. It uses fuzzy matching (see the `context_matching` parameters of `pipeline.eval()`).
# - `document_id_and_context`: A Boolean operation specifying that both `'document_id' AND 'context'` must match.
# - `document_id_or_context`: A Boolean operation specifying that either `'document_id' OR 'context'` must match.
# - `answer`: Specifies that the document contents must include the answer. The selected `answer_scope` is enforced.
# - `document_id_or_answer` (default): A Boolean operation specifying that either `'document_id' OR 'answer'` must match.
# Let's try Document Retrieval on a file level (it's sufficient if the correct file identified by its name (for example, 'Book of Life') was retrieved).
# Storing evaluation results in CSVs is fine but not enough if you want to compare and track multiple evaluation runs. MLflow is a handy tool when it comes to tracking experiments. So we decided to use it to track all of `Pipeline.eval()` with reproducability of your experiments in mind.
# ### Host your own MLflow or use deepset's public MLflow
# If you don't want to use deepset's public MLflow instance under https://public-mlflow.deepset.ai, you can easily host it yourself.
# !pip install mlflow
# !mlflow server --serve-artifacts
# ### Preprocessing the dataset
# Preprocessing the dataset works a bit differently than before. Instead of directly generating documents (and labels) out of a SQuAD file, we first save them to disk. This is necessary to experiment with different indexing pipelines.
document_store=InMemoryDocumentStore()
label_preprocessor=PreProcessor(
split_length=200,
split_overlap=0,
split_respect_sentence_boundary=False,
clean_empty_lines=False,
clean_whitespace=False,
)
# The add_eval_data() method converts the given dataset in json format into Haystack document and label objects.
# Those objects are then indexed in their respective document and label index in the document store.
# The method can be used with any dataset in SQuAD format.
# We only use it to get the evaluation set labels and the corpus files.
# You can now open MLflow (e.g. https://public-mlflow.deepset.ai/ if you used the public one hosted by deepset) and look for the haystack-eval-experiment experiment.
# Try out mlflow's compare function and have fun...
#
# Note that on our public mlflow instance we are not able to log artifacts like the evaluation results or the piplines.yaml file.
# Just as a sanity check, we can compare the recall from `retriever.eval()` with the multi hit recall from `pipeline.eval(add_isolated_node_eval=True)`.
# These two recall metrics are only comparable since we chose to filter out no_answer samples when generating eval_labels and setting document_scope to `"document_id"`.
# Per default `calculate_metrics()` has document_scope set to `"document_id_or_answer"` which interprets documents as relevant if they either match the gold document ID or contain the answer.
# Reader Top-1-Exact Match is the proportion of questions where the first predicted answer is exactly the same as the correct answer including no_answers
# Reader Top-N-Exact Match is the proportion of questions where the predicted answer within the first n results is exactly the same as the correct answer excluding no_answers (no_answers are always present within top n).
print(f"Reader Top-{top_n}-Exact Match (without no_answers):",reader_eval_results["top_n_EM_text_answer"])
# Reader Top-N-F1-Score is the average overlap between the top n predicted answers and the correct answers excluding no_answers (no_answers are always present within top n).
# Just as a sanity check, we can compare the top-n exact_match and f1 metrics from `reader.eval()` with the exact_match and f1 from `pipeline.eval(add_isolated_node_eval=True)`.
# These two approaches return the same values because pipeline.eval() calculates top-n metrics per default.
# Small discrepancies might occur due to string normalization in pipeline.eval()'s answer-to-label comparison.
# reader.eval() does not use string normalization.