
* calculate context pred metrics * Update Documentation & Code Style * extend doc_relevance_col values * fix import order * Update Documentation & Code Style * fix mypy * fix typings literal import * add option for custom document_id_field * Update Documentation & Code Style * fix tests and dataframe col-order * Update Documentation & Code Style * rename content to context in eval dataframe * add backward compatibility to EvaluationResult.load() * Update Documentation & Code Style * add docstrings * Update Documentation & Code Style * support sas * Update Documentation & Code Style * add answer_scope param * Update Documentation & Code Style * rework doc_relevance_col and keep document_id col in case of custom_document_id_field * Update Documentation & Code Style * improve docstrings * Update Documentation & Code Style * rename document_relevance_criterion into document_scope * Update Documentation & Code Style * add document_scope and answer_scope to print_eval_report * support all new features in execute_eval_run() * fix imports * fix mypy * Update Documentation & Code Style * rename pred_label_sas_grid into pred_label_matrix * update dataframe schema and sorting * Update Documentation & Code Style * pass through context_matching params and extend document_scope test * Update Documentation & Code Style * add answer_scope tests * fix context_matching_threshold for document metrics * shorten dataframe apply calls * Update Documentation & Code Style * fix queries getting lost if nothing was retrieved * Update Documentation & Code Style * Update Documentation & Code Style * use document_id scopes * Update Documentation & Code Style * fix answer_scope literal * Update Documentation & Code Style * update the docs (lg changes) * Update Documentation & Code Style * update tutorial 5 * Update Documentation & Code Style * fix tests * Add minor lg updates * final docstring changes * fix single quotes in docstrings * Update Documentation & Code Style * dataframe scopes added for each column * better docstrings for context_matching params * Update Documentation & Code Style * fix summarizer eval test * Update Documentation & Code Style * fix test * Update Documentation & Code Style Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: agnieszka-m <amarzec13@gmail.com>
5.8 KiB
Module evaluator
EvalDocuments
class EvalDocuments(BaseComponent)
This is a pipeline node that should be placed after a node that returns a List of Document, e.g., Retriever or Ranker, in order to assess its performance. Performance metrics are stored in this class and updated as each sample passes through it. To view the results of the evaluation, call EvalDocuments.print(). Note that results from this Node may differ from that when calling Retriever.eval() since that is a closed domain evaluation. Have a look at our evaluation tutorial for more info about open vs closed domain eval ( https://haystack.deepset.ai/tutorials/evaluation).
EvalDocuments node is deprecated and will be removed in a future version. Please use pipeline.eval() instead.
EvalDocuments.__init__
def __init__(debug: bool = False, open_domain: bool = True, top_k: int = 10)
Arguments:
open_domain
: When True, a document is considered correctly retrieved so long as the answer string can be found within it. When False, correct retrieval is evaluated based on document_id.debug
: When True, a record of each sample and its evaluation will be stored in EvalDocuments.logtop_k
: calculate eval metrics for top k results, e.g., recall@k
EvalDocuments.run
def run(documents: List[Document], labels: List[Label], top_k: Optional[int] = None)
Run this node on one sample and its labels
EvalDocuments.print
def print()
Print the evaluation results
EvalAnswers
class EvalAnswers(BaseComponent)
This is a pipeline node that should be placed after a Reader in order to assess the performance of the Reader individually or to assess the extractive QA performance of the whole pipeline. Performance metrics are stored in this class and updated as each sample passes through it. To view the results of the evaluation, call EvalAnswers.print(). Note that results from this Node may differ from that when calling Reader.eval() since that is a closed domain evaluation. Have a look at our evaluation tutorial for more info about open vs closed domain eval (https://haystack.deepset.ai/tutorials/evaluation).
EvalAnswers node is deprecated and will be removed in a future version. Please use pipeline.eval() instead.
EvalAnswers.__init__
def __init__(skip_incorrect_retrieval: bool = True, open_domain: bool = True, sas_model: str = None, debug: bool = False)
Arguments:
skip_incorrect_retrieval
: When set to True, this eval will ignore the cases where the retriever returned no correct documentsopen_domain
: When True, extracted answers are evaluated purely on string similarity rather than the position of the extracted answersas_model
: Name or path of "Semantic Answer Similarity (SAS) model". When set, the model will be used to calculate similarity between predictions and labels and generate the SAS metric. The SAS metric correlates better with human judgement of correct answers as it does not rely on string overlaps. Example: Prediction = "30%", Label = "thirty percent", EM and F1 would be overly pessimistic with both being 0, while SAS paints a more realistic picture. More info in the paper: https://arxiv.org/abs/2108.06130 Models:- You can use Bi Encoders (sentence transformers) or cross encoders trained on Semantic Textual Similarity (STS) data. Not all cross encoders can be used because of different return types. If you use custom cross encoders please make sure they work with sentence_transformers.CrossEncoder class
- Good default for multiple languages: "sentence-transformers/paraphrase-multilingual-mpnet-base-v2"
- Large, powerful, but slow model for English only: "cross-encoder/stsb-roberta-large"
- Large model for German only: "deepset/gbert-large-sts"
debug
: When True, a record of each sample and its evaluation will be stored in EvalAnswers.log
EvalAnswers.run
def run(labels: List[Label], answers: List[Answer], correct_retrieval: bool)
Run this node on one sample and its labels
EvalAnswers.print
def print(mode)
Print the evaluation results
semantic_answer_similarity
def semantic_answer_similarity(predictions: List[List[str]], gold_labels: List[List[str]], sas_model_name_or_path: str = "sentence-transformers/paraphrase-multilingual-mpnet-base-v2", batch_size: int = 32, use_gpu: bool = True) -> Tuple[List[float], List[float], List[List[float]]]
Computes Transformer-based similarity of predicted answer to gold labels to derive a more meaningful metric than EM or F1.
Returns per QA pair a) the similarity of the most likely prediction (top 1) to all available gold labels b) the highest similarity of all predictions to gold labels c) a matrix consisting of the similarities of all the predicitions compared to all gold labels
Arguments:
predictions
: Predicted answers as list of multiple preds per questiongold_labels
: Labels as list of multiple possible answers per questionsas_model_name_or_path
: SentenceTransformers semantic textual similarity model, should be path or string pointing to downloadable models.batch_size
: Number of prediction label pairs to encode at once.use_gpu
: Whether to use a GPU or the CPU for calculating semantic answer similarity. Falls back to CPU if no GPU is available.
Returns:
top_1_sas, top_k_sas, pred_label_matrix