mirror of
https://github.com/deepset-ai/haystack.git
synced 2026-01-21 12:13:21 +00:00
96 lines
6.1 KiB
Plaintext
96 lines
6.1 KiB
Plaintext
---
|
||
title: "ExtractiveReader"
|
||
id: extractivereader
|
||
slug: "/extractivereader"
|
||
description: "Use this component in extractive question answering pipelines based on a query and a list of documents."
|
||
---
|
||
|
||
# ExtractiveReader
|
||
|
||
Use this component in extractive question answering pipelines based on a query and a list of documents.
|
||
|
||
| | |
|
||
| --- | --- |
|
||
| **Most common position in a pipeline** | In query pipelines, after a component that returns a list of documents, such as a [Retriever](/docs/pipeline-components/retrievers.mdx) |
|
||
| **Mandatory init variables** | "token": The Hugging Face API token. Can be set with `HF_API_TOKEN` or `HF_TOKEN` env var. |
|
||
| **Mandatory run variables** | "documents": A list of documents <br /> <br />"query": A query string |
|
||
| **Output variables** | "answers": A list of [`ExtractedAnswer`](/docs/concepts/data-classes.mdx#extractedanswer) objects |
|
||
| **API reference** | [Readers](/reference/readers-api) |
|
||
| **GitHub link** | https://github.com/deepset-ai/haystack/blob/main/haystack/components/readers/extractive.py |
|
||
|
||
## Overview
|
||
|
||
`ExtractiveReader` locates and extracts answers to a given query from the document text. It's used in extractive QA systems where you want to know exactly where the answer is located within the document. It's usually coupled with a Retriever that precedes it, but you can also use it with other components that fetch documents.
|
||
|
||
Readers assign a _probability_ to answers. This score ranges from 0 to 1, indicating how well the results the Reader returned match the query. Probability closest to 1 means the model has high confidence in the answer's relevance. The Reader sorts the answers based on their probability scores, with higher probability listed first. You can limit the number of answers the Reader returns in the optional `top_k` parameter.
|
||
|
||
You can use the probability to set the quality expectations for your system. To do that, use the `confidence_score` parameter of the Reader to set a minimum probability threshold for answers. For example, setting `confidence_threshold` to `0.7` means only answers with a probability higher than 0.7 will be returned.
|
||
|
||
By default, the Reader includes a scenario where no answer to the query is found in the document text (`no_answer=True`). In this case, it returns an additional `ExtractedAnswer` with no text and the probability that none of the `top_k` answers are correct. For example, if `top_k=4` the system will return four answers and an additional empty one. Each answer has a probability assigned. If the empty answer has a probability of 0.5, it means that's the probability that none of the returned answers is correct. To receive only the actual top_k answers, set the `no_answer` parameter to `False` when initializing the component.
|
||
|
||
### Models
|
||
|
||
Here are the models that we recommend for using with `ExtractiveReader`:
|
||
|
||
| | | |
|
||
| --- | --- | --- |
|
||
| Model URL | Description | Language |
|
||
| [deepset/roberta-base-squad2-distilled](https://huggingface.co/deepset/roberta-base-squad2-distilled) (default) | A distilled model, relatively fast and with good performance. | English |
|
||
| [deepset/roberta-large-squad2](https://huggingface.co/deepset/roberta-large-squad2) | A large model with good performance. Slower than the distilled one. | English |
|
||
| [deepset/tinyroberta-squad2](https://huggingface.co/deepset/tinyroberta-squad2) | A distilled version of roberta-large-squad2 model, very fast. | English |
|
||
| [deepset/xlm-roberta-base-squad2](https://huggingface.co/deepset/xlm-roberta-base-squad2) | A base multilingual model with good speed and performance. | Multilingual |
|
||
|
||
You can also view other question answering models on [Hugging Face](https://huggingface.co/models?pipeline_tag=question-answering).
|
||
|
||
## Usage
|
||
|
||
### On its own
|
||
|
||
Below is an example that uses the `ExtractiveReader` outside of a pipeline. The Reader gets the query and the documents at runtime. It should return two answers and an additional third answer with no text and the probability that the `top_k` answers are incorrect.
|
||
|
||
```python
|
||
from haystack import Document
|
||
from haystack.components.readers import ExtractiveReader
|
||
|
||
docs = [Document(content="Paris is the capital of France."), Document(content="Berlin is the capital of Germany.")]
|
||
|
||
reader = ExtractiveReader()
|
||
reader.warm_up()
|
||
|
||
reader.run(query="What is the capital of France?", documents=docs, top_k=2)
|
||
```
|
||
|
||
### In a pipeline
|
||
|
||
Below is an example of a pipeline that retrieves a document from an `InMemoryDocumentStore` based on keyword search (using `InMemoryBM25Retriever`). It then uses the `ExtractiveReader` to extract the answer to our query from the top retrieved documents.
|
||
|
||
With the ExtractiveReader’s `top_k` set to 2, an additional, third answer with no text and the probability that the other `top_k` answers are incorrect is also returned.
|
||
|
||
```python
|
||
from haystack import Document, Pipeline
|
||
from haystack.document_stores.in_memory import InMemoryDocumentStore
|
||
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
|
||
from haystack.components.readers import ExtractiveReader
|
||
|
||
docs = [Document(content="Paris is the capital of France."),
|
||
Document(content="Berlin is the capital of Germany."),
|
||
Document(content="Rome is the capital of Italy."),
|
||
Document(content="Madrid is the capital of Spain.")]
|
||
document_store = InMemoryDocumentStore()
|
||
document_store.write_documents(docs)
|
||
|
||
retriever = InMemoryBM25Retriever(document_store = document_store)
|
||
reader = ExtractiveReader()
|
||
reader.warm_up()
|
||
|
||
extractive_qa_pipeline = Pipeline()
|
||
extractive_qa_pipeline.add_component(instance=retriever, name="retriever")
|
||
extractive_qa_pipeline.add_component(instance=reader, name="reader")
|
||
|
||
extractive_qa_pipeline.connect("retriever.documents", "reader.documents")
|
||
|
||
query = "What is the capital of France?"
|
||
extractive_qa_pipeline.run(data={"retriever": {"query": query, "top_k": 3},
|
||
"reader": {"query": query, "top_k": 2}})
|
||
```
|