mirror of
https://github.com/deepset-ai/haystack.git
synced 2025-07-28 11:19:58 +00:00

* new docs version * Add latest docstring and tutorial changes Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
384 lines
15 KiB
Markdown
384 lines
15 KiB
Markdown
<a name="base"></a>
|
|
# Module base
|
|
|
|
<a name="farm"></a>
|
|
# Module farm
|
|
|
|
<a name="farm.FARMReader"></a>
|
|
## FARMReader Objects
|
|
|
|
```python
|
|
class FARMReader(BaseReader)
|
|
```
|
|
|
|
Transformer based model for extractive Question Answering using the FARM framework (https://github.com/deepset-ai/FARM).
|
|
While the underlying model can vary (BERT, Roberta, DistilBERT, ...), the interface remains the same.
|
|
|
|
| With a FARMReader, you can:
|
|
|
|
- directly get predictions via predict()
|
|
- fine-tune the model on QA data via train()
|
|
|
|
<a name="farm.FARMReader.__init__"></a>
|
|
#### \_\_init\_\_
|
|
|
|
```python
|
|
| __init__(model_name_or_path: Union[str, Path], context_window_size: int = 150, batch_size: int = 50, use_gpu: bool = True, no_ans_boost: float = 0.0, return_no_answer: bool = False, top_k_per_candidate: int = 3, top_k_per_sample: int = 1, num_processes: Optional[int] = None, max_seq_len: int = 256, doc_stride: int = 128)
|
|
```
|
|
|
|
**Arguments**:
|
|
|
|
- `model_name_or_path`: Directory of a saved model or the name of a public model e.g. 'bert-base-cased',
|
|
'deepset/bert-base-cased-squad2', 'deepset/bert-base-cased-squad2', 'distilbert-base-uncased-distilled-squad'.
|
|
See https://huggingface.co/models for full list of available models.
|
|
- `context_window_size`: The size, in characters, of the window around the answer span that is used when
|
|
displaying the context around the answer.
|
|
- `batch_size`: Number of samples the model receives in one batch for inference.
|
|
Memory consumption is much lower in inference mode. Recommendation: Increase the batch size
|
|
to a value so only a single batch is used.
|
|
- `use_gpu`: Whether to use GPU (if available)
|
|
- `no_ans_boost`: How much the no_answer logit is boosted/increased.
|
|
If set to 0 (default), the no_answer logit is not changed.
|
|
If a negative number, there is a lower chance of "no_answer" being predicted.
|
|
If a positive number, there is an increased chance of "no_answer"
|
|
- `return_no_answer`: Whether to include no_answer predictions in the results.
|
|
- `top_k_per_candidate`: How many answers to extract for each candidate doc that is coming from the retriever (might be a long text).
|
|
Note that this is not the number of "final answers" you will receive
|
|
(see `top_k` in FARMReader.predict() or Finder.get_answers() for that)
|
|
and that FARM includes no_answer in the sorted list of predictions.
|
|
- `top_k_per_sample`: How many answers to extract from each small text passage that the model can process at once
|
|
(one "candidate doc" is usually split into many smaller "passages").
|
|
You usually want a very small value here, as it slows down inference
|
|
and you don't gain much of quality by having multiple answers from one passage.
|
|
Note that this is not the number of "final answers" you will receive
|
|
(see `top_k` in FARMReader.predict() or Finder.get_answers() for that)
|
|
and that FARM includes no_answer in the sorted list of predictions.
|
|
- `num_processes`: The number of processes for `multiprocessing.Pool`. Set to value of 0 to disable
|
|
multiprocessing. Set to None to let Inferencer determine optimum number. If you
|
|
want to debug the Language Model, you might need to disable multiprocessing!
|
|
- `max_seq_len`: Max sequence length of one input text for the model
|
|
- `doc_stride`: Length of striding window for splitting long texts (used if ``len(text) > max_seq_len``)
|
|
|
|
<a name="farm.FARMReader.train"></a>
|
|
#### train
|
|
|
|
```python
|
|
| train(data_dir: str, train_filename: str, dev_filename: Optional[str] = None, test_filename: Optional[str] = None, use_gpu: Optional[bool] = None, batch_size: int = 10, n_epochs: int = 2, learning_rate: float = 1e-5, max_seq_len: Optional[int] = None, warmup_proportion: float = 0.2, dev_split: float = 0, evaluate_every: int = 300, save_dir: Optional[str] = None, num_processes: Optional[int] = None, use_amp: str = None)
|
|
```
|
|
|
|
Fine-tune a model on a QA dataset. Options:
|
|
|
|
- Take a plain language model (e.g. `bert-base-cased`) and train it for QA (e.g. on SQuAD data)
|
|
- Take a QA model (e.g. `deepset/bert-base-cased-squad2`) and fine-tune it for your domain (e.g. using your labels collected via the haystack annotation tool)
|
|
|
|
**Arguments**:
|
|
|
|
- `data_dir`: Path to directory containing your training data in SQuAD style
|
|
- `train_filename`: Filename of training data
|
|
- `dev_filename`: Filename of dev / eval data
|
|
- `test_filename`: Filename of test data
|
|
- `dev_split`: Instead of specifying a dev_filename, you can also specify a ratio (e.g. 0.1) here
|
|
that gets split off from training data for eval.
|
|
- `use_gpu`: Whether to use GPU (if available)
|
|
- `batch_size`: Number of samples the model receives in one batch for training
|
|
- `n_epochs`: Number of iterations on the whole training data set
|
|
- `learning_rate`: Learning rate of the optimizer
|
|
- `max_seq_len`: Maximum text length (in tokens). Everything longer gets cut down.
|
|
- `warmup_proportion`: Proportion of training steps until maximum learning rate is reached.
|
|
Until that point LR is increasing linearly. After that it's decreasing again linearly.
|
|
Options for different schedules are available in FARM.
|
|
- `evaluate_every`: Evaluate the model every X steps on the hold-out eval dataset
|
|
- `save_dir`: Path to store the final model
|
|
- `num_processes`: The number of processes for `multiprocessing.Pool` during preprocessing.
|
|
Set to value of 1 to disable multiprocessing. When set to 1, you cannot split away a dev set from train set.
|
|
Set to None to use all CPU cores minus one.
|
|
- `use_amp`: Optimization level of NVIDIA's automatic mixed precision (AMP). The higher the level, the faster the model.
|
|
Available options:
|
|
None (Don't use AMP)
|
|
"O0" (Normal FP32 training)
|
|
"O1" (Mixed Precision => Recommended)
|
|
"O2" (Almost FP16)
|
|
"O3" (Pure FP16).
|
|
See details on: https://nvidia.github.io/apex/amp.html
|
|
|
|
**Returns**:
|
|
|
|
None
|
|
|
|
<a name="farm.FARMReader.update_parameters"></a>
|
|
#### update\_parameters
|
|
|
|
```python
|
|
| update_parameters(context_window_size: Optional[int] = None, no_ans_boost: Optional[float] = None, return_no_answer: Optional[bool] = None, max_seq_len: Optional[int] = None, doc_stride: Optional[int] = None)
|
|
```
|
|
|
|
Hot update parameters of a loaded Reader. It may not to be safe when processing concurrent requests.
|
|
|
|
<a name="farm.FARMReader.save"></a>
|
|
#### save
|
|
|
|
```python
|
|
| save(directory: Path)
|
|
```
|
|
|
|
Saves the Reader model so that it can be reused at a later point in time.
|
|
|
|
**Arguments**:
|
|
|
|
- `directory`: Directory where the Reader model should be saved
|
|
|
|
<a name="farm.FARMReader.predict_batch"></a>
|
|
#### predict\_batch
|
|
|
|
```python
|
|
| predict_batch(query_doc_list: List[dict], top_k: int = None, batch_size: int = None)
|
|
```
|
|
|
|
Use loaded QA model to find answers for a list of queries in each query's supplied list of Document.
|
|
|
|
Returns list of dictionaries containing answers sorted by (desc.) probability
|
|
|
|
**Arguments**:
|
|
|
|
- `query_doc_list`: List of dictionaries containing queries with their retrieved documents
|
|
- `top_k`: The maximum number of answers to return for each query
|
|
- `batch_size`: Number of samples the model receives in one batch for inference
|
|
|
|
**Returns**:
|
|
|
|
List of dictionaries containing query and answers
|
|
|
|
<a name="farm.FARMReader.predict"></a>
|
|
#### predict
|
|
|
|
```python
|
|
| predict(query: str, documents: List[Document], top_k: Optional[int] = None)
|
|
```
|
|
|
|
Use loaded QA model to find answers for a query in the supplied list of Document.
|
|
|
|
Returns dictionaries containing answers sorted by (desc.) probability.
|
|
Example:
|
|
```python
|
|
|{
|
|
| 'query': 'Who is the father of Arya Stark?',
|
|
| 'answers':[
|
|
| {'answer': 'Eddard,',
|
|
| 'context': " She travels with her father, Eddard, to King's Landing when he is ",
|
|
| 'offset_answer_start': 147,
|
|
| 'offset_answer_end': 154,
|
|
| 'probability': 0.9787139466668613,
|
|
| 'score': None,
|
|
| 'document_id': '1337'
|
|
| },...
|
|
| ]
|
|
|}
|
|
```
|
|
|
|
**Arguments**:
|
|
|
|
- `query`: Query string
|
|
- `documents`: List of Document in which to search for the answer
|
|
- `top_k`: The maximum number of answers to return
|
|
|
|
**Returns**:
|
|
|
|
Dict containing query and answers
|
|
|
|
<a name="farm.FARMReader.eval_on_file"></a>
|
|
#### eval\_on\_file
|
|
|
|
```python
|
|
| eval_on_file(data_dir: str, test_filename: str, device: str)
|
|
```
|
|
|
|
Performs evaluation on a SQuAD-formatted file.
|
|
Returns a dict containing the following metrics:
|
|
- "EM": exact match score
|
|
- "f1": F1-Score
|
|
- "top_n_accuracy": Proportion of predicted answers that overlap with correct answer
|
|
|
|
**Arguments**:
|
|
|
|
- `data_dir`: The directory in which the test set can be found
|
|
:type data_dir: Path or str
|
|
- `test_filename`: The name of the file containing the test data in SQuAD format.
|
|
:type test_filename: str
|
|
- `device`: The device on which the tensors should be processed. Choose from "cpu" and "cuda".
|
|
:type device: str
|
|
|
|
<a name="farm.FARMReader.eval"></a>
|
|
#### eval
|
|
|
|
```python
|
|
| eval(document_store: BaseDocumentStore, device: str, label_index: str = "label", doc_index: str = "eval_document", label_origin: str = "gold_label")
|
|
```
|
|
|
|
Performs evaluation on evaluation documents in the DocumentStore.
|
|
Returns a dict containing the following metrics:
|
|
- "EM": Proportion of exact matches of predicted answers with their corresponding correct answers
|
|
- "f1": Average overlap between predicted answers and their corresponding correct answers
|
|
- "top_n_accuracy": Proportion of predicted answers that overlap with correct answer
|
|
|
|
**Arguments**:
|
|
|
|
- `document_store`: DocumentStore containing the evaluation documents
|
|
- `device`: The device on which the tensors should be processed. Choose from "cpu" and "cuda".
|
|
- `label_index`: Index/Table name where labeled questions are stored
|
|
- `doc_index`: Index/Table name where documents that are used for evaluation are stored
|
|
|
|
<a name="farm.FARMReader.predict_on_texts"></a>
|
|
#### predict\_on\_texts
|
|
|
|
```python
|
|
| predict_on_texts(question: str, texts: List[str], top_k: Optional[int] = None)
|
|
```
|
|
|
|
Use loaded QA model to find answers for a question in the supplied list of Document.
|
|
Returns dictionaries containing answers sorted by (desc.) probability.
|
|
Example:
|
|
```python
|
|
|{
|
|
| 'question': 'Who is the father of Arya Stark?',
|
|
| 'answers':[
|
|
| {'answer': 'Eddard,',
|
|
| 'context': " She travels with her father, Eddard, to King's Landing when he is ",
|
|
| 'offset_answer_start': 147,
|
|
| 'offset_answer_end': 154,
|
|
| 'probability': 0.9787139466668613,
|
|
| 'score': None,
|
|
| 'document_id': '1337'
|
|
| },...
|
|
| ]
|
|
|}
|
|
```
|
|
|
|
**Arguments**:
|
|
|
|
- `question`: Question string
|
|
- `documents`: List of documents as string type
|
|
- `top_k`: The maximum number of answers to return
|
|
|
|
**Returns**:
|
|
|
|
Dict containing question and answers
|
|
|
|
<a name="farm.FARMReader.convert_to_onnx"></a>
|
|
#### convert\_to\_onnx
|
|
|
|
```python
|
|
| @classmethod
|
|
| convert_to_onnx(cls, model_name: str, output_path: Path, convert_to_float16: bool = False, quantize: bool = False, task_type: str = "question_answering", opset_version: int = 11)
|
|
```
|
|
|
|
Convert a PyTorch BERT model to ONNX format and write to ./onnx-export dir. The converted ONNX model
|
|
can be loaded with in the `FARMReader` using the export path as `model_name_or_path` param.
|
|
|
|
Usage:
|
|
|
|
`from haystack.reader.farm import FARMReader
|
|
from pathlib import Path
|
|
onnx_model_path = Path("roberta-onnx-model")
|
|
FARMReader.convert_to_onnx(model_name="deepset/bert-base-cased-squad2", output_path=onnx_model_path)
|
|
reader = FARMReader(onnx_model_path)`
|
|
|
|
**Arguments**:
|
|
|
|
- `model_name`: transformers model name
|
|
- `output_path`: Path to output the converted model
|
|
- `convert_to_float16`: Many models use float32 precision by default. With the half precision of float16,
|
|
inference is faster on Nvidia GPUs with Tensor core like T4 or V100. On older GPUs,
|
|
float32 could still be be more performant.
|
|
- `quantize`: convert floating point number to integers
|
|
- `task_type`: Type of task for the model. Available options: "question_answering" or "embeddings".
|
|
- `opset_version`: ONNX opset version
|
|
|
|
<a name="transformers"></a>
|
|
# Module transformers
|
|
|
|
<a name="transformers.TransformersReader"></a>
|
|
## TransformersReader Objects
|
|
|
|
```python
|
|
class TransformersReader(BaseReader)
|
|
```
|
|
|
|
Transformer based model for extractive Question Answering using the HuggingFace's transformers framework
|
|
(https://github.com/huggingface/transformers).
|
|
While the underlying model can vary (BERT, Roberta, DistilBERT ...), the interface remains the same.
|
|
With this reader, you can directly get predictions via predict()
|
|
|
|
<a name="transformers.TransformersReader.__init__"></a>
|
|
#### \_\_init\_\_
|
|
|
|
```python
|
|
| __init__(model_name_or_path: str = "distilbert-base-uncased-distilled-squad", tokenizer: Optional[str] = None, context_window_size: int = 70, use_gpu: int = 0, top_k_per_candidate: int = 4, return_no_answers: bool = True, max_seq_len: int = 256, doc_stride: int = 128)
|
|
```
|
|
|
|
Load a QA model from Transformers.
|
|
Available models include:
|
|
|
|
- ``'distilbert-base-uncased-distilled-squad`'``
|
|
- ``'bert-large-cased-whole-word-masking-finetuned-squad``'
|
|
- ``'bert-large-uncased-whole-word-masking-finetuned-squad``'
|
|
|
|
See https://huggingface.co/models for full list of available QA models
|
|
|
|
**Arguments**:
|
|
|
|
- `model_name_or_path`: Directory of a saved model or the name of a public model e.g. 'bert-base-cased',
|
|
'deepset/bert-base-cased-squad2', 'deepset/bert-base-cased-squad2', 'distilbert-base-uncased-distilled-squad'.
|
|
See https://huggingface.co/models for full list of available models.
|
|
- `tokenizer`: Name of the tokenizer (usually the same as model)
|
|
- `context_window_size`: Num of chars (before and after the answer) to return as "context" for each answer.
|
|
The context usually helps users to understand if the answer really makes sense.
|
|
- `use_gpu`: If < 0, then use cpu. If >= 0, this is the ordinal of the gpu to use
|
|
- `top_k_per_candidate`: How many answers to extract for each candidate doc that is coming from the retriever (might be a long text).
|
|
Note that this is not the number of "final answers" you will receive
|
|
(see `top_k` in TransformersReader.predict() or Finder.get_answers() for that)
|
|
and that no_answer can be included in the sorted list of predictions.
|
|
- `return_no_answers`: If True, the HuggingFace Transformers model could return a "no_answer" (i.e. when there is an unanswerable question)
|
|
If False, it cannot return a "no_answer". Note that `no_answer_boost` is unfortunately not available with TransformersReader.
|
|
If you would like to set no_answer_boost, use a `FARMReader`.
|
|
- `max_seq_len`: max sequence length of one input text for the model
|
|
- `doc_stride`: length of striding window for splitting long texts (used if len(text) > max_seq_len)
|
|
|
|
<a name="transformers.TransformersReader.predict"></a>
|
|
#### predict
|
|
|
|
```python
|
|
| predict(query: str, documents: List[Document], top_k: Optional[int] = None)
|
|
```
|
|
|
|
Use loaded QA model to find answers for a query in the supplied list of Document.
|
|
|
|
Returns dictionaries containing answers sorted by (desc.) probability.
|
|
Example:
|
|
|
|
```python
|
|
|{
|
|
| 'query': 'Who is the father of Arya Stark?',
|
|
| 'answers':[
|
|
| {'answer': 'Eddard,',
|
|
| 'context': " She travels with her father, Eddard, to King's Landing when he is ",
|
|
| 'offset_answer_start': 147,
|
|
| 'offset_answer_end': 154,
|
|
| 'probability': 0.9787139466668613,
|
|
| 'score': None,
|
|
| 'document_id': '1337'
|
|
| },...
|
|
| ]
|
|
|}
|
|
```
|
|
|
|
**Arguments**:
|
|
|
|
- `query`: Query string
|
|
- `documents`: List of Document in which to search for the answer
|
|
- `top_k`: The maximum number of answers to return
|
|
|
|
**Returns**:
|
|
|
|
Dict containing query and answers
|
|
|