mirror of
https://github.com/deepset-ai/haystack.git
synced 2025-07-23 08:52:16 +00:00

* refactor: remove azure-core, pydoc and hf-hub pins * fix: remove extra-comma * fix: force minimum version of azure forms recognizer * refactor: allow newer ocr libs * refactor: update more dependencies and container versions * refactor: remove extra comment * docs: pre-commit manual run * refactor: remove unnecessary dependency * tests: update weaviate container image version
392 lines
11 KiB
Markdown
392 lines
11 KiB
Markdown
<a id="export_utils"></a>
|
|
|
|
# Module export\_utils
|
|
|
|
<a id="export_utils.print_answers"></a>
|
|
|
|
#### print\_answers
|
|
|
|
```python
|
|
def print_answers(results: dict,
|
|
details: str = "all",
|
|
max_text_len: Optional[int] = None)
|
|
```
|
|
|
|
Utility function to print results of Haystack pipelines
|
|
|
|
**Arguments**:
|
|
|
|
- `results`: Results that the pipeline returned.
|
|
- `details`: Defines the level of details to print. Possible values: minimum, medium, all.
|
|
- `max_text_len`: Specifies the maximum allowed length for a text field. If you don't want to shorten the text, set this value to None.
|
|
|
|
**Returns**:
|
|
|
|
None
|
|
|
|
<a id="export_utils.print_documents"></a>
|
|
|
|
#### print\_documents
|
|
|
|
```python
|
|
def print_documents(results: dict,
|
|
max_text_len: Optional[int] = None,
|
|
print_name: bool = True,
|
|
print_meta: bool = False)
|
|
```
|
|
|
|
Utility that prints a compressed representation of the documents returned by a pipeline.
|
|
|
|
**Arguments**:
|
|
|
|
- `max_text_len`: Shorten the document's content to a maximum number of characters. When set to `None`, the document is not shortened.
|
|
- `print_name`: Whether to print the document's name from the metadata.
|
|
- `print_meta`: Whether to print the document's metadata.
|
|
|
|
<a id="export_utils.print_questions"></a>
|
|
|
|
#### print\_questions
|
|
|
|
```python
|
|
def print_questions(results: dict)
|
|
```
|
|
|
|
Utility to print the output of a question generating pipeline in a readable format.
|
|
|
|
<a id="export_utils.export_answers_to_csv"></a>
|
|
|
|
#### export\_answers\_to\_csv
|
|
|
|
```python
|
|
def export_answers_to_csv(agg_results: list, output_file)
|
|
```
|
|
|
|
Exports answers coming from finder.get_answers() to a CSV file.
|
|
|
|
**Arguments**:
|
|
|
|
- `agg_results`: A list of predictions coming from finder.get_answers().
|
|
- `output_file`: The name of the output file.
|
|
|
|
**Returns**:
|
|
|
|
None
|
|
|
|
<a id="export_utils.convert_labels_to_squad"></a>
|
|
|
|
#### convert\_labels\_to\_squad
|
|
|
|
```python
|
|
def convert_labels_to_squad(labels_file: str)
|
|
```
|
|
|
|
Convert the export from the labeling UI to the SQuAD format for training.
|
|
|
|
**Arguments**:
|
|
|
|
- `labels_file`: The path to the file containing labels.
|
|
|
|
<a id="preprocessing"></a>
|
|
|
|
# Module preprocessing
|
|
|
|
<a id="preprocessing.convert_files_to_docs"></a>
|
|
|
|
#### convert\_files\_to\_docs
|
|
|
|
```python
|
|
def convert_files_to_docs(
|
|
dir_path: str,
|
|
clean_func: Optional[Callable] = None,
|
|
split_paragraphs: bool = False,
|
|
encoding: Optional[str] = None,
|
|
id_hash_keys: Optional[List[str]] = None) -> List[Document]
|
|
```
|
|
|
|
Convert all files(.txt, .pdf, .docx) in the sub-directories of the given path to Documents that can be written to a
|
|
|
|
Document Store.
|
|
|
|
**Arguments**:
|
|
|
|
- `dir_path`: The path of the directory containing the Files.
|
|
- `clean_func`: A custom cleaning function that gets applied to each Document (input: str, output: str).
|
|
- `split_paragraphs`: Whether to split text by paragraph.
|
|
- `encoding`: Character encoding to use when converting pdf documents.
|
|
- `id_hash_keys`: A list of Document attribute names from which the Document ID should be hashed from.
|
|
Useful for generating unique IDs even if the Document contents are identical.
|
|
To ensure you don't have duplicate Documents in your Document Store if texts are
|
|
not unique, you can modify the metadata and pass [`"content"`, `"meta"`] to this field.
|
|
If you do this, the Document ID will be generated by using the content and the defined metadata.
|
|
|
|
<a id="preprocessing.tika_convert_files_to_docs"></a>
|
|
|
|
#### tika\_convert\_files\_to\_docs
|
|
|
|
```python
|
|
def tika_convert_files_to_docs(
|
|
dir_path: str,
|
|
clean_func: Optional[Callable] = None,
|
|
split_paragraphs: bool = False,
|
|
merge_short: bool = True,
|
|
merge_lowercase: bool = True,
|
|
id_hash_keys: Optional[List[str]] = None) -> List[Document]
|
|
```
|
|
|
|
Convert all files (.txt, .pdf) in the sub-directories of the given path to Documents that can be written to a
|
|
|
|
Document Store.
|
|
|
|
**Arguments**:
|
|
|
|
- `merge_lowercase`: Whether to convert merged paragraphs to lowercase.
|
|
- `merge_short`: Whether to allow merging of short paragraphs
|
|
- `dir_path`: The path to the directory containing the files.
|
|
- `clean_func`: A custom cleaning function that gets applied to each doc (input: str, output:str).
|
|
- `split_paragraphs`: Whether to split text by paragraphs.
|
|
- `id_hash_keys`: A list of Document attribute names from which the Document ID should be hashed from.
|
|
Useful for generating unique IDs even if the Document contents are identical.
|
|
To ensure you don't have duplicate Documents in your Document Store if texts are
|
|
not unique, you can modify the metadata and pass [`"content"`, `"meta"`] to this field.
|
|
If you do this, the Document ID will be generated by using the content and the defined metadata.
|
|
|
|
<a id="squad_data"></a>
|
|
|
|
# Module squad\_data
|
|
|
|
<a id="squad_data.SquadData"></a>
|
|
|
|
## SquadData
|
|
|
|
```python
|
|
class SquadData()
|
|
```
|
|
|
|
This class is designed to manipulate data that is in SQuAD format
|
|
|
|
<a id="squad_data.SquadData.__init__"></a>
|
|
|
|
#### SquadData.\_\_init\_\_
|
|
|
|
```python
|
|
def __init__(squad_data)
|
|
```
|
|
|
|
**Arguments**:
|
|
|
|
- `squad_data`: SQuAD format data, either as a dictionary with a `data` key, or just a list of SQuAD documents.
|
|
|
|
<a id="squad_data.SquadData.merge_from_file"></a>
|
|
|
|
#### SquadData.merge\_from\_file
|
|
|
|
```python
|
|
def merge_from_file(filename: str)
|
|
```
|
|
|
|
Merge the contents of a JSON file in the SQuAD format with the data stored in this object.
|
|
|
|
<a id="squad_data.SquadData.merge"></a>
|
|
|
|
#### SquadData.merge
|
|
|
|
```python
|
|
def merge(new_data: List)
|
|
```
|
|
|
|
Merge data in SQuAD format with the data stored in this object.
|
|
|
|
**Arguments**:
|
|
|
|
- `new_data`: A list of SQuAD document data.
|
|
|
|
<a id="squad_data.SquadData.from_file"></a>
|
|
|
|
#### SquadData.from\_file
|
|
|
|
```python
|
|
@classmethod
|
|
def from_file(cls, filename: str)
|
|
```
|
|
|
|
Create a SquadData object by providing the name of a JSON file in the SQuAD format.
|
|
|
|
<a id="squad_data.SquadData.save"></a>
|
|
|
|
#### SquadData.save
|
|
|
|
```python
|
|
def save(filename: str)
|
|
```
|
|
|
|
Write the data stored in this object to a JSON file.
|
|
|
|
<a id="squad_data.SquadData.to_document_objs"></a>
|
|
|
|
#### SquadData.to\_document\_objs
|
|
|
|
```python
|
|
def to_document_objs()
|
|
```
|
|
|
|
Export all paragraphs stored in this object to haystack.Document objects.
|
|
|
|
<a id="squad_data.SquadData.to_label_objs"></a>
|
|
|
|
#### SquadData.to\_label\_objs
|
|
|
|
```python
|
|
def to_label_objs()
|
|
```
|
|
|
|
Export all labels stored in this object to haystack.Label objects.
|
|
|
|
<a id="squad_data.SquadData.to_df"></a>
|
|
|
|
#### SquadData.to\_df
|
|
|
|
```python
|
|
@staticmethod
|
|
def to_df(data)
|
|
```
|
|
|
|
Convert a list of SQuAD document dictionaries into a pandas dataframe (each row is one annotation).
|
|
|
|
<a id="squad_data.SquadData.count"></a>
|
|
|
|
#### SquadData.count
|
|
|
|
```python
|
|
def count(unit="questions")
|
|
```
|
|
|
|
Count the samples in the data. Choose a unit: "paragraphs", "questions", "answers", "no_answers", "span_answers".
|
|
|
|
<a id="squad_data.SquadData.df_to_data"></a>
|
|
|
|
#### SquadData.df\_to\_data
|
|
|
|
```python
|
|
@classmethod
|
|
def df_to_data(cls, df)
|
|
```
|
|
|
|
Convert a data frame into the SQuAD format data (list of SQuAD document dictionaries).
|
|
|
|
<a id="squad_data.SquadData.sample_questions"></a>
|
|
|
|
#### SquadData.sample\_questions
|
|
|
|
```python
|
|
def sample_questions(n)
|
|
```
|
|
|
|
Return a sample of n questions in the SQuAD format (a list of SQuAD document dictionaries).
|
|
Note that if the same question is asked on multiple different passages, this function treats that
|
|
as a single question.
|
|
|
|
<a id="squad_data.SquadData.get_all_paragraphs"></a>
|
|
|
|
#### SquadData.get\_all\_paragraphs
|
|
|
|
```python
|
|
def get_all_paragraphs()
|
|
```
|
|
|
|
Return all paragraph strings.
|
|
|
|
<a id="squad_data.SquadData.get_all_questions"></a>
|
|
|
|
#### SquadData.get\_all\_questions
|
|
|
|
```python
|
|
def get_all_questions()
|
|
```
|
|
|
|
Return all question strings. Note that if the same question appears for different paragraphs, this function returns it multiple times.
|
|
|
|
<a id="squad_data.SquadData.get_all_document_titles"></a>
|
|
|
|
#### SquadData.get\_all\_document\_titles
|
|
|
|
```python
|
|
def get_all_document_titles()
|
|
```
|
|
|
|
Return all document title strings.
|
|
|
|
<a id="early_stopping"></a>
|
|
|
|
# Module early\_stopping
|
|
|
|
<a id="early_stopping.EarlyStopping"></a>
|
|
|
|
## EarlyStopping
|
|
|
|
```python
|
|
class EarlyStopping()
|
|
```
|
|
|
|
An object you can to control early stopping with a Node's `train()` method or a Trainer class. You can use a custom
|
|
EarlyStopping class instead as long as it implements the method `check_stopping()` and provides the attribute
|
|
`save_dir`.
|
|
|
|
<a id="early_stopping.EarlyStopping.__init__"></a>
|
|
|
|
#### EarlyStopping.\_\_init\_\_
|
|
|
|
```python
|
|
def __init__(head: int = 0,
|
|
metric: Union[str, Callable] = "loss",
|
|
save_dir: Optional[str] = None,
|
|
mode: Literal["min", "max"] = "min",
|
|
patience: int = 0,
|
|
min_delta: float = 0.001,
|
|
min_evals: int = 0)
|
|
```
|
|
|
|
**Arguments**:
|
|
|
|
- `head`: The index of the prediction head that you are evaluating to determine the chosen `metric`.
|
|
In Haystack, the large majority of the models are trained from the loss signal of a single prediction
|
|
head so the default value of 0 should work in most cases.
|
|
- `save_dir`: The directory where to save the final best model. If you set it to None, the model is not saved.
|
|
- `metric`: The name of a dev set metric to monitor (default: loss) which is extracted from the prediction
|
|
head specified by the variable `head`, or a function that extracts a value from the trainer dev evaluation
|
|
result.
|
|
For FARMReader training, some available metrics to choose from are "EM", "f1", and "top_n_accuracy".
|
|
For DensePassageRetriever training, some available metrics to choose from are "acc", "f1", and "average_rank".
|
|
NOTE: This is different from the metric that is specified in the Processor which defines how to calculate
|
|
one or more evaluation metric values from the prediction and target sets. The metric variable in this
|
|
function specifies the name of one particular metric value, or it is a method to calculate a value from
|
|
the result returned by the Processor metric.
|
|
- `mode`: When set to "min", training stops if the metric does not continue to decrease. When set to "max",
|
|
training stops if the metric does not continue to increase.
|
|
- `patience`: How many evaluations with no improvement to perform before stopping training.
|
|
- `min_delta`: Minimum difference to the previous best value to count as an improvement.
|
|
- `min_evals`: Minimum number of evaluations to perform before checking that the evaluation metric is
|
|
improving.
|
|
|
|
<a id="early_stopping.EarlyStopping.check_stopping"></a>
|
|
|
|
#### EarlyStopping.check\_stopping
|
|
|
|
```python
|
|
def check_stopping(eval_result: List[Dict]) -> Tuple[bool, bool, float]
|
|
```
|
|
|
|
Provides the evaluation value for the current evaluation. Returns true if stopping should occur.
|
|
|
|
This saves the model if you provided `self.save_dir` when initializing `EarlyStopping`.
|
|
|
|
**Arguments**:
|
|
|
|
- `eval_result`: The current evaluation result which consists of a list of dictionaries, one for each
|
|
prediction head. Each dictionary contains the metrics and reports generated during evaluation.
|
|
|
|
**Returns**:
|
|
|
|
A tuple (stopprocessing, savemodel, eval_value) indicating if processing should be stopped
|
|
and if the current model should get saved and the evaluation value used.
|
|
|