haystack/docs/_src/api/api/utils.md

<a id="export_utils"></a>

# Module export\_utils

<a id="export_utils.print_answers"></a>

#### print\_answers

```python
def print_answers(results: dict,
                  details: str = "all",
                  max_text_len: Optional[int] = None)
```

Utility function to print results of Haystack pipelines

**Arguments**:

- `results`: Results that the pipeline returned.
- `details`: Defines the level of details to print. Possible values: minimum, medium, all.
- `max_text_len`: Specifies the maximum allowed length for a text field. If you don't want to shorten the text, set this value to None.

**Returns**:

None

<a id="export_utils.print_documents"></a>

#### print\_documents

```python
def print_documents(results: dict,
                    max_text_len: Optional[int] = None,
                    print_name: bool = True,
                    print_meta: bool = False)
```

Utility that prints a compressed representation of the documents returned by a pipeline.

**Arguments**:

- `max_text_len`: Shorten the document's content to a maximum number of characters. When set to `None`, the document is not shortened.
- `print_name`: Whether to print the document's name from the metadata.
- `print_meta`: Whether to print the document's metadata.

<a id="export_utils.print_questions"></a>

#### print\_questions

```python
def print_questions(results: dict)
```

Utility to print the output of a question generating pipeline in a readable format.

<a id="export_utils.export_answers_to_csv"></a>

#### export\_answers\_to\_csv

```python
def export_answers_to_csv(agg_results: list, output_file)
```

Exports answers coming from finder.get_answers() to a CSV file.

**Arguments**:

- `agg_results`: A list of predictions coming from finder.get_answers().
- `output_file`: The name of the output file.

**Returns**:

None

<a id="export_utils.convert_labels_to_squad"></a>

#### convert\_labels\_to\_squad

```python
def convert_labels_to_squad(labels_file: str)
```

Convert the export from the labeling UI to the SQuAD format for training.

**Arguments**:

- `labels_file`: The path to the file containing labels.

<a id="preprocessing"></a>

# Module preprocessing

<a id="preprocessing.convert_files_to_docs"></a>

#### convert\_files\_to\_docs

```python
def convert_files_to_docs(
        dir_path: str,
        clean_func: Optional[Callable] = None,
        split_paragraphs: bool = False,
        encoding: Optional[str] = None,
        id_hash_keys: Optional[List[str]] = None) -> List[Document]
```

Convert all files(.txt, .pdf, .docx) in the sub-directories of the given path to Documents that can be written to a

Document Store.

**Arguments**:

- `dir_path`: The path of the directory containing the Files.
- `clean_func`: A custom cleaning function that gets applied to each Document (input: str, output: str).
- `split_paragraphs`: Whether to split text by paragraph.
- `encoding`: Character encoding to use when converting pdf documents.
- `id_hash_keys`: A list of Document attribute names from which the Document ID should be hashed from.
Useful for generating unique IDs even if the Document contents are identical.
To ensure you don't have duplicate Documents in your Document Store if texts are
not unique, you can modify the metadata and pass [`"content"`, `"meta"`] to this field.
If you do this, the Document ID will be generated by using the content and the defined metadata.

<a id="preprocessing.tika_convert_files_to_docs"></a>

#### tika\_convert\_files\_to\_docs

```python
def tika_convert_files_to_docs(
        dir_path: str,
        clean_func: Optional[Callable] = None,
        split_paragraphs: bool = False,
        merge_short: bool = True,
        merge_lowercase: bool = True,
        id_hash_keys: Optional[List[str]] = None) -> List[Document]
```

Convert all files (.txt, .pdf) in the sub-directories of the given path to Documents that can be written to a

Document Store.

**Arguments**:

- `merge_lowercase`: Whether to convert merged paragraphs to lowercase.
- `merge_short`: Whether to allow merging of short paragraphs
- `dir_path`: The path to the directory containing the files.
- `clean_func`: A custom cleaning function that gets applied to each doc (input: str, output:str).
- `split_paragraphs`: Whether to split text by paragraphs.
- `id_hash_keys`: A list of Document attribute names from which the Document ID should be hashed from.
Useful for generating unique IDs even if the Document contents are identical.
To ensure you don't have duplicate Documents in your Document Store if texts are
not unique, you can modify the metadata and pass [`"content"`, `"meta"`] to this field.
If you do this, the Document ID will be generated by using the content and the defined metadata.

<a id="squad_data"></a>

# Module squad\_data

<a id="squad_data.SquadData"></a>

## SquadData

```python
class SquadData()
```

This class is designed to manipulate data that is in SQuAD format

<a id="squad_data.SquadData.__init__"></a>

#### SquadData.\_\_init\_\_

```python
def __init__(squad_data)
```

**Arguments**:

- `squad_data`: SQuAD format data, either as a dictionary with a `data` key, or just a list of SQuAD documents.

<a id="squad_data.SquadData.merge_from_file"></a>

#### SquadData.merge\_from\_file

```python
def merge_from_file(filename: str)
```

Merge the contents of a JSON file in the SQuAD format with the data stored in this object.

<a id="squad_data.SquadData.merge"></a>

#### SquadData.merge

```python
def merge(new_data: List)
```

Merge data in SQuAD format with the data stored in this object.

**Arguments**:

- `new_data`: A list of SQuAD document data.

<a id="squad_data.SquadData.from_file"></a>

#### SquadData.from\_file

```python
@classmethod
def from_file(cls, filename: str)
```

Create a SquadData object by providing the name of a JSON file in the SQuAD format.

<a id="squad_data.SquadData.save"></a>

#### SquadData.save

```python
def save(filename: str)
```

Write the data stored in this object to a JSON file.

<a id="squad_data.SquadData.to_document_objs"></a>

#### SquadData.to\_document\_objs

```python
def to_document_objs()
```

Export all paragraphs stored in this object to haystack.Document objects.

<a id="squad_data.SquadData.to_label_objs"></a>

#### SquadData.to\_label\_objs

```python
def to_label_objs()
```

Export all labels stored in this object to haystack.Label objects.

<a id="squad_data.SquadData.to_df"></a>

#### SquadData.to\_df

```python
@staticmethod
def to_df(data)
```

Convert a list of SQuAD document dictionaries into a pandas dataframe (each row is one annotation).

<a id="squad_data.SquadData.count"></a>

#### SquadData.count

```python
def count(unit="questions")
```

Count the samples in the data. Choose a unit: "paragraphs", "questions", "answers", "no_answers", "span_answers".

<a id="squad_data.SquadData.df_to_data"></a>

#### SquadData.df\_to\_data

```python
@classmethod
def df_to_data(cls, df)
```

Convert a data frame into the SQuAD format data (list of SQuAD document dictionaries).

<a id="squad_data.SquadData.sample_questions"></a>

#### SquadData.sample\_questions

```python
def sample_questions(n)
```

Return a sample of n questions in the SQuAD format (a list of SQuAD document dictionaries).
Note that if the same question is asked on multiple different passages, this function treats that
as a single question.

<a id="squad_data.SquadData.get_all_paragraphs"></a>

#### SquadData.get\_all\_paragraphs

```python
def get_all_paragraphs()
```

Return all paragraph strings.

<a id="squad_data.SquadData.get_all_questions"></a>

#### SquadData.get\_all\_questions

```python
def get_all_questions()
```

Return all question strings. Note that if the same question appears for different paragraphs, this function returns it multiple times.

<a id="squad_data.SquadData.get_all_document_titles"></a>

#### SquadData.get\_all\_document\_titles

```python
def get_all_document_titles()
```

Return all document title strings.

<a id="early_stopping"></a>

# Module early\_stopping

<a id="early_stopping.EarlyStopping"></a>

## EarlyStopping

```python
class EarlyStopping()
```

An object you can to control early stopping with a Node's `train()` method or a Trainer class. You can use a custom
EarlyStopping class instead as long as it implements the method `check_stopping()` and provides the attribute
`save_dir`.

<a id="early_stopping.EarlyStopping.__init__"></a>

#### EarlyStopping.\_\_init\_\_

```python
def __init__(head: int = 0,
             metric: Union[str, Callable] = "loss",
             save_dir: Optional[str] = None,
             mode: Literal["min", "max"] = "min",
             patience: int = 0,
             min_delta: float = 0.001,
             min_evals: int = 0)
```

**Arguments**:

- `head`: The index of the prediction head that you are evaluating to determine the chosen `metric`.
In Haystack, the large majority of the models are trained from the loss signal of a single prediction
head so the default value of 0 should work in most cases.
- `save_dir`: The directory where to save the final best model. If you set it to None, the model is not saved.
- `metric`: The name of a dev set metric to monitor (default: loss) which is extracted from the prediction
head specified by the variable `head`, or a function that extracts a value from the trainer dev evaluation
result.
For FARMReader training, some available metrics to choose from are "EM", "f1", and "top_n_accuracy".
For DensePassageRetriever training, some available metrics to choose from are "acc", "f1", and "average_rank".
NOTE: This is different from the metric that is specified in the Processor which defines how to calculate
one or more evaluation metric values from the prediction and target sets. The metric variable in this
function specifies the name of one particular metric value, or it is a method to calculate a value from
the result returned by the Processor metric.
- `mode`: When set to "min", training stops if the metric does not continue to decrease. When set to "max",
training stops if the metric does not continue to increase.
- `patience`: How many evaluations with no improvement to perform before stopping training.
- `min_delta`: Minimum difference to the previous best value to count as an improvement.
- `min_evals`: Minimum number of evaluations to perform before checking that the evaluation metric is
improving.

<a id="early_stopping.EarlyStopping.check_stopping"></a>

#### EarlyStopping.check\_stopping

```python
def check_stopping(eval_result: List[Dict]) -> Tuple[bool, bool, float]
```

Provides the evaluation value for the current evaluation. Returns true if stopping should occur.

This saves the model if you provided `self.save_dir` when initializing `EarlyStopping`.

**Arguments**:

- `eval_result`: The current evaluation result which consists of a list of dictionaries, one for each
prediction head. Each dictionary contains the metrics and reports generated during evaluation.

**Returns**:

A tuple (stopprocessing, savemodel, eval_value) indicating if processing should be stopped
and if the current model should get saved and the evaluation value used.