haystack/docs/_src/api/api/preprocessor.md

<a id="base"></a>

# Module base

<a id="base.BasePreProcessor"></a>

## BasePreProcessor

```python
class BasePreProcessor(BaseComponent)
```

<a id="base.BasePreProcessor.process"></a>

#### BasePreProcessor.process

```python
@abstractmethod
def process(documents: Union[dict, Document, List[Union[dict, Document]]],
            clean_whitespace: Optional[bool] = True,
            clean_header_footer: Optional[bool] = False,
            clean_empty_lines: Optional[bool] = True,
            remove_substrings: List[str] = [],
            split_by: Optional[str] = "word",
            split_length: Optional[int] = 1000,
            split_overlap: Optional[int] = None,
            split_respect_sentence_boundary: Optional[bool] = True,
            id_hash_keys: Optional[List[str]] = None) -> List[Document]
```

Perform document cleaning and splitting. Takes a single Document or a List of Documents as input and returns a
list of Documents.

<a id="preprocessor"></a>

# Module preprocessor

<a id="preprocessor.PreProcessor"></a>

## PreProcessor

```python
class PreProcessor(BasePreProcessor)
```

<a id="preprocessor.PreProcessor.__init__"></a>

#### PreProcessor.\_\_init\_\_

```python
def __init__(clean_whitespace: bool = True,
             clean_header_footer: bool = False,
             clean_empty_lines: bool = True,
             remove_substrings: List[str] = [],
             split_by: str = "word",
             split_length: int = 200,
             split_overlap: int = 0,
             split_respect_sentence_boundary: bool = True,
             tokenizer_model_folder: Optional[Union[str, Path]] = None,
             language: str = "en",
             id_hash_keys: Optional[List[str]] = None,
             progress_bar: bool = True,
             add_page_number: bool = False)
```

**Arguments**:

- `clean_header_footer`: Use heuristic to remove footers and headers across different pages by searching
for the longest common string. This heuristic uses exact matches and therefore
works well for footers like "Copyright 2019 by XXX", but won't detect "Page 3 of 4"
or similar.
- `clean_whitespace`: Strip whitespaces before or after each line in the text.
- `clean_empty_lines`: Remove more than two empty lines in the text.
- `remove_substrings`: Remove specified substrings from the text.
- `split_by`: Unit for splitting the document. Can be "word", "sentence", or "passage". Set to None to disable splitting.
- `split_length`: Max. number of the above split unit (e.g. words) that are allowed in one document. For instance, if n -> 10 & split_by ->
"sentence", then each output document will have 10 sentences.
- `split_overlap`: Word overlap between two adjacent documents after a split.
Setting this to a positive number essentially enables the sliding window approach.
For example, if split_by -> `word`,
split_length -> 5 & split_overlap -> 2, then the splits would be like:
[w1 w2 w3 w4 w5, w4 w5 w6 w7 w8, w7 w8 w10 w11 w12].
Set the value to 0 to ensure there is no overlap among the documents after splitting.
- `split_respect_sentence_boundary`: Whether to split in partial sentences if split_by -> `word`. If set
to True, the individual split will always have complete sentences &
the number of words will be <= split_length.
- `language`: The language used by "nltk.tokenize.sent_tokenize" in iso639 format.
Available options: "ru","sl","es","sv","tr","cs","da","nl","en","et","fi","fr","de","el","it","no","pl","pt","ml"
- `tokenizer_model_folder`: Path to the folder containing the NTLK PunktSentenceTokenizer models, if loading a model from a local path. Leave empty otherwise.
- `id_hash_keys`: Generate the document id from a custom list of strings that refer to the document's
attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are
not unique, you can modify the metadata and pass e.g. `"meta"` to this field (e.g. [`"content"`, `"meta"`]).
In this case the id will be generated by using the content and the defined metadata.
- `progress_bar`: Whether to show a progress bar.
- `add_page_number`: Add the number of the page a paragraph occurs in to the Document's meta
field `"page"`. Page boundaries are determined by `"\f"' character which is added
in between pages by `PDFToTextConverter`, `TikaConverter`, `ParsrConverter` and
`AzureConverter`.

<a id="preprocessor.PreProcessor.process"></a>

#### PreProcessor.process

```python
def process(documents: Union[dict, Document, List[Union[dict, Document]]],
            clean_whitespace: Optional[bool] = None,
            clean_header_footer: Optional[bool] = None,
            clean_empty_lines: Optional[bool] = None,
            remove_substrings: List[str] = [],
            split_by: Optional[str] = None,
            split_length: Optional[int] = None,
            split_overlap: Optional[int] = None,
            split_respect_sentence_boundary: Optional[bool] = None,
            id_hash_keys: Optional[List[str]] = None) -> List[Document]
```

Perform document cleaning and splitting. Can take a single document or a list of documents as input and returns a list of documents.

<a id="preprocessor.PreProcessor.clean"></a>

#### PreProcessor.clean

```python
def clean(document: Union[dict, Document],
          clean_whitespace: bool,
          clean_header_footer: bool,
          clean_empty_lines: bool,
          remove_substrings: List[str],
          id_hash_keys: Optional[List[str]] = None) -> Document
```

Perform document cleaning on a single document and return a single document. This method will deal with whitespaces, headers, footers
and empty lines. Its exact functionality is defined by the parameters passed into PreProcessor.__init__().

<a id="preprocessor.PreProcessor.split"></a>

#### PreProcessor.split

```python
def split(document: Union[dict, Document],
          split_by: str,
          split_length: int,
          split_overlap: int,
          split_respect_sentence_boundary: bool,
          id_hash_keys: Optional[List[str]] = None) -> List[Document]
```

Perform document splitting on a single document. This method can split on different units, at different lengths,
with different strides. It can also respect sentence boundaries. Its exact functionality is defined by
the parameters passed into PreProcessor.__init__(). Takes a single document as input and returns a list of documents.