mirror of
https://github.com/deepset-ai/haystack.git
synced 2025-09-29 18:15:57 +00:00
152 lines
6.2 KiB
Markdown
152 lines
6.2 KiB
Markdown
<a id="base"></a>
|
|
|
|
# Module base
|
|
|
|
<a id="base.BasePreProcessor"></a>
|
|
|
|
## BasePreProcessor
|
|
|
|
```python
|
|
class BasePreProcessor(BaseComponent)
|
|
```
|
|
|
|
<a id="base.BasePreProcessor.process"></a>
|
|
|
|
#### BasePreProcessor.process
|
|
|
|
```python
|
|
@abstractmethod
|
|
def process(documents: Union[dict, Document, List[Union[dict, Document]]],
|
|
clean_whitespace: Optional[bool] = True,
|
|
clean_header_footer: Optional[bool] = False,
|
|
clean_empty_lines: Optional[bool] = True,
|
|
remove_substrings: List[str] = [],
|
|
split_by: Optional[str] = "word",
|
|
split_length: Optional[int] = 1000,
|
|
split_overlap: Optional[int] = None,
|
|
split_respect_sentence_boundary: Optional[bool] = True,
|
|
id_hash_keys: Optional[List[str]] = None) -> List[Document]
|
|
```
|
|
|
|
Perform document cleaning and splitting. Takes a single Document or a List of Documents as input and returns a
|
|
list of Documents.
|
|
|
|
<a id="preprocessor"></a>
|
|
|
|
# Module preprocessor
|
|
|
|
<a id="preprocessor.PreProcessor"></a>
|
|
|
|
## PreProcessor
|
|
|
|
```python
|
|
class PreProcessor(BasePreProcessor)
|
|
```
|
|
|
|
<a id="preprocessor.PreProcessor.__init__"></a>
|
|
|
|
#### PreProcessor.\_\_init\_\_
|
|
|
|
```python
|
|
def __init__(clean_whitespace: bool = True,
|
|
clean_header_footer: bool = False,
|
|
clean_empty_lines: bool = True,
|
|
remove_substrings: List[str] = [],
|
|
split_by: str = "word",
|
|
split_length: int = 200,
|
|
split_overlap: int = 0,
|
|
split_respect_sentence_boundary: bool = True,
|
|
tokenizer_model_folder: Optional[Union[str, Path]] = None,
|
|
language: str = "en",
|
|
id_hash_keys: Optional[List[str]] = None,
|
|
progress_bar: bool = True,
|
|
add_page_number: bool = False)
|
|
```
|
|
|
|
**Arguments**:
|
|
|
|
- `clean_header_footer`: Use heuristic to remove footers and headers across different pages by searching
|
|
for the longest common string. This heuristic uses exact matches and therefore
|
|
works well for footers like "Copyright 2019 by XXX", but won't detect "Page 3 of 4"
|
|
or similar.
|
|
- `clean_whitespace`: Strip whitespaces before or after each line in the text.
|
|
- `clean_empty_lines`: Remove more than two empty lines in the text.
|
|
- `remove_substrings`: Remove specified substrings from the text.
|
|
- `split_by`: Unit for splitting the document. Can be "word", "sentence", or "passage". Set to None to disable splitting.
|
|
- `split_length`: Max. number of the above split unit (e.g. words) that are allowed in one document. For instance, if n -> 10 & split_by ->
|
|
"sentence", then each output document will have 10 sentences.
|
|
- `split_overlap`: Word overlap between two adjacent documents after a split.
|
|
Setting this to a positive number essentially enables the sliding window approach.
|
|
For example, if split_by -> `word`,
|
|
split_length -> 5 & split_overlap -> 2, then the splits would be like:
|
|
[w1 w2 w3 w4 w5, w4 w5 w6 w7 w8, w7 w8 w10 w11 w12].
|
|
Set the value to 0 to ensure there is no overlap among the documents after splitting.
|
|
- `split_respect_sentence_boundary`: Whether to split in partial sentences if split_by -> `word`. If set
|
|
to True, the individual split will always have complete sentences &
|
|
the number of words will be <= split_length.
|
|
- `language`: The language used by "nltk.tokenize.sent_tokenize" in iso639 format.
|
|
Available options: "ru","sl","es","sv","tr","cs","da","nl","en","et","fi","fr","de","el","it","no","pl","pt","ml"
|
|
- `tokenizer_model_folder`: Path to the folder containing the NTLK PunktSentenceTokenizer models, if loading a model from a local path. Leave empty otherwise.
|
|
- `id_hash_keys`: Generate the document id from a custom list of strings that refer to the document's
|
|
attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are
|
|
not unique, you can modify the metadata and pass e.g. `"meta"` to this field (e.g. [`"content"`, `"meta"`]).
|
|
In this case the id will be generated by using the content and the defined metadata.
|
|
- `progress_bar`: Whether to show a progress bar.
|
|
- `add_page_number`: Add the number of the page a paragraph occurs in to the Document's meta
|
|
field `"page"`. Page boundaries are determined by `"\f"' character which is added
|
|
in between pages by `PDFToTextConverter`, `TikaConverter`, `ParsrConverter` and
|
|
`AzureConverter`.
|
|
|
|
<a id="preprocessor.PreProcessor.process"></a>
|
|
|
|
#### PreProcessor.process
|
|
|
|
```python
|
|
def process(documents: Union[dict, Document, List[Union[dict, Document]]],
|
|
clean_whitespace: Optional[bool] = None,
|
|
clean_header_footer: Optional[bool] = None,
|
|
clean_empty_lines: Optional[bool] = None,
|
|
remove_substrings: List[str] = [],
|
|
split_by: Optional[str] = None,
|
|
split_length: Optional[int] = None,
|
|
split_overlap: Optional[int] = None,
|
|
split_respect_sentence_boundary: Optional[bool] = None,
|
|
id_hash_keys: Optional[List[str]] = None) -> List[Document]
|
|
```
|
|
|
|
Perform document cleaning and splitting. Can take a single document or a list of documents as input and returns a list of documents.
|
|
|
|
<a id="preprocessor.PreProcessor.clean"></a>
|
|
|
|
#### PreProcessor.clean
|
|
|
|
```python
|
|
def clean(document: Union[dict, Document],
|
|
clean_whitespace: bool,
|
|
clean_header_footer: bool,
|
|
clean_empty_lines: bool,
|
|
remove_substrings: List[str],
|
|
id_hash_keys: Optional[List[str]] = None) -> Document
|
|
```
|
|
|
|
Perform document cleaning on a single document and return a single document. This method will deal with whitespaces, headers, footers
|
|
and empty lines. Its exact functionality is defined by the parameters passed into PreProcessor.__init__().
|
|
|
|
<a id="preprocessor.PreProcessor.split"></a>
|
|
|
|
#### PreProcessor.split
|
|
|
|
```python
|
|
def split(document: Union[dict, Document],
|
|
split_by: str,
|
|
split_length: int,
|
|
split_overlap: int,
|
|
split_respect_sentence_boundary: bool,
|
|
id_hash_keys: Optional[List[str]] = None) -> List[Document]
|
|
```
|
|
|
|
Perform document splitting on a single document. This method can split on different units, at different lengths,
|
|
with different strides. It can also respect sentence boundaries. Its exact functionality is defined by
|
|
the parameters passed into PreProcessor.__init__(). Takes a single document as input and returns a list of documents.
|
|
|