haystack/docs/_src/api/api/preprocessor.md
Sara Zan e85b948a4c
Fix PreProcessor test (#2290)
* Adding Document import, missing from recent PR

* Fix mypy signature warning too

* reduce diff to minimum

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-03-09 13:46:47 +01:00

68 lines
2.4 KiB
Markdown

<a id="base"></a>
# Module base
<a id="base.BasePreProcessor"></a>
## BasePreProcessor
```python
class BasePreProcessor(BaseComponent)
```
<a id="base.BasePreProcessor.process"></a>
#### process
```python
def process(documents: Union[dict, List[dict]], clean_whitespace: Optional[bool] = True, clean_header_footer: Optional[bool] = False, clean_empty_lines: Optional[bool] = True, remove_substrings: List[str] = [], split_by: Optional[str] = "word", split_length: Optional[int] = 1000, split_overlap: Optional[int] = None, split_respect_sentence_boundary: Optional[bool] = True) -> List[dict]
```
Perform document cleaning and splitting. Takes a single document as input and returns a list of documents.
<a id="preprocessor"></a>
# Module preprocessor
<a id="preprocessor.PreProcessor"></a>
## PreProcessor
```python
class PreProcessor(BasePreProcessor)
```
<a id="preprocessor.PreProcessor.process"></a>
#### process
```python
def process(documents: Union[dict, List[dict]], clean_whitespace: Optional[bool] = None, clean_header_footer: Optional[bool] = None, clean_empty_lines: Optional[bool] = None, remove_substrings: List[str] = [], split_by: Optional[str] = None, split_length: Optional[int] = None, split_overlap: Optional[int] = None, split_respect_sentence_boundary: Optional[bool] = None) -> List[dict]
```
Perform document cleaning and splitting. Can take a single document or a list of documents as input and returns a list of documents.
<a id="preprocessor.PreProcessor.clean"></a>
#### clean
```python
def clean(document: dict, clean_whitespace: bool, clean_header_footer: bool, clean_empty_lines: bool, remove_substrings: List[str]) -> dict
```
Perform document cleaning on a single document and return a single document. This method will deal with whitespaces, headers, footers
and empty lines. Its exact functionality is defined by the parameters passed into PreProcessor.__init__().
<a id="preprocessor.PreProcessor.split"></a>
#### split
```python
def split(document: dict, split_by: str, split_length: int, split_overlap: int, split_respect_sentence_boundary: bool) -> List[dict]
```
Perform document splitting on a single document. This method can split on different units, at different lengths,
with different strides. It can also respect sentence boundaries. Its exact functionality is defined by
the parameters passed into PreProcessor.__init__(). Takes a single document as input and returns a list of documents.