mirror of https://github.com/deepset-ai/haystack.git synced 2025-11-06 04:43:39 +00:00

* Adding Document import, missing from recent PR

* Fix mypy signature warning too

* reduce diff to minimum

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

2022-03-09 13:46:47 +01:00

2.4 KiB

Raw Blame History

Module base

BasePreProcessor

class BasePreProcessor(BaseComponent)

process

def process(documents: Union[dict, List[dict]], clean_whitespace: Optional[bool] = True, clean_header_footer: Optional[bool] = False, clean_empty_lines: Optional[bool] = True, remove_substrings: List[str] = [], split_by: Optional[str] = "word", split_length: Optional[int] = 1000, split_overlap: Optional[int] = None, split_respect_sentence_boundary: Optional[bool] = True) -> List[dict]

Perform document cleaning and splitting. Takes a single document as input and returns a list of documents.

Module preprocessor

PreProcessor

class PreProcessor(BasePreProcessor)

process

def process(documents: Union[dict, List[dict]], clean_whitespace: Optional[bool] = None, clean_header_footer: Optional[bool] = None, clean_empty_lines: Optional[bool] = None, remove_substrings: List[str] = [], split_by: Optional[str] = None, split_length: Optional[int] = None, split_overlap: Optional[int] = None, split_respect_sentence_boundary: Optional[bool] = None) -> List[dict]

Perform document cleaning and splitting. Can take a single document or a list of documents as input and returns a list of documents.

clean

def clean(document: dict, clean_whitespace: bool, clean_header_footer: bool, clean_empty_lines: bool, remove_substrings: List[str]) -> dict

Perform document cleaning on a single document and return a single document. This method will deal with whitespaces, headers, footers and empty lines. Its exact functionality is defined by the parameters passed into PreProcessor.init().

split

def split(document: dict, split_by: str, split_length: int, split_overlap: int, split_respect_sentence_boundary: bool) -> List[dict]

Perform document splitting on a single document. This method can split on different units, at different lengths, with different strides. It can also respect sentence boundaries. Its exact functionality is defined by the parameters passed into PreProcessor.init(). Takes a single document as input and returns a list of documents.

2.4 KiB Raw Blame History

Module base

BasePreProcessor

process

Module preprocessor

PreProcessor

process

clean

split

2.4 KiB

Raw Blame History