mirror of
https://github.com/deepset-ai/haystack.git
synced 2025-06-26 22:00:13 +00:00

* remove whitespaces, substrings, regex, empty lines * remove repeated substrings * reno * return empty string as shortest common ngram * address first half of review feedback * address second half of review feedback * mention \f page separator for header/footer removal * mention \f page separator for header/footer removal * mark example usage as python code
6 lines
218 B
YAML
6 lines
218 B
YAML
---
|
|
preview:
|
|
- |
|
|
Added DocumentCleaner, which removes extra whitespace, empty lines, headers, etc. from Documents containing text.
|
|
Useful as a preprocessing step before splitting into shorter text documents.
|