haystack/releasenotes/notes/text-document-cleaner-8afce831a2ac31ae.yaml
Julian Risch aaee03aee8
feat: Add DocumentCleaner 2.0 (#5976)
* remove whitespaces, substrings, regex, empty lines

* remove repeated substrings

* reno

* return empty string as shortest common ngram

* address first half of review feedback

* address second half of review feedback

* mention \f page separator for header/footer removal

* mention \f page separator for header/footer removal

* mark example usage as python code
2023-10-13 12:39:55 +02:00

6 lines
218 B
YAML

---
preview:
- |
Added DocumentCleaner, which removes extra whitespace, empty lines, headers, etc. from Documents containing text.
Useful as a preprocessing step before splitting into shorter text documents.