haystack/releasenotes/notes/preprocessor-2-0-9828d930562fa3f5.yaml
Julian Risch 4413675e64
feat: Add TextDocumentSplitter that splits by word, sentence, passage (2.0) (#5870)
* draft split by word, sentence, passage

* naive way to split sentences without nltk

* reno

* add tests

* make input list of docs, review feedback

* add source_id and more validation

* update docstrings

* add split delimiters back to strings

---------

Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
2023-09-27 12:26:20 +02:00

5 lines
277 B
YAML

---
preview:
- |
Add the `TextDocumentSplitter` component for Haystack 2.0 that splits a Document with long text into multiple Documents with shorter texts. Thereby the texts match the maximum length that the language models in Embedders or other components can process.