haystack/releasenotes/notes/fix-nltk-doc-splitter-d0864dda906c45b0.yaml
Sebastian Husch Lee 0c11c7b98e
fix: Bring in fix from custom nodes (#8539)
* Bring in fix from custom nodes

* Add to_dict function and test

* reno

* Fix pylint
2024-11-14 13:00:28 +01:00

11 lines
681 B
YAML

---
fixes:
- |
For the NLTKDocumentSplitter we are updating how chunks are made when splitting by word and sentence boundary is respected.
Namely, to avoid fully subsuming the previous chunk into the next one, we ignore the first sentence from that chunk when calculating sentence overlap.
i.e. we want to avoid cases of Doc1 = [s1, s2], Doc2 = [s1, s2, s3].
Finished adding function support for this component by updating the _split_into_units function and added the splitting_function init parameter.
Add specific to_dict method to overwrite the underlying one from DocumentSplitter. This is needed to properly save the settings of the component to yaml.