3 Commits

Author SHA1 Message Date
David S. Batista
0f00c1882e
fix: make SentenceSplitter QUOTE_SPANS_RE regex ReDoS-safe (#9338)
* fix: make QUOTE_SPANS_RE regex ReDoS-safe

* Removing the capture of leading non-character on double quotes, allowing quote with new lines, adding tests

* cleaning

* fixing release notes

* changing import

* adding test for Regex Denial of Service (ReDoS)

* reducing the size/time of tests

* Update test/components/preprocessors/test_sentence_tokenizer.py

Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com>

* Update test/components/preprocessors/test_sentence_tokenizer.py

---------

Co-authored-by: Waivey <waivey@proton.me>
Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com>
2025-05-02 15:40:17 +00:00
David S. Batista
c306bee665
fix: adding missing abbreviations files for SentenceSplitter (#8660)
* adding missing abbreviations files for SentenceSplitter

* fixing tests path
2024-12-19 11:08:29 +01:00
David S. Batista
3f77d3ab6c
!feat: unify NLTKDocumentSplitter and DocumentSplitter (#8617)
* wip: initial import

* wip: refactoring

* wip: refactoring tests

* wip: refactoring tests

* making all NLTKSplitter related tests work

* refactoring

* docstrings

* refactoring and removing NLTKDocumentSplitter

* fixing tests for custom sentence tokenizer

* fixing tests for custom sentence tokenizer

* cleaning up

* adding release notes

* reverting some changes

* cleaning up tests

* fixing serialisation and adding tests

* cleaning up

* wip

* renaming and cleaning

* adding NLTK files

* updating docstring

* adding import to init

* Update haystack/components/preprocessors/document_splitter.py

Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com>

* updating tests

* wip

* adding sentence/period change warning

* fixing LICENSE header

* Update haystack/components/preprocessors/document_splitter.py

Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com>

---------

Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com>
2024-12-12 14:22:27 +00:00