Silvano Cerza
3d69094f9a
refactor: Remove id_hash_keys
from TextDocumentSplitter
( #6124 )
...
* Remove id_hash_keys from DocumentCleaner
* Remove id_hash_keys from TextDocumentSplitter
2023-10-20 15:18:28 +02:00
Julian Risch
9f3b6512be
refactor: Remove reimplementations of default from_dict
/to_dict
and corresponding tests in 2.0 ( #6108 )
...
* whisper transcriber
* remove from/to_dict from builders
* remove from/to_dict from embedders
* remove from/to_dict from fetcher, file_converters
* remove from/to_dict from generators, preprocessors
* remove from/to_dict from ranker, reader
* remove from/to_dict from router, sampler, websearch
* pylint
* reno
* refactor import
* remove unused import
2023-10-19 11:17:02 +02:00
Silvano Cerza
ec9f898cd6
fix: Fix TextDocumentSplitter failing if run with empty list ( #6081 )
...
* Fix TextDocumentSplitter failing if run with empty list
* Release notes
* Simplify check
* Enhance test
2023-10-17 11:25:28 +02:00
Julian Risch
90ddeba579
fix: DocumentSplitter and DocumentCleaner copy id_hash_keys
to newly created Documents ( #6083 )
...
* copy id_hash_keys in splitter and cleaner
* reno
2023-10-17 11:03:48 +02:00
Julian Risch
4413675e64
feat: Add TextDocumentSplitter that splits by word, sentence, passage (2.0) ( #5870 )
...
* draft split by word, sentence, passage
* naive way to split sentences without nltk
* reno
* add tests
* make input list of docs, review feedback
* add source_id and more validation
* update docstrings
* add split delimiters back to strings
---------
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
2023-09-27 12:26:20 +02:00