* Fix types in test_run.py
* Get test_run.py to pass fmt-check
* Add test_run to mypy checks
* Update test folder to pass ruff linting
* Fix merge
* Fix HF tests
* Fix hf test
* Try to fix tests
* Another attempt
* minor fix
* fix SentenceTransformersDiversityRanker
* skip integrations tests due to model unavailable on HF inference
---------
Co-authored-by: anakin87 <stefanofiorucci@gmail.com>
* add token split_unit
* fix overlap with fallback
* reno
* mark as integration tests
* use type ignore instead of assert
* Update releasenotes/notes/recursive-splitter-token-df56428887ac45bd.yaml
Co-authored-by: David S. Batista <dsbatista@gmail.com>
---------
Co-authored-by: David S. Batista <dsbatista@gmail.com>
* initial import
* adding initial version + tests
* adding more tests
* more tests
* incorporating SentenceSplitter based on NLTK
* adding more tests
* adding release notes
* adding LICENSE header
* removing unused imports
* fixing example docstring
* addding docstrings
* fixing tests and returning a dictionary
* updating release notes
* attending PR comments
* Update haystack/components/preprocessors/recursive_splitter.py
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
* wip: updating tests for split_idx_start and _split_overlap
* adding tests for split_idx and split_start and overlaps
* adjusting file for LICENSE checking
* adding more tests
* adding tests for page numbering
* adding tests for min split lenghts and falling back to character-level chunking based on size
* fixing linting issue
* Update haystack/components/preprocessors/recursive_splitter.py
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
* Update haystack/components/preprocessors/recursive_splitter.py
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
* Update haystack/components/preprocessors/recursive_splitter.py
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
* Update haystack/components/preprocessors/recursive_splitter.py
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
* Update haystack/components/preprocessors/recursive_splitter.py
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
* Update haystack/components/preprocessors/recursive_splitter.py
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
* Update haystack/components/preprocessors/recursive_splitter.py
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
* Update haystack/components/preprocessors/recursive_splitter.py
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
* wip
* wip
* updating tests
* wip: fixing all tests after changes
* more tests
* wip: debugging sentence overlap
* wip: debugging page number
* wip
* wip; fixed bug with sentence tokenizer, needs to keep white spaces
* adding tests for counting pages on different split approaches
* NLTK checks done on SentenceSplitter
* fixing types
* adding detecting for full overlap with previous chunks
* fixing types
* improving docstring
* improving docstring
* adding custom lenght, 'character' use case
* customising overlap function for word and adding a few tests
* updating docstring
* Update haystack/components/preprocessors/recursive_splitter.py
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
* Update haystack/components/preprocessors/recursive_splitter.py
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
* Update haystack/components/preprocessors/recursive_splitter.py
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
* wip: adding more tests for word unit length
* fix
* feat: `Tool` dataclass - unified abstraction to represent tools (#8652)
* draft
* del HF token in tests
* adaptations
* progress
* fix type
* import sorting
* more control on deserialization
* release note
* improvements
* support name field
* fix chatpromptbuilder test
* port Tool from experimental
* release note
* docs upd
* Update tool.py
---------
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
* fix: fix deserialization issues in multi-threading environments (#8651)
* adding 'word' as default length
* fixing types
* handing both default strategies
* wip
* \f was not being counted properly
* updating tests
* fixing the overlap bug
* adding more tests
* refactoring _apply_overlap
* further refactoring
* Update haystack/components/preprocessors/recursive_splitter.py
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
* Update haystack/components/preprocessors/recursive_splitter.py
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
* Update haystack/components/preprocessors/recursive_splitter.py
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
* Update haystack/components/preprocessors/recursive_splitter.py
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
* adding ticks to close code block
* fixing comments
* applying changes: split with space and force keep_white_spaces=True
* fixing some tests and replacing count words approach in more places
* keep_white_spaces = True only if not defined
* cleaning docs
* handling some more edge cases, when split is still too big and all separators ran
* fixing fallback whitespaces count to fixed word/char split based on split size
* cleaning
---------
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com>
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
Co-authored-by: Tobias Wochinger <tobias.wochinger@deepset.ai>