1 Commits

Author SHA1 Message Date
David S. Batista
4f73b192f8
feat: add RecursiveSplitter component for Document preprocessing (#8605)
* initial import

* adding initial version + tests

* adding more tests

* more tests

* incorporating SentenceSplitter based on NLTK

* adding more tests

* adding release notes

* adding LICENSE header

* removing unused imports

* fixing example docstring

* addding docstrings

* fixing tests and returning a dictionary

* updating release notes

* attending PR comments

* Update haystack/components/preprocessors/recursive_splitter.py

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* wip: updating tests for split_idx_start and _split_overlap

* adding tests for split_idx and split_start and overlaps

* adjusting file for LICENSE checking

* adding more tests

* adding tests for page numbering

* adding tests for min split lenghts and falling back to character-level chunking based on size

* fixing linting issue

* Update haystack/components/preprocessors/recursive_splitter.py

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* Update haystack/components/preprocessors/recursive_splitter.py

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* Update haystack/components/preprocessors/recursive_splitter.py

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* Update haystack/components/preprocessors/recursive_splitter.py

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* Update haystack/components/preprocessors/recursive_splitter.py

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* Update haystack/components/preprocessors/recursive_splitter.py

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* Update haystack/components/preprocessors/recursive_splitter.py

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* Update haystack/components/preprocessors/recursive_splitter.py

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* wip

* wip

* updating tests

* wip: fixing all tests after changes

* more tests

* wip: debugging sentence overlap

* wip: debugging page number

* wip

* wip; fixed bug with sentence tokenizer, needs to keep white spaces

* adding tests for counting pages on different split approaches

* NLTK checks done on SentenceSplitter

* fixing types

* adding detecting for full overlap with previous chunks

* fixing types

* improving docstring

* improving docstring

* adding custom lenght, 'character' use case

* customising overlap function for word and adding a few tests

* updating docstring

* Update haystack/components/preprocessors/recursive_splitter.py

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* Update haystack/components/preprocessors/recursive_splitter.py

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* Update haystack/components/preprocessors/recursive_splitter.py

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* wip: adding more tests for word unit length

* fix

* feat: `Tool` dataclass - unified abstraction to represent tools (#8652)

* draft

* del HF token in tests

* adaptations

* progress

* fix type

* import sorting

* more control on deserialization

* release note

* improvements

* support name field

* fix chatpromptbuilder test

* port Tool from experimental

* release note

* docs upd

* Update tool.py

---------

Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>

* fix: fix deserialization issues in multi-threading environments (#8651)

* adding 'word' as default length

* fixing types

* handing both default strategies

* wip

* \f was not being counted properly

* updating tests

* fixing the overlap bug

* adding more tests

* refactoring _apply_overlap

* further refactoring

* Update haystack/components/preprocessors/recursive_splitter.py

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* Update haystack/components/preprocessors/recursive_splitter.py

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* Update haystack/components/preprocessors/recursive_splitter.py

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* Update haystack/components/preprocessors/recursive_splitter.py

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* adding ticks to close code block

* fixing comments

* applying changes: split with space and force keep_white_spaces=True

* fixing some tests and replacing count words approach in more places

* keep_white_spaces = True only if not defined

* cleaning docs

* handling some more edge cases, when split is still too big and all separators ran

* fixing fallback whitespaces count to fixed word/char split based on split size

* cleaning

---------

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com>
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
Co-authored-by: Tobias Wochinger <tobias.wochinger@deepset.ai>
2025-01-10 17:28:53 +01:00