* Initial commit for csv cleaner
* Add release notes
* Update lineterminator
* Update releasenotes/notes/csv-document-cleaner-8eca67e884684c56.yaml
Co-authored-by: David S. Batista <dsbatista@gmail.com>
* alphabetize
* Use lazy import
* Some refactoring
* Some refactoring
---------
Co-authored-by: David S. Batista <dsbatista@gmail.com>
* updated DocumentSplitter
issue #8741
* release note
* updated DocumentSplitter
in _create_docs_from_splits function initialize a new variable copied_mete instead to overwrite meta
* added test
test_duplicate_pages_get_different_doc_id
* fix fmt
---------
Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com>
* initial import
* adding initial version + tests
* adding more tests
* more tests
* incorporating SentenceSplitter based on NLTK
* adding more tests
* adding release notes
* adding LICENSE header
* removing unused imports
* fixing example docstring
* addding docstrings
* fixing tests and returning a dictionary
* updating release notes
* attending PR comments
* Update haystack/components/preprocessors/recursive_splitter.py
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
* wip: updating tests for split_idx_start and _split_overlap
* adding tests for split_idx and split_start and overlaps
* adjusting file for LICENSE checking
* adding more tests
* adding tests for page numbering
* adding tests for min split lenghts and falling back to character-level chunking based on size
* fixing linting issue
* Update haystack/components/preprocessors/recursive_splitter.py
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
* Update haystack/components/preprocessors/recursive_splitter.py
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
* Update haystack/components/preprocessors/recursive_splitter.py
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
* Update haystack/components/preprocessors/recursive_splitter.py
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
* Update haystack/components/preprocessors/recursive_splitter.py
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
* Update haystack/components/preprocessors/recursive_splitter.py
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
* Update haystack/components/preprocessors/recursive_splitter.py
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
* Update haystack/components/preprocessors/recursive_splitter.py
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
* wip
* wip
* updating tests
* wip: fixing all tests after changes
* more tests
* wip: debugging sentence overlap
* wip: debugging page number
* wip
* wip; fixed bug with sentence tokenizer, needs to keep white spaces
* adding tests for counting pages on different split approaches
* NLTK checks done on SentenceSplitter
* fixing types
* adding detecting for full overlap with previous chunks
* fixing types
* improving docstring
* improving docstring
* adding custom lenght, 'character' use case
* customising overlap function for word and adding a few tests
* updating docstring
* Update haystack/components/preprocessors/recursive_splitter.py
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
* Update haystack/components/preprocessors/recursive_splitter.py
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
* Update haystack/components/preprocessors/recursive_splitter.py
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
* wip: adding more tests for word unit length
* fix
* feat: `Tool` dataclass - unified abstraction to represent tools (#8652)
* draft
* del HF token in tests
* adaptations
* progress
* fix type
* import sorting
* more control on deserialization
* release note
* improvements
* support name field
* fix chatpromptbuilder test
* port Tool from experimental
* release note
* docs upd
* Update tool.py
---------
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
* fix: fix deserialization issues in multi-threading environments (#8651)
* adding 'word' as default length
* fixing types
* handing both default strategies
* wip
* \f was not being counted properly
* updating tests
* fixing the overlap bug
* adding more tests
* refactoring _apply_overlap
* further refactoring
* Update haystack/components/preprocessors/recursive_splitter.py
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
* Update haystack/components/preprocessors/recursive_splitter.py
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
* Update haystack/components/preprocessors/recursive_splitter.py
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
* Update haystack/components/preprocessors/recursive_splitter.py
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
* adding ticks to close code block
* fixing comments
* applying changes: split with space and force keep_white_spaces=True
* fixing some tests and replacing count words approach in more places
* keep_white_spaces = True only if not defined
* cleaning docs
* handling some more edge cases, when split is still too big and all separators ran
* fixing fallback whitespaces count to fixed word/char split based on split size
* cleaning
---------
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com>
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
Co-authored-by: Tobias Wochinger <tobias.wochinger@deepset.ai>
* feat: added split by line to DocumentSplitter
* fix: pr review comments
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
---------
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
* Add log lines for PDF conversion and make skipping more explicit in DocumentSplitter
* Add logging statement for PDFMinerToDocument as well
* Add tests
* Remove unused line
* Remove unused line
* add reno
* Add in PDF file
* Update checks in PDF converters and add tests for document splitter
* Revert
* Remove line
* Fix comment
* Make mypy happy
* Make mypy happy
* Port NLTKDocumentSplitter from dC to Haystack
* Improve pydocs
* Use haystack logging
* Add NLTKDocumentSplitter to __init__.py
* Use haystack logging, rename test classes
* Fixing _needs_join return
* Linting
* PR feedback
* More static methods
* Increase test coverage
* Compile pattern
---------
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
* Adding splitting function
* Adding test for split by function
* Adding release note for feat adding split by function
* Fixing release note for split_by_function
* Fixing issue with splitting_function non callable
* nit: fixing value error in documentsplitter for split_by
* Add custom serde
---------
Co-authored-by: Giovanni Alzetta <giovannialzetta@gmail.com>
Co-authored-by: Vladimir Blagojevic <dovlex@gmail.com>
* feat: add unicode normalization & ascii_only mode for DocumentCleaner.
* feat: add unicode_normalization parameter valdiation to DocumentCleaner.
* test: fix the unit test to work after code linting.
* Fix bug in DocumentSplitter and expand tests to catch said bug
* Fix split overlap information calc and actually test it
* Add release notes
* Remove comments
* Same fix in SentenceWindowRetrieval
---------
Co-authored-by: Vladimir Blagojevic <dovlex@gmail.com>
* ruff settings
enable ruff format and re-format outdated files
feat: `EvaluationRunResult` add parameter to specify columns to keep in the comparative `Dataframe` (#7879)
* adding param to explictily state which cols to keep
* adding param to explictily state which cols to keep
* adding param to explictily state which cols to keep
* updating tests
* adding release notes
* Update haystack/evaluation/eval_run_result.py
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
* Update releasenotes/notes/add-keep-columns-to-EvalRunResult-comparative-be3e15ce45de3e0b.yaml
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
* updating docstring
---------
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
add format-check
fail on format and linting failures
fix string formatting
reformat long lines
fix tests
fix typing
linter
pull from main
* reformat
* lint -> check
* lint -> check
* Add the implementation for page counting used in the v1.25.x branch. It should work as expected in issue #6705.
* Add tests that reflect the desired behabiour. This behabiour is inffered from the one it had on Haystack 1.x
Solve some minor bugs spotted by tests.
* Update docstrings.
* Add reno.
* Update haystack/components/preprocessors/document_splitter.py
Update docstring from suggestion
Co-authored-by: David S. Batista <dsbatista@gmail.com>
* solve suggestion to improve readability
* fragment tests
* Update haystack/components/preprocessors/document_splitter.py
Co-authored-by: David S. Batista <dsbatista@gmail.com>
* Update .gitignore
* Update .gitignore
* Update add-page-number-to-document-splitter-162e9dc7443575f0.yaml
* blackening
---------
Co-authored-by: David S. Batista <dsbatista@gmail.com>
* feat-added-split-by-page-to-DocumentSplitter
* added test case and the suggested changes
* Update document_splitter.py
* Update haystack/components/preprocessors/document_splitter.py
* Update test_document_splitter.py
---------
Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>