haystack

mirror of https://github.com/deepset-ai/haystack.git synced 2025-07-07 09:01:11 +00:00

Author	SHA1	Message	Date
Nicola Procopio	542a7f7ef5	fix: update meta data before initializing new Document in DocumentSplitter (#8745 ) * updated DocumentSplitter issue #8741 * release note * updated DocumentSplitter in _create_docs_from_splits function initialize a new variable copied_mete instead to overwrite meta * added test test_duplicate_pages_get_different_doc_id * fix fmt --------- Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com>	2025-01-20 09:51:47 +01:00
Stefano Fiorucci	5539f6c33f	refactor: improve serialization/deserialization of callables (to handle class methods and static methods) (#8683 ) * progress * refinements * tidy up * release note	2025-01-08 11:28:00 +01:00
Sebastian Husch Lee	286061f005	fix: Move potential nltk download to warm_up (#8646 ) * Move potential nltk download to warm_up * Update tests * Add release notes * Fix tests * Uncomment * Make mypy happy * Add RuntimeError message * Update release notes --------- Co-authored-by: Julian Risch <julian.risch@deepset.ai>	2024-12-20 10:41:44 +01:00
David S. Batista	3f77d3ab6c	!feat: unify NLTKDocumentSplitter and DocumentSplitter (#8617 ) * wip: initial import * wip: refactoring * wip: refactoring tests * wip: refactoring tests * making all NLTKSplitter related tests work * refactoring * docstrings * refactoring and removing NLTKDocumentSplitter * fixing tests for custom sentence tokenizer * fixing tests for custom sentence tokenizer * cleaning up * adding release notes * reverting some changes * cleaning up tests * fixing serialisation and adding tests * cleaning up * wip * renaming and cleaning * adding NLTK files * updating docstring * adding import to init * Update haystack/components/preprocessors/document_splitter.py Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com> * updating tests * wip * adding sentence/period change warning * fixing LICENSE header * Update haystack/components/preprocessors/document_splitter.py Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com> --------- Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com>	2024-12-12 14:22:27 +00:00
Silvano Cerza	bd77120cf3	Fix `DocumentSplitter` not splitting by function (#8549 ) * Fix DocumentSplitter not splitting by function * Make the split_by mapping a constant	2024-11-18 11:54:30 +01:00
Sriniketh J	a045c0eabb	feat: added split by line to DocumentSplitter (#8525 ) * feat: added split by line to DocumentSplitter * fix: pr review comments Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> --------- Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>	2024-11-14 16:09:01 +01:00
Sebastian Husch Lee	911f3523ab	feat: Increase logging transparency for empty Documents during conversion (#8509 ) * Add log lines for PDF conversion and make skipping more explicit in DocumentSplitter * Add logging statement for PDFMinerToDocument as well * Add tests * Remove unused line * Remove unused line * add reno * Add in PDF file * Update checks in PDF converters and add tests for document splitter * Revert * Remove line * Fix comment * Make mypy happy * Make mypy happy	2024-11-04 09:26:57 +01:00
Giovanni Alzetta, PhD	4106e7e8d1	feat : DocumentSplitter, adding the option to split_by function (#8336 ) * Adding splitting function * Adding test for split by function * Adding release note for feat adding split by function * Fixing release note for split_by_function * Fixing issue with splitting_function non callable * nit: fixing value error in documentsplitter for split_by * Add custom serde --------- Co-authored-by: Giovanni Alzetta <giovannialzetta@gmail.com> Co-authored-by: Vladimir Blagojevic <dovlex@gmail.com>	2024-09-12 16:38:37 +02:00
Sebastian Husch Lee	baed478f23	fix: Fix `split_start_idx` and `_split_overlap` information in `DocumentSplitter` (#8046 ) * Fix bug in DocumentSplitter and expand tests to catch said bug * Fix split overlap information calc and actually test it * Add release notes * Remove comments * Same fix in SentenceWindowRetrieval --------- Co-authored-by: Vladimir Blagojevic <dovlex@gmail.com>	2024-07-24 15:15:36 +02:00
David S. Batista	91f57015c0	feat : adding `split_id` and `split_overlap` to `DocumentSplitter` (#7933 ) * wip: adding _split_overlapp * fixing join issue for _split_overlap * adding tests * adding release notes * cleaning and fixing tests * making mypy happy * Update haystack/components/preprocessors/document_splitter.py Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com> * adding docstrings --------- Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>	2024-06-27 15:07:43 +02:00
Massimiliano Pippi	3a03fce71c	ci: Add code formatting checks (#7882 ) * ruff settings enable ruff format and re-format outdated files feat: `EvaluationRunResult` add parameter to specify columns to keep in the comparative `Dataframe` (#7879) * adding param to explictily state which cols to keep * adding param to explictily state which cols to keep * adding param to explictily state which cols to keep * updating tests * adding release notes * Update haystack/evaluation/eval_run_result.py Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com> * Update releasenotes/notes/add-keep-columns-to-EvalRunResult-comparative-be3e15ce45de3e0b.yaml Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com> * updating docstring --------- Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com> add format-check fail on format and linting failures fix string formatting reformat long lines fix tests fix typing linter pull from main * reformat * lint -> check * lint -> check	2024-06-18 15:52:46 +00:00
Alessio Cesaretti	d0da31a047	feat: Add split_threshold to DocumentSplitter to avoid excessively short splits (#7721 ) * feat: add split_threshold to document splitter to avoid excessively small splits * Update haystack/components/preprocessors/document_splitter.py Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com> * Update haystack/components/preprocessors/document_splitter.py Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com> * extend release note --------- Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com> Co-authored-by: Julian Risch <julian.risch@deepset.ai>	2024-05-27 14:48:38 +02:00
Massimiliano Pippi	10c675d534	chore: add license header to all modules (#7675 ) * add license header to modules * check license header at linting time	2024-05-09 13:40:36 +00:00
Carlos Fernández	d2c87b2fd9	feat: add page_number to metadata in DocumentSplitter (#7599 ) * Add the implementation for page counting used in the v1.25.x branch. It should work as expected in issue #6705. * Add tests that reflect the desired behabiour. This behabiour is inffered from the one it had on Haystack 1.x Solve some minor bugs spotted by tests. * Update docstrings. * Add reno. * Update haystack/components/preprocessors/document_splitter.py Update docstring from suggestion Co-authored-by: David S. Batista <dsbatista@gmail.com> * solve suggestion to improve readability * fragment tests * Update haystack/components/preprocessors/document_splitter.py Co-authored-by: David S. Batista <dsbatista@gmail.com> * Update .gitignore * Update .gitignore * Update add-page-number-to-document-splitter-162e9dc7443575f0.yaml * blackening --------- Co-authored-by: David S. Batista <dsbatista@gmail.com>	2024-04-29 12:51:18 +02:00
sahusiddharth	a7ac4edd07	feat: added split by page to `DocumentSplitter` (#6753 ) * feat-added-split-by-page-to-DocumentSplitter * added test case and the suggested changes * Update document_splitter.py * Update haystack/components/preprocessors/document_splitter.py * Update test_document_splitter.py --------- Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>	2024-01-17 15:36:29 +01:00
Massimiliano Pippi	7c05f37a53	remove unit marker (#6450 )	2023-11-29 19:24:25 +01:00
Silvano Cerza	e6637f5ec2	Fix all tests	2023-11-24 14:48:43 +01:00
Massimiliano Pippi	8adb8bbab8	Remove preview folder in test/ --------- Co-authored-by: Silvano Cerza <silvanocerza@gmail.com>	2023-11-24 11:52:55 +01:00

18 Commits