haystack

mirror of https://github.com/deepset-ai/haystack.git synced 2025-11-16 01:54:35 +00:00

Author	SHA1	Message	Date
Sebastian Husch Lee	35788a2d06	feat: Update csv cleaner (#8828 ) * More refactoring * Add more new options and more tests * Improve docstrings * Add release notes * Fix pylint	2025-02-07 14:29:53 +01:00
Sebastian Husch Lee	1785ea622e	feat: Add component CSVDocumentCleaner for removing empty rows and columns (#8816 ) * Initial commit for csv cleaner * Add release notes * Update lineterminator * Update releasenotes/notes/csv-document-cleaner-8eca67e884684c56.yaml Co-authored-by: David S. Batista <dsbatista@gmail.com> * alphabetize * Use lazy import * Some refactoring * Some refactoring --------- Co-authored-by: David S. Batista <dsbatista@gmail.com>	2025-02-06 17:56:38 +01:00
Nicola Procopio	542a7f7ef5	fix: update meta data before initializing new Document in DocumentSplitter (#8745 ) * updated DocumentSplitter issue #8741 * release note * updated DocumentSplitter in _create_docs_from_splits function initialize a new variable copied_mete instead to overwrite meta * added test test_duplicate_pages_get_different_doc_id * fix fmt --------- Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com>	2025-01-20 09:51:47 +01:00
David S. Batista	26b80778f5	chore: removing NLTKDocumentSplitter (#8724 ) * removing NLTKDocumentSplitter * adding release notes * removing pydocs reference	2025-01-15 16:11:51 +00:00
David S. Batista	4f73b192f8	feat: add `RecursiveSplitter` component for `Document` preprocessing (#8605 ) * initial import * adding initial version + tests * adding more tests * more tests * incorporating SentenceSplitter based on NLTK * adding more tests * adding release notes * adding LICENSE header * removing unused imports * fixing example docstring * addding docstrings * fixing tests and returning a dictionary * updating release notes * attending PR comments * Update haystack/components/preprocessors/recursive_splitter.py Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * wip: updating tests for split_idx_start and _split_overlap * adding tests for split_idx and split_start and overlaps * adjusting file for LICENSE checking * adding more tests * adding tests for page numbering * adding tests for min split lenghts and falling back to character-level chunking based on size * fixing linting issue * Update haystack/components/preprocessors/recursive_splitter.py Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * Update haystack/components/preprocessors/recursive_splitter.py Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * Update haystack/components/preprocessors/recursive_splitter.py Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * Update haystack/components/preprocessors/recursive_splitter.py Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * Update haystack/components/preprocessors/recursive_splitter.py Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * Update haystack/components/preprocessors/recursive_splitter.py Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * Update haystack/components/preprocessors/recursive_splitter.py Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * Update haystack/components/preprocessors/recursive_splitter.py Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * wip * wip * updating tests * wip: fixing all tests after changes * more tests * wip: debugging sentence overlap * wip: debugging page number * wip * wip; fixed bug with sentence tokenizer, needs to keep white spaces * adding tests for counting pages on different split approaches * NLTK checks done on SentenceSplitter * fixing types * adding detecting for full overlap with previous chunks * fixing types * improving docstring * improving docstring * adding custom lenght, 'character' use case * customising overlap function for word and adding a few tests * updating docstring * Update haystack/components/preprocessors/recursive_splitter.py Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * Update haystack/components/preprocessors/recursive_splitter.py Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * Update haystack/components/preprocessors/recursive_splitter.py Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * wip: adding more tests for word unit length * fix * feat: `Tool` dataclass - unified abstraction to represent tools (#8652) * draft * del HF token in tests * adaptations * progress * fix type * import sorting * more control on deserialization * release note * improvements * support name field * fix chatpromptbuilder test * port Tool from experimental * release note * docs upd * Update tool.py --------- Co-authored-by: Daria Fokina <daria.fokina@deepset.ai> * fix: fix deserialization issues in multi-threading environments (#8651) * adding 'word' as default length * fixing types * handing both default strategies * wip * \f was not being counted properly * updating tests * fixing the overlap bug * adding more tests * refactoring _apply_overlap * further refactoring * Update haystack/components/preprocessors/recursive_splitter.py Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * Update haystack/components/preprocessors/recursive_splitter.py Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * Update haystack/components/preprocessors/recursive_splitter.py Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * Update haystack/components/preprocessors/recursive_splitter.py Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * adding ticks to close code block * fixing comments * applying changes: split with space and force keep_white_spaces=True * fixing some tests and replacing count words approach in more places * keep_white_spaces = True only if not defined * cleaning docs * handling some more edge cases, when split is still too big and all separators ran * fixing fallback whitespaces count to fixed word/char split based on split size * cleaning --------- Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com> Co-authored-by: Daria Fokina <daria.fokina@deepset.ai> Co-authored-by: Tobias Wochinger <tobias.wochinger@deepset.ai>	2025-01-10 17:28:53 +01:00
mathislucka	fe9b1e29d4	CI: fix format after newly introduced formatting rules from ruff release (#8696 )	2025-01-09 16:25:55 +00:00
Stefano Fiorucci	5539f6c33f	refactor: improve serialization/deserialization of callables (to handle class methods and static methods) (#8683 ) * progress * refinements * tidy up * release note	2025-01-08 11:28:00 +01:00
Sebastian Husch Lee	286061f005	fix: Move potential nltk download to warm_up (#8646 ) * Move potential nltk download to warm_up * Update tests * Add release notes * Fix tests * Uncomment * Make mypy happy * Add RuntimeError message * Update release notes --------- Co-authored-by: Julian Risch <julian.risch@deepset.ai>	2024-12-20 10:41:44 +01:00
David S. Batista	c306bee665	fix: adding missing abbreviations files for SentenceSplitter (#8660 ) * adding missing abbreviations files for SentenceSplitter * fixing tests path	2024-12-19 11:08:29 +01:00
David S. Batista	3f77d3ab6c	!feat: unify NLTKDocumentSplitter and DocumentSplitter (#8617 ) * wip: initial import * wip: refactoring * wip: refactoring tests * wip: refactoring tests * making all NLTKSplitter related tests work * refactoring * docstrings * refactoring and removing NLTKDocumentSplitter * fixing tests for custom sentence tokenizer * fixing tests for custom sentence tokenizer * cleaning up * adding release notes * reverting some changes * cleaning up tests * fixing serialisation and adding tests * cleaning up * wip * renaming and cleaning * adding NLTK files * updating docstring * adding import to init * Update haystack/components/preprocessors/document_splitter.py Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com> * updating tests * wip * adding sentence/period change warning * fixing LICENSE header * Update haystack/components/preprocessors/document_splitter.py Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com> --------- Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com>	2024-12-12 14:22:27 +00:00
Silvano Cerza	ab840351f8	Fix DocumentCleaner not preserving Document fields (#8578 )	2024-11-25 13:08:59 +01:00
Silvano Cerza	bd77120cf3	Fix `DocumentSplitter` not splitting by function (#8549 ) * Fix DocumentSplitter not splitting by function * Make the split_by mapping a constant	2024-11-18 11:54:30 +01:00
Sriniketh J	a045c0eabb	feat: added split by line to DocumentSplitter (#8525 ) * feat: added split by line to DocumentSplitter * fix: pr review comments Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> --------- Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>	2024-11-14 16:09:01 +01:00
Sebastian Husch Lee	0c11c7b98e	fix: Bring in fix from custom nodes (#8539 ) * Bring in fix from custom nodes * Add to_dict function and test * reno * Fix pylint	2024-11-14 13:00:28 +01:00
Sebastian Husch Lee	911f3523ab	feat: Increase logging transparency for empty Documents during conversion (#8509 ) * Add log lines for PDF conversion and make skipping more explicit in DocumentSplitter * Add logging statement for PDFMinerToDocument as well * Add tests * Remove unused line * Remove unused line * add reno * Add in PDF file * Update checks in PDF converters and add tests for document splitter * Revert * Remove line * Fix comment * Make mypy happy * Make mypy happy	2024-11-04 09:26:57 +01:00
Vladimir Blagojevic	514e0abc39	fix: Fix nltk imports (#8381 )	2024-09-18 11:25:21 +00:00
Vladimir Blagojevic	badd0594cc	feat: Port NLTKDocumentSplitter from dC to Haystack (#8350 ) * Port NLTKDocumentSplitter from dC to Haystack * Improve pydocs * Use haystack logging * Add NLTKDocumentSplitter to __init__.py * Use haystack logging, rename test classes * Fixing _needs_join return * Linting * PR feedback * More static methods * Increase test coverage * Compile pattern --------- Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>	2024-09-17 13:59:19 +02:00
Giovanni Alzetta, PhD	4106e7e8d1	feat : DocumentSplitter, adding the option to split_by function (#8336 ) * Adding splitting function * Adding test for split by function * Adding release note for feat adding split by function * Fixing release note for split_by_function * Fixing issue with splitting_function non callable * nit: fixing value error in documentsplitter for split_by * Add custom serde --------- Co-authored-by: Giovanni Alzetta <giovannialzetta@gmail.com> Co-authored-by: Vladimir Blagojevic <dovlex@gmail.com>	2024-09-12 16:38:37 +02:00
Corentin Meyer	58517014ec	fix: DocumentCleaner: keep the \f in text (#8078 ) * Keep the \f in Document Cleaner * Add Reno * Add Test * Simplified _remove_empty_lines() code	2024-08-07 14:50:14 +02:00
Tim Wellbrock	2e2f5f17bb	feat: add unicode normalization & ascii_only mode for DocumentCleaner (#8103 ) * feat: add unicode normalization & ascii_only mode for DocumentCleaner. * feat: add unicode_normalization parameter valdiation to DocumentCleaner. * test: fix the unit test to work after code linting.	2024-08-05 13:00:39 +02:00
Sebastian Husch Lee	baed478f23	fix: Fix `split_start_idx` and `_split_overlap` information in `DocumentSplitter` (#8046 ) * Fix bug in DocumentSplitter and expand tests to catch said bug * Fix split overlap information calc and actually test it * Add release notes * Remove comments * Same fix in SentenceWindowRetrieval --------- Co-authored-by: Vladimir Blagojevic <dovlex@gmail.com>	2024-07-24 15:15:36 +02:00
David S. Batista	91f57015c0	feat : adding `split_id` and `split_overlap` to `DocumentSplitter` (#7933 ) * wip: adding _split_overlapp * fixing join issue for _split_overlap * adding tests * adding release notes * cleaning and fixing tests * making mypy happy * Update haystack/components/preprocessors/document_splitter.py Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com> * adding docstrings --------- Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>	2024-06-27 15:07:43 +02:00
Massimiliano Pippi	3a03fce71c	ci: Add code formatting checks (#7882 ) * ruff settings enable ruff format and re-format outdated files feat: `EvaluationRunResult` add parameter to specify columns to keep in the comparative `Dataframe` (#7879) * adding param to explictily state which cols to keep * adding param to explictily state which cols to keep * adding param to explictily state which cols to keep * updating tests * adding release notes * Update haystack/evaluation/eval_run_result.py Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com> * Update releasenotes/notes/add-keep-columns-to-EvalRunResult-comparative-be3e15ce45de3e0b.yaml Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com> * updating docstring --------- Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com> add format-check fail on format and linting failures fix string formatting reformat long lines fix tests fix typing linter pull from main * reformat * lint -> check * lint -> check	2024-06-18 15:52:46 +00:00
Alessio Cesaretti	d0da31a047	feat: Add split_threshold to DocumentSplitter to avoid excessively short splits (#7721 ) * feat: add split_threshold to document splitter to avoid excessively small splits * Update haystack/components/preprocessors/document_splitter.py Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com> * Update haystack/components/preprocessors/document_splitter.py Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com> * extend release note --------- Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com> Co-authored-by: Julian Risch <julian.risch@deepset.ai>	2024-05-27 14:48:38 +02:00
Carlos Fernández	57af95d7ea	add keep-id to DocumentCleaner (#7703 )	2024-05-16 19:18:48 +02:00
Massimiliano Pippi	10c675d534	chore: add license header to all modules (#7675 ) * add license header to modules * check license header at linting time	2024-05-09 13:40:36 +00:00
Carlos Fernández	d2c87b2fd9	feat: add page_number to metadata in DocumentSplitter (#7599 ) * Add the implementation for page counting used in the v1.25.x branch. It should work as expected in issue #6705. * Add tests that reflect the desired behabiour. This behabiour is inffered from the one it had on Haystack 1.x Solve some minor bugs spotted by tests. * Update docstrings. * Add reno. * Update haystack/components/preprocessors/document_splitter.py Update docstring from suggestion Co-authored-by: David S. Batista <dsbatista@gmail.com> * solve suggestion to improve readability * fragment tests * Update haystack/components/preprocessors/document_splitter.py Co-authored-by: David S. Batista <dsbatista@gmail.com> * Update .gitignore * Update .gitignore * Update add-page-number-to-document-splitter-162e9dc7443575f0.yaml * blackening --------- Co-authored-by: David S. Batista <dsbatista@gmail.com>	2024-04-29 12:51:18 +02:00
Silvano Cerza	c82f787b41	feat: Add `TextCleaner` component (#6997 ) * Add TextCleaner component * Update docstrings and simplify run logic * Update docstrings	2024-02-15 16:10:38 +01:00
sahusiddharth	a7ac4edd07	feat: added split by page to `DocumentSplitter` (#6753 ) * feat-added-split-by-page-to-DocumentSplitter * added test case and the suggested changes * Update document_splitter.py * Update haystack/components/preprocessors/document_splitter.py * Update test_document_splitter.py --------- Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>	2024-01-17 15:36:29 +01:00
Massimiliano Pippi	7c05f37a53	remove unit marker (#6450 )	2023-11-29 19:24:25 +01:00
Silvano Cerza	e6637f5ec2	Fix all tests	2023-11-24 14:48:43 +01:00
Massimiliano Pippi	8adb8bbab8	Remove preview folder in test/ --------- Co-authored-by: Silvano Cerza <silvanocerza@gmail.com>	2023-11-24 11:52:55 +01:00

32 Commits