haystack

mirror of https://github.com/deepset-ai/haystack.git synced 2025-11-23 05:26:33 +00:00

Author	SHA1	Message	Date
Sebastian Husch Lee	35788a2d06	feat: Update csv cleaner (#8828 ) * More refactoring * Add more new options and more tests * Improve docstrings * Add release notes * Fix pylint	2025-02-07 14:29:53 +01:00
Sebastian Husch Lee	1785ea622e	feat: Add component CSVDocumentCleaner for removing empty rows and columns (#8816 ) * Initial commit for csv cleaner * Add release notes * Update lineterminator * Update releasenotes/notes/csv-document-cleaner-8eca67e884684c56.yaml Co-authored-by: David S. Batista <dsbatista@gmail.com> * alphabetize * Use lazy import * Some refactoring * Some refactoring --------- Co-authored-by: David S. Batista <dsbatista@gmail.com>	2025-02-06 17:56:38 +01:00
Stefano Fiorucci	1f257944a6	chore: fix Hugging Face components for mypy 1.15.0 (#8822 ) * chore: fix Hugging Face components for mypy 1.15.0 * small fixes * fix test * rm print * use cast and be more permissive	2025-02-06 16:25:59 +00:00
Amna Mubashar	b0809b75f5	feat: Add a `ListJoiner` component (#8810 ) * Add a ListJoiner * Add tests and release notes	2025-02-05 23:19:14 +01:00
György Orosz	d2348ad462	feat: SentenceTransformersDocumentEmbedder and SentenceTransformersTextEmbedder can accept and pass any arguments to SentenceTransformer.encode (#8806 ) * feat: SentenceTransformersDocumentEmbedder and SentenceTransformersTextEmbedder can accept and pass any arguments to SentenceTransformer.encode * refactor: encode_kwargs parameter of SentenceTransformersDocumentEmbedder and SentenceTransformersTextEmbedder mae to be the last positional parameter for backward compatibility reasons * docs: added explanation for encode_kwargs in SentenceTransformersTextEmbedder and SentenceTransformersDocumentEmbedder * test: added tests for encode_kwargs in SentenceTransformersTextEmbedder and SentenceTransformersDocumentEmbedder * doc: removed empty lines from docstrings of SentenceTransformersTextEmbedder and SentenceTransformersDocumentEmbedder * refactor: encode_kwargs parameter of SentenceTransformersDocumentEmbedder and SentenceTransformersTextEmbedder mae to be the last positional parameter for backward compatibility (part II.)	2025-02-05 16:09:35 +00:00
Stefano Fiorucci	2828d9e4ae	refactor!: `DOCXToDocument` converter - store DOCX metadata as a dict (#8804 ) * DOCXToDocument - store DOCX metadata as a dict * do not export DOCXMetadata to converters package	2025-02-05 14:43:19 +01:00
Stefano Fiorucci	5ae94886b2	fix: fix test failures with Transformers models in PRs from forks (#8809 ) * trigger * try pinning sentence transformers * make integr tests run right away * pin transformers instead * older transformers version * rm transformers pin * try ignoring cache * change ubuntu version * try removing token * try again * more HF_API_TOKEN local deletions * restore test priority * rm leftover * more deletions * moreee * more * deletions * restore jobs order	2025-02-04 19:08:37 +01:00
Sebastian Husch Lee	1ee86b5041	fix: Fix filters to handle date times with timezones (loading and comparison) (#8800 ) * Fix on date time parsing with timezones. And comparing naive and aware date times. * Add release note * Add more filter tests	2025-02-04 14:51:06 +01:00
Stefano Fiorucci	877f826da0	refactor: HF API Embedders - use `InferenceClient.feature_extraction` instead of `InferenceClient.post` (#8794 ) * HF API Embedders: refactoring * rename variables * rm leftovers * rm pin * rm unused import * relnote * warning with truncate/normalize and serverless inference API * test that warnings are raised	2025-02-03 15:11:16 +00:00
Sebastian Husch Lee	bba84e5517	fix: Fix JSONConverter to properly skip files that are not utf-8 encoded (#8775 ) * Small fix * Add reno * Trying out license header fix here	2025-01-28 10:29:55 +01:00
tstadel	bf79f04932	feat: support streaming_callback as run param for HF Chat generators (#8763 ) * feat: support streaming_callback as run param for HF Chat generators * add tests	2025-01-23 12:14:32 +01:00
Stefano Fiorucci	c3d0643511	feat: `AzureOpenAIChatGenerator` - support for tools (#8757 ) * feat: AzureOpenAIChatGenerator - support for tools * release note * feedback	2025-01-23 09:24:04 +00:00
Stefano Fiorucci	f96839e139	chore: update `transformers` test dependency (#8752 ) * update transformers test dependency * add pad_token_id to the mock tokenizer * fix HFLocal test + new test	2025-01-21 14:43:27 +01:00
Nicola Procopio	542a7f7ef5	fix: update meta data before initializing new Document in DocumentSplitter (#8745 ) * updated DocumentSplitter issue #8741 * release note * updated DocumentSplitter in _create_docs_from_splits function initialize a new variable copied_mete instead to overwrite meta * added test test_duplicate_pages_get_different_doc_id * fix fmt --------- Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com>	2025-01-20 09:51:47 +01:00
David S. Batista	5af2888e23	fix: `PDFMinerToDocument` convert function - adding double new lines between each `container_text` so that passages can be detected. (#8729 ) * initial import * adding double new lines between container_texts so that passages can be detected * reducing type specification to avoid import error * adding release notes * renaming variable	2025-01-17 13:01:16 +00:00
Stefano Fiorucci	424bce2783	test: fix HF API flaky live test with tools (#8744 ) * test: fix HF API flaky live test with tools * rm print	2025-01-17 12:36:07 +00:00
David S. Batista	2c84266d8f	test: adding test for PyPDF to extract passages so that they are detect by DocumentSplitter (#8739 )	2025-01-17 10:56:16 +01:00
Vladimir Blagojevic	21dd03d3e7	feat: Add completion start time timestamp to relevant generators (#8728 ) * OpenAIChatGenerator - add completion_start_time * HuggingFaceAPIChatGenerator - add completion_start_time * Add tests * Add reno note * Relax condition for cached responses * Add completion_start_time timestamping to non-chat generators * Update haystack/components/generators/chat/hugging_face_api.py Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com> * PR feedback --------- Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com>	2025-01-17 09:58:45 +01:00
David S. Batista	26b80778f5	chore: removing NLTKDocumentSplitter (#8724 ) * removing NLTKDocumentSplitter * adding release notes * removing pydocs reference	2025-01-15 16:11:51 +00:00
David S. Batista	425ce9b98f	test: updating HuggingFaceAPIChatGenerator tests	2025-01-14 16:47:29 +01:00
Julian Risch	642fa60cdf	fix: PDFMinerToDocument initializes documents with content and meta (#8708 ) * fix: PDFMinerToDocument initializes documents with content and meta * add release note * Apply suggestions from code review Co-authored-by: David S. Batista <dsbatista@gmail.com> --------- Co-authored-by: David S. Batista <dsbatista@gmail.com>	2025-01-13 10:12:06 +00:00
Amna Mubashar	db76ae2847	feat: add `default_headers` for Azure embedders (#8699 ) * Add default_headers param to azure embedders	2025-01-12 17:41:38 +01:00
David S. Batista	4f73b192f8	feat: add `RecursiveSplitter` component for `Document` preprocessing (#8605 ) * initial import * adding initial version + tests * adding more tests * more tests * incorporating SentenceSplitter based on NLTK * adding more tests * adding release notes * adding LICENSE header * removing unused imports * fixing example docstring * addding docstrings * fixing tests and returning a dictionary * updating release notes * attending PR comments * Update haystack/components/preprocessors/recursive_splitter.py Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * wip: updating tests for split_idx_start and _split_overlap * adding tests for split_idx and split_start and overlaps * adjusting file for LICENSE checking * adding more tests * adding tests for page numbering * adding tests for min split lenghts and falling back to character-level chunking based on size * fixing linting issue * Update haystack/components/preprocessors/recursive_splitter.py Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * Update haystack/components/preprocessors/recursive_splitter.py Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * Update haystack/components/preprocessors/recursive_splitter.py Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * Update haystack/components/preprocessors/recursive_splitter.py Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * Update haystack/components/preprocessors/recursive_splitter.py Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * Update haystack/components/preprocessors/recursive_splitter.py Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * Update haystack/components/preprocessors/recursive_splitter.py Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * Update haystack/components/preprocessors/recursive_splitter.py Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * wip * wip * updating tests * wip: fixing all tests after changes * more tests * wip: debugging sentence overlap * wip: debugging page number * wip * wip; fixed bug with sentence tokenizer, needs to keep white spaces * adding tests for counting pages on different split approaches * NLTK checks done on SentenceSplitter * fixing types * adding detecting for full overlap with previous chunks * fixing types * improving docstring * improving docstring * adding custom lenght, 'character' use case * customising overlap function for word and adding a few tests * updating docstring * Update haystack/components/preprocessors/recursive_splitter.py Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * Update haystack/components/preprocessors/recursive_splitter.py Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * Update haystack/components/preprocessors/recursive_splitter.py Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * wip: adding more tests for word unit length * fix * feat: `Tool` dataclass - unified abstraction to represent tools (#8652) * draft * del HF token in tests * adaptations * progress * fix type * import sorting * more control on deserialization * release note * improvements * support name field * fix chatpromptbuilder test * port Tool from experimental * release note * docs upd * Update tool.py --------- Co-authored-by: Daria Fokina <daria.fokina@deepset.ai> * fix: fix deserialization issues in multi-threading environments (#8651) * adding 'word' as default length * fixing types * handing both default strategies * wip * \f was not being counted properly * updating tests * fixing the overlap bug * adding more tests * refactoring _apply_overlap * further refactoring * Update haystack/components/preprocessors/recursive_splitter.py Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * Update haystack/components/preprocessors/recursive_splitter.py Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * Update haystack/components/preprocessors/recursive_splitter.py Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * Update haystack/components/preprocessors/recursive_splitter.py Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * adding ticks to close code block * fixing comments * applying changes: split with space and force keep_white_spaces=True * fixing some tests and replacing count words approach in more places * keep_white_spaces = True only if not defined * cleaning docs * handling some more edge cases, when split is still too big and all separators ran * fixing fallback whitespaces count to fixed word/char split based on split size * cleaning --------- Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com> Co-authored-by: Daria Fokina <daria.fokina@deepset.ai> Co-authored-by: Tobias Wochinger <tobias.wochinger@deepset.ai>	2025-01-10 17:28:53 +01:00
Stefano Fiorucci	741ce5df50	fix: `OpenAIChatGenerator` - do not pass tools to the OpenAI client when none are provided (#8702 ) * do not pass tools to OpenAI client if None * release note * fix release note	2025-01-10 14:46:41 +01:00
Julian Risch	dd9660f90d	fix: PyPDFToDocument initializes documents with content and meta (#8698 ) * initialize document with content and meta * update test * add test checking that not only content is used for id generation	2025-01-09 19:12:10 +00:00
mathislucka	fe9b1e29d4	CI: fix format after newly introduced formatting rules from ruff release (#8696 )	2025-01-09 16:25:55 +00:00
Stefano Fiorucci	3f15f38c51	refactor: move `Tool` to a separate package; refactor serde (#8690 ) * move tool to separate package; refactor serde * release note * rm unused import	2025-01-09 12:30:13 +01:00
Sebastian Husch Lee	28ad78c73d	feat: Add XLSXToDocument converter (#8522 ) * Add draft of the Excel To Document converter * Add license header * Add release note * Use Union instead of pipe * Add openpyxl as additional dep * Fix zip issue * few updates from Bijay * Update deps * Add markdown test * Adding more example excels and expanding tests * Added more tests * Fix windows test by setting lineterminator * Addressing PR comments * PR comments * Fix linting	2025-01-09 09:03:19 +01:00
Stefano Fiorucci	5539f6c33f	refactor: improve serialization/deserialization of callables (to handle class methods and static methods) (#8683 ) * progress * refinements * tidy up * release note	2025-01-08 11:28:00 +01:00
Stefano Fiorucci	188b2a7f06	feat: support for tools in `OpenAIChatGenerator` (#8666 ) * move chatmsg>openai conversion to chatmsg dataclass * implementation and tests cleanup * release note * try fixing azure chat generator * add serde test for toolinvoker * small fix	2024-12-20 14:20:54 +00:00
Stefano Fiorucci	7dcbf25bd7	feat: add Tool Invoker component (#8664 ) * port toolinvoker * release note	2024-12-20 14:02:42 +01:00
Michele Pangrazzi	c192488bf6	Named entity extractor private models (#8658 ) * add 'token' support to NamedEntityExtractor to enable using private models on HF backend * fix existing error message format * add release note * add HF_API_TOKEN to e2e workflow * add informative comment * Updated to_dict / from_dict to handle 'token' correctly ; Added tests * Fix lint * Revert unwanted change	2024-12-20 11:15:55 +01:00
Sebastian Husch Lee	286061f005	fix: Move potential nltk download to warm_up (#8646 ) * Move potential nltk download to warm_up * Update tests * Add release notes * Fix tests * Uncomment * Make mypy happy * Add RuntimeError message * Update release notes --------- Co-authored-by: Julian Risch <julian.risch@deepset.ai>	2024-12-20 10:41:44 +01:00
Stefano Fiorucci	f4d9c2bb91	fix: Make the `HuggingFaceLocalChatGenerator` compatible with the new `ChatMessage`; serialize `chat_template` (#8663 ) * message conversion function * hfapi w tools * right test file + hf_hub version * release note * fix for new chatmessage; serialize chat_template * feedback	2024-12-19 15:12:12 +01:00
Stefano Fiorucci	2bc58d2987	feat: support for tools in `HuggingFaceAPIChatGenerator` (#8661 ) * message conversion function * hfapi w tools * right test file + hf_hub version * release note * feedback	2024-12-19 15:04:37 +01:00
David S. Batista	c306bee665	fix: adding missing abbreviations files for SentenceSplitter (#8660 ) * adding missing abbreviations files for SentenceSplitter * fixing tests path	2024-12-19 11:08:29 +01:00
Stefano Fiorucci	ea3602643a	feat!: new `ChatMessage` (#8640 ) * draft * del HF token in tests * adaptations * progress * fix type * import sorting * more control on deserialization * release note * improvements * support name field * fix chatpromptbuilder test * Update chat_message.py --------- Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>	2024-12-17 17:02:04 +01:00
Stefano Fiorucci	f2b5f123b3	del HF token in tests (#8634 )	2024-12-13 09:50:23 +01:00
David S. Batista	3f77d3ab6c	!feat: unify NLTKDocumentSplitter and DocumentSplitter (#8617 ) * wip: initial import * wip: refactoring * wip: refactoring tests * wip: refactoring tests * making all NLTKSplitter related tests work * refactoring * docstrings * refactoring and removing NLTKDocumentSplitter * fixing tests for custom sentence tokenizer * fixing tests for custom sentence tokenizer * cleaning up * adding release notes * reverting some changes * cleaning up tests * fixing serialisation and adding tests * cleaning up * wip * renaming and cleaning * adding NLTK files * updating docstring * adding import to init * Update haystack/components/preprocessors/document_splitter.py Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com> * updating tests * wip * adding sentence/period change warning * fixing LICENSE header * Update haystack/components/preprocessors/document_splitter.py Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com> --------- Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com>	2024-12-12 14:22:27 +00:00
Michele Pangrazzi	21d53d0ec6	update default value of 'store_full_path' to False in converters (#8619 )	2024-12-10 16:03:38 +01:00
David S. Batista	248dccbdd3	chore: fixing `pylint` issues (#8610 ) * initial import * fixing internal methods * fixing some internal methods * modify _preprocess * fixed internal methods --------- Co-authored-by: anakin87 <stefanofiorucci@gmail.com>	2024-12-09 16:53:37 +00:00
Anton Pelykh	6f983a22ca	fix: add missing stream mime type assignment to the `LinkContentFetcher` (#8596 ) * add missing stream mime type assignment to the `LinkContentFetcher` * fix release note fmt --------- Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com>	2024-12-09 14:51:14 +00:00
Michele Pangrazzi	b32f85cca2	remove deprecated 'converter' init parameter from PyPDFToDocument component (#8609 )	2024-12-06 15:43:43 +01:00
David S. Batista	2282c26f17	feat!: `SentenceWindowRetriever` returns `List[Document]` with docs ordered by `split_idx_start` (#8590 ) * initial import * adding a few pylint disable * adding tests * fixing integration tests * adding release notes * fixing types and docstrings	2024-12-04 16:55:56 +01:00
Amna Mubashar	4c8eb54049	feat: Add store_full_path to converters (3/3) (#8585 ) * Add store_full_path params	2024-12-03 13:48:56 +05:00
Stefano Fiorucci	c8685aa141	refactor: update components to access `ChatMessage.text` instead of `content` (#8589 ) * introduce text property and deprecate content * release note * use chatmessage.text * release note * linting	2024-11-28 10:16:07 +00:00
Stefano Fiorucci	51c1390426	chore: use class methods to create `ChatMessage` (#8581 ) * use class methods to build messages * fix failing format	2024-11-28 09:35:24 +00:00
Stefano Fiorucci	fb42c035c5	feat: `PyPDFToDocument` - add new customization parameters (#8574 ) * deprecat converter in pypdf * fix linting of MetaFieldGroupingRanker * linting * pypdftodocument: add customization params * fix mypy * incorporate feedback	2024-11-26 16:37:59 +01:00
Vladimir Blagojevic	59f1e182db	feat: Add variable to specify inputs as optional to ConditionalRouter (#8568 ) * Add optional_variables in ConditionalRouter * Add reno note * Add more unit test with various complex scenarios * Add more unit tests * Add pylint disable=too-many-positional-arguments * PR feedback from @sjrl	2024-11-26 10:48:55 +01:00
Silvano Cerza	ab840351f8	Fix DocumentCleaner not preserving Document fields (#8578 )	2024-11-25 13:08:59 +01:00

1 2 3 4 5 ...

373 Commits