haystack

mirror of https://github.com/deepset-ai/haystack.git synced 2025-11-14 17:13:03 +00:00

Author	SHA1	Message	Date
Sebastian Husch Lee	35788a2d06	feat: Update csv cleaner (#8828 ) * More refactoring * Add more new options and more tests * Improve docstrings * Add release notes * Fix pylint	2025-02-07 14:29:53 +01:00
Sebastian Husch Lee	1785ea622e	feat: Add component CSVDocumentCleaner for removing empty rows and columns (#8816 ) * Initial commit for csv cleaner * Add release notes * Update lineterminator * Update releasenotes/notes/csv-document-cleaner-8eca67e884684c56.yaml Co-authored-by: David S. Batista <dsbatista@gmail.com> * alphabetize * Use lazy import * Some refactoring * Some refactoring --------- Co-authored-by: David S. Batista <dsbatista@gmail.com>	2025-02-06 17:56:38 +01:00
mathislucka	eec91824bc	fix: pipeline run bugs in cyclic and acyclic pipelines (#8707 ) * add component checks * pipeline should run deterministically * add FIFOQueue * add agent tests * add order dependent tests * run new tests * remove code that is not needed * test: intermediate from cycle outputs are available outside cycle * add tests for component checks (Claude) * adapt tests for component checks (o1 review) * chore: format * remove tests that aren't needed anymore * add _calculate_priority tests * revert accidental change in pyproject.toml * test format conversion * adapt to naming convention * chore: proper docstrings and type hints for PQ * format * add more unit tests * rm unneeded comments * test input consumption * lint * fix: docstrings * lint * format * format * fix license header * fix license header * add component run tests * fix: pass correct input format to tracing * fix types * format * format * types * add defaults from Socket instead of signature - otherwise components with dynamic inputs would fail * fix test names * still wait for optional inputs on greedy variadic sockets - mirrors previous behavior * fix format * wip: warn for ambiguous running order * wip: alternative warning * fix license header * make code more readable Co-authored-by: Amna Mubashar <amnahkhan.ak@gmail.com> * Introduce content tracing to a behavioral test * Fixing linting * Remove debug print statements * Fix tracer tests * remove print * test: test for component inputs * test: remove testing for run order * chore: update component checks from experimental * chore: update pipeline and base from experimental * refactor: remove unused method * refactor: remove unused method * refactor: outdated comment * refactor: inputs state is updated as side effect - to prepare for AsyncPipeline implementation * format * test: add file conversion test * format * fix: original implementation deepcopies outputs * lint * fix: from_dict was updated * fix: format * fix: test * test: add test for thread safety * remove unused imports * format * test: FIFOPriorityQueue * chore: add release note * fix: resolve merge conflict with mermaid changes * fix: format * fix: remove unused import * refactor: rename to avoid accidental conflicts * chore: remove unused inputs, add missing license header * chore: extend release notes * Update releasenotes/notes/fix-pipeline-run-2fefeafc705a6d91.yaml Co-authored-by: Amna Mubashar <amnahkhan.ak@gmail.com> * fix: format * fix: format * Update release note --------- Co-authored-by: Amna Mubashar <amnahkhan.ak@gmail.com> Co-authored-by: David S. Batista <dsbatista@gmail.com>	2025-02-06 14:19:47 +00:00
Amna Mubashar	b0809b75f5	feat: Add a `ListJoiner` component (#8810 ) * Add a ListJoiner * Add tests and release notes	2025-02-05 23:19:14 +01:00
György Orosz	d2348ad462	feat: SentenceTransformersDocumentEmbedder and SentenceTransformersTextEmbedder can accept and pass any arguments to SentenceTransformer.encode (#8806 ) * feat: SentenceTransformersDocumentEmbedder and SentenceTransformersTextEmbedder can accept and pass any arguments to SentenceTransformer.encode * refactor: encode_kwargs parameter of SentenceTransformersDocumentEmbedder and SentenceTransformersTextEmbedder mae to be the last positional parameter for backward compatibility reasons * docs: added explanation for encode_kwargs in SentenceTransformersTextEmbedder and SentenceTransformersDocumentEmbedder * test: added tests for encode_kwargs in SentenceTransformersTextEmbedder and SentenceTransformersDocumentEmbedder * doc: removed empty lines from docstrings of SentenceTransformersTextEmbedder and SentenceTransformersDocumentEmbedder * refactor: encode_kwargs parameter of SentenceTransformersDocumentEmbedder and SentenceTransformersTextEmbedder mae to be the last positional parameter for backward compatibility (part II.)	2025-02-05 16:09:35 +00:00
Stefano Fiorucci	2828d9e4ae	refactor!: `DOCXToDocument` converter - store DOCX metadata as a dict (#8804 ) * DOCXToDocument - store DOCX metadata as a dict * do not export DOCXMetadata to converters package	2025-02-05 14:43:19 +01:00
Sebastian Husch Lee	1ee86b5041	fix: Fix filters to handle date times with timezones (loading and comparison) (#8800 ) * Fix on date time parsing with timezones. And comparing naive and aware date times. * Add release note * Add more filter tests	2025-02-04 14:51:06 +01:00
Stefano Fiorucci	877f826da0	refactor: HF API Embedders - use `InferenceClient.feature_extraction` instead of `InferenceClient.post` (#8794 ) * HF API Embedders: refactoring * rename variables * rm leftovers * rm pin * rm unused import * relnote * warning with truncate/normalize and serverless inference API * test that warnings are raised	2025-02-03 15:11:16 +00:00
David S. Batista	f1652121ac	feat: Add support for custom (or offline) Mermaid.ink server and support all parameters (#8799 ) * compress graph data to support pako endpoint * support mermaid.ink parameters and custom servers * dont try to resolve conflicts with the github web ui... * avoid double graph copy * fixing typing, improving docstrings and release notes * reverting type * nit - force type checker no cache * nit - force type checker no cache --------- Co-authored-by: Ulises M <ulises@lbux.org> Co-authored-by: Ulises M <30765968+lbux@users.noreply.github.com>	2025-02-03 15:55:29 +01:00
mathislucka	1a91365cc8	fix: callables can be deserialized from fully qualified import path (#8788 ) * fix: callables can be deserialized from fully qualified import path * fix: license header * fix: format * fix: types * fix? types * test: extend test case * format * add release notes	2025-02-03 12:35:37 +01:00
Stefano Fiorucci	80575a7e9c	deprecate dataframe and ExtractedTableAnswer (#8789 )	2025-01-31 15:03:15 +01:00
Ulises M	d939321505	fix: compress pipeline graphs before sending to mermaid (#8767 ) * compress graph data to support pako endpoint * Update haystack/core/pipeline/draw.py Co-authored-by: David S. Batista <dsbatista@gmail.com> * Update haystack/core/pipeline/draw.py Co-authored-by: David S. Batista <dsbatista@gmail.com> --------- Co-authored-by: David S. Batista <dsbatista@gmail.com>	2025-01-28 12:18:54 +01:00
Sebastian Husch Lee	bba84e5517	fix: Fix JSONConverter to properly skip files that are not utf-8 encoded (#8775 ) * Small fix * Add reno * Trying out license header fix here	2025-01-28 10:29:55 +01:00
Per Lunnemann Hansen	0e6d2a4c39	fix: update component registration to use new class reference (#8715 ) The pyright language server is now able to resolve the import and provide completions for the component. Co-authored-by: Michele Pangrazzi <xmikex83@gmail.com>	2025-01-27 14:52:24 +01:00
Night-Quiet	c989d9c483	fix: skip comment blocks in `DOCXToDocument` (#8764 ) * fix bug #8759 * Apply suggestions from code review * release note --------- Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com>	2025-01-24 11:06:09 +00:00
Stefano Fiorucci	223373eced	fix: Document Classifiers - fix error messages (#8765 ) * fix: Document Classifiers - fix docstrings + error messages * grammar * fix	2025-01-24 11:17:47 +01:00
tstadel	3119ae1ec9	refactor: raise `PipelineError` when `Pipeline.from_dict` receives an invalid type (#8711 ) * fix: error on invalid type * add reno * Update releasenotes/notes/fix-invalid-component-type-error-83ee00d820b63cc5.yaml Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com> * Update test/core/pipeline/test_pipeline.py Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com> * fix reno * fix reno * last reno fix --------- Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com>	2025-01-23 11:40:19 +00:00
tstadel	bf79f04932	feat: support streaming_callback as run param for HF Chat generators (#8763 ) * feat: support streaming_callback as run param for HF Chat generators * add tests	2025-01-23 12:14:32 +01:00
Stefano Fiorucci	c3d0643511	feat: `AzureOpenAIChatGenerator` - support for tools (#8757 ) * feat: AzureOpenAIChatGenerator - support for tools * release note * feedback	2025-01-23 09:24:04 +00:00
Stefano Fiorucci	2bf6bf6a45	build: add `jsonschema` library to core dependencies (#8753 ) * add jsonschema to core dependencies * release note	2025-01-21 10:07:56 +01:00
Nicola Procopio	542a7f7ef5	fix: update meta data before initializing new Document in DocumentSplitter (#8745 ) * updated DocumentSplitter issue #8741 * release note * updated DocumentSplitter in _create_docs_from_splits function initialize a new variable copied_mete instead to overwrite meta * added test test_duplicate_pages_get_different_doc_id * fix fmt --------- Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com>	2025-01-20 09:51:47 +01:00
David S. Batista	5af2888e23	fix: `PDFMinerToDocument` convert function - adding double new lines between each `container_text` so that passages can be detected. (#8729 ) * initial import * adding double new lines between container_texts so that passages can be detected * reducing type specification to avoid import error * adding release notes * renaming variable	2025-01-17 13:01:16 +00:00
Vladimir Blagojevic	21dd03d3e7	feat: Add completion start time timestamp to relevant generators (#8728 ) * OpenAIChatGenerator - add completion_start_time * HuggingFaceAPIChatGenerator - add completion_start_time * Add tests * Add reno note * Relax condition for cached responses * Add completion_start_time timestamping to non-chat generators * Update haystack/components/generators/chat/hugging_face_api.py Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com> * PR feedback --------- Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com>	2025-01-17 09:58:45 +01:00
Stefano Fiorucci	62ac27c947	chore: remove deprecated `function` `ChatRole` and `from_function` class method in `ChatMessage` (#8725 ) * rm deprecated function role and from_function class method in chatmessage * release note	2025-01-15 18:55:22 +01:00
David S. Batista	26b80778f5	chore: removing NLTKDocumentSplitter (#8724 ) * removing NLTKDocumentSplitter * adding release notes * removing pydocs reference	2025-01-15 16:11:51 +00:00
Vladimir Blagojevic	d147c7658f	feat: Add `ComponentTool` to Haystack tools (#8693 ) * Initial ComponentTool --------- Co-authored-by: Daria Fokina <daria.fokina@deepset.ai> Co-authored-by: Julian Risch <julian.risch@deepset.ai>	2025-01-13 11:15:33 +01:00
Julian Risch	642fa60cdf	fix: PDFMinerToDocument initializes documents with content and meta (#8708 ) * fix: PDFMinerToDocument initializes documents with content and meta * add release note * Apply suggestions from code review Co-authored-by: David S. Batista <dsbatista@gmail.com> --------- Co-authored-by: David S. Batista <dsbatista@gmail.com>	2025-01-13 10:12:06 +00:00
Amna Mubashar	db76ae2847	feat: add `default_headers` for Azure embedders (#8699 ) * Add default_headers param to azure embedders	2025-01-12 17:41:38 +01:00
David S. Batista	4f73b192f8	feat: add `RecursiveSplitter` component for `Document` preprocessing (#8605 ) * initial import * adding initial version + tests * adding more tests * more tests * incorporating SentenceSplitter based on NLTK * adding more tests * adding release notes * adding LICENSE header * removing unused imports * fixing example docstring * addding docstrings * fixing tests and returning a dictionary * updating release notes * attending PR comments * Update haystack/components/preprocessors/recursive_splitter.py Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * wip: updating tests for split_idx_start and _split_overlap * adding tests for split_idx and split_start and overlaps * adjusting file for LICENSE checking * adding more tests * adding tests for page numbering * adding tests for min split lenghts and falling back to character-level chunking based on size * fixing linting issue * Update haystack/components/preprocessors/recursive_splitter.py Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * Update haystack/components/preprocessors/recursive_splitter.py Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * Update haystack/components/preprocessors/recursive_splitter.py Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * Update haystack/components/preprocessors/recursive_splitter.py Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * Update haystack/components/preprocessors/recursive_splitter.py Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * Update haystack/components/preprocessors/recursive_splitter.py Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * Update haystack/components/preprocessors/recursive_splitter.py Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * Update haystack/components/preprocessors/recursive_splitter.py Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * wip * wip * updating tests * wip: fixing all tests after changes * more tests * wip: debugging sentence overlap * wip: debugging page number * wip * wip; fixed bug with sentence tokenizer, needs to keep white spaces * adding tests for counting pages on different split approaches * NLTK checks done on SentenceSplitter * fixing types * adding detecting for full overlap with previous chunks * fixing types * improving docstring * improving docstring * adding custom lenght, 'character' use case * customising overlap function for word and adding a few tests * updating docstring * Update haystack/components/preprocessors/recursive_splitter.py Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * Update haystack/components/preprocessors/recursive_splitter.py Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * Update haystack/components/preprocessors/recursive_splitter.py Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * wip: adding more tests for word unit length * fix * feat: `Tool` dataclass - unified abstraction to represent tools (#8652) * draft * del HF token in tests * adaptations * progress * fix type * import sorting * more control on deserialization * release note * improvements * support name field * fix chatpromptbuilder test * port Tool from experimental * release note * docs upd * Update tool.py --------- Co-authored-by: Daria Fokina <daria.fokina@deepset.ai> * fix: fix deserialization issues in multi-threading environments (#8651) * adding 'word' as default length * fixing types * handing both default strategies * wip * \f was not being counted properly * updating tests * fixing the overlap bug * adding more tests * refactoring _apply_overlap * further refactoring * Update haystack/components/preprocessors/recursive_splitter.py Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * Update haystack/components/preprocessors/recursive_splitter.py Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * Update haystack/components/preprocessors/recursive_splitter.py Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * Update haystack/components/preprocessors/recursive_splitter.py Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * adding ticks to close code block * fixing comments * applying changes: split with space and force keep_white_spaces=True * fixing some tests and replacing count words approach in more places * keep_white_spaces = True only if not defined * cleaning docs * handling some more edge cases, when split is still too big and all separators ran * fixing fallback whitespaces count to fixed word/char split based on split size * cleaning --------- Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com> Co-authored-by: Daria Fokina <daria.fokina@deepset.ai> Co-authored-by: Tobias Wochinger <tobias.wochinger@deepset.ai>	2025-01-10 17:28:53 +01:00
Stefano Fiorucci	741ce5df50	fix: `OpenAIChatGenerator` - do not pass tools to the OpenAI client when none are provided (#8702 ) * do not pass tools to OpenAI client if None * release note * fix release note	2025-01-10 14:46:41 +01:00
Stefano Fiorucci	08cf09f83f	refactor: `create_tool_from_function` + `tool` decorator (#8697 ) * create_tool_from_function + decorator * release note * improve usage example * add imports to @tool usage example * clarify docstrings * small docstring addition	2025-01-10 12:15:15 +01:00
Julian Risch	dd9660f90d	fix: PyPDFToDocument initializes documents with content and meta (#8698 ) * initialize document with content and meta * update test * add test checking that not only content is used for id generation	2025-01-09 19:12:10 +00:00
Stefano Fiorucci	3f15f38c51	refactor: move `Tool` to a separate package; refactor serde (#8690 ) * move tool to separate package; refactor serde * release note * rm unused import	2025-01-09 12:30:13 +01:00
Sebastian Husch Lee	28ad78c73d	feat: Add XLSXToDocument converter (#8522 ) * Add draft of the Excel To Document converter * Add license header * Add release note * Use Union instead of pipe * Add openpyxl as additional dep * Fix zip issue * few updates from Bijay * Update deps * Add markdown test * Adding more example excels and expanding tests * Added more tests * Fix windows test by setting lineterminator * Addressing PR comments * PR comments * Fix linting	2025-01-09 09:03:19 +01:00
Stefano Fiorucci	bc30105fbc	test: reorganize docstore test suite to isolate dataframe tests (#8684 ) * reorganize docstore test suite to isolate dataframe tests * improve docstring * include FilterDocumentsTestWithDataframe in InMemoryDocumentStore tests	2025-01-08 14:58:52 +00:00
Stefano Fiorucci	5539f6c33f	refactor: improve serialization/deserialization of callables (to handle class methods and static methods) (#8683 ) * progress * refinements * tidy up * release note	2025-01-08 11:28:00 +01:00
tstadel	e6059e632e	fix: truncate ByteStream string representation (#8673 ) * fix: truncate ByteStream string representation * add reno * better reno * add test * Update test_byte_stream.py * apply feedback * update reno	2025-01-07 19:00:52 +01:00
Bohan Qu	8e3f64717f	feat: use importlib when deserializing callables (#8648 )	2025-01-03 15:06:58 +01:00
Stefano Fiorucci	7b4d9ba86e	feat: introduce class method to create `ChatMessage` from the OpenAI dictionary format (#8670 ) * add ChatMessage.from_openai_dict_format * remove print * release note * improve docstring * separate validation logic * rm obvious comment	2025-01-02 10:34:41 +00:00
Stefano Fiorucci	99e7e343b2	chore: update links to chatmessage docs (#8667 )	2024-12-20 15:33:27 +01:00
Stefano Fiorucci	188b2a7f06	feat: support for tools in `OpenAIChatGenerator` (#8666 ) * move chatmsg>openai conversion to chatmsg dataclass * implementation and tests cleanup * release note * try fixing azure chat generator * add serde test for toolinvoker * small fix	2024-12-20 14:20:54 +00:00
Stefano Fiorucci	7dcbf25bd7	feat: add Tool Invoker component (#8664 ) * port toolinvoker * release note	2024-12-20 14:02:42 +01:00
Michele Pangrazzi	c192488bf6	Named entity extractor private models (#8658 ) * add 'token' support to NamedEntityExtractor to enable using private models on HF backend * fix existing error message format * add release note * add HF_API_TOKEN to e2e workflow * add informative comment * Updated to_dict / from_dict to handle 'token' correctly ; Added tests * Fix lint * Revert unwanted change	2024-12-20 11:15:55 +01:00
Sebastian Husch Lee	286061f005	fix: Move potential nltk download to warm_up (#8646 ) * Move potential nltk download to warm_up * Update tests * Add release notes * Fix tests * Uncomment * Make mypy happy * Add RuntimeError message * Update release notes --------- Co-authored-by: Julian Risch <julian.risch@deepset.ai>	2024-12-20 10:41:44 +01:00
Stefano Fiorucci	f4d9c2bb91	fix: Make the `HuggingFaceLocalChatGenerator` compatible with the new `ChatMessage`; serialize `chat_template` (#8663 ) * message conversion function * hfapi w tools * right test file + hf_hub version * release note * fix for new chatmessage; serialize chat_template * feedback	2024-12-19 15:12:12 +01:00
Stefano Fiorucci	2bc58d2987	feat: support for tools in `HuggingFaceAPIChatGenerator` (#8661 ) * message conversion function * hfapi w tools * right test file + hf_hub version * release note * feedback	2024-12-19 15:04:37 +01:00
Tobias Wochinger	91619a79c1	fix: fix deserialization issues in multi-threading environments (#8651 )	2024-12-18 21:34:57 +01:00
Stefano Fiorucci	96b4a1d2fd	feat: `Tool` dataclass - unified abstraction to represent tools (#8652 ) * draft * del HF token in tests * adaptations * progress * fix type * import sorting * more control on deserialization * release note * improvements * support name field * fix chatpromptbuilder test * port Tool from experimental * release note * docs upd * Update tool.py --------- Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>	2024-12-18 11:36:44 +00:00
Stefano Fiorucci	ea3602643a	feat!: new `ChatMessage` (#8640 ) * draft * del HF token in tests * adaptations * progress * fix type * import sorting * more control on deserialization * release note * improvements * support name field * fix chatpromptbuilder test * Update chat_message.py --------- Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>	2024-12-17 17:02:04 +01:00
Stefano Fiorucci	2a9a6401d2	chore: pin `openai>=1.56.1` (#8632 ) * pin openai>=1.56.1 * release note	2024-12-12 16:26:38 +01:00

1 2 3 4 5 ...

652 Commits