haystack

mirror of https://github.com/deepset-ai/haystack.git synced 2025-11-14 00:54:22 +00:00

Author	SHA1	Message	Date
Sebastian Husch Lee	35788a2d06	feat: Update csv cleaner (#8828 ) * More refactoring * Add more new options and more tests * Improve docstrings * Add release notes * Fix pylint	2025-02-07 14:29:53 +01:00
Sebastian Husch Lee	1785ea622e	feat: Add component CSVDocumentCleaner for removing empty rows and columns (#8816 ) * Initial commit for csv cleaner * Add release notes * Update lineterminator * Update releasenotes/notes/csv-document-cleaner-8eca67e884684c56.yaml Co-authored-by: David S. Batista <dsbatista@gmail.com> * alphabetize * Use lazy import * Some refactoring * Some refactoring --------- Co-authored-by: David S. Batista <dsbatista@gmail.com>	2025-02-06 17:56:38 +01:00
Stefano Fiorucci	1f257944a6	chore: fix Hugging Face components for mypy 1.15.0 (#8822 ) * chore: fix Hugging Face components for mypy 1.15.0 * small fixes * fix test * rm print * use cast and be more permissive	2025-02-06 16:25:59 +00:00
David S. Batista	e7c6d14431	docs: removing undefined param from docstring (#8826 )	2025-02-06 16:04:57 +01:00
mathislucka	eec91824bc	fix: pipeline run bugs in cyclic and acyclic pipelines (#8707 ) * add component checks * pipeline should run deterministically * add FIFOQueue * add agent tests * add order dependent tests * run new tests * remove code that is not needed * test: intermediate from cycle outputs are available outside cycle * add tests for component checks (Claude) * adapt tests for component checks (o1 review) * chore: format * remove tests that aren't needed anymore * add _calculate_priority tests * revert accidental change in pyproject.toml * test format conversion * adapt to naming convention * chore: proper docstrings and type hints for PQ * format * add more unit tests * rm unneeded comments * test input consumption * lint * fix: docstrings * lint * format * format * fix license header * fix license header * add component run tests * fix: pass correct input format to tracing * fix types * format * format * types * add defaults from Socket instead of signature - otherwise components with dynamic inputs would fail * fix test names * still wait for optional inputs on greedy variadic sockets - mirrors previous behavior * fix format * wip: warn for ambiguous running order * wip: alternative warning * fix license header * make code more readable Co-authored-by: Amna Mubashar <amnahkhan.ak@gmail.com> * Introduce content tracing to a behavioral test * Fixing linting * Remove debug print statements * Fix tracer tests * remove print * test: test for component inputs * test: remove testing for run order * chore: update component checks from experimental * chore: update pipeline and base from experimental * refactor: remove unused method * refactor: remove unused method * refactor: outdated comment * refactor: inputs state is updated as side effect - to prepare for AsyncPipeline implementation * format * test: add file conversion test * format * fix: original implementation deepcopies outputs * lint * fix: from_dict was updated * fix: format * fix: test * test: add test for thread safety * remove unused imports * format * test: FIFOPriorityQueue * chore: add release note * fix: resolve merge conflict with mermaid changes * fix: format * fix: remove unused import * refactor: rename to avoid accidental conflicts * chore: remove unused inputs, add missing license header * chore: extend release notes * Update releasenotes/notes/fix-pipeline-run-2fefeafc705a6d91.yaml Co-authored-by: Amna Mubashar <amnahkhan.ak@gmail.com> * fix: format * fix: format * Update release note --------- Co-authored-by: Amna Mubashar <amnahkhan.ak@gmail.com> Co-authored-by: David S. Batista <dsbatista@gmail.com>	2025-02-06 14:19:47 +00:00
Stefano Fiorucci	05300490a6	docs: add `ListJoiner` to pydoc configuration (#8821 ) * docs: add ListJoiner to pydoc configuration * Update docs/pydoc/config/joiners_api.yml Co-authored-by: David S. Batista <dsbatista@gmail.com> --------- Co-authored-by: David S. Batista <dsbatista@gmail.com>	2025-02-06 08:52:24 +00:00
Amna Mubashar	b0809b75f5	feat: Add a `ListJoiner` component (#8810 ) * Add a ListJoiner * Add tests and release notes	2025-02-05 23:19:14 +01:00
György Orosz	d2348ad462	feat: SentenceTransformersDocumentEmbedder and SentenceTransformersTextEmbedder can accept and pass any arguments to SentenceTransformer.encode (#8806 ) * feat: SentenceTransformersDocumentEmbedder and SentenceTransformersTextEmbedder can accept and pass any arguments to SentenceTransformer.encode * refactor: encode_kwargs parameter of SentenceTransformersDocumentEmbedder and SentenceTransformersTextEmbedder mae to be the last positional parameter for backward compatibility reasons * docs: added explanation for encode_kwargs in SentenceTransformersTextEmbedder and SentenceTransformersDocumentEmbedder * test: added tests for encode_kwargs in SentenceTransformersTextEmbedder and SentenceTransformersDocumentEmbedder * doc: removed empty lines from docstrings of SentenceTransformersTextEmbedder and SentenceTransformersDocumentEmbedder * refactor: encode_kwargs parameter of SentenceTransformersDocumentEmbedder and SentenceTransformersTextEmbedder mae to be the last positional parameter for backward compatibility (part II.)	2025-02-05 16:09:35 +00:00
Stefano Fiorucci	2828d9e4ae	refactor!: `DOCXToDocument` converter - store DOCX metadata as a dict (#8804 ) * DOCXToDocument - store DOCX metadata as a dict * do not export DOCXMetadata to converters package	2025-02-05 14:43:19 +01:00
Stefano Fiorucci	5ae94886b2	fix: fix test failures with Transformers models in PRs from forks (#8809 ) * trigger * try pinning sentence transformers * make integr tests run right away * pin transformers instead * older transformers version * rm transformers pin * try ignoring cache * change ubuntu version * try removing token * try again * more HF_API_TOKEN local deletions * restore test priority * rm leftover * more deletions * moreee * more * deletions * restore jobs order	2025-02-04 19:08:37 +01:00
dependabot[bot]	f1679f1dca	build(deps): bump fossas/fossa-action from 1.4.0 to 1.5.0 (#8771 ) Bumps [fossas/fossa-action](https://github.com/fossas/fossa-action) from 1.4.0 to 1.5.0. - [Release notes](https://github.com/fossas/fossa-action/releases) - [Commits](https://github.com/fossas/fossa-action/compare/v1.4.0...v1.5.0) --- updated-dependencies: - dependency-name: fossas/fossa-action dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-02-04 15:01:54 +01:00
Sebastian Husch Lee	1ee86b5041	fix: Fix filters to handle date times with timezones (loading and comparison) (#8800 ) * Fix on date time parsing with timezones. And comparing naive and aware date times. * Add release note * Add more filter tests	2025-02-04 14:51:06 +01:00
Stefano Fiorucci	ad5d29d92f	chore: ToolInvoker - remove warning (#8803 )	2025-02-04 09:39:17 +00:00
Stefano Fiorucci	877f826da0	refactor: HF API Embedders - use `InferenceClient.feature_extraction` instead of `InferenceClient.post` (#8794 ) * HF API Embedders: refactoring * rename variables * rm leftovers * rm pin * rm unused import * relnote * warning with truncate/normalize and serverless inference API * test that warnings are raised	2025-02-03 15:11:16 +00:00
David S. Batista	f1652121ac	feat: Add support for custom (or offline) Mermaid.ink server and support all parameters (#8799 ) * compress graph data to support pako endpoint * support mermaid.ink parameters and custom servers * dont try to resolve conflicts with the github web ui... * avoid double graph copy * fixing typing, improving docstrings and release notes * reverting type * nit - force type checker no cache * nit - force type checker no cache --------- Co-authored-by: Ulises M <ulises@lbux.org> Co-authored-by: Ulises M <30765968+lbux@users.noreply.github.com>	2025-02-03 15:55:29 +01:00
David S. Batista	503d275ade	chore: remove DocumentSplitter warning related to split_by='sentence'	2025-02-03 12:47:14 +01:00
mathislucka	1a91365cc8	fix: callables can be deserialized from fully qualified import path (#8788 ) * fix: callables can be deserialized from fully qualified import path * fix: license header * fix: format * fix: types * fix? types * test: extend test case * format * add release notes	2025-02-03 12:35:37 +01:00
Amna Mubashar	379711f63e	fix: Pin nltk version for sentence tokenizer (#8786 ) * Pin nltk version for sentence tokenizer * Update pyproject.toml * Update haystack/components/preprocessors/sentence_tokenizer.py --------- Co-authored-by: David S. Batista <dsbatista@gmail.com>	2025-01-31 17:01:00 +01:00
Stefano Fiorucci	80575a7e9c	deprecate dataframe and ExtractedTableAnswer (#8789 )	2025-01-31 15:03:15 +01:00
Stefano Fiorucci	3ef609a3e8	temporarily pin huggingface_hub<0.28.0 (#8790 )	2025-01-31 10:35:15 +01:00
Ulises M	d939321505	fix: compress pipeline graphs before sending to mermaid (#8767 ) * compress graph data to support pako endpoint * Update haystack/core/pipeline/draw.py Co-authored-by: David S. Batista <dsbatista@gmail.com> * Update haystack/core/pipeline/draw.py Co-authored-by: David S. Batista <dsbatista@gmail.com> --------- Co-authored-by: David S. Batista <dsbatista@gmail.com>	2025-01-28 12:18:54 +01:00
Sebastian Husch Lee	bba84e5517	fix: Fix JSONConverter to properly skip files that are not utf-8 encoded (#8775 ) * Small fix * Add reno * Trying out license header fix here	2025-01-28 10:29:55 +01:00
Sebastian Husch Lee	e3dc164625	Update license-header.txt with breaking changes from hawkeye (#8778 )	2025-01-28 10:03:23 +01:00
Per Lunnemann Hansen	0e6d2a4c39	fix: update component registration to use new class reference (#8715 ) The pyright language server is now able to resolve the import and provide completions for the component. Co-authored-by: Michele Pangrazzi <xmikex83@gmail.com>	2025-01-27 14:52:24 +01:00
Stefano Fiorucci	0ac47b0064	pin numba>=0.54.0 (#8773 )	2025-01-27 11:55:18 +01:00
Night-Quiet	c989d9c483	fix: skip comment blocks in `DOCXToDocument` (#8764 ) * fix bug #8759 * Apply suggestions from code review * release note --------- Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com>	2025-01-24 11:06:09 +00:00
Stefano Fiorucci	223373eced	fix: Document Classifiers - fix error messages (#8765 ) * fix: Document Classifiers - fix docstrings + error messages * grammar * fix	2025-01-24 11:17:47 +01:00
tstadel	3119ae1ec9	refactor: raise `PipelineError` when `Pipeline.from_dict` receives an invalid type (#8711 ) * fix: error on invalid type * add reno * Update releasenotes/notes/fix-invalid-component-type-error-83ee00d820b63cc5.yaml Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com> * Update test/core/pipeline/test_pipeline.py Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com> * fix reno * fix reno * last reno fix --------- Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com>	2025-01-23 11:40:19 +00:00
tstadel	bf79f04932	feat: support streaming_callback as run param for HF Chat generators (#8763 ) * feat: support streaming_callback as run param for HF Chat generators * add tests	2025-01-23 12:14:32 +01:00
Stefano Fiorucci	c3d0643511	feat: `AzureOpenAIChatGenerator` - support for tools (#8757 ) * feat: AzureOpenAIChatGenerator - support for tools * release note * feedback	2025-01-23 09:24:04 +00:00
Stefano Fiorucci	f96839e139	chore: update `transformers` test dependency (#8752 ) * update transformers test dependency * add pad_token_id to the mock tokenizer * fix HFLocal test + new test	2025-01-21 14:43:27 +01:00
Stefano Fiorucci	2bf6bf6a45	build: add `jsonschema` library to core dependencies (#8753 ) * add jsonschema to core dependencies * release note	2025-01-21 10:07:56 +01:00
Nicola Procopio	542a7f7ef5	fix: update meta data before initializing new Document in DocumentSplitter (#8745 ) * updated DocumentSplitter issue #8741 * release note * updated DocumentSplitter in _create_docs_from_splits function initialize a new variable copied_mete instead to overwrite meta * added test test_duplicate_pages_get_different_doc_id * fix fmt --------- Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com>	2025-01-20 09:51:47 +01:00
Stefano Fiorucci	242138c68b	chore: update ruff version in pre-commit hook (#8746 )	2025-01-19 20:45:02 +01:00
Julian Risch	6feb3856bb	chore: Remove FixMe comment from __init__.py (#8749 )	2025-01-19 17:28:37 +01:00
David S. Batista	5af2888e23	fix: `PDFMinerToDocument` convert function - adding double new lines between each `container_text` so that passages can be detected. (#8729 ) * initial import * adding double new lines between container_texts so that passages can be detected * reducing type specification to avoid import error * adding release notes * renaming variable	2025-01-17 13:01:16 +00:00
Stefano Fiorucci	424bce2783	test: fix HF API flaky live test with tools (#8744 ) * test: fix HF API flaky live test with tools * rm print	2025-01-17 12:36:07 +00:00
David S. Batista	2c84266d8f	test: adding test for PyPDF to extract passages so that they are detect by DocumentSplitter (#8739 )	2025-01-17 10:56:16 +01:00
Vladimir Blagojevic	21dd03d3e7	feat: Add completion start time timestamp to relevant generators (#8728 ) * OpenAIChatGenerator - add completion_start_time * HuggingFaceAPIChatGenerator - add completion_start_time * Add tests * Add reno note * Relax condition for cached responses * Add completion_start_time timestamping to non-chat generators * Update haystack/components/generators/chat/hugging_face_api.py Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com> * PR feedback --------- Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com>	2025-01-17 09:58:45 +01:00
Stefano Fiorucci	62ac27c947	chore: remove deprecated `function` `ChatRole` and `from_function` class method in `ChatMessage` (#8725 ) * rm deprecated function role and from_function class method in chatmessage * release note	2025-01-15 18:55:22 +01:00
David S. Batista	26b80778f5	chore: removing NLTKDocumentSplitter (#8724 ) * removing NLTKDocumentSplitter * adding release notes * removing pydocs reference	2025-01-15 16:11:51 +00:00
Stefano Fiorucci	167ede1f4c	remove deprecation warning from SentenceWindowRetriever (#8720 )	2025-01-15 08:51:52 +00:00
David S. Batista	425ce9b98f	test: updating HuggingFaceAPIChatGenerator tests	2025-01-14 16:47:29 +01:00
David S. Batista	34bd31ef32	docs: fixing RecursiveSplitter pydoc markdown rendering	2025-01-14 11:27:31 +00:00
Haystack Bot	ed40d9f001	Update unstable version to 2.10.0-rc0 (#8713 ) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>	2025-01-13 15:23:33 +01:00
David S. Batista	ec8666545d	docs: adding RecursiveSplitter to pydoc v2.10.0-rc0	2025-01-13 11:46:34 +01:00
Vladimir Blagojevic	d147c7658f	feat: Add `ComponentTool` to Haystack tools (#8693 ) * Initial ComponentTool --------- Co-authored-by: Daria Fokina <daria.fokina@deepset.ai> Co-authored-by: Julian Risch <julian.risch@deepset.ai>	2025-01-13 11:15:33 +01:00
Julian Risch	642fa60cdf	fix: PDFMinerToDocument initializes documents with content and meta (#8708 ) * fix: PDFMinerToDocument initializes documents with content and meta * add release note * Apply suggestions from code review Co-authored-by: David S. Batista <dsbatista@gmail.com> --------- Co-authored-by: David S. Batista <dsbatista@gmail.com>	2025-01-13 10:12:06 +00:00
Amna Mubashar	db76ae2847	feat: add `default_headers` for Azure embedders (#8699 ) * Add default_headers param to azure embedders	2025-01-12 17:41:38 +01:00
David S. Batista	4f73b192f8	feat: add `RecursiveSplitter` component for `Document` preprocessing (#8605 ) * initial import * adding initial version + tests * adding more tests * more tests * incorporating SentenceSplitter based on NLTK * adding more tests * adding release notes * adding LICENSE header * removing unused imports * fixing example docstring * addding docstrings * fixing tests and returning a dictionary * updating release notes * attending PR comments * Update haystack/components/preprocessors/recursive_splitter.py Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * wip: updating tests for split_idx_start and _split_overlap * adding tests for split_idx and split_start and overlaps * adjusting file for LICENSE checking * adding more tests * adding tests for page numbering * adding tests for min split lenghts and falling back to character-level chunking based on size * fixing linting issue * Update haystack/components/preprocessors/recursive_splitter.py Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * Update haystack/components/preprocessors/recursive_splitter.py Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * Update haystack/components/preprocessors/recursive_splitter.py Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * Update haystack/components/preprocessors/recursive_splitter.py Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * Update haystack/components/preprocessors/recursive_splitter.py Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * Update haystack/components/preprocessors/recursive_splitter.py Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * Update haystack/components/preprocessors/recursive_splitter.py Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * Update haystack/components/preprocessors/recursive_splitter.py Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * wip * wip * updating tests * wip: fixing all tests after changes * more tests * wip: debugging sentence overlap * wip: debugging page number * wip * wip; fixed bug with sentence tokenizer, needs to keep white spaces * adding tests for counting pages on different split approaches * NLTK checks done on SentenceSplitter * fixing types * adding detecting for full overlap with previous chunks * fixing types * improving docstring * improving docstring * adding custom lenght, 'character' use case * customising overlap function for word and adding a few tests * updating docstring * Update haystack/components/preprocessors/recursive_splitter.py Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * Update haystack/components/preprocessors/recursive_splitter.py Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * Update haystack/components/preprocessors/recursive_splitter.py Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * wip: adding more tests for word unit length * fix * feat: `Tool` dataclass - unified abstraction to represent tools (#8652) * draft * del HF token in tests * adaptations * progress * fix type * import sorting * more control on deserialization * release note * improvements * support name field * fix chatpromptbuilder test * port Tool from experimental * release note * docs upd * Update tool.py --------- Co-authored-by: Daria Fokina <daria.fokina@deepset.ai> * fix: fix deserialization issues in multi-threading environments (#8651) * adding 'word' as default length * fixing types * handing both default strategies * wip * \f was not being counted properly * updating tests * fixing the overlap bug * adding more tests * refactoring _apply_overlap * further refactoring * Update haystack/components/preprocessors/recursive_splitter.py Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * Update haystack/components/preprocessors/recursive_splitter.py Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * Update haystack/components/preprocessors/recursive_splitter.py Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * Update haystack/components/preprocessors/recursive_splitter.py Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> * adding ticks to close code block * fixing comments * applying changes: split with space and force keep_white_spaces=True * fixing some tests and replacing count words approach in more places * keep_white_spaces = True only if not defined * cleaning docs * handling some more edge cases, when split is still too big and all separators ran * fixing fallback whitespaces count to fixed word/char split based on split size * cleaning --------- Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com> Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com> Co-authored-by: Daria Fokina <daria.fokina@deepset.ai> Co-authored-by: Tobias Wochinger <tobias.wochinger@deepset.ai>	2025-01-10 17:28:53 +01:00

1 2 3 4 5 ...

3855 Commits