3855 Commits

Author SHA1 Message Date
Sebastian Husch Lee
35788a2d06
feat: Update csv cleaner (#8828)
* More refactoring

* Add more new options and more tests

* Improve docstrings

* Add release notes

* Fix pylint
2025-02-07 14:29:53 +01:00
Sebastian Husch Lee
1785ea622e
feat: Add component CSVDocumentCleaner for removing empty rows and columns (#8816)
* Initial commit for csv cleaner

* Add release notes

* Update lineterminator

* Update releasenotes/notes/csv-document-cleaner-8eca67e884684c56.yaml

Co-authored-by: David S. Batista <dsbatista@gmail.com>

* alphabetize

* Use lazy import

* Some refactoring

* Some refactoring

---------

Co-authored-by: David S. Batista <dsbatista@gmail.com>
2025-02-06 17:56:38 +01:00
Stefano Fiorucci
1f257944a6
chore: fix Hugging Face components for mypy 1.15.0 (#8822)
* chore: fix Hugging Face components for mypy 1.15.0

* small fixes

* fix test

* rm print

* use cast and be more permissive
2025-02-06 16:25:59 +00:00
David S. Batista
e7c6d14431
docs: removing undefined param from docstring (#8826) 2025-02-06 16:04:57 +01:00
mathislucka
eec91824bc
fix: pipeline run bugs in cyclic and acyclic pipelines (#8707)
* add component checks

* pipeline should run deterministically

* add FIFOQueue

* add agent tests

* add order dependent tests

* run new tests

* remove code that is not needed

* test: intermediate from cycle outputs are available outside cycle

* add tests for component checks (Claude)

* adapt tests for component checks (o1 review)

* chore: format

* remove tests that aren't needed anymore

* add _calculate_priority tests

* revert accidental change in pyproject.toml

* test format conversion

* adapt to naming convention

* chore: proper docstrings and type hints for PQ

* format

* add more unit tests

* rm unneeded comments

* test input consumption

* lint

* fix: docstrings

* lint

* format

* format

* fix license header

* fix license header

* add component run tests

* fix: pass correct input format to tracing

* fix types

* format

* format

* types

* add defaults from Socket instead of signature

- otherwise components with dynamic inputs would fail

* fix test names

* still wait for optional inputs on greedy variadic sockets

- mirrors previous behavior

* fix format

* wip: warn for ambiguous running order

* wip: alternative warning

* fix license header

* make code more readable

Co-authored-by: Amna Mubashar <amnahkhan.ak@gmail.com>

* Introduce content tracing to a behavioral test

* Fixing linting

* Remove debug print statements

* Fix tracer tests

* remove print

* test: test for component inputs

* test: remove testing for run order

* chore: update component checks from experimental

* chore: update pipeline and base from experimental

* refactor: remove unused method

* refactor: remove unused method

* refactor: outdated comment

* refactor: inputs state is updated as side effect

- to prepare for AsyncPipeline implementation

* format

* test: add file conversion test

* format

* fix: original implementation deepcopies outputs

* lint

* fix: from_dict was updated

* fix: format

* fix: test

* test: add test for thread safety

* remove unused imports

* format

* test: FIFOPriorityQueue

* chore: add release note

* fix: resolve merge conflict with mermaid changes

* fix: format

* fix: remove unused import

* refactor: rename to avoid accidental conflicts

* chore: remove unused inputs, add missing license header

* chore: extend release notes

* Update releasenotes/notes/fix-pipeline-run-2fefeafc705a6d91.yaml

Co-authored-by: Amna Mubashar <amnahkhan.ak@gmail.com>

* fix: format

* fix: format

* Update release note

---------

Co-authored-by: Amna Mubashar <amnahkhan.ak@gmail.com>
Co-authored-by: David S. Batista <dsbatista@gmail.com>
2025-02-06 14:19:47 +00:00
Stefano Fiorucci
05300490a6
docs: add ListJoiner to pydoc configuration (#8821)
* docs: add ListJoiner to pydoc configuration

* Update docs/pydoc/config/joiners_api.yml

Co-authored-by: David S. Batista <dsbatista@gmail.com>

---------

Co-authored-by: David S. Batista <dsbatista@gmail.com>
2025-02-06 08:52:24 +00:00
Amna Mubashar
b0809b75f5
feat: Add a ListJoiner component (#8810)
* Add a ListJoiner

* Add tests and release notes
2025-02-05 23:19:14 +01:00
György Orosz
d2348ad462
feat: SentenceTransformersDocumentEmbedder and SentenceTransformersTextEmbedder can accept and pass any arguments to SentenceTransformer.encode (#8806)
* feat: SentenceTransformersDocumentEmbedder and SentenceTransformersTextEmbedder can accept and pass any arguments to SentenceTransformer.encode

* refactor: encode_kwargs parameter of SentenceTransformersDocumentEmbedder and SentenceTransformersTextEmbedder mae to be the last positional parameter for backward compatibility reasons

* docs: added explanation for encode_kwargs in SentenceTransformersTextEmbedder and SentenceTransformersDocumentEmbedder

* test: added tests for encode_kwargs in SentenceTransformersTextEmbedder and SentenceTransformersDocumentEmbedder

* doc: removed empty lines from docstrings of SentenceTransformersTextEmbedder and SentenceTransformersDocumentEmbedder

* refactor: encode_kwargs parameter of SentenceTransformersDocumentEmbedder and SentenceTransformersTextEmbedder mae to be the last positional parameter for backward compatibility (part II.)
2025-02-05 16:09:35 +00:00
Stefano Fiorucci
2828d9e4ae
refactor!: DOCXToDocument converter - store DOCX metadata as a dict (#8804)
* DOCXToDocument - store DOCX metadata as a dict

* do not export DOCXMetadata to converters package
2025-02-05 14:43:19 +01:00
Stefano Fiorucci
5ae94886b2
fix: fix test failures with Transformers models in PRs from forks (#8809)
* trigger

* try pinning sentence transformers

* make integr tests run right away

* pin transformers instead

* older transformers version

* rm transformers pin

* try ignoring cache

* change ubuntu version

* try removing token

* try again

* more HF_API_TOKEN local deletions

* restore test priority

* rm leftover

* more deletions

* moreee

* more

* deletions

* restore jobs order
2025-02-04 19:08:37 +01:00
dependabot[bot]
f1679f1dca
build(deps): bump fossas/fossa-action from 1.4.0 to 1.5.0 (#8771)
Bumps [fossas/fossa-action](https://github.com/fossas/fossa-action) from 1.4.0 to 1.5.0.
- [Release notes](https://github.com/fossas/fossa-action/releases)
- [Commits](https://github.com/fossas/fossa-action/compare/v1.4.0...v1.5.0)

---
updated-dependencies:
- dependency-name: fossas/fossa-action
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-02-04 15:01:54 +01:00
Sebastian Husch Lee
1ee86b5041
fix: Fix filters to handle date times with timezones (loading and comparison) (#8800)
* Fix on date time parsing with timezones. And comparing naive and aware date times.

* Add release note

* Add more filter tests
2025-02-04 14:51:06 +01:00
Stefano Fiorucci
ad5d29d92f
chore: ToolInvoker - remove warning (#8803) 2025-02-04 09:39:17 +00:00
Stefano Fiorucci
877f826da0
refactor: HF API Embedders - use InferenceClient.feature_extraction instead of InferenceClient.post (#8794)
* HF API Embedders: refactoring

* rename variables

* rm leftovers

* rm pin

* rm unused import

* relnote

* warning with truncate/normalize and serverless inference API

* test that warnings are raised
2025-02-03 15:11:16 +00:00
David S. Batista
f1652121ac
feat: Add support for custom (or offline) Mermaid.ink server and support all parameters (#8799)
* compress graph data to support pako endpoint

* support mermaid.ink parameters and custom servers

* dont try to resolve conflicts with the github web ui...

* avoid double graph copy

* fixing typing, improving docstrings and release notes

* reverting type

* nit - force type checker no cache

* nit - force type checker no cache

---------

Co-authored-by: Ulises M <ulises@lbux.org>
Co-authored-by: Ulises M <30765968+lbux@users.noreply.github.com>
2025-02-03 15:55:29 +01:00
David S. Batista
503d275ade
chore: remove DocumentSplitter warning related to split_by='sentence' 2025-02-03 12:47:14 +01:00
mathislucka
1a91365cc8
fix: callables can be deserialized from fully qualified import path (#8788)
* fix: callables can be deserialized from fully qualified import path

* fix: license header

* fix: format

* fix: types

* fix? types

* test: extend test case

* format

* add release notes
2025-02-03 12:35:37 +01:00
Amna Mubashar
379711f63e
fix: Pin nltk version for sentence tokenizer (#8786)
* Pin nltk version for sentence tokenizer

* Update pyproject.toml

* Update haystack/components/preprocessors/sentence_tokenizer.py

---------

Co-authored-by: David S. Batista <dsbatista@gmail.com>
2025-01-31 17:01:00 +01:00
Stefano Fiorucci
80575a7e9c
deprecate dataframe and ExtractedTableAnswer (#8789) 2025-01-31 15:03:15 +01:00
Stefano Fiorucci
3ef609a3e8
temporarily pin huggingface_hub<0.28.0 (#8790) 2025-01-31 10:35:15 +01:00
Ulises M
d939321505
fix: compress pipeline graphs before sending to mermaid (#8767)
* compress graph data to support pako endpoint

* Update haystack/core/pipeline/draw.py

Co-authored-by: David S. Batista <dsbatista@gmail.com>

* Update haystack/core/pipeline/draw.py

Co-authored-by: David S. Batista <dsbatista@gmail.com>

---------

Co-authored-by: David S. Batista <dsbatista@gmail.com>
2025-01-28 12:18:54 +01:00
Sebastian Husch Lee
bba84e5517
fix: Fix JSONConverter to properly skip files that are not utf-8 encoded (#8775)
* Small fix

* Add reno

* Trying out license header fix here
2025-01-28 10:29:55 +01:00
Sebastian Husch Lee
e3dc164625
Update license-header.txt with breaking changes from hawkeye (#8778) 2025-01-28 10:03:23 +01:00
Per Lunnemann Hansen
0e6d2a4c39
fix: update component registration to use new class reference (#8715)
The pyright language server is now able to resolve the import and provide completions for the component.

Co-authored-by: Michele Pangrazzi <xmikex83@gmail.com>
2025-01-27 14:52:24 +01:00
Stefano Fiorucci
0ac47b0064
pin numba>=0.54.0 (#8773) 2025-01-27 11:55:18 +01:00
Night-Quiet
c989d9c483
fix: skip comment blocks in DOCXToDocument (#8764)
* fix bug #8759

* Apply suggestions from code review

* release note

---------

Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com>
2025-01-24 11:06:09 +00:00
Stefano Fiorucci
223373eced
fix: Document Classifiers - fix error messages (#8765)
* fix: Document Classifiers - fix docstrings + error messages

* grammar

* fix
2025-01-24 11:17:47 +01:00
tstadel
3119ae1ec9
refactor: raise PipelineError when Pipeline.from_dict receives an invalid type (#8711)
* fix: error on invalid type

* add reno

* Update releasenotes/notes/fix-invalid-component-type-error-83ee00d820b63cc5.yaml

Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com>

* Update test/core/pipeline/test_pipeline.py

Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com>

* fix reno

* fix reno

* last reno fix

---------

Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com>
2025-01-23 11:40:19 +00:00
tstadel
bf79f04932
feat: support streaming_callback as run param for HF Chat generators (#8763)
* feat: support streaming_callback as run param for HF Chat generators

* add tests
2025-01-23 12:14:32 +01:00
Stefano Fiorucci
c3d0643511
feat: AzureOpenAIChatGenerator - support for tools (#8757)
* feat: AzureOpenAIChatGenerator - support for tools

* release note

* feedback
2025-01-23 09:24:04 +00:00
Stefano Fiorucci
f96839e139
chore: update transformers test dependency (#8752)
* update transformers test dependency

* add pad_token_id to the mock tokenizer

* fix HFLocal test + new test
2025-01-21 14:43:27 +01:00
Stefano Fiorucci
2bf6bf6a45
build: add jsonschema library to core dependencies (#8753)
* add jsonschema to core dependencies

* release note
2025-01-21 10:07:56 +01:00
Nicola Procopio
542a7f7ef5
fix: update meta data before initializing new Document in DocumentSplitter (#8745)
* updated DocumentSplitter

issue #8741

* release note

* updated DocumentSplitter

in _create_docs_from_splits function initialize a new variable copied_mete instead to overwrite meta

* added test

test_duplicate_pages_get_different_doc_id

* fix fmt

---------

Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com>
2025-01-20 09:51:47 +01:00
Stefano Fiorucci
242138c68b
chore: update ruff version in pre-commit hook (#8746) 2025-01-19 20:45:02 +01:00
Julian Risch
6feb3856bb
chore: Remove FixMe comment from __init__.py (#8749) 2025-01-19 17:28:37 +01:00
David S. Batista
5af2888e23
fix: PDFMinerToDocument convert function - adding double new lines between each container_text so that passages can be detected. (#8729)
* initial import

* adding double new lines between container_texts so that passages can be detected

* reducing type specification to avoid import error

* adding release notes

* renaming variable
2025-01-17 13:01:16 +00:00
Stefano Fiorucci
424bce2783
test: fix HF API flaky live test with tools (#8744)
* test: fix HF API flaky live test with tools

* rm print
2025-01-17 12:36:07 +00:00
David S. Batista
2c84266d8f
test: adding test for PyPDF to extract passages so that they are detect by DocumentSplitter (#8739) 2025-01-17 10:56:16 +01:00
Vladimir Blagojevic
21dd03d3e7
feat: Add completion start time timestamp to relevant generators (#8728)
* OpenAIChatGenerator - add completion_start_time

* HuggingFaceAPIChatGenerator - add completion_start_time

* Add tests

* Add reno note

* Relax condition for cached responses

* Add completion_start_time timestamping to non-chat generators

* Update haystack/components/generators/chat/hugging_face_api.py

Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com>

* PR feedback

---------

Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com>
2025-01-17 09:58:45 +01:00
Stefano Fiorucci
62ac27c947
chore: remove deprecated function ChatRole and from_function class method in ChatMessage (#8725)
* rm deprecated function role and from_function class method in chatmessage

* release note
2025-01-15 18:55:22 +01:00
David S. Batista
26b80778f5
chore: removing NLTKDocumentSplitter (#8724)
* removing NLTKDocumentSplitter

* adding release notes

* removing pydocs reference
2025-01-15 16:11:51 +00:00
Stefano Fiorucci
167ede1f4c
remove deprecation warning from SentenceWindowRetriever (#8720) 2025-01-15 08:51:52 +00:00
David S. Batista
425ce9b98f
test: updating HuggingFaceAPIChatGenerator tests 2025-01-14 16:47:29 +01:00
David S. Batista
34bd31ef32
docs: fixing RecursiveSplitter pydoc markdown rendering 2025-01-14 11:27:31 +00:00
Haystack Bot
ed40d9f001
Update unstable version to 2.10.0-rc0 (#8713)
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2025-01-13 15:23:33 +01:00
David S. Batista
ec8666545d
docs: adding RecursiveSplitter to pydoc v2.10.0-rc0 2025-01-13 11:46:34 +01:00
Vladimir Blagojevic
d147c7658f
feat: Add ComponentTool to Haystack tools (#8693)
* Initial ComponentTool
---------

Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
Co-authored-by: Julian Risch <julian.risch@deepset.ai>
2025-01-13 11:15:33 +01:00
Julian Risch
642fa60cdf
fix: PDFMinerToDocument initializes documents with content and meta (#8708)
* fix: PDFMinerToDocument initializes documents with content and meta

* add release note

* Apply suggestions from code review

Co-authored-by: David S. Batista <dsbatista@gmail.com>

---------

Co-authored-by: David S. Batista <dsbatista@gmail.com>
2025-01-13 10:12:06 +00:00
Amna Mubashar
db76ae2847
feat: add default_headers for Azure embedders (#8699)
* Add default_headers param to azure embedders
2025-01-12 17:41:38 +01:00
David S. Batista
4f73b192f8
feat: add RecursiveSplitter component for Document preprocessing (#8605)
* initial import

* adding initial version + tests

* adding more tests

* more tests

* incorporating SentenceSplitter based on NLTK

* adding more tests

* adding release notes

* adding LICENSE header

* removing unused imports

* fixing example docstring

* addding docstrings

* fixing tests and returning a dictionary

* updating release notes

* attending PR comments

* Update haystack/components/preprocessors/recursive_splitter.py

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* wip: updating tests for split_idx_start and _split_overlap

* adding tests for split_idx and split_start and overlaps

* adjusting file for LICENSE checking

* adding more tests

* adding tests for page numbering

* adding tests for min split lenghts and falling back to character-level chunking based on size

* fixing linting issue

* Update haystack/components/preprocessors/recursive_splitter.py

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* Update haystack/components/preprocessors/recursive_splitter.py

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* Update haystack/components/preprocessors/recursive_splitter.py

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* Update haystack/components/preprocessors/recursive_splitter.py

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* Update haystack/components/preprocessors/recursive_splitter.py

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* Update haystack/components/preprocessors/recursive_splitter.py

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* Update haystack/components/preprocessors/recursive_splitter.py

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* Update haystack/components/preprocessors/recursive_splitter.py

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* wip

* wip

* updating tests

* wip: fixing all tests after changes

* more tests

* wip: debugging sentence overlap

* wip: debugging page number

* wip

* wip; fixed bug with sentence tokenizer, needs to keep white spaces

* adding tests for counting pages on different split approaches

* NLTK checks done on SentenceSplitter

* fixing types

* adding detecting for full overlap with previous chunks

* fixing types

* improving docstring

* improving docstring

* adding custom lenght, 'character' use case

* customising overlap function for word and adding a few tests

* updating docstring

* Update haystack/components/preprocessors/recursive_splitter.py

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* Update haystack/components/preprocessors/recursive_splitter.py

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* Update haystack/components/preprocessors/recursive_splitter.py

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* wip: adding more tests for word unit length

* fix

* feat: `Tool` dataclass - unified abstraction to represent tools (#8652)

* draft

* del HF token in tests

* adaptations

* progress

* fix type

* import sorting

* more control on deserialization

* release note

* improvements

* support name field

* fix chatpromptbuilder test

* port Tool from experimental

* release note

* docs upd

* Update tool.py

---------

Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>

* fix: fix deserialization issues in multi-threading environments (#8651)

* adding 'word' as default length

* fixing types

* handing both default strategies

* wip

* \f was not being counted properly

* updating tests

* fixing the overlap bug

* adding more tests

* refactoring _apply_overlap

* further refactoring

* Update haystack/components/preprocessors/recursive_splitter.py

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* Update haystack/components/preprocessors/recursive_splitter.py

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* Update haystack/components/preprocessors/recursive_splitter.py

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* Update haystack/components/preprocessors/recursive_splitter.py

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* adding ticks to close code block

* fixing comments

* applying changes: split with space and force keep_white_spaces=True

* fixing some tests and replacing count words approach in more places

* keep_white_spaces = True only if not defined

* cleaning docs

* handling some more edge cases, when split is still too big and all separators ran

* fixing fallback whitespaces count to fixed word/char split based on split size

* cleaning

---------

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com>
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
Co-authored-by: Tobias Wochinger <tobias.wochinger@deepset.ai>
2025-01-10 17:28:53 +01:00