373 Commits

Author SHA1 Message Date
Sebastian Husch Lee
35788a2d06
feat: Update csv cleaner (#8828)
* More refactoring

* Add more new options and more tests

* Improve docstrings

* Add release notes

* Fix pylint
2025-02-07 14:29:53 +01:00
Sebastian Husch Lee
1785ea622e
feat: Add component CSVDocumentCleaner for removing empty rows and columns (#8816)
* Initial commit for csv cleaner

* Add release notes

* Update lineterminator

* Update releasenotes/notes/csv-document-cleaner-8eca67e884684c56.yaml

Co-authored-by: David S. Batista <dsbatista@gmail.com>

* alphabetize

* Use lazy import

* Some refactoring

* Some refactoring

---------

Co-authored-by: David S. Batista <dsbatista@gmail.com>
2025-02-06 17:56:38 +01:00
Stefano Fiorucci
1f257944a6
chore: fix Hugging Face components for mypy 1.15.0 (#8822)
* chore: fix Hugging Face components for mypy 1.15.0

* small fixes

* fix test

* rm print

* use cast and be more permissive
2025-02-06 16:25:59 +00:00
Amna Mubashar
b0809b75f5
feat: Add a ListJoiner component (#8810)
* Add a ListJoiner

* Add tests and release notes
2025-02-05 23:19:14 +01:00
György Orosz
d2348ad462
feat: SentenceTransformersDocumentEmbedder and SentenceTransformersTextEmbedder can accept and pass any arguments to SentenceTransformer.encode (#8806)
* feat: SentenceTransformersDocumentEmbedder and SentenceTransformersTextEmbedder can accept and pass any arguments to SentenceTransformer.encode

* refactor: encode_kwargs parameter of SentenceTransformersDocumentEmbedder and SentenceTransformersTextEmbedder mae to be the last positional parameter for backward compatibility reasons

* docs: added explanation for encode_kwargs in SentenceTransformersTextEmbedder and SentenceTransformersDocumentEmbedder

* test: added tests for encode_kwargs in SentenceTransformersTextEmbedder and SentenceTransformersDocumentEmbedder

* doc: removed empty lines from docstrings of SentenceTransformersTextEmbedder and SentenceTransformersDocumentEmbedder

* refactor: encode_kwargs parameter of SentenceTransformersDocumentEmbedder and SentenceTransformersTextEmbedder mae to be the last positional parameter for backward compatibility (part II.)
2025-02-05 16:09:35 +00:00
Stefano Fiorucci
2828d9e4ae
refactor!: DOCXToDocument converter - store DOCX metadata as a dict (#8804)
* DOCXToDocument - store DOCX metadata as a dict

* do not export DOCXMetadata to converters package
2025-02-05 14:43:19 +01:00
Stefano Fiorucci
5ae94886b2
fix: fix test failures with Transformers models in PRs from forks (#8809)
* trigger

* try pinning sentence transformers

* make integr tests run right away

* pin transformers instead

* older transformers version

* rm transformers pin

* try ignoring cache

* change ubuntu version

* try removing token

* try again

* more HF_API_TOKEN local deletions

* restore test priority

* rm leftover

* more deletions

* moreee

* more

* deletions

* restore jobs order
2025-02-04 19:08:37 +01:00
Sebastian Husch Lee
1ee86b5041
fix: Fix filters to handle date times with timezones (loading and comparison) (#8800)
* Fix on date time parsing with timezones. And comparing naive and aware date times.

* Add release note

* Add more filter tests
2025-02-04 14:51:06 +01:00
Stefano Fiorucci
877f826da0
refactor: HF API Embedders - use InferenceClient.feature_extraction instead of InferenceClient.post (#8794)
* HF API Embedders: refactoring

* rename variables

* rm leftovers

* rm pin

* rm unused import

* relnote

* warning with truncate/normalize and serverless inference API

* test that warnings are raised
2025-02-03 15:11:16 +00:00
Sebastian Husch Lee
bba84e5517
fix: Fix JSONConverter to properly skip files that are not utf-8 encoded (#8775)
* Small fix

* Add reno

* Trying out license header fix here
2025-01-28 10:29:55 +01:00
tstadel
bf79f04932
feat: support streaming_callback as run param for HF Chat generators (#8763)
* feat: support streaming_callback as run param for HF Chat generators

* add tests
2025-01-23 12:14:32 +01:00
Stefano Fiorucci
c3d0643511
feat: AzureOpenAIChatGenerator - support for tools (#8757)
* feat: AzureOpenAIChatGenerator - support for tools

* release note

* feedback
2025-01-23 09:24:04 +00:00
Stefano Fiorucci
f96839e139
chore: update transformers test dependency (#8752)
* update transformers test dependency

* add pad_token_id to the mock tokenizer

* fix HFLocal test + new test
2025-01-21 14:43:27 +01:00
Nicola Procopio
542a7f7ef5
fix: update meta data before initializing new Document in DocumentSplitter (#8745)
* updated DocumentSplitter

issue #8741

* release note

* updated DocumentSplitter

in _create_docs_from_splits function initialize a new variable copied_mete instead to overwrite meta

* added test

test_duplicate_pages_get_different_doc_id

* fix fmt

---------

Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com>
2025-01-20 09:51:47 +01:00
David S. Batista
5af2888e23
fix: PDFMinerToDocument convert function - adding double new lines between each container_text so that passages can be detected. (#8729)
* initial import

* adding double new lines between container_texts so that passages can be detected

* reducing type specification to avoid import error

* adding release notes

* renaming variable
2025-01-17 13:01:16 +00:00
Stefano Fiorucci
424bce2783
test: fix HF API flaky live test with tools (#8744)
* test: fix HF API flaky live test with tools

* rm print
2025-01-17 12:36:07 +00:00
David S. Batista
2c84266d8f
test: adding test for PyPDF to extract passages so that they are detect by DocumentSplitter (#8739) 2025-01-17 10:56:16 +01:00
Vladimir Blagojevic
21dd03d3e7
feat: Add completion start time timestamp to relevant generators (#8728)
* OpenAIChatGenerator - add completion_start_time

* HuggingFaceAPIChatGenerator - add completion_start_time

* Add tests

* Add reno note

* Relax condition for cached responses

* Add completion_start_time timestamping to non-chat generators

* Update haystack/components/generators/chat/hugging_face_api.py

Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com>

* PR feedback

---------

Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com>
2025-01-17 09:58:45 +01:00
David S. Batista
26b80778f5
chore: removing NLTKDocumentSplitter (#8724)
* removing NLTKDocumentSplitter

* adding release notes

* removing pydocs reference
2025-01-15 16:11:51 +00:00
David S. Batista
425ce9b98f
test: updating HuggingFaceAPIChatGenerator tests 2025-01-14 16:47:29 +01:00
Julian Risch
642fa60cdf
fix: PDFMinerToDocument initializes documents with content and meta (#8708)
* fix: PDFMinerToDocument initializes documents with content and meta

* add release note

* Apply suggestions from code review

Co-authored-by: David S. Batista <dsbatista@gmail.com>

---------

Co-authored-by: David S. Batista <dsbatista@gmail.com>
2025-01-13 10:12:06 +00:00
Amna Mubashar
db76ae2847
feat: add default_headers for Azure embedders (#8699)
* Add default_headers param to azure embedders
2025-01-12 17:41:38 +01:00
David S. Batista
4f73b192f8
feat: add RecursiveSplitter component for Document preprocessing (#8605)
* initial import

* adding initial version + tests

* adding more tests

* more tests

* incorporating SentenceSplitter based on NLTK

* adding more tests

* adding release notes

* adding LICENSE header

* removing unused imports

* fixing example docstring

* addding docstrings

* fixing tests and returning a dictionary

* updating release notes

* attending PR comments

* Update haystack/components/preprocessors/recursive_splitter.py

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* wip: updating tests for split_idx_start and _split_overlap

* adding tests for split_idx and split_start and overlaps

* adjusting file for LICENSE checking

* adding more tests

* adding tests for page numbering

* adding tests for min split lenghts and falling back to character-level chunking based on size

* fixing linting issue

* Update haystack/components/preprocessors/recursive_splitter.py

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* Update haystack/components/preprocessors/recursive_splitter.py

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* Update haystack/components/preprocessors/recursive_splitter.py

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* Update haystack/components/preprocessors/recursive_splitter.py

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* Update haystack/components/preprocessors/recursive_splitter.py

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* Update haystack/components/preprocessors/recursive_splitter.py

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* Update haystack/components/preprocessors/recursive_splitter.py

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* Update haystack/components/preprocessors/recursive_splitter.py

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* wip

* wip

* updating tests

* wip: fixing all tests after changes

* more tests

* wip: debugging sentence overlap

* wip: debugging page number

* wip

* wip; fixed bug with sentence tokenizer, needs to keep white spaces

* adding tests for counting pages on different split approaches

* NLTK checks done on SentenceSplitter

* fixing types

* adding detecting for full overlap with previous chunks

* fixing types

* improving docstring

* improving docstring

* adding custom lenght, 'character' use case

* customising overlap function for word and adding a few tests

* updating docstring

* Update haystack/components/preprocessors/recursive_splitter.py

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* Update haystack/components/preprocessors/recursive_splitter.py

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* Update haystack/components/preprocessors/recursive_splitter.py

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* wip: adding more tests for word unit length

* fix

* feat: `Tool` dataclass - unified abstraction to represent tools (#8652)

* draft

* del HF token in tests

* adaptations

* progress

* fix type

* import sorting

* more control on deserialization

* release note

* improvements

* support name field

* fix chatpromptbuilder test

* port Tool from experimental

* release note

* docs upd

* Update tool.py

---------

Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>

* fix: fix deserialization issues in multi-threading environments (#8651)

* adding 'word' as default length

* fixing types

* handing both default strategies

* wip

* \f was not being counted properly

* updating tests

* fixing the overlap bug

* adding more tests

* refactoring _apply_overlap

* further refactoring

* Update haystack/components/preprocessors/recursive_splitter.py

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* Update haystack/components/preprocessors/recursive_splitter.py

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* Update haystack/components/preprocessors/recursive_splitter.py

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* Update haystack/components/preprocessors/recursive_splitter.py

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* adding ticks to close code block

* fixing comments

* applying changes: split with space and force keep_white_spaces=True

* fixing some tests and replacing count words approach in more places

* keep_white_spaces = True only if not defined

* cleaning docs

* handling some more edge cases, when split is still too big and all separators ran

* fixing fallback whitespaces count to fixed word/char split based on split size

* cleaning

---------

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com>
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
Co-authored-by: Tobias Wochinger <tobias.wochinger@deepset.ai>
2025-01-10 17:28:53 +01:00
Stefano Fiorucci
741ce5df50
fix: OpenAIChatGenerator - do not pass tools to the OpenAI client when none are provided (#8702)
* do not pass tools to OpenAI client if None

* release note

* fix release note
2025-01-10 14:46:41 +01:00
Julian Risch
dd9660f90d
fix: PyPDFToDocument initializes documents with content and meta (#8698)
* initialize document with content and meta

* update test

* add test checking that not only content is used for id generation
2025-01-09 19:12:10 +00:00
mathislucka
fe9b1e29d4
CI: fix format after newly introduced formatting rules from ruff release (#8696) 2025-01-09 16:25:55 +00:00
Stefano Fiorucci
3f15f38c51
refactor: move Tool to a separate package; refactor serde (#8690)
* move tool to separate package; refactor serde

* release note

* rm unused import
2025-01-09 12:30:13 +01:00
Sebastian Husch Lee
28ad78c73d
feat: Add XLSXToDocument converter (#8522)
* Add draft of the Excel To Document converter

* Add license header

* Add release note

* Use Union instead of pipe

* Add openpyxl as additional dep

* Fix zip issue

* few updates from Bijay

* Update deps

* Add markdown test

* Adding more example excels and expanding tests

* Added more tests

* Fix windows test by setting lineterminator

* Addressing PR comments

* PR comments

* Fix linting
2025-01-09 09:03:19 +01:00
Stefano Fiorucci
5539f6c33f
refactor: improve serialization/deserialization of callables (to handle class methods and static methods) (#8683)
* progress

* refinements

* tidy up

* release note
2025-01-08 11:28:00 +01:00
Stefano Fiorucci
188b2a7f06
feat: support for tools in OpenAIChatGenerator (#8666)
* move chatmsg>openai conversion to chatmsg dataclass

* implementation and tests cleanup

* release note

* try fixing azure chat generator

* add serde test for toolinvoker

* small fix
2024-12-20 14:20:54 +00:00
Stefano Fiorucci
7dcbf25bd7
feat: add Tool Invoker component (#8664)
* port toolinvoker

* release note
2024-12-20 14:02:42 +01:00
Michele Pangrazzi
c192488bf6
Named entity extractor private models (#8658)
* add 'token' support to NamedEntityExtractor to enable using private models on HF backend

* fix existing error message format

* add release note

* add HF_API_TOKEN to e2e workflow

* add informative comment

* Updated to_dict / from_dict to handle 'token' correctly ; Added tests

* Fix lint

* Revert unwanted change
2024-12-20 11:15:55 +01:00
Sebastian Husch Lee
286061f005
fix: Move potential nltk download to warm_up (#8646)
* Move potential nltk download to warm_up

* Update tests

* Add release notes

* Fix tests

* Uncomment

* Make mypy happy

* Add RuntimeError message

* Update release notes

---------

Co-authored-by: Julian Risch <julian.risch@deepset.ai>
2024-12-20 10:41:44 +01:00
Stefano Fiorucci
f4d9c2bb91
fix: Make the HuggingFaceLocalChatGenerator compatible with the new ChatMessage; serialize chat_template (#8663)
* message conversion function

* hfapi w tools

* right test file + hf_hub version

* release note

* fix for new chatmessage; serialize chat_template

* feedback
2024-12-19 15:12:12 +01:00
Stefano Fiorucci
2bc58d2987
feat: support for tools in HuggingFaceAPIChatGenerator (#8661)
* message conversion function

* hfapi w tools

* right test file + hf_hub version

* release note

* feedback
2024-12-19 15:04:37 +01:00
David S. Batista
c306bee665
fix: adding missing abbreviations files for SentenceSplitter (#8660)
* adding missing abbreviations files for SentenceSplitter

* fixing tests path
2024-12-19 11:08:29 +01:00
Stefano Fiorucci
ea3602643a
feat!: new ChatMessage (#8640)
* draft

* del HF token in tests

* adaptations

* progress

* fix type

* import sorting

* more control on deserialization

* release note

* improvements

* support name field

* fix chatpromptbuilder test

* Update chat_message.py

---------

Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
2024-12-17 17:02:04 +01:00
Stefano Fiorucci
f2b5f123b3
del HF token in tests (#8634) 2024-12-13 09:50:23 +01:00
David S. Batista
3f77d3ab6c
!feat: unify NLTKDocumentSplitter and DocumentSplitter (#8617)
* wip: initial import

* wip: refactoring

* wip: refactoring tests

* wip: refactoring tests

* making all NLTKSplitter related tests work

* refactoring

* docstrings

* refactoring and removing NLTKDocumentSplitter

* fixing tests for custom sentence tokenizer

* fixing tests for custom sentence tokenizer

* cleaning up

* adding release notes

* reverting some changes

* cleaning up tests

* fixing serialisation and adding tests

* cleaning up

* wip

* renaming and cleaning

* adding NLTK files

* updating docstring

* adding import to init

* Update haystack/components/preprocessors/document_splitter.py

Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com>

* updating tests

* wip

* adding sentence/period change warning

* fixing LICENSE header

* Update haystack/components/preprocessors/document_splitter.py

Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com>

---------

Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com>
2024-12-12 14:22:27 +00:00
Michele Pangrazzi
21d53d0ec6
update default value of 'store_full_path' to False in converters (#8619) 2024-12-10 16:03:38 +01:00
David S. Batista
248dccbdd3
chore: fixing pylint issues (#8610)
* initial import

* fixing internal methods

* fixing some internal methods

* modify _preprocess

* fixed internal methods

---------

Co-authored-by: anakin87 <stefanofiorucci@gmail.com>
2024-12-09 16:53:37 +00:00
Anton Pelykh
6f983a22ca
fix: add missing stream mime type assignment to the LinkContentFetcher (#8596)
* add missing stream mime type assignment to the `LinkContentFetcher`

* fix release note fmt

---------

Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com>
2024-12-09 14:51:14 +00:00
Michele Pangrazzi
b32f85cca2
remove deprecated 'converter' init parameter from PyPDFToDocument component (#8609) 2024-12-06 15:43:43 +01:00
David S. Batista
2282c26f17
feat!: SentenceWindowRetriever returns List[Document] with docs ordered by split_idx_start (#8590)
* initial import

* adding a few pylint disable

* adding tests

* fixing integration tests

* adding release notes

* fixing types and docstrings
2024-12-04 16:55:56 +01:00
Amna Mubashar
4c8eb54049
feat: Add store_full_path to converters (3/3) (#8585)
* Add store_full_path params
2024-12-03 13:48:56 +05:00
Stefano Fiorucci
c8685aa141
refactor: update components to access ChatMessage.text instead of content (#8589)
* introduce text property and deprecate content

* release note

* use chatmessage.text

* release note

* linting
2024-11-28 10:16:07 +00:00
Stefano Fiorucci
51c1390426
chore: use class methods to create ChatMessage (#8581)
* use class methods to build messages

* fix failing format
2024-11-28 09:35:24 +00:00
Stefano Fiorucci
fb42c035c5
feat: PyPDFToDocument - add new customization parameters (#8574)
* deprecat converter in pypdf

* fix linting of MetaFieldGroupingRanker

* linting

* pypdftodocument: add customization params

* fix mypy

* incorporate feedback
2024-11-26 16:37:59 +01:00
Vladimir Blagojevic
59f1e182db
feat: Add variable to specify inputs as optional to ConditionalRouter (#8568)
* Add optional_variables in ConditionalRouter

* Add reno note

* Add more unit test with various complex scenarios

* Add more unit tests

* Add pylint disable=too-many-positional-arguments

* PR feedback from @sjrl
2024-11-26 10:48:55 +01:00
Silvano Cerza
ab840351f8
Fix DocumentCleaner not preserving Document fields (#8578) 2024-11-25 13:08:59 +01:00