218 Commits

Author SHA1 Message Date
David S. Batista
91f57015c0
feat : adding split_id and split_overlap to DocumentSplitter (#7933)
* wip: adding _split_overlapp

* fixing join issue for _split_overlap

* adding tests

* adding release notes

* cleaning and fixing tests

* making mypy happy

* Update haystack/components/preprocessors/document_splitter.py

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>

* adding docstrings

---------

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>
2024-06-27 15:07:43 +02:00
Vladimir Blagojevic
569b2a87cb
feat: Update LocalWhisperTranscriber, add tests (#7935)
* Update LocalWhisperTranscriber, add tests

* Final touches

* Update haystack/components/audio/whisper_local.py

Co-authored-by: David S. Batista <dsbatista@gmail.com>

* Fix prev commit

* Relax test for tiny model to work

---------

Co-authored-by: David S. Batista <dsbatista@gmail.com>
2024-06-27 12:53:41 +02:00
Vladimir Blagojevic
c2ed275a2d
feat: Improve LinkContentFetcher content type handling (#7920)
* LinkContentFetcher: add more default content type handlers

* Update/add unit test

* Add reno note

* Add image content handler

* Update unit test
2024-06-27 11:45:20 +02:00
Vladimir Blagojevic
535a281eec
feat: Add option to use HF_TOKEN as env var for authentication across all HF components (#7942)
* Read both HF_API_TOKEN and HF_TOKEN env vars in all HF related components

* Add reno note

* Test fixes

* More test updates

* More test updates
2024-06-27 10:31:58 +02:00
Sebastian Husch Lee
6836079686
chore: Capitalize DOCX in DOCXToDocument converter (#7931)
* Capitalize DOCX in DOCXToDocument converter

* Update docstrings

* Update test class name

* add releease notes
2024-06-27 08:19:01 +02:00
Amna Mubashar
866e6c8fc2
Add the missing parameter for serialization (#7929)
* Add the missing parameter for serialization

* Updated test

---------

Co-authored-by: Amna Mubashar <amna.mubashar@Amnas-MBP.fritz.box>
2024-06-26 11:07:00 +02:00
David S. Batista
8b9eddcd94
fix: explicitly tell ContextRelevanceEvaluator that each statement should be scored (#7904)
* initial import

* adding release notes

* adding pytest decorator for live test

* make examples more readable

* updating tests

* reverting progress_bar = False
2024-06-25 16:59:37 +02:00
Amna Mubashar
fc011d7b04
bug: fix MRR and MAP calculations (#7841)
* bug: fix MRR and MAP calculations
2024-06-25 12:07:11 +02:00
Stefano Fiorucci
c51f8ffb86
PyPDFToDocument: remove deprecated converter_name and CONVERTERS_REGISTRY (#7910) 2024-06-21 16:52:03 +02:00
Ulises M
9c45203a76
fix: check for None in SAS eval input (#7909)
* check for None in SAS input

* Update releasenotes/notes/check-for-None-SAS-eval-0b982ccc1491ee83.yaml

---------

Co-authored-by: David S. Batista <dsbatista@gmail.com>
2024-06-21 14:22:33 +02:00
Vedant Naik
f5a34d4d5c
fix: CacheChecker filters syntax (#7898)
* fix: cache_checker filters

* add release notes

* test: add test for cache checker filters syntax
2024-06-21 12:31:29 +02:00
Stefano Fiorucci
c59ad95f42
chore: remove deprecated TGI generators (#7908)
* remove deprecated TGI generators

* rm unused import
2024-06-21 11:15:13 +02:00
Corentin
d158be40b3
fix(JsonSchemaValidator): fix recursive loop and general LLM (claude, mistral...) compatibility (#7556)
* Feat: Fix recursive conversion in JsonSchemaValidator (autofix generated by ClaudeOpus). Modify the behaviour to build the error template in a single user_message instead of two separate. Modify the behaviour to only include latest message instead of full history (very costly if long looping pipeline)

* Feat: Fix recursive conversion in JsonSchemaValidator (autofix generated by ClaudeOpus). Modify the behaviour to build the error template in a single user_message instead of two separate. Modify the behaviour to only include latest message instead of full history (very costly if long looping pipeline)

* reno

* fix test

* Verify provided message contains JSON object to begin with

* Minor detail

---------

Co-authored-by: Vladimir Blagojevic <dovlex@gmail.com>
2024-06-21 11:10:34 +02:00
Stefano Fiorucci
75ad76a7ce
chore: remove deprecated TEI embedders (#7907)
* remove deprecated TEI embedders

* rm from the embedders init

* rm related tests
2024-06-21 10:36:12 +02:00
Carlos Fernández
57c1d47c7d
fix: ChatPromptBuilder Fails to JSON Serialize (#7849)
* implement serialization for chat messages and add tests

* implement serialization for ChatPromptBuilder and test it

* add reno

* solve mypy type error

* solve mypy type error

* remove flattening parameter in to_dict

* simplify to jus non-flat metadata

* try to fix linting issue

* solve format issues

* update test for ChatPromptBuilder

* remove unused import

---------

Co-authored-by: David S. Batista <dsbatista@gmail.com>
2024-06-20 13:20:52 +02:00
Sebastian Husch Lee
3db56d9066
refactor: DocxToDocument update (#7857)
* Some changes

Use tests file path

* Update tests

* Add another unit test

* Shorten _get_docx_metadata

* Update tests

* Remove try block

* Add a dataclass

* Add a to dict unit test

* Remove unused import

* Add release notes

* Update docstrings

* Use optional instead of pipe

* Update docstring

* Remove file
2024-06-19 15:48:31 +02:00
Madeesh Kannan
fe60eedee9
fix: Fix deserialization of pipelines that contain LLMEvaluator subclasses (#7891) 2024-06-19 13:47:38 +02:00
Massimiliano Pippi
3a03fce71c
ci: Add code formatting checks (#7882)
* ruff settings

enable ruff format and re-format outdated files

feat: `EvaluationRunResult` add parameter to specify columns to keep in the comparative `Dataframe`  (#7879)

* adding param to explictily state which cols to keep

* adding param to explictily state which cols to keep

* adding param to explictily state which cols to keep

* updating tests

* adding release notes

* Update haystack/evaluation/eval_run_result.py

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>

* Update releasenotes/notes/add-keep-columns-to-EvalRunResult-comparative-be3e15ce45de3e0b.yaml

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>

* updating docstring

---------

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>

add format-check

fail on format and linting failures

fix string formatting

reformat long lines

fix tests

fix typing

linter

pull from main

* reformat

* lint -> check

* lint -> check
2024-06-18 15:52:46 +00:00
Stefano Fiorucci
8de639bd70
DocxDocument forward reference (#7852) 2024-06-13 11:29:31 +02:00
Carlos Fernández
c1c339923f
feat: add DocxToDocument converter (#7838)
* first fucntioning DocxFileToDocument

* fix lazy import message

* add reno

* Add license headder

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* change DocxFileToDocument to DocxToDocument

* Update library install to the maintained version

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* clan try-exvept to only take non haystack errors into account

* Add wanring on docstring of component ignoring page brakes, mark test as skip

* make warnings lazy evaluations

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* make warnings lazy evaluations

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* Make warnings lazy evaluated

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* Solve f bug

* Get more metadata from docx files

* add 'python-docx' dependency and docs

* Change logging import

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* Fix typo

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* remake metadata extraction for docx

* solve bug regarding _get_docx_metadata method

* Update haystack/components/converters/docx.py

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* Update haystack/components/converters/docx.py

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* Delete unused test

---------

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
2024-06-12 11:58:36 +02:00
Rob Pasternak
28dd0f5596
feat: Add options for what to do with missing metadata fields in MetaFieldRanker (#7700)
* Add `missing_meta` param to `MetaFieldRanker`, plus checks for validation.

* Implement `missing_meta` functionality in `run()`.

* Finish first draft of revised `MetaFieldRanker` functionality.

* Add tests for `MetaFieldRanker` `missing_meta` functionality.

* Add `missing_meta` param to `MetaFieldRanker`, plus checks for validation.

* Implement `missing_meta` functionality in `run()`.

* Finish first draft of revised `MetaFieldRanker` functionality.

* Add tests for `MetaFieldRanker` `missing_meta` functionality.

* Add release notes for new `missing_meta` param of `MetaFieldRanker`

* Move part of docs_missing_meta_field warning string outside of `if...elif...else`.
2024-06-12 10:42:02 +02:00
Madeesh Kannan
63226dad34
fix: Fix LLMEvaluator serialization (#7818)
* fix: Fix `LLMEvaluator` serialization

* `reno`
2024-06-07 12:49:23 +02:00
Sebastian Husch Lee
2c2c7c9f56
feat: Add PPTXToDocument converter (#7808)
* Add first pass at PPTXToDocument converter

* Add test and update code

* Add doc string

* Update docstrings

* Add release notes

* remove unused imports, add to api docs, update pyproject.toml

* Add a new test

* Add dep so tests can run
2024-06-07 09:43:29 +00:00
Sebastian Husch Lee
d815c78198
feat: Add TransformersTextRouter component (#7801)
* First pass at adding TransformerTextRouter

* Fix tests

* Add release notes

* Add optional labels param

* Add verification in the warm_up

* Fix tests

* Add labels to to_dict

* Feedback from review

* Add component to docs

* Added extra tests
2024-06-05 15:28:53 +02:00
Vladimir Blagojevic
678f193f10
feat: Add filter_policy init parameter to in memory retrievers (#7795)
* Add filter_policy init parameter to in-memory retrievers
2024-06-04 17:51:16 +02:00
Silvano Cerza
854c4173f2
feat: Add memory sharing between different instances of InMemoryDocumentStore (#7781)
* Add memory sharing between different instances of InMemoryDocumentStore

* Fix FilterRetriever tests

* Fix InMemoryBM25Retriever tests
2024-05-31 16:44:14 +02:00
Massimiliano Pippi
8d80ff86d9
Add BranchJoiner and deprecate Multiplexer (#7765) 2024-05-30 15:34:52 +02:00
Massimiliano Pippi
0ceeb733ba
chore: make warm_up() usage consistent (#7752)
* make  usage consistent

* fix error type

* release notes

* pylint fix

* change of plan

* revert

* fix test

* revert

* fix HF tests

* Apply suggestions from code review

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>

* fix formatting

* reformat

* fix regex match with the new error message

* fix integration test

---------

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
2024-05-29 10:54:21 +02:00
Alessio Cesaretti
d0da31a047
feat: Add split_threshold to DocumentSplitter to avoid excessively short splits (#7721)
* feat: add split_threshold to document splitter to avoid excessively small splits

* Update haystack/components/preprocessors/document_splitter.py

Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com>

* Update haystack/components/preprocessors/document_splitter.py

Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com>

* extend release note

---------

Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com>
Co-authored-by: Julian Risch <julian.risch@deepset.ai>
2024-05-27 14:48:38 +02:00
tstadel
98fd270428
feat: add ChatPromptBuilder, deprecate DynamicChatPromptBuilder (#7663) 2024-05-23 19:04:55 +02:00
David S. Batista
38747ff7a3
fix: failsafe for non-valid json and failed LLM calls (#7723)
* wip

* initial import

* adding tests

* adding params

* adding safeguards for nan in evaluators

* adding docstrings

* fixing tests

* removing unused imports

* adding tests to context and faithfullness evaluators

* fixing docstrings

* nit

* removing unused imports

* adding release notes

* attending PR comments

* fixing tests

* fixing tests

* adding types

* removing unused imports

* Update haystack/components/evaluators/context_relevance.py

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>

* Update haystack/components/evaluators/faithfulness.py

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>

* attending PR comments

---------

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
2024-05-23 15:41:29 +00:00
Massimiliano Pippi
e3dccf4406
add timeout to AzureOpenAIGenerator (#7724)
* add timeout to AzureOpenAIGenerator

* add to chat also

* Update azure-openai-generator-timeout-c39ecd6d4b0cdb4b.yaml
2024-05-23 16:28:24 +02:00
tstadel
83d3970405
feat: extend PromptBuilder and deprecate DynamicPromptBuilder (#7655)
* feat: add default template to DynamicPromptBuilder

* fix mypy

* fix mypy

* extend PromptBuilder and deprecate DynamicPromptBuilder

* make backward-compatible: optional -> required

* make backward-compatible: _template_string

* make backward-compatible: missing_required_vars error

* add test for no template case

* better docstrings

* some chors

* some chors

* add reno

* revert test_dynamic_prompt_builder.py

* better docstring

* make backward-compatible: reorder init args

* fix tests

* add raises docstring

* make default template required and rework docstrings

* docs chores

* keep to_dict in place for easier review

* remove unnecessary logger

* update docstring
2024-05-23 16:03:39 +02:00
Varun Krishnan
badb05b3ab
feat: allow DocumentJoiner to accept top_k parameter in run method (#7709)
* feat: allow DocumentJoiner to accept top_k parameter in run method

* Added release note for DocumentJoiner top_k fix
2024-05-23 16:03:26 +02:00
Massimiliano Pippi
482f60ec99
fix: exit early if the component receives no documents (#7732)
* exit early if the component receives no documents

* relnote
2024-05-23 09:35:10 +02:00
David S. Batista
a4fc2b66e6
style: adding progress bar to llm-based evaluators (#7726)
* adding progress bar

* fixing typo

* fixing tests

* Update test_llm_evaluator.py

* fixing missing colon

* passing directly to parent

* adding docstrings
2024-05-23 09:22:14 +02:00
Massimiliano Pippi
76224fc781
make SerperDevWebSearch more robust (#7725) 2024-05-22 13:14:39 +02:00
Stefano Fiorucci
7181f6b7e9
feat: change HTML conversion backend from boilerpy3 to Trafilatura (#7705)
* change HTML conversion backed to Trafilatura

* rm unused var
2024-05-17 10:38:47 +02:00
Carlos Fernández
57af95d7ea
add keep-id to DocumentCleaner (#7703) 2024-05-16 19:18:48 +02:00
Carlos Fernández
686a4999cf
feat: widen support of env vars in OpenAI components (#7653)
* add enviroment variables to the _enviroment.py file

* add support for two of the three variables

* Add support for 'OPENAI_TIMEOUT' and 'OPENAI_MAX_RETRIES' on OpenAIDocument Ebedder.

* Replicate support for env vars in OpenAITextEmbedder.

* Add support for env vars in OpenAIGenerator..

* Add support for env vars in OpenAIChatGenerator.

* add docstrings and reno

* add params to __init__ in OpenAIDocumentEmbedder

* add params to __init__ in OpenAITextEmbedder

* make fully functional implementation of env vars and unit tests

* update reno

* Update haystack/components/embedders/openai_text_embedder.py

* reverse changes to telemetry/_enviroment.py

* Update haystack/components/embedders/openai_text_embedder.py

---------

Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
2024-05-15 21:58:41 +00:00
David S. Batista
96b9d3e32a
fix: Adding missing component decorator to AzureOpenAIGenerator (#7698)
* initial import

* adding release notes

* tests avoiding I/O operations

* Update fix-azure-generators-serialization-18fcdc9cbcb3732e.yaml
2024-05-15 10:00:38 +02:00
David S. Batista
798dc4a4a5
fix: avoid FaithfulnessEvaluator and ContextRelevanceEvaluator return Nan (#7685)
* initial import

* fixing tests

* relaxing condition

* adding safeguard for ContextRelevanceEvaluator as well

* adding release notes
2024-05-14 17:08:51 +02:00
Vladimir Blagojevic
4352b1688e
fix: Fix NamedEntityExtractor serde (#7684)
* Fix NamedEntityExtractor serde

* Add release note

* Linting, remove unit markers
2024-05-14 12:24:55 +02:00
Sebastian Husch Lee
a2be90b95a
fix: Update device deserialization for components that use local models (#7686)
* fix: Update device deserializtion for SentenceTransformersTextEmbedder

* Add unit test

* Fix unit test

* Make same change to doc embedder

* Add release notes

* Add same change to Diversity Ranker and Named Entity Extractor

* Add unit test

* Add the same for whisper local

* Update release notes
2024-05-14 08:36:14 +02:00
Vladimir Blagojevic
811b93db91
feat: Set ByteStream's mime_type attribute for web based resources (#7681) 2024-05-13 19:44:02 +02:00
Massimiliano Pippi
10c675d534
chore: add license header to all modules (#7675)
* add license header to modules
* check license header at linting time
2024-05-09 13:40:36 +00:00
Stefano Fiorucci
7c9532b200
fix broken serialization of HFAPI components (#7661) 2024-05-08 17:14:37 +02:00
Stefano Fiorucci
94467149c1
fix: fix serialization of DocumentRecallEvaluator (#7662)
* fix serialization of DocumentRecallEvaluator

* add requested tests
2024-05-08 16:00:49 +02:00
Vladimir Blagojevic
5f813373eb
chore: Update huggingface_hub classes used after library upgrade (#7631)
* Update huggingface_hub classes used after library upgrade

* Fix chat tests

* Update lazy import guard and other references to huggingface_hub>=0.23.0

* In huggingface_hub 0.23.0 TextGenerationOutput property details is now optional

* More fixes

* Add reno note
2024-05-03 10:14:54 +02:00
Julian Risch
b0284977db
feat: Add document page number of ExtractedAnswer to meta (#7572)
* calculate page number of answer and add to meta

* fix mypy, add reno

* add test

* simplify unit test

* update release note

* undo @patch updates

* extend tests, check page_number type
2024-05-02 14:48:27 +02:00