1204 Commits

Author SHA1 Message Date
Christian Clauss
30ca042370
ci: Use ruff in pre-commit to further limit code complexity (#5783)
* ci: Use ruff in pre-commit to further limit complexity

* Delete releasenotes/notes/ruff-4d2504d362035166.yaml

---------

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>
2023-09-13 15:18:16 +02:00
ZanSara
5888fb7052
make MemoryBM25Retriever non match (#5768) 2023-09-13 15:11:47 +02:00
Julian Risch
4ae0924ea0
feat!: Remove SklearnQueryClassifier (#5779)
* remove SklearnQueryClassifier

* reno
2023-09-13 12:55:33 +02:00
Stefano Fiorucci
283ecf2760
feat: add prefix and suffix to SentenceTransformersDocumentEmbedder (#5745)
* add prefix and suffix

* fix test
2023-09-13 12:55:06 +02:00
ZanSara
335a09bc1d
feat: make AnswerBuilder non batch (#5766)
* make answerbuilder non batch

* fix mypy

* review feedback

* mypy

---------

Co-authored-by: bogdankostic <bogdankostic@web.de>
2023-09-13 12:01:16 +02:00
ZanSara
2c4d839b64
feat: GPT4Generator (#5744)
* add gpt4generator

* add e2e

* add tests

* reno

* fix e2e

* Update test/preview/components/generators/openai/test_gpt4_generator.py

Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>

---------

Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>
2023-09-13 10:07:09 +02:00
ZanSara
94c5d6d216
feat: make GPT35Generator non batch (#5764)
* make gpt35generator not batch

* fix tests

* review feedback

* mypy
2023-09-12 18:19:28 +02:00
ZanSara
6e70d403f8
feat: Improve Document for Haystack 2.0 (#5738)
* initial draft

* tests

* add proposal

* proposal number

* reno

* fix tests and usage of content and content_type

* update branch & fix more tests

* mypy

* add docstring

* fix more tests

* review feedback

* improve __str__

* Apply suggestions from code review

Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>

* Update haystack/preview/dataclasses/document.py

Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>

* improve __str__

* fix tests

* fix more tests

* Update haystack/preview/document_stores/memory/document_store.py

---------

Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
2023-09-11 17:40:00 +02:00
Stefano Fiorucci
2edf85f739
MemoryEmbeddingRetriever (2.0) (#5726)
* MemoryDocumentStore - Embedding retrieval draft

* add release notes

* fix mypy

* better comment

* improve return_embeddings handling

* MemoryEmbeddingRetriever - first draft

* address PR comments

* release note

* update docstrings

* update docstrings

* incorporated feeback

* add return_embedding to __init__

* rm leftover docstring

---------

Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
2023-09-08 15:52:48 +02:00
Stefano Fiorucci
d860a5c604
make tests more robust (#5747) 2023-09-08 15:50:56 +02:00
Stefano Fiorucci
b7bea3ae9c
MemoryDocumentStore - Embedding retrieval (2.0) (#5715)
* MemoryDocumentStore - Embedding retrieval draft

* add release notes

* fix mypy

* better comment

* improve return_embeddings handling

* address PR comments

* update docstrings

* incorporated feeback

---------

Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
2023-09-07 15:44:07 +02:00
bogdankostic
71852c7b06
Fix output of AnswerBuilder (#5737) 2023-09-07 12:54:24 +02:00
ZanSara
63cbde7287
feat: GPT35Generator (#5714)
* chatgpt backend

* fix tests

* reno

* remove print

* helpers tests

* add chatgpt generator

* use openai sdk

* remove backend

* tests are broken

* fix tests

* stray param

* move _check_troncated_answers into the class

* wrong import

* rename function

* typo in test

* add openai deps

* mypy

* improve system prompt docstring

* typos update

* Update haystack/preview/components/generators/openai/chatgpt.py

* pylint

* Update haystack/preview/components/generators/openai/chatgpt.py

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>

* Update haystack/preview/components/generators/openai/chatgpt.py

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>

* Update haystack/preview/components/generators/openai/chatgpt.py

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>

* review feedback

* fix tests

* freview feedback

* reno

* remove tenacity mock

* gpt35generator

* fix naming

* remove stray references to chatgpt

* fix e2e

* Update releasenotes/notes/chatgpt-llm-generator-d043532654efe684.yaml

Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>

* add another test

* test wrong model name

* review feedback

---------

Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>
2023-09-07 10:06:57 +02:00
Vladimir Blagojevic
c5edb45c10
feat: Add SerperDevWebSearch Haystack 2.0 component (#5712)
* Add SerperDev

* Add release note

* PR Feedback

* Simplify, remove one-liner

* Update haystack/preview/components/websearch/serper_dev.py

Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>

* Update haystack/preview/components/websearch/serper_dev.py

Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>

* Fix formatting

* PR feedback

* Fix tests

* Function rename

* Remove scoring, update tests

* PR feedback

* Fix return

* small adjustments

* fix tests

* add e2e test

* fix release notes

* fix tests

* fix e2e

---------

Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
2023-09-06 17:31:42 +02:00
ZanSara
10d6886255
chore: move PromptBuilder in builders (#5729) 2023-09-06 11:52:21 +02:00
bogdankostic
639f7cf888
chore: Rename AnswersBuilder to AnswerBuilder (#5720)
* Add AnswersBuilder

* Add tests for AnswersBuilder

* Add release note

* PR feedback

* Fix mypy

* Remove redundant check for number of groups

* Rename AnswersBuilder to AnswerBuilder

* Update test/preview/components/builders/test_answer_builder.py

Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>

* Rename reno file

---------

Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
2023-09-05 14:34:22 +02:00
Silvano Cerza
2acc41ea85
Add PromptBuilder (#5713)
* Add PromptBuilder

* Update release note

* Add test
2023-09-05 12:22:21 +02:00
bogdankostic
a5b815690e
feat: Add AnswersBuilder component (2.0) (#5701)
* Add AnswersBuilder

* Add tests for AnswersBuilder

* Add release note

* PR feedback

* Fix mypy

* Remove redundant check for number of groups

* docstrings upd

---------

Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
2023-09-04 21:16:20 +02:00
bogdankostic
11440395f4
fix: Set model_max_length in the Tokenizer of DefaultPromptHandler (#5596)
* Set model_max_length in tokenizer in prompt handler

* Add release note
2023-09-01 11:48:41 +02:00
ZanSara
5f1256ac7e
feat: generators (2.0) (#5690)
* add generators module

* add tests for module helper

* reno

* add another test

* move into openai

* improve tests
2023-08-31 17:33:12 +02:00
Fanli Lin
40d9f34e68
feat: enable passing use_fast to the underlying transformers' pipeline (#5655)
* copy instead of deepcopy

* fix pylint

* add use_fast

* add release note

* remove unrelevant changes

* black fix

* fix bug

* black

* bug fix
2023-08-30 10:25:18 +02:00
ZanSara
b1daa7c647
chore: migrate to canals==0.7.0 (#5647)
* add default_to_dict and default_from_dict placeholders to ease migration to canals 0.7.0

* canals==0.7.0

* whisper components

* add to_dict/from_dict stubs

* import serialization methods in init to hide canals imports

* reno

* export deserializationerror too

* Update haystack/preview/__init__.py

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>

* serialization methods for LocalWhisperTranscriber (#5648)

* chore: serialization methods for `FileExtensionClassifier` (#5651)

* serialization methods for FileExtensionClassifier

* Update test_file_classifier.py

* chore: serialization methods for `SentenceTransformersDocumentEmbedder` (#5652)

* serialization methods for SentenceTransformersDocumentEmbedder

* fix device management

* serialization methods for SentenceTransformersTextEmbedder (#5653)

* serialization methods for TextFileToDocument (#5654)

* chore: serialization methods for `RemoteWhisperTranscriber` (#5650)

* serialization methods for RemoteWhisperTranscriber

* remove patches

* Add default to_dict and from_dict in document stores built with factory (#5674)

* fix tests (#5671)

* chore: simplify serialization methods for `MemoryDocumentStore` (#5667)

* simplify serialization for MemoryDocumentStore

* remove redundant tests

* pylint

* chore: serialization methods for `MemoryRetriever` (#5663)

* serialization method for MemoryRetriever

* more tests

* remove hash from default_document_store_to_dict

* remove diff in factory.py

* chore: serialization methods for `DocumentWriter` (#5661)

* serialization methods for DocumentWriter

* more tests

* use factory

* black

---------

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>
2023-08-29 18:15:07 +02:00
bogdankostic
07c85905f3
fix: Change use_auth_token to token in TransformersQueryClassifier (#5659) 2023-08-29 15:21:25 +02:00
Vladimir Blagojevic
f13b37db24
fix: LinkContentFetcher - when no content retrieved (i.e. request blocked), default to snippet text (#5656)
* When no content retrieved (i.e. request blocked), default to snippet

* Add release note
2023-08-29 10:57:47 +02:00
Stefano Fiorucci
72fe4fc57b
feat: SentenceTransformersDocumentEmbedder (#5606)
* first draft

* incorporate feedback

* some unit tests

* release notes

* real release notes

* refactored to use a factory class

* allow forcing fresh instances

* first draft

* Update haystack/preview/embedding_backends/sentence_transformers_backend.py

Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>

* simplify implementation and tests

* add embed_meta_fields implementation

* lg update

* improve meta data embedding; tests

* support non-string metadata

* make factory private

* change return type; improve tests

* warm_up not called in run

* fix typing

* rm unused import

* Remove base test class

* black

---------

Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
2023-08-28 16:23:41 +02:00
Stefano Fiorucci
89c1813d9f
feat: SentenceTransformersTextEmbedder (#5600)
* first draft

* incorporate feedback

* some unit tests

* release notes

* real release notes

* first draft

* refactored to use a factory class

* adapt to new ST Embedding Backend implementation

* allow forcing fresh instances

* add tests

* release notes

* fix typo

* little improvements in tests

* Update haystack/preview/embedding_backends/sentence_transformers_backend.py

Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>

* simplify implementation and tests

* lg update

* input check

* better error message

* make factory private

* change return type; improve tests

* warm_up not called in run

* warm_up not called in run

* rm unused import; default model

* fix typing

* rm unused import

* Remove BaseTestComponent

* black

---------

Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
2023-08-28 16:23:26 +02:00
Stefano Fiorucci
35dfe47186
feat: SentenceTransformersEmbeddingBackend (v2) (#5572)
* first draft

* incorporate feedback

* some unit tests

* release notes

* real release notes

* refactored to use a factory class

* allow forcing fresh instances

* Update haystack/preview/embedding_backends/sentence_transformers_backend.py

Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>

* simplify implementation and tests

* make factory private

* change return type; improve tests

* fix typing

* rm unused import

---------

Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
2023-08-28 12:32:37 +02:00
Silvano Cerza
66f615a3a4
Remove BaseTestComponent (#5613)
* Remove BaseTestComponent

* Add release notes
2023-08-23 17:03:37 +02:00
Silvano Cerza
4ef813fc8a
Remove specialised Pipeline (#5584)
* Remove Pipeline

* Add release notes

* Enhance imports

* Update release note

Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>

* Remove Pipeline tests

---------

Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
2023-08-18 17:48:13 +02:00
Silvano Cerza
72e0a588db
Rework DocumentWriter (#5583)
* Remove DocumentStoreAwareMixin from DocumentWriter

* Add release notes
2023-08-18 17:03:17 +02:00
Silvano Cerza
4bc68cbc2f
Rework MemoryRetriever (#5582)
* Remove DocumentStoreAwareMixin from MemoryRetriever

* Add release notes

* Update an article

---------

Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
2023-08-18 16:33:35 +02:00
Massimiliano Pippi
7e633c6b0c
chore: change import paths under preview (#5592)
* fix import paths

* add release notes
2023-08-18 12:53:25 +02:00
Massimiliano Pippi
39a1f61326
chore: improve error message in FileExtensionClassifier (#5590)
* output an actionable error

* add release note

* fix matching in raised error

* fix release note category
2023-08-18 12:28:55 +02:00
bogdankostic
ee2745bad8
ci: Add Github workflow to automate benchmark runs (#5399)
* Add config files

* log benchmarks to stdout

* Add top-k and batch size to configs

* Add batch size to configs

* fix: don't download files if they already exist

* Add batch size to configs

* refine script

* Remove configs using 1m docs

* update run script

* update run script

* update run script

* datadog integration

* remove out folder

* gitignore benchmarks output

* test: send benchmarks to datadog

* remove uncommented lines in script

* feat: take branch/tag argument for benchmark setup script

* fix: run.sh should ignore errors

* Add GH workflow to run benchmarks periodically

* Remove unused script

* Adapt cml.yml

* Adapt cml.yml

* Rename cml.yml to benchmarks.yml

* Revert "Rename cml.yml to benchmarks.yml"

This reverts commit 897299433a71a55827124728adff5de918d46d21.

* remove benchmarks.yml

* Use same file extension for all config files

* Use checkout@v3

* Run benchmarks sequentially

* Add timeout-minutes parameter

* Remove changes unrelated to datadog

* Apply black

* use haystack-oss aws account

* Update test/benchmarks/utils.py

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>

* PR feedback

* fix aws credentials step

* Fix path

* check docker

* Allow spinning up containers from within container

* Allow spinning up containers from within container

* Separate launching doc stores from benchmarks

* Remove docker related commands

* run only retrievers

* change port

* Revert "change port"

This reverts commit 6e5bcebb1d16e03ba7672be7e8a089084c7fc3a7.

* Run opensearch benchmark only

* Run weaviate benchmark only

* Run bm25 benchmarks only

* Changes host of doc stores

* add step to get docker logs

* Revert "add step to get docker logs"

This reverts commit c10e6faa76bde5df406a027203bd775d18c93c90.

* Install docker

* Launch doc store containers from wtihin runner container

* Remove kill command

* Change host

* dump docker logs

* change port

* Add cloud startup script

* dump docker logs

* add network param

* add network to startup.sh

* check cluster health

* move steps

* change port

* try using services

* check cluster health

* use services

* run only weaviate

* change host

* Upload benchmark results as artifacts

* Update configs

* Delete index after benchmark run

* Use correct index name

* Run only failing config

* Use smaller batch size

* Increase memory for opensearch

* Reduce batch size further

* Provide more storage

* Reduce batch size

* dump docker logs

* add java opts

* Spin up only opensearch container

* Create separate job for each doc store

* Run benchmarks sequentially

* Set working directory

* Account for reader benchmarks not doing indexing

* Change key of reader metrics

* Apply PR feedback

* Remove whitespace

* Adapt workflow to changes in datadog scripts

* Adapt workflow to changes in datadog scripts

* Increase memory for opensearch

* Reduce batch size

* Add preprocessing_batch_size to Readers

* Remove unrelated change

* Move order

* Fix path

* Manually terminate EC2 instance

Manually terminate EC2 instance

Manually terminate EC2 instance

Manually terminate EC2 instance

Manually terminate EC2 instance

Manually terminate EC2 instance

Manually terminate EC2 instance

Manually terminate EC2 instance

* Manually terminate EC2 instance

* Manually terminate EC2 instance

* Always terminate runner

* Always terminate runner

* Remove unnecessary terminate-runner job

* Add cron schedule

* Disable telemetry

* Rename cml.yml to benchmarks.yml

---------

Co-authored-by: rjanjua <rohan.janjua@gmail.com>
Co-authored-by: Paul Steppacher <p.steppacher91@gmail.com>
Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>
Co-authored-by: Silvano Cerza <silvanocerza@gmail.com>
2023-08-17 12:56:45 +02:00
Vladimir Blagojevic
46c9139caf
refactor: Rework WebRetriever caching, adjust tests (#5566)
* Rework WebRetriever caching, adjust tests

* Add release note

* Better pydocs

* Minor improvements

* Update haystack/nodes/retriever/web.py

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>

---------

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>
2023-08-16 17:41:11 +02:00
Julian Risch
22c7601729
feat: Add DocumentWriter v2 (#5435)
* add draft of WriteToStore and basic test

* add DocumentWriter implementation

* draft unit and integration tests

* add release note

* mock Store in unit tests

* pylint

* Update haystack/preview/components/writers/document_writer.py

Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>

* Remove unnecessary test

* Rework DocumentWriter to support new Component I/O definition

---------

Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
Co-authored-by: Silvano Cerza <silvanocerza@gmail.com>
2023-08-16 13:48:33 +02:00
Massimiliano Pippi
d4c1a0508a
chore: remove haystack dependencies from preview (#5569)
* provides preview's own implementation of expit

* copy the requests utility over into preview

* remove unnecessary types conversions

* fix mocking paths
2023-08-16 12:45:28 +02:00
Vladimir Blagojevic
8652d00b54
feat: Add FileExtensionClassifier to previews (#5514)
* Add FileExtensionClassifier preview component

* Add release note

* PR feedback
2023-08-15 15:58:55 +02:00
bogdankostic
c26f1e9426
fix: Use correct type for points in datadog (benchmarks) (#5570) 2023-08-14 17:40:36 +02:00
Massimiliano Pippi
f9bd64ba9e
make code layout consistent (#5561) 2023-08-14 16:35:34 +02:00
Massimiliano Pippi
714b944dc2
chore: rename store to document_store for clarity (#5547)
* store -> document_store

* fix leftovers

* fix import name

* moar leftovers

* rebase on main, update MemoryDocumentStore to the new protocol

* Update haystack/preview/pipeline.py

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>

---------

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>
2023-08-12 08:44:36 +02:00
Silvano Cerza
a7416bcf89
Add to_dict and from_dict methods for Stores (#5541)
* Add to_dict and from_dict methods for Stores

* Add release notes

* Add tests with custom init parameters
2023-08-11 14:45:56 +02:00
Silvano Cerza
168b7c806c
Add _store_name field to StoreAwareMixin to ease serialisation (#5531) 2023-08-10 15:42:19 +02:00
Vladimir Blagojevic
a75b9dd4bb
feat: LinkContentFetcher - add content-type resolution, user agent switching, PDF handler (#5374)
* Add content type resolution, pdf handler, user agent switching
---------

Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
2023-08-09 18:14:04 +02:00
ZanSara
5ca4874df9
Migrate existing v2 components to Canals 0.4.0 (#5532)
* pin canals==0.4.0

* update audio components

* allow audio components to receive whisper_params in init too

* migrating memoryretriever

* migrate memoryretriever

* migrate TextFileToDocument

* fix TextFileToDocument tests

* fix pipeline tests

* fix defaults management

* reno

* inverted assignments

* Simplify release notes

---------

Co-authored-by: Silvano Cerza <silvanocerza@gmail.com>
2023-08-09 15:51:32 +02:00
Silvano Cerza
83fce1bd72
Add Store class factory (#5530)
* Add Store class factory

* Add release notes
2023-08-09 13:09:36 +02:00
Vladimir Blagojevic
227bf6ca39
feat: Remove template variables from PromptNode invocation kwargs (#5526)
* Remove template params from kwargs before passing kwargs to invocation layer

* More unit tests

* Add release note

* Enable simple prompt node pipeline integration test use case
2023-08-08 16:40:23 +02:00
Vladimir Blagojevic
84ed954c8c
feat: Improve performance and add default media support in FileTypeClassifier (#5083)
* feat: add media outgoing edge to FileTypeClassifier

* Add release note

* Update language

---------

Co-authored-by: Daniel Bichuetti <daniel.bichuetti@gmail.com>
Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
Co-authored-by: agnieszka-m <amarzec13@gmail.com>
2023-08-08 15:51:07 +02:00
tstadel
d46c84bb61
feat: support dynamic filters in custom_query (#5427)
* support filters in custom_query

* better tests

* Update docstrings

---------

Co-authored-by: agnieszka-m <amarzec13@gmail.com>
2023-08-08 15:48:15 +02:00
Stefano Fiorucci
3f472995bb
refactor: update Crawler to support selenium>=4.11.0 and simplify it (#5515)
* refactor crawler

* rm unused imports

* release notes!

* rm outdated mock
2023-08-08 15:13:22 +02:00