89 Commits

Author SHA1 Message Date
Massimiliano Pippi
ac408134f4
feat: add support for async openai calls (#5946)
* add support for async openai calls

* add actual async call

* split the async api

* ask permission

* Update haystack/utils/openai_utils.py

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>

* Fix OpenAI content moderation tests

* Fix ChatGPT invocation layer tests

---------

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>
Co-authored-by: Silvano Cerza <silvanocerza@gmail.com>
2023-10-03 10:42:21 +02:00
Lavesh Akhadkar
1ccf674d73
feat: DocumentWriter returns number of documents written (#5939)
* Make DocumentWriter return the number of documents it wrote

* Fixed return type
2023-10-03 10:02:33 +02:00
Massimiliano Pippi
0947f59545
feat: add async PromptNode run (#5890)
* add async promptnode

* Remove unecessary calls to dict.keys()

---------

Co-authored-by: Silvano Cerza <silvanocerza@gmail.com>
Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>
2023-09-29 08:40:01 +02:00
Vladimir Blagojevic
e882a7d5c8
feat: Add HTMLToDocument component (v2) (#5907) 2023-09-28 17:22:28 +02:00
Stefano Fiorucci
d4aacad5f9
feat: OpenAIDocumentEmbedder (#5822)
* first draft

* release note

* mypy fix

* fix test

* corrections

* pr feedback

* better secrets handling and new tests

* missing imports in embedders/__init__.py

* better format condition

* address feedback
2023-09-28 15:42:51 +02:00
Julian Risch
4413675e64
feat: Add TextDocumentSplitter that splits by word, sentence, passage (2.0) (#5870)
* draft split by word, sentence, passage

* naive way to split sentences without nltk

* reno

* add tests

* make input list of docs, review feedback

* add source_id and more validation

* update docstrings

* add split delimiters back to strings

---------

Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
2023-09-27 12:26:20 +02:00
bogdankostic
80192589b1
feat: Add AzureOCRDocumentConverter (2.0) (#5855)
* Add AzureOCRDocumentConverter

* Add tests

* Add release note

* Formatting

* update docstrings

* Apply suggestions from code review

Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>

* PR feedback

* PR feedback

* PR feedback

* Add secrets as environment variables

* Adapt test

* Add azure dependency to CI

* Add azure dependency to CI

---------

Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
2023-09-26 15:57:55 +02:00
Silvano Cerza
cf7f0ebc22
Add Pipelines async run (#5864)
* Add Pipeline.arun()

* Sleeper node

* Fix async running

* Add e2e tests

To run a Pipeline that doesn't have any async node in async mode:

    pytest e2e/pipelines/test_standard_pipelines.py::test_query_and_indexing_pipeline

To run a Pipeline that has a single async node in concurrent mode:

    pytest e2e/pipelines/test_standard_pipelines.py::test_async_concurrent_complex_pipeline

To run a Pipeline that has a single async node in sequential mode:

    pytest e2e/pipelines/test_standard_pipelines.py::test_async_sequential_complex_pipeline

* Remove unused _adispatch_run method

* Make Pipeline.run work with async nodes

* Revert "Make Pipeline.run work with async nodes"

This reverts commit 22d7a94e4d41aca1b59dad18c0b366fbb6e8f431.

* Rename Pipeline.arun to Pipeline._arun

* Enhance docstring

* Add Sleeper docstring

* Add release notes

* ignore typing across the node

* make pylint happy

* skip pylint on needed unused import

* fix

* if a node has an arun method, use it

---------

Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
2023-09-26 15:37:27 +02:00
ZanSara
6cb7d16e22
feat: preview extra (#5869)
* copy the deps list over from haystack-ai

* fix lazyimport usage

* keep jinja and openai

* fix ci

* reno

* separate out preview unit tests

* fix import error message for tika

* tika

* add preview to all

* wrap torch

* remove comment

* unwrap openai and jinja
2023-09-26 12:48:15 +02:00
bogdankostic
9a4373bf8e
feat: Add TikaDocumentConverter (2.0) (#5847)
* Add TikaFileToDocument component

* Add tests

* Add tika service to CI

* Add release note

* Change name

* PR feedback

* Fix naming in tests

* Fix tika version in CI

* Update tests

---------

Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
2023-09-25 11:47:21 +02:00
Stefano Fiorucci
c0f22372d4
feat: OpenAITextEmbedder (#5801)
* first draft

* release notes

* avoid serializing secrets

* fix import order

* simplify serialization

* simplification

* monkeypatch delenv

* Update haystack/preview/components/embedders/openai_text_embedder.py

Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>

* docstrings updates

* fix test

* Update haystack/preview/components/embedders/openai_text_embedder.py

Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>

* rm comment

---------

Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
2023-09-22 21:54:11 +02:00
Massimiliano Pippi
a5a0dc9f87
feat: optionally pass an id to the Document constructor (#5862)
* revert #5826

* do not use Optional
2023-09-22 11:09:59 +02:00
Silvano Cerza
cc4f95bf51
Remove unnecessary GPT4Generator class (#5863)
* Remove GPT4Generator class

* Rename GPT35Generator to GPTGenerator

* Fix tests

* Release notes
2023-09-22 11:05:06 +02:00
MichelBartels
f3dc9edd26
feat: initial ExtractiveReader implementation (#5553)
* initial ExtractiveReader implementation

* initial ExtractiveReader implementation

* fix mypy

* remove unused import

* Use AutoTokenizer

* rename reader to model

* combine no-answer logit

* support document slicing with proper probabilities

* add variable stride

* validate model

* fix typo

* make postprocessing easier to understand

* remove debug code

* set default reader

* add ExtractiveReader to __init__

* remove validation

* use new answer class

* add batching

* use v2 lazy imports

* move reader

* fix type hints

* add doc strings

* add nucleus sampling

* fix types

* fix doc string

* add no_answer parameter

* remove print statement

* fix gpu support

* turn into binary classification task

* change dataclass so document does not need to be provided for no answer

* add simple tests

* add unit tests

* rename reader folder to readers

* add integration tests

* fix type hints

* add release notes

* remove accidentally included test file

* remove unnecessary __init__ file

* revert __init__ file to main

* rename test script by adding test_ prefix

* undo accidentally moving of test script after renaming it

* remove use of bisect

* rename _flatten and _unflatten

* make variable name more intuitive

* remove type: ignore

* fix mypy issue

* refactor long tuple

* add doc strings

* explain HF test

* remove unnecessary top_k check

---------

Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
2023-09-21 12:16:51 +02:00
Vladimir Blagojevic
92a6221927
feat: Add PyPDFToDocument component (2.0) (#5850)
* Initial PyPDFToDocument implementation

* Remove progress bar

* Add release note

* Minor fix

* import check and dependency

---------

Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
2023-09-21 11:52:26 +02:00
bogdankostic
abe2706298
feat: Add MetadataRouter (2.0) (#5824)
* Move filter utilities

* Add MetadataRouter

* Add tests for MetadataRouter

* Add more tests

* Rename FileExtensionClassifer to FileExtensionRouter

* Add support for dates in filters

* Add tests

* Add release note

* Add release note

* Apply suggestions from code review

Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>

---------

Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
2023-09-20 14:49:17 +02:00
ZanSara
454988672e
feat: UrlCacheChecker (#5841)
* add UrlCacheChecker

* rename

* add tests

* reno

* pylint

* review feedback
2023-09-20 14:45:50 +02:00
bogdankostic
719c1c040c
feat: Add support for dates in filters (2.0) (#5823)
* Add support for dates in filters

* Add tests

* Add release note

* Update haystack/preview/utils/filters.py

Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>

---------

Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
2023-09-20 12:05:56 +02:00
Vladimir Blagojevic
0983fb656a
feat: Add LinkContentFetcher Haystack 2.0 component (#5724)
* Add LinkContentFetcher

* Add release note

* Small fixes

* Fix pydocs

* PR feedback

* Remove handlers registration

* PR feedback

* adjustments

* improve tests

* initial draft

* tests

* add proposal

* proposal number

* reno

* fix tests and usage of content and content_type

* update branch & fix more tests

* mypy

* use the new document

* add docstring

* fix more tests

* mypy

* fix tests

* add e2e

* review feedback

* improve __str__

* Apply suggestions from code review

Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>

* Update haystack/preview/dataclasses/document.py

Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>

* improve __str__

* fix tests

* fix more tests

* fix test

* Fix end-of-file-fixer

* Post merge fixes

* Move e2e tests back into component

---------

Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
2023-09-20 11:03:52 +02:00
Malte Pietsch
aa3cc3d5ae
feat: Add support for OpenAI's gpt-3.5-turbo-instruct model (#5837)
* support gpt-3.5.-turbo-instruct

* add release note
2023-09-19 16:06:43 +02:00
Onur Eren Arpacı
8af0d816e6
bug: fix the date_fields request bottleneck (#5695)
* bug: fix the date_fields request bottleneck

I encountered a performance issue while attempting to index 1 million vectors. Despite the Weaviate instance having low utilization, the process was estimated to take around 10 hours. 

After some investigation, I identified the bottleneck: _get_date_properties function was being called for every document, consequently a request to the Weaviate client was being sent and awaited for each document.

To address this, I optimized the code by invoking the _get_date_properties function only when there is a schema change. This modification resulted in a notable performance improvement, reducing the indexing time to approximately 90 minutes for the same 1 million vectors.

* bug: fix the date_fields request bottleneck

* fix: executed the pre commit hooks for #9341
2023-09-15 18:12:14 +02:00
Silvano Cerza
5c04cd6ba2
Fix Document constructor accepting unused id parameter (#5826) 2023-09-15 17:03:03 +02:00
Chivereanu Radu
cab21da87b
fix: Support for Azure 16k gpt 35 deployment (#5804)
* Support for Azure 16k gpt 35 deployment

* releasenote added

---------

Co-authored-by: user11999 <radugabrielchivereanu@gmail.com>
2023-09-14 18:01:22 +02:00
Ivana Zeljkovic
4bad202197
feat: Pinecone document store refactoring (#5725)
* Refactor codebase so that doc_type metadata is used instead of namespaces for making distinction between documents without embeddings, documents with embeddings and labels

* Fix parameter name in integration test

* Remove code under comment in add_type_metadata_filter method

* Fix mypy and pylint checks

* Add release note

* Apply minimal changes: rename method, update method docs and remove redundant method

* Mypy fixes

* Fix docstrings

* Revert helper methods for fetching documents when the number of documents exceeds Pinecone limit

* Remove unnecessary attributes in PineconeDocumentStore

* Fix unit test

---------

Co-authored-by: Ivana Zeljkovic <ivana.zeljkovic@smartcat.io>
Co-authored-by: DosticJelena <jelena.dostic@smartcat.io>
2023-09-14 11:46:47 +02:00
Darion
beb8853412
fix: return types of EntityExtractor to work with FAISSDocumentStore (#5750)
* Changed entity extractor score from type float32 to float64 and start/stop from int64 to int

* Added relase notes
2023-09-14 10:49:54 +02:00
Stefano Fiorucci
28f42fbaab
move release note to the right directory (#5808) 2023-09-14 09:57:09 +02:00
Christian Clauss
6dd52d91b2
ci: Fix typos discovered by codespell (#5778)
* Fix typos discovered by codespell

* pylint: max-args = 38
2023-09-13 16:14:45 +02:00
Julian Risch
4ae0924ea0
feat!: Remove SklearnQueryClassifier (#5779)
* remove SklearnQueryClassifier

* reno
2023-09-13 12:55:33 +02:00
Stefano Fiorucci
283ecf2760
feat: add prefix and suffix to SentenceTransformersDocumentEmbedder (#5745)
* add prefix and suffix

* fix test
2023-09-13 12:55:06 +02:00
ZanSara
2c4d839b64
feat: GPT4Generator (#5744)
* add gpt4generator

* add e2e

* add tests

* reno

* fix e2e

* Update test/preview/components/generators/openai/test_gpt4_generator.py

Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>

---------

Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>
2023-09-13 10:07:09 +02:00
Christian Clauss
23f7308bec
ci: pre-commit autoupdate (#5777) 2023-09-12 14:34:41 +02:00
ZanSara
6e70d403f8
feat: Improve Document for Haystack 2.0 (#5738)
* initial draft

* tests

* add proposal

* proposal number

* reno

* fix tests and usage of content and content_type

* update branch & fix more tests

* mypy

* add docstring

* fix more tests

* review feedback

* improve __str__

* Apply suggestions from code review

Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>

* Update haystack/preview/dataclasses/document.py

Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>

* improve __str__

* fix tests

* fix more tests

* Update haystack/preview/document_stores/memory/document_store.py

---------

Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
2023-09-11 17:40:00 +02:00
Stefano Fiorucci
2edf85f739
MemoryEmbeddingRetriever (2.0) (#5726)
* MemoryDocumentStore - Embedding retrieval draft

* add release notes

* fix mypy

* better comment

* improve return_embeddings handling

* MemoryEmbeddingRetriever - first draft

* address PR comments

* release note

* update docstrings

* update docstrings

* incorporated feeback

* add return_embedding to __init__

* rm leftover docstring

---------

Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
2023-09-08 15:52:48 +02:00
Stefano Fiorucci
b7bea3ae9c
MemoryDocumentStore - Embedding retrieval (2.0) (#5715)
* MemoryDocumentStore - Embedding retrieval draft

* add release notes

* fix mypy

* better comment

* improve return_embeddings handling

* address PR comments

* update docstrings

* incorporated feeback

---------

Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
2023-09-07 15:44:07 +02:00
ZanSara
63cbde7287
feat: GPT35Generator (#5714)
* chatgpt backend

* fix tests

* reno

* remove print

* helpers tests

* add chatgpt generator

* use openai sdk

* remove backend

* tests are broken

* fix tests

* stray param

* move _check_troncated_answers into the class

* wrong import

* rename function

* typo in test

* add openai deps

* mypy

* improve system prompt docstring

* typos update

* Update haystack/preview/components/generators/openai/chatgpt.py

* pylint

* Update haystack/preview/components/generators/openai/chatgpt.py

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>

* Update haystack/preview/components/generators/openai/chatgpt.py

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>

* Update haystack/preview/components/generators/openai/chatgpt.py

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>

* review feedback

* fix tests

* freview feedback

* reno

* remove tenacity mock

* gpt35generator

* fix naming

* remove stray references to chatgpt

* fix e2e

* Update releasenotes/notes/chatgpt-llm-generator-d043532654efe684.yaml

Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>

* add another test

* test wrong model name

* review feedback

---------

Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>
2023-09-07 10:06:57 +02:00
Vladimir Blagojevic
c5edb45c10
feat: Add SerperDevWebSearch Haystack 2.0 component (#5712)
* Add SerperDev

* Add release note

* PR Feedback

* Simplify, remove one-liner

* Update haystack/preview/components/websearch/serper_dev.py

Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>

* Update haystack/preview/components/websearch/serper_dev.py

Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>

* Fix formatting

* PR feedback

* Fix tests

* Function rename

* Remove scoring, update tests

* PR feedback

* Fix return

* small adjustments

* fix tests

* add e2e test

* fix release notes

* fix tests

* fix e2e

---------

Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
2023-09-06 17:31:42 +02:00
bogdankostic
639f7cf888
chore: Rename AnswersBuilder to AnswerBuilder (#5720)
* Add AnswersBuilder

* Add tests for AnswersBuilder

* Add release note

* PR feedback

* Fix mypy

* Remove redundant check for number of groups

* Rename AnswersBuilder to AnswerBuilder

* Update test/preview/components/builders/test_answer_builder.py

Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>

* Rename reno file

---------

Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
2023-09-05 14:34:22 +02:00
Silvano Cerza
2acc41ea85
Add PromptBuilder (#5713)
* Add PromptBuilder

* Update release note

* Add test
2023-09-05 12:22:21 +02:00
bogdankostic
a5b815690e
feat: Add AnswersBuilder component (2.0) (#5701)
* Add AnswersBuilder

* Add tests for AnswersBuilder

* Add release note

* PR feedback

* Fix mypy

* Remove redundant check for number of groups

* docstrings upd

---------

Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
2023-09-04 21:16:20 +02:00
bogdankostic
11440395f4
fix: Set model_max_length in the Tokenizer of DefaultPromptHandler (#5596)
* Set model_max_length in tokenizer in prompt handler

* Add release note
2023-09-01 11:48:41 +02:00
ZanSara
5f1256ac7e
feat: generators (2.0) (#5690)
* add generators module

* add tests for module helper

* reno

* add another test

* move into openai

* improve tests
2023-08-31 17:33:12 +02:00
Fanli Lin
40d9f34e68
feat: enable passing use_fast to the underlying transformers' pipeline (#5655)
* copy instead of deepcopy

* fix pylint

* add use_fast

* add release note

* remove unrelevant changes

* black fix

* fix bug

* black

* bug fix
2023-08-30 10:25:18 +02:00
ZanSara
b1daa7c647
chore: migrate to canals==0.7.0 (#5647)
* add default_to_dict and default_from_dict placeholders to ease migration to canals 0.7.0

* canals==0.7.0

* whisper components

* add to_dict/from_dict stubs

* import serialization methods in init to hide canals imports

* reno

* export deserializationerror too

* Update haystack/preview/__init__.py

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>

* serialization methods for LocalWhisperTranscriber (#5648)

* chore: serialization methods for `FileExtensionClassifier` (#5651)

* serialization methods for FileExtensionClassifier

* Update test_file_classifier.py

* chore: serialization methods for `SentenceTransformersDocumentEmbedder` (#5652)

* serialization methods for SentenceTransformersDocumentEmbedder

* fix device management

* serialization methods for SentenceTransformersTextEmbedder (#5653)

* serialization methods for TextFileToDocument (#5654)

* chore: serialization methods for `RemoteWhisperTranscriber` (#5650)

* serialization methods for RemoteWhisperTranscriber

* remove patches

* Add default to_dict and from_dict in document stores built with factory (#5674)

* fix tests (#5671)

* chore: simplify serialization methods for `MemoryDocumentStore` (#5667)

* simplify serialization for MemoryDocumentStore

* remove redundant tests

* pylint

* chore: serialization methods for `MemoryRetriever` (#5663)

* serialization method for MemoryRetriever

* more tests

* remove hash from default_document_store_to_dict

* remove diff in factory.py

* chore: serialization methods for `DocumentWriter` (#5661)

* serialization methods for DocumentWriter

* more tests

* use factory

* black

---------

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>
2023-08-29 18:15:07 +02:00
Vladimir Blagojevic
e5e7bb9654
feat: Allow WebRetrieve to use custom LinkContentFetcher (#5662)
* Allow use of custom LinkContentFetcher

* Add release note
2023-08-29 15:46:48 +02:00
Vladimir Blagojevic
1f7c7b716a
Update release note for #5526 (#5664) 2023-08-29 14:25:52 +02:00
Julian Risch
fa81c611e8
build: Upgrade transformers to v4.32.1 (#5658)
* upgrade transformers to 4.32.1

* added release notes

* upgrade transformers version also for inference extra
2023-08-29 13:46:00 +02:00
Vladimir Blagojevic
f13b37db24
fix: LinkContentFetcher - when no content retrieved (i.e. request blocked), default to snippet text (#5656)
* When no content retrieved (i.e. request blocked), default to snippet

* Add release note
2023-08-29 10:57:47 +02:00
Vladimir Blagojevic
2118f68769
feat: Add domain scoping to WebRetriever (#5587)
* WebSearch: add allowed_domains scoped search

* Add talk to website example

* Add release note

* Add allowed_domains to WebSearch

* Minor fix

---------

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>
2023-08-28 20:02:02 +02:00
Stefano Fiorucci
72fe4fc57b
feat: SentenceTransformersDocumentEmbedder (#5606)
* first draft

* incorporate feedback

* some unit tests

* release notes

* real release notes

* refactored to use a factory class

* allow forcing fresh instances

* first draft

* Update haystack/preview/embedding_backends/sentence_transformers_backend.py

Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>

* simplify implementation and tests

* add embed_meta_fields implementation

* lg update

* improve meta data embedding; tests

* support non-string metadata

* make factory private

* change return type; improve tests

* warm_up not called in run

* fix typing

* rm unused import

* Remove base test class

* black

---------

Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
2023-08-28 16:23:41 +02:00
Stefano Fiorucci
89c1813d9f
feat: SentenceTransformersTextEmbedder (#5600)
* first draft

* incorporate feedback

* some unit tests

* release notes

* real release notes

* first draft

* refactored to use a factory class

* adapt to new ST Embedding Backend implementation

* allow forcing fresh instances

* add tests

* release notes

* fix typo

* little improvements in tests

* Update haystack/preview/embedding_backends/sentence_transformers_backend.py

Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>

* simplify implementation and tests

* lg update

* input check

* better error message

* make factory private

* change return type; improve tests

* warm_up not called in run

* warm_up not called in run

* rm unused import; default model

* fix typing

* rm unused import

* Remove BaseTestComponent

* black

---------

Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
2023-08-28 16:23:26 +02:00