187 Commits

Author SHA1 Message Date
Silvano Cerza
76d5142bb8
Refactor: Document serialization and backward compatibility (#6180)
* Rework Document serialisation

* Make Document backward compatible

* Fix InMemoryDocumentStore filters

* Fix InMemoryDocumentStore.bm25_retrieval

* Add release notes

* Fix pylint failures

* Enhance Document kwargs handling and docstrings

* cosmetics

---------

Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
2023-10-30 17:03:06 +01:00
Ayush Jain
655bf68b7a
fix: Add search_engine_kwargs param to WebRetriever to pass to WebSearch (#5805)
* Add search_engine_kwargs param to WebRetriever to pass to WebSearch

* add relnote

---------

Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
2023-10-30 12:50:00 +01:00
Nripesh Niketan
708d33a657
feat: add apple silicon GPU acceleration (#6151)
* feat: add apple silicon GPU acceleration

* add release notes

* small fix

* Update utils.py

* Update utils.py

* ci fix mps

* Revert "ci fix mps"

This reverts commit 783ae503940d9ff8270a970a321549fb9e69dce7.

* mps fix

* Update experiment_tracking.py

* try removing upper watermark limit

* disable mps CI

* Use xl runner

* initialise env

* small fix

* black linting

---------

Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
2023-10-30 11:26:46 +01:00
Vladimir Blagojevic
f76fc04ed0
feat: Add StreamingChunk dataclass to Haystack 2.x (#6174)
* Add StreamingChunk

* Add release note

* Use default value init for metadata, turn of hashing

* Add unit tests
2023-10-26 17:42:52 +02:00
Ashwin Mathur
5f35e7d04a
refactor: Migrate RemoteWhisperTranscriber to OpenAI SDK. (#6149)
* Migrate RemoteWhisperTranscriber to OpenAI SDK

* Migrate RemoteWhisperTranscriber to OpenAI SDK

* Remove unnecessary imports

* Add release notes

* Fix api_key serialization

* Fix linting

* Apply suggestions from code review

Co-authored-by: ZanSara <sarazanzo94@gmail.com>

* Add additional tests for api_key

* Adapt .run() to take ByteStream inputs

* Update docstrings

* Rework implementation to use io.BytesIO

* Update error message

* Add default file name

---------

Co-authored-by: ZanSara <sarazanzo94@gmail.com>
2023-10-26 16:25:23 +02:00
Julian Risch
fe3bc15571
chore: Rename ExtractiveReader's input from document to documents to match its type List[Document] (#6164)
* rename input param, add doc string, add example

* reno
2023-10-24 21:44:15 +02:00
Stefano Fiorucci
1f4ed3cc03
refactor!: rename SimilarityRanker to TransformersSimilarityRanker (#6100)
* rename

* release note

* Update haystack/preview/components/rankers/transformers_similarity.py

Co-authored-by: Domenico <domenico.cinque98@gmail.com>

* Update haystack/preview/components/rankers/transformers_similarity.py

Co-authored-by: Domenico <domenico.cinque98@gmail.com>

* fix test

---------

Co-authored-by: Domenico <domenico.cinque98@gmail.com>
2023-10-24 19:45:16 +02:00
Grant Williams
1cf70d3dce
build: Upgrade transformers to the latest version 4.34.1 (#5994)
* Upgrade transformers to the latest version 4.34.0 so that Haystack can support the new Mistral, Nougat, and other models.

* update release notes

* updated missing lazy import

* Update .github workflows imports

* bump more versions in .github workflows

* rever import sorting

* Update  to catch runtime errors to match haystack_hub changes

* add language parameter value to whisper test

* bump transformers version in linting preview workflow

* bump transformers version in linting preview workflow

* bump version to v4.34.1

* resolve mypy issue with reused variables

* install openai-whisper without dependencies

* remove audio extra, update whisper install instructions

* remove audio extra, update whisper install instructions

* keep audio extra but add version

* keep audio extra with no constraints

* remove audio extra

---------

Co-authored-by: Julian Risch <julian.risch@deepset.ai>
2023-10-24 19:13:12 +02:00
Vladimir Blagojevic
b9b7d7666d
feat: Add dynamic per-user ChatMessage templating support (#6161)
* Add dynamic per-user ChatMessage templating support

* Add unit tests for dynamic templating

* Update add-dynamic-per-message-templating-908468226c5e3d45.yaml

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>

* Proper init ValueError raising, unit tests

---------

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>
2023-10-24 16:50:45 +02:00
Massimiliano Pippi
dd24210908
feat: add pipeline Yaml marshaller (#6137)
* add marshaller

* release notes

* add docstrings and missing tests
2023-10-23 19:02:59 +02:00
Silvano Cerza
31fb5b84e7
feature: Add mime_type field to ByteStream (#6154)
* Add mime_type field to ByteStream

* Add release notes

* Update tests
2023-10-23 16:13:40 +02:00
Vladimir Blagojevic
dcc7e63dc9
feat: Add ChatMessage class to Haystack 2.0 (#6144)
* Add ChatMessage and ChatRole
2023-10-23 16:08:05 +02:00
Shaurya Agrawal
9d8979af41
feat: Refactor SentenceTransformersDocumentEmbedder.py (#6143)
* changed sentense_transformers

* added release note

* updated release notes

* Corrected release notes

---------

Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>
2023-10-23 14:02:35 +02:00
Silvano Cerza
ae812617fd
Remove Document.array field (#6139) 2023-10-23 13:01:15 +02:00
Stefano Fiorucci
047e79f256
refactor: better API keys handling in GPTGenerator (#6103)
* refactor: do not serialize API keys

* release note

* check if api key is set in the module client

* make tests more robust

* better tests
2023-10-23 12:53:52 +02:00
Ashwin Mathur
101bd816f8
refactor: Remove api_key from serialization of AzureOCRDocumentConverter and SerperDevWebSearch (#6150)
* Remove api_key from serialization of AzureOCRDocumentConverter

* Remove api_key from serialization of SerperDevWebSearch

* Add release notes

* Add init_fail_without_api_key test for SerperDevWebSearch

* Rename env var to AZURE_AI_API_KEY
2023-10-23 12:26:23 +02:00
Silvano Cerza
c8d162ced9
refactor: Change Document.embedding type to list of floats (#6135)
* Change Document.embedding type

* Add release notes

* Fix document_store testing

* Fix pylint

* Fix tests
2023-10-23 12:26:05 +02:00
Silvano Cerza
8f289282f1
refactor: Remove id_hash_keys field from Document (#6127)
* Remove id_hash_fields from Document

* Update release notes

* Remove unused import
2023-10-23 10:35:24 +02:00
Stefano Fiorucci
7e6c6becd6
fix release note (#6145) 2023-10-22 11:15:51 +02:00
Julian Risch
64649312bc
build: Upgrade to canals==0.9.0 (#6133)
* build: Upgrade to `canals==0.9.0`

* reno
2023-10-20 13:00:24 +02:00
Silvano Cerza
3f98bd9137
refactor: Rework Document.id generation (#6122)
* Rework Document id generation

* Fix tests

* Add release notes

* Fix failing integration test

* Remove score from Document id generation

* Enhance tests

* Update release notes

---------

Co-authored-by: Julian Risch <julian.risch@deepset.ai>
2023-10-20 10:34:28 +02:00
Sunil Kumar Dash
957d1be68d
Enrich documents with embeddings for OpenAIDocumentEmbedder (#6126)
* Enrich documents with embeddings

Signed-off-by: sunilkumardash9 <sunilkumardash9@gmail.com>

* add release note

Signed-off-by: sunilkumardash9 <sunilkumardash9@gmail.com>

* try to fix typing

* change embedding field type in Document

---------

Signed-off-by: sunilkumardash9 <sunilkumardash9@gmail.com>
Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>
2023-10-19 18:29:16 +02:00
Stefano Fiorucci
ef40c7c728
refactor: make sure that Document's id_hash_keys has a valid value (#6112)
* fix handling id_hash_keys

* reno

* handle empty id_hash_keys in post_init

* fix

* reno

* test
2023-10-19 12:10:19 +02:00
Julian Risch
9f3b6512be
refactor: Remove reimplementations of default from_dict/to_dict and corresponding tests in 2.0 (#6108)
* whisper transcriber

* remove from/to_dict from builders

* remove from/to_dict from embedders

* remove from/to_dict from fetcher, file_converters

* remove from/to_dict from generators, preprocessors

* remove from/to_dict from ranker, reader

* remove from/to_dict from router, sampler, websearch

* pylint

* reno

* refactor import

* remove unused import
2023-10-19 11:17:02 +02:00
Stefano Fiorucci
21d894d85a
refactor: adopt token instead of use_auth_token in HF components (#6040)
* move embedding backends

* use token in Sentence Transformers embeddings

* more compact token handling

* token parameter in reader

* add token to ranker

* release note

* add test for reader
2023-10-17 16:32:13 +02:00
Stefano Fiorucci
4e4af99a5e
refactor!: rename MemoryDocumentStore and related Retrievers (#6076)
* rename doc store and retrievers

* release note

* fix patch
2023-10-17 16:15:16 +02:00
Silvano Cerza
ec9f898cd6
fix: Fix TextDocumentSplitter failing if run with empty list (#6081)
* Fix TextDocumentSplitter failing if run with empty list

* Release notes

* Simplify check

* Enhance test
2023-10-17 11:25:28 +02:00
Julian Risch
90ddeba579
fix: DocumentSplitter and DocumentCleaner copy id_hash_keys to newly created Documents (#6083)
* copy id_hash_keys in splitter and cleaner

* reno
2023-10-17 11:03:48 +02:00
Stefano Fiorucci
e963c8acdd
feat: HuggingFaceLocalGenerator - stopwords handling (#6049)
* first implementation

* release notes

* fixes

* tests

* better reno

* release note
2023-10-17 10:36:08 +02:00
Ivana Zeljkovic
2326f2f9fe
feat: Pinecone document store optimizations (#5902)
* Optimize methods for deleting documents and getting vector count. Enable warning messages when Pinecone limits are exceeded on Starter index type.

* Fix typo

* Add release note

* Fix mypy errors

* Remove unused import. Fix warning logging message.

* Update release note with description about limits for Starter index type in Pinecone

* Improve code base by:
- Adding new test cases for get_embedding_count method
- Fixing get_embedding_count method
- Improving delete documents
- Fix label retrieval
- Increase default batch size
- Improve get_document_count method

* Remove unused variable

* Fix mypy issues
2023-10-16 19:26:24 +02:00
Nicola Procopio
32e87d37c1
fixed join_docs.py concatenate (#5970)
* added hybrid search example

Added an example about hybrid search for faq pipeline on covid dataset

* formatted with back formatter

* renamed document

* fixed

* fixed typos

* added test

added test for hybrid search

* fixed withespaces

* removed test for hybrid search

* fixed pylint

* commented logging

* fixed bug in join_docs.py _concatenate_results

* Update join_docs.py

updated comment

* format with black

* added releasenote on PR

* updated release notes

* updated test_join_documents

* updated test

* updated test

* Update test_join_documents.py

* formatted with black

* fixed test

* fixed

---------

Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>
2023-10-16 09:31:52 +02:00
Julian Risch
aaee03aee8
feat: Add DocumentCleaner 2.0 (#5976)
* remove whitespaces, substrings, regex, empty lines

* remove repeated substrings

* reno

* return empty string as shortest common ngram

* address first half of review feedback

* address second half of review feedback

* mention \f page separator for header/footer removal

* mention \f page separator for header/footer removal

* mark example usage as python code
2023-10-13 12:39:55 +02:00
Bilge Yücel
ad25041618
Remove old Cohere models and add aliases for existing ones (#6007)
* Remove old cohere models

* Add aliases for the existing models according to Cohere documentation

* Add release note

* put cohere embdding models in a constant
* update doc strings

---------

Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
2023-10-13 12:08:26 +02:00
Stefano Fiorucci
fbd22bc1e9
feat: HuggingFaceLocalGenerator - first implementation (#6022)
* draft

* still a raw draft

* still a raw draft

* improvements

* minimal impl ok

* tests

* reno

* better language

* examples of generation_kwargs

* incorporate feedback

* lg and format updates

* don't save valid str tokens

* fix style

---------

Co-authored-by: Darja Fokina <daria.f93@gmail.com>
2023-10-13 11:23:56 +02:00
Julian Risch
b507f1a124
feat: Add TextLanguageClassifier 2.0 (#6026)
* draft TextLanguageClassifier

* implement language detection with langdetect

* add unit test for logging message

* reno

* pylint

* change input from List[str] to str

* remove empty output connections

* add from_dict/to_dict tests

* mark example usage as python code
2023-10-13 10:30:49 +02:00
ZanSara
110aacdc35
feat: add basic telemetry to pipelines 2.0 (#5929)
* add telemetry to pipelines 2.0

* only collect data if telemetry is on

* reno

* add downsampling

* typing

* manual tests

* pylint

* simplify code

* Update haystack/preview/telemetry/__init__.py

* rather index by component type

* black

* mypy

* review feedback & small improvements

* defaultdict

* stray changes

* lint

* invert condition

* always send the first event of the day

* collect specs

* track 2nd and 3rd events too

* send first event and then max 1 event a minute

* rename constant

* invert condition

* linting
2023-10-13 09:31:51 +02:00
ZanSara
adf7e49af3
chore: review all extra (#6029) 2023-10-12 21:50:53 +02:00
Vladimir Blagojevic
6a50123b9f
feat: Adjust LinkContentFetcher run method, use ByteStream (#5972) 2023-10-10 17:48:31 +02:00
Nicola Procopio
c102b152dc
fix: Run update_embeddings in examples (#6008)
* added hybrid search example

Added an example about hybrid search for faq pipeline on covid dataset

* formatted with back formatter

* renamed document

* fixed

* fixed typos

* added test

added test for hybrid search

* fixed withespaces

* removed test for hybrid search

* fixed pylint

* commented logging

* updated hybrid search example

* release notes

* Update hybrid_search_faq_pipeline.py-815df846dca7e872.yaml

* Update hybrid_search_faq_pipeline.py

* mention hybrid search example in release notes

* reduce installed dependencies in examples test workflow

* do not install cuda dependencies

* skip models if API key not set; delete document indices

* skip models if API key not set; delete document indices

* skip models if API key not set; delete document indices

* keep roberta-base model and inference extra

* pylint

* disable pylint no-logging-basicconfig rule

---------

Co-authored-by: Julian Risch <julian.risch@deepset.ai>
2023-10-10 16:38:52 +02:00
Vladimir Blagojevic
98215aec0d
feat: Rename FileExtensionRouter to FileTypeRouter, handle ByteStream(s) (#5998)
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
2023-10-10 09:14:04 +02:00
DanShatford
07048791aa
feat: allow list of file paths in convert_files_to_docs (#5961)
* feat: allow list of file paths in `convert_files_to_docs`

* Fix validation

* Fix check errors
2023-10-09 20:19:03 +02:00
David Berenstein
13fb7c5b5f
feat: added on_agent_final_answer-support to Agent callback_manager (#5736)
* chore: added on_agent_final_answer-support to Agent callback_manager

* chore: format black

* run pre-commit to format file

* updated release notes

* reverted sorted imports

---------

Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>
2023-10-09 18:03:47 +02:00
Vladimir Blagojevic
40b83d8a47
feat: Add TopPSampler Haystack 2.0 component (#5924) 2023-10-09 13:44:01 +02:00
Vladimir Blagojevic
1cdff6427e
feat: Add SimilarityRanker to Haystack 2.0 (#5923)
* Initial SimilarityRanker
2023-10-06 16:01:34 +02:00
Stefano Fiorucci
ccc9f010bb
fix: fix ChatGPT invocation layer (and add async support) (#5979)
* ChatGPT async

* release note

* fix tests
2023-10-05 18:43:26 +02:00
Tobias Wochinger
d5d3a9eef4
chore: adapt deepset cloud sdk endpoint format for saving pipelines (#5969)
* chore: adapt to new endpoints formats

* docs: add release notes
2023-10-05 08:56:28 +02:00
Massimiliano Pippi
c2ec3f5fde
feat: add File type to preview package (#5873)
* add Blob type

* review feedback

* fix tests and naming

* Update add-blob-type-2a9476a39841f54d.yaml

* removed unused import

---------

Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>
2023-10-04 17:23:12 +02:00
Stefano Fiorucci
cc70b4b613
deprecation (#5954) 2023-10-03 12:48:06 +02:00
Massimiliano Pippi
ac408134f4
feat: add support for async openai calls (#5946)
* add support for async openai calls

* add actual async call

* split the async api

* ask permission

* Update haystack/utils/openai_utils.py

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>

* Fix OpenAI content moderation tests

* Fix ChatGPT invocation layer tests

---------

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>
Co-authored-by: Silvano Cerza <silvanocerza@gmail.com>
2023-10-03 10:42:21 +02:00
Lavesh Akhadkar
1ccf674d73
feat: DocumentWriter returns number of documents written (#5939)
* Make DocumentWriter return the number of documents it wrote

* Fixed return type
2023-10-03 10:02:33 +02:00