297 Commits

Author SHA1 Message Date
Vladimir Blagojevic
5497ca2a45
feat: Adapt GPTGenerator to use str input/output format in Haystack 2.x (#6214)
* Adapt GPTGenerator to string input/output

* Finishing touches

* punctuation upd

* PR feedback

* Small naming fixes

* Update haystack/preview/components/generators/openai.py

Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>

* Update class pydoc with a printed response

---------

Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
2023-11-07 18:00:43 +01:00
Stefano Fiorucci
fb96aef4dd
refactor!: move classifiers to an appropriate directory/package (#6240)
* mv classifiers

* release note
2023-11-06 12:00:01 +01:00
Vladimir Blagojevic
d7e1833c40
feat: Add HuggingFaceTGIChatGenerator Haystack 2.x component (#6199)
* Add ChatHuggingFaceTGIGenerator

* Add release note
---------
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>
2023-11-06 09:48:45 +01:00
Stefano Fiorucci
063d27c522
refactor!: rename TextDocumentSplitter to DocumentSplitter (#6223)
* rename TextDocumentSplitter to DocumentSplitter

* reno

* fix init
2023-11-03 11:33:20 +01:00
Vladimir Blagojevic
6e2dbdc320
feat: Add HuggingFaceTGIGenerator Haystack 2.x component (#6205)
* Add HuggingFaceTGIGenerator

* PR review

* PR feedback from Stefano

---------

Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>
2023-11-02 19:35:16 +01:00
Stefano Fiorucci
8511b8cd79
feat: HuggingFaceLocalGenerator- allow passing generation_kwargs in run method (#6220)
* allow custom generation_kwargs in run

* reno

* make pylint ignore too-many-public-methods
2023-11-02 15:29:38 +01:00
Vladimir Blagojevic
f2db68ef0b
fix: Add new rankers to nodes __init__.py (#6219)
* Add new rankers to nodes __init__.py

* Add release note
2023-11-02 10:56:52 +01:00
Ashwin Mathur
6bf0b9dc7c
feat: Add MarkdownToTextDocument (v2) (#6159)
* Add MarkdownToTextDocument

* Add release notes

* Update GitHub workflows

* Update GitHub workflows

* Refactor code with minimal dependencies

* Update docstrings

* Apply suggestions from code review

Co-authored-by: Daria Fokina <daria.f93@gmail.com>

* Update document with content and meta for backward compatibility

* Refactor Document Class for Backward Compatibility

Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>

* Update tests

* Improve test assertions

---------

Co-authored-by: Daria Fokina <daria.f93@gmail.com>
Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>
2023-10-31 18:28:13 +01:00
Julian Risch
29b1fefaa4
feat: Add DocumentLanguageClassifier 2.0 (#6037)
* add DocumentLanguageClassifier and tests

* reno

* fix import, rename DocumentCleaner

* mark example usage as python code

* add assertions to e2e test

* use deserialized document_store

* Apply suggestions from code review

Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>

* remove from/to_dict

* use renamed InMemoryDocumentStore

* adapt to Document refactoring

* improve docstring

* fix test for new Document

---------

Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>
Co-authored-by: anakin87 <stefanofiorucci@gmail.com>
2023-10-31 15:35:05 +01:00
Silvano Cerza
7287657f0e
refactor: Rename Document's text field to content (#6181)
* Rework Document serialisation

Make Document backward compatible

Fix InMemoryDocumentStore filters

Fix InMemoryDocumentStore.bm25_retrieval

Add release notes

Fix pylint failures

Enhance Document kwargs handling and docstrings

Rename Document's text field to content

Fix e2e tests

Fix SimilarityRanker tests

Fix typo in release notes

Rename Document's metadata field to meta (#6183)

* fix bugs

* make linters happy

* fix

* more fix

* match regex

---------

Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
2023-10-31 12:44:04 +01:00
Silvano Cerza
76d5142bb8
Refactor: Document serialization and backward compatibility (#6180)
* Rework Document serialisation

* Make Document backward compatible

* Fix InMemoryDocumentStore filters

* Fix InMemoryDocumentStore.bm25_retrieval

* Add release notes

* Fix pylint failures

* Enhance Document kwargs handling and docstrings

* cosmetics

---------

Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
2023-10-30 17:03:06 +01:00
Ayush Jain
655bf68b7a
fix: Add search_engine_kwargs param to WebRetriever to pass to WebSearch (#5805)
* Add search_engine_kwargs param to WebRetriever to pass to WebSearch

* add relnote

---------

Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
2023-10-30 12:50:00 +01:00
Nripesh Niketan
708d33a657
feat: add apple silicon GPU acceleration (#6151)
* feat: add apple silicon GPU acceleration

* add release notes

* small fix

* Update utils.py

* Update utils.py

* ci fix mps

* Revert "ci fix mps"

This reverts commit 783ae503940d9ff8270a970a321549fb9e69dce7.

* mps fix

* Update experiment_tracking.py

* try removing upper watermark limit

* disable mps CI

* Use xl runner

* initialise env

* small fix

* black linting

---------

Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
2023-10-30 11:26:46 +01:00
Vladimir Blagojevic
f76fc04ed0
feat: Add StreamingChunk dataclass to Haystack 2.x (#6174)
* Add StreamingChunk

* Add release note

* Use default value init for metadata, turn of hashing

* Add unit tests
2023-10-26 17:42:52 +02:00
Ashwin Mathur
5f35e7d04a
refactor: Migrate RemoteWhisperTranscriber to OpenAI SDK. (#6149)
* Migrate RemoteWhisperTranscriber to OpenAI SDK

* Migrate RemoteWhisperTranscriber to OpenAI SDK

* Remove unnecessary imports

* Add release notes

* Fix api_key serialization

* Fix linting

* Apply suggestions from code review

Co-authored-by: ZanSara <sarazanzo94@gmail.com>

* Add additional tests for api_key

* Adapt .run() to take ByteStream inputs

* Update docstrings

* Rework implementation to use io.BytesIO

* Update error message

* Add default file name

---------

Co-authored-by: ZanSara <sarazanzo94@gmail.com>
2023-10-26 16:25:23 +02:00
Julian Risch
fe3bc15571
chore: Rename ExtractiveReader's input from document to documents to match its type List[Document] (#6164)
* rename input param, add doc string, add example

* reno
2023-10-24 21:44:15 +02:00
Stefano Fiorucci
1f4ed3cc03
refactor!: rename SimilarityRanker to TransformersSimilarityRanker (#6100)
* rename

* release note

* Update haystack/preview/components/rankers/transformers_similarity.py

Co-authored-by: Domenico <domenico.cinque98@gmail.com>

* Update haystack/preview/components/rankers/transformers_similarity.py

Co-authored-by: Domenico <domenico.cinque98@gmail.com>

* fix test

---------

Co-authored-by: Domenico <domenico.cinque98@gmail.com>
2023-10-24 19:45:16 +02:00
Grant Williams
1cf70d3dce
build: Upgrade transformers to the latest version 4.34.1 (#5994)
* Upgrade transformers to the latest version 4.34.0 so that Haystack can support the new Mistral, Nougat, and other models.

* update release notes

* updated missing lazy import

* Update .github workflows imports

* bump more versions in .github workflows

* rever import sorting

* Update  to catch runtime errors to match haystack_hub changes

* add language parameter value to whisper test

* bump transformers version in linting preview workflow

* bump transformers version in linting preview workflow

* bump version to v4.34.1

* resolve mypy issue with reused variables

* install openai-whisper without dependencies

* remove audio extra, update whisper install instructions

* remove audio extra, update whisper install instructions

* keep audio extra but add version

* keep audio extra with no constraints

* remove audio extra

---------

Co-authored-by: Julian Risch <julian.risch@deepset.ai>
2023-10-24 19:13:12 +02:00
Vladimir Blagojevic
b9b7d7666d
feat: Add dynamic per-user ChatMessage templating support (#6161)
* Add dynamic per-user ChatMessage templating support

* Add unit tests for dynamic templating

* Update add-dynamic-per-message-templating-908468226c5e3d45.yaml

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>

* Proper init ValueError raising, unit tests

---------

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>
2023-10-24 16:50:45 +02:00
Massimiliano Pippi
dd24210908
feat: add pipeline Yaml marshaller (#6137)
* add marshaller

* release notes

* add docstrings and missing tests
2023-10-23 19:02:59 +02:00
Silvano Cerza
31fb5b84e7
feature: Add mime_type field to ByteStream (#6154)
* Add mime_type field to ByteStream

* Add release notes

* Update tests
2023-10-23 16:13:40 +02:00
Vladimir Blagojevic
dcc7e63dc9
feat: Add ChatMessage class to Haystack 2.0 (#6144)
* Add ChatMessage and ChatRole
2023-10-23 16:08:05 +02:00
Shaurya Agrawal
9d8979af41
feat: Refactor SentenceTransformersDocumentEmbedder.py (#6143)
* changed sentense_transformers

* added release note

* updated release notes

* Corrected release notes

---------

Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>
2023-10-23 14:02:35 +02:00
Silvano Cerza
ae812617fd
Remove Document.array field (#6139) 2023-10-23 13:01:15 +02:00
Stefano Fiorucci
047e79f256
refactor: better API keys handling in GPTGenerator (#6103)
* refactor: do not serialize API keys

* release note

* check if api key is set in the module client

* make tests more robust

* better tests
2023-10-23 12:53:52 +02:00
Ashwin Mathur
101bd816f8
refactor: Remove api_key from serialization of AzureOCRDocumentConverter and SerperDevWebSearch (#6150)
* Remove api_key from serialization of AzureOCRDocumentConverter

* Remove api_key from serialization of SerperDevWebSearch

* Add release notes

* Add init_fail_without_api_key test for SerperDevWebSearch

* Rename env var to AZURE_AI_API_KEY
2023-10-23 12:26:23 +02:00
Silvano Cerza
c8d162ced9
refactor: Change Document.embedding type to list of floats (#6135)
* Change Document.embedding type

* Add release notes

* Fix document_store testing

* Fix pylint

* Fix tests
2023-10-23 12:26:05 +02:00
Silvano Cerza
8f289282f1
refactor: Remove id_hash_keys field from Document (#6127)
* Remove id_hash_fields from Document

* Update release notes

* Remove unused import
2023-10-23 10:35:24 +02:00
Stefano Fiorucci
7e6c6becd6
fix release note (#6145) 2023-10-22 11:15:51 +02:00
Julian Risch
64649312bc
build: Upgrade to canals==0.9.0 (#6133)
* build: Upgrade to `canals==0.9.0`

* reno
2023-10-20 13:00:24 +02:00
Silvano Cerza
3f98bd9137
refactor: Rework Document.id generation (#6122)
* Rework Document id generation

* Fix tests

* Add release notes

* Fix failing integration test

* Remove score from Document id generation

* Enhance tests

* Update release notes

---------

Co-authored-by: Julian Risch <julian.risch@deepset.ai>
2023-10-20 10:34:28 +02:00
Sunil Kumar Dash
957d1be68d
Enrich documents with embeddings for OpenAIDocumentEmbedder (#6126)
* Enrich documents with embeddings

Signed-off-by: sunilkumardash9 <sunilkumardash9@gmail.com>

* add release note

Signed-off-by: sunilkumardash9 <sunilkumardash9@gmail.com>

* try to fix typing

* change embedding field type in Document

---------

Signed-off-by: sunilkumardash9 <sunilkumardash9@gmail.com>
Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>
2023-10-19 18:29:16 +02:00
Stefano Fiorucci
ef40c7c728
refactor: make sure that Document's id_hash_keys has a valid value (#6112)
* fix handling id_hash_keys

* reno

* handle empty id_hash_keys in post_init

* fix

* reno

* test
2023-10-19 12:10:19 +02:00
Julian Risch
9f3b6512be
refactor: Remove reimplementations of default from_dict/to_dict and corresponding tests in 2.0 (#6108)
* whisper transcriber

* remove from/to_dict from builders

* remove from/to_dict from embedders

* remove from/to_dict from fetcher, file_converters

* remove from/to_dict from generators, preprocessors

* remove from/to_dict from ranker, reader

* remove from/to_dict from router, sampler, websearch

* pylint

* reno

* refactor import

* remove unused import
2023-10-19 11:17:02 +02:00
Stefano Fiorucci
21d894d85a
refactor: adopt token instead of use_auth_token in HF components (#6040)
* move embedding backends

* use token in Sentence Transformers embeddings

* more compact token handling

* token parameter in reader

* add token to ranker

* release note

* add test for reader
2023-10-17 16:32:13 +02:00
Stefano Fiorucci
4e4af99a5e
refactor!: rename MemoryDocumentStore and related Retrievers (#6076)
* rename doc store and retrievers

* release note

* fix patch
2023-10-17 16:15:16 +02:00
Silvano Cerza
ec9f898cd6
fix: Fix TextDocumentSplitter failing if run with empty list (#6081)
* Fix TextDocumentSplitter failing if run with empty list

* Release notes

* Simplify check

* Enhance test
2023-10-17 11:25:28 +02:00
Julian Risch
90ddeba579
fix: DocumentSplitter and DocumentCleaner copy id_hash_keys to newly created Documents (#6083)
* copy id_hash_keys in splitter and cleaner

* reno
2023-10-17 11:03:48 +02:00
Stefano Fiorucci
e963c8acdd
feat: HuggingFaceLocalGenerator - stopwords handling (#6049)
* first implementation

* release notes

* fixes

* tests

* better reno

* release note
2023-10-17 10:36:08 +02:00
Ivana Zeljkovic
2326f2f9fe
feat: Pinecone document store optimizations (#5902)
* Optimize methods for deleting documents and getting vector count. Enable warning messages when Pinecone limits are exceeded on Starter index type.

* Fix typo

* Add release note

* Fix mypy errors

* Remove unused import. Fix warning logging message.

* Update release note with description about limits for Starter index type in Pinecone

* Improve code base by:
- Adding new test cases for get_embedding_count method
- Fixing get_embedding_count method
- Improving delete documents
- Fix label retrieval
- Increase default batch size
- Improve get_document_count method

* Remove unused variable

* Fix mypy issues
2023-10-16 19:26:24 +02:00
Nicola Procopio
32e87d37c1
fixed join_docs.py concatenate (#5970)
* added hybrid search example

Added an example about hybrid search for faq pipeline on covid dataset

* formatted with back formatter

* renamed document

* fixed

* fixed typos

* added test

added test for hybrid search

* fixed withespaces

* removed test for hybrid search

* fixed pylint

* commented logging

* fixed bug in join_docs.py _concatenate_results

* Update join_docs.py

updated comment

* format with black

* added releasenote on PR

* updated release notes

* updated test_join_documents

* updated test

* updated test

* Update test_join_documents.py

* formatted with black

* fixed test

* fixed

---------

Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>
2023-10-16 09:31:52 +02:00
Julian Risch
aaee03aee8
feat: Add DocumentCleaner 2.0 (#5976)
* remove whitespaces, substrings, regex, empty lines

* remove repeated substrings

* reno

* return empty string as shortest common ngram

* address first half of review feedback

* address second half of review feedback

* mention \f page separator for header/footer removal

* mention \f page separator for header/footer removal

* mark example usage as python code
2023-10-13 12:39:55 +02:00
Bilge Yücel
ad25041618
Remove old Cohere models and add aliases for existing ones (#6007)
* Remove old cohere models

* Add aliases for the existing models according to Cohere documentation

* Add release note

* put cohere embdding models in a constant
* update doc strings

---------

Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
2023-10-13 12:08:26 +02:00
Stefano Fiorucci
fbd22bc1e9
feat: HuggingFaceLocalGenerator - first implementation (#6022)
* draft

* still a raw draft

* still a raw draft

* improvements

* minimal impl ok

* tests

* reno

* better language

* examples of generation_kwargs

* incorporate feedback

* lg and format updates

* don't save valid str tokens

* fix style

---------

Co-authored-by: Darja Fokina <daria.f93@gmail.com>
2023-10-13 11:23:56 +02:00
Julian Risch
b507f1a124
feat: Add TextLanguageClassifier 2.0 (#6026)
* draft TextLanguageClassifier

* implement language detection with langdetect

* add unit test for logging message

* reno

* pylint

* change input from List[str] to str

* remove empty output connections

* add from_dict/to_dict tests

* mark example usage as python code
2023-10-13 10:30:49 +02:00
ZanSara
110aacdc35
feat: add basic telemetry to pipelines 2.0 (#5929)
* add telemetry to pipelines 2.0

* only collect data if telemetry is on

* reno

* add downsampling

* typing

* manual tests

* pylint

* simplify code

* Update haystack/preview/telemetry/__init__.py

* rather index by component type

* black

* mypy

* review feedback & small improvements

* defaultdict

* stray changes

* lint

* invert condition

* always send the first event of the day

* collect specs

* track 2nd and 3rd events too

* send first event and then max 1 event a minute

* rename constant

* invert condition

* linting
2023-10-13 09:31:51 +02:00
ZanSara
adf7e49af3
chore: review all extra (#6029) 2023-10-12 21:50:53 +02:00
Vladimir Blagojevic
6a50123b9f
feat: Adjust LinkContentFetcher run method, use ByteStream (#5972) 2023-10-10 17:48:31 +02:00
Nicola Procopio
c102b152dc
fix: Run update_embeddings in examples (#6008)
* added hybrid search example

Added an example about hybrid search for faq pipeline on covid dataset

* formatted with back formatter

* renamed document

* fixed

* fixed typos

* added test

added test for hybrid search

* fixed withespaces

* removed test for hybrid search

* fixed pylint

* commented logging

* updated hybrid search example

* release notes

* Update hybrid_search_faq_pipeline.py-815df846dca7e872.yaml

* Update hybrid_search_faq_pipeline.py

* mention hybrid search example in release notes

* reduce installed dependencies in examples test workflow

* do not install cuda dependencies

* skip models if API key not set; delete document indices

* skip models if API key not set; delete document indices

* skip models if API key not set; delete document indices

* keep roberta-base model and inference extra

* pylint

* disable pylint no-logging-basicconfig rule

---------

Co-authored-by: Julian Risch <julian.risch@deepset.ai>
2023-10-10 16:38:52 +02:00
Vladimir Blagojevic
98215aec0d
feat: Rename FileExtensionRouter to FileTypeRouter, handle ByteStream(s) (#5998)
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
2023-10-10 09:14:04 +02:00