140 Commits

Author SHA1 Message Date
Vladimir Blagojevic
cd429a73cd
feat: Add GPTChatGenerator to Haystack 2.x (#6212)
* Add GPTChatGenerator

* Apply lessons from previous PR

* PR review - Stefano
2023-11-09 10:45:41 +01:00
Silvano Cerza
bf884094d1
refactor: Change Document.blob type and remove mime_type field (#6249)
* Change Document.blob type and remove mime_type field

* Add release notes

* Remove mime_type from Document docstring
2023-11-08 10:35:17 +01:00
Vladimir Blagojevic
5497ca2a45
feat: Adapt GPTGenerator to use str input/output format in Haystack 2.x (#6214)
* Adapt GPTGenerator to string input/output

* Finishing touches

* punctuation upd

* PR feedback

* Small naming fixes

* Update haystack/preview/components/generators/openai.py

Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>

* Update class pydoc with a printed response

---------

Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
2023-11-07 18:00:43 +01:00
Stefano Fiorucci
fb96aef4dd
refactor!: move classifiers to an appropriate directory/package (#6240)
* mv classifiers

* release note
2023-11-06 12:00:01 +01:00
Vladimir Blagojevic
d7e1833c40
feat: Add HuggingFaceTGIChatGenerator Haystack 2.x component (#6199)
* Add ChatHuggingFaceTGIGenerator

* Add release note
---------
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>
2023-11-06 09:48:45 +01:00
Stefano Fiorucci
063d27c522
refactor!: rename TextDocumentSplitter to DocumentSplitter (#6223)
* rename TextDocumentSplitter to DocumentSplitter

* reno

* fix init
2023-11-03 11:33:20 +01:00
Vladimir Blagojevic
6e2dbdc320
feat: Add HuggingFaceTGIGenerator Haystack 2.x component (#6205)
* Add HuggingFaceTGIGenerator

* PR review

* PR feedback from Stefano

---------

Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>
2023-11-02 19:35:16 +01:00
Stefano Fiorucci
8511b8cd79
feat: HuggingFaceLocalGenerator- allow passing generation_kwargs in run method (#6220)
* allow custom generation_kwargs in run

* reno

* make pylint ignore too-many-public-methods
2023-11-02 15:29:38 +01:00
Ashwin Mathur
6bf0b9dc7c
feat: Add MarkdownToTextDocument (v2) (#6159)
* Add MarkdownToTextDocument

* Add release notes

* Update GitHub workflows

* Update GitHub workflows

* Refactor code with minimal dependencies

* Update docstrings

* Apply suggestions from code review

Co-authored-by: Daria Fokina <daria.f93@gmail.com>

* Update document with content and meta for backward compatibility

* Refactor Document Class for Backward Compatibility

Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>

* Update tests

* Improve test assertions

---------

Co-authored-by: Daria Fokina <daria.f93@gmail.com>
Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>
2023-10-31 18:28:13 +01:00
Julian Risch
29b1fefaa4
feat: Add DocumentLanguageClassifier 2.0 (#6037)
* add DocumentLanguageClassifier and tests

* reno

* fix import, rename DocumentCleaner

* mark example usage as python code

* add assertions to e2e test

* use deserialized document_store

* Apply suggestions from code review

Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>

* remove from/to_dict

* use renamed InMemoryDocumentStore

* adapt to Document refactoring

* improve docstring

* fix test for new Document

---------

Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>
Co-authored-by: anakin87 <stefanofiorucci@gmail.com>
2023-10-31 15:35:05 +01:00
Silvano Cerza
7287657f0e
refactor: Rename Document's text field to content (#6181)
* Rework Document serialisation

Make Document backward compatible

Fix InMemoryDocumentStore filters

Fix InMemoryDocumentStore.bm25_retrieval

Add release notes

Fix pylint failures

Enhance Document kwargs handling and docstrings

Rename Document's text field to content

Fix e2e tests

Fix SimilarityRanker tests

Fix typo in release notes

Rename Document's metadata field to meta (#6183)

* fix bugs

* make linters happy

* fix

* more fix

* match regex

---------

Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
2023-10-31 12:44:04 +01:00
Vladimir Blagojevic
c51aa1ee8d
feat: Add general and HF util methods (#6200)
* Add general and hf util methods
2023-10-31 11:13:11 +01:00
Silvano Cerza
76d5142bb8
Refactor: Document serialization and backward compatibility (#6180)
* Rework Document serialisation

* Make Document backward compatible

* Fix InMemoryDocumentStore filters

* Fix InMemoryDocumentStore.bm25_retrieval

* Add release notes

* Fix pylint failures

* Enhance Document kwargs handling and docstrings

* cosmetics

---------

Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
2023-10-30 17:03:06 +01:00
Vladimir Blagojevic
f76fc04ed0
feat: Add StreamingChunk dataclass to Haystack 2.x (#6174)
* Add StreamingChunk

* Add release note

* Use default value init for metadata, turn of hashing

* Add unit tests
2023-10-26 17:42:52 +02:00
Vladimir Blagojevic
bb295d29ee
Fix failing test (#6176) 2023-10-26 17:22:24 +02:00
Ashwin Mathur
5f35e7d04a
refactor: Migrate RemoteWhisperTranscriber to OpenAI SDK. (#6149)
* Migrate RemoteWhisperTranscriber to OpenAI SDK

* Migrate RemoteWhisperTranscriber to OpenAI SDK

* Remove unnecessary imports

* Add release notes

* Fix api_key serialization

* Fix linting

* Apply suggestions from code review

Co-authored-by: ZanSara <sarazanzo94@gmail.com>

* Add additional tests for api_key

* Adapt .run() to take ByteStream inputs

* Update docstrings

* Rework implementation to use io.BytesIO

* Update error message

* Add default file name

---------

Co-authored-by: ZanSara <sarazanzo94@gmail.com>
2023-10-26 16:25:23 +02:00
Stefano Fiorucci
1f4ed3cc03
refactor!: rename SimilarityRanker to TransformersSimilarityRanker (#6100)
* rename

* release note

* Update haystack/preview/components/rankers/transformers_similarity.py

Co-authored-by: Domenico <domenico.cinque98@gmail.com>

* Update haystack/preview/components/rankers/transformers_similarity.py

Co-authored-by: Domenico <domenico.cinque98@gmail.com>

* fix test

---------

Co-authored-by: Domenico <domenico.cinque98@gmail.com>
2023-10-24 19:45:16 +02:00
Grant Williams
1cf70d3dce
build: Upgrade transformers to the latest version 4.34.1 (#5994)
* Upgrade transformers to the latest version 4.34.0 so that Haystack can support the new Mistral, Nougat, and other models.

* update release notes

* updated missing lazy import

* Update .github workflows imports

* bump more versions in .github workflows

* rever import sorting

* Update  to catch runtime errors to match haystack_hub changes

* add language parameter value to whisper test

* bump transformers version in linting preview workflow

* bump transformers version in linting preview workflow

* bump version to v4.34.1

* resolve mypy issue with reused variables

* install openai-whisper without dependencies

* remove audio extra, update whisper install instructions

* remove audio extra, update whisper install instructions

* keep audio extra but add version

* keep audio extra with no constraints

* remove audio extra

---------

Co-authored-by: Julian Risch <julian.risch@deepset.ai>
2023-10-24 19:13:12 +02:00
Vladimir Blagojevic
b9b7d7666d
feat: Add dynamic per-user ChatMessage templating support (#6161)
* Add dynamic per-user ChatMessage templating support

* Add unit tests for dynamic templating

* Update add-dynamic-per-message-templating-908468226c5e3d45.yaml

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>

* Proper init ValueError raising, unit tests

---------

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>
2023-10-24 16:50:45 +02:00
Massimiliano Pippi
dd24210908
feat: add pipeline Yaml marshaller (#6137)
* add marshaller

* release notes

* add docstrings and missing tests
2023-10-23 19:02:59 +02:00
Silvano Cerza
31fb5b84e7
feature: Add mime_type field to ByteStream (#6154)
* Add mime_type field to ByteStream

* Add release notes

* Update tests
2023-10-23 16:13:40 +02:00
Vladimir Blagojevic
dcc7e63dc9
feat: Add ChatMessage class to Haystack 2.0 (#6144)
* Add ChatMessage and ChatRole
2023-10-23 16:08:05 +02:00
Silvano Cerza
ae812617fd
Remove Document.array field (#6139) 2023-10-23 13:01:15 +02:00
Stefano Fiorucci
047e79f256
refactor: better API keys handling in GPTGenerator (#6103)
* refactor: do not serialize API keys

* release note

* check if api key is set in the module client

* make tests more robust

* better tests
2023-10-23 12:53:52 +02:00
Ashwin Mathur
101bd816f8
refactor: Remove api_key from serialization of AzureOCRDocumentConverter and SerperDevWebSearch (#6150)
* Remove api_key from serialization of AzureOCRDocumentConverter

* Remove api_key from serialization of SerperDevWebSearch

* Add release notes

* Add init_fail_without_api_key test for SerperDevWebSearch

* Rename env var to AZURE_AI_API_KEY
2023-10-23 12:26:23 +02:00
Silvano Cerza
c8d162ced9
refactor: Change Document.embedding type to list of floats (#6135)
* Change Document.embedding type

* Add release notes

* Fix document_store testing

* Fix pylint

* Fix tests
2023-10-23 12:26:05 +02:00
Silvano Cerza
8f289282f1
refactor: Remove id_hash_keys field from Document (#6127)
* Remove id_hash_fields from Document

* Update release notes

* Remove unused import
2023-10-23 10:35:24 +02:00
Silvano Cerza
2a45e7cc06
refactor: Remove id_hash_keys from all file_converters (#6125)
* Remove id_hash_keys from DocumentCleaner

* Remove id_hash_keys from TextDocumentSplitter

* Remove id_hash_keys from all file_converters

* Fix pylint failure

* Update docstrings
2023-10-20 16:22:14 +02:00
Silvano Cerza
3d69094f9a
refactor: Remove id_hash_keys from TextDocumentSplitter (#6124)
* Remove id_hash_keys from DocumentCleaner

* Remove id_hash_keys from TextDocumentSplitter
2023-10-20 15:18:28 +02:00
Silvano Cerza
ec376c7dbd
Remove id_hash_keys from DocumentCleaner (#6123) 2023-10-20 15:16:06 +02:00
Silvano Cerza
3f98bd9137
refactor: Rework Document.id generation (#6122)
* Rework Document id generation

* Fix tests

* Add release notes

* Fix failing integration test

* Remove score from Document id generation

* Enhance tests

* Update release notes

---------

Co-authored-by: Julian Risch <julian.risch@deepset.ai>
2023-10-20 10:34:28 +02:00
Stefano Fiorucci
ef40c7c728
refactor: make sure that Document's id_hash_keys has a valid value (#6112)
* fix handling id_hash_keys

* reno

* handle empty id_hash_keys in post_init

* fix

* reno

* test
2023-10-19 12:10:19 +02:00
Julian Risch
9f3b6512be
refactor: Remove reimplementations of default from_dict/to_dict and corresponding tests in 2.0 (#6108)
* whisper transcriber

* remove from/to_dict from builders

* remove from/to_dict from embedders

* remove from/to_dict from fetcher, file_converters

* remove from/to_dict from generators, preprocessors

* remove from/to_dict from ranker, reader

* remove from/to_dict from router, sampler, websearch

* pylint

* reno

* refactor import

* remove unused import
2023-10-19 11:17:02 +02:00
Stefano Fiorucci
21d894d85a
refactor: adopt token instead of use_auth_token in HF components (#6040)
* move embedding backends

* use token in Sentence Transformers embeddings

* more compact token handling

* token parameter in reader

* add token to ranker

* release note

* add test for reader
2023-10-17 16:32:13 +02:00
Stefano Fiorucci
4e4af99a5e
refactor!: rename MemoryDocumentStore and related Retrievers (#6076)
* rename doc store and retrievers

* release note

* fix patch
2023-10-17 16:15:16 +02:00
Silvano Cerza
ec9f898cd6
fix: Fix TextDocumentSplitter failing if run with empty list (#6081)
* Fix TextDocumentSplitter failing if run with empty list

* Release notes

* Simplify check

* Enhance test
2023-10-17 11:25:28 +02:00
Julian Risch
90ddeba579
fix: DocumentSplitter and DocumentCleaner copy id_hash_keys to newly created Documents (#6083)
* copy id_hash_keys in splitter and cleaner

* reno
2023-10-17 11:03:48 +02:00
Stefano Fiorucci
e963c8acdd
feat: HuggingFaceLocalGenerator - stopwords handling (#6049)
* first implementation

* release notes

* fixes

* tests

* better reno

* release note
2023-10-17 10:36:08 +02:00
ZanSara
660f84e6ef
feat: enable telemetry to pick up component data (#5957)
* add telemetry to pipelines 2.0

* only collect data if telemetry is on

* reno

* add downsampling

* typing

* manual tests

* pylint

* simplify code

* Update haystack/preview/telemetry/__init__.py

* look for _telemetry_data

* rather index by component type

* black

* mypy

* error handling

* comment

* review feedback & small improvements

* defaultdict

* stray changes

* try-catch

* method instead of attribute

* fixes

* remove print statements

* lint

* invert condition

* always send the first event of the day

* collect specs

* track 2nd and 3rd events too

* send first event and then max 1 event a minute

* rename constant

* black

* add test
2023-10-16 17:43:48 +02:00
Julian Risch
aaee03aee8
feat: Add DocumentCleaner 2.0 (#5976)
* remove whitespaces, substrings, regex, empty lines

* remove repeated substrings

* reno

* return empty string as shortest common ngram

* address first half of review feedback

* address second half of review feedback

* mention \f page separator for header/footer removal

* mention \f page separator for header/footer removal

* mark example usage as python code
2023-10-13 12:39:55 +02:00
Stefano Fiorucci
fbd22bc1e9
feat: HuggingFaceLocalGenerator - first implementation (#6022)
* draft

* still a raw draft

* still a raw draft

* improvements

* minimal impl ok

* tests

* reno

* better language

* examples of generation_kwargs

* incorporate feedback

* lg and format updates

* don't save valid str tokens

* fix style

---------

Co-authored-by: Darja Fokina <daria.f93@gmail.com>
2023-10-13 11:23:56 +02:00
Julian Risch
b507f1a124
feat: Add TextLanguageClassifier 2.0 (#6026)
* draft TextLanguageClassifier

* implement language detection with langdetect

* add unit test for logging message

* reno

* pylint

* change input from List[str] to str

* remove empty output connections

* add from_dict/to_dict tests

* mark example usage as python code
2023-10-13 10:30:49 +02:00
Stefano Fiorucci
2c2549f13d
move embedding backends (#6033) 2023-10-12 17:52:28 +02:00
Vladimir Blagojevic
d51be9edac
Add top_k to SimilarityRanker (#6036) 2023-10-12 13:52:01 +02:00
Vladimir Blagojevic
3803d23ff6
feat: Update PyPDFToDocument to process ByteStream inputs (#6021)
* Update PyPDF converter

* Add mixed source unit test

* Update haystack/preview/components/file_converters/pypdf.py

Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>

---------

Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
2023-10-11 10:52:08 +02:00
Vladimir Blagojevic
1a6a8863e8
feat: Update HTMLToDocument to handle ByteStream inputs (#6020)
* Update HTML converter

* Add mixed source unit test

* Update haystack/preview/components/file_converters/html.py

Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>

---------

Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
2023-10-11 10:15:58 +02:00
Vladimir Blagojevic
6a50123b9f
feat: Adjust LinkContentFetcher run method, use ByteStream (#5972) 2023-10-10 17:48:31 +02:00
Vladimir Blagojevic
98215aec0d
feat: Rename FileExtensionRouter to FileTypeRouter, handle ByteStream(s) (#5998)
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
2023-10-10 09:14:04 +02:00
Vladimir Blagojevic
40b83d8a47
feat: Add TopPSampler Haystack 2.0 component (#5924) 2023-10-09 13:44:01 +02:00
Vladimir Blagojevic
1cdff6427e
feat: Add SimilarityRanker to Haystack 2.0 (#5923)
* Initial SimilarityRanker
2023-10-06 16:01:34 +02:00