1032 Commits

Author SHA1 Message Date
Ben Heckmann
a492771b4d
feat: PreProcessor split by token (tiktoken & Hugging Face) (#5276)
* #4983 implemented split by token for tiktoken tokenizer

* #4983 added unit test for tiktoken splitting

* #4983 implemented and added a test for splitting documents with HuggingFace tokenizer

* #4983 added support for passing HF model names (instead of objects) and added an example to the HF token splitting test

* mocked HTTP model loading in unit tests, fixed pylint error

* fix lossy tokenizers splitting, use LazyImport, ignore UnicodeEncodeError for tiktoken

* reno

* rename reno file

---------

Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
2023-11-23 12:26:37 +01:00
Vladimir Blagojevic
e04a1f16bb
feat: Add DynamicPromptBuilder to Haystack 2.x (#6328)
* Add DynamicPromptBuilder

* Improve pydocs, add unit tests

* Add release note

* Make expected_runtime_variables optional

* Add pydocs usage example

* Add more pydocs

* Remove test markers

* Update type in unit test

* Update after canals upgrade

* add to api ref

* docstrings updates

* Update test/preview/components/builders/test_dynamic_prompt_builder.py

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>

* Update haystack/preview/components/builders/dynamic_prompt_builder.py

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>

* Deparametrize init test

* Rename expected_runtime_variables to runtime_variables

* Rephrase docstring so meaning is clearer

---------

Co-authored-by: Darja Fokina <daria.f93@gmail.com>
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>
Co-authored-by: Silvano Cerza <silvanocerza@gmail.com>
2023-11-23 11:41:57 +01:00
Vladimir Blagojevic
e57a593d2e
fix: Revert back to straightforward PromptBuilder (#6335)
* Revert back to simple PromptBuilder

* Updating to full typing
2023-11-23 11:34:06 +01:00
Vladimir Blagojevic
cfff0d5212
Rename file_converters to converters (#6390) 2023-11-23 10:28:40 +01:00
Vladimir Blagojevic
b557f3035e
feat: Add ConditionalRouter Haystack 2.x component (#6147)
Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
2023-11-23 10:28:08 +01:00
Stefano Fiorucci
e91f7a8a4d
refactor!: improve the public interface of Generators (#6374)
* merge lazy import blocks

* refactor generators

* release note

* revert unrelated changes
2023-11-22 10:40:48 +01:00
ZanSara
b751978d65
Extends input types of RemoteWhisperTranscriber (#6218)
* fix tests

* reno

* tests

* retain file name

* paths are strings for openai sdk

* streams->sources

* feedback

* always add name to file

* mypy

* test placeholder with extension

* fallback

* paths

* path test

* path must be a string

* fix test
2023-11-22 09:57:45 +01:00
Ashwin Mathur
e6c8374562
feat: Add ByteStream metadata and other metadata to Documents created by HTMLToDocument (#6304)
* Refactor HTMLToDocument

* Add release notes

* Add additional tests

* remove progress bar

* Add additional test for metadata

* remove progress bar from release notes

* Update tests

* Use truthiness checks instead of is not None
2023-11-21 21:44:02 +01:00
Silvano Cerza
76165d024f
Fix corner cases and error handling with filters conversion (#6376) 2023-11-21 18:22:48 +01:00
Stefano Fiorucci
456902235a
feat: make DocumentWriter return the actual number of documents written (#6366)
* make DocumentWriter return the actual number of documents written

* add/improve tests
2023-11-21 15:54:25 +01:00
Daniel Fleischer
0cef17ac13
feat: embedding instructions for dense retrieval (#6372)
* Embedding instructions in EmbeddingRetriever

Query and documents embeddings are prefixed with instructions, useful
for retrievers finetuned on specific tasks, such as Q&A.

* Tests

Checking vectors 0th component vs. reference, using different stores.

* Normalizing vectors

* Release notes
2023-11-21 12:56:40 +01:00
Silvano Cerza
a7f742fdbd
refactor: Rename docstore fixture to document_store (#6360)
* Prevent pytest_generate_tests from polluting preview tests

* Rename docstore fixture to document_store
2023-11-20 17:41:48 +01:00
Silvano Cerza
83c245db74
feat: Implement function to convert legacy filters to new style (#6314)
* Implement function to convert legacy filters to new style

* Reduce return statements in conversion to fix linting

* Move convert function in different module

* Fix typos in docstrings

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

---------

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
2023-11-20 13:00:05 +01:00
Agnieszka Marzec
497299c27a
Docs: Update Rankers docstrings and messages (#6296)
* Update docstrings and messages

* Fix tests

* Fix formatting

* Update haystack/preview/components/rankers/meta_field.py

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>

* Fix tests

---------

Co-authored-by: Silvano Cerza <silvanocerza@gmail.com>
Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>
2023-11-20 12:24:01 +01:00
Julian Risch
4ef2a680bb
feat: Add DocumentJoiner component 2.0 (#6105)
* draft DocumentJoiner

* implement merge and rrf

* draft end-to-end test with DocumentJoiner in hybrid doc search pipeline

* adjust for variadics Canals PR #122

* fix text_embedder input

* adapt to the new Document class

* adapt to new doc id

* specify documents input as Variadic in run method

* compare doc ids instead of full docs

* rename text_file_converter input to sources

* update docstring

* Update haystack/preview/components/routers/document_joiner.py

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Apply suggestions from docstring review

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* capitalize Documents and Retrievers in docstrings

* fix log message in test

---------

Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>
Co-authored-by: anakin87 <stefanofiorucci@gmail.com>
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
2023-11-20 10:56:56 +01:00
ZanSara
e905066458
feat: make InMemoryDocumentStore return the number of docs actually written (#6274)
* make InMemoryDocumentStore return the number of documents actually written

* add fixme

* reno

* add missing continue
2023-11-20 10:03:22 +01:00
ZanSara
e888852aec
Standardize TextFileToDocument (#6232)
* simplify textfiletodocument

* fix error handling and tests

* stray print

* reno

* streams->sources

* reno

* feedback

* test

* fix tests
2023-11-17 15:39:39 +01:00
Silvano Cerza
c26a932423
Change preview tests to run all tests except integration ones (#6325) 2023-11-17 15:33:43 +01:00
ZanSara
dfc1d452bb
feat: upgrade canals to 0.10.1 (#6309)
* upgrade canals

* reno

* trigger preview e2e

* bump canals

* fix decorator

* fix test

* test factory

* tests inmemory

* tests writer

* test audio

* tests builders

* tests caching

* tests embedders

* tests converters

* tests generators

* tests rankers

* tests retrievers

* fix pipeline and telemetry tests

* remove trigger
2023-11-17 14:46:23 +01:00
Silvano Cerza
6dda6e5b2d
Change Document.__eq__ to compare all fields (#6323) 2023-11-16 17:17:43 +01:00
Massimiliano Pippi
ff3165b8b8
fix: fix un-flattening of metadata (#6318)
* fix un-flattening of metadata

* test should pass

* add relnote

* change policy: raise an error if both meta and keys are passed

* Update document.py

* support python 3.8

* adjust wording in the error message
2023-11-16 17:10:53 +01:00
x110
c4cfe6cb90
fix: Load additional fields from SQUAD-format file to meta field for labels #5978 (#6301)
* Load additional fields from SQUAD-format file to meta field for labels

* added a test function

* rewritten test using pytest

* added release notes

* improve release note

* clean up test

---------

Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>
2023-11-16 10:44:51 +01:00
Vivek Silimkhan
f998bf4a4f
feat: add Amazon Bedrock support (#6226)
* Add Bedrock

* Update supported models for Bedrock

* Fix supports and add extract response in Bedrock

* fix errors imports

* improve and refactor supports

* fix install

* fix mypy

* fix pylint

* fix existing tests

* Added Anthropic Bedrock

* fix tests

* fix sagemaker tests

* add default prompt handler, constructor and supports tests

* more tests

* invoke refactoring

* refactor model_kwargs

* fix mypy

* lstrip responses

* Add streaming support

* bump boto3 version

* add class docstrings, better exception names

* fix layer name

* add tests for anthropic and cohere model adapters

* update cohere params

* update ai21 args and add tests

* support cohere command light model

* add tital tests

* better class names

* support meta llama 2 model

* fix streaming support

* more future-proof model adapter selection

* fix import

* fix mypy

* fix pylint for preview

* add tests for streaming

* add release notes

* Apply suggestions from code review

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* fix format

* fix tests after msg changes

* fix streaming for cohere

---------

Co-authored-by: tstadel <60758086+tstadel@users.noreply.github.com>
Co-authored-by: tstadel <thomas.stadelmann@deepset.ai>
Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
2023-11-15 13:26:29 +01:00
Julian Risch
08ec492039
refactor!: Remove routing from DocumentLanguageClassifier and rename TextLanguageClassifier (#6307)
* remove routing from DocumentLanguageClassifier

* fix MetadataRouter typo
2023-11-15 13:10:07 +01:00
Ashwin Mathur
4e4d5eb3e2
feat!: Remove unused query parameter from MetaFieldRanker (#6300)
* Remove unused query parameter from MetaFieldRanker

* Add release notes
2023-11-14 12:33:38 +01:00
Stefano Fiorucci
f708cf6056
refactor!: set scale_score default value to False (#6276)
* set default scale_score to False

* release note
2023-11-13 11:59:18 +01:00
Silvano Cerza
8e7ce208fc
Fix Document init when passing non existing fields (#6286)
* Fix Document init when passing non existing fields

* Update releasenotes/notes/fix-document-init-09c1cbb14202be7d.yaml

Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>

* Fix linting

---------

Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
2023-11-13 11:42:42 +01:00
Vladimir Blagojevic
b4d8d1c904
feat: Add custom conversion callable to PyPDFToDocument - Haystack 2.x (#6258)
* Allow user specified converter hook

* Add a release note

* More unit tests

* PR review - Massi, use protocol as converter
2023-11-09 17:35:33 +01:00
Agnieszka Marzec
1046bebbe0
Docs: Update docstrings lg (#6260)
* Update docstrings lg

* Update test_in_memory_bm25_retriever.py

* Update test_in_memory_embedding_retriever.py

---------

Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
2023-11-09 17:34:52 +01:00
Stefano Fiorucci
f95937b0ce
chore: move HuggingFaceLocalGenerator to the generators directory (#6264)
* move HuggingFaceLocalGenerator to right directory

* fix tests
2023-11-09 15:59:23 +01:00
Stefano Fiorucci
2b3c77e41d
fix: make JoinDocuments correctly handle duplicate documents w null scores (#6261)
* fix error with null values

* release note

* simplify
2023-11-09 14:28:56 +01:00
Domenico
676da681d0
feat: MetaField Ranker (#6189)
* proposal: meta field ranker

* Apply suggestions from code review

Co-authored-by: ZanSara <sarazanzo94@gmail.com>

* update proposal filename

* feat: add metafield ranker

* fix docstrings

* remove proposal file from pr

* add release notes

* update code according to new Document class

* separate loops for each ranking mode in __merge_scores

* change error type in init and new tests for linear score warning

* docstring upd

---------

Co-authored-by: ZanSara <sarazanzo94@gmail.com>
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
2023-11-09 12:20:41 +01:00
Sebastian Husch Lee
71d0d92ea2
feat: Add model_kwargs to ExtractiveReader to impact model loading (#6257)
* Add ability to pass model_kwargs to AutoModelForQuestionAnswering

* Add testing for new model_kwargs

* Add spacing

* Add release notes

* Update haystack/preview/components/readers/extractive.py

Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>

* Make changes suggested by Stefano

---------

Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>
2023-11-09 11:25:22 +01:00
Vladimir Blagojevic
cd429a73cd
feat: Add GPTChatGenerator to Haystack 2.x (#6212)
* Add GPTChatGenerator

* Apply lessons from previous PR

* PR review - Stefano
2023-11-09 10:45:41 +01:00
Silvano Cerza
bf884094d1
refactor: Change Document.blob type and remove mime_type field (#6249)
* Change Document.blob type and remove mime_type field

* Add release notes

* Remove mime_type from Document docstring
2023-11-08 10:35:17 +01:00
Vladimir Blagojevic
5497ca2a45
feat: Adapt GPTGenerator to use str input/output format in Haystack 2.x (#6214)
* Adapt GPTGenerator to string input/output

* Finishing touches

* punctuation upd

* PR feedback

* Small naming fixes

* Update haystack/preview/components/generators/openai.py

Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>

* Update class pydoc with a printed response

---------

Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
2023-11-07 18:00:43 +01:00
Stefano Fiorucci
fb96aef4dd
refactor!: move classifiers to an appropriate directory/package (#6240)
* mv classifiers

* release note
2023-11-06 12:00:01 +01:00
Vladimir Blagojevic
d7e1833c40
feat: Add HuggingFaceTGIChatGenerator Haystack 2.x component (#6199)
* Add ChatHuggingFaceTGIGenerator

* Add release note
---------
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>
2023-11-06 09:48:45 +01:00
Stefano Fiorucci
063d27c522
refactor!: rename TextDocumentSplitter to DocumentSplitter (#6223)
* rename TextDocumentSplitter to DocumentSplitter

* reno

* fix init
2023-11-03 11:33:20 +01:00
Vladimir Blagojevic
6e2dbdc320
feat: Add HuggingFaceTGIGenerator Haystack 2.x component (#6205)
* Add HuggingFaceTGIGenerator

* PR review

* PR feedback from Stefano

---------

Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>
2023-11-02 19:35:16 +01:00
Stefano Fiorucci
8511b8cd79
feat: HuggingFaceLocalGenerator- allow passing generation_kwargs in run method (#6220)
* allow custom generation_kwargs in run

* reno

* make pylint ignore too-many-public-methods
2023-11-02 15:29:38 +01:00
Ashwin Mathur
6bf0b9dc7c
feat: Add MarkdownToTextDocument (v2) (#6159)
* Add MarkdownToTextDocument

* Add release notes

* Update GitHub workflows

* Update GitHub workflows

* Refactor code with minimal dependencies

* Update docstrings

* Apply suggestions from code review

Co-authored-by: Daria Fokina <daria.f93@gmail.com>

* Update document with content and meta for backward compatibility

* Refactor Document Class for Backward Compatibility

Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>

* Update tests

* Improve test assertions

---------

Co-authored-by: Daria Fokina <daria.f93@gmail.com>
Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>
2023-10-31 18:28:13 +01:00
Julian Risch
29b1fefaa4
feat: Add DocumentLanguageClassifier 2.0 (#6037)
* add DocumentLanguageClassifier and tests

* reno

* fix import, rename DocumentCleaner

* mark example usage as python code

* add assertions to e2e test

* use deserialized document_store

* Apply suggestions from code review

Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>

* remove from/to_dict

* use renamed InMemoryDocumentStore

* adapt to Document refactoring

* improve docstring

* fix test for new Document

---------

Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>
Co-authored-by: anakin87 <stefanofiorucci@gmail.com>
2023-10-31 15:35:05 +01:00
Silvano Cerza
7287657f0e
refactor: Rename Document's text field to content (#6181)
* Rework Document serialisation

Make Document backward compatible

Fix InMemoryDocumentStore filters

Fix InMemoryDocumentStore.bm25_retrieval

Add release notes

Fix pylint failures

Enhance Document kwargs handling and docstrings

Rename Document's text field to content

Fix e2e tests

Fix SimilarityRanker tests

Fix typo in release notes

Rename Document's metadata field to meta (#6183)

* fix bugs

* make linters happy

* fix

* more fix

* match regex

---------

Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
2023-10-31 12:44:04 +01:00
Vladimir Blagojevic
c51aa1ee8d
feat: Add general and HF util methods (#6200)
* Add general and hf util methods
2023-10-31 11:13:11 +01:00
Silvano Cerza
76d5142bb8
Refactor: Document serialization and backward compatibility (#6180)
* Rework Document serialisation

* Make Document backward compatible

* Fix InMemoryDocumentStore filters

* Fix InMemoryDocumentStore.bm25_retrieval

* Add release notes

* Fix pylint failures

* Enhance Document kwargs handling and docstrings

* cosmetics

---------

Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
2023-10-30 17:03:06 +01:00
Massimiliano Pippi
789e524de3
remove leftovers from 1.18 (#6196) 2023-10-30 11:25:54 +01:00
Vladimir Blagojevic
f76fc04ed0
feat: Add StreamingChunk dataclass to Haystack 2.x (#6174)
* Add StreamingChunk

* Add release note

* Use default value init for metadata, turn of hashing

* Add unit tests
2023-10-26 17:42:52 +02:00
Vladimir Blagojevic
bb295d29ee
Fix failing test (#6176) 2023-10-26 17:22:24 +02:00
Ashwin Mathur
5f35e7d04a
refactor: Migrate RemoteWhisperTranscriber to OpenAI SDK. (#6149)
* Migrate RemoteWhisperTranscriber to OpenAI SDK

* Migrate RemoteWhisperTranscriber to OpenAI SDK

* Remove unnecessary imports

* Add release notes

* Fix api_key serialization

* Fix linting

* Apply suggestions from code review

Co-authored-by: ZanSara <sarazanzo94@gmail.com>

* Add additional tests for api_key

* Adapt .run() to take ByteStream inputs

* Update docstrings

* Rework implementation to use io.BytesIO

* Update error message

* Add default file name

---------

Co-authored-by: ZanSara <sarazanzo94@gmail.com>
2023-10-26 16:25:23 +02:00