3803 Commits

Author SHA1 Message Date
Massimiliano Pippi
bbb6025e89
update package name 2023-11-24 12:14:43 +01:00
Massimiliano Pippi
ea1e3f588b
Update dependencies list
---------

Co-authored-by: Silvano Cerza <silvanocerza@gmail.com>
2023-11-24 12:09:47 +01:00
Massimiliano Pippi
8adb8bbab8
Remove preview folder in test/
---------

Co-authored-by: Silvano Cerza <silvanocerza@gmail.com>
2023-11-24 11:52:55 +01:00
Massimiliano Pippi
f71e11c717
Removed preview package
---------

Co-authored-by: Silvano Cerza <silvanocerza@gmail.com>
2023-11-24 11:49:41 +01:00
Massimiliano Pippi
09e7831f60
clean up 1.x code
---------

Co-authored-by: Silvano Cerza <silvanocerza@gmail.com>
2023-11-24 11:47:47 +01:00
Silvano Cerza
fd16ec63cb
refactor: Add support for new filters declaration (#6397)
* Rework filter logic for InMemoryDocumentStore to support new filters
declaration

* Fix legacy filters tests

* Simplify logic and handle dates comparison

* Rework MetadataRouter to support new filters

* Update docstrings

* Add release notes

* Fix linting

* Avoid duplicating filters specifications

* Handle corner case

* Simplify docstring

* Fix filters logic and tests

* Fix Document Store testing legacy filters tests
2023-11-24 11:22:46 +01:00
SebastjanPrachovskij
28c2b09d90
Add SearchApi integration for websearch (#6400) 2023-11-24 11:18:43 +01:00
Agnieszka Marzec
27cf8ee4ff
Docs: Update Reader's doc strings (#6312)
* Update doc strings

* Add warm_up docs, fix margins

---------

Co-authored-by: Vladimir Blagojevic <dovlex@gmail.com>
2023-11-24 11:07:02 +01:00
Stefano Fiorucci
b850b36a4b
fix: Cohere - better handling of COHERE_API_URL (#6407)
* extract API URL from lazy import

* improve solution
2023-11-24 10:58:46 +01:00
ZanSara
c45d8c39c7
fix: make ExtractiveReader handle situations where token_to_chars returns None instead of a (start, end) tuple (#6382)
* fix reader bug

* add test

* log

* fix logging

* improve error message
2023-11-24 09:08:56 +01:00
ZanSara
f3b73030a1
Fix wrong import in cohere.py and change model to model_name for consistency (#6405)
* Fix wrong import in `cohere.py`

* model -> model_name

* fix tests too

* black

* typo

* typo
2023-11-23 19:54:50 +01:00
Stefano Fiorucci
fdae81eee8
add pptx to API reference (#6404) 2023-11-23 18:02:31 +01:00
pandasar13
edb40b6c1b
refactor: add batch_size to FAISS __init__ (#6401)
* refactor: add batch_size to FAISS __init__

* refactor: add batch_size to FAISS __init__

* add release note to refactor: add batch_size to FAISS __init__

* fix release note

* add batch_size to docstrings

---------

Co-authored-by: anakin87 <stefanofiorucci@gmail.com>
2023-11-23 17:27:24 +01:00
ZanSara
4ec6a60a76
feat: CohereGenerator (#6395)
* added CohereGenerator with unit tests

Signed-off-by: sunilkumardash9 <sunilkumardash9@gmail.com>

* 1. added releasenote
2. removed commented files in test-cohere_generators
3. removed unused imports

Signed-off-by: sunilkumardash9 <sunilkumardash9@gmail.com>

* 1. move client creation to __init__
2. remove dict casting of metadata in run

Signed-off-by: sunilkumardash9 <sunilkumardash9@gmail.com>

* few fixes

Signed-off-by: sunilkumardash9 <sunilkumardash9@gmail.com>

* add cohere to git workflows

Signed-off-by: sunilkumardash9 <sunilkumardash9@gmail.com>

* 1. CohereGenerator as top level import in generators
2. small change in doc string

Signed-off-by: sunilkumardash9 <sunilkumardash9@gmail.com>

* 1. corrected git workflow files for cohere import
2. changed api key env var from CO_API_KEY to COHERE_API_KEY

Signed-off-by: sunilkumardash9 <sunilkumardash9@gmail.com>

* added cohere in missed out workflow installs

Signed-off-by: sunilkumardash9 <sunilkumardash9@gmail.com>

* 1. Removed default_streaming_callback from cohere.py and added in test.
2. Added kwargs doc strings for CohereGenerator
3. removed type hints for metadata and replies
4. use COHERE_API_URL instead of hard coded URL.

Signed-off-by: sunilkumardash9 <sunilkumardash9@gmail.com>

* Update haystack/preview/components/generators/cohere/cohere.py

Co-authored-by: Daria Fokina <daria.f93@gmail.com>

* Update haystack/preview/components/generators/cohere/cohere.py

Co-authored-by: Daria Fokina <daria.f93@gmail.com>

* Update haystack/preview/components/generators/cohere/cohere.py

Co-authored-by: Daria Fokina <daria.f93@gmail.com>

* Update haystack/preview/components/generators/cohere/cohere.py

Co-authored-by: Daria Fokina <daria.f93@gmail.com>

* Update haystack/preview/components/generators/cohere/cohere.py

Co-authored-by: Daria Fokina <daria.f93@gmail.com>

* move out of folder

* black

* fix tests

* feedback

* black

* remove api key from tests

* read api key from env var if missing

* typo

* black

* missing import

---------

Signed-off-by: sunilkumardash9 <sunilkumardash9@gmail.com>
Co-authored-by: sunilkumardash9 <sunilkumardash9@gmail.com>
Co-authored-by: Daria Fokina <daria.f93@gmail.com>
2023-11-23 17:21:07 +01:00
Julian Risch
67780a62d5
test: Add end-to-end test for dense doc search 2.0 (#6102)
* draft e2e test for dense doc search

* fix import path

* add DocumentJoiner

* update converter import; fix getting filled doc store

* add text embedder

* add sample txt and pdf for preview e2e tests

* run the query pipeline before serializing

* define samples path

---------

Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
2023-11-23 16:59:02 +01:00
jlonge4
c44e2cf49b
feat: add microsoft pptx file converter (#6399)
* Create pptx.py

* feat: pptx converter import __init__.py

* feat: add pptx import __init__.py

* feat: add python-pptx dependency

* feat: add sample pptx for testing

* feat: add pptx file-converter test

* feat: release note pptx-file-converter-3e494d2747637eb2.yaml

* feat: Update releasenotes/notes/pptx-file-converter-3e494d2747637eb2.yaml

Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>

* feat: refactor haystack/nodes/file_converter/pptx.py

Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>

* fix imports

---------

Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>
2023-11-23 16:46:41 +01:00
Silvano Cerza
604b177788
chore: Remove pydoc-markdown from dev dependencies (#6398)
* Remove pydoc-markdown from dev dependencies

* Remove fastapi pin in rest_api
2023-11-23 15:59:41 +01:00
Stefano Fiorucci
b0b514778d
fix!: make PyPDFToDocument JSON-serializable (#6396)
* add registry

* release not

* add checks

* rm superflous check

* fix typo

* rm print :-)
2023-11-23 15:37:20 +01:00
Ben Heckmann
a492771b4d
feat: PreProcessor split by token (tiktoken & Hugging Face) (#5276)
* #4983 implemented split by token for tiktoken tokenizer

* #4983 added unit test for tiktoken splitting

* #4983 implemented and added a test for splitting documents with HuggingFace tokenizer

* #4983 added support for passing HF model names (instead of objects) and added an example to the HF token splitting test

* mocked HTTP model loading in unit tests, fixed pylint error

* fix lossy tokenizers splitting, use LazyImport, ignore UnicodeEncodeError for tiktoken

* reno

* rename reno file

---------

Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
2023-11-23 12:26:37 +01:00
Vladimir Blagojevic
e04a1f16bb
feat: Add DynamicPromptBuilder to Haystack 2.x (#6328)
* Add DynamicPromptBuilder

* Improve pydocs, add unit tests

* Add release note

* Make expected_runtime_variables optional

* Add pydocs usage example

* Add more pydocs

* Remove test markers

* Update type in unit test

* Update after canals upgrade

* add to api ref

* docstrings updates

* Update test/preview/components/builders/test_dynamic_prompt_builder.py

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>

* Update haystack/preview/components/builders/dynamic_prompt_builder.py

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>

* Deparametrize init test

* Rename expected_runtime_variables to runtime_variables

* Rephrase docstring so meaning is clearer

---------

Co-authored-by: Darja Fokina <daria.f93@gmail.com>
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>
Co-authored-by: Silvano Cerza <silvanocerza@gmail.com>
2023-11-23 11:41:57 +01:00
Vladimir Blagojevic
e57a593d2e
fix: Revert back to straightforward PromptBuilder (#6335)
* Revert back to simple PromptBuilder

* Updating to full typing
2023-11-23 11:34:06 +01:00
Silvano Cerza
3e79de7043
ci: Add workflow to test code snippets (#6364)
* initial

* Add workflow to test code snippets

---------

Co-authored-by: Timo Möller <timo.moeller@deepset.ai>
Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
2023-11-23 11:26:53 +01:00
Timo Moeller
b34c35d982
initial (#6355) 2023-11-23 10:32:54 +01:00
Vladimir Blagojevic
cfff0d5212
Rename file_converters to converters (#6390) 2023-11-23 10:28:40 +01:00
Vladimir Blagojevic
b557f3035e
feat: Add ConditionalRouter Haystack 2.x component (#6147)
Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
Co-authored-by: Daria Fokina <daria.fokina@deepset.ai>
2023-11-23 10:28:08 +01:00
Massimiliano Pippi
70e40eae5c
fix: fix type hints on DocumentStore protocol (#6383)
* fix type hints

* disable specific pylint checker
2023-11-23 09:14:08 +01:00
Stefano Fiorucci
e91f7a8a4d
refactor!: improve the public interface of Generators (#6374)
* merge lazy import blocks

* refactor generators

* release note

* revert unrelated changes
2023-11-22 10:40:48 +01:00
ZanSara
b751978d65
Extends input types of RemoteWhisperTranscriber (#6218)
* fix tests

* reno

* tests

* retain file name

* paths are strings for openai sdk

* streams->sources

* feedback

* always add name to file

* mypy

* test placeholder with extension

* fallback

* paths

* path test

* path must be a string

* fix test
2023-11-22 09:57:45 +01:00
Ashwin Mathur
e6c8374562
feat: Add ByteStream metadata and other metadata to Documents created by HTMLToDocument (#6304)
* Refactor HTMLToDocument

* Add release notes

* Add additional tests

* remove progress bar

* Add additional test for metadata

* remove progress bar from release notes

* Update tests

* Use truthiness checks instead of is not None
2023-11-21 21:44:02 +01:00
Silvano Cerza
76165d024f
Fix corner cases and error handling with filters conversion (#6376) 2023-11-21 18:22:48 +01:00
Stefano Fiorucci
456902235a
feat: make DocumentWriter return the actual number of documents written (#6366)
* make DocumentWriter return the actual number of documents written

* add/improve tests
2023-11-21 15:54:25 +01:00
Silvano Cerza
ec3558021e
Remove Document Store tests with invalid filter (#6375) 2023-11-21 15:08:16 +01:00
Silvano Cerza
0a5b37f3d1
Rework legacy filters embedding tests and remove numpy dependency (#6371) 2023-11-21 14:02:15 +01:00
Daniel Fleischer
0cef17ac13
feat: embedding instructions for dense retrieval (#6372)
* Embedding instructions in EmbeddingRetriever

Query and documents embeddings are prefixed with instructions, useful
for retrievers finetuned on specific tasks, such as Q&A.

* Tests

Checking vectors 0th component vs. reference, using different stores.

* Normalizing vectors

* Release notes
2023-11-21 12:56:40 +01:00
Julian Risch
07cda09aa8
docs: Include TextEmbedder in DocumentJoiner usage example (#6369)
* docs: Include TextEmbedder in DocumentJoiner usage example

* black
2023-11-21 11:27:10 +01:00
Stefano Fiorucci
1fff2bc255
merge lazy import blocks (#6358) 2023-11-21 11:15:37 +01:00
Julian Risch
2943b83b31
fix: Add DocumentJoiner to routers' init (#6368) 2023-11-21 09:45:00 +01:00
Julian Risch
939e443ee8
docs: Add DocumentJoiner to API docs (#6365) 2023-11-20 18:18:06 +01:00
Silvano Cerza
d57760787d
refactor: Rework delete_documents tests (#6363)
* Rework write_documents tests

* Rework delete_documents tests

* Fix linting
2023-11-20 17:54:42 +01:00
Silvano Cerza
9b0e3f5ed4
Rework write_documents tests (#6362) 2023-11-20 17:54:29 +01:00
Silvano Cerza
a7f742fdbd
refactor: Rename docstore fixture to document_store (#6360)
* Prevent pytest_generate_tests from polluting preview tests

* Rename docstore fixture to document_store
2023-11-20 17:41:48 +01:00
Silvano Cerza
365127dc5b
Prevent pytest_generate_tests from polluting preview tests (#6361) 2023-11-20 15:47:06 +01:00
Silvano Cerza
83c245db74
feat: Implement function to convert legacy filters to new style (#6314)
* Implement function to convert legacy filters to new style

* Reduce return statements in conversion to fix linting

* Move convert function in different module

* Fix typos in docstrings

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

---------

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
2023-11-20 13:00:05 +01:00
Agnieszka Marzec
497299c27a
Docs: Update Rankers docstrings and messages (#6296)
* Update docstrings and messages

* Fix tests

* Fix formatting

* Update haystack/preview/components/rankers/meta_field.py

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>

* Fix tests

---------

Co-authored-by: Silvano Cerza <silvanocerza@gmail.com>
Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>
2023-11-20 12:24:01 +01:00
Stefano Fiorucci
0ef06e72ff
fix: InMemoryDocumentStore - recreate Documents in the right way during embedding retrieval (#6354)
* do not recreate docs

* copy Documents

* recreate Document in the right way

* improve naming
2023-11-20 12:10:15 +01:00
ZanSara
9cee2f82c4
feat: extend write_documents to return the number of documents actually written in the document store (#6006)
* add typing and docstring

* reno

* Update releasenotes/notes/extend-write-documents-855ffc315974f03b.yaml

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>

---------

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>
2023-11-20 11:54:02 +01:00
Julian Risch
4ef2a680bb
feat: Add DocumentJoiner component 2.0 (#6105)
* draft DocumentJoiner

* implement merge and rrf

* draft end-to-end test with DocumentJoiner in hybrid doc search pipeline

* adjust for variadics Canals PR #122

* fix text_embedder input

* adapt to the new Document class

* adapt to new doc id

* specify documents input as Variadic in run method

* compare doc ids instead of full docs

* rename text_file_converter input to sources

* update docstring

* Update haystack/preview/components/routers/document_joiner.py

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* Apply suggestions from docstring review

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

* capitalize Documents and Retrievers in docstrings

* fix log message in test

---------

Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>
Co-authored-by: anakin87 <stefanofiorucci@gmail.com>
Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
2023-11-20 10:56:56 +01:00
ZanSara
e905066458
feat: make InMemoryDocumentStore return the number of docs actually written (#6274)
* make InMemoryDocumentStore return the number of documents actually written

* add fixme

* reno

* add missing continue
2023-11-20 10:03:22 +01:00
x110
d03bffab8b
Promptnode timeout (#6282) 2023-11-19 16:32:09 +01:00
Silvano Cerza
9b11462bf8
refactor: Move tests for delete_documents from DocumentStoreBaseTests to separate class (#6336)
* Move tests for delete_documents from DocumentStoreBaseTests to separate class

* Move `filterable_docs` fixture from `DocumentStoreBaseTests` to separate mixin class (#6337)

* Move filterable_docs fixture from DocumentStoreBaseTests to separate mixin class

* refactor: Move generic `filter_documents` tests from `DocumentStoreBaseTests` to separate class (#6338)

* Move generic filter_documents tests from DocumentStoreBaseTests to separate class

* refactor: Move `filter_documents` tests with invalid filters from `DocumentStoreBaseTests` to separate class (#6339)

* Move filter_documents tests with invalid filters from DocumentStoreBaseTests to separate class

* Move `filter_documents` tests with equal filters from `DocumentStoreBaseTests` to separate class (#6340)

* Move filter_documents tests with equal filters from DocumentStoreBaseTests to separate class

* Move `filter_documents` tests with not equal filters from `DocumentStoreBaseTests` to separate class (#6341)

* Move filter_documents tests with not equal filters from DocumentStoreBaseTests to separate class

* Move `filter_documents` tests with in filters from `DocumentStoreBaseTests` to separate class (#6342)

* Move filter_documents tests with in filters from DocumentStoreBaseTests to separate class

* Move `filter_documents` tests with not in filters from `DocumentStoreBaseTests` to separate class (#6343)

* Move filter_documents tests with not in filters from DocumentStoreBaseTests to separate class

* Move `filter_documents` tests with greater than filters from `DocumentStoreBaseTests` to separate class (#6344)

* Move filter_documents tests with greater than filters from DocumentStoreBaseTests to separate class

* Move `filter_documents` tests with greater than equal filters from `DocumentStoreBaseTests` to separate class (#6345)

* Move filter_documents tests with greater than equal filters from DocumentStoreBaseTests to separate class

* Move `filter_documents` tests with less than filters from `DocumentStoreBaseTests` to separate class (#6346)

* Move filter_documents tests with less than filters from DocumentStoreBaseTests to separate class

* Move `filter_documents` tests with less than equal filters from `DocumentStoreBaseTests` to separate class (#6347)

* Move filter_documents tests with less than equal filters from DocumentStoreBaseTests to separate class

* Move `filter_documents` tests with simple logical filters from `DocumentStoreBaseTests` to separate class (#6348)

* Move filter_documents tests with simple logical filters from DocumentStoreBaseTests to separate class

* Move filter_documents tests with nested logical filters from DocumentStoreBaseTests to separate class (#6349)
2023-11-17 19:35:12 +01:00