3803 Commits

Author SHA1 Message Date
Daniel Bichuetti
739fc228c6
feat: support cl100k_base tokenization and increase performance for GPT2 (#3897)
* feat: migrate to tiktoken when tokenizing for OpenAI

* refactor: add OpenAI optional egg

* fix: add Python 3.7 fallback support for tiktoken

* refactor: change both tokenization implementations and fix mypy

* refactor: remove dummy-class

* refactor: add tiktoken as core dependency and minor refactoring

* refactor: sort imports

* refactor: remove out-of-scope PR change

* refactor: reintroduce corner case check

* refactor: remove unused egg

* refactor: remove unused exception after titkoken as core dep

* refactor: reduce ifs and include log warning

* refactor: remove timeout linting ignore

* refactor: revert change due to mypy

* refactor: disable pylint import error
2023-01-24 16:15:49 +01:00
Vladimir Blagojevic
4d8b1d0b22
refactor: Improve stop_words handling, add unit test cases (#3918)
* Improve stop_words handling, add unit test cases

* Update test/nodes/test_prompt_node.py

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>
2023-01-24 12:52:41 +01:00
Fabian
61ebe4b5dc
fix: authenticate with aws4auth if set in OpenSearchDocumentStore (#3741)
* bug(OpenSearchDocumentStore): fix authenticate with aws4auth if set.

Rearrange check to authenticate with aws4auth before username
and password, as the username is set to "admin" by default.

* Make username check less restrictive

* Fix test, do not used mocked _init_client function

* Add warning for aws4auth and username to ElasticSearchDocumentStore

Co-authored-by: Julian Risch <julian.risch@deepset.ai>
2023-01-24 10:01:39 +01:00
ZanSara
e954230ae7
chore: enable f-string-without-interpolation (#3906)
* f-string-without-interpolation

* remove line

* missed one line
2023-01-23 17:35:52 +01:00
Zoltan Fedor
e447bd728a
feat: adding the ability to use Ray Serve async functionality (#3769)
* Adding the ability to call the Ray pipeline from concurrent apps with async

This is to fix #2968

* Fixes: mype + pylint (`invalid-overridden-method`)

* Simplifying - no real need for an `AsyncRayPipeline` anymore

* Moving the new `run_async` method to the `RayPipeline`

* Cleanup

* [EMPTY] Re-trigger CI
2023-01-23 16:23:09 +01:00
Benjamin BERNARD
eed009eddb
feat: Add CsvTextConverter (#3587)
* feat: Add Csv2Documents, EmbedDocuments nodes and FAQ indexing pipeline

Fixes #3550, allow user to build full FAQ using YAML pipeline description and with CSV import and indexing.

* feat: Add Csv2Documents, EmbedDocuments nodes and FAQ indexing pipeline

Fix linter issues mypy and pylint.

* feat: Add Csv2Documents, EmbedDocuments nodes and FAQ indexing pipeline

Fix linter issues mypy.

* implement proposal's feedback

* tidy up for merge

* use BaseConverter

* use BaseConverter

* pylint

* black

* Revert "black"

This reverts commit e1c45cb1848408bd52a630328750cb67c8eb7110.

* black

* add check for column names

* add check for column names

* add tests

* fix tests

* address lists of paths

* typo

* remove duplicate line

Co-authored-by: ZanSara <sarazanzo94@gmail.com>
2023-01-23 15:56:36 +01:00
ZanSara
94f660c56f
feat: store id_hash_keys in Document objects to make documents clonable (#3697)
* store id_hash_keys in Document objects

* fix id_hash_keys calls throughout codebase

* generate schema

* fix es

* fix weaviate

* backward compatible

* openapi schema

* remove unused deprecation warning

* remove unused imports

* openapi

* unused var

* Apply suggestions from code review

Co-authored-by: bogdankostic <bogdankostic@web.de>

* Update haystack/schema.py

* Apply suggestions from code review

Co-authored-by: bogdankostic <bogdankostic@web.de>

* Update haystack/schema.py

* review feedback

* trailing spaces

* pylint

* add deprecation test

Co-authored-by: bogdankostic <bogdankostic@web.de>
2023-01-23 15:00:52 +01:00
ZanSara
2f15f3c64d
Fix OpensearchDocumentStore docstring (#3904) 2023-01-23 19:19:40 +05:30
Silvano Cerza
afa2bb1386
fix: Remove double super class init from ParsrConverter init (#3896) 2023-01-23 12:31:27 +01:00
Silvano Cerza
45bea5a838
chore: Add timeouts to external requests calls (#3895)
* chore: Add timeouts to external requests calls

* Remove :type directives from docstrings
2023-01-23 12:31:13 +01:00
Stefano Fiorucci
b910df7ec7
feat: ImageToText (caption generator) (#3859)
* first draft

* fix pylint and mypy

* retry w mypy

* mypy :-)

* rem unused import

* incorporate feedback and initial tests

* better tests

* fix import order

* fix docstring

* other fix docstring

* more and better tests

Co-authored-by: ZanSara <sarazanzo94@gmail.com>
2023-01-23 11:59:56 +01:00
Sebastian
d2bba4935b
feat: Use truncate option for Cohere.embed (#3865)
* Use truncate option for cohere request instead of GPT2 tokenizer to truncate texts

* Update max batch size for cohere which is 96

Co-authored-by: ZanSara <sarazanzo94@gmail.com>
2023-01-20 09:49:55 +01:00
Vladimir Blagojevic
04deb3b535
feat: Add retry with exponential back-off to PromptNode's OpenAI models (#3886) 2023-01-19 21:04:32 +01:00
ZanSara
90c877a559
bug: mypy should ignore files in test/ (#3894)
* exclude files in test/

* verify that the CI ignores test files

* dont fail in case of no files
2023-01-19 18:12:26 +01:00
Vladimir Blagojevic
4c28253955
feat: PromptNode - implement stop words (#3884) 2023-01-19 12:26:15 +01:00
Vladimir Blagojevic
e2fb82b148
refactor: Move invocation_context from meta to own pipeline variable (#3888) 2023-01-19 11:17:06 +01:00
ZanSara
34b7db0209
chore: enable singleton-comparison and cleanup (#3849)
* enable singleton-comparison

* fix triadaptive_model bug
2023-01-19 10:07:41 +01:00
ZanSara
6f5a2fb1da
fix: remove string validation in YAML (#3854)
* remove string validation in YAML

* unused import

* fix import

* remove tests

* fix tests
2023-01-19 10:06:53 +01:00
Mayank Jobanputra
dad7b12874
fix: Allowing InMemStore and FAISSDocStore for indexing using single worker (#3868)
* Allowing InMemStore and FAISSDocStore for indexing using single worker YAML config

* unified pipeline & doc store loading

* fix pylint warning

* separated tests

* removed unnecessay caplog
2023-01-19 14:06:00 +05:30
Ahmed Nabil
12e057837b
Adding condition to pinecone object. (#3768)
* Adding condition to `pinecone` object.

While you can assign any values to `PineconeDocumentStore`'s parameter `pinecone_index`, it must have another condition to prevent that from happening.

* Added test, and changed the code to make sure the pinecone idx variable has correct instance

* fixed black error

Co-authored-by: Mayank Jobanputra <mayankjobanputra@gmail.com>
2023-01-19 01:34:44 +05:30
Vladimir Blagojevic
c44d67856e
Simplify PromptTemplate substitution in PromptNode (#3876) 2023-01-18 18:31:15 +01:00
ZanSara
eb57e1fc09
chore: make Mypy work when Haystack is installed (#3856)
* add ignore statements to each failing line in haystack/

* simplify workflow

* few typos

* mypy cache directory missing

* mypy cache directory missing

* install types from Haystack only

* install types from rest_api too

* mypy vs literal

* install types at check time

* add mypy cache to python cache

* fix version condition

* fix version condition

* try running mypy only on affected files

* try using explicit hashes

* try another approach

* filter python files

* typo

* quotes

* use action
2023-01-18 15:36:10 +01:00
ZanSara
6af4f14fe0
feat: preprocessor raises warning when doc length exceeds threshold (#3837)
* add warning for excessive lenght

* improve test

* review feedback

* fix test

* move into _process_single
2023-01-17 13:48:28 +01:00
ZanSara
c50968dfe5
upgrade es to the version used in the CI (#3858) 2023-01-17 13:47:37 +01:00
ZanSara
9e457db2e9
test: add version deprecation fixture (#3851)
* add fixture

* Update test/conftest.py

* remove +2 and add tests

* few typos

* more cases

* Update test/conftest.py
2023-01-16 15:36:14 +01:00
ZanSara
3ffdb0a9a3
chore: fix all EOF (#3852)
* fix all eof

* fix test

* fix test

* fix test

* typo

* fix sample

* fix sample

* add logs

* fix page_dynamic_result.txt
2023-01-16 12:34:50 +01:00
ZanSara
62935bde6d
enable unused-variable (#3846) 2023-01-12 19:38:45 +01:00
Benjamin BERNARD
15203d864b
docs: Proposal - CSV FAQ indexing feature (#3638)
* docs(proposal): Add new proposal about CSV FAQ indexing feature

* docs(proposal): Add new proposal about CSV FAQ indexing feature

Introduce PR number.

* Review feedback

* Mixed up the PR numbers

Co-authored-by: ZanSara <sara.zanzottera@deepset.ai>
2023-01-12 11:07:26 +01:00
Zoltan Fedor
9cf80ee07e
feat: add HA support for Weaviate (#3764)
* feat: add HA support for Weaviate

Adding the `replicationConfig => factor` parameter to the Weaviate class at the time of class creation, allowing the user to have Haystack create a Weaviate "Class" with a replication factor set above 1.

This enables the use of Weaviate in a HA (High Availability) fashion, where the created class is stored on multiple Weaviate nodes increasing Weaviate's throughput and also ensuring high availability.

* Trying out a recommendation from @masci to fix the CI issue
2023-01-12 10:01:38 +01:00
ZanSara
d157e41c1f
chore: enable logging-fstring-interpolation and cleanup (#3843)
* enable logging-fstring-interpolation

* remove logging-fstring-interpolation from exclusion list

* remove implicit string interpolations added by black

* remove from rest_api too

* fix % sign
2023-01-12 09:31:21 +01:00
ZanSara
4cbc8550d6
chore: enable trailing-whitespace and cleanup (#3847)
* enable trailing-whitespace

* remove trailing whitespace on rest api too
2023-01-11 20:08:19 +01:00
Massimiliano Pippi
fa4404baa0
fix: ignore non-serializable params when hashing pipeline objects (#3842)
* ignore non-serializable params when hashing pipeline objects

* make tests more clear
2023-01-11 17:11:41 +01:00
Vladimir Blagojevic
ccda51fb43
proposal: Shaper pipeline component (#3784)
* Add InputOutputShaper proposal

* Add security section

* Rename to Shaper, small additions

* Rewording, rename contract_docs to concat
2023-01-11 18:50:12 +05:30
Bilge Yücel
88db75a419
feat: update the docker image for haystack-api service (#3835) 2023-01-11 15:35:46 +03:00
Stefano Fiorucci
be31178892
fix: make the crawler runnable and testable on Windows (#3830)
* fix crawler and try to run CI

* more compact expression

* try to fix

* improve naming regex

* revert regex

* make test_url compatible wirh Windows

* better conditional expression
2023-01-10 20:27:28 +01:00
Massimiliano Pippi
7f8910192e
list conventional commit types in the PR template (#3836) 2023-01-10 18:24:51 +01:00
Julian Risch
0e42a9015e
fix: inconsistent batch_size parameter names in distillation (#3811) 2023-01-10 11:38:21 +01:00
Tobias Wochinger
dea10a51d3
fix: gracefully handle FileExistsError during Preprocessor resource download (#3816)
* fix: use temp path for downloading punkt resources

* fix: gracefully handle file exists error during download
2023-01-10 11:22:49 +01:00
Vladimir Blagojevic
394c4895c7
fix: Add missing docstrings to PromptNode, PromptTemplate and PromptModel (#3821)
Co-authored-by: agnieszka-m <amarzec13@gmail.com>
Co-authored-by: sjrl <sjrl@users.noreply.github.com>
2023-01-10 10:26:20 +01:00
Zoltan Fedor
0288e1be76
bug: The PromptNode handles all parameters as lists without checking if they are in fact lists (#3820) 2023-01-10 08:08:17 +01:00
Agnieszka Marzec
897e89c9b1
Docs: Update FAISSDocStore load and save descriptions (#3808)
* Update load and save descriptions

* Add reviewers' suggestions

* Add Bilge's comment

* Blackify

* Update haystack/document_stores/faiss.py

Co-authored-by: Bilge Yücel <bilgeyucel96@gmail.com>

Co-authored-by: Bilge Yücel <bilgeyucel96@gmail.com>
2023-01-10 07:55:20 +01:00
Massimiliano Pippi
d728dc2210
refactor: remove haystack demo along with deprecated Dockerfiles (#3829)
* remove haystack demo from the repo

* remove install step from the action
2023-01-09 18:46:47 +01:00
Vladimir Blagojevic
fa78e2b0e4
refactor: Change PromptNode registered templates from per class to per instance (#3810) 2023-01-09 15:57:04 +01:00
tstadel
6ca88bfd23
fix: Despite return_embedding=False SearchEngineDocumentStore.query retrieves embedding_field (#3662)
* fix: Despite return_embedding=False SearchEngineDocumentStore.query retrieves embedding_field

* fix pylint

* add tests

* fix mypy

* fix merge

* format

* fix pylint

* move tests to SearchEngineDocumentStoreTestAbstract

* move missed constants

* add mocked_document_store fixture to TestElasticsearchDocumentStore

* fix mocked_document_store

* fix get_all_documents tests for elasticsearch>=7.16

* fix tests

* fix tests try 2
2023-01-09 11:58:23 +01:00
Sebastian
5b0b338175
fix: Ensure eval mode for TableReader model for predictions (#3743)
* Adding model.eval() calls to prediction functions in table reader

* Add unit test to check if model is set in train mode that inference time prediction still works.
2023-01-09 11:07:06 +01:00
Sebastian
659020fcac
fix: Convert table cells to strings for compatibility with TableReader (#3762)
* Add table = table.astype(str) to make sure cells are converted into to strings to be compatible witht the TableReader

* Turn more strings into ints

* Make sure answer text is always a string.
2023-01-09 10:42:11 +01:00
Massimiliano Pippi
93b48bc334
fix if clause in job skip logic (#3825) 2023-01-08 22:50:35 +05:30
Massimiliano Pippi
eb1881f38f
skip fossa check from forks (#3824) 2023-01-08 16:50:11 +01:00
tstadel
4a0a054164
fix: linefeeds in custom_query (#3813)
* fix linefeeds in custom_query

* add double quote test case
2023-01-05 17:13:04 +01:00
Julian Risch
0c2d13f1b8
bug: skip validating empty embeddings (#3774)
* skip validating empty embeddings

* skip batches without embeddings to update

* add unit test with mocked retriever
2023-01-05 15:13:57 +01:00