266 Commits

Author SHA1 Message Date
Tobias Wochinger
fe0ac5c4a2
chore: enforce kwarg logging (#7207)
* chore: add logger which eases logging of extras

* chore: start migrating to key value

* fix: import fixes

* tests: temporarily comment out breaking test

* refactor: move to kwarg based logging

* style: fix import order

* chore: implement self-review comments

* test: drop failing test

* chore: fix more import orders

* docs: add changelog

* tests: fix broken tests

* chore: fix getting the frames

* chore: add comment

* chore: cleanup

* chore: adapt remaining `%s` usages
2024-02-29 14:31:20 +01:00
David S. Batista
3fc77979d8
fixing docstrings (#7225) 2024-02-27 17:50:36 +01:00
ZanSara
1182c08daf
fix: Dont filter negative scores when using BM25Okapi and scale_score=False (#6889)
* dont filter negatives for unscaled Okapi

* change BM25 algorithm default to BM25L

* Update haystack/document_stores/in_memory/document_store.py

* improve comment
2024-02-06 11:07:27 +01:00
Madeesh Kannan
a5189dd035
fix!: InMemoryBM25Retriever no longer returns documents that have a score of 0.0 (#6717)
* fix!: `InMemoryBM25Retriever` no longer returns documents that have a score of 0.0

Also update tests to accommodate the new behavior.

* Remove superfluous code
2024-01-12 17:50:55 +01:00
Massimiliano Pippi
e1ec4e5e4d
refact!: Remove symbols under the haystack.document_stores namespace (#6714)
* remove symbols under the haystack.document_stores namespace

* Update haystack/document_stores/types/protocol.py

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>

* fix

* same for retrievers

* leftovers

* more leftovers

* add relnote

* leftovers

* one more

* fix examples

---------

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>
2024-01-10 21:20:42 +01:00
Massimiliano Pippi
00fed32024
build: depend on haystack_bm25 instead of rank_bm25 (#6578)
* use the forked package

* switch package dependency

* relnote

* fix package name
2023-12-18 10:47:15 +01:00
Stefano Fiorucci
4912f7cb58
refactor!: improve the deserialization logic for components that use a Document Store (#6466)
* improve deserialization

* rm ds decorator

* improve tests

* fix pylint

* rm decorator from module init

* rm decorator

* rm decorator from factory

* fix tests

* release note

* rm print
2023-12-04 15:17:28 +01:00
Silvano Cerza
5240672088
Rename document_store/protocols.py to document_store/protocol.py (#6448) 2023-11-29 16:41:15 +01:00
Silvano Cerza
831d0611d9
feat: Change default DuplicatePolicy in DocumentStore.write_documents() (#6438)
* Change default DuplicatePolicy in DocumentStore.write_documents()

* Add release notes
2023-11-28 12:30:17 +01:00
Silvano Cerza
9a7fd6f2ce
refactor: Add new filters tests for Document Store testing (#6428)
* Add new filters tests for Document Store testing

* Add release notes
2023-11-28 09:57:08 +01:00
Silvano Cerza
e6637f5ec2 Fix all tests 2023-11-24 14:48:43 +01:00
Massimiliano Pippi
f71e11c717
Removed preview package
---------

Co-authored-by: Silvano Cerza <silvanocerza@gmail.com>
2023-11-24 11:49:41 +01:00
Massimiliano Pippi
09e7831f60
clean up 1.x code
---------

Co-authored-by: Silvano Cerza <silvanocerza@gmail.com>
2023-11-24 11:47:47 +01:00
pandasar13
edb40b6c1b
refactor: add batch_size to FAISS __init__ (#6401)
* refactor: add batch_size to FAISS __init__

* refactor: add batch_size to FAISS __init__

* add release note to refactor: add batch_size to FAISS __init__

* fix release note

* add batch_size to docstrings

---------

Co-authored-by: anakin87 <stefanofiorucci@gmail.com>
2023-11-23 17:27:24 +01:00
x110
c4cfe6cb90
fix: Load additional fields from SQUAD-format file to meta field for labels #5978 (#6301)
* Load additional fields from SQUAD-format file to meta field for labels

* added a test function

* rewritten test using pytest

* added release notes

* improve release note

* clean up test

---------

Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>
2023-11-16 10:44:51 +01:00
Ivana Zeljkovic
2326f2f9fe
feat: Pinecone document store optimizations (#5902)
* Optimize methods for deleting documents and getting vector count. Enable warning messages when Pinecone limits are exceeded on Starter index type.

* Fix typo

* Add release note

* Fix mypy errors

* Remove unused import. Fix warning logging message.

* Update release note with description about limits for Starter index type in Pinecone

* Improve code base by:
- Adding new test cases for get_embedding_count method
- Fixing get_embedding_count method
- Improving delete documents
- Fix label retrieval
- Increase default batch size
- Improve get_document_count method

* Remove unused variable

* Fix mypy issues
2023-10-16 19:26:24 +02:00
Christian Clauss
bf6d306d68
ci: Simplify Python code with ruff rules SIM (#5833)
* ci: Simplify Python code with ruff rules SIM

* Revert #5828

* ruff --select=I --fix haystack/modeling/infer.py

---------

Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
2023-09-20 08:32:44 +02:00
Christian Clauss
91ab90a256
perf: Python performance improvements with ruff C4 and PERF fixes (#5803)
* Python performance improvements with ruff C4 and PERF

* pre-commit fixes

* Revert changes to examples/basic_qa_pipeline.py

* Revert changes to haystack/preview/testing/document_store.py

* revert releasenotes

* Upgrade to ruff v0.0.290
2023-09-16 16:26:07 +02:00
Onur Eren Arpacı
8af0d816e6
bug: fix the date_fields request bottleneck (#5695)
* bug: fix the date_fields request bottleneck

I encountered a performance issue while attempting to index 1 million vectors. Despite the Weaviate instance having low utilization, the process was estimated to take around 10 hours. 

After some investigation, I identified the bottleneck: _get_date_properties function was being called for every document, consequently a request to the Weaviate client was being sent and awaited for each document.

To address this, I optimized the code by invoking the _get_date_properties function only when there is a schema change. This modification resulted in a notable performance improvement, reducing the indexing time to approximately 90 minutes for the same 1 million vectors.

* bug: fix the date_fields request bottleneck

* fix: executed the pre commit hooks for #9341
2023-09-15 18:12:14 +02:00
ZanSara
9056c43240
fix: remove __future__ import from pinecone.py (#5813)
* remove future import

* fix forward reference
2023-09-14 16:28:39 +02:00
Ivana Zeljkovic
4bad202197
feat: Pinecone document store refactoring (#5725)
* Refactor codebase so that doc_type metadata is used instead of namespaces for making distinction between documents without embeddings, documents with embeddings and labels

* Fix parameter name in integration test

* Remove code under comment in add_type_metadata_filter method

* Fix mypy and pylint checks

* Add release note

* Apply minimal changes: rename method, update method docs and remove redundant method

* Mypy fixes

* Fix docstrings

* Revert helper methods for fetching documents when the number of documents exceeds Pinecone limit

* Remove unnecessary attributes in PineconeDocumentStore

* Fix unit test

---------

Co-authored-by: Ivana Zeljkovic <ivana.zeljkovic@smartcat.io>
Co-authored-by: DosticJelena <jelena.dostic@smartcat.io>
2023-09-14 11:46:47 +02:00
Christian Clauss
6dd52d91b2
ci: Fix typos discovered by codespell (#5778)
* Fix typos discovered by codespell

* pylint: max-args = 38
2023-09-13 16:14:45 +02:00
Tuana Çelik
4bb22c9665
Update weaviate.py (#5469)
Updating the weaviate docstrings to replace the old URL with the new correct one. The old one now gives a 404
2023-08-10 15:37:55 +02:00
ZanSara
c27622e1bc
chore: normalize more optional imports (#5251)
* docstore filters

* modeling metrics

* doc language classifier

* file converter

* docx converter

* tika

* preprocessor

* context matcher

* pylint
2023-08-09 09:27:53 +02:00
tstadel
d46c84bb61
feat: support dynamic filters in custom_query (#5427)
* support filters in custom_query

* better tests

* Update docstrings

---------

Co-authored-by: agnieszka-m <amarzec13@gmail.com>
2023-08-08 15:48:15 +02:00
tstadel
d26d4201fc
feat: support search_fields in DeepsetCloudDocumentStore (#5455)
* feat: support search_fields in DeepsetCloudDocumentStore

* add reno file

* make search_fields plain init arg

* Update lg

* Update releasenotes/notes/deepset-cloud-document-store-search-fields-40b2322466f808a3.yaml

* Update haystack/document_stores/deepsetcloud.py

---------

Co-authored-by: agnieszka-m <amarzec13@gmail.com>
2023-08-04 11:13:05 +02:00
bogdankostic
237d67dbfd
feat: Check version of Elasticsearch server and add support for Elasticsearch <= 7.5 (#5320)
* Check ES server version + add support for ES <= 7.5

* Adapt comment

* PR feedback
2023-07-13 14:50:43 +02:00
bogdankostic
206b21816c
chore: Adapt import message for Elasticsearch7 (#5295)
* Adapt import message for es7.ElasticsearchDocumentStore

* Move import statement
2023-07-10 10:21:26 +02:00
tstadel
9acb275680
fix: avoid conflicts with opensearch / elasticsearch magic attributes during bulk requests (#5113)
* use _source on opensearch bulk requests

* fix label bulk requests

* add tests

* fix test

* apply feedback

---------

Co-authored-by: Vladimir Blagojevic <dovlex@gmail.com>
2023-07-07 15:12:50 +02:00
Massimiliano Pippi
00efa514ca
refactor: remove Elasticsearch client version 8 deprecation warnings (#5245)
* remove deprecation warnings

* remove leftover
2023-07-04 14:17:34 +02:00
Vladimir Blagojevic
1066e959a2
bug: fix for pinecone not working for per document updates (#5110) 2023-07-03 14:07:52 +02:00
Stefano Fiorucci
1be39367ac
Fix: FAISSDocumentStore - make write_documents properly work in combination w update_embeddings (#5221)
* Update VERSION.txt

* first draft

* simplify method and test

* rm unnecessary pb.close

* integrate feedback
2023-07-03 10:07:36 +02:00
Massimiliano Pippi
6c1d0fbf04
refactor: isolate Elasticsearch client calls (#5241)
* isolate client code

* pass headers

* pass headers

* more adjustments

* revert

* revert

* leftover

* fix opensearch
2023-06-30 18:29:01 +02:00
Massimiliano Pippi
cb638af0ff
refactor: fix method type and add comments (#5235)
* fix method type and add comments

* fix tests
2023-06-30 11:55:52 +02:00
Massimiliano Pippi
037e4f24ce
refactor: add a new Document Store supporting Elasticsearch 8 (#5231)
* introduce es8

* prepare tests

* fix unit tests

* adjust tests

* install elastic_transport package

* make mypy happy

* fix opensearch tests
2023-06-29 16:40:10 +02:00
Massimiliano Pippi
d5c13aa71d
refactor: introduce a base class for ElasticsearchDocumentStore (#5228)
* introduce a base class

* forgot

* fix linting

* try

* try

* schema generation doesnt support aliasing, use the same name
2023-06-29 13:28:49 +02:00
bogdankostic
8c63e295f4
fix: Allow filtering on list fields in InMemoryDocumentStore with all operators (#5208)
* Add support for list fields

* Unskip tests
2023-06-29 12:10:39 +02:00
Massimiliano Pippi
6373e2ea66
refactor: prepare support to Elasticsearch 8 (#5226)
* make  a package

* Update haystack/document_stores/elasticsearch/es7.py

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>

* do not expose ES types from the package

---------

Co-authored-by: Silvano Cerza <3314350+silvanocerza@users.noreply.github.com>
2023-06-29 11:06:20 +02:00
ZanSara
462f3a5c99
feat: globally disable progress bars (#5207)
* add SilenceableTqdm and update usage

* pylint

* rename module

* add tests
2023-06-27 11:45:17 +02:00
bogdankostic
82291b56ad
fix: Send batches of query-doc pairs to inference_from_objects (#5125)
* Send batches of query-doc pairs to inference_from_objects

* Use absolute import path

* Add separate preprocessing_batch_size parameter
2023-06-26 14:26:26 +02:00
ZanSara
7a9cf30063
chore: remove safe_import and all usages (#5139)
* remove safe_import and all usages

* forward references

* fix additional import

* mypy

* mypy

* pylint

* forward reference

* Update haystack/document_stores/opensearch.py

Co-authored-by: bogdankostic <bogdankostic@web.de>

* fix except clause

---------

Co-authored-by: bogdankostic <bogdankostic@web.de>
2023-06-26 12:42:43 +02:00
Julian Risch
30fdf2b5df
feat!: Add extra for inference dependencies such as torch (#5147)
* feat!: add extra for inference dependencies such as torch

* add inference extra to 'all' and 'all-gpu' extra

* install inference extra in selected integration tests

* import LazyImport

* review feedback

* add import error messages and update readme

* remove extra dot
2023-06-20 09:54:10 +02:00
Shukri
916e8452f5
feat!: simplify weaviate auth (#5115)
* feat!: simplify weaviate auth

* docs: explain param precedence

* refactor: simplify _get_embedded_options
2023-06-19 15:46:58 +02:00
Ben Heckmann
60e5d73424
fix: changing document scores (#5090)
* #4653 fix changing scores by returning new document objects from document store queries

* added integration test for InMemoryDocumentStore demonstrating the desired behavior

* Update test/document_stores/test_memory.py
2023-06-14 17:35:46 +02:00
ZanSara
20c1f23fff
feat: optional transformers (#5101)
* generalimport -> lazy-imports

* remove generalimport

* fix pdftotextconverter import check

* customize error messages

* pylint

* fix sql.py

* pylint

* Update haystack/document_stores/sql.py

Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>

* make contextmanager less verbose

* do not catch syntax errors

* review feedback

* make all torch and transformers import lazy

* fix environment.py

* mypy

* merge leftovers

* fix schema

* pylint

* review feedback

---------

Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
2023-06-14 12:00:20 +02:00
ZanSara
52e7a77595
feat: introduce lazy_import (#5084)
* generalimport -> lazy-imports

* remove generalimport

* fix pdftotextconverter import check

* customize error messages

* pylint

* fix sql.py

* pylint

* Update haystack/document_stores/sql.py

Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>

* make contextmanager less verbose

* do not catch syntax errors

* review feedback

* Update haystack/nodes/file_converter/pdf.py

---------

Co-authored-by: Massimiliano Pippi <mpippi@gmail.com>
2023-06-08 12:11:38 +02:00
bogdankostic
da1f245a84
feat: Add batch_size parameter and cast timeout_config value to tuple for WeaviateDocumentStore (#5079)
* Add batch_size parameter and cast timeout_config to tuple

* Add unit test

* Remove debug tqdm

* Remove debug tqdm introduced in #5063
2023-06-06 17:06:10 +02:00
bogdankostic
9cb83402c4
refactor: Use globally defined request timeout in ElasticsearchDocumentStore and OpenSearchDocumentStore (#5064)
* Include benchmark config in output

* Use queries from aggregated labels

* Introduce batching for querying in ElasticsearchDocStore and OpenSearchDocStore

* Use globally defined timeout

* Fix mypy

* Use self.batch_size in write_documents

* Use 10_000 as default batch size

* Add unit tests for write documents
2023-06-05 09:47:31 +02:00
bogdankostic
a9a49e2c0a
feat: Add batching for querying in ElasticsearchDocumentStore and OpenSearchDocumentStore (#5063)
* Include benchmark config in output

* Use queries from aggregated labels

* Introduce batching for querying in ElasticsearchDocStore and OpenSearchDocStore

* Fix mypy

* Use self.batch_size in write_documents

* Use 10_000 as default batch size

* Add unit tests for write documents
2023-06-01 18:47:24 +02:00
Silvano Cerza
ba06bc4805
Unpin typing_extensions and remove all its uses (#5040) 2023-05-29 15:31:34 +02:00