Make weaviate more compliant to other doc stores (UUIDs and dummy embedddings) (#1656)

* create uuid and dummy embeddding in weaviate doc store

* handle and test for duplicate non-uuid-formatted ids in weaviate

* add uuid and dummy embedding to doc strings

* Add latest docstring and tutorial changes

* Upgrade weaviate

* Include weaviate in common doc store test cases

* Add latest docstring and tutorial changes

* Exclude weaviate doc store from eval tests

* Incorporate index name in uuid generation

* Ignore mypy error

* Fix typo

* Restore DOCS without uuid and embeddings generated by weaviate

* Supply docs for retriever tests as fixture

* Limit scope of fixture to function instead of session

* Add comments

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
This commit is contained in:
Julian Risch 2021-11-04 09:27:12 +01:00 committed by GitHub
parent 4ca1937775
commit 892ce4a760
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
12 changed files with 267 additions and 410 deletions

View File

@ -78,7 +78,7 @@ jobs:
run: docker run -d -p 19530:19530 -p 19121:19121 milvusdb/milvus:1.1.0-cpu-d050721-5e559c
- name: Run Weaviate
run: docker run -d -p 8080:8080 --name haystack_test_weaviate --env AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED='true' --env PERSISTENCE_DATA_PATH='/var/lib/weaviate' semitechnologies/weaviate:1.7.0
run: docker run -d -p 8080:8080 --name haystack_test_weaviate --env AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED='true' --env PERSISTENCE_DATA_PATH='/var/lib/weaviate' semitechnologies/weaviate:1.7.2
- name: Run GraphDB
run: docker run -d -p 7200:7200 --name haystack_test_graphdb deepset/graphdb-free:9.4.1-adoptopenjdk11

View File

@ -33,7 +33,7 @@ You can launch them like this:
```
docker run -d -p 9200:9200 -e "discovery.type=single-node" -e "ES_JAVA_OPTS=-Xms128m -Xmx128m" elasticsearch:7.9.2
docker run -d -p 19530:19530 -p 19121:19121 milvusdb/milvus:1.1.0-cpu-d050721-5e559c
docker run -d -p 8080:8080 --name haystack_test_weaviate --env AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED='true' --env PERSISTENCE_DATA_PATH='/var/lib/weaviate' semitechnologies/weaviate:1.7.0
docker run -d -p 8080:8080 --name haystack_test_weaviate --env AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED='true' --env PERSISTENCE_DATA_PATH='/var/lib/weaviate' semitechnologies/weaviate:1.7.2
docker run -d -p 7200:7200 --name haystack_test_graphdb deepset/graphdb-free:9.4.1-adoptopenjdk11
docker run -d -p 9998:9998 -e "TIKA_CHILD_JAVA_OPTS=-JXms128m" -e "TIKA_CHILD_JAVA_OPTS=-JXmx128m" apache/tika:1.24.1
```

View File

@ -1677,7 +1677,8 @@ Weaviate is a cloud-native, modular, real-time vector search engine built to sca
Some of the key differences in contrast to FAISS & Milvus:
1. Stores everything in one place: documents, meta data and vectors - so less network overhead when scaling this up
2. Allows combination of vector search and scalar filtering, i.e. you can filter for a certain tag and do dense retrieval on that subset
3. Has less variety of ANN algorithms, as of now only HNSW.
3. Has less variety of ANN algorithms, as of now only HNSW.
4. Requires document ids to be in uuid-format. If wrongly formatted ids are provided at indexing time they will be replaced with uuids automatically.
Weaviate python client is used to connect to the server, more details are here
https://weaviate-python-client.readthedocs.io/en/docs/weaviate.html
@ -1735,7 +1736,7 @@ The current implementation is not supporting the storage of labels, so you canno
| get_document_by_id(id: str, index: Optional[str] = None) -> Optional[Document]
```
Fetch a document by specifying its text id string
Fetch a document by specifying its uuid string
<a name="weaviate.WeaviateDocumentStore.get_documents_by_id"></a>
#### get\_documents\_by\_id
@ -1744,7 +1745,7 @@ Fetch a document by specifying its text id string
| get_documents_by_id(ids: List[str], index: Optional[str] = None, batch_size: int = 10_000) -> List[Document]
```
Fetch documents by specifying a list of text id strings.
Fetch documents by specifying a list of uuid strings.
<a name="weaviate.WeaviateDocumentStore.write_documents"></a>
#### write\_documents
@ -1757,8 +1758,7 @@ Add new documents to the DocumentStore.
**Arguments**:
- `documents`: List of `Dicts` or List of `Documents`. Passing an Embedding/Vector is mandatory in case weaviate is not
configured with a module. If a module is configured, the embedding is automatically generated by Weaviate.
- `documents`: List of `Dicts` or List of `Documents`. A dummy embedding vector for each document is automatically generated if it is not provided. The document id needs to be in uuid format. Otherwise a correctly formatted uuid will be automatically generated based on the provided id.
- `index`: index name for storing the docs and metadata
- `batch_size`: When working with large number of documents, batching can help reduce memory footprint.
- `duplicate_documents`: Handle duplicates document based on parameter options.
@ -1785,6 +1785,15 @@ None
Update the metadata dictionary of a document by specifying its string id.
<a name="weaviate.WeaviateDocumentStore.get_embedding_count"></a>
#### get\_embedding\_count
```python
| get_embedding_count(filters: Optional[Dict[str, List[str]]] = None, index: Optional[str] = None) -> int
```
Return the number of embeddings in the document store, which is the same as the number of documents since every document has a default embedding
<a name="weaviate.WeaviateDocumentStore.get_document_count"></a>
#### get\_document\_count

View File

@ -130,7 +130,7 @@ document_store = SQLDocumentStore()
The `WeaviateDocumentStore` requires a running Weaviate Server.
You can start a basic instance like this (see the [Weaviate docs](https://www.semi.technology/developers/weaviate/current/) for details):
```
docker run -d -p 8080:8080 --env AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED='true' --env PERSISTENCE_DATA_PATH='/var/lib/weaviate' semitechnologies/weaviate:1.7.0
docker run -d -p 8080:8080 --env AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED='true' --env PERSISTENCE_DATA_PATH='/var/lib/weaviate' semitechnologies/weaviate:1.7.2
```
Afterwards, you can use it in Haystack:

View File

@ -1,3 +1,6 @@
import hashlib
import re
import uuid
from typing import Dict, Generator, List, Optional, Union
import logging
@ -13,6 +16,7 @@ from weaviate import ObjectsBatchRequest
logger = logging.getLogger(__name__)
UUID_PATTERN = re.compile(r'^[\da-f]{8}-([\da-f]{4}-){3}[\da-f]{12}$', re.IGNORECASE)
class WeaviateDocumentStore(BaseDocumentStore):
@ -24,7 +28,8 @@ class WeaviateDocumentStore(BaseDocumentStore):
Some of the key differences in contrast to FAISS & Milvus:
1. Stores everything in one place: documents, meta data and vectors - so less network overhead when scaling this up
2. Allows combination of vector search and scalar filtering, i.e. you can filter for a certain tag and do dense retrieval on that subset
3. Has less variety of ANN algorithms, as of now only HNSW.
3. Has less variety of ANN algorithms, as of now only HNSW.
4. Requires document ids to be in uuid-format. If wrongly formatted ids are provided at indexing time they will be replaced with uuids automatically.
Weaviate python client is used to connect to the server, more details are here
https://weaviate-python-client.readthedocs.io/en/docs/weaviate.html
@ -120,7 +125,7 @@ class WeaviateDocumentStore(BaseDocumentStore):
f"Initial connection to Weaviate failed. Make sure you run Weaviate instance "
f"at `{weaviate_url}` and that it has finished the initial ramp up (can take > 30s)."
)
self.index = index
self.index = self._sanitize_index_name(index)
self.embedding_dim = embedding_dim
self.content_field = content_field
self.name_field = name_field
@ -133,6 +138,15 @@ class WeaviateDocumentStore(BaseDocumentStore):
self.duplicate_documents = duplicate_documents
self._create_schema_and_index_if_not_exist(self.index)
self.uuid_format_warning_raised = False
def _sanitize_index_name(self, index: Optional[str]) -> Optional[str]:
if index is None:
return None
elif "_" in index:
return ''.join(x.capitalize() for x in index.split('_'))
else:
return index[0].upper() + index[1:]
def _create_schema_and_index_if_not_exist(
self,
@ -142,7 +156,7 @@ class WeaviateDocumentStore(BaseDocumentStore):
Create a new index (schema/class in Weaviate) for storing documents in case if an
index (schema) with the name doesn't exist already.
"""
index = index or self.index
index = self._sanitize_index_name(index) or self.index
if self.custom_schema:
schema = self.custom_schema
@ -239,7 +253,7 @@ class WeaviateDocumentStore(BaseDocumentStore):
}
def get_document_by_id(self, id: str, index: Optional[str] = None) -> Optional[Document]:
"""Fetch a document by specifying its text id string"""
"""Fetch a document by specifying its uuid string"""
# Sample result dict from a get method
'''{'class': 'Document',
'creationTimeUnix': 1621075584724,
@ -248,8 +262,11 @@ class WeaviateDocumentStore(BaseDocumentStore):
'name': 'name_5',
'content': 'text_5'},
'vector': []}'''
index = index or self.index
index = self._sanitize_index_name(index) or self.index
document = None
id = self._sanitize_id(id=id, index=index)
result = self.weaviate_client.data_object.get_by_id(id, with_vector=True)
if result:
document = self._convert_weaviate_result_to_document(result, return_embedding=True)
@ -258,23 +275,40 @@ class WeaviateDocumentStore(BaseDocumentStore):
def get_documents_by_id(self, ids: List[str], index: Optional[str] = None,
batch_size: int = 10_000) -> List[Document]:
"""
Fetch documents by specifying a list of text id strings.
Fetch documents by specifying a list of uuid strings.
"""
index = index or self.index
index = self._sanitize_index_name(index) or self.index
documents = []
#TODO: better implementation with multiple where filters instead of chatty call below?
for id in ids:
id = self._sanitize_id(id=id, index=index)
result = self.weaviate_client.data_object.get_by_id(id, with_vector=True)
if result:
document = self._convert_weaviate_result_to_document(result, return_embedding=True)
documents.append(document)
return documents
def _sanitize_id(self, id: str, index: Optional[str] = None) -> str:
"""
Generate a valid uuid if the provided id is not in uuid format.
Two documents with the same provided id and index name will get the same uuid.
"""
index = self._sanitize_index_name(index) or self.index
if not UUID_PATTERN.match(id):
hashed_id = hashlib.sha256((id+index).encode('utf-8')) #type: ignore
generated_uuid = str(uuid.UUID(hashed_id.hexdigest()[::2]))
if not self.uuid_format_warning_raised:
logger.warning(
f"Document id {id} is not in uuid format. Such ids will be replaced by uuids, in this case {generated_uuid}.")
self.uuid_format_warning_raised = True
id = generated_uuid
return id
def _get_current_properties(self, index: Optional[str] = None) -> List[str]:
"""
Get all the existing properties in the schema.
"""
index = index or self.index
index = self._sanitize_index_name(index) or self.index
cur_properties = []
for class_item in self.weaviate_client.schema.get()['classes']:
if class_item['class'] == index:
@ -309,7 +343,7 @@ class WeaviateDocumentStore(BaseDocumentStore):
"""
Updates the schema with a new property.
"""
index = index or self.index
index = self._sanitize_index_name(index) or self.index
property_dict = {
"dataType": [
"string"
@ -331,8 +365,7 @@ class WeaviateDocumentStore(BaseDocumentStore):
"""
Add new documents to the DocumentStore.
:param documents: List of `Dicts` or List of `Documents`. Passing an Embedding/Vector is mandatory in case weaviate is not
configured with a module. If a module is configured, the embedding is automatically generated by Weaviate.
:param documents: List of `Dicts` or List of `Documents`. A dummy embedding vector for each document is automatically generated if it is not provided. The document id needs to be in uuid format. Otherwise a correctly formatted uuid will be automatically generated based on the provided id.
:param index: index name for storing the docs and metadata
:param batch_size: When working with large number of documents, batching can help reduce memory footprint.
:param duplicate_documents: Handle duplicates document based on parameter options.
@ -344,7 +377,7 @@ class WeaviateDocumentStore(BaseDocumentStore):
:raises DuplicateDocumentError: Exception trigger on duplicate document
:return: None
"""
index = index or self.index
index = self._sanitize_index_name(index) or self.index
self._create_schema_and_index_if_not_exist(index)
field_map = self._create_document_field_map()
@ -361,9 +394,30 @@ class WeaviateDocumentStore(BaseDocumentStore):
current_properties = self._get_current_properties(index)
document_objects = [Document.from_dict(d, field_map=field_map) if isinstance(d, dict) else d for d in documents]
# Weaviate has strict requirements for what ids can be used.
# We check the id format and sanitize it if no uuid was provided.
# Duplicate document ids will be mapped to the same generated uuid.
for do in document_objects:
do.id = self._sanitize_id(id=do.id, index=index)
document_objects = self._handle_duplicate_documents(documents=document_objects,
index=index,
duplicate_documents=duplicate_documents)
# Weaviate requires that documents contain a vector in order to be indexed. These lines add a
# dummy vector so that indexing can still happen
dummy_embed_warning_raised = False
for do in document_objects:
if do.embedding is None:
dummy_embedding = np.random.rand(self.embedding_dim).astype(np.float32)
do.embedding = dummy_embedding
if not dummy_embed_warning_raised:
logger.warning("No embedding found in Document object being written into Weaviate. A dummy "
"embedding is being supplied so that indexing can still take place. This "
"embedding should be overwritten in order to perform vector similarity searches.")
dummy_embed_warning_raised = True
batched_documents = get_batches_from_generator(document_objects, batch_size)
with tqdm(total=len(document_objects), disable=not self.progress_bar) as progress_bar:
for document_batch in batched_documents:
@ -417,11 +471,17 @@ class WeaviateDocumentStore(BaseDocumentStore):
"""
self.weaviate_client.data_object.update(meta, class_name=self.index, uuid=id)
def get_embedding_count(self, filters: Optional[Dict[str, List[str]]] = None, index: Optional[str] = None) -> int:
"""
Return the number of embeddings in the document store, which is the same as the number of documents since every document has a default embedding
"""
return self.get_document_count(filters=filters, index=index)
def get_document_count(self, filters: Optional[Dict[str, List[str]]] = None, index: Optional[str] = None) -> int:
"""
Return the number of documents in the document store.
"""
index = index or self.index
index = self._sanitize_index_name(index) or self.index
doc_count = 0
if filters:
filter_dict = self._build_filter_clause(filters=filters)
@ -457,7 +517,7 @@ class WeaviateDocumentStore(BaseDocumentStore):
:param return_embedding: Whether to return the document embeddings.
:param batch_size: When working with large number of documents, batching can help reduce memory footprint.
"""
index = index or self.index
index = self._sanitize_index_name(index) or self.index
result = self.get_all_documents_generator(
index=index, filters=filters, return_embedding=return_embedding, batch_size=batch_size
)
@ -474,7 +534,7 @@ class WeaviateDocumentStore(BaseDocumentStore):
"""
Return all documents in a specific index in the document store
"""
index = index or self.index
index = self._sanitize_index_name(index) or self.index
# Build the properties to retrieve from Weaviate
properties = self._get_current_properties(index)
@ -516,8 +576,7 @@ class WeaviateDocumentStore(BaseDocumentStore):
:param batch_size: When working with large number of documents, batching can help reduce memory footprint.
"""
if index is None:
index = self.index
index = self._sanitize_index_name(index) or self.index
if return_embedding is None:
return_embedding = self.return_embedding
@ -546,7 +605,7 @@ class WeaviateDocumentStore(BaseDocumentStore):
https://www.semi.technology/developers/weaviate/current/graphql-references/filters.html
:param index: The name of the index in the DocumentStore from which to retrieve documents
"""
index = index or self.index
index = self._sanitize_index_name(index) or self.index
# Build the properties to retrieve from Weaviate
properties = self._get_current_properties(index)
@ -597,7 +656,7 @@ class WeaviateDocumentStore(BaseDocumentStore):
"""
if return_embedding is None:
return_embedding = self.return_embedding
index = index or self.index
index = self._sanitize_index_name(index) or self.index
# Build the properties to retrieve from Weaviate
properties = self._get_current_properties(index)
@ -658,8 +717,7 @@ class WeaviateDocumentStore(BaseDocumentStore):
:param batch_size: When working with large number of documents, batching can help reduce memory footprint.
:return: None
"""
if index is None:
index = self.index
index = self._sanitize_index_name(index) or self.index
if not self.embedding_field:
raise RuntimeError("Specify the arg `embedding_field` when initializing WeaviateDocumentStore()")
@ -718,7 +776,11 @@ class WeaviateDocumentStore(BaseDocumentStore):
have their ID in the list).
:return: None
"""
index = index or self.index
index = self._sanitize_index_name(index) or self.index
# create index if it doesn't exist yet
self._create_schema_and_index_if_not_exist(index)
if not filters and not ids:
self.weaviate_client.schema.delete_class(index)
self._create_schema_and_index_if_not_exist(index)
@ -728,7 +790,3 @@ class WeaviateDocumentStore(BaseDocumentStore):
docs_to_delete = [doc for doc in docs_to_delete if doc.id in ids]
for doc in docs_to_delete:
self.weaviate_client.data_object.delete(doc.id)

View File

@ -9,6 +9,7 @@ from haystack.utils.doc_store import (
launch_milvus,
launch_open_distro_es,
launch_opensearch,
launch_weaviate,
stop_opensearch,
stop_service,
)

View File

@ -50,6 +50,20 @@ def launch_opensearch(sleep=15):
time.sleep(sleep)
def launch_weaviate(sleep=15):
# Start a Weaviate server via Docker
logger.info("Starting Weaviate ...")
status = subprocess.run(
["docker run -d -p 8080:8080 --env AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED='true' --env PERSISTENCE_DATA_PATH='/var/lib/weaviate' semitechnologies/weaviate:1.7.2"], shell=True
)
if status.returncode:
logger.warning("Tried to start Weaviate through Docker but this failed. "
"It is likely that there is already an existing Weaviate instance running. ")
else:
time.sleep(sleep)
def stop_opensearch():
logger.info("Stopping OpenSearch...")
status = subprocess.run(['docker stop opensearch'], shell=True)

View File

@ -57,13 +57,6 @@ def pytest_generate_tests(metafunc):
break
# for all others that don't have explicit parametrization, we add the ones from the CLI arg
if 'document_store' in metafunc.fixturenames and not found_mark_parametrize_document_store:
# TODO: Remove the following if-condition once weaviate is fully compliant
# Background: Currently, weaviate is not fully compliant (e.g. "_" in "meta_field", problems with uuids ...)
# Therefore, we have separate tests in test_weaviate.py and we don't want to parametrize our generic
# tests (e.g. in test_document_store.py) with the weaviate fixture. However, we still need the weaviate option
# in the CLI arg as we want to skip test_weaviate.py if weaviate is not selected from CLI
if "weaviate" in selected_doc_stores:
selected_doc_stores.remove("weaviate")
metafunc.parametrize("document_store", selected_doc_stores, indirect=True)
@ -172,7 +165,7 @@ def weaviate_fixture():
shell=True
)
status = subprocess.run(
['docker run -d --name haystack_test_weaviate -p 8080:8080 semitechnologies/weaviate:1.4.0'],
['docker run -d --name haystack_test_weaviate -p 8080:8080 semitechnologies/weaviate:1.7.2'],
shell=True
)
if status.returncode:
@ -447,8 +440,7 @@ def get_retriever(retriever_type, document_store):
return retriever
@pytest.fixture(params=["elasticsearch", "faiss", "memory", "milvus"])
# @pytest.fixture(params=["memory"])
@pytest.fixture(params=["elasticsearch", "faiss", "memory", "milvus", "weaviate"])
def document_store_with_docs(request, test_docs_xs):
document_store = get_document_store(request.param)
document_store.write_documents(test_docs_xs)
@ -516,7 +508,7 @@ def get_document_store(document_store_type, embedding_dim=768, embedding_field="
elif document_store_type == "weaviate":
document_store = WeaviateDocumentStore(
weaviate_url="http://localhost:8080",
index=index.replace('_','').title(),
index=index,
similarity=similarity
)
document_store.weaviate_client.schema.delete_all()

View File

@ -4,6 +4,8 @@ import pytest
from elasticsearch import Elasticsearch
from conftest import get_document_store
from haystack.document_stores import WeaviateDocumentStore
from haystack.errors import DuplicateDocumentError
from haystack.schema import Document, Label, Answer, Span
from haystack.document_stores.elasticsearch import ElasticsearchDocumentStore
from haystack.document_stores.faiss import FAISSDocumentStore
@ -49,6 +51,7 @@ def test_write_with_duplicate_doc_ids(document_store):
document_store.write_documents(documents, duplicate_documents="fail")
@pytest.mark.parametrize("document_store", ["elasticsearch", "faiss", "memory", "milvus", "weaviate"], indirect=True)
def test_write_with_duplicate_doc_ids_custom_index(document_store):
documents = [
Document(
@ -62,9 +65,24 @@ def test_write_with_duplicate_doc_ids_custom_index(document_store):
]
document_store.delete_documents(index="haystack_custom_test")
document_store.write_documents(documents, index="haystack_custom_test", duplicate_documents="skip")
with pytest.raises(Exception):
with pytest.raises(DuplicateDocumentError):
document_store.write_documents(documents, index="haystack_custom_test", duplicate_documents="fail")
# Weaviate manipulates document objects in-place when writing them to an index.
# It generates a uuid based on the provided id and the index name where the document is added to.
# We need to get rid of these generated uuids for this test and therefore reset the document objects.
# As a result, the documents will receive a fresh uuid based on their id_hash_keys and a different index name.
if isinstance(document_store, WeaviateDocumentStore):
documents = [
Document(
content="Doc1",
id_hash_keys=["key1"]
),
Document(
content="Doc2",
id_hash_keys=["key1"]
)
]
# writing to the default, empty index should still work
document_store.write_documents(documents, duplicate_documents="fail")
@ -220,13 +238,13 @@ def test_write_document_index(document_store):
{"content": "text1", "id": "1"},
{"content": "text2", "id": "2"},
]
document_store.write_documents([documents[0]], index="haystack_test_1")
assert len(document_store.get_all_documents(index="haystack_test_1")) == 1
document_store.write_documents([documents[0]], index="haystack_test_one")
assert len(document_store.get_all_documents(index="haystack_test_one")) == 1
document_store.write_documents([documents[1]], index="haystack_test_2")
assert len(document_store.get_all_documents(index="haystack_test_2")) == 1
document_store.write_documents([documents[1]], index="haystack_test_two")
assert len(document_store.get_all_documents(index="haystack_test_two")) == 1
assert len(document_store.get_all_documents(index="haystack_test_1")) == 1
assert len(document_store.get_all_documents(index="haystack_test_one")) == 1
assert len(document_store.get_all_documents()) == 0
@ -237,13 +255,15 @@ def test_document_with_embeddings(document_store):
{"content": "text3", "id": "3", "embedding": np.random.rand(768).astype(np.float32).tolist()},
{"content": "text4", "id": "4", "embedding": np.random.rand(768).astype(np.float32)},
]
document_store.write_documents(documents, index="haystack_test_1")
assert len(document_store.get_all_documents(index="haystack_test_1")) == 4
document_store.write_documents(documents, index="haystack_test_one")
assert len(document_store.get_all_documents(index="haystack_test_one")) == 4
documents_without_embedding = document_store.get_all_documents(index="haystack_test_1", return_embedding=False)
assert documents_without_embedding[0].embedding is None
if not isinstance(document_store, WeaviateDocumentStore):
# weaviate is excluded because it would return dummy vectors instead of None
documents_without_embedding = document_store.get_all_documents(index="haystack_test_one", return_embedding=False)
assert documents_without_embedding[0].embedding is None
documents_with_embedding = document_store.get_all_documents(index="haystack_test_1", return_embedding=True)
documents_with_embedding = document_store.get_all_documents(index="haystack_test_one", return_embedding=True)
assert isinstance(documents_with_embedding[0].embedding, (list, np.ndarray))
@ -254,15 +274,15 @@ def test_update_embeddings(document_store, retriever):
documents.append({"content": f"text_{i}", "id": str(i), "meta_field": f"value_{i}"})
documents.append({"content": "text_0", "id": "6", "meta_field": "value_0"})
document_store.write_documents(documents, index="haystack_test_1")
document_store.update_embeddings(retriever, index="haystack_test_1", batch_size=3)
documents = document_store.get_all_documents(index="haystack_test_1", return_embedding=True)
document_store.write_documents(documents, index="haystack_test_one")
document_store.update_embeddings(retriever, index="haystack_test_one", batch_size=3)
documents = document_store.get_all_documents(index="haystack_test_one", return_embedding=True)
assert len(documents) == 7
for doc in documents:
assert type(doc.embedding) is np.ndarray
documents = document_store.get_all_documents(
index="haystack_test_1",
index="haystack_test_one",
filters={"meta_field": ["value_0"]},
return_embedding=True,
)
@ -272,53 +292,57 @@ def test_update_embeddings(document_store, retriever):
np.testing.assert_array_almost_equal(documents[0].embedding, documents[1].embedding, decimal=4)
documents = document_store.get_all_documents(
index="haystack_test_1",
index="haystack_test_one",
filters={"meta_field": ["value_0", "value_5"]},
return_embedding=True,
)
documents_with_value_0 = [doc for doc in documents if doc.meta["meta_field"] == "value_0"]
documents_with_value_5 = [doc for doc in documents if doc.meta["meta_field"] == "value_5"]
np.testing.assert_raises(
AssertionError,
np.testing.assert_array_equal,
documents[0].embedding,
documents[1].embedding
documents_with_value_0[0].embedding,
documents_with_value_5[0].embedding
)
doc = {"content": "text_7", "id": "7", "meta_field": "value_7",
"embedding": retriever.embed_queries(texts=["a random string"])[0]}
document_store.write_documents([doc], index="haystack_test_1")
document_store.write_documents([doc], index="haystack_test_one")
documents = []
for i in range(8, 11):
documents.append({"content": f"text_{i}", "id": str(i), "meta_field": f"value_{i}"})
document_store.write_documents(documents, index="haystack_test_1")
document_store.write_documents(documents, index="haystack_test_one")
doc_before_update = document_store.get_all_documents(index="haystack_test_1", filters={"meta_field": ["value_7"]})[0]
doc_before_update = document_store.get_all_documents(index="haystack_test_one", filters={"meta_field": ["value_7"]})[0]
embedding_before_update = doc_before_update.embedding
# test updating only documents without embeddings
document_store.update_embeddings(retriever, index="haystack_test_1", batch_size=3, update_existing_embeddings=False)
doc_after_update = document_store.get_all_documents(index="haystack_test_1", filters={"meta_field": ["value_7"]})[0]
embedding_after_update = doc_after_update.embedding
np.testing.assert_array_equal(embedding_before_update, embedding_after_update)
if not isinstance(document_store, WeaviateDocumentStore):
# All the documents in Weaviate store have an embedding by default. "update_existing_embeddings=False" is not allowed
document_store.update_embeddings(retriever, index="haystack_test_one", batch_size=3, update_existing_embeddings=False)
doc_after_update = document_store.get_all_documents(index="haystack_test_one", filters={"meta_field": ["value_7"]})[0]
embedding_after_update = doc_after_update.embedding
np.testing.assert_array_equal(embedding_before_update, embedding_after_update)
# test updating with filters
if isinstance(document_store, FAISSDocumentStore):
with pytest.raises(Exception):
document_store.update_embeddings(
retriever, index="haystack_test_1", update_existing_embeddings=True, filters={"meta_field": ["value"]}
retriever, index="haystack_test_one", update_existing_embeddings=True, filters={"meta_field": ["value"]}
)
else:
document_store.update_embeddings(
retriever, index="haystack_test_1", batch_size=3, filters={"meta_field": ["value_0", "value_1"]}
retriever, index="haystack_test_one", batch_size=3, filters={"meta_field": ["value_0", "value_1"]}
)
doc_after_update = document_store.get_all_documents(index="haystack_test_1", filters={"meta_field": ["value_7"]})[0]
doc_after_update = document_store.get_all_documents(index="haystack_test_one", filters={"meta_field": ["value_7"]})[0]
embedding_after_update = doc_after_update.embedding
np.testing.assert_array_equal(embedding_before_update, embedding_after_update)
# test update all embeddings
document_store.update_embeddings(retriever, index="haystack_test_1", batch_size=3, update_existing_embeddings=True)
assert document_store.get_embedding_count(index="haystack_test_1") == 11
doc_after_update = document_store.get_all_documents(index="haystack_test_1", filters={"meta_field": ["value_7"]})[0]
document_store.update_embeddings(retriever, index="haystack_test_one", batch_size=3, update_existing_embeddings=True)
assert document_store.get_embedding_count(index="haystack_test_one") == 11
doc_after_update = document_store.get_all_documents(index="haystack_test_one", filters={"meta_field": ["value_7"]})[0]
embedding_after_update = doc_after_update.embedding
np.testing.assert_raises(AssertionError, np.testing.assert_array_equal, embedding_before_update, embedding_after_update)
@ -326,9 +350,12 @@ def test_update_embeddings(document_store, retriever):
documents = []
for i in range(12, 15):
documents.append({"content": f"text_{i}", "id": str(i), "meta_field": f"value_{i}"})
document_store.write_documents(documents, index="haystack_test_1")
document_store.update_embeddings(retriever, index="haystack_test_1", batch_size=3, update_existing_embeddings=False)
assert document_store.get_embedding_count(index="haystack_test_1") == 14
document_store.write_documents(documents, index="haystack_test_one")
if not isinstance(document_store, WeaviateDocumentStore):
# All the documents in Weaviate store have an embedding by default. "update_existing_embeddings=False" is not allowed
document_store.update_embeddings(retriever, index="haystack_test_one", batch_size=3, update_existing_embeddings=False)
assert document_store.get_embedding_count(index="haystack_test_one") == 14
@pytest.mark.parametrize("retriever", ["table_text_retriever"], indirect=True)
@ -354,16 +381,16 @@ def test_update_embeddings_table_text_retriever(document_store, retriever):
"meta_field": "value_table_0",
"content_type": "table"})
document_store.write_documents(documents, index="haystack_test_1")
document_store.update_embeddings(retriever, index="haystack_test_1", batch_size=3)
documents = document_store.get_all_documents(index="haystack_test_1", return_embedding=True)
document_store.write_documents(documents, index="haystack_test_one")
document_store.update_embeddings(retriever, index="haystack_test_one", batch_size=3)
documents = document_store.get_all_documents(index="haystack_test_one", return_embedding=True)
assert len(documents) == 8
for doc in documents:
assert type(doc.embedding) is np.ndarray
# Check if Documents with same content (text) get same embedding
documents = document_store.get_all_documents(
index="haystack_test_1",
index="haystack_test_one",
filters={"meta_field": ["value_text_0"]},
return_embedding=True,
)
@ -374,7 +401,7 @@ def test_update_embeddings_table_text_retriever(document_store, retriever):
# Check if Documents with same content (table) get same embedding
documents = document_store.get_all_documents(
index="haystack_test_1",
index="haystack_test_one",
filters={"meta_field": ["value_table_0"]},
return_embedding=True,
)
@ -385,7 +412,7 @@ def test_update_embeddings_table_text_retriever(document_store, retriever):
# Check if Documents wih different content (text) get different embedding
documents = document_store.get_all_documents(
index="haystack_test_1",
index="haystack_test_one",
filters={"meta_field": ["value_text_1", "value_text_2"]},
return_embedding=True,
)
@ -398,7 +425,7 @@ def test_update_embeddings_table_text_retriever(document_store, retriever):
# Check if Documents with different content (table) get different embeddings
documents = document_store.get_all_documents(
index="haystack_test_1",
index="haystack_test_one",
filters={"meta_field": ["value_table_1", "value_table_2"]},
return_embedding=True,
)
@ -411,7 +438,7 @@ def test_update_embeddings_table_text_retriever(document_store, retriever):
# Check if Documents with different content (table + text) get different embeddings
documents = document_store.get_all_documents(
index="haystack_test_1",
index="haystack_test_one",
filters={"meta_field": ["value_text_1", "value_table_1"]},
return_embedding=True,
)
@ -438,17 +465,25 @@ def test_delete_documents(document_store_with_docs):
documents = document_store_with_docs.get_all_documents()
assert len(documents) == 0
def test_delete_documents_by_id(document_store_with_docs):
doc_ids = [doc.id for doc in document_store_with_docs.get_all_documents()]
assert len(doc_ids) == 3
docs_to_delete = doc_ids[0:2]
document_store_with_docs.delete_documents(ids=docs_to_delete)
def test_delete_documents_with_filters(document_store_with_docs):
document_store_with_docs.delete_documents(filters={"meta_field": ["test1", "test2"]})
documents = document_store_with_docs.get_all_documents()
assert len(documents) == 1
assert documents[0].id == doc_ids[2]
assert documents[0].meta["meta_field"] == "test3"
def test_delete_documents_by_id(document_store_with_docs):
docs_to_delete = document_store_with_docs.get_all_documents(filters={"meta_field": ["test1", "test2"]})
docs_not_to_delete = document_store_with_docs.get_all_documents(filters={"meta_field": ["test3"]})
document_store_with_docs.delete_documents(ids=[doc.id for doc in docs_to_delete])
all_docs_left = document_store_with_docs.get_all_documents()
assert len(all_docs_left) == 1
assert all_docs_left[0].meta["meta_field"] == "test3"
all_ids_left = [doc.id for doc in all_docs_left]
assert all(doc.id in all_ids_left for doc in docs_not_to_delete)
def test_delete_documents_by_id_with_filters(document_store_with_docs):
@ -464,14 +499,9 @@ def test_delete_documents_by_id_with_filters(document_store_with_docs):
all_ids_left = [doc.id for doc in all_docs_left]
assert all(doc.id in all_ids_left for doc in docs_not_to_delete)
def test_delete_documents_with_filters(document_store_with_docs):
document_store_with_docs.delete_documents(filters={"meta_field": ["test1", "test2"]})
documents = document_store_with_docs.get_all_documents()
assert len(documents) == 1
assert documents[0].meta["meta_field"] == "test3"
# exclude weaviate because it does not support storing labels
@pytest.mark.parametrize("document_store", ["elasticsearch", "faiss", "memory", "milvus"], indirect=True)
def test_labels(document_store):
label = Label(
query="question1",
@ -556,6 +586,8 @@ def test_labels(document_store):
assert len(labels) == 0
# exclude weaviate because it does not support storing labels
@pytest.mark.parametrize("document_store", ["elasticsearch", "faiss", "memory", "milvus"], indirect=True)
def test_multilabel(document_store):
labels =[
Label(
@ -666,8 +698,9 @@ def test_multilabel(document_store):
assert len(docs) == 0
# exclude weaviate because it does not support storing labels
@pytest.mark.parametrize("document_store", ["elasticsearch", "faiss", "memory", "milvus"], indirect=True)
def test_multilabel_no_answer(document_store):
labels = [
Label(
query="question",

View File

@ -4,6 +4,8 @@ from haystack.nodes.preprocessor import PreProcessor
from haystack.nodes.evaluator import EvalAnswers, EvalDocuments
from haystack.pipelines.base import Pipeline
@pytest.mark.parametrize("document_store", ["elasticsearch", "faiss", "memory", "milvus"], indirect=True)
@pytest.mark.parametrize("batch_size", [None, 20])
def test_add_eval_data(document_store, batch_size):
# add eval data (SQUAD format)
@ -47,6 +49,7 @@ def test_add_eval_data(document_store, batch_size):
assert doc.content[start:end] == "France"
@pytest.mark.parametrize("document_store", ["elasticsearch", "faiss", "memory", "milvus"], indirect=True)
@pytest.mark.parametrize("reader", ["farm"], indirect=True)
def test_eval_reader(reader, document_store: BaseDocumentStore):
# add eval data (SQUAD format)
@ -136,6 +139,7 @@ def test_eval_pipeline(document_store: BaseDocumentStore, reader, retriever):
assert eval_reader.top_k_em == eval_reader_vanila.top_k_em
@pytest.mark.parametrize("document_store", ["elasticsearch", "faiss", "memory", "milvus"], indirect=True)
def test_eval_data_split_word(document_store):
# splitting by word
preprocessor = PreProcessor(
@ -160,6 +164,7 @@ def test_eval_data_split_word(document_store):
assert len(set(labels[0].document_ids)) == 2
@pytest.mark.parametrize("document_store", ["elasticsearch", "faiss", "memory", "milvus"], indirect=True)
def test_eval_data_split_passage(document_store):
# splitting by passage
preprocessor = PreProcessor(

View File

@ -4,6 +4,8 @@ import numpy as np
import pandas as pd
import pytest
from elasticsearch import Elasticsearch
from haystack.document_stores import WeaviateDocumentStore
from haystack.schema import Document
from haystack.document_stores.elasticsearch import ElasticsearchDocumentStore
from haystack.document_stores.faiss import FAISSDocumentStore
@ -13,32 +15,35 @@ from haystack.nodes.retriever.sparse import ElasticsearchRetriever, Elasticsearc
from transformers import DPRContextEncoderTokenizerFast, DPRQuestionEncoderTokenizerFast
DOCS = [
Document(
content="""Aaron Aaron ( or ; ""Ahärôn"") is a prophet, high priest, and the brother of Moses in the Abrahamic religions. Knowledge of Aaron, along with his brother Moses, comes exclusively from religious texts, such as the Bible and Quran. The Hebrew Bible relates that, unlike Moses, who grew up in the Egyptian royal court, Aaron and his elder sister Miriam remained with their kinsmen in the eastern border-land of Egypt (Goshen). When Moses first confronted the Egyptian king about the Israelites, Aaron served as his brother's spokesman (""prophet"") to the Pharaoh. Part of the Law (Torah) that Moses received from""",
meta={"name": "0"},
id="1",
),
Document(
content="""Democratic Republic of the Congo to the south. Angola's capital, Luanda, lies on the Atlantic coast in the northwest of the country. Angola, although located in a tropical zone, has a climate that is not characterized for this region, due to the confluence of three factors: As a result, Angola's climate is characterized by two seasons: rainfall from October to April and drought, known as ""Cacimbo"", from May to August, drier, as the name implies, and with lower temperatures. On the other hand, while the coastline has high rainfall rates, decreasing from North to South and from to , with""",
id="2",
),
Document(
content="""Schopenhauer, describing him as an ultimately shallow thinker: ""Schopenhauer has quite a crude mind ... where real depth starts, his comes to an end."" His friend Bertrand Russell had a low opinion on the philosopher, and attacked him in his famous ""History of Western Philosophy"" for hypocritically praising asceticism yet not acting upon it. On the opposite isle of Russell on the foundations of mathematics, the Dutch mathematician L. E. J. Brouwer incorporated the ideas of Kant and Schopenhauer in intuitionism, where mathematics is considered a purely mental activity, instead of an analytic activity wherein objective properties of reality are""",
meta={"name": "1"},
id="3",
),
Document(
content="""The Dothraki vocabulary was created by David J. Peterson well in advance of the adaptation. HBO hired the Language Creatio""",
meta={"name": "2"},
id="4",
),
Document(
content="""The title of the episode refers to the Great Sept of Baelor, the main religious building in King's Landing, where the episode's pivotal scene takes place. In the world created by George R. R. Martin""",
meta={},
id="5",
),
]
@pytest.fixture()
def docs():
documents = [
Document(
content="""Aaron Aaron ( or ; ""Ahärôn"") is a prophet, high priest, and the brother of Moses in the Abrahamic religions. Knowledge of Aaron, along with his brother Moses, comes exclusively from religious texts, such as the Bible and Quran. The Hebrew Bible relates that, unlike Moses, who grew up in the Egyptian royal court, Aaron and his elder sister Miriam remained with their kinsmen in the eastern border-land of Egypt (Goshen). When Moses first confronted the Egyptian king about the Israelites, Aaron served as his brother's spokesman (""prophet"") to the Pharaoh. Part of the Law (Torah) that Moses received from""",
meta={"name": "0"},
id="1",
),
Document(
content="""Democratic Republic of the Congo to the south. Angola's capital, Luanda, lies on the Atlantic coast in the northwest of the country. Angola, although located in a tropical zone, has a climate that is not characterized for this region, due to the confluence of three factors: As a result, Angola's climate is characterized by two seasons: rainfall from October to April and drought, known as ""Cacimbo"", from May to August, drier, as the name implies, and with lower temperatures. On the other hand, while the coastline has high rainfall rates, decreasing from North to South and from to , with""",
id="2",
),
Document(
content="""Schopenhauer, describing him as an ultimately shallow thinker: ""Schopenhauer has quite a crude mind ... where real depth starts, his comes to an end."" His friend Bertrand Russell had a low opinion on the philosopher, and attacked him in his famous ""History of Western Philosophy"" for hypocritically praising asceticism yet not acting upon it. On the opposite isle of Russell on the foundations of mathematics, the Dutch mathematician L. E. J. Brouwer incorporated the ideas of Kant and Schopenhauer in intuitionism, where mathematics is considered a purely mental activity, instead of an analytic activity wherein objective properties of reality are""",
meta={"name": "1"},
id="3",
),
Document(
content="""The Dothraki vocabulary was created by David J. Peterson well in advance of the adaptation. HBO hired the Language Creatio""",
meta={"name": "2"},
id="4",
),
Document(
content="""The title of the episode refers to the Great Sept of Baelor, the main religious building in King's Landing, where the episode's pivotal scene takes place. In the world created by George R. R. Martin""",
meta={},
id="5",
),
]
return documents
#TODO check if we this works with only "memory" arg
@pytest.mark.parametrize(
@ -142,10 +147,10 @@ def test_elasticsearch_custom_query():
@pytest.mark.slow
@pytest.mark.parametrize("retriever", ["dpr"], indirect=True)
def test_dpr_embedding(document_store, retriever):
def test_dpr_embedding(document_store, retriever, docs):
document_store.return_embedding = True
document_store.write_documents(DOCS)
document_store.write_documents(docs)
document_store.update_embeddings(retriever=retriever)
time.sleep(1)
@ -165,10 +170,19 @@ def test_dpr_embedding(document_store, retriever):
@pytest.mark.slow
@pytest.mark.parametrize("retriever", ["retribert"], indirect=True)
@pytest.mark.vector_dim(128)
def test_retribert_embedding(document_store, retriever):
def test_retribert_embedding(document_store, retriever, docs):
if isinstance(document_store, WeaviateDocumentStore):
# Weaviate sets the embedding dimension to 768 as soon as it is initialized.
# We need 128 here and therefore initialize a new WeaviateDocumentStore.
document_store = WeaviateDocumentStore(
weaviate_url="http://localhost:8080",
index="haystack_test",
embedding_dim=128
)
document_store.weaviate_client.schema.delete_all()
document_store._create_schema_and_index_if_not_exist()
document_store.return_embedding = True
document_store.write_documents(DOCS)
document_store.write_documents(docs)
document_store.update_embeddings(retriever=retriever)
time.sleep(1)
@ -184,10 +198,10 @@ def test_retribert_embedding(document_store, retriever):
@pytest.mark.parametrize("retriever", ["table_text_retriever"], indirect=True)
@pytest.mark.parametrize("document_store", ["elasticsearch"], indirect=True)
@pytest.mark.vector_dim(512)
def test_table_text_retriever_embedding(document_store, retriever):
def test_table_text_retriever_embedding(document_store, retriever, docs):
document_store.return_embedding = True
document_store.write_documents(DOCS)
document_store.write_documents(docs)
table_data = {
"Mountain": ["Mount Everest", "K2", "Kangchenjunga", "Lhotse", "Makalu"],
"Height": ["8848m", "8,611 m", "8 586m", "8 516 m", "8,485m"]

View File

@ -6,11 +6,13 @@ import uuid
embedding_dim = 768
def get_uuid():
return str(uuid.uuid4())
DOCUMENTS = [
{"content": "text1", "id":get_uuid(), "key": "a", "embedding": np.random.rand(embedding_dim).astype(np.float32)},
{"content": "text1", "id":"not a correct uuid", "key": "a"},
{"content": "text2", "id":get_uuid(), "key": "b", "embedding": np.random.rand(embedding_dim).astype(np.float32)},
{"content": "text3", "id":get_uuid(), "key": "b", "embedding": np.random.rand(embedding_dim).astype(np.float32)},
{"content": "text4", "id":get_uuid(), "key": "b", "embedding": np.random.rand(embedding_dim).astype(np.float32)},
@ -26,6 +28,7 @@ DOCUMENTS_XS = [
Document(content="My name is Christelle and I live in Paris", id=get_uuid(), meta={"metafield": "test3", "name": "filename3"}, embedding=np.random.rand(embedding_dim).astype(np.float32))
]
@pytest.fixture(params=["weaviate"])
def document_store_with_docs(request):
document_store = get_document_store(request.param)
@ -33,56 +36,13 @@ def document_store_with_docs(request):
yield document_store
document_store.delete_documents()
@pytest.fixture(params=["weaviate"])
def document_store(request):
document_store = get_document_store(request.param)
yield document_store
document_store.delete_documents()
@pytest.mark.weaviate
@pytest.mark.parametrize("document_store_with_docs", ["weaviate"], indirect=True)
def test_get_all_documents_without_filters(document_store_with_docs):
documents = document_store_with_docs.get_all_documents()
assert all(isinstance(d, Document) for d in documents)
assert len(documents) == 3
assert {d.meta["name"] for d in documents} == {"filename1", "filename2", "filename3"}
assert {d.meta["metafield"] for d in documents} == {"test1", "test2", "test3"}
@pytest.mark.weaviate
def test_get_all_documents_with_correct_filters(document_store_with_docs):
documents = document_store_with_docs.get_all_documents(filters={"metafield": ["test2"]})
assert len(documents) == 1
assert documents[0].meta["name"] == "filename2"
documents = document_store_with_docs.get_all_documents(filters={"metafield": ["test1", "test3"]})
assert len(documents) == 2
assert {d.meta["name"] for d in documents} == {"filename1", "filename3"}
assert {d.meta["metafield"] for d in documents} == {"test1", "test3"}
@pytest.mark.weaviate
def test_get_all_documents_with_incorrect_filter_name(document_store_with_docs):
documents = document_store_with_docs.get_all_documents(filters={"incorrectmetafield": ["test2"]})
assert len(documents) == 0
@pytest.mark.weaviate
def test_get_all_documents_with_incorrect_filter_value(document_store_with_docs):
documents = document_store_with_docs.get_all_documents(filters={"metafield": ["incorrect_value"]})
assert len(documents) == 0
@pytest.mark.weaviate
def test_get_documents_by_id(document_store_with_docs):
documents = document_store_with_docs.get_all_documents()
doc = document_store_with_docs.get_document_by_id(documents[0].id)
assert doc.id == documents[0].id
assert doc.content == documents[0].content
@pytest.mark.weaviate
@pytest.mark.parametrize("document_store", ["weaviate"], indirect=True)
def test_get_document_count(document_store):
document_store.write_documents(DOCUMENTS)
assert document_store.get_document_count() == 5
assert document_store.get_document_count(filters={"key": ["a"]}) == 1
assert document_store.get_document_count(filters={"key": ["b"]}) == 4
@pytest.mark.weaviate
@pytest.mark.parametrize("document_store", ["weaviate"], indirect=True)
@ -98,189 +58,6 @@ def test_weaviate_write_docs(document_store, batch_size):
documents_indexed = document_store.get_all_documents(batch_size=batch_size)
assert len(documents_indexed) == len(DOCUMENTS)
@pytest.mark.weaviate
@pytest.mark.parametrize("document_store", ["weaviate"], indirect=True)
def test_get_all_document_filter_duplicate_value(document_store):
documents = [
Document(
content="Doc1",
meta={"fone": "f0"},
id = get_uuid(),
embedding= np.random.rand(embedding_dim).astype(np.float32)
),
Document(
content="Doc1",
meta={"fone": "f1", "metaid": "0"},
id = get_uuid(),
embedding = np.random.rand(embedding_dim).astype(np.float32)
),
Document(
content="Doc2",
meta={"fthree": "f0"},
id = get_uuid(),
embedding=np.random.rand(embedding_dim).astype(np.float32)
)
]
document_store.write_documents(documents)
documents = document_store.get_all_documents(filters={"fone": ["f1"]})
assert documents[0].content == "Doc1"
assert len(documents) == 1
assert {d.meta["metaid"] for d in documents} == {"0"}
@pytest.mark.weaviate
@pytest.mark.parametrize("document_store", ["weaviate"], indirect=True)
def test_get_all_documents_generator(document_store):
document_store.write_documents(DOCUMENTS)
assert len(list(document_store.get_all_documents_generator(batch_size=2))) == 5
@pytest.mark.weaviate
@pytest.mark.parametrize("document_store", ["weaviate"], indirect=True)
def test_write_with_duplicate_doc_ids(document_store):
id = get_uuid()
documents = [
Document(
content="Doc1",
id=id,
embedding=np.random.rand(embedding_dim).astype(np.float32)
),
Document(
content="Doc2",
id=id,
embedding=np.random.rand(embedding_dim).astype(np.float32)
)
]
document_store.write_documents(documents, duplicate_documents="skip")
with pytest.raises(Exception):
document_store.write_documents(documents, duplicate_documents="fail")
@pytest.mark.weaviate
@pytest.mark.parametrize("document_store", ["weaviate"], indirect=True)
@pytest.mark.parametrize("update_existing_documents", [True, False])
def test_update_existing_documents(document_store, update_existing_documents):
id = uuid.uuid4()
original_docs = [
{"content": "text1_orig", "id": id, "metafieldforcount": "a", "embedding": np.random.rand(embedding_dim).astype(np.float32)},
]
updated_docs = [
{"content": "text1_new", "id": id, "metafieldforcount": "a", "embedding": np.random.rand(embedding_dim).astype(np.float32)},
]
document_store.update_existing_documents = update_existing_documents
document_store.write_documents(original_docs)
assert document_store.get_document_count() == 1
if update_existing_documents:
document_store.write_documents(updated_docs, duplicate_documents="overwrite")
else:
with pytest.raises(Exception):
document_store.write_documents(updated_docs, duplicate_documents="fail")
stored_docs = document_store.get_all_documents()
assert len(stored_docs) == 1
if update_existing_documents:
assert stored_docs[0].content == updated_docs[0]["content"]
else:
assert stored_docs[0].content == original_docs[0]["content"]
@pytest.mark.weaviate
@pytest.mark.parametrize("document_store", ["weaviate"], indirect=True)
def test_write_document_meta(document_store):
uid1 = get_uuid()
uid2 = get_uuid()
uid3 = get_uuid()
uid4 = get_uuid()
documents = [
{"content": "dict_without_meta", "id": uid1, "embedding": np.random.rand(embedding_dim).astype(np.float32)},
{"content": "dict_with_meta", "metafield": "test2", "name": "filename2", "id": uid2, "embedding": np.random.rand(embedding_dim).astype(np.float32)},
Document(content="document_object_without_meta", id=uid3, embedding=np.random.rand(embedding_dim).astype(np.float32)),
Document(content="document_object_with_meta", meta={"metafield": "test4", "name": "filename3"}, id=uid4, embedding=np.random.rand(embedding_dim).astype(np.float32)),
]
document_store.write_documents(documents)
documents_in_store = document_store.get_all_documents()
assert len(documents_in_store) == 4
assert not document_store.get_document_by_id(uid1).meta
assert document_store.get_document_by_id(uid2).meta["metafield"] == "test2"
assert not document_store.get_document_by_id(uid3).meta
assert document_store.get_document_by_id(uid4).meta["metafield"] == "test4"
@pytest.mark.weaviate
@pytest.mark.parametrize("document_store", ["weaviate"], indirect=True)
def test_write_document_index(document_store):
documents = [
{"content": "text1", "id": uuid.uuid4(), "embedding": np.random.rand(embedding_dim).astype(np.float32)},
{"content": "text2", "id": uuid.uuid4(), "embedding": np.random.rand(embedding_dim).astype(np.float32)},
]
document_store.write_documents([documents[0]], index="Haystackone")
assert len(document_store.get_all_documents(index="Haystackone")) == 1
document_store.write_documents([documents[1]], index="Haystacktwo")
assert len(document_store.get_all_documents(index="Haystacktwo")) == 1
assert len(document_store.get_all_documents(index="Haystackone")) == 1
assert len(document_store.get_all_documents()) == 0
@pytest.mark.weaviate
@pytest.mark.parametrize("retriever", ["dpr", "embedding"], indirect=True)
@pytest.mark.parametrize("document_store", ["weaviate"], indirect=True)
def test_update_embeddings(document_store, retriever):
documents = []
for i in range(6):
documents.append({"content": f"text_{i}", "id": str(uuid.uuid4()), "metafield": f"value_{i}", "embedding": np.random.rand(embedding_dim).astype(np.float32)})
documents.append({"content": "text_0", "id": str(uuid.uuid4()), "metafield": "value_0", "embedding": np.random.rand(embedding_dim).astype(np.float32)})
document_store.write_documents(documents, index="HaystackTestOne")
document_store.update_embeddings(retriever, index="HaystackTestOne", batch_size=3)
documents = document_store.get_all_documents(index="HaystackTestOne", return_embedding=True)
assert len(documents) == 7
for doc in documents:
assert type(doc.embedding) is np.ndarray
documents = document_store.get_all_documents(
index="HaystackTestOne",
filters={"metafield": ["value_0"]},
return_embedding=True,
)
assert len(documents) == 2
for doc in documents:
assert doc.meta["metafield"] == "value_0"
np.testing.assert_array_almost_equal(documents[0].embedding, documents[1].embedding, decimal=4)
documents = document_store.get_all_documents(
index="HaystackTestOne",
filters={"metafield": ["value_1", "value_5"]},
return_embedding=True,
)
np.testing.assert_raises(
AssertionError,
np.testing.assert_array_equal,
documents[0].embedding,
documents[1].embedding
)
doc = {"content": "text_7", "id": str(uuid.uuid4()), "metafield": "value_7",
"embedding": retriever.embed_queries(texts=["a random string"])[0]}
document_store.write_documents([doc], index="HaystackTestOne")
doc_before_update = document_store.get_all_documents(index="HaystackTestOne", filters={"metafield": ["value_7"]})[0]
embedding_before_update = doc_before_update.embedding
document_store.update_embeddings(
retriever, index="HaystackTestOne", batch_size=3, filters={"metafield": ["value_0", "value_1"]}
)
doc_after_update = document_store.get_all_documents(index="HaystackTestOne", filters={"metafield": ["value_7"]})[0]
embedding_after_update = doc_after_update.embedding
np.testing.assert_array_equal(embedding_before_update, embedding_after_update)
# test update all embeddings
document_store.update_embeddings(retriever, index="HaystackTestOne", batch_size=3, update_existing_embeddings=True)
assert document_store.get_document_count(index="HaystackTestOne") == 8
doc_after_update = document_store.get_all_documents(index="HaystackTestOne", filters={"metafield": ["value_7"]})[0]
embedding_after_update = doc_after_update.embedding
np.testing.assert_raises(AssertionError, np.testing.assert_array_equal, embedding_before_update, embedding_after_update)
@pytest.mark.weaviate
@pytest.mark.parametrize("document_store_with_docs", ["weaviate"], indirect=True)
@ -311,49 +88,3 @@ def test_query(document_store_with_docs):
docs = document_store_with_docs.query(filters={"content":['live']})
assert len(docs) == 3
@pytest.mark.weaviate
@pytest.mark.parametrize("document_store_with_docs", ["weaviate"], indirect=True)
def test_delete_documents(document_store_with_docs):
assert len(document_store_with_docs.get_all_documents()) == 3
document_store_with_docs.delete_documents()
documents = document_store_with_docs.get_all_documents()
assert len(documents) == 0
@pytest.mark.weaviate
@pytest.mark.parametrize("document_store_with_docs", ["weaviate"], indirect=True)
def test_delete_documents_with_filters(document_store_with_docs):
assert len(document_store_with_docs.get_all_documents()) == 3
document_store_with_docs.delete_documents(filters={"metafield": ["test1", "test2"]})
documents = document_store_with_docs.get_all_documents()
assert len(documents) == 1
assert documents[0].meta["metafield"] == "test3"
@pytest.mark.weaviate
@pytest.mark.parametrize("document_store_with_docs", ["weaviate"], indirect=True)
def test_delete_documents_by_id(document_store_with_docs):
assert len(document_store_with_docs.get_all_documents()) == 3
ids_to_delete = [doc.id for doc in document_store_with_docs.get_all_documents()[0:2]]
document_store_with_docs.delete_documents(ids=ids_to_delete)
documents = document_store_with_docs.get_all_documents()
assert len(documents) == 1
assert documents[0].id not in ids_to_delete
@pytest.mark.weaviate
@pytest.mark.parametrize("document_store_with_docs", ["weaviate"], indirect=True)
def test_delete_documents_by_id_with_filters(document_store_with_docs):
docs_to_delete = document_store_with_docs.get_all_documents(filters={"metafield": ["test1", "test2"]})
docs_not_to_delete = document_store_with_docs.get_all_documents(filters={"metafield": ["test3"]})
document_store_with_docs.delete_documents(ids=[doc.id for doc in docs_to_delete], filters={"metafield": ["test1"]})
all_docs_left = document_store_with_docs.get_all_documents()
assert len(all_docs_left) == 2
assert all(doc.meta["metafield"] != "test1" for doc in all_docs_left)
all_ids_left = [doc.id for doc in all_docs_left]
assert all(doc.id in all_ids_left for doc in docs_not_to_delete)