11 KiB
DocumentStores
You can think of the DocumentStore as a "database" that:
- stores your texts and meta data
- provides them to the retriever at query time
There are different DocumentStores in Haystack to fit different use cases and tech stacks.
Initialisation
Initialising a new DocumentStore within Haystack is straight forward.
Install Elasticsearch and then start an instance.
If you have Docker set up, we recommend pulling the Docker image and running it.
docker pull docker.elastic.co/elasticsearch/elasticsearch:7.9.2
docker run -d -p 9200:9200 -e "discovery.type=single-node" elasticsearch:7.9.2
Note that we also have a utility function haystack.utils.launch_es
that can start up an Elasticsearch instance.
Next you can initialize the Haystack object that will connect to this instance.
from haystack.document_store import ElasticsearchDocumentStore
document_store = ElasticsearchDocumentStore()
Note that we also support OpenSearch.
Follow their documentation
to run it and connect to it using Haystack's OpenSearchDocumentStore
class.
We further support AWS Elastic Search Service with signed Requests:
Use e.g. aws-requests-auth to create an auth object and pass it as aws4auth
to the ElasticsearchDocumentStore
constructor.
Follow the official documentation to start a Milvus instance via Docker.
Note that we also have a utility function haystack.utils.launch_milvus
that can start up a Milvus instance.
You can initialize the Haystack object that will connect to this instance as follows:
from haystack.document_store import MilvusDocumentStore
document_store = MilvusDocumentStore()
The FAISSDocumentStore
requires no external setup. Start it by simply using this line.
from haystack.document_store import FAISSDocumentStore
document_store = FAISSDocumentStore(faiss_index_factory_str="Flat")
The InMemoryDocumentStore()
requires no external setup. Start it by simply using this line.
from haystack.document_store import InMemoryDocumentStore
document_store = InMemoryDocumentStore()
The SQLDocumentStore
requires SQLite, PostgresQL or MySQL to be installed and started.
Note that SQLite already comes packaged with most operating systems.
from haystack.document_store import SQLDocumentStore
document_store = SQLDocumentStore()
The WeaviateDocumentStore
requires a running Weaviate Server.
You can start a basic instance like this (see the Weaviate docs for details):
docker run -d -p 8080:8080 --env AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED='true' --env PERSISTENCE_DATA_PATH='/var/lib/weaviate' semitechnologies/weaviate:1.7.2
Afterwards, you can use it in Haystack:
from haystack.document_store import WeaviateDocumentStore
document_store = WeaviateDocumentStore()
See the official OpenSearch documentation on how to install and start an instance.
If you have Docker set up, we recommend pulling the Docker image and running it.
docker pull opensearchproject/opensearch:1.0.0
docker run -p 9200:9200 -p 9600:9600 -e "discovery.type=single-node" opensearchproject/opensearch:1.0.0
Note that we also have a utility function haystack.utils.launch_opensearch
that can start up an OpenSearch instance.
Next you can initialize the Haystack object that will connect to this instance.
from haystack.document_store import OpenSearchDocumentStore
document_store = OpenSearchDocumentStore()
Each DocumentStore constructor allows for arguments specifying how to connect to existing databases and the names of indexes. See API documentation for more info.
Input Format
DocumentStores expect Documents in dictionary form, like that below.
They are loaded using the DocumentStore.write_documents()
method.
See Preprocessing for more information on the cleaning and splitting steps that will help you maximize Haystack's performance.
from haystack.document_store import ElasticsearchDocumentStore
document_store = ElasticsearchDocumentStore()
dicts = [
{
'text': DOCUMENT_TEXT_HERE,
'meta': {'name': DOCUMENT_NAME, ...}
}, ...
]
document_store.write_documents(dicts)
Writing Documents (Sparse Retrievers)
Haystack allows for you to write store documents in an optimised fashion so that query times can be kept low.
For sparse, keyword based retrievers such as BM25 and TF-IDF,
you simply have to call DocumentStore.write_documents()
.
The creation of the inverted index which optimises querying speed is handled automatically.
document_store.write_documents(dicts)
Writing Documents (Dense Retrievers)
For dense neural network based retrievers like Dense Passage Retrieval, or Embedding Retrieval, indexing involves computing the Document embeddings which will be compared against the Query embedding.
The storing of the text is handled by DocumentStore.write_documents()
and the computation of the
embeddings is started by DocumentStore.update_embeddings()
.
document_store.write_documents(dicts)
document_store.update_embeddings(retriever)
This step is computationally intensive since it will engage the transformer based encoders. Having GPU acceleration will significantly speed this up.
Choosing the Right Document Store
The Document Stores have different characteristics. You should choose one depending on the maturity of your project, the use case and technical environment:
Pros:
- Fast & accurate sparse retrieval with many tuning options
- Basic support for dense retrieval
- Production-ready
- Support also for Open Distro
Cons:
- Slow for dense retrieval with more than ~ 1 Mio documents
Pros:
- Scalable DocumentStore that excels at handling vectors (hence suited to dense retrieval methods like DPR)
- Encapsulates multiple ANN libraries (e.g. FAISS and ANNOY) and provides added reliability
- Runs as a separate service (e.g. a Docker container)
- Allows dynamic data management
Cons:
- No efficient sparse retrieval
Pros:
- Fast & accurate dense retrieval
- Highly scalable due to approximate nearest neighbour algorithms (ANN)
- Many options to tune dense retrieval via different index types (more info here)
Cons:
- No efficient sparse retrieval
Pros:
- Simple
- Exists already in many environments
Cons:
- Only compatible with minimal TF-IDF Retriever
- Bad retrieval performance
- Not recommended for production
Pros:
- Simple & fast to test
- No database requirements
- Supports MySQL, PostgreSQL and SQLite
Cons:
- Not scalable
- Not persisting your data on disk
Pros:
- Simple vector search
- Stores everything in one place: documents, meta data and vectors - so less network overhead when scaling this up
- Allows combination of vector search and scalar filtering, i.e. you can filter for a certain tag and do dense retrieval on that subset
Cons:
- Less options for ANN algorithms than FAISS or Milvus
- No BM25 / Tf-idf retrieval
Pros:
- Fully open source fork of Elasticsearch
- Has support for Approximate Nearest Neighbours vector search
Cons:
- It's ANN algorithms seem a little less performant that FAISS or Milvus in our benchmarks
Our Recommendations
Restricted environment: Use the InMemoryDocumentStore
, if you are just giving Haystack a quick try on a small sample and are working in a restricted environment that complicates running Elasticsearch or other databases
Allrounder: Use the ElasticSearchDocumentStore
, if you want to evaluate the performance of different retrieval options (dense vs. sparse) and are aiming for a smooth transition from PoC to production
Vector Specialist: Use the MilvusDocumentStore
, if you want to focus on dense retrieval and possibly deal with larger datasets