docs(document_store): add usage information for aws elastic search (#1008)

Co-authored-by: Mario Jäckle <m.jaeckle@careerpartner.eu>
This commit is contained in:
Mario Jäckle 2021-04-30 10:38:25 +01:00 committed by GitHub
parent 37a72d2f45
commit a00703256f
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

View File

@ -11,10 +11,10 @@ id: "documentstoremd"
# DocumentStores
You can think of the DocumentStore as a "database" that:
- stores your texts and meta data
- provides them to the retriever at query time
- stores your texts and meta data
- provides them to the retriever at query time
There are different DocumentStores in Haystack to fit different use cases and tech stacks.
There are different DocumentStores in Haystack to fit different use cases and tech stacks.
## Initialisation
@ -29,7 +29,7 @@ Initialising a new DocumentStore within Haystack is straight forward.
[Install](https://www.elastic.co/guide/en/elasticsearch/reference/current/install-elasticsearch.html)
Elasticsearch and then [start](https://www.elastic.co/guide/en/elasticsearch/reference/current/starting-elasticsearch.html)
an instance.
an instance.
If you have Docker set up, we recommend pulling the Docker image and running it.
```bash
@ -49,6 +49,8 @@ Note that we also support [Open Distro for Elasticsearch](https://opendistro.git
Follow [their documentation](https://opendistro.github.io/for-elasticsearch-docs/docs/install/)
to run it and connect to it using Haystack's `OpenDistroElasticsearchDocumentStore` class.
We further support [AWS Elastic Search Service](https://aws.amazon.com/elasticsearch-service/) with [signed Requests](https://docs.aws.amazon.com/general/latest/gr/signature-version-4.html):
Use e.g. [aws-requests-auth](https://github.com/davidmuller/aws-requests-auth) to create an auth object and pass it as `aws4auth` to the `ElasticsearchDocumentStore` constructor.
</div>
</div>
@ -59,7 +61,7 @@ to run it and connect to it using Haystack's `OpenDistroElasticsearchDocumentSto
<div class="tabcontent">
Follow the [official documentation](https://www.milvus.io/docs/v1.0.0/milvus_docker-cpu.md) to start a Milvus instance via Docker
You can initialize the Haystack object that will connect to this instance as follows:
```python
from haystack.document_store import MilvusDocumentStore
@ -75,7 +77,7 @@ document_store = MilvusDocumentStore()
<label class="labelouter" for="tab-1-3">FAISS</label>
<div class="tabcontent">
The `FAISSDocumentStore` requires no external setup. Start it by simply using this line.
The `FAISSDocumentStore` requires no external setup. Start it by simply using this line.
```python
from haystack.document_store import FAISSDocumentStore
@ -106,7 +108,7 @@ document_store = InMemoryDocumentStore()
<div class="tabcontent">
The `SQLDocumentStore` requires SQLite, PostgresQL or MySQL to be installed and started.
Note that SQLite already comes packaged with most operating systems.
Note that SQLite already comes packaged with most operating systems.
```python
from haystack.document_store import SQLDocumentStore
@ -174,7 +176,7 @@ Having GPU acceleration will significantly speed this up.
<!-- _comment: !! Make this a tab element to show how different datastores are initialized !! -->
## Choosing the Right Document Store
The Document Stores have different characteristics. You should choose one depending on the maturity of your project, the use case and technical environment:
The Document Stores have different characteristics. You should choose one depending on the maturity of your project, the use case and technical environment:
<div class="tabs tabsdschoose">
@ -183,13 +185,13 @@ The Document Stores have different characteristics. You should choose one depend
<label class="labelouter" for="tab-2-1">Elasticsearch</label>
<div class="tabcontent">
**Pros:**
**Pros:**
- Fast & accurate sparse retrieval with many tuning options
- Basic support for dense retrieval
- Production-ready
- Support also for Open Distro
**Cons:**
**Cons:**
- Slow for dense retrieval with more than ~ 1 Mio documents
</div>
@ -200,7 +202,7 @@ The Document Stores have different characteristics. You should choose one depend
<label class="labelouter" for="tab-2-2">Milvus</label>
<div class="tabcontent">
**Pros:**
**Pros:**
- Scalable DocumentStore that excels at handling vectors (hence suited to dense retrieval methods like DPR)
- Encapsulates multiple ANN libraries (e.g. FAISS and ANNOY) and provides added reliability
- Runs as a separate service (e.g. a Docker container)
@ -217,7 +219,7 @@ The Document Stores have different characteristics. You should choose one depend
<label class="labelouter" for="tab-2-3">FAISS</label>
<div class="tabcontent">
**Pros:**
**Pros:**
- Fast & accurate dense retrieval
- Highly scalable due to approximate nearest neighbour algorithms (ANN)
- Many options to tune dense retrieval via different index types (more info [here](https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index))
@ -255,7 +257,7 @@ The Document Stores have different characteristics. You should choose one depend
- No database requirements
- Supports MySQL, PostgreSQL and SQLite
**Cons:**
**Cons:**
- Not scalable
- Not persisting your data on disk
@ -268,7 +270,7 @@ The Document Stores have different characteristics. You should choose one depend
#### Our Recommendations
**Restricted environment:** Use the `InMemoryDocumentStore`, if you are just giving Haystack a quick try on a small sample and are working in a restricted environment that complicates running Elasticsearch or other databases
**Restricted environment:** Use the `InMemoryDocumentStore`, if you are just giving Haystack a quick try on a small sample and are working in a restricted environment that complicates running Elasticsearch or other databases
**Allrounder:** Use the `ElasticSearchDocumentStore`, if you want to evaluate the performance of different retrieval options (dense vs. sparse) and are aiming for a smooth transition from PoC to production