mirror of
https://github.com/deepset-ai/haystack.git
synced 2025-12-15 00:57:12 +00:00
92 lines
4.8 KiB
Plaintext
92 lines
4.8 KiB
Plaintext
---
|
||
title: "Document Store"
|
||
id: "document-store"
|
||
description: "You can think of the Document Store as a database that stores your data and provides them to the Retriever at query time. Learn how to use Document Store in a pipeline or how to create your own."
|
||
---
|
||
Document Store is an object that stores your documents. In Haystack, a Document Store is different from a component, as it doesn't have the `run()` method. You can think of it as an interface to your database – you put the information there, or you can look through it. This means that a Document Store is not a piece of a pipeline but rather a tool that the components of a pipeline have access to and can interact with.
|
||
|
||
:::tip 👍 Work with Retrievers
|
||
The most common way to use a Document Store in Haystack is to fetch documents using a Retriever. A Document Store will often have a corresponding Retriever to get the most out of specific technologies. See more information in our [Retriever](/docs/retrievers) documentation.
|
||
:::
|
||
|
||
:::info 📘 How to choose a Document Store?
|
||
To learn about different types of Document Stores and their strengths and disadvantages, head to the [Choosing a Document Store](/docs/choosing-a-document-store) page.
|
||
:::
|
||
|
||
## DocumentStore Protocol
|
||
|
||
Document Stores in Haystack are designed to use the following methods as part of their protocol:
|
||
|
||
- `count_documents` returns the number of documents stored in the given store as an integer.
|
||
- `filter_documents` returns a list of documents that match the provided filters.
|
||
- `write_documents` writes or overwrites documents into the given store and returns the number of documents that were written as an integer.
|
||
- `delete_documents` deletes all documents with given `document_ids` from the Document Store.
|
||
|
||
## Initialization
|
||
|
||
To use a Document Store in a pipeline, you must initialize it first.
|
||
|
||
See the installation and initialization details for each Document Store in the "Document Stores" section in the navigation panel on your left.
|
||
|
||
## Work with Documents
|
||
|
||
Convert your data into `Document` objects before writing them into a Document Store along with its metadata and document ID.
|
||
|
||
The ID field is mandatory, so if you don't choose a specific ID yourself, Haystack will do its best to come up with a unique ID based on the document's information and assign it automatically. However, since Haystack uses the document's contents to create an ID, two identical documents might have identical IDs. Keep it in mind as you update your documents, as the ID will not be updated automatically.
|
||
|
||
```python
|
||
document_store = ChromaDocumentStore()
|
||
documents = [
|
||
Document(
|
||
'meta'={'name': DOCUMENT_NAME, ...}
|
||
'id'="document_unique_id",
|
||
'content'="this is content"
|
||
),
|
||
...
|
||
]
|
||
document_store.write_documents(documents)
|
||
```
|
||
|
||
To write documents into the `InMemoryDocumentStore`, simply call the `.write_documents()` function:
|
||
|
||
```python
|
||
document_store.write_documents([
|
||
Document(content="My name is Jean and I live in Paris."),
|
||
Document(content="My name is Mark and I live in Berlin."),
|
||
Document(content="My name is Giorgio and I live in Rome.")
|
||
])
|
||
```
|
||
|
||
:::info 📘 `DocumentWriter`
|
||
See `DocumentWriter` component [docs](/docs/documentwriter) to write your documents into a Document Store in a pipeline.
|
||
:::
|
||
|
||
## DuplicatePolicy
|
||
|
||
The `DuplicatePolicy` is a class that defines the different options for handling documents with the same ID in a `DocumentStore`. It has three possible values:
|
||
|
||
- **OVERWRITE**: Indicates that if a document with the same ID already exists in the `DocumentStore`, it should be overwritten with the new document.
|
||
- **SKIP**: If a document with the same ID already exists, the new document will be skipped and not added to the `DocumentStore`.
|
||
- **FAIL**: Raises an error if a document with the same ID already exists in the `DocumentStore`. It prevents duplicate documents from being added.
|
||
|
||
Here is an example of how you could apply the policy to skip the existing document:
|
||
|
||
```python
|
||
from haystack.document_stores.in_memory import InMemoryDocumentStore
|
||
from haystack.components.writers import DocumentWriter
|
||
from haystack.document_stores.types import DuplicatePolicy
|
||
|
||
document_store = InMemoryDocumentStore()
|
||
document_writer = DocumentWriter(document_store = document_store, policy=DuplicatePolicy.SKIP)
|
||
```
|
||
|
||
## Custom Document Store
|
||
|
||
All custom document stores must implement the [protocol](https://github.com/deepset-ai/haystack/blob/13804293b1bb79743e5a30e980b76a0561dcfaf8/haystack/document_stores/types/protocol.py) with four mandatory methods: `count_documents`,`filter_documents`, `write_documents`, and `delete_documents`.
|
||
|
||
The `init` function should indicate all the specifics for the chosen database or vector store.
|
||
|
||
We also recommend having a custom corresponding Retriever to get the most out of a specific Document Store.
|
||
|
||
See [Creating Custom Document Stores](/docs/creating-custom-document-stores) page for more details.
|