mirror of
https://github.com/deepset-ai/haystack.git
synced 2026-01-29 02:53:51 +00:00
161 lines
6.2 KiB
Plaintext
161 lines
6.2 KiB
Plaintext
---
|
|
title: "FastembedDocumentEmbedder"
|
|
id: fastembeddocumentembedder
|
|
slug: "/fastembeddocumentembedder"
|
|
description: "This component computes the embeddings of a list of documents using the models supported by FastEmbed."
|
|
---
|
|
|
|
# FastembedDocumentEmbedder
|
|
|
|
This component computes the embeddings of a list of documents using the models supported by FastEmbed.
|
|
|
|
<div className="key-value-table">
|
|
|
|
| | |
|
|
| --- | --- |
|
|
| **Most common position in a pipeline** | Before a [`DocumentWriter`](../writers/documentwriter.mdx) in an indexing pipeline |
|
|
| **Mandatory run variables** | `documents`: A list of documents |
|
|
| **Output variables** | `documents`: A list of documents (enriched with embeddings) |
|
|
| **API reference** | [FastEmbed](/reference/fastembed-embedders) |
|
|
| **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/fastembed |
|
|
|
|
</div>
|
|
|
|
This component should be used to embed a list of documents. To embed a string, use the [`FastembedTextEmbedder`](fastembedtextembedder.mdx).
|
|
|
|
## Overview
|
|
|
|
`FastembedDocumentEmbedder` computes the embeddings of a list of documents and stores the obtained vectors in the embedding field of each document. It uses embedding [models supported by FastEmbed](https://qdrant.github.io/fastembed/examples/Supported_Models/).
|
|
|
|
The vectors computed by this component are necessary to perform embedding retrieval on a collection of documents. At retrieval time, the vector that represents the query is compared with those of the documents in order to find the most similar or relevant documents.
|
|
|
|
### Compatible models
|
|
|
|
You can find the original models in the [FastEmbed documentation](https://qdrant.github.io/fastembed/).
|
|
|
|
Nowadays, most of the models in the [Massive Text Embedding Benchmark (MTEB) Leaderboard](https://huggingface.co/spaces/mteb/leaderboard) are compatible with FastEmbed. You can look for compatibility in the [supported model list](https://qdrant.github.io/fastembed/examples/Supported_Models/).
|
|
|
|
### Installation
|
|
|
|
To start using this integration with Haystack, install the package with:
|
|
|
|
```shell
|
|
pip install fastembed-haystack
|
|
```
|
|
|
|
### Parameters
|
|
|
|
You can set the path where the model will be stored in a cache directory. Also, you can set the number of threads a single `onnxruntime` session can use.
|
|
|
|
```python
|
|
cache_dir= "/your_cacheDirectory"
|
|
embedder = FastembedDocumentEmbedder(
|
|
*model="*BAAI/bge-large-en-v1.5",
|
|
cache_dir=cache_dir,
|
|
threads=2
|
|
)
|
|
```
|
|
|
|
If you want to use the data parallel encoding, you can set the parameters `parallel` and `batch_size`.
|
|
|
|
- If parallel > 1, data-parallel encoding will be used. This is recommended for offline encoding of large datasets.
|
|
- If parallel is 0, use all available cores.
|
|
- If None, don't use data-parallel processing; use default `onnxruntime` threading instead.
|
|
|
|
:::tip
|
|
If you create a Text Embedder and a Document Embedder based on the same model, Haystack uses the same resource behind the scenes to save resources.
|
|
:::
|
|
|
|
### Embedding Metadata
|
|
|
|
Text documents often come with a set of metadata. If they are distinctive and semantically meaningful, you can embed them along with the text of the document to improve retrieval.
|
|
|
|
You can do this easily by using the Document Embedder:
|
|
|
|
```python
|
|
from haystack.preview import Document
|
|
from haystack_integrations.components.embedders.fastembed import FastembedDocumentEmbedder
|
|
|
|
doc = Document(text="some text",
|
|
metadata={"title": "relevant title",
|
|
"page number": 18})
|
|
|
|
embedder = FastembedDocumentEmbedder(
|
|
model="BAAI/bge-small-en-v1.5",
|
|
batch_size=256,
|
|
metadata_fields_to_embed=["title"]
|
|
)
|
|
|
|
docs_w_embeddings = embedder.run(documents=[doc])["documents"]
|
|
```
|
|
|
|
## Usage
|
|
|
|
### On its own
|
|
|
|
```python
|
|
from haystack.dataclasses import Document
|
|
from haystack_integrations.components.embedders.fastembed import FastembedDocumentEmbedder
|
|
document_list = [
|
|
Document(content="I love pizza!"),
|
|
Document(content="I like spaghetti")
|
|
]
|
|
|
|
doc_embedder = FastembedDocumentEmbedder()
|
|
doc_embedder.warm_up()
|
|
|
|
result = doc_embedder.run(document_list)
|
|
print(result['documents'][0].embedding)
|
|
|
|
## [-0.04235665127635002, 0.021791068837046623, ...]
|
|
```
|
|
|
|
### In a pipeline
|
|
|
|
```python
|
|
from haystack import Document, Pipeline
|
|
from haystack.components.writers import DocumentWriter
|
|
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
|
|
from haystack.document_stores.in_memory import InMemoryDocumentStore
|
|
from haystack.document_stores.types import DuplicatePolicy
|
|
from haystack_integrations.components.embedders.fastembed import FastembedDocumentEmbedder, FastembedTextEmbedder
|
|
|
|
document_store = InMemoryDocumentStore(embedding_similarity_function="cosine")
|
|
|
|
documents = [
|
|
Document(content="My name is Wolfgang and I live in Berlin"),
|
|
Document(content="I saw a black horse running"),
|
|
Document(content="Germany has many big cities"),
|
|
Document(content="fastembed is supported by and maintained by Qdrant."),
|
|
]
|
|
|
|
document_embedder = FastembedDocumentEmbedder()
|
|
writer = DocumentWriter(document_store=document_store, policy=DuplicatePolicy.OVERWRITE)
|
|
|
|
indexing_pipeline = Pipeline()
|
|
indexing_pipeline.add_component("document_embedder", document_embedder)
|
|
indexing_pipeline.add_component("writer", writer)
|
|
indexing_pipeline.connect("document_embedder", "writer")
|
|
|
|
indexing_pipeline.run({"document_embedder": {"documents": documents}})
|
|
|
|
query_pipeline = Pipeline()
|
|
query_pipeline.add_component("text_embedder", FastembedTextEmbedder())
|
|
query_pipeline.add_component("retriever", InMemoryEmbeddingRetriever(document_store=document_store))
|
|
query_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
|
|
|
|
query = "Who supports fastembed?"
|
|
|
|
result = query_pipeline.run({"text_embedder": {"text": query}})
|
|
|
|
print(result["retriever"]["documents"][0]) # noqa: T201
|
|
|
|
## Document(id=...,
|
|
## content: 'fastembed is supported by and maintained by Qdrant.',
|
|
## score: 0.758..)
|
|
```
|
|
|
|
## Additional References
|
|
|
|
🧑🍳 Cookbook: [RAG Pipeline Using FastEmbed for Embeddings Generation](https://haystack.deepset.ai/cookbook/rag_fastembed)
|