mirror of
https://github.com/deepset-ai/haystack.git
synced 2026-01-30 03:22:50 +00:00
183 lines
7.6 KiB
Plaintext
183 lines
7.6 KiB
Plaintext
---
|
|
title: "SentenceTransformersSparseDocumentEmbedder"
|
|
id: sentencetransformerssparsedocumentembedder
|
|
slug: "/sentencetransformerssparsedocumentembedder"
|
|
description: "Use this component to enrich a list of documents with their sparse embeddings using Sentence Transformers models."
|
|
---
|
|
|
|
# SentenceTransformersSparseDocumentEmbedder
|
|
|
|
Use this component to enrich a list of documents with their sparse embeddings using Sentence Transformers models.
|
|
|
|
| | |
|
|
| :------------------------------------- | :--------------------------------------------------------------------------------------------------------------------------------- |
|
|
| **Most common position in a pipeline** | Before a [`DocumentWriter`](../writers/documentwriter.mdx) in an indexing pipeline |
|
|
| **Mandatory run variables** | "documents": A list of documents |
|
|
| **Output variables** | "documents": A list of documents (enriched with sparse embeddings) |
|
|
| **API reference** | [Embedders](/reference/embedders-api) |
|
|
| **GitHub link** | https://github.com/deepset-ai/haystack/blob/main/haystack/components/embedders/sentence_transformers_sparse_document_embedder.py |
|
|
|
|
To compute a sparse embedding for a string, use the [`SentenceTransformersSparseTextEmbedder`](sentencetransformerssparsetextembedder.mdx).
|
|
|
|
## Overview
|
|
|
|
`SentenceTransformersSparseDocumentEmbedder` computes the sparse embeddings of a list of documents and stores the obtained vectors in the `sparse_embedding` field of each document. It uses sparse embedding models supported by the Sentence Transformers library.
|
|
|
|
The vectors computed by this component are necessary to perform sparse embedding retrieval on a collection of documents. At retrieval time, the sparse vector representing the query is compared with those of the documents to find the most similar or relevant ones.
|
|
|
|
### Compatible Models
|
|
|
|
The default embedding model is [`prithivida/Splade_PP_en_v2`](https://huggingface.co/prithivida/Splade_PP_en_v2). You can specify another model with the `model` parameter when initializing this component.
|
|
|
|
Compatible models are based on SPLADE (SParse Lexical AnD Expansion), a technique for producing sparse representations for text, where each non-zero value in the embedding is the importance weight of a term in the vocabulary. This approach combines the benefits of learned sparse representations with the efficiency of traditional sparse retrieval methods. For more information, see [our docs](../retrievers.mdx#sparse-embedding-based-retrievers) that explain sparse embedding-based Retrievers further.
|
|
|
|
You can find compatible SPLADE models on the [Hugging Face Model Hub](https://huggingface.co/models?search=splade).
|
|
|
|
### Authentication
|
|
|
|
Authentication with a Hugging Face API Token is only required to access private or gated models.
|
|
|
|
The component uses an `HF_API_TOKEN` or `HF_TOKEN` environment variable, or you can pass a Hugging Face API token at initialization. See our [Secret Management](doc:secret-management) page for more information.
|
|
|
|
```python
|
|
from haystack.utils import Secret
|
|
from haystack.components.embedders import SentenceTransformersSparseDocumentEmbedder
|
|
|
|
document_embedder = SentenceTransformersSparseDocumentEmbedder(
|
|
token=Secret.from_token("<your-api-key>")
|
|
)
|
|
```
|
|
|
|
### Backend Options
|
|
|
|
This component supports multiple backends for model execution:
|
|
|
|
- **torch** (default): Standard PyTorch backend
|
|
- **onnx**: Optimized ONNX Runtime backend for faster inference
|
|
- **openvino**: Intel OpenVINO backend for additional optimizations on Intel hardware
|
|
|
|
You can specify the backend during initialization:
|
|
|
|
```python
|
|
embedder = SentenceTransformersSparseDocumentEmbedder(
|
|
model="prithivida/Splade_PP_en_v2",
|
|
backend="onnx"
|
|
)
|
|
```
|
|
|
|
For more information on acceleration and quantization options, refer to the [Sentence Transformers documentation](https://sbert.net/docs/sentence_transformer/usage/efficiency.html).
|
|
|
|
### Embedding Metadata
|
|
|
|
Text documents often include metadata. If the metadata is distinctive and semantically meaningful, you can embed it along with the document's text to improve retrieval.
|
|
|
|
You can do this easily by using the Sparse Document Embedder:
|
|
|
|
```python
|
|
from haystack import Document
|
|
from haystack.components.embedders import SentenceTransformersSparseDocumentEmbedder
|
|
|
|
doc = Document(
|
|
content="some text",
|
|
meta={"title": "relevant title", "page number": 18}
|
|
)
|
|
|
|
embedder = SentenceTransformersSparseDocumentEmbedder(
|
|
meta_fields_to_embed=["title"]
|
|
)
|
|
embedder.warm_up()
|
|
|
|
docs_w_sparse_embeddings = embedder.run(documents=[doc])["documents"]
|
|
```
|
|
|
|
## Usage
|
|
|
|
### On its own
|
|
|
|
```python
|
|
from haystack import Document
|
|
from haystack.components.embedders import SentenceTransformersSparseDocumentEmbedder
|
|
|
|
doc = Document(content="I love pizza!")
|
|
doc_embedder = SentenceTransformersSparseDocumentEmbedder()
|
|
doc_embedder.warm_up()
|
|
|
|
result = doc_embedder.run([doc])
|
|
print(result['documents'][0].sparse_embedding)
|
|
|
|
## SparseEmbedding(indices=[999, 1045, ...], values=[0.918, 0.867, ...])
|
|
```
|
|
|
|
### In a pipeline
|
|
|
|
Currently, sparse embedding retrieval is only supported by `QdrantDocumentStore`.
|
|
|
|
First, install the required package:
|
|
|
|
```shell
|
|
pip install qdrant-haystack
|
|
```
|
|
|
|
Then, try out this pipeline:
|
|
|
|
```python
|
|
from haystack import Document, Pipeline
|
|
from haystack.components.embedders import (
|
|
SentenceTransformersSparseDocumentEmbedder,
|
|
SentenceTransformersSparseTextEmbedder
|
|
)
|
|
from haystack.components.writers import DocumentWriter
|
|
from haystack_integrations.components.retrievers.qdrant import QdrantSparseEmbeddingRetriever
|
|
from haystack_integrations.document_stores.qdrant import QdrantDocumentStore
|
|
from haystack.document_stores.types import DuplicatePolicy
|
|
|
|
document_store = QdrantDocumentStore(
|
|
":memory:",
|
|
recreate_index=True,
|
|
use_sparse_embeddings=True
|
|
)
|
|
|
|
documents = [
|
|
Document(content="My name is Wolfgang and I live in Berlin"),
|
|
Document(content="I saw a black horse running"),
|
|
Document(content="Germany has many big cities"),
|
|
Document(content="Sentence Transformers provides sparse embedding models."),
|
|
]
|
|
|
|
## Indexing pipeline
|
|
indexing_pipeline = Pipeline()
|
|
indexing_pipeline.add_component(
|
|
"sparse_document_embedder",
|
|
SentenceTransformersSparseDocumentEmbedder()
|
|
)
|
|
indexing_pipeline.add_component(
|
|
"writer",
|
|
DocumentWriter(document_store=document_store, policy=DuplicatePolicy.OVERWRITE)
|
|
)
|
|
indexing_pipeline.connect("sparse_document_embedder", "writer")
|
|
|
|
indexing_pipeline.run({"sparse_document_embedder": {"documents": documents}})
|
|
|
|
## Query pipeline
|
|
query_pipeline = Pipeline()
|
|
query_pipeline.add_component(
|
|
"sparse_text_embedder",
|
|
SentenceTransformersSparseTextEmbedder()
|
|
)
|
|
query_pipeline.add_component(
|
|
"sparse_retriever",
|
|
QdrantSparseEmbeddingRetriever(document_store=document_store)
|
|
)
|
|
query_pipeline.connect("sparse_text_embedder.sparse_embedding", "sparse_retriever.query_sparse_embedding")
|
|
|
|
query = "Who provides sparse embedding models?"
|
|
|
|
result = query_pipeline.run({"sparse_text_embedder": {"text": query}})
|
|
|
|
print(result["sparse_retriever"]["documents"][0])
|
|
|
|
## Document(id=...,
|
|
## content: 'Sentence Transformers provides sparse embedding models.',
|
|
## score: 0.75...)
|
|
```
|