mirror of
https://github.com/deepset-ai/haystack.git
synced 2026-02-07 07:22:03 +00:00
99 lines
4.6 KiB
Plaintext
99 lines
4.6 KiB
Plaintext
---
|
|
title: "EmbeddingBasedDocumentSplitter"
|
|
id: embeddingbaseddocumentsplitter
|
|
slug: "/embeddingbaseddocumentsplitter"
|
|
description: "Use this component to split documents based on embedding similarity using cosine distances between sequential sentence groups."
|
|
---
|
|
|
|
# EmbeddingBasedDocumentSplitter
|
|
|
|
Use this component to split documents based on embedding similarity using cosine distances between sequential sentence groups.
|
|
|
|
<div className="key-value-table">
|
|
|
|
| | |
|
|
| --- | --- |
|
|
| **Most common position in a pipeline** | In indexing pipelines after [Converters](../converters.mdx) and [`DocumentCleaner`](documentcleaner.mdx) |
|
|
| **Mandatory run variables** | `documents`: A list of documents to split each into smaller documents based on embedding similarity. |
|
|
| **Output variables** | `documents`: A list of documents |
|
|
| **API reference** | [PreProcessors](/reference/preprocessors-api) |
|
|
| **GitHub link** | https://github.com/deepset-ai/haystack/blob/main/haystack/components/preprocessors/embedding_based_document_splitter.py |
|
|
|
|
</div>
|
|
|
|
## Overview
|
|
|
|
This component splits documents based on embedding similarity using cosine distances between sequential sentence groups.
|
|
|
|
It first splits text into sentences, optionally groups them, calculates embeddings for each group, and then uses cosine
|
|
distance between sequential embeddings to determine split points. Any distance above the specified percentile is treated
|
|
as a break point. The component also tracks page numbers based on form feed characters (`\f`) in the original document.
|
|
|
|
This component is inspired by [5 Levels of Text Splitting](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb) by Greg Kamradt.
|
|
|
|
## Usage
|
|
|
|
### On its own
|
|
|
|
```python
|
|
|
|
from haystack import Document
|
|
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
|
|
from haystack.components.preprocessors import EmbeddingBasedDocumentSplitter
|
|
|
|
# Create a document with content that has a clear topic shift
|
|
doc = Document(
|
|
content="This is a first sentence. This is a second sentence. This is a third sentence. "
|
|
"Completely different topic. The same completely different topic."
|
|
)
|
|
|
|
# Initialize the embedder to calculate semantic similarities
|
|
embedder = SentenceTransformersDocumentEmbedder()
|
|
|
|
# Configure the splitter with parameters that control splitting behavior
|
|
splitter = EmbeddingBasedDocumentSplitter(
|
|
document_embedder=embedder,
|
|
sentences_per_group=2, # Group 2 sentences before calculating embeddings
|
|
percentile=0.95, # Split when cosine distance exceeds 95th percentile
|
|
min_length=50, # Merge splits shorter than 50 characters
|
|
max_length=1000 # Further split chunks longer than 1000 characters
|
|
)
|
|
splitter.warm_up()
|
|
result = splitter.run(documents=[doc])
|
|
|
|
# The result contains a list of Document objects, each representing a semantic chunk
|
|
# Each split document includes metadata: source_id, split_id, and page_number
|
|
print(f"Original document split into {len(result['documents'])} chunks")
|
|
for i, split_doc in enumerate(result['documents']):
|
|
print(f"Chunk {i}: {split_doc.content[:50]}...")
|
|
```
|
|
|
|
### In a pipeline
|
|
|
|
```python
|
|
from pathlib import Path
|
|
|
|
from haystack import Document
|
|
from haystack import Pipeline
|
|
from haystack.document_stores.in_memory import InMemoryDocumentStore
|
|
from haystack.components.converters.txt import TextFileToDocument
|
|
from haystack.components.preprocessors import DocumentCleaner
|
|
from haystack.components.preprocessors import EmbeddingBasedDocumentSplitter
|
|
from haystack.components.writers import DocumentWriter
|
|
|
|
document_store = InMemoryDocumentStore()
|
|
|
|
Pipeline = Pipeline()
|
|
Pipeline.add_component(instance=TextFileToDocument(), name="text_file_converter")
|
|
Pipeline.add_component(instance=DocumentCleaner(), name="cleaner")
|
|
Pipeline.add_component(instance=EmbeddingBasedDocumentSplitter(document_embedder=embedder, sentences_per_group=2, percentile=0.95, min_length=50,max_length=1000)
|
|
Pipeline.add_component(instance=DocumentWriter(document_store=document_store), name="writer")
|
|
Pipeline.connect("text_file_converter.documents", "cleaner.documents")
|
|
Pipeline.connect("cleaner.documents", "splitter.documents")
|
|
Pipeline.connect("splitter.documents", "writer.documents")
|
|
|
|
path = "path/to/your/files"
|
|
files = list(Path(path).glob("*.md"))
|
|
Pipeline.run({"text_file_converter": {"sources": files}})
|
|
```
|