haystack/docs-website/docs/pipeline-components/preprocessors/embeddingbaseddocumentsplitter.mdx

---
title: "EmbeddingBasedDocumentSplitter"
id: embeddingbaseddocumentsplitter
slug: "/embeddingbaseddocumentsplitter"
description: "Use this component to split documents based on embedding similarity using cosine distances between sequential sentence groups."
---

# EmbeddingBasedDocumentSplitter

Use this component to split documents based on embedding similarity using cosine distances between sequential sentence groups.

<div className="key-value-table">

|  |  |
| --- | --- |
| **Most common position in a pipeline** | In indexing pipelines after [Converters](../converters.mdx)   and [`DocumentCleaner`](documentcleaner.mdx)              |
| **Mandatory run variables**            | `documents`: A list of documents to split each into smaller documents based on embedding similarity.                    |
| **Output variables**                   | `documents`: A list of documents                                                                                        |
| **API reference**                      | [PreProcessors](/reference/preprocessors-api)                                                                           |
| **GitHub link**                        | https://github.com/deepset-ai/haystack/blob/main/haystack/components/preprocessors/embedding_based_document_splitter.py |

</div>

## Overview

This component splits documents based on embedding similarity using cosine distances between sequential sentence groups.

It first splits text into sentences, optionally groups them, calculates embeddings for each group, and then uses cosine
distance between sequential embeddings to determine split points. Any distance above the specified percentile is treated
as a break point. The component also tracks page numbers based on form feed characters (`\f`) in the original document.

This component is inspired by [5 Levels of Text Splitting](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb) by Greg Kamradt.

## Usage

### On its own

```python

from haystack import Document
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.components.preprocessors import EmbeddingBasedDocumentSplitter

# Create a document with content that has a clear topic shift
doc = Document(
    content="This is a first sentence. This is a second sentence. This is a third sentence. "
    "Completely different topic. The same completely different topic."
)

# Initialize the embedder to calculate semantic similarities
embedder = SentenceTransformersDocumentEmbedder()

# Configure the splitter with parameters that control splitting behavior
splitter = EmbeddingBasedDocumentSplitter(
    document_embedder=embedder,
    sentences_per_group=2,      # Group 2 sentences before calculating embeddings
    percentile=0.95,            # Split when cosine distance exceeds 95th percentile
    min_length=50,              # Merge splits shorter than 50 characters
    max_length=1000             # Further split chunks longer than 1000 characters
)
splitter.warm_up()
result = splitter.run(documents=[doc])

# The result contains a list of Document objects, each representing a semantic chunk
# Each split document includes metadata: source_id, split_id, and page_number
print(f"Original document split into {len(result['documents'])} chunks")
for i, split_doc in enumerate(result['documents']):
    print(f"Chunk {i}: {split_doc.content[:50]}...")
```

### In a pipeline

```python
from pathlib import Path

from haystack import Document
from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.converters.txt import TextFileToDocument
from haystack.components.preprocessors import DocumentCleaner
from haystack.components.preprocessors import EmbeddingBasedDocumentSplitter
from haystack.components.writers import DocumentWriter

document_store = InMemoryDocumentStore()

Pipeline = Pipeline()
Pipeline.add_component(instance=TextFileToDocument(), name="text_file_converter")
Pipeline.add_component(instance=DocumentCleaner(), name="cleaner")
Pipeline.add_component(instance=EmbeddingBasedDocumentSplitter(document_embedder=embedder, sentences_per_group=2, percentile=0.95, min_length=50,max_length=1000)
Pipeline.add_component(instance=DocumentWriter(document_store=document_store), name="writer")
Pipeline.connect("text_file_converter.documents", "cleaner.documents")
Pipeline.connect("cleaner.documents", "splitter.documents")
Pipeline.connect("splitter.documents", "writer.documents")

path = "path/to/your/files"
files = list(Path(path).glob("*.md"))
Pipeline.run({"text_file_converter": {"sources": files}})
```