docling/docs/concepts/chunking.md

## Introduction

A *chunker* is a Docling abstraction that, given a
[`DoclingDocument`](./docling_document.md), returns a stream of chunks, each of which
captures some part of the document as a string accompanied by respective metadata.

To enable both flexibility for downstream applications and out-of-the-box utility,
Docling defines a chunker class hierarchy, providing a base type, `BaseChunker`, as well
as specific subclasses.

Docling integration with gen AI frameworks like LlamaIndex is done using the
`BaseChunker` interface, so users can easily plug in any built-in, self-defined, or
third-party `BaseChunker` implementation.

## Base Chunker

The `BaseChunker` base class API defines that any chunker should provide the following:

- `def chunk(self, dl_doc: DoclingDocument, **kwargs) -> Iterator[BaseChunk]`:
  Returning the chunks for the provided document.
- `def serialize(self, chunk: BaseChunk) -> str`:
  Returning the potentially metadata-enriched serialization of the chunk, typically
  used to feed an embedding model (or generation model).

## Hybrid Chunker

!!! note "To access `HybridChunker`"

    - If you are using the `docling` package, you can import as follows:
        ```python
        from docling.chunking import HybridChunker
        ```
    - If you are only using the `docling-core` package, you must ensure to install
        the `chunking` extra, e.g.
        ```shell
        pip install 'docling-core[chunking]'
        ```
        and then you
        can import as follows:
        ```python
        from docling_core.transforms.chunker.hybrid_chunker import HybridChunker
        ```

The `HybridChunker` implementation uses a hybrid approach, applying tokenization-aware
refinements on top of document-based [hierarchical](#hierarchical-chunker) chunking.

More precisely:

- it starts from the result of the hierarchical chunker and, based on the user-provided
  tokenizer (typically to be aligned to the embedding model tokenizer), it:
- does one pass where it splits chunks only when needed (i.e. oversized w.r.t.
tokens), &
- another pass where it merges chunks only when possible (i.e. undersized successive
chunks with same headings & captions) — users can opt out of this step via param
`merge_peers` (by default `True`)

👉 Example: see  [here](../examples/hybrid_chunking.ipynb).

## Hierarchical Chunker

The `HierarchicalChunker` implementation uses the document structure information from
the [`DoclingDocument`](./docling_document.md) to create one chunk for each individual
detected document element, by default only merging together list items (can be opted out
via param `merge_list_items`). It also takes care of attaching all relevant document
metadata, including headers and captions.
feat: expose new hybrid chunker, update docs (#384) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> 2024-12-09 08:28:29 +01:00			`## Introduction`

			`A chunker is a Docling abstraction that, given a`
			[`DoclingDocument`](./docling_document.md), returns a stream of chunks, each of which
			`captures some part of the document as a string accompanied by respective metadata.`

			`To enable both flexibility for downstream applications and out-of-the-box utility,`
			Docling defines a chunker class hierarchy, providing a base type, `BaseChunker`, as well
			`as specific subclasses.`

			`Docling integration with gen AI frameworks like LlamaIndex is done using the`
			`BaseChunker` interface, so users can easily plug in any built-in, self-defined, or
			third-party `BaseChunker` implementation.

			`## Base Chunker`

			The `BaseChunker` base class API defines that any chunker should provide the following:

			- `def chunk(self, dl_doc: DoclingDocument, **kwargs) -> Iterator[BaseChunk]`:
			`Returning the chunks for the provided document.`
			- `def serialize(self, chunk: BaseChunk) -> str`:
			`Returning the potentially metadata-enriched serialization of the chunk, typically`
			`used to feed an embedding model (or generation model).`

			`## Hybrid Chunker`

			!!! note "To access `HybridChunker`"

			- If you are using the `docling` package, you can import as follows:
			```python
			`from docling.chunking import HybridChunker`
			```
			- If you are only using the `docling-core` package, you must ensure to install
			the `chunking` extra, e.g.
			```shell
			`pip install 'docling-core[chunking]'`
			```
			`and then you`
			`can import as follows:`
			```python
			`from docling_core.transforms.chunker.hybrid_chunker import HybridChunker`
			```

			The `HybridChunker` implementation uses a hybrid approach, applying tokenization-aware
			`refinements on top of document-based [hierarchical](#hierarchical-chunker) chunking.`

			`More precisely:`

			`- it starts from the result of the hierarchical chunker and, based on the user-provided`
			`tokenizer (typically to be aligned to the embedding model tokenizer), it:`
			`- does one pass where it splits chunks only when needed (i.e. oversized w.r.t.`
			`tokens), &`
			`- another pass where it merges chunks only when possible (i.e. undersized successive`
			`chunks with same headings & captions) — users can opt out of this step via param`
			`merge_peers` (by default `True`)

docs: fix links between docs pages (#697) Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> 2025-01-20 09:52:59 +01:00			`👉 Example: see [here](../examples/hybrid_chunking.ipynb).`
feat: expose new hybrid chunker, update docs (#384) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> 2024-12-09 08:28:29 +01:00
			`## Hierarchical Chunker`

			The `HierarchicalChunker` implementation uses the document structure information from
docs: fix links between docs pages (#697) Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> 2025-01-20 09:52:59 +01:00			the [`DoclingDocument`](./docling_document.md) to create one chunk for each individual
feat: expose new hybrid chunker, update docs (#384) Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> 2024-12-09 08:28:29 +01:00			`detected document element, by default only merging together list items (can be opted out`
			via param `merge_list_items`). It also takes care of attaching all relevant document
			`metadata, including headers and captions.`