docling/docs/concepts/serialization.md
Panos Vagenas 3220a592e7
docs: add serialization docs, update chunking docs (#1556)
* docs: add serializers docs, update chunking docs

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

* update notebook to improve MD table rendering

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>

---------

Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
2025-05-08 21:43:01 +02:00

41 lines
1.8 KiB
Markdown
Raw Permalink Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

## Introduction
A *document serializer* (AKA simply *serializer*) is a Docling abstraction that is
initialized with a given [`DoclingDocument`](./docling_document.md) and returns a
textual representation for that document.
Besides the document serializer, Docling defines similar abstractions for several
document subcomponents, for example: *text serializer*, *table serializer*,
*picture serializer*, *list serializer*, *inline serializer*, and more.
Last but not least, a *serializer provider* is a wrapper that abstracts the
document serialization strategy from the document instance.
## Base classes
To enable both flexibility for downstream applications and out-of-the-box utility,
Docling defines a serialization class hierarchy, providing:
- base types for the above abstractions: `BaseDocSerializer`, as well as
`BaseTextSerializer`, `BaseTableSerializer` etc, and `BaseSerializerProvider`, and
- specific subclasses for the above-mentioned base types, e.g. `MarkdownDocSerializer`.
You can review all methods required to define the above base classes [here](https://github.com/docling-project/docling-core/blob/main/docling_core/transforms/serializer/base.py).
From a client perspective, the most relevant is `BaseDocSerializer.serialize()`, which
returns the textual representation, as well as relevant metadata on which document
components contributed to that serialization.
## Use in `DoclingDocument` export methods
Docling provides predefined serializers for Markdown, HTML, and DocTags.
The respective `DoclingDocument` export methods (e.g. `export_to_markdown()`) are
provided as user shorthands — internally directly instantiating and delegating to
respective serializers.
## Examples
For an example showcasing how to use serializers, see
[here](../examples/serialization.ipynb).