mirror of
https://github.com/docling-project/docling.git
synced 2025-06-27 05:20:05 +00:00

* docs: add serializers docs, update chunking docs Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> * update notebook to improve MD table rendering Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> --------- Signed-off-by: Panos Vagenas <pva@zurich.ibm.com>
41 lines
1.8 KiB
Markdown
41 lines
1.8 KiB
Markdown
## Introduction
|
||
|
||
A *document serializer* (AKA simply *serializer*) is a Docling abstraction that is
|
||
initialized with a given [`DoclingDocument`](./docling_document.md) and returns a
|
||
textual representation for that document.
|
||
|
||
Besides the document serializer, Docling defines similar abstractions for several
|
||
document subcomponents, for example: *text serializer*, *table serializer*,
|
||
*picture serializer*, *list serializer*, *inline serializer*, and more.
|
||
|
||
Last but not least, a *serializer provider* is a wrapper that abstracts the
|
||
document serialization strategy from the document instance.
|
||
|
||
## Base classes
|
||
|
||
To enable both flexibility for downstream applications and out-of-the-box utility,
|
||
Docling defines a serialization class hierarchy, providing:
|
||
|
||
- base types for the above abstractions: `BaseDocSerializer`, as well as
|
||
`BaseTextSerializer`, `BaseTableSerializer` etc, and `BaseSerializerProvider`, and
|
||
- specific subclasses for the above-mentioned base types, e.g. `MarkdownDocSerializer`.
|
||
|
||
You can review all methods required to define the above base classes [here](https://github.com/docling-project/docling-core/blob/main/docling_core/transforms/serializer/base.py).
|
||
|
||
From a client perspective, the most relevant is `BaseDocSerializer.serialize()`, which
|
||
returns the textual representation, as well as relevant metadata on which document
|
||
components contributed to that serialization.
|
||
|
||
## Use in `DoclingDocument` export methods
|
||
|
||
Docling provides predefined serializers for Markdown, HTML, and DocTags.
|
||
|
||
The respective `DoclingDocument` export methods (e.g. `export_to_markdown()`) are
|
||
provided as user shorthands — internally directly instantiating and delegating to
|
||
respective serializers.
|
||
|
||
## Examples
|
||
|
||
For an example showcasing how to use serializers, see
|
||
[here](../examples/serialization.ipynb).
|