2024-10-16 21:02:03 +02:00
## What's new
Docling v2 introduces several new features:
2024-10-17 18:14:48 +02:00
- Understands and converts PDF, MS Word, MS Powerpoint, HTML and several image formats
2024-10-16 21:02:03 +02:00
- Produces a new, universal document representation which can encapsulate document hierarchy
- Comes with a fresh new API and CLI
## Changes in Docling v2
### CLI
We updated the command line syntax of Docling v2 to support many formats. Examples are seen below.
```shell
# Convert a single file to Markdown (default)
docling myfile.pdf
# Convert a single file to Markdown and JSON, without OCR
docling myfile.pdf --to json --to md --no-ocr
# Convert PDF files in input directory to Markdown (default)
docling ./input/dir --from pdf
# Convert PDF and Word files in input directory to Markdown and JSON
2024-10-17 18:14:48 +02:00
docling ./input/dir --from pdf --from docx --to md --to json --output ./scratch
2024-10-16 21:02:03 +02:00
# Convert all supported files in input directory to Markdown, but abort on first error
docling ./input/dir --output ./scratch --abort-on-error
```
**Notable changes from Docling v1:**
- The standalone switches for different export formats are removed, and replaced with `--from` and `--to` arguments, to define input and output formats respectively.
- The new `--abort-on-error` will abort any batch conversion as soon an error is encountered
- The `--backend` option for PDFs was removed
### Setting up a `DocumentConverter`
2025-04-28 14:52:09 +08:00
To accommodate many input formats, we changed the way you need to set up your `DocumentConverter` object.
2024-10-17 18:14:48 +02:00
You can now define a list of allowed formats on the `DocumentConverter` initialization, and specify custom options
per-format if desired. By default, all supported formats are allowed. If you don't provide `format_options` , defaults
2024-10-16 21:02:03 +02:00
will be used for all `allowed_formats` .
Format options can include the pipeline class to use, the options to provide to the pipeline, and the document backend.
They are provided as format-specific types, such as `PdfFormatOption` or `WordFormatOption` , as seen below.
```python
from docling.document_converter import DocumentConverter
from docling.datamodel.base_models import InputFormat
from docling.document_converter import (
DocumentConverter,
PdfFormatOption,
WordFormatOption,
)
from docling.pipeline.simple_pipeline import SimplePipeline
from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend
## Default initialization still works as before:
2024-10-17 18:14:48 +02:00
# doc_converter = DocumentConverter()
2024-10-16 21:02:03 +02:00
# previous `PipelineOptions` is now `PdfPipelineOptions`
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = False
pipeline_options.do_table_structure = True
#...
2024-10-17 18:14:48 +02:00
## Custom options are now defined per format.
2024-10-16 21:02:03 +02:00
doc_converter = (
DocumentConverter( # all of the below is optional, has internal defaults.
allowed_formats=[
InputFormat.PDF,
InputFormat.IMAGE,
InputFormat.DOCX,
InputFormat.HTML,
InputFormat.PPTX,
], # whitelist formats, non-matching files are ignored.
format_options={
InputFormat.PDF: PdfFormatOption(
pipeline_options=pipeline_options, # pipeline options go here.
backend=PyPdfiumDocumentBackend # optional: pick an alternative backend
),
InputFormat.DOCX: WordFormatOption(
pipeline_cls=SimplePipeline # default for office formats and HTML
),
},
)
)
```
**Note**: If you work only with defaults, all remains the same as in Docling v1.
More options are shown in the following example units:
2025-01-20 09:52:59 +01:00
- [run_with_formats.py ](examples/run_with_formats.py )
- [custom_convert.py ](examples/custom_convert.py )
2024-10-16 21:02:03 +02:00
### Converting documents
2024-10-17 18:14:48 +02:00
We have simplified the way you can feed input to the `DocumentConverter` and renamed the conversion methods for
better semantics. You can now call the conversion directly with a single file, or a list of input files,
2024-10-16 21:02:03 +02:00
or `DocumentStream` objects, without constructing a `DocumentConversionInput` object first.
* `DocumentConverter.convert` now converts a single file input (previously `DocumentConverter.convert_single` ).
* `DocumentConverter.convert_all` now converts many files at once (previously `DocumentConverter.convert` ).
```python
...
from docling.datamodel.document import ConversionResult
## Convert a single file (from URL or local path)
conv_result: ConversionResult = doc_converter.convert("https://arxiv.org/pdf/2408.09869") # previously `convert_single`
## Convert several files at once:
input_files = [
2025-02-07 08:43:31 +01:00
"tests/data/html/wiki_duck.html",
"tests/data/docx/word_sample.docx",
"tests/data/docx/lorem_ipsum.docx",
"tests/data/pptx/powerpoint_sample.pptx",
2024-10-16 21:02:03 +02:00
"tests/data/2305.03393v1-pg9-img.png",
2025-02-07 08:43:31 +01:00
"tests/data/pdf/2206.01062.pdf",
2024-10-16 21:02:03 +02:00
]
# Directly pass list of files or streams to `convert_all`
2024-10-18 08:54:06 +02:00
conv_results_iter = doc_converter.convert_all(input_files) # previously `convert`
2024-10-16 21:02:03 +02:00
```
2024-10-17 18:14:48 +02:00
Through the `raises_on_error` argument, you can also control if the conversion should raise exceptions when first
2024-10-16 21:02:03 +02:00
encountering a problem, or resiliently convert all files first and reflect errors in each file's conversion status.
By default, any error is immediately raised and the conversion aborts (previously, exceptions were swallowed).
```python
...
2024-10-18 08:54:06 +02:00
conv_results_iter = doc_converter.convert_all(input_files, raises_on_error=False) # previously `convert`
2024-10-16 21:02:03 +02:00
```
2024-10-17 18:14:48 +02:00
### Access document structures
2024-10-16 21:02:03 +02:00
We have simplified how you can access and export the converted document data, too. Our universal document representation
is now available in conversion results as a `DoclingDocument` object.
`DoclingDocument` provides a neat set of APIs to construct, iterate and export content in the document, as shown below.
```python
conv_result: ConversionResult = doc_converter.convert("https://arxiv.org/pdf/2408.09869") # previously `convert_single`
## Inspect the converted document:
conv_result.document.print_element_tree()
2025-04-28 14:52:09 +08:00
## Iterate the elements in reading order, including hierarchy level:
2024-11-19 10:27:19 -05:00
for item, level in conv_result.document.iterate_items():
2024-10-16 21:02:03 +02:00
if isinstance(item, TextItem):
print(item.text)
elif isinstance(item, TableItem):
table_df: pd.DataFrame = item.export_to_dataframe()
print(table_df.to_markdown())
elif ...:
#...
```
**Note**: While it is deprecated, you can _still_ work with the Docling v1 document representation, it is available as:
```shell
conv_result.legacy_document # provides the representation in previous ExportedCCSDocument type
```
2024-10-17 18:14:48 +02:00
### Export into JSON, Markdown, Doctags
2024-10-16 21:02:03 +02:00
**Note**: All `render_...` methods in `ConversionResult` have been removed in Docling v2,
and are now available on `DoclingDocument` as:
- `DoclingDocument.export_to_dict`
- `DoclingDocument.export_to_markdown`
- `DoclingDocument.export_to_document_tokens`
```python
conv_result: ConversionResult = doc_converter.convert("https://arxiv.org/pdf/2408.09869") # previously `convert_single`
## Export to desired format:
print(json.dumps(conv_res.document.export_to_dict()))
print(conv_res.document.export_to_markdown())
print(conv_res.document.export_to_document_tokens())
```
2024-10-17 18:14:48 +02:00
**Note**: While it is deprecated, you can _still_ export Docling v1 JSON format. This is available through the same
2024-10-16 21:02:03 +02:00
methods as on the `DoclingDocument` type:
```shell
## Export legacy document representation to desired format, for v1 compatibility:
print(json.dumps(conv_res.legacy_document.export_to_dict()))
print(conv_res.legacy_document.export_to_markdown())
print(conv_res.legacy_document.export_to_document_tokens())
```
2024-10-17 18:14:48 +02:00
### Reload a `DoclingDocument` stored as JSON
2024-10-16 21:02:03 +02:00
You can save and reload a `DoclingDocument` to disk in JSON format using the following codes:
```python
# Save to disk:
doc: DoclingDocument = conv_res.document # produced from conversion result...
with Path("./doc.json").open("w") as fp:
fp.write(json.dumps(doc.export_to_dict())) # use `export_to_dict` to ensure consistency
# Load from disk:
with Path("./doc.json").open("r") as fp:
doc_dict = json.loads(fp.read())
doc = DoclingDocument.model_validate(doc_dict) # use standard pydantic API to populate doc
```
2024-10-17 18:14:48 +02:00
### Chunking
Docling v2 defines new base classes for chunking:
- `BaseMeta` for chunk metadata
- `BaseChunk` containing the chunk text and metadata, and
- `BaseChunker` for chunkers, producing chunks out of a `DoclingDocument` .
Additionally, it provides an updated `HierarchicalChunker` implementation, which
leverages the new `DoclingDocument` and provides a new, richer chunk output format, including:
- the respective doc items for grounding
- any applicable headings for context
- any applicable captions for context
2025-01-20 09:52:59 +01:00
For an example, check out [Chunking usage ](usage.md#chunking ).