2025-02-19 11:28:54 +01:00
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a href=\"https://colab.research.google.com/github/DS4SD/docling/blob/main/docs/examples/pictures_description.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Note: you may need to restart the kernel to use updated packages.\n"
]
}
],
"source": [
"%pip install -q docling[vlm] ipython"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"from docling.datamodel.base_models import InputFormat\n",
"from docling.datamodel.pipeline_options import PdfPipelineOptions\n",
"from docling.document_converter import DocumentConverter, PdfFormatOption"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"# The source document\n",
"DOC_SOURCE = \"https://arxiv.org/pdf/2501.17887\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Describe pictures with Granite Vision\n",
"\n",
"This section will run locally the [ibm-granite/granite-vision-3.1-2b-preview](https://huggingface.co/ibm-granite/granite-vision-3.1-2b-preview) model to describe the pictures of the document."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "93a634699bf1434c9bc8e384d6db1a28",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from docling.datamodel.pipeline_options import granite_picture_description\n",
"\n",
"pipeline_options = PdfPipelineOptions()\n",
"pipeline_options.do_picture_description = True\n",
"pipeline_options.picture_description_options = (\n",
" granite_picture_description # <-- the model choice\n",
")\n",
"pipeline_options.picture_description_options.prompt = (\n",
" \"Describe the image in three sentences. Be consise and accurate.\"\n",
")\n",
"pipeline_options.images_scale = 2.0\n",
"pipeline_options.generate_picture_images = True\n",
"\n",
"converter = DocumentConverter(\n",
" format_options={\n",
" InputFormat.PDF: PdfFormatOption(\n",
" pipeline_options=pipeline_options,\n",
" )\n",
" }\n",
")\n",
"doc = converter.convert(DOC_SOURCE).document"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<h3>Picture <code>#/pictures/0</code></h3><img src=\"
"<hr /><h3>Picture <code>#/pictures/1</code></h3><img src=\"
"<hr /><h3>Picture <code>#/pictures/2</code></h3><img src=\"
"<hr /><h3>Picture <code>#/pictures/3</code></h3><img src=\"
"<hr /><h3>Picture <code>#/pictures/4</code></h3><img src=\"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from docling_core.types.doc.document import PictureDescriptionData\n",
"from IPython import display\n",
"\n",
"html_buffer = []\n",
"# display the first 5 pictures and their captions and annotations:\n",
"for pic in doc.pictures[:5]:\n",
" html_item = (\n",
" f\"<h3>Picture <code>{pic.self_ref}</code></h3>\"\n",
2025-04-14 18:01:26 +02:00
" f'<img src=\"{pic.image.uri!s}\" /><br />'\n",
2025-02-19 11:28:54 +01:00
" f\"<h4>Caption</h4>{pic.caption_text(doc=doc)}<br />\"\n",
" )\n",
" for annotation in pic.annotations:\n",
" if not isinstance(annotation, PictureDescriptionData):\n",
" continue\n",
" html_item += (\n",
" f\"<h4>Annotations ({annotation.provenance})</h4>{annotation.text}<br />\\n\"\n",
" )\n",
" html_buffer.append(html_item)\n",
"display.HTML(\"<hr />\".join(html_buffer))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Describe pictures with SmolVLM\n",
"\n",
"This section will run locally the [HuggingFaceTB/SmolVLM-256M-Instruct](https://huggingface.co/HuggingFaceTB/SmolVLM-256M-Instruct) model to describe the pictures of the document."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"from docling.datamodel.pipeline_options import smolvlm_picture_description\n",
"\n",
"pipeline_options = PdfPipelineOptions()\n",
"pipeline_options.do_picture_description = True\n",
"pipeline_options.picture_description_options = (\n",
" smolvlm_picture_description # <-- the model choice\n",
")\n",
"pipeline_options.picture_description_options.prompt = (\n",
" \"Describe the image in three sentences. Be consise and accurate.\"\n",
")\n",
"pipeline_options.images_scale = 2.0\n",
"pipeline_options.generate_picture_images = True\n",
"\n",
"converter = DocumentConverter(\n",
" format_options={\n",
" InputFormat.PDF: PdfFormatOption(\n",
" pipeline_options=pipeline_options,\n",
" )\n",
" }\n",
")\n",
"doc = converter.convert(DOC_SOURCE).document"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<h3>Picture <code>#/pictures/0</code></h3><img src=\"
"<hr /><h3>Picture <code>#/pictures/1</code></h3><img src=\"
"- Science\n",
"- Articles\n",
"- Law and Regulations\n",
"- Articles\n",
"- Misc.<br />\n",
"<hr /><h3>Picture <code>#/pictures/2</code></h3><img src=\"
"\n",
"The chart shows a clear trend: as the number of pages increases, the number of pages decreases. This is evident from the following points:\n",
"\n",
"- The number of pages increases from 100 to 1000.\n",
"- The number of pages decreases from 1000 to 10,000.\n",
"- The number of pages increases from 10,000 to 10,000.<br />\n",
"<hr /><h3>Picture <code>#/pictures/3</code></h3><img src=\"
"<hr /><h3>Picture <code>#/pictures/4</code></h3><img src=\"
"\n",
"- The x-axis represents the number of pages, ranging from 0 to 14.\n",
"- The y-axis represents the page count, ranging from 0 to 14.\n",
"- The chart has three categories: Marker, Unstructured, and Detailed.\n",
"- The x-axis is labeled \"see/page.\"\n",
"- The y-axis is labeled \"Page Count.\"\n",
"- The chart shows that the Marker category has the highest number of pages, followed by the Unstructured category, and then the Detailed category.<br />\n"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from docling_core.types.doc.document import PictureDescriptionData\n",
"from IPython import display\n",
"\n",
"html_buffer = []\n",
"# display the first 5 pictures and their captions and annotations:\n",
"for pic in doc.pictures[:5]:\n",
" html_item = (\n",
" f\"<h3>Picture <code>{pic.self_ref}</code></h3>\"\n",
2025-04-14 18:01:26 +02:00
" f'<img src=\"{pic.image.uri!s}\" /><br />'\n",
2025-02-19 11:28:54 +01:00
" f\"<h4>Caption</h4>{pic.caption_text(doc=doc)}<br />\"\n",
" )\n",
" for annotation in pic.annotations:\n",
" if not isinstance(annotation, PictureDescriptionData):\n",
" continue\n",
" html_item += (\n",
" f\"<h4>Annotations ({annotation.provenance})</h4>{annotation.text}<br />\\n\"\n",
" )\n",
" html_buffer.append(html_item)\n",
"display.HTML(\"<hr />\".join(html_buffer))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Use other vision models\n",
"\n",
"The examples above can also be reproduced using other vision model.\n",
2025-04-28 14:52:09 +08:00
"The Docling options `PictureDescriptionVlmOptions` allows to specify your favorite vision model from the Hugging Face Hub."
2025-02-19 11:28:54 +01:00
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"from docling.datamodel.pipeline_options import PictureDescriptionVlmOptions\n",
"\n",
"pipeline_options = PdfPipelineOptions()\n",
"pipeline_options.do_picture_description = True\n",
"pipeline_options.picture_description_options = PictureDescriptionVlmOptions(\n",
" repo_id=\"\", # <-- add here the Hugging Face repo_id of your favorite VLM\n",
" prompt=\"Describe the image in three sentences. Be consise and accurate.\",\n",
")\n",
"pipeline_options.images_scale = 2.0\n",
"pipeline_options.generate_picture_images = True\n",
"\n",
"converter = DocumentConverter(\n",
" format_options={\n",
" InputFormat.PDF: PdfFormatOption(\n",
" pipeline_options=pipeline_options,\n",
" )\n",
" }\n",
")\n",
"\n",
"# Uncomment to run:\n",
"# doc = converter.convert(DOC_SOURCE).document"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
2025-03-14 12:35:29 +01:00
"display_name": "docling-hgXEfXco-py3.12",
2025-02-19 11:28:54 +01:00
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.7"
}
},
"nbformat": 4,
"nbformat_minor": 2
}