docling/examples/export_figures.py

86 lines
2.5 KiB
Python
Raw Normal View History

import logging
import time
from pathlib import Path
from typing import Tuple
from docling.datamodel.base_models import (
AssembleOptions,
ConversionStatus,
FigureElement,
PageElement,
TableElement,
)
from docling.datamodel.document import DocumentConversionInput
from docling.document_converter import DocumentConverter
_log = logging.getLogger(__name__)
IMAGE_RESOLUTION_SCALE = 2.0
def main():
logging.basicConfig(level=logging.INFO)
input_doc_paths = [
fix: Add unit tests (#51) * add the pytests Signed-off-by: Peter Staar <taa@zurich.ibm.com> * renamed the test folder and added the toplevel test Signed-off-by: Peter Staar <taa@zurich.ibm.com> * updated the toplevel function test Signed-off-by: Peter Staar <taa@zurich.ibm.com> * need to start running all tests successfully Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added the reference converted documents Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added first test for json and md output Signed-off-by: Peter Staar <taa@zurich.ibm.com> * ran pre-commit Signed-off-by: Peter Staar <taa@zurich.ibm.com> * replaced deprecated json function with model_dump_json Signed-off-by: Peter Staar <taa@zurich.ibm.com> * replaced deprecated json function with model_dump_json Signed-off-by: Peter Staar <taa@zurich.ibm.com> * reformatted code Signed-off-by: Peter Staar <taa@zurich.ibm.com> * Fix backend tests Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * commented out the drawing Signed-off-by: Peter Staar <taa@zurich.ibm.com> * ci: avoid duplicate runs Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com> * commented out json verification for now Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added verification of input cells Signed-off-by: Peter Staar <taa@zurich.ibm.com> * reformat code Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added test to verify the cells in the pages Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added test to verify the cells in the pages (2) Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added test to verify the cells in the pages (3) Signed-off-by: Peter Staar <taa@zurich.ibm.com> * run all examples in CI Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * make sure examples return failures Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * raise a failure if examples fail Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * fix examples Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * run examples after tests Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * Add tests and update top_level_tests using only datamodels Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Remove unnecessary code Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Validate conversion status on e2e test Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * package verify utils and add more tests Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * reduce docs in example, since they are already in the tests Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * skip batch_convert Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * pin docling-parse 1.1.2 Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * updated the error messages Signed-off-by: Peter Staar <taa@zurich.ibm.com> * commented out the json verification for now Signed-off-by: Peter Staar <taa@zurich.ibm.com> * bumped GLM version Signed-off-by: Peter Staar <taa@zurich.ibm.com> * Fix lockfile Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Pin new docling-parse v1.1.3 Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: Peter Staar <taa@zurich.ibm.com> Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com> Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Co-authored-by: Christoph Auer <cau@zurich.ibm.com> Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com> Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2024-08-30 14:08:20 +02:00
Path("./tests/data/2206.01062.pdf"),
]
output_dir = Path("./scratch")
input_files = DocumentConversionInput.from_paths(input_doc_paths)
# Important: For operating with page images, we must keep them, otherwise the DocumentConverter
# will destroy them for cleaning up memory.
# This is done by setting AssembleOptions.images_scale, which also defines the scale of images.
# scale=1 correspond of a standard 72 DPI image
assemble_options = AssembleOptions()
assemble_options.images_scale = IMAGE_RESOLUTION_SCALE
doc_converter = DocumentConverter(assemble_options=assemble_options)
start_time = time.time()
conv_results = doc_converter.convert(input_files)
fix: Add unit tests (#51) * add the pytests Signed-off-by: Peter Staar <taa@zurich.ibm.com> * renamed the test folder and added the toplevel test Signed-off-by: Peter Staar <taa@zurich.ibm.com> * updated the toplevel function test Signed-off-by: Peter Staar <taa@zurich.ibm.com> * need to start running all tests successfully Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added the reference converted documents Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added first test for json and md output Signed-off-by: Peter Staar <taa@zurich.ibm.com> * ran pre-commit Signed-off-by: Peter Staar <taa@zurich.ibm.com> * replaced deprecated json function with model_dump_json Signed-off-by: Peter Staar <taa@zurich.ibm.com> * replaced deprecated json function with model_dump_json Signed-off-by: Peter Staar <taa@zurich.ibm.com> * reformatted code Signed-off-by: Peter Staar <taa@zurich.ibm.com> * Fix backend tests Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * commented out the drawing Signed-off-by: Peter Staar <taa@zurich.ibm.com> * ci: avoid duplicate runs Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com> * commented out json verification for now Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added verification of input cells Signed-off-by: Peter Staar <taa@zurich.ibm.com> * reformat code Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added test to verify the cells in the pages Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added test to verify the cells in the pages (2) Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added test to verify the cells in the pages (3) Signed-off-by: Peter Staar <taa@zurich.ibm.com> * run all examples in CI Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * make sure examples return failures Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * raise a failure if examples fail Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * fix examples Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * run examples after tests Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * Add tests and update top_level_tests using only datamodels Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Remove unnecessary code Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Validate conversion status on e2e test Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * package verify utils and add more tests Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * reduce docs in example, since they are already in the tests Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * skip batch_convert Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * pin docling-parse 1.1.2 Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * updated the error messages Signed-off-by: Peter Staar <taa@zurich.ibm.com> * commented out the json verification for now Signed-off-by: Peter Staar <taa@zurich.ibm.com> * bumped GLM version Signed-off-by: Peter Staar <taa@zurich.ibm.com> * Fix lockfile Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Pin new docling-parse v1.1.3 Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: Peter Staar <taa@zurich.ibm.com> Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com> Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Co-authored-by: Christoph Auer <cau@zurich.ibm.com> Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com> Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2024-08-30 14:08:20 +02:00
success_count = 0
failure_count = 0
output_dir.mkdir(parents=True, exist_ok=True)
for conv_res in conv_results:
if conv_res.status != ConversionStatus.SUCCESS:
_log.info(f"Document {conv_res.input.file} failed to convert.")
fix: Add unit tests (#51) * add the pytests Signed-off-by: Peter Staar <taa@zurich.ibm.com> * renamed the test folder and added the toplevel test Signed-off-by: Peter Staar <taa@zurich.ibm.com> * updated the toplevel function test Signed-off-by: Peter Staar <taa@zurich.ibm.com> * need to start running all tests successfully Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added the reference converted documents Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added first test for json and md output Signed-off-by: Peter Staar <taa@zurich.ibm.com> * ran pre-commit Signed-off-by: Peter Staar <taa@zurich.ibm.com> * replaced deprecated json function with model_dump_json Signed-off-by: Peter Staar <taa@zurich.ibm.com> * replaced deprecated json function with model_dump_json Signed-off-by: Peter Staar <taa@zurich.ibm.com> * reformatted code Signed-off-by: Peter Staar <taa@zurich.ibm.com> * Fix backend tests Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * commented out the drawing Signed-off-by: Peter Staar <taa@zurich.ibm.com> * ci: avoid duplicate runs Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com> * commented out json verification for now Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added verification of input cells Signed-off-by: Peter Staar <taa@zurich.ibm.com> * reformat code Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added test to verify the cells in the pages Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added test to verify the cells in the pages (2) Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added test to verify the cells in the pages (3) Signed-off-by: Peter Staar <taa@zurich.ibm.com> * run all examples in CI Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * make sure examples return failures Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * raise a failure if examples fail Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * fix examples Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * run examples after tests Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * Add tests and update top_level_tests using only datamodels Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Remove unnecessary code Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Validate conversion status on e2e test Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * package verify utils and add more tests Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * reduce docs in example, since they are already in the tests Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * skip batch_convert Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * pin docling-parse 1.1.2 Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * updated the error messages Signed-off-by: Peter Staar <taa@zurich.ibm.com> * commented out the json verification for now Signed-off-by: Peter Staar <taa@zurich.ibm.com> * bumped GLM version Signed-off-by: Peter Staar <taa@zurich.ibm.com> * Fix lockfile Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Pin new docling-parse v1.1.3 Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: Peter Staar <taa@zurich.ibm.com> Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com> Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Co-authored-by: Christoph Auer <cau@zurich.ibm.com> Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com> Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2024-08-30 14:08:20 +02:00
failure_count += 1
continue
doc_filename = conv_res.input.file.stem
# Export page images
for page in conv_res.pages:
page_no = page.page_no + 1
page_image_filename = output_dir / f"{doc_filename}-{page_no}.png"
with page_image_filename.open("wb") as fp:
page.image.save(fp, format="PNG")
# Export figures and tables
for element, image in conv_res.render_element_images(
element_types=(FigureElement, TableElement)
):
element_image_filename = (
output_dir / f"{doc_filename}-element-{element.id}.png"
)
with element_image_filename.open("wb") as fp:
image.save(fp, "PNG")
fix: Add unit tests (#51) * add the pytests Signed-off-by: Peter Staar <taa@zurich.ibm.com> * renamed the test folder and added the toplevel test Signed-off-by: Peter Staar <taa@zurich.ibm.com> * updated the toplevel function test Signed-off-by: Peter Staar <taa@zurich.ibm.com> * need to start running all tests successfully Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added the reference converted documents Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added first test for json and md output Signed-off-by: Peter Staar <taa@zurich.ibm.com> * ran pre-commit Signed-off-by: Peter Staar <taa@zurich.ibm.com> * replaced deprecated json function with model_dump_json Signed-off-by: Peter Staar <taa@zurich.ibm.com> * replaced deprecated json function with model_dump_json Signed-off-by: Peter Staar <taa@zurich.ibm.com> * reformatted code Signed-off-by: Peter Staar <taa@zurich.ibm.com> * Fix backend tests Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * commented out the drawing Signed-off-by: Peter Staar <taa@zurich.ibm.com> * ci: avoid duplicate runs Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com> * commented out json verification for now Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added verification of input cells Signed-off-by: Peter Staar <taa@zurich.ibm.com> * reformat code Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added test to verify the cells in the pages Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added test to verify the cells in the pages (2) Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added test to verify the cells in the pages (3) Signed-off-by: Peter Staar <taa@zurich.ibm.com> * run all examples in CI Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * make sure examples return failures Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * raise a failure if examples fail Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * fix examples Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * run examples after tests Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * Add tests and update top_level_tests using only datamodels Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Remove unnecessary code Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Validate conversion status on e2e test Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * package verify utils and add more tests Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * reduce docs in example, since they are already in the tests Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * skip batch_convert Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * pin docling-parse 1.1.2 Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * updated the error messages Signed-off-by: Peter Staar <taa@zurich.ibm.com> * commented out the json verification for now Signed-off-by: Peter Staar <taa@zurich.ibm.com> * bumped GLM version Signed-off-by: Peter Staar <taa@zurich.ibm.com> * Fix lockfile Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Pin new docling-parse v1.1.3 Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: Peter Staar <taa@zurich.ibm.com> Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com> Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Co-authored-by: Christoph Auer <cau@zurich.ibm.com> Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com> Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2024-08-30 14:08:20 +02:00
success_count += 1
end_time = time.time() - start_time
_log.info(f"All documents were converted in {end_time:.2f} seconds.")
fix: Add unit tests (#51) * add the pytests Signed-off-by: Peter Staar <taa@zurich.ibm.com> * renamed the test folder and added the toplevel test Signed-off-by: Peter Staar <taa@zurich.ibm.com> * updated the toplevel function test Signed-off-by: Peter Staar <taa@zurich.ibm.com> * need to start running all tests successfully Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added the reference converted documents Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added first test for json and md output Signed-off-by: Peter Staar <taa@zurich.ibm.com> * ran pre-commit Signed-off-by: Peter Staar <taa@zurich.ibm.com> * replaced deprecated json function with model_dump_json Signed-off-by: Peter Staar <taa@zurich.ibm.com> * replaced deprecated json function with model_dump_json Signed-off-by: Peter Staar <taa@zurich.ibm.com> * reformatted code Signed-off-by: Peter Staar <taa@zurich.ibm.com> * Fix backend tests Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * commented out the drawing Signed-off-by: Peter Staar <taa@zurich.ibm.com> * ci: avoid duplicate runs Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com> * commented out json verification for now Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added verification of input cells Signed-off-by: Peter Staar <taa@zurich.ibm.com> * reformat code Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added test to verify the cells in the pages Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added test to verify the cells in the pages (2) Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added test to verify the cells in the pages (3) Signed-off-by: Peter Staar <taa@zurich.ibm.com> * run all examples in CI Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * make sure examples return failures Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * raise a failure if examples fail Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * fix examples Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * run examples after tests Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * Add tests and update top_level_tests using only datamodels Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Remove unnecessary code Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Validate conversion status on e2e test Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * package verify utils and add more tests Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * reduce docs in example, since they are already in the tests Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * skip batch_convert Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * pin docling-parse 1.1.2 Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * updated the error messages Signed-off-by: Peter Staar <taa@zurich.ibm.com> * commented out the json verification for now Signed-off-by: Peter Staar <taa@zurich.ibm.com> * bumped GLM version Signed-off-by: Peter Staar <taa@zurich.ibm.com> * Fix lockfile Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Pin new docling-parse v1.1.3 Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: Peter Staar <taa@zurich.ibm.com> Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com> Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Co-authored-by: Christoph Auer <cau@zurich.ibm.com> Co-authored-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com> Co-authored-by: Michele Dolfi <dol@zurich.ibm.com>
2024-08-30 14:08:20 +02:00
if failure_count > 0:
raise RuntimeError(
f"The example failed converting {failure_count} on {len(input_doc_paths)}."
)
if __name__ == "__main__":
main()