docling/docs/examples/develop_formula_understanding.py

# WARNING
# This example demonstrates only how to develop a new enrichment model.
# It does not run the actual formula understanding model.

import logging
from collections.abc import Iterable
from pathlib import Path

from docling_core.types.doc import DocItemLabel, DoclingDocument, NodeItem, TextItem

from docling.datamodel.base_models import InputFormat, ItemAndImageEnrichmentElement
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.models.base_model import BaseItemAndImageEnrichmentModel
from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline


class ExampleFormulaUnderstandingPipelineOptions(PdfPipelineOptions):
    do_formula_understanding: bool = True


# A new enrichment model using both the document element and its image as input
class ExampleFormulaUnderstandingEnrichmentModel(BaseItemAndImageEnrichmentModel):
    images_scale = 2.6

    def __init__(self, enabled: bool):
        self.enabled = enabled

    def is_processable(self, doc: DoclingDocument, element: NodeItem) -> bool:
        return (
            self.enabled
            and isinstance(element, TextItem)
            and element.label == DocItemLabel.FORMULA
        )

    def __call__(
        self,
        doc: DoclingDocument,
        element_batch: Iterable[ItemAndImageEnrichmentElement],
    ) -> Iterable[NodeItem]:
        if not self.enabled:
            return

        for enrich_element in element_batch:
            enrich_element.image.show()

            yield enrich_element.item


# How the pipeline can be extended.
class ExampleFormulaUnderstandingPipeline(StandardPdfPipeline):
    def __init__(self, pipeline_options: ExampleFormulaUnderstandingPipelineOptions):
        super().__init__(pipeline_options)
        self.pipeline_options: ExampleFormulaUnderstandingPipelineOptions

        self.enrichment_pipe = [
            ExampleFormulaUnderstandingEnrichmentModel(
                enabled=self.pipeline_options.do_formula_understanding
            )
        ]

        if self.pipeline_options.do_formula_understanding:
            self.keep_backend = True

    @classmethod
    def get_default_options(cls) -> ExampleFormulaUnderstandingPipelineOptions:
        return ExampleFormulaUnderstandingPipelineOptions()


# Example main. In the final version, we simply have to set do_formula_understanding to true.
def main():
    logging.basicConfig(level=logging.INFO)

    input_doc_path = Path("./tests/data/pdf/2203.01017v2.pdf")

    pipeline_options = ExampleFormulaUnderstandingPipelineOptions()
    pipeline_options.do_formula_understanding = True

    doc_converter = DocumentConverter(
        format_options={
            InputFormat.PDF: PdfFormatOption(
                pipeline_cls=ExampleFormulaUnderstandingPipeline,
                pipeline_options=pipeline_options,
            )
        }
    )
    doc_converter.convert(input_doc_path)


if __name__ == "__main__":
    main()
docs: Enrichment models (#1097) * warning for develop examples Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * add docs for enrichment models Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * minor reorg of top-level docs (#1098) * minor reorg of top-level docs Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> * fix typo [no ci] Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> --------- Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> * trigger ci Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Signed-off-by: Panos Vagenas <pva@zurich.ibm.com> Signed-off-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> Co-authored-by: Panos Vagenas <35837085+vagenas@users.noreply.github.com> 2025-03-04 14:24:38 +01:00			`# WARNING`
			`# This example demonstrates only how to develop a new enrichment model.`
			`# It does not run the actual formula understanding model.`

refactor: allow the usage of backends in the enrich models and generalize the interface (#742) * fix get image with cropbox Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * allow the usage of backends in the enrich models and generalize the interface Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * move logic in BaseTextImageEnrichmentModel Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * renaming Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> 2025-01-15 09:52:38 +01:00			`import logging`
ci: add coverage and ruff (#1383) * add coverage calculation and push Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * new codecov version and usage of token Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * enable ruff formatter instead of black and isort Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * apply ruff lint fixes Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * apply ruff unsafe fixes Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * add removed imports Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * runs 1 on linter issues Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * finalize linter fixes Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * Update pyproject.toml Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com> Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> 2025-04-14 18:01:26 +02:00			`from collections.abc import Iterable`
refactor: allow the usage of backends in the enrich models and generalize the interface (#742) * fix get image with cropbox Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * allow the usage of backends in the enrich models and generalize the interface Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * move logic in BaseTextImageEnrichmentModel Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * renaming Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> 2025-01-15 09:52:38 +01:00			`from pathlib import Path`

			`from docling_core.types.doc import DocItemLabel, DoclingDocument, NodeItem, TextItem`

			`from docling.datamodel.base_models import InputFormat, ItemAndImageEnrichmentElement`
			`from docling.datamodel.pipeline_options import PdfPipelineOptions`
			`from docling.document_converter import DocumentConverter, PdfFormatOption`
			`from docling.models.base_model import BaseItemAndImageEnrichmentModel`
			`from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline`


			`class ExampleFormulaUnderstandingPipelineOptions(PdfPipelineOptions):`
			`do_formula_understanding: bool = True`


			`# A new enrichment model using both the document element and its image as input`
			`class ExampleFormulaUnderstandingEnrichmentModel(BaseItemAndImageEnrichmentModel):`
			`images_scale = 2.6`

			`def __init__(self, enabled: bool):`
			`self.enabled = enabled`

			`def is_processable(self, doc: DoclingDocument, element: NodeItem) -> bool:`
			`return (`
			`self.enabled`
			`and isinstance(element, TextItem)`
			`and element.label == DocItemLabel.FORMULA`
			`)`

			`def __call__(`
			`self,`
			`doc: DoclingDocument,`
			`element_batch: Iterable[ItemAndImageEnrichmentElement],`
			`) -> Iterable[NodeItem]:`
			`if not self.enabled:`
			`return`

			`for enrich_element in element_batch:`
			`enrich_element.image.show()`

			`yield enrich_element.item`


			`# How the pipeline can be extended.`
			`class ExampleFormulaUnderstandingPipeline(StandardPdfPipeline):`
			`def __init__(self, pipeline_options: ExampleFormulaUnderstandingPipelineOptions):`
			`super().__init__(pipeline_options)`
			`self.pipeline_options: ExampleFormulaUnderstandingPipelineOptions`

			`self.enrichment_pipe = [`
			`ExampleFormulaUnderstandingEnrichmentModel(`
			`enabled=self.pipeline_options.do_formula_understanding`
			`)`
			`]`

			`if self.pipeline_options.do_formula_understanding:`
			`self.keep_backend = True`

			`@classmethod`
			`def get_default_options(cls) -> ExampleFormulaUnderstandingPipelineOptions:`
			`return ExampleFormulaUnderstandingPipelineOptions()`


			`# Example main. In the final version, we simply have to set do_formula_understanding to true.`
			`def main():`
			`logging.basicConfig(level=logging.INFO)`

fix: Test cases for RTL programmatic PDFs and fixes for the formula model (#903) fix: Support for RTL programmatic documents fix(parser): detect and handle rotated pages fix(parser): fix bug causing duplicated text fix(formula): improve stopping criteria chore: update lock file fix: temporary constrain beautifulsoup * switch to code formula model v1.0.1 and new test pdf Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com> * switch to code formula model v1.0.1 and new test pdf Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com> * cleaned up the data folder in the tests Signed-off-by: Peter Staar <taa@zurich.ibm.com> * switch to code formula model v1.0.1 and new test pdf Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com> * added three test-files for right-to-left Signed-off-by: Peter Staar <taa@zurich.ibm.com> * fix black Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com> * added new gt for test_e2e_conversion Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com> * added new gt for test_e2e_conversion Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com> * Add code to expose text direction of cell Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * new test file Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com> * update lock Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * fix mypy reports Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * fix example filepaths Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * add test data results Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * pin wheel of latest docling-parse release Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * use latest docling-core Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * remove debugging code Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * fix path to files in example Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * Revert unwanted RTL additions Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Fix test data paths in examples Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com> Signed-off-by: Peter Staar <taa@zurich.ibm.com> Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Co-authored-by: Matteo-Omenetti <Matteo.Omenetti1@ibm.com> Co-authored-by: Peter Staar <taa@zurich.ibm.com> Co-authored-by: Christoph Auer <cau@zurich.ibm.com> 2025-02-07 08:43:31 +01:00			`input_doc_path = Path("./tests/data/pdf/2203.01017v2.pdf")`
refactor: allow the usage of backends in the enrich models and generalize the interface (#742) * fix get image with cropbox Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * allow the usage of backends in the enrich models and generalize the interface Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * move logic in BaseTextImageEnrichmentModel Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * renaming Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> 2025-01-15 09:52:38 +01:00
			`pipeline_options = ExampleFormulaUnderstandingPipelineOptions()`
			`pipeline_options.do_formula_understanding = True`

			`doc_converter = DocumentConverter(`
			`format_options={`
			`InputFormat.PDF: PdfFormatOption(`
			`pipeline_cls=ExampleFormulaUnderstandingPipeline,`
			`pipeline_options=pipeline_options,`
			`)`
			`}`
			`)`
ci: add coverage and ruff (#1383) * add coverage calculation and push Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * new codecov version and usage of token Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * enable ruff formatter instead of black and isort Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * apply ruff lint fixes Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * apply ruff unsafe fixes Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * add removed imports Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * runs 1 on linter issues Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * finalize linter fixes Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * Update pyproject.toml Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Signed-off-by: Michele Dolfi <97102151+dolfim-ibm@users.noreply.github.com> Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> 2025-04-14 18:01:26 +02:00			`doc_converter.convert(input_doc_path)`
refactor: allow the usage of backends in the enrich models and generalize the interface (#742) * fix get image with cropbox Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * allow the usage of backends in the enrich models and generalize the interface Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * move logic in BaseTextImageEnrichmentModel Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * renaming Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> 2025-01-15 09:52:38 +01:00

			`if __name__ == "__main__":`
			`main()`