fix: prov for merged-elems (#1728)

* fix: prov for merged-elems Signed-off-by: Peter Staar <taa@zurich.ibm.com> * reformatted the code Signed-off-by: Peter Staar <taa@zurich.ibm.com> * Reset pyproject.toml Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Fix tests Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: Peter Staar <taa@zurich.ibm.com> Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Co-authored-by: Christoph Auer <cau@zurich.ibm.com>
2025-06-27 05:20:05 +00:00 · 2025-06-10 11:22:42 +02:00 · 2025-06-10 11:22:42 +02:00 · 6613b9e98b
commit 6613b9e98b
parent e979750ce9
13 changed files with 92 additions and 92 deletions
--- a/docling/models/readingorder_model.py
+++ b/docling/models/readingorder_model.py
@ -334,12 +334,12 @@ class ReadingOrderModel:
            "Labels of merged elements must match."
        )
        prov = ProvenanceItem(
-            page_no=element.page_no + 1,
+            page_no=merged_elem.page_no + 1,
            charspan=(
                len(new_item.text) + 1,
                len(new_item.text) + 1 + len(merged_elem.text),
            ),
-            bbox=element.cluster.bbox.to_bottom_left_origin(page_height),
+            bbox=merged_elem.cluster.bbox.to_bottom_left_origin(page_height),
        )
        new_item.text += f" {merged_elem.text}"
        new_item.orig += f" {merged_elem.text}"  # TODO: This is incomplete, we don't have the `orig` field of the merged element.
--- a/tests/data/groundtruth/docling_v1/2203.01017v2.doctags.txt
+++ b/tests/data/groundtruth/docling_v1/2203.01017v2.doctags.txt
@ -109,7 +109,7 @@
 <subtitle-level-1><location><page_6><loc_8><loc_28><loc_28><loc_30></location>5. Experimental Results</subtitle-level-1>
 <subtitle-level-1><location><page_6><loc_8><loc_26><loc_29><loc_27></location>5.1. Implementation Details</subtitle-level-1>
 <paragraph><location><page_6><loc_8><loc_19><loc_47><loc_25></location>TableFormer uses ResNet-18 as the CNN Backbone Network . The input images are resized to 448*448 pixels and the feature map has a dimension of 28*28. Additionally, we enforce the following input constraints:</paragraph>
-<paragraph><location><page_6><loc_8><loc_10><loc_47><loc_13></location><location><page_6><loc_8><loc_10><loc_47><loc_13></location>Although input constraints are used also by other methods, such as EDD, ours are less restrictive due to the improved runtime performance and lower memory footprint of TableFormer. This allows to utilize input samples with longer sequences and images with larger dimensions.</paragraph>
+<paragraph><location><page_6><loc_8><loc_10><loc_47><loc_13></location><location><page_6><loc_50><loc_86><loc_89><loc_91></location>Although input constraints are used also by other methods, such as EDD, ours are less restrictive due to the improved runtime performance and lower memory footprint of TableFormer. This allows to utilize input samples with longer sequences and images with larger dimensions.</paragraph>
 <paragraph><location><page_6><loc_50><loc_59><loc_89><loc_85></location>The Transformer Encoder consists of two "Transformer Encoder Layers", with an input feature size of 512, feed forward network of 1024, and 4 attention heads. As for the Transformer Decoder it is composed of four "Transformer Decoder Layers" with similar input and output dimensions as the "Transformer Encoder Layers". Even though our model uses fewer layers and heads than the default implementation parameters, our extensive experimentation has proved this setup to be more suitable for table images. We attribute this finding to the inherent design of table images, which contain mostly lines and text, unlike the more elaborate content present in other scopes (e.g. the COCO dataset). Moreover, we have added ResNet blocks to the inputs of the Structure Decoder and Cell BBox Decoder. This prevents a decoder having a stronger influence over the learned weights which would damage the other prediction task (structure vs bounding boxes), but learn task specific weights instead. Lastly our dropout layers are set to 0.5.</paragraph>
 <paragraph><location><page_6><loc_50><loc_46><loc_89><loc_58></location>For training, TableFormer is trained with 3 Adam optimizers, each one for the CNN Backbone Network , Structure Decoder , and Cell BBox Decoder . Taking the PubTabNet as an example for our parameter set up, the initializing learning rate is 0.001 for 12 epochs with a batch size of 24, and λ set to 0.5. Afterwards, we reduce the learning rate to 0.0001, the batch size to 18 and train for 12 more epochs or convergence.</paragraph>
 <paragraph><location><page_6><loc_50><loc_30><loc_89><loc_45></location>TableFormer is implemented with PyTorch and Torchvision libraries [22]. To speed up the inference, the image undergoes a single forward pass through the CNN Backbone Network and transformer encoder. This eliminates the overhead of generating the same features for each decoding step. Similarly, we employ a 'caching' technique to preform faster autoregressive decoding. This is achieved by storing the features of decoded tokens so we can reuse them for each time step. Therefore, we only compute the attention for each new tag.</paragraph>
@ -137,7 +137,7 @@
 </table>
 <paragraph><location><page_7><loc_8><loc_23><loc_47><loc_25></location>Table 2: Structure results on PubTabNet (PTN), FinTabNet (FTN), TableBank (TB) and SynthTabNet (STN).</paragraph>
 <paragraph><location><page_7><loc_8><loc_21><loc_43><loc_22></location>FT: Model was trained on PubTabNet then finetuned.</paragraph>
-<paragraph><location><page_7><loc_8><loc_10><loc_47><loc_19></location><location><page_7><loc_8><loc_10><loc_47><loc_19></location>Cell Detection. Like any object detector, our Cell BBox Detector provides bounding boxes that can be improved with post-processing during inference. We make use of the grid-like structure of tables to refine the predictions. A detailed explanation on the post-processing is available in the supplementary material. As shown in Tab. 3, we evaluate our Cell BBox Decoder accuracy for cells with a class label of 'content' only using the PASCAL VOC mAP metric for pre-processing and post-processing. Note that we do not have post-processing results for SynthTabNet as images are only provided. To compare the performance of our proposed approach, we've integrated TableFormer's Cell BBox Decoder into EDD architecture. As mentioned previously, the Structure Decoder provides the Cell BBox Decoder with the features needed to predict the bounding box predictions. Therefore, the accuracy of the Structure Decoder directly influences the accuracy of the Cell BBox Decoder . If the Structure Decoder predicts an extra column, this will result in an extra column of predicted bounding boxes.</paragraph>
+<paragraph><location><page_7><loc_8><loc_10><loc_47><loc_19></location><location><page_7><loc_50><loc_71><loc_89><loc_91></location>Cell Detection. Like any object detector, our Cell BBox Detector provides bounding boxes that can be improved with post-processing during inference. We make use of the grid-like structure of tables to refine the predictions. A detailed explanation on the post-processing is available in the supplementary material. As shown in Tab. 3, we evaluate our Cell BBox Decoder accuracy for cells with a class label of 'content' only using the PASCAL VOC mAP metric for pre-processing and post-processing. Note that we do not have post-processing results for SynthTabNet as images are only provided. To compare the performance of our proposed approach, we've integrated TableFormer's Cell BBox Decoder into EDD architecture. As mentioned previously, the Structure Decoder provides the Cell BBox Decoder with the features needed to predict the bounding box predictions. Therefore, the accuracy of the Structure Decoder directly influences the accuracy of the Cell BBox Decoder . If the Structure Decoder predicts an extra column, this will result in an extra column of predicted bounding boxes.</paragraph>
 <table>
 <location><page_7><loc_50><loc_62><loc_87><loc_69></location>
 <caption>Table 3: Cell Bounding Box detection results on PubTabNet, and FinTabNet. PP: Post-processing.</caption>
@ -263,7 +263,7 @@
 <paragraph><location><page_11><loc_8><loc_21><loc_47><loc_51></location>We have developed a technique that tries to derive a missing bounding box out of its neighbors. As a first step, we use the annotation data to generate the most fine-grained grid that covers the table structure. In case of strict HTML tables, all grid squares are associated with some table cell and in the presence of table spans a cell extends across multiple grid squares. When enough bounding boxes are known for a rectangular table, it is possible to compute the geometrical border lines between the grid rows and columns. Eventually this information is used to generate the missing bounding boxes. Additionally, the existence of unused grid squares indicates that the table rows have unequal number of columns and the overall structure is non-strict. The generation of missing bounding boxes for non-strict HTML tables is ambiguous and therefore quite challenging. Thus, we have decided to simply discard those tables. In case of PubTabNet we have computed missing bounding boxes for 48% of the simple and 69% of the complex tables. Regarding FinTabNet, 68% of the simple and 98% of the complex tables require the generation of bounding boxes.</paragraph>
 <paragraph><location><page_11><loc_8><loc_18><loc_47><loc_20></location>Figure 7 illustrates the distribution of the tables across different dimensions per dataset.</paragraph>
 <subtitle-level-1><location><page_11><loc_8><loc_15><loc_25><loc_16></location>1.2. Synthetic datasets</subtitle-level-1>
-<paragraph><location><page_11><loc_8><loc_10><loc_47><loc_14></location><location><page_11><loc_8><loc_10><loc_47><loc_14></location>Aiming to train and evaluate our models in a broader spectrum of table data we have synthesized four types of datasets. Each one contains tables with different appear- ances in regard to their size, structure, style and content. Every synthetic dataset contains 150k examples, summing up to 600k synthetic examples. All datasets are divided into Train, Test and Val splits (80%, 10%, 10%).</paragraph>
+<paragraph><location><page_11><loc_8><loc_10><loc_47><loc_14></location><location><page_11><loc_50><loc_74><loc_89><loc_79></location>Aiming to train and evaluate our models in a broader spectrum of table data we have synthesized four types of datasets. Each one contains tables with different appear- ances in regard to their size, structure, style and content. Every synthetic dataset contains 150k examples, summing up to 600k synthetic examples. All datasets are divided into Train, Test and Val splits (80%, 10%, 10%).</paragraph>
 <paragraph><location><page_11><loc_50><loc_71><loc_89><loc_73></location>The process of generating a synthetic dataset can be decomposed into the following steps:</paragraph>
 <paragraph><location><page_11><loc_50><loc_60><loc_89><loc_70></location>- 1. Prepare styling and content templates: The styling templates have been manually designed and organized into groups of scope specific appearances (e.g. financial data, marketing data, etc.) Additionally, we have prepared curated collections of content templates by extracting the most frequently used terms out of non-synthetic datasets (e.g. PubTabNet, FinTabNet, etc.).</paragraph>
 <paragraph><location><page_11><loc_50><loc_43><loc_89><loc_60></location>- 2. Generate table structures: The structure of each synthetic dataset assumes a horizontal table header which potentially spans over multiple rows and a table body that may contain a combination of row spans and column spans. However, spans are not allowed to cross the header - body boundary. The table structure is described by the parameters: Total number of table rows and columns, number of header rows, type of spans (header only spans, row only spans, column only spans, both row and column spans), maximum span size and the ratio of the table area covered by spans.</paragraph>
--- a/tests/data/groundtruth/docling_v1/2203.01017v2.json
+++ b/tests/data/groundtruth/docling_v1/2203.01017v2.json
@ -1742,10 +1742,10 @@
        },
        {
          "bbox": [
-            50.112061,
-            78.84812199999999,
-            286.36514,
-            99.70968600000003
+            308.86206,
+            683.9751,
+            545.11523,
+            716.79169
          ],
          "page": 6,
          "span": [
@ -2106,10 +2106,10 @@
        },
        {
          "bbox": [
-            50.112015,
-            78.84806800000001,
-            286.366,
-            147.65019000000007
+            308.862,
+            564.42291,
+            545.11517,
+            716.79163
          ],
          "page": 7,
          "span": [
@ -3660,10 +3660,10 @@
        },
        {
          "bbox": [
-            50.111984,
-            77.85229500000003,
-            286.36505,
-            110.66886999999997
+            308.862,
+            584.57227,
+            545.11511,
+            629.34485
          ],
          "page": 11,
          "span": [
--- a/tests/data/groundtruth/docling_v1/2206.01062.doctags.txt
+++ b/tests/data/groundtruth/docling_v1/2206.01062.doctags.txt
@ -80,7 +80,7 @@
 </figure>
 <caption><location><page_4><loc_9><loc_23><loc_48><loc_30></location>Figure 3: Corpus Conversion Service annotation user interface. The PDF page is shown in the background, with overlaid text-cells (in darker shades). The annotation boxes can be drawn by dragging a rectangle over each segment with the respective label from the palette on the right.</caption>
 <paragraph><location><page_4><loc_9><loc_15><loc_48><loc_20></location>we distributed the annotation workload and performed continuous quality controls. Phase one and two required a small team of experts only. For phases three and four, a group of 40 dedicated annotators were assembled and supervised.</paragraph>
-<paragraph><location><page_4><loc_9><loc_11><loc_48><loc_14></location><location><page_4><loc_9><loc_11><loc_48><loc_14></location>Phase 1: Data selection and preparation. Our inclusion criteria for documents were described in Section 3. A large effort went into ensuring that all documents are free to use. The data sources include publication repositories such as arXiv$^{3}$, government offices, company websites as well as data directory services for financial reports and patents. Scanned documents were excluded wherever possible because they can be rotated or skewed. This would not allow us to perform annotation with rectangular bounding-boxes and therefore complicate the annotation process.</paragraph>
+<paragraph><location><page_4><loc_9><loc_11><loc_48><loc_14></location><location><page_4><loc_52><loc_53><loc_91><loc_61></location>Phase 1: Data selection and preparation. Our inclusion criteria for documents were described in Section 3. A large effort went into ensuring that all documents are free to use. The data sources include publication repositories such as arXiv$^{3}$, government offices, company websites as well as data directory services for financial reports and patents. Scanned documents were excluded wherever possible because they can be rotated or skewed. This would not allow us to perform annotation with rectangular bounding-boxes and therefore complicate the annotation process.</paragraph>
 <paragraph><location><page_4><loc_52><loc_36><loc_91><loc_52></location>Preparation work included uploading and parsing the sourced PDF documents in the Corpus Conversion Service (CCS) [22], a cloud-native platform which provides a visual annotation interface and allows for dataset inspection and analysis. The annotation interface of CCS is shown in Figure 3. The desired balance of pages between the different document categories was achieved by selective subsampling of pages with certain desired properties. For example, we made sure to include the title page of each document and bias the remaining page selection to those with figures or tables. The latter was achieved by leveraging pre-trained object detection models from PubLayNet, which helped us estimate how many figures and tables a given page contains.</paragraph>
 <paragraph><location><page_4><loc_52><loc_12><loc_91><loc_36></location>Phase 2: Label selection and guideline. We reviewed the collected documents and identified the most common structural features they exhibit. This was achieved by identifying recurrent layout elements and lead us to the definition of 11 distinct class labels. These 11 class labels are Caption , Footnote , Formula , List-item , Pagefooter , Page-header , Picture , Section-header , Table , Text , and Title . Critical factors that were considered for the choice of these class labels were (1) the overall occurrence of the label, (2) the specificity of the label, (3) recognisability on a single page (i.e. no need for context from previous or next page) and (4) overall coverage of the page. Specificity ensures that the choice of label is not ambiguous, while coverage ensures that all meaningful items on a page can be annotated. We refrained from class labels that are very specific to a document category, such as Abstract in the Scientific Articles category. We also avoided class labels that are tightly linked to the semantics of the text. Labels such as Author and Affiliation , as seen in DocBank, are often only distinguishable by discriminating on</paragraph>
 <paragraph><location><page_5><loc_9><loc_87><loc_48><loc_89></location>the textual content of an element, which goes beyond visual layout recognition, in particular outside the Scientific Articles category.</paragraph>
--- a/tests/data/groundtruth/docling_v1/2206.01062.json
+++ b/tests/data/groundtruth/docling_v1/2206.01062.json
@ -1337,10 +1337,10 @@
        },
        {
          "bbox": [
-            53.79800000000001,
-            83.57982600000003,
-            295.55844,
-            113.98901000000001
+            317.95499,
+            416.75183,
+            559.18536,
+            479.92047
          ],
          "page": 4,
          "span": [
--- a/tests/data/groundtruth/docling_v1/2305.03393v1-pg9.json
+++ b/tests/data/groundtruth/docling_v1/2305.03393v1-pg9.json
@ -213,9 +213,9 @@
      "prov": [
        {
          "bbox": [
-            139.66746520996094,
+            139.6674041748047,
            322.5054626464844,
-            475.0093078613281,
+            475.00927734375,
            454.4546203613281
          ],
          "page": 1,
--- a/tests/data/groundtruth/docling_v1/2305.03393v1-pg9.pages.json
+++ b/tests/data/groundtruth/docling_v1/2305.03393v1-pg9.pages.json
@ -2646,7 +2646,7 @@
              "b": 102.78223000000003,
              "coord_origin": "TOPLEFT"
            },
-            "confidence": 0.9373531937599182,
+            "confidence": 0.9373533129692078,
            "cells": [
              {
                "index": 0,
@ -2686,7 +2686,7 @@
              "b": 102.78223000000003,
              "coord_origin": "TOPLEFT"
            },
-            "confidence": 0.8858677744865417,
+            "confidence": 0.8858679533004761,
            "cells": [
              {
                "index": 1,
@ -2881,7 +2881,7 @@
              "b": 255.42400999999995,
              "coord_origin": "TOPLEFT"
            },
-            "confidence": 0.98504239320755,
+            "confidence": 0.9850425124168396,
            "cells": [
              {
                "index": 7,
@ -3096,7 +3096,7 @@
              "b": 327.98218,
              "coord_origin": "TOPLEFT"
            },
-            "confidence": 0.9591910243034363,
+            "confidence": 0.9591907262802124,
            "cells": [
              {
                "index": 15,
@ -3280,9 +3280,9 @@
            "id": 0,
            "label": "table",
            "bbox": {
-              "l": 139.66746520996094,
+              "l": 139.6674041748047,
              "t": 337.5453796386719,
-              "r": 475.0093078613281,
+              "r": 475.00927734375,
              "b": 469.4945373535156,
              "coord_origin": "TOPLEFT"
            },
@ -7852,7 +7852,7 @@
              "b": 618.3,
              "coord_origin": "TOPLEFT"
            },
-            "confidence": 0.9849975109100342,
+            "confidence": 0.9849976301193237,
            "cells": [
              {
                "index": 93,
@ -8184,9 +8184,9 @@
              "id": 0,
              "label": "table",
              "bbox": {
-                "l": 139.66746520996094,
+                "l": 139.6674041748047,
                "t": 337.5453796386719,
-                "r": 475.0093078613281,
+                "r": 475.00927734375,
                "b": 469.4945373535156,
                "coord_origin": "TOPLEFT"
              },
@ -13582,7 +13582,7 @@
              "b": 102.78223000000003,
              "coord_origin": "TOPLEFT"
            },
-            "confidence": 0.9373531937599182,
+            "confidence": 0.9373533129692078,
            "cells": [
              {
                "index": 0,
@ -13628,7 +13628,7 @@
              "b": 102.78223000000003,
              "coord_origin": "TOPLEFT"
            },
-            "confidence": 0.8858677744865417,
+            "confidence": 0.8858679533004761,
            "cells": [
              {
                "index": 1,
@ -13841,7 +13841,7 @@
              "b": 255.42400999999995,
              "coord_origin": "TOPLEFT"
            },
-            "confidence": 0.98504239320755,
+            "confidence": 0.9850425124168396,
            "cells": [
              {
                "index": 7,
@ -14062,7 +14062,7 @@
              "b": 327.98218,
              "coord_origin": "TOPLEFT"
            },
-            "confidence": 0.9591910243034363,
+            "confidence": 0.9591907262802124,
            "cells": [
              {
                "index": 15,
@ -14252,9 +14252,9 @@
            "id": 0,
            "label": "table",
            "bbox": {
-              "l": 139.66746520996094,
+              "l": 139.6674041748047,
              "t": 337.5453796386719,
-              "r": 475.0093078613281,
+              "r": 475.00927734375,
              "b": 469.4945373535156,
              "coord_origin": "TOPLEFT"
            },
@ -19713,7 +19713,7 @@
              "b": 618.3,
              "coord_origin": "TOPLEFT"
            },
-            "confidence": 0.9849975109100342,
+            "confidence": 0.9849976301193237,
            "cells": [
              {
                "index": 93,
@ -20224,7 +20224,7 @@
              "b": 255.42400999999995,
              "coord_origin": "TOPLEFT"
            },
-            "confidence": 0.98504239320755,
+            "confidence": 0.9850425124168396,
            "cells": [
              {
                "index": 7,
@ -20445,7 +20445,7 @@
              "b": 327.98218,
              "coord_origin": "TOPLEFT"
            },
-            "confidence": 0.9591910243034363,
+            "confidence": 0.9591907262802124,
            "cells": [
              {
                "index": 15,
@ -20635,9 +20635,9 @@
            "id": 0,
            "label": "table",
            "bbox": {
-              "l": 139.66746520996094,
+              "l": 139.6674041748047,
              "t": 337.5453796386719,
-              "r": 475.0093078613281,
+              "r": 475.00927734375,
              "b": 469.4945373535156,
              "coord_origin": "TOPLEFT"
            },
@ -26096,7 +26096,7 @@
              "b": 618.3,
              "coord_origin": "TOPLEFT"
            },
-            "confidence": 0.9849975109100342,
+            "confidence": 0.9849976301193237,
            "cells": [
              {
                "index": 93,
@ -26440,7 +26440,7 @@
              "b": 102.78223000000003,
              "coord_origin": "TOPLEFT"
            },
-            "confidence": 0.9373531937599182,
+            "confidence": 0.9373533129692078,
            "cells": [
              {
                "index": 0,
@ -26486,7 +26486,7 @@
              "b": 102.78223000000003,
              "coord_origin": "TOPLEFT"
            },
-            "confidence": 0.8858677744865417,
+            "confidence": 0.8858679533004761,
            "cells": [
              {
                "index": 1,
--- a/tests/data/groundtruth/docling_v2/2203.01017v2.doctags.txt
+++ b/tests/data/groundtruth/docling_v2/2203.01017v2.doctags.txt
@ -83,7 +83,7 @@
 <section_header_level_1><loc_41><loc_364><loc_146><loc_370>5.1. Implementation Details</section_header_level_1>
 <text><loc_41><loc_376><loc_234><loc_404>TableFormer uses ResNet-18 as the CNN Backbone Network . The input images are resized to 448*448 pixels and the feature map has a dimension of 28*28. Additionally, we enforce the following input constraints:</text>
 <formula><loc_75><loc_413><loc_234><loc_428></formula>
-<text><loc_41><loc_437><loc_234><loc_450><loc_41><loc_437><loc_234><loc_450>Although input constraints are used also by other methods, such as EDD, ours are less restrictive due to the improved runtime performance and lower memory footprint of TableFormer. This allows to utilize input samples with longer sequences and images with larger dimensions.</text>
+<text><loc_41><loc_437><loc_234><loc_450><loc_252><loc_47><loc_445><loc_68>Although input constraints are used also by other methods, such as EDD, ours are less restrictive due to the improved runtime performance and lower memory footprint of TableFormer. This allows to utilize input samples with longer sequences and images with larger dimensions.</text>
 <text><loc_252><loc_73><loc_445><loc_207>The Transformer Encoder consists of two "Transformer Encoder Layers", with an input feature size of 512, feed forward network of 1024, and 4 attention heads. As for the Transformer Decoder it is composed of four "Transformer Decoder Layers" with similar input and output dimensions as the "Transformer Encoder Layers". Even though our model uses fewer layers and heads than the default implementation parameters, our extensive experimentation has proved this setup to be more suitable for table images. We attribute this finding to the inherent design of table images, which contain mostly lines and text, unlike the more elaborate content present in other scopes (e.g. the COCO dataset). Moreover, we have added ResNet blocks to the inputs of the Structure Decoder and Cell BBox Decoder. This prevents a decoder having a stronger influence over the learned weights which would damage the other prediction task (structure vs bounding boxes), but learn task specific weights instead. Lastly our dropout layers are set to 0.5.</text>
 <text><loc_252><loc_212><loc_445><loc_271>For training, TableFormer is trained with 3 Adam optimizers, each one for the CNN Backbone Network , Structure Decoder , and Cell BBox Decoder . Taking the PubTabNet as an example for our parameter set up, the initializing learning rate is 0.001 for 12 epochs with a batch size of 24, and λ set to 0.5. Afterwards, we reduce the learning rate to 0.0001, the batch size to 18 and train for 12 more epochs or convergence.</text>
 <text><loc_252><loc_276><loc_445><loc_350>TableFormer is implemented with PyTorch and Torchvision libraries [22]. To speed up the inference, the image undergoes a single forward pass through the CNN Backbone Network and transformer encoder. This eliminates the overhead of generating the same features for each decoding step. Similarly, we employ a 'caching' technique to preform faster autoregressive decoding. This is achieved by storing the features of decoded tokens so we can reuse them for each time step. Therefore, we only compute the attention for each new tag.</text>
@ -101,7 +101,7 @@
 <otsl><loc_44><loc_258><loc_231><loc_368><ched>Model<ched>Dataset<ched>Simple<ched>TEDS Complex<ched>All<nl><rhed>EDD<fcel>PTN<fcel>91.1<fcel>88.7<fcel>89.9<nl><rhed>GTE<fcel>PTN<fcel>-<fcel>-<fcel>93.01<nl><rhed>TableFormer<fcel>PTN<fcel>98.5<fcel>95.0<fcel>96.75<nl><rhed>EDD<fcel>FTN<fcel>88.4<fcel>92.08<fcel>90.6<nl><rhed>GTE<fcel>FTN<fcel>-<fcel>-<fcel>87.14<nl><rhed>GTE (FT)<fcel>FTN<fcel>-<fcel>-<fcel>91.02<nl><rhed>TableFormer<fcel>FTN<fcel>97.5<fcel>96.0<fcel>96.8<nl><rhed>EDD<fcel>TB<fcel>86.0<fcel>-<fcel>86.0<nl><rhed>TableFormer<fcel>TB<fcel>89.6<fcel>-<fcel>89.6<nl><rhed>TableFormer<fcel>STN<fcel>96.9<fcel>95.7<fcel>96.7<nl></otsl>
 <text><loc_41><loc_374><loc_234><loc_387>Table 2: Structure results on PubTabNet (PTN), FinTabNet (FTN), TableBank (TB) and SynthTabNet (STN).</text>
 <text><loc_41><loc_389><loc_214><loc_395>FT: Model was trained on PubTabNet then finetuned.</text>
-<text><loc_41><loc_407><loc_234><loc_450><loc_41><loc_407><loc_234><loc_450>Cell Detection. Like any object detector, our Cell BBox Detector provides bounding boxes that can be improved with post-processing during inference. We make use of the grid-like structure of tables to refine the predictions. A detailed explanation on the post-processing is available in the supplementary material. As shown in Tab. 3, we evaluate our Cell BBox Decoder accuracy for cells with a class label of 'content' only using the PASCAL VOC mAP metric for pre-processing and post-processing. Note that we do not have post-processing results for SynthTabNet as images are only provided. To compare the performance of our proposed approach, we've integrated TableFormer's Cell BBox Decoder into EDD architecture. As mentioned previously, the Structure Decoder provides the Cell BBox Decoder with the features needed to predict the bounding box predictions. Therefore, the accuracy of the Structure Decoder directly influences the accuracy of the Cell BBox Decoder . If the Structure Decoder predicts an extra column, this will result in an extra column of predicted bounding boxes.</text>
+<text><loc_41><loc_407><loc_234><loc_450><loc_252><loc_47><loc_445><loc_144>Cell Detection. Like any object detector, our Cell BBox Detector provides bounding boxes that can be improved with post-processing during inference. We make use of the grid-like structure of tables to refine the predictions. A detailed explanation on the post-processing is available in the supplementary material. As shown in Tab. 3, we evaluate our Cell BBox Decoder accuracy for cells with a class label of 'content' only using the PASCAL VOC mAP metric for pre-processing and post-processing. Note that we do not have post-processing results for SynthTabNet as images are only provided. To compare the performance of our proposed approach, we've integrated TableFormer's Cell BBox Decoder into EDD architecture. As mentioned previously, the Structure Decoder provides the Cell BBox Decoder with the features needed to predict the bounding box predictions. Therefore, the accuracy of the Structure Decoder directly influences the accuracy of the Cell BBox Decoder . If the Structure Decoder predicts an extra column, this will result in an extra column of predicted bounding boxes.</text>
 <otsl><loc_252><loc_156><loc_436><loc_192><ched>Model<ched>Dataset<ched>mAP<ched>mAP (PP)<nl><rhed>EDD+BBox<fcel>PubTabNet<fcel>79.2<fcel>82.7<nl><rhed>TableFormer<fcel>PubTabNet<fcel>82.1<fcel>86.8<nl><rhed>TableFormer<fcel>SynthTabNet<fcel>87.7<fcel>-<nl><caption><loc_252><loc_200><loc_445><loc_213>Table 3: Cell Bounding Box detection results on PubTabNet, and FinTabNet. PP: Post-processing.</caption></otsl>
 <text><loc_252><loc_232><loc_445><loc_328>Cell Content. In this section, we evaluate the entire pipeline of recovering a table with content. Here we put our approach to test by capitalizing on extracting content from the PDF cells rather than decoding from images. Tab. 4 shows the TEDs score of HTML code representing the structure of the table along with the content inserted in the data cell and compared with the ground-truth. Our method achieved a 5.3% increase over the state-of-the-art, and commercial solutions. We believe our scores would be higher if the HTML ground-truth matched the extracted PDF cell content. Unfortunately, there are small discrepancies such as spacings around words or special characters with various unicode representations.</text>
 <otsl><loc_272><loc_341><loc_426><loc_406><fcel>Model<ched>Simple<ched>TEDS Complex<ched>All<nl><rhed>Tabula<fcel>78.0<fcel>57.8<fcel>67.9<nl><rhed>Traprange<fcel>60.8<fcel>49.9<fcel>55.4<nl><rhed>Camelot<fcel>80.0<fcel>66.0<fcel>73.0<nl><rhed>Acrobat Pro<fcel>68.9<fcel>61.8<fcel>65.3<nl><rhed>EDD<fcel>91.2<fcel>85.4<fcel>88.3<nl><rhed>TableFormer<fcel>95.4<fcel>90.1<fcel>93.6<nl><caption><loc_252><loc_415><loc_445><loc_435>Table 4: Results of structure with content retrieved using cell detection on PubTabNet. In all cases the input is PDF documents with cropped tables.</caption></otsl>
@ -181,7 +181,7 @@
 <text><loc_41><loc_247><loc_234><loc_396>We have developed a technique that tries to derive a missing bounding box out of its neighbors. As a first step, we use the annotation data to generate the most fine-grained grid that covers the table structure. In case of strict HTML tables, all grid squares are associated with some table cell and in the presence of table spans a cell extends across multiple grid squares. When enough bounding boxes are known for a rectangular table, it is possible to compute the geometrical border lines between the grid rows and columns. Eventually this information is used to generate the missing bounding boxes. Additionally, the existence of unused grid squares indicates that the table rows have unequal number of columns and the overall structure is non-strict. The generation of missing bounding boxes for non-strict HTML tables is ambiguous and therefore quite challenging. Thus, we have decided to simply discard those tables. In case of PubTabNet we have computed missing bounding boxes for 48% of the simple and 69% of the complex tables. Regarding FinTabNet, 68% of the simple and 98% of the complex tables require the generation of bounding boxes.</text>
 <text><loc_41><loc_398><loc_234><loc_411>Figure 7 illustrates the distribution of the tables across different dimensions per dataset.</text>
 <section_header_level_1><loc_41><loc_418><loc_125><loc_424>1.2. Synthetic datasets</section_header_level_1>
-<text><loc_41><loc_430><loc_234><loc_451><loc_41><loc_430><loc_234><loc_451>Aiming to train and evaluate our models in a broader spectrum of table data we have synthesized four types of datasets. Each one contains tables with different appear- ances in regard to their size, structure, style and content. Every synthetic dataset contains 150k examples, summing up to 600k synthetic examples. All datasets are divided into Train, Test and Val splits (80%, 10%, 10%).</text>
+<text><loc_41><loc_430><loc_234><loc_451><loc_252><loc_103><loc_445><loc_131>Aiming to train and evaluate our models in a broader spectrum of table data we have synthesized four types of datasets. Each one contains tables with different appear- ances in regard to their size, structure, style and content. Every synthetic dataset contains 150k examples, summing up to 600k synthetic examples. All datasets are divided into Train, Test and Val splits (80%, 10%, 10%).</text>
 <text><loc_252><loc_133><loc_445><loc_147>The process of generating a synthetic dataset can be decomposed into the following steps:</text>
 <unordered_list><list_item><loc_252><loc_149><loc_445><loc_200>1. Prepare styling and content templates: The styling templates have been manually designed and organized into groups of scope specific appearances (e.g. financial data, marketing data, etc.) Additionally, we have prepared curated collections of content templates by extracting the most frequently used terms out of non-synthetic datasets (e.g. PubTabNet, FinTabNet, etc.).</list_item>
 <list_item><loc_252><loc_202><loc_445><loc_283>2. Generate table structures: The structure of each synthetic dataset assumes a horizontal table header which potentially spans over multiple rows and a table body that may contain a combination of row spans and column spans. However, spans are not allowed to cross the header - body boundary. The table structure is described by the parameters: Total number of table rows and columns, number of header rows, type of spans (header only spans, row only spans, column only spans, both row and column spans), maximum span size and the ratio of the table area covered by spans.</list_item>
--- a/tests/data/groundtruth/docling_v2/2203.01017v2.json
+++ b/tests/data/groundtruth/docling_v2/2203.01017v2.json
@ -8606,10 +8606,10 @@
        {
          "page_no": 6,
          "bbox": {
-            "l": 50.112061,
-            "t": 99.70968600000003,
-            "r": 286.36514,
-            "b": 78.84812199999999,
+            "l": 308.86206,
+            "t": 716.79169,
+            "r": 545.11523,
+            "b": 683.9751,
            "coord_origin": "BOTTOMLEFT"
          },
          "charspan": [
@ -9087,10 +9087,10 @@
        {
          "page_no": 7,
          "bbox": {
-            "l": 50.112015,
-            "t": 147.65019000000007,
-            "r": 286.366,
-            "b": 78.84806800000001,
+            "l": 308.862,
+            "t": 716.79163,
+            "r": 545.11517,
+            "b": 564.42291,
            "coord_origin": "BOTTOMLEFT"
          },
          "charspan": [
@ -12848,10 +12848,10 @@
        {
          "page_no": 11,
          "bbox": {
-            "l": 50.111984,
-            "t": 110.66886999999997,
-            "r": 286.36505,
-            "b": 77.85229500000003,
+            "l": 308.862,
+            "t": 629.34485,
+            "r": 545.11511,
+            "b": 584.57227,
            "coord_origin": "BOTTOMLEFT"
          },
          "charspan": [
--- a/tests/data/groundtruth/docling_v2/2206.01062.doctags.txt
+++ b/tests/data/groundtruth/docling_v2/2206.01062.doctags.txt
@ -61,7 +61,7 @@
 <otsl><loc_81><loc_87><loc_419><loc_186><ecel><ecel><ched>% of Total<lcel><lcel><ched>triple inter-annotator mAP @ 0.5-0.95 (%)<lcel><lcel><lcel><lcel><lcel><lcel><nl><ched>class label<ched>Count<ched>Train<ched>Test<ched>Val<ched>All<ched>Fin<ched>Man<ched>Sci<ched>Law<ched>Pat<ched>Ten<nl><rhed>Caption<fcel>22524<fcel>2.04<fcel>1.77<fcel>2.32<fcel>84-89<fcel>40-61<fcel>86-92<fcel>94-99<fcel>95-99<fcel>69-78<fcel>n/a<nl><rhed>Footnote<fcel>6318<fcel>0.60<fcel>0.31<fcel>0.58<fcel>83-91<fcel>n/a<fcel>100<fcel>62-88<fcel>85-94<fcel>n/a<fcel>82-97<nl><rhed>Formula<fcel>25027<fcel>2.25<fcel>1.90<fcel>2.96<fcel>83-85<fcel>n/a<fcel>n/a<fcel>84-87<fcel>86-96<fcel>n/a<fcel>n/a<nl><rhed>List-item<fcel>185660<fcel>17.19<fcel>13.34<fcel>15.82<fcel>87-88<fcel>74-83<fcel>90-92<fcel>97-97<fcel>81-85<fcel>75-88<fcel>93-95<nl><rhed>Page-footer<fcel>70878<fcel>6.51<fcel>5.58<fcel>6.00<fcel>93-94<fcel>88-90<fcel>95-96<fcel>100<fcel>92-97<fcel>100<fcel>96-98<nl><rhed>Page-header<fcel>58022<fcel>5.10<fcel>6.70<fcel>5.06<fcel>85-89<fcel>66-76<fcel>90-94<fcel>98-100<fcel>91-92<fcel>97-99<fcel>81-86<nl><rhed>Picture<fcel>45976<fcel>4.21<fcel>2.78<fcel>5.31<fcel>69-71<fcel>56-59<fcel>82-86<fcel>69-82<fcel>80-95<fcel>66-71<fcel>59-76<nl><rhed>Section-header<fcel>142884<fcel>12.60<fcel>15.77<fcel>12.85<fcel>83-84<fcel>76-81<fcel>90-92<fcel>94-95<fcel>87-94<fcel>69-73<fcel>78-86<nl><rhed>Table<fcel>34733<fcel>3.20<fcel>2.27<fcel>3.60<fcel>77-81<fcel>75-80<fcel>83-86<fcel>98-99<fcel>58-80<fcel>79-84<fcel>70-85<nl><rhed>Text<fcel>510377<fcel>45.82<fcel>49.28<fcel>45.00<fcel>84-86<fcel>81-86<fcel>88-93<fcel>89-93<fcel>87-92<fcel>71-79<fcel>87-95<nl><rhed>Title<fcel>5071<fcel>0.47<fcel>0.30<fcel>0.50<fcel>60-72<fcel>24-63<fcel>50-63<fcel>94-100<fcel>82-96<fcel>68-79<fcel>24-56<nl><rhed>Total<fcel>1107470<fcel>941123<fcel>99816<fcel>66531<fcel>82-83<fcel>71-74<fcel>79-81<fcel>89-94<fcel>86-91<fcel>71-76<fcel>68-85<nl><caption><loc_44><loc_54><loc_456><loc_73>Table 1: DocLayNet dataset overview. Along with the frequency of each class label, we present the relative occurrence (as % of row "Total") in the train, test and validation sets. The inter-annotator agreement is computed as the mAP@0.5-0.95 metric between pairwise annotations from the triple-annotated pages, from which we obtain accuracy ranges.</caption></otsl>
 <picture><loc_43><loc_196><loc_242><loc_341><caption><loc_44><loc_350><loc_242><loc_383>Figure 3: Corpus Conversion Service annotation user interface. The PDF page is shown in the background, with overlaid text-cells (in darker shades). The annotation boxes can be drawn by dragging a rectangle over each segment with the respective label from the palette on the right.</caption></picture>
 <text><loc_44><loc_400><loc_240><loc_426>we distributed the annotation workload and performed continuous quality controls. Phase one and two required a small team of experts only. For phases three and four, a group of 40 dedicated annotators were assembled and supervised.</text>
-<text><loc_44><loc_428><loc_241><loc_447><loc_44><loc_428><loc_241><loc_447>Phase 1: Data selection and preparation. Our inclusion criteria for documents were described in Section 3. A large effort went into ensuring that all documents are free to use. The data sources include publication repositories such as arXiv$^{3}$, government offices, company websites as well as data directory services for financial reports and patents. Scanned documents were excluded wherever possible because they can be rotated or skewed. This would not allow us to perform annotation with rectangular bounding-boxes and therefore complicate the annotation process.</text>
+<text><loc_44><loc_428><loc_241><loc_447><loc_260><loc_197><loc_457><loc_237>Phase 1: Data selection and preparation. Our inclusion criteria for documents were described in Section 3. A large effort went into ensuring that all documents are free to use. The data sources include publication repositories such as arXiv$^{3}$, government offices, company websites as well as data directory services for financial reports and patents. Scanned documents were excluded wherever possible because they can be rotated or skewed. This would not allow us to perform annotation with rectangular bounding-boxes and therefore complicate the annotation process.</text>
 <text><loc_260><loc_239><loc_457><loc_320>Preparation work included uploading and parsing the sourced PDF documents in the Corpus Conversion Service (CCS) [22], a cloud-native platform which provides a visual annotation interface and allows for dataset inspection and analysis. The annotation interface of CCS is shown in Figure 3. The desired balance of pages between the different document categories was achieved by selective subsampling of pages with certain desired properties. For example, we made sure to include the title page of each document and bias the remaining page selection to those with figures or tables. The latter was achieved by leveraging pre-trained object detection models from PubLayNet, which helped us estimate how many figures and tables a given page contains.</text>
 <text><loc_259><loc_321><loc_457><loc_438>Phase 2: Label selection and guideline. We reviewed the collected documents and identified the most common structural features they exhibit. This was achieved by identifying recurrent layout elements and lead us to the definition of 11 distinct class labels. These 11 class labels are Caption , Footnote , Formula , List-item , Pagefooter , Page-header , Picture , Section-header , Table , Text , and Title . Critical factors that were considered for the choice of these class labels were (1) the overall occurrence of the label, (2) the specificity of the label, (3) recognisability on a single page (i.e. no need for context from previous or next page) and (4) overall coverage of the page. Specificity ensures that the choice of label is not ambiguous, while coverage ensures that all meaningful items on a page can be annotated. We refrained from class labels that are very specific to a document category, such as Abstract in the Scientific Articles category. We also avoided class labels that are tightly linked to the semantics of the text. Labels such as Author and Affiliation , as seen in DocBank, are often only distinguishable by discriminating on</text>
 <footnote><loc_260><loc_443><loc_302><loc_448>$^{3}$https://arxiv.org/</footnote>
--- a/tests/data/groundtruth/docling_v2/2206.01062.json
+++ b/tests/data/groundtruth/docling_v2/2206.01062.json
@ -12152,10 +12152,10 @@
        {
          "page_no": 4,
          "bbox": {
-            "l": 53.79800000000001,
-            "t": 113.98901000000001,
-            "r": 295.55844,
-            "b": 83.57982600000003,
+            "l": 317.95499,
+            "t": 479.92047,
+            "r": 559.18536,
+            "b": 416.75183,
            "coord_origin": "BOTTOMLEFT"
          },
          "charspan": [
--- a/tests/data/groundtruth/docling_v2/2305.03393v1-pg9.json
+++ b/tests/data/groundtruth/docling_v2/2305.03393v1-pg9.json
@ -336,9 +336,9 @@
        {
          "page_no": 1,
          "bbox": {
-            "l": 139.66746520996094,
+            "l": 139.6674041748047,
            "t": 454.4546203613281,
-            "r": 475.0093078613281,
+            "r": 475.00927734375,
            "b": 322.5054626464844,
            "coord_origin": "BOTTOMLEFT"
          },
--- a/tests/data/groundtruth/docling_v2/2305.03393v1-pg9.pages.json
+++ b/tests/data/groundtruth/docling_v2/2305.03393v1-pg9.pages.json
@ -2646,7 +2646,7 @@
              "b": 102.78223000000003,
              "coord_origin": "TOPLEFT"
            },
-            "confidence": 0.9373531937599182,
+            "confidence": 0.9373533129692078,
            "cells": [
              {
                "index": 0,
@ -2686,7 +2686,7 @@
              "b": 102.78223000000003,
              "coord_origin": "TOPLEFT"
            },
-            "confidence": 0.8858677744865417,
+            "confidence": 0.8858679533004761,
            "cells": [
              {
                "index": 1,
@ -2881,7 +2881,7 @@
              "b": 255.42400999999995,
              "coord_origin": "TOPLEFT"
            },
-            "confidence": 0.98504239320755,
+            "confidence": 0.9850425124168396,
            "cells": [
              {
                "index": 7,
@ -3096,7 +3096,7 @@
              "b": 327.98218,
              "coord_origin": "TOPLEFT"
            },
-            "confidence": 0.9591910243034363,
+            "confidence": 0.9591907262802124,
            "cells": [
              {
                "index": 15,
@ -3280,9 +3280,9 @@
            "id": 0,
            "label": "table",
            "bbox": {
-              "l": 139.66746520996094,
+              "l": 139.6674041748047,
              "t": 337.5453796386719,
-              "r": 475.0093078613281,
+              "r": 475.00927734375,
              "b": 469.4945373535156,
              "coord_origin": "TOPLEFT"
            },
@ -7852,7 +7852,7 @@
              "b": 618.3,
              "coord_origin": "TOPLEFT"
            },
-            "confidence": 0.9849975109100342,
+            "confidence": 0.9849976301193237,
            "cells": [
              {
                "index": 93,
@ -8184,9 +8184,9 @@
              "id": 0,
              "label": "table",
              "bbox": {
-                "l": 139.66746520996094,
+                "l": 139.6674041748047,
                "t": 337.5453796386719,
-                "r": 475.0093078613281,
+                "r": 475.00927734375,
                "b": 469.4945373535156,
                "coord_origin": "TOPLEFT"
              },
@ -13582,7 +13582,7 @@
              "b": 102.78223000000003,
              "coord_origin": "TOPLEFT"
            },
-            "confidence": 0.9373531937599182,
+            "confidence": 0.9373533129692078,
            "cells": [
              {
                "index": 0,
@ -13628,7 +13628,7 @@
              "b": 102.78223000000003,
              "coord_origin": "TOPLEFT"
            },
-            "confidence": 0.8858677744865417,
+            "confidence": 0.8858679533004761,
            "cells": [
              {
                "index": 1,
@ -13841,7 +13841,7 @@
              "b": 255.42400999999995,
              "coord_origin": "TOPLEFT"
            },
-            "confidence": 0.98504239320755,
+            "confidence": 0.9850425124168396,
            "cells": [
              {
                "index": 7,
@ -14062,7 +14062,7 @@
              "b": 327.98218,
              "coord_origin": "TOPLEFT"
            },
-            "confidence": 0.9591910243034363,
+            "confidence": 0.9591907262802124,
            "cells": [
              {
                "index": 15,
@ -14252,9 +14252,9 @@
            "id": 0,
            "label": "table",
            "bbox": {
-              "l": 139.66746520996094,
+              "l": 139.6674041748047,
              "t": 337.5453796386719,
-              "r": 475.0093078613281,
+              "r": 475.00927734375,
              "b": 469.4945373535156,
              "coord_origin": "TOPLEFT"
            },
@ -19713,7 +19713,7 @@
              "b": 618.3,
              "coord_origin": "TOPLEFT"
            },
-            "confidence": 0.9849975109100342,
+            "confidence": 0.9849976301193237,
            "cells": [
              {
                "index": 93,
@ -20224,7 +20224,7 @@
              "b": 255.42400999999995,
              "coord_origin": "TOPLEFT"
            },
-            "confidence": 0.98504239320755,
+            "confidence": 0.9850425124168396,
            "cells": [
              {
                "index": 7,
@ -20445,7 +20445,7 @@
              "b": 327.98218,
              "coord_origin": "TOPLEFT"
            },
-            "confidence": 0.9591910243034363,
+            "confidence": 0.9591907262802124,
            "cells": [
              {
                "index": 15,
@ -20635,9 +20635,9 @@
            "id": 0,
            "label": "table",
            "bbox": {
-              "l": 139.66746520996094,
+              "l": 139.6674041748047,
              "t": 337.5453796386719,
-              "r": 475.0093078613281,
+              "r": 475.00927734375,
              "b": 469.4945373535156,
              "coord_origin": "TOPLEFT"
            },
@ -26096,7 +26096,7 @@
              "b": 618.3,
              "coord_origin": "TOPLEFT"
            },
-            "confidence": 0.9849975109100342,
+            "confidence": 0.9849976301193237,
            "cells": [
              {
                "index": 93,
@ -26440,7 +26440,7 @@
              "b": 102.78223000000003,
              "coord_origin": "TOPLEFT"
            },
-            "confidence": 0.9373531937599182,
+            "confidence": 0.9373533129692078,
            "cells": [
              {
                "index": 0,
@ -26486,7 +26486,7 @@
              "b": 102.78223000000003,
              "coord_origin": "TOPLEFT"
            },
-            "confidence": 0.8858677744865417,
+            "confidence": 0.8858679533004761,
            "cells": [
              {
                "index": 1,