Document Image Orientation Classification Module (Optional):
+ +In this pipeline, you can choose the model to use based on the benchmark data below. + +Text Image Unwarp Module (Optional):
+Table Structure Recognition Module Models:
-| Model | Model Download Link | -Accuracy (%) | -GPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
-CPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
-Model Storage Size (MB) | -Description | -
|---|---|---|---|---|---|---|
| SLANet | -Inference Model/Training Model | -59.52 | -23.96 / 21.75 | -- / 43.12 | -6.9 | -SLANet is a table structure recognition model developed by Baidu PaddleX Team. The model significantly improves the accuracy and inference speed of table structure recognition by adopting a CPU-friendly lightweight backbone network PP-LCNet, a high-low-level feature fusion module CSP-PAN, and a feature decoding module SLA Head that aligns structural and positional information. | -
| SLANet_plus | -Inference Model/Training Model | -63.69 | -23.43 / 22.16 | -- / 41.80 | -6.9 | -SLANet_plus is an enhanced version of SLANet, the table structure recognition model developed by Baidu PaddleX Team. Compared to SLANet, SLANet_plus significantly improves the recognition ability for wireless and complex tables and reduces the model's sensitivity to the accuracy of table positioning, enabling more accurate recognition even with offset table positioning. | -
Layout Detection Module Models:
+ +| PP-DocLayout_plus-L | +Inference Model/Training Model | +83.2 | +53.03 / 17.23 | +634.62 / 378.32 | +126.01 | +A higher-precision layout area localization model trained on a self-built dataset containing Chinese and English papers, PPT, multi-layout magazines, contracts, books, exams, ancient books and research reports using RT-DETR-L | +
| Model | Model Download Link | +mAP(0.5) (%) | +GPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
+CPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
+Model Storage Size (MB) | +Introduction | +
|---|---|---|---|---|---|---|
| PP-DocBlockLayout | +Inference Model/Training Model | +95.9 | +34.60 / 28.54 | +506.43 / 256.83 | +123.92 | +A layout block localization model trained on a self-built dataset containing Chinese and English papers, PPT, multi-layout magazines, contracts, books, exams, ancient books and research reports using RT-DETR-L | +
| Model | Download Link | +mAP(0.5) (%) | +GPU Inference Time (ms) [Standard Mode / High Performance Mode] |
+CPU Inference Time (ms) [Standard Mode / High Performance Mode] |
+Model Storage Size (MB) | +Description | +
|---|---|---|---|---|---|---|
| PP-DocLayout-L | -Inference Model/Training Model | +Inference Model/Pretrained Model | 90.4 | 33.59 / 33.59 | 503.01 / 251.08 | @@ -115,7 +158,7 @@ The Document Scene Information Extraction v4 pipeline includes modules for **Lay|
| PP-DocLayout-M | -Inference Model/Training Model | +Inference Model/Pretrained Model | 75.2 | 13.03 / 4.72 | 43.39 / 24.44 | @@ -124,7 +167,7 @@ The Document Scene Information Extraction v4 pipeline includes modules for **Lay|
| PP-DocLayout-S | -Inference Model/Training Model | +Inference Model/Pretrained Model | 70.9 | 11.54 / 3.86 | 18.53 / 6.29 | @@ -133,9 +176,10 @@ The Document Scene Information Extraction v4 pipeline includes modules for **Lay
| A high-efficiency layout area localization model trained on a self-built dataset using PicoDet-1x, capable of detecting table regions. |
| 22.6 | A balanced efficiency and precision layout area localization model trained on a self-built dataset of Chinese and English papers, magazines, and research reports using PicoDet-L. | +
Table Classification Module Models:
+| RT-DETR-H_layout_3cls | Inference Model/Training Model | @@ -203,7 +250,6 @@ The Document Scene Information Extraction v4 pipeline includes modules for **LayA high-precision layout area localization model trained on a self-built dataset of Chinese and English papers, magazines, and research reports using RT-DETR-H. |
| A high-efficiency English document layout area localization model trained on the PubLayNet dataset using PicoDet-1x. |
| 470.2 | A high-precision layout area localization model trained on a self-built dataset of Chinese and English papers, magazines, and research reports using RT-DETR-H. | -
Text Detection Module Models:
+| Model | Model Download Link | +Accuracy (%) | +GPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
+CPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
+Model Storage Size (MB) | +Description | +
|---|---|---|---|---|---|---|
| SLANet | +Inference Model/Training Model | +59.52 | +23.96 / 21.75 | +- / 43.12 | +6.9 | +SLANet is a table structure recognition model developed by Baidu PaddleX Team. The model significantly improves the accuracy and inference speed of table structure recognition by adopting a CPU-friendly lightweight backbone network PP-LCNet, a high-low-level feature fusion module CSP-PAN, and a feature decoding module SLA Head that aligns structural and positional information. | +
| SLANet_plus | +Inference Model/Training Model | +63.69 | +23.43 / 22.16 | +- / 41.80 | +6.9 | +SLANet_plus is an enhanced version of SLANet, the table structure recognition model developed by Baidu PaddleX Team. Compared to SLANet, SLANet_plus significantly improves the recognition ability for wireless and complex tables and reduces the model's sensitivity to the accuracy of table positioning, enabling more accurate recognition even with offset table positioning. | +
Text Recognition Module Models:
+| Model | Model Download Links | @@ -556,13 +637,6 @@ The RepSVTR text recognition model is a mobile-oriented text recognition model b
|---|
| 85.91 | +43.50 | +1311.84 / 1311.84 | +- / 8288.07 | +1530 | +UniMERNet is a formula recognition model developed by Shanghai AI Lab. It uses Donut Swin as the encoder and MBartDecoder as the decoder. The model is trained on a dataset of one million samples, including simple formulas, complex formulas, scanned formulas, and handwritten formulas, significantly improving the recognition accuracy of real-world formulas. | +PP-FormulaNet-S | +Inference Model/Training Model | +87.00 | +45.71 | +182.25 / 182.25 | +- / 254.39 | +224 | +PP-FormulaNet is an advanced formula recognition model developed by the Baidu PaddlePaddle Vision Team. The PP-FormulaNet-S version uses PP-HGNetV2-B4 as its backbone network. Through parallel masking and model distillation techniques, it significantly improves inference speed while maintaining high recognition accuracy, making it suitable for applications requiring fast inference. The PP-FormulaNet-L version, on the other hand, uses Vary_VIT_B as its backbone network and is trained on a large-scale formula dataset, showing significant improvements in recognizing complex formulas compared to PP-FormulaNet-S. | + +PP-FormulaNet-L | +Inference Model/Training Model | +90.36 | +45.78 | +1482.03 / 1482.03 | +- / 3131.54 | +695 | + +PP-FormulaNet_plus-S | +Inference Model/Training Model | +88.71 | +53.32 | +179.20 / 179.20 | +- / 260.99 | +248 | +PP-FormulaNet_plus is an enhanced version of the formula recognition model developed by the Baidu PaddlePaddle Vision Team, building upon the original PP-FormulaNet. Compared to the original version, PP-FormulaNet_plus utilizes a more diverse formula dataset during training, including sources such as Chinese dissertations, professional books, textbooks, exam papers, and mathematics journals. This expansion significantly improves the model’s recognition capabilities. Among the models, PP-FormulaNet_plus-M and PP-FormulaNet_plus-L have added support for Chinese formulas and increased the maximum number of predicted tokens for formulas from 1,024 to 2,560, greatly enhancing the recognition performance for complex formulas. Meanwhile, the PP-FormulaNet_plus-S model focuses on improving the recognition of English formulas. With these improvements, the PP-FormulaNet_plus series models perform exceptionally well in handling complex and diverse formula recognition tasks. | + +
| PP-FormulaNet_plus-M | +Inference Model/Training Model | +91.45 | +89.76 | +1040.27 / 1040.27 | +- / 1615.80 | +592 | +||||
| PP-FormulaNet_plus-L | +Inference Model/Training Model | +92.22 | +90.64 | +1476.07 / 1476.07 | +- / 3125.58 | +698 | ||||
| LaTeX_OCR_rec | Inference Model/Training Model | -0.8821 | -0.0823 | -40.01 | +74.55 | +39.96 | 1088.89 / 1088.89 | - / - | 99 | +LaTeX-OCR is a formula recognition algorithm based on an autoregressive large model. It uses Hybrid ViT as the backbone network and a transformer as the decoder, significantly improving the accuracy of formula recognition. |
Seal Text Detection Module Models:
+| 85.91 | +43.50 | +1311.84 / 1311.84 | +- / 8288.07 | +1530 | +UniMERNet是由上海AI Lab研发的一款公式识别模型。该模型采用Donut Swin作为编码器,MBartDecoder作为解码器,并通过在包含简单公式、复杂公式、扫描捕捉公式和手写公式在内的一百万数据集上进行训练,大幅提升了模型对真实场景公式的识别准确率 | +|||||
| PP-FormulaNet-S | +推理模型/训练模型 | +87.00 | +45.71 | +182.25 / 182.25 | +- / 254.39 | +224 | +PP-FormulaNet 是由百度飞桨视觉团队开发的一款先进的公式识别模型,支持5万个常见LateX源码词汇的识别。PP-FormulaNet-S 版本采用了 PP-HGNetV2-B4 作为其骨干网络,通过并行掩码和模型蒸馏等技术,大幅提升了模型的推理速度,同时保持了较高的识别精度,适用于简单印刷公式、跨行简单印刷公式等场景。而 PP-FormulaNet-L 版本则基于 Vary_VIT_B 作为骨干网络,并在大规模公式数据集上进行了深入训练,在复杂公式的识别方面,相较于PP-FormulaNet-S表现出显著的提升,适用于简单印刷公式、复杂印刷公式、手写公式等场景。 | + +PP-FormulaNet-L | +推理模型/训练模型 | +90.36 | +45.78 | +1482.03 / 1482.03 | +- / 3131.54 | +695 | + +PP-FormulaNet_plus-S | +推理模型/训练模型 | +88.71 | +53.32 | +179.20 / 179.20 | +- / 260.99 | +248 | +PP-FormulaNet_plus 是百度飞桨视觉团队在 PP-FormulaNet 的基础上开发的增强版公式识别模型。与原版相比,PP-FormulaNet_plus 在训练中使用了更为丰富的公式数据集,包括中文学位论文、专业书籍、教材试卷以及数学期刊等多种来源。这一扩展显著提升了模型的识别能力。 + +其中,PP-FormulaNet_plus-M 和 PP-FormulaNet_plus-L 模型新增了对中文公式的支持,并将公式的最大预测 token 数从 1024 扩大至 2560,大幅提升了对复杂公式的识别性能。同时,PP-FormulaNet_plus-S 模型则专注于增强英文公式的识别能力。通过这些改进,PP-FormulaNet_plus 系列模型在处理复杂多样的公式识别任务时表现更加出色。 | + +
| PP-FormulaNet_plus-M | +推理模型/训练模型 | +91.45 | +89.76 | +1040.27 / 1040.27 | +- / 1615.80 | +592 | +||||
| PP-FormulaNet_plus-L | +推理模型/训练模型 | +92.22 | +90.64 | +1476.07 / 1476.07 | +- / 3125.58 | +698 | ||||
| LaTeX_OCR_rec | 推理模型/训练模型 | -0.8821 | -0.0823 | -40.01 | +74.55 | +39.96 | 1088.89 / 1088.89 | - / - | 99 | +LaTeX-OCR是一种基于自回归大模型的公式识别算法,通过采用 Hybrid ViT 作为骨干网络,transformer作为解码器,显著提升了公式识别的准确性。 |