From 1788b8633f0d5f5192c9e97639ff79703d40dae3 Mon Sep 17 00:00:00 2001 From: Zhang Zelun <33217105+BluebirdStory@users.noreply.github.com> Date: Mon, 19 May 2025 23:02:27 +0800 Subject: [PATCH] add ocr docs (#15173) * add ocr docs * fix sth --- docs/version3.x/module_usage/doc_vlm.en.md | 255 +++++ docs/version3.x/module_usage/doc_vlm.md | 27 +- .../pipeline_usage/doc_understanding.en.md | 878 ++++++++++++++++++ .../pipeline_usage/doc_understanding.md | 23 +- paddleocr/_models/doc_vlm.py | 2 +- 5 files changed, 1171 insertions(+), 14 deletions(-) create mode 100644 docs/version3.x/module_usage/doc_vlm.en.md create mode 100644 docs/version3.x/pipeline_usage/doc_understanding.en.md diff --git a/docs/version3.x/module_usage/doc_vlm.en.md b/docs/version3.x/module_usage/doc_vlm.en.md new file mode 100644 index 0000000000..eee46a9f92 --- /dev/null +++ b/docs/version3.x/module_usage/doc_vlm.en.md @@ -0,0 +1,255 @@ +--- +comments: true +--- + +# Tutorial on Using Document Visual Language Model Module + +## I. Overview + +Document visual language models are a cutting-edge multimodal processing technology aimed at addressing the limitations of traditional document processing methods. Traditional methods are often limited to processing document information in specific formats or predefined categories, whereas document visual language models can integrate visual and linguistic information to understand and handle diverse document content. By combining computer vision and natural language processing technologies, these models can recognize images, text, and their relationships within documents, and even understand semantic information within complex layout structures. This makes document processing more intelligent and flexible, with stronger generalization capabilities, showing broad application prospects in automated office work, information extraction, and other fields. + +## II. Supported Model List + + + + + + + + + + + + + + + + + + + + + + + + + +
ModelModel Download LinkModel Storage Size (GB)Total ScoreDescription
PP-DocBee-2BInference Model4.2765PP-DocBee is a self-developed multimodal large model by the PaddlePaddle team, focusing on document understanding, and it performs excellently in Chinese document understanding tasks. The model is fine-tuned and optimized using nearly 5 million multimodal datasets for document understanding, including general VQA, OCR, charts, text-rich documents, mathematics and complex reasoning, synthetic data, and pure text data, with different training data ratios set. On several authoritative English document understanding evaluation lists in academia, PP-DocBee has basically achieved SOTA for models of the same parameter scale. In terms of internal business Chinese scenario indicators, PP-DocBee also outperforms the current popular open-source and closed-source models.
PP-DocBee-7BInference Model15.8-
PP-DocBee2-3BInference Model7.6852PP-DocBee2 is a self-developed multimodal large model by the PaddlePaddle team, further optimizing the base model on the foundation of PP-DocBee and introducing a new data optimization scheme to improve data quality. Using a small amount of 470,000 data generated by a self-developed data synthesis strategy, PP-DocBee2 performs better in Chinese document understanding tasks. In terms of internal business Chinese scenario indicators, PP-DocBee2 improves by about 11.4% compared to PP-DocBee, and also outperforms the current popular open-source and closed-source models of the same scale.
+ +Note: The total scores of the above models are test results from an internal evaluation set, where all images have a resolution (height, width) of (1680, 1204), with a total of 1196 data entries, covering scenarios such as financial reports, laws and regulations, scientific and technical papers, manuals, humanities papers, contracts, research reports, etc. There are no plans for public release at the moment. + +## III. Quick Start + +> ❗ Before starting quickly, please install the PaddleOCR wheel package. For details, please refer to the [Installation Guide](../ppocr/installation.md). + +You can quickly experience it with one line of command: + +```bash +paddleocr doc_vlm -i "{'image': 'https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/medal_table.png', 'query': '识别这份表格的内容, 以markdown格式输出'}" +``` + +You can also integrate the model inference from the open document visual language model module into your project. Before running the following code, please download the [sample image](https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/medal_table.png) locally. + +```python +from paddleocr import DocVLM +model = DocVLM(model_name="PP-DocBee2-3B") +results = model.predict( + input={"image": "medal_table.png", "query": "识别这份表格的内容, 以markdown格式输出"}, + batch_size=1 +) +for res in results: + res.print() + res.save_to_json(f"./output/res.json") +``` + +After running, the result is: + +```bash +{'res': {'image': 'medal_table.png', 'query': '识别这份表格的内容, 以markdown格式输出', 'result': '| 名次 | 国家/地区 | 金牌 | 银牌 | 铜牌 | 奖牌总数 |\n| --- | --- | --- | --- | --- | --- |\n| 1 | 中国(CHN) | 48 | 22 | 30 | 100 |\n| 2 | 美国(USA) | 36 | 39 | 37 | 112 |\n| 3 | 俄罗斯(RUS) | 24 | 13 | 23 | 60 |\n| 4 | 英国(GBR) | 19 | 13 | 19 | 51 |\n| 5 | 德国(GER) | 16 | 11 | 14 | 41 |\n| 6 | 澳大利亚(AUS) | 14 | 15 | 17 | 46 |\n| 7 | 韩国(KOR) | 13 | 11 | 8 | 32 |\n| 8 | 日本(JPN) | 9 | 8 | 8 | 25 |\n| 9 | 意大利(ITA) | 8 | 9 | 10 | 27 |\n| 10 | 法国(FRA) | 7 | 16 | 20 | 43 |\n| 11 | 荷兰(NED) | 7 | 5 | 4 | 16 |\n| 12 | 乌克兰(UKR) | 7 | 4 | 11 | 22 |\n| 13 | 肯尼亚(KEN) | 6 | 4 | 6 | 16 |\n| 14 | 西班牙(ESP) | 5 | 11 | 3 | 19 |\n| 15 | 牙买加(JAM) | 5 | 4 | 2 | 11 |\n'}} +``` + +The meaning of the result parameters is as follows: +- `image`: Indicates the path of the input image to be predicted +- `query`: Represents the input text information to be predicted +- `result`: Information of the model's prediction result + +The visualization of the prediction result is as follows: + +```bash +| 名次 | 国家/地区 | 金牌 | 银牌 | 铜牌 | 奖牌总数 | +| --- | --- | --- | --- | --- | --- | +| 1 | 中国(CHN) | 48 | 22 | 30 | 100 | +| 2 | 美国(USA) | 36 | 39 | 37 | 112 | +| 3 | 俄罗斯(RUS) | 24 | 13 | 23 | 60 | +| 4 | 英国(GBR) | 19 | 13 | 19 | 51 | +| 5 | 德国(GER) | 16 | 11 | 14 | 41 | +| 6 | 澳大利亚(AUS) | 14 | 15 | 17 | 46 | +| 7 | 韩国(KOR) | 13 | 11 | 8 | 32 | +| 8 | 日本(JPN) | 9 | 8 | 8 | 25 | +| 9 | 意大利(ITA) | 8 | 9 | 10 | 27 | +| 10 | 法国(FRA) | 7 | 16 | 20 | 43 | +| 11 | 荷兰(NED) | 7 | 5 | 4 | 16 | +| 12 | 乌克兰(UKR) | 7 | 4 | 11 | 22 | +| 13 | 肯尼亚(KEN) | 6 | 4 | 6 | 16 | +| 14 | 西班牙(ESP) | 5 | 11 | 3 | 19 | +| 15 | 牙买加(JAM) | 5 | 4 | 2 | 11 | +``` + +Explanations of related methods, parameters, etc., are as follows: + +* `DocVLM` instantiates the document visual language model (taking `PP-DocBee-2B` as an example), with specific explanations as follows: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ParameterDescriptionTypeOptionsDefault
model_nameModel NamestrNoneNone
model_dirModel Storage PathstrNoneNone
deviceModel Inference DevicestrSupports specifying specific GPU card number, such as "gpu:0", other hardware specific card numbers, such as "npu:0", CPU such as "cpu".gpu:0
use_hpipWhether to enable high-performance inference plugin. Currently not supported.boolNoneFalse
hpi_configHigh-performance inference configuration. Currently not supported.dict | NoneNoneNone
+ +* Among them, `model_name` must be specified. After specifying `model_name`, the default PaddleX built-in model parameters will be used. On this basis, when specifying `model_dir`, user-defined models will be used. + +* Call the `predict()` method of the document visual language model for inference prediction. This method will return a result list. Additionally, this module also provides the `predict_iter()` method. Both are completely consistent in terms of parameter acceptance and result return, the difference being that `predict_iter()` returns a `generator`, capable of gradually processing and obtaining prediction results, suitable for handling large datasets or scenarios where memory saving is desired. You can choose to use either of these methods based on actual needs. The `predict()` method parameters include `input`, `batch_size`, with specific explanations as follows: + + + + + + + + + + + + + + + + + + + + + + + + + +
ParameterDescriptionTypeOptionsDefault
inputData to be predicteddict +Dict, as multimodal models have different input requirements, it needs to be determined based on the specific model. Specifically: +
  • PP-DocBee series input format is {'image': image_path, 'query': query_text}
  • +
    None
    batch_sizeBatch SizeintInteger1
    + +* Process the prediction results. The prediction result for each sample is the corresponding Result object, and it supports operations such as printing and saving as `json` file: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    MethodDescriptionParameterTypeDescriptionDefault
    print()Print results to terminalformat_jsonboolWhether to format the output content using JSON indentationTrue
    indentintSpecify the indentation level to beautify the output JSON data, making it more readable, effective only when format_json is True4
    ensure_asciiboolControl whether non-ASCII characters are escaped to Unicode. When set to True, all non-ASCII characters will be escaped; False retains the original characters, effective only when format_json is TrueFalse
    save_to_json()Save the result as a json format filesave_pathstrPath of the file to be saved. When it is a directory, the naming of the saved file is consistent with the input file type.None
    indentintSpecify the indentation level to beautify the output JSON data, making it more readable, effective only when format_json is True4
    ensure_asciiboolControl whether non-ASCII characters are escaped to Unicode. When set to True, all non-ASCII characters will be escaped; False retains the original characters, effective only when format_json is TrueFalse
    + +* Additionally, it also supports obtaining prediction results through attributes, as follows: + + + + + + + + + + + + +
    AttributeDescription
    jsonGet the prediction result in json format
    + +## IV. Secondary Development + +The current module does not support fine-tuning training temporarily, only inference integration is supported. The fine-tuning training of this module is planned to be supported in the future. + +## V. FAQ diff --git a/docs/version3.x/module_usage/doc_vlm.md b/docs/version3.x/module_usage/doc_vlm.md index 28dd25bcee..a0ff328636 100644 --- a/docs/version3.x/module_usage/doc_vlm.md +++ b/docs/version3.x/module_usage/doc_vlm.md @@ -15,19 +15,31 @@ comments: true 模型模型下载链接 模型存储大小(GB) +模型总分 介绍 PP-DocBee-2B推理模型 4.2 +765 PP-DocBee 是飞桨团队自研的一款专注于文档理解的多模态大模型,在中文文档理解任务上具有卓越表现。该模型通过近 500 万条文档理解类多模态数据集进行微调优化,各种数据集包括了通用VQA类、OCR类、图表类、text-rich文档类、数学和复杂推理类、合成数据类、纯文本数据等,并设置了不同训练数据配比。在学术界权威的几个英文文档理解评测榜单上,PP-DocBee基本都达到了同参数量级别模型的SOTA。在内部业务中文场景类的指标上,PP-DocBee也高于目前的热门开源和闭源模型。 PP-DocBee-7B推理模型 15.8 +- + + +PP-DocBee2-3B推理模型 +7.6 +852 +PP-DocBee2 是飞桨团队自研的一款专注于文档理解的多模态大模型,在PP-DocBee的基础上进一步优化了基础模型,并引入了新的数据优化方案,提高了数据质量,使用自研数据合成策略生成的少量的47万数据便使得PP-DocBee2在中文文档理解任务上表现更佳。在内部业务中文场景类的指标上,PP-DocBee2相较于PP-DocBee提升了约11.4%,同时也高于目前的同规模热门开源和闭源模型。 +注:以上模型总分为内部评估集模型测试结果,内部评估集所有图像分辨率 (height, width) 为 (1680,1204),共1196条数据,包括了财报、法律法规、理工科论文、说明书、文科论文、合同、研报等场景,暂时未有计划公开。 + + ## 三、快速开始 @@ -36,16 +48,16 @@ comments: true 使用一行命令即可快速体验: ```bash -paddleocr doc_vlm -i "{'image': 'https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/medal_table.png', 'query': '识别这份表格的内容'}" +paddleocr doc_vlm -i "{'image': 'https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/medal_table.png', 'query': '识别这份表格的内容, 以markdown格式输出'}" ``` 您也可以将开放文档类视觉语言模型模块中的模型推理集成到您的项目中。运行以下代码前,请您下载[示例图片](https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/medal_table.png)到本地。 ```python from paddleocr import DocVLM -model = DocVLM(model_name="PP-DocBee-2B") +model = DocVLM(model_name="PP-DocBee2-3B") results = model.predict( - input={"image": "medal_table.png", "query": "识别这份表格的内容"}, + input={"image": "medal_table.png", "query": "识别这份表格的内容, 以markdown格式输出"}, batch_size=1 ) for res in results: @@ -56,7 +68,7 @@ for res in results: 运行后,得到的结果为: ```bash -{'res': {'image': 'medal_table.png', 'query': '识别这份表格的内容', 'result': '| 名次 | 国家/地区 | 金牌 | 银牌 | 铜牌 | 奖牌总数 |\n| --- | --- | --- | --- | --- | --- |\n| 1 | 中国(CHN) | 48 | 22 | 30 | 100 |\n| 2 | 美国(USA) | 36 | 39 | 37 | 112 |\n| 3 | 俄罗斯(RUS) | 24 | 13 | 23 | 60 |\n| 4 | 英国(GBR) | 19 | 13 | 19 | 51 |\n| 5 | 德国(GER) | 16 | 11 | 14 | 41 |\n| 6 | 澳大利亚(AUS) | 14 | 15 | 17 | 46 |\n| 7 | 韩国(KOR) | 13 | 11 | 8 | 32 |\n| 8 | 日本(JPN) | 9 | 8 | 8 | 25 |\n| 9 | 意大利(ITA) | 8 | 9 | 10 | 27 |\n| 10 | 法国(FRA) | 7 | 16 | 20 | 43 |\n| 11 | 荷兰(NED) | 7 | 5 | 4 | 16 |\n| 12 | 乌克兰(UKR) | 7 | 4 | 11 | 22 |\n| 13 | 肯尼亚(KEN) | 6 | 4 | 6 | 16 |\n| 14 | 西班牙(ESP) | 5 | 11 | 3 | 19 |\n| 15 | 牙买加(JAM) | 5 | 4 | 2 | 11 |\n'}} +{'res': {'image': 'medal_table.png', 'query': '识别这份表格的内容, 以markdown格式输出', 'result': '| 名次 | 国家/地区 | 金牌 | 银牌 | 铜牌 | 奖牌总数 |\n| --- | --- | --- | --- | --- | --- |\n| 1 | 中国(CHN) | 48 | 22 | 30 | 100 |\n| 2 | 美国(USA) | 36 | 39 | 37 | 112 |\n| 3 | 俄罗斯(RUS) | 24 | 13 | 23 | 60 |\n| 4 | 英国(GBR) | 19 | 13 | 19 | 51 |\n| 5 | 德国(GER) | 16 | 11 | 14 | 41 |\n| 6 | 澳大利亚(AUS) | 14 | 15 | 17 | 46 |\n| 7 | 韩国(KOR) | 13 | 11 | 8 | 32 |\n| 8 | 日本(JPN) | 9 | 8 | 8 | 25 |\n| 9 | 意大利(ITA) | 8 | 9 | 10 | 27 |\n| 10 | 法国(FRA) | 7 | 16 | 20 | 43 |\n| 11 | 荷兰(NED) | 7 | 5 | 4 | 16 |\n| 12 | 乌克兰(UKR) | 7 | 4 | 11 | 22 |\n| 13 | 肯尼亚(KEN) | 6 | 4 | 6 | 16 |\n| 14 | 西班牙(ESP) | 5 | 11 | 3 | 19 |\n| 15 | 牙买加(JAM) | 5 | 4 | 2 | 11 |\n'}} ``` 运行结果参数含义如下: - `image`: 表示输入待预测图像的路径 @@ -155,7 +167,8 @@ for res in results: 待预测数据 dict -Dict, 需要根据具体的模型确定,如PP-DocBee系列的输入为{'image': image_path, 'query': query_text} +Dict, 由于多模态模型对输入有不同的要求,需要根据具体的模型确定,具体而言: +
  • PP-DocBee系列的输入形式为{'image': image_path, 'query': query_text}
  • 无 @@ -163,7 +176,7 @@ for res in results: batch_size 批大小 int -整数(目前仅支持为1) +整数 1 @@ -241,3 +254,5 @@ for res in results: ## 四、二次开发 当前模块暂时不支持微调训练,仅支持推理集成。关于该模块的微调训练,计划在未来支持。 + +## 五、FAQ diff --git a/docs/version3.x/pipeline_usage/doc_understanding.en.md b/docs/version3.x/pipeline_usage/doc_understanding.en.md new file mode 100644 index 0000000000..b6604763e0 --- /dev/null +++ b/docs/version3.x/pipeline_usage/doc_understanding.en.md @@ -0,0 +1,878 @@ +--- + +comments: true + +--- + +# Document Understanding Pipeline Usage Tutorial + +## 1. Introduction to the Document Understanding Pipeline + +The Document Understanding Pipeline is an advanced document processing technology based on Visual-Language Models (VLM), designed to overcome the limitations of traditional document processing. Traditional methods rely on fixed templates or predefined rules to parse documents, whereas this pipeline leverages the multimodal capabilities of VLM to accurately answer user queries by inputting document images and user questions, integrating visual and language information. This technology does not require pre-training for specific document formats, allowing it to flexibly handle diverse document content, significantly enhancing the generalization and practicality of document processing. It has broad application prospects in intelligent Q&A, information extraction, and other scenarios. Currently, the pipeline does not support secondary development of VLM models, but plans to support it in the future. + + + +The general document image preprocessing pipeline includes the following module. Each module can be trained and inferred independently and contains multiple models. For more details, click the corresponding module to view the documentation. + +- [Document-like Vision Language Model Module](../module_usage/doc_vlm.md) + +In this pipeline, you can choose the model to use based on the benchmark data below. + +
    + Document-like Vision Language Model Module: + + + + + + + + + + + + + + + + + + + + + + + + + +
    ModelModel Download LinkModel Storage Size (GB)Total ScoreDescription
    PP-DocBee-2BInference Model4.2765PP-DocBee is a multimodal large model independently developed by the PaddlePaddle team, focusing on document understanding, with excellent performance in Chinese document understanding tasks. The model is fine-tuned and optimized using nearly 5 million multimodal datasets related to document understanding, including general VQA, OCR, chart, text-rich documents, math and complex reasoning, synthetic data, pure text data, etc., with different training data ratios set. In several authoritative English document understanding evaluation lists in academia, PP-DocBee has generally achieved SOTA for models of the same parameter scale. In internal business Chinese scenarios, PP-DocBee also outperforms current popular open and closed-source models.
    PP-DocBee-7BInference Model15.8-
    PP-DocBee2-3BInference Model7.6852PP-DocBee2 is a multimodal large model independently developed by the PaddlePaddle team, focusing on document understanding. It further optimizes the basic model based on PP-DocBee and introduces new data optimization schemes to improve data quality. With only 470,000 data generated using self-developed data synthesis strategy, PP-DocBee2 performs better in Chinese document understanding tasks. In internal business Chinese scenarios, PP-DocBee2 improves by about 11.4% compared to PP-DocBee and also outperforms current popular open and closed-source models of the same scale.
    + +Note: The above total scores are the model test results of the internal evaluation set. All images in the internal evaluation set have a resolution (height, width) of (1680,1204), with a total of 1196 data, including scenarios such as financial reports, laws and regulations, science and engineering papers, manuals, humanities papers, contracts, and research reports, with no plans to make it public for now. +
    + +
    +If you focus more on model accuracy, choose a model with higher accuracy; if you care more about inference speed, choose a model with faster inference speed; if you are concerned about storage size, choose a model with a smaller storage volume. + +## 2. Quick Start + +Before using the document understanding pipeline locally, ensure that you have completed the installation of the wheel package according to the [installation tutorial](../ppocr/installation.md). After installation, you can experience it locally using the command line or Python integration. + +### 2.1 Command Line Experience + +Experience the doc_understanding pipeline with just one command line: + +```bash +paddleocr doc_understanding -i "{'image': 'https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/medal_table.png', 'query': '识别这份表格的内容, 以markdown格式输出'}" +``` + +
    The command line supports more parameter settings, click to expand for a detailed explanation of the command line parameters + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    ParameterDescriptionTypeDefault Value
    doc_understanding_model_nameThe name of the document understanding model. If set to None, the default model of the pipeline will be used.strNone
    doc_understanding_model_dirThe directory path of the document understanding model. If set to None, the official model will be downloaded.strNone
    doc_understanding_batch_sizeThe batch size of the document understanding model. If set to None, the default batch size will be set to 1.intNone
    inputData to be predicted, supports dictionary type input, required. +
      +
    • Python Dict: The input format for PP-DocBee is: {"image":/path/to/image, "query": user question}, representing the input image and corresponding user question.
    • +
    +
    Python Var|str|listNone
    save_pathSpecify the path for saving the inference result file. If set to None, the inference result will not be saved locally.strNone
    deviceThe device used for inference. Supports specifying a specific card number. +
      +
    • CPU: For example, cpu indicates using the CPU for inference;
    • +
    • GPU: For example, gpu:0 indicates using the first GPU for inference;
    • +
    • NPU: For example, npu:0 indicates using the first NPU for inference;
    • +
    • XPU: For example, xpu:0 indicates using the first XPU for inference;
    • +
    • MLU: For example, mlu:0 indicates using the first MLU for inference;
    • +
    • DCU: For example, dcu:0 indicates using the first DCU for inference;
    • +
    • None: If set to None, the initialized value of this parameter will be used by default, which will preferentially use the local GPU device 0, or the CPU device if none is available.
    • +
    +
    strNone
    enable_hpiWhether to enable high-performance inference.boolFalse
    use_tensorrtWhether to use TensorRT for inference acceleration.boolFalse
    min_subgraph_sizeThe minimum subgraph size used to optimize model subgraph calculations.int3
    precisionCalculation precision, such as fp32, fp16.strfp32
    enable_mkldnnWhether to enable the MKL-DNN acceleration library. If set to None, it will be enabled by default.boolNone
    cpu_threadsThe number of threads used for inference on the CPU.int8
    +
    +
    + +The results will be printed to the terminal, and the default configuration of the doc_understanding pipeline will produce the following output: + +```bash +{'res': {'image': 'https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/medal_table.png', 'query': '识别这份表格的内容, 以markdown格式输出', 'result': '| 名次 | 国家/地区 | 金牌 | 银牌 | 铜牌 | 奖牌总数 |\n| --- | --- | --- | --- | --- | --- |\n| 1 | 中国(CHN) | 48 | 22 | 30 | 100 |\n| 2 | 美国(USA) | 36 | 39 | 37 | 112 |\n| 3 | 俄罗斯(RUS) | 24 | 13 | 23 | 60 |\n| 4 | 英国(GBR) | 19 | 13 | 19 | 51 |\n| 5 | 德国(GER) | 16 | 11 | 14 | 41 |\n| 6 | 澳大利亚(AUS) | 14 | 15 | 17 | 46 |\n| 7 | 韩国(KOR) | 13 | 11 | 8 | 32 |\n| 8 | 日本(JPN) | 9 | 8 | 8 | 25 |\n| 9 | 意大利(ITA) | 8 | 9 | 10 | 27 |\n| 10 | 法国(FRA) | 7 | 16 | 20 | 43 |\n| 11 | 荷兰(NED) | 7 | 5 | 4 | 16 |\n| 12 | 乌克兰(UKR) | 7 | 4 | 11 | 22 |\n| 13 | 肯尼亚(KEN) | 6 | 4 | 6 | 16 |\n| 14 | 西班牙(ESP) | 5 | 11 | 3 | 19 |\n| 15 | 牙买加(JAM) | 5 | 4 | 2 | 11 |\n'}} +``` + +### 2.2 Python Script Integration + +The command line method is for quickly experiencing the effect. Generally, in projects, code integration is often required. You can complete quick inference of the pipeline with just a few lines of code. The inference code is as follows: + +```python +from paddleocr import DocUnderstanding + +pipeline = DocUnderstanding() +output = pipeline.predict( + { + "image": "https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/medal_table.png", + "query": "识别这份表格的内容, 以markdown格式输出" + } +) +for res in output: + res.print() ## Print the structured output of the prediction + res.save_to_json("./output/") +``` + +In the above Python script, the following steps are performed: + +(1) Instantiate a Document Understanding Pipeline object through `DocUnderstanding()`. The specific parameter descriptions are as follows: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    ParameterDescriptionTypeDefault Value
    doc_understanding_model_nameThe name of the document understanding model. If set to None, the default model of the pipeline will be used.strNone
    doc_understanding_model_dirThe directory path of the document understanding model. If set to None, the official model will be downloaded.strNone
    doc_understanding_batch_sizeThe batch size of the document understanding model. If set to None, the default batch size will be set to 1.intNone
    deviceThe device used for inference. Supports specifying a specific card number. +
      +
    • CPU: For example, cpu indicates using the CPU for inference;
    • +
    • GPU: For example, gpu:0 indicates using the first GPU for inference;
    • +
    • NPU: For example, npu:0 indicates using the first NPU for inference;
    • +
    • XPU: For example, xpu:0 indicates using the first XPU for inference;
    • +
    • MLU: For example, mlu:0 indicates using the first MLU for inference;
    • +
    • DCU: For example, dcu:0 indicates using the first DCU for inference;
    • +
    • None: If set to None, the initialized value of this parameter will be used by default, which will preferentially use the local GPU device 0, or the CPU device if none is available.
    • +
    +
    strNone
    enable_hpiWhether to enable high-performance inference.boolFalse
    use_tensorrtWhether to use TensorRT for inference acceleration.boolFalse
    min_subgraph_sizeThe minimum subgraph size used to optimize model subgraph calculations.int3
    precisionCalculation precision, such as fp32, fp16.strfp32
    enable_mkldnnWhether to enable the MKL-DNN acceleration library. If set to None, it will be enabled by default.boolNone
    cpu_threadsThe number of threads used for inference on the CPU.int8
    + +(2) Call the `predict()` method of the Document Understanding Pipeline object for inference prediction, which will return a result list. + +Additionally, the pipeline also provides a `predict_iter()` method. Both methods are consistent in terms of parameter acceptance and result return. The difference is that `predict_iter()` returns a `generator` that can process and obtain prediction results step by step, suitable for handling large datasets or scenarios where memory saving is desired. You can choose to use either method according to your actual needs. + +Below are the parameters and their descriptions for the `predict()` method: + + + + + + + + + + + + + + + + + + + + + +
    ParameterDescriptionTypeDefault Value
    inputData to be predicted, currently only supports dictionary type input +
      +
    • Python Dict: The input format for PP-DocBee is: {"image":/path/to/image, "query": user question}, representing the input image and corresponding user question.
    • +
    +
    Python DictNone
    deviceSame as the parameter during instantiation.strNone
    + +(3) Process the prediction results. The prediction result for each sample is a corresponding Result object, which supports printing and saving as a `json` file: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    MethodDescriptionParameterTypeParameter DescriptionDefault Value
    print()Print the result to the terminalformat_jsonboolWhether to format the output content using JSON indentationTrue
    indentintSpecifies the indentation level to beautify the output JSON data, making it more readable, effective only when format_json is True4
    ensure_asciiboolControls whether to escape non-ASCII characters into Unicode. When set to True, all non-ASCII characters will be escaped; False will retain the original characters, effective only when format_json is TrueFalse
    save_to_json()Save the result as a JSON format filesave_pathstrThe path to save the file. When specified as a directory, the saved file is named consistent with the input file type.None
    indentintSpecifies the indentation level to beautify the output JSON data, making it more readable, effective only when format_json is True4
    ensure_asciiboolControls whether to escape non-ASCII characters into Unicode. When set to True, all non-ASCII characters will be escaped; False will retain the original characters, effective only when format_json is TrueFalse
    + +- Calling the `print()` method will print the result to the terminal. The content printed to the terminal is explained as follows: + + - `image`: `(str)` Input path of the image + + - `query`: `(str)` Question regarding the input image + + - `result`: `(str)` Output result of the model + +- Calling the `save_to_json()` method will save the above content to the specified `save_path`. If specified as a directory, the path saved will be `save_path/{your_img_basename}_res.json`, and if specified as a file, it will be saved directly to that file. + +* Additionally, the result can be obtained through attributes that provide the visualized images with results and the prediction results, as follows: + + + + + + + + + + + + + + + + +
    AttributeDescription
    jsonGet the prediction result in json format
    imgGet the visualized image in dict format
    + +- The prediction result obtained through the `json` attribute is data of the dict type, consistent with the content saved by calling the `save_to_json()` method. + +## 3. Development Integration/Deployment + +If the pipeline meets your requirements for pipeline inference speed and accuracy, you can proceed with development integration/deployment directly. + +If you need to apply the pipeline directly to your Python project, you can refer to the example code in [2.2 Python Script Integration](#22-python-script-integration). + +In addition, PaddleOCR also provides two other deployment methods, detailed descriptions are as follows: + +🚀 High-Performance Inference: In real production environments, many applications have strict standards for the performance indicators of deployment strategies (especially response speed) to ensure efficient system operation and smooth user experience. To this end, PaddleOCR provides high-performance inference capabilities, aiming to deeply optimize the performance of model inference and pre-and post-processing, achieving significant acceleration of the end-to-end process. For detailed high-performance inference processes, refer to the [High-Performance Inference Guide](../deployment/high_performance_inference.md). + +☁️ Service Deployment: Service deployment is a common form of deployment in real production environments. By encapsulating inference functions as services, clients can access these services through network requests to obtain inference results. For detailed pipeline service deployment processes, refer to the [Service Deployment Guide](../deployment/serving.md). + +Below is the API reference for basic service deployment and examples of service invocation in multiple languages: + +
    API Reference + +

    For the main operations provided by the service:

    + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    NameTypeMeaning
    logIdstringUUID of the request.
    errorCodeintegerError code. Fixed as 0.
    errorMsgstringError description. Fixed as "Success".
    resultobjectOperation result.
    + + + + + + + + + + + + + + + + + + + + + + + + + + +
    NameTypeMeaning
    logIdstringUUID of the request.
    errorCodeintegerError code. Same as the response status code.
    errorMsgstringError description.
    +

    The main operations provided by the service are as follows:

    + +

    Perform inference on the input message to generate a response.

    +

    POST /document-understanding

    +

    Note: The above interface is also known as /chat/completion, compatible with OpenAI interfaces.

    + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    NameTypeMeaningRequiredDefault Value
    modelstringThe name of the model to useYes-
    messagesarrayList of dialogue messagesYes-
    max_tokensintegerMaximum number of tokens to generateNo1024
    temperaturefloatSampling temperatureNo0.1
    top_pfloatCore sampling probabilityNo0.95
    streambooleanWhether to output in streaming modeNofalse
    max_image_tokensintMaximum number of input tokens for imagesNoNone
    + +

    Each element in messages is an object with the following attributes:

    + + + + + + + + + + + + + + + + + + + + + + + +
    NameTypeMeaningRequired
    rolestringMessage role (user/assistant/system)Yes
    contentstring or arrayMessage content (text or mixed media)Yes
    + +

    When content is an array, each element is an object with the following attributes:

    + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    NameTypeMeaningRequiredDefault Value
    typestringContent type (text/image_url)Yes-
    textstringText content (when type is text)Conditionally required-
    image_urlstring or objectImage URL or object (when type is image_url)Conditionally required-
    + +

    When image_url is an object, it has the following attributes:

    + + + + + + + + + + + + + + + + + + + + + + + + + + +
    NameTypeMeaningRequiredDefault Value
    urlstringImage URLYes-
    detailstringImage detail processing method (low/high/auto)Noauto
    + +

    When the request is processed successfully, the result in the response body has the following attributes:

    + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    NameTypeMeaning
    idstringRequest ID
    objectstringObject type (chat.completion)
    createdintegerCreation timestamp
    choicesarrayGenerated result options
    usageobjectToken usage
    + +

    Each element in choices is a Choice object with the following attributes:

    + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    NameTypeMeaningOptional Values
    finish_reasonstringReason for the model to stop generating tokensstop (natural stop)
    length (reached max token count)
    tool_calls (called a tool)
    content_filter (content filtering)
    function_call (called a function, deprecated)
    indexintegerIndex of the option in the list-
    logprobsobject | nullLog probability information of the option-
    messageChatCompletionMessageChat message generated by the model-
    + +

    The message object has the following attributes:

    + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    NameTypeMeaningRemarks
    contentstring | nullMessage contentMay be empty
    refusalstring | nullRefusal message generated by the modelProvided when content is refused
    rolestringRole of the message authorFixed as "assistant"
    audioobject | nullAudio output dataProvided when audio output is requested
    Learn more
    function_callobject | nullName and parameters of the function to be calledDeprecated, recommended to use tool_calls
    tool_callsarray | nullTool calls generated by the modelSuch as function calls
    + +

    The usage object has the following attributes:

    + + + + + + + + + + + + + + + + + + + + + + + + + +
    NameTypeMeaning
    prompt_tokensintegerNumber of prompt tokens
    completion_tokensintegerNumber of generated tokens
    total_tokensintegerTotal number of tokens
    +

    An example of a result is as follows:

    +
    {
    +    "id": "ed960013-eb19-43fa-b826-3c1b59657e35",
    +    "choices": [
    +        {
    +            "finish_reason": "stop",
    +            "index": 0,
    +            "message": {
    +                "content": "| 名次 | 国家/地区 | 金牌 | 银牌 | 铜牌 | 奖牌总数 |\n| --- | --- | --- | --- | --- | --- |\n| 1 | 中国(CHN) | 48 | 22 | 30 | 100 |\n| 2 | 美国(USA) | 36 | 39 | 37 | 112 |\n| 3 | 俄罗斯(RUS) | 24 | 13 | 23 | 60 |\n| 4 | 英国(GBR) | 19 | 13 | 19 | 51 |\n| 5 | 德国(GER) | 16 | 11 | 14 | 41 |\n| 6 | 澳大利亚(AUS) | 14 | 15 | 17 | 46 |\n| 7 | 韩国(KOR) | 13 | 11 | 8 | 32 |\n| 8 | 日本(JPN) | 9 | 8 | 8 | 25 |\n| 9 | 意大利(ITA) | 8 | 9 | 10 | 27 |\n| 10 | 法国(FRA) | 7 | 16 | 20 | 43 |\n| 11 | 荷兰(NED) | 7 | 5 | 4 | 16 |\n| 12 | 乌克兰(UKR) | 7 | 4 | 11 | 22 |\n| 13 | 肯尼亚(KEN) | 6 | 4 | 6 | 16 |\n| 14 | 西班牙(ESP) | 5 | 11 | 3 | 19 |\n| 15 | 牙买加(JAM) | 5 | 4 | 2 | 11 |\n",
    +                "role": "assistant"
    +            }
    +        }
    +    ],
    +    "created": 1745218041,
    +    "model": "pp-docbee",
    +    "object": "chat.completion"
    +}
    +
    + +
    Multi-language Service Invocation Examples + +
    +Python +OpenAI interface invocation example + +
    import base64
    +from openai import OpenAI
    +
    +API_BASE_URL = "http://0.0.0.0:8080"
    +
    +# Initialize OpenAI client
    +client = OpenAI(
    +    api_key='xxxxxxxxx',
    +    base_url=f'{API_BASE_URL}'
    +)
    +
    +# Function to convert image to base64
    +def encode_image(image_path):
    +  with open(image_path, "rb") as image_file:
    +    return base64.b64encode(image_file.read()).decode('utf-8')
    +
    +# Input image path
    +image_path = "medal_table.png"
    +
    +# Convert original image to base64
    +base64_image = encode_image(image_path)
    +
    +# Submit information to PP-DocBee model
    +response = client.chat.completions.create(
    +    model="pp-docbee",# Choose Model
    +    messages=[
    +        {
    +            "role": "system",
    +            "content": "You are a helpful assistant."
    +        },
    +        {
    +            "role": "user",
    +            "content":[
    +                {
    +                    "type": "text",
    +                    "text": "识别这份表格的内容,输出html格式的内容"
    +                },
    +                {
    +                    "type": "image_url",
    +                    "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}
    +                },
    +            ]
    +        },
    +    ],
    +)
    +content = response.choices[0].message.content
    +print('Reply:', content)
    +
    +
    +
    + +## 4. Secondary Development + +The current pipeline does not support fine-tuning training and only supports inference integration. Concerning fine-tuning training for this pipeline, there are plans to support it in the future. diff --git a/docs/version3.x/pipeline_usage/doc_understanding.md b/docs/version3.x/pipeline_usage/doc_understanding.md index 377f2bd238..c889ca95ee 100644 --- a/docs/version3.x/pipeline_usage/doc_understanding.md +++ b/docs/version3.x/pipeline_usage/doc_understanding.md @@ -23,18 +23,29 @@ comments: true 模型模型下载链接 模型存储大小(GB) +模型总分 介绍 PP-DocBee-2B推理模型 4.2 +765 PP-DocBee 是飞桨团队自研的一款专注于文档理解的多模态大模型,在中文文档理解任务上具有卓越表现。该模型通过近 500 万条文档理解类多模态数据集进行微调优化,各种数据集包括了通用VQA类、OCR类、图表类、text-rich文档类、数学和复杂推理类、合成数据类、纯文本数据等,并设置了不同训练数据配比。在学术界权威的几个英文文档理解评测榜单上,PP-DocBee基本都达到了同参数量级别模型的SOTA。在内部业务中文场景类的指标上,PP-DocBee也高于目前的热门开源和闭源模型。 PP-DocBee-7B推理模型 15.8 +- + + +PP-DocBee2-3B推理模型 +7.6 +852 +PP-DocBee2 是飞桨团队自研的一款专注于文档理解的多模态大模型,在PP-DocBee的基础上进一步优化了基础模型,并引入了新的数据优化方案,提高了数据质量,使用自研数据合成策略生成的少量的47万数据便使得PP-DocBee2在中文文档理解任务上表现更佳。在内部业务中文场景类的指标上,PP-DocBee2相较于PP-DocBee提升了约11.4%,同时也高于目前的同规模热门开源和闭源模型。 + +注:以上模型总分为内部评估集模型测试结果,内部评估集所有图像分辨率 (height, width) 为 (1680,1204),共1196条数据,包括了财报、法律法规、理工科论文、说明书、文科论文、合同、研报等场景,暂时未有计划公开。
    @@ -49,7 +60,7 @@ comments: true 一行命令即可快速体验 doc_understanding 产线效果: ```bash -paddleocr doc_understanding -i "{'image': 'https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/medal_table.png', 'query': '识别这份表格的内容'}" +paddleocr doc_understanding -i "{'image': 'https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/medal_table.png', 'query': '识别这份表格的内容, 以markdown格式输出'}" ```
    命令行支持更多参数设置,点击展开以查看命令行参数的详细说明 @@ -83,11 +94,9 @@ paddleocr doc_understanding -i "{'image': 'https://paddle-model-ecology.bj.bcebo input -待预测数据,支持多种输入类型,必填。 +待预测数据,支持字典类型输入,必填。 Python Var|str|list @@ -160,7 +169,7 @@ paddleocr doc_understanding -i "{'image': 'https://paddle-model-ecology.bj.bcebo 运行结果会被打印到终端上,默认配置的 doc_understanding 产线的运行结果如下: ```bash -{'res': {'image': 'https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/medal_table.png', 'query': '识别这份表格的内容', 'result': '| 名次 | 国家/地区 | 金牌 | 银牌 | 铜牌 | 奖牌总数 |\n| --- | --- | --- | --- | --- | --- |\n| 1 | 中国(CHN) | 48 | 22 | 30 | 100 |\n| 2 | 美国(USA) | 36 | 39 | 37 | 112 |\n| 3 | 俄罗斯(RUS) | 24 | 13 | 23 | 60 |\n| 4 | 英国(GBR) | 19 | 13 | 19 | 51 |\n| 5 | 德国(GER) | 16 | 11 | 14 | 41 |\n| 6 | 澳大利亚(AUS) | 14 | 15 | 17 | 46 |\n| 7 | 韩国(KOR) | 13 | 11 | 8 | 32 |\n| 8 | 日本(JPN) | 9 | 8 | 8 | 25 |\n| 9 | 意大利(ITA) | 8 | 9 | 10 | 27 |\n| 10 | 法国(FRA) | 7 | 16 | 20 | 43 |\n| 11 | 荷兰(NED) | 7 | 5 | 4 | 16 |\n| 12 | 乌克兰(UKR) | 7 | 4 | 11 | 22 |\n| 13 | 肯尼亚(KEN) | 6 | 4 | 6 | 16 |\n| 14 | 西班牙(ESP) | 5 | 11 | 3 | 19 |\n| 15 | 牙买加(JAM) | 5 | 4 | 2 | 11 |\n'}} +{'res': {'image': 'https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/medal_table.png', 'query': '识别这份表格的内容, 以markdown格式输出', 'result': '| 名次 | 国家/地区 | 金牌 | 银牌 | 铜牌 | 奖牌总数 |\n| --- | --- | --- | --- | --- | --- |\n| 1 | 中国(CHN) | 48 | 22 | 30 | 100 |\n| 2 | 美国(USA) | 36 | 39 | 37 | 112 |\n| 3 | 俄罗斯(RUS) | 24 | 13 | 23 | 60 |\n| 4 | 英国(GBR) | 19 | 13 | 19 | 51 |\n| 5 | 德国(GER) | 16 | 11 | 14 | 41 |\n| 6 | 澳大利亚(AUS) | 14 | 15 | 17 | 46 |\n| 7 | 韩国(KOR) | 13 | 11 | 8 | 32 |\n| 8 | 日本(JPN) | 9 | 8 | 8 | 25 |\n| 9 | 意大利(ITA) | 8 | 9 | 10 | 27 |\n| 10 | 法国(FRA) | 7 | 16 | 20 | 43 |\n| 11 | 荷兰(NED) | 7 | 5 | 4 | 16 |\n| 12 | 乌克兰(UKR) | 7 | 4 | 11 | 22 |\n| 13 | 肯尼亚(KEN) | 6 | 4 | 6 | 16 |\n| 14 | 西班牙(ESP) | 5 | 11 | 3 | 19 |\n| 15 | 牙买加(JAM) | 5 | 4 | 2 | 11 |\n'}} ``` ### 2.2 Python脚本方式集成 @@ -174,7 +183,7 @@ pipeline = DocUnderstanding() output = pipeline.predict( { "image": "https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/medal_table.png", - "query": "识别这份表格的内容" + "query": "识别这份表格的内容, 以markdown格式输出" } ) for res in output: diff --git a/paddleocr/_models/doc_vlm.py b/paddleocr/_models/doc_vlm.py index c0df6bd697..f2fce5b9bf 100644 --- a/paddleocr/_models/doc_vlm.py +++ b/paddleocr/_models/doc_vlm.py @@ -34,7 +34,7 @@ class DocVLM(PaddleXPredictorWrapper): @property def default_model_name(self): - return "PP-DocBee-2B" + return "PP-DocBee2-3B" @classmethod def get_cli_subcommand_executor(cls):