--- comments: true --- # Document Visual Language Model Module Tutorial ## I. Overview Document visual language models are a cutting-edge multimodal processing technology aimed at addressing the limitations of traditional document processing methods. Traditional methods are often limited to processing document information in specific formats or predefined categories, whereas document visual language models can integrate visual and linguistic information to understand and handle diverse document content. By combining computer vision and natural language processing technologies, these models can recognize images, text, and their relationships within documents, and even understand semantic information within complex layout structures. This makes document processing more intelligent and flexible, with stronger generalization capabilities, showing broad application prospects in automated office work, information extraction, and other fields. ## II. Supported Model List
Model | Model Download Link | Model Storage Size (GB) | Total Score | Description |
---|---|---|---|---|
PP-DocBee-2B | Inference Model | 4.2 | 765 | PP-DocBee is a self-developed multimodal large model by the PaddlePaddle team, focusing on document understanding, and it performs excellently in Chinese document understanding tasks. The model is fine-tuned and optimized using nearly 5 million multimodal datasets for document understanding, including general VQA, OCR, charts, text-rich documents, mathematics and complex reasoning, synthetic data, and pure text data, with different training data ratios set. On several authoritative English document understanding evaluation lists in academia, PP-DocBee has basically achieved SOTA for models of the same parameter scale. In terms of internal business Chinese scenario indicators, PP-DocBee also outperforms the current popular open-source and closed-source models. |
PP-DocBee-7B | Inference Model | 15.8 | - | |
PP-DocBee2-3B | Inference Model | 7.6 | 852 | PP-DocBee2 is a self-developed multimodal large model by the PaddlePaddle team, further optimizing the base model on the foundation of PP-DocBee and introducing a new data optimization scheme to improve data quality. Using a small amount of 470,000 data generated by a self-developed data synthesis strategy, PP-DocBee2 performs better in Chinese document understanding tasks. In terms of internal business Chinese scenario indicators, PP-DocBee2 improves by about 11.4% compared to PP-DocBee, and also outperforms the current popular open-source and closed-source models of the same scale. |
Parameter | Description | Type | Default |
---|---|---|---|
model_name |
Model name | str |
PP-DocBee-2B |
model_dir |
Model storage path | str |
None |
device |
Device(s) to use for inference. Examples: cpu , gpu , npu , gpu:0 , gpu:0,1 .If multiple devices are specified, inference will be performed in parallel. Note that parallel inference is not always supported. By default, GPU 0 will be used if available; otherwise, the CPU will be used. |
str |
None |
enable_hpi |
Whether to use the high performance inference. | bool |
False |
use_tensorrt |
Whether to use the Paddle Inference TensorRT subgraph engine. For Paddle with CUDA version 11.8, the compatible TensorRT version is 8.x (x>=6), and it is recommended to install TensorRT 8.6.1.6. For Paddle with CUDA version 12.6, the compatible TensorRT version is 10.x (x>=5), and it is recommended to install TensorRT 10.5.0.18. | bool |
False |
min_subgraph_size |
Minimum subgraph size for TensorRT when using the Paddle Inference TensorRT subgraph engine. | int |
3 |
precision |
Precision for TensorRT when using the Paddle Inference TensorRT subgraph engine. Options: fp32 , fp16 , etc. |
str |
fp32 |
enable_mkldnn |
Whether to enable MKL-DNN acceleration for inference. If MKL-DNN is unavailable or the model does not support it, acceleration will not be used even if this flag is set. | bool |
True |
cpu_threads |
Number of threads to use for inference on CPUs. | int |
10 |
Parameter | Description | Type | Default |
---|---|---|---|
input |
Input data. Required. Since multimodal models have different input requirements, please refer to the specific model for the correct format. For example, for the PP-DocBee series models, the input format should be: {'image': image_path, 'query': query_text}
|
dict |
None |
batch_size |
Batch size, positive integer. | int |
1 |
Method | Description | Parameter | Type | Description | Default |
---|---|---|---|---|---|
print() |
Print results to terminal | format_json |
bool |
Whether to format the output content using JSON indentation |
True |
indent |
int |
Specify the indentation level to beautify the output JSON data, making it more readable, effective only when format_json is True |
4 | ||
ensure_ascii |
bool |
Control whether non-ASCII characters are escaped to Unicode . When set to True , all non-ASCII characters will be escaped; False retains the original characters, effective only when format_json is True |
False |
||
save_to_json() |
Save the result as a json format file | save_path |
str |
Path of the file to be saved. When it is a directory, the naming of the saved file is consistent with the input file type. | None |
indent |
int |
Specify the indentation level to beautify the output JSON data, making it more readable, effective only when format_json is True |
4 | ||
ensure_ascii |
bool |
Control whether non-ASCII characters are escaped to Unicode . When set to True , all non-ASCII characters will be escaped; False retains the original characters, effective only when format_json is True |
False |
Attribute | Description |
---|---|
json |
Get the prediction result in json format |