From 7aafc2d7b896b315e2e6c77f71b564544e03e65f Mon Sep 17 00:00:00 2001 From: andyjpaddle Date: Sat, 7 May 2022 12:24:23 +0000 Subject: [PATCH 1/3] update quickstart doc --- doc/doc_ch/quickstart.md | 10 ++++------ doc/doc_en/quickstart_en.md | 10 ++++------ 2 files changed, 8 insertions(+), 12 deletions(-) diff --git a/doc/doc_ch/quickstart.md b/doc/doc_ch/quickstart.md index 6301755de8..d6b9dc1ccb 100644 --- a/doc/doc_ch/quickstart.md +++ b/doc/doc_ch/quickstart.md @@ -59,15 +59,13 @@ cd /path/to/ppocr_img 如果不使用提供的测试图片,可以将下方`--image_dir`参数替换为相应的测试图片路径。 -**注意** whl包默认使用`PP-OCRv3`模型,识别模型使用的输入shape为`3,48,320`, 因此如果使用识别功能,需要添加参数`--rec_image_shape 3,48,320`,如果不使用默认的`PP-OCRv3`模型,则无需设置该参数。 - #### 2.1.1 中英文模型 * 检测+方向分类器+识别全流程:`--use_angle_cls true`设置使用方向分类器识别180度旋转文字,`--use_gpu false`设置不使用GPU ```bash - paddleocr --image_dir ./imgs/11.jpg --use_angle_cls true --use_gpu false --rec_image_shape 3,48,320 + paddleocr --image_dir ./imgs/11.jpg --use_angle_cls true --use_gpu false ``` 结果是一个list,每个item包含了文本框,文字和识别置信度 @@ -94,7 +92,7 @@ cd /path/to/ppocr_img - 单独使用识别:设置`--det`为`false` ```bash - paddleocr --image_dir ./imgs_words/ch/word_1.jpg --det false --rec_image_shape 3,48,320 + paddleocr --image_dir ./imgs_words/ch/word_1.jpg --det false ``` 结果是一个list,每个item只包含识别结果和识别置信度 @@ -104,7 +102,7 @@ cd /path/to/ppocr_img ``` -如需使用2.0模型,请指定参数`--version PP-OCR`,paddleocr默认使用PP-OCRv3模型(`--versioin PP-OCRv3`)。更多whl包使用可参考[whl包文档](./whl.md) +如需使用2.0模型,请指定参数`--ocr_version PP-OCR`,paddleocr默认使用PP-OCRv3模型(`--ocr_version PP-OCRv3`)。更多whl包使用可参考[whl包文档](./whl.md) @@ -113,7 +111,7 @@ cd /path/to/ppocr_img Paddleocr目前支持80个语种,可以通过修改`--lang`参数进行切换,对于英文模型,指定`--lang=en`, PP-OCRv3目前只支持中文和英文模型,其他多语言模型会陆续更新。 ``` bash -paddleocr --image_dir ./imgs_en/254.jpg --lang=en --rec_image_shape 3,48,320 +paddleocr --image_dir ./imgs_en/254.jpg --lang=en ```
diff --git a/doc/doc_en/quickstart_en.md b/doc/doc_en/quickstart_en.md index bf1ce05cf4..6e698b0c79 100644 --- a/doc/doc_en/quickstart_en.md +++ b/doc/doc_en/quickstart_en.md @@ -75,8 +75,6 @@ cd /path/to/ppocr_img If you do not use the provided test image, you can replace the following `--image_dir` parameter with the corresponding test image path -**Note**: The whl package uses the `PP-OCRv3` model by default, and the input shape used by the recognition model is `3,48,320`, so if you use the recognition function, you need to add the parameter `--rec_image_shape 3,48,320`, if you do not use the default `PP- OCRv3` model, you do not need to set this parameter. - #### 2.1.1 Chinese and English Model @@ -84,7 +82,7 @@ If you do not use the provided test image, you can replace the following `--imag * Detection, direction classification and recognition: set the parameter`--use_gpu false` to disable the gpu device ```bash - paddleocr --image_dir ./imgs_en/img_12.jpg --use_angle_cls true --lang en --use_gpu false --rec_image_shape 3,48,320 + paddleocr --image_dir ./imgs_en/img_12.jpg --use_angle_cls true --lang en --use_gpu false ``` Output will be a list, each item contains bounding box, text and recognition confidence @@ -114,7 +112,7 @@ If you do not use the provided test image, you can replace the following `--imag * Only recognition: set `--det` to `false` ```bash - paddleocr --image_dir ./imgs_words_en/word_10.png --det false --lang en --rec_image_shape 3,48,320 + paddleocr --image_dir ./imgs_words_en/word_10.png --det false --lang en ``` Output will be a list, each item contains text and recognition confidence @@ -123,7 +121,7 @@ If you do not use the provided test image, you can replace the following `--imag ['PAIN', 0.9934559464454651] ``` -If you need to use the 2.0 model, please specify the parameter `--version PP-OCR`, paddleocr uses the PP-OCRv3 model by default(`--versioin PP-OCRv3`). More whl package usage can be found in [whl package](./whl_en.md) +If you need to use the 2.0 model, please specify the parameter `--ocr_version PP-OCR`, paddleocr uses the PP-OCRv3 model by default(`--ocr_version PP-OCRv3`). More whl package usage can be found in [whl package](./whl_en.md) #### 2.1.2 Multi-language Model @@ -131,7 +129,7 @@ If you need to use the 2.0 model, please specify the parameter `--version PP-OCR Paddleocr currently supports 80 languages, which can be switched by modifying the `--lang` parameter. PP-OCRv3 currently only supports Chinese and English models, and other multilingual models will be updated one after another. ``` bash -paddleocr --image_dir ./doc/imgs_en/254.jpg --lang=en --rec_image_shape 3,48,320 +paddleocr --image_dir ./doc/imgs_en/254.jpg --lang=en ```
From 394cb2f7df41c59a878b96b53b8933139fad28d0 Mon Sep 17 00:00:00 2001 From: andyjpaddle Date: Sat, 7 May 2022 12:40:04 +0000 Subject: [PATCH 2/3] update quickstart doc --- doc/doc_ch/quickstart.md | 2 +- doc/doc_en/quickstart_en.md | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/doc/doc_ch/quickstart.md b/doc/doc_ch/quickstart.md index e0e65c3e83..29ca48fa83 100644 --- a/doc/doc_ch/quickstart.md +++ b/doc/doc_ch/quickstart.md @@ -108,7 +108,7 @@ cd /path/to/ppocr_img #### 2.1.2 多语言模型 -PaddleOCR目前支持80个语种,可以通过修改`--lang`参数进行切换,对于英文模型,指定`--lang=en`, PP-OCRv3目前只支持中文和英文模型,其他多语言模型会陆续更新。 +PaddleOCR目前支持80个语种,可以通过修改`--lang`参数进行切换,对于英文模型,指定`--lang=en`。 ``` bash paddleocr --image_dir ./imgs_en/254.jpg --lang=en diff --git a/doc/doc_en/quickstart_en.md b/doc/doc_en/quickstart_en.md index ac820ed64f..d7aeb77730 100644 --- a/doc/doc_en/quickstart_en.md +++ b/doc/doc_en/quickstart_en.md @@ -124,7 +124,7 @@ If you need to use the 2.0 model, please specify the parameter `--ocr_version PP #### 2.1.2 Multi-language Model -PaddleOCR currently supports 80 languages, which can be switched by modifying the `--lang` parameter. PP-OCRv3 currently only supports Chinese and English models, and other multilingual models will be updated one after another. +PaddleOCR currently supports 80 languages, which can be switched by modifying the `--lang` parameter. ``` bash paddleocr --image_dir ./doc/imgs_en/254.jpg --lang=en @@ -208,4 +208,4 @@ Visualization of results In this section, you have mastered the use of PaddleOCR whl package. -PaddleOCR is a rich and practical OCR tool library that get through the whole process of data production, model training, compression, inference and deployment, please refer to the [tutorials](../../README.md#tutorials) to start the journey of PaddleOCR. \ No newline at end of file +PaddleOCR is a rich and practical OCR tool library that get through the whole process of data production, model training, compression, inference and deployment, please refer to the [tutorials](../../README.md#tutorials) to start the journey of PaddleOCR. From 3cfa0fbf6336bdec4126537ded3384a9e4ce5fde Mon Sep 17 00:00:00 2001 From: andyjpaddle Date: Sun, 8 May 2022 10:05:59 +0000 Subject: [PATCH 3/3] update ppocr en doc --- doc/doc_en/PP-OCRv3_introduction_en.md | 111 +++++++++++++++++++++++++ 1 file changed, 111 insertions(+) diff --git a/doc/doc_en/PP-OCRv3_introduction_en.md b/doc/doc_en/PP-OCRv3_introduction_en.md index 791c95a6b5..5cc6ef3741 100644 --- a/doc/doc_en/PP-OCRv3_introduction_en.md +++ b/doc/doc_en/PP-OCRv3_introduction_en.md @@ -1 +1,112 @@ English | [简体中文](../doc_ch/PP-OCRv3_introduction.md) + + + + + +## 3. Optimization for Text Recognition Model + +The recognition module of PP-OCRv3 is optimized based on the text recognition algorithm [SVTR] (https://arxiv.org/abs/2205.00159). RNN is abandoned in SVTR, and the context information of the text line image is more effectively mined by introducing the Transformers structure, thereby improving the text recognition ability. The recognition model of PP-OCRv2 was directly replaced with SVTR_Tiny, and the recognition accuracy increased from 74.8% to 80.1% (+5.3%), but the prediction speed was nearly 11 times slower, and it took nearly 100ms to predict a text line on the CPU. Therefore, as shown in the figure below, PP-OCRv3 adopts the following six optimization strategies to accelerate the recognition model. + +
+ +
+ +Based on the above strategy, compared with PP-OCRv2, the PP-OCRv3 recognition model further improves the accuracy by 4.6% with comparable speed. The specific ablation experiments are as follows: + +| ID | strategy | Model size | accuracy | prediction speed(CPU + MKLDNN)| +|-----|-----|--------|----| --- | +| 01 | PP-OCRv2 | 8M | 74.8% | 8.54ms | +| 02 | SVTR_Tiny | 21M | 80.1% | 97ms | +| 03 | SVTR_LCNet(h32) | 12M | 71.9% | 6.6ms | +| 04 | SVTR_LCNet(h48) | 12M | 73.98% | 7.6ms | +| 05 | + GTC | 12M | 75.8% | 7.6ms | +| 06 | + TextConAug | 12M | 76.3% | 7.6ms | +| 07 | + TextRotNet | 12M | 76.9% | 7.6ms | +| 08 | + UDML | 12M | 78.4% | 7.6ms | +| 09 | + UIM | 12M | 79.4% | 7.6ms | + +Note: When testing the speed, the input image shape of Experiment 01-03 is (3, 32, 320), and the input image shape of 04-08 is (3, 48, 320). In the actual prediction, the image is a variable-length input, and the speed will vary. Test environment: Intel Gold 6148 CPU, with MKLDNN acceleration enabled during prediction. + +**(1)SVTR_LCNet:Lightweight Text Recognition Network** + +SVTR_LCNet is for text recognition tasks, which is a lightweight text recognition network fused by Transformer-based [SVTR](https://arxiv.org/abs/2205.00159) network and lightweight CNN network [PP-LCNet](https://arxiv.org/abs/ 2109.15099). Using this network, the prediction speed is 20% better than the recognition model of PP-OCRv2, but because the distillation strategy is not adopted, the effect of the recognition model is slightly worse. In addition, the height of the input image is further increased from 32 to 48, and the prediction speed is slightly slower, but the model effect is greatly improved, and the recognition accuracy reaches 73.98% (+2.08%), which is close to the recognition model effect of PP-OCRv2 using the distillation strategy. + +SVTR_Tiny network structure is as follows: + +
+ +
+ +Due to the limited model structure supported by the MKLDNN acceleration library, SVTR is 10 times slower than PP-OCRv2 on CPU+MKLDNN. PP-OCRv3 expects to improve the accuracy of the model without bringing additional inference time. Through analysis, it is found that the main time-consuming module of SVTR_Tiny structure is Mixing Block, so we have carried out a series of optimizations to the structure of SVTR_Tiny (for detailed speed data, please refer to the ablation experiment table below): + + +1. Replace the first half of the SVTR network with the first three stages of PP-LCNet, retain 4 Global Mixing Blocks, the accuracy is 76%, and the speedup is 69%. The network structure is as follows: + +
+ +
+ +2. Reduce the number of Global Mixing Blocks from 4 to 2, the accuracy is 72.9%, and the speedup is 69%. The network structure is as follows: + +
+ +
+ +3. The experiment found that the prediction speed of the Global Mixing Block is related to the shape of the input features. Therefore, after moving the position of the Global Mixing Block to the back of pooling layer, the accuracy dropped to 71.9%, and the speed surpassed the PP-OCRv2-baseline based on the CNN structure by 22%. The network structure is as follows: + +
+ +
+ +The specific ablation experiments are as follows: + +| ID | strategy | Model size | accuracy | prediction speed(CPU + MKLDNN)| +|-----|-----|--------|----| --- | +| 01 | PP-OCRv2-baseline | 8M | 69.3% | 8.54ms | +| 02 | SVTR_Tiny | 21M | 80.1% | 97ms | +| 03 | SVTR_LCNet(G4) | 9.2M | 76% | 30ms | +| 04 | SVTR_LCNet(G2) | 13M | 72.98% | 9.37ms | +| 05 | SVTR_LCNet(h32) | 12M | 71.9% | 6.6ms | +| 06 | SVTR_LCNet(h48) | 12M | 73.98% | 7.6ms | + +Note: When testing the speed, the input image shape of 01-05 are all (3, 32, 320); PP-OCRv2-baseline represents the model trained without distillation method + +**(2)GTC:Attention guides CTC training strategy** + +[GTC](https://arxiv.org/pdf/2002.01276.pdf)(Guided Training of CTC),using the Attention module and loss to guide the CTC loss training and fuse the expression of multiple text features is an effective strategy to improve text recognition. Using this strategy, the Attention module is completely removed during prediction, and no time-consuming is added in the inference stage, and the accuracy of the recognition model is further improved to 75.8% (+1.82%). The training process is as follows: + +
+ +
+ +**(3)TextConAug:Data Augmentation Strategy for Mining Text Context Information** + +TextConAug is a data augmentation strategy for mining textual context information. The main idea comes from the paper [ConCLR](https://www.cse.cuhk.edu.hk/~byu/papers/C139-AAAI2022-ConCLR.pdf) , the author proposes ConAug data augmentation to connect 2 different images in a batch to form new images and perform self-supervised comparative learning. PP-OCRv3 applies this method to supervised learning tasks, and designs the TextConAug data augmentation method, which can enrich the context information of training data and improve the diversity of training data. Using this strategy, the accuracy of the recognition model is further improved to 76.3% (+0.5%). The schematic diagram of TextConAug is as follows: + +
+ +
+ + +**(4)TextRotNet:Self-Supervised Pretrained Models** + +TextRotNet is a pre-training model, which is trained by using a large amount of unlabeled text line data in a self-supervised manner. Refer to the paper [STR-Fewer-Labels](https://github.com/ku21fan/STR-Fewer-Labels). This model can initialize the initial weights of SVTR_LCNet, which helps the text recognition model to converge to a better position. Using this strategy, the accuracy of the recognition model is further improved to 76.9% (+0.6%). The TextRotNet training process is shown in the following figure: + +
+ +
+ + +**(5)UDML:Unified-Deep Mutual Learning** + +UDML (Unified-Deep Mutual Learning) is a strategy adopted in PP-OCRv2 that is very effective for text recognition to improve the model effect. In PP-OCRv3, for two different SVTR_LCNet and Attention structures, the feature map of PP-LCNet, the output of the SVTR module and the output of the Attention module between them are simultaneously supervised and trained. Using this strategy, the accuracy of the recognition model is further improved to 78.4% (+1.5%). + + +**(6)UIM:Unlabeled Images Mining** + +UIM (Unlabeled Images Mining) is a very simple unlabeled data mining scheme. The core idea is to use a high-precision text recognition model to predict unlabeled data, obtain pseudo-labels, and select samples with high prediction confidence as training data for training small models. Using this strategy, the accuracy of the recognition model is further improved to 79.4% (+1%). + +
+ +