diff --git a/README.md b/README.md index 6df373911..1c9613f91 100644 --- a/README.md +++ b/README.md @@ -260,7 +260,6 @@ You can do MindSpore Lite inference in MindOCR using **MindOCR models** or **Thi - [x] [MASTER](configs/rec/master/README.md) (PR'2019) - [x] [VISIONLAN](configs/rec/visionlan/README.md) (ICCV'2021) - [x] [RobustScanner](configs/rec/robustscanner/README.md) (ECCV'2020) -- [x] [ABINet](configs/rec/abinet/README.md) (CVPR'2021) @@ -382,7 +381,7 @@ Frequently asked questions about configuring environment and mindocr, please ref - [PP-OCRv3 DBNet](configs/det/dbnet/db_mobilenetv3_ppocrv3.yaml) for text detection and [PP-OCRv3 SVTR](configs/rec/svtr/svtr_ppocrv3_ch.yaml) for recognition, supporting online inferece and finetuning 2. Add more benchmark datasets and their results - [XFUND](configs/kie/vi_layoutxlm/README_CN.md) -3. Multiple specifications support for Ascend 910: DBNet ResNet-50, DBNet++ ResNet-50, CRNN VGG7, SVTR-Tiny, FCENet, ABINet +3. Multiple specifications support for Ascend 910: DBNet ResNet-50, DBNet++ ResNet-50, CRNN VGG7, SVTR-Tiny, FCENet - 2023/11/28 1. Add offline inference support for PP-OCRv4 - [PP-OCRv4 DBNet](deploy/py_infer/src/configs/det/ppocr/ch_PP-OCRv4_det_cml.yaml) for text detection and [PP-OCRv4 CRNN](deploy/py_infer/src/configs/rec/ppocr/ch_PP-OCRv4_rec_distillation.yaml) for text recognition, supporting offline inferece diff --git a/README_CN.md b/README_CN.md index fd6158d4b..ff8cfa2a3 100644 --- a/README_CN.md +++ b/README_CN.md @@ -263,7 +263,6 @@ python tools/infer/text/predict_system.py --image_dir {path_to_img or dir_to_img - [x] [MASTER](configs/rec/master/README_CN.md) (PR'2019) - [x] [VISIONLAN](configs/rec/visionlan/README_CN.md) (ICCV'2021) - [x] [RobustScanner](configs/rec/robustscanner/README_CN.md) (ECCV'2020) -- [x] [ABINet](configs/rec/abinet/README_CN.md) (CVPR'2021)
@@ -385,7 +384,7 @@ MindOCR提供了[数据格式转换工具](https://github.com/mindspore-lab/mind - 文本检测[PP-OCRv3 DBNet](configs/det/dbnet/db_mobilenetv3_ppocrv3.yaml)和文本识别[PP-OCRv3 SVTR](configs/rec/svtr/svtr_ppocrv3_ch.yaml),支持在线推理和微调训练 2. 添加更多基准数据集及其结果 - [XFUND](configs/kie/vi_layoutxlm/README_CN.md) -3. 昇腾910硬件多规格支持:DBNet ResNet-50、DBNet++ ResNet-50、CRNN VGG7、SVTR-Tiny、FCENet、ABINet +3. 昇腾910硬件多规格支持:DBNet ResNet-50、DBNet++ ResNet-50、CRNN VGG7、SVTR-Tiny、FCENet - 2023/11/28 1. 增加支持PP-OCRv4模型离线推理 - 文本检测 [PP-OCRv4 DBNet](deploy/py_infer/src/configs/det/ppocr/ch_PP-OCRv4_det_cml.yaml)和文本识别 [PP-OCRv4 CRNN](deploy/py_infer/src/configs/rec/ppocr/ch_PP-OCRv4_rec_distillation.yaml),支持离线推理 diff --git a/configs/rec/abinet/README.md b/configs/rec/abinet/README.md deleted file mode 100644 index 75a5b87f2..000000000 --- a/configs/rec/abinet/README.md +++ /dev/null @@ -1,299 +0,0 @@ - -# ABINet - - -> [Read Like Humans: Autonomous, Bidirectional and Iterative Language -Modeling for Scene Text Recognition](https://arxiv.org/pdf/2103.06495) - -## Abstract - -Linguistic knowledge is of great benefit to scene text recognition. However, how to effectively model linguistic rules in end-to-end deep networks remains a research challenge. In this paper, we argue that the limited capacity of language models comes from: 1) implicitly language modeling; 2) unidirectional feature representation; and 3) language model with noise input. Correspondingly, we propose an autonomous, bidirectional and iterative ABINet for scene text recognition. Firstly, the autonomous suggests to block gradient flow between vision and language models to enforce explicitly language modeling. Secondly, a novel bidirectional cloze network (BCN) as the language model is proposed based on bidirectional feature representation. Thirdly, we propose an execution manner of iterative correction for language model which can effectively alleviate the impact of noise input. Additionally, based on the ensemble of iterative predictions, we propose a self-training method which can learn from unlabeled images effectively. Extensive experiments indicate that ABINet has superiority on low-quality images and achieves state-of-the-art results on several mainstream benchmarks. Besides, the ABINet trained with ensemble self-training shows promising improvement in realizing human-level recognition. [1] - - - -

- -

-

- Figure 1. Architecture of ABINet [1] -

- -## Requirements - -| mindspore | ascend driver | firmware | cann toolkit/kernel | -|:----------:|:--------------:|:-------------:|:-------------------:| -| 2.5.0 | 24.1.0 | 7.5.0.3.220 | 8.0.0.beta1 | - -## Quick Start - -### Installation - -Please refer to the [installation instruction](https://github.com/mindspore-lab/mindocr#installation) in MindOCR. - -### Dataset preparation - -#### Dataset Download - -Please download LMDB dataset for traininig and evaluation from - - `training` contains two datasets: [MJSynth (MJ)](https://pan.baidu.com/s/1mgnTiyoR8f6Cm655rFI4HQ) and [SynthText (ST)](https://pan.baidu.com/s/1mgnTiyoR8f6Cm655rFI4HQ) - - `evaluation` contains several benchmarking datasets, which are [IIIT](http://cvit.iiit.ac.in/projects/SceneTextUnderstanding/IIIT5K.html), [SVT](http://www.iapr-tc11.org/mediawiki/index.php/The_Street_View_Text_Dataset), [IC13](http://rrc.cvc.uab.es/?ch=2), [IC15](http://rrc.cvc.uab.es/?ch=4), [SVTP](http://openaccess.thecvf.com/content_iccv_2013/papers/Phan_Recognizing_Text_with_2013_ICCV_paper.pdf), and [CUTE](http://cs-chan.com/downloads_CUTE80_dataset.html). - - -The data structure should be manually adjusted like - -``` text -data_lmdb_release/ -├── evaluation -│ ├── CUTE80 -│ │ ├── data.mdb -│ │ └── lock.mdb -│ ├── IC13_857 -│ │ ├── data.mdb -│ │ └── lock.mdb -│ ├── IC15_1811 -│ │ ├── data.mdb -│ │ └── lock.mdb -│ ├── ... -├── train -│ ├── MJ -│ │ ├── MJ_test -│ │ │ ├── data.mdb -│ │ │ └── lock.mdb -│ │ ├── MJ_train -│ │ │ ├── data.mdb -│ │ │ └── lock.mdb -│ │ └── MJ_valid -│ │ ├── data.mdb -│ │ └── lock.mdb -│ └── ST -│ ├── data.mdb -│ └── lock.mdb -``` - -#### Dataset Usage - -Here we used the datasets under `train/` folders for **train**. After training, we used the datasets under `evaluation/` to evluation model accuracy. - -**Train:** (total 15,895,356 samples) -- [MJSynth (MJ)](https://pan.baidu.com/s/1mgnTiyoR8f6Cm655rFI4HQ) - - Train: 21.2 GB, 7224586 samples - - Valid: 2.36 GB, 802731 samples - - Test: 2.61 GB, 891924 samples -- [SynthText (ST)](https://pan.baidu.com/s/1mgnTiyoR8f6Cm655rFI4HQ) - - Total: 24.6 GB, 6976115 samples - -**Evaluation:** (total 12,067 samples) -- [CUTE80](http://cs-chan.com/downloads_CUTE80_dataset.html): 8.8 MB, 288 samples -- [IC03_860](http://www.iapr-tc11.org/mediawiki/index.php/ICDAR_2003_Robust_Reading_Competitions): 36 MB, 860 samples -- [IC03_867](http://www.iapr-tc11.org/mediawiki/index.php/ICDAR_2003_Robust_Reading_Competitions): 4.9 MB, 867 samples -- [IC13_857](http://rrc.cvc.uab.es/?ch=2): 72 MB, 857 samples -- [IC13_1015](http://rrc.cvc.uab.es/?ch=2): 77 MB, 1015 samples -- [IC15_1811](http://rrc.cvc.uab.es/?ch=4): 21 MB, 1811 samples -- [IC15_2077](http://rrc.cvc.uab.es/?ch=4): 25 MB, 2077 samples -- [IIIT5k_3000](http://cvit.iiit.ac.in/projects/SceneTextUnderstanding/IIIT5K.html): 50 MB, 3000 samples -- [SVT](http://www.iapr-tc11.org/mediawiki/index.php/The_Street_View_Text_Dataset): 2.4 MB, 647 samples -- [SVTP](http://openaccess.thecvf.com/content_iccv_2013/papers/Phan_Recognizing_Text_with_2013_ICCV_paper.pdf): 1.8 MB, 645 samples - -### Update yaml config file - -#### Data configuration for model training - -To reproduce the training of model, it is recommended that you modify the configuration yaml as follows: - -```yaml -... -train: - ... - dataset: - type: LMDBDataset - dataset_root: dir/to/data_lmdb_release/ # Root dir of training dataset - data_dir: train/ # Dir of training dataset, concatenated with `dataset_root` to be the complete dir of training dataset - # label_file: # Path of training label file, concatenated with `dataset_root` to be the complete path of training label file, not required when using LMDBDataset -... -eval: - dataset: - type: LMDBDataset - dataset_root: dir/to/data_lmdb_release/ # Root dir of validation dataset - data_dir: evaluation/ # Dir of validation dataset, concatenated with `dataset_root` to be the complete dir of validation dataset - # label_file: # Path of validation label file, concatenated with `dataset_root` to be the complete path of validation label file, not required when using LMDBDataset - ... -``` - -#### Data configuration for model evaluation - -We use the dataset under `evaluation/` as the benchmark dataset. On **each individual dataset** (e.g. CUTE80, IC13_857, etc.), we perform a full evaluation by setting the dataset's directory to the evaluation dataset. This way, we get a list of the corresponding accuracies for each dataset, and then the reported accuracies are the average of these values. - -To reproduce the reported evaluation results, you can: -- Option 1: Repeat the evaluation step for all individual datasets: CUTE80, IC13_857, IC15_1811, IIIT5k_3000, SVT, SVTP. Then take the average score. - -- Option 2: Put all the benchmark datasets folder under the same directory, e.g. `evaluation/`. And use the script `tools/benchmarking/multi_dataset_eval.py`. - - - -1. Evaluate on one specific dataset - -For example, you can evaluate the model on dataset `CUTE80` by modifying the config yaml as follows: - -```yaml -... -train: - # NO NEED TO CHANGE ANYTHING IN TRAIN SINCE IT IS NOT USED -... -eval: - dataset: - type: LMDBDataset - dataset_root: dir/to/data_lmdb_release/ # Root dir of evaluation dataset - data_dir: evaluation/CUTE80/ # Dir of evaluation dataset, concatenated with `dataset_root` to be the complete dir of evaluation dataset - # label_file: # Path of evaluation label file, concatenated with `dataset_root` to be the complete path of evaluation label file, not required when using LMDBDataset - ... -``` - -By running `tools/eval.py` as noted in section [Model Evaluation](#model-evaluation) with the above config yaml, you can get the accuracy performance on dataset CUTE80. - -2. Evaluate on multiple datasets under the same folder - -Assume you have put all benckmark datasets under evaluation/ as shown below: - -``` text -data_lmdb_release/ -├── evaluation -│ ├── CUTE80 -│ │ ├── data.mdb -│ │ └── lock.mdb -│ ├── IC13_857 -│ │ ├── data.mdb -│ │ └── lock.mdb -│ ├── IC15_1811 -│ │ ├── data.mdb -│ │ └── lock.mdb -│ ├── ... -``` - -then you can evaluate on each dataset by modifying the config yaml as follows, and execute the script `tools/benchmarking/multi_dataset_eval.py`. - -#### Check YAML Config Files -Apart from the dataset setting, please also check the following important args: `system.distribute`, `system.val_while_train`, `common.batch_size`, `train.ckpt_save_dir`, `train.dataset.dataset_root`, `train.dataset.data_dir`, `train.dataset.label_file`, -`eval.ckpt_load_path`, `eval.dataset.dataset_root`, `eval.dataset.data_dir`, `eval.dataset.label_file`, `eval.loader.batch_size`. Explanations of these important args: - -```yaml -system: - distribute: True # `True` for distributed training, `False` for standalone training - amp_level: 'O0' - seed: 42 - val_while_train: True # Validate while training - drop_overflow_update: False -common: - ... - batch_size: &batch_size 96 # Batch size for training -... -train: - ckpt_save_dir: './tmp_rec' # The training result (including checkpoints, per-epoch performance and curves) saving directory - dataset_sink_mode: False - dataset: - type: LMDBDataset - dataset_root: dir/to/data_lmdb_release/ # Root dir of training dataset - data_dir: train/ # Dir of training dataset, concatenated with `dataset_root` to be the complete dir of training dataset - # label_file: # Path of training label file, concatenated with `dataset_root` to be the complete path of training label file, not required when using LMDBDataset -... -eval: - ckpt_load_path: './tmp_rec/best.ckpt' # checkpoint file path - dataset_sink_mode: False - dataset: - type: LMDBDataset - dataset_root: dir/to/data_lmdb_release/ # Root dir of validation/evaluation dataset - data_dir: evaluation/ # Dir of validation/evaluation dataset, concatenated with `dataset_root` to be the complete dir of validation/evaluation dataset - # label_file: # Path of validation/evaluation label file, concatenated with `dataset_root` to be the complete path of validation/evaluation label file, not required when using LMDBDataset - ... - loader: - shuffle: False - batch_size: 96 # Batch size for validation/evaluation -... -``` - -**Notes:** -- As the global batch size (batch_size x num_devices) is important for reproducing the result, please adjust `batch_size` accordingly to keep the global batch size unchanged for a different number of NPUs, or adjust the learning rate linearly to a new global batch size. -- Dataset: The MJSynth and SynthText datasets come from [ABINet_repo](https://github.com/FangShancheng/ABINet). - - -### Model Training - - -* Distributed Training - -It is easy to reproduce the reported results with the pre-defined training recipe. For distributed training on multiple Ascend 910 devices, please modify the configuration parameter `distribute` as True and run - -```shell -# distributed training on multiple Ascend devices -# worker_num is the total number of Worker processes participating in the distributed task. -# local_worker_num is the number of Worker processes pulled up on the current node. -# The number of processes is equal to the number of NPUs used for training. In the case of single-machine multi-card worker_num and local_worker_num must be the same. -msrun --worker_num=8 --local_worker_num=8 python tools/train.py --config configs/rec/abinet/abinet_resnet45_en.yaml - -# Based on verification,binding cores usually results in performance acceleration.Please configure the parameters and run. -msrun --bind_core=True --worker_num=8 --local_worker_num=8 python tools/train.py --config configs/rec/abinet/abinet_resnet45_en.yaml -``` -**Note:** For more information about msrun configuration, please refer to [here](https://www.mindspore.cn/docs/en/master/model_train/parallel/msrun_launcher.html). - -The pre-trained model needs to be loaded during ABINet model training, and the weight of the pre-trained model is -from [abinet_pretrain_en.ckpt](https://download.mindspore.cn/toolkits/mindocr/abinet/abinet_pretrain_en-821ca20b.ckpt). It is needed to add the path of the pretrained weight to the model pretrained in "configs/rec/abinet/abinet_resnet45_en.yaml". - -* Standalone Training - -If you want to train or finetune the model on a smaller dataset without distributed training, please modify the configuration parameter`distribute` as False and run: - -```shell -# standalone training on a CPU/Ascend device -python tools/train.py --config configs/rec/abinet/abinet_resnet45_en.yaml -``` - -The training result (including checkpoints, per-epoch performance and curves) will be saved in the directory parsed by the arg `ckpt_save_dir`. The default directory is `./tmp_rec`. - -### Model Evaluation - -To evaluate the accuracy of the trained model, you can use `eval.py`. Please set the checkpoint path to the arg `ckpt_load_path` in the `eval` section of yaml config file, set `distribute` to be False, and then run: - -``` -python tools/eval.py --config configs/rec/abinet/abinet_resnet45_en.yaml -``` - - -## Results - - -### Accuracy - -According to our experiments, the evaluation results on public benchmark datasets ( IC13, IC15, IIIT, SVT, SVTP, CUTE) is as follow: - -Performance tested on ascend 910* with graph mode - -
- -| **model name** | **backbone** | **train dataset** | **params(M)** | **cards** | **batch size** | **jit level** | **graph compile** | **ms/step** | **img/s** | **accuracy** | **recipe** | **weight** | -|:--------------:| :----------: |:-----------------:| :-----------: |:---------:| :------------: |:-------------:|:-----------------:|:-----------:|:---------:| :----------: |:--------------------------------:|:------------------------------------------------------------------------------------------------:| -| ABINet | Resnet45 | MJ+ST | 36.93 | 8 | 96 | O2 | 680.51 s | 115.56 | 6646.07 | 91.35% | [yaml](abinet_resnet45_en.yaml) | [ckpt](https://download.mindspore.cn/toolkits/mindocr/abinet/abinet_resnet45_en-7efa1184.ckpt) | -
- - Detailed accuracy results for each benchmark dataset -
- -| **model name** | **backbone** | **cards** | **IC03_860** | **IC03_867** | **IC13_857** | **IC13_1015** | **IC15_1811** | **IC15_2077** | **IIIT5k_3000** | **SVT** | **SVTP** | **CUTE80** | **Average** | -|:--------------:|:------------:|:---------:|:------------:|:------------:|:------------:|:-------------:|:-------------:|:-------------:|:---------------:|:-------:|:--------:|:----------:|:-----------:| -| ABINet | Resnet45 | 1 | 96.22% | 95.83% | 96.48% | 94.90% | 84.38% | 80.56% | 95.83% | 92.36% | 87.33% | 89.58% | 91.35% | -
- - -**Notes:** -- The input Shapes of MindIR of ABINet is (1, 3, 32, 128). - - -## References - - -[1] Fang S, Xie H, Wang Y, et al. Read like humans: Autonomous, bidirectional and iterative language modeling for scene text recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 7098-7107. diff --git a/configs/rec/abinet/README_CN.md b/configs/rec/abinet/README_CN.md deleted file mode 100644 index 3dd8ebb3b..000000000 --- a/configs/rec/abinet/README_CN.md +++ /dev/null @@ -1,322 +0,0 @@ - -# ABINet - - -> [Read Like Humans: Autonomous, Bidirectional and Iterative Language -Modeling for Scene Text Recognition](https://arxiv.org/pdf/2103.06495) - -## 模型描述 - -语义知识对场景文本识别有很大的帮助。然而,如何在端到端深度网络中有效地建模语义规则仍然是一个研究挑战。在本文中,我们认为语言模型的能力有限来自于:1)隐式语言建模;2)单向特征表示;3)带噪声输入的语言模型。相应地,我们提出了一种自主、双向、迭代的场景文本识别ABINet。首先,自主阻塞视觉和语言模型之间的梯度流,以强制显式语言建模。其次,提出了一种基于双向特征表示的新型双向完形填空式网络作为语言模型。第三,提出了一种语言模型迭代修正的执行方式,可以有效缓解噪声输入的影响。此外,我们提出了一种基于迭代预测集合的自训练方法,可以有效地从未标记的图像中学习。大量的实验表明,ABINet在低质量图像上具有优势,并在几个主流基准上取得了最先进的结果。此外,集成自训练训练的ABINet在实现人类水平的识别方面也有很大的进步 [1] - - - -

- -

-

- 图1. ABINet结构图 [1] -

- -## 配套版本 - -| mindspore | ascend driver | firmware | cann toolkit/kernel | -|:----------:|:--------------:|:-------------:|:-------------------:| -| 2.5.0 | 24.1.0 | 7.5.0.3.220 | 8.0.0.beta1 | - - -## 快速开始 - -### 安装 -环境安装教程请参考MindOCR的 [installation instruction](https://github.com/mindspore-lab/mindocr#installation). - -### 数据准备 - -#### 数据集下载 - -请下载LMDB数据集用于训练和评估 - - `training` 包含两个数据集: [MJSynth (MJ)](https://pan.baidu.com/s/1mgnTiyoR8f6Cm655rFI4HQ) 和 [SynthText (ST)](https://pan.baidu.com/s/1mgnTiyoR8f6Cm655rFI4HQ) - - `evaluation` 包含几个基准数据集,它们是[IIIT](http://cvit.iiit.ac.in/projects/SceneTextUnderstanding/IIIT5K.html), [SVT](http://www.iapr-tc11.org/mediawiki/index.php/The_Street_View_Text_Dataset), [IC13](http://rrc.cvc.uab.es/?ch=2), [IC15](http://rrc.cvc.uab.es/?ch=4), [SVTP](http://openaccess.thecvf.com/content_iccv_2013/papers/Phan_Recognizing_Text_with_2013_ICCV_paper.pdf), 和 [CUTE](http://cs-chan.com/downloads_CUTE80_dataset.html). - - -数据结构应该被手动调整为 - -``` text -data_lmdb_release/ -├── evaluation -│ ├── CUTE80 -│ │ ├── data.mdb -│ │ └── lock.mdb -│ ├── IC13_857 -│ │ ├── data.mdb -│ │ └── lock.mdb -│ ├── IC15_1811 -│ │ ├── data.mdb -│ │ └── lock.mdb -│ ├── ... -├── train -│ ├── MJ -│ │ ├── MJ_test -│ │ │ ├── data.mdb -│ │ │ └── lock.mdb -│ │ ├── MJ_train -│ │ │ ├── data.mdb -│ │ │ └── lock.mdb -│ │ └── MJ_valid -│ │ ├── data.mdb -│ │ └── lock.mdb -│ └── ST -│ ├── data.mdb -│ └── lock.mdb -``` - -#### 数据集使用 - -在这里,我们使用 `train/` 文件夹下的数据集进行训练,我们使用 `evaluation/` 下的数据集来评估模型的准确性。 - -**Train:** (total 15,895,356 samples) -- [MJSynth (MJ)](https://pan.baidu.com/s/1mgnTiyoR8f6Cm655rFI4HQ) - - Train: 21.2 GB, 7224586 samples - - Valid: 2.36 GB, 802731 samples - - Test: 2.61 GB, 891924 samples -- [SynthText (ST)](https://pan.baidu.com/s/1mgnTiyoR8f6Cm655rFI4HQ) - - Total: 24.6 GB, 6976115 samples - - -**Evaluation:** (total 12,067 samples) -- [CUTE80](http://cs-chan.com/downloads_CUTE80_dataset.html): 8.8 MB, 288 samples -- [IC03_860](http://www.iapr-tc11.org/mediawiki/index.php/ICDAR_2003_Robust_Reading_Competitions): 36 MB, 860 samples -- [IC03_867](http://www.iapr-tc11.org/mediawiki/index.php/ICDAR_2003_Robust_Reading_Competitions): 4.9 MB, 867 samples -- [IC13_857](http://rrc.cvc.uab.es/?ch=2): 72 MB, 857 samples -- [IC13_1015](http://rrc.cvc.uab.es/?ch=2): 77 MB, 1015 samples -- [IC15_1811](http://rrc.cvc.uab.es/?ch=4): 21 MB, 1811 samples -- [IC15_2077](http://rrc.cvc.uab.es/?ch=4): 25 MB, 2077 samples -- [IIIT5k_3000](http://cvit.iiit.ac.in/projects/SceneTextUnderstanding/IIIT5K.html): 50 MB, 3000 samples -- [SVT](http://www.iapr-tc11.org/mediawiki/index.php/The_Street_View_Text_Dataset): 2.4 MB, 647 samples -- [SVTP](http://openaccess.thecvf.com/content_iccv_2013/papers/Phan_Recognizing_Text_with_2013_ICCV_paper.pdf): 1.8 MB, 645 samples - -### 配置说明 - -#### 模型训练的数据配置 - -如欲重现模型的训练,建议修改配置yaml如下: - -```yaml -... -train: - ckpt_save_dir: './tmp_rec' # 训练结果(包括checkpoint、每个epoch的性能和曲线图)保存目录 - dataset_sink_mode: False - dataset: - type: LMDBDataset - dataset_root: dir/to/data_lmdb_release/ # 训练数据集根目录 - data_dir: train/ # 训练数据集目录,将与`dataset_root`拼接形成完整训练数据集目录 - # label_files: # 训练数据集的标签文件路径,将与`dataset_root`拼接形成完整的训练数据的标签文件路径。当数据集为LMDB格式时无需配置 -... -eval: - ckpt_load_path: './tmp_rec/best.ckpt' # checkpoint文件路径 - dataset_sink_mode: False - dataset: - type: LMDBDataset - dataset_root: dir/to/data_lmdb_release/ # 验证或评估数据集根目录 - data_dir: evaluation/ # 验证或评估数据集目录,将与`dataset_root`拼接形成完整验证或评估数据集目录 - # label_file: # 验证或评估数据集的标签文件路径,将与`dataset_root`拼接形成完整的验证或评估数据的标签文件路径。当数据集为LMDB格式时无需配置 - ... -``` - -#### 模型评估的数据配置 - -我们使用 `evaluation/` 下的数据集作为基准数据集。在**每个单独的数据集**(例如 CUTE80、IC03_860 等)上,我们通过将数据集的目录设置为评估数据集来执行完整评估。这样,我们就得到了每个数据集对应精度的列表,然后报告的精度是这些值的平均值。 - -如要重现报告的评估结果,您可以: -- 方法 1:对所有单个数据集重复评估步骤:CUTE80、IC13_857、IC15_1811、IIIT5k_3000、SVT、SVTP。然后取平均分。 - -- 方法 2:将所有基准数据集文件夹放在同一目录下,例如`evaluation/`。并使用脚本`tools/benchmarking/multi_dataset_eval.py`。 - -1.评估一个特定的数据集 - -例如,您可以通过修改配置 yaml 来评估数据集“CUTE80”上的模型,如下所示: - -```yaml -... -train: - # 不需要修改 -... -eval: - ckpt_load_path: './tmp_rec/best.ckpt' # checkpoint文件路径 - dataset_sink_mode: False - dataset: - type: LMDBDataset - dataset_root: dir/to/data_lmdb_release/ # 验证或评估数据集根目录 - data_dir: evaluation/CUTE80/ # 验证或评估数据集目录,将与`dataset_root`拼接形成完整验证或评估数据集目录 - # label_file: # 验证或评估数据集的标签文件路径,将与`dataset_root`拼接形成完整的验证或评估数据的标签文件路径。当数据集为LMDB格式时无需配置 - ... -``` -通过使用上述配置 yaml 运行 [模型评估](#模型评估) 部分中所述的`tools/eval.py`,您可以获得数据集 CUTE80 的准确度性能。 - -2.对同一文件夹下的多个数据集进行评估 - -假设您已将所有 benckmark 数据集置于 evaluation/ 下,如下所示: - -``` text -data_lmdb_release/ -├── evaluation -│ ├── CUTE80 -│ │ ├── data.mdb -│ │ └── lock.mdb -│ ├── IC13_857 -│ │ ├── data.mdb -│ │ └── lock.mdb -│ ├── IC15_1811 -│ │ ├── data.mdb -│ │ └── lock.mdb -│ ├── ... -``` - -然后你可以通过如下修改配置yaml来评估每个数据集,并执行脚本`tools/benchmarking/multi_dataset_eval.py`。 - -```yaml -... -train: - # 不需要修改 -... -eval: - ckpt_load_path: './tmp_rec/best.ckpt' # checkpoint文件路径 - dataset_sink_mode: False - dataset: - type: LMDBDataset - dataset_root: dir/to/data_lmdb_release/ # 验证或评估数据集根目录 - data_dir: evaluation/ # 验证或评估数据集目录,将与`dataset_root`拼接形成完整验证或评估数据集目录 - # label_file: # 验证或评估数据集的标签文件路径,将与`dataset_root`拼接形成完整的验证或评估数据的标签文件路径。当数据集为LMDB格式时无需配置 - ... -``` - -#### 检查配置文件 - -除了数据集的设置,请同时重点关注以下变量的配置:`system.distribute`, `system.val_while_train`, `common.batch_size`, `train.ckpt_save_dir`, `train.dataset.dataset_root`, `train.dataset.data_dir`, `train.dataset.label_file`, -`eval.ckpt_load_path`, `eval.dataset.dataset_root`, `eval.dataset.data_dir`, `eval.dataset.label_file`, `eval.loader.batch_size`。说明如下: - -```yaml -system: - distribute: True # 分布式训练为True,单卡训练为False - amp_level: 'O3' - seed: 42 - val_while_train: True # 边训练边验证 - drop_overflow_update: False -common: - ... - batch_size: &batch_size 96 # 训练批大小 -... -train: - ckpt_save_dir: './tmp_rec' # 训练结果(包括checkpoint、每个epoch的性能和曲线图)保存目录 - dataset_sink_mode: False - dataset: - type: LMDBDataset - dataset_root: dir/to/data_lmdb_release/ # 训练数据集根目录 - data_dir: train/ # 训练数据集目录,将与`dataset_root`拼接形成完整训练数据集目录 - # label_files: # 训练数据集的标签文件路径,将与`dataset_root`拼接形成完整的训练数据的标签文件路径。当数据集为LMDB格式时无需配置 -... -eval: - ckpt_load_path: './tmp_rec/best.ckpt' # checkpoint文件路径 - dataset_sink_mode: False - dataset: - type: LMDBDataset - dataset_root: dir/to/data_lmdb_release/ # 验证或评估数据集根目录 - data_dir: evaluation/ # 验证或评估数据集目录,将与`dataset_root`拼接形成完整验证或评估数据集目录 - # label_file: # 验证或评估数据集的标签文件路径,将与`dataset_root`拼接形成完整的验证或评估数据的标签文件路径。当数据集为LMDB格式时无需配置 - ... - loader: - shuffle: False - batch_size: 96 # 验证或评估批大小 -... -``` - -**注意:** -- 由于全局批大小 (batch_size x num_devices) 是对结果复现很重要,因此当NPU卡数发生变化时,调整`batch_size`以保持全局批大小不变,或将学习率线性调整为新的全局批大小。 -- 数据集:MJSynth和SynthText数据集来自作者公布的代码仓[ABINet_repo](https://github.com/FangShancheng/ABINet). - - -### 模型训练 - - -* 分布式训练 - -使用预定义的训练配置可以轻松重现报告的结果。对于在多个昇腾910设备上的分布式训练,请将配置参数`distribute`修改为True,并运行: - -```shell -# 在多个 Ascend 设备上进行分布式训练 -# worker_num代表分布式总进程数量。 -# local_worker_num代表当前节点进程数量。 -# 进程数量即为训练使用的NPU的数量,单机多卡情况下worker_num和local_worker_num需保持一致。 -msrun --worker_num=8 --local_worker_num=8 python tools/train.py --config configs/rec/abinet/abinet_resnet45_en.yaml - -# 经验证,绑核在大部分情况下有性能加速,请配置参数并运行 -msrun --bind_core=True --worker_num=8 --local_worker_num=8 python tools/train.py --config configs/rec/abinet/abinet_resnet45_en.yaml -``` -**注意:** 有关 msrun 配置的更多信息,请参考[此处](https://www.mindspore.cn/docs/zh-CN/master/model_train/parallel/msrun_launcher.html). - - -ABINet模型训练时需要加载预训练模型,预训练模型的权重来自[abinet_pretrain_en.ckpt](https://download.mindspore.cn/toolkits/mindocr/abinet/abinet_pretrain_en-821ca20b.ckpt),需要在“configs/rec/abinet/abinet_resnet45_en.yaml”中model的pretrained添加预训练权重的路径。 - - -* 单卡训练 - -如果要在没有分布式训练的情况下在较小的数据集上训练或微调模型,请将配置参数`distribute`修改为False 并运行: - -```shell -# CPU/Ascend 设备上的单卡训练 -python tools/train.py --config configs/rec/abinet/abinet_resnet45_en.yaml -``` - - - -训练结果(包括checkpoint、每个epoch的性能和曲线图)将被保存在yaml配置文件的`ckpt_save_dir`参数配置的目录下,默认为`./tmp_rec`。 - -### 模型评估 - -若要评估已训练模型的准确性,可以使用`eval.py`。请在yaml配置文件的`eval`部分将参数`ckpt_load_path`设置为模型checkpoint的文件路径,设置`distribute`为False,然后运行: - -``` -python tools/eval.py --config configs/rec/abinet/abinet_resnet45_en.yaml -``` - - - -## 评估结果 - - -### 精确度 -根据我们的实验,在公共基准数据集(IC13、IC15、IIIT、SVT、SVTP、CUTE)上的评估结果如下: - -
- -| **model name** | **backbone** | **train dataset** | **params(M)** | **cards** | **batch size** | **jit level** | **graph compile** | **ms/step** | **img/s** | **accuracy** | **recipe** | **weight** | -|:--------------:| :----------: |:-----------------:| :-----------: |:---------:| :------------: |:-------------:|:-----------------:|:-----------:|:---------:| :----------: |:--------------------------------:|:------------------------------------------------------------------------------------------------:| -| ABINet | Resnet45 | MJ+ST | 36.93 | 8 | 96 | O2 | 680.51 s | 115.56 | 6646.07 | 91.35% | [yaml](abinet_resnet45_en.yaml) | [ckpt](https://download.mindspore.cn/toolkits/mindocr/abinet/abinet_resnet45_en-7efa1184.ckpt) | - -
- - -
-
- 每个基准数据集的详细精度结果 - -| **model name** | **backbone** | **cards** | **IC03_860** | **IC03_867** | **IC13_857** | **IC13_1015** | **IC15_1811** | **IC15_2077** | **IIIT5k_3000** | **SVT** | **SVTP** | **CUTE80** | **Average** | -|:--------------:|:------------:|:---------:|:------------:|:------------:|:------------:|:-------------:|:-------------:|:-------------:|:---------------:|:-------:|:--------:|:----------:|:-----------:| -| ABINet | Resnet45 | 1 | 96.22% | 95.83% | 96.48% | 94.90% | 84.38% | 80.56% | 95.83% | 92.36% | 87.33% | 89.58% | 91.35% | - -
-
- - - -## 参考文献 - - -[1] Fang S, Xie H, Wang Y, et al. Read like humans: Autonomous, bidirectional and iterative language modeling for scene text recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 7098-7107. diff --git a/configs/rec/abinet/abinet_resnet45_en.yaml b/configs/rec/abinet/abinet_resnet45_en.yaml deleted file mode 100644 index 4a8c3edd3..000000000 --- a/configs/rec/abinet/abinet_resnet45_en.yaml +++ /dev/null @@ -1,119 +0,0 @@ -system: - mode: 0 # 0 for graph mode, 1 for pynative mode in MindSpore - distribute: True - amp_level: 'O0' - seed: 42 - log_interval: 100 - val_while_train: True - refine_batch_size: False - drop_overflow_update: False - -common: - character_dict_path: &character_dict_path - num_classes: &num_classes 37 - max_text_len: &max_text_len 25 - infer_mode: &infer_mode False - use_space_char: &use_space_char False - batch_size: &batch_size 96 - -model: - type: rec - pretrained : https://download.mindspore.cn/toolkits/mindocr/abinet/abinet_pretrain_en-821ca20b.ckpt - transform: null - backbone: - name: abinet_backbone - pretrained: False - batchsize: *batch_size - head: - name: ABINetHead - batchsize: *batch_size - -postprocess: - name: ABINetLabelDecode - -metric: - name: RecMetric - main_indicator: acc - character_dict_path: *character_dict_path - ignore_space: True - print_flag: False - filter_ood: False - -loss: - name: ABINetLoss - - -scheduler: - scheduler: step_decay - decay_rate: 0.1 - decay_epochs: 6 - warmup_epochs: 0 - lr: 0.00001 - num_epochs : 10 - -# scheduler: -# scheduler: warmup_cosine_decay -# lr: 0.00001 -# decay_epochs: 12 -# num_epochs : 15 - -optimizer: - opt: adam - - -train: - clip_grad: True - clip_norm: 20.0 - ckpt_save_dir: './tmp_rec' - dataset_sink_mode: False - pred_cast_fp32: False - max_call_depth: 1300 - dataset: - type: LMDBDataset - dataset_root: path/to/data_lmdb_release/train/ # Optional, if set, dataset_root will be used as a prefix for data_dir - data_dir: ['MJ/MJ_test','MJ/MJ_train','MJ/MJ_valid','ST'] - # label_files: # not required when using LMDBDataset - sample_ratio: 1.0 - shuffle: True - transform_pipeline: - - ABINetTransforms: - - ABINetRecAug: - # # the order of the dataloader list, matching the network input and the input labels for the loss function, and optional data for debug/visaulize - output_columns: ['image','label','length','label_for_mask'] #'img_path'] - filter_max_len: True - max_text_len: 25 - filter_zero_text_image: True - check_rec_image: True - - loader: - shuffle: True # TODO: tbc - batch_size: *batch_size - drop_remainder: True - max_rowsize: 64 - num_workers: 8 - -eval: - ckpt_load_path: './tmp_rec/best.ckpt' - dataset_sink_mode: False - pred_cast_fp32: False - dataset: - type: LMDBDataset - dataset_root: path/to/data_lmdb_release/ # Root dir of validation/evaluation dataset - data_dir: evaluation/ - # label_files: # not required when using LMDBDataset - sample_ratio: 1.0 - shuffle: False - transform_pipeline: - - ABINetEvalTransforms: - - ABINetEval: - # the order of the dataloader list, matching the network input and the input labels for the loss function, and optional data for debug/visaulize - output_columns: ['image','label','length'] # TODO return text string padding w/ fixed length, and a scaler to indicate the length - net_input_column_index: [0] # input indices for network forward func in output_columns - label_column_index: [1, 2] # input indices marked as label - - loader: - shuffle: False # TODO: tbc - batch_size: *batch_size - drop_remainder: True - max_rowsize: 64 - num_workers: 8 diff --git a/docs/en/mkdocs/modelzoo_training.md b/docs/en/mkdocs/modelzoo_training.md index b3e6f35b9..4eb145252 100644 --- a/docs/en/mkdocs/modelzoo_training.md +++ b/docs/en/mkdocs/modelzoo_training.md @@ -25,7 +25,7 @@ | visionlan_resnet45| IC03,13,15,IIIT,etc | 192| 4 | 90.61 | 417 | 1840 | O2 | [mindocr_visionlan](https://github.com/mindspore-lab/mindocr/blob/main/configs/rec/visionlan) | | master_resnet31 | IC03,13,15,IIIT,etc | 512 | 4 | 90.37 | 747 | 2741 | O2 | [mindocr_master](https://github.com/mindspore-lab/mindocr/tree/main/configs/rec/master) | | robustscanner_resnet31 | IC13,15,IIIT,SVT,etc | 256 | 4 | 87.86 | 825 | 310 | O0 | [mindocr_robustscanner](https://github.com/mindspore-lab/mindocr/blob/main/configs/rec/robustscanner) | -| abinet_resnet45 | IC03,13,15,IIIT,etc | 768 | 8 | 91.35 | 718 | 628.11 | O0 | [mindocr_abinet](https://github.com/mindspore-lab/mindocr/tree/main/configs/rec/abinet) | + ### Text Direction Classification diff --git a/docs/en/tutorials/frequently_asked_questions.md b/docs/en/tutorials/frequently_asked_questions.md index bfbb8e0e5..37726c493 100644 --- a/docs/en/tutorials/frequently_asked_questions.md +++ b/docs/en/tutorials/frequently_asked_questions.md @@ -9,10 +9,9 @@ - [Problems related to inference](#q8-problems-related-to-inference) - [Training speed of DBNet not as fast as expexted](#q9-training-speed-of-dbnet-not-as-fast-as-expexted) - [Error about libgomp-d22c30c5.so.1.0.0](#q10-error-about-libgomp-d22c30c5so100) - - [Dataset Pipeline Error when training abinet on lmdb dataset](#q11-dataset-pipeline-error-when-training-abinet-on-lmdb-dataset) - - [Runtime Error when training dbnet on synthtext dataset](#q12-runtime-error-when-training-dbnet-on-synthtext-dataset) - - [Failed to install seqeval](#q13-failed-to-install-seqeval) - - [Failed to install lanms](#q14-failed-to-install-lanms) + - [Runtime Error when training dbnet on synthtext dataset](#q11-runtime-error-when-training-dbnet-on-synthtext-dataset) + - [Failed to install seqeval](#q12-failed-to-install-seqeval) + - [Failed to install lanms](#q13-failed-to-install-lanms) ### Q1 Undefined symbol @@ -748,58 +747,8 @@ You can try the following steps to fix it: export LD_PRELOAD=/root/mindocr_env/lib/python3.8/site-packages/scikit_learn.libs/libgomp-d22c30c5.so.1.0.0:$LD_PRELOAD ``` -### Q11 Dataset Pipeline Error when training abinet on lmdb dataset -The following error may occur when training abinet on lmdb dataset -```bash -mindocr.data.rec_lmdb_dataset WARNING - Error occurred during preprocess. - Exception thrown from dataset pipeline. Refer to 'Dataset Pipeline Error Message'. - ------------------------------------------------------------------- -- Dataset Pipeline Error Message: ------------------------------------------------------------------- -[ERROR] No cast for the specified DataType was found. - ------------------------------------------------------------------- -- C++ Call Stack: (For framework developers) ------------------------------------------------------------------- -mindspore/ccsrc/minddata/dataset/kernels/py_func_op.cc(143). -``` -You can try the following steps to fix it: - - - find the folder of mindspore package - - open file: `mindspore/dataset/transforms/transform.py` - - switch to line 93: - ```bash - 93 if key in EXECUTORS_LIST: - 94 # get the executor by process id and thread id - 95 executor = EXECUTORS_LIST[key] - 96 # remove the old transform which in executor and update the new transform - 97 executor.UpdateOperation(self.parse()) - 98 else: - 99 # create a new executor by process id and thread_id - 100 executor = cde.Execute(self.parse()) - 101 # add the executor the global EXECUTORS_LIST - 102 EXECUTORS_LIST[key] = executor - ``` - - - replace line 97 with `executor = cde.Execute(self.parse())`, and get - ```bash - 93 if key in EXECUTORS_LIST: - 94 # get the executor by process id and thread id - 95 executor = EXECUTORS_LIST[key] - 96 # remove the old transform which in executor and update the new transform - 97 executor = cde.Execute(self.parse()) - 98 else: - 99 # create a new executor by process id and thread_id - 100 executor = cde.Execute(self.parse()) - 101 # add the executor the global EXECUTORS_LIST - 102 EXECUTORS_LIST[key] = executor - ``` - - - save the file, and try to train the model. - -### Q12 Runtime Error when training dbnet on synthtext dataset +### Q11 Runtime Error when training dbnet on synthtext dataset Runtime Error occur as following when training dbnet on synthtext dataset: ```bash Traceback (most recent call last): @@ -811,7 +760,7 @@ RuntimeError: Run task for graph:kernel_graph_1 error! The details reger to 'Asc Please update CANN to 7.1 version. -### Q13 Failed to install seqeval +### Q12 Failed to install seqeval The following error occur when run `pip install -r requirements.txt` ```bash Collecting seqeval>=1.2.2 (from -r requirements.txt (line 19)) @@ -889,7 +838,7 @@ Please try the following steps to fix this problem: - Install `seqeval`: `pip3 install seqeval -i https://pypi.tuna.tsinghua.edu.cn/simple` -### Q14 Failed to install lanms +### Q13 Failed to install lanms The following error occur when installing lanms ```bash ImportError: Python version mismatch: module was compiled for version 3.8, while the interpreter is running version 3.7. diff --git a/docs/zh/mkdocs/modelzoo_training.md b/docs/zh/mkdocs/modelzoo_training.md index be0660800..9a5960472 100644 --- a/docs/zh/mkdocs/modelzoo_training.md +++ b/docs/zh/mkdocs/modelzoo_training.md @@ -25,7 +25,7 @@ | visionlan_resnet45| IC03,13,15,IIIT,etc | 192| 4 | 90.61 | 417 | 1840 | O2 | [mindocr_visionlan](https://github.com/mindspore-lab/mindocr/blob/main/configs/rec/visionlan) | | master_resnet31 | IC03,13,15,IIIT,etc | 512 | 4 | 90.37 | 747 | 2741 | O2 | [mindocr_master](https://github.com/mindspore-lab/mindocr/tree/main/configs/rec/master) | | robustscanner_resnet31 | IC13,15,IIIT,SVT,etc | 256 | 4 | 87.86 | 825 | 310 | O0 | [mindocr_robustscanner](https://github.com/mindspore-lab/mindocr/blob/main/configs/rec/robustscanner) | -| abinet_resnet45 | IC03,13,15,IIIT,etc | 768 | 8 | 91.35 | 718 | 628.11 | O0 | [mindocr_abinet](https://github.com/mindspore-lab/mindocr/tree/main/configs/rec/abinet) | + ### 文本方向分类 diff --git a/docs/zh/tutorials/frequently_asked_questions.md b/docs/zh/tutorials/frequently_asked_questions.md index e9c987084..b07fcb372 100644 --- a/docs/zh/tutorials/frequently_asked_questions.md +++ b/docs/zh/tutorials/frequently_asked_questions.md @@ -9,10 +9,9 @@ - [推理相关问题](#q8) - [DBNet训练速率不及预期](#q9) - [libgomp-d22c30c5.so.1.0.0 相关错误](#q10) - - [当在lmdb dataset上训练abinet报数据管道错误](#q11) - - [当在synthtext数据集上训练dbnet报运行时错误](#q12) - - [安装seqeval相关错误](#q13) - - [安装lanms相关错误](#q14) + - [当在synthtext数据集上训练dbnet报运行时错误](#q11) + - [安装seqeval相关错误](#q12) + - [安装lanms相关错误](#q13) ### Q1 未定义符号 @@ -756,57 +755,8 @@ ImportError: /root/mindocr_env/lib/python3.8/site-packages/sklearn/__check_build export LD_PRELOAD=/root/mindocr_env/lib/python3.8/site-packages/scikit_learn.libs/libgomp-d22c30c5.so.1.0.0:$LD_PRELOAD ``` -### Q11 当在lmdb dataset上训练abinet报数据管道错误 -当在lmdb dataset上训练abinet报以下数据管道错误 -```bash -mindocr.data.rec_lmdb_dataset WARNING - Error occurred during preprocess. - Exception thrown from dataset pipeline. Refer to 'Dataset Pipeline Error Message'. - ------------------------------------------------------------------- -- Dataset Pipeline Error Message: ------------------------------------------------------------------- -[ERROR] No cast for the specified DataType was found. - ------------------------------------------------------------------- -- C++ Call Stack: (For framework developers) ------------------------------------------------------------------- -mindspore/ccsrc/minddata/dataset/kernels/py_func_op.cc(143). -``` -可以尝试用如下步骤修复: - - - 找到mindspore的包路径 - - 打开文件: `mindspore/dataset/transforms/transform.py` - - 跳转到93行,可以得到如下内容: - ```bash - 93 if key in EXECUTORS_LIST: - 94 # get the executor by process id and thread id - 95 executor = EXECUTORS_LIST[key] - 96 # remove the old transform which in executor and update the new transform - 97 executor.UpdateOperation(self.parse()) - 98 else: - 99 # create a new executor by process id and thread_id - 100 executor = cde.Execute(self.parse()) - 101 # add the executor the global EXECUTORS_LIST - 102 EXECUTORS_LIST[key] = executor - ``` - - - 使用`executor = cde.Execute(self.parse())`替换97行, 得到如下内容: - ```bash - 93 if key in EXECUTORS_LIST: - 94 # get the executor by process id and thread id - 95 executor = EXECUTORS_LIST[key] - 96 # remove the old transform which in executor and update the new transform - 97 executor = cde.Execute(self.parse()) - 98 else: - 99 # create a new executor by process id and thread_id - 100 executor = cde.Execute(self.parse()) - 101 # add the executor the global EXECUTORS_LIST - 102 EXECUTORS_LIST[key] = executor - ``` - - - 保存后再次尝试训练即可 -### Q12 当在synthtext数据集上训练dbnet报运行时错误 +### Q11 当在synthtext数据集上训练dbnet报运行时错误 当在synthtext数据集上训练dbnet报以下数据管道错误 ```bash Traceback (most recent call last): @@ -819,7 +769,7 @@ RuntimeError: Run task for graph:kernel_graph_1 error! The details reger to 'Asc 请尝试将CANN更新到7.1。 -### Q13 安装seqeval相关错误 +### Q12 安装seqeval相关错误 当运行`pip install -r requirements.txt`时,报以下错误 ```bash Collecting seqeval>=1.2.2 (from -r requirements.txt (line 19)) @@ -897,7 +847,7 @@ note: This is an issue with the package mentioned above, not pip. - 安装`seqeval`:`pip3 install seqeval -i https://pypi.tuna.tsinghua.edu.cn/simple` -### Q14 安装lanms相关错误 +### Q13 安装lanms相关错误 当安装lanms时,报 ```bash ImportError: Python version mismatch: module was compiled for version 3.8, while the interpreter is running version 3.7. diff --git a/mindocr/data/transforms/rec_abinet_transforms.py b/mindocr/data/transforms/rec_abinet_transforms.py deleted file mode 100644 index 95a0ae386..000000000 --- a/mindocr/data/transforms/rec_abinet_transforms.py +++ /dev/null @@ -1,242 +0,0 @@ -""" -transform for text recognition tasks. -""" -import copy -import logging -import random -import re -import warnings - -import cv2 -import numpy as np -import PIL -import six -from PIL import Image - -import mindspore as ms -import mindspore.dataset as ds - -from ...models.utils.abinet_layers import CharsetMapper, onehot -from .svtr_transform import ( - CVColorJitter, - CVGaussianNoise, - CVMotionBlur, - CVRandomAffine, - CVRandomPerspective, - CVRandomRotation, - CVRescale, -) - -_logger = logging.getLogger(__name__) -__all__ = ["ABINetTransforms", "ABINetRecAug", "ABINetEval", "ABINetEvalTransforms"] - - -class ABINetTransforms(object): - """Convert text label (str) to a sequence of character indices according to the char dictionary - - Args: - - """ - - def __init__( - self, - **kwargs, - ): - # ABINet_Transforms - self.case_sensitive = False - self.charset = CharsetMapper(max_length=26) - - def __call__(self, data: dict): - if "img_path" in data: - with open(data["img_path"], "rb") as f: - img = f.read() - elif "img_lmdb" in data: - img = data["img_lmdb"] - label = data["label"] - label = label.encode("utf-8") - label = str(label, "utf-8") - try: - label = re.sub("[^0-9a-zA-Z]+", "", label) - if len(label) > 25 or len(label) <= 0: - string_false2 = f"len(label) > 25 or len(label) <= 0: {label}, {len(label)}" - _logger.warning(string_false2) - label = label[:25] - buf = six.BytesIO() - buf.write(img) - buf.seek(0) - with warnings.catch_warnings(): - warnings.simplefilter("ignore", UserWarning) - image = PIL.Image.open(buf).convert("RGB") - if not _check_image(image, pixels=6): - string_false1 = f"_check_image false: {label}, {len(label)}" - _logger.warning(string_false1) - except Exception: - string_false = f"Corrupted image is found: {label}, {len(label)}" - _logger.warning(string_false) - - image = np.array(image) - - text = label - - length = len(text) + 1 - length = float(length) - - label = self.charset.get_labels(text, case_sensitive=self.case_sensitive) - label_for_mask = copy.deepcopy(label) - label_for_mask[int(length - 1)] = 1 - label = onehot(label, self.charset.num_classes) - data_dict = {"image": image, "label": label, "length": length, "label_for_mask": label_for_mask} - return data_dict - - -class ABINetRecAug(object): - def __init__(self, width=128, height=32, **kwargs): - self.transforms = ds.transforms.Compose( - [ - CVGeometry( - degrees=45, - translate=(0.0, 0.0), - scale=(0.5, 2.0), - shear=(45, 15), - distortion=0.5, - p=0.5, - ), - CVDeterioration(var=20, degrees=6, factor=4, p=0.25), - CVColorJitter(brightness=0.5, contrast=0.5, saturation=0.5, hue=0.1, p=0.25), - ] - ) - self.toTensor = ds.vision.ToTensor() - self.w = width - self.h = height - self.op = ms.dataset.vision.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225], is_hwc=False) - - def __call__(self, data): - img = data["image"] - img = self.transforms(img) - img = cv2.resize(img, (self.w, self.h)) - img = self.toTensor(img) - img = self.op(img) - data["image"] = img - return data - - -def _check_image(x, pixels=6): - if x.size[0] <= pixels or x.size[1] <= pixels: - return False - else: - return True - - -class ABINetEvalTransforms(object): - """Convert text label (str) to a sequence of character indices according to the char dictionary - - Args: - - """ - - def __init__( - self, - **kwargs, - ): - # ABINet_Transforms - self.case_sensitive = False - self.charset = CharsetMapper(max_length=26) - - def __call__(self, data: dict): - if "img_path" in data: - with open(data["img_path"], "rb") as f: - img = f.read() - elif "img_lmdb" in data: - img = data["img_lmdb"] - label = data["label"] - label = label.encode("utf-8") - label = str(label, "utf-8") - try: - label = re.sub("[^0-9a-zA-Z]+", "", label) - if len(label) > 25 or len(label) <= 0: - string_false2 = f"en(label) > 25 or len(label) <= 0: {label}, {len(label)}" - _logger.warning(string_false2) - label = label[:25] - buf = six.BytesIO() - buf.write(img) - buf.seek(0) - with warnings.catch_warnings(): - warnings.simplefilter("ignore", UserWarning) - image = PIL.Image.open(buf).convert("RGB") - if not _check_image(image, pixels=6): - string_false1 = f"_check_image false: {label}, {len(label)}" - _logger.warning(string_false1) - except Exception: - string_false = f"Corrupted image is found: {label}, {len(label)}" - _logger.warning(string_false) - - image = np.array(image) - - text = label - length = len(text) + 1 - length = float(length) - data_dict = {"image": image, "label": text, "length": length} - return data_dict - - -class ABINetEval(object): - def __init__(self, **kwargs): - self.toTensor = ds.vision.ToTensor() - self.w = 128 - self.h = 32 - self.op = ms.dataset.vision.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225], is_hwc=False) - - def __call__(self, data): - img = data["image"] - img = cv2.resize(img, (self.w, self.h)) - img = self.toTensor(img) - data["image"] = img - length = data["length"] - length = int(length) - data["length"] = length - return data - - -class CVGeometry(object): - def __init__( - self, degrees=15, translate=(0.3, 0.3), scale=(0.5, 2.0), shear=(45, 15), distortion=0.5, p=0.5, **kwargs - ): - self.p = p - type_p = random.random() - if type_p < 0.33: - self.transforms = CVRandomRotation(degrees=degrees) - elif type_p < 0.66: - self.transforms = CVRandomAffine(degrees=degrees, translate=translate, scale=scale, shear=shear) - else: - self.transforms = CVRandomPerspective(distortion=distortion) - - def __call__(self, img): - if random.random() < self.p: - img = np.array(img) - return Image.fromarray(self.transforms(img)) - else: - return img - - -class CVDeterioration(object): - def __init__(self, var, degrees, factor, p=0.5, **kwargs): - self.p = p - transforms = [] - if var is not None: - transforms.append(CVGaussianNoise(variance=var)) - if degrees is not None: - transforms.append(CVMotionBlur(degrees=degrees)) - if factor is not None: - transforms.append(CVRescale(factor=factor)) - - random.shuffle(transforms) - - transforms = ds.transforms.Compose(transforms) - self.transforms = transforms - - def __call__(self, img): - if random.random() < self.p: - img = np.array(img) - return Image.fromarray(self.transforms(img)) - else: - return img diff --git a/mindocr/data/transforms/transforms_factory.py b/mindocr/data/transforms/transforms_factory.py index a1d551612..89586dddc 100644 --- a/mindocr/data/transforms/transforms_factory.py +++ b/mindocr/data/transforms/transforms_factory.py @@ -1,3 +1,4 @@ +# flake8: noqa: F405 """ Create and run transformations from a config or predefined transformation pipeline """ @@ -7,7 +8,6 @@ from .det_transforms import * from .general_transforms import * from .layoutlm_transforms import * -from .rec_abinet_transforms import * from .rec_transforms import * from .svtr_transform import * from .table_transform import * @@ -39,10 +39,6 @@ "VQAReTokenRelation": VQAReTokenRelation, "VQAReTokenChunk": VQAReTokenChunk, "TensorizeEntitiesRelations": TensorizeEntitiesRelations, - "ABINetTransforms": ABINetTransforms, - "ABINetRecAug": ABINetRecAug, - "ABINetEval": ABINetEval, - "ABINetEvalTransforms": ABINetEvalTransforms, "RecCTCLabelEncode": RecCTCLabelEncode, "RecAttnLabelEncode": RecAttnLabelEncode, "RecMasterLabelEncode": RecMasterLabelEncode, diff --git a/mindocr/losses/abinet_loss.py b/mindocr/losses/abinet_loss.py deleted file mode 100644 index 07d5299bd..000000000 --- a/mindocr/losses/abinet_loss.py +++ /dev/null @@ -1,132 +0,0 @@ -import mindspore as ms -import mindspore.numpy as msnp -from mindspore import nn -from mindspore.ops import operations as P - -__all__ = ["ABINetLoss"] - - -class ABINetLoss(nn.Cell): - def __init__(self, one_hot=True): - super().__init__() - self.ce = SoftCrossEntropyLoss() - self.bce = nn.BCELoss(reduction="mean") - self.cast = P.Cast() - - def _merge_list(self, all_res): - if not isinstance(all_res, (list, tuple)): - return all_res - - def merge(items): - concat_op = ms.ops.Concat(axis=0) - if isinstance(items[0], ms.Tensor): - return concat_op(items) - else: - return items[0] - - res = [] - - for key in all_res[0].keys(): - items = [] - - for i in range(3): - items.append(all_res[i][key]) - - res.append(merge(items)) - - return res - - def _ce_loss(self, output, loss_args, i, idx=None, record=True): - pt_logits = 1.0 - weight = 1.0 - - if i == 0: - pt_logits = output[0] - - if i == 1: - pt_logits = output[1] - - if i == 2: - pt_logits = output["logits"] - - gt_labels = loss_args[0] - gt_lengths = loss_args[1] - label_for_mask = loss_args[2] - assert pt_logits.shape[0] % gt_labels.shape[0] == 0 - - iter_size = pt_logits.shape[0] // gt_labels.shape[0] - type_dst = ms.float16 - cast = ms.ops.Cast() - gt_labels = cast(gt_labels, type_dst) - gt_lengths = cast(gt_lengths, type_dst) - pt_logits = cast(pt_logits, type_dst) - label_for_mask = cast(label_for_mask, type_dst) - - if iter_size > 1: - gt_labels = msnp.tile(gt_labels, (3, 1, 1)) - gt_lengths = msnp.tile(gt_lengths, 3) - - label_for_mask = msnp.tile(label_for_mask, (3, 1)) - - label_for_mask = label_for_mask[:, None] - - loss = self.ce(gt_labels, pt_logits, gt_lengths, label_for_mask) * weight - - return loss - - def construct(self, outputs, label, length, label_for_mask): - loss_args = [label, length, label_for_mask] - output_list = [] - for i in range(len(outputs)): - output_list.append(self._merge_list(outputs[i])) - outputs = output_list - loss_one = 0 - loss_all = 0 - for i in range(3): - loss_one = self._ce_loss(outputs[i], loss_args, i) - loss_all = loss_one + loss_all - return loss_all - - -class SoftCrossEntropyLoss(nn.Cell): - def __init__(self, reduction="mean"): - super().__init__() - - def construct(self, gt_labels, pt_logits, gt_lengths, label_for_mask, softmax=True): - data_pt_list = [] - mask_list = [] - gt_list = [] - - loss = 0 - mean_divide = 0 - - for i in range(pt_logits.shape[0]): - data_length = gt_lengths[i] - mean_divide = mean_divide + data_length - mask_pt = label_for_mask[i] > 0 - - mask_pt = mask_pt.transpose(1, 0) - - data_pt_list.append(pt_logits[i]) - mask_list.append(mask_pt) - gt_list.append(gt_labels[i]) - - concat_pt_logits = ms.ops.concat(data_pt_list) - concat_mask = ms.ops.concat(mask_list) - concat_gt_labels = ms.ops.concat(gt_list) - concat_mask = concat_mask.astype(ms.float16) - concat_pt_logits = concat_pt_logits * concat_mask - - if softmax: - concat_pt_logits = concat_pt_logits.astype(ms.float16) - log_prob = ms.ops.log_softmax(concat_pt_logits) - else: - log_prob = ms.ops.log(concat_pt_logits) - - loss = -(concat_gt_labels * log_prob) - loss = loss.astype(ms.float16) - loss = loss * concat_mask - loss = loss.sum(axis=(-2, -1)) - loss_mean = loss / mean_divide - - return loss_mean diff --git a/mindocr/losses/builder.py b/mindocr/losses/builder.py index e232d2e4b..28f19bfb5 100644 --- a/mindocr/losses/builder.py +++ b/mindocr/losses/builder.py @@ -1,4 +1,3 @@ -from .abinet_loss import ABINetLoss from .cls_loss import CrossEntropySmooth from .det_loss import DBLoss, EASTLoss, PSEDiceLoss from .kie_loss import VQAReTokenLayoutLMLoss, VQASerTokenLayoutLMLoss @@ -16,7 +15,6 @@ "PSEDiceLoss", "EASTLoss", "CrossEntropySmooth", - "ABINetLoss", "SARLoss", "VisionLANLoss", "VQAReTokenLayoutLMLoss", diff --git a/mindocr/models/__init__.py b/mindocr/models/__init__.py index 9ee5f139d..643e76f15 100644 --- a/mindocr/models/__init__.py +++ b/mindocr/models/__init__.py @@ -7,7 +7,6 @@ from .det_psenet import * from .kie_layoutxlm import * from .layout_yolov8 import * -from .rec_abinet import * from .rec_crnn import * from .rec_master import * from .rec_rare import * diff --git a/mindocr/models/backbones/__init__.py b/mindocr/models/backbones/__init__.py index 8f71299ab..5fb4132e0 100644 --- a/mindocr/models/backbones/__init__.py +++ b/mindocr/models/backbones/__init__.py @@ -10,7 +10,6 @@ from .det_resnet import * from .layoutlmv3 import layoutlmv3 from .layoutxlm import layoutxlm -from .rec_abinet_backbone import * from .rec_master import * from .rec_resnet import * from .rec_resnet45 import * diff --git a/mindocr/models/backbones/rec_abinet_backbone.py b/mindocr/models/backbones/rec_abinet_backbone.py deleted file mode 100644 index 5af63e349..000000000 --- a/mindocr/models/backbones/rec_abinet_backbone.py +++ /dev/null @@ -1,254 +0,0 @@ -import math - -import numpy as np - -import mindspore as ms -import mindspore.nn as nn - -from ..utils.abinet_layers import ( - ABINetBlock, - PositionalEncoding, - PositionAttention, - TransformerEncoder, - _default_tfmer_cfg, -) -from ._registry import register_backbone, register_backbone_class - -__all__ = [ - "ABINetIterBackbone", - "abinet_backbone"] - -# ABINet_backbone - - -@register_backbone_class -class ABINetIterBackbone(nn.Cell): - def __init__(self, batchsize=96): - super().__init__() - self.out_channels = [1, 512] - self.batchsize = batchsize - self.vision = BaseVision(self.batchsize) - - def construct(self, images, *args): - v_res = self.vision(images) - return v_res - - -@register_backbone -def abinet_backbone(pretrained: bool = True, **kwargs): - model = ABINetIterBackbone(**kwargs) - - # load pretrained weights - if pretrained: - raise NotImplementedError("The default pretrained checkpoint for `rec_abinet_backbone` backbone does not exist") - - return model - - -class BaseVision(ABINetBlock): - def __init__(self, batchsize): - super().__init__() - self.batchsize = batchsize - self.loss_weight = 1.0 - self.out_channels = 512 - self.backbone = ResTranformer(self.batchsize) - mode = "nearest" - self.attention = PositionAttention( - max_length=26, # additional stop token - mode=mode, - ) - - self.cls = nn.Dense( - self.out_channels, - self.charset.num_classes, - weight_init="HeUniform", - bias_init="uniform", - ) - - def construct(self, images, *args): - features = self.backbone(images) # (N, E, H, W) - - attn_vecs, attn_scores = self.attention(features) - - logits = self.cls(attn_vecs) # (N, T, C) - - pt_lengths = self._get_length(logits) - - return { - "feature": attn_vecs, - "logits": logits, - "pt_lengths": pt_lengths, - "attn_scores": attn_scores, - "loss_weight": self.loss_weight, - "name": "vision", - } - - -class ResTranformer(nn.Cell): - def __init__(self, batchsize): - super().__init__() - self.resnet = resnet45() - - self.d_model = _default_tfmer_cfg["d_model"] - nhead = _default_tfmer_cfg["nhead"] - d_inner = _default_tfmer_cfg["d_inner"] - dropout = _default_tfmer_cfg["dropout"] - num_layers = 3 - self.encoder_mask = ms.Tensor(np.ones((batchsize, 256, 256)), dtype=ms.float32) - self.pos_encoder = PositionalEncoding(self.d_model, max_len=8 * 32) - - self.transformer = TransformerEncoder( - batch_size=batchsize, - num_layers=num_layers, - hidden_size=self.d_model, - num_heads=nhead, - ffn_hidden_size=d_inner, - hidden_dropout_rate=dropout, - attention_dropout_rate=dropout, - hidden_act="relu", - seq_length=256, - ) - - def construct(self, images): - feature = self.resnet(images) - n, c, h, w = feature.shape - feature = feature.view(n, c, -1) - feature = feature.transpose(2, 0, 1) - - feature = self.pos_encoder(feature) - feature = feature.transpose(1, 0, 2) - feature = self.transformer( - feature, self.encoder_mask - ) - feature = feature.transpose(1, 0, 2) - feature = feature.transpose(1, 2, 0) - feature = feature.view(n, c, h, w) - return feature - - -def conv1x1(in_planes, out_planes, stride=1): - return nn.Conv2d( - in_planes, out_planes, kernel_size=1, stride=stride, has_bias=False - ) - - -def conv3x3(in_planes, out_planes, stride=1): - "3x3 convolution with padding" - return nn.Conv2d( - in_planes, - out_planes, - kernel_size=3, - stride=stride, - pad_mode="pad", - padding=1, - has_bias=False, - ) - - -class BasicBlock(nn.Cell): - expansion = 1 - - def __init__(self, inplanes, planes, stride=1, downsample=None): - super(BasicBlock, self).__init__() - self.conv1 = conv1x1(inplanes, planes) - self.bn1 = nn.BatchNorm2d(planes, momentum=0.1) - self.relu = nn.ReLU() - self.conv2 = conv3x3(planes, planes, stride) - self.bn2 = nn.BatchNorm2d(planes, momentum=0.1) - self.downsample = downsample - self.stride = stride - - def construct(self, x): - residual = x - - out = self.conv1(x) - out = self.bn1(out) - out = self.relu(out) - - out = self.conv2(out) - out = self.bn2(out) - - if self.downsample is not None: - residual = self.downsample(x) - - out += residual - out = self.relu(out) - - return out - - -class ResNet(nn.Cell): - def __init__(self, block, layers): - self.inplanes = 32 - super(ResNet, self).__init__() - self.conv1 = nn.Conv2d( - 3, 32, kernel_size=3, stride=1, padding=1, has_bias=False, pad_mode="pad" - ) - - self.bn1 = nn.BatchNorm2d(32, momentum=0.1) - self.relu = nn.ReLU() - - self.layer1 = self._make_layer(block, 32, layers[0], stride=2) - self.layer2 = self._make_layer(block, 64, layers[1], stride=1) - self.layer3 = self._make_layer(block, 128, layers[2], stride=2) - self.layer4 = self._make_layer(block, 256, layers[3], stride=1) - self.layer5 = self._make_layer(block, 512, layers[4], stride=1) - - for _, cell in self.cells_and_names(): - if isinstance(cell, nn.Conv2d): - n = cell.kernel_size[0] * cell.kernel_size[1] * cell.out_channels - cell.weight.set_data( - ms.common.initializer.initializer( - ms.common.initializer.Normal(sigma=math.sqrt(2.0 / n), mean=0), - cell.weight.shape, - cell.weight.dtype, - ) - ) - elif isinstance(cell, nn.BatchNorm2d): - cell.gamma.set_data( - ms.common.initializer.initializer( - "ones", cell.gamma.shape, cell.gamma.dtype - ) - ) - cell.beta.set_data( - ms.common.initializer.initializer( - "zeros", cell.beta.shape, cell.beta.dtype - ) - ) - - def _make_layer(self, block, planes, blocks, stride=1): - downsample = None - if stride != 1 or self.inplanes != planes * block.expansion: - downsample = nn.SequentialCell( - nn.Conv2d( - self.inplanes, - planes * block.expansion, - kernel_size=1, - stride=stride, - has_bias=False, - ), - nn.BatchNorm2d(planes * block.expansion, momentum=0.1), - ) - - layers = [] - layers.append(block(self.inplanes, planes, stride, downsample)) - self.inplanes = planes * block.expansion - for i in range(1, blocks): - layers.append(block(self.inplanes, planes)) - - return nn.SequentialCell(*layers) - - def construct(self, x): - x = self.conv1(x) - x = self.bn1(x) - x = self.relu(x) - x = self.layer1(x) - x = self.layer2(x) - x = self.layer3(x) - x = self.layer4(x) - x = self.layer5(x) - return x - - -def resnet45(): - return ResNet(BasicBlock, [3, 4, 6, 6, 3]) diff --git a/mindocr/models/heads/builder.py b/mindocr/models/heads/builder.py index 0ca586207..f001bfc0a 100644 --- a/mindocr/models/heads/builder.py +++ b/mindocr/models/heads/builder.py @@ -11,7 +11,6 @@ 'MasterDecoder', 'RobustScannerHead', 'VisionLANHead', - 'ABINetHead', "TokenClassificationHead", "RelationExtractionHead", 'YOLOv8Head', @@ -26,7 +25,6 @@ from .det_pse_head import PSEHead from .kie_relationextraction_head import RelationExtractionHead from .kie_tokenclassification_head import TokenClassificationHead -from .rec_abinet_head import ABINetHead from .rec_attn_head import AttentionHead from .rec_ctc_head import CTCHead from .rec_master_decoder import MasterDecoder diff --git a/mindocr/models/heads/rec_abinet_head.py b/mindocr/models/heads/rec_abinet_head.py deleted file mode 100644 index de90c97a3..000000000 --- a/mindocr/models/heads/rec_abinet_head.py +++ /dev/null @@ -1,226 +0,0 @@ -import math - -import numpy as np - -import mindspore as ms -from mindspore import nn - -from ..utils.abinet_layers import ABINetBlock, PositionalEncoding -from ..utils.abinet_layers import TransformerDecoder as ms_TransformerDecoder -from ..utils.abinet_layers import _default_tfmer_cfg - -__all__ = ["ABINetHead"] - - -class ABINetHead(nn.Cell): - def __init__(self, in_channels, batchsize=96): - super().__init__() - self.iter_size = 3 - self.batchsize = batchsize - self.in_channels = in_channels # In order to fit the mindocr framework, it is not actually used. - self.alignment = BaseAlignment() - self.language = BCNLanguage(self.batchsize) - self.max_length = 26 # additional stop token - - def construct(self, v_res): - # v_res = nout - a_res = v_res - all_l_res = [] - all_a_res = [] - for _ in range(self.iter_size): - ms_softmax = nn.Softmax() - tokens = ms_softmax(a_res["logits"]) - lengths = a_res["pt_lengths"] - lengths = ms.ops.clip_by_value(lengths, 2, self.max_length) - l_res = self.language( - tokens, lengths - ) - all_l_res.append(l_res) - a_res = self.alignment(l_res["feature"], v_res["feature"]) - all_a_res.append(a_res) - - if not self.training: - return a_res["logits"] - - return all_a_res, all_l_res, v_res - - -def _calculate_fan_in_and_fan_out(shape): - """ - calculate fan_in and fan_out - - Args: - shape (tuple): input shape. - - Returns: - Tuple, a tuple with two elements, the first element is `n_in` and the second element is `n_out`. - """ - dimensions = len(shape) - if dimensions < 2: - raise ValueError("'fan_in' and 'fan_out' can not be computed for tensor with fewer than" - " 2 dimensions, but got dimensions {}.".format(dimensions)) - if dimensions == 2: # Linear - fan_in = shape[1] - fan_out = shape[0] - else: - num_input_fmaps = shape[1] - num_output_fmaps = shape[0] - receptive_field_size = 1 - for i in range(2, dimensions): - receptive_field_size *= shape[i] - fan_in = num_input_fmaps * receptive_field_size - fan_out = num_output_fmaps * receptive_field_size - return fan_in, fan_out - - -class BaseAlignment(ABINetBlock): - def __init__(self): - super().__init__() - d_model = 512 - - self.loss_weight = 1.0 - self.max_length = 26 # additional stop token - self.w_att = nn.Dense( - 2 * d_model, d_model, weight_init='HeUniform', bias_init='uniform' - ) - self.cls = nn.Dense( - d_model, - self.charset.num_classes, weight_init='HeUniform', bias_init='uniform' - ) - for _, cell in self.cells_and_names(): - if isinstance(cell, nn.Dense): - print("Dense Init HeUniform") - cell.weight.set_data(ms.common.initializer.initializer( - ms.common.initializer.HeUniform(negative_slope=math.sqrt(5), mode="fan_in", - nonlinearity="leaky_relu"), - cell.weight.shape, cell.weight.dtype)) - weight = cell.weight - fan_in, _ = _calculate_fan_in_and_fan_out(weight.shape) - bound = 1 / math.sqrt(int(fan_in)) - - cell.bias.set_data(ms.common.initializer.initializer(ms.common.initializer.Uniform(scale=bound), - cell.bias.shape, cell.bias.dtype)) - - def construct(self, l_feature, v_feature): - - f = ms.ops.concat((l_feature, v_feature), axis=2) - - f_att = ms.ops.sigmoid(self.w_att(f)) - - output = f_att * v_feature + (1 - f_att) * l_feature - logits = self.cls(output) # (N, T, C) - pt_lengths = self._get_length(logits) - - return { - "logits": logits, - "pt_lengths": pt_lengths, - "loss_weight": self.loss_weight, - "name": "alignment", - } - - -class BCNLanguage(ABINetBlock): - def __init__( - self, batchsize - ): - super().__init__() - d_model = _default_tfmer_cfg["d_model"] - nhead = _default_tfmer_cfg["nhead"] - d_inner = _default_tfmer_cfg["d_inner"] - dropout = _default_tfmer_cfg["dropout"] - self.batchsize = batchsize - num_layers = 4 - self.d_model = d_model - self.detach = True - self.use_self_attn = False - self.loss_weight = 1.0 - self.max_length = 26 # additional stop token - self.debug = False - - self.proj = nn.Dense( - self.charset.num_classes, - d_model, - weight_init="uniform", - bias_init="uniform", - has_bias=False, - ) - self.token_encoder = PositionalEncoding(d_model, max_len=self.max_length) - self.pos_encoder = PositionalEncoding( - d_model, dropout=1.0, max_len=self.max_length - ) - self.model = ms_TransformerDecoder( - batch_size=self.batchsize, - num_layers=num_layers, - hidden_size=self.d_model, - num_heads=nhead, - ffn_hidden_size=d_inner, - hidden_dropout_rate=dropout, - attention_dropout_rate=dropout, - hidden_act="relu", - src_seq_length=26, - tgt_seq_length=26, - ) - - self.cls = nn.Dense( - self.d_model, - self.charset.num_classes, - weight_init="uniform", - bias_init="uniform", - ) - - def mindspore_decoder_mask(self, lengths): - ms_unqueeze = ms.ops.expand_dims - ms_pad_mask = self._get_padding_mask(lengths, 26) - ms_pad_mask = ms_unqueeze(ms_pad_mask, -2) - ms_eye_mask = self._get_location_mask(26) - ms_eye_mask = ms_unqueeze(ms_eye_mask, 0) - bitand = ms.ops.logical_and - out_mask = bitand(ms_pad_mask, ms_eye_mask) - - return (out_mask).astype(ms.float16) - - def _get_padding_mask(self, length, max_length): - ms_unqueeze = ms.ops.expand_dims - length = ms_unqueeze(length, -1) - grid = ms.numpy.arange(0, max_length) - grid = ms_unqueeze(grid, 0) - return grid < length - - def _get_location_mask(self, sz): - a = np.eye(sz, sz) - b = np.ones((26, 26)) - mask = b - a - mask = ms.Tensor(mask) - return mask.astype(ms.bool_) - - def construct(self, tokens, lengths): - """ - Args: - tokens: (N, T, C) where T is length, N is batch size and C is classes number - lengths: (N,) - """ - # if self.detach: tokens = tokens.detach() - tokens1 = ms.ops.stop_gradient(tokens) - embed = self.proj(tokens1) # (N, T, E) - embed = embed.transpose(1, 0, 2) - embed = self.token_encoder(embed) # (T, N, E) - embed = embed.transpose(1, 0, 2) - zeros = ms.ops.zeros((self.batchsize, 26, 512), ms.float32) - zeros = zeros.transpose(1, 0, 2) - query = self.pos_encoder(zeros) - query = query.transpose(1, 0, 2) - padding_mask = self.mindspore_decoder_mask(lengths) - location_mask = self.mindspore_decoder_mask(lengths) - output = self.model(query, padding_mask, embed, location_mask) - logits = self.cls(output) # (N, T, C) - pt_lengths = self._get_length(logits) - - res = { - "feature": output, - "logits": logits, - "pt_lengths": pt_lengths, - "loss_weight": self.loss_weight, - "name": "language", - } - - return res diff --git a/mindocr/models/rec_abinet.py b/mindocr/models/rec_abinet.py deleted file mode 100644 index 3270ee906..000000000 --- a/mindocr/models/rec_abinet.py +++ /dev/null @@ -1,40 +0,0 @@ -from ._registry import register_model -from .backbones.mindcv_models.utils import load_pretrained -from .base_model import BaseModel - -__all__ = ["ABINetModel", "abinet"] - - -def _cfg(url="", **kwargs): - return {"url": url, "input_size": (3, 32, 100), **kwargs} - - -default_cfgs = { - # 'abinet': - "abinet": _cfg( - url="https://download-mindspore.osinfra.cn/toolkits/mindocr/abinet/abinet_resnet45_en-7efa1184.ckpt" - ), -} - - -class ABINetModel(BaseModel): - def __init__(self, config): - BaseModel.__init__(self, config) - - -@register_model -def abinet(pretrained=False, **kwargs): - model_config = { - "backbone": {"name": "abinet_backbone", "pretrained": False}, - "head": { - "name": "ABINetHead", - }, - } - model = ABINetModel(model_config) - - # load pretrained weights - if pretrained: - default_cfg = default_cfgs['abinet'] - load_pretrained(model, default_cfg) - - return model diff --git a/mindocr/models/utils/abinet_layers.py b/mindocr/models/utils/abinet_layers.py deleted file mode 100644 index 42c2f58ed..000000000 --- a/mindocr/models/utils/abinet_layers.py +++ /dev/null @@ -1,1733 +0,0 @@ -import math - -import numpy as np - -import mindspore as ms -import mindspore.common.dtype as mstype -from mindspore import _checkparam as Validator -from mindspore import log as logger -from mindspore import nn -from mindspore.common.parameter import Parameter -from mindspore.common.tensor import Tensor -from mindspore.context import ParallelMode -from mindspore.log import _LogActionOnce -from mindspore.nn.cell import Cell -from mindspore.ops import functional as F -from mindspore.ops import operations as P -from mindspore.ops.primitive import constexpr -from mindspore.parallel._transformer.layers import ( - _args_type_validator_check, - _check_input_dtype, - _check_past_none_input_none, - _LayerInputCheck, - _LayerNorm, - _valid_type_checks, - _valid_value_checks, -) -from mindspore.parallel._transformer.moe import MoE, _check_moe_config, default_moe_config -from mindspore.parallel._transformer.op_parallel_config import ( - MoEParallelConfig, - OpParallelConfig, - _check_config, - default_dpmp_config, -) -from mindspore.parallel._transformer.transformer import ( - FeedForward, - MultiHeadAttention, - TransformerOpParallelConfig, - _get_lambda_func, - default_transformer_config, -) -from mindspore.parallel._utils import _get_parallel_mode, _is_sharding_propagation - -_default_tfmer_cfg = dict( - d_model=512, nhead=8, d_inner=2048, dropout=0.1, activation="relu" # 1024 -) - - -@constexpr -def _check_shape_equal(input_shape, param_name, func_name, target_shape): - _LayerInputCheck.check_shape_equal(input_shape, param_name, func_name, target_shape) - - -class ABINetBlock(nn.Cell): - def __init__(self): - super().__init__() - self.max_length = 26 - self.charset = CharsetMapper( - max_length=self.max_length, - ) - - def _get_length(self, logit, dim=-1): - - logit_argmax = ms.ops.Argmax()(logit) - out = logit_argmax == 0 - out_copy = out.copy() - abn = out.any(dim) - out1 = out.cumsum(dim) == 1 - out = ms.ops.logical_and(out_copy, out1) - out1 = out.argmax(-1) - out1 = out1 + 1 - logit_shape1 = logit.shape[1] - out = ms.numpy.where(abn, out1, logit_shape1) - return out - - @staticmethod - def _get_padding_mask(length, max_length): - length = ms.numpy.expand_dims(length, -1) - # length = length.unsqueeze(-1) - grid = ms.numpy.arange(0, max_length) - grid = ms.numpy.expand_dims(grid, 0) - grid = ms.Tensor(grid) - return grid >= length - - @staticmethod - def _get_location_mask(sz, device=None): - eyes = ms.ops.Eye() - mask1 = eyes(sz, sz, ms.bool_) - cast = ms.ops.Cast() - mask = cast(mask1, ms.float32) - mask = ms.ops.masked_fill(mask, mask1, float("-inf")) - expand_dims = ms.ops.ExpandDims() - mask = expand_dims(mask, 0) - # mask = mask.float().masked_fill(mask == 1, float('-inf')) - return mask - - -class CharsetMapper(object): - - def __init__(self, max_length=30, null_char="\u2591"): - - self.null_char = null_char - self.max_length = max_length - self.label_to_char = self._read_charset() - self.char_to_label = dict(map(reversed, self.label_to_char.items())) - self.num_classes = len(self.label_to_char) - - def _read_charset(self): - charset = {} - charset_list = "░abcdefghijklmnopqrstuvwxyz1234567890" - charset = {idx: c for idx, c in enumerate(charset_list)} - self.null_label = 0 - charset[self.null_label] = self.null_char - return charset - - def trim(self, text): - assert isinstance(text, str) - return text.replace(self.null_char, "") - - def get_text(self, labels, length=None, padding=True, trim=False): - """Returns a string corresponding to a sequence of character ids.""" - length = length if length else self.max_length - labels = [int(a) if isinstance(a, ms.Tensor) else int(a) for a in labels] - if padding: - labels = labels + [self.null_label] * (length - len(labels)) - text = "".join([self.label_to_char[label] for label in labels]) - if trim: - text = self.trim(text) - return text - - def get_labels(self, text, length=None, padding=True, case_sensitive=False): - """Returns the labels of the corresponding text.""" - length = length if length else self.max_length - if padding: - text = text + self.null_char * (length - len(text)) - if not case_sensitive: - text = text.lower() - labels = [self.char_to_label[char] for char in text] - return labels - - def pad_labels(self, labels, length=None): - length = length if length else self.max_length - return labels + [self.null_label] * (length - len(labels)) - - @property - def digits(self): - return "0123456789" - - @property - def digit_labels(self): - return self.get_labels(self.digits, padding=False) - - @property - def alphabets(self): - all_chars = list(self.char_to_label.keys()) - valid_chars = [] - for c in all_chars: - if c in "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ": - valid_chars.append(c) - return "".join(valid_chars) - - @property - def alphabet_labels(self): - return self.get_labels(self.alphabets, padding=False) - - -def onehot(label, depth, device=None): - - label_shape = 26 - - onehot_output = np.zeros((label_shape, depth)) - - label_expand = np.expand_dims(label, -1) - label_expand = np.expand_dims(label_expand, -1) - label_expand_onehot = np.zeros((26, 37)) - a = 0 - for i in label_expand: - i = int(i) - label_expand_onehot[a][i] = 1 - a = a + 1 - - label_expand_onehot = label_expand_onehot - onehot_output = label_expand_onehot + onehot_output - - return onehot_output - - -def encoder_layer(in_c, out_c, k=3, s=2, p=1): - return nn.SequentialCell( - nn.Conv2d(in_c, out_c, k, s, pad_mode="pad", padding=p, has_bias=True), - nn.BatchNorm2d(out_c, momentum=0.1), - nn.ReLU(), - ) - - -class ms_upsample_scale(nn.Cell): - def __init__(self, scale_factor, align_corners): - super().__init__() - self.scale_factor = scale_factor - self.align_corners = align_corners - - def construct(self, x): - _1, _2, height, width = x.shape - new_height = self.scale_factor * height - new_width = self.scale_factor * width - resize = ms.ops.ResizeNearestNeighbor( - size=(new_height, new_width), align_corners=self.align_corners - ) - x = resize(x) - return x - - -class ms_upsample_size(nn.Cell): - def __init__(self, size, align_corners): - super().__init__() - self.size = size - self.align_corners = align_corners - - def construct(self, x): - resize = ms.ops.ResizeNearestNeighbor( - size=self.size, align_corners=self.align_corners - ) - x = resize(x) - return x - - -# mindspore upsample ResizeBilinear 只有bilinear -def decoder_layer1( - in_c, out_c, k=3, s=1, p=1, mode="nearest", scale_factor=None, size=None -): - align_corners = False if mode == "nearest" else True - return nn.SequentialCell( - ms_upsample_scale(scale_factor, align_corners=align_corners), - nn.Conv2d(in_c, out_c, k, s, pad_mode="pad", padding=p, has_bias=True), - nn.BatchNorm2d(out_c, momentum=0.1), - nn.ReLU(), - ) - - -def decoder_layer2( - in_c, out_c, k=3, s=1, p=1, mode="nearest", scale_factor=None, size=None -): - align_corners = False if mode == "nearest" else True - return nn.SequentialCell( - ms_upsample_size(size, align_corners=align_corners), - nn.Conv2d(in_c, out_c, k, s, pad_mode="pad", padding=p, has_bias=True), - nn.BatchNorm2d(out_c, momentum=0.1), - nn.ReLU(), - ) - - -class PositionAttention(nn.Cell): - def __init__( - self, - max_length, - in_channels=512, - num_channels=64, - h=8, - w=32, - mode="nearest", - **kwargs - ): - super().__init__() - self.max_length = max_length - self.k_encoder1 = encoder_layer(in_channels, num_channels, s=(1, 2)) - self.k_encoder2 = encoder_layer(num_channels, num_channels, s=(2, 2)) - self.k_encoder3 = encoder_layer(num_channels, num_channels, s=(2, 2)) - self.k_encoder4 = encoder_layer(num_channels, num_channels, s=(2, 2)) - - self.k_decoder1 = decoder_layer1( - num_channels, num_channels, scale_factor=2, mode=mode - ) - self.k_decoder2 = decoder_layer1( - num_channels, num_channels, scale_factor=2, mode=mode - ) - self.k_decoder3 = decoder_layer1( - num_channels, num_channels, scale_factor=2, mode=mode - ) - self.k_decoder4 = decoder_layer2( - num_channels, in_channels, size=(h, w), mode=mode - ) - - self.pos_encoder = PositionalEncoding( - in_channels, dropout=1.0, max_len=max_length - ) - self.project = nn.Dense( - in_channels, in_channels, weight_init="HeUniform", bias_init="uniform" - ) - - def construct(self, x): - N, E, H, W = x.shape - k, v = x, x # (N, E, H, W) - features = [] - k = self.k_encoder1(k) - features.append(k) - k = self.k_encoder2(k) - features.append(k) - k = self.k_encoder3(k) - features.append(k) - k = self.k_encoder4(k) - features.append(k) - k = self.k_decoder1(k) - k = k + features[2] - k = self.k_decoder2(k) - k = k + features[1] - k = self.k_decoder3(k) - k = k + features[0] - k = self.k_decoder4(k) - - k_1, k_2, k_3, k_4 = k.shape - # calculate query vector - # TODO q=f(q,k) - zeros = ms.ops.Zeros() - x_zeros = zeros((self.max_length, N, E), ms.float32) # (T, N, E) - q = self.pos_encoder(x_zeros) # (T, N, E) - q = q.transpose(1, 0, 2) - q = self.project(q) # (N, T, E) - - # calculate attention - k_attn = k.view(k_1, k_2, -1) - batmatmul = ms.ops.BatchMatMul() - attn_scores = batmatmul(q, k_attn) # (N, T, (H*W)) - attn_scores = attn_scores / (E**0.5) - softmax_attn = nn.Softmax() - attn_scores = softmax_attn(attn_scores) - v = v.transpose(0, 2, 3, 1) - v = v.view(N, -1, E) # (N, (H*W), E) - attn_vecs = batmatmul(attn_scores, v) # (N, T, E) - - return attn_vecs, attn_scores.view(N, -1, H, W) - - -class PositionalEncoding(nn.Cell): - def __init__(self, d_model=512, dropout=0.9, max_len=5000): - super(PositionalEncoding, self).__init__() - self.dropout = nn.Dropout(p=1 - dropout) - pe = np.zeros((max_len, d_model), np.float32) - position = np.arange(0, max_len, dtype=np.float32) - position = np.expand_dims(position, 1) - div = np.arange(0, d_model, 2, dtype=np.float32) - div = div * (-math.log(10000.0) / d_model) - div_term = np.exp(div) - pe[:, 0::2] = np.sin(position * div_term) - pe[:, 1::2] = np.cos(position * div_term) - pe = np.expand_dims(pe, 0) - pe = np.transpose(pe, (1, 0, 2)) - pe = ms.Tensor(pe).astype(dtype=ms.float32) - self.pe = ms.Parameter(pe, name="pe1", requires_grad=False) - - def construct(self, x): - w, _, = x.shape - x = x + self.pe[:w, :] - - return self.dropout(x) - - -# Since Mindspore Transformer does not meet the requirements -# It has been modified - - -class TransformerEncoderLayer(Cell): - - @_LogActionOnce( - logger=logger, - key="TransformerEncoderLayer", - no_warning=_get_parallel_mode() in (ParallelMode.STAND_ALONE,), - ) - @_args_type_validator_check( - batch_size=Validator.check_positive_int, - hidden_size=Validator.check_positive_int, - num_heads=Validator.check_positive_int, - ffn_hidden_size=Validator.check_positive_int, - seq_length=Validator.check_positive_int, - attention_dropout_rate=Validator.check_non_negative_float, - hidden_dropout_rate=Validator.check_non_negative_float, - hidden_act=_valid_type_checks([str], "TransformerEncoderLayer"), - post_layernorm_residual=Validator.check_bool, - layernorm_compute_type=_valid_value_checks( - [mstype.float32, mstype.float16], "TransformerEncoderLayer" - ), - softmax_compute_type=_valid_value_checks( - [mstype.float32, mstype.float16], "TransformerEncoderLayer" - ), - param_init_type=_valid_value_checks( - [mstype.float32, mstype.float16], "TransformerEncoderLayer" - ), - parallel_config=_valid_type_checks( - [OpParallelConfig, MoEParallelConfig], "TransformerEncoderLayer" - ), - use_past=Validator.check_bool, - ) - def __init__( - self, - batch_size, - hidden_size, - ffn_hidden_size, - num_heads, - seq_length, - attention_dropout_rate=0.1, - hidden_dropout_rate=0.1, - post_layernorm_residual=False, - layernorm_compute_type=mstype.float32, - softmax_compute_type=mstype.float32, - param_init_type=mstype.float32, - hidden_act="gelu", - use_past=False, - moe_config=default_moe_config, - parallel_config=default_dpmp_config, - ): - super(TransformerEncoderLayer, self).__init__() - if ( - _get_parallel_mode() in (ParallelMode.AUTO_PARALLEL,) - and _is_sharding_propagation() - ): - _check_config(parallel_config) - if num_heads % parallel_config.model_parallel != 0: - raise ValueError( - "For 'TransformerEncoderLayer', the class variable 'num_heads' must be divisibled by the " - "'parallel_config.model_parallel', but got the num_heads is {} and " - "parallel_config.model_parallel is {}.".format( - num_heads, parallel_config.model_parallel - ) - ) - if hidden_size % parallel_config.model_parallel != 0: - raise ValueError( - "For 'TransformerEncoderLayer', the class variable 'hidden_size' must be divisibled by " - "the 'parallel_config.model_parallel', but got the hidden_size is {} and parallel_config." - " model_parallel is {}.".format( - hidden_size, parallel_config.model_parallel - ) - ) - if ffn_hidden_size % parallel_config.model_parallel != 0: - raise ValueError( - "For 'TransformerEncoderLayer', the class variable 'ffn_hidden_size' must be divisibled " - "by the 'parallel_config.model_parallel', but got the ffn_hidden_size is {} " - "and parallel_config. model_parallel is {}.".format( - ffn_hidden_size, parallel_config.model_parallel - ) - ) - _check_moe_config(moe_config, parallel_config) - self.use_moe = moe_config.expert_num > 1 - self.use_past = use_past - self.seq_length = seq_length - self.hidden_size = hidden_size - self.batch_size = batch_size - self.layernorm1 = _LayerNorm((hidden_size,)).to_float( - layernorm_compute_type - ) - self.layernorm2 = _LayerNorm((hidden_size,)).to_float( - layernorm_compute_type - ) - parallel_config_args = ( - parallel_config.dpmp if self.use_moe else parallel_config - ) - self.attention = MultiHeadAttention( - batch_size=batch_size, - src_seq_length=seq_length, - tgt_seq_length=seq_length, - hidden_size=hidden_size, - num_heads=num_heads, - hidden_dropout_rate=hidden_dropout_rate, - attention_dropout_rate=attention_dropout_rate, - softmax_compute_type=softmax_compute_type, - param_init_type=param_init_type, - use_past=use_past, - parallel_config=parallel_config_args, - ) - # For ABINet, the following paragraph is deleted - # if self.use_moe: - # self.output = MoE(hidden_size=hidden_size, - # dropout_rate=hidden_dropout_rate, - # ffn_hidden_size=ffn_hidden_size, - # param_init_type=param_init_type, - # hidden_act=hidden_act, - # moe_config=moe_config, - # parallel_config=parallel_config) - # else: - # Feed Forward Network, FFN - self.output = FeedForward( - hidden_size=hidden_size, - dropout_rate=hidden_dropout_rate, - ffn_hidden_size=ffn_hidden_size, - param_init_type=param_init_type, - hidden_act=hidden_act, - parallel_config=parallel_config, - ) - self.post_layernorm_residual = post_layernorm_residual - self.add = P.Add().shard( - ((parallel_config.data_parallel, 1), (parallel_config.data_parallel, 1)) - ) - self.add_3d = P.Add().shard( - ( - (parallel_config.data_parallel, 1, 1), - (parallel_config.data_parallel, 1, 1), - ) - ) - self.dtype = mstype.float16 - self.key_past = None - self.value_past = None - - # For ABINet, the following paragraph is deleted - # if self.use_past: - # # operator used for state reuse - # self.reducesum = P.ReduceSum().shard(((1, 1, 1, 1),)) - # self.not_equal = P.NotEqual().shard(((1, 1, 1, 1), ())) - # self.slice = P.StridedSlice().shard(((1, 1, 1, 1),)) - # size_per_head = int(hidden_size / num_heads) - # self.key_shape = (batch_size, num_heads, size_per_head, seq_length) - # self.value_shape = (batch_size, num_heads, seq_length, size_per_head) - # # parameters saving key and value states - # self.key_past = Parameter(Tensor(np.zeros(shape=self.key_shape), self.dtype), name="key_past") - # self.value_past = Parameter(Tensor(np.zeros(shape=self.value_shape), self.dtype), name="value_past") - # self.tile = P.Tile().shard(((1, 1),)) - # self.mul = P.Mul().shard(((1, 1, 1, 1), (1,))) - # self.assign = P.Assign().shard(((1, 1, 1, 1), (1, 1, 1, 1))) - elif _get_parallel_mode() not in (ParallelMode.AUTO_PARALLEL,): - _check_config(parallel_config) - if num_heads % parallel_config.model_parallel != 0: - raise ValueError( - "For 'TransformerEncoderLayer', the class variable 'num_heads' must be divisibled by the " - "'parallel_config.model_parallel', but got the num_heads is {} and " - "parallel_config.model_parallel is {}.".format( - num_heads, parallel_config.model_parallel - ) - ) - if hidden_size % parallel_config.model_parallel != 0: - raise ValueError( - "For 'TransformerEncoderLayer', the class variable 'hidden_size' must be divisibled by " - "the 'parallel_config.model_parallel', but got the hidden_size is {} and parallel_config." - " model_parallel is {}.".format( - hidden_size, parallel_config.model_parallel - ) - ) - if ffn_hidden_size % parallel_config.model_parallel != 0: - raise ValueError( - "For 'TransformerEncoderLayer', the class variable 'ffn_hidden_size' must be divisibled " - "by the 'parallel_config.model_parallel', but got the ffn_hidden_size is {} " - "and parallel_config. model_parallel is {}.".format( - ffn_hidden_size, parallel_config.model_parallel - ) - ) - _check_moe_config(moe_config, parallel_config) - self.use_moe = moe_config.expert_num > 1 - self.use_past = use_past - self.seq_length = seq_length - self.hidden_size = hidden_size - self.batch_size = batch_size - self.layernorm1 = _LayerNorm((hidden_size,)).to_float( - layernorm_compute_type - ) - self.layernorm1.shard(((parallel_config.data_parallel, 1),)) - self.layernorm2 = _LayerNorm((hidden_size,)).to_float( - layernorm_compute_type - ) - self.layernorm2.shard(((parallel_config.data_parallel, 1),)) - parallel_config_args = ( - parallel_config.dpmp if self.use_moe else parallel_config - ) - self.attention = MultiHeadAttention( - batch_size=batch_size, - src_seq_length=seq_length, - tgt_seq_length=seq_length, - hidden_size=hidden_size, - num_heads=num_heads, - hidden_dropout_rate=hidden_dropout_rate, - attention_dropout_rate=attention_dropout_rate, - softmax_compute_type=softmax_compute_type, - param_init_type=param_init_type, - use_past=use_past, - parallel_config=parallel_config_args, - ) - - # For ABINet, the following paragraph is deleted - # if self.use_moe: - # self.output = MoE(hidden_size=hidden_size, - # dropout_rate=hidden_dropout_rate, - # ffn_hidden_size=ffn_hidden_size, - # param_init_type=param_init_type, - # hidden_act=hidden_act, - # moe_config=moe_config, - # parallel_config=parallel_config) - # else: - # Feed Forward Network, FFN - self.output = FeedForward( - hidden_size=hidden_size, - dropout_rate=hidden_dropout_rate, - ffn_hidden_size=ffn_hidden_size, - param_init_type=param_init_type, - hidden_act=hidden_act, - parallel_config=parallel_config, - ) - self.post_layernorm_residual = post_layernorm_residual - self.add = P.Add().shard( - ((parallel_config.data_parallel, 1), (parallel_config.data_parallel, 1)) - ) - self.add_3d = P.Add().shard( - ( - (parallel_config.data_parallel, 1, 1), - (parallel_config.data_parallel, 1, 1), - ) - ) - self.dtype = mstype.float16 - self.key_past = None - self.value_past = None - - # For ABINet, the following paragraph is deleted - # if self.use_past: - # # operator used for state reuse - # self.reducesum = P.ReduceSum().shard(((1, 1, 1, 1),)) - # self.not_equal = P.NotEqual().shard(((1, 1, 1, 1), ())) - # self.slice = P.StridedSlice().shard(((1, 1, 1, 1),)) - # size_per_head = int(hidden_size / num_heads) - # self.key_shape = (batch_size, num_heads, size_per_head, seq_length) - # self.value_shape = (batch_size, num_heads, seq_length, size_per_head) - # # parameters saving key and value states - # self.key_past = Parameter(Tensor(np.zeros(shape=self.key_shape), self.dtype), name="key_past") - # self.value_past = Parameter(Tensor(np.zeros(shape=self.value_shape), self.dtype), name="value_past") - # self.tile = P.Tile().shard(((1, 1),)) - # self.mul = P.Mul().shard(((1, 1, 1, 1), (1,))) - # self.assign = P.Assign().shard(((1, 1, 1, 1), (1, 1, 1, 1))) - else: - raise RuntimeError( - f"The {self.cls_name} only support sharding propagation or " - f"semi-auto parallel mode now." - ) - - def construct(self, x, input_mask, init_reset=True, batch_valid_length=None): - self._check_input(x, input_mask, init_reset, batch_valid_length) - x_shape = F.shape(x) - x = F.reshape(x, (-1, x_shape[-1])) - - # For ABINet, the following paragraph needs to be revised - # input_x = self.layernorm1(x) - input_x = F.cast(x, self.dtype) - - # indicate whether reset saved states - # key_reset = None - # value_reset = None - - # if self.use_past: - # # reset states, init_reset True for reuse and False for reset - # key_reset = self.assign(self.key_past, self.mul(self.key_past, F.cast(init_reset, self.dtype))) - # value_reset = self.assign(self.value_past, self.mul(self.value_past, F.cast(init_reset, self.dtype))) - # # add dependency for desired execution order - # input_x = F.depend(input_x, key_reset) - # input_x = F.depend(input_x, value_reset) - - attention, layer_present = self.attention( - input_x, - input_x, - input_x, - input_mask, - self.key_past, - self.value_past, - batch_valid_length, - ) - # For post-layernorm the inputs for residual path are output of self-attention and output of layernorm - # if self.post_layernorm_residual: - # x = self.add(input_x, attention) - # # For pre-layernorm the inputs for residual path are output of self-attention and input of this layer - # else: - # x = self.add(x, attention) - x = self.add(x, attention) - x = self.layernorm1(x) - - # output_x = self.layernorm2(x) - output_x = F.cast(x, self.dtype) - # if self.use_moe: - # mlp_logit, aux_loss = self.output(output_x) - - mlp_logit = self.output(output_x) - - value_update = None - key_update = None - # if self.use_past: - # # current key and value - # key_present, value_present = layer_present - # # update key and value calculated this step - # key_update = self.assign(self.key_past, key_present) - # value_update = self.assign(self.value_past, value_present) - # # add dependency for desired execution order - # key_update = F.depend(key_update, key_reset) - # value_update = F.depend(value_update, value_reset) - - # add dependency for desired execution order - mlp_logit = F.depend(mlp_logit, value_update) - mlp_logit = F.depend(mlp_logit, key_update) - output = 1.0 # mindspore graph need assignment - # if shape is 3d, we reshape the inputs of the add - if len(x_shape) == 3: - output_x = P.Reshape()(output_x, x_shape) - mlp_logit = P.Reshape()(mlp_logit, x_shape) - x = P.Reshape()(x, x_shape) - - # if self.post_layernorm_residual: - # output = self.add_3d(output_x, mlp_logit) - # else: - # output = self.add_3d(x, mlp_logit) - output = self.add_3d(x, mlp_logit) - output = self.layernorm2(output) - - else: - # if self.post_layernorm_residual: - # output = self.add(output_x, mlp_logit) - # else: - # output = self.add(x, mlp_logit) - output = F.reshape(output, x_shape) - output = self.add_3d(x, mlp_logit) - output = self.layernorm2(output) - # if self.use_moe: - # return output, layer_present, aux_loss - return output - - def _check_input(self, x, input_mask, init_reset, batch_valid_length): - r"""Check inputs""" - if not self.use_past or (self.use_past and self.is_first_iteration): - _check_shape_equal( - F.shape(x), - "x", - self.cls_name, - [ - [self.batch_size, self.seq_length, self.hidden_size], - [self.batch_size * self.seq_length, self.hidden_size], - ], - ) - _check_shape_equal( - F.shape(input_mask), - "input_mask", - self.cls_name, - [self.batch_size, self.seq_length, self.seq_length], - ) - else: - _check_shape_equal( - F.shape(x), "x", self.cls_name, [self.batch_size, 1, self.hidden_size] - ) - _check_shape_equal( - F.shape(input_mask), - "input_mask", - self.cls_name, - [self.batch_size, 1, self.seq_length], - ) - _check_input_dtype( - F.dtype(x), "x", [mstype.float32, mstype.float16], self.cls_name - ) - _check_input_dtype( - F.dtype(input_mask), - "input_mask", - [mstype.float32, mstype.float16], - self.cls_name, - ) - - init_reset_is_tensor = isinstance(init_reset, Tensor) - init_reset_is_default = init_reset is True - batch_valid_length_is_tensor = isinstance(batch_valid_length, Tensor) - batch_is_default = batch_valid_length is None - _check_past_none_input_none( - self.use_past, - "init_reset", - self.cls_name, - True, - init_reset_is_tensor, - init_reset_is_default, - ) - _check_past_none_input_none( - self.use_past, - "batch_valid_length", - self.cls_name, - None, - batch_valid_length_is_tensor, - batch_is_default, - ) - - if self.use_past: - _check_shape_equal(F.shape(init_reset), "init_reset", self.cls_name, [1]) - _check_input_dtype( - F.dtype(init_reset), "init_reset", [mstype.bool_], self.cls_name - ) - _check_shape_equal( - F.shape(batch_valid_length), - "batch_valid_length", - self.cls_name, - [self.batch_size], - ) - _check_input_dtype( - F.dtype(batch_valid_length), - "batch_valid_length", - [mstype.int32], - self.cls_name, - ) - return True - - -class TransformerDecoderLayer(Cell): - - @_LogActionOnce( - logger=logger, - key="TransformerDecoderLayer", - no_warning=_get_parallel_mode() in (ParallelMode.STAND_ALONE,), - ) - @_args_type_validator_check( - batch_size=Validator.check_positive_int, - hidden_size=Validator.check_positive_int, - num_heads=Validator.check_positive_int, - ffn_hidden_size=Validator.check_positive_int, - src_seq_length=Validator.check_positive_int, - tgt_seq_length=Validator.check_positive_int, - attention_dropout_rate=Validator.check_non_negative_float, - hidden_dropout_rate=Validator.check_non_negative_float, - hidden_act=_valid_type_checks([str], "TransformerDecoderLayer"), - post_layernorm_residual=Validator.check_bool, - layernorm_compute_type=_valid_value_checks( - [mstype.float32, mstype.float16], "TransformerDecoderLayer" - ), - softmax_compute_type=_valid_value_checks( - [mstype.float32, mstype.float16], "TransformerDecoderLayer" - ), - param_init_type=_valid_value_checks( - [mstype.float32, mstype.float16], "TransformerDecoderLayer" - ), - parallel_config=_valid_type_checks( - [OpParallelConfig, MoEParallelConfig], "TransformerDecoderLayer" - ), - use_past=Validator.check_bool, - ) - def __init__( - self, - hidden_size, - ffn_hidden_size, - num_heads, - batch_size, - src_seq_length, - tgt_seq_length, - attention_dropout_rate=0.1, - hidden_dropout_rate=0.1, - post_layernorm_residual=False, - use_past=False, - layernorm_compute_type=mstype.float32, - softmax_compute_type=mstype.float32, - param_init_type=mstype.float32, - hidden_act="gelu", - moe_config=default_moe_config, - parallel_config=default_dpmp_config, - ): - super(TransformerDecoderLayer, self).__init__() - if ( - _get_parallel_mode() in (ParallelMode.AUTO_PARALLEL,) - and _is_sharding_propagation() - ): - _check_config(parallel_config) - if num_heads % parallel_config.model_parallel != 0: - raise ValueError( - "For 'TransformerDecoderLayer', the class variable 'num_heads' must be divisibled by " - "'parallel_config.model_parallel', but got the num_heads is {} and " - "parallel_config.model_parallel is {}.".format( - num_heads, parallel_config.model_parallel - ) - ) - if hidden_size % parallel_config.model_parallel != 0: - raise ValueError( - "For 'TransformerDecoderLayer', the class variable 'hidden_size' must be divisibled by " - "'parallel_config.model_parallel', but got the hidden_size is {} and " - "parallel_config.model_parallel is {}.".format( - hidden_size, parallel_config.model_parallel - ) - ) - if ffn_hidden_size % parallel_config.model_parallel != 0: - raise ValueError( - "For 'TransformerDecoderLayer', the class variable 'ffn_hidden_size' must be " - "divisibled by 'parallel_config.model_parallel', but got the ffn_hidden_size is {} " - "and parallel_config.model_parallel is {}.".format( - ffn_hidden_size, parallel_config.model_parallel - ) - ) - _check_moe_config(moe_config, parallel_config) - self.use_moe = moe_config.expert_num > 1 - if use_past: - raise ValueError(f"The {self.cls_name} does not support use_past=True.") - self.batch_size = batch_size - self.use_past = use_past - self.softmax_compute_type = softmax_compute_type - - self.src_seq_length = src_seq_length - self.tgt_seq_length = tgt_seq_length - self.use_past = use_past - self.hidden_size = hidden_size - - # For ABINet, the following paragraph needs to be revised - # self.layernorm1 = _LayerNorm((hidden_size,)).to_float(layernorm_compute_type) - self.layernorm2 = _LayerNorm((hidden_size,)).to_float( - layernorm_compute_type - ) - parallel_config_args = ( - parallel_config.dpmp if self.use_moe else parallel_config - ) - # self.attention = MultiHeadAttention(hidden_size=hidden_size, - # num_heads=num_heads, - # batch_size=batch_size, - # src_seq_length=tgt_seq_length, - # tgt_seq_length=tgt_seq_length, - # hidden_dropout_rate=hidden_dropout_rate, - # attention_dropout_rate=attention_dropout_rate, - # use_past=use_past, - # softmax_compute_type=softmax_compute_type, - # param_init_type=param_init_type, - # parallel_config=parallel_config_args) - - # Cross attention with the output of encoder as memory tensor - self.cross_attention = MultiHeadAttention( - hidden_size=hidden_size, - num_heads=num_heads, - batch_size=batch_size, - src_seq_length=tgt_seq_length, - tgt_seq_length=src_seq_length, - hidden_dropout_rate=hidden_dropout_rate, - attention_dropout_rate=attention_dropout_rate, - softmax_compute_type=softmax_compute_type, - use_past=use_past, - param_init_type=param_init_type, - parallel_config=parallel_config_args, - ) - self.cross_attention_layernorm = _LayerNorm((hidden_size,)).to_float( - layernorm_compute_type - ) - - if self.use_moe: - self.output = MoE( - hidden_size=hidden_size, - dropout_rate=hidden_dropout_rate, - ffn_hidden_size=ffn_hidden_size, - param_init_type=param_init_type, - hidden_act=hidden_act, - moe_config=moe_config, - parallel_config=parallel_config, - ) - else: - # Feed Forward Network, FFN - self.output = FeedForward( - hidden_size=hidden_size, - dropout_rate=hidden_dropout_rate, - ffn_hidden_size=ffn_hidden_size, - hidden_act=hidden_act, - param_init_type=param_init_type, - parallel_config=parallel_config, - ) - self.post_layernorm_residual = post_layernorm_residual - self.add = P.Add() - self.add_3d = P.Add() - self.dtype = mstype.float16 - self.key_past = None - self.value_past = None - if self.use_past: - # operator used for state reuse - self.reducesum = P.ReduceSum().shard(((1, 1, 1, 1),)) - self.not_equal = P.NotEqual().shard(((1, 1, 1, 1), ())) - self.slice = P.StridedSlice().shard(((1, 1, 1, 1),)) - size_per_head = int(hidden_size / num_heads) - self.key_shape = (batch_size, num_heads, size_per_head, tgt_seq_length) - self.value_shape = ( - batch_size, - num_heads, - tgt_seq_length, - size_per_head, - ) - # parameters saving key and value states - self.key_past = Parameter( - Tensor(np.zeros(shape=self.key_shape), self.dtype), name="key_past" - ) - self.value_past = Parameter( - Tensor(np.zeros(shape=self.value_shape), self.dtype), - name="value_past", - ) - self.tile = P.Tile().shard(((1, 1),)) - self.mul = P.Mul().shard(((1, 1, 1, 1), (1,))) - self.assign = P.Assign().shard(((1, 1, 1, 1), (1, 1, 1, 1))) - elif _get_parallel_mode() not in (ParallelMode.AUTO_PARALLEL,): - _check_config(parallel_config) - if num_heads % parallel_config.model_parallel != 0: - raise ValueError( - "For 'TransformerDecoderLayer', the class variable 'num_heads' must be divisibled by " - "'parallel_config.model_parallel', but got the num_heads is {} and " - "parallel_config.model_parallel is {}.".format( - num_heads, parallel_config.model_parallel - ) - ) - if hidden_size % parallel_config.model_parallel != 0: - raise ValueError( - "For 'TransformerDecoderLayer', the class variable 'hidden_size' must be divisibled by " - "'parallel_config.model_parallel', but got the hidden_size is {} and " - "parallel_config.model_parallel is {}.".format( - hidden_size, parallel_config.model_parallel - ) - ) - if ffn_hidden_size % parallel_config.model_parallel != 0: - raise ValueError( - "For 'TransformerDecoderLayer', the class variable 'ffn_hidden_size' must be " - "divisibled by 'parallel_config.model_parallel', but got the ffn_hidden_size is {} " - "and parallel_config.model_parallel is {}.".format( - ffn_hidden_size, parallel_config.model_parallel - ) - ) - _check_moe_config(moe_config, parallel_config) - self.use_moe = moe_config.expert_num > 1 - if use_past: - raise ValueError(f"The {self.cls_name} does not support use_past=True.") - self.batch_size = batch_size - self.use_past = use_past - self.softmax_compute_type = softmax_compute_type - - self.src_seq_length = src_seq_length - self.tgt_seq_length = tgt_seq_length - self.use_past = use_past - self.hidden_size = hidden_size - - # self.layernorm1 = _LayerNorm((hidden_size,)).to_float(layernorm_compute_type) - # self.layernorm1.shard(((parallel_config.data_parallel, 1),)) - self.layernorm2 = _LayerNorm((hidden_size,)).to_float( - layernorm_compute_type - ) - self.layernorm2.shard(((parallel_config.data_parallel, 1),)) - parallel_config_args = ( - parallel_config.dpmp if self.use_moe else parallel_config - ) - # self.attention = MultiHeadAttention(hidden_size=hidden_size, - # num_heads=num_heads, - # batch_size=batch_size, - # src_seq_length=tgt_seq_length, - # tgt_seq_length=tgt_seq_length, - # hidden_dropout_rate=hidden_dropout_rate, - # attention_dropout_rate=attention_dropout_rate, - # use_past=use_past, - # softmax_compute_type=softmax_compute_type, - # param_init_type=param_init_type, - # parallel_config=parallel_config_args) - - # Cross attention with the output of encoder as memory tensor - self.cross_attention = MultiHeadAttention( - hidden_size=hidden_size, - num_heads=num_heads, - batch_size=batch_size, - src_seq_length=tgt_seq_length, - tgt_seq_length=src_seq_length, - hidden_dropout_rate=hidden_dropout_rate, - attention_dropout_rate=attention_dropout_rate, - softmax_compute_type=softmax_compute_type, - use_past=use_past, - param_init_type=param_init_type, - parallel_config=parallel_config_args, - ) - self.cross_attention_layernorm = _LayerNorm((hidden_size,)).to_float( - layernorm_compute_type - ) - self.cross_attention_layernorm.shard(((parallel_config.data_parallel, 1),)) - - if self.use_moe: - self.output = MoE( - hidden_size=hidden_size, - dropout_rate=hidden_dropout_rate, - ffn_hidden_size=ffn_hidden_size, - param_init_type=param_init_type, - hidden_act=hidden_act, - moe_config=moe_config, - parallel_config=parallel_config, - ) - else: - # Feed Forward Network, FFN - self.output = FeedForward( - hidden_size=hidden_size, - dropout_rate=hidden_dropout_rate, - ffn_hidden_size=ffn_hidden_size, - hidden_act=hidden_act, - param_init_type=param_init_type, - parallel_config=parallel_config, - ) - self.post_layernorm_residual = post_layernorm_residual - self.add = P.Add().shard( - ((parallel_config.data_parallel, 1), (parallel_config.data_parallel, 1)) - ) - self.add_3d = P.Add().shard( - ( - (parallel_config.data_parallel, 1, 1), - (parallel_config.data_parallel, 1, 1), - ) - ) - self.dtype = mstype.float16 - self.key_past = None - self.value_past = None - if self.use_past: - # operator used for state reuse - self.reducesum = P.ReduceSum().shard(((1, 1, 1, 1),)) - self.not_equal = P.NotEqual().shard(((1, 1, 1, 1), ())) - self.slice = P.StridedSlice().shard(((1, 1, 1, 1),)) - size_per_head = int(hidden_size / num_heads) - self.key_shape = (batch_size, num_heads, size_per_head, tgt_seq_length) - self.value_shape = ( - batch_size, - num_heads, - tgt_seq_length, - size_per_head, - ) - # parameters saving key and value states - self.key_past = Parameter( - Tensor(np.zeros(shape=self.key_shape), self.dtype), name="key_past" - ) - self.value_past = Parameter( - Tensor(np.zeros(shape=self.value_shape), self.dtype), - name="value_past", - ) - self.tile = P.Tile().shard(((1, 1),)) - self.mul = P.Mul().shard(((1, 1, 1, 1), (1,))) - self.assign = P.Assign().shard(((1, 1, 1, 1), (1, 1, 1, 1))) - else: - raise RuntimeError( - f"The {self.cls_name} only support sharding propagation or " - f"semi-auto parallel mode now." - ) - - def construct( - self, - hidden_stats, - decoder_mask, - encoder_output=None, - memory_mask=None, - init_reset=True, - batch_valid_length=None, - ): - self._check_input( - hidden_stats, - decoder_mask, - encoder_output, - memory_mask, - init_reset, - batch_valid_length, - ) - # the returned shape is [bs, seq_length, embedding_size] or [bs * seq_length, embedding_size] - hidden_shape = F.shape(hidden_stats) - hidden_stats = F.reshape(hidden_stats, (-1, hidden_shape[-1])) - # input_x = self.layernorm1(hidden_stats) - # input_x = F.cast(input_x, self.dtype) - - # # indicate whether reset saved states - # key_reset = None - # value_reset = None - # if self.use_past: - # # reset states, init_reset True for reuse and False for reset - # key_reset = self.assign(self.key_past, self.mul(self.key_past, F.cast(init_reset, self.dtype))) - # value_reset = self.assign(self.value_past, self.mul(self.value_past, F.cast(init_reset, self.dtype))) - # # add dependency for desired execution order - # input_x = F.depend(input_x, key_reset) - # input_x = F.depend(input_x, value_reset) - - # attention, layer_present = self.attention(input_x, input_x, input_x, decoder_mask, self.key_past, - # self.value_past, batch_valid_length) - # # For post-layernorm the inputs for residual path are output of self-attention and output of layernorm - # if self.post_layernorm_residual: - # x = self.add(input_x, attention) - # # For pre-layernorm the inputs for residual path are output of self-attention and input of this layer - # else: - # x = self.add(hidden_stats, attention) - x = hidden_stats # 为了适应pytorch没有调用self.attn - middle_output = None - if encoder_output is not None: - # middle_output = self.cross_attention_layernorm(x) - middle_output = F.cast(x, self.dtype) - encoder_output = F.cast(encoder_output, self.dtype) - cross_attn_output, cross_layer_present = self.cross_attention( - middle_output, - encoder_output, - encoder_output, - memory_mask, - self.key_past, - self.value_past, - batch_valid_length, - ) - # layer_present += cross_layer_present - # if self.post_layernorm_residual: - # x = self.add(middle_output, cross_attn_output) - # else: - # x = self.add(x, cross_attn_output) - x = self.add(x, cross_attn_output) - x = self.cross_attention_layernorm(x) - - # output_x = self.layernorm2(x) - output_x = F.cast(x, self.dtype) - aux_loss = None - if self.use_moe: - mlp_logit, aux_loss = self.output(output_x) - else: - mlp_logit = self.output(output_x) - - value_update = None - key_update = None - # if self.use_past: - # # current key and value - # key_present, value_present = layer_present - # # update key and value calculated this step - # key_update = self.assign(self.key_past, key_present) - # value_update = self.assign(self.value_past, value_present) - # # add dependency for desired execution order - # key_update = F.depend(key_update, key_reset) - # value_update = F.depend(value_update, value_reset) - - # add dependency for desired execution order - mlp_logit = F.depend(mlp_logit, value_update) - mlp_logit = F.depend(mlp_logit, key_update) - - # if shape is 3d, we reshape the inputs of the add - if len(hidden_shape) == 3: - output_x = P.Reshape()(output_x, hidden_shape) - mlp_logit = P.Reshape()(mlp_logit, hidden_shape) - x = P.Reshape()(x, hidden_shape) - - # if self.post_layernorm_residual: - # output = self.add_3d(output_x, mlp_logit) - # else: - # output = self.add_3d(x, mlp_logit) - output = self.add_3d(x, mlp_logit) - output = self.layernorm2(output) - - else: - # if self.post_layernorm_residual: - # output = self.add(output_x, mlp_logit) - # else: - # output = self.add(x, mlp_logit) - # output = F.reshape(output, hidden_shape) - output = self.add(x, mlp_logit) - output = self.layernorm2(output) - - # if self.use_moe: - # return output, layer_present, aux_loss - return output - - def _check_input( - self, - hidden_states, - attention_mask, - encoder_output, - memory_mask, - init_reset, - batch_valid_length, - ): - r"""Check inputs""" - if not self.use_past or (self.use_past and self.is_first_iteration): - _check_shape_equal( - F.shape(hidden_states), - "hidden_states", - self.cls_name, - [ - [self.batch_size, self.tgt_seq_length, self.hidden_size], - [self.batch_size * self.tgt_seq_length, self.hidden_size], - ], - ) - _check_shape_equal( - F.shape(attention_mask), - "attention_mask", - self.cls_name, - [self.batch_size, self.tgt_seq_length, self.tgt_seq_length], - ) - - else: - _check_shape_equal( - F.shape(hidden_states), - "hidden_states", - self.cls_name, - [self.batch_size, 1, self.hidden_size], - ) - _check_shape_equal( - F.shape(attention_mask), - "attention_mask", - self.cls_name, - [self.batch_size, 1, self.tgt_seq_length], - ) - _check_input_dtype( - F.dtype(hidden_states), - "hidden_states", - [mstype.float32, mstype.float16], - self.cls_name, - ) - _check_input_dtype( - F.dtype(attention_mask), - "attention_mask", - [mstype.float32, mstype.float16], - self.cls_name, - ) - if encoder_output is not None: - _check_shape_equal( - F.shape(encoder_output), - "encoder_output", - self.cls_name, - [ - [self.batch_size, self.src_seq_length, self.hidden_size], - [self.batch_size * self.src_seq_length, self.hidden_size], - ], - ) - _check_input_dtype( - F.dtype(encoder_output), - "encoder_output", - [mstype.float32, mstype.float16], - self.cls_name, - ) - if memory_mask is not None: - _check_shape_equal( - F.shape(memory_mask), - "memory_mask", - self.cls_name, - [self.batch_size, self.tgt_seq_length, self.src_seq_length], - ) - _check_input_dtype( - F.dtype(memory_mask), - "memory_mask", - [mstype.float32, mstype.float16], - self.cls_name, - ) - - init_reset_is_tensor = isinstance(init_reset, Tensor) - init_reset_is_default = init_reset is True - batch_valid_length_is_tensor = isinstance(batch_valid_length, Tensor) - batch_is_default = batch_valid_length is None - _check_past_none_input_none( - self.use_past, - "init_reset", - self.cls_name, - True, - init_reset_is_tensor, - init_reset_is_default, - ) - _check_past_none_input_none( - self.use_past, - "batch_valid_length", - self.cls_name, - None, - batch_valid_length_is_tensor, - batch_is_default, - ) - - if self.use_past: - _check_shape_equal(F.shape(init_reset), "init_reset", self.cls_name, [1]) - _check_input_dtype( - F.dtype(init_reset), "init_reset", [mstype.bool_], self.cls_name - ) - _check_shape_equal( - F.shape(batch_valid_length), - "batch_valid_length", - self.cls_name, - [self.batch_size], - ) - _check_input_dtype( - F.dtype(batch_valid_length), - "batch_valid_length", - [mstype.int32], - self.cls_name, - ) - return True - - -class TransformerEncoder(Cell): - - @_LogActionOnce( - logger=logger, - key="TransformerEncoder", - no_warning=_get_parallel_mode() in (ParallelMode.STAND_ALONE,), - ) - @_args_type_validator_check( - batch_size=Validator.check_positive_int, - hidden_size=Validator.check_positive_int, - num_heads=Validator.check_positive_int, - ffn_hidden_size=Validator.check_positive_int, - seq_length=Validator.check_positive_int, - num_layers=Validator.check_positive_int, - offset=Validator.check_non_negative_int, - attention_dropout_rate=Validator.check_non_negative_float, - hidden_dropout_rate=Validator.check_non_negative_float, - hidden_act=_valid_type_checks([str], "TransformerEncoder"), - post_layernorm_residual=Validator.check_bool, - layernorm_compute_type=_valid_value_checks( - [mstype.float32, mstype.float16], "TransformerEncoder" - ), - softmax_compute_type=_valid_value_checks( - [mstype.float32, mstype.float16], "TransformerEncoder" - ), - param_init_type=_valid_value_checks( - [mstype.float32, mstype.float16], "TransformerEncoder" - ), - parallel_config=_valid_type_checks( - [TransformerOpParallelConfig], "TransformerEncoder" - ), - use_past=Validator.check_bool, - ) - def __init__( - self, - batch_size, - num_layers, - hidden_size, - ffn_hidden_size, - seq_length, - num_heads, - attention_dropout_rate=0.1, - hidden_dropout_rate=0.1, - hidden_act="gelu", - post_layernorm_residual=False, - layernorm_compute_type=mstype.float32, - softmax_compute_type=mstype.float32, - param_init_type=mstype.float32, - lambda_func=None, - offset=0, - use_past=False, - moe_config=default_moe_config, - parallel_config=default_transformer_config, - ): - super(TransformerEncoder, self).__init__() - if ( - _get_parallel_mode() in (ParallelMode.AUTO_PARALLEL,) - and _is_sharding_propagation() - ): - _check_config(parallel_config) - _check_moe_config(moe_config, parallel_config) - self.use_moe = moe_config.expert_num > 1 - self.add = P.Add() - self.aux_loss = Tensor(0.0, mstype.float32) - self.num_layers = num_layers - self.blocks = nn.CellList() - parallel_config_args = ( - parallel_config.moe_parallel_config - if self.use_moe - else parallel_config.dp_mp_config - ) - for i in range(num_layers): - block = TransformerEncoderLayer( - hidden_size=hidden_size, - batch_size=batch_size, - ffn_hidden_size=ffn_hidden_size, - seq_length=seq_length, - attention_dropout_rate=attention_dropout_rate, - hidden_dropout_rate=hidden_dropout_rate, - layernorm_compute_type=layernorm_compute_type, - softmax_compute_type=softmax_compute_type, - num_heads=num_heads, - hidden_act=hidden_act, - post_layernorm_residual=post_layernorm_residual, - param_init_type=param_init_type, - use_past=use_past, - moe_config=moe_config, - parallel_config=parallel_config_args, - ) - # If the user doesn't pass the fusion function, use the default one - if not lambda_func: - lambda_func = _get_lambda_func() - - lambda_func( - block, - layer_id=i, - layers=num_layers, - offset=offset, - parallel_config=parallel_config, - ) - self.blocks.append(block) - elif _get_parallel_mode() not in (ParallelMode.AUTO_PARALLEL,): - _check_config(parallel_config) - _check_moe_config(moe_config, parallel_config) - self.use_moe = moe_config.expert_num > 1 - self.add = P.Add().shard(((), ())) - self.aux_loss = Tensor(0.0, mstype.float32) - logger.warning( - "For parallel mode, sharding propagation is recommended, you can use it by setting " - "'set_auto_parallel_context(parallel_mode=ParallelMode.AUTO_PARALLEL, " - 'search_mode="sharding_propagation")\' and ' - "'set_algo_parameters(elementwise_op_strategy_follow=False, fully_use_devices=False)'" - ) - self.num_layers = num_layers - self.blocks = nn.CellList() - parallel_config_args = ( - parallel_config.moe_parallel_config - if self.use_moe - else parallel_config.dp_mp_config - ) - for i in range(num_layers): - block = TransformerEncoderLayer( - hidden_size=hidden_size, - batch_size=batch_size, - ffn_hidden_size=ffn_hidden_size, - seq_length=seq_length, - attention_dropout_rate=attention_dropout_rate, - hidden_dropout_rate=hidden_dropout_rate, - layernorm_compute_type=layernorm_compute_type, - softmax_compute_type=softmax_compute_type, - num_heads=num_heads, - hidden_act=hidden_act, - post_layernorm_residual=post_layernorm_residual, - param_init_type=param_init_type, - use_past=use_past, - moe_config=moe_config, - parallel_config=parallel_config_args, - ) - # If the user doesn't pass the fusion function, use the default one - if not lambda_func: - lambda_func = _get_lambda_func() - - lambda_func( - block, - layer_id=i, - layers=num_layers, - offset=offset, - parallel_config=parallel_config, - ) - self.blocks.append(block) - else: - raise RuntimeError( - f"The {self.cls_name} only support sharding propagation or " - f"semi-auto parallel mode now." - ) - - def construct( - self, hidden_states, attention_mask, init_reset=True, batch_valid_length=None - ): - - # if self.use_moe: - # accum_loss = self.aux_loss - # for i in range(self.num_layers): - # hidden_states, present, aux_loss = self.blocks[i](hidden_states, - # attention_mask, - # init_reset, - # batch_valid_length) - # present_layer = present_layer + (present,) - # accum_loss = self.add(accum_loss, aux_loss) - # return hidden_states, present_layer, accum_loss - - for i in range(self.num_layers): - hidden_states = self.blocks[i]( - hidden_states, attention_mask, init_reset, batch_valid_length - ) - # present_layer = present_layer + (present,) - - return hidden_states - - -class TransformerDecoder(Cell): - - @_LogActionOnce( - logger=logger, - key="TransformerDecoder", - no_warning=_get_parallel_mode() in (ParallelMode.STAND_ALONE,), - ) - @_args_type_validator_check( - batch_size=Validator.check_positive_int, - hidden_size=Validator.check_positive_int, - num_heads=Validator.check_positive_int, - ffn_hidden_size=Validator.check_positive_int, - src_seq_length=Validator.check_positive_int, - num_layers=Validator.check_positive_int, - tgt_seq_length=Validator.check_positive_int, - offset=Validator.check_non_negative_int, - attention_dropout_rate=Validator.check_non_negative_float, - hidden_dropout_rate=Validator.check_non_negative_float, - hidden_act=_valid_type_checks([str], "TransformerDecoder"), - post_layernorm_residual=Validator.check_bool, - layernorm_compute_type=_valid_value_checks( - [mstype.float32, mstype.float16], "TransformerDecoder" - ), - softmax_compute_type=_valid_value_checks( - [mstype.float32, mstype.float16], "TransformerDecoder" - ), - param_init_type=_valid_value_checks( - [mstype.float32, mstype.float16], "TransformerDecoder" - ), - parallel_config=_valid_type_checks( - [TransformerOpParallelConfig], "TransformerDecoder" - ), - use_past=Validator.check_bool, - ) - def __init__( - self, - num_layers, - batch_size, - hidden_size, - ffn_hidden_size, - src_seq_length, - tgt_seq_length, - num_heads, - attention_dropout_rate=0.1, - hidden_dropout_rate=0.1, - post_layernorm_residual=False, - layernorm_compute_type=mstype.float32, - softmax_compute_type=mstype.float32, - param_init_type=mstype.float32, - hidden_act="gelu", - lambda_func=None, - use_past=False, - offset=0, - moe_config=default_moe_config, - parallel_config=default_transformer_config, - ): - super(TransformerDecoder, self).__init__() - if ( - _get_parallel_mode() in (ParallelMode.AUTO_PARALLEL,) - and _is_sharding_propagation() - ): - _check_config(parallel_config) - - self.add = P.Add() - self.aux_loss = Tensor(0.0, mstype.float32) - self.num_layers = num_layers - self.blocks = nn.CellList() - _check_moe_config(moe_config, parallel_config) - self.use_moe = moe_config.expert_num > 1 - parallel_config_args = ( - parallel_config.moe_parallel_config - if self.use_moe - else parallel_config.dp_mp_config - ) - for i in range(num_layers): - block = TransformerDecoderLayer( - hidden_size=hidden_size, - batch_size=batch_size, - ffn_hidden_size=ffn_hidden_size, - src_seq_length=src_seq_length, - tgt_seq_length=tgt_seq_length, - attention_dropout_rate=attention_dropout_rate, - hidden_dropout_rate=hidden_dropout_rate, - num_heads=num_heads, - layernorm_compute_type=layernorm_compute_type, - softmax_compute_type=softmax_compute_type, - hidden_act=hidden_act, - use_past=use_past, - param_init_type=param_init_type, - post_layernorm_residual=post_layernorm_residual, - moe_config=moe_config, - parallel_config=parallel_config_args, - ) - # If the user doesn't pass the fusion function, use the default one - if not lambda_func: - lambda_func = _get_lambda_func() - - lambda_func( - block, - layer_id=i, - layers=num_layers, - offset=offset, - parallel_config=parallel_config, - ) - - self.blocks.append(block) - elif _get_parallel_mode() not in (ParallelMode.AUTO_PARALLEL,): - _check_config(parallel_config) - - self.add = P.Add().shard(((), ())) - self.aux_loss = Tensor(0.0, mstype.float32) - logger.warning( - "For parallel mode, sharding propagation is recommended, you can use it by setting " - "'set_auto_parallel_context(parallel_mode=ParallelMode.AUTO_PARALLEL, " - 'search_mode="sharding_propagation")\' and ' - "'set_algo_parameters(elementwise_op_strategy_follow=False, fully_use_devices=False)'" - ) - self.num_layers = num_layers - self.blocks = nn.CellList() - _check_moe_config(moe_config, parallel_config) - self.use_moe = moe_config.expert_num > 1 - parallel_config_args = ( - parallel_config.moe_parallel_config - if self.use_moe - else parallel_config.dp_mp_config - ) - for i in range(num_layers): - block = TransformerDecoderLayer( - hidden_size=hidden_size, - batch_size=batch_size, - ffn_hidden_size=ffn_hidden_size, - src_seq_length=src_seq_length, - tgt_seq_length=tgt_seq_length, - attention_dropout_rate=attention_dropout_rate, - hidden_dropout_rate=hidden_dropout_rate, - num_heads=num_heads, - layernorm_compute_type=layernorm_compute_type, - softmax_compute_type=softmax_compute_type, - hidden_act=hidden_act, - use_past=use_past, - param_init_type=param_init_type, - post_layernorm_residual=post_layernorm_residual, - moe_config=moe_config, - parallel_config=parallel_config_args, - ) - # If the user doesn't pass the fusion function, use the default one - if not lambda_func: - lambda_func = _get_lambda_func() - - lambda_func( - block, - layer_id=i, - layers=num_layers, - offset=offset, - parallel_config=parallel_config, - ) - - self.blocks.append(block) - else: - raise RuntimeError( - f"The {self.cls_name} only support sharding propagation or " - f"semi-auto parallel mode now." - ) - - def construct( - self, - hidden_states, - attention_mask, - encoder_output=None, - memory_mask=None, - init_reset=True, - batch_valid_length=None, - ): - - # if self.use_moe: - # accum_loss = self.aux_loss - # for i in range(self.num_layers): - # hidden_states, present, aux_loss = self.blocks[i](hidden_states, - # attention_mask, - # encoder_output, - # memory_mask, - # init_reset, - # batch_valid_length) - # present_layer = present_layer + (present,) - # accum_loss = self.add(accum_loss, aux_loss) - # return hidden_states, present_layer, accum_loss - - # Loop through each self-attention layer - for i in range(self.num_layers): - hidden_states = self.blocks[i]( - hidden_states, - attention_mask, - encoder_output, - memory_mask, - init_reset, - batch_valid_length, - ) - # present_layer = present_layer + (present,) - - return hidden_states diff --git a/mindocr/postprocess/builder.py b/mindocr/postprocess/builder.py index e0f80da60..63a942a07 100644 --- a/mindocr/postprocess/builder.py +++ b/mindocr/postprocess/builder.py @@ -1,4 +1,5 @@ -from . import ( # rec_abinet_postprocess, +# flake8: noqa: F405 +from . import ( cls_postprocess, det_db_postprocess, det_east_postprocess, @@ -6,7 +7,6 @@ kie_re_postprocess, kie_ser_postprocess, layout_postprocess, - rec_abinet_postprocess, rec_postprocess, table_postprocess, ) @@ -17,7 +17,6 @@ from .kie_re_postprocess import VQAReTokenLayoutLMPostProcess from .kie_ser_postprocess import VQASerTokenLayoutLMPostProcess from .layout_postprocess import * -from .rec_abinet_postprocess import * from .rec_postprocess import * from .table_postprocess import * @@ -34,7 +33,6 @@ "VisionLANPostProcess": VisionLANPostProcess, "SARLabelDecode": SARLabelDecode, "ClsPostprocess": ClsPostprocess, - "ABINetLabelDecode": ABINetLabelDecode, "VQASerTokenLayoutLMPostProcess": VQASerTokenLayoutLMPostProcess, "VQAReTokenLayoutLMPostProcess": VQAReTokenLayoutLMPostProcess, "YOLOv8Postprocess": YOLOv8Postprocess, diff --git a/mindocr/postprocess/rec_abinet_postprocess.py b/mindocr/postprocess/rec_abinet_postprocess.py deleted file mode 100644 index 3ed1db42d..000000000 --- a/mindocr/postprocess/rec_abinet_postprocess.py +++ /dev/null @@ -1,75 +0,0 @@ -""" -""" -from typing import Union - -import numpy as np - -import mindspore as ms -from mindspore import Tensor - -from ..models.utils.abinet_layers import CharsetMapper - -__all__ = ["ABINetLabelDecode"] - - -class ABINetLabelDecode(object): - def __init__( - self, - lower=False, - ): - self.space_idx = None - self.lower = lower - self.charset = CharsetMapper(max_length=26) - - def decode(self, logit): - """Greed decode""" - # TODO: test running time and decode on GPU - ms_softmax = ms.ops.Softmax(axis=2) - out = ms_softmax(logit) - pt_text, pt_scores, pt_lengths = [], [], [] - for o in out: - text = self.charset.get_text(o.argmax(axis=1), padding=False, trim=False) - text = text.split(self.charset.null_char)[0] # end at end-token - pt_text.append(text) - pt_scores.append(o.max(axis=1)[0]) - pt_lengths.append(min(len(text) + 1, 26)) # one for end-token - ms_stack = ms.ops.Stack() - pt_scores = ms_stack(pt_scores) - pt_lengths = ms.Tensor(pt_lengths, dtype=ms.int64) - - return pt_text, pt_scores, pt_lengths - - def __call__(self, preds: Union[Tensor, np.ndarray], labels=None, **kwargs): - """ - Args: - preds (Union[Tensor, np.ndarray]): network prediction, class probabilities in shape [BS, W, num_classes], - where W is the sequence length. - labels: optional - Return: - texts (List[Tuple]): list of string - - """ - logits = preds - pt_text, pt_scores, pt_lengths_ = self.decode(logits) - pt_text = [self.charset.trim(t) for t in pt_text] - return {"texts": pt_text} - - def _get_output(self, last_output): - if isinstance(last_output, (tuple, list)): - for res in last_output: - for i in range(len(res)): - if len(res) == 3: - if res[i]["name"] == "alignment": - output = res[i] - else: - output = last_output - return output - - def _update_output(self, last_output, items): - if isinstance(last_output, (tuple, list)): - res = last_output - if res[3] == "alignment": - res.update(items) - else: - last_output.update(items) - return last_output diff --git a/tests/ut/test_models.py b/tests/ut/test_models.py index 88a2fe576..367523229 100644 --- a/tests/ut/test_models.py +++ b/tests/ut/test_models.py @@ -22,7 +22,6 @@ "configs/rec/svtr/svtr_tiny.yaml", "configs/rec/visionlan/visionlan_resnet45_LF_1.yaml", "configs/cls/mobilenetv3/cls_mv3.yaml", - "configs/rec/abinet/abinet_resnet45_en.yaml", ] print("All config yamls: ", all_yamls) diff --git a/tools/data_for_export_convert.py b/tools/data_for_export_convert.py index 8685ece1e..5937e6368 100644 --- a/tools/data_for_export_convert.py +++ b/tools/data_for_export_convert.py @@ -1,10 +1,4 @@ data_converte_static_model_from_download_mindir = { - "abinet": { - "mindir_url": "None", - "mindir_name": "None", - "data_shape": "x:[96,3,32,128]", - "infer_shape_list": ["96,3,32,128"], - }, "master_resnet31": { "mindir_url": "https://download-mindspore.osinfra.cn/toolkits/mindocr/master/" + "master_resnet31_ascend-e7bfbc97-b724ed55.mindir", @@ -156,11 +150,6 @@ data_converte_static_model_from_exported_mindir = { - "abinet": { - "data_shape": "args0:[96,3,32,128]", - "mindir_name": "abinet.mindir", - "infer_shape_list": ["96,3,32,128"], - }, "master_resnet31": {"data_shape": "args0:[1,3,48,160]", "mindir_name": "master_resnet31.mindir"}, "cls_mobilenet_v3_small_100_model": { "data_shape": "args0:[1,3,48,192]", @@ -271,11 +260,6 @@ data_converte_dynamic_model_from_exported_mindir = { - "abinet": { - "data_shape": "args0:[96,3,32,-1]", - "mindir_name": "abinet.mindir", - "infer_shape_list": ["96,3,32,128", "96,3,32,144"], - }, "master_resnet31": { "data_shape": "args0:[-1,3,-1,-1]", "mindir_name": "master_resnet31.mindir", @@ -390,7 +374,6 @@ data_export_static_model = { - "abinet": {"model_name": "abinet", "data_shape_h_w": [32, 128]}, "master_resnet31": {"model_name": "master_resnet31", "data_shape_h_w": [32, 100]}, "cls_mobilenet_v3_small_100_model": {"model_name": "cls_mobilenet_v3_small_100_model", "data_shape_h_w": [48, 192]}, "crnn_resnet34": {"model_name": "crnn_resnet34", "data_shape_h_w": [32, 100]}, @@ -417,7 +400,6 @@ data_export_dynamic_model = { - "abinet": {"model_name": "abinet", "model_type": "rec"}, "master_resnet31": {"model_name": "master_resnet31", "model_type": "rec"}, "cls_mobilenet_v3_small_100_model": {"model_name": "cls_mobilenet_v3_small_100_model", "model_type": "cls"}, "crnn_resnet34": {"model_name": "crnn_resnet34", "model_type": "rec"}, diff --git a/tools/export.py b/tools/export.py index b8160bc3a..74f49f7b8 100644 --- a/tools/export.py +++ b/tools/export.py @@ -68,21 +68,6 @@ def common_exporter(save_dir, name, net, data_shape, is_dynamic_shape, model_typ ) -def abinet_exporter(save_dir, name, net, data_shape, is_dynamic_shape): - if is_dynamic_shape: - x = ms.Tensor(shape=[96, 3, 32, None], dtype=ms.float32) - else: - h, w = data_shape - bs, c = 96, 3 - x = ms.Tensor(np.ones([bs, c, h, w]), dtype=ms.float32) - output_path = os.path.join(save_dir, name) + ".mindir" - ms.export(net, x, file_name=output_path, file_format="MINDIR") - logger.info( - f"=> Finish exporting mindir file of {name} to {os.path.realpath(output_path)}." - f"The data shape (N, C, H, W) is {x.shape}." - ) - - def robustscanner_resnet31_exporter(save_dir, name, net, data_shape, is_dynamic_shape): if is_dynamic_shape: x0 = ms.Tensor(shape=[1, 3, 48, None], dtype=ms.float32) @@ -147,10 +132,6 @@ def export(model_name_or_config, data_shape, local_ckpt_path, save_dir, is_dynam net.set_train(False) - if name == "abinet": - abinet_exporter(save_dir, name, net, data_shape, is_dynamic_shape) - return - if name == "robustscanner_resnet31": robustscanner_resnet31_exporter(save_dir, name, net, data_shape, is_dynamic_shape) return