diff --git a/PaddlePaddle/Classification/RN50v1.5/Dockerfile b/PaddlePaddle/Classification/RN50v1.5/Dockerfile index b926e4dd1..932dca3c6 100644 --- a/PaddlePaddle/Classification/RN50v1.5/Dockerfile +++ b/PaddlePaddle/Classification/RN50v1.5/Dockerfile @@ -1,4 +1,4 @@ -ARG FROM_IMAGE_NAME=nvcr.io/nvidia/paddlepaddle:23.09-py3 +ARG FROM_IMAGE_NAME=nvcr.io/nvidia/paddlepaddle:23.12-py3 FROM ${FROM_IMAGE_NAME} ADD requirements.txt /workspace/ diff --git a/PaddlePaddle/Classification/RN50v1.5/README.md b/PaddlePaddle/Classification/RN50v1.5/README.md index 1d856bacf..1e981db73 100644 --- a/PaddlePaddle/Classification/RN50v1.5/README.md +++ b/PaddlePaddle/Classification/RN50v1.5/README.md @@ -17,6 +17,8 @@ achieve state-of-the-art accuracy. The content of this repository is tested and * [Enabling TF32](#enabling-tf32) * [Automatic SParsity](#automatic-sparsity) * [Enable Automatic SParsity](#enable-automatic-sparsity) + * [Quantization aware training](#quantization-aware-training) + * [Enable quantization aware training](#enable-quantization-aware-training) * [Setup](#setup) * [Requirements](#requirements) * [Quick Start Guide](#quick-start-guide) @@ -26,6 +28,7 @@ achieve state-of-the-art accuracy. The content of this repository is tested and * [Dataset guidelines](#dataset-guidelines) * [Training process](#training-process) * [Automatic SParsity training process](#automatic-sparsity-training-process) + * [Quantization aware training process](#quantization-aware-training-process) * [Inference process](#inference-process) * [Performance](#performance) * [Benchmarking](#benchmarking) @@ -128,6 +131,7 @@ This model supports the following features: |[DALI](https://docs.nvidia.com/deeplearning/sdk/dali-release-notes/index.html) | Yes | |[Paddle AMP](https://www.paddlepaddle.org.cn/documentation/docs/en/guides/performance_improving/amp_en.html) | Yes | |[Paddle ASP](https://www.paddlepaddle.org.cn/documentation/docs/en/api/paddle/static/sparsity/decorate_en.html) | Yes | +|[PaddleSlim QAT](https://paddleslim.readthedocs.io/en/latest/quick_start/quant_aware_tutorial_en.html) | Yes | |[Paddle-TRT](https://github.com/PaddlePaddle/Paddle-Inference-Demo/blob/master/docs/optimize/paddle_trt_en.rst) | Yes | #### Features @@ -139,7 +143,9 @@ with the DALI library. For more information about DALI, refer to the [DALI produ - Paddle ASP is a PaddlePaddle built-in module that provides functions to enable automatic sparsity workflow with only a few code line insertions. The full APIs can be found in [Paddle.static.sparsity](https://www.paddlepaddle.org.cn/documentation/docs/en/api/paddle/static/sparsity/calculate_density_en.html). Paddle ASP support, currently, static graph mode only (Dynamic graph support is under development). Refer to the [Enable Automatic SParsity](#enable-automatic-sparsity) section for more details. -- Paddle-TRT is a PaddlePaddle inference integration with [TensorRT](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html). It selects subgraph to be accelerated by TensorRT, while leaving the rest of the operations to be executed natively by PaddlePaddle. Refer to the [Inference with TensorRT](#inference-with-tensorrt) section for more details. +- PaddleSlim is a set of tools based on PaddlePaddle for model acceleration, quantization, pruning, and knowledge distillation. For model quantization, PaddleSlim offers simple and user-friendly APIs for quantization aware training. The full APIs can be found in [Quantization aware training](https://paddleslim.readthedocs.io/en/latest/api_en/index_en.html). PaddleSlim currently supports updating gradients and scales simultaneously during quantization aware training (Training with fixed scales is still under development). Refer to the [Enable quantization aware training](#enable-quantization-aware-training) section for more details. + +- Paddle-TRT is a PaddlePaddle inference integration with [TensorRT](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html). It selects subgraphs to be accelerated by TensorRT, while leaving the rest of the operations to be executed natively by PaddlePaddle. Refer to the [Inference with TensorRT](#inference-with-tensorrt) section for more details. ### DALI @@ -147,7 +153,7 @@ We use [NVIDIA DALI](https://github.com/NVIDIA/DALI), which speeds up data loading when the CPU becomes a bottleneck. DALI can use CPU or GPU and outperforms the PaddlePaddle native data loader. -For data loader, we only support DALI as data loader for now. +For data loaders, we only support DALI as data loader for now. ### Mixed precision training @@ -225,6 +231,30 @@ Moreover, ASP is also compatible with mixed precision training. Note that currently ASP only supports static graphs (Dynamic graph support is under development). +### Quantization Aware Training +Quantization aware training (QAT) is a technique to train models with the awareness of the quantization process. Quantization refers to reducing the precision of numerical values in a model, typically from floating-point to lower-bit fixed-point representations. In QAT, during the training process, the model is trained to accommodate the effects of quantization, enabling it to maintain performance even when deployed with reduced precision. +Through PaddleSlim QAT, we can quantize models by the following steps: +- quantize and dequantize the weights and inputs before feeding them into weighted-layers (ex. Convolution and Fullyconnected) +- record the scale of each tensor for use in low precision inference + +For more information, refer to +- [INTEGER QUANTIZATION FOR DEEP LEARNING INFERENCE: PRINCIPLES AND EMPIRICAL EVALUATION](https://arxiv.org/pdf/2004.09602.pdf) + +#### Enable Quantization Aware Training +PaddlePaddle integrates some QAT modules from PaddleSlim, a toolkit for deep learning model compression, to enable QAT training. +The APIs can quantize a train program and also convert it into an INT8 inference model. + +```python +quant_program = quanter.quant_aware(program) +... +quant_infer_program = quanter.convert(quant_program) +``` + +The detailed information on QAT API can be found in [quantization_aware_tutorial](https://paddleslim.readthedocs.io/en/latest/quick_start/quant_aware_tutorial_en.html). + +Moreover, QAT is also compatible with mixed precision training. + + ## Setup The following section lists the requirements you need to meet to start training the ResNet50 model. @@ -233,7 +263,7 @@ The following section lists the requirements you need to meet to start training This repository contains a Dockerfile that extends the CUDA NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components: * [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker) -* [PaddlePaddle 23.09-py3 NGC container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/paddlepaddle) or newer +* [PaddlePaddle 23.12-py3 NGC container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/paddlepaddle) or newer * Supported GPUs: * [NVIDIA Ampere architecture](https://www.nvidia.com/en-us/data-center/nvidia-ampere-gpu-architecture/) @@ -295,7 +325,7 @@ nvidia-docker run --rm -it -v :/imagenet --ipc=host --e FLAGS_ ### 5. Start training To run training for a standard configuration (DGXA100, AMP/TF32), -use one of scripts in `scripts/training` to launch training. (Please ensure ImageNet is mounted in the `/imagenet` directory.) +use one of the scripts in `scripts/training` to launch training. (Please ensure ImageNet is mounted in the `/imagenet` directory.) Example: ```bash @@ -390,7 +420,8 @@ python -m paddle.distributed.launch --gpus=0,1,2,3,4,5,6,7 train.py ### Command-line options: To find the full list of available options and their descriptions, use the `-h` or `--help` command-line option, for example: -`python [train.py|export_model.py|inference.py] -h` + +`python train.py -h` ```bash PaddlePaddle RN50v1.5 training script @@ -398,9 +429,11 @@ PaddlePaddle RN50v1.5 training script optional arguments: -h, --help show this help message and exit -Global: - --output-dir OUTPUT_DIR - A path to store trained models. (default: ./output/) +General: + --checkpoint-dir CHECKPOINT_DIR + A path to store trained models. (default: ./checkpoint) + --inference-dir INFERENCE_DIR + A path to store inference model once the training is finished. (default: ./inference/) --run-scope {train_eval,train_only,eval_only} Running scope. It should be one of {train_eval, train_only, eval_only}. (default: train_eval) --epochs EPOCHS The number of epochs for training. (default: 90) @@ -410,11 +443,9 @@ Global: The iteration interval to test trained models on a given validation dataset. Ignored when --run-scope is train_only. (default: 1) --print-interval PRINT_INTERVAL - The iteration interval to show training/evaluation message. (default: 10) + The iteration interval to show a training/evaluation message. (default: 10) --report-file REPORT_FILE - A file in which to store JSON experiment report. (default: ./report.json) - --data-layout {NCHW,NHWC} - Data format. It should be one of {NCHW, NHWC}. (default: NCHW) + A file in which to store JSON experiment reports. (default: ./train.json) --benchmark To enable benchmark mode. (default: False) --benchmark-steps BENCHMARK_STEPS Steps for benchmark run, only be applied when --benchmark is set. (default: 100) @@ -431,7 +462,7 @@ Global: --last-epoch-of-checkpoint LAST_EPOCH_OF_CHECKPOINT The epoch id of the checkpoint given by --from-checkpoint. It should be None, auto or integer >= 0. If it is set as None, then training will start from 0-th epoch. If it is set as auto, then it will search largest integer- - convertable folder --from-checkpoint, which contains required checkpoint. Default is None. (default: None) + convertible folder --from-checkpoint, which contains the required checkpoint. Default is None. (default: None) --show-config SHOW_CONFIG To show arguments. (default: True) --enable-cpu-affinity ENABLE_CPU_AFFINITY @@ -448,7 +479,7 @@ Dataset: --dali-random-seed DALI_RANDOM_SEED The random seed for DALI data loader. (default: 42) --dali-num-threads DALI_NUM_THREADS - The number of threads applied to DALI data loader. (default: 4) + The number of threads applied to the DALI data loader. (default: 4) --dali-output-fp16 Output FP16 data from DALI data loader. (default: False) Data Augmentation: @@ -472,6 +503,8 @@ Model: The model architecture name. It should be one of {ResNet50}. (default: ResNet50) --num-of-class NUM_OF_CLASS The number classes of images. (default: 1000) + --data-layout {NCHW,NHWC} + Data format. It should be one of {NCHW, NHWC}. (default: NCHW) --bn-weight-decay Apply weight decay to BatchNorm shift and scale. (default: False) Training: @@ -479,16 +512,16 @@ Training: The ratio of label smoothing. (default: 0.1) --optimizer OPTIMIZER The name of optimizer. It should be one of {Momentum}. (default: Momentum) - --momentum MOMENTUM The momentum value of optimizer. (default: 0.875) + --momentum MOMENTUM The momentum value of an optimizer. (default: 0.875) --weight-decay WEIGHT_DECAY The coefficient of weight decay. (default: 3.0517578125e-05) --lr-scheduler LR_SCHEDULER - The name of learning rate scheduler. It should be one of {Cosine}. (default: Cosine) + The name of the learning rate scheduler. It should be one of {Cosine}. (default: Cosine) --lr LR The initial learning rate. (default: 0.256) --warmup-epochs WARMUP_EPOCHS The number of epochs for learning rate warmup. (default: 5) --warmup-start-lr WARMUP_START_LR - The initial learning rate for warmup. (default: 0.0) + The initial learning rate for warm up. (default: 0.0) Advanced Training: --amp Enable automatic mixed precision training (AMP). (default: False) @@ -503,33 +536,44 @@ Advanced Training: --mask-algo {mask_1d,mask_2d_greedy,mask_2d_best} The algorithm to generate sparse masks. It should be one of {mask_1d, mask_2d_greedy, mask_2d_best}. This only be applied when --asp and --prune-model is set. (default: mask_1d) + --qat Enable quantization aware training (QAT). (default: False) +``` +`python inference.py -h` +```sh Paddle-TRT: --device DEVICE_ID The GPU device id for Paddle-TRT inference. (default: 0) - --trt-inference-dir TRT_INFERENCE_DIR - A path to store/load inference models. export_model.py would export models to this folder, then inference.py - would load from here. (default: ./inference) - --trt-precision {FP32,FP16,INT8} + --inference-dir INFERENCE_DIR + A path to load inference models. (default: ./inference) + --batch-size BATCH_SIZE + The batch size for Paddle-TRT. (default: 256) + --image-shape IMAGE_SHAPE + The image shape. Its shape should be [channel, height, width]. (default: [4, 224, 224]) + --data-layout {NCHW,NHWC} + Data format. It should be one of {NCHW, NHWC}. (default: NCHW) + --precision {FP32,FP16,INT8} The precision of TensorRT. It should be one of {FP32, FP16, INT8}. (default: FP32) - --trt-workspace-size TRT_WORKSPACE_SIZE + --workspace-size WORKSPACE_SIZE The memory workspace of TensorRT in MB. (default: 1073741824) - --trt-min-subgraph-size TRT_MIN_SUBGRAPH_SIZE + --min-subgraph-size MIN_SUBGRAPH_SIZE The minimal subgraph size to enable PaddleTRT. (default: 3) - --trt-use-static TRT_USE_STATIC + --use-static USE_STATIC Fix TensorRT engine at first running. (default: False) - --trt-use-calib-mode TRT_USE_CALIB_MODE + --use-calib-mode USE_CALIB_MODE Use the PTQ calibration of PaddleTRT int8. (default: False) - --trt-export-log-path TRT_EXPORT_LOG_PATH - A file in which to store JSON model exporting report. (default: ./export.json) - --trt-log-path TRT_LOG_PATH - A file in which to store JSON inference report. (default: ./inference.json) - --trt-use-synthetic TRT_USE_SYNTHAT + --report-file REPORT_FILE + A file in which to store JSON experiment report. (default: ./inference.json) + --use-synthetic USE_SYNTHAT Apply synthetic data for benchmark. (default: False) + --benchmark-steps BENCHMARK_STEPS + Steps for benchmark run, only be applied when --benchmark is set. (default: 100) + --benchmark-warmup-steps BENCHMARK_WARMUP_STEPS + Warmup steps for benchmark run, only be applied when --benchmark is set. (default: 100) + --show-config SHOW_CONFIG + To show arguments. (default: True) ``` -Noted that arguments in Paddle-TRT are only available to `export_model.py` or `inference.py`. - ### Dataset guidelines To use your own dataset, divide it in directories as in the following scheme: @@ -540,15 +584,17 @@ To use your own dataset, divide it in directories as in the following scheme: If the number of classes in your dataset is not 1000, you need to specify it to `--num-of-class`. ### Training process -The model will be stored in the directory specified with `--output-dir` and `--model-arch-name`, including three files: +The checkpoint will be stored in the directory specified with `--checkpoint-dir` and `--model-arch-name`, including three files: - `.pdparams`: The parameters contain all the trainable tensors and will save to a file with the suffix “.pdparams”. -- `.pdopts`: The optimizer information contains all the Tensors used by the optimizer. For Adam optimizer, it contains beta1, beta2, momentum, and so on. All the information will be saved to a file with suffix “.pdopt”. (If the optimizer has no Tensor need to save (like SGD), the file will not be generated). +- `.pdopts`: The optimizer information contains all the Tensors used by the optimizer. For Adam optimizer, it contains beta1, beta2, momentum, and so on. All the information will be saved to a file with the suffix “.pdopt”. (If the optimizer has no Tensor need to save (like SGD), the file will not be generated). - `.pdmodel`: The network description is the description of the program. It’s only used for deployment. The description will save to a file with the suffix “.pdmodel”. -The prefix of model files is specified by `--model-prefix`, which default value is `resnet_50_paddle`. Model of each epoch would be stored in directory `./output/ResNet50/epoch_id/` with three files by default, including `resnet_50_paddle.pdparams`, `resnet_50_paddle.pdopts`, `resnet_50_paddle.pdmodel`. Note that `epoch_id` is 0-based, which means `epoch_id` is from 0 to 89 for a total of 90 epochs. For example, the model of the 89th epoch would be stored in `./output/ResNet50/89/resnet_50_paddle` +The prefix of model files is specified by `--model-prefix`, whose default value is `resnet_50_paddle`. Model of each epoch would be stored in directory `./checkpoint/ResNet50/epoch_id/` with three files by default, including `resnet_50_paddle.pdparams`, `resnet_50_paddle.pdopts`, `resnet_50_paddle.pdmodel`. Note that `epoch_id` is 0-based, which means `epoch_id` is from 0 to 89 for a total of 90 epochs. For example, the model of the 89th epoch would be stored in `./output/ResNet50/89/resnet_50_paddle` + +When the training phase is done, the inference model will be stored in the directory specified with `--inference-dir` and `--model-arch-name`, and it includes `.pdmodel` and `.pdparams` two files. Assume you want to train the ResNet50 for 90 epochs, but the training process aborts during the 50th epoch due to infrastructure faults. To resume training from the checkpoint, specify `--from-checkpoint` and `--last-epoch-of-checkpoint` with following these steps: -- Set `./output/ResNet50/49` to `--from-checkpoint`. +- Set `./checkpoint/ResNet50/49` to `--from-checkpoint`. - Set `--last-epoch-of-checkpoint` to `49`. Then rerun the training to resume training from the 50th epoch to the 89th epoch. @@ -562,11 +608,11 @@ python -m paddle.distributed.launch --gpus=0,1,2,3,4,5,6,7 train.py \ --use-dynamic-loss-scaling \ --data-layout NHWC \ --model-prefix resnet_50_paddle \ - --from-checkpoint ./output/ResNet50/49 \ + --from-checkpoint ./checkpoint/ResNet50/49 \ --last-epoch-of-checkpoint 49 ``` -We also provide automatic searching for the checkpoint from last epoch. You can enable this by set `--last-epoch-of-checkpoint` as `auto`. Noted that if enable automatic searching, `--from-checkpoint` should be a folder contains chekcpoint files or `/`. In previous example, it should be `./output/ResNet50`. +We also provide automatic searching for the checkpoint from last epoch. You can enable this by setting `--last-epoch-of-checkpoint` as `auto`. Note that if you enable automatic searching, `--from-checkpoint` should be a folder containing checkpoint files or `/`. In previous example, it should be `./checkpoint/ResNet50`. Example: ```bash @@ -578,11 +624,11 @@ python -m paddle.distributed.launch --gpus=0,1,2,3,4,5,6,7 train.py \ --use-dynamic-loss-scaling \ --data-layout NHWC \ --model-prefix resnet_50_paddle \ - --from-checkpoint ./output/ResNet50 \ + --from-checkpoint ./checkpoint/ResNet50 \ --last-epoch-of-checkpoint auto ``` -To start training from pretrained weights, set `--from-pretrained-params` to `./output/ResNet50//<--model-prefix>`. +To start training from pretrained weights, set `--from-pretrained-params` to `./checkpoint/ResNet50//<--model-prefix>`. Example: ```bash @@ -594,7 +640,7 @@ python -m paddle.distributed.launch --gpus=0,1,2,3,4,5,6,7 train.py \ --use-dynamic-loss-scaling \ --data-layout NHWC \ --model-prefix resnet_50_paddle \ - --from-pretrained-params ./output/ResNet50/ + --from-pretrained-params ./checkpoint/ResNet50/ ``` Make sure: @@ -606,7 +652,7 @@ The difference between those two is that `--from-pretrained-params` contain only `--from-checkpoint` is suitable for dividing the training into parts, for example, in order to divide the training job into shorter stages, or restart training after infrastructure faults. -`--from-pretrained-params` can be used as a base for finetuning the model to a different dataset or as a backbone to detection models. +`--from-pretrained-params` can be used as a base for fine tuning the model to a different dataset or as a backbone to detection models. Metrics gathered through both training and evaluation: - `[train|val].loss` - loss @@ -622,24 +668,24 @@ Metrics gathered through both training and evaluation: ### Automatic SParsity training process: -To enable automatic sparsity training workflow, turn on `--amp` and `--prune-mode` when training launches. Refer to [Command-line options](#command-line-options) +To enable automatic sparsity training workflow, turn on `--asp` and `--prune-mode` when training launches. Refer to [Command-line options](#command-line-options) Note that automatic sparsity (ASP) requires a pretrained model to initialize parameters. You can apply `scripts/training/train_resnet50_AMP_ASP_90E_DGXA100.sh` we provided to launch ASP + AMP training. ```bash -# Default path to pretrained parameters is ./output/ResNet50/89/resnet_50_paddle +# Default path to pretrained parameters is ./checkpoint/ResNet50/89/resnet_50_paddle bash scripts/training/train_resnet50_AMP_ASP_90E_DGXA100.sh ``` Or following steps below to manually launch ASP + AMP training. -First, set `--from-pretrained-params` to a pretrained model file. For example, if you have trained the ResNet50 for 90 epochs following [Training process](#training-process), the final pretrained weights would be stored in `./output/ResNet50/89/resnet_50_paddle.pdparams` by default, and set `--from-pretrained-params` to `./output/ResNet50/89`. +First, set `--from-pretrained-params` to a pretrained model file. For example, if you have trained the ResNet50 for 90 epochs following [Training process](#training-process), the final pretrained weights would be stored in `./checkpoint/ResNet50/89/resnet_50_paddle.pdparams` by default, and set `--from-pretrained-params` to `./checkpoint/ResNet50/89`. Then run following command to run AMP + ASP: ```bash python -m paddle.distributed.launch --gpus=0,1,2,3,4,5,6,7 train.py \ - --from-pretrained-params ./output/ResNet50/89 \ + --from-pretrained-params ./checkpoint/ResNet50/89 \ --model-prefix resnet_50_paddle \ --epochs 90 \ --amp \ @@ -651,14 +697,43 @@ python -m paddle.distributed.launch --gpus=0,1,2,3,4,5,6,7 train.py \ --mask-algo mask_1d ``` +## Quantization Aware Training Process +Quantization aware training requires a fine-tuned model. Quantize / dequantize OPs will be inserted into the model and then a smaller number of epochs of training will be taken to update the parameters in the model. + +To enable quantization aware training workflow, turn on `--qat` when training launches. Refer to [Command-line options](#command-line-options). + +You can apply the script `scripts/training/train_resnet50_AMP_QAT_10E_DGXA100.sh` we provided to launch AMP + QAT training. +```bash +# Default path to pretrained parameters is ./output/ResNet50/89/resnet_50_paddle +bash scripts/training/train_resnet50_AMP_QAT_10E_DGXA100.sh +``` + +Or following steps below to manually launch AMP + QAT training. + +First, set `--from-pretrained-params` to a pretrained model file. For example, if you have trained the ResNet50 for 90 epochs following [Training process](#training-process), the final pretrained weights would be stored in `./output/ResNet50/89/resnet_50_paddle.pdparams` by default, and set `--from-pretrained-params` to `./output/ResNet50/89`. + +Then run following command to run AMP + QAT: +```bash +python -m paddle.distributed.launch --gpus=0,1,2,3,4,5,6,7 train.py \ + --from-pretrained-params ./output/ResNet50/89 \ + --model-prefix resnet_50_paddle \ + --epochs 10 \ + --amp \ + --scale-loss 128.0 \ + --use-dynamic-loss-scaling \ + --data-layout NHWC \ + --qat +``` + + ### Inference process #### Inference on your own datasets. To run inference on a single example with pretrained parameters, 1. Set `--from-pretrained-params` to your pretrained parameters. -2. Set `--image-root` to the root folder of your own dataset. - - Note that validation dataset should be in `image-root/val`. +2. Set `--image-root` to the root folder of your own dataset. + - Note that the validation dataset should be in `image-root/val`. 3. Set `--run-scope` to `eval_only`. ```bash # For single GPU evaluation @@ -675,7 +750,7 @@ python -m paddle.distributed.launch --gpus=0,1,2,3,4,5,6,7 train.py \ ``` #### Inference with TensorRT -For inference with TensorRT, we provide two scopes to do benchmark with or without data preprocessing. +For inference with TensorRT, we provide two scopes to benchmark with or without data preprocessing. The default scripts in `scripts/inference` use synthetic input to run inference without data preprocessing. @@ -688,12 +763,12 @@ Or you could manually run `export_model.py` and `inference.py` with specific arg Note that arguments passed to `export_model.py` and `inference.py` should be the same with arguments used in training. -To run inference with data preprocessing, set the option `--trt-use-synthetic` to false and `--image-root` to the path of your own dataset. For example, +To run inference with data preprocessing, set the option `--use-synthetic` to false and `--image-root` to the path of your own dataset. For example, ```bash -python inference.py --trt-inference-dir \ +python inference.py --inference-dir \ --image-root \ - --trt-use-synthetic False + --use-synthetic False ``` ## Performance @@ -761,28 +836,28 @@ python -m paddle.distributed.launch --gpus=0,1,2,3,4,5,6,7 train.py \ ##### Benchmark with TensorRT -To benchmark the inference performance with TensorRT on a specific batch size, run inference.py with `--trt-use-synthetic True`. The benchmark uses synthetic input without data preprocessing. +To benchmark the inference performance with TensorRT on a specific batch size, run inference.py with `--use-synthetic True`. The benchmark uses synthetic input without data preprocessing. * FP32 / TF32 ```bash python inference.py \ - --trt-inference-dir \ - --trt-precision FP32 \ + --inference-dir \ + --precision FP32 \ --batch-size \ --benchmark-steps 1024 \ --benchmark-warmup-steps 16 \ - --trt-use-synthetic True + --use-synthetic True ``` * FP16 ```bash python inference.py \ - --trt-inference-dir \ - --trt-precision FP16 \ + --inference-dir \ + --precision FP16 \ --batch-size --benchmark-steps 1024 \ --benchmark-warmup-steps 16 \ - --trt-use-synthetic True + --use-synthetic True ``` Note that arguments passed to `inference.py` should be the same with arguments used in training. @@ -806,7 +881,7 @@ To achieve these same results, follow the steps in the [Quick Start Guide](#quic ##### Example plots -The following images show the 90 epochs configuration on a DGX-A100. +The following images show the 90 epoch configuration on a DGX-A100. ![ValidationLoss](./img/loss.png) ![ValidationTop1](./img/top1.png) @@ -838,7 +913,7 @@ To achieve these same results, follow the steps in the [Quick Start Guide](#quic | 8 | 20267 img/s | 20144 img/s | 0.6% | -Note that the `train.py` would enable CPU affinity binding to GPUs by default, that is designed and guaranteed being optimal for NVIDIA DGX-series. You could disable binding via launch `train.py` with `--enable-cpu-affinity false`. +Note that the `train.py` would enable CPU affinity binding to GPUs by default, that is designed and guaranteed to be optimal for NVIDIA DGX-series. You could disable binding via launch `train.py` with `--enable-cpu-affinity false`. ### Inference performance results @@ -885,30 +960,43 @@ Note that the benchmark does not include data preprocessing. Refer to [Benchmark |**Batch Size**|**Avg throughput**|**Avg latency**|**90% Latency**|**95% Latency**|**99% Latency**| |--------------|------------------|---------------|---------------|---------------|---------------| -| 1 | 915.48 img/s | 1.09 ms | 1.09 ms | 1.18 ms | 1.19 ms | -| 2 | 1662.70 img/s | 1.20 ms | 1.21 ms | 1.29 ms | 1.30 ms | -| 4 | 2856.25 img/s | 1.40 ms | 1.40 ms | 1.49 ms | 1.55 ms | -| 8 | 3988.80 img/s | 2.01 ms | 2.01 ms | 2.10 ms | 2.18 ms | -| 16 | 5409.55 img/s | 2.96 ms | 2.96 ms | 3.05 ms | 3.07 ms | -| 32 | 6406.13 img/s | 4.99 ms | 5.00 ms | 5.08 ms | 5.12 ms | -| 64 | 7169.75 img/s | 8.93 ms | 8.94 ms | 9.01 ms | 9.04 ms | -| 128 | 7616.79 img/s | 16.80 ms | 16.89 ms | 16.90 ms | 16.99 ms | -| 256 | 7843.26 img/s | 32.64 ms | 32.85 ms | 32.88 ms | 32.93 ms | +| 1 | 969.11 img/s | 1.03 ms | 1.03 ms | 1.13 ms | 1.14 ms | +| 2 | 1775.33 img/s | 1.13 ms | 1.13 ms | 1.22 ms | 1.23 ms | +| 4 | 3088.02 img/s | 1.29 ms | 1.30 ms | 1.39 ms | 1.40 ms | +| 8 | 4552.29 img/s | 1.76 ms | 1.76 ms | 1.85 ms | 1.87 ms | +| 16 | 6059.48 img/s | 2.64 ms | 2.64 ms | 2.73 ms | 2.75 ms | +| 32 | 7264.92 img/s | 4.40 ms | 4.41 ms | 4.49 ms | 4.52 ms | +| 64 | 8022.82 img/s | 7.98 ms | 8.03 ms | 8.05 ms | 8.11 ms | +| 128 | 8436.27 img/s | 15.17 ms | 15.20 ms | 15.27 ms | 15.30 ms | +| 256 | 8623.08 img/s | 29.69 ms | 29.82 ms | 29.86 ms | 29.97 ms | **FP16 Inference Latency** |**Batch Size**|**Avg throughput**|**Avg latency**|**90% Latency**|**95% Latency**|**99% Latency**| |--------------|------------------|---------------|---------------|---------------|---------------| -| 1 | 1265.67 img/s | 0.79 ms | 0.79 ms | 0.88 ms | 0.89 ms | -| 2 | 2339.59 img/s | 0.85 ms | 0.86 ms | 0.94 ms | 0.96 ms | -| 4 | 4271.30 img/s | 0.94 ms | 0.94 ms | 1.03 ms | 1.04 ms | -| 8 | 7053.76 img/s | 1.13 ms | 1.14 ms | 1.22 ms | 1.31 ms | -| 16 | 10225.85 img/s | 1.56 ms | 1.57 ms | 1.65 ms | 1.67 ms | -| 32 | 12802.53 img/s | 2.50 ms | 2.50 ms | 2.59 ms | 2.61 ms | -| 64 | 14723.56 img/s | 4.35 ms | 4.35 ms | 4.43 ms | 4.45 ms | -| 128 | 16157.12 img/s | 7.92 ms | 7.96 ms | 8.00 ms | 8.06 ms | -| 256 | 17054.80 img/s | 15.01 ms | 15.06 ms | 15.07 ms | 15.16 ms | +| 1 | 1306.28 img/s | 0.76 ms | 0.77 ms | 0.86 ms | 0.87 ms | +| 2 | 2453.18 img/s | 0.81 ms | 0.82 ms | 0.91 ms | 0.92 ms | +| 4 | 4295.75 img/s | 0.93 ms | 0.95 ms | 1.03 ms | 1.04 ms | +| 8 | 7036.09 img/s | 1.14 ms | 1.15 ms | 1.23 ms | 1.25 ms | +| 16 | 10376.70 img/s | 1.54 ms | 1.56 ms | 1.64 ms | 1.66 ms | +| 32 | 13078.23 img/s | 2.45 ms | 2.45 ms | 2.54 ms | 2.56 ms | +| 64 | 14992.88 img/s | 4.27 ms | 4.27 ms | 4.36 ms | 4.38 ms | +| 128 | 16386.96 img/s | 7.81 ms | 7.83 ms | 7.89 ms | 7.93 ms | +| 256 | 17363.79 img/s | 14.74 ms | 14.80 ms | 14.82 ms | 14.90 ms | + +**INT8 Inference Latency** +|**Batch Size**|**Avg throughput**|**Avg latency**|**90% Latency**|**95% Latency**|**99% Latency**| +|--------------|------------------|---------------|---------------|---------------|---------------| +| 1 | 1430.17 img/s | 0.70 ms | 0.70 ms | 0.79 ms | 0.80 ms | +| 2 | 2683.75 img/s | 0.74 ms | 0.75 ms | 0.84 ms | 0.85 ms | +| 4 | 4792.51 img/s | 0.83 ms | 0.84 ms | 0.93 ms | 0.94 ms | +| 8 | 8366.92 img/s | 0.96 ms | 0.96 ms | 1.05 ms | 1.06 ms | +| 16 | 13083.56 img/s | 1.22 ms | 1.22 ms | 1.32 ms | 1.33 ms | +| 32 | 18171.90 img/s | 1.76 ms | 1.76 ms | 1.86 ms | 1.87 ms | +| 64 | 22578.08 img/s | 2.83 ms | 2.84 ms | 2.93 ms | 2.95 ms | +| 128 | 25730.51 img/s | 4.97 ms | 4.98 ms | 5.07 ms | 5.08 ms | +| 256 | 27935.10 img/s | 9.16 ms | 9.26 ms | 9.30 ms | 9.34 ms | #### Paddle-TRT performance: NVIDIA A30 (1x A30 24GB) Our results for Paddle-TRT were obtained by running the `inference.py` script on NVIDIA A30 with (1x A30 24G) GPU. @@ -919,30 +1007,43 @@ Note that the benchmark does not include data preprocessing. Refer to [Benchmark |**Batch Size**|**Avg throughput**|**Avg latency**|**90% Latency**|**95% Latency**|**99% Latency**| |--------------|------------------|---------------|---------------|---------------|---------------| -| 1 | 781.87 img/s | 1.28 ms | 1.29 ms | 1.38 ms | 1.45 ms | -| 2 | 1290.14 img/s | 1.55 ms | 1.55 ms | 1.65 ms | 1.67 ms | -| 4 | 1876.48 img/s | 2.13 ms | 2.13 ms | 2.23 ms | 2.25 ms | -| 8 | 2451.23 img/s | 3.26 ms | 3.27 ms | 3.37 ms | 3.42 ms | -| 16 | 2974.77 img/s | 5.38 ms | 5.42 ms | 5.47 ms | 5.53 ms | -| 32 | 3359.63 img/s | 9.52 ms | 9.62 ms | 9.66 ms | 9.72 ms | -| 64 | 3585.82 img/s | 17.85 ms | 18.03 ms | 18.09 ms | 18.20 ms | -| 128 | 3718.44 img/s | 34.42 ms | 34.71 ms | 34.75 ms | 34.91 ms | -| 256 | 3806.11 img/s | 67.26 ms | 67.61 ms | 67.71 ms | 67.86 ms | +| 1 | 860.08 img/s | 1.16 ms | 1.16 ms | 1.27 ms | 1.29 ms | +| 2 | 1422.02 img/s | 1.40 ms | 1.41 ms | 1.52 ms | 1.53 ms | +| 4 | 2058.41 img/s | 1.94 ms | 1.94 ms | 2.06 ms | 2.10 ms | +| 8 | 2748.94 img/s | 2.91 ms | 2.93 ms | 3.03 ms | 3.22 ms | +| 16 | 3329.39 img/s | 4.80 ms | 4.90 ms | 4.93 ms | 5.09 ms | +| 32 | 3729.45 img/s | 8.58 ms | 8.68 ms | 8.74 ms | 8.84 ms | +| 64 | 3946.74 img/s | 16.21 ms | 16.34 ms | 16.41 ms | 16.51 ms | +| 128 | 4116.98 img/s | 31.09 ms | 31.26 ms | 31.38 ms | 31.43 ms | +| 256 | 4227.52 img/s | 60.55 ms | 60.93 ms | 61.01 ms | 61.25 ms | **FP16 Inference Latency** |**Batch Size**|**Avg throughput**|**Avg latency**|**90% Latency**|**95% Latency**|**99% Latency**| |--------------|------------------|---------------|---------------|---------------|---------------| -| 1 | 1133.80 img/s | 0.88 ms | 0.89 ms | 0.98 ms | 0.99 ms | -| 2 | 2068.18 img/s | 0.97 ms | 0.97 ms | 1.06 ms | 1.08 ms | -| 4 | 3181.06 img/s | 1.26 ms | 1.27 ms | 1.35 ms | 1.38 ms | -| 8 | 5078.30 img/s | 1.57 ms | 1.58 ms | 1.68 ms | 1.74 ms | -| 16 | 6240.02 img/s | 2.56 ms | 2.58 ms | 2.67 ms | 2.86 ms | -| 32 | 7000.86 img/s | 4.57 ms | 4.66 ms | 4.69 ms | 4.76 ms | -| 64 | 7523.45 img/s | 8.51 ms | 8.62 ms | 8.73 ms | 8.86 ms | -| 128 | 7914.47 img/s | 16.17 ms | 16.31 ms | 16.34 ms | 16.46 ms | -| 256 | 8225.56 img/s | 31.12 ms | 31.29 ms | 31.38 ms | 31.50 ms | +| 1 | 1195.76 img/s | 0.83 ms | 0.84 ms | 0.95 ms | 0.96 ms | +| 2 | 2121.44 img/s | 0.94 ms | 0.95 ms | 1.05 ms | 1.10 ms | +| 4 | 3498.59 img/s | 1.14 ms | 1.14 ms | 1.26 ms | 1.30 ms | +| 8 | 5139.91 img/s | 1.55 ms | 1.56 ms | 1.67 ms | 1.72 ms | +| 16 | 6322.78 img/s | 2.53 ms | 2.54 ms | 2.64 ms | 2.83 ms | +| 32 | 7093.70 img/s | 4.51 ms | 4.61 ms | 4.64 ms | 4.70 ms | +| 64 | 7682.36 img/s | 8.33 ms | 8.44 ms | 8.48 ms | 8.58 ms | +| 128 | 8072.73 img/s | 15.85 ms | 15.98 ms | 16.04 ms | 16.14 ms | +| 256 | 8393.37 img/s | 30.50 ms | 30.67 ms | 30.70 ms | 30.84 ms | + +**INT8 Inference Latency** +|**Batch Size**|**Avg throughput**|**Avg latency**|**90% Latency**|**95% Latency**|**99% Latency**| +|--------------|------------------|---------------|---------------|---------------|---------------| +| 1 | 1346.83 img/s | 0.74 ms | 0.74 ms | 0.85 ms | 0.87 ms | +| 2 | 2415.06 img/s | 0.83 ms | 0.83 ms | 0.94 ms | 0.99 ms | +| 4 | 4152.29 img/s | 0.96 ms | 0.97 ms | 1.07 ms | 1.11 ms | +| 8 | 6684.53 img/s | 1.20 ms | 1.20 ms | 1.31 ms | 1.37 ms | +| 16 | 9336.11 img/s | 1.71 ms | 1.72 ms | 1.82 ms | 1.89 ms | +| 32 | 11544.88 img/s | 2.77 ms | 2.77 ms | 2.88 ms | 3.09 ms | +| 64 | 12954.16 img/s | 4.94 ms | 5.04 ms | 5.08 ms | 5.23 ms | +| 128 | 13914.60 img/s | 9.20 ms | 9.27 ms | 9.34 ms | 9.45 ms | +| 256 | 14443.15 img/s | 17.72 ms | 17.87 ms | 17.92 ms | 18.00 ms | #### Paddle-TRT performance: NVIDIA A10 (1x A10 24GB) Our results for Paddle-TRT were obtained by running the `inference.py` script on NVIDIA A10 with (1x A10 24G) GPU. @@ -953,29 +1054,43 @@ Note that the benchmark does not include data preprocessing. Refer to [Benchmark |**Batch Size**|**Avg throughput**|**Avg latency**|**90% Latency**|**95% Latency**|**99% Latency**| |--------------|------------------|---------------|---------------|---------------|---------------| -| 1 | 563.63 img/s | 1.77 ms | 1.79 ms | 1.87 ms | 1.89 ms | -| 2 | 777.13 img/s | 2.57 ms | 2.63 ms | 2.68 ms | 2.89 ms | -| 4 | 1171.93 img/s | 3.41 ms | 3.43 ms | 3.51 ms | 3.55 ms | -| 8 | 1627.81 img/s | 4.91 ms | 4.97 ms | 5.02 ms | 5.09 ms | -| 16 | 1986.40 img/s | 8.05 ms | 8.11 ms | 8.19 ms | 8.37 ms | -| 32 | 2246.04 img/s | 14.25 ms | 14.33 ms | 14.40 ms | 14.57 ms | -| 64 | 2398.07 img/s | 26.69 ms | 26.87 ms | 26.91 ms | 27.06 ms | -| 128 | 2489.96 img/s | 51.41 ms | 51.74 ms | 51.80 ms | 51.94 ms | -| 256 | 2523.22 img/s | 101.46 ms | 102.13 ms | 102.35 ms | 102.77 ms | +| 1 | 601.39 img/s | 1.66 ms | 1.66 ms | 1.82 ms | 1.85 ms | +| 2 | 962.31 img/s | 2.08 ms | 2.13 ms | 2.23 ms | 2.38 ms | +| 4 | 1338.26 img/s | 2.99 ms | 3.04 ms | 3.14 ms | 3.32 ms | +| 8 | 1650.56 img/s | 4.85 ms | 4.93 ms | 5.01 ms | 5.14 ms | +| 16 | 2116.53 img/s | 7.56 ms | 7.64 ms | 7.71 ms | 7.84 ms | +| 32 | 2316.43 img/s | 13.81 ms | 14.00 ms | 14.07 ms | 14.26 ms | +| 64 | 2477.26 img/s | 25.83 ms | 26.05 ms | 26.15 ms | 26.35 ms | +| 128 | 2528.92 img/s | 50.61 ms | 51.24 ms | 51.37 ms | 51.72 ms | +| 256 | 2576.08 img/s | 99.37 ms | 100.45 ms | 100.66 ms | 101.05 ms | **FP16 Inference Latency** |**Batch Size**|**Avg throughput**|**Avg latency**|**90% Latency**|**95% Latency**|**99% Latency**| |--------------|------------------|---------------|---------------|---------------|---------------| -| 1 | 1296.81 img/s | 0.77 ms | 0.77 ms | 0.87 ms | 0.88 ms | -| 2 | 2224.06 img/s | 0.90 ms | 0.90 ms | 1.00 ms | 1.01 ms | -| 4 | 2845.61 img/s | 1.41 ms | 1.43 ms | 1.51 ms | 1.53 ms | -| 8 | 3793.35 img/s | 2.11 ms | 2.19 ms | 2.22 ms | 2.30 ms | -| 16 | 4315.53 img/s | 3.71 ms | 3.80 ms | 3.86 ms | 3.98 ms | -| 32 | 4815.26 img/s | 6.64 ms | 6.74 ms | 6.79 ms | 7.15 ms | -| 64 | 5103.27 img/s | 12.54 ms | 12.66 ms | 12.70 ms | 13.01 ms | -| 128 | 5393.20 img/s | 23.73 ms | 23.98 ms | 24.05 ms | 24.20 ms | -| 256 | 5505.24 img/s | 46.50 ms | 46.82 ms | 46.92 ms | 47.17 ms | +| 1 | 1109.59 img/s | 0.90 ms | 0.90 ms | 1.06 ms | 1.08 ms | +| 2 | 1901.53 img/s | 1.05 ms | 1.05 ms | 1.22 ms | 1.23 ms | +| 4 | 2733.20 img/s | 1.46 ms | 1.48 ms | 1.62 ms | 1.65 ms | +| 8 | 3494.23 img/s | 2.29 ms | 2.32 ms | 2.44 ms | 2.48 ms | +| 16 | 4113.53 img/s | 3.89 ms | 3.99 ms | 4.10 ms | 4.17 ms | +| 32 | 4714.63 img/s | 6.79 ms | 6.98 ms | 7.14 ms | 7.30 ms | +| 64 | 5054.70 img/s | 12.66 ms | 12.78 ms | 12.83 ms | 13.08 ms | +| 128 | 5261.98 img/s | 24.32 ms | 24.58 ms | 24.71 ms | 24.96 ms | +| 256 | 5397.53 img/s | 47.43 ms | 47.83 ms | 47.95 ms | 48.17 ms | + +**INT8 Inference Latency** + +|**Batch Size**|**Avg throughput**|**Avg latency**|**90% Latency**|**95% Latency**|**99% Latency**| +|--------------|------------------|---------------|---------------|---------------|---------------| +| 1 | 1285.15 img/s | 0.78 ms | 0.78 ms | 0.93 ms | 0.95 ms | +| 2 | 2293.43 img/s | 0.87 ms | 0.88 ms | 1.03 ms | 1.05 ms | +| 4 | 3508.39 img/s | 1.14 ms | 1.15 ms | 1.29 ms | 1.32 ms | +| 8 | 5907.02 img/s | 1.35 ms | 1.36 ms | 1.51 ms | 1.60 ms | +| 16 | 7416.99 img/s | 2.16 ms | 2.19 ms | 2.31 ms | 2.36 ms | +| 32 | 8337.02 img/s | 3.84 ms | 3.91 ms | 4.01 ms | 4.14 ms | +| 64 | 9039.71 img/s | 7.08 ms | 7.24 ms | 7.40 ms | 7.66 ms | +| 128 | 9387.23 img/s | 13.63 ms | 13.84 ms | 13.92 ms | 14.11 ms | +| 256 | 9598.97 img/s | 26.67 ms | 27.12 ms | 27.24 ms | 27.48 ms | ## Release notes @@ -995,6 +1110,11 @@ Note that the benchmark does not include data preprocessing. Refer to [Benchmark * Updated README * A100 convergence benchmark +3. December 2023 + * Add quantization aware training + * Add INT8 inference for Paddle-TRT + * Simplify the inference process + ### Known issues * Allreduce issues to top1 and top5 accuracy in evaluation. Workaround: use `build_strategy.fix_op_run_order = True` for eval program. (refer to [Paddle-issue-39567](https://github.com/PaddlePaddle/Paddle/issues/39567) for details) diff --git a/PaddlePaddle/Classification/RN50v1.5/export_model.py b/PaddlePaddle/Classification/RN50v1.5/export_model.py deleted file mode 100644 index dac24d3e8..000000000 --- a/PaddlePaddle/Classification/RN50v1.5/export_model.py +++ /dev/null @@ -1,75 +0,0 @@ -# Copyright (c) 2022 NVIDIA Corporation. All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import os -import logging -import paddle -import program -from dali import build_dataloader -from utils.mode import Mode -from utils.save_load import init_ckpt -from utils.logger import setup_dllogger -from utils.config import parse_args, print_args - - -def main(args): - ''' - Export saved model params to paddle inference model - ''' - setup_dllogger(args.trt_export_log_path) - if args.show_config: - print_args(args) - - eval_dataloader = build_dataloader(args, Mode.EVAL) - - startup_prog = paddle.static.Program() - eval_prog = paddle.static.Program() - - eval_fetchs, _, eval_feeds, _ = program.build( - args, - eval_prog, - startup_prog, - step_each_epoch=len(eval_dataloader), - is_train=False) - eval_prog = eval_prog.clone(for_test=True) - - device = paddle.set_device('gpu') - exe = paddle.static.Executor(device) - exe.run(startup_prog) - - path_to_ckpt = args.from_checkpoint - - if path_to_ckpt is None: - logging.warning( - 'The --from-checkpoint is not set, model weights will not be initialize.' - ) - else: - init_ckpt(path_to_ckpt, eval_prog, exe) - logging.info('Checkpoint path is %s', path_to_ckpt) - - save_inference_dir = args.trt_inference_dir - paddle.static.save_inference_model( - path_prefix=os.path.join(save_inference_dir, args.model_arch_name), - feed_vars=[eval_feeds['data']], - fetch_vars=[eval_fetchs['label'][0]], - executor=exe, - program=eval_prog) - - logging.info('Successully export inference model to %s', - save_inference_dir) - - -if __name__ == '__main__': - paddle.enable_static() - main(parse_args(including_trt=True)) diff --git a/PaddlePaddle/Classification/RN50v1.5/inference.py b/PaddlePaddle/Classification/RN50v1.5/inference.py index 2396865e7..bad6ccac9 100644 --- a/PaddlePaddle/Classification/RN50v1.5/inference.py +++ b/PaddlePaddle/Classification/RN50v1.5/inference.py @@ -29,7 +29,7 @@ def init_predictor(args): - infer_dir = args.trt_inference_dir + infer_dir = args.inference_dir assert os.path.isdir( infer_dir), f'inference_dir = "{infer_dir}" is not a directory' pdiparams_path = glob.glob(os.path.join(infer_dir, '*.pdiparams')) @@ -41,7 +41,7 @@ def init_predictor(args): predictor_config = Config(pdmodel_path[0], pdiparams_path[0]) predictor_config.enable_memory_optim() predictor_config.enable_use_gpu(0, args.device) - precision = args.trt_precision + precision = args.precision max_batch_size = args.batch_size assert precision in ['FP32', 'FP16', 'INT8'], \ 'precision should be FP32/FP16/INT8' @@ -54,12 +54,17 @@ def init_predictor(args): else: raise NotImplementedError predictor_config.enable_tensorrt_engine( - workspace_size=args.trt_workspace_size, + workspace_size=args.workspace_size, max_batch_size=max_batch_size, - min_subgraph_size=args.trt_min_subgraph_size, + min_subgraph_size=args.min_subgraph_size, precision_mode=precision_mode, - use_static=args.trt_use_static, - use_calib_mode=args.trt_use_calib_mode) + use_static=args.use_static, + use_calib_mode=args.use_calib_mode) + predictor_config.set_trt_dynamic_shape_info( + {"data": (1,) + tuple(args.image_shape)}, + {"data": (args.batch_size,) + tuple(args.image_shape)}, + {"data": (args.batch_size,) + tuple(args.image_shape)}, + ) predictor = create_predictor(predictor_config) return predictor @@ -140,7 +145,7 @@ def benchmark_dataset(args): quantile = np.quantile(latency, [0.9, 0.95, 0.99]) statistics = { - 'precision': args.trt_precision, + 'precision': args.precision, 'batch_size': batch_size, 'throughput': total_images / (end - start), 'accuracy': correct_predict / total_images, @@ -189,7 +194,7 @@ def benchmark_synthetic(args): quantile = np.quantile(latency, [0.9, 0.95, 0.99]) statistics = { - 'precision': args.trt_precision, + 'precision': args.precision, 'batch_size': batch_size, 'throughput': args.benchmark_steps * batch_size / (end - start), 'eval_latency_avg': np.mean(latency), @@ -200,11 +205,11 @@ def benchmark_synthetic(args): return statistics def main(args): - setup_dllogger(args.trt_log_path) + setup_dllogger(args.report_file) if args.show_config: print_args(args) - if args.trt_use_synthetic: + if args.use_synthetic: statistics = benchmark_synthetic(args) else: statistics = benchmark_dataset(args) @@ -213,4 +218,4 @@ def main(args): if __name__ == '__main__': - main(parse_args(including_trt=True)) + main(parse_args(script='inference')) diff --git a/PaddlePaddle/Classification/RN50v1.5/program.py b/PaddlePaddle/Classification/RN50v1.5/program.py index 12de72282..ec16c727d 100644 --- a/PaddlePaddle/Classification/RN50v1.5/program.py +++ b/PaddlePaddle/Classification/RN50v1.5/program.py @@ -188,6 +188,7 @@ def dist_optimizer(args, optimizer): } dist_strategy.asp = args.asp + dist_strategy.qat = args.qat optimizer = fleet.distributed_optimizer(optimizer, strategy=dist_strategy) diff --git a/PaddlePaddle/Classification/RN50v1.5/scripts/inference/infer_resnet50_AMP.sh b/PaddlePaddle/Classification/RN50v1.5/scripts/inference/infer_resnet50_AMP.sh index 611520c93..7dd68dc40 100644 --- a/PaddlePaddle/Classification/RN50v1.5/scripts/inference/infer_resnet50_AMP.sh +++ b/PaddlePaddle/Classification/RN50v1.5/scripts/inference/infer_resnet50_AMP.sh @@ -14,9 +14,9 @@ python inference.py \ --data-layout NHWC \ - --trt-inference-dir ./inference_amp \ - --trt-precision FP16 \ + --inference-dir ./inference_amp \ + --precision FP16 \ --batch-size 256 \ --benchmark-steps 1024 \ --benchmark-warmup-steps 16 \ - --trt-use-synthetic True + --use-synthetic True diff --git a/PaddlePaddle/Classification/RN50v1.5/scripts/inference/export_resnet50_TF32.sh b/PaddlePaddle/Classification/RN50v1.5/scripts/inference/infer_resnet50_QAT.sh similarity index 71% rename from PaddlePaddle/Classification/RN50v1.5/scripts/inference/export_resnet50_TF32.sh rename to PaddlePaddle/Classification/RN50v1.5/scripts/inference/infer_resnet50_QAT.sh index 107b1f4f9..bb2858eb7 100644 --- a/PaddlePaddle/Classification/RN50v1.5/scripts/inference/export_resnet50_TF32.sh +++ b/PaddlePaddle/Classification/RN50v1.5/scripts/inference/infer_resnet50_QAT.sh @@ -12,10 +12,11 @@ # See the License for the specific language governing permissions and # limitations under the License. -CKPT=${1:-"./output/ResNet50/89"} -MODEL_PREFIX=${2:-"resnet_50_paddle"} - -python -m paddle.distributed.launch --gpus=0 export_model.py \ - --trt-inference-dir ./inference_tf32 \ - --from-checkpoint $CKPT \ - --model-prefix ${MODEL_PREFIX} +python inference.py \ + --data-layout NHWC \ + --inference-dir ./inference_qat \ + --precision INT8 \ + --batch-size 256 \ + --benchmark-steps 1024 \ + --benchmark-warmup-steps 16 \ + --use-synthetic True diff --git a/PaddlePaddle/Classification/RN50v1.5/scripts/inference/infer_resnet50_TF32.sh b/PaddlePaddle/Classification/RN50v1.5/scripts/inference/infer_resnet50_TF32.sh index bfad608ee..6e55fd0be 100644 --- a/PaddlePaddle/Classification/RN50v1.5/scripts/inference/infer_resnet50_TF32.sh +++ b/PaddlePaddle/Classification/RN50v1.5/scripts/inference/infer_resnet50_TF32.sh @@ -13,10 +13,10 @@ # limitations under the License. python inference.py \ - --trt-inference-dir ./inference_tf32 \ - --trt-precision FP32 \ + --inference-dir ./inference_tf32 \ + --precision FP32 \ --dali-num-threads 8 \ --batch-size 256 \ --benchmark-steps 1024 \ --benchmark-warmup-steps 16 \ - --trt-use-synthetic True + --use-synthetic True diff --git a/PaddlePaddle/Classification/RN50v1.5/scripts/training/train_resnet50_AMP_90E_DGXA100.sh b/PaddlePaddle/Classification/RN50v1.5/scripts/training/train_resnet50_AMP_90E_DGXA100.sh index 4096d655a..23c4a4991 100644 --- a/PaddlePaddle/Classification/RN50v1.5/scripts/training/train_resnet50_AMP_90E_DGXA100.sh +++ b/PaddlePaddle/Classification/RN50v1.5/scripts/training/train_resnet50_AMP_90E_DGXA100.sh @@ -18,4 +18,5 @@ python -m paddle.distributed.launch --gpus=0,1,2,3,4,5,6,7 train.py \ --scale-loss 128.0 \ --use-dynamic-loss-scaling \ --data-layout NHWC \ - --fuse-resunit + --fuse-resunit \ + --inference-dir ./inference_amp diff --git a/PaddlePaddle/Classification/RN50v1.5/scripts/inference/export_resnet50_AMP.sh b/PaddlePaddle/Classification/RN50v1.5/scripts/training/train_resnet50_AMP_QAT_10E_DGXA100.sh similarity index 69% rename from PaddlePaddle/Classification/RN50v1.5/scripts/inference/export_resnet50_AMP.sh rename to PaddlePaddle/Classification/RN50v1.5/scripts/training/train_resnet50_AMP_QAT_10E_DGXA100.sh index b1c5676b9..0e7c8f104 100644 --- a/PaddlePaddle/Classification/RN50v1.5/scripts/inference/export_resnet50_AMP.sh +++ b/PaddlePaddle/Classification/RN50v1.5/scripts/training/train_resnet50_AMP_QAT_10E_DGXA100.sh @@ -15,9 +15,14 @@ CKPT=${1:-"./output/ResNet50/89"} MODEL_PREFIX=${2:-"resnet_50_paddle"} -python -m paddle.distributed.launch --gpus=0 export_model.py \ - --amp \ - --data-layout NHWC \ - --trt-inference-dir ./inference_amp \ - --from-checkpoint ${CKPT} \ - --model-prefix ${MODEL_PREFIX} +python -m paddle.distributed.launch --gpus=0,1,2,3,4,5,6,7 train.py \ + --from-pretrained-params ${CKPT} \ + --model-prefix ${MODEL_PREFIX} \ + --epochs 10 \ + --amp \ + --scale-loss 128.0 \ + --use-dynamic-loss-scaling \ + --data-layout NHWC \ + --qat \ + --lr 0.00005 \ + --inference-dir ./inference_qat diff --git a/PaddlePaddle/Classification/RN50v1.5/scripts/training/train_resnet50_TF32_90E_DGXA100.sh b/PaddlePaddle/Classification/RN50v1.5/scripts/training/train_resnet50_TF32_90E_DGXA100.sh index 65c87b752..0c5ea7988 100644 --- a/PaddlePaddle/Classification/RN50v1.5/scripts/training/train_resnet50_TF32_90E_DGXA100.sh +++ b/PaddlePaddle/Classification/RN50v1.5/scripts/training/train_resnet50_TF32_90E_DGXA100.sh @@ -12,4 +12,4 @@ # See the License for the specific language governing permissions and # limitations under the License. -python -m paddle.distributed.launch --gpus=0,1,2,3,4,5,6,7 train.py --epochs 90 +python -m paddle.distributed.launch --gpus=0,1,2,3,4,5,6,7 train.py --epochs 90 --inference-dir ./inference_tf32 diff --git a/PaddlePaddle/Classification/RN50v1.5/train.py b/PaddlePaddle/Classification/RN50v1.5/train.py index 3fbb9dd63..28e985135 100644 --- a/PaddlePaddle/Classification/RN50v1.5/train.py +++ b/PaddlePaddle/Classification/RN50v1.5/train.py @@ -28,6 +28,7 @@ from paddle.static.amp.fp16_lists import AutoMixedPrecisionLists from paddle.static.amp.fp16_utils import cast_model_to_fp16 from paddle.incubate import asp as sparsity +from paddle.static.quantization.quanter import quant_aware class MetricSummary: @@ -107,7 +108,7 @@ def main(args): eval_step_each_epoch = len(eval_dataloader) eval_prog = paddle.static.Program() - eval_fetchs, _, _, _ = program.build( + eval_fetchs, _, eval_feeds, _ = program.build( args, eval_prog, startup_prog, @@ -147,6 +148,14 @@ def main(args): sparsity.prune_model(train_prog, mask_algo=args.mask_algo) logging.info("Pruning model done.") + if args.qat: + if args.run_scope == RunScope.EVAL_ONLY: + eval_prog = quant_aware(eval_prog, device, for_test=True, return_program=True) + else: + optimizer.qat_init( + device, + test_program=eval_prog) + if eval_prog is not None: eval_prog = program.compile_prog(args, eval_prog, is_train=False) @@ -169,7 +178,7 @@ def main(args): # Save a checkpoint if epoch_id % args.save_interval == 0: - model_path = os.path.join(args.output_dir, args.model_arch_name) + model_path = os.path.join(args.checkpoint_dir, args.model_arch_name) save_model(train_prog, model_path, epoch_id, args.model_prefix) # Evaluation @@ -190,6 +199,10 @@ def main(args): if eval_summary.is_updated: program.log_info((), eval_summary.metric_dict, Mode.EVAL) + if eval_prog is not None: + model_path = os.path.join(args.inference_dir, args.model_arch_name) + paddle.static.save_inference_model(model_path, [eval_feeds['data']], [eval_fetchs['label'][0]], exe, program=eval_prog) + if __name__ == '__main__': paddle.enable_static() diff --git a/PaddlePaddle/Classification/RN50v1.5/utils/config.py b/PaddlePaddle/Classification/RN50v1.5/utils/config.py index c77ea7422..3b4b46494 100644 --- a/PaddlePaddle/Classification/RN50v1.5/utils/config.py +++ b/PaddlePaddle/Classification/RN50v1.5/utils/config.py @@ -100,7 +100,8 @@ def print_args(args): args_for_log = copy.deepcopy(args) # Due to dllogger cannot serialize Enum into JSON. - args_for_log.run_scope = args_for_log.run_scope.value + if hasattr(args_for_log, 'run_scope'): + args_for_log.run_scope = args_for_log.run_scope.value dllogger.log(step='PARAMETER', data=vars(args_for_log)) @@ -150,13 +151,19 @@ def check_and_process_args(args): args.eval_interval = 1 -def add_global_args(parser): - group = parser.add_argument_group('Global') +def add_general_args(parser): + group = parser.add_argument_group('General') group.add_argument( - '--output-dir', + '--checkpoint-dir', type=str, - default='./output/', + default='./checkpoint/', help='A path to store trained models.') + group.add_argument( + '--inference-dir', + type=str, + default='./inference/', + help='A path to store inference model once the training is finished.' + ) group.add_argument( '--run-scope', default='train_eval', @@ -188,13 +195,8 @@ def add_global_args(parser): group.add_argument( '--report-file', type=str, - default='./report.json', + default='./train.json', help='A file in which to store JSON experiment report.') - group.add_argument( - '--data-layout', - default='NCHW', - choices=('NCHW', 'NHWC'), - help='Data format. It should be one of {NCHW, NHWC}.') group.add_argument( '--benchmark', action='store_true', help='To enable benchmark mode.') group.add_argument( @@ -298,6 +300,11 @@ def add_advance_args(parser): '{mask_1d, mask_2d_greedy, mask_2d_best}. This only be applied ' \ 'when --asp and --prune-model is set.' ) + # QAT + group.add_argument( + '--qat', + action='store_true', + help='Enable quantization aware training (QAT).') return parser @@ -395,6 +402,11 @@ def add_model_args(parser): type=int, default=1000, help='The number classes of images.') + group.add_argument( + '--data-layout', + default='NCHW', + choices=('NCHW', 'NHWC'), + help='Data format. It should be one of {NCHW, NHWC}.') group.add_argument( '--bn-weight-decay', action='store_true', @@ -448,6 +460,9 @@ def add_training_args(parser): def add_trt_args(parser): + def int_list(x): + return list(map(int, x.split(','))) + group = parser.add_argument_group('Paddle-TRT') group.add_argument( '--device', @@ -456,70 +471,94 @@ def add_trt_args(parser): help='The GPU device id for Paddle-TRT inference.' ) group.add_argument( - '--trt-inference-dir', + '--inference-dir', type=str, default='./inference', - help='A path to store/load inference models. ' \ - 'export_model.py would export models to this folder, ' \ - 'then inference.py would load from here.' + help='A path to load inference models.' ) group.add_argument( - '--trt-precision', + '--data-layout', + default='NCHW', + choices=('NCHW', 'NHWC'), + help='Data format. It should be one of {NCHW, NHWC}.') + group.add_argument( + '--precision', default='FP32', choices=('FP32', 'FP16', 'INT8'), help='The precision of TensorRT. It should be one of {FP32, FP16, INT8}.' ) group.add_argument( - '--trt-workspace-size', + '--workspace-size', type=int, default=(1 << 30), help='The memory workspace of TensorRT in MB.') group.add_argument( - '--trt-min-subgraph-size', + '--min-subgraph-size', type=int, default=3, help='The minimal subgraph size to enable PaddleTRT.') group.add_argument( - '--trt-use-static', + '--use-static', type=distutils.util.strtobool, default=False, help='Fix TensorRT engine at first running.') group.add_argument( - '--trt-use-calib-mode', + '--use-calib-mode', type=distutils.util.strtobool, default=False, help='Use the PTQ calibration of PaddleTRT int8.') group.add_argument( - '--trt-export-log-path', - type=str, - default='./export.json', - help='A file in which to store JSON model exporting report.') - group.add_argument( - '--trt-log-path', + '--report-file', type=str, default='./inference.json', help='A file in which to store JSON inference report.') group.add_argument( - '--trt-use-synthetic', + '--use-synthetic', type=distutils.util.strtobool, default=False, help='Apply synthetic data for benchmark.') + group.add_argument( + '--benchmark-steps', + type=int, + default=100, + help='Steps for benchmark run, only be applied when --benchmark is set.' + ) + group.add_argument( + '--benchmark-warmup-steps', + type=int, + default=100, + help='Warmup steps for benchmark run, only be applied when --benchmark is set.' + ) + group.add_argument( + '--show-config', + type=distutils.util.strtobool, + default=True, + help='To show arguments.') return parser -def parse_args(including_trt=False): +def parse_args(script='train'): + assert script in ['train', 'inference'] parser = argparse.ArgumentParser( - description="PaddlePaddle RN50v1.5 training script", + description=f'PaddlePaddle RN50v1.5 {script} script', formatter_class=argparse.ArgumentDefaultsHelpFormatter) - parser = add_global_args(parser) - parser = add_dataset_args(parser) - parser = add_model_args(parser) - parser = add_training_args(parser) - parser = add_advance_args(parser) - - if including_trt: + if script == 'train': + parser = add_general_args(parser) + parser = add_dataset_args(parser) + parser = add_model_args(parser) + parser = add_training_args(parser) + parser = add_advance_args(parser) + args = parser.parse_args() + check_and_process_args(args) + else: parser = add_trt_args(parser) + parser = add_dataset_args(parser) + args = parser.parse_args() + # Precess image layout and channel + args.image_channel = args.image_shape[0] + if args.data_layout == "NHWC": + args.image_shape = [ + args.image_shape[1], args.image_shape[2], args.image_shape[0] + ] - args = parser.parse_args() - check_and_process_args(args) return args