The performance of DAMO-YOLO-S is seriously reduced from 46.8% to 33.6% after traditional PTQs, which is unacceptable. In order to solve this problem, we apply partial quantization. We quantified each layer of the model separately at the TRT level, analyzed each layer with precision as sensitivity, and then let the most sensitive layer to have full precision as a compromise.
With partial quantization, we finally reached 46.5% with a loss of only 0.3% in accuracy on DAMO-YOLO-S. Compared with the FP16 model, the partial quantization model accelerates by 20% when the batch size is 1, showing a good compromise between accuracy and latency.
DAMO-YOLO-T, DAMO-YOLO-M quantized model will be coming soon.
TRT Version: 8.4.1.5
pip install --extra-index-url=https://pypi.ngc.nvidia.com --trusted-host pypi.ngc.nvidia.com nvidia-pyindex
pip install --extra-index-url=https://pypi.ngc.nvidia.com --trusted-host pypi.ngc.nvidia.com pytorch_quantization
by specifying the layer to be quanted, we proceed partial quantization as follows, the calib weights, onnx files and trt files will be generated.
python tools/partial_quantization/partial_quant.py -f configs/damoyolo_tinynasL25_S.py -c damoyolo_tinynasL25_S_468.pth --batch_size 1 --img_size 640 --trt --trt_eval
TRT model latency can be measured by trtexec.
trtexec --avgRuns=1000 --workspace=1024 --loadEngine=damoyolo_tinynasL25_S_partial_quant_bs1.trt
Model | Size | Precision | mAP_val(0.5:0.95) | T4 Latency bs=1 (ms) |
---|---|---|---|---|
DAMOYOLO-S-partial | 640 | INT8 | 46.5 | 3.23 |
DAMOYOLO-S-INT8 | 640 | INT8 | 33.6 | 3.08 |
DAMOYOLO-S-FP16 | 640 | FP16 | 46.8 | 3.83 |