Name	Name	Last commit message	Last commit date
parent directory ..
docker	docker
results	results
tools	tools
LICENSE	LICENSE
README.md	README.md
bevformer_trt10.patch	bevformer_trt10.patch

About

This repository contains an end-to-end example of deploying BEVFormer with explicit quantization with NVIDIA's ModelOpt Toolkit. At the end, we show TensorRT deployment results in terms of runtime and accuracy.

Requirements

TensorRT 10.x
ONNX-Runtime 1.18.x
onnx-graphsurgeon
onnsim
ModelOpt toolkit 0.15.0
BEVFormer_tensorrt

Prepare dataset

Follow the Data Preparation steps for NuScenes and CAN bus. This will prepare the full train / validation dataset.

Prepare docker image

Build docker image:

$ export TAG=tensorrt_bevformer:24.08
$ docker build -f docker/tensorrt.Dockerfile --no-cache --tag=$TAG .

How to Run

1. Export model to ONNX and compile plugins

A. Download model weights from here and save it in ./models:

$ wget -P ./models https://github.com/zhiqi-li/storage/releases/download/v1.0/bevformer_tiny_epoch_24.pth

B. Run docker container:

$ docker run -it --rm --gpus device=0 --network=host --shm-size 20g -v $(pwd):/mnt -v <path to data>:/workspace/BEVFormer_tensorrt/data $TAG

C. In docker container, patch the BEVFormer_tensorrt folder and compile plugins:

# 1. Apply patch to BEVFormer_tensorrt with changes necessary for TensorRT 10 support
$ cd /workspace/BEVFormer_tensorrt
$ git apply /mnt/bevformer_trt10.patch

# 2. Compile plugins
$ cd TensorRT/build
$ cmake .. -DCMAKE_TENSORRT_PATH=/usr && make -j$(nproc) && make install

The compiled plugin will be saved in TensorRT/lib/libtensorrt_ops.so, which will later be used by both ModelOpt and TensorRT.

D. Export simplified ONNX model from torch:

$ cd /workspace/BEVFormer_tensorrt
$ python tools/pth2onnx.py configs/bevformer/plugin/bevformer_tiny_trt_p2.py /mnt/models/bevformer_tiny_epoch_24.pth --opset=13 --cuda --flag=cp2_op13
$ cp checkpoints/onnx/bevformer_tiny_epoch_24_cp2_op13.onnx /mnt/models/

2. Post-process ONNX model

$ export PLUGIN_PATH=/workspace/BEVFormer_tensorrt/TensorRT/lib/libtensorrt_ops.so
$ python /mnt/tools/onnx_postprocess.py --onnx=/mnt/models/bevformer_tiny_epoch_24_cp2_op13.onnx --trt_plugins=$PLUGIN_PATH

This will generate an ONNX file of same name as the input ONNX file with the suffix *_post_simp.onnx. May need to use CUDA_MODULE_LOADING=LAZY if using CUDA 12.x. No such variable is needed with CUDA 11.8.

This script does the following post-processing actions:

Automatically detect custom TRT ops in the ONNX model.
Ensure that the custom ops are supported as a TRT plugin in ONNX-Runtime (trt.plugins domain).
Update all tensor types and shapes in the ONNX graph with onnx-graphsurgeon.
Simplify model with onnxsim.

3. Quantize ONNX model

Prepare the calibration data:

$ cd /workspace/BEVFormer_tensorrt
$ PYTHONPATH=$(pwd) python /mnt/tools/calib_data_prep.py configs/bevformer/plugin/bevformer_tiny_trt_p2.py \
    --onnx_path=/mnt/models/bevformer_tiny_epoch_24_cp2_op13_post_simp.onnx \
    --trt_plugins=$PLUGIN_PATH

The calibration data will be saved in data/nuscenes/calib_data.npz. The script uses 600 calibration samples by default. See instructions in the ModelOpt toolkit for more info on generating the calibration data.

Quantize ONNX model with calibration data:

$ python /mnt/tools/quantize_model.py --onnx_path=/mnt/models/bevformer_tiny_epoch_24_cp2_op13_post_simp.onnx \
      --trt_plugins=$PLUGIN_PATH \
      --op_types_to_exclude MatMul \
      --calibration_data_path=/workspace/BEVFormer_tensorrt/data/nuscenes/calib_data.npz

This generates an ONNX model with suffix .quant.onnx with Q/DQ nodes around relevant layers.

Notes:

MatMul ops are not being quantized (--op_types_to_exclude MatMul). The reasoning for this is that MHA blocks, present in Transformer-based models, are currently recommended to run in FP16. Keep in mind that optimal Q/DQ node placement can vary for different models, so there may be cases where quantizing MatMul ops may be more advantageous. This is up to the user to decide.
If you're running out of memory, you may need to add CUDA_MODULE_LOADING=LAZY to the beginning of that quantization command. This is only valid for CUDA 12.x. No such variable is needed with CUDA 11.8.

4. Build TensorRT engine

$ trtexec --onnx=/mnt/models/bevformer_tiny_epoch_24_cp2_op13_post_simp.quant.onnx \
	      --saveEngine=/mnt/models/bevformer_tiny_epoch_24_cp2_op13_post_simp.quant.engine \
	      --staticPlugins=$PLUGIN_PATH \
	      --best

Note: In order to deploy the quantized ONNX model in another platform or with another TensorRT version, simply re-compile the plugin for the required settings and deploy the engine using the same explicitly-quantized ONNX model.

5. Evaluate accuracy of TensorRT engine

Run evaluation script:

$ cd /workspace/BEVFormer_tensorrt
$ python tools/bevformer/evaluate_trt.py \
         configs/bevformer/plugin/bevformer_tiny_trt_p2.py \
         /mnt/models/bevformer_tiny_epoch_24_cp2_op13_post_simp.quant.engine \
         --trt_plugins=$PLUGIN_PATH

Results

System: NVIDIA A40 GPU, TensorRT 10.3.0.26.

BEVFormer tiny with FP16 plugins with nv_half2 (bevformer_tiny_epoch_24_cp2_op13_post_simp.onnx):

Precision	GPU Compute Time (median, ms)	Accuracy (NDS / mAP)
FP32	18.82	NDS: 0.354, mAP: 0.252
FP16	9.36	NDS: 0.354, mAP: 0.251
BEST (TensorRT PTQ - Implicit Quantization)	6.20	NDS: 0.353, mAP: 0.250
QDQ_BEST (ModelOpt PTQ - Explicit Quantization)	6.02	NDS: 0.352, mAP: 0.251

BEVFormer tiny with FP16 plugins with nv_half (bevformer_tiny_epoch_24_cp_op13_post_simp.onnx):

Precision	GPU Compute Time (median, ms)	Accuracy (NDS / mAP)
FP32	18.80	NDS: 0.354, mAP: 0.252
FP16	9.81	NDS: 0.354, mAP: 0.251
BEST (TensorRT PTQ - Implicit Quantization)	6.73	NDS: 0.353, mAP: 0.250
QDQ_BEST (ModelOpt PTQ - Explicit Quantization)	6.54	NDS: 0.353, mAP: 0.251

Steps to reproduce

To reproduce the results, run:

./deploy_trt.sh to build/save the TensorRT engine and obtain the runtime;
./evaluate_trt.sh to evaluate the TensorRT engine's accuracy.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bevformer-int8-eq

bevformer-int8-eq

README.md

About

Requirements

Prepare dataset

Prepare docker image

How to Run

1. Export model to ONNX and compile plugins

2. Post-process ONNX model

3. Quantize ONNX model

4. Build TensorRT engine

5. Evaluate accuracy of TensorRT engine

Results

Steps to reproduce

Files

bevformer-int8-eq

Directory actions

More options

Directory actions

More options

Latest commit

History

bevformer-int8-eq

Folders and files

parent directory

README.md

About

Requirements

Prepare dataset

Prepare docker image

How to Run

1. Export model to ONNX and compile plugins

2. Post-process ONNX model

3. Quantize ONNX model

4. Build TensorRT engine

5. Evaluate accuracy of TensorRT engine

Results

Steps to reproduce