This application is to demostrate the deployment of a multi-task network on NVIDIA Drive Orin platform.
To improve latency and throughput, we leveraging different compute devices(GPU and DLA) on the SoC with CUDA, TensorRT and cuDLA. To better utilize all the compute resources, we decided to run encoder in FP16 on GPU, depth decoder in INT8 on DLA0 and segmentation decoder in INT8 on DLA1.
The schedule of the tasks are pipelined to pursue higher throughput with low overhead. When DLA0 and DLA1 are working on previous frame, we can launch the encoder for the current frame on GPU at the same time. So the executation is overlapped.
For more details, you may refer to our webinar Optimizing Multi-task Model Inference for Autonomous Vehicles
The original onnx model has been exported from a trained model. It's a multi-task network with Mix Transformer encoders(MiT) B0 as backbone. This backbone was originally used in SegFormer. And for the segmentation head, it's a slim version of SegFormer-B0. For depth head, it was a progressive decoder orignally from DEST.
You may install dependencies with
pip install onnx onnxruntime onnx-graphsurgeon onnxsim
We also provided some onnx files containing random weights to help you go through the following process. You can find these files in AV-Solutions/mtmi/onnx_files/
. These files are handled by git-lfs. To setup git-lfs, you may refer to this tutorial from github installing-git-large-file-storage. For real use case, we encourage users to choose and train your own models with backbone and heads in order to generate decent inference results in the end.
onnx | links |
---|---|
whole network | mtmi.onnx |
simplified whole network | mtmi_slim.onnx |
image encoder | mtmi_encoder.onnx |
depth head | mtmi_depth_head.onnx |
segmentation head | mtmi_seg_head.onnx |
Step1: Simplify the onnx file. This is to manipulate the onnx graph, remove redundant nodes, do constant folding etc. For more detail about the simplification, you may refer to daquexian/onnx-simplifier
python tools/onnx_simplify.py
Step2: Split the onnx model into encoder, depth decoder and semantic segmentation decoder. Since encoder, depth and segmentation heads are assigned to different device, we also split the whole onnx graph into 3 sub-graphs and handle them separately.
python tools/onnx_split.py
To better utilize DLA, we should use Int8. In order to obtain scale information, we utilize tensorrt calibration apis.
We first use onnxruntime to produce intermediate features shared by the two decoders. And these features will be cached for next steps.
python tools/batch_preprocessing_onnx.py --onnx=PATH_TO_ONNX --image_path=PATH_TO_IMAGE_FILES --output_path=PATH_TO_SAVE_INTERMEDIATE_FEATURES
Then we can load from the cached feature map as input, and feed them into calibators.
With default arguments, you will get the calibration file under calibration/
.
python tools/create_calibration_cache.py --onnx=PATH_TO_HEAD_ONNX --image-path=PATH_TO_IMAGE_FILES --output-path=PATH_TO_SAVE_ENGINES --cache-path=PATH_TO_SAVE_CALIBRATION_FILES
In order to achieve best INT8 resize perf on DLA, we carefully change the I/O scale manually for segmentation head while keeping the accuracy. TensorRT will choose a better DLA kernel with modified scales.
On NVIDIA Drive Orin platform:
Build engine with "outputIOFormats=fp16:chw32" for MIT-b0 encoder:
trtexec --onnx=onnx_files/mtmi_encoder.onnx \
--fp16 \
--saveEngine=engines/mtmi_encoder_fp16.engine \
--outputIOFormats=fp16:chw32 \
--verbose
Build DLA loadable with "inputIOFormats=int8:chw32" and "outputIOFormats=int8:dla_linear" for depth decoder and semantic segmentation decoder onnx model.
trtexec --onnx=onnx_files/mtmi_depth_head.onnx \
--int8 \
--saveEngine=loadables/mtmi_depth_i8_dla.loadable \
--useDLACore=0 \
--inputIOFormats=int8:chw32 \
--outputIOFormats=int8:dla_linear \
--buildOnly \
--verbose \
--buildDLAStandalone \
--calib=calibration/calibration_cache_depth.bin
trtexec --onnx=onnx_files/mtmi_seg_head.onnx \
--int8 \
--saveEngine=loadables/mtmi_seg_i8_dla.loadable \
--useDLACore=1 \
--inputIOFormats=int8:chw32 \
--outputIOFormats=int8:dla_linear \
--buildOnly \
--verbose \
--buildDLAStandalone \
--calib=calibration/calibration_cache_seg_mod.bin
The inference app is designed as the following steps:
- Initialization
- For each round
- load next image, if no more image, move to the first one
- preprocess
- run backbone on gpu
- quantize the feature map
- run depth head on dla0, and run segmentation head on dla1 at the same time.
- if last round, dump output data to results/ for visualization
- cleanup and exit
In this section, we will cross compile the application on x86 for NVIDIA Drive Orin platform.
Docker: nvcr.io/drive-priority/driveos-pdk/drive-agx-orin-linux-aarch64-pdk-build-x86:6.0.8.1-0006. Launch docker with
docker run --gpus all -it --network=host --rm -v DL4AGX_DIR:/DL4AGX nvcr.io/drive-priority/driveos-pdk/drive-agx-orin-linux-aarch64-pdk-build-x86:6.0.8.1-0006
Inside the docker, you can run:
cd /DL4AGX/mtmi/inference_app
sh build.sh
And you will get the executable file at /DL4AGX/mtmi/inference_app/build/orin/mtmiapp
Please make sure you have the following files on NVIDIA Drive Orin platform.
engines/
mtmi_depth_i8_dla.loadable
mtmi_encoder_fp16.engine
mtmi_seg_i8_dla.loadable
results/ # folder to hold result data
tests/
1.png # input images, must crop to 1024x1024
...
mtmiapp # the demo app
Then run the demo with
$ ./mtmiapp
Sample logs:
...
loop: 1 backbone elapsed: 29.799711 chrono elapsed: 29.965000
...
Run the following python scripts to obtain visualization results from dumped binary result data.
python tools/visualize.py