Custom model had very slow performance (fps) #107

hellvesper · 2025-01-19T00:48:24Z

Hi, I tried to run custom model and it run very slow, compared to YOLO. I tested with examples/vision/ai_vision/nn_forward.py and my model had forward time ~280ms compared to 11ms for YOLOv8n. But mine model is 4x time smaller. Actually I tried to run SuperPoint CDNN.

I've exported PyTorch model to ONNX, it runs well, on cpu it has same ~200ms forward time. There is model structure from Netron.app:

convert_model.sh

Then I used this script to quantize model to cvitek format and set output tensor to last convolution layers:

convert_model.sh

#!/bin/bash

set -e

net_name=superpoint_dynamic_simple
input_w=640  
input_h=480  

mkdir -p workspace
cd workspace

# convert to mlir
model_transform.py \
--model_name ${net_name} \
--model_def ../${net_name}.onnx \
--input_shapes [[1,1,${input_h},${input_w}]] \
--mean "0" \
--scale "0.00392156862745098" \
--keep_aspect_ratio \
--pixel_format gray \
--channel_format nchw \
--output_names "semi,/convDb/Conv_output_0" \
--test_input ../test_image.jpg \
--test_result ${net_name}_top_outputs.npz \
--tolerance 0.99,0.99 \
--mlir ${net_name}.mlir

# export bf16 model
#   not use --quant_input, use float32 for easy coding
model_deploy.py \
--mlir ${net_name}.mlir \
--quantize BF16 \
--processor cv181x \
--test_input ${net_name}_in_f32.npz \
--test_reference ${net_name}_top_outputs.npz \
--model ${net_name}_bf16.cvimodel

# export int8 model
echo "calibrate for int8 model"
run_calibration.py ${net_name}.mlir \
--dataset ../calibration_images \
--input_num 200 \
-o ${net_name}_cali_table

echo "convert to int8 model"
model_deploy.py \
--mlir ${net_name}.mlir \
--quantize INT8 \
--quant_input \
--calibration_table ${net_name}_cali_table \
--processor cv181x \
--test_input ${net_name}_in_f32.npz \
--test_reference ${net_name}_top_outputs.npz \
--tolerance 0.9,0.6 \
--model ${net_name}_int8.cvimodel

Although my model has only 18 nodes compared to 80 in yolov8n, it has enormous ION memory need - 46.7Mb (CviModel Need ION Memory Size: (46.68 MB)) compared to 4.4Mb for YOLO (CviModel Need ION Memory Size: (4.40 MB)).

Also that tensor map in resulting cvimodel looks odd, batch of Relu's, where in ONNX model has Conv→Relu→Conv→Relu structure.

I have read "cvitek tpu quick start guide" and tpumlir.org docs and didn't found any clue.

I definitely missing something, please help.

cvimodel_tool full dump

cvimodel_tool
Cvitek Runtime (1.4.0)t4.1.0-23-gb920beb@20230910
Mlir Version: v1.14-20241231
Cvimodel Version: 1.4.0
superpoint_dynamic_simple Build at 2025-01-19 03:07:23
For cv181x chip ONLY
CviModel Need ION Memory Size: (46.68 MB)

Sections:
ID   TYPE      NAME                     SIZE        OFFSET      ENCRYPT     COMPRESS    MD5
000  weight    weight                   1313680     0           False       False       22500857e07e66db361ac62bbc1b4780
001  cmdbuf    subfunc_0                1837776     1313680     False       False       d7f1d41bfa3e2e7f32e0035ca91e8639

WeightMap:
ID   OFFSET    SIZE      TYPE    N    C    H    W    NAME
000  467072    576       int8    1    64   1    9    /relu/Relu_output_0_Relu_bias_packed
001  902400    576       int8    1    64   9    1    /relu/Relu_output_0_Relu_filter_reordered
002  467648    576       int8    1    64   1    9    /relu_1/Relu_output_0_Relu_bias_packed
003  865536    36864     int8    1    64   9    64   /relu_1/Relu_output_0_Relu_filter_reordered
004  942160    576       int8    1    64   1    9    /relu_2/Relu_output_0_Relu_bias_packed
005  902976    36864     int8    1    64   9    64   /relu_2/Relu_output_0_Relu_filter_reordered
006  939840    576       int8    1    64   1    9    /relu_3/Relu_output_0_Relu_bias_packed
007  468224    36864     int8    1    64   9    64   /relu_3/Relu_output_0_Relu_filter_reordered
008  940416    1152      int8    1    128  1    9    /relu_4/Relu_output_0_Relu_bias_packed
009  942736    73728     int8    1    128  9    64   /relu_4/Relu_output_0_Relu_filter_reordered
010  1165072   1152      int8    1    128  1    9    /relu_5/Relu_output_0_Relu_bias_packed
011  1166224   147456    int8    1    128  9    128  /relu_5/Relu_output_0_Relu_filter_reordered
012  1016464   1152      int8    1    128  1    9    /relu_6/Relu_output_0_Relu_bias_packed
013  1017616   147456    int8    1    128  9    128  /relu_6/Relu_output_0_Relu_filter_reordered
014  465920    1152      int8    1    128  1    9    /relu_7/Relu_output_0_Relu_bias_packed
015  318464    147456    int8    1    128  9    128  /relu_7/Relu_output_0_Relu_filter_reordered
016  316160    2304      int8    1    256  1    9    /relu_8/Relu_output_0_Relu_bias_packed
017  21248     294912    int8    1    256  9    128  /relu_8/Relu_output_0_Relu_filter_reordered
018  941568    585       int8    1    65   1    9    semi_Conv_bias_packed
019  2304      16640     int8    1    65   1    256  semi_Conv_filter_reordered
020  0         2304      int8    1    256  1    9    /relu_9/Relu_output_0_Relu_bias_packed
021  570624    294912    int8    1    256  9    128  /relu_9/Relu_output_0_Relu_filter_reordered
022  18944     2304      int8    1    256  1    9    /convDb/Conv_output_0_Conv_bias_packed
023  505088    65536     int8    1    256  1    256  /convDb/Conv_output_0_Conv_filter_reordered

Program #0
    batch_num   : 0
    private_gmem_size: 0
    shared_gmem_size: 39321600
    inputs      : input
    outputs     : semi_Conv_f32,/convDb/Conv_output_0_Conv_f32
    routines    :
     #00  tpu
        inputs  : input
        outputs : semi_Conv_f32,/convDb/Conv_output_0_Conv_f32
        section : subfunc_0

    tensor_map  :
        ID   OFFSET      TYPE  N    C    H    W    QSCALE     MEM     NAME
        000  0           int8  1    1    480  640  127.000000 io_mem  input
        001  0           int8  1    64   480  640  0.339957   shared  /relu/Relu_output_0_Relu
        002  19660800    int8  1    64   480  640  0.165536   shared  /relu_1/Relu_output_0_Relu
        003  0           int8  1    64   240  320  0.165536   shared  /pool/MaxPool_output_0_MaxPool
        004  4915200     int8  1    64   240  320  0.231064   shared  /relu_2/Relu_output_0_Relu
        005  0           int8  1    64   240  320  0.269022   shared  /relu_3/Relu_output_0_Relu
        006  4915200     int8  1    64   120  160  0.269022   shared  /pool_1/MaxPool_output_0_MaxPool
        007  0           int8  1    128  120  160  0.167438   shared  /relu_4/Relu_output_0_Relu
        008  2457600     int8  1    128  120  160  0.154103   shared  /relu_5/Relu_output_0_Relu
        009  0           int8  1    128  60   80   0.154103   shared  /pool_2/MaxPool_output_0_MaxPool
        010  614400      int8  1    128  60   80   0.248690   shared  /relu_6/Relu_output_0_Relu
        011  0           int8  1    128  60   80   0.283347   shared  /relu_7/Relu_output_0_Relu
        012  614400      int8  1    256  60   80   0.033683   shared  /relu_8/Relu_output_0_Relu
        013  2457600     int8  1    65   60   80   0.335987   shared  semi_Conv
        014  1228800     int8  1    256  60   80   0.177987   shared  /relu_9/Relu_output_0_Relu
        015  0           int8  1    256  60   80   4.332494   shared  /convDb/Conv_output_0_Conv
        016  0           fp32  1    256  60   80   1.000000   io_mem  /convDb/Conv_output_0_Conv_f32
        017  0           fp32  1    65   60   80   1.000000   io_mem  semi_Conv_f32

The text was updated successfully, but these errors were encountered:

Neutree · 2025-01-20T02:11:09Z

maybe bacause the input size? try change to smaller input size

hellvesper · 2025-01-20T10:55:23Z

maybe bacause the input size? try change to smaller input size

I tried YOLO11n 640x640 it has ~11ms. Is there any profiler tools I can use to investigate performance bottlenecks?
I noticed that toolkit used to quantise and compile model for TPU use some TPU emulation but lacking documentation

Neutree · 2025-01-20T12:50:08Z

maybe you can change different output node to debug which node spend so much time

Neutree · 2025-01-20T12:51:13Z

your model is simple, change different output node and export bf16 or int8 both fast, just try

Neutree · 2025-01-20T12:53:43Z

and don't use --quant_input arg if you use MaixPy

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Custom model had very slow performance (fps) #107

Custom model had very slow performance (fps) #107

hellvesper commented Jan 19, 2025

Neutree commented Jan 20, 2025

hellvesper commented Jan 20, 2025

Neutree commented Jan 20, 2025

Neutree commented Jan 20, 2025

Neutree commented Jan 20, 2025

Custom model had very slow performance (fps) #107

Custom model had very slow performance (fps) #107

Comments

hellvesper commented Jan 19, 2025

Neutree commented Jan 20, 2025

hellvesper commented Jan 20, 2025

Neutree commented Jan 20, 2025

Neutree commented Jan 20, 2025

Neutree commented Jan 20, 2025