Can not make great speed improvement on GPU #3353

Yongfan-Liu · 2024-09-21T06:15:19Z

I tried to run the exported onnx file on both RTX3070 and RTX 4090, but can not see speed improvement (even slower than the unquantized model). Here is the warning of onnxruntime:
2024-09-20 19:58:09.358958003 [W:onnxruntime:, transformer_memcpy.cc:74 ApplyImpl] 24 Memcpy nodes are added to the graph main_graph for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message.
2024-09-20 19:58:09.367445710 [W:onnxruntime:, session_state.cc:1166 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2024-09-20 19:58:09.367452106 [W:onnxruntime:, session_state.cc:1168 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
2024-09-20 19:58:09.536748386 [W:onnxruntime:Default, scatter_nd.h:51 ScatterNDWithAtomicReduction] ScatterND with reduction=='none' only guarantees to be correct if indices are not duplicated.
2024-09-20 19:58:09.536770318 [W:onnxruntime:Default, scatter_nd.h:51 ScatterNDWithAtomicReduction] ScatterND with reduction=='none' only guarantees to be correct if indices are not duplicated.
Does anyone meet the same problem as this? Or can someone please tell me if that is because of AIMET, or is something wrong with onnxruntime?
It seems that the exported onnx file does not match the ort well, how to improve that?

The text was updated successfully, but these errors were encountered:

xs-alt · 2024-10-21T03:55:43Z

@Yongfan-Liu
Hi Liu, did you solve it?

Yongfan-Liu · 2024-10-21T09:13:06Z

@xs-alt No, haven't

quic-mangal · 2024-10-30T22:37:19Z

@quic-mtuttle, can you help respond to this?

quic-mtuttle · 2024-10-31T21:43:08Z

Hi @Yongfan-Liu, sorry for the delayed response. To clarify a bit, AIMET is designed to simulate and optimize the quantized accuracy of networks prior to deployment on quantized runtimes/edge devices, not to optimize the GPU performance in onnxruntime/torch/tensorflow. This simulation is done by inserting fake-quantization (quantize-dequantize) operations in the model graph, which adds some computational overhead.

I might need a bit more context to help with the warnings. Generally, the exported onnx files do not contain any aimet quantization nodes at all (the quantization parameters are in a separate .encodings file), so it's possible these warnings may be normal for your model. Do you see any of these warnings when running the model in onnxruntime without going through aimet?

Yongfan-Liu · 2024-11-18T06:00:09Z

Hello @quic-mtuttle , thank you for your clarification. I tried to run 3 types of files on ORT, they are:

The onnx file directly exported by onnx without going through aimet
The onnx file exported by sim.export with going through aimet
The onnx file exported by sim.export with going through aimet and set use_embedded_encodings=True

They all reported :

24 Memcpy nodes are added to the graph main_graph for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message.

Actually, I'm still confused that, after the PTQ finished and running sim.export, how can we load the onnx file correctly so that we can get a real quantized model for downstream tasks? How to take good use of .encodings file? The related introduction is not very clear in the document. Do you have any solutions about this or do you have any plans for it in the future?

quic-mtuttle · 2024-11-20T19:05:05Z

Hi @Yongfan-Liu, thanks for the additional information. If you are getting the warnings even without AIMET, then it is probably something related to the model structure rather than anything AIMET's doing. It may not be anything you need to worry about, these warnings can be fairly common in onnx models in my experience.

Exporting the model with use_embedded_encodings=True will allow you to load the quantized model in onnx (in qdq format), but as far as I know the CUDAExecutionProvider isn't yet capable of truly running this as quantized model and will just fall back to fake-quantization (i.e., use quantize-dequantize operations).

As for how to use the .encodings file, these can be passed to other tools such as qairt-converter (passed as --quantization_overrides) or Qualcomm AI Hub (passed with the .onnx model to the compile job) which will help compile the model for target runtimes. We will work to provide more thorough documentation on this process in the coming weeks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can not make great speed improvement on GPU #3353

Can not make great speed improvement on GPU #3353

Yongfan-Liu commented Sep 21, 2024 •

edited

Loading

xs-alt commented Oct 21, 2024 •

edited

Loading

Yongfan-Liu commented Oct 21, 2024

quic-mangal commented Oct 30, 2024

quic-mtuttle commented Oct 31, 2024

Yongfan-Liu commented Nov 18, 2024 •

edited

Loading

quic-mtuttle commented Nov 20, 2024

Can not make great speed improvement on GPU #3353

Can not make great speed improvement on GPU #3353

Comments

Yongfan-Liu commented Sep 21, 2024 • edited Loading

xs-alt commented Oct 21, 2024 • edited Loading

Yongfan-Liu commented Oct 21, 2024

quic-mangal commented Oct 30, 2024

quic-mtuttle commented Oct 31, 2024

Yongfan-Liu commented Nov 18, 2024 • edited Loading

quic-mtuttle commented Nov 20, 2024

Yongfan-Liu commented Sep 21, 2024 •

edited

Loading

xs-alt commented Oct 21, 2024 •

edited

Loading

Yongfan-Liu commented Nov 18, 2024 •

edited

Loading