-
Notifications
You must be signed in to change notification settings - Fork 391
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can not make great speed improvement on GPU #3353
Comments
@Yongfan-Liu |
@xs-alt No, haven't |
@quic-mtuttle, can you help respond to this? |
Hi @Yongfan-Liu, sorry for the delayed response. To clarify a bit, AIMET is designed to simulate and optimize the quantized accuracy of networks prior to deployment on quantized runtimes/edge devices, not to optimize the GPU performance in onnxruntime/torch/tensorflow. This simulation is done by inserting fake-quantization (quantize-dequantize) operations in the model graph, which adds some computational overhead. I might need a bit more context to help with the warnings. Generally, the exported onnx files do not contain any aimet quantization nodes at all (the quantization parameters are in a separate |
Hello @quic-mtuttle , thank you for your clarification. I tried to run 3 types of files on ORT, they are:
They all reported : 24 Memcpy nodes are added to the graph main_graph for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message. Actually, I'm still confused that, after the PTQ finished and running |
Hi @Yongfan-Liu, thanks for the additional information. If you are getting the warnings even without AIMET, then it is probably something related to the model structure rather than anything AIMET's doing. It may not be anything you need to worry about, these warnings can be fairly common in onnx models in my experience. Exporting the model with As for how to use the |
I tried to run the exported onnx file on both RTX3070 and RTX 4090, but can not see speed improvement (even slower than the unquantized model). Here is the warning of onnxruntime:
2024-09-20 19:58:09.358958003 [W:onnxruntime:, transformer_memcpy.cc:74 ApplyImpl] 24 Memcpy nodes are added to the graph main_graph for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message.
2024-09-20 19:58:09.367445710 [W:onnxruntime:, session_state.cc:1166 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2024-09-20 19:58:09.367452106 [W:onnxruntime:, session_state.cc:1168 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
2024-09-20 19:58:09.536748386 [W:onnxruntime:Default, scatter_nd.h:51 ScatterNDWithAtomicReduction] ScatterND with reduction=='none' only guarantees to be correct if indices are not duplicated.
2024-09-20 19:58:09.536770318 [W:onnxruntime:Default, scatter_nd.h:51 ScatterNDWithAtomicReduction] ScatterND with reduction=='none' only guarantees to be correct if indices are not duplicated.
Does anyone meet the same problem as this? Or can someone please tell me if that is because of AIMET, or is something wrong with onnxruntime?
It seems that the exported onnx file does not match the ort well, how to improve that?
The text was updated successfully, but these errors were encountered: