Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can not make great speed improvement on GPU #3353

Open
Yongfan-Liu opened this issue Sep 21, 2024 · 6 comments
Open

Can not make great speed improvement on GPU #3353

Yongfan-Liu opened this issue Sep 21, 2024 · 6 comments

Comments

@Yongfan-Liu
Copy link

Yongfan-Liu commented Sep 21, 2024

I tried to run the exported onnx file on both RTX3070 and RTX 4090, but can not see speed improvement (even slower than the unquantized model). Here is the warning of onnxruntime:
2024-09-20 19:58:09.358958003 [W:onnxruntime:, transformer_memcpy.cc:74 ApplyImpl] 24 Memcpy nodes are added to the graph main_graph for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message.
2024-09-20 19:58:09.367445710 [W:onnxruntime:, session_state.cc:1166 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2024-09-20 19:58:09.367452106 [W:onnxruntime:, session_state.cc:1168 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
2024-09-20 19:58:09.536748386 [W:onnxruntime:Default, scatter_nd.h:51 ScatterNDWithAtomicReduction] ScatterND with reduction=='none' only guarantees to be correct if indices are not duplicated.
2024-09-20 19:58:09.536770318 [W:onnxruntime:Default, scatter_nd.h:51 ScatterNDWithAtomicReduction] ScatterND with reduction=='none' only guarantees to be correct if indices are not duplicated.
Does anyone meet the same problem as this? Or can someone please tell me if that is because of AIMET, or is something wrong with onnxruntime?
It seems that the exported onnx file does not match the ort well, how to improve that?

@xs-alt
Copy link

xs-alt commented Oct 21, 2024

@Yongfan-Liu
Hi Liu, did you solve it?

@Yongfan-Liu
Copy link
Author

@xs-alt No, haven't

@quic-mangal
Copy link
Contributor

@quic-mtuttle, can you help respond to this?

@quic-mtuttle
Copy link
Contributor

Hi @Yongfan-Liu, sorry for the delayed response. To clarify a bit, AIMET is designed to simulate and optimize the quantized accuracy of networks prior to deployment on quantized runtimes/edge devices, not to optimize the GPU performance in onnxruntime/torch/tensorflow. This simulation is done by inserting fake-quantization (quantize-dequantize) operations in the model graph, which adds some computational overhead.

I might need a bit more context to help with the warnings. Generally, the exported onnx files do not contain any aimet quantization nodes at all (the quantization parameters are in a separate .encodings file), so it's possible these warnings may be normal for your model. Do you see any of these warnings when running the model in onnxruntime without going through aimet?

@Yongfan-Liu
Copy link
Author

Yongfan-Liu commented Nov 18, 2024

Hello @quic-mtuttle , thank you for your clarification. I tried to run 3 types of files on ORT, they are:

  • The onnx file directly exported by onnx without going through aimet
  • The onnx file exported by sim.export with going through aimet
  • The onnx file exported by sim.export with going through aimet and set use_embedded_encodings=True

They all reported :

24 Memcpy nodes are added to the graph main_graph for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message.

Actually, I'm still confused that, after the PTQ finished and running sim.export, how can we load the onnx file correctly so that we can get a real quantized model for downstream tasks? How to take good use of .encodings file? The related introduction is not very clear in the document. Do you have any solutions about this or do you have any plans for it in the future?

@quic-mtuttle
Copy link
Contributor

Hi @Yongfan-Liu, thanks for the additional information. If you are getting the warnings even without AIMET, then it is probably something related to the model structure rather than anything AIMET's doing. It may not be anything you need to worry about, these warnings can be fairly common in onnx models in my experience.

Exporting the model with use_embedded_encodings=True will allow you to load the quantized model in onnx (in qdq format), but as far as I know the CUDAExecutionProvider isn't yet capable of truly running this as quantized model and will just fall back to fake-quantization (i.e., use quantize-dequantize operations).

As for how to use the .encodings file, these can be passed to other tools such as qairt-converter (passed as --quantization_overrides) or Qualcomm AI Hub (passed with the .onnx model to the compile job) which will help compile the model for target runtimes. We will work to provide more thorough documentation on this process in the coming weeks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants