Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slower results from quantized model potentially due to warning prints #9

Open
Mijawel opened this issue Feb 16, 2023 · 4 comments
Open

Comments

@Mijawel
Copy link

Mijawel commented Feb 16, 2023

I'm getting the following warning which prints probably 1000 times during execution:
[W qlinear_dynamic.cpp:239] Warning: Currently, qnnpack incorrectly ignores reduce_range when it is set to true; this may change in a future release. (function apply_dynamic_impl)

And I'm getting results maybe 30% slower than when running the non quantized model and I think it might be the cause.

Any idea how to fix?

(And just using filter warnings in python is not supressing them for some reason)

@MiscellaneousStuff
Copy link
Owner

Are you getting this using the code from this repository or are you running the dynamic quantisation in your own script?

@Mijawel
Copy link
Author

Mijawel commented Feb 27, 2023

Getting it from the repo (although I have moved over to the faster_whisper repo based on CTranslate2 since then, which seems to have working quantization)

@MiscellaneousStuff
Copy link
Owner

That repo looks more developed than this one so that is probably a good idea, especially if they have verified their method has the same performance after optimisation. Good to see others also using quantisation. As for the warnings I’ve found similar issues when using other Nvidia libraries like TensorRT so will still need to look into this one.

@cehongwang
Copy link

I'm getting the following warning which prints probably 1000 times during execution: [W qlinear_dynamic.cpp:239] Warning: Currently, qnnpack incorrectly ignores reduce_range when it is set to true; this may change in a future release. (function apply_dynamic_impl)

And I'm getting results maybe 30% slower than when running the non quantized model and I think it might be the cause.

Any idea how to fix?

(And just using filter warnings in python is not supressing them for some reason)

Hello, How do you fix the problem? I am using torch.quantization and also want to disable the printing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants