-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HQQ for convolutional layers #78
Comments
Hi! Thanks for you message! What is the main reason to quantize conv layers? The solution above will just shrink the model size, you'll not see an inference speed-up, because you'll need a custom CUDA kernel to run this thing faster. |
I understand your work is focused on transformers (big ml-models). But the key problems that you're touching in the solution are present in CNNs as well, like calibration with subset, when you're dealing with large dataset or multi-task training, this is likely to be an issue. So why not see if the method can help there as well. Thanks for your response, I will try to see if I can get a working solution out of this. |
@mobicham Do you have any inference time benchmarks available between QPTS, AWQ, bitsandbytes and HQQ? |
What is your main goal to achieve with quantizing convolutions: reduce the model size or faster inference? Regarding the speed, HQQ is using an optimized int4 matmul kernel for 4-bit developed by the Torch AO team: https://github.com/mobiusml/hqq/raw/master/imgs/llama_int4_4090.png |
Any of the below, without dropping inference speed and accuracy (minor/reasonable trade-off acceptable):
I see your implementation has |
That's the default backend, we support different backends. This is what is actually used to speed-up 4-bit inference: https://github.com/mobiusml/hqq/blob/master/hqq/backends/torchao.py#L258-L266 The idea is that you first have a default backend that works for both inference and training, then depending on your inference use-case, you can patch the model to use a specific optimized backend. |
Thanks for the amazing work.
Is there any chance you can add support for convolutional layers?
I tried reshaping weight to get it quantized, but then it crashes in inference coz shape doesn't match.
Any suggestions or ideas would be highly appreciated. Thanks,
The text was updated successfully, but these errors were encountered: