-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Does Marlin support zero-point quantization? #5
Comments
+1 - would be great to have marlin speed with AWQ perplexity |
+1 - marlin's great, would be amazing to have AWQ support |
I am a bit confused by this issue. Have you compared the PPL of Marlin models relative to AWQ? |
Marlin used a different method for measuring perplexity, so can’t compare the two unfortunately |
Well, my point is that the above post seems to be assuming that the AWQ PPL is better than the GPTQ version used by Marlin. This might not be the case. |
Hi, in general, my experience is that when GPTQ is tuned and configured properly (e.g., also uses grid-clipping), results are extremely similar to AWQ. That being said, Marlin is a general fp16xint4 matmul kernel, at the moment supporting symmetric linear quantization either column-wise or at groupsize 128, with fp16 scales. It does not matter how the quantized weights are produced, they could come from GPTQ, AWQ, ZeroQuant or any other quantization method, they just have to follow Marlin's format. I think fixing the zero-point to 8 should cause AWQ to produce Marlin-compatible weights? Currently, Marlin does not support zero points. With moderately sized groups and grid-clipping (as used by AWQ or our improved GPTQ implementation), the difference between symmetric and asymmetric seemed very small in my tests, maybe <= 0.01 PPL. Zero points stored in fp16 should not be too hard to support, but are probably not worth it from an accuracy standpoint (one could use smaller symmetric groupsize instead). Quantized zero points may bring marginal gains in some cases, but are likely a bit tricky to support without any efficiency drop (already the current version requires quite some care to avoid unfavorable instruction ordering by the compiler in the main loop when using groups). |
Thanks for the amazing work @efrantar ! Regarding the zero-point, it is actually very important to have it especially at low-bits. In fact, the zero-point is more important than the scaling. That is why methods like HQQ optimize for the zero-point. To give you some perspective on why the zero-point is important, I run two experiments on wikitext, Llama2-7B model, 2-bit quantization, context-size=1024 with HQQ+:
If the group-size for both the scaling and zero-point are the same, it shouldn't be too difficult to add it I think. |
Dear creators of Marlin
What a huge performance boost these kernels can bring! I’m super excited about this as the open source community has been lacking kernels that scale.
To my question, does Marlin support zero point quantization like we normally get from AutoGPTQ or AutoAWQ?
Best wishes
Casper
The text was updated successfully, but these errors were encountered: