-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] Better support for w4a8 quantization #2605
Comments
@Barry-Delaney would u plz add comments here? |
Hi, Our officially supported toolkit for quantization is ModelOpt. We have discussed before and found that it is not trivial to land the techniques used by DeepCompressor (such as asymmetric quantization, double quantization, rotation, smoothing etc) into ModelOpt. At least in the near future we need to rely on DeepCompressor. If you find problems on quantizing new models, could you try to do some implementation as DeepCompressor has an abstraction layer for models like this? You could also raise an issue in the DeepCompressor library and paste the errors in detail. The authors of DeepCompressor would be glad to answer. Thank you! |
Thanks! I have filed mit-han-lab/deepcompressor#38. BTW if we can fix these issues: #2602, #2603, #2604, we can at least use w8a8. But today we can't even use w8a8. |
I managed to build the qserve w4a8 with g128, and now it fails with
|
Based on this doc, we have to use deepcompressor to build prepare fake-quantized checkpoint. However, it's a lot more trouble to set up that repo and it seems to me that tool is not being maintained well, especially for new llama models like 3.1/3.2. At least I was not able to do it successfully for llama 3.1 8b.
It would be great if we can add more native support for w4a8 quantization in the trtllm.
The text was updated successfully, but these errors were encountered: