-
-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPTQ / Quantization support? #174
Comments
Thanks for the feature request! Quantization is not currently supported, but it's definitely on our roadmap. Please stay tuned. |
How do I best go about tracking this? Is there a discord or public roadmap somewhere I can look at? |
See Roadmap here: #244 |
I looked into this a bit today and it seems straight forward to integrate AutoGPTQ into vllm, so I implemented a preliminary version for LLaMA (see this commit) and did a few benchmarks on single A100-80G. I don't know why but It's slower than expected.
|
what's the baseline with normal version? |
If you mean the throughput, in the above table I dug into the kernel code of quant linear layer and found that it falls back to dequantization followed by fp16 matrix multiplication when the batch size is bigger than 8, so the performance degradation is understandable. |
As an update, I added tensor parallel QuantLinear layer and supported most AutoGPT compatible models in this branch. The code has not been thoroughly tested yet because the combinations of model architectures and GPTQ settings are way too many. |
@chu-tianxiang I tried forking your
|
Hi, |
is baichuan-gptq supported? |
支持Qwen-72B-Chat-Int4加速吗? |
Something is off with this QLLM gptq quantization @wejoncy ... all the dependencies aren't specified in requirements file. Also, 3 times tried quantizing and every time it breaks when it tries to save the file or quant is done. Tried Llama2-70b and mistral 7b |
Hi,
|
Is there any update for 8bit support? That would help Mixtral generate useable outputs on a single (non-overpriced) GPU. |
I have successfully used both GPTQ and AWQ models with vLLM. Should this issue be considered solved @WoosukKwon? |
@hmellor it currently works with 4-bit, but not 8-bit. Currently you have to use chu-tianxiang/vllm-gptq to get 8-bit support. |
Closing as this was resolved by #2330 |
Is vLLM well-supporting the int-2 GPTW models? Thank you very much! |
Sanity check done: Server mode; BS1 perf; Llama405b FP8
Will vLLM support 4-bit GPTQ models?
The text was updated successfully, but these errors were encountered: