-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC]Add Auto-Round Support #2130
Comments
Hi, And the seconds one is about 8bit quants, do you already have some benchmarks ? I would be interested in integrating this :) |
Awesome work! Question about inference: I see that you also support exporting to GPTQ format. What is the benefit of adding an extra (Some background: we are considering to switch to GPTQ-Marlin for supported configurations, since we see much-improved throughput.) |
Ah, I see you fixed the asymmetric quantization negative zeros bug. There is work on fixing this with a new gptq format revision, which you have probably seen. Linking it just in case: AutoGPTQ/AutoGPTQ#640 |
Yes, AutoRound supports exporting to AutoGPTQ format without quality loss, but it is for pure 4-bit or 8-bit models. Currently, AutoGPTQ can only apply a single quantization configuration for all target layers. In contrast, we support quantizing models with mixed various bits to balance accuracy and inference speed. Additionally, auto-round supports quantizing the We are also exploring other data types such as cc @wenhuach21 |
While 8-bit quantization is supported, extensive benchmarks have not been conducted because our 4-bit quantization results are already quite well :). If you are interested, we are happy to conduct some tests. |
Thank you for sharing this information. AutoRound supports exporting the format compatible with GPTQ-Marlin kernel as well. |
That's really nice! I wonder if 'mixed-bitness' could be considered for a GPTQ v2 format as well. I think ideally, every quantizer that uses quantized weights, scales, biases, and scale grouping would use the same GPTQ-based format. This would allow us to switch out kernels when new options become available. There has been a lot of development in this space and every improvement in GPTQ inference performance has benefitted all GPTQ format-based models. With respect to training, |
Yes, I agree. From the TGI side, this is the ideal scenario. However, unifying everything is challenging due to various reasons, similar to the existence of multiple LLM serving frameworks. For example, AutoGPTQ limits their calibration dataset to about three datasets and throws an error if you specify others. Additionally, GPTQV2's pull request has been open for 2-3 months with no indication of whether it will be merged. For AutoRound, we have currently specified the backend name (e.g., gptq:exllamav2 or others https://huggingface.co/Intel/Mistral-7B-v0.1-int4-inc-lmhead/blob/main/config.json ) and will switch to the GPTQ backend as the default CUDA kernel once their issue is fixed. Therefore, I believe TGI should have little difficulty switching to a better option in the future for all the gptq based models. |
I tried to export a model using the gptq format but it seems to be not marlin compatible.
|
Yes, sorry for the inconvenience. We will provide support within 1-2 days and keep you updated. |
I just started a quantization with 1000 samples and iters |
For debugging, use 32 samples with 2 iterations and disable_low_gpu_mem_usage for much faster performance. By default, we use 512 samples and 200 iterations and will support a fast config soon. Additionally, I think --sym needs to be added. Currently, the packing format is Triton in AutoRound, the same as exllamav2. I'm not sure whether TGI supports the conversion between marlin with exllamav2, as they are different formats to my knowledge. |
intel/auto-round#168 |
@flozi00 Hi, I have fixed the issue, please have a double check. auto-gptq==0.7.1 text = "There is a girl who likes adventure," opt125m Transformers API: There is a girl who likes adventure, and she is a girl who likes adventure. opt125m AutoGPTQ marlin API: "There is a girl who likes adventure, and she is a girl who likes adventure. LLAMA3-8B-Instruct Transformers API: There is a girl who likes adventure, and she is always ready to take on new challenges. She is a true adventurer at heart, and she loves to explore new places and try new things. She is also very brave and never backs down from a challenge, even if it seems scary or LLAMA3-8B-Instruct AutoGPTQ marlin API: There is a girl who likes adventure, and she is always ready to take on new challenges. She is a true adventurer at heart, and she loves to explore new places and try new things. She is also very brave and never backs down from a challenge, even if it seems scary or reference cmd CUDA_VISIBLE_DEVICES=0 \
python3 main.py \
--model_name $model_name \
--nsamples 128 \
--seqlen 512 \
--sym \
--disable_low_gpu_mem_usage \
--disable_eval \
--deployment_device 'gpu' \ We can support exporting to the Marlin format directly if needed, due to the repacking process |
That would make it a lot easier |
Sure, we will support it tomorrow. |
I can confirm that the Marlin Kernels for gptq in tgi are working with the exported Models from the Auto round Main branch |
We have added support for packing directly to the AutoRound format in intel/auto-round#172 by setting --deployment_device 'auto_round:marlin' in our latest update. This feature will be merged after extensive testing. Regarding exporting to the AutoGPTQ format, we found that with the current AutoGPTQ API, it still conducts repacking even with marlin format. Therefore, we do not plan to support this format as exporting to ExLlamaV2 is more compatible. test_result: |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
Hi, here is the INC team from Intel. Thank you for developing this amazing project.
Motivation
Our team has developed a new weight-only quantization algorithm called Auto-Round. It has achieved superior accuracy compared to GPTQ, AWQ, and OmniQuant across 11 tasks, particularly excelling in low-bit quantization (e.g., 2-bits and 3-bits). Auto-Round supports quantization from 2 to 8 bits, involves low tuning costs, and imposes no additional overhead during inference. Key results are summarized below, with detailed information available in our paper, GitHub repository, and Hugging Face low-bit quantization leaderboard.
We would like to contribute this quantization algorithm to TGI and enable users to:
1. Quantize Floating Model Using Auto-Round
Extend the current
quantize
API and addmethod
as a new argument to select different algorithms. Users can utilize it as follows:text-generation-server quantize \ --MODEL_ID path/to/float/model/\ --OUTPUT_DIR /path/to/save/quantized/model \ --method autoround # <--- select the different methods, such as `gptq`, `autoround`
We propose two options to implement it:
Option 1: Adding Auto-Round as a New Python Dependency (Recommended)
Auto-Round is currently released as a pure Python binary. The option adds
auto-round
to TGI'srequirements_xx.txt
and calls Auto-Round's API to obtain the quantized model.Advantages:
Option 2: Porting All Source Code of Auto-Round into TGI
We are also willing to integrate all source code of Auto-Round directly into TGI.
Advantages:
Here is the overall calling flow for the these two options:
2. Perform Inference with an AutoRound-quantized Model.
We propose extending the current
text-generation-launcher
API to includeautoround
as a new option within--quantize
. Users can utilize it as follows:Your feedback is important. Please feel free to comment on the options mentioned above or suggest additional approaches to ensure the most appropriate way to contribute :). Thank you in advance!
The text was updated successfully, but these errors were encountered: