-
Notifications
You must be signed in to change notification settings - Fork 486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable GPTQModel #2064
Enable GPTQModel #2064
Conversation
@SunMarc GPTQModel is intended to replace AutoGPTQ entirely due to lack of progress in that repo for many reasons but for the sake of compat, they can co-exist in parallel until this integration is merged, everything is stable/tested, and maybe later we can add init a deprecation plan of AutoGPTQ which is no longer actively developed and/or maintained. |
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clean PR ! LGTM ! Thanks for creating this lib ! Can you check that the tests in optimum and in transformers pass as expected ?
@SunMarc Current PR in the current state is not passing our internal tests. @jiqing-feng Will merge some of our changes in that will pass both inference/quant tests. Please delay your review until then since there are substantial changes, relative to the code/PR currently. |
* need checkpoint_format * default value of checkpoint_format is gptq * fix quantize * fix quantize * fix quantize * Update quantizer.py * need convert to v1 before gptqmodel save * back checkpoint_format to gptq after convert * cleanup code * sym=False is not supported with auto-gptq * add comments * cleanup code * Update quantizer.py * always convert v2 to v1 if checkpoint_format = "gptq" * Update quantizer.py --------- Co-authored-by: ZX-ModelCloud <[email protected]> Co-authored-by: Qubitium-ModelCloud <[email protected]>
* keep gptq_v2 if sym is false * use hf_convert_gptq_v1_to_v2_format, hf_convert_gptq_v2_to_v1_format, and hf_gptqmodel_post_init * no need check backend * use device_map * cleanup * Update quantizer.py * move import --------- Co-authored-by: Qubitium-ModelCloud <[email protected]>
Hi @Qubitium . The gptqmodel tests have been integrated. I can run @SunMarc Do we need to change any test yaml file in .github or any dockerfile? If yes, please let me know the file location. Thanks! BTW, tests with |
Testing changes contain:
|
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
* add meta info * cleanup * cleanup * The value of quantizer should be an array * Update quantizer.py * If is_auto_gptq_available() also writes "auto_gptq:version" to "quantizer" * If is_auto_gptq_available() also writes "auto_gptq:version" to "quantizer" * Update quantizer.py * cleanup * comment on meta * hf_select_quant_linear pass checkpoint_format * add todo fix * move convert code to quantizer.save() * Update quantizer.py * Optimize hf_convert_gptq_v2_to_v1_format() * Optimize hf_convert_gptq_v1_to_v2_format() * fix GPTQTestCUDA * hf_select_quant_linear() always set pack=True * gptqmodel.hf_select_quant_linear() now does not select ExllamaV2 * gptqmodel.hf_select_quant_linear() now does not select ExllamaV2 * GPTQQuantizer add backend * lower checkpoint_format and backend * cleanup * move backend to bottom * no need to check gptqmodel version for ipex support * Update import_utils.py * Update quantizer.py * fix UnboundLocalError: cannot access local variable 'version' where it is not associated with a value * make version var short * Update import_utils.py * fix unittest * use assertLessEqual --------- Co-authored-by: Qubitium-ModelCloud <[email protected]> Co-authored-by: LRL <[email protected]>
optimum/gptq/quantizer.py
Outdated
checkpoint_format (`str`, *optional*, defaults to `gptq`): | ||
GPTQ weight format. `gptq`(v1) is supported by both gptqmodel and auto-gptq. `gptq_v2` is gptqmodel only. | ||
meta (`Dict[str, any]`, *optional*): | ||
Properties, such as tooling:version, that do not directly contributes to quantization or quant inference are stored in meta. | ||
i.e. `meta.quantizer`: ["optimum:_version_", "gptqmodel:_version_"] | ||
backend (`str`, *optional*): | ||
Controls which gptq kernel to be used. Valid values for gptqmodel are `auto`, `auto_trainable` and more. For auto-gptq, only | ||
valid value is None and `auto_trainable`. Ref gptqmodel backends: https://github.com/ModelCloud/GPTQModel/blob/main/gptqmodel/utils/backend.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SunMarc This is the biggest change and one I need to go into explicit details since they are non-obvious.
checkpoint_format
is added to both auto-gptq(main) but never released and carried-over to gptqmodel since I see as a good addition since the method of gptq produces many kernels, each may use a specific weight/disk format. Existing gptqmodel checkpoint_formats
are gptq
, gptq_v2
, marlin
, ipex
, bitblas
with more coming this year.
meta
is added by gptqmodel to store info
only properties that is not related to the loading and execution of the quantized model. Most importantly, it stores a meta.quantizer
property [list of: quantizer:version] tooling that produced the quant. This is extremely valuable for two reasons:
-
[Good to have but not essential] Debugging and tracing back bad quants generated by bad code/tools. Who made the (bad/good) quants? This is an tooling fingerprint since there are multiple tools that can produce gptq format. They are not equal and this will allow everyone to trace to origins. In this PR, meta.quantizer is size 2 array holding both optimum version + (gptqmodel version or autogptq version)
-
[Requirement for GPTQMode + future bug proofing] Backward compat and future bug-proofing. GPTQModel uses this to test of a zeropoint fix made by @qwopqwop200 that affects all
gptq
(v1) disk format created before this fix. Models made before this fix has brokensym=false
zeropoint. Models quantized after this fix can loadsym=False
. This is a safety check since there are two version of gptq v1 that is compatible with either when sym=true, and only after fix for sym=false.
backend
[Essential for GPTQModel and also good for auto-gptq]: The old auto-gptq method of selecting which kernel/quant_linear for which task/model is extremely cryptic and controlled by 3 params disable_exllama
, exllama_config
, disable_exllama
plus 1 code state called use_exllama
. Frankly, this control scheme no longer makes logical sense and is borderline crazy. GPTQmodel uses a single backend
to signal kernel selection.
The kernel selection comes down to this logic split into two core paths: Does the model require training? aka will it enter peft path. True, than select the best kernel that can be trained on. False, select a best kernel for quant/inference.
You can ignore all the switches as the basic need for the 3+1 variables in auto-gptq boils down to above logic while trying to give users the ability to choose a specific kernel but I can safely say, there are maybe 3 people in the world that can select the correct auto-gptq kernel without actually reading the entire code. There are even more toggle beyond the 3+1 within auto-gptq.
Due to the above, GPTQModel will not accept or adapt to to auto-gptq kernel selection crazyiness. 1 clean variable is all you need with two primar/auto states: auto
, auto_trainable
, plus individual kernels you can explicitly call via this single backend
param. This is the best and only way out the mess of kernel selection. There are currently 8 kernels in gptqmodel with more coming this year plus even more checkpoint_formats beyond gptq, gptq_v2, marlin, ipex, bitblas. We are not going down the auto-gptq path of adding a 1970 telco phoneline switch style variable for each kernel that needs to be and
+or
ed to compute the kernel selection state.
But, for the sake of compatibility, this PR contains code that will allow users to pass only auto-gptq control vars and convert that to gptqmodel auto loading state of auto
and auto_trainable
and reversely, allow passing backend=auto_trainable
and to auto-gptq kernel selection control.
I apologize for the overly verbose message here but this is the meat of the pr as far as potential friction for review since I can totally see why anyone seeing these changes will throw blank stares without me explaining every single detail about each new param/state var.
def select_quant_linear(self, device_map: Union[str, dict]): | ||
if is_gptqmodel_available(): | ||
self.quant_linear = hf_select_quant_linear( | ||
bits=self.bits, | ||
group_size=self.group_size, | ||
desc_act=self.desc_act, | ||
sym=self.sym, | ||
checkpoint_format=self.checkpoint_format, | ||
meta=self.meta, | ||
device_map=device_map, | ||
backend=self.backend, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SunMarc GPTQModel is exposing hf_
prefixed stable apis to transfomer/peft/optimum that will not change over time. So any calls to GPTQModel will be hf_
prefixed.
Here our quant linear selection requiring the full knowledge of sym
, checkpoint_format
, meta
, device_map
, and backend
before deciding on the correct quant_linear to use.
meta = gptq_dict["meta"] | ||
# store both optimum:version and gptq_lib:version into quantize_config.meta.quantizer | ||
if meta.get("quantizer") is None: | ||
meta["quantizer"] = [f"optimum:{optimum_version}"] | ||
|
||
if is_gptqmodel_available(): | ||
meta["quantizer"].append(f"gptqmodel:{gptqmodel_version}") | ||
elif is_auto_gptq_available(): | ||
meta["quantizer"].append(f"auto_gptq:{autogptq_version}") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SunMarc This is where we store the tooling fingerprints, both optimum name + version, and gptqmodel + version, and auto-gptq + version for fingerprinting that also double as future bug proofing since quant weight bugs can be detected and auto-fixed by new code.
optimum/gptq/quantizer.py
Outdated
if is_gptqmodel_available(): | ||
model, _ = hf_convert_gptq_v1_to_v2_format(model, self.bits, self.quant_linear, self.checkpoint_format, self.meta) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SunMarc GPTQModel uses v2 as internal format for most kernels, except for IPEX but this method will auto skip conversion for IPEX. It returns converted model plus true/false if conversion happened.
On model save to gptq
format we do the reverse of v2 to v1. It's very fast and minimal relative to the slow quantization phase.
v = version.parse(importlib_metadata.version("auto_gptq")) | ||
if v >= AUTOGPTQ_MINIMUM_VERSION: | ||
return True | ||
else: | ||
raise ImportError( | ||
f"Found an incompatible version of auto-gptq. Found version {version_autogptq}, but only version above {AUTOGPTQ_MINIMUM_VERSION} are supported" | ||
f"Found an incompatible version of auto-gptq. Found version {v}, but only version >= {AUTOGPTQ_MINIMUM_VERSION} are supported" | ||
) | ||
|
||
|
||
def is_gptqmodel_available(): | ||
if _gptqmodel_available: | ||
v = version.parse(importlib_metadata.version("gptqmodel")) | ||
if v >= GPTQMODEL_MINIMUM_VERSION: | ||
return True | ||
else: | ||
raise ImportError( | ||
f"Found an incompatible version of gptqmodel. Found version {v}, but only version >= {GPTQMODEL_MINIMUM_VERSION} are supported" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable names are just too verbose. v is enough to convey clear message in such a short code ctx.
Since GPTQModel tests will not be running on the CI to verify them, let's revert the modifications in GPTQ testing (auto-gptq + cuda only) to not omit something that might be broken. |
Done, please re-run the CI and take the second round review. Thanks |
optimum/gptq/quantizer.py
Outdated
checkpoint_format: str = "gptq", | ||
meta: Optional[Dict[str, any]] = None, | ||
backend: Optional[str] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should probably be moved down so that codes that rely on the order of args won't break.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
optimum/gptq/quantizer.py
Outdated
@@ -450,6 +564,8 @@ def store_input_hook(_, input, *args): | |||
raise ValueError(f"Module {module_name} was not found in model") | |||
|
|||
torch.cuda.empty_cache() | |||
if hasattr(torch, "xpu"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure but I have seen it multiple times don't we have to also check and torch.xpu.is_available()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Hi @IlyasMoutawwakil , Could you check the failed tests for gptq? It seems like a torch version error, I can pass the tests locally with torch 2.5. Besides, the failed tests should be already in the original repo. I see no gptq tests will be triggered in the previous optimum commits. Could you please help to check it? |
Hello, the error is This error doesn't happen on main, it happens on this branch only because a layer is on cpu trying to process 16bit inputs Scheduled CI from 14 hours ago on main runs successfully. |
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Hi @IlyasMoutawwakil . The main branch optimum/gptq cannot run gptqmodel on CPU device because it will move the model to cuda by hard code. But in my changes, I kept the original model's device where the model can run on CPU if you set The point is pytorch2.2 does not support cpu fp16 layer norm op and the pytorch 2.5 supports cpu fp16 layer norm op but conflicts with gptq exllama tests. I am afraid we need to skip cpu tests. As the previous optimum didn't actually run on CPU ever because of the hard code..... Another way is to use fp32 model on CPU so it could pass the tests. We can book a meeting to align it if I didn't make it clear, please let me know your time slot. Thanks! |
Hi @IlyasMoutawwakil . I use a fp32 model to run CPU tests and now all tests are passed, please trigger the CI. Thanks! |
Hi @IlyasMoutawwakil . I have fixed the tests according to your instructions, please re-run the CI, thanks! |
There are no gptq+cpu tests in optimum... The tests that failed on previous commits here are cuda tests which no longer worked because of the previous code change. It's not cpu-only friendly yes, but we can think about how to make it so in another PR that adds cpu tests as well, breaking compatibility with pytorch<2.5 is not an optimal solution IMO. |
I get your point, it makes sense. Yes, we will see how to integrate cpu-only device test in the next PR since gptqmodel supports CPU-only now. |
Hi @IlyasMoutawwakil . Please check if any changes required before merging, thanks! |
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
@IlyasMoutawwakil, I have alread had a previous conversation with @jiqing-feng regarding this issue and we will need to address in the new PR (post merge) as it require even more internal changes that is well out-of-scope of this relatively simple changes for first-round of gptqmodel support. For next round of PRs:
Our plan is to integrate GPTQModel as simply as possible with the least friction first. Then introduce more advanced features such as Edit: There are actually 2 kernels in GPTQModel that supports CPU inference/quantization: |
Co-authored-by: Ilyas Moutawwakil <[email protected]>
Co-authored-by: Ilyas Moutawwakil <[email protected]>
Hi @IlyasMoutawwakil . Please re-run the CI if no more changes are required. Thanks! |
I think we can go ahead with merging these changes for now, @jiqing-feng does the tests pass locally when gptqmodel is installed ? |
Yes! Both Cuda and CPU can pass all gptq tests. |
Enable GPTQModel in optimum.