Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
HQQ Integration: dequant kernel
Standalone asymmetric dequant kernel for
hqq
quantization as a first step towards integratinghqq
as an alternative quantization backend.Supports
hqq
BaseQuantConfig
settings currently:nbits
={4, 8}
{1, 2, 3} bits
not yet supportedaxis
={0, 1}
axis=0
hqq
dequant implementations are available for both axis -- this kernel supports both.group_size
64
,128
).autotune
kernels, which should ease downstream interoperability withtorch.compile
.quant_zero
nbit=8
scalar scale / zero quantization of the zeros, which is the default setting ofhqq.BaseQuantizeConfig
.quant_scale
hqq.BaseQuantizeConfig
isquant_scale=False
(scales are not additionally quantized).Accuracy
See
test_hqq_dequant.py
for comprehensive tests acrossdtypes
,group_sizes
,axis
, and other relevant params.Run with
Performance
Please take with grain of salt, as I only benched against
HQQBackend.PYTORCH
on my laptop (RTX 3050):Notes
The kernel requires
triton >= 3.0.0
which is not compatible with stablexformers
:triton
importunsloth.__init__.py
per this PR.unsloth.kernels
butimport xformers
fromunsloth.models.__init__.py
errors out due toxformers
triton
kernels incompatible withtriton >= 3.0.0
.xformers
is technically not required withtorch >= 2.3
sincexformers.attn_bias.LowerTriangularMask
is available undertorch.nn.attention.bias
.TODO
fast_lora
FastLanguageModel
bitsandbytes
1, 2, 3
bits