You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new and useful enhancement to share.
Feature Description
Qwen 2 and models can't use Q5_K and Q6_K ggml_types to quantize ffn_down due to its irregular shape.
It thus brings us on Q5_1, which is now suboptimal, and sometimes subpar to Q5_0. Or to Q8_0, much bigger.
Ikawrakow published a while ago a Q6_0 which can quantize irregularly shaped tensors, and it would provide a great alternative to either Q5_0 or Q8_0.
After all, ffn_down represents something like 25% of the weight of a layer, and so, for Qwen 2 models, that means 0.4 to 0.5bpw more for a Q6_K quantized model. In Q5_K_S or Q5_K_M, Q5_1 tensors are also bigger than Q5_K, with a lesser quality.
In all these cases, a Q6_0 ggml_type would be more appropriate.
Ikawrakow has already one made on his fork. It'd be great to see such a quant in LlamaCPP mainline.
Motivation
Better ratio quality/size for models with irregularly shaped tensors.
Possible Implementation
Factor and merge IK's Q6_0 GGML_Type.
Or
Develop and implement an equivalent here.
The text was updated successfully, but these errors were encountered:
Prerequisites
Feature Description
Qwen 2 and models can't use Q5_K and Q6_K ggml_types to quantize ffn_down due to its irregular shape.
It thus brings us on Q5_1, which is now suboptimal, and sometimes subpar to Q5_0. Or to Q8_0, much bigger.
Ikawrakow published a while ago a Q6_0 which can quantize irregularly shaped tensors, and it would provide a great alternative to either Q5_0 or Q8_0.
After all, ffn_down represents something like 25% of the weight of a layer, and so, for Qwen 2 models, that means 0.4 to 0.5bpw more for a Q6_K quantized model. In Q5_K_S or Q5_K_M, Q5_1 tensors are also bigger than Q5_K, with a lesser quality.
In all these cases, a Q6_0 ggml_type would be more appropriate.
Ikawrakow has already one made on his fork. It'd be great to see such a quant in LlamaCPP mainline.
Motivation
Better ratio quality/size for models with irregularly shaped tensors.
Possible Implementation
Factor and merge IK's Q6_0 GGML_Type.
Or
Develop and implement an equivalent here.
The text was updated successfully, but these errors were encountered: