Skip to content

Releases: ModelCloud/GPTQModel

GPTQModel v2.1.0

13 Mar 14:30
37d4b2b
Compare
Choose a tag to compare

What's Changed

✨ New QQQ quantization method and inference support!
✨ New Google Gemma 3 day-zero model support.
✨ New Alibaba Ovis 2 VL model support.
✨ New AMD Instella day-zero model support.
✨ New GSM8K Platinum and MMLU-Pro benchmarking suppport.
✨ Peft Lora training with GPTQModel is now 30%+ faster on all gpu and IPEX devices.
✨ Auto detect MoE modules not activated during quantization due to insufficient calibration data.
ROCm setup.py compat fixes.
✨ Optimum and Peft compat fixes.
✨ Fixed Peft bfloat16 training.

New Contributors

Full Changelog: v2.0.0...v2.1.0

GPTQModel v2.0.0

03 Mar 22:14
c0f9dc0
Compare
Choose a tag to compare

What's Changed

🎉 GPTQ quantization internals are now broken into multiple stages (processes) for feature expansion.
🎉 Synced Marlin kernel inference quality fix from upstream. Added MARLIN_FP16, lower-quality but faster backend.
🎉 ModelScope support added.
🎉 Logging and cli progress bar output has been revamped with sticky bottom progress.
🎉 Added CI tests to track regression in kernel inference quality and sweep all bits/group_sizes.
🎉 Delegate loggin/progressbar to LogBar pkg.
🐛 Fix ROCm version auto detection in setup install.
🐛 Fixed generation_config.json save and load.
🐛 Fixed Transformers v4.49.0 compat. Fixed compat of models without bos.
🐛 Fixed group_size=-1 and bits=3 packing regression.
🐛 Fixed Qwen 2.5 MoE regressions.

New Contributors

Full Changelog: v1.9.0...v2.0.0

GPTQModel v1.9.0

12 Feb 09:34
599e5c7
Compare
Choose a tag to compare

What's Changed

⚡ Offload tokenizer fixes to Toke(n)icer pkg.
⚡ Optimized lm_head quant time and vram usage.
⚡ Optimized DeekSeek v3/R1 model quant vram usage.
⚡ 3x speed-up for Torch kernel when using Pytorch >= 2.5.0 with model.compile().
⚡ New calibration_dataset_concat_size option to enable calibration data concat mode to mimic original GPTQ data packing strategy which may improve quant speed and accuracy for datasets like wikitext2.
🐛 Fixed Optimum compat and XPU/IPEX auto kernel selection regresion in v1.8.1

Full Changelog: v1.8.1...v1.9.0

GPTQModel v1.8.1

08 Feb 20:19
63499e1
Compare
Choose a tag to compare

What's Changed

DeekSeek v3/R1 model support.
⚡ New flexible weight packing: allow quantized weights to be packed to [int32, int16, int8] dtypes. Triton and Torch kernels supports full range of new QuantizeConfig.pack_dtype.
⚡ Over 50% speedup for vl model quantization (Qwen 2.5-VL + Ovis)
⚡ New auto_gc: bool control in quantize() which can reduce quantization time for small model with no chance of oom.
⚡ New GPTQModel.push_to_hub() api for easy quant model upload to HF repo.
⚡ New buffered_fwd: bool control in model.quantize().
🐛 Fixed bits=3 packing and group_size=-1 regression in v1.7.4.
🐛 Fixed Google Colab install requiring two install passes
🐛 Fixed Python 3.10 compatibility

Full Changelog: v1.7.4...v1.8.1

GPTQModel v1.8.0

07 Feb 17:07
e876a49
Compare
Choose a tag to compare
GPTQModel v1.8.0 Pre-release
Pre-release

What's Changed

DeekSeek v3/R1 model support.
⚡ New flexible weight packing: allow quantized weights to be packed to [int32, int16, int8] dtypes. Triton and Torch kernels supports full range of new QuantizeConfig.pack_dtype.
⚡ New auto_gc: bool control in quantize() which can reduce quantization time for small model with no chance of oom.
⚡ New GPTQModel.push_to_hub() api for easy quant model to HF repo.
⚡ New buffered_fwd: bool control in model.quantize().
🐛 Fixed bits=3 packing regression in v1.7.4.
🐛 Fixed Google Colab install requiring two install passes
🐛 Fixed Python 3.10 compatibility

Full Changelog: v1.7.4...v1.8.0

GPTQModel v1.7.4

26 Jan 07:02
b623b96
Compare
Choose a tag to compare

What's Changed

⚡ Faster packing for post-quantization model weight save.
Triton kernel now validated for Intel/XPU when Intel Triton package is installed.
⚡ New compile() api that allows torch to improve tps by ~4-8%. May need to disable flash_attention for some kernels.
🐛 Fix HF Transformers bug of downcasting fast tokenizer class on save.
🐛 Fix inaccurate bpw calculations.
🐛 Fix ROCm compile with setup.py

New Contributors

Full Changelog: v1.7.3...v1.7.4

GPTQModel v1.7.3

21 Jan 00:14
5c1a7e8
Compare
Choose a tag to compare

What's Changed

⚡ Telechat2 (China Telecom) model support
⚡ PhiMoE model support
🐛 Fix lm_head weights duplicated in post-quantize save() for models with tied-embedding.

New Contributors

Full Changelog: v1.7.2...v1.7.3

GPTQModel v1.7.2

19 Jan 03:52
d762379
Compare
Choose a tag to compare

What's Changed

⚡Effective BPW (bits per weight) will now be logged during load().
⚡Reduce loading time on Intel Arc A770/B580 XPU by 3.3x.
⚡Reduce memory usage in MLX conversion.
🐛 Fix Marlin kernel auto-select not checking CUDA compute version.

Full Changelog: v1.7.0...v1.7.2

GPTQModel v1.7.0

17 Jan 01:34
d247fd0
Compare
Choose a tag to compare

What's Changed

backend.MLX added for runtime-conversion and execution of GPTQ models on Apple's MLX framework on Apple Silicon (M1+). ⚡ Exports of gptq models to mlx also now possible. We have added mlx exported models to huggingface.co/ModelCloud.
⚡ lm_head quantization now fully support by GPTQModel without external pkg dependency.
🐛 Fixed setup.py not correctly detecting incompatible setuptools/wheel pkgs.

Full Changelog: v1.6.1...v1.7.0

GPTQModel v1.6.1

09 Jan 03:40
0c6452b
Compare
Choose a tag to compare

What's Changed

🎉 New OpenAI api compatible end-point via model.serve(host, port).
⚡ Auto-enable flash-attention2 for inference.
🐛 Fixed sym=False loading regression.

Full Changelog: v1.6.0...v1.6.1