Releases: ModelCloud/GPTQModel
GPTQModel v2.1.0
What's Changed
✨ New QQQ quantization method and inference support!
✨ New Google Gemma 3
day-zero model support.
✨ New Alibaba Ovis 2
VL model support.
✨ New AMD Instella
day-zero model support.
✨ New GSM8K Platinum
and MMLU-Pro
benchmarking suppport.
✨ Peft Lora training with GPTQModel is now 30%+ faster on all gpu and IPEX devices.
✨ Auto detect MoE modules not activated during quantization due to insufficient calibration data.
✨ ROCm
setup.py compat fixes.
✨ Optimum and Peft compat fixes.
✨ Fixed Peft bfloat16 training.
- auto enable flash_attn only when flash-attn was installed by @CSY-ModelCloud in #1372
- Fix rocm compat by @Qubitium in #1373
- fix unnecessary mkdir by @CSY-ModelCloud in #1374
- add test_kernel_output_xpu.py by @CSY-ModelCloud in #1382
- clean test_kernel_output_xpu.py by @CSY-ModelCloud in #1383
- tremove xpu support of triton kernel by @Qubitium in #1384
- [MODEL] Add instella support by @LRL-ModelCloud in #1385
- Fix optimum/peft trainer integration by @CSY-ModelCloud in #1381
- rename peft test file by @CSY-ModelCloud in #1387
- [CI] fix wandb was not installed & update test_olora_finetuning_xpu.py by @CSY-ModelCloud in #1388
- Add lm-eval
GSM8k Platinum
by @Qubitium in #1394 - Remove cuda kernel by @Qubitium in #1396
- fix exllama kernels not compiled by @Qubitium in #1397
- update tests by @Qubitium in #1398
- make the kernel output validation more robust by @Qubitium in #1399
- speed up ci by @Qubitium in #1400
- add fwd counter by @yuchiwang in #1389
- allow triton and ipex to inherit torch kernel and use torch for train… by @Qubitium in #1401
- fix skip moe modules when fwd count is 0 by @Qubitium in #1404
- fix ipex linear post init for finetune by @jiqing-feng in #1406
- fix optimum compat by @Qubitium in #1408
- [Feature] Add mmlupro API by @CL-ModelCloud in #1405
- add training callback by @CSY-ModelCloud in #1409
- Fix bf16 training by @Qubitium in #1410
- fix bf16 forward for triton by @Qubitium in #1411
- Add QQQ by @Qubitium in #1402
- make IPEX or any kernel that uses Torch for Training to auto switch v… by @Qubitium in #1412
- [CI] xpu inference test by @CL-ModelCloud in #1380
- [FIX] qqq with eora by @ZX-ModelCloud in #1415
- [FIX] device error by @ZX-ModelCloud in #1417
- make quant linear expose internal buffers by @Qubitium in #1418
- Fix bfloat16 kernels by @Qubitium in #1420
- fix qqq bfloat16 forward by @Qubitium in #1423
- Fix ci10 by @Qubitium in #1424
- fix marlin bf16 compat by @Qubitium in #1427
- [CI] no need reinstall requirements by @CSY-ModelCloud in #1426
- [FIX] dynamic save error by @ZX-ModelCloud in #1428
- [FIX] super().post_init() calling order by @ZX-ModelCloud in #1431
- fix bitblas choose IPEX in cuda env by @CSY-ModelCloud in #1432
- Fix exllama is not packable by @Qubitium in #1433
- disable exllama for training by @Qubitium in #1435
- remove TritonV2QuantLinear for xpu test by @CSY-ModelCloud in #1436
- [MODEL] add gemma3 support by @LRL-ModelCloud in #1434
- fix the error when downloading models using modelscope by @mushenL in #1437
- Add QQQ Rotation by @ZX-ModelCloud in #1425
- fix no init.py by @CSY-ModelCloud in #1438
- Fix hardmard import by @Qubitium in #1441
- Eora final by @nbasyl in #1440
- triton is not validated for ipex by @Qubitium in #1445
- Fix exllama adapter by @Qubitium in #1446
- fix rocm compile by @Qubitium in #1447
- [FIX] Correctly obtain the submodule's device by @ZX-ModelCloud in #1448
- fix rocm not compatible with exllama v2 and eora kernel by @Qubitium in #1449
- revert overflow code by @Qubitium in #1450
- add kernel dtype support and add full float15 vs bfloat16 kernel testing by @Qubitium in #1452
- [MODEL] add Ovis2 support and bug fix by @Fusionplay in #1454
- add unit test for ovis2 by @CSY-ModelCloud in #1456
New Contributors
- @yuchiwang made their first contribution in #1389
- @mushenL made their first contribution in #1437
- @nbasyl made their first contribution in #1440
- @Fusionplay made their first contribution in #1454
Full Changelog: v2.0.0...v2.1.0
GPTQModel v2.0.0
What's Changed
🎉 GPTQ quantization internals are now broken into multiple stages (processes) for feature expansion.
🎉 Synced Marlin kernel inference quality fix from upstream. Added MARLIN_FP16, lower-quality but faster backend.
🎉 ModelScope support added.
🎉 Logging and cli progress bar output has been revamped with sticky bottom progress.
🎉 Added CI tests to track regression in kernel inference quality and sweep all bits/group_sizes.
🎉 Delegate loggin/progressbar to LogBar pkg.
🐛 Fix ROCm version auto detection in setup install.
🐛 Fixed generation_config.json save and load.
🐛 Fixed Transformers v4.49.0 compat. Fixed compat of models without bos.
🐛 Fixed group_size=-1 and bits=3 packing regression.
🐛 Fixed Qwen 2.5 MoE regressions.
- fix 3 bit packing regression, fixed #1278 by @CSY-ModelCloud in #1280
- Fix supported models list (syntax error) by @Forenche in #1281
- feat: load model from modelscope by @suluyana in #1283
- merge eval & utils.lm_eval by @CSY-ModelCloud in #1282
- fix modelscope import & tests by @CSY-ModelCloud in #1285
- allow passing model instance to evalplus & update tokenizer loading logics by @CSY-ModelCloud in #1284
- fix lm-eval & vllm check tokenizer type by @CSY-ModelCloud in #1287
- Fix
generation_config.json
not auto-saved by @Qubitium in #1292 - [SAVE] Save config files with empty state dict by @ZX-ModelCloud in #1293
- [SAVE] Save processor related config files by @ZX-ModelCloud in #1295
- fix wrong order of config save causing sharded tensors to be removed by @Qubitium in #1297
- [FIX] not pack when group_size=-1 by @ZX-ModelCloud in #1298
- cleanup marlin paths: marlin does conversion on
post_init
by @Qubitium in #1310 - bump tokenicer to v0.0.3 by @CSY-ModelCloud in #1308
- clean is_marlin_format for tests by @CSY-ModelCloud in #1311
- [CI] fix sglang test name & add status logs & remove exllama packing test by @CSY-ModelCloud in #1312
- skip v1 to v2 conversion for sym=True only kernels by @Qubitium in #1314
- bump tokenicer to 0.0.4 & remove FORMAT_FIELD_COMPAT_MARLIN by @CSY-ModelCloud in #1315
- revert is_marlin_format check by @CSY-ModelCloud in #1316
- Improve Marlin accuracy (default) but add
MARLIN_FP16
backend for faster with less-accuracy by @Qubitium in #1317 - marlin fp32 mode should also be enabled if kernel was selected due to… by @Qubitium in #1318
- refractor logger by @Qubitium in #1319
- fix typo by @Qubitium in #1320
- refractor logger and have progress bar sticky to bottom of cli by @Qubitium in #1322
- [CI] fix tokenicer upgraded transformers & install bitblas for test_save_quanted_model by @CSY-ModelCloud in #1321
- [CI] allow to select compiler server & move model test to correct dir by @CSY-ModelCloud in #1323
- fix bitblas loading regression by @Qubitium in #1324
- marlin fp16 warning missed check by @Qubitium in #1325
- fix custom logger overriding system level logger by @Qubitium in #1327
- fix progress bar for packing by @CSY-ModelCloud in #1326
- More log fixes by @Qubitium in #1328
- fix no backend when creating a quant linear by @CSY-ModelCloud in #1329
- use relative path instead of importing gptqmodel by @CSY-ModelCloud in #1331
- no need patch vllm now by @CSY-ModelCloud in #1332
- [CI] fix CI url by @CSY-ModelCloud in #1333
- fix oom by @CSY-ModelCloud in #1335
- add default value for backend, fix optimum doesn't pass it by @CSY-ModelCloud in #1334
- refractor pb and pb usage by @Qubitium in #1341
- fix generator has no length info by @CSY-ModelCloud in #1342
- replace utils.Progressbar with logbar by @CSY-ModelCloud in #1343
- [CI] update UI by @CSY-ModelCloud in #1344
- fix logbar api usage by @CSY-ModelCloud in #1345
- fix v2 to v1 missed logic bypass by @Qubitium in #1347
- [CI] fix xpu env has no logbar by @CSY-ModelCloud in #1346
- [CI] update runner ip env & fix show-statistics didn't run by @CSY-ModelCloud in #1348
- fix time was not imported by @CSY-ModelCloud in #1349
- update device-smi depend to v0.4.0 by @Qubitium in #1351
- [CI] install requirements.txt for m4 by @CSY-ModelCloud in #1352
- Exllama V1 is Packable by @ZX-ModelCloud in #1356
- [FIX] test_packable.py by @ZX-ModelCloud in #1357
- [setup] use torch.version.hip for rocm version check by @CSY-ModelCloud in #1360
- save/load peft lora by @Qubitium in #1358
- update device-smi to 0.4.1 for rocm fix by @Qubitium in #1362
- strip model path by @Qubitium in #1363
- [CI] exllama v1 kernel now eligible for quant stage by @Qubitium in #1364
- Fix transformers modeling code passing
input.shape[0] == 0
to nn.module by @Qubitium in #1365 - simplify log var by @Qubitium in #1368
- fix import by @CSY-ModelCloud in #1369
- update by @Qubitium in #1370
New Contributors
Full Changelog: v1.9.0...v2.0.0
GPTQModel v1.9.0
What's Changed
⚡ Offload tokenizer fixes to Toke(n)icer pkg.
⚡ Optimized lm_head
quant time and vram usage.
⚡ Optimized DeekSeek v3/R1
model quant vram usage.
⚡ 3x speed-up for Torch kernel when using Pytorch >= 2.5.0 with model.compile().
⚡ New calibration_dataset_concat_size
option to enable calibration data concat mode to mimic original GPTQ data packing strategy which may improve quant speed and accuracy for datasets like wikitext2.
🐛 Fixed Optimum compat and XPU
/IPEX
auto kernel selection regresion in v1.8.1
- Fix init arg order and
optimum
compat by @CSY-ModelCloud in #1240 - [FIX][Optimize] lm_head quantize by @ZX-ModelCloud in #1239
- [Model] [DeepSpeek] un-merge
gate_proj
andup_proj
by @LRL-ModelCloud in #1241 - Use Toke(n)icer by @CL-ModelCloud in #1242
#1244 - Add Tokenicer Test by @CL-ModelCloud in #1245
- prepare for 1.8.2 release by @Qubitium in #1243
- simplify calls to tokenicer by @CL-ModelCloud in #1246
- Update requirements.txt by @Qubitium in #1248
- fix trust_remote was lost by @CSY-ModelCloud in #1249
- fix trust_remote was lost by @CSY-ModelCloud in #1250
- prepare for 1.8.5 release by @Qubitium in #1251
- fix unit tests & tweak logic for selecting backends by @CSY-ModelCloud in #1253
- install tokenicer form git & do ruff by @CSY-ModelCloud in #1254
- fix k,v is not a dict by @CSY-ModelCloud in #1255
- fix not enough values to unpack (expected 2, got 1) by @CSY-ModelCloud in #1256
- fix sglang test requires numpy<2.0 by @CSY-ModelCloud in #1258
- fix ipex backend by @jiqing-feng in #1259
- ipex should be packable, reverted pr #1259 importer.py changes by @CSY-ModelCloud in #1260
- remove sentencepiece by @CSY-ModelCloud in #1261
- speed up torch dequantize by @Qubitium in #1262
- Add
calibration_dataset_concat_size
option/mode by @LRL-ModelCloud in #1257 - add transformers test by @CSY-ModelCloud in #1264
- Add kernel torch.compile hook by @Qubitium in #1265
- [FIX]fix vl model prepare_dataset by @LRL-ModelCloud in #1266
Full Changelog: v1.8.1...v1.9.0
GPTQModel v1.8.1
What's Changed
⚡ DeekSeek v3/R1
model support.
⚡ New flexible weight packing
: allow quantized weights to be packed to [int32, int16, int8]
dtypes. Triton and Torch kernels supports full range of new QuantizeConfig.pack_dtype.
⚡ Over 50% speedup for vl
model quantization (Qwen 2.5-VL + Ovis)
⚡ New auto_gc: bool
control in quantize()
which can reduce quantization time for small model with no chance of oom.
⚡ New GPTQModel.push_to_hub()
api for easy quant model upload to HF repo.
⚡ New buffered_fwd: bool
control in model.quantize().
🐛 Fixed bits=3
packing and group_size=-1
regression in v1.7.4.
🐛 Fixed Google Colab install requiring two install passes
🐛 Fixed Python 3.10 compatibility
- Flexible Pack DType by @Qubitium in #1158
- cuda needs to declare pack dtypes by @Qubitium in #1169
- fix pass pack dtype by @Qubitium in #1172
- Pass dtype by @Qubitium in #1173
- move in/out features and grop_size init to base by @Qubitium in #1174
- move self.maxq to base class by @Qubitium in #1175
- consolidate pack() into packer cls by @Qubitium in #1176
- Add
pack_dtype
to dynamic config and fix validate by @Qubitium in #1178 - Refract 4 by @Qubitium in #1180
- Refractor and simplify multi-kernel selection/init by @Qubitium in #1183
- Update/Refractor Bitblas/Marlin/Cuda by @Qubitium in #1184
- push bitblas logic down by @Qubitium in #1185
- Revert Bitblas to 0.0.1-dev13 by @Qubitium in #1186
- Do not export config.key if value is None by @Qubitium in #1187
- Fix examples/perplexity by @Qubitium in #1191
- [MODEL] add deepseek v3 support by @LRL-ModelCloud in #1127
- Push register buffer down to base class and rename all in/out features by @Qubitium in #1193
- Fix #1196 hf_transfer not accepting
max_memory
arg by @Qubitium in #1197 - reduce peak memory and reduce quant time by @Qubitium in #1198
- skip zero math by @Qubitium in #1199
- fix test_packing_speed by @Qubitium in #1202
- Update test_quant_time.py by @Qubitium in #1203
- experimental
buffered_fwd
quantize control by @Qubitium in #1205 - Fix dynamic regression on quant save by @Qubitium in #1208
- Python 3.10 type-hint compt bug by @Qubitium in #1213
- Fix colab install by @Qubitium in #1215
- add
GPTQModel.push_to_hub()
support by @Qubitium in #1216 - default to 8GB shard-size for model save by @Qubitium in #1217
- Auto gc toggle by @Qubitium in #1219
- fix 3bit packing and inference by @Qubitium in #1218
- fix merge error by @CSY-ModelCloud in #1234
- fix var name by @CSY-ModelCloud in #1235
- fix visual llm slow forward by @LRL-ModelCloud in #1232
Full Changelog: v1.7.4...v1.8.1
GPTQModel v1.8.0
What's Changed
⚡ DeekSeek v3/R1
model support.
⚡ New flexible weight packing
: allow quantized weights to be packed to [int32, int16, int8]
dtypes. Triton and Torch kernels supports full range of new QuantizeConfig.pack_dtype.
⚡ New auto_gc: bool
control in quantize()
which can reduce quantization time for small model with no chance of oom.
⚡ New GPTQModel.push_to_hub()
api for easy quant model to HF repo.
⚡ New buffered_fwd: bool
control in model.quantize().
🐛 Fixed bits=3
packing regression in v1.7.4.
🐛 Fixed Google Colab install requiring two install passes
🐛 Fixed Python 3.10 compatibility
- start 1.8.0-dev cycle by @Qubitium in #1168
- Flexible Pack DType by @Qubitium in #1158
- cuda needs to declare pack dtypes by @Qubitium in #1169
- fix pass pack dtype by @Qubitium in #1172
- Pass dtype by @Qubitium in #1173
- move in/out features and grop_size init to base by @Qubitium in #1174
- move self.maxq to base class by @Qubitium in #1175
- consolidate pack() into packer cls by @Qubitium in #1176
- Add
pack_dtype
to dynamic config and fix validate by @Qubitium in #1178 - format by @Qubitium in #1179
- Refract 4 by @Qubitium in #1180
- Refractor and simplify multi-kernel selection/init by @Qubitium in #1183
- Update/Refractor Bitblas/Marlin/Cuda by @Qubitium in #1184
- push bitblas logic down by @Qubitium in #1185
- Revert Bitblas to 0.0.1-dev13 by @Qubitium in #1186
- Do not export config.key if value is None by @Qubitium in #1187
- Fix examples/perplexity by @Qubitium in #1191
- [MODEL] add deepseek v3 support by @LRL-ModelCloud in #1127
- Push register buffer down to base class and rename all in/out features by @Qubitium in #1193
- Fix #1196 hf_transfer not accepting
max_memory
arg by @Qubitium in #1197 - reduce peak memory and reduce quant time by @Qubitium in #1198
- skip zero math by @Qubitium in #1199
- fix test_packing_speed by @Qubitium in #1202
- Update test_quant_time.py by @Qubitium in #1203
- experimental
buffered_fwd
quantize control by @Qubitium in #1205 - Fix dynamic regression on quant save by @Qubitium in #1208
- Python 3.10 type-hint compt bug by @Qubitium in #1213
- Fix colab install by @Qubitium in #1215
- add
GPTQModel.push_to_hub()
support by @Qubitium in #1216 - default to 8GB shard-size for model save by @Qubitium in #1217
- Auto gc toggle by @Qubitium in #1219
- fix 3bit packing and inference by @Qubitium in #1218
Full Changelog: v1.7.4...v1.8.0
GPTQModel v1.7.4
What's Changed
⚡ Faster packing
for post-quantization model weight save.
⚡ Triton
kernel now validated for Intel/XPU
when Intel Triton package is installed.
⚡ New compile()
api that allows torch to improve tps by ~4-8%. May need to disable flash_attention for some kernels.
🐛 Fix HF Transformers bug of downcasting fast tokenizer class on save.
🐛 Fix inaccurate bpw
calculations.
🐛 Fix ROCm
compile with setup.py
- Fix exllama slow pack() by @CSY-ModelCloud in #1128
- use optimized torch.round() codes by @CSY-ModelCloud in #1131
- fix shape mismatch for packing by @CSY-ModelCloud in #1132
- Speed up triton dequant by @Qubitium in #1136
- add torch compile with backend aot_ts by @CSY-ModelCloud in #1139
- disable sampling by @Qubitium in #1141
- mod triton-xpu by @CL-ModelCloud in #1135
- supress dynamo error by @CSY-ModelCloud in #1143
- fix bpw by @CL-ModelCloud in #1150
- [FIX] fix incorrectly saved the slow tokenizer by @LRL-ModelCloud in #1151
- Add mod chat by @CL-ModelCloud in #1154
- optimize pack by @Qubitium in #1153
- add quant time test by @CL-ModelCloud in #1155
- Export to hf model by @LRL-ModelCloud in #1157
- Fix bpw calculation by @Qubitium in #1163
- Inference speed test by @CL-ModelCloud in #1159
New Contributors
Full Changelog: v1.7.3...v1.7.4
GPTQModel v1.7.3
What's Changed
⚡ Telechat2 (China Telecom) model support
⚡ PhiMoE model support
🐛 Fix lm_head weights duplicated in post-quantize save() for models with tied-embedding.
- Add util.tensor_parameters() by @ZX-ModelCloud in #1107
- add require_dtype by @LRL-ModelCloud in #1109
- [MODEL] Add Telechat2 (China Telecom) by @1096125073 in #1106
- [FIX] Filter weight-sharing tensors when save by @ZX-ModelCloud in #1112
- Add telechat test by @LRL-ModelCloud in #1111
- [FIX] fix convert_gptq_to_mlx_weights by @LRL-ModelCloud in #1113
- add test_parameter_count.py by @ZX-ModelCloud in #1115
- Add gpqa eval task by @CL-ModelCloud in #1117
- [FIX] Call tied_weights() after load_checkpoint_in_model() by @ZX-ModelCloud in #1119
- add phimoe support by @CSY-ModelCloud in #1118
New Contributors
- @1096125073 made their first contribution in #1106
Full Changelog: v1.7.2...v1.7.3
GPTQModel v1.7.2
What's Changed
⚡Effective BPW (bits per weight) will now be logged during load().
⚡Reduce loading time on Intel Arc A770/B580 XPU by 3.3x.
⚡Reduce memory usage in MLX conversion.
🐛 Fix Marlin kernel auto-select not checking CUDA compute version.
- remove catching module error by @CSY-ModelCloud in #1088
- [FIX] monkey patch GPTQShuffle.convert_idx to use fixed convert_idx by @LRL-ModelCloud in #1090
- [FIX] monkey patch only once by @LRL-ModelCloud in #1091
- check CC >= 8 for marlin, fixed #1092 by @CSY-ModelCloud in #1093
- check compute capability for marlin in validate_device() by @CSY-ModelCloud in #1095
- torch get device with index of CUDA_VISIBLE_DEVICES, not value of it by @CSY-ModelCloud in #1096
- fix local model path & marlin test by @CSY-ModelCloud in #1097
- mod bits info by @CL-ModelCloud in #1100
- Reduce memory usage in mlx conversion by @Qubitium in #1099
- cleanup mlx code by @Qubitium in #1101
Full Changelog: v1.7.0...v1.7.2
GPTQModel v1.7.0
What's Changed
⚡backend.MLX
added for runtime-conversion and execution of GPTQ models on Apple's MLX
framework on Apple Silicon (M1+). ⚡ Exports of gptq models to mlx also now possible. We have added mlx exported models to huggingface.co/ModelCloud.
⚡ lm_head quantization now fully support by GPTQModel without external pkg dependency.
🐛 Fixed setup.py
not correctly detecting incompatible setuptools
/wheel
pkgs.
- [CI] run tests with linux tag by @CSY-ModelCloud in #1067
- Add backend.MLX by @LRL-ModelCloud in #1061
- add mlx generate test by @CL-ModelCloud in #1069
- [CI] upload source in build step by @CSY-ModelCloud in #1070
- code review by @CL-ModelCloud in #1072
- [CI] install mlx by @CSY-ModelCloud in #1071
- Add option to quantize
lm_head
by @ZX-ModelCloud in #1037 - fix test_packing by @LRL-ModelCloud in #1073
- [CI] add mlx test by @CSY-ModelCloud in #1074
- [CI] fix ci relase env name by @CSY-ModelCloud in #1078
- update mlx test by @CSY-ModelCloud in #1079
- convert to mlx support desc_act true by @LRL-ModelCloud in #1082
- [CI] add extra-index-url for pip install by @CSY-ModelCloud in #1083
- catch module error for setup.py by @CSY-ModelCloud in #1084
Full Changelog: v1.6.1...v1.7.0
GPTQModel v1.6.1
What's Changed
🎉 New OpenAI api compatible end-point via model.serve(host, port)
.
⚡ Auto-enable flash-attention2 for inference.
🐛 Fixed sym=False
loading regression.
- code opt by @CL-ModelCloud in #1038
- fix marlin validate rocm & do validate() if backend not AUTO by @CSY-ModelCloud in #1040
- add global rocm check by @CSY-ModelCloud in #1043
- [FIX] pass sym to make_quant by @LRL-ModelCloud in #1046
- enable flash attn for loading quantized by @CSY-ModelCloud in #1045
- add flash_attn2 test by @CSY-ModelCloud in #1047
- enable flash_attention only when device is cuda by @CSY-ModelCloud in #1050
- move flash attn test to correct folder by @CSY-ModelCloud in #1052
- Expose openai server api by @CL-ModelCloud in #1048
- update openai server by @CL-ModelCloud in #1058
- don't download whl for xpu env by @CSY-ModelCloud in #1059
- remove build tag for normal release by @CSY-ModelCloud in #1063
- disable flash attn 2 for internlm by @CSY-ModelCloud in #1065
Full Changelog: v1.6.0...v1.6.1