Releases · ModelCloud/GPTQModel

13 Mar 14:30

Qubitium

v2.1.0

37d4b2b

GPTQModel v2.1.0 Latest

Latest

What's Changed

✨ New QQQ quantization method and inference support!
✨ New Google Gemma 3 day-zero model support.
✨ New Alibaba Ovis 2 VL model support.
✨ New AMD Instella day-zero model support.
✨ New GSM8K Platinum and MMLU-Pro benchmarking suppport.
✨ Peft Lora training with GPTQModel is now 30%+ faster on all gpu and IPEX devices.
✨ Auto detect MoE modules not activated during quantization due to insufficient calibration data.
✨ ROCm setup.py compat fixes.
✨ Optimum and Peft compat fixes.
✨ Fixed Peft bfloat16 training.

auto enable flash_attn only when flash-attn was installed by @CSY-ModelCloud in #1372
Fix rocm compat by @Qubitium in #1373
fix unnecessary mkdir by @CSY-ModelCloud in #1374
add test_kernel_output_xpu.py by @CSY-ModelCloud in #1382
clean test_kernel_output_xpu.py by @CSY-ModelCloud in #1383
tremove xpu support of triton kernel by @Qubitium in #1384
[MODEL] Add instella support by @LRL-ModelCloud in #1385
Fix optimum/peft trainer integration by @CSY-ModelCloud in #1381
rename peft test file by @CSY-ModelCloud in #1387
[CI] fix wandb was not installed & update test_olora_finetuning_xpu.py by @CSY-ModelCloud in #1388
Add lm-eval GSM8k Platinum by @Qubitium in #1394
Remove cuda kernel by @Qubitium in #1396
fix exllama kernels not compiled by @Qubitium in #1397
update tests by @Qubitium in #1398
make the kernel output validation more robust by @Qubitium in #1399
speed up ci by @Qubitium in #1400
add fwd counter by @yuchiwang in #1389
allow triton and ipex to inherit torch kernel and use torch for train… by @Qubitium in #1401
fix skip moe modules when fwd count is 0 by @Qubitium in #1404
fix ipex linear post init for finetune by @jiqing-feng in #1406
fix optimum compat by @Qubitium in #1408
[Feature] Add mmlupro API by @CL-ModelCloud in #1405
add training callback by @CSY-ModelCloud in #1409
Fix bf16 training by @Qubitium in #1410
fix bf16 forward for triton by @Qubitium in #1411
Add QQQ by @Qubitium in #1402
make IPEX or any kernel that uses Torch for Training to auto switch v… by @Qubitium in #1412
[CI] xpu inference test by @CL-ModelCloud in #1380
[FIX] qqq with eora by @ZX-ModelCloud in #1415
[FIX] device error by @ZX-ModelCloud in #1417
make quant linear expose internal buffers by @Qubitium in #1418
Fix bfloat16 kernels by @Qubitium in #1420
fix qqq bfloat16 forward by @Qubitium in #1423
Fix ci10 by @Qubitium in #1424
fix marlin bf16 compat by @Qubitium in #1427
[CI] no need reinstall requirements by @CSY-ModelCloud in #1426
[FIX] dynamic save error by @ZX-ModelCloud in #1428
[FIX] super().post_init() calling order by @ZX-ModelCloud in #1431
fix bitblas choose IPEX in cuda env by @CSY-ModelCloud in #1432
Fix exllama is not packable by @Qubitium in #1433
disable exllama for training by @Qubitium in #1435
remove TritonV2QuantLinear for xpu test by @CSY-ModelCloud in #1436
[MODEL] add gemma3 support by @LRL-ModelCloud in #1434
fix the error when downloading models using modelscope by @mushenL in #1437
Add QQQ Rotation by @ZX-ModelCloud in #1425
fix no init.py by @CSY-ModelCloud in #1438
Fix hardmard import by @Qubitium in #1441
Eora final by @nbasyl in #1440
triton is not validated for ipex by @Qubitium in #1445
Fix exllama adapter by @Qubitium in #1446
fix rocm compile by @Qubitium in #1447
[FIX] Correctly obtain the submodule's device by @ZX-ModelCloud in #1448
fix rocm not compatible with exllama v2 and eora kernel by @Qubitium in #1449
revert overflow code by @Qubitium in #1450
add kernel dtype support and add full float15 vs bfloat16 kernel testing by @Qubitium in #1452
[MODEL] add Ovis2 support and bug fix by @Fusionplay in #1454
add unit test for ovis2 by @CSY-ModelCloud in #1456

New Contributors

@yuchiwang made their first contribution in #1389
@mushenL made their first contribution in #1437
@nbasyl made their first contribution in #1440
@Fusionplay made their first contribution in #1454

Full Changelog: v2.0.0...v2.1.0

Contributors

Qubitium, yuchiwang, and 8 other contributors

Assets 61

gptqmodel-2.1.0+cu118torch2.0-cp310-cp310-linux_x86_64.whl

79.6 MB 2025-03-13T14:50:04Z
gptqmodel-2.1.0+cu118torch2.0-cp311-cp311-linux_x86_64.whl

79.6 MB 2025-03-13T14:47:04Z
gptqmodel-2.1.0+cu118torch2.0-cp39-cp39-linux_x86_64.whl

79.6 MB 2025-03-13T14:50:00Z
gptqmodel-2.1.0+cu118torch2.1-cp310-cp310-linux_x86_64.whl

79.6 MB 2025-03-13T14:49:45Z
gptqmodel-2.1.0+cu118torch2.1-cp311-cp311-linux_x86_64.whl

79.7 MB 2025-03-13T14:48:37Z
gptqmodel-2.1.0+cu118torch2.1-cp39-cp39-linux_x86_64.whl

79.6 MB 2025-03-13T14:48:05Z
gptqmodel-2.1.0+cu118torch2.2-cp310-cp310-linux_x86_64.whl

79.5 MB 2025-03-13T15:03:13Z
gptqmodel-2.1.0+cu118torch2.2-cp311-cp311-linux_x86_64.whl

79.5 MB 2025-03-13T15:03:07Z
gptqmodel-2.1.0+cu118torch2.2-cp312-cp312-linux_x86_64.whl

79.5 MB 2025-03-13T15:03:32Z
gptqmodel-2.1.0+cu118torch2.2-cp39-cp39-linux_x86_64.whl

79.4 MB 2025-03-13T14:49:31Z
Source code (zip)

2025-03-13T11:19:42Z
Source code (tar.gz)

2025-03-13T11:19:42Z

03 Mar 22:14

Qubitium

v2.0.0

c0f9dc0

GPTQModel v2.0.0

What's Changed

🎉 GPTQ quantization internals are now broken into multiple stages (processes) for feature expansion.
🎉 Synced Marlin kernel inference quality fix from upstream. Added MARLIN_FP16, lower-quality but faster backend.
🎉 ModelScope support added.
🎉 Logging and cli progress bar output has been revamped with sticky bottom progress.
🎉 Added CI tests to track regression in kernel inference quality and sweep all bits/group_sizes.
🎉 Delegate loggin/progressbar to LogBar pkg.
🐛 Fix ROCm version auto detection in setup install.
🐛 Fixed generation_config.json save and load.
🐛 Fixed Transformers v4.49.0 compat. Fixed compat of models without bos.
🐛 Fixed group_size=-1 and bits=3 packing regression.
🐛 Fixed Qwen 2.5 MoE regressions.

fix 3 bit packing regression， fixed #1278 by @CSY-ModelCloud in #1280
Fix supported models list (syntax error) by @Forenche in #1281
feat: load model from modelscope by @suluyana in #1283
merge eval & utils.lm_eval by @CSY-ModelCloud in #1282
fix modelscope import & tests by @CSY-ModelCloud in #1285
allow passing model instance to evalplus & update tokenizer loading logics by @CSY-ModelCloud in #1284
fix lm-eval & vllm check tokenizer type by @CSY-ModelCloud in #1287
Fix generation_config.json not auto-saved by @Qubitium in #1292
[SAVE] Save config files with empty state dict by @ZX-ModelCloud in #1293
[SAVE] Save processor related config files by @ZX-ModelCloud in #1295
fix wrong order of config save causing sharded tensors to be removed by @Qubitium in #1297
[FIX] not pack when group_size=-1 by @ZX-ModelCloud in #1298
cleanup marlin paths: marlin does conversion on post_init by @Qubitium in #1310
bump tokenicer to v0.0.3 by @CSY-ModelCloud in #1308
clean is_marlin_format for tests by @CSY-ModelCloud in #1311
[CI] fix sglang test name & add status logs & remove exllama packing test by @CSY-ModelCloud in #1312
skip v1 to v2 conversion for sym=True only kernels by @Qubitium in #1314
bump tokenicer to 0.0.4 & remove FORMAT_FIELD_COMPAT_MARLIN by @CSY-ModelCloud in #1315
revert is_marlin_format check by @CSY-ModelCloud in #1316
Improve Marlin accuracy (default) but add MARLIN_FP16 backend for faster with less-accuracy by @Qubitium in #1317
marlin fp32 mode should also be enabled if kernel was selected due to… by @Qubitium in #1318
refractor logger by @Qubitium in #1319
fix typo by @Qubitium in #1320
refractor logger and have progress bar sticky to bottom of cli by @Qubitium in #1322
[CI] fix tokenicer upgraded transformers & install bitblas for test_save_quanted_model by @CSY-ModelCloud in #1321
[CI] allow to select compiler server & move model test to correct dir by @CSY-ModelCloud in #1323
fix bitblas loading regression by @Qubitium in #1324
marlin fp16 warning missed check by @Qubitium in #1325
fix custom logger overriding system level logger by @Qubitium in #1327
fix progress bar for packing by @CSY-ModelCloud in #1326
More log fixes by @Qubitium in #1328
fix no backend when creating a quant linear by @CSY-ModelCloud in #1329
use relative path instead of importing gptqmodel by @CSY-ModelCloud in #1331
no need patch vllm now by @CSY-ModelCloud in #1332
[CI] fix CI url by @CSY-ModelCloud in #1333
fix oom by @CSY-ModelCloud in #1335
add default value for backend, fix optimum doesn't pass it by @CSY-ModelCloud in #1334
refractor pb and pb usage by @Qubitium in #1341
fix generator has no length info by @CSY-ModelCloud in #1342
replace utils.Progressbar with logbar by @CSY-ModelCloud in #1343
[CI] update UI by @CSY-ModelCloud in #1344
fix logbar api usage by @CSY-ModelCloud in #1345
fix v2 to v1 missed logic bypass by @Qubitium in #1347
[CI] fix xpu env has no logbar by @CSY-ModelCloud in #1346
[CI] update runner ip env & fix show-statistics didn't run by @CSY-ModelCloud in #1348
fix time was not imported by @CSY-ModelCloud in #1349
update device-smi depend to v0.4.0 by @Qubitium in #1351
[CI] install requirements.txt for m4 by @CSY-ModelCloud in #1352
Exllama V1 is Packable by @ZX-ModelCloud in #1356
[FIX] test_packable.py by @ZX-ModelCloud in #1357
[setup] use torch.version.hip for rocm version check by @CSY-ModelCloud in #1360
save/load peft lora by @Qubitium in #1358
update device-smi to 0.4.1 for rocm fix by @Qubitium in #1362
strip model path by @Qubitium in #1363
[CI] exllama v1 kernel now eligible for quant stage by @Qubitium in #1364
Fix transformers modeling code passing input.shape[0] == 0 to nn.module by @Qubitium in #1365
simplify log var by @Qubitium in #1368
fix import by @CSY-ModelCloud in #1369
update by @Qubitium in #1370

New Contributors

@Forenche made their first contribution in #1281
@suluyana made their first contribution in #1283

Full Changelog: v1.9.0...v2.0.0

Contributors

Qubitium, Forenche, and 3 other contributors

Assets 60

12 Feb 09:34

Qubitium

v1.9.0

599e5c7

GPTQModel v1.9.0

What's Changed

⚡ Offload tokenizer fixes to Toke(n)icer pkg.
⚡ Optimized lm_head quant time and vram usage.
⚡ Optimized DeekSeek v3/R1 model quant vram usage.
⚡ 3x speed-up for Torch kernel when using Pytorch >= 2.5.0 with model.compile().
⚡ New calibration_dataset_concat_size option to enable calibration data concat mode to mimic original GPTQ data packing strategy which may improve quant speed and accuracy for datasets like wikitext2.
🐛 Fixed Optimum compat and XPU/IPEX auto kernel selection regresion in v1.8.1

Fix init arg order and optimum compat by @CSY-ModelCloud in #1240
[FIX][Optimize] lm_head quantize by @ZX-ModelCloud in #1239
[Model] [DeepSpeek] un-merge gate_proj and up_proj by @LRL-ModelCloud in #1241
Use Toke(n)icer by @CL-ModelCloud in #1242
#1244
Add Tokenicer Test by @CL-ModelCloud in #1245
prepare for 1.8.2 release by @Qubitium in #1243
simplify calls to tokenicer by @CL-ModelCloud in #1246
Update requirements.txt by @Qubitium in #1248
fix trust_remote was lost by @CSY-ModelCloud in #1249
fix trust_remote was lost by @CSY-ModelCloud in #1250
prepare for 1.8.5 release by @Qubitium in #1251
fix unit tests & tweak logic for selecting backends by @CSY-ModelCloud in #1253
install tokenicer form git & do ruff by @CSY-ModelCloud in #1254
fix k,v is not a dict by @CSY-ModelCloud in #1255
fix not enough values to unpack (expected 2, got 1) by @CSY-ModelCloud in #1256
fix sglang test requires numpy<2.0 by @CSY-ModelCloud in #1258
fix ipex backend by @jiqing-feng in #1259
ipex should be packable, reverted pr #1259 importer.py changes by @CSY-ModelCloud in #1260
remove sentencepiece by @CSY-ModelCloud in #1261
speed up torch dequantize by @Qubitium in #1262
Add calibration_dataset_concat_size option/mode by @LRL-ModelCloud in #1257
add transformers test by @CSY-ModelCloud in #1264
Add kernel torch.compile hook by @Qubitium in #1265
[FIX]fix vl model prepare_dataset by @LRL-ModelCloud in #1266

Full Changelog: v1.8.1...v1.9.0

Contributors

Qubitium, jiqing-feng, and 4 other contributors

Assets 60

08 Feb 20:19

Qubitium

v1.8.1

63499e1

GPTQModel v1.8.1

What's Changed

⚡ DeekSeek v3/R1 model support.
⚡ New flexible weight packing: allow quantized weights to be packed to [int32, int16, int8] dtypes. Triton and Torch kernels supports full range of new QuantizeConfig.pack_dtype.
⚡ Over 50% speedup for vl model quantization (Qwen 2.5-VL + Ovis)
⚡ New auto_gc: bool control in quantize() which can reduce quantization time for small model with no chance of oom.
⚡ New GPTQModel.push_to_hub() api for easy quant model upload to HF repo.
⚡ New buffered_fwd: bool control in model.quantize().
🐛 Fixed bits=3 packing and group_size=-1 regression in v1.7.4.
🐛 Fixed Google Colab install requiring two install passes
🐛 Fixed Python 3.10 compatibility

Flexible Pack DType by @Qubitium in #1158
cuda needs to declare pack dtypes by @Qubitium in #1169
fix pass pack dtype by @Qubitium in #1172
Pass dtype by @Qubitium in #1173
move in/out features and grop_size init to base by @Qubitium in #1174
move self.maxq to base class by @Qubitium in #1175
consolidate pack() into packer cls by @Qubitium in #1176
Add pack_dtype to dynamic config and fix validate by @Qubitium in #1178
Refract 4 by @Qubitium in #1180
Refractor and simplify multi-kernel selection/init by @Qubitium in #1183
Update/Refractor Bitblas/Marlin/Cuda by @Qubitium in #1184
push bitblas logic down by @Qubitium in #1185
Revert Bitblas to 0.0.1-dev13 by @Qubitium in #1186
Do not export config.key if value is None by @Qubitium in #1187
Fix examples/perplexity by @Qubitium in #1191
[MODEL] add deepseek v3 support by @LRL-ModelCloud in #1127
Push register buffer down to base class and rename all in/out features by @Qubitium in #1193
Fix #1196 hf_transfer not accepting max_memory arg by @Qubitium in #1197
reduce peak memory and reduce quant time by @Qubitium in #1198
skip zero math by @Qubitium in #1199
fix test_packing_speed by @Qubitium in #1202
Update test_quant_time.py by @Qubitium in #1203
experimental buffered_fwd quantize control by @Qubitium in #1205
Fix dynamic regression on quant save by @Qubitium in #1208
Python 3.10 type-hint compt bug by @Qubitium in #1213
Fix colab install by @Qubitium in #1215
add GPTQModel.push_to_hub() support by @Qubitium in #1216
default to 8GB shard-size for model save by @Qubitium in #1217
Auto gc toggle by @Qubitium in #1219
fix 3bit packing and inference by @Qubitium in #1218
fix merge error by @CSY-ModelCloud in #1234
fix var name by @CSY-ModelCloud in #1235
fix visual llm slow forward by @LRL-ModelCloud in #1232

Full Changelog: v1.7.4...v1.8.1

Contributors

Qubitium, LRL-ModelCloud, and CSY-ModelCloud

Assets 52

07 Feb 17:07

Qubitium

v1.8.0

e876a49

GPTQModel v1.8.0 Pre-release

Pre-release

What's Changed

⚡ DeekSeek v3/R1 model support.
⚡ New flexible weight packing: allow quantized weights to be packed to [int32, int16, int8] dtypes. Triton and Torch kernels supports full range of new QuantizeConfig.pack_dtype.
⚡ New auto_gc: bool control in quantize() which can reduce quantization time for small model with no chance of oom.
⚡ New GPTQModel.push_to_hub() api for easy quant model to HF repo.
⚡ New buffered_fwd: bool control in model.quantize().
🐛 Fixed bits=3 packing regression in v1.7.4.
🐛 Fixed Google Colab install requiring two install passes
🐛 Fixed Python 3.10 compatibility

start 1.8.0-dev cycle by @Qubitium in #1168
Flexible Pack DType by @Qubitium in #1158
cuda needs to declare pack dtypes by @Qubitium in #1169
fix pass pack dtype by @Qubitium in #1172
Pass dtype by @Qubitium in #1173
move in/out features and grop_size init to base by @Qubitium in #1174
move self.maxq to base class by @Qubitium in #1175
consolidate pack() into packer cls by @Qubitium in #1176
Add pack_dtype to dynamic config and fix validate by @Qubitium in #1178
format by @Qubitium in #1179
Refract 4 by @Qubitium in #1180
Refractor and simplify multi-kernel selection/init by @Qubitium in #1183
Update/Refractor Bitblas/Marlin/Cuda by @Qubitium in #1184
push bitblas logic down by @Qubitium in #1185
Revert Bitblas to 0.0.1-dev13 by @Qubitium in #1186
Do not export config.key if value is None by @Qubitium in #1187
Fix examples/perplexity by @Qubitium in #1191
[MODEL] add deepseek v3 support by @LRL-ModelCloud in #1127
Push register buffer down to base class and rename all in/out features by @Qubitium in #1193
Fix #1196 hf_transfer not accepting max_memory arg by @Qubitium in #1197
reduce peak memory and reduce quant time by @Qubitium in #1198
skip zero math by @Qubitium in #1199
fix test_packing_speed by @Qubitium in #1202
Update test_quant_time.py by @Qubitium in #1203
experimental buffered_fwd quantize control by @Qubitium in #1205
Fix dynamic regression on quant save by @Qubitium in #1208
Python 3.10 type-hint compt bug by @Qubitium in #1213
Fix colab install by @Qubitium in #1215
add GPTQModel.push_to_hub() support by @Qubitium in #1216
default to 8GB shard-size for model save by @Qubitium in #1217
Auto gc toggle by @Qubitium in #1219
fix 3bit packing and inference by @Qubitium in #1218

Full Changelog: v1.7.4...v1.8.0

Contributors

Qubitium and LRL-ModelCloud

Assets 52

26 Jan 07:02

Qubitium

v1.7.4

b623b96

GPTQModel v1.7.4

What's Changed

⚡ Faster packing for post-quantization model weight save.
⚡ Triton kernel now validated for Intel/XPU when Intel Triton package is installed.
⚡ New compile() api that allows torch to improve tps by ~4-8%. May need to disable flash_attention for some kernels.
🐛 Fix HF Transformers bug of downcasting fast tokenizer class on save.
🐛 Fix inaccurate bpw calculations.
🐛 Fix ROCm compile with setup.py

Fix exllama slow pack() by @CSY-ModelCloud in #1128
use optimized torch.round() codes by @CSY-ModelCloud in #1131
fix shape mismatch for packing by @CSY-ModelCloud in #1132
Speed up triton dequant by @Qubitium in #1136
add torch compile with backend aot_ts by @CSY-ModelCloud in #1139
disable sampling by @Qubitium in #1141
mod triton-xpu by @CL-ModelCloud in #1135
supress dynamo error by @CSY-ModelCloud in #1143
fix bpw by @CL-ModelCloud in #1150
[FIX] fix incorrectly saved the slow tokenizer by @LRL-ModelCloud in #1151
Add mod chat by @CL-ModelCloud in #1154
optimize pack by @Qubitium in #1153
add quant time test by @CL-ModelCloud in #1155
Export to hf model by @LRL-ModelCloud in #1157
Fix bpw calculation by @Qubitium in #1163
Inference speed test by @CL-ModelCloud in #1159

New Contributors

@isaranto made their first contribution in #1162

Full Changelog: v1.7.3...v1.7.4

Contributors

Qubitium, isaranto, and 3 other contributors

Assets 52

21 Jan 00:14

Qubitium

v1.7.3

5c1a7e8

GPTQModel v1.7.3

What's Changed

⚡ Telechat2 (China Telecom) model support
⚡ PhiMoE model support
🐛 Fix lm_head weights duplicated in post-quantize save() for models with tied-embedding.

Add util.tensor_parameters() by @ZX-ModelCloud in #1107
add require_dtype by @LRL-ModelCloud in #1109
[MODEL] Add Telechat2 (China Telecom) by @1096125073 in #1106
[FIX] Filter weight-sharing tensors when save by @ZX-ModelCloud in #1112
Add telechat test by @LRL-ModelCloud in #1111
[FIX] fix convert_gptq_to_mlx_weights by @LRL-ModelCloud in #1113
add test_parameter_count.py by @ZX-ModelCloud in #1115
Add gpqa eval task by @CL-ModelCloud in #1117
[FIX] Call tied_weights() after load_checkpoint_in_model() by @ZX-ModelCloud in #1119
add phimoe support by @CSY-ModelCloud in #1118

New Contributors

@1096125073 made their first contribution in #1106

Full Changelog: v1.7.2...v1.7.3

Contributors

1096125073, ZX-ModelCloud, and 3 other contributors

Assets 52

19 Jan 03:52

Qubitium

v1.7.2

d762379

GPTQModel v1.7.2

What's Changed

⚡Effective BPW (bits per weight) will now be logged during load().
⚡Reduce loading time on Intel Arc A770/B580 XPU by 3.3x.
⚡Reduce memory usage in MLX conversion.
🐛 Fix Marlin kernel auto-select not checking CUDA compute version.

remove catching module error by @CSY-ModelCloud in #1088
[FIX] monkey patch GPTQShuffle.convert_idx to use fixed convert_idx by @LRL-ModelCloud in #1090
[FIX] monkey patch only once by @LRL-ModelCloud in #1091
check CC >= 8 for marlin, fixed #1092 by @CSY-ModelCloud in #1093
check compute capability for marlin in validate_device() by @CSY-ModelCloud in #1095
torch get device with index of CUDA_VISIBLE_DEVICES, not value of it by @CSY-ModelCloud in #1096
fix local model path & marlin test by @CSY-ModelCloud in #1097
mod bits info by @CL-ModelCloud in #1100
Reduce memory usage in mlx conversion by @Qubitium in #1099
cleanup mlx code by @Qubitium in #1101

Full Changelog: v1.7.0...v1.7.2

Contributors

Qubitium, LRL-ModelCloud, and 2 other contributors

Assets 52

17 Jan 01:34

Qubitium

v1.7.0

d247fd0

GPTQModel v1.7.0

What's Changed

⚡backend.MLX added for runtime-conversion and execution of GPTQ models on Apple's MLX framework on Apple Silicon (M1+). ⚡ Exports of gptq models to mlx also now possible. We have added mlx exported models to huggingface.co/ModelCloud.
⚡ lm_head quantization now fully support by GPTQModel without external pkg dependency.
🐛 Fixed setup.py not correctly detecting incompatible setuptools/wheel pkgs.

[CI] run tests with linux tag by @CSY-ModelCloud in #1067
Add backend.MLX by @LRL-ModelCloud in #1061
add mlx generate test by @CL-ModelCloud in #1069
[CI] upload source in build step by @CSY-ModelCloud in #1070
code review by @CL-ModelCloud in #1072
[CI] install mlx by @CSY-ModelCloud in #1071
Add option to quantize lm_head by @ZX-ModelCloud in #1037
fix test_packing by @LRL-ModelCloud in #1073
[CI] add mlx test by @CSY-ModelCloud in #1074
[CI] fix ci relase env name by @CSY-ModelCloud in #1078
update mlx test by @CSY-ModelCloud in #1079
convert to mlx support desc_act true by @LRL-ModelCloud in #1082
[CI] add extra-index-url for pip install by @CSY-ModelCloud in #1083
catch module error for setup.py by @CSY-ModelCloud in #1084

Full Changelog: v1.6.1...v1.7.0

Contributors

ZX-ModelCloud, LRL-ModelCloud, and 2 other contributors

Assets 52

09 Jan 03:40

Qubitium

v1.6.1

0c6452b

GPTQModel v1.6.1

What's Changed

🎉 New OpenAI api compatible end-point via model.serve(host, port).
⚡ Auto-enable flash-attention2 for inference.
🐛 Fixed sym=False loading regression.

code opt by @CL-ModelCloud in #1038
fix marlin validate rocm & do validate() if backend not AUTO by @CSY-ModelCloud in #1040
add global rocm check by @CSY-ModelCloud in #1043
[FIX] pass sym to make_quant by @LRL-ModelCloud in #1046
enable flash attn for loading quantized by @CSY-ModelCloud in #1045
add flash_attn2 test by @CSY-ModelCloud in #1047
enable flash_attention only when device is cuda by @CSY-ModelCloud in #1050
move flash attn test to correct folder by @CSY-ModelCloud in #1052
Expose openai server api by @CL-ModelCloud in #1048
update openai server by @CL-ModelCloud in #1058
don't download whl for xpu env by @CSY-ModelCloud in #1059
remove build tag for normal release by @CSY-ModelCloud in #1063
disable flash attn 2 for internlm by @CSY-ModelCloud in #1065

Full Changelog: v1.6.0...v1.6.1

Contributors

LRL-ModelCloud, CL-ModelCloud, and CSY-ModelCloud

Assets 51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's Changed

New Contributors

Contributors

What's Changed

New Contributors

Contributors

What's Changed

Contributors

What's Changed

Contributors

What's Changed

Contributors

What's Changed

New Contributors

Contributors

What's Changed

New Contributors

Contributors

What's Changed

Contributors

What's Changed

Contributors

What's Changed

Contributors

Releases: ModelCloud/GPTQModel

GPTQModel v2.1.0

What's Changed

New Contributors

Contributors

GPTQModel v2.0.0

What's Changed

New Contributors

Contributors

GPTQModel v1.9.0

What's Changed

Contributors

GPTQModel v1.8.1

What's Changed

Contributors

GPTQModel v1.8.0

What's Changed

Contributors

GPTQModel v1.7.4

What's Changed

New Contributors

Contributors

GPTQModel v1.7.3

What's Changed

New Contributors

Contributors

GPTQModel v1.7.2

What's Changed

Contributors

GPTQModel v1.7.0

What's Changed

Contributors

GPTQModel v1.6.1

What's Changed

Contributors