Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to quantify the Qwen2.5-VL-3B model? #979

Open
Gusha-nye opened this issue Mar 20, 2025 · 1 comment
Open

How to quantify the Qwen2.5-VL-3B model? #979

Gusha-nye opened this issue Mar 20, 2025 · 1 comment

Comments

@Gusha-nye
Copy link

Hi,big guys!
Recently I wanted to quantize a Qwen2.5-VL-3B model and deploy it locally, I tried to use sglang (https://docs.sglang.ai/backend/quantization.html) to quantize the model but it failed with the following error:
Image

sglang seems to be able to quantize chat models (Qwen2.5-3B) only? I was able to successfully quantize the Qwen2.5-3B model
I would like to ask you guys if there is any way to quantize the Qwen2.5-VL-3B model?

@Gusha-nye
Copy link
Author

Now that I have changed the code, the model loads successfully but there are other errors as follows

Image

(sglangEnv) (base) ubuntu@ubun:~/Song/sglang$ python Quantization.py
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00, 1.43it/s]
Model Load finished
Using a slow image processor as use_fast is unset and a slow processor was saved with this model. use_fast=True will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with use_fast=False.
Tokenizer Load finished
2025-03-20T16:59:08.550688+0800 | main | WARNING - Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, 16-bits training: False
2025-03-20T16:59:08.551472+0800 | main | INFO - Training/evaluation parameters TrainingArguments(
_n_gpu=1,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
average_tokens_across_devices=False,
batch_eval_metrics=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=False,
do_oneshot=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_do_concat_batches=True,
eval_on_start=False,
eval_steps=None,
eval_strategy=IntervalStrategy.NO,
eval_use_gather_object=False,
evaluation_strategy=None,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
gradient_checkpointing_kwargs=None,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_always_push=False,
hub_model_id=None,
hub_private_repo=None,
hub_strategy=HubStrategy.EVERY_SAVE,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_for_metrics=[],
include_inputs_for_metrics=False,
include_num_input_tokens_seen=False,
include_tokens_per_second=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=5e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=passive,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=./output/runs/Mar20_16-59-08_ubun,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=500,
logging_strategy=IntervalStrategy.STEPS,
lr_scheduler_kwargs={},
lr_scheduler_type=SchedulerType.LINEAR,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
neftune_noise_alpha=None,
no_cuda=False,
num_train_epochs=3.0,
optim=OptimizerNames.ADAMW_TORCH,
optim_args=None,
optim_target_modules=None,
output_dir=./output,
overwrite_output_dir=False,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=8,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=True,
report_to=[],
restore_callback_states_from_checkpoint=False,
resume_from_checkpoint=None,
run_name=./output,
run_stages=False,
save_on_each_node=False,
save_only_model=False,
save_safetensors=True,
save_steps=500,
save_strategy=SaveStrategy.STEPS,
save_total_limit=None,
seed=42,
skip_memory_metrics=True,
split_batches=None,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torch_empty_cache_steps=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_cpu=False,
use_ipex=False,
use_legacy_prediction_loop=False,
use_liger_kernel=False,
use_mps_device=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0,
)
2025-03-20T16:59:08.858207+0800 | _check_create_state | INFO - State created for compression lifecycle
2025-03-20T16:59:08.858472+0800 | pre_initialize_structure | INFO - Compression lifecycle structure pre-initialized for 0 modifiers
2025-03-20T16:59:08.858533+0800 | pre_initialize_structure | INFO - Compression lifecycle structure pre-initialized for 0 modifiers
2025-03-20T16:59:08.858563+0800 | populate_datasets | INFO - Running oneshot without calibration data. This is expected for weight-only and dynamic quantization
2025-03-20T16:59:08.861435+0800 | one_shot | INFO - *** One Shot ***
2025-03-20T16:59:08.861538+0800 | from_modifiers | INFO - Creating recipe from modifiers
2025-03-20T16:59:08.875834+0800 | _check_compile_recipe | INFO - Recipe compiled and 1 modifiers created
/home/ubuntu/Song/Env/sglangEnv/lib/python3.12/site-packages/compressed_tensors/utils/offload.py:205: UserWarning: Shape of parameter being updated torch.Size([1280, 26]) does not match shape of update data torch.Size([1280, 27])
warnings.warn(
Traceback (most recent call last):
File "/home/ubuntu/Song/sglang/Quantization.py", line 20, in
oneshot(model=model, recipe=recipe)
File "/home/ubuntu/Song/Env/sglangEnv/lib/python3.12/site-packages/llmcompressor/transformers/finetune/text_generation.py", line 94, in oneshot
main(model_args, data_args, recipe_args, training_args)
File "/home/ubuntu/Song/Env/sglangEnv/lib/python3.12/site-packages/llmcompressor/transformers/finetune/text_generation.py", line 447, in main
stage_runner.one_shot()
File "/home/ubuntu/Song/Env/sglangEnv/lib/python3.12/site-packages/llmcompressor/transformers/finetune/runner.py", line 168, in one_shot
self.trainer.one_shot(calibration_data=calib_data, stage=stage)
File "/home/ubuntu/Song/Env/sglangEnv/lib/python3.12/site-packages/llmcompressor/transformers/finetune/session_mixin.py", line 454, in one_shot
apply(
File "/home/ubuntu/Song/Env/sglangEnv/lib/python3.12/site-packages/llmcompressor/core/session_functions.py", line 184, in apply
return active_session().apply(
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/Song/Env/sglangEnv/lib/python3.12/site-packages/llmcompressor/core/session.py", line 212, in apply
self.initialize(**kwargs)
File "/home/ubuntu/Song/Env/sglangEnv/lib/python3.12/site-packages/llmcompressor/core/session.py", line 158, in initialize
mod_data = self._lifecycle.initialize(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/Song/Env/sglangEnv/lib/python3.12/site-packages/llmcompressor/core/lifecycle.py", line 126, in initialize
data = mod.initialize(state=self.state, **extras)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/Song/Env/sglangEnv/lib/python3.12/site-packages/llmcompressor/modifiers/stage.py", line 124, in initialize
modifier.initialize(state, **kwargs)
File "/home/ubuntu/Song/Env/sglangEnv/lib/python3.12/site-packages/llmcompressor/modifiers/modifier.py", line 118, in initialize
initialized = self.on_initialize(state=state, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/Song/Env/sglangEnv/lib/python3.12/site-packages/llmcompressor/modifiers/quantization/quantization/base.py", line 110, in on_initialize
module.apply(update_weight_zp_scale)
File "/home/ubuntu/Song/Env/sglangEnv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1029, in apply
module.apply(fn)
File "/home/ubuntu/Song/Env/sglangEnv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1029, in apply
module.apply(fn)
File "/home/ubuntu/Song/Env/sglangEnv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1029, in apply
module.apply(fn)
[Previous line repeated 2 more times]
File "/home/ubuntu/Song/Env/sglangEnv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1030, in apply
fn(self)
File "/home/ubuntu/Song/Env/sglangEnv/lib/python3.12/site-packages/llmcompressor/modifiers/quantization/calibration.py", line 123, in update_weight_zp_scale
call_observer(module=module, base_name="weight")
File "/home/ubuntu/Song/Env/sglangEnv/lib/python3.12/site-packages/llmcompressor/modifiers/quantization/calibration.py", line 87, in call_observer
update_parameter_data(module, updated_scale, f"{base_name}scale")
File "/home/ubuntu/Song/Env/sglangEnv/lib/python3.12/site-packages/compressed_tensors/utils/offload.py", line 155, in update_parameter_data
update_offload_parameter(module, param_name, new_param_data)
File "/home/ubuntu/Song/Env/sglangEnv/lib/python3.12/site-packages/compressed_tensors/utils/offload.py", line 212, in update_offload_parameter
param.data.copy
(data)
RuntimeError: The size of tensor a (26) must match the size of tensor b (27) at non-singleton dimension 1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant