Error when quantizing LLama 3.3 70b to FP8 #963

Syst3m1cAn0maly · 2024-12-06T23:59:18Z

I get a systematic error when quantizing LLama 3.3 70b to FP8 (static) on 2xH100, it always fails at the 82nd step of calibration with the following error :

Loading checkpoint shards: 100%
30/30 [00:03<00:00, 7.96it/s]
Loading checkpoint shards: 100%
30/30 [01:09<00:00, 1.98s/it]
Map: 100%
512/512 [00:00<00:00, 1513.65 examples/s]
Map: 100%
512/512 [00:01<00:00, 458.48 examples/s]
2024-12-06T23:51:22.108862+0000 | main | WARNING - Process rank: 0, device: cuda:0, n_gpu: 2, distributed training: True, 16-bits training: False
2024-12-06T23:51:22.110751+0000 | main | INFO - Training/evaluation parameters TrainingArguments(
_n_gpu=2,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
average_tokens_across_devices=False,
batch_eval_metrics=False,
bf16=False,
bf16_full_eval=False,
clear_sparse_session=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=False,
do_oneshot=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_do_concat_batches=True,
eval_on_start=False,
eval_steps=None,
eval_strategy=IntervalStrategy.NO,
eval_use_gather_object=False,
evaluation_strategy=None,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
gradient_checkpointing_kwargs=None,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_always_push=False,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=HubStrategy.EVERY_SAVE,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_for_metrics=[],
include_inputs_for_metrics=False,
include_num_input_tokens_seen=False,
include_tokens_per_second=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=5e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=passive,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=/data/models/Llama-3.3-70B-Instruct-FP8/runs/Dec06_23-51-22_2343050e0892,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=500,
logging_strategy=IntervalStrategy.STEPS,
lr_scheduler_kwargs={},
lr_scheduler_type=SchedulerType.LINEAR,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
neftune_noise_alpha=None,
no_cuda=False,
num_train_epochs=3.0,
oneshot_device=cuda:0,
optim=OptimizerNames.ADAMW_TORCH,
optim_args=None,
optim_target_modules=None,
output_dir=/data/models/Llama-3.3-70B-Instruct-FP8,
overwrite_output_dir=False,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=8,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
recipe=
quant_stage:
quant_modifiers:
QuantizationModifier:
ignore: ["lm_head"]
config_groups:
group_0:
weights:
num_bits: 8
type: float
strategy: tensor
dynamic: false
symmetric: true
input_activations:
num_bits: 8
type: float
strategy: tensor
dynamic: false
symmetric: true
targets: ["Linear"]
,
recipe_args=None,
remove_unused_columns=True,
report_to=[],
restore_callback_states_from_checkpoint=False,
resume_from_checkpoint=None,
run_name=/data/models/Llama-3.3-70B-Instruct-FP8,
run_stages=False,
save_compressed=True,
save_on_each_node=False,
save_only_model=False,
save_safetensors=True,
save_steps=500,
save_strategy=IntervalStrategy.STEPS,
save_total_limit=None,
seed=42,
skip_memory_metrics=True,
split_batches=None,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torch_empty_cache_steps=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_cpu=False,
use_ipex=False,
use_legacy_prediction_loop=False,
use_liger_kernel=False,
use_mps_device=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0,
)
2024-12-06T23:51:22.574722+0000 | _check_create_state | INFO - State created for compression lifecycle
2024-12-06T23:51:22.576664+0000 | pre_initialize_structure | INFO - Compression lifecycle structure pre-initialized for 0 modifiers
2024-12-06T23:51:22.577698+0000 | pre_initialize_structure | INFO - Compression lifecycle structure pre-initialized for 0 modifiers
2024-12-06T23:51:22.634068+0000 | one_shot | INFO - * One Shot *
2024-12-06T23:51:22.701228+0000 | _check_compile_recipe | INFO - Recipe compiled and 1 modifiers created
/opt/conda/lib/python3.11/site-packages/llmcompressor/transformers/finetune/session_mixin.py:95: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.init`. Use `processing_class` instead.
super().init(**kwargs)
2024-12-06T23:51:23.012585+0000 | _calibrate | INFO - Running QuantizationModifier calibration with 512 samples...
16%|█▌ | 82/512 [04:17<22:32, 3.14s/it]

RuntimeError Traceback (most recent call last)
Cell In[2], line 76
66 return tokenizer(
67 sample["text"],
68 padding=False,
(...)
71 add_special_tokens=False,
72 )
74 ds = ds.map(tokenize, remove_columns=ds.column_names)
---> 76 oneshot(
77 model=model,
78 output_dir=output_dir,
79 dataset=ds,
80 recipe=recipe,
81 max_seq_length=MAX_SEQUENCE_LENGTH,
82 num_calibration_samples=NUM_CALIBRATION_SAMPLES,
83 save_compressed=True,
84 )

File /opt/conda/lib/python3.11/site-packages/llmcompressor/transformers/finetune/text_generation.py:76, in oneshot(**kwargs)
74 model_args, data_args, training_args = parse_args(**kwargs)
75 training_args.do_oneshot = True
---> 76 main(model_args, data_args, training_args)

File /opt/conda/lib/python3.11/site-packages/llmcompressor/transformers/finetune/text_generation.py:363, in main(model_args, data_args, training_args)
361 # One Shot
362 if training_args.do_oneshot:
--> 363 stage_runner.one_shot()
365 # Evaluation
366 if training_args.do_eval:

File /opt/conda/lib/python3.11/site-packages/llmcompressor/transformers/finetune/runner.py:171, in StageRunner.one_shot(self, stage)
167 self.trainer.model(**dummy_inp)
169 self.trainer.accelerator.wait_for_everyone()
--> 171 self.trainer.one_shot(calibration_data=calib_data, stage=stage)
173 if is_fsdp_model(self.trainer.model):
174 try:

File /opt/conda/lib/python3.11/site-packages/llmcompressor/transformers/finetune/session_mixin.py:439, in SessionManagerMixIn.one_shot(self, calibration_data, stage)
430 def one_shot(
431 self, calibration_data: Optional[DataLoader] = None, stage: Optional[str] = None
432 ):
433 """
434 Run oneshot calibration on the active model
435
436 :param stage: which stage of the recipe to run, or None to run whole recipe
437 :param calib_data: dataloader of calibration data
438 """
--> 439 apply(
440 recipe=self.recipe,
441 recipe_stage=stage,
442 recipe_args=self.recipe_args,
443 model=self.model,
444 calib_data=calibration_data,
445 start=-1,
446 copy_data=False,
447 accelerator=self.accelerator,
448 min_tokens_per_module=self.min_tokens_per_module,
449 )
451 # log model sparsity
452 # self.maybe_log_model_sparsification()
453 self.accelerator.wait_for_everyone()

File /opt/conda/lib/python3.11/site-packages/llmcompressor/core/session_functions.py:184, in apply(recipe, recipe_stage, recipe_args, model, teacher_model, train_data, val_data, test_data, calib_data, copy_data, start, steps_per_epoch, batches_per_step, **kwargs)
146 def apply(
147 recipe: Union[str, List[str], "Recipe", List["Recipe"], None] = None,
148 recipe_stage: Union[str, List[str], None] = None,
(...)
160 **kwargs,
161 ) -> ModifiedState:
162 """
163 A method to apply the recipe in one-shot manner. This will invoke the initialize
164 and then finalize methods for each modifier in the active session's lifecycle.
(...)
182 :return: the modified state of the active session after applying the recipe
183 """
--> 184 return active_session().apply(
185 recipe=recipe,
186 recipe_stage=recipe_stage,
187 recipe_args=recipe_args,
188 model=model,
189 teacher_model=teacher_model,
190 train_data=train_data,
191 val_data=val_data,
192 test_data=test_data,
193 calib_data=calib_data,
194 copy_data=copy_data,
195 start=start,
196 steps_per_epoch=steps_per_epoch,
197 batches_per_step=batches_per_step,
198 **kwargs,
199 )

File /opt/conda/lib/python3.11/site-packages/llmcompressor/core/session.py:210, in CompressionSession.apply(self, **kwargs)
201 def apply(self, **kwargs):
202 """
203 Apply the recipe in one-shot manner. This will invoke the initialize
204 and then finalize methods for each modifier in the session's lifecycle.
(...)
208 finalize methods
209 """
--> 210 self.initialize(**kwargs)
212 return self.finalize(**kwargs)

File /opt/conda/lib/python3.11/site-packages/llmcompressor/core/session.py:156, in CompressionSession.initialize(self, recipe, recipe_stage, recipe_args, model, teacher_model, optimizer, attach_optim_callbacks, train_data, val_data, test_data, calib_data, copy_data, start, steps_per_epoch, batches_per_step, loggers, **kwargs)
105 def initialize(
106 self,
107 recipe: Union[str, List[str], "Recipe", List["Recipe"], None] = None,
(...)
123 **kwargs,
124 ) -> ModifiedState:
125 """
126 Initialize the session for compression. This will run the initialize method
127 for each modifier in the session's lifecycle. This will also set the session's
(...)
153 :return: the modified state of the session after initializing
154 """
--> 156 mod_data = self._lifecycle.initialize(
157 recipe=recipe,
158 recipe_stage=recipe_stage,
159 recipe_args=recipe_args,
160 model=model,
161 teacher_model=teacher_model,
162 optimizer=optimizer,
163 attach_optim_callbacks=attach_optim_callbacks,
164 train_data=train_data,
165 val_data=val_data,
166 test_data=test_data,
167 calib_data=calib_data,
168 copy_data=copy_data,
169 start=start,
170 steps_per_epoch=steps_per_epoch,
171 batches_per_step=batches_per_step,
172 loggers=loggers,
173 **kwargs,
174 )
176 return ModifiedState(
177 model=self.state.model,
178 optimizer=self.state.optimizer,
179 loss=self.state.loss,
180 modifier_data=mod_data,
181 )

File /opt/conda/lib/python3.11/site-packages/llmcompressor/core/lifecycle.py:126, in CompressionLifecycle.initialize(self, **kwargs)
124 mod_data = []
125 for mod in self.modifiers:
--> 126 data = mod.initialize(state=self.state, **extras)
127 logger.debug("Initialized modifier: {}", mod)
128 if data is not None:

File /opt/conda/lib/python3.11/site-packages/llmcompressor/modifiers/stage.py:124, in StageModifiers.initialize(self, state, **kwargs)
122 accelerator = kwargs.get("accelerator", None)
123 for modifier in self.modifiers:
--> 124 modifier.initialize(state, **kwargs)
125 if accelerator:
126 accelerator.wait_for_everyone()

File /opt/conda/lib/python3.11/site-packages/llmcompressor/modifiers/modifier.py:119, in Modifier.initialize(self, state, **kwargs)
113 if (
114 self.calculate_end() >= 0
115 and state.start_event.current_index >= self.calculate_end()
116 ):
117 return
--> 119 initialized = self.on_initialize(state=state, **kwargs)
121 if not isinstance(initialized, bool):
122 raise ValueError(
123 "on_initialize must return a boolean value; "
124 "True for success, False for not initialized"
125 )

File /opt/conda/lib/python3.11/site-packages/llmcompressor/modifiers/quantization/quantization/base.py:105, in QuantizationModifier.on_initialize(self, state, **kwargs)
103 module.apply(apply_calibration_status)
104 self.calibration_hooks_ = []
--> 105 self._calibrate_if_possible(module)
106 self._check_token_distribution(
107 module, threshold=kwargs.get("min_tokens_per_module")
108 )
109 module.apply(freeze_module_quantization)

File /opt/conda/lib/python3.11/site-packages/llmcompressor/modifiers/quantization/quantization/base.py:268, in QuantizationModifier._calibrate_if_possible(self, module)
266 module.apply(lambda model: initialize_observer(model, base_name="output"))
267 module.apply(self.register_calibration_hooks)
--> 268 self.calibrate(module)
269 module.apply(set_unset_kv_cache)
270 for h in self.calibration_hooks:

File /opt/conda/lib/python3.11/site-packages/llmcompressor/modifiers/quantization/quantization/base.py:325, in QuantizationModifier.calibrate(self, module)
322 module_training = module.training
323 module.eval()
--> 325 run_calibration_forward(
326 module,
327 self.calibration_dataloader,
328 self.num_calibration_steps,
329 self.calibration_function_,
330 )
332 if module_training:
333 module.train()

File /opt/conda/lib/python3.11/site-packages/llmcompressor/modifiers/utils/pytorch_helpers.py:105, in run_calibration_forward(model, calibration_dataloader, num_calibration_steps, calibration_function, device, mask_padding)
101 intermediates.append((e.args, e.kwargs))
103 # TODO: not ideal, figure out where we aren't freeing memory instead
104 # currently without this we run OOM on the 2nd forward pass
--> 105 torch.cuda.empty_cache()
107 return intermediates

File /opt/conda/lib/python3.11/site-packages/torch/cuda/memory.py:170, in empty_cache()
159 r"""Release all unoccupied cached memory currently held by the caching
160 allocator so that those can be used in other GPU application and visible in
161 nvidia-smi.
(...)
167 more details about GPU memory management.
168 """
169 if is_initialized():
--> 170 torch._C._cuda_emptyCache()

RuntimeError: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Additional context
Add any other context about the problem here. Also include any relevant files.

The text was updated successfully, but these errors were encountered:

Kha-Zix-1 · 2024-12-07T02:33:42Z

I get a systematic error when quantizing LLama 3.3 70b to FP8 (static) on 2xH100, it always fails at the 82nd step of calibration with the following error :

Loading checkpoint shards: 100%

30/30 [00:03<00:00, 7.96it/s]
Loading checkpoint shards: 100%
30/30 [01:09<00:00, 1.98s/it]
Map: 100%
512/512 [00:00<00:00, 1513.65 examples/s]
Map: 100%
512/512 [00:01<00:00, 458.48 examples/s]
2024-12-06T23:51:22.108862+0000 | main | WARNING - Process rank: 0, device: cuda:0, n_gpu: 2, distributed training: True, 16-bits training: False
2024-12-06T23:51:22.110751+0000 | main | INFO - Training/evaluation parameters TrainingArguments(
_n_gpu=2,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
average_tokens_across_devices=False,
batch_eval_metrics=False,
bf16=False,
bf16_full_eval=False,
clear_sparse_session=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=False,
do_oneshot=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_do_concat_batches=True,
eval_on_start=False,
eval_steps=None,
eval_strategy=IntervalStrategy.NO,
eval_use_gather_object=False,
evaluation_strategy=None,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
gradient_checkpointing_kwargs=None,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_always_push=False,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=HubStrategy.EVERY_SAVE,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_for_metrics=[],
include_inputs_for_metrics=False,
include_num_input_tokens_seen=False,
include_tokens_per_second=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=5e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=passive,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=/data/models/Llama-3.3-70B-Instruct-FP8/runs/Dec06_23-51-22_2343050e0892,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=500,
logging_strategy=IntervalStrategy.STEPS,
lr_scheduler_kwargs={},
lr_scheduler_type=SchedulerType.LINEAR,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
neftune_noise_alpha=None,
no_cuda=False,
num_train_epochs=3.0,
oneshot_device=cuda:0,
optim=OptimizerNames.ADAMW_TORCH,
optim_args=None,
optim_target_modules=None,
output_dir=/data/models/Llama-3.3-70B-Instruct-FP8,
overwrite_output_dir=False,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=8,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
recipe=
quant_stage:
quant_modifiers:
QuantizationModifier:
ignore: ["lm_head"]
config_groups:
group_0:
weights:
num_bits: 8
type: float
strategy: tensor
dynamic: false
symmetric: true
input_activations:
num_bits: 8
type: float
strategy: tensor
dynamic: false
symmetric: true
targets: ["Linear"]
,
recipe_args=None,
remove_unused_columns=True,
report_to=[],
restore_callback_states_from_checkpoint=False,
resume_from_checkpoint=None,
run_name=/data/models/Llama-3.3-70B-Instruct-FP8,
run_stages=False,
save_compressed=True,
save_on_each_node=False,
save_only_model=False,
save_safetensors=True,
save_steps=500,
save_strategy=IntervalStrategy.STEPS,
save_total_limit=None,
seed=42,
skip_memory_metrics=True,
split_batches=None,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torch_empty_cache_steps=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_cpu=False,
use_ipex=False,
use_legacy_prediction_loop=False,
use_liger_kernel=False,
use_mps_device=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0,
)
2024-12-06T23:51:22.574722+0000 | _check_create_state | INFO - State created for compression lifecycle
2024-12-06T23:51:22.576664+0000 | pre_initialize_structure | INFO - Compression lifecycle structure pre-initialized for 0 modifiers
2024-12-06T23:51:22.577698+0000 | pre_initialize_structure | INFO - Compression lifecycle structure pre-initialized for 0 modifiers
2024-12-06T23:51:22.634068+0000 | one_shot | INFO - *** One Shot ***
2024-12-06T23:51:22.701228+0000 | _check_compile_recipe | INFO - Recipe compiled and 1 modifiers created
/opt/conda/lib/python3.11/site-packages/llmcompressor/transformers/finetune/session_mixin.py:95: FutureWarning: tokenizer is deprecated and will be removed in version 5.0.0 for Trainer.__init__. Use processing_class instead.
super().init(**kwargs)
2024-12-06T23:51:23.012585+0000 | _calibrate | INFO - Running QuantizationModifier calibration with 512 samples...
16%|█▌ | 82/512 [04:17<22:32, 3.14s/it]
RuntimeError Traceback (most recent call last) Cell In[2], line 76 66 return tokenizer( 67 sample["text"], 68 padding=False, (...) 71 add_special_tokens=False, 72 ) 74 ds = ds.map(tokenize, remove_columns=ds.column_names) ---> 76 oneshot( 77 model=model, 78 output_dir=output_dir, 79 dataset=ds, 80 recipe=recipe, 81 max_seq_length=MAX_SEQUENCE_LENGTH, 82 num_calibration_samples=NUM_CALIBRATION_SAMPLES, 83 save_compressed=True, 84 )

File /opt/conda/lib/python3.11/site-packages/llmcompressor/transformers/finetune/text_generation.py:76, in oneshot(**kwargs) 74 model_args, data_args, training_args = parse_args(**kwargs) 75 training_args.do_oneshot = True ---> 76 main(model_args, data_args, training_args)

File /opt/conda/lib/python3.11/site-packages/llmcompressor/transformers/finetune/text_generation.py:363, in main(model_args, data_args, training_args) 361 # One Shot 362 if training_args.do_oneshot: --> 363 stage_runner.one_shot() 365 # Evaluation 366 if training_args.do_eval:

File /opt/conda/lib/python3.11/site-packages/llmcompressor/transformers/finetune/runner.py:171, in StageRunner.one_shot(self, stage) 167 self.trainer.model(**dummy_inp) 169 self.trainer.accelerator.wait_for_everyone() --> 171 self.trainer.one_shot(calibration_data=calib_data, stage=stage) 173 if is_fsdp_model(self.trainer.model): 174 try:

File /opt/conda/lib/python3.11/site-packages/llmcompressor/transformers/finetune/session_mixin.py:439, in SessionManagerMixIn.one_shot(self, calibration_data, stage) 430 def one_shot( 431 self, calibration_data: Optional[DataLoader] = None, stage: Optional[str] = None 432 ): 433 """ 434 Run oneshot calibration on the active model 435 436 :param stage: which stage of the recipe to run, or None to run whole recipe 437 :param calib_data: dataloader of calibration data 438 """ --> 439 apply( 440 recipe=self.recipe, 441 recipe_stage=stage, 442 recipe_args=self.recipe_args, 443 model=self.model, 444 calib_data=calibration_data, 445 start=-1, 446 copy_data=False, 447 accelerator=self.accelerator, 448 min_tokens_per_module=self.min_tokens_per_module, 449 ) 451 # log model sparsity 452 # self.maybe_log_model_sparsification() 453 self.accelerator.wait_for_everyone()

File /opt/conda/lib/python3.11/site-packages/llmcompressor/core/session_functions.py:184, in apply(recipe, recipe_stage, recipe_args, model, teacher_model, train_data, val_data, test_data, calib_data, copy_data, start, steps_per_epoch, batches_per_step, **kwargs) 146 def apply( 147 recipe: Union[str, List[str], "Recipe", List["Recipe"], None] = None, 148 recipe_stage: Union[str, List[str], None] = None, (...) 160 **kwargs, 161 ) -> ModifiedState: 162 """ 163 A method to apply the recipe in one-shot manner. This will invoke the initialize 164 and then finalize methods for each modifier in the active session's lifecycle. (...) 182 :return: the modified state of the active session after applying the recipe 183 """ --> 184 return active_session().apply( 185 recipe=recipe, 186 recipe_stage=recipe_stage, 187 recipe_args=recipe_args, 188 model=model, 189 teacher_model=teacher_model, 190 train_data=train_data, 191 val_data=val_data, 192 test_data=test_data, 193 calib_data=calib_data, 194 copy_data=copy_data, 195 start=start, 196 steps_per_epoch=steps_per_epoch, 197 batches_per_step=batches_per_step, 198 **kwargs, 199 )

File /opt/conda/lib/python3.11/site-packages/llmcompressor/core/session.py:210, in CompressionSession.apply(self, **kwargs) 201 def apply(self, **kwargs): 202 """ 203 Apply the recipe in one-shot manner. This will invoke the initialize 204 and then finalize methods for each modifier in the session's lifecycle. (...) 208 finalize methods 209 """ --> 210 self.initialize(**kwargs) 212 return self.finalize(**kwargs)

File /opt/conda/lib/python3.11/site-packages/llmcompressor/core/session.py:156, in CompressionSession.initialize(self, recipe, recipe_stage, recipe_args, model, teacher_model, optimizer, attach_optim_callbacks, train_data, val_data, test_data, calib_data, copy_data, start, steps_per_epoch, batches_per_step, loggers, **kwargs) 105 def initialize( 106 self, 107 recipe: Union[str, List[str], "Recipe", List["Recipe"], None] = None, (...) 123 **kwargs, 124 ) -> ModifiedState: 125 """ 126 Initialize the session for compression. This will run the initialize method 127 for each modifier in the session's lifecycle. This will also set the session's (...) 153 :return: the modified state of the session after initializing 154 """ --> 156 mod_data = self._lifecycle.initialize( 157 recipe=recipe, 158 recipe_stage=recipe_stage, 159 recipe_args=recipe_args, 160 model=model, 161 teacher_model=teacher_model, 162 optimizer=optimizer, 163 attach_optim_callbacks=attach_optim_callbacks, 164 train_data=train_data, 165 val_data=val_data, 166 test_data=test_data, 167 calib_data=calib_data, 168 copy_data=copy_data, 169 start=start, 170 steps_per_epoch=steps_per_epoch, 171 batches_per_step=batches_per_step, 172 loggers=loggers, 173 **kwargs, 174 ) 176 return ModifiedState( 177 model=self.state.model, 178 optimizer=self.state.optimizer, 179 loss=self.state.loss, 180 modifier_data=mod_data, 181 )

File /opt/conda/lib/python3.11/site-packages/llmcompressor/core/lifecycle.py:126, in CompressionLifecycle.initialize(self, **kwargs) 124 mod_data = [] 125 for mod in self.modifiers: --> 126 data = mod.initialize(state=self.state, **extras) 127 logger.debug("Initialized modifier: {}", mod) 128 if data is not None:

File /opt/conda/lib/python3.11/site-packages/llmcompressor/modifiers/stage.py:124, in StageModifiers.initialize(self, state, **kwargs) 122 accelerator = kwargs.get("accelerator", None) 123 for modifier in self.modifiers: --> 124 modifier.initialize(state, **kwargs) 125 if accelerator: 126 accelerator.wait_for_everyone()

File /opt/conda/lib/python3.11/site-packages/llmcompressor/modifiers/modifier.py:119, in Modifier.initialize(self, state, **kwargs) 113 if ( 114 self.calculate_end() >= 0 115 and state.start_event.current_index >= self.calculate_end() 116 ): 117 return --> 119 initialized = self.on_initialize(state=state, **kwargs) 121 if not isinstance(initialized, bool): 122 raise ValueError( 123 "on_initialize must return a boolean value; " 124 "True for success, False for not initialized" 125 )

File /opt/conda/lib/python3.11/site-packages/llmcompressor/modifiers/quantization/quantization/base.py:105, in QuantizationModifier.on_initialize(self, state, **kwargs) 103 module.apply(apply_calibration_status) 104 self.calibration_hooks_ = [] --> 105 self._calibrate_if_possible(module) 106 self._check_token_distribution( 107 module, threshold=kwargs.get("min_tokens_per_module") 108 ) 109 module.apply(freeze_module_quantization)

File /opt/conda/lib/python3.11/site-packages/llmcompressor/modifiers/quantization/quantization/base.py:268, in QuantizationModifier._calibrate_if_possible(self, module) 266 module.apply(lambda model: initialize_observer(model, base_name="output")) 267 module.apply(self.register_calibration_hooks) --> 268 self.calibrate(module) 269 module.apply(set_unset_kv_cache) 270 for h in self.calibration_hooks:

File /opt/conda/lib/python3.11/site-packages/llmcompressor/modifiers/quantization/quantization/base.py:325, in QuantizationModifier.calibrate(self, module) 322 module_training = module.training 323 module.eval() --> 325 run_calibration_forward( 326 module, 327 self.calibration_dataloader, 328 self.num_calibration_steps, 329 self.calibration_function_, 330 ) 332 if module_training: 333 module.train()

File /opt/conda/lib/python3.11/site-packages/llmcompressor/modifiers/utils/pytorch_helpers.py:105, in run_calibration_forward(model, calibration_dataloader, num_calibration_steps, calibration_function, device, mask_padding) 101 intermediates.append((e.args, e.kwargs)) 103 # TODO: not ideal, figure out where we aren't freeing memory instead 104 # currently without this we run OOM on the 2nd forward pass --> 105 torch.cuda.empty_cache() 107 return intermediates

File /opt/conda/lib/python3.11/site-packages/torch/cuda/memory.py:170, in empty_cache() 159 r"""Release all unoccupied cached memory currently held by the caching 160 allocator so that those can be used in other GPU application and visible in 161 nvidia-smi. (...) 167 more details about GPU memory management. 168 """ 169 if is_initialized(): --> 170 torch._C._cuda_emptyCache()

RuntimeError: CUDA error: unspecified launch failure CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Additional context Add any other context about the problem here. Also include any relevant files.

Hello! @Syst3m1cAn0maly I have not encountered your bug yet, but I have encountered a similar bug.

When I use the SmoothQuantModifier, the calibration process is always stuck at the 61st data. I delete the torch.cuda.empty_cache() in llmcompressor/modifiers/utils/pytorch_helpers.py run_calibration_forward and the bug is fixed. Maybe it should help?

Syst3m1cAn0maly · 2024-12-07T15:46:53Z

Thank you @Kha-Zix-1 it works with the workaround you provided.
It probably needs a proper fix.

Details: vllm-project#963

Syst3m1cAn0maly added the bug Something isn't working label Dec 6, 2024

Tinsane pushed a commit to Tinsane/llm-compressor that referenced this issue Dec 11, 2024

Fix issue with cuda.

d92a609

Details: vllm-project#963

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error when quantizing LLama 3.3 70b to FP8 #963

Error when quantizing LLama 3.3 70b to FP8 #963

Syst3m1cAn0maly commented Dec 6, 2024

Kha-Zix-1 commented Dec 7, 2024

Loading checkpoint shards: 100%

Syst3m1cAn0maly commented Dec 7, 2024

Error when quantizing LLama 3.3 70b to FP8 #963

Error when quantizing LLama 3.3 70b to FP8 #963

Comments

Syst3m1cAn0maly commented Dec 6, 2024

Kha-Zix-1 commented Dec 7, 2024

Loading checkpoint shards: 100%

Syst3m1cAn0maly commented Dec 7, 2024