_quantize_param_fragment_impl gives "NameError" #12638

edyin · 2025-03-17T18:37:34Z

Describe the bug

When following this benchmark instruction, and using fp8 option, I get the following error:

[rank23]:   File "/opt/NeMo/nemo/collections/nlp/models/language_modeling/megatron_base_model.py", line 1341, in optimizer_step
[rank23]:     super().optimizer_step(*args, **kwargs)
[rank23]:   File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/core/module.py", line 1306, in optimizer_step
[rank23]:     optimizer.step(closure=optimizer_closure)
[rank23]:   File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/core/optimizer.py", line 153, in step
[rank23]:     step_output = self._strategy.optimizer_step(self._optimizer, closure, **kwargs)
[rank23]:                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank23]:   File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/strategies/ddp.py", line 270, in optimizer_step
[rank23]:     optimizer_output = super().optimizer_step(optimizer, closure, model, **kwargs)
[rank23]:                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank23]:   File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/strategies/strategy.py", line 238, in optimizer_step
[rank23]:     return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, **kwargs)
[rank23]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank23]:   File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/plugins/precision/amp.py", line 75, in optimizer_step
[rank23]:     return super().optimizer_step(optimizer, model=model, closure=closure, **kwargs)
[rank23]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank23]:   File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/plugins/precision/precision.py", line 122, in optimizer_step
[rank23]:     return optimizer.step(closure=closure, **kwargs)
[rank23]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank23]:   File "/usr/local/lib/python3.12/dist-packages/torch/optim/lr_scheduler.py", line 140, in wrapper
[rank23]:     return func.__get__(opt, opt.__class__)(*args, **kwargs)
[rank23]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank23]:   File "/usr/local/lib/python3.12/dist-packages/torch/optim/optimizer.py", line 494, in wrapper
[rank23]:     out = func(*args, **kwargs)
[rank23]:           ^^^^^^^^^^^^^^^^^^^^^
[rank23]:   File "/usr/local/lib/python3.12/dist-packages/apex/contrib/optimizers/distributed_fused_adam.py", line 2402, in step
[rank23]:     self._local_step(first_bucket_ids)
[rank23]:   File "/usr/local/lib/python3.12/dist-packages/apex/contrib/optimizers/distributed_fused_adam.py", line 2532, in _local_step
[rank23]:     self._check_params_shard_dtypes(
[rank23]:   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank23]:     return func(*args, **kwargs)
[rank23]:            ^^^^^^^^^^^^^^^^^^^^^
[rank23]:   File "/opt/NeMo/nemo/core/optim/distributed_adam.py", line 738, in _check_params_shard_dtypes
[rank23]:     quantize_param_fragment(
[rank23]:   File "/opt/NeMo/nemo/core/optim/distributed_adam.py", line 121, in quantize_param_fragment
[rank23]:     _quantize_param_fragment_impl(input_, out=out, param=param)
[rank23]:   File "/opt/NeMo/nemo/core/optim/distributed_adam.py", line 83, in _quantize_param_fragment_impl
[rank23]:     src.view(1, -1),
[rank23]:     ^^^
[rank23]: NameError: name 'src' is not defined

Steps/Code to reproduce bug
Following this instruction https://catalog.ngc.nvidia.com/orgs/nvidia/teams/dgxc-benchmarking/resources/nemotron15b-dgxc-benchmarking-b

you can reproduce the bug by running training with fp8

the code causing problem is :
470243c#diff-9d28dcb461bdb37dfaafbb51b827d6a9e51865afb7654d96417fe695fba22d0cR83

def _quantize_param_fragment_impl(
        input_: torch.Tensor,
        *,
        out: torch.Tensor,
        param: torch.nn.Parameter,
    ) -> None:
        cast_to_fp8(
            src.view(1, -1),
            param._fp8_meta["scaling_fwd"],
            param._fp8_meta_index,
            param._fp8_dtype,
            out=dst.view(1, -1),
        )

Expected behavior

NameError that has shown in the Traceback above.

Environment overview (please complete the following information)

Environment location: Docker
Method of NeMo install: Docker
If method of install is [Docker], srun bash -c "enroot import --output ${STAGE_PATH}/nvidia+nemo+24.09.sqsh docker://nvcr.io#nvidia/nemo:24.09" we also tried the latest image: nvcr.io/nvidia/nemo:25.02. job was executed also followed the instruction srun \ --container-image "$IMAGE" \ --container-mounts "$RESULT_DIR,$INDEX_MAPPING_DIR,$STAGE_PATH/cfg:/cfg,$STAGE_PATH/configure.sh:/gsw/configure.sh" \ --container-writable \ --no-container-mount-home bash -c "source /gsw/configure.sh && launch"

Environment details

If NVIDIA docker image is used you don't need to specify these.
Otherwise, please provide:

OS version
PyTorch version
Python version

Additional context

Add any other context about the problem here.
GPU: 32 H100

The text was updated successfully, but these errors were encountered:

edyin added the bug Something isn't working label Mar 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

_quantize_param_fragment_impl gives "NameError" #12638

_quantize_param_fragment_impl gives "NameError" #12638

edyin commented Mar 17, 2025

_quantize_param_fragment_impl gives "NameError" #12638

_quantize_param_fragment_impl gives "NameError" #12638

Comments

edyin commented Mar 17, 2025