Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

_quantize_param_fragment_impl gives "NameError" #12638

Open
edyin opened this issue Mar 17, 2025 · 0 comments
Open

_quantize_param_fragment_impl gives "NameError" #12638

edyin opened this issue Mar 17, 2025 · 0 comments
Labels
bug Something isn't working

Comments

@edyin
Copy link

edyin commented Mar 17, 2025

Describe the bug

When following this benchmark instruction, and using fp8 option, I get the following error:

[rank23]:   File "/opt/NeMo/nemo/collections/nlp/models/language_modeling/megatron_base_model.py", line 1341, in optimizer_step
[rank23]:     super().optimizer_step(*args, **kwargs)
[rank23]:   File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/core/module.py", line 1306, in optimizer_step
[rank23]:     optimizer.step(closure=optimizer_closure)
[rank23]:   File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/core/optimizer.py", line 153, in step
[rank23]:     step_output = self._strategy.optimizer_step(self._optimizer, closure, **kwargs)
[rank23]:                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank23]:   File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/strategies/ddp.py", line 270, in optimizer_step
[rank23]:     optimizer_output = super().optimizer_step(optimizer, closure, model, **kwargs)
[rank23]:                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank23]:   File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/strategies/strategy.py", line 238, in optimizer_step
[rank23]:     return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, **kwargs)
[rank23]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank23]:   File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/plugins/precision/amp.py", line 75, in optimizer_step
[rank23]:     return super().optimizer_step(optimizer, model=model, closure=closure, **kwargs)
[rank23]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank23]:   File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/plugins/precision/precision.py", line 122, in optimizer_step
[rank23]:     return optimizer.step(closure=closure, **kwargs)
[rank23]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank23]:   File "/usr/local/lib/python3.12/dist-packages/torch/optim/lr_scheduler.py", line 140, in wrapper
[rank23]:     return func.__get__(opt, opt.__class__)(*args, **kwargs)
[rank23]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank23]:   File "/usr/local/lib/python3.12/dist-packages/torch/optim/optimizer.py", line 494, in wrapper
[rank23]:     out = func(*args, **kwargs)
[rank23]:           ^^^^^^^^^^^^^^^^^^^^^
[rank23]:   File "/usr/local/lib/python3.12/dist-packages/apex/contrib/optimizers/distributed_fused_adam.py", line 2402, in step
[rank23]:     self._local_step(first_bucket_ids)
[rank23]:   File "/usr/local/lib/python3.12/dist-packages/apex/contrib/optimizers/distributed_fused_adam.py", line 2532, in _local_step
[rank23]:     self._check_params_shard_dtypes(
[rank23]:   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank23]:     return func(*args, **kwargs)
[rank23]:            ^^^^^^^^^^^^^^^^^^^^^
[rank23]:   File "/opt/NeMo/nemo/core/optim/distributed_adam.py", line 738, in _check_params_shard_dtypes
[rank23]:     quantize_param_fragment(
[rank23]:   File "/opt/NeMo/nemo/core/optim/distributed_adam.py", line 121, in quantize_param_fragment
[rank23]:     _quantize_param_fragment_impl(input_, out=out, param=param)
[rank23]:   File "/opt/NeMo/nemo/core/optim/distributed_adam.py", line 83, in _quantize_param_fragment_impl
[rank23]:     src.view(1, -1),
[rank23]:     ^^^
[rank23]: NameError: name 'src' is not defined

Steps/Code to reproduce bug
Following this instruction https://catalog.ngc.nvidia.com/orgs/nvidia/teams/dgxc-benchmarking/resources/nemotron15b-dgxc-benchmarking-b

you can reproduce the bug by running training with fp8

the code causing problem is :
470243c#diff-9d28dcb461bdb37dfaafbb51b827d6a9e51865afb7654d96417fe695fba22d0cR83

def _quantize_param_fragment_impl(
        input_: torch.Tensor,
        *,
        out: torch.Tensor,
        param: torch.nn.Parameter,
    ) -> None:
        cast_to_fp8(
            src.view(1, -1),
            param._fp8_meta["scaling_fwd"],
            param._fp8_meta_index,
            param._fp8_dtype,
            out=dst.view(1, -1),
        )

Expected behavior

NameError that has shown in the Traceback above.

Environment overview (please complete the following information)

  • Environment location: Docker
  • Method of NeMo install: Docker
  • If method of install is [Docker], srun bash -c "enroot import --output ${STAGE_PATH}/nvidia+nemo+24.09.sqsh docker://nvcr.io#nvidia/nemo:24.09" we also tried the latest image: nvcr.io/nvidia/nemo:25.02. job was executed also followed the instruction srun \ --container-image "$IMAGE" \ --container-mounts "$RESULT_DIR,$INDEX_MAPPING_DIR,$STAGE_PATH/cfg:/cfg,$STAGE_PATH/configure.sh:/gsw/configure.sh" \ --container-writable \ --no-container-mount-home bash -c "source /gsw/configure.sh && launch"

Environment details

If NVIDIA docker image is used you don't need to specify these.
Otherwise, please provide:

  • OS version
  • PyTorch version
  • Python version

Additional context

Add any other context about the problem here.
GPU: 32 H100

@edyin edyin added the bug Something isn't working label Mar 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant