You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Error occurs immediately during trainer.fit(model) after checkpoint loading or initialization.
Expected behavior
Training should proceed without requiring manual modifications to Lightning’s internals. Optimizer state restoration should be depended on the configuration.
Environment overview (please complete the following information)
Method of NeMo install: git clone
No docker
Environment details
If NVIDIA docker image is used you don't need to specify these.
Otherwise, please provide:
OS version: Ubuntu 22.04.1
PyTorch version: 2.5.1
Python version: 3.10.12
Workaround
File: .../site-packages/lightning/pytorch/trainer/connectors/checkpoint_connector.py
Line 297:
# self.restore_optimizers_and_lr_schedulers(...)
After this modification, I was able to successfully overfit a single audio sample, confirming training works with freeze_audio_encoder=False.
The text was updated successfully, but these errors were encountered:
Describe the bug
Running speech_llm/modular_audio_gpt_train.py with freeze_audio_encoder: False results in a runtime error during trainer.fit:
RuntimeError: loaded state dict contains a parameter group that doesn't match the size of optimizer's group
This occurs because the optimizer state restoration assumes a fixed parameter group structure, which changes when the audio encoder is not frozen.
Steps/Code to reproduce bug
Clone NeMo repo (main branch).
/path/to/NeMo/examples/multimodal/speech_llm/modular_audio_gpt_train.py model.freeze_audio_encoder=False model.freeze_llm=True model.freeze_modality_adapter=False model.global_batch_size=4 model.micro_batch_size=2 model.pretrained_audio_model=/path/to/stt_en_fastconformer_transducer_large.nemo model.restore_from_path=/path/to/megatron_gpt_345m.nemo trainer.val_check_interval=1
Error occurs immediately during trainer.fit(model) after checkpoint loading or initialization.
Expected behavior
Training should proceed without requiring manual modifications to Lightning’s internals. Optimizer state restoration should be depended on the configuration.
Environment overview (please complete the following information)
Environment details
If NVIDIA docker image is used you don't need to specify these.
Otherwise, please provide:
Workaround
After this modification, I was able to successfully overfit a single audio sample, confirming training works with freeze_audio_encoder=False.
The text was updated successfully, but these errors were encountered: