You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When using multiple nodes I am getting OSError: [Errno 12] Cannot allocate memory while checkpointing (training works fine). I don't see this on the nemo:24.12 container but when I change it to nemo:25.02 I get the error. Even with model.data.num_workers=0. It works fine on one node.
Steps/Code to reproduce bug
Run the following slurm script after filling in appropriate HF/wandb key + pointing it at a dataset and confirm it works on 24.12.
Describe the bug
When using multiple nodes I am getting
OSError: [Errno 12] Cannot allocate memory
while checkpointing (training works fine). I don't see this on thenemo:24.12
container but when I change it tonemo:25.02
I get the error. Even withmodel.data.num_workers=0
. It works fine on one node.Steps/Code to reproduce bug
Run the following slurm script after filling in appropriate HF/wandb key + pointing it at a dataset and confirm it works on 24.12.
SLURM JOB
Switch to
nemo:25.02
conatiner and expect to see cannot allocate memory while checkpointing after 2 steps (see below for full traceback)Traceback
Environment overview (please complete the following information)
srun --container-image='nvcr.io#nvidia/nemo:25.02'
Additional context
Add any other context about the problem here.
GPU model - 16x h100s
The text was updated successfully, but these errors were encountered: