Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Cannot allocate memory while multinode checkpointing on NeMo 25 container #12637

Open
Stillerman opened this issue Mar 17, 2025 · 0 comments
Assignees
Labels
bug Something isn't working

Comments

@Stillerman
Copy link

Describe the bug

When using multiple nodes I am getting OSError: [Errno 12] Cannot allocate memory while checkpointing (training works fine). I don't see this on the nemo:24.12 container but when I change it to nemo:25.02 I get the error. Even with model.data.num_workers=0. It works fine on one node.

Steps/Code to reproduce bug

Run the following slurm script after filling in appropriate HF/wandb key + pointing it at a dataset and confirm it works on 24.12.

SLURM JOB
#!/bin/bash
#SBATCH --job-name=NEMO24WT
#SBATCH --output=/fsx/jason/MEGATRON_EXPS/logs/NEMO24WT-%j.out
#SBATCH --gpus-per-node=8
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --time=48:00:00
#SBATCH --qos=high
#SBATCH --mem-per-gpu=248g

#### SETTINGS ####

MODEL="/fsx/jason/Minitron/llama-3_1-8b-nemo_v1.0/llama3_1_8b.nemo"

# Can change these to accommodate resources:

DEVICES=8
TENSOR_PARALLEL_SIZE=4
NODES=$SLURM_NNODES
MICRO_BATCH_SIZE=4

# Don't change the following:

EXPERIMENT_DIR="/fsx/jason/MEGATRON_EXPS/outputs"
EXPERIMENT_NAME="NEMO24WT"

DATA_TRAIN='/fsx/jason/workspace/wikitext_tokenized_train_text_document'
DATA_VAL='/fsx/jason/workspace/wikitext_tokenized_val_text_document'
DATA_TEST='/fsx/jason/workspace/wikitext_tokenized_test_text_document'

DATA_SETTINGS="model.data.data_prefix={train:[1.0,$DATA_TRAIN],validation:[$DATA_VAL],test:[$DATA_TEST]}"

SEQ_LENGTH=4096
STEPS=10
WARMUP_STEPS=800
GLOBAL_BATCH_SIZE=128 # 4096 * 128 = .5M toks GBS

LOG_INTERVAL=1
VAL_INTERVAL=500
NUM_VAL_BATCHES=5
SAVE_TOP_K=1
SAVE_INTERVAL=2

LR=1e-5
MIN_LR=1e-6

HYDRA_FULL_ERROR=1

#### END SETTINGS ####

set -x -e

echo "START TIME: $(date)"


# SLURM stuff
export HOSTNAMES=`scontrol show hostnames "$SLURM_JOB_NODELIST"`
export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export MASTER_PORT=6000
export COUNT_NODE=$SLURM_NNODES

module load cuda/12.4

echo go $COUNT_NODE
echo $HOSTNAMES


# Export environment variables
export CUDA_DEVICE_MAX_CONNECTIONS=1

export LAUNCHER="pip install hf_transfer && TMPDIR=/scratch/nemotmp python -u -m torch.distributed.run \
    --nproc_per_node 8 \
    --nnodes $COUNT_NODE \
    --rdzv-backend c10d \
    --rdzv-endpoint $MASTER_ADDR \
    --rdzv-id $SLURM_JOB_ID \
    --node_rank $SLURM_PROCID \
    --role $SLURMD_NODENAME: \
    --max_restarts 0 \
    --tee 3 \
    "

export CMD="/opt/NeMo/examples/nlp/language_modeling/megatron_gpt_pretraining.py \
    --config-path /opt/NeMo/examples/nlp/language_modeling/conf/ \
    --config-name megatron_llama_distill.yaml \
    \
    name=${EXPERIMENT_NAME} \
    \
    exp_manager.exp_dir=${EXPERIMENT_DIR} \
    exp_manager.checkpoint_callback_params.save_top_k=${SAVE_TOP_K} \
    exp_manager.checkpoint_callback_params.monitor=step \
    exp_manager.checkpoint_callback_params.mode=max \
    +exp_manager.checkpoint_callback_params.every_n_train_steps=${SAVE_INTERVAL} \
    +exp_manager.checkpoint_callback_params.every_n_epochs=null \
    +exp_manager.log_step_timing=True \
    exp_manager.create_wandb_logger=True \
    exp_manager.wandb_logger_kwargs.name=${EXPERIMENT_NAME} \
    exp_manager.wandb_logger_kwargs.project=${EXPERIMENT_NAME} \
    \
    trainer.max_steps=${STEPS} \
    trainer.log_every_n_steps=${LOG_INTERVAL} \
    trainer.val_check_interval=${VAL_INTERVAL} \
    trainer.limit_val_batches=${NUM_VAL_BATCHES} \
    +trainer.num_sanity_val_steps=0 \
    \
    trainer.precision=bf16 \
    trainer.devices=${DEVICES} \
    trainer.num_nodes=${NODES} \
    \
    \"${DATA_SETTINGS}\" \
    model.data.num_workers=0 \
    model.data.seq_length=${SEQ_LENGTH} \
    \
    model.restore_from_path=${MODEL} \
    +model.dist_ckpt_load_strictness=log_all \
    \
    ~model.tokenizer \
    +model.tokenizer.library=huggingface \
    +model.tokenizer.type=meta-llama/Meta-Llama-3.1-8B \
    +model.tokenizer.use_fast=True \
    \
    model.tensor_model_parallel_size=${TENSOR_PARALLEL_SIZE} \
    model.sequence_parallel=True \
    model.micro_batch_size=${MICRO_BATCH_SIZE} \
    model.global_batch_size=${GLOBAL_BATCH_SIZE} \
    \
    model.encoder_seq_length=${SEQ_LENGTH} \
    model.num_layers=32 \
    model.hidden_size=4096 \
    model.ffn_hidden_size=14336 \
    model.num_attention_heads=32 \
    model.hidden_dropout=0.0 \
    model.attention_dropout=0.0 \
    model.apply_query_key_layer_scaling=True \
    model.normalization='rmsnorm' \
    model.bias=False \
    model.activation='fast-swiglu' \
    model.position_embedding_type='rope' \
    model.share_embeddings_and_output_weights=False \
    model.num_query_groups=8 \
    ++model.scale_positional_embedding=True \
    ++model.rotary_base=500000.0 \
    \
    model.optim.name=distributed_fused_adam \
    model.optim.lr=${LR} \
    model.optim.sched.min_lr=${MIN_LR} \
    model.optim.sched.warmup_steps=${WARMUP_STEPS}"

export HF_TOKEN='hf_xxxx'
export WANDB_API_KEY='xxxx'
export TMPDIR=/scratch
srun --container-image='nvcr.io#nvidia/nemo:24.12' --container-mounts=/fsx:/fsx,/scratch:/scratch --no-container-mount-home -u bash -c "$LAUNCHER --node_rank $SLURM_PROCID --role $SLURMD_NODENAME: $CMD"

echo "END TIME: $(date)"

Switch to nemo:25.02 conatiner and expect to see cannot allocate memory while checkpointing after 2 steps (see below for full traceback)

Traceback
[ip-26-0-170-31:2]:Traceback (most recent call last):
[ip-26-0-170-31:2]:  File "/opt/NeMo/examples/nlp/language_modeling/megatron_gpt_pretraining.py", line 66, in main
[ip-26-0-170-31:2]:    trainer.fit(model)
[ip-26-0-170-31:2]:  File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/trainer/trainer.py", line 538, in fit
[ip-26-0-170-31:2]:    call._call_and_handle_interrupt(
[ip-26-0-170-31:2]:  File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/trainer/call.py", line 46, in _call_and_handle_interrupt
[ip-26-0-170-31:2]:    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[ip-26-0-170-31:2]:           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[ip-26-0-170-31:2]:  File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch
[ip-26-0-170-31:2]:    return function(*args, **kwargs)
[ip-26-0-170-31:2]:           ^^^^^^^^^^^^^^^^^^^^^^^^^
[ip-26-0-170-31:2]:  File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/trainer/trainer.py", line 574, in _fit_impl
[ip-26-0-170-31:2]:    self._run(model, ckpt_path=ckpt_path)
[ip-26-0-170-31:2]:  File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/trainer/trainer.py", line 981, in _run
[ip-26-0-170-31:2]:    results = self._run_stage()
[ip-26-0-170-31:2]:              ^^^^^^^^^^^^^^^^^
[ip-26-0-170-31:2]:  File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/trainer/trainer.py", line 1025, in _run_stage
[ip-26-0-170-31:2]:    self.fit_loop.run()
[ip-26-0-170-31:2]:  File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/loops/fit_loop.py", line 205, in run
[ip-26-0-170-31:2]:    self.advance()
[ip-26-0-170-31:2]:  File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/loops/fit_loop.py", line 363, in advance
[ip-26-0-170-31:2]:    self.epoch_loop.run(self._data_fetcher)
[ip-26-0-170-31:2]:  File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/loops/training_epoch_loop.py", line 140, in run
[ip-26-0-170-31:2]:    self.advance(data_fetcher)
[ip-26-0-170-31:2]:  File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/loops/training_epoch_loop.py", line 269, in advance
[ip-26-0-170-31:2]:    call._call_callback_hooks(trainer, "on_train_batch_end", batch_output, batch, batch_idx)
[ip-26-0-170-31:2]:  File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/trainer/call.py", line 218, in _call_callback_hooks
[ip-26-0-170-31:2]:    fn(trainer, trainer.lightning_module, *args, **kwargs)
[ip-26-0-170-31:2]:  File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/callbacks/model_checkpoint.py", line 316, in on_train_batch_end
[ip-26-0-170-31:2]:    self._save_topk_checkpoint(trainer, monitor_candidates)
[ip-26-0-170-31:2]:  File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/callbacks/model_checkpoint.py", line 385, in _save_topk_checkpoint
[ip-26-0-170-31:2]:    self._save_monitor_checkpoint(trainer, monitor_candidates)
[ip-26-0-170-31:2]:  File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/callbacks/model_checkpoint.py", line 705, in _save_monitor_checkpoint
[ip-26-0-170-31:2]:    self._update_best_and_save(current, trainer, monitor_candidates)
[ip-26-0-170-31:2]:  File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/callbacks/model_checkpoint.py", line 757, in _update_best_and_save
[ip-26-0-170-31:2]:    self._save_checkpoint(trainer, filepath)
[ip-26-0-170-31:2]:  File "/opt/NeMo/nemo/utils/callbacks/nemo_model_checkpoint.py", line 545, in _save_checkpoint
[ip-26-0-170-31:2]:    trainer.save_checkpoint(filepath, self.save_weights_only, storage_options=storage_options)
[ip-26-0-170-31:2]:  File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/trainer/trainer.py", line 1365, in save_checkpoint
[ip-26-0-170-31:2]:    self.strategy.save_checkpoint(checkpoint, filepath, storage_options=storage_options)
[ip-26-0-170-31:2]:  File "/opt/NeMo/nemo/collections/nlp/parts/nlp_overrides.py", line 421, in save_checkpoint
[ip-26-0-170-31:2]:    self.checkpoint_io.save_checkpoint(checkpoint, ckpt_to_dir(filepath), storage_options=storage_options)
[ip-26-0-170-31:2]:  File "/usr/lib/python3.12/contextlib.py", line 81, in inner
[ip-26-0-170-31:2]:    return func(*args, **kwds)
[ip-26-0-170-31:2]:           ^^^^^^^^^^^^^^^^^^^
[ip-26-0-170-31:2]:  File "/opt/NeMo/nemo/utils/callbacks/dist_ckpt_io.py", line 287, in save_checkpoint
[ip-26-0-170-31:2]:    async_save_request = dist_checkpointing.save(
[ip-26-0-170-31:2]:                         ^^^^^^^^^^^^^^^^^^^^^^^^
[ip-26-0-170-31:2]:  File "/opt/megatron-lm/megatron/core/dist_checkpointing/serialization.py", line 395, in save
[ip-26-0-170-31:2]:    sharded_strategy.save(sharded_state_dict, checkpoint_dir)
[ip-26-0-170-31:2]:  File "/opt/megatron-lm/megatron/core/dist_checkpointing/strategies/base.py", line 226, in save
[ip-26-0-170-31:2]:    async_calls.schedule_async_request(async_request)
[ip-26-0-170-31:2]:  File "/opt/megatron-lm/megatron/core/dist_checkpointing/strategies/async_utils.py", line 189, in schedule_async_request
[ip-26-0-170-31:2]:    async_caller.schedule_async_call(async_request.async_fn, async_request.async_fn_args)
[ip-26-0-170-31:2]:  File "/opt/megatron-lm/megatron/core/dist_checkpointing/strategies/async_utils.py", line 103, in schedule_async_call
[ip-26-0-170-31:2]:    self.process.start()
[ip-26-0-170-31:2]:  File "/usr/lib/python3.12/multiprocessing/process.py", line 121, in start
[ip-26-0-170-31:2]:    self._popen = self._Popen(self)
[ip-26-0-170-31:2]:                  ^^^^^^^^^^^^^^^^^
[ip-26-0-170-31:2]:  File "/usr/lib/python3.12/multiprocessing/context.py", line 282, in _Popen
[ip-26-0-170-31:2]:    return Popen(process_obj)
[ip-26-0-170-31:2]:           ^^^^^^^^^^^^^^^^^^
[ip-26-0-170-31:2]:  File "/usr/lib/python3.12/multiprocessing/popen_fork.py", line 19, in __init__
[ip-26-0-170-31:2]:    self._launch(process_obj)
[ip-26-0-170-31:2]:  File "/usr/lib/python3.12/multiprocessing/popen_fork.py", line 66, in _launch
[ip-26-0-170-31:2]:    self.pid = os.fork()
[ip-26-0-170-31:2]:               ^^^^^^^^^
[ip-26-0-170-31:2]:OSError: [Errno 12] Cannot allocate memory

Environment overview (please complete the following information)

  • Environment location: Docker
  • srun --container-image='nvcr.io#nvidia/nemo:25.02'

Additional context

Add any other context about the problem here.
GPU model - 16x h100s

@Stillerman Stillerman added the bug Something isn't working label Mar 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants