OOM and Ray Debugger Issues with Training on 4x4090 #62

jankinf · 2025-03-06T12:36:22Z

When running training with Qwen2.5-3B on 4x RTX 4090 GPUs (24GB each), encountering two issues:

Out of Memory (OOM) error despite using memory optimization settings
Ray debugger hangs indefinitely when trying to debug remote functions

My script is as follows:

set -x
MODEL_PATH="Qwen/Qwen2.5-3B"
EXPERIMENT_NAME="logic_grpo_countdown_3b"
export HYDRA_FULL_ERROR=1
export VLLM_ATTENTION_BACKEND=XFORMERS
export CUDA_VISIBLE_DEVICES=0,1,2,3

RAY_DEBUG=legacy python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=grpo \
    data.train_files=data/countdown/train.parquet \
    data.val_files=data/countdown/test.parquet \
    data.train_batch_size=8 \
    data.val_batch_size=8 \
    data.max_prompt_length=400 \
    data.max_response_length=1024 \
    actor_rollout_ref.model.path=$MODEL_PATH\
    actor_rollout_ref.actor.optim.lr=3e-7 \
    actor_rollout_ref.model.use_remove_padding=True \
    actor_rollout_ref.actor.ppo_mini_batch_size=256 \
    actor_rollout_ref.actor.ppo_micro_batch_size=32 \
    actor_rollout_ref.actor.use_kl_loss=True \
    actor_rollout_ref.actor.kl_loss_coef=0.001 \
    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.actor.fsdp_config.param_offload=True \
    actor_rollout_ref.actor.fsdp_config.grad_offload=True \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \
    actor_rollout_ref.rollout.log_prob_micro_batch_size=160 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.3 \
    actor_rollout_ref.rollout.n=10 \
    actor_rollout_ref.ref.log_prob_micro_batch_size=32 \
    actor_rollout_ref.ref.fsdp_config.param_offload=True \
    algorithm.kl_ctrl.kl_coef=0.001 \
    trainer.critic_warmup=0 \
    trainer.logger=['wandb'] \
    trainer.project_name='GRPO_logic_countdown' \
    trainer.experiment_name='Qwen-3B' \
    trainer.n_gpus_per_node=4 \
    trainer.nnodes=1 \
    trainer.default_local_dir='/data/projects/logic_rl/checkpoints/${trainer.project_name}/${trainer.experiment_name}' \
    trainer.default_hdfs_dir=null \
    trainer.save_freq=100 \
    trainer.test_freq=10 \
    trainer.total_epochs=5 $@

I followed the official ray debugger document, tried to step into the remote func by remote and it hanged infinitely without any error or print log.

Active breakpoints:
index | timestamp           | Ray task       | filename:lineno                                                         
0     | 2025-03-06 12:03:52 | ray::main_task | /home/projects/Logic-RL/verl/trainer/ppo/ray_trainer.py:691
Enter breakpoint index or press enter to refresh: 0
> /home/projects/Logic-RL/verl/trainer/ppo/ray_trainer.py(692)fit()
-> actor_output = self.actor_rollout_wg.update_actor(batch)
(Pdb) s
--Call--
> /home/projects/Logic-RL/verl/single_controller/ray/base.py(38)func()
-> def func(*args, **kwargs):
(Pdb) n
> /home/projects/Logic-RL/verl/single_controller/ray/base.py(39)func()
-> args, kwargs = dispatch_fn(self, *args, **kwargs)
(Pdb) n
> /home/projects/Logic-RL/verl/single_controller/ray/base.py(40)func()
-> output = execute_fn(method_name, *args, **kwargs)
(Pdb) s
--Call--
> /home/projects/Logic-RL/verl/single_controller/ray/base.py(329)execute_all()
-> def execute_all(self, method_name: str, *args, **kwargs):
(Pdb) s
> /home/projects/Logic-RL/verl/single_controller/ray/base.py(330)execute_all()
-> return self.execute_all_async(method_name, *args, **kwargs)
(Pdb) s
--Call--
> /home/projects/Logic-RL/verl/single_controller/ray/base.py(335)execute_all_async()
-> def execute_all_async(self, method_name: str, *args, **kwargs):
(Pdb) n
> /home/projects/Logic-RL/verl/single_controller/ray/base.py(339)execute_all_async()
-> length = len(self._workers)
(Pdb) n
> /home/projects/Logic-RL/verl/single_controller/ray/base.py(340)execute_all_async()
-> if all(isinstance(arg, list) for arg in args) and all(isinstance(kwarg, list) for kwarg in kwargs.values()):
(Pdb) n
> /home/projects/Logic-RL/verl/single_controller/ray/base.py(341)execute_all_async()
-> if all(len(arg) == length for arg in args) and all(len(kwarg) == length for kwarg in kwargs.values()):
(Pdb) n
> /home/projects/Logic-RL/verl/single_controller/ray/base.py(343)execute_all_async()
-> result = []
(Pdb) n
> /home/projects/Logic-RL/verl/single_controller/ray/base.py(344)execute_all_async()
-> for i in range(length):
(Pdb) n
> /home/projects/Logic-RL/verl/single_controller/ray/base.py(345)execute_all_async()
-> sliced_args = tuple(arg[i] for arg in args)
(Pdb) n
> /home/projects/Logic-RL/verl/single_controller/ray/base.py(346)execute_all_async()
-> sliced_kwargs = {k: v[i] for k, v in kwargs.items()}
(Pdb) n
> /home/projects/Logic-RL/verl/single_controller/ray/base.py(347)execute_all_async()
-> remote_call = getattr(self._workers[i], method_name)
(Pdb) n
> /home/projects/Logic-RL/verl/single_controller/ray/base.py(348)execute_all_async()
-> result.append(remote_call.remote(*sliced_args, **sliced_kwargs))
(Pdb) remote
Continuing pdb session in different process...

Can you guys give me some suggestions on how to tune the hyperparameters and how to fix the debugger hanging problem?
Thx!

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOM and Ray Debugger Issues with Training on 4x4090 #62

OOM and Ray Debugger Issues with Training on 4x4090 #62

jankinf commented Mar 6, 2025

OOM and Ray Debugger Issues with Training on 4x4090 #62

OOM and Ray Debugger Issues with Training on 4x4090 #62

Comments

jankinf commented Mar 6, 2025