Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM and Ray Debugger Issues with Training on 4x4090 #62

Open
jankinf opened this issue Mar 6, 2025 · 0 comments
Open

OOM and Ray Debugger Issues with Training on 4x4090 #62

jankinf opened this issue Mar 6, 2025 · 0 comments

Comments

@jankinf
Copy link

jankinf commented Mar 6, 2025

When running training with Qwen2.5-3B on 4x RTX 4090 GPUs (24GB each), encountering two issues:

  1. Out of Memory (OOM) error despite using memory optimization settings
  2. Ray debugger hangs indefinitely when trying to debug remote functions

My script is as follows:

set -x
MODEL_PATH="Qwen/Qwen2.5-3B"
EXPERIMENT_NAME="logic_grpo_countdown_3b"
export HYDRA_FULL_ERROR=1
export VLLM_ATTENTION_BACKEND=XFORMERS
export CUDA_VISIBLE_DEVICES=0,1,2,3

RAY_DEBUG=legacy python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=grpo \
    data.train_files=data/countdown/train.parquet \
    data.val_files=data/countdown/test.parquet \
    data.train_batch_size=8 \
    data.val_batch_size=8 \
    data.max_prompt_length=400 \
    data.max_response_length=1024 \
    actor_rollout_ref.model.path=$MODEL_PATH\
    actor_rollout_ref.actor.optim.lr=3e-7 \
    actor_rollout_ref.model.use_remove_padding=True \
    actor_rollout_ref.actor.ppo_mini_batch_size=256 \
    actor_rollout_ref.actor.ppo_micro_batch_size=32 \
    actor_rollout_ref.actor.use_kl_loss=True \
    actor_rollout_ref.actor.kl_loss_coef=0.001 \
    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.actor.fsdp_config.param_offload=True \
    actor_rollout_ref.actor.fsdp_config.grad_offload=True \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \
    actor_rollout_ref.rollout.log_prob_micro_batch_size=160 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.3 \
    actor_rollout_ref.rollout.n=10 \
    actor_rollout_ref.ref.log_prob_micro_batch_size=32 \
    actor_rollout_ref.ref.fsdp_config.param_offload=True \
    algorithm.kl_ctrl.kl_coef=0.001 \
    trainer.critic_warmup=0 \
    trainer.logger=['wandb'] \
    trainer.project_name='GRPO_logic_countdown' \
    trainer.experiment_name='Qwen-3B' \
    trainer.n_gpus_per_node=4 \
    trainer.nnodes=1 \
    trainer.default_local_dir='/data/projects/logic_rl/checkpoints/${trainer.project_name}/${trainer.experiment_name}' \
    trainer.default_hdfs_dir=null \
    trainer.save_freq=100 \
    trainer.test_freq=10 \
    trainer.total_epochs=5 $@

I followed the official ray debugger document, tried to step into the remote func by remote and it hanged infinitely without any error or print log.

Active breakpoints:
index | timestamp           | Ray task       | filename:lineno                                                         
0     | 2025-03-06 12:03:52 | ray::main_task | /home/projects/Logic-RL/verl/trainer/ppo/ray_trainer.py:691
Enter breakpoint index or press enter to refresh: 0
> /home/projects/Logic-RL/verl/trainer/ppo/ray_trainer.py(692)fit()
-> actor_output = self.actor_rollout_wg.update_actor(batch)
(Pdb) s
--Call--
> /home/projects/Logic-RL/verl/single_controller/ray/base.py(38)func()
-> def func(*args, **kwargs):
(Pdb) n
> /home/projects/Logic-RL/verl/single_controller/ray/base.py(39)func()
-> args, kwargs = dispatch_fn(self, *args, **kwargs)
(Pdb) n
> /home/projects/Logic-RL/verl/single_controller/ray/base.py(40)func()
-> output = execute_fn(method_name, *args, **kwargs)
(Pdb) s
--Call--
> /home/projects/Logic-RL/verl/single_controller/ray/base.py(329)execute_all()
-> def execute_all(self, method_name: str, *args, **kwargs):
(Pdb) s
> /home/projects/Logic-RL/verl/single_controller/ray/base.py(330)execute_all()
-> return self.execute_all_async(method_name, *args, **kwargs)
(Pdb) s
--Call--
> /home/projects/Logic-RL/verl/single_controller/ray/base.py(335)execute_all_async()
-> def execute_all_async(self, method_name: str, *args, **kwargs):
(Pdb) n
> /home/projects/Logic-RL/verl/single_controller/ray/base.py(339)execute_all_async()
-> length = len(self._workers)
(Pdb) n
> /home/projects/Logic-RL/verl/single_controller/ray/base.py(340)execute_all_async()
-> if all(isinstance(arg, list) for arg in args) and all(isinstance(kwarg, list) for kwarg in kwargs.values()):
(Pdb) n
> /home/projects/Logic-RL/verl/single_controller/ray/base.py(341)execute_all_async()
-> if all(len(arg) == length for arg in args) and all(len(kwarg) == length for kwarg in kwargs.values()):
(Pdb) n
> /home/projects/Logic-RL/verl/single_controller/ray/base.py(343)execute_all_async()
-> result = []
(Pdb) n
> /home/projects/Logic-RL/verl/single_controller/ray/base.py(344)execute_all_async()
-> for i in range(length):
(Pdb) n
> /home/projects/Logic-RL/verl/single_controller/ray/base.py(345)execute_all_async()
-> sliced_args = tuple(arg[i] for arg in args)
(Pdb) n
> /home/projects/Logic-RL/verl/single_controller/ray/base.py(346)execute_all_async()
-> sliced_kwargs = {k: v[i] for k, v in kwargs.items()}
(Pdb) n
> /home/projects/Logic-RL/verl/single_controller/ray/base.py(347)execute_all_async()
-> remote_call = getattr(self._workers[i], method_name)
(Pdb) n
> /home/projects/Logic-RL/verl/single_controller/ray/base.py(348)execute_all_async()
-> result.append(remote_call.remote(*sliced_args, **sliced_kwargs))
(Pdb) remote
Continuing pdb session in different process...

Can you guys give me some suggestions on how to tune the hyperparameters and how to fix the debugger hanging problem?
Thx!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant