GPU Requirement for PPO - CUDA Out of Memory Error During PPO Training #393

RoozbehNahavandi · 2024-10-19T01:58:06Z

Hi,

I'm an external user and have recently been struggling with PPO finetuning. I have finetuned a Llama 2 7B model as a reward model and now I’m trying to do PPO on another Llama 2 7B model using 4 H100 gpus each with 94 GB of memory. However, I’m encountering CUDA out of memory errors during PPO training. With 4 gpus, the process is able to backpropagate the error for some batches, but after a few minutes, I consistently hit a CUDA out of memory error. This issue remains whether I use deepspeed, torchrun, or accelerate. The readme file provides a configuration for 8 GPU training, so my question is do I actually need 8 gpus for PPO or it's possible to complete it with only 4 gpus?

I’ve tried changing variables like response_length, max_token_length, and gradient_accumulation_step, but it doesn’t resolve the issue.

I also faced a similar OOM issue during reward modeling, but I was able to solve it by setting zero_stage to 3 in the deepspeed config. However, I haven’t made any changes to the deepspeed_zero3.yaml for PPO training.

Any guidance would be greatly appreciated.

This is the script I'm submitting


#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:4
#SBATCH --cpus-per-task=2
#SBATCH --mem=128gb
#SBATCH --ntasks=1

accelerate launch --num_processes 3 --config_file configs/ds_configs/deepspeed_zero3.yaml \
    open_instruct/ppo_vllm_thread.py \
    --exp_name "ppo_vllm_thread_beta_0.03" \
    --dataset_mixer '{"allenai/ultrafeedback_binarized_cleaned": 1.0}' \
    --sft_messages_key chosen \
    --dataset_train_splits train_prefs \
    --dataset_eval_mixer '{"allenai/ultrafeedback_binarized_cleaned": 1.0}' \
    --dataset_eval_splits test_prefs \
    --max_token_length 1024 \
    --max_prompt_token_lenth 512 \
    --learning_rate 8e-7 \
    --output_dir /output/ \
    --chat_template tulu \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 64 \
    --local_rollout_forward_batch_size 1 \
    --vllm_device cuda:3 \
    --num_epochs 1 \
    --num_mini_batches 1 \
    --total_episodes 300000 \
    --model_name_or_path output/tulu_v2_7B  \
    --model_revision finetune__meta-llama_Meta-Llama-3.1-8B__42__1725751338 \
    --reward_model_path models/rm/rm_tulu_7b \
    --reward_model_revision reward_modeling__1__1726175049 \
    --non_stop_penalty \
    --stop_token eos \
    --penalty_reward_value -10.0 \
    --beta 0.02 \
    --num_evals 3 \
    --response_length 1024 \
    --checkpoint_output_dir output/ppo_7b \
    --gradient_checkpointing \
    --with_tracking \

And this is the error I'm getting

[rank1]: Traceback (most recent call last):
[rank1]:   File "/fs/scratch/PAS2138/roozbehn99/open-instruct/open_instruct/ppo_vllm_thread.py", line 1086, in <module>
[rank1]:     main(*parser.parse())
[rank1]:   File "/fs/scratch/PAS2138/roozbehn99/open-instruct/open_instruct/ppo_vllm_thread.py", line 714, in main
[rank1]:     with unwrap_model_for_generation(model, accelerator) as unwrapped_model:
[rank1]:   File "/users/PAS2138/roozbehn99/miniconda3/envs/openinstruct-repo/lib/python3.11/contextlib.py", line 137, in __enter__
[rank1]:     return next(self.gen)
[rank1]:            ^^^^^^^^^^^^^^
[rank1]:   File "/fs/scratch/PAS2138/roozbehn99/open-instruct/open_instruct/model_utils.py", line 455, in unwrap_model_for_generation
[rank1]:     with deepspeed.zero.GatheredParameters(model.parameters()):
[rank1]:   File "/users/PAS2138/roozbehn99/miniconda3/envs/openinstruct-repo/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 2224, in __enter__
[rank1]:     self.params[0].all_gather(param_list=self.params)
[rank1]:   File "/users/PAS2138/roozbehn99/miniconda3/envs/openinstruct-repo/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1143, in all_gather
[rank1]:     return self._all_gather(param_list, async_op=async_op, hierarchy=hierarchy)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/users/PAS2138/roozbehn99/miniconda3/envs/openinstruct-repo/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank1]:     ret_val = func(*args, **kwargs)
[rank1]:               ^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/users/PAS2138/roozbehn99/miniconda3/envs/openinstruct-repo/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1511, in _all_gather
[rank1]:     self._allgather_params_coalesced(all_gather_nonquantize_list, hierarchy, quantize=False)
[rank1]:   File "/users/PAS2138/roozbehn99/miniconda3/envs/openinstruct-repo/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1799, in _allgather_params_coalesced
[rank1]:     flat_tensor = torch.empty(tensor_size, dtype=param_list[0].ds_tensor.dtype,
[rank1]:                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 88.00 MiB. GPU 1 has a total capacity of 93.02 GiB of which 7.81 MiB is free. Including non-PyTorch memory, this process has 92.99 GiB memory in use. Of the allocated memory 89.65 GiB is allocated by PyTorch, and 1.36 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank2]: Traceback (most recent call last):
[rank2]:   File "/fs/scratch/PAS2138/roozbehn99/open-instruct/open_instruct/ppo_vllm_thread.py", line 1086, in <module>
[rank2]:     main(*parser.parse())
[rank2]:   File "/fs/scratch/PAS2138/roozbehn99/open-instruct/open_instruct/ppo_vllm_thread.py", line 714, in main
[rank2]:     with unwrap_model_for_generation(model, accelerator) as unwrapped_model:
[rank2]:   File "/users/PAS2138/roozbehn99/miniconda3/envs/openinstruct-repo/lib/python3.11/contextlib.py", line 137, in __enter__
[rank2]:     return next(self.gen)
[rank2]:            ^^^^^^^^^^^^^^
[rank2]:   File "/fs/scratch/PAS2138/roozbehn99/open-instruct/open_instruct/model_utils.py", line 455, in unwrap_model_for_generation
[rank2]:     with deepspeed.zero.GatheredParameters(model.parameters()):
[rank2]:   File "/users/PAS2138/roozbehn99/miniconda3/envs/openinstruct-repo/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 2224, in __enter__
[rank2]:     self.params[0].all_gather(param_list=self.params)
[rank2]:   File "/users/PAS2138/roozbehn99/miniconda3/envs/openinstruct-repo/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1143, in all_gather
[rank2]:     return self._all_gather(param_list, async_op=async_op, hierarchy=hierarchy)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/users/PAS2138/roozbehn99/miniconda3/envs/openinstruct-repo/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank2]:     ret_val = func(*args, **kwargs)
[rank2]:               ^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/users/PAS2138/roozbehn99/miniconda3/envs/openinstruct-repo/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1511, in _all_gather
[rank2]:     self._allgather_params_coalesced(all_gather_nonquantize_list, hierarchy, quantize=False)
[rank2]:   File "/users/PAS2138/roozbehn99/miniconda3/envs/openinstruct-repo/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1799, in _allgather_params_coalesced
[rank2]:     flat_tensor = torch.empty(tensor_size, dtype=param_list[0].ds_tensor.dtype,
[rank2]:                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 88.00 MiB. GPU 2 has a total capacity of 93.02 GiB of which 7.81 MiB is free. Including non-PyTorch memory, this process has 92.99 GiB memory in use. Of the allocated memory 89.65 GiB is allocated by PyTorch, and 1.36 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Traceback (most recent call last):
  File "/fs/scratch/PAS2138/roozbehn99/open-instruct/open_instruct/ppo_vllm_thread.py", line 1086, in <module>
    main(*parser.parse())
  File "/fs/scratch/PAS2138/roozbehn99/open-instruct/open_instruct/ppo_vllm_thread.py", line 714, in main
    with unwrap_model_for_generation(model, accelerator) as unwrapped_model:
  File "/users/PAS2138/roozbehn99/miniconda3/envs/openinstruct-repo/lib/python3.11/contextlib.py", line 137, in __enter__
    return next(self.gen)
           ^^^^^^^^^^^^^^
  File "/fs/scratch/PAS2138/roozbehn99/open-instruct/open_instruct/model_utils.py", line 455, in unwrap_model_for_generation
    with deepspeed.zero.GatheredParameters(model.parameters()):
  File "/users/PAS2138/roozbehn99/miniconda3/envs/openinstruct-repo/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 2224, in __enter__
    self.params[0].all_gather(param_list=self.params)
  File "/users/PAS2138/roozbehn99/miniconda3/envs/openinstruct-repo/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1143, in all_gather
    return self._all_gather(param_list, async_op=async_op, hierarchy=hierarchy)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/PAS2138/roozbehn99/miniconda3/envs/openinstruct-repo/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
    ret_val = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
  File "/users/PAS2138/roozbehn99/miniconda3/envs/openinstruct-repo/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1511, in _all_gather
    self._allgather_params_coalesced(all_gather_nonquantize_list, hierarchy, quantize=False)
  File "/users/PAS2138/roozbehn99/miniconda3/envs/openinstruct-repo/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1799, in _allgather_params_coalesced
    flat_tensor = torch.empty(tensor_size, dtype=param_list[0].ds_tensor.dtype,
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 88.00 MiB. GPU 0 has a total capacity of 93.02 GiB of which 67.81 MiB is free. Including non-PyTorch memory, this process has 92.93 GiB memory in use. Of the allocated memory 90.04 GiB is allocated by PyTorch, and 896.61 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

The text was updated successfully, but these errors were encountered:

vwxyzjn · 2024-10-19T14:13:27Z

yeah PPO takes more memory and 8 GPUs is fine.maybe you would be interested in trying out the online dpo scripts? They use less memory.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU Requirement for PPO - CUDA Out of Memory Error During PPO Training #393

GPU Requirement for PPO - CUDA Out of Memory Error During PPO Training #393

RoozbehNahavandi commented Oct 19, 2024

vwxyzjn commented Oct 19, 2024

GPU Requirement for PPO - CUDA Out of Memory Error During PPO Training #393

GPU Requirement for PPO - CUDA Out of Memory Error During PPO Training #393

Comments

RoozbehNahavandi commented Oct 19, 2024

vwxyzjn commented Oct 19, 2024