Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU Requirement for PPO - CUDA Out of Memory Error During PPO Training #393

Open
RoozbehNahavandi opened this issue Oct 19, 2024 · 1 comment

Comments

@RoozbehNahavandi
Copy link

Hi,

I'm an external user and have recently been struggling with PPO finetuning. I have finetuned a Llama 2 7B model as a reward model and now I’m trying to do PPO on another Llama 2 7B model using 4 H100 gpus each with 94 GB of memory. However, I’m encountering CUDA out of memory errors during PPO training. With 4 gpus, the process is able to backpropagate the error for some batches, but after a few minutes, I consistently hit a CUDA out of memory error. This issue remains whether I use deepspeed, torchrun, or accelerate. The readme file provides a configuration for 8 GPU training, so my question is do I actually need 8 gpus for PPO or it's possible to complete it with only 4 gpus?

I’ve tried changing variables like response_length, max_token_length, and gradient_accumulation_step, but it doesn’t resolve the issue.

I also faced a similar OOM issue during reward modeling, but I was able to solve it by setting zero_stage to 3 in the deepspeed config. However, I haven’t made any changes to the deepspeed_zero3.yaml for PPO training.

Any guidance would be greatly appreciated.

This is the script I'm submitting


#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:4
#SBATCH --cpus-per-task=2
#SBATCH --mem=128gb
#SBATCH --ntasks=1

accelerate launch --num_processes 3 --config_file configs/ds_configs/deepspeed_zero3.yaml \
    open_instruct/ppo_vllm_thread.py \
    --exp_name "ppo_vllm_thread_beta_0.03" \
    --dataset_mixer '{"allenai/ultrafeedback_binarized_cleaned": 1.0}' \
    --sft_messages_key chosen \
    --dataset_train_splits train_prefs \
    --dataset_eval_mixer '{"allenai/ultrafeedback_binarized_cleaned": 1.0}' \
    --dataset_eval_splits test_prefs \
    --max_token_length 1024 \
    --max_prompt_token_lenth 512 \
    --learning_rate 8e-7 \
    --output_dir /output/ \
    --chat_template tulu \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 64 \
    --local_rollout_forward_batch_size 1 \
    --vllm_device cuda:3 \
    --num_epochs 1 \
    --num_mini_batches 1 \
    --total_episodes 300000 \
    --model_name_or_path output/tulu_v2_7B  \
    --model_revision finetune__meta-llama_Meta-Llama-3.1-8B__42__1725751338 \
    --reward_model_path models/rm/rm_tulu_7b \
    --reward_model_revision reward_modeling__1__1726175049 \
    --non_stop_penalty \
    --stop_token eos \
    --penalty_reward_value -10.0 \
    --beta 0.02 \
    --num_evals 3 \
    --response_length 1024 \
    --checkpoint_output_dir output/ppo_7b \
    --gradient_checkpointing \
    --with_tracking \

And this is the error I'm getting

[rank1]: Traceback (most recent call last):
[rank1]:   File "/fs/scratch/PAS2138/roozbehn99/open-instruct/open_instruct/ppo_vllm_thread.py", line 1086, in <module>
[rank1]:     main(*parser.parse())
[rank1]:   File "/fs/scratch/PAS2138/roozbehn99/open-instruct/open_instruct/ppo_vllm_thread.py", line 714, in main
[rank1]:     with unwrap_model_for_generation(model, accelerator) as unwrapped_model:
[rank1]:   File "/users/PAS2138/roozbehn99/miniconda3/envs/openinstruct-repo/lib/python3.11/contextlib.py", line 137, in __enter__
[rank1]:     return next(self.gen)
[rank1]:            ^^^^^^^^^^^^^^
[rank1]:   File "/fs/scratch/PAS2138/roozbehn99/open-instruct/open_instruct/model_utils.py", line 455, in unwrap_model_for_generation
[rank1]:     with deepspeed.zero.GatheredParameters(model.parameters()):
[rank1]:   File "/users/PAS2138/roozbehn99/miniconda3/envs/openinstruct-repo/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 2224, in __enter__
[rank1]:     self.params[0].all_gather(param_list=self.params)
[rank1]:   File "/users/PAS2138/roozbehn99/miniconda3/envs/openinstruct-repo/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1143, in all_gather
[rank1]:     return self._all_gather(param_list, async_op=async_op, hierarchy=hierarchy)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/users/PAS2138/roozbehn99/miniconda3/envs/openinstruct-repo/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank1]:     ret_val = func(*args, **kwargs)
[rank1]:               ^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/users/PAS2138/roozbehn99/miniconda3/envs/openinstruct-repo/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1511, in _all_gather
[rank1]:     self._allgather_params_coalesced(all_gather_nonquantize_list, hierarchy, quantize=False)
[rank1]:   File "/users/PAS2138/roozbehn99/miniconda3/envs/openinstruct-repo/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1799, in _allgather_params_coalesced
[rank1]:     flat_tensor = torch.empty(tensor_size, dtype=param_list[0].ds_tensor.dtype,
[rank1]:                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 88.00 MiB. GPU 1 has a total capacity of 93.02 GiB of which 7.81 MiB is free. Including non-PyTorch memory, this process has 92.99 GiB memory in use. Of the allocated memory 89.65 GiB is allocated by PyTorch, and 1.36 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank2]: Traceback (most recent call last):
[rank2]:   File "/fs/scratch/PAS2138/roozbehn99/open-instruct/open_instruct/ppo_vllm_thread.py", line 1086, in <module>
[rank2]:     main(*parser.parse())
[rank2]:   File "/fs/scratch/PAS2138/roozbehn99/open-instruct/open_instruct/ppo_vllm_thread.py", line 714, in main
[rank2]:     with unwrap_model_for_generation(model, accelerator) as unwrapped_model:
[rank2]:   File "/users/PAS2138/roozbehn99/miniconda3/envs/openinstruct-repo/lib/python3.11/contextlib.py", line 137, in __enter__
[rank2]:     return next(self.gen)
[rank2]:            ^^^^^^^^^^^^^^
[rank2]:   File "/fs/scratch/PAS2138/roozbehn99/open-instruct/open_instruct/model_utils.py", line 455, in unwrap_model_for_generation
[rank2]:     with deepspeed.zero.GatheredParameters(model.parameters()):
[rank2]:   File "/users/PAS2138/roozbehn99/miniconda3/envs/openinstruct-repo/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 2224, in __enter__
[rank2]:     self.params[0].all_gather(param_list=self.params)
[rank2]:   File "/users/PAS2138/roozbehn99/miniconda3/envs/openinstruct-repo/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1143, in all_gather
[rank2]:     return self._all_gather(param_list, async_op=async_op, hierarchy=hierarchy)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/users/PAS2138/roozbehn99/miniconda3/envs/openinstruct-repo/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank2]:     ret_val = func(*args, **kwargs)
[rank2]:               ^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/users/PAS2138/roozbehn99/miniconda3/envs/openinstruct-repo/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1511, in _all_gather
[rank2]:     self._allgather_params_coalesced(all_gather_nonquantize_list, hierarchy, quantize=False)
[rank2]:   File "/users/PAS2138/roozbehn99/miniconda3/envs/openinstruct-repo/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1799, in _allgather_params_coalesced
[rank2]:     flat_tensor = torch.empty(tensor_size, dtype=param_list[0].ds_tensor.dtype,
[rank2]:                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 88.00 MiB. GPU 2 has a total capacity of 93.02 GiB of which 7.81 MiB is free. Including non-PyTorch memory, this process has 92.99 GiB memory in use. Of the allocated memory 89.65 GiB is allocated by PyTorch, and 1.36 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Traceback (most recent call last):
  File "/fs/scratch/PAS2138/roozbehn99/open-instruct/open_instruct/ppo_vllm_thread.py", line 1086, in <module>
    main(*parser.parse())
  File "/fs/scratch/PAS2138/roozbehn99/open-instruct/open_instruct/ppo_vllm_thread.py", line 714, in main
    with unwrap_model_for_generation(model, accelerator) as unwrapped_model:
  File "/users/PAS2138/roozbehn99/miniconda3/envs/openinstruct-repo/lib/python3.11/contextlib.py", line 137, in __enter__
    return next(self.gen)
           ^^^^^^^^^^^^^^
  File "/fs/scratch/PAS2138/roozbehn99/open-instruct/open_instruct/model_utils.py", line 455, in unwrap_model_for_generation
    with deepspeed.zero.GatheredParameters(model.parameters()):
  File "/users/PAS2138/roozbehn99/miniconda3/envs/openinstruct-repo/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 2224, in __enter__
    self.params[0].all_gather(param_list=self.params)
  File "/users/PAS2138/roozbehn99/miniconda3/envs/openinstruct-repo/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1143, in all_gather
    return self._all_gather(param_list, async_op=async_op, hierarchy=hierarchy)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/PAS2138/roozbehn99/miniconda3/envs/openinstruct-repo/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
    ret_val = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
  File "/users/PAS2138/roozbehn99/miniconda3/envs/openinstruct-repo/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1511, in _all_gather
    self._allgather_params_coalesced(all_gather_nonquantize_list, hierarchy, quantize=False)
  File "/users/PAS2138/roozbehn99/miniconda3/envs/openinstruct-repo/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1799, in _allgather_params_coalesced
    flat_tensor = torch.empty(tensor_size, dtype=param_list[0].ds_tensor.dtype,
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 88.00 MiB. GPU 0 has a total capacity of 93.02 GiB of which 67.81 MiB is free. Including non-PyTorch memory, this process has 92.93 GiB memory in use. Of the allocated memory 90.04 GiB is allocated by PyTorch, and 896.61 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
@vwxyzjn
Copy link
Collaborator

vwxyzjn commented Oct 19, 2024

yeah PPO takes more memory and 8 GPUs is fine.maybe you would be interested in trying out the online dpo scripts? They use less memory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants