You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm an external user and have recently been struggling with PPO finetuning. I have finetuned a Llama 2 7B model as a reward model and now I’m trying to do PPO on another Llama 2 7B model using 4 H100 gpus each with 94 GB of memory. However, I’m encountering CUDA out of memory errors during PPO training. With 4 gpus, the process is able to backpropagate the error for some batches, but after a few minutes, I consistently hit a CUDA out of memory error. This issue remains whether I use deepspeed, torchrun, or accelerate. The readme file provides a configuration for 8 GPU training, so my question is do I actually need 8 gpus for PPO or it's possible to complete it with only 4 gpus?
I’ve tried changing variables like response_length, max_token_length, and gradient_accumulation_step, but it doesn’t resolve the issue.
I also faced a similar OOM issue during reward modeling, but I was able to solve it by setting zero_stage to 3 in the deepspeed config. However, I haven’t made any changes to the deepspeed_zero3.yaml for PPO training.
[rank1]: Traceback (most recent call last):
[rank1]: File "/fs/scratch/PAS2138/roozbehn99/open-instruct/open_instruct/ppo_vllm_thread.py", line 1086, in <module>
[rank1]: main(*parser.parse())
[rank1]: File "/fs/scratch/PAS2138/roozbehn99/open-instruct/open_instruct/ppo_vllm_thread.py", line 714, in main
[rank1]: with unwrap_model_for_generation(model, accelerator) as unwrapped_model:
[rank1]: File "/users/PAS2138/roozbehn99/miniconda3/envs/openinstruct-repo/lib/python3.11/contextlib.py", line 137, in __enter__
[rank1]: return next(self.gen)
[rank1]: ^^^^^^^^^^^^^^
[rank1]: File "/fs/scratch/PAS2138/roozbehn99/open-instruct/open_instruct/model_utils.py", line 455, in unwrap_model_for_generation
[rank1]: with deepspeed.zero.GatheredParameters(model.parameters()):
[rank1]: File "/users/PAS2138/roozbehn99/miniconda3/envs/openinstruct-repo/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 2224, in __enter__
[rank1]: self.params[0].all_gather(param_list=self.params)
[rank1]: File "/users/PAS2138/roozbehn99/miniconda3/envs/openinstruct-repo/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1143, in all_gather
[rank1]: return self._all_gather(param_list, async_op=async_op, hierarchy=hierarchy)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/users/PAS2138/roozbehn99/miniconda3/envs/openinstruct-repo/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank1]: ret_val = func(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/users/PAS2138/roozbehn99/miniconda3/envs/openinstruct-repo/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1511, in _all_gather
[rank1]: self._allgather_params_coalesced(all_gather_nonquantize_list, hierarchy, quantize=False)
[rank1]: File "/users/PAS2138/roozbehn99/miniconda3/envs/openinstruct-repo/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1799, in _allgather_params_coalesced
[rank1]: flat_tensor = torch.empty(tensor_size, dtype=param_list[0].ds_tensor.dtype,
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 88.00 MiB. GPU 1 has a total capacity of 93.02 GiB of which 7.81 MiB is free. Including non-PyTorch memory, this process has 92.99 GiB memory in use. Of the allocated memory 89.65 GiB is allocated by PyTorch, and 1.36 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank2]: Traceback (most recent call last):
[rank2]: File "/fs/scratch/PAS2138/roozbehn99/open-instruct/open_instruct/ppo_vllm_thread.py", line 1086, in <module>
[rank2]: main(*parser.parse())
[rank2]: File "/fs/scratch/PAS2138/roozbehn99/open-instruct/open_instruct/ppo_vllm_thread.py", line 714, in main
[rank2]: with unwrap_model_for_generation(model, accelerator) as unwrapped_model:
[rank2]: File "/users/PAS2138/roozbehn99/miniconda3/envs/openinstruct-repo/lib/python3.11/contextlib.py", line 137, in __enter__
[rank2]: return next(self.gen)
[rank2]: ^^^^^^^^^^^^^^
[rank2]: File "/fs/scratch/PAS2138/roozbehn99/open-instruct/open_instruct/model_utils.py", line 455, in unwrap_model_for_generation
[rank2]: with deepspeed.zero.GatheredParameters(model.parameters()):
[rank2]: File "/users/PAS2138/roozbehn99/miniconda3/envs/openinstruct-repo/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 2224, in __enter__
[rank2]: self.params[0].all_gather(param_list=self.params)
[rank2]: File "/users/PAS2138/roozbehn99/miniconda3/envs/openinstruct-repo/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1143, in all_gather
[rank2]: return self._all_gather(param_list, async_op=async_op, hierarchy=hierarchy)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/users/PAS2138/roozbehn99/miniconda3/envs/openinstruct-repo/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank2]: ret_val = func(*args, **kwargs)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/users/PAS2138/roozbehn99/miniconda3/envs/openinstruct-repo/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1511, in _all_gather
[rank2]: self._allgather_params_coalesced(all_gather_nonquantize_list, hierarchy, quantize=False)
[rank2]: File "/users/PAS2138/roozbehn99/miniconda3/envs/openinstruct-repo/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1799, in _allgather_params_coalesced
[rank2]: flat_tensor = torch.empty(tensor_size, dtype=param_list[0].ds_tensor.dtype,
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 88.00 MiB. GPU 2 has a total capacity of 93.02 GiB of which 7.81 MiB is free. Including non-PyTorch memory, this process has 92.99 GiB memory in use. Of the allocated memory 89.65 GiB is allocated by PyTorch, and 1.36 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Traceback (most recent call last):
File "/fs/scratch/PAS2138/roozbehn99/open-instruct/open_instruct/ppo_vllm_thread.py", line 1086, in <module>
main(*parser.parse())
File "/fs/scratch/PAS2138/roozbehn99/open-instruct/open_instruct/ppo_vllm_thread.py", line 714, in main
with unwrap_model_for_generation(model, accelerator) as unwrapped_model:
File "/users/PAS2138/roozbehn99/miniconda3/envs/openinstruct-repo/lib/python3.11/contextlib.py", line 137, in __enter__
return next(self.gen)
^^^^^^^^^^^^^^
File "/fs/scratch/PAS2138/roozbehn99/open-instruct/open_instruct/model_utils.py", line 455, in unwrap_model_for_generation
with deepspeed.zero.GatheredParameters(model.parameters()):
File "/users/PAS2138/roozbehn99/miniconda3/envs/openinstruct-repo/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 2224, in __enter__
self.params[0].all_gather(param_list=self.params)
File "/users/PAS2138/roozbehn99/miniconda3/envs/openinstruct-repo/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1143, in all_gather
return self._all_gather(param_list, async_op=async_op, hierarchy=hierarchy)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/users/PAS2138/roozbehn99/miniconda3/envs/openinstruct-repo/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
ret_val = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/users/PAS2138/roozbehn99/miniconda3/envs/openinstruct-repo/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1511, in _all_gather
self._allgather_params_coalesced(all_gather_nonquantize_list, hierarchy, quantize=False)
File "/users/PAS2138/roozbehn99/miniconda3/envs/openinstruct-repo/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1799, in _allgather_params_coalesced
flat_tensor = torch.empty(tensor_size, dtype=param_list[0].ds_tensor.dtype,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 88.00 MiB. GPU 0 has a total capacity of 93.02 GiB of which 67.81 MiB is free. Including non-PyTorch memory, this process has 92.93 GiB memory in use. Of the allocated memory 90.04 GiB is allocated by PyTorch, and 896.61 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
The text was updated successfully, but these errors were encountered:
Hi,
I'm an external user and have recently been struggling with PPO finetuning. I have finetuned a Llama 2 7B model as a reward model and now I’m trying to do PPO on another Llama 2 7B model using 4 H100 gpus each with 94 GB of memory. However, I’m encountering CUDA out of memory errors during PPO training. With 4 gpus, the process is able to backpropagate the error for some batches, but after a few minutes, I consistently hit a CUDA out of memory error. This issue remains whether I use deepspeed, torchrun, or accelerate. The readme file provides a configuration for 8 GPU training, so my question is do I actually need 8 gpus for PPO or it's possible to complete it with only 4 gpus?
I’ve tried changing variables like response_length, max_token_length, and gradient_accumulation_step, but it doesn’t resolve the issue.
I also faced a similar OOM issue during reward modeling, but I was able to solve it by setting zero_stage to 3 in the deepspeed config. However, I haven’t made any changes to the deepspeed_zero3.yaml for PPO training.
Any guidance would be greatly appreciated.
This is the script I'm submitting
And this is the error I'm getting
The text was updated successfully, but these errors were encountered: