-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize GPU usage in reward models #82
Comments
Issue #96 provides a palliative solution for an edge case in which validators experience out-of-memory (OOM) errors. This bug is not a show stopper as it does not happen very often and the validators get restarted gracefully with the autorun script. However, it is still necessary to investigate this behavior as it can cause inconvenience. |
Issue update (initial EDA):The issue is still present as we can see from the following wandb runs:
One thing that can be observed is that this exception does not have a temporal pattern, as there are runs durations varying from 59m to 1d 21h 53m. Plotting the GPU memory allocation of some preliminary runs from netuid 11, we can see that there is a peak that suddenly happens throughout the runs. Looking at the GPU memory allocation of the runs mentioned above, we can verify that the gpu does not scale linearly in a consistent form throughout time. The pattern that can be highlighted throughout the logs of those runs is that the following error happens (example of a real run): OutOfMemoryError: CUDA out of memory. Tried to allocate 4.40 GiB (GPU 0; 39.56
GiB total capacity; 28.93 GiB already allocated; 566.56 MiB free; 32.88 GiB
reserved in total by PyTorch) If reserved memory is >> allocated memory try
setting max_split_size_mb to avoid fragmentation. See documentation for Memory
Management and PYTORCH_CUDA_ALLOC_CONF The error happens when a model tries to allocate a given space in GPU that surpasses the available space. Looking at ~250 rows in the stack trace before the EOF exception of the wandb log, it can be seem that the error happens consistently with the openassistant model. Some simulations where done in attempt to replicate the peek, using the isolated openassistant model and the complete default reward stack of the validator. Both tests iterated over the data of the run w60lsiy9, by passing all prompts + completions to the reward flow. Test 1: Isolated openassistant model Test 2: Default reward + mask stack of openvalidators In both cases, after the initial peek of loading the model(s), the gpu usage remained stable without variations. None of the attempts resulted in an OOM exception. Possible directions for future investigation
|
Some of the validators are getting CUDA OOM every now and then (including the test validator).
https://wandb.ai/opentensor-dev/openvalidators/runs/7p6prmo1/logs?workspace=user-opentensor-pedro
My initial hypothesis is that things are getting stacked in the GPU until they reach the limit. Considering that we have a validator that should run for days, it would be nice to identify some potential points of improvement for GPU management in order to avoid reaching the OOM point.
The text was updated successfully, but these errors were encountered: