Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize GPU usage in reward models #82

Open
p-ferreira opened this issue Jul 3, 2023 · 2 comments
Open

Optimize GPU usage in reward models #82

p-ferreira opened this issue Jul 3, 2023 · 2 comments
Labels
bug Something isn't working low priority

Comments

@p-ferreira
Copy link
Contributor

Some of the validators are getting CUDA OOM every now and then (including the test validator).

https://wandb.ai/opentensor-dev/openvalidators/runs/7p6prmo1/logs?workspace=user-opentensor-pedro

My initial hypothesis is that things are getting stacked in the GPU until they reach the limit. Considering that we have a validator that should run for days, it would be nice to identify some potential points of improvement for GPU management in order to avoid reaching the OOM point.

@p-ferreira p-ferreira added the bug Something isn't working label Jul 3, 2023
@p-ferreira
Copy link
Contributor Author

Issue #96 provides a palliative solution for an edge case in which validators experience out-of-memory (OOM) errors.

This bug is not a show stopper as it does not happen very often and the validators get restarted gracefully with the autorun script. However, it is still necessary to investigate this behavior as it can cause inconvenience.

@p-ferreira
Copy link
Contributor Author

Issue update (initial EDA):

The issue is still present as we can see from the following wandb runs:

One thing that can be observed is that this exception does not have a temporal pattern, as there are runs durations varying from 59m to 1d 21h 53m.

Plotting the GPU memory allocation of some preliminary runs from netuid 11, we can see that there is a peak that suddenly happens throughout the runs.
image

Looking at the GPU memory allocation of the runs mentioned above, we can verify that the gpu does not scale linearly in a consistent form throughout time.

The pattern that can be highlighted throughout the logs of those runs is that the following error happens (example of a real run):

OutOfMemoryError: CUDA out of memory. Tried to allocate 4.40 GiB (GPU 0; 39.56
 GiB total capacity; 28.93 GiB already allocated; 566.56 MiB free; 32.88 GiB
 reserved in total by PyTorch) If reserved memory is >> allocated memory try
 setting max_split_size_mb to avoid fragmentation.  See documentation for Memory
Management and PYTORCH_CUDA_ALLOC_CONF

The error happens when a model tries to allocate a given space in GPU that surpasses the available space. Looking at ~250 rows in the stack trace before the EOF exception of the wandb log, it can be seem that the error happens consistently with the openassistant model.

Some simulations where done in attempt to replicate the peek, using the isolated openassistant model and the complete default reward stack of the validator.

Both tests iterated over the data of the run w60lsiy9, by passing all prompts + completions to the reward flow.

Test 1: Isolated openassistant model
image

Test 2: Default reward + mask stack of openvalidators

image

In both cases, after the initial peek of loading the model(s), the gpu usage remained stable without variations. None of the attempts resulted in an OOM exception.

Possible directions for future investigation

  • Track closely how the gpu changes by implementing extra observability
  • Optimize models to reduce gpu consumption
  • Verify impact of Clear cache #96 once it's deployed in main

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working low priority
Projects
None yet
Development

No branches or pull requests

1 participant