Optimize GPU usage in reward models #82

p-ferreira · 2023-07-03T13:26:00Z

Some of the validators are getting CUDA OOM every now and then (including the test validator).

https://wandb.ai/opentensor-dev/openvalidators/runs/7p6prmo1/logs?workspace=user-opentensor-pedro

My initial hypothesis is that things are getting stacked in the GPU until they reach the limit. Considering that we have a validator that should run for days, it would be nice to identify some potential points of improvement for GPU management in order to avoid reaching the OOM point.

p-ferreira · 2023-07-18T14:06:04Z

Issue #96 provides a palliative solution for an edge case in which validators experience out-of-memory (OOM) errors.

This bug is not a show stopper as it does not happen very often and the validators get restarted gracefully with the autorun script. However, it is still necessary to investigate this behavior as it can cause inconvenience.

p-ferreira · 2023-07-18T15:59:29Z

Issue update (initial EDA):

The issue is still present as we can see from the following wandb runs:

One thing that can be observed is that this exception does not have a temporal pattern, as there are runs durations varying from 59m to 1d 21h 53m.

Plotting the GPU memory allocation of some preliminary runs from netuid 11, we can see that there is a peak that suddenly happens throughout the runs.

Looking at the GPU memory allocation of the runs mentioned above, we can verify that the gpu does not scale linearly in a consistent form throughout time.

The pattern that can be highlighted throughout the logs of those runs is that the following error happens (example of a real run):

OutOfMemoryError: CUDA out of memory. Tried to allocate 4.40 GiB (GPU 0; 39.56
 GiB total capacity; 28.93 GiB already allocated; 566.56 MiB free; 32.88 GiB
 reserved in total by PyTorch) If reserved memory is >> allocated memory try
 setting max_split_size_mb to avoid fragmentation.  See documentation for Memory
Management and PYTORCH_CUDA_ALLOC_CONF

The error happens when a model tries to allocate a given space in GPU that surpasses the available space. Looking at ~250 rows in the stack trace before the EOF exception of the wandb log, it can be seem that the error happens consistently with the openassistant model.

Some simulations where done in attempt to replicate the peek, using the isolated openassistant model and the complete default reward stack of the validator.

Both tests iterated over the data of the run w60lsiy9, by passing all prompts + completions to the reward flow.

Test 1: Isolated openassistant model

Test 2: Default reward + mask stack of openvalidators

In both cases, after the initial peek of loading the model(s), the gpu usage remained stable without variations. None of the attempts resulted in an OOM exception.

Possible directions for future investigation

Track closely how the gpu changes by implementing extra observability
Optimize models to reduce gpu consumption
Verify impact of Clear cache #96 once it's deployed in main

p-ferreira added the bug Something isn't working label Jul 3, 2023

p-ferreira added the low priority label Jul 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize GPU usage in reward models #82

Optimize GPU usage in reward models #82

p-ferreira commented Jul 3, 2023

p-ferreira commented Jul 18, 2023

p-ferreira commented Jul 18, 2023

Optimize GPU usage in reward models #82

Optimize GPU usage in reward models #82

Comments

p-ferreira commented Jul 3, 2023

p-ferreira commented Jul 18, 2023

p-ferreira commented Jul 18, 2023

Issue update (initial EDA):