-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU Out of Memory when repeatedly running large models (hyperparameter_search
)
#13019
Comments
Thanks for the issue and the investigation. It looks like you have found the right fix, would you mind making a PR with it? |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
I'm experiencing the exact same problem. Sadly, the suggested solution doesn't work for me. At first I had the impression that the OutOfMemoryError shows up a bit later now (sometimes after 6–8 instead of 2 iterations), but that might be a coincidence. |
Environment info
transformers
version: 4.9.1trainer
do its default thing here. I see thattrainer.is_model_parallel = False
.Who can help
Looks like @sgugger has some related activity in trainer...maybe he can point toward the right person to help?
Information
Model I am using (Bert, XLNet ...):
disilbert-base-uncased
The problem arises when using:
The tasks I am working on is:
To reproduce
Steps to reproduce the behavior:
distilbert-base-uncased
, using the code below. Training set is limited to 10k sentences with binary labels. Eval consists of 500 sentences.RuntimeError: CUDA out of memory...
(full error pasted at the bottom of this issue).Looking at my wandb logs, I see that GPU memory is not freed between tuning runs.
(purple is run-0, gray is run-1, blue is run-2).
run_hp_search_optuna
fn to explicitly delete the model and de-allocate memory between runs seems to resolve the problem (see below).Code that produces the issue
Running the following code yields the error after ~2 hyperparameter tuning runs.
Updates to remedy the issue
If I re-write the
hyperparameter_search
fn with the following additions torun_hp_search_optuna
(following advice in #1742), then the memory does appear to get de-allocated between tuning runs:Full error / trace
The text was updated successfully, but these errors were encountered: