GPU Out of Memory when repeatedly running large models (`hyperparameter_search`) #13019

acocos · 2021-08-05T18:05:26Z

Environment info

transformers version: 4.9.1
Platform: Linux-4.19.0-17-cloud-amd64-x86_64-with-debian-10.10
Python version: 3.7.10
PyTorch version (GPU?): 1.9.0 (True)
Using GPU in script?: yes (4 x GPUs)
Using distributed or parallel set-up in script?: There are 4x GPU on this machine; I'm letting the trainer do its default thing here. I see that trainer.is_model_parallel = False.

Who can help

Looks like @sgugger has some related activity in trainer...maybe he can point toward the right person to help?

Information

Model I am using (Bert, XLNet ...): disilbert-base-uncased

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)
my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

I'm running fine-tuning for sentence classification using distilbert-base-uncased, using the code below. Training set is limited to 10k sentences with binary labels. Eval consists of 500 sentences.
Hyperparameter search runs fine for the first ~2 iterations, and then I reliably see a CUDA out-of-memory error RuntimeError: CUDA out of memory... (full error pasted at the bottom of this issue).
Looking at my wandb logs, I see that GPU memory is not freed between tuning runs.

(purple is run-0, gray is run-1, blue is run-2).
I think this is very closely related/possibly the same as the issue in Out of Memory (OOM) when repeatedly running large models #1742.
I have found that adding some additional lines within the run_hp_search_optuna fn to explicitly delete the model and de-allocate memory between runs seems to resolve the problem (see below).

Code that produces the issue

Running the following code yields the error after ~2 hyperparameter tuning runs.

## setup data
from datasets import DatasetDict

paths = {
    "train": train_file,
    "dev": dev_file,
    "test": test_file,
    "unlabeled": to_classify_file
}
raw_datasets = DatasetDict.from_json(paths)

## setup tokenizer
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
def tokenize_function(x):
    return tokenizer(x["sentence"], x["source_column"], padding="max_length", truncation=True)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
tokenized_datasets.set_format("torch")

## setup model and metrics
import torch
import gc
from transformers import AutoModelForSequenceClassification
from datasets import load_metric

prec = load_metric("precision")
rec = load_metric("recall")
acc = load_metric("accuracy")
f1 = load_metric("f1")

def model_init():
    return AutoModelForSequenceClassification.from_pretrained(
        "distilbert-base-uncased", num_labels=2, return_dict=True)

def f_b(p, r, b):
    num = (1 + b**2) * p * r
    den = (b**2 * p) + r
    if den == 0:
        return 0.
    return num/den

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = predictions.argmax(axis=-1)
    result = {}
    for mtrc in [prec, rec, acc, f1]:
        mtrc_result = mtrc.compute(predictions=predictions, references=labels)
        result.update(mtrc_result)
    result["f0.5"] = f_b(result["precision"], result["recall"], 0.5)
    return result

def compute_objective(metrics):
    return metrics["eval_f0.5"]

## run hyperparam tuning
from transformers import Trainer, TrainingArguments

gpus_per_trial = 1


n_hyperparam_search_examples = 10000

training_args = TrainingArguments(
    "ls_classifier_distilbert_hyperparams",
    overwrite_output_dir=True,
    do_train=True,
    do_eval=True,
    num_train_epochs=2,
    evaluation_strategy="steps",
    eval_steps=250,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    warmup_steps=0,
    weight_decay=0.1,
    logging_dir="./logs",
    report_to="wandb",
    load_best_model_at_end=True
)
trainer = Trainer(
    model_init=model_init,
    args=training_args,
    tokenizer=tokenizer,
    train_dataset=tokenized_datasets["train"].shuffle(seed=123).select(range(n_hyperparam_search_examples)),
    eval_dataset=tokenized_datasets["dev"],
    compute_metrics=compute_metrics
)
trainer.hyperparameter_search(
    backend="optuna",
    compute_objective=compute_objective,
    n_trials=4,
    direction="maximize",
)

Updates to remedy the issue

If I re-write the hyperparameter_search fn with the following additions to run_hp_search_optuna (following advice in #1742), then the memory does appear to get de-allocated between tuning runs:

from transformers.trainer_utils import HPSearchBackend, default_hp_space

def run_hp_search_optuna(trainer, n_trials, direction, **kwargs):
    import optuna
    
    def _objective(trial, checkpoint_dir=None):
        checkpoint = None
        if checkpoint_dir:
            for subdir in os.listdir(checkpoint_dir):
                if subdir.startswith(PREFIX_CHECKPOINT_DIR):
                    checkpoint = os.path.join(checkpoint_dir, subdir)
        #################
        ## UPDATES START
        #################
        if not checkpoint:
            # free GPU memory
            del trainer.model
            gc.collect()
            torch.cuda.empty_cache()
        trainer.objective = None
        trainer.train(resume_from_checkpoint=checkpoint, trial=trial)
        # If there hasn't been any evaluation during the training loop.
        if getattr(trainer, "objective", None) is None:
            metrics = trainer.evaluate()
            trainer.objective = trainer.compute_objective(metrics)
        return trainer.objective

    timeout = kwargs.pop("timeout", None)
    n_jobs = kwargs.pop("n_jobs", 1)
    study = optuna.create_study(direction=direction, **kwargs)
    study.optimize(_objective, n_trials=n_trials, timeout=timeout, n_jobs=n_jobs)
    best_trial = study.best_trial
    return BestRun(str(best_trial.number), best_trial.value, best_trial.params)

def hyperparameter_search(trainer, compute_objective, n_trials, direction, **kwargs):
    trainer.hp_search_backend = HPSearchBackend.OPTUNA
    trainer.hp_space = default_hp_space[HPSearchBackend.OPTUNA]
    trainer.hp_name = None
    trainer.compute_objective = compute_objective
    best_run = run_hp_search_optuna(trainer, n_trials, direction, **kwargs)
    self.hp_search_backend = None
    return best_run

Full error / trace

[W 2021-08-05 17:21:10,456] Trial 2 failed because of the following error: RuntimeError('Caught RuntimeError in replica 0 on device 0.\nOriginal Traceback (most recent call last):\n  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker\n    output = module(*input, **kwargs)\n  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl\n    return forward_call(*input, **kwargs)\n  File "/opt/conda/lib/python3.7/site-packages/transformers/models/distilbert/modeling_distilbert.py", line 632, in forward\n    return_dict=return_dict,\n  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl\n    return forward_call(*input, **kwargs)\n  File "/opt/conda/lib/python3.7/site-packages/transformers/models/distilbert/modeling_distilbert.py", line 495, in forward\n    return_dict=return_dict,\n  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl\n    return forward_call(*input, **kwargs)\n  File "/opt/conda/lib/python3.7/site-packages/transformers/models/distilbert/modeling_distilbert.py", line 315, in forward\n    x=hidden_state, attn_mask=attn_mask, head_mask=head_mask[i], output_attentions=output_attentions\n  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl\n    return forward_call(*input, **kwargs)\n  File "/opt/conda/lib/python3.7/site-packages/transformers/models/distilbert/modeling_distilbert.py", line 264, in forward\n    output_attentions=output_attentions,\n  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl\n    return forward_call(*input, **kwargs)\n  File "/opt/conda/lib/python3.7/site-packages/transformers/models/distilbert/modeling_distilbert.py", line 192, in forward\n    scores = torch.matmul(q, k.transpose(2, 3))  # (bs, n_heads, q_length, k_length)\nRuntimeError: CUDA out of memory. Tried to allocate 768.00 MiB (GPU 0; 14.76 GiB total capacity; 12.82 GiB already allocated; 727.75 MiB free; 12.93 GiB reserved in total by PyTorch)\n')
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/optuna/study/_optimize.py", line 213, in _run_trial
    value_or_values = func(trial)
  File "/opt/conda/lib/python3.7/site-packages/transformers/integrations.py", line 140, in _objective
    trainer.train(resume_from_checkpoint=checkpoint, trial=trial)
  File "/opt/conda/lib/python3.7/site-packages/transformers/trainer.py", line 1280, in train
    tr_loss += self.training_step(model, inputs)
  File "/opt/conda/lib/python3.7/site-packages/transformers/trainer.py", line 1773, in training_step
    loss = self.compute_loss(model, inputs)
  File "/opt/conda/lib/python3.7/site-packages/transformers/trainer.py", line 1805, in compute_loss
    outputs = model(**inputs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
    output.reraise()
  File "/opt/conda/lib/python3.7/site-packages/torch/_utils.py", line 425, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/transformers/models/distilbert/modeling_distilbert.py", line 632, in forward
    return_dict=return_dict,
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/transformers/models/distilbert/modeling_distilbert.py", line 495, in forward
    return_dict=return_dict,
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/transformers/models/distilbert/modeling_distilbert.py", line 315, in forward
    x=hidden_state, attn_mask=attn_mask, head_mask=head_mask[i], output_attentions=output_attentions
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/transformers/models/distilbert/modeling_distilbert.py", line 264, in forward
    output_attentions=output_attentions,
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/transformers/models/distilbert/modeling_distilbert.py", line 192, in forward
    scores = torch.matmul(q, k.transpose(2, 3))  # (bs, n_heads, q_length, k_length)
RuntimeError: CUDA out of memory. Tried to allocate 768.00 MiB (GPU 0; 14.76 GiB total capacity; 12.82 GiB already allocated; 727.75 MiB free; 12.93 GiB reserved in total by PyTorch)

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
/tmp/ipykernel_10884/1040859948.py in <module>
     35     compute_objective=compute_objective,
     36     n_trials=4,
---> 37     direction="maximize",
     38 )
     39 # trainer.is_model_parallel

/opt/conda/lib/python3.7/site-packages/transformers/trainer.py in hyperparameter_search(self, hp_space, compute_objective, n_trials, direction, backend, hp_name, **kwargs)
   1698 
   1699         run_hp_search = run_hp_search_optuna if backend == HPSearchBackend.OPTUNA else run_hp_search_ray
-> 1700         best_run = run_hp_search(self, n_trials, direction, **kwargs)
   1701 
   1702         self.hp_search_backend = None

/opt/conda/lib/python3.7/site-packages/transformers/integrations.py in run_hp_search_optuna(trainer, n_trials, direction, **kwargs)
    148     n_jobs = kwargs.pop("n_jobs", 1)
    149     study = optuna.create_study(direction=direction, **kwargs)
--> 150     study.optimize(_objective, n_trials=n_trials, timeout=timeout, n_jobs=n_jobs)
    151     best_trial = study.best_trial
    152     return BestRun(str(best_trial.number), best_trial.value, best_trial.params)

/opt/conda/lib/python3.7/site-packages/optuna/study/study.py in optimize(self, func, n_trials, timeout, n_jobs, catch, callbacks, gc_after_trial, show_progress_bar)
    407             callbacks=callbacks,
    408             gc_after_trial=gc_after_trial,
--> 409             show_progress_bar=show_progress_bar,
    410         )
    411 

/opt/conda/lib/python3.7/site-packages/optuna/study/_optimize.py in _optimize(study, func, n_trials, timeout, n_jobs, catch, callbacks, gc_after_trial, show_progress_bar)
     74                 reseed_sampler_rng=False,
     75                 time_start=None,
---> 76                 progress_bar=progress_bar,
     77             )
     78         else:

/opt/conda/lib/python3.7/site-packages/optuna/study/_optimize.py in _optimize_sequential(study, func, n_trials, timeout, catch, callbacks, gc_after_trial, reseed_sampler_rng, time_start, progress_bar)
    161 
    162         try:
--> 163             trial = _run_trial(study, func, catch)
    164         except Exception:
    165             raise

/opt/conda/lib/python3.7/site-packages/optuna/study/_optimize.py in _run_trial(study, func, catch)
    262 
    263     if state == TrialState.FAIL and func_err is not None and not isinstance(func_err, catch):
--> 264         raise func_err
    265     return trial
    266 

/opt/conda/lib/python3.7/site-packages/optuna/study/_optimize.py in _run_trial(study, func, catch)
    211 
    212     try:
--> 213         value_or_values = func(trial)
    214     except exceptions.TrialPruned as e:
    215         # TODO(mamu): Handle multi-objective cases.

/opt/conda/lib/python3.7/site-packages/transformers/integrations.py in _objective(trial, checkpoint_dir)
    138                     checkpoint = os.path.join(checkpoint_dir, subdir)
    139         trainer.objective = None
--> 140         trainer.train(resume_from_checkpoint=checkpoint, trial=trial)
    141         # If there hasn't been any evaluation during the training loop.
    142         if getattr(trainer, "objective", None) is None:

/opt/conda/lib/python3.7/site-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1278                         tr_loss += self.training_step(model, inputs)
   1279                 else:
-> 1280                     tr_loss += self.training_step(model, inputs)
   1281                 self.current_flos += float(self.floating_point_ops(inputs))
   1282 

/opt/conda/lib/python3.7/site-packages/transformers/trainer.py in training_step(self, model, inputs)
   1771                 loss = self.compute_loss(model, inputs)
   1772         else:
-> 1773             loss = self.compute_loss(model, inputs)
   1774 
   1775         if self.args.n_gpu > 1:

/opt/conda/lib/python3.7/site-packages/transformers/trainer.py in compute_loss(self, model, inputs, return_outputs)
   1803         else:
   1804             labels = None
-> 1805         outputs = model(**inputs)
   1806         # Save past state if it exists
   1807         # TODO: this needs to be fixed and made cleaner later.

/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1050                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1051             return forward_call(*input, **kwargs)
   1052         # Do not call functions when jit is used
   1053         full_backward_hooks, non_full_backward_hooks = [], []

/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py in forward(self, *inputs, **kwargs)
    166                 return self.module(*inputs[0], **kwargs[0])
    167             replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
--> 168             outputs = self.parallel_apply(replicas, inputs, kwargs)
    169             return self.gather(outputs, self.output_device)
    170 

/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py in parallel_apply(self, replicas, inputs, kwargs)
    176 
    177     def parallel_apply(self, replicas, inputs, kwargs):
--> 178         return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
    179 
    180     def gather(self, outputs, output_device):

/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py in parallel_apply(modules, inputs, kwargs_tup, devices)
     84         output = results[i]
     85         if isinstance(output, ExceptionWrapper):
---> 86             output.reraise()
     87         outputs.append(output)
     88     return outputs

/opt/conda/lib/python3.7/site-packages/torch/_utils.py in reraise(self)
    423             # have message field
    424             raise self.exc_type(message=msg)
--> 425         raise self.exc_type(msg)
    426 
    427 

RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/transformers/models/distilbert/modeling_distilbert.py", line 632, in forward
    return_dict=return_dict,
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/transformers/models/distilbert/modeling_distilbert.py", line 495, in forward
    return_dict=return_dict,
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/transformers/models/distilbert/modeling_distilbert.py", line 315, in forward
    x=hidden_state, attn_mask=attn_mask, head_mask=head_mask[i], output_attentions=output_attentions
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/transformers/models/distilbert/modeling_distilbert.py", line 264, in forward
    output_attentions=output_attentions,
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/transformers/models/distilbert/modeling_distilbert.py", line 192, in forward
    scores = torch.matmul(q, k.transpose(2, 3))  # (bs, n_heads, q_length, k_length)
RuntimeError: CUDA out of memory. Tried to allocate 768.00 MiB (GPU 0; 14.76 GiB total capacity; 12.82 GiB already allocated; 727.75 MiB free; 12.93 GiB reserved in total by PyTorch)

The text was updated successfully, but these errors were encountered:

sgugger · 2021-08-05T18:08:31Z

Thanks for the issue and the investigation. It looks like you have found the right fix, would you mind making a PR with it?

github-actions · 2021-09-05T15:01:53Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Imaginny · 2023-06-13T10:50:22Z

I'm experiencing the exact same problem. Sadly, the suggested solution doesn't work for me. At first I had the impression that the OutOfMemoryError shows up a bit later now (sometimes after 6–8 instead of 2 iterations), but that might be a coincidence.
I'm using Python 3.10.11, PyTorch 2.0.1, 1 GPU with 24 GiB GPU Memory, Platform: Linux (Ubuntu 20.04.1) with x86_64 architecture on AWS.

ilanazim · 2023-07-24T20:20:16Z

I too am experiencing the same error. Memory increases at every parameter change until an OOM is reached.

github-actions bot closed this as completed Sep 13, 2021

bogedy mentioned this issue Feb 12, 2023

Does anyone else get CUDA out of memory during hyperparameter search? huggingface/setfit#311

Open

JiHun-Lim mentioned this issue Aug 5, 2023

Sudden cuda OOM issue AIML-K/bamboo-forest#129

Closed

matthieuvion mentioned this issue Dec 16, 2023

GPU out of memory issues huggingface/setfit#242

Closed

JanetVictorious mentioned this issue Jul 12, 2024

Allow additional keyword args to be passed to optuna hyperparameter search #31923

Closed

JanetVictorious mentioned this issue Dec 28, 2024

Release GPU memory after Optuna trial #35440

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU Out of Memory when repeatedly running large models (`hyperparameter_search`) #13019

GPU Out of Memory when repeatedly running large models (`hyperparameter_search`) #13019

acocos commented Aug 5, 2021

sgugger commented Aug 5, 2021

github-actions bot commented Sep 5, 2021

Imaginny commented Jun 13, 2023

ilanazim commented Jul 24, 2023

GPU Out of Memory when repeatedly running large models (hyperparameter_search) #13019

GPU Out of Memory when repeatedly running large models (hyperparameter_search) #13019

Comments

acocos commented Aug 5, 2021

Environment info

Who can help

Information

To reproduce

Code that produces the issue

Updates to remedy the issue

Full error / trace

sgugger commented Aug 5, 2021

github-actions bot commented Sep 5, 2021

Imaginny commented Jun 13, 2023

ilanazim commented Jul 24, 2023

GPU Out of Memory when repeatedly running large models (`hyperparameter_search`) #13019

GPU Out of Memory when repeatedly running large models (`hyperparameter_search`) #13019