Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU Out of Memory when repeatedly running large models (hyperparameter_search) #13019

Closed
2 of 4 tasks
acocos opened this issue Aug 5, 2021 · 4 comments · May be fixed by #35440
Closed
2 of 4 tasks

GPU Out of Memory when repeatedly running large models (hyperparameter_search) #13019

acocos opened this issue Aug 5, 2021 · 4 comments · May be fixed by #35440

Comments

@acocos
Copy link

acocos commented Aug 5, 2021

Environment info

  • transformers version: 4.9.1
  • Platform: Linux-4.19.0-17-cloud-amd64-x86_64-with-debian-10.10
  • Python version: 3.7.10
  • PyTorch version (GPU?): 1.9.0 (True)
  • Using GPU in script?: yes (4 x GPUs)
  • Using distributed or parallel set-up in script?: There are 4x GPU on this machine; I'm letting the trainer do its default thing here. I see that trainer.is_model_parallel = False.

Who can help

Looks like @sgugger has some related activity in trainer...maybe he can point toward the right person to help?

Information

Model I am using (Bert, XLNet ...): disilbert-base-uncased

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

  1. I'm running fine-tuning for sentence classification using distilbert-base-uncased, using the code below. Training set is limited to 10k sentences with binary labels. Eval consists of 500 sentences.
  2. Hyperparameter search runs fine for the first ~2 iterations, and then I reliably see a CUDA out-of-memory error RuntimeError: CUDA out of memory... (full error pasted at the bottom of this issue).
    Looking at my wandb logs, I see that GPU memory is not freed between tuning runs.
    image
    (purple is run-0, gray is run-1, blue is run-2).
  3. I think this is very closely related/possibly the same as the issue in Out of Memory (OOM) when repeatedly running large models #1742.
  4. I have found that adding some additional lines within the run_hp_search_optuna fn to explicitly delete the model and de-allocate memory between runs seems to resolve the problem (see below).

Code that produces the issue

Running the following code yields the error after ~2 hyperparameter tuning runs.

## setup data
from datasets import DatasetDict

paths = {
    "train": train_file,
    "dev": dev_file,
    "test": test_file,
    "unlabeled": to_classify_file
}
raw_datasets = DatasetDict.from_json(paths)

## setup tokenizer
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
def tokenize_function(x):
    return tokenizer(x["sentence"], x["source_column"], padding="max_length", truncation=True)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
tokenized_datasets.set_format("torch")

## setup model and metrics
import torch
import gc
from transformers import AutoModelForSequenceClassification
from datasets import load_metric

prec = load_metric("precision")
rec = load_metric("recall")
acc = load_metric("accuracy")
f1 = load_metric("f1")

def model_init():
    return AutoModelForSequenceClassification.from_pretrained(
        "distilbert-base-uncased", num_labels=2, return_dict=True)

def f_b(p, r, b):
    num = (1 + b**2) * p * r
    den = (b**2 * p) + r
    if den == 0:
        return 0.
    return num/den

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = predictions.argmax(axis=-1)
    result = {}
    for mtrc in [prec, rec, acc, f1]:
        mtrc_result = mtrc.compute(predictions=predictions, references=labels)
        result.update(mtrc_result)
    result["f0.5"] = f_b(result["precision"], result["recall"], 0.5)
    return result

def compute_objective(metrics):
    return metrics["eval_f0.5"]

## run hyperparam tuning
from transformers import Trainer, TrainingArguments

gpus_per_trial = 1


n_hyperparam_search_examples = 10000

training_args = TrainingArguments(
    "ls_classifier_distilbert_hyperparams",
    overwrite_output_dir=True,
    do_train=True,
    do_eval=True,
    num_train_epochs=2,
    evaluation_strategy="steps",
    eval_steps=250,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    warmup_steps=0,
    weight_decay=0.1,
    logging_dir="./logs",
    report_to="wandb",
    load_best_model_at_end=True
)
trainer = Trainer(
    model_init=model_init,
    args=training_args,
    tokenizer=tokenizer,
    train_dataset=tokenized_datasets["train"].shuffle(seed=123).select(range(n_hyperparam_search_examples)),
    eval_dataset=tokenized_datasets["dev"],
    compute_metrics=compute_metrics
)
trainer.hyperparameter_search(
    backend="optuna",
    compute_objective=compute_objective,
    n_trials=4,
    direction="maximize",
)

Updates to remedy the issue

If I re-write the hyperparameter_search fn with the following additions to run_hp_search_optuna (following advice in #1742), then the memory does appear to get de-allocated between tuning runs:

from transformers.trainer_utils import HPSearchBackend, default_hp_space

def run_hp_search_optuna(trainer, n_trials, direction, **kwargs):
    import optuna
    
    def _objective(trial, checkpoint_dir=None):
        checkpoint = None
        if checkpoint_dir:
            for subdir in os.listdir(checkpoint_dir):
                if subdir.startswith(PREFIX_CHECKPOINT_DIR):
                    checkpoint = os.path.join(checkpoint_dir, subdir)
        #################
        ## UPDATES START
        #################
        if not checkpoint:
            # free GPU memory
            del trainer.model
            gc.collect()
            torch.cuda.empty_cache()
        trainer.objective = None
        trainer.train(resume_from_checkpoint=checkpoint, trial=trial)
        # If there hasn't been any evaluation during the training loop.
        if getattr(trainer, "objective", None) is None:
            metrics = trainer.evaluate()
            trainer.objective = trainer.compute_objective(metrics)
        return trainer.objective

    timeout = kwargs.pop("timeout", None)
    n_jobs = kwargs.pop("n_jobs", 1)
    study = optuna.create_study(direction=direction, **kwargs)
    study.optimize(_objective, n_trials=n_trials, timeout=timeout, n_jobs=n_jobs)
    best_trial = study.best_trial
    return BestRun(str(best_trial.number), best_trial.value, best_trial.params)

def hyperparameter_search(trainer, compute_objective, n_trials, direction, **kwargs):
    trainer.hp_search_backend = HPSearchBackend.OPTUNA
    trainer.hp_space = default_hp_space[HPSearchBackend.OPTUNA]
    trainer.hp_name = None
    trainer.compute_objective = compute_objective
    best_run = run_hp_search_optuna(trainer, n_trials, direction, **kwargs)
    self.hp_search_backend = None
    return best_run

Full error / trace

[W 2021-08-05 17:21:10,456] Trial 2 failed because of the following error: RuntimeError('Caught RuntimeError in replica 0 on device 0.\nOriginal Traceback (most recent call last):\n  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker\n    output = module(*input, **kwargs)\n  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl\n    return forward_call(*input, **kwargs)\n  File "/opt/conda/lib/python3.7/site-packages/transformers/models/distilbert/modeling_distilbert.py", line 632, in forward\n    return_dict=return_dict,\n  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl\n    return forward_call(*input, **kwargs)\n  File "/opt/conda/lib/python3.7/site-packages/transformers/models/distilbert/modeling_distilbert.py", line 495, in forward\n    return_dict=return_dict,\n  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl\n    return forward_call(*input, **kwargs)\n  File "/opt/conda/lib/python3.7/site-packages/transformers/models/distilbert/modeling_distilbert.py", line 315, in forward\n    x=hidden_state, attn_mask=attn_mask, head_mask=head_mask[i], output_attentions=output_attentions\n  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl\n    return forward_call(*input, **kwargs)\n  File "/opt/conda/lib/python3.7/site-packages/transformers/models/distilbert/modeling_distilbert.py", line 264, in forward\n    output_attentions=output_attentions,\n  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl\n    return forward_call(*input, **kwargs)\n  File "/opt/conda/lib/python3.7/site-packages/transformers/models/distilbert/modeling_distilbert.py", line 192, in forward\n    scores = torch.matmul(q, k.transpose(2, 3))  # (bs, n_heads, q_length, k_length)\nRuntimeError: CUDA out of memory. Tried to allocate 768.00 MiB (GPU 0; 14.76 GiB total capacity; 12.82 GiB already allocated; 727.75 MiB free; 12.93 GiB reserved in total by PyTorch)\n')
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/optuna/study/_optimize.py", line 213, in _run_trial
    value_or_values = func(trial)
  File "/opt/conda/lib/python3.7/site-packages/transformers/integrations.py", line 140, in _objective
    trainer.train(resume_from_checkpoint=checkpoint, trial=trial)
  File "/opt/conda/lib/python3.7/site-packages/transformers/trainer.py", line 1280, in train
    tr_loss += self.training_step(model, inputs)
  File "/opt/conda/lib/python3.7/site-packages/transformers/trainer.py", line 1773, in training_step
    loss = self.compute_loss(model, inputs)
  File "/opt/conda/lib/python3.7/site-packages/transformers/trainer.py", line 1805, in compute_loss
    outputs = model(**inputs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
    output.reraise()
  File "/opt/conda/lib/python3.7/site-packages/torch/_utils.py", line 425, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/transformers/models/distilbert/modeling_distilbert.py", line 632, in forward
    return_dict=return_dict,
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/transformers/models/distilbert/modeling_distilbert.py", line 495, in forward
    return_dict=return_dict,
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/transformers/models/distilbert/modeling_distilbert.py", line 315, in forward
    x=hidden_state, attn_mask=attn_mask, head_mask=head_mask[i], output_attentions=output_attentions
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/transformers/models/distilbert/modeling_distilbert.py", line 264, in forward
    output_attentions=output_attentions,
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/transformers/models/distilbert/modeling_distilbert.py", line 192, in forward
    scores = torch.matmul(q, k.transpose(2, 3))  # (bs, n_heads, q_length, k_length)
RuntimeError: CUDA out of memory. Tried to allocate 768.00 MiB (GPU 0; 14.76 GiB total capacity; 12.82 GiB already allocated; 727.75 MiB free; 12.93 GiB reserved in total by PyTorch)

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
/tmp/ipykernel_10884/1040859948.py in <module>
     35     compute_objective=compute_objective,
     36     n_trials=4,
---> 37     direction="maximize",
     38 )
     39 # trainer.is_model_parallel

/opt/conda/lib/python3.7/site-packages/transformers/trainer.py in hyperparameter_search(self, hp_space, compute_objective, n_trials, direction, backend, hp_name, **kwargs)
   1698 
   1699         run_hp_search = run_hp_search_optuna if backend == HPSearchBackend.OPTUNA else run_hp_search_ray
-> 1700         best_run = run_hp_search(self, n_trials, direction, **kwargs)
   1701 
   1702         self.hp_search_backend = None

/opt/conda/lib/python3.7/site-packages/transformers/integrations.py in run_hp_search_optuna(trainer, n_trials, direction, **kwargs)
    148     n_jobs = kwargs.pop("n_jobs", 1)
    149     study = optuna.create_study(direction=direction, **kwargs)
--> 150     study.optimize(_objective, n_trials=n_trials, timeout=timeout, n_jobs=n_jobs)
    151     best_trial = study.best_trial
    152     return BestRun(str(best_trial.number), best_trial.value, best_trial.params)

/opt/conda/lib/python3.7/site-packages/optuna/study/study.py in optimize(self, func, n_trials, timeout, n_jobs, catch, callbacks, gc_after_trial, show_progress_bar)
    407             callbacks=callbacks,
    408             gc_after_trial=gc_after_trial,
--> 409             show_progress_bar=show_progress_bar,
    410         )
    411 

/opt/conda/lib/python3.7/site-packages/optuna/study/_optimize.py in _optimize(study, func, n_trials, timeout, n_jobs, catch, callbacks, gc_after_trial, show_progress_bar)
     74                 reseed_sampler_rng=False,
     75                 time_start=None,
---> 76                 progress_bar=progress_bar,
     77             )
     78         else:

/opt/conda/lib/python3.7/site-packages/optuna/study/_optimize.py in _optimize_sequential(study, func, n_trials, timeout, catch, callbacks, gc_after_trial, reseed_sampler_rng, time_start, progress_bar)
    161 
    162         try:
--> 163             trial = _run_trial(study, func, catch)
    164         except Exception:
    165             raise

/opt/conda/lib/python3.7/site-packages/optuna/study/_optimize.py in _run_trial(study, func, catch)
    262 
    263     if state == TrialState.FAIL and func_err is not None and not isinstance(func_err, catch):
--> 264         raise func_err
    265     return trial
    266 

/opt/conda/lib/python3.7/site-packages/optuna/study/_optimize.py in _run_trial(study, func, catch)
    211 
    212     try:
--> 213         value_or_values = func(trial)
    214     except exceptions.TrialPruned as e:
    215         # TODO(mamu): Handle multi-objective cases.

/opt/conda/lib/python3.7/site-packages/transformers/integrations.py in _objective(trial, checkpoint_dir)
    138                     checkpoint = os.path.join(checkpoint_dir, subdir)
    139         trainer.objective = None
--> 140         trainer.train(resume_from_checkpoint=checkpoint, trial=trial)
    141         # If there hasn't been any evaluation during the training loop.
    142         if getattr(trainer, "objective", None) is None:

/opt/conda/lib/python3.7/site-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1278                         tr_loss += self.training_step(model, inputs)
   1279                 else:
-> 1280                     tr_loss += self.training_step(model, inputs)
   1281                 self.current_flos += float(self.floating_point_ops(inputs))
   1282 

/opt/conda/lib/python3.7/site-packages/transformers/trainer.py in training_step(self, model, inputs)
   1771                 loss = self.compute_loss(model, inputs)
   1772         else:
-> 1773             loss = self.compute_loss(model, inputs)
   1774 
   1775         if self.args.n_gpu > 1:

/opt/conda/lib/python3.7/site-packages/transformers/trainer.py in compute_loss(self, model, inputs, return_outputs)
   1803         else:
   1804             labels = None
-> 1805         outputs = model(**inputs)
   1806         # Save past state if it exists
   1807         # TODO: this needs to be fixed and made cleaner later.

/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1050                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1051             return forward_call(*input, **kwargs)
   1052         # Do not call functions when jit is used
   1053         full_backward_hooks, non_full_backward_hooks = [], []

/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py in forward(self, *inputs, **kwargs)
    166                 return self.module(*inputs[0], **kwargs[0])
    167             replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
--> 168             outputs = self.parallel_apply(replicas, inputs, kwargs)
    169             return self.gather(outputs, self.output_device)
    170 

/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py in parallel_apply(self, replicas, inputs, kwargs)
    176 
    177     def parallel_apply(self, replicas, inputs, kwargs):
--> 178         return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
    179 
    180     def gather(self, outputs, output_device):

/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py in parallel_apply(modules, inputs, kwargs_tup, devices)
     84         output = results[i]
     85         if isinstance(output, ExceptionWrapper):
---> 86             output.reraise()
     87         outputs.append(output)
     88     return outputs

/opt/conda/lib/python3.7/site-packages/torch/_utils.py in reraise(self)
    423             # have message field
    424             raise self.exc_type(message=msg)
--> 425         raise self.exc_type(msg)
    426 
    427 

RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/transformers/models/distilbert/modeling_distilbert.py", line 632, in forward
    return_dict=return_dict,
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/transformers/models/distilbert/modeling_distilbert.py", line 495, in forward
    return_dict=return_dict,
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/transformers/models/distilbert/modeling_distilbert.py", line 315, in forward
    x=hidden_state, attn_mask=attn_mask, head_mask=head_mask[i], output_attentions=output_attentions
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/transformers/models/distilbert/modeling_distilbert.py", line 264, in forward
    output_attentions=output_attentions,
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/transformers/models/distilbert/modeling_distilbert.py", line 192, in forward
    scores = torch.matmul(q, k.transpose(2, 3))  # (bs, n_heads, q_length, k_length)
RuntimeError: CUDA out of memory. Tried to allocate 768.00 MiB (GPU 0; 14.76 GiB total capacity; 12.82 GiB already allocated; 727.75 MiB free; 12.93 GiB reserved in total by PyTorch)
@sgugger
Copy link
Collaborator

sgugger commented Aug 5, 2021

Thanks for the issue and the investigation. It looks like you have found the right fix, would you mind making a PR with it?

@github-actions
Copy link

github-actions bot commented Sep 5, 2021

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@Imaginny
Copy link

I'm experiencing the exact same problem. Sadly, the suggested solution doesn't work for me. At first I had the impression that the OutOfMemoryError shows up a bit later now (sometimes after 6–8 instead of 2 iterations), but that might be a coincidence.
I'm using Python 3.10.11, PyTorch 2.0.1, 1 GPU with 24 GiB GPU Memory, Platform: Linux (Ubuntu 20.04.1) with x86_64 architecture on AWS.

@ilanazim
Copy link

I too am experiencing the same error. Memory increases at every parameter change until an OOM is reached.
Screen Shot 2023-07-24 at 1 19 12 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants