Does anyone else get CUDA out of memory during hyperparameter search? #311

bogedy · 2023-02-12T20:28:03Z

I had this problem and I see that in the repo's hyperparameter notebook someone else had this problem too! https://github.com/huggingface/setfit/blob/main/notebooks/text-classification_hyperparameter-search.ipynb

I fixed it by following this advice here huggingface/transformers#13019

I wanted to make a pull request, but when I tried to reproduce the issue later (after pulling new changes) I couldn't. The memory use stayed constant over all the trials. Did e1a5375 fix this? I'm curious. Would love to supply a PR if its helpful but maybe it's fixed already.

The text was updated successfully, but these errors were encountered:

tomaarsen · 2023-02-13T11:40:34Z

Hello!

Just intuitively, I wouldn't expect e1a5375 to have fixed this issue. I'm aware that others have experienced OOM issues with the hyperparameter search, but I don't think anyone has successfully debugged it so far. With other words, I suspect the issue still persists.

Tom Aarsen

ayala-usma · 2023-05-23T23:14:49Z

Hi @bogedy, can I ask you how did you exactly apply the suggestion in huggingface/transformers#13019?

I'm running into the same OutOfMemoryError when doing hyperparameter search with Optuna, but I'm not sure about how to apply the suggestion in the issue you reference, since there is no checkpointing in SetFit's trainer. Please let me know.

Aurelia

bogedy · 2023-05-24T03:25:12Z

Want to share what versions of SetFit, optuna and pytorch and which base model you're using so I can try to reproduce?

I had to edit the SetFit source code. It's in the second code block under "Updates to remedy the issue". Basically it's a hacky work around: the _objective function gets called at the end of each trial to evaluate the trial. You add some code to it that deletes the model and runs the garbage collector, which is okay so long as that code comes after you run your evaluation.

This error is common enough with Optuna that they have some documentation on it and an argument to run gc automatically https://optuna.readthedocs.io/en/stable/faq.html#how-do-i-avoid-running-out-of-memory-oom-when-optimizing-studies

azagajewski · 2023-09-15T14:08:03Z

Still an active issue with optuna - not using huggingface, but just running a optuna hyperparameter optimization with big keras models is impossible because GPU memory allocator bugs out before a trial even begins, unless done on trivially small batch sizes.

julioc-p · 2024-06-24T19:19:03Z

HI! I had the same problem and got a working version by rewriting the hyperparameter_search function following this issue: huggingface/transformers#13019

Just updated it according to the current state of the module:

def run_hp_search_optuna(trainer, n_trials, direction, **kwargs):

    def _objective(trial, checkpoint_dir=None):
        checkpoint = None
        if checkpoint_dir:
            for subdir in os.listdir(checkpoint_dir):
                if subdir.startswith(PREFIX_CHECKPOINT_DIR):
                    checkpoint = os.path.join(checkpoint_dir, subdir)
        if not checkpoint:
            # free GPU memory
            del trainer.model
            gc.collect()
            torch.cuda.empty_cache()
        trainer.objective = None
        trainer.train(resume_from_checkpoint=checkpoint, trial=trial)
        # If there hasn't been any evaluation during the training loop.
        if getattr(trainer, "objective", None) is None:
            metrics = trainer.evaluate()
            trainer.objective = trainer.compute_objective(metrics)
        return trainer.objective

    timeout = kwargs.pop("timeout", None)
    n_jobs = kwargs.pop("n_jobs", 1)
    study = optuna.create_study(direction=direction, **kwargs)
    study.optimize(_objective, n_trials=n_trials,
                   timeout=timeout, n_jobs=n_jobs)
    best_trial = study.best_trial
    return BestRun(str(best_trial.number), best_trial.value, best_trial.params)


def hyperparameter_search(
    self,
    hp_space,
    n_trials,
    direction,
) -> Union[BestRun, List[BestRun]]:

    trainer.hp_search_backend = HPSearchBackend.OPTUNA
    self.hp_space = hp_space
    trainer.hp_name = None
    trainer.compute_objective = default_compute_objective
    best_run = run_hp_search_optuna(trainer, n_trials, direction)
    self.hp_search_backend = None
    return best_run

julioc-p · 2024-06-24T19:24:53Z

Eventhough it is probably better if one just runs optuna with gc_after_trial=True

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does anyone else get CUDA out of memory during hyperparameter search? #311

Does anyone else get CUDA out of memory during hyperparameter search? #311

bogedy commented Feb 12, 2023

tomaarsen commented Feb 13, 2023 •

edited

Loading

ayala-usma commented May 23, 2023

bogedy commented May 24, 2023

azagajewski commented Sep 15, 2023

julioc-p commented Jun 24, 2024 •

edited

Loading

julioc-p commented Jun 24, 2024

Does anyone else get CUDA out of memory during hyperparameter search? #311

Does anyone else get CUDA out of memory during hyperparameter search? #311

Comments

bogedy commented Feb 12, 2023

tomaarsen commented Feb 13, 2023 • edited Loading

ayala-usma commented May 23, 2023

bogedy commented May 24, 2023

azagajewski commented Sep 15, 2023

julioc-p commented Jun 24, 2024 • edited Loading

julioc-p commented Jun 24, 2024

tomaarsen commented Feb 13, 2023 •

edited

Loading

julioc-p commented Jun 24, 2024 •

edited

Loading