Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does anyone else get CUDA out of memory during hyperparameter search? #311

Open
bogedy opened this issue Feb 12, 2023 · 6 comments
Open

Comments

@bogedy
Copy link
Contributor

bogedy commented Feb 12, 2023

I had this problem and I see that in the repo's hyperparameter notebook someone else had this problem too! https://github.com/huggingface/setfit/blob/main/notebooks/text-classification_hyperparameter-search.ipynb

I fixed it by following this advice here huggingface/transformers#13019

I wanted to make a pull request, but when I tried to reproduce the issue later (after pulling new changes) I couldn't. The memory use stayed constant over all the trials. Did e1a5375 fix this? I'm curious. Would love to supply a PR if its helpful but maybe it's fixed already.

@tomaarsen
Copy link
Member

tomaarsen commented Feb 13, 2023

Hello!

Just intuitively, I wouldn't expect e1a5375 to have fixed this issue. I'm aware that others have experienced OOM issues with the hyperparameter search, but I don't think anyone has successfully debugged it so far. With other words, I suspect the issue still persists.

  • Tom Aarsen

@ayala-usma
Copy link

Hi @bogedy, can I ask you how did you exactly apply the suggestion in huggingface/transformers#13019?

I'm running into the same OutOfMemoryError when doing hyperparameter search with Optuna, but I'm not sure about how to apply the suggestion in the issue you reference, since there is no checkpointing in SetFit's trainer. Please let me know.

  • Aurelia

@bogedy
Copy link
Contributor Author

bogedy commented May 24, 2023

Want to share what versions of SetFit, optuna and pytorch and which base model you're using so I can try to reproduce?

I had to edit the SetFit source code. It's in the second code block under "Updates to remedy the issue". Basically it's a hacky work around: the _objective function gets called at the end of each trial to evaluate the trial. You add some code to it that deletes the model and runs the garbage collector, which is okay so long as that code comes after you run your evaluation.

This error is common enough with Optuna that they have some documentation on it and an argument to run gc automatically https://optuna.readthedocs.io/en/stable/faq.html#how-do-i-avoid-running-out-of-memory-oom-when-optimizing-studies

@azagajewski
Copy link

Still an active issue with optuna - not using huggingface, but just running a optuna hyperparameter optimization with big keras models is impossible because GPU memory allocator bugs out before a trial even begins, unless done on trivially small batch sizes.

@julioc-p
Copy link

julioc-p commented Jun 24, 2024

HI! I had the same problem and got a working version by rewriting the hyperparameter_search function following this issue: huggingface/transformers#13019

Just updated it according to the current state of the module:

def run_hp_search_optuna(trainer, n_trials, direction, **kwargs):

    def _objective(trial, checkpoint_dir=None):
        checkpoint = None
        if checkpoint_dir:
            for subdir in os.listdir(checkpoint_dir):
                if subdir.startswith(PREFIX_CHECKPOINT_DIR):
                    checkpoint = os.path.join(checkpoint_dir, subdir)
        if not checkpoint:
            # free GPU memory
            del trainer.model
            gc.collect()
            torch.cuda.empty_cache()
        trainer.objective = None
        trainer.train(resume_from_checkpoint=checkpoint, trial=trial)
        # If there hasn't been any evaluation during the training loop.
        if getattr(trainer, "objective", None) is None:
            metrics = trainer.evaluate()
            trainer.objective = trainer.compute_objective(metrics)
        return trainer.objective

    timeout = kwargs.pop("timeout", None)
    n_jobs = kwargs.pop("n_jobs", 1)
    study = optuna.create_study(direction=direction, **kwargs)
    study.optimize(_objective, n_trials=n_trials,
                   timeout=timeout, n_jobs=n_jobs)
    best_trial = study.best_trial
    return BestRun(str(best_trial.number), best_trial.value, best_trial.params)


def hyperparameter_search(
    self,
    hp_space,
    n_trials,
    direction,
) -> Union[BestRun, List[BestRun]]:

    trainer.hp_search_backend = HPSearchBackend.OPTUNA
    self.hp_space = hp_space
    trainer.hp_name = None
    trainer.compute_objective = default_compute_objective
    best_run = run_hp_search_optuna(trainer, n_trials, direction)
    self.hp_search_backend = None
    return best_run

@julioc-p
Copy link

Eventhough it is probably better if one just runs optuna with gc_after_trial=True

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants