Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow additional keyword args to be passed to optuna hyperparameter search #31923

Closed
JanetVictorious opened this issue Jul 12, 2024 · 0 comments · Fixed by #31924
Closed

Allow additional keyword args to be passed to optuna hyperparameter search #31923

JanetVictorious opened this issue Jul 12, 2024 · 0 comments · Fixed by #31924
Labels
Feature request Request for a new feature

Comments

@JanetVictorious
Copy link
Contributor

Feature request

The issue with CUDA out of memory due to hyperparameter optimization has been addressed before (old issue) but no implementation has been made to remedy this.

The fix would be quite simple to allow for the additional argument gc_after_trial to be passed to hyperparameter_search():

def run_hp_search_optuna(trainer, n_trials: int, direction: str, **kwargs) -> BestRun:
    import optuna

    if trainer.args.process_index == 0:

        def _objective(trial, checkpoint_dir=None):
            checkpoint = None
            if checkpoint_dir:
                for subdir in os.listdir(checkpoint_dir):
                    if subdir.startswith(PREFIX_CHECKPOINT_DIR):
                        checkpoint = os.path.join(checkpoint_dir, subdir)
            trainer.objective = None
            if trainer.args.world_size > 1:
                if trainer.args.parallel_mode != ParallelMode.DISTRIBUTED:
                    raise RuntimeError("only support DDP optuna HPO for ParallelMode.DISTRIBUTED currently.")
                trainer._hp_search_setup(trial)
                torch.distributed.broadcast_object_list(pickle.dumps(trainer.args), src=0)
                trainer.train(resume_from_checkpoint=checkpoint)
            else:
                trainer.train(resume_from_checkpoint=checkpoint, trial=trial)
            # If there hasn't been any evaluation during the training loop.
            if getattr(trainer, "objective", None) is None:
                metrics = trainer.evaluate()
                trainer.objective = trainer.compute_objective(metrics)
            return trainer.objective

        timeout = kwargs.pop("timeout", None)
        n_jobs = kwargs.pop("n_jobs", 1)
        gc_after_trial = kwargs.pop("gc_after_trial", False)  # <--- Added arg
        directions = direction if isinstance(direction, list) else None
        direction = None if directions is not None else direction
        study = optuna.create_study(direction=direction, directions=directions, **kwargs)
        study.optimize(_objective, n_trials=n_trials, timeout=timeout, n_jobs=n_jobs, gc_after_trial=gc_after_trial)  # <--- Added arg
        if not study._is_multi_objective():
            best_trial = study.best_trial
            return BestRun(str(best_trial.number), best_trial.value, best_trial.params)
        else:
            best_trials = study.best_trials
            return [BestRun(str(best.number), best.values, best.params) for best in best_trials]
    else:
        for i in range(n_trials):
            trainer.objective = None
            args_main_rank = list(pickle.dumps(trainer.args))
            if trainer.args.parallel_mode != ParallelMode.DISTRIBUTED:
                raise RuntimeError("only support DDP optuna HPO for ParallelMode.DISTRIBUTED currently.")
            torch.distributed.broadcast_object_list(args_main_rank, src=0)
            args = pickle.loads(bytes(args_main_rank))
            for key, value in asdict(args).items():
                if key != "local_rank":
                    setattr(trainer.args, key, value)
            trainer.train(resume_from_checkpoint=None)
            # If there hasn't been any evaluation during the training loop.
            if getattr(trainer, "objective", None) is None:
                metrics = trainer.evaluate()
                trainer.objective = trainer.compute_objective(metrics)
        return None

Then in the training script, calling trainer.hyperparameter_search() would allow for the additional arg gc_after_trial:

best_trial = trainer.hyperparameter_search(
    direction='minimize',
    backend='optuna',
    hp_space=_hp_space,
    n_trials=model_args.hpo_trials,
    gc_after_trial=False,
)

Motivation

I'm experiencing CUDA out of memory issues when running trainer.hyperparameter_search() with the optuna backend. After investigating a bit, the recommendation from optuna is to pass gc_after_trial=True as parameter to study.optimize() (optuna reference).

My proposal is to allow gc_after_trial to be passed as a kwarg and picked up in the study.optimize() call in run_hp_search_optuna() method (source code).

I would like to be able to pass this argument as part of trainer.hyperparameter_search() with optuna backend.

Your contribution

I can submit a PR for this change if there is a value in this feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature request Request for a new feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant