You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The issue with CUDA out of memory due to hyperparameter optimization has been addressed before (old issue) but no implementation has been made to remedy this.
The fix would be quite simple to allow for the additional argument gc_after_trial to be passed to hyperparameter_search():
defrun_hp_search_optuna(trainer, n_trials: int, direction: str, **kwargs) ->BestRun:
importoptunaiftrainer.args.process_index==0:
def_objective(trial, checkpoint_dir=None):
checkpoint=Noneifcheckpoint_dir:
forsubdirinos.listdir(checkpoint_dir):
ifsubdir.startswith(PREFIX_CHECKPOINT_DIR):
checkpoint=os.path.join(checkpoint_dir, subdir)
trainer.objective=Noneiftrainer.args.world_size>1:
iftrainer.args.parallel_mode!=ParallelMode.DISTRIBUTED:
raiseRuntimeError("only support DDP optuna HPO for ParallelMode.DISTRIBUTED currently.")
trainer._hp_search_setup(trial)
torch.distributed.broadcast_object_list(pickle.dumps(trainer.args), src=0)
trainer.train(resume_from_checkpoint=checkpoint)
else:
trainer.train(resume_from_checkpoint=checkpoint, trial=trial)
# If there hasn't been any evaluation during the training loop.ifgetattr(trainer, "objective", None) isNone:
metrics=trainer.evaluate()
trainer.objective=trainer.compute_objective(metrics)
returntrainer.objectivetimeout=kwargs.pop("timeout", None)
n_jobs=kwargs.pop("n_jobs", 1)
gc_after_trial=kwargs.pop("gc_after_trial", False) # <--- Added argdirections=directionifisinstance(direction, list) elseNonedirection=NoneifdirectionsisnotNoneelsedirectionstudy=optuna.create_study(direction=direction, directions=directions, **kwargs)
study.optimize(_objective, n_trials=n_trials, timeout=timeout, n_jobs=n_jobs, gc_after_trial=gc_after_trial) # <--- Added argifnotstudy._is_multi_objective():
best_trial=study.best_trialreturnBestRun(str(best_trial.number), best_trial.value, best_trial.params)
else:
best_trials=study.best_trialsreturn [BestRun(str(best.number), best.values, best.params) forbestinbest_trials]
else:
foriinrange(n_trials):
trainer.objective=Noneargs_main_rank=list(pickle.dumps(trainer.args))
iftrainer.args.parallel_mode!=ParallelMode.DISTRIBUTED:
raiseRuntimeError("only support DDP optuna HPO for ParallelMode.DISTRIBUTED currently.")
torch.distributed.broadcast_object_list(args_main_rank, src=0)
args=pickle.loads(bytes(args_main_rank))
forkey, valueinasdict(args).items():
ifkey!="local_rank":
setattr(trainer.args, key, value)
trainer.train(resume_from_checkpoint=None)
# If there hasn't been any evaluation during the training loop.ifgetattr(trainer, "objective", None) isNone:
metrics=trainer.evaluate()
trainer.objective=trainer.compute_objective(metrics)
returnNone
Then in the training script, calling trainer.hyperparameter_search() would allow for the additional arg gc_after_trial:
I'm experiencing CUDA out of memory issues when running trainer.hyperparameter_search() with the optuna backend. After investigating a bit, the recommendation from optuna is to pass gc_after_trial=True as parameter to study.optimize() (optuna reference).
My proposal is to allow gc_after_trial to be passed as a kwarg and picked up in the study.optimize() call in run_hp_search_optuna() method (source code).
I would like to be able to pass this argument as part of trainer.hyperparameter_search() with optuna backend.
Your contribution
I can submit a PR for this change if there is a value in this feature.
The text was updated successfully, but these errors were encountered:
Feature request
The issue with CUDA out of memory due to hyperparameter optimization has been addressed before (old issue) but no implementation has been made to remedy this.
The fix would be quite simple to allow for the additional argument
gc_after_trial
to be passed tohyperparameter_search()
:Then in the training script, calling
trainer.hyperparameter_search()
would allow for the additional arggc_after_trial
:Motivation
I'm experiencing CUDA out of memory issues when running
trainer.hyperparameter_search()
with the optuna backend. After investigating a bit, the recommendation from optuna is to passgc_after_trial=True
as parameter tostudy.optimize()
(optuna reference).My proposal is to allow
gc_after_trial
to be passed as akwarg
and picked up in thestudy.optimize()
call inrun_hp_search_optuna()
method (source code).I would like to be able to pass this argument as part of
trainer.hyperparameter_search()
with optuna backend.Your contribution
I can submit a PR for this change if there is a value in this feature.
The text was updated successfully, but these errors were encountered: