-
-
Notifications
You must be signed in to change notification settings - Fork 229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] ASHA #1169
Comments
As an additional point, ASHA is not model based, and so the slow-down you're experiencing if you're using a model based optimizer is that as the number of trials increases, the model re-training takes longer and longer. One of the benefits of using SMAC's random forest surrogate is that it is generally faster than re-training the de-facto standard of a GP, which is O(n^2) complexity in the number of trials. ASHA in comparison is (usually) fully non-model based and so the speed ups in consideration there are largely at a cost of optimization taking longer to converge to some optimal solution. Each new trial being sampled is random and is independent of the samples that came before it. Still fully onboard with an ASHA implementation in SMAC, just that better core utilization does not directly translate to better end performance, and perhaps running on less cores with a better optimization routine is still something worth considering. |
Thanks for your answer. I'm not sure I understand fully correct, but just to make sure, I don't mean to use ASHA as a standalone optimizer, but in conjunction with SMAC's random forrest surrogate model, as a multi-fidelity intensifier in place of Hyperband. So just to cut short trials suggested by the RF that underperform (early stopping). |
Also, I guess, in my case, to eliminate the other probable cause for the low cpu load, I think it would make sense to increase However, could you suggest a particular value given the metrics above? More generally, what'd be a good way to compute optimal value for Also, should the hyperband reduction factor play a role in establishing the optimal value for |
Ideally, I sure wish there was a way to periodically re-adjust Coming to think of it, though, ideally, one must also take into consideration the time required to run a particular trial that has been (or will be) scheduled, and that time might depend on the parameter values sampled by a particular config, on the feature vector of the dataset, and, of course, the allotted budget. A simple model will not suffice, so another complex model (like a random forest) would, then, be required to be trained to provide these time estimates. But that's maybe not so bad, after all. This kind of model would allow algorithms to better choose exploration/exploitation balance strategies in the context of given time/calls budget constraints. Also, if the same surrogate model predicts the time to evaluate the cost function jointly with the objective costs to optimize, the time cost target may act as a regularizer of sorts, perhaps, and help improve objective cost estimation accuracy? I remember HyperMapper used to train a separate predictor for feasibility, perhaps that's another thing that might tie into this. |
Another nice thing one could do if there's a bottleneck like this would be to employ an async polling method, i.e., just use an older version of the surrogate model to provide trial configs when asked for while the new surrogate model is training. I guess there's an optimal tradeoff to be calculated here, too - figuring out when it is better to actually wait for a little while for the new results to be churned by the surrogate model before starting a new time-consuming trial, and when is it better to just keep starting almost-random trials (i.e., because too little data to constrain the model) instead of remaining blocked. And I think that's a bit meta right there, because I guess that this would mean the decision of asking the old model now vs waiting for the new one to finish training would, fundamentally, be based on the expected improvement of the accuracy of the surrogate model. And now my head hurts. LE: Perhaps all this stuff with computing optimal tradeoffs in this comment and in the above might be easier to solve if posed within the framework of Value of Information, utility functions etc.? LE2: VoI stuff might also be a sensible paradigm for things like optimizing the tradeoff between exploration vs exploitation (e.g., how much to time should be allocated to roaming around with, e.g., random search in hope of finding a better bucket / ravine in the acq function surface, while the remaining time being used to zero in on the the currently best bucket's local minimum). |
Also, since the surrogate model is only trained on the main thread, would it be perhaps possible to have some GPU support to help minimize RF training times? |
Sorry I over-complicated things. It can be simpler than all that estimation. Just use the old model if occupancy falls below 90%. Occupancy is the percentage of workers that are running, i.e., that aren't waiting for a config suggestion from the surrogate model. LE: NOTE: I made this into a separate issue #1170 |
Asynchronous Successive Halving Algorithm (ASHA) is, as far as I can understand, a tweaked version of the Hyperband multi-fidelity intenstifier (aka [ early stopping | pruning | scheduler ] algorithm) that allows to make better use of the available computing power in the context of many cpu cores (improves cpu occupancy).
Here are a bunch of references:
I think here is the original repo in which the authors published the reference implementation, but a bunch of other optimizers implement it, too nowadays (e.g., Oríon, optuna). I've personally used
ray[tune]
's implementation before (docs overview, api reference), and, at some point, I somehow ended up stepping through the code and their implementation looked surprisingly simple (yes, I know... famous last words, but after reading research papers there's sometimes this shock when you read the code and see that the implementation could have been explained in much simpler words).Motivation: I was checking out an optimization session running on 64 cores and noticed the cpu load drop below 50% after 12 hours. Not sure yet exactly how many trials it had gotten through at that point (after 22 hours it got through 12600 trials, and cpu load is under 40%).
Also, not sure if this is really the bottleneck that throttles cpu usage, I know it can be something else, too, like, e.g., many processes waiting for the surrogate model to retrain before starting a new trial or something (that's why I also mentioned the number of finished trials above) or perhaps other reasons, and it's kinda hard to debug, so I thought it'd be useful to be able to eliminate some potential causes, and this is the one I thought of first.
However, if you have some suggestions on improving parallelism, please let me know.
The text was updated successfully, but these errors were encountered: