Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] ASHA #1169

Open
bbudescu opened this issue Nov 22, 2024 · 7 comments
Open

[Feature Request] ASHA #1169

bbudescu opened this issue Nov 22, 2024 · 7 comments
Assignees

Comments

@bbudescu
Copy link

bbudescu commented Nov 22, 2024

Asynchronous Successive Halving Algorithm (ASHA) is, as far as I can understand, a tweaked version of the Hyperband multi-fidelity intenstifier (aka [ early stopping | pruning | scheduler ] algorithm) that allows to make better use of the available computing power in the context of many cpu cores (improves cpu occupancy).

Here are a bunch of references:

  • nice explanation in a book that I found helpful for myself to understand how it works
  • authors' 2018 blog post introducing the method
  • 2017 paper (rejected by ICLR 2018, accepted by MLSys 2020); arxiv preprint here
  • another docs entry offering a pretty nice explanation

I think here is the original repo in which the authors published the reference implementation, but a bunch of other optimizers implement it, too nowadays (e.g., Oríon, optuna). I've personally used ray[tune]'s implementation before (docs overview, api reference), and, at some point, I somehow ended up stepping through the code and their implementation looked surprisingly simple (yes, I know... famous last words, but after reading research papers there's sometimes this shock when you read the code and see that the implementation could have been explained in much simpler words).

Motivation: I was checking out an optimization session running on 64 cores and noticed the cpu load drop below 50% after 12 hours. Not sure yet exactly how many trials it had gotten through at that point (after 22 hours it got through 12600 trials, and cpu load is under 40%).

Also, not sure if this is really the bottleneck that throttles cpu usage, I know it can be something else, too, like, e.g., many processes waiting for the surrogate model to retrain before starting a new trial or something (that's why I also mentioned the number of finished trials above) or perhaps other reasons, and it's kinda hard to debug, so I thought it'd be useful to be able to eliminate some potential causes, and this is the one I thought of first.

However, if you have some suggestions on improving parallelism, please let me know.

@eddiebergman
Copy link
Contributor

eddiebergman commented Nov 22, 2024

As an additional point, ASHA is not model based, and so the slow-down you're experiencing if you're using a model based optimizer is that as the number of trials increases, the model re-training takes longer and longer. One of the benefits of using SMAC's random forest surrogate is that it is generally faster than re-training the de-facto standard of a GP, which is O(n^2) complexity in the number of trials.

ASHA in comparison is (usually) fully non-model based and so the speed ups in consideration there are largely at a cost of optimization taking longer to converge to some optimal solution. Each new trial being sampled is random and is independent of the samples that came before it.

Still fully onboard with an ASHA implementation in SMAC, just that better core utilization does not directly translate to better end performance, and perhaps running on less cores with a better optimization routine is still something worth considering.

@bbudescu
Copy link
Author

bbudescu commented Nov 22, 2024

Thanks for your answer. I'm not sure I understand fully correct, but just to make sure, I don't mean to use ASHA as a standalone optimizer, but in conjunction with SMAC's random forrest surrogate model, as a multi-fidelity intensifier in place of Hyperband. So just to cut short trials suggested by the RF that underperform (early stopping).

@bbudescu
Copy link
Author

Also, I guess, in my case, to eliminate the other probable cause for the low cpu load, I think it would make sense to increase retrain_after, as suggested in the docs.

However, could you suggest a particular value given the metrics above? More generally, what'd be a good way to compute optimal value for retrain_after given the average time it takes to run a trial and number of workers in order to keep cpu load above a certain threshold of, e.g., 90%?

Also, should the hyperband reduction factor play a role in establishing the optimal value for retrain_after?

@bbudescu
Copy link
Author

Ideally, I sure wish there was a way to periodically re-adjust retrain_after, automatically, like an adaptive heuristic algorithm that uses a simple-model-based estimate (perhaps something even as simple as a regression would suffice if there's no aim to do transfer learning across config spaces and instance features) of how long it will take to train the surrogate model given the number of finished trial results, to make sure that the cpu load is always, let's say, at least 90%.

Coming to think of it, though, ideally, one must also take into consideration the time required to run a particular trial that has been (or will be) scheduled, and that time might depend on the parameter values sampled by a particular config, on the feature vector of the dataset, and, of course, the allotted budget. A simple model will not suffice, so another complex model (like a random forest) would, then, be required to be trained to provide these time estimates.

But that's maybe not so bad, after all. This kind of model would allow algorithms to better choose exploration/exploitation balance strategies in the context of given time/calls budget constraints. Also, if the same surrogate model predicts the time to evaluate the cost function jointly with the objective costs to optimize, the time cost target may act as a regularizer of sorts, perhaps, and help improve objective cost estimation accuracy?

I remember HyperMapper used to train a separate predictor for feasibility, perhaps that's another thing that might tie into this.

@bbudescu
Copy link
Author

bbudescu commented Nov 22, 2024

Another nice thing one could do if there's a bottleneck like this would be to employ an async polling method, i.e., just use an older version of the surrogate model to provide trial configs when asked for while the new surrogate model is training.

I guess there's an optimal tradeoff to be calculated here, too - figuring out when it is better to actually wait for a little while for the new results to be churned by the surrogate model before starting a new time-consuming trial, and when is it better to just keep starting almost-random trials (i.e., because too little data to constrain the model) instead of remaining blocked.

And I think that's a bit meta right there, because I guess that this would mean the decision of asking the old model now vs waiting for the new one to finish training would, fundamentally, be based on the expected improvement of the accuracy of the surrogate model.

And now my head hurts.

LE: Perhaps all this stuff with computing optimal tradeoffs in this comment and in the above might be easier to solve if posed within the framework of Value of Information, utility functions etc.?

LE2: VoI stuff might also be a sensible paradigm for things like optimizing the tradeoff between exploration vs exploitation (e.g., how much to time should be allocated to roaming around with, e.g., random search in hope of finding a better bucket / ravine in the acq function surface, while the remaining time being used to zero in on the the currently best bucket's local minimum).

@bbudescu
Copy link
Author

Also, since the surrogate model is only trained on the main thread, would it be perhaps possible to have some GPU support to help minimize RF training times?

@bbudescu
Copy link
Author

bbudescu commented Nov 23, 2024

Sorry I over-complicated things. It can be simpler than all that estimation. Just use the old model if occupancy falls below 90%. Occupancy is the percentage of workers that are running, i.e., that aren't waiting for a config suggestion from the surrogate model.

LE: NOTE: I made this into a separate issue #1170

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants