-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
when combining, workers persist for longer than needed #1080
Comments
Yeah, I see what you mean, and I agree it does occupy resources unnecessarily. One limitation is that Proposal: send a shutdown message whenever the number of unscheduled targets (ones we have not yet sent to workers) drops below the number of workers. I predict that extra idle workers will terminate and workers in the middle of building targets will remain unaffected. I think the shutdown message will go to the next available idle worker, and even if it does go to a worker in the middle of a job, I do not think it will interrupt the job (@mschubert, would you confirm?) |
On second thought, I do not think that idea will work. The development version of So @psadil, like you said initially, I think a better way to conserve resources in your case is to use transient workers, e.g. So close yet so far... |
Depending on mschubert/clustermq#182, I might actually come back to this issue. @psadil, on the off chance that that happens, would you install https://github.com/ropensci/drake/tree/1080 and see if those extra workers disappear? |
Your current handling in f449117 seems reasonable to me. Using the worker API, you will always have to decide whether to shut down a worker or have it wait, because clustermq does not know how much work remains. |
@wlandau, thanks for the response. I'd be happy to try out your fix. But I think it'll be a few days before I can run it. |
👋 Hey @psadil... Letting you know, |
Forgot to mention: if you supply a |
Hmm... it actually doesn't work on my end because |
Yeah, do not worry about trying it yourself. It does not work yet, and we cannot roll out a solution anyway until mschubert/clustermq#182 is addressed. |
If I'm not misunderstanding the point, this is not entirely true. Let's say You can use the worker API as it is right now to never have more workers than remaining targets (if you send the shutdown signal instead of the wait for all superfluous workers). I thought this is what you did in f449117? This should work, and improve the situation. The only thing that depends on mschubert/clustermq#182 is if there is a target 3 that requires more workers again, and you want to temporarily down- and then upscale your workers. |
It almost works, but Line 159 in 20aaa86
Dynamic branching creates such additional targets while |
FYI: I just pushed a quick fix to the library(drake)
options(
clustermq.scheduler = "sge",
clustermq.template = "sge.tmpl" # drake_hpc_template_file("sge_clustermq.tmpl")
)
clean(destroy = TRUE)
plan <- drake_plan(
a = target(Sys.sleep(x), transform = map(x = c(1, 10))),
b = target(Sys.sleep(prod(a)), transform = combine(a))
)
make(plan, parallelism = "clustermq", jobs = 3) |
Wouldn't it at least be safe to shutdown extra workers once there are 0 remaining unassigned targets? |
Yes, and drake does do this already. |
Prework
drake
's code of conduct.drake-r-package
tag. (If you anticipate extended follow-up and discussion, you are already in the right place!)Description
This is closely related to #751, but I think slightly different (sorry if it's actually the same issue!). A workflow has a first target which is easy to parallelize (target B, below). The second target combines these outputs (target C), which requires fewer workers. That's the end of the workflow. However, it seems like workers from the first target are not shut down until the second target begins. This means that workers from the first target can persist for much longer than they're needed.
Reproducible example
Two workers get going on the B targets. B_10 completes quickly, B_60 takes a bit longer. Only one of those workers will be needed for C, but B_10 hangs around until B_60 has finished.
sample output
The worker continues to wait for a while, and one worker (I'm not sure which one) shuts down when C starts
Desired result
When a worker has finished a job and there are enough other currently engaged workers to finish the remaining jobs, that finished worker will be shut down.
Does this mean I should use transient workers?
Session info
The text was updated successfully, but these errors were encountered: