Replies: 3 comments 22 replies
-
Thanks @mschubert - important topic! Reading your post I am a bit confused because of
This is not what I experience in practice but I am not sure if this is due to {clustermq}. I am using {clustermq} via {drake} and the following SLURM logic
In this scenario, when I spin up 100 workers, some finish earlier then others. These then output the following when their job is done
The worker is in an an IDLE state and blocks resources for the scheduler. So TL,DR: is this maybe a thing I need to tell/configure in SLURM and {clustermq} is doing just fine? |
Beta Was this translation helpful? Give feedback.
-
@mschubert, thank you so much for bringing up this thread! It is such a critical issue for large
The use cases I see are mostly around (1). Workflows that I deal with in Bayesian data analysis and clinical trial simulation are arbitrary DAGs of tasks. Many of these DAGs are just composites of map/reduce steps, but this is not always the case, and I would strongly prefer not to make assumptions about the graph topoolgy.
Somewhat related:
(4) The HPC devops team at my work has expressed a strong preference for (4) because it would significantly reduce idling time. For use, our scheduler is beefy, but HPC resources are often occupied. In addition, they claim fully transient workers would allow sys admins to more finely manage what is running on the cluster. However, I am not sure how they would feel about scenarios with large numbers of small jobs (less common at my work). In any case, my team and I have not even been able to pilot transient workers (via If I recall correctly,
What about something that allowed workers to scale up and down as the work progresses?
Tuning the initial worker size in (1) and max idle time in (3) could allow a nice spectrum between fully transient and fully persistent workers. |
Beta Was this translation helpful? Give feedback.
-
Related: @mschubert, does |
Beta Was this translation helpful? Give feedback.
-
We've had multiple mentions how
clustermq
workers are "persistent" and that in some situations more "transient" (i.e., workers that get spun up and down more frequently) can be desirable for some use cases. This was coming from especially the angle of thedrake
andtargets
packages.The question that I'm asking myself now is: What are these use cases?
I see the following:
And the following gradient of approaches:
The advantages and disadvantages I see for those approaches are roughly:
So, I am looking for examples where point 3 (or 4) is desirable, and why it will serve a use case better than the current approach (point 2).
In terms of implementation, I'm considering:
/cc @wlandau @pat-s @mattwarkentin
Beta Was this translation helpful? Give feedback.
All reactions