RFC: Use cases for transient workers? #257

mschubert · 2021-04-03T21:55:25Z

mschubert
Apr 3, 2021
Maintainer

We've had multiple mentions how clustermq workers are "persistent" and that in some situations more "transient" (i.e., workers that get spun up and down more frequently) can be desirable for some use cases. This was coming from especially the angle of the drake and targets packages.

The question that I'm asking myself now is: What are these use cases?

I see the following:

Computing on a graph may parallelize to 100 workers, or go through an integration ("reduce") step where no parallelism is possible
(I'm not sure about batch-like queries, are there any?)

And the following gradient of approaches:

Spin up workers once and keep all of them until all computations are complete
Spin up workers once and shut them down as soon as their work is complete (this is what we are currently using)
(there is probably some intermediate uses cases here)
Spin up a separate worker for each individual function call

The advantages and disadvantages I see for those approaches are roughly:

This is wasteful, as workers are keep online when they are idle (both for startup and shutdown)
This seems like a better alternative than 1, but workers may still be idle until they receive their instructions (I'd consider this a minor drawback, when is this an issue?)
?
This will likely put unnecessary load on the scheduler; a rule of thumb here is that jobs should run at least 2 minutes (and individual function calls may well be faster)

So, I am looking for examples where point 3 (or 4) is desirable, and why it will serve a use case better than the current approach (point 2).

In terms of implementation, I'm considering:

A worker API call that allows either to (a) resize the worker pool, or (b) concatenate different (potentially heterogeneous) pools of workers; when this should be done would be a decision of the API user
others?

/cc @wlandau @pat-s @mattwarkentin

pat-s · 2021-04-04T10:46:03Z

pat-s
Apr 4, 2021

Thanks @mschubert - important topic!

Reading your post I am a bit confused because of

Spin up workers once and shut them down as soon as their work is complete (this is what we are currently using)

This is not what I experience in practice but I am not sure if this is due to {clustermq}. I am using {clustermq} via {drake} and the following SLURM logic

#SBATCH --array=1-{{ n_jobs }}

In this scenario, when I spin up 100 workers, some finish earlier then others. These then output the following when their job is done

2021-04-03 22:48:07.480188 | > WORKER_STOP (0.000s wait)
2021-04-03 22:48:07.488350 | shutting down worker

The worker is in an an IDLE state and blocks resources for the scheduler.
SLURM will only release the allocated resources if all jobs are done.
In my detailed scenario from the 100 workers I have around 5 which take a lot longer than the rest.
This is very unfortunate because I then block a lot of resources even though just 5% of them are in actual use at some point in time.

So TL,DR: is this maybe a thing I need to tell/configure in SLURM and {clustermq} is doing just fine?
My goal would be that the scheduler releases the resources once it realizes that they are not needed anymore to compute the missing array tasks.

10 replies

wlandau Apr 7, 2021

Should be fixed now, yes.

mattwarkentin Apr 7, 2021

Can confirm it works on my HPC with Slurm. This is excellent!

pat-s Apr 9, 2021

@wlandau I assume there won't be a backport for {drake}, will it?

wlandau Apr 9, 2021

I know I promised backports for some things, but I did not architect drake well enough to handle the kind of flexibility required in this case, especially with dynamic branching. Even with targets, it was very sensitive work to implement, and I would not have caught ropensci/targets#404 if I had not been using targets for an actual data analysis project at work. So in spite of the inefficiency, I would prefer to leave drake alone so we can be more confident that it continues to orchestrate and conclude pipelines without obvious errors.

pat-s Apr 9, 2021

Fair enough - looking foward to use {targets} soon hopefully!

wlandau · 2021-04-04T15:05:18Z

wlandau
Apr 4, 2021

@mschubert, thank you so much for bringing up this thread! It is such a critical issue for large targets pipelines!

The question that I'm asking myself now is: What are these use cases?

I see the following:

Computing on a graph may parallelize to 100 workers, or go through an integration ("reduce") step where no parallelism is possible

(I'm not sure about batch-like queries, are there any?)

The use cases I see are mostly around (1). Workflows that I deal with in Bayesian data analysis and clinical trial simulation are arbitrary DAGs of tasks. Many of these DAGs are just composites of map/reduce steps, but this is not always the case, and I would strongly prefer not to make assumptions about the graph topoolgy.

And the following gradient of approaches:

Spin up workers once and keep all of them until all computations are complete

Spin up workers once and shut them down as soon as their work is complete (this is what we are currently using)

Somewhat related: targets has the capability of running some targets on the main process and others on remote workers. When it starts traversing the graph, it skips up-to-date targets like always, and it tries to run everything it can locally. It only launches clustermq workers when it reaches a target in the graph that (1) is out of date (needs to run) and (2) must to run on a remote worker.

Spin up a separate worker for each individual function call

(4) The HPC devops team at my work has expressed a strong preference for (4) because it would significantly reduce idling time. For use, our scheduler is beefy, but HPC resources are often occupied. In addition, they claim fully transient workers would allow sys admins to more finely manage what is running on the cluster. However, I am not sure how they would feel about scenarios with large numbers of small jobs (less common at my work).

In any case, my team and I have not even been able to pilot transient workers (via tar_make_future()) because future.batchtools is intractably slow. Even if we could, the overhead might be a bit much.

If I recall correctly, snakemake used fully transient workers at first, but due to complaints from sys admins, it now supports job groups. This is similar batching in targets, which is recommended in the documentation as part of user best practices. Target batching is automatically supported in extension packages such as tarchetypes and stantargets.

(there is probably some intermediate uses cases here)

What about something that allowed workers to scale up and down as the work progresses?

Start by submitting an array job of a certain user-specified size, not necessarily the maximum size.
If more work is requested with $send_call() and the number of currently busy workers is less than the user-specified maximum, then initialize a new worker for the new job.
If a worker idles for long enough (i.e. receives nothing or only $send_wait() for some length of time) then shut down that worker.

Tuning the initial worker size in (1) and max idle time in (3) could allow a nice spectrum between fully transient and fully persistent workers.

8 replies

mschubert Apr 5, 2021
Maintainer Author

The HPC devops team at my work has expressed a strong preference for (4)

I can only see this make sense if all function calls on your HPC take some minutes to compute, and the overall number of calls is relatively low. Otherwise, if some users have a million short calls each this may bring down the whole farm (which is why I will probably never support that).

mattwarkentin Apr 7, 2021

Again, late to join here, but my opinion is that I like persistent workers capable of handling many jobs. I am definitely looking at clustermq through a targets lens, but I do think the ability to scale up and down would be great. As you mentioned, scaling down was always supported by clustermq and is now working in targets, so the ability to scale up as required would be a great feature.

mattwarkentin Apr 8, 2021

@wlandau I was thinking about this a bit last night, if/when upscaling is supported, I thought about two types of worker management techniques (relevant to targets):

Heuristic (less conservative)
- This would align with what you described above: (1) start some number of jobs (less than or equal to max_workers), (2) spin up more jobs if more work is requested and all current workers are busy, and if max_workers isn't saturated, and (3) if any worker is idle for idle amount of time, you can shut down that worker. Iterate through (2) and (3) until all the work is done.
- This approach would be probably be more resource conscious, in that it would free up resources when they are idle and return them to the resource pool for others to use.
Deterministic (more conservative)
- With this approach, steps (1) and (2) would be the same, but you would only shut down workers when those workers would never receive any more work. I think this is basically the current approach in targets. Resources are only relinquished when they are definitely no longer needed.

In either approach (or a combination of the two), the number of workers spun up at step (1) would only be as many workers as needed at the very start (up to max_workers), and then workers are spun up and shut down according to one of the above methods. Just scribbling my thoughts down.

wlandau Apr 9, 2021

I like the way you frame those options, @mattwarkentin.

If the heuristic approach also adopts the shutdown rule of the deterministic one, it seems like the latter is a special case of the former with max_workers == initial_workers and idle = Inf.

mattwarkentin Apr 9, 2021

If the heuristic approach also adopts the shutdown rule of the deterministic one, it seems like the latter is a special case of the former with max_workers == initial_workers and idle = Inf.

Yes, this is exactly what I was thinking too! With the deterministic shutdown rule in place, idle becomes like a tuning parameter for the strength of the heuristic rule.

wlandau · 2021-04-09T12:48:05Z

wlandau
Apr 9, 2021

Related: @mschubert, does clustermq provide a way to know how many workers are up that have not been sent shutdown messages? To implement ropensci/targets#399, I used a counter in targets to keep track of this manually. I am not sure that will still work if/when clustermq automatically scales up workers.

4 replies

wlandau Apr 9, 2021

I tried qsys$workers and qsys$workers_running but had trouble. (The event loop was trying to shut down workers that were already shut down.)

mschubert Apr 11, 2021
Maintainer Author

Can you elaborate? w$workers should give you the correct number.

For worker management in general, the idea is that if you want to shut down a worker, you reply to its WORKER_READY message with a WORKER_STOP message (in practice, by calling w$send_shutdown_worker()). After that, it will still answer with a WORKER_DONE and summary statistics, which you do not need to shut down anymore (this is handled internally by the QSys base class and you just quit the loop and then call w$cleanup()).

wlandau Apr 11, 2021

That certainly makes sense. I reproduced the issue in a new branch. If you install ropensci/targets@d271573, the following error results in a simple example. (I know the error is slightly different from what I was describing, but it is one of the errors I saw when trying w$workers and w$workers_running.)

library(targets)

print(packageDescription("targets")$GithubSHA1)
#> [1] "d2715730de0c86b9f2053b2e499177d6a69ffd03"

tar_script({
  options(clustermq.scheduler = "multicore")
  tar_option_set(deployment = "main")
  list(
    tar_target(index, seq_len(4)),
    tar_target(run, index, pattern = map(index), deployment = "worker"),
    tar_target(end, run)
  )
})

tar_make_clustermq(callr_function = NULL)
#> ● start target index
#> ● built target index
#> ● start branch run_77ba9815
#> ● built branch run_77ba9815
#> ● start branch run_4b7abfa4
#> ● built branch run_4b7abfa4
#> ● start branch run_f5a3cbc2
#> ● built branch run_f5a3cbc2
#> ● start branch run_b0df9eb8
#> ● built branch run_b0df9eb8
#> ● built pattern run
#> Error in self$crew$receive_data(): Trying to receive data after work finished
#> ● end pipeline

^{Created on 2021-04-11 by the reprex package (v1.0.0)}

If I use a custom worker counter in targets (current implementation) then there is no issue.

library(targets)

print(packageDescription("targets")$GithubSHA1)
#> [1] "d0422b378eb5ed2900822c89112406ace3e9db7c"

tar_script({
  options(clustermq.scheduler = "multicore")
  tar_option_set(deployment = "main")
  list(
    tar_target(index, seq_len(4)),
    tar_target(run, index, pattern = map(index), deployment = "worker"),
    tar_target(end, run)
  )
})

tar_make_clustermq(callr_function = NULL)
#> ● start target index
#> ● built target index
#> ● start branch run_77ba9815
#> ● built branch run_77ba9815
#> ● start branch run_4b7abfa4
#> ● built branch run_4b7abfa4
#> ● start branch run_f5a3cbc2
#> ● built branch run_f5a3cbc2
#> ● start branch run_b0df9eb8
#> ● built branch run_b0df9eb8
#> ● built pattern run
#> ● start target end
#> ● built target end
#> Master: [0.3s 38.1% CPU]; Worker: [avg 23.7% CPU, max 4159493.0 Mb]
#> ● end pipeline

^{Created on 2021-04-11 by the reprex package (v1.0.0)}

mschubert Apr 11, 2021
Maintainer Author

I'll have a look. Given that you have a working solution, I will put this in the low priority bucket.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Use cases for transient workers? #257

{{title}}

Replies: 3 comments 22 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

RFC: Use cases for transient workers? #257

mschubert Apr 3, 2021 Maintainer

Replies: 3 comments · 22 replies

mschubert Apr 5, 2021 Maintainer Author

mschubert Apr 11, 2021 Maintainer Author

mschubert Apr 11, 2021 Maintainer Author

mschubert
Apr 3, 2021
Maintainer

Replies: 3 comments 22 replies

mschubert Apr 5, 2021
Maintainer Author

mschubert Apr 11, 2021
Maintainer Author

mschubert Apr 11, 2021
Maintainer Author