Support clustermq as backend for drake #86

wlandau · 2018-06-27T02:00:41Z

First of all, I am super excited about this package! I just tried it out, and it automatically detected my company's SGE scheduler and submitted jobs in a fraction of the time I am used to. Way cool!

I am thinking about a couple of clustermq backends for drake: one for persistent workers, and another for transient workers. The former is easiest to implement, but I do not expect to see performance gains there because persistent workers need to use the file system to save targets on the go.

However, clustermq-based transient workers could really shine, especially when compared to drake's current future/batchtools-based transient workers (make(parallelism = "future"), which add noticeable overhead to every single target. But to implement transient workers with clustermq, I believe I will need a way to submit and monitor workers in a non-blocking way. With future, for example, the master process can submit a job, go check on other jobs, and come back and check if the first job finished.

I suppose one potential alternative is to take advantage of the native pipeline functionality in the schedulers themselves. Are you considering this for clustermq?

The text was updated successfully, but these errors were encountered:

mschubert · 2018-06-27T21:26:21Z

Thank you for your kind words, and please excuse my delay in getting back to you: this question is a bit harder to answer than the others.

We have discussed a non-blocking version of Q before in #23.

For drake support specifically:

One option would be to implement clustermq as a backend for future, which would give us drake support for free. There is an issue on future for the backend, but I'm afraid this is currently not high enough on my priority list.

Another option would be to use parallel foreach as a parallelism backend for drake. I don't know if this is possible already, or how hard it would be to implement. foreach support in clustermq is experimental and should be trialed from the next release.

In either case, I think this would use persistent (and reusable) workers and not actually need non-blocking Q, because each loop will be blocking but still be processed in parallel.

wlandau · 2018-06-28T12:55:41Z

Thanks for explaining, Michael. I think futureverse/future#204 would be ideal, and I am optimistic about its eventual implementation. I will close this issue as a duplicate of #23.

mschubert · 2018-06-28T20:08:26Z

I am reopening this for drake support specifically

wlandau · 2018-06-28T20:15:33Z

And I appreciate it. Really glad you're willing to help me speed up drake.

mschubert · 2018-06-29T13:44:40Z

Let me expand on the previous point a bit:

(1) I'm pretty sure you do not want transient workers for drake on the cluster (as in, a worker per call)

Many of your function calls may be short. Submitting many short-running jobs in a short period of time puts stress on the scheduler. The lower limit that is usually quoted is 2 minutes, and snakemake was scolded by sysadmins for using this exact approach (that they later tried to mend using job groups)

(2) There are usually batches of function calls where some arguments are iterated but others are not

Often, function calls will work on the same data but e.g. different indices are supplied. In this case, you will not want the common part to be sent over the network for each call separately.

I think the way to go here is to:

Let drake define a pool of workers that will be common to the workflow (I think you already do this, but also allow for a different number of workers during those steps)
Group tasks per function to call, organise iterated and constant arguments; if there are parallel chains, create a temporary function for the chain (I think you already do this or something similar)
Let clustermq work on each of these groups one after another (this is where the parallel foreach or the future backend comes in)

Q in this case would be handled like parLapply, which is blocking itself but processes tasks in parallel.

The downside of course is that each group of function calls needs to be completed before the next can start. As the calls themselves are load-balanced, I don't think this is a big deal.

wlandau · 2018-06-29T20:11:18Z

You bring up a good point re transient workers. I do eventually want to get to job groups, but it is much harder to implement than persistent and transient workers. All in good time.

I do like the idea of maintaining a pool of common workers and submitting jobs to them as targets become ready (i.e. when their dependencies are all checked and built). The only downside is that users may hit wall time limits fairly easily, but we can work around that later with job groups (or simply by refreshing workers periodically).

Unfortunately, it is not so simple to group tasks by function call. Users define arbitrary dependency networks of commands, and those commands are arbitrary chunks of R code (not necessarily neat function calls). I suppose some targets sometimes share all the same dependencies, but I do not think such groupings would efficiently generalize to larger workflows.

The downside of course is that each group of function calls needs to be completed before the next can start.

At first, I did think this was such a big deal. What you describe is essentially what I have been calling "staged scheduling", and it was a good start. However, during its first several months of use, there were several projects that lost considerable parallel efficiency. drake could not always jump ahead in the graph even when targets were ready. ropensci/drake#168 helped somewhat, but it was not nearly sufficient. I touched on the downsides of staged parallelism in a recent rOpenSci tech note, and drake no longer supports it for remote workers.

The most successful solutions so far have been

Totally transient workers with the "future" and "Makefile" backends, and
Totally persistent workers with the "mclapply", "parLapply", and "future_lapply" backends. Here, the workers cannot and do not block the master process that sends them instructions. I have found this asynchronicity to be essential for traversing the dependency graph in parallel as efficiently as possible. Whether through future or clustermq itself, I hope we can make it happen.

wlandau · 2018-06-29T20:15:02Z

I did try to use clustermq workers just like the current lapply-like persistent worker backends, but I found it defeated the purpose. Currently, drake uses the file system to communicate with its persistent workers, and those workers use the file system to cache the targets they build. While this approach produces the correct results (I took special precautions to avoid race conditions due to NFS file latency) the overhead is large. That's why I hope instructions and targets can be transmitted with ZeroMQ and the master process can take total responsibility for the file system.

mschubert · 2018-07-04T15:41:37Z

I'm beginning to see the issues you are facing. It should be possible to use clustermq workers for individual function calls, while not using the main event loop.

This would be to use something like:

w = workers(2) # submit 2 worker jobs for testing
on.exit(w$cleanup())

msg = w$receive_data()
if (!is.null(msg$result))
    # handle result ...

# if there is work
# this still needs some adjustments on cmq for this to work
w$set_common_data(data=list(fun, const, export, seed)) # set function to call, constant args
w$send_common_data()
w$send_job_data(msg=list(id="do_chunk", df)) # df is data.frame with row=id and fun args

# if not
w$disconnect_worker(msg)

Processing the calls this way would be asynchronous, with the drake parallel backend responsible for the event loop.

Is that something you could work with?

wlandau · 2018-07-04T22:31:42Z

Thanks for meeting me half way! I think I get the general idea, and I want to understand more about the details.

From what I understand, the master process will loop over something like the above code to receive worker data and send jobs to workers.

After msg <- w$receive_data(), what are the contents of msg? Is it the output from the worker that finished soonest, results from all the workers, or something else?

In w$send_job_data(), would I be sending a single job to the workers collectively? Will clustermq's load balancing decide which worker gets assigned?

mschubert · 2018-07-05T13:22:42Z

I've written about the specification here, hope that helps. Please note the proposed drake-specific extensions DO_CALL, which is DO_SETUP and DO_CHUNK in one command (otherwise there's no guarantee you're talking to the same worker).

From what I understand, the master process will loop over something like the above code to receive worker data and send jobs to workers.

Yes

After msg <- w$receive_data(), what are the contents of msg? Is it the output from the worker that finished soonest, results from all the workers, or something else?

It will be the result of a single chunk (in your case, likely a function call or expression on a single set of arguments).

The fields of msg will be: id, result, warnings, and errors (as documented above)

In w$send_job_data(), would I be sending a single job to the workers collectively? Will clustermq's load balancing decide which worker gets assigned?

This would be a single call to a single worker. You'd loop through the following way:

switch(msg$id,
    "WORKER_UP" = {
        if (there is calls left that you can do right now)
            w$do_call(...)
        if (there is a result in msg$id)
            # handle result here
    }
    "WORKER_ERROR" = {
        # something went wrong, clean up and print error
    }
)

Calls will be load-balanced, but you need to check the result ID to match the result to a call you sent. Otherwise, you'll have no idea which worker sent which result.

wlandau · 2018-07-07T21:04:02Z

Thanks for elaborating. I will experiment with this and get back to you.

wlandau · 2018-07-18T15:17:33Z

Michael, thank you for your patience. I know I was slow getting back to you. I wanted to make sure I had a good quality block of time to devote to this.

I've written about the specification here, hope that helps. Please note the proposed drake-specific extensions DO_CALL, which is DO_SETUP and DO_CHUNK in one command (otherwise there's no guarantee you're talking to the same worker).

The spec helped a lot, and I do see the necessity of DO_CALL. Thank you.

Regarding your follow-up code from #86 (comment): do you plan to implement w$do_call(...) as part of the DO_CALL message type? Maybe shorthand for w$send_job_data(msg = list(id = "DO_CALL")?

I studied your code from #86 (comment), and the results of my experimentation are below, along with some embedded questions and comments. I am optimistic about interacting with clustermq this way, and I have decided to seriously pursue it for drake::make(parallelism = "clustermq"). I realize we still need DO_CALL, and I would also appreciate your help teaching me the rest.

library(clustermq)
#> * Option 'clustermq.scheduler' not set, defaulting to SGE
#> --- see: https://github.com/mschubert/clustermq/wiki#setting-up-the-scheduler
w <- workers(2)
#> Submitting 2 worker jobs (ID: 7628) ...
on.exit(w$finalize()) # See #91.
msg <- w$receive_data()
msg
#> $id
#> [1] "WORKER_UP"
#>
#> $pkgver
#> [1] 0.8.4.1
if (is.null(msg)){
  # Under what circumstances is the message NULL?
  # What should I check if it is?
}
# From https://github.com/mschubert/clustermq/blob/3c7fda0a7c6ddcc7f9b7ac557a6a2abfcc917e16/R/Q_rows.r#L24-L25
data <- list(
  fun = function(x){
    x * 2
  },
  const = list(),
  export = list(),
  rettype = "list",
  common_seed = 123
)
# From https://github.com/mschubert/clustermq/blob/3c7fda0a7c6ddcc7f9b7ac557a6a2abfcc917e16/R/Q_rows.r#L41
# Do we need `id = "DO_SETUP"` somewhere?
do.call(w$set_common_data, data)
w$send_common_data()
w$send_job_data(
  msg = list(
    id = "DO_CHUNK",       # Maybe eventually "DO_CALL"?
    df = data.frame(x = 1) # drake will deploy targets 1 at a time.
  )
)
#> Operation cannot be accomplished in current state
msg <- w$receive_data()
msg
#> $id
#> [1] "WORKER_ERROR"
#>
#> $token
#> [1] "agasx"
w$disconnect_worker(msg)

It seems I do not know how to set the common data properly. In a previous attempt, I tried to send data more like you did in #86 (comment), but I get a worker error there too.

w$set_common_data(data)
w$send_common_data()
w$send_job_data(
  msg = list(
    id = "DO_CHUNK",       # Maybe eventually "DO_CALL"?
    df = data.frame(x = 1) # drake will deploy targets 1 at a time.
  )
)
#> Operation cannot be accomplished in current state
msg <- w$receive_data()
msg
#> $id
#> [1] "WORKER_ERROR"
#>
#> $msg
#> [1] "wrong field names for DO_SETUP: data"

Like you said, maybe clustermq needs more development before this will work. How far can I get with the current development version?

mschubert · 2018-08-09T19:01:19Z

@wlandau Can you try the following using develop?

w = workers(1)

# worker connecting to master with id=WORKER_UP
w$receive_data()

# you should be able to send arbitrary expressions to evaluate on a worker
w$send_call(x*2, list(x=5))

# receive the result from any worker with id=WORKER_READY
w$receive_data()

w$cleanup()

wlandau · 2018-08-09T19:26:11Z

It worked exactly as I guessed.

> options(clustermq.scheduler = "sge", clustermq.template = "template.tmpl")
> library(clustermq)
> w = workers(1)
Submitting 1 worker jobs (ID: 7225) ...
> # worker connecting to master with id=WORKER_UP
> w$receive_data()
$id
[1] "WORKER_UP"

$pkgver
[1] ‘0.8.4.99’

> # you should be able to send arbitrary expressions to evaluate on a worker
> w$send_call(x*2, list(x=5))
> # receive the result from any worker with id=WORKER_READY
> w$receive_data()
$id
[1] "WORKER_READY"

$token
[1] NA

$expr
x * 2

$result
[1] 10

> w$cleanup()

And with a low enough timeout, it looks like I can keep calling w$receive_data() until a worker is ready to send a result. Multiple calls are giving me a little trouble at the moment, though (maybe I am getting a little too far ahead here).

> options(clustermq.scheduler = "sge", clustermq.template = "template.tmpl")
> library(clustermq)
> 
> w = workers(2)
Submitting 2 worker jobs (ID: 7321) ...
> 
> # worker connecting to master with id=WORKER_UP
> w$receive_data()
$id
[1] "WORKER_UP"

$pkgver
[1] ‘0.8.4.99’

> w$receive_data(timeout = 1e-9)
NULL
> 
> # you should be able to send arbitrary expressions to evaluate on a worker
> w$send_call(x*2, list(x=5))
> w$send_call(x*2, list(x=5))
Operation cannot be accomplished in current state
> 
> # receive the result from any worker with id=WORKER_READY
> w$receive_data()
$id
[1] "WORKER_READY"

$token
[1] NA

$expr
x * 2

$result
[1] 10

> 
> w$cleanup()
Master: [2.3s 1.5% CPU]; Worker: [avg NA% CPU, max NA Mb]

mschubert · 2018-08-09T19:42:59Z

For your 2nd example you still need an event loop ;-)

while(work remaining or worker still running) {
    msg = w$receive_data() # wait until a worker is ready
    if (msg$id == "WORKER_READY") {
        if (work remaining)
            w$send_call(next work) # send it work if it is
        else
            w$send_shutdown_worker()
        # ...handle your result...
    } else
        # w$disconnect_worker() if WORKER_DONE, stop if WORKER_ERROR
}

The key here is that receive and send are called alternately, so that ZeroMQ can distribute the work in the background.

wlandau · 2018-08-10T00:34:03Z

I see. This is beginning to take shape.

Also, drake keeps track of a large configuration list internally (from drake_config()). If possible, I would like to send it to the workers just once at the beginning rather than every send_call().

Does this look sort of right to you?

w$send_job_data(config)
while(work remains or worker still running) {
  msg = w$receive_data()
  if (identical(msg$id, "WORKER_READY")) {
    store_result(msg)
    target <- choose_next_target(msg)
    if (!is.null(target)){ # There is work to be done.
      # Load the dependencies the the next worker needs to have in memory
      # and remove unnecessary objects
      # (depending on the user-defined memory strategy)
      prune_envir(config) # https://github.com/ropensci/drake/blob/16700f77700729511db103610f55c23978d7fef7/R/envir.R#L28
      # Build the target.
      w$send_call(
        build_target,
        list(target = target),
        const = list(dependencies = get_dependencies(config$envir))
      )
    }
  } else if (identical(msg$id, "WORKER_DONE") {
    w$disconnect_worker()
  } else if (identical(msg$id, "WORKER_ERROR") {
    stop("worker error")
  }
}

Eventually, I would like to be more selective about the data that gets sent to each worker. Many targets share dependencies, and depending what the assigned worker built previously, we could be repeating ourselves if we send all the dependency data over the socket. Of course, because of clustermq's automatic load balancing, we may not be able to identify the next worker beforehand to figure out what it already has in memory.

mschubert · 2018-08-10T09:53:58Z

Looks good!

Some minor pointers:

If you want to send the config separately in the beginning, use w$set_common_data with your config object in export and dummy fields for fun, const, rettype, and common_seed.¹ In your loop, you'd check for WORKER_UP (respond with w$send_job_data), and WORKER_READY (do w$send_call) as you do now.

w$send_call takes an expression (expr) and all dependencies in a list (env); There is no const argument, because for a single call it doesn't make sense to distinguish between constant and iterated.²

¹ I know requiring dummy fields is not ideal, and this is still subject to change
² I'm thinking about generalizing token to refer to compute environments, so that workers can perform operations on multiple sets of available data (that were previously exported to that worker); this way you could also track what objects they have in memory

wlandau · 2018-08-10T11:37:52Z

I am not sure I understand your comments about WORKER_UP and w$send_job_data(), especially since there is no DO_CHUNK message in section of the message specification on evaluating custom expressions. What is the role of w$send_job_data() here? If w$send_call() already has an env argument, what additional value does w$send_job_data() provide?

Also, do we need a w$send_common_data() after w$set_common_data()? Do we need calls to receive_data() after each?

Overall, I am having trouble understanding the functions in the QSys class and when to use them. I think it would be helpful to have a specification to distinguish among them and walk through how they work together practice.

wlandau · 2018-08-11T22:04:05Z

Update: the new backend has progressed enough for a PR: ropensci/drake#501. The only thing left is an unpredictable yet frequent error:

Error in rzmq::poll.socket(list(private$socket), list("read"), timeout = timeout) :
  Interrupted system call

Any ideas about what might be happening? I am running the reprex from #86 (comment) using ropensci/drake@8f84aa7.

mschubert · 2018-08-12T09:18:45Z

I can process jobs=1 for the mtcars example without problems in 3.1 seconds. I see the same problem as you with jobs=2 (which requires me to install the txtq package, so there is some different handling of 2 vs 1 jobs).

rzmq::poll.socket will raise an error if an interrupt is received. Question is, what is sending the interrupt, and why? (it's not rzmq or clustermq)

Note that we had issues with terminal size changes sending interrupts before.

wlandau · 2018-08-12T12:11:21Z

txtq is not actually used for the work that drake does with clustermq. It is required for drake's mclapply backend, which is used by default to process the imports. make(jobs = c(imports = 1, targets = 2) and make(parallelism = c(imports = "mclapply_staged", targets = "clustermq") do not require txtq.

ropensci/rzmq#37 does provide context, thanks.

I am not sure what could be sending the interruption, but since the error message mentions private$socket, it seems like the interrupt happens around the call to receive_data(). I would like to be able to just try receive_data() again if there is an interruption, but try()/tryCatch() statements are ineffective, presumably because of the asynchronous nature of signal interruptions.

mschubert · 2018-08-12T19:48:54Z

Yes, it's no surprise that you catch the interrupts with receive_data() because poll.socket() is what reacts to them (and you're spending most of your time there).

Using strace, I get:

--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=160983,
si_uid=10282396, si_status=SIGTERM, si_utime=1, si_stime=1} ---

Are you running any multicore operations in the background?

wlandau · 2018-08-13T02:23:09Z

I was running some occasional mclapply() calls to manage dependencies in the execution environments, but no longer (ropensci/drake@7228b3b). Even afterwards, I still see that SIGCHLD/CLD_KILLED message from strace. Incidentally, this reminds me of the SIGCHLD/parallel issues callr is grappling with right now.

By the way, I am not sure this is relevant, but in the current loop on the master process, I never see a WORKER_DONE message. Just one WORKER_UP message for each worker at the beginning and then WORKER_READY messages from then on.

wlandau · 2018-08-13T02:38:31Z

Regarding WORKER_DONE, I realize that part is my fault. The relevant section of the loop is never reached, and I have been relying on w$cleanup().

In your pseudocode from #86 (comment), the while() loop checks if a worker is still running. What is the correct way to do that with a QSys object?

wlandau · 2018-08-13T12:59:37Z

Never mind: I saw you use w$workers_running internally, and it seems to work for drake as well. As of ropensci/drake@026cfb3, the "Interrupted system call" errors seem to have disappeared completely when I submit to SGE. The multicore backend still shows them when I try am using at least 4 workers at a time, but only when drake starts shutting down workers.

wlandau · 2018-08-14T00:54:14Z

bb4477e seemed to help, but drake's clustermq tests give several warnings

test-clustermq.R:16: warning: clustermq parallelism
cannot wait for child 14833 as it does not exist

Then, sometimes the task hangs, and other times the infamous "Interrupted system call" error appears.

wlandau · 2018-08-14T00:57:13Z

However, jobs on SGE run great, and real HPC schedulers should be drake's main focus when it comes to clustermq. drake already has several local parallel backends, so its only use for multicore clustermq is for unit testing. Is there another backend that could allow drake to just go through the motions locally? I would rather not merge ropensci/drake#501 unless there are proper unit tests that cover all the important parts of the code.

mschubert · 2018-08-20T18:43:25Z

Please note that the latest version on develop no longer sends WORKER_UP, but WORKER_READY instead.

This will cause problems for drake/clustermq.R#L34; you will need to check token of the ready message to be the same as you set in common data.

Sorry for this, but that's the downside of using the bleeding edge version. I might still simplify the shutdown procedure before a release (eliminating the requirement for WORKER_DONE checks).

wlandau · 2018-08-21T14:31:40Z

Thanks for letting me know. Nothing to apologize for, I am excited to see clustermq develop.

In the new develop branch, would the workflow be something like this?

cmq_set_common_data <- function(config){
  export <- list()
  if (identical(config$envir, globalenv())){
    export <- as.list(config$envir, all.names = TRUE) # nocov
  }
  export$config <- config
  config$workers$set_common_data(
    export = export,
    fun = identity,
    const = list(),
    rettype = list(),
    common_seed = config$seed,
    token = "set_common_data_token"
  )
}

cmq_master <- function(config){
  while (cmq_work_remains(config) || config$workers$workers_running > 0){
    msg <- config$workers$receive_data()
    cmq_conclude_build(msg = msg, config = config)
    if (identical(msg$id, "WORKER_READY")) {
      if (identical(msg$token, "set_common_data_token")){
        config$workers$send_common_data()
      } else if (cmq_work_remains(config)){
        cmq_send_target(config)
      } else {
        config$workers$send_shutdown_worker()
      }
    } else if (identical(msg$id, "WORKER_DONE")) {
      config$workers$disconnect_worker(msg)
    } else if (identical(msg$id, "WORKER_ERROR")) {
      stop("clustermq worker error") # nocov
    }
  }
}

wlandau · 2018-08-21T14:32:36Z

Also, if/when you remove the requirement to check WORKER_DONE, what would the cleanup procedure look like? I am picturing something like the following.

cmq_master <- function(config){
  on.exit(config$workers$cleanup()) # Replaces WORKER_DONE checks?
  while (cmq_work_remains(config) || config$workers$workers_running > 0){
    msg <- config$workers$receive_data()
    cmq_conclude_build(msg = msg, config = config)
    if (identical(msg$id, "WORKER_READY")) {
      if (identical(msg$token, "set_common_data_token")){
        config$workers$send_common_data()
      } else if (cmq_work_remains(config)){
        cmq_send_target(config)
      } else {
        config$workers$send_shutdown_worker() # Do we still need this?
      }
    } else if (identical(msg$id, "WORKER_ERROR")) {
      stop("clustermq worker error") # nocov
    }
  }
}

mschubert · 2018-08-21T20:01:00Z

The idea (implemented in clustermq@wapi) is to handle default events within workers$receive_data() and separate the worker API from the message identifiers:

cmq_master <- function(config){
  on.exit(config$workers$finalize())
  while (not all results are in){
    msg <- config$workers$receive_data()
    cmq_conclude_build(msg = msg, config = config)
    if (!identical(msg$token, "set_common_data_token")){
      config$workers$send_common_data()
    } else if (cmq_work_remains(config)){
      cmq_send_target(config)
    } else {
      config$workers$send_shutdown_worker()
    }
  }
  if (config$workers$cleanup())
    on.exit()
}

The downside I can see is that if you enter the loop after all results are finished, you will get a WORKER_DONE message ID, which you usually don't have to handle (and hence may confuse API users if they don't set their exit condition properly)

wlandau · 2018-08-21T20:21:11Z

I do like having a worker API that does not require checking message types most of the time. I have begun the adjustment in ropensci/drake@8ee84b0. Currently, drake hangs indefinitely here at receive_data() after the first send_common_data().

# devtools::install_github("ropensci/drake", ref = "clustermq")
library(drake)
load_mtcars_example()
clean(destroy = TRUE)
options(clustermq.scheduler = "multicore")
make(my_plan, parallelism = "clustermq", jobs = 2, verbose = 4)

mschubert · 2018-08-22T09:54:36Z

You still seem to be using

while (cmq_work_remains(config) || config$workers$workers_running > 0){

and not

while (not all results are in){

The new API requires that you exit your main loop when the last result is in, not when the last worker is shut down (this is handled by cleanup())

I've added (2a4c3bf) the errors:

Error in config$workers$receive_data() :
  Trying to receive data without workers

and

Error in config$workers$receive_data() :
  Trying to receive data after work finished

to tell the user when they attempt to do that.

wlandau · 2018-08-22T12:12:16Z

Thanks. In ropensci/drake@0489ccf, the main loop now checks if all the results are in.

I just tried #86 (comment) again using ropensci/drake@0489ccf and 6af6dff, and I see the following.

The first receive_data() succeeds and returns a WORKER_READY message.

[1] "receiveing"
$id
[1] "WORKER_READY"

$pkgver
[1] 0.8.4.99

$token
[1] "not set"

The first set_common_data() runs without error.
The second receive_data() quits with the new "Trying to receive data without workers" error.

The results are the same whether I use 1, 2, or 8 workers.

mschubert · 2018-08-22T19:20:10Z

The call to cleanup should go outside the loop (see #86 (comment)).

Now you shut down your workers after the first receive_data(), that's why you don't have workers ;-)

wlandau · 2018-08-22T19:48:35Z

Wow, I can't believe I missed that! Thanks!

wlandau · 2018-08-22T20:45:08Z

Initial testing is looking so much better! #99 aside, my last concern is that I am seeing some (but not all) of my SGE jobs still running even after cleanup. I submitted more workers than targets that time, so I wonder if these zombie workers never fully started up to begin with.

Anyway, is the new worker API likely to change significantly going forward? If not, we are close being ready to merge ropensci/drake#501.

wlandau · 2018-08-23T20:19:16Z

drake::make(parallelism = "clustermq") and drake::make(parallelism = "clustermq_staged") appear to work beautifully with the clustermq develop branch. I sincerely appreciate your hard work on refactoring and generous help bringing me along. High-performance computing is one of drake's major promises, and thanks to you, it can really deliver.

mschubert · 2018-08-24T12:02:10Z

Good to see that it works & happy I could contribute a bit!

I still aim to solve #99 and have the rzmq patch merged before a CRAN release, so I estimate about 10 days for 0.8.5.

wlandau · 2018-08-24T15:19:16Z

Fantastic! I will keep an eye out for the next CRAN release.

mschubert added the feature-request label Jun 27, 2018

wlandau mentioned this issue Jun 27, 2018

Experiment with clustermq for persistent workers ropensci/drake#431

Closed

7 tasks

wlandau closed this as completed Jun 28, 2018

wlandau mentioned this issue Jun 28, 2018

run asynchronously #23

Open

mschubert reopened this Jun 28, 2018

mschubert changed the title ~~Consider a non-blocking version of Q()~~ Support clustermq as backend for drake Jun 28, 2018

This was referenced Jul 3, 2018

"future" parallelism takes a really long time to get going when many upstream targets are up to date ropensci/drake#448

Closed

Add clustermq-based staged scheduling ropensci/drake#452

Merged

Faster remote/distributed persistent workers ropensci/drake#417

Closed

wlandau mentioned this issue Jul 4, 2018

future_lapply workers don't start working right away ropensci/drake#449

Closed

wlandau mentioned this issue Jul 18, 2018

Question: cleanup() vs finalize() #91

Closed

mschubert added a commit that referenced this issue Aug 9, 2018

add send_call (#86)

3664b93

wlandau mentioned this issue Aug 13, 2018

Interrupted system call while cleaning up multicore workers #93

Closed

wlandau closed this as completed Aug 14, 2018

wlandau reopened this Aug 14, 2018

wlandau mentioned this issue Aug 23, 2018

Document persistent clustermq workers ropensci-books/drake#35

Closed

wlandau closed this as completed Aug 23, 2018

wlandau mentioned this issue Aug 24, 2018

Redeploy workers that time out too soon? #101

Closed

wlandau mentioned this issue Oct 27, 2018

Remove all non-clustermq parallel backends? ropensci/drake#561

Closed

pat-s mentioned this issue Feb 27, 2021

Issues getting SLURM + future to work ropensci/drake#1359

Closed

6 tasks

Support clustermq as backend for drake #86

Support clustermq as backend for drake #86

Comments

wlandau commented Jun 27, 2018 • edited Loading

mschubert commented Jun 27, 2018

wlandau commented Jun 28, 2018

mschubert commented Jun 28, 2018

wlandau commented Jun 28, 2018

mschubert commented Jun 29, 2018

wlandau commented Jun 29, 2018

wlandau commented Jun 29, 2018 • edited Loading

mschubert commented Jul 4, 2018

wlandau commented Jul 4, 2018

mschubert commented Jul 5, 2018 • edited Loading

wlandau commented Jul 7, 2018

wlandau commented Jul 18, 2018 • edited Loading

mschubert commented Aug 9, 2018

wlandau commented Aug 9, 2018

mschubert commented Aug 9, 2018 • edited Loading

wlandau commented Aug 10, 2018 • edited Loading

mschubert commented Aug 10, 2018

wlandau commented Aug 10, 2018 • edited Loading

wlandau commented Aug 11, 2018 • edited Loading

mschubert commented Aug 12, 2018

wlandau commented Aug 12, 2018 • edited Loading

mschubert commented Aug 12, 2018 • edited Loading

wlandau commented Aug 13, 2018 • edited Loading

wlandau commented Aug 13, 2018 • edited Loading

wlandau commented Aug 13, 2018

wlandau commented Aug 14, 2018 • edited Loading

wlandau commented Aug 14, 2018 • edited Loading

mschubert commented Aug 20, 2018

wlandau commented Aug 21, 2018

wlandau commented Aug 21, 2018 • edited Loading

mschubert commented Aug 21, 2018

wlandau commented Aug 21, 2018 • edited Loading

mschubert commented Aug 22, 2018

wlandau commented Aug 22, 2018 • edited Loading

mschubert commented Aug 22, 2018

wlandau commented Aug 22, 2018

wlandau commented Aug 22, 2018 • edited Loading

wlandau commented Aug 23, 2018

mschubert commented Aug 24, 2018

wlandau commented Aug 24, 2018

wlandau commented Jun 27, 2018 •

edited

Loading

wlandau commented Jun 29, 2018 •

edited

Loading

mschubert commented Jul 5, 2018 •

edited

Loading

wlandau commented Jul 18, 2018 •

edited

Loading

mschubert commented Aug 9, 2018 •

edited

Loading

wlandau commented Aug 10, 2018 •

edited

Loading

wlandau commented Aug 10, 2018 •

edited

Loading

wlandau commented Aug 11, 2018 •

edited

Loading

wlandau commented Aug 12, 2018 •

edited

Loading

mschubert commented Aug 12, 2018 •

edited

Loading

wlandau commented Aug 13, 2018 •

edited

Loading

wlandau commented Aug 13, 2018 •

edited

Loading

wlandau commented Aug 14, 2018 •

edited

Loading

wlandau commented Aug 14, 2018 •

edited

Loading

wlandau commented Aug 21, 2018 •

edited

Loading

wlandau commented Aug 21, 2018 •

edited

Loading

wlandau commented Aug 22, 2018 •

edited

Loading

wlandau commented Aug 22, 2018 •

edited

Loading