Can workers begin processing concurrently as the master uploads jobs? #39

nathaniel-mahieu · 2016-08-08T18:38:53Z

When using foreach and doRedis the doRedis workers wait until all jobs have reached the redis server before beginning processing. Is it possible to have them begin before all the preprocessing has finished?

I am using an iterator which is working great - preprocessing happens 'just in time' and the job data begins to hit the server as the iterator runs but before it has finished. I can't seem to take advantage of this behavior, though, because the workers just wait until all jobs have been uploaded before they start grabbing them from the queue.

Example code:

library(foreach)
library(doRedis)

registerDoRedis("worklist", "0.0.0.0")

foreach (var = complex.iter(1:1E6)) %dopar% {
    process.function(var)
    }

In this example complex.iter takes a while and there are many elements to iterate over. As such it would be great if workers started running process.function() before all the preprocessing is finished. Unfortunately they seem to wait until complex.iter has run on all elements.

I have set .inorder=F.

Any suggestions as to how to achieve this desired behavior? Thanks.

The text was updated successfully, but these errors were encountered:

nathaniel-mahieu · 2016-08-08T19:13:56Z

Looking at the source it looks like currently the answer is no. https://github.com/bwlewis/doRedis/blob/master/R/doRedis.R#L492

So in addition to the question of feasibility, am I misunderstanding this as one of the major features of iterators? If all work has to be queued before it can be operated on how do we get the memory savings associated with iterating over data in a "just in time" manner?

Thanks for the help!

bwlewis · 2016-08-08T20:54:24Z

This is a good question and an astute observation. I see right now the code you refer to:

# use nonblocking call to submit all tasks at once
  redisSetPipeline(TRUE)
  redisMulti()
  while(j <= ntasks)
  {
    k <- min(j + chunkSize, ntasks)
    block <- argsList[j:k]
    if(is.null(block)) break
    if(!is.null(gather)) names(block) <- rep(nout, k - j + 1)
    else names(block) <- j:k
    blocknames <- c(blocknames, list(names(block)))
    redisRPush(queue, list(ID=ID, argsList=block))
    j <- k + 1
    nout <- nout + 1
  }
  redisExec()
  redisGetResponse(all=TRUE)
  redisSetPipeline(FALSE)

could be changed to, for example:

  while(j <= ntasks)
  {
    k <- min(j + chunkSize, ntasks)
    block <- argsList[j:k]
    if(is.null(block)) break
    if(!is.null(gather)) names(block) <- rep(nout, k - j + 1)
    else names(block) <- j:k
    blocknames <- c(blocknames, list(names(block)))
    redisRPush(queue, list(ID=ID, argsList=block))
    j <- k + 1
    nout <- nout + 1
  }

and then tasks would go into the queue one at a time and the workers will pull them immediately. For a job with thousands of tasks, this approach will take much longer.

I guess I had that code in there for efficient submission of large jobs. The new TCP nodelay setting in the rredis package probably goes a long way to making this optimization not so important.

So, this would be an interesting change to experiment with and see how well it works. Would you consider submitting a pull request if this change works well for you?

nathaniel-mahieu · 2016-08-08T21:04:19Z

I will definitely give it a try. But first, there is a lot I don't understand about that code so I want to clarify.

Will we create a race condition between the workers and the uploaded work in which the job could end prematurely? Or will the workers wait for all the jobs, still, in the case of slow preprocessing. (Don't workers quit when the queue goes empty?)

Also, does .to.list() on this line: https://github.com/bwlewis/doRedis/blob/master/R/doRedis.R#L380 cause the full iteration to be executed, in which case we will still wait to upload?

bwlewis · 2016-08-08T21:27:37Z

Workers don't quit when the queue goes empty, but they can be configured to exit after a fixed number of tasks. But I don't see a race condition. Recall that with doRedis, one can submit tasks without any workers around and the master will just wait until someone comes along and runs it (anonymous pull-based scheduling). That's it's main feature making it very elastic.

But! you're right about the 2nd part. The .to.list() does indeed just materialize all entries in the iterator. Sorry, it's been quite a while since I've looked closely at this code.

That .to.list() pattern is repeated in every foreach backend that I know of. It's unfortunate. Let me think a while about a work-around.

bwlewis · 2016-08-08T21:31:57Z

Would be nice to use nextElem() but then also interleave those task submissions with the code that checks for results in the result queue. This would be a pretty big change...

nathaniel-mahieu · 2016-08-08T22:43:17Z

Initially it appears that using nextElem() would afford the concurrency benefits. I could see concurrently checking for results, but only if we had a second process to work with. As it stands with nextElem() we would be uploading or downloading results 100% of the time with our one process?

Looking at the surrounding code (with a vague understanding) nextElem() seems like it would work nearly drop in with the current while(j <= ntasks) loop.

Could we also use redisMulti() and let the number of uploaded chunks be configurable? A default of chunkSize/10 would probably be reasonable. Ideally this would be the number of workers I think - but we can't know that.

bwlewis · 2016-08-11T20:29:33Z

Oops, I merged this but now I see that tests are failing. For instance

R CMD build doRedis && TEST_DOREDIS=TRUE R CMD check doRedis_1.2.0.tar.gz

fails with errors. I'll work on figuring out what's wrong tonight... In the meanwhile, I reverted this but moved the commit into the 'devel' branch.

nathaniel-mahieu · 2016-08-17T23:36:28Z

Great thanks for checking it for me. I will look into this as well.

bwlewis · 2016-08-18T00:04:30Z

sorry I've been slow to investigate... lots going on
On Aug 17, 2016 7:36 PM, "Nathaniel Mahieu" [email protected]
wrote:

Great thanks for checking it for me. I will look into this as well.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#39 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAIsnkAEPnSPq5X2DBMlZjUEO6E9ql1-ks5qg5r8gaJpZM4JfVzq
.

nathaniel-mahieu mentioned this issue Aug 10, 2016

Add interleaved task aggregation and upload #40

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can workers begin processing concurrently as the master uploads jobs? #39

Can workers begin processing concurrently as the master uploads jobs? #39

nathaniel-mahieu commented Aug 8, 2016

nathaniel-mahieu commented Aug 8, 2016 •

edited

Loading

bwlewis commented Aug 8, 2016 •

edited

Loading

nathaniel-mahieu commented Aug 8, 2016

bwlewis commented Aug 8, 2016

bwlewis commented Aug 8, 2016

nathaniel-mahieu commented Aug 8, 2016

bwlewis commented Aug 11, 2016 •

edited

Loading

nathaniel-mahieu commented Aug 17, 2016

bwlewis commented Aug 18, 2016

Can workers begin processing concurrently as the master uploads jobs? #39

Can workers begin processing concurrently as the master uploads jobs? #39

Comments

nathaniel-mahieu commented Aug 8, 2016

nathaniel-mahieu commented Aug 8, 2016 • edited Loading

bwlewis commented Aug 8, 2016 • edited Loading

nathaniel-mahieu commented Aug 8, 2016

bwlewis commented Aug 8, 2016

bwlewis commented Aug 8, 2016

nathaniel-mahieu commented Aug 8, 2016

bwlewis commented Aug 11, 2016 • edited Loading

nathaniel-mahieu commented Aug 17, 2016

bwlewis commented Aug 18, 2016

nathaniel-mahieu commented Aug 8, 2016 •

edited

Loading

bwlewis commented Aug 8, 2016 •

edited

Loading

bwlewis commented Aug 11, 2016 •

edited

Loading