Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can workers begin processing concurrently as the master uploads jobs? #39

Open
nathaniel-mahieu opened this issue Aug 8, 2016 · 9 comments

Comments

@nathaniel-mahieu
Copy link
Contributor

When using foreach and doRedis the doRedis workers wait until all jobs have reached the redis server before beginning processing. Is it possible to have them begin before all the preprocessing has finished?

I am using an iterator which is working great - preprocessing happens 'just in time' and the job data begins to hit the server as the iterator runs but before it has finished. I can't seem to take advantage of this behavior, though, because the workers just wait until all jobs have been uploaded before they start grabbing them from the queue.

Example code:

library(foreach)
library(doRedis)

registerDoRedis("worklist", "0.0.0.0")

foreach (var = complex.iter(1:1E6)) %dopar% {
    process.function(var)
    }

In this example complex.iter takes a while and there are many elements to iterate over. As such it would be great if workers started running process.function() before all the preprocessing is finished. Unfortunately they seem to wait until complex.iter has run on all elements.

I have set .inorder=F.

Any suggestions as to how to achieve this desired behavior? Thanks.

@nathaniel-mahieu
Copy link
Contributor Author

nathaniel-mahieu commented Aug 8, 2016

Looking at the source it looks like currently the answer is no. https://github.com/bwlewis/doRedis/blob/master/R/doRedis.R#L492

So in addition to the question of feasibility, am I misunderstanding this as one of the major features of iterators? If all work has to be queued before it can be operated on how do we get the memory savings associated with iterating over data in a "just in time" manner?

Thanks for the help!

@bwlewis
Copy link
Owner

bwlewis commented Aug 8, 2016

This is a good question and an astute observation. I see right now the code you refer to:

# use nonblocking call to submit all tasks at once
  redisSetPipeline(TRUE)
  redisMulti()
  while(j <= ntasks)
  {
    k <- min(j + chunkSize, ntasks)
    block <- argsList[j:k]
    if(is.null(block)) break
    if(!is.null(gather)) names(block) <- rep(nout, k - j + 1)
    else names(block) <- j:k
    blocknames <- c(blocknames, list(names(block)))
    redisRPush(queue, list(ID=ID, argsList=block))
    j <- k + 1
    nout <- nout + 1
  }
  redisExec()
  redisGetResponse(all=TRUE)
  redisSetPipeline(FALSE)

could be changed to, for example:

  while(j <= ntasks)
  {
    k <- min(j + chunkSize, ntasks)
    block <- argsList[j:k]
    if(is.null(block)) break
    if(!is.null(gather)) names(block) <- rep(nout, k - j + 1)
    else names(block) <- j:k
    blocknames <- c(blocknames, list(names(block)))
    redisRPush(queue, list(ID=ID, argsList=block))
    j <- k + 1
    nout <- nout + 1
  }

and then tasks would go into the queue one at a time and the workers will pull them immediately. For a job with thousands of tasks, this approach will take much longer.

I guess I had that code in there for efficient submission of large jobs. The new TCP nodelay setting in the rredis package probably goes a long way to making this optimization not so important.

So, this would be an interesting change to experiment with and see how well it works. Would you consider submitting a pull request if this change works well for you?

@nathaniel-mahieu
Copy link
Contributor Author

I will definitely give it a try. But first, there is a lot I don't understand about that code so I want to clarify.

Will we create a race condition between the workers and the uploaded work in which the job could end prematurely? Or will the workers wait for all the jobs, still, in the case of slow preprocessing. (Don't workers quit when the queue goes empty?)

Also, does .to.list() on this line: https://github.com/bwlewis/doRedis/blob/master/R/doRedis.R#L380 cause the full iteration to be executed, in which case we will still wait to upload?

@bwlewis
Copy link
Owner

bwlewis commented Aug 8, 2016

Workers don't quit when the queue goes empty, but they can be configured to exit after a fixed number of tasks. But I don't see a race condition. Recall that with doRedis, one can submit tasks without any workers around and the master will just wait until someone comes along and runs it (anonymous pull-based scheduling). That's it's main feature making it very elastic.

But! you're right about the 2nd part. The .to.list() does indeed just materialize all entries in the iterator. Sorry, it's been quite a while since I've looked closely at this code.

That .to.list() pattern is repeated in every foreach backend that I know of. It's unfortunate. Let me think a while about a work-around.

@bwlewis
Copy link
Owner

bwlewis commented Aug 8, 2016

Would be nice to use nextElem() but then also interleave those task submissions with the code that checks for results in the result queue. This would be a pretty big change...

@nathaniel-mahieu
Copy link
Contributor Author

Initially it appears that using nextElem() would afford the concurrency benefits. I could see concurrently checking for results, but only if we had a second process to work with. As it stands with nextElem() we would be uploading or downloading results 100% of the time with our one process?

Looking at the surrounding code (with a vague understanding) nextElem() seems like it would work nearly drop in with the current while(j <= ntasks) loop.

Could we also use redisMulti() and let the number of uploaded chunks be configurable? A default of chunkSize/10 would probably be reasonable. Ideally this would be the number of workers I think - but we can't know that.

@bwlewis
Copy link
Owner

bwlewis commented Aug 11, 2016

Oops, I merged this but now I see that tests are failing. For instance

R CMD build doRedis && TEST_DOREDIS=TRUE R CMD check doRedis_1.2.0.tar.gz

fails with errors. I'll work on figuring out what's wrong tonight... In the meanwhile, I reverted this but moved the commit into the 'devel' branch.

@nathaniel-mahieu
Copy link
Contributor Author

Great thanks for checking it for me. I will look into this as well.

@bwlewis
Copy link
Owner

bwlewis commented Aug 18, 2016

sorry I've been slow to investigate... lots going on
On Aug 17, 2016 7:36 PM, "Nathaniel Mahieu" [email protected]
wrote:

Great thanks for checking it for me. I will look into this as well.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#39 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAIsnkAEPnSPq5X2DBMlZjUEO6E9ql1-ks5qg5r8gaJpZM4JfVzq
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants