Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Target level resource overrides not honored with clustermq #635

Closed
nviets opened this issue Sep 13, 2021 · 8 comments
Closed

Target level resource overrides not honored with clustermq #635

nviets opened this issue Sep 13, 2021 · 8 comments

Comments

@nviets
Copy link

nviets commented Sep 13, 2021

In targets 0.7.0.9000, target level resource overrides inherit resource settings from the parent session options, instead of applying those defined in "resources" of tar_target.

# _targets.R
options("clustermq.scheduler" = "SLURM", clustermq.template = "slurm_clustermq.tmpl", tidyverse.quiet = TRUE)

library(targets)
library(tarchetypes)
library(tidyverse)
resources_1 <- tar_resources(
  clustermq = tar_resources_clustermq(template = list(n_cores = 2, memory = 500, job_name = 'hello'))
)

resources_2 <- tar_resources(
  clustermq = tar_resources_clustermq(template = list(n_cores = 2, memory = 1000, job_name = 'world'))
)

tar_option_set(
  resources = resources_1
)

source("R/functions.R") # function to grab session memory

list(
  tar_target(
    mem1,
    memCheck()
  ),
  tar_target(
    mem2,
    memCheck(),
    resources = resources_2
  )
)

Result:

> tar_read(mem1)
$mem
[1] "1048MB"

> tar_read(mem2)
$mem
[1] "1048MB"
@mattwarkentin
Copy link
Contributor

mattwarkentin commented Sep 13, 2021

Hi @nviets,

It looks like the targets are actually inheriting the correct resources.

library(targets)
library(tarchetypes)

resources_1 <- tar_resources(
  clustermq = tar_resources_clustermq(template = list(n_cores = 2, memory = 500, job_name = 'hello'))
)

resources_2 <- tar_resources(
  clustermq = tar_resources_clustermq(template = list(n_cores = 2, memory = 1000, job_name = 'world'))
)

tar_option_set(
  resources = resources_1
)

ll <- list(
  tar_target(
    mem1,
    memCheck()
  ),
  tar_target(
    mem2,
    memCheck(),
    resources = resources_2
  )
)

ll[[1]]$settings$resources
#> $clustermq
#> <tar_resources_clustermq>
#>   template: list(n_cores = 2, memory = 500, job_name = "hello")
ll[[2]]$settings$resources
#> $clustermq
#> <tar_resources_clustermq>
#>   template: list(n_cores = 2, memory = 1000, job_name = "world")

I think this is related to how targets and clustermq work together. @wlandau is away this week, but it is my understanding that clustermq uses the filled in slurm_clustermq.tmpl to spin up however many persistent workers in can (based on available resources and requested resources), and then targets basically handles passing the targets to each worker as they become available. I'm pretty sure workers are not target-specific, so I don't think it makes sense to set target specific HPC resource requests.

I believe this is confirmed with the following code where self$workers number of workers are created based on the resources and is independent of specific targets.

create_crew = function() {
crew <- clustermq::workers(
n_jobs = self$workers,
template = tar_option_get("resources")$clustermq$template %|||%
tar_option_get("resources") %|||%
list(),
log_worker = self$log_worker
)
self$crew <- crew
},

@nviets
Copy link
Author

nviets commented Sep 13, 2021

Thanks for the quick response @mattwarkentin ! Looks like the issue is related to persistent workers with clustermq. Can workers be configured with different resources? E.g. a clustermq worker assigned to a GPU slurm queue and another assigned to a normal server.

@mattwarkentin
Copy link
Contributor

mattwarkentin commented Sep 13, 2021

Hmm, that's a good question. I don't believe that is currently possible. In your SLURM template file you can control the partition and nodelist to ensure that clustermq spins up workers using specific resources (e.g. GPU), but I don't think you can set different resources for different workers (at least not with a single call to clustermq::workers(), as is done by targets). More importantly, I don't think you can control which targets are ultimately sent to which workers.

Maybe we can page @mschubert to see if he has any insights into spinning up workers with different resources.

@mattwarkentin
Copy link
Contributor

mattwarkentin commented Sep 13, 2021

For your interest, there were some lengthy discussions about building out clustermq's support for transient workers here: mschubert/clustermq#257

Not sure what came of these discussions. But perhaps in the future if transient workers are supported then it might make sense to start thinking about target-specific compute resources. I don't think it makes sense under the current approach (i.e. persistent workers).

@mschubert
Copy link

We don't support heterogeneous workers in clustermq yet, unfortunately. It's on the roadmap (tracked here mschubert/clustermq#145), but I don't have an ETA.

@nviets
Copy link
Author

nviets commented Sep 14, 2021

Thanks @mattwarkentin and @mschubert - the context is helpful! Is support for transient workers naturally something that falls within the clustermq package rather than targets? I am not familiar with the internal mechanics of targets when it farms out jobs to clustermq. Is there no way to make independent calls to clustermq with different worker configurations? As a user, it would seem the interface to support this already exists via tar_target(..., resources = configX)

@mattwarkentin
Copy link
Contributor

mattwarkentin commented Sep 14, 2021

Targets supports several resource backends, I think future provides support for transient workers. Like you, I use a SLURM HPC and much prefer clustermq for these purposes. tar_make_clustermq() is effectively a wrapper around the clustermq API and so targets is "limited" to what clustermq supports (though I don't feel it is really limited, at all).

In the future, if clustermq gains support for transient and/or heterogeneous workers, then I imagine @wlandau would at least think about supporting this in a way that makes sense for targets. In theory, you could create artificial transient workers by just making a call to clustermq::workers() for each target, but this would be inefficient and would mean a significant refactor of how targets works with clustermq, and not for the better, in my opinion.

There is a computational cost to starting up/shutting down workers, starting R processes, loading package dependencies, loading targets, and sending data over sockets, which is why targets mostly does all of this once and up front (some of this is configurable). With persistent workers, most of these tasks only need to be done once, and then jobs can be dispatched as needed. Workers only shut down when there is no work left for them to receive. There are definitely use cases for both transient and persistent workers.

@wlandau
Copy link
Member

wlandau commented Sep 21, 2021

Thanks for the discussion. Indeed, this is a limitation with the current clustermq package and not a targets issue. The design of the targets interface allows target-specific resources even when those resources currently are global sometimes.

@wlandau wlandau closed this as completed Sep 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants