Help with diagnosing a {targets} + {clustermq} issue on slurm / AWS Parallel Cluster #1294

kkmann · 2024-06-06T14:13:26Z

kkmann
Jun 6, 2024

Help

I understand and agree to https://books.ropensci.org/targets/help.html.

Description

Dear all,

My setup is {[email protected]} + {[email protected]} on slurm on an AWS parallel cluster.

My pipeline is

options(
  tidyverse.quiet = TRUE,
  clustermq.scheduler = "slurm",
  clustermq.template = "clustermq.tmpl"
)

library(targets)
library(tidyverse)

tar_option_set(
  packages = c("tidyverse")
)

# tar_source()

n_sim <- 1e4
reps <- 1e2
batches <- ceiling(n_sim / reps)

list(

  tarchetypes::tar_map_rep(test, {
        tbl_data <- tibble::tibble(
            x = rnorm(2*n_per_group, sd = sd_x),
            trt = rep(c(FALSE, TRUE), each = n_per_group),
            eps = rnorm(2 * n_per_group, sd = sigma),
            y = beta_x * x + beta_trt * trt + eps
          )
        res <- lm(y ~ x + trt, data = tbl_data)
        tibble::tibble(model_summary = list(broom::tidy(res)))
      },
    values = tidyr::expand_grid(
        n_per_group = c(100000),
        sd_x = c(1, 2),
        sigma = c(1, 2),
        beta_x = c(0, 0.2, 0.4, 1),
        beta_trt = c(0, 0.2, 0.4)
      ),
    batches = batches,
    reps = reps
  )

) # end targets

And I use a minimally modified default template (sh->bash, name)

#!/bin/bash

# Modified from https://github.com/mschubert/clustermq/blob/master/inst/SLURM.tmpl
# under the Apache 2.0 license.

#SBATCH --job-name=tar_{{ job_name }}
#SBATCH --output={{ log_file | /dev/null }}
#SBATCH --error={{ log_file | /dev/null }}
#SBATCH --mem-per-cpu={{ memory | 4096 }}
#SBATCH --array=1-{{ n_jobs }}
#SBATCH --cpus-per-task={{ cores | 1 }}

# ulimit -v $(( 1024 * {{ memory | 4096 }} ))
CMQ_AUTH={{ auth }} R --no-save --no-restore -e 'clustermq:::worker("{{ master }}")'

If I run this with a low number of workers, say around 100 targets::tar_make_clustermq(reporter = "summary", workers = 100L) this works great although a bit slow.

If I run this with a many more workers targets::tar_make_clustermq(reporter = "summary", workers = 1600L) the same pipeline has a quite high chance of failing at some point

> targets::tar_make_clustermq(reporter = "summary", workers = 1600)
queued | skipped | dispatched | completed | errored | warned | canceled | time       
26     | 0       | 1455       | 2940      | 0       | 0      | 0        | 12:28 43.90Error in `get_result(output = out, options)`:
! callr subprocess failed: could not start R, exited with non-zero status, has crashed or was killed
ℹ See `$stderr` for standard error.
Type .Last.error to see the more details.

> .Last.error
<callr_error/rlib_error_3_0/rlib_error/error>
Error in `get_result(output = out, options)`:
! callr subprocess failed: could not start R, exited with non-zero status, has crashed or was killed
ℹ See `$stderr` for standard error.
---
Backtrace:
1. targets::tar_make_clustermq(reporter = "summary", workers = 1600)
2. targets:::callr_outer(targets_function = tar_make_clustermq_inner, targets_arguments = targets_arguments, …
3. targets:::callr_dispatch(targets_function = targets_function, targets_arguments = targets_arguments, …
4. targets:::if_any(is.null(callr_function), callr_inner(targets_function = targets_function, …
5. base::do.call(callr_function, callr_prepare_arguments(callr_function, …
6. (function (func, args = list(), libpath = .libPaths(), repos = default_repos(), …
7. callr:::get_result(output = out, options)
8. callr:::throw(new_callr_crash_error(output, killmsg))

I suspect that the high number of individual small workers is either maxing out networking or I/O (no lustre FS) on the launching instance but I cannot pin it down exactly.

I tried to fix the situation with

targets::tar_make_clustermq(reporter = "summary", workers = 1600, seconds_meta_append = 5, seconds_reporter = 5)

but the same issue occurs.

My next attempt was to leverage the rep_workers arguments of tar_map_rep to see if that reduces the worker-host communication and gets rid of the proble but this leads to

! Error running targets::tar_make_clustermq()
Error messages: targets::tar_meta(fields = error, complete_only = TRUE)
Debugging guide: https://books.ropensci.org/targets/debugging.html
How to ask for help: https://books.ropensci.org/targets/help.html
Last error message:
    creation of server socket failed: port 11221 cannot be opened
Last error traceback:
    tarchetypes::tar_rep_run(command = tarchetypes::tar_append_static_values...
    tar_rep_run_map(expr = expr, batch = batch, reps = reps, rep_workers = r...
    make_psock_cluster(rep_workers)
    parallel::makePSOCKcluster(workers)
    serverSocket(port = port)
    .handleSimpleError(function (condition)  {     state$error <- build_mess...
    h(simpleError(msg, call))

I can, however, open PSOCK clusters manually via parallel::makePSOCKcluster(workers) on compute nodes.

I would appreciate any ideas on how to diagnose this further.

wlandau · 2024-06-06T17:44:33Z

wlandau
Jun 6, 2024
Maintainer

To make it easier to debug this, you could try running an equivalent workload with just clustermq::Q(), configured with SLURM as described at https://mschubert.github.io/clustermq/articles/userguide.html#environments. To replicate the rep_workers behavior in the downsized case, you could try parallel::parLapply() on a custom psock cluster within each clustermq task.

Once you get your setup working, https://wlandau.github.io/crew.cluster/ + tar_make() may be less costly to use than clustermq because of automatic down-scaling when workers are no longer needed (via seconds_idle etc).

2 replies

kkmann Jun 7, 2024
Author

Thanks for the suggestions Will!

I am triggering a similar workload using {clustermq} directly

options(
  clustermq.scheduler = "slurm"
)

f <- function(n_per_group, sd_x, sigma, beta_x, beta_trt, iter) {
  tbl_data <- tibble::tibble(
    x = rnorm(2*n_per_group, sd = sd_x),
    trt = rep(c(FALSE, TRUE), each = n_per_group),
    eps = rnorm(2 * n_per_group, sd = sigma),
    y = beta_x * x + beta_trt * trt + eps
  )
  res <- lm(y ~ x + trt, data = tbl_data)
  tibble::tibble(model_summary = list(broom::tidy(res)), iter = iter)
}


values <- tidyr::expand_grid(
    n_per_group = c(100000),
    sd_x = c(1, 2),
    sigma = c(1, 2),
    beta_x = c(0, 0.2, 0.4, 1),
    beta_trt = c(0, 0.2, 0.4)
  )

n_sim <- 1e4
reps <- 1e2

worker_pool <- clustermq::workers(2000)

res <- clustermq::Q_rows(
    tidyr::expand_grid(
      values,
      iter = 1:n_sim
    ),
    f,
    workers = worker_pool,
    job_size = reps
  )

worker_pool$cleanup()

This runs smoothly while the targets pipeline fails with 2k workers. It is my understanding that clustermq mainly uses network bandwidth and it seems to corresponds to the IOPS on the system (low with clustermq and high with targets).

I also tried to run the pipeline with targets + crew + crew.cluster and the R session crashed as well.

All pointing towards the fs being the limiting factor here. Is there a way of reducing the strain of targets on the fs (accepting less frequent caching), I was hoping that , seconds_meta_append = 5, seconds_reporter = 5 would get me there.

wlandau Jun 10, 2024
Maintainer

Unless storage = "worker" in tar_option_set() or tar_target(), I targets theoretically shouldn't be writing to disk on the workers. But it would still be writing output data to _targets/objects/ on the main process.

I looked at this thread again and noticed you may be directing SLURM logging to /dev/null/ If you supply a directory with a trailing slash, either in the clustermq template file or in the slurm_log_output/slurm_log_error args of crew_controller_slurm(), then you may get really good information about what's happening. It could still be a SLURM-related issue that interacts only with targets somehow.

kkmann · 2024-07-01T14:15:17Z

kkmann
Jul 1, 2024
Author

I think I finally got to the bottom of this. It does seem to be a memory issue on the R session running the targets pipeline.
The abnormal IPOPS are probably related to the OS trying to swap just before crashing

Thank you for helping me with the diagnosis of this issue and sorry for not cathing this before positn here!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Help with diagnosing a {targets} + {clustermq} issue on slurm / AWS Parallel Cluster #1294

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Help with diagnosing a {targets} + {clustermq} issue on slurm / AWS Parallel Cluster #1294

kkmann Jun 6, 2024

Help

Description

Replies: 2 comments · 2 replies

wlandau Jun 6, 2024 Maintainer

kkmann Jun 7, 2024 Author

wlandau Jun 10, 2024 Maintainer

kkmann Jul 1, 2024 Author

kkmann
Jun 6, 2024

Replies: 2 comments 2 replies

wlandau
Jun 6, 2024
Maintainer

kkmann Jun 7, 2024
Author

wlandau Jun 10, 2024
Maintainer

kkmann
Jul 1, 2024
Author