From d3ea5e7b252b624e04c810ead188d01b0f3cf4db Mon Sep 17 00:00:00 2001 From: Philippe Marchand Date: Fri, 27 May 2016 19:00:05 -0400 Subject: [PATCH] Add vignette and finalize for CRAN submission --- .Rbuildignore | 2 +- .gitignore | 1 + .travis.yml | 0 DESCRIPTION | 10 +- R/slurm_call.R | 2 +- README.md | 239 +++++++++++++++++++++++++++++----- cran-comments.md | 9 ++ vignettes/rslurm-vignette.Rmd | 233 +++++++++++++++++++++++++++++++++ 8 files changed, 460 insertions(+), 36 deletions(-) mode change 100644 => 100755 .travis.yml create mode 100755 cran-comments.md create mode 100755 vignettes/rslurm-vignette.Rmd diff --git a/.Rbuildignore b/.Rbuildignore index 42404a8..e5afaad 100755 --- a/.Rbuildignore +++ b/.Rbuildignore @@ -1,4 +1,4 @@ ^.*\.Rproj$ ^\.Rproj\.user$ ^\.travis\.yml$ -^README.md$ +cran-comments.md diff --git a/.gitignore b/.gitignore index 807ea25..09a72cb 100755 --- a/.gitignore +++ b/.gitignore @@ -1,3 +1,4 @@ .Rproj.user .Rhistory .RData +inst/doc diff --git a/.travis.yml b/.travis.yml old mode 100644 new mode 100755 diff --git a/DESCRIPTION b/DESCRIPTION index 43154c4..0a09194 100755 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -8,12 +8,16 @@ Version: 0.3.0 License: GPL-3 URL: https://github.com/SESYNC-ci/rslurm BugReports: https://github.com/SESYNC-ci/rslurm/issues -Authors@R: person('Philippe', 'Marchand', email = 'pmarchand@sesync.org', - role = c('aut', 'cre')) +Authors@R: c(person('Philippe', 'Marchand', email = 'pmarchand@sesync.org', + role = c('aut', 'cre')), + person('Mike', 'Smorul', role = 'ctb')) Depends: R (>= 3.2.0) Imports: parallel, whisker (>= 0.3) RoxygenNote: 5.0.1 -Suggests: testthat +Suggests: testthat, + knitr, + rmarkdown +VignetteBuilder: knitr diff --git a/R/slurm_call.R b/R/slurm_call.R index 982f345..7030636 100755 --- a/R/slurm_call.R +++ b/R/slurm_call.R @@ -90,7 +90,7 @@ slurm_call <- function(f, params, jobname = NA, add_objects = NULL, writeLines(script_r, file.path(tmpdir, "slurm_run.R")) # Create submission bash script - template_sh <- readLines(system.file("templates/submit_sh.txt", + template_sh <- readLines(system.file("templates/submit_single_sh.txt", package = "rslurm")) slurm_options <- format_option_list(slurm_options) script_sh <- whisker::whisker.render(template_sh, diff --git a/README.md b/README.md index 91e142d..97044d5 100755 --- a/README.md +++ b/README.md @@ -1,59 +1,236 @@ rslurm ====== -This R package simplifies the process of splitting a R calculation over a computing cluster that uses the [SLURM](http://slurm.schedmd.com/) workload manager. +[![Travis-CI Build Status](https://travis-ci.org/SESYNC-ci/rslurm.svg?branch=master)](https://travis-ci.org/SESYNC-ci/rslurm) -Currently, it is possible to use existing R packages like `parallel` to split a calculation over multiple CPUs in a single cluster node. The functions in this package automate the process of dividing the parameter sets over multiple cluster nodes (using a slurm array), applying the function in parallel in each node using `parallel`, and recombining the output. +Many computing-intensive processes in R involve the repeated evaluation of +a function over many items or parameter sets. These so-called +[embarrassingly parallel](https://en.wikipedia.org/wiki/Embarrassingly_parallel) +calculations can be run serially with the `lapply` or `Map` function, or in parallel +on a single machine with `mclapply` or `mcMap` (from the **parallel** package). +The rslurm package simplifies the process of distributing this type of calculation +across a computing cluster that uses the [SLURM](http://slurm.schedmd.com/) +workload manager. Its main function, `slurm_apply`, automatically divides the +computation over multiple nodes and writes the necessary submission scripts. +It also includes functions to retrieve and combine the output from different nodes, +as well as wrappers for common SLURM commands. -How to install / use --------------------- +*Development of this R package was supported by the National Socio-Environmental Synthesis Center (SESYNC) under funding received from the National Science Foundation DBI-1052875.* -Install the package from GitHub using the following code: -```R -install.packages("devtools") -devtools::install_github("SESYNC-ci/rslurm") + +### Table of contents + +- [Basic example](#basic-example) +- [Single function evaluation](#single-function-evaluation) +- [Adding auxiliary data and functions](#adding-auxiliary-data-and-functions) +- [Configuring SLURM options](#configuring-slurm-options) +- [Generating scripts for later submission](#generating-scripts-for-later-submission) +- [How it works / advanced customization](#how-it-works-advanced-customization) + + +## Basic example + +To illustrate a typical rslurm workflow, we use a simple function that takes +a mean and standard deviation as parameters, generates a million normal deviates +and returns the sample mean and standard deviation. + +```r +test_func <- function(par_mu, par_sd) { + samp <- rnorm(10^6, par_mu, par_sd) + c(s_mu = mean(samp), s_sd = sd(samp)) +} ``` -Here's an overview of the workflow using this package: +We then create a parameter data frame where each row is a parameter set and each +column matches an argument of the function. -- Create a function that you want to call with multiple parameter sets, and a data frame containing these parameter sets. -- Call `slurm_apply` with the function, the parameters data frame and (if applicable) the names of additional R objects needed as arguments. The function returns a `slurm_job` object. -- The `slurm_job` object can be passed to other utility functions in the package to inquire about the SLURM job's status (`print_job_status`), cancel the job (`cancel_slurm`), collect the output in a single list or data frame (`get_slurm_out`), or delete the temporary files generated during the process (`cleanup_files`). +```r +pars <- data.frame(par_mu = 1:10, + par_sd = seq(0.1, 1, length.out = 10)) +head(pars, 3) +``` + +``` + par_mu par_sd +1 1 0.1 +2 2 0.2 +3 3 0.3 +``` -Read the `rslurm-package` help file in R and each function's help file for more details. +We can now pass that function and the parameters data frame to `slurm_apply`, +specifiying the number of cluster nodes to use and the number of CPUs per node. +The latter (`cpus_per_node`) determines how many processes will be forked on +each node, as the `mc.cores` argument of `parallel::mcMap`. +```r +library(rslurm) +sjob <- slurm_apply(test_func, pars, jobname = "test_job", + nodes = 2, cpus_per_node = 2) +``` +The output of `slurm_apply` is a *slurm_job* object that stores a few pieces of +information (job name and number of nodes) needed to retrieve the job's output. +Assuming the function is run on a machine with access to the cluster, it also +prints a message confirming the job has been submitted to SLURM. +``` +Submitted batch job 352375 +``` -Instructions for SESYNC users ------------------------------ +Particular clusters may require the specification of additional SLURM options, +such as time and memory limits for the job. Also, when running R on a local +machine without direct cluster access, you may want to generate scripts to be +copied to the cluster and run at a later time. These topics are covered in +additional sections below this basic example. + +After the job has been submitted, you can call `print_job_status` to display its +status (in queue, running or completed) or call `cancel_slurm` to cancel its +execution. These functions are R wrappers for the SLURM command line functions +`squeue` and `scancel`, respectively. + +Once the job completes, `get_slurm_out` reads and combines the output from all +nodes. +```r +res <- get_slurm_out(sjob, outtype = "table") +head(res, 3) +``` -When using the SESYNC SLURM cluster, you should set the `nodes` argument of `slurm_apply` to a value less than the number of nodes available on the cluster (there are 20 nodes in total). You should set `cpus_per_node = 8` unless your job requires a large amount of memory (i.e. when running 8 copies would exceed the 60 Gb available by node). +``` + s_mu s_sd +1 1.000005 0.09987899 +2 2.000185 0.20001108 +3 3.000238 0.29988789 +``` -You must also specify the correct partition for jobs to be run in serial or parallel mode. This can be done in one of two ways: +When `outtype = "table"`, the outputs from each function evaluation are +row-bound into a single data frame; this is an appropriate format when the +function returns a simple vector. The default `outtype = "raw"` combines the +outputs into a list and can thus handle arbitrarily complex return objects. -*As an option set in each call to the `rslurm` functions* +```r +res_raw <- get_slurm_out(sjob, outtype = "raw") +res_raw[1:3] +``` -* For `slurm_apply`, set `slurm_options = list(partition = "sesync")`. -* For `slurm_call`, set `slurm_options = list(partition = "sesyncshared", share = TRUE)`. +``` +[[1]] + s_mu s_sd +1.00000506 0.09987899 -*By editing the template scripts* +[[2]] + s_mu s_sd +2.0001852 0.2000111 -Note: We recommend saving a backup copy of the original templates before editing them. +[[3]] + s_mu s_sd +3.0002377 0.2998879 +``` -* Go to the `rslurm` folder in your R library (generally located at `~/R/x86_64-pc-linux-gnu-library/3.3/`, with "3.3" replaced with the latest version of R). Open the `templates` subfolder. +The files generated by `slurm_apply` are saved in a folder named +*\_rslurm_[jobname]* under the current working directory. -* In `submit_sh.txt`, insert the line +```r +dir("_rslurm_test_job") ``` -#SBATCH --partition=sesync -``` -before the first `#SBATCH` line. -* In `submit_single_sh.txt`, insert the lines ``` -#SBATCH --partition=sesyncshared -#SBATCH --share +[1] "params.RData" "results_0.RData" "results_1.RData" "slurm_0.out" +[5] "slurm_1.out" "slurm_run.R" "submit.sh" ``` -before the first `#SBATCH` line. + +The utility function `cleanup_files` deletes the temporary folder for the +specified *slurm_job*. + + +## Single function evaluation + +In addition to `slurm_apply`, rslurm also defines a `slurm_call` function, which +sends a single function call to the cluster. It is analogous in syntax to the +base R function `do.call`, accepting a function and a named list of parameters +as arguments. + +```r +sjob <- slurm_call(test_func, list(par_mu = 5, par_sd = 1)) +``` + +Because `slurm_call` involves a single process on a single node, it does not +recognize the `nodes` and `cpus_per_node` arguments; otherwise, it accepts the +same additional arguments (detailed in the sections below) as `slurm_apply`. + + +## Adding auxiliary data and functions + +The function passed to `slurm_apply` can only receive atomic parameters stored +within a data frame. Suppose we want instead to apply a function `func` to a list +of complex R objects, `obj_list`. To use `slurm_apply` in this case, we can wrap +`func` in an inline function that takes an integer parameter. + +```r +sjob <- slurm_apply(function(i) func(obj_list[[i]]), + data.frame(i = seq_along(obj_list)), + add_objects = c("func", "obj_list"), + nodes = 2, cpus_per_node = 2) +``` + +The `add_objects` argument specifies the names of any R objects (besides the +parameters data frame) that must be accessed by the function passed to +`slurm_apply`. These objects are saved to a `.RData` file that is loaded +on each cluster node prior to evaluating the function in parallel. + +By default, all R packages attached to the current R session will also be +attached (with `library`) on each cluster node, though this can be modified with +the optional `pkgs` argument. + + +## Configuring SLURM options + +The `slurm_options` argument allows you to set any of the command line +options ([view list](http://slurm.schedmd.com/sbatch.html)) recognized by the +SLURM `sbatch` command. It should be formatted as a named list, using the long +names of each option (e.g. "time" rather than "t"). Flags, i.e. command line +options that are toggled rather than set to a particular value, should be set to +`TRUE` in `slurm_options`. For example, the following code: +```r +sjob <- slurm_apply(test_func, pars, + slurm_options = list(time = "1:00:00", share = TRUE)) +``` +sets the command line options `--time=1:00:00 --share`. + + +## Generating scripts for later submission + +When working from a R session without direct access to the cluster, you can set +`submit = FALSE` within `slurm_apply`. The function will create the +*\_rslurm\_[jobname]* folder and generate the scripts and .RData files, without +submitting the job. You may then copy those files to the cluster and submit the +job manually by calling `sbatch submit.sh` from the command line. + + +## How it works / advanced customization + +As mentioned above, the `slurm_apply` function creates a job-specific folder. +This folder contains the parameters data frame and (if applicable) the objects +specified as `add_objects`, both saved in *.RData* files. The function also +generates a R script (`slurm_run.R`) to be run on each cluster node, as well +as a Bash script (`submit.sh`) to submit the job to SLURM. + +More specifically, the Bash script creates a SLURM job array, with each cluster +node receiving a different value of the *SLURM\_ARRAY\_TASK\_ID* environment +variable. This variable is read by `slurm_run.R`, which allows each instance of +the script to operate on a different parameter subset and write its output to +a different results file. The R script calls `parallel::mcMap` to parallelize +calculations on each node. + +Both `slurm_run.R` and `submit.sh` are generated from templates, using the +**whisker** package; these templates can be found in the `rslurm/templates` +subfolder in your R package library. There are two templates for each script, +one for `slurm_apply` and the other (with the word *single* in its title) for +`slurm_call`. + +While you should avoid changing any existing lines in the template scripts, you +may want to add `#SBATCH` lines to the `submit.sh` templates in order to +permanently set certain SLURM command line options and thus customize the package +to your particular cluster setup. + diff --git a/cran-comments.md b/cran-comments.md new file mode 100755 index 0000000..49ba3d9 --- /dev/null +++ b/cran-comments.md @@ -0,0 +1,9 @@ +## Tested on + +win-builder (devel and release) +Ubuntu 12.04 with R 3.3 (on travis-ci) +OS X with R 3.3 (local machine) + +## R CMD check results + +Status: OK (no errors, warnings or notes) diff --git a/vignettes/rslurm-vignette.Rmd b/vignettes/rslurm-vignette.Rmd new file mode 100755 index 0000000..d38bc77 --- /dev/null +++ b/vignettes/rslurm-vignette.Rmd @@ -0,0 +1,233 @@ +--- +title: "Parallelize R code on a SLURM cluster" +output: rmarkdown::html_vignette +vignette: > + %\VignetteIndexEntry{rslurm-vignette} + %\VignetteEngine{knitr::rmarkdown} + %\VignetteEncoding{UTF-8} +--- + +Many computing-intensive processes in R involve the repeated evaluation of +a function over many items or parameter sets. These so-called +[embarrassingly parallel](https://en.wikipedia.org/wiki/Embarrassingly_parallel) +calculations can be run serially with the `lapply` or `Map` function, or in parallel +on a single machine with `mclapply` or `mcMap` (from the **parallel** package). + +The rslurm package simplifies the process of distributing this type of calculation +across a computing cluster that uses the [SLURM](http://slurm.schedmd.com/) +workload manager. Its main function, `slurm_apply`, automatically divides the +computation over multiple nodes and writes the necessary submission scripts. +It also includes functions to retrieve and combine the output from different nodes, +as well as wrappers for common SLURM commands. + +### Table of contents + +- [Basic example](#basic-example) +- [Single function evaluation](#single-function-evaluation) +- [Adding auxiliary data and functions](#adding-auxiliary-data-and-functions) +- [Configuring SLURM options](#configuring-slurm-options) +- [Generating scripts for later submission](#generating-scripts-for-later-submission) +- [How it works / advanced customization](#how-it-works-advanced-customization) + + +## Basic example + +To illustrate a typical rslurm workflow, we use a simple function that takes +a mean and standard deviation as parameters, generates a million normal deviates +and returns the sample mean and standard deviation. + +```r +test_func <- function(par_mu, par_sd) { + samp <- rnorm(10^6, par_mu, par_sd) + c(s_mu = mean(samp), s_sd = sd(samp)) +} +``` + +We then create a parameter data frame where each row is a parameter set and each +column matches an argument of the function. + +```r +pars <- data.frame(par_mu = 1:10, + par_sd = seq(0.1, 1, length.out = 10)) +head(pars, 3) +``` + +``` + par_mu par_sd +1 1 0.1 +2 2 0.2 +3 3 0.3 +``` + +We can now pass that function and the parameters data frame to `slurm_apply`, +specifiying the number of cluster nodes to use and the number of CPUs per node. +The latter (`cpus_per_node`) determines how many processes will be forked on +each node, as the `mc.cores` argument of `parallel::mcMap`. +```r +library(rslurm) +sjob <- slurm_apply(test_func, pars, jobname = "test_job", + nodes = 2, cpus_per_node = 2) +``` +The output of `slurm_apply` is a *slurm_job* object that stores a few pieces of +information (job name and number of nodes) needed to retrieve the job's output. + +Assuming the function is run on a machine with access to the cluster, it also +prints a message confirming the job has been submitted to SLURM. +``` +Submitted batch job 352375 +``` + +Particular clusters may require the specification of additional SLURM options, +such as time and memory limits for the job. Also, when running R on a local +machine without direct cluster access, you may want to generate scripts to be +copied to the cluster and run at a later time. These topics are covered in +additional sections below this basic example. + +After the job has been submitted, you can call `print_job_status` to display its +status (in queue, running or completed) or call `cancel_slurm` to cancel its +execution. These functions are R wrappers for the SLURM command line functions +`squeue` and `scancel`, respectively. + +Once the job completes, `get_slurm_out` reads and combines the output from all +nodes. +```r +res <- get_slurm_out(sjob, outtype = "table") +head(res, 3) +``` + +``` + s_mu s_sd +1 1.000005 0.09987899 +2 2.000185 0.20001108 +3 3.000238 0.29988789 +``` + +When `outtype = "table"`, the outputs from each function evaluation are +row-bound into a single data frame; this is an appropriate format when the +function returns a simple vector. The default `outtype = "raw"` combines the +outputs into a list and can thus handle arbitrarily complex return objects. + +```r +res_raw <- get_slurm_out(sjob, outtype = "raw") +res_raw[1:3] +``` + +``` +[[1]] + s_mu s_sd +1.00000506 0.09987899 + +[[2]] + s_mu s_sd +2.0001852 0.2000111 + +[[3]] + s_mu s_sd +3.0002377 0.2998879 +``` + +The files generated by `slurm_apply` are saved in a folder named +*\_rslurm_[jobname]* under the current working directory. + +```r +dir("_rslurm_test_job") +``` + +``` +[1] "params.RData" "results_0.RData" "results_1.RData" "slurm_0.out" +[5] "slurm_1.out" "slurm_run.R" "submit.sh" +``` + +The utility function `cleanup_files` deletes the temporary folder for the +specified *slurm_job*. + + +## Single function evaluation + +In addition to `slurm_apply`, rslurm also defines a `slurm_call` function, which +sends a single function call to the cluster. It is analogous in syntax to the +base R function `do.call`, accepting a function and a named list of parameters +as arguments. + +```r +sjob <- slurm_call(test_func, list(par_mu = 5, par_sd = 1)) +``` + +Because `slurm_call` involves a single process on a single node, it does not +recognize the `nodes` and `cpus_per_node` arguments; otherwise, it accepts the +same additional arguments (detailed in the sections below) as `slurm_apply`. + + +## Adding auxiliary data and functions + +The function passed to `slurm_apply` can only receive atomic parameters stored +within a data frame. Suppose we want instead to apply a function `func` to a list +of complex R objects, `obj_list`. To use `slurm_apply` in this case, we can wrap +`func` in an inline function that takes an integer parameter. + +```r +sjob <- slurm_apply(function(i) func(obj_list[[i]]), + data.frame(i = seq_along(obj_list)), + add_objects = c("func", "obj_list"), + nodes = 2, cpus_per_node = 2) +``` + +The `add_objects` argument specifies the names of any R objects (besides the +parameters data frame) that must be accessed by the function passed to +`slurm_apply`. These objects are saved to a `.RData` file that is loaded +on each cluster node prior to evaluating the function in parallel. + +By default, all R packages attached to the current R session will also be +attached (with `library`) on each cluster node, though this can be modified with +the optional `pkgs` argument. + + +## Configuring SLURM options + +The `slurm_options` argument allows you to set any of the command line +options ([view list](http://slurm.schedmd.com/sbatch.html)) recognized by the +SLURM `sbatch` command. It should be formatted as a named list, using the long +names of each option (e.g. "time" rather than "t"). Flags, i.e. command line +options that are toggled rather than set to a particular value, should be set to +`TRUE` in `slurm_options`. For example, the following code: +```r +sjob <- slurm_apply(test_func, pars, + slurm_options = list(time = "1:00:00", share = TRUE)) +``` +sets the command line options `--time=1:00:00 --share`. + + +## Generating scripts for later submission + +When working from a R session without direct access to the cluster, you can set +`submit = FALSE` within `slurm_apply`. The function will create the +*\_rslurm\_[jobname]* folder and generate the scripts and .RData files, without +submitting the job. You may then copy those files to the cluster and submit the +job manually by calling `sbatch submit.sh` from the command line. + + +## How it works / advanced customization + +As mentioned above, the `slurm_apply` function creates a job-specific folder. +This folder contains the parameters data frame and (if applicable) the objects +specified as `add_objects`, both saved in *.RData* files. The function also +generates a R script (`slurm_run.R`) to be run on each cluster node, as well +as a Bash script (`submit.sh`) to submit the job to SLURM. + +More specifically, the Bash script creates a SLURM job array, with each cluster +node receiving a different value of the *SLURM\_ARRAY\_TASK\_ID* environment +variable. This variable is read by `slurm_run.R`, which allows each instance of +the script to operate on a different parameter subset and write its output to +a different results file. The R script calls `parallel::mcMap` to parallelize +calculations on each node. + +Both `slurm_run.R` and `submit.sh` are generated from templates, using the +**whisker** package; these templates can be found in the `rslurm/templates` +subfolder in your R package library. There are two templates for each script, +one for `slurm_apply` and the other (with the word *single* in its title) for +`slurm_call`. + +While you should avoid changing any existing lines in the template scripts, you +may want to add `#SBATCH` lines to the `submit.sh` templates in order to +permanently set certain SLURM command line options and thus customize the package +to your particular cluster setup. \ No newline at end of file