Skip to content

Commit

Permalink
Add vignette and finalize for CRAN submission
Browse files Browse the repository at this point in the history
  • Loading branch information
pmarchand1 committed May 27, 2016
1 parent 17cbb4f commit d3ea5e7
Show file tree
Hide file tree
Showing 8 changed files with 460 additions and 36 deletions.
2 changes: 1 addition & 1 deletion .Rbuildignore
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
^.*\.Rproj$
^\.Rproj\.user$
^\.travis\.yml$
^README.md$
cran-comments.md
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
.Rproj.user
.Rhistory
.RData
inst/doc
Empty file modified .travis.yml
100644 → 100755
Empty file.
10 changes: 7 additions & 3 deletions DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -8,12 +8,16 @@ Version: 0.3.0
License: GPL-3
URL: https://github.com/SESYNC-ci/rslurm
BugReports: https://github.com/SESYNC-ci/rslurm/issues
Authors@R: person('Philippe', 'Marchand', email = '[email protected]',
role = c('aut', 'cre'))
Authors@R: c(person('Philippe', 'Marchand', email = '[email protected]',
role = c('aut', 'cre')),
person('Mike', 'Smorul', role = 'ctb'))
Depends:
R (>= 3.2.0)
Imports:
parallel,
whisker (>= 0.3)
RoxygenNote: 5.0.1
Suggests: testthat
Suggests: testthat,
knitr,
rmarkdown
VignetteBuilder: knitr
2 changes: 1 addition & 1 deletion R/slurm_call.R
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,7 @@ slurm_call <- function(f, params, jobname = NA, add_objects = NULL,
writeLines(script_r, file.path(tmpdir, "slurm_run.R"))

# Create submission bash script
template_sh <- readLines(system.file("templates/submit_sh.txt",
template_sh <- readLines(system.file("templates/submit_single_sh.txt",
package = "rslurm"))
slurm_options <- format_option_list(slurm_options)
script_sh <- whisker::whisker.render(template_sh,
Expand Down
239 changes: 208 additions & 31 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,59 +1,236 @@
rslurm
======

This R package simplifies the process of splitting a R calculation over a computing cluster that uses the [SLURM](http://slurm.schedmd.com/) workload manager.
[![Travis-CI Build Status](https://travis-ci.org/SESYNC-ci/rslurm.svg?branch=master)](https://travis-ci.org/SESYNC-ci/rslurm)

Currently, it is possible to use existing R packages like `parallel` to split a calculation over multiple CPUs in a single cluster node. The functions in this package automate the process of dividing the parameter sets over multiple cluster nodes (using a slurm array), applying the function in parallel in each node using `parallel`, and recombining the output.
Many computing-intensive processes in R involve the repeated evaluation of
a function over many items or parameter sets. These so-called
[embarrassingly parallel](https://en.wikipedia.org/wiki/Embarrassingly_parallel)
calculations can be run serially with the `lapply` or `Map` function, or in parallel
on a single machine with `mclapply` or `mcMap` (from the **parallel** package).

The rslurm package simplifies the process of distributing this type of calculation
across a computing cluster that uses the [SLURM](http://slurm.schedmd.com/)
workload manager. Its main function, `slurm_apply`, automatically divides the
computation over multiple nodes and writes the necessary submission scripts.
It also includes functions to retrieve and combine the output from different nodes,
as well as wrappers for common SLURM commands.

How to install / use
--------------------
*Development of this R package was supported by the National Socio-Environmental Synthesis Center (SESYNC) under funding received from the National Science Foundation DBI-1052875.*

Install the package from GitHub using the following code:
```R
install.packages("devtools")
devtools::install_github("SESYNC-ci/rslurm")

### Table of contents

- [Basic example](#basic-example)
- [Single function evaluation](#single-function-evaluation)
- [Adding auxiliary data and functions](#adding-auxiliary-data-and-functions)
- [Configuring SLURM options](#configuring-slurm-options)
- [Generating scripts for later submission](#generating-scripts-for-later-submission)
- [How it works / advanced customization](#how-it-works-advanced-customization)


## Basic example

To illustrate a typical rslurm workflow, we use a simple function that takes
a mean and standard deviation as parameters, generates a million normal deviates
and returns the sample mean and standard deviation.

```r
test_func <- function(par_mu, par_sd) {
samp <- rnorm(10^6, par_mu, par_sd)
c(s_mu = mean(samp), s_sd = sd(samp))
}
```

Here's an overview of the workflow using this package:
We then create a parameter data frame where each row is a parameter set and each
column matches an argument of the function.

- Create a function that you want to call with multiple parameter sets, and a data frame containing these parameter sets.
- Call `slurm_apply` with the function, the parameters data frame and (if applicable) the names of additional R objects needed as arguments. The function returns a `slurm_job` object.
- The `slurm_job` object can be passed to other utility functions in the package to inquire about the SLURM job's status (`print_job_status`), cancel the job (`cancel_slurm`), collect the output in a single list or data frame (`get_slurm_out`), or delete the temporary files generated during the process (`cleanup_files`).
```r
pars <- data.frame(par_mu = 1:10,
par_sd = seq(0.1, 1, length.out = 10))
head(pars, 3)
```

```
par_mu par_sd
1 1 0.1
2 2 0.2
3 3 0.3
```

Read the `rslurm-package` help file in R and each function's help file for more details.
We can now pass that function and the parameters data frame to `slurm_apply`,
specifiying the number of cluster nodes to use and the number of CPUs per node.
The latter (`cpus_per_node`) determines how many processes will be forked on
each node, as the `mc.cores` argument of `parallel::mcMap`.
```r
library(rslurm)
sjob <- slurm_apply(test_func, pars, jobname = "test_job",
nodes = 2, cpus_per_node = 2)
```
The output of `slurm_apply` is a *slurm_job* object that stores a few pieces of
information (job name and number of nodes) needed to retrieve the job's output.

Assuming the function is run on a machine with access to the cluster, it also
prints a message confirming the job has been submitted to SLURM.
```
Submitted batch job 352375
```

Instructions for SESYNC users
-----------------------------
Particular clusters may require the specification of additional SLURM options,
such as time and memory limits for the job. Also, when running R on a local
machine without direct cluster access, you may want to generate scripts to be
copied to the cluster and run at a later time. These topics are covered in
additional sections below this basic example.

After the job has been submitted, you can call `print_job_status` to display its
status (in queue, running or completed) or call `cancel_slurm` to cancel its
execution. These functions are R wrappers for the SLURM command line functions
`squeue` and `scancel`, respectively.

Once the job completes, `get_slurm_out` reads and combines the output from all
nodes.
```r
res <- get_slurm_out(sjob, outtype = "table")
head(res, 3)
```

When using the SESYNC SLURM cluster, you should set the `nodes` argument of `slurm_apply` to a value less than the number of nodes available on the cluster (there are 20 nodes in total). You should set `cpus_per_node = 8` unless your job requires a large amount of memory (i.e. when running 8 copies would exceed the 60 Gb available by node).
```
s_mu s_sd
1 1.000005 0.09987899
2 2.000185 0.20001108
3 3.000238 0.29988789
```

You must also specify the correct partition for jobs to be run in serial or parallel mode. This can be done in one of two ways:
When `outtype = "table"`, the outputs from each function evaluation are
row-bound into a single data frame; this is an appropriate format when the
function returns a simple vector. The default `outtype = "raw"` combines the
outputs into a list and can thus handle arbitrarily complex return objects.

*As an option set in each call to the `rslurm` functions*
```r
res_raw <- get_slurm_out(sjob, outtype = "raw")
res_raw[1:3]
```

* For `slurm_apply`, set `slurm_options = list(partition = "sesync")`.
* For `slurm_call`, set `slurm_options = list(partition = "sesyncshared", share = TRUE)`.
```
[[1]]
s_mu s_sd
1.00000506 0.09987899
*By editing the template scripts*
[[2]]
s_mu s_sd
2.0001852 0.2000111
Note: We recommend saving a backup copy of the original templates before editing them.
[[3]]
s_mu s_sd
3.0002377 0.2998879
```

* Go to the `rslurm` folder in your R library (generally located at `~/R/x86_64-pc-linux-gnu-library/3.3/`, with "3.3" replaced with the latest version of R). Open the `templates` subfolder.
The files generated by `slurm_apply` are saved in a folder named
*\_rslurm_[jobname]* under the current working directory.

* In `submit_sh.txt`, insert the line
```r
dir("_rslurm_test_job")
```
#SBATCH --partition=sesync
```
before the first `#SBATCH` line.

* In `submit_single_sh.txt`, insert the lines
```
#SBATCH --partition=sesyncshared
#SBATCH --share
[1] "params.RData" "results_0.RData" "results_1.RData" "slurm_0.out"
[5] "slurm_1.out" "slurm_run.R" "submit.sh"
```
before the first `#SBATCH` line.

The utility function `cleanup_files` deletes the temporary folder for the
specified *slurm_job*.


## Single function evaluation

In addition to `slurm_apply`, rslurm also defines a `slurm_call` function, which
sends a single function call to the cluster. It is analogous in syntax to the
base R function `do.call`, accepting a function and a named list of parameters
as arguments.

```r
sjob <- slurm_call(test_func, list(par_mu = 5, par_sd = 1))
```

Because `slurm_call` involves a single process on a single node, it does not
recognize the `nodes` and `cpus_per_node` arguments; otherwise, it accepts the
same additional arguments (detailed in the sections below) as `slurm_apply`.


## Adding auxiliary data and functions

The function passed to `slurm_apply` can only receive atomic parameters stored
within a data frame. Suppose we want instead to apply a function `func` to a list
of complex R objects, `obj_list`. To use `slurm_apply` in this case, we can wrap
`func` in an inline function that takes an integer parameter.

```r
sjob <- slurm_apply(function(i) func(obj_list[[i]]),
data.frame(i = seq_along(obj_list)),
add_objects = c("func", "obj_list"),
nodes = 2, cpus_per_node = 2)
```

The `add_objects` argument specifies the names of any R objects (besides the
parameters data frame) that must be accessed by the function passed to
`slurm_apply`. These objects are saved to a `.RData` file that is loaded
on each cluster node prior to evaluating the function in parallel.

By default, all R packages attached to the current R session will also be
attached (with `library`) on each cluster node, though this can be modified with
the optional `pkgs` argument.


## Configuring SLURM options

The `slurm_options` argument allows you to set any of the command line
options ([view list](http://slurm.schedmd.com/sbatch.html)) recognized by the
SLURM `sbatch` command. It should be formatted as a named list, using the long
names of each option (e.g. "time" rather than "t"). Flags, i.e. command line
options that are toggled rather than set to a particular value, should be set to
`TRUE` in `slurm_options`. For example, the following code:
```r
sjob <- slurm_apply(test_func, pars,
slurm_options = list(time = "1:00:00", share = TRUE))
```
sets the command line options `--time=1:00:00 --share`.


## Generating scripts for later submission

When working from a R session without direct access to the cluster, you can set
`submit = FALSE` within `slurm_apply`. The function will create the
*\_rslurm\_[jobname]* folder and generate the scripts and .RData files, without
submitting the job. You may then copy those files to the cluster and submit the
job manually by calling `sbatch submit.sh` from the command line.


## How it works / advanced customization

As mentioned above, the `slurm_apply` function creates a job-specific folder.
This folder contains the parameters data frame and (if applicable) the objects
specified as `add_objects`, both saved in *.RData* files. The function also
generates a R script (`slurm_run.R`) to be run on each cluster node, as well
as a Bash script (`submit.sh`) to submit the job to SLURM.

More specifically, the Bash script creates a SLURM job array, with each cluster
node receiving a different value of the *SLURM\_ARRAY\_TASK\_ID* environment
variable. This variable is read by `slurm_run.R`, which allows each instance of
the script to operate on a different parameter subset and write its output to
a different results file. The R script calls `parallel::mcMap` to parallelize
calculations on each node.

Both `slurm_run.R` and `submit.sh` are generated from templates, using the
**whisker** package; these templates can be found in the `rslurm/templates`
subfolder in your R package library. There are two templates for each script,
one for `slurm_apply` and the other (with the word *single* in its title) for
`slurm_call`.

While you should avoid changing any existing lines in the template scripts, you
may want to add `#SBATCH` lines to the `submit.sh` templates in order to
permanently set certain SLURM command line options and thus customize the package
to your particular cluster setup.




9 changes: 9 additions & 0 deletions cran-comments.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
## Tested on

win-builder (devel and release)
Ubuntu 12.04 with R 3.3 (on travis-ci)
OS X with R 3.3 (local machine)

## R CMD check results

Status: OK (no errors, warnings or notes)
Loading

0 comments on commit d3ea5e7

Please sign in to comment.