Add vignette and finalize for CRAN submission

earthlab · May 27, 2016 · d3ea5e7 · d3ea5e7
1 parent 17cbb4f
commit d3ea5e7
Show file tree

Hide file tree

Showing 8 changed files with 460 additions and 36 deletions.
diff --git a/.Rbuildignore b/.Rbuildignore
@@ -1,4 +1,4 @@
 ^.*\.Rproj$
 ^\.Rproj\.user$
 ^\.travis\.yml$
-^README.md$
+cran-comments.md
diff --git a/.gitignore b/.gitignore
@@ -1,3 +1,4 @@
 .Rproj.user
 .Rhistory
 .RData
+inst/doc
diff --git a/.travis.yml b/.travis.yml
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -8,12 +8,16 @@ Version: 0.3.0
 License: GPL-3
 URL: https://github.com/SESYNC-ci/rslurm
 BugReports: https://github.com/SESYNC-ci/rslurm/issues
-Authors@R: person('Philippe', 'Marchand', email = '[email protected]', 
-                role = c('aut', 'cre'))
+Authors@R: c(person('Philippe', 'Marchand', email = '[email protected]', 
+             role = c('aut', 'cre')), 
+             person('Mike', 'Smorul', role = 'ctb'))
 Depends:
     R (>= 3.2.0)
 Imports:
     parallel,
     whisker (>= 0.3)
 RoxygenNote: 5.0.1
-Suggests: testthat
+Suggests: testthat,
+    knitr,
+    rmarkdown
+VignetteBuilder: knitr
diff --git a/R/slurm_call.R b/R/slurm_call.R
@@ -90,7 +90,7 @@ slurm_call <- function(f, params, jobname = NA, add_objects = NULL,
     writeLines(script_r, file.path(tmpdir, "slurm_run.R"))
 
     # Create submission bash script
-    template_sh <- readLines(system.file("templates/submit_sh.txt", 
+    template_sh <- readLines(system.file("templates/submit_single_sh.txt", 
                                          package = "rslurm"))
     slurm_options <- format_option_list(slurm_options)
     script_sh <- whisker::whisker.render(template_sh, 

diff --git a/README.md b/README.md
@@ -1,59 +1,236 @@
 rslurm
 ======
 
-This R package simplifies the process of splitting a R calculation over a computing cluster that uses the [SLURM](http://slurm.schedmd.com/) workload manager.
+[![Travis-CI Build Status](https://travis-ci.org/SESYNC-ci/rslurm.svg?branch=master)](https://travis-ci.org/SESYNC-ci/rslurm)
 
-Currently, it is possible to use existing R packages like `parallel` to split a calculation over multiple CPUs in a single cluster node. The functions in this package automate the process of dividing the parameter sets over multiple cluster nodes (using a slurm array), applying the function in parallel in each node using `parallel`, and recombining the output.
+Many computing-intensive processes in R involve the repeated evaluation of 
+a function over many items or parameter sets. These so-called 
+[embarrassingly parallel](https://en.wikipedia.org/wiki/Embarrassingly_parallel)
+calculations can be run serially with the `lapply` or `Map` function, or in parallel
+on a single machine with `mclapply` or `mcMap` (from the **parallel** package).
 
+The rslurm package simplifies the process of distributing this type of calculation
+across a computing cluster that uses the [SLURM](http://slurm.schedmd.com/) 
+workload manager. Its main function, `slurm_apply`, automatically divides the
+computation over multiple nodes and writes the necessary submission scripts.
+It also includes functions to retrieve and combine the output from different nodes,
+as well as wrappers for common SLURM commands.
 
-How to install / use
---------------------
+*Development of this R package was supported by the National Socio-Environmental Synthesis Center (SESYNC) under funding received from the National Science Foundation DBI-1052875.*
 
-Install the package from GitHub using the following code:
-```R
-install.packages("devtools")
-devtools::install_github("SESYNC-ci/rslurm")
+
+### Table of contents
+
+- [Basic example](#basic-example)
+- [Single function evaluation](#single-function-evaluation)
+- [Adding auxiliary data and functions](#adding-auxiliary-data-and-functions)
+- [Configuring SLURM options](#configuring-slurm-options)
+- [Generating scripts for later submission](#generating-scripts-for-later-submission)
+- [How it works / advanced customization](#how-it-works-advanced-customization)
+
+
+## Basic example
+
+To illustrate a typical rslurm workflow, we use a simple function that takes
+a mean and standard deviation as parameters, generates a million normal deviates
+and returns the sample mean and standard deviation.
+
+```r
+test_func <- function(par_mu, par_sd) {
+    samp <- rnorm(10^6, par_mu, par_sd)
+    c(s_mu = mean(samp), s_sd = sd(samp))
+}
 ```
 
-Here's an overview of the workflow using this package:
+We then create a parameter data frame where each row is a parameter set and each
+column matches an argument of the function.
 
-- Create a function that you want to call with multiple parameter sets, and a data frame containing these parameter sets. 
-- Call `slurm_apply` with the function, the parameters data frame and (if applicable) the names of additional R objects needed as arguments. The function returns a `slurm_job` object.
-- The `slurm_job` object can be passed to other utility functions in the package to inquire about the SLURM job's status (`print_job_status`), cancel the job (`cancel_slurm`), collect the output in a single list or data frame (`get_slurm_out`), or delete the temporary files generated during the process (`cleanup_files`).
+```r
+pars <- data.frame(par_mu = 1:10,
+                   par_sd = seq(0.1, 1, length.out = 10))
+head(pars, 3)
+```
+
+```
+  par_mu par_sd
+1      1    0.1
+2      2    0.2
+3      3    0.3
+```
 
-Read the `rslurm-package` help file in R and each function's help file for more details.
+We can now pass that function and the parameters data frame to `slurm_apply`,
+specifiying the number of cluster nodes to use and the number of CPUs per node.
+The latter (`cpus_per_node`) determines how many processes will be forked on
+each node, as the `mc.cores` argument of `parallel::mcMap`. 
+```r
+library(rslurm)
+sjob <- slurm_apply(test_func, pars, jobname = "test_job", 
+                    nodes = 2, cpus_per_node = 2)
+```
+The output of `slurm_apply` is a *slurm_job* object that stores a few pieces of 
+information (job name and number of nodes) needed to retrieve the job's output. 
 
+Assuming the function is run on a machine with access to the cluster, it also 
+prints a message confirming the job has been submitted to SLURM.
+```
+Submitted batch job 352375
+```
 
-Instructions for SESYNC users
------------------------------
+Particular clusters may require the specification of additional SLURM options,
+such as time and memory limits for the job. Also, when running R on a local
+machine without direct cluster access, you may want to generate scripts to be
+copied to the cluster and run at a later time. These topics are covered in
+additional sections below this basic example.
+
+After the job has been submitted, you can call `print_job_status` to display its
+status (in queue, running or completed) or call `cancel_slurm` to cancel its
+execution. These functions are R wrappers for the SLURM command line functions
+`squeue` and `scancel`, respectively.
+
+Once the job completes, `get_slurm_out` reads and combines the output from all
+nodes.
+```r
+res <- get_slurm_out(sjob, outtype = "table")
+head(res, 3)
+```
 
-When using the SESYNC SLURM cluster, you should set the `nodes` argument of `slurm_apply` to a value less than the number of nodes available on the cluster (there are 20 nodes in total). You should set `cpus_per_node = 8` unless your job requires a large amount of memory (i.e. when running 8 copies would exceed the 60 Gb available by node).
+```
+      s_mu       s_sd
+1 1.000005 0.09987899
+2 2.000185 0.20001108
+3 3.000238 0.29988789
+```
 
-You must also specify the correct partition for jobs to be run in serial or parallel mode. This can be done in one of two ways:
+When `outtype = "table"`, the outputs from each function evaluation are 
+row-bound into a single data frame; this is an appropriate format when the 
+function returns a simple vector. The default `outtype = "raw"` combines the
+outputs into a list and can thus handle arbitrarily complex return objects.
 
-*As an option set in each call to the `rslurm` functions*
+```r
+res_raw <- get_slurm_out(sjob, outtype = "raw")
+res_raw[1:3]
+```
 
-* For `slurm_apply`, set `slurm_options = list(partition = "sesync")`.
-* For `slurm_call`, set `slurm_options = list(partition = "sesyncshared", share = TRUE)`.
+```
+[[1]]
+      s_mu       s_sd 
+1.00000506 0.09987899 
 
-*By editing the template scripts*
+[[2]]
+     s_mu      s_sd 
+2.0001852 0.2000111 
 
-Note: We recommend saving a backup copy of the original templates before editing them.
+[[3]]
+     s_mu      s_sd 
+3.0002377 0.2998879 
+```
 
-* Go to the `rslurm` folder in your R library (generally located at `~/R/x86_64-pc-linux-gnu-library/3.3/`, with "3.3" replaced with the latest version of R). Open the `templates` subfolder.
+The files generated by `slurm_apply` are saved in a folder named
+*\_rslurm_[jobname]* under the current working directory.
 
-* In `submit_sh.txt`, insert the line 
+```r
+dir("_rslurm_test_job")
 ```
-#SBATCH --partition=sesync
-``` 
-before the first `#SBATCH` line.
 
-* In `submit_single_sh.txt`, insert the lines
 ```
-#SBATCH --partition=sesyncshared
-#SBATCH --share
+[1] "params.RData"    "results_0.RData" "results_1.RData" "slurm_0.out"    
+[5] "slurm_1.out"     "slurm_run.R"     "submit.sh" 
 ```
-before the first `#SBATCH` line.
+
+The utility function `cleanup_files` deletes the temporary folder for the
+specified *slurm_job*.
+
+
+## Single function evaluation
+
+In addition to `slurm_apply`, rslurm also defines a `slurm_call` function, which
+sends a single function call to the cluster. It is analogous in syntax to the 
+base R function `do.call`, accepting a function and a named list of parameters
+as arguments.
+
+```r
+sjob <- slurm_call(test_func, list(par_mu = 5, par_sd = 1))
+```
+
+Because `slurm_call` involves a single process on a single node, it does not
+recognize the `nodes` and `cpus_per_node` arguments; otherwise, it accepts the
+same additional arguments (detailed in the sections below) as `slurm_apply`.
+
+
+## Adding auxiliary data and functions
+
+The function passed to `slurm_apply` can only receive atomic parameters stored 
+within a data frame. Suppose we want instead to apply a function `func` to a list
+of complex R objects, `obj_list`. To use `slurm_apply` in this case, we can wrap 
+`func` in an inline function that takes an integer parameter.
+
+```r
+sjob <- slurm_apply(function(i) func(obj_list[[i]]), 
+                    data.frame(i = seq_along(obj_list)),
+                    add_objects = c("func", "obj_list"),
+                    nodes = 2, cpus_per_node = 2)
+```
+
+The `add_objects` argument specifies the names of any R objects (besides the 
+parameters data frame) that must be accessed by the function passed to 
+`slurm_apply`. These objects are saved to a `.RData` file that is loaded
+on each cluster node prior to evaluating the function in parallel.
+
+By default, all R packages attached to the current R session will also be 
+attached (with `library`) on each cluster node, though this can be modified with
+the optional `pkgs` argument.
+
+
+## Configuring SLURM options
+
+The `slurm_options` argument allows you to set any of the command line 
+options ([view list](http://slurm.schedmd.com/sbatch.html)) recognized by the 
+SLURM `sbatch` command. It should be formatted as a named list, using the long
+names of each option (e.g. "time" rather than "t"). Flags, i.e. command line 
+options that are toggled rather than set to a particular value, should be set to
+`TRUE` in `slurm_options`. For example, the following code:
+```r
+sjob <- slurm_apply(test_func, pars, 
+                    slurm_options = list(time = "1:00:00", share = TRUE))
+```
+sets the command line options `--time=1:00:00 --share`.
+
+
+## Generating scripts for later submission
+
+When working from a R session without direct access to the cluster, you can set
+`submit = FALSE` within `slurm_apply`. The function will create the
+*\_rslurm\_[jobname]* folder and generate the scripts and .RData files, without
+submitting the job. You may then copy those files to the cluster and submit the
+job manually by calling `sbatch submit.sh` from the command line.
+
+
+## How it works / advanced customization
+
+As mentioned above, the `slurm_apply` function creates a job-specific folder. 
+This folder contains the parameters data frame and (if applicable) the objects
+specified as `add_objects`, both saved in *.RData* files. The function also
+generates a R script (`slurm_run.R`) to be run on each cluster node, as well
+as a Bash script (`submit.sh`) to submit the job to SLURM.
+
+More specifically, the Bash script creates a SLURM job array, with each cluster
+node receiving a different value of the *SLURM\_ARRAY\_TASK\_ID* environment
+variable. This variable is read by `slurm_run.R`, which allows each instance of
+the script to operate on a different parameter subset and write its output to
+a different results file. The R script calls `parallel::mcMap` to parallelize
+calculations on each node.
+
+Both `slurm_run.R` and `submit.sh` are generated from templates, using the
+**whisker** package; these templates can be found in the `rslurm/templates`
+subfolder in your R package library. There are two templates for each script,
+one for `slurm_apply` and the other (with the word *single* in its title) for
+`slurm_call`. 
+
+While you should avoid changing any existing lines in the template scripts, you
+may want to add `#SBATCH` lines to the `submit.sh` templates in order to
+permanently set certain SLURM command line options and thus customize the package
+to your particular cluster setup.
+
 
 
 
diff --git a/cran-comments.md b/cran-comments.md
@@ -0,0 +1,9 @@
+## Tested on
+
+win-builder (devel and release)
+Ubuntu 12.04 with R 3.3 (on travis-ci)
+OS X with R 3.3 (local machine)
+
+## R CMD check results
+
+Status: OK (no errors, warnings or notes)