-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clean environments for modules #84
Comments
If we could spawn modules in their own, clean environments, load the things they need (and only them), then take their products and destroy them, that would be a lovely clean solution. One approach would be to use existing parallel computing libraries to spawn child processes. E.g. using sfInit(cpus = 1, parallel = FALSE)
sfLibrary(list_of_module_dependencies)
sfExport(input_data_and_module_arguments)
ans <- sfLapply(NULL, some_model_module_in_a_wrapper)[[1]]
sfStop() We could also act on lists to do parellelism easily, though we'd want to load different packages on different cluster nodes. A downside of child processes is that it makes debugging much harder. |
Tim's suggestion would be worth trying first, although it doesn't protect the modules from packages the user has loaded previously. |
any thoughts on this appreciated - maybe @richfitz has ideas? |
I don't know much about S4 classes or how often they are going to be a problem. Was the leaflet example an issue with S4? I get the feeling the It isn't totally obviously how to make sure modules even remotely clear up after themselves as they are user contributed. I don't know if you could add a line in Something like package <- as.character(substitute(package)) str <- paste0('package:', package) eval.parent(on.exit(detach(str, unload=TRUE)) , n= 1) The idea being, that this will be evaluated in the module function environment so that once the module has finished, any packages used will be detached. However, I can't work out how to get Anyway, then we only have to push that people use Child process might still be a better idea at least in the long term as (as you said) we want to make stuff parallel anyway. Just need to be careful that we don't duplicate big memory objects. Model objects (e.g. randomforests) can get pretty big, let alone data. |
It's not really an S4 thing - that was part of that SO thread, but the detaching/unloading thing is more generic. It's down to the dependencies and loaded in by each packages, and not being able to work out reverse dependencies without trying to remove them from packages you want to keep. I tried all the combinations of |
I'd love a way of getting isolated environments within an R session, but hands down the safest way would be a subprocess (along those lines I'm stewing on the idea of trying to replicate Python's subprocess module in R which would allow for possibly nicer management of subprocesses). Previously, I wondered if this could be done by manipulating the search path (this might not work well for S4 classes, but then they're generally horrible things anyway). Basically, I'm wondering if you can either:
For both approaches you'd want to prohibit user use of The trick with most of these is not to unload the packages but to use things like |
Cool, we should definitely try In the meantime, here's a basic version of child processes using # load the parallel package to spawn child processes
library(parallel)
# load some other arbitrary package as a user might
library(plyr)
# set up a cluster on three cores
cl <- makeCluster(3)
# function to laod multiple libraries
libraries <- function (list) {
lapply(as.list(list),
library,
character.only = TRUE)
}
# a list for our three modules, giving the required packages
dependencies <- list(one = c('dismo', 'spocc'),
two = c('leaflet'),
three = 'base') # load nothing
# (need to get a list of dependencies for each module
# - should enforce that these are listed in the docs)
# send packages out
clusterMap(cl, libraries, dependencies) -> .
# packages loaded for the user
names(sessionInfo()$otherPkgs)
# [1] "plyr"
# packages laoded on each child
clusterEvalQ(cl, names(sessionInfo()$otherPkgs))
# [[1]]
# [1] "spocc" "dismo" "raster" "sp"
# [[2]]
# [1] "leaflet"
# [[3]]
# NULL
# ~~~~~~~~~~~
# export modules and arguments in the zoon style
# define some silly modules (all have the x argument)
printer <- function(x) {
paste('result:', x)
}
printer2 <- function(x, ...) {
x <- as.vector(c(x, list(...)))
paste('result:', x)
}
emphasiser <- function(x) {
x <- toupper(x)
paste('result:', x)
}
# define the list of modules and their arguments
# (workflow has something like this)
modules <- list(one = list(module = printer,
paras = list(x = 'test')),
two = list(module = printer2,
paras = list(x = 'test',
a = 1)),
three = list(module = emphasiser,
paras = list(x = 'test')))
# note that we pass the modules, not their names, so that the children
# have them
# a function to execute the modules from their lists
doer <- function(module_list) {
do.call(module_list$module, module_list$paras)
}
# show that it works at the top level
(res1 <- doer(modules$one))
# [1] "result: test"
(res2 <- doer(modules$two))
# [1] "result: test" "result: 1"
(res3 <- doer(modules$three))
# [1] "result: TEST"
# execute on the child processes
(res <- clusterMap(cl, doer, modules))
# $one
# [1] "result: test"
# $two
# [1] "result: test" "result: 1"
# $three
# [1] "result: TEST" This is all in parallel already, which is nice. A downside is that we can't plot directly to the open graphics device with the output modules. We can always require the user to save plots instead, and that may solve a lot of other problems (e.g. compiling results into a report, other unforeseen pain) further down the line. We could easily write a ZoonPlot function to handle that nicely with a temp directory. |
Here's a more concrete definition of approach 2: in_environment <- function(expr, packages, envir=parent.frame()) {
nms <- sprintf("fakepackage:%s", packages)
cleanup <- function() {
loaded <- intersect(rev(nms), search())
for (p in loaded) {
try(detach(p, character.only=TRUE))
}
}
on.exit(cleanup())
for (i in seq_along(packages)) {
requireNamespace(packages[[i]], quietly=TRUE)
attach(asNamespace(packages[[i]]), name=nms[[i]])
}
eval(expr, envir)
} The function
|
sweet. thanks! |
Hmmm... Running the following on the definitions in my example above, in_environment <- function(expr, packages, envir=parent.frame()) {
nms <- sprintf("fakepackage:%s", packages)
cleanup <- function() {
loaded <- intersect(rev(nms), search())
for (p in loaded) {
try(detach(p, character.only=TRUE))
}
}
on.exit(cleanup())
for (i in seq_along(packages)) {
requireNamespace(packages[[i]], quietly=TRUE)
attach(asNamespace(packages[[i]]), name=nms[[i]])
}
eval(expr, envir)
}
attached <- function() names(sessionInfo()$loadedOnly)
attached()
# "tools" "Rcpp"
in_environment(attached(),
packages = dependencies$one)
# The following object is masked from fakepackage:dismo:
#
# .onLoad
#
# The following objects are masked from package:plyr:
#
# mapvalues, quickdf, rename, revalue
#
# The following object is masked from package:base:
#
# strtrim
#
# [1] "Rcpp" "tools" "digest" "lubridate" "jsonlite"
# [6] "memoise" "ecoengine" "gtable" "lattice" "rgbif"
# [11] "DBI" "rstudioapi" "mapproj" "curl" "rbison"
# [16] "proto" "dismo" "dplyr" "httr" "stringr"
# [21] "leafletR" "raster" "maps" "rvertnet" "grid"
# [26] "rebird" "data.table" "R6" "XML" "sp"
# [31] "ggplot2" "reshape2" "magrittr" "whisker" "AntWeb"
# [36] "scales" "spocc" "MASS" "assertthat" "colorspace"
# [41] "brew" "V8" "stringi" "munsell" "chron"
# [46] "rjson"
attached()
# [1] "Rcpp" "tools" "digest" "lubridate" "jsonlite"
# [6] "memoise" "ecoengine" "gtable" "lattice" "rgbif"
# [11] "DBI" "rstudioapi" "mapproj" "curl" "rbison"
# [16] "proto" "dismo" "dplyr" "httr" "stringr"
# [21] "leafletR" "raster" "maps" "rvertnet" "grid"
# [26] "rebird" "data.table" "R6" "XML" "sp"
# [31] "ggplot2" "reshape2" "magrittr" "whisker" "AntWeb"
# [36] "scales" "spocc" "MASS" "assertthat" "colorspace"
# [41] "brew" "V8" "stringi" "munsell" "chron"
# [46] "rjson" |
Yeah, it doesn't unload them (which is the bit that might fail) but it will remove them from the search path which removes conflicts (check out As a potential warning about unloading packages: if you use a package that uses Rcpp modules and unload or reload the package before all garbage collection of R objects is complete, R will crash. |
Ah OK, thanks. I'm not sure whether they need to be out of the search path or unloaded in the |
unfortunately On a clean R session, this works just fine: # install.packages('raster')
# devtools::install_github('environmentalinformatics-marburg/Rsenal')
library(Rsenal); library(raster)
mapView(raster(system.file("external/test.grd", package="raster"))) but this does not: # install.packages(c('spocc', 'raster'))
# devtools::install_github('environmentalinformatics-marburg/Rsenal')
in_environment <- function(expr, packages, envir=parent.frame()) {
nms <- sprintf("fakepackage:%s", packages)
cleanup <- function() {
loaded <- intersect(rev(nms), search())
for (p in loaded) {
try(detach(p, character.only=TRUE))
}
}
on.exit(cleanup())
for (i in seq_along(packages)) {
requireNamespace(packages[[i]], quietly=TRUE)
attach(asNamespace(packages[[i]]), name=nms[[i]])
}
eval(expr, envir)
}
in_environment(occ('Anopheles plumbeus'), 'spocc')
in_environment(mapView(raster(system.file("external/test.grd", package="raster"))),
c('raster', 'Rsenal')) |
for completeness, the parallel version works fine on the leaflet example: libraries <- function (list)
lapply(list, library, character.only = TRUE)
dependencies <- list(one = 'spocc',
two = c('raster', 'Rsenal'))
expr_list <- list(one = expression(occ('Anopheles plumbeus')),
two = expression(mapView(raster(system.file("external/test.grd",
package="raster")))))
library(parallel)
cl <- makeCluster(2)
clusterMap(cl, libraries, dependencies) -> .
res <- clusterMap(cl, eval, expr_list)
str(res$one, 1)
library('leaflet') # needed for plotting too
m |
Is there a copy of external/test.grd somewhere? I could have a look? Alternatively, what is the error message? The parallel versions will be running in separate clean environments. However, presumably if a future call required the alternate loading order things could fail (non deterministically as jobs get allocated to nodes). |
That's in the The error message is: |
what do you mean by the alternate loading order in that example? |
OK, so I managed to debug more successfully now. The error arises because the result has the classes: library(spocc); library(raster); library(Rsenal)
m <- mapView(raster(system.file("external/test.grd", package="raster")))
m
# Error in browseURL(x, ...) : 'url' must be a non-empty character string
sessionInfo()
# R version 3.2.1 (2015-06-18)
# Platform: x86_64-apple-darwin14.4.0 (64-bit)
# Running under: OS X 10.10.3 (Yosemite)
#
# locale:
# [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
#
# attached base packages:
# [1] stats graphics grDevices utils datasets methods base
#
# other attached packages:
# [1] Rsenal_0.1.85 raster_2.4-18 sp_1.1-1 spocc_0.3.0
#
# loaded via a namespace (and not attached):
# [1] Rcpp_0.12.0 plyr_1.8.3 base64enc_0.1-3 tools_3.2.1
# [5] digest_0.6.8 lubridate_1.3.3 jsonlite_0.9.16 memoise_0.2.1
# [9] ecoengine_1.9.1 gtable_0.1.2 lattice_0.20-31 rgbif_0.8.8
# [13] png_0.1-7 DBI_0.3.1 rstudioapi_0.3.1 mapproj_1.2-3
# [17] rgdal_1.0-4 curl_0.9.2 parallel_3.2.1 rbison_0.4.8
# [21] proto_0.3-10 dplyr_0.4.2 httr_1.0.0 stringr_1.0.0
# [25] leafletR_0.3-3 htmlwidgets_0.5 maps_2.3-11 leaflet_1.0.0
# [29] rvertnet_0.3.0 grid_3.2.1 rebird_0.2 data.table_1.9.4
# [33] R6_2.1.0 XML_3.98-1.3 ggplot2_1.0.1 reshape2_1.4.1
# [37] magrittr_1.5 whisker_0.3-2 AntWeb_0.7 htmltools_0.2.6
# [41] scales_0.2.5 MASS_7.3-40 assertthat_0.1 colorspace_1.2-6
# [45] brew_1.0-6 V8_0.6 stringi_0.5-5 munsell_0.4.2
# [49] chron_2.3-47 rjson_0.2.15 |
I'm now explicitly calling the correct print method in th emodule InteractiveMap, which fixes the problem. Would be good to know if there's an elegant and generalisable way around this though! |
I still like the idea of using child processes to execute each module, but realised that would mean predict methods aren't visible in the output modules. We'd need to find a way of identifying the package hosting the relevant predict method to load when executing output modules. Or, preferably, pass the predict method with the model so that modules writers can define their own predict method for weird models. |
As above I think that model modules returning objects that contain a bespoke predict method with the correct behaviour would solve a lot of problems in output modules. It would also mean we can execute those in clean processes without the package containing the predict method having to be in the global namespace. Here's a prototype which which makes model module developers define prediction code to conform to set (yet to be determined) IO standards via the # these functions defined in zoon:
# generic predict function for predict objects in zoon output modules
# usage along the lines of:
# prediction <- ZoonPredict(.model$model, SomeNewData)
ZoonPredict <- function(predict_object, newdata) {
# get required packages
require (predict_object$packages,
character.only = TRUE)
# define prediction function using module code
fun_text <- sprintf('fun <- function (model, newdata) {%s}',
predict_object$code)
fun <- eval(parse(text = fun_text))
# run the predictor and return result
ans <- fun(predict_object$model, newdata = newdata)
return (ans)
}
# utility function to help module developers make prediction objects
# packages is the packages required for prediction
# (i.e. wherever the predict method is defined)
# model is the fitted model object
# code is a code snippet used to make predictions that conform to clear
# IO rules (tbd). e.g:
# must use the object 'model' and a dataframe 'newdata' (which may have some constraints)
# and nothing else;
# must return a numeric vector with length matching the number of rows
# of newdata and on the response scale.
MakePredictor <- function(packages,
model,
code) {
# catch the code as text
code <- deparse(substitute(code))
code <- paste(code, collapse = '\n')
# return the list
list(packages = packages,
model = model,
code = code)
} # ~~~~~~~~~
# example usage
library(zoon)
# an using this approach for the logistic regression module:
LogisticRegression <- function (.df) {
# fit the model, as normal
covs <- as.data.frame(.df[, 6:ncol(.df)])
names(covs) <- names(.df)[6:ncol(.df)]
m <- glm(.df$value ~ .,
data = covs,
family = 'binomial')
# make the predictor object
MakePredictor(packages = 'stats',
model = m,
code = stats::predict.glm(object = model,
newdata = newdata,
type = 'response'))
}
# fake data to test
.df <- data.frame(value = sample(0:1, 10, replace = TRUE),
matrix(rnorm(100), nrow = 10))
# run the module to get the predict object
out <- LogisticRegression(.df)
# make predictions (here, against the training data)
ZoonPredict(predict_object = out,
newdata = .df[, 6:NCOL(.df)])
A similar one for QuickGRaF would therefore account for the fact that predictions are a matrix QuickGRaF <- function (.df, l = NULL) {
zoon:::GetPackage('GRaF')
if (!all(.df$type %in% c('presence', 'absence', 'background'))) {
stop ('only for presence/absence or presence/background data')
}
# get the covariates
covs <- as.data.frame(.df[, 6:ncol(.df)])
names(covs) <- names(.df)[6:ncol(.df)]
# set up l
if (!is.null(l)) {
# duplicate if necessary
if (length(l) == 1) {
l <- rep(l, ncol(covs))
}
# check l has the correct length
if (length(l) != ncol(covs)) {
stop (sprintf('l has %i elements, but there are %i covariates', length(l), ncol(covs)))
}
# check l is of the correct value
if (any(l <= 0)) {
stop(sprintf('l must be positive, but the values provided were: %s',
paste(format(l, digits = 3), collapse = ', ')))
}
}
# fit the model
m <- graf(.df$value,
covs,
l = l)
# make the predictor object
MakePredictor(packages = 'GRaF',
model = m,
code = {
p <- GRaF::predict.graf(object = model,
newdata = newdata,
type = 'response')
p[, 1]
})
}
# run the module to get the predict object
out <- QuickGRaF(.df)
# make predictions (here, against the training data)
ZoonPredict(predict_object = out,
newdata = .df[, 6:NCOL(.df)])
This would mean we could get of this snippet which appears in lots of output modules: # if pred is a matrix/dataframe, take only the first column
if(!is.null(dim(pred))) {
pred <- pred[, 1]
} And because the model object, package name and code are all passed in a single object, the object can easily be passed to a child process and executed cleanly. What are people's thoughts on this as an approach to the problem? When I get some time, I'll start a new branch to implement both this and execution of modules in child processes. |
I think it makes sense as long as we do good documentation for module builders (which is important anyway). Part of that will be simply rewritting the model modules we already have. I think people will often use current modules as examples. I'm just wondering whether the default value for the That said, we would probably want to write modules for most of those basic algorithms anyway. So maybe pointless. Finally I'm not sure I like the name |
Yup, sounds good.
That would help developers a bit, but is not explicit about the predict function. If a model has two classes, it would help to be explicit about the required function. I think that being explicit won't be to arduous, and as you say they can copy from existing functions.
Yup, I agree - worth thinking about a better name... |
I've been having a back at child processes again and thought I'd record my notes here. TL;DR: nope. It's definitely possible to execute modules in child processes, but every time a new clean child process is started there's a significant overhead (~2s on my 2015 mbp) to loading zoon and raster. This would apply to every module in a workflow so even the smallest would have 10 seconds tacked on, which I think is unacceptable. Instead of using brand new processes, we can instead fork from existing processes with next to no computational cost (if the parent process has zoon & raster, the fork is born with them loaded too). If we span up one clean process when zoon is loaded in an R session, then we can fork from that, and only pay the 2s cost once per session. Unfortunately, #?*!ing windows doesn't do forking so that's a non-starter for the majority of our user base. We could think about a two-tier option, where things just work more cleanly on Unix, but that doesn't appeal to me - workflows developed on Unix could error for a Windows user. I think it would be better to invest some time trying to implement @richfitz' suggestion of using environments and attachment. AFAIK that wouldn't protect us from a user |
Packages required for one module are loaded into the global namespace and are visible to subsequent modules and to users. As is happening with #77, already loaded packages can conflict with subsequent modules. As the number of modules increases, the number of such conflicts will rise and it will become impossible to test for them.
It would be great if we could just clean up after each module by unloading all the packages it's loaded. I've searched for a while on this to no avail, and this comment seems pretty definitive:
The text was updated successfully, but these errors were encountered: