Parallel implementation of `ffs()` #31

r2j2ritson · 2022-03-25T16:43:59Z

Hi there,

More of an enhancement suggestion but also a question. Any advice on parallel-izing ffs()? I'm using ranger to create species distribution models for many plant species and have ~70 covariates, resulting in ffs reporting over 4000 individual models being trained. I have ~20 cores at my disposal, I think I could see major speed improvements with a multicore implementation similar to aoa.

Thanks,

Rob

The text was updated successfully, but these errors were encountered:

joshualerickson · 2022-11-11T19:12:38Z

Hey developers, I've been using this package for a while now and love it! I'm reaching out because I took a look into the ffs() function and noticed that it could be parallized (chunk-based) as described above by @r2j2ritson. I was able to convert to chunk-based parallism within this fork; however, there are some big changes.

Two new imports furrr and purrr.
Model results are returned in a data.frame instead of the actual model (this could be changed/adapted though for backward compatibility).
I tried to keep as much of the original code as I could but ended up refactoring a decent amount.
It requires a new argument ffsParalllel (logical). If TRUE, then chuck-based parallism is used. If FALSE, purrr::map() is used instead. On the outside everything looks the same.

caveat

It's dependent on the user to plan() their process via {future}. I really like this approach but may be new to others.
Haven't done any testing on other datasets and methods!

If this is something you would like to add as a new function (e.g. ffs_parallel()) and keep the original ffs() for backward comp then fine by me. Also, totally fine if this is outside the scope of your package and goals and just refer to the fork for comments like above! However, if this is something that you would want to add I could eventually do a PR knowing that it's a big change. I'd be open to discussing in more detail if so inclined.

Thanks again for the papers and package!

Below are some benchmarks using the fork and new function;

library(microbenchmark)
library(lubridate)
library(CAST)
library(caret)

#load and prepare dataset:
dat <- get(load(system.file("extdata","Cookfarm.RData",package="CAST")))
trainDat <- dat[dat$altitude==-0.3&year(dat$Date)==2012&week(dat$Date)%in%c(13:14),]

#create folds for Leave Location Out Cross Validation:
set.seed(10)
indices <- CreateSpacetimeFolds(trainDat,spacevar = "SOURCEID",k=3)
ctrl <- trainControl(method="cv",
                     index = indices$index)

#define potential predictors:
predictors <- c("DEM","TWI","BLD","Precip_cum","cday","MaxT_wrcc",
                "Precip_wrcc","NDRE.M","Bt","MinT_wrcc","Northing","Easting")


bm <- microbenchmark(
                      para = {
                        set.seed(10)
                        library(future)
                        plan(multisession(workers = availableCores()-1))
                        
                        ffsmodel <- ffs(trainDat[,predictors],trainDat$VW,method="rf",
                                        tuneLength=2,trControl=ctrl, ffsParallel = T, minVar = 2)
                      },
                      
                      normal = {
                        set.seed(10)
                        ffsmodel <- ffs(trainDat[,predictors],trainDat$VW,method="rf",
                                        tuneLength=2,trControl=ctrl, minVar = 2)
                      },
                      times = 10
                    )

autoplot(bm)

bm
Unit: seconds
   expr      min      lq     mean   median       uq      max neval cld
   para 25.00555 25.0667 26.18992 25.17491 26.90551 30.57737    10  a 
 normal 63.08693 63.2396 65.07448 63.41705 66.17914 72.53602    10   b

Model data.frame (what you get back)

head(ffsmodel)

  actmodelperf actmodelperfSE             vars var_number
1   0.09254037     0.01761334         DEM, TWI          2
2   0.06752643     0.01192134         DEM, BLD          2
3   0.08953551     0.01323702  DEM, Precip_cum          2
4   0.08949154     0.01326544        DEM, cday          2
5   0.08906957     0.01281766   DEM, MaxT_wrcc          2
6   0.08933983     0.01296838 DEM, Precip_wrcc          2


library(dplyr)
ffsmodel %>% 
  mutate(iteration = row_number()) %>% 
  ggplot(aes(iteration, actmodelperf, color = factor(var_number))) +
  geom_point() +
  geom_errorbar(aes(ymin = actmodelperf-actmodelperfSE, ymax = actmodelperf+actmodelperfSE))

pecto2020 · 2022-12-13T12:36:39Z

Hei Josh,
Thank you very much for your fork which finally allowed me to use CAST ffs on a moderately large database!
Running ffs I get a warning which I cannot figure out how to solve. Any suggestion about it?

Warning: UNRELIABLE VALUE: Future (‘’) unexpectedly generated random numbers without specifying argument 'seed'. There is a risk that those random numbers are not statistically sound and the overall results might be invalid. To fix this, specify 'seed=TRUE'. This ensures that proper, parallel-safe random numbers are produced via the L'Ecuyer-CMRG method. To disable this check, use 'seed=NULL', or set option 'future.rng.onMisuse' to "ignore".

Thank you

joshualerickson · 2022-12-13T17:13:59Z

Hey @pecto2020, I'm glad you found this useful! As for the warning, this is a standard warning when using future and shouldn't be affecting your seed set in the function arguments. I went ahead and put future.seed=TRUE in the fork which will not display that message any further. Hope that helps, thanks.

If you have any more issues regarding the fork feel free to leave an issue at that repository, thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel implementation of `ffs()` #31

Parallel implementation of `ffs()` #31

r2j2ritson commented Mar 25, 2022

joshualerickson commented Nov 11, 2022

pecto2020 commented Dec 13, 2022

joshualerickson commented Dec 13, 2022

Parallel implementation of ffs() #31

Parallel implementation of ffs() #31

Comments

r2j2ritson commented Mar 25, 2022

joshualerickson commented Nov 11, 2022

pecto2020 commented Dec 13, 2022

joshualerickson commented Dec 13, 2022

Parallel implementation of `ffs()` #31

Parallel implementation of `ffs()` #31