Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel implementation of ffs() #31

Open
r2j2ritson opened this issue Mar 25, 2022 · 3 comments
Open

Parallel implementation of ffs() #31

r2j2ritson opened this issue Mar 25, 2022 · 3 comments

Comments

@r2j2ritson
Copy link

Hi there,

More of an enhancement suggestion but also a question. Any advice on parallel-izing ffs()? I'm using ranger to create species distribution models for many plant species and have ~70 covariates, resulting in ffs reporting over 4000 individual models being trained. I have ~20 cores at my disposal, I think I could see major speed improvements with a multicore implementation similar to aoa.

Thanks,

Rob

@joshualerickson
Copy link

Hey developers, I've been using this package for a while now and love it! I'm reaching out because I took a look into the ffs() function and noticed that it could be parallized (chunk-based) as described above by @r2j2ritson. I was able to convert to chunk-based parallism within this fork; however, there are some big changes.

  1. Two new imports furrr and purrr.
  2. Model results are returned in a data.frame instead of the actual model (this could be changed/adapted though for backward compatibility).
  3. I tried to keep as much of the original code as I could but ended up refactoring a decent amount.
  4. It requires a new argument ffsParalllel (logical). If TRUE, then chuck-based parallism is used. If FALSE, purrr::map() is used instead. On the outside everything looks the same.

caveat

  1. It's dependent on the user to plan() their process via {future}. I really like this approach but may be new to others.
  2. Haven't done any testing on other datasets and methods!

If this is something you would like to add as a new function (e.g. ffs_parallel()) and keep the original ffs() for backward comp then fine by me. Also, totally fine if this is outside the scope of your package and goals and just refer to the fork for comments like above! However, if this is something that you would want to add I could eventually do a PR knowing that it's a big change. I'd be open to discussing in more detail if so inclined.

Thanks again for the papers and package!

Below are some benchmarks using the fork and new function;

library(microbenchmark)
library(lubridate)
library(CAST)
library(caret)

#load and prepare dataset:
dat <- get(load(system.file("extdata","Cookfarm.RData",package="CAST")))
trainDat <- dat[dat$altitude==-0.3&year(dat$Date)==2012&week(dat$Date)%in%c(13:14),]

#create folds for Leave Location Out Cross Validation:
set.seed(10)
indices <- CreateSpacetimeFolds(trainDat,spacevar = "SOURCEID",k=3)
ctrl <- trainControl(method="cv",
                     index = indices$index)

#define potential predictors:
predictors <- c("DEM","TWI","BLD","Precip_cum","cday","MaxT_wrcc",
                "Precip_wrcc","NDRE.M","Bt","MinT_wrcc","Northing","Easting")


bm <- microbenchmark(
                      para = {
                        set.seed(10)
                        library(future)
                        plan(multisession(workers = availableCores()-1))
                        
                        ffsmodel <- ffs(trainDat[,predictors],trainDat$VW,method="rf",
                                        tuneLength=2,trControl=ctrl, ffsParallel = T, minVar = 2)
                      },
                      
                      normal = {
                        set.seed(10)
                        ffsmodel <- ffs(trainDat[,predictors],trainDat$VW,method="rf",
                                        tuneLength=2,trControl=ctrl, minVar = 2)
                      },
                      times = 10
                    )

autoplot(bm)

cast_bm

bm
Unit: seconds
   expr      min      lq     mean   median       uq      max neval cld
   para 25.00555 25.0667 26.18992 25.17491 26.90551 30.57737    10  a 
 normal 63.08693 63.2396 65.07448 63.41705 66.17914 72.53602    10   b

Model data.frame (what you get back)

head(ffsmodel)

  actmodelperf actmodelperfSE             vars var_number
1   0.09254037     0.01761334         DEM, TWI          2
2   0.06752643     0.01192134         DEM, BLD          2
3   0.08953551     0.01323702  DEM, Precip_cum          2
4   0.08949154     0.01326544        DEM, cday          2
5   0.08906957     0.01281766   DEM, MaxT_wrcc          2
6   0.08933983     0.01296838 DEM, Precip_wrcc          2


library(dplyr)
ffsmodel %>% 
  mutate(iteration = row_number()) %>% 
  ggplot(aes(iteration, actmodelperf, color = factor(var_number))) +
  geom_point() +
  geom_errorbar(aes(ymin = actmodelperf-actmodelperfSE, ymax = actmodelperf+actmodelperfSE))

cast_ggplot

@pecto2020
Copy link

Hei Josh,
Thank you very much for your fork which finally allowed me to use CAST ffs on a moderately large database!
Running ffs I get a warning which I cannot figure out how to solve. Any suggestion about it?

Warning: UNRELIABLE VALUE: Future (‘’) unexpectedly generated random numbers without specifying argument 'seed'. There is a risk that those random numbers are not statistically sound and the overall results might be invalid. To fix this, specify 'seed=TRUE'. This ensures that proper, parallel-safe random numbers are produced via the L'Ecuyer-CMRG method. To disable this check, use 'seed=NULL', or set option 'future.rng.onMisuse' to "ignore".

Thank you

@joshualerickson
Copy link

Hey @pecto2020, I'm glad you found this useful! As for the warning, this is a standard warning when using future and shouldn't be affecting your seed set in the function arguments. I went ahead and put future.seed=TRUE in the fork which will not display that message any further. Hope that helps, thanks.

If you have any more issues regarding the fork feel free to leave an issue at that repository, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants