Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

performance bug with engine = "lightgbm" #94

Open
simonpcouch opened this issue Nov 12, 2024 · 2 comments
Open

performance bug with engine = "lightgbm" #94

simonpcouch opened this issue Nov 12, 2024 · 2 comments

Comments

@simonpcouch
Copy link
Contributor

Noticed while working on emlwr the other day that bonsai::train_lightgbm() is quite a bit slower than lightgbm::lgb.train(), probably due to the handling of categorical variables / conversion to lgb.Dataset. Observed with emlwr:::simulate_classification()@EmilHvitfeldt also noted a slowdown from a user with a similar-appearing dataset last week.

@estroger34
Copy link

Not sure if related, but I've noticed unexpectedly long tuning times with lightgbm as well, even with numeric variables only. About 10x longer than xgboost for the reprex below.

library(tidyverse)
library(tidymodels)
library(bonsai)
library(future)

options(tidymodels.dark = TRUE) #lighter colors.

moddata <-
  concrete |> 
  summarise(compressive_strength = mean(compressive_strength), .by = c(cement:age)) 
  
#initial split
set.seed(432)
split <- initial_split(moddata)

#recipe
rec <- 
  recipe(compressive_strength ~ ., data = training(split)) |>  
  step_dummy(all_nominal_predictors(), one_hot = TRUE)

#model specs
lgb_spec <- 
  boost_tree(tree_depth = tune(), learn_rate = tune(), loss_reduction = tune(), 
             mtry = tune(), min_n = tune(), sample_size = tune(), trees = tune(), stop_iter = tune()) %>% 
  set_engine("lightgbm") %>% 
  set_mode("regression")

xgb_spec <- 
  boost_tree(tree_depth = tune(), learn_rate = tune(), loss_reduction = tune(), 
             mtry = tune(), min_n = tune(), sample_size = tune(), trees = tune(), stop_iter = tune()) %>% 
  set_engine("xgboost") %>% 
  set_mode("regression")

#worflowset
wfset <- 
  workflow_set(
    preproc = list(rec = rec), 
    models = list(
      lGBM = lgb_spec,
      xgb = xgb_spec
      )
  )

#resamples
set.seed(1265)
folds <- vfold_cv(training(split), v = 5, repeats = 1)

#set parallel
plan(multisession, workers = 5)

# #ctrl
grid_ctrl <-
  control_grid(
  save_pred = FALSE,
  parallel_over = "resamples",
  pkgs = NULL,
  save_workflow = FALSE
)

#fit
fit_results <-
  wfset |> 
  workflow_map("tune_grid",
               seed = 1563,
               resamples = folds,
               grid = 2,
               control = grid_ctrl,
               verbose = TRUE)

#> i 1 of 2 tuning:     rec_lGBM
#> i Creating pre-processing data to finalize unknown parameter: mtry
#> v 1 of 2 tuning:     rec_lGBM (32.6s)
#> i 2 of 2 tuning:     rec_xgb
#> i Creating pre-processing data to finalize unknown parameter: mtry
#> v 2 of 2 tuning:     rec_xgb (3.4s)

Created on 2024-12-07 with reprex v2.1.1

@estroger34
Copy link

FWIW, I've notice that during parallel tuning lightGBM, each of the 5 R session processes uses ~ 7-9% of total CPU capacity, as indicated in the task manager of the 20 core/ 40 threads Windows workstation. This is unusual, as they otherwise max at 3% for other heavy computation stuff. Including tuning the xgb and other models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants