performance bug with `engine = "lightgbm"` #94

simonpcouch · 2024-11-12T15:39:46Z

Noticed while working on emlwr the other day that bonsai::train_lightgbm() is quite a bit slower than lightgbm::lgb.train(), probably due to the handling of categorical variables / conversion to lgb.Dataset. Observed with emlwr:::simulate_classification()—@EmilHvitfeldt also noted a slowdown from a user with a similar-appearing dataset last week.

The text was updated successfully, but these errors were encountered:

estroger34 · 2024-12-07T17:28:09Z

Not sure if related, but I've noticed unexpectedly long tuning times with lightgbm as well, even with numeric variables only. About 10x longer than xgboost for the reprex below.

library(tidyverse)
library(tidymodels)
library(bonsai)
library(future)

options(tidymodels.dark = TRUE) #lighter colors.

moddata <-
  concrete |> 
  summarise(compressive_strength = mean(compressive_strength), .by = c(cement:age)) 
  
#initial split
set.seed(432)
split <- initial_split(moddata)

#recipe
rec <- 
  recipe(compressive_strength ~ ., data = training(split)) |>  
  step_dummy(all_nominal_predictors(), one_hot = TRUE)

#model specs
lgb_spec <- 
  boost_tree(tree_depth = tune(), learn_rate = tune(), loss_reduction = tune(), 
             mtry = tune(), min_n = tune(), sample_size = tune(), trees = tune(), stop_iter = tune()) %>% 
  set_engine("lightgbm") %>% 
  set_mode("regression")

xgb_spec <- 
  boost_tree(tree_depth = tune(), learn_rate = tune(), loss_reduction = tune(), 
             mtry = tune(), min_n = tune(), sample_size = tune(), trees = tune(), stop_iter = tune()) %>% 
  set_engine("xgboost") %>% 
  set_mode("regression")

#worflowset
wfset <- 
  workflow_set(
    preproc = list(rec = rec), 
    models = list(
      lGBM = lgb_spec,
      xgb = xgb_spec
      )
  )

#resamples
set.seed(1265)
folds <- vfold_cv(training(split), v = 5, repeats = 1)

#set parallel
plan(multisession, workers = 5)

# #ctrl
grid_ctrl <-
  control_grid(
  save_pred = FALSE,
  parallel_over = "resamples",
  pkgs = NULL,
  save_workflow = FALSE
)

#fit
fit_results <-
  wfset |> 
  workflow_map("tune_grid",
               seed = 1563,
               resamples = folds,
               grid = 2,
               control = grid_ctrl,
               verbose = TRUE)

#> i 1 of 2 tuning:     rec_lGBM
#> i Creating pre-processing data to finalize unknown parameter: mtry
#> v 1 of 2 tuning:     rec_lGBM (32.6s)
#> i 2 of 2 tuning:     rec_xgb
#> i Creating pre-processing data to finalize unknown parameter: mtry
#> v 2 of 2 tuning:     rec_xgb (3.4s)

^{Created on 2024-12-07 with reprex v2.1.1}

estroger34 · 2024-12-08T08:07:37Z

FWIW, I've notice that during parallel tuning lightGBM, each of the 5 R session processes uses ~ 7-9% of total CPU capacity, as indicated in the task manager of the 20 core/ 40 threads Windows workstation. This is unusual, as they otherwise max at 3% for other heavy computation stuff. Including tuning the xgb and other models.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

performance bug with `engine = "lightgbm"` #94

performance bug with `engine = "lightgbm"` #94

simonpcouch commented Nov 12, 2024

estroger34 commented Dec 7, 2024

estroger34 commented Dec 8, 2024

performance bug with engine = "lightgbm" #94

performance bug with engine = "lightgbm" #94

Comments

simonpcouch commented Nov 12, 2024

estroger34 commented Dec 7, 2024

estroger34 commented Dec 8, 2024

performance bug with `engine = "lightgbm"` #94

performance bug with `engine = "lightgbm"` #94