Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement out of bag predictions in Random Forest mode #6805

Open
oliviermeslin opened this issue Jan 28, 2025 · 0 comments
Open

Implement out of bag predictions in Random Forest mode #6805

oliviermeslin opened this issue Jan 28, 2025 · 0 comments

Comments

@oliviermeslin
Copy link

oliviermeslin commented Jan 28, 2025

Summary

LightGBM could support out-of-bag predictions when used in random forest mode (boosting_type = 'rf' and bagging_fraction < 1). This means that, for a given instance, the prediction of the model would be based only on the trees that were not trained using this particular instance. This feature has been mentioned multiple times:

Motivation

What are out-of-bag predictions and why are they useful?

A specific feature of the random forest algorithm (Breiman 2001) is the out-of-bag (OOB) method: with this method it is possible to evaluate the performance of the model using training data. The principle is to build for each instance a prediction using only the trees that were trained on a bootstrap subsample that did not contain this particular instance. This method has several practical advantages:

  • with OOB predictions, there is no need for a holdout/test set, meaning that all the data can be used to train the model; this is a real gain when working on small datasets;
  • OOB predictions can be used as an alternative to cross-validation, implying a large speed-up in training;
  • OOB predictions of a random forest can be safely compared to the target on the training data (for instance if you want to detect anomalies by comparing the target with an RF prediction).

OOB predictions are widely used by RF users and are supported in all common implementations of random forests (see for instance scikit-learn and ranger in R.

Why would out-of-bag predictions be useful in LightGBM?

LightGBM is already an excellent implementation of random forests, for three reasons.

  • All defining features of random forests have already been implemented in LightGBM (building trees separately instead of sequentially, row subsampling, feature subsampling at the node level), except sampling with replacement but this does not matter much in practice.
  • LightGBM in RF mode is clearly more convenient than all the common implementations of random forests, because these implementations do not have all the nice features of LightGBM (histogram-based learning, native support of categorical variables, depth-wise tree building, GOSS, custom loss functions...).
  • Most importantly, LightGBM is much faster than other implementations of RF. Given that RF typically rely on complex trees (fully-developed, unpruned trees) instead of shallow trees, this speed advantage is even more important for random forests than for gradient boosting.

Unfortunately, one capital feature is still missing: OOB predictions, and I think it prevents LightGBM from becoming the default way to go for random forests users.

Note: my point is not about performance, so the usual argument “gbdt is better than rf” would not be relevant here. Second, the solution proposed in this issue (basically using train/validation split) would not answer my concern because it does not offer advantages 1/ and 3/ of OOB predictions.

Description

The first step towards introducing this feature could consist in keep track of what instances were used to train each tree. I guess this could be done by storing the indices of sampled instances as an attribute of the booster_ object.

The second step could consist in implementing the OOB prediction itself. An option oob_prediction =False could be added to the predict() method of LightGBM models (raising an error if boosting_type is not rf). I’m not familiar with the underlying C++ implementation of prediction in LighGBM, but I understand from older comments that some work might have been done in this direction.

@oliviermeslin oliviermeslin changed the title Implement the out of bag prediction in Random Forest mode Implement out of bag predictions in Random Forest mode Jan 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant