You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
LightGBM could support out-of-bag predictions when used in random forest mode (boosting_type = 'rf' and bagging_fraction < 1). This means that, for a given instance, the prediction of the model would be based only on the trees that were not trained using this particular instance. This feature has been mentioned multiple times:
What are out-of-bag predictions and why are they useful?
A specific feature of the random forest algorithm (Breiman 2001) is the out-of-bag (OOB) method: with this method it is possible to evaluate the performance of the model using training data. The principle is to build for each instance a prediction using only the trees that were trained on a bootstrap subsample that did not contain this particular instance. This method has several practical advantages:
with OOB predictions, there is no need for a holdout/test set, meaning that all the data can be used to train the model; this is a real gain when working on small datasets;
OOB predictions can be used as an alternative to cross-validation, implying a large speed-up in training;
OOB predictions of a random forest can be safely compared to the target on the training data (for instance if you want to detect anomalies by comparing the target with an RF prediction).
OOB predictions are widely used by RF users and are supported in all common implementations of random forests (see for instance scikit-learn and ranger in R.
Why would out-of-bag predictions be useful in LightGBM?
LightGBM is already an excellent implementation of random forests, for three reasons.
All defining features of random forests have already been implemented in LightGBM (building trees separately instead of sequentially, row subsampling, feature subsampling at the node level), except sampling with replacement but this does not matter much in practice.
LightGBM in RF mode is clearly more convenient than all the common implementations of random forests, because these implementations do not have all the nice features of LightGBM (histogram-based learning, native support of categorical variables, depth-wise tree building, GOSS, custom loss functions...).
Most importantly, LightGBM is much faster than other implementations of RF. Given that RF typically rely on complex trees (fully-developed, unpruned trees) instead of shallow trees, this speed advantage is even more important for random forests than for gradient boosting.
Unfortunately, one capital feature is still missing: OOB predictions, and I think it prevents LightGBM from becoming the default way to go for random forests users.
Note: my point is not about performance, so the usual argument “gbdt is better than rf” would not be relevant here. Second, the solution proposed in this issue (basically using train/validation split) would not answer my concern because it does not offer advantages 1/ and 3/ of OOB predictions.
Description
The first step towards introducing this feature could consist in keep track of what instances were used to train each tree. I guess this could be done by storing the indices of sampled instances as an attribute of the booster_ object.
The second step could consist in implementing the OOB prediction itself. An option oob_prediction =False could be added to the predict() method of LightGBM models (raising an error if boosting_type is not rf). I’m not familiar with the underlying C++ implementation of prediction in LighGBM, but I understand from older comments that some work might have been done in this direction.
The text was updated successfully, but these errors were encountered:
oliviermeslin
changed the title
Implement the out of bag prediction in Random Forest mode
Implement out of bag predictions in Random Forest mode
Jan 29, 2025
Summary
LightGBM could support out-of-bag predictions when used in random forest mode (
boosting_type = 'rf'
andbagging_fraction < 1
). This means that, for a given instance, the prediction of the model would be based only on the trees that were not trained using this particular instance. This feature has been mentioned multiple times:Motivation
What are out-of-bag predictions and why are they useful?
A specific feature of the random forest algorithm (Breiman 2001) is the out-of-bag (OOB) method: with this method it is possible to evaluate the performance of the model using training data. The principle is to build for each instance a prediction using only the trees that were trained on a bootstrap subsample that did not contain this particular instance. This method has several practical advantages:
OOB predictions are widely used by RF users and are supported in all common implementations of random forests (see for instance
scikit-learn
andranger
inR
.Why would out-of-bag predictions be useful in LightGBM?
LightGBM is already an excellent implementation of random forests, for three reasons.
Unfortunately, one capital feature is still missing: OOB predictions, and I think it prevents LightGBM from becoming the default way to go for random forests users.
Note: my point is not about performance, so the usual argument “
gbdt
is better thanrf
” would not be relevant here. Second, the solution proposed in this issue (basically using train/validation split) would not answer my concern because it does not offer advantages 1/ and 3/ of OOB predictions.Description
The first step towards introducing this feature could consist in keep track of what instances were used to train each tree. I guess this could be done by storing the indices of sampled instances as an attribute of the
booster_
object.The second step could consist in implementing the OOB prediction itself. An option
oob_prediction =False
could be added to thepredict()
method of LightGBM models (raising an error ifboosting_type
is notrf
). I’m not familiar with the underlying C++ implementation of prediction in LighGBM, but I understand from older comments that some work might have been done in this direction.The text was updated successfully, but these errors were encountered: