-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhance binning strategy #39
Comments
Here is another example, this time with pre-binned data. I can't explain why the left root child has a different split gain. When I print the split gain values considered by lightgbm, no split is equal to The discrepancy may not come from the actual binning strategy here, but could be due to how the bins are treated afterwards. Some of them may not be considered, or merged, I don't know. from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from pygbm import GradientBoostingMachine
from lightgbm import LGBMClassifier
from pygbm.plotting import plot_tree
from pygbm.binning import BinMapper
import numpy as np
rng = np.random.RandomState(seed=2)
n_leaf_nodes = 4
n_trees = 1
lr = 1.
min_samples_leaf = 1
max_bins = 255
n_samples = 100
X = rng.normal(size=(n_samples, 5))
y = (X[:, 0] > 0) & (X[:, 1] > .5)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=rng)
X_train = BinMapper().fit_transform(X_train)
pygbm_model = GradientBoostingMachine(
loss='log_loss', learning_rate=lr, max_iter=n_trees, max_bins=max_bins,
max_leaf_nodes=n_leaf_nodes, random_state=0, scoring=None, verbose=1,
validation_split=None, min_samples_leaf=min_samples_leaf)
pygbm_model.fit(X_train, y_train)
lightgbm_model = LGBMClassifier(
objective='binary', n_estimators=n_trees, max_bin=max_bins,
num_leaves=n_leaf_nodes, learning_rate=lr, verbose=10, random_state=0,
boost_from_average=False, min_data_in_leaf=min_samples_leaf)
lightgbm_model.fit(X_train, y_train)
plot_tree(pygbm_model, lightgbm_model, view=True) |
It's not just the split gain that is different on the left root child: it's also not splitting on the same feature. |
Ok I made some small progress on this. Still don't know the details of lightgbm binning but I can explain the 2 previous comments. For the first comment (#39 (comment)), it looks like LightGBM forces Do we want to do such a thing as well? For the binning threshold, something like midpoints = np.insert(midpoints, np.searchsorted(midpoints, 0), 0) would do it but that would make For my second comment (#39 (comment)), the discrepancy comes from the Side note: when debugging, it's helpful to set |
Maybe we should ask the LightGBM developers to explain why this is useful.
+1, and we can reenable it the day we implement feature bundling (hopefully). |
Nice catch. |
Results are comparable to LightGBM when
n_samples
<=n_bins
because both libs are using the actual feature values as bin thresholds.This is not the case anymore when
n_samples
>n_bins
. In particular, on this very easy dataset (target = X[:, 0] > 0
, lightgbm finds a perfect threshold of1e-35
while that of pygbm is-0.262
. This leads to different trees and less accurate predictions (1 vs .9).The text was updated successfully, but these errors were encountered: