-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmark results with better parameters #30
Comments
Indeed I observed the memory usage issue. We will need to investigate why this is the case. The grower object is expected to be big because we store the samples indices in the nodes but this should not be the case for the predictor objects. When the number of trees is increased, we only accumulate the predictor objects in a list. The grower objects should garbage collected has we progress. Maybe we do not collect them correctly for some reason. I had not realized performance would degrade with more trees and larger number of leaf nodes. |
Including the warm-up penalty (JIT compilation overhead) is a bit unfair because it would be possible to use numba with a compile cache or Ahead of Time compiling but we do not want to spend time on this while we are still developing the pygbm package. |
To get additional info on where the time is spent in LightGBM you can compile it with the following: diff --git a/CMakeLists.txt b/CMakeLists.txt
index c222221..9309026 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -54,6 +54,8 @@ if(USE_R35)
ADD_DEFINITIONS(-DR_VER_ABOVE_35)
endif()
+add_definitions(-DTIMETAG)
+
if(USE_MPI)
find_package(MPI REQUIRED)
ADD_DEFINITIONS(-DUSE_MPI) It should report additional information that makes it possible to compare with the verbose output of pygbm. |
On my laptop (XPS13 from 2 years ago) with the following change to the benchmark settings: diff --git a/benchmarks/bench_higgs_boson.py b/benchmarks/bench_higgs_boson.py
index 3631edd..0097dcd 100644
--- a/benchmarks/bench_higgs_boson.py
+++ b/benchmarks/bench_higgs_boson.py
@@ -18,10 +18,10 @@ HERE = os.path.dirname(__file__)
URL = ("https://archive.ics.uci.edu/ml/machine-learning-databases/00280/"
"HIGGS.csv.gz")
m = Memory(location='/tmp', mmap_mode='r')
-n_leaf_nodes = 31
-n_trees = 10
+n_leaf_nodes = 255
+n_trees = 50
subsample = None
-lr = 1.
+lr = 0.1
max_bins = 255 I get:
So LightGBM performs better than pygbm but the difference is not as strong as what you report. Compile time overhead for pygbm (not included in the above results) is around 6-8s on my machine. |
With your hyperparameters: diff --git a/benchmarks/bench_higgs_boson.py b/benchmarks/bench_higgs_boson.py
index 3631edd..a2b3866 100644
--- a/benchmarks/bench_higgs_boson.py
+++ b/benchmarks/bench_higgs_boson.py
@@ -18,10 +18,10 @@ HERE = os.path.dirname(__file__)
URL = ("https://archive.ics.uci.edu/ml/machine-learning-databases/00280/"
"HIGGS.csv.gz")
m = Memory(location='/tmp', mmap_mode='r')
-n_leaf_nodes = 31
-n_trees = 10
-subsample = None
-lr = 1.
+n_leaf_nodes = 255
+n_trees = 500
+subsample = int(1e6)
+lr = 0.05
max_bins = 255 I get the following:
I only have 16GB of RAM on this laptop and the VIRT memory usage seems to be the cause of the slowdown. |
New results with 1 million and
With
That's a VERY interesting result for a small model! Results with 1 million and
By the way, I noticed pygbm does not fully saturate my 8 threads. Usually, around 65% CPU usage (cores full, hyperthreaded cores not full), which means around 25% of performance remains unused (75% of the hyperthreads are not fully exploited) => potential parallelism issue? I'll check later on my 72 thread server. |
@ogrisel It seems with |
I noticed that too. If you install numba with conda, you can set |
We might have a discrepancy in the hyperparameters that would explain the difference in AUC but I am not sure which. We have a test that check that we get the same trees in non-pathological cases here: https://github.com/ogrisel/pygbm/blob/master/tests/test_compare_lightgbm.py But apparently this is not the case for Higgs boson. This would also require more investigation. I suspect our handling of shrinkage / learning rate is different. |
You need to check the following equivalent hyperparameters in pygbm from LightGBM:
|
One thing I noticed before is that
In pygbm the logic is different, the parent will be split even if the childs have less samples than if (self.min_samples_leaf is not None
and len(sample_indices_left) < self.min_samples_leaf):
self._finalize_leaf(left_child_node)
else:
self._compute_spittability(left_child_node)
if (self.min_samples_leaf is not None
and len(sample_indices_right) < self.min_samples_leaf):
self._finalize_leaf(right_child_node) Also I seem to remember that Lightgbm's |
great work (both the new implementation allowing to compare the effectiveness of LLVM/code generation etc. vs traditional implementation/compilation for GBMs) and the benchmarking effort as uncovered here in this issue |
EDIT: nevermind I need some sleep ^^ Hmmm so import lightgbm as lb
from sklearn.model_selection import train_test_split
import numpy as np
import pytest
from pygbm import GradientBoostingMachine
from pygbm import plotting
rng = np.random.RandomState(2)
n_samples = 100
max_leaf_nodes = 40
min_sample_leaf = 40
max_iter = 1
# data = linear target, 5 features, 3 irrelevant.
X = rng.normal(size=(n_samples, 5))
y = X[:, 0] - X[:, 1]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=rng)
est_lightgbm = lb.LGBMRegressor(n_estimators=max_iter,
min_data=1, min_data_in_bin=1,
learning_rate=1,
min_sample_leaf=min_sample_leaf,
num_leaves=max_leaf_nodes)
est_pygbm = GradientBoostingMachine(validation_split=None) # just train for plotting to work
est_lightgbm.fit(X_train, y_train)
est_pygbm.fit(X_train, y_train)
plotting.plot_tree(est_pygbm, est_lightgbm, view=True)
I get a tree with 40 leaves and leaves with 1 sample. Changing So a side effect is that I don't know if this comes from the python binding or directly from the c++ source though. |
Are you using Also, I'm curious if setting the
|
@NicolasHug LightGBM doesn't have a parameter named |
@dhirschfeld there are no nested |
@Laurae2 I merged #36 with a fix to workaround a memory leak by numba. Please feel free to try again, the results should be better. I also noticed that on a many core machine the tbb threading layer of numba gives much better performance than the workqueue backend, but LightGBM is still better at using all the cores very efficiently. |
We still get a lower accuracy when the number of trees is large and with a small learning rate. This discrepancy is tracked in #32. |
We also merged #37 that makes it possible to customize the benchmark parameters from the command line. For instance: $ python benchmarks/bench_higgs_boson.py --n-trees 500 --learning-rate 0.1 --n-leaf-nodes 255 Gives the following results (on a workstation with Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz with 2 sockets each with 12 cores which means 48 hyperthreads in total) :
|
Here is another run on the same machine with a single core to check the scalability w.r.t. the number of threads: NUMBA_NUM_THREADS=1 OMP_NUM_THREADS=1 python benchmarks/bench_higgs_boson.py --n-trees 100 --learning-rate 0.1 --n-leaf-nodes 255
So even in single thread, pygbm is slightly slower. And the AUC is slightly lower (tracked in #32). |
Regarding the speed there are still some places where we can parallelize the code, for example when computing the new gradients and hessians. |
I will open an issue dedicated to the thread scalability with more details. Edit: here it is: #38. |
Used a laptop for a better demo benchmark:
Setup for the proper benchmarking:
The benchmark in the master branch (https://github.com/ogrisel/pygbm/blob/master/benchmarks/bench_higgs_boson.py) is way too short and doesn't exactly test the speed of whole model due to how fast it is: there are diminishing returns when the number of iterations increases, and this is what is difficult to optimize once the tree construction is already optimized.
Results:
Slower as more trees are added over time.
Conclusion:
To run the benchmark, one can use the following for a clean setup, not optimized for fastest performance but you have the pre-requisites (0.20 scikit-learn, 0.39 numba):
Before installing pygbm, change the following in line 147 of pygbm/grower (https://github.com/ogrisel/pygbm/blob/master/pygbm/grower.py#L146-L147):
to:
Allows to avoid the infamous divide by zero error.
Then, one can run the following:
If you have slow Internet, download HIGGS dataset: https://archive.ics.uci.edu/ml/machine-learning-databases/00280/ then uncompress it.
Then, you may run a proper benchmark using the following (make sure to change the
load_path
to your HIGGS csv file):If something is missing in the script, please let me know.
The text was updated successfully, but these errors were encountered: