-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Do not store histogram on SplitInfo #36
Conversation
At this time some tests are broken (need to be updated) and The memory leak issue should be fixed though. |
The fact that we observe a slowdown when we update the packed |
@NicolasHug noted that false sharing might not be a problem in a write only datastructure updated in a parallel for loop. I don't know. What I observe though is that on a 12 cores machine, the code runs 2x faster with |
Actually I tried again with master and the various threading backends and this PR is either as fast or faster than master. I must have done something wrong when I reported the initial slow down. In any case, tbb is significantly faster than the workqueue backend when the number of cores is large (e.g. 12 in my case). |
@NicolasHug I have to go, feel free to update the failing tests and merge this PR. |
Codecov Report
@@ Coverage Diff @@
## master #36 +/- ##
==========================================
+ Coverage 94.34% 94.36% +0.02%
==========================================
Files 8 8
Lines 778 781 +3
==========================================
+ Hits 734 737 +3
Misses 44 44
Continue to review full report at Codecov.
|
Here are the same plots from #31 (comment) now. I don't know how I feel about the 1e7 case. We have now the following results with the benchmark from: #30 (comment) (numba is pre-compiled here for fair comparisons): Laptop with 8GB RAM, i5 7th gen. Lightgbm: 75.408s, ROC AUC: 0.8293 😄 Code: from urllib.request import urlretrieve
import os
from gzip import GzipFile
from time import time
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from joblib import Memory
from pygbm import GradientBoostingMachine
from lightgbm import LGBMRegressor
import numba
import gc
HERE = os.path.dirname(__file__)
URL = ("https://archive.ics.uci.edu/ml/machine-learning-databases/00280/"
"HIGGS.csv.gz")
m = Memory(location='/tmp', mmap_mode='r')
@m.cache
def load_data():
filename = os.path.join(HERE, URL.rsplit('/', 1)[-1])
if not os.path.exists(filename):
print(f"Downloading {URL} to {filename} (2.6 GB)...")
urlretrieve(URL, filename)
print("done.")
print(f"Parsing {filename}...")
tic = time()
with GzipFile(filename) as f:
df = pd.read_csv(f, header=None, dtype=np.float32)
toc = time()
print(f"Loaded {df.values.nbytes / 1e9:0.3f} GB in {toc - tic:0.3f}s")
return df
df = load_data()
n_leaf_nodes = 255
n_trees = 500
lr = 0.05
max_bins = 255
subsample = 1000000 # Change this to 10000000 if you wish, or to None
target = df.values[:, 0]
data = np.ascontiguousarray(df.values[:, 1:])
data_train, data_test, target_train, target_test = train_test_split(
data, target, test_size=50000, random_state=0)
if subsample is not None:
data_train, target_train = data_train[:subsample], target_train[:subsample]
n_samples, n_features = data_train.shape
print(f"Training set with {n_samples} records with {n_features} features.")
gc.collect()
print("Compiling pygbm...")
tic = time()
pygbm_model = GradientBoostingMachine(learning_rate=lr, max_iter=n_trees,
max_bins=max_bins,
max_leaf_nodes=n_leaf_nodes,
random_state=0, scoring=None,
verbose=0, validation_split=None)
pygbm_model.fit(data_train[:100], target_train[:100])
toc = time()
predicted_test = pygbm_model.predict(data_test)
roc_auc = roc_auc_score(target_test, predicted_test)
print(f"done in {toc - tic:.3f}s, ROC AUC: {roc_auc:.4f}")
del pygbm_model
del predicted_test
print("Fitting a LightGBM model...")
tic = time()
lightgbm_model = LGBMRegressor(n_estimators=n_trees, num_leaves=n_leaf_nodes,
learning_rate=lr, silent=False)
lightgbm_model.fit(data_train, target_train)
toc = time()
predicted_test = lightgbm_model.predict(data_test)
roc_auc = roc_auc_score(target_test, predicted_test)
print(f"done in {toc - tic:.3f}s, ROC AUC: {roc_auc:.4f}")
del lightgbm_model
del predicted_test
gc.collect()
print("Fitting a pygbm model...")
tic = time()
pygbm_model = GradientBoostingMachine(learning_rate=lr, max_iter=n_trees,
max_bins=max_bins,
max_leaf_nodes=n_leaf_nodes,
random_state=0, scoring=None,
verbose=1, validation_split=None)
pygbm_model.fit(data_train, target_train)
toc = time()
predicted_test = pygbm_model.predict(data_test)
roc_auc = roc_auc_score(target_test, predicted_test)
print(f"done in {toc - tic:.3f}s, ROC AUC: {roc_auc:.4f}")
del pygbm_model
del predicted_test
gc.collect()
if hasattr(numba, 'threading_layer'):
Log:
Also, I tried to reproduce the leak with a minimal example and this is what I came up with. It's pretty weird and I'd like to know if you can reproduce it before submitting it to numba because I feel like I'm tripping: import numpy as np
import psutil
from numba import (njit, jitclass, prange, float32, uint8, uint32, typeof,
optional)
@jitclass([
('attr', uint32),
])
class JitClass:
def __init__(self):
self.attr = 3
@njit
def f():
cs = [JitClass() for i in range(1000)]
array = np.empty(shape=10, dtype=np.uint32)
# If I remove this loop, there is no leak
for i, c in enumerate(cs):
c.attr[i] = array[i] # this should not even pass!
# c.attr = array[i] # <-- leak still here if we do this instead.
# this should not work either!
something_that_should_not_compile
blahblahblah
why_is_this_passing
# a = a + 3 <-- this produces an error, as expected (a is not defined)
return array
class C:
def g(self):
self.array = f()
p = psutil.Process()
for _ in range(10000):
o = C()
o.g()
del o
# leak proportional to the size of cs, independent to the size of array
print(f"{p.memory_info().rss / 1e6} MB") @ogrisel I'm happy to merge as is, but I'd like your input on the 1e7 case first. |
Cool, let's merge as it's already a net improvement. About the minimal reproduction case, I confirm I get the leak with your code, without any error message or exception. Just memory usage increasing as reported by psutil. |
We still have a discrepancy in terms of results with LightGBM though. But there is another issue for that. |
I have opened numba/numba#3473 and numba/numba#3472 regarding the leak and some other weird stuff I found. |
Tentative fix for #31.