Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Single threaded split #85

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

maartenbreddels
Copy link
Contributor

@maartenbreddels maartenbreddels commented Dec 18, 2018

(continues from #79, will rebase after merge)

Implement what is discussed in #83.

TODO:

  • docstrings

@codecov
Copy link

codecov bot commented Dec 18, 2018

Codecov Report

Merging #85 into master will decrease coverage by 0.21%.
The diff coverage is 86.66%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #85      +/-   ##
==========================================
- Coverage   96.96%   96.74%   -0.22%     
==========================================
  Files          10       10              
  Lines        1053     1076      +23     
==========================================
+ Hits         1021     1041      +20     
- Misses         32       35       +3
Impacted Files Coverage Δ
pygbm/grower.py 92.89% <100%> (+0.07%) ⬆️
pygbm/gradient_boosting.py 96.86% <75%> (+0.01%) ⬆️
pygbm/splitting.py 97.59% <86.95%> (-1.35%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 686495b...e84cdc3. Read the comment docs.

Copy link
Collaborator

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good in general! Thanks!

Made a few comments.

Waiting for @ogrisel input but maybe it would be interesting to have a new benchmark script comparing the memory usage with parallel_splitting=True and parallel_splitting=False.

def split_indices_single_thread(context, split_info, sample_indices):
"""Split samples into left and right arrays.

This implementation requires less memory than the parallel version.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a comment about using Hoare's partition scheme, just for reference

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I didn't know it had a name, I'm ok with adding it, but did you check that it is effectively the same algo?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

----------
context : SplittingContext
The splitting context
split_ingo : SplitInfo
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

split_info (my fault ^^)
can you also fix it at the other occurrence please?

i += 1
j -= 1
return (sample_indices[:i],
sample_indices[i:])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return sample_indices[:i], sample_indices[i:] ?

pygbm/grower.py Outdated
@@ -336,6 +339,7 @@ def split_next(self):
node = heappop(self.splittable_nodes)

tic = time()
split_indices = split_indices_parallel if self.parallel_splitting else split_indices_single_thread
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should fork in SplittingContext instead of in the grower. We probably don't need grower.parallel_splitting.

That is, call SplittingContext.split_indice() in the grower, and in splitting.py do:

def split_indice(...)
   if self.parallel_splitting:
       return ...
   else:
       return ...

@@ -287,6 +288,12 @@ def test_split_indices():
assert samples_left.shape[0] == si_root.n_samples_left
assert samples_right.shape[0] == si_root.n_samples_right

samples_left_single_thread, samples_right_single_thread = split_indices_single_thread(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe instead parametrize this test with parallel_split=True and parallel_split=False, so that both methods go through exactly the same tests (though I agree your current test also covers this implicitly)

Copy link
Owner

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more comments on top of @NicolasHug's review.

@@ -85,7 +85,7 @@ def load_data():
max_leaf_nodes=n_leaf_nodes,
n_iter_no_change=None,
random_state=0,
verbose=1)
verbose=1, parallel_splitting=False)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you can expose this a commandline flag to make it easy to profile with mprof run benchmarks/bench_higgs_boson.py --disable-parallel-splitting (using memory_profiler).

pygbm/splitting.py Outdated Show resolved Hide resolved
@@ -304,6 +312,31 @@ def split_indices(context, split_info, sample_indices):
return (sample_indices[:right_child_position],
sample_indices[right_child_position:])

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8: you need 2 blank lines to separate top level functions.

Copy link
Collaborator

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few more comments,

Also please see the pep8 issues on travis, or locally run

flake8 pygbm tests examples benchmarks

@@ -304,6 +312,31 @@ def split_indices(context, split_info, sample_indices):
return (sample_indices[:right_child_position],
sample_indices[right_child_position:])

@njit(parallel=False)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just use @njit, parallelism is off by default

Suggested change
@njit(parallel=False)
@njit

sample_indices[i], sample_indices[j] = sample_indices[j], sample_indices[i]
i += 1
j -= 1
return (sample_indices[:i], sample_indices[i:])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for parenthesis

@@ -261,7 +262,7 @@ def test_split_indices():
n_bins_per_feature,
all_gradients, all_hessians,
l2_regularization, min_hessian_to_split,
min_samples_leaf, min_gain_to_split)
min_samples_leaf, min_gain_to_split, True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please parametrize this test with True/False instead and remove the checks below

@@ -304,6 +312,31 @@ def split_indices(context, split_info, sample_indices):
return (sample_indices[:right_child_position],
sample_indices[right_child_position:])

@njit(parallel=False)
def _split_indices_single_threaded(context, split_info, sample_indices):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _split_indices_single_threaded(context, split_info, sample_indices):
def _split_indices_single_threaded(context, split_info, sample_indices):
# single-threaded partition into left and right arrays (Hoare's partition scheme)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants