Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decouple Boosting Types (fixes #3128) #4827

Merged
merged 105 commits into from
Dec 28, 2022
Merged
Show file tree
Hide file tree
Changes from 40 commits
Commits
Show all changes
105 commits
Select commit Hold shift + click to select a range
2ac5786
add parameter data_sample_strategy
GuangdaLiu Nov 5, 2021
590aec6
abstract GOSS as a sample strategy(GOSS1), togetherwith origial GOSS …
GuangdaLiu Nov 9, 2021
c8dce4d
abstract Bagging as a subclass (BAGGING), but original Bagging member…
GuangdaLiu Nov 12, 2021
dd40531
fix some variables
GuangdaLiu Nov 12, 2021
4b6095d
remove GOSS(as boost) and Bagging logic in GBDT
GuangdaLiu Nov 12, 2021
2acb230
rename GOSS1 to GOSS(as sample strategy)
GuangdaLiu Nov 12, 2021
8b25d65
add warning about use GOSS as boosting_type
GuangdaLiu Nov 12, 2021
05a8d15
a little ; bug
GuangdaLiu Nov 12, 2021
6f9c8cc
remove CHECK when "gradients != nullptr"
GuangdaLiu Nov 15, 2021
80c4f70
rename DataSampleStrategy to avoid confusion
GuangdaLiu Dec 5, 2021
8103d81
remove and add some ccomments, followingconvention
GuangdaLiu Dec 5, 2021
94a17ee
fix bug about GBDT::ResetConfig (ObjectiveFunction inconsistencty bet…
GuangdaLiu Dec 5, 2021
f000f0a
add std::ignore to avoid compiler warnings (anpotential fails)
GuangdaLiu Dec 7, 2021
0ca5cb1
update Makevars and vcxproj
shiyu1994 Dec 8, 2021
2a58353
handle constant hessian
shiyu1994 Dec 8, 2021
8775c05
mark override for IsHessianChange
shiyu1994 Dec 8, 2021
1e888ef
fix lint errors
shiyu1994 Dec 8, 2021
22ad1c8
rerun parameter_generator.py
shiyu1994 Dec 8, 2021
f72e0c4
Merge remote-tracking branch 'LightGBM/master' into decouple
shiyu1994 Dec 8, 2021
e64ad6f
update config_auto.cpp
shiyu1994 Dec 8, 2021
8dec630
delete redundant blank line
shiyu1994 Dec 8, 2021
aa63de8
update num_data_ when train_data_ is updated
shiyu1994 Dec 8, 2021
6405361
check bagging_freq is not zero
shiyu1994 Dec 8, 2021
4d6362a
reset config_ value
shiyu1994 Dec 8, 2021
21ee487
remove useless check
shiyu1994 Dec 8, 2021
634fab4
add ttests in test_engine.py
GuangdaLiu Dec 10, 2021
a68fc25
remove whitespace in blank line
GuangdaLiu Dec 11, 2021
7b957c4
Merge remote-tracking branch 'LightGBM/master' into decouple
shiyu1994 Jan 7, 2022
ac387b3
remove arguments verbose_eval and evals_result
shiyu1994 Jan 7, 2022
6e94059
Update tests/python_package_test/test_engine.py
GuangdaLiu Jan 11, 2022
0fe6dc8
Update tests/python_package_test/test_engine.py
GuangdaLiu Jan 11, 2022
ab39d21
Update tests/python_package_test/test_engine.py
GuangdaLiu Jan 11, 2022
9978c3c
Update tests/python_package_test/test_engine.py
GuangdaLiu Jan 11, 2022
7ba1750
Update tests/python_package_test/test_engine.py
GuangdaLiu Jan 11, 2022
ecaaabe
Update tests/python_package_test/test_engine.py
GuangdaLiu Jan 11, 2022
c1f1b91
Update src/boosting/sample_strategy.cpp
GuangdaLiu Jan 11, 2022
006de87
Update tests/python_package_test/test_engine.py
GuangdaLiu Jan 11, 2022
20ddcb4
Update tests/python_package_test/test_engine.py
GuangdaLiu Jan 15, 2022
73d7db7
Modify warnning about using goss as boosting type
GuangdaLiu Jan 15, 2022
beaaf19
Update tests/python_package_test/test_engine.py
GuangdaLiu Jan 18, 2022
99b069f
merge LightGBM/master
shiyu1994 Mar 11, 2022
1dbbee4
Update src/boosting/sample_strategy.cpp
shiyu1994 Mar 15, 2022
cddfcd6
remove goss from boosting types in documentation
shiyu1994 Mar 15, 2022
cd08e39
Merge branch 'decouple' of https://github.com/GuangdaLiu/LightGBM int…
shiyu1994 Mar 15, 2022
df523f3
Update src/boosting/bagging.hpp
shiyu1994 Mar 15, 2022
85e7fd1
Update src/boosting/bagging.hpp
shiyu1994 Mar 15, 2022
efb5e28
Update src/boosting/goss.hpp
shiyu1994 Mar 15, 2022
beb9f8c
Update src/boosting/goss.hpp
shiyu1994 Mar 15, 2022
4bdcdd5
rename GOSS with GOSSStrategy
shiyu1994 Mar 19, 2022
3291d7e
update doc
shiyu1994 Mar 19, 2022
2b5a9b6
Merge branch 'decouple' of https://github.com/GuangdaLiu/LightGBM int…
shiyu1994 Mar 19, 2022
93a8762
address comments
shiyu1994 Mar 19, 2022
7e1167a
fix table in doc
shiyu1994 Mar 19, 2022
a1b6bd1
Update include/LightGBM/config.h
shiyu1994 Mar 21, 2022
4499113
update documentation
shiyu1994 Mar 21, 2022
3a2235e
update test case
shiyu1994 Mar 21, 2022
1e4c11a
revert useless change in test_engine.py
shiyu1994 Mar 21, 2022
72f5e1b
merge LightGBM/master
shiyu1994 Jun 7, 2022
e72fb01
add tests for evaluation results in test_sample_strategy_with_boosting
shiyu1994 Jun 7, 2022
05292ff
include <string>
shiyu1994 Jun 9, 2022
6ec7812
change to assert_allclose in test_goss_boosting_and_strategy_equivalent
shiyu1994 Jun 9, 2022
9f749fa
more tolerance in result checking, due to minor difference in results…
shiyu1994 Jun 9, 2022
808ccc6
change == to np.testing.assert_allclose
shiyu1994 Jun 9, 2022
35f4eb5
fix test case
shiyu1994 Jun 13, 2022
7fe6a94
set gpu_use_dp to true
shiyu1994 Jul 27, 2022
7f10818
change --report to --report-level for rstcheck
shiyu1994 Jul 27, 2022
8033655
Merge branch 'master' into decouple
shiyu1994 Jul 29, 2022
755cb3a
use gpu_use_dp=true in test_goss_boosting_and_strategy_equivalent
shiyu1994 Jul 29, 2022
f12458b
Merge branch 'decouple' of https://github.com/GuangdaLiu/LightGBM int…
shiyu1994 Jul 29, 2022
b431c2c
revert unexpected changes of non-ascii characters
shiyu1994 Jul 29, 2022
43480d1
revert unexpected changes of non-ascii characters
shiyu1994 Jul 29, 2022
6f799a7
merge LightGBM/master
shiyu1994 Aug 16, 2022
b1f2c77
Merge branch 'decouple' of https://github.com/GuangdaLiu/LightGBM int…
shiyu1994 Aug 16, 2022
9297157
remove useless changes
shiyu1994 Aug 16, 2022
7a5fede
allocate gradients_pointer_ and hessians_pointer when necessary
shiyu1994 Aug 24, 2022
b4a014f
add spaces
shiyu1994 Aug 24, 2022
f783a61
remove redundant virtual
shiyu1994 Aug 24, 2022
204517b
include <LightGBM/utils/log.h> for USE_CUDA
shiyu1994 Aug 24, 2022
9259845
merge LightGBM/master
shiyu1994 Aug 29, 2022
e5d4605
check for in test_goss_boosting_and_strategy_equivalent
shiyu1994 Aug 29, 2022
469f6bb
check for identity in test_sample_strategy_with_boosting
shiyu1994 Aug 29, 2022
5127188
remove cuda option in test_sample_strategy_with_boosting
shiyu1994 Aug 29, 2022
cc28c8a
Update tests/python_package_test/test_engine.py
shiyu1994 Sep 5, 2022
42f3de9
Update tests/python_package_test/test_engine.py
shiyu1994 Sep 5, 2022
4668986
Merge branch 'master' into decouple
shiyu1994 Sep 5, 2022
33e3fd6
apply review comments
shiyu1994 Sep 6, 2022
ea95e86
ResetGradientBuffers after ResetSampleConfig
shiyu1994 Sep 7, 2022
48c21a4
Merge remote-tracking branch 'origin/master' into decouple
shiyu1994 Sep 9, 2022
18a54ef
ResetGradientBuffers after ResetSampleConfig
shiyu1994 Sep 9, 2022
beb12b9
ResetGradientBuffers after bagging
shiyu1994 Sep 9, 2022
c617518
remove useless code
shiyu1994 Sep 9, 2022
87b3e0e
check objective_function_ instead of gradients
shiyu1994 Sep 9, 2022
58356e4
enable rf with goss
shiyu1994 Sep 14, 2022
47957b1
remove useless changes
shiyu1994 Sep 14, 2022
1eb96d6
allow rf with feature subsampling alone
shiyu1994 Sep 14, 2022
d138c69
Merge branch 'decouple' of https://github.com/GuangdaLiu/LightGBM int…
shiyu1994 Sep 14, 2022
90a2b8f
change position of ResetGradientBuffers
shiyu1994 Sep 15, 2022
c3d4933
check for dask
shiyu1994 Dec 1, 2022
a198468
Merge branch 'master' into decouple
shiyu1994 Dec 13, 2022
faf5932
Merge branch 'decouple' of https://github.com/GuangdaLiu/LightGBM int…
shiyu1994 Dec 13, 2022
8ab505a
Merge remote-tracking branch 'origin/master' into decouple
shiyu1994 Dec 21, 2022
ced7b06
add parameter types for data_sample_strategy
shiyu1994 Dec 22, 2022
0b541a6
Merge remote-tracking branch 'origin/master' into decouple
shiyu1994 Dec 27, 2022
aec8454
Merge branch 'decouple' of https://github.com/GuangdaLiu/LightGBM int…
shiyu1994 Dec 27, 2022
7fd71f1
Merge remote-tracking branch 'origin/master' into decouple
shiyu1994 Dec 28, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions R-package/src/Makevars.in
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ OBJECTS = \
boosting/gbdt_model_text.o \
boosting/gbdt_prediction.o \
boosting/prediction_early_stop.o \
boosting/sample_strategy.o \
io/bin.o \
io/config.o \
io/config_auto.o \
Expand Down
1 change: 1 addition & 0 deletions R-package/src/Makevars.win.in
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ OBJECTS = \
boosting/gbdt_model_text.o \
boosting/gbdt_prediction.o \
boosting/prediction_early_stop.o \
boosting/sample_strategy.o \
io/bin.o \
io/config.o \
io/config_auto.o \
Expand Down
6 changes: 6 additions & 0 deletions docs/Parameters.rst
Original file line number Diff line number Diff line change
Expand Up @@ -139,6 +139,12 @@ Core Parameters

- **Note**: internally, LightGBM uses ``gbdt`` mode for the first ``1 / learning_rate`` iterations

- ``data_sample_strategy`` :raw-html:`<a id="data_sample_strategy" title="Permalink to this parameter" href="#data_sample_strategy">&#x1F517;&#xFE0E;</a>`, default = ``bagging``, type = enum, options: ``bagging``, ``goss``

- ``bagging``, Randomly Bagging Sampling
shiyu1994 marked this conversation as resolved.
Show resolved Hide resolved

- ``goss``, Gradient-based One-Side Sampling

- ``data`` :raw-html:`<a id="data" title="Permalink to this parameter" href="#data">&#x1F517;&#xFE0E;</a>`, default = ``""``, type = string, aliases: ``train``, ``train_data``, ``train_data_file``, ``data_filename``

- path of training data, LightGBM will train from this data
Expand Down
7 changes: 7 additions & 0 deletions include/LightGBM/config.h
Original file line number Diff line number Diff line change
Expand Up @@ -149,6 +149,13 @@ struct Config {
// descl2 = **Note**: internally, LightGBM uses ``gbdt`` mode for the first ``1 / learning_rate`` iterations
std::string boosting = "gbdt";

// [doc-only]
// type = enum
// options = bagging, goss
// desc = ``bagging``, Randomly Bagging Sampling
// desc = ``goss``, Gradient-based One-Side Sampling
std::string data_sample_strategy = "bagging";
shiyu1994 marked this conversation as resolved.
Show resolved Hide resolved

// alias = train, train_data, train_data_file, data_filename
// desc = path of training data, LightGBM will train from this data
// desc = **Note**: can be used only in CLI version
Expand Down
73 changes: 73 additions & 0 deletions include/LightGBM/sample_strategy.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
/*!
* Copyright (c) 2021 Microsoft Corporation. All rights reserved.
* Licensed under the MIT License. See LICENSE file in the project root for license information.
*/

#ifndef LIGHTGBM_SAMPLE_STRATEGY_H_
#define LIGHTGBM_SAMPLE_STRATEGY_H_

#include <LightGBM/utils/random.h>
#include <LightGBM/utils/common.h>
#include <LightGBM/utils/threading.h>
#include <LightGBM/config.h>
#include <LightGBM/dataset.h>
#include <LightGBM/tree_learner.h>
#include <LightGBM/objective_function.h>

#include <memory>
#include <vector>

namespace LightGBM {

class SampleStrategy {
public:
SampleStrategy() : balanced_bagging_(false), bagging_runner_(0, bagging_rand_block_), need_resize_gradients_(false) {}

virtual ~SampleStrategy() {}

static SampleStrategy* CreateSampleStrategy(const Config* config, const Dataset* train_data, const ObjectiveFunction* objective_function, int num_tree_per_iteration);

virtual void Bagging(int iter, TreeLearner* tree_learner, score_t* gradients, score_t* hessians) = 0;

virtual void ResetSampleConfig(const Config* config, bool is_change_dataset) = 0;

bool is_use_subset() const { return is_use_subset_; }

data_size_t bag_data_cnt() const { return bag_data_cnt_; }

std::vector<data_size_t, Common::AlignmentAllocator<data_size_t, kAlignedSize>>& bag_data_indices() { return bag_data_indices_; }

void UpdateObjectiveFunction(const ObjectiveFunction* objective_function) {
objective_function_ = objective_function;
}

void UpdateTrainingData(const Dataset* train_data) {
train_data_ = train_data;
num_data_ = train_data->num_data();
}

virtual bool IsHessianChange() const = 0;

bool NeedResizeGradients() const { return need_resize_gradients_; }

protected:
const Config* config_;
const Dataset* train_data_;
const ObjectiveFunction* objective_function_;
std::vector<data_size_t, Common::AlignmentAllocator<data_size_t, kAlignedSize>> bag_data_indices_;
data_size_t bag_data_cnt_;
data_size_t num_data_;
int num_tree_per_iteration_;
std::unique_ptr<Dataset> tmp_subset_;
bool is_use_subset_;
bool balanced_bagging_;
const int bagging_rand_block_ = 1024;
std::vector<Random> bagging_rands_;
ParallelPartitionRunner<data_size_t, false> bagging_runner_;
/*! \brief whether need to resize the gradient vectors */
bool need_resize_gradients_;
};

} // namespace LightGBM

#endif // LIGHTGBM_SAMPLE_STRATEGY_H_
181 changes: 181 additions & 0 deletions src/boosting/bagging.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,181 @@
/*!
* Copyright (c) 2021 Microsoft Corporation. All rights reserved.
* Licensed under the MIT License. See LICENSE file in the project root for license information.
*/

#ifndef LIGHTGBM_SAMPLE_STRATEGY_BAGGING_HPP_
#define LIGHTGBM_SAMPLE_STRATEGY_BAGGING_HPP_
shiyu1994 marked this conversation as resolved.
Show resolved Hide resolved

namespace LightGBM {

class BAGGING : public SampleStrategy {
shiyu1994 marked this conversation as resolved.
Show resolved Hide resolved
public:
BAGGING(const Config* config, const Dataset* train_data, const ObjectiveFunction* objective_function, int num_tree_per_iteration)
: need_re_bagging_(false) {
config_ = config;
train_data_ = train_data;
num_data_ = train_data->num_data();
objective_function_ = objective_function;
num_tree_per_iteration_ = num_tree_per_iteration;
}

~BAGGING() {}

void Bagging(int iter, TreeLearner* tree_learner, score_t* gradients, score_t* hessians) override {
Common::FunctionTimer fun_timer("GBDT::Bagging", global_timer);
// if need bagging
if ((bag_data_cnt_ < num_data_ && iter % config_->bagging_freq == 0) ||
need_re_bagging_) {
need_re_bagging_ = false;
auto left_cnt = bagging_runner_.Run<true>(
num_data_,
[=](int, data_size_t cur_start, data_size_t cur_cnt, data_size_t* left,
data_size_t*) {
data_size_t cur_left_count = 0;
if (balanced_bagging_) {
cur_left_count =
BalancedBaggingHelper(cur_start, cur_cnt, left);
} else {
cur_left_count = BaggingHelper(cur_start, cur_cnt, left);
}
return cur_left_count;
},
bag_data_indices_.data());
bag_data_cnt_ = left_cnt;
Log::Debug("Re-bagging, using %d data to train", bag_data_cnt_);
// set bagging data to tree learner
if (!is_use_subset_) {
tree_learner->SetBaggingData(nullptr, bag_data_indices_.data(), bag_data_cnt_);
} else {
// get subset
tmp_subset_->ReSize(bag_data_cnt_);
tmp_subset_->CopySubrow(train_data_, bag_data_indices_.data(),
bag_data_cnt_, false);
tree_learner->SetBaggingData(tmp_subset_.get(), bag_data_indices_.data(),
bag_data_cnt_);
}
}
// avoid warnings
std::ignore = gradients;
std::ignore = hessians;
shiyu1994 marked this conversation as resolved.
Show resolved Hide resolved
}

void ResetSampleConfig(const Config* config, bool is_change_dataset) override {
need_resize_gradients_ = false;
// if need bagging, create buffer
data_size_t num_pos_data = 0;
if (objective_function_ != nullptr) {
num_pos_data = objective_function_->NumPositiveData();
}
bool balance_bagging_cond = (config->pos_bagging_fraction < 1.0 || config->neg_bagging_fraction < 1.0) && (num_pos_data > 0);
if ((config->bagging_fraction < 1.0 || balance_bagging_cond) && config->bagging_freq > 0) {
need_re_bagging_ = false;
if (!is_change_dataset &&
config_ != nullptr && config_->bagging_fraction == config->bagging_fraction && config_->bagging_freq == config->bagging_freq
&& config_->pos_bagging_fraction == config->pos_bagging_fraction && config_->neg_bagging_fraction == config->neg_bagging_fraction) {
return;
}
config_ = config;
if (balance_bagging_cond) {
balanced_bagging_ = true;
bag_data_cnt_ = static_cast<data_size_t>(num_pos_data * config->pos_bagging_fraction)
+ static_cast<data_size_t>((num_data_ - num_pos_data) * config->neg_bagging_fraction);
} else {
bag_data_cnt_ = static_cast<data_size_t>(config->bagging_fraction * num_data_);
}
bag_data_indices_.resize(num_data_);
bagging_runner_.ReSize(num_data_);
bagging_rands_.clear();
for (int i = 0;
i < (num_data_ + bagging_rand_block_ - 1) / bagging_rand_block_; ++i) {
bagging_rands_.emplace_back(config_->bagging_seed + i);
}

double average_bag_rate =
(static_cast<double>(bag_data_cnt_) / num_data_) / config->bagging_freq;
is_use_subset_ = false;
const int group_threshold_usesubset = 100;
if (average_bag_rate <= 0.5
shiyu1994 marked this conversation as resolved.
Show resolved Hide resolved
&& (train_data_->num_feature_groups() < group_threshold_usesubset)) {
if (tmp_subset_ == nullptr || is_change_dataset) {
tmp_subset_.reset(new Dataset(bag_data_cnt_));
tmp_subset_->CopyFeatureMapperFrom(train_data_);
}
is_use_subset_ = true;
Log::Debug("Use subset for bagging");
}

need_re_bagging_ = true;

if (is_use_subset_ && bag_data_cnt_ < num_data_) {
if (objective_function_ == nullptr) {
// resize gradient vectors to copy the customized gradients for using subset data
need_resize_gradients_ = true;
}
}
} else {
bag_data_cnt_ = num_data_;
bag_data_indices_.clear();
bagging_runner_.ReSize(0);
is_use_subset_ = false;
}
}

bool IsHessianChange() const override {
return false;
}

private:
data_size_t BaggingHelper(data_size_t start, data_size_t cnt, data_size_t* buffer) {
jameslamb marked this conversation as resolved.
Show resolved Hide resolved
if (cnt <= 0) {
return 0;
}
data_size_t cur_left_cnt = 0;
data_size_t cur_right_pos = cnt;
// random bagging, minimal unit is one record
for (data_size_t i = 0; i < cnt; ++i) {
auto cur_idx = start + i;
if (bagging_rands_[cur_idx / bagging_rand_block_].NextFloat() < config_->bagging_fraction) {
buffer[cur_left_cnt++] = cur_idx;
} else {
buffer[--cur_right_pos] = cur_idx;
}
}
return cur_left_cnt;
}

data_size_t BalancedBaggingHelper(data_size_t start, data_size_t cnt, data_size_t* buffer) {
if (cnt <= 0) {
return 0;
}
auto label_ptr = train_data_->metadata().label();
data_size_t cur_left_cnt = 0;
data_size_t cur_right_pos = cnt;
// random bagging, minimal unit is one record
for (data_size_t i = 0; i < cnt; ++i) {
auto cur_idx = start + i;
bool is_pos = label_ptr[start + i] > 0;
bool is_in_bag = false;
if (is_pos) {
is_in_bag = bagging_rands_[cur_idx / bagging_rand_block_].NextFloat() <
config_->pos_bagging_fraction;
} else {
is_in_bag = bagging_rands_[cur_idx / bagging_rand_block_].NextFloat() <
config_->neg_bagging_fraction;
}
if (is_in_bag) {
buffer[cur_left_cnt++] = cur_idx;
} else {
buffer[--cur_right_pos] = cur_idx;
}
}
return cur_left_cnt;
}

/*! \brief whether need restart bagging in continued training */
bool need_re_bagging_;
};

} // namespace LightGBM

#endif // LIGHTGBM_SAMPLE_STRATEGY_BAGGING_HPP_
shiyu1994 marked this conversation as resolved.
Show resolved Hide resolved
5 changes: 2 additions & 3 deletions src/boosting/boosting.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,6 @@

#include "dart.hpp"
#include "gbdt.h"
#include "goss.hpp"
#include "rf.hpp"

namespace LightGBM {
Expand Down Expand Up @@ -39,7 +38,7 @@ Boosting* Boosting::CreateBoosting(const std::string& type, const char* filename
} else if (type == std::string("dart")) {
return new DART();
} else if (type == std::string("goss")) {
return new GOSS();
return new GBDT();
} else if (type == std::string("rf")) {
return new RF();
} else {
Expand All @@ -53,7 +52,7 @@ Boosting* Boosting::CreateBoosting(const std::string& type, const char* filename
} else if (type == std::string("dart")) {
ret.reset(new DART());
} else if (type == std::string("goss")) {
ret.reset(new GOSS());
ret.reset(new GBDT());
} else if (type == std::string("rf")) {
return new RF();
} else {
Expand Down
Loading