Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XGBoost parameter error (colsample_bytree=1) #449

Open
tjvananne opened this issue May 16, 2017 · 12 comments
Open

XGBoost parameter error (colsample_bytree=1) #449

tjvananne opened this issue May 16, 2017 · 12 comments
Labels

Comments

@tjvananne
Copy link

I've seen some traffic on these issues regarding potentially getting rid of xgboost altogether due to dependency troubles, so if that is the case then this isn't relevant.

I am receiving the following error message:

Optimization Progress:   0%|                            | 26/10100 [00:22<2:10:08,  1.29pipeline/s][
08:35:46] c:\dev\libs\xgboost\dmlc-core\include\dmlc\./logging.h:235: [08:35:46] C:\dev\libs\xgboost
\src\tree\updater_colmaker.cc:162: Check failed: (n) > (0U) colsample_bytree=1 is too small that no
feature can be included

I know that the colsample_bytree parameter should be what proportion of the features you're allowing each tree to randomly sample from in order to build itself. So a colsample_bytree=1 should be telling each tree to sample from 100% of the columns/features when building a tree. (Please correct me if I'm wrong on that!)

xgboost colsample_bytree = subsample ratio of columns when constructing each tree.

This has also been previously raised as an issue on xgboost's github repo, but that issue was closed without really any explanation of what the user was doing wrong.

My guess is that this would be an error with what parameters are being passed into XGBoost and not necessarily an xgboost issue.

Context of the issue

My environment:

  • Windows 7 - 64-bit OS
  • Python3.4 - 64-bit
  • pandas version: 0.20.1
  • numpy version: 1.12.1
  • scipy version: 0.19.0
  • tpot version: 0.7.3
  • sklearn version: 0.18.1
  • xgboost version: 0.6

Process to reproduce the issue

This is my simple script to reproduce the error in my environment with random data. This error doesn't tend to occur when my generations and population_size are low (around 10-15 each). I have experienced this issue with generation/population_size as low as 32 (with this same script below). Hopefully this short script is sufficiently reproducible!

print("importing modules...")
import pandas as pd
import numpy as np
import tpot
from tpot import TPOTClassifier
from sklearn.model_selection import train_test_split
import sklearn
import scipy
import xgboost as xgb
from random import randint

# I wanted the label data to be a bit imbalanced
print("creating fake data...")
np.random.seed(1776)
df = pd.DataFrame(np.random.randn(8000,11), columns=list("ABCDEFGHIJK"))
label = np.array([randint(1,11) for mynumber in range(0, 8000)])
label[label <= 9] = 0
label[label >= 10] = 1
print(label)
df['label'] = label


# extract labels and drop them from the DataFrame
y = df['label'].values
colsToDrop = ['label']
xdf = df.drop(colsToDrop, axis=1)


x_train, x_test, y_train, y_test = train_test_split(xdf, y, train_size=0.7, test_size=0.3, random_state=1776)

# this will error out:
tpot = TPOTClassifier(generations=100, population_size=100, verbosity=2, 
scoring="balanced_accuracy", cv=5, random_state=1776)
tpot.fit(x_train, y_train)

I couldn't find any prior issues that addressed this specific error I keep running into, but I apologize if I may have missed one.

@weixuanfu
Copy link
Contributor

weixuanfu commented May 16, 2017

Hmm, the warning message was indeed reproduced in my environment. But it is very weird that this colsample_bytree parameter is not in our operator dictionary so TPOT wound not tune this parameter and keep it as default value of 1.

I also checked the first 26 pipelines in TPOT optimization process. Only two pipelines below used XGBClassifier. I tested both of them and both worked without warning message. Very weird.

Pipeline(steps=[('xgbclassifier', XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
       gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,
       min_child_weight=1, missing=None, n_estimators=100, nthread=1,
       objective='binary:logistic', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=0, silent=True,
       subsample=0.6500000000000001))])

Pipeline(steps=[('xgbclassifier', XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
       gamma=0, learning_rate=0.5, max_delta_step=0, max_depth=4,
       min_child_weight=7, missing=None, n_estimators=100, nthread=1,
       objective='binary:logistic', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=0, silent=True, subsample=0.8))])

@tjvananne
Copy link
Author

I should have been a bit more explicit for anyone else following along. In addition to the message in the console, I also receive the windows message that "python.exe has stopped working" and then the program crashes.

image

Also, I'm not sure if "Pipeline" is a 1-to-1 with "generation" (haven't dug into too much of the source yet), but it did require me to set the "generations" parameter of the TPOTClassifier object sufficiently high enough to receive the error. So maybe it isn't one of the XGBClassifiers in the first 26 pipelines, but rather one of the pipelines later on?

I might be misunderstanding the relationships between pipelines and generations though.

In the script above for example, if I set the generations and population_size both to 32, then I get this error message, but if I lower those parameters each to 30, then there is no error. That seems to be where the threshold is for reproducing this issue.

Note:
(Nevermind, I just tested it with 30 as the value for both generations and population_size and it received the same error message, but not until it was 23% done with the optimization)

The tpot object was able to fit with no errors when generations and population_size were both set to 26.

Going to try and investigate this more tonight.

@weixuanfu
Copy link
Contributor

Thank you for these detailed information for this issue.

I don't think the issue was related to generation since the error message showed up in the initial generation which had only randomly generated pipelines. I suspected it might be due to _pre_test decorator because it would test pipeline with a small dataset to make sure it is a valid pipeline. Some invalid pipelines including XGBClassifier operator might cause this issue in _per_test. I will also run more tests to find the reason.

@tjvananne
Copy link
Author

Absolutely! Thanks for your response!

That reminds me, it's also probably worth mentioning that I ran into a few Invalid pipeline encountered. Skipping its evaluation. messages when using verbosity=3 in the TPOTClassifier() constructor.

Thank you!

@weixuanfu
Copy link
Contributor

weixuanfu commented May 17, 2017

I found the reason of the issue. It is due to pipeline 32 in generation 0 (see below) when using the demo in the issue. The first step is feature selection. But sadly no feature passed the threshold in the first step so that no feature is available for XGBClassifier in second steps.

For solving this issue, I will submit a PR to catch this error message to prevent TPOT from crashing.

# pipeline 32
make_pipeline(
    SelectFromModel(estimator=ExtraTreesClassifier(max_features=0.2), threshold=0.30000000000000004),
    XGBClassifier(learning_rate=0.1, max_depth=1, min_child_weight=13, nthread=1, subsample=0.6000000000000001)
)

For reproducing the error message without running TPOT, please try the codes below:

print("importing modules...")
import pandas as pd
import numpy as np
from sklearn.pipeline import make_pipeline, make_union
from sklearn.preprocessing import FunctionTransformer
from sklearn.ensemble import VotingClassifier, ExtraTreesClassifier
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.feature_selection import SelectFromModel
from xgboost import XGBClassifier
from random import randint
from copy import copy
from tpot import TPOTClassifier

# I wanted the label data to be a bit imbalanced
print("creating fake data...")
np.random.seed(1776)
df = pd.DataFrame(np.random.randn(8000,11), columns=list("ABCDEFGHIJK"))
label = np.array([randint(1,11) for mynumber in range(0, 8000)])
label[label <= 9] = 0
label[label >= 10] = 1
print(label)
df['label'] = label


# extract labels and drop them from the DataFrame
y = df['label'].values
colsToDrop = ['label']
xdf = df.drop(colsToDrop, axis=1)


x_train, x_test, y_train, y_test = train_test_split(xdf, y, train_size=0.7, test_size=0.3, random_state=1776)

# make a test pipeline
"""test_pipeline = make_pipeline(
    make_union(VotingClassifier([("est", DecisionTreeClassifier(criterion="gini", max_depth=10, min_samples_leaf=8, min_samples_split=13))]), FunctionTransformer(copy)),
    XGBClassifier(learning_rate=0.01, max_depth=2, min_child_weight=9, nthread=1, subsample=0.1)
)"""

test_pipeline = make_pipeline(
    SelectFromModel(estimator=ExtraTreesClassifier(max_features=0.2), threshold=0.30000000000000004),
    XGBClassifier(learning_rate=0.1, max_depth=1, min_child_weight=13, nthread=1, subsample=0.6000000000000001)
    )

# # Fix random state when the operator allows  (optional) just for get consistent CV score in TPOT
tpot = TPOTClassifier()
tpot._set_param_recursive(test_pipeline.steps, 'random_state', 42)

# cv scores
cvscores = cross_val_score(test_pipeline, x_train, y_train, cv=5, scoring='accuracy', verbose=0)


@weixuanfu
Copy link
Contributor

from xgboost.core import XGBoostError
import warnings
for i in range(2000):
    try:
    # cv scores
        with warnings.catch_warnings():
            warnings.simplefilter('ignore')
            cvscores = cross_val_score(test_pipeline, x_train, y_train, cv=5, scoring='accuracy', verbose=0)
    except XGBoostError:
        print("Wrong")

Somehow, the error message still showed up even I used the codes above to catch XGBoostError. But program did not crash after runing this bad pipeine 2000 times.

@weixuanfu
Copy link
Contributor

From this part of source code in xgboost, it seems that the error message is printed out by std::ostringstream. I am not sure if python can catch this message.

@tjvananne
Copy link
Author

Ah that makes sense, good catch!

Would it be acceptable to just suppress any pipelines that don't meet certain conditions (in this case, not passing in any features due to no features meeting the feature-selection threshold) so they don't get scored / crossed over with other pipelines?

I see what you're saying though, it would probably be best to try to use XGBoosts built-in error checking from a maintainability perspective, right?

@weixuanfu
Copy link
Contributor

weixuanfu commented May 18, 2017

Thank you for these good ideas.

It is hard to tell whether the feature-selection step would remove all features before running the pipeline, and it also depends on data. We will refine parameters in selectors (#423) to prevent this issue.

In my codes above, I tried XGBoosts built-in error XGBoostError in their python wrapper but it still printed out the error message even the program is still running. I think std::ostringstream in xgboost C++ source codes printed out error in stdout. It is very strange.

@bicepjai
Copy link

Do we know why this issue occurs ? it will be helpful to know why "colsample_bytree=1 is too small that no
feature can be included" occurs.

@weixuanfu
Copy link
Contributor

The reason is that feature-selection step in a pipeline can exclude all features before running xgboost. We need better control on feature numbers within pipeline.

@privefl
Copy link

privefl commented Aug 24, 2017

I have the same issue.

> devtools::session_info()
Session info --------------------------------------------------------------------------------------
 setting  value                       
 version  R version 3.4.0 (2017-04-21)
 system   x86_64, mingw32             
 ui       RStudio (1.0.153)           
 language (EN)                        
 collate  French_France.1252          
 tz       Europe/Paris                
 date     2017-08-24                  

Packages ------------------------------------------------------------------------------------------
 package       * version    date       source                             
 acepack         1.4.1      2016-10-29 CRAN (R 3.4.1)                     
 backports       1.1.0      2017-05-22 CRAN (R 3.4.0)                     
 base          * 3.4.0      2017-04-21 local                              
 base64enc       0.1-3      2015-07-28 CRAN (R 3.4.0)                     
 bigmemory       4.5.19     2016-03-28 CRAN (R 3.4.1)                     
 bigmemory.sri   0.1.3      2014-08-18 CRAN (R 3.4.0)                     
 bigsnpr       * 0.1.0.9001 2017-08-24 local                              
 bigstatsr     * 0.1.0.9002 2017-08-24 local                              
 checkmate       1.8.3      2017-07-03 CRAN (R 3.4.1)                     
 cluster         2.0.6      2017-03-10 CRAN (R 3.4.0)                     
 codetools       0.2-15     2016-10-05 CRAN (R 3.4.0)                     
 colorspace      1.3-2      2016-12-14 CRAN (R 3.4.0)                     
 compiler        3.4.0      2017-04-21 local                              
 crayon          1.3.2.9000 2017-07-22 Github (gaborcsardi/crayon@750190f)
 data.table      1.10.4     2017-02-01 CRAN (R 3.4.0)                     
 datasets      * 3.4.0      2017-04-21 local                              
 devtools        1.13.3     2017-08-02 CRAN (R 3.4.1)                     
 digest          0.6.12     2017-01-27 CRAN (R 3.4.0)                     
 foreach       * 1.4.3      2015-10-13 CRAN (R 3.4.0)                     
 foreign         0.8-67     2016-09-13 CRAN (R 3.4.0)                     
 Formula       * 1.2-2      2017-07-10 CRAN (R 3.4.1)                     
 ggplot2       * 2.2.1.9000 2017-07-23 Github (hadley/ggplot2@331977e)    
 graphics      * 3.4.0      2017-04-21 local                              
 grDevices     * 3.4.0      2017-04-21 local                              
 grid            3.4.0      2017-04-21 local                              
 gridExtra       2.2.1      2016-02-29 CRAN (R 3.4.0)                     
 gtable          0.2.0      2016-02-26 CRAN (R 3.4.0)                     
 Hmisc         * 4.0-3      2017-05-02 CRAN (R 3.4.1)                     
 htmlTable       1.9        2017-01-26 CRAN (R 3.4.1)                     
 htmltools       0.3.6      2017-04-28 CRAN (R 3.4.0)                     
 htmlwidgets     0.9        2017-07-10 CRAN (R 3.4.1)                     
 iterators       1.0.8      2015-10-13 CRAN (R 3.4.0)                     
 knitr           1.17       2017-08-10 CRAN (R 3.4.1)                     
 lattice       * 0.20-35    2017-03-25 CRAN (R 3.4.0)                     
 latticeExtra    0.6-28     2016-02-09 CRAN (R 3.4.1)                     
 lazyeval        0.2.0      2016-06-12 CRAN (R 3.4.0)                     
 magrittr      * 1.5        2014-11-22 CRAN (R 3.4.0)                     
 Matrix        * 1.2-9      2017-03-14 CRAN (R 3.4.0)                     
 memoise         1.1.0      2017-04-21 CRAN (R 3.4.0)                     
 methods       * 3.4.0      2017-04-21 local                              
 munsell         0.4.3      2016-02-13 CRAN (R 3.4.0)                     
 nnet            7.3-12     2016-02-02 CRAN (R 3.4.0)                     
 parallel        3.4.0      2017-04-21 local                              
 plyr            1.8.4      2016-06-08 CRAN (R 3.4.0)                     
 R6              2.2.2      2017-06-17 CRAN (R 3.4.1)                     
 RColorBrewer    1.1-2      2014-12-07 CRAN (R 3.4.0)                     
 Rcpp            0.12.12    2017-07-15 CRAN (R 3.4.1)                     
 rlang           0.1.2      2017-08-09 CRAN (R 3.4.1)                     
 rpart           4.1-11     2017-03-13 CRAN (R 3.4.0)                     
 rstudioapi      0.6        2016-06-27 CRAN (R 3.4.0)                     
 scales          0.4.1.9002 2017-07-23 Github (hadley/scales@6db7b6f)     
 splines         3.4.0      2017-04-21 local                              
 stats         * 3.4.0      2017-04-21 local                              
 stringi         1.1.5      2017-04-07 CRAN (R 3.4.0)                     
 stringr         1.2.0      2017-02-18 CRAN (R 3.4.0)                     
 survival      * 2.41-3     2017-04-04 CRAN (R 3.4.0)                     
 testthat      * 1.0.2      2016-04-23 CRAN (R 3.4.0)                     
 tibble          1.3.4      2017-08-22 CRAN (R 3.4.0)                     
 tools           3.4.0      2017-04-21 local                              
 utils         * 3.4.0      2017-04-21 local                              
 withr           2.0.0      2017-07-28 CRAN (R 3.4.1)                     
 xgboost         0.6-4      2017-01-05 CRAN (R 3.4.0)  

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants