SMILE #32

haifengl · 2015-12-23T16:30:13Z

Thanks for great work! We have an open source machine learning library called SMILE (https://github.com/haifengl/smile). We have incorporated your benchmark (https://github.com/haifengl/smile/blob/master/benchmark/src/main/scala/smile/benchmark/Airline.scala). We found that our system is much faster for this data set. For 100K training data on a 4 core machine, we can train a random forest with 500 trees in 100 seconds, and gradient boost trees of 300 trees in 180 seconds. Projected to 32 cores, I think that we will be much faster than all the tools you tested. You can try it out by cloning our project. Then

sbt benchmark/run

This also includes benchmark on USPS data, which you may ignore. Thanks!

haifengl · 2015-12-23T16:34:58Z

A couple of questions about your benchmark. First about your data encoding. You use the origin 8 variables directly or convert them to other representation?

Besides, the data is highly unbalanced (positive : negative is about 1 : 4). Do you rebalance the data before training?

Can you also report other metric besides AUC, such as accuracy, sensitivity, specificity, etc. None of them are perfect. But it would be better report more than AUC. Thanks!

haifengl · 2015-12-23T16:39:38Z

BTW, our random forest AUC is low. It is because the prediction probabilities are derived from votes instead of from leaf weights. We will update the calculation ASAP.

The AUC of our gradient boost trees match other systems.

szilard · 2015-12-26T21:58:04Z

Thanks, I'll try it out.

Re: questions. I use the original (categorical) encoding for the algos/implementations that can deal with it and 1-hot encoding for the ones that cannot.

1:4 is not really "highly" unbalanced (1:100 would be), so I do not rebalance.

Surely, AUC is not "complete", but it captures a lot of what I'm interested in.

Yes, for RF averaging probabilities gives better AUC than averaging votes.

haifengl · 2015-12-27T00:51:50Z

Thanks! There are two real valved variables (departure time and distance). You also treat them as categorical values?

This data is unbalanced. Even though AUC is at about 70%, the sensitivity is only about 10% (99% specificity), which is pretty much useless for this particular problem. Our implementation can assign different weights to classes. By adjusting the weight, we can achieve much higher sensitivity (of course lower specificity) and lower AUC. I feel that it is more meaningful in practice. As your benchmark is most about speed and memory usage, it may not be important.

haifengl · 2016-01-07T19:49:35Z

Have you tried it? Any help I can do? Thanks!

szilard · 2016-01-07T20:28:24Z

No, sorry. And I'll have very limited time the next 3-4 weeks for sure. How about you take a look at this https://github.com/szilard/benchm-ml/tree/master/z-other-tools and you run random forests with 100 trees on 32 cores for the 1M dataset and you tell me the run time and AUC?

haifengl · 2016-01-07T20:38:31Z

No problem. I did run on the 1M dataset on my 4 core Mac (while I am using it for other things). Here is the print out:

--------------- 100K samples ---------------------
class: "N", "Y"
train data size: 100000, test data size: 100000
train data positive : negative = 19044 : 80956
test data positive : negative = 21617 : 78383
Training Random Forest of 500 trees...
runtime: 40691.435646 ms
Accuracy = 78.56%
Sensitivity = 2.17%
Specificity = 99.62%
AUC = 69.05%
OOB error rate = 18.93%
runtime: 6321.360014 ms

Training Gradient Boosted Trees of 300 trees...
Accuracy = 79.66%
Sensitivity = 8.84%
Specificity = 99.19%
AUC = 72.50%

Training AdaBoost of 300 trees...
runtime: 6180.334174 ms
Accuracy = 79.06%
Sensitivity = 7.85%
Specificity = 98.70%
AUC = 71.76%

--------------- 1M samples ---------------------
class: "N", "Y"
train data size: 1000000, test data size: 100000
train data positive : negative = 192982 : 807018
test data positive : negative = 21617 : 78383
Training Random Forest of 500 trees...
runtime: 1436028.498601 ms
Accuracy = 78.41%
Sensitivity = 0.15%
Specificity = 99.99%
AUC = 69.91%
OOB error rate = 19.26%

Training Gradient Boosted Trees of 300 trees...
runtime: 83840.278901 ms
Accuracy = 79.63%
Sensitivity = 8.13%
Specificity = 99.35%
AUC = 72.79%

Training AdaBoost of 300 trees...
runtime: 96979.686961 ms
Accuracy = 79.15%
Sensitivity = 8.32%
Specificity = 98.68%
AUC = 71.65%

Note that I report other metrics besides AUC and also run AdaBoost. For gradient boosting, I use your second settings (300 trees). Thanks!

haifengl · 2016-01-07T20:42:00Z

My running time is milliseconds. So it is about 1436 seconds for random forest, 84 seconds for gradient boosting, and 97 seconds for AdaBoost on the 1M dataset. As random forest training can be linearly scaled, I expect that we will use 1/8 time on a 32 core box. We also parallelize tree training in gradient boosting and AdaBoost. I expect we will use less time but won't as little as 1/8.

haifengl · 2016-01-07T20:49:36Z

BTW, we calculate AUC by our own implantation (https://github.com/haifengl/smile/blob/master/core/src/main/java/smile/validation/AUC.java), which is based on Mann-Whitney U test. I am not sure if it is same as yours. If you want, I can ship you the prediction results and you can calculate it with your AUC method. Thanks!

myui mentioned this issue Mar 23, 2017

[HIVEMALL-75] Support Sparse Vector Format as the input of RandomForest apache/incubator-hivemall#51

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SMILE #32

SMILE #32

haifengl commented Dec 23, 2015

haifengl commented Dec 23, 2015

haifengl commented Dec 23, 2015

szilard commented Dec 26, 2015

haifengl commented Dec 27, 2015

haifengl commented Jan 7, 2016

szilard commented Jan 7, 2016

haifengl commented Jan 7, 2016

haifengl commented Jan 7, 2016

haifengl commented Jan 7, 2016

SMILE #32

SMILE #32

Comments

haifengl commented Dec 23, 2015

haifengl commented Dec 23, 2015

haifengl commented Dec 23, 2015

szilard commented Dec 26, 2015

haifengl commented Dec 27, 2015

haifengl commented Jan 7, 2016

szilard commented Jan 7, 2016

haifengl commented Jan 7, 2016

haifengl commented Jan 7, 2016

haifengl commented Jan 7, 2016