-
Notifications
You must be signed in to change notification settings - Fork 334
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SMILE #32
Comments
A couple of questions about your benchmark. First about your data encoding. You use the origin 8 variables directly or convert them to other representation? Besides, the data is highly unbalanced (positive : negative is about 1 : 4). Do you rebalance the data before training? Can you also report other metric besides AUC, such as accuracy, sensitivity, specificity, etc. None of them are perfect. But it would be better report more than AUC. Thanks! |
BTW, our random forest AUC is low. It is because the prediction probabilities are derived from votes instead of from leaf weights. We will update the calculation ASAP. The AUC of our gradient boost trees match other systems. |
Thanks, I'll try it out. Re: questions. I use the original (categorical) encoding for the algos/implementations that can deal with it and 1-hot encoding for the ones that cannot. 1:4 is not really "highly" unbalanced (1:100 would be), so I do not rebalance. Surely, AUC is not "complete", but it captures a lot of what I'm interested in. Yes, for RF averaging probabilities gives better AUC than averaging votes. |
Thanks! There are two real valved variables (departure time and distance). You also treat them as categorical values? This data is unbalanced. Even though AUC is at about 70%, the sensitivity is only about 10% (99% specificity), which is pretty much useless for this particular problem. Our implementation can assign different weights to classes. By adjusting the weight, we can achieve much higher sensitivity (of course lower specificity) and lower AUC. I feel that it is more meaningful in practice. As your benchmark is most about speed and memory usage, it may not be important. |
Have you tried it? Any help I can do? Thanks! |
No, sorry. And I'll have very limited time the next 3-4 weeks for sure. How about you take a look at this https://github.com/szilard/benchm-ml/tree/master/z-other-tools and you run random forests with 100 trees on 32 cores for the 1M dataset and you tell me the run time and AUC? |
No problem. I did run on the 1M dataset on my 4 core Mac (while I am using it for other things). Here is the print out: --------------- 100K samples --------------------- Training Gradient Boosted Trees of 300 trees... Training AdaBoost of 300 trees... --------------- 1M samples --------------------- Training Gradient Boosted Trees of 300 trees... Training AdaBoost of 300 trees... Note that I report other metrics besides AUC and also run AdaBoost. For gradient boosting, I use your second settings (300 trees). Thanks! |
My running time is milliseconds. So it is about 1436 seconds for random forest, 84 seconds for gradient boosting, and 97 seconds for AdaBoost on the 1M dataset. As random forest training can be linearly scaled, I expect that we will use 1/8 time on a 32 core box. We also parallelize tree training in gradient boosting and AdaBoost. I expect we will use less time but won't as little as 1/8. |
BTW, we calculate AUC by our own implantation (https://github.com/haifengl/smile/blob/master/core/src/main/java/smile/validation/AUC.java), which is based on Mann-Whitney U test. I am not sure if it is same as yours. If you want, I can ship you the prediction results and you can calculate it with your AUC method. Thanks! |
Thanks for great work! We have an open source machine learning library called SMILE (https://github.com/haifengl/smile). We have incorporated your benchmark (https://github.com/haifengl/smile/blob/master/benchmark/src/main/scala/smile/benchmark/Airline.scala). We found that our system is much faster for this data set. For 100K training data on a 4 core machine, we can train a random forest with 500 trees in 100 seconds, and gradient boost trees of 300 trees in 180 seconds. Projected to 32 cores, I think that we will be much faster than all the tools you tested. You can try it out by cloning our project. Then
sbt benchmark/run
This also includes benchmark on USPS data, which you may ignore. Thanks!
The text was updated successfully, but these errors were encountered: