forked from dlab-berkeley/Machine-Learning-in-R
-
Notifications
You must be signed in to change notification settings - Fork 0
/
06-xgboost.Rmd
179 lines (134 loc) · 6.71 KB
/
06-xgboost.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
# XGBoost
## Load packages
```{r load_packages}
library(caret)
library(pROC)
library(xgboost)
```
## Load data
Load `train_x_class`, `train_y_class`, `test_x_class`, and `test_y_class` variables we defined in 02-preprocessing.Rmd for this *classification* task.
```{r setup_data}
# Objects: task_reg, task_class
load("data/preprocessed.RData")
```
## Overview
From [Freund Y, Schapire RE. 1999. A short introduction to boosting. Journal of Japanese Society for Artificial Intelligence 14:771-780](https://cseweb.ucsd.edu/~yfreund/papers/IntroToBoosting.pdf):
"Boosting is a general method for improving the accuracy of any given learning algorithm" and evolved from AdaBoost and PAC learning (p. 1-2). Gradient boosted machines are ensembles decision tree methods of "weak" trees that are just slightly more accurate than random guessing. These are then "boosted" into "strong" learners. That is, the models don't have to be accurate over the entire feature space."
The model first tries to predict each value in a dataset - the cases that can be predicted easily are _downweighted_ so that the algorithm does not try as hard to predict them.
However, the cases that the model has difficulty predicting are _upweighted_ so that the model more assertively tries to predict them. This continues for multiple "boosting iterations", with a training-based performance measure produced at each iteration. This method can drive down generalization error (p. 5).
Rather than testing only a single model at a time, it is useful to tune the parameters of that single model against multiple versions. Bootstrap is the default, but we want cross-validation.
Create two objects - `cv_control` and `xgb_grid`. `cv_control` will allow us to customize the cross-validation settings, while `xgb_grid` lets us evaluate the model with different settings:
### Define `cv_control`
```{r caret_prep}
# Use 5-fold cross-validation with 2 repeats as our evaluation procedure (instead of the default "bootstrap")
cv_control =
caret::trainControl(method = "repeatedcv",
# Number of folds
number = 5L,
# Number of complete sets of folds to compute
repeats = 2L,
# Calculate class probabilities?
classProbs = TRUE,
# Indicate that our response variable is binary
summaryFunction = twoClassSummary)
```
### Define `xgb_grid`
```{r}
# Ask caret what hyperparameters can be tuned for the xgbTree algorithm.
modelLookup("xgbTree")
# turn off scientific notation
options(scipen = 999)
# More details at https://xgboost.readthedocs.io/en/latest/parameter.html
(xgb_grid = expand.grid(
# Number of trees to fit, aka boosting iterations
nrounds = c(100, 300, 500, 700, 900),
# Depth of the decision tree (how many levels of splits).
max_depth = c(1, 6),
# Learning rate: lower means the ensemble will adapt more slowly.
eta = c(0.0001, 0.01, 0.2),
# Make this larger and xgboost will tend to make smaller trees
gamma = 0,
colsample_bytree = 1.0,
subsample = 1.0,
# Stop splitting a tree if we only have this many obs in a tree node.
min_child_weight = 10L))
# Other hyperparameters: gamma, column sampling, row sampling
# How many combinations of settings do we end up with?
nrow(xgb_grid)
```
## Fit model
Note that we will now use *A*rea *U*nder the ROC *C*urve (called "AUC") as our performance metric, which relates the number of true positives (sensitivity) to the number of true negatives (specificity).
However, xgboost is expecting character strings as the factor level names so our integer 1s and 0s will not do. Let's quickly recode the 1s as "yes" and 0s as "no".
```{r}
xgb_train_y_class = as.factor(ifelse(train_y_class == 1, "yes", "no"))
xgb_test_y_class = as.factor(ifelse(test_y_class == 1, "yes", "no"))
table(train_y_class, xgb_train_y_class)
table(test_y_class, xgb_test_y_class)
```
> NOTE: This will take a few minutes to complete!
```{r xgb_fit, cache = TRUE}
set.seed(1)
# cbind: caret expects the Y response and X predictors to be part of the same dataframe
model = caret::train(xgb_train_y_class ~ ., data = cbind(xgb_train_y_class, train_x_class),
# Use xgboost's tree-based algorithm (i.e. gbm)
method = "xgbTree",
# Use "AUC" as our performance metric, which caret incorrectly calls "ROC"
metric = "ROC",
# Specify our cross-validation settings
trControl = cv_control,
# Test multiple configurations of the xgboost algorithm
tuneGrid = xgb_grid,
# Hide detailed output (setting to TRUE will print that output)
verbose = FALSE)
# See how long this algorithm took to complete (from ?proc.time)
# user time = the CPU time charged for the execution of user instructions of the calling process
# system time = the CPU time charged for execution by the system on behalf of the calling process
# elapsed time = real time since the process was started
model$times
```
Review model summary table
```{r}
model
# model$bestTune = "The final values used for the model were..."
```
## Investigate Results
```{r}
# Extract the hyperparameters with the best performance
model$bestTune
# And the corresponding performance metrics.
# TODO: fix
model$results[as.integer(rownames(model$bestTune)), ]
# Plot the performance across all hyperparameter combinations. Nice!
options(scipen = 999)
ggplot(model) + theme_bw() + ggtitle("Xgboost hyperparameter comparison")
# Show variable importance (text).
caret::varImp(model)
# This version uses the complex caret object
vip::vip(model) + theme_minimal()
# This version operates on the xgboost model within the caret object
vip::vip(model$finalModel) + theme_minimal()
# Generate predicted labels.
predicted_labels = predict(model, test_x_class)
table(xgb_test_y_class, predicted_labels)
# Generate class probabilities.
pred_probs = predict(model, test_x_class, type = "prob")
head(pred_probs)
# View final model
(cm = confusionMatrix(predicted_labels, xgb_test_y_class))
# Define ROC characteristics
(rocCurve = pROC::roc(response = xgb_test_y_class,
predictor = pred_probs[, "yes"],
levels = rev(levels(xgb_test_y_class)),
auc = TRUE, ci = TRUE))
# Plot ROC curve with optimal threshold.
plot(rocCurve,
print.thres.cex = 2,
print.thres = "best",
main = "XGBoost on test set", col = "blue", las = 1)
# Get specificiety and sensitivity at particular threshold
pROC::coords(rocCurve, 0.01, transpose = FALSE)
pROC::coords(rocCurve, 0.525, transpose = FALSE)
pROC::coords(rocCurve, 0.99, transpose = FALSE)
```
## Challenge 4
Open Challenge 4 in the "Challenges" folder.