forked from tidymodels/TMwR
-
Notifications
You must be signed in to change notification settings - Fork 0
/
19-when-should-you-trust-predictions.Rmd
552 lines (431 loc) · 28.2 KB
/
19-when-should-you-trust-predictions.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
```{r setup, include=FALSE}
knitr::opts_chunk$set(fig.path = "figures/")
library(tidymodels)
library(applicable)
library(patchwork)
library(probably)
load("RData/Chicago_2020.RData")
data(Chicago)
```
# When Should You Trust Your Predictions? {#trust}
A predictive model can almost always produce a prediction, given input data. However, in plenty of situations it is inappropriate to produce such a prediction. When a new data point is well outside of the range of data used to create the model, making a prediction may be an inappropriate _extrapolation_. A more qualitative example of an inappropriate prediction would be when the model is used in a completely different context. The cell segmentation data used in Chapter \@ref(iterative-search) flags when human breast cancer cells can or cannot be accurately isolated inside an image. A model built from these data could be inappropriately applied to stomach cells for the same purpose. We can produce a prediction but it is unlikely to be applicable to the different cell type.
This chapter discusses two methods for quantifying the potential quality of a prediction:
- _Equivocal zones_ use the predicted values to alert the user that results may be suspect.
- _Applicability_ uses the predictors to measure the amount of extrapolation (if any) for new samples.
## Equivocal Results {#equivocal-zones}
:::rmdwarning
In some cases, the amount of uncertainty associated with a prediction is too high to be trusted.
:::
If a model result indicated that you had a 51% chance of having contracted COVID-19, it would be natural to view the diagnosis with some skepticism. In fact, regulatory bodies often require many medical diagnostics to have an _equivocal zone_. This zone is a range of results in which the prediction should not be reported to patients, for example, some range of COVID-19 test results that are too uncertain to be reported to a patient. See @Danowski524 and @Kerleguer1783 for examples. The same notion can be applied to models created outside of medical diagnostics.
Let's use a function that can simulate classification data with two classes and two predictors (`x` and `y`). The true model is a logistic regression model with the equation:
$$
\mathrm{logit}(p) = -1 - 2x - \frac{x^2}{5} + 2y^2
$$
The two predictors follow a bivariate normal distribution with a correlation of 0.70. We'll create a training set of 200 samples and a test set of 50:
```{r trust-simulation}
library(tidymodels)
tidymodels_prefer()
simulate_two_classes <-
function (n, error = 0.1, eqn = quote(-1 - 2 * x - 0.2 * x^2 + 2 * y^2)) {
# Slightly correlated predictors
sigma <- matrix(c(1, 0.7, 0.7, 1), nrow = 2, ncol = 2)
dat <- MASS::mvrnorm(n = n, mu = c(0, 0), Sigma = sigma)
colnames(dat) <- c("x", "y")
cls <- paste0("class_", 1:2)
dat <-
as_tibble(dat) %>%
mutate(
linear_pred = !!eqn,
# Add some misclassification noise
linear_pred = linear_pred + rnorm(n, sd = error),
prob = binomial()$linkinv(linear_pred),
class = ifelse(prob > runif(n), cls[1], cls[2]),
class = factor(class, levels = cls)
)
dplyr::select(dat, x, y, class)
}
set.seed(1901)
training_set <- simulate_two_classes(200)
testing_set <- simulate_two_classes(50)
```
We estimate a logistic regression model using Bayesian methods (using the default Gaussian prior distributions for the parameters):
```{r trust-bayes-glm}
two_class_mod <-
logistic_reg() %>%
set_engine("stan", seed = 1902) %>%
fit(class ~ . + I(x^2)+ I(y^2), data = training_set)
print(two_class_mod, digits = 3)
```
The fitted class boundary is overlaid onto the test set in Figure \@ref(fig:glm-boundaries). The data points closest to the class boundary are the most uncertain. If their values changed slightly, their predicted class might change. One simple method for disqualifying some results is to call them "equivocal" if the values are within some range around 50% (or the appropriate probability cutoff for a certain situation). Depending on the problem the model is being applied to, this might indicate we should collect another measurement or we require more information before a trustworthy prediction is possible.
```{r glm-boundaries}
#| echo = FALSE,
#| fig.width=6,
#| fig.height=6,
#| out.width="70%",
#| fig.align="center",
#| fig.cap = "Simulated two-class data set with a logistic regression fit and decision boundary.",
#| fig.alt = "Simulated two-class data set with a logistic regression fit and decision boundary. The scatter plot of the two classes shows fairly correlated data. The decision boundary is a parabola in the x axis that does a good job of separating the classes."
data_grid <-
crossing(
x = seq(-4.5, 4.5, length = 100),
y = seq(-4.5, 4.5, length = 100)
)
grid_pred <-
predict(two_class_mod, data_grid, type = "prob") %>%
bind_cols(
predict(two_class_mod, data_grid, type = "pred_int", std_error = TRUE),
data_grid
)
grid_pred %>%
mutate(`Probability of Class 1` = .pred_class_1) %>%
ggplot(aes(x = x, y = y)) +
geom_raster(aes(fill = `Probability of Class 1`)) +
geom_point(data = testing_set, aes(shape = class, color = class), alpha = .75, size = 2.5) +
geom_contour(aes(z = .pred_class_1), breaks = .5, color = "black", lty = 2) +
coord_equal() +
labs(x = "Predictor x", y = "Predictor y", color = NULL, shape = NULL) +
scale_fill_gradient2(low = "#FDB863", mid = "white", high = "#B2ABD2", midpoint = .5) +
scale_color_manual(values = c("#2D004B", "darkorange"))
```
We could base the width of the band around the cutoff on how performance improves when the uncertain results are removed. However, we should also estimate the reportable rate (the expected proportion of usable results). For example, it would not be useful in real-world situations to have perfect performance but release predictions on only 2% of the samples passed to the model.
Let's use the test set to determine the balance between improving performance and having enough reportable results. The predictions are created using:
```{r trust-bayes-glm-pred}
test_pred <- augment(two_class_mod, testing_set)
test_pred %>% head()
```
With tidymodels, the `r pkg(probably)` package contains functions for equivocal zones. For cases with two classes, the `make_two_class_pred()` function creates a factor-like column that has the predicted classes with an equivocal zone:
```{r trust-make-eq}
library(probably)
lvls <- levels(training_set$class)
test_pred <-
test_pred %>%
mutate(.pred_with_eqz = make_two_class_pred(.pred_class_1, lvls, buffer = 0.15))
test_pred %>% count(.pred_with_eqz)
```
Rows that are within $0.50\pm0.15$ are given a value of `[EQ]`.
:::rmdnote
The notation `[EQ]` in this example is not a factor level but an attribute of that column.
:::
Since the factor levels are the same as the original data, confusion matrices and other statistics can be computed without error. When using standard functions from the `r pkg(yardstick)` package, the equivocal results are converted to `NA` and are not used in the calculations that use the hard class predictions. Notice the differences in these confusion matrices:
```{r trust-conf-mat}
# All data
test_pred %>% conf_mat(class, .pred_class)
# Reportable results only:
test_pred %>% conf_mat(class, .pred_with_eqz)
```
An `is_equivocal()` function is also available for filtering these rows from the data.
Does the equivocal zone help improve accuracy? Let's look at different buffer sizes, as shown in Figure \@ref(fig:equivocal-zone-results):
```{r trust-eq-calcs, eval = FALSE}
# A function to change the buffer then compute performance.
eq_zone_results <- function(buffer) {
test_pred <-
test_pred %>%
mutate(.pred_with_eqz = make_two_class_pred(.pred_class_1, lvls, buffer = buffer))
acc <- test_pred %>% accuracy(class, .pred_with_eqz)
rep_rate <- reportable_rate(test_pred$.pred_with_eqz)
tibble(accuracy = acc$.estimate, reportable = rep_rate, buffer = buffer)
}
# Evaluate a sequence of buffers and plot the results.
map(seq(0, .1, length.out = 40), eq_zone_results) %>%
list_rbind() %>%
pivot_longer(c(-buffer), names_to = "statistic", values_to = "value") %>%
ggplot(aes(x = buffer, y = value, lty = statistic)) +
geom_step(linewidth = 1.2, alpha = 0.8) +
labs(y = NULL, lty = NULL)
```
```{r equivocal-zone-results, ref.label = "trust-eq-calcs"}
#| echo = FALSE,
#| out.width="80%",
#| fig.cap = "The effect of equivocal zones on model performance",
#| fig.alt = "The effect of equivocal zones on model performance. There is a slight increase in accuracy at the expense of a falling reportable rate."
```
Figure \@ref(fig:equivocal-zone-results) shows us that accuracy improves by a few percentage points but at the cost of nearly 10% of predictions being unusable! The value of such a compromise depends on how the model predictions will be used.
This analysis focused on using the predicted class probability to disqualify points, since this is a fundamental measure of uncertainty in classification models. A slightly better approach would be to use the standard error of the class probability. Since we used a Bayesian model, the probability estimates we found are actually the mean of the posterior predictive distribution. In other words, the Bayesian model gives us a distribution for the class probability. Measuring the standard deviation of this distribution gives us a _standard error of prediction_ of the probability. In most cases, this value is directly related to the mean class probability. You might recall that, for a Bernoulli random variable with probability $p$, the variance is $p(1-p)$. Because of this relationship, the standard error is largest when the probability is 50%. Instead of assigning an equivocal result using the class probability, we could instead use a cutoff on the standard error of prediction.
One important aspect of the standard error of prediction is that it takes into account more than just the class probability. In cases where there is significant extrapolation or aberrant predictor values, the standard error might increase. The benefit of using the standard error of prediction is that it might also flag predictions that are problematic (as opposed to simply uncertain). One reason we used the Bayesian model is that it naturally estimates the standard error of prediction; not many models can calculate this. For our test set, using `type = "pred_int"` will produce upper and lower limits and the `std_error` adds a column for that quantity. For 80% intervals:
```{r trust-pred-int}
test_pred <-
test_pred %>%
bind_cols(
predict(two_class_mod, testing_set, type = "pred_int", std_error = TRUE)
)
```
For our example where the model and data are well behaved, Figure \@ref(fig:std-errors) shows the standard error of prediction across the space:
```{r std-errors}
#| echo = FALSE,
#| fig.width=5,
#| fig.height=5,
#| out.width="70%",
#| fig.align="center",
#| fig.cap = "The effect of the standard error of prediction overlaid with the test set data",
#| fig.alt = "The effect of the standard error of prediction overlaid with the test set data. The region of large variation is very similar to the class boundary space. Additionally, there is a large amount of variation to the west of the inflection point of the boundary curve."
grid_pred %>%
mutate(`Std Error` = .std_error) %>%
ggplot(aes(x = x, y = y)) +
geom_raster(aes(fill = `Std Error`)) +
scale_fill_gradientn(colours = c("#F7FBFF", "#DEEBF7", "#C6DBEF", "#9ECAE1", "#6BAED6")) +
geom_point(data = testing_set, aes(shape = class), alpha = .5, size = 2) +
coord_equal() +
labs(x = "Predictor x", y = "Predictor y", shape = NULL)
```
Using the standard error as a measure to preclude samples from being predicted can also be applied to models with numeric outcomes. However, as shown in the next section, this may not always work.
## Determining Model Applicability {#applicability-domains}
Equivocal zones try to measure the reliability of a prediction based on the model outputs. It may be that model statistics, such as the standard error of prediction, cannot measure the impact of extrapolation, and so we need another way to assess whether to trust a prediction and answer, "Is our model applicable for predicting a specific data point?" Let's take the Chicago train data used extensively in [Kuhn and Johnson (2019)](https://bookdown.org/max/FES/chicago-intro.html) and first shown in Section \@ref(examples-of-tidyverse-syntax). The goal is to predict the number of customers entering the Clark and Lake train station each day.
The data set in the `r pkg(modeldata)` package (a tidymodels package with example data sets) has daily values between `r format(min(Chicago$date), "%B %d, %Y")` and `r format(max(Chicago$date), "%B %d, %Y")`. Let's create a small test set using the last two weeks of the data:
```{r trust-chicago-data}
## loads both `Chicago` data set as well as `stations`
data(Chicago)
Chicago <- Chicago %>% select(ridership, date, one_of(stations))
n <- nrow(Chicago)
Chicago_train <- Chicago %>% slice(1:(n - 14))
Chicago_test <- Chicago %>% slice((n - 13):n)
```
The main predictors are lagged ridership data at different train stations, including Clark and Lake, as well as the date. The ridership predictors are highly correlated with one another. In the following recipe, the date column is expanded into several new features, and the ridership predictors are represented using partial least squares (PLS) components. PLS [@Geladi:1986], as we discussed in Section \@ref(partial-least-squares), is a supervised version of principal component analysis where the new features have been decorrelated but are predictive of the outcome data.
Using the preprocessed data, we fit a standard linear model:
```{r trust-chicago-model}
base_recipe <-
recipe(ridership ~ ., data = Chicago_train) %>%
# Create date features
step_date(date) %>%
step_holiday(date, keep_original_cols = FALSE) %>%
# Create dummy variables from factor columns
step_dummy(all_nominal()) %>%
# Remove any columns with a single unique value
step_zv(all_predictors()) %>%
step_normalize(!!!stations)%>%
step_pls(!!!stations, num_comp = 10, outcome = vars(ridership))
lm_spec <-
linear_reg() %>%
set_engine("lm")
lm_wflow <-
workflow() %>%
add_recipe(base_recipe) %>%
add_model(lm_spec)
set.seed(1902)
lm_fit <- fit(lm_wflow, data = Chicago_train)
```
How well do the data fit on the test set? We can `predict()` for the test set to find both predictions and prediction intervals:
```{r trust-chicago-test-res}
res_test <-
predict(lm_fit, Chicago_test) %>%
bind_cols(
predict(lm_fit, Chicago_test, type = "pred_int"),
Chicago_test
)
res_test %>% select(date, ridership, starts_with(".pred"))
res_test %>% rmse(ridership, .pred)
```
These are fairly good results. Figure \@ref(fig:chicago-2016) visualizes the predictions along with 95% prediction intervals.
```{r chicago-2016}
#| echo = FALSE,
#| fig.height=4,
#| out.width="80%",
#| fig.align="center",
#| fig.cap = "Two weeks of 2016 predictions for the Chicago data along with 95% prediction intervals",
#| fig.alt = "Two weeks of 2016 predictions for the Chicago data along with 95% prediction intervals. The model fit the data fairly well with reasonable error estimates."
add_day <- function(x) {
day <- lubridate::wday(x$date, label = TRUE)
factor(as.character(day), ordered = FALSE, levels = levels(day))
}
res_test %>%
mutate(day = add_day(.)) %>%
ggplot(aes(x = date)) +
geom_point(aes(y = ridership, color = day, pch = day), size = 3) +
geom_line(aes(y = .pred), alpha = .75) +
geom_ribbon(aes(ymin = .pred_lower, ymax = .pred_upper), fill = "blue", alpha = .1) +
scale_color_brewer(palette = "Set2") +
scale_shape_manual(values = 15:22) +
scale_x_date(labels = date_format("%B %d, %Y")) +
labs(x = NULL, y = "Daily Ridership (x1000)", color = NULL, pch = NULL)
```
Given the scale of the ridership numbers, these results look particularly good for such a simple model. If this model were deployed, how well would it have done a few years later in June 2020? The model successfully makes a prediction, as a predictive model almost always will when given input data:
```{r trust-chicago-2020-res}
res_2020 <-
predict(lm_fit, Chicago_2020) %>%
bind_cols(
predict(lm_fit, Chicago_2020, type = "pred_int"),
Chicago_2020
)
res_2020 %>% select(date, contains(".pred"))
```
The prediction intervals are about the same width, even though these data are well beyond the time period of the original training set. However, given the global pandemic in 2020, the performance on these data are abysmal:
```{r trust-chicago-2020-stats}
res_2020 %>% select(date, ridership, starts_with(".pred"))
res_2020 %>% rmse(ridership, .pred)
```
You can see this terrible model performance visually in Figure \@ref(fig:chicago-2020).
```{r chicago-2020}
#| echo = FALSE,
#| fig.height=4,
#| out.width="80%",
#| fig.align="center",
#| fig.cap = "Two weeks of 2020 predictions for the Chicago data along with 95% prediction intervals",
#| fig.alt = "Two weeks of 2016 predictions for the Chicago data along with 95% prediction intervals. The model fit the data fairly well with reasonable error estimates."
res_2020 %>%
mutate(day = add_day(.)) %>%
ggplot(aes(x = date)) +
geom_point(aes(y = ridership, color = day, pch = day), size = 3) +
geom_line(aes(y = .pred), alpha = .75) +
geom_ribbon(aes(ymin = .pred_lower, ymax = .pred_upper), fill = "blue", alpha = .1) +
scale_shape_manual(values = 15:22) +
scale_color_brewer(palette = "Set2") +
scale_x_date(labels = date_format("%B %d, %Y")) +
labs(x = NULL, y = "Daily Ridership (x1000)", color = NULL, pch = NULL)
```
Confidence and prediction intervals for linear regression expand as the data become more and more removed from the center of the training set. However, that effect is not dramatic enough to flag these predictions as being poor.
:::rmdwarning
Sometimes the statistics produced by models don't measure the quality of predictions very well.
:::
This situation can be avoided by having a secondary methodology that can quantify how applicable the model is for any new prediction (i.e., the model's _applicability domain_). There are a variety of methods to compute an applicability domain model, such as @Jaworska or @Netzeva. The approach used in this chapter is a fairly simple unsupervised method that attempts to measure how much (if any) a new data point is beyond the training data.^[@Bartley shows yet another method and applies it to ecological studies.]
:::rmdnote
The idea is to accompany a prediction with a score that measures how similar the new point is to the training set.
:::
One method that works well uses principal component analysis (PCA) on the numeric predictor values. We'll illustrate the process by using only two of the predictors that correspond to ridership at different stations (California and Austin stations). The training set are shown in panel (a) in Figure \@ref(fig:pca-reference-dist). The ridership data for these stations are highly correlated, and the two distributions shown in the scatter plot correspond to ridership on the weekends and week days.
The first step is to conduct PCA on the training data. The PCA scores for the training set are shown in panel (b) in Figure \@ref(fig:pca-reference-dist). Next, using these results, we measure the distance of each training set point to the center of the PCA data (panel (c) of Figure \@ref(fig:pca-reference-dist)). We can then use this _reference distribution_ (panel (d) of Figure \@ref(fig:pca-reference-dist)) to estimate how far a data point is from the mainstream of the training data.
```{r pca-reference-dist}
#| echo = FALSE,
#| out.width = "100%",
#| fig.cap = "The PCA reference distribution based on the training set",
#| fig.alt = "The PCA reference distribution based on the training set. The majority of the distances to the center of the PCA distribution are below a value of three."
pca_rec <- recipe(~ ., data = Chicago_train) %>%
step_normalize(California, Austin) %>%
step_pca(California, Austin, num_comp = 2) %>%
prep()
training_pca <- bake(pca_rec, new_data = NULL)
pca_center <-
training_pca %>%
select(PC1, PC2) %>%
summarize(PC1_mean = mean(PC1), PC2_mean = mean(PC2))
training_pca <-
cbind(pca_center, training_pca) %>%
mutate(
distance = (PC1 - PC1_mean)^2 + (PC2 - PC2_mean)^2,
distance = sqrt(distance)
)
testing_pca <-
bake(pca_rec, Chicago_test %>% slice(1)) %>%
cbind(pca_center) %>%
mutate(
distance = (PC1 - PC1_mean)^2 + (PC2 - PC2_mean)^2,
distance = sqrt(distance)
)
testing_pctl <- round(mean(training_pca$distance <= testing_pca$distance) * 100, 1)
new_pca <-
bake(pca_rec, Chicago_2020 %>% slice(6)) %>%
cbind(pca_center) %>%
mutate(
distance = (PC1 - PC1_mean)^2 + (PC2 - PC2_mean)^2,
distance = sqrt(distance)
)
new_pctl <- round(mean(training_pca$distance <= new_pca$distance) * 100, 1)
tr_plot <-
Chicago_train %>%
ggplot(aes(x = California, y = Austin)) +
geom_point(alpha = .25, size = .3) +
# coord_equal() +
labs(title = "(a) Training Set") +
theme(plot.title = element_text(size=9))
pca_plot <- training_pca %>%
ggplot(aes(x = PC1, y = PC2)) +
geom_point(alpha = .25, size = .3) +
coord_obs_pred() +
labs(x = "Component 1", y = "Component 2", title = "(b) Training Set PCA Scores") +
theme(plot.title = element_text(size = 9))
pca_dist <-
training_pca %>%
ggplot() +
geom_segment(aes(x = PC1_mean, y = PC2_mean,
xend = PC1, yend = PC2), alpha = .1) +
coord_obs_pred() +
labs(x = "Component 1", y = "Component 2", title = "(c) Distances to Center") +
theme(plot.title = element_text(size = 9))
dist_hist <-
training_pca %>%
ggplot(aes(x = distance)) +
geom_histogram(bins = 30, color = "white") +
labs(x = "Distance to Training Set Center", title = "(d) Reference Distribution") +
theme(plot.title = element_text(size = 9))
library(patchwork)
tr_plot + pca_plot + pca_dist + dist_hist
```
For a new sample, the PCA scores are computed along with the distance to the center of the training set.
However, what does it mean when a new sample has a distance of _X_? Since the PCA components can have different ranges from data set to data set, there is no obvious limit to say that a distance is too large.
One approach is to treat the distances from the training set data as "normal." For new samples, we can determine how the new distance compares to the range in the reference distribution (from the training set). A percentile can be computed for new samples that reflect how much of the training set is less extreme than the new samples.
:::rmdnote
A percentile of 90% means that most of the training set data are closer to the data center than the new sample.
:::
The plot in Figure \@ref(fig:two-new-points) overlays a testing set sample (triangle and dashed line) and a 2020 sample (circle and solid line) with the PCA distances from the training set.
```{r two-new-points}
#| echo = FALSE,
#| out.width = "100%",
#| fig.width=9,
#| fig.height=4,
#| fig.cap = "The reference distribution with two new points: one using the test set and one from the 2020 data",
#| fig.alt = "The reference distribution with two new points: one using the test set and one from the 2020 data. The test set point is snugly within the data mainstream while the 2020 point is outside of the reference distribution."
test_pca_dist <-
training_pca %>%
ggplot() +
geom_segment(
aes(x = PC1_mean, y = PC2_mean, xend = PC1, yend = PC2),
alpha = .05
) +
geom_segment(
data = testing_pca,
aes(x = PC1_mean, y = PC2_mean, xend = PC1, yend = PC2),
color = "lightblue",
lty = 2
) +
geom_segment(
data = new_pca,
aes(x = PC1_mean, y = PC2_mean, xend = PC1, yend = PC2),
color = "red"
) +
geom_point(data = testing_pca, aes(x = PC1, y = PC2), color = "lightblue", size = 2, pch = 17) +
geom_point(data = new_pca, aes(x = PC1, y = PC2), size = 2, color = "red") +
coord_obs_pred() +
labs(x = "Component 1", y = "Component 2", title = "Distances to Training Set Center") +
theme_bw() +
theme(legend.position = "top")
test_dist_hist <-
training_pca %>%
ggplot(aes(x = distance)) +
geom_histogram(bins = 30, color = "white", alpha = .5) +
geom_vline(xintercept = testing_pca$distance, color = "lightblue", lty = 2) +
geom_vline(xintercept = new_pca$distance, color = "red") +
xlab("Distance to Training Set Center")
test_pca_dist + test_dist_hist
```
The test set point has a distance of `r round(testing_pca$distance, 2)`. It is in the `r testing_pctl`% percentile of the training set distribution, indicating that it is snugly within the mainstream of the training set.
The 2020 sample is farther from the center than any of the training set samples (with a percentile of `r new_pctl`%). This indicates the sample is very extreme and that its corresponding prediction would be a severe extrapolation (and probably should not be reported).
The `r pkg(applicable)` package can develop an applicability domain model using PCA. We'll use the 20 lagged station ridership predictors as inputs into the PCA analysis. There is an additional argument called `threshold` that determines how many components are used in the distance calculation. For our example, we'll use a large value that indicates we should use enough components to account for 99% of the variation in the ridership predictors:
```{r trust-apd-pca}
library(applicable)
pca_stat <- apd_pca(~ ., data = Chicago_train %>% select(one_of(stations)),
threshold = 0.99)
pca_stat
```
The `autoplot()` method plots the reference distribution. It has an optional argument for which data to plot. We'll add a value of `distance` to plot only the training set distance distribution. This code generates the plot in Figure \@ref(fig:ap-autoplot):
```{r trust-ref-dist, eval = FALSE}
autoplot(pca_stat, distance) + labs(x = "distance")
```
```{r ap-autoplot, ref.label = "trust-ref-dist"}
#| echo = FALSE,
#| out.width="70%",
#| fig.cap = "The results of using the `autoplot()` method on an applicable object",
#| fig.alt = "The results of using the `autoplot()` method on an applicable object."
```
The x-axis shows the values of the distance and the y-axis displays the distribution's percentiles. For example, half of the training set samples had distances less than `r round(approx(pca_stat$pctls$percentile, pca_stat$pctls$distance, xout = 50)$y, 1)`.
To compute the percentiles for new data, the `score()` function works in the same way as `predict()`:
```{r trust-apd-test-scores}
score(pca_stat, Chicago_test) %>% select(starts_with("distance"))
```
These seem fairly reasonable. For the 2020 data:
```{r trust-apd-2020-scores}
score(pca_stat, Chicago_2020) %>% select(starts_with("distance"))
```
The 2020 distance values indicate that these predictor values are outside of the vast majority of data seen by the model at training time. These should be flagged so that the predictions are either not reported at all or viewed with skepticism.
:::rmdnote
One important aspect of this analysis concerns which predictors are used to develop the applicability domain model. In our analysis, we used the raw predictor columns. However, in building the model, PLS score features were used in their place. Which of these should `apd_pca()` use? The `apd_pca()` function can also take a recipe as the input (instead of a formula) so that the distances reflect the PLS scores instead of the individual predictor columns. You can evaluate both methods to understand which one gives more relevant results.
:::
## Chapter Summary {#trust-summary}
This chapter showed two methods for evaluating whether predictions should be reported to the consumers of models. Equivocal zones deal with outcomes/predictions and can be helpful when the amount of uncertainty in a prediction is too large.
Applicability domain models deal with features/predictors and quantify the amount of extrapolation (if any) that occurs when making a prediction. This chapter showed a basic method using principal component analysis, although there are many other ways to measure applicability. The `r pkg(applicable)` package also contains specialized methods for data sets where all of the predictors are binary. This method computes similarity scores between training set data points to define the reference distribution.