-
Notifications
You must be signed in to change notification settings - Fork 2
/
09-hotelling.qmd
671 lines (538 loc) · 29.1 KB
/
09-hotelling.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
```{r include=FALSE}
source(here::here("R", "common.R"))
knitr::opts_chunk$set(fig.path = "figs/ch09/")
```
::: {.content-visible unless-format="pdf"}
{{< include latex/latex-commands.qmd >}}
:::
# Hotelling's $T^2$ {#sec-Hotelling}
Just as the one- and two- sample univariate $t$-test is the gateway drug for understanding
analysis of variance, so too Hotelling's $T^2$ test provides an entry point to multivariate analysis of variance. This simple case provides an entry point to understanding the collection of methods
I call the **HE plot framework** for visualizing effects in multivariate linear models, which
are a main focus of this book.
The essential idea is that Hotelling's $T^2$ provides a test of the difference in
means between two groups on a _collection_ of variables, $\mathbf{x} = x_1, x_2, \dots x_p$
_simultaneously_, rather than one by one. This has the advantages that it:
* does not require $p$-value corrections for multiple tests (e.g., Bonferroni);
* combines the evidence from the multiple response variables, and _pools strength_,
accumulating support across measures;
* clarifies how the multiple response are _jointly_ related to the group effect along a single dimension, the _discriminant axis_;
* in so doing, it reduces the problem of testing differences for two (and potentially more)
response variables to testing the difference on a single variable that best captures
the multivariable relations.
After describing it's features, I use an example of a two-group $T^2$ test to illustrate
the basic ideas behind multivariate tests and hypothesis error plots.
Then, we'll dip our toes into the visual ideas for representing the statistical quantities
involved in such tests.
**Packages**
In this chapter we use the following packages. Load them now.
```{r}
library(car)
library(heplots)
library(Hotelling)
library(ggplot2)
library(dplyr)
library(tidyr)
library(ggbiplot)
library(broom)
```
## $T^2$ as a generalized $t$-test
Hotelling's $T^2$ [@Hotelling:1931] is an analog of the square of a univariate $t$ statistic,
extended to the case of two or more response variables tested together.
Consider the basic one-sample $t$-test, where we wish to test the hypothesis that the
mean $\bar{x}$ of a set of $N$ measures on a test of basic math, with standard deviation $s$
does not differ from an assumed mean $\mu_0 = 150$ for
a population. The $t$ statistic for testing $\mathcal{H}_0 : \mu = \mu_0$ against the
two-sided alternative, $\mathcal{H}_0 : \mu \ne \mu_0$ is
$$
t = \frac{(\bar{x} - \mu_0)}{s / \sqrt{N}} = \frac{(\bar{x} - \mu_0)\sqrt{N}}{s}
$$
Squaring this gives
$$
t^2 = \frac{N (\bar{x} - \mu_0)^2}{s^2} = N (\bar{x} - \mu_0)(s^2)^{-1} (\bar{x} - \mu_0)
$$
Now consider we also have measures on a test of solving word problems for the same sample.
Then, a hypothesis test for the means on basic math (BM) and word problems (WP)
is the test of the means of these two variables jointly equal some specified values,
say, $(\mu_{0,BM}=150,\; \mu_{0,WP} =100)$:
$$
\mathcal{H}_0 : \mathbf{\mu} = \mathbf{\mu_0} =
\begin{pmatrix}
\mu_{0,BM} \\ \mu_{0,WP}
\end{pmatrix}
=
\begin{pmatrix}
150 \\ 100
\end{pmatrix}
$$
<!-- ![](equations/eqn-mathscore1.png){width=50% fig-align="center"} -->
Hotelling's $T^2$ is then the analog of $t^2$, with the variance-covariance matrix $\mathbf{S}$ of the scores
on (BM, WP) replacing the variance of a single score. This is nothing more than the
squared Mahalanobis $D^2_M$ distance between the sample mean vector $(\bar{x}_{BM}, \bar{x}_{WP})^\mathsf{T}$
and the hypothesized means $\mathbf{\mu}_0$, in the metric of $\mathbf{S}$, as shown in @fig-T2-diagram.
\begin{align*}
T^2 &= N (\bar{\mathbf{x}} - \mathbf{\mu}_0)^\mathsf{T} \; \mathbf{S}^{-1} \; (\bar{\mathbf{x}} - \mathbf{\mu}_0) \\
&= N D^2_M (\bar{\mathbf{x}}, \mathbf{\mu}_0)
\end{align*}
```{r}
#| label: fig-T2-diagram
#| echo: false
#| fig-align: center
#| out-width: "50%"
#| fig-cap: "Hotelling's T^2 statistic as the squared distance between the sample means and hypothesized means relative to the variance-covariance matrix. _Source_: Author"
knitr::include_graphics("images/T2-diagram.png")
```
## $T^2$ properties {#sec-t2-properties}
Aside from it's elegant geometric interpretation Hotelling's $T^2$ has simple properties that aid in understanding the extension to more complex multivariate tests.
* **Maximum $t^2$** :
Consider constructing a new variable $w$ as a linear combination of the scores in a matrix
$\mathbf{X} = [ \mathbf{x_1}, \mathbf{x_2}, \dots, \mathbf{x_p}]$ with weights $\mathbf{a}$,
$$
w = a_1 \mathbf{x_1} + a_2 \mathbf{x_2} + \dots + a_p \mathbf{x_p} = \mathbf{X} \mathbf{a}
$$
Hotelling's $T^2$ is then the maximum value of a univariate $t^2 (\mathbf{a})$ over all possible choices of the weights in $\mathbf{a}$. In this way, Hotellings test reduces a multivariate problem to a univariate one.
* **Eigenvalue** : Hotelling showed that $T^2$ is the one non-zero eigenvalue (latent root)
$\lambda$
of the matrix $\mathbf{Q}_H = N (\bar{\mathbf{x}} - \mathbf{\mu}_0)^\mathsf{T} (\bar{\mathbf{x}} - \mathbf{\mu}_0)$ relative to $\mathbf{Q}_E = \mathbf{S}$
that solves the equation
$$
(\mathbf{Q}_H - \lambda \mathbf{Q}_E) \mathbf{a} = 0
$$ {#eq-eigen}
In more complex MANOVA problems, there are more than one non-zero latent
roots, $\lambda_1, \lambda_2, \dots \lambda_s$, and test statistics (Wilks' $\Lambda$, Pillai and Hotelling-Lawley trace criteria, Roy's maximum root test)
are functions of these.
* **Eigenvector** : The corresponding eigenvector is
$\mathbf{a} = \mathbf{S}^{-1} (\bar{\mathbf{x}} - \mathbf{\mu}_0)$. These are the
(raw) _discriminant coefficients_, giving the relative contribution of each variable to $T^2$.
* **Critical values** : For a single response, the square of a $t$ statistic with
$N-1$ degrees of freedom is an $F (1, N-1)$ statistic. But we chose $\mathbf{a}$
to give the _maximum_ $t^2 (\mathbf{a})$; this can be taken into account with a
transformation of $T^2$ to give an **exact** $F$ test with the correct
sampling distribution:
$$
F^* = \frac{N - p}{p (N-1)} T^2 \; \sim \; F (p, N - p)
$$ {#eq-Fstat}
* **Invariance under linear transformation** : Just as a univariate $t$-test is unchanged if
we apply a linear transformation to the variable, $x \rightarrow a x + b$, $T^2$ is invariant
under all linear (_affine_) transformations,
$$
\mathbf{x}_{p \times 1} \rightarrow \mathbf{C}_{p \times p} \mathbf{x} + \mathbf{b}
$$
So, you get the same results if you convert penguins flipper lengths from
millimeters to centimeters or inches.
The same is true for all MANOVA tests.
* **Two-sample tests** : With minor variations in notation, everything above applies to the more usual test of equality of multivariate means in a two sample test of
$\mathcal{H}_0 : \mathbf{\mu}_1 = \mathbf{\mu}_2$.
$$
T^2 = N (\bar{\mathbf{x}}_1 - \bar{\mathbf{x}}_2)^\mathsf{T} \; \mathbf{S}_p^{-1} \; (\bar{\mathbf{x}}_1 - \bar{\mathbf{x}}_2)
$$
where $\mathbf{S}_p$ is the pooled within-sample variance covariance matrix.
### Example {.unnumbered}
The data set `heplots::mathscore` gives (fictitious) scores on a test of basic math skills (BM) and solving word problems (WP) for two groups of $N=6$ students in an algebra course, each taught by different instructors.
```{r mathscore}
data(mathscore, package = "heplots")
str(mathscore)
```
You can carry out the test that the means for both variables are jointly equal across groups using either
`Hotelling::hotelling.test()` [@R-Hotelling] or `car::Anova()`, but the latter is more generally useful
```{r mathscore1}
hotelling.test(cbind(BM, WP) ~ group, data=mathscore) |> print()
math.mod <- lm(cbind(BM, WP) ~ group, data=mathscore)
Anova(math.mod)
```
What's wrong with just doing the two $t$-tests (or equivalent $F$-test with `lm()`)?
```{r mathscore2}
Anova(mod1 <- lm(BM ~ group, data=mathscore))
Anova(mod2 <- lm(WP ~ group, data=mathscore))
```
From this, we might conclude that the two groups do _not_ differ significantly on Basic Math
but strongly differ on Word problems. But the two univariate tests do not take the correlation
among the mean differences into account.
If you want to just extract the $t$-tests, here's a handy trick using `broom::tidy.mlm()`,
which summarizes the test statistics for each response and each term in a MLM. The mean difference shown below is that for group 2 - group 1.
```{r mathscore-tidy}
tidy(math.mod) |>
filter(term != "(Intercept)") |>
select(-term) |>
rename(Mean_diff = estimate,
t = statistic) |>
mutate(signif = noquote(gtools::stars.pval(p.value)))
```
To see the differences between the groups on both variables together, we draw their data (68%)
ellipses, using `heplots::covEllipses()`. (Setting `pooled=FALSE` here omits drawing the
the ellipse for the pooled covariance matrix $\mathbf{S}_p$.)
<!-- figure-code: R/mathscore/mathscore-figs.R -->
```{r}
#| label: fig-mathscore-cov1
#| fig-align: center
#| out-width: "70%"
#| fig-cap: "Data ellipses for the `mathscore` data, enclosing approximately 68% of the observations in each group"
colors <- c("darkgreen", "blue")
covEllipses(mathscore[,c("BM", "WP")], mathscore$group,
pooled = FALSE,
col = colors,
fill = TRUE,
fill.alpha = 0.05,
cex = 2, cex.lab = 1.5,
asp = 1,
xlab="Basic math", ylab="Word problems")
# plot points
pch <- ifelse(mathscore$group==1, 15, 16)
col <- ifelse(mathscore$group==1, colors[1], colors[2])
points(mathscore[,2:3], pch=pch, col=col, cex=1.25)
```
We can see that:
* Group 1 > Group 2 on Basic Math, but worse on Word Problems
* Group 2 > Group 1 on Word Problems, but worse on Basic Math
* Within each group, those who do better on Basic Math also do better on Word Problems
We can also see why the univariate test, at least for Basic math is non-significant: the
scores for the two groups overlap considerably on the horizontal axis. They are slightly
better separated along the vertical axis for word problems. The plot also reveals why
Hotelling's $T^2$ reveals such a strongly significant result: the two groups are very widely
separated along an approximately 45$^o$ line between them.
A relatively simple interpretation is that the groups don't really differ in overall
math ability, but perhaps the instructor in Group 1 put more focus on basic math skills,
while the instructor for Group 2 placed greater emphasis on solving word problems.
In Hotelling's $T^2$, the "size" of the difference between the means (labeled "1" and "2")
is assessed relative to the pooled within-group covariance matrix $\mathbf{S}_p$, which is just a size-weighted average of the two within-sample matrices, $\mathbf{S}_1$ and $\mathbf{S}_2$,
$$
\mathbf{S}_p = [ (n_1 - 1) \mathbf{S}_1 + (n_2 - 1) \mathbf{S}_2 ] / (n_1 + n_2 - 2) \period
$$
Visually, imagine sliding the the separate data ellipses to the grand mean,
$(\bar{x}_{\text{BM}}, \bar{x}_{\text{WP}})$ and finding their combined data ellipse.
This is just the data ellipse of the sample of deviations of the scores from their
group means, or that of the residuals from the model `lm(cbind(BM, WP) ~ group, data=mathscore)`
To see this, we plot $\mathbf{S}_1$, $\mathbf{S}_2$ and $\mathbf{S}_p$ together,
```{r}
#| label: fig-mathscore-cov2
#| fig-align: center
#| out-width: "70%"
#| fig-cap: "Data ellipses and the pooled covariance matrix `mathscore` data."
covEllipses(mathscore[,c("BM", "WP")], mathscore$group,
col = c(colors, "red"),
fill = c(FALSE, FALSE, TRUE),
fill.alpha = 0.3,
cex = 2, cex.lab = 1.5,
asp = 1,
xlab="Basic math", ylab="Word problems")
```
One of the assumptions of the $T^2$ test (and of MANOVA) is that the within-group variance covariance matrices, $\mathbf{S}_1$ and $\mathbf{S}_2$, are the same.
In @fig-mathscore-cov2, you can see how the shapes of $\mathbf{S}_1$ and $\mathbf{S}_2$ are very
similar, differing in that the variance of word Problems is slightly greater for group 2.
In Chapter XX we take of the topic of visualizing tests of this assumption, based on
Box's $M$-test.
## HE plot and discriminant axis {#sec-t2-heplot}
As we describe in detail in @sec-vis-mlm, all the information relevant to the
$T^2$ test and MANOVA can be captured in the remarkably simple
_Hypothesis Error_ plot, which shows the relative size of two data ellipses,
* $\mathbf{H}$: the data ellipse of the _fitted_ values, which are just the
group means on the two variables, $\bar{\mathbf{x}}$, corresponding to $\mathbf{Q}_H$
in @eq-eigen. In case of $T^2$, the $\mathbf{H}$ matrix is of rank 1, so the
"ellipse" plots as a line.
```{r Hcalc}
# calculate H directly
fit <- fitted(math.mod)
xbar <- colMeans(mathscore[,2:3])
N <- nrow(mathscore)
crossprod(fit) - N * outer(xbar, xbar)
# same as: SSP for group effect from Anova
math.aov <- Anova(math.mod)
(H <- math.aov$SSP)
```
* $\mathbf{E}$: the data ellipse of the _residuals_, the deviations of the scores
from the group means, $\mathbf{x} - \bar{\mathbf{x}}$, corresponding to $\mathbf{Q}_E$.
```{r Ecalc}
# calculate E directly
resids <- residuals(math.mod)
crossprod(resids)
# same as: SSPE from Anova
(E <- math.aov$SSPE)
```
### `heplot()`
`heplots::heplot()` takes the model object, extracts the $\mathbf{H}$ and $\mathbf{E}$
matrices (from `summary(Anova(math.mod))`) and plots them. There are many options to control the details.
```{r}
#| label: fig-mathscore-HE
#| fig-align: center
#| out-width: "70%"
#| fig-cap: "Hypothesis error plot of the `mathscore` data. The line through the group means is the H ellipse, which plots as a line here. The red ellipse labeled 'Error' represents the pooled within-group covariance matrix."
heplot(math.mod,
fill=TRUE, lwd = 3,
asp = 1,
cex=2, cex.lab=1.8,
xlab="Basic math", ylab="Word problems")
```
But the HE plot offers more:
* A visual test of significance: the $\mathbf{H}$ ellipse is scaled so that it projects _anywhere_ outside the $\mathbf{E}$ ellipse, if and only if the test is significant at a given $\alpha$ level ($\alpha = 0.05$ by default)
* The $\mathbf{H}$ ellipse, which appears as a line, goes through the means of the two groups.
This is also the _discriminant axis_, the direction in the space of the variables which maximally
discriminates between the groups. That is, if we project the data points onto this line, we get the
linear combination $w$ which has the maximum possible univariate $t^2$.
You can see how the HE plot relates to the plots of the separate data ellipses by overlaying them
in a single figure. We also plot the scores on the discriminant axis, by using this small function
to find the orthogonal projection of a point $\mathbf{a}$ on the line joining two points,
$\mathbf{p}_1$ and $\mathbf{p}_2$, which in math is
$\mathbf{p}_1 + \frac{\mathbf{d}^\mathsf{T} (\mathbf{a} - \mathbf{p}_1)} {\mathbf{d}^\mathsf{T} \mathbf{d}}$,
letting $\mathbf{d} = \mathbf{p}_1 - \mathbf{p}_2$.
```{r project-on}
dot <- function(x, y) sum(x*y) # dot product of two vectors
project_on <- function(a, p1, p2) {
a <- as.numeric(a)
p1 <- as.numeric(p1)
p2 <- as.numeric(p2)
t <- dot(p2-p1, a-p1) / dot(p2-p1, p2-p1)
C <- p1 + t*(p2-p1)
C
}
```
Then, we run the same code as before to plot the data ellipses, and follow this with a call
to `heplot()` using the option `add=TRUE` which adds to an existing plot. Following this,
we find the group means and draw lines projecting the points on the line between them.
```{r}
#| label: fig-mathscore-HE-overlay
#| fig-align: center
#| out-width: "70%"
#| fig-cap: "HE plot overlaid on top of the within-group data ellipses, with lines showing the projection of each point on the discriminant axis."
covEllipses(mathscore[,c("BM", "WP")], mathscore$group,
pooled=FALSE,
col = colors,
cex=2, cex.lab=1.5,
asp=1,
xlab="Basic math", ylab="Word problems"
)
pch <- ifelse(mathscore$group==1, 15, 16)
col <- ifelse(mathscore$group==1, "red", "blue")
points(mathscore[,2:3], pch=pch, col=col, cex=1.25)
# overlay with HEplot (add = TRUE)
heplot(math.mod,
fill=TRUE,
cex=2, cex.lab=1.8,
fill.alpha=0.2, lwd=c(1,3),
add = TRUE,
error.ellipse=TRUE)
# find group means
means <- mathscore |>
group_by(group) |>
summarize(BM = mean(BM), WP = mean(WP))
for(i in 1:nrow(mathscore)) {
gp <- mathscore$group[i]
pt <- project_on( mathscore[i, 2:3], means[1, 2:3], means[2, 2:3])
segments(mathscore[i, "BM"], mathscore[i, "WP"], pt[1], pt[2], lwd = 1.2)
}
```
## Discriminant analysis {#sec-t2-discrim}
Discriminant analysis for two-group designs or for one-way MANOVA
essentially turns the problem around: Instead of asking whether the mean
vectors for two or more groups are equal, discriminant analysis tries to find
the linear combination $w$ of the response variables that has the greatest
separation among the groups, allowing cases to be best classified.
It was developed by @Fisher:1936 as a solution to the biological taxonomy problem
of developing a rule to classify instances of flowers---in his famous case, Iris flowers---into known species (_I. setosa_, _I. versicolor_, _I. virginica_) on the basis of multiple measurements
(length and width of their sepals and petals).
```{r lda}
(math.lda <- MASS::lda(group ~ ., data=mathscore))
```
The coefficients give $w = -0.084 \;\text{BM} + 0.075 \;\text{WP}$. This is exactly the direction given by the line for the $\mathbf{H}$ ellipse in @fig-mathscore-HE-overlay.
To round this out, we can calculate the discriminant scores by multiplying the matrix $\mathbf{X}$
by the vector $\mathbf{a} = \mathbf{S}^{-1} (\bar{\mathbf{x}}_1 - \bar{\mathbf{x}}_2)$ of the discriminant weights. These were shown in @fig-mathscore-HE-overlay as the projections
of the data points on the line joining the group means,
```{r}
math.lda$scaling
scores <- cbind(group = mathscore$group,
as.matrix(mathscore[, 2:3]) %*% math.lda$scaling) |>
as.data.frame()
scores |>
group_by(group) |>
slice(1:3)
```
Then a $t$-test on these scores gives the same value as Hotelling's $T$; it is accessed via the `statistic` component of `t.test()`
```{r}
t <- t.test(LD1 ~ group, data=scores)$statistic
c(t, T2 =t^2)
```
Finally, it is instructive to compare violin plots for the three measures,
`BM`, `WP` and `LD1`. To do this with `ggplot2` requires reshaping the
data from wide to long format so the plots can be faceted.
```{r}
#| label: fig-mathscore-violins
#| fig-align: center
#| out-width: "90%"
#| fig-cap: "Violin plots comparing group 1 and 2 for the two observed measures and the linear discriminant score."
scores <- mathscore |>
bind_cols(LD1 = scores[, "LD1"])
scores |>
tidyr::gather(key = "measure", value ="Score", BM:LD1) |>
mutate(measure = factor(measure, levels = c("BM", "WP", "LD1"))) |>
ggplot(aes(x = group, y = Score, color = group, fill = group)) +
geom_violin(alpha = 0.2) +
geom_jitter(width = .2, size = 2) +
facet_wrap( ~ measure, scales = "free", labeller = label_both) +
scale_fill_manual(values = c("darkgreen", "blue")) +
scale_color_manual(values = c("darkgreen", "blue")) +
theme_bw(base_size = 14) +
theme(legend.position = "none")
```
You can readily see how well the groups are separated on the discriminant axes,
relative to the two individual variables.
## More variables {#sec-t2-more-variables}
The `mathscore` data gave a simple example with two outcomes to explain the essential ideas behind
Hotelling's $T^2$ and multivariate tests.
Multivariate methods become increasingly useful as the number of response variables increases because
it is harder to show them all together and see how they relate to differences between groups.
A classic example is the dataset `mbclust::banknote`, containing six size measures made on 100 genuine and 100 counterfeit old-Swiss 1000-franc bank notes [@FluryReidwyl-1988]. The goal is to see how well the real and fake
banknotes can be distinguished. The measures are the `Length` and `Diagonal` lengths of a banknote and
the `Left`, `Right`, `Top` and `Bottom` edge margins in mm.
Before considering hypothesis tests, let's look at some exploratory graphics. @fig-banknote-violin shows univariate violin and boxplots of each of the measures. To make this plot, faceted by measure, I first reshape the data
from wide to long and make `measure` a factor with levels in the order of the variables in the data set.
<!-- figure-code: R/banknote.R -->
```{r}
#| label: fig-banknote-violin
#| fig-height: 7
#| out-width: "100%"
#| fig-cap: "Overlaid violin and boxplots of the banknote variables. The violin plots give a sense of the shapes of the distributions, while the boxplots highlight the center and spread."
data(banknote, package= "mclust")
banknote |>
tidyr::gather(key = "measure",
value = "Size",
Length:Diagonal) |>
mutate(measure = factor(measure,
levels = c(names(banknote)[-1]))) |>
ggplot(aes(x = Status, y = Size, color = Status)) +
geom_violin(aes(fill = Status), alpha = 0.2) + # (1)
geom_jitter(width = .2, size = 1.2) + # (2)
geom_boxplot(width = 0.25, # (3)
linewidth = 1.1,
color = "black",
alpha = 0.5) +
labs(y = "Size (mm)") +
facet_wrap( ~ measure, scales = "free", labeller = label_both) +
theme_bw(base_size = 14) +
theme(legend.position = "top")
```
A quick glance at @fig-banknote-violin shows that the counterfeit and genuine bills differ in their
means on most of the measures, with the counterfeit ones slightly larger on Left, Right, Bottom and Top margins.
But univariate plots don't give an overall sense of how these variables are related to one another.
::: {.callout-note title="**Graph craft**: Layers and transparency"}
@fig-banknote-violin is somewhat complex, so it is useful to understand the steps needed to make this figure show what I wanted. The plot in each panel contains three layers:
(1) the violin plot based on a density estimate, showing the shape of each distribution;
(2) the data points, but they are jittered horizontally using `geom_jiter()` because otherwise they would all overlap on the X axis;
(3) the boxplot, showing the center (median) and spread (IQR) of each distribution.
In composing graphs with layers, order matters, and also does the `alpha` transparency, because each layer
adds data ink on top of earlier ones. I plotted these in the order shown because I wanted the violin plot to
provide the background, and the boxplot to show a simple univariate summary, not obscured by the other layers.
The `alpha` values allow the data ink to be blended for each layer, and in this case, `alpha = 0.5` for the
boxplot let the earlier layers show through.
:::
### Biplots
Multivariate relations among these six variables could be explored in data space using scatterplots or other methods, but I turn to my trusty multivariate juicer, a biplot, to give a 2D summary.
Two dimensions account for 70% of the total variance of all the banknotes, while three would give 85%.
```{r banknote-pca}
banknote.pca <- prcomp(banknote[, -1], scale = TRUE)
summary(banknote.pca)
```
The biplot in @fig-banknote-biplot gives a nicely coherent overview, at least in two dimensions.
The first component shows the positive correlations among the measures of the margins, where the
counterfeit bills are larger than the real ones and a negative correlation of the Diagonal with the other measures.
The length of bills only distinguishes the types of banknotes on the second dimension.
```{r}
#| label: fig-banknote-biplot
#| out-width: "100%"
#| fig-cap: "Biplot of the banknote variables, showing how the size measurements are related to each other. The points and data ellipses for the component scores are colored by Status, showing how the counterfeit and genuine bills are distinguished by these measures."
banknote.pca <- ggbiplot::reflect(banknote.pca)
ggbiplot(banknote.pca,
obs.scale = 1, var.scale = 1,
groups = banknote$Status,
ellipse = TRUE,
ellipse.level = 0.5,
ellipse.alpha = 0.1,
ellipse.linewidth = 0,
varname.size = 4,
varname.color = "black") +
labs(fill = "Status",
color = "Status") +
theme_minimal(base_size = 14) +
theme(legend.position = 'top')
```
### Testing mean differences
As noted above, Hotelling's $T^2$ is equivalent to a one-way MANOVA, fitting the size measures
to the `Status` of the banknotes. `Anova()` reports only the $F$-statistic based on Pillai's trace criterion.
```{r banknote-mlm}
banknote.mlm <- lm(cbind(Length, Left, Right, Bottom, Top, Diagonal) ~ Status,
data = banknote)
Anova(banknote.mlm)
```
You can see all the multivariate test statistics with the `summary()` method for `"Anova.mlm"` objects.
With two groups, and hence a 1 df test, these all translate into identical $F$-statistics.
```{r summary-Anova}
summary(Anova(banknote.mlm)) |> print(SSP = FALSE)
```
If you wish, you can extract the univariate $t$-tests or equivalent $F = t^2$ statistics
from the `"mlm"` object using `broom::tidy.mlm()`.
What is given as the `estimate` is the difference in the mean for the genuine banknotes relative
to the counterfeit ones.
```{r banknote-tests}
broom::tidy(banknote.mlm) |>
filter(term != "(Intercept)") |>
dplyr::select(-term) |>
rename(t = statistic) |>
mutate(F = t^2) |>
relocate(F, .after = t)
```
The individual $F_{(1, 198)}$ statistics can be compared to the $F_{(6, 193)} = 392$
value for the overall multivariate test. While all of the individual tests are highly significant,
the average of the univariate $F$s is only 236. The multivariate test gains power by taking the
correlations of the size measures into account.
## Variance accounted for: Eta square ($\eta^2$)
In a univariate multiple regression model, the coefficient of determination $R^2 = \text{SS}_H / \text{SS}_\text{Total}$ gives the proportion of variance accounted for by hypothesized terms in $H$ relative to the total variance. An analog
for ANOVA-type models with categorical, group factors as predictors is $\eta^2$ [@Pearson-1903], defined as
$$
\eta^2 = \frac{\text{SS}_\text{Between groups}}{\text{SS}_\text{Total}}
$$
For multivariate response models, the generalization of $\eta^2$ uses multivariate analogs of these sums of squares, $\mathbf{Q}_H$
and $\mathbf{Q}_T = \mathbf{Q}_H + \mathbf{Q}_E$, and there are different calculations for a single
measure corresponding to the various test statistics (Wilks' $\Lambda$, etc.), as described in @sec-mlm-review.
Let's calculate the $\eta^2$ for the multivariate model `banknote.mlm` with `Status` as the only predictor, giving $\eta^2 = 0.92$, or 92% of the total variance.
```{r banknote-etasq}
heplots::etasq(banknote.mlm)
```
This can be compared to the principal components analysis and the biplot in @fig-banknote-biplot,
where two components (less favorably) accounted for 70% of total variance and it took four PCA dimensions to account for over 90%.
The goals of PCA and MANOVA are different, of course, but they are both concerned with accounting for variance
of multivariate data. We will meet another multivariate juicer, **canonical discriminant analysis** in @sec-vis-mlm.
<!-- ## Canonical analysis -->
## What we've learned
This chapter was designed to illustrate the main ideas for visualizing differences between
means on multiple response variables in a two-group design.
Hotelling's $T^2$ is the generalization of a simple univariate $t$-test and
works by combining the responses into a weighted sum that has the maximum possible
univariate $t$ for all choices of weights.
@fig-HE-framework summarizes what was shown in @sec-t2-heplot and @sec-t2-discrim.
The data ellipses for the two groups in the `mathscore` data summarize the information
about means and within-group variances. In the HE plot, the difference between the means is
itself summarized by the line through them, which represents the $\mathbf{H} =\mathbf{Q}_H$ matrix
and within-group variation is represented by the "Error" ellipse which is the
$\mathbf{E} = \mathbf{S}_p = \mathbf{Q}_E$ matrix.
```{r}
#| label: fig-HE-framework
#| echo: false
#| fig-align: center
#| out-width: "100%"
#| fig-cap: "The Hypothesis Error plot framework for a two-group design. Above: Data ellipses can be summarized in an HE plot showing the pooled within-group error ($\\mathbf{E}$) ellipse and the $\\mathbf{H}$ 'ellipse' for the group means.
#| Below: Observations projected on the line joining the means give discriminant scores which correpond to a one-dimensional canonical space, represented by a boxplot of their scores and arrows reflecting the variable weights."
knitr::include_graphics("images/HE-framework.png")
```
As we will see later (@sec-vis-mlm), the $\mathbf{H}$ ellipse is scaled so that it provides a visual
test of significance: it projects somewhere outside the $\mathbf{E}$ ellipse if and only if the
means differ significantly. The direction of the line between the means is also the discriminant
axis and scores on this axis are weighted sum of the responses that have the greatest possible
mean difference.
## Exercises
1. The value of Hotelling's $T^2$ found by `hotelling.test()` is 64.17. The value of
the equivalent $F$ statistic found by `Anova()` is 28.9. Verify that @eq-Fstat gives this result.
```{r}
#| echo: false
#| results: asis
cat("**Packages used here**:\n\n")
write_pkgs(file = .pkg_file)
```
<!-- ## References {.unnumbered} -->