-
Notifications
You must be signed in to change notification settings - Fork 2
/
11-mlm-viz.qmd
207 lines (165 loc) · 9.31 KB
/
11-mlm-viz.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
```{r include=FALSE}
source("R/common.R")
knitr::opts_chunk$set(fig.path = "figs/ch11/")
```
::: {.content-visible unless-format="pdf"}
{{< include latex/latex-commands.qmd >}}
:::
# Visualizing Multivariate Models {#sec-vis-mlm}
Tests of multivariate models, including multivariate analysis of variance (MANOVA) for group differences and multivariate multiple regression (MMRA) can be easily visualized by plots of a hypothesis ("H") data ellipse for the fitted values relative to the corresponding plot of the error ellipse ("E") of the residuals, which I call the HE plot framework.
For more than a few response variables, these result can be projected onto a lower-dimensional "canonical discriminant" space providing an even simpler description.
**Packages**
In this chapter we use the following packages. Load them now
```{r}
library(car)
library(heplots)
library(ggplot2)
library(dplyr)
library(tidyr)
```
## HE plot framework {#sec-he-framework}
@sec-Hotelling illustrated the basic ideas of the framework for visualizing multivariate
linear models in the context of a simple two group design, using Hotelling's $T^2$. The main ideas were illustrated in @fig-HE-framework.
Having described the statistical ideas behind the MLM in @sec-mlm-review, we can proceed to
extend this framework to larger designs. @fig-dogfood-quartet illustrates these ideas using the
simple one-way MANOVA design of the dogfood data from @sec-dogfood-data.
```{r}
#| label: fig-dogfood-quartet
#| echo: false
#| fig-align: center
#| out-width: "100%"
#| fig-cap: "**Dogfood quartet**: Illustration of the conceptual ideas of the HE plot framework for the dogfood data. (a) Scatterplot of the data; (b) Summary using data ellipses; (c) HE plot shows the variation in the means in relation to pooled within group variance; (d) Transformation from data space to canonical space"
knitr::include_graphics("images/dogfood-quartet.png")
```
* In data space (a), each group is summarized by its **data ellipse** (b), representing the means and covariances.
* Variation against the hypothesis of equal means can be seen by the $\mathbf{H}$ ellipse in the **HE plot** (c), representing the data ellipse of the fitted values. Error variance is shown in the $\mathbf{E}$ ellipse,
representing the pooled within-group covariance matrix, $\mathbf{S}_p$ and the data ellipse of the residuals from the model.
For the dogfood data, the group means have a negative relation: longer time to start eating is associated with a smaller amount eaten.
* The MANOVA (or Hotelling's $T^2$) is formally equivalent to a **discriminant analysis**, predicting
group membership from the response variables which can be seen in data space. (The main difference is
emphasis and goals: MANOVA seeks to test differences among group means, while discriminant analysis aims
at classification of the observations into groups.)
* This effectively projects the $p$-dimensional
space of the predictors into the smaller **canonical space** (d) that shows the greatest differences among
the groups. As in a biplot, vectors show the relations of the response variables with the canonical dimensions.
<!--
```{r}
#| label: fig-HE-framework
#| echo: false
#| fig-align: center
#| out-width: "100%"
#| fig-cap: "The Hypothesis Error plot framework for a two-group design. Above: Data ellipses can be summarized in an HE plot showing the pooled within-group error ($\\mathbf{E}$) ellipse and the $\\mathbf{H}$ 'ellipse' for the group means.
#| Below: Observations projected on the line joining the means give discriminant scores which correpond to a one-dimensional canonical space, represented by a boxplot of their scores and arrows reflecting the variable weights."
knitr::include_graphics("images/HE-framework.png")
```
-->
<!--
Having described the statistical ideas behind the MLM in @sec-mlm-review, we can proceed to
extend this framework to larger designs. A conceptual overview is shown in @fig-arcmanov1 for a one-way MANOVA
design with 8 groups.
```{r}
#| label: fig-arcmanov1
#| echo: false
#| fig-align: center
#| out-width: "100%"
#| fig-cap: "Conceptual plots showing the essential ideas behind multivariate tests, in terms of the hypothesis
#| ($\\mathbf{H}$) and error ($\\mathbf{E}$) matrices for a 1-way MANOVA design with two response variables, $Y_1$ and $Y_2$"
knitr::include_graphics("images/arcmanov.png")
```
-->
For more complex models such as MANOVA with multiple factors or multivariate multivariate regression,
there is one sum of squares and products matrix (SSP), and therefore one
$\mathbf{H}$ ellipse for each term in the model. For example, in a two-way MANOVA design with the model
formula `(y1, y2) ~ A + B + A*B` and equal sample sizes in the groups, the total sum of squares accounted for by the model is
\begin{align*}
\mathbf{SSP}_{\text{Model}} & = \mathbf{SSP}_{A} + \mathbf{SSP}_{B} + \mathbf{SSP}_{AB} \\
& = \mathbf{H}_A + \mathbf{H}_B + \mathbf{H}_{AB} \period
\end{align*}
## HE plot construction
The HE plot is constructed to allow a direct visualization of the "size" of hypothesized terms in
a multivariate linear model in relation to unexplained error variation.
These can be displayed in 2D or 3D plots, so I use the term "ellipsoid" below to cover all cases.
Error variation is represented by a standard 68\% data ellipsoid of the $\mat{E}$ matrix
of the residuals in $\Epsilon$. This is divided by the residual degrees of freedom, so the size of $\mat{E} / \text{df}_e$ is analogous to a mean square error
in univariate tests. The choice of 68\% coverage allows you to ``read'' the residual standard deviation as the half-length of the
shadow of the $\mat{E}$ ellipsoid on any axis (see @fig-galton-ellipse-r).
The $\mat{E}$ ellipsoid is then translated to the overall (grand) means $\bar{\mathbf{y}}$ of the variables plotted, which allows us to show the means for factor levels on the same scale, facilitating interpretation.
In the notation of @eq-ellE, the error ellipsoid is given by
$$
\mathcal{E}_c (\bar{\mathbf{y}}, \mathbf{E}) = \bar{\mathbf{y}} \; \oplus \; c\,\mathbf{E}^{1/2} \comma
$$
where $c = \sqrt{2 F_{2, n-2}^{0.68}}$ for 2D plots and $c = \sqrt{3 F_{3, n-3}^{0.68}}$ for 3D.
An ellipsoid representing variation in the means of a factor (or any other term reflected in a general linear hypothesis test, @eq-hmat) in the $\mat{H}$ matrix is simply the data ellipse of the fitted values for that term.
Dividing the hypothesis matrix by the error degrees of freedom, giving
$\mat{H} / \text{df}_e$,
puts this on the same scale as the \E ellipse.
<!-- , as shown in the left panel of \figref{fig:heplot-iris1}. -->
I refer to this as _effect size scaling_, because it is similar to an effect size index used in
univariate models, e.g., $ES = (\bar{y}_1 - \bar{y}_2) / s_e$ in a two-group, univariate design.
### Example: Iris data
**TODO**: Introduce the iris data earlier. Where???
@fig-iris-HE1 shows two views of the relationship between sepal length and sepal width for the iris data. ...
<!--
```{r}
#| eval: false
op <- par(mar = c(4, 4, 1, 1) + .5,
mfrow = c(1, 2))
col <-c("blue", "darkgreen", "brown")
clr <- c(col, "red")
covEllipses(cbind(Sepal.Length, Sepal.Width) ~ Species, data=iris,
pooled = TRUE,
fill=TRUE,
fill.alpha = 0.1,
lwd = 3,
col = clr,
cex = 1.5, cex.lab = 1.5,
label.pos = c(3, 1, 3, 0),
xlim = c(4,8), ylim = c(2,4))
iris.mod <- lm(cbind(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) ~
Species, data=iris)
heplot(iris.mod, size = "effect",
cex = 1.5, cex.lab = 1.5,
fill=TRUE, fill.alpha=c(0.3,0.1),
xlim = c(4,8), ylim = c(2,4))
par(op)
```
-->
```{r}
#| label: fig-iris-HE1
#| out-width: "100%"
#| fig-cap: "Iris data: Data ellipses and HE plot"
knitr::include_graphics("images/iris-HE1.png")
```
### Significance scaling
The geometry of ellipsoids and multivariate tests allow us to go further with another re-scaling of the $\mat{H}$ ellipsoid
that gives a _visual test of significance_ for any term in a MLM.
This is done simply by dividing $\mat{H} / df_e$ further
by the $\alpha$-critical value of the corresponding test statistic to show the strength of evidence against
the null hypothesis.
Among the various multivariate test statistics,
Roy's maximum root test, based on the largest eigenvalue $\lambda_1$ of $\mat{H} \mat{E}^{-1}$,
gives $\mat{H} / (\lambda_\alpha df_e)$
which has the visual property that the
scaled $\mat{H}$ ellipsoid will protrude _somewhere_ outside the standard $\mat{E}$ ellipsoid if and only if
Roy's test is significant at significance level $\alpha$. The critical value $\lambda_\alpha$ for Roy's
test is
$$
\lambda_\alpha = \left(\frac{\text{df}_1}{\text{df}_2}\right) \; F_{\text{df}_1, \text{df}_2}^{1-\alpha} \comma
$$
where $\text{df}_1 = \max(p, \text{df}_h)$ and $\text{df}_2 = \text{df}_h + \text{df}_e - \text{df}_1$.
For these data, the HE plot using
significance scaling is shown in the right panel of \figref{fig:heplot-iris1}.
```{r}
#| label: fig-iris-HE2
#| out-width: "100%"
#| fig-cap: "HE plots for sepal width and sepal length in the iris dataset. Left: effect scaling of the $\\mathbf{H}$ magtrix; right: significance scaling"
knitr::include_graphics("images/iris-HE2.png")
```
### Contrasts
### HE plot matrices
## Canonical discriminant analysis {#sec-candisc}
```{r}
#| echo: false
cat("Writing packages to ", .pkg_file, "\n")
write_pkgs(file = .pkg_file)
```