diff --git a/10-mlm-review.qmd b/10-mlm-review.qmd index d8139bc..70dd713 100644 --- a/10-mlm-review.qmd +++ b/10-mlm-review.qmd @@ -586,10 +586,12 @@ the following comparisons among the formulas Old, New, Major and Alps: (c) Major vs. Alps. The contrasts that do this are: \begin{align*} -L_1 & = \frac12 (\mu_O + \mu_N) - \frac12 (\mu_M + \mu_A) & \rightarrow\: & \mathbf{c}_1 = -\frac12 \begin{pmatrix} - 1 & 1 & -1 & -1 - \end{pmatrix} \\ +L_1 & = \textstyle{\frac12} (\mu_O + \mu_N) - + \textstyle{\frac12} (\mu_M + \mu_A) & \rightarrow\: & \mathbf{c}_1 = +\textstyle{\frac12} + \begin{pmatrix} + 1 & 1 & -1 & -1 + \end{pmatrix} \\ L_2 & = \mu_O - \mu_N & \rightarrow\: & \mathbf{c}_2 = \begin{pmatrix} 1 & -1 & 0 & 0 @@ -622,10 +624,12 @@ C t(C) %*% C ``` -For a dataset, `data`, with this factor as `group`, you can set up the analyses to use these -contrasts by assigning the matrix `C` to `contrasts()`. When the contrasts are changed, +For the `dogfood` data, with `formula` as the group factor, +you can set up the analyses to use these +contrasts by assigning the matrix `C` to `contrasts()` for that factor in the dataset itself. +When the contrasts are changed, it is necessary to refit the model. The estimated coefficients then become the estimated -mean differences of the contrasts. +mean differences for the contrasts. ```{r dogfood-contrasts} contrasts(dogfood$formula) <- C @@ -633,7 +637,8 @@ dogfood.mod <- lm(cbind(start, amount) ~ formula, data=dogfood) coef(dogfood.mod) ``` - +For example, Ours vs. Theirs estimated by `formulac1` takes 0.69 less time to start eating +and eats 5.44 more on average. For multivariate tests, when all contrasts are pairwise orthogonal, the overall test of a factor with @@ -668,15 +673,15 @@ Then, we can illustrate @eq-H-contrasts by extracting the 1 df $\mathbf{H}$ mat from the results of `linearHypothesis`. -```{r dogfood-Eqn, results='asis', eval=FALSE, echo=FALSE} +```{r dogfood-Eqn, results='asis'} options(print.latexMatrix = list(display.labels=FALSE)) SSP_H1 <- H1$SSPH |> round(digits=2) SSP_H2 <- H2$SSPH |> round(digits=2) SSP_H3 <- H3$SSPH |> round(digits=2) -Eqn(latexMatrix(SSP_H), "=", +Eqn(latexMatrix(SSP_H), "=", latexMatrix(SSP_H1), "+", latexMatrix(SSP_H2), "+", - latexMatrix(SSP_H3)) + latexMatrix(SSP_H3), quarto=TRUE) ``` $$ diff --git a/docs/08-collinearity-ridge.html b/docs/08-collinearity-ridge.html index ed5248b..3b10180 100644 --- a/docs/08-collinearity-ridge.html +++ b/docs/08-collinearity-ridge.html @@ -1026,7 +1026,7 @@

There is a close connection with principal components regression mentioned in Section 8.3. Ridge regression shrinks all dimensions in proportion to \(\text{df}_k(i)\), so the low variance dimensions are shrunk more. Principal components regression discards the low variance dimensions and leaves the high variance dimensions unchanged.

8.4.2 The genridge package

-

Ridge regression and other shrinkage methods are available in several packages including MASS (the lm.ridge() function), glmnet (Friedman et al., 2023), and penalized (R-penalized?), but none of these provides insightful graphical displays. glmnet::glmnet() also implements a method for multivariate responses with a `family=“mgaussian”.

+

Ridge regression and other shrinkage methods are available in several packages including MASS (the lm.ridge() function), glmnet (Friedman et al., 2023), and penalized (Goeman et al., 2022), but none of these provides insightful graphical displays. glmnet::glmnet() also implements a method for multivariate responses with a `family=“mgaussian”.

Here, I focus in the genridge package (Friendly, 2024), where the ridge() function is the workhorse and pca.ridge() transforms these results to PCA/SVD space. vif.ridge() calculates VIFs for class "ridge" objects and precision() calculates precision and shrinkage measures.

A variety of plotting functions is available for univariate, bivariate and 3D plots:

    @@ -1382,6 +1382,9 @@

    Gabriel, K. R. (1971). The biplot graphic display of matrices with application to principal components analysis. Biometrics, 58(3), 453–467. https://doi.org/10.2307/2334381 +
    +Goeman, J., Meijer, R., Chaturvedi, N., & Lueder, M. (2022). Penalized: L1 (lasso and fused lasso) and L2 (ridge) penalized estimation in GLMs and in the cox model. https://CRAN.R-project.org/package=penalized +
    Gower, J. C., & Hand, D. J. (1996). Biplots. Chapman & Hall.
    diff --git a/docs/10-mlm-review.html b/docs/10-mlm-review.html index 2f9723b..02d52b3 100644 --- a/docs/10-mlm-review.html +++ b/docs/10-mlm-review.html @@ -843,10 +843,12 @@

    TODO: Should use \(\frac12\) here so contrast L_1 is mean diff

    For example, with the \(g=4\) groups for the dogfood data, the company might want to test the following comparisons among the formulas Old, New, Major and Alps: (a) Ours vs. Theirs: The average of (Old, New) compared to (Major, Alps); (b) Old vs. New; (c) Major vs. Alps. The contrasts that do this are:

    \[\begin{aligned} -L_1 & = \frac12 (\mu_O + \mu_N) - \frac12 (\mu_M + \mu_A) & \rightarrow\: & \mathbf{c}_1 = -\frac12 \begin{pmatrix} - 1 & 1 & -1 & -1 - \end{pmatrix} \\ +L_1 & = \textstyle{\frac12} (\mu_O + \mu_N) - + \textstyle{\frac12} (\mu_M + \mu_A) & \rightarrow\: & \mathbf{c}_1 = +\textstyle{\frac12} + \begin{pmatrix} + 1 & 1 & -1 & -1 + \end{pmatrix} \\ L_2 & = \mu_O - \mu_N & \rightarrow\: & \mathbf{c}_2 = \begin{pmatrix} 1 & -1 & 0 & 0 @@ -879,7 +881,7 @@

    #> c2 0 2 0 #> c3 0 0 2 -

    For a dataset, data, with this factor as group, you can set up the analyses to use these contrasts by assigning the matrix C to contrasts(). When the contrasts are changed, it is necessary to refit the model. The estimated coefficients then become the estimated mean differences of the contrasts.

    +

    For the dogfood data, with formula as the group factor, you can set up the analyses to use these contrasts by assigning the matrix C to contrasts() for that factor in the dataset itself. When the contrasts are changed, it is necessary to refit the model. The estimated coefficients then become the estimated mean differences for the contrasts.

    contrasts(dogfood$formula) <- C
     dogfood.mod <- lm(cbind(start, amount) ~ formula, 
    @@ -891,6 +893,7 @@ 

    #> formulac2 -0.500 3.25 #> formulac3 0.125 1.88

    +

    For example, Ours vs. Theirs estimated by formulac1 takes 0.69 less time to start eating and eats 5.44 more on average.

    For multivariate tests, when all contrasts are pairwise orthogonal, the overall test of a factor with \(\text{df}_h = g-1\) degrees of freedom can be broken down into \(g-1\) separate 1 df tests. This gives rise to a set of \(\text{df}_h\) rank 1 \(\mathbf{H}\) matrices that additively decompose the overall hypothesis SSCP matrix,

    \[ \mathbf{H} = \mathbf{H}_1 + \mathbf{H}_2 + \cdots + \mathbf{H}_{\text{df}_h} \:\: , @@ -929,6 +932,37 @@

    Then, we can illustrate Equation 10.5 by extracting the 1 df \(\mathbf{H}\) matrices (SSPH) from the results of linearHypothesis.

    +
    +
    options(print.latexMatrix = list(display.labels=FALSE))
    +SSP_H1 <- H1$SSPH |> round(digits=2)
    +SSP_H2 <- H2$SSPH |> round(digits=2)
    +SSP_H3 <- H3$SSPH |> round(digits=2)
    +Eqn(latexMatrix(SSP_H),  "=", 
    +    latexMatrix(SSP_H1), "+", 
    +    latexMatrix(SSP_H2), "+", 
    +    latexMatrix(SSP_H3), quarto=TRUE)
    +$$ +
    +

    10 & -71
    +-71 & 586

    +
    += +
    +

    8 & -60
    +-60 & 473

    +
    ++ +
    +

    2 & -13
    +-13 & 84

    +
    ++ +
    +

    0.1 & 1.9
    +1.9 & 28.1

    +
    +

    $$

    +

    \[ \overset{\mathbf{H}} {\begin{pmatrix} @@ -1002,7 +1036,7 @@

    The main questions concern whether the group means differ on these scales, and the nature of these differences. That is, do the means differ significantly on all three measures? Is there a consistent order of groups across these three aspects of parenting?

    More specific questions are: (a) Do the fathers of typical children differ from the other two groups on average? (b) Do the physical and mental groups differ? These questions can be tested using contrasts, and are specified by assigning a matrix to contrasts(Parenting$group); each column is a contrast whose values sum to zero. They are given labels "group1" (normal vs. other) and "group2" (physical vs. mental) in some output.

    -
    data(Parenting, package="heplots")
    +
    data(Parenting, package="heplots")
     C <- matrix(c(1, -.5, -.5,
                   0,  1,  -1), 
                 nrow = 3, ncol = 2) |> print()
    @@ -1015,7 +1049,7 @@ 

    Exploratory plots

    Before setting up a model and testing, it is well-advised to examine the data graphically. The simplest plots are side-by-side boxplots (or violin plots) for the three responses. With ggplot2, this is easily done by reshaping the data to long format and using faceting. In Figure 10.4, I’ve also plotted the group means with white dots.

    -
    See the ggplot code
    parenting_long <- Parenting |>
    +
    See the ggplot code
    parenting_long <- Parenting |>
       tidyr::pivot_longer(cols=caring:play, names_to = "variable")
     
     ggplot(parenting_long, 
    @@ -1048,7 +1082,7 @@ 

    In this figure, differences among the groups on play are most apparent, with fathers of non-disabled children scoring highest. Differences among the groups on emotion are very small, but one high outlier for the fathers of mentally disabled children is apparent. On caring, fathers of children with a physical disability stand out as highest.

    For exploratory purposes, you might also make a scatterplot matrix. Here, because the MLM assumes homogeneity of the variances and covariance matrices \(\mathbf{S}_i\), I show only the data ellipses in scatterplot matrix format, using heplots:covEllipses() (with 50% coverage, for clarity):

    -
    colors <- scales::hue_pal()(3) |> rev()  # match color use in ggplot
    +
    colors <- scales::hue_pal()(3) |> rev()  # match color use in ggplot
     covEllipses(cbind(caring, play, emotion) ~ group, data=Parenting,
       variables = 1:3,
       fill = TRUE, fill.alpha = 0.2,
    @@ -1071,7 +1105,7 @@ 

    10.4.1.1 Testing the model

    Let’s proceed to fit the multivariate model predicting all three scales from the group factor. lm() for a multivariate response returns an object of class "mlm", for which there are many methods (use methods(class="mlm") to find them).

    -
    parenting.mlm <- lm(cbind(caring, play, emotion) ~ group, 
    +
    parenting.mlm <- lm(cbind(caring, play, emotion) ~ group, 
                         data=Parenting) |> print()
     #> 
     #> Call:
    @@ -1086,7 +1120,7 @@ 

    The coefficients in this model are the values of the contrasts set up above. group1 is the mean of the typical group minus the average of the other two, which is negative on caring and emotion but positive for play. group2 is the difference in means for the physical vs. mental groups.

    Before doing multivariate tests, it is useful to see what would happen if we ran univariate ANOVAs on each of the responses. These can be extracted from an MLM using stats::summary.aov():

    -
    summary.aov(parenting.mlm)
    +
    summary.aov(parenting.mlm)
     #>  Response caring :
     #>             Df Sum Sq Mean Sq F value Pr(>F)    
     #> group        2    130    65.2    18.6  6e-07 ***
    @@ -1108,7 +1142,7 @@ 

    If you like, you can also extract the univariate model fit statistics from the "mlm" using thebroom::glance()` method for a multivariate model object.

    -
    glance(parenting.mlm) |>
    +
    glance(parenting.mlm) |>
       select(response, r.squared, fstatistic, p.value)
     #> # A tibble: 3 × 4
     #>   response r.squared fstatistic       p.value
    @@ -1119,7 +1153,7 @@ 

    From this, one might conclude that there are differences only in caring and play and therefore ignore emotion, but this would be short-sighted. car::Anova() gives the overall multivariate test \(\mathcal{H}_0: \mathbf{B} = 0\) of the group effect. Note that this has a much smaller \(p\)-value than any of the univariate \(F\) tests.

    -
    Anova(parenting.mlm)
    +
    Anova(parenting.mlm)
     #> 
     #> Type II MANOVA Tests: Pillai test statistic
     #>       Df test stat approx F num Df den Df Pr(>F)    
    @@ -1129,7 +1163,7 @@ 

    Anova() returns an object of class "Anova.mlm" which has various methods. The summary() method for this gives more details, including all four test statistics. These tests have \(s = \min(p, \text{df}_h) = \min(3,2) = 2\) dimensions and the \(F\) approximations are not equivalent here. All four tests are highly significant

    -
    parenting.summary <- Anova(parenting.mlm) |>  summary() 
    +
    parenting.summary <- Anova(parenting.mlm) |>  summary() 
     print(parenting.summary, SSP=FALSE)
     #> 
     #> Type II MANOVA Tests:
    @@ -1149,7 +1183,7 @@ 

    The summary() method by default prints the SSH = \(\mathbf{H}\) and SSE = \(\mathbf{E}\) matrices, but I suppressed them above. They can be extracted from the object using purrr::pluck():

    -
    H <- parenting.summary |> 
    +
    H <- parenting.summary |> 
       purrr::pluck("multivariate.tests", "group", "SSPH") |> 
       print()
     #>         caring    play emotion
    @@ -1168,7 +1202,7 @@ 

    With three or more groups or with a more complex MANOVA design, contrasts provide a way of testing questions of substantive interest regarding differences among group means.

    The test of the contrast comparing the typical group to the average of the others is the test using the contrast \(c_1 = (1, -\frac12, -\frac12)\) which produces the coefficients labeled "group1". The function car::linearHypothesis() carries out the multivariate test that these are all zero. This is a 1 df test, so all four test statistics produce the same \(F\) and \(p\)-values.

    -
    coef(parenting.mlm)["group1",]
    +
    coef(parenting.mlm)["group1",]
     #>  caring    play emotion 
     #> -0.3833  2.4167 -0.0667
     linearHypothesis(parenting.mlm, "group1") |> 
    @@ -1185,7 +1219,7 @@ 

    Similarly, the difference between the physical and mental groups uses the contrast \(c_2 = (0, 1, -1)\) and the test that these means are equal is given by linearHypothesis() applied to group2.

    -
    coef(parenting.mlm)["group2",]
    +
    coef(parenting.mlm)["group2",]
     #>  caring    play emotion 
     #>   1.775  -0.225  -0.600
     linearHypothesis(parenting.mlm, "group2") |> 
    @@ -1203,7 +1237,7 @@ 

    linearHypothesis() is very general. The second argument (hypothesis.matrix) corresponds to \(\mathbf{C}\), and can be specified as numeric matrix giving the linear combinations of coefficients by rows to be tested, or a character vector giving the hypothesis in symbolic form; "group1" is equivalent to "group1 = 0".

    Because the contrasts used here are orthogonal, they comprise the overall test of \(\mathbf{B} = \mathbf{0}\) which implies that the means of the three groups are all equal. The test below gives the same results as Anova(parenting.mlm).

    -
    linearHypothesis(parenting.mlm, c("group1", "group2")) |> 
    +
    linearHypothesis(parenting.mlm, c("group1", "group2")) |> 
       print(SSP=FALSE)
     #> 
     #> Multivariate Tests: 
    @@ -1222,7 +1256,7 @@ 

    10.4.3 Example: Adolescent mental health

    The dataset heplots::AddHealth contains a large cross-sectional sample of participants from grades 7–12 from the National Longitudinal Study of Adolescent Health, described by Warne (2014). It contains responses to two Likert-scale (1–5) items, anxiety and depression. grade is an ordered factor, which means that the default contrasts are taken as orthogonal polynomials with linear (grade.L), quadratic (grade.Q), up to 5th degree (grade^5) trends, which decompose the total effect of grade.

    -
    data(AddHealth, package="heplots")
    +
    data(AddHealth, package="heplots")
     str(AddHealth)
     #> 'data.frame':  4344 obs. of  3 variables:
     #>  $ grade     : Ord.factor w/ 6 levels "7"<"8"<"9"<"10"<..: 5 4 6 1 2 2 2 3 3 3 ...
    @@ -1251,7 +1285,7 @@ 

    Exploratory plots

    Some exploratory analysis is useful before fitting and visualizing models. As a first step, we find the means, standard deviations, and standard errors of the means.

    -
    Show the code
    library(ggplot2)
    +
    Show the code
    library(ggplot2)
     library(dplyr)
     
     means <- AddHealth |>
    @@ -1279,7 +1313,7 @@ 

    Now, plot the means with \(\pm 1\) error bars. It appears that average level of both depression and anxiety increase steadily with grade, except for grades 11 and 12 which don’t differ much. Alternatively, we could describe this as relationships that seem largely linear, with a hint of curvature at the upper end.

    -
    p1 <-ggplot(data = means, aes(x = grade, y = anxiety)) +
    +
    p1 <-ggplot(data = means, aes(x = grade, y = anxiety)) +
       geom_point(size = 4) +
       geom_line(aes(group = 1), linewidth = 1.2) +
       geom_errorbar(aes(ymin = anxiety - anx_se, 
    @@ -1307,7 +1341,7 @@ 

    It is also useful to within-group correlations using covEllipses(), as shown in Figure 10.7. This also plots the bivariate means showing the form of the association , treating anxiety and depression as multivariate outcomes. (Because the variability of the scores within groups is so large compared to the range of the means, I show the data ellipses with coverage of only 10%.)

    -
    covEllipses(AddHealth[, 3:2], group = AddHealth$grade,
    +
    covEllipses(AddHealth[, 3:2], group = AddHealth$grade,
                 pooled = FALSE, level = 0.1,
                 center.cex = 2.5, cex = 1.5, cex.lab = 1.5,
                 fill = TRUE, fill.alpha = 0.05)
    @@ -1329,7 +1363,7 @@

    We fit the MANOVA model, and test the grade effect using car::Anova(). The effect of grade is highly significant, as we could tell from Figure 10.6.

    -
    AH.mlm <- lm(cbind(anxiety, depression) ~ grade, data = AddHealth)
    +
    AH.mlm <- lm(cbind(anxiety, depression) ~ grade, data = AddHealth)
     
     # overall test of `grade`
     Anova(AH.mlm)
    @@ -1342,12 +1376,12 @@ 

    However, the overall test, with 5 degrees of freedom is diffuse, in that it can be rejected if any pair of means differ. Given that grade is an ordered factor, it makes sense to examine narrower hypotheses of linear and nonlinear trends, car::linearHypothesis() on the coefficients of model AH.mlm.

    -
    coef(AH.mlm) |> names()
    +
    coef(AH.mlm) |> names()
     #> NULL

    The joint test of the linear coefficients \(\boldsymbol{\beta}_1 = (\beta_{1,\text{anx}}, \beta_{1,\text{dep}})^\mathsf{T}\) for anxiety and depression, \(H_0 : \boldsymbol{\beta}_1 = \boldsymbol{0}\) is highly significant,

    -
    ## linear effect
    +
    ## linear effect
     linearHypothesis(AH.mlm, "grade.L") |> print(SSP = FALSE)
     #> 
     #> Multivariate Tests: 
    @@ -1361,7 +1395,7 @@ 

    The test of the quadratic coefficients \(H_0 : \boldsymbol{\beta}_2 = \boldsymbol{0}\) indicates significant curvature in trends across grade, as we saw in the plots of their means in Figure 10.6. One interpretation might be that depression and anxiety after increasing steadily up to grade eleven could level off thereafter.

    -
    ## quadratic effect
    +
    ## quadratic effect
     linearHypothesis(AH.mlm, "grade.Q") |> print(SSP = FALSE)
     #> 
     #> Multivariate Tests: 
    @@ -1375,7 +1409,7 @@ 

    An advantage of linear hypotheses is that we can test several terms jointly. Of interest here is the hypothesis that all higher order terms beyond the quadratic are zero, \(H_0 : \boldsymbol{\beta}_3 = \boldsymbol{\beta}_4 = \boldsymbol{\beta}_5 = \boldsymbol{0}\). Using linearHypothesis you can supply a vector of coefficient names to be tested for their joint effect when dropped from the model.

    -
    rownames(coef(AH.mlm))
    +
    rownames(coef(AH.mlm))
     #> [1] "(Intercept)" "grade.L"     "grade.Q"     "grade.C"    
     #> [5] "grade^4"     "grade^5"
     ## joint test of all higher terms
    diff --git a/docs/search.json b/docs/search.json
    index 7a7952b..e5f7770 100644
    --- a/docs/search.json
    +++ b/docs/search.json
    @@ -712,7 +712,7 @@
         "href": "08-collinearity-ridge.html#sec-ridge",
         "title": "8  Collinearity & Ridge Regression",
         "section": "\n8.4 Ridge regression",
    -    "text": "8.4 Ridge regression\nRidge regression is an instance of a class of techniques designed to obtain more favorable predictions at the expense of some increase in bias, compared to ordinary least squares (OLS) estimation. These methods began as a way of solving collinearity problems in OLS regression with highly correlated predictors (Hoerl & Kennard, 1970). More recently ridge regression developed to a larger class of model selection methods, of which the LASSO method of Tibshirani (1996) and LAR method of Efron et al. (2004) are well-known instances. See, for example, the reviews in Vinod (1978) and McDonald (2009) for details and context omitted here. The case of ridge regression has also been extended to the case of two or more response variables (Brown & Zidek, 1980; Haitovsky, 1987).\nAn essential idea behind these methods is that the OLS estimates are constrained in some way, shrinking them, on average, toward zero, to achieve increased predictive accuracy at the expense of some increase in bias. Another common characteristic is that they involve some tuning parameter (\\(k\\)) or criterion to quantify the tradeoff between bias and variance. In many cases, analytical or computationally intensive methods have been developed to choose an optimal value of the tuning parameter, for example using generalized cross validation, bootstrap methods.\nA common means to visualize the effects of shrinkage in these problems is to make what are called univariate ridge trace plots (Section 8.5) showing how the estimated coefficients \\(\\widehat{\\boldsymbol{\\beta}}_k\\) change as the shrinkage criterion \\(k\\) increases. (An example is shown in Fig XX below.) But this only provides a view of bias. It is the wrong graphic form for a multivariate problem where we want to visualize bias in the coefficients \\(\\widehat{\\boldsymbol{\\beta}}_k\\) vs. their precision, as reflected in their estimated variances, \\(\\widehat{\\textsf{Var}} (\\widehat{\\boldsymbol{\\beta}}_k)\\). A more useful graphic plots the confidence ellipses for the coefficients, showing both bias and precision (Section 8.6). Some of the material below borrows from Friendly (2011) and Friendly (2013).\n\n8.4.1 Properties of ridge regression\nTo provide some context, I summarize the properties of ridge regression below, comparing the OLS estimates with their ridge counterparts. To avoid unnecessary details related to the intercept, assume the predictors have been centered at their means and the unit vector is omitted from \\(\\mathbf{X}\\). Further, to avoid scaling issues, we standardize the columns of \\(\\mathbf{X}\\) to unit length, so that \\(\\mathbf{X}^\\mathsf{T}\\mathbf{X}\\) is a also correlation matrix.\nThe ordinary least squares estimates of coefficients and their estimated variance covariance matrix take the (hopefully now) familiar form\n\\[\\begin{aligned}\n\\widehat{\\boldsymbol{\\beta}}^{\\mathrm{OLS}} = &\n    (\\mathbf{X}^\\mathsf{T}\\mathbf{X})^{-1} \\mathbf{X}^\\mathsf{T}\\mathbf{y} \\:\\: ,\\\\\n\\widehat{\\mathsf{Var}} (\\widehat{\\boldsymbol{\\beta}}^{\\mathrm{OLS}}) = &\n    \\widehat{\\sigma}_{\\epsilon}^2 (\\mathbf{X}^\\mathsf{T}\\mathbf{X})^{-1}.\n\\end{aligned} \\tag{8.2}\\]\nAs we saw ealier, one signal of the problem of collinearity is that the determinant \\(\\mathrm{det}(\\mathbf{X}^\\mathsf{T}\\mathbf{X})\\) approaches zero as the predictors become more collinear. The inverse \\((\\mathbf{X}^\\mathsf{T}\\mathbf{X})^{-1}\\) becomes numerically unstable, or does not exist if the determinant becomes zero in the case of exact dependency of one variable on the others.\nRidge regression uses a trick to avoid this. It adds a constant, \\(k\\) to the diagonal elements, replacing \\(\\mathbf{X}^\\mathsf{T}\\mathbf{X}\\) with \\(\\mathbf{X}^\\mathsf{T}\\mathbf{X} + k \\mathbf{I}\\) in Equation 8.2. This drives the determinant away from zero as \\(k\\) increases. The ridge regression estimates then become,\n\\[\\begin{aligned}\n\\widehat{\\boldsymbol{\\beta}}^{\\mathrm{RR}}_k = &\n    (\\mathbf{X}^\\mathsf{T}\\mathbf{X} + k \\mathbf{I})^{-1} \\mathbf{X}^\\mathsf{T}\\mathbf{y}  \\\\\n                                    = & \\mathbf{G}_k \\, \\widehat{\\boldsymbol{\\beta}}^{\\mathrm{OLS}} \\:\\: ,\\\\\n\\widehat{\\mathsf{Var}} (\\widehat{\\boldsymbol{\\beta}}^{\\mathrm{RR}}_k) = &\n     \\widehat{\\sigma}^2  \\mathbf{G}_k (\\mathbf{X}^\\mathsf{T}\\mathbf{X})^{-1} \\mathbf{G}_k^\\mathsf{T}\\:\\: ,\n\\end{aligned} \\tag{8.3}\\]\nwhere \\(\\mathbf{G}_k = \\left[\\mathbf{I} + k (\\mathbf{X}^\\mathsf{T}\\mathbf{X})^{-1} \\right] ^{-1}\\) is the \\((p \\times p)\\) shrinkage matrix. Thus, as \\(k\\) increases, \\(\\mathbf{G}_k\\) decreases, and drives \\(\\widehat{\\mathbf{\\beta}}^{\\mathrm{RR}}_k\\) toward \\(\\mathbf{0}\\) (Hoerl & Kennard, 1970).\nAnother insight, from the shrinkage literature, is that ridge regression can be formulated as least squares regression, minimizing a residual sum of squares, \\(\\text{RSS}(k)\\), which adds a penalty for large coefficients,\n\\[\n\\text{RSS}(k) = (\\mathbf{y}-\\mathbf{X} \\mathbf{\\beta}) ^\\mathsf{T}(\\mathbf{y}-\\mathbf{X} \\boldsymbol{\\beta}) + k \\boldsymbol{\\beta}^\\mathsf{T}\\boldsymbol{\\beta} \\quad\\quad (k \\ge 0)\n\\:\\: ,\n\\tag{8.4}\\] where the penalty restrict the coefficients to some squared length \\(\\boldsymbol{\\beta}^\\mathsf{T}\\boldsymbol{\\beta} = \\Sigma \\beta_i \\le t(k)\\).\nThe geometry of ridge regession is illustrated in Figure 8.8 for two coefficients \\(\\boldsymbol{\\beta} = (\\beta_1, \\beta_2)\\). The blue circles at the origin, having radii \\(\\sqrt{t_k}\\), show the constraint that the sum of squares of coefficients, \\(\\boldsymbol{\\beta}^\\mathsf{T}\\boldsymbol{\\beta} = \\beta_1^2 + \\beta_2^2\\) be less than \\(k\\). The red ellipses show contours of the covariance ellipse of \\(\\widehat{\\boldsymbol{\\beta}}^{\\mathrm{OLS}}\\). As the shrinkage constant \\(k\\) increases, the center of these ellipses travel along the path illustrated toward \\(\\boldsymbol{\\beta} = \\mathbf{0}\\) This path is called the locus of osculation, the path along which circles or ellipses first kiss as they expand, like the pattern of ripples from rocks dropped into a pond (Friendly et al., 2013).\n\n\n\n\n\n\n\nFigure 8.8: Geometric interpretation of ridge regression, using elliptical contours of the \\(\\text{RSS}(k)\\) function. The blue circles at the origin show the constraint that the sum of squares of coefficients, \\(\\boldsymbol{\\beta}^\\mathsf{T}\\boldsymbol{\\beta}\\) be less than \\(k\\). The red ellipses show the covariance ellipse of two coefficients \\(\\boldsymbol{\\beta}\\). Ridge regression finds the point \\(\\widehat{\\boldsymbol{\\beta}}^{\\mathrm{RR}}_k\\) where the OLS contours just kiss the constraint region. _Source: Friendly et al. (2013).\n\n\n\n\n\nEquation 8.3 is computationally expensive, potentially numerically unstable for small \\(k\\), and it is conceptually opaque, in that it sheds little light on the underlying geometry of the data in the column space of \\(\\mathbf{X}\\). An alternative formulation can be given in terms of the singular value decomposition (SVD) of \\(\\mathbf{X}\\),\n\\[\n\\mathbf{X} = \\mathbf{U} \\mathbf{D} \\mathbf{V}^\\mathsf{T}\n\\]\nwhere \\(\\mathbf{U}\\) and \\(\\mathbf{V}\\) are respectively \\(n\\times p\\) and \\(p\\times p\\) orthonormal matrices, so that \\(\\mathbf{U}^\\mathsf{T}\\mathbf{U} = \\mathbf{V}^\\mathsf{T}\\mathbf{V} = \\mathbf{I}\\), and \\(\\mathbf{D} = \\mathrm{diag}\\, (d_1, d_2, \\dots d_p)\\) is the diagonal matrix of ordered singular values, with entries \\(d_1 \\ge d_2 \\ge \\cdots \\ge d_p \\ge 0\\).\nBecause \\(\\mathbf{X}^\\mathsf{T}\\mathbf{X} = \\mathbf{V} \\mathbf{D}^2 \\mathbf{V}^\\mathsf{T}\\), the eigenvalues of \\(\\mathbf{X}^\\mathsf{T}\\mathbf{X}\\) are given by \\(\\mathbf{D}^2\\) and therefore the eigenvalues of \\(\\mathbf{G}_k\\) can be shown (Hoerl & Kennard, 1970) to be the diagonal elements of\n\\[\n\\mathbf{D}(\\mathbf{D}^2 + k \\mathbf{I} )^{-1} \\mathbf{D} = \\mathrm{diag}\\,  \\left(\\frac{d_i^2}{d_i^2 + k}\\right) \\:\\: .\n\\]\nNoting that the eigenvectors, \\(\\mathbf{V}\\) are the principal component vectors, and that \\(\\mathbf{X} \\mathbf{V} = \\mathbf{U} \\mathbf{D}\\), the ridge estimates can be calculated more simply in terms of \\(\\mathbf{U}\\) and \\(\\mathbf{D}\\) as\n\\[\n\\widehat{\\boldsymbol{\\beta}}^{\\mathrm{RR}}_k = (\\mathbf{D}^2 + k \\mathbf{I})^{-1} \\mathbf{D} \\mathbf{U}^\\mathsf{T}\\mathbf{y} = \\left( \\frac{d_i}{d_i^2 + k}\\right) \\: \\mathbf{u}_i^\\mathsf{T}\\mathbf{y}, \\quad i=1, \\dots p \\:\\: .\n\\]\nThe terms \\(d^2_i / (d_i^2 + k) \\le 1\\) are thus the factors by which the coordinates of \\(\\mathbf{u}_i^\\mathsf{T}\\mathbf{y}\\) are shrunk with respect to the orthonormal basis for the column space of \\(\\mathbf{X}\\). The small singular values \\(d_i\\) correspond to the directions which ridge regression shrinks the most. These are the directions which contribute most to collinearity, discussed earlier.\nThis analysis also provides an alternative and more intuitive characterization of the ridge tuning constant. By analogy with OLS, where the hat matrix, \\(\\mathbf{H} = \\mathbf{X} (\\mathbf{X}^\\mathsf{T}\\mathbf{X})^{-1} \\mathbf{X}^\\mathsf{T}\\) reflects degrees of freedom \\(\\text{df} = \\mathrm{tr} (\\mathbf{H}) = p\\) corresponding to the \\(p\\) parameters, the effective degrees of freedom for ridge regression (Hastie et al., 2009) is\n\\[\\begin{aligned}\n\\text{df}_k\n    = & \\text{tr}[\\mathbf{X} (\\mathbf{X}^\\mathsf{T}\\mathbf{X} + k \\mathbf{I})^{-1} \\mathbf{X}^\\mathsf{T}] \\\\\n    = & \\sum_i^p \\text{df}_k(i) = \\sum_i^p \\left( \\frac{d_i^2}{d_i^2 + k} \\right) \\:\\: .\n\\end{aligned} \\tag{8.5}\\]\n\\(\\text{df}_k\\) is a monotone decreasing function of \\(k\\), and hence any set of ridge constants can be specified in terms of equivalent \\(\\text{df}_k\\). Greater shrinkage corresponds to fewer coefficients being estimated.\nThere is a close connection with principal components regression mentioned in Section 8.3. Ridge regression shrinks all dimensions in proportion to \\(\\text{df}_k(i)\\), so the low variance dimensions are shrunk more. Principal components regression discards the low variance dimensions and leaves the high variance dimensions unchanged.\n\n8.4.2 The genridge package\nRidge regression and other shrinkage methods are available in several packages including MASS (the lm.ridge() function), glmnet (Friedman et al., 2023), and penalized (R-penalized?), but none of these provides insightful graphical displays. glmnet::glmnet() also implements a method for multivariate responses with a `family=“mgaussian”.\nHere, I focus in the genridge package (Friendly, 2024), where the ridge() function is the workhorse and pca.ridge() transforms these results to PCA/SVD space. vif.ridge() calculates VIFs for class \"ridge\" objects and precision() calculates precision and shrinkage measures.\nA variety of plotting functions is available for univariate, bivariate and 3D plots:\n\n\ntraceplot() Traditional univariate ridge trace plots\n\nplot.ridge() Bivariate 2D ridge trace plots, showing the covariance ellipse of the estimated coefficients\n\npairs.ridge() All pairwise bivariate ridge trace plots\n\nplot3d.ridge() 3D ridge trace plots with ellipsoids\n\nbiplot.ridge() ridge trace plots in PCA/SVD space\n\nIn addition, the pca() method for \"ridge\" objects transforms the coefficients and covariance matrices of a ridge object from predictor space to the equivalent, but more interesting space of the PCA of \\(\\mathbf{X}^\\mathsf{T}\\mathbf{X}\\) or the SVD of \\(\\mathbf{X}\\). biplot.pcaridge() adds variable vectors to the bivariate plots of coefficients in PCA space",
    +    "text": "8.4 Ridge regression\nRidge regression is an instance of a class of techniques designed to obtain more favorable predictions at the expense of some increase in bias, compared to ordinary least squares (OLS) estimation. These methods began as a way of solving collinearity problems in OLS regression with highly correlated predictors (Hoerl & Kennard, 1970). More recently ridge regression developed to a larger class of model selection methods, of which the LASSO method of Tibshirani (1996) and LAR method of Efron et al. (2004) are well-known instances. See, for example, the reviews in Vinod (1978) and McDonald (2009) for details and context omitted here. The case of ridge regression has also been extended to the case of two or more response variables (Brown & Zidek, 1980; Haitovsky, 1987).\nAn essential idea behind these methods is that the OLS estimates are constrained in some way, shrinking them, on average, toward zero, to achieve increased predictive accuracy at the expense of some increase in bias. Another common characteristic is that they involve some tuning parameter (\\(k\\)) or criterion to quantify the tradeoff between bias and variance. In many cases, analytical or computationally intensive methods have been developed to choose an optimal value of the tuning parameter, for example using generalized cross validation, bootstrap methods.\nA common means to visualize the effects of shrinkage in these problems is to make what are called univariate ridge trace plots (Section 8.5) showing how the estimated coefficients \\(\\widehat{\\boldsymbol{\\beta}}_k\\) change as the shrinkage criterion \\(k\\) increases. (An example is shown in Fig XX below.) But this only provides a view of bias. It is the wrong graphic form for a multivariate problem where we want to visualize bias in the coefficients \\(\\widehat{\\boldsymbol{\\beta}}_k\\) vs. their precision, as reflected in their estimated variances, \\(\\widehat{\\textsf{Var}} (\\widehat{\\boldsymbol{\\beta}}_k)\\). A more useful graphic plots the confidence ellipses for the coefficients, showing both bias and precision (Section 8.6). Some of the material below borrows from Friendly (2011) and Friendly (2013).\n\n8.4.1 Properties of ridge regression\nTo provide some context, I summarize the properties of ridge regression below, comparing the OLS estimates with their ridge counterparts. To avoid unnecessary details related to the intercept, assume the predictors have been centered at their means and the unit vector is omitted from \\(\\mathbf{X}\\). Further, to avoid scaling issues, we standardize the columns of \\(\\mathbf{X}\\) to unit length, so that \\(\\mathbf{X}^\\mathsf{T}\\mathbf{X}\\) is a also correlation matrix.\nThe ordinary least squares estimates of coefficients and their estimated variance covariance matrix take the (hopefully now) familiar form\n\\[\\begin{aligned}\n\\widehat{\\boldsymbol{\\beta}}^{\\mathrm{OLS}} = &\n    (\\mathbf{X}^\\mathsf{T}\\mathbf{X})^{-1} \\mathbf{X}^\\mathsf{T}\\mathbf{y} \\:\\: ,\\\\\n\\widehat{\\mathsf{Var}} (\\widehat{\\boldsymbol{\\beta}}^{\\mathrm{OLS}}) = &\n    \\widehat{\\sigma}_{\\epsilon}^2 (\\mathbf{X}^\\mathsf{T}\\mathbf{X})^{-1}.\n\\end{aligned} \\tag{8.2}\\]\nAs we saw ealier, one signal of the problem of collinearity is that the determinant \\(\\mathrm{det}(\\mathbf{X}^\\mathsf{T}\\mathbf{X})\\) approaches zero as the predictors become more collinear. The inverse \\((\\mathbf{X}^\\mathsf{T}\\mathbf{X})^{-1}\\) becomes numerically unstable, or does not exist if the determinant becomes zero in the case of exact dependency of one variable on the others.\nRidge regression uses a trick to avoid this. It adds a constant, \\(k\\) to the diagonal elements, replacing \\(\\mathbf{X}^\\mathsf{T}\\mathbf{X}\\) with \\(\\mathbf{X}^\\mathsf{T}\\mathbf{X} + k \\mathbf{I}\\) in Equation 8.2. This drives the determinant away from zero as \\(k\\) increases. The ridge regression estimates then become,\n\\[\\begin{aligned}\n\\widehat{\\boldsymbol{\\beta}}^{\\mathrm{RR}}_k = &\n    (\\mathbf{X}^\\mathsf{T}\\mathbf{X} + k \\mathbf{I})^{-1} \\mathbf{X}^\\mathsf{T}\\mathbf{y}  \\\\\n                                    = & \\mathbf{G}_k \\, \\widehat{\\boldsymbol{\\beta}}^{\\mathrm{OLS}} \\:\\: ,\\\\\n\\widehat{\\mathsf{Var}} (\\widehat{\\boldsymbol{\\beta}}^{\\mathrm{RR}}_k) = &\n     \\widehat{\\sigma}^2  \\mathbf{G}_k (\\mathbf{X}^\\mathsf{T}\\mathbf{X})^{-1} \\mathbf{G}_k^\\mathsf{T}\\:\\: ,\n\\end{aligned} \\tag{8.3}\\]\nwhere \\(\\mathbf{G}_k = \\left[\\mathbf{I} + k (\\mathbf{X}^\\mathsf{T}\\mathbf{X})^{-1} \\right] ^{-1}\\) is the \\((p \\times p)\\) shrinkage matrix. Thus, as \\(k\\) increases, \\(\\mathbf{G}_k\\) decreases, and drives \\(\\widehat{\\mathbf{\\beta}}^{\\mathrm{RR}}_k\\) toward \\(\\mathbf{0}\\) (Hoerl & Kennard, 1970).\nAnother insight, from the shrinkage literature, is that ridge regression can be formulated as least squares regression, minimizing a residual sum of squares, \\(\\text{RSS}(k)\\), which adds a penalty for large coefficients,\n\\[\n\\text{RSS}(k) = (\\mathbf{y}-\\mathbf{X} \\mathbf{\\beta}) ^\\mathsf{T}(\\mathbf{y}-\\mathbf{X} \\boldsymbol{\\beta}) + k \\boldsymbol{\\beta}^\\mathsf{T}\\boldsymbol{\\beta} \\quad\\quad (k \\ge 0)\n\\:\\: ,\n\\tag{8.4}\\] where the penalty restrict the coefficients to some squared length \\(\\boldsymbol{\\beta}^\\mathsf{T}\\boldsymbol{\\beta} = \\Sigma \\beta_i \\le t(k)\\).\nThe geometry of ridge regession is illustrated in Figure 8.8 for two coefficients \\(\\boldsymbol{\\beta} = (\\beta_1, \\beta_2)\\). The blue circles at the origin, having radii \\(\\sqrt{t_k}\\), show the constraint that the sum of squares of coefficients, \\(\\boldsymbol{\\beta}^\\mathsf{T}\\boldsymbol{\\beta} = \\beta_1^2 + \\beta_2^2\\) be less than \\(k\\). The red ellipses show contours of the covariance ellipse of \\(\\widehat{\\boldsymbol{\\beta}}^{\\mathrm{OLS}}\\). As the shrinkage constant \\(k\\) increases, the center of these ellipses travel along the path illustrated toward \\(\\boldsymbol{\\beta} = \\mathbf{0}\\) This path is called the locus of osculation, the path along which circles or ellipses first kiss as they expand, like the pattern of ripples from rocks dropped into a pond (Friendly et al., 2013).\n\n\n\n\n\n\n\nFigure 8.8: Geometric interpretation of ridge regression, using elliptical contours of the \\(\\text{RSS}(k)\\) function. The blue circles at the origin show the constraint that the sum of squares of coefficients, \\(\\boldsymbol{\\beta}^\\mathsf{T}\\boldsymbol{\\beta}\\) be less than \\(k\\). The red ellipses show the covariance ellipse of two coefficients \\(\\boldsymbol{\\beta}\\). Ridge regression finds the point \\(\\widehat{\\boldsymbol{\\beta}}^{\\mathrm{RR}}_k\\) where the OLS contours just kiss the constraint region. _Source: Friendly et al. (2013).\n\n\n\n\n\nEquation 8.3 is computationally expensive, potentially numerically unstable for small \\(k\\), and it is conceptually opaque, in that it sheds little light on the underlying geometry of the data in the column space of \\(\\mathbf{X}\\). An alternative formulation can be given in terms of the singular value decomposition (SVD) of \\(\\mathbf{X}\\),\n\\[\n\\mathbf{X} = \\mathbf{U} \\mathbf{D} \\mathbf{V}^\\mathsf{T}\n\\]\nwhere \\(\\mathbf{U}\\) and \\(\\mathbf{V}\\) are respectively \\(n\\times p\\) and \\(p\\times p\\) orthonormal matrices, so that \\(\\mathbf{U}^\\mathsf{T}\\mathbf{U} = \\mathbf{V}^\\mathsf{T}\\mathbf{V} = \\mathbf{I}\\), and \\(\\mathbf{D} = \\mathrm{diag}\\, (d_1, d_2, \\dots d_p)\\) is the diagonal matrix of ordered singular values, with entries \\(d_1 \\ge d_2 \\ge \\cdots \\ge d_p \\ge 0\\).\nBecause \\(\\mathbf{X}^\\mathsf{T}\\mathbf{X} = \\mathbf{V} \\mathbf{D}^2 \\mathbf{V}^\\mathsf{T}\\), the eigenvalues of \\(\\mathbf{X}^\\mathsf{T}\\mathbf{X}\\) are given by \\(\\mathbf{D}^2\\) and therefore the eigenvalues of \\(\\mathbf{G}_k\\) can be shown (Hoerl & Kennard, 1970) to be the diagonal elements of\n\\[\n\\mathbf{D}(\\mathbf{D}^2 + k \\mathbf{I} )^{-1} \\mathbf{D} = \\mathrm{diag}\\,  \\left(\\frac{d_i^2}{d_i^2 + k}\\right) \\:\\: .\n\\]\nNoting that the eigenvectors, \\(\\mathbf{V}\\) are the principal component vectors, and that \\(\\mathbf{X} \\mathbf{V} = \\mathbf{U} \\mathbf{D}\\), the ridge estimates can be calculated more simply in terms of \\(\\mathbf{U}\\) and \\(\\mathbf{D}\\) as\n\\[\n\\widehat{\\boldsymbol{\\beta}}^{\\mathrm{RR}}_k = (\\mathbf{D}^2 + k \\mathbf{I})^{-1} \\mathbf{D} \\mathbf{U}^\\mathsf{T}\\mathbf{y} = \\left( \\frac{d_i}{d_i^2 + k}\\right) \\: \\mathbf{u}_i^\\mathsf{T}\\mathbf{y}, \\quad i=1, \\dots p \\:\\: .\n\\]\nThe terms \\(d^2_i / (d_i^2 + k) \\le 1\\) are thus the factors by which the coordinates of \\(\\mathbf{u}_i^\\mathsf{T}\\mathbf{y}\\) are shrunk with respect to the orthonormal basis for the column space of \\(\\mathbf{X}\\). The small singular values \\(d_i\\) correspond to the directions which ridge regression shrinks the most. These are the directions which contribute most to collinearity, discussed earlier.\nThis analysis also provides an alternative and more intuitive characterization of the ridge tuning constant. By analogy with OLS, where the hat matrix, \\(\\mathbf{H} = \\mathbf{X} (\\mathbf{X}^\\mathsf{T}\\mathbf{X})^{-1} \\mathbf{X}^\\mathsf{T}\\) reflects degrees of freedom \\(\\text{df} = \\mathrm{tr} (\\mathbf{H}) = p\\) corresponding to the \\(p\\) parameters, the effective degrees of freedom for ridge regression (Hastie et al., 2009) is\n\\[\\begin{aligned}\n\\text{df}_k\n    = & \\text{tr}[\\mathbf{X} (\\mathbf{X}^\\mathsf{T}\\mathbf{X} + k \\mathbf{I})^{-1} \\mathbf{X}^\\mathsf{T}] \\\\\n    = & \\sum_i^p \\text{df}_k(i) = \\sum_i^p \\left( \\frac{d_i^2}{d_i^2 + k} \\right) \\:\\: .\n\\end{aligned} \\tag{8.5}\\]\n\\(\\text{df}_k\\) is a monotone decreasing function of \\(k\\), and hence any set of ridge constants can be specified in terms of equivalent \\(\\text{df}_k\\). Greater shrinkage corresponds to fewer coefficients being estimated.\nThere is a close connection with principal components regression mentioned in Section 8.3. Ridge regression shrinks all dimensions in proportion to \\(\\text{df}_k(i)\\), so the low variance dimensions are shrunk more. Principal components regression discards the low variance dimensions and leaves the high variance dimensions unchanged.\n\n8.4.2 The genridge package\nRidge regression and other shrinkage methods are available in several packages including MASS (the lm.ridge() function), glmnet (Friedman et al., 2023), and penalized (Goeman et al., 2022), but none of these provides insightful graphical displays. glmnet::glmnet() also implements a method for multivariate responses with a `family=“mgaussian”.\nHere, I focus in the genridge package (Friendly, 2024), where the ridge() function is the workhorse and pca.ridge() transforms these results to PCA/SVD space. vif.ridge() calculates VIFs for class \"ridge\" objects and precision() calculates precision and shrinkage measures.\nA variety of plotting functions is available for univariate, bivariate and 3D plots:\n\n\ntraceplot() Traditional univariate ridge trace plots\n\nplot.ridge() Bivariate 2D ridge trace plots, showing the covariance ellipse of the estimated coefficients\n\npairs.ridge() All pairwise bivariate ridge trace plots\n\nplot3d.ridge() 3D ridge trace plots with ellipsoids\n\nbiplot.ridge() ridge trace plots in PCA/SVD space\n\nIn addition, the pca() method for \"ridge\" objects transforms the coefficients and covariance matrices of a ridge object from predictor space to the equivalent, but more interesting space of the PCA of \\(\\mathbf{X}^\\mathsf{T}\\mathbf{X}\\) or the SVD of \\(\\mathbf{X}\\). biplot.pcaridge() adds variable vectors to the bivariate plots of coefficients in PCA space",
         "crumbs": [
           "Univariate Linear Models",
           "8  Collinearity & Ridge Regression"
    @@ -756,7 +756,7 @@
         "href": "08-collinearity-ridge.html#what-have-we-learned",
         "title": "8  Collinearity & Ridge Regression",
         "section": "\n8.8 What have we learned?",
    -    "text": "8.8 What have we learned?\nThis chapter has considered the problems in regression models which stem from high correlations among the predictors. We saw that collinearity results in unstable estimates of coefficients with larger uncertainty, often dramatically more so than would be the case if the predictors were uncorrelated. Collinearity can be seen as merely a “data problem” which can safely be ignored if we are only interested in prediction. When we want to understand a model, ridge regression can tame the collinearity beast by shrinking the coefficients slightly to gain greater precision in the estimates.\nBeyond these statistical considerations, the methods of this chapter highlight the roles of multivariate thinking and visualization in understanding these phenomena and the methods developed for solving them. Data ellipses and confidence ellipses for coefficients again provide tools for visualizing what is concealed in numerical summaries. A perhaps surprising feature of both collinearity and ridge regression is that the important information usually resides in the smallest PCA dimensions and biplots help again to understand these dimensions.\n\nPackages used here:\n12 packages used here: car, carData, dplyr, factoextra, genridge, ggplot2, ggrepel, knitr, MASS, patchwork, splines, VisCollin\n\n\n\n\n\n\nBelsley, D. A. (1991). Conditioning diagnostics: Collinearity and weak data in regression. Wiley.\n\n\nBelsley, D. A., Kuh, E., & Welsch, R. E. (1980). Regression diagnostics: Identifying influential data and sources of collinearity. John Wiley; Sons.\n\n\nBrown, P. J., & Zidek, J. V. (1980). Adaptive multivariate ridge regression. The Annals of Statistics, 8(1), 64–74. http://www.jstor.org/stable/2240743\n\n\nEfron, B., Hastie, T., Johnstone, I., & Tibshirani, R. (2004). Least angle regression. The Annals of Statistics, 32(2), 407–499.\n\n\nFox, J. (2016). Applied regression analysis and generalized linear models (Third edition.). SAGE.\n\n\nFox, J., & Monette, G. (1992). Generalized collinearity diagnostics. Journal of the American Statistical Association, 87(417), 178–183.\n\n\nFriedman, J., Hastie, T., Tibshirani, R., Narasimhan, B., Tay, K., Simon, N., & Yang, J. (2023). Glmnet: Lasso and elastic-net regularized generalized linear models. https://glmnet.stanford.edu\n\n\nFriendly, M. (2011). Generalized ridge trace plots: Visualizing bias and precision with the genridge R package. SCS Seminar.\n\n\nFriendly, M. (2013). The generalized ridge trace plot: Visualizing bias and precision. Journal of Computational and Graphical Statistics, 22(1), 50–68. https://doi.org/10.1080/10618600.2012.681237\n\n\nFriendly, M. (2024). Genridge: Generalized ridge trace plots for ridge regression. https://github.com/friendly/genridge\n\n\nFriendly, M., & Kwan, E. (2009). Where’s Waldo: Visualizing collinearity diagnostics. The American Statistician, 63(1), 56–65. https://doi.org/10.1198/tast.2009.0012\n\n\nFriendly, M., Monette, G., & Fox, J. (2013). Elliptical insights: Understanding statistical methods through elliptical geometry. Statistical Science, 28(1), 1–39. https://doi.org/10.1214/12-STS402\n\n\nGabriel, K. R. (1971). The biplot graphic display of matrices with application to principal components analysis. Biometrics, 58(3), 453–467. https://doi.org/10.2307/2334381\n\n\nGower, J. C., & Hand, D. J. (1996). Biplots. Chapman & Hall.\n\n\nGraybill, F. A. (1961). An introduction to linear statistical models. McGraw-Hill.\n\n\nHaitovsky, Y. (1987). On multivariate ridge regression. Biometrika, 74(3), 563–570. https://doi.org/10.1093/biomet/74.3.563\n\n\nHastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: Data mining, inference and prediction (2nd ed.). Springer. http://www-stat.stanford.edu/~tibs/ElemStatLearn/\n\n\nHocking, R. R. (2013). Methods and applications of linear models: Regression and the analysis of variance. Wiley. https://books.google.ca/books?id=iq2J-1iS6HcC\n\n\nHoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12, 55–67.\n\n\nHoerl, A. E., Kennard, R. W., & Baldwin, K. F. (1975). Ridge regression: Some simulations. Communications in Statistics, 4(2), 105–123. https://doi.org/10.1080/03610927508827232\n\n\nKwan, E., Lu, I. R. R., & Friendly, M. (2009). Tableplot: A new tool for assessing precise predictions. Zeitschrift für Psychologie / Journal of Psychology, 217(1), 38–48. https://doi.org/10.1027/0044-3409.217.1.38\n\n\nLawless, J. F., & Wang, P. (1976). A simulation study of ridge and other regression estimators. Communications in Statistics, 5, 307–323.\n\n\nLongley, J. W. (1967). An appraisal of least squares programs for the electronic computer from the point of view of the user. Journal of the American Statistical Association, 62, 819–841. https://doi.org/https://www.tandfonline.com/doi/abs/10.1080/01621459.1967.10500896\n\n\nMarquardt, D. W. (1970). Generalized inverses, ridge regression, biased linear estimation, and nonlinear estimation. Technometrics, 12, 591–612.\n\n\nMarquardt, D. W., & Snee, R. D. (1975). Ridge regression in practice. The American Statistician, 29(1), 3–20. https://doi.org/10.1080/00031305.1975.10479105\n\n\nMcDonald, G. C. (2009). Ridge regression. Wiley Interdisciplinary Reviews: Computational Statistics, 1(1), 93–100. https://doi.org/10.1002/wics.14\n\n\nTibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B: Methodological, 58, 267–288.\n\n\nVinod, H. D. (1978). A survey of ridge regression and related techniques for improvements over ordinary least squares. The Review of Economics and Statistics, 60(1), 121–131. http://www.jstor.org/stable/1924340",
    +    "text": "8.8 What have we learned?\nThis chapter has considered the problems in regression models which stem from high correlations among the predictors. We saw that collinearity results in unstable estimates of coefficients with larger uncertainty, often dramatically more so than would be the case if the predictors were uncorrelated. Collinearity can be seen as merely a “data problem” which can safely be ignored if we are only interested in prediction. When we want to understand a model, ridge regression can tame the collinearity beast by shrinking the coefficients slightly to gain greater precision in the estimates.\nBeyond these statistical considerations, the methods of this chapter highlight the roles of multivariate thinking and visualization in understanding these phenomena and the methods developed for solving them. Data ellipses and confidence ellipses for coefficients again provide tools for visualizing what is concealed in numerical summaries. A perhaps surprising feature of both collinearity and ridge regression is that the important information usually resides in the smallest PCA dimensions and biplots help again to understand these dimensions.\n\nPackages used here:\n12 packages used here: car, carData, dplyr, factoextra, genridge, ggplot2, ggrepel, knitr, MASS, patchwork, splines, VisCollin\n\n\n\n\n\n\nBelsley, D. A. (1991). Conditioning diagnostics: Collinearity and weak data in regression. Wiley.\n\n\nBelsley, D. A., Kuh, E., & Welsch, R. E. (1980). Regression diagnostics: Identifying influential data and sources of collinearity. John Wiley; Sons.\n\n\nBrown, P. J., & Zidek, J. V. (1980). Adaptive multivariate ridge regression. The Annals of Statistics, 8(1), 64–74. http://www.jstor.org/stable/2240743\n\n\nEfron, B., Hastie, T., Johnstone, I., & Tibshirani, R. (2004). Least angle regression. The Annals of Statistics, 32(2), 407–499.\n\n\nFox, J. (2016). Applied regression analysis and generalized linear models (Third edition.). SAGE.\n\n\nFox, J., & Monette, G. (1992). Generalized collinearity diagnostics. Journal of the American Statistical Association, 87(417), 178–183.\n\n\nFriedman, J., Hastie, T., Tibshirani, R., Narasimhan, B., Tay, K., Simon, N., & Yang, J. (2023). Glmnet: Lasso and elastic-net regularized generalized linear models. https://glmnet.stanford.edu\n\n\nFriendly, M. (2011). Generalized ridge trace plots: Visualizing bias and precision with the genridge R package. SCS Seminar.\n\n\nFriendly, M. (2013). The generalized ridge trace plot: Visualizing bias and precision. Journal of Computational and Graphical Statistics, 22(1), 50–68. https://doi.org/10.1080/10618600.2012.681237\n\n\nFriendly, M. (2024). Genridge: Generalized ridge trace plots for ridge regression. https://github.com/friendly/genridge\n\n\nFriendly, M., & Kwan, E. (2009). Where’s Waldo: Visualizing collinearity diagnostics. The American Statistician, 63(1), 56–65. https://doi.org/10.1198/tast.2009.0012\n\n\nFriendly, M., Monette, G., & Fox, J. (2013). Elliptical insights: Understanding statistical methods through elliptical geometry. Statistical Science, 28(1), 1–39. https://doi.org/10.1214/12-STS402\n\n\nGabriel, K. R. (1971). The biplot graphic display of matrices with application to principal components analysis. Biometrics, 58(3), 453–467. https://doi.org/10.2307/2334381\n\n\nGoeman, J., Meijer, R., Chaturvedi, N., & Lueder, M. (2022). Penalized: L1 (lasso and fused lasso) and L2 (ridge) penalized estimation in GLMs and in the cox model. https://CRAN.R-project.org/package=penalized\n\n\nGower, J. C., & Hand, D. J. (1996). Biplots. Chapman & Hall.\n\n\nGraybill, F. A. (1961). An introduction to linear statistical models. McGraw-Hill.\n\n\nHaitovsky, Y. (1987). On multivariate ridge regression. Biometrika, 74(3), 563–570. https://doi.org/10.1093/biomet/74.3.563\n\n\nHastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: Data mining, inference and prediction (2nd ed.). Springer. http://www-stat.stanford.edu/~tibs/ElemStatLearn/\n\n\nHocking, R. R. (2013). Methods and applications of linear models: Regression and the analysis of variance. Wiley. https://books.google.ca/books?id=iq2J-1iS6HcC\n\n\nHoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12, 55–67.\n\n\nHoerl, A. E., Kennard, R. W., & Baldwin, K. F. (1975). Ridge regression: Some simulations. Communications in Statistics, 4(2), 105–123. https://doi.org/10.1080/03610927508827232\n\n\nKwan, E., Lu, I. R. R., & Friendly, M. (2009). Tableplot: A new tool for assessing precise predictions. Zeitschrift für Psychologie / Journal of Psychology, 217(1), 38–48. https://doi.org/10.1027/0044-3409.217.1.38\n\n\nLawless, J. F., & Wang, P. (1976). A simulation study of ridge and other regression estimators. Communications in Statistics, 5, 307–323.\n\n\nLongley, J. W. (1967). An appraisal of least squares programs for the electronic computer from the point of view of the user. Journal of the American Statistical Association, 62, 819–841. https://doi.org/https://www.tandfonline.com/doi/abs/10.1080/01621459.1967.10500896\n\n\nMarquardt, D. W. (1970). Generalized inverses, ridge regression, biased linear estimation, and nonlinear estimation. Technometrics, 12, 591–612.\n\n\nMarquardt, D. W., & Snee, R. D. (1975). Ridge regression in practice. The American Statistician, 29(1), 3–20. https://doi.org/10.1080/00031305.1975.10479105\n\n\nMcDonald, G. C. (2009). Ridge regression. Wiley Interdisciplinary Reviews: Computational Statistics, 1(1), 93–100. https://doi.org/10.1002/wics.14\n\n\nTibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B: Methodological, 58, 267–288.\n\n\nVinod, H. D. (1978). A survey of ridge regression and related techniques for improvements over ordinary least squares. The Review of Economics and Statistics, 60(1), 121–131. http://www.jstor.org/stable/1924340",
         "crumbs": [
           "Univariate Linear Models",
           "8  Collinearity & Ridge Regression"
    @@ -888,7 +888,7 @@
         "href": "10-mlm-review.html#multivariate-test-statistics",
         "title": "10  Multivariate Linear Models",
         "section": "\n10.3 Multivariate test statistics",
    -    "text": "10.3 Multivariate test statistics\nIn the univariate case, the overall \\(F\\)-test of \\(\\mathcal{H}_0: \\boldsymbol{\\beta} = \\mathbf{0}\\) is the uniformly most powerful invariant test when the assumptions are met. There is nothing better. This is not the case in the MLM.\nThe reason is that when there are \\(p > 1\\) response variables, and we are testing a hypothesis comprising \\(\\text{df}_h >1\\) coefficients or degrees of freedom, there are \\(s > 1\\) possible dimensions in which \\(\\mathbf{H}\\) can be large relative to \\(\\mathbf{E}\\), each measured by the eigenvalue \\(\\lambda_i\\). There are several test statistics that combine these into a single measure, shown in Table 10.1.\n\n\n\n\n\n\n\n\n\n\nTable 10.1: Test statistics for multivariate tests combine the size of dimensions of \\(\\mathbf{H}\\mathbf{E}^{-1}\\) into a single measure.\n\n\n\n\nCriterion\nFormula\nPartial \\(\\eta^2\\)\n\n\n\n\n\nWilks’s \\(\\Lambda\\)\n\n\\(\\Lambda = \\prod^s_i \\frac{1}{1+\\lambda_i}\\)\n\\(\\eta^2 = 1-\\Lambda^{1/s}\\)\n\n\n\nPillai trace\n\\(V = \\sum^s_i \\frac{\\lambda_i}{1+\\lambda_i}\\)\n\\(\\eta^2 = \\frac{V}{s}\\)\n\n\n\nHotelling-Lawley trace\n\\(H = \\sum^s_i \\lambda_i\\)\n\\(\\eta^2 = \\frac{H}{H+s}\\)\n\n\n\nRoy maximum root\n\\(R = \\lambda_1\\)\n\\(\\eta^2 = \\frac{\\lambda_1}{1+\\lambda_1}\\)\n\n\n\n\n\n\n\n\nThese correspond to different kinds of “means” of the \\(\\lambda_i\\): arithmetic, geometric, harmonic and supremum. See Friendly et al. (2013) for the geometry behind these measures.\nEach of these statistics have different sampling distributions under the null hypothesis, but they can all be converted to \\(F\\) statistics, which are exact when \\(s \\le 2\\), and approximations otherwise. As well, each has an analog of the \\(R^2\\)-like partial \\(\\eta^2\\) measure, giving the partial association accounted for by each term in the MLM.\n\n10.3.1 Testing contrasts and linear hypotheses\nEven more generally, these multivariate tests apply to every linear hypothesis concerning the coefficients in \\(\\mathbf{B}\\). Suppose we want to test the hypothesis that a subset of rows (predictors) and/or columns (responses) simultaneously have null effects. This can be expressed in the general linear test, \\[\n\\mathcal{H}_0 : \\mathbf{C}_{h \\times q} \\, \\mathbf{B}_{q \\times p} = \\mathbf{0}_{h \\times p} \\:\\: ,\n\\] where \\(\\mathbf{C}\\) is a full rank \\(h \\le q\\) hypothesis matrix of constants, that selects subsets or linear combinations (contrasts) of the coefficients in \\(\\mathbf{B}\\) to be tested in a \\(h\\) degree-of-freedom hypothesis.\nIn this case, the SSP matrix for the hypothesis has the form \\[\n\\mathbf{H}  =\n(\\mathbf{C} \\widehat{\\mathbf{B}})^\\mathsf{T}\\,\n[\\mathbf{C} (\\mathbf{X}^\\mathsf{T}\\mathbf{X} )^{-1} \\mathbf{C}^\\mathsf{T}]^{-1} \\,\n(\\mathbf{C} \\widehat{\\mathbf{B}})\n\\tag{10.4}\\]\nwhere there are \\(s = \\min(h, p)\\) non-zero eigenvalues of \\(\\mathbf{H}\\mathbf{E}^{-1}\\). In Equation 10.4, \\(\\mathbf{H}\\) measures the (Mahalanobis) squared distances (and cross products) among the linear combinations \\(\\mathbf{C} \\widehat{\\mathbf{B}}\\) from the origin under the null hypothesis.  \nFor example, with three responses \\(y_1, y_2, y_3\\) and three predictors \\(x_1, x_2, x_3\\), we can test the hypothesis that neither \\(x_2\\) nor \\(x_3\\) contribute at all to the predicting the \\(y\\)s in terms of the hypothesis that the coefficients for the corresponding rows of \\(\\mathbf{B}\\) are zero using a 1-row \\(\\mathbf{C}\\) matrix that simply selects those rows:\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\\[\\begin{aligned}\n\\mathcal{H}_0 : \\mathbf{C} \\mathbf{B} & =\n\\begin{bmatrix}\n0 & 1 & 1 & 0\n\\end{bmatrix}\n\\begin{pmatrix}\n  \\beta_{0,y_1} & \\beta_{0,y_2} & \\beta_{0,y_3} \\\\\n  \\beta_{1,y_1} & \\beta_{1,y_2} & \\beta_{1,y_3} \\\\\n  \\beta_{2,y_1} & \\beta_{2,y_2} & \\beta_{2,y_3} \\\\\n  \\beta_{3,y_1} & \\beta_{3,y_2} & \\beta_{3,y_3}\n\\end{pmatrix} \\\\ \\\\\n& =\n\\begin{bmatrix}\n  \\beta_{1,y_1} & \\beta_{1,y_2} & \\beta_{1,y_3} \\\\\n  \\beta_{2,y_1} & \\beta_{2,y_2} & \\beta_{2,y_3} \\\\\n\\end{bmatrix}\n=\n\\mathbf{0}_{(2 \\times 3)}\n\\end{aligned}\\]\nIn MANOVA designs, it is often desirable to follow up a significant effect for a factor with subsequent tests to determine which groups differ. While you can simply test for all pairwise differences among groups (using Bonferonni or other corrections for multiplicity), a more substantively-driven approach uses planned comparisons or contrasts among the factor levels.\nFor a factor with \\(g\\) groups, a contrast is simply a comparison of the mean of one subset of groups against the mean of another subset. This is specified as a weighted sum, \\(L\\) of the means with weights \\(\\mathbf{c}\\) that sum to zero,\n\\[\nL = \\mathbf{c}^\\mathsf{T} \\boldsymbol{\\mu} = \\sum_i c_i \\mu_i \\quad\\text{such that}\\quad \\Sigma c_i = 0\n\\] Two contrasts, \\(\\mathbf{c}_1\\) and \\(\\mathbf{c}_2\\) are orthogonal if the sum of products of their weights is zero, i.e., \\(\\mathbf{c}_1^\\mathsf{T} \\mathbf{c}_2 = \\Sigma c_{1i} \\times c_{2i} = 0\\). When contrasts are placed as columns of a matrix \\(\\mathbf{C}\\), they are all mutually orthogonal if each pair is orthogonal, which means \\(\\mathbf{C}^\\top \\mathbf{C}\\) is a diagonal matrix. Orthogonal contrasts correspond to statistically independent tests.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nTODO: Should use \\(\\frac12\\) here so contrast L_1 is mean diff\nFor example, with the \\(g=4\\) groups for the dogfood data, the company might want to test the following comparisons among the formulas Old, New, Major and Alps: (a) Ours vs. Theirs: The average of (Old, New) compared to (Major, Alps); (b) Old vs. New; (c) Major vs. Alps. The contrasts that do this are:\n\\[\\begin{aligned}\nL_1 & = \\frac12 (\\mu_O + \\mu_N) - \\frac12 (\\mu_M + \\mu_A) & \\rightarrow\\: & \\mathbf{c}_1 =\n\\frac12 \\begin{pmatrix}\n         1 &  1 & -1 & -1\n        \\end{pmatrix} \\\\\nL_2 & = \\mu_O - \\mu_N                     & \\rightarrow\\: & \\mathbf{c}_2 =\n    \\begin{pmatrix}\n     1 &  -1 & 0 & 0\n    \\end{pmatrix} \\\\\nL_3 & = \\mu_M - \\mu_A                     & \\rightarrow\\: & \\mathbf{c}_3 =\n    \\begin{pmatrix}\n     0 &  0 & 1 & -1\n    \\end{pmatrix}\n\\end{aligned}\\]\nNote that these correspond to nested dichotomies among the four groups: first we compare groups (Old and New) against groups (Major and Alps), then subsequently within each of these sets. Such contrasts are always orthogonal, and therefore correspond to statistically independent tests.\nIn R, contrasts for a factor are specified as columns of matrix, each of which sums to zero. For this example, we can set this up by creating each as a vector and joining them as columns using cbind():\n\nc1 <- c(1,  1, -1, -1)/2    # Old,New vs. Major,Alps\nc2 <- c(1, -1,  0,  0)      # Old vs. New\nc3 <- c(0,  0,  1, -1)      # Major vs. Alps\nC <- cbind(c1,c2,c3) \nrownames(C) <- levels(dogfood$formula)\n\nC\n#>         c1 c2 c3\n#> Old    0.5  1  0\n#> New    0.5 -1  0\n#> Major -0.5  0  1\n#> Alps  -0.5  0 -1\n\n# show they are mutually orthogonal\nt(C) %*% C\n#>    c1 c2 c3\n#> c1  1  0  0\n#> c2  0  2  0\n#> c3  0  0  2\n\nFor a dataset, data, with this factor as group, you can set up the analyses to use these contrasts by assigning the matrix C to contrasts(). When the contrasts are changed, it is necessary to refit the model. The estimated coefficients then become the estimated mean differences of the contrasts.\n\ncontrasts(dogfood$formula) <- C\ndogfood.mod <- lm(cbind(start, amount) ~ formula, \n                  data=dogfood)\ncoef(dogfood.mod)\n#>              start amount\n#> (Intercept)  1.688  85.56\n#> formulac1   -1.375  10.88\n#> formulac2   -0.500   3.25\n#> formulac3    0.125   1.88\n\nFor multivariate tests, when all contrasts are pairwise orthogonal, the overall test of a factor with \\(\\text{df}_h = g-1\\) degrees of freedom can be broken down into \\(g-1\\) separate 1 df tests. This gives rise to a set of \\(\\text{df}_h\\) rank 1 \\(\\mathbf{H}\\) matrices that additively decompose the overall hypothesis SSCP matrix,\n\\[\n\\mathbf{H} = \\mathbf{H}_1 + \\mathbf{H}_2 + \\cdots + \\mathbf{H}_{\\text{df}_h} \\:\\: ,\n\\tag{10.5}\\]\nexactly as the univariate \\(\\text{SS}_H\\) can be decomposed using orthogonal contrasts in an ANOVA.\nYou can test such contrasts or any other hypotheses involving linear combinations of the coefficients using car::linearHypothesis. Here, \"formulac1\" refers to the contrast c1 for the difference between Ours and Theirs. Note that because this is a 1 df test, all four test statistics yield the same \\(F\\) values.\n\nhyp <- rownames(coef(dogfood.mod))[-1] |> print()\n#> [1] \"formulac1\" \"formulac2\" \"formulac3\"\nH1 <- linearHypothesis(dogfood.mod, hyp[1], title=\"Ours vs. Theirs\") |> \n  print()\n#> \n#> Sum of squares and products for the hypothesis:\n#>         start amount\n#> start    7.56  -59.8\n#> amount -59.81  473.1\n#> \n#> Sum of squares and products for error:\n#>        start amount\n#> start   25.8   11.7\n#> amount  11.7  390.3\n#> \n#> Multivariate Tests: Ours vs. Theirs\n#>                  Df test stat approx F num Df den Df Pr(>F)   \n#> Pillai            1     0.625     9.18      2     11 0.0045 **\n#> Wilks             1     0.375     9.18      2     11 0.0045 **\n#> Hotelling-Lawley  1     1.669     9.18      2     11 0.0045 **\n#> Roy               1     1.669     9.18      2     11 0.0045 **\n#> ---\n#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nSimilarly for the other two contrasts within these each of these two subsets, but I don’t print the results\n\nH2 <- linearHypothesis(dogfood.mod, hyp[2], title=\"Old vs. New\")\nH3 <- linearHypothesis(dogfood.mod, hyp[3], title=\"Alps vs. Major\")\n\nThen, we can illustrate Equation 10.5 by extracting the 1 df \\(\\mathbf{H}\\) matrices (SSPH) from the results of linearHypothesis.\n\n\\[\n\\overset{\\mathbf{H}}\n{\\begin{pmatrix}\n  9.7 & -70.9 \\\\\n-70.9 & 585.7 \\\\\n\\end{pmatrix}}\n=\n\\overset{\\mathbf{H}_1}\n{\\begin{pmatrix}\n  7.6 & -59.8 \\\\\n-59.8 & 473.1 \\\\\n\\end{pmatrix}}\n+\n\\overset{\\mathbf{H}_2}\n{\\begin{pmatrix}\n0.13 &  1.88 \\\\\n1.88 & 28.12 \\\\\n\\end{pmatrix}}\n+\n\\overset{\\mathbf{H}_3}\n{\\begin{pmatrix}\n  2 & -13 \\\\\n-13 &  84 \\\\\n\\end{pmatrix}}\n\\]",
    +    "text": "10.3 Multivariate test statistics\nIn the univariate case, the overall \\(F\\)-test of \\(\\mathcal{H}_0: \\boldsymbol{\\beta} = \\mathbf{0}\\) is the uniformly most powerful invariant test when the assumptions are met. There is nothing better. This is not the case in the MLM.\nThe reason is that when there are \\(p > 1\\) response variables, and we are testing a hypothesis comprising \\(\\text{df}_h >1\\) coefficients or degrees of freedom, there are \\(s > 1\\) possible dimensions in which \\(\\mathbf{H}\\) can be large relative to \\(\\mathbf{E}\\), each measured by the eigenvalue \\(\\lambda_i\\). There are several test statistics that combine these into a single measure, shown in Table 10.1.\n\n\n\n\n\n\n\n\n\n\nTable 10.1: Test statistics for multivariate tests combine the size of dimensions of \\(\\mathbf{H}\\mathbf{E}^{-1}\\) into a single measure.\n\n\n\n\nCriterion\nFormula\nPartial \\(\\eta^2\\)\n\n\n\n\n\nWilks’s \\(\\Lambda\\)\n\n\\(\\Lambda = \\prod^s_i \\frac{1}{1+\\lambda_i}\\)\n\\(\\eta^2 = 1-\\Lambda^{1/s}\\)\n\n\n\nPillai trace\n\\(V = \\sum^s_i \\frac{\\lambda_i}{1+\\lambda_i}\\)\n\\(\\eta^2 = \\frac{V}{s}\\)\n\n\n\nHotelling-Lawley trace\n\\(H = \\sum^s_i \\lambda_i\\)\n\\(\\eta^2 = \\frac{H}{H+s}\\)\n\n\n\nRoy maximum root\n\\(R = \\lambda_1\\)\n\\(\\eta^2 = \\frac{\\lambda_1}{1+\\lambda_1}\\)\n\n\n\n\n\n\n\n\nThese correspond to different kinds of “means” of the \\(\\lambda_i\\): arithmetic, geometric, harmonic and supremum. See Friendly et al. (2013) for the geometry behind these measures.\nEach of these statistics have different sampling distributions under the null hypothesis, but they can all be converted to \\(F\\) statistics, which are exact when \\(s \\le 2\\), and approximations otherwise. As well, each has an analog of the \\(R^2\\)-like partial \\(\\eta^2\\) measure, giving the partial association accounted for by each term in the MLM.\n\n10.3.1 Testing contrasts and linear hypotheses\nEven more generally, these multivariate tests apply to every linear hypothesis concerning the coefficients in \\(\\mathbf{B}\\). Suppose we want to test the hypothesis that a subset of rows (predictors) and/or columns (responses) simultaneously have null effects. This can be expressed in the general linear test, \\[\n\\mathcal{H}_0 : \\mathbf{C}_{h \\times q} \\, \\mathbf{B}_{q \\times p} = \\mathbf{0}_{h \\times p} \\:\\: ,\n\\] where \\(\\mathbf{C}\\) is a full rank \\(h \\le q\\) hypothesis matrix of constants, that selects subsets or linear combinations (contrasts) of the coefficients in \\(\\mathbf{B}\\) to be tested in a \\(h\\) degree-of-freedom hypothesis.\nIn this case, the SSP matrix for the hypothesis has the form \\[\n\\mathbf{H}  =\n(\\mathbf{C} \\widehat{\\mathbf{B}})^\\mathsf{T}\\,\n[\\mathbf{C} (\\mathbf{X}^\\mathsf{T}\\mathbf{X} )^{-1} \\mathbf{C}^\\mathsf{T}]^{-1} \\,\n(\\mathbf{C} \\widehat{\\mathbf{B}})\n\\tag{10.4}\\]\nwhere there are \\(s = \\min(h, p)\\) non-zero eigenvalues of \\(\\mathbf{H}\\mathbf{E}^{-1}\\). In Equation 10.4, \\(\\mathbf{H}\\) measures the (Mahalanobis) squared distances (and cross products) among the linear combinations \\(\\mathbf{C} \\widehat{\\mathbf{B}}\\) from the origin under the null hypothesis.  \nFor example, with three responses \\(y_1, y_2, y_3\\) and three predictors \\(x_1, x_2, x_3\\), we can test the hypothesis that neither \\(x_2\\) nor \\(x_3\\) contribute at all to the predicting the \\(y\\)s in terms of the hypothesis that the coefficients for the corresponding rows of \\(\\mathbf{B}\\) are zero using a 1-row \\(\\mathbf{C}\\) matrix that simply selects those rows:\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\\[\\begin{aligned}\n\\mathcal{H}_0 : \\mathbf{C} \\mathbf{B} & =\n\\begin{bmatrix}\n0 & 1 & 1 & 0\n\\end{bmatrix}\n\\begin{pmatrix}\n  \\beta_{0,y_1} & \\beta_{0,y_2} & \\beta_{0,y_3} \\\\\n  \\beta_{1,y_1} & \\beta_{1,y_2} & \\beta_{1,y_3} \\\\\n  \\beta_{2,y_1} & \\beta_{2,y_2} & \\beta_{2,y_3} \\\\\n  \\beta_{3,y_1} & \\beta_{3,y_2} & \\beta_{3,y_3}\n\\end{pmatrix} \\\\ \\\\\n& =\n\\begin{bmatrix}\n  \\beta_{1,y_1} & \\beta_{1,y_2} & \\beta_{1,y_3} \\\\\n  \\beta_{2,y_1} & \\beta_{2,y_2} & \\beta_{2,y_3} \\\\\n\\end{bmatrix}\n=\n\\mathbf{0}_{(2 \\times 3)}\n\\end{aligned}\\]\nIn MANOVA designs, it is often desirable to follow up a significant effect for a factor with subsequent tests to determine which groups differ. While you can simply test for all pairwise differences among groups (using Bonferonni or other corrections for multiplicity), a more substantively-driven approach uses planned comparisons or contrasts among the factor levels.\nFor a factor with \\(g\\) groups, a contrast is simply a comparison of the mean of one subset of groups against the mean of another subset. This is specified as a weighted sum, \\(L\\) of the means with weights \\(\\mathbf{c}\\) that sum to zero,\n\\[\nL = \\mathbf{c}^\\mathsf{T} \\boldsymbol{\\mu} = \\sum_i c_i \\mu_i \\quad\\text{such that}\\quad \\Sigma c_i = 0\n\\] Two contrasts, \\(\\mathbf{c}_1\\) and \\(\\mathbf{c}_2\\) are orthogonal if the sum of products of their weights is zero, i.e., \\(\\mathbf{c}_1^\\mathsf{T} \\mathbf{c}_2 = \\Sigma c_{1i} \\times c_{2i} = 0\\). When contrasts are placed as columns of a matrix \\(\\mathbf{C}\\), they are all mutually orthogonal if each pair is orthogonal, which means \\(\\mathbf{C}^\\top \\mathbf{C}\\) is a diagonal matrix. Orthogonal contrasts correspond to statistically independent tests.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nTODO: Should use \\(\\frac12\\) here so contrast L_1 is mean diff\nFor example, with the \\(g=4\\) groups for the dogfood data, the company might want to test the following comparisons among the formulas Old, New, Major and Alps: (a) Ours vs. Theirs: The average of (Old, New) compared to (Major, Alps); (b) Old vs. New; (c) Major vs. Alps. The contrasts that do this are:\n\\[\\begin{aligned}\nL_1 & = \\textstyle{\\frac12} (\\mu_O + \\mu_N) -\n        \\textstyle{\\frac12} (\\mu_M + \\mu_A) & \\rightarrow\\: & \\mathbf{c}_1 =\n\\textstyle{\\frac12}\n     \\begin{pmatrix}\n      1 &  1 & -1 & -1\n     \\end{pmatrix} \\\\\nL_2 & = \\mu_O - \\mu_N                     & \\rightarrow\\: & \\mathbf{c}_2 =\n    \\begin{pmatrix}\n     1 &  -1 & 0 & 0\n    \\end{pmatrix} \\\\\nL_3 & = \\mu_M - \\mu_A                     & \\rightarrow\\: & \\mathbf{c}_3 =\n    \\begin{pmatrix}\n     0 &  0 & 1 & -1\n    \\end{pmatrix}\n\\end{aligned}\\]\nNote that these correspond to nested dichotomies among the four groups: first we compare groups (Old and New) against groups (Major and Alps), then subsequently within each of these sets. Such contrasts are always orthogonal, and therefore correspond to statistically independent tests.\nIn R, contrasts for a factor are specified as columns of matrix, each of which sums to zero. For this example, we can set this up by creating each as a vector and joining them as columns using cbind():\n\nc1 <- c(1,  1, -1, -1)/2    # Old,New vs. Major,Alps\nc2 <- c(1, -1,  0,  0)      # Old vs. New\nc3 <- c(0,  0,  1, -1)      # Major vs. Alps\nC <- cbind(c1,c2,c3) \nrownames(C) <- levels(dogfood$formula)\n\nC\n#>         c1 c2 c3\n#> Old    0.5  1  0\n#> New    0.5 -1  0\n#> Major -0.5  0  1\n#> Alps  -0.5  0 -1\n\n# show they are mutually orthogonal\nt(C) %*% C\n#>    c1 c2 c3\n#> c1  1  0  0\n#> c2  0  2  0\n#> c3  0  0  2\n\nFor the dogfood data, with formula as the group factor, you can set up the analyses to use these contrasts by assigning the matrix C to contrasts() for that factor in the dataset itself. When the contrasts are changed, it is necessary to refit the model. The estimated coefficients then become the estimated mean differences for the contrasts.\n\ncontrasts(dogfood$formula) <- C\ndogfood.mod <- lm(cbind(start, amount) ~ formula, \n                  data=dogfood)\ncoef(dogfood.mod)\n#>              start amount\n#> (Intercept)  1.688  85.56\n#> formulac1   -1.375  10.88\n#> formulac2   -0.500   3.25\n#> formulac3    0.125   1.88\n\nFor example, Ours vs. Theirs estimated by formulac1 takes 0.69 less time to start eating and eats 5.44 more on average.\nFor multivariate tests, when all contrasts are pairwise orthogonal, the overall test of a factor with \\(\\text{df}_h = g-1\\) degrees of freedom can be broken down into \\(g-1\\) separate 1 df tests. This gives rise to a set of \\(\\text{df}_h\\) rank 1 \\(\\mathbf{H}\\) matrices that additively decompose the overall hypothesis SSCP matrix,\n\\[\n\\mathbf{H} = \\mathbf{H}_1 + \\mathbf{H}_2 + \\cdots + \\mathbf{H}_{\\text{df}_h} \\:\\: ,\n\\tag{10.5}\\]\nexactly as the univariate \\(\\text{SS}_H\\) can be decomposed using orthogonal contrasts in an ANOVA.\nYou can test such contrasts or any other hypotheses involving linear combinations of the coefficients using car::linearHypothesis. Here, \"formulac1\" refers to the contrast c1 for the difference between Ours and Theirs. Note that because this is a 1 df test, all four test statistics yield the same \\(F\\) values.\n\nhyp <- rownames(coef(dogfood.mod))[-1] |> print()\n#> [1] \"formulac1\" \"formulac2\" \"formulac3\"\nH1 <- linearHypothesis(dogfood.mod, hyp[1], title=\"Ours vs. Theirs\") |> \n  print()\n#> \n#> Sum of squares and products for the hypothesis:\n#>         start amount\n#> start    7.56  -59.8\n#> amount -59.81  473.1\n#> \n#> Sum of squares and products for error:\n#>        start amount\n#> start   25.8   11.7\n#> amount  11.7  390.3\n#> \n#> Multivariate Tests: Ours vs. Theirs\n#>                  Df test stat approx F num Df den Df Pr(>F)   \n#> Pillai            1     0.625     9.18      2     11 0.0045 **\n#> Wilks             1     0.375     9.18      2     11 0.0045 **\n#> Hotelling-Lawley  1     1.669     9.18      2     11 0.0045 **\n#> Roy               1     1.669     9.18      2     11 0.0045 **\n#> ---\n#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nSimilarly for the other two contrasts within these each of these two subsets, but I don’t print the results\n\nH2 <- linearHypothesis(dogfood.mod, hyp[2], title=\"Old vs. New\")\nH3 <- linearHypothesis(dogfood.mod, hyp[3], title=\"Alps vs. Major\")\n\nThen, we can illustrate Equation 10.5 by extracting the 1 df \\(\\mathbf{H}\\) matrices (SSPH) from the results of linearHypothesis.\n\n\noptions(print.latexMatrix = list(display.labels=FALSE))\nSSP_H1 <- H1$SSPH |> round(digits=2)\nSSP_H2 <- H2$SSPH |> round(digits=2)\nSSP_H3 <- H3$SSPH |> round(digits=2)\nEqn(latexMatrix(SSP_H),  \"=\", \n    latexMatrix(SSP_H1), \"+\", \n    latexMatrix(SSP_H2), \"+\", \n    latexMatrix(SSP_H3), quarto=TRUE)\n$$\n\n10 & -71\n-71 & 586\n\n=\n\n8 & -60\n-60 & 473\n\n+\n\n2 & -13\n-13 & 84\n\n+\n\n0.1 & 1.9\n1.9 & 28.1\n\n$$\n\n\\[\n\\overset{\\mathbf{H}}\n{\\begin{pmatrix}\n  9.7 & -70.9 \\\\\n-70.9 & 585.7 \\\\\n\\end{pmatrix}}\n=\n\\overset{\\mathbf{H}_1}\n{\\begin{pmatrix}\n  7.6 & -59.8 \\\\\n-59.8 & 473.1 \\\\\n\\end{pmatrix}}\n+\n\\overset{\\mathbf{H}_2}\n{\\begin{pmatrix}\n0.13 &  1.88 \\\\\n1.88 & 28.12 \\\\\n\\end{pmatrix}}\n+\n\\overset{\\mathbf{H}_3}\n{\\begin{pmatrix}\n  2 & -13 \\\\\n-13 &  84 \\\\\n\\end{pmatrix}}\n\\]",
         "crumbs": [
           "Multivariate Linear Models",
           "10  Multivariate Linear Models"