Skip to content

Commit

Permalink
complete Ch06 ridge biplot
Browse files Browse the repository at this point in the history
  • Loading branch information
friendly committed Nov 9, 2024
1 parent 2c831e7 commit f9c6f7e
Show file tree
Hide file tree
Showing 22 changed files with 261 additions and 204 deletions.
77 changes: 57 additions & 20 deletions 08-collinearity-ridge.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -1052,7 +1052,7 @@ c(HKB=lridge$kHKB,
The shrinkage constant $k$ doesn't have much intrinsic meaning, s
it is often easier to interpret the plot when coefficients are plotted against the equivalent degrees of freedom, $\text{df}_k$.
OLS corresponds to $\text{df}_k = 6$ degrees of freedom in the space of six parameters,
and the effect of shrinkage is to decrease the degrees of freedom, as if estimating fewer parameters.
and the effect of shrinkage is to decrease the degrees of freedom, as if estimating fewer parameters. This more natural scale also makes the changes in coefficient with shrinkage more nearly linear.
```{r echo=-1}
#| label: fig-longley-traceplot2
Expand Down Expand Up @@ -1108,43 +1108,49 @@ The `pairs()` method for `"ridge"` objects shows all pairwise views in scatterpl
#| fig-height: 10
#| out-width: "90%"
#| fig-cap: "Scatterplot matrix of bivariate ridge trace plots. Each panel shows the effect of shrinkage on the covariance ellipse for a pair of predictors."
pairs(lridge, radius=0.5, diag.cex = 1.5)
pairs(lridge, radius=0.5, diag.cex = 2)
```
### Visualizing the bias-variance tradeoff
The function `precision()` calculates a number of measures of the effect of shrinkage of the coefficients in relation to the "size" of
the covariance matrix $\boldsymbol{\mathcal{V}}_k \equiv \widehat{\Var} (\widehat{\boldsymbol{\beta}}^{\mathrm{RR}}_k)$:
* `norm.beta` $= \left \Vert \boldsymbol{\beta}\right \Vert / \max{\left \Vert \boldsymbol{\beta}\right \Vert}$ is a summary measure of shrinkage, the normalized root mean square of the estimated coefficients.
* `det` $ = \log{| \mathcal{V}_k |}$ is an overall measure of variance of the coefficients. It is the (linearized) volume of the covariance ellipsoid and corresponds conceptually to Wilks' Lambda criterion.
* `trace` $ = \text{trace} (\boldsymbol{\mathcal{V}}_k) $ is the sum of the variances and also the sum of the eigenvalues of $\boldsymbol{\mathcal{V}}_k$, conceptually similar to Pillai's trace criterion.
the covariance matrix $\boldsymbol{\mathcal{V}}_k \equiv \widehat{\Var} (\widehat{\boldsymbol{\beta}}^{\mathrm{RR}}_k)$. Larger shrinkage $k$ should lead
to a smaller ellipsoid for $\boldsymbol{\mathcal{V}}_k$, indicating increased precision.
```{r precision}
pdat <- precision(lridge) |> print()
```
Here,
* `norm.beta` $= \left \Vert \boldsymbol{\beta}\right \Vert / \max{\left \Vert \boldsymbol{\beta}\right \Vert}$ is a summary measure of shrinkage, the normalized root mean square of the estimated coefficients. It starts at 1.0 for $k=0$ and decreases.
* `det` $= \log{| \mathcal{V}_k |}$ is an overall measure of variance of the coefficients. It is the (linearized) volume of the covariance ellipsoid and corresponds conceptually to Wilks' Lambda criterion.
* `trace` $= \text{trace} (\boldsymbol{\mathcal{V}}_k) $ is the sum of the variances and also the sum of the eigenvalues of $\boldsymbol{\mathcal{V}}_k$, conceptually similar to Pillai's trace criterion.
* `max.eig` is the largest eigenvalue measure of size, an analog of Roy's maximum root test.
Plotting shrinkage against a measure of variance gives a direct view of the tradeoff
between bias and precision. Here I plot `norm.beta` against `det`, and join the
points with a curve. You can see that in this example the HKB criterion prefers a smaller degree of shrinkage, but achieves only a modest decrease in variance. But variance
decreases more sharply thereafter and the LW choice achieves greater precision.
```{r echo=-1}
#| label: fig-longley-precision-plot
#| code-fold: true
#| fig-width: 8
#| fig-height: 7
#| out-width: "80%"
#| fig-cap: "The tradeoff between bias and precision"
#| fig-cap: "The tradeoff between bias and precision. Bias increases as we move away from the OLS solution, but precision increases."
op <- par(mar=c(4, 4, 1, 1) + 0.2)
library(splines)
with(pdat, {
plot(norm.beta, det, type="b",
cex.lab=1.25, pch=16, cex=1.5, col=clr, lwd=2,
xlab='shrinkage: ||b|| / max(||b||)',
ylab='variance: log |Var(b)|')
ylab='variance: log |Var(b)|')
text(norm.beta, det,
labels = lambdaf,
cex=1.25, pos=c(rep(2,length(lambda)-1),4))
Expand All @@ -1153,8 +1159,10 @@ with(pdat, {
cex=1.5, pos=4)
})
# find locations for optimal shrinkage criteria
mod <- lm(cbind(det, norm.beta) ~ bs(lambda, df=5), data=pdat)
x <- data.frame(lambda=c(lridge$kHKB, lridge$kLW))
mod <- lm(cbind(det, norm.beta) ~ bs(lambda, df=5),
data=pdat)
x <- data.frame(lambda=c(lridge$kHKB,
lridge$kLW))
fit <- predict(mod, x)
points(fit[,2:1], pch=15, col=gray(.50), cex=1.6)
text(fit[,2:1], c("HKB", "LW"), pos=3, cex=1.5, col=gray(.50))
Expand All @@ -1170,43 +1178,72 @@ from parameter space, where the estimated coefficients are
$\beta_k$ with covariance matrices $\boldsymbol{\mathcal{V}}_k$, to the
principal component space defined by the right singular vectors, $\mathbf{V}$,
of the singular value decomposition of the scaled predictor matrix, $\mathbf{X}$.
In PCA space the total variance of the predictors is distributed among the
linear combinations that account for greatest variance.
```{r pca-ridge}
plridge <- pca(lridge)
plridge
```
Then, a `traceplot()` of the resulting `"pcaridge"` object shows how the dimensions
are affected by shrinkage, on the scale of degrees of freedom in @fig-longley-pca-traceplot.
```{r echo=-1}
#| label: fig-longley-pca-traceplot
#| fig-cap: "Ridge traceplot for the longley regression viewed in PCA space. The dimensions are the linear combinations of the predictors which account for greatest variance."
par(mar=c(4, 4, 1, 1)+ 0.1)
plridge <- pca(lridge)
plridge
traceplot(plridge)
traceplot(plridge, X="df",
cex.lab = 1.2, lwd=2)
```
What may be surprising at first is that the coefficients for the first 4 components are not shrunk at all. Rather, the effect of shrinkage is seen only on the _last two dimensions_.
What may be surprising at first is that the coefficients for the first 4 components are not shrunk at all. These large dimensions are immune to ridge tuning.
Rather, the effect of shrinkage is seen only on the _last two dimensions_.
But those also are the directions that contribute most to collinearity as we saw earlier.
** pairs() ? **
A `pairs()` plot gives a dramatic representation bivariate effects of shrinkage in PCA space: the principal components of X are uncorrelated, so the ellipses are all aligned with the coordinate axes and the ellipses largely coincide for dimensions 1 to 4:
```{r echo = -1}
#| label: fig-longley-pca-pairs
#| out-width: "100%"
pairs(plridge)
```
If we focus on the plot of dimensions `5:6`, we can see where all the shrinkage action
is in this representation. Generally, the predictors that are related to the smallest
dimension (6) are shrunk quickly at first.
```{r echo = -1}
#| label: fig-longley-pca-dim56
#| fig-height: 7
#| fig-width: 7
#| fig-cap: "Bivariate ridge trace plot for the smallest two dimensions ... "
par(mar=c(4, 4, 1, 1)+ 0.1)
plot(plridge, variables=5:6, fill = TRUE, fill.alpha=0.2)
plot(plridge, variables=5:6,
fill = TRUE, fill.alpha=0.2)
text(plridge$coef[, 5:6],
label = lambdaf,
cex=1.5, pos=4, offset=.1)
```
### Biplot view
Finally, we can project the predictor variables into the PCA space of the smallest dimensions, where the shrinkage action mostly occurs to see how the predictor variables relate to these dimensions.
The question arises how to relate this view of shrinkage in PCA space to the original
predictors. The biplot is again your friend.
You can project variable vectors for the
predictor variables into the PCA space of the smallest dimensions, where the shrinkage action mostly occurs to see how the predictor variables relate to these dimensions.
`biplot.pcaridge()` supplements the standard display of the covariance ellipsoids for a ridge regression problem in PCA/SVD space with labeled arrows showing the contributions of the original variables to the dimensions plotted. The length of the arrows reflects proportion of variance that each predictors shares with the components.
The biplot view showing the dimensions corresponding to the two smallest singular values is particularly useful for understanding how the predictors contribute to shrinkage in ridge regression. Here, Year and Population largely contribute to dimension 5; a contrast between (Year, Population) and GNP contributes to dimension 6.
The biplot view in @fig-longley-pca-biplot showing the two smallest dimensions is particularly useful for understanding how the predictors contribute to shrinkage in ridge regression. Here, Year and Population largely contribute to dimension 5; a contrast between (Year, Population) and GNP contributes to dimension 6.
```{r echo = -1}
#| label: fig-longley-pca-biplot
#| fig-height: 7
#| fig-width: 7
#| fig-cap: "Biplot view of the ridge trace plot for the smallest two dimensions ..."
op <- par(mar=c(4, 4, 1, 1) + 0.2)
biplot(plridge, radius=0.5,
ref=FALSE, asp=1,
Expand Down
3 changes: 2 additions & 1 deletion _quarto.yml
Original file line number Diff line number Diff line change
Expand Up @@ -145,10 +145,11 @@ format:
mainfont: "Roboto"
monofont: "Fira mono"
# monofont: "JetBrains Mono"
# monofont: "Fira mono"
# monofont: "Fira cod" # -- give ligatures for |> etc
title-block-style: default
title-block-banner: true
code-block-bg: 'E8FFFF' #'#f1f1f1'
tab-stop: 2
# include-before-body: latex-commands.qmd
# linkcolor: "#03638E"
# fontsize: "15px"
Expand Down
Loading

0 comments on commit f9c6f7e

Please sign in to comment.