Skip to content

Commit

Permalink
add workers PCA example to Ch 04
Browse files Browse the repository at this point in the history
  • Loading branch information
friendly committed Oct 9, 2024
1 parent ef1ae4b commit f3cf31c
Show file tree
Hide file tree
Showing 67 changed files with 4,371 additions and 233 deletions.
2 changes: 1 addition & 1 deletion 03-multivariate_plots.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ In this chapter I use the following packages. Load them now:

```{r load-ggally}
#| include: false
suppressPackageStartupMessages(library(GGally))
suppressPackageStartupMessages(library(GGally, quietly = TRUE))
```


Expand Down
125 changes: 120 additions & 5 deletions 04-pca-biplot.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,7 @@ library(FactoMineR)
library(factoextra)
library(car)
library(ggpubr)
library(matlib)
```


Expand Down Expand Up @@ -226,7 +227,7 @@ squared length and energy, the positions of the `r colorize("red")` points on th
<div align="center">
<iframe width="412" height="364" src="images/pca-springs-cropped.gif"></iframe>
</div>
Animation of PCA fitted by springs. The `r colorize("blue")` data points are connected to their projections on the `r colorize("red")` line by springs perpendicular to that line. From an initial position, the springs pull that line in proportion to their squared distances, until the line finally settles down to the position where the forces are balanced and the minimum is achieved. _Source_: Amoeba, https://bit.ly/46tAicu.
Animation of PCA fitted by springs. The `r colorize("blue")` data points are connected to their projections on the `r colorize("red")` line by springs perpendicular to that line. From an initial position, the springs pull that line in proportion to their squared distances, until the line finally settles down to the position where the forces are balanced and the minimum is achieved. _Source_: Amoeba, [https://bit.ly/46tAicu].

:::

Expand All @@ -250,8 +251,10 @@ diagonal matrix $\mathbf{\Lambda}$, whose diagonal elements ($\lambda_i$) are th

To explain this, it is helpful to use a bit of matrix math:

\begin{align*}
\mathbf{S}_{p \times p} & = \mathbf{V}_{p \times p} \quad\quad \mathbf{\Lambda}_{p \times p} \quad\quad \mathbf{V}_{p \times p}^\mathsf{T} \\
\begin{align}
\mathbf{S}_{p \times p} & = \mathbf{V}_{p \times p} \phantom{0000000000}
\mathbf{\Lambda}_{p \times p} \phantom{0000000000}
\mathbf{V}_{p \times p}^\mathsf{T} \\
& = \left( \mathbf{v}_1, \, \mathbf{v}_2, \,\dots, \, \mathbf{v}_p \right)
\begin{pmatrix}
\lambda_1 & & & \\
Expand All @@ -268,7 +271,7 @@ To explain this, it is helpful to use a bit of matrix math:
\end{pmatrix}
\\
& = \lambda_1 \mathbf{v}_1 \mathbf{v}_1^\mathsf{T} + \lambda_2 \mathbf{v}_2 \mathbf{v}_2^\mathsf{T} + \cdots + \lambda_p \mathbf{v}_p \mathbf{v}_p^\mathsf{T}
\end{align*}
\end{align} {#eq-S-eigen}

In this equation,

Expand Down Expand Up @@ -307,6 +310,117 @@ of the eigenvalues.
```
<!-- UDI: I like this figure, but I would change the blue curved arrow. What do you think? I can put together a few drafts using Inkscape. Let me know if you agree and if you have any suggestions. -->

#### Example: Workers' experience and income {.unnumbered}

For a small example, consider the relation between years of experience and income in a small (contrived) sample
($n = 10$) of workers in a factory. The dataset `workers` contains these and other variables.
In a wider context, we might want to fit a regression model to predict `Income`, but here we focus on
a PCA of just these two variables.

```{r workers-data}
load(here::here("data", "workers.RData"))
head(workers)
```

Let's start with a simple scatterplot of `Income` vs. `Experience`, with points labeled by `Name`
(and colored by `Gender`). There's a fairly strong correlation
($r$ = `r with(workers, cor(Income, Experience))`). How does a PCA capture this?

<!-- fig.source: R/workers-pca.R -->

```{r}
#| label: fig-workers-scat
#| out-width: "80%"
#| fig-cap: "Scatterplot of Income vs. Experience for the `workers` data."
vars <- c("Experience", "Income")
plot(workers[, vars],
pch = 16, cex = 1.5,
cex.lab = 1.5)
text(workers[, vars],
labels = workers$Name,
col = ifelse(workers$Gender == "Female", "red", "blue"),
pos = 3, xpd = TRUE)
```

First , calculate the vector of means ($\bar{\mathbf{x}}$) and covariance matrix $\mathbf{S}$.
```{r workers-cov}
mu <- colMeans(workers[, vars]) |> print()
S <- cov(workers[, vars]) |> print()
```

Eigenvalues and eigenvectors are calculated by `eigen()`. This returns a list with components
`values` for the $\lambda_i$ and `vectors` for $\mathbf{V}$.

```{r workers-eigen}
S.eig <- eigen(S)
Lambda <- S.eig$values |> print()
V <- S.eig$vectors |> print()
```

Using these, you can express the eigenvalue decomposition of $\mathbf{S}$ in @eq-S-eigen with
`latexMatrix()` and `Eqn` from the **matlib** package [@R-matlib] as:

```{r results = 'asis', echo=-1}
op <- options(digits = 5)
rownames(S) <- colnames(S) <- c("Exp", "Inc")
spacer <- "\\phantom{00000000000000}"
Eqn("\\mathbf{S} & = \\mathbf{V} spacer
\\mathbf{\\Lambda} spacer
\\mathbf{V}^\\top", Eqn_newline(),
latexMatrix(S), "& =",
latexMatrix(V), " ", diag(Lambda), " ", latexMatrix(V, transpose=TRUE),
align = TRUE)
```

Then, you can visualize the geometry of PCA as in @fig-pca-rotation (left)
by plotting the data ellipse for the points, along with the
PCA axes (`heplots::ellipse.axes()`). @fig-workers-pca also shows the
bounding box of the data ellipse.

<!-- Nasty conflict loading `spida2`: labs() conflicts with ggplot2 -->

```{r echo=-1}
#| label: fig-workers-pca
#| out-width: 90%
#| fig-cap: "Geometry of the PCA for the `workers` data, showing the data ellipse, the
#| eigenvectors of $\\mathbf{S}$, whose half-lengths are the square roots $\\sqrt{\\lambda_i}$ of the eigenvalues,
#| and the bounding box of the ellipse."
op <- par(mar = c(4, 4, 0, 0) + 0.1)
# calculate conjugate axes for PCA factorization
pca.fac <- function(x) {
xx <- svd(x)
ret <- t(xx$v) * sqrt(pmax( xx$d,0))
ret
}
dataEllipse(Income ~ Experience, data=workers,
pch = 16, cex = 1.5,
center.pch = "+", center.cex = 2,
cex.lab = 1.5,
levels = 0.68,
grid = FALSE,
xlim = c(-10, 40),
ylim = c(10, 80),
asp = 1)
abline(h = mu[2], v = mu[1], lty = 2, col = "grey")
# axes of the ellipse = PC1, PC2
radius <- sqrt(2 * qf(0.68, 2, nrow(workers)-1 ))
heplots::ellipse.axes(S, mu,
radius = radius,
labels = TRUE,
col = "red", lwd = 2,
cex = 1.8)
# bounding box of the ellipse
lines(spida2::ellplus(mu, S, radius = radius,
box = TRUE, fac = pca.fac),
col = "darkgreen",
lwd = 2, lty="longdash")
```



### Finding principal components

In R, PCA is most easily carried out using `stats::prcomp()` or `stats::princomp()`
Expand Down Expand Up @@ -734,6 +848,7 @@ ggbiplot(crime.pca,
In this dataset the states are grouped by region and we saw some differences among regions in the plot (@fig-crime-scores-plot12) of component scores.
`ggbiplot()` provides options to include a `groups =` variable, used to
color the observation points and also to draw their data ellipses, facilitating interpretation.
```{r}
#| label: fig-crime-biplot2
#| out-width: "80%"
Expand Down Expand Up @@ -1799,7 +1914,7 @@ Animation of rotation from data space to PCA space.
```{r}
#| echo: false
#| results: asis
cat("Packages used here:\n")
cat("**Packages used here**:\n\n")
write_pkgs(file = .pkg_file)
```
Expand Down
22 changes: 14 additions & 8 deletions R/common.R
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
# --------------

# set a seed for all chapters
set.seed(47)
set.seed(47) # The one TRUE seed

options(
digits = 3,
Expand Down Expand Up @@ -63,10 +63,7 @@ custom_hook_output <- function(x, options) {
}


# suppress "Registered S3 method overwritten by GGally"
Sys.setenv(`_R_S3_METHOD_REGISTRATION_NOTE_OVERWRITES_` = "false")

# wrap hook
# wrap chunk output hook
# from: https://github.com/yihui/knitr-examples/blob/master/077-wrap-output.Rmd
library(knitr)
hook_output = knit_hooks$get('output')
Expand Down Expand Up @@ -141,9 +138,18 @@ $\\newcommand*{\\diag}[1]{\\ensuremath{\\mathrm{diag}\\, #1}}$
# packages
# -------------

# packages to be cited here. Code at the end automatically updates packages.bib
# These should be packages not used in actual code via `library()` or
# `require()`. Packages only referenced via `data()` need to be listed here.
# References to package in text
# TODO: add styles (color, font); do it differently for PDF output
pkg <- function(package, cite) {
ref <- paste0("**", package, "**")
if (cite) ref <- paste0(ref, " [@R-", package, "]")
cat(ref)
}


# packages to be cited here. Code at the end automatically updates `packages.bib`
# These should be packages not used in actual code via `library()` or `require()`.
# Packages only referenced via `data()` need to be listed here.
.to.cite <- c(
"rgl",
"animation",
Expand Down
70 changes: 54 additions & 16 deletions R/workers-pca.R
Original file line number Diff line number Diff line change
Expand Up @@ -4,17 +4,55 @@
#'

library(car)
library(spida2)
#library(spida2) # note: labs() conflicts with ggplot2
library(heplots)
library(matlib)

load(here::here("data", "workers.RData"))
str(workers)

vars <- c("Experience", "Income")
plot(workers[, vars],
pch = 16, cex = 1.5,
cex.lab = 1.5)
text(workers[, vars],
labels = workers$Name,
col = ifelse(workers$Gender == "Female", "red", "blue"),
pos = 3, xpd = TRUE)

workers.pca <- prcomp(workers[, vars]) |> print()


# covariance matrix & mean
S <- cov(workers[, 3:2])
mu <- colMeans(workers[, 3:2])
mu <- colMeans(workers[, vars]) |> print()
S <- cov(workers[, vars]) |> print()

# eigenvalues and eigenvectors
S.eig <- eigen(S)
Lambda <- S.eig$values |> print()
V <- S.eig$vectors |> print()

op <- options(digits = 5)
rownames(S) <- colnames(S) <- c("Exp", "Inc")
Eqn("\\mathbf{S} & = \\mathbf{V} \\phantom{00000000000000}
\\mathbf{\\Lambda} \\phantom{00000000000000}
\\mathbf{V}^\\top", Eqn_newline(),
latexMatrix(S), "& =",
latexMatrix(V), " ", diag(Lambda), " ", latexMatrix(V, transpose=TRUE),
align = TRUE)
options(op)

radius <- sqrt(2 * qf(0.68, 2, nrow(workers)-1 ))

# calculate conjugate axes for PCA factorization
# 'conjugate axis: The line passing through the center of the ellipse and perpendicular to the major axis
pca.fac <- function(x) {
xx <- svd(x)
ret <- t(xx$v) * sqrt(pmax( xx$d,0))
ret
}


op <- par(mar = c(4, 4, 0, 0) + 0.1)
dataEllipse(Income ~ Experience, data=workers,
pch = 16, cex = 1.5,
Expand All @@ -23,14 +61,19 @@ dataEllipse(Income ~ Experience, data=workers,
levels = 0.68,
grid = FALSE,
xlim = c(-10, 40),
ylim = c(20, 80),
ylim = c(10, 80),
asp = 1)
abline(h = mu[2], v = mu[1], lty = 2, col = "grey")
ellipse.axes(S, mu,
radius = radius,
labels = TRUE,
col = "red", lwd = 2,
cex = 1.5)
# label.ends = c(2, 3),
cex = 1.8)
lines(spida2::ellplus(mu, S, radius = radius,
box = TRUE, fac = pca.fac),
col = "darkgreen",
lwd = 2, lty="longdash")
par(op)

op <- par(mar = c(4, 4, 0, 0) + 0.1)
Expand All @@ -41,7 +84,7 @@ plot(Income ~ Experience, data=workers,
asp = 1)


ell.lines <- dellplus(workers$Experience, workers$Income,
ell.lines <- spida2::dellplus(workers$Experience, workers$Income,
radius = sqrt(qchisq(0.68, 2)),
ellipse = FALSE,
box=TRUE
Expand All @@ -52,20 +95,15 @@ str(ell.lines)
S <- cov(workers[, 3:2])
mu <- colMeans(workers[, 3:2])

lines(ellplus(mu, S,
lines(spida2::ellplus(mu, S,
radius = sqrt(qchisq(0.68, 2))
))
par(op)

# show conjugate axes for PCA factorization
pca.fac <- function(x) {
xx <- svd(x)
ret <- t(xx$v) * sqrt(pmax( xx$d,0))
ret
}

lines( ellplus(mu, shape = S,
radius = sqrt(qchisq(0.68, 2)),
lines(spida2::ellplus(mu, shape = S,
radius = radius, #sqrt(qchisq(0.68, 2)),
box=TRUE, diameters=TRUE, fac=pca.fac), col = 'red')




9 changes: 9 additions & 0 deletions bib/pkgs.bib
Original file line number Diff line number Diff line change
Expand Up @@ -357,6 +357,15 @@ @Manual{R-MASS
url = {http://www.stats.ox.ac.uk/pub/MASS4/},
}

@Manual{R-matlib,
title = {matlib: Matrix Functions for Teaching and Learning Linear Algebra and
Multivariate Statistics},
author = {Michael Friendly and John Fox and Phil Chalmers},
year = {2024},
note = {R package version 1.0.0},
url = {https://github.com/friendly/matlib},
}

@Manual{R-modelbased,
title = {modelbased: Estimation of Model-Based Predictions, Contrasts and Means},
author = {Dominique Makowski and Daniel Lüdecke and Mattan S. Ben-Shachar and Indrajeet Patil},
Expand Down
1 change: 1 addition & 0 deletions bib/pkgs.txt
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ ggpubr
imager
knitr
magrittr
matlib
patchwork
Rtsne
tidyr
Expand Down
Loading

0 comments on commit f3cf31c

Please sign in to comment.