Skip to content

Commit

Permalink
start on biplots
Browse files Browse the repository at this point in the history
  • Loading branch information
friendly committed Nov 30, 2023
1 parent 8b0929c commit 5f5ea4a
Show file tree
Hide file tree
Showing 25 changed files with 323 additions and 187 deletions.
52 changes: 44 additions & 8 deletions 04-pca-biplot.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -69,11 +69,14 @@ knitr::include_graphics("images/MV-juicer.png")
Here, I concentrate on **principal components analysis** (PCA), whose goal reflects A Square's desire to see that sparkly cloud of points
in $nD$ space in the plane showing the greatest variation (squeezing the most juice)
among all other possible views.

This appealed to his sense of geometry, but left him wondering how the variables in that
high-D cloud were related to the dimensions he could see in a best-fitting plane.

The idea of a **biplot**, showing the data points in the plane, together with thick pointed
arrows---variable vectors--- in one view is the other topic explained in this chapter (@sec-biplot).
arrows---variable vectors--- in one view is the other topic explained in this chapter (@sec-biplot). The biplot is the simplest example of a multivariate juicer. The essential idea is to project the cloud of data points in $n$ dimensions into the
2D space of principal components and simultaneously show how the original variables
relate to this space. For exploratory analysis to get an initial, incisive view of
a multivariate dataset, a biplot is often my first choice.

**Packages**

Expand Down Expand Up @@ -264,11 +267,13 @@ The components returned by various PCA methods have (confusingly) different name
Then, a simple visualization is a plot of the proportion of variance for each component (or cumulative proportion)
against the component number, usually called a **screeplot**. The idea, introduced by @Cattell1966, is that
after the largest, dominant components, the remainder should resemble the rubble, or scree formed by rocks falling
from a cliff. From this plot, imagine
from a cliff.

From this plot, imagine
drawing a straight line through the plotted eigenvalues, starting with the largest one. The typical rough guidance is that the last point to fall on this line represents the last component to extract, the idea being that beyond this, the amount of additional variance explained is non-meaningful. Another rule of thumb is to choose the number
of components to extract a desired proportion of total variance, usually in the range of 80 - 90%.

`stats::plot(crime.pca)` would give a bar plot of the variances of the components, however ggbiplot::ggscreeplot()` gives nicer and more flexible displays as shown in @fig-crime-ggscreeplot.
`stats::plot(crime.pca)` would give a bar plot of the variances of the components, however `ggbiplot::ggscreeplot()` gives nicer and more flexible displays as shown in @fig-crime-ggscreeplot.

```{r}
#| label: fig-crime-ggscreeplot
Expand Down Expand Up @@ -331,7 +336,7 @@ ellipse.

```{r}
#| label: fig-crime-scores-plot12
#| out-width: "80%"
#| out-width: "100%"
#| fig-cap: "Plot of component scores on the first two principal components for the `crime` data. States are colored by `region`."
crime.pca |>
broom::augment(crime) |> # add original dataset back in
Expand Down Expand Up @@ -413,19 +418,50 @@ vectors |>

What is shown in @fig-crime-vectors has the following interpretations:

* the lengths of the variable vectors, $||\mathbf{v}_i|| = \sqrt{\Sigma_{j = 1:2} \; v_{ij}^2}$ give the proportion of variance of each variable accounted for in a two-dimensional display.
(1) the lengths of the variable vectors, $||\mathbf{v}_i|| = \sqrt{\Sigma_{j = 1:2} \; v_{ij}^2}$ give the proportion of variance of each variable accounted for in a two-dimensional display.

* the value, $v_{ij}$, of the vector for variable $\mathbf{x}_i$ on component $j$ gives
(2) the value, $v_{ij}$, of the vector for variable $\mathbf{x}_i$ on component $j$ gives
the correlation of that variable with the $j$th principal component.

* the cosine of the angle between two variable vectors, $\mathbf{v}_i$ and $\mathbf{v}_j$
(3) the cosine of the angle between two variable vectors, $\mathbf{v}_i$ and $\mathbf{v}_j$
gives the approximation of the correlation between $\mathbf{x}_i$ and $\mathbf{x}_j$
that is shown in this space. This means that two variable vectors that point in the same direction are highly correlated, while variable vectors at right angles are approximately uncorrelated.

To illustrate point (1), the following indicates that almost 70% of the variance of
`murder` is represented in the the 2D plot shown in @fig-crime-scores-plot12, but only
40% of the variance of `robbery` is captured.

```{r}
vectors |> select(label, PC1, PC2) |>
mutate(length = sqrt(PC1^2 + PC2^2))
```




## Biplots {#sec-biplot}

The biplot is a simple and powerful idea that came from the recognition that
you can overlay a plot of observation scores in a principal components analysis
with the information of the variable loadings (weights) to give a simultaneous display
that is easy to interpret. In this sense, a biplot is generalization of a scatterplot,
going from data space to PCA space, where the observations are shown by points and variables are shown by vectors (or scaled linear axes aligned with those vectors).

Biplot methodolgy is far more general than I cover here. ...
**TODO**: mention categorical variables (category level points), correspondence analysis, CCA, MDS, ...

The idea of the biplot was introduced by Ruben Gabriel [-@Gabriel:71] and later expanded in scope by @GowerHand:96.
**TODO**: Cite Greenacre, Biplots in Practice, Gower et al. 2010.

### Constructing a biplot

SVD:

### Example




## Elliptical insights: Outlier detection

The data ellipse (@sec-data-ellipse), or ellipsoid in more than 2D is fundamental in regression. But also,
Expand Down
119 changes: 1 addition & 118 deletions R/crime-ggbiplot.R
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ library(corrplot)
library(patchwork)
library(broom)

data(crime)
data(crime, package = "ggbiplot")

crime |>
dplyr::select(where(is.numeric)) |>
Expand All @@ -23,123 +23,6 @@ crime.pca <-

summary(crime.pca)

# show the eigenvalue decomposition
(crime.eig <- crime.pca |>
broom::tidy(matrix = "eigenvalues"))

# add fitted line for smallest eigenvalues
p1 <- ggscreeplot(crime.pca) +
stat_smooth(data = crime.eig |> filter(PC>=4),
aes(x=PC, y=percent), method = "lm",
se = FALSE,
fullrange = TRUE) +
theme_bw(base_line_size = 14)

p2 <- ggscreeplot(crime.pca, type = "cev") +
geom_hline(yintercept = c(0.8, 0.9), color = "blue") +
theme_bw(base_size = 14)

p1 + p2

# tidy scores plot

scores <- crime.pca |> purrr::pluck("x")
cov(scores) |> zapsmall()

crime.pca |>
broom::augment(crime) |> # add original dataset back in
ggplot(aes(.fittedPC1, .fittedPC2, color = region)) +
geom_hline(yintercept = 0) +
geom_vline(xintercept = 0) +
geom_point(size = 1.5) +
geom_text(aes(label = st), nudge_x = 0.2) +
stat_ellipse(color = "grey") +
coord_fixed()
labs(x = "PC Dimension 1", y = "PC Dimnension 2") +
theme_minimal(base_size = 14) +
theme(legend.position = "top") +

crime.pca |>
broom::augment(crime) |>
ggplot(aes(.fittedPC1, .fittedPC3, color = region, label = st)) +
geom_point(size = 1.5) +
geom_text(nudge_x = 0.2) +
geom_hline(yintercept = 0) +
geom_vline(xintercept = 0) +
labs(x = "PC Dimension 1", y = "PC Dimnension 3") +
theme_minimal(base_size = 14) +
theme(legend.position = "top") +
coord_fixed()



# tidy variable vectors

crime.pca |> purrr::pluck("rotation")

crime.pca |>
tidy(matrix = "rotation")



# define arrow style for plotting
arrow_style <- arrow(
angle = 20, ends = "first", type = "closed", length = grid::unit(8, "pt")
)

# try to plot the unit circle
r <- 1
theta <- c(seq(-pi, pi, length = 100))
cir <- data.frame(PC1 = r * cos(theta), PC2 = r * sin(theta))
circle <- geom_path(data = circle, color = "grey")

# plot rotation matrix
crime.pca |>
tidy(matrix = "rotation") |>
tidyr::pivot_wider(names_from = "PC",
names_prefix = "PC",
values_from = "value") |>
ggplot(aes(PC1, PC2)) +
geom_segment(xend = 0, yend = 0, arrow = arrow_style) +
geom_text(
aes(label = column),
hjust = 1, nudge_x = -0.02,
color = "brown") +
xlim(-0.8, .2) + ylim(-.7, 0.6) +
coord_fixed() + # fix aspect ratio to 1:1
theme_minimal(base_size = 14)


# do this without pivot_wider

vectors <- crime.pca |>
purrr::pluck("rotation") |>
as.data.frame() |>
mutate(PC1 = -1 * PC1, PC2 = -1 * PC2) |> # reflect axes
tibble::rownames_to_column(var = "label")
vectors

vectors |>
ggplot(aes(PC1, PC2)) +
geom_hline(yintercept = 0) +
geom_vline(xintercept = 0) +
geom_segment(xend = 0, yend = 0, linewidth=1, arrow = arrow_style) +
geom_text(aes(label = label),
size = 5,
hjust = "outward",
nudge_x = 0.05,
color = "brown") +
xlim(-0.4, 0.9) + ylim(-0.8, 0.8) +
coord_fixed() + # fix aspect ratio to 1:1
theme_minimal(base_size = 14)

# interpreting variable vectors
vectors[, 2:3]^2 |> rowSums() |> sqrt()
vectors |> select(label, PC1, PC2) |> mutate(length = sqrt(PC1^2 + PC2^2))

# correlations
crime |> select(murder, auto) |> cor()


#biplot(crime.pca)

Expand Down
156 changes: 156 additions & 0 deletions R/crime-pca.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,156 @@
#' ---
#' title: crime data - pca
#' ---

library(ggplot2)
library(ggbiplot)
library(dplyr)
library(corrplot)
library(patchwork)
library(broom)

data(crime, package = "ggbiplot")

crime |>
dplyr::select(where(is.numeric)) |>
cor() |>
corrplot(method = "ellipse", tl.srt = 0, tl.col = "black", mar = rep(.5, 4))

crime.pca <-
crime |>
dplyr::select(where(is.numeric)) |>
prcomp(scale. = TRUE)

summary(crime.pca)

# show the eigenvalue decomposition
(crime.eig <- crime.pca |>
broom::tidy(matrix = "eigenvalues"))

# add fitted line for smallest eigenvalues
p1 <- ggscreeplot(crime.pca) +
stat_smooth(data = crime.eig |> filter(PC>=4),
aes(x=PC, y=percent), method = "lm",
se = FALSE,
fullrange = TRUE) +
theme_bw(base_line_size = 14)

p2 <- ggscreeplot(crime.pca, type = "cev") +
geom_hline(yintercept = c(0.8, 0.9), color = "blue") +
theme_bw(base_size = 14)

p1 + p2

# tidy scores plot

scores <- crime.pca |> purrr::pluck("x")
cov(scores) |> zapsmall()

crime.pca |>
broom::augment(crime) |> # add original dataset back in
ggplot(aes(.fittedPC1, .fittedPC2, color = region)) +
geom_hline(yintercept = 0) +
geom_vline(xintercept = 0) +
geom_point(size = 1.5) +
geom_text(aes(label = st), nudge_x = 0.2) +
stat_ellipse(color = "grey") +
coord_fixed()
labs(x = "PC Dimension 1", y = "PC Dimnension 2") +
theme_minimal(base_size = 14) +
theme(legend.position = "top") +

crime.pca |>
broom::augment(crime) |>
ggplot(aes(.fittedPC1, .fittedPC3, color = region, label = st)) +
geom_point(size = 1.5) +
geom_text(nudge_x = 0.2) +
geom_hline(yintercept = 0) +
geom_vline(xintercept = 0) +
labs(x = "PC Dimension 1", y = "PC Dimnension 3") +
theme_minimal(base_size = 14) +
theme(legend.position = "top") +
coord_fixed()



# tidy variable vectors

crime.pca |> purrr::pluck("rotation")

crime.pca |>
tidy(matrix = "rotation")



# define arrow style for plotting
arrow_style <- arrow(
angle = 20, ends = "first", type = "closed", length = grid::unit(8, "pt")
)

# try to plot the unit circle
r <- 1
theta <- c(seq(-pi, pi, length = 100))
cir <- data.frame(PC1 = r * cos(theta), PC2 = r * sin(theta))
circle <- geom_path(data = circle, color = "grey")

# plot rotation matrix
crime.pca |>
tidy(matrix = "rotation") |>
tidyr::pivot_wider(names_from = "PC",
names_prefix = "PC",
values_from = "value") |>
ggplot(aes(PC1, PC2)) +
geom_segment(xend = 0, yend = 0, arrow = arrow_style) +
geom_text(
aes(label = column),
hjust = 1, nudge_x = -0.02,
color = "brown") +
xlim(-0.8, .2) + ylim(-.7, 0.6) +
coord_fixed() + # fix aspect ratio to 1:1
theme_minimal(base_size = 14)


# do this without pivot_wider

vectors <- crime.pca |>
purrr::pluck("rotation") |>
as.data.frame() |>
mutate(PC1 = -1 * PC1, PC2 = -1 * PC2) |> # reflect axes
tibble::rownames_to_column(var = "label")
vectors

vectors |>
ggplot(aes(PC1, PC2)) +
geom_hline(yintercept = 0) +
geom_vline(xintercept = 0) +
geom_segment(xend = 0, yend = 0, linewidth=1, arrow = arrow_style) +
geom_text(aes(label = label),
size = 5,
hjust = "outward",
nudge_x = 0.05,
color = "brown") +
xlim(-0.4, 0.9) + ylim(-0.8, 0.8) +
coord_fixed() + # fix aspect ratio to 1:1
theme_minimal(base_size = 14)

# interpreting variable vectors
vectors[, 2:3]^2 |> rowSums() |> sqrt()
vectors |> select(label, PC1, PC2) |> mutate(length = sqrt(PC1^2 + PC2^2))

matlib::len(vectors)
matlib::len(t(vectors))

# correlations of
crime.pca |>
broom::augment(crime)

# correlations
crime |> select(murder, burglary, larceny, auto) |> cor()
crime |> select(larceny, auto) |> cor()

# angles between vectors
vec <- vectors <- crime.pca |>
purrr::pluck("rotation")
vec <- vec[, 1:2]
vec %*% t(vec)
matlib::angle(vec[1,], vec[7,]) |> cos()
Loading

0 comments on commit 5f5ea4a

Please sign in to comment.