Skip to content

Commit

Permalink
work on 04-pca-biplot
Browse files Browse the repository at this point in the history
  • Loading branch information
friendly committed Nov 23, 2023
1 parent 020693c commit de13aa8
Show file tree
Hide file tree
Showing 19 changed files with 315 additions and 59 deletions.
140 changes: 127 additions & 13 deletions 04-pca-biplot.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -15,20 +15,58 @@ Spaceland and therefore had an inkling of its majesty beyond the simple world of
> "Oh, can I, in my imagination, rotate that cloud and squeeze its' juice so that it rains down on Flatland with greatest joy?"
As it happened, our Square friend, although he could never really _see_ in three dimensions could now
at least _think_ of a world described by height as well as breadth and width.
at least _think_ of a world described by **height** as well as breadth and width.

And what a world it was, inhabited by
Pryamids, Cubes and wondrous creatures called Polyhedrons with many corners, sides and faces. Indeed, even exalted Spheres,
having so many faces that its surface became as smooth as a baby's bottom with no need for pointed corners, just as Circles were the smoothest occupants of his world;
he also marveled at Ellipsoids, as smooth as Spheres, but having three natural axes of different extent
and capable of being appearing fatter or slimmer when rotated from different views.
Pryamids, Cubes and wondrous creatures called Polyhedrons with many $C$orners, $F$aces and $E$dges.
Not only that, but all those Polyhedrons were forced in Spaceland to obey a magic formula:
$C + F - E = 2$.[^1-euler]

All of these now arose in his richer 3D imagination.
And all of this came from just one more dimension than life in Flatland.
[^1-euler]: This is Euler's formula ...

Indeed, even exalted Spheres,
having so many faces that its surface became as smooth as a baby's bottom with no need for pointed corners or edges, just as Circles were the smoothest occupants of his world.

He also marveled at Ellipsoids, as smooth as Spheres, but in Spaceland having three natural axes of different extent
and capable of being appearing fatter or slimmer when rotated from different views. An Ellipsoid
had magical properties: it could appear as so thin in one or more dimensions that it became a simple
2D ellipse, or a 1D line, or even a 0D point [@Friendly-etal:ellipses:2013].
<!-- **TODO**: somehow mention the `gellipsoid` package here. -->

All of these now arose in Square's richer 3D imagination.
And, all of this came from just one more dimension than his life in Flatland.


### Multivariate juicers

Up to now, we have also been living in Flatland. We have been trying to understand data in
**data space** of possibly many dimensions, but confined to the 2D plane of a graph window.
Scatterplot matrices and parallel coordinate plots provided some relief ...
Scatterplot matrices and parallel coordinate plots provided some relief.
The former did so by **projecting** the data into sets of 2D views in the coordinates of data
space; the latter did so by providing multiple axes in a 2D space along which we could trace
the paths of individual observations.

This chapter is about seeing data in a different space, a low-dimensional, usually 2D, space
that which squeezes out the most juice
from multidimensional data for a particular purpose (@fig-MV-juicer), where what we want to
understand can be more easily seen.

```{r}
#| label: fig-MV-juicer
#| echo: false
#| fig-align: center
#| out-width: "90%"
#| fig-cap: "A multivariate juicer takes data from possibly high-dimensional data space and transforms it to a lower-dimenional space in which important effects can be more easily seen."
knitr::include_graphics("images/MV-juicer.png")
```

Here, I concentrate on **principal components analysis** (PCA), whose goal reflects A Square's desire to see that sparkly cloud of points
in $nD$ space in the plane showing the greatest variation among all other possible views.

This appealed to his sense of geometry, but left him wondering how the variables in that
high-D cloud were related to the dimensions he could see in a best-fitting plane.
The idea of a **biplot**, showing the data points in the plane, together with thick pointed
arrows---variable vectors--- in one view is the other topic explained in this chapter.

## Principal components analysis {#sec-pca}

Expand All @@ -39,11 +77,87 @@ His friend, Karl Pearson [-@Pearson:1896] took that idea and developed it into a
regression and a measure of correlation that would bear his name, Pearson's $r$.

But then Pearson [-@Pearson:1901] had a further inspiration, akin to that of A Square.
If he also had a cloud of sparkly points in 2, 3, 4, ... dimensions, could he find a
point (0D), or line (1D), or plane (2D), or even a hyperplane ($n$D) that best summarized ---
squeezed out the most juice---from multivariate data?
The best 0D point was easy--- it was simply the centroid, the means of each of the
variables in the data, $(\bar{x}_1, \bar{x}_2, ...)$.
If he also had a cloud of sparkly points in $2, 3, 4, ..., p$ dimensions, could he find a
point ($0D$), or line ($1D$), or plane ($2D$), or even a hyperplane ($nD$) that best summarized ---
squeezed out the most juice---from multivariate data? This was the first trully multivariate
problem in the history of statistics [@FriendlyWainer:2021:TOGS, p. 186].

The best $0D$ point was easy--- it was simply the centroid, the means of each of the
variables in the data, $(\bar{x}_1, \bar{x}_2, ..., \bar{x}_p)$, because that was "closest"
to the data in the sense of minimizing the sum of squared differences, $\Sigma_i\Sigma_j (x_{ij} - \bar{x}_j)^2$.
In higher dimensions, his solution was also an application of the method of least squares, but he argued
it geometrically and visually as shown in @fig-Pearson1901.

```{r}
#| label: fig-Pearson1901
#| echo: false
#| fig-align: center
#| out-width: "70%"
#| fig-cap: "Karl Pearson's (1901) geometric, visual argument for finding the line or plane of closest fit to a collection of points, P1, P2, P3, ... "
knitr::include_graphics("images/Pearson1901.png")
```

For a $1D$ summary, the line of best fit to the points $P_1, P_2, \dots P_n$ is the line that goes through the
centroid and made the average squared length of the _perpendicular_ segments from those points to a line as small
as possible. This was different from the case in linear regression, for fitting $y$ from $x$,
where the average squared length of the _vertical_ segments, $\Sigma_i (y_i - \hat{y}_i)^2$ was minimized by
least squares.

He went on to prove the visual insights from simple smoothing of @Galton:1886 (shown in @fig-galton-corr)
regarding the regression lines of
`y ~ x` and `x ~ y`. More importantly, he proved that the cloud of points is captured,
for the purpose of finding a best line, plane or hyperplane, by
the ellipsoid that encloses it, as seen in his diagram, @fig-Pearson1901-2. The major axis of the
2D ellipse is the line of best fit, along which the data points have the smallest average squared
distance from the line. The axis at right angles to that---the minor axis--- is labeled "line of worst fit"
with the largest average squared distance.

```{r}
#| label: fig-Pearson1901-2
#| echo: false
#| fig-align: center
#| out-width: "80%"
#| fig-cap: "Karl Pearson's diagram showing the elliptical geometry of regression and principal components analysis ... _Source_: Pearson (1901), p. 566."
knitr::include_graphics("images/Pearson1901_2.png")
```

Even more importantly--- and this is the basis for what we call **principal components analysis** (PCA)--- he recognized that the two orthogonal axes of the ellipse gave new coordinates for the data which were uncorrelated, whatever the correlation of $x$ and $y$.

> Physically, the axes of the correlation type-ellipse are the directions of independent and uncorrelated variation. --- @Pearson:1901, p. 566.
It was but a small step to recognize that for two variables, $x$ and $y$:

* the line of best fit, the major axis (PC1) had the greatest variance of points projected onto it;
* the line of worst fit, the minor axis (PC2), had the least variance;
* these could be seen as a rotation of the data space of $(x, y)$ to a new space (PC1, PC2) with uncorrelated variables;
* the total variation of the points in data space, $\text{Var}(x) + \text{Var}(y)$, being unchanged by rotation, was equally well expressed as the total variation $\text{Var}(PC1) + \text{Var}(PC2)$ of the scores on what are
now called the principal component axes.



### PCA by springs

Before delving into the mathematics of PCA, it is useful to see how Pearson's problem, and fitting by least squares generally, could be solved in a physical realization.

From elementary statistics, you may be familiar with a
physical demonstration that the mean, $\bar{x}$, of a sample is the value for which the sum of deviations,
$\Sigma_i (x_i - \bar{x})$ is zero, so the mean can be visualized as the point of balance on a line where
those differnces $(x_i - \bar{x})$ are placed. Equally well, there is a physical realization of the mean
as the point along an axis where weights connected by springs will minimize the sum of squared differences,
because springs with a constant stiffness, $k$, exert forces proportional to $k (x_i - \bar{x}) ^2$. ...

In two dimensions,

::: {#fig-pca}
<div align="center">
<iframe width="412" height="364" src="images/pca-springs-cropped.gif"></iframe>
</div>
Animation of PCA fitted by springs. The blue data points are connected to their projections on the red line by springs, constrained to be perpendicular to that line. From an initial position, the springs pull that line in proportion to their squared distances, until the line finally settles down to the position where the forces are balanced and the minimum is achieved.

:::


### Elliptical insights: Outlier detection



Expand Down
59 changes: 59 additions & 0 deletions R/iris-3D-rgl.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
library(rgl)
library(bpca)
#source("c:/R/functions/ellipse3d.axes.R")

data(iris);
col <-c("blue", "green", "red")[iris$Species]
plot3d(iris[, 1:3], type="s",
size=0.4,
col=col, cex=2,
box=FALSE, aspect="iso")

cov <- cov(iris[, 1:3])
mu <- colMeans(iris[,1:3])
shade3d(ellipse3d(cov, centre=mu, level=0.68),
col="gray", alpha=0.2)

axes <- heplots::ellipse3d.axes(cov, centre=mu, level=0.72, labels=TRUE)

M1 <- par3d("userMatrix")

# hand rotate / zoom, then save current position
M2 <- par3d("userMatrix")

# rotate to show PC1 & PC3
M3 <- par3d("userMatrix")

# rotate to show PC2 & 3
M4 <- par3d("userMatrix")

# Make a movie: rotation to PC coordinates
interp <-par3dinterp( userMatrix=list(M1, M2),
extrapolate="constant", method="linear")
movie3d(interp, duration=4, fps=8, movie="biplot3d-iris")

# View in PCA space: bpca package

interp3 <-par3dinterp(userMatrix=list(M1, M2, M3, M4, M3, M2, M1),
extrapolate="constant", method="linear" )
movie3d(interp3, duration=6, fps=8, movie="biplot3d-iris3", dir="../anim")

#

# 3d static
iris.pca <- bpca(iris[, 1:4],
d=1:3)

plot(bpca(iris[-5],
d=1:3),
var.factor=.2,
var.color=c('blue', 'red'),
var.cex=1,
obj.names=FALSE,
obj.cex=1,
obj.col=c('red', 'green3', 'blue')[unclass(iris$Species)],
obj.pch=c('+', '*', '-')[unclass(iris$Species)])




2 changes: 1 addition & 1 deletion _quarto.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ book:
author:
- name: Michael Friendly
url: https://datavis.ca
affiliation: Department of Psychology<br>York University
# affiliation: Department of Psychology<br>York University
date: today
cover-image: images/cover-ellipse.jpg
favicon: images/favicon/favicon.ico
Expand Down
6 changes: 3 additions & 3 deletions bib/pkgs.bib
Original file line number Diff line number Diff line change
Expand Up @@ -102,7 +102,7 @@ @Manual{R-dplyr
title = {dplyr: A Grammar of Data Manipulation},
author = {Hadley Wickham and Romain François and Lionel Henry and Kirill Müller and Davis Vaughan},
year = {2023},
note = {R package version 1.1.3},
note = {R package version 1.1.4},
url = {https://dplyr.tidyverse.org},
}

Expand Down Expand Up @@ -159,8 +159,8 @@ @Manual{R-geomtextpath
@Manual{R-GGally,
title = {GGally: Extension to ggplot2},
author = {Barret Schloerke and Di Cook and Joseph Larmarange and Francois Briatte and Moritz Marbach and Edwin Thoen and Amos Elberg and Jason Crowley},
year = {2021},
note = {R package version 2.1.2},
year = {2023},
note = {R package version 2.2.0},
url = {https://ggobi.github.io/ggally/},
}

Expand Down
9 changes: 6 additions & 3 deletions child/00-flatland.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -74,9 +74,9 @@ But wait! Where does that 4D thing (a _tesseract_) come from?
To really see a tesseract you have to view it in an animation over time:

::: {#fig-tesseract}
<center>
<iframe width="560" height="315" src="images/tesseract.gif"></iframe>
</center>
<div align="center">
<iframe width="256" height="256" src="images/tesseract.gif"></iframe>
</div>
Animation of a tesseract, a cube changing over time.
:::

Expand All @@ -90,3 +90,6 @@ Albert Einstein revolutionized the Newtonian conceptions of gravity in 1915 when
The parable of _Flatland_ can provide inspiration for statistical thinking and data visualization.
Once we go beyond bivariate statistics and 2D plots, we are in a multivariate world of
possibly MANY dimensions. It takes only some imagination and suitable methods to get there.

Like Abbott's _Flatland_, this book is a romance, in many dimensions, of what we can learn from modern methods
of data visualization.
4 changes: 2 additions & 2 deletions docs/03-multivariate_plots.html
Original file line number Diff line number Diff line change
Expand Up @@ -1146,7 +1146,7 @@ <h1 class="title"><span id="sec-multivariate_plots" class="quarto-section-identi
<li>a pair of one continuous and one categorical variable can be shown as side-by-side boxplots or violin plots, histograms or density plots</li>
<li>two categorical variables could be shown in a mosaic plot or by grouped bar plots.</li>
</ul>
<p>In the <code>ggplot2</code> framework, these displays are implemented in the <strong>GGally</strong> package <span class="citation" data-cites="R-GGally">(<a href="90-references.html#ref-R-GGally" role="doc-biblioref">Schloerke et al. 2021</a>)</span> in the <code><a href="https://ggobi.github.io/ggally/reference/ggpairs.html">ggpairs()</a></code> function. This allows different plot types to be shown in the lower and upper triangles and in the diagonal cells of the plot matrix. As well, aesthetics such as color and shape can be used within the plots to distinguish groups directly. As illustrated below, you can define custom functions to control exactly what is plotted in any panel.</p>
<p>In the <code>ggplot2</code> framework, these displays are implemented in the <strong>GGally</strong> package <span class="citation" data-cites="R-GGally">(<a href="90-references.html#ref-R-GGally" role="doc-biblioref">Schloerke et al. 2023</a>)</span> in the <code><a href="https://ggobi.github.io/ggally/reference/ggpairs.html">ggpairs()</a></code> function. This allows different plot types to be shown in the lower and upper triangles and in the diagonal cells of the plot matrix. As well, aesthetics such as color and shape can be used within the plots to distinguish groups directly. As illustrated below, you can define custom functions to control exactly what is plotted in any panel.</p>
<p>The basic, default plot shows scatterplots for pairs of continuous variables in the lower triangle and the values of correlations in the upper triangle. A combination of a discrete and continuous variable is plotted as histograms in the lower triangle and boxplots in the upper triangle. <a href="#fig-peng-ggpairs1">Figure&nbsp;<span>3.29</span></a> includes <code>sex</code> to illustrate the combinations.</p>
<div class="cell" data-layout-align="center" data-hash="03-multivariate_plots_cache/html/fig-peng-ggpairs1_cf5f6c2fa254edc6f5bb67f54e064c70">
<details open=""><summary>Code</summary><div class="sourceCode" id="cb37" data-source-line-numbers="nil" data-code-line-numbers="nil"><pre class="downlit sourceCode r code-with-copy"><code class="sourceCode R"><span><span class="fu"><a href="https://ggobi.github.io/ggally/reference/ggpairs.html">ggpairs</a></span><span class="op">(</span><span class="va">peng</span>, columns<span class="op">=</span><span class="fu"><a href="https://rdrr.io/r/base/c.html">c</a></span><span class="op">(</span><span class="fl">3</span><span class="op">:</span><span class="fl">6</span>, <span class="fl">7</span><span class="op">)</span>,</span>
Expand Down Expand Up @@ -1426,7 +1426,7 @@ <h1 class="title"><span id="sec-multivariate_plots" class="quarto-section-identi
Sarkar, Deepayan. 2023. <em>Lattice: Trellis Graphics for r</em>. <a href="https://lattice.r-forge.r-project.org/">https://lattice.r-forge.r-project.org/</a>.
</div>
<div id="ref-R-GGally" class="csl-entry" role="listitem">
Schloerke, Barret, Di Cook, Joseph Larmarange, Francois Briatte, Moritz Marbach, Edwin Thoen, Amos Elberg, and Jason Crowley. 2021. <em>GGally: Extension to Ggplot2</em>. <a href="https://ggobi.github.io/ggally/">https://ggobi.github.io/ggally/</a>.
Schloerke, Barret, Di Cook, Joseph Larmarange, Francois Briatte, Moritz Marbach, Edwin Thoen, Amos Elberg, and Jason Crowley. 2023. <em>GGally: Extension to Ggplot2</em>. <a href="https://ggobi.github.io/ggally/">https://ggobi.github.io/ggally/</a>.
</div>
<div id="ref-Scott1992" class="csl-entry" role="listitem">
Scott, David W. 1992. <em>Multivariate Density Estimation: Theory, Practice, and Visualization</em>. A Wiley-Interscience Publication. NY: Wiley.
Expand Down
Loading

0 comments on commit de13aa8

Please sign in to comment.