work on 04-pca-biplot

friendly · Nov 23, 2023 · de13aa8 · de13aa8
1 parent 020693c
commit de13aa8
Show file tree

Hide file tree

Showing 19 changed files with 315 additions and 59 deletions.
diff --git a/04-pca-biplot.qmd b/04-pca-biplot.qmd
@@ -15,20 +15,58 @@ Spaceland and therefore had an inkling of its majesty beyond the simple world of
 > "Oh, can I, in my imagination, rotate that cloud and squeeze its' juice so that it rains down on Flatland with greatest joy?"
 
 As it happened, our Square friend, although he could never really _see_ in three dimensions could now 
-at least _think_ of a world described by height as well as breadth and width. 
+at least _think_ of a world described by **height** as well as breadth and width. 
 
 And what a world it was, inhabited by
-Pryamids, Cubes and wondrous creatures called Polyhedrons with many corners, sides and faces. Indeed, even exalted Spheres,
-having so many faces that its surface became as smooth as a baby's bottom with no need for pointed corners, just as Circles were the smoothest occupants of his world;
-he also marveled at Ellipsoids, as smooth as Spheres, but having three natural axes of different extent
-and capable of being appearing fatter or slimmer when rotated from different views.
+Pryamids, Cubes and wondrous creatures called Polyhedrons with many $C$orners, $F$aces and $E$dges.
+Not only that, but all those Polyhedrons were forced in Spaceland to obey a magic formula:
+$C + F - E = 2$.[^1-euler]
 
-All of these now arose in his richer 3D imagination. 
-And all of this came from just one more dimension than life in Flatland.
+[^1-euler]: This is Euler's formula ...
+
+Indeed, even exalted Spheres,
+having so many faces that its surface became as smooth as a baby's bottom with no need for pointed corners or edges, just as Circles were the smoothest occupants of his world.
+
+He also marveled at Ellipsoids, as smooth as Spheres, but in Spaceland having three natural axes of different extent
+and capable of being appearing fatter or slimmer when rotated from different views. An Ellipsoid
+had magical properties: it could appear as so thin in one or more dimensions that it became a simple
+2D ellipse, or a 1D line, or even a 0D point [@Friendly-etal:ellipses:2013].
+<!-- **TODO**: somehow mention the `gellipsoid` package here. -->
+
+All of these now arose in Square's richer 3D imagination. 
+And, all of this came from just one more dimension than his life in Flatland.
+
+
+### Multivariate juicers
 
 Up to now, we have also been living in Flatland. We have been trying to understand data in
 **data space** of possibly many dimensions, but confined to the 2D plane of a graph window.
-Scatterplot matrices and parallel coordinate plots provided some relief ...
+Scatterplot matrices and parallel coordinate plots provided some relief. 
+The former did so by **projecting** the data into sets of 2D views in the coordinates of data
+space; the latter did so by providing multiple axes in a 2D space along which we could trace
+the paths of individual observations.
+
+This chapter is about seeing data in a different space, a low-dimensional, usually 2D, space
+that which squeezes out the most juice
+from multidimensional data for a particular purpose (@fig-MV-juicer), where what we want to
+understand can be more easily seen.
+
+```{r}
+#| label: fig-MV-juicer
+#| echo: false
+#| fig-align: center
+#| out-width: "90%"
+#| fig-cap: "A multivariate juicer takes data from possibly high-dimensional data space and transforms it to a lower-dimenional space in which important effects can be more easily seen."
+knitr::include_graphics("images/MV-juicer.png")
+```
+
+Here, I concentrate on **principal components analysis** (PCA), whose goal reflects A Square's desire to see that sparkly cloud of points 
+in $nD$ space in the plane showing the greatest variation among all other possible views.
+
+This appealed to his sense of geometry, but left him wondering how the variables in that 
+high-D cloud were related to the dimensions he could see in a best-fitting plane.
+The idea of a **biplot**, showing the data points in the plane, together with thick pointed
+arrows---variable vectors--- in one view is the other topic explained in this chapter.
 
 ## Principal components analysis {#sec-pca}
 
@@ -39,11 +77,87 @@ His friend, Karl Pearson [-@Pearson:1896] took that idea and developed it into a
 regression and a measure of correlation that would bear his name, Pearson's $r$.
 
 But then Pearson [-@Pearson:1901] had a further inspiration, akin to that of A Square. 
-If he also had a cloud of sparkly points in 2, 3, 4, ... dimensions, could he find a
-point (0D), or line (1D), or plane (2D), or even a hyperplane ($n$D) that best summarized ---
-squeezed out the most juice---from multivariate data?
-The best 0D point was easy--- it was simply the centroid, the means of each of the
-variables in the data, $(\bar{x}_1, \bar{x}_2, ...)$.
+If he also had a cloud of sparkly points in $2, 3, 4, ..., p$ dimensions, could he find a
+point ($0D$), or line ($1D$), or plane ($2D$), or even a hyperplane ($nD$) that best summarized ---
+squeezed out the most juice---from multivariate data? This was the first trully multivariate
+problem in the history of statistics [@FriendlyWainer:2021:TOGS, p. 186].
+
+The best $0D$ point was easy--- it was simply the centroid, the means of each of the
+variables in the data, $(\bar{x}_1, \bar{x}_2, ..., \bar{x}_p)$, because that was "closest"
+to the data in the sense of minimizing the sum of squared differences, $\Sigma_i\Sigma_j (x_{ij} - \bar{x}_j)^2$.
+In higher dimensions, his solution was also an application of the method of least squares, but he argued
+it geometrically and visually as shown in @fig-Pearson1901.
+
+```{r}
+#| label: fig-Pearson1901
+#| echo: false
+#| fig-align: center
+#| out-width: "70%"
+#| fig-cap: "Karl Pearson's (1901) geometric, visual argument for finding the line or plane of closest fit to a collection of points, P1, P2, P3, ... "
+knitr::include_graphics("images/Pearson1901.png")
+```
+
+For a $1D$ summary, the line of best fit to the points $P_1, P_2, \dots P_n$ is the line that goes through the
+centroid and made the average squared length of the _perpendicular_ segments from those points to a line as small
+as possible. This was different from the case in linear regression, for fitting $y$ from $x$,
+where the average squared length of the _vertical_ segments, $\Sigma_i (y_i - \hat{y}_i)^2$ was minimized by
+least squares.
+
+He went on to prove the visual insights from simple smoothing of @Galton:1886 (shown in @fig-galton-corr) 
+regarding the regression lines of
+`y ~ x` and `x ~ y`. More importantly, he proved that the cloud of points is captured,
+for the purpose of finding a best line, plane or hyperplane, by
+the ellipsoid that encloses it, as seen in his diagram, @fig-Pearson1901-2. The major axis of the 
+2D ellipse is the line of best fit, along which the data points have the smallest average squared
+distance from the line. The axis at right angles to that---the minor axis--- is labeled "line of worst fit"
+with the largest average squared distance.
+
+```{r}
+#| label: fig-Pearson1901-2
+#| echo: false
+#| fig-align: center
+#| out-width: "80%"
+#| fig-cap: "Karl Pearson's diagram showing the elliptical geometry of regression and principal components analysis ... _Source_: Pearson (1901), p. 566."
+knitr::include_graphics("images/Pearson1901_2.png")
+```
+
+Even more importantly--- and this is the basis for what we call **principal components analysis** (PCA)--- he recognized that the two orthogonal axes of the ellipse gave new coordinates for the data which were uncorrelated, whatever the correlation of $x$ and $y$.
+
+> Physically, the axes of the correlation type-ellipse are the directions of independent and uncorrelated variation. --- @Pearson:1901, p. 566.
+
+It was but a small step to recognize that for two variables, $x$ and $y$:
+
+* the line of best fit, the major axis (PC1) had the greatest variance of points projected onto it;
+* the line of worst fit, the minor axis (PC2), had the least variance;
+* these could be seen as a rotation of the data space of $(x, y)$ to a new space (PC1, PC2) with uncorrelated variables;
+* the total variation of the points in data space, $\text{Var}(x) + \text{Var}(y)$, being unchanged by rotation, was equally well expressed as the total variation $\text{Var}(PC1) + \text{Var}(PC2)$ of the scores on what are
+now called the principal component axes.
+
+
+
+### PCA by springs
+
+Before delving into the mathematics of PCA, it is useful to see how Pearson's problem, and fitting by least squares generally, could be solved in a physical realization. 
+
+From elementary statistics, you may be familiar with a
+physical demonstration that the mean, $\bar{x}$, of a sample is the value for which the sum of deviations,
+$\Sigma_i (x_i - \bar{x})$ is zero, so the mean can be visualized as the point of balance on a line where
+those differnces $(x_i - \bar{x})$ are placed. Equally well, there is a physical realization of the mean
+as the point along an axis where weights connected by springs will minimize the sum of squared differences,
+because springs with a constant stiffness, $k$, exert forces proportional to $k (x_i - \bar{x}) ^2$. ...
+
+In two dimensions, 
+
+::: {#fig-pca}
+<div align="center">
+<iframe width="412" height="364" src="images/pca-springs-cropped.gif"></iframe>
+</div>
+Animation of PCA fitted by springs. The blue data points are connected to their projections on the red line by springs, constrained to be perpendicular to that line. From an initial position, the springs pull that line in proportion to their squared distances, until the line finally settles down to the position where the forces are balanced and the minimum is achieved.
+
+:::
+
+
+### Elliptical insights: Outlier detection
 
 
 

diff --git a/R/iris-3D-rgl.R b/R/iris-3D-rgl.R
@@ -0,0 +1,59 @@
+library(rgl)
+library(bpca)
+#source("c:/R/functions/ellipse3d.axes.R")
+
+data(iris); 
+col <-c("blue", "green", "red")[iris$Species]
+plot3d(iris[, 1:3], type="s", 
+       size=0.4, 
+       col=col, cex=2, 
+       box=FALSE, aspect="iso")
+
+cov <- cov(iris[, 1:3]) 
+mu <- colMeans(iris[,1:3])
+shade3d(ellipse3d(cov, centre=mu, level=0.68), 
+       col="gray", alpha=0.2)
+
+axes <- heplots::ellipse3d.axes(cov, centre=mu, level=0.72, labels=TRUE)
+
+M1 <- par3d("userMatrix")
+
+# hand rotate / zoom, then save current position
+M2 <- par3d("userMatrix")
+
+# rotate to show PC1 & PC3
+M3 <- par3d("userMatrix")
+
+# rotate to show PC2 & 3
+M4 <- par3d("userMatrix")
+
+# Make a movie: rotation to PC coordinates
+interp <-par3dinterp( userMatrix=list(M1, M2), 
+                      extrapolate="constant", method="linear")
+movie3d(interp, duration=4, fps=8, movie="biplot3d-iris")
+
+# View in PCA space: bpca package
+
+interp3 <-par3dinterp(userMatrix=list(M1, M2, M3, M4, M3, M2, M1),  
+                      extrapolate="constant", method="linear" )
+movie3d(interp3, duration=6, fps=8, movie="biplot3d-iris3", dir="../anim")
+
+#
+
+# 3d static
+iris.pca <- bpca(iris[, 1:4],
+                 d=1:3)
+
+plot(bpca(iris[-5],
+          d=1:3),
+     var.factor=.2,
+     var.color=c('blue', 'red'),
+     var.cex=1,
+     obj.names=FALSE,
+     obj.cex=1,
+     obj.col=c('red', 'green3', 'blue')[unclass(iris$Species)],
+     obj.pch=c('+', '*', '-')[unclass(iris$Species)])
+
+
+
+
diff --git a/_quarto.yml b/_quarto.yml
@@ -7,7 +7,7 @@ book:
   author:
     - name: Michael Friendly
       url: https://datavis.ca
-      affiliation: Department of Psychology<br>York University
+      # affiliation: Department of Psychology<br>York University
   date: today
   cover-image: images/cover-ellipse.jpg
   favicon: images/favicon/favicon.ico

diff --git a/bib/pkgs.bib b/bib/pkgs.bib
@@ -102,7 +102,7 @@ @Manual{R-dplyr
   title = {dplyr: A Grammar of Data Manipulation},
   author = {Hadley Wickham and Romain François and Lionel Henry and Kirill Müller and Davis Vaughan},
   year = {2023},
-  note = {R package version 1.1.3},
+  note = {R package version 1.1.4},
   url = {https://dplyr.tidyverse.org},
 }
 
@@ -159,8 +159,8 @@ @Manual{R-geomtextpath
 @Manual{R-GGally,
   title = {GGally: Extension to ggplot2},
   author = {Barret Schloerke and Di Cook and Joseph Larmarange and Francois Briatte and Moritz Marbach and Edwin Thoen and Amos Elberg and Jason Crowley},
-  year = {2021},
-  note = {R package version 2.1.2},
+  year = {2023},
+  note = {R package version 2.2.0},
   url = {https://ggobi.github.io/ggally/},
 }
 

diff --git a/child/00-flatland.qmd b/child/00-flatland.qmd
@@ -74,9 +74,9 @@ But wait! Where does that 4D thing (a _tesseract_) come from?
 To really see a tesseract you have to view it in an animation over time:
 
 ::: {#fig-tesseract}
-<center>
-<iframe width="560" height="315" src="images/tesseract.gif"></iframe>
-</center>
+<div align="center">
+<iframe width="256" height="256" src="images/tesseract.gif"></iframe>
+</div>
 Animation of a tesseract, a cube changing over time.
 :::
 
@@ -90,3 +90,6 @@ Albert Einstein revolutionized the Newtonian conceptions of gravity in 1915 when
 The parable of _Flatland_ can provide inspiration for statistical thinking and data visualization.
 Once we go beyond bivariate statistics and 2D plots, we are in a multivariate world of
 possibly MANY dimensions. It takes only some imagination and suitable methods to get there.
+
+Like Abbott's _Flatland_, this book is a romance, in many dimensions, of what we can learn from modern methods
+of data visualization.
diff --git a/docs/03-multivariate_plots.html b/docs/03-multivariate_plots.html
@@ -1146,7 +1146,7 @@ <h1 class="title"><span id="sec-multivariate_plots" class="quarto-section-identi
 <li>a pair of one continuous and one categorical variable can be shown as side-by-side boxplots or violin plots, histograms or density plots</li>
 <li>two categorical variables could be shown in a mosaic plot or by grouped bar plots.</li>
 </ul>
-<p>In the <code>ggplot2</code> framework, these displays are implemented in the <strong>GGally</strong> package <span class="citation" data-cites="R-GGally">(<a href="90-references.html#ref-R-GGally" role="doc-biblioref">Schloerke et al. 2021</a>)</span> in the <code><a href="https://ggobi.github.io/ggally/reference/ggpairs.html">ggpairs()</a></code> function. This allows different plot types to be shown in the lower and upper triangles and in the diagonal cells of the plot matrix. As well, aesthetics such as color and shape can be used within the plots to distinguish groups directly. As illustrated below, you can define custom functions to control exactly what is plotted in any panel.</p>
+<p>In the <code>ggplot2</code> framework, these displays are implemented in the <strong>GGally</strong> package <span class="citation" data-cites="R-GGally">(<a href="90-references.html#ref-R-GGally" role="doc-biblioref">Schloerke et al. 2023</a>)</span> in the <code><a href="https://ggobi.github.io/ggally/reference/ggpairs.html">ggpairs()</a></code> function. This allows different plot types to be shown in the lower and upper triangles and in the diagonal cells of the plot matrix. As well, aesthetics such as color and shape can be used within the plots to distinguish groups directly. As illustrated below, you can define custom functions to control exactly what is plotted in any panel.</p>
 <p>The basic, default plot shows scatterplots for pairs of continuous variables in the lower triangle and the values of correlations in the upper triangle. A combination of a discrete and continuous variable is plotted as histograms in the lower triangle and boxplots in the upper triangle. <a href="#fig-peng-ggpairs1">Figure&nbsp;<span>3.29</span></a> includes <code>sex</code> to illustrate the combinations.</p>
 <div class="cell" data-layout-align="center" data-hash="03-multivariate_plots_cache/html/fig-peng-ggpairs1_cf5f6c2fa254edc6f5bb67f54e064c70">
 <details open=""><summary>Code</summary><div class="sourceCode" id="cb37" data-source-line-numbers="nil" data-code-line-numbers="nil"><pre class="downlit sourceCode r code-with-copy"><code class="sourceCode R"><span><span class="fu"><a href="https://ggobi.github.io/ggally/reference/ggpairs.html">ggpairs</a></span><span class="op">(</span><span class="va">peng</span>, columns<span class="op">=</span><span class="fu"><a href="https://rdrr.io/r/base/c.html">c</a></span><span class="op">(</span><span class="fl">3</span><span class="op">:</span><span class="fl">6</span>, <span class="fl">7</span><span class="op">)</span>,</span>
@@ -1426,7 +1426,7 @@ <h1 class="title"><span id="sec-multivariate_plots" class="quarto-section-identi
 Sarkar, Deepayan. 2023. <em>Lattice: Trellis Graphics for r</em>. <a href="https://lattice.r-forge.r-project.org/">https://lattice.r-forge.r-project.org/</a>.
 </div>
 <div id="ref-R-GGally" class="csl-entry" role="listitem">
-Schloerke, Barret, Di Cook, Joseph Larmarange, Francois Briatte, Moritz Marbach, Edwin Thoen, Amos Elberg, and Jason Crowley. 2021. <em>GGally: Extension to Ggplot2</em>. <a href="https://ggobi.github.io/ggally/">https://ggobi.github.io/ggally/</a>.
+Schloerke, Barret, Di Cook, Joseph Larmarange, Francois Briatte, Moritz Marbach, Edwin Thoen, Amos Elberg, and Jason Crowley. 2023. <em>GGally: Extension to Ggplot2</em>. <a href="https://ggobi.github.io/ggally/">https://ggobi.github.io/ggally/</a>.
 </div>
 <div id="ref-Scott1992" class="csl-entry" role="listitem">
 Scott, David W. 1992. <em>Multivariate Density Estimation: Theory, Practice, and Visualization</em>. A Wiley-Interscience Publication. NY: Wiley.