Skip to content

Commit

Permalink
more work on PCA math
Browse files Browse the repository at this point in the history
  • Loading branch information
friendly committed Dec 8, 2023
1 parent e84dcc9 commit de8ef38
Show file tree
Hide file tree
Showing 27 changed files with 176 additions and 75 deletions.
73 changes: 60 additions & 13 deletions 04-pca-biplot.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -234,25 +234,63 @@ higher-dimensional ellipsoids. The brides maids were eigenvectors, pointing in a
different directions as space would allow, each sized according to their associated eigenvalues.
Attending the wedding were the ghosts of uncles, Leonhard Euler, Jean-Louis Lagrange,
Augustin-Louis Cauchy and others who had earlier discovered the mathematical properties
of quadratic forms in relation to problems in physics.
of ellipses and quadratic forms in relation to problems in physics.

The key idea in the statistical application was that, for a set of variables $\mathbf{x}_1, \mathbf{x}_2, \dots, \mathbf{x}_p$,
the $p \times p$ covariance matrix $\mathbf{S}$ could be expressed exactly as a matrix
product involving a matrix $\mathbf{V}$, whose columns are eigenvectors ($\mathbf{v}_i$) and a
diagonal matrix $\mathbf{\Lambda}$, whose diagonal elements ($\lambda_i$) are the corresponding eigenvalues,
the $p \times p$ covariance matrix $\mathbf{S}$ could be expressed **exactly** as a matrix
product involving a matrix $\mathbf{V}$, whose columns are _eigenvectors_ ($\mathbf{v}_i$) and a
diagonal matrix $\mathbf{\Lambda}$, whose diagonal elements ($\lambda_i$) are the corresponding _eigenvalues_.

To explain this, it is helpful to use a bit of matrix math:

$$
\begin{align*}
\mathbf{S} & = \mathbf{V} \mathbf{\Lambda} \mathbf{V}^T \\
& = \lambda_1 \mathbf{v}_1 \mathbf{v}_1^T + \lambda_2 \mathbf{v}_2 \mathbf{v}_2^T + ... + \lambda_p \mathbf{v}_p \mathbf{v}_p^T
\mathbf{S}_{p \times p} & = \mathbf{V}_{p \times p} \quad\quad \mathbf{\Lambda}_{p \times p} \quad\quad \mathbf{V}_{p \times p}^T \\
& = \left( \mathbf{v}_1, \, \mathbf{v}_2, \,\dots, \, \mathbf{v}_p \right)
\begin{pmatrix}
\lambda_1 & & & \\
& \lambda_2 & & \\
& & \ddots & \\
& & & \lambda_p
\end{pmatrix}

\begin{pmatrix}
\mathbf{v}_1^T\\
\mathbf{v}_2^T\\
\vdots\\
\mathbf{v}_p^T\\
\end{pmatrix}
\\
& = \lambda_1 \mathbf{v}_1 \mathbf{v}_1^T + \lambda_2 \mathbf{v}_2 \mathbf{v}_2^T + \cdots + \lambda_p \mathbf{v}_p \mathbf{v}_p^T
\end{align*}
$$

In this equation,
* The columns of $\mathbf{V}$ are the eigenvectors and they represent orthogonal (uncorrelated) directions in data space. The values $\mathbf{v}_i$ are the weights applied to the variables.
* The eigenvalues, $\lambda_i$ ...

* The last line follows because $\mathbf{\Lambda}$ is a diagonal matrix, so $\mathbf{S}$ is expressed as a sum of outer products of each $\mathbf{v}_i$ with itself.
* The columns of $\mathbf{V}$ are the eigenvectors of $\mathbf{S}$. They are orthogonal
and of unit length, so $\mathbf{V}^T \mathbf{V} = \mathbf{I}$ and thus
they represent orthogonal (uncorrelated) directions in data space.

* The columns $\mathbf{v}_i$ are the weights applied to the variables to produce the scores on
the principal components. For example, the first principal component is the weighted sum:

$$
PC1 = v_{11} \mathbf{x}_1 + v_{12} \mathbf{x}_2 + \cdots + v_{1p} \mathbf{x}_p
$$

* The eigenvalues, $\lambda_1, \lambda_2, \dots, \lambda_p$ are the variances of the the components, because
$\mathbf{v}_i^T \;\mathbf{S} \; \mathbf{v}_i = \lambda_i$.

* It is usually the case that the variables $\mathbf{x}_1, \mathbf{x}_2, \dots, \mathbf{x}_p$ are linearly
independent, which means that none of these is an exact linear combination of the others. In this case,
all eigenvalues $\lambda_i$ are positive and the covariance matrix $\mathbf{S}$ is said to have **rank** $p$.

* Here is the key fact: If the eigenvalues are arranged in order, so that $\lambda_1 > \lambda_2 > \dots > \lambda_p$, then the first $d$ components give a $d$-dimensional approximation to $\mathbf{S}$, which accounts for $\sigma_i^d \lambda_i / \sigma_i^p \lambda_i$ of the total variance.

For the case of two variables, $\mathbf{x}_1$ and $\mathbf{x}_2$ @fig-pca-rotation shows the
transformation ...
transformation from data space to component space. The eigenvectors, $\mathbf{v}_1, \mathbf{v}_2$
are the major and minor
axes of the data ellipse, whose lengths are the square roots $\sqrt{\lambda_1}, \sqrt{\lambda_2}$
of the eigenvalues.
```{r}
#| label: fig-pca-rotation
#| out-width: "100%"
Expand All @@ -266,8 +304,15 @@ transformation ...

In R, principal components analysis is most easily carried out using `stats::prcomp()` or `stats::princomp()`
or similar functions in other packages such as `FactomineR::PCA()`.
The **FactoMineR** package [@R-FactoMineR]
has extensive capabilities for exploratory analysis of multivariate data (PCA, correspondence analysis, cluster analysis, ...).
The **FactoMineR** package [@R-FactoMineR; @Husson-etal-2017]
has extensive capabilities for exploratory analysis of multivariate data (PCA, correspondence analysis, cluster analysis).

A particular strength of FactoMineR
for PCA is that it allows the inclusion of _supplementary variables_ (which can be categorical
or quantitative) and _supplementary points_ for individuals. These are not used in the analysis,
but are projected into the plots to facilitate interpretation. For example, in the analysis of
the `crime` data described below, it would be useful to have measures of other characteristics
of the U.S. states, such as poverty and average level of education.

Unfortunately, although all of these performing similar calculations, the options for
analysis and the details of the result they return differ.
Expand Down Expand Up @@ -811,6 +856,8 @@ Yes, the correlation of number of forward gears (`gear`) and number of carburato
in the upper left and lower right corners is moderately positive (0.27) while all the others
in their off-diagonal blocks are negative.

## Application: Eigenfaces

## Elliptical insights: Outlier detection

The data ellipse (@sec-data-ellipse), or ellipsoid in more than 2D is fundamental in regression. But also,
Expand Down
19 changes: 9 additions & 10 deletions bib/pkgs.bib
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ @Manual{R-car
title = {car: Companion to Applied Regression},
author = {John Fox and Sanford Weisberg and Brad Price},
year = {2023},
note = {R package version 3.1-4},
note = {R package version 3.1-3},
url = {https://r-forge.r-project.org/projects/car/},
}

Expand Down Expand Up @@ -82,14 +82,6 @@ @Manual{R-corrplot
url = {https://github.com/taiyun/corrplot},
}

@Manual{R-datasauRus,
title = {datasauRus: Datasets from the Datasaurus Dozen},
author = {Rhian Davies and Steph Locke and Lucy {D'Agostino McGowan}},
year = {2022},
note = {R package version 0.1.6},
url = {https://github.com/jumpingrivers/datasauRus},
}

@Manual{R-datawizard,
title = {datawizard: Easy Data Wrangling and Statistical Transformations},
author = {Indrajeet Patil and Etienne Bacher and Dominique Makowski and Daniel Lüdecke and Mattan S. Ben-Shachar and Brenton M. Wiernik},
Expand Down Expand Up @@ -155,6 +147,13 @@ @Manual{R-forcats
url = {https://forcats.tidyverse.org/},
}

@Manual{R-genridge,
title = {genridge: Generalized Ridge Trace Plots for Ridge Regression},
author = {Michael Friendly},
year = {2023},
note = {R package version 0.7.0},
url = {https://friendly.github.io/genridge/},
}

@Manual{R-geomtextpath,
title = {geomtextpath: Curved Text in ggplot2},
Expand Down Expand Up @@ -209,7 +208,7 @@ @Manual{R-heplots
title = {heplots: Visualizing Hypothesis Tests in Multivariate Linear Models},
author = {Michael Friendly and John Fox and Georges Monette},
year = {2023},
note = {R package version 1.6.1},
note = {R package version 1.6.0},
url = {http://friendly.github.io/heplots/},
}

Expand Down
10 changes: 10 additions & 0 deletions bib/references.bib
Original file line number Diff line number Diff line change
Expand Up @@ -808,6 +808,16 @@ @Article{Hotelling:1931
publisher = {Institute of Mathematical Statistics},
}

@Book{Husson-etal-2017,
author = {Francois Husson and Sebastien Le and Jérôme Pagès},
publisher = {Chapman & Hall},
title = {Exploratory Multivariate Analysis by Example Using R},
year = {2017},
address = {New York},
doi = {10.1201/b21874},
url = {https://dokumen.pub/exploratory-multivariate-analysis-by-example-using-r-1nbsped-1439835802-9781439835807.html},
}

@ARTICLE{Inselberg:1985,
author = {A. Inselberg},
title = {The Plane with Parallel Coordinates},
Expand Down
4 changes: 2 additions & 2 deletions docs/03-multivariate_plots.html
Original file line number Diff line number Diff line change
Expand Up @@ -1154,7 +1154,7 @@ <h1 class="title"><span id="sec-multivariate_plots" class="quarto-section-identi
<li>A pair of one continuous and one categorical variable can be shown as side-by-side boxplots or violin plots, histograms or density plots;</li>
<li>Two categorical variables could be shown in a mosaic plot or by grouped bar plots.</li>
</ul>
<p>In the <strong>ggplot2</strong> framework, these displays are implemented using the <code><a href="https://ggobi.github.io/ggally/reference/ggpairs.html">ggpairs()</a></code> function from the <strong>GGally</strong> package <span class="citation" data-cites="R-GGally">(<a href="90-references.html#ref-R-GGally" role="doc-biblioref">Schloerke et al., 2021</a>)</span>. This allows different plot types to be shown in the lower and upper triangles and in the diagonal cells of the plot matrix. As well, aesthetics such as color and shape can be used within the plots to distinguish groups directly. As illustrated below, you can define custom functions to control exactly what is plotted in any panel.</p>
<p>In the <strong>ggplot2</strong> framework, these displays are implemented using the <code><a href="https://ggobi.github.io/ggally/reference/ggpairs.html">ggpairs()</a></code> function from the <strong>GGally</strong> package <span class="citation" data-cites="R-GGally">(<a href="90-references.html#ref-R-GGally" role="doc-biblioref">Schloerke et al., 2023</a>)</span>. This allows different plot types to be shown in the lower and upper triangles and in the diagonal cells of the plot matrix. As well, aesthetics such as color and shape can be used within the plots to distinguish groups directly. As illustrated below, you can define custom functions to control exactly what is plotted in any panel.</p>
<p>The basic, default plot shows scatterplots for pairs of continuous variables in the lower triangle and the values of correlations in the upper triangle. A combination of a discrete and continuous variables is plotted as histograms in the lower triangle and boxplots in the upper triangle. <a href="#fig-peng-ggpairs1">Figure&nbsp;<span>3.29</span></a> includes <code>sex</code> to illustrate the combinations.</p>
<div class="cell" data-layout-align="center" data-hash="03-multivariate_plots_cache/html/fig-peng-ggpairs1_aa20fec9cffa163cd664a3f40f246b77">
<details open=""><summary>Code</summary><div class="sourceCode" id="cb37" data-source-line-numbers="nil" data-code-line-numbers="nil"><pre class="downlit sourceCode r code-with-copy"><code class="sourceCode R"><span><span class="fu"><a href="https://ggobi.github.io/ggally/reference/ggpairs.html">ggpairs</a></span><span class="op">(</span><span class="va">peng</span>, columns<span class="op">=</span><span class="fu"><a href="https://rdrr.io/r/base/c.html">c</a></span><span class="op">(</span><span class="fl">3</span><span class="op">:</span><span class="fl">6</span>, <span class="fl">7</span><span class="op">)</span>,</span>
Expand Down Expand Up @@ -1445,7 +1445,7 @@ <h1 class="title"><span id="sec-multivariate_plots" class="quarto-section-identi
Sarkar, D. (2023). <em>Lattice: Trellis graphics for r</em>. <a href="https://lattice.r-forge.r-project.org/">https://lattice.r-forge.r-project.org/</a>
</div>
<div id="ref-R-GGally" class="csl-entry" role="listitem">
Schloerke, B., Cook, D., Larmarange, J., Briatte, F., Marbach, M., Thoen, E., Elberg, A., &amp; Crowley, J. (2021). <em>GGally: Extension to ggplot2</em>. <a href="https://ggobi.github.io/ggally/">https://ggobi.github.io/ggally/</a>
Schloerke, B., Cook, D., Larmarange, J., Briatte, F., Marbach, M., Thoen, E., Elberg, A., &amp; Crowley, J. (2023). <em>GGally: Extension to ggplot2</em>. <a href="https://ggobi.github.io/ggally/">https://ggobi.github.io/ggally/</a>
</div>
<div id="ref-Scott1992" class="csl-entry" role="listitem">
Scott, D. W. (1992). <em>Multivariate density estimation: Theory, practice, and visualization</em>. Wiley.
Expand Down
Loading

0 comments on commit de8ef38

Please sign in to comment.