Skip to content

Commit

Permalink
add another crime biplot
Browse files Browse the repository at this point in the history
  • Loading branch information
friendly committed Dec 4, 2023
1 parent 2b1ad65 commit 6775113
Show file tree
Hide file tree
Showing 14 changed files with 215 additions and 56 deletions.
115 changes: 96 additions & 19 deletions 04-pca-biplot.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -174,13 +174,23 @@ data, with points colored by species and the 95% data ellipsoid. This is rotated
Because this is a rigid rotation of the cloud of points, the total variability is obviously unchanged.


::: {#fig-pca-animation}
<div align="center">
<iframe width="946" height="594" src="images/pca-animation1.gif"></iframe>
</div>
Animation of PCA as a rotation in 3D space. The plot shows three variables for the `iris` data, initially
in data space and its' data ellipsoid, with points colored according to species of the iris flowers. This is rotated smoothly until the first two principal axes are aligned with the horizontal and vertical dimensions.
<!-- ::: {#fig-pca-animation} -->
<!-- <div align="center"> -->
<!-- <iframe width="946" height="594" src="images/pca-animation1.gif"></iframe> -->
<!-- </div> -->
<!-- Animation of PCA as a rotation in 3D space. The plot shows three variables for the `iris` data, initially -->
<!-- in data space and its' data ellipsoid, with points colored according to species of the iris flowers. This is rotated smoothly until the first two principal axes are aligned with the horizontal and vertical dimensions. -->

<!-- ::: -->

::: {.content-visible unless-format="pdf"}
```{r}
#| label: fig-pca-animation
#| out-width: "100%"
#| echo: false
#| fig-cap: "Animation of PCA as a rotation in 3D space. The plot shows three variables for the #| `iris` data, initially in data space and its' data ellipsoid, with points colored according #| to species of the iris flowers. This is rotated smoothly until the first two principal axes #| are aligned with the horizontal and vertical dimensions."
knitr::include_graphics("images/pca-animation1.gif")
```
:::


Expand Down Expand Up @@ -227,12 +237,21 @@ The **FactoMineR** package [@R-FactoMineR]
has extensive capabilities for exploratory analysis of multivariate data (PCA, correspondence analysis, cluster analysis, ...).

Unfortunately, although all of these performing similar calculations, the options for
analysis and the details of the result they return differ ...
analysis and the details of the result they return differ.

The important options for analysis include:

* whether or not the data variables are **centered**, to a mean of 0
* whether or not the data variables are **scaled**, to a variance of 1.
* whether or not the data variables are **centered**, to a mean of $\bar{x}_j =0$
* whether or not the data variables are **scaled**, to a variance of $\text{Var}(x_j) =1$.

It nearly always makes sense to center the variables. The choice of
scaling determines whether the correlation matrix is analyzed, so that
each variable contributes equally to the total variance that is to be accounted for
versus analysis of the covariance matrix, where each variable contributes its
own variance to the total. Analysis of the covariance matrix makes little sense
when the variables are measured on different scales.[^pca-scales]

[^pca-scales]: For example, if two variables in the analysis are height and weight, changing the unit of height from inches to centimeters would multiply its' variance by $2.54^2$; changing weight from pounds to kilograms would divide its' variance by $2.2^2$.

#### Example: Crime data {.unnumbered}

Expand Down Expand Up @@ -304,6 +323,7 @@ of components to extract a desired proportion of total variance, usually in the

```{r}
#| label: fig-crime-ggscreeplot
#| fig-height: 4
#| out-width: "100%"
#| fig-cap: "Screeplots for the PCA of the crime data. The left panel shows the traditional version, plotting variance proportions against component number, with linear guideline for the scree rule of thumb. The right panel plots cumulative proportions, showing cutoffs of 80%, 90%."
p1 <- ggscreeplot(crime.pca) +
Expand Down Expand Up @@ -352,7 +372,7 @@ crime.pca |>
broom::augment(crime) |> head()
```

Then, we can use `ggplot()` to plot and pair of components.
Then, we can use `ggplot()` to plot any pair of components.
To aid interpretation, I label the points by their state abbreviation and color them
by `region` of the U.S.. A geometric interpretation of the plot requires
an aspect ratio of 1.0 (via `coord_fixed()`)
Expand Down Expand Up @@ -387,9 +407,10 @@ and West Virginia. The second component has most of the southern states on the l
and Massachusetts, Rhode Island and Hawaii on the high end. However, interpretation is
easier when we also consider how the various crimes contribute to these dimensions.

We could obviously go further and plot other pairs of components,
When, as here, there
are more than two components that seem important in the scree plot,
we could obviously go further and plot other pairs.

**TODO**: Add plot of PC1 vs. PC3
#### Variable vectors {.unnumbered}

You can extract the variable loadings using either `crime.pca$rotation` or
Expand Down Expand Up @@ -543,11 +564,18 @@ $\widehat{\mathbf{X}}$ as the product of two matrices,
$$
\widehat{\mathbf{X}} = (\mathbf{U} \mathbf{\Lambda}^\alpha) (\mathbf{\Lambda}^{1-\alpha} \mathbf{V}') = \mathbf{A} \mathbf{B}'
$$

The choice $\alpha = 1$, assigning the singular values totally to the left factor,
gives a distance interpretation to the row display and
This notation uses a little math trick involving a power, $0 \le \alpha \le 1$:
When $\alpha = 1$, $\mathbf{\Lambda}^\alpha = \mathbf{\Lambda}^1 =\mathbf{\Lambda}$,
and $\mathbf{\Lambda}^{1-\alpha} = \mathbf{\Lambda}^0 =\mathbf{I}$.
$\alpha = 1/2$ gives the diagonal matrix $\mathbf{\Lambda}^1/2$ whose elements are the square roots of the singular values.

The choice $\alpha = 1$ assigns the singular values totally to the left factor;
then, the angle between two variable vectors, reflecting the inner product
$\mathbf{x}_j^T, \mathbf{x}_{j'}$ approximates their correlation or covariance,
and the distance between the points approximates their Mahalanobis distances.
$\alpha = 0$ gives a distance interpretation to the column display.
$\alpha = 1/2$ gives a symmetrically scaled biplot.
*TODO**: Explain this better.

When the singular values are assigned totally to the left or to the right factor, the resultant
coordinates are called _principal coordinates_ and the sum of squared coordinates
Expand All @@ -560,13 +588,17 @@ values equal to 1.0.

### Biplots in R

There are a large number of R packages providing biplots, ...
There are a large number of R packages providing biplots. The most basic, `stats::biplot()`, provides methods for `"prcomp"` and `"princomp"` objects.

**TODO**: Mention **factoextra** package, `fviz()`, `fviz_pca_biplot()`, ... giving `ggplot2` graphics. Also mention **adegraphics** package

Here, I use the **ggbiplot** package ...
Here, I use the **ggbiplot** package, which aims to provide a simple interface to biplots within the `ggplot2` framework.

### Example

A basic biplot, using standardized principal components and labeling the observation by their state abbreviation is shown in @fig-crime-biplot1.
A basic biplot of the `crime` data, using standardized principal components and labeling the observation by their state abbreviation is shown in @fig-crime-biplot1.
The correlation circle indicates that these components are uncorrelated and have
equal variance in the display.
```{r}
#| label: fig-crime-biplot1
#| out-width: "80%"
Expand All @@ -582,10 +614,13 @@ ggbiplot(crime.pca,
theme_minimal(base_size = 14)
```

In this dataset the states are grouped by region and we saw some differences among regions in the plot (@fig-crime-scores-plot12) of component scores.
`ggbiplot()` provides options to include a `groups =` variable, used to
color the observation points and also to draw their data ellipses, facilitating interpretation.
```{r}
#| label: fig-crime-biplot2
#| out-width: "80%"
#| fig-cap: "Enhanced biplot of the crime data. ..."
#| fig-cap: "Enhanced biplot of the crime data, grouping the states by region and adding data ellipses."
ggbiplot(crime.pca,
obs.scale = 1, var.scale = 1,
groups = crime$region,
Expand All @@ -601,6 +636,48 @@ ggbiplot(crime.pca,
theme(legend.direction = 'horizontal', legend.position = 'top')
```

This plot provides what is necessary to interpret the nature of the components and also the variation of the states in relation to these. In this, the data ellipses for the regions
provide a visual summary that aids interpretation.

* From the variable vectors, it seems that PC1, having all positive and nearly equal loadings, reflects a total or overall index of crimes. Nevada, California, New York and Florida are highest on this, while North Dakota, South Dakota and West Virginia are lowest.

* The second component, PC2, shows a contrast between crimes against persons (murder, assault, rape) at the top and property crimes (auto theft, larceny) at the bottom. Nearly all the Southern states are high on personal crimes; states in the North East are generally higher
on property crimes.

* Western states tend to be somewhat higher on overall crime rate, while North Central are lower on average. In these states there is not much variation in the relative proportions of personal vs. property crimes.

Moreover, in this biplot you can interpret the the value for a particular state on a given crime by considering its projection on the variable vector, where the origin corresponds to the mean, positions along the vector have greater than average values on that crime, and the opposite direction have lower than average values. For example, Massachusetts has the highest value on auto theft, but a value less than the mean. Louisiana and South Carolina on the other hand are highest in the rate of murder and slightly less than average on auto theft.

These 2D plots account for only 76.5% of the total variance of crimes, so it is useful to also examine the third principal component, which accounts for an additional 10.4%.
The `choices =` option controls which dimensions are plotted.

```{r}
#| label: fig-crime-biplot3
#| out-width: "80%"
#| fig-cap: "Biplot of dimensions 1 & 3 of the crime data."
ggbiplot(crime.pca,
choices = c(1,3),
obs.scale = 1, var.scale = 1,
groups = crime$region,
labels = crime$st,
labels.size = 4,
var.factor = 2,
ellipse = TRUE, ellipse.level = 0.5, ellipse.alpha = 0.1,
circle = TRUE,
varname.size = 4,
varname.color = "black") +
labs(fill = "Region", color = "Region") +
theme_minimal(base_size = 14) +
theme(legend.direction = 'horizontal', legend.position = 'top')
```

Dimension 3 in @fig-crime-biplot3 is more subtle. One interpretation is a contrast between
larceny, which is a simple theft and robbery, which involves stealing something from a person
and is considered a more serious crime with an element of possible violence.
In this plot, murder has a relatively short variable vector, so does not contribute
very much to differences among the states.



## Elliptical insights: Outlier detection

Expand Down
15 changes: 15 additions & 0 deletions R/crime-ggbiplot.R
Original file line number Diff line number Diff line change
Expand Up @@ -66,4 +66,19 @@ ggbiplot(crime.pca,
theme_minimal(base_size = 14) +
theme(legend.direction = 'horizontal', legend.position = 'top')

# PC1 & PC3
ggbiplot(crime.pca,
choices = c(1,3),
obs.scale = 1, var.scale = 1,
groups = crime$region,
labels = crime$st,
labels.size = 4,
var.factor = 2,
ellipse = TRUE, ellipse.level = 0.5, ellipse.alpha = 0.1,
circle = TRUE,
varname.size = 4,
varname.color = "black") +
labs(fill = "Region", color = "Region") +
theme_minimal(base_size = 14) +
theme(legend.direction = 'horizontal', legend.position = 'top')

15 changes: 14 additions & 1 deletion bib/references.bib
Original file line number Diff line number Diff line change
Expand Up @@ -561,7 +561,8 @@ @article{Gabriel:71
Pages = {453--467},
Title = {The Biplot Graphic Display of Matrices with Application to Principal Components Analysis},
Volume = {58},
Year = {1971}
Year = {1971},
doi = {10.2307/2334381},
}

@incollection{Gabriel:81,
Expand Down Expand Up @@ -944,6 +945,18 @@ @article{Mardia:1974
}


@Article{McGowan2023,
author = {McGowan, Lucy D’Agostino and Gerke, Travis and Barrett, Malcolm},
journal = {Journal of Statistics and Data Science Education},
title = {Causal inference is not just a statistics problem},
year = {2023},
issn = {2693-9169},
month = dec,
pages = {1--9},
doi = {10.1080/26939169.2023.2276446},
publisher = {Informa UK Limited},
}

@incollection{Monette:90,
Address = {Beverly Hills, CA},
Author = {Georges Monette},
Expand Down
3 changes: 2 additions & 1 deletion child/02-anscombe.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -136,7 +136,8 @@ when you look behind the scenes.
For example, in the context of causal analysis @Gelman-etal:2023, illustrated
sets of four graphs, within each of which
all four represent the same average (latent) causal effect but with
much different patterns of individual effects.
much different patterns of individual effects; @McGowan2023 provide another illustration
with four seemingly identical data sets each generated by a different causal mechanism.
As an example of machine learning models, @Biecek-etal:2023, introduced the "Rashamon Quartet",
a synthetic dataset for which four models from different classes
(linear model, regression tree, random forest, neural network)
Expand Down
5 changes: 4 additions & 1 deletion docs/02-getting_started.html
Original file line number Diff line number Diff line change
Expand Up @@ -394,7 +394,7 @@ <h1 class="title"><span id="sec-getting_started" class="quarto-section-identifie
</div>
</div>
<div class="callout-body-container callout-body">
<p>The essential idea of a statistical “quartet” is to illustrate four quite different datasets or circumstances that seem superficially the same, but yet are paradoxically very different when you look behind the scenes. For example, in the context of causal analysis <span class="citation" data-cites="Gelman-etal:2023">Gelman, Hullman, and Kennedy (<a href="90-references.html#ref-Gelman-etal:2023" role="doc-biblioref">2023</a>)</span>, illustrated sets of four graphs, within each of which all four represent the same average (latent) causal effect but with much different patterns of individual effects. As an example of machine learning models, <span class="citation" data-cites="Biecek-etal:2023">Biecek et al. (<a href="90-references.html#ref-Biecek-etal:2023" role="doc-biblioref">2023</a>)</span>, introduced the “Rashamon Quartet”, a synthetic dataset for which four models from different classes (linear model, regression tree, random forest, neural network) have practically identical predictive performance. In all cases, the paradox is solved when their visualization reveals the distinct ways of understanding structure in the data. The <a href="https://r-causal.github.io/quartets/"><strong>quartets</strong></a> package contains these and other variations on this theme.</p>
<p>The essential idea of a statistical “quartet” is to illustrate four quite different datasets or circumstances that seem superficially the same, but yet are paradoxically very different when you look behind the scenes. For example, in the context of causal analysis <span class="citation" data-cites="Gelman-etal:2023">Gelman, Hullman, and Kennedy (<a href="90-references.html#ref-Gelman-etal:2023" role="doc-biblioref">2023</a>)</span>, illustrated sets of four graphs, within each of which all four represent the same average (latent) causal effect but with much different patterns of individual effects; <span class="citation" data-cites="McGowan2023">McGowan, Gerke, and Barrett (<a href="90-references.html#ref-McGowan2023" role="doc-biblioref">2023</a>)</span> provide another illustration with four seemingly identical data sets each generated by a different causal mechanism. As an example of machine learning models, <span class="citation" data-cites="Biecek-etal:2023">Biecek et al. (<a href="90-references.html#ref-Biecek-etal:2023" role="doc-biblioref">2023</a>)</span>, introduced the “Rashamon Quartet”, a synthetic dataset for which four models from different classes (linear model, regression tree, random forest, neural network) have practically identical predictive performance. In all cases, the paradox is solved when their visualization reveals the distinct ways of understanding structure in the data. The <a href="https://r-causal.github.io/quartets/"><strong>quartets</strong></a> package contains these and other variations on this theme.</p>
</div>
</div>
</section><section id="sec-davis" class="level3" data-number="2.1.2"><h3 data-number="2.1.2" class="anchored" data-anchor-id="sec-davis">
Expand Down Expand Up @@ -547,6 +547,9 @@ <h1 class="title"><span id="sec-getting_started" class="quarto-section-identifie
<div id="ref-MatejkaFitzmaurice2017" class="csl-entry" role="listitem">
Matejka, Justin, and George Fitzmaurice. 2017. <span>“Same Stats, Different Graphs.”</span> In <em>Proceedings of the 2017 <span>CHI</span> Conference on Human Factors in Computing Systems</em>. <span>ACM</span>. <a href="https://doi.org/10.1145/3025453.3025912">https://doi.org/10.1145/3025453.3025912</a>.
</div>
<div id="ref-McGowan2023" class="csl-entry" role="listitem">
McGowan, Lucy D’Agostino, Travis Gerke, and Malcolm Barrett. 2023. <span>“Causal Inference Is Not Just a Statistics Problem.”</span> <em>Journal of Statistics and Data Science Education</em>, December, 1–9. <a href="https://doi.org/10.1080/26939169.2023.2276446">https://doi.org/10.1080/26939169.2023.2276446</a>.
</div>
<div id="ref-Pearson:1896" class="csl-entry" role="listitem">
Pearson, Karl. 1896. <span>“Contributions to the Mathematical Theory of Evolution—<span>III</span>, Regression, Heredity and Panmixia.”</span> <em>Philosophical Transactions of the Royal Society of London</em>, A, 187: 253–318.
</div>
Expand Down
Loading

0 comments on commit 6775113

Please sign in to comment.