add another crime biplot

friendly · Dec 4, 2023 · 6775113 · 6775113
1 parent 2b1ad65
commit 6775113
Show file tree

Hide file tree

Showing 14 changed files with 215 additions and 56 deletions.
diff --git a/04-pca-biplot.qmd b/04-pca-biplot.qmd
@@ -174,13 +174,23 @@ data, with points colored by species and the 95% data ellipsoid. This is rotated
 Because this is a rigid rotation of the cloud of points, the total variability is obviously unchanged.
 
 
-::: {#fig-pca-animation}
-<div align="center">
-<iframe width="946" height="594" src="images/pca-animation1.gif"></iframe>
-</div>
-Animation of PCA as a rotation in 3D space. The plot shows three variables for the `iris` data, initially
-in data space and its' data ellipsoid, with points colored according to species of the iris flowers. This is rotated smoothly until the first two principal axes are aligned with the horizontal and vertical dimensions.
+<!-- ::: {#fig-pca-animation} -->
+<!-- <div align="center"> -->
+<!-- <iframe width="946" height="594" src="images/pca-animation1.gif"></iframe> -->
+<!-- </div> -->
+<!-- Animation of PCA as a rotation in 3D space. The plot shows three variables for the `iris` data, initially -->
+<!-- in data space and its' data ellipsoid, with points colored according to species of the iris flowers. This is rotated smoothly until the first two principal axes are aligned with the horizontal and vertical dimensions. -->
 
+<!-- ::: -->
+
+::: {.content-visible unless-format="pdf"}
+```{r}
+#| label: fig-pca-animation
+#| out-width: "100%"
+#| echo: false
+#| fig-cap: "Animation of PCA as a rotation in 3D space. The plot shows three variables for the #| `iris` data, initially in data space and its' data ellipsoid, with points colored according #| to species of the iris flowers. This is rotated smoothly until the first two principal axes #| are aligned with the horizontal and vertical dimensions."
+knitr::include_graphics("images/pca-animation1.gif")
+```
 :::
 
 
@@ -227,12 +237,21 @@ The **FactoMineR** package [@R-FactoMineR]
 has extensive capabilities for exploratory analysis of multivariate data (PCA, correspondence analysis, cluster analysis, ...). 
 
 Unfortunately, although all of these performing similar calculations, the options for
-analysis and the details of the result they return differ ...
+analysis and the details of the result they return differ.
 
 The important options for analysis include: 
 
-* whether or not the data variables are **centered**, to a mean of 0
-* whether or not the data variables are **scaled**, to a variance of 1.
+* whether or not the data variables are **centered**, to a mean of $\bar{x}_j =0$
+* whether or not the data variables are **scaled**, to a variance of $\text{Var}(x_j) =1$.
+
+It nearly always makes sense to center the variables. The choice of
+scaling determines whether the correlation matrix is analyzed, so that
+each variable contributes equally to the total variance that is to be accounted for
+versus analysis of the covariance matrix, where each variable contributes its
+own variance to the total. Analysis of the covariance matrix makes little sense
+when the variables are measured on different scales.[^pca-scales]
+
+[^pca-scales]: For example, if two variables in the analysis are height and weight, changing the unit of height from inches to centimeters would multiply its' variance by $2.54^2$; changing weight from pounds to kilograms would divide its' variance by $2.2^2$.
 
 #### Example: Crime data {.unnumbered}
 
@@ -304,6 +323,7 @@ of components to extract a desired proportion of total variance, usually in the
 
 ```{r}
 #| label: fig-crime-ggscreeplot
+#| fig-height: 4
 #| out-width: "100%"
 #| fig-cap: "Screeplots for the PCA of the crime data. The left panel shows the traditional version, plotting variance proportions against component number, with linear guideline for the scree rule of thumb. The right panel plots cumulative proportions, showing cutoffs of 80%, 90%."
 p1 <- ggscreeplot(crime.pca) +
@@ -352,7 +372,7 @@ crime.pca |>
   broom::augment(crime) |> head()
 ```
 
-Then, we can use `ggplot()` to plot and pair of components.
+Then, we can use `ggplot()` to plot any pair of components.
 To aid interpretation, I label the points by their state abbreviation and color them
 by `region` of the U.S.. A geometric interpretation of the plot requires 
 an aspect ratio of 1.0 (via `coord_fixed()`)
@@ -387,9 +407,10 @@ and West Virginia. The second component has most of the southern states on the l
 and Massachusetts, Rhode Island and Hawaii on the high end. However, interpretation is
 easier when we also consider how the various crimes contribute to these dimensions.
 
-We could obviously go further and plot other pairs of components,
+When, as here, there
+are more than two components that seem important in the scree plot,
+we could obviously go further and plot other pairs.
 
-**TODO**: Add plot of PC1 vs. PC3
 #### Variable vectors {.unnumbered}
 
 You can extract the variable loadings using either `crime.pca$rotation` or
@@ -543,11 +564,18 @@ $\widehat{\mathbf{X}}$ as the product of two matrices,
 $$
 \widehat{\mathbf{X}} = (\mathbf{U} \mathbf{\Lambda}^\alpha) (\mathbf{\Lambda}^{1-\alpha} \mathbf{V}') = \mathbf{A} \mathbf{B}'
 $$
-
-The choice $\alpha = 1$, assigning the singular values totally to the left factor,
- gives a distance interpretation to the row display and 
+This notation uses a little math trick involving a power, $0 \le \alpha \le 1$:
+When $\alpha = 1$, $\mathbf{\Lambda}^\alpha = \mathbf{\Lambda}^1  =\mathbf{\Lambda}$,
+and $\mathbf{\Lambda}^{1-\alpha} = \mathbf{\Lambda}^0  =\mathbf{I}$.
+$\alpha = 1/2$ gives the diagonal matrix $\mathbf{\Lambda}^1/2$ whose elements are the square roots of the singular values.
+
+The choice $\alpha = 1$ assigns the singular values totally to the left factor;
+then, the angle between two variable vectors, reflecting the inner product 
+$\mathbf{x}_j^T, \mathbf{x}_{j'}$ approximates their correlation or covariance,
+and the distance between the points approximates their Mahalanobis distances.
 $\alpha = 0$ gives a distance interpretation to the column display.
 $\alpha = 1/2$ gives a symmetrically scaled biplot.
+*TODO**: Explain this better.
 
 When the singular values are assigned totally to the left or to the right factor, the resultant 
 coordinates are called _principal coordinates_ and the sum of squared coordinates
@@ -560,13 +588,17 @@ values equal to 1.0.
 
 ### Biplots in R
 
-There are a large number of R packages providing biplots, ...
+There are a large number of R packages providing biplots. The most basic, `stats::biplot()`, provides methods for `"prcomp"` and `"princomp"` objects.
+
+**TODO**: Mention **factoextra** package, `fviz()`, `fviz_pca_biplot()`, ... giving `ggplot2` graphics. Also mention **adegraphics** package
 
-Here, I use the **ggbiplot** package ...
+Here, I use the **ggbiplot** package, which aims to provide a simple interface to biplots within the `ggplot2` framework.
 
 ### Example
 
-A basic biplot, using standardized principal components and labeling the observation by their state abbreviation is shown in @fig-crime-biplot1.
+A basic biplot of the `crime` data, using standardized principal components and labeling the observation by their state abbreviation is shown in @fig-crime-biplot1.
+The correlation circle indicates that these components are uncorrelated and have
+equal variance in the display.
 ```{r}
 #| label: fig-crime-biplot1
 #| out-width: "80%"
@@ -582,10 +614,13 @@ ggbiplot(crime.pca,
   theme_minimal(base_size = 14) 
 ```
 
+In this dataset the states are grouped by region and we saw some differences among regions in the plot (@fig-crime-scores-plot12) of component scores.
+`ggbiplot()` provides options to include a `groups =` variable, used to
+color the observation points and also to draw their data ellipses, facilitating interpretation.
 ```{r}
 #| label: fig-crime-biplot2
 #| out-width: "80%"
-#| fig-cap: "Enhanced biplot of the crime data. ..."
+#| fig-cap: "Enhanced biplot of the crime data, grouping the states by region and adding data ellipses."
 ggbiplot(crime.pca,
    obs.scale = 1, var.scale = 1,
    groups = crime$region,
@@ -601,6 +636,48 @@ ggbiplot(crime.pca,
   theme(legend.direction = 'horizontal', legend.position = 'top')
 ```
 
+This plot provides what is necessary to interpret the nature of the components and also the variation of the states in relation to these. In this, the data ellipses for the regions
+provide a visual summary that aids interpretation.
+
+* From the variable vectors, it seems that PC1, having all positive and nearly equal loadings, reflects a total or overall index of crimes. Nevada, California, New York and Florida are highest on this, while North Dakota, South Dakota and West Virginia are lowest.
+
+* The second component, PC2, shows a contrast between crimes against persons (murder, assault, rape) at the top and property crimes (auto theft, larceny) at the bottom. Nearly all the Southern states are high on personal crimes; states in the North East are generally higher
+on property crimes.
+
+* Western states tend to be somewhat higher on overall crime rate, while North Central are lower on average. In these states there is not much variation in the relative proportions of personal vs. property crimes.
+
+Moreover, in this biplot you can interpret the the value for a particular state on a given crime by considering its projection on the variable vector, where the origin corresponds to the mean, positions along the vector have greater than average values on that crime, and the opposite direction have lower than average values. For example, Massachusetts has the highest value on auto theft, but a value less than the mean. Louisiana and South Carolina on the other hand are highest in the rate of murder and slightly less than average on auto theft.
+
+These 2D plots account for only 76.5% of the total variance of crimes, so it is useful to also examine the third principal component, which accounts for an additional 10.4%.
+The `choices =` option controls which dimensions are plotted.
+
+```{r}
+#| label: fig-crime-biplot3
+#| out-width: "80%"
+#| fig-cap: "Biplot of dimensions 1 & 3 of the crime data."
+ggbiplot(crime.pca,
+         choices = c(1,3),
+         obs.scale = 1, var.scale = 1,
+         groups = crime$region,
+         labels = crime$st,
+         labels.size = 4,
+         var.factor = 2,
+         ellipse = TRUE, ellipse.level = 0.5, ellipse.alpha = 0.1,
+         circle = TRUE,
+         varname.size = 4,
+         varname.color = "black") +
+  labs(fill = "Region", color = "Region") +
+  theme_minimal(base_size = 14) +
+  theme(legend.direction = 'horizontal', legend.position = 'top')
+```
+
+Dimension 3 in @fig-crime-biplot3 is more subtle. One interpretation is a contrast between
+larceny, which is a simple theft and robbery, which involves stealing something from a person
+and is considered a more serious crime with an element of possible violence.
+In this plot, murder has a relatively short variable vector, so does not contribute
+very much to differences among the states.
+
+
 
 ## Elliptical insights: Outlier detection
 

diff --git a/R/crime-ggbiplot.R b/R/crime-ggbiplot.R
@@ -66,4 +66,19 @@ ggbiplot(crime.pca,
   theme_minimal(base_size = 14) +
   theme(legend.direction = 'horizontal', legend.position = 'top')
 
+# PC1 & PC3
+ggbiplot(crime.pca,
+         choices = c(1,3),
+         obs.scale = 1, var.scale = 1,
+         groups = crime$region,
+         labels = crime$st,
+         labels.size = 4,
+         var.factor = 2,
+         ellipse = TRUE, ellipse.level = 0.5, ellipse.alpha = 0.1,
+         circle = TRUE,
+         varname.size = 4,
+         varname.color = "black") +
+  labs(fill = "Region", color = "Region") +
+  theme_minimal(base_size = 14) +
+  theme(legend.direction = 'horizontal', legend.position = 'top')
 
diff --git a/bib/references.bib b/bib/references.bib
@@ -561,7 +561,8 @@ @article{Gabriel:71
 	Pages = {453--467},
 	Title = {The Biplot Graphic Display of Matrices with Application to Principal Components Analysis},
 	Volume = {58},
-	Year = {1971}
+	Year = {1971},
+	doi = {10.2307/2334381},
 }
 
 @incollection{Gabriel:81,
@@ -944,6 +945,18 @@ @article{Mardia:1974
 }
 
 
+@Article{McGowan2023,
+  author    = {McGowan, Lucy D’Agostino and Gerke, Travis and Barrett, Malcolm},
+  journal   = {Journal of Statistics and Data Science Education},
+  title     = {Causal inference is not just a statistics problem},
+  year      = {2023},
+  issn      = {2693-9169},
+  month     = dec,
+  pages     = {1--9},
+  doi       = {10.1080/26939169.2023.2276446},
+  publisher = {Informa UK Limited},
+}
+
 @incollection{Monette:90,
 	Address = {Beverly Hills, CA},
 	Author = {Georges Monette},

diff --git a/child/02-anscombe.qmd b/child/02-anscombe.qmd
@@ -136,7 +136,8 @@ when you look behind the scenes.
 For example, in the context of causal analysis @Gelman-etal:2023, illustrated
 sets of four graphs, within each of which 
 all four represent the same average (latent) causal effect but with
-much different patterns of individual effects.
+much different patterns of individual effects; @McGowan2023 provide another illustration
+with four seemingly identical data sets each generated by a different causal mechanism.
 As an example of machine learning models, @Biecek-etal:2023, introduced the "Rashamon Quartet",
 a synthetic dataset for which four models from different classes 
 (linear model, regression tree, random forest, neural network)

diff --git a/docs/02-getting_started.html b/docs/02-getting_started.html
@@ -394,7 +394,7 @@ <h1 class="title"><span id="sec-getting_started" class="quarto-section-identifie
 </div>
 </div>
 <div class="callout-body-container callout-body">
-<p>The essential idea of a statistical “quartet” is to illustrate four quite different datasets or circumstances that seem superficially the same, but yet are paradoxically very different when you look behind the scenes. For example, in the context of causal analysis <span class="citation" data-cites="Gelman-etal:2023">Gelman, Hullman, and Kennedy (<a href="90-references.html#ref-Gelman-etal:2023" role="doc-biblioref">2023</a>)</span>, illustrated sets of four graphs, within each of which all four represent the same average (latent) causal effect but with much different patterns of individual effects. As an example of machine learning models, <span class="citation" data-cites="Biecek-etal:2023">Biecek et al. (<a href="90-references.html#ref-Biecek-etal:2023" role="doc-biblioref">2023</a>)</span>, introduced the “Rashamon Quartet”, a synthetic dataset for which four models from different classes (linear model, regression tree, random forest, neural network) have practically identical predictive performance. In all cases, the paradox is solved when their visualization reveals the distinct ways of understanding structure in the data. The <a href="https://r-causal.github.io/quartets/"><strong>quartets</strong></a> package contains these and other variations on this theme.</p>
+<p>The essential idea of a statistical “quartet” is to illustrate four quite different datasets or circumstances that seem superficially the same, but yet are paradoxically very different when you look behind the scenes. For example, in the context of causal analysis <span class="citation" data-cites="Gelman-etal:2023">Gelman, Hullman, and Kennedy (<a href="90-references.html#ref-Gelman-etal:2023" role="doc-biblioref">2023</a>)</span>, illustrated sets of four graphs, within each of which all four represent the same average (latent) causal effect but with much different patterns of individual effects; <span class="citation" data-cites="McGowan2023">McGowan, Gerke, and Barrett (<a href="90-references.html#ref-McGowan2023" role="doc-biblioref">2023</a>)</span> provide another illustration with four seemingly identical data sets each generated by a different causal mechanism. As an example of machine learning models, <span class="citation" data-cites="Biecek-etal:2023">Biecek et al. (<a href="90-references.html#ref-Biecek-etal:2023" role="doc-biblioref">2023</a>)</span>, introduced the “Rashamon Quartet”, a synthetic dataset for which four models from different classes (linear model, regression tree, random forest, neural network) have practically identical predictive performance. In all cases, the paradox is solved when their visualization reveals the distinct ways of understanding structure in the data. The <a href="https://r-causal.github.io/quartets/"><strong>quartets</strong></a> package contains these and other variations on this theme.</p>
 </div>
 </div>
 </section><section id="sec-davis" class="level3" data-number="2.1.2"><h3 data-number="2.1.2" class="anchored" data-anchor-id="sec-davis">
@@ -547,6 +547,9 @@ <h1 class="title"><span id="sec-getting_started" class="quarto-section-identifie
 <div id="ref-MatejkaFitzmaurice2017" class="csl-entry" role="listitem">
 Matejka, Justin, and George Fitzmaurice. 2017. <span>“Same Stats, Different Graphs.”</span> In <em>Proceedings of the 2017 <span>CHI</span> Conference on Human Factors in Computing Systems</em>. <span>ACM</span>. <a href="https://doi.org/10.1145/3025453.3025912">https://doi.org/10.1145/3025453.3025912</a>.
 </div>
+<div id="ref-McGowan2023" class="csl-entry" role="listitem">
+McGowan, Lucy D’Agostino, Travis Gerke, and Malcolm Barrett. 2023. <span>“Causal Inference Is Not Just a Statistics Problem.”</span> <em>Journal of Statistics and Data Science Education</em>, December, 1–9. <a href="https://doi.org/10.1080/26939169.2023.2276446">https://doi.org/10.1080/26939169.2023.2276446</a>.
+</div>
 <div id="ref-Pearson:1896" class="csl-entry" role="listitem">
 Pearson, Karl. 1896. <span>“Contributions to the Mathematical Theory of Evolution—<span>III</span>, Regression, Heredity and Panmixia.”</span> <em>Philosophical Transactions of the Royal Society of London</em>, A, 187: 253–318.
 </div>