diff --git a/04-pca-biplot.qmd b/04-pca-biplot.qmd index 682e9f87..ab8db3a7 100644 --- a/04-pca-biplot.qmd +++ b/04-pca-biplot.qmd @@ -174,13 +174,23 @@ data, with points colored by species and the 95% data ellipsoid. This is rotated Because this is a rigid rotation of the cloud of points, the total variability is obviously unchanged. -::: {#fig-pca-animation} -
- -
-Animation of PCA as a rotation in 3D space. The plot shows three variables for the `iris` data, initially -in data space and its' data ellipsoid, with points colored according to species of the iris flowers. This is rotated smoothly until the first two principal axes are aligned with the horizontal and vertical dimensions. + + + + + + + + +::: {.content-visible unless-format="pdf"} +```{r} +#| label: fig-pca-animation +#| out-width: "100%" +#| echo: false +#| fig-cap: "Animation of PCA as a rotation in 3D space. The plot shows three variables for the #| `iris` data, initially in data space and its' data ellipsoid, with points colored according #| to species of the iris flowers. This is rotated smoothly until the first two principal axes #| are aligned with the horizontal and vertical dimensions." +knitr::include_graphics("images/pca-animation1.gif") +``` ::: @@ -227,12 +237,21 @@ The **FactoMineR** package [@R-FactoMineR] has extensive capabilities for exploratory analysis of multivariate data (PCA, correspondence analysis, cluster analysis, ...). Unfortunately, although all of these performing similar calculations, the options for -analysis and the details of the result they return differ ... +analysis and the details of the result they return differ. The important options for analysis include: -* whether or not the data variables are **centered**, to a mean of 0 -* whether or not the data variables are **scaled**, to a variance of 1. +* whether or not the data variables are **centered**, to a mean of $\bar{x}_j =0$ +* whether or not the data variables are **scaled**, to a variance of $\text{Var}(x_j) =1$. + +It nearly always makes sense to center the variables. The choice of +scaling determines whether the correlation matrix is analyzed, so that +each variable contributes equally to the total variance that is to be accounted for +versus analysis of the covariance matrix, where each variable contributes its +own variance to the total. Analysis of the covariance matrix makes little sense +when the variables are measured on different scales.[^pca-scales] + +[^pca-scales]: For example, if two variables in the analysis are height and weight, changing the unit of height from inches to centimeters would multiply its' variance by $2.54^2$; changing weight from pounds to kilograms would divide its' variance by $2.2^2$. #### Example: Crime data {.unnumbered} @@ -304,6 +323,7 @@ of components to extract a desired proportion of total variance, usually in the ```{r} #| label: fig-crime-ggscreeplot +#| fig-height: 4 #| out-width: "100%" #| fig-cap: "Screeplots for the PCA of the crime data. The left panel shows the traditional version, plotting variance proportions against component number, with linear guideline for the scree rule of thumb. The right panel plots cumulative proportions, showing cutoffs of 80%, 90%." p1 <- ggscreeplot(crime.pca) + @@ -352,7 +372,7 @@ crime.pca |> broom::augment(crime) |> head() ``` -Then, we can use `ggplot()` to plot and pair of components. +Then, we can use `ggplot()` to plot any pair of components. To aid interpretation, I label the points by their state abbreviation and color them by `region` of the U.S.. A geometric interpretation of the plot requires an aspect ratio of 1.0 (via `coord_fixed()`) @@ -387,9 +407,10 @@ and West Virginia. The second component has most of the southern states on the l and Massachusetts, Rhode Island and Hawaii on the high end. However, interpretation is easier when we also consider how the various crimes contribute to these dimensions. -We could obviously go further and plot other pairs of components, +When, as here, there +are more than two components that seem important in the scree plot, +we could obviously go further and plot other pairs. -**TODO**: Add plot of PC1 vs. PC3 #### Variable vectors {.unnumbered} You can extract the variable loadings using either `crime.pca$rotation` or @@ -543,11 +564,18 @@ $\widehat{\mathbf{X}}$ as the product of two matrices, $$ \widehat{\mathbf{X}} = (\mathbf{U} \mathbf{\Lambda}^\alpha) (\mathbf{\Lambda}^{1-\alpha} \mathbf{V}') = \mathbf{A} \mathbf{B}' $$ - -The choice $\alpha = 1$, assigning the singular values totally to the left factor, - gives a distance interpretation to the row display and +This notation uses a little math trick involving a power, $0 \le \alpha \le 1$: +When $\alpha = 1$, $\mathbf{\Lambda}^\alpha = \mathbf{\Lambda}^1 =\mathbf{\Lambda}$, +and $\mathbf{\Lambda}^{1-\alpha} = \mathbf{\Lambda}^0 =\mathbf{I}$. +$\alpha = 1/2$ gives the diagonal matrix $\mathbf{\Lambda}^1/2$ whose elements are the square roots of the singular values. + +The choice $\alpha = 1$ assigns the singular values totally to the left factor; +then, the angle between two variable vectors, reflecting the inner product +$\mathbf{x}_j^T, \mathbf{x}_{j'}$ approximates their correlation or covariance, +and the distance between the points approximates their Mahalanobis distances. $\alpha = 0$ gives a distance interpretation to the column display. $\alpha = 1/2$ gives a symmetrically scaled biplot. +*TODO**: Explain this better. When the singular values are assigned totally to the left or to the right factor, the resultant coordinates are called _principal coordinates_ and the sum of squared coordinates @@ -560,13 +588,17 @@ values equal to 1.0. ### Biplots in R -There are a large number of R packages providing biplots, ... +There are a large number of R packages providing biplots. The most basic, `stats::biplot()`, provides methods for `"prcomp"` and `"princomp"` objects. + +**TODO**: Mention **factoextra** package, `fviz()`, `fviz_pca_biplot()`, ... giving `ggplot2` graphics. Also mention **adegraphics** package -Here, I use the **ggbiplot** package ... +Here, I use the **ggbiplot** package, which aims to provide a simple interface to biplots within the `ggplot2` framework. ### Example -A basic biplot, using standardized principal components and labeling the observation by their state abbreviation is shown in @fig-crime-biplot1. +A basic biplot of the `crime` data, using standardized principal components and labeling the observation by their state abbreviation is shown in @fig-crime-biplot1. +The correlation circle indicates that these components are uncorrelated and have +equal variance in the display. ```{r} #| label: fig-crime-biplot1 #| out-width: "80%" @@ -582,10 +614,13 @@ ggbiplot(crime.pca, theme_minimal(base_size = 14) ``` +In this dataset the states are grouped by region and we saw some differences among regions in the plot (@fig-crime-scores-plot12) of component scores. +`ggbiplot()` provides options to include a `groups =` variable, used to +color the observation points and also to draw their data ellipses, facilitating interpretation. ```{r} #| label: fig-crime-biplot2 #| out-width: "80%" -#| fig-cap: "Enhanced biplot of the crime data. ..." +#| fig-cap: "Enhanced biplot of the crime data, grouping the states by region and adding data ellipses." ggbiplot(crime.pca, obs.scale = 1, var.scale = 1, groups = crime$region, @@ -601,6 +636,48 @@ ggbiplot(crime.pca, theme(legend.direction = 'horizontal', legend.position = 'top') ``` +This plot provides what is necessary to interpret the nature of the components and also the variation of the states in relation to these. In this, the data ellipses for the regions +provide a visual summary that aids interpretation. + +* From the variable vectors, it seems that PC1, having all positive and nearly equal loadings, reflects a total or overall index of crimes. Nevada, California, New York and Florida are highest on this, while North Dakota, South Dakota and West Virginia are lowest. + +* The second component, PC2, shows a contrast between crimes against persons (murder, assault, rape) at the top and property crimes (auto theft, larceny) at the bottom. Nearly all the Southern states are high on personal crimes; states in the North East are generally higher +on property crimes. + +* Western states tend to be somewhat higher on overall crime rate, while North Central are lower on average. In these states there is not much variation in the relative proportions of personal vs. property crimes. + +Moreover, in this biplot you can interpret the the value for a particular state on a given crime by considering its projection on the variable vector, where the origin corresponds to the mean, positions along the vector have greater than average values on that crime, and the opposite direction have lower than average values. For example, Massachusetts has the highest value on auto theft, but a value less than the mean. Louisiana and South Carolina on the other hand are highest in the rate of murder and slightly less than average on auto theft. + +These 2D plots account for only 76.5% of the total variance of crimes, so it is useful to also examine the third principal component, which accounts for an additional 10.4%. +The `choices =` option controls which dimensions are plotted. + +```{r} +#| label: fig-crime-biplot3 +#| out-width: "80%" +#| fig-cap: "Biplot of dimensions 1 & 3 of the crime data." +ggbiplot(crime.pca, + choices = c(1,3), + obs.scale = 1, var.scale = 1, + groups = crime$region, + labels = crime$st, + labels.size = 4, + var.factor = 2, + ellipse = TRUE, ellipse.level = 0.5, ellipse.alpha = 0.1, + circle = TRUE, + varname.size = 4, + varname.color = "black") + + labs(fill = "Region", color = "Region") + + theme_minimal(base_size = 14) + + theme(legend.direction = 'horizontal', legend.position = 'top') +``` + +Dimension 3 in @fig-crime-biplot3 is more subtle. One interpretation is a contrast between +larceny, which is a simple theft and robbery, which involves stealing something from a person +and is considered a more serious crime with an element of possible violence. +In this plot, murder has a relatively short variable vector, so does not contribute +very much to differences among the states. + + ## Elliptical insights: Outlier detection diff --git a/R/crime-ggbiplot.R b/R/crime-ggbiplot.R index a6830e08..762116f2 100644 --- a/R/crime-ggbiplot.R +++ b/R/crime-ggbiplot.R @@ -66,4 +66,19 @@ ggbiplot(crime.pca, theme_minimal(base_size = 14) + theme(legend.direction = 'horizontal', legend.position = 'top') +# PC1 & PC3 +ggbiplot(crime.pca, + choices = c(1,3), + obs.scale = 1, var.scale = 1, + groups = crime$region, + labels = crime$st, + labels.size = 4, + var.factor = 2, + ellipse = TRUE, ellipse.level = 0.5, ellipse.alpha = 0.1, + circle = TRUE, + varname.size = 4, + varname.color = "black") + + labs(fill = "Region", color = "Region") + + theme_minimal(base_size = 14) + + theme(legend.direction = 'horizontal', legend.position = 'top') diff --git a/bib/references.bib b/bib/references.bib index 49127022..e7d0808b 100644 --- a/bib/references.bib +++ b/bib/references.bib @@ -561,7 +561,8 @@ @article{Gabriel:71 Pages = {453--467}, Title = {The Biplot Graphic Display of Matrices with Application to Principal Components Analysis}, Volume = {58}, - Year = {1971} + Year = {1971}, + doi = {10.2307/2334381}, } @incollection{Gabriel:81, @@ -944,6 +945,18 @@ @article{Mardia:1974 } +@Article{McGowan2023, + author = {McGowan, Lucy D’Agostino and Gerke, Travis and Barrett, Malcolm}, + journal = {Journal of Statistics and Data Science Education}, + title = {Causal inference is not just a statistics problem}, + year = {2023}, + issn = {2693-9169}, + month = dec, + pages = {1--9}, + doi = {10.1080/26939169.2023.2276446}, + publisher = {Informa UK Limited}, +} + @incollection{Monette:90, Address = {Beverly Hills, CA}, Author = {Georges Monette}, diff --git a/child/02-anscombe.qmd b/child/02-anscombe.qmd index 2da26b4c..2915262f 100644 --- a/child/02-anscombe.qmd +++ b/child/02-anscombe.qmd @@ -136,7 +136,8 @@ when you look behind the scenes. For example, in the context of causal analysis @Gelman-etal:2023, illustrated sets of four graphs, within each of which all four represent the same average (latent) causal effect but with -much different patterns of individual effects. +much different patterns of individual effects; @McGowan2023 provide another illustration +with four seemingly identical data sets each generated by a different causal mechanism. As an example of machine learning models, @Biecek-etal:2023, introduced the "Rashamon Quartet", a synthetic dataset for which four models from different classes (linear model, regression tree, random forest, neural network) diff --git a/docs/02-getting_started.html b/docs/02-getting_started.html index 32ce84f6..5145ecdd 100644 --- a/docs/02-getting_started.html +++ b/docs/02-getting_started.html @@ -394,7 +394,7 @@

-

The essential idea of a statistical “quartet” is to illustrate four quite different datasets or circumstances that seem superficially the same, but yet are paradoxically very different when you look behind the scenes. For example, in the context of causal analysis Gelman, Hullman, and Kennedy (2023), illustrated sets of four graphs, within each of which all four represent the same average (latent) causal effect but with much different patterns of individual effects. As an example of machine learning models, Biecek et al. (2023), introduced the “Rashamon Quartet”, a synthetic dataset for which four models from different classes (linear model, regression tree, random forest, neural network) have practically identical predictive performance. In all cases, the paradox is solved when their visualization reveals the distinct ways of understanding structure in the data. The quartets package contains these and other variations on this theme.

+

The essential idea of a statistical “quartet” is to illustrate four quite different datasets or circumstances that seem superficially the same, but yet are paradoxically very different when you look behind the scenes. For example, in the context of causal analysis Gelman, Hullman, and Kennedy (2023), illustrated sets of four graphs, within each of which all four represent the same average (latent) causal effect but with much different patterns of individual effects; McGowan, Gerke, and Barrett (2023) provide another illustration with four seemingly identical data sets each generated by a different causal mechanism. As an example of machine learning models, Biecek et al. (2023), introduced the “Rashamon Quartet”, a synthetic dataset for which four models from different classes (linear model, regression tree, random forest, neural network) have practically identical predictive performance. In all cases, the paradox is solved when their visualization reveals the distinct ways of understanding structure in the data. The quartets package contains these and other variations on this theme.

@@ -547,6 +547,9 @@

Matejka, Justin, and George Fitzmaurice. 2017. “Same Stats, Different Graphs.” In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. ACM. https://doi.org/10.1145/3025453.3025912. +
+McGowan, Lucy D’Agostino, Travis Gerke, and Malcolm Barrett. 2023. “Causal Inference Is Not Just a Statistics Problem.” Journal of Statistics and Data Science Education, December, 1–9. https://doi.org/10.1080/26939169.2023.2276446. +
Pearson, Karl. 1896. “Contributions to the Mathematical Theory of Evolution—III, Regression, Heredity and Panmixia.” Philosophical Transactions of the Royal Society of London, A, 187: 253–318.
diff --git a/docs/04-pca-biplot.html b/docs/04-pca-biplot.html index 11711cd4..a0ece077 100644 --- a/docs/04-pca-biplot.html +++ b/docs/04-pca-biplot.html @@ -373,12 +373,20 @@

the total variation of the points in data space, \(\text{Var}(x) + \text{Var}(y)\), being unchanged by rotation, was equally well expressed as the total variation \(\text{Var}(PC1) + \text{Var}(PC2)\) of the scores on what are now called the principal component axes.

It would have appealed to Pearson (and also to A Square) to see these observations demonstrated in a 3D video. Figure 4.4 shows a 3D plot of the variables Sepal.Length, Sepal.Width and Petal.Length in Edgar Anderson’s iris data, with points colored by species and the 95% data ellipsoid. This is rotated smoothly by interpolation until the first two principal axes, PC1 and PC2 are aligned with the horizontal and vertical dimensions. Because this is a rigid rotation of the cloud of points, the total variability is obviously unchanged.

+ + + + + + + +
+
-
- +

+
Figure 4.4: Animation of PCA as a rotation in 3D space. The plot shows three variables for the #| iris data, initially in data space and its’ data ellipsoid, with points colored according #| to species of the iris flowers. This is rotated smoothly until the first two principal axes #| are aligned with the horizontal and vertical dimensions.
+
-
Figure 4.4: Animation of PCA as a rotation in 3D space. The plot shows three variables for the iris data, initially in data space and its’ data ellipsoid, with points colored according to species of the iris flowers. This is rotated smoothly until the first two principal axes are aligned with the horizontal and vertical dimensions.

4.2.1 PCA by springs

@@ -400,12 +408,14 @@

4.2.3 Finding principal components

In R, principal components analysis is most easily carried out using stats::prcomp() or stats::princomp() or similar functions in other packages such as FactomineR::PCA(). The FactoMineR package (Husson et al. 2023) has extensive capabilities for exploratory analysis of multivariate data (PCA, correspondence analysis, cluster analysis, …).

-

Unfortunately, although all of these performing similar calculations, the options for analysis and the details of the result they return differ …

+

Unfortunately, although all of these performing similar calculations, the options for analysis and the details of the result they return differ.

The important options for analysis include:

    -
  • whether or not the data variables are centered, to a mean of 0
  • -
  • whether or not the data variables are scaled, to a variance of 1.
  • +
  • whether or not the data variables are centered, to a mean of \(\bar{x}_j =0\) +
  • +
  • whether or not the data variables are scaled, to a variance of \(\text{Var}(x_j) =1\).
+

It nearly always makes sense to center the variables. The choice of scaling determines whether the correlation matrix is analyzed, so that each variable contributes equally to the total variance that is to be accounted for versus analysis of the covariance matrix, where each variable contributes its own variance to the total. Analysis of the covariance matrix makes little sense when the variables are measured on different scales.2

Example: Crime data

The dataset crime, analysed in Section 3.2.2, showed all positive correlations among the rates of various crimes in the corrgram, Figure 3.27. What can we see from a principal components analysis? Is it possible that a few dimensions can account for most of the juice in this data?

In this example, you can easily find the PCA solution using prcomp() in a single line in base-R. You need to specify the numeric variables to analyze by their columns in the data frame. The most important option here is scale. = TRUE

@@ -519,7 +529,7 @@

#> # .fittedPC4 <dbl>, .fittedPC5 <dbl>, .fittedPC6 <dbl>, #> # .fittedPC7 <dbl>

-

Then, we can use ggplot() to plot and pair of components. To aid interpretation, I label the points by their state abbreviation and color them by region of the U.S.. A geometric interpretation of the plot requires an aspect ratio of 1.0 (via coord_fixed()) so that a unit distance on the horizontal axis is the same length as a unit distance on the vertical. To demonstrate that the components are uncorrelated, I also added their data ellipse.

+

Then, we can use ggplot() to plot any pair of components. To aid interpretation, I label the points by their state abbreviation and color them by region of the U.S.. A geometric interpretation of the plot requires an aspect ratio of 1.0 (via coord_fixed()) so that a unit distance on the horizontal axis is the same length as a unit distance on the vertical. To demonstrate that the components are uncorrelated, I also added their data ellipse.

crime.pca |>
   broom::augment(crime) |> # add original dataset back in
@@ -541,8 +551,8 @@ 

To interpret such plots, it is useful consider the observations that are a high and low on each of the axes as well as other information, such as region here, and ask how these differ on the crime statistics. The first component, PC1, contrasts Nevada and California with North Dakota, South Dakota and West Virginia. The second component has most of the southern states on the low end and Massachusetts, Rhode Island and Hawaii on the high end. However, interpretation is easier when we also consider how the various crimes contribute to these dimensions.

-

We could obviously go further and plot other pairs of components,

-

TODO: Add plot of PC1 vs. PC3 #### Variable vectors {.unnumbered}

+

When, as here, there are more than two components that seem important in the scree plot, we could obviously go further and plot other pairs.

+

Variable vectors

You can extract the variable loadings using either crime.pca$rotation or purrr::pluck("rotation"), similar to what I did with the scores.

crime.pca |> purrr::pluck("rotation")
@@ -662,16 +672,17 @@ 

The factor, \(\alpha\) allows the variances of the components to be apportioned between the row points and column vectors, with different interpretations, by representing the approximation \(\widehat{\mathbf{X}}\) as the product of two matrices,

\[ \widehat{\mathbf{X}} = (\mathbf{U} \mathbf{\Lambda}^\alpha) (\mathbf{\Lambda}^{1-\alpha} \mathbf{V}') = \mathbf{A} \mathbf{B}' -\]

-

The choice \(\alpha = 1\), assigning the singular values totally to the left factor, gives a distance interpretation to the row display and \(\alpha = 0\) gives a distance interpretation to the column display. \(\alpha = 1/2\) gives a symmetrically scaled biplot.

+\]
This notation uses a little math trick involving a power, \(0 \le \alpha \le 1\): When \(\alpha = 1\), \(\mathbf{\Lambda}^\alpha = \mathbf{\Lambda}^1 =\mathbf{\Lambda}\), and \(\mathbf{\Lambda}^{1-\alpha} = \mathbf{\Lambda}^0 =\mathbf{I}\). \(\alpha = 1/2\) gives the diagonal matrix \(\mathbf{\Lambda}^1/2\) whose elements are the square roots of the singular values.

+

The choice \(\alpha = 1\) assigns the singular values totally to the left factor; then, the angle between two variable vectors, reflecting the inner product \(\mathbf{x}_j^T, \mathbf{x}_{j'}\) approximates their correlation or covariance, and the distance between the points approximates their Mahalanobis distances. \(\alpha = 0\) gives a distance interpretation to the column display. \(\alpha = 1/2\) gives a symmetrically scaled biplot. *TODO**: Explain this better.

When the singular values are assigned totally to the left or to the right factor, the resultant coordinates are called principal coordinates and the sum of squared coordinates on each dimension equal the corresponding singular value. The other matrix, to which no part of the singular values is assigned, contains the so-called standard coordinates and have sum of squared values equal to 1.0.

4.3.2 Biplots in R

-

There are a large number of R packages providing biplots, …

-

Here, I use the ggbiplot package …

+

There are a large number of R packages providing biplots. The most basic, stats::biplot(), provides methods for "prcomp" and "princomp" objects.

+

TODO: Mention factoextra package, fviz(), fviz_pca_biplot(), … giving ggplot2 graphics. Also mention adegraphics package

+

Here, I use the ggbiplot package, which aims to provide a simple interface to biplots within the ggplot2 framework.

4.3.3 Example

-

A basic biplot, using standardized principal components and labeling the observation by their state abbreviation is shown in Figure 4.9.

+

A basic biplot of the crime data, using standardized principal components and labeling the observation by their state abbreviation is shown in Figure 4.9. The correlation circle indicates that these components are uncorrelated and have equal variance in the display.

crime.pca <- reflect(crime.pca) # reflect the axes
 
@@ -689,6 +700,7 @@ 

+

In this dataset the states are grouped by region and we saw some differences among regions in the plot (Figure 4.7) of component scores. ggbiplot() provides options to include a groups = variable, used to color the observation points and also to draw their data ellipses, facilitating interpretation.

ggbiplot(crime.pca,
    obs.scale = 1, var.scale = 1,
@@ -706,34 +718,65 @@ 

-
Figure 4.10: Enhanced biplot of the crime data. …
+
Figure 4.10: Enhanced biplot of the crime data, grouping the states by region and adding data ellipses.
+
+

+
+

This plot provides what is necessary to interpret the nature of the components and also the variation of the states in relation to these. In this, the data ellipses for the regions provide a visual summary that aids interpretation.

+
    +
  • From the variable vectors, it seems that PC1, having all positive and nearly equal loadings, reflects a total or overall index of crimes. Nevada, California, New York and Florida are highest on this, while North Dakota, South Dakota and West Virginia are lowest.

  • +
  • The second component, PC2, shows a contrast between crimes against persons (murder, assault, rape) at the top and property crimes (auto theft, larceny) at the bottom. Nearly all the Southern states are high on personal crimes; states in the North East are generally higher on property crimes.

  • +
  • Western states tend to be somewhat higher on overall crime rate, while North Central are lower on average. In these states there is not much variation in the relative proportions of personal vs. property crimes.

  • +
+

Moreover, in this biplot you can interpret the the value for a particular state on a given crime by considering its projection on the variable vector, where the origin corresponds to the mean, positions along the vector have greater than average values on that crime, and the opposite direction have lower than average values. For example, Massachusetts has the highest value on auto theft, but a value less than the mean. Louisiana and South Carolina on the other hand are highest in the rate of murder and slightly less than average on auto theft.

+

These 2D plots account for only 76.5% of the total variance of crimes, so it is useful to also examine the third principal component, which accounts for an additional 10.4%. The choices = option controls which dimensions are plotted.

+
+
ggbiplot(crime.pca,
+         choices = c(1,3),
+         obs.scale = 1, var.scale = 1,
+         groups = crime$region,
+         labels = crime$st,
+         labels.size = 4,
+         var.factor = 2,
+         ellipse = TRUE, ellipse.level = 0.5, ellipse.alpha = 0.1,
+         circle = TRUE,
+         varname.size = 4,
+         varname.color = "black") +
+  labs(fill = "Region", color = "Region") +
+  theme_minimal(base_size = 14) +
+  theme(legend.direction = 'horizontal', legend.position = 'top')
+
+
+

+
Figure 4.11: Biplot of dimensions 1 & 3 of the crime data.
+

Dimension 3 in Figure 4.11 is more subtle. One interpretation is a contrast between larceny, which is a simple theft and robbery, which involves stealing something from a person and is considered a more serious crime with an element of possible violence. In this plot, murder has a relatively short variable vector, so does not contribute very much to differences among the states.

4.4 Elliptical insights: Outlier detection

The data ellipse (Section 3.1.4), or ellipsoid in more than 2D is fundamental in regression. But also, as Pearson showed, it is key to understanding principal components analysis, where the principal component directions are simply the axes of the ellipsoid of the data. As such, observations that are unusual in data space may not stand out in univariate views of the variables, but will stand out in principal component space, usually on the smallest dimension.

As an illustration, I created a dataset of \(n = 100\) observations with a linear relation, \(y = x + \mathcal{N}(0, 1)\) and then added two discrepant points at (1.5, -1.5), (-1.5, 1.5).

-
set.seed(123345)
+
set.seed(123345)
 x <- c(rnorm(100),             1.5, -1.5)
 y <- c(x[1:100] + rnorm(100), -1.5, 1.5)
-

When these are plotted with a data ellipse in Figure 4.11 (left), you can see the discrepant points labeled 101 and 102, but they do not stand out as unusual on either \(x\) or \(y\). The transformation to from data space to principal components space, shown in Figure 4.11 (right), is simply a rotation of \((x, y)\) to a space whose coordinate axes are the major and minor axes of the data ellipse, \((PC_1, PC_2)\). In this view, the additional points appear a univariate outliers on the smallest dimension, \(PC_2\).

+

When these are plotted with a data ellipse in Figure 4.12 (left), you can see the discrepant points labeled 101 and 102, but they do not stand out as unusual on either \(x\) or \(y\). The transformation to from data space to principal components space, shown in Figure 4.12 (right), is simply a rotation of \((x, y)\) to a space whose coordinate axes are the major and minor axes of the data ellipse, \((PC_1, PC_2)\). In this view, the additional points appear a univariate outliers on the smallest dimension, \(PC_2\).

-
Figure 4.11: Outlier demonstration: The left panel shows the original data and highlights the two discrepant points, which do not appear to be unusual on either x or y. The right panel shows the data rotated to principal components, where the labeled points stand out on the smallest PCA dimension.
+
Figure 4.12: Outlier demonstration: The left panel shows the original data and highlights the two discrepant points, which do not appear to be unusual on either x or y. The right panel shows the data rotated to principal components, where the labeled points stand out on the smallest PCA dimension.
-

To see this more clearly, Figure 4.12 shows an animation of the rotation from data space to PCA space. This uses heplots::interpPlot()

+

To see this more clearly, Figure 4.13 shows an animation of the rotation from data space to PCA space. This uses heplots::interpPlot()

-
Figure 4.12: Animation of rotation from data space to PCA space.
+
Figure 4.13: Animation of rotation from data space to PCA space.
@@ -754,7 +797,7 @@

A History of Data Visualization and Graphic Communication. Cambridge, MA: Harvard University Press. https://doi.org/10.4159/9780674259034.

-Gabriel, K. R. 1971. “The Biplot Graphic Display of Matrices with Application to Principal Components Analysis.” Biometrics 58 (3): 453–67. +Gabriel, K. R. 1971. “The Biplot Graphic Display of Matrices with Application to Principal Components Analysis.” Biometrics 58 (3): 453–67. https://doi.org/10.2307/2334381.
———. 1981. “Biplot Display of Multivariate Matrices for Inspection of Data and Diagnosis.” In Interpreting Multivariate Data, edited by V. Barnett, 147–73. London: John Wiley; Sons. @@ -787,6 +830,7 @@


  1. This is Euler’s (1758) formula, which states that any convex polyheron must obey the formula \(V + F - E = 2\) where \(V\) is the number of vertexes (corners), \(F\) is the number of faces and \(E\) is the number of edges. For example, a tetrahedron or pyramid has \((V, F, E) = (4, 4, 6)\) and a cube has \((V, F, E) = (8, 6, 12)\). Stated in words, for all solid bodies confined by planes, the sum of the number of vertexes and the number of faces is two less than the number of edges.↩︎

  2. +
  3. For example, if two variables in the analysis are height and weight, changing the unit of height from inches to centimeters would multiply its’ variance by \(2.54^2\); changing weight from pounds to kilograms would divide its’ variance by \(2.2^2\).↩︎