diff --git a/03-multivariate_plots.qmd b/03-multivariate_plots.qmd index cbee089a..1b4a517c 100644 --- a/03-multivariate_plots.qmd +++ b/03-multivariate_plots.qmd @@ -13,7 +13,7 @@ source("R/common.R") > There is no excuse for failing to plot and look. --- J. W.Tukey > (1997), *Exploratory Data Analysis*, p. 157 - + The quote above from John Tukey reminds us that data analysis should rightly start with graphs to help us understand the main features of our @@ -171,7 +171,7 @@ The most generally useful idea is a smoother that tracks an average value, $\mathbb{E} (y | x)$, of $y$ as $x$ varies across its' range *without* assuming any particular functional form, and so avoiding the necessity to choose among `y ~ poly(x, 1)`, or `y ~ poly(x, 2)`, or -`y ~ poly(x, 3)` ... +`y ~ poly(x, 3)`, etc. Non-parametric smoothers attempt to estimate $\mathbb{E} (y | x) = f(x)$ where $f(x)$ is some smooth function. These typically use a collection @@ -218,7 +218,7 @@ after earning their degree is about the same as a person at 15 years. What else is going on here? - + ### Stratifiers @@ -240,15 +240,15 @@ down the overall relation to see whether and how it changes over such a continuous variable into ranges (*shingles*) and do the same graphically. There are two general stratifying graphical techniques: -- **grouping**: Identify subgroups in the data by assigning different +- **Grouping**: Identify subgroups in the data by assigning different visual attributes, such as color, shape, line style, etc. within a single plot. This is quite natural for factors; quantitative predictors can be accommodated by cutting their range into ordered - intervals. Two such grouping variables can be Grouping has the + intervals. Grouping has the advantage that the levels of a grouping variable can be shown within the same plot, facilitating direct comparison. -- **conditioning**: Showing subgroups in different plot panels. This +- **Conditioning**: Showing subgroups in different plot panels. This has the advantages that relations for the individual groups more easily discerned and one can easily stratify by two (or more) other variables jointly, but visual comparison is more difficult because @@ -273,11 +273,11 @@ in *Trellis* plots @Becker:1996:VDC;@Cleveland:85. Trellis displays were extended in the **lattice** package [@R-lattice], which offered: -- a **graphing syntax** similar to that used in statistical model +- A **graphing syntax** similar to that used in statistical model formulas: `y ~ x | g` conditions the data by the levels of `g`, with `|` read as "given"; two or more conditioning are specified as `y ~ x | g1 + g2 + ...`, with `+` read as "and". -- **panel functions** define what is plotted in a given panel. +- **Panel functions** define what is plotted in a given panel. `panel.xyplot()` is the default for scatterplots, plotting points, but you can add `panel.lmline()` for regression lines, `latticeExtra::panel.smoother()` for loess smooths and a wide @@ -285,7 +285,7 @@ which offered: The **car** package [@R-car] supports this graphing syntax in many of its functions. **ggplot2** does not; it uses aesthetics (`aes()`), which -map variables in the data to visual characteristics in displays. ... +map variables in the data to visual characteristics in displays. ::: The most obvious variable that affects academic salary is `rank`, @@ -294,7 +294,7 @@ that carries through in their future salary. What can we see if we group by `rank` and fit a separate smoothed curve for each? In `ggplot2` thinking, grouping is accomplished simply by adding an -aesthetic, such as `color = rank`. what happens then is that points, +aesthetic, such as `color = rank`. What happens then is that points, lines, smooths and other `geom_*()` inherit the feature that they are differentiated by color. In the case of `geom_smooth()`, we get a separate fit for each subset of the data, according to `rank`. @@ -322,8 +322,7 @@ ggplot(Salaries, y = "Salary") + geom_smooth(aes(fill = rank), method = "loess", formula = "y ~ x", - linewidth = 2) + -# theme_bw(base_size = 14) + + linewidth = 2) + legend_pos ``` @@ -331,11 +330,11 @@ Well, there is a different story here. Salaries generally occupy separate levels, increasing with academic rank. The horizontal extents of the smoothed curves show their ranges. Within each rank there is some initial increase after promotion, and then some tendency to decline with -increasing years. But by and large, years since the PhD doesn't make -that much difference, once we've taken academic rank into account +increasing years. But, by and large, years since the PhD doesn't make +that much difference once we've taken academic rank into account. -What about the `discipline`, classified, perhaps peculiarly as -"theoretical" vs. "applied", here? The values are just `"A"` and `"B"`, +What about the `discipline` which is classified, perhaps peculiarly, as +"theoretical" vs. "applied"? The values are just `"A"` and `"B"`, so I map these to more meaningful labels before making the plot. ```{r} @@ -370,15 +369,17 @@ Salaries |> ``` For both groups, there is an approximately linear relation up to about -30--40 years, but the smoothed curves then diverge, into the region -where the data is thin. +30--40 years, but the smoothed curves then diverge into the region +where the data is thinner. + + ### Conditioning The previous plots use grouping by color to plot the data for different subsets inside the same plot window, making comparison among groups easier, because they can be directly compared along a common vertical -scale [^03-multivariate_plots-1]. This gets messy however when there are +scale [^03-multivariate_plots-1]. This gets messy, however, when there are more than just a few levels, or worse---when there are two (or more) variables for which we want to show separate effects. In such cases, we can plot separate panels using the `ggplot2` concept of *faceting*. @@ -413,7 +414,6 @@ Salaries |> method = "lm", formula = "y ~ x", linewidth = 2) + facet_wrap(~ discipline) + -# theme_bw(base_size = 14) + legend_pos ``` @@ -427,8 +427,7 @@ are there anomalous differences in salary for men and women--- we can look at differences in salary according to sex, when discipline and rank are taken into account. To do this graphically, condition by both variables, but use `facet_grid(discipline ~ rank)` to arrange their -combinations in a grid whose rows are the levels of `discipline` and -whose columns are those of `rank`. I want to make the comparison of +combinations in a grid whose rows are the levels of `discipline` and columns are those of `rank`. I want to make the comparison of males and females most direct, so I use `color = sex` to stratify the panels. The smoothed regression lines and error bands are calculated separately for each combination of discipline, rank and sex. @@ -453,6 +452,7 @@ Salaries |> ``` + ```{r child="child/03-data-ellipse.qmd"} ``` @@ -475,7 +475,7 @@ Plotting these together gives @fig-prestige-pairs. In such plots, the diagonal cells give labels for the variables, but they are also a guide to interpreting what is shown. In each row, say row 2 for `income`, income is the vertical $y$ variable in plots against other variables. In -each column, say column 3 for `education`, education is the horizonal +each column, say column 3 for `education`, education is the horizontal $x$ variable. ```{r} @@ -528,7 +528,7 @@ scatterplotMatrix(~ prestige + income + education + women, `scatterplotMatrix()` can also label points using the `id =` argument (though this can get messy) and can stratify the observations by a -grouping variable with different symbols and colors. For example +grouping variable with different symbols and colors. For example, @fig-prestige-spm2 uses the syntax `~ prestige + education + income + women | type` to provide separate regression lines, smoothed curves and data ellipses for the three types @@ -581,9 +581,10 @@ scatterplotMatrix(~ bill_length + bill_depth + flipper_length + body_mass | spec regLine = list(lwd=3), diagonal = list(method = "boxplot"), smooth = FALSE, - plot.points = FALSE) + plot.points = FALSE, + cex.labels=1) ``` - + It can be seen that the species are widely separated in most of the bivariate plots. As well, the regression lines for species have similar slopes and the data ellipses have similar size and shape in most of the @@ -598,8 +599,8 @@ penguins. ## Looking ahead @fig-peng-spm provides a reasonably complete visual summary of the data -in relation to multivariate models that ask "do the species differ on in -their means on these body size measures? This corresponds to the MANOVA +in relation to multivariate models that ask "do the species differ in +their means on these body size measures?" This corresponds to the MANOVA model, ```{r} @@ -612,7 +613,7 @@ Hypothesis-error (HE) plots, described in @sec-vis-mlm provide a better summary of the evidence for the MANOVA test of differences among means on all variables together. These give an $\mathbf{H}$ ellipse reflecting the differences among means, to be compared with an $\mathbf{E}$ ellipse -reflecting within-group variation and a visual test of significance +reflecting within-group variation and a visual test of significance. A related question is "how well are the penguin species distinguished by these body size measures?" Here, the relevant model is linear @@ -626,7 +627,7 @@ peng.lda <- MASS:lda( species ~ cbind(bill_length, bill_depth, flipper_length, b ``` Both MANOVA and LDA depend on the assumption that the variances and -correlations of the are the same for all groups. This assumption can be +correlations between the variables are the same for all groups. This assumption can be tested and visualized using the methods in @sec-eqcov. ::: @@ -651,7 +652,7 @@ str(crime) @fig-crime-spm displays the scatterplot matrix for these seven variables, using only the regression line and data ellipse to show the -linear relation and the loess smooth to show potential nonlinearity. To +linear relation and the loess smooth to show potential non-linearity. To make this even more schematic, the axis tick marks and labels are also removed using the `par()` settings `xaxt = "n", yaxt = "n"`. @@ -704,7 +705,7 @@ related variables together as described in @sec-pca-biplot. knitr::include_graphics("images/corrgram-renderings.png") ``` -In R these diagrams can be created using the **corrgram** [@R-corrgram] +In R, these diagrams can be created using the **corrgram** [@R-corrgram] and **corrplot** [@R-corrplot] packages, with different features. `corrgram::corrgram()` is closest to @Friendly:02:corrgram, in that it allows different rendering functions for the lower, upper and diagonal @@ -725,7 +726,7 @@ crime |> `c("circle", "square", "ellipse", "number", "shade", "color", "pie")`, but only one can be used at a time. The function `corrplot::corrplot.mixed()` allows different options to be selected for -the lower and upper triangles. The iconic shape is colored with with a +the lower and upper triangles. The iconic shape is colored with a gradient in relation to the correlation value. ```{r} @@ -760,7 +761,24 @@ presentation purposes), makes this an attractive alternative to boring tables of correlations. **TODO**: Add example showing correlation ordering -- e.g., `mtcars` -data. +data. + +```{r} +#| echo: false +#| include: false +#| label: fig-mtcars-corrplot-ordering +#| fig-width: 8 +#| fig-height: 8 +#| out-width: "100%" +#| fig-cap: "Corrplot of the `mtcars` data, showing the ordered correlation between each pair of variables with a circle (upper) and a numerical correlation value (lower), all shaded in proportion to the correlation magnitude and direction." +#| +library(corrplot) +corrplot.mixed(cor(mtcars[,1:5]), + lower = "number", + order = 'FPC', + upper = 'circle', + tl.col = "black") +``` ## Generalized pairs plots {#sec-ggpairs} @@ -779,7 +797,7 @@ For example, we can tabulate the distributions of penguin species by sex and the island where they were observed using `xtabs()`. `ftable()` prints this three-way table more compactly. (In this example, and what follows in the chapter, I've changed the labels for sex from ("f", "m") -to ("Female", "Male")) +to ("Female", "Male")). ```{r peng-table} # use better labels for sex @@ -827,25 +845,23 @@ species varies across island because on each island one or more species did not occur. Row 2 and column 2 show that sex is nearly exactly proportional among species and islands, indicating independence, $\text{sex} \perp \{\text{species}, \text{island}\}$. More importantly, -mosaic pairs plots can show, at a glance all (bivariate) associations +mosaic pairs plots can show, at a glance, all (bivariate) associations among multivariate categorical variables. The next step, by John Emerson and others [@Emerson-etal:2013] was to recognize that combinations of continuous and discrete, categorical variables could be plotted in different ways. -- two continuous variables can be shown as a standard scatterplot of +- Two continuous variables can be shown as a standard scatterplot of points and/or bivariate density contours, or simply by numeric summaries such as a correlation value; -- a pair of one continuous and one categorical variable can be shown +- A pair of one continuous and one categorical variable can be shown as side-by-side boxplots or violin plots, histograms or density - plots -- two categorical variables could be shown in a mosaic plot or by + plots; +- Two categorical variables could be shown in a mosaic plot or by grouped bar plots. -In the `ggplot2` framework, these displays are implemented in the -**GGally** package [@R-GGally] in the `ggpairs()` function. This allows -different plot types to be shown in the lower and upper triangles and in +In the **ggplot2** framework, these displays are implemented using the `ggpairs()` function from the **GGally** package [@R-GGally]. This allows different plot types to be shown in the lower and upper triangles and in the diagonal cells of the plot matrix. As well, aesthetics such as color and shape can be used within the plots to distinguish groups directly. As illustrated below, you can define custom functions to control exactly @@ -853,7 +869,7 @@ what is plotted in any panel. The basic, default plot shows scatterplots for pairs of continuous variables in the lower triangle and the values of correlations in the -upper triangle. A combination of a discrete and continuous variable is +upper triangle. A combination of a discrete and continuous variables is plotted as histograms in the lower triangle and boxplots in the upper triangle. @fig-peng-ggpairs1 includes `sex` to illustrate the combinations. @@ -876,30 +892,30 @@ To my eye, printing the values of correlations in the upper triangle is often a waste of graphic space. But this example shows something peculiar and interesting if you look closely: In all pairs among the penguin size measurements, there are positive correlations within each -species, as we can see in @fig-peng-spm. Yet in three of these panels, -the overall correlation ignoring species is negative. For example the +species, as we can see in @fig-peng-spm. Yet, in three of these panels, +the overall correlation ignoring species is negative. For example, the overall correlation between bill depth and flipper length is $r = -0.579$ in row 2, column 3; the scatterplot in the diagonally opposite cell, row 3, column 2 shows the data. These cases, of differing signs for an overall correlation, ignoring a group variable and the within group correlations are examples of **Simpson's Paradox**, -explored later in Chapter XX. +explored later in Chapter XX. The last row and column, for `sex` in @fig-peng-ggpairs1, provides an -initial glance at the issue of sex differences among pengiun species +initial glance at the issue of sex differences among penguin species that motivated the collection of these data. We can go further by also examining differences among species and island, but first we need to understand how to display exactly what we want for each pairwise plot. -`ggpairs()` is extremely general, in that for each of the `lower`, +`ggpairs()` is extremely general in that for each of the `lower`, `upper` and `diag` sections you can assign any of a large number of -built-in functions (of the form `ggally_NAME`), or you own custom +built-in functions (of the form `ggally_NAME`), or your own custom function for what is plotted, depending on the types of variables in each plot. -- `continuous`: both X and Y are continuous variables; supply this as +- `continuous`: both X and Y are continuous variables, supply this as the `NAME` part of a `ggally_NAME()` function or the name of a - custom function. + custom function; - `combo`: one X of and Y variable is discrete while the other is continuous, using the same convention; - `discrete`: both X and Y are discrete variables. @@ -926,7 +942,7 @@ discrete factors. See the vignette, [ggally_plots](https://ggobi.github.io/ggally/articles/ggally_plots.html) -for an illustrated list of available high-level plots. For my purpose +for an illustrated list of available high-level plots. For our purpose here, which is to illustrate enhanced displays, note that for scatterplots of continuous variables, there are two functions which plot the points and also add a smoother, `_lm` or `_loess`. @@ -941,8 +957,8 @@ function that takes `data` and `mapping` arguments and returns a `aes(color=species, alpha=0.5)`, but only if you wish to override what is supplied in the `ggpairs()` call. -Here is a function, `my_panel()` that plots the data points and -regression line and loess smooth +Here is a function, `my_panel()` that plots the data points, +regression line and loess smooth: ```{r} #| label: my-panel @@ -955,7 +971,7 @@ my_panel <- function(data, mapping, ...){ } ``` -For this example, I want to simple summaries of for the scatterplots, so +For this example, I want only simple summaries of for the scatterplots, so I don't want to plot the data points, but do want to add the regression line and a data ellipse. @@ -999,7 +1015,7 @@ ggpairs(peng, columns=c(3:6, 1, 2, 7), There is certainly a lot going on in @fig-peng-ggpairs7, but it does show a high-level overview of all the variables (except `year`) in the penguins dataset. - + ## Parallel coordinate plots {#sec-parcoord} As we have seen above, scatterplot matrices and generalized pairs plots diff --git a/docs/04-pca-biplot.html b/docs/04-pca-biplot.html index a0ece077..28849101 100644 --- a/docs/04-pca-biplot.html +++ b/docs/04-pca-biplot.html @@ -672,7 +672,7 @@

The factor, \(\alpha\) allows the variances of the components to be apportioned between the row points and column vectors, with different interpretations, by representing the approximation \(\widehat{\mathbf{X}}\) as the product of two matrices,

\[ \widehat{\mathbf{X}} = (\mathbf{U} \mathbf{\Lambda}^\alpha) (\mathbf{\Lambda}^{1-\alpha} \mathbf{V}') = \mathbf{A} \mathbf{B}' -\] This notation uses a little math trick involving a power, \(0 \le \alpha \le 1\): When \(\alpha = 1\), \(\mathbf{\Lambda}^\alpha = \mathbf{\Lambda}^1 =\mathbf{\Lambda}\), and \(\mathbf{\Lambda}^{1-\alpha} = \mathbf{\Lambda}^0 =\mathbf{I}\). \(\alpha = 1/2\) gives the diagonal matrix \(\mathbf{\Lambda}^1/2\) whose elements are the square roots of the singular values.

+\]
This notation uses a little math trick involving a power, \(0 \le \alpha \le 1\): When \(\alpha = 1\), \(\mathbf{\Lambda}^\alpha = \mathbf{\Lambda}^1 =\mathbf{\Lambda}\), and \(\mathbf{\Lambda}^{1-\alpha} = \mathbf{\Lambda}^0 =\mathbf{I}\). \(\alpha = 1/2\) gives the diagonal matrix \(\mathbf{\Lambda}^{1/2}\) whose elements are the square roots of the singular values.

The choice \(\alpha = 1\) assigns the singular values totally to the left factor; then, the angle between two variable vectors, reflecting the inner product \(\mathbf{x}_j^T, \mathbf{x}_{j'}\) approximates their correlation or covariance, and the distance between the points approximates their Mahalanobis distances. \(\alpha = 0\) gives a distance interpretation to the column display. \(\alpha = 1/2\) gives a symmetrically scaled biplot. *TODO**: Explain this better.

When the singular values are assigned totally to the left or to the right factor, the resultant coordinates are called principal coordinates and the sum of squared coordinates on each dimension equal the corresponding singular value. The other matrix, to which no part of the singular values is assigned, contains the so-called standard coordinates and have sum of squared values equal to 1.0.

diff --git a/docs/search.json b/docs/search.json index acdd3beb..b1109658 100644 --- a/docs/search.json +++ b/docs/search.json @@ -172,7 +172,7 @@ "href": "04-pca-biplot.html#sec-biplot", "title": "4  PCA and Biplots", "section": "\n4.3 Biplots", - "text": "4.3 Biplots\nThe biplot is a simple and powerful idea that came from the recognition that you can overlay a plot of observation scores in a principal components analysis with the information of the variable loadings (weights) to give a simultaneous display that is easy to interpret. In this sense, a biplot is generalization of a scatterplot, projecting from data space to PCA space, where the observations are shown by points, as in the plots of component scores in Figure 4.7, but with the variables also shown by vectors (or scaled linear axes aligned with those vectors).\nThe idea of the biplot was introduced by Ruben Gabriel (1971, 1981) and later expanded in scope by Gower and Hand (1996). The book by Greenacre (2010) gives a practical overview of the many variety of biplots and Gower, Lubbe, and Roux (2011) provide a full treatment …\nBiplot methodolgy is far more general than I cover here. Categorical variables can be incorporated in PCA using category level points. Two-way frequency tables of categorical variables can be analysed using correspondence analysis, which is similar to PCA, but designed to account for the maximum amount of the \\(\\chi^2\\) statistic for association; multiple correspondence analysis extends this to method to multi-way tables (Friendly and Meyer 2016; Greenacre 1984).\n\n4.3.1 Constructing a biplot\nThe biplot is constructed by using the singular value decomposition (SVD) to obtain a low-rank approximation to the data matrix \\(\\mathbf{X}_{n \\times p}\\) (centered, and optionally scaled to unit variances) whose \\(n\\) rows are the observations and whose \\(p\\) columns are the variables.\n\n\n\n\nThe singular value decomposition expresses a data matrix X as the product of a matrix U of observation scores, a diagonal matrix Lambda of singular values and a matrix V of variable weights. TODO: Re-draw to fix notation.\n\n\n\nUsing the SVD, the matrix \\(\\mathbf{X}\\), of rank \\(r \\le p\\) can be expressed exactly as: \\[\n\\mathbf{X} = \\mathbf{U} \\mathbf{\\Lambda} \\mathbf{V}'\n = \\sum_i^r \\lambda_i \\mathbf{u}_i \\mathbf{v}_i' \\; ,\n\\]\nwhere\n\n\n\\(\\mathbf{U}\\) is an \\(n \\times r\\) orthonormal matrix of uncorrelated observation scores; these are also the eigenvectors of \\(\\mathbf{X} \\mathbf{X}'\\),\n\n\\(\\mathbf{\\Lambda}\\) is an \\(r \\times r\\) diagonal matrix of singular values, \\(\\lambda_1 \\ge \\lambda_2 \\ge \\cdots \\lambda_r\\), which are also the square roots of the eigenvalues of \\(\\mathbf{X} \\mathbf{X}'\\).\n\n\\(\\mathbf{V}\\) is an \\(r \\times p\\) orthonormal matrix of observation scores and also the eigenvectors of \\(\\mathbf{X}' \\mathbf{X}\\).\n\nThen, a rank 2 (or 3) PCA approximation \\(\\widehat{\\mathbf{X}}\\) to the data matrix used in the biplot can be obtained from the first 2 (or 3) singular values \\(\\lambda_i\\) and the corresponding \\(\\mathbf{u}_i, \\mathbf{v}_i\\) as:\n\\[\n\\mathbf{X} \\approx \\widehat{\\mathbf{X}} = \\lambda_1 \\mathbf{u}_1 \\mathbf{v}_1' + \\lambda_2 \\mathbf{u}_2 \\mathbf{v}_2' \\; .\n\\]\nThe variance of \\(\\mathbf{X}\\) accounted for by each term is \\(\\lambda_i^2\\).\nA biplot is then obtained by overlaying two scatterplots that share a common set of axes and have a between-set scalar product interpretation. Typically, the observations (rows of \\(\\mathbf{X}\\)) are represented as points and the variables (columns of \\(\\mathbf{X}\\)) are represented as vectors from the origin.\nThe factor, \\(\\alpha\\) allows the variances of the components to be apportioned between the row points and column vectors, with different interpretations, by representing the approximation \\(\\widehat{\\mathbf{X}}\\) as the product of two matrices,\n\\[\n\\widehat{\\mathbf{X}} = (\\mathbf{U} \\mathbf{\\Lambda}^\\alpha) (\\mathbf{\\Lambda}^{1-\\alpha} \\mathbf{V}') = \\mathbf{A} \\mathbf{B}'\n\\] This notation uses a little math trick involving a power, \\(0 \\le \\alpha \\le 1\\): When \\(\\alpha = 1\\), \\(\\mathbf{\\Lambda}^\\alpha = \\mathbf{\\Lambda}^1 =\\mathbf{\\Lambda}\\), and \\(\\mathbf{\\Lambda}^{1-\\alpha} = \\mathbf{\\Lambda}^0 =\\mathbf{I}\\). \\(\\alpha = 1/2\\) gives the diagonal matrix \\(\\mathbf{\\Lambda}^1/2\\) whose elements are the square roots of the singular values.\nThe choice \\(\\alpha = 1\\) assigns the singular values totally to the left factor; then, the angle between two variable vectors, reflecting the inner product \\(\\mathbf{x}_j^T, \\mathbf{x}_{j'}\\) approximates their correlation or covariance, and the distance between the points approximates their Mahalanobis distances. \\(\\alpha = 0\\) gives a distance interpretation to the column display. \\(\\alpha = 1/2\\) gives a symmetrically scaled biplot. *TODO**: Explain this better.\nWhen the singular values are assigned totally to the left or to the right factor, the resultant coordinates are called principal coordinates and the sum of squared coordinates on each dimension equal the corresponding singular value. The other matrix, to which no part of the singular values is assigned, contains the so-called standard coordinates and have sum of squared values equal to 1.0.\n\n4.3.2 Biplots in R\nThere are a large number of R packages providing biplots. The most basic, stats::biplot(), provides methods for \"prcomp\" and \"princomp\" objects.\nTODO: Mention factoextra package, fviz(), fviz_pca_biplot(), … giving ggplot2 graphics. Also mention adegraphics package\nHere, I use the ggbiplot package, which aims to provide a simple interface to biplots within the ggplot2 framework.\n\n4.3.3 Example\nA basic biplot of the crime data, using standardized principal components and labeling the observation by their state abbreviation is shown in Figure 4.9. The correlation circle indicates that these components are uncorrelated and have equal variance in the display.\n\ncrime.pca <- reflect(crime.pca) # reflect the axes\n\nggbiplot(crime.pca,\n obs.scale = 1, var.scale = 1,\n labels = crime$st ,\n circle = TRUE,\n varname.size = 4,\n varname.color = \"brown\") +\n theme_minimal(base_size = 14) \n\n\n\nFigure 4.9: Basic biplot of the crime data. …\n\n\n\nIn this dataset the states are grouped by region and we saw some differences among regions in the plot (Figure 4.7) of component scores. ggbiplot() provides options to include a groups = variable, used to color the observation points and also to draw their data ellipses, facilitating interpretation.\n\nggbiplot(crime.pca,\n obs.scale = 1, var.scale = 1,\n groups = crime$region,\n labels = crime$st,\n labels.size = 4,\n var.factor = 1.4,\n ellipse = TRUE, ellipse.level = 0.5, ellipse.alpha = 0.1,\n circle = TRUE,\n varname.size = 4,\n varname.color = \"black\") +\n labs(fill = \"Region\", color = \"Region\") +\n theme_minimal(base_size = 14) +\n theme(legend.direction = 'horizontal', legend.position = 'top')\n\n\n\nFigure 4.10: Enhanced biplot of the crime data, grouping the states by region and adding data ellipses.\n\n\n\nThis plot provides what is necessary to interpret the nature of the components and also the variation of the states in relation to these. In this, the data ellipses for the regions provide a visual summary that aids interpretation.\n\nFrom the variable vectors, it seems that PC1, having all positive and nearly equal loadings, reflects a total or overall index of crimes. Nevada, California, New York and Florida are highest on this, while North Dakota, South Dakota and West Virginia are lowest.\nThe second component, PC2, shows a contrast between crimes against persons (murder, assault, rape) at the top and property crimes (auto theft, larceny) at the bottom. Nearly all the Southern states are high on personal crimes; states in the North East are generally higher on property crimes.\nWestern states tend to be somewhat higher on overall crime rate, while North Central are lower on average. In these states there is not much variation in the relative proportions of personal vs. property crimes.\n\nMoreover, in this biplot you can interpret the the value for a particular state on a given crime by considering its projection on the variable vector, where the origin corresponds to the mean, positions along the vector have greater than average values on that crime, and the opposite direction have lower than average values. For example, Massachusetts has the highest value on auto theft, but a value less than the mean. Louisiana and South Carolina on the other hand are highest in the rate of murder and slightly less than average on auto theft.\nThese 2D plots account for only 76.5% of the total variance of crimes, so it is useful to also examine the third principal component, which accounts for an additional 10.4%. The choices = option controls which dimensions are plotted.\n\nggbiplot(crime.pca,\n choices = c(1,3),\n obs.scale = 1, var.scale = 1,\n groups = crime$region,\n labels = crime$st,\n labels.size = 4,\n var.factor = 2,\n ellipse = TRUE, ellipse.level = 0.5, ellipse.alpha = 0.1,\n circle = TRUE,\n varname.size = 4,\n varname.color = \"black\") +\n labs(fill = \"Region\", color = \"Region\") +\n theme_minimal(base_size = 14) +\n theme(legend.direction = 'horizontal', legend.position = 'top')\n\n\n\nFigure 4.11: Biplot of dimensions 1 & 3 of the crime data.\n\n\n\nDimension 3 in Figure 4.11 is more subtle. One interpretation is a contrast between larceny, which is a simple theft and robbery, which involves stealing something from a person and is considered a more serious crime with an element of possible violence. In this plot, murder has a relatively short variable vector, so does not contribute very much to differences among the states." + "text": "4.3 Biplots\nThe biplot is a simple and powerful idea that came from the recognition that you can overlay a plot of observation scores in a principal components analysis with the information of the variable loadings (weights) to give a simultaneous display that is easy to interpret. In this sense, a biplot is generalization of a scatterplot, projecting from data space to PCA space, where the observations are shown by points, as in the plots of component scores in Figure 4.7, but with the variables also shown by vectors (or scaled linear axes aligned with those vectors).\nThe idea of the biplot was introduced by Ruben Gabriel (1971, 1981) and later expanded in scope by Gower and Hand (1996). The book by Greenacre (2010) gives a practical overview of the many variety of biplots and Gower, Lubbe, and Roux (2011) provide a full treatment …\nBiplot methodolgy is far more general than I cover here. Categorical variables can be incorporated in PCA using category level points. Two-way frequency tables of categorical variables can be analysed using correspondence analysis, which is similar to PCA, but designed to account for the maximum amount of the \\(\\chi^2\\) statistic for association; multiple correspondence analysis extends this to method to multi-way tables (Friendly and Meyer 2016; Greenacre 1984).\n\n4.3.1 Constructing a biplot\nThe biplot is constructed by using the singular value decomposition (SVD) to obtain a low-rank approximation to the data matrix \\(\\mathbf{X}_{n \\times p}\\) (centered, and optionally scaled to unit variances) whose \\(n\\) rows are the observations and whose \\(p\\) columns are the variables.\n\n\n\n\nThe singular value decomposition expresses a data matrix X as the product of a matrix U of observation scores, a diagonal matrix Lambda of singular values and a matrix V of variable weights. TODO: Re-draw to fix notation.\n\n\n\nUsing the SVD, the matrix \\(\\mathbf{X}\\), of rank \\(r \\le p\\) can be expressed exactly as: \\[\n\\mathbf{X} = \\mathbf{U} \\mathbf{\\Lambda} \\mathbf{V}'\n = \\sum_i^r \\lambda_i \\mathbf{u}_i \\mathbf{v}_i' \\; ,\n\\]\nwhere\n\n\n\\(\\mathbf{U}\\) is an \\(n \\times r\\) orthonormal matrix of uncorrelated observation scores; these are also the eigenvectors of \\(\\mathbf{X} \\mathbf{X}'\\),\n\n\\(\\mathbf{\\Lambda}\\) is an \\(r \\times r\\) diagonal matrix of singular values, \\(\\lambda_1 \\ge \\lambda_2 \\ge \\cdots \\lambda_r\\), which are also the square roots of the eigenvalues of \\(\\mathbf{X} \\mathbf{X}'\\).\n\n\\(\\mathbf{V}\\) is an \\(r \\times p\\) orthonormal matrix of observation scores and also the eigenvectors of \\(\\mathbf{X}' \\mathbf{X}\\).\n\nThen, a rank 2 (or 3) PCA approximation \\(\\widehat{\\mathbf{X}}\\) to the data matrix used in the biplot can be obtained from the first 2 (or 3) singular values \\(\\lambda_i\\) and the corresponding \\(\\mathbf{u}_i, \\mathbf{v}_i\\) as:\n\\[\n\\mathbf{X} \\approx \\widehat{\\mathbf{X}} = \\lambda_1 \\mathbf{u}_1 \\mathbf{v}_1' + \\lambda_2 \\mathbf{u}_2 \\mathbf{v}_2' \\; .\n\\]\nThe variance of \\(\\mathbf{X}\\) accounted for by each term is \\(\\lambda_i^2\\).\nA biplot is then obtained by overlaying two scatterplots that share a common set of axes and have a between-set scalar product interpretation. Typically, the observations (rows of \\(\\mathbf{X}\\)) are represented as points and the variables (columns of \\(\\mathbf{X}\\)) are represented as vectors from the origin.\nThe factor, \\(\\alpha\\) allows the variances of the components to be apportioned between the row points and column vectors, with different interpretations, by representing the approximation \\(\\widehat{\\mathbf{X}}\\) as the product of two matrices,\n\\[\n\\widehat{\\mathbf{X}} = (\\mathbf{U} \\mathbf{\\Lambda}^\\alpha) (\\mathbf{\\Lambda}^{1-\\alpha} \\mathbf{V}') = \\mathbf{A} \\mathbf{B}'\n\\] This notation uses a little math trick involving a power, \\(0 \\le \\alpha \\le 1\\): When \\(\\alpha = 1\\), \\(\\mathbf{\\Lambda}^\\alpha = \\mathbf{\\Lambda}^1 =\\mathbf{\\Lambda}\\), and \\(\\mathbf{\\Lambda}^{1-\\alpha} = \\mathbf{\\Lambda}^0 =\\mathbf{I}\\). \\(\\alpha = 1/2\\) gives the diagonal matrix \\(\\mathbf{\\Lambda}^{1/2}\\) whose elements are the square roots of the singular values.\nThe choice \\(\\alpha = 1\\) assigns the singular values totally to the left factor; then, the angle between two variable vectors, reflecting the inner product \\(\\mathbf{x}_j^T, \\mathbf{x}_{j'}\\) approximates their correlation or covariance, and the distance between the points approximates their Mahalanobis distances. \\(\\alpha = 0\\) gives a distance interpretation to the column display. \\(\\alpha = 1/2\\) gives a symmetrically scaled biplot. *TODO**: Explain this better.\nWhen the singular values are assigned totally to the left or to the right factor, the resultant coordinates are called principal coordinates and the sum of squared coordinates on each dimension equal the corresponding singular value. The other matrix, to which no part of the singular values is assigned, contains the so-called standard coordinates and have sum of squared values equal to 1.0.\n\n4.3.2 Biplots in R\nThere are a large number of R packages providing biplots. The most basic, stats::biplot(), provides methods for \"prcomp\" and \"princomp\" objects.\nTODO: Mention factoextra package, fviz(), fviz_pca_biplot(), … giving ggplot2 graphics. Also mention adegraphics package\nHere, I use the ggbiplot package, which aims to provide a simple interface to biplots within the ggplot2 framework.\n\n4.3.3 Example\nA basic biplot of the crime data, using standardized principal components and labeling the observation by their state abbreviation is shown in Figure 4.9. The correlation circle indicates that these components are uncorrelated and have equal variance in the display.\n\ncrime.pca <- reflect(crime.pca) # reflect the axes\n\nggbiplot(crime.pca,\n obs.scale = 1, var.scale = 1,\n labels = crime$st ,\n circle = TRUE,\n varname.size = 4,\n varname.color = \"brown\") +\n theme_minimal(base_size = 14) \n\n\n\nFigure 4.9: Basic biplot of the crime data. …\n\n\n\nIn this dataset the states are grouped by region and we saw some differences among regions in the plot (Figure 4.7) of component scores. ggbiplot() provides options to include a groups = variable, used to color the observation points and also to draw their data ellipses, facilitating interpretation.\n\nggbiplot(crime.pca,\n obs.scale = 1, var.scale = 1,\n groups = crime$region,\n labels = crime$st,\n labels.size = 4,\n var.factor = 1.4,\n ellipse = TRUE, ellipse.level = 0.5, ellipse.alpha = 0.1,\n circle = TRUE,\n varname.size = 4,\n varname.color = \"black\") +\n labs(fill = \"Region\", color = \"Region\") +\n theme_minimal(base_size = 14) +\n theme(legend.direction = 'horizontal', legend.position = 'top')\n\n\n\nFigure 4.10: Enhanced biplot of the crime data, grouping the states by region and adding data ellipses.\n\n\n\nThis plot provides what is necessary to interpret the nature of the components and also the variation of the states in relation to these. In this, the data ellipses for the regions provide a visual summary that aids interpretation.\n\nFrom the variable vectors, it seems that PC1, having all positive and nearly equal loadings, reflects a total or overall index of crimes. Nevada, California, New York and Florida are highest on this, while North Dakota, South Dakota and West Virginia are lowest.\nThe second component, PC2, shows a contrast between crimes against persons (murder, assault, rape) at the top and property crimes (auto theft, larceny) at the bottom. Nearly all the Southern states are high on personal crimes; states in the North East are generally higher on property crimes.\nWestern states tend to be somewhat higher on overall crime rate, while North Central are lower on average. In these states there is not much variation in the relative proportions of personal vs. property crimes.\n\nMoreover, in this biplot you can interpret the the value for a particular state on a given crime by considering its projection on the variable vector, where the origin corresponds to the mean, positions along the vector have greater than average values on that crime, and the opposite direction have lower than average values. For example, Massachusetts has the highest value on auto theft, but a value less than the mean. Louisiana and South Carolina on the other hand are highest in the rate of murder and slightly less than average on auto theft.\nThese 2D plots account for only 76.5% of the total variance of crimes, so it is useful to also examine the third principal component, which accounts for an additional 10.4%. The choices = option controls which dimensions are plotted.\n\nggbiplot(crime.pca,\n choices = c(1,3),\n obs.scale = 1, var.scale = 1,\n groups = crime$region,\n labels = crime$st,\n labels.size = 4,\n var.factor = 2,\n ellipse = TRUE, ellipse.level = 0.5, ellipse.alpha = 0.1,\n circle = TRUE,\n varname.size = 4,\n varname.color = \"black\") +\n labs(fill = \"Region\", color = \"Region\") +\n theme_minimal(base_size = 14) +\n theme(legend.direction = 'horizontal', legend.position = 'top')\n\n\n\nFigure 4.11: Biplot of dimensions 1 & 3 of the crime data.\n\n\n\nDimension 3 in Figure 4.11 is more subtle. One interpretation is a contrast between larceny, which is a simple theft and robbery, which involves stealing something from a person and is considered a more serious crime with an element of possible violence. In this plot, murder has a relatively short variable vector, so does not contribute very much to differences among the states." }, { "objectID": "04-pca-biplot.html#elliptical-insights-outlier-detection",