iris
data, initially in data space and its’ data ellipsoid, with points colored according to species of the iris flowers. This is rotated smoothly until the first two principal axes are aligned with the horizontal and vertical dimensions.4.2.1 PCA by springs
@@ -400,12 +408,14 @@
4.2.3 Finding principal components
In R, principal components analysis is most easily carried out using stats::prcomp()
or stats::princomp()
or similar functions in other packages such as FactomineR::PCA()
. The FactoMineR package (Husson et al. 2023) has extensive capabilities for exploratory analysis of multivariate data (PCA, correspondence analysis, cluster analysis, …).
-Unfortunately, although all of these performing similar calculations, the options for analysis and the details of the result they return differ …
+Unfortunately, although all of these performing similar calculations, the options for analysis and the details of the result they return differ.
The important options for analysis include:
-- whether or not the data variables are centered, to a mean of 0
-- whether or not the data variables are scaled, to a variance of 1.
+- whether or not the data variables are centered, to a mean of \(\bar{x}_j =0\)
+
+- whether or not the data variables are scaled, to a variance of \(\text{Var}(x_j) =1\).
+It nearly always makes sense to center the variables. The choice of scaling determines whether the correlation matrix is analyzed, so that each variable contributes equally to the total variance that is to be accounted for versus analysis of the covariance matrix, where each variable contributes its own variance to the total. Analysis of the covariance matrix makes little sense when the variables are measured on different scales.2
Example: Crime data
The dataset crime
, analysed in Section 3.2.2, showed all positive correlations among the rates of various crimes in the corrgram, Figure 3.27. What can we see from a principal components analysis? Is it possible that a few dimensions can account for most of the juice in this data?
In this example, you can easily find the PCA solution using prcomp()
in a single line in base-R. You need to specify the numeric variables to analyze by their columns in the data frame. The most important option here is scale. = TRUE
…
@@ -519,7 +529,7 @@ #> # .fittedPC4 <dbl>, .fittedPC5 <dbl>, .fittedPC6 <dbl>,
#> # .fittedPC7 <dbl>
4.2.3 Finding principal components
In R, principal components analysis is most easily carried out using stats::prcomp()
or stats::princomp()
or similar functions in other packages such as FactomineR::PCA()
. The FactoMineR package (Husson et al. 2023) has extensive capabilities for exploratory analysis of multivariate data (PCA, correspondence analysis, cluster analysis, …).
Unfortunately, although all of these performing similar calculations, the options for analysis and the details of the result they return differ …
+Unfortunately, although all of these performing similar calculations, the options for analysis and the details of the result they return differ.
The important options for analysis include:
-
-
- whether or not the data variables are centered, to a mean of 0 -
- whether or not the data variables are scaled, to a variance of 1. +
- whether or not the data variables are centered, to a mean of \(\bar{x}_j =0\) + +
- whether or not the data variables are scaled, to a variance of \(\text{Var}(x_j) =1\).
It nearly always makes sense to center the variables. The choice of scaling determines whether the correlation matrix is analyzed, so that each variable contributes equally to the total variance that is to be accounted for versus analysis of the covariance matrix, where each variable contributes its own variance to the total. Analysis of the covariance matrix makes little sense when the variables are measured on different scales.2
Example: Crime data
The dataset crime
, analysed in Section 3.2.2, showed all positive correlations among the rates of various crimes in the corrgram, Figure 3.27. What can we see from a principal components analysis? Is it possible that a few dimensions can account for most of the juice in this data?
In this example, you can easily find the PCA solution using prcomp()
in a single line in base-R. You need to specify the numeric variables to analyze by their columns in the data frame. The most important option here is scale. = TRUE
…