finish ch04 edits

friendly · Oct 13, 2024 · 1a41976 · 1a41976
1 parent d21b468
commit 1a41976
Show file tree

Hide file tree

Showing 6 changed files with 54 additions and 43 deletions.
diff --git a/03-multivariate_plots.qmd b/03-multivariate_plots.qmd
@@ -1909,6 +1909,9 @@ axes with appropriate aesthetics, labels for categorical factors and so
 forth. @fig-peng-ggpcp1 illustrates this type of display, using sex and
 species in addition to the quantitative variables for the penguin data.
 
+<!-- WARN: 03-multivariate_plots.html: Unable to resolve crossref @fig-peng-ggpcp1 
+     Does this have something to do with cache?
+--> 
 ```{r}
 #| label: fig-peng-ggpcp1
 #| code-fold: show

diff --git a/04-pca-biplot.qmd b/04-pca-biplot.qmd
@@ -1468,16 +1468,23 @@ vs. "unsupervised" settings, a variety of new algorithms have been proposed for
 task of finding low-D representations of high-D data. Among these,
 t-distributed Stochastic Neighbor Embedding (t-SNE) developed by @MaatenHinton2008
 is touted as method for revealing
-local structure and clustering better in possibly complex high-D data and at different scales.
+_local structure_ and clustering better in possibly complex high-D data and at different scales.
+
+t-SNE differs from MDS in what it tries to preserve in the mapping to low-D space:
+Multidimensional scaling aims to preserve the distances between pairs of data points, focusing on pairs of _distant points_ in the original space. t-SNE, on the other hand focuses on preserving _neighboring_ data points. Data points that are close in the original data space will be tight in the t-SNE embeddings.
+
 
 <!-- Simpler description, from: https://www.displayr.com/using-t-sne-to-visualize-data-before-prediction/ -->
-<!-- The t-SNE algorithm models the probability distribution of neighbors around each point. Here, the term neighbors refers to the set of points which are closest to each point. In the original, high-dimensional space this is modeled as a Gaussian distribution. In the 2-dimensional output space this is modeled as a t-distribution. The goal of the procedure is to find a mapping onto the 2-dimensional space that minimizes the differences between these two distributions over all points. The fatter tails of a t-distribution compared to a Gaussian help to spread the points more evenly in the 2-dimensional space. -->
+
+"The t-SNE algorithm models the probability distribution of neighbors around each point. Here, the term _neighbors_ refers to the set of points which are closest to each point. In the original, high-dimensional space, this is modeled as a Gaussian distribution. In the 2-dimensional output space this is modeled as a $t$-distribution. The goal of the procedure is to find a mapping onto the 2-dimensional space that minimizes the differences between these two distributions over all points. The fatter tails of a $t$-distribution compared to a Gaussian help to spread the points more evenly in the 2-dimensional space."
+(Jake Hoare, [How t-SNE works and Dimensionality Reduction](https://www.displayr.com/using-t-sne-to-visualize-data-before-prediction/)).
+
 
 t-SNE also uses the idea of mapping distance measures into a low-D space,
 but converts Euclidean distances into conditional probabilities.
 Stochastic neighbor embedding means that t-SNE constructs a probability distribution over pairs of high-dimensional objects in such a way that similar objects are assigned a higher probability while dissimilar points are assigned a lower probability.
 
-As van der Maaten and Hinton explained: "The similarity of datapoint $\mathbf{x}_{j}$ to datapoint  $\mathbf{x}_{i}$ is the conditional probability, $p_{j|i}$, that x i $\mathbf{x}_{i}$ would pick  $\mathbf{x}_{j}$ as its neighbor if neighbors were picked in proportion to their probability density under a Gaussian distribution centered at $\mathbf{x}_{i}$." For $i \ne j$, they define
+As van der Maaten and Hinton explained: "The similarity of datapoint $\mathbf{x}_{j}$ to datapoint  $\mathbf{x}_{i}$ is the conditional probability, $p_{j|i}$, that $\mathbf{x}_{i}$ would pick  $\mathbf{x}_{j}$ as its neighbor if neighbors were picked in proportion to their probability density under a Gaussian distribution centered at $\mathbf{x}_{i}$." For $i \ne j$, they define:
 
 $$
 p_{j\mid i} = \frac{\exp(-\lVert\mathbf{x}_i - \mathbf{x}_j\rVert^2 / 2\sigma_i^2)}{\sum_{k \neq i} \exp(-\lVert\mathbf{x}_i - \mathbf{x}_k\rVert^2 / 2\sigma_i^2)} \;. 
@@ -1706,7 +1713,7 @@ The plot shows the variables as widely dispersed. There is a collection at the l
 #| fig-width: 8
 #| fig-height: 8
 #| out-width: "70%"
-#| fig-cap: "Biplot of the `mtcars` data ..."
+#| fig-cap: "Biplot of the `mtcars` data. The order of the variables around the circle, starting from \"gear\" (say) arranges them so that the most similar variables are adjacent in graphical displays."
 mtcars.pca <- prcomp(mtcars, scale. = TRUE)
 ggbiplot(mtcars.pca,
          circle = TRUE,
@@ -1720,10 +1727,10 @@ In `corrplot()` principal component variable ordering is implemented using the `
 [package vignette](https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html).
 
 
-@fig-mtcars-corrplot-pcaorder shows the result. A nice feature of `corrplot()` is the ability to manually highlight blocks of variables that have a similar pattern of signs by outlining them with
+@fig-mtcars-corrplot-pcaorder shows the result of ordering the variables by this method. 
+A nice feature of `corrplot()` is the ability to manually highlight blocks of variables that have a similar pattern of signs by outlining them with
 rectangles. From the biplot, the two main clusters of positively correlated variables seemed clear,
-and are outlined in the plot using `corrplot::corrRect()`. What became clear in the corrplot is that `qsec`, the time to drive a quarter-mile from a dead start didn't fit this pattern, so I highlighted it
-separately.
+and are outlined in the plot using `corrplot::corrRect()`. What became clear in the corrplot is that `qsec`, the time to drive a quarter-mile from a dead start didn't quite fit this pattern, so I highlighted it separately.
 
 ```{r}
 #| label: fig-mtcars-corrplot-pcaorder
@@ -1771,7 +1778,7 @@ than pure prediction, it has the disadvantage that the components may be hard to
 An interesting class of problems have to do with image processing, where an image of size
 width $\times$ height in pixels can be represented by a $w \times h$ array of 
 greyscale values $x_{ij}$ in the range of [0, 1] or $h \times w \times 3$ array $x_{ijk}$
-of (red, green, blue) color values. For example a single $640 \times 640$ photo is comprised of about 400K
+of (`r red`, `r green`, `r blue`) color values. For example a single $640 \times 640$ photo is comprised of about 400K
 pixels in B/W and 1200K pixels in color.
 
 The uses here include

diff --git a/docs/03-multivariate_plots.html b/docs/03-multivariate_plots.html
@@ -1378,7 +1378,7 @@ <h1 class="title"><span id="sec-multivariate_plots" class="quarto-section-identi
 </div>
 </div>
 </div>
-<p>In R, these diagrams can be created using the <strong>corrgram</strong> <span class="citation" data-cites="R-corrgram">(<a href="90-references.html#ref-R-corrgram" role="doc-biblioref">Wright, 2021</a>)</span> and <strong>corrplot</strong> <span class="citation" data-cites="R-corrplot">(<a href="90-references.html#ref-R-corrplot" role="doc-biblioref">Wei &amp; Simko, 2021</a>)</span> packages, with different features. <code><a href="http://kwstat.github.io/corrgram/reference/corrgram.html">corrgram::corrgram()</a></code> is closest to <span class="citation" data-cites="Friendly:02:corrgram">Friendly (<a href="90-references.html#ref-Friendly:02:corrgram" role="doc-biblioref">2002</a>)</span>, in that it allows different rendering functions for the lower, upper and diagonal panels as illustrated in <a href="#fig-corrgram-renderings" class="quarto-xref">Figure&nbsp;<span>3.27</span></a>. For example, a corrgram similar to <a href="#fig-crime-spm" class="quarto-xref">Figure&nbsp;<span>3.26</span></a> can be produced as follows (not shown here):</p>
+<p>In R, these diagrams can be created using the <strong>corrgram</strong> <span class="citation" data-cites="R-corrgram">(<a href="90-references.html#ref-R-corrgram" role="doc-biblioref">Wright, 2021</a>)</span> and <strong>corrplot</strong> <span class="citation" data-cites="R-corrplot">(<a href="90-references.html#ref-R-corrplot" role="doc-biblioref">Wei &amp; Simko, 2024</a>)</span> packages, with different features. <code><a href="http://kwstat.github.io/corrgram/reference/corrgram.html">corrgram::corrgram()</a></code> is closest to <span class="citation" data-cites="Friendly:02:corrgram">Friendly (<a href="90-references.html#ref-Friendly:02:corrgram" role="doc-biblioref">2002</a>)</span>, in that it allows different rendering functions for the lower, upper and diagonal panels as illustrated in <a href="#fig-corrgram-renderings" class="quarto-xref">Figure&nbsp;<span>3.27</span></a>. For example, a corrgram similar to <a href="#fig-crime-spm" class="quarto-xref">Figure&nbsp;<span>3.26</span></a> can be produced as follows (not shown here):</p>
 <div class="cell" data-layout-align="center">
 <div class="sourceCode" id="cb36" data-source-line-numbers="nil" data-code-line-numbers="nil"><pre class="downlit sourceCode r code-with-copy"><code class="sourceCode R"><span><span class="va">crime</span> <span class="op">|&gt;</span></span>
 <span>  <span class="fu"><a href="https://dplyr.tidyverse.org/reference/select.html">select</a></span><span class="op">(</span><span class="fu"><a href="https://tidyselect.r-lib.org/reference/where.html">where</a></span><span class="op">(</span><span class="va">is.numeric</span><span class="op">)</span><span class="op">)</span> <span class="op">|&gt;</span></span>
@@ -1414,7 +1414,7 @@ <h1 class="title"><span id="sec-multivariate_plots" class="quarto-section-identi
 <p><strong>TODO</strong>: Add example showing correlation ordering – e.g., <code>mtcars</code> data.<!--# How's something like the code I added below? --></p>
 </section></section><section id="sec-ggpairs" class="level2" data-number="3.4"><h2 data-number="3.4" class="anchored" data-anchor-id="sec-ggpairs">
 <span class="header-section-number">3.4</span> Generalized pairs plots</h2>
-<p>When a dataset contains one or more discrete variables, the traditional pairs plot cannot cope, using only color and/or point symbols to represent categorical variables. In the context of mosaic displays and loglinear models, representing <span class="math inline">\(n\)</span>-way frequency tables by rectangular tiles depicting cell frequencies, I <span class="citation" data-cites="Friendly:94a">(<a href="90-references.html#ref-Friendly:94a" role="doc-biblioref">Friendly, 1994</a>)</span> proposed an analog of the scatterplot matrix using mosaic plots for each pair of variables. The <strong>vcd</strong> package <span class="citation" data-cites="R-vcd">(<a href="90-references.html#ref-R-vcd" role="doc-biblioref">Meyer et al., 2023</a>)</span> implements very general <code><a href="https://rdrr.io/r/graphics/pairs.html">pairs()</a></code> methods for <code>"table"</code> objects. See my book <em>Discrete Data Analysis with R</em> <span class="citation" data-cites="FriendlyMeyer:2016:DDAR">(<a href="90-references.html#ref-FriendlyMeyer:2016:DDAR" role="doc-biblioref">Friendly &amp; Meyer, 2016</a>)</span> and the <strong>vcdExtra</strong> <span class="citation" data-cites="R-vcdExtra">(<a href="90-references.html#ref-R-vcdExtra" role="doc-biblioref">Friendly, 2023</a>)</span> package for mosaic plots and mosaic matrices.</p>
+<p>When a dataset contains one or more discrete variables, the traditional pairs plot cannot cope, using only color and/or point symbols to represent categorical variables. In the context of mosaic displays and loglinear models, representing <span class="math inline">\(n\)</span>-way frequency tables by rectangular tiles depicting cell frequencies, I <span class="citation" data-cites="Friendly:94a">(<a href="90-references.html#ref-Friendly:94a" role="doc-biblioref">Friendly, 1994</a>)</span> proposed an analog of the scatterplot matrix using mosaic plots for each pair of variables. The <strong>vcd</strong> package <span class="citation" data-cites="R-vcd">(<a href="90-references.html#ref-R-vcd" role="doc-biblioref">Meyer et al., 2024</a>)</span> implements very general <code><a href="https://rdrr.io/r/graphics/pairs.html">pairs()</a></code> methods for <code>"table"</code> objects. See my book <em>Discrete Data Analysis with R</em> <span class="citation" data-cites="FriendlyMeyer:2016:DDAR">(<a href="90-references.html#ref-FriendlyMeyer:2016:DDAR" role="doc-biblioref">Friendly &amp; Meyer, 2016</a>)</span> and the <strong>vcdExtra</strong> <span class="citation" data-cites="R-vcdExtra">(<a href="90-references.html#ref-R-vcdExtra" role="doc-biblioref">Friendly, 2023</a>)</span> package for mosaic plots and mosaic matrices.</p>
 <p>For example, we can tabulate the distributions of penguin species by sex and the island where they were observed using <code><a href="https://rdrr.io/r/stats/xtabs.html">xtabs()</a></code>. <code><a href="https://rdrr.io/r/stats/ftable.html">ftable()</a></code> prints this three-way table more compactly. (In this example, and what follows in the chapter, I’ve changed the labels for sex from (“f”, “m”) to (“Female”, “Male”)).</p>
 <div class="cell" data-layout-align="center">
 <div class="sourceCode" id="cb38" data-source-line-numbers="nil" data-code-line-numbers="nil"><pre class="downlit sourceCode r code-with-copy"><code class="sourceCode R"><span><span class="co"># use better labels for sex</span></span>
@@ -2236,7 +2236,7 @@ <h1 class="title"><span id="sec-multivariate_plots" class="quarto-section-identi
 Harrison, P. (2023). Langevitour: Smooth interactive touring of high dimensions, demonstrated with scRNA-seq data. <em>The R Journal</em>, <em>15</em>(2), 206–219. <a href="https://doi.org/10.32614/RJ-2023-046">https://doi.org/10.32614/RJ-2023-046</a>
 </div>
 <div id="ref-R-langevitour" class="csl-entry" role="listitem">
-Harrison, P. (2024). <em>Langevin tour</em>. <a href="https://CRAN.R-project.org/package=langevitour">https://CRAN.R-project.org/package=langevitour</a>
+Harrison, P. (2024). <em>Langevitour: Langevin tour</em>. <a href="https://logarithmic.net/langevitour/">https://logarithmic.net/langevitour/</a>
 </div>
 <div id="ref-R-detourr" class="csl-entry" role="listitem">
 Hart, C., &amp; Wang, E. (2022). <em>Detourr: Portable and performant tour animations</em>. <a href="https://CRAN.R-project.org/package=detourr">https://CRAN.R-project.org/package=detourr</a>
@@ -2263,10 +2263,10 @@ <h1 class="title"><span id="sec-multivariate_plots" class="quarto-section-identi
 Lee, E.-K., &amp; Cook, D. (2009). A projection pursuit index for large p small n data. <em>Statistics and Computing</em>, <em>20</em>(3), 381–392. <a href="https://doi.org/10.1007/s11222-009-9131-1">https://doi.org/10.1007/s11222-009-9131-1</a>
 </div>
 <div id="ref-R-liminal" class="csl-entry" role="listitem">
-Lee, S. (2021). <em>Liminal: Multivariate data visualization with tours and embeddings</em>. <a href="https://CRAN.R-project.org/package=liminal">https://CRAN.R-project.org/package=liminal</a>
+Lee, S. (2021). <em>Liminal: Multivariate data visualization with tours and embeddings</em>. <a href="https://github.com/sa-lee/liminal/">https://github.com/sa-lee/liminal/</a>
 </div>
 <div id="ref-R-vcd" class="csl-entry" role="listitem">
-Meyer, D., Zeileis, A., &amp; Hornik, K. (2023). <em>Vcd: Visualizing categorical data</em>. <a href="https://CRAN.R-project.org/package=vcd">https://CRAN.R-project.org/package=vcd</a>
+Meyer, D., Zeileis, A., Hornik, K., &amp; Friendly, M. (2024). <em>Vcd: Visualizing categorical data</em>. <a href="https://CRAN.R-project.org/package=vcd">https://CRAN.R-project.org/package=vcd</a>
 </div>
 <div id="ref-Monette:90" class="csl-entry" role="listitem">
 Monette, G. (1990). Geometry of multiple regression and interactive 3-<span>D</span> graphics. In J. Fox &amp; S. Long (Eds.), <em>Modern methods of data analysis</em> (pp. 209–256). SAGE Publications.
@@ -2311,7 +2311,7 @@ <h1 class="title"><span id="sec-multivariate_plots" class="quarto-section-identi
 Wegman, E. J. (1990). Hyperdimensional data analysis using parallel coordinates. <em>Journal of the American Statistical Association</em>, <em>85</em>(411), 664–675.
 </div>
 <div id="ref-R-corrplot" class="csl-entry" role="listitem">
-Wei, T., &amp; Simko, V. (2021). <em>Corrplot: Visualization of a correlation matrix</em>. <a href="https://github.com/taiyun/corrplot">https://github.com/taiyun/corrplot</a>
+Wei, T., &amp; Simko, V. (2024). <em>Corrplot: Visualization of a correlation matrix</em>. <a href="https://github.com/taiyun/corrplot">https://github.com/taiyun/corrplot</a>
 </div>
 <div id="ref-R-tourr" class="csl-entry" role="listitem">
 Wickham, H., &amp; Cook, D. (2024). <em>Tourr: Tour methods for multivariate data visualisation</em>. <a href="https://github.com/ggobi/tourr">https://github.com/ggobi/tourr</a>