Skip to content

Commit

Permalink
finish ch04 edits
Browse files Browse the repository at this point in the history
  • Loading branch information
friendly committed Oct 13, 2024
1 parent d21b468 commit 1a41976
Show file tree
Hide file tree
Showing 6 changed files with 54 additions and 43 deletions.
3 changes: 3 additions & 0 deletions 03-multivariate_plots.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -1909,6 +1909,9 @@ axes with appropriate aesthetics, labels for categorical factors and so
forth. @fig-peng-ggpcp1 illustrates this type of display, using sex and
species in addition to the quantitative variables for the penguin data.
<!-- WARN: 03-multivariate_plots.html: Unable to resolve crossref @fig-peng-ggpcp1
Does this have something to do with cache?
-->
```{r}
#| label: fig-peng-ggpcp1
#| code-fold: show
Expand Down
23 changes: 15 additions & 8 deletions 04-pca-biplot.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -1468,16 +1468,23 @@ vs. "unsupervised" settings, a variety of new algorithms have been proposed for
task of finding low-D representations of high-D data. Among these,
t-distributed Stochastic Neighbor Embedding (t-SNE) developed by @MaatenHinton2008
is touted as method for revealing
local structure and clustering better in possibly complex high-D data and at different scales.
_local structure_ and clustering better in possibly complex high-D data and at different scales.
t-SNE differs from MDS in what it tries to preserve in the mapping to low-D space:
Multidimensional scaling aims to preserve the distances between pairs of data points, focusing on pairs of _distant points_ in the original space. t-SNE, on the other hand focuses on preserving _neighboring_ data points. Data points that are close in the original data space will be tight in the t-SNE embeddings.
<!-- Simpler description, from: https://www.displayr.com/using-t-sne-to-visualize-data-before-prediction/ -->
<!-- The t-SNE algorithm models the probability distribution of neighbors around each point. Here, the term neighbors refers to the set of points which are closest to each point. In the original, high-dimensional space this is modeled as a Gaussian distribution. In the 2-dimensional output space this is modeled as a t-distribution. The goal of the procedure is to find a mapping onto the 2-dimensional space that minimizes the differences between these two distributions over all points. The fatter tails of a t-distribution compared to a Gaussian help to spread the points more evenly in the 2-dimensional space. -->
"The t-SNE algorithm models the probability distribution of neighbors around each point. Here, the term _neighbors_ refers to the set of points which are closest to each point. In the original, high-dimensional space, this is modeled as a Gaussian distribution. In the 2-dimensional output space this is modeled as a $t$-distribution. The goal of the procedure is to find a mapping onto the 2-dimensional space that minimizes the differences between these two distributions over all points. The fatter tails of a $t$-distribution compared to a Gaussian help to spread the points more evenly in the 2-dimensional space."
(Jake Hoare, [How t-SNE works and Dimensionality Reduction](https://www.displayr.com/using-t-sne-to-visualize-data-before-prediction/)).
t-SNE also uses the idea of mapping distance measures into a low-D space,
but converts Euclidean distances into conditional probabilities.
Stochastic neighbor embedding means that t-SNE constructs a probability distribution over pairs of high-dimensional objects in such a way that similar objects are assigned a higher probability while dissimilar points are assigned a lower probability.
As van der Maaten and Hinton explained: "The similarity of datapoint $\mathbf{x}_{j}$ to datapoint $\mathbf{x}_{i}$ is the conditional probability, $p_{j|i}$, that x i $\mathbf{x}_{i}$ would pick $\mathbf{x}_{j}$ as its neighbor if neighbors were picked in proportion to their probability density under a Gaussian distribution centered at $\mathbf{x}_{i}$." For $i \ne j$, they define
As van der Maaten and Hinton explained: "The similarity of datapoint $\mathbf{x}_{j}$ to datapoint $\mathbf{x}_{i}$ is the conditional probability, $p_{j|i}$, that $\mathbf{x}_{i}$ would pick $\mathbf{x}_{j}$ as its neighbor if neighbors were picked in proportion to their probability density under a Gaussian distribution centered at $\mathbf{x}_{i}$." For $i \ne j$, they define:
$$
p_{j\mid i} = \frac{\exp(-\lVert\mathbf{x}_i - \mathbf{x}_j\rVert^2 / 2\sigma_i^2)}{\sum_{k \neq i} \exp(-\lVert\mathbf{x}_i - \mathbf{x}_k\rVert^2 / 2\sigma_i^2)} \;.
Expand Down Expand Up @@ -1706,7 +1713,7 @@ The plot shows the variables as widely dispersed. There is a collection at the l
#| fig-width: 8
#| fig-height: 8
#| out-width: "70%"
#| fig-cap: "Biplot of the `mtcars` data ..."
#| fig-cap: "Biplot of the `mtcars` data. The order of the variables around the circle, starting from \"gear\" (say) arranges them so that the most similar variables are adjacent in graphical displays."
mtcars.pca <- prcomp(mtcars, scale. = TRUE)
ggbiplot(mtcars.pca,
circle = TRUE,
Expand All @@ -1720,10 +1727,10 @@ In `corrplot()` principal component variable ordering is implemented using the `
[package vignette](https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html).
@fig-mtcars-corrplot-pcaorder shows the result. A nice feature of `corrplot()` is the ability to manually highlight blocks of variables that have a similar pattern of signs by outlining them with
@fig-mtcars-corrplot-pcaorder shows the result of ordering the variables by this method.
A nice feature of `corrplot()` is the ability to manually highlight blocks of variables that have a similar pattern of signs by outlining them with
rectangles. From the biplot, the two main clusters of positively correlated variables seemed clear,
and are outlined in the plot using `corrplot::corrRect()`. What became clear in the corrplot is that `qsec`, the time to drive a quarter-mile from a dead start didn't fit this pattern, so I highlighted it
separately.
and are outlined in the plot using `corrplot::corrRect()`. What became clear in the corrplot is that `qsec`, the time to drive a quarter-mile from a dead start didn't quite fit this pattern, so I highlighted it separately.
```{r}
#| label: fig-mtcars-corrplot-pcaorder
Expand Down Expand Up @@ -1771,7 +1778,7 @@ than pure prediction, it has the disadvantage that the components may be hard to
An interesting class of problems have to do with image processing, where an image of size
width $\times$ height in pixels can be represented by a $w \times h$ array of
greyscale values $x_{ij}$ in the range of [0, 1] or $h \times w \times 3$ array $x_{ijk}$
of (red, green, blue) color values. For example a single $640 \times 640$ photo is comprised of about 400K
of (`r red`, `r green`, `r blue`) color values. For example a single $640 \times 640$ photo is comprised of about 400K
pixels in B/W and 1200K pixels in color.
The uses here include
Expand Down
12 changes: 6 additions & 6 deletions docs/03-multivariate_plots.html
Original file line number Diff line number Diff line change
Expand Up @@ -1378,7 +1378,7 @@ <h1 class="title"><span id="sec-multivariate_plots" class="quarto-section-identi
</div>
</div>
</div>
<p>In R, these diagrams can be created using the <strong>corrgram</strong> <span class="citation" data-cites="R-corrgram">(<a href="90-references.html#ref-R-corrgram" role="doc-biblioref">Wright, 2021</a>)</span> and <strong>corrplot</strong> <span class="citation" data-cites="R-corrplot">(<a href="90-references.html#ref-R-corrplot" role="doc-biblioref">Wei &amp; Simko, 2021</a>)</span> packages, with different features. <code><a href="http://kwstat.github.io/corrgram/reference/corrgram.html">corrgram::corrgram()</a></code> is closest to <span class="citation" data-cites="Friendly:02:corrgram">Friendly (<a href="90-references.html#ref-Friendly:02:corrgram" role="doc-biblioref">2002</a>)</span>, in that it allows different rendering functions for the lower, upper and diagonal panels as illustrated in <a href="#fig-corrgram-renderings" class="quarto-xref">Figure&nbsp;<span>3.27</span></a>. For example, a corrgram similar to <a href="#fig-crime-spm" class="quarto-xref">Figure&nbsp;<span>3.26</span></a> can be produced as follows (not shown here):</p>
<p>In R, these diagrams can be created using the <strong>corrgram</strong> <span class="citation" data-cites="R-corrgram">(<a href="90-references.html#ref-R-corrgram" role="doc-biblioref">Wright, 2021</a>)</span> and <strong>corrplot</strong> <span class="citation" data-cites="R-corrplot">(<a href="90-references.html#ref-R-corrplot" role="doc-biblioref">Wei &amp; Simko, 2024</a>)</span> packages, with different features. <code><a href="http://kwstat.github.io/corrgram/reference/corrgram.html">corrgram::corrgram()</a></code> is closest to <span class="citation" data-cites="Friendly:02:corrgram">Friendly (<a href="90-references.html#ref-Friendly:02:corrgram" role="doc-biblioref">2002</a>)</span>, in that it allows different rendering functions for the lower, upper and diagonal panels as illustrated in <a href="#fig-corrgram-renderings" class="quarto-xref">Figure&nbsp;<span>3.27</span></a>. For example, a corrgram similar to <a href="#fig-crime-spm" class="quarto-xref">Figure&nbsp;<span>3.26</span></a> can be produced as follows (not shown here):</p>
<div class="cell" data-layout-align="center">
<div class="sourceCode" id="cb36" data-source-line-numbers="nil" data-code-line-numbers="nil"><pre class="downlit sourceCode r code-with-copy"><code class="sourceCode R"><span><span class="va">crime</span> <span class="op">|&gt;</span></span>
<span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/select.html">select</a></span><span class="op">(</span><span class="fu"><a href="https://tidyselect.r-lib.org/reference/where.html">where</a></span><span class="op">(</span><span class="va">is.numeric</span><span class="op">)</span><span class="op">)</span> <span class="op">|&gt;</span></span>
Expand Down Expand Up @@ -1414,7 +1414,7 @@ <h1 class="title"><span id="sec-multivariate_plots" class="quarto-section-identi
<p><strong>TODO</strong>: Add example showing correlation ordering – e.g., <code>mtcars</code> data.<!--# How's something like the code I added below? --></p>
</section></section><section id="sec-ggpairs" class="level2" data-number="3.4"><h2 data-number="3.4" class="anchored" data-anchor-id="sec-ggpairs">
<span class="header-section-number">3.4</span> Generalized pairs plots</h2>
<p>When a dataset contains one or more discrete variables, the traditional pairs plot cannot cope, using only color and/or point symbols to represent categorical variables. In the context of mosaic displays and loglinear models, representing <span class="math inline">\(n\)</span>-way frequency tables by rectangular tiles depicting cell frequencies, I <span class="citation" data-cites="Friendly:94a">(<a href="90-references.html#ref-Friendly:94a" role="doc-biblioref">Friendly, 1994</a>)</span> proposed an analog of the scatterplot matrix using mosaic plots for each pair of variables. The <strong>vcd</strong> package <span class="citation" data-cites="R-vcd">(<a href="90-references.html#ref-R-vcd" role="doc-biblioref">Meyer et al., 2023</a>)</span> implements very general <code><a href="https://rdrr.io/r/graphics/pairs.html">pairs()</a></code> methods for <code>"table"</code> objects. See my book <em>Discrete Data Analysis with R</em> <span class="citation" data-cites="FriendlyMeyer:2016:DDAR">(<a href="90-references.html#ref-FriendlyMeyer:2016:DDAR" role="doc-biblioref">Friendly &amp; Meyer, 2016</a>)</span> and the <strong>vcdExtra</strong> <span class="citation" data-cites="R-vcdExtra">(<a href="90-references.html#ref-R-vcdExtra" role="doc-biblioref">Friendly, 2023</a>)</span> package for mosaic plots and mosaic matrices.</p>
<p>When a dataset contains one or more discrete variables, the traditional pairs plot cannot cope, using only color and/or point symbols to represent categorical variables. In the context of mosaic displays and loglinear models, representing <span class="math inline">\(n\)</span>-way frequency tables by rectangular tiles depicting cell frequencies, I <span class="citation" data-cites="Friendly:94a">(<a href="90-references.html#ref-Friendly:94a" role="doc-biblioref">Friendly, 1994</a>)</span> proposed an analog of the scatterplot matrix using mosaic plots for each pair of variables. The <strong>vcd</strong> package <span class="citation" data-cites="R-vcd">(<a href="90-references.html#ref-R-vcd" role="doc-biblioref">Meyer et al., 2024</a>)</span> implements very general <code><a href="https://rdrr.io/r/graphics/pairs.html">pairs()</a></code> methods for <code>"table"</code> objects. See my book <em>Discrete Data Analysis with R</em> <span class="citation" data-cites="FriendlyMeyer:2016:DDAR">(<a href="90-references.html#ref-FriendlyMeyer:2016:DDAR" role="doc-biblioref">Friendly &amp; Meyer, 2016</a>)</span> and the <strong>vcdExtra</strong> <span class="citation" data-cites="R-vcdExtra">(<a href="90-references.html#ref-R-vcdExtra" role="doc-biblioref">Friendly, 2023</a>)</span> package for mosaic plots and mosaic matrices.</p>
<p>For example, we can tabulate the distributions of penguin species by sex and the island where they were observed using <code><a href="https://rdrr.io/r/stats/xtabs.html">xtabs()</a></code>. <code><a href="https://rdrr.io/r/stats/ftable.html">ftable()</a></code> prints this three-way table more compactly. (In this example, and what follows in the chapter, I’ve changed the labels for sex from (“f”, “m”) to (“Female”, “Male”)).</p>
<div class="cell" data-layout-align="center">
<div class="sourceCode" id="cb38" data-source-line-numbers="nil" data-code-line-numbers="nil"><pre class="downlit sourceCode r code-with-copy"><code class="sourceCode R"><span><span class="co"># use better labels for sex</span></span>
Expand Down Expand Up @@ -2236,7 +2236,7 @@ <h1 class="title"><span id="sec-multivariate_plots" class="quarto-section-identi
Harrison, P. (2023). Langevitour: Smooth interactive touring of high dimensions, demonstrated with scRNA-seq data. <em>The R Journal</em>, <em>15</em>(2), 206–219. <a href="https://doi.org/10.32614/RJ-2023-046">https://doi.org/10.32614/RJ-2023-046</a>
</div>
<div id="ref-R-langevitour" class="csl-entry" role="listitem">
Harrison, P. (2024). <em>Langevin tour</em>. <a href="https://CRAN.R-project.org/package=langevitour">https://CRAN.R-project.org/package=langevitour</a>
Harrison, P. (2024). <em>Langevitour: Langevin tour</em>. <a href="https://logarithmic.net/langevitour/">https://logarithmic.net/langevitour/</a>
</div>
<div id="ref-R-detourr" class="csl-entry" role="listitem">
Hart, C., &amp; Wang, E. (2022). <em>Detourr: Portable and performant tour animations</em>. <a href="https://CRAN.R-project.org/package=detourr">https://CRAN.R-project.org/package=detourr</a>
Expand All @@ -2263,10 +2263,10 @@ <h1 class="title"><span id="sec-multivariate_plots" class="quarto-section-identi
Lee, E.-K., &amp; Cook, D. (2009). A projection pursuit index for large p small n data. <em>Statistics and Computing</em>, <em>20</em>(3), 381–392. <a href="https://doi.org/10.1007/s11222-009-9131-1">https://doi.org/10.1007/s11222-009-9131-1</a>
</div>
<div id="ref-R-liminal" class="csl-entry" role="listitem">
Lee, S. (2021). <em>Liminal: Multivariate data visualization with tours and embeddings</em>. <a href="https://CRAN.R-project.org/package=liminal">https://CRAN.R-project.org/package=liminal</a>
Lee, S. (2021). <em>Liminal: Multivariate data visualization with tours and embeddings</em>. <a href="https://github.com/sa-lee/liminal/">https://github.com/sa-lee/liminal/</a>
</div>
<div id="ref-R-vcd" class="csl-entry" role="listitem">
Meyer, D., Zeileis, A., &amp; Hornik, K. (2023). <em>Vcd: Visualizing categorical data</em>. <a href="https://CRAN.R-project.org/package=vcd">https://CRAN.R-project.org/package=vcd</a>
Meyer, D., Zeileis, A., Hornik, K., &amp; Friendly, M. (2024). <em>Vcd: Visualizing categorical data</em>. <a href="https://CRAN.R-project.org/package=vcd">https://CRAN.R-project.org/package=vcd</a>
</div>
<div id="ref-Monette:90" class="csl-entry" role="listitem">
Monette, G. (1990). Geometry of multiple regression and interactive 3-<span>D</span> graphics. In J. Fox &amp; S. Long (Eds.), <em>Modern methods of data analysis</em> (pp. 209–256). SAGE Publications.
Expand Down Expand Up @@ -2311,7 +2311,7 @@ <h1 class="title"><span id="sec-multivariate_plots" class="quarto-section-identi
Wegman, E. J. (1990). Hyperdimensional data analysis using parallel coordinates. <em>Journal of the American Statistical Association</em>, <em>85</em>(411), 664–675.
</div>
<div id="ref-R-corrplot" class="csl-entry" role="listitem">
Wei, T., &amp; Simko, V. (2021). <em>Corrplot: Visualization of a correlation matrix</em>. <a href="https://github.com/taiyun/corrplot">https://github.com/taiyun/corrplot</a>
Wei, T., &amp; Simko, V. (2024). <em>Corrplot: Visualization of a correlation matrix</em>. <a href="https://github.com/taiyun/corrplot">https://github.com/taiyun/corrplot</a>
</div>
<div id="ref-R-tourr" class="csl-entry" role="listitem">
Wickham, H., &amp; Cook, D. (2024). <em>Tourr: Tour methods for multivariate data visualisation</em>. <a href="https://github.com/ggobi/tourr">https://github.com/ggobi/tourr</a>
Expand Down
Loading

0 comments on commit 1a41976

Please sign in to comment.