rebuild work on Ch 03

friendly · Nov 15, 2023 · c0639d1 · c0639d1
1 parent 9dcde3b
commit c0639d1
Show file tree

Hide file tree

Showing 18 changed files with 200 additions and 51 deletions.
diff --git a/03-multivariate_plots.qmd b/03-multivariate_plots.qmd
@@ -399,7 +399,11 @@ The most basic version is provided by `pairs()` in base R. When one variable
 is considered as an outcome or response, it is usually helpful to put this
 in the first row and column. For the `Prestige` data, in addition to
 income and education, we also have a measure of % women in each occupational category.
-Plotting these together gives @fig-prestige-pairs.
+
+Plotting these together gives @fig-prestige-pairs. In such plots, the diagonal cells
+give labels for the variables, but they are also a guide to interpreting what is shown.
+In each row, say row 2 for `income`, income is the vertical $y$ variable in plots against other
+variables. In each column, say column 3 for `education`, education is the horizonal $x$ variable.
 
 ```{r}
 #| label: fig-prestige-pairs
@@ -413,14 +417,23 @@ pairs(~ prestige + income + education + women,
 
 The plots in the first row show what we have seen before for the relations between prestige and income and education, adding to those the plot of prestige vs. % women. Plots in the first column show the same data, but with $x$ and $y$ interchanged. 
 
-A more feature-rich version is provided by `car::scatterplotMatrix()` which
+But this basic `pairs()` plot is very limited. A more feature-rich version is provided by `car::scatterplotMatrix()` which
 can add the regression lines, loess smooths and data ellipses for each
-pair. The diagonal panels show density curves for the distribution of each
-variable; for example, the distribution of education appears to be multi-modal
+pair, as shown in @fig-prestige-spm1.
+
+The diagonal panels show density curves for the distribution of each
+variable; for example, the distribution of `education` appears to be multi-modal
 and that of `women` shows that most of the occupations have a low percentage
-of women. The combination of the regression line with the loess smoothed
+of women. 
+
+The combination of the regression line with the loess smoothed
 curve, but without their confidence envelopes, provides about the right
 amount of detail to take in at a glance where the relations are non-linear.
+We've already seen (@fig-Prestige-scatterplot-income1)
+the non-linear relation between prestige and income
+(row 1, column 2) when occupational type is ignored. 
+But all relations with income in column 2 are non-linear, reinforcing
+our idea (@sec-log-scale) that effects of income should be assessed on a log scale.
 
 ```{r}
 #| label: fig-prestige-spm1
@@ -497,7 +510,7 @@ have similar size and shape in most of the plots.
 From the boxplots, we can also see that `r colorize("Adelie", "orange")` penguins
 have shorter bill lengths than the others, while `r colorize("Gentoo", "green")`
 penguins have smaller bill depth, but longer flippers and are heavier
-than r colorize("Chinstrap", "purple")` and `r colorize("Adelie", "orange")`
+than `r colorize("Chinstrap", "purple")` and `r colorize("Adelie", "orange")`
 penguins. 
 
 
@@ -517,7 +530,7 @@ peng.mod <- lm(cbind(bill_length, bill_depth, flipper_length, body_mass) ~ speci
 
 Hypothesis-error (HE) plots, described in @sec-vis-mlm provide a better summary of the evidence for the MANOVA test of differences among means on all variables together. These give an 
 $\mathbf{H}$ ellipse reflecting the differences among means, to be compared with
-an $\mathbf{E}$ ellipse reflecting within-group variation.
+an $\mathbf{E}$ ellipse reflecting within-group variation and a visual test of significance
 
 A related question is "how well are the penguin species distinguished by these body size measures?"
 Here, the relevant model is linear discriminant analysis (LDA), where `species` plays the role of the response in the model,
@@ -547,6 +560,47 @@ _Discrete Data Analysis with R_ [@FriendlyMeyer:2016:DDAR] and the **vcdExtra**
 package for mosaic plots and mosaic matrices.
 
 **TODO** Mosaic matrix of species, island & sex.
+For example, we can tabulate the distributions of penguin species by sex and the island
+where they were observed using `xtabs()`. `ftable()` prints this three-way table more compactly.
+
+```{r peng-table}
+# use better labels for sex
+peng <- peng |>
+  mutate(sex = factor(sex, labels = c("Female", "Male")))
+peng.table <- xtabs(~ species + sex + island, data = peng)
+
+ftable(peng.table)
+```
+
+We can see immediately that the penguin species differ by island:
+only Adelie were observed on all three islands; Biscoe Island had no Chinstraps
+and Dream Island had no Gentoos.
+
+`vcd::pairs()` produces all pairwise mosaic plots, as shown in @fig-peng-mosaic.
+The diagonal panels show the one-way
+frequencies by width of the divided bars. Each off-diagonal panel shows the bivariate
+counts, breaking down each column variable by splitting the bars in proportion to a
+second variable. Consequently, the frequency of each cell is represented by its' area.
+The purpose is to show the pattern of association between each pair,
+and so, the tiles in the mosaic are shaded according to the signed residual
+in a simple $\chi^2$ test for association--- `r colorize("blue", "blue")`
+where the observed frequency is greater than expected under independence, and
+`r colorize("red", "red")` where it is less than expected.
+
+
+```{r}
+#| label: fig-peng-mosaic
+#| fig-width: 7
+#| fig-height: 7
+#| out-width: "100%"
+#| fig-cap: "Mosaic pairs plot for the combinations of species, sex and island. ..."
+library(vcd)
+pairs(peng.table, shade = TRUE)
+```
+
+The shading patterns in cells (1,3) and (3,1) of @fig-peng-mosaic show what we've seen before in the table of frequencies: The distribution of species varies across island. Row 2 and column 2
+show that there is sex is roughly proportional among species and islands.
+
 
 The next step, by John Emerson and others [@Emerson-etal:2013] was to recognize
 that combinations of continuous and discrete, categorical variables could be plotted

diff --git a/R/common.R b/R/common.R
@@ -104,6 +104,7 @@ $\\newcommand*{\\diag}[1]{\\ensuremath{\\mathrm{diag}\\, #1}}$
   "lattice",
   "datasauRus",
   "vcd",
+  "vcdExtra", 
   "palmerpenguins")
 
 # write list of packages used at end of every chapter

diff --git a/R/penguin/peng-ggplot.R b/R/penguin/peng-ggplot.R
@@ -10,7 +10,7 @@ library(ggdensity)
 library(patchwork)
 
 load(here::here("data", "peng.RData"))
-str(peng)
+#str(peng)
 
 source("R/penguin-colors.R")
 

diff --git a/R/penguin/peng-mosaic.R b/R/penguin/peng-mosaic.R
@@ -0,0 +1,18 @@
+#' ---
+#' title: Penguins data, mosaic plots
+#' ---
+
+load(here::here("data", "peng.RData"))
+source("R/penguin-colors.R")
+
+library(vcdExtra)
+
+# change levels of sex
+peng <- peng |>
+  mutate(sex = factor(sex, labels = c("Female", "Male")))
+
+peng.table <- xtabs(~ species + sex + island, data = peng)
+
+ftable(peng.table)
+
+pairs(peng.table, shade = TRUE)
diff --git a/R/penguin/penguin-colors.R b/R/penguin/penguin-colors.R
@@ -45,7 +45,7 @@ theme_penguins <- function(shade=c("medium", "light", "dark"), ...) {
   shade = match.arg(shade)
   list(scale_color_penguins(shade=shade),
        scale_fill_penguins(shade=shade)
-  )
+      )
 }
 
 

diff --git a/bib/pkgs.bib b/bib/pkgs.bib
@@ -350,6 +350,14 @@ @Manual{R-tidyverse
 }
 
 
+@Manual{R-vcdExtra,
+  title = {vcdExtra: vcd Extensions and Additions},
+  author = {Michael Friendly},
+  year = {2023},
+  note = {R package version 0.8-5},
+  url = {https://friendly.github.io/vcdExtra/},
+}
+
 @Manual{R-VisCollin,
   title = {VisCollin: Visualizing Collinearity Diagnostics},
   author = {Michael Friendly},

diff --git a/bib/pkgs.txt b/bib/pkgs.txt
@@ -34,11 +34,13 @@ ggdensity
 ggplot2
 graphics
 grDevices
+grid
 methods
 patchwork
 stats
 tidyr
 utils
+vcd
 base
 bayestestR
 car

diff --git a/child/03-data-ellipse.qmd b/child/03-data-ellipse.qmd
@@ -272,7 +272,8 @@ scatterplot(prestige ~ education, data=Prestige,
 ```
 
 
-#### Plotting on a log scale
+#### Plotting on a log scale {#sec-log-scale}
+
 A typical remedy for the non-linear relationship of income to prestige is to plot income on a log scale. This usually makes sense, and expresses a belief that a **multiple** of
 or **percentage increase** in income has a constant impact on prestige, as opposed to
 the **additive** interpretation for income itself.
@@ -360,13 +361,14 @@ higher occupational category. A different plot entails a different story.
 #| echo: false
 #| fig-align: center
 #| out-width: "80%"
-#| fig-cap: "Penguin species observed in the Palmer Archipelago. Image: Allison Horst"
+#| fig-cap: "Penguin species observed in the Palmer Archipelago. This is a cartoon, but it illustrates some features of penguin body size measurements, and the colors typically used for species.  Image: Allison Horst"
 knitr::include_graphics("images/penguins-horst.png")
 ```
 
 The `penguins` dataset from the **palmerpenguins** package [@R-palmerpenguins] provides further instructive examples of plots and analyses of multivariate data. The data consists of measurements of body size 
 (flipper length, body mass, bill length and depth)
 of 344 penguins collected at the [Palmer Research Station](https://pallter.marine.rutgers.edu/) in Antartica.
+
 There were three different species of penguins (Adélie, Chinstrap & Gentoo)
 collected from 3 islands in the Palmer Archipelago
 between 2007--2009 [@Gorman2014]. The purpose was to examine differences in size or appearance of these species,
@@ -398,9 +400,11 @@ use `ggplot2` for these plots.
 
 I'll be using the penguins data quite a lot, so it is useful to set up custom colors like those
 used in @fig-penguin-species. These are shades of:
-Adelie: `r colorize("orange", "orange")`, 
-Chinstrap: `r colorize("purple", "purple")`, and
-Gentoo: `r colorize("green", "green")`. To use these in `ggplot2` I define a function
+* `r colorize("Adelie", "orange")`: `r colorize("orange", "orange")`, 
+* `r colorize("Chinstrap", "purple")`: `r colorize("purple", "purple")`, and
+* Gentoo: `r colorize("green", "green")`. 
+
+To use these in `ggplot2` I define a function
 `peng.colors()` that allows shades of light, medium and dark and then functions
 `scale_*_penguins()` for color and fill.
 
@@ -441,13 +445,16 @@ scale_colour_penguins <- function(shade=c("medium", "light", "dark"), ...){
 scale_color_penguins <- scale_colour_penguins
 ```
 
-This is used to define a `theme_penguins` that I use to simply change the color and fill scales
+This is used to define a `theme_penguins()` function that I use to simply change the color and fill scales
 for plots below.
 
 ```{r}
-theme_penguins <- list(
-  scale_color_penguins(shade="dark"),
-  scale_fill_penguins(shade="dark"))
+theme_penguins <- function(shade=c("medium", "light", "dark"), ...) {
+  shade = match.arg(shade)
+  list(scale_color_penguins(shade=shade),
+       scale_fill_penguins(shade=shade)
+      )
+}
 ```
 
 
@@ -468,7 +475,7 @@ ggplot(peng,
   geom_smooth(method = "loess",  formula = y ~ x,
               linewidth = 1.5, se = FALSE, alpha=0.1) +
   stat_ellipse(geom = "polygon", level = 0.95, alpha = 0.2) +
-  theme_penguins +
+  theme_penguins("dark") +
   theme(legend.position = c(0.85, 0.15))
 ```
 
@@ -495,11 +502,11 @@ ggplot(peng,
   stat_ellipse(geom = "polygon", level = 0.95, alpha = 0.2) +
   stat_ellipse(geom = "polygon", level = 0.68, alpha = 0.2) +
   stat_ellipse(geom = "polygon", level = 0.40, alpha = 0.2) +
-  theme_penguins +
+  theme_penguins("dark") +
   theme(legend.position = c(0.85, 0.15))
 ```
 
-#### Nonparamteric bivariate density plots
+#### Nonparamtric bivariate density plots
 While I emphasize data ellipses (because I like their beautiful geometry), other visual
 summaries of the bivariate density are possible and often useful. 
 
@@ -541,7 +548,7 @@ p1 <- ggplot(peng,
   geom_density_2d(linewidth = 1.1, bins = 8) +
   ggtitle("geom_density_2d") +
   theme_bw(base_size = 14) + 
-  theme_penguins +
+  theme_penguins() +
   theme(legend.position = c(0.85, 0.15))
 
 p2 <- ggplot(peng, 
@@ -551,7 +558,7 @@ p2 <- ggplot(peng,
   geom_hdr(probs = c(0.95, 0.68, 0.4), show.legend = FALSE) +
   ggtitle("ggdensity::geom_hdr") +
   theme_bw(base_size = 14) +
-  theme_penguins +
+  theme_penguins() +
   theme(legend.position = "none")
 
 p1 + p2
-Original file line number
+Diff line change
@@ Expand Up @@
       shade = match.arg(shade)
       list(scale_color_penguins(shade=shade),
            scale_fill_penguins(shade=shade)
-      )
+          )
     }
@@ Expand Down @@