Skip to content

Commit

Permalink
rebuild work on Ch 03
Browse files Browse the repository at this point in the history
  • Loading branch information
friendly committed Nov 15, 2023
1 parent 9dcde3b commit c0639d1
Show file tree
Hide file tree
Showing 18 changed files with 200 additions and 51 deletions.
68 changes: 61 additions & 7 deletions 03-multivariate_plots.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -399,7 +399,11 @@ The most basic version is provided by `pairs()` in base R. When one variable
is considered as an outcome or response, it is usually helpful to put this
in the first row and column. For the `Prestige` data, in addition to
income and education, we also have a measure of % women in each occupational category.
Plotting these together gives @fig-prestige-pairs.

Plotting these together gives @fig-prestige-pairs. In such plots, the diagonal cells
give labels for the variables, but they are also a guide to interpreting what is shown.
In each row, say row 2 for `income`, income is the vertical $y$ variable in plots against other
variables. In each column, say column 3 for `education`, education is the horizonal $x$ variable.

```{r}
#| label: fig-prestige-pairs
Expand All @@ -413,14 +417,23 @@ pairs(~ prestige + income + education + women,

The plots in the first row show what we have seen before for the relations between prestige and income and education, adding to those the plot of prestige vs. % women. Plots in the first column show the same data, but with $x$ and $y$ interchanged.

A more feature-rich version is provided by `car::scatterplotMatrix()` which
But this basic `pairs()` plot is very limited. A more feature-rich version is provided by `car::scatterplotMatrix()` which
can add the regression lines, loess smooths and data ellipses for each
pair. The diagonal panels show density curves for the distribution of each
variable; for example, the distribution of education appears to be multi-modal
pair, as shown in @fig-prestige-spm1.

The diagonal panels show density curves for the distribution of each
variable; for example, the distribution of `education` appears to be multi-modal
and that of `women` shows that most of the occupations have a low percentage
of women. The combination of the regression line with the loess smoothed
of women.

The combination of the regression line with the loess smoothed
curve, but without their confidence envelopes, provides about the right
amount of detail to take in at a glance where the relations are non-linear.
We've already seen (@fig-Prestige-scatterplot-income1)
the non-linear relation between prestige and income
(row 1, column 2) when occupational type is ignored.
But all relations with income in column 2 are non-linear, reinforcing
our idea (@sec-log-scale) that effects of income should be assessed on a log scale.

```{r}
#| label: fig-prestige-spm1
Expand Down Expand Up @@ -497,7 +510,7 @@ have similar size and shape in most of the plots.
From the boxplots, we can also see that `r colorize("Adelie", "orange")` penguins
have shorter bill lengths than the others, while `r colorize("Gentoo", "green")`
penguins have smaller bill depth, but longer flippers and are heavier
than r colorize("Chinstrap", "purple")` and `r colorize("Adelie", "orange")`
than `r colorize("Chinstrap", "purple")` and `r colorize("Adelie", "orange")`
penguins.


Expand All @@ -517,7 +530,7 @@ peng.mod <- lm(cbind(bill_length, bill_depth, flipper_length, body_mass) ~ speci

Hypothesis-error (HE) plots, described in @sec-vis-mlm provide a better summary of the evidence for the MANOVA test of differences among means on all variables together. These give an
$\mathbf{H}$ ellipse reflecting the differences among means, to be compared with
an $\mathbf{E}$ ellipse reflecting within-group variation.
an $\mathbf{E}$ ellipse reflecting within-group variation and a visual test of significance

A related question is "how well are the penguin species distinguished by these body size measures?"
Here, the relevant model is linear discriminant analysis (LDA), where `species` plays the role of the response in the model,
Expand Down Expand Up @@ -547,6 +560,47 @@ _Discrete Data Analysis with R_ [@FriendlyMeyer:2016:DDAR] and the **vcdExtra**
package for mosaic plots and mosaic matrices.

**TODO** Mosaic matrix of species, island & sex.
For example, we can tabulate the distributions of penguin species by sex and the island
where they were observed using `xtabs()`. `ftable()` prints this three-way table more compactly.

```{r peng-table}
# use better labels for sex
peng <- peng |>
mutate(sex = factor(sex, labels = c("Female", "Male")))
peng.table <- xtabs(~ species + sex + island, data = peng)
ftable(peng.table)
```

We can see immediately that the penguin species differ by island:
only Adelie were observed on all three islands; Biscoe Island had no Chinstraps
and Dream Island had no Gentoos.

`vcd::pairs()` produces all pairwise mosaic plots, as shown in @fig-peng-mosaic.
The diagonal panels show the one-way
frequencies by width of the divided bars. Each off-diagonal panel shows the bivariate
counts, breaking down each column variable by splitting the bars in proportion to a
second variable. Consequently, the frequency of each cell is represented by its' area.
The purpose is to show the pattern of association between each pair,
and so, the tiles in the mosaic are shaded according to the signed residual
in a simple $\chi^2$ test for association--- `r colorize("blue", "blue")`
where the observed frequency is greater than expected under independence, and
`r colorize("red", "red")` where it is less than expected.


```{r}
#| label: fig-peng-mosaic
#| fig-width: 7
#| fig-height: 7
#| out-width: "100%"
#| fig-cap: "Mosaic pairs plot for the combinations of species, sex and island. ..."
library(vcd)
pairs(peng.table, shade = TRUE)
```

The shading patterns in cells (1,3) and (3,1) of @fig-peng-mosaic show what we've seen before in the table of frequencies: The distribution of species varies across island. Row 2 and column 2
show that there is sex is roughly proportional among species and islands.


The next step, by John Emerson and others [@Emerson-etal:2013] was to recognize
that combinations of continuous and discrete, categorical variables could be plotted
Expand Down
1 change: 1 addition & 0 deletions R/common.R
Original file line number Diff line number Diff line change
Expand Up @@ -104,6 +104,7 @@ $\\newcommand*{\\diag}[1]{\\ensuremath{\\mathrm{diag}\\, #1}}$
"lattice",
"datasauRus",
"vcd",
"vcdExtra",
"palmerpenguins")

# write list of packages used at end of every chapter
Expand Down
2 changes: 1 addition & 1 deletion R/penguin/peng-ggplot.R
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ library(ggdensity)
library(patchwork)

load(here::here("data", "peng.RData"))
str(peng)
#str(peng)

source("R/penguin-colors.R")

Expand Down
18 changes: 18 additions & 0 deletions R/penguin/peng-mosaic.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
#' ---
#' title: Penguins data, mosaic plots
#' ---

load(here::here("data", "peng.RData"))
source("R/penguin-colors.R")

library(vcdExtra)

# change levels of sex
peng <- peng |>
mutate(sex = factor(sex, labels = c("Female", "Male")))

peng.table <- xtabs(~ species + sex + island, data = peng)

ftable(peng.table)

pairs(peng.table, shade = TRUE)
2 changes: 1 addition & 1 deletion R/penguin/penguin-colors.R
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ theme_penguins <- function(shade=c("medium", "light", "dark"), ...) {
shade = match.arg(shade)
list(scale_color_penguins(shade=shade),
scale_fill_penguins(shade=shade)
)
)
}


Expand Down
8 changes: 8 additions & 0 deletions bib/pkgs.bib
Original file line number Diff line number Diff line change
Expand Up @@ -350,6 +350,14 @@ @Manual{R-tidyverse
}


@Manual{R-vcdExtra,
title = {vcdExtra: vcd Extensions and Additions},
author = {Michael Friendly},
year = {2023},
note = {R package version 0.8-5},
url = {https://friendly.github.io/vcdExtra/},
}

@Manual{R-VisCollin,
title = {VisCollin: Visualizing Collinearity Diagnostics},
author = {Michael Friendly},
Expand Down
2 changes: 2 additions & 0 deletions bib/pkgs.txt
Original file line number Diff line number Diff line change
Expand Up @@ -34,11 +34,13 @@ ggdensity
ggplot2
graphics
grDevices
grid
methods
patchwork
stats
tidyr
utils
vcd
base
bayestestR
car
Expand Down
35 changes: 21 additions & 14 deletions child/03-data-ellipse.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -272,7 +272,8 @@ scatterplot(prestige ~ education, data=Prestige,
```


#### Plotting on a log scale
#### Plotting on a log scale {#sec-log-scale}

A typical remedy for the non-linear relationship of income to prestige is to plot income on a log scale. This usually makes sense, and expresses a belief that a **multiple** of
or **percentage increase** in income has a constant impact on prestige, as opposed to
the **additive** interpretation for income itself.
Expand Down Expand Up @@ -360,13 +361,14 @@ higher occupational category. A different plot entails a different story.
#| echo: false
#| fig-align: center
#| out-width: "80%"
#| fig-cap: "Penguin species observed in the Palmer Archipelago. Image: Allison Horst"
#| fig-cap: "Penguin species observed in the Palmer Archipelago. This is a cartoon, but it illustrates some features of penguin body size measurements, and the colors typically used for species. Image: Allison Horst"
knitr::include_graphics("images/penguins-horst.png")
```

The `penguins` dataset from the **palmerpenguins** package [@R-palmerpenguins] provides further instructive examples of plots and analyses of multivariate data. The data consists of measurements of body size
(flipper length, body mass, bill length and depth)
of 344 penguins collected at the [Palmer Research Station](https://pallter.marine.rutgers.edu/) in Antartica.

There were three different species of penguins (Adélie, Chinstrap & Gentoo)
collected from 3 islands in the Palmer Archipelago
between 2007--2009 [@Gorman2014]. The purpose was to examine differences in size or appearance of these species,
Expand Down Expand Up @@ -398,9 +400,11 @@ use `ggplot2` for these plots.

I'll be using the penguins data quite a lot, so it is useful to set up custom colors like those
used in @fig-penguin-species. These are shades of:
Adelie: `r colorize("orange", "orange")`,
Chinstrap: `r colorize("purple", "purple")`, and
Gentoo: `r colorize("green", "green")`. To use these in `ggplot2` I define a function
* `r colorize("Adelie", "orange")`: `r colorize("orange", "orange")`,
* `r colorize("Chinstrap", "purple")`: `r colorize("purple", "purple")`, and
* Gentoo: `r colorize("green", "green")`.

To use these in `ggplot2` I define a function
`peng.colors()` that allows shades of light, medium and dark and then functions
`scale_*_penguins()` for color and fill.

Expand Down Expand Up @@ -441,13 +445,16 @@ scale_colour_penguins <- function(shade=c("medium", "light", "dark"), ...){
scale_color_penguins <- scale_colour_penguins
```

This is used to define a `theme_penguins` that I use to simply change the color and fill scales
This is used to define a `theme_penguins()` function that I use to simply change the color and fill scales
for plots below.

```{r}
theme_penguins <- list(
scale_color_penguins(shade="dark"),
scale_fill_penguins(shade="dark"))
theme_penguins <- function(shade=c("medium", "light", "dark"), ...) {
shade = match.arg(shade)
list(scale_color_penguins(shade=shade),
scale_fill_penguins(shade=shade)
)
}
```


Expand All @@ -468,7 +475,7 @@ ggplot(peng,
geom_smooth(method = "loess", formula = y ~ x,
linewidth = 1.5, se = FALSE, alpha=0.1) +
stat_ellipse(geom = "polygon", level = 0.95, alpha = 0.2) +
theme_penguins +
theme_penguins("dark") +
theme(legend.position = c(0.85, 0.15))
```

Expand All @@ -495,11 +502,11 @@ ggplot(peng,
stat_ellipse(geom = "polygon", level = 0.95, alpha = 0.2) +
stat_ellipse(geom = "polygon", level = 0.68, alpha = 0.2) +
stat_ellipse(geom = "polygon", level = 0.40, alpha = 0.2) +
theme_penguins +
theme_penguins("dark") +
theme(legend.position = c(0.85, 0.15))
```

#### Nonparamteric bivariate density plots
#### Nonparamtric bivariate density plots
While I emphasize data ellipses (because I like their beautiful geometry), other visual
summaries of the bivariate density are possible and often useful.

Expand Down Expand Up @@ -541,7 +548,7 @@ p1 <- ggplot(peng,
geom_density_2d(linewidth = 1.1, bins = 8) +
ggtitle("geom_density_2d") +
theme_bw(base_size = 14) +
theme_penguins +
theme_penguins() +
theme(legend.position = c(0.85, 0.15))
p2 <- ggplot(peng,
Expand All @@ -551,7 +558,7 @@ p2 <- ggplot(peng,
geom_hdr(probs = c(0.95, 0.68, 0.4), show.legend = FALSE) +
ggtitle("ggdensity::geom_hdr") +
theme_bw(base_size = 14) +
theme_penguins +
theme_penguins() +
theme(legend.position = "none")
p1 + p2
Expand Down
Loading

0 comments on commit c0639d1

Please sign in to comment.