Skip to content

Commit

Permalink
Merge Udi's edits to ch03
Browse files Browse the repository at this point in the history
  • Loading branch information
friendly committed Dec 5, 2023
2 parents 55a0ecf + 7bfa030 commit 354c21d
Show file tree
Hide file tree
Showing 3 changed files with 75 additions and 59 deletions.
130 changes: 73 additions & 57 deletions 03-multivariate_plots.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ source("R/common.R")
> There is no excuse for failing to plot and look. --- J. W.Tukey
> (1997), *Exploratory Data Analysis*, p. 157
<!--# comment -->


The quote above from John Tukey reminds us that data analysis should
rightly start with graphs to help us understand the main features of our
Expand Down Expand Up @@ -171,7 +171,7 @@ The most generally useful idea is a smoother that tracks an average
value, $\mathbb{E} (y | x)$, of $y$ as $x$ varies across its' range
*without* assuming any particular functional form, and so avoiding the
necessity to choose among `y ~ poly(x, 1)`, or `y ~ poly(x, 2)`, or
`y ~ poly(x, 3)` ...
`y ~ poly(x, 3)`, etc.

Non-parametric smoothers attempt to estimate $\mathbb{E} (y | x) = f(x)$
where $f(x)$ is some smooth function. These typically use a collection
Expand Down Expand Up @@ -218,7 +218,7 @@ after earning their degree is about the same as a person at 15 years.
What else is going on here?


<!--# UA: up to here Nov 21 -->



### Stratifiers
Expand All @@ -240,15 +240,15 @@ down the overall relation to see whether and how it changes over such
a continuous variable into ranges (*shingles*) and do the same
graphically. There are two general stratifying graphical techniques:

- **grouping**: Identify subgroups in the data by assigning different
- **Grouping**: Identify subgroups in the data by assigning different
visual attributes, such as color, shape, line style, etc. within a
single plot. This is quite natural for factors; quantitative
predictors can be accommodated by cutting their range into ordered
intervals. Two such grouping variables can be Grouping has the
intervals. Grouping has the
advantage that the levels of a grouping variable can be shown within
the same plot, facilitating direct comparison.

- **conditioning**: Showing subgroups in different plot panels. This
- **Conditioning**: Showing subgroups in different plot panels. This
has the advantages that relations for the individual groups more
easily discerned and one can easily stratify by two (or more) other
variables jointly, but visual comparison is more difficult because
Expand All @@ -273,19 +273,19 @@ in *Trellis* plots @Becker:1996:VDC;@Cleveland:85.
Trellis displays were extended in the **lattice** package [@R-lattice],
which offered:

- a **graphing syntax** similar to that used in statistical model
- A **graphing syntax** similar to that used in statistical model
formulas: `y ~ x | g` conditions the data by the levels of `g`, with
`|` read as "given"; two or more conditioning are specified as
`y ~ x | g1 + g2 + ...`, with `+` read as "and".
- **panel functions** define what is plotted in a given panel.
- **Panel functions** define what is plotted in a given panel.
`panel.xyplot()` is the default for scatterplots, plotting points,
but you can add `panel.lmline()` for regression lines,
`latticeExtra::panel.smoother()` for loess smooths and a wide
variety of others.

The **car** package [@R-car] supports this graphing syntax in many of
its functions. **ggplot2** does not; it uses aesthetics (`aes()`), which
map variables in the data to visual characteristics in displays. ...
map variables in the data to visual characteristics in displays.
:::

The most obvious variable that affects academic salary is `rank`,
Expand All @@ -294,7 +294,7 @@ that carries through in their future salary. What can we see if we group
by `rank` and fit a separate smoothed curve for each?

In `ggplot2` thinking, grouping is accomplished simply by adding an
aesthetic, such as `color = rank`. what happens then is that points,
aesthetic, such as `color = rank`. What happens then is that points,
lines, smooths and other `geom_*()` inherit the feature that they are
differentiated by color. In the case of `geom_smooth()`, we get a
separate fit for each subset of the data, according to `rank`.
Expand Down Expand Up @@ -322,20 +322,19 @@ ggplot(Salaries,
y = "Salary") +
geom_smooth(aes(fill = rank),
method = "loess", formula = "y ~ x",
linewidth = 2) +
# theme_bw(base_size = 14) +
linewidth = 2) +
legend_pos
```

Well, there is a different story here. Salaries generally occupy
separate levels, increasing with academic rank. The horizontal extents
of the smoothed curves show their ranges. Within each rank there is some
initial increase after promotion, and then some tendency to decline with
increasing years. But by and large, years since the PhD doesn't make
that much difference, once we've taken academic rank into account
increasing years. But, by and large, years since the PhD doesn't make
that much difference once we've taken academic rank into account.

What about the `discipline`, classified, perhaps peculiarly as
"theoretical" vs. "applied", here? The values are just `"A"` and `"B"`,
What about the `discipline` which is classified, perhaps peculiarly, as
"theoretical" vs. "applied"? The values are just `"A"` and `"B"`,
so I map these to more meaningful labels before making the plot.

```{r}
Expand Down Expand Up @@ -370,15 +369,17 @@ Salaries |>
```

For both groups, there is an approximately linear relation up to about
30--40 years, but the smoothed curves then diverge, into the region
where the data is thin.
30--40 years, but the smoothed curves then diverge into the region
where the data is thinner.



### Conditioning

The previous plots use grouping by color to plot the data for different
subsets inside the same plot window, making comparison among groups
easier, because they can be directly compared along a common vertical
scale [^03-multivariate_plots-1]. This gets messy however when there are
scale [^03-multivariate_plots-1]. This gets messy, however, when there are
more than just a few levels, or worse---when there are two (or more)
variables for which we want to show separate effects. In such cases, we
can plot separate panels using the `ggplot2` concept of *faceting*.
Expand Down Expand Up @@ -413,7 +414,6 @@ Salaries |>
method = "lm", formula = "y ~ x",
linewidth = 2) +
facet_wrap(~ discipline) +
# theme_bw(base_size = 14) +
legend_pos
```

Expand All @@ -427,8 +427,7 @@ are there anomalous differences in salary for men and women--- we can
look at differences in salary according to sex, when discipline and rank
are taken into account. To do this graphically, condition by both
variables, but use `facet_grid(discipline ~ rank)` to arrange their
combinations in a grid whose rows are the levels of `discipline` and
whose columns are those of `rank`. I want to make the comparison of
combinations in a grid whose rows are the levels of `discipline` and columns are those of `rank`. I want to make the comparison of
males and females most direct, so I use `color = sex` to stratify the
panels. The smoothed regression lines and error bands are calculated
separately for each combination of discipline, rank and sex.
Expand All @@ -453,6 +452,7 @@ Salaries |>
```

<!-- Data Ellipses -->
<!--# UA: Is there a reason the Data Ellipses section was added as a different .qmd file? Let me know if you want me to insert the full code/text here instead. -->
```{r child="child/03-data-ellipse.qmd"}
```

Expand All @@ -475,7 +475,7 @@ Plotting these together gives @fig-prestige-pairs. In such plots, the
diagonal cells give labels for the variables, but they are also a guide
to interpreting what is shown. In each row, say row 2 for `income`,
income is the vertical $y$ variable in plots against other variables. In
each column, say column 3 for `education`, education is the horizonal
each column, say column 3 for `education`, education is the horizontal
$x$ variable.

```{r}
Expand Down Expand Up @@ -528,7 +528,7 @@ scatterplotMatrix(~ prestige + income + education + women,

`scatterplotMatrix()` can also label points using the `id =` argument
(though this can get messy) and can stratify the observations by a
grouping variable with different symbols and colors. For example
grouping variable with different symbols and colors. For example,
@fig-prestige-spm2 uses the syntax
`~ prestige + education + income + women | type` to provide separate
regression lines, smoothed curves and data ellipses for the three types
Expand Down Expand Up @@ -581,9 +581,10 @@ scatterplotMatrix(~ bill_length + bill_depth + flipper_length + body_mass | spec
regLine = list(lwd=3),
diagonal = list(method = "boxplot"),
smooth = FALSE,
plot.points = FALSE)
plot.points = FALSE,
cex.labels=1)
```

<!--# I added cex.labels=1 in the code above because the diag labels were on top of the boxplots, I think it looks a bit better now, but still slightly obscured-->
It can be seen that the species are widely separated in most of the
bivariate plots. As well, the regression lines for species have similar
slopes and the data ellipses have similar size and shape in most of the
Expand All @@ -598,8 +599,8 @@ penguins.
## Looking ahead

@fig-peng-spm provides a reasonably complete visual summary of the data
in relation to multivariate models that ask "do the species differ on in
their means on these body size measures? This corresponds to the MANOVA
in relation to multivariate models that ask "do the species differ in
their means on these body size measures?" This corresponds to the MANOVA
model,

```{r}
Expand All @@ -612,7 +613,7 @@ Hypothesis-error (HE) plots, described in @sec-vis-mlm provide a better
summary of the evidence for the MANOVA test of differences among means
on all variables together. These give an $\mathbf{H}$ ellipse reflecting
the differences among means, to be compared with an $\mathbf{E}$ ellipse
reflecting within-group variation and a visual test of significance
reflecting within-group variation and a visual test of significance.

A related question is "how well are the penguin species distinguished by
these body size measures?" Here, the relevant model is linear
Expand All @@ -626,7 +627,7 @@ peng.lda <- MASS:lda( species ~ cbind(bill_length, bill_depth, flipper_length, b
```

Both MANOVA and LDA depend on the assumption that the variances and
correlations of the are the same for all groups. This assumption can be
correlations between the variables are the same for all groups. This assumption can be
tested and visualized using the methods in @sec-eqcov.
:::

Expand All @@ -651,7 +652,7 @@ str(crime)

@fig-crime-spm displays the scatterplot matrix for these seven
variables, using only the regression line and data ellipse to show the
linear relation and the loess smooth to show potential nonlinearity. To
linear relation and the loess smooth to show potential non-linearity.<!--# I believe you used non-linear, hyphenated earlier so I added the - --> To
make this even more schematic, the axis tick marks and labels are also
removed using the `par()` settings `xaxt = "n", yaxt = "n"`.

Expand Down Expand Up @@ -704,7 +705,7 @@ related variables together as described in @sec-pca-biplot.
knitr::include_graphics("images/corrgram-renderings.png")
```

In R these diagrams can be created using the **corrgram** [@R-corrgram]
In R, these diagrams can be created using the **corrgram** [@R-corrgram]
and **corrplot** [@R-corrplot] packages, with different features.
`corrgram::corrgram()` is closest to @Friendly:02:corrgram, in that it
allows different rendering functions for the lower, upper and diagonal
Expand All @@ -725,7 +726,7 @@ crime |>
`c("circle", "square", "ellipse", "number", "shade", "color", "pie")`,
but only one can be used at a time. The function
`corrplot::corrplot.mixed()` allows different options to be selected for
the lower and upper triangles. The iconic shape is colored with with a
the lower and upper triangles. The iconic shape is colored with a
gradient in relation to the correlation value.

```{r}
Expand Down Expand Up @@ -760,7 +761,24 @@ presentation purposes), makes this an attractive alternative to boring
tables of correlations.

**TODO**: Add example showing correlation ordering -- e.g., `mtcars`
data.
data.<!--# How's something like the code I added below? -->

```{r}
#| echo: false
#| include: false
#| label: fig-mtcars-corrplot-ordering
#| fig-width: 8
#| fig-height: 8
#| out-width: "100%"
#| fig-cap: "Corrplot of the `mtcars` data, showing the ordered correlation between each pair of variables with a circle (upper) and a numerical correlation value (lower), all shaded in proportion to the correlation magnitude and direction."
#|
library(corrplot)
corrplot.mixed(cor(mtcars[,1:5]),
lower = "number",
order = 'FPC',
upper = 'circle',
tl.col = "black")
```

## Generalized pairs plots {#sec-ggpairs}

Expand All @@ -779,7 +797,7 @@ For example, we can tabulate the distributions of penguin species by sex
and the island where they were observed using `xtabs()`. `ftable()`
prints this three-way table more compactly. (In this example, and what
follows in the chapter, I've changed the labels for sex from ("f", "m")
to ("Female", "Male"))
to ("Female", "Male")).

```{r peng-table}
# use better labels for sex
Expand Down Expand Up @@ -827,33 +845,31 @@ species varies across island because on each island one or more species
did not occur. Row 2 and column 2 show that sex is nearly exactly
proportional among species and islands, indicating independence,
$\text{sex} \perp \{\text{species}, \text{island}\}$. More importantly,
mosaic pairs plots can show, at a glance all (bivariate) associations
mosaic pairs plots can show, at a glance, all (bivariate) associations
among multivariate categorical variables.

The next step, by John Emerson and others [@Emerson-etal:2013] was to
recognize that combinations of continuous and discrete, categorical
variables could be plotted in different ways.

- two continuous variables can be shown as a standard scatterplot of
- Two continuous variables can be shown as a standard scatterplot of
points and/or bivariate density contours, or simply by numeric
summaries such as a correlation value;
- a pair of one continuous and one categorical variable can be shown
- A pair of one continuous and one categorical variable can be shown
as side-by-side boxplots or violin plots, histograms or density
plots
- two categorical variables could be shown in a mosaic plot or by
plots;
- Two categorical variables could be shown in a mosaic plot or by
grouped bar plots.

In the `ggplot2` framework, these displays are implemented in the
**GGally** package [@R-GGally] in the `ggpairs()` function. This allows
different plot types to be shown in the lower and upper triangles and in
In the **ggplot2** framework, these displays are implemented using the `ggpairs()` function from the **GGally** package [@R-GGally]. This allows different plot types to be shown in the lower and upper triangles and in
the diagonal cells of the plot matrix. As well, aesthetics such as color
and shape can be used within the plots to distinguish groups directly.
As illustrated below, you can define custom functions to control exactly
what is plotted in any panel.

The basic, default plot shows scatterplots for pairs of continuous
variables in the lower triangle and the values of correlations in the
upper triangle. A combination of a discrete and continuous variable is
upper triangle. A combination of a discrete and continuous variables is
plotted as histograms in the lower triangle and boxplots in the upper
triangle. @fig-peng-ggpairs1 includes `sex` to illustrate the
combinations.
Expand All @@ -876,30 +892,30 @@ To my eye, printing the values of correlations in the upper triangle is
often a waste of graphic space. But this example shows something
peculiar and interesting if you look closely: In all pairs among the
penguin size measurements, there are positive correlations within each
species, as we can see in @fig-peng-spm. Yet in three of these panels,
the overall correlation ignoring species is negative. For example the
species, as we can see in @fig-peng-spm. Yet, in three of these panels,
the overall correlation ignoring species is negative. For example, the
overall correlation between bill depth and flipper length is
$r = -0.579$ in row 2, column 3; the scatterplot in the diagonally
opposite cell, row 3, column 2 shows the data. These cases, of differing
signs for an overall correlation, ignoring a group variable and the
within group correlations are examples of **Simpson's Paradox**,
explored later in Chapter XX.
explored later in Chapter XX. <!--# TODO: add chapter number when known -->

The last row and column, for `sex` in @fig-peng-ggpairs1, provides an
initial glance at the issue of sex differences among pengiun species
initial glance at the issue of sex differences among penguin species
that motivated the collection of these data. We can go further by also
examining differences among species and island, but first we need to
understand how to display exactly what we want for each pairwise plot.

`ggpairs()` is extremely general, in that for each of the `lower`,
`ggpairs()` is extremely general in that for each of the `lower`,
`upper` and `diag` sections you can assign any of a large number of
built-in functions (of the form `ggally_NAME`), or you own custom
built-in functions (of the form `ggally_NAME`), or your own custom
function for what is plotted, depending on the types of variables in
each plot.

- `continuous`: both X and Y are continuous variables; supply this as
- `continuous`: both X and Y are continuous variables, supply this as
the `NAME` part of a `ggally_NAME()` function or the name of a
custom function.
custom function;
- `combo`: one X of and Y variable is discrete while the other is
continuous, using the same convention;
- `discrete`: both X and Y are discrete variables.
Expand All @@ -926,7 +942,7 @@ discrete factors.

See the vignette,
[ggally_plots](https://ggobi.github.io/ggally/articles/ggally_plots.html)
for an illustrated list of available high-level plots. For my purpose
for an illustrated list of available high-level plots. For our purpose
here, which is to illustrate enhanced displays, note that for
scatterplots of continuous variables, there are two functions which plot
the points and also add a smoother, `_lm` or `_loess`.
Expand All @@ -941,8 +957,8 @@ function that takes `data` and `mapping` arguments and returns a
`aes(color=species, alpha=0.5)`, but only if you wish to override what
is supplied in the `ggpairs()` call.

Here is a function, `my_panel()` that plots the data points and
regression line and loess smooth
Here is a function, `my_panel()` that plots the data points,
regression line and loess smooth:

```{r}
#| label: my-panel
Expand All @@ -955,7 +971,7 @@ my_panel <- function(data, mapping, ...){
}
```

For this example, I want to simple summaries of for the scatterplots, so
For this example, I want only simple summaries of for the scatterplots, so
I don't want to plot the data points, but do want to add the regression
line and a data ellipse.

Expand Down Expand Up @@ -999,7 +1015,7 @@ ggpairs(peng, columns=c(3:6, 1, 2, 7),
There is certainly a lot going on in @fig-peng-ggpairs7, but it does
show a high-level overview of all the variables (except `year`) in the
penguins dataset.

<!--# Up to here, Nov 28 -->
## Parallel coordinate plots {#sec-parcoord}

As we have seen above, scatterplot matrices and generalized pairs plots
Expand Down
Loading

0 comments on commit 354c21d

Please sign in to comment.