03-multivariate_plots.qmd

---
editor: 
  markdown: 
    wrap: 72
editor_options: 
  chunk_output_type: console
---

```{r include=FALSE}
source("R/common.R")
knitr::opts_chunk$set(fig.path = "figs/ch03/")
```

::: {.content-visible unless-format="pdf"}
{{< include latex/latex-commands.qmd >}}
:::

# Plots of Multivariate Data {#sec-multivariate_plots}

> There is no excuse for failing to plot and look. 
>
> The greatest value of a picture is when it forces us to notice what we never expected to see.
> --- John W. Tukey, _Exploratory Data Analysis_, 1977


These quotes from John Tukey remind us that data analysis should nearly always
start with graphs to help us understand the main features of our
data. It is important to understand the general _patterns_ and _trends_: Are relationships increasing or decreasing? Are they approximately linear or non-linear? But it is also important to spot
_anomalies_: "unusual" observations, groups of points that seem to differ from the rest, and so forth.
As we saw with Anscombe's quartet (@sec-anscombe) numerical summaries hide features that are immediately apparent in a plot.

This chapter introduces a
toolbox of basic graphical methods for visualizing multivariate
datasets. It starts with some simple techniques to enhance the basic
scatterplot with graphical _annotations_ such as fitted lines, curves and data
ellipses to _summarize_ the relation between two variables.

To visualize more than two variables, we can view all pairs of variables
in a scatterplot matrix or shift gears entirely to show multiple
variables along a set of parallel axes. As the number of variables
increases, we may need to suppress details with stronger summaries
for a high-level reconnaissance of our data terrain, as we do by zooming
out on a map. For example, we can simply remove the data points or make them nearly transparent
to focus on the visual summaries provided by fitted lines or other graphical summaries.

**Packages**

In this chapter I use the following packages. Load them now:

```{r load-ggally}
#| include: false
suppressPackageStartupMessages(library(GGally, quietly = TRUE))
```


```{r load-pkgs}
library(car)
library(ggplot2)
library(dplyr)
library(tidyr)
library(corrplot)
library(corrgram)
library(GGally)
library(ggdensity)
library(patchwork)
library(ggpcp)
library(tourr)
```

## Bivariate summaries {#sec-bivariate_summaries}

The basic scatterplot is the workhorse of multivariate data
visualization, showing how one variable, $y$, often an outcome to be
explained by or varies with another, $x$. It is a building block for
many useful techniques, so it is helpful to understand how it can be
used as a tool for thinking in a wider, multivariate context.

The essential idea is that we can start with a simple version of the
scatterplot and add annotations to show interesting features more
clearly. We consider the following here:

-   **Smoothers**: Showing overall trends, perhaps in several forms, as
    visual summaries such as fitted regression lines or curves and
    nonparametric smoothers.
-   **Stratifiers**: Using color, shape or other features to identify
    subgroups; more generally, *conditioning* on other variables in
    multi-panel displays;
-   **Data ellipses**: A compact 2D visual summary of bivariate linear
    relations and uncertainty assuming normality; more generally,
    contour plots of bivariate density.

**Example: Academic salaries**

Let's start with data on the academic salaries of faculty members
collected at a U.S. college for the purpose of assessing salary
differences between male and female faculty members, and perhaps address
anomalies in compensation. The dataset `carData::Salaries` gives data on
nine-month salaries and other variables for 397 faculty members in the
2008-2009 academic year.

```{r Salaries}
data(Salaries, package = "carData")
str(Salaries)
```

The most obvious, but perhaps naive, predictor of `salary` is
`years.since.phd`. For simplicity, I'll refer to this as years of
"experience." Before looking at differences between males and females,
we would want consider faculty `rank` (related also to `yrs.service`)
and `discipline`, recorded here as `"A"` ("theoretical" departments) or
`"B"` ("applied" departments). But, for a basic plot, we will ignore these
for now to focus on what can be learned from plot annotations.

<!-- figure-code: `R/Salaries-scatterplots.R`  -->

```{r}
#| label: fig-Salaries-scat
#| out-width: "90%"
#| fig-cap: "Naive scatterplot of Salary vs. years since PhD, ignoring other variables, and without graphical annotations."
library(ggplot2)
gg1 <- ggplot(Salaries, 
       aes(x = yrs.since.phd, y = salary)) +
  geom_jitter(size = 2) +
  scale_y_continuous(labels = scales::dollar_format(
    prefix="$", scale = 0.001, suffix = "K")) +
  labs(x = "Years since PhD",
       y = "Salary") 

gg1 + geom_rug(position = "jitter", alpha = 1/4)
```

There is quite a lot we can see "just by looking" at this simple plot,
but the main things are:

-   Salary increases generally from 0 - 40 years since the PhD, but then maybe begins to drop off (partial retirement?);
-   Variability in salary increases among those with the same
    experience, a "fan-shaped" pattern that signals a violation of
    homogeneity of variance in simple regression;
-   Data beyond 50 years is thin, but there are some quite low salaries
    there. Adding rug plots to the X and Y axes is a simple but effective way to show the
    marginal distributions of the observations. Jitter and transparency helps to avoid overplotting
    due to discrete values.

### Smoothers

Smoothers are among the most useful graphical annotations you can add to
such plots, giving a visual summary of how $y$ changes with $x$. The
most common smoother is a line showing the linear regression for $y$
given $x$, expressed in math notation as
$\mathbb{E} (y | x) = b_0 + b_1 x$. If there is doubt that a linear
relation is an adequate summary, you can try a quadratic or other
polynomial smoothers.

In `r pkg("ggplot2")`, these are easily added to a plot using `geom_smooth()`
with `method = "lm"`, and a model `formula`, which (by default) is
`y ~ x` for a linear relation or `y ~ poly(x, k)` for a polynomial of
degree $k$.

```{r}
#| label: fig-Salaries-lm
#| out-width: "80%"
#| code-fold: show
#| fig-cap: !expr paste("Scatterplot of Salary vs. years since PhD, showing", colorize("linear", "red"), 
#|   "and", colorize("quadratic", "darkgreen"), "smooths with 95% confidence bands.")
gg1 + 
  geom_smooth(method = "lm", formula = "y ~ x", 
              color = "red", fill= "pink",
              linewidth = 2) +
  geom_smooth(method = "lm", formula = "y ~ poly(x,2)", 
              color = "darkgreen", fill = "lightgreen",
              linewidth = 2) 
```

<!--# UA: This is a fantastic graph. My only (very minor) suggestion is to replace one of the colours because the most common type of colour vision deficiency makes it hard to tell the difference between red and green. So, maybe green and purple (purple would match the inline code highlight colour)-->


This serves to highlight some of our impressions from the basic
scatterplot shown in @fig-Salaries-scat, making them more apparent. And
that's precisely the point: the regression smoother draws attention to a
possible pattern that we can consider as a visual summary of the data.
You can think of this as showing what a linear (or quadratic) regression
"sees" in the data. Statistical tests <!--# (secref?) --> can help you decide if there is more evidence for a quadratic fit compared to the simpler linear relation. <!--# UA: Great paragraph!-->


It is useful to also show some indication of *uncertainty* (or
inversely, *precision*) associated with the predicted values. Both
the linear and quadratic trends are shown in @fig-Salaries-lm with 95%
pointwise confidence bands.[^pointwise] These are necessarily narrower in the center
of the range of $x$ where there is typically more data; they get wider
toward the highest values of experience where the data are thinner.

[^pointwise]: 
Confidence bands allow us to visualize the uncertainty around a fitted regression curve,
which can be of two types: _pointwise intervals_ or _simultaneous intervals_.
The default setting in ``ggplot2::geom_smooth()` calculates pointwise intervals 
(using `stats::predict.lm(..., interval="confidence")` at a confidence level $1-\alpha$ for the predicted response at _each value_ $x_i$ of a predictor, and have the frequentist interpretation that over repeated sampling only $100\;\alpha$ of the predictions at $x_i$ will be outside that interval. 
In contrast, simultaneous intervals are calculated so that $1 - \alpha$ is the probability that _all of them_ cover their corresponding true values simultaneously. These are necessarily wider than pointwise intervals.
Commonly used methods for constructing simultaneous confidence bands in regression are the Bonferroni and Scheffé methods, which control the family-wise error rate over all values of $x_i$.
See [](https://en.wikipedia.org/wiki/Confidence_and_prediction_bands) for precise definitions of these terms.
These are different from a _prediction band_, which is used to represent the uncertainty about the value of a **new** data-point on the curve, but subject to the additional variance reflected in one observation.


#### Non-parametric smoothers {.unnumbered}

The most generally useful idea is a smoother that tracks an average
value, $\mathbb{E} (y | x)$, of $y$ as $x$ varies across its' range
*without* assuming any particular functional form, and so avoiding the
necessity to choose among `y ~ poly(x, 1)`, or `y ~ poly(x, 2)`, or
`y ~ poly(x, 3)`, etc.

Non-parametric smoothers attempt to estimate $\mathbb{E} (y | x) = f(x)$
where $f(x)$ is some smooth function. These typically use a collection
of weighted *local regressions* for each $x_i$ within a window centered
at that value. In the method called *lowess* or *loess* [@Cleveland:79;
@ClevelandDevlin:88], a weight function is applied, giving greatest
weight to $x_i$ and a weight of 0 outside a window containing a certain fraction, $s$, called *span*, of the nearest neighbors of $x_i$. The fraction, $s$, is usually within the range $1/3 \le s \le 2/3$, and it determines the
smoothness of the resulting curve; smaller values produce a wigglier
curve and larger values giving a smoother fit (an optimal
span can be determined by $k$-fold cross-validation to minimize a
measure of overall error of approximation).

Non-parametric regression is a broad topic; see @Fox:2016:ARA, Ch. 18 for
a more general treatment including smoothing splines, and @Wood:2006 for generalized additive models,
fit using `method = "gam"` in **ggplot2**, which is the default when the
largest group has more than 1,000 observations.

@fig-Salaries-loess shows the addition of a loess smooth to the plot in
@fig-Salaries-lm, suppressing the confidence band for the linear
regression. The loess fit is nearly coincident with the quadratic fit,
but has a slightly wider confidence band.

```{r}
#| label: fig-Salaries-loess
#| out-width: "80%"
#| code-fold: show
#| fig-cap: !expr glue::glue("Scatterplot of Salary vs. years since PhD, adding the loess smooth. The loess smooth curve and confidence band in {green} is nearly indistinguishable from a quadratic fit in {blue}.")

gg1 + 
  geom_smooth(method = "loess", formula = "y ~ x", 
              color = "blue", fill = scales::muted("blue"),
              linewidth = 2) +
  geom_smooth(method = "lm", formula = "y ~ x", se = FALSE,
              color = "red",
              linewidth = 2) +
  geom_smooth(method = "lm", formula = "y ~ poly(x,2)", 
              color = "darkgreen", fill = "lightgreen",
              linewidth = 2) 
```

But now comes an important question: is it reasonable that academic
salary should increase up to about 40 years since the PhD degree and
then decline? The predicted salary for someone still working 50 years
after earning their degree is about the same as a person at 15 years.
What else is going on here?


### Stratifiers

Very often, we have a main relationship of interest, but various groups
in the data are identified by discrete factors (like faculty `rank` and
`sex`, their type of `discipline` here), or there are quantitative
predictors for which the main relation might vary. In the language of
statistical models such effects are *interaction* terms, as in
`y ~ group + x + group:x`, where the term `group:x` fits a different
slope for each group and the grouping variable is often called a
*moderator* variable. Common moderator variables are ethnicity, health
status, social class and level of education. Moderators can also be
continuous variables as in `y ~ x1 + x2 + x1:x2`.

I call these *stratifiers*, recognizing that we should consider breaking
down the overall relation to see whether and how it changes over such
"other" variables. Such variables are most often factors, but we can cut
a continuous variable into ranges (*shingles*) and do the same
graphically. There are two general stratifying graphical techniques:

-   **Grouping**: Identify subgroups in the data by assigning different
    visual attributes, such as color, shape, line style, etc. within a
    single plot. This is quite natural for factors; quantitative
    predictors can be accommodated by cutting their range into ordered
    intervals. Grouping has the
    advantage that the levels of a grouping variable can be shown within
    the same plot, facilitating direct comparison.

-   **Conditioning**: Showing subgroups in different plot panels. This
    has the advantages that relations for the individual groups more
    easily discerned and one can easily stratify by two (or more) other
    variables jointly, but visual comparison is more difficult because
    the eye must scan from one panel to another.

::: {.callout-note title="History Corner"}
Recognition of the roles of visual grouping by factors within a panel
and conditioning in multi-panel displays was an important advance in the
development of modern statistical graphics. It began at A.T.&T. Bell
Labs in Murray Hill, NJ in conjunction with the **S** language, the
mother of R.

Conditioning displays (originally called *coplots*
[@ChambersHastie1991]) are simply a collection of 1D, 2D or 3D plots
separate panels for subsets of the data broken down by one or more
factors, or, for quantitative variables, subdivided into a factor with
several overlapping intervals (*shingles*). The first implementation was
in *Trellis* plots [@Becker:1996:VDC;@Cleveland:85].

Trellis displays were extended in the `r package("lattice", cite=TRUE)`,
which offered:

-   A **graphing syntax** similar to that used in statistical model
    formulas: `y ~ x | g` conditions the data by the levels of `g`, with
    `|` read as "given"; two or more conditioning are specified as
    `y ~ x | g1 + g2 + ...`, with `+` read as "and".
-   **Panel functions** define what is plotted in a given panel.
    `panel.xyplot()` is the default for scatterplots, plotting points,
    but you can add `panel.lmline()` for regression lines,
    `latticeExtra::panel.smoother()` for loess smooths and a wide
    variety of others.

The `r package("car", cite=TRUE)` supports this graphing syntax in many of
its functions. `r pkg("ggplot2")` does not; it uses aesthetics (`aes()`), which
map variables in the data to visual characteristics in displays.
:::

The most obvious variable that affects academic salary is `rank`,
because faculty typically get an increase in salary with a promotion
that carries through in their future salary. What can we see if we group
by `rank` and fit a separate smoothed curve for each?

In `ggplot2` thinking, grouping is accomplished simply by adding an
aesthetic, such as `color = rank`. What happens then is that points,
lines, smooths and other `geom_*()` inherit the feature that they are
differentiated by color. In the case of `geom_smooth()`, we get a
separate fit for each subset of the data, according to `rank`.

```{r}
#| label: fig-Salaries-rank
#| out-width: "80%"
#| code-fold: show
#| fig-cap: "Scatterplot of Salary vs. years since PhD, grouped by rank."
# make some re-useable pieces to avoid repetitions
scale_salary <-   scale_y_continuous(
  labels = scales::dollar_format(prefix="$", 
                                 scale = 0.001, 
                                 suffix = "K")) 
# position the legend inside the plot
legend_pos <- theme(legend.position = "inside",
                    legend.position.inside = c(.1, 0.95), 
                    legend.justification = c(0, 1))

ggplot(Salaries, 
       aes(x = yrs.since.phd, y = salary, 
           color = rank, shape = rank)) +
  geom_point() +
  scale_salary +
  labs(x = "Years since PhD",
       y = "Salary") +
  geom_smooth(aes(fill = rank),
                  method = "loess", formula = "y ~ x", 
                  linewidth = 2)  +
  legend_pos
```

Well, there is a different story here. Salaries generally occupy
separate vertical levels, increasing with academic rank. The horizontal extents
of the smoothed curves show their ranges. Within each rank there is some
initial increase after promotion, and then some tendency to decline with
increasing years. But, by and large, years since the PhD doesn't make
as much difference once we've taken academic rank into account.

What about the `discipline` which is classified, perhaps peculiarly, as
"theoretical" vs. "applied"? The values are just `"A"` and `"B"`,
so I map these to more meaningful labels before making the plot.

```{r}
#| label: fig-Salaries-discipline
#| out-width: "80%"
#| code-fold: show
#| fig-cap: "Scatterplot of Salary vs. years since PhD, grouped by discipline."
Salaries <- Salaries |>
  mutate(discipline = 
           factor(discipline, 
                  labels = c("A: Theoretical", "B: Applied")))

Salaries |>
  ggplot(aes(x = yrs.since.phd, y = salary, color = discipline)) +
    geom_point() +
  scale_salary +
  geom_smooth(aes(fill = discipline ),
                method = "loess", formula = "y ~ x", 
                linewidth = 2) + 
  labs(x = "Years since PhD",
       y = "Salary") +
  legend_pos 
```

The story in @fig-Salaries-discipline is again different. Faculty in
applied disciplines on average earn about 10,000\$ more per year on
average than their theoretical colleagues. 

```{r discipline-means}
Salaries |>
  group_by(discipline) |>
  summarize(mean = mean(salary)) 
```

For both groups, there is an approximately linear relation up to about
30--40 years, but the smoothed curves then diverge into the region
where the data is thinner.

This result is more surprising than
differences among faculty ranks. The effect of annotation with
smoothed curves as visual summaries is apparent, and provides a stimulus
to think about _why_ these differences (if they are real) exist
between theoretical and applied professors, and maybe _should_
theoreticians be paid more!


### Conditioning

The previous plots use grouping by color to plot the data for different
subsets inside the same plot window, making comparison among groups
easier, because they can be directly compared along a common vertical
scale [^03-multivariate_plots-1]. This gets messy, however, when there are
more than just a few levels, or worse---when there are two (or more)
variables for which we want to show separate effects. In such cases, we
can plot separate panels using the `ggplot2` concept of *faceting*.
There are two options: `facet_wrap()` takes one or more conditioning
variables and produces a ribbon of plots for each combination of levels;
`facet_grid(row ~ col)` takes two or more conditioning variables and
arranges the plots in a 2D array identified by the `row` and `col`
variables.

[^03-multivariate_plots-1]: The classic study by
    @ClevelandMcGill:84b;@ClevelandMcGill:85 shows that judgements of
    magnitude along a common scale are more accurate than those along
    separate, aligned scales.

Let's look at salary broken down by the combinations of discipline and
rank. Here, I chose to stratify using color by rank within each of
panels faceting by discipline. Because there is more going on in this
plot, a linear smooth is used to represent the trend.

```{r}
#| label: fig-Salaries-faceted
#| out-width: "100%"
#| code-fold: show
#| fig-cap: "Scatterplot of Salary vs. years since PhD, grouped by rank, with separate panels for discipline."
Salaries |>
  ggplot(aes(x = yrs.since.phd, y = salary, 
             color = rank, shape = rank)) +
  geom_point() +
  scale_salary +
  labs(x = "Years since PhD",
       y = "Salary") +
  geom_smooth(aes(fill = rank),
              method = "lm", formula = "y ~ x", 
              linewidth = 2, alpha = 1/4) +
  facet_wrap(~ discipline) +
  legend_pos
```

Once both of these factors are taken into account, there does not seem
to be much impact of years of service. Salaries in theoretical
disciplines are noticeably greater than those in applied disciplines at
all ranks, and there are even greater differences among ranks.

Finally, to shed light on the question that motivated this example---
are there anomalous differences in salary for men and women--- we can
look at differences in salary according to sex, when discipline and rank
are taken into account. To do this graphically, condition by both
variables, but use `facet_grid(discipline ~ rank)` to arrange their
combinations in a grid whose rows are the levels of `discipline` and columns are those of `rank`. I want to make the comparison of
males and females most direct, so I use `color = sex` to stratify the
panels. The smoothed regression lines and error bands are calculated
separately for each combination of discipline, rank and sex.

```{r}
#| label: fig-Salaries-facet-sex
#| out-width: "100%"
#| code-fold: show
#| fig-cap: "Scatterplot of Salary vs. years since PhD, grouped by sex, faceted by discipline and rank."
Salaries |>
  ggplot(aes(x = yrs.since.phd, y = salary, color = sex)) +
  geom_point() +
  scale_salary +
  labs(x = "Years since PhD",
       y = "Salary") +
  geom_smooth(aes(fill = sex),
              method = "lm", formula = "y ~ x",
              linewidth = 2, alpha = 1/4) +
  facet_grid(discipline ~ rank) +
  theme_bw(base_size = 14) + 
  legend_pos
```


## Data Ellipses {#sec-data-ellipse}

The _data ellipse_ [@Monette:90], or _concentration ellipse_ [@Dempster:69] is a
remarkably simple and effective display for viewing and understanding
bivariate relationships in multivariate data.
The data ellipse is typically used to add a visual summary to a scatterplot,
that shows all together the means, standard deviations, correlation,
and slope of the regression line for
two variables, perhaps stratified by another variable.
Under the classical assumption that the data are bivariate normally distributed,
the data ellipse is also a **sufficient** visual summary, in the sense that
it captures **all** relevant features of the data.
See @Friendly-etal:ellipses:2013 for a complete discussion of the role of
ellipsoids in statistical data visualization.

It is based on the idea that in a bivariate normal distribution, the contours
of equal probability form a series of concentric ellipses. If the variables were
uncorrelated and had the same variances, these would be circles, and Euclidean
distance would measure the distance of each observation from the mean.
When the variables are correlated, a different measure, _Mahalanobis distance_
is the proper measure of how far a point is from the mean, taking the correlation
into account.

```{r}
#| label: fig-mahalanobis
#| echo: false
#| fig-align: center
#| out-width: "60%"
#| fig-cap: "2D data with curves of constant distance from the centroid. The blue solid ellipse shows a contour of constant Mahalanobis distance, taking the correlation into account; the dashed red circle is a contour of equal Euclidean distance. Given the data points,  Which of the points **A** and **B** is further from the mean (X)? _Source_: Re-drawn from [Ou Zhang](https://ouzhang.rbind.io/2020/11/16/outliers-part4/)"
knitr::include_graphics("images/mahalanobis.png")
```

<!--
This doesn't work
#| fig-cap: !expr paste("2D data with curves of constant distance from the centroid. The", colorize('blue'), "solid ellipse shows a contour of constant Mahalanobis distance, taking the correlation into account; the dashed", colorize('blue'), "circle is a contour of equal Euclidean distance. Given the data points,  Which of the points **A** and **B** is further from the mean (X)? _Source_: Re-drawn from [Ou Zhang](https://ouzhang.rbind.io/2020/11/16/outliers-part4/)")
-->

To illustrate, @fig-mahalanobis shows a scatterplot with labels for two points, "A" and "B".
Which is further from the mean, "X"? 
A contour of constant Euclidean distance, shown by the `r colorize("red")` dashed circle,
ignores the apparent negative correlation, so point "A" is further.
The `r colorize("blue")` ellipse for Mahalanobis distance 
takes the correlation into account, so point "B" has a greater distance from the mean.

Mathematically, Euclidean (squared) distance for $p$ variables, $j = 1, 2, \dots , p$,
is just a generalization of
the square of a univariate standardized ($z$) score, $z^2 = [(y - \bar{y}) / s]^2$,

$$
D_E^2 (\mathbf{y}) = \sum_j^p z_j^2 = \mathbf{z}^\textsf{T}  \mathbf{z} = (\mathbf{y} - \bar{\mathbf{y}})^\textsf{T} \operatorname{diag}(\mathbf{S})^{-1} (\mathbf{y} - \bar{\mathbf{y}}) \; ,
$$
where $\mathbf{S}$ is the sample variance-covariance matrix,
$\mathbf{S} = ({n-1})^{-1} \sum_{i=1}^n (\mathbf{y}_i - \bar{\mathbf{y}})^\textsf{T} (\mathbf{y}_i - \bar{\mathbf{y}})$.

Mahalanobis' distance takes the correlations into account simply by using the covariances
as well as the variances,
$$
D_M^2 (\mathbf{y}) = (\mathbf{y} - \bar{\mathbf{y}})^\mathsf{T} S^{-1} (\mathbf{y} - \bar{\mathbf{y}}) \; .
$$ {#eq-Dsq}

In @eq-Dsq, the inverse $S^{-1}$ serves to "divide" the matrix $(\mathbf{y} - \bar{\mathbf{y}})^\mathsf{T} (\mathbf{y} - \bar{\mathbf{y}})$ of squared distances 
by the variances (and covariances) of the variables, as in the univariate case.

For $p$ variables, the data _ellipsoid_ $\mathcal{E}_c$ of
size $c$ is a $p$-dimensional ellipse,
defined as the set of points $\mathbf{y} = (y_1, y_2, \dots y_p)$
whose squared Mahalanobis distance, $D_M^2 ( \mathbf{y} )$ is less than or equal
to $c^2$,
$$
\mathcal{E}_c (\bar{\mathbf{y}}, \mathbf{S}) := \{ D_M^2 (\mathbf{y}) \le c^2 \} \; .
$$
A computational definition recognizes that the boundary of the ellipsoid can be found by transforming
a unit sphere $\mathcal{P}$
centered at the origin, $\mathcal{P} : \{ \mathbf{x}^\textsf{T} \mathbf{x}= 1\}$, by $\mathbf{S}^{1/2}$
and then shifting that to centroid of the data,

$$
\mathcal{E}_c (\bar{\mathbf{y}}, \mathbf{S}) = \bar{\mathbf{y}} \; \oplus \; \mathbf{S}^{1/2} \, \mathcal{P} \comma
$$
where $\mathbf{S}^{1/2}$ represents a rotation and scaling and the notation $\oplus$ represents translation to a new centroid, $\bar{\mathbf{y}}$ here. The matrix $\mathbf{S}^{1/2}$ is commonly computed
as the Choleski factor of $\mathbf{S}$. Slightly abusing notation and taking the unit sphere as given (like an identity matrix $\mathbf{I}$),
we can write the data ellipsoid as simply:

$$
\mathcal{E}_c (\bar{\mathbf{y}}, \mathbf{S}) = \bar{\mathbf{y}} \; \oplus \; c\, \sqrt{\mathbf{S}} \period
$$ {#eq-ellE}

When $\mathbf{y}$ is (at least approximately) bivariate normal,
$D_M^2(\mathbf{y})$ has a large-sample $\chi^2_2$ distribution
($\chi^2$ with 2 df),
so 

* $c^2 = \chi^2_2 (0.68) = 2.28$ gives a "1 standard deviation
bivariate ellipse,"
an analog of the standard interval $\bar{y} \pm 1 s$, while
* $c^2 = \chi^2_2 (0.95) = 5.99 \approx 6$ gives a data ellipse of
95\% coverage.

In not-large samples, the radius $c$ of the ellipsoid is better approximated by a multiple of a $F_{p, n-p}$ distribution, becoming $c =\sqrt{ 2 F_{2, n-2}^{1-\alpha} }$
in the bivariate case ($p=2$) for coverage $1-\alpha$.


### Ellipse properties

The essential ideas of correlation and regression and their relation to ellipses go back to
@Galton:1886.
Galton's goal was to predict (or explain) how a heritable trait, $Y$, (e.g.,
height) of children was related to that of their parents, $X$.
He made a semi-graphic table of the frequencies of 928 observations of the average
height of father and mother versus the height of their child, shown in @fig-galton-corr.
He then drew smoothed contour lines of equal frequencies and had the wonderful
visual insight that these formed concentric shapes that were tolerably close to ellipses.

He then calculated summaries,  $\text{Ave}(Y | X)$, and, for symmetry, $\text{Ave}(X | Y)$, and plotted these as lines of means on his diagram. Lo and behold, he had a second visual
insight: the lines of means of ($Y | X$) and ($X | Y$) corresponded approximately to
the loci of  horizontal and vertical tangents to the concentric ellipses. 
To complete the picture, he added lines showing the major and minor axes of the
family of ellipses (which turned out to be the principal components) with the result shown in @fig-galton-corr.

```{r}
#| label: fig-galton-corr
#| echo: false
#| fig-align: center
#| out-width: "70%"
#| fig-cap: "Galton's 1886 diagram, showing the relationship of height of children to the average of their parents' height. The diagram is essentially an overlay of a geometrical interpretation on a bivariate grouped frequency distribution, shown as numbers."
knitr::include_graphics("images/galton-corr.jpg")
```


For two variables, $x$ and $y$, the remarkable properties of the data ellipse are illustrated in @fig-galton-ellipse-r, a modern reconstruction of Galton's diagram.


```{r}
#| label: fig-galton-ellipse-r
#| echo: false
#| fig-align: center
#| out-width: "100%"
#| fig-cap: "Sunflower plot of Galton's data on heights of parents and their children (in.), with
#|   40%, 68% and 95% data ellipses and the regression lines of $y$ on $x$ (black) and
#|   $x$ on $y$ (grey)."
knitr::include_graphics("images/galton-ellipse-r.jpg")
```

* The ellipses have the mean vector $(\bar{x}, \bar{y})$ as their center.

* The lengths of arms of the `r colorize("blue")` dashed central cross 
show the standard deviations of the variables, which correspond to the shadows of the ellipse covering 40\% of the data. These are the bivariate analogs of 
the standard intervals $\bar{x} \pm 1 s_x$ and $\bar{y} \pm 1 s_y$.

* More generally, shadows (projections) on the coordinate axes, or any linear combination of them,
give any standard interval, 
  $\bar{x} \pm k s_x$ and $\bar{y} \pm k s_y$.
  Those with $k=1, 1.5, 2.45$, have
  bivariate coverage 40%, 68% and 95% respectively, corresponding to these quantiles of the $\chi^2$ distribution
  with 2 degrees of freedom, i.e., 
  $\chi^2_2 (.40) \approx 1^2$, 
  $\chi^2_2 (.68) \approx 1.5^2$, and
  $\chi^2_2 (.95) \approx 2.45$.
  The shadows of the 68% ellipse are the bivariate analog of a univariate  $\bar{x} \pm 1 s_x$ interval.
  
  <!--# and univariate coverage 68\%, 87\% and 98.6\% respectively. -->

* The regression line predicting $y$ from $x$ goes through the points where the ellipses have vertical tangents. The _other_ regression line, predicting $x$ from $y$ goes through the points of horizontal
tangency.

* The correlation $r(x, y)$ is the ratio of the vertical segment from the mean of $y$ to the regression line to the vertical segment going to the top of the ellipse as shown at the right of the figure. It is
$r = 0.46$ in this example.

* The residual standard deviation, $s_e = \sqrt{MSE} = \sqrt{\Sigma (y - \bar{y})^2 / n-2}$, 
is the half-length of the ellipse at the mean $\bar{x}$. 


Because Galton's values of `parent` and `child` height were recorded in class intervals of 1 in.,
they are shown as sunflower symbols in @fig-galton-ellipse-r,
with multiple 'petals' reflecting the number of observations
at each location. This plot (except for annotations) is constructed using `sunflowerplot()` and
`car::dataEllipse()` for the ellipses.

```{r}
#| eval: false
data(Galton, package = "HistData")

sunflowerplot(parent ~ child, data=Galton, 
      xlim=c(61,75), 
      ylim=c(61,75), 
      seg.col="black", 
    	xlab="Child height", 
      ylab="Mid Parent height")

y.x <- lm(parent ~ child, data=Galton)     # regression of y on x
abline(y.x, lwd=2)
x.y <- lm(child ~ parent, data=Galton)     # regression of x on y
cc <- coef(x.y)
abline(-cc[1]/cc[2], 1/cc[2], lwd=2, col="gray")

with(Galton, 
     car::dataEllipse(child, parent, 
         plot.points=FALSE, 
         levels=c(0.40, 0.68, 0.95), 
         lty=1:3)
    )
```

Finally, as Galton noted in his diagram, the principal major and minor axes of the ellipse have important statistical properties. @Pearson:1901 would later show that
their directions are determined by the eigenvectors $\mathbf{v}_1, \mathbf{v}_2, \dots$ of the covariance matrix $\mathbf{S}$ and their radii by the
square roots, $\sqrt{\mathbf{v}_1}, \sqrt{\mathbf{v}_1}, \dots$ of the corresponding
eigenvalues.


### R functions for data ellipses

A number of packages provide functions for drawing data ellipses in a scatterplot, with various features.

* `car::scatterplot()`: uses base R graphics to draw 2D scatterplots, with a wide variety of plot enhancements including linear and non-parametric smoothers (loess, gam), a formula method, e.g., `y ~ x | group`, and marking points and lines using symbol shape,
color, etc. Importantly, the `r pkg("car")` package generally allows automatic identification of "noteworthy" points by their labels in the plot using a variety of methods. For example, `method = "mahal"` labels cases with the most extreme Mahalanobis distances;
`method = "r"` selects points according to their value of `abs(y)`, which is
appropriate in residual plots.
* `car::dataEllipse()`: plots  classical or robust data (using `MASS::cov/trob()`) ellipses for one or more groups, with the same facilities for point identification.
* `heplots::covEllipses()`: draws classical or robust data ellipses for one or more groups in a one-way design and optionally for the pooled total sample, where the focus is on homogeneity of within-group covariance matrices.
* `ggplot2::stat_ellipse()`: uses the calculation methods of `car::dataEllipse()` to add unfilled (`geom = "path"`) or filled (`geom = polygon"`) data ellipses in a `ggplot` scatterplot, using inherited aesthetics.

### Example: Canadian occupational prestige {#sec-prestige}

These examples use the data on the prestige of 102 occupational categories and other measures from the
1971 Canadian Census, recorded in `carData::Prestige`.[^prestige-src]
Our interest is in understanding how `prestige` (the @PineoPorter2008 prestige score for an occupational category, derived from a social survey)
is related to census measures of the average education, income, percent women of incumbents in those occupations.
Occupation `type` is a factor with levels `"bc"` (blue collar), `"wc"` (white collar) and `"prof"` (professional).

[^prestige-src]: The dataset was collected by Bernard Blishen, William Carroll and Catherine Moore, but apparently unpublished. A version updated to the 1981 census is described in @Blishen-etal-1987.

<!-- figure-code: R/prestige.R -->

```{r prestige}
data(Prestige, package="carData")
# `type` is really an ordered factor. Make it so.
Prestige$type <- ordered(Prestige$type,
                         levels=c("bc", "wc", "prof"))
str(Prestige)
```

I first illustrate the relation between `income` and `prestige` in @fig-Prestige-scatterplot-income1
using `car::scatterplot()`
with many of its bells and whistles, including marginal boxplots for the variables,
the linear regression line, loess smooth and the 68% data ellipse.

```{r}
#| label: fig-Prestige-scatterplot-income1
#| out-width: "80%"
#| fig-cap: !expr glue::glue("Scatterplot of prestige vs. income, showing the linear regression line ({red}), the loess smooth with a confidence envelope ({darkgreen}) and a 68% data ellipse. Points with the 4 largest $D^2$ values are labeled.")
scatterplot(prestige ~ income, data=Prestige,
  pch = 16, cex.lab = 1.25,
  regLine = list(col = "red", lwd=3),
  smooth = list(smoother=loessLine, 
                lty.smooth = 1, lwd.smooth=3,
                col.smooth = "darkgreen", 
                col.var = "darkgreen"),
  ellipse = list(levels = 0.68),
  id = list(n=4, method = "mahal", col="black", cex=1.2))
```

There is a lot that can be seen here:

* `income` is positively skewed, as is often the case.
* The loess smooth, on the scale of income, shows `prestige` increasing up to $15,000 (these are 1971 incomes), and then leveling off.
* The data ellipse, centered at the means encloses approximately 68% of the data points. It adds visual information about the correlation and precision of the linear regression; but here, the non-linear trend for higher incomes strongly suggests a different approach.
* The four points identified by their labels are those with the largest Mahalanobis distances. `scatterplot()` prints their labels to the console.

@fig-Prestige-scatterplot-educ1 shows a similar plot for education, which
from the boxplot appears to be reasonably symmetric. The smoothed curve is quite
close to the linear regression, according to which `prestige` increases
on average 
`coef(lm(prestige ~ education, data=Prestige))["education"]` =
`r coef(lm(prestige ~ education, data=Prestige))["education"]` with each year of education.

```{r echo = -1}
#| label: fig-Prestige-scatterplot-educ1
#| out-width: "80%"
#| fig-cap: !expr glue::glue("Scatterplot of prestige vs. education, showing the linear regression line ({red}), the loess smooth with a confidence envelope ({darkgreen}) and a 68% data ellipse.")
par(mar = c(4,4,1,1)+.1)
scatterplot(prestige ~ education, data=Prestige,
  pch = 16, cex.lab = 1.25,
  regLine = list(col = "red", lwd=3),
  smooth = list(smoother=loessLine, 
                lty.smooth = 1, lwd.smooth=3,
                col.smooth = "darkgreen", 
                col.var = "darkgreen"),
  ellipse = list(levels = 0.68),
  id = list(n=4, method = "mahal", col="black", cex=1.2))
```

In this plot, farmers, newsboys, file.clerks and physicians are identified as
noteworthy, for being furthest from the mean by Mahalanobis distance.
In relation to their typical level of education, these are mostly
understandable, but it is nice that farmers are rated of higher prestige
than their level of education would predict.

Note that the `method` argument for point identification can take a vector
of case numbers indicating the points to be labeled. So, to
label the observations with large absolute standardized residuals
in the linear model `m`, you can use `method = which(abs(rstandard(m)) > 2)`.

```{r echo = -1}
#| label: fig-Prestige-scatterplot-educ2
#| out-width: "80%"
#| fig-cap: "Scatterplot of prestige vs. education, labeling points whose absolute standardized residual is > 2."
par(mar = c(4,4,1,1)+.1)
m <- lm(prestige ~ education, data=Prestige)
scatterplot(prestige ~ education, data=Prestige,
            pch = 16, cex.lab = 1.25,
            boxplots = FALSE,
            regLine = list(col = "red", lwd=3),
            smooth = list(smoother=loessLine,
                          lty.smooth = 1, lwd.smooth=3,
                          col.smooth = "black", 
                          col.var = "darkgreen"),
            ellipse = list(levels = 0.68),
            id = list(n=4, method = which(abs(rstandard(m))>2), 
                      col="black", cex=1.2)) |> invisible()
```


#### Plotting on a log scale {#sec-log-scale}

A typical remedy for the non-linear relationship of income to prestige is to plot income on a log scale. This usually makes sense, and expresses a belief that a **multiple** of
or **percentage increase** in income has a constant impact on prestige, as opposed to
the **additive** interpretation for income itself.

For example, the slope of the linear regression line in @fig-Prestige-scatterplot-income1
is given by  `coef(lm(prestige ~ income, data=Prestige))["income"]` = 
`r coef(lm(prestige ~ income, data=Prestige))["income"]`. Multiplying this by 1000
says that a $1000 increase in `income` is associated with with an average
increase of `prestige` of 2.9.

In the plot below, `scatterplot(..., log = "x")` re-scales the x-axis to the
$\log_e()$ scale. The slope, `coef(lm(prestige ~ log(income), data=Prestige))["log(income)"]` =
`r coef(lm(prestige ~ log(income), data=Prestige))["log(income)"]` says that a 1%
increase in salary is associated with an average change of 21.55 / 100 
in prestige.

<!-- removed: #| source-line-numbers: "2" -->

```{r echo = -1}
#| label: fig-Prestige-scatterplot2
#| out-width: "80%"
#| fig-cap: "Scatterplot of prestige vs. log(income)."
par(mar = c(4,4,1,1)+.1)
scatterplot(prestige ~ income, data=Prestige,
  log = "x",
  pch = 16, cex.lab = 1.25,
  regLine = list(col = "red", lwd=3),
  smooth = list(smoother=loessLine,
                lty.smooth = 1, lwd.smooth=3,
                col.smooth = "darkgreen", col.var = "darkgreen"),
  ellipse = list(levels = 0.68),
  id = list(n=4, method = "mahal", col="black", cex=1.2))
```

The smoothed curve in @fig-Prestige-scatterplot2
exhibits a slight tendency to bend upwards, but a linear relation is a reasonable approximation.

#### Stratifying {#sec-stratifying}

Before going further, it is instructive to ask what we could see in the relationship
between income and prestige if we stratified by type of occupation, fitting
separate regressions and smooths for blue collar, white collar and professional
incumbents in these occupations. 

The formula `prestige ~ income | type` (read: income _given_ type)
is a natural way to specify grouping by `type`; separate linear regressions
and smooths are calculated for each group, applying the
color and point shapes specified by the `col` and `pch` arguments.

```{r}
#| label: fig-Prestige-scatterplot3
#| out-width: "80%"
#| fig-cap: "Scatterplot of prestige vs. income, stratified by occupational type. This implies a different interpretation, where occupation type is a moderator variable."
scatterplot(prestige ~ income | type, data=Prestige,
  col = c("blue", "red", "darkgreen"),
  pch = 15:17,
  grid = FALSE,
  legend = list(coords="bottomright"),
  regLine = list(lwd=3),
  smooth=list(smoother=loessLine, 
              var=FALSE, lwd.smooth=2, lty.smooth=1))
```

This visual analysis offers a different interpretation of the dependence of prestige
on income, which appeared to be non-linear when occupation type was ignored.
Instead, @fig-Prestige-scatterplot3 suggests an *interaction* of income by type.
In a model formula this would be expressed as one of:

```r
lm(prestige ~ income + type + income:type, data = Prestige)
lm(prestige ~ income * type, data = Prestige)
```

These models signify that there are different slopes (and intercepts) for the three
occupational types. In this interpretation, `type` is a moderator variable, with a different story.
The slopes of the fitted lines suggest that among blue collar workers, prestige
increases sharply with their income. For white collar and professional workers, there is still
an increasing relation of prestige with income, but the effect of income (slope) diminishes with
higher occupational category. A different plot entails a different story.


### Example: Penguins data {#sec-penguins}

```{r}
#| label: fig-penguin-species
#| echo: false
#| fig-align: center
#| out-width: "80%"
#| fig-cap: "Penguin species observed in the Palmer Archipelago. This is a cartoon, but it illustrates some features of penguin body size measurements, and the colors typically used for species.  Image: Allison Horst"
knitr::include_graphics("images/penguins-horst.png")
```

The `penguins` dataset from the `r package("palmerpenguins", cite=TRUE)` provides further instructive examples of plots and analyses of multivariate data. The data consists of measurements of body size 
(flipper length, body mass, bill length and depth)
of 344 penguins collected at the [Palmer Research Station](https://pallter.marine.rutgers.edu/) in Antarctica.

There were three different species of penguins (Adélie, Chinstrap & Gentoo)
collected from 3 islands in the Palmer Archipelago
between 2007--2009 [@Gorman2014]. The purpose was to examine differences in size or appearance of these species,
particularly differences among the sexes (sexual dimorphism) in relation to foraging and habitat.

Here, I use a slightly altered version of the dataset, `peng`, renaming variables to remove the units,
making factors of character variables and deleting a few cases with missing data.

```{r}
data(penguins, package = "palmerpenguins")
peng <- penguins |>
  rename(
    bill_length = bill_length_mm, 
    bill_depth = bill_depth_mm, 
    flipper_length = flipper_length_mm, 
    body_mass = body_mass_g
  ) |>
  mutate(species = as.factor(species),
         island = as.factor(island),
         sex = as.factor(substr(sex,1,1))) |>
  tidyr::drop_na()

str(peng)
```

There are quite a few variables to choose for illustrating data ellipses in scatterplots. Here I focus on
the measures of their bills, `bill_length` and `bill_depth` (indicating curvature) and show how to 
use `ggplot2` for these plots.

I'll be using the penguins data quite a lot, so it is useful to set up custom colors like those
used in @fig-penguin-species, and shown in @fig-peng-colors with their color codes. These are shades of:

* `r colorize("Adelie", "orange")`: `r colorize("orange", "orange")`, 
* `r colorize("Chinstrap", "purple")`: `r colorize("purple", "purple")`, and
* `r colorize("Gentoo", "green")`: `r colorize("green", "green")`. 

```{r}
#| label: fig-peng-colors
#| echo: false
#| out-width: "70%"
#| fig-cap: Color palettes used for penguin species.
knitr::include_graphics("images/peng-colors.png")
```

To use these in `ggplot2` I define a function
`peng.colors()` that allows shades of light, medium and dark and then functions
`scale_*_penguins()` for color and fill.

```{r}
#| label: theme-penguins
#| code-fold: true
peng.colors <- function(shade=c("medium", "light", "dark")) {
  shade = match.arg(shade)
  #             light      medium     dark
  oranges <- c("#FDBF6F", "#F89D38", "#F37A00")  # Adelie
  purples <- c("#CAB2D6", "#9A78B8", "#6A3D9A")  # Chinstrap
  greens <-  c("#B2DF8A", "#73C05B", "#33a02c")  # Gentoo
  
  cols.vec <- c(oranges, purples, greens)
  cols.mat <- 
    matrix(cols.vec, 3, 3, 
           byrow = TRUE,
           dimnames = list(species = c("Adelie", "Chinstrap", "Gentoo"),
                           shade = c("light", "medium", "dark")))
  # get shaded colors
  cols.mat[, shade ]
}

# define color and fill scales
scale_fill_penguins <- function(shade=c("medium", "light", "dark"), ...){
  shade = match.arg(shade)
  ggplot2::discrete_scale(
    "fill","penguins",
     scales:::manual_pal(values = peng.colors(shade)), ...)
}

scale_colour_penguins <- function(shade=c("medium", "light", "dark"), ...){
  shade = match.arg(shade)
  ggplot2::discrete_scale(
    "colour","penguins",
    scales:::manual_pal(values = peng.colors(shade)), ...)
}
scale_color_penguins <- scale_colour_penguins
```

This is used to define a `theme_penguins()` function that I use to simply change the color and fill scales
for plots below.

```{r}
theme_penguins <- function(shade=c("medium", "light", "dark"), ...) {
  shade = match.arg(shade)
  list(scale_color_penguins(shade=shade),
       scale_fill_penguins(shade=shade)
      )
}
```


An initial plot using `ggplot2` shown in @fig-peng-ggplot1 uses color and point shape to distinguish the three penguin species. I annotate the plot of points using the linear regression lines, loess smooths to check for non-linearity
and 95% data ellipses to show precision of the linear relation.

```{r}
#| label: fig-peng-ggplot1
#| out-width: "80%"
#| code-fold: show
#| fig-cap: "Penguin bill length and bill depth according to species."
ggplot(peng, 
       aes(x = bill_length, y = bill_depth,
           color = species, shape = species, fill=species)) +
  geom_point(size=2) +
  geom_smooth(method = "lm", formula = y ~ x,
              se=FALSE, linewidth=2) +
  geom_smooth(method = "loess",  formula = y ~ x,
              linewidth = 1.5, se = FALSE, alpha=0.1) +
  stat_ellipse(geom = "polygon", level = 0.95, alpha = 0.2) +
  theme_penguins("dark") +
  theme(legend.position = "inside",
        legend.position.inside = c(0.85, 0.15))
```

Overall, the three species occupy different regions of this 2D space and for each species the relation between bill length and depth appears reasonably linear. Given this, we can suppress plotting the
data points to get a visual summary of the data using the fitted regression lines and data ellipses,
as shown in @fig-peng-ggplot2. 

This idea, of **visual thinning** a graph to focus on what should be seen,
becomes increasingly useful as the data becomes more complex. The `ggplot2` framework encourages this,
because we can think of various components as layers, to be included or not.
Here I chose to include only the regression line and
add data ellipses of 40%, 68% and 95% coverage to highlight the increasing bivariate 
density around the group means.

```{r}
#| label: fig-peng-ggplot2
#| out-width: "80%"
#| code-fold: show
#| fig-cap: "**Visual thinning**: Suppressing the data points gives a visual summary of the relation between bill length and bill depth using the regression line and data ellipses."
ggplot(peng, 
       aes(x = bill_length, y = bill_depth,
           color = species, shape = species, fill=species)) +
  geom_smooth(method = "lm",  se=FALSE, linewidth=2) +
  stat_ellipse(geom = "polygon", level = 0.95, alpha = 0.2) +
  stat_ellipse(geom = "polygon", level = 0.68, alpha = 0.2) +
  stat_ellipse(geom = "polygon", level = 0.40, alpha = 0.2) +
  theme_penguins("dark") +
  theme(legend.position = "inside",
        legend.position.inside = c(0.85, 0.15))
```

#### Nonparamtric bivariate density plots

While I emphasize data ellipses (because I like their beautiful geometry), other visual summaries of the bivariate density are possible and often useful. 

For a single variable, `stats::density()` and `ggplot2::geom_density()`
calculate a smoothed estimate of the density using nonparametric kernel methods [@Silverman:86]
whose smoothness
is controlled by a bandwidth parameter, analogous to the span in a loess smoother.
This idea extends to two (and more) variables [@Scott1992]. 
For bivariate data,
`MASS::kde2d()` estimates the density on a square $n \times n$ grid over the
ranges of the variables.

`ggplot2` provides `geom_density_2d()` which uses `MASS::kde2d()` and displays these as contours---
horizontal slices of the 3D surface at equally-spaced heights and projects these onto the 2D plane.
The `r package("ggdensity", cite=TRUE)` extends this with `geom_hdr()`,
computing the high density regions that bound given levels of probability
and maps these to the `alpha` transparency aesthetic. 
A `method` argument allows you to specify various nonparametric (`method ="kde"` is the default)
and parametric (`method ="mvnorm"` gives normal data ellipses) ways to estimate the underlying bivariate distribution.

@fig-peng-ggdensity shows these side-by-side for comparison.
With `geom_density_2d()` you can specify either the number of contour `bins` or the
width of these bins (`binwidth`). For `geom_hdr()`, the `probs` argument gives a result that
is easier to understand.

```{r}
#| label: fig-peng-ggdensity
#| fig-width: 10
#| fig-height: 5
#| out-width: "120%"
#| code-fold: show
#| fig-cap: "**Bivariate densities** show the contours of the 3D surface representing the frequency in the joint distribution of bill length and bill depth."
library(ggdensity)
library(patchwork)
p1 <- ggplot(peng, 
       aes(x = bill_length, y = bill_depth,
           color = species)) +
  geom_smooth(method = "lm",  se=FALSE, linewidth=2) +
  geom_density_2d(linewidth = 1.1, bins = 8) +
  ggtitle("geom_density_2d") +
  theme_bw(base_size = 14) + 
  theme_penguins() +
  theme(legend.position = "inside",
        legend.position.inside = c(0.85, 0.15))

p2 <- ggplot(peng, 
       aes(x = bill_length, y = bill_depth,
           color = species, fill = species)) +
  geom_smooth(method = "lm",  se=FALSE, linewidth=2) +
  geom_hdr(probs = c(0.95, 0.68, 0.4), show.legend = FALSE) +
  ggtitle("ggdensity::geom_hdr") +
  theme_bw(base_size = 14) +
  theme_penguins() +
  theme(legend.position = "none")

p1 + p2
```

### Simpson's paradox: marginal and conditional relationships

Because it provides a visual representation of means, variances, and correlations,
the data ellipse is ideally suited as a tool for illustrating and
explicating various phenomena that occur in the analysis of linear models.
One class of simple, but important, examples concerns the difference between the _marginal relationship_ between variables, ignoring some important factor or covariate, and the _conditional_ relationship, adjusting (controlling) for that variable.

Simpson's [-@Simpson:51] paradox  occurs when the marginal and
conditional relationships differ in direction. That is, the overall correlation
in a model `y ~ x` might be negative, while the within-group correlations
in separate models for each group `y[g] ~ x[g]` might be positive, or vice versa.

This may be seen in the plots
of bill length against bill depth for the penguin data shown in @fig-peng-simpsons. Ignoring penguin species, the marginal, total-sample correlation is slightly negative
as seen in panel (a). The individual-sample ellipses in panel (b) show
that the conditional, within-species correlations are all positive, with
approximately equal regression slopes.  However the group means have a negative
relationship, accounting for the negative marginal correlation when species is ignored. 

```{r}
#| label: fig-peng-simpsons
#| echo: false
#| out-width: "120%"
#| fig-cap: "Marginal (a), conditional (b), and pooled within-sample (c) relationships
#|  of bill length and depth in the Penguins data. Each plot shows the 68% data ellipse and regression line(s) with 95% confidence bands."
knitr::include_graphics("images/peng-simpsons.png")
```

The regression line in panel (a) is that for the linear model
`lm(bill_depth ~ bill_length)`, while the separate lines in panel (b)
are those for the model `lm(bill_depth ~ bill_length * species)` which
allows a different slope and intercept for each species.

A correct analysis of the (conditional) relationship between these variables, controlling or adjusting for mean
differences among species, is based on the pooled within-sample covariance matrix,
a weighted average of the individual within-group $\mat{S}_i$,
$$
\mat{S}_{\textrm{within}}  =
\sum_{i=1}^g
(n_i - 1) \mat{S}_i \, / \, (N - g)
\comma
$$
where $N = \sum n_i$. The result is shown in
panel (c) of @fig-peng-simpsons.

In this graph, the data for each species were first transformed to deviations from the species means on both variables and then translated back to the grand means.
You can also see here that the shapes and sizes of the individual data ellipses are roughly comparable, but perhaps not identical. This visual idea of centering groups to a common mean will become
important in @sec-eqcov when we want to test the assumption of equality of
error covariances in multivariate models.

The `ggplot2` code for the panels in this figure are shown below. Note that for components that will be the same across panels, you can define elements
(e.g., `labels`, `theme_penguins()`, `legend_position`) once, and then re-use
these across several graphs.

::: {.panel-tabset}

## (a) Ignoring species

```{r}
#| eval: false
labels <- labs(
  x = "Bill length (mm)",
  y = "Bill depth (mm)",
  color = "Species",
  shape = "Species",
  fill = "Species") 

plt1 <- ggplot(data = peng,
               aes(x = bill_length,
                   y = bill_depth)) +
  geom_point(size = 1.5) +
  geom_smooth(method = "lm", formula = y ~ x, 
              se = TRUE, color = "gray50") +
  stat_ellipse(level = 0.68, linewidth = 1.1) +
  ggtitle("Ignoring species") +
  labels

plt1
```

## (b) By species

```{r}
#| eval: false
legend_position <-
  theme(legend.position = "inside",
        legend.position.inside = c(0.83, 0.16))

plt2 <- ggplot(data = peng,
               aes(x = bill_length,
                   y = bill_depth,
                   color = species,
                   shape = species,
                   fill = species)) +
  geom_point(size = 1.5,
             alpha = 0.8) +
  geom_smooth(method = "lm", formula = y ~ x, 
              se = TRUE, alpha = 0.3) +
  stat_ellipse(level = 0.68, linewidth = 1.1) +
  ggtitle("By species") +
  labels +
  theme_penguins("dark") +
  legend_position 

plt2
```

## (c) Within species

```{r}
#| eval: false
# center within groups, translate to grand means
means <- colMeans(peng[, 3:4])
peng.centered <- peng |>
  group_by(species) |>
  mutate(bill_length = means[1] + scale(bill_length, scale = FALSE),
         bill_depth  = means[2] + scale(bill_depth, scale = FALSE))

plt3 <- ggplot(data = peng.centered,
               aes(x = bill_length,
                   y = bill_depth,
                   color = species,
                   shape = species,
                   fill = species)) +
  geom_point(size = 1.5,
             alpha = 0.8) +
  geom_smooth(method = "lm", formula = y ~ x, 
              se = TRUE, alpha = 0.3) +
  stat_ellipse(level = 0.68, linewidth = 1.1) +
  labels +
  ggtitle("Within species") +
  theme_penguins("dark") +
  legend_position 

plt3
```

:::


## Scatterplot matrices {#sec-scatmat}

Going beyond bivariate scatterplots, a *pairs* plot (or *scatterplot
matrix*) displays all possible $p \times p$ pairs of $p$ variables in a
matrix-like display where variables $(x_i, x_j)$ are shown in a plot for
row $i$, column $j$. This idea, due to @Hartigan:75b, uses small
multiple plots, so that the eye can easily scan across a row or down a
column to see how a given variable is related to all the others.

The most basic version is provided by `pairs()` in base R. When one
variable is considered as an outcome or response, it is usually helpful
to put this in the first row and column. For the `Prestige` data, in
addition to income and education, we also have a measure of % women in
each occupational category.

Plotting these together gives @fig-prestige-pairs. In such plots, the
diagonal cells give labels for the variables, but they are also a guide
to interpreting what is shown. In each row, say row 2 for `income`,
income is the vertical $y$ variable in plots against other variables. In
each column, say column 3 for `education`, education is the horizontal
$x$ variable.

```{r}
#| label: fig-prestige-pairs
#| fig-width: 7
#| fig-height: 7
#| out-width: "100%"
#| fig-cap: "Scatterplot matrix of the variables in the Prestige dataset produced by `pairs()`"
pairs(~ prestige + income + education + women,
      data=Prestige)
```

The plots in the first row show what we have seen before for the
relations between prestige and income and education, adding to those the
plot of prestige vs. % women. Plots in the first column show the same
data, but with $x$ and $y$ interchanged.

But this basic `pairs()` plot is very limited. A more feature-rich
version is provided by `car::scatterplotMatrix()` which can add the
regression lines, loess smooths and data ellipses for each pair, as
shown in @fig-prestige-spm1.

The diagonal panels show density curves for the distribution of each
variable; for example, the distribution of `education` appears to be
multi-modal and that of `women` shows that most of the occupations have
a low percentage of women.

The combination of the regression line with the loess smoothed curve,
but without their confidence envelopes, provides about the right amount
of detail to take in at a glance where the relations are non-linear.
We've already seen (@fig-Prestige-scatterplot-income1) the non-linear
relation between prestige and income (row 1, column 2) when occupational
type is ignored. But all relations with income in column 2 are
non-linear, reinforcing our idea (@sec-log-scale) that effects of income
should be assessed on a log scale.

```{r}
#| label: fig-prestige-spm1
#| fig-width: 7
#| fig-height: 7
#| out-width: "100%"
#| fig-cap: "Scatterplot matrix of the variables in the Prestige dataset from `car::scatterplotMatrix()`."
scatterplotMatrix(~ prestige + income + education + women,
  data=Prestige,
  regLine = list(method=lm, lty=1, lwd=2, col="black"),
  smooth=list(smoother=loessLine, spread=FALSE,
              lty.smooth=1, lwd.smooth=3, col.smooth="red"),
  ellipse=list(levels=0.68, fill.alpha=0.1))
```

`scatterplotMatrix()` can also label points using the `id =` argument
(though this can get messy) and can stratify the observations by a
grouping variable with different symbols and colors. For example,
@fig-prestige-spm2 uses the syntax
`~ prestige + education + income + women | type` to provide separate
regression lines, smoothed curves and data ellipses for the three types
of occupations. (The default colors are somewhat garish, so I use
`scales::hue_pal()` to mimic the discrete color scale used in
`ggplot2`).

```{r}
#| label: fig-prestige-spm2
#| fig-width: 7
#| fig-height: 7
#| out-width: "100%"
#| fig-cap: "Scatterplot matrix of the variables in the Prestige dataset from `car::scatterplotMatrix()`, stratified by type of occupation."
scatterplotMatrix(~ prestige + income + education + women | type,
  data = Prestige,
  col = scales::hue_pal()(3),
  pch = 15:17,
  smooth=list(smoother=loessLine, spread=FALSE,
              lty.smooth=1, lwd.smooth=3, col.smooth="black"),
  ellipse=list(levels=0.68, fill.alpha=0.1))
```

It is now easy to see why education is multi-modal: blue collar, white
collar and professional occupations have largely non-overlapping years
of education. As well, the distribution of % women is much higher in the
white collar category.

For the `penguins` data, given what we've seen before in
@fig-peng-ggplot1 and @fig-peng-ggplot2, we may wish to suppress details
of the points (`plot.points = FALSE`) and loess smooths
(`smooth = FALSE`) to focus attention on the similarity of regression
lines and data ellipses for the three penguin species. In @fig-peng-spm,
I've chosen to show boxplots rather than density curves in the diagonal
panels in order to highlight differences in the means and interquartile
ranges of the species, and to show 68% and 95% data ellipses in the
off-diagonal panels.

```{r}
#| label: fig-peng-spm
#| fig-width: 7
#| fig-height: 7
#| out-width: "100%"
#| fig-cap: "Scatterplot matrix of the quantitative variables in the penguins dataset, stratified by species."
scatterplotMatrix(~ bill_length + bill_depth + flipper_length + body_mass | species,
  data = peng, 
  col = peng.colors("medium"), 
  legend=FALSE,
  ellipse = list(levels = c(0.68, 0.95), 
                 fill.alpha = 0.1),
  regLine = list(lwd=3),
  diagonal = list(method = "boxplot"),
  smooth = FALSE,
  plot.points = FALSE,
  cex.labels=1) 
```
<!--# I added cex.labels=1 in the code above because the diag labels were on top of the boxplots, I think it looks a bit better now, but still slightly obscured-->
It can be seen that the species are widely separated in most of the
bivariate plots. As well, the regression lines for species have similar
slopes and the data ellipses have similar size and shape in most of the
plots. From the boxplots, we can also see that
`r colorize("Adelie", "orange")` penguins have shorter bill lengths than
the others, while `r colorize("Gentoo", "green")` penguins have smaller
bill depth, but longer flippers and are heavier than
`r colorize("Chinstrap", "purple")` and `r colorize("Adelie", "orange")`
penguins.

::: {.callout-note title="Looking ahead"}

@fig-peng-spm provides a reasonably complete visual summary of the data
in relation to multivariate models that ask "do the species differ in
their means on these body size measures?" This corresponds to the MANOVA
model,

```{r}
#| eval: false
peng.mod <- lm(cbind(bill_length, bill_depth, flipper_length, body_mass) ~ species, 
               data=peng)
```

Hypothesis-error (HE) plots, described in @sec-vis-mlm provide a better
summary of the evidence for the MANOVA test of differences among means
on all variables together. These give an $\mathbf{H}$ ellipse reflecting
the differences among means, to be compared with an $\mathbf{E}$ ellipse
reflecting within-group variation and a visual test of significance.

A related question is "how well are the penguin species distinguished by
these body size measures?" Here, the relevant model is linear
discriminant analysis (LDA), where `species` plays the role of the
response in the model,

```{r}
#| eval: false
peng.lda <- MASS:lda( species ~ cbind(bill_length, bill_depth, flipper_length, body_mass), 
               data=peng)
```

Both MANOVA and LDA depend on the assumption that the variances and
correlations between the variables are the same for all groups. This assumption can be
tested and visualized using the methods in @sec-eqcov.
:::

### Visual thinning

What can you do if there are even more variables than in these examples?
If what you want is a high-level, zoomed-out display summarizing the
pairwise relations more strongly, you can apply the idea of visual
thinning to show only the most important features.

This example uses data on the rate of various crimes in the 50 U.S.
states from the United States Statistical Abstracts, 1970, used by
@Hartigan:75 and @Friendly:91. These are ordered in the dataset roughly
by seriousness of crime or from crimes of violence to property crimes.

```{r crime-data}
data(crime, package = "ggbiplot")
str(crime)
```

<!-- **TODO**: This dataset is actually `data(crime, package = "ggbiplot")` but this depends on my new version, not yet on CRAN. -->

@fig-crime-spm displays the scatterplot matrix for these seven
variables, using only the regression line and data ellipse to show the
linear relation and the loess smooth to show potential non-linearity.<!--# I believe you used non-linear, hyphenated earlier so I added the - --> To
make this even more schematic, the axis tick marks and labels are also
removed using the `par()` settings `xaxt = "n", yaxt = "n"`.

```{r}
#| label: fig-crime-spm
#| fig-width: 8
#| fig-height: 8
#| out-width: "100%"
#| fig-cap: "**Visual thinning**: Scatterplot matrix of the crime data, showing only high-level summaries of the linear and nonlinear relations betgween each pair of variables."
crime |>
  select(where(is.numeric)) |>
  scatterplotMatrix(
    plot.points = FALSE,
    ellipse = list(levels = 0.68, fill=FALSE),
    smooth = list(spread = FALSE, 
                  lwd.smooth=2, lty.smooth = 1, col.smooth = "red"),
    cex.labels = 2,
    xaxt = "n", yaxt = "n")
```

We can see that all pairwise correlations are positive, pairs closer to
the main diagonal tend to be more highly correlated and in most cases
the nonparametric smooth doesn't differ much from the linear regression
line. Exceptions to this appear mainly in the columns for `robbery` and
`auto` (auto theft).

## Corrgrams {#sec-corrgram}

What if you want to summarize the data even further simple visual thinning. 
For example with many variables you might want to show
only the value of the correlation for each pair of variables,
but do so in a way to help see patterns in the correlations
that would be invisible in just a table.

A **corrgram** [@Friendly:02:corrgram] is a visual display of a
correlation matrix, where the correlation can be rendered in a variety
of ways to show the direction and magnitude: circular "pac-man" (or pie)
symbols, ellipses, colored vars or shaded rectangles, as shown in
@fig-corrgram-renderings.

Another aspect is that of **effect ordering** [@FriendlyKwan:03:effect],
ordering the levels of factors and variables in graphic displays to make
important features most apparent. For variables, this means that we can
arrange the variables in a matrix-like display in such a way as to make
the pattern of relationships easiest to see. Methods to achieve this
include using principal components 
and cluster analysis
to put the most
related variables together as described in @sec-pca-biplot.

```{r}
#| label: fig-corrgram-renderings
#| echo: false
#| out-width: "100%"
#| fig-cap: "**Corrgrams**: Some renderings for the value of a correlation in a corrgram display, conveying sign and magnitude in different ways."
knitr::include_graphics("images/corrgram-renderings.png")
```

In R, these diagrams can be created using the `r pkg("corrgram", cite=TRUE)` 
and `r pkg("corrplot", cite=TRUE)`  packages, with different features.
`corrgram::corrgram()` is closest to @Friendly:02:corrgram, in that it
allows different rendering functions for the lower, upper and diagonal
panels as illustrated in @fig-corrgram-renderings. For example, a
corrgram similar to @fig-crime-spm can be produced as follows (not shown
here):

```{r corrgram}
#| eval: false
crime |>
  select(where(is.numeric)) |>
  corrgram(lower.panel = panel.ellipse,
           upper.panel = panel.ellipse,
           diag.panel = panel.density)
```

With the `r pkg("corrplot")` package, `corrplot()` provides the rendering methods
`c("circle", "square", "ellipse", "number", "shade", "color", "pie")`,
but only one can be used at a time. The function
`corrplot.mixed()` allows different options to be selected for
the lower and upper triangles. The iconic rendering shape is colored with a
gradient in relation to the correlation value.
For comparison, @fig-crime-corrplot uses ellipses below the diagonal and
filled pie charts below the diagonal using a gradient of the fill color
in both cases.

```{r}
#| label: fig-crime-corrplot
#| fig-width: 8
#| fig-height: 8
#| out-width: "100%"
#| fig-cap: "Mixed corrplot of the `crime` data, showing the correlation between each pair of variables with an ellipse (lower) and a pie chart symbol (upper), all shaded in proportion to the correlation value, also shown numerically."
crime.cor <- crime |>
  dplyr::select(where(is.numeric)) |> 
  cor()

corrplot.mixed(crime.cor,
   lower = "ellipse",
   upper = "pie",
   tl.col = "black",
   tl.srt = 0,
   tl.cex = 1.25,
   addCoef.col = "black",
   addCoefasPercent = TRUE)
```

The combination of renderings shown in @fig-crime-corrplot is
instructive. Small differences among correlation values are easier to
see with the pie symbols than with the ellipses; for example, compare
the values for murder with larceny and auto theft in row 1, columns 6-7
with those in column 1, rows 6-7, where the former are easier to
distinguish. The shading color adds another visual cue.

The variables in @fig-crime-spm and @fig-crime-corrplot are arranged
by their order in the dataset, which is not often the most useful.
A better idea is to arrange the variables so that the most highly
correlated variables are adjacent. 

A general method described in @sec-var-order orders the variables according to the
angles of the first two eigenvectors from a principal components analysis (PCA)
around a unit circle. 
The function `corrMatOrder()` provides several methods
(`order = c("AOE", "FPC", "hclust", "alphabet")`) for doing this, and PCA ordering is
`order = "AOE"`. Murder and auto theft are still first and last, but some of the
intermediate crimes have been rearranged.

```{r corMatOrder}
ord <- corrMatOrder(crime.cor, order = "AOE")
rownames(crime.cor)[ord]
```


<!-- other orders: FPC (PC1), hclust (with various clustering methods), alphabet, original -->
<!-- calculated with corrMatOrder() -->

Using this ordering in `corrplot()` produces @fig-crime-corrplot-AOE.
```{r}
#| label: fig-crime-corrplot-AOE
#| fig-width: 8
#| fig-height: 8
#| out-width: "100%"
#| fig-cap: "Corrplot of the `crime` data with the variables reordered according to the angles of variable eigenvectors. Correlations are rendered with ellipses shaded in proportion to their magnitude."
corrplot.mixed(crime.cor,
  order = "AOE", 
  lower = "ellipse",
  upper = "ellipse",
  tl.col = "black",
  tl.srt = 0,
  tl.cex = 1.25,
  addCoef.col = "black",
  addCoefasPercent = TRUE)
```

In this case, where the correlations among the crime variables are all positive, the effect of
variable re-ordering is subtle, but note that there is now a slightly pronounced pattern of highest
correlations near the diagonal, and decreasing away from the diagonal.
@fig-mtcars-corrplot-varorder and @fig-mtcars-corrplot-pcaorder in @sec-var-order provide a more dramatic example of variable ordering using this method.


Variations of corrgrams are worthy replacements for a numeric table of
correlations, which are often presented in publications only for
archival value. Including the numeric value (rounded here, for
presentation purposes), makes this an attractive alternative to boring
tables of correlations.


## Generalized pairs plots {#sec-ggpairs}

When a dataset contains one or more discrete variables, the traditional
pairs plot cannot cope, because the discrete categories would plot
as many overlaid points. This cannot be represented
using only color and/or point symbols in a meaningful scatterplot.

But the associations between
categorical variables in a frequency table
_can_ be shown in _mosaic displays_ [@Friendly:94a], using
an array of tiles whose areas are depict the cell frequencies.
For an $n$-way frequency, an analog of the scatterplot matrix
uses mosaic plots for each pair of variables.
The `r package("vcd", cite=TRUE)` implements very general `pairs()` methods
for `"table"` objects and `r pkg("vcdExtra", cite=TRUE)` extends this
to wide classes of loglinear models [@Friendly:99:EMD]
See @Friendly:99:EMD and my book *Discrete Data Analysis with R*
[@FriendlyMeyer:2016:DDAR] for mosaic plots and mosaic matrices.

For example, we can tabulate the distributions of penguin species by sex
and the island where they were observed using `xtabs()`. `ftable()`
prints this three-way table more compactly. (In this example, and what
follows in the chapter, I've changed the labels for sex from ("f", "m")
to ("Female", "Male")).

```{r peng-table}
# use better labels for sex
peng <- peng |>
  mutate(sex = factor(sex, labels = c("Female", "Male")))
peng.table <- xtabs(~ species + sex + island, data = peng)

ftable(peng.table)
```

We can see immediately that the penguin species differ by island: only
Adelie were observed on all three islands; Biscoe Island had no
Chinstraps and Dream Island had no Gentoos.

`vcd::pairs()` produces all pairwise mosaic plots, as shown in
@fig-peng-mosaic. The diagonal panels show the one-way frequencies by
width of the divided bars. Each off-diagonal panel shows the bivariate
counts, breaking down each column variable by splitting the bars in
proportion to a second variable. Consequently, the frequency of each
cell is represented by its' area. The purpose is to show the **pattern
of association** between each pair, and so, the tiles in the mosaic are
shaded according to the signed standardized residual,
$d_{ij} = (n_{ij} - \hat{n}_{ij}) / \sqrt{\hat{n}_{ij}}$ in a simple
$\chi^2 = \Sigma_{ij} \; d_{ij}^2$ test for association---
`r colorize("blue", "blue")` where the observed frequency $n_{ij}$ is
significantly greater than expected $\hat{n}_{ij}$ under independence,
and `r colorize("red", "red")` where it is less than expected. The tiles
are unshaded when $| d_{ij} | < 2$.

```{r}
#| label: fig-peng-mosaic
#| fig-width: 9
#| fig-height: 9
#| out-width: "100%"
#| fig-cap: "Mosaic pairs plot for the combinations of species, sex and island. Diagnonal plots show the marginal frequency of each variable by the width of each rectangle. Off-diagonal mosaic plots subdivide by the conditional frequency of the second variable, shown numerically in the tiles. "
library(vcd)
pairs(peng.table, shade = TRUE,
      lower_panel_args = list(labeling = labeling_values()),
      upper_panel_args = list(labeling = labeling_values()))
```

The shading patterns in cells (1,3) and (3,1) of @fig-peng-mosaic show
what we've seen before in the table of frequencies: The distribution of
species varies across island because on each island one or more species
did not occur. Row 2 and column 2 show that sex is nearly exactly
proportional among species and islands, indicating independence,
$\text{sex} \perp \{\text{species}, \text{island}\}$. More importantly,
mosaic pairs plots can show, at a glance, all (bivariate) associations
among multivariate categorical variables.

The next step, by John Emerson and others [@Emerson-etal:2013] was to
recognize that combinations of continuous and discrete, categorical
variables could be plotted in different ways.

-   Two continuous variables can be shown as a standard scatterplot of
    points and/or bivariate density contours, or simply by numeric
    summaries such as a correlation value;
-   A pair of one continuous and one categorical variable can be shown
    as side-by-side boxplots or violin plots, histograms or density
    plots;
-   Two categorical variables could be shown in a mosaic plot or by
    grouped bar plots.

In the ggplot2 framework, these displays are implemented using the `ggpairs()` function from the `r package("GGally", cite=TRUE)`. This allows different plot types to be shown in the lower and upper triangles and in
the diagonal cells of the plot matrix. As well, aesthetics such as color
and shape can be used within the plots to distinguish groups directly.
As illustrated below, you can define custom functions to control exactly
what is plotted in any panel.

The basic, default plot shows scatterplots for pairs of continuous
variables in the lower triangle and the values of correlations in the
upper triangle. A combination of a discrete and continuous variables is
plotted as histograms in the lower triangle and boxplots in the upper
triangle. @fig-peng-ggpairs1 includes `sex` to illustrate the
combinations.

<!-- fig.code: R/peng/peng-ggally.R -->

```{r}
#| label: peng-ggpairs1
#| eval: false
#| code-fold: show
#| fig-width: 7
#| fig-height: 7
#| out-width: "100%"
#| fig-cap: "Basic `ggpairs()` plot of penguin size variables and sex, stratified by species."
ggpairs(peng, columns=c(3:6, 7),
        aes(color=species, alpha=0.5),
        progress = FALSE) +
  theme_penguins() +
  theme(axis.text.x = element_text(angle = -45))
```

<!-- this figure now frozen because it takes too long and caching creates git problems -->

```{r}
#| label: fig-peng-ggpairs1
#| echo: false
#| fig-width: 7
#| fig-height: 7
#| out-width: "100%"
#| fig-cap: "Basic `ggpairs()` plot of penguin size variables and sex, stratified by species."
knitr::include_graphics("figs/ch03/fig-peng-ggpairs1-1.png")
```


To my eye, printing the values of correlations in the upper triangle is
often a waste of graphic space. But in this example the correlations show something
peculiar and interesting if you look closely: In all pairs among the
penguin size measurements, there are positive correlations within each
species, as we can see in @fig-peng-spm. Yet, in three of these panels,
the overall correlation ignoring species is negative. For example, the
overall correlation between bill depth and flipper length is
$r = -0.579$ in row 2, column 3; the scatterplot in the diagonally
opposite cell, row 3, column 2 shows the data. These cases, of differing
signs for an overall correlation, ignoring a group variable and the
within group correlations are examples of **Simpson's Paradox**,
explored later in Chapter XX. <!--# TODO: add chapter number when known -->

The last row and column, for `sex` in @fig-peng-ggpairs1, provides an
initial glance at the issue of sex differences among penguin species
that motivated the collection of these data. We can go further by also
examining differences among species and island, but first we need to
understand how to display exactly what we want for each pairwise plot.

`ggpairs()` is extremely general in that for each of the `lower`,
`upper` and `diag` sections you can assign any of a large number of
built-in functions (of the form `ggally_NAME`), or your own custom
function for what is plotted, depending on the types of variables in
each plot.

-   `continuous`: both X and Y are continuous variables, supply this as
    the `NAME` part of a `ggally_NAME()` function or the name of a
    custom function;
-   `combo`: one X of and Y variable is discrete while the other is
    continuous, using the same convention;
-   `discrete`: both X and Y are discrete variables.

The defaults, which were used in @fig-peng-ggpairs1, are:

```{r eval=FALSE}
upper = list(continuous = "cor",          # correlation values
             combo = "box_no_facet",      # boxplots 
             discrete = "count")          # rectangles ~ count
lower = list(continuous = "points",       # just data points
             combo = "facethist",         # faceted histograms
             discrete = "facetbar")       # faceted bar plots
diag  = list(continuous = "densityDiag",  # density plots
             discrete = "barDiag")        # bar plots
```

Thus, `ggpairs()` uses `ggally_cor()` to print the correlation values
for pairs of continuous variables in the upper triangle, and uses
`ggally_points()` to plot scatterplots of points in the lower portion.
The diagonal panels as shown as density plots (`ggally_densityDiag()`)
for continuous variables but as bar plots (`ggally_barDiag()`) for
discrete factors.

See the vignette,
[ggally_plots](https://ggobi.github.io/ggally/articles/ggally_plots.html)
for an illustrated list of available high-level plots. For our purpose
here, which is to illustrate enhanced displays, note that for
scatterplots of continuous variables, there are two functions which plot
the points and also add a smoother, `_lm` or `_loess`.

```{r ggally-smooth-fns}
ls(getNamespace("GGally")) |> 
  stringr::str_subset("^ggally_smooth_")
```

A customized display for scatterplots of continuous variables can be any
function that takes `data` and `mapping` arguments and returns a
`"ggplot"` object. The `mapping` argument supplies the aesthetics, e.g.,
`aes(color=species, alpha=0.5)`, but only if you wish to override what
is supplied in the `ggpairs()` call.

Here is a function, `my_panel()` that plots the data points,
regression line and loess smooth:

```{r}
#| label: my-panel
my_panel <- function(data, mapping, ...){
  p <- ggplot(data = data, mapping = mapping) + 
    geom_point() + 
    geom_smooth(method=lm, formula = y ~ x, se = FALSE, ...) +
    geom_smooth(method=loess, formula = y ~ x, se = FALSE, ...)
  p
}
```

For this example, I want only simple summaries of for the scatterplots, so
I don't want to plot the data points, but do want to add the regression
line and a data ellipse.

```{r}
#| label: my-panel1
my_panel1 <- function(data, mapping, ...){
  p <- ggplot(data = data, mapping = mapping) + 
     geom_smooth(method=lm, formula = y ~ x, se = FALSE, ...) +
     stat_ellipse(geom = "polygon", level = 0.68, ...)
  p
}

```

Then, to show what can be done, @fig-peng-ggpairs7 uses `my_panel1()`
for the scatterplots in the 4 x 4 block of plots in the upper left. The
combination of the continuous body size measures and the discrete
factors `species`, `island` and `sex` are shown in upper triangle by
boxplots but by faceted histograms in the lower portion. The factors are
shown as rectangles with area proportional to count (poor-man's mosaic
plots) above the diagonal and as faceted bar plots below.

<!-- fig.code: R/peng/peng-ggally.R -->

```{r}
#| label: peng-ggpairs7
#| code-fold: show
#| eval: false
ggpairs(peng, columns=c(3:6, 1, 2, 7),
        mapping = aes(color=species, fill = species, alpha=0.2),
        lower = list(continuous = my_panel1),
        upper = list(continuous = my_panel1),
        progress = FALSE) +
  theme_penguins() +
  theme(panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank()) + 
  theme(axis.text.x = element_text(angle = -45))
```

<!-- this figure now frozen because it takes too long and caching creates git problems -->

```{r}
#| label: fig-peng-ggpairs7
#| echo: false
#| fig-width: 9
#| fig-height: 9
#| out-width: "100%"
#| fig-cap: "Customized `ggpairs()` plot of penguin size variables, together with species, island and sex."
knitr::include_graphics("figs/ch03/fig-peng-ggpairs7-1.png")
```

There is certainly a lot going on in @fig-peng-ggpairs7, but it does
show a high-level overview of all the variables (except `year`) in the
penguins dataset. It is probably easiest to "read" this figure by
focusing on the four blocks for the combinations of 4 continuous and 3 categorical
measures. In the upper left block, visual thinning of the scatterplots,
showing only the data ellipses and regression lines gives a simple
view as it did in @fig-peng-spm.

## Parallel coordinate plots {#sec-parcoord}

As we have seen above, scatterplot matrices and generalized pairs plots
extend data visualization to multivariate data, but these variables
share one 2D space, so resolution decreases as the number of variable
increase. You need a very large screen or sheet of paper to see more
than, say 5-6 variables with any clarity.

Parallel coordinate plots are an attractive alternative, with which we
can visualize an arbitrary number of variables to get a visual summary
of a potentially high-dimensional dataset, and perhaps recognize
outliers and clusters in the data in a different way. In these plots,
each variable is shown on a separate, parallel axis. A multivariate
observation is then plotted by connecting their respective values on
each axis with lines across all the axes.

The geometry of parallel coordinates is interesting, because what is a
point in $n$-dimensional (Euclidean) *data* space becomes a line in the
*projective* parallel coordinate space with $n$ axes, and vice-versa:
lines in parallel coordinate space correspond to points in data space.
Thus, a collection of points in data space map to lines that intersect
in a point in projective space. What this does is to map $n$-dimensional
relations into 2D patterns we can see in a parallel coordinates plot.

::: {.callout-note title="History Corner"}
> Those who don't know history are doomed to plagarize it ---The author

The theory of projective geometry originated with the French
mathematician Maurice d'Ocagne [-@Ocagne:1885] who sought a way to
provide graphic calculation of mathematical functions with alignment
diagrams or *nomograms* using parallel axes with different scales. A
three-variable equation, for example, could be solved using three
parallel axes, where known values could be marked on their scales, a
line drawn between them, and an unknown read on its scale at the point
where the line intersects that scale.

Henry Gannet (1880), in work preceding the *Statistical Atlas of the
United States* for the 1890 Census [@Gannett:1898], is widely credited
with being the first to use parallel coordinates plots to show data, in
his case, to show the [rank ordering of US
states](https://www.davidrumsey.com/luna/servlet/detail/RUMSEY~8~1~32803~1152181)
by 10 measures including population, occupations, wealth, manufacturing,
agriculture and so on.

However, both d'Ocagne and Gannet were far preceded in this by
Andre-Michel Guerry [-@Guerry:1833] who used this method to show how the
rank order of various crimes changed with age of the accused. See
@Friendly2022, Figure 7 for his version and for an appreciation of the
remarkable contributions of this amateur statistician to the history of
data visualization.

<!-- **TODO**: Revise the _History_ section of the Wikipedia page for [Parallel coordinates](https://en.wikipedia.org/wiki/Parallel_coordinates). -->

The use of parallel coordinates for display of multidimensional data was
rediscovered by Alfred Inselberg [-@Inselberg:1985] and extended by
Edward Wegman [-@Wegman:1990], neither of whom recognized the earlier
history. Somewhat earlier, David Andrews [-@Andrews:72] proposed mapping
multivariate observations to smooth Fourrier functions composed of
alternating $\sin()$ and $\cos()$ terms. And in my book, *SAS System for
Statistical Graphics* [@Friendly:91], I implemented what I called
[*profile
plots*](https://blogs.sas.com/content/iml/2022/11/14/profile-plots-sas.html)
without knowing their earlier history as parallel coordinate plots.
:::

Parallel coordinate plots present a challenge for graphic developers, in
that they require a different way to think about plot construction for
multiple variables, which can be quantitative, as in the original idea,
or categorical factors, all to be shown along parallel axes.

Here, I use the `r package("ggpcp", cite=TRUE)`, best described in
@VanderPlas2023, who also review the modern history.[^other-pcp] This takes some
getting used to, because they develop `pcp_*()` extensions of the
`ggplot2` grammar of graphics framework to allow:

[^other-pcp]: Other implementations of parallel coordinate plots in R include:
`MASS::parcoord()`, `GGally::ggparcoord() and `PairViz::pcp()`.
The **ggpcp** version used here is the most general.

-   `pcp_select()`: selections of the variables to be plotted and their
    horizontal order on parallel axes,
-   `pcp_scale()`: methods for scaling of the variables to each axis,
-   `pcp_arrange()`: methods for breaking ties in factor variables to
    space them out.

Then, it provides `geom_pcp_*()` functions to control the display of
axes with appropriate aesthetics, labels for categorical factors and so
forth. @fig-peng-ggpcp1 illustrates this type of display, using sex and
species in addition to the quantitative variables for the penguin data.

<!-- fig.code: R/peng/peng-ggpcp.R -->

```{r}
#| label: fig-peng-ggpcp1-code
#| code-fold: show
#| eval: false
#| fig-width: 9
#| fig-height: 6
#| out-width: "100%"
#| fig-cap: "Parallel coordinates plot of penguin size variables, together with sex and species."
peng |>
  pcp_select(bill_length:body_mass, sex, species) |>
  pcp_scale(method = "uniminmax") |>
  pcp_arrange() |>
  ggplot(aes_pcp()) +
  geom_pcp_axes() +
  geom_pcp(aes(colour = species), alpha = 0.8, overplot = "none") +
  geom_pcp_labels() +
  scale_colour_manual(values = peng.colors()) +
  labs(x = "", y = "") +
  theme(axis.title.y = element_blank(), axis.text.y = element_blank(), 
        axis.ticks.y = element_blank(), legend.position = "none")
```

<!-- CHEATING HERE because ggcpc plots take so long & caching causes problems -->

```{r}
#| label: fig-peng-ggpcp1
#| echo: false
#| fig-width: 9
#| fig-height: 6
#| out-width: "100%"
#| fig-cap: "Parallel coordinates plot of penguin size variables, together with sex and species."
knitr::include_graphics("figs/fig-peng-ggpcp1-1.png")
```


Rearranging the order of variables and the ordering of factor levels can
make a difference in what we can see in such plots. For a simple example
(following @VanderPlas2023), we reorder the levels of species and
islands to make it clearer which species occur on each island.

```{r}
#| label: fig-peng-ggpcp2
#| code-fold: show
#| eval: false
#| fig-width: 9
#| fig-height: 6
#| out-width: "100%"
#| fig-cap: "Parallel coordinates plot of penguin size variables, with the levels of species and island reordered."
peng1 <- peng |>
  mutate(species = factor(species, levels = c("Chinstrap", "Adelie", "Gentoo"))) |>
  mutate(island = factor(island, levels = c("Dream", "Torgersen", "Biscoe")))

peng1 |>
  pcp_select(species, island, bill_length:body_mass) |>
  pcp_scale() |>
  pcp_arrange(method = "from-left") |>
  ggplot(aes_pcp()) +
  geom_pcp_axes() +
  geom_pcp(aes(colour = species), alpha = 0.6, overplot = "none") +
  geom_pcp_boxes(fill = "white", alpha = 0.5) +
  geom_pcp_labels() +
  scale_colour_manual(values = peng.colors()[c(2,1,3)]) +
  theme_bw() +
  labs(x = "", y = "") +
  theme(axis.text.y = element_blank(), 
        axis.ticks.y = element_blank(),
        legend.position = "none") 
```

<!-- cHEATING HERE because ggcpc plots take so long -->
```{r}
#| label: inc-fig-peng-ggpcp2
#| echo: false
knitr::include_graphics("figs/fig-peng-ggpcp2-1.png")
```

The order of variables in this plot emphasizes the relation between penguin species and the island
where they were observed and then shows the values of the quantitative
body size measurements. More generally, quantitative variables can, and probably should,
be ordered to place the most highly correlated variables adjacently to minimize 
the degree of crossing lines from one variable to the next [@MartiLaguna2003].
When variables are highly _negatively_ correlated (such as `bill_depth` and `flipper_length` here),
crossings can be reduced simply by reversing the scale of one of the variables, e.g.,
by plotting `-bill_depth`.
\index{ordering!variables}


```{r child="child/03-grand-tour.qmd"}
```

```{r child="child/03-network.qmd"}
```

## Multivariate thinking and visualization

**TODO**: These are just initial notes on a chapter summary, and pointing the way to dimension reduction
methods in @sec-pca-biplot.

This chapter has covered a lot of ground. We started with simple scatterplots and how to enhance them
with graphical summaries and annotations ...

**The two curses**

Multivariate data is often said to suffer from the **curse of dimensionality** (ref: Bellman1957),
meaning that that as the dimensionality of data increases, the volume of the space increases so fast that the available data become sparse, so that the amount of data needed often grows exponentially with the dimensionality.

But, there is another curse here, the **curse of two-dimensionality**,
meaning that as the dimensionality of data increases, what we can display and understand from
a 2D image decreases rapidly with the number of dimensions of data. ...


**Package summary**

For development, keep track of the packages used in each chapter.
```{r}
#| echo: false
#| results: asis
#cat("Packages used here:\n")
write_pkgs(file = .pkg_file)
```

<!-- ## References {.unnumbered} -->