-
Notifications
You must be signed in to change notification settings - Fork 2
/
03-multivariate_plots.qmd
2116 lines (1774 loc) · 92.1 KB
/
03-multivariate_plots.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
editor:
markdown:
wrap: 72
editor_options:
chunk_output_type: console
---
```{r include=FALSE}
source("R/common.R")
knitr::opts_chunk$set(fig.path = "figs/ch03/")
```
::: {.content-visible unless-format="pdf"}
{{< include latex/latex-commands.qmd >}}
:::
# Plots of Multivariate Data {#sec-multivariate_plots}
> There is no excuse for failing to plot and look.
>
> The greatest value of a picture is when it forces us to notice what we never expected to see.
> --- John W. Tukey, _Exploratory Data Analysis_, 1977
These quotes from John Tukey remind us that data analysis should nearly always
start with graphs to help us understand the main features of our
data. It is important to understand the general _patterns_ and _trends_: Are relationships increasing or decreasing? Are they approximately linear or non-linear? But it is also important to spot
_anomalies_: "unusual" observations, groups of points that seem to differ from the rest, and so forth.
As we saw with Anscombe's quartet (@sec-anscombe) numerical summaries hide features that are immediately apparent in a plot.
This chapter introduces a
toolbox of basic graphical methods for visualizing multivariate
datasets. It starts with some simple techniques to enhance the basic
scatterplot with graphical _annotations_ such as fitted lines, curves and data
ellipses to _summarize_ the relation between two variables.
To visualize more than two variables, we can view all pairs of variables
in a scatterplot matrix or shift gears entirely to show multiple
variables along a set of parallel axes. As the number of variables
increases, we may need to suppress details with stronger summaries
for a high-level reconnaissance of our data terrain, as we do by zooming
out on a map. For example, we can simply remove the data points or make them nearly transparent
to focus on the visual summaries provided by fitted lines or other graphical summaries.
**Packages**
In this chapter I use the following packages. Load them now:
```{r load-ggally}
#| include: false
suppressPackageStartupMessages(library(GGally, quietly = TRUE))
```
```{r load-pkgs}
library(car)
library(ggplot2)
library(dplyr)
library(tidyr)
library(corrplot)
library(corrgram)
library(GGally)
library(ggdensity)
library(patchwork)
library(ggpcp)
library(tourr)
```
## Bivariate summaries {#sec-bivariate_summaries}
The basic scatterplot is the workhorse of multivariate data
visualization, showing how one variable, $y$, often an outcome to be
explained by or varies with another, $x$. It is a building block for
many useful techniques, so it is helpful to understand how it can be
used as a tool for thinking in a wider, multivariate context.
The essential idea is that we can start with a simple version of the
scatterplot and add annotations to show interesting features more
clearly. We consider the following here:
- **Smoothers**: Showing overall trends, perhaps in several forms, as
visual summaries such as fitted regression lines or curves and
nonparametric smoothers.
- **Stratifiers**: Using color, shape or other features to identify
subgroups; more generally, *conditioning* on other variables in
multi-panel displays;
- **Data ellipses**: A compact 2D visual summary of bivariate linear
relations and uncertainty assuming normality; more generally,
contour plots of bivariate density.
**Example: Academic salaries**
Let's start with data on the academic salaries of faculty members
collected at a U.S. college for the purpose of assessing salary
differences between male and female faculty members, and perhaps address
anomalies in compensation. The dataset `carData::Salaries` gives data on
nine-month salaries and other variables for 397 faculty members in the
2008-2009 academic year.
```{r Salaries}
data(Salaries, package = "carData")
str(Salaries)
```
The most obvious, but perhaps naive, predictor of `salary` is
`years.since.phd`. For simplicity, I'll refer to this as years of
"experience." Before looking at differences between males and females,
we would want consider faculty `rank` (related also to `yrs.service`)
and `discipline`, recorded here as `"A"` ("theoretical" departments) or
`"B"` ("applied" departments). But, for a basic plot, we will ignore these
for now to focus on what can be learned from plot annotations.
<!-- figure-code: `R/Salaries-scatterplots.R` -->
```{r}
#| label: fig-Salaries-scat
#| out-width: "90%"
#| fig-cap: "Naive scatterplot of Salary vs. years since PhD, ignoring other variables, and without graphical annotations."
library(ggplot2)
gg1 <- ggplot(Salaries,
aes(x = yrs.since.phd, y = salary)) +
geom_jitter(size = 2) +
scale_y_continuous(labels = scales::dollar_format(
prefix="$", scale = 0.001, suffix = "K")) +
labs(x = "Years since PhD",
y = "Salary")
gg1 + geom_rug(position = "jitter", alpha = 1/4)
```
There is quite a lot we can see "just by looking" at this simple plot,
but the main things are:
- Salary increases generally from 0 - 40 years since the PhD, but then maybe begins to drop off (partial retirement?);
- Variability in salary increases among those with the same
experience, a "fan-shaped" pattern that signals a violation of
homogeneity of variance in simple regression;
- Data beyond 50 years is thin, but there are some quite low salaries
there. Adding rug plots to the X and Y axes is a simple but effective way to show the
marginal distributions of the observations. Jitter and transparency helps to avoid overplotting
due to discrete values.
### Smoothers
Smoothers are among the most useful graphical annotations you can add to
such plots, giving a visual summary of how $y$ changes with $x$. The
most common smoother is a line showing the linear regression for $y$
given $x$, expressed in math notation as
$\mathbb{E} (y | x) = b_0 + b_1 x$. If there is doubt that a linear
relation is an adequate summary, you can try a quadratic or other
polynomial smoothers.
In `r pkg("ggplot2")`, these are easily added to a plot using `geom_smooth()`
with `method = "lm"`, and a model `formula`, which (by default) is
`y ~ x` for a linear relation or `y ~ poly(x, k)` for a polynomial of
degree $k$.
```{r}
#| label: fig-Salaries-lm
#| out-width: "80%"
#| code-fold: show
#| fig-cap: !expr paste("Scatterplot of Salary vs. years since PhD, showing", colorize("linear", "red"),
#| "and", colorize("quadratic", "darkgreen"), "smooths with 95% confidence bands.")
gg1 +
geom_smooth(method = "lm", formula = "y ~ x",
color = "red", fill= "pink",
linewidth = 2) +
geom_smooth(method = "lm", formula = "y ~ poly(x,2)",
color = "darkgreen", fill = "lightgreen",
linewidth = 2)
```
<!--# UA: This is a fantastic graph. My only (very minor) suggestion is to replace one of the colours because the most common type of colour vision deficiency makes it hard to tell the difference between red and green. So, maybe green and purple (purple would match the inline code highlight colour)-->
This serves to highlight some of our impressions from the basic
scatterplot shown in @fig-Salaries-scat, making them more apparent. And
that's precisely the point: the regression smoother draws attention to a
possible pattern that we can consider as a visual summary of the data.
You can think of this as showing what a linear (or quadratic) regression
"sees" in the data. Statistical tests <!--# (secref?) --> can help you decide if there is more evidence for a quadratic fit compared to the simpler linear relation. <!--# UA: Great paragraph!-->
It is useful to also show some indication of *uncertainty* (or
inversely, *precision*) associated with the predicted values. Both
the linear and quadratic trends are shown in @fig-Salaries-lm with 95%
pointwise confidence bands.[^pointwise] These are necessarily narrower in the center
of the range of $x$ where there is typically more data; they get wider
toward the highest values of experience where the data are thinner.
[^pointwise]:
Confidence bands allow us to visualize the uncertainty around a fitted regression curve,
which can be of two types: _pointwise intervals_ or _simultaneous intervals_.
The default setting in ``ggplot2::geom_smooth()` calculates pointwise intervals
(using `stats::predict.lm(..., interval="confidence")` at a confidence level $1-\alpha$ for the predicted response at _each value_ $x_i$ of a predictor, and have the frequentist interpretation that over repeated sampling only $100\;\alpha$ of the predictions at $x_i$ will be outside that interval.
In contrast, simultaneous intervals are calculated so that $1 - \alpha$ is the probability that _all of them_ cover their corresponding true values simultaneously. These are necessarily wider than pointwise intervals.
Commonly used methods for constructing simultaneous confidence bands in regression are the Bonferroni and Scheffé methods, which control the family-wise error rate over all values of $x_i$.
See [](https://en.wikipedia.org/wiki/Confidence_and_prediction_bands) for precise definitions of these terms.
These are different from a _prediction band_, which is used to represent the uncertainty about the value of a **new** data-point on the curve, but subject to the additional variance reflected in one observation.
#### Non-parametric smoothers {.unnumbered}
The most generally useful idea is a smoother that tracks an average
value, $\mathbb{E} (y | x)$, of $y$ as $x$ varies across its' range
*without* assuming any particular functional form, and so avoiding the
necessity to choose among `y ~ poly(x, 1)`, or `y ~ poly(x, 2)`, or
`y ~ poly(x, 3)`, etc.
Non-parametric smoothers attempt to estimate $\mathbb{E} (y | x) = f(x)$
where $f(x)$ is some smooth function. These typically use a collection
of weighted *local regressions* for each $x_i$ within a window centered
at that value. In the method called *lowess* or *loess* [@Cleveland:79;
@ClevelandDevlin:88], a weight function is applied, giving greatest
weight to $x_i$ and a weight of 0 outside a window containing a certain fraction, $s$, called *span*, of the nearest neighbors of $x_i$. The fraction, $s$, is usually within the range $1/3 \le s \le 2/3$, and it determines the
smoothness of the resulting curve; smaller values produce a wigglier
curve and larger values giving a smoother fit (an optimal
span can be determined by $k$-fold cross-validation to minimize a
measure of overall error of approximation).
Non-parametric regression is a broad topic; see @Fox:2016:ARA, Ch. 18 for
a more general treatment including smoothing splines, and @Wood:2006 for generalized additive models,
fit using `method = "gam"` in **ggplot2**, which is the default when the
largest group has more than 1,000 observations.
@fig-Salaries-loess shows the addition of a loess smooth to the plot in
@fig-Salaries-lm, suppressing the confidence band for the linear
regression. The loess fit is nearly coincident with the quadratic fit,
but has a slightly wider confidence band.
```{r}
#| label: fig-Salaries-loess
#| out-width: "80%"
#| code-fold: show
#| fig-cap: !expr glue::glue("Scatterplot of Salary vs. years since PhD, adding the loess smooth. The loess smooth curve and confidence band in {green} is nearly indistinguishable from a quadratic fit in {blue}.")
gg1 +
geom_smooth(method = "loess", formula = "y ~ x",
color = "blue", fill = scales::muted("blue"),
linewidth = 2) +
geom_smooth(method = "lm", formula = "y ~ x", se = FALSE,
color = "red",
linewidth = 2) +
geom_smooth(method = "lm", formula = "y ~ poly(x,2)",
color = "darkgreen", fill = "lightgreen",
linewidth = 2)
```
But now comes an important question: is it reasonable that academic
salary should increase up to about 40 years since the PhD degree and
then decline? The predicted salary for someone still working 50 years
after earning their degree is about the same as a person at 15 years.
What else is going on here?
### Stratifiers
Very often, we have a main relationship of interest, but various groups
in the data are identified by discrete factors (like faculty `rank` and
`sex`, their type of `discipline` here), or there are quantitative
predictors for which the main relation might vary. In the language of
statistical models such effects are *interaction* terms, as in
`y ~ group + x + group:x`, where the term `group:x` fits a different
slope for each group and the grouping variable is often called a
*moderator* variable. Common moderator variables are ethnicity, health
status, social class and level of education. Moderators can also be
continuous variables as in `y ~ x1 + x2 + x1:x2`.
I call these *stratifiers*, recognizing that we should consider breaking
down the overall relation to see whether and how it changes over such
"other" variables. Such variables are most often factors, but we can cut
a continuous variable into ranges (*shingles*) and do the same
graphically. There are two general stratifying graphical techniques:
- **Grouping**: Identify subgroups in the data by assigning different
visual attributes, such as color, shape, line style, etc. within a
single plot. This is quite natural for factors; quantitative
predictors can be accommodated by cutting their range into ordered
intervals. Grouping has the
advantage that the levels of a grouping variable can be shown within
the same plot, facilitating direct comparison.
- **Conditioning**: Showing subgroups in different plot panels. This
has the advantages that relations for the individual groups more
easily discerned and one can easily stratify by two (or more) other
variables jointly, but visual comparison is more difficult because
the eye must scan from one panel to another.
::: {.callout-note title="History Corner"}
Recognition of the roles of visual grouping by factors within a panel
and conditioning in multi-panel displays was an important advance in the
development of modern statistical graphics. It began at A.T.&T. Bell
Labs in Murray Hill, NJ in conjunction with the **S** language, the
mother of R.
Conditioning displays (originally called *coplots*
[@ChambersHastie1991]) are simply a collection of 1D, 2D or 3D plots
separate panels for subsets of the data broken down by one or more
factors, or, for quantitative variables, subdivided into a factor with
several overlapping intervals (*shingles*). The first implementation was
in *Trellis* plots [@Becker:1996:VDC;@Cleveland:85].
Trellis displays were extended in the `r package("lattice", cite=TRUE)`,
which offered:
- A **graphing syntax** similar to that used in statistical model
formulas: `y ~ x | g` conditions the data by the levels of `g`, with
`|` read as "given"; two or more conditioning are specified as
`y ~ x | g1 + g2 + ...`, with `+` read as "and".
- **Panel functions** define what is plotted in a given panel.
`panel.xyplot()` is the default for scatterplots, plotting points,
but you can add `panel.lmline()` for regression lines,
`latticeExtra::panel.smoother()` for loess smooths and a wide
variety of others.
The `r package("car", cite=TRUE)` supports this graphing syntax in many of
its functions. `r pkg("ggplot2")` does not; it uses aesthetics (`aes()`), which
map variables in the data to visual characteristics in displays.
:::
The most obvious variable that affects academic salary is `rank`,
because faculty typically get an increase in salary with a promotion
that carries through in their future salary. What can we see if we group
by `rank` and fit a separate smoothed curve for each?
In `ggplot2` thinking, grouping is accomplished simply by adding an
aesthetic, such as `color = rank`. What happens then is that points,
lines, smooths and other `geom_*()` inherit the feature that they are
differentiated by color. In the case of `geom_smooth()`, we get a
separate fit for each subset of the data, according to `rank`.
```{r}
#| label: fig-Salaries-rank
#| out-width: "80%"
#| code-fold: show
#| fig-cap: "Scatterplot of Salary vs. years since PhD, grouped by rank."
# make some re-useable pieces to avoid repetitions
scale_salary <- scale_y_continuous(
labels = scales::dollar_format(prefix="$",
scale = 0.001,
suffix = "K"))
# position the legend inside the plot
legend_pos <- theme(legend.position = "inside",
legend.position.inside = c(.1, 0.95),
legend.justification = c(0, 1))
ggplot(Salaries,
aes(x = yrs.since.phd, y = salary,
color = rank, shape = rank)) +
geom_point() +
scale_salary +
labs(x = "Years since PhD",
y = "Salary") +
geom_smooth(aes(fill = rank),
method = "loess", formula = "y ~ x",
linewidth = 2) +
legend_pos
```
Well, there is a different story here. Salaries generally occupy
separate vertical levels, increasing with academic rank. The horizontal extents
of the smoothed curves show their ranges. Within each rank there is some
initial increase after promotion, and then some tendency to decline with
increasing years. But, by and large, years since the PhD doesn't make
as much difference once we've taken academic rank into account.
What about the `discipline` which is classified, perhaps peculiarly, as
"theoretical" vs. "applied"? The values are just `"A"` and `"B"`,
so I map these to more meaningful labels before making the plot.
```{r}
#| label: fig-Salaries-discipline
#| out-width: "80%"
#| code-fold: show
#| fig-cap: "Scatterplot of Salary vs. years since PhD, grouped by discipline."
Salaries <- Salaries |>
mutate(discipline =
factor(discipline,
labels = c("A: Theoretical", "B: Applied")))
Salaries |>
ggplot(aes(x = yrs.since.phd, y = salary, color = discipline)) +
geom_point() +
scale_salary +
geom_smooth(aes(fill = discipline ),
method = "loess", formula = "y ~ x",
linewidth = 2) +
labs(x = "Years since PhD",
y = "Salary") +
legend_pos
```
The story in @fig-Salaries-discipline is again different. Faculty in
applied disciplines on average earn about 10,000\$ more per year on
average than their theoretical colleagues.
```{r discipline-means}
Salaries |>
group_by(discipline) |>
summarize(mean = mean(salary))
```
For both groups, there is an approximately linear relation up to about
30--40 years, but the smoothed curves then diverge into the region
where the data is thinner.
This result is more surprising than
differences among faculty ranks. The effect of annotation with
smoothed curves as visual summaries is apparent, and provides a stimulus
to think about _why_ these differences (if they are real) exist
between theoretical and applied professors, and maybe _should_
theoreticians be paid more!
### Conditioning
The previous plots use grouping by color to plot the data for different
subsets inside the same plot window, making comparison among groups
easier, because they can be directly compared along a common vertical
scale [^03-multivariate_plots-1]. This gets messy, however, when there are
more than just a few levels, or worse---when there are two (or more)
variables for which we want to show separate effects. In such cases, we
can plot separate panels using the `ggplot2` concept of *faceting*.
There are two options: `facet_wrap()` takes one or more conditioning
variables and produces a ribbon of plots for each combination of levels;
`facet_grid(row ~ col)` takes two or more conditioning variables and
arranges the plots in a 2D array identified by the `row` and `col`
variables.
[^03-multivariate_plots-1]: The classic study by
@ClevelandMcGill:84b;@ClevelandMcGill:85 shows that judgements of
magnitude along a common scale are more accurate than those along
separate, aligned scales.
Let's look at salary broken down by the combinations of discipline and
rank. Here, I chose to stratify using color by rank within each of
panels faceting by discipline. Because there is more going on in this
plot, a linear smooth is used to represent the trend.
```{r}
#| label: fig-Salaries-faceted
#| out-width: "100%"
#| code-fold: show
#| fig-cap: "Scatterplot of Salary vs. years since PhD, grouped by rank, with separate panels for discipline."
Salaries |>
ggplot(aes(x = yrs.since.phd, y = salary,
color = rank, shape = rank)) +
geom_point() +
scale_salary +
labs(x = "Years since PhD",
y = "Salary") +
geom_smooth(aes(fill = rank),
method = "lm", formula = "y ~ x",
linewidth = 2, alpha = 1/4) +
facet_wrap(~ discipline) +
legend_pos
```
Once both of these factors are taken into account, there does not seem
to be much impact of years of service. Salaries in theoretical
disciplines are noticeably greater than those in applied disciplines at
all ranks, and there are even greater differences among ranks.
Finally, to shed light on the question that motivated this example---
are there anomalous differences in salary for men and women--- we can
look at differences in salary according to sex, when discipline and rank
are taken into account. To do this graphically, condition by both
variables, but use `facet_grid(discipline ~ rank)` to arrange their
combinations in a grid whose rows are the levels of `discipline` and columns are those of `rank`. I want to make the comparison of
males and females most direct, so I use `color = sex` to stratify the
panels. The smoothed regression lines and error bands are calculated
separately for each combination of discipline, rank and sex.
```{r}
#| label: fig-Salaries-facet-sex
#| out-width: "100%"
#| code-fold: show
#| fig-cap: "Scatterplot of Salary vs. years since PhD, grouped by sex, faceted by discipline and rank."
Salaries |>
ggplot(aes(x = yrs.since.phd, y = salary, color = sex)) +
geom_point() +
scale_salary +
labs(x = "Years since PhD",
y = "Salary") +
geom_smooth(aes(fill = sex),
method = "lm", formula = "y ~ x",
linewidth = 2, alpha = 1/4) +
facet_grid(discipline ~ rank) +
theme_bw(base_size = 14) +
legend_pos
```
## Data Ellipses {#sec-data-ellipse}
The _data ellipse_ [@Monette:90], or _concentration ellipse_ [@Dempster:69] is a
remarkably simple and effective display for viewing and understanding
bivariate relationships in multivariate data.
The data ellipse is typically used to add a visual summary to a scatterplot,
that shows all together the means, standard deviations, correlation,
and slope of the regression line for
two variables, perhaps stratified by another variable.
Under the classical assumption that the data are bivariate normally distributed,
the data ellipse is also a **sufficient** visual summary, in the sense that
it captures **all** relevant features of the data.
See @Friendly-etal:ellipses:2013 for a complete discussion of the role of
ellipsoids in statistical data visualization.
It is based on the idea that in a bivariate normal distribution, the contours
of equal probability form a series of concentric ellipses. If the variables were
uncorrelated and had the same variances, these would be circles, and Euclidean
distance would measure the distance of each observation from the mean.
When the variables are correlated, a different measure, _Mahalanobis distance_
is the proper measure of how far a point is from the mean, taking the correlation
into account.
```{r}
#| label: fig-mahalanobis
#| echo: false
#| fig-align: center
#| out-width: "60%"
#| fig-cap: "2D data with curves of constant distance from the centroid. The blue solid ellipse shows a contour of constant Mahalanobis distance, taking the correlation into account; the dashed red circle is a contour of equal Euclidean distance. Given the data points, Which of the points **A** and **B** is further from the mean (X)? _Source_: Re-drawn from [Ou Zhang](https://ouzhang.rbind.io/2020/11/16/outliers-part4/)"
knitr::include_graphics("images/mahalanobis.png")
```
<!--
This doesn't work
#| fig-cap: !expr paste("2D data with curves of constant distance from the centroid. The", colorize('blue'), "solid ellipse shows a contour of constant Mahalanobis distance, taking the correlation into account; the dashed", colorize('blue'), "circle is a contour of equal Euclidean distance. Given the data points, Which of the points **A** and **B** is further from the mean (X)? _Source_: Re-drawn from [Ou Zhang](https://ouzhang.rbind.io/2020/11/16/outliers-part4/)")
-->
To illustrate, @fig-mahalanobis shows a scatterplot with labels for two points, "A" and "B".
Which is further from the mean, "X"?
A contour of constant Euclidean distance, shown by the `r colorize("red")` dashed circle,
ignores the apparent negative correlation, so point "A" is further.
The `r colorize("blue")` ellipse for Mahalanobis distance
takes the correlation into account, so point "B" has a greater distance from the mean.
Mathematically, Euclidean (squared) distance for $p$ variables, $j = 1, 2, \dots , p$,
is just a generalization of
the square of a univariate standardized ($z$) score, $z^2 = [(y - \bar{y}) / s]^2$,
$$
D_E^2 (\mathbf{y}) = \sum_j^p z_j^2 = \mathbf{z}^\textsf{T} \mathbf{z} = (\mathbf{y} - \bar{\mathbf{y}})^\textsf{T} \operatorname{diag}(\mathbf{S})^{-1} (\mathbf{y} - \bar{\mathbf{y}}) \; ,
$$
where $\mathbf{S}$ is the sample variance-covariance matrix,
$\mathbf{S} = ({n-1})^{-1} \sum_{i=1}^n (\mathbf{y}_i - \bar{\mathbf{y}})^\textsf{T} (\mathbf{y}_i - \bar{\mathbf{y}})$.
Mahalanobis' distance takes the correlations into account simply by using the covariances
as well as the variances,
$$
D_M^2 (\mathbf{y}) = (\mathbf{y} - \bar{\mathbf{y}})^\mathsf{T} S^{-1} (\mathbf{y} - \bar{\mathbf{y}}) \; .
$$ {#eq-Dsq}
In @eq-Dsq, the inverse $S^{-1}$ serves to "divide" the matrix $(\mathbf{y} - \bar{\mathbf{y}})^\mathsf{T} (\mathbf{y} - \bar{\mathbf{y}})$ of squared distances
by the variances (and covariances) of the variables, as in the univariate case.
For $p$ variables, the data _ellipsoid_ $\mathcal{E}_c$ of
size $c$ is a $p$-dimensional ellipse,
defined as the set of points $\mathbf{y} = (y_1, y_2, \dots y_p)$
whose squared Mahalanobis distance, $D_M^2 ( \mathbf{y} )$ is less than or equal
to $c^2$,
$$
\mathcal{E}_c (\bar{\mathbf{y}}, \mathbf{S}) := \{ D_M^2 (\mathbf{y}) \le c^2 \} \; .
$$
A computational definition recognizes that the boundary of the ellipsoid can be found by transforming
a unit sphere $\mathcal{P}$
centered at the origin, $\mathcal{P} : \{ \mathbf{x}^\textsf{T} \mathbf{x}= 1\}$, by $\mathbf{S}^{1/2}$
and then shifting that to centroid of the data,
$$
\mathcal{E}_c (\bar{\mathbf{y}}, \mathbf{S}) = \bar{\mathbf{y}} \; \oplus \; \mathbf{S}^{1/2} \, \mathcal{P} \comma
$$
where $\mathbf{S}^{1/2}$ represents a rotation and scaling and the notation $\oplus$ represents translation to a new centroid, $\bar{\mathbf{y}}$ here. The matrix $\mathbf{S}^{1/2}$ is commonly computed
as the Choleski factor of $\mathbf{S}$. Slightly abusing notation and taking the unit sphere as given (like an identity matrix $\mathbf{I}$),
we can write the data ellipsoid as simply:
$$
\mathcal{E}_c (\bar{\mathbf{y}}, \mathbf{S}) = \bar{\mathbf{y}} \; \oplus \; c\, \sqrt{\mathbf{S}} \period
$$ {#eq-ellE}
When $\mathbf{y}$ is (at least approximately) bivariate normal,
$D_M^2(\mathbf{y})$ has a large-sample $\chi^2_2$ distribution
($\chi^2$ with 2 df),
so
* $c^2 = \chi^2_2 (0.68) = 2.28$ gives a "1 standard deviation
bivariate ellipse,"
an analog of the standard interval $\bar{y} \pm 1 s$, while
* $c^2 = \chi^2_2 (0.95) = 5.99 \approx 6$ gives a data ellipse of
95\% coverage.
In not-large samples, the radius $c$ of the ellipsoid is better approximated by a multiple of a $F_{p, n-p}$ distribution, becoming $c =\sqrt{ 2 F_{2, n-2}^{1-\alpha} }$
in the bivariate case ($p=2$) for coverage $1-\alpha$.
### Ellipse properties
The essential ideas of correlation and regression and their relation to ellipses go back to
@Galton:1886.
Galton's goal was to predict (or explain) how a heritable trait, $Y$, (e.g.,
height) of children was related to that of their parents, $X$.
He made a semi-graphic table of the frequencies of 928 observations of the average
height of father and mother versus the height of their child, shown in @fig-galton-corr.
He then drew smoothed contour lines of equal frequencies and had the wonderful
visual insight that these formed concentric shapes that were tolerably close to ellipses.
He then calculated summaries, $\text{Ave}(Y | X)$, and, for symmetry, $\text{Ave}(X | Y)$, and plotted these as lines of means on his diagram. Lo and behold, he had a second visual
insight: the lines of means of ($Y | X$) and ($X | Y$) corresponded approximately to
the loci of horizontal and vertical tangents to the concentric ellipses.
To complete the picture, he added lines showing the major and minor axes of the
family of ellipses (which turned out to be the principal components) with the result shown in @fig-galton-corr.
```{r}
#| label: fig-galton-corr
#| echo: false
#| fig-align: center
#| out-width: "70%"
#| fig-cap: "Galton's 1886 diagram, showing the relationship of height of children to the average of their parents' height. The diagram is essentially an overlay of a geometrical interpretation on a bivariate grouped frequency distribution, shown as numbers."
knitr::include_graphics("images/galton-corr.jpg")
```
For two variables, $x$ and $y$, the remarkable properties of the data ellipse are illustrated in @fig-galton-ellipse-r, a modern reconstruction of Galton's diagram.
```{r}
#| label: fig-galton-ellipse-r
#| echo: false
#| fig-align: center
#| out-width: "100%"
#| fig-cap: "Sunflower plot of Galton's data on heights of parents and their children (in.), with
#| 40%, 68% and 95% data ellipses and the regression lines of $y$ on $x$ (black) and
#| $x$ on $y$ (grey)."
knitr::include_graphics("images/galton-ellipse-r.jpg")
```
* The ellipses have the mean vector $(\bar{x}, \bar{y})$ as their center.
* The lengths of arms of the `r colorize("blue")` dashed central cross
show the standard deviations of the variables, which correspond to the shadows of the ellipse covering 40\% of the data. These are the bivariate analogs of
the standard intervals $\bar{x} \pm 1 s_x$ and $\bar{y} \pm 1 s_y$.
* More generally, shadows (projections) on the coordinate axes, or any linear combination of them,
give any standard interval,
$\bar{x} \pm k s_x$ and $\bar{y} \pm k s_y$.
Those with $k=1, 1.5, 2.45$, have
bivariate coverage 40%, 68% and 95% respectively, corresponding to these quantiles of the $\chi^2$ distribution
with 2 degrees of freedom, i.e.,
$\chi^2_2 (.40) \approx 1^2$,
$\chi^2_2 (.68) \approx 1.5^2$, and
$\chi^2_2 (.95) \approx 2.45$.
The shadows of the 68% ellipse are the bivariate analog of a univariate $\bar{x} \pm 1 s_x$ interval.
<!--# and univariate coverage 68\%, 87\% and 98.6\% respectively. -->
* The regression line predicting $y$ from $x$ goes through the points where the ellipses have vertical tangents. The _other_ regression line, predicting $x$ from $y$ goes through the points of horizontal
tangency.
* The correlation $r(x, y)$ is the ratio of the vertical segment from the mean of $y$ to the regression line to the vertical segment going to the top of the ellipse as shown at the right of the figure. It is
$r = 0.46$ in this example.
* The residual standard deviation, $s_e = \sqrt{MSE} = \sqrt{\Sigma (y - \bar{y})^2 / n-2}$,
is the half-length of the ellipse at the mean $\bar{x}$.
Because Galton's values of `parent` and `child` height were recorded in class intervals of 1 in.,
they are shown as sunflower symbols in @fig-galton-ellipse-r,
with multiple 'petals' reflecting the number of observations
at each location. This plot (except for annotations) is constructed using `sunflowerplot()` and
`car::dataEllipse()` for the ellipses.
```{r}
#| eval: false
data(Galton, package = "HistData")
sunflowerplot(parent ~ child, data=Galton,
xlim=c(61,75),
ylim=c(61,75),
seg.col="black",
xlab="Child height",
ylab="Mid Parent height")
y.x <- lm(parent ~ child, data=Galton) # regression of y on x
abline(y.x, lwd=2)
x.y <- lm(child ~ parent, data=Galton) # regression of x on y
cc <- coef(x.y)
abline(-cc[1]/cc[2], 1/cc[2], lwd=2, col="gray")
with(Galton,
car::dataEllipse(child, parent,
plot.points=FALSE,
levels=c(0.40, 0.68, 0.95),
lty=1:3)
)
```
Finally, as Galton noted in his diagram, the principal major and minor axes of the ellipse have important statistical properties. @Pearson:1901 would later show that
their directions are determined by the eigenvectors $\mathbf{v}_1, \mathbf{v}_2, \dots$ of the covariance matrix $\mathbf{S}$ and their radii by the
square roots, $\sqrt{\mathbf{v}_1}, \sqrt{\mathbf{v}_1}, \dots$ of the corresponding
eigenvalues.
### R functions for data ellipses
A number of packages provide functions for drawing data ellipses in a scatterplot, with various features.
* `car::scatterplot()`: uses base R graphics to draw 2D scatterplots, with a wide variety of plot enhancements including linear and non-parametric smoothers (loess, gam), a formula method, e.g., `y ~ x | group`, and marking points and lines using symbol shape,
color, etc. Importantly, the `r pkg("car")` package generally allows automatic identification of "noteworthy" points by their labels in the plot using a variety of methods. For example, `method = "mahal"` labels cases with the most extreme Mahalanobis distances;
`method = "r"` selects points according to their value of `abs(y)`, which is
appropriate in residual plots.
* `car::dataEllipse()`: plots classical or robust data (using `MASS::cov/trob()`) ellipses for one or more groups, with the same facilities for point identification.
* `heplots::covEllipses()`: draws classical or robust data ellipses for one or more groups in a one-way design and optionally for the pooled total sample, where the focus is on homogeneity of within-group covariance matrices.
* `ggplot2::stat_ellipse()`: uses the calculation methods of `car::dataEllipse()` to add unfilled (`geom = "path"`) or filled (`geom = polygon"`) data ellipses in a `ggplot` scatterplot, using inherited aesthetics.
### Example: Canadian occupational prestige {#sec-prestige}
These examples use the data on the prestige of 102 occupational categories and other measures from the
1971 Canadian Census, recorded in `carData::Prestige`.[^prestige-src]
Our interest is in understanding how `prestige` (the @PineoPorter2008 prestige score for an occupational category, derived from a social survey)
is related to census measures of the average education, income, percent women of incumbents in those occupations.
Occupation `type` is a factor with levels `"bc"` (blue collar), `"wc"` (white collar) and `"prof"` (professional).
[^prestige-src]: The dataset was collected by Bernard Blishen, William Carroll and Catherine Moore, but apparently unpublished. A version updated to the 1981 census is described in @Blishen-etal-1987.
<!-- figure-code: R/prestige.R -->
```{r prestige}
data(Prestige, package="carData")
# `type` is really an ordered factor. Make it so.
Prestige$type <- ordered(Prestige$type,
levels=c("bc", "wc", "prof"))
str(Prestige)
```
I first illustrate the relation between `income` and `prestige` in @fig-Prestige-scatterplot-income1
using `car::scatterplot()`
with many of its bells and whistles, including marginal boxplots for the variables,
the linear regression line, loess smooth and the 68% data ellipse.
```{r}
#| label: fig-Prestige-scatterplot-income1
#| out-width: "80%"
#| fig-cap: !expr glue::glue("Scatterplot of prestige vs. income, showing the linear regression line ({red}), the loess smooth with a confidence envelope ({darkgreen}) and a 68% data ellipse. Points with the 4 largest $D^2$ values are labeled.")
scatterplot(prestige ~ income, data=Prestige,
pch = 16, cex.lab = 1.25,
regLine = list(col = "red", lwd=3),
smooth = list(smoother=loessLine,
lty.smooth = 1, lwd.smooth=3,
col.smooth = "darkgreen",
col.var = "darkgreen"),
ellipse = list(levels = 0.68),
id = list(n=4, method = "mahal", col="black", cex=1.2))
```
There is a lot that can be seen here:
* `income` is positively skewed, as is often the case.
* The loess smooth, on the scale of income, shows `prestige` increasing up to $15,000 (these are 1971 incomes), and then leveling off.
* The data ellipse, centered at the means encloses approximately 68% of the data points. It adds visual information about the correlation and precision of the linear regression; but here, the non-linear trend for higher incomes strongly suggests a different approach.
* The four points identified by their labels are those with the largest Mahalanobis distances. `scatterplot()` prints their labels to the console.
@fig-Prestige-scatterplot-educ1 shows a similar plot for education, which
from the boxplot appears to be reasonably symmetric. The smoothed curve is quite
close to the linear regression, according to which `prestige` increases
on average
`coef(lm(prestige ~ education, data=Prestige))["education"]` =
`r coef(lm(prestige ~ education, data=Prestige))["education"]` with each year of education.
```{r echo = -1}
#| label: fig-Prestige-scatterplot-educ1
#| out-width: "80%"
#| fig-cap: !expr glue::glue("Scatterplot of prestige vs. education, showing the linear regression line ({red}), the loess smooth with a confidence envelope ({darkgreen}) and a 68% data ellipse.")
par(mar = c(4,4,1,1)+.1)
scatterplot(prestige ~ education, data=Prestige,
pch = 16, cex.lab = 1.25,
regLine = list(col = "red", lwd=3),
smooth = list(smoother=loessLine,
lty.smooth = 1, lwd.smooth=3,
col.smooth = "darkgreen",
col.var = "darkgreen"),
ellipse = list(levels = 0.68),
id = list(n=4, method = "mahal", col="black", cex=1.2))
```
In this plot, farmers, newsboys, file.clerks and physicians are identified as
noteworthy, for being furthest from the mean by Mahalanobis distance.
In relation to their typical level of education, these are mostly
understandable, but it is nice that farmers are rated of higher prestige
than their level of education would predict.
Note that the `method` argument for point identification can take a vector
of case numbers indicating the points to be labeled. So, to
label the observations with large absolute standardized residuals
in the linear model `m`, you can use `method = which(abs(rstandard(m)) > 2)`.
```{r echo = -1}
#| label: fig-Prestige-scatterplot-educ2
#| out-width: "80%"
#| fig-cap: "Scatterplot of prestige vs. education, labeling points whose absolute standardized residual is > 2."
par(mar = c(4,4,1,1)+.1)
m <- lm(prestige ~ education, data=Prestige)
scatterplot(prestige ~ education, data=Prestige,
pch = 16, cex.lab = 1.25,
boxplots = FALSE,
regLine = list(col = "red", lwd=3),
smooth = list(smoother=loessLine,
lty.smooth = 1, lwd.smooth=3,
col.smooth = "black",
col.var = "darkgreen"),
ellipse = list(levels = 0.68),
id = list(n=4, method = which(abs(rstandard(m))>2),
col="black", cex=1.2)) |> invisible()
```
#### Plotting on a log scale {#sec-log-scale}
A typical remedy for the non-linear relationship of income to prestige is to plot income on a log scale. This usually makes sense, and expresses a belief that a **multiple** of
or **percentage increase** in income has a constant impact on prestige, as opposed to
the **additive** interpretation for income itself.
For example, the slope of the linear regression line in @fig-Prestige-scatterplot-income1
is given by `coef(lm(prestige ~ income, data=Prestige))["income"]` =
`r coef(lm(prestige ~ income, data=Prestige))["income"]`. Multiplying this by 1000
says that a $1000 increase in `income` is associated with with an average
increase of `prestige` of 2.9.
In the plot below, `scatterplot(..., log = "x")` re-scales the x-axis to the
$\log_e()$ scale. The slope, `coef(lm(prestige ~ log(income), data=Prestige))["log(income)"]` =
`r coef(lm(prestige ~ log(income), data=Prestige))["log(income)"]` says that a 1%
increase in salary is associated with an average change of 21.55 / 100
in prestige.
<!-- removed: #| source-line-numbers: "2" -->
```{r echo = -1}
#| label: fig-Prestige-scatterplot2
#| out-width: "80%"
#| fig-cap: "Scatterplot of prestige vs. log(income)."
par(mar = c(4,4,1,1)+.1)
scatterplot(prestige ~ income, data=Prestige,
log = "x",
pch = 16, cex.lab = 1.25,
regLine = list(col = "red", lwd=3),
smooth = list(smoother=loessLine,
lty.smooth = 1, lwd.smooth=3,
col.smooth = "darkgreen", col.var = "darkgreen"),
ellipse = list(levels = 0.68),
id = list(n=4, method = "mahal", col="black", cex=1.2))
```
The smoothed curve in @fig-Prestige-scatterplot2
exhibits a slight tendency to bend upwards, but a linear relation is a reasonable approximation.
#### Stratifying {#sec-stratifying}
Before going further, it is instructive to ask what we could see in the relationship
between income and prestige if we stratified by type of occupation, fitting
separate regressions and smooths for blue collar, white collar and professional
incumbents in these occupations.
The formula `prestige ~ income | type` (read: income _given_ type)
is a natural way to specify grouping by `type`; separate linear regressions
and smooths are calculated for each group, applying the
color and point shapes specified by the `col` and `pch` arguments.
```{r}
#| label: fig-Prestige-scatterplot3
#| out-width: "80%"
#| fig-cap: "Scatterplot of prestige vs. income, stratified by occupational type. This implies a different interpretation, where occupation type is a moderator variable."
scatterplot(prestige ~ income | type, data=Prestige,
col = c("blue", "red", "darkgreen"),
pch = 15:17,
grid = FALSE,
legend = list(coords="bottomright"),
regLine = list(lwd=3),
smooth=list(smoother=loessLine,
var=FALSE, lwd.smooth=2, lty.smooth=1))
```
This visual analysis offers a different interpretation of the dependence of prestige
on income, which appeared to be non-linear when occupation type was ignored.
Instead, @fig-Prestige-scatterplot3 suggests an *interaction* of income by type.
In a model formula this would be expressed as one of:
```r
lm(prestige ~ income + type + income:type, data = Prestige)
lm(prestige ~ income * type, data = Prestige)
```
These models signify that there are different slopes (and intercepts) for the three
occupational types. In this interpretation, `type` is a moderator variable, with a different story.
The slopes of the fitted lines suggest that among blue collar workers, prestige
increases sharply with their income. For white collar and professional workers, there is still
an increasing relation of prestige with income, but the effect of income (slope) diminishes with
higher occupational category. A different plot entails a different story.
### Example: Penguins data {#sec-penguins}
```{r}
#| label: fig-penguin-species
#| echo: false
#| fig-align: center
#| out-width: "80%"
#| fig-cap: "Penguin species observed in the Palmer Archipelago. This is a cartoon, but it illustrates some features of penguin body size measurements, and the colors typically used for species. Image: Allison Horst"
knitr::include_graphics("images/penguins-horst.png")
```
The `penguins` dataset from the `r package("palmerpenguins", cite=TRUE)` provides further instructive examples of plots and analyses of multivariate data. The data consists of measurements of body size
(flipper length, body mass, bill length and depth)
of 344 penguins collected at the [Palmer Research Station](https://pallter.marine.rutgers.edu/) in Antarctica.
There were three different species of penguins (Adélie, Chinstrap & Gentoo)
collected from 3 islands in the Palmer Archipelago
between 2007--2009 [@Gorman2014]. The purpose was to examine differences in size or appearance of these species,
particularly differences among the sexes (sexual dimorphism) in relation to foraging and habitat.
Here, I use a slightly altered version of the dataset, `peng`, renaming variables to remove the units,
making factors of character variables and deleting a few cases with missing data.
```{r}
data(penguins, package = "palmerpenguins")
peng <- penguins |>
rename(
bill_length = bill_length_mm,
bill_depth = bill_depth_mm,
flipper_length = flipper_length_mm,
body_mass = body_mass_g
) |>
mutate(species = as.factor(species),
island = as.factor(island),
sex = as.factor(substr(sex,1,1))) |>
tidyr::drop_na()
str(peng)
```
There are quite a few variables to choose for illustrating data ellipses in scatterplots. Here I focus on
the measures of their bills, `bill_length` and `bill_depth` (indicating curvature) and show how to
use `ggplot2` for these plots.
I'll be using the penguins data quite a lot, so it is useful to set up custom colors like those
used in @fig-penguin-species, and shown in @fig-peng-colors with their color codes. These are shades of:
* `r colorize("Adelie", "orange")`: `r colorize("orange", "orange")`,
* `r colorize("Chinstrap", "purple")`: `r colorize("purple", "purple")`, and
* `r colorize("Gentoo", "green")`: `r colorize("green", "green")`.
```{r}
#| label: fig-peng-colors
#| echo: false
#| out-width: "70%"
#| fig-cap: Color palettes used for penguin species.
knitr::include_graphics("images/peng-colors.png")
```
To use these in `ggplot2` I define a function
`peng.colors()` that allows shades of light, medium and dark and then functions
`scale_*_penguins()` for color and fill.
```{r}
#| label: theme-penguins
#| code-fold: true
peng.colors <- function(shade=c("medium", "light", "dark")) {
shade = match.arg(shade)
# light medium dark
oranges <- c("#FDBF6F", "#F89D38", "#F37A00") # Adelie
purples <- c("#CAB2D6", "#9A78B8", "#6A3D9A") # Chinstrap
greens <- c("#B2DF8A", "#73C05B", "#33a02c") # Gentoo
cols.vec <- c(oranges, purples, greens)
cols.mat <-
matrix(cols.vec, 3, 3,
byrow = TRUE,
dimnames = list(species = c("Adelie", "Chinstrap", "Gentoo"),
shade = c("light", "medium", "dark")))
# get shaded colors
cols.mat[, shade ]
}
# define color and fill scales
scale_fill_penguins <- function(shade=c("medium", "light", "dark"), ...){
shade = match.arg(shade)
ggplot2::discrete_scale(
"fill","penguins",
scales:::manual_pal(values = peng.colors(shade)), ...)
}
scale_colour_penguins <- function(shade=c("medium", "light", "dark"), ...){
shade = match.arg(shade)
ggplot2::discrete_scale(
"colour","penguins",