-
Notifications
You must be signed in to change notification settings - Fork 2
/
README.qmd
306 lines (229 loc) · 10.2 KB
/
README.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
---
format: gfm
---
<!-- README.md is generated from README.qmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# quartets: Datasets to help teach statistics <img src="man/figures/logo.png" align="right" height="138" />
<!-- badges: start -->
[![R-CMD-check](https://github.com/r-causal/quartets/workflows/R-CMD-check/badge.svg)](https://github.com/r-causal/quartets/actions)
<!-- badges: end -->
**Authors:** [Lucy D'Agostino McGowan](https://www.lucymcgowan.com/)<br/>
**License:** [MIT](https://opensource.org/license/mit/)
The quartets package is a collection of datasets aimed to help data analysis practitioners and students learn key statistical insights in a hands-on manner. It contains:
* [Anscombe's Quartet](#anscombes-quartet)
* [Causal Quartet](#causal-quartet)
* [Datasaurus Dozen](#datasaurus-dozen)
* [Interaction Triptych](#interaction-triptych)
* [Rashomon Quartet](#rashomon-quartet)
* [Gelman Variation and Heterogeneity Causal Quartets](#gelman-variation-and-heterogenity-causal-quartets)
## Installation
You can install the quartets package from CRAN as follows:
``` r
install.packages("quartets")
```
Or the development version of quartets like so:
``` r
devtools::install_github("r-causal/quartets")
```
## Anscombe's Quartet
The goal of the `anscombe_quartet` data set is to help drive home the point that visualizing your data is important. Francis Anscombe generated these four datasets to demonstrate that statistical summary measures alone cannot capture the full relationship between two variables (here, `x` and `y`). Anscombe emphasized the importance of visualizing data prior to calculating summary statistics.
* Dataset 1 has a linear relationship between `x` and `y`
* Dataset 2 has shows a nonlinear relationship between `x` and `y`
* Dataset 3 has a linear relationship between `x` and `y` with a single outlier
* Dataset 4 has shows no relationship between `x` and `y` with a single outlier that serves as a high-leverage point.
In each of the datasets the following statistical summaries hold:
* mean of `x`: 9
* variance of `x`: 11
* mean of `y`: 7.5
* variance of y: 4.125
* correlation between `x` and `y`: 0.816
* linear regression between `x` and `y`: `y = 3 + 0.5x`
* $R^2$ for the regression: 0.67
## Example
```{r}
#| message: false
#| warning: false
library(tidyverse)
library(quartets)
ggplot(anscombe_quartet, aes(x = x, y = y)) +
geom_point() +
geom_smooth(method = "lm", formula = "y ~ x") +
facet_wrap(~dataset)
anscombe_quartet |>
group_by(dataset) |>
summarise(mean_x = mean(x),
var_x = var(x),
mean_y = mean(y),
var_y = var(y),
cor = cor(x, y)) |>
knitr::kable(digits = 2)
```
## Causal Quartet
The goal of the `causal_quartet` data set is to help drive home the point that when presented with an exposure, outcome, and some measured factors, statistics alone, whether summary statistics or data visualizations, are not sufficient to determine the appropriate causal estimate. Additional information about the data generating mechanism is needed in order to draw the correct conclusions. See [this paper](https://github.com/LucyMcGowan/writing-quartet/blob/main/manuscript.pdf) for details.
## Example
```{r}
#| message: false
#| warning: false
ggplot(causal_quartet, aes(x = exposure, y = outcome)) +
geom_point() +
geom_smooth(method = "lm", formula = "y ~ x") +
facet_wrap(~dataset)
```
```{r}
causal_quartet |>
nest_by(dataset) |>
mutate(`Y ~ X` = round(coef(lm(outcome ~ exposure, data = data))[2], 2),
`Y ~ X + Z` = round(coef(lm(outcome ~ exposure + covariate, data = data))[2], 2),
`Correlation of X and Z` = round(cor(data$exposure, data$covariate), 2)) |>
select(-data, `Data generating mechanism` = dataset) |>
knitr::kable()
```
## Datasaurus Dozen
Similar to Anscombe's Quartet, the Datasaurus Dozen has additional data sets where the mean, variance, and Pearson's correlation are identical, but visualizations demonstrate the large difference between datasets. This dataset is re-exported from the [datasauRus](https://CRAN.R-project.org/package=datasauRus) R package.
## Example
```{r}
#| message: false
#| warning: false
ggplot(datasaurus_dozen, aes(x = x, y = y)) +
geom_point() +
geom_smooth(method = "lm", formula = "y ~ x") +
facet_wrap(~dataset)
datasaurus_dozen |>
group_by(dataset) |>
summarise(mean_x = mean(x),
var_x = var(x),
mean_y = mean(y),
var_y = var(y),
cor = cor(x, y)) |>
knitr::kable(digits = 2)
```
## Interaction Triptych
This set of 3 datasets demonstrating that while the slopes estimated by a simple linear interaction model may be the same, the underlying data-generating mechanisms can be vastly different.
```{r}
ggplot(interaction_triptych, aes(x, y)) +
geom_point(shape = "o") +
geom_smooth(method = "lm", formula = "y ~ x") +
facet_grid(dataset ~ moderator)
```
## Rashomon Quartet
This dataset demonstrates that model diagnostics alone (such as $R^2$ and RMSE) do not tell the full story of a prediction model. Here, there are three predictors and one outcome. Models fit using a regression tree, linear regression, random forest, and neural network all yield the same $R^2$ and RMSE, but are finding different relationships between the predictors, as evidenced by the below partial dependence plots.
```{r}
#| message: false
#| warning: false
set.seed(1568)
library(tidymodels)
library(DALEXtra)
```
```{r}
#| cache: true
rec <- recipe(y ~ ., data = rashomon_quartet_train)
## Regression Tree
wf_tree <- workflow() |>
add_recipe(rec) |>
add_model(
decision_tree(mode = "regression", engine = "rpart",
tree_depth = 3, min_n = 250)
)
tree <- fit(wf_tree, rashomon_quartet_train)
exp_tree <- explain_tidymodels(
tree,
data = rashomon_quartet_test[, -1],
y = rashomon_quartet_test[, 1],
verbose = FALSE,
label = "decision tree")
## Linear Model
wf_linear <- wf_tree |>
update_model(linear_reg())
lin <- fit(wf_linear, rashomon_quartet_train)
exp_lin <- explain_tidymodels(
lin,
data = rashomon_quartet_test[, -1],
y = rashomon_quartet_test[, 1],
verbose = FALSE,
label = "linear regression")
## Random Forest
wf_rf <- wf_tree |>
update_model(rand_forest(mode = "regression",
engine = "randomForest",
trees = 100))
rf <- fit(wf_rf, rashomon_quartet_train)
exp_rf <- explain_tidymodels(
rf,
data = rashomon_quartet_test[, -1],
y = rashomon_quartet_test[, 1],
verbose = FALSE,
label = "random forest")
## Neural Network
library(neuralnet)
nn <- neuralnet(
y ~ .,
data = rashomon_quartet_train,
hidden = c(8, 4),
threshold = 0.05)
exp_nn <- explain_tidymodels(
nn,
data = rashomon_quartet_test[, -1],
y = rashomon_quartet_test[, 1],
verbose = FALSE,
label = "neural network")
```
We can see that each of these models "perform" the same.
```{r}
mp <- map(list(exp_tree, exp_lin, exp_rf, exp_nn), model_performance)
tibble(
model = c("Decision tree", "Linear regression", "Random forest", "Neural network"),
R2 = map_dbl(mp, ~.x$measures$r2),
RMSE = map_dbl(mp, ~.x$measures$rmse)
) |>
knitr::kable(digits = 2)
```
But the way they fit to the actual predictors is quite different:
```{r}
#| cache: true
pd_tree <- model_profile(exp_tree, N=NULL)
pd_lin <- model_profile(exp_lin, N=NULL)
pd_rf <- model_profile(exp_rf, N=NULL)
pd_nn <- model_profile(exp_nn, N=NULL)
plot(pd_tree, pd_nn, pd_rf, pd_lin)
```
## Gelman Variation and Heterogeneity Causal Quartets
The first set of data `variation_causal_quartet` demonstrates that you can get the same average treatment effect despite variability across some pre-treatment characteristic (here called `covariate`).
```{r}
ggplot(variation_causal_quartet, aes(x = covariate, y = outcome, color = factor(exposure))) +
geom_point(alpha = 0.5) +
facet_wrap(~ dataset) +
labs(color = "exposure group")
variation_causal_quartet |>
nest_by(dataset) |>
mutate(ATE = round(coef(lm(outcome ~ exposure, data = data))[2], 2)) |>
select(-data, dataset) |>
knitr::kable()
```
The `heterogeneous_causal_quartet` demonstrates how you can observe the same causal effect under different patterns of treatment heterogeneity.
```{r}
ggplot(heterogeneous_causal_quartet, aes(x = covariate, y = outcome, color = factor(exposure))) +
geom_point(alpha = 0.5) +
facet_wrap(~ dataset) +
labs(color = "exposure group")
heterogeneous_causal_quartet |>
nest_by(dataset) |>
mutate(ATE = round(coef(lm(outcome ~ exposure, data = data))[2], 2)) |>
select(-data, dataset) |>
knitr::kable()
```
## References
Anscombe, F. J. (1973). "Graphs in Statistical Analysis". American Statistician. 27 (1): 17–21. doi:10.1080/00031305.1973.10478966. JSTOR 2682899.
Biecek P, Baniecki H, Krzyziński M, Cook D (2023). Performance is not enough: the story of Rashomon’s quartet. Preprint arXiv:2302.13356v2.
Lucy D’Agostino McGowan, Travis Gerke & Malcolm Barrett (2023) Causal inference is not just a statistics problem, Journal of Statistics and Data Science Education, DOI: 10.1080/26939169.2023.2276446
Davies R, Locke S, D'Agostino McGowan L (2022). _datasauRus: Datasets from the Datasaurus Dozen_. R package version 0.1.6, <https://CRAN.R-project.org/package=datasauRus>.
Gelman, A., Hullman, J., & Kennedy, L. (2023). Causal quartets: Different ways to attain the same average treatment effect. arXiv preprint arXiv:2302.12878.
Hullman J (2023). _causalQuartet: Create Causal Quartets for Interrogating Average Treatment Effects_. R package version 0.0.0.9000.
Matejka, J., & Fitzmaurice, G. (2017). Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing. CHI 2017 Conference proceedings: ACM SIGCHI Conference on Human Factors in Computing Systems. Retrieved from https://www.autodesk.com/research/publications/same-stats-different-graphs
Rohrer, Julia M., and Ruben C. Arslan. "Precise answers to vague questions: Issues with interactions." Advances in Methods and Practices in Psychological Science 4.2 (2021): 25152459211007368.