-
Notifications
You must be signed in to change notification settings - Fork 26
/
gsp.Rmd
219 lines (169 loc) · 7.07 KB
/
gsp.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
# Finding the best model of gross state product {#gsp}
```{r, message = FALSE, warning = FALSE, echo = FALSE}
knitr::opts_knit$set(root.dir = fs::dir_create(tempfile()))
knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
options(
drake_make_menu = FALSE,
drake_clean_menu = FALSE,
warnPartialMatchArgs = FALSE,
crayon.enabled = FALSE,
readr.show_progress = FALSE,
tidyverse.quiet = TRUE
)
```
```{r, message = FALSE, warning = FALSE, echo = FALSE}
library(biglm)
library(drake)
library(Ecdat)
library(ggplot2)
library(knitr)
library(purrr)
library(tidyverse)
```
The following data analysis workflow shows off `drake`'s ability to generate lots of reproducibly-tracked tasks with ease. The same technique would be cumbersome, even intractable, with [GNU Make](https://www.gnu.org/software/make/).
## Get the code.
Write the code files to your workspace.
```{r, eval = FALSE}
drake_example("gsp")
```
The new `gsp` folder now includes a file structure of a serious `drake` project, plus an `interactive-tutorial.R` to narrate the example. The code is also [online here](https://github.com/wlandau/drake-examples/tree/main/gsp).
## Objective and methods
The goal is to search for factors closely associated with the productivity of states in the USA around the 1970s and 1980s. For the sake of simplicity, we use gross state product as a metric of productivity, and we restrict ourselves to multiple linear regression models with three variables. For each of the 84 possible models, we fit the data and then evaluate the root mean squared prediction error (RMSPE).
$$
\begin{aligned}
\text{RMSPE} = \sqrt{(\text{y} - \widehat{y})^T(y - \widehat{y})}
\end{aligned}
$$
Here, $y$ is the vector of observed gross state products in the data, and $\widehat{y}$ is the vector of predicted gross state products under one of the models. We take the best variables to be the triplet in the model with the lowest RMSPE.
## Data
The `Produc` dataset from the [Ecdat package](https://cran.r-project.org/package=Ecdat) contains data on the Gross State Product from 1970 to 1986. Each row is a single observation on a single state for a single year. The dataset has the following variables as columns. See the references later in this report for more details.
- `gsp`: gross state product.
- `state`: the state.
- `year`: the year.
- `pcap`: private capital stock.
- `hwy`: highway and streets.
- `water`: water and sewer facilities.
- `util`: other public buildings and structures.
- `pc`: public capital.
- `emp`: labor input measured by the employment in non-agricultural payrolls.
- `unemp`: state unemployment rate.
```{r}
library(Ecdat)
data(Produc)
head(Produc)
```
## Analysis
First, we load the required packages. `drake` is aware of all the packages you load with `library()` or `require()`.
```{r}
library(biglm) # lightweight models, easier to store than with lm()
library(drake)
library(Ecdat) # econometrics datasets
library(ggplot2)
library(knitr)
library(purrr)
library(tidyverse)
```
Next, we construct our plan. The following code uses `drake`'s special new language for generating plans (learn more [here](#plans)).
```{r}
predictors <- setdiff(colnames(Produc), "gsp")
# We will try all combinations of three covariates.
combos <- combn(predictors, 3) %>%
t() %>%
as.data.frame(stringsAsFactors = FALSE) %>%
setNames(c("x1", "x2", "x3"))
head(combos)
# We need to list each covariate as a symbol.
for (col in colnames(combos)) {
combos[[col]] <- rlang::syms(combos[[col]])
}
# Requires drake >= 7.0.0 or the development version
# at github.com/ropensci/drake.
# Install with remotes::install_github("ropensci/drake").
plan <- drake_plan(
model = target(
biglm(gsp ~ x1 + x2 + x3, data = Ecdat::Produc),
transform = map(.data = !!combos) # Remember the bang-bang!!
),
rmspe_i = target(
get_rmspe(model, Ecdat::Produc),
transform = map(model)
),
rmspe = target(
bind_rows(rmspe_i, .id = "model"),
transform = combine(rmspe_i)
),
plot = ggsave(
filename = file_out("rmspe.pdf"),
plot = plot_rmspe(rmspe),
width = 8,
height = 8
),
report = knit(knitr_in("report.Rmd"), file_out("report.md"), quiet = TRUE)
)
plan
```
We also need to define functions for summaries and plots.
```{r}
get_rmspe <- function(model_fit, data){
y <- data$gsp
yhat <- as.numeric(predict(model_fit, newdata = data))
terms <- attr(model_fit$terms, "term.labels")
tibble(
rmspe = sqrt(mean((y - yhat)^2)), # nolint
X1 = terms[1],
X2 = terms[2],
X3 = terms[3]
)
}
plot_rmspe <- function(rmspe){
ggplot(rmspe) +
geom_histogram(aes(x = rmspe), bins = 15)
}
```
We have a [`report.Rmd` file ](https://github.com/wlandau/drake-examples/blob/main/gsp/report.Rmd) to summarize our results at the end.
```{r}
drake_example("gsp")
file.copy(from = "gsp/report.Rmd", to = ".", overwrite = TRUE)
```
We can inspect the project before we run it.
```{r}
vis_drake_graph(plan)
```
Now, we can run the project.
```{r}
make(plan, verbose = 0L)
```
## Results
Here are the root mean squared prediction errors of all the models.
```{r}
results <- readd(rmspe)
library(ggplot2)
plot_rmspe(rmspe = results)
```
And here are the best models. The best variables are in the top row under `X1`, `X2`, and `X3`.
```{r}
head(results[order(results$rmspe, decreasing = FALSE), ])
```
## Comparison with GNU Make
If we were using [Make](https://www.gnu.org/software/make/) instead of `drake` with the same set of targets, the analogous [Makefile](https://www.gnu.org/software/make/) would look something like this pseudo-code sketch.
<pre><code>models = model_state_year_pcap.rds model_state_year_hwy.rds ... # 84 of these
model_%
Rscript -e 'saveRDS(lm(...), ...)'
rmspe_%: model_%
Rscript -e 'saveRDS(get_rmspe(...), ...)'
rmspe.rds: rmspe_%
Rscript -e 'saveRDS(dplyr::bind_rows(...), ...)'
rmspe.pdf: rmspe.rds
Rscript -e 'ggplot2::ggsave(plot_rmspe(readRDS("rmspe.rds")), "rmspe.pdf")'
report.md: report.Rmd
Rscript -e 'knitr::knit("report.Rmd")'
</code></pre>
There are three main disadvantages to this approach.
1. Every target requires a new call to `Rscript`, which means that more time is spent initializing R sessions than doing the actual work.
2. The user must micromanage nearly one hundred output files (in this case, `*.rds` files), which is cumbersome, messy, and inconvenient. `drake`, on the other hand, automatically manages storage using a [storr cache](https://github.com/richfitz/storr).
3. The user needs to write the names of the 84 `models` near the top of the `Makefile`, which is less convenient than maintaining a data frame in R.
## References
- Baltagi, Badi H (2003). Econometric analysis of panel data, John Wiley and sons, http://www.wiley.com/legacy/wileychi/baltagi/.
- Baltagi, B. H. and N. Pinnoi (1995). "Public capital stock and state productivity growth: further evidence", Empirical Economics, 20, 351-359.
- Munnell, A. (1990). "Why has productivity growth declined? Productivity and public investment"", New England Economic Review, 3-22.
- Yves Croissant (2016). Ecdat: Data Sets for Econometrics. R package version 0.3-1. https://CRAN.R-project.org/package=Ecdat.