-
Notifications
You must be signed in to change notification settings - Fork 32
/
Copy path06-fitting_models_with_parsnip.Rmd
435 lines (305 loc) · 13.4 KB
/
06-fitting_models_with_parsnip.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
# Fitting models with parsnip
**Learning objectives:**
- Identify ways in which **model interfaces can differ.** x
- **Specify a model** in `{parsnip}`. x
- **Fit a model** with `parsnip::fit()` and `parsnip::fit_xy()`. x
- Describe how `{parsnip}` **generalizes model arguments.** x
- Use `broom::tidy()` to **convert model objects to a tidy structure.** x
- Use `dplyr::bind_cols()` and the `predict()` methods from `{parsnip}` to **make tidy predictions. **
- **Find interfaces to other models** in `{parsnip}`-adjacent packages.
<details>
<summary> Modeling Map </summary>
![modeling flow](images/modeling_map.png)
- __Chapter Setup Below__
```{r set-up-07, warning=FALSE, message=FALSE}
# load parsnip, recipes, rsample, broom...
library(tidymodels)
library(AmesHousing)
# attach data
data(ames)
# log scale price
ames <- mutate(ames, Sale_Price = log10(Sale_Price))
# train/test
set.seed(123)
ames_split <- initial_split(ames, prop = 0.80, strata = Sale_Price)
ames_train <- training(ames_split)
ames_test <- testing(ames_split)
```
</details>
<br>
## Create a Model
### Different Model Interfaces
![Different-interfaces](images/interfaces-meme.jpg)
<br>
- Model Interfaces
- Different Implementations = Different Interfaces
- _Linear Regression_ can be implemented in many ways
- Ordinary Least Squares
- Regularized Linear Regression
- ...
<br>
- __{stats}__
- takes formula
- uses `data.frame`
```{r lm-interface, eval=FALSE}
lm(formula, data, ...)
```
<br>
- __{glmnet}__
- Has x/y interface
- Uses a matrix
```{r glmnet-interface, eval=FALSE}
glmnet(x = matrix, y = vector, family = "gaussian", ...)
```
<br>
<br>
### Model Specification
![model specification](images/model_specification_process.png)
- __{tidymodels}/{parsnip}__ - Philosophy is to unify & make interfaces more predictable.
- Specify model type (e.g. linear regression, random forest ...)
- `linear_reg()`
- `rand_forest()`
- Specify engine (i.e. package implementation of algorithm)
- `set_engine("some package's implementation")`
- declare mode (e.g. classification vs linear regression)
- use this when model can do both classification & regression
- `set_mode("regression")`
- `set_mode("classification")`
<br>
- __Bringing it all together__
```{r model-spec}
lm_model_spec <-
linear_reg() %>% # specify model
set_engine("lm") # set engine
lm_model_spec
```
<br>
<br>
### Model Fitting
From above we will use our existing model specification
<br>
- `fit()`
- any nominal or categorical variables will be split out into dummy columns
- _most_ formula methods also turn do the same thing
- `fit_xy`
- delays creating dummy variable and has underlying model function
```{r model-fit}
# create model fit using formula
lm_form_fit <-
lm_model_spec %>%
fit(Sale_Price ~ Longitude + Latitude, data = ames_train)
# create model fit using x/y
lm_xy_fit <-
lm_model_spec %>%
fit_xy(
x = ames_train %>% select(Longitude, Latitude),
y = ames_train %>% pull(Sale_Price)
)
```
<br>
<br>
### Generalized Model Arguments
- Like the varying interfaces, model parameters differ from implementation to implementation
- two level of model arguments
- __main arguments__ - Parameters aligned with the mathematical vehicle
- __engine arguments__ - Parameters aligned with the package implementation of the mathematical algorithm
```{r package-param-comparisions, echo=FALSE}
tribble(
~argument, ~ranger, ~randomForest, ~sparklyr,
"sampled predictors", "mtry", "mtry", "feature_subset_strategy",
"trees", "num.tress", "ntree", "num_trees",
"data points to split", "min.node.size", "nodesize", "min_instances_per_node"
) %>%
knitr::kable()
```
<br>
```{r parsnip-param-comparisions, echo=FALSE}
tribble(
~argument, ~parsnip,
"sampled predictors", "mtry",
"trees", "trees",
"data points to split","min_n"
) %>%
knitr::kable()
```
<br>
![Parsnip in Action](images/parsnip_meme.jpg)
<br>
+ The `translate()` provides the mapping from the parsnips interface to the each individual package's implementation of the algorithm.
```{r model-package-differences}
# stats implementation
linear_reg() %>%
set_engine("lm") %>%
translate()
# glmnet implementation
linear_reg(penalty = 1) %>%
set_engine("glmnet") %>%
translate()
```
## Use Model results
Now that we have a fitted model we will need to pull some summary information from it we will use two extremely _fun_ functions from the `{broom}` package to help us out (`tidy()` & `glance()`).
+ `tidy()` - Has a bunch of versatility, but for our context it can take our model object and return our model coefficients into a nice tibble.
```{r broom-tidy}
broom::tidy(lm_form_fit) %>%
knitr::kable()
```
<br>
+ `glance()` - allows us in this context to convert our model's summary statistics into a `tibble`
```{r broom-glance}
broom::glance(lm_form_fit) %>%
knitr::kable()
```
## Make Predictions
![](images/crystal_ball.png)
- __Rules to Live by__:
- Returns a tibble
- Column names are ... erh Predictable
- Return the same number of rows as are in the data set
- some predict functions omit observations with `NA` values. Which is great if that's what you intend, but if you aren't expecting that behavior you would have to find out the hard way.
```{r predict-new-data}
# create example test set
ames_test_small <- ames_test %>% slice(1:5)
# predict on test set
predict(lm_form_fit, new_data = ames_test_small) %>%
knitr::kable()
```
<br>
- Combining `bind_cols` with our predict function we can merge our predictions back to the test set.
```{r combine-07}
# add predictions together with actuals
ames_test_small %>%
select(Sale_Price) %>%
bind_cols(predict(lm_form_fit, ames_test_small)) %>%
# Add 95% prediction intervals to the results:
bind_cols(predict(lm_form_fit, ames_test_small, type = "pred_int")) %>%
knitr::kable()
```
## {tidymodels}-Adjacent Packages
- Opinions can be shared, other modeling packages can use the same opinion to replicate a workflow. The `{discrim}`^[Discrim Package [Link](https://github.com/tidymodels/discrim)] package adds a new set of mathematical models to our arsenal of tools.
- `discrim_flexible()` `%>%` - Mathematical Model or if we are using my terrible analogy the car body
- `set_engine("earth")` - The package we want to approximate our discriminat analysis
```{r adjacent-packages, warning=FALSE, message=FALSE}
# devtools::install_github("tidymodels/discrim") # to install
# load package
library(discrim)
# create dummy data
parabolic_grid <-
expand.grid(X1 = seq(-5, 5, length = 100),
X2 = seq(-5, 5, length = 100))
# fit model from discrim
fda_mod <-
discrim_flexible(num_terms = 3) %>%
set_engine("earth") %>%
fit(class ~ ., data = parabolic)
# assigning predictions to data frame
parabolic_grid$fda <-
predict(fda_mod, parabolic_grid, type = "prob")$.pred_Class1
# plotting prediction
library(ggplot2)
ggplot(parabolic, aes(x = X1, y = X2)) +
geom_point(aes(col = class), alpha = .5) +
geom_contour(data = parabolic_grid, aes(z = fda), col = "black", breaks = .5) +
theme_bw() +
theme(legend.position = "top") +
coord_equal()
```
## Summary
- __Create a Common Interface__ - All models are comprised of some core components
- mathematical model
- engine implementation
- mode if needed
- Arguments
- Main - algorithm specific (trees, mtry, penalty)
- Engine - Package/Engine specific (e.g. verbose, num.threads, ...)
- __Predictable Behavior__
- tibble in, tibble out
- same number of observations returned for `predict()`
## Meeting Videos
### Cohort 1
`r knitr::include_url("https://www.youtube.com/embed/97VLayRdu-A")`
<details>
<summary> Meeting chat log </summary>
```
00:15:40 Tan Ho: YESSS
00:15:58 Tan Ho: (@ the drake meme)
00:21:51 Tyler Grant Smith: space advantage
00:23:32 Asmae : what's that
00:24:28 Jim Gruman: http://www.feat.engineering/categorical-trees.html on dummies or no-dummies with trees
00:24:37 Tony ElHabr: amazing meme
00:24:48 Tony ElHabr: I am physically applauding
00:25:03 Scott Nestler: Some additional discussion regarding fit_xy() at tidyverse.org/blog/2019/04/parsnip-internals, related to possible range of mtry variables when you don't know number of predictors before recipe is prepped.
00:25:56 Scott Nestler: Is there a typo in the book after tables toward end of 7.1 where it talks about common argument names? It mentions num_n but I think they meant min_n. What do others think? Or is num_n actually used?
00:26:30 Conor Tompkins: Scott I think Jon made a PR to fix that typo
00:26:53 Scott Nestler: Thx. I hadn't checked yet. Just caught it in a quick read right before we started.
00:31:15 Scott Nestler: There are actually 30 different model types and engines at https://www.tidymodels.org/find/parsnip/ that work with parsnip.
00:43:55 Tony ElHabr: yay volunteers
00:44:48 Andy Farina: Thanks Jordan, excellent presentation
```
</details>
### Cohort 2
`r knitr::include_url("https://www.youtube.com/embed/gjBKHu4yJlY")`
<details>
<summary> Meeting chat log </summary>
```
00:12:53 shamsuddeen: https://torch.mlverse.org
00:13:21 shamsuddeen: Book: https://mlverse.github.io/torchbook_materials/
00:16:18 Janita Botha: I'm in the meeting twice because my sound is not working this morning
00:26:10 August: fit_xy doesn't do the contrast coding automatically, fit does.
00:28:04 Luke Shaw: Is fit_xy a 'safer' option?
00:28:33 rahul bahadur: @August. If you were to have categorical predictors in `lm()` would fit_xy() be able to model it? `lm()` automatically does contrast coding internally
00:30:55 Luke Shaw: https://xkcd.com/927/
00:33:10 Amélie Gourdon-Kanhukamwe (she/they): Nice one Luke!
00:34:44 rahul bahadur: I hope tidymodels doesn't end up becoming just another standard :P
00:34:50 August: I'm not sure about safer, it depends I suppose if you have done a recipe with the coding before hand.
00:35:39 Kevin Kent: I wonder if future packages that tidy models references will be built with a tidy models integration in mind. Instead of tidy models having to do so much translation
00:36:34 Luke Shaw: Thanks August. Think I need to get my hands dirty using it to really help me understand!
00:37:34 Luke Shaw: Yeah I'm only joking with the xkcd but does feel relevant. Kinda hoping tidymodels can be the one standard I get v. familiar with and has a long life
00:40:30 Kevin Kent: I think modeltime is an example of a modeling package done with tidymodels in mind. Kind of designed first for tidy models instead of having to do a bunch of translation to fit into the framework. Hopefully more do development like this
00:41:21 Kevin Kent: But I guess that one is doing a similar job to tidymodels in the sense of pulling in functionality from other packages as engines
00:41:45 August: Absolutely.
00:42:44 August: btw I have mostly used fit() rather than fit_xy() although the latter by be preferable
00:42:54 Kevin Kent: I love these templating packages..excited to use them
00:43:18 Kevin Kent: rstudio add in looks fantastic
00:44:23 rahul bahadur: Yes, fit_xy(), though requires more code, would not lead to unknown behaviour. I believe.
00:44:41 Kevin Kent: Yeah less “magic” happening
00:46:19 Luke Shaw: This feels very scary about using "fit": "If the data were preprocessed in any way, incorrect predictions will be generated (sometimes, without errors)."
00:47:56 Luke Shaw: ^^^ that's not about the function fit() - my bad
00:48:09 Kevin Kent: Or if you accidentally read in a numeric column as categorical and then run fit it won’t throw an error even though its not meaningful as a categorical variable
```
</details>
### Cohort 3
`r knitr::include_url("https://www.youtube.com/embed/-wG0XrZe00U")`
<details>
<summary> Meeting chat log </summary>
```
00:15:22 Ildiko Czeller: > linear_reg()
Linear Regression Model Specification (regression)
> linear_reg() %>% set_mode("classification")
Linear Regression Model Specification (classification)
00:17:00 Edgar Zamora: https://www.tidymodels.org/find/parsnip/
00:22:51 Toryn Schafer (she/her): model.matrix
00:26:52 Ildiko Czeller: Hi Daniel! good to see you
00:27:14 Daniel Chen: hello hello!
00:29:02 Daniel Chen: I forgot there was an article/help page of all the supported engines
00:30:04 Daniel Chen: ooh they moved the things around: this is the page to help you find arguments: https://www.tidymodels.org/find/parsnip/
00:30:40 Daniel Chen: that page has the model/engine parsnip param name and original param name
00:31:03 Ildiko Czeller: thanks!
00:34:36 Daniel Chen: you just need to be careful with the column names after `tidy` since they don't always mean the same thing for each model output.
but it makes the column names consistent so you can combine tables
00:45:18 jiwan: Error in parsnip::rand_forest(verbose = TRUE) :
unused argument (verbose = TRUE)
00:45:48 Daniel Chen: unused arguments usually get ignored. maybe you get a warning?
00:54:29 Daniel Chen: bye all!
```
</details>
### Cohort 4
`r knitr::include_url("https://www.youtube.com/embed/JSQsTfdIB_U")`
<details>
<summary> Meeting chat log </summary>
```
00:56:18 Laura Rose: https://emilhvitfeldt.github.io/ISLR-tidymodels-labs/linear-model-selection-and-regularization.html#forward-and-backward-stepwise-selection
00:58:07 Ryan Metcalf: https://emilhvitfeldt.github.io/ISLR-tidymodels-labs/index.html
00:58:52 Ryan Metcalf: https://avehtari.github.io/ROS-Examples/
01:01:31 Federica Gazzelloni: https://arxiv.org/pdf/1502.06988.pdf
```
</details>