This repository has been archived by the owner on Apr 9, 2020. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 5
/
sl3_pipelines.Rmd
269 lines (216 loc) · 10.2 KB
/
sl3_pipelines.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
---
title: 'The [`sl3`](https://sl3.tlverse.org/) R package: Ensemble Machine Learning
with Pipelines'
author: '[Nima Hejazi](https://nimahejazi.org) & Jeremy Coyle'
date: '`r format(Sys.time(), "%Y %b %d (%a), %H:%M:%S")`'
output:
html_document:
toc: yes
---
Based on materials originally produced by Jeremy Coyle, Nima Hejazi, Ivana
Malenica, and Oleg Sofrygin
## Introduction
In this demonstration, we will illustrate the basic functionality of the `sl3` R
package. Specifically, we will walk through the concept of machine learning
pipelines, the construction of ensemble models, and a few simple optimality
properties of stacked regression.
## Resources
* The `sl3` R package homepage: https://sl3.tlverse.org/
* The `sl3` R package repository: https://github.com/tlverse/sl3
## Setup
First, we'll load the packages required for this exercise and load a simple data
set (`cpp_imputed` below) that we'll use for demonstration purposes:
```{r}
set.seed(49753)
# packages we'll be using
library(data.table)
library(SuperLearner)
library(origami)
library(sl3)
# load example data set
data(cpp_imputed)
# take a peek at the data
head(cpp_imputed)
```
To use this data set with `sl3`, the object must be wrapped in a customized
`sl3` container, an __`sl3` "Task"__ object. A _task_ is an idiom for all of the
elements of a prediction problem other than the learning algorithms and
prediction approach itself -- that is, a task delineates the structure of the
data set of interest and any potential metadata (e.g., observation-level
weights).
```{r}
# here are the covariates we are interested in and, of course, the outcome
covars <- c("apgar1", "apgar5", "parity", "gagebrth", "mage", "meducyrs",
"sexn")
outcome <- "haz"
# create the sl3 task and take a look at it
task <- make_sl3_Task(data = cpp_imputed, covariates = covars,
outcome = outcome, outcome_type = "continuous")
# let's take a look at the sl3 task
task
```
## Interlude: Object Oriented Programming in `R`
`sl3` is designed using basic OOP principles and the `R6` OOP framework. While
we've tried to make it easy to use `sl3` without worrying much about OOP, it is
helpful to have some intuition about how `sl3` is structured. In this section,
we briefly outline some key concepts from OOP. Readers familiar with OOP basics
are invited to skip this section. The key concept of OOP is that of an object, a
collection of data and functions that corresponds to some conceptual unit.
Objects have two main types of elements: (1) _fields_, which can be thought of
as nouns, are information about an object, and (2) _methods_, which can be
thought of as verbs, are actions an object can perform. Objects are members of
classes, which define what those specific fields and methods are. Classes can
inherit elements from other classes (sometimes called base classes) --
accordingly, classes that are similar, but not exactly the same, can share some
parts of their definitions.
Many different implementations of OOP exist, with variations in how these
concepts are implemented and used. R has several different implementations,
including `S3`, `S4`, reference classes, and `R6`. `sl3` uses the `R6`
implementation. In `R6`, methods and fields of a class object are accessed using
the `$` operator. The next section explains how these concepts are used in `sl3`
to model machine learning problems and algorithms.
## `sl3` Learners
`Lrnr_base` is the base class for defining machine learning algorithms, as well
as fits for those algorithms to particular `sl3_Tasks`. Different machine
learning algorithms are defined in classes that inherit from `Lrnr_base`. For
instance, the `Lrnr_glm` class inherits from `Lrnr_base`, and defines a learner
that fits generalized linear models. We will use the term learners to refer to
the family of classes that inherit from `Lrnr_base`. Learner objects can be
constructed from their class definitions using the `make_learner` function:
```{r}
# make learner object
lrnr_glm <- make_learner(Lrnr_glm)
```
Because all learners inherit from `Lrnr_base`, they have many features in
common, and can be used interchangeably. All learners define three main methods:
`train`, `predict`, and `chain`. The first, `train`, takes an `sl3_task` object,
and returns a `learner_fit`, which has the same class as the learner that was
trained:
```{r}
# fit learner to task data
lrnr_glm_fit <- lrnr_glm$train(task)
# verify that the learner is fit
lrnr_glm_fit$is_trained
```
Here, we fit the learner to the CPP task we defined above. Both `lrnr_glm` and
`lrnr_glm_fit` are objects of class `Lrnr_glm`, although the former defines a
learner and the latter defines a fit of that learner. We can distiguish between
the learners and learner fits using the `is_trained` field, which is true for
fits but not for learners.
Now that we’ve fit a learner, we can generate predictions using the predict
method:
```{r}
# get learner predictions
preds <- lrnr_glm_fit$predict()
head(preds)
```
Here, we specified task as the task for which we wanted to generate predictions.
If we had omitted this, we would have gotten the same predictions because
predict defaults to using the task provided to train (called the training task).
Alternatively, we could have provided a different task for which we want to
generate predictions.
The final important learner method, chain, will be discussed below, in the
section on learner composition. As with `sl3_Task`, learners have a variety of
fields and methods we haven't discussed here. More information on these is
available in the help for `Lrnr_base`.
## Pipelines
Based on the concept popularized by
[`scikit-learn`](http://scikit-learn.org/stable/index.html) `sl3` implements the
notion of [machine learning pipelines](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html),
which prove to be useful in a wide variety of data analytic settings.
A pipeline is a set of learners to be fit sequentially, where the fit from one
learner is used to define the task for the next learner. There are many ways in
which a learner can define the task for the downstream learner. The chain method
defined by learners defines how this will work. Let's look at the example of
pre-screening variables. For now, we'll rely on a screener from the
`SuperLearner` package, although native `sl3` screening algorithms will be
implemented soon.
Below, we generate a screener object based on the `SuperLearner` function
`screen.corP` and fit it to our task. Inspecting the fit, we see that it
selected a subset of covariates:
```{r}
screen_cor <- Lrnr_pkg_SuperLearner_screener$new("screen.corP")
screen_fit <- screen_cor$train(task)
print(screen_fit)
```
The `Pipeline` class automates this process. It takes an arbitrary number of
learners and fits them sequentially, training and chaining each one in turn.
Since `Pipeline` is a learner like any other, it shares the same interface. We
can define a pipeline using `make_learner`, and use `train` and `predict` just
as we did before:
```{r}
sg_pipeline <- make_learner(Pipeline, screen_cor, lrnr_glm)
sg_pipeline_fit <- sg_pipeline$train(task)
sg_pipeline_preds <- sg_pipeline_fit$predict()
head(sg_pipeline_preds)
```
## Stacks
Like `Pipelines`, `Stacks` combine multiple learners. Stacks train learners
simultaneously, so that their predictions can be either combined or compared.
Again, `Stack` is just a special learner and so has the same interface as all
other learners:
```{r}
stack <- make_learner(Stack, lrnr_glm, sg_pipeline)
stack_fit <- stack$train(task)
stack_preds <- stack_fit$predict()
head(stack_preds)
```
Above, we've defined and fit a stack comprised of a simple `glm` learner as well
as a pipeline that combines a screening algorithm with that same learner. We
could have included any abitrary set of learners and pipelines, the latter of
which are themselves just learners. We can see that the predict method now
returns a matrix, with a column for each learner included in the stack.
## The Super Learner Algorithm
Having defined a stack, we might want to compare the performance of learners in
the stack, which we may do using cross-validation. The `Lrnr_cv` learner wraps
another learner and performs training and prediction in a cross-validated
fashion, using separate training and validation splits as defined by
`task$folds`.
Below, we define a new `Lrnr_cv` object based on the previously defined stack
and train it and generate predictions on the validation set:
```{r}
cv_stack <- Lrnr_cv$new(stack)
cv_fit <- cv_stack$train(task)
cv_preds <- cv_fit$predict()
```
```{r}
risks <- cv_fit$cv_risk(loss_squared_error)
print(risks)
```
We can combine all of the above elements, `Pipelines`, `Stacks`, and
cross-validation using `Lrnr_cv`, to easily define a Super Learner. The Super
Learner algorithm works by fitting a "meta-learner," which combines predictions
from multiple stacked learners. It does this while avoiding overfitting by
training the meta-learner on validation-set predictions in a manner that is
cross-validated. Using some of the objects we defined in the above examples,
this becomes a very simple operation:
```{r}
metalearner <- make_learner(Lrnr_nnls)
cv_task <- cv_fit$chain()
ml_fit <- metalearner$train(cv_task)
```
Here, we used a special learner, Lrnr_nnls, for the meta-learning step. This
fits a non-negative least squares meta-learner. It is important to note that any
learner can be used as a meta-learner.
The Super Learner finally produced is defined as a pipeline with the learner
stack trained on the full data and the meta-learner trained on the
validation-set predictions. Below, we use a special behavior of pipelines: if
all objects passed to a pipeline are learner fits (i.e., `learner$is_trained` is
`TRUE`), the result will also be a fit:
```{r}
sl_pipeline <- make_learner(Pipeline, stack_fit, ml_fit)
sl_preds <- sl_pipeline$predict()
head(sl_preds)
```
An optimal stacked regression model (or Super Learner) may be fit in a more
streamlined manner using the `Lrnr_sl` learner. For simplicity, we will use the
same set of learners and meta-learning algorithm as we did before:
```{r}
sl <- Lrnr_sl$new(learners = stack,
metalearner = metalearner)
sl_fit <- sl$train(task)
lrnr_sl_preds <- sl_fit$predict()
head(lrnr_sl_preds)
```
We can see that this generates the same predictions as the more hands-on
definition above.