forked from hadley/adv-r
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Evaluation.Rmd
484 lines (351 loc) · 17.5 KB
/
Evaluation.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
# Evaluation
```{r setup, include = FALSE}
source("common.R")
library(rlang)
```
## Introduction
## Overscopes
## Base R
### Promises
These functions work because internally R represents function arguments with a special type of object called a __promise__. A promise captures the expression needed to compute the value and the environment in which to compute it. You're not normally aware of promises because the first time you access a promise its code is evaluated in its environment, yielding a value. \index{promises}
Promises are hard to work with because they are quantum - attempting to look at them in R changes their behaviour.
You can see both the expression and it's environment if you capture with a quosure, rather than an expression.
```{r}
capture <- function(x) {
enquo(x)
}
f <- function(x) {
x <- 10
capture(x + 1)
}
x <- 1
f1 <- capture(x + 1)
f2 <- f(x + 1)
get_expr(f1)
get_expr(f2)
get_env(f1)
get_env(f2)
eval_tidy(f1)
eval_tidy(f2)
```
We'll come back to quosures in the next chapter: because they capture the environment they're most useful for non-standard evaluation.
## Non-standard evaluation in subset {#subset}
While printing out the code supplied to an argument value can be useful, we can actually do more with the unevaluated code. Take `subset()`, for example. It's a useful interactive shortcut for subsetting data frames: instead of repeating the name of data frame many times, you can save some typing: \indexc{subset()}
```{r}
sample_df <- data.frame(a = 1:5, b = 5:1, c = c(5, 3, 1, 4, 1))
subset(sample_df, a >= 4)
# equivalent to:
# sample_df[sample_df$a >= 4, ]
subset(sample_df, b == c)
# equivalent to:
# sample_df[sample_df$b == sample_df$c, ]
```
`subset()` is special because it implements different scoping rules: the expressions `a >= 4` or `b == c` are evaluated in the specified data frame rather than in the current or global environments. This is the essence of non-standard evaluation.
How does `subset()` work? We've already seen how to capture an argument's expression rather than its result, so we just need to figure out how to evaluate that expression in the right context. Specifically, we want `x` to be interpreted as `sample_df$x`, not `globalenv()$x`. To do this, we need `eval()`. This function takes an expression and evaluates it in the specified environment. \indexc{eval()}
Before we can explore `eval()`, we need one more useful function: `quote()`. It captures an unevaluated expression like `substitute()`, but doesn't do any of the advanced transformations that can make `substitute()` confusing. `quote()` always returns its input as is: \indexc{quote()} \index{quoting}
```{r}
quote(1:10)
quote(x)
quote(x + y^2)
```
We need `quote()` to experiment with `eval()` because `eval()`'s first argument is an expression. So if you only provide one argument, it will evaluate the expression in the current environment. This makes `eval(quote(x))` exactly equivalent to `x`, regardless of what `x` is:
```{r, error = TRUE}
eval(quote(x <- 1))
eval(quote(x))
eval(quote(y))
```
`quote()` and `eval()` are opposites. In the example below, each `eval()` peels off one layer of `quote()`'s.
```{r}
quote(2 + 2)
eval(quote(2 + 2))
quote(quote(2 + 2))
eval(quote(quote(2 + 2)))
eval(eval(quote(quote(2 + 2))))
```
`eval()`'s second argument specifies the environment in which the code is executed:
```{r}
x <- 10
eval(quote(x))
e <- new.env()
e$x <- 20
eval(quote(x), e)
```
Because lists and data frames bind names to values in a similar way to environments, `eval()`'s second argument need not be limited to an environment: it can also be a list or a data frame.
```{r}
eval(quote(x), list(x = 30))
eval(quote(x), data.frame(x = 40))
```
This gives us one part of `subset()`:
```{r}
eval(quote(a >= 4), sample_df)
eval(quote(b == c), sample_df)
```
A common mistake when using `eval()` is to forget to quote the first argument. Compare the results below:
```{r, error = TRUE}
a <- 10
eval(quote(a), sample_df)
eval(a, sample_df)
eval(quote(b), sample_df)
eval(b, sample_df)
```
```{r, echo = FALSE}
rm(a)
```
We can use `eval()` and `substitute()` to write `subset()`. We first capture the call representing the condition, then we evaluate it in the context of the data frame and, finally, we use the result for subsetting:
```{r}
subset2 <- function(x, condition) {
condition_call <- substitute(condition)
r <- eval(condition_call, x)
x[r, ]
}
subset2(sample_df, a >= 4)
```
### Exercises
1. Predict the results of the following lines of code:
```{r, eval = FALSE}
eval(quote(eval(quote(eval(quote(2 + 2))))))
eval(eval(quote(eval(quote(eval(quote(2 + 2)))))))
quote(eval(quote(eval(quote(eval(quote(2 + 2)))))))
```
1. `subset2()` has a bug if you use it with a single column data frame.
What should the following code return? How can you modify `subset2()`
so it returns the correct type of object?
```{r}
sample_df2 <- data.frame(x = 1:10)
subset2(sample_df2, x > 8)
```
1. The real subset function (`subset.data.frame()`) removes missing
values in the condition. Modify `subset2()` to do the same: drop the
offending rows.
1. What happens if you use `quote()` instead of `substitute()` inside of
`subset2()`?
1. The third argument in `subset()` allows you to select variables. It
treats variable names as if they were positions. This allows you to do
things like `subset(mtcars, , -cyl)` to drop the cylinder variable, or
`subset(mtcars, , disp:drat)` to select all the variables between `disp`
and `drat`. How does this work? I've made this easier to understand by
extracting it out into its own function.
```{r, eval = FALSE}
select <- function(df, vars) {
vars <- substitute(vars)
var_pos <- setNames(as.list(seq_along(df)), names(df))
pos <- eval(vars, var_pos)
df[, pos, drop = FALSE]
}
select(mtcars, -cyl)
```
1. What does `evalq()` do? Use it to reduce the amount of typing for the
examples above that use both `eval()` and `quote()`.
1. Write an equivalent to `get()` using `as.name()` and `eval()`. Write an
equivalent to `assign()` using `as.name()`, `substitute()`, and `eval()`.
(Don't worry about the multiple ways of choosing an environment; assume
that the user supplies it explicitly.)
## Scoping issues {#scoping-issues}
It certainly looks like our `subset2()` function works. But since we're working with expressions instead of values, we need to test things more extensively. For example, the following applications of `subset2()` should all return the same value because the only difference between them is the name of a variable: \index{lexical scoping}
```{r, error = TRUE}
y <- 4
x <- 4
condition <- 4
condition_call <- 4
subset2(sample_df, a == 4)
subset2(sample_df, a == y)
subset2(sample_df, a == x)
subset2(sample_df, a == condition)
subset2(sample_df, a == condition_call)
```
What went wrong? You can get a hint from the variable names I've chosen: they are all names of variables defined inside `subset2()`. If `eval()` can't find the variable inside the data frame (its second argument), it looks in the environment of `subset2()`. That's obviously not what we want, so we need some way to tell `eval()` where to look if it can't find the variables in the data frame.
The key is the third argument to `eval()`: `enclos`. This allows us to specify a parent (or enclosing) environment for objects that don't have one (like lists and data frames). If the binding is not found in `env`, `eval()` will next look in `enclos`, and then in the parents of `enclos`. `enclos` is ignored if `env` is a real environment. We want to look for `x` in the environment from which `subset2()` was called. In R terminology this is called the __parent frame__ and is accessed with `parent.frame()`. This is an example of [dynamic scope](http://en.wikipedia.org/wiki/Scope_%28programming%29#Dynamic_scoping): the values come from the location where the function was called, not where it was defined. \indexc{parent.frame()}
With this modification our function now works:
```{r}
subset2 <- function(x, condition) {
condition_call <- substitute(condition)
r <- eval(condition_call, x, parent.frame())
x[r, ]
}
x <- 4
subset2(sample_df, a == x)
```
Using `enclos` is just a shortcut for converting a list or data frame to an environment. We can get the same behaviour by using `list2env()`. It turns a list into an environment with an explicit parent: \indexc{list2env()}
```{r}
subset2a <- function(x, condition) {
condition_call <- substitute(condition)
env <- list2env(x, parent = parent.frame())
r <- eval(condition_call, env)
x[r, ]
}
x <- 5
subset2a(sample_df, a == x)
```
### Exercises
1. What does `transform()` do? Read the documentation. How does it work?
Read the source code for `transform.data.frame()`. What does
`substitute(list(...))` do?
1. What does `with()` do? How does it work? Read the source code for
`with.default()`. What does `within()` do? How does it work? Read the
source code for `within.data.frame()`. Why is the code so much more
complex than `with()`?
## Capturing the current call {#capturing-call}
```{r, eval = FALSE, echo = FALSE}
std <- c("package:base", "package:utils", "package:stats")
names(find_uses(std, "sys.call"))
names(find_uses(std, "match.call"))
```
Many base R functions use the current call: the expression that caused the current function to be run. There are two ways to capture a current call: \indexc{calls|capturing current}
* `sys.call()` captures exactly what the user typed. \indexc{sys.call()}
* `match.call()` makes a call that only uses named arguments. It's like
automatically calling `pryr::standardise_call()` on the result of
`sys.call()` \indexc{match.call()}
The following example illustrates the difference between the two:
```{r}
f <- function(abc = 1, def = 2, ghi = 3) {
list(sys = sys.call(), match = match.call())
}
f(d = 2, 2)
```
Modelling functions often use `match.call()` to capture the call used to create the model. This makes it possible to `update()` a model, re-fitting the model after modifying some of original arguments. Here's an example of `update()` in action: \indexc{update()}
```{r}
mod <- lm(mpg ~ wt, data = mtcars)
update(mod, formula = . ~ . + cyl)
```
How does `update()` work? We can rewrite it using some tools from pryr to focus on the essence of the algorithm.
```{r, eval = FALSE}
update_call <- function (object, formula., ...) {
call <- object$call
# Use update.formula to deal with formulas like . ~ .
if (!missing(formula.)) {
call$formula <- update.formula(formula(object), formula.)
}
modify_call(call, dots(...))
}
update_model <- function(object, formula., ...) {
call <- update_call(object, formula., ...)
eval(call, parent.frame())
}
update_model(mod, formula = . ~ . + cyl)
```
The original `update()` has an `evaluate` argument that controls whether the function returns the call or the result. But I think it's better, on principle, that a function returns only one type of object, rather than different types depending on the function's arguments.
This rewrite also allows us to fix a small bug in `update()`: it re-evaluates the call in the global environment, when what we really want is to re-evaluate it in the environment where the model was originally fit --- in the formula.
```{r, error = TRUE}
f <- function() {
n <- 3
lm(mpg ~ poly(wt, n), data = mtcars)
}
mod <- f()
update(mod, data = mtcars)
update_model <- function(object, formula., ...) {
call <- update_call(object, formula., ...)
eval(call, environment(formula(object)))
}
update_model(mod, data = mtcars)
```
This is an important principle to remember: if you want to re-run code captured with `match.call()`, you also need to capture the environment in which it was evaluated, usually the `parent.frame()`. The downside to this is that capturing the environment also means capturing any large objects which happen to be in that environment, which prevents their memory from being released. This topic is explored in more detail in [garbage collection](#gc). \index{environments|capturing}
Some base R functions use `match.call()` where it's not necessary. For example, `write.csv()` captures the call to `write.csv()` and mangles it to call `write.table()` instead:
```{r}
write.csv <- function(...) {
Call <- match.call(expand.dots = TRUE)
for (arg in c("append", "col.names", "sep", "dec", "qmethod")) {
if (!is.null(Call[[arg]])) {
warning(gettextf("attempt to set '%s' ignored", arg))
}
}
rn <- eval.parent(Call$row.names)
Call$append <- NULL
Call$col.names <- if (is.logical(rn) && !rn) TRUE else NA
Call$sep <- ","
Call$dec <- "."
Call$qmethod <- "double"
Call[[1L]] <- as.name("write.table")
eval.parent(Call)
}
```
To fix this, we could implement `write.csv()` using regular function call semantics:
```{r}
write.csv <- function(x, file = "", sep = ",", qmethod = "double",
...) {
write.table(x = x, file = file, sep = sep, qmethod = qmethod,
...)
}
```
This is much easier to understand: it's just calling `write.table()` with different defaults. This also fixes a subtle bug in the original `write.csv()`: `write.csv(mtcars, row = FALSE)` raises an error, but `write.csv(mtcars, row.names = FALSE)` does not. The lesson here is that it's always better to solve a problem with the simplest tool possible.
### Exercises
1. Compare and contrast `update_model()` with `update.default()`.
1. Why doesn't `write.csv(mtcars, "mtcars.csv", row = FALSE)` work?
What property of argument matching has the original author forgotten?
1. Rewrite `update.formula()` to use R code instead of C code.
1. Sometimes it's necessary to uncover the function that called the
function that called the current function (i.e., the grandparent, not
the parent). How can you use `sys.call()` or `match.call()` to find
this function?
## Anaphoric functions
One useful application of `make_function()` is in functions like `curve()`. `curve()` allows you to plot a mathematical function without creating an explicit R function:
```{r curve-demo, fig.width = 3.5, fig.height = 2.5, small_mar = TRUE}
curve(sin(exp(4 * x)), n = 1000)
```
Here `x` is a pronoun. `x` doesn't represent a single concrete value, but is instead a placeholder that varies over the range of the plot. One way to implement `curve()` would be with `make_function()`:
```{r curve2}
curve4 <- function(expr, xlim = c(0, 1), n = 100) {
expr <- enquo(expr)
f <- new_function(alist(x = ), get_expr(expr), get_env(env))
x <- seq(xlim[1], xlim[2], length = n)
y <- f(x)
plot(x, y, type = "l", ylab = deparse(substitute(expr)))
}
curve4(sin(exp(4 * x)), n = 1000)
curve3 <- function(expr, xlim = c(0, 1), n = 100) {
expr <- enquo(expr)
e <- rlang::expr({
function(x) !!get_expr(expr)
})
f <- eval_tidy(e, env = get_env(expr))
x <- seq(xlim[1], xlim[2], length = n)
y <- f(x)
plot(x, y, type = "l", ylab = deparse(substitute(expr)))
}
curve3(sin(exp(4 * x)), n = 1000)
```
Functions that use a pronoun are called [anaphoric](http://en.wikipedia.org/wiki/Anaphora_(linguistics)) functions. They are used in [Arc](http://www.arcfn.com/doc/anaphoric.html) (a lisp like language), [Perl](http://www.perlmonks.org/index.pl?node_id=666047), and [Clojure](http://amalloy.hubpages.com/hub/Unhygenic-anaphoric-Clojure-macros-for-fun-and-profit). \index{anaphoric functions} \index{functions!anaphoric}
### Exercises
1. How are `alist(a)` and `alist(a = )` different? Think about both the
input and the output.
1. Read the documentation and source code for `pryr::partial()`. What does it
do? How does it work? Read the documentation and source code for
`pryr::unenclose()`. What does it do and how does it work?
1. The actual implementation of `curve()` looks more like
```{r curve3}
curve3 <- function(expr, xlim = c(0, 1), n = 100,
env = parent.frame()) {
env2 <- new.env(parent = env)
env2$x <- seq(xlim[1], xlim[2], length = n)
y <- eval(substitute(expr), env2)
plot(env2$x, y, type = "l",
ylab = deparse(substitute(expr)))
}
```
How does this approach differ from `curve2()` defined above?
## Sourcee
With `parse()` and `eval()`, it's possible to write a simple version of `source()`. We read in the file from disk, `parse()` it and then `eval()` each component in a specified environment. This version defaults to a new environment, so it doesn't affect existing objects. `source()` invisibly returns the result of the last expression in the file, so `simple_source()` does the same. \index{source()}
```{r}
simple_source <- function(file, envir = new.env()) {
stopifnot(file.exists(file))
stopifnot(is.environment(envir))
lines <- readLines(file, warn = FALSE)
exprs <- parse(text = lines)
n <- length(exprs)
if (n == 0L) return(invisible())
for (i in seq_len(n - 1)) {
eval(exprs[i], envir)
}
invisible(eval(exprs[n], envir))
}
```
The real `source()` is considerably more complicated because it can `echo` input and output, and also has many additional settings to control behaviour.
### Exercises
1. Compare and contrast `source()` and `sys.source()`.
1. Modify `simple_source()` so it returns the result of _every_ expression,
not just the last one.
1. The code generated by `simple_source()` lacks source references. Read
the source code for `sys.source()` and the help for `srcfilecopy()`,
then modify `simple_source()` to preserve source references. You can
test your code by sourcing a function that contains a comment. If
successful, when you look at the function, you'll see the comment and
not just the source code.