-
Notifications
You must be signed in to change notification settings - Fork 11
/
timing.Rmd
151 lines (113 loc) · 5.72 KB
/
timing.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
---
title: Timing and profiling R code
layout: default
---
## 1. Benchmarking
*system.time* is very handy for comparing the speed of different
implementations. Here's a basic comparison of the time to calculate the row means of a matrix using a for loop compared to the built-in *rowMeans* function.
```{r, system-time}
n <- 10000
m <- 1000
x <- matrix(rnorm(n*m), nrow = n)
system.time({
mns <- rep(NA, n)
for(i in 1:n) mns[i] <- mean(x[i , ])
})
system.time(rowMeans(x))
```
In general, *user* time gives the CPU time spent by R and *system* time gives the CPU time spent by the kernel (the operating system) on behalf of R. Operations that fall under system time include opening files, doing input or output, starting other processes, etc.
To time code that runs very quickly, you should use the *microbenchmark*
package. Of course one would generally only care about accurately timing quick calculations if a larger operation does the quick calculation very many times. Here's a comparison of different ways of accessing an element of a dataframe.
```{r, microbenchmark}
library(microbenchmark)
df <- data.frame(vals = 1:3, labs = c('a','b','c'))
microbenchmark(
df[2,1],
df$vals[2],
df[2, 'vals']
)
```
**Challenge**: Think about what order of operations in the first and third approaches above might lead to the timing result seen above.
*microbenchmark* is also a good way to compare slower calculations, e.g., doing a crossproduct via *crossproduct()* compared to "manually" via transpose and matrix multiply:
```{r, microbenchmark2}
library(microbenchmark)
n <- 1000
x <- matrix(rnorm(n^2), n)
microbenchmark(
t(x) %*% x,
crossprod(x),
times = 10)
```
**Challenge**: What is inefficient about the manual approach above?
An alternative that also automates timings and comparisons is the *benchmark* function from the *rbenchmark* package, but there's not really a reason to use it when *microbenchmark* is available.
```{r, benchmark, cache=TRUE}
library(rbenchmark)
# speed of one calculation
benchmark(t(x) %*% x,
crossprod(x),
replications = 10,
columns=c('test', 'elapsed', 'replications'))
```
In general, it's a good idea to repeat (replicate) your timing, as there is some stochasticity in how fast your computer will run a piece of code at any given moment.
You might also checkout the *tictoc* package.
## 2. Profiling
The *Rprof* function will show you how much time is spent in
different functions, which can help you pinpoint bottlenecks in your
code. The output from *Rprof* can be hard to decipher, so you
will probably want to use the *proftools* package functions, which make use of
*Rprof* under the hood.
Here's a function that does the linear algebra to find the least squares solution in a linear regression, assuming `x`
is the matrix of predictors, including a column for the intercept.
```{r, Rprof-fun}
lr_slow <- function(y, x) {
xtx <- t(x) %*% x
xty <- t(x) %*% y
inv <- solve(xtx) ## explicit matrix inverse is slow and usually a bad idea numerically
return(inv %*% xty)
}
```
Let's run the function with profiling turned on.
```{r, Rprof-run1}
## generate random observations and random matrix of predictors
y <- rnorm(5000)
x <- matrix(rnorm(5000*1000), nrow = 5000)
library(proftools)
pd1 <- profileExpr(lr_slow(y, x))
hotPaths(pd1)
hotPaths(pd1, value = 'time')
```
The first call to *hotPaths* shows the percentage of time spent in each call, while the second shows the actual time.
Note the nestedness of the results. For example, essentially all the time spent in *solve* is actually spent in *solve.default*. In this case *solve* is just an S3 generic method that immediately calls the specific method *solve.default*.
We can see that a lot of time is spent in doing the two crossproducts. So let's try using *crossprod* to make those steps faster.
```{r, Rprof-run2}
lr_medium <- function(y, x) {
xtx <- crossprod(x)
xty <- crossprod(x, y)
inv <- solve(xtx) ## explicit matrix inverse is slow and usually a bad idea numerically
return(inv %*% xty)
}
pd2 <- profileExpr(lr_medium(y, x))
hotPaths(pd2)
hotPaths(pd2, value = 'time')
```
First note that this version takes less than half the time of the previous one. Second note that a fair amount of time is spent computing the explicit matrix inverse using *solve*. (There's not much we can do to speed up the *crossprod* operations, apart from making sure we are using a fast BLAS and potentially a parallelized BLAS.) It's well known that one should avoid computing the explicit inverse if one can avoid it. Here's a faster version that avoids it.
```{r, Rprof-run3}
lr_fast <- function(y, x) {
xtx <- crossprod(x)
xty <- crossprod(x, y)
U <- chol(xtx)
tmp <- backsolve(U, xty, transpose = TRUE)
return(backsolve(U, tmp))
}
pd3 <- profileExpr(lr_fast(y, x))
hotPaths(pd3)
hotPaths(pd3, value = 'time')
```
We can see we get another speedup from that final version of the code. (But beware my earlier caution that if comparing times between implementations, we should have replication.)
You might also check out *profvis* for an alternative to displaying profiling information
generated by *Rprof*.
Note that *Rprof* works by sampling - every little while (the *interval* argument) during a calculation it finds out what function R is in and saves that information to the file given as the argument to *Rprof*. So if you try to profile code that finishes really quickly, there's not enough opportunity for the sampling to represent the calculation accurately and you may get spurious results.
> *Warning*:
> *Rprof* conflicts with threaded linear algebra,
so you may need to set OMP_NUM_THREADS to 1 to disable threaded
linear algebra if you profile code that involves linear algebra.