-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy path04-speed_02_loops.Rmd
545 lines (400 loc) · 16 KB
/
04-speed_02_loops.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
# For Loops {#loops}
```{r include=FALSE}
library(magrittr)
```
This is a topic that I wanted to discuss for a long time. People read blogs from 2014-2016 and assume that for loops in R are bad. You should not use them and Loops in R are slow etc. etc... This chapter will help you understand how to use them more effectively.
R loops are not too slow compared to other programming languages like python, ruby etc... But Yes they are slower than the vectorized code. You can get a faster code by doing more with vectorization than a simple loop.
## initialize objects before loops
create vectors for storing object even before the loop starts. Because it allocates memory before the loop it makes R loop a lot faster. And creating a vector is a vectorized C function call thus it's always a lot faster.
R has a few functions to create the type of vector of vector you need.`integer, numeric, character, logical` are most common function for these cases. numeric can be used to store `Date types` as well. It's always beneficial to start with a vector to store the values.
## use simple data-types
Data-types are the most common reason people don't get speed in R. If you run a loop on a Data.frame it always have to check the constraints of a data.frame like same length vectors to make sure you are not messing up the data type and it also creates a copy on each modification. But same code could be like a 1000 times faster if we just use a simple list.
R data.table packages provides an interface to set values inside a data-table without creating a copy which makes it faster for most of the use cases. Let's compare how fast will it be.
```{r}
set_dt_num = function(
data_table,
n_row
) {
for(i in 1:n_row){
data.table::set(
x = data_table,
i = i,
j = 1L,
value = i * 2L
)
}
}
set_dt_col = function(
data_table,
n_row
) {
for(i in 1:n_row){
data.table::set(
x = data_table,
i = i,
j = "x",
value = i * 2L
)
}
}
```
```{r}
n_row <- 1e3
data_table <- data.table::data.table(x = integer(n_row))
data_frame <- data.frame(x = integer(n_row))
microbenchmark::microbenchmark(
set_df_col = {
for(i in 1:n_row){
data_frame$x[[i]] <- i * 2L
}
},
set_dt_num = set_dt_num(data_table , n_row ),
set_dt_col = set_dt_col(data_table , n_row ),
times = 10
)
```
This Code used to give me around 200x increment over base R in previous versions. From R 4.0 onward R is managing memory pretty efficiently and the base is performing better in this test. Let me try it on a larger data set.
```{r}
n_row <- 1e5
data_table <- data.table::data.table(x = integer(n_row))
data_frame <- data.frame(x = integer(n_row))
microbenchmark::microbenchmark(
set_df_col = {
for(i in 1:n_row){
data_frame$x[[i]] <- i * 2L
}
},
set_dt_num = set_dt_num(data_table , n_row ),
set_dt_col = set_dt_col(data_table , n_row ),
times = 10
)
```
```{r include=FALSE}
gc()
```
Now we can see some improvement over the base. On bigger data set we are getting around 10x+ speed with data.table. It was just to establish the fact that data.table performs well over bigger data sets. Yet we can still get a better performance by moving to a lower level data structure.
```{r}
n_row <- 1e5
data_table <- data.table::data.table(x = integer(n_row))
data_list <- list(x = integer(n_row))
microbenchmark::microbenchmark(
set_list_col = {
for(i in 1:n_row){
data_list$x[[i]] <- i*2L
}
},
set_dt_num = set_dt_num(data_table , n_row ),
set_dt_col = set_dt_col(data_table , n_row ),
times = 10
)
```
Just by moving from data.frame to list we can get a substantial increment over the base. Now we are close to 400x over R data.table. Which is huge. But wait we can do it better. We haven't tried the most atomic level structure in R the VECTORS. Let's benchmark it again with vectors.
```{r}
n_row <- 1e5
x <- integer(n_row)
data_list <- list(x = integer(n_row))
microbenchmark::microbenchmark(
set_list_col = {
for(i in 1:n_row){
data_list$x[[i]] <- i*2L
}
},
set_vector = {
for(i in 1:n_row){
x[[i]] <- i*2L
}
}
)
```
We were able to squeeze 2x + speed with base R vector. So finally we have vector that can do the entire computation at 6ms while the worst dataframe would do it at 7700ms. Which makes our code around 1200x faster. All you need to do is to remember can we do it with a simpler data-type.
This is the best change you can make in your code to make it run faster. I always prefer looping over a vector like 90% of the time. In some cases where it's not possible I like to convert the dataframe to list and run a loop and then convert it back to a dataframe. Because dataframe itself is a list with some constraints. Remembering this will help you a lot in speeding up your code.
## apply family
Use apply family for concise and efficient code. apply functions make your code smaller and more readable and you don't have to write loops so you can focus on your tasks. Whenever possible you should use apply function because there are times when it could be more efficient than a loop. There are only 3 functions that you should really know. You can get away with just these 3 function to solve almost 90% of your tasks.
1. `lapply` for almost `everything`
2. `apply` for looping over `rows`
3. `mapply` for looping over `multiple vectors` as an argument to a single function
There is `vapply` too for strictly getting `vector` in return. I have used it very rarely because you can unlist the lapply for same results as well. `Map` is nothing more than mapply with parameter SIMPLIFY as true. `Reduce` can also be used but in very rare circumstances.
There are a certain things you need to know about apply function.
### apply functions are not much faster than loops
Many people wrongly assume that just because you are using apply functions that means your code is vectorized, which is not true at all. Apply functions are loops under the hood and they are meant for convenience not for speed.
```{r}
microbenchmark::microbenchmark(
lapply = lapply(1:1e3, rnorm),
forloop = for(i in 1:1e3){
rnorm(i)
},
times = 10
)
```
I have tested this on bigger vector and the results are almost identical. There difference is not too much. but lapply gives you optimized loop to begin and thus you should always prefer a lapply where ever it's possible but you should not be scared to use a loop either as the speed is mostly the same.
### Nested lapply have same speed as a normal lapply
```{r}
microbenchmark::microbenchmark(
nested = lapply(1:1e3, function(x){
rnorm(x)
}) %>%
lapply(function(x){
sum(x)
}),
normal = lapply(1:1e3, function(x){
rnorm(x) %>%
sum()
}),
times = 10
)
```
as you can see that nesting multiple lapply function doesn't slow the code. However it's not a standard practice to do it and I would not recommend anybody to make your code harder to read by nesting multiple lapply functions.
So let me summarize it lapply is better than loops but it's no where near the speed of a vectorized code. Let's talk about the fastest way to speed up your code.
## Vectorize your code
R is vectorized to the core. Every function in R is vectorized. Even the comparison operators are vectorized. This is a core strength of R. If you can break your task down to vectorized operation you can make it faster even after adding more steps to it. Let's take an example.
```{r}
dummy_text <- sample(
x = letters,
size = 1e3,
replace = TRUE
)
dummy_category <- sample(
x = c(1,2,3),
size = 1e3,
replace = TRUE
)
main_table <- data.frame(dummy_text, dummy_category)
```
Now this table has a 1000 text that I want to join into a a huge corpus based on their category. Anybody familiar with other programming languages like python or java or c++ will look for a loop that can solve it. If you try that approach it might go like this.
```{r}
join_text_norm <- function(df = main_table){
text <- character(length(unique(df$dummy_category)))
for(i in seq_along(df$dummy_category)){
if ( df$dummy_category[[i]] == 1 ) {
text[[1]] <- paste0(text[[1]], df$dummy_text[[i]])
} else if ( df$dummy_category[[i]] == 2 ) {
text[[2]] <- paste0(text[[2]], df$dummy_text[[i]])
} else {
text[[3]] <- paste0(text[[3]], df$dummy_text[[i]])
}
}
return(text)
}
join_text_norm()
```
This is not the most optimized function but this can get the job done. And I am breaking a golden rule here.
### never repeat a calculation
in the above code I could save the some time by storing the value of text into a variable and stop R from calculating it again and again.
```{r}
join_text_saved <- function(
df = main_table
){
text <- character(length(unique(df$dummy_category)))
for(i in seq_along(df$dummy_category)){
curr_text <- df$dummy_text[[i]]
curr_cat <- df$dummy_category[[i]]
if (curr_cat == 1 ) {
text[[1]] <- paste0(text[[1]], curr_text)
} else if ( curr_cat == 2 ) {
text[[2]] <- paste0(text[[2]], curr_text)
} else {
text[[3]] <- paste0(text[[3]], curr_text)
}
}
return(text)
}
join_text_saved()
```
```{r}
microbenchmark::microbenchmark(
join_text_norm(df = main_table),
join_text_saved(df = main_table)
)
```
We did not save much on it but we still saved one millisecond on just a 1000 loop. It's an excellent practice of not to repeat calculation. Especially when you are calculating multiple things again and again.
Now coming back to the point. You could try this approach just like every other programming language does. Or you can try a vectorized approach with the built in paste function with collapse argument.
```{r}
collapsed_fun <- function(
df = main_table
){
text <- df %>%
split(f = dummy_category) %>%
lapply(function(x)
paste0(x$dummy_text,collapse = "")
) %>%
unlist()
return(text)
}
collapsed_fun(main_table)
```
Let's compare it with the loop approach.
```{r}
microbenchmark::microbenchmark(
join_text_norm(),
join_text_saved(),
collapsed_fun()
)
```
Collapsed function is faster than all the other approach for just 1000 loops. Imagine doing it for 1 million. Vectorized function in those cases will be 1000 times faster than loops.
The real reason for that is vectorized code uses optimized c for looping which is almost always faster than loops in R. And at times you can get 1000x speed compared to a normal loop. and Thus sometimes you can get away with doing more with vectors than with loops.
### Vectorized code can do 2 or 3 steps more in lesser time
There is a classical example that I read in a book `efficient R` and I was amazed to see why it happened.
```{r}
if_norm <- function(
x,
size
){
y <- character(size)
for(i in 1:1e3){
value <- x[[i]]
if(value == 0){
y[[i]] <- "zero"
} else if(value == 1){
y[[i]] <- "one"
} else {
y[[i]] <- "many"
}
}
return(y)
}
if_vector <- function(
x,
size
){
y <- character(size)
y[1:size] <- "many"
y[x == 1] <- "one"
y[x == 0] <- "zero"
return(y)
}
```
Both the function will return the same vector. However we are doing minimum with the normal function while in vectorized function we are doing redundant operations.
```{r}
size <- 1e3
x <- sample(
x = c(0,1,2),
size = size,
replace = TRUE
)
all.equal(
if_norm(x, size),
if_vector(x, size)
)
```
Let's check the speed of both the functions. Even though We are doing 3 steps in the vectorized solution and doing the same thing 3 times just to get a solution yet we are faster because vectorized code is very very fast you can get away with doing more and still saving time. Compare vectorized to be FLASH GORDEN and it can be faster even when it's doing more than it should.
```{r}
microbenchmark::microbenchmark(
minimal = if_norm(x,size),
vectorized = if_vector(x, size)
)
```
## Understanding non-vectorized code
There are times when you need only scalar values and in those times it is redundant to use vectorized code. I see many people not understanding the difference and using vectorized code inside a loop on a scalar variable when non vectorized would have been more efficient. Let's check it with an example.
```{r}
n_size <- 1e4
binary_df <- data.frame(
x = sample(
x = c(TRUE, FALSE),
size = n_size,
replace = TRUE
),
y = sample(
x = c(TRUE, FALSE),
size = n_size,
replace = TRUE
),
z = sample(
x = c(TRUE, FALSE),
size = n_size,
replace = TRUE
)
)
```
Let's check if we can find all the rows where all the variable are true. The fastest method would be vectorized solution like this
```{r}
all_true <- binary_df$x & binary_df$y & binary_df$z
```
But when suppose you are using it in a loop to find the exact same thing.
```{r}
### vectorized code
vect_all_true <- function(
df
){
y <- logical(nrow(df))
for(i in seq_along(binary_df$x)){
y[i] <- df$x[i] & df$y[i] & df$z[i]
}
return(y)
}
### scalar code
scalar_all_true <- function(
df
){
y <- logical(nrow(df))
for(i in seq_along(df$x)){
y[[i]] <- df$x[[i]] && df$y[[i]] && df$z[[i]]
}
return(y)
}
```
Let's compare both the functions for speed.
```{r}
microbenchmark::microbenchmark(
vect_all_true(df = binary_df),
scalar_all_true(df = binary_df),
times = 10
)
```
I know this is not an excellent example because it can be vectorized easily but in the cases where you are working on individual scalar values using non-vectorized code gives you speed. In our current example we are getting twice the speed which is enough for such a small data set and the difference will increase with the number of rows.
## Do as little as possible inside a loop
R is an interpreted language every thing you write inside a loop runs multiple times. The best thing you can do is to be parsimonious while writing code inside a loop. There are a number of steps that you can do to speed up a loop a bit more.
1. Calculate results before the loop
2. initialize objects before the loop
3. Iterate on as few numbers as possible
4. Write as less functions inside a loop as possible
The main tip is to ***Get out of loop as quickly as possible***. There is another very crucial thing you can do to speed up your code.
### Combine Vectorized code inside a loop
The best case is to figure which part of codes can be optimized with a vectorized solution and which would require you to loop through. The key is to use ***as minimum loops as possible and as much as vectorized code as possible***. This is the same thing that helps in parallelizing the code too.
```{r}
n_size <- 1e5
hr_df <- data.table::data.table(
department = sample(
x = letters[1:5],
size = n_size,
replace = TRUE
),
salary = sample(
x = 1e3:1e4,
size = n_size,
replace = TRUE
)
)
```
Let's try to find out How much money each department is paying for salary.
```{r}
sum_salary <- function(
df
){
answer <- list()
for(dep in unique(df$department)){
value <- df[
department == dep,
sum(salary, na.rm = TRUE)
]
answer[[dep]] <- value
}
return(answer)
}
microbenchmark::microbenchmark(
sum_salary(df = hr_df)
)
```
I am doing as much vectorized calculation as possible in this scenario and this is the reason this code runs pretty fast. If I write a loop that goes through all the `r n_size` rows then I would get around 100x slower speed. This is a neat trick you must use whenever you can. Do as little as possible inside a loop.
## Conclusion
This chapter mainly focused on Loops and how to optimize them. Loops are necessary and these tips will help run them better
1. vectorize your code
2. You can do more with vectorization and still be faster than a loop
3. Use vectorized and scalar code with care
4. combine vectorized code with loops to gain maximum power
5. Initialize Your object before the loop
6. use simpler data-types inside loop
7. apply functions are not faster than loops
8. nested apply functions don't necessary mean slower code but you should avoid them
9. Don't repeat same calculation
10. cache or save the results
11. run garbage collection for heavy calculations