forked from alexd106/Rbook
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path07-programming.Rmd
executable file
·523 lines (354 loc) · 34.3 KB
/
07-programming.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
# Programming in R {#prog_r}
After learning the basics, programming in R is the next big step towards attaining coding nirvana. In R this freedom is gained by being able to wield and manipulate code to get it to do exactly what you want, rather than contort, bend and restrict yourself to what others have built. It's worth remembering that many packages are written with the goal of solving a problem the author had. Your problem may be similar, but not exactly the same. In such cases you could contort, bend and restrict your data so that it works with the package, or you could build your own functions that solve your specific problems.
But there are already a vast number of R packages available, surely more than enough to cover everything you could possibly want to do? Why then, would you ever need to use R as a programming language? Why not just stick to the functions from a package? Well, in some cases you'll want to customise those existing functions to suit your specific needs. Or you may want to implement an approach that's hot off the press or you've come up with an entirely novel idea, which means there won't be any pre-existing packages that work for you. Both of these are not particularly common early-on when you start using R, but you may well find as you progress, the reliance on existing functions becomes a little limiting. In addition, if you ever start using more advanced statistical approaches (i.e. Bayesian inference coded through Stan or JAGS) then an understanding of basic programming constructs, especially loops, becomes fundamental.
In this Chapter we'll explore the basics of the programming side of R. Even if you never create your own function, you will at the very least use functions on a daily basis. Pulling back the curtain and showing the nitty-gritty of how these work will hopefully, in the worst case, help your confidence and in the best case, provide a starting point for doing some clever coding.
## Looking behind the curtain
A good way to start learning to program in R is to see what others have done. We can start by briefly peeking behind the curtain. In [Chapter 5](#graphics_r) we made use of the `theme_classic()` function when customising our `ggplot` figures. With many functions in R, if you want to have a quick glance at the machinery behind the scenes, we can simply write the function name but without the `()`. This is the same trick we used in [Chapter 5](#graphics_r) to alter the `theme_classic()` style of `ggplot2`\index{ggplot2 package} to make `theme_rbook()`\index{theme\_rbook()}.
Note that to view the source code of base R packages (those that come with R) requires some additional steps which we won't cover here (see this [link][show-code] if you're interested), but for most other packages that you install yourself, generally entering the function name without `()` will show the source code of the function.
```{r libraries, echo = FALSE, message=FALSE, warning=FALSE}
library(ggplot2)
```
```{r theme_function, collapse=TRUE}
theme_classic
```
What we see above is the underlying code for this particular function. As we did in [Chapter 5](#graphics_r), we could copy and paste this into our own script and make any changes we deemed necessary. This approach isn't limited to `ggplot2` functions and can be used for other functions as well, although tread carefully and test the changes you've made.
Don't worry overly if most of the code contained in functions doesn't make sense immediately. This will be especially true if you are new to R, in which case it seems incredibly intimidating. To help with that, we'll begin by making our own functions in R in the next section.
## Functions in R
Functions are your loyal servants, waiting patiently to do your bidding to the best of their ability. They're made with the utmost care and attention ... though sometimes may end up being something of a Frankenstein's monster - with an extra limb or two and a head put on backwards. But no matter how ugly they may be they're completely faithful to you.
They're also very stupid.
If we asked you to go to the supermarket to get us some ingredients to make *Francesinha*, even if you don't know what the heck that is, you'd be able to guess and bring at least *something* back. Or you could decide to make something else. Or you could ask a celebrity chef for help. Or you could pull out your phone and search online for what [*Francesinha*][francesinha] is. The point is, even if we didn't give you enough information to do the task, you're intelligent enough to, at the very least, try to find a work around.
If instead, we asked our loyal function to do the same, it would listen intently to our request, stand still for a few milliseconds, compose itself, and then start shouting `Error: 'data' must be a data frame, or other object ...`. It would then repeat this every single time we asked it to do the job. The point here, is that code and functions are not intelligent. They cannot find workarounds. It's totally reliant on you, to tell it very explicitly what it needs to do step by step.
Remember two things: the intelligence of code comes from the coder, not the computer and functions need exact instructions to work.
To prevent functions from being *too* stupid you must provide the information the function needs in order for it to function. As with the *Francesinha* example, if we'd supplied a recipe list to the function, it would have managed just fine. We call this "fulfilling an argument". The vast majority of functions require the user to fulfill at least one argument.
This can be illustrated in the pseudocode below. When we make a function we can specify what arguments the user must fulfill (e.g. `argument1` and `argument2`), as well as what to do once it has this information (`expression`):
```{r generic_function, eval = FALSE}
nameOfFunction <- function(argument1, argument2, ...) {expression}
```
The first thing to note is that we've used the function `function()`\index{function()} to create a new function called `nameOfFunction`. To walk through the above code; we're creating a function called `nameOfFunction`. Within the round brackets we specify what information (i.e. arguments) the function requires to run (as many or as few as needed). These arguments are then passed to the expression part of the function. The expression can be any valid R command or set of R commands and is usually contained between a pair of braces `{ }` (if a function is only one line long you can omit the braces). Once you run the above code, you can then use your new function by typing:
```{r run_function, out.width="75%", eval = FALSE}
nameOfFunction(argument1, argument2)
```
Confused? Let's work through an example to help clear things up.
First we are going to create a data frame called `city`, where columns `porto`, `aberdeen`, `nairobi`, and `genoa` are filled with 100 random values drawn from a bag (using the `rnorm()`\index{rnorm()} function to draw random values from a Normal distribution with mean 0 and standard deviation of 1). We also include a "problem", for us to solve later, by including 10 `NA` values within the `nairobi` column (using `rep(NA, 10)`).
```{r df, out.width="75%"}
city <- data.frame(
porto = rnorm(100),
aberdeen = rnorm(100),
nairobi = c(rep(NA, 10), rnorm(90)),
genoa = rnorm(100)
)
```
Let's say that you want to multiply the values in the variables `Porto` and `Aberdeen` and create a new object called `porto_aberdeen`. We can do this "by hand" using:
```{r manual, out.width="75%"}
porto_aberdeen <- city$porto * city$aberdeen
```
We've now created an object called `porto_aberdeen` by multiplying the vectors `city$porto` and `city$aberdeen`. Simple. If this was all we needed to do, we can stop here. R works with vectors, so doing these kinds of operations in R is actually much simpler than other programming languages, where this type of code might require loops (we say that R is a vectorised language). Something to keep in mind for later is that doing these kinds of operations with loops can be much slower compared to vectorisation.
But what if we want to repeat this multiplication many times? Let's say we wanted to multiply columns `porto` and `aberdeen`, `aberdeen` and `genoa`, and `nairobi` and `genoa`. In this case we could copy and paste the code, replacing the relevant information.
```{r simple, out.width="75%"}
porto_aberdeen <- city$porto * city$aberdeen
aberdeen_genoa <- city$aberdeen * city$aberdeen
nairobi_genoa <- city$nairobi * city$genoa
```
While this approach works, it's easy to make mistakes. In fact, here we've "forgotten" to change `aberdeen` to `genoa` in the second line of code when copying and pasting. This is where writing a function comes in handy. If we were to write this as a function, there is only one source of potential error (within the function itself) instead of many copy-pasted lines of code (which we also cut down on by using a function).
In this case, we're using some fairly trivial code where it's maybe hard to make a genuine mistake. But what if we increased the complexity?
```{r complex, eval = FALSE}
city$porto * city$aberdeen / city$porto + (city$porto * 10^(city$aberdeen)) - city$aberdeen - (city$porto * sqrt(city$aberdeen + 10))
```
Now imagine having to copy and paste this three times, and in each case having to change the `porto` and `aberdeen` variables (especially if we had to do it more than three times).
What we could do instead is generalise our code for `x` and `y` columns instead of naming specific cities. If we did this, we could recycle the `x * y` code. Whenever we wanted to multiple columns together, we assign a city to either `x` or `y`. We'll assign the multiplication to the objects `porto_aberdeen` and `aberdeen_nairobi` so we can come back to them later.
```{r manual_function, eval = FALSE}
# Assign x and y values
x <- city$porto
y <- city$aberdeen
# Use multiplication code
porto_aberdeen <- x * y
# Assign new x and y values
x <- city$aberdeen
y <- city$nairobi
# Reuse multiplication code
aberdeen_nairobi <- x * y
```
This is essentially what a function does. OK down to business, let's call our new function `multiply_columns()` and define it with two arguments, `x` and `y`. In the function code we simply return the value of `x * y` using the `return()`\index{return()} function. Using the `return()` function is not strictly necessary in this example as R will automatically return the value of the last line of code in our function. We include it here to make this explicit.
```{r first_function}
multiply_columns <- function(x, y) {
return(x * y)
}
```
No that we've defined our function we can use it. Let's use the function to multiple the columns `city$porto` and `city$aberdeen` and assign the result to a new object called `porto_aberdeen_func`
```{r first_function2, collapse=TRUE}
porto_aberdeen_func <- multiply_columns(x = city$porto, y = city$aberdeen)
porto_aberdeen_func
```
If we're only interested in multiplying `city$porto` and `city$aberdeen`, it would be overkill to create a function to do something once. However, the benefit of creating a function is that we now have that function added to our environment which we can use as often as we like. We also have the code to create the function, meaning we can use it in completely new projects, reducing the amount of code that has to be written (and retested) from scratch each time. As a rule of thumb, you should consider writing a function whenever you’ve copied and pasted a block of code more than twice.
To satisfy ourselves that the function has worked properly, we can compare the `porto_aberdeen` variable with our new variable `porto_aberdeen_func` using the `identical()`\index{identical()} function. The `identical()` function tests whether two objects are *exactly* identical and returns either a `TRUE` or `FALSE` value. Use `?identical` if you want to know more about this function.
```{r identical_check, collapse=TRUE}
identical(porto_aberdeen, porto_aberdeen_func)
```
And we confirm that the function has produced the same result as when we do the calculation manually. We recommend getting into a habit of checking that the function you've created works the way you think it has.
Now let's use our `multiply_columns()` function to multiply columns `aberdeen` and `nairobi`. Notice now that argument `x` is given the value `city$aberdeen`and `y` the value `city$nairobi`.
```{r calc_w_na, eval = T, collapse=TRUE}
aberdeen_nairobi_func <- multiply_columns(x = city$aberdeen, y = city$nairobi)
aberdeen_nairobi_func
```
So far so good. All we've really done is wrapped the code `x * y` into a function, where we ask the user to specify what their `x` and `y` variables are.
Now let's add a little complexity. If you look at the output of `nairobi_genoa` some of the calculations have produced `NA` values. This is because of those `NA` values we included in `nairobi` when we created the `city` data frame. Despite these `NA` values, the function appeared to have worked but it gave us no indication that there might be a problem. In such cases we may prefer if it had warned us that something was wrong. How can we get the function to let us know when `NA` values are produced? Here's one way.
```{r example condition, collapse=TRUE}
multiply_columns <- function(x, y) {
temp_var <- x * y
if (any(is.na(temp_var))) {
warning("The function has produced NAs")
return(temp_var)
} else {
return(temp_var)
}
}
aberdeen_nairobi_func <- multiply_columns(city$aberdeen, city$nairobi)
porto_aberdeen_func <- multiply_columns(city$porto, city$aberdeen)
```
The core of our function is still the same. We still have `x * y`, but we've now got an extra six lines of code. Namely, we've included some conditional statements, `if` and `else`, to test whether any `NA`s have been produced and if they have we display a warning message to the user. The next section of this Chapter will explain how these work and how to use them.
## Conditional statements
`x * y` does not apply any logic. It merely takes the value of `x` and multiplies it by the value of `y`. Conditional statements are how you inject some logic into your code. The most commonly used conditional statement is `if`. Whenever you see an `if` statement, read it as *'If X is TRUE, do a thing'*. Including an `else` statement simply extends the logic to *'If X is TRUE, do a thing, or else do something different'*.
Both the `if`\index{if()} and `else`\index{else} statements allow you to run sections of code, depending on a condition is either `TRUE` or `FALSE`. The pseudocode below shows you the general form.
```{r schematic_ifelse, eval=FALSE}
if (condition) {
Code executed when condition is TRUE
} else {
Code executed when condition is FALSE
}
```
To delve into this a bit more, we can use an old programmer joke to set up a problem.
---
A programmer's partner says: *'Please go to the store and buy a carton of milk and if they have eggs, get six.'*
The programmer returned with 6 cartons of milk.
When the partner sees this, and exclaims *'Why the heck did you buy six cartons of milk?'*
The programmer replied *'They had eggs'*
---
At the risk of explaining a joke, the conditional statement here is whether or not the store had eggs. If coded as per the original request, the programmer should bring 6 cartons of milk if the store had eggs (condition = TRUE), or else bring 1 carton of milk if there weren't any eggs (condition = FALSE). In R this is coded as:
```{r joke_logic, collapse=TRUE}
eggs <- TRUE # Whether there were eggs in the store
if (eggs == TRUE) { # If there are eggs
n.milk <- 6 # Get 6 cartons of milk
} else { # If there are not eggs
n.milk <- 1 # Get 1 carton of milk
}
```
We can then check `n.milk` to see how many milk cartons they returned with.
```{r n_milk, collapse=TRUE}
n.milk
```
And just like the joke, our R code has missed that the condition was to determine whether or not to buy eggs, not more milk (this is actually a loose example of the [Winograd Scheme][winograd], designed to test the *intelligence* of artificial intelligence by whether it can reason what the intended referent of a sentence is).
We could code the exact same egg-milk joke conditional statement using an `ifelse()`\index{ifelse()} function.
```{r ifelse, collapse=TRUE}
eggs <- TRUE
n.milk <- ifelse(eggs == TRUE, yes = 6, no = 1)
```
This `ifelse()` function is doing exactly the same as the more fleshed out version from earlier, but is now condensed down into a single line of code. It has the added benefit of working on vectors as opposed to single values (more on this later when we introduce loops). The logic is read in the same way; "If there are eggs, assign a value of 6 to `n.milk`, if there isn't any eggs, assign the value 1 to `n.milk`".
We can check again to make sure the logic is still returning 6 cartons of milk:
```{r ifelse_check, collapse=TRUE}
n.milk
```
Currently we'd have to copy and paste code if we wanted to change if eggs were in the store or not. We learned above how to avoid lots of copy and pasting by creating a function. Just as with the simple `x * y` expression in our previous `multiply_columns()` function, the logical statements above are straightforward to code and well suited to be turned into a function. How about we do just that and wrap this logical statement up in a function?
```{r joke_function, collapse=TRUE}
milk <- function(eggs) {
if (eggs == TRUE) {
6
} else {
1
}
}
```
We've now created a function called `milk()` where the only argument is `eggs`. The user of the function specifies if eggs is either `TRUE` or `FALSE`, and the function will then use a conditional statement to determine how many cartons of milk are returned.
Let's quickly try:
```{r joke_func_check, collapse=TRUE}
milk(eggs = TRUE)
```
And the joke is maintained. Notice in this case we have actually specified that we are fulfilling the `eggs` argument (`eggs = TRUE`)? In some functions, as with ours here, when a function only has a single argument we can be lazy and not name which argument we are fulfilling. In reality, it's generally viewed as better practice to explicitly state which arguments you are fulfilling to avoid potential mistakes.
OK, lets go back to the `multiply_columns()` function we created above and explain how we've used conditional statements to warn the user if `NA` values are produced when we multiple any two columns together.
```{r mult-example, collapse=TRUE}
multiply_columns <- function(x, y) {
temp_var <- x * y
if (any(is.na(temp_var))) {
warning("The function has produced NAs")
return(temp_var)
} else {
return(temp_var)
}
}
```
In this new version of the function we still use `x * y` as before but this time we've assigned the values from this calculation to a temporary vector called `temp_var` so we can use it in our conditional statements. Note, this `temp_var` variable is *local* to our function and will not exist outside of the function due something called [R's scoping rules][scoping]. We then use an `if` statement to determine whether our `temp_var` variable contains any `NA` values. The way this works is that we first use the `is.na()`\index{is.na()} function to test whether each value in our `temp_var` variable is an `NA`. The `is.na()` function returns `TRUE` if the value is an `NA` and `FALSE` if the value isn't an `NA`. We then nest the `is.na(temp_var)` function inside the function `any()`\index{any()} to test whether **any** of the values returned by `is.na(temp_var)` are `TRUE`. If at least one value is `TRUE` the `any()` function will return a `TRUE`. So, if there are any `NA` values in our `temp_var` variable the condition for the `if()` function will be `TRUE` whereas if there are no `NA` values present then the condition will be `FALSE`. If the condition is `TRUE` the `warning()`\index{warning()} function generates a warning message for the user and then returns the `temp_var` variable. If the condition is `FALSE` the code below the `else` statement is executed which just returns the `temp_var` variable.
So if we run our modified `multiple_columns()` function on the columns `city$Aberdeen` and `city$nairobi` (which contains `NA`s) we will receive an warning message.
```{r , collapse=TRUE}
aberdeen_nairobi_func <- multiply_columns(city$aberdeen, city$nairobi)
```
Whereas if we multiple two columns that don't contain `NA` values we don't receive a warning message
```{r, collapse=TRUE}
porto_aberdeen_func <- multiply_columns(city$porto, city$aberdeen)
```
## Combining logical operators
The functions that we've created so far have been perfectly suited for what we need, though they have been fairly simplistic. Let's try creating a function that has a little more complexity to it. We'll make a function to determine if today is going to be a good day or not based on two criteria. The first criteria will depend on the day of the week (Friday or not) and the second will be whether or not your code is working (TRUE or FALSE). To accomplish this, we'll be using `if` and `else` statements. The complexity will come from `if` statements immediately following the relevant `else` statement. We'll use such conditional statements four times to achieve all combinations of it being a Friday or not, and if your code is working or not.
```{r, out.width="75%", fig.align="center", collapse=TRUE}
good.day <- function(code.working, day) {
if (code.working == TRUE && day == "Friday") {
"BEST. DAY. EVER. Stop while you are ahead and go to the pub!"
} else if (code.working == FALSE && day == "Friday") {
"Oh well, but at least it's Friday! Pub time!"
} else if (code.working == TRUE && day != "Friday") {
"So close to a good day... shame it's not a Friday"
} else if (code.working == FALSE && day != "Friday") {
"Hello darkness."
}
}
good.day(code.working = TRUE, day = "Friday")
good.day(FALSE, "Tuesday")
```
Notice that we never specified what to do if the day was not a Friday? That's because, for this function, the only thing that matters is whether or not it's Friday.
We've also been using logical operators whenever we've used `if` statements. Logical operators are the final piece of the logical conditions jigsaw. Below is a table which summarises operators. The first two are logical operators and the final six are relational operators. You can use any of these when you make your own functions (or loops).
\
| Operator | Technical Description | What it means | Example |
|---------:|:-----------------------:|:----------------------------|:-------------------------------------|
| `&&` | Logical AND | Both conditions must be met | `if(cond1 == test && cond2 == test)` |
| `||` | Logical OR | Either condition must be met| `if(cond1 == test || cond2 == test)` |
| `<` | Less than | X is less than Y | `if(X < Y)` |
| `>` | Greater than | X is greater than Y | `if(X > Y)` |
| `<=` | Less than or equal to | X is less/equal to Y | `if(X <= Y)` |
| `>=` | Greater than or equal to| X is greater/equal to Y | `if(X >= Y)` |
| `==` | Equal to | X is equal to Y | `if(X == Y)` |
| `!=` | Not equal to | X is not equal to Y | `if(X != Y)` |
\
## Loops
R is very good at performing repetitive tasks. If we want a set of operations to be repeated several times we use what's known as a loop. When you create a loop, R will execute the instructions in the loop a specified number of times or until a specified condition is met. There are three main types of loop in R: the *for* loop, the *while* loop and the *repeat* loop.
Loops are one of the staples of all programming languages, not just R, and can be a powerful tool (although in our opinion, used far too frequently when writing R code).
### For loop
The most commonly used loop structure when you want to repeat a task a defined number of times is the `for` loop.
The most basic example of a `for` loop is:
```{r basic_for_loop, collapse=TRUE}
for (i in 1:5) {
print(i)
}
```
But what's the code actually doing? This is a dynamic bit of code were an index `i` is iteratively replaced by each value in the vector `1:5`. Let's break it down. Because the first value in our sequence (`1:5`) is `1`, the loop starts by replacing `i` with `1` and runs everything between the `{ }`. Loops conventionally use `i` as the counter, short for iteration, but you are free to use whatever you like, even your pet's name, it really does not matter (except when using nested loops, in which case the counters must be called different things, like `SenorWhiskers` and `HerrFlufferkins`.
So, if we were to do the first iteration of the loop manually
```{r manual_loop, collapse=TRUE}
i <- 1
print(i)
```
Once this first iteration is complete, the for loop *loops* back to the beginning and replaces `i` with the next value in our `1:5` sequence (`2` in this case):
```{r manual_loop_2, collapse=TRUE}
i <- 2
print(i)
```
This process is then repeated until the loop reaches the final value in the sequence (`5` in this example) after which point it stops.
To reinforce how `for` loops work and introduce you to a valuable feature of loops, we'll alter our counter within the loop. This can be used, for example, if we're using a loop to iterate through a vector but want to select the next row (or any other value). To show this we'll simply add 1 to the value of our index every time we iterate our loop.
```{r altering_i, collapse=TRUE}
for (i in 1:5) {
print(i + 1)
}
```
As in the previous loop, the first value in our sequence is 1. The loop begins by replacing `i` with `1`, but this time we've specified that a value of `1` must be added to `i` in the expression resulting in a value of `1 + 1`.
```{r altering_i_1, collapse=TRUE}
i <- 1
i + 1
```
As before, once the iteration is complete, the loop moves onto the next value in the sequence and replaces `i` with the next value (`2` in this case) so that `i + 1` becomes `2 + 1`.
```{r altering_i_2, collapse=TRUE}
i <- 2
i + 1
```
And so on. I think you got the idea! In essence this is all a `for` loop is doing and nothing more.
Whilst above we have been using simple addition in the body of the loop, you can also combine loops with functions.
Let's go back to our data frame `city`. Previously in the Chapter we created a function to multiply two columns and used it to create our `porto_aberdeen`, `aberdeen_nairobi`, and `nairobi_genoa` objects. We could have used a loop for this. Let's remind ourselves what our data look like and the code for the `multiple_columns()` function.
```{r recreate_data_function, collapse=TRUE}
# Recreating our dataset
city <- data.frame(
porto = rnorm(100),
aberdeen = rnorm(100),
nairobi = c(rep(NA, 10), rnorm(90)),
genoa = rnorm(100)
)
# Our function
multiply_columns <- function(x, y) {
temp <- x * y
if (any(is.na(temp))) {
warning("The function has produced NAs")
return(temp)
} else {
return(temp)
}
}
```
To use a list to iterate over these columns we need to first create an empty list (remember [lists](#lists)?) which we call `temp` (short for temporary) which will be used to store the output of the `for` loop.
```{r loop_function, collapse=TRUE}
temp <- list()
for (i in 1:(ncol(city) - 1)) {
temp[[i]] <- multiply_columns(x = city[, i], y = city[, i + 1])
}
```
When we specify our `for` loop notice how we subtracted 1 from `ncol(city)`. The `ncol()`\index{ncol()} function returns the number of columns in our `city` data frame which is `4` and so our loop runs from `i = 1` to `i = 4 - 1` which is `i = 3`. We'll come back to why we need to subtract 1 from this in a minute.
So in the first iteration of the loop `i` takes on the value `1`. The `multiply_columns()` function multiplies the `city[, 1]` (`porto`) and `city[, 1 + 1]` (`aberdeen`) columns and stores it in the `temp[[1]]` which is the first element of the `temp` list.
The second iteration of the loop `i` takes on the value `2`. The `multiply_columns()` function multiplies the `city[, 2]` (`aberdeen`) and `city[, 2 + 1]` (`nairobi`) columns and stores it in the `temp[[2]]` which is the second element of the `temp` list.
The third and final iteration of the loop `i` takes on the value `3`. The `multiply_columns()` function multiplies the `city[, 3]` (`nairobi`) and `city[, 3 + 1]` (`genoa`) columns and stores it in the `temp[[3]]` which is the third element of the `temp` list.
So can you see why we used `ncol(city) - 1` when we first set up our loop? As we have four columns in our `city` data frame if we didn't use `ncol(city) - 1` then eventually we'd try to add the 4^th^ column with the non-existent 5^th^ column.
Again, it's a good idea to test that we are getting something sensible from our loop (remember, check, check and check again!). To do this we can use the `identical()` function to compare the variables we created `by hand` with each iteration of the loop manually.
```{r compare_3_methods, collapse=TRUE}
porto_aberdeen_func <- multiply_columns(city$porto, city$aberdeen)
i <- 1
identical(multiply_columns(city[, i], city[, i + 1]), porto_aberdeen_func)
aberdeen_nairobi_func <- multiply_columns(city$aberdeen, city$nairobi)
i <- 2
identical(multiply_columns(city[, i], city[, i + 1]), aberdeen_nairobi_func)
```
If you can follow the examples above, you'll be in a good spot to begin writing some of your own for loops. That said there are other types of loops available to you.
### While loop
Another type of loop that you may use (albeit less frequently) is the `while` loop. The `while` loop is used when you want to keep looping until a specific logical condition is satisfied (contrast this with the `for` loop which will always iterate through an entire sequence).
The basic structure of the while loop is:
```{r, eval = FALSE}
while(logical_condition){ expression }
```
A simple example of a while loop is:
```{r, collapse=TRUE}
i <- 0
while (i <= 4) {
i <- i + 1
print(i)
}
```
Here the loop will only continue to pass values to the main body of the loop (the `expression` body) when `i` is less than or equal to 4 (specified using the `<=` operator in this example). Once `i` is greater than 4 the loop will stop.
There is another, very rarely used type of loop; the `repeat` loop. The `repeat` loop has no conditional check so can keep iterating indefinitely (meaning a break, or "stop here", has to be coded into it). It's worthwhile being aware of it's existence, but for now we don't think you need to worry about it; the `for` and `while` loops will see you through the vast majority of your looping needs.
### When to use a loop?
Loops are fairly commonly used, though sometimes a little overused in our opinion. Equivalent tasks can be performed with functions, which are often more efficient than loops. Though this raises the question when should you use a loop?
In general loops are implemented inefficiently in R and should be avoided when better alternatives exist, especially when you're working with large datasets. However, loop are sometimes the only way to achieve the result we want.
**Some examples of when using loops can be appropriate:**
- Some simulations (e.g. the Ricker model can, in part, be built using loops)
- Recursive relationships (a relationship which depends on the value of the previous relationship ["to understand recursion, you must understand recursion"])
- More complex problems (e.g., how long since the last badger was seen at site $j$, given a pine marten was seen at time $t$, at the same location $j$ as the badger, where the pine marten was detected in a specific 6 hour period, but exclude badgers seen 30 minutes before the pine marten arrival, repeated for all pine marten detection's)
- While loops (keep jumping until you've reached the moon)
### If not loops, then what?
In short, use the apply family of functions; `apply()`\index{apply()}, `lapply()`\index{lapply()}, `tapply()`\index{tapply()}, `sapply()`\index{sapply()}, `vapply()`, and `mapply()`. The apply functions can often do the tasks of most "home-brewed" loops, sometimes faster (though that won't really be an issue for most people) but more importantly with a much lower risk of error. A strategy to have in the back of your mind which may be useful is; for every loop you make, try to remake it using an apply function (often `lapply` or `sapply` will work). If you can, use the apply version. There's nothing worse than realising there was a small, tiny, seemingly meaningless mistake in a loop which weeks, months or years down the line has propagated into a huge mess. We strongly recommend trying to use the apply functions whenever possible.
#### lapply {-}
Your go to apply function will often be `lapply()` at least in the beginning. The way that `lapply()` works, and the reason it is often a good alternative to for loops, is that it will go through each element in a list and perform a task (i.e. run a function). It has the added benefit that it will output the results as a list - something you'd have to otherwise code yourself into a loop.
An `lapply()` has the following structure:
```{r, eval = FALSE}
lapply(X, FUN)
```
Here `X` is the vector which we want to do *something* to. `FUN` stands for how much fun this is (just kidding!). It's also is short for "function".
Let's start with a simple demonstration first. Let's use the `lapply()` function create a sequence from 1 to 5 and add 1 to each observation (just like we did when we used a for loop):
```{r, collapse=TRUE}
lapply(0:4, function(a) {a + 1})
```
Notice that we need to specify our sequence as `0:4` to get the output `1 ,2 ,3 ,4 , 5` as we are adding `1` to each element of the sequence. See what happens if you use `1:5` instead.
Equivalently, we could have defined the function first and then used the function in `lapply()`
```{r, collapse=TRUE}
add_fun <- function(a) {a + 1}
lapply(0:4, add_fun)
```
The `sapply()` function does the same thing as `lapply()` but instead of storing the results as a list, it stores them as a vector.
```{r, collapse=TRUE}
sapply(0:4, function(a) {a + 1})
```
As you can see, in both cases, we get exactly the same results as when we used the for loop.
## Exercise
```{block2, note-text7, type='rmdtip'}
Congratulations, you've reached the end of Chapter 7! Perhaps now's a good time to practice some of what you've learned. You can find an exercise we've prepared for you [here][exercise7] on the course companion [website][course-web]. If you want to see our solutions for this exercise you can find them [here][exercise7-sol] (don't peek at them too soon though!).
```
```{r links, child="links.md"}
```