-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathintroduction_to_r.Rmd
535 lines (392 loc) · 15.3 KB
/
introduction_to_r.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
---
title: "Introduction to R"
---
> #### Objectives
>
> * Get acquainted with the R command prompt
> * Create named objects and assign values to them
> * Create and work with vectors containing series of values
> * Call functions and use arguments to change their default options
> * Look at how R handles missing values
> * Write R code in a script
```{r setup, include=FALSE}
knitr::opts_chunk$set(comment = NA)
```
# Using R as a calculator
Open RStudio and type the following at the command prompt, **`>`** (in the console
tab pane), to add two numbers together.
```{r}
4 + 3
```
Clearly the answer is 7 but what's the `[1]` that you see printed before it?
Sometimes operations will return more than one value and these may get written
across several lines. Here's an example using one of the built-in datasets that
contains the lengths of the major North American rivers.
```{r}
rivers
```
The numbers in brackets are indexes for the first element printed on each line,
so the first line will always begin with `[1]` as that line starts with the
first element. If a subsequent line starts with `[11]`, for example, then the
first element printed on that line is the eleventh element within the list.
> **_Exercise_**
>
> Try doing some subtractions, multiplications and divisions at the R command
> prompt.
>
> The operator for multiplication is `*` and for division it is `/`.
Let's add several numbers together.
```{r}
32.33 + 28.6 + 29.49 + 25.7 + 30.81
```
And we'll divide by 5 to get the mean value.
```{r}
32.33 + 28.6 + 29.49 + 25.7 + 30.81 / 5
```
That doesn't look right. Can you see how R has interpreted this?
The last of our values, 30.81, was divided by 5 before adding the result to
the other values. This is because multiplication and division operations take
precedence over addition and subtraction and so they are calculated first.
We can use parentheses to ensure that our values are added together before
dividing by the number of values.
```{r}
(32.33 + 28.6 + 29.49 + 25.7 + 30.81) / 5
```
# Creating objects in R
If we want to use our average value, perhaps in another calculation, we need
some way of storing it for use later. We need to assign the value to an
*object* and we can do this with the assignment operator, `<-`.
```{r}
average_tumour_size <- (32.33 + 28.6 + 29.49 + 25.7 + 30.81) / 5
```
It is also possible to use `=` for assignment and if you're familiar with other
programming languages this will feel more natural. `<-` is preferred though and
there are some situations where using `=` may have unforeseen consequences.
Our new object is listed in the Environment tab in the top right panel in
RStudio.
![](images/object_assignment.png){width=100%}
Objects are commonly referred to as *variables*, a term commonly used in other
programming languages.
We can now use our object in further calculations. For example, if our tumour
sizes were measured in millimetres and we wanted to convert the average value to
centimetres, we could do the following:
```{r}
average_tumour_size / 10
```
We could assign the converted value to another object,
```{r}
average_tumour_size_cm <- average_tumour_size / 10
```
or overwrite the existing one.
```{r}
average_tumour_size <- average_tumour_size / 10
```
To check a value of our object, we can get R to print it out in the
console by typing its name.
```{r}
average_tumour_size
```
# Vectors and data types
A *vector* is an ordered series of values and is the simplest data structure in
R. The `rivers` data set is an example of a vector.
We can create a vector of our tumour sizes using the `c()` function.
```{r}
tumour_sizes <- c(32.33, 28.6, 29.49, 25.7, 30.81)
tumour_sizes
```
We'll introduce functions in the next section but for now we note that `c`
stands for 'combine' and the `c()` function combines the values it is given
within the parentheses into a vector.
Most operations in R are *'vectorized'*, i.e. they can work on vectors. For
example we can convert our tumour sizes to centimetres in a single operation.
```{r}
tumour_sizes_cm <- tumour_sizes / 10
tumour_sizes_cm
```
Vectors contain values that are all of the same type. So far, we've only been
using numeric values but there are some other atomic data types including
Boolean (logical) and character values.
Character values are strings of characters enclosed in quotation marks.
```{r}
drug <- "Tamoxifen"
```
```{r}
drugs <- c("Tamoxifen", "Fulvestrant", "Olaparib", "Paclitaxel")
```
Logical values can be either `TRUE` or `FALSE`.
```{r}
positive_outcomes <- c(TRUE, FALSE, FALSE, TRUE, TRUE)
```
Logical values are produced when using logical operators, e.g. the greater than
operator `>`.
```{r}
average_tumour_size_cm > 3
```
We can also do this on vectors to produce logical vectors, something we'll come
back to shortly.
```{r}
tumours_larger_than_30mm <- tumour_sizes > 30
tumours_larger_than_30mm
```
> **_Exercise_**
>
> Try creating a vector that contains values with different types (numeric,
> logical, character)
>
> Try different combinations of types and see what happens, for example
>
> `c(FALSE, 2.5)`
>
> `c(FALSE, 2.5, "hello")`
# Functions
R contains a large set of functions that do many useful operations. Let's have a
look at a simple example, the `log2()` function that calculates the base 2
logarithm of a number.
```{r}
log2(4)
```
A function usually takes one or more inputs known as *arguments*. Functions
often, but not always, return a value, which in turn can be assigned to an
object.
```{r}
a <- 10.25
b <- log2(a)
b
```
The `log2()` function only takes a single argument. Let's try a function that
can take multiple arguments: `round()`
```{r}
round(average_tumour_size)
```
By default the `round()` function rounds to the nearest whole number. We can
specify the number of digits to round to using the additional `digits` argument.
```{r}
round(average_tumour_size, digits = 2)
```
Information about a function can be found on its help page by typing `?round` or
`help(round)` at the command prompt.
```{r eval = FALSE}
?round
```
The 'Help' tab should be visible in the lower right panel in RStudio.
![](images/rstudio_help_tab_pane.png){width=100%}
The *Usage* section shows that the default value for the `digits` argument is 0
and that `digits` refers to the number of decimal places. The help page also
gives information for some related functions, `signif()`, `ceiling()`, etc.
So if we don't specify the value for digits, it will round to 0 digits (decimal
places), i.e. to the nearest whole number.
If we provide the arguments in the exact same order as they are defined we
don't have to name them.
```{r}
round(average_tumour_size, 2)
```
It's good practice to put the non-optional arguments, like the number we're
rounding in this case, first in the function call, in the order they're
expected (in which case you don't need to name them), and then use names for all
the optional arguments you're specifying. It will make it much easier for
someone reading your code and is less error-prone, particularly when using
functions with many arguments.
Many functions take vector arguments. Some are *vectorized* and carry out the
same operation on all the elements of the vector, e.g.
```{r}
log10(tumour_sizes)
```
Others compute a summary value from the given vector. For example, we can pass
our vector of tumour sizes to the `mean()` function to compute the average value
we calculated earlier.
```{r}
average_tumour_size <- mean(tumour_sizes)
average_tumour_size
```
> **_Exercise_**
>
> Try computing some other summary statistics on the vector of tumour sizes
> using the functions, `sd()`, `var()`, `median()`, `IQR()`, `min()` and
> `max()`.
>
> Look up the Help page for these functions. Try running some of the example
> code snippets given in the *Examples* section in the help page.
We can nest function calls, one within another,
```{r}
average_tumour_size <- round(mean(tumour_sizes), digits = 1)
average_tumour_size
```
but this can make for code that is difficult to read. Usually it is better to
keep things simple even if you end up with code that is more verbose.
```{r}
average_tumour_size <- mean(tumour_sizes)
average_tumour_size <- round(average_tumour_size, digits = 1)
average_tumour_size
```
# Extracting subsets from vectors
One of the operations we do frequently is to select subsets of our data that are
of particular interest.
To select one or more values from a vector we need to provide the index or
indices within square brackets.
```{r}
tumour_sizes[3]
```
```{r}
tumour_sizes[c(1, 4, 5)]
```
It is also quite common to extract a range of values using the `:` operator.
The `:` operator creates a sequence of integer numbers.
```{r}
2:4
tumour_sizes[2:4]
```
## Conditional subsetting
Another way of subsetting a vector is to use a logical vector.
```{r}
selected <- c(TRUE, FALSE, FALSE, TRUE, TRUE)
tumour_sizes[selected]
```
You may be thinking that this seems very abstract and questioning why it would
ever be useful. But actually, it is probably the most commonly used way of
selecting values of interest from a vector.
Recall how we used the `>` operator to create a logical vector corresponding to
those tumours with a size greater than 30mm. We can use that to extract the
sizes of those tumours.
```{r}
tumours_larger_than_30mm <- tumour_sizes > 30
tumour_sizes[tumours_larger_than_30mm]
```
In practice, we wouldn’t really create a variable containing our logical vector
signifying which values are of interest. Instead we’d do this in a single step.
```{r}
tumour_sizes[tumour_sizes > 30]
```
Other logical operators include `==` (equal to), `!=` (not equal to), `<` (less
than), `<=` (less than or equal to) and `>=` (greater than or equal to).
We can combine logical operations using `&` and `|` operators which are the R
versions of the AND and OR operations in Boolean algebra but which are applied
to vectors.
For example, we could obtain the sizes of tumours that are between 27.5mm and
30mm.
```{r}
tumours_of_interest <- tumour_sizes >= 27.5 & tumour_sizes <= 30
tumour_sizes[tumours_of_interest]
```
Or in a single command:
```{r}
tumour_sizes[tumour_sizes >= 27.5 & tumour_sizes <= 30]
```
# Modifying vectors
We can add new values to a vector using the `c()` function.
```{r}
tumour_sizes <- c(tumour_sizes, 31.92, 24.11)
tumour_sizes
```
We can also combine two or more vectors in the same way.
```{r}
more_tumour_sizes <- c(26.34, 29.93)
tumour_sizes <- c(tumour_sizes, more_tumour_sizes)
tumour_sizes
```
One of more values in a vector can be changed using the same subsetting
operations we used before but this time assigning new values to the subset.
```{r}
tumour_sizes[3] <- 33.67
tumour_sizes
```
```{r}
tumour_sizes[c(2, 6, 7)] <- c(29.58, 25.55, 34.51)
```
```{r}
tumour_sizes[4:6] <- c(31.83, 25.99, 27.24)
```
# Missing values
Missing values are quite common in scientific data and R has a way of handling
these using the special value, `NA`, which stands for 'not available'.
The `airquality` example data set that comes with R contains missing values.
This data set is a table of daily air quality measurements taken in New York
and includes observations of ozone levels, wind speed and temperature. We'll
extract the ozone measurements from the table (Ozone column) as a vector.
```{r}
ozone <- airquality$Ozone
ozone
```
We'll be looking at tabular data in the next part of the course so don't worry
about the `$` operator we used here for now.
Most functions will return `NA` if the data they work on contain missing values.
```{r}
mean(ozone)
```
The `mean()` function, and many like it, takes the view that it cannot compute
the mean for a set of values where some are unknown. This is quite annoying but
these functions usually have an argument named `na.rm` that can be set to
`TRUE` to remove the `NA` values before doing the calculation.
```{r}
mean(ozone, na.rm = TRUE)
```
# The very useful `summary()` function
One very useful function is `summary()`. As the name suggests this produces a
summary of the values it is given. It is really flexible and can take vectors of
different types, tables and other data structures.
```{r}
summary(ozone)
```
```{r}
summary(tumours_larger_than_30mm)
```
```{r}
summary(airquality)
```
# Scripting in R
Up to now, we were mostly typing code in the Console pane at the **`>`** prompt.
This is a very interactive way of working with R but it is also important to be
able to record the commands you've typed for when you come back to your analysis
later.
Instead we can create a script file containing our R commands; this is the way
most R coding is done.
From the RStudio '**File**' menu, select '**New File**' and then '**R Script**'.
![](images/rstudio_new_file_menu.png){width=50%}
You should now have a new file in its own tab, named 'Untitled1', at the top of
the left-hand side of RStudio. The console window no longer occupies the whole
of the left-hand side.
We can type code into this file just as we have done at the command prompt in
the Console tab pane. Save changes you make using the '**Save**' option from
the '**File**' menu. There is also a button or you can use a keyboard shortcut.
On a Mac this is <kbd>cmd</kbd> + <kbd>S</kbd> (press the <kbd>cmd</kbd> key
first and, while keeping this depressed, click the <kbd>S</kbd> key); on Windows
it is <kbd>Ctrl</kbd> + <kbd>S</kbd>. RStudio will open a dialog box for you to
enter the file name and loation the first time you try to save a new file. It is
a good idea to regularly save changes to your script.
![](images/rstudio_script.png){width=100%}
## Running scripts
Having typed an R command and hit the return key you'll notice that the
command isn't actually run like it was in the console window. That's because
you're writing your R code in an editor. To run a single line of code within
your script you can press the '**Run**' button at the top of the script.
![](images/rstudio_run_button_highlighted.png){width=60%}
This will run the line of code on which the cursor is flashing or the next line
of code if the cursor is on a blank or empty line.
The keyboard shortcut is more convenient in practice as you won't have to stop
typing at the keyboard to use your mouse. This is <kbd>cmd</kbd> +
<kbd>return</kbd> on a Mac and <kbd>Ctrl</kbd> + <kbd>enter</kbd> on Windows.
Running a line in your script will automatically move the cursor onto the
next command which can be very convenient as you'll be able to run successive
commands just by repeatedly clicking '**Run**' or using the keyboard shortcut.
You can also run the entire script by clicking on the '**Source**' button, a
little to the right of the '**Run**' button. More useful though is to run
'**Source with Echo**' from the Source drop-down menu as this will also display
your commands and the outputs from these in the Console window.
## Adding comments to your code
Anything that follows a `#` character within a line of code is ignored by R.
This is useful as it allows you to add comments and explanations to your code.
```{r}
# extract tumour sizes that are greater than 30mm
large_tumour_sizes <- tumour_sizes[tumour_sizes > 30]
```
Comments usually appear at the beginning of lines but can appear at the end of
an R statement.
```{r}
days <- c(1, 2, 4, 6, 8, 12, 16) # didn't manage to get a measurement on day 10
```
It is also quite common when looking at R code to see lines of code commented
out, usually replaced by another line that does something similar or makes a
small change.
```{r}
# random_numbers <- rnorm(100, mean = 0, sd = 1)
random_numbers <- rnorm(100, mean = 0, sd = 0.5)
```