forked from dzchilds/eda-for-bio
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path2_04_dplyr_variables.Rmd
148 lines (106 loc) · 10.5 KB
/
2_04_dplyr_variables.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
# Working with variables
```{r, include=FALSE}
library("dplyr")
library("nasaweather")
```
## Introduction
This chapter will explore the the **dplyr** `select` and `mutate` verbs, as well as the related `rename` and `transmute` verbs. These four verbs are considered together because they all operate on the variables (i.e. the columns) of a data frame or tibble:
- The `select` function selects a subset of variables to retain and (optionally) renames them in the process.
- The `mutate` function creates new variables from preexisting ones and retains the original variables.
- The `rename` function renames one or more variables while keeping the remaining variable names unchanged.
- The `transmute` function creates new variables from preexisting ones and drops the original variables.
### Getting ready
A script that uses **dplyr** will typically start by loading and attaching the package:
```{r, eval=FALSE}
library("dplyr")
```
Obviously we need to have first installed `dplyr` package for this to work.
We'll use the `iris` data set in the **datasets** package to illustrate the ideas in this chapter. The **datasets** package ships with R and is loaded and attached at start up, so there's no need to do anything to make `iris` available. The `iris` data set is an ordinary data frame. Before we start working, it's handy to convert this to a tibble so that it prints to the Console in a compact way:
```{r}
iris_tbl <- tbl_df(iris)
```
We gave the new tibble version a new name. We didn't have to do this, but it will remind us that we're working with tibbles.
## Subset variables with `select`
We use `select` to __select variables__ from a data frame or tibble. This is typically used when we have a data set with many variables but only need to work with a subset of these. Basic usage of `select` looks like this:
```{r, eval=FALSE}
select(data_set, vname1, vname2, ...)
```
take note: this is not an example we can run. This is a "pseudo code" example, designed to show, in abstract terms, how we use `select`:
- The first argument, `data_set` ("data object"), must be the name of the object containing our data.
- We then include a series of one or more additional arguments, where each one is the name of a variable in `data_set`. We've expressed this as `vname1, vname2, ...`, where `vname1` and `vname2` are names of the first two variables, and the `...` is acting as placeholder for the remaining variables (there could be any number of these).
It's easiest to understand how a function like `select` works by seeing it in action. We select the `Species`, `Petal.Length` and `Petal.Width` variables from `iris_tbl` like this:
```{r}
select(iris_tbl, Species, Petal.Length, Petal.Width)
```
Hopefully nothing about this example is surprising or confusing. There are a few things to notice about how `select` works though:
* The `select` function is one of those non-standard functions we briefly mentioned in the [Using functions] chapter. This means the variable names should not be surrounded by quotes unless they have spaces in them (which is best avoided).
* The `select` function is just like other R functions: it does not have "side effects". What this means is that it does not change the original `iris_tbl`. We printed the result produced by `select` to the Console, so we can't access the new data set. If we need to access the result we have to assign it a name using ` <- `.
* The order of variables (i.e. the column order) in the resulting object is the same as the order in which they were supplied as arguments. This means we can reorder variables at the same time as selecting them if we need to.
* The `select` function will return the same kind of data object it is working on. It returns a data frame if our data was originally in a data frame and a tibble if it was a tibble. In this example, R prints a tibble because we had converted `iris_tbl` from a data frame to a tibble.
It's sometimes more convenient to use `select` to subset variables by specifying those we do __not__ need, rather than specifying of the ones to keep. We use the `-` operator indicate that variables should be dropped. For example, to drop the `Petal.Width` and `Petal.Length` columns, we use:
```{r}
select(iris_tbl, -Petal.Width, -Petal.Length)
```
This returns a tibble with just the remaining variables:`Sepal.Length`, `Sepal.Width` and `Species`.
The `select` function can also be used to grab (or drop) a set of variables that occur in a sequence next to one another. We specify a series of adjacent variables using the `:` operator. This must be used with two variable names, one on the left hand side and one on the right. When we use `:` like this, `select` will subset both those variables along with any others that fall in between them. For example, if we need the two `Petal` variables and `Species`, we use:
```{r}
select(iris_tbl, Petal.Length:Species)
```
The `:` operator can be combined with `-` if we need to drop a series of variables according to their position in a data frame or tibble:
```{r}
select(iris_tbl, -(Petal.Length:Species))
```
The extra `( )` around are `Petal.Length:Species` important here --- `select` will throw an error if we don't include them.
### Renaming variables with `select` and `rename`
In addition to selecting a subset of variables, the `select` function can also rename variables at the same time. To do this, we have to name the arguments using `=`, placing the new name on the left hand side. For example, to select the`Species`, `Petal.Length` and `Petal.Width` variables from `iris_tbl`, but also rename `Petal.Length` and `Petal.Width` to `Petal_Length` and `Petal_Width`, we use:
```{r}
select(iris_tbl, Species, Petal_Length = Petal.Length, Petal_Width = Petal.Width)
```
Renaming variables is a common task when working with data frames and tibbles. What should we do if the _only_ thing we would like to achieve is to rename a variables, rather than rename and select variables? The `dplyr` provides an additional function called `rename` for this purpose. This function renames certain variables while retaining all others. It works in a similar way to `select`. For example, to rename `Petal.Length` and `Petal.Width` to `Petal_Length` and `Petal_Width`, we use:
```{r}
rename(iris_tbl, Petal_Length = Petal.Length, Petal_Width = Petal.Width)
```
Notice that the rename function also preserves the order of the variables found in the original data.
## Creating variables with `mutate`
We use `mutate` to __add new variables__ to a data frame or tibble. This is useful if we need to construct one or more derived variables to support an analysis. Basic usage of `mutate` looks like this:
```{r, eval=FALSE}
mutate(data_set, <expression1>, <expression2>, ...)
```
Again, this is not an example we can run; it's pseudo code that highlights in abstract terms how to use `mutate`. As always with `dplyr`, the first argument, `data_set`, should be the name of the object containing our data. We then include a series of one or more additional arguments, where each of these is a valid R expression involving one or more variables in `data_set`. We've have expressed these as `<expression1>, <expression2>`, where `<expression1>` and `<expression2>` represent the first two expressions, and the `...` is acting as placeholder for the remaining expressions. Remember, this is not valid R code. It is just intended to demonstrate the general usage of `mutate`.
To see `mutate` in action, let's construct a new version of `iris_tbl` that contains a variable summarising the approximate area of sepals:
```{r}
mutate(iris_tbl, Sepal.Width * Sepal.Length)
```
This creates a copy of `iris_tbl` with a new column called `Sepal.Width * Sepal.Length` (mentioned at the bottom of the printed output). Most of the rules that apply to `select` also apply to `mutate`:
* The expression that performs the required calculation is not surrounded by quotes. This makes sense, because an expression is meant to be evaluated so that it "does something". It is not a value.
* Once again, we just printed the result produced by `mutate` to the Console, rather than assigning the result a name using ` <- `. The `mutate` function does not have side effects, meaning it does not change the original `iris_tbl` in any way.
* The `select` function returns the same kind of data object as the one it is working on: a data frame if our data was originally in a data frame, a tibble if it was a tibble.
Creating a variable called something like `Sepal.Width * Sepal.Length` is not exactly ideal because it's a difficult name to work with. The `mutate` function can name variables at the same time as they are created. We have to name the arguments using `=`, placing the name on the left hand side, to do this. Here's how to use this construct to name the new area variable `Sepal.Area`:
```{r}
mutate(iris_tbl, Sepal.Area = Sepal.Width * Sepal.Length)
```
We can create more than one variable by supplying `mutate` multiple (named) arguments:
```{r}
mutate(iris_tbl,
Sepal.Area = Sepal.Width * Sepal.Length,
Petal.Area = Petal.Width * Petal.Length,
Area.Ratio = Petal.Area / Petal.Area)
```
Notice that here we placed each argument on a new line, remembering the comma to separate arguments. There is nothing to stop us doing this because R ignores white space. This is useful though, because it allows us, the user, to makes long function calls easier to read by breaking them up on different lines.
This last example reveals a nice feature of `mutate`: we can use newly created variables in further calculations. Here we constructed approximate sepal and petal area variables, and then used these to construct a third variable containing the ratio of these two quantities, `Area.Ratio`.
### Transforming and dropping variables
Occasionally we may want to construct one or more new variables, and then drop all other variables in the original dataset. The `transmute` function is designed to do this. It works exactly like `mutate`, but it has a slightly different behaviour:
```{r}
transmute(iris_tbl,
Sepal.Area = Sepal.Width * Sepal.Length,
Petal.Area = Petal.Width * Petal.Length,
Area.Ratio = Petal.Area / Petal.Area)
```
Here we repeated the previous example, but now only the new variables were retained in the resulting tibble. If we also want to retain one or more variables without altering them we just have to pass them as unnamed arguments. For example, if we need to retain species identity in the output, we use:
```{r}
transmute(iris_tbl,
Species,
Sepal.Area = Sepal.Width * Sepal.Length,
Petal.Area = Petal.Width * Petal.Length,
Area.Ratio = Petal.Area / Petal.Area)
```