forked from dzchilds/eda-for-bio
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path2_06_dplyr_helpers.Rmd
53 lines (36 loc) · 4.02 KB
/
2_06_dplyr_helpers.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
# Helper functions
```{r, include=FALSE}
library("dplyr")
library("nasaweather")
```
## Introduction
There are a number of __helper functions__ supplied by **dplyr**. Many of these are shown in the handy **dplyr** [cheat sheat](http://www.rstudio.com/resources/cheatsheets/). This is a short chapter. We aren't going to try to cover every single helper function here. Instead, we'll highlight some of the more useful ones, and point out where the others tend to be used. We also assume that the `storms_tbl` and `iris_tbl` tibbles have already been constructed (look over the previous two chapters to see how this is done).
## Working with `select`
There are relatively few helper functions that can be used with `select`. The job of these functions is to make it easier to match variable names according to various criteria. We'll look at the three simplest of these, but look at the examples in the help file for `select` and the [cheat sheat](http://www.rstudio.com/resources/cheatsheets/) to see what else is available.
We can select variables according to the sequence of characters used at the start of their name with the `starts_with` function. For example, to select all the variables in `iris_tbl` that begin with the word "Petal", we use:
```{r}
select(iris_tbl, starts_with("petal"))
```
This returns a table containing just `Petal.Length` and `Petal.Width`. As one might expect, there is also a helper function to select variables according to characters used at the end of their name. This is the `ends_with` function (no surprises here). To select all the variables in `iris_tbl` that end with the word "Length", we use:
```{r}
select(iris_tbl, ends_with("length"))
```
Notice that we have to quote the character string that we want to match against. This is not optional. However, the `starts_with` and `ends_with` functions are not case sensitive by default. For example, I passed `starts_with` the argument `"petal"` instead of `"Petal"`, yet it still selected variables beginning with the character string `"Petal"`. If we want to select variables on a case-sensitive basis, we need to set an argument `ignore.case` to `FALSE` in `starts_with` and `ends_with`.
The last `select` helper function we will look at is called `contains`. This allows us to select variables based on a partial match anywhere in their name. Look at what happens if we pass `contains` the argument `"."`:
```{r}
select(iris_tbl, contains("."))
```
This selects all the variables with a dot in their name.
There is nothing to stop us combining the different variable selection methods. For example, we can use this approach to select all the variables whose names start with the word "Petal" or end with the word "Length":
```{r}
select(iris_tbl, ends_with("length"), starts_with("petal"))
```
When we apply more than one selection criteria like this the `select` function returns all the variables that match either criteria, rather than just the set that meets all the criteria.
## Working with `mutate` and `transmute`
There are quite a few __helper functions__ that can be used with `mutate`. These make it easier to carry out certain transformations that aren't easy to do with base R functions. We won't explore these here as they tend to be needed only in quite specific circumstances. However, in situations where we need to construct an unusual variable---for example, one that ranks the values of another variable---it's always worth looking at the that [handy cheat sheat](http://www.rstudio.com/resources/cheatsheets/) to see what options might be available.
## Working with `filter`
There's one `dplyr` __helper function__ that works with `filter` that's definitely worth knowing about: the `between` function. This is used to identify the values of a variable that lie inside a defined range:
```{r}
filter(storms_tbl, between(pressure, 960, 970))
```
This example filters the `storms` dataset such that only values of `pressure` between 960 and 970 are retained. We could do the same thing using some combination of `>` or `<`, but the `between` function makes things a bit easier to read.