forked from fels-bioinformatics/fels_bioinformatics_meetup
-
Notifications
You must be signed in to change notification settings - Fork 0
/
data_wrangling_w_tidyr_INCLASS.Rmd
300 lines (199 loc) · 9.08 KB
/
data_wrangling_w_tidyr_INCLASS.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
```{r setup, include=FALSE}
# load libraries
library(tidyverse)
library(conflicted)
library(viridis)
# configure knit settings
knitr::opts_chunk$set(echo = TRUE, fig.width = 6, fig.height = 4)
# resolve package conflicts
filter <- dplyr::filter
select <- dplyr::select
```
# Data Wrangling with tidyr
The tidyr package has some built in data-sets specifically for practicing tidying, so they'll be used for the demonstration today.
## Tidy Data
The first example data table in tidyr is the cleverly named "table1". This table is tidy! The other tableNs in this dataset are untidy versions of this one to practice on.
```{r}
table1
```
As you can see when you look at table1, each column is a variable and contains only values of the same type belonging to that variable. Going foward, we'll be tidying messed up versions of table1
## tidyr functions
### `spread()`
If you look at table 2 in the chunk below, it's not tidy because the column 'type' contains two variables, cases and population. Cases and population are separate measurements, so they should be their own columns. To make more columns from existing columns, use `spread()`.
```{r}
table2
table2 %>% spread(type, count)
```
### `gather()`
Looking at table4a in the chunk below, it's not tidy because the columns 1999 and 2000 are not variable descriptions, they're data (the year the measurement was made in). We need the column names to be values in a column. To do that we'll use gather.
```{r}
table4a
# you can gather by specifying column names
table4a %>% gather(year, cases, `1999`:`2000`)
# also with column numbers
table4a %>% gather(year, cases, 2:3)
```
### `separate()`
In the chunk below, table3 is not tidy because it has multiple values separate by a backslash in the rate column. To split one column into two or more columns, use `separate()`
```{r}
table3
# separate example
table3 %>% separate(rate, into = c('cases', 'population'), sep = '/', convert = TRUE)
```
`separate()` also lets you specify an interger position to split at
```{r}
# just so you know, you can also separate by specifying a position to separate after
table3 %>% separate(rate, into = c('cases', 'population'), sep = 3)
```
### `unite()`
In the chunk below, table5 is not tidy because the date is split between two columns, century and year. To combine columns, use the opposite of `separate()`, `unite()`
```{r}
table5
table5 %>% unite(year, c('century', 'year'), sep = '')
```
You may have noticed that the rate column in table5 is the same as the rate column in table3 and contains two values. Chain together some of the tidyr functions above to completely tidy table5.
```{r}
table5 %>%
unite(year, c('century', 'year'), sep = '') %>%
separate(rate, into = c('cases', 'population'), sep = '/')
```
## Combining Tables
Binds and joins are in the dplyr package, but fit in better with data wrangling, so they were saved for this week.
### Binds
Attach tables together. In order to bind tables, the tables being bound together have to have the same number of rows OR the same number of columns (depending if you're binding by rows or columns).
#### `rbind()` / `bind_rows()`
Attach tables together by adding additional rows. When you bind by rows, the second table is added to the bottom of the first table.
```{r}
table4a
table4b
# base R
rbind(table4a, table4b)
# not same number of columns doesn't work
rbind(table4a, table1)
# tidyverse, dplyr
bind_rows(table4a, table4b)
# not same number of columns gives NAs
bind_rows(table4a, table1)
```
The difference between `rbind()` and `bind_rows()` is that if you have a different number of columns, `rbind()` will refuse to work and return an error, while `bind_rows()` will keep all columns from all tables and add NAs where there are no values.
#### `cbind()` / `bind_cols()`
Attach tables by adding additional columns.
```{r}
table4a
table4b
# base R
cbind(table4a, table4b)
# not same number of rows, rows are recycled (see how the stuff from table4a is repeated)
cbind(table4a, table1)
# tidyverse, dplyr
bind_cols(table4a, table4b)
# notice that bind_cols() adds a 1 to columns that are repeated so that each column has a unique name
# not same number of rows doesn't work
bind_cols(table4a, table1)
```
### Joins
Joins combine two tables together by common variables. The difference between binds and joins is that binds are intended to combine tables together that have exactly the same information, like for different samples, while joins are intended to combine together tables with complimentary but different information by common variables, like samples with their metadata. The `*_join()` functions combine either by the columns you specify or by auto-detecting columns with the same name. Mutating joins add additional columns, while filtering joins filter a table by another table.
We'll use the band_* tables from the dplyr package to practice joins. As you can see when you run the tables in the chunk below, the tables have information in common, musician names, but complimentary information, what band they belong to in one table and what instruments they play in the others.
```{r}
band_members
band_instruments
band_instruments2
```
The descriptions below refer to x and y in the join. x is the first table listed in the join and y is the second. The order tables are given in joins is important!
#### Mutating Joins
Mutating joins add columns onto a table from another table.
##### `inner_join()`
Joins tables by returning all rows where x and y equal each other (in the join columns) and all columns from both x and y.
```{r}
band_members
band_instruments
inner_join(band_members, band_instruments, by = c('name'))
```
##### `full_join()`
The opposite of `inner_join()`, `full_join()` joins tables by returning all rows all columns from both x and y. If there is no matching value between x and y, an NA will be returned.
```{r}
band_members
band_instruments
full_join(band_members, band_instruments, by = c('name'))
```
##### `left_join()`
Joins tables by return all rows from x and all columns from x and y. If there is no matching value in y for x, an NA will be returned.
```{r}
band_members
band_instruments
left_join(band_members, band_instruments, by = c('name'))
# order matters
left_join(band_instruments, band_members, by = c('name'))
```
##### `right_join()`
Joins tables by the opposite of `left_join()`, returning all rows from y and all columns from x and y. If there is no matching value in x for y, an NA will be returned.
```{r}
band_members
band_instruments
right_join(band_members, band_instruments, by = c('name'))
```
#### Filtering Joins
Filtering joins filter a table by another table.
#### `semi_join()`
Filters x by returning all rows from x that match values in y.
```{r}
band_members
band_instruments
semi_join(band_members, band_instruments, by = c('name'))
```
#### `anti_join()`
The opposite of `semi_join()`, `anti_join()` returns rows from x that DON'T match a value in y
```{r}
band_members
band_instruments
anti_join(band_members, band_instruments, by = c('name'))
```
#### What if my column names don't match?
Name and artist are the same variable, who the person is, in the tables below.
```{r}
band_members
band_instruments2
```
To specify the join by two differently named columns, specify that they equal each other. You MUST list the column names in the order the tables are given to the join. 'name' comes first below, because that's the column you want to join by in the first table.
```{r}
full_join(band_members, band_instruments2, by = c('name' = 'artist'))
# if there are no common names, it won't work if you don't tell the join that two columns are the same
full_join(band_members, band_instruments2)
```
### Dealing with NAs
If there's no value in a table, or R can't figure out what's supposed to be there, it will give you an `NA` as seen in the Star Wars table below
```{r}
starwars
```
NAs are bad not only because you're missing data, but also because many functions in R either refuse to run or return weird answers if there are NAs in the data.
```{r}
mean(starwars$birth_year)
```
#### `na.omit()`
One strategy is just to drop all rows that contain NAs
```{r}
starwars
starwars %>% na.omit()
```
#### `replace_na()`
Sometimes you might want to replace NAs with a value though, which you can do using tidyr
```{r}
starwars
starwars %>% replace_na(list(hair_color = 'none', birth_year = 'unknown', gender = 'n/a'))
```
## Putting it together
*2018.10.12* We didn't quite get to this in classes, because I talked a lot, but try running it on your own!
```{r}
# what the tidy table should look like
table1
# the tables we need to tidy into one table
table4a
table4b
```
```{r}
# tidy individual tables
table4a %>% gather(year, cases, `1999`:`2000`) -> table4a_tidy
table4b %>% gather(year, population, `1999`:`2000`) -> table4b_tidy
# combine
left_join(table4a_tidy, table4b_tidy, by = c('country', 'year'))
```