forked from fels-bioinformatics/fels_bioinformatics_meetup
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdata_manipulation_w_dplyr.Rmd
193 lines (125 loc) · 5.4 KB
/
data_manipulation_w_dplyr.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
---
title: "Data Manipulation with dplyr"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, fig.width = 6, fig.height = 4)
library(tidyverse)
library(viridis)
library(conflicted)
library(nycflights13)
filter <- dplyr::filter
```
# Data Manipulation with dplyr
For today's demonstration (and the practice), we'll use data tables from the nycflights13 package.
## Setup
First, install the package.
```{r}
#install.packages('nycflights13')
```
The library is in the set up chunk above as well, but you need to load the library before using the tables
```{r}
library(nycflights13)
```
Before we do
```{r}
head(flights)
```
## Piping
### Pipes
The pipe operator `%>%` pipes output from one function to the next (like | in Linux). You can use this to chain together as many command/functions as desired. For a simple example, instead of calling `head()` on flights, we could have piped flights to `head()`
```{r}
flights %>% head()
```
You can chain as many pipes as you want together and you can also pipe into `ggplot()`
```{r}
flights %>% filter(sched_dep_time <= 1200) %>% ggplot(aes(x = origin, y = dep_time, color = origin)) + geom_violin()
```
### Assignment
The assignment operator, `->` saves the current output as an object (aka a data structure) to whatever name you pick. You can save data of any format this way.
Either as a list or a table
```{r}
list(1:100) -> int_list
int_list
```
```{r}
flights %>% select(year, month, day, dep_time, arr_time) %>% filter(month >= 3) -> sched_flights_q1
sched_flights_q1
```
Or a plot
```{r}
ggplot(flights, aes(x = as.factor(month), y = dep_delay, color = month)) + geom_boxplot() + scale_color_viridis() + labs(x = 'Month', y = 'Departure Delay (min)') + theme_classic() + theme(legend.position = 'none') -> dep_delay_plot
dep_delay_plot
```
## dplyr functions
### Review
The first two functions are a review from week 2: `filter()` and `select()`. Remember `filter()` works on rows and `select()` works on columns.
#### `filter()`
As a quick review of `filter()`, filter for carrier F9 and arrivale times after noon.
```{r}
filter(flights, carrier == 'F9', arr_time > 1200)
```
#### `select()`
As a quick review of `select()`, select the year, month, and day columns.
```{r}
select(flights, year, month, day)
```
An additional trick, you can select many columns at once using a : like with numbers. For example, when you type 1:5, you get the numbers 1 to 5, and if you select carrier:dest, you get all the columns in flights between them as in the example below.
```{r}
select(flights, carrier:dest)
```
### New Functions
#### `mutate()`
Mutate adds another column onto a table. Let's say we want to add the average time spent in security on as another column. The syntax is column_name = values.
```{r}
mutate(flights, avg_secur_time = 1)
```
If you just add a single value (which could be a number or a character), it's listed in every row. Sometimes this is useful, for example if you're adding a sample ID to an entire table, but often you probably want more than one value in your column. For that, use `ifelse()` and `case_when()`. For example, there are three origin airports in flights for the three airports around New York City, EWR (Newark), JFK, and LGA (LaGuardia). Maybe the average wait time in the New York airports (JFK and LGA) is 1 hour, but it's 1.5 hours in Newark. You can add values based on another column using `ifelse()` when you have two different values
```{r}
mutate(flights, avg_secur_time = ifelse(origin == 'EWR', 1.5, 1)) #%>% select (origin, avg_secur_time)
```
If you have more than two values, use `case_when()`. In the example below, we'll set different times for each airport.
```{r}
mutate(flights, avg_secur_time = case_when(origin == 'EWR' ~ 1.5,
origin == 'JFK' ~ 1,
origin == 'LGA' ~ 2)) %>% select(origin, avg_secur_time)
```
#### `arrange()`
`arrange()` orders the entire table by the column(s) selected.
```{r}
# this is how it looked originally
flights
```
Flights arranged by arrival time
```{r}
arrange(flights, arr_time)
```
Flights arranged by carrier, then arrival time.
```{r}
arrange(flights, carrier, arr_time)
```
By default, `arrange()` orders by least to greatest. If you want to greatest first, use `desc()`, which selects greatest values first.
```{r}
arrange(flights, desc(arr_time))
```
#### `group_by()` & `summarize()`/`summarise()`
`summarize()` summarizes your table. This can be a summary statistic like mean, or median, or it can be whatever function you want. Also, fyi, for all the tidyverse functions that have multiple English spellings, you can call the function with any version.
```{r}
summarize(flights, avg_dep_time = mean(sched_dep_time))
```
Using `group_by()`, you can group by one (or more) thing(s) and have a summary function applied to each group individually
```{r}
flights %>% group_by(origin) %>% summarize(avg_dep_time = mean(sched_dep_time))
```
### All Together Now
```{r}
flights %>%
filter(origin == 'LGA') %>%
group_by(carrier) %>%
summarize(avg_dep_time = mean(sched_dep_time), avg_arr_time = mean(sched_arr_time)) %>%
ggplot(aes(x = carrier, y = avg_dep_time, color = avg_arr_time)) +
geom_point(size = 4) +
scale_color_viridis() +
labs(x = 'Airline Carrier', y = 'Average Departure Time (24h)', color = 'Average Arrival Time (24h)') +
theme_classic()
```