generated from jtr13/cctemplate
-
Notifications
You must be signed in to change notification settings - Fork 139
/
joining_and_reshaping_in_r.Rmd
277 lines (196 loc) · 13 KB
/
joining_and_reshaping_in_r.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
# Joining and reshaping in R
Yisi Liu
```{r, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message = FALSE)
```
```{r}
library(gridExtra)
library(dplyr)
library(tidyr)
library(ggplot2)
library(forcats)
```
## Motivation
The majority of the analyses we have done in assignments and lectures so far have been based on a single data set. In the real world, however, raw data may be saved in different files or tables, so there may not always be a meta data set containing all of the information. For example, we want a summary table showing how many students are in each department, but department information and student information are kept separate. To begin the analysis, we will need to combine the two or more data sets into a single one and, if necessary, change it into a preferred format.
During my previous intern experience as an analytics engineer, merging and transforming the source data into the desired table is a crucial step. We performed data wrangling and ETL (extract-transform-loading) pipelines with SQL back then. While working on assignments for this class, I surprisingly discovered some comparable functions within `dplyr` and `tidyr`, which I will explain in this tutorial.
## Content
- Joining / Merging data sets
- `bind_rows`, `bind_cols`
- `inner_join`, `full_join`, `left_join`, `right_join`
- `semi_join`, `anti_join`
- Reshaping data
- `pivot_longer`
- `pivot_wider`
## Combining Tables
### Setup
We will use `student` table to walk through the tutorial today. `student` includes information `student_id`, `student_name`, and `department_id` so far.
```{r fig.height=2}
set.seed(111)
student <- data.frame('student_id' = sample(100000:999999, 4, replace=FALSE),
'student_name' = c('James','Alice','Amy','Roger'),
'department_id' = c(1,1,3,4))
grid.arrange(tableGrob(student), top = 'student table')
```
### Types of Bind
**bind_rows()**
At the beginning of this semester, two new students Troy and Wendy transferred to our school. Table `new_student` records their student id, name, and department id (new students' majors remain undeclared so `NA` here).
```{r fig.height=1.5}
set.seed(123)
new_student = data.frame('student_id' = sample(100000:999999, 2, replace=FALSE),
'student_name' = c('Troy','Wendy'),
'department_id' = NA)
grid.arrange(tableGrob(new_student), top = 'new_student table')
```
For the convenience of future analyses, we want to add new students' information to our existing `student` table. To append the rows of `new_students` onto `student`, we perform `bind_rows()` function in `dplyr`.
```{r fig.height=2.5}
student <- bind_rows(student, new_student)
grid.arrange(tableGrob(student), top = 'bind_rows: student & new_student')
```
**bind_cols()**
In this semester, suppose the first half of students on `student` table are assigned to take STAT 5701 while the second half to take STAT 5702. A table `course` records the course id and course name each student takes, but there is no student name.
```{r fig.height=2.5}
course = data.frame(
'course_id' = c(rep('STAT 5701',nrow(student)/2),rep('STAT 5702',nrow(student)/2)),
'course_name' = c(rep('Probability',nrow(student)/2),rep('Data Visualization',nrow(student)/2)))
grid.arrange(tableGrob(course), top = 'course table')
```
As we are interested in questions like which class James takes or if Roger and Amy take the same class this year, we want to include these two columns in our `student` table.
To add an additional column, we use `bind_columns()` here. Now we are able to tell that James takes STAT 5701, and Roger and Amy are not taking the same class this semester.
```{r fig.height=2.5}
bind_c = bind_cols(student, course)
grid.arrange(tableGrob(bind_c) , top = 'bind_rows: student & course')
```
*Note: `rbind()` and `cbind()` function in base R perform similarly as `bind_rows()` and `bind_cols()`. You can find more information about `rbind()` and `cbind()` by trying `?rbind` and `?cbind`.*
### Mutating Joins
We have shown that `bind_cols()` could append columns to a table. However, it is less likely that a table with the same number of rows as the primary table would always exist. Consider the case where the `student` table has 10,000 rows. Another table with 10,000 rows, each indicating a student's major, would take up a lot of space. A more common way to store relational data is as follows:
```{r fig.height=2.5,fig.width=9}
department <- data.frame('department_id' = c(1,2,3,4,5),
'department_name' = c('Math','Computer Science',
'Philosophy','Linguistics',
'Economics'))
grid.arrange(tableGrob(student),tableGrob(department), nrow=1, top = 'student table, department table')
```
In this way, the `department` table will only have to store a modest amount of data. However, the new difficulty is how to link this data to the `student` table, as we want to know which students belong to which departments.
> **Keys**
We require a common variable (or combination of variables) that uniquely identifies an observation in at least one of the tables and identifies it in the same way across both tables to appropriately align observations between them. This variable is commonly referred to as a 'key'. **department id** is the key shared by `student` and `department` in our case.
To accommodate the many ways tables are arranged, we require multiple distinct forms of joins. Here is a summary table. https://dplyr.tidyverse.org/reference/mutate-joins.html
Join types Meaning
------------ -----------------------------------------------
inner_join keep rows that have matching records in both tables
full_join keep rows when there is a match in either left or right table
left_join keep rows from left table, and matched records from right table
right_join keep rows from right table, and matched records from left table
```{r, out.width = '400px', echo = F}
knitr::include_graphics('resources/joining_and_reshaping_in_r/join_pic.jpg')
```
Now, let's perform different joins to merge the tables.
**inner_join()**
```{r fig.height=2}
join_inner <- inner_join(student, department, by = 'department_id')
grid.arrange(tableGrob(join_inner), top = 'inner join: student & department')
```
**full_join()**
```{r fig.height=3}
join_full <- full_join(student, department, by = 'department_id')
grid.arrange(tableGrob(join_full), top = 'full join: student & department')
```
**left_join()**
```{r fig.height=2.5}
join_left <- left_join(student, department, by = 'department_id')
grid.arrange(tableGrob(join_left), top = 'left join: student & department')
```
**right_join()**
```{r fig.height=2.5}
join_right <- right_join(student, department, by = 'department_id')
grid.arrange(tableGrob(join_right), top = 'right join: student & department')
```
> **Multiple keys**
Multiple keys may be required in some circumstances to uniquely identify an observation. Our data sets below, for example, each student has two separate observations, therefore we need to connect them by both `student_id` and `course_id`.
```{r fig.height=4,fig.width=9}
student2 <- data.frame('student_id' = rep(student$student_id, 2),
'student_name' = rep(student$student_name, 2),
'course_id' = c(rep('STAT 5701',nrow(student)),rep('STAT 5702',nrow(student)))) %>%
arrange(student_id)
grades <- data.frame('course_id' = c(rep('STAT 5701',nrow(student)),rep('STAT 5702',nrow(student))),
'student_id' = rep(student$student_id, 2),
'final_grade' = c('A','A-','B','B+','A','B+','B','A-','A','A','A','A'))%>%
arrange(course_id, student_id)
grid.arrange(tableGrob(student2),tableGrob(grades), nrow=1, top = 'student2 table, grades table')
student_with_grades = inner_join(student2, grades, by = c('student_id','course_id'))
grid.arrange(tableGrob(student_with_grades), top = 'inner join: student2 & grades')
```
### Filtering joins
In addition to mutating joins that add columns to `x` from `y`, there are two more filtering joins functions within `dplyr`. Filtering joins filter rows from `x` based on the presence or absence of matches in `y`. https://dplyr.tidyverse.org/reference/filter-joins.html
Join types Meaning
------------ ----------------------------------------------------
semi_join keep rows from left table with a match in right table
anti_join keep rows from left table without a match in right table
Recall that `department_id` is the key for both `student` table and `department` table, and `department_id`= {1,3,4} occurs in both tables.
```{r fig.height=2.5,fig.width=9, echo = F}
grid.arrange(tableGrob(student),tableGrob(department), nrow=1, top = 'student table, department table')
```
**semi_join()**
```{r fig.height=2}
#only keep rows in student table
join_semi1 <- semi_join(student, department, by = 'department_id')
grid.arrange(tableGrob(join_semi1), top = 'semi join: student & department')
```
```{r fig.height=1.5}
#only keep rows in department table
join_semi2 <- semi_join(department,student, by = 'department_id')
grid.arrange(tableGrob(join_semi2), top = 'semi join: department & student')
```
**anti_join()**
```{r fig.height=1.5}
#only keep rows in student table
join_anti1 <- anti_join(student, department, by = 'department_id')
grid.arrange(tableGrob(join_anti1), top = 'anti join: student & department')
```
```{r fig.height=1.5}
#only keep rows in department table
join_anti2 <- anti_join(department, student, by = 'department_id')
grid.arrange(tableGrob(join_anti2), top = 'anti join: department & student')
```
### R vs. SQL
The commands for joins in R and SQL are included in the table below. We can see that both languages have mutating/filtering joins.
dplyr SQL
--------------------------- -----------------------------------------------
bind_rows(x, y) (SELECT * FROM x) UNION (SELECT * FROM y)
inner_join(x, y, by = 'z') SELECT * FROM x INNER JOIN y ON x.z = y.z
full_join(x, y, by = 'z') SELECT * FROM x FULL OUTER JOIN y ON x.z = y.z
left_join(x, y, by = 'z') SELECT * FROM x LEFT JOIN y ON x.z = y.z
right_join(x, y, by = 'z') SELECT * FROM x RIGHT JOIN y ON x.z = y.z
semi_join(x, y, by = 'z') SELECT x.* FROM x LEFT JOIN y ON x.z = y.z WHERE y.z IS NOT NULL
anti_join(x, y, by = 'z') SELECT x.* FROM x LEFT JOIN y ON x.z = y.z WHERE y.z IS NULL
## Reshaping
**pivot_wider()**
Recall the table `student_with_grades`.
```{r echo=F, fig.height=4}
grid.arrange(tableGrob(student_with_grades), top = 'student_with_grades')
```
The above data makes it difficult to read and compare students' grades in both classes. Alternatively, we can extract the courses' names and divide them into two columns, each containing the final grade as the value. We could make such a table with the `pivot wider()` function in `tidyr`.
```{r fig.height=2.5}
student_with_grades_2 = student_with_grades %>% pivot_wider(names_from = course_id, values_from = final_grade)
grid.arrange(tableGrob(student_with_grades_2), top = 'pivot_wider: student_with_grades_2')
```
**pivot_longer()**
Conversely, if we are given the 'wider' table `student_with_grades_2`, and want to convert the format to be `student_with_grades` in order to compare the distribution of final grades between two classes. We can do this by using the `pivor_longer()` function to reshape the table and plot a side-by-side bar chart from the 'longer' table.
```{r fig.height=4}
student_with_grades_3 <- student_with_grades_2 %>%
pivot_longer(cols = c('STAT 5701','STAT 5702'),
names_to = 'course_id',
values_to = 'final_grade')
grid.arrange(tableGrob(student_with_grades_3), top = 'pivot_longer: student_with_grades_3')
student_with_grades_3 %>%
ggplot(aes(fct_relevel(final_grade,'B', after = 3))) +
geom_bar() +
facet_grid(~course_id) +
ggtitle('Distribution of Final Grades in STAT 5701 & STAT 5702') +
xlab('Final grades')
```
## Self-evaluation
Before starting this project, I only knew **mutating joins** such as `inner_join()` and `left_join()`. I discovered **filtering joins** like `semi_join()` and `anti_join()` while working on it. The filtering joins are extremely useful when we only need data from `x` based on some standards from `y`.
In SQL, we normally use the `where` statement to create such a constraint in `x`, as seen in the 'R vs. SQL' comparison table I included above. This, I believe, is the most intriguing aspect of knowing many data-handling tools: we can always find other approaches that perform the same function.
Due to time constraints, this tutorial only covers as many as 'merging' and 'reshaping' functions that I am familiar with. If I get another chance, I will go over the parameters in each functions, such as `copy`, `keep`, `na_matches` in mutating joins, to assist everyone understand them better.
## Reference
R4DS Chapter 13: https://r4ds.had.co.nz/relational-data.html