forked from thunguyen177/Big-data-resources
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdplyr.Rmd
226 lines (181 loc) · 5.9 KB
/
dplyr.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
---
title: "dplyr - An accompanian to dplyr Cheat Sheet"
author: "Thu Nguyen"
output: pdf_document #beamer_presentation
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
```
## R Cheat Sheet
- RStudio >>> Help >>> Cheatsheet
- data()
Data sets in package 'datasets':
AirPassengers Monthly Airline Passenger Numbers 1949-1960
BJsales Sales Data with Leading Indicator
...
- data(package = "MASS")
Data sets in package 'MASS':
Aids2 Australian AIDS Survival Data
Animals Brain and Body Weights for 28 Species
....
## Summarise cases
```{r, echo = TRUE}
library(dplyr)
summarise(mtcars,avg=mean(mpg))
count(iris,Species)
```
- n_distinct()
```{r, echo = TRUE}
x = 1:5
x = c(x,x)
x
n_distinct(x)
```
## Group_by and pipes
```{r, echo = TRUE}
iris %>%
group_by(Species) %>%
summarise_all(mean)
# g_iris = group_by(iris,Species)
# upgroup(g_iris)
starwars %>% summarise_at(c("height", "mass"), mean, na.rm = TRUE)
starwars %>% summarise_if(is.numeric, mean, na.rm = TRUE)
```
## Manipulate cases
### Filter
```{r, echo = TRUE}
filter(iris, Sepal.Length > 7)[1:5,]
# Multiple criteria
filter(starwars, hair_color == "none" & eye_color == "black")[1:5,]
filter(starwars, hair_color == "none" | eye_color == "black")[1:5,]
# Multiple arguments are equivalent to and
filter(starwars, hair_color == "none", eye_color == "black")[1:5,]
```
### Distinct
```{r, echo = TRUE}
df <- tibble(
g = c(1, 1, 2, 2),
x = c(1, 1, 2, 1)
)
df
df %>% distinct()
df %>% distinct(x)
```
### Sample
```{r, echo = TRUE}
sample_frac(mtcars, 0.1)
sample_frac(mtcars, 1.5, replace = TRUE)[1:5,]
sample_frac(mtcars, 0.1, weight = 1 / mpg)
```
weight: Sampling weights. This must evaluate to a vector of non-negative numbers the same length as the input. Weights are automatically standardised to sum to 1.
```{r, echo=TRUE}
by_cyl <- mtcars %>% group_by(cyl)
sample_n(by_cyl, 3)
sample_n(by_cyl, 7, replace = TRUE)
sample_n(by_cyl, 3, weight = mpg / mean(mpg))
```
### Arrange
Use desc() to sort a variable in descending order.
```{r, echo = TRUE}
arrange(mtcars, cyl, disp)[1:5,]
arrange(mtcars, desc(disp))[1:5,]
```
## Manipulate variables
- Select/rename variables by name:
```{r, echo = TRUE}
select(iris, starts_with("Petal"))[1:5,]
select(iris, ends_with("Width"))[1:5,]
# Drop variables with -
select(iris, -starts_with("Petal"))[1:5,]
```
Some useful helpers to use with select(): contains(), matches(), num_range(prefix,range), one_of, everything(), starts_with, ends_with()
- mutate() adds new variables and preserves existing; transmute(): transform existing variables
```{r, echo = TRUE}
mtcars[1:5,] %>%
transmute(displ_l = disp / 61.0237)
mtcars %>%
group_by(cyl) %>%
mutate(rank = min_rank(desc(mpg)))
iris %>%
group_by(Species)%>%
mutate_all(funs(. / 2.54))
mutate_if(iris, is.numeric,funs(log(.)))[1:5,]
mutate_at(iris,vars(-Species), funs(log(.),log2(.)))[1:5,]
```
## Vetorized function: misc:
- between()
```{r, echo = TRUE}
x = 2*(1:10)
x
between(x,6,13)
x[between(x,6,13)]
```
- case_when()
```{r, echo = TRUE}
mtcars$carb
mtcars %>%
mutate(cg = case_when(.$carb <= 2 ~ "low", .$carb > 2 ~ "high")) %>%
.$cg %>%
table()
```
- na_if()
```{r, echo = TRUE}
y <- c("abc", "def", "", "ghi")
na_if(y, "")
```
## Names: library(tibble)
```{r, echo=TRUE}
names(iris)
rename(iris, Length = Sepal.Length)[1:3,]
library(tibble)
# ? rownames_to_column
mtcars[1:3,]
rownames_to_column(mtcars)[1:3,]
```
```{r, echo = TRUE}
x= rownames_to_column(mtcars)
column_to_rownames(x,"rowname")[1:3,]
```
## Combine
- cbind(): column bind
- rbind(): row bind
```{r, echo = TRUE}
x = 1:5
u = 2*x
rbind(x,u)
cbind(x,u)
```
- inner_join(): return all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned.
- left_join(): return all rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.
- right_join(): return all rows from y, and all columns from x and y. Rows in y with no match in x will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.
- full_join(): return all rows and all columns from both x and y. Where there are not matching values, returns NA for the one missing.
- semi_join(): return all rows from x where there are matching values in y, keeping just columns from x.
A semi join differs from an inner join because an inner join will return one row of x for each matching row of y, where a semi join will never duplicate rows of x.
- anti_join(): return all rows from x where there are not matching values in y, keeping just columns from x.
```{r, echo = TRUE}
head(band_members)
band_members %>% inner_join(band_instruments)
band_members %>% left_join(band_instruments)
band_members %>% right_join(band_instruments)
band_members %>% full_join(band_instruments)
band_members %>% semi_join(band_instruments)
band_members %>% anti_join(band_instruments)
# Use a named `by` if the join variables have different names
band_members %>% full_join(band_instruments2, by = c("name" = "artist"))
```
- intersect(x, y, ...)
- union(x, y, ...)
- union_all(x, y, ...)
- setdiff(x, y, ...)
- setequal(x, y, ...)
```{r, echo = TRUE}
mtcars$model <- rownames(mtcars)
first <- mtcars[1:20, ]
second <- mtcars[10:32, ]
intersect(first, second)[1:4,]
union(first, second)[1:4,]
setdiff(first, second)[1:4,]
setdiff(second, first)[1:4,]
union_all(first, second)[1:4,]
setequal(mtcars, mtcars[12:1, ])
```