-
Notifications
You must be signed in to change notification settings - Fork 1
/
08-AdvancedDataManipulation.Rmd
277 lines (198 loc) · 15.1 KB
/
08-AdvancedDataManipulation.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
# Chapter 8: Advanced Data Manipulation {-}
> "Every new thing creates two new questions and two new opportunities." --- Jeff Bezos
There's so much more we can do with data in `R` than what we've presented. Two main topics we need to clarify here are:
1. How do you reshape your data from wide to long form or vice versa in more complex data structures?
2. How do can we automate tasks that we need done many times?
We will introduce both ideas to you in this chapter. To discuss the first, show the use of `long()` and `wide()` from the `furniture` package. For the second, we need to talk about loops. Looping, for our purposes, refers to the ability to repeat something across many variables or data sets. There's many ways of doing this but some are better than others. For looping, we'll talk about:
1. vectorized functions,
2. `for` loops, and
3. the `apply` family of functions.
## Reshaping Your Data {-}
We introduced you to wide form and long form of your data in Chapter 2. In reality, data can take on nearly infinite forms but for most data in health, behavioral, and social science, these two forms are sufficient to know.
In some situations, your data may have multiple variables with multiple time points (known as time-variant variables) and other variables that are not (known as time-invariant variables) as shown:
```{r, echo=FALSE}
d1 <- data.frame("ID"=c(1:10),
"Var_Time1"=rnorm(10), "Var_Time2"=runif(10),
"Var2_Time1"=rnorm(10), "Var2_Time2"=rnorm(10),
"Var3"=rnorm(10))
d1
```
Notice that this data frame is in wide format (each ID is one row and there are multiple times or measurements per person for two of the variables). To change this to wide format, we'll use `long()`. The first argument is the data.frame, followed by two variable names (names that we go into the new long form), and then the numbers of the columns that are the measures (e.g., `Var_Time1` and `Var_Time2`).
```{r}
long_form <- furniture::long(d1,
c("Var_Time1", "Var_Time2"),
c("Var2_Time1", "Var2_Time2"),
v.names = c("Var", "Var2"))
long_form
```
As you can see, it took the variable names and put that in our first variable that we called "measures". The actual values of the variables are now in the variable we called "values". Finally, notice that each ID now has two rows (one for each measure).
To go in the opposite direction (long to wide) we can use the `wide()` function. All we do is provide the long formed data frame, variables that are time-varying (`Var1` and `Var2`) and the variable showing the time points (`time`).
```{r}
wide_form <- furniture::wide(long_form,
v.names = c("Var", "Var2"),
timevar = "time")
wide_form
```
And we are back to the wide form.
These steps can be followed for situations where there are many measures per person, many people per cluster, etc. In most cases, this is the way multilevel data analysis occurs (as we discussed in Chapter 6) and is a nice way to get our data ready for plotting.
The following figure shows the features of both `long()` and `wide()`.
![](DataCleaning_Handout.jpg)
## Repeating Actions (Looping) {-}
To fully go into looping, understanding how to write your own functions is needed.
### Your Own Functions {-}
Let's create a function that estimates the mean (although it is completely unnecessary since there is already a perfectly good `mean()` function).
```{r}
mean2 <- function(x){
n <- length(x)
m <- (1/n) * sum(x)
return(m)
}
```
We create a function using the `function()` function.[^functions] Within the `function()` we put an `x`. This is the argument that the function will ask for. Here, it is a numeric vector that we want to take the mean of. We then provide the meat of the function between the `{}`. Here, we did a simple mean calculation using the `length(x)` which gives us the number of observations, and `sum()` which sums the numbers in `x`.
Let's give it a try:
```{r}
v1 <- c(1,3,2,4,2,1,2,1,1,1) ## vector to try
mean2(v1) ## our function
mean(v1) ## the base R function
```
Looks good! These functions that you create can do whatever you need them to (within the bounds that `R` can do). I recommend by starting outside of a function that then put it into a function. For example, we would start with:
```{r}
n <- length(v1)
m <- (1/n) * sum(v1)
m
```
and once things look good, we would put it into a function like we had before with `mean2`. It is an easy way to develop a good function and test it while developing it.
By creating your own function, you can simplify your workflow and can use them in loops, the `apply` functions and the `purrr` package.
For practice, we will write one more function. Let's make a function that takes a vector and gives us the N, the mean, and the standard deviation.
```{r}
important_statistics <- function(x, na.rm=FALSE){
N <- length(x)
M <- mean(x, na.rm=na.rm)
SD <- sd(x, na.rm=na.rm)
final <- c(N, M, SD)
return(final)
}
```
One of the first things you should note is that we included a second argument in the function seen as `na.rm=FALSE` (you can have as many arguments as you want within reason). This argument has a default that we provide as `FALSE` as it is in most functions that use the `na.rm` argument. We take what is provided in the `na.rm` and give that to both the `mean()` and `sd()` functions. Finally, you should notice that we took several pieces of information and combined them into the `final` object and returned that.
Let's try it out with the vector we created earlier.
```{r}
important_statistics(v1)
```
Looks good but we may want to change a few aesthetics. In the following code, we adjust it so we have each one labeled.
```{r}
important_statistics2 <- function(x, na.rm=FALSE){
N <- length(x)
M <- mean(x, na.rm=na.rm)
SD <- sd(x, na.rm=na.rm)
final <- data.frame(N, "Mean"=M, "SD"=SD)
return(final)
}
important_statistics2(v1)
```
We will come back to this function and use it in some loops and see what else we can do with it.
### Vectorized {-}
By construction, `R` is the fastest when we use the vectorized form of doing things. For example, when we want to add two variables together, we can use the `+` operator. Like most functions in `R`, it is vectorized and so it is fast. Below we create a new vector using the `rnorm()` function that produces normally distributed random variables. First argument in the function is the length of the vector, followed by the mean and SD.
```{r}
v2 <- rnorm(10, mean=5, sd=2)
add1 <- v1 + v2
round(add1, 3)
```
We will compare the speed of this to other ways of adding two variables together and see it is the simplest and quickest.
### For Loops {-}
For loops have a bad reputation in the `R` world. This is because, in general, they are slow. It is among the slowest of ways to iterate (i.e., repeat) functions. We start here to show you, in essence, what the `apply` family of functions are doing, often, in a faster way.
At times, it is easiest to develop a for loop and then take it and use it within the `apply` or `purrr` functions. It can help you think through the pieces that need to be done in order to get your desired result.
For demonstration, we are using the `for` loop to add two variables together. The code between the `()`'s tells `R` information about how many loops it should do. Here, we are looping through `1:10` since there are ten observations in each vector. We could also specify this as `1:length(v1)`. When using `for` loops, we need to keep in mind that we need to initialize a variable in order to use it within the loop. That's precisely what we do with the `add2`, making it a numberic vector with 10 observations.
```{r}
add2 <- vector("numeric", 10) ## Initialize
for (i in 1:10){
add2[i] <- v1[i] + v2[i]
}
round(add2, 3)
```
Same results! But, we'll see later that the speed is much than the vectorized function.
### The `apply` family {-}
The `apply` family of functions that we'll introduce are:
1. `apply()`
2. `lapply()`
3. `sapply()`
4. `tapply()`
Each essentially do a loop over the data you provide using a function (either one you created or another). The different versions are extremely similar with some minor differences. For `apply()` you tell it if you want to iterative over the columns or rows; `lapply()` assumes you want to iterate over the columns and outputs a list (hence the `l`); `sapply()` is similar to `lapply()` but outputs vectors and data frames. `tapply()` has the most differences because it can iterative over columns by a grouping variable. We'll show `apply()`, `lapply()` and `tapply()` below.
For example, we can add two variables together here. We provide it the `data.frame` that has the variables we want to add together.
```{r}
df <- data.frame(v1, v2)
add3 <- apply(df, 1, sum)
round(add3, 3)
```
The function `apply()` has three main arguments: a) the `data.frame` or list of data, b) 1 meaning to apply the function for each row or 2 to the columns, and c) the function to use.
We can also use one of our own functions such as `important_statistics2()` within the `apply` family.
```{r}
lapply(df, important_statistics2)
```
This gives us a list of two elements, one for each variable, with the statistics that our function provides. With a little adjustment, we can make this into a `data.frame` using the `do.call()` function with `"rbind"`.
```{r}
do.call("rbind", lapply(df, important_statistics2))
```
`tapply()` allows us to get information by a grouping factor. We are going to add a factor variable to the data frame we are using `df` and then get the mean of the variables by group.
```{r}
group1 <- factor(sample(c(0,1), 10, replace=TRUE))
tapply(df$v1, group1, mean)
```
We now have the means by each group. This, however, is probably replaced by the 3 step summary that we learned earlier in `dplyr` using `group_by()` and `summarize()`.
These functions are useful in many situations, especially where there are no vectorized functions. You can always get an idea of whether to use a `for` loop or an `apply` function by giving it a try on a small subset of data to see if one is better and/or faster.
#### Speed Comparison {-}
We can test to see how fast functions are with the `microbenchmark` package. Since it wants functions, we will create a function that uses the `for` looop.
```{r}
forloop <- function(var1, var2){
add2 <- vector("numeric", length(var1))
for (i in 1:10){
add2[i] <- var1[i] + var2[i]
}
return(add2)
}
```
Below, we can see that the vectorized version is nearly 50 times faster than the `for` loop and 300 times faster than the `apply`. Although the `for` loop was faster here, sometimes it can be slower than the `apply` functions--it just depends on the situation. But, the vectorized functions will almost always be *much* faster than anything else. It's important to note that the `+` is also a function that can be used as we do below, highlighting the fact that anything that does something to an object in `R` is a function.
```{r}
library(microbenchmark)
microbenchmark(forloop(v1, v2),
apply(df, 1, sum),
`+`(v1, v2))
```
Of course, as it says the units are in nanoseconds. Whether a function takes 200 or 200,000 nanoseconds probably won't change your life. However, if the function is being used repeatedly or on on large data sets, this can make a difference.
### Using "Anonymous Functions" in Apply {-}
Last thing to know here is that you don't need to create a named function everytime you want to use `apply`. We can use what is called "Anonymous" functions. Below, we use one to get at the N and mean of the data.
```{r}
lapply(df, function(x) rbind(length(x), mean(x, na.rm=TRUE)))
```
So we don't name the function but we design it like we would a named function, just minus the `return()`. We take `x` (which is a column of `df`) and do `length()` and `mean()` and bind them by rows. The first argument in the anonymous function will be the column or variable of the data you provide.
Here's another example:
```{r}
lapply(df, function(y) y * 2 / sd(y))
```
We take `y` (again, the column of `df`), times it by two and divide by the standard deviation of `y`. Note that this is gibberish and is not some special formula, but again, we can see how flexible it is.
The last two examples also show something important regarding the output:
1. The output will be at the level of the anonymous function. The first had two numbers per variable because the function produced two summary statistics for each variable. The second we multiplied `y` by 2 (so it is still at the individual observation level) and then divide by the SD. This keeps it at the observation level so we get ten values for every variable.
2. We can name the argument anything we want (as long as it is one word). We used `x` in the first and `y` in the second but as long as it is the same within the function, it doesn't matter what you use.
Finally, we may not want our variables to be in the list format. We may want to control more tightly what is outputted from the looping. For that, we can thank the `purrr` package (part of the `tidyverse`; note the three r's in purrr). This package provides many valuable functions that you can explore. Of particular mention here, though, are some of the `map*()` functions that work just like `lapply()`.
1. `map()` -- outputs a list
2. `map_df()` -- outputs a data frame
3. `map_if()` -- outputs a list but only makes any changes to the variables that meet a condition (e.g., `is.numeric()`).
```{r}
purrr::map(df, function(y) y * 2 / sd(y))
```
```{r}
purrr::map_df(df, function(y) y * 2 / sd(y))
```
```{r}
purrr::map_if(df, is.numeric, function(y) y * 2 / sd(y))
```
## Apply It {-}
[This link](https://tysonbarrett.com/DataR/Chapter8.zip) contains a folder complete with an Rstudio project file, an RMarkdown file, and a few data files. Download it and unzip it to do the following steps.
### Step 1 {-}
Open the `Chapter8.Rproj` file. This will open up RStudio for you.
### Step 2 {-}
Once RStudio has started, in the panel on the lower-right, there is a `Files` tab. Click on that to see the project folder. You should see the data files and the `Chapter8.Rmd` file. Click on the `Chapter8.Rmd` file to open it. In this file, import the data, reshape it to long form, create your own function to do something for you, and apply the function in a loop over some of the variables of the data set.
Once that code is in the file, click the `knit` button. This will create an HTML file with the code and output knitted together into one nice document. This can be read into any browser and can be used to show your work in a clean document.
## Conclusions {-}
These are useful tools to use in your own data manipulation beyond that what we discussed with `dplyr`. It takes time to get used to making your own functions so be patient with yourself as you learn how to get `R` to do exactly what you want in a condensed, replicable format.
With these new tricks up your sleeve, we can move on to more advanced plotting using `ggplot2`.
[^functions]: That seemed like excessive use of the word function... It is important though. So, get used to it!