-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.Rmd
463 lines (292 loc) · 26.2 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
---
title: "Cleaning and Checking Data in R"
output:
learnr::tutorial:
progressive: true
allow_skip: true
df_print: paged
runtime: shiny_prerendered
description: >
Learn about the basics of cleaning and checking your data within R
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Overview
Welcome to this first workbook in the R for Data Management Module.
During these workbooks you will not need to be writing any of your own R code but will see examples of code that should help you in your own projects.
This first workbook will look at some code you can use to check and clean your data in R.
This covers
* Creating new variables
* Dealing with missing data
* Checking values
* Recoding and changing values
* Sorting data
* Subsetting data
## Installing R and RStudio
Both R and RStudio are open source software tools, so these can be downloaded from the internet for free, forever.
Even if you have prior installations of R or RStudio on your computer - please ensure you download and install the most recent versions of both programs. R and RStudio are both updated on a very regular basis, and some of the features we may highlight in this course may not be available if you are using an older version.
<a href="http://milton-the-cat.rocks/learnr/r/r_getting_started/#section-installing-r-and-RStudio" target="_blank">Video covering the installation process</a>
The first thing to install is the most recent version of R for your operating system:
<a href="https://cran.r-project.org/bin/windows/base/" target="_blank">Download R for Windows</a>
<a href="https://cran.r-project.org/bin/macosx/" target="_blank">Download R for Mac</a>
Once you have completed the installation of R you can download RStudio, the link below should provide the recommended link to download based on your operating system.
<a href="https://RStudio.com/products/RStudio/download/#download" target="_blank">Download RStudio</a>
You should not need to change any of the default settings as you work through the installation process. If you are having any issues with the downloading and installation process please make sure that you have administrator rights on the computer you are working on. Please remember that you can use the help forum at any time if you are struggling with these steps.
## Tidyverse
A big advantage of R is that it is open source.Therefore the packages which people have designed to accomplish certain tasks are freely available for anyone to download and learn how to use.
In recent years, there is a curated collection of packages which has become immensely popular.The tidyverse. This is a range of packages quite closely linked to this established concept of "Tidy data" as developed by Hadley Wickham. In fact he literally wrote the <a href="https://r4ds.had.co.nz/" target="_blank">book on the subject</a>
The tidyverese contains a wide array of packages with different purposes, but the general philosophy is keeping your data and your code clean, readable and easy to understand.
For our purposes during this module of the course, our main 2 packages will be Dplyr and Tidyr. If you took Stats4SD's course ‘Introduction to R’, you may be somewhat familiar with these packages already.
```{r echo=FALSE, out.width="20%", fig.align='left'}
knitr::include_graphics("./images/dplyr.png") # TO ADD
```
Dplyr is primarily concerned with data manipulation, from this package we will be looking into how to create new variables but also edit existing ones.In a later session we will also look at merging data. For more information on this package [visit this page](https://dplyr.tidyverse.org/)
```{r echo=FALSE, out.width="20%", fig.align='left'}
knitr::include_graphics("./images/tidyr.png")
```
Tidyr on the other hand is about some of the primary rules about Tidy data as the name implies. This includes functions to reshape your data and deal with some missing values. These will be our primary focus from this package but it does have other capabilities as well. [Visit this page for more information](https://tidyr.tidyverse.org/)
The easiest way to install these packages to your version of R would be to run the following code.
```{r installtidyverse,exercise=FALSE, echo=TRUE, message = FALSE}
install.packages("tidyverse")
#Remember that you need to load packages to use them
library(tidyverse)
```
## Dataset
For the purpose of this session workbook I have generated a small fake dataset that we can use to explore some of the basic tips for cleaning your data in R.
```{r data_generation, include=FALSE}
data1 <- data.frame(
index = 1:10,
plot_id = c(1,2,4,3,1,3,4,NA,5,5),
plant = c("Sorghum","sorg"," Sorghum", "Maize ","M","Maize", "Sorghum","sorg","M","M"),
fertilised = c(0,1,1,0,1,-99,1,0,0,0),
height = c(10,12,8,15,12,13,7,88,10,11),
n_pests = sample(1:40,10)
)
```
```{r,echo=FALSE}
data1
```
You can probably already see looking at this small rectangle of data that there are few issues here to deal with. This data is tidy in terms of having one row per observation, one column per variable and each cell is a single value. However, it is not clean. We have missing data, possible incorrect values, and inconsistent text formatting in the plant column.
## Creating new variables
Creating a new variable is something you are almost always going to do whether you are cleaning data or doing an analysis.
In R there are 2 basic ways of generating new variables. The base R way and the dplyr way.
For the case of this demonstration we are going to calculate a variable indicating the extent of the infestation of pests on the plant. This being number of pests / height of the plant. So number of pests per inch of the plant's height.
### Assign Operator
Base R is the basic version of R that comes when you just install R and does not require any packages. It is how R usually operates.
When we want to crate a new variable using base R we have to use something which is colloquially known as the assign operator. `<-` it is a less than symbol and a hyphen, to make a left pointing looking arrow. This will be familiar to any R user as it is how we save things
So to generate this new variable using base R we could write something like this
```{r assign, exercise = FALSE}
data1$pests_per_inch <- data1$n_pests/data1$height
```
Remember you will not see any output as you are saving that to a new column in your dataset, type in that column to take a look.
```{r, exercise = FALSE}
data1$pests_per_inch
```
### Mutate function
The mutate function comes from the dplyr package and is used to generate new variables. It behaves like many of the tidyverse functions in that the first argument will be the dataset.
For a brief tutorial please watch this video.
![](https://www.youtube.com/watch?v=k4Tq0hFtFno)
In the mutate function you supply the column name as a string of text in quotation marks. Then type in how this variable is to be calculated. Remember however that you still need to use the assign operator as you are making edits to the dataset that you will want to save.
```{r mutate,exercise = FALSE}
data1 <- mutate(.data = data1,"pests_per_inch" = n_pests/height)
```
For a basic example such as this the base R way perhaps makes the most sense. The advantages of mutates comes from when you want to chain some edits together, using the pipe operator ` %>%`, a slightly more advanced data manipulation technique that you may be familiar with from our previous R course.
As a brief introduction, the pipe operator allows you to feed data between subsequent data manipulation, reshaping or even plotting functions.It does this by using the pipe to note that the data argument should take the result of the previous step.
Using the pipe we can also create a column like this
```{r mutate2, exercise = FALSE}
data1 <- data1 %>%
mutate("pests_per_inch" = n_pests/height)
```
Because we have used the pipe after `data1` we do not need to specify it as our first argument in the mutate. For more information on the pipe operator please follow this [link](https://www.datacamp.com/community/tutorials/pipe-r-tutorial)
Both mutate and the assign can be used to edit existing columns as well, not just make new ones.
Mutate also allows us to create multiple columns at once. We can just keep listing new columns together for our code to create,
rather than doing them separately. For instance lets create a column converting height in inches to height in cm by multiplying it by 2.54.
```{r mutate3, exercise = FALSE}
data1 <-data1 %>%
mutate("pests_per_inch" = n_pests/height,
"height_cm" = height*2.54)
```
Dplyr comes with other functions such as `mutate_at()`, `mutate_if()` and `mutate_all()` to allow you to perform the same operation to multiple or even all your columns. So there is a lot of variability in what mutate can achieve.
## Missing Data
R deals with missing data in quite a particular way, you need to tell it what is missing in most cases. If the values are blank where there is missing data then this is usually fine, R will understand that this is missing data.
R won’t automatically understand text strings such as "na" or something numeric like -9 as meaning data is missing, it will interpret these as actual values. If missing data has been coded in this way, you can let R know that this means that the data is missing while importing the data from a csv or xlsx file. The functions for reading in these files come with an argument to specify the codes used for missing data.
In the example below we have told R that cells in our csv that are one of these 3 values are missing data.
```{r, eval=FALSE}
data1 <- read.csv("data.csv", na.strings = c("na", -9, -8))
```
If you were to do something like this with your data you would notice that those -8s and -9s have become `NA` instead. This is because R will not keep these missing codes but use its own internal character to mean missing. R will only recognise missing data if it takes the value of `NA` a special value in R to denote missing data. Note that `NA` and `"NA"` are not the same, the former is a special logical value in R and the other is a character string so again R would read `"NA"` as a literal character string.
If your data has already been properly coded and is clean then this should theoretically be all you need to do with missing data but there are a few things you can do if this is not the case.
### Using summary and table
Depending on the state of your missing data, there are 2 key ways we can check individual variables for missing data.
If your missing data is already coded as `NA` within R, i.e. R understands it is missing, use summary to look at some key summary statistics of that variable. This includes how many values are `NA`.
```{r missing1, exercise = FALSE}
summary(data1$plant)
```
Now we don't see a summary of missing data for the column plant because there aren’t any. Let's look at the column plot_id instead.
```{r missing2, exercise = FALSE}
summary(data1$plot_id)
```
Ah this time we get another value telling us the number of `NA` values in our column. This being 1. If we take another look at our data we can see by looking at plot number plot 2 only has 1 plant while plot 1, 3, 4 and 5 all have 2 plants.. In this basic example we can make the assumption that plant 8 should belong to plot 2 so we can simply overwrite this value with what it should be.
```{r,echo=FALSE}
data1
```
Other examples may not be so straightforward, if you find some missing data that is unexpected then you may need to either look at the questionnaire or data dictionary to understand the skip patterns to find if it was simply irrelevant and therefore should be missing for this observation. Or you may need to look back at an earlier version of the data to see if there was a missing data code which is not shown because of how R deals with missing data. Of course a lot of the time, missing data is truly missing and nothing needs to be done.
Alternatively, if your `NA` values have not yet been established but you have known missing data codes you could use `table` to look at how many observations in a variable are using one of these codes. Now in our data we have used `-99` to mean missing in our fertilised variable. So lets have a look at that variable using `table`
```{r, exercise = FALSE}
table(data1$fertilised)
```
Again we have one missing observation according to these codes. Now we could follow up on why this is missing as it feels as if it shouldn't. But for our purposes here lets assume this is unfortunately truly missing data and we don't know the true answer.
### Setting values to NA and replacing NA values
Now there are many many ways you can change a value in R. Some we will get onto later in this workbook. But for now lets focus on some ways that specifically deal with missing values.
Firstly, lets fix plot number. Now there are a few great ways to replace missing data with a new value. If you are editing an exiting column you do need to put quotation marks around its name.
The first of these works specifically in our case because we only need to fix one value. So what we can do is use the assign operator to save a change to this cell. Now to do this we need to specify exactly what cell we are looking at, we thankfully have a small piece of data so we can see easily it is row 8. Similar to when we made a new variable we write out the column but after it we put the row number in square brackets. This means row 8 of column `plot_id`. And then we can assign it to be the number 2
```{r, eval=FALSE}
data1$plot_id[8] <- 2
```
This is mainly useful when the number of changes you are making are quite small and targeted.
If you are making lots of the same change, an alternative would be to replace the row number with `is.na(data$column_name)` this will tell R that in the rows of this column where this is missing data, replace it with this value instead. So the could would look like this.
```{r, eval=FALSE}
data1$plot_id[is.na(data1$plot_id)] <- 2
```
`is.na` is a function that tests whether or not a particular observation is equal to `NA`.
The tidyverse offers another alternative that can be used with the mutate function. This is the `replace_na` function. This replaces a missing value with a specified value. Here we need to first specify the column we are working on and then the new value it should take. Again this is best used when all that missing data should now take the same value. If they need to be different then you will need a more targeted method that may be a little more manual. Such as the first option on this page.
```{r,exercise =FALSE}
data1 <- data1 %>%
mutate(plot_id = replace_na(plot_id,2))
```
Now there are a few options for doing the reverse.
We can modify the first 2 bits of code and use the same method for instance but sort of reverse the function as it were.
```{r, exercise =FALSE,eval=FALSE}
data1$fertilised[6] <- NA
```
or
```{r, exercise = FALSE,eval=FALSE}
data1$fertilised[data1$fertilised==-99] <- NA
```
Here the `==` operator is used to tell R to look for the rows where fertilised is equal to -9. We use `==` rather than `=` because one equals sign tells R a value is x, while two equals signs asks R if a value is x. The first is an assignment of a value, the second is a logical test.
Again the dplyr version for doing this is available and it is `na_if`. The syntax is exactly the same except we are giving the value that we want to become `NA` rather than the new value `NA` should become.
```{r,exercise = FALSE}
data1 <- data1 %>%
mutate(fertilised = na_if(fertilised, -99))
```
If your missing codes are consistent you can of course use `mutate_all` instead to edit all your columns at once rather than one by one. Note that for `mutate_all` you specify your arguments slightly differently. Therefore I recommend looking into the help page for this function as it is not always straight forward.
In this case our arguments are actually the name of the function, and then the missing code value. This will be applied to every column. Therefore you may need be a little creative if your missing codes are different across columns.
```{r,exercise = FALSE}
data1 <- data1 %>%
mutate_all(na_if, -99)
```
## Checking for implausible values
Now unless your data collection and management has been 100% robust with absolutely no room for errors at all, you will more than likely come across the odd value that doesn't quite seem right. A value that looks implausible, a bit like an extreme outlier.
It can be quite easy for these types of errors to occur. They are typically data entry errors, where a value has been incorrectly inputted for whatever reason.
We can use the same methods we used to detect missing data to seek out these potential errors. Sometimes it is quite common for these errors to go unnoticed until they cause an issue in your analysis. So seeking them out early in your data cleaning process could save time later down the line.
For instance, lets use summary to look at our number of insects variable.
```{r, exercise = FALSE}
summary(data1$n_pests)
```
A minimum of `r min(data1$n_pests)` and maximum of `r max(data1$n_pests)`, seems like a little bit of maybe large range.
Maybe we should look at this variable graphically to see if this seems a little out of place.
Ordinarily we would recommend using the tidyverse package ggplot2 for plotting your data, and we certainly recommend you do look at this package on your own time if you want to learn more, but for the sake of ease we will just quickly use the base plotting function `hist` to take a quick look. GGplot2 offers a lot more flexibility in plotting your data and creates a more visually appealing graph.
For Stats4SD's previous videos on this package, please follow this [link](https://www.youtube.com/watch?v=lbt4BH9Q82E&list=PLK5PktXR1tmM_ISpxwoiXD2SioG3MWNqg)
```{r, exercise = FALSE}
hist(data1$n_pests)
```
Based on this plot, nothing is really standing out as implausible.
Let's instead look to height of the plant
```{r, exercise = FALSE}
summary(data1$height)
```
Now that is a significant gap in our range. 7 to 88 Inches. Moreover our maximum is 88 while our 75th percentile is only 12.75. That certainty doesn't seem correct.
Let's look at the plot just to be sure
```{r, exercise = FALSE}
hist(data1$height)
```
Without a doubt this value of 88 is far too high to be correct. There are two possible reasons as to why this value has been entered incorrectly.. One possibility is that this value is meant to be 8 and the enumerator accidentally put 88 instead. Or perhaps a number like 88 was meant to be a missing code and this was incorrectly entered. In a case like this, it would probably be best to contact either the data manager for any insight, or the enumerator who collected data for this particular observation.
For our case let's assume the first option is true. Sometimes if there is not a concrete explanation you may have to use a bit of your best judgement to decide what to do about this sort of error. Whether it is changing it to a specific value or changing this data to be missing.
We will see on the next page how we could go about changing this value.
The above tips work best when dealing with a numeric variable. If you are dealing with something more categorical i would recommend either using `table` again or perhaps `unique`. This function will display a list of all the unique values in a column, whether it is numeric, categorical, a date. Any variable type will work
```{r, exercise = FALSE}
unique(data1$plant)
```
Here we can see that there is an issue with our plant variable. Seems the labels were not standardised. This also shows us that we are dealing with the issue of that hard to spot whitespace at the beginning and the end of some of our text. Note that "Sorghum" and " Sorghum" will register as different text strings. The same goes for "Maize" and "Maize ". We should fix that.
## Correcting labels / values
On the previous page we saw two more issues in our data
First, we have a data entry error in our height variable.
Secondly, our plant name variable must have been open text as our labels are not standardised.
To fix either of these issues we could use some methods we have already seen in regards to missing data, but I want to introduce a couple key functions that can be very useful in recoding your data.`ifelse` and `case_when`
`ifelse` works very much the same way IF() works in Excel.You provide a condition, a value for if that condition is true and a value for if that condition is false.
Let's use this function to correct our height variable. Again we can edit existing variables using mutate.
```{r, exercise = FALSE}
data1 <- data1 %>%
mutate(height = ifelse(height==88,8,height))
```
In this case we want to change the value if it is equal to 88, and we want it changed to just 8 instead. If height is not 88 then we want it to stay the same. `Ifelse` can be a very useful function for fixing mistakes and generating new variables for analysis.
Before we go onto standardising the plant variable we should remove that whitespace we have in our some of out data entry. This is actually something I would recommend you do with all your character variables, just in case some has made its way in, as it can be difficult to spot.
Thankfully r comes with a neat package `trimws`. You provide the column you want to remove the whitespace from, and then tell R if you want it to do this at the left (start) of the string, the right (end) or both. This is specified using the "which" argument.
```{r}
data1 <- data1 %>%
mutate(plant = trimws(plant, which = "both"))
unique(data1$plant)
```
We can see that the whitespace has been removed.
`case_when` works in a similar way to `ifelse` as it is about recoding variables.Particularly it is for generating new or editing existing categorical variables.
Again you provide a condition and a value for if that condition is true, however after each comma you provide a new condition and value. Some of you may be familiar with similar functions in STATA or SPSS. Note that the condition and value is separated by a tilde `~`. This is R notation to denote a formula. You may be familiar with it if you have done any statistical modelling in R before.
In the code below we want to change "Sorghum" and "sorg" to both be "Sorghum". We use the `%in%` operator to tell R to look to see if the value in the cell matches one of the values in a list that we then provide in the `c()` or combine function. So R will look for where plant is either "Sorghum" or "sorg". Then after the tilde we provide that new value we want it to be. We do the same for Maize. If there are any values not covered by your conditions in using case_when, it will be assumed by R that you want these to be NA. So always check that you are thorough in your code.
```{r, exercise = FALSE}
data1 <- data1 %>%
mutate(plant = case_when(
plant %in% c("Sorghum", "sorg") ~ "Sorghum",
plant %in% c("Maize", "M") ~ "Maize"
))
```
Now we need to make an important point about the order of your code in R. If you remember earlier we created a height in cm variable from our height in inches variable. However we now have changed one of those inch values. So if we look at height in cm, there is still an implausible value.
```{r, exercise = FALSE}
summary(data1$height_cm)
```
We could fix this directly but the easiest thing to do now would be to recalculate that height in cm variable.
```{r mutate_redo, exercise = FALSE}
data1 <-data1 %>%
mutate("pests_per_inch" = n_pests/height,
"height_cm" = height*2.54)
```
However, the better thing to do would be to make sure you do any and all of your analysis and cleaning in a logical and sequential order to avoid having to go backwards as it is very easy for errors to go unfound until later down the line. This is something we want to avoid. Therefore, we would advise focusing on missing and implausible data as your first steps in cleaning data in R before going onto generating any new variables.
## Sorting data
Finally we will have a quick look at a couple quick data management tips you may find useful.
While not always necessary, it may be useful to sort data based on a variable other than a unique identifier. To do this, you can use the `arrange` function. This sorts your data based on one or more variables which you supply as arguments. By default it will do this ascendingly but if you wish to do the opposite you can type the variable into the `desc` function first
For instance lets sort on plot number
```{r, exercise = FALSE}
data1 <-data1 %>%
arrange(plot_id)
data1
```
Or in descending order
```{r, exercise = FALSE}
data1 <-data1 %>%
arrange(desc(plot_id))
data1
```
Or lets even order it on both plant first and then plot number
```{r, exercise = FALSE}
data1 <-data1 %>%
arrange(plant,plot_id)
data1
```
## Subsetting data
Finally, something you may often wish to do is to subset your data. There can be many reasons for this, including an analysis that is only applicable to certain groups. Sometimes you may wish to do this subsetting early in your data cleaning process so that you have these subsets readily available.
With dplyr there is a simple function provided to do this. `filter` does exactly what you expect it to, filter the data based on some condition you provide.
For example let's create a subset of our data that is just Maize plants.
```{r, exercise = FALSE}
maize_data <- data1 %>%
filter(plant == "Maize")
```
We can provide multiple arguments to filter our data. We can separate them using `&` to mean AND, or we can use `|` to mean OR.
## External Links and Resources
<a href="https://r4ds.had.co.nz/" target="_blank">R for data science book : Covers a lot of the key areas of the tidyvesrse packages for data cleaning and manipulation</a>
[Course of Introductory R Videos](http://milton-the-cat.rocks/learnr/r/r_getting_started/#section-overview)
[Quick-R tutorial on some base R data managment mehods](https://www.statmethods.net/management/index.html)
[Stats4SD course videos on data management and analysis](https://www.youtube.com/playlist?list=PLK5PktXR1tmM_ISpxwoiXD2SioG3MWNqg)
[Stats4SD course video on tidy data and importing data to R](https://www.youtube.com/watch?v=truh-B_S-v4)
[Blog post - What's in a table](https://www.datakind.org/blog/whats-in-a-table)
[Article](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html)
[Blog post - What is tidy data](https://towardsdatascience.com/what-is-tidy-data-d58bb9ad2458)
[R for data science: Chapter on Tidy Data](https://r4ds.had.co.nz/tidy-data.html)