-
Notifications
You must be signed in to change notification settings - Fork 109
/
Copy pathdata_structure_and_data_cleaning_tutorial.Rmd
526 lines (363 loc) · 19 KB
/
data_structure_and_data_cleaning_tutorial.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
# Data structure and cleaning 101
Zijian Wang and Shuo Yang
```{r, include=FALSE}
knitr::opts_chunk$set(warning = FALSE, message = FALSE)
```
## Overview
From our own experiences, many of the errors or difficulties arised during implementing charts or analysis are highly likely due to the data. In this tutorial, we will discuss and summarize some basic data structure knowledge and useful data cleaning skills. We hope it will be helpful by shareing some common mistakes and practical tips, especially for the beginners in R.
## Data Structure
### Basic Data Types
+ character: strings "this is an example".
+ numeric: numbers 1, 1.1, 2.5.
+ factor: categorical labels "red", "green", "blue".
+ Date: dates "2001-02-03", "09-20-2019".
+ logical: Boolean True, Flase.
+ integer : integer values 1, 2, 3.
For these commonly used data types, we can check the data type by `is.*()` and convert to one another using `as.*()`.
### Attributes
R objects can have attributes to describe the data object. Commonly used attributes are names, dimnames, dimensions, class, length and other user-defined attibutes.
### Vector
Vector is a basic data structure in R. It cantains the element of the same type.
**Create**
+ Assign value directly by `c()`
```{r}
v<-c('Alpha','Beta') #char vector
v<-c(TRUE, FALSE, TRUE) #logi vector
v<-c(1,6,4.5,3.2) #num vector
```
+ Use `:` operator for numeric sequence with increment 1
```{r}
v<-1:4
```
+ Use functions that returns a vector
```{r}
seq(0,1,by=0.1) #increment by 0.1
rep(c("A","B"), each=2, times=3) #repeat each by 2 times and 3 complete replications
```
**Access**
+ Use integer as index
```{r}
v<-1:4
v[c(1,2)] #access 1st and 2nd element.
```
+ Use logical vector as index
```{r}
v>2 #a logical vector comparing elements in v to 2.
v[v>2] #access elements that is larger than 2.
```
+ Use element name as index
```{r}
names(v)<-c('A','B','C','D') #assign each element a name
v[c('A','B')] # using name as an index
```
**Operation**
+ Modify an element: use `<—` to assign a value to overide the original one
```{r}
v<-1:4
v[1]<-5
v
```
+ Add an element: use c() to add or combine two vectors
```{r}
c(v,c(1,2,3)) # combine two vectors together.
```
+ Delete an element: use negative index to delete the element
```{r}
v[-3] #delete the 3rd element
```
### Matrix
Matrix is similar to vector with additional dimension attribute. Matrix is 2-dimensional and requires the same type elements.
**Create**
+ Use `matrix()` to create a matrix. Dimension attribute can be set by `nrow` and/or `ncol`. Dimension names can also be set by argument. Matrix is filled by column by default.
```{r}
m<-matrix(1:6, nrow=2, #byrow=FALSE by default
dimnames=list(c("row1","row2"),
c("col1","col2","col3")))
m
```
+ Filled by row:
```{r}
matrix(1:6, nrow=2, byrow=TRUE)
```
**Access**
Elements of matrix can be accessed by integer, logical and name as index, like what we do with vector. Index is needed for both row and column.
+ Use integer as index
```{r}
m[1,3] # element in row 1 and column 3
m[,-1] # select all column except first column.
```
+ Use logical vector as index
```{r}
m[c(TRUE,FALSE),] # select rows 1
```
```{r}
m[m>2] # select element that greater than 2
```
+ Use element name as index
```{r}
m['row2',c('col2','col3')] # select element in 2nd row and 2nd and 3rd column.
```
**Operation**
+ Combine two matrices (same column) by row: `rbind()`
```{r}
x<-matrix(c(1,2,3,4), nrow = 2)
y<-matrix(c(5,6,7,8), nrow = 2)
rbind(x,y)
```
+ Combine two matrices (same row) by column: `cbind()`
```{r}
cbind(x,y)
```
+ Change dimension using `dim()`
```{r}
dim(x)<-c(1,4)# change the dimension to (1,4)
x #Matrix is rearranged by column.
```
### Array
Arrays can store data in more than two dimensions. One-dimensional arrays are similar to vectors. A two-dimensional arrays are the same as a matrix. Array requires input of same type.
**Create**
```{r}
A<-array(1:12, dim=c(3,2,2)) # create a array of three dimension.
A
```
**Access**
Accessing elements is similar to vector and matrix. The index must be consistent with the dimension. For a three-dimensional array, index is in the format `[ , , ]`.
**Operation**
+ As array is made of matrices in multiple dimensions, the operation on elements are carried out by accessing elements of the matrices.
```{r}
matrix1<-A[,,1]
matrix2<-A[,,2]
matrix1+matrix2 # add element in the two matrix together accordingly.
```
+ We can also do operation on specific dimension of an array:
```{r}
apply(A,1,sum) # sum the rows value across all matrixis.
```
```{r}
apply(A,2,sum) # sum the columns value across all matrixis.
```
```{r}
apply(A,3,sum) # sum the matrix value across all matrixis
```
### List
A vector with elements of different types is called a list. For example, a list can have a vector of character, a factor, a numeric matrix and a data frame as its four components. Great flexibility makes lists widely used.
**Create**
We can give names of the components to the list or use the index for reference.
```{r}
L<-list(a=1:4,FALSE) # assign a name to the first component
L
```
**Accesse**
+ We can access the contents in each component by either `$` notation or `[[]]`
```{r}
L$a
L[[2]]
```
+ We can obtain a sublist of the list by `[]`
```{r}
L[1]
```
+ Note that `L[[1]]` returns the content of the first component, which is a numeric vector in our example. On the other hand, `L[1]` returns a list of one component, which is the first component of L.
**Operation**
+ Add components to a list: similar like vector, we can use `c()` to add components. The list structure is dynamic, which means we can also simply assign values using new tags.
```{r}
c(L,b="s")
```
```{r}
L$b<-"s" # add a string component to the list and name it 'b'.
L
```
+ Delete components from a list: we can delete a component either by negative index or assigning `NULL` to it.
```{r}
L[-1] # delete the first component of the list.
```
```{r}
L$b<-NULL # delete the component which name is b.
L
```
+ Flatten lists: use `unlist` to produce a vector for all components in the list. The components will be converted into an approporiate unique mode.
```{r}
unlist(L) # FALSE is converted to 0
```
### Data Frame
Data frame is a special case of list. Data frame is organized in a two-dimensional way, which has equal length for each column.
**Create**
Use `data.frame()`. If the name of a column is not provided, it will make a valid name with the component. Another thing to point out is argument `stringsAsFactors`. It is set to default setting, which is `TRUE` for most of the cases. We recommend to set it to `FALSE` if there is at least column meant to be character.
```{r}
df<-data.frame('A'=c(1,2,3),'B'=c('a','b','c'), stringsAsFactors = FALSE)
df
```
**Access**
+ The content can be accessed like a list using `$` operator and `[[]]` or like a matrix using the index.
```{r}
df$A #using name to select columns
```
```{r}
df[[1]] #like a list
```
```{r}
df[,1] #like a matrix
```
**Operation**
Here we only discuss basic operation of data frame from a data structure perspective. `{dplyr}` is highly recommended for efficient data frame operation.
+ Add a column in data frame: just like list.
```{r}
df$'m'<-c(TRUE,TRUE,FALSE) # add a column m
df<-cbind(df,n=2:4) # add a column n
df
```
+ Add a row in data frame: add the new row as a data frame.
```{r}
df.new<-data.frame(A=4, B="d", m=TRUE, n=5)
rbind(df,df.new) # add a row
```
+ Delete a column in data frame: we can do this by consider the data frame as a matrix or list
```{r}
df[[1]]<-NULL # delete the first column
df
```
+ Delete a row in data frame: just like matrix, by negative the row index we want to delete
```{r}
df[-1,] # delete the first row
```
### Data Structure Conversion
+ Similar as data types, there are `as.*` functions to convert object into different structures.
+ Pay attention to the differences in attributes when performing conversion. For example, A data frame converted to a matrix will lose its variety in column data types.
### Functions to Check Data Structure Attributes
**Name Related**
+ `names()`: get or set the names of the object. For matrix-like objects (matrix and data frame), `names()` is dealing with the column names.
+ `colnames()` and `rownames()`: only work for matrix-like objects.
**Size Related**
+ `length()`: get or set the length of object.
```{r}
vector<-1:12
length(vector) # length of the vector is the number of elements.
matrix<-matrix(1:12,nrow = 3)
length(matrix) # length of the matrix is the number of all elements.
array<-array(1:12,dim=c(2,3,2))
length(array) # length of the array is the number of all elements.
list<-list('a'=c(1,2,3),'b'=c('r','b'))
length(list) # length of the list is the number of components in the list.
df<-data.frame('a'=c(1,2,3),'b'=c('A','B','C'))
length(df) # length of the data frame is the number of column.
```
+ `nrow()` and `ncol()`: returns the number of rows or column for matrix-like objects.
+ `dim()`: returns a vector `c(nrow(), ncol())` for matrix-like objects.
**Type Related**
+ `typeof()`: object's type (low-level).
+ `class()`: object's type (high-level).
**Others**
+ `attributes()`: present the attributes of the object.
+ `str()`: structure of the object and information about the class, length and content of each column.
```{r}
str(df)
```
+ `summary()`: summary statistics
```{r}
summary(df) #summary for data frame by column
summary(L) #summary for list
```
+ `head()` and `tail()`: return the first few part and last few part of the object
```{r}
head(1:100)
tail(1:100)
```
__*Tips:*__
- Getting familiar with different data types and structures will help you spot simple mistakes. Be aware of each data structures' different characteristics, like dimension and data types.
- Most of the functions require certain type of data input and produce certain type of outputs. Keeping a clear clue of what to input and what you get will help debug errors.
- Understanding the relationship and conversion among different data structures will provide more flexibility dealing with the analysis.
## Data Cleaning
Getting data ready for analysis can be quite tedious and time consuming. In this chapter, we will discuss some general workflow for the data cleaning process using {`tidyverse`}, including importing and tidying data. We will also summarise some common mistakes and provide useful tips.
### Import Data
Commonly used data file types are .txt, .csv, .xlsx and .xls.
**Import .txt and .csv using {`readr` package}**
+ The .txt and .csv files are flat files with certian character as **delimiter** to seperate fields with in a record. Normally .txt is tab (`\t`) delimited and .csv is comma (`,`) delimited.
+ General function to import a delimited file is `read_delim()`. Argument `delim` needs to be set to indicate delimiter. With commonly used delimiers "`,`" and "`\t`", `read_csv()` and `read_tsv` are the special caes of `read_delim()`. For example, `read_csv(...) = read_delim(..., delim=",")`.
+ By defualt, argument `col_names` is set to be `TRUE` for these `read_*` functions. It uses the first row as the column names. Remember to set it to `FALSE` if it starts with data directly without column names. `col_names` can also take a character vector for the known column names.
+ By defualt, argument `na` is set to be `c("","NA")`. If there is any other input interpreted as missing values, like `"N\A"` or `"*"`, it can be pass to `na` as a character vector.
```{r, warning = FALSE, message = FALSE}
library(tidyverse)
df<-read_delim("1|a|\nNA|c|5.5\n2|*|4.3", delim = "|",
col_names=c("A","B","C"), na=c("","NA","*"))
df
```
**Import .xlsx and .xls using {`readxl` package}**
+ `read_excel()` is commonly used, without specifying file type. Otherwise, `read_xls()` can be used for .xls file and `read_xlsx()` can be used for .xlsx file.
+ Argument `sheet` is for the sheet to read. It is either a string of the sheet name, or an integer $i$ for the ith sheet.
+ `col_names` is similar to {`readr`} functions
+ `na=""` by default. It can be modified similar as the above {`readr`} functions.
__*Tips:*__
- It is always helpful to open the original files to observe the data first. This will help determine which delimiter is used, whether the first line is the column name, what should be interpreted as missing value, and any notable pattern for further cleaning (for example 50* for 50 with some notes).
- Sometimes it might not be possible to open the original file directly, for example the file size is too large. Then you need to go through a trial and error process to set the proper arguments. In this case, argument of `n_max` is very useful. You can start with the general functions reading the first 10 rows by setting `n_max=10`. Then you can decide whether the first row is the column name. For delimited files, you can try a few delimiter till you find the correct one. As for missing values, you can always find a way to modify them in the later cleaning process.
- There is an easy way to familiarize yourselve with the previous two sets of function. On the Environment tab, you can manually import dataset. Select "From Text (readr)..." and "From Excel..." from the dropdown list and play around with the "Import Options". Then you can visualize the effects of different arguments directly from the "Data Preview" and learn the coding details from the "Code Preview".
- The local file to read is usually put in the working directory. But if you want to access a file in some other folder and do not know the path to it, here is the trick. Function `file.choose()` will pop up a dialog box allowing you to choose a file interactively and return the path to this file.
There are packages for other types of data to explore:
+ {`rvest`} for html data, like the `html_table`.
+ {`haven`} reads SPSS, Stata, and SAS files.
+ {`dbplyr`} returns a data frame with SQL queries against a database, along with a backend package {`DBI`} to work with different databases (e.g. RMySQL, RSQLite, RPostgreSQL etc).
### Tidy Data
Before we perform any data manipulation with fancy dplyr pipes for further visualization or modeling, we must have tidy data at hand. However in the real world, the imported data is never as clean as the textbook one. Each data has its own adjustment needs, here we try to discuss some must-check steps and useful skills, along with some tips.
**Data Overview**
+ Use `str()` to get the structure of the data frame. It shows the size of the data frame, each column's name and data types. This is a quick check of what the data looks like.
+ Use `head()` to get first few rows of data.
+ Use `summary()` to get a columnwise statistics summary. This will provide some initial ideas of the data range and distribution. It is also a good way to check NAs.
+ These functions is useful whenever a data overview is needed in the data cleaning process, especially when there is any big change in the data frame.
**Column Names**
+ Majority of the functions for data import will return a data-frame like structure or something can be easily transformed into a data frame. An important feature of data frame is its column name. `{readr}` can preserve its column names as original with white space and special characters, but it is still recommended to keep a clear short name with "_" at most.
+ Renaming the columns is basically string manipulation, which can be performed with `{stringr}`
```{r colnames}
eg.colname<-c("Number of Moons", "Ring System?")
#eg1: replace white space by "_"
eg1<-str_replace_all(eg.colname," ","")
eg1
#eg2: remove all white space and special characters
eg2<-str_replace_all(eg.colname,"(\\W)","") #\W is RegEx for any non char
eg2
#eg3: keep the first 3 char as the abbr.
eg3<-str_sub(eg.colname, 1, 3)
eg3
#eg4: design your own way for names
eg4<-sapply(sapply(eg.colname,str_extract_all,"([A-Z][a-z]{2})"),
paste0,collapse = "")
eg4
```
+ Then column names can be changed by `colnames` or `rename_*` functions, along with above string manipulation.
+ The original informative long column names can be stored in a vector separately for future reference.
+ Similar treatments can be applied to row names as needed.
**Transpose**
+ Sometimes we need to transpose data frame for a better organization of data. It is better to perform transposition before any further modification of data.
+ Simpliest way is to treat data frame a matrix and use `t()`, after which `as.data.frame()` is needed to transfer data back to a data frame. The drawback is all entries will be forced into one data type. This can be modified in later stage.
+ A better way to transpose data frame while maintaining each column's data type is to use `gather()` and `spread()`.
**Columnwise Modification**
+ Use `mutate_all()` for general patterns applied to all collumns. For example, remove any unnecessary notation.
```{r}
df<-data.frame(ID=c("test_1","test_2","test_3"),
Value=c("3","4.5*","2"),
Type=c("A","A","B*"), stringsAsFactors=F)
df<-df%>%mutate_all(function(x){str_remove_all(x,"\\*")})
#df%>%mutate_all(str_remove_all,"\\*") #produce the same result
df
```
+ Use `mutate_if()` or `mutate_at()` to modify selected columns. For example, double the value of each numeric column.
```{r}
df<-df%>%
mutate_at(vars(Type), as.factor)%>%
mutate_at("Value",as.numeric)%>%
mutate_if(is.numeric, ~.^2)
df
```
+ Use `separate()` to separate information in one column into multiple columns. Combine with string manipulation to modify character columns as desired.
```{r}
df%>%separate(ID,c("ID_Type","Index"),sep="_")
df%>%mutate(Index = str_extract(ID,"\\d"))
```
**NAs and Outliers**
Missing values and extreme values are also very important for future analysis. Depending on the purposes, there are different strategies dealing with them. We will not cover them in this tutorial, but we need to highlight their importance. We would rather to include NA and outlier processing in the later stages of data analysis, as the strategies might be adjusted along the way.
__*Tips:*__
- It always worths studying the documentations about the original data in more details. This will help you get a better understand of the data: what the number unit is, what the possible values of the data are, any natural relation presents and so on.
- Store the data as different objects from time to time. This helps to keep tracks for future reference and debug.
- Column modification is basically `mutate_*` functions with some functions to modify the columns. Simple functions without arguments are easy to use within the `mutate_*` functions. For complex modification involving multiple functions, it is recommended to define a new function which only requires the column values as the input.
- Try to make your code more generic, so that it can be applied to similar data with minimal modification.
**Reference:**
1. [R for Data Science] https://r4ds.had.co.nz/
2. [edav.info/] https://edav.info/index.html