-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathCleaningExcel.Rmd
206 lines (159 loc) · 7.58 KB
/
CleaningExcel.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
## Cleaning from Excel files
```{r setup-excel, include=F}
library(rio)
library(tidyverse)
```
Excel, being omnipresent, creates its own sets of difficulties. Excel, on top of being a data entry and
analysis platform, is also a visual platform, so tables are often created to look good rather than be tidy. Things
we commonly find in Excel files include colored rows and columns, multiple lines of headers, multiple rows with variables, typos resulting in numeric data becoming string data, and many others.
Let's start with a data set that's structured similarly to many government datasets around
the world.
![](img/ExcelClean.png)
Notice that there are two levels of headers on top, each spanning multiple columns.
Data is in paired columns, with the first column being the statistic and the second column
being its standard error. The first row is a title, which we don't need, and the
4th row is also not needed.
We'll first tidy-fy this dataset, and then we'll clean it up a bit. Let's first
think about the structure we want to get to for tidy-fication. The actual data lies in
the statistics and the standard errors, and each of the column groupings represents different
variables, so they should be in columns.
You can try to load this data using `import`, but you'll find it's a big mess.
There are two powerful packages (by the same group of developers), `tidyxl` and
`unpivotr`, that are fantastic tools for "fixing" Excel files for analysis.
Let's start with `tidyxl`.
```{r CleaningExcel-1}
library(tidyxl)
dataset1 <- xlsx_cells('data/attendance.xlsx')
dataset1
```
Notice that this pulls in a lot of meta-data in a tidy form, including information
about cell formatting. This will be really useful in many situations.
First, lets get rid of the rows we don't need.
```{r CleaningExcel-2}
dataset1 <- dataset1 %>% filter(row != 1, row != 4, row < 65)
```
Now we could manipulate this dataset using `tidyverse` tools, but `unpivotr` is much mor poweful.
First, we are going to pull off the two headers. `unpivotr` does this using the function `behead` (suggestive?),
with the first argument being the direction ('N', "S", 'E','W',etc) of the table that the header is present. We will
also consider the first column, consisting of state names, as a header on the left.
```{r CleaningExcel-3}
library(unpivotr)
dataset1 %>%
behead('N', tophead) %>%
behead('N', head2) %>%
behead('W', State) %>%
select(row, col, data_type, numeric, tophead, head2, State)
```
We need to separate the statistics and the standard errors from consecutive columns, and
also make them headers.
```{r CleaningExcel-4}
dataset1 %>%
behead('N', tophead) %>%
behead('N', head2) %>%
behead('W', State) %>%
select(row, col, data_type, numeric, tophead, head2, State) %>%
mutate(header = ifelse(col %% 2 == 0, 'stats','se'))
```
The `%%` operator computes the remainder if the left side is divided by the right side.
So the criterion is asking which columns are even. The `ifelse` statement says,
if the criterion is met, write "stats", otherwise write "se". This new variable
is assigned to the dataset with the variable name "header".
Notice that we have actually tidy-fied this dataset, but there's missing data here, since
the column headers span several rows visually but are only credited to the first column it covers.
So we need to fill in the entries for the remaining rows with the corresponding entry from the earliest column.
There is a function `fill` in `tidyr` that does this general trick, using a method called _last value carried forward_.
```{r CleaningExcel-5}
dataset1 %>%
behead('N', tophead) %>%
behead('N', head2) %>%
behead('W', State) %>%
select(row, col, data_type, numeric, tophead, head2, State) %>%
fill(tophead) %>%
fill(head2)
```
To make this really tidy, we need to make two columns titled `stats` and `se` from this.
We've seen this using `spread`, but there is a slightly more robust method from `unpivotr` called
`spatter` which is meant for this unique structure.
```{r CleaningExcel-6}
tidy_dataset <- dataset1 %>%
behead('N', tophead) %>%
behead('N', head2) %>%
behead('W', State) %>%
select(row, col, data_type, numeric, tophead, head2, State) %>%
mutate(header = ifelse(col %% 2 == 0, 'stats','se')) %>%
fill(tophead) %>%
fill(head2) %>%
select(row, numeric, tophead, head2, State, header) %>%
spatter(header, numeric) %>%
select(-row)
tidy_dataset
```
We have to clean the `State` variable. We'll use the methods in the `stringr`
package, which is already loaded with the `tidyverse`.
```{r CleaningExcel-7}
tidy_dataset <- tidy_dataset %>%
mutate(State = str_remove(State, '\\.+')) %>%
mutate(State = str_trim(State))
```
The first verb removes all the `.` in the variable, using something called a regular expression.
This particular expression means that we want to look for sequences of dots, and remove them.
The `\\` before the `.` tells R that we really mean `.`, since the `.` has a different meaning in
regular expressions.
The second verb trims away blank spaces before and after each entry.
We'll hold on to this dataset for the visualization section. Just to be safe, let's
save it.
```{r CleaningExcel-8}
saveRDS(tidy_dataset, file = 'data/attendance.rds')
```
This saves the data in an R-specific format that will allow us to load it quickly.
> The RDS format is an open standard and so it can be called from other programs if the
appropriate programs are written.
### Dealing with visual formating (colors) {-}
The dataset we'll use for this has identifiable information, so I will not expose
it publicly. It is available in your files as `data/classlist.xlsx`.
Since we're interested in background and font colors here, which are informative,
we also need to load the format information into R.
```{r CleaningExcel-9, eval = T, results='hide'}
library(tidyxl)
library(unpivotr)
dataset2 <- xlsx_cells('data/classlist.xlsx')
formats <- xlsx_formats('data/classlist.xlsx')
format_id <- dataset2$local_format_id
dataset2$font_color <- formats$local$font$color$rgb[format_id]
dataset2$bg_color <- formats$local$fill$patternFill$fgColor$rgb[format_id]
```
```{r CleaningExcel-10}
unique(dataset2$font_color)
unique(dataset2$bg_color)
```
So we can filter rows based on these two colors if we want.
To tidy-fy this dataset, we realize that there are really two interweaved datasets.
The odd rows are one dataset and the even rows are another dataset.
```{r CleaningExcel-11}
dat1 <- dataset2 %>%
filter( row %% 2 == 1) %>% # odd rows
behead('N', header) %>%
mutate(row = (row+1)/2) # make the row numbers sequential
dat2 <- dataset2 %>%
filter(row %% 2 == 0) %>% # even rows
behead('N', header) %>%
mutate(row = row/2) %>% # make row numbers sequential
mutate(col = col+4) # These will be the last 4 cols of new data
tidy_dataset2 <-
rbind(dat1, dat2) %>% # Put datsets on top of each other
select(row, data_type, numeric, character, header) %>%
spatter(header) %>%
select(-row, -numeric)
```
We'll do a couple of finesse things to finish. First, we'll make the names with
no spaces (they're a pain to write otherwise) and put the student name on
the first column.
```{r CleaningExcel-12}
tidy_dataset2 <- tidy_dataset2 %>%
set_names(make.names(names(.))) %>%
select(Student.Name, everything())
```
`make.names` changes a vector of names into "approved" space-free format, replacing
the space with `.`. One shortcut I've used is using `.` as an argument to `names`, which
means that the `.` is replaced by the "noun" that is being acted on by the "verbs". You will
notice that I can also do multiple verbs together to work sequentially.