CleaningExcel.Rmd

## Cleaning from Excel files

```{r setup-excel, include=F}
library(rio)
library(tidyverse)
```


Excel, being omnipresent, creates its own sets of difficulties. Excel, on top of being a data entry and
analysis platform, is also a visual platform, so tables are often created to look good rather than be tidy. Things
we commonly find in Excel files include colored rows and columns, multiple lines of headers, multiple rows with variables, typos resulting in numeric data becoming string data, and many others.

Let's start with a data set that's structured similarly to many government datasets around
the world.

![](img/ExcelClean.png)

Notice that there are two levels of headers on top, each spanning multiple columns. 
Data is in paired columns, with the first column being the statistic and the second column
being its standard error. The first row is a title, which we don't need, and the 
4th row is also not needed. 

We'll first tidy-fy this dataset, and then we'll clean it up a bit. Let's first
think about the structure we want to get to for tidy-fication. The actual data lies in 
the statistics and the standard errors, and each of the column groupings represents different
variables, so they should be in columns. 


You can try to load this data using `import`, but you'll find it's a big mess.
There are two powerful packages (by the same group of developers), `tidyxl` and
`unpivotr`, that are fantastic tools for "fixing" Excel files for analysis.

Let's start with `tidyxl`.

```{r CleaningExcel-1}
library(tidyxl)
dataset1 <- xlsx_cells('data/attendance.xlsx')
dataset1
```

Notice that this pulls in a lot of meta-data in a tidy form, including information
about cell formatting. This will be really useful in many situations. 

First, lets get rid of the rows we don't need.
```{r CleaningExcel-2}
dataset1 <- dataset1 %>% filter(row != 1, row != 4, row < 65)
```

Now we could manipulate this dataset using `tidyverse` tools, but `unpivotr` is much mor poweful.
First, we are going to pull off the two headers. `unpivotr` does this using the function `behead` (suggestive?), 
with the first argument being the direction ('N', "S", 'E','W',etc) of the table that the header is present. We will 
also consider the first column, consisting of state names, as a header on the left.
```{r CleaningExcel-3}
library(unpivotr)
dataset1 %>% 
  behead('N', tophead) %>% 
  behead('N', head2) %>% 
  behead('W', State) %>% 
  select(row, col, data_type, numeric, tophead, head2, State)
```

We need to separate the statistics and the standard errors from consecutive columns, and 
also make them headers.
```{r CleaningExcel-4}
dataset1 %>% 
  behead('N', tophead) %>% 
  behead('N', head2) %>% 
  behead('W', State) %>% 
  select(row, col, data_type, numeric, tophead, head2, State) %>% 
  mutate(header = ifelse(col %% 2 == 0, 'stats','se'))
```

The `%%` operator computes the remainder if the left side is divided by the right side.
So the criterion is asking which columns are even. The `ifelse` statement says, 
if the criterion is met, write "stats", otherwise write "se". This new variable 
is assigned to the dataset with the variable name "header". 


Notice that we have actually tidy-fied this dataset, but there's missing data here, since
the column headers span several rows visually but are only credited to the first column it covers.
So we need to fill in the entries for the remaining rows with the corresponding entry from the earliest column.
There is a function `fill` in `tidyr` that does this general trick, using a method called _last value carried forward_. 
```{r CleaningExcel-5}
dataset1 %>% 
  behead('N', tophead) %>% 
  behead('N', head2) %>% 
  behead('W', State) %>% 
  select(row, col, data_type, numeric, tophead, head2, State) %>% 
  fill(tophead) %>% 
  fill(head2)

```

To make this really tidy, we need to make two columns titled `stats` and `se` from this. 
We've seen this using `spread`, but there is a slightly more robust method from `unpivotr` called
`spatter` which is meant for this unique structure. 
```{r CleaningExcel-6}
tidy_dataset <- dataset1 %>% 
  behead('N', tophead) %>% 
  behead('N', head2) %>% 
  behead('W', State) %>% 
  select(row, col, data_type, numeric, tophead, head2, State) %>% 
  mutate(header = ifelse(col %% 2 == 0, 'stats','se')) %>% 
  fill(tophead) %>% 
  fill(head2) %>% 
  select(row, numeric, tophead, head2, State, header) %>% 
  spatter(header, numeric) %>% 
  select(-row)
tidy_dataset
```

We have to clean the `State` variable. We'll use the methods in the `stringr` 
package, which is already loaded with the `tidyverse`.

```{r CleaningExcel-7}
tidy_dataset <- tidy_dataset %>% 
  mutate(State = str_remove(State, '\\.+')) %>% 
  mutate(State = str_trim(State))
```

The first verb removes all the `.` in the variable, using something called a regular expression. 
This particular expression means that we want to look for sequences of dots, and remove them. 
The `\\` before the `.` tells R that we really mean `.`, since the `.` has a different meaning in 
regular expressions.

The second verb trims away blank spaces before and after each entry.

We'll hold on to this dataset for the visualization section. Just to be safe, let's 
save it. 

```{r CleaningExcel-8}
saveRDS(tidy_dataset, file = 'data/attendance.rds')
```

This saves the data in an R-specific format that will allow us to load it quickly. 

> The RDS format is an open standard and so it can be called from other programs if the 
appropriate programs are written.

### Dealing with visual formating (colors) {-}

The dataset we'll use for this has identifiable information, so I will not expose 
it publicly. It is available in your files as `data/classlist.xlsx`. 

Since we're interested in background and font colors here, which are informative,
we also need to load the format information into R.

```{r CleaningExcel-9, eval = T, results='hide'}
library(tidyxl)
library(unpivotr)

dataset2 <- xlsx_cells('data/classlist.xlsx')
formats <- xlsx_formats('data/classlist.xlsx')

format_id <- dataset2$local_format_id
dataset2$font_color <- formats$local$font$color$rgb[format_id]
dataset2$bg_color <- formats$local$fill$patternFill$fgColor$rgb[format_id]
```

```{r CleaningExcel-10}
unique(dataset2$font_color)
unique(dataset2$bg_color)
```

So we can filter rows based on these two colors if we want.

To tidy-fy this dataset, we realize that there are really two interweaved datasets. 
The odd rows are one dataset and the even rows are another dataset.

```{r CleaningExcel-11}
dat1 <- dataset2 %>% 
  filter( row %% 2 == 1) %>% # odd rows
  behead('N', header) %>% 
  mutate(row = (row+1)/2) # make the row numbers sequential

dat2 <- dataset2 %>% 
  filter(row %% 2 == 0) %>%  # even rows
  behead('N', header) %>% 
  mutate(row = row/2) %>% # make row numbers sequential
  mutate(col = col+4) # These will be the last 4 cols of new data

tidy_dataset2 <- 
  rbind(dat1, dat2) %>% # Put datsets on top of each other
  select(row, data_type, numeric, character, header) %>% 
  spatter(header) %>% 
  select(-row, -numeric)
```

We'll do a couple of finesse things to finish. First, we'll make the names with 
no spaces (they're a pain to write otherwise) and put the student name on 
the first column.

```{r CleaningExcel-12}
tidy_dataset2 <- tidy_dataset2 %>% 
  set_names(make.names(names(.))) %>% 
  select(Student.Name, everything())
```

`make.names` changes a vector of names into "approved" space-free format, replacing
the space with `.`. One shortcut I've used is using `.` as an argument to `names`, which
means that the `.` is replaced by the "noun" that is being acted on by the "verbs". You will 
notice that I can also do multiple verbs together to work sequentially.