-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path07-wrangle.qmd
872 lines (639 loc) · 33 KB
/
07-wrangle.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
# Data Wrangling {#sec-wrangle .incomplete-chapter}
## Intended Learning Outcomes {#sec-ilo-wrangle - .ilo}
- [ ] Be able to select and filter data for relevance
- [ ] Be able to create new columns and edit existing ones
- [ ] Be able to handle missing data
## Functions used {#sec-functions-wrangle -}
* built-in (you can always use these without loading any packages)
* base:: `round()`, `mean()`, `min()`, `max()`, ``sum()`, `as.character()`, `as.numeric()`
* utils:: `head()`, `packageVersion()`
* tidyverse (you can use all these with `library(tidyverse)`)
* readr:: `readr::read_csv()`
* dplyr:: `dplyr::filter()`, `dplyr::arrange()`, `dplyr::mutate()`, `dplyr::summarise()`, `dplyr::group_by()`, `dplyr::ungroup()`, `dplyr::case_when()`, `dplyr::na_if()`
* tidyr:: `tidyr::pivot_longer()`, `tidyr::pivot_wider()`
* tibble:: `tibble::tibble()`
* other (you need to load each package to use these)
* janitor:: `janitor::round_half_up()`
## Set-up {#sec-setup-wrangle -}
1. Open your `reprores` project
1. Create a new quarto file called `03-dataviz.qmd`
1. Update the YAML header
1. Replace the setup chunk with the one below:
```{r}
#| echo: fenced
#| label: setup
#| include: false
library(tidyverse) # data wrangling functions
library(janitor) # clean data table names
```
::: {.callout-warning}
At this point it's very likely you will have the most recent version of the <pkg>tidyverse</pkg> packages but just in case, be aware that the function `case_when()` that we will use in this chapter was updated in to introduce new arguments and ways of handling `NA`. The code we provide in this chapter will only work if you have v1.1.0 or above of <pkg>dplyr</pkg> installed. To check run `packageVersion("dplyr")` and if it's below 1.1.0, install it again to update.
:::
You'll need to make a folder called "data" and download a data file into it:
<a href="https://psyteachr.github.io/ads-v2/data/budget.csv" download>budget.csv</a>.
Download the [Data transformation cheat sheet](https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-transformation.pdf).
## Wrangling functions {#sec-wrangling-functions}
Data wrangling refers to the process of cleaning, transforming, and restructuring your data to get it into the format you need for analysis and it's something you will spend an awful lot of time doing. Most `r glossary("data wrangling")` involves the reshaping functions you learned in @sec-tidy and six functions from the <pkg>dplyr</pkg> package that is loaded as part of the <pkg>tidyverse</pkg>: `select`, `filter`, `arrange`, `mutate`, `summarise`, and `group_by`. You'll remember the last two from @sec-summary, so we'll only cover them briefly.
It's worth highlighting that in this chapter we're going to cover these common functions and common uses of said functions. However, <pkg>dplyr</pkg> has a huge number of additional wrangling functions and each function has many different arguments. Essentially, if you think you should be able to wrangle your data in a particular way that we haven't explicitly shown you, you almost certainly can, it might just take a bit of Googling to find out how.
```{r, echo = FALSE, eval = FALSE}
set.seed(8675309)
budget <- faux::sim_design(
between = list(
region = c("North", "South", "East", "West"),
product = c("widgets", "gadgets")
),
within = list(
type = c("sales", "expenses"),
year = 2019:2020
),
n = 1,
mu = rep(c(100, 50), each = 4, times = 4),
sd = 1000,
plot = FALSE
) |>
mutate_if(is.numeric, round) |>
select(-id) |>
mutate(satisfaction_2019 = c("high", "very high", "high", "low", "very low", "neutral", "neutral", "low"),
satisfaction_2020 = c("very high", "very high", "neutral", "low", "very low", "high", "neutral", "low"))
write_csv(budget, "data/budget.csv")
```
We'll use a small example table with the sales, expenses, and satisfaction for two years from four regions over two products. After you load the data, use `glimpse(budget)` or `View(budget)` to get familiar with the data.
```{r}
budget <- read_csv("data/budget.csv", show_col_types = FALSE)
```
```{r, echo = FALSE}
budget
```
### Select
You can select a subset of the columns (variables) in a table to make it easier to view or to prepare a table for display. You can also select columns in a new order.
#### By name or index
You can select columns by name or number (which is sometimes referred to as the column index). Selecting by number can be useful when the column names are long or complicated.
```{r}
# select single column by name
product_dat <- budget |> select(product)
# select single column by number
product_dat <- budget |> select(2)
```
You can select each column individually, separated by commas (e.g., `region, sales_2019`) but you can also select all columns from one to another by separating them with a colon (e.g., `sales_2019:expenses_2020`).
The colon notation can be much faster because you don't need to type out each individual variable name, but requires you know what order your columns are in and don't expect them to change. Always check the output to make sure you have selected what you intended.
```{r select-name}
# select columns individually
sales2019 <- budget |> select(region, product, sales_2019)
# select columns with colon
sales2019 <- budget |> select(region:sales_2019)
```
You can rename columns at the same time as selecting them by setting `new_name = old_col`.
```{r select-number}
regions <- budget |> select(`Sales Region` = 1, 3:6)
head(regions, 2)
```
#### Un-selecting columns
You can select columns either by telling R which ones you want to keep as in the previous examples, or by specifying which ones you want to exclude by using a minus symbol to un-select columns. You can also use the colon notation to de-select columns, but to do so you need to put parentheses around the span first, e.g., `-(sales_2019:expenses_2020)`, not `-sales_2019:expenses_2020`.
```{r select-unselect}
# de-select individual columns
sales <- budget |> select(-expenses_2019, -expenses_2020)
# de-select a range of columns
sales <- budget |> select(-(expenses_2019:expenses_2020))
```
#### Select helpers
Finally, you can select columns based on criteria about the column names.
| function | definition |
|----------|------------|
| `starts_with()` | select columns that start with a character string|
| `ends_with()` | select columns that end with a character string |
| `contains()` | select columns that contain a character string |
| `num_range()` | select columns with a name that matches the pattern `prefix` |
```{r select-helpers, echo = FALSE}
x1 <- budget |> select(starts_with("sales"))
x2 <- budget |> select(ends_with("2020"))
x3 <- budget |> select(contains("_"))
x4 <- budget |> select(num_range("expenses_", 2019:2020))
a <- list(names(x1), names(x2), names(x3), names(x4)) |>
sapply(paste, collapse = ", ")
a1 <- setNames(a, c("answer", "x", "x", "x")) |> longmcq()
a2 <- setNames(a, c("x", "answer", "x", "x")) |> longmcq()
a3 <- setNames(a, c("x", "x", "answer", "x")) |> longmcq()
a4 <- setNames(a, c("x", "x", "x", "answer")) |> longmcq()
```
::: {.try}
What are the resulting columns for these four examples?
* `budget |> select(contains("_"))`
`r a3`
* `budget |> select(num_range("expenses_", 2019:2020))`
`r a4`
* `budget |> select(starts_with("sales"))`
`r a1`
* `budget |> select(ends_with("2020"))`
`r a2`
:::
### Filter
Whilst `select()` chooses the columns you want to retain, `filter()` chooses the rows to retain by matching row or column criteria.
You can filter by a single criterion. This criterion can be rows where a certain column's value matches a character value (e.g., "North") or a number (e.g., 9003). It can also be the result of a logical equation (e.g., keep all rows with a specific column value larger than a certain value). The criterion is checked for each row, and if the result is FALSE, the row is removed. You can reverse equations by specifying `!=` where `!` means "not".
```{r filter, eval = FALSE}
# select all rows where region equals North
budget |> filter(region == "North")
# select all rows where expenses_2020 were exactly equal to 200
budget |> filter(expenses_2020 == 200)
# select all rows where sales_2019 was more than 100
budget |> filter(sales_2019 > 100)
# everything but the North
budget |> filter(region != "North")
```
::: {.callout-warning}
Remember to use `==` and not `=` to check if two things are equivalent. A single `=` assigns the right-hand value to the left-hand variable (much like the `<-` operator).
:::
```{r filter-mcq, echo = FALSE}
class <- tibble(id = 1:4,
grade = c("A", "A", "C", "B"),
score = c(95, 91, 76, 84)
)
x1 <- which(class$grade == "A")
x2 <- which(class$score == 91)
x3 <- which(class$score < 80)
x4 <- which(class$grade != "A")
a <- list(x1, x2, x3, x4) |>
sapply(paste, collapse = ", ")
a1 <- setNames(a, c("answer", "x", "x", "x")) |> mcq()
a2 <- setNames(a, c("x", "answer", "x", "x")) |> mcq()
a3 <- setNames(a, c("x", "x", "answer", "x")) |> mcq()
a4 <- setNames(a, c("x", "x", "x", "answer")) |> mcq()
```
::: {.try}
Which IDs are kept from the table below?
```{r, echo = FALSE}
class
```
* `demo |> filter(score < 80)`
`r a3`
* `demo |> filter(grade == "A")`
`r a1`
* `demo |> filter(grade != "A")`
`r a4`
* `demo |> filter(score == 91)`
`r a2`
:::
You can also select on multiple criteria by separating them by commas (rows will be kept if they match *all* criteria). Additionally, you can use `&` ("and") and `|` ("or") to create complex criteria.
```{r filter-symbols}
# regions and products with profit in both 2019 and 2020
profit_both <- budget |>
filter(
sales_2019 > expenses_2019,
sales_2020 > expenses_2020
)
# the same as above, using & instead of a comma
profit_both <- budget |>
filter(
sales_2019 > expenses_2019 &
sales_2020 > expenses_2020
)
# regions and products with profit in 2019 or 2020
profit_either <- budget |>
filter(
sales_2019 > expenses_2019 |
sales_2020 > expenses_2020
)
# 2020 profit greater than 1000
profit_1000 <- budget |>
filter(sales_2020 - expenses_2020 > 1000)
```
If you want the filter to retain multiple specific values in the same variable, the "match operator" (`%in%`) should be used rather than `|` (or). The `!` can also be used in combination here, but it is placed before the variable name.
```{r match-operator, eval = FALSE}
# retain any rows where region is north or south, and where product equals widget
budget |>
filter(region %in% c("North", "South"),
product == "widgets")
# retain any rows where the region is not east or west, and where the product does not equal gadgets
budget |>
filter(!region %in% c("East", "West"),
product != "gadgets")
```
`r glossary("Operator")` |Name |is TRUE if and only if
-----------|----------------------|---------------------------------
`A < B` |less than |A is less than B
`A <= B` |less than or equal |A is less than or equal to B
`A > B` |greater than |A is greater than B
`A >= B` |greater than or equal |A is greater than or equal to B
`A == B` |equivalence |A exactly equals B
`A != B` |not equal |A does not exactly equal B
`A %in% B` |in |A is an element of vector B
Finally, you can also pass many other functions to filter. For example, the package <pkg>stringr</pkg> that is loaded as part of the <pkg>tidyverse</pkg> contains many different functions for working with `r glossary("string", "strings")` (character data). For example, you you use `str_detect()` to only retain rows where the customer satisfaction rating includes the word "high"
```{r eval = TRUE}
budget |>
filter(str_detect(satisfaction_2019, "high"))
```
Note that `str_detect()` is case sensitive so it would not return values of "High" or "HIGH". You can use the function `tolower()` or `toupper()` to convert a string to lowercase or uppercase before you search for substring if you need case-insensitive matching.
::: {.callout-warning}
`filter()` is incredibly powerful and can allow you to select very specific subsets of data. But, it is also quite dangerous because when you start combining multiple criteria and operators, it's very easy to accidentally specify something slightly different than what you intended. **Always check your output**. If you have a small dataset, then you can eyeball it to see if it looks right. With a larger dataset, you may wish to compute summary statistics or count the number of groups/observations in each variable to verify your filter is correct. There is no level of expertise in coding that can substitute knowing and checking your data.
:::
### Arrange {#sec-arrange}
You can sort your dataset using `arrange()`. You will find yourself needing to sort data in R much less than you do in Excel, since you don't need to have rows next to each other in order to, for example, calculate group means. But `arrange()` can be useful when preparing data for display in tables. `arrange()` works on character data where it will sort alphabetically, as well as numeric data where the default is ascending order (smallest to largest). Reverse the order using `desc()`.
```{r arrange}
# arranging the table
# first by product in alphabetical order
# then by "region" in reverse alphabetical order
budget |>
arrange(product, desc(region))
```
::: {.try}
If you want to sort character data/categories in a specific order, turn the column into a `r glossary("factor")` and set the `levels` in the desired order.
```{r}
budget |>
mutate(region = factor(region, levels = c("North", "South", "East", "West"))) |>
filter(product == "gadgets") |>
arrange(region)
```
:::
### Mutate {#sec-mutate}
The function `mutate()` allows you to add new columns or change existing ones by overwriting them by using the syntax `new_column = operation`. You can add more than one column in the same mutate function by separating the columns with a comma. Once you make a new column, you can use it in further column definitions. For example, the creation of `profit` below uses the column `expenses`, which is created above it.
```{r mutate}
budget2 <- budget |>
mutate(
sales = sales_2019 + sales_2020,
expenses = expenses_2019 + expenses_2020,
profit = sales - expenses,
region = paste(region, "Office")
)
```
`mutate()` can also be used in conjunction with other functions and Boolean operators. For example, we can add another column to `budget2` that states whether a profit was returned that year or overwrite our `product` variable as a factor. Just like when we used `r glossary("boolean expression", "Boolean expressions")` with filter, it will evaluate the equation and return TRUE or FALSE depending on whether the observation meets the criteria.
```{r}
budget2 <- budget2 |>
mutate(profit_category = profit > 0,
product = as.factor(product))
```
::: {.callout-warning}
You can overwrite a column by giving a new column the same name as the old column (see `region` or `product`) above. Make sure that you mean to do this and that you aren't trying to use the old column value after you redefine it.
:::
You can also use `case_when()` to specify what values to return, rather than defaulting to TRUE or FALSE:
```{r}
budget3 <- budget2 |>
mutate(profit_category = case_when(profit > 0 ~ "PROFIT",
profit < 0 ~ "NO PROFIT"))
```
Use it to recode values:
```{r}
# create a column where people get a bonus if customer satisfaction was overall high or very high
bonus <- budget3 |>
mutate(bonus_2019 = case_when(satisfaction_2019 %in% c("very low", "low", "neutral") ~ "no bonus",
satisfaction_2019 %in% c("high", "very high") ~ "bonus"))
```
And combine different criteria:
```{r}
# new management takes over - people only get a bonus if customer satisfaction was overall high or very high AND if a profit was returned
bonus2 <- budget3 |>
mutate(bonus_2020 = case_when(satisfaction_2020 == "high" &
profit_category == "PROFIT" ~ "bonus",
satisfaction_2020 == "very high" &
profit_category == "PROFIT" ~ "bonus",
.default = "No bonus")) # set all other values to "no bonus"
```
::: {.callout-tip}
Be mindful that `.default` uses `=` whilst the others use `~`. Emily has lost quite a lot of her time and sanity to not realising this.
:::
Just like `filter()`, `mutate()` is incredibly powerful and the scope of what you can create is far beyond what we can cover in this book.
### Summarise {#sec-dplyr-summarise}
You were introduced to the `summarise()` function in @sec-summary-summarise. This applies summary functions to an entire table (or groups, as you'll see in the next section).
Let's say we want to determine the mean sales and expenses, plus the minimum and maximum profit, for any region, product and year. First, we need to reshape the data like we learned in @sec-tidy, so that there is a column for year and one column each for sales and expenses, instead of separate columns for each year. We'll also drop the satisfaction data as we don't need it for this analysis.
```{r}
budget4 <- budget |>
select(-satisfaction_2019, -satisfaction_2020) |>
pivot_longer(cols = sales_2019:expenses_2020,
names_to = c("type", "year"),
names_sep = "_",
names_transform = list(year = as.integer),
values_to = "value") |>
pivot_wider(names_from = type,
values_from = value)
head(budget4) # check the format
```
::: {.callout-note}
The argument `names_transform` lets you change the `r glossary("data type")` of a new column, since the names always default to `r glossary("character")` types.
:::
Now we can create summary statistics for the table.
```{r}
budget4 |>
summarise(
mean_sales = mean(sales),
mean_expenses = mean(expenses),
min_profit = min(expenses - sales),
max_profit = max(expenses - sales)
)
```
### Group By {#sec-dplyr-groupby}
You were introduced to the `group_by()` function in @sec-grouping. For example, you can break down the summary statistics above by year and product.
```{r}
year_prod <- budget4 |>
group_by(year, product) |>
summarise(
mean_sales = mean(sales),
mean_expenses = mean(expenses),
min_profit = min(expenses - sales),
max_profit = max(expenses - sales)
) |>
ungroup()
year_prod
```
Alternatively, if you have a newer version of tidyverse, you can use the `.by` argument to summarise by groups.
```{r}
budget4 |>
summarise(
.by = c(year, product),
mean_sales = mean(sales),
mean_expenses = mean(expenses),
min_profit = min(expenses - sales),
max_profit = max(expenses - sales)
)
```
::: {.try}
How would you find out the maximum sales for each region?
```{r, results='asis', echo = FALSE}
c(answer = "<pre>budget3 |>\n group_by(region) |>\n summarise(max_sales = max(sales))</pre>",
x = "<pre>budget3 |>\n group_by(region) |>\n summarise(max_sales = max(region))</pre>",
x = "<pre>budget3 |>\n group_by(sales) |>\n summarise(max_sales = max(sales))</pre>",
x = "<pre>budget3 |>\n group_by(sales) |>\n summarise(max_sales = max(region))</pre>") |>
sample() |> webexercises::longmcq() |> cat()
```
:::
You can also use `group_by()` in combination with other functions. For example, `slice_max()` returns the top N rows, ordered by a specific variable.
```{r}
# return top 3 sales
budget4 |>
slice_max(n = 3, order_by = sales)
```
But this can be combined with `group_by()` to return the top sales for each region:
```{r}
# return top sale for each region
budget4 |>
group_by(region) |>
slice_max(n = 1, order_by = sales)
```
## Complications
### Rounding
Let's say we want to round all the values to the nearest pound. The pattern below uses the `across()` function to apply the `round()` function to the columns from `mean_sales` to `max_profit`.
```{r}
year_prod |>
mutate(across(.cols = mean_sales:max_profit,
.fns = round))
```
If you compare this table to the one in @sec-dplyr-groupby, you'll see that the 2019 gadgets mean sales rounded up from `r year_prod$mean_sales[1]` to `r round(year_prod$mean_sales[1])`, while the mean expenses rounded down from `r year_prod$mean_expenses[1]` to `r round(year_prod$mean_expenses[1])`. What's going on!?
This may seem like a mistake, but R rounds .5 to the nearest even number, rather than always up, like you were probably taught in school. This prevents overestimation biases, since x.5 is *exactly* halfway between x and x+1, so there is no reason it should always round up.
```{r}
round(0.5)
round(1.5)
```
However, this might throw a monkey wrench into your own systems. For example, our school policy is to round up for course marks at x.5. The easiest solution is to use the `round_half_up()` function from the package <pkg>janitor</pkg>.
when you run this code, a new section will appear in the environment pane labelled "Functions". In addition to using functions from packages, you can also make your own. It's not something we are going to go into detail on in this course, but it's useful to know the functionality exists.
This should work as you'd expect.
```{r}
round_half_up(0.5)
round_half_up(1.5)
```
### Missing values {#sec-missing-values}
If you have control over your data, it is always best to keep missing values as empty cells rather than denoting missingness with a word or implausible number. If you used "missing" rather than leaving the cell empty, the entire variable would be read as character data (unless you note this on import), which means you wouldn't be able to perform mathematical operations like calculating the mean. If you use an implausible number (0 or 999 are common), then you risk these values being included in any calculations as real numbers.
However, we often don't have control over how the data come to us, so let's run through how to fix this.
#### Bad missing values
What if the South region hadn't returned their expenses (entered as 0) and the North region hadn't returned their sales data for 2020 yet, so someone entered it as "missing"?
First, we're going to recode the data to add in the missing values
For the South data, we can use `case_when()` to set the value of expenses to 0 if the year is 2020 and region is "South", otherwise use the value from the expenses column (i.e., don't change).
```{r}
missing_bad <- budget4 |>
mutate(expenses = case_when(
# set to 0 when year is 2020 and region is North
year == 2020 & region == "South" ~ 0,
# otherwise, set to the value in the expenses column
.default = expenses
))
```
`r hide("Using case_when() for multiple criteria")`
The `case_when()` function allows allows you to set multiple criteria, although we're only using one non-default criterion here. It can be very useful, but takes a little practice.
The example below creates a label for each row. Notice how the label for the first row is "x < 2", even though this row also fits the second criterion "y < 4". This is because `case_when()` applies the first match to each row, even if other criteria in the function also match that row.
```{r}
data <- tibble(
x = 1:5,
y = 1:5
)
data |>
mutate(label = case_when(
x < 2 ~ "x < 2",
y < 4 ~ "y < 4",
x == 5 & y == 5 ~ "both 5",
.default = "default"
))
```
`r unhide()`
For the North, we will recode these values as "missing". Since this is character data, and sales are currently numeric data, we first need to change it to a character variable.
```{r}
# set sales values to "missing" for North 2020 rows
missing_bad <- missing_bad |>
mutate(sales = as.character(sales), # Convert all sales to character
sales = case_when(year == 2020 & region == "North" ~ "missing",
TRUE ~ sales)) # Set specific condition to "missing"
str(missing_bad)
```
Now, if you try to compute the mean sales, you will get an error message and the result will be `NA`.
```{r eval = TRUE, error=TRUE}
# try to compute mean sales
missing_bad |>
summarise(mean_sales = mean(sales))
```
#### Convert missing values to NA
To set the missing values to NA, we can use the handy function `na_if()`. We'll also need to transform `sales` back to numeric.
```{r}
missing_data <- missing_bad |>
mutate(
# if expenses = 0, set as NA
expenses = na_if(expenses,0),
# if sales = "missing" set as NA
sales = na_if(sales, "missing"),
# convert to numeric
sales = as.numeric(sales)
)
```
Now, if we try to calculate the mean sales and profits, we get missing values for any summary value that used one of the North 2020 sales values or the South 2020 expenses.
```{r error=TRUE}
missing_data |>
group_by(region) |>
summarise(
mean_sales = mean(sales),
mean_expenses = mean(expenses),
min_profit = min(expenses - sales),
max_profit = max(expenses - sales),
.groups = "drop")
```
#### Ignore missing values
This is because `NA` basically means "I don't know", and the sum of 100 and "I don't know" is "I don't know", not 100. However, when you're calculating means, you often want to just ignore missing values. Set `na.rm = TRUE` in the summary function to remove missing values before calculating.
```{r}
missing_data |>
group_by(region) |>
summarise(
mean_sales = mean(sales, na.rm = TRUE),
mean_expenses = mean(expenses, na.rm = TRUE),
min_profit = min(expenses - sales, na.rm = TRUE),
max_profit = max(expenses - sales, na.rm = TRUE),
.groups = "drop"
)
```
#### Count missing values
If you want to find out how many missing or non-missing values there are in a column, use the `is.na()` function to get a `r glossary("logical")` vector of whether or not each value is missing, and use `sum()` to count how many values are TRUE or `mean()` to calculate the proportion of TRUE values.
```{r}
missing_data |>
group_by(year, product) |>
summarise(
n_valid = sum(!is.na(sales)),
n_missing = sum(is.na(sales)),
prop_missing = mean(is.na(sales)),
.groups = "drop"
)
```
#### Omit missing values
You may also want to remove rows that have missing values and only work from complete datasets. `drop_na()` will remove any row that has a missing observation. You can use `drop_na()` on the entire dataset which will remove any row that has *any* missing value, or you can specify to only remove rows that are missing a specific value.
```{r}
# remove any rows with any missing values
complete_data <- missing_data |>
drop_na()
# remove any rows that are missing a value for sales
complete_sales <- missing_data |>
drop_na(sales)
```
Missing data can be quite difficult to deal with depending on how it is represented. As always, no amount of coding expertise can make up for not understanding the structure and idiosyncrasies of your data.
## Exercises {#sec-together-wrangle}
Let's try some exercises using a dataset you already encountered in @sec-viz so that you can see how much more you're able to do with the data now.
* Save your current script, close it, and open a new script named "survey-data-mad-skillz.qmd".
* In the set-up code chunk, load the tidyverse, then load the dataset from https://psyteachr.github.io/ads-v2/data/survey_data.csv into an object named `survey_data`.
* Use your method of choice to review the dataset and familiarise yourself with its structure.
```{r webex.hide = TRUE}
# from https://www.kaggle.com/kyanyoga/sample-sales-data
library(tidyverse)
survey_data <- read_csv("https://psyteachr.github.io/reprores-v4/data/survey_data.csv")
```
### Creating new categories
Employees 1-5 were trained by Michael and employees 6-10 were trained by Dwight.
* Create a new column named `trainer` that lists the trainer for each employee.
* Then, calculate the average satisfaction scores for employees trained by each trainer and visualise the satisfaction scores for each in whatever way you think best.
` r hide("Hint")`
To add the trainer column you can use `case_when()` and specify multiple criteria (e.g., if the employee is 1-5, Michael, if the employee is 6-10 Dwight)
` r unhide()`
```{r webex.hide = TRUE}
# case_when() method
survey_data <- survey_data |>
mutate(trainer = case_when(employee_id %in% c("E01", "E02", "E03", "E04", "E05") ~ "Michael",
employee_id %in% c("E06", "E07", "E08", "E09", "E10") ~ "Dwight"))
# mean satisfaction scores
survey_data |>
group_by(trainer) |>
summarise(mean_satisfaction = mean(satisfaction))
# possible visualisation
ggplot(survey_data, aes(x = satisfaction, fill = trainer)) +
geom_histogram(binwidth = 1, show.legend = FALSE, colour = "black") +
facet_wrap(~trainer) +
labs(title = "Satisfaction scores by employee trainer")
```
### Filter by calculated score
First, calculate the average wait time and store this in an object named `mean_wait`. This should be a single value rather than a table.
`r hide("Hint")`
There are multiple ways to achieve this. You could create the table and then pull out the single value, or just calculate the single value.
`r unhide()`
```{r webex.hide = TRUE}
# method 1 - tidyverse
mean_wait <- survey_data |>
summarise(mean_wait = mean(wait_time)) |>
pull(mean_wait)
# method 2 - base R
mean_wait <- mean(survey_data$wait_time)
```
Now create a dataset named `long_wait` that just contains data from customers who waited more than the average wait time.
```{r webex.hide = TRUE}
long_wait <- survey_data |>
filter(wait_time > mean_wait)
```
Create a visualisation that shows how many customers waited more than the average wait time for each employee.
```{r webex.hide = TRUE}
long_wait |>
ggplot(aes(x = employee_id)) +
geom_bar()
```
### Multiple critera
Now, add a column to `survey_data` named `follow_up` that flags whether a customer should be followed up with a courtesy phone call. Your company is short-staffed so only customers that meet all three of the following criteria should be followed-up:
* Their wait time should be above the average **for all calls**
* Their call time should be above the average **for their category**
* Their satisfaction should be less than three 3.
This is quite complicated and there are multiple ways to achieve the desired outcome. Some approaches may need other functions that were covered in previous chapters and you may need to create intermediate objects.
Call the final object `follow_data` and keep only the customer ID, employee ID, trainer, and follow up columns.
```{r webex.hide = TRUE}
# this is one possible solution, there are many other valid approaches
# calculate mean wait time across all calls
mean_wait <- mean(survey_data$wait_time)
# calculate mean call time for each category
follow_data <- survey_data |>
group_by(issue_category) |>
summarise(mean_call = mean(call_time)) |>
#then join it to the survey data
left_join(survey_data, by = "issue_category") |>
# then add on the column
mutate(follow_up = case_when(wait_time > mean_wait &
call_time > mean_call &
satisfaction < 3 ~ "yes",
.default = "no")) |>
select(caller_id, employee_id, trainer, follow_up)
```
For all of the above, write code that stores the answer as a single value, so that you could easily use it in inline coding.
How many customers need to be followed up:
* In total? `r fitb("120")`
* From calls by employee 06? `r fitb("16")`
* From calls by employees trained by Michael `r fitb("65")`
* From calls by employees trained by Dwight `r fitb("55")`
`r hide("Hint")`
`group_by() |> count() |> filter() |> pull()`
`r unhide()`
Which employee needs to make the largest number of follow-up courtesy calls? `r fitb("E02")`
`r hide("Hint")`
As above but add in an `ungroup()` and `slice_max()` along the way.
`r unhide()`
```{r webex.hide = TRUE}
# in total
follow_data |>
group_by(follow_up) |>
count()|>
filter(follow_up == "yes") |>
pull(n)
# by employee 6
follow_data |>
group_by(follow_up, employee_id) |>
count() |>
filter(employee_id == "E06",
follow_up == "yes") |>
pull(n)
# by michael
follow_data |>
group_by(follow_up, trainer) |>
count() |>
filter(trainer == "Michael",
follow_up == "yes") |>
pull(n)
# by dwight
follow_data |>
group_by(follow_up, trainer) |>
count() |>
filter(trainer == "Dwight",
follow_up == "yes") |>
pull(n)
# most follow-ups needed
follow_data |>
group_by(follow_up, employee_id) |>
count() |>
ungroup() |>
filter(follow_up == "yes") |>
slice_max(n = 1, order_by = n) |>
pull(employee_id)
```
### Original insight
In preparation for the summative assessment, explore the data to provide one original insight of your own.
## Glossary {#sec-glossary-wwrangle -}
```{r, echo = FALSE}
glossary_table()
```
## Further resources {#sec-resources-wrangle -}
* [Data transformation cheat sheet](https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-transformation.pdf)
* [Chapter 18: Missing Data](https://r4ds.hadley.nz/missing-values) in *R for Data Science*
* [Chapter 3: Data Transformation ](https://r4ds.hadley.nz/data-transform) in *R for Data Science*
* [Chapter 25: Functions](https://r4ds.hadley.nz/functions) in *R for Data Science*
* [Introduction to stringr](https://stringr.tidyverse.org/articles/stringr.html)