-
Notifications
You must be signed in to change notification settings - Fork 77
/
20-r_essentials.qmd
1562 lines (1142 loc) · 61 KB
/
20-r_essentials.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
engine: knitr
---
# R essentials {#sec-r-essentials}
**Prerequisites**
- Read *R for Data Science*, Chapter 4 "Data transformation", [@r4ds]
- Provides an overview of manipulating datasets using `dplyr`.
- Read *Data Feminism*, Chapter 6 "The Numbers Don't Speak for Themselves", [@datafeminism2020]
- Discusses the need to consider data within the broader context that generated them.
- Read *R Generation*, [@Thieme2018]
- Provides background information about `R`.
**Key concepts and skills**
- Understanding foundational aspects of `R` and RStudio enables a gradual improvement of workflows. For instance, being able to use key `dplyr` verbs and make graphs with `ggplot2` makes manipulating and understanding datasets easier.
- But there is an awful lot of functionality in the `tidyverse` including importing data, dataset manipulation, string manipulation, and factors. You do not need to know it all at once, but you should know that you do not yet know it.
- Beyond the `tidyverse` it is also important to know that foundational aspects, common to many languages, exist and can be added to data science workflows. For instance, class, functions, and data simulation all have an important role to play.
**Software and packages**
- Base `R`
- Core `tidyverse` [@tidyverse]
- `dplyr` [@citedplyr]
- `forcats` [@citeforcats]
- `ggplot2` [@citeggplot]
- `readr` [@citereadr]
- `stringr` [@citestringr]
- `tibble` [@tibble]
- `tidyr` [@citetidyr]
- Outer `tidyverse` [@tidyverse] (these need to be loaded separately e.g. `library("haven")`)
- `haven` [@citehaven]
- `lubridate` [@GrolemundWickham2011]
- `janitor` [@janitor]
## Introduction
In this chapter we focus on foundational skills needed to use the statistical programming language `R` [@citeR] to tell stories with data. Some of it may not make sense at first, but these are skills and approaches that we will often use. You should initially go through this chapter quickly, noting aspects that you do not understand. Then come back to this chapter from time to time as you continue through the rest of the book. That way you will see how the various bits fit into context.
`R` is an open-source language for statistical programming. You can download `R` for free from the [Comprehensive R Archive Network](https://cran.r-project.org) (CRAN). RStudio is an Integrated Development Environment (IDE) for `R` which makes the language easier to use and can be downloaded for free from Posit [here](https://www.rstudio.com/products/rstudio/).
The past ten years or so have been characterized by the increased use of the `tidyverse`. This is "...an opinionated collection of `R` packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures" [@tidyversewebsite]. There are three distinctions to be clear about: the original `R` language, typically referred to as "base"; the `tidyverse` which is a coherent collection of packages that build on top of base, and other packages.
Essentially everything that we can do in the `tidyverse`, we can also do in base. But, as the `tidyverse` was built especially for data science it is often easier to use, especially when learning. Additionally, most everything that we can do in the `tidyverse`, we can also do with other packages. But, as the `tidyverse` is a coherent collection of packages, it is often easier to use, again, especially when learning. Eventually there are cases where it makes sense to trade-off the convenience and coherence of the `tidyverse` for some features of base, other packages, or languages. Indeed, we introduce SQL in @sec-store-and-share as one source of considerable efficiency gain when working with data. For instance, the `tidyverse` can be slow, and so if one needs to import thousands of CSVs then it can make sense to switch away from `read_csv()`. The appropriate use of base and non-tidyverse packages, or even other languages, rather than dogmatic insistence on a particular solution, is a sign of intellectual maturity.
Central to our use of the statistical programming language `R` is data, and most of the data that we use will have humans at the heart of it. Sometimes, dealing with human-centered data in this way can have a numbing effect, resulting in over-generalization, and potentially problematic work. Another sign of intellectual maturity is when it has the opposite effect, increasing our awareness of our decision-making processes and their consequences.
> In practice, I find that far from distancing you from questions of meaning, quantitative data forces you to confront them. The numbers draw you in. Working with data like this is an unending exercise in humility, a constant compulsion to think through what you can and cannot see, and a standing invitation to understand what the measures really capture---what they mean, and for whom.
>
> @kieranskitchen
## R, RStudio, and Posit Cloud
`R` and RStudio are complementary, but they are not the same thing. @vistransrep explain their relationship by analogy, where `R` is like an engine and RStudio is like a car---we can use engines in a lot of different situations, and they are not limited to being used in cars, but the combination is especially useful.
### R
[`R`](https://www.r-project.org/) is an open-source and free programming language that is focused on general statistics. Free in this context does not refer to a price of zero, but instead to the freedom that the creators give users to largely do what they want with it (although it also does have a price of zero). This is in contrast with an open-source programming language that is designed for general purpose, such as `Python`, or an open-source programming language that is focused on probability, such as `Stan`. It was created by Ross Ihaka and Robert Gentleman at the University of Auckland in the 1990s, and traces its provenance to `S`, which was developed at Bell Labs in the 1970s. It is maintained by the R Core Team and changes to this "base" of code occur methodically and with concern given to different priorities.
Many people build on this stable base, to extend the capabilities of `R` to better and more quickly suit their needs. They do this by creating packages. Typically, although not always, a package is a collection of `R` code, mostly functions, and this allows us to more easily do things that we want to do. These packages are managed by repositories such as CRAN and Bioconductor.
If you want to use a package, then you first need to install it on your computer, and then you need to load it when you want to use it. Dr Di Cook, Professor of Business Analytics at Monash University, describes this as analogous to a lightbulb. If you want light in your house, first you need to fit a lightbulb, and then you need to turn the switch on. Installing a package, say, `install.packages("tidyverse")`, is akin to fitting a lightbulb into a socket---you only need to do this once for each lightbulb. But then each time you want light you need to turn on the switch to the lightbulb, which in the `R` packages case, means drawing on your library, say, `library(tidyverse)`.
:::{.callout-note}
## Shoulders of giants
Dr Di Cook is Distinguished Professor of Statistics at Monash University. After earning a PhD in statistics from Rutgers University in 1993 where she focused on statistical graphics, she was appointed as an assistant professor at Iowa State University, being promoted to full professor in 2005, and in 2015 she moved to Monash. One area of her research is data visualization, especially interactive and dynamic graphics. @buja1996interactive which proposes a taxonomy of interactive data visualization and associated software XGobi, which is the focus of @ggobibook. @Cook1995 develops and explores the use of a dynamic graphical tool for exploratory data analysis and @Buja2009 develops a framework for evaluating visual statistical methods, where plots and human cognition stand in for test statistics and statistical tests, respectively. She is a Fellow of the American Statistical Association.
:::
To install a package on your computer (again, we will need to do this only once per computer) we use `install.packages()`.
```{r}
#| eval: false
#| echo: true
install.packages("tidyverse")
```
And then when we want to use the package, we use `library()`.
```{r}
#| eval: false
#| echo: true
library(tidyverse)
```
Having downloaded it, we can open `R` and use it directly. It is primarily designed to be interacted with through the command line. While this is functional, it can be useful to have a richer environment than the command line provides. In particular, it can be useful to install an Integrated Development Environment (IDE), which is an application that brings together various bits and pieces that will be used often. One common IDE for `R` is RStudio, although others such as Visual Studio are also used.
### RStudio
RStudio is distinct to `R`, and they are different entities. RStudio builds on top of `R` to make it easier to use R. This is in the same way that one could use the internet from the command line, but most people use a browser such as Chrome, Firefox, or Safari.
RStudio is free in the sense that we do not pay for it. It is also free in the sense of being able to take the code, modify it, and distribute that code. But the maker of RStudio, Posit, is a company, albeit it a B Corp, and so it is possible that the current situation could change. It can be downloaded from Posit [here](https://www.rstudio.com/products/rstudio/).
When we open RStudio it will look like @fig-first.
![Opening RStudio for the first time](figures/01.png){#fig-first width=90% fig-align="center"}
The left pane is a console in which you can type and execute `R` code line by line. Try it with 2+2 by clicking next to the prompt ">", typing "2+2", and then pressing "return/enter".
```{r}
#| eval: true
#| echo: true
2 + 2
```
The pane on the top right has information about the environment. For instance, when we create variables a list of their names and some properties will appear there. Next to the prompt type the following code, replacing Rohan with your name, and again press enter.
```{r}
#| eval: true
#| echo: true
my_name <- "Rohan"
```
As mentioned in @sec-fire-hose the `<-`, or "assignment operator", allocates `"Rohan"` to an object called "my_name". You should notice a new value in the environment pane with the variable name and its value.
The pane in the bottom right is a file manager. At the moment it should just have two files: an `R` History file and a `R` Project file. We will get to what these are later, but for now we will create and save a file.
Run the following code, without worrying too much about the details for now. You should see a new ".rds" file in your list of files.
```{r}
#| eval: false
#| echo: true
saveRDS(object = my_name, file = "my_first_file.rds")
```
### Posit Cloud
While you can and should download RStudio to your own computer, initially we recommend using [Posit Cloud](https://posit.cloud). This is an online version of RStudio that is provided by Posit. We will use this so that you can focus on getting comfortable with `R` and RStudio in an environment that is consistent. This way you do not have to worry about what computer you have or installation permissions, amongst other things.
The free version of Posit Cloud is free, as in no financial cost. The trade-off is that it is not powerful, and it is sometimes slow, but for the purposes of getting started it is enough.
## Getting started
We will now start going through some code. Actively write this all out yourself.
While working line-by-line in the console is fine, it is easier to write out a whole script that can then be run. We will do this by making an `R` Script ("File" $\rightarrow$ "New File" $\rightarrow$ "R Script"). The console pane will fall to the bottom left and an `R` Script will open in the top left. We will write some code that will get all of the Australian federal politicians and then construct a small table about the genders of the prime ministers. Some of this code will not make sense at this stage, but just type it all out to get into the habit and then run it. To run the whole script, we can click "Run" or we can highlight certain lines and then click "Run" to just run those lines.
```{r}
#| eval: false
#| echo: true
#| warning: false
#| message: false
# Install the packages that we need
install.packages("tidyverse")
install.packages("AustralianPoliticians")
```
```{r}
#| eval: true
#| echo: true
#| warning: false
#| message: false
# Load the packages that we need to use this time
library(tidyverse)
library(AustralianPoliticians)
# Make a table of the counts of genders of the prime ministers
get_auspol("all") |> # Imports data from GitHub
as_tibble() |>
filter(wasPrimeMinister == 1) |>
count(gender)
```
We can see that, as at the end of 2021, one female has been prime minister (Julia Gillard), while the other 29 prime ministers were male.
One critical operator when programming is the "pipe": `|>`. We read this as "and then". This takes the output of a line of code and uses it as the first input to the next line of code. It makes code easier to read. By way of background, for many years `R` users used `%>%` as the pipe, which is from `magrittr` [@magrittr] and part of the `tidyverse`. Base `R` added the pipe that we use in this book, `|>`, in 2021, and so if you look at older code, you may see the earlier pipe being used. For the most part, they are interchangeable.
The idea of the pipe is that we take a dataset, and then do something to it. We used this in the earlier example. Another example follows where we will look at the first six lines of a dataset by piping it to `head()`. Notice that `head()` does not explicitly take any arguments in this example. It knows which data to display because the pipe tells it implicitly.
```{r}
#| eval: true
#| echo: true
get_auspol("all") |> # Imports data from GitHub
head()
```
We can save this `R` Script as "my_first_r_script.R" ("File" $\rightarrow$ "Save As"). At this point, our workspace should look something like @fig-third.
![After running an `R` Script](figures/03.png){#fig-third width=90% fig-align="center"}
One thing to be aware of is that each Posit Cloud workspace is essentially a new computer. Because of this, we need to install any package that we want to use for each workspace. For instance, before we can use the `tidyverse`, we need to install it with `install.packages("tidyverse")`. This contrasts with using one's own computer.
A few final notes on Posit Cloud:
1. In the Australian politician's example, we got our data from the website GitHub using an `R` package, but we can get data into a workspace from a local computer in a variety of ways. One way is to use the "upload" button in the "Files" panel. Another is to use `readr` [@citereadr], which is part of the `tidyverse` [@tidyverse].
2. Posit Cloud allows some degree of collaboration. For instance, you can give someone else access to a workspace that you create and even both be in the same workspace at the one time. This could be useful for collaboration.
3. There are a variety of weaknesses of Posit Cloud, in particular the RAM limits. Additionally, like any web application, things break from time to time or go down.
## The `dplyr` verbs
One of the key packages that we will use is the `tidyverse` [@tidyverse]. The `tidyverse` is actually a package of packages, which means when we install the `tidyverse`, we actually install a whole bunch of different packages. The key package in the `tidyverse` in terms of manipulating data is `dplyr` [@citedplyr].
There are five `dplyr` functions that are regularly used, and we will now go through each of these. These are commonly referred to as the `dplyr` verbs.
1. `select()`
2. `filter()`
3. `arrange()`
4. `mutate()`
5. `summarise()` or equally `summarize()`
We will also cover `.by`, and `count()` here as they are closely related.
As we have already installed the `tidyverse`, we just need to load it.
```{r, warning: false, message: false, eval: true, echo: true}
library(tidyverse)
```
And we will begin by again using some data about Australian politicians from the `AustralianPoliticians` package [@citeaustralianpoliticians].
```{r}
#| eval: true
#| echo: true
library(AustralianPoliticians)
australian_politicians <-
get_auspol("all")
head(australian_politicians)
```
### `select()`
We use `select()` to pick particular columns of a dataset. For instance, we might like to select the "firstName" column.
```{r}
#| eval: true
#| echo: true
australian_politicians |>
select(firstName)
```
In R, there are many ways to do things. Sometimes these are different ways to do the same thing, and other times they are different ways to do *almost* the same thing. For instance, another way to pick a particular column of a dataset is to use the "extract" operator `$`. This is from base, as opposed to `select()` which is from the `tidyverse`.
```{r}
#| eval: true
#| echo: true
australian_politicians$firstName |>
head()
```
The two appear similar---both pick the "firstName" column---but they differ in the class of what they return, with `select()` returning a tibble and `$` returning a vector. For the sake of completeness, if we combine `select()` with `pull()` then we get the same class of output, a vector, as if we had used the extract operator.
```{r}
#| eval: true
#| echo: true
australian_politicians |>
select(firstName) |>
pull() |>
head()
```
We can also use `select()` to remove columns, by negating the column name.
```{r}
#| eval: true
#| echo: true
australian_politicians |>
select(-firstName)
```
Finally, we can `select()` based on conditions. For instance, we can `select()` all of the columns that start with, say, "birth".
```{r}
#| eval: true
#| echo: true
australian_politicians |>
select(starts_with("birth"))
```
There are a variety of similar "selection helpers" including `starts_with()`, `ends_with()`, and `contains()`. More information about these is available in the help page for `select()` which can be accessed by running `?select()`.
At this point, we will use `select()` to reduce the width of our dataset.
```{r}
#| eval: true
#| echo: true
australian_politicians <-
australian_politicians |>
select(
uniqueID,
surname,
firstName,
gender,
birthDate,
birthYear,
deathDate,
member,
senator,
wasPrimeMinister
)
australian_politicians
```
One thing that sometimes confuses people who are new to R, is that the output is not "saved", unless you assign it to an object. For instance, here the first lines are `australian_politicians <- australian_politicians |>` and then `select()` is used, compared with `australian_politicians |>`. This ensures that the changes brought about by `select()` are applied to the object, and so it is that modified version that would be used at any point later in the code.
### `filter()`
We use `filter()` to pick particular rows of a dataset. For instance, we might be only interested in politicians that became prime minister.
```{r}
#| eval: true
#| echo: true
australian_politicians |>
filter(wasPrimeMinister == 1)
```
We could also give `filter()` two conditions. For instance, we could look at politicians that become prime minister and were named Joseph, using the "and" operator `&`.
```{r}
#| eval: true
#| echo: true
australian_politicians |>
filter(wasPrimeMinister == 1 & firstName == "Joseph")
```
We get the same result if we use a comma instead of an ampersand.
```{r}
#| eval: true
#| echo: true
australian_politicians |>
filter(wasPrimeMinister == 1, firstName == "Joseph")
```
Similarly, we could look at politicians who were named, say, Myles or Ruth using the "or" operator `|`.
```{r}
#| eval: true
#| echo: true
australian_politicians |>
filter(firstName == "Myles" | firstName == "Ruth")
```
We could also pipe the result. For instance we could pipe from `filter()` to `select()`.
```{r}
#| eval: true
#| echo: true
australian_politicians |>
filter(firstName == "Ruth" | firstName == "Myles") |>
select(firstName, surname)
```
If we happen to know the particular row number that is of interest then we could `filter()` to only that particular row. For instance, say the row 853 was of interest.
```{r}
#| eval: true
#| echo: true
australian_politicians |>
filter(row_number() == 853)
```
There is also a dedicated function to do this, which is `slice()`.
```{r}
#| eval: true
#| echo: true
australian_politicians |>
slice(853)
```
While this may seem somewhat esoteric, it is especially useful if we would like to remove a particular row using negation, or duplicate specific rows. For instance, we could remove the first row.
```{r}
#| eval: true
#| echo: true
australian_politicians |>
slice(-1)
```
We could also only, say, only keep the first three rows.
```{r}
#| eval: true
#| echo: true
australian_politicians |>
slice(1:3)
```
Finally, we could duplicate the first two rows and this takes advantage of `n()` which provides the current group size.
```{r}
#| eval: true
#| echo: true
australian_politicians |>
slice(1:2, 1:n())
```
### `arrange()`
We use `arrange()` to change the order of the dataset based on the values of particular columns. For instance, we could arrange the politicians by their birthday.
```{r}
#| eval: true
#| echo: true
australian_politicians |>
arrange(birthYear)
```
We could modify `arrange()` with `desc()` to change from ascending to descending order.
```{r}
#| eval: true
#| echo: true
australian_politicians |>
arrange(desc(birthYear))
```
This could also be achieved with the minus sign.
```{r}
#| eval: true
#| echo: true
australian_politicians |>
arrange(-birthYear)
```
And we could arrange based on more than one column. For instance, if two politicians have the same first name, then we could also arrange based on their birthday.
```{r}
#| eval: true
#| echo: true
australian_politicians |>
arrange(firstName, birthYear)
```
We could achieve the same result by piping between two instances of `arrange()`.
```{r}
#| eval: true
#| echo: true
australian_politicians |>
arrange(birthYear) |>
arrange(firstName)
```
When we use `arrange()` we should be clear about precedence. For instance, changing to birthday and then first name would give a different arrangement.
```{r}
#| eval: true
#| echo: true
australian_politicians |>
arrange(birthYear, firstName)
```
A nice way to arrange by a variety of columns is to use `across()`. It enables us to use the "selection helpers" such as `starts_with()` that were mentioned in association with `select()`.
```{r}
#| eval: true
#| echo: true
australian_politicians |>
arrange(across(c(firstName, birthYear)))
australian_politicians |>
arrange(across(starts_with("birth")))
```
### `mutate()`
We use `mutate()` when we want to make a new column. For instance, perhaps we want to make a new column that is 1 if a person was both a member and a senator and 0 otherwise. That is to say that our new column would denote politicians that served in both the upper and the lower house.
```{r}
#| eval: true
#| echo: true
australian_politicians <-
australian_politicians |>
mutate(was_both = if_else(member == 1 & senator == 1, 1, 0))
australian_politicians |>
select(member, senator, was_both)
```
We could use `mutate()` with math, such as addition and subtraction. For instance, we could calculate the age that the politicians are (or would have been) in 2022.
```{r}
#| eval: true
#| echo: true
#| message: false
#| warning: false
library(lubridate)
australian_politicians <-
australian_politicians |>
mutate(age = 2022 - year(birthDate))
australian_politicians |>
select(uniqueID, age)
```
There are a variety of functions that are especially useful when constructing new columns. These include `log()` which will compute the natural logarithm, `lead()` which will bring values up by one row, `lag()` which will push values down by one row, and `cumsum()` which creates a cumulative sum of the column.
```{r}
#| eval: true
#| echo: true
australian_politicians |>
select(uniqueID, age) |>
mutate(log_age = log(age))
australian_politicians |>
select(uniqueID, age) |>
mutate(lead_age = lead(age))
australian_politicians |>
select(uniqueID, age) |>
mutate(lag_age = lag(age))
australian_politicians |>
select(uniqueID, age) |>
drop_na(age) |>
mutate(cumulative_age = cumsum(age))
```
As we have in earlier examples, we can also use `mutate()` in combination with `across()`. This includes the potential use of the selection helpers. For instance, we could count the number of characters in both the first and last names at the same time.
```{r}
#| eval: true
#| echo: true
australian_politicians |>
mutate(across(c(firstName, surname), str_count)) |>
select(uniqueID, firstName, surname)
```
Finally, we use `case_when()` when we need to make a new column on the basis of more than two conditional statements (in contrast to `if_else()` from our first `mutate()` example). For instance, we may have some years and want to group them into decades.
```{r}
library(lubridate)
australian_politicians |>
mutate(
year_of_birth = year(birthDate),
decade_of_birth =
case_when(
year_of_birth <= 1929 ~ "pre-1930",
year_of_birth <= 1939 ~ "1930s",
year_of_birth <= 1949 ~ "1940s",
year_of_birth <= 1959 ~ "1950s",
year_of_birth <= 1969 ~ "1960s",
year_of_birth <= 1979 ~ "1970s",
year_of_birth <= 1989 ~ "1980s",
year_of_birth <= 1999 ~ "1990s",
TRUE ~ "Unknown or error"
)
) |>
select(uniqueID, year_of_birth, decade_of_birth)
```
We could accomplish this with a series of nested `if_else()` statements, but `case_when()` is more clear. The cases are evaluated in order and as soon as there is a match `case_when()` does not continue to the remainder of the cases. It can be useful to have a catch-all at the end that will signal if there is a potential issue that we might like to know about if the code were to ever get there.
### `summarise()`
We use `summarise()` when we would like to make new, condensed, summary variables. For instance, perhaps we would like to know the minimum, average, and maximum of some column.
```{r}
#| eval: true
#| echo: true
australian_politicians |>
summarise(
youngest = min(age, na.rm = TRUE),
oldest = max(age, na.rm = TRUE),
average = mean(age, na.rm = TRUE)
)
```
As an aside, `summarise()` and `summarize()` are equivalent and we can use either. In this book we use `summarise()`.
```{r}
#| eval: true
#| echo: true
australian_politicians |>
summarize(
youngest = min(age, na.rm = TRUE),
oldest = max(age, na.rm = TRUE),
average = mean(age, na.rm = TRUE)
)
```
By default, `summarise()` will provide one row of output for a whole dataset. For instance, in the earlier example we found the youngest, oldest, and average across all politicians. However, we can create more groups in our dataset using `.by` within the function. We can use many functions on the basis of groups, but the `summarise()` function is particularly powerful in conjunction with `.by`. For instance, we could group by gender, and then get age-based summary statistics.
```{r}
#| eval: true
#| echo: true
australian_politicians |>
summarise(
youngest = min(age, na.rm = TRUE),
oldest = max(age, na.rm = TRUE),
average = mean(age, na.rm = TRUE),
.by = gender
)
```
Similarly, we could look at youngest, oldest, and mean age at death by gender.
```{r}
#| eval: true
#| echo: true
australian_politicians |>
mutate(days_lived = deathDate - birthDate) |>
drop_na(days_lived) |>
summarise(
min_days = min(days_lived),
mean_days = mean(days_lived) |> round(),
max_days = max(days_lived),
.by = gender
)
```
And so we learn that female members of parliament on average lived slightly longer than male members of parliament.
We can use `.by` on the basis of more than one group. For instance, we could look at the average number of days lived by gender and by they were in the House of Representatives or the Senate.
```{r}
#| eval: true
#| echo: true
#| warning: false
#| message: false
australian_politicians |>
mutate(days_lived = deathDate - birthDate) |>
drop_na(days_lived) |>
summarise(
min_days = min(days_lived),
mean_days = mean(days_lived) |> round(),
max_days = max(days_lived),
.by = c(gender, member)
)
```
We can use `count()` to create counts by groups. For instance, the number of politicians by gender.
```{r}
#| eval: true
#| echo: true
australian_politicians |>
count(gender)
```
In addition to the `count()`, we could calculate a proportion.
```{r}
#| eval: true
#| echo: true
australian_politicians |>
count(gender) |>
mutate(proportion = n / (sum(n)))
```
Using `count()` is essentially the same as using `.by` within `summarise()` with `n()`, and we get the same result in that way.
```{r}
#| eval: true
#| echo: true
australian_politicians |>
summarise(n = n(),
.by = gender)
```
And there is a comparably helpful function that acts similarly to`mutate()`, which is `add_count()`. The difference is that the number will be added in a new column.
```{r}
#| eval: true
#| echo: true
australian_politicians |>
add_count(gender) |>
select(uniqueID, gender, n)
```
## Base R
While the `tidyverse` was established relatively recently to help with data science, `R` existed long before this. There is a host of functionality that is built into `R` especially around the core needs of programming and statisticians.
In particular, we will cover:
1. `class()`.
2. Data simulation.
3. `function()`, `for()`, and `apply()`.
There is no need to install or load any additional packages, as this functionality comes with R.
### `class()`
In everyday usage "a, b, c, ..." are letters and "1, 2, 3,..." are numbers. And we use letters and numbers differently; for instance, we do not add or subtract letters. Similarly, `R` needs to have some way of distinguishing different classes of content and to define the properties that each class has, "how it behaves, and how it relates to other types of objects" [@advancedr].
Classes have a hierarchy. For instance, we are "human", which is itself "animal". All "humans" are "animals", but not all "animals" are "humans". Similarly, all integers are numbers, but not all numbers are integers. We can find out the class of an object in `R` with `class()`.
```{r}
#| echo: true
a_number <- 8
class(a_number)
a_letter <- "a"
class(a_letter)
```
The classes that we cover here are "numeric", "character", "factor", "date", and "data.frame".
The first thing to know is that, in the same way that a frog can become a prince, we can sometimes change the class of an object in R. This is called "casting". For instance, we could start with a "numeric", change it to a "character" with `as.character()`, and then a "factor" with `as.factor()`. But if we tried to make it into a "date" with `as.Date()` we would get an error because no all numbers have the properties that are needed to be a date.
```{r}
#| echo: true
a_number <- 8
a_number
class(a_number)
a_number <- as.character(a_number)
a_number
class(a_number)
a_number <- as.factor(a_number)
a_number
class(a_number)
```
Compared with "numeric" and "character" classes, the "factor" class might be less familiar. A "factor" is used for categorical data that can only take certain values [@advancedr]. For instance, typical usage of a "factor" variable would be a binary, such as "day" or "night". It is also often used for age-groups, such as "18-29", "30-44", "45-60", "60+" (as opposed to age, which would often be a "numeric"); and sometimes for level of education: "less than high school", "high school", "college", "undergraduate degree", "postgraduate degree". We can find the allowed levels for a "factor" using `levels()`.
```{r}
age_groups <- factor(
c("18-29", "30-44", "45-60", "60+")
)
age_groups
class(age_groups)
levels(age_groups)
```
Dates are an especially tricky class and quickly become complicated. Nonetheless, at a foundational level, we can use `as.Date()` to convert a character that looks like a "date" into an actual "date". This enables us to, say, perform addition and subtraction, when we would not be able to do that with a "character".
```{r}
looks_like_a_date_but_is_not <- "2022-01-01"
looks_like_a_date_but_is_not
class(looks_like_a_date_but_is_not)
is_a_date <- as.Date(looks_like_a_date_but_is_not)
is_a_date
class(is_a_date)
is_a_date + 3
```
The final class that we discuss here is "data.frame". This looks like a spreadsheet and is commonly used to store the data that we will analyze. Formally, "a data frame is a list of equal-length vectors" [@advancedr]. It will have column and row names which we can see using `colnames()` and `rownames()`, although often the names of the rows are just numbers.
To illustrate this, we use the "ResumeNames" dataset from `AER` [@citeaer]. This package can be installed in the same way as any other package from CRAN. This dataset comprises cross-sectional data about resume content, especially the name used on the resume, and associated information about whether the candidate received a call-back for 4,870 fictitious resumes. The dataset was created by @bertrand2004emily who sent fictitious resumes in response to job advertisements in Boston and Chicago that differed in whether the resume was assigned a "very African American sounding name or a very White sounding name". They found considerable discrimination whereby "White names receive 50 per cent more callbacks for interviews". @hangartner2021monitoring generalize this using an online Swiss platform and find that immigrants and minority ethnic groups are contacted less by recruiters, as are women when the profession is men-dominated, and vice versa.
```{r}
#| eval: false
#| echo: true
#| warning: false
#| message: false
install.packages("AER")
```
```{r}
#| eval: true
#| echo: true
#| warning: false
#| message: false
library(AER)
data("ResumeNames", package = "AER")
```
```{r}
ResumeNames |>
head()
class(ResumeNames)
colnames(ResumeNames)
```
We can examine the class of the vectors, i.e. the columns, that make-up a data frame by specifying the column name.
```{r}
class(ResumeNames$name)
class(ResumeNames$jobs)
```
Sometimes it is helpful to be able to change the classes of many columns at once. We can do this by using `mutate()` and `across()`.
```{r}
class(ResumeNames$name)
class(ResumeNames$gender)
class(ResumeNames$ethnicity)
ResumeNames <- ResumeNames |>
mutate(across(c(name, gender, ethnicity), as.character)) |>
head()
class(ResumeNames$name)
class(ResumeNames$gender)
class(ResumeNames$ethnicity)
```
There are many ways for code to not run but having an issue with the class is always among the first things to check. Common issues are variables that we think should be "character" or "numeric" actually being "factor". And variables that we think should be "numeric" actually being "character".
Finally, it is worth pointing out that the class of a vector is whatever the class of the content. In Python and other languages, a similar data structure to a vector is a "list". A "list" is a class of its own, and the objects in a "list" have their own classes (for instance, `["a", 1]` is an object of class "list" with entries of class "string" and "int"). This may be counter-intuitive to see that a vector is not its own class if you are coming to `R` from another language.
### Simulating data
Simulating data is a key skill for telling believable stories with data. In order to simulate data, we need to be able to randomly draw from statistical distributions and other collections. `R` has a variety of functions to make this easier, including: the normal distribution, `rnorm()`; the uniform distribution, `runif()`; the Poisson distribution, `rpois()`; the binomial distribution, `rbinom()`; and many others. To randomly sample from a collection of items, we can use `sample()`.
When dealing with randomness, the need for reproducibility makes it important, paradoxically, that the randomness is repeatable. That is to say, another person needs to be able to draw the random numbers that we draw. We do this by setting a seed for our random draws using `set.seed()`.
We could get observations from the standard normal distribution and put the those into a data frame.
```{r}
#| echo: true
set.seed(853)
number_of_observations <- 5
simulated_data <-
data.frame(
person = c(1:number_of_observations),
std_normal_observations = rnorm(
n = number_of_observations,
mean = 0,
sd = 1
)
)
simulated_data
```
We could then add draws from the uniform, Poisson, and binomial distributions, using `cbind()` to bring the columns of the original dataset and the new one together.
```{r}
#| echo: true
simulated_data <-
simulated_data |>
cbind() |>
data.frame(
uniform_observations =
runif(n = number_of_observations, min = 0, max = 10),
poisson_observations =
rpois(n = number_of_observations, lambda = 100),
binomial_observations =
rbinom(n = number_of_observations, size = 2, prob = 0.5)
)
simulated_data
```
Finally, we will add a favorite color to each observation with `sample()`.
```{r}
#| echo: true
simulated_data <-
data.frame(
favorite_color = sample(
x = c("blue", "white"),
size = number_of_observations,
replace = TRUE
)
) |>
cbind(simulated_data)
simulated_data
```
We set the option "replace" to "TRUE" because we are only choosing between two items, but each time we choose we want the possibility that either are chosen. Depending on the simulation we may need to think about whether "replace" should be "TRUE" or "FALSE". Another useful optional argument in `sample()` is to adjust the probability with which each item is drawn. The default is that all options are equally likely, but we could specify particular probabilities if we wanted to with "prob". As always with functions, we can find more in the help file, for instance `?sample`.
### `function()`, `for()`, and `apply()`
`R` "is a functional programming language" [@advancedr]. This means that we foundationally write, use, and compose functions, which are collections of code that accomplish something specific.
There are a lot of functions in `R` that other people have written, and we can use. Almost any common statistical or data science task that we might need to accomplish likely already has a function that has been written by someone else and made available to us, either as part of the base `R` installation or a package. But we will need to write our own functions from time to time, especially for more-specific tasks.
We define a function using `function()`, and then assign a name. We will likely need to include some inputs and outputs for the function. Inputs are specified between round brackets. The specific task that the function is to accomplish goes between braces.
```{r}
print_names <- function(some_names) {
print(some_names)
}
print_names(c("rohan", "monica"))
```
We can specify defaults for the inputs in case the person using the function does not supply them.
```{r}
print_names <- function(some_names = c("edward", "hugo")) {
print(some_names)
}
print_names()
```
One common scenario is that we want to apply a function multiple times. Like many programming languages, we can use a `for()` loop for this. The look of a `for()` loop in `R` is similar to `function()`, in that we define what we are iterating over in the round brackets, and the function to apply in braces.
<!-- ```{r} -->
<!-- for (i in 1:3) { -->
<!-- print(i) -->
<!-- } -->
<!-- ``` -->
<!-- ```{r} -->
<!-- x <- cbind(x1 = 66, x2 = c(4:1, 2:5)) -->
<!-- dimnames(x)[[1]] <- letters[1:8] -->
<!-- class(x) -->
<!-- apply(x, 2, mean, trim = .2) -->
<!-- ``` -->
Because `R` is a programming language that is focused on statistics, we are often interested in arrays or matrices. We use `apply()` to apply a function to rows ("MARGIN = 1") or columns ("MARGIN = 2").
```{r}
simulated_data
apply(X = simulated_data, MARGIN = 2, FUN = unique)
```
## Making graphs with `ggplot2`
If the key package in the `tidyverse` in terms of manipulating data is `dplyr` [@citedplyr], then the key package in the `tidyverse` in terms of creating graphs is `ggplot2` [@citeggplot]. We will have more to say about graphing in @sec-static-communication, but here we provide a quick tour of some essentials. `ggplot2` works by defining layers which build to form a graph, based around the "grammar of graphics" (hence, the "gg"). Instead of the pipe operator (`|>`) `ggplot2` uses the add operator `+`. As part of the `tidyverse` collection of packages, `ggplot2` does not need to be explicitly installed or loaded if the `tidyverse` has been loaded.
There are three key aspects that need to be specified to build a graph with `ggplot2`:
1. Data;
2. Aesthetics / mapping; and
3. Type.
To get started we will obtain some GDP data for countries in the Organisation for Economic Co-operation and Development (OECD) [@citeoecdgdp].
```{r}
#| eval: false
#| echo: true
library(tidyverse)
oecd_gdp <-