20-r_essentials.qmd

---
engine: knitr
---

# R essentials {#sec-r-essentials}

**Prerequisites**

- Read *R for Data Science*, Chapter 4 "Data transformation", [@r4ds]
  - Provides an overview of manipulating datasets using `dplyr`.
- Read *Data Feminism*, Chapter 6 "The Numbers Don't Speak for Themselves", [@datafeminism2020]
  - Discusses the need to consider data within the broader context that generated them.
- Read *R Generation*, [@Thieme2018]
  - Provides background information about `R`.

**Key concepts and skills**

- Understanding foundational aspects of `R` and RStudio enables a gradual improvement of workflows. For instance, being able to use key `dplyr` verbs and make graphs with `ggplot2` makes manipulating and understanding datasets easier. 
- But there is an awful lot of functionality in the `tidyverse` including importing data, dataset manipulation, string manipulation, and factors. You do not need to know it all at once, but you should know that you do not yet know it.
- Beyond the `tidyverse` it is also important to know that foundational aspects, common to many languages, exist and can be added to data science workflows. For instance, class, functions, and data simulation all have an important role to play.

**Software and packages**

- Base `R`
- Core `tidyverse` [@tidyverse]
  - `dplyr` [@citedplyr]
  - `forcats` [@citeforcats]
  - `ggplot2` [@citeggplot]
  - `readr` [@citereadr]
  - `stringr` [@citestringr]
  - `tibble` [@tibble]
  - `tidyr` [@citetidyr]
- Outer `tidyverse` [@tidyverse] (these need to be loaded separately e.g. `library("haven")`)
  - `haven` [@citehaven]
  - `lubridate` [@GrolemundWickham2011]
- `janitor` [@janitor]


## Introduction

In this chapter we focus on foundational skills needed to use the statistical programming language `R` [@citeR] to tell stories with data. Some of it may not make sense at first, but these are skills and approaches that we will often use. You should initially go through this chapter quickly, noting aspects that you do not understand. Then come back to this chapter from time to time as you continue through the rest of the book. That way you will see how the various bits fit into context.

`R` is an open-source language for statistical programming. You can download `R` for free from the [Comprehensive R Archive Network](https://cran.r-project.org) (CRAN). RStudio is an Integrated Development Environment (IDE) for `R` which makes the language easier to use and can be downloaded for free from Posit [here](https://www.rstudio.com/products/rstudio/).

The past ten years or so have been characterized by the increased use of the `tidyverse`. This is "...an opinionated collection of `R` packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures" [@tidyversewebsite]. There are three distinctions to be clear about: the original `R` language, typically referred to as "base"; the `tidyverse` which is a coherent collection of packages that build on top of base, and other packages. 

Essentially everything that we can do in the `tidyverse`, we can also do in base. But, as the `tidyverse` was built especially for data science it is often easier to use, especially when learning. Additionally, most everything that we can do in the `tidyverse`, we can also do with other packages. But, as the `tidyverse` is a coherent collection of packages, it is often easier to use, again, especially when learning. Eventually there are cases where it makes sense to trade-off the convenience and coherence of the `tidyverse` for some features of base, other packages, or languages. Indeed, we introduce SQL in @sec-store-and-share as one source of considerable efficiency gain when working with data. For instance, the `tidyverse` can be slow, and so if one needs to import thousands of CSVs then it can make sense to switch away from `read_csv()`. The appropriate use of base and non-tidyverse packages, or even other languages, rather than dogmatic insistence on a particular solution, is a sign of intellectual maturity.

Central to our use of the statistical programming language `R` is data, and most of the data that we use will have humans at the heart of it. Sometimes, dealing with human-centered data in this way can have a numbing effect, resulting in over-generalization, and potentially problematic work. Another sign of intellectual maturity is when it has the opposite effect, increasing our awareness of our decision-making processes and their consequences.

> In practice, I find that far from distancing you from questions of meaning, quantitative data forces you to confront them. The numbers draw you in. Working with data like this is an unending exercise in humility, a constant compulsion to think through what you can and cannot see, and a standing invitation to understand what the measures really capture---what they mean, and for whom. 
>
> @kieranskitchen


## R, RStudio, and Posit Cloud

`R` and RStudio are complementary, but they are not the same thing. @vistransrep explain their relationship by analogy, where `R` is like an engine and RStudio is like a car---we can use engines in a lot of different situations, and they are not limited to being used in cars, but the combination is especially useful.

### R

[`R`](https://www.r-project.org/) is an open-source and free programming language that is focused on general statistics. Free in this context does not refer to a price of zero, but instead to the freedom that the creators give users to largely do what they want with it (although it also does have a price of zero). This is in contrast with an open-source programming language that is designed for general purpose, such as `Python`, or an open-source programming language that is focused on probability, such as `Stan`. It was created by Ross Ihaka and Robert Gentleman at the University of Auckland in the 1990s, and traces its provenance to `S`, which was developed at Bell Labs in the 1970s. It is maintained by the R Core Team and changes to this "base" of code occur methodically and with concern given to different priorities.

Many people build on this stable base, to extend the capabilities of `R` to better and more quickly suit their needs. They do this by creating packages. Typically, although not always, a package is a collection of `R` code, mostly functions, and this allows us to more easily do things that we want to do. These packages are managed by repositories such as CRAN and Bioconductor. 

If you want to use a package, then you first need to install it on your computer, and then you need to load it when you want to use it. Dr Di Cook, Professor of Business Analytics at Monash University, describes this as analogous to a lightbulb. If you want light in your house, first you need to fit a lightbulb, and then you need to turn the switch on. Installing a package, say, `install.packages("tidyverse")`, is akin to fitting a lightbulb into a socket---you only need to do this once for each lightbulb. But then each time you want light you need to turn on the switch to the lightbulb, which in the `R` packages case, means drawing on your library, say, `library(tidyverse)`. 

:::{.callout-note}
## Shoulders of giants

Dr Di Cook is Distinguished Professor of Statistics at Monash University. After earning a PhD in statistics from Rutgers University in 1993 where she focused on statistical graphics, she was appointed as an assistant professor at Iowa State University, being promoted to full professor in 2005, and in 2015 she moved to Monash. One area of her research is data visualization, especially interactive and dynamic graphics. @buja1996interactive which proposes a taxonomy of interactive data visualization and associated software XGobi, which is the focus of @ggobibook. @Cook1995 develops and explores the use of a dynamic graphical tool for exploratory data analysis and @Buja2009 develops a framework for evaluating visual statistical methods, where plots and human cognition stand in for test statistics and statistical tests, respectively. She is a Fellow of the American Statistical Association. 
:::

To install a package on your computer (again, we will need to do this only once per computer) we use `install.packages()`.

```{r}
#| eval: false
#| echo: true

install.packages("tidyverse")
```

And then when we want to use the package, we use `library()`.

```{r}
#| eval: false
#| echo: true

library(tidyverse)
```

Having downloaded it, we can open `R` and use it directly. It is primarily designed to be interacted with through the command line. While this is functional, it can be useful to have a richer environment than the command line provides. In particular, it can be useful to install an Integrated Development Environment (IDE), which is an application that brings together various bits and pieces that will be used often. One common IDE for `R` is RStudio, although others such as Visual Studio are also used.

### RStudio

RStudio is distinct to `R`, and they are different entities. RStudio builds on top of `R` to make it easier to use R. This is in the same way that one could use the internet from the command line, but most people use a browser such as Chrome, Firefox, or Safari.

RStudio is free in the sense that we do not pay for it. It is also free in the sense of being able to take the code, modify it, and distribute that code. But the maker of RStudio, Posit, is a company, albeit it a B Corp, and so it is possible that the current situation could change. It can be downloaded from Posit [here](https://www.rstudio.com/products/rstudio/).

When we open RStudio it will look like @fig-first.

![Opening RStudio for the first time](figures/01.png){#fig-first width=90% fig-align="center"}


The left pane is a console in which you can type and execute `R` code line by line. Try it with 2+2 by clicking next to the prompt ">", typing "2+2", and then pressing "return/enter".

```{r}
#| eval: true
#| echo: true

2 + 2
```

The pane on the top right has information about the environment. For instance, when we create variables a list of their names and some properties will appear there. Next to the prompt type the following code, replacing Rohan with your name, and again press enter.

```{r}
#| eval: true
#| echo: true

my_name <- "Rohan"
```

As mentioned in @sec-fire-hose the `<-`, or "assignment operator", allocates `"Rohan"` to an object called "my_name". You should notice a new value in the environment pane with the variable name and its value. 

The pane in the bottom right is a file manager. At the moment it should just have two files: an `R` History file and a `R` Project file. We will get to what these are later, but for now we will create and save a file.

Run the following code, without worrying too much about the details for now. You should see a new ".rds" file in your list of files.

```{r}
#| eval: false
#| echo: true

saveRDS(object = my_name, file = "my_first_file.rds")
```

### Posit Cloud

While you can and should download RStudio to your own computer, initially we recommend using [Posit Cloud](https://posit.cloud). This is an online version of RStudio that is provided by Posit. We will use this so that you can focus on getting comfortable with `R` and RStudio in an environment that is consistent. This way you do not have to worry about what computer you have or installation permissions, amongst other things.

The free version of Posit Cloud is free, as in no financial cost. The trade-off is that it is not powerful, and it is sometimes slow, but for the purposes of getting started it is enough.


## Getting started

We will now start going through some code. Actively write this all out yourself.

While working line-by-line in the console is fine, it is easier to write out a whole script that can then be run. We will do this by making an `R` Script ("File" $\rightarrow$ "New File" $\rightarrow$ "R Script"). The console pane will fall to the bottom left and an `R` Script will open in the top left. We will write some code that will get all of the Australian federal politicians and then construct a small table about the genders of the prime ministers. Some of this code will not make sense at this stage, but just type it all out to get into the habit and then run it. To run the whole script, we can click "Run" or we can highlight certain lines and then click "Run" to just run those lines.

```{r}
#| eval: false
#| echo: true
#| warning: false
#| message: false

# Install the packages that we need
install.packages("tidyverse")
install.packages("AustralianPoliticians")
```

```{r}
#| eval: true
#| echo: true
#| warning: false
#| message: false

# Load the packages that we need to use this time
library(tidyverse)
library(AustralianPoliticians)

# Make a table of the counts of genders of the prime ministers
get_auspol("all") |> # Imports data from GitHub
  as_tibble() |>
  filter(wasPrimeMinister == 1) |>
  count(gender)
```

We can see that, as at the end of 2021, one female has been prime minister (Julia Gillard), while the other 29 prime ministers were male.

One critical operator when programming is the "pipe": `|>`. We read this as "and then". This takes the output of a line of code and uses it as the first input to the next line of code. It makes code easier to read. By way of background, for many years `R` users used `%>%` as the pipe, which is from `magrittr` [@magrittr] and part of the `tidyverse`. Base `R` added the pipe that we use in this book, `|>`, in 2021, and so if you look at older code, you may see the earlier pipe being used. For the most part, they are interchangeable.

The idea of the pipe is that we take a dataset, and then do something to it. We used this in the earlier example. Another example follows where we will look at the first six lines of a dataset by piping it to `head()`. Notice that `head()` does not explicitly take any arguments in this example. It knows which data to display because the pipe tells it implicitly.

```{r}
#| eval: true
#| echo: true

get_auspol("all") |> # Imports data from GitHub
  head()
```

We can save this `R` Script as "my_first_r_script.R" ("File" $\rightarrow$ "Save As"). At this point, our workspace should look something like @fig-third.

![After running an `R` Script](figures/03.png){#fig-third width=90% fig-align="center"}

One thing to be aware of is that each Posit Cloud workspace is essentially a new computer. Because of this, we need to install any package that we want to use for each workspace. For instance, before we can use the `tidyverse`, we need to install it with `install.packages("tidyverse")`. This contrasts with using one's own computer.

A few final notes on Posit Cloud:

1. In the Australian politician's example, we got our data from the website GitHub using an `R` package, but we can get data into a workspace from a local computer in a variety of ways. One way is to use the "upload" button in the "Files" panel. Another is to use `readr` [@citereadr], which is part of the `tidyverse` [@tidyverse].
2. Posit Cloud allows some degree of collaboration. For instance, you can give someone else access to a workspace that you create and even both be in the same workspace at the one time. This could be useful for collaboration.
3. There are a variety of weaknesses of Posit Cloud, in particular the RAM limits. Additionally, like any web application, things break from time to time or go down.


## The `dplyr` verbs

One of the key packages that we will use is the `tidyverse` [@tidyverse]. The `tidyverse` is actually a package of packages, which means when we install the `tidyverse`, we actually install a whole bunch of different packages. The key package in the `tidyverse` in terms of manipulating data is `dplyr` [@citedplyr].

There are five `dplyr` functions that are regularly used, and we will now go through each of these. These are commonly referred to as the `dplyr` verbs.

1. `select()`
2. `filter()`
3. `arrange()`
4. `mutate()`
5. `summarise()` or equally `summarize()`

We will also cover `.by`, and `count()` here as they are closely related.

As we have already installed the `tidyverse`, we just need to load it.

```{r, warning: false, message: false, eval: true, echo: true}
library(tidyverse)
```

And we will begin by again using some data about Australian politicians from the `AustralianPoliticians` package [@citeaustralianpoliticians].

```{r}
#| eval: true
#| echo: true

library(AustralianPoliticians)

australian_politicians <-
  get_auspol("all")

head(australian_politicians)
```

### `select()`

We use `select()` to pick particular columns of a dataset. For instance, we might like to select the "firstName" column.

```{r}
#| eval: true
#| echo: true

australian_politicians |>
  select(firstName)
```

In R, there are many ways to do things. Sometimes these are different ways to do the same thing, and other times they are different ways to do *almost* the same thing. For instance, another way to pick a particular column of a dataset is to use the "extract" operator `$`. This is from base, as opposed to `select()` which is from the `tidyverse`.

```{r}
#| eval: true
#| echo: true

australian_politicians$firstName |>
  head()
```

The two appear similar---both pick the "firstName" column---but they differ in the class of what they return, with `select()` returning a tibble and `$` returning a vector. For the sake of completeness, if we combine `select()` with `pull()` then we get the same class of output, a vector, as if we had used the extract operator.

```{r}
#| eval: true
#| echo: true

australian_politicians |>
  select(firstName) |>
  pull() |>
  head()
```

We can also use `select()` to remove columns, by negating the column name.

```{r}
#| eval: true
#| echo: true

australian_politicians |>
  select(-firstName)
```

Finally, we can `select()` based on conditions. For instance, we can `select()` all of the columns that start with, say, "birth".

```{r}
#| eval: true
#| echo: true

australian_politicians |>
  select(starts_with("birth"))
```

There are a variety of similar "selection helpers" including `starts_with()`, `ends_with()`, and `contains()`. More information about these is available in the help page for `select()` which can be accessed by running `?select()`.

At this point, we will use `select()` to reduce the width of our dataset.

```{r}
#| eval: true
#| echo: true

australian_politicians <-
  australian_politicians |>
  select(
    uniqueID,
    surname,
    firstName,
    gender,
    birthDate,
    birthYear,
    deathDate,
    member,
    senator,
    wasPrimeMinister
  )

australian_politicians
```

One thing that sometimes confuses people who are new to R, is that the output is not "saved", unless you assign it to an object. For instance, here the first lines are `australian_politicians <- australian_politicians |>` and then `select()` is used, compared with `australian_politicians |>`. This ensures that the changes brought about by `select()` are applied to the object, and so it is that modified version that would be used at any point later in the code.

### `filter()`

We use `filter()` to pick particular rows of a dataset. For instance, we might be only interested in politicians that became prime minister.

```{r}
#| eval: true
#| echo: true

australian_politicians |>
  filter(wasPrimeMinister == 1)
```

We could also give `filter()` two conditions. For instance, we could look at politicians that become prime minister and were named Joseph, using the "and" operator `&`.

```{r}
#| eval: true
#| echo: true

australian_politicians |>
  filter(wasPrimeMinister == 1 & firstName == "Joseph")
```

We get the same result if we use a comma instead of an ampersand.

```{r}
#| eval: true
#| echo: true

australian_politicians |>
  filter(wasPrimeMinister == 1, firstName == "Joseph")
```

Similarly, we could look at politicians who were named, say, Myles or Ruth using the "or" operator `|`.

```{r}
#| eval: true
#| echo: true

australian_politicians |>
  filter(firstName == "Myles" | firstName == "Ruth")
```

We could also pipe the result. For instance we could pipe from `filter()` to `select()`.

```{r}
#| eval: true
#| echo: true

australian_politicians |>
  filter(firstName == "Ruth" | firstName == "Myles") |>
  select(firstName, surname)
```

If we happen to know the particular row number that is of interest then we could `filter()` to only that particular row. For instance, say the row 853 was of interest.

```{r}
#| eval: true
#| echo: true

australian_politicians |>
  filter(row_number() == 853)
```

There is also a dedicated function to do this, which is `slice()`.

```{r}
#| eval: true
#| echo: true

australian_politicians |>
  slice(853)
```

While this may seem somewhat esoteric, it is especially useful if we would like to remove a particular row using negation, or duplicate specific rows. For instance, we could remove the first row.

```{r}
#| eval: true
#| echo: true

australian_politicians |>
  slice(-1)
```

We could also only, say, only keep the first three rows.

```{r}
#| eval: true
#| echo: true

australian_politicians |>
  slice(1:3)
```

Finally, we could duplicate the first two rows and this takes advantage of `n()` which provides the current group size.

```{r}
#| eval: true
#| echo: true

australian_politicians |>
  slice(1:2, 1:n())
```


### `arrange()`

We use `arrange()` to change the order of the dataset based on the values of particular columns. For instance, we could arrange the politicians by their birthday.

```{r}
#| eval: true
#| echo: true

australian_politicians |>
  arrange(birthYear)
```

We could modify `arrange()` with `desc()` to change from ascending to descending order.

```{r}
#| eval: true
#| echo: true

australian_politicians |>
  arrange(desc(birthYear))
```

This could also be achieved with the minus sign.

```{r}
#| eval: true
#| echo: true

australian_politicians |>
  arrange(-birthYear)
```

And we could arrange based on more than one column. For instance, if two politicians have the same first name, then we could also arrange based on their birthday.

```{r}
#| eval: true
#| echo: true

australian_politicians |>
  arrange(firstName, birthYear)
```

We could achieve the same result by piping between two instances of `arrange()`.

```{r}
#| eval: true
#| echo: true

australian_politicians |>
  arrange(birthYear) |>
  arrange(firstName)
```

When we use `arrange()` we should be clear about precedence. For instance, changing to birthday and then first name would give a different arrangement.

```{r}
#| eval: true
#| echo: true

australian_politicians |>
  arrange(birthYear, firstName)
```

A nice way to arrange by a variety of columns is to use `across()`. It enables us to use the "selection helpers" such as `starts_with()` that were mentioned in association with `select()`.

```{r}
#| eval: true
#| echo: true

australian_politicians |>
  arrange(across(c(firstName, birthYear)))

australian_politicians |>
  arrange(across(starts_with("birth")))
```


### `mutate()`

We use `mutate()` when we want to make a new column. For instance, perhaps we want to make a new column that is 1 if a person was both a member and a senator and 0 otherwise. That is to say that our new column would denote politicians that served in both the upper and the lower house.

```{r}
#| eval: true
#| echo: true

australian_politicians <-
  australian_politicians |>
  mutate(was_both = if_else(member == 1 & senator == 1, 1, 0))

australian_politicians |>
  select(member, senator, was_both)
```

We could use `mutate()` with math, such as addition and subtraction. For instance, we could calculate the age that the politicians are (or would have been) in 2022.

```{r}
#| eval: true
#| echo: true
#| message: false
#| warning: false

library(lubridate)

australian_politicians <-
  australian_politicians |>
  mutate(age = 2022 - year(birthDate))

australian_politicians |>
  select(uniqueID, age)
```

There are a variety of functions that are especially useful when constructing new columns. These include `log()` which will compute the natural logarithm, `lead()` which will bring values up by one row, `lag()` which will push values down by one row, and `cumsum()` which creates a cumulative sum of the column.

```{r}
#| eval: true
#| echo: true

australian_politicians |>
  select(uniqueID, age) |>
  mutate(log_age = log(age))

australian_politicians |>
  select(uniqueID, age) |>
  mutate(lead_age = lead(age))

australian_politicians |>
  select(uniqueID, age) |>
  mutate(lag_age = lag(age))

australian_politicians |>
  select(uniqueID, age) |>
  drop_na(age) |>
  mutate(cumulative_age = cumsum(age))
```

As we have in earlier examples, we can also use `mutate()` in combination with `across()`. This includes the potential use of the selection helpers. For instance, we could count the number of characters in both the first and last names at the same time.

```{r}
#| eval: true
#| echo: true

australian_politicians |>
  mutate(across(c(firstName, surname), str_count)) |>
  select(uniqueID, firstName, surname)
```


Finally, we use `case_when()` when we need to make a new column on the basis of more than two conditional statements (in contrast to `if_else()` from our first `mutate()` example). For instance, we may have some years and want to group them into decades.

```{r}
library(lubridate)

australian_politicians |>
  mutate(
    year_of_birth = year(birthDate),
    decade_of_birth =
      case_when(
        year_of_birth <= 1929 ~ "pre-1930",
        year_of_birth <= 1939 ~ "1930s",
        year_of_birth <= 1949 ~ "1940s",
        year_of_birth <= 1959 ~ "1950s",
        year_of_birth <= 1969 ~ "1960s",
        year_of_birth <= 1979 ~ "1970s",
        year_of_birth <= 1989 ~ "1980s",
        year_of_birth <= 1999 ~ "1990s",
        TRUE ~ "Unknown or error"
      )
  ) |>
  select(uniqueID, year_of_birth, decade_of_birth)
```

We could accomplish this with a series of nested `if_else()` statements, but `case_when()` is more clear. The cases are evaluated in order and as soon as there is a match `case_when()` does not continue to the remainder of the cases. It can be useful to have a catch-all at the end that will signal if there is a potential issue that we might like to know about if the code were to ever get there.


### `summarise()`

We use `summarise()` when we would like to make new, condensed, summary variables. For instance, perhaps we would like to know the minimum, average, and maximum of some column. 

```{r}
#| eval: true
#| echo: true

australian_politicians |>
  summarise(
    youngest = min(age, na.rm = TRUE),
    oldest = max(age, na.rm = TRUE),
    average = mean(age, na.rm = TRUE)
  )
```

As an aside, `summarise()` and `summarize()` are equivalent and we can use either. In this book we use `summarise()`.

```{r}
#| eval: true
#| echo: true

australian_politicians |>
  summarize(
    youngest = min(age, na.rm = TRUE),
    oldest = max(age, na.rm = TRUE),
    average = mean(age, na.rm = TRUE)
  )
```

By default, `summarise()` will provide one row of output for a whole dataset. For instance, in the earlier example we found the youngest, oldest, and average across all politicians. However, we can create more groups in our dataset using `.by` within the function. We can use many functions on the basis of groups, but the `summarise()` function is particularly powerful in conjunction with `.by`. For instance, we could group by gender, and then get age-based summary statistics.

```{r}
#| eval: true
#| echo: true

australian_politicians |>
  summarise(
    youngest = min(age, na.rm = TRUE),
    oldest = max(age, na.rm = TRUE),
    average = mean(age, na.rm = TRUE),
    .by = gender
  )
```

Similarly, we could look at youngest, oldest, and mean age at death by gender.

```{r}
#| eval: true
#| echo: true

australian_politicians |>
  mutate(days_lived = deathDate - birthDate) |>
  drop_na(days_lived) |>
  summarise(
    min_days = min(days_lived),
    mean_days = mean(days_lived) |> round(),
    max_days = max(days_lived),
    .by = gender
  )
```

And so we learn that female members of parliament on average lived slightly longer than male members of parliament.

We can use `.by` on the basis of more than one group. For instance, we could look at the average number of days lived by gender and by they were in the House of Representatives or the Senate.

```{r}
#| eval: true
#| echo: true
#| warning: false
#| message: false

australian_politicians |>
  mutate(days_lived = deathDate - birthDate) |>
  drop_na(days_lived) |>
  summarise(
    min_days = min(days_lived),
    mean_days = mean(days_lived) |> round(),
    max_days = max(days_lived),
    .by = c(gender, member)
  )
```

We can use `count()` to create counts by groups. For instance, the number of politicians by gender.

```{r}
#| eval: true
#| echo: true

australian_politicians |>
  count(gender)
```

In addition to the `count()`, we could calculate a proportion.

```{r}
#| eval: true
#| echo: true

australian_politicians |>
  count(gender) |>
  mutate(proportion = n / (sum(n)))
```

Using `count()` is essentially the same as using `.by` within `summarise()` with `n()`, and we get the same result in that way.

```{r}
#| eval: true
#| echo: true

australian_politicians |>
  summarise(n = n(),
            .by = gender)
```

And there is a comparably helpful function that acts similarly to`mutate()`, which is `add_count()`. The difference is that the number will be added in a new column.

```{r}
#| eval: true
#| echo: true

australian_politicians |>
  add_count(gender) |>
  select(uniqueID, gender, n)
```


## Base R

While the `tidyverse` was established relatively recently to help with data science, `R` existed long before this. There is a host of functionality that is built into `R` especially around the core needs of programming and statisticians.

In particular, we will cover:

1. `class()`.
2. Data simulation.
3. `function()`, `for()`, and `apply()`.

There is no need to install or load any additional packages, as this functionality comes with R.

### `class()`

In everyday usage "a, b, c, ..." are letters and "1, 2, 3,..." are numbers. And we use letters and numbers differently; for instance, we do not add or subtract letters. Similarly, `R` needs to have some way of distinguishing different classes of content and to define the properties that each class has, "how it behaves, and how it relates to other types of objects" [@advancedr].

Classes have a hierarchy. For instance, we are "human", which is itself "animal". All "humans" are "animals", but not all "animals" are "humans". Similarly, all integers are numbers, but not all numbers are integers. We can find out the class of an object in `R` with `class()`. 

```{r}
#| echo: true

a_number <- 8
class(a_number)

a_letter <- "a"
class(a_letter)
```

The classes that we cover here are "numeric", "character", "factor", "date", and "data.frame".

The first thing to know is that, in the same way that a frog can become a prince, we can sometimes change the class of an object in R. This is called "casting". For instance, we could start with a "numeric", change it to a "character" with `as.character()`, and then a "factor" with `as.factor()`. But if we tried to make it into a "date" with `as.Date()` we would get an error because no all numbers have the properties that are needed to be a date.

```{r}
#| echo: true

a_number <- 8
a_number
class(a_number)

a_number <- as.character(a_number)
a_number
class(a_number)

a_number <- as.factor(a_number)
a_number
class(a_number)
``` 

Compared with "numeric" and "character" classes, the "factor" class might be less familiar. A "factor" is used for categorical data that can only take certain values  [@advancedr]. For instance, typical usage of a "factor" variable would be a binary, such as "day" or "night". It is also often used for age-groups, such as "18-29", "30-44", "45-60", "60+" (as opposed to age, which would often be a "numeric"); and sometimes for level of education: "less than high school", "high school", "college", "undergraduate degree", "postgraduate degree". We can find the allowed levels for a "factor" using `levels()`.

```{r}
age_groups <- factor(
  c("18-29", "30-44", "45-60", "60+")
)
age_groups
class(age_groups)
levels(age_groups)
```

Dates are an especially tricky class and quickly become complicated. Nonetheless, at a foundational level, we can use `as.Date()` to convert a character that looks like a "date" into an actual "date". This enables us to, say, perform addition and subtraction, when we would not be able to do that with a "character".

```{r}
looks_like_a_date_but_is_not <- "2022-01-01"
looks_like_a_date_but_is_not
class(looks_like_a_date_but_is_not)
is_a_date <- as.Date(looks_like_a_date_but_is_not)
is_a_date
class(is_a_date)
is_a_date + 3
```

The final class that we discuss here is "data.frame". This looks like a spreadsheet and is commonly used to store the data that we will analyze. Formally, "a data frame is a list of equal-length vectors" [@advancedr]. It will have column and row names which we can see using `colnames()` and `rownames()`, although often the names of the rows are just numbers.

To illustrate this, we use the "ResumeNames" dataset from `AER` [@citeaer]. This package can be installed in the same way as any other package from CRAN. This dataset comprises cross-sectional data about resume content, especially the name used on the resume, and associated information about whether the candidate received a call-back for 4,870 fictitious resumes. The dataset was created by @bertrand2004emily who sent fictitious resumes in response to job advertisements in Boston and Chicago that differed in whether the resume was assigned a "very African American sounding name or a very White sounding name". They found considerable discrimination whereby "White names receive 50 per cent more callbacks for interviews". @hangartner2021monitoring generalize this using an online Swiss platform and find that immigrants and minority ethnic groups are contacted less by recruiters, as are women when the profession is men-dominated, and vice versa.

```{r}
#| eval: false
#| echo: true
#| warning: false
#| message: false

install.packages("AER")
```

```{r}
#| eval: true
#| echo: true
#| warning: false
#| message: false

library(AER)
data("ResumeNames", package = "AER")
```

```{r}
ResumeNames |>
  head()
class(ResumeNames)
colnames(ResumeNames)
```

We can examine the class of the vectors, i.e. the columns, that make-up a data frame by specifying the column name.

```{r}
class(ResumeNames$name)
class(ResumeNames$jobs)
```

Sometimes it is helpful to be able to change the classes of many columns at once. We can do this by using `mutate()` and `across()`.

```{r}
class(ResumeNames$name)
class(ResumeNames$gender)
class(ResumeNames$ethnicity)

ResumeNames <- ResumeNames |>
  mutate(across(c(name, gender, ethnicity), as.character)) |>
  head()

class(ResumeNames$name)
class(ResumeNames$gender)
class(ResumeNames$ethnicity)
```

There are many ways for code to not run but having an issue with the class is always among the first things to check. Common issues are variables that we think should be "character" or "numeric" actually being "factor". And variables that we think should be "numeric" actually being "character".

Finally, it is worth pointing out that the class of a vector is whatever the class of the content. In Python and other languages, a similar data structure to a vector is a "list". A "list" is a class of its own, and the objects in a "list" have their own classes (for instance, `["a", 1]` is an object of class "list" with entries of class "string" and "int"). This may be counter-intuitive to see that a vector is not its own class if you are coming to `R` from another language.


### Simulating data

Simulating data is a key skill for telling believable stories with data. In order to simulate data, we need to be able to randomly draw from statistical distributions and other collections. `R` has a variety of functions to make this easier, including: the normal distribution, `rnorm()`; the uniform distribution, `runif()`; the Poisson distribution, `rpois()`; the binomial distribution, `rbinom()`; and many others. To randomly sample from a collection of items, we can use `sample()`. 

When dealing with randomness, the need for reproducibility makes it important, paradoxically, that the randomness is repeatable. That is to say, another person needs to be able to draw the random numbers that we draw. We do this by setting a seed for our random draws using `set.seed()`.

We could get observations from the standard normal distribution and put the those into a data frame.

```{r}
#| echo: true

set.seed(853)

number_of_observations <- 5

simulated_data <-
  data.frame(
    person = c(1:number_of_observations),
    std_normal_observations = rnorm(
      n = number_of_observations,
      mean = 0,
      sd = 1
    )
  )

simulated_data
```

We could then add draws from the uniform, Poisson, and binomial distributions, using `cbind()` to bring the columns of the original dataset and the new one together. 

```{r}
#| echo: true

simulated_data <-
  simulated_data |>
  cbind() |>
  data.frame(
    uniform_observations =
      runif(n = number_of_observations, min = 0, max = 10),
    poisson_observations =
      rpois(n = number_of_observations, lambda = 100),
    binomial_observations =
      rbinom(n = number_of_observations, size = 2, prob = 0.5)
  )

simulated_data
```

Finally, we will add a favorite color to each observation with `sample()`.

```{r}
#| echo: true

simulated_data <-
  data.frame(
    favorite_color = sample(
      x = c("blue", "white"),
      size = number_of_observations,
      replace = TRUE
    )
  ) |>
  cbind(simulated_data)

simulated_data
```

We set the option "replace" to "TRUE" because we are only choosing between two items, but each time we choose we want the possibility that either are chosen. Depending on the simulation we may need to think about whether "replace" should be "TRUE" or "FALSE". Another useful optional argument in `sample()` is to adjust the probability with which each item is drawn. The default is that all options are equally likely, but we could specify particular probabilities if we wanted to with "prob". As always with functions, we can find more in the help file, for instance `?sample`.


### `function()`, `for()`, and `apply()`

`R` "is a functional programming language" [@advancedr]. This means that we foundationally write, use, and compose functions, which are collections of code that accomplish something specific. 

There are a lot of functions in `R` that other people have written, and we can use. Almost any common statistical or data science task that we might need to accomplish likely already has a function that has been written by someone else and made available to us, either as part of the base `R` installation or a package. But we will need to write our own functions from time to time, especially for more-specific tasks. 
We define a function using `function()`, and then assign a name. We will likely need to include some inputs and outputs for the function. Inputs are specified between round brackets. The specific task that the function is to accomplish goes between braces.

```{r}
print_names <- function(some_names) {
  print(some_names)
}

print_names(c("rohan", "monica"))
```

We can specify defaults for the inputs in case the person using the function does not supply them.

```{r}
print_names <- function(some_names = c("edward", "hugo")) {
  print(some_names)
}

print_names()
```

One common scenario is that we want to apply a function multiple times. Like many programming languages, we can use a `for()` loop for this. The look of a `for()` loop in `R` is similar to `function()`, in that we define what we are iterating over in the round brackets, and the function to apply in braces.

<!-- ```{r} -->
<!-- for (i in 1:3) { -->
<!--   print(i) -->
<!-- } -->
<!-- ``` -->

<!-- ```{r} -->
<!-- x <- cbind(x1 = 66, x2 = c(4:1, 2:5)) -->
<!-- dimnames(x)[[1]] <- letters[1:8] -->
<!-- class(x) -->
<!-- apply(x, 2, mean, trim = .2) -->
<!-- ``` -->

Because `R` is a programming language that is focused on statistics, we are often interested in arrays or matrices. We use `apply()` to apply a function to rows ("MARGIN = 1") or columns ("MARGIN = 2").

```{r}
simulated_data
apply(X = simulated_data, MARGIN = 2, FUN = unique)
```


## Making graphs with `ggplot2`

If the key package in the `tidyverse` in terms of manipulating data is `dplyr` [@citedplyr], then the key package in the `tidyverse` in terms of creating graphs is `ggplot2` [@citeggplot]. We will have more to say about graphing in @sec-static-communication, but here we provide a quick tour of some essentials. `ggplot2` works by defining layers which build to form a graph, based around the "grammar of graphics" (hence, the "gg"). Instead of the pipe operator (`|>`) `ggplot2` uses the add operator `+`. As part of the `tidyverse` collection of packages, `ggplot2` does not need to be explicitly installed or loaded if the `tidyverse` has been loaded. 

There are three key aspects that need to be specified to build a graph with `ggplot2`: 

1. Data;
2. Aesthetics / mapping; and
3. Type.

To get started we will obtain some GDP data for countries in the Organisation for Economic Co-operation and Development (OECD) [@citeoecdgdp].

```{r}
#| eval: false
#| echo: true

library(tidyverse)

oecd_gdp <-
  read_csv("https://stats.oecd.org/sdmx-json/data/DP_LIVE/.QGDP.../OECD?contentType=csv&detail=code&separator=comma&csv-lang=en")

write_csv(oecd_gdp, "inputs/data/oecd_gdp.csv")
```

```{r}
#| eval: true
#| echo: false

library(tidyverse)

oecd_gdp <-
  read_csv(
    "inputs/data/oecd_gdp.csv",
    show_col_types = FALSE
  )

head(oecd_gdp)
```

We are interested, firstly, in making a bar chart of GDP change in the third quarter of 2021 for ten countries: Australia, Canada, Chile, Indonesia, Germany, Great Britain, New Zealand, South Africa, Spain, and the US.

```{r}
oecd_gdp_2021_q3 <-
  oecd_gdp |>
  filter(
    TIME == "2021-Q3",
    SUBJECT == "TOT",
    LOCATION %in% c(
      "AUS",
      "CAN",
      "CHL",
      "DEU",
      "GBR",
      "IDN",
      "ESP",
      "NZL",
      "USA",
      "ZAF"
    ),
    MEASURE == "PC_CHGPY"
  ) |>
  mutate(
    european = if_else(
      LOCATION %in% c("DEU", "GBR", "ESP"),
      "European",
      "Not european"
    ),
    hemisphere = if_else(
      LOCATION %in% c("CAN", "DEU", "GBR", "ESP", "USA"),
      "Northern Hemisphere",
      "Southern Hemisphere"
    ),
  )
```

We start with `ggplot()` and specify a mapping/aesthetic, which in this case means specifying the x-axis and the y-axis. The first argument in `ggplot()` is the data we want to visualize, so we can use the pipe operator at this stage as usual.

```{r}
#| eval: true
#| echo: true

oecd_gdp_2021_q3 |>
  ggplot(mapping = aes(x = LOCATION, y = Value))
```

Now we need to specify the type of graph that we are interested in. In this case we want a bar chart and we do this by adding `geom_bar()` using `+`.

```{r}
#| eval: true
#| echo: true

oecd_gdp_2021_q3 |>
  ggplot(mapping = aes(x = LOCATION, y = Value)) +
  geom_bar(stat = "identity")
```

We can color the bars by whether the country is European by adding another aesthetic, "fill".

```{r}
#| eval: true
#| echo: true

oecd_gdp_2021_q3 |>
  ggplot(mapping = aes(x = LOCATION, y = Value, fill = european)) +
  geom_bar(stat = "identity")
```

Finally, we could make it look nicer by: adding labels, `labs()`; changing the color, `scale_fill_brewer()`; and the background, `theme_classic()`. 

```{r}
#| eval: true
#| echo: true

oecd_gdp_2021_q3 |>
  ggplot(mapping = aes(x = LOCATION, y = Value, fill = european)) +
  geom_bar(stat = "identity") +
  labs(
    title = "Quarterly change in GDP for ten OECD countries in 2021Q3",
    x = "Countries",
    y = "Change (%)",
    fill = "Is European?"
  ) +
  theme_classic() +
  scale_fill_brewer(palette = "Set1")
```


Facets enable us to that we create subplots that focus on specific aspects of our data. They are invaluable because they allow us to add another variable to a graph without having to make a 3D graph. We use `facet_wrap()` to add facets and specify the variable that we would like to facet by. In this case, we facet by hemisphere.

```{r}
#| eval: true
#| echo: true

oecd_gdp_2021_q3 |>
  ggplot(mapping = aes(x = LOCATION, y = Value, fill = european)) +
  geom_bar(stat = "identity") +
  labs(
    title = "Quarterly change in GDP for ten OECD countries in 2021Q3",
    x = "Countries",
    y = "Change (%)",
    fill = "Is European?"
  ) +
  theme_classic() +
  scale_fill_brewer(palette = "Set1") +
  facet_wrap(
    ~hemisphere,
    scales = "free_x"
  )
```


## Exploring the `tidyverse`

We have focused on two aspects of the `tidyverse`: `dplyr`, and `ggplot2`. However, the `tidyverse` comprises a variety of different packages and functions. We will now go through four common aspects:

- Importing data and `tibble()`;
- Joining and pivoting datasets;
- String manipulation and `stringr`;
- Factor variables and `forcats`.

However, the first task is to deal with the nomenclature, and in particular to be specific about what is "tidy" about the "tidyverse". The name refers to tidy data, and the benefit of that is that while there are a variety of ways for data to be messy, tidy data satisfy three rules. This means the structure of the datasets is consistent regardless of the specifics, and makes it easier to apply functions that expect certain types of input. Tidy data refers to a dataset where [@r4ds; @wickham2014tidy, p. 4]:

- Every variable is in a column of its own.
- Every observation is in its own row.
- Every value is in its own cell.

@tbl-nottidydata is not tidy because age and hair share a column. @tbl-tidydata is its tidy counterpart.

```{r}
#| label: tbl-nottidydata
#| tbl-cap: "Example of data that are not tidy"
#| echo: false

tibble(
  person = c(
    "Rohan",
    "Rohan",
    "Monica",
    "Monica",
    "Edward",
    "Edward",
    "Hugo",
    "Hugo"
  ),
  variable = c("Age", "Hair", "Age", "Hair", "Age", "Hair", "Age", "Hair"),
  value = c("35", "Black", "35", "Blonde", "3", "Brown", "1", "Blonde")
) |>
  knitr::kable(
    col.names = c("Person", "Variable", "Value"),
    digits = 1
  )
```

```{r}
#| label: tbl-tidydata
#| tbl-cap: "Example of tidy data"
#| echo: false

tibble(
  person = c("Rohan", "Monica", "Edward", "Hugo"),
  age = c(35, 35, 3, 1),
  hair = c("Black", "Blonde", "Brown", "Blonde")
) |>
  knitr::kable(
    col.names = c("Person", "Age", "Hair"),
    digits = 1
  )
```


### Importing data and `tibble()`

There are a variety of ways to get data into `R` so that we can use it. For CSV files, there is `read_csv()` from `readr` [@citereadr], and for Stata files, there is `read_dta()` from `haven` [@citehaven].

CSVs are a common format and have many advantages including the fact that they typically do not modify the data. Each column is separated by a comma, and each row is a record. We can provide `read_csv()` with a URL or a local file to read. There are a variety of different options that can be passed to `read_csv()` including the ability to specify whether the dataset has column names, the types of the columns, and how many lines to skip. If we do not specify the types of the columns then `read_csv()` will make a guess by looking at the dataset. 

We use `read_dta()` to read .dta files, which are commonly produced by the statistical program Stata. This means that they are common in fields such as sociology, political science, and economics. This format separates the data from its labels and so we typically reunite these using `to_factor()` from `labelled` [@citelabelled]. `haven` is part of the `tidyverse`, but is not automatically loaded by default, in contrast to a package such as `ggplot2`, and so we would need to run `library(haven)`.

Typically a dataset enters `R` as a "data.frame". While this can be useful, another helpful class for a dataset is "tibble". These can be created using `tibble()` from `tibble`, which is part of the `tidyverse`. A tibble is a data frame, with some particular changes that make it easier to work with, including not converting strings to factors by default, showing the class of columns, and printing nicely.

We can make a tibble manually, if need be, for instance, when we simulate data. But we typically import data directly as a tibble, for instance, when we use `read_csv()`.

```{r}
#| echo: true

people_as_dataframe <-
  data.frame(
    names = c("rohan", "monica"),
    website = c("rohanalexander.com", "monicaalexander.com"),
    fav_color = c("blue", "white")
  )
class(people_as_dataframe)
people_as_dataframe

people_as_tibble <-
  tibble(
    names = c("rohan", "monica"),
    website = c("rohanalexander.com", "monicaalexander.com"),
    fav_color = c("blue", "white")
  )
people_as_tibble
class(people_as_tibble)
```


### Dataset manipulation with joins and pivots

There are two dataset manipulations that are often needed: joins and pivots.

We often have a situation where we have two, or more, datasets and we are interested in combining them. We can join datasets together in a variety of ways. A common way is to use `left_join()` from `dplyr` [@citedplyr]. This is most useful where there is one main dataset that we are using and there is another dataset with some useful variables that we want to add to that. The critical aspect is that we have a column or columns that we can use to link the two datasets. Here we will create two tibbles and then join them on the basis of names.

```{r}
main_dataset <-
  tibble(
    names = c("rohan", "monica", "edward", "hugo"),
    status = c("adult", "adult", "child", "infant")
  )
main_dataset

supplementary_dataset <-
  tibble(
    names = c("rohan", "monica", "edward", "hugo"),
    favorite_food = c("pasta", "salmon", "pizza", "milk")
  )
supplementary_dataset

main_dataset <-
  main_dataset |>
  left_join(supplementary_dataset, by = "names")

main_dataset
```

There are a variety of other options to join datasets, including `inner_join()`, `right_join()`, and `full_join()`.

Another common dataset manipulation task is pivoting them. Datasets tend to be either long or wide. Long data means that each variable is on a row, and so possibly there is repetition, whereas with wide data each variable is a column and so there is typically little repetition [@Wickham2009]. For instance, "anscombe" is wide, and "anscombe_long" is long. 

```{r}
anscombe
```

```{r}
#| include: false
anscombe_long <-
  anscombe |>
  pivot_longer(
    everything(),
    names_to = c(".value", "set"),
    names_pattern = "(.)(.)"
  )
```


```{r}
anscombe_long
```


Generally, in the `tidyverse`, and certainly for `ggplot2`, we need long data. To go from one to the other we use `pivot_longer()` and `pivot_wider()` from `tidyr` [@citetidyr].

We will create some wide data on whether "mark" or "lauren" won a running race in each of three years.

```{r}
pivot_example_data <-
  tibble(
    year = c(2019, 2020, 2021),
    mark = c("first", "second", "first"),
    lauren = c("second", "first", "second")
  )

pivot_example_data
```

This dataset is in wide format at the moment. To get it into long format, we need a column that specifies the person, and another that specifies the result. We use `pivot_longer()` to achieve this.

```{r}
data_pivoted_longer <-
  pivot_example_data |>
  pivot_longer(
    cols = c("mark", "lauren"),
    names_to = "person",
    values_to = "position"
  )

head(data_pivoted_longer)
```

Occasionally, we need to go from long data to wide data. We use `pivot_wider()` to do this.

```{r}
data_pivoted_wider <-
  data_pivoted_longer |>
  pivot_wider(
    names_from = "person",
    values_from = "position"
  )

head(data_pivoted_wider)
```


### String manipulation and `stringr`

In `R` we often create a string with double quotes, although using single quotes works too. For instance `c("a", "b")` consists of two strings "a" and "b", that are contained in a character vector. There are a variety of ways to manipulate strings in `R` and we focus on `stringr` [@citestringr]. This is automatically loaded when we load the `tidyverse`.

If we want to look for whether a string contains certain content, then we can use `str_detect()`. And if we want to remove or change some particular content then we can use `str_remove()` or `str_replace()`. 

```{r}
#| echo: true

dataset_of_strings <-
  tibble(
    names = c(
      "rohan alexander",
      "monica alexander",
      "edward alexander",
      "hugo alexander"
    )
  )

dataset_of_strings |>
  mutate(
    is_rohan = str_detect(names, "rohan"),
    make_howlett = str_replace(names, "alexander", "howlett"),
    remove_rohan = str_remove(names, "rohan")
  )
```

There are a variety of other functions that are often especially useful in data cleaning. For instance, we can use `str_length()` to find out how long a string is, and `str_c()` to bring strings together.

```{r}
dataset_of_strings |>
  mutate(
    length_is = str_length(string = names),
    name_and_length = str_c(names, length_is, sep = " - ")
  )
```

Finally, `separate()` from `tidyr`, although not part of `stringr`, is indispensable for string manipulation. It turns one character column into many.

```{r}
dataset_of_strings |>
  separate(
    col = names,
    into = c("first", "last"),
    sep = " ",
    remove = FALSE
  )
```


### Factor variables and `forcats`

A factor is a collection of strings that are categories. Sometimes there will be an inherent ordering. For instance, the days of the week have an order -- Monday, Tuesday, Wednesday, ... -- which is not alphabetical. But there is no requirement for that to be the case, for instance pregnancy status: pregnant or not pregnant. Factors feature prominently in base R. They can be useful because they ensure that only appropriate strings are allowed. For instance, if "days_of_the_week" was a factor variable then "January" would not be allowed. But they can add a great deal of complication, and so they have a less prominent role in the `tidyverse`. Nonetheless taking advantage of factors is useful in certain circumstances. For instance, when plotting the days of the week we probably want them in the usual ordering than in the alphabetical ordering that would result if we had them as a character variable. While factors are built into base R, one `tidyverse` package that is especially useful when using factors is `forcats` [@citeforcats].

Sometimes we have a character vector, and we will want it ordered in a particular way. The default is that a character vector is ordered alphabetically, but we may not want that. For instance, the days of the week would look strange on a graph if they were alphabetically ordered: Friday, Monday, Saturday, Sunday, Thursday, Tuesday, and Wednesday!

The way to change the ordering is to change the variable from a character to a factor. We can use `fct_relevel()` from `forcats` [@citeforcats] to specify an ordering.

```{r}
#| eval: true
#| echo: true

set.seed(853)

days_data <-
  tibble(
    days =
      c(
        "Monday",
        "Tuesday",
        "Wednesday",
        "Thursday",
        "Friday",
        "Saturday",
        "Sunday"
      ),
    some_value = c(sample.int(100, 7))
  )

days_data <-
  days_data |>
  mutate(
    days_as_factor = factor(days),
    days_as_factor = fct_relevel(
      days,
      "Monday",
      "Tuesday",
      "Wednesday",
      "Thursday",
      "Friday",
      "Saturday",
      "Sunday"
    )
  )
```

And we can compare the results by graphing first with the original character vector on the x-axis, and then another graph with the factor vector on the x-axis.

```{r}
#| fig.show: "hold"
#| out.width: "50%"

days_data |>
  ggplot(aes(x = days, y = some_value)) +
  geom_point()

days_data |>
  ggplot(aes(x = days_as_factor, y = some_value)) +
  geom_point()
```


## Exercises

### Practice {.unnumbered}

1. *(Plan)* Consider the following scenario: *A person is interested in whether kings or queens live longer, and for every monarch they gather data about how long they lived.* Please sketch what that dataset could look like, and then sketch a graph that you could build to show all observations.
2. *(Simulate)* Please further consider the scenario described, assume there are 1,000 monarchs, and decide which of the following could be used to simulate the situation (select all that apply)?
    a.  `runif(n = 1000, min = 1, max = 110) |> floor()`
    b.  `rpois(n = 1000, lambda = 65)`
    c. `rnorm(n = 1000) |> floor()`
    d. `sample(x = sunspot.month, size = 1000, replace = TRUE) |> floor()`
3. *(Acquire)* Please identify one possible source of actual data about how long monarchs lived for.
4. *(Explore)* Assume that `tidyverse` is loaded and the dataset "monarchs" has the column "years". Which of the following would result only those monarchs that lived longer than 35 years (pick one)?
    a. `monarchs |> arrange(years > 35)`
    b. `monarchs |> select(years > 35)`
    c.  `monarchs |> filter(years > 35)`
    d. `monarchs |> mutate(years > 35)`
5. *(Communicate)* Please write two paragraphs as if you had gathered data from that source, and had built a graph. The exact details contained in the paragraphs do not have to be factual (i.e. you do not actually have to get the data nor create the graphs).

### Quiz {.unnumbered}

1. What is `R` (pick one)?
    a.  A open-source statistical programming language
    b. A programming language created by Guido van Rossum
    c. A closed source statistical programming language
    d. An integrated development environment (IDE)
2. What are three advantages of R? What are three disadvantages?
3. What is RStudio?
    a.  An integrated development environment (IDE).
    b. A closed source paid program.
    c. A programming language created by Guido van Rossum
    d. A statistical programming language.
4. What is the class of the output of `2 + 2` (pick one)?
    a. character
    b. factor
    c.  numeric
    d. date
5. Say we had run: `my_name <- "rohan"`. What would be the result of running `print(my_name)` (pick one)?
    a. "edward"
    b. "monica"
    c. "hugo"
    d. "rohan"
6. Say we had a dataset with two columns: "name", and "age". Which verb should we use to pick just "name" (pick one)?
    a. `select()`
    b. `mutate()`
    c. `filter()`
    d. `rename()`
7. Say we had loaded `AustralianPoliticians` and `tidyverse` and then run the following code: `australian_politicians <- get_auspol("all")`. How could we select all of the columns that end with "Name" (pick one)? 
    a. `australian_politicians |> select(contains("Name"))`
    b. `australian_politicians |> select(starts_with("Name"))`
    c. `australian_politicians |> select(matches("Name"))`
    d. `australian_politicians |> select(ends_with("Name"))`
8. Under what circumstances, in terms of the names of the columns, would the use of `contains()` potentially give different answers to using `ends_with()` in the above question?
9. Which of the following are not `tidyverse` verbs (pick one)? 
    a. `select()`
    b. `filter()`
    c. `arrange()`
    d. `mutate()`
    e. `visualize()`
10. Which function would make a new column (pick one)? 
    a. `select()`
    b. `filter()`
    c. `arrange()`
    d.  `mutate()`
    e. `visualize()`
11. Which function would focus on particular rows (pick one)? 
    a. `select()`
    b.  `filter()`
    c. `arrange()`
    d. `mutate()`
    e. `summarise()`
12. Which combination of two could provide a mean of a dataset, by sex (pick two)? 
    a. `summarise()`
    b. `filter()`
    c. `arrange()`
    d. `mutate()`
    e. `.by`
13. Assume a variable called "age" is an integer. Which line of code would create a column that is its exponential (pick one)? 
    a. `generate(exp_age = exp(age))`
    b. `change(exp_age = exp(age))`
    c. `make(exp_age = exp(age))`
    d. `mutate(exp_age = exp(age))`
14. Assume a column called "age". Which line of code could create a column that contains the value from five rows above?
    a. `mutate(five_before = lag(age))`
    b. `mutate(five_before = lead(age))`
    c. `mutate(five_before = lag(age, n = 5))`
    d. `mutate(five_before = lead(age, n = 5))`
15. What would be the output of `class("edward")` (pick one)? 
    a. "numeric"
    b. "character"
    c. "data.frame"
    d. "vector"
16. Which function would enable us to draw once from three options "blue, white, red", with 10 per cent probability on "blue" and "white", and the remainder on "red"?
    a. `sample(c("blue", "white", "red"), prob = c(0.1, 0.1, 0.8))`
    b. `sample(c("blue", "white", "red"), size = 1)`
    c. `sample(c("blue", "white", "red"), size = 1, prob = c(0.8, 0.1, 0.1))`
    d.  `sample(c("blue", "white", "red"), size = 1, prob = c(0.1, 0.1, 0.8))`
17. Which code simulates 10,000 draws from a normal distribution with a mean of 27 and a standard deviation of 3 (pick one)? 
    a.  `rnorm(10000, mean = 27, sd = 3)`
    b. `rnorm(27, mean = 10000, sd = 3)`
    c. `rnorm(3, mean = 10000, sd = 27)`
    d. `rnorm(27, mean = 3, sd = 1000)`
18. What are the three key aspects of the grammar of graphics (select all)? 
    a.  data
    b.  aesthetics
    c.  type
    d. `geom_histogram()`
  
### Activity {.unnumbered}

> I think we should be suspicious when we find ourselves attracted to data---very, very thin and weak data---that seem to justify beliefs that have held great currency in lots of societies throughout history, in a way that is conducive to the oppression of large segments of the population.
> 
> Amia Srinivasan [@conversationswithamiasrinivasan]

Reflect on the quote from Amia Srinivasan, Chichele Professor of Social and Political Theory, All Souls College, Oxford, and @datafeminism2020, especially Chapter 6. 

Please create a GitHub repository with a meaningful name, and appropriate structure, and use a reproducible Quarto file to create a PDF of at least two pages (not including references) discussing those quotes in relation to a dataset that you are familiar with. Submit a link to the PDF file e.g. https://github.com/RohanAlexander/starter_folder/blob/main/outputs/paper/paper.pdf and the PDF itself.