05-graphs_tables_maps.qmd

---
engine: knitr
---

# Graphs, tables, and maps {#sec-static-communication}

::: {.callout-note}
Chapman and Hall/CRC published this book in July 2023. You can purchase that [here](https://www.routledge.com/Telling-Stories-with-Data-With-Applications-in-R/Alexander/p/book/9781032134772). This online version has some updates to what was printed.
:::

**Prerequisites**

- Read *R for Data Science*, [@r4ds]
  - Focus on Chapter 1 "Data visualization", which provides an overview of `ggplot2`.
- Read *Data Visualization: A Practical Introduction*, [@healyviz]
  - Focus on Chapter 3 "Make a plot", which provides an overview of `ggplot2` with different emphasis.
- Watch *The Glamour of Graphics*, [@chase2020]
  - This video details ideas for how to improve a plot made with `ggplot2`.
- Read *Testing Statistical Charts: What Makes a Good Graph?*, [@vanderplas2020testing]
  - This article details best practice for making graphs.
- Read *Data Feminism*, [@datafeminism2020]
  - Focus on Chapter 3 "On Rational, Scientific, Objective Viewpoints from Mythical, Imaginary, Impossible Standpoints", which provides examples of why data needs to be considered within context.
- Read *Historical development of the graphical representation of statistical data*, [@funkhouser1937historical]
  - Focus on Chapter 2 "The Origin of the Graphic Method", which discusses how various graphs developed.
- Read *Remove the legend to become one*, [@removethelegend]
  - Goes through the process of gradually improving a graph. It is all interesting, but the graphs aspect begins with "What does this have to do with line graphs?".
- Read *Geocomputation with R*, Chapter 2 "Geographic data in R", [@lovelace2019geocomputation]
  - This chapter provides an overview of mapping in `R`.
- Read *Mastering Shiny*, Chapter 1 "Your first Shiny app", [@wickham2021mastering]
  - This chapter provides a self-contained example of a Shiny app.

**Key concepts and skills**

- Visualization is one way to get a sense of our data and to communicate this  to the reader. Plotting the observations in a dataset is important.
- We need to be comfortable with a variety of graph types, including: bar charts, scatterplots, line plots, and histograms. We can even consider a map to be a type of graph, especially after geocoding our data.
- We should also summarize data using tables. Typical use cases for this include showing part of a dataset, summary statistics, and regression results. 

**Software and packages**

- `babynames` [@citebabynames]
- Base R [@citeR]
- `carData` [@carData]
- `datasauRus` [@citedatasauRus]
- `ggmap` [@KahleWickham2013]
- `janitor` [@janitor]
- `knitr` [@citeknitr]
- `leaflet` [@ChengKarambelkarXie2017]
- `mapdeck` [@citemapdeck]
- `maps` [@citemaps]
- `mapproj` [@mapproj]
- `modelsummary` [@citemodelsummary]
- `opendatatoronto` [@citeSharla]
- `patchwork` [@citepatchwork]
- `shiny` [@citeshiny]
- `tidygeocoder` [@tidygeocoder]
- `tidyverse` [@tidyverse]
- `tinytable` [@tinytable]
- `troopdata` [@troopdata]
- `usethis` [@usethis]
- `WDI` [@WDI]

```{r}
#| message: false
#| warning: false

library(babynames)
library(carData)
library(datasauRus)
library(ggmap)
library(janitor)
library(knitr)
library(leaflet)
library(mapdeck)
library(maps)
library(mapproj)
library(modelsummary)
library(opendatatoronto)
library(patchwork)
library(tidygeocoder)
library(tidyverse)
library(tinytable)
library(troopdata)
library(shiny)
library(usethis)
library(WDI)
```


## Introduction

When telling stories with data, we would like the data to do much of the work of convincing our reader. The paper is the medium, and the data are the message. To that end, we want to show our reader the data that allowed us to come to our understanding of the story. We use graphs, tables, and maps to help achieve this. 

Try to show the observations that underpin our analysis. For instance, if your dataset consists of 2,500 responses to a survey, then at some point in the paper you should have a plot/s that contains each of the 2,500 observations, for every variable of interest. To do this we build graphs using `ggplot2` which is part of the core `tidyverse` and so does not have to be installed or loaded separately. In this chapter we go through a variety of different options including bar charts, scatterplots, line plots, and histograms.

In contrast to the role of graphs, which is to show each observation, the role of tables is typically to show an extract of the dataset or to convey various summary statistics, or regression results. We will build tables primarily using `knitr`. Later we will use `modelsummary` to build tables related to regression output.

Finally, we cover maps as a variant of graphs that are used to show a particular type of data. We will build static maps using `ggmap` after having obtained geocoded data using `tidygeocoder`.

## Graphs

> A world turning to a saner and richer civilization will be a world turning to charts.
> 
> @karsetn [p. 684]

Graphs\index{graphs} are a critical aspect of compelling data stories. They allow us to see both broad patterns and details [@elementsofgraphingdata, p. 5]. Graphs enable a familiarity with our data that is hard to get from any other method. Every variable of interest should be graphed.

The most important objective of a graph is to convey as much of the actual data, and its context, as possible. In a way, graphing is an information encoding process where we construct a deliberate representation to convey information to our audience. The audience must decode that representation. The success of our graph depends on how much information is lost in this process so the decoding is a critical aspect [@elementsofgraphingdata, p. 221]. This means that we must focus on creating effective graphs that are suitable for our specific audience.

To see why graphing the actual data is important\index{graphs!importance of}, after installing and loading `datasauRus` consider the `datasaurus_dozen` dataset.

```{r}
datasaurus_dozen
```

The dataset consists of values for "x" and "y", which should be plotted on the x-axis and y-axis, respectively. There are 13 different values in the variable "dataset" including: "dino", "star", "away", and "bullseye". We focus on those four and generate summary statistics for each (@tbl-datasaurussummarystats).

```{r}
#| label: tbl-datasaurussummarystats
#| tbl-cap: "Mean and standard deviation for four datasauRus datasets"

# Based on: https://juliasilge.com/blog/datasaurus-multiclass/
datasaurus_dozen |>
  filter(dataset %in% c("dino", "star", "away", "bullseye")) |>
  summarise(across(c(x, y), list(mean = mean, sd = sd)),
            .by = dataset) |>
  tt() |> 
  style_tt(j = 2:5, align = "r") |> 
  format_tt(digits = 1, num_fmt = "decimal") |> 
  setNames(c("Dataset", "x mean", "x sd", "y mean", "y sd"))
```

Notice that the summary statistics are similar (@tbl-datasaurussummarystats). Despite this it turns out that the different datasets are actually very different beasts. This becomes clear when we plot the data (@fig-datasaurusgraph).

```{r}
#| eval: true
#| fig-cap: "Graph of four datasauRus datasets"
#| label: fig-datasaurusgraph
#| warning: false
#| echo: true

datasaurus_dozen |>
  filter(dataset %in% c("dino", "star", "away", "bullseye")) |>
  ggplot(aes(x = x, y = y, colour = dataset)) +
  geom_point() +
  theme_minimal() +
  facet_wrap(vars(dataset), nrow = 2, ncol = 2) +
  labs(color = "Dataset")
```

We get a similar lesson---always plot your data---from "Anscombe's Quartet", created by the twentieth century statistician Frank Anscombe. The key takeaway is that it is important to plot the actual data and not rely solely on summary statistics.\index{graphs!not relying on summary statistics}

```{r}
head(anscombe)
```

::: {.content-visible when-format="pdf"}
Anscombe's Quartet consists of eleven observations for four different datasets, with x and y values for each observation. We need to manipulate this dataset with `pivot_longer()` to get it into the "tidy" format discussed in the ["R Essentials" Online Appendix](https://tellingstorieswithdata.com/20-r_essentials.html).
:::

::: {.content-visible unless-format="pdf"}
Anscombe's Quartet consists of eleven observations for four different datasets, with x and y values for each observation. We need to manipulate this dataset with `pivot_longer()` to get it into the "tidy" format discussed in [Online Appendix -@sec-r-essentials]. 
:::


```{r}
# From: https://www.njtierney.com/post/2020/06/01/tidy-anscombe/
# And the pivot_longer() vignette.

tidy_anscombe <-
  anscombe |>
  pivot_longer(
    everything(),
    names_to = c(".value", "set"),
    names_pattern = "(.)(.)"
  )
```

We can first create summary statistics (@tbl-anscombesummarystats) and then plot the data (@fig-anscombegraph). This again illustrates the importance of graphing the actual data, rather than relying on summary statistics.

```{r}
#| label: tbl-anscombesummarystats
#| message: false
#| tbl-cap: "Mean and standard deviation for Anscombe's quartet"

tidy_anscombe |>
  summarise(
    across(c(x, y), list(mean = mean, sd = sd)),
    .by = set
    ) |>
  tt() |> 
  style_tt(j = 2:5, align = "r") |> 
  format_tt(digits = 1, num_fmt = "decimal") |> 
  setNames(c("Dataset", "x mean", "x sd", "y mean", "y sd"))
```


```{r}
#| eval: true
#| fig-cap: "Recreation of Anscombe's Quartet"
#| label: fig-anscombegraph
#| warning: false
#| echo: true

tidy_anscombe |>
  ggplot(aes(x = x, y = y, colour = set)) +
  geom_point() +
  geom_smooth(method = lm, se = FALSE) +
  theme_minimal() +
  facet_wrap(vars(set), nrow = 2, ncol = 2) +
  labs(colour = "Dataset") +
  theme(legend.position = "bottom")
```

### Bar charts

We typically use a bar chart\index{graphs!bar chart} when we have a categorical variable that we want to focus on. We saw an example of this in @sec-fire-hose when we constructed a graph of the number of occupied beds. The geometric object---a "geom"---that we primarily use is `geom_bar()`, but there are many variants to cater for specific situations. To illustrate the use of bar charts, we use a dataset from the 1997-2001 British Election Panel Study that was put together by @fox2006effect and made available with `BEPS`, after installing and loading `carData`.\index{gender!British Election Panel Study}\index{British Election Panel Study}

```{r}
beps <- 
  BEPS |> 
  as_tibble() |> 
  clean_names() |> 
  select(age, vote, gender, political_knowledge)
```

The dataset consists of which party the respondent supports, along with various demographic, economic, and political variables. In particular, we have the age of the respondent. We begin by creating age-groups from the ages, and making a bar chart showing the frequency of each age-group using `geom_bar()` (@fig-bepfitst-1).

```{r}
beps <-
  beps |>
  mutate(
    age_group =
      case_when(
        age < 35 ~ "<35",
        age < 50 ~ "35-49",
        age < 65 ~ "50-64",
        age < 80 ~ "65-79",
        age < 100 ~ "80-99"
      ),
    age_group = 
      factor(age_group, levels = c("<35", "35-49", "50-64", "65-79", "80-99"))
  )
```

```{r}
#| label: fig-bepfitst
#| eval: true
#| fig-cap: "Distribution of age-groups in the 1997-2001 British Election Panel Study"
#| echo: true
#| fig-subcap: ["Using `geom_bar()`", "Using `count()` and `geom_col()`"]
#| layout-ncol: 2

beps |>
  ggplot(mapping = aes(x = age_group)) +
  geom_bar() +
  theme_minimal() +
  labs(x = "Age group", y = "Number of observations")

beps |> 
  count(age_group) |> 
  ggplot(mapping = aes(x = age_group, y = n)) +
  geom_col() +
  theme_minimal() +
  labs(x = "Age group", y = "Number of observations")
```

The default axis label used by `ggplot2` is the name of the relevant variable, so it is often useful to add more detail. We do this using `labs()` by specifying a variable and a name. In the case of @fig-bepfitst-1 we have specified labels for the x-axis and y-axis.

By default, `geom_bar()` creates a count of the number of times each age-group appears in the dataset. It does this because the default statistical transformation---a "stat"---for `geom_bar()` is "count", which saves us from having to create that statistic ourselves. But if we had already constructed a count (for instance, with `beps |> count(age_group)`), then we could specify a variable for the y-axis and then use `geom_col()` (@fig-bepfitst-2).

We may also like to consider various groupings of the data to get a different insight. For instance, we can use color to look at which party the respondent supports, by age-group (@fig-bepsecond-1).

```{r}
#| echo: true
#| eval: true
#| fig-cap: "Distribution of age-group, and vote preference, in the 1997-2001 British Election Panel Study"
#| label: fig-bepsecond
#| fig-subcap: ["Using `geom_bar()`", "Using `geom_bar()` with dodge2"]
#| layout-ncol: 2

beps |>
  ggplot(mapping = aes(x = age_group, fill = vote)) +
  geom_bar() +
  labs(x = "Age group", y = "Number of observations", fill = "Vote") +
  theme(legend.position = "bottom")

beps |>
  ggplot(mapping = aes(x = age_group, fill = vote)) +
  geom_bar(position = "dodge2") +
  labs(x = "Age group", y = "Number of observations", fill = "Vote") +
  theme(legend.position = "bottom")
```

By default, these different groups are stacked, but they can be placed side by side with `position = "dodge2"` (@fig-bepsecond-2). (Using "dodge2" rather than "dodge" adds a little space between the bars.)

#### Themes

At this point, we may like to address the general look of the graph. There are various themes that are built into `ggplot2`. These include: `theme_bw()`, `theme_classic()`, `theme_dark()`, and `theme_minimal()`. A full list is available in the `ggplot2` [cheat sheet](https://github.com/rstudio/cheatsheets/blob/main/data-visualization.pdf). We can use these themes by adding them as a layer (@fig-bepthemes). We could also install more themes from other packages, including `ggthemes` [@ggthemes], and `hrbrthemes` [@hrbrthemes]. We could even build our own!

```{r}
#| echo: true
#| eval: true
#| fig-cap: "Distribution of age-groups, and vote preference, in the 1997-2001 British Election Panel Study, illustrating different themes and the use of `patchwork`"
#| label: fig-bepthemes
#| warning: false

theme_bw <-
  beps |>
  ggplot(mapping = aes(x = age_group)) +
  geom_bar(position = "dodge") +
  theme_bw()

theme_classic <-
  beps |>
  ggplot(mapping = aes(x = age_group)) +
  geom_bar(position = "dodge") +
  theme_classic()

theme_dark <-
  beps |>
  ggplot(mapping = aes(x = age_group)) +
  geom_bar(position = "dodge") +
  theme_dark()

theme_minimal <-
  beps |>
  ggplot(mapping = aes(x = age_group)) +
  geom_bar(position = "dodge") +
  theme_minimal()

(theme_bw + theme_classic) / (theme_dark + theme_minimal)
```

In @fig-bepthemes we use `patchwork` to bring together multiple graphs. To do this, after installing and loading the package, we assign the graph to a variable. We then use "+" to signal which should be next to each other, "/" to signal which should be on top, and use brackets to indicate precedence

#### Facets

We use facets\index{graphs!facets} to show variation, based on one or more variables [@grammarofgraphics, p. 219]. Facets are especially useful when we have already used color to highlight variation in some other variable. For instance, we may be interested to explain vote, by age and gender (@fig-facets). We rotate the x-axis with `guides(x = guide_axis(angle = 90))` to avoid overlapping. We also change the position of the legend with `theme(legend.position = "bottom")`.

```{r}
#| echo: true
#| eval: true
#| fig-cap: "Distribution of age-group by gender, and vote preference, in the 1997-2001 British Election Panel Study"
#| label: fig-facets
#| warning: false

beps |>
  ggplot(mapping = aes(x = age_group, fill = gender)) +
  geom_bar() +
  theme_minimal() +
  labs(
    x = "Age-group of respondent",
    y = "Number of respondents",
    fill = "Gender"
  ) +
  facet_wrap(vars(vote)) +
  guides(x = guide_axis(angle = 90)) +
  theme(legend.position = "bottom")
```

We could change `facet_wrap()` to wrap vertically instead of horizontally with `dir = "v"`. Alternatively, we could specify a few rows, say `nrow = 2`, or a number of columns, say `ncol = 2`. 

By default, both facets will have the same x-axis and y-axis. We could enable both facets to have different scales with `scales = "free"`, or just the x-axis with `scales = "free_x"`, or just the y-axis with `scales = "free_y"` (@fig-facetsfancy). 

```{r}
#| echo: true
#| eval: true
#| fig-cap: "Distribution of age-group by gender, and vote preference, in the 1997-2001 British Election Panel Study"
#| label: fig-facetsfancy
#| warning: false

beps |>
  ggplot(mapping = aes(x = age_group, fill = gender)) +
  geom_bar() +
  theme_minimal() +
  labs(
    x = "Age-group of respondent",
    y = "Number of respondents",
    fill = "Gender"
  ) +
  facet_wrap(vars(vote), scales = "free") +
  guides(x = guide_axis(angle = 90)) +
  theme(legend.position = "bottom")
```

Finally, we can change the labels of the facets using `labeller()` (@fig-facetsfancylabels). 

```{r}
#| echo: true
#| eval: true
#| fig-cap: "Distribution of age-group by political knowledge, and vote preference, in the 1997-2001 British Election Panel Study"
#| label: fig-facetsfancylabels
#| warning: false

new_labels <- 
  c("0" = "No knowledge", "1" = "Low knowledge",
    "2" = "Moderate knowledge", "3" = "High knowledge")

beps |>
  ggplot(mapping = aes(x = age_group, fill = vote)) +
  geom_bar() +
  theme_minimal() +
  labs(
    x = "Age-group of respondent",
    y = "Number of respondents",
    fill = "Voted for"
  ) +
  facet_wrap(
    vars(political_knowledge),
    scales = "free",
    labeller = labeller(political_knowledge = new_labels)
  ) +
  guides(x = guide_axis(angle = 90)) +
  theme(legend.position = "bottom")
```

We now have three ways to combine multiple graphs: sub-figures, facets, and `patchwork`. They are useful in different circumstances: 

- sub-figures---which we covered in @sec-reproducible-workflows---for when we are considering different variables;
- facets for when we are considering a categorical variable; and 
- `patchwork` for when we are interested in bringing together entirely different graphs.

#### Colors

We now turn to the colors\index{graphs!color} used in the graph. There are a variety of different ways to change the colors. The many palettes available from `RColorBrewer` [@RColorBrewer] can be specified using `scale_fill_brewer()`. In the case of `viridis` [@viridis] we can specify the palettes using `scale_fill_viridis_d()`. Additionally, `viridis` is particularly focused on color-blind palettes (@fig-usecolor). Neither `RColorBrewer` nor `viridis` need to be explicitly installed or loaded because `ggplot2`, which is part of the `tidyverse`, takes care of that for us.

::: callout-note
## Shoulders of giants

The name of the "brewer" palette refers to Cindy Brewer\index{Brewer, Cindy} [@brewerisarealperson]. After earning a PhD in Geography from Michigan State University in 1991, she joined San Diego State University as an assistant professor, moving to Pennsylvania State University in 1994, where she was promoted to full professor in 2007. One of her best-known books is *Designing Better Maps: A Guide for GIS Users* [@brewerbook]. In 2019 she became only the ninth person to have been awarded the O. M. Miller Cartographic Medal since it was established in 1968.\index{O. M. Miller Cartographic Medal}
:::

```{r}
#| echo: true
#| eval: true
#| message: false
#| warning: false
#| fig-cap: "Distribution of age-group and vote preference, in the 1997-2001 British Election Panel Study, illustrating different colors"
#| label: fig-usecolor
#| fig-subcap: ["Brewer palette 'Blues'", "Brewer palette 'Set1'", "Viridis palette default", "Viridis palette 'magma'"]
#| layout-ncol: 2

# Panel (a)
beps |>
  ggplot(mapping = aes(x = age_group, fill = vote)) +
  geom_bar() +
  theme_minimal() +
  labs(x = "Age-group", y = "Number", fill = "Voted for") +
  theme(legend.position = "bottom") +
  scale_fill_brewer(palette = "Blues")

# Panel (b)
beps |>
  ggplot(mapping = aes(x = age_group, fill = vote)) +
  geom_bar() +
  theme_minimal() +
  labs(x = "Age-group", y = "Number", fill = "Voted for") +
  theme(legend.position = "bottom") +
  scale_fill_brewer(palette = "Set1")

# Panel (c)
beps |>
  ggplot(mapping = aes(x = age_group, fill = vote)) +
  geom_bar() +
  theme_minimal() +
  labs(x = "Age-group", y = "Number", fill = "Voted for") +
  theme(legend.position = "bottom") +
  scale_fill_viridis_d()

# Panel (d)
beps |>
  ggplot(mapping = aes(x = age_group, fill = vote)) +
  geom_bar() +
  theme_minimal() +
  labs(x = "Age-group", y = "Number", fill = "Voted for") +
  theme(legend.position = "bottom") +
  scale_fill_viridis_d(option = "magma")
```

In addition to using pre-built palettes, we could build our own palette. That said, color is something to be considered with care. It should be used to increase the amount of information that is communicated [@elementsofgraphingdata]. Color should not be added to graphs unnecessarily---that is to say, it should play some role. Typically, that role is to distinguish different groups, which implies making the colors dissimilar. Color may also be appropriate if there is some relationship between the color and the variable.
For instance, if making a graph of the price of mangoes and raspberries, then it could help the reader decode the information if the colors were yellow and red, respectively [@franconeri2021science, p. 121].


### Scatterplots

We are often interested in the relationship between two numeric or continuous variables. We can use scatterplots\index{graphs!scatterplot} to show this. A scatterplot may not always be the best choice, but it is rarely a bad one [@weissgerber2015beyond]. Some consider it the most versatile and useful graph option [@historyofdataviz, p. 121]. To illustrate scatterplots, we install and load `WDI` and then use that to download some economic indicators from the World Bank\index{World Bank!economic data}. In particular, we use `WDIsearch()` to find the unique key that we need to pass to `WDI()` to facilitate the download.

:::{.callout-note}
## Oh, you think we have good data on that!

From @EssentialMacroAggregates [p. 15] Gross Domestic Product (GDP) "combines in a single figure, and with no double counting, all the output (or production) carried out by all the firms, non-profit institutions, government bodies and households in a given country during a given period, regardless of the type of goods and services produced, provided that the production takes place within the country's economic territory." \index{Gross Domestic Product (GDP)} The modern concept was developed by the twentieth century economist Simon Kuznets and is widely used and reported. There is a certain comfort in having a definitive and concrete single number to describe something as complicated as the economic activity of a country. It is useful and informative that we have such summary statistics. But as with any summary statistic, its strength is also its weakness. A single number necessarily loses information about constituent components, and disaggregated differences can be important [@Moyer2020Measuring]. It highlights short term economic progress over longer term improvements. And "the quantitative definiteness of the estimates makes it easy to forget their dependence upon imperfect data and the consequently wide margins of possible error to which both totals and components are liable" [@NationalIncomeAndItsComposition, p. xxvi]. Summary measures of economic performance shows only one side of a country's economy. While there are many strengths there are also well-known areas where GDP is weak.
:::

```{r}
#| echo: true
#| eval: false

WDIsearch("gdp growth")
WDIsearch("inflation")
WDIsearch("population, total")
WDIsearch("Unemployment, total")
```

```{r}
#| echo: true
#| eval: false

world_bank_data <-
  WDI(
    indicator =
      c("FP.CPI.TOTL.ZG", "NY.GDP.MKTP.KD.ZG", "SP.POP.TOTL","SL.UEM.TOTL.NE.ZS"),
    country = c("AU", "ET", "IN", "US")
  )
```

```{r}
#| echo: false
#| eval: false

# INTERNAL
write_csv(world_bank_data, "inputs/data/world_bank_data.csv")
```

```{r}
#| eval: true
#| warning: false
#| echo: false

# INTERNAL

world_bank_data <-
  read_csv(
    "inputs/data/world_bank_data.csv",
    show_col_types = FALSE
  )
```

We may like to change the variable names to be more meaningful, and only keep those that we need.

```{r}
#| echo: true
#| eval: true

world_bank_data <-
  world_bank_data |>
  rename(
    inflation = FP.CPI.TOTL.ZG,
    gdp_growth = NY.GDP.MKTP.KD.ZG,
    population = SP.POP.TOTL,
    unem_rate = SL.UEM.TOTL.NE.ZS
  ) |>
  select(country, year, inflation, gdp_growth, population, unem_rate)

head(world_bank_data)
```

To get started we can use `geom_point()` to make a scatterplot showing GDP growth and inflation, by country (@fig-scattorplot-1).

```{r}
#| warning: false
#| label: fig-scattorplot
#| fig-cap: "Relationship between inflation and GDP growth for Australia, Ethiopia, India, and the United States"
#| fig-subcap: ["Default settings", "With the addition of a theme and labels", "Including standard errors"]
#| layout-ncol: 2

# Panel (a)
world_bank_data |>
  ggplot(mapping = aes(x = gdp_growth, y = inflation, color = country)) +
  geom_point()

# Panel (b)
world_bank_data |>
  ggplot(mapping = aes(x = gdp_growth, y = inflation, color = country)) +
  geom_point() +
  theme_minimal() +
  labs(x = "GDP growth", y = "Inflation", color = "Country")
```

As with bar charts, we can change the theme, and update the labels (@fig-scattorplot-2). 

For scatterplots we use "color" instead of "fill", as we did for bar charts, because they use dots rather than bars. This also then slightly affects how we change the palette (@fig-scatterplotnicercolor). That said, with particular types of dots, for instance `shape = 21`, it is possible to have both `fill` and `color` aesthetics.

```{r}
#| echo: true
#| eval: true
#| message: false
#| warning: false
#| label: fig-scatterplotnicercolor
#| fig-cap: "Relationship between inflation and GDP growth for Australia, Ethiopia, India, and the United States"
#| fig-subcap: ["Brewer palette 'Blues'", "Brewer palette 'Set1'", "Viridis palette default", "Viridis palette 'magma'"]
#| layout-ncol: 2

# Panel (a)
world_bank_data |>
  ggplot(aes(x = gdp_growth, y = inflation, color = country)) +
  geom_point() +
  theme_minimal() +
  labs(x = "GDP growth", y = "Inflation", color = "Country") +
  theme(legend.position = "bottom") +
  scale_color_brewer(palette = "Blues")

# Panel (b)
world_bank_data |>
  ggplot(aes(x = gdp_growth, y = inflation, color = country)) +
  geom_point() +
  theme_minimal() +
  labs(x = "GDP growth",  y = "Inflation", color = "Country") +
  theme(legend.position = "bottom") +
  scale_color_brewer(palette = "Set1")

# Panel (c)
world_bank_data |>
  ggplot(aes(x = gdp_growth, y = inflation, color = country)) +
  geom_point() +
  theme_minimal() +
  labs(x = "GDP growth",  y = "Inflation", color = "Country") +
  theme(legend.position = "bottom") +
  scale_colour_viridis_d()

# Panel (d)
world_bank_data |>
  ggplot(aes(x = gdp_growth, y = inflation, color = country)) +
  geom_point() +
  theme_minimal() +
  labs(x = "GDP growth",  y = "Inflation", color = "Country") +
  theme(legend.position = "bottom") +
  scale_colour_viridis_d(option = "magma")
```

The points of a scatterplot sometimes overlap. We can address this situation in a variety of ways (@fig-alphajitter): 

1) Adding a degree of transparency\index{graphs!transparency} to our dots with "alpha" (@fig-alphajitter-1).\index{graphs!alpha} The value for "alpha" can vary between 0, which is fully transparent, and 1, which is completely opaque. 
2) Adding a small amount of noise, which slightly moves the points, using `geom_jitter()` (@fig-alphajitter-2).\index{graphs!jitter} By default, the movement is uniform in both directions, but we can specify which direction movement occurs with "width" or "height". The decision between these two options turns on the degree to which accuracy matters, and the number of points: it is often useful to use `geom_jitter()` when you want to highlight the relative density of points and not necessarily the exact value of individual points. When using `geom_jitter()` it is a good idea to set a seed, as introduced in @sec-fire-hose, for reproducibility.

```{r}
#| fig-cap: "Relationship between inflation and GDP growth for Australia, Ethiopia, India, and the United States"
#| label: fig-alphajitter
#| warning: false
#| fig-subcap: ["Changing the alpha setting", "Using jitter"]
#| layout-ncol: 2

set.seed(853)

# Panel (a)
world_bank_data |>
  ggplot(aes(x = gdp_growth, y = inflation, color = country )) +
  geom_point(alpha = 0.5) +
  theme_minimal() +
  labs(x = "GDP growth", y = "Inflation", color = "Country")

# Panel (b)
world_bank_data |>
  ggplot(aes(x = gdp_growth, y = inflation, color = country)) +
  geom_jitter(width = 1, height = 1) +
  theme_minimal() +
  labs(x = "GDP growth", y = "Inflation", color = "Country")
```

We often use scatterplots to illustrate a relationship between two continuous variables.\index{graphs!continuous variables} It can be useful to add a "summary" line using `geom_smooth()` (@fig-scattorplottwo).\index{graphs!best fit} We can specify the relationship using "method", change the color with "color", and add or remove standard errors with "se". 
A commonly used "method" is `lm`, which computes and plots a simple linear regression line similar to using the `lm()` function. Using `geom_smooth()` adds a layer to the graph, and so it inherits aesthetics from `ggplot()`. For instance, that is why we have one line for each country in @fig-scattorplottwo-1 and @fig-scattorplottwo-2. We could overwrite that by specifying a particular color (@fig-scattorplottwo-3). There are situation where other types of fitted lines such as splines might be preferred.

```{r}
#| message: false
#| warning: false
#| fig-cap: "Relationship between inflation and GDP growth for Australia, Ethiopia, India, and the United States"
#| label: fig-scattorplottwo
#| fig-subcap: ["Default line of best fit", "Specifying a linear relationship", "Specifying only one color"]
#| layout-ncol: 2

# Panel (a)
world_bank_data |>
  ggplot(aes(x = gdp_growth, y = inflation, color = country)) +
  geom_jitter() +
  geom_smooth() +
  theme_minimal() +
  labs(x = "GDP growth", y = "Inflation", color = "Country")

# Panel (b)
world_bank_data |>
  ggplot(aes(x = gdp_growth, y = inflation, color = country)) +
  geom_jitter() +
  geom_smooth(method = lm, se = FALSE) +
  theme_minimal() +
  labs(x = "GDP growth", y = "Inflation", color = "Country")

# Panel (c)
world_bank_data |>
  ggplot(aes(x = gdp_growth, y = inflation, color = country)) +
  geom_jitter() +
  geom_smooth(method = lm, color = "black", se = FALSE) +
  theme_minimal() +
  labs(x = "GDP growth", y = "Inflation", color = "Country")
```


### Line plots

We can use a line plot\index{graphs!line plot} when we have variables that should be joined together, for instance, an economic time series. We will continue with the dataset from the World Bank\index{World Bank} and focus on GDP\index{Gross Domestic Product!United States}\index{United States!Gross Domestic Product} growth in the United States using `geom_line()` (@fig-lineplot-1). The source of the data can be added to the graph using "caption" within `labs()`.

```{r}
#| fig-cap: "United States GDP growth (1961-2020)"
#| label: fig-lineplot
#| warning: false
#| layout-ncol: 2
#| fig-subcap: ["Using a line plot", "Using a stairstep line plot"]

# Panel (a)
world_bank_data |>
  filter(country == "United States") |>
  ggplot(mapping = aes(x = year, y = gdp_growth)) +
  geom_line() +
  theme_minimal() +
  labs(x = "Year", y = "GDP growth", caption = "Data source: World Bank.")

# Panel (b)
world_bank_data |>
  filter(country == "United States") |>
  ggplot(mapping = aes(x = year, y = gdp_growth)) +
  geom_step() +
  theme_minimal() +
  labs(x = "Year",y = "GDP growth", caption = "Data source: World Bank.")
```

We can use `geom_step()`, a slight variant of `geom_line()`, to focus attention on the change from year to year (@fig-lineplot-2).

The Phillips curve\index{Phillips curve} is the name given to plot of the relationship between unemployment and inflation over time. An inverse relationship is sometimes found in the data, for instance in the United Kingdom between 1861 and 1957 [@phillips1958relation]. We have a variety of ways to investigate this relationship in our data, including:

::: {.content-visible when-format="pdf"}
1) Adding a second line to our graph. For instance, we could add inflation (@fig-notphillips-1). This requires us to use `pivot_longer()`, which is discussed in the ["R Essentials" Online Appendix](https://tellingstorieswithdata.com/20-r_essentials.html), to ensure that the data are in a tidy format.
2) Using `geom_path()` to link values in the order they appear in the dataset. In @fig-notphillips-2 we show a Phillips curve for the United States between 1960 and 2020. @fig-notphillips-2 does not appear to show any clear relationship between unemployment and inflation.
:::

::: {.content-visible unless-format="pdf"}
1) Adding a second line to our graph. For instance, we could add inflation (@fig-notphillips-1). This requires us to use `pivot_longer()`, which is discussed in [Online Appendix -@sec-r-essentials], to ensure that the data are in a tidy format.
2) Using `geom_path()` to link values in the order they appear in the dataset. In @fig-notphillips-2 we show a Phillips curve for the United States between 1960 and 2020. @fig-notphillips-2 does not appear to show any clear relationship between unemployment and inflation.
:::

```{r}
#| fig-cap: "Unemployment and inflation for the United States (1960-2020)"
#| label: fig-notphillips
#| layout-ncol: 2
#| fig-subcap: ["Comparing the two time series over time", "Plotting the two time series against each other"]
#| warning: false

world_bank_data |>
  filter(country == "United States") |>
  select(-population, -gdp_growth) |>
  pivot_longer(
    cols = c("inflation", "unem_rate"),
    names_to = "series",
    values_to = "value"
  ) |>
  ggplot(mapping = aes(x = year, y = value, color = series)) +
  geom_line() +
  theme_minimal() +
  labs(
    x = "Year", y = "Value", color = "Economic indicator",
    caption = "Data source: World Bank."
  ) +
  scale_color_brewer(palette = "Set1", labels = c("Inflation", "Unemployment")) +
  theme(legend.position = "bottom")

world_bank_data |>
  filter(country == "United States") |>
  ggplot(mapping = aes(x = unem_rate, y = inflation)) +
  geom_path() +
  theme_minimal() +
  labs(
    x = "Unemployment rate", y = "Inflation",
    caption = "Data source: World Bank."
  )
```

### Histograms

A histogram\index{graphs!histogram} is useful to show the shape of the distribution of a continuous variable. The full range of the data values is split into intervals called "bins" and the histogram counts how many observations fall into which bin. In @fig-hisogramone we examine the distribution of GDP in Ethiopia.

```{r}
#| fig-cap: "Distribution of GDP growth in Ethiopia (1960-2020)"
#| label: fig-hisogramone
#| message: false
#| warning: false

world_bank_data |>
  filter(country == "Ethiopia") |>
  ggplot(aes(x = gdp_growth)) +
  geom_histogram() +
  theme_minimal() +
  labs(
    x = "GDP growth",
    y = "Number of occurrences",
    caption = "Data source: World Bank."
  )
```

The key component that determines the shape of a histogram is the number of bins. This can be specified in one of two ways (@fig-hisogrambins): 

1) specifying the number of "bins" to include; or 
2) specifying their "binwidth".

```{r}
#| message: false
#| warning: false
#| fig-cap: "Distribution of GDP growth in Ethiopia (1960-2020)"
#| label: fig-hisogrambins
#| fig-subcap: ["Five bins", "20 bins", "Binwidth of two", "Binwidth of five"]
#| layout-ncol: 2

# Panel (a)
world_bank_data |>
  filter(country == "Ethiopia") |>
  ggplot(aes(x = gdp_growth)) +
  geom_histogram(bins = 5) +
  theme_minimal() +
  labs(
    x = "GDP growth",
    y = "Number of occurrences"
  )

# Panel (b)
world_bank_data |>
  filter(country == "Ethiopia") |>
  ggplot(aes(x = gdp_growth)) +
  geom_histogram(bins = 20) +
  theme_minimal() +
  labs(
    x = "GDP growth",
    y = "Number of occurrences"
  )

# Panel (c)
world_bank_data |>
  filter(country == "Ethiopia") |>
  ggplot(aes(x = gdp_growth)) +
  geom_histogram(binwidth = 2) +
  theme_minimal() +
  labs(
    x = "GDP growth",
    y = "Number of occurrences"
  )

# Panel (d)
world_bank_data |>
  filter(country == "Ethiopia") |>
  ggplot(aes(x = gdp_growth)) +
  geom_histogram(binwidth = 5) +
  theme_minimal() +
  labs(
    x = "GDP growth",
    y = "Number of occurrences"
  )
```

Histograms\index{graphs!histograms} can be thought of as locally averaging data, and the number of bins affects how much of this occurs. When there are only two bins then there is considerable smoothing, but we lose a lot of accuracy. Too few bins results in more bias, while too many bins results in more variance [@wasserman, p. 303]. Our decision as to the number of bins, or their width, is concerned with trying to balance bias and variance. This will depend on a variety of concerns including the subject matter and the goal [@elementsofgraphingdata, p. 135]. This is one of the reasons that @Denby2009 consider histograms to be especially valuable as exploratory tools.

Finally, while we can use "fill" to distinguish between different types of observations, it can get quite messy. It is usually better to: 

1. trace the outline of the distribution with `geom_freqpoly()` (@fig-different-obs-1) 
2. build stack of dots with `geom_dotplot()` (@fig-different-obs-2); or
3. add transparency, especially if the differences are more stark (@fig-different-obs-3).

```{r}
#| fig-cap: "Distribution of GDP growth across various countries (1960-2020)"
#| label: fig-different-obs
#| message: false
#| warning: false
#| layout-ncol: 2
#| fig-subcap: ["Tracing the outline", "Using dots", "Adding transparency"]

# Panel (a)
world_bank_data |>
  ggplot(aes(x = gdp_growth, color = country)) +
  geom_freqpoly() +
  theme_minimal() +
  labs(
    x = "GDP growth", y = "Number of occurrences",
    color = "Country",
    caption = "Data source: World Bank."
  ) +
  scale_color_brewer(palette = "Set1")

# Panel (b)
world_bank_data |>
  ggplot(aes(x = gdp_growth, group = country, fill = country)) +
  geom_dotplot(method = "histodot") +
  theme_minimal() +
  labs(
    x = "GDP growth", y = "Number of occurrences",
    fill = "Country",
    caption = "Data source: World Bank."
  ) +
  scale_color_brewer(palette = "Set1")

# Panel (c)
world_bank_data |>
  filter(country %in% c("India", "United States")) |>
  ggplot(mapping = aes(x = gdp_growth, fill = country)) +
  geom_histogram(alpha = 0.5, position = "identity") +
  theme_minimal() +
  labs(
    x = "GDP growth", y = "Number of occurrences",
    fill = "Country",
    caption = "Data source: World Bank."
  ) +
  scale_color_brewer(palette = "Set1")
```

An interesting alternative to a histogram is the empirical cumulative distribution function (ECDF).\index{graphs!ECDF} The choice between this and a histogram is tends to be audience-specific. It may not appropriate for less-sophisticated audiences, but if the audience is quantitatively comfortable, then it can be a great choice because it does less smoothing than a histogram. We can build an ECDF with `stat_ecdf()`. For instance, @fig-ecdfismyfavohidonthavefavs shows an ECDF equivalent to @fig-hisogramone.

```{r}
#| fig-cap: "Distribution of GDP growth in four countries (1960-2020)"
#| label: fig-ecdfismyfavohidonthavefavs
#| warning: false

world_bank_data |>
  ggplot(mapping = aes(x = gdp_growth, color = country)) +
  stat_ecdf(geom = "point") +
  theme_minimal() +
  labs(
    x = "GDP growth", y = "Proportion", color = "Country",
    caption = "Data source: World Bank."
  ) + 
  theme(legend.position = "bottom")
```

### Boxplots

A boxplot\index{graphs!boxplot} typically shows five aspects: 1) the median, 2) the 25th, and 3) 75th percentiles. The fourth and fifth elements differ depending on specifics. One option is the minimum and maximum values. Another option is to determine the difference between the 75th and 25th percentiles, which is the interquartile range (IQR). The fourth and fifth elements are then the extreme observations within $1.5\times\mbox{IQR}$ from the 25th and 75th percentiles. That latter approach is used, by default, in `geom_boxplot` from `ggplot2`. @chartingstatistics [p. 166] introduced the notion of a chart that focused on the range and various summary statistics including the median and the range, while @tukeyeda focused on which summary statistics and popularized it [@anotherhadleyreferencelol].

One reason for using graphs is that they help us understand and embrace how complex our data are, rather than trying to hide and smooth it away [@armstrongembracecomplexity]. One appropriate use case for boxplots is to compare the summary statistics of many variables at once, such as in @Bethlehem2022. But boxplots alone are rarely the best choice because they hide the distribution of data, rather than show it. The same boxplot can apply to very different distributions. To see this, consider some simulated data from the beta distribution of two types.\index{simulation!beta distribution} The first contains draws from two beta distributions:\index{distribution!beta} one that is right skewed and another that is left skewed. The second contains draws from a beta distribution with no skew, noting that $\mbox{Beta}(1, 1)$ is equivalent to $\mbox{Uniform}(0, 1)$.

```{r}
set.seed(853)

number_of_draws <- 10000

both_left_and_right_skew <-
  c(
    rbeta(number_of_draws / 2, 5, 2),
    rbeta(number_of_draws / 2, 2, 5)
  )

no_skew <-
  rbeta(number_of_draws, 1, 1)

beta_distributions <-
  tibble(
    observation = c(both_left_and_right_skew, no_skew),
    source = c(
      rep("Left and right skew", number_of_draws),
      rep("No skew", number_of_draws)
    )
  )
```

We can first compare the boxplots of the two series (@fig-boxplotfirst-1). But if we plot the actual data then we can see how different they are (@fig-boxplotfirst-2).

```{r}
#| label: fig-boxplotfirst
#| message: false
#| warning: false
#| layout-ncol: 2
#| fig-cap: "Data drawn from beta distributions with different parameters"
#| fig-subcap: ["Illustrated with a boxplot","Actual data"]

beta_distributions |>
  ggplot(aes(x = source, y = observation)) +
  geom_boxplot() +
  theme_classic()

beta_distributions |>
  ggplot(aes(x = observation, color = source)) +
  geom_freqpoly(binwidth = 0.05) +
  theme_classic() +
  theme(legend.position = "bottom")
```

One way forward, if a boxplot is to be used, is to include the actual data as a layer on top of the boxplot.\index{graphs!boxplot} For instance, in @fig-bloxplotandoverlay we show the distribution of inflation across the four countries. The reason that this works well is that it shows the actual observations, as well as the summary statistics.

```{r}
#| fig-cap: "Distribution of inflation data for four countries (1960-2020)"
#| label: fig-bloxplotandoverlay
#| message: false
#| warning: false

world_bank_data |>
  ggplot(mapping = aes(x = country, y = inflation)) +
  geom_boxplot() +
  geom_jitter(alpha = 0.3, width = 0.15, height = 0) +
  theme_minimal() +
  labs(
    x = "Country",
    y = "Inflation",
    caption = "Data source: World Bank."
  )
```


### Interactive graphs

`shiny` [@citeshiny] is a way of making interactive web applications using R. It is fun, but can be a little fiddly. Here we are going to step through one way to take advantage of `shiny`, which is to quickly add some interactivity to our graphs. This sounds like a small thing, but a great example of why it is so powerful is provided by @theeconomistforecasts where they show how their forecasts of the 2022 French Presidential Election changed over time.

We are going to make an interactive graph based on the "babynames" dataset from `babynames` [@citebabynames]. First, we will build a static version (@fig-babynames).

```{r}
#| fig-cap: "Popular baby names"
#| label: fig-babynames
#| message: false
#| warning: false

top_five_names_by_year <-
  babynames |>
  arrange(desc(n)) |>
  slice_head(n = 5, by = c(year, sex))

top_five_names_by_year |>
  ggplot(aes(x = n, fill = sex)) +
  geom_histogram(position = "dodge") +
  theme_minimal() +
  scale_fill_brewer(palette = "Set1") +
  labs(
    x = "Babies with that name",
    y = "Occurrences",
    fill = "Sex"
  )
```

One thing that we might be interested in is how the effect of the "bins" parameter shapes what we see. We might like to use interactivity to explore different values.

To get started, create a new `shiny` app ("File" -> "New File" -> "Shiny Web App"). Give it a name, such as "not_my_first_shiny" and then leave all the other options as the default. A new file "app.R" will open and we click "Run app" to see what it looks like.

Now replace the content in that file, "app.R", with the content below, and then again click "Run app".

```{r}
#| eval: false

library(shiny)

# Define UI for application that draws a histogram
ui <- fluidPage(
  # Application title
  titlePanel("Count of names for five most popular names each year."),

  # Sidebar with a slider input for number of bins
  sidebarLayout(
    sidebarPanel(
      sliderInput(
        inputId = "number_of_bins",
        label = "Number of bins:",
        min = 1,
        max = 50,
        value = 30
      )
    ),

    # Show a plot of the generated distribution
    mainPanel(plotOutput("distPlot"))
  )
)

# Define server logic required to draw a histogram
server <- function(input, output) {
  output$distPlot <- renderPlot({
    # Draw the histogram with the specified number of bins
    top_five_names_by_year |>
      ggplot(aes(x = n, fill = sex)) +
      geom_histogram(position = "dodge", bins = input$number_of_bins) +
      theme_minimal() +
      scale_fill_brewer(palette = "Set1") +
      labs(
        x = "Babies with that name",
        y = "Occurrences",
        fill = "Sex"
      )
  })
}

# Run the application
shinyApp(ui = ui, server = server)
```

We have just build an interactive graph where the number of bins can be changed. It should look like @fig-shinyone.

![Example of Shiny app where the user controls the number of bins](figures/22-shiny_one.png){#fig-shinyone width=90% fig-align="center"}


## Tables

Tables\index{tables} are an important part of telling a compelling story. Tables can communicate less information than a graph, but they do so at a high fidelity. They are especially useful to highlight a few specific values [@andersen2021presenting]. In this book, we primarily use tables in three ways:

1. To show an extract of the dataset.
2. To communicate summary statistics.
3. To display regression results.

### Showing part of a dataset

We illustrate showing part of a dataset using `tt()` from `tinytable`. We use the World Bank\index{World Bank} dataset that we downloaded earlier and focus on inflation, GDP growth, and population as unemployment data are not available for every year for every country.\index{tables!kable()}

```{r}
world_bank_data <- 
  world_bank_data |> 
  select(-unem_rate)
```

To begin, after installing and loading `tinytable`, we can display the first ten rows with the default `tt()` settings.

```{r}
world_bank_data |>
  slice(1:10) |>
  tt()
```

To be able to cross-reference a table in the text, we need to add a table caption and label to the R chunk as shown in @sec-quartocrossreferences of @sec-reproducible-workflows. We can also make the column names more informative with `setNames` and specify the number of digits to be displayed (@tbl-gdpfirst).

```{r}
#| echo: fenced
#| label: tbl-gdpfirst
#| message: false
#| tbl-cap: "A dataset of economic indicators for four countries"

world_bank_data |>
  slice(1:10) |>
  tt() |> 
  style_tt(j = 2:5, align = "r") |> 
  format_tt(digits = 1, num_fmt = "decimal") |> 
  setNames(c("Country", "Year", "Inflation", "GDP growth", "Population"))
```

### Improving the formatting

We can specify the alignment of the columns using `style_tt()` and a character vector of "l" (left), "c" (center), and "r" (right) (@tbl-gdpalign).\index{tables!alignment} We specify which columns this applies to by using `j` and specifying the column number. Additionally, we can change the formatting. For instance, we could specify groupings for numbers that are at least 1,000 using `num_mark_big = ","`.

```{r}
#| label: tbl-gdpalign
#| message: false
#| tbl-cap: "First ten rows of a dataset of economic indicators for Australia, Ethiopia, India, and the United States"

world_bank_data |>
  slice(1:10) |>
  mutate(year = as.factor(year)) |>
  tt() |> 
  style_tt(j = 1:5, align = "lccrr") |> 
  format_tt(digits = 1, num_mark_big = ",", num_fmt = "decimal") |> 
  setNames(c("Country", "Year", "Inflation", "GDP growth", "Population"))
```


### Communicating summary statistics

After installing and loading `modelsummary` we can use `datasummary_skim()` to create tables of summary statistics from our dataset.\index{tables!summary statistics} 

We can use this to get a table such as @tbl-testdatasummarynormal. That might be useful for exploratory data analysis, which we cover in @sec-exploratory-data-analysis. (Here we remove population to save space and do not include a histogram of each variable.)

```{r}
#| message: false
#| warning: false
#| label: tbl-testdatasummarynormal
#| tbl-cap: "Summary of economic indicator variables for four countries"

world_bank_data |>
  select(-population) |> 
  datasummary_skim(histogram = FALSE)
```

By default, `datasummary_skim()` summarizes the numeric variables, but we can ask for the categorical variables (@tbl-testdatasummary). Additionally we can add cross-references in the same way as `kable()`, that is, include a "tbl-cap" entry and then cross-reference the name of the R chunk.

```{r}
#| label: tbl-testdatasummary
#| tbl-cap: "Summary of categorical economic indicator variables for four countries"

world_bank_data |>
  datasummary_skim(type = "categorical")
```

We can create a table that shows the correlation\index{tables!correlation} between variables using `datasummary_correlation()` (@tbl-correlationtable).

```{r}
#| label: tbl-correlationtable
#| tbl-cap: "Correlation between the economic indicator variables for four countries (Australia, Ethiopia, India, and the United States)"

world_bank_data |>
  datasummary_correlation()
```

We typically need a table of descriptive statistics\index{tables!descriptive statistics} that we could add to our paper (@tbl-descriptivestats). This contrasts with @tbl-testdatasummary which would likely not be included in the main section of a paper, and is more to help us understand the data. We can add a note about the source of the data using `notes`.

```{r}
#| label: tbl-descriptivestats
#| warning: false
#| tbl-cap: "Descriptive statistics for the inflation and GDP dataset"

datasummary_balance(
  formula = ~country,
  data = world_bank_data |> 
    filter(country %in% c("Australia", "Ethiopia")),
  dinm = FALSE,
  notes = "Data source: World Bank."
)
```


### Display regression results

We can report regression results\index{tables!regression results} using `modelsummary()` from `modelsummary`. For instance, we could display the estimates from a few different models (@tbl-twomodels).

```{r}
#| label: tbl-twomodels
#| tbl-cap: "Explaining GDP as a function of inflation"

first_model <- lm(
  formula = gdp_growth ~ inflation,
  data = world_bank_data
)

second_model <- lm(
  formula = gdp_growth ~ inflation + country,
  data = world_bank_data
)

third_model <- lm(
  formula = gdp_growth ~ inflation + country + population,
  data = world_bank_data
)

modelsummary(list(first_model, second_model, third_model))
```

The number of significant digits can be adjusted with "fmt" (@tbl-twomodelstwo). To help establish credibility you should generally not add as many significant digits as possible [@howes2022representing]. Instead, you should think carefully about the data-generating process and adjust based on that.

```{r}
#| label: tbl-twomodelstwo
#| tbl-cap: "Three models of GDP as a function of inflation"

modelsummary(
  list(first_model, second_model, third_model),
  fmt = 1
)
```

## Maps

In many ways maps can be thought of as another type of graph, where the x-axis is latitude, the y-axis is longitude, and there is some outline or background image.\index{tables!maps}\index{maps} It is possible that they are the oldest and best understood type of chart [@karsetn, p. 1]. We can generate a map in a straight-forward manner. That said, it is not to be taken lightly; things quickly get complicated!

The first step is to get some data.\index{maps!data} There is some geographic data built into `ggplot2` that we can access with `map_data()`. There are additional variables in the `world.cities` dataset from `maps`. 

```{r}
#| message: false
#| warning: false

france <- map_data(map = "france")

head(france)

french_cities <-
  world.cities |>
  filter(country.etc == "France")

head(french_cities)
```

Using that information you can create a map of France\index{France!cities} that shows the larger cities (@fig-heyitsfrance).\index{maps!France} Use `geom_polygon()` from `ggplot2` to draw shapes by connecting points within groups. And `coord_map()` adjusts for the fact that we are making a 2D map to represent a world that is 3D.

```{r}
#| label: fig-heyitsfrance
#| fig-cap: "Map of France showing the largest cities"
#| message: false
#| warning: false

ggplot() +
  geom_polygon(
    data = france,
    aes(x = long, y = lat, group = group),
    fill = "white",
    colour = "grey"
  ) +
  coord_map() +
  geom_point(
    aes(x = french_cities$long, y = french_cities$lat),
    alpha = 0.3,
    color = "black"
  ) +
  theme_minimal() +
  labs(x = "Longitude", y = "Latitude")
```

As is often the case with R, there are many ways to get started creating static maps. We have seen how they can be built using only `ggplot2`, but `ggmap` brings additional functionality.

There are two essential components to a map: 

1) a border or background image (sometimes called a tile); and 
2) something of interest within that border, or on top of that tile. 

In `ggmap`, we use an open-source option for our tile, Stamen Maps.\index{maps!Stamen Maps} And we use plot points based on latitude and longitude.


### Static maps

#### Australian polling places

In Australia,\index{Australia!elections} people have to go to "booths" in order to vote. Because the booths have coordinates (latitude and longitude), we can plot them. One reason we may like to do that is to notice spatial voting patterns.

To get started we need to get a tile.\index{maps!tiles} We are going to use `ggmap` to get a tile from Stamen Maps, which builds on [OpenStreetMap](openstreetmap.org). The main argument to this function is to specify a bounding box.\index{maps!bounding box} A bounding box is the coordinates of the edges that you are interested in. 
This requires two latitudes and two longitudes.

It can be useful to use Google Maps,\index{maps!Google Maps} or other mapping platform, to find the coordinate values that you need. In this case we have provided it with coordinates such that it will be centered around Australia's capital Canberra.\index{Australia!Canberra}

```{r}
#| warning: false
#| message: false

bbox <- c(left = 148.95, bottom = -35.5, right = 149.3, top = -35.1)
```

It is free, but we need to register in order to get a map. To do this go to https://client.stadiamaps.com/signup/ and create an account. Then create a new property, then "Add API Key". Copy the key and run (replacing PUT-KEY-HERE with the key) `register_stadiamaps(key = "PUT-KEY-HERE", write = TRUE)`. Then once you have defined the bounding box, the function `get_stadiamap()` will get the tiles in that area (@fig-heyitscanberra).\index{maps!tiles} The number of tiles that it needs depends on the zoom, and the type of tiles that it gets depends on the type of map. We have used "toner-lite", which is black and white, but there are others including: "terrain", "toner", and "toner-lines". We pass the tiles to `ggmap()` which will plot it. An internet connection is needed for this to work as `get_stadiamap()` downloads the tiles.\index{internet!map tiles}

```{r}
#| label: fig-heyitscanberra
#| fig-cap: "Map of Canberra, Australia"
#| warning: false
#| message: false

canberra_stamen_map <- get_stadiamap(bbox, zoom = 11, maptype = "stamen_toner_lite")

ggmap(canberra_stamen_map)
```

Once we have a map then we can use `ggmap()` to plot it. Now we want to get some data that we plot on top of our tiles. We will plot the location of the polling place based on its "division". This is available [from the Australian Electoral Commission (AEC)](https://results.aec.gov.au/20499/Website/Downloads/HouseTppByPollingPlaceDownload-20499.csv).\index{Australia!Australian Electoral Commission}

::: {.content-visible when-format="pdf"}
```{r}
#| warning: false
#| message: false
#| echo: true
#| eval: false

booths <-
  read_csv(
    paste0(
      "https://results.aec.gov.au/24310/Website/Downloads/",
      "GeneralPollingPlacesDownload-24310.csv"
    ),
    skip = 1,
    guess_max = 10000
  )
```
:::

::: {.content-visible unless-format="pdf"}
```{r}
#| warning: false
#| message: false
#| echo: true
#| eval: false

booths <-
  read_csv(
    "https://results.aec.gov.au/24310/Website/Downloads/GeneralPollingPlacesDownload-24310.csv",
    skip = 1,
    guess_max = 10000
  )
```
:::

```{r}
#| echo: false
#| eval: false

# INTERNAL

write_csv(booths, "inputs/data/booths.csv")
```

```{r}
#| echo: false
#| eval: true
#| warning: false
#| message: false

booths <- 
  read_csv(
    file = "inputs/data/booths.csv",
    guess_max = 10000
  )
```

This dataset is for the whole of Australia, but as we are only plotting the area around Canberra, we will filter the data to only booths with a geography close to Canberra.\index{Australia!Canberra}

```{r}
#| warning: false
#| message: false

booths_reduced <-
  booths |>
  filter(State == "ACT") |>
  select(PollingPlaceID, DivisionNm, Latitude, Longitude) |>
  filter(!is.na(Longitude)) |> # Remove rows without geography
  filter(Longitude < 165) # Remove Norfolk Island
```

Now we can use `ggmap` in the same way as before to plot our underlying tiles, and then build on that using `geom_point()` to add our points of interest.\index{maps!plotting locations}

```{r}
#| label: fig-heyitscanberrapolling
#| fig-cap: "Map of Canberra, Australia, with polling places"
#| warning: false
#| message: false

ggmap(canberra_stamen_map, extent = "normal", maprange = FALSE) +
  geom_point(data = booths_reduced,
             aes(x = Longitude, y = Latitude, colour = DivisionNm),
             alpha = 0.7) +
  scale_color_brewer(name = "2019 Division", palette = "Set1") +
  coord_map(
    projection = "mercator",
    xlim = c(attr(map, "bb")$ll.lon, attr(map, "bb")$ur.lon),
    ylim = c(attr(map, "bb")$ll.lat, attr(map, "bb")$ur.lat)
  ) +
  labs(x = "Longitude",
       y = "Latitude") +
  theme_minimal() +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank())
```

We may like to save the map so that we do not have to create it every time, and we can do that in the same way as any other graph, using `ggsave()`.

```{r}
#| eval: false

ggsave("map.pdf", width = 20, height = 10, units = "cm")
```

Finally, the reason that we used Stamen Maps and OpenStreetMap is because they are open source, but we could have also used Google Maps.\index{maps!Google Maps} This requires you to first register a credit card with Google, and specify a key, but with low usage the service should be free. Using Google Maps---by using `get_googlemap()` within `ggmap`---brings some advantages over `get_stadiamap()`. For instance it will attempt to find a place name rather than needing to specify a bounding box.

#### United States military bases

To see another example of a static map we will plot some United States military bases after installing and loading `troopdata`.\index{maps!US military bases} We can access data about United States overseas military bases back to the start of the Cold War using `get_basedata()`.

```{r}
bases <- get_basedata()

head(bases)
```

We will look at the locations of United States military bases in Germany,\index{Germany!US military bases} Japan,\index{Japan!US military bases} and Australia.\index{Australia!US military bases} The `troopdata` dataset already has the latitude and longitude of each base, and we will use that as our item of interest. The first step is to define a bounding box for each country.\index{maps!bounding box}

```{r}
#| message: false
#| warning: false

# Use: https://data.humdata.org/dataset/bounding-boxes-for-countries
bbox_germany <- c(left = 5.867, bottom = 45.967, right = 15.033, top = 55.133)

bbox_japan <- c(left = 127, bottom = 30, right = 146, top = 45)

bbox_australia <- c(left = 112.467, bottom = -45, right = 155, top = -9.133)
```

Then we need to get the tiles using `get_stadiamap()` from `ggmap`.\index{maps!tiles} 

```{r}
#| message: false
#| warning: false

german_stamen_map <- get_stadiamap(bbox_germany, zoom = 6, maptype = "stamen_toner_lite")

japan_stamen_map <- get_stadiamap(bbox_japan, zoom = 6, maptype = "stamen_toner_lite")

aus_stamen_map <- get_stadiamap(bbox_australia, zoom = 5, maptype = "stamen_toner_lite")
```

And finally, we can bring it all together with maps showing United States military bases in Germany (@fig-mapbasesin-1), Japan (@fig-mapbasesin-2), and Australia (@fig-mapbasesin-3).\index{maps!plotting locations}

```{r}
#| fig-cap: "Map of United States military bases in various parts of the world"
#| label: fig-mapbasesin
#| message: false
#| warning: false
#| fig-subcap: ["Germany", "Japan", "Australia"]
#| layout-ncol: 2

ggmap(german_stamen_map) +
  geom_point(data = bases, aes(x = lon, y = lat)) +
  labs(x = "Longitude",
       y = "Latitude") +
  theme_minimal()

ggmap(japan_stamen_map) +
  geom_point(data = bases, aes(x = lon, y = lat)) +
  labs(x = "Longitude",
       y = "Latitude") +
  theme_minimal()

ggmap(aus_stamen_map) +
  geom_point(data = bases, aes(x = lon, y = lat)) +
  labs(x = "Longitude",
       y = "Latitude") +
  theme_minimal()
```

### Geocoding

So far we have assumed that we already have geocoded data. This means that we have latitude and longitude coordinates for each place. But sometimes we only have place names, such as "Sydney, Australia", "Toronto, Canada", "Accra, Ghana", and "Guayaquil, Ecuador". Before we can plot them, we need to get the latitude and longitude coordinates for each case. The process of going from names to coordinates is called geocoding.\index{maps!geocoding}

:::{.callout-note}
## Oh, you think we have good data on that!

While you almost surely know where you live, it can be surprisingly difficult to specifically define the boundaries of many places.\index{maps!boundaries} And this is made especially difficult when different levels of government have different definitions. @bronnerquantediting illustrates this in the case of Atlanta, Georgia,\index{United States!Georgia} where there are (at least) three official different definitions: 

1) the metropolitan statistical area; 
2) the urbanized area; and 
3) the census place. 

Which definition is used can have a substantial effect on the analysis, or even the data that are available, even though they are all "Atlanta". 
:::

There are a range of options to geocode data in R, but `tidygeocoder` is especially useful. We first need a dataframe of locations. 

```{r}
place_names <-
  tibble(
    city = c("Sydney", "Toronto", "Accra", "Guayaquil"),
    country = c("Australia", "Canada", "Ghana", "Ecuador")
  )

place_names
```

```{r}
#| message: false
#| warning: false

place_names <-
  geo(
    city = place_names$city,
    country = place_names$country,
    method = "osm"
  )

place_names
```

And we can now plot and label these cities (@fig-mynicemap).

```{r}
#| fig-cap: "Map of Accra, Sydney, Toronto, and Guayaquil after geocoding to obtain their locations"
#| label: fig-mynicemap
#| message: false
#| warning: false

world <- map_data(map = "world")

ggplot() +
  geom_polygon(
    data = world,
    aes(x = long, y = lat, group = group),
    fill = "white",
    colour = "grey"
  ) +
  geom_point(
    aes(x = place_names$long, y = place_names$lat),
    color = "black") +
  geom_text(
    aes(x = place_names$long, y = place_names$lat, label = place_names$city),
    nudge_y = -5) +
  theme_minimal() +
  labs(x = "Longitude",
       y = "Latitude")
```


### Interactive maps

The nice thing about interactive maps is that we can let our user decide what they are interested in. For instance, in the case of a map, some people will be interested in, say, Toronto, while others will be interested in Chennai or even Auckland. But it would be difficult to present a map that focused on all of those, so an interactive map is a way to allow users to focus on what they want.

That said, we should be cognizant of what we are doing when we build maps, and more broadly, what is being done at scale to enable us to be able to build our own maps. For instance, with regard to Google, @mcquire2019one says:

> Google began life in 1998 as a company famously dedicated to organising the vast amounts of data on the Internet. But over the last two decades its ambitions have changed in a crucial way. Extracting data such as words and numbers from the physical world is now merely a stepping-stone towards apprehending and organizing the physical world as data. Perhaps this shift is not surprising at a moment when it has become possible to comprehend human identity as a form of (genetic) 'code'. However, apprehending and organizing the world as data under current settings is likely to take us well beyond Heidegger's 'standing reserve' in which modern technology enframed 'nature' as productive resource. In the 21st century, it is the stuff of human life itself—from genetics to bodily appearances, mobility, gestures, speech, and behaviour---that is being progressively rendered as productive resource that can not only be harvested continuously but subject to modulation over time.

Does this mean that we should not use or build interactive maps? Of course not. But it is important to be aware of the fact that this is a frontier, and the boundaries of appropriate use are still being determined. Indeed, the literal boundaries of the maps themselves are being consistently determined and updated. The move to digital maps, compared with physical printed maps, means that it is possible for different users to be presented with different realities. For instance, "...Google routinely takes sides in border disputes. Take, for instance, the representation of the border between Ukraine and Russia. In Russia, the Crimean Peninsula is represented with a hard-line border as Russian-controlled, whereas Ukrainians and others see a dotted-line border. The strategically important peninsula is claimed by both nations and was violently seized by Russia in 2014, one of many skirmishes over control" [@washingtonpostmaps].


#### Leaflet

We can use `leaflet` [@ChengKarambelkarXie2017] to make interactive maps. The essentials are similar to `ggmap` [@KahleWickham2013], but there are many additional aspects beyond that. We can redo the US military deployments map from @sec-static-communication that used `troopdata` [@troopdata]. The advantage with an interactive map is that we can plot all the bases and allow the user to focus on which area they want, in comparison with @sec-static-communication where we just picked a few particular countries. A great example of why this might be useful is provided by @theeconomistmaps where they are able to show 2022 French Presidential results for the entire country by commune.

In the same way as a graph in `ggplot2` begins with `ggplot()`, a map in `leaflet` begins with `leaflet()`. Here we can specify data, and other options such as width and height. After this, we add "layers" in the same way that we added them in `ggplot2`. The first layer that we add is a tile, using `addTiles()`. In this case, the default is from OpenStreeMap. After that we add markers with `addMarkers()` to show the location of each base (@fig-canhasbase).

```{r}
#| fig-cap: "Interactive map of US bases"
#| label: fig-canhasbase
#| message: false
#| warning: false

bases <- get_basedata()

# Some of the bases include unexpected characters which we need to address
Encoding(bases$basename) <- "latin1"

leaflet(data = bases) |>
  addTiles() |> # Add default OpenStreetMap map tiles
  addMarkers(
    lng = bases$lon,
    lat = bases$lat,
    popup = bases$basename,
    label = bases$countryname
  )
```

There are two new arguments, compared with `ggmap`. The first is "popup", which is the behavior that occurs when the user clicks on the marker. In this case, the name of the base is provided. The second is "label", which is what happens when the user hovers on the marker. In this case it is the name of the country.

We can try another example, this time of the amount spent building those bases. We will introduce a different type of marker here, which is circles. This will allow us to use different colors for the outcomes of each type. There are four possible outcomes: "More than $100,000,000", "More than $10,000,000", "More than $1,000,000", "$1,000,000 or less" [@fig-canhasbaseandmoney].

```{r}
#| fig-cap: "Interactive map of US bases with colored circules to indicate spend"
#| label: fig-canhasbaseandmoney
#| message: false
#| warning: false

build <-
  get_builddata(startyear = 2008, endyear = 2019) |>
  filter(!is.na(lon)) |>
  mutate(
    cost = case_when(
      spend_construction > 100000 ~ "More than $100,000,000",
      spend_construction > 10000 ~ "More than $10,000,000",
      spend_construction > 1000 ~ "More than $1,000,000",
      TRUE ~ "$1,000,000 or less"
    )
  )

pal <-
  colorFactor("Dark2", domain = build$cost |> unique())

leaflet() |>
  addTiles() |> # Add default OpenStreetMap map tiles
  addCircleMarkers(
    data = build,
    lng = build$lon,
    lat = build$lat,
    color = pal(build$cost),
    popup = paste(
      "<b>Location:</b>",
      as.character(build$location),
      "<br>",
      "<b>Amount:</b>",
      as.character(build$spend_construction),
      "<br>"
    )
  ) |>
  addLegend(
    "bottomright",
    pal = pal,
    values = build$cost |> unique(),
    title = "Type",
    opacity = 1
  )
```


#### Mapdeck

`mapdeck` [@citemapdeck] is based on WebGL. This means the web browser will do a lot of work for us. This enables us to accomplish things with `mapdeck` that `leaflet` struggles with, such as larger datasets. 

To this point we have used "stamen maps" as our underlying tile, but `mapdeck` uses [Mapbox](https://www.mapbox.com/). This requires registering an account and obtaining a token. This is free and only needs to be done once. Once we have that token we add it to our R environment (the details of this process are covered in @sec-gather-data) by running `edit_r_environ()`, which will open a text file, which is where we should add our Mapbox secret token.

```{r}
#| eval: false

MAPBOX_TOKEN <- "PUT_YOUR_MAPBOX_SECRET_HERE"
```

We then save this ".Renviron" file, and restart R ("Session" -> "Restart R").

Having obtained a token, we can create a plot of our base spend data from earlier (@fig-canhasbaseandmoneymapdeck).

```{r}
#| fig-cap: "Interactive map of US bases using Mapdeck"
#| label: fig-canhasbaseandmoneymapdeck
#| message: false
#| warning: false

mapdeck(style = mapdeck_style("light")) |>
  add_scatterplot(
    data = build,
    lat = "lat",
    lon = "lon",
    layer_id = "scatter_layer",
    radius = 10,
    radius_min_pixels = 5,
    radius_max_pixels = 100,
    tooltip = "location"
  )
```


## Concluding remarks

In this chapter we considered many ways of communicating data. We spent substantial time on graphs, because of their ability to convey a large amount of information in an efficient way. We then turned to tables because of how they can specifically convey information. Finally, we discussed maps, which allow us to display geographic information. The most important task is to show the observations to the full extent possible.

## Exercises

### Practice {.unnumbered}

1. *(Plan)* Consider the following scenario: *Three friends---Edward, Hugo, and Lucy---each measure the height of 20 of their friends. Each of the three use a slightly different approach to measurement and so make slightly different errors.* Please sketch what that dataset could look like and then sketch a graph that you could build to show all observations.
2. *(Simulate)* Please further consider the scenario described and simulate the situation with every variable independent of each other. Please include three tests based on the simulated data.
3. *(Acquire)* Please specify a source of actual data about human height that you are interested in.
4. *(Explore)* Build a graph and table using the simulated data.
5. *(Communicate)* Please write some text to accompany the graph and table, as if they reflected the actual situation. The exact details contained in the paragraphs do not have to be factual but they should be reasonable (i.e. you do not actually have to get the data nor create the graphs). Separate the code appropriately into `R` files and a Quarto doc. Submit a link to a GitHub repo with a README.

### Quiz {.unnumbered}

1. What is the primary reason for always plotting data (pick one)?
    a.  To better understand our data.
    b. To ensure the data are normal.
    c. To check for missing values.
2. From @r4ds, which of the following best describes tidy data (pick one)?
    a.  Each variable is in its own column, and each observation in its own row.
    b. All data in a single row.
    c. Multiple values per cell.
    d. Data are stored in one cell.
3. From @healyviz, what does `ggplot()` require as its first argument (pick one)?
    a.  A dataframe.
    b. A geom function.
    c. A legend.
    d. An aesthetic mapping.
4. From @healyviz, in `ggplot2`, what does the `+` operator do (pick one)?
    a. Saves the plot.
    b. Adds data to the plot.
    c.  Combines layers of the plot.
    d. Removes elements from the plot.
5. From @r4ds, what is an "aesthetic" in the context of `ggplot2` (pick one)?
    a. The type of chart used.
    b. The axis labels.
    c. The color of the plot.
    d.  How variables in the dataset are mapped to visual properties.
6. From @r4ds, in `ggplot2`, what is a "geom" (pick one)?
    a. A data transformation function.
    b.  The geometrical object that a plot uses to represent data.
    c. A plot title.
    d. A statistical transformation.
7. Which geom should be used to make a scatter plot (pick one)?
    a. `geom_dotplot()`
    b. `geom_bar()`
    c. `geom_smooth()`
    d.  `geom_point()`
8. Which should be used to create bar charts (when you have already computed counts) (pick one)?
    a. `geom_line()`
    b. `geom_bar()`
    c. `geom_histogram()`
    d.  `geom_col()`
9. Which `ggplot2` geom would you use to create a histogram (pick one)?
    a. `geom_col()`
    b. `geom_bar()`
    c. `geom_density()`
    d.  `geom_histogram()`
10. Assume the `tidyverse` and `datasauRus` are installed and loaded. What would be the outcome of the following code (pick one)?
    a. Two vertical lines.
    b. Three vertical lines.
    c.  Four vertical lines.
    d. Five vertical lines.

```{r}
#| eval: false
#| echo: true

datasaurus_dozen |> 
  filter(dataset == "v_lines") |> 
  ggplot(aes(x=x, y=y)) + 
  geom_point()
```

11. From @r4ds, what happens when you map a variable to the `color` aesthetic in `ggplot2` (select all that apply)?
    a.  The points get different colors based on the variable.
    b.  A legend is automatically created.
    c. The size of points changes based on the variable.
12. From @healyviz, what is the difference between `color` and `fill` aesthetics in `ggplot2` (pick one)?
    a. Both terms are used interchangeably.
    b.  `color` applies to points and lines, while `fill` applies to area elements.
    c. `color` controls font color, while `fill` controls plot title.
    d. `color` applies to the background, while `fill` applies to text.
13. How do you add some transparency to the points when using `geom_point()` (pick one)?
    a.  By setting `alpha` to a value between 0 and 1.
    b. By removing `geom_point()` from the plot.
    c. By using `color = NULL` in `aes()`
14. What does `labs()` do in `ggplot2` (pick one)?
    a. Changes the background color of the plot.
    b.  Adds labels like legend and axis labels.
    c. Adds a line of best fit.
    d. Modifies the plot layout.
15. In the code below, what should be added to `labs()` to change the text of the legend (pick one)?
    a. `scale = "Voted for"`
    b. `legend = "Voted for"`
    c. `color = "Voted for"`
    d.  `fill = "Voted for"`

```{r}
#| eval: false
#| echo: true

beps |>
  ggplot(mapping = aes(x = age, fill = vote)) +
  geom_bar() +
  theme_minimal() +
  labs(x = "Age of respondent", y = "Number of respondents")
```

16. Based on the help file for `scale_colour_brewer()` which palette diverges (pick one)?
    a. "GnBu"
    b. "Set1"
    c. "Accent"
    d.  "RdBu"
17. Which theme does not have solid lines along the x and y axes (pick one)?
    a.  `theme_minimal()`
    b. `theme_classic()`
    c. `theme_bw()`
18. Which argument to `position` should be added to `geom_bar()` to make the bars be next to each other rather than on top of each other (pick one)?
    a. `position = "adjacent"`
    b. `position = "side_by_side"`
    c. `position = "closest"`
    d.  `position = "dodge2"`

```{r}
#| eval: false
#| echo: true

beps |> 
  ggplot(mapping = aes(x = age, fill = vote)) + 
  geom_bar()
```

19. Based on @vanderplas2020testing, which cognitive principle should be considered when creating graphs (pick one)?
    a.  Proximity.
    b. Volume estimation.
    c. Relative motion.
    d. Axial positioning.
20. Based on @vanderplas2020testing, color can be used to (pick one)?
    a. Improve chart design aesthetics.
    b.  Encode categorical and continuous variables and group plot elements.
    c. Identify magnitude.
21. Which of these would result in the largest number of bins (pick one)?
    a.  `geom_histogram(binwidth = 2)`
    b. `geom_histogram(binwidth = 5)`
22. Suppose there is a dataset that contains the heights of 100 birds, each from one of three different species. If we are interested in understanding the distribution of these heights, then in a paragraph or two, please explain which type of graph should be used and why.
23. Would this code `data |> ggplot(aes(x = col_one)) |> geom_point()` work if we assume the libraries are loaded and the dataset and columns exist (pick one)?
    a.  No.
    b. Yes.
24. In `ggplot2`, what is the purpose of using facets in a plot (pick one)?
    a. To adjust the transparency of the points.
    b. To add labels to data points.
    c.  To create multiple plots split by the values of one or more variables.
    d. To change the color scheme of the plot.
25. When creating a bar chart with `ggplot2`, which aesthetic is typically mapped to a categorical variable to fill bars with different colors (pick one)?
    a.  `fill`
    b. `x`
    c. `y`
    d. `size`
26. What is the effect of adding `position = "dodge"` or `position = "dodge2"` to a `geom_bar()` in `ggplot2` (pick one)?
    a. It adds transparency to the bars.
    b.  It places the bars side by side for each group.
    c. It stacks the bars on top of each other.
    d. It changes the bar colors to grayscale.
27. In the context of `ggplot2`, what is the primary difference between `geom_point()` and `geom_jitter()` (pick one)?
    a.  `geom_jitter()` adds random noise to points to reduce overplotting.
    b. `geom_jitter()` is used for continuous data, `geom_point()` for categorical data. 
    c. `geom_point()` adds transparency, `geom_jitter()` does not.
    d. `geom_point()` plots points, `geom_jitter()` plots lines.
28. Which `ggplot2` geom would you use to add a line of best fit to a scatterplot (pick one)?
    a. `geom_histogram()`
    b.  `geom_smooth()`
    c. `geom_bar()`
    d. `geom_line()`
29. What argument would you use in geom_smooth() to specify a linear model without standard errors (pick one)?
    a. `fit = lm, show_se = FALSE`
    b. `type = "linear", ci = FALSE`
    c. `model = linear, error = FALSE`
    d.  `method = lm, se = FALSE`
30. What does adjusting the number of bins, or changing the binwidth, affect for a histogram (pick one)?
    a. The labels on the x-axis.
    b. The size of the data points.
    c. The colors used in the plot.
    d.  How smooth the distribution will appear.
31. What is one disadvantage of using boxplots (pick one)?
    a. They cannot show outliers.
    b. They take too long to compute.
    c.  They hide the underlying distribution of the data.
    d. They are too colorful.
32. How can you deal with that disadvantage (pick one)?
    a. Remove the whiskers from the boxplot.
    b. Add colors for each category.
    c.  Overlay the actual data points using `geom_jitter()`.
    d. Increase the box width.
33. What does `stat_ecdf()` compute (pick one)?
    a.  A cumulative distribution function.
    b. A scatterplot with error bars.
    c. A boxplot.
    d. A histogram.
34. Which function from `modelsummary` could we use to create a table of descriptive statistics (pick one)?
    a.  `datasummary_balance()`
    b. `datasummary_skim()`
    c. `datasummary_descriptive()`
    d. `datasummary_crosstab()`
35. What is geocoding (pick one)?
    a. Converting latitude and longitude into place names.
    b. Drawing map boundaries.
    c. Picking a map projection.
    d.  Converting place names into latitude and longitude.
36. From @lovelace2019geocomputation, please explain in a paragraph or two, what is the difference between vector data and raster data in the context of geographic data?
37. Which argument to `addMarkers()` is used to specify the behavior that occurs after a marker is clicked (pick one)?
    a. `layerId`
    b. `icon`
    c.  `popup`
    d. `label`    


### Class activities {.unnumbered}

- Use the [starter folder](https://github.com/RohanAlexander/starter_folder) and create a new repo. Add a link to the GitHub repo in the class's shared Google Doc. Do all the following in `paper.qmd`.
- The following produces a scatterplot showing the level, in feet, of Lake Huron between 1875 and 1972. Please improve it.
```{r}
#| eval: false 

tibble(year = 1875:1972,
       level = as.numeric(datasets::LakeHuron)) |>
  ggplot(aes(x = year, y = level)) +
  geom_point()
```
- The following produces a bar chart of the height of 31 Black Cherry Trees. Please improve it.
```{r}
#| eval: false 

datasets::trees |> 
  as_tibble() |> 
  ggplot(aes(x = Height)) +
  geom_bar()
```

- The following produces a line plot showing the weight of chicks, in grams, by how many days old they were. Please improve it.
```{r}
#| eval: false 

datasets::ChickWeight |> 
  as_tibble() |> 
  ggplot(aes(x = Time, y = weight, group = Chick)) +
  geom_line()
```
```{r}
#| echo: false
#| eval: false

# The best I've managed (based on stealing the best bits of past student work) is:
datasets::ChickWeight |>
  group_by(Diet, Time) |>
  summarize(average_weight = mean(weight)) |>
  ggplot(aes(x = Time,
             y = average_weight,
             color = Diet)) +
  geom_line(linewidth = 1.5) +
  geom_point(data = datasets::ChickWeight,
             aes(x = Time, y = weight),
             alpha = 0.5) +
  geom_line(data = datasets::ChickWeight,
            aes(x = Time, y = weight, group = Chick),
            alpha = 0.1) +
  labs(x = "Days since birth",
       y = "Average weight (grams)",
       color = "Diet") +
  theme_classic() +
  scale_color_brewer(palette = "Set1") +
  scale_y_continuous(breaks = seq(0, 400, 50))
```

- The following produces a histogram showing the annual number of sunspots between 1700 and 1988. Please improve it.
```{r}
#| eval: false 

tibble(year = 1700:1988,
       sunspots = as.numeric(datasets::sunspot.year) |> round(0)) |>
  ggplot(aes(x = sunspots)) +
  geom_histogram()
```

- Please follow [this code](https://github.com/saloni-nd/misc/blob/main/Mortality%20rates%20by%20age%20-%20HMD.R) from Saloni Dattani, and make a graph for two countries of interest to you.
- The following code, taken from the `palmerpenguins` [vignette](https://allisonhorst.github.io/palmerpenguins/articles/examples.html), produces a beautiful graph. Please modify it to create the ugliest graph that you can.^[The idea for this exercise is from Liza Bolton.]
```{r}
#| eval: false 
#| warning: false

ggplot(data = penguins,
       aes(x = flipper_length_mm,
           y = body_mass_g)) +
  geom_point(aes(color = species,
                 shape = species),
             size = 3,
             alpha = 0.8) +
  scale_color_manual(values = c("darkorange", "purple", "cyan4")) +
  labs(
    title = "Penguin size, Palmer Station LTER",
    subtitle = "Flipper length and body mass for Adelie, Chinstrap and Gentoo Penguins",
    x = "Flipper length (mm)",
    y = "Body mass (g)",
    color = "Penguin species",
    shape = "Penguin species"
  ) +
  theme_minimal() +
  theme(
    legend.position = c(0.2, 0.7),
    plot.title.position = "plot",
    plot.caption = element_text(hjust = 0, face = "italic"),
    plot.caption.position = "plot"
  )
```

- The following code provides estimates for the speed of light, from three experiments, each of 20 runs. Please create an average speed of light, per experiment, then use `knitr::kable()` to create a cross-referenced table, with specified column names, and no significant digits.
```{r}
#| eval: false 

datasets::morley |> 
  tibble()
```

### Task I {.unnumbered}

Please create a graph using `ggplot2` and a map using `ggmap` and add explanatory text to accompany both. Be sure to include cross-references and captions, etc. Each of these should take about a page.

Then, with regard the graph you created, please reflect on @vanderplas2020testing. Add a few paragraphs about the different options that you considered to make the graph more effective.

And finally, with regard to the map that you created, please reflect on the following quote from Heather Krause, founder of [We All Count](https://weallcount.com): "maps only show people who aren't invisible to the makers" as well as Chapter 3 from @datafeminism2020 and add a few paragraphs related to this.

Submit a link to a high-quality GitHub repo.

### Task II {.unnumbered}

Please obtain data on the ethnic origins and number of Holocaust victims killed at Auschwitz concentration camp. Then use `shiny` to create an interactive graph and an interactive table. These should show the number of people murdered by nationality/category and should allow the user to specify the groups they are interested in seeing data for. Publish them using the free tier of shinyapps.io. 

Then, based on the themes brought up in @americaslavery, discuss your work in at least two pages. The expectation is that, similar to @kieranskitchen, you use your work as a foundation to build on and discuss what it means to use data about such a horror. 

Use the starter folder, and submit a PDF created using the Quarto doc provided there. Ensure that your essay contains links to both your app and the GitHub repo that contains all code and data. As well as extensive citations to relevant literature that you reflected on.


### Paper {.unnumbered}

::: {.content-visible when-format="pdf"}
At about this point the *Mawson* Paper in the ["Papers" Online Appendix](https://tellingstorieswithdata.com/23-assessment.html) would be appropriate.
:::

::: {.content-visible unless-format="pdf"}
At about this point the *Mawson* Paper from [Online Appendix -@sec-papers] would be appropriate.
:::