Sarah Hosking for #R-Ladies Paris 9 November 2017
# set variable to conditionally hide some non-ggplot code chunks.
# used with `echo=` chunk option.
show_all = TRUE
knitr::opts_chunk$set(echo = TRUE,
include = TRUE,
message = FALSE,
warning = FALSE)
library(tidyverse)
theme_set(theme_bw())
You can build plots like you would for this cake.
- Modular
- Easier to read
- They look good OOTB
Once you get over the initial learning curve, it's all too easy during demos to go gaga showing off great-looking faceted plots with fabulous colour palettes and themes.
But after a few presentations I noticed what people talked to me about wasn't the fancier plotting techniques. It was a few fundamentals that stumped them.
- Opaque terminology
- Untidy dataframes
- Unclear how to iterate
But before we dive into those ...
Paris air quality!
(Photo from Le Monde. http://www.lemonde.fr/planete/article/2014/11/24/a-paris-la-pollution-est-aussi-nocive-que-le-tabagisme-passif_4528203_3244.html)
My audience is often Parisian. The air quality data is available for public download, and the internet already has enough plots about cars, irises and diamonds.
Moreoever, when I was really getting into Ggplot2 in late 2016, European headlines were dominated by stories of extreme pollution spikes from fine particulate matter.
And since I live between 2 major arterial roads in the city center, my question was "Just how bad is it?" Also, being a runner and cyclist, I wanted to see if there was clearly a best time of day to exercise outdoors.
- 5 pollutants monitored
- readings collected hourly, 24/7
- from July 21 2011 to Oct 31 2017
- measuring location: Paris Hôtel de Ville
airparif <- read_rds("airparif_oct2017.rds")
str(airparif)
## 'data.frame': 55080 obs. of 10 variables:
## $ date : Date, format: "2011-07-21" "2011-07-21" ...
## $ year : Ord.factor w/ 7 levels "2011"<"2012"<..: 1 1 1 1 1 1 1 1 1 1 ...
## $ month: Ord.factor w/ 12 levels "01"<"02"<"03"<..: 7 7 7 7 7 7 7 7 7 7 ...
## $ day : Ord.factor w/ 31 levels "01"<"02"<"03"<..: 21 21 21 21 21 21 21 21 21 21 ...
## $ hour : int 1 2 3 4 5 6 7 8 9 10 ...
## $ PM10 : num 7 6 7 7 10 10 11 15 17 19 ...
## $ PM25 : num NA NA NA NA NA NA NA NA NA NA ...
## $ NO2 : num 12 10 10 14 30 36 32 39 36 39 ...
## $ O3 : num NA NA NA NA NA NA NA NA NA NA ...
## $ CO : num NA NA NA NA NA NA NA NA NA NA ...
nrow(airparif)
## [1] 55080
head(airparif)
## date year month day hour PM10 PM25 NO2 O3 CO
## 2 2011-07-21 2011 07 21 1 7 NA 12 NA NA
## 3 2011-07-21 2011 07 21 2 6 NA 10 NA NA
## 4 2011-07-21 2011 07 21 3 7 NA 10 NA NA
## 5 2011-07-21 2011 07 21 4 7 NA 14 NA NA
## 6 2011-07-21 2011 07 21 5 10 NA 30 NA NA
## 7 2011-07-21 2011 07 21 6 10 NA 36 NA NA
The 5 pollutants being monitored are:
PM25
= Particulate matter < 2.5 µmPM10
= Particulate matter < 10 µmNO2
= Nitrogen Dioxide (Azote)CO
= Carbon monoxideO3
= Ozone
Let's build a simple histogram of PM10
.
# create data layer
ggplot(data = airparif, aes(x = PM10))
Hmmmmmm.....
- only told Ggplot2 what to plot, not how
- nonetheless, x-axis is already nicely-formatted
- next step: add a
geom
layer
# add plot-type layer
ggplot(data = airparif, aes(x = PM10)) +
geom_histogram()
Voilà. But before we go further, let's clarify a few terms.
Mental barrier #1 for new(er) Ggplot2 users is aes()
, called the aesthetic.
This term spooks people.
In plain English: aes()
is where you map data frame variables to visual elements.
The most basic visual elements to any plot are the x and y axes. Other examples are linetype
, shape
, and fill
.
This tells Ggplot2 to change this visual element for each group.
Notice how this automatically generates a nice-looking legend.
# create data layer that maps
# a variable to fill colour
p <- ggplot(data = airparif, aes(x = PM10, fill = year))
p + geom_histogram()
Before those with design backgrounds start hyperventilating, I've created a rainbow histogram only to illustrate a point. As a method to communicate data, it totally sucks.
By contrast, to change the look of the entire plot, you set a visual element to a value.
# create data layer, *without* fill=
p <- ggplot(data = airparif, aes(x = PM10))
# add plot layer & set fill color
p + geom_histogram(fill = 'blue')
This refers to the geometric projection of your data.
In plain English, it means the plot type.
So we can easily swap another distribution plot layer (in this case, density plot) that will work with our existing data layer.
# create data layer *with* fill mapped to year
p <- ggplot(data = airparif, aes(x = PM10, fill = year))
# create density plot this time
p + geom_density()
# make easier on eyes!
p + geom_density(alpha = 1/4, linetype = 0)
Why? Let's plot the distrubtion of another - NO2.
# plot another var distribution
p <- ggplot(data = airparif, aes(x = NO2))
p + geom_histogram()
Now, imagine we had 20 pollutants in our data frame.
Instead, use tidy data.
# make tidy
airparif.tidy <- airparif %>%
gather(PM10, PM25, NO2, O3, CO,
key = 'pollutant', value = 'value',
na.rm = TRUE) %>%
group_by(date, hour, pollutant) %>%
arrange(date)
airparif.tidy <- data.frame(airparif.tidy)
str(airparif.tidy)
## 'data.frame': 254769 obs. of 7 variables:
## $ date : Date, format: "2011-07-21" "2011-07-21" ...
## $ year : Ord.factor w/ 7 levels "2011"<"2012"<..: 1 1 1 1 1 1 1 1 1 1 ...
## $ month : Ord.factor w/ 12 levels "01"<"02"<"03"<..: 7 7 7 7 7 7 7 7 7 7 ...
## $ day : Ord.factor w/ 31 levels "01"<"02"<"03"<..: 21 21 21 21 21 21 21 21 21 21 ...
## $ hour : int 1 2 3 4 5 6 7 8 9 10 ...
## $ pollutant: chr "PM10" "PM10" "PM10" "PM10" ...
## $ value : num 7 6 7 7 10 10 11 15 17 19 ...
Not tidy
dim(airparif)
## [1] 55080 10
head(airparif)
## date year month day hour PM10 PM25 NO2 O3 CO
## 2 2011-07-21 2011 07 21 1 7 NA 12 NA NA
## 3 2011-07-21 2011 07 21 2 6 NA 10 NA NA
## 4 2011-07-21 2011 07 21 3 7 NA 10 NA NA
## 5 2011-07-21 2011 07 21 4 7 NA 14 NA NA
## 6 2011-07-21 2011 07 21 5 10 NA 30 NA NA
## 7 2011-07-21 2011 07 21 6 10 NA 36 NA NA
Tidy
- observations in rows
- variables are columns
- values in cells
dim(airparif.tidy)
## [1] 254769 7
head(airparif.tidy, 20)
## date year month day hour pollutant value
## 1 2011-07-21 2011 07 21 1 PM10 7
## 2 2011-07-21 2011 07 21 2 PM10 6
## 3 2011-07-21 2011 07 21 3 PM10 7
## 4 2011-07-21 2011 07 21 4 PM10 7
## 5 2011-07-21 2011 07 21 5 PM10 10
## 6 2011-07-21 2011 07 21 6 PM10 10
## 7 2011-07-21 2011 07 21 7 PM10 11
## 8 2011-07-21 2011 07 21 8 PM10 15
## 9 2011-07-21 2011 07 21 9 PM10 17
## 10 2011-07-21 2011 07 21 10 PM10 19
## 11 2011-07-21 2011 07 21 11 PM10 19
## 12 2011-07-21 2011 07 21 12 PM10 16
## 13 2011-07-21 2011 07 21 13 PM10 15
## 14 2011-07-21 2011 07 21 14 PM10 17
## 15 2011-07-21 2011 07 21 15 PM10 16
## 16 2011-07-21 2011 07 21 16 PM10 19
## 17 2011-07-21 2011 07 21 17 PM10 20
## 18 2011-07-21 2011 07 21 18 PM10 24
## 19 2011-07-21 2011 07 21 19 PM10 27
## 20 2011-07-21 2011 07 21 20 PM10 31
The advantage of gathering your data into a tidy format is it generates categorical variables. In our case, pollutant
.
Categorical variables are easy to group and facet in Ggplot2.
# create plot of data layer only
p.tidy <- ggplot(data = airparif.tidy, aes(x = value))
p.tidy
# add geom layer (a.k.a. the plot type)
hg <- p.tidy + geom_histogram()
hg
# add facet layer
hg + facet_wrap(~pollutant)
# assign facet wrap to a variable
# & fix ranges for x-axes
fw <- facet_wrap(~pollutant, scales = 'free_x')
hg + fw
- save layers as variables
- update data layer with
%+%
# display earlier histogram of NO2 only
p + geom_histogram()
# update plot var to use new tidy df & x var
p <- p %+% airparif.tidy + aes(x = value)
p + geom_histogram()
# add facet layer
p +
geom_histogram() +
fw
# map fill to pollutant
p <- p %+% aes(fill = pollutant)
p +
geom_histogram() +
fw
The original plot for a single histogram required 2 lines of code.
Here we've created 5 histograms using 3 lines of code.
How do distributions change from month to month?
# update data layer to plot values by *month*
# and add a y variable.
p <- p %+% aes(x = month, y = value)
# add boxplot & facet layers
bp <- p + geom_boxplot()
bp + fw
# update fw var to free y-axis!
fw <- facet_wrap(~pollutant, scales = 'free_y')
# try again
bp + fw
# change to *hourly* boxplot
bp <- bp %+% aes(x = as.factor(hour))
bp + fw
Quite frankly, these outliers look scary, particularly for carbon monoxide & nitrogen dioxide. But the values alone don't mean anything to me. I need context.
So just how bad are these outliers, the pollution spikes?
To answer that question, I'd like to add horizontal lines showing the low-med-high-very high thresholds. These are different for each pollutant.
# first, define thresholds for diff pollutants
PM10 <- c(25,50,90,180)
PM25 <- c(15,30,55,110)
NO2 <- c(50,100,200,400)
O3 <- c(60,120,180,240)
CO <- c(5000,7500, 10000, 20000)
# create df of levels
levels_h.df <- as.data.frame(cbind(PM10, PM25, NO2, O3, CO))
rownames(levels_h.df) <- c('low', 'medium', 'high', 'very high')
# set rownames as columns
library(data.table)
setDT(levels_h.df, keep.rownames = TRUE)
# rename rn column
colnames(levels_h.df)[1] <- "level"
# put in long format
levels_h.long <- levels_h.df %>%
gather('pollutant', 'value', 2:6)
# change col order
levels_h.long <- levels_h.long[, c(2,1,3)]
# df of 1-hr threshold values for each pollutant
head(levels_h.long, 10)
## pollutant level value
## 1 PM10 low 25
## 2 PM10 medium 50
## 3 PM10 high 90
## 4 PM10 very high 180
## 5 PM25 low 15
## 6 PM25 medium 30
## 7 PM25 high 55
## 8 PM25 very high 110
## 9 NO2 low 50
## 10 NO2 medium 100
This involves adding a geom_hline()
layer to the plot. By default, the data specified in data layer is global. However, in subsequent layers you can overwrite this by specifying different data.
# add horizontal lines based on threshold levels df
hl <- geom_hline(data = levels_h.long,
aes(yintercept = value, group = pollutant),
linetype = 2)
# without hlines
bp + fw
# with hlines
bp + fw + hl
This reassures me for CO and NO2. PM10 & PM25, on the other hand...
As I recall from news headlines, Dec 2016 was a bad month for particulate matter.
How did pollution fluctuate throughout the day?
To answer this, let's make a faceted spaghetti plot, with a "strand" for each day of the Dec 2016.
dec2016 <- airparif.tidy %>%
filter(year == 2016 & month == 12)
dec2016 <- data.frame(dec2016)
# update p with new df, x & y
p.dec16 <- p %+% dec2016 + aes(hour, value)
p.dec16 +
geom_line() +
fw
This is NOT what I was after.
Why is this happening?
Remember we are plotting values along an hourly x-axis. There are 31 days in December, so for each hour on the x-axis, there 31 data points.
By default, geom_line()
connects all of them with a single line. We can change that.
# change mapping to plot is grouped by date
p.dec16 <- p.dec16 %+% aes(group = date)
# add geom layer to plot var
p.dec16 <- p.dec16 +
geom_line(colour = 'darkgrey')
p.dec16 + fw
That's more like it!
For all pollutants except O3, I see big spikes during the middle of the day. Did these all occur on the same day?
I know PM10 was extremely high that month, so if I can identify the day with the highest PM10 spike, is that the same day as the big spikes for PM25, NO2 and CO?
When did PM10 pollution spike?
We need to create a new data frame for the day of with the highest PM10 value.
# Dec 2016 particulate data
dec2016_pm <- dec2016 %>%
filter(pollutant %in% c('PM10', 'PM25'))
# find index of max value in Dec
max_val <- which.max(dec2016_pm[,'value'])
# find date
max_date <- dec2016_pm[max_val, 'date']
# filter on this date
dec2016_pm_max <- dec2016 %>%
filter(date == max_date)
Then we will create another geom_line()
layer using the data for this PM10 spike day.
# create geom layer for day of PM10 spike
p.spike_day <- geom_line(data = dec2016_pm_max, aes(hour, value,
colour = pollutant))
# add on top of the Dec 2016 spaghetti plot
p.dec16 +
p.spike_day +
fw
# add context: bring back the threshold lines
p.dec16 +
p.spike_day +
fw +
hl
The plots show that on the day PM10 reached its maximum value for December 2016, PM25, NO2, and CO also reached their peak values for the month. This suggests that those pollutants are positively correlated, especially PM25.
In terms of severity:
- CO's peak remained within 'very low'
- NO2's peak got up to 'high'
- PM10 and PM25 surged into 'very high'
Ideas to follow up:
- Look at correlations between the variables using
ggpairs()
- Find out which months and which hours have the lowest pollution levels using
geom_tile()
- Calculate statistical significance of the max daily peak vs usual daily peak.
To visualize groups:
- use tidy dataframes
- map grouping variables to
aes()
- use
facet_wrap
&facet_grid
To iterate quickly:
- update data layer with
%+%
- save frequently-used layers as variables
Much of the data munging I did for this presention used the tidyr and dplyr from the tidyverse
package. This slide show by Garret Grolemund is one of the best slide tutorials I've ever seen.
- Data wrangling with Dplyr & Tidyr: https://s3.amazonaws.com/udacity-hosted-downloads/ud651/DataWranglingWithR.pdf
I referred to these quite frequently:
-
Ggplot2 cheatsheet: http://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf
-
Online excerpts from ggplot2: Elegant Graphics for Data Analysis (Use R!) by Hadley Wickham is how, among other things, I learned about the magical
%+%
operator: http://ggplot2.org/book/
I have not looked at this book yet, but it is often recommended by Ggplot2 gurus.
- The R Graphics Cookbook by Winston Chang http://www.cookbook-r.com/Graphs/