JOM299

week 2: air pollution in Euston road

Introduction to ggplot

ggplot is going to be our best friend for this module

Great link to bookmark: ggplot cheatsheet

Importing data

“Importing” data means loading a file from your computer into your programming environment, then storing it in a variable to make it available to us.

Where is our data?

- london air - camden open data

CSV files

Our preferred data format. CSV is like an Excel spreadsheet, but just plain text:

name,surname,occupation
basile,simon,journalist
mick,jagger,musician
theresa,may,prime minister

R will recognise the structure above and understand that the commas represent columns. It will show the structure above as a table-like representation:

name	surname	occupation
basile	simon	journalist
mick	jagger	musician
theresa	may	prime minister

using CSV in R

We start by loading in the CSV file containing our data:

library(readr)

df <- read_csv("data/airpollutioneuston.csv")
View(df)

Loading ggplot

install.packages("ggplot2")
install.packages("dplyr")
library(ggplot2)
library(dplyr)

WHO guideline: 40ug/m3 annual mean

The WHO guideline for NO2 pollution is to stay under 40ug/m3 annually.

Did this happen on Euston Road? We load dplyr to get some basic stats back from our dataset very quickly:

library(dplyr)

df %>% summary()

Calculating a mean

We could also calculate our mean manually with summarise - many handy functions we can use, actually

df %>% summarise(annual_mean = mean(Value))

  annual_mean
        <dbl>
1        82.8

# how many observations do we have?
df %>% summarise(observations = n())

  observations
         <int>
1          365

Clean data a bit

One issue with our dataset: ReadingDateTime column comes out as a string (see df %>% summary() showing character value).

We will need to parse that as a date!

Dates in programming

Dates as odd creatures. We parse strings and convert them into dates, but how does the computer know the format of the date?

2018-01-02
2018/02/01

These dates could be identical or different depending on how we parse them.

Date formats to the rescue

Date format specifiers

2018-01-02 parsed with %Y-%m-%d becomes 2nd Jan 2018
2018-01-02 parsed with %Y-%d-%m becomes 1st Feb 2018

Cleaning our air pollution data

We’ll use British standards in this case:

df <- df %>% mutate(Date = as.Date(ReadingDateTime,
                                   format = "%d/%m/%Y")) %>%
  select(Date, Value)
  
  Date       Value
  <date>     <dbl>
1 2017-01-01  69.9
2 2017-01-02  57.5
3 2017-01-03  91.9
4 2017-01-04  67.9

Basic plot in ggplot

# install.packages("ggplot2")
library(ggplot2)

ggplot(df, aes(x = Date, y = Value)) +
  geom_point()

What just happened?

We just used ggplot, the leading R visualisation package, to create a scatterplot. Ggplot is a grammar, ie a chart is composed of several bricks:

a dataset,
geometries,
a coordinate system

Colours, opacity, scales

alpha is opacity
colours are written in hex codes - What to consider when choosing colours
geom_hline is a new geometry! We can also use geom_vline for a vertical line

ggplot(df, aes(Date, Value), color='#254251') +
  geom_point(alpha = 0.5, color="#254251") +
  geom_hline(yintercept=40) +
  scale_y_continuous(breaks = c(40, 100, 150, 200, 250),
                     labels = c(40, 100, 150, 200, 250))

Gratuitous styles

library(scales)

df$alpha <- rescale(df$Value, to=c(0,1))

ggplot(df, aes(Date, Value), color='#254251') +
  geom_point(alpha = df$alpha, color="#254251") +
  geom_hline(yintercept=40) +
  scale_y_continuous(breaks = c(40, 100, 150, 200, 250),
                     labels = c(40, 100, 150, 200, 250))

Averages

We want to calculate a 30-day rolling average. This is super wasy in R: we need rollmean, from the zoo package.

Syntax:

rollmean(data$column, period)

#install.packages("zoo")
library(zoo)

df_mean <- df %>%
  mutate(mean = rollmean(Value, 30, fill = NA))

ggplot(df_mean, aes(Date, Value), color='#254251') +
  geom_hline(yintercept=40) +
  geom_point(alpha = df$alpha, color="#254251") +
  geom_line(aes(x = Date, y = mean)) +
  scale_y_continuous(breaks = c(40, 100, 150, 200, 250),
                     labels = c(40, 100, 150, 200, 250))

All together

We can also use pipes to avoid mutating our dataset as we go along, like so:

dataframe %>%
  do something on it %>%
  like filtering, adding columns, etc %>%
  then send it to ggplot like so %>%
  ggplot() +
    add geometries, etc

df <- read_csv("data/airpollutioneuston.csv")
df %>% filter(!is.na(Value)) %>%
    mutate(Date = as.Date(ReadingDateTime,
                          format = "%d/%m/%Y"),
           mean = rollmean(Value, 30, fill = NA)) %>%
    select(Date, Value, mean) %>%
    ggplot() +
    geom_hline(yintercept = 40) +
    geom_point(aes(x = Date, y = Value, alpha = 0.5, color = "steelblue")) +
    geom_line(aes(x = Date, y = mean)) +
    scale_y_continuous(breaks = c(40, 100, 150, 200, 250),
                       labels = c(40, 100, 150, 200, 250)) +
    ggtitle("Hourly NO2 concentration in Euston road") +
    xlab("Date") + ylab("NO2 concentration") + theme(legend.position="none")

Reading list

https://www.ted.com/talks/hans_rosling_shows_the_best_stats_you_ve_ever_seen

http://datadrivenjournalism.net/resources/when_should_i_use_logarithmic_scales_in_my_charts_and_graphs

https://www.datacamp.com/community/blog/the-easiest-way-to-learn-ggplot2#gs.QnUNY8Y

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!