-
Notifications
You must be signed in to change notification settings - Fork 1
/
OpenResExtension.Rmd
211 lines (161 loc) · 5.2 KB
/
OpenResExtension.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
---
title: "OpenRes extension workshop"
author: Lucy Liu
output:
pdf_document: default
html_document:
keep_md: true
---
#Data import
Make new file. Download data from website - it is large csv 2 million rows so many take some time. Save to your new file. setwd() to be this directory.
```{r}
peds <- read.csv("./Data/Pedestrian_volume__updated_monthly_.csv")
```
#Data cleaning
This data is fairly clean. Are there any NA values?
***
CHALLENGE: Find out if there are any NA values in this dataset.
Hint: you can use the function is.na()
***
```{r, eval=FALSE, echo=TRUE}
sum(is.na(peds))
```
We can have a look at our data with
```{r}
str(peds)
```
Here we can see what data type R has labelled the columns. Which ones make sense?
DateTime does not make sense as factor. There is actually the data type date. Here we have Month, day of the week and time columns which make it really easy to use this data. If we only had the datetime column, it might be difficult to subset this data. Luckily there are the data types for date data, which makes it easier to deal with date data. When you have data that R has labelled as "date" you can do useful thing like find the number of dates between 2 dates, or the number hours!
Date is a data type for date data. We have date and time though! We need to use POXI
***
CHALLENGE: Change the Date_Time column to be of the data type POXIXct.
Hint: as.POSIXct(peds$Date_Time, format = ")
***
```{r}
peds$Date_Time <- as.POSIXct(peds$Date_Time, format = "%d-%b-%Y %H:%M")
```
#Data subsetting
```{r, message=FALSE}
library(dplyr)
```
Make new data frame with just the data from Fridays:
```{r}
fridaydata <- peds %>%
filter(Day == "Friday")
```
***
CHALLENGE:
1. Make a new dataframe with just data from the weekends.
2. Make a new dataframe with data just from Friday at 6pm.
3. Make a new dataframe with data just from the week days (i.e. NOT Saturday or Sunday). Hint: use "!"
***
```{r}
weekend <- peds %>%
filter(Day %in% c("Saturday", "Sunday"))
```
```{r}
weekend <- peds %>%
filter(Day == "Saturday" | Day == "Sunday")
```
```{r}
friday <- peds %>%
filter(Day == "Friday", Time == 18)
```
```{r}
weekday <- peds %>%
filter(! Day %in% c("Sunday", "Saturday"))
```
Use group_by and summarise. Show on board with grouping and how the number of rows == number of groups.
```{r}
peds %>%
group_by(Day) %>%
summarise(mean = mean(Hourly_Counts))
```
***
CHALLENGE:
1. Calculate the total number of counts for each day of the week.
2. Calculate mean counts for each month.
3. Calculate the mean counts on Fridays by location.
***
```{r}
peds %>%
filter(Day == "Friday") %>%
group_by(Sensor_Name) %>%
summarise(mean = mean(Hourly_Counts))
```
#Data visualisation
```{r}
library(ggplot2)
```
Let's start off by plotting the mean counts for each day of the week. Draw what plot looks like.
```{r, eval=FALSE, echo=TRUE}
peds %>%
group_by(Day) %>%
summarise(meancount = mean(Hourly_Counts)) %>%
ggplot() +
geom_bar(aes(y = meancount, x = Day))
```
Does anyone know why I get an error?
Okay to look at why lets just plot a bar graph where we don't get an error.
```{r}
ggplot(peds) +
geom_bar(aes(x = Month))
```
What is this plotting???
We don't want this. We want it to plot the actual mean values we calculated. We add stat = "identity"
```{r}
peds %>%
group_by(Day) %>%
summarise(meancount = mean(Hourly_Counts)) %>%
ggplot() +
geom_bar(aes(y = meancount, x = Day), stat = "identity")
```
***
CHALLENGE:
1. How would you find all the unique sensor names?
2. Plot the mean pedestrian counts for 5 sensor names.
***
Sensory locations:
```{r}
unique(peds$Sensor_Name)
```
```{r}
levels(peds$Sensor_Name)
```
***
CHALLENGE
In what season is the mean pedestrian count highest in? Plot a bar graph to visualise any differences.
***
```{r}
a <- peds %>%
mutate(Season = ifelse(Month %in% c("December", "January","February"), "Summer", ifelse(Month %in% c("March", "April", "May"), "Autumn", ifelse(Month %in% c("June", "July", "August"), "Winter", "Spring")))) %>%
group_by(Season) %>%
summarise(mean1 = mean(Hourly_Counts), sdcount = sd(Hourly_Counts))
a$sdcount <- c(23,45,35,34)
ggplot(a) +
geom_bar(aes(y = mean1, x = Season), stat = "identity") +
geom_errorbar(aes(ymin = mean1 - sdcount, ymax = mean1 + sdcount))
df <- data.frame(a = c("a","b"), mean1 = c(23,25), sd = c(2,3))
ggplot(df) +
geom_bar(aes(x = a, y = mean1), stat = "identity") +
geom_errorbar(aes(ymin = mean1 - sd, ymax = mean1 + sd, x = a), width = .2)
```
#Plot locations
If time at end?
```{r, message=FALSE}
#install.packages("ggmap")
library(ggmap)
```
```{r}
locations <- read.csv("./Data/Pedestrian_sensor_locations.csv")
# Creating a sample data.frame with your lat/lon points
locationdf <- data.frame(lon = locations$Longitude, lat = locations$Latitude)
# getting the map
mapgilbert <- get_map(location = c(lon = mean(locationdf$lon), lat = mean(locationdf$lat)), zoom = 14,
maptype = "roadmap", scale = 2)
# plotting the map with some points on it
ggmap(mapgilbert) +
geom_point(data = locationdf, aes(x = lon, y = lat, fill = "red"), size = 2, shape = 21) +
guides(fill=FALSE, alpha=FALSE, size=FALSE) +
theme_void()
```