forked from susanli2016/Data-Analysis-with-R
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathboston-marathon.Rmd
280 lines (211 loc) · 12.4 KB
/
boston-marathon.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
---
output:
html_document:
smart: false
---
A Data Analysis of the 2016 Boston Marathon Finishers
==============================================================================
By Susan Li
March 28, 2017
As a runner myself, this dataset from [Kaggle](https://www.kaggle.com/) naturally got my attention. The dataset contains the name, age, gender, country, city and state (where available), times at 9 different stages of the race, expected time, finish time and pace, overall place, gender place and division place of 26630 finishers of 2016 Boston Marathon.
```{r global_options, include=FALSE}
knitr::opts_chunk$set(echo=FALSE, warning=FALSE, message=FALSE)
```
As alway, check the missing values, wrangle and clean up the data.
```{r}
marathon <- read.csv('marathon_finishers_2016.csv', stringsAsFactors = F)
colnames(marathon) <- c('Bib', 'Name', 'Age', 'Gender', 'City', 'State', 'Country', 'Citizen', 'X', '5K', '10K', '15K', '20K', 'Half', '25K', '30K', '35K', '40K', 'Pace', 'Proj_Time', 'official_time', 'overall_place', 'Gender_place', 'Division_place')
# Remove columns that have a lot of missing values
drops <- c("State","Citizen", "X")
marathon <- marathon[ , !(names(marathon) %in% drops)]
marathon[!complete.cases(marathon),]
```
```{r}
dim(marathon)
```
```{r}
library(chron)
marathon$`5K` <- chron(times=marathon$`5K`)
marathon$`10K` <- chron(times = marathon$`10K`)
marathon$`15K` <- chron(times = marathon$`15K`)
marathon$`20K` <- chron(times = marathon$`20K`)
marathon$Half <- chron(times = marathon$Half)
marathon$`25K` <- chron(times = marathon$`25K`)
marathon$`35K` <- chron(times = marathon$`35K`)
marathon$`40K` <- chron(times = marathon$`40K`)
marathon$Pace <- chron(times = marathon$Pace)
marathon$Proj_Time <- chron(times = marathon$Proj_Time)
marathon$official_time <- chron(times = marathon$official_time)
```
### Who were the finishers?
```{r}
by(marathon$Age, marathon$Gender, summary)
```
The youngest finishers were 18 years old, and oldest were 83, this applies to both genders.
```{r}
library(ggplot2)
library(ggthemes)
ggplot(aes(x = Age, fill = Gender), data = marathon) +
geom_histogram(position = position_dodge()) +
scale_x_continuous(breaks = round(seq(min(marathon$Age), max(marathon$Age),
by = 5),10)) +
theme_economist() +
ggtitle('Finishers by age and gender')
```
The number of female finishers is larger than male for younger runners, but this trend is reversed for older runners, and seems 48 is a good age to run for both male and female.
```{r}
ggplot(aes(x=Gender, y=Age, fill=Gender), data = marathon) +
geom_boxplot() +
theme_economist() +
ggtitle('Finishers Boxplot by Age and Gender')
```
Female finishers are younger than male finishers in first quartile, median and third quartile.
```{r}
# Create a new column for age group
marathon$age_group <- cut(marathon$Age, breaks=c(17,25,40,83), labels=c("18-25","26-40","41-83"))
```
```{r}
# need a new dataframe for barplot
library(dplyr)
gender_age_group <- group_by(marathon, Gender, age_group)
marathon_age_gender <- dplyr::summarise(gender_age_group, count = n(),
average_time = mean(official_time))
```
```{r}
marathon_age_gender$average_time <- 60 * 24 * as.numeric(times(marathon_age_gender$average_time))
ggplot(aes(x = Gender, y = count, fill = age_group), data = marathon_age_gender) +
geom_bar(stat = 'identity', position = position_dodge()) +
geom_text(stat='identity', aes(label = count),
data = marathon_age_gender, position = position_dodge(width = 1)) +
ggtitle('Total Finishers by Age and Gender')
```
The split between men and women was 54% (14,463 runners) to 46% (12,167 runners), respectively. And over half (56%) of the finishers were older than 40. There were 9196 male age over 40 compared to just 5263 under 40. The picture looks different for female, there were 5781 female age over 40 compared to 6375 under 40.
```{r}
ggplot(aes(x = Gender, y = average_time, fill=age_group), data=marathon_age_gender) + geom_bar(stat = 'identity', position = position_dodge()) +
geom_text(stat='identity', aes(label = round(average_time)),
data = marathon_age_gender, position = position_dodge(width=1)) +
ylab('Average Finish Time (mins)') +
ggtitle('Finish Time by Age and Gender')
```
The average finish time difference based on age is not significant. 26-40 years old female had the best time in female group. There was almost no differnce on average finish time between 18 and 40 years old in male group.
### Age matters
```{r}
group_mean <- group_by(marathon, Age, Gender)
marathon_mean <- dplyr::summarise(group_mean, count = n(),
mean = mean(official_time))
marathon_mean$mean <- 60 * 24 * as.numeric(times(marathon_mean$mean))
```
```{r}
ggplot(aes(x = Age, y=mean, color=Gender), data = marathon_mean) +
geom_point() + geom_line() +
ylab('Average Finish Time(mins)') +
ggtitle('Finish Time Trend by Age and Gender')
```
We can see young runners run faster as they mature, on average, there best time is at around 30 years old, after that, their finish time slowly increase. Of course there were outliers, several women age between 70 and 74 were faster than 60 years old men on average. There were men age between 75 and 79 finished within 4 hours.
### Elite and the rest of us
According to [Mercedes Marathon](https://www.mercedesmarathon.com/elite-runner-info.php), to qualify for elite runners, the PR finishing time must be at least 2:35(155 mins) for men and 3:05(185 mins) for women. It makes sense to separate them from the rest of us.
```{r}
marathon$official_time <- 60 * 24 * as.numeric(times(marathon$official_time))
marathon_m <- marathon[marathon$Gender == 'M',]
marathon_f <- marathon[marathon$Gender == 'F',]
marathon_elite_m <- marathon_m[marathon_m$official_time <= 155,]
marathon_elite_f <- marathon_f[marathon_f$official_time <= 185,]
marathon_normal_m <- marathon_m[marathon_m$official_time > 155,]
marathon_normal_f <- marathon_f[marathon_f$official_time > 185,]
marathon_elite <- rbind(marathon_elite_m, marathon_elite_f)
marathon_normal <- rbind(marathon_normal_m, marathon_normal_f)
ggplot(aes(x = official_time, color = Gender), data = marathon_elite) +
geom_density() +
xlab('Finish Time(mins)') +
ggtitle('Elite Runners Finish Time')
```
2016 Boston Marathon male and female winners were [Lemi Berhanu Hayle](https://en.wikipedia.org/wiki/Lemi_Berhanu_Hayle) and [Atsede Baysa](https://en.wikipedia.org/wiki/Atsede_Baysa), crossed the line in 2:12:45 and 2:29:19, respectively. Most of the elite male runners crossed the finish line shortly after the 150-minute mark, and a great propotion of elite female runners crossed the finish line around the 180-minute mark.
```{r}
by(marathon_normal$official_time, marathon_normal$Gender, summary)
ggplot(aes(x = official_time, color = Gender), data = marathon_normal) +
geom_density() +
scale_x_continuous(breaks = seq(155, 630, 25)) +
xlab('Finish Time(mins)') +
ggtitle('Average Runners Finish Time')
```
For the rest of us, the fastest male crossed finished line in 155 minutes (2:35) and 185 minutes (3:05) for female. The peak finish time for male was around 205 minutes (3:25) and for female was around 230 minutes (3:50).
When it comes to running, there is a gender gap between male and female, male, on average faster than female. And this gap is greater among the runners who finish last than among those who finish first.
### Split
In running , a negative split is a racing strategy that involves completing the second half of a race faster than the first half. In contrast, positive split means running the second half slower than the first half. Even splits are where the two halves of the race are run in the same amount of time.
To add split in my analysis, I will need to do some calculation and add a column.
```{r}
marathon$`5K` <- 60 * 24 * as.numeric(times(marathon$`5K`))
marathon$`10K` <- 60 * 24 * as.numeric(times(marathon$`10K`))
marathon$`15K` <- 60 * 24 * as.numeric(times(marathon$`15K`))
marathon$`20K` <- 60 * 24 * as.numeric(times(marathon$`20K`))
marathon$Half <- 60 * 24 * as.numeric(times(marathon$Half))
marathon$`25K` <- 60 * 24 * as.numeric(times(marathon$`25K`))
marathon$`30K` <- 60 * 24 * as.numeric(times(marathon$`30K`))
marathon$`35K` <- 60 * 24 * as.numeric(times(marathon$`35K`))
marathon$`40K` <- 60 * 24 * as.numeric(times(marathon$`40K`))
marathon$`5K`<- round(marathon$`5K`/5., 2)
marathon$`10K`<- round(marathon$`10K`/10., 2)
marathon$`15K` <- round(marathon$`15K`/15., 2)
marathon$`20K` <- round(marathon$`20K`/20., 2)
marathon$`25K` <- round(marathon$`25K`/25., 2)
marathon$Half <- round(marathon$Half/21., 2)
marathon$`30K` <- round(marathon$`30K`/30., 2)
marathon$`35K` <- round(marathon$`35K`/35., 2)
marathon$`40K` <- round(marathon$`40K`/40., 2)
marathon$secondhalf <- with(marathon, (`40K`+`35K`+`30K`+`25K`)/4)
marathon$firsthalf <- with(marathon, (`5K`+`10K`+`15K`+`20K`+Half)/5)
marathon$split <- with(marathon, secondhalf-firsthalf)
```
```{r}
ggplot(aes(x = split, fill = Gender, color = Gender), data = marathon) +
geom_density(position = 'stack') +
xlab('Split') +
ggtitle('Split Distribution') +
theme_economist()
```
Majority of the runners ran a positive split. They ran second half slower than the first half, but not by much. There are a very small percent of runners ran a negative split. Are they elite runners?
```{r}
marathon_elite$`5K` <- 60 * 24 * as.numeric(times(marathon_elite$`5K`))
marathon_elite$`10K` <- 60 * 24 * as.numeric(times(marathon_elite$`10K`))
marathon_elite$`15K` <- 60 * 24 * as.numeric(times(marathon_elite$`15K`))
marathon_elite$`20K` <- 60 * 24 * as.numeric(times(marathon_elite$`20K`))
marathon_elite$Half <- 60 * 24 * as.numeric(times(marathon_elite$Half))
marathon_elite$`25K` <- 60 * 24 * as.numeric(times(marathon_elite$`25K`))
marathon_elite$`30K` <- 60 * 24 * as.numeric(times(marathon_elite$`30K`))
marathon_elite$`35K` <- 60 * 24 * as.numeric(times(marathon_elite$`35K`))
marathon_elite$`40K` <- 60 * 24 * as.numeric(times(marathon_elite$`40K`))
marathon_elite$`5K`<- round(marathon_elite$`5K`/5., 2)
marathon_elite$`10K`<- round(marathon_elite$`10K`/10., 2)
marathon_elite$`15K` <- round(marathon_elite$`15K`/15., 2)
marathon_elite$`20K` <- round(marathon_elite$`20K`/20., 2)
marathon_elite$`25K` <- round(marathon_elite$`25K`/25., 2)
marathon_elite$Half <- round(marathon_elite$Half/21., 2)
marathon_elite$`30K` <- round(marathon_elite$`30K`/30., 2)
marathon_elite$`35K` <- round(marathon_elite$`35K`/35., 2)
marathon_elite$`40K` <- round(marathon_elite$`40K`/40., 2)
marathon_elite$secondhalf <- with(marathon_elite, (`40K`+`35K`+`30K`+`25K`)/4)
marathon_elite$firsthalf <- with(marathon_elite, (`5K`+`10K`+`15K`+`20K`+Half)/5)
marathon_elite$split <- with(marathon_elite, secondhalf-firsthalf)
```
```{r}
ggplot(aes(x = split, fill = Gender, color = Gender), data = marathon_elite) +
geom_density(position = 'stack') +
xlab('Split') +
ggtitle('Elite Runners Split Distribution') +
theme_economist()
```
Indeed, there seems more negative split runners in the elite group, and even they run positive split, the difference is very very small(0 to 0.2), this indicates that they were able to maintain a very steady pace.
```{r}
ggplot(aes(x = official_time, y = split, color = Gender), data = marathon) +
geom_point(alpha=1/5) +
geom_smooth(method='loess') +
scale_color_brewer(type = "qual", palette = "Set1") +
xlab('Finish Time (mins)') +
ggtitle('Split and Finish Time') +
theme_economist()
```
The general trend shows a moderate correlation between split and finish time, while a more positive split a runner runs, the more time he (or she) will need to finish.
### Something else to consider
Because this is Boston, Up until about the 25K, most of the race is downhill. when reaching what they call "The Newton Hill" that end at about the 33.7K at the top of the so called ["Heartbreak hill"](https://en.wikipedia.org/wiki/Heartbreak_Hill). This hill is a test for many runners.
### End
Boston Marathon 2017 is only less than three weeks away. If you run Boston this year, I hope this article can put your race in context. Enjoy your race day!