-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathR Markdown_Triathlon.Rmd
183 lines (132 loc) · 7.2 KB
/
R Markdown_Triathlon.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
---
title: "Boulder Peak Triathlon Race Data"
output:
html_document:
df_print: paged
pdf_document: default
---
### **The goal of this project is to take race results data from a triathlon I recently completed and analyze it to better understand areas I did well in and areas where I can improve for better results.**
### Steps of the project will include:
1. Scraping the results table from a webpage and compiling it into a usable table
2. Cleaning the table in R and performing basic analysis
3. Exporting the data to Tableau where I can perform exploratory analysis and display my findings.
```{r}
library(tidyverse)
#general purpose library with many useful tools and commands
library(rvest)
#commands for html
library(RSelenium)
#package necessary for opening a web instance through R
library(netstat)
#used to see and establish available TCP ports
library(data.table)
#loading necessary packages
```
I started by installing the necessary packages and set up my environment.
The next step is to access the webpage with my race results, locate the table, and scrape data from said table.
```{r}
rD <- rsDriver(browser = "firefox",
chromever = NULL,
geckover = "latest",
verbose = FALSE,
port = free_port(),
check = FALSE,
iedrver = NULL)
#set up RSelenium server and browser, establish port using netstat 'free_port'
#If chrome version changes, chromever will need to be updated
remDr <- rD$client
remDr$open()
remDr$navigate("https://racingunderground.racetecresults.com/results.aspx?CId=16436&RId=5047")
#opens Selemium and navigates to Triathalon results page
results_table <- remDr$findElement(using = "id", "ctl00_Content_Main_divGrid")
#selects table element
```
The race results webpage has been successfully navigated to and the table containing the data located. The problem with simply trying to store these results in a table is that they are spread out over 8 pages and the web address does not change for each page. I need to find a way to cycle through each page and store the results.
```{r}
compiled_results_raw <- list()
#create blank list to compile pages
x = 2
for (x in 2:8) {
#setting up a for loop
results_table_html <- results_table$getPageSource()
page <- read_html(results_table_html %>% unlist())
results <- html_table(page) %>% .[[2]]
#stores second table on html page into results
compiled_results_raw <- rbindlist(list(compiled_results_raw, results))
#the results from the current page are read and added to the compiled results table previous created
next_button <- remDr$findElement(using = "xpath", paste("//td[(((count(preceding-sibling::*) + 1) = ", x, ") and parent::*)]//a"))
next_button$clickElement()
#sets up and executes commands to cycle through web pages, simulating clicks
#no 'next page' button is available on site so actual page numbers have to be used, updating for each loop
}
results_table_html <- results_table$getPageSource()
page <- read_html(results_table_html %>% unlist())
results <- html_table(page) %>% .[[2]]
compiled_results_raw <- rbindlist(list(compiled_results_raw, results))
#stores final 8th page into results after loops completion
remDr$close()
rD$server$stop()
rm(rD)
gc()
system("taskkill /im java.exe /f", intern=FALSE, ignore.stdout=FALSE)
#stop firefox instance and server
```
We got the data! A loop had to be set up to click and cycle through each page, adding the current displayed table results to the compiled results before clicking to the next page. Thats one major step down to start performing analysis.
Now the table needs to be analyzed to make sure the data was scraped properly and to see if any transformation are required.
```{r}
write.csv(compiled_results_raw, "Data/compiled_results_raw.csv", row.names = FALSE)
glimpse(compiled_results_raw)
```
Using "Glimpse" it looks like we have 358 total results which is great, the exact number of total rows on the webpage. Initially, it is apparent there are a couple repeat columns and some columns that have unnecessary info for my analysis, such as "Fav" and "Share" which were used in the web browser to bookmark and share data to social media. Lets remove these unnecessary columns and rename some columns to more clear names.
```{r}
compiled_results <- compiled_results_raw
#created new table to perform transformation on
colnames(compiled_results)[6] <- "Info"
colnames(compiled_results)[8] <- "Time_Duplicate"
colnames(compiled_results)[9] <- "Cat_Gen_Pos"
#Renamed several columns with duplicate names
compiled_results <- subset( compiled_results, select = -c(Fav,Share,Info,Time_Duplicate,Cat_Gen_Pos))
#Trimmed off unnecessary columns
library(lubridate)
compiled_results <- compiled_results %>%
mutate(Name=str_to_title(Name),
Time=hms(Time),
Swim=hms(Swim),
T1=hms(T1),
Cycle=hms(Cycle),
T2=hms(T2),
Run=hms(Run))
#load lubridate and convert string values to hms format. Convert Name to Title Case.
compiled_results <- compiled_results %>%
mutate(Time=period_to_seconds(Time),
Swim=period_to_seconds(Swim),
T1=period_to_seconds(T1),
Cycle=period_to_seconds(Cycle),
T2=period_to_seconds(T2),
Run=period_to_seconds(Run))
#convert time to # of seconds for easier calculations in tableau
```
A new table has been created to perform all transformations on, leaving a table of the raw data scraped from the web. Unneccesary columns have been removed and certain columns have been converted to a more usable time format using lubridate. Now I am rechecking glimpse to the table was updated properly and run Summary to see if any initial insights can be gathered.
```{r}
glimpse(compiled_results)
summary(compiled_results)
```
Several column names have spaces which could prove problematic during analysis. Changing these names to remove spaces. There are also NA's in several columns. This is likely due to competitors who did not finish the race. For the purposes of my analysis, I will be removing these competitors in a separate table.
```{r}
setnames(compiled_results,"Race No","Race_No")
setnames(compiled_results,"Cat Pos","Cat_Pos")
setnames(compiled_results,"Gen Pos","Gen_Pos")
#Renamed potential problematic column names
compiled_results_null <- na.omit(compiled_results)
#creating a second table with null values removed for racers who did not finish
summary(compiled_results_null)
glimpse(compiled_results_null)
print(as_tibble(compiled_results_null))
write.csv(compiled_results, "Data/compiled_results.csv", row.names = FALSE)
write.csv(compiled_results_null, "Data/compiled_results_null.csv", row.names = FALSE)
#.csv's have been saved to the folder for use in visualizations.
```
The data has been successfully scraped and cleaned for use in analysis. That's another major step down!
The next step will be to load the cleaned data into tableau and perform some exploratory analysis to see if there are any interesting insights to be gained.
See link below to my Tableau Public profile for the visualizations created using the data.
### [Boulder Peak Triathlon Results](https://public.tableau.com/app/profile/clayton.stewart/viz/BoulderPeakTriathlonResults/MyTriathalonResults#1)