Prediction of March Madness 2018 Men's Tournament
The detailed project report can be viewed here.
- Install Apache Spark and Apache Hadoop
- Download and install R v3.4.3
- Download and install RStudio v1.1.383
- In case you have never used the libraries I used in my project, open the R console in RStudio and type the following lines:
install.packages("caret")
install.packages("SparkR")
install.packages("dplyr")
install.packages("magrittr")
install.packages("tidyr")
install.packages("ggplot2")
- Open
FullTeamData.rmd
in RStudio - Go to line 11 (it will contain:
currentYear <- 2016
) - Set
currentYear
to the value of2010
(it should look like:currentYear <- 2010
) - Knit the file, this will create a
FullTeamData.html
file in the current directory that shows the results and aFullTeamData2010.csv
file in the Data folder - Repeat steps 8 and 9 for values
2011
,2012
,2013
,2014
,2015
,2016
,2017
, and2018
- Open
Testing.rmd
in RStudio - Knit the file (this will take about 20 minutes to complete), this will create a
Testing.html
file that shows the results - Open
Submission.rmd
in RStudio - Knit the file, this will create a
Submission.html
file that shows the results andsubmission_v1.csv
,submission_v1_forBracket.csv
,submission_v2.csv
, andsubmission_v2_forBracket.csv
in the current directory submission_v1.csv
andsubmission_v2.csv
were the files submitted to the Kaggle competition,submission_v1_forBracket.csv
andsubmission_v2_forBracket.csv
were used to create brackets for the NCAA March Madness Bracket Challenge- To view the results of the three R scripts (
FullTeamData.rmd
,Testing.rmd
, andSubmission.rmd
), open the respective html files that were created in any browser
FullTeamData.rmd
creates 9 datasets (FullTeamData2010.csv
, ..., FullTeamData2018.csv
) that combines all of the data from the 52 datasets given by Kaggle into one dataset for each year. Testing.rmd
tests six different Logistic Regression models on every year's data, which comes from the datasets created in FullTeamData.rmd
. Submission.Rmd
uses the two best Logistic Regression models on the 2018 data to create output files used in the two competitions (Kaggle and NCAA Bracket Challenge).