Brazilians love soccer it is like a religion for some people here. Every Sunday at 4 pm the country stops to see the weekly soccer match and the most popular championship is the Campeonato Brasileiro. Campeonato Brasileiro is the biggest championship in Brazil and its duration is 9 months. The championship has 20 teams and 19 rounds.
Within this great championship we have 11 players playing on each side of the field, each with its performance and efficiency. Knowing this fact we start to ask some questions:
- Can we predict the results of a football game before it starts?
- How much does a player’s performance affect the results of the match?
- Can we model a Machine Learning algorithm to predict the game’s final result?
- How far can the predictive power of an Artificial Intelligence get?
To answer this question I decided to develop a Machine Learning model that is based on the performance data of each player generated by Globo’s Fantasy Game named CartolaFC and the conditions of each game. As tools, I used the R programming language in R Studio and the H2O package which is a very powerful library prepared only for the development of projects in the area of Artificial Intelligence. Link: https://www.h2o.ai/
So, let’s get started !
First, we must find a good dataset containing the clubs, matches and scores of Cartola FC. I was able to get a well-organized dataset on Kaggle: https://www.kaggle.com/schiller/cartolafc Importing packages:
library(h2o)
library(plyr)
h2o.init()
library(dplyr)
mydata <- read.csv("CartolaFC Predict/Datasets/2017_partidas.csv", header=TRUE,
sep=",")
scouts <- read.csv("CartolaFC Predict/Datasets/2017_scouts.csv", header=TRUE,
sep=",")
After importing the csvs as datasets, I realized that the data did not contain the result of the game as victory or defeat, only the score. So, I had to do this transformation in the data generating a column with the result of the match in relation to the home team, being v for victory, d for defeat and e for tie.
matchResults <- data.frame("resultado" =ifelse(mydata$placar_oficial_mandante > mydata$placar_oficial_visitante,"v", ifelse(mydata$placar_oficial_mandante == mydata$placar_oficial_visitante,"e","d")))
Now we have a data frame with the features and a data frame with the label that will be the result of the matches. Only with this data frame could we start to make a model with some interesting features like the team, the position of the club in the table and whether it plays at home or away, however we can get a much more interesting dataset by merging with the dataset status of players.
We will have a bit of work here, in the scout dataset each line represents a player, his position in the field, his score in the last match, the variation of the score, his purchase value, his scoring average, the round and his team.
The idea was to agglutinate the players by Defense, Attack and Technician of each team. At first we are agglutinating the players making a sum of the stats of the players of the area, thus a team with more attackers has an attack score greater than a team with less attackers, quantifying the offensiveness of the team for example.
We get a dataset with these stats from the home team and away team and we made a Join with the initial dataset, now we have a lot more features to train the model. Now we have features that represent the current conditions of the game and a history of each player of the teams combined in attack, defense and Technician. All this process is in the code below:
#gerar os dados de defesa,ataque e tec de cada time
t0 <- data.frame("rodada"=0 ,"clube_id"=0,"cDefPonto"=0,"cDefPreco"=0,"cDefVar"=0,"cDefMedia"=0,
"cAtaPonto"=0,"cAtaPreco"=0,"cAtaVar"=0,"cAtaMedia"=0,
"cTecPonto"=0,"cTecPreco"=0,"cTecVar"=0,"cTecMedia"=0)
for(i in 1:19){
for(j in 262:373){
t1 <- filter(scouts,rodada_id==i & clube_id==j & (posicao_id == 1 |posicao_id == 2 |posicao_id == 3))
t2 <- filter(scouts,rodada_id==i & clube_id==j & (posicao_id == 4 |posicao_id == 5))
t3 <- filter(scouts,rodada_id==i & clube_id==j & (posicao_id == 6))
t6 <- data.frame("rodada"=t1$rodada_id[1] ,"clube_id"=t1$clube_id[1],"cDefPonto"=sum(t1$pontos_num),"cDefPreco"=sum(t1$preco_num),"cDefVar"=sum(t1$variacao_num),"cDefMedia"=sum(t1$media_num),
"cAtaPonto"=sum(t2$pontos_num),"cAtaPreco"=sum(t2$preco_num),"cAtaVar"=sum(t2$variacao_num),"cAtaMedia"=sum(t2$media_num),
"cTecPonto"=sum(t3$pontos_num),"cTecPreco"=sum(t3$preco_num),"cTecVar"=sum(t3$variacao_num),"cTecMedia"=sum(t3$media_num))
t0<-rbind(t0,t6)
}
}
teste <- filter(t0,!is.na(rodada))
#pegar os stats para as equipes em cada partida
b0 <- data.frame("rodada"=0,"clube_casa_id"=0,"casaDef"= 0,"casaDefMedia"= 0,"casaDefPreco"=0,"casaDefVar"=0,
"casaAtk"= 0,"casaAtkMedia"= 0,"casaAtkPreco"=0,"casaAtkVar"=0,
"casaTec"= 0,"casaTecMedia"= 0,"casaTecPreco"=0,"casaTecVar"=0,
"visitanteDef"= 0,"visitanteDefMedia"= 0,"visitanteDefPreco"=0,"visitanteDefVar"=0,
"visitanteAtk"= 0,"visitanteAtkMedia"= 0,"visitanteAtkPreco"=0,"visitanteAtkVar"=0,
"visitanteTec"= 0,"visitanteTecMedia"= 0,"visitanteTecPreco"=0,"visitanteTecVar"=0)
dataset <- mydata
for(i in 1:19){
b1 <- filter(mydata,rodada_id==i)
for(j in 1:length(b1$clube_casa_id)){
casa <- filter(teste,rodada==i,clube_id==b1$clube_casa_id[j])
visitante <-filter(teste,rodada==i,clube_id==b1$clube_visitante_id[j])
b5 <- data.frame("rodada"=b1$rodada[j],"clube_casa_id"=b1$clube_casa_id[j],"casaDef"= casa$cDefPonto,"casaDefMedia"= casa$cDefMedia,"casaDefPreco"=casa$cDefPreco,"casaDefVar"=casa$cDefVar,
"casaAtk"= casa$cAtaPonto,"casaAtkMedia"= casa$cAtaMedia,"casaAtkPreco"=casa$cAtaPreco,"casaAtkVar"=casa$cAtaVar,
"casaTec"= casa$cTecPonto,"casaTecMedia"= casa$cTecMedia,"casaTecPreco"=casa$cTecPreco,"casaTecVar"=casa$cTecVar,
"visitanteDef"= visitante$cDefPonto,"visitanteDefMedia"= visitante$cDefMedia,"visitanteDefPreco"=visitante$cDefPreco,"visitanteDefVar"=visitante$cDefVar,
"visitanteAtk"= visitante$cAtaPonto,"visitanteAtkMedia"= visitante$cAtaMedia,"visitanteAtkPreco"=visitante$cAtaPreco,"visitanteAtkVar"=visitante$cAtaVar,
"visitanteTec"= visitante$cTecPonto,"visitanteTecMedia"= visitante$cTecMedia,"visitanteTecPreco"=visitante$cTecPreco,"visitanteTecVar"=visitante$cTecVar
)
b0<-rbind(b0,b5)
}
}
#fazer um join com a tabela original
datasetFinal <- merge(mydata,b0,by.x=c("rodada_id","clube_casa_id"),by.y=c("rodada","clube_casa_id"))
Now we can start to train the model.
One of the most important factors in Data Science is asking the right question that will guide you in the process. In this case, what do we want to predict? The result of the match. The score? No, just if the home team will win, tie or loose. Then, as we will not quantify, we will initially have a classification model.
What are the possible results of the match? We have V for victory, E for tie and D for defeat. So we have a Multinomial Classification model (because it is not a label with binary results).
Among the models presented by the H2O library I chose to use a Generalized Linear Model (GLM). “In addition to the Gaussian (ie, normal) distribution, these include the Poisson, binomial and gamma distributions, each serving a different purpose and, depending on the distribution and of the link function, can be used for prediction or classification.” (http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/glm.html)
The Multinomial model is a generalization of the binomial model being used for response variables with several classes. Similar to the binomail family, GLM models the conditional probability of observing the given class “c” x. A vector of coefficients exists for each of the output classes and expressed in a matrix β. The probabilities are calculated as:
Now that we have chosen the models we must transform the datasets into H2O environment and transform the label column of the results into factor.
#combine os datasets
matchResults <- data.frame("resultado" = ifelse(datasetFinal$placar_oficial_mandante > datasetFinal$placar_oficial_visitante,"v", ifelse(datasetFinal$placar_oficial_mandante == datasetFinal$placar_oficial_visitante,"e","d")))
matchResults <- as.h2o(matchResults)
partidas <- as.h2o(datasetFinal)
partidas <- h2o.cbind(partidas, matchResults)
# convert response column to a factor
partidas["resultado"] <- as.factor(partidas["resultado"])
We create a vector with the features and the label. We separate the dataset in training and validation. We run the model.
# set the predictor names and the response column name
predictors <- c("clube_casa_posicao","clube_visitante_posicao","clube_casa_id","clube_visitante_id",
"casaDef","casaDefMedia", "casaDefPreco", "casaDefVar", "casaAtk","casaAtkMedia","casaAtkPreco", "casaAtkVar", "casaTec",
"casaTecMedia", "casaTecPreco","casaTecVar", "visitanteDef", "visitanteDefMedia", "visitanteDefPreco", "visitanteDefVar",
"visitanteAtk", "visitanteAtkMedia", "visitanteAtkPreco", "visitanteAtkVar", "visitanteTec", "visitanteTecMedia", "visitanteTecPreco",
"visitanteTecVar")
response <- "resultado"
# split into train and validation sets
partidas.split <- h2o.splitFrame(data = partidas,ratios = 0.80, seed = 1234)
train <- partidas.split[[1]]
valid <- partidas.split[[2]]
#modelo GLM
partidas_glm <- h2o.glm(family= "multinomial", x= predictors, y=response, training_frame=train)
Validation
After running the model we can see the Confusion Matrix and see how the model came out in training.
> h2o.confusionMatrix(partidas_glm)
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
d e v Error Rate
d 47 2 2 0.0784 = 4 / 51
e 7 15 12 0.5588 = 19 / 34
v 1 1 67 0.0290 = 2 / 69
Totals 55 18 81 0.1623 = 25 / 154
We can see that the model had a good performance with more than 83% accuracy and about 16% error. The model managed to define well what was Victory and Defeat but had struggled in predict draws getting an 56% error, which was expected, since it is the most difficult to predict. For validation we will execute the following code:
> h2o.performance(partidas_glm, newdata = valid, train = FALSE, valid = TRUE,
xval = FALSE)
Test Set Metrics:
=====================
MSE: (Extract with `h2o.mse`) 0.2233906
RMSE: (Extract with `h2o.rmse`) 0.4726422
Logloss: (Extract with `h2o.logloss`) 0.6190762
Mean Per-Class Error: 0.2820513
Null Deviance: (Extract with `h2o.nulldeviance`) 79.43712
Residual Deviance: (Extract with `h2o.residual_deviance`) 43.33533
R^2: (Extract with `h2o.r2`) 0.627175
AIC: (Extract with `h2o.aic`) NaN
Confusion Matrix: Extract with `h2o.confusionMatrix(<model>, <data>)`)
=========================================================================
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
d e v Error Rate
d 8 0 0 0.0000 = 0 / 8
e 5 2 6 0.8462 = 11 / 13
v 0 0 14 0.0000 = 0 / 14
Totals 13 2 20 0.3143 = 11 / 35
As it is a multinomial model we will analyze the Confidence Matrix of validation. At this moment, there was an error of 31% and 69% of correctness, a factor that was again “prejudiced” by the difficulty of forecasting draws. All cases of victory and defeat were predictable. The model presented an R^2 of 0.627.
Feature Selection is a process that aims at choosing the most important variables to reduce dimensionality and avoid the Curse of dimensionality. In H2O we have the option to use the Variable Importance function that quantifies the importance of each variable for the model.
h2o.varimp(partidas_glm)
Standardized Coefficient Magnitudes: standardized coefficient magnitudes
names coefficients sign
1 visitanteTec 0.829742 POS
2 casaDef 0.741036 POS
3 casaTec 0.672827 POS
4 visitanteDef 0.628416 POS
5 casaAtk 0.459256 POS
---
names coefficients sign
24 visitanteAtkMedia 0.000000 POS
25 visitanteAtkPreco 0.000000 POS
26 visitanteAtkVar 0.000000 POS
27 visitanteTecMedia 0.000000 POS
28 visitanteTecPreco 0.000000 POS
29 NA NA
We can see that in this model, the variables of Away’s Coach, Defense and Coach of the home team, Away team defense and Home team attack were the most important variables and the variables of average, price and variation did not influence the model accuracy, which can be excluded for simplification of the model.
To test the model I accessed the site https://www.cartolafcbrasil.com.br and went in the 8th round of the 2018 Brazilian Championship to predict the game between Flamengo and Bahia.
I got the data from the two teams in the previous match (round 7) and set up a dataset for the forecast. Note: Data of price, variation and average I left 0 or got the same value of the score, as these variables were not influencing the model, this should not be so much problem for now.
#predict
predict(partidas_glm,valid)
x <- data.frame("clube_casa_posicao"=2,"clube_visitante_posicao"=16,"clube_casa_id"=262,"clube_visitante_id"=265,
"casaDef"=37.7,"casaDefMedia"=37.7, "casaDefPreco"=0, "casaDefVar"=0, "casaAtk"=37.6,"casaAtkMedia"=37.6,"casaAtkPreco"=0, "casaAtkVar"=0, "casaTec"=6.33,
"casaTecMedia"=6.33, "casaTecPreco"=0,"casaTecVar"=0, "visitanteDef"=38.7, "visitanteDefMedia"=38.7, "visitanteDefPreco"=0, "visitanteDefVar"=0,
"visitanteAtk"=40.4, "visitanteAtkMedia"=40.4, "visitanteAtkPreco"=0, "visitanteAtkVar"=0, "visitanteTec"=7.19, "visitanteTecMedia"=7.19, "visitanteTecPreco"=0,
"visitanteTecVar"=0)
newdata <- as.h2o(x)
predict(partidas_glm, newdata)
And so we get the result:
predict d e v
1 v 0.3668072 0.2423011 0.3908917
Home win — Flamengo — 39% probability, the forecast is correct for the game.
The initial model presented good predictive power, with satisfactory training and validation performance showing indications that it can be refined and used to predict soccer matches using the Fantasy Game CartolaFC data as input. The model can be used for fun or even for betting. I will continue refining the model and testing the predictions in other matches. I am also available for tips, insights and questions. I hope I have helped in this journey through the Data Science and Machine Learning, here are some improvements that can be made to a future article.
As improvements of the model we can better represent the stats for each area of the team using perhaps an average, because the team can have substitutions.
We can also obtain a way to quantify the performance of the teams in the last matches, taking into account the scores, the average of the last 3 games, pros points in the table, goals conceded, goals conceded, goals balance, number of wins, draws and losses . Use a larger dataset with championship data for 2014,2015 and 2016 as well. Use a model that fits the data better and reduces the error, such as some neural networks, Naive Bayes or Random Forest. The use of PCA for dimension reduction can also be interesting.