-
Notifications
You must be signed in to change notification settings - Fork 0
/
index.md
3448 lines (1947 loc) · 47.4 KB
/
index.md
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
Bayesian estimation of a Poisson model for Dutch football matches odds
================
Piet Stam
August 15th, 2020
- [Introduction](#introduction)
- [Quick summary](#quick-summary)
- [Acknowledgements](#acknowledgements)
- [Data and methods](#data-and-methods)
- [Theoretical description of the
model](#theoretical-description-of-the-model)
- [Basic model](#basic-model)
- [General model](#general-model)
- [Probabilistic Graphical
Model](#probabilistic-graphical-model)
- [Read data](#read-data)
- [Estimation, simulation and
validation](#estimation-simulation-and-validation)
- [Model estimation](#model-estimation)
- [Generating predictions (in- and
out-of-sample)](#generating-predictions-in--and-out-of-sample)
- [Model validation](#model-validation)
- [Results](#results)
- [The ranking of the teams](#the-ranking-of-the-teams)
- [Predicting the future](#predicting-the-future)
- [Betting on the match outcome](#betting-on-the-match-outcome)
- [Betting on the correct score](#betting-on-the-correct-score)
- [Betting results](#betting-results)
- [Appendix](#appendix)
- [Who do you call?](#who-do-you-call)
> *Copyright 2019 [Piet Stam](http://www.pietstam.nl). The code and the
> documentation are licensed under the Creative Commons [Attribution 4.0
> International license](http://creativecommons.org/licenses/by/4.0/).*
## Introduction
### Quick summary
We applied the original work of [Rasmus
Baath](http://www.sumsar.net/about.html) to the Dutch Eredivisie
football competition. With `r-bayesian-football-odds` we estimated the
odds of football matches in the last two weeks of the 2018/2019 Dutch
Eredivisie football competiton. We provide the code and evaluate the
results of our predictions.
### Acknowledgements
This piece of work is based on the works of [Rasmus
Baath](http://www.sumsar.net/blog/2013/07/modeling-match-results-in-la-liga-part-one/).
Rasmus Baath submitted his code to the [UseR 2013 data analysis
contest](https://www.r-project.org/conferences/useR-2013/) and licensed
it under the Creative Commons [Attribution 3.0 Unported
license](http://creativecommons.org/licenses/by/3.0/).
He predicted the results of the 50 last matches of the 2012/2013 Spanish
LaLiga season. He used data of the 2008/09-2012/13 seasons (5 seasons in
total) to estimate his regression model in a
[Bayesian](https://en.wikipedia.org/wiki/Bayes_estimator) way. See [this
thread](https://stats.stackexchange.com/questions/252577/bayes-regression-how-is-it-done-in-comparison-to-standard-regression)
for an intuitive explanation of the difference between the bayesian
approach and the conventional approaches of linear regression and
maximum likelihood.
I slightly adpated his code to predict the results of the last two
competition rounds (that is, the last 18 matches) of the 2018/2019 Dutch
Eredivisie season. These predictions are based on soccer match data of
the 2014/15-2018/19 seasons (5 seasons in total). The source of these
data is [here](http://www.football-data.co.uk/netherlandsm.php). Out of
the three model specifications that Rasmus developed, I used the most
sophisticated model that allowed for year-to-year variability in team
skill (called “iteration 3” by Rasmus).
You can find my code at
[GitHub](https://github.com/pjastam/r-bayesian-football-odds). Rasmus
deserves all the credits, I deserve all the blame in case of any errors
in my application to the Dutch football competition.
## Data and methods
### Theoretical description of the model
<!-- The source of the folowing function is [https://github.com/STAT545-UBC/Discussion/issues/102](https://github.com/STAT545-UBC/Discussion/issues/102). -->
#### Basic model
The first thing to notice is that not all teams are equally good.
Therefore, it will be assumed that all teams have a latent skill
variable and the skill of a team *minus* the skill of the opposing team
defines the predicted outcome of a game. As the number of goals are
assumed to be Poisson distributed it is natural that the skills of the
teams are on the log scale of the mean of the distribution.
In its simplest form, the distribution of the number of goals for team
\(i\) when facing team \(j\) is then
![](http://latex.codecogs.com/gif.latex?%7Bgoals%7D_%7Bi,j%7D%20%5Csim%20%5Ctext%7BPoisson%7D\(%5Clambda_%7Bi,j%7D\))
![](http://latex.codecogs.com/gif.latex?%5Ctext%7Blog%7D\(%5Clambda_%7Bi,j%7D\)%20%3D%20%5Ctext%7Bbaseline%7D%20+%20%5Ctext%7Bskill%7D_i%20-%20%5Ctext%7Bskill%7D_j)
where <code>baseline</code> is the log average number of goals when both
teams are equally good. Note that this model description does not
capture the variation in the number of goals among football seasons and
between home vs away teams.
#### General model
In order to allow for variation in the number of goals among football
seasons and between home vs away teams, we refine the distribution of
the goal outcome of a match between home team \(i\) and away team \(j\)
in season \(s\) as follows:
![](http://latex.codecogs.com/gif.latex?%7Bgoals%7D%5E%5Ctext%7Bhome%7D_%7Bs,i,j%7D%20%5Csim%20%5Ctext%7BPoisson%7D\(%5Clambda%5E%5Ctext%7Bhome%7D_%7Bs,i,j%7D\))
with the <code>lambdas</code> defined as follows
![](http://latex.codecogs.com/gif.latex?%5Clambda%5E%5Ctext%7Bhome%7D_%7Bs,i,j%7D%20%3D%20%5Cexp\(%5Ctext%7Bbaseline%7D%5E%5Ctext%7Bhome%7D_s%20+%20%5Ctext%7Bskill%7D_%7Bs,i%7D%20-%20%5Ctext%7Bskill%7D_%7Bs,j%7D\))
![](http://latex.codecogs.com/gif.latex?%5Clambda%5E%5Ctext%7Baway%7D_%7Bs,i,j%7D%20%3D%20%5Cexp\(%5Ctext%7Bbaseline%7D%5E%5Ctext%7Baway%7D_s%20+%20%5Ctext%7Bskill%7D_%7Bs,j%7D%20-%20%5Ctext%7Bskill%7D_%7Bs,i%7D\))
Note that the <code>baseline</code> is split into
<code>home\_baseline</code> and <code>away\_baseline</code> in order to
account for the home advantage. Furthermore, we introduced the index t
for the baseline and skill parameters to allow for variation among
seasons.
##### Defining the baseline distributions
I set the prior distributions of the baselines in season \(s\) to:
![](http://latex.codecogs.com/gif.latex?%5Ctext%7Bbaseline%7D%5E%5Ctext%7Bhome%7D_s%20%5Csim%20%5Ctext%7BNormal%7D\(%5Ctext%7Bbaseline%7D%5E%5Ctext%7Bhome%7D_%7Bs-1%7D,%20%5Csigma_%7B%5Ctext%7Bseasons%7D%7D%5E2\))
![](http://latex.codecogs.com/gif.latex?%5Ctext%7Bbaseline%7D%5E%5Ctext%7Baway%7D_s%20%5Csim%20%5Ctext%7BNormal%7D\(%5Ctext%7Bbaseline%7D%5E%5Ctext%7Baway%7D_%7Bs-1%7D,%20%5Csigma_%7B%5Ctext%7Bseasons%7D%7D%5E2\))
and in the *first* season to:
![](http://latex.codecogs.com/gif.latex?%5Ctext%7Bbaseline%7D%5E%5Ctext%7Bhome%7D_1%20%5Csim%20%5Ctext%7BNormal%7D\(0,%204%5E2\))
![](http://latex.codecogs.com/gif.latex?%5Ctext%7Bbaseline%7D%5E%5Ctext%7Baway%7D_1%20%5Csim%20%5Ctext%7BNormal%7D\(0,%204%5E2\))
with <code>sigma-seasons</code> defined as:
![](http://latex.codecogs.com/gif.latex?%5Csigma_%5Ctext%7Bseasons%7D%20%5Csim%20%5Ctext%7BUniform%7D\(0,%203\))
##### Defining the team skill distributions
I set the prior distributions over the skills of team \(i\) (or \(j\),
denoted as i|j) in season \(s\) to:
![](http://latex.codecogs.com/gif.latex?%5Ctext%7Bskill%7D_%7Bs,i%7Cj%7D%20%5Csim%20%5Ctext%7BNormal%7D\(%5Ctext%7Bskill%7D_%7Bs-1,i%7Cj%7D,%20%5Csigma_%7B%5Ctext%7Bseasons%7D%7D%5E2\))
and in the *first* season to:
![](http://latex.codecogs.com/gif.latex?%5Ctext%7Bskill%7D_%7B1,i%7Cj%7D%20%5Csim%20%5Ctext%7BNormal%7D\(%5Cmu_%5Ctext%7Bteams%7D,%20%5Csigma_%7B%5Ctext%7Bteams%7D%7D%5E2\))
with the <code>sigma-seasons</code> defined as above and
<code>mu-teams</code> and <code>sigma-teams</code> defined as:
![](http://latex.codecogs.com/gif.latex?%5Cmu_%5Ctext%7Bteams%7D%20%5Csim%20%5Ctext%7BNormal%7D\(0,%204%5E2\))
![](http://latex.codecogs.com/gif.latex?%5Csigma_%5Ctext%7Bteams%7D%20%5Csim%20%5Ctext%7BUniform%7D\(0,%203\))
We apply a normalizing restriction with respect to the (arbitrarily
chosen) *first* team in each season \(s\) as follows
![](http://latex.codecogs.com/gif.latex?%5Ctext%7Bskill%7D_%7Bs,1%7D%20%3D%200)
We choose very vague priors. For example, the prior on the baseline have
a SD of 4 but since this is on the log scale of the mean number of goals
it corresponds to one SD from the mean \(0\) covering the range of
\([0.02, 54.6]\) goals. A very wide prior indeed.
#### Probabilistic Graphical Model
We graphed the dependencies described above with the help of a
probabilistic graphical model. To this end, we make use of the CRAN
package
[DiagrammeR](https://cran.r-project.org/web/packages/DiagrammeR/index.html)
with the help of which you can use the [Graph Visualization
Software](https://graphviz.gitlab.io/).
![](results/pgm-1.png)<!-- -->
### Read data
We first read the Dutch soccer match data of the 2014/15-2018/19 Dutch
Eredivisie seasons from the original csv-files and cache them. The
result is a database called `eredivisie`.
``` r
from_year <- 2014
to_year <- 2019
source(paste0("functions/Import_Data_Eredivisie.R"))
```
Then the cached `eredivisie` data are cleaned and new variables are
created.
``` r
eredivisie <- eredivisie %>%
mutate(MatchResult = sign(HomeGoals - AwayGoals)) # -1 Away win, 0 Draw, 1 Home win
# Creating a data frame d with only the complete match results
d <- na.omit(eredivisie)
# Lists with the unique names of all teams and all seasons in the database
teams <- unique(c(d$HomeTeam, d$AwayTeam))
seasons <- unique(d$Season)
# A list for JAGS with the data from d where the strings are coded as integers
data_list <- list(HomeGoals = d$HomeGoals, AwayGoals = d$AwayGoals,
HomeTeam = as.numeric(factor(d$HomeTeam, levels=teams)),
AwayTeam = as.numeric(factor(d$AwayTeam, levels=teams)),
Season = as.numeric(factor(d$Season, levels=seasons)),
n_teams = length(teams), n_games = nrow(d),
n_seasons = length(seasons))
```
The data set <code>eredivisie</code> contains data from 5 different
seasons. In this model we allow for variability in year-to-year team
performance. This variablitity in team performance can be demonstrated
by the following diagram, which shows that some teams do not even
participate in all seasons in the <code>eredivisie</code> data set, as a
result of dropping out of the first division:
``` r
qplot(Season, HomeTeam, data=d, ylab="Team", xlab = "Season")
```
![](results/participation_by_season-1.png)<!-- -->
## Estimation, simulation and validation
### Model estimation
Turning this into a JAGS model results in the following string. Note
that the model loops over all seasons and all match results. JAGS
parameterizes the normal distribution with precision (the reciprocal of
the variance) instead of variance so the hyper priors have to be
converted. Finally, we "anchor" the skill of one team to a constant
otherwise the mean skill can drift away freely. Doing these adjustments
results in the following model description:
``` r
m3_string <- "model {
for(i in 1:n_games) {
HomeGoals[i] ~ dpois(lambda_home[Season[i], HomeTeam[i],AwayTeam[i]])
AwayGoals[i] ~ dpois(lambda_away[Season[i], HomeTeam[i],AwayTeam[i]])
}
for(season_i in 1:n_seasons) {
for(home_i in 1:n_teams) {
for(away_i in 1:n_teams) {
lambda_home[season_i, home_i, away_i] <- exp( home_baseline[season_i] + skill[season_i, home_i] - skill[season_i, away_i])
lambda_away[season_i, home_i, away_i] <- exp( away_baseline[season_i] + skill[season_i, away_i] - skill[season_i, home_i])
}
}
}
skill[1, 1] <- 0
for(j in 2:n_teams) {
skill[1, j] ~ dnorm(group_skill, group_tau)
}
group_skill ~ dnorm(0, 0.0625)
group_tau <- 1/pow(group_sigma, 2)
group_sigma ~ dunif(0, 3)
home_baseline[1] ~ dnorm(0, 0.0625)
away_baseline[1] ~ dnorm(0, 0.0625)
for(season_i in 2:n_seasons) {
skill[season_i, 1] <- 0
for(j in 2:n_teams) {
skill[season_i, j] ~ dnorm(skill[season_i - 1, j], season_tau)
}
home_baseline[season_i] ~ dnorm(home_baseline[season_i - 1], season_tau)
away_baseline[season_i] ~ dnorm(away_baseline[season_i - 1], season_tau)
}
season_tau <- 1/pow(season_sigma, 2)
season_sigma ~ dunif(0, 3)
}"
```
We then run this model directly from R using RJAGS and the
<code>textConnection</code> function. This takes about half an hour on
my computer, but of course this depends on the configuration at hand.
``` r
# Compiling the model
m3 <- run.jags(method="parallel",
model=m3_string,
monitor=c("home_baseline", "away_baseline","skill", "season_sigma", "group_sigma", "group_skill"),
data=data_list,
n.chains=3,
adapt=10000,
burnin=10000,
sample=15000,
thin=8,
summarise=FALSE,
plots=FALSE)
# Generating MCMC samples
s3 <- as.mcmc.list(m3)
# Merging the three MCMC chains into one matrix
ms3 <- as.matrix(s3)
```
The following graphs shows the trace plots and probability distributions
of the team mean, team sigma and season sigma parameters, respectively.
``` r
plot(s3[, "group_skill"])
```
![](results/mu_sigma_params-1.png)<!-- -->
``` r
plot(s3[, "group_sigma"])
```
![](results/mu_sigma_params-2.png)<!-- -->
``` r
plot(s3[, "season_sigma"])
```
![](results/mu_sigma_params-3.png)<!-- -->
We can also calculate the default home advantage by looking at the
difference between <code>exp(home\_baseline) -
exp(away\_baseline)</code>. The next graph shows that there is a home
advantage of more than 0.4 goals, on average, and it differs
significantly from zero.
``` r
plotPost(exp(ms3[,col_name("home_baseline",to_year-from_year)]) - exp(ms3[,col_name("away_baseline",to_year-from_year)]), compVal = 0, xlab = "Home advantage in number of goals")
```
![](results/overall_home_advantage-1.png)<!-- -->
## mean median mode hdiMass
## Home advantage in number of goals 0.431632 0.4284996 0.4222933 0.95
## hdiLow hdiHigh compVal pcGTcompVal
## Home advantage in number of goals 0.2814409 0.5908113 0 1
## ROPElow ROPEhigh pcInROPE
## Home advantage in number of goals NA NA NA
### Generating predictions (in- and out-of-sample)
In the <code>eredivisie</code> data set included in this project, the
results of the 18 last games of the 2018/2019 season are missing. Using
our model we can now both predict and simulate the outcomes of these 18
games. The R code below calculates a number of measures for each game
(both the games with known and unknown outcomes):
- The mode of the simulated number of goals, that is, the *most
likely* number of scored goals. If we were asked to bet on the
number of goals in a game this is what we would use.
- The mean of the simulated number of goals, this is our best guess of
the average number of goals in a game.
- The most likely match result for each game.
- A random sample from the distributions of credible home scores, away
scores and match results. This is how the Eredivisie actually could
have played out in an alternative reality.
<!-- end list -->
``` r
n <- nrow(ms3)
m3_pred <- sapply(1:nrow(eredivisie), function(i) {
home_team <- which(teams == eredivisie$HomeTeam[i])
away_team <- which(teams == eredivisie$AwayTeam[i])
season <- which(seasons == eredivisie$Season[i])
home_skill <- ms3[, col_name("skill", season, home_team)]
away_skill <- ms3[, col_name("skill", season, away_team)]
home_baseline <- ms3[, col_name("home_baseline", season)]
away_baseline <- ms3[, col_name("away_baseline", season)]
home_goals <- rpois(n, exp(home_baseline + home_skill - away_skill))
away_goals <- rpois(n, exp(away_baseline + away_skill - home_skill))
home_goals_table <- table(home_goals)
away_goals_table <- table(away_goals)
match_results <- sign(home_goals - away_goals)
match_results_table <- table(match_results)
mode_home_goal <- as.numeric(names(home_goals_table)[ which.max(home_goals_table)])
mode_away_goal <- as.numeric(names(away_goals_table)[ which.max(away_goals_table)])
match_result <- as.numeric(names(match_results_table)[which.max(match_results_table)])
rand_i <- sample(seq_along(home_goals), 1)
c(mode_home_goal = mode_home_goal, mode_away_goal = mode_away_goal, match_result = match_result,
mean_home_goal = mean(home_goals), mean_away_goal = mean(away_goals),
rand_home_goal = home_goals[rand_i], rand_away_goal = away_goals[rand_i],
rand_match_result = match_results[rand_i])
})
m3_pred <- t(m3_pred)
```
### Model validation
First let's compare the distribution of the actual number of goals in
the data with the predicted mode, mean and randomized number of goals
for all the games (focusing on the number of goals for the home team).
First the actual distribution of the number of goals for the home teams.
``` r
hist(eredivisie$HomeGoals, breaks= (-1:max(eredivisie$HomeGoals, na.rm=TRUE)) + 0.5, xlim=c(-0.5, 10), main = "Distribution of the number of goals\nscored by a home team in a match",
xlab = "")
```
![](results/hist_home_goal-1.png)<!-- -->
This next plot shows the distribution of the modes from the predicted
distribution of home goals from each game. That is, what is the most
probable outcome, for the home team, in each game.
``` r
hist(m3_pred[ , "mode_home_goal"], breaks= (-1:max(m3_pred[ , "mode_home_goal"])) + 0.5, xlim=c(-0.5, 10),
main = "Distribution of predicted most \nprobable score by a home team in\na match",
xlab = "")
```
![](results/mode_home_goal-1.png)<!-- -->
For almost all games the single most likely number of goals is one.
Actually, if you know nothing about an Eredivisie game, betting on one
goal for the home team is 78 % of the times the best bet.
Let's instead look at the distribution of the predicted mean number of
home goals in each game.
``` r
hist(m3_pred[ , "mean_home_goal"], breaks= (-1:max(m3_pred[ , "mean_home_goal"])) + 0.5, xlim=c(-0.5, 10),
main = "Distribution of predicted mean \n score by a home team in a match",
xlab = "")
```
![](results/mean_home_goal-1.png)<!-- -->
For most games the expected number of goals are 2. That is, even if your
safest bet is one goal you would expect to see around two goals.
The distribution of the mode and the mean number of goals doesn’t look
remotely like the actual number of goals. This was not to be expected,
we would however expect the distribution of randomized goals (where for
each match the number of goals has been randomly drawn from that match’s
predicted home goal distribution) to look similar to the actual number
of home goals. Looking at the histogram below, this seems to be the
case.
``` r
hist(m3_pred[ , "rand_home_goal"], breaks= (-1:max(m3_pred[ , "rand_home_goal"])) + 0.5, xlim=c(-0.5, 10),
main = "Distribution of randomly drawn \n score by a home team in a match",
xlab = "")
```
![](results/rand_home_goal-1.png)<!-- -->
We can also look at how well the model predicts the data. This should
probably be done using cross validation, but as the number of effective
parameters are much smaller than the number of data points a direct
comparison should at least give an estimated prediction accuracy in the
right ballpark.
``` r
mean(eredivisie$HomeGoals == m3_pred[ , "mode_home_goal"], na.rm=T)
```
## [1] 0.3167989
``` r
mean((eredivisie$HomeGoals - m3_pred[ , "mean_home_goal"])^2, na.rm=T)
```
## [1] 1.508416
So on average the model predicts the correct number of home goals 31% of
the time and guesses the average number of goals with a mean squared
error of 1.51. Now we’ll look at the actual and predicted match
outcomes. The graph below shows the match outcomes in the data with 1
being a home win, 0 being a draw and -1 being a win for the away team.
``` r
hist(eredivisie$MatchResult, breaks= (-2:1) + 0.5, xlim=c(-1.5, 1.5), ylim=c(0, 1000), main = "Actual match results",
xlab = "")
```
![](results/hist_actual_match_result-1.png)<!-- -->
Now looking at the most probable outcomes of the matches according to
the model.
``` r
hist(m3_pred[ , "match_result"], breaks= (-2:1) + 0.5, xlim=c(-1.5, 1.5), ylim=c(0, 1000), main = "Predicted match results",
xlab = "")
```
![](results/hist_pred_match_result-1.png)<!-- -->
For almost all matches the safest bet is to bet on the home team. While
draws are not uncommon it is *never* the safest bet.
As in the case with the number of home goals, the randomized match
outcomes have a distribution similar to the actual match outcomes:
``` r
hist(m3_pred[ , "rand_match_result"], breaks= (-2:1) + 0.5, xlim=c(-1.5, 1.5), ylim=c(0, 1000), main = "Randomized match results",
xlab = "")
```
![](results/hist_rand_match_result-1.png)<!-- -->
``` r
mean(eredivisie$MatchResult == m3_pred[ , "match_result"], na.rm=T)
```
## [1] 0.5681217
The model predicts the correct match outcome (i.e. home team wins / a
draw / away team wins) 57% of the time. Pretty good\!
## Results
Disclaimer: my comments below may be out of sync with the empirical
results and graphs, because these comments (as well as the description
of my betting experience in the last section) are based on the results
of running
[VERSION 1.0](https://github.com/pjastam/r-bayesian-football-odds/releases/tag/v1.0)
instead of the current version of the app.
### The ranking of the teams
We’ll start by ranking the teams of the <code>Eredivisie</code> using
the estimated skill parameters from the 2018/2019 season, which are
based on the estimation sample of the five seasons 2014/2015-2018/2019.
Note that for one of the teams the skill parameter is “anchored at
zero”. This “anchoring” is done for the very same "identification"
reason that one of the parameters in a traditional logit analysis is
always set to zero by default: the value of a parameter automatically
follows if you already know all the other parameters in your model and
given the fact that probabilities always sum up to 1 in total.
Consequently, as Rasmus noted before, the skill parameters are difficult
to interpret as they are relative to the skill of the team that had its
skill parameter “anchored” at zero. To put them on a more interpretable
scale the skill paramters are first zero centered by subtracting the
mean skill of all teams. Then he added the home baseline and
exponentiated the resulting values. These rescaled skill parameters are
now on the scale of expected number of goals when playing as a home
team.
``` r
team_skill <- ms3[, str_detect(string=colnames(ms3), paste0("skill\\[",to_year-from_year,","))]
team_skill <- (team_skill - rowMeans(team_skill)) + ms3[, paste0("home_baseline[",to_year-from_year,"]")]
team_skill <- exp(team_skill)
colnames(team_skill) <- teams
team_skill <- team_skill[,order(colMeans(team_skill), decreasing=T)]
old_par <- par(mar=c(2,0.7,0.7,0.7), xaxs='i')
caterplot(team_skill, labels.loc="above", val.lim=c(0.7, 3.8))
```
![](results/team_skill-1.png)<!-- -->
``` r
par(old_par)
```
Two teams are clearly ahead of the rest, Ajax and PSV. Let's look at the
credible difference between these two teams. Ajax is a better team than
PSV with a probabilty of 74%, i.e. the odds in favor of Ajax are 74% /
26% = 3. So, on average, PSV only wins one out of four games that they
play against Ajax.
``` r
plotPost(team_skill[, "Ajax"] - team_skill[, "PSV Eindhoven"], compVal = 0, xlab = "<- PSV vs Ajax ->")
```
![](results/team_skill_PSV_Ajax-1.png)<!-- -->
## mean median mode hdiMass hdiLow
## <- PSV vs Ajax -> 0.166562 0.1581584 0.1227583 0.95 -0.3601816
## hdiHigh compVal pcGTcompVal ROPElow ROPEhigh
## <- PSV vs Ajax -> 0.6779463 0 0.7369778 NA NA
## pcInROPE
## <- PSV vs Ajax -> NA
### Predicting the future
Now that we’ve checked that the model reasonably predicts the Eredivisie
history let's predict the Eredivisie endgame\!
At the time when I executed my version of this model applied to the
Dutch Eredivisie competition (2019-05-10), most of the matches in the
2018/2019 season had already been played. Yet two out of 34 competition
rounds had to be played (that is, competition rounds 33 and 34). With
these two rounds still to go, Ajax and PSV both have 80 points, but Ajax
leads the competition as their goal difference is larger (111-30 = 81)
than that of PSV (95-24 = 71). The code below displays the predicted
mean and mode number of goals for the endgame and the predicted winner
of each game.
``` r
eredivisie_forecast <- eredivisie[is.na(eredivisie$HomeGoals), c("Season", "Week", "HomeTeam", "AwayTeam")]
m3_forecast <- m3_pred[is.na(eredivisie$HomeGoals),]
eredivisie_forecast$mean_home_goals <- round(m3_forecast[,"mean_home_goal"], 1)
eredivisie_forecast$mean_away_goals <- round(m3_forecast[,"mean_away_goal"], 1)
eredivisie_forecast$mode_home_goals <- m3_forecast[,"mode_home_goal"]
eredivisie_forecast$mode_away_goals <- m3_forecast[,"mode_away_goal"]
eredivisie_forecast$predicted_winner <- ifelse(m3_forecast[ , "match_result"] == 1, eredivisie_forecast$HomeTeam,
ifelse(m3_forecast[ , "match_result"] == -1, eredivisie_forecast$AwayTeam, "Draw"))
rownames(eredivisie_forecast) <- NULL
print(xtable(eredivisie_forecast, align="cccccccccc"), type="html")
```
<!-- html table generated in R 3.6.1 by xtable 1.8-4 package -->
<!-- Sun Aug 16 11:39:17 2020 -->
<table border="1">
<tr>
<th>
</th>
<th>
Season
</th>
<th>
Week
</th>
<th>
HomeTeam
</th>
<th>
AwayTeam
</th>
<th>
mean\_home\_goals
</th>
<th>
mean\_away\_goals
</th>
<th>
mode\_home\_goals
</th>
<th>
mode\_away\_goals
</th>
<th>
predicted\_winner
</th>
</tr>
<tr>
<td align="center">
1
</td>
<td align="center">
2018/2019
</td>
<td align="center">
40
</td>
<td align="center">
Ajax
</td>
<td align="center">
Utrecht
</td>
<td align="center">
2.90
</td>
<td align="center">
0.80
</td>
<td align="center">
2.00
</td>
<td align="center">
0.00
</td>
<td align="center">
Ajax
</td>
</tr>
<tr>
<td align="center">
2
</td>
<td align="center">
2018/2019
</td>
<td align="center">
40
</td>
<td align="center">
AZ Alkmaar
</td>
<td align="center">
PSV Eindhoven
</td>
<td align="center">
1.30
</td>
<td align="center">
1.80
</td>
<td align="center">
1.00
</td>
<td align="center">
1.00
</td>
<td align="center">
PSV Eindhoven
</td>
</tr>
<tr>
<td align="center">
3
</td>
<td align="center">
2018/2019
</td>
<td align="center">
40
</td>
<td align="center">
Groningen
</td>
<td align="center">
For Sittard
</td>
<td align="center">
2.10
</td>
<td align="center">
1.10
</td>
<td align="center">
2.00
</td>
<td align="center">
1.00
</td>
<td align="center">
Groningen
</td>
</tr>
<tr>
<td align="center">
4
</td>
<td align="center">
2018/2019
</td>
<td align="center">
40
</td>
<td align="center">
Feyenoord
</td>
<td align="center">
Den Haag
</td>
<td align="center">
2.70
</td>
<td align="center">
0.90
</td>
<td align="center">
2.00
</td>
<td align="center">
0.00
</td>
<td align="center">
Feyenoord
</td>
</tr>
<tr>
<td align="center">
5
</td>
<td align="center">
2018/2019
</td>
<td align="center">
40
</td>
<td align="center">
Heerenveen
</td>
<td align="center">
NAC Breda
</td>
<td align="center">
2.20
</td>
<td align="center">
1.00
</td>
<td align="center">
2.00
</td>
<td align="center">
1.00
</td>
<td align="center">
Heerenveen