This repo contains a project whose objective is to forecast the total number of vaccinations in each country based on the World Bank's Covid-19 dataset, using probabilistic models.
Probabilistic models are part of what is called "Bayesian" methodologies, as opposed to the regular machine learning or "Frequentist" methodologies. These methods focus on acknowledging and quantifying both what is known and what is unknown. The probabilistic methods approximate (physical) processes to probability distributions and by understanding the underlying processes they are able to make predictions and quantify uncertainty. In particular, the project uses Markov Chain Montecarlo methods since they are integrated in pyStan.
About the project itself, it consists of the following stages: preprocessing, exploratory data analysis, data preparation, modelling and evaluation. It was a group project which was the focus of the course 42186 - Model-Based Machine Learning taught at the Technical University of Denmark.
The course is centered around the modelling of probabilistic methods, and so is the project. The main file, "Final_Vaccination Prediction.ipynb" consists mostly of iterations of several models, starting from a rather simple AR(1), to an AR(2) with a hierarchical model, somewhat closer to the "ideal" model:
Starting with the preprocessing, the dataset is expanded with a bunch of economic, health and education-related indicators about the countries. Thanks to these new indicators of the countries' wellbeing, the different countries are grouped into five different clusters. To do this, first the dimensionality of the data is reduced using PCA and with its results the countries are clustered with a KMeans algorithm.
This chart allows to better understand the similarities between countries on economic, health and education-related indicators. It is expected that countries belonging to the same clusters have similar strategies regarding the vaccination rollout and therefore the predictions should be within the same range.
Next comes preparing the data for the forecasting, which is modified to a weekly frequency and prepared both for the countries and for the clusters. In the exploratory data analysis it is easy to see the patterns vary greatly within clusters and more abruptly within clusters:
Finally, the modelling and evaluation. Again, this was the main focus of the project and all the code can be seen in the "FINAL_Vaccination prediction.ipynb" notebook. Without getting into too much detail, to assess each model it's interesting to look at the evolution of the Markov chains, the distributions and the sample variance. Along the modelling phase, the different models are tried evaluating each model and rejecting or further developing the model based on the accuracy of the predictions. All in all, the different models tried are: 1 - AR(1), country-specific forecast 2 - AR(1), cluster-specific with priors for the country clustering 3 - AR(1), cluster-specific forecast 3.1 - with the data standardization from the preprocessing 3.2 - with a log scale 3.3 - with a lag of the differenced weekly vaccinations 4 - AR(1), country-specific forecast with data standardization and temporal factor 5 - AR(2), country-specific forecast with data standardization and temporal factor 6 - AR(2), as in [5], with Hierarchical Model. We introduce a Hierarchical Model, imposing the mean and standard deviations of our coefficients b and standard deviation W to assume some specific values. These numbers have been identified after running the model multiple times and observing the results from the chains and the convergence of the parameters to specific values.
To conclude, this final model results in an icnrease of R2 of 0.001. which suggests that the mean and standard deviation of every prior was already close to the optimal value without using the hiearchical model. Moreover, this also means that the quality of the chains was already good, especially due to the fact that we set iter=1000 and chains=6 in the sampling to increase convergence of the chains.