Covid19ProbabilisticMethods

This repo contains a project whose objective is to forecast the total number of vaccinations in each country based on the World Bank's Covid-19 dataset, using probabilistic models.

Probabilistic models are part of what is called "Bayesian" methodologies, as opposed to the regular machine learning or "Frequentist" methodologies. These methods focus on acknowledging and quantifying both what is known and what is unknown. The probabilistic methods approximate (physical) processes to probability distributions and by understanding the underlying processes they are able to make predictions and quantify uncertainty. In particular, the project uses Markov Chain Montecarlo methods since they are integrated in pyStan.

About the project itself, it consists of the following stages: preprocessing, exploratory data analysis, data preparation, modelling and evaluation. It was a group project which was the focus of the course 42186 - Model-Based Machine Learning taught at the Technical University of Denmark.

The course is centered around the modelling of probabilistic methods, and so is the project. The main file, "Final_Vaccination Prediction.ipynb" consists mostly of iterations of several models, starting from a rather simple AR(1), to an AR(2) with a hierarchical model, somewhat closer to the "ideal" model:

Starting with the preprocessing, the dataset is expanded with a bunch of economic, health and education-related indicators about the countries. Thanks to these new indicators of the countries' wellbeing, the different countries are grouped into five different clusters. To do this, first the dimensionality of the data is reduced using PCA and with its results the countries are clustered with a KMeans algorithm.

This chart allows to better understand the similarities between countries on economic, health and education-related indicators. It is expected that countries belonging to the same clusters have similar strategies regarding the vaccination rollout and therefore the predictions should be within the same range.

Next comes preparing the data for the forecasting, which is modified to a weekly frequency and prepared both for the countries and for the clusters. In the exploratory data analysis it is easy to see the patterns vary greatly within clusters and more abruptly within clusters:

Finally, the modelling and evaluation. Again, this was the main focus of the project and all the code can be seen in the "FINAL_Vaccination prediction.ipynb" notebook. Without getting into too much detail, to assess each model it's interesting to look at the evolution of the Markov chains, the distributions and the sample variance. Along the modelling phase, the different models are tried evaluating each model and rejecting or further developing the model based on the accuracy of the predictions. All in all, the different models tried are: 1 - AR(1), country-specific forecast 2 - AR(1), cluster-specific with priors for the country clustering 3 - AR(1), cluster-specific forecast 3.1 - with the data standardization from the preprocessing 3.2 - with a log scale 3.3 - with a lag of the differenced weekly vaccinations 4 - AR(1), country-specific forecast with data standardization and temporal factor 5 - AR(2), country-specific forecast with data standardization and temporal factor 6 - AR(2), as in [5], with Hierarchical Model. We introduce a Hierarchical Model, imposing the mean and standard deviations of our coefficients b and standard deviation W to assume some specific values. These numbers have been identified after running the model multiple times and observing the results from the chains and the convergence of the parameters to specific values.

To conclude, this final model results in an icnrease of R2 of 0.001. which suggests that the mean and standard deviation of every prior was already close to the optimal value without using the hiearchical model. Moreover, this also means that the quality of the chains was already good, especially due to the fact that we set iter=1000 and chains=6 in the sampling to increase convergence of the chains.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
Clustering of Countries based on 2 Principal components.png		Clustering of Countries based on 2 Principal components.png
FINAL_Vaccination Prediction.ipynb		FINAL_Vaccination Prediction.ipynb
LICENSE		LICENSE
Making functions.ipynb		Making functions.ipynb
PGM.png		PGM.png
README.md		README.md
Vaccination_prediction_html.pdf		Vaccination_prediction_html.pdf
country_profile_variables.csv		country_profile_variables.csv
country_vaccinations.csv		country_vaccinations.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Covid19ProbabilisticMethods

About

Releases

Packages

Languages

License

Marcoscos/Covid19ProbabilisticMethods

Folders and files

Latest commit

History

Repository files navigation

Covid19ProbabilisticMethods

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages