Data pipeline for COVID19 monitoring

Data pipeline for uploading, preprocessing, and visualising COVID19 data using Google Cloud Platform

This repo includes implementation of a pipeline for visualization of COVID19 data: all over the time and the last 7 days.

Original idea of this pipeline is to have scheduled jobs with regularly updated table.

Dashboard is available here. Note that it will not be updated after April 2023.

Detailed description of the project without too deep focus on the tools is given in the post on my personal page.

NOTE that starting from 08-03-2023, the data source of Our World in Data has been changed and the data is updated on weekly basis instead of daily (each Thu)

Goal
Content
Data source
Description of architecture
Reproducibility
Tests
Improvements

Goal

COVID19 has been affected our lives for quite long period of time already. It is important to regularly monitor the situation to avoid unexpected situations and be ready for actions beforehand.

This project builds the pipeline which updates the dashboard for monitoring total cases of COVID19. The dashboard is updated regularly (daily until 08-03-2023 and weekly from 08-03-2023 due to the changes in data source) so that it is easy to control the situation.

The dashboard can be easily customized e.g. for the country, period of time, months, etc.

Content

/data: Data source

src/dbt: dbt files and folders

/images: printscreens for Readme files

/infrastructure: Terraform files

/scripts: helper scripts

/src: source codes

env: file with environmental variables

/test: tests for the code

Data source

Data has been provided by Our World in Data.

The source file has been uploaded from GitHub which was daily updated until 08-03-2023 (the source was Johns Hopkins University).

Johns Hopkins University stopped providing some metrics and the source was changed to the World Health Organization (WHO) from 08-03-2023. The data is updated on weekly basis (each Thu).

Description of architecture

The source data (raw level) is originally in csv format and located in GitHub.

Batch pipeline is implemented using Google Cloud Platform (GCP). There is no point to implement stream pipeline as the data is not a real-time.

Terraform is used as an IaC (Infrastructure as code) to create resources in GCP.

Pipeline partially cleans the source csv data, saves it as a parquet file, and moves sequantially first to a datalake, GCP bucket (Google Cloud Storage (GCS)) and then to a data warehouse, Google Biq Query (silver layer). The whole process is orchestrated by Prefect as a daily scheduled jobs.

Pipeline follows so-called medallion architecture describing data as bronze, silver, and gold layers.

The silver layer data from the data warehouse is next transformed by dbt for configuring the schema, final cleaning, selecting only necessary columns, and saving the resulted data as tables to Big Query. The data is partitioned on the date as the date is later used for quering that optimizes the process. Because of the size of the data, the data was not clustered. This data (gold layer) is ready for the dashboard.

dbt models used incremental configuration meaning that dbt transforms only the rows in the source data for the last week e.g. rows that have been created or updated since the last time dbt ran.

Dashboard has been built from the gold layer data using Looker Studio (previously Google Data Studio) which is synced with Big Query.

Unit tests (/tests)have been written and integrated into CI/CD pipelines via GitHub Actions.

The implementation is limited by GCP usage. At the same time, implementation does not involve any local components which makes it more flexible for collaboration goals e.g. working in a team.

Reproducibility

One has to read with following instructions from prerequisites_readme first. This instruction gives a detailed step-by-step guidelines for required configurations of the tools and services needed for the pipeline.
On the virtual machine, run the following commands to deploy the pipelines:

cd src/
prefect cloud login # or ´prefect orion start` if you don't have or don't want to create an account
prefect deployment build web_to_gcs.py:web_to_gcs -n 'COVID19 data to GCS' --cron "0 15 * * 4" -a # creates deployment yaml file and schedule it via CRON on 15.00 every Thu
prefect deployment build gcs_to_bq.py:gcs_to_bq -n 'COVID19 data to BQ' --cron "0 16 * * 4" -a # creates deployment yaml file and schedule it via CRON on 16.00 every Thu
tmux # it will allow to run prefect agent in the background when ssh connection will be closed
prefect agent start -q 'default'

Deploy dbt core job for data transformation to Prefect and update:

cd src/
source ../env
sed -i "s/REGION/$REGION/g" dbt/profiles.yml # it will add the region to dbt profile needed for dbt core
sed -i "s/PROJECT_ID/$PROJECT_ID/g" dbt/profiles.yml # it will add the project id to dbt profile needed for dbt core
sed -i "s|PATH_TO_KEY|$(realpath ../.gc/sa-iam.json)|g" dbt/profiles.yml # it will add the path to the json key to dbt profile needed for dbt core 
prefect deployment build run-dbt.py:dbt_transform -n 'dbt job' --cron "0 17 * * 4" -a # # creates deployment yaml file and schedule it via CRON on 17.00 every Thu

This job will update gold layer table in Big Query with weekly data.

Follow configuring instructions for Looker Studio

Final dashboard is located here with Viewer mode. Dashboard contains two pages for COVID19 visualizations: all over the time and the last 7 days.

If one want to reuse the dashboards, the copy with Editor mode for public is located at this link. Please, create the copy of this dashboard in order to avoid direct changes in the file.

Tests

Two different kind of tests have been developed:

(1) checking dataframe schema inside of Prefect flow

(2) checking the consistency of the functions in the ´/src´ which are critical for running the pipeline.

Tests (2) are integrated to CI/CD pipeline using GitHub Actions.

Improvements

Dashboard can be modified by adding ´total cases per million´ metrics instead of ´total cases´ which is a normalization for easier comparison between countries.
Due to the nature of the source dataset, the current implementation every time copies the full file. It is not the ideal case because datalake already contains most of the data and only recent data has to be added. It is not a problem for this project because the size of the data is not huge and cost of the operation is very low, but in general, it is not a good practice.
Within pipelines, there is a risk that one of the steps might fail and running other steps could be meaningless or sometimes even harmful. Ideally, steps in pipeline should be triggered based on the success of the previous one instead of the scheduled runs. Such triggers might be easily implemented in Prefect using ´Automations´ feature. However, because the pipeline is not complex and easy to debug, triggers automation can be avoided for now.

Name		Name	Last commit message	Last commit date
Latest commit History 472 Commits
.github/workflows		.github/workflows
data		data
images		images
infrastucture		infrastucture
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
env		env
prerequisites_readme.md		prerequisites_readme.md
requirements.txt		requirements.txt
visualizations_readme.md		visualizations_readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data pipeline for COVID19 monitoring

Goal

Content

Data source

Description of architecture

Reproducibility

Tests

Improvements

About

Releases

Packages

Languages

License

MikhailKuklin/data-pipeline-COVID19-monitoring

Folders and files

Latest commit

History

Repository files navigation

Data pipeline for COVID19 monitoring

Goal

Content

Data source

Description of architecture

Reproducibility

Tests

Improvements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages