To replicate the project follow the steps below:

Here are initial guidelines to install the required tools.

Terraform: Installation guide

Python: My installation is done anaconda;Use the relevant commands when it comes to creating the environment)

Google cloud CLI: Installation Guide

Prefect CLI: Installed in the instructions

Docker: Find docker here

DBT Core: Installed in the instructions

Git and a Github Repository: Download git here and create a repository here if you dont have one

Steps

Clone this repository
Create a new environment

Install dependencies using the command pip install -r requirements.txt

GCP set up

Log in to GCP and create a project
Set up Google cloud CLI as in the guide
Create a service account assigning the rights to interact with the various resources (Bigquery, GCS Bucket).
Create the raw, staging and production datasets in bigquery. In the script we only use raw and production, but for development its best practise to use staging.

Set up the infrastructure needed using terraform

Run terraform init , then terraform plan ( see an overview), then terraform apply to set up the infrastructure.

Follow these steps to run dbt with bigquery adapter using docker.
Prefect is already installed, when we ran requirements.txt so start prefect orion server locally

Open another terminal window and activate the environment (conda activate <your environment name>)
Register GCP Blocks using the command: prefect block register -m prefect_gcp

Create prefect GCP blocks on the prefect UI or using this code: In our case, we do not require dbt blocks as dbt is running on docker and so we trigger it inside our script instead of using a prefect block.

At this point you can actually run the flow but using the command line. By running python etl.py, the script will run, calling all the processes in the flow.

To go further and create a prefect deployment:

Create and apply deployments by running the following command. prefect deployment build ./your-main-flow -n “parent_etl” (this is the name of the main flow function) It creates a 'parent_etl-deployment.yaml` file with metadata required by the agent to trigger the flow.

Next run the command: prefect deployment apply parent_etl-deployment.yaml.
Run the command prefect agent start --work-queue "default"
On the Prefect UI if we trigger a quick run, the data will be loaded to GCS and big query and we see logs of the process as below.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replicate.md

Replicate.md

To replicate the project follow the steps below:

Steps

Files

Replicate.md

Latest commit

History

Replicate.md

File metadata and controls

To replicate the project follow the steps below:

Steps