Skip to content

Latest commit

 

History

History
62 lines (38 loc) · 3.32 KB

Replicate.md

File metadata and controls

62 lines (38 loc) · 3.32 KB

To replicate the project follow the steps below:

Here are initial guidelines to install the required tools.

Terraform: Installation guide

Python: My installation is done anaconda;Use the relevant commands when it comes to creating the environment)

Google cloud CLI: Installation Guide

Prefect CLI: Installed in the instructions

Docker: Find docker here

DBT Core: Installed in the instructions

Git and a Github Repository: Download git here and create a repository here if you dont have one

Steps

  1. Clone this repository
  2. Create a new environment
  • Install dependencies using the command pip install -r requirements.txt
  1. GCP set up
  • Log in to GCP and create a project
  • Set up Google cloud CLI as in the guide
  • Create a service account assigning the rights to interact with the various resources (Bigquery, GCS Bucket).
  • Create the raw, staging and production datasets in bigquery. In the script we only use raw and production, but for development its best practise to use staging.
  1. Set up the infrastructure needed using terraform
  • Run terraform init , then terraform plan ( see an overview), then terraform apply to set up the infrastructure.
  1. Follow these steps to run dbt with bigquery adapter using docker.

  2. Prefect is already installed, when we ran requirements.txt so start prefect orion server locally

  • Open another terminal window and activate the environment (conda activate <your environment name>)
  • Register GCP Blocks using the command: prefect block register -m prefect_gcp

blocks

  • Create prefect GCP blocks on the prefect UI or using this code: In our case, we do not require dbt blocks as dbt is running on docker and so we trigger it inside our script instead of using a prefect block.

At this point you can actually run the flow but using the command line. By running python etl.py, the script will run, calling all the processes in the flow.

To go further and create a prefect deployment:

  1. Create and apply deployments by running the following command. prefect deployment build ./your-main-flow -n “parent_etl” (this is the name of the main flow function) It creates a 'parent_etl-deployment.yaml` file with metadata required by the agent to trigger the flow.

deploy

  1. Next run the command: prefect deployment apply parent_etl-deployment.yaml.
  2. Run the command prefect agent start --work-queue "default"
  3. On the Prefect UI if we trigger a quick run, the data will be loaded to GCS and big query and we see logs of the process as below. success