Here are initial guidelines to install the required tools.
Terraform: Installation guide
Python: My installation is done anaconda;Use the relevant commands when it comes to creating the environment)
Google cloud CLI: Installation Guide
Prefect CLI: Installed in the instructions
Docker: Find docker here
DBT Core: Installed in the instructions
Git and a Github Repository: Download git here and create a repository here if you dont have one
- Clone this repository
- Create a new environment
- Install dependencies using the command
pip install -r requirements.txt
- GCP set up
- Log in to GCP and create a project
- Set up Google cloud CLI as in the guide
- Create a service account assigning the rights to interact with the various resources (Bigquery, GCS Bucket).
- Create the raw, staging and production datasets in bigquery. In the script we only use raw and production, but for development its best practise to use staging.
- Set up the infrastructure needed using terraform
- Run
terraform init
, thenterraform plan
( see an overview), thenterraform apply
to set up the infrastructure.
-
Follow these steps to run dbt with bigquery adapter using docker.
-
Prefect is already installed, when we ran requirements.txt so start prefect orion server locally
- Open another terminal window and activate the environment (
conda activate <your environment name>
) - Register GCP Blocks using the command:
prefect block register -m prefect_gcp
- Create prefect GCP blocks on the prefect UI or using this code: In our case, we do not require dbt blocks as dbt is running on docker and so we trigger it inside our script instead of using a prefect block.
At this point you can actually run the flow but using the command line. By running python etl.py
, the script will run, calling all the processes in the flow.
To go further and create a prefect deployment:
- Create and apply deployments by running the following command.
prefect deployment build ./your-main-flow -n “parent_etl”
(this is the name of the main flow function) It creates a 'parent_etl-deployment.yaml` file with metadata required by the agent to trigger the flow.