Skip to content

Latest commit

 

History

History
282 lines (194 loc) · 7.82 KB

TUTORIAL.md

File metadata and controls

282 lines (194 loc) · 7.82 KB

Instructions

Introduction

This repo shows an end-to-end example on how to use dbt to model a number of BigQuery datasets. A high level overview of the solution is provided on the README.md file.

Follow the steps in this guide to deploy your own BigQuery data model using dbt!

Click the Start button to move to the next step.

Install Terraform v1.1.7

From Cloud Shell, execute the following to install 1.1.7 of Terraform:

export TERRAFORM_VERSION="1.1.7"
curl https://releases.hashicorp.com/terraform/${TERRAFORM_VERSION}/terraform_${TERRAFORM_VERSION}_linux_amd64.zip > terraform_${TERRAFORM_VERSION}_linux_amd64.zip && \
    sudo unzip -o terraform_${TERRAFORM_VERSION}_linux_amd64.zip -d /usr/local/bin && \
    rm -f terraform_${TERRAFORM_VERSION}_linux_amd64.zip

Verify that this is installed correctly by executing:

terraform -v

Install dbt

From Cloud Shell, execute the following to install dbt:

pip3 uninstall dbt-core
pip3 install --user --upgrade dbt-bigquery

Verify that this is installed correctly by executing:

dbt --version

Set the PROJECT_ROOT environment variable

Execute the following to set an environment variable that will be used throughout the rest of these instructions

export PROJECT_ROOT=$(pwd)

Create a terraform.tfvars file in the terraform directory

Open the terraform.tfvars.renameMe file.

The file should be in this format, the fields to fill in are quite self-explanatory:

project_prefix = "bigquery-dbt-bank"
billing_account_id = ""
bigquery_location = "US"

When you have finished editing this file, save a copy of this file as terraform.tfvars within the terraform directory.

cp $PROJECT_ROOT/terraform/terraform.tfvars.renameMe $PROJECT_ROOT/terraform/terraform.tfvars

Terraform

Create a new GCP project, enable the BigQuery API, and then create a dataset named raw

cd $PROJECT_ROOT/terraform
terraform init
terraform apply --auto-approve

Note that the project created will have a project id whose prefix is project_prefix as defined in terraform.tfvars, and a random suffix. The project id will be shown to you after you run terraform apply.

Take note of the project id that was created. You will need this in the next step.

cd $PROJECT_ROOT/terraform
terraform output project_id

dbt (setup profile)

Create a new file in ~/.dbt/profiles.yml if it doesn't exist already. Add the following code block to configure this project's connection to BigQuery:

bigquery_bank:
  outputs:
    dev:
      dataset: bigquery_bank
      fixed_retries: 1
      location: <bigquery_location>
      method: oauth
      priority: batch
      project: <project_id>
      threads: 8
      timeout_seconds: 300
      type: bigquery
  target: dev

Be sure to change the project and the location to correspond to the project id created by terraform (including the randomized suffix) and the location of the BigQuery dataset as configured in terraform.tfvars.

dbt (install dependencies)

Install dbt dependencies, as indicated in the dbt/packages.yml file.

cd $PROJECT_ROOT/dbt
dbt clean
dbt deps

dbt (load external tables)

Execute the following:

cd $PROJECT_ROOT/dbt
dbt run-operation stage_external_sources

This takes the GCS files listed in dbt/models/raw/external_tables.yml, and creates corresponding external tables in BigQuery within the raw dataset.

For this tutorial, the data will come from a public Google Cloud Storage bucket gs://williamtsoi-bigquery-bank-demo/. In your use case you may use data that has been Extract-Loaded by your platform of choice.

dbt - generate base staging schema

Execute the following:

cd $PROJECT_ROOT/dbt
dbt compile

This will execute the jinja macros from the dbt/analyses directory, such as dbt/analyses/stg_card_transactions.sql:

{{ codegen.generate_base_model(
    source_name='raw',
    table_name='ext_card_transactions'
) }}

This macro will inspect the table schema of the ext_card_transactions external table in the raw dataset, and will generate a base SQL model for the staging schema.

The generated SQL models will be placed in the dbt/target/compiled/bigquery_bank/analyses directory. For example, the corresponding base schema generated for the ext_card_transactions table is:

with source as (

    select * from {{ source('raw', 'ext_card_transactions') }}

),

renamed as (

    select
        cc_number,
        trans_id,
        trans_time,
        epoch_time,
        category,
        merchant,
        merchant_lat,
        merchant_lon,
        amount,
        is_fraud,
        trans_date

    from source

)

select * from renamed

dbt - define staging schema

Normally what happens here is that you will need to model the staging schema in the dbt/models/staging directory, based on the generated base schema files in dbt/target/compiled/bigquery_bank/analyses in the previous step.

This is the place where you will define partition/clustering for the tables. For example, in dbt/models/staging/stg_card_transactions.sql, we have added an additional config block on the top of the model file to define partitioning by the trans_date column:

{{ config(
    materialized='table',
    partition_by={
      "field": "trans_date",
      "data_type": "date",
      "granularity": "day"
    }
)}}

with source as (

    select * from {{ source('raw', 'ext_card_transactions') }}

),

renamed as (

    select
        cc_number,
        trans_id,
        trans_time,
        epoch_time,
        category,
        merchant,
        merchant_lat,
        merchant_lon,
        amount,
        is_fraud,
        trans_date

    from source

)

select * from renamed

For the purpose of this tutorial though, the staging schemas have already been defined and so there's no action required here.

dbt - Deploy schema

Deploy the staging schema into BigQuery by executing:

cd $PROJECT_ROOT/dbt
dbt run

dbt - Execute Data Tests

Execute the data tests in BigQuery by executing:

cd $PROJECT_ROOT/dbt
dbt test

Tests are defined in the dbt/models/staging/stg_bigquery_bank.yml, as well as dbt/tests/*.sql.

dbt - Generate docs

dbt comes with some powerful documentation generation facilities. Once the data model has been deployed, you can generate a document website using the following command.

cd $PROJECT_ROOT/dbt
dbt docs generate

This will generate the data catalog json file.

dbt - Serve documentation website

Now to serve the website, run the following commands:

cd $PROJECT_ROOT/dbt
dbt docs serve

Next, click on the web preview icon at the top right of the window and click "Preview on port 8080".

When you are done browsing the documentation website, ensure that you press Ctrl-C to stop the web server and to bring you back to the shell.

Terraform - destroy resources

Once you are done with this tutorial, run the following commands to destroy all the resources:

cd $PROJECT_ROOT/terraform
terraform destroy --auto-approve