The purpose of this walkthrough is to create Custom Dataflow templates.
The value of Custom Dataflow templates is that it allows us to execute Dataflow jobs without installing any code. This is useful to enable Dataflow execution using an automated process or to enable others without technical expertise to run jobs via a user-friendly guided user interface.
You have two options to walkthrough the steps to deploy this example:
- See Easy Walkthrough to walk through the steps without installing anything on your local machine.
- See Requirements and beyond for an unguided approach.
For an easy walkthrough without installing anything on your local machine:
- Google Cloud SDK;
gcloud init
andgcloud auth
- Google Cloud project with billing enabled
- terraform
Additionally, refer to each of the following folders for specific requirements.
In the examples/dataflow-custom-templates folder of this repository, run the following terraform in order to provision resources for your GCP project.
It is recommended to go through this walkthrough using a new temporary Google Cloud project, unrelated to any of your existing Google Cloud projects.
See https://cloud.google.com/resource-manager/docs/creating-managing-projects for more details.
To simplify the following commands, set the default GCP project.
PROJECT=<CHANGE ME>
gcloud config set project $PROJECT
Best practice recommends a Dataflow job to:
- Utilize a worker service account to access the pipeline's files and resources
- Minimally necessary IAM permissions for the worker service account
- Minimally required Google cloud services
Therefore, this step will:
- Create service accounts
- Provision IAM credentials
- Enable required Google cloud services
Run the terraform workflow in the infrastructure/01.setup directory.
Terraform will ask your permission before provisioning resources.
If you agree with terraform provisioning resources,
type yes
to proceed.
DIR=infrastructure/01.setup
terraform -chdir=$DIR init
terraform -chdir=$DIR apply -var="project=$(gcloud config get-value project)"
Best practice recommends a Dataflow job to:
- Utilize a custom network and subnetwork
- Minimally necessary network firewall rules
- Execute the Dataflow job using private IPs
Note: Building Python custom templates additionally requires the use of a Cloud NAT. During the Python Dataflow job initialization, the worker accesses PyPI to install dependencies. Therefore, we need Cloud NAT in the setting of private IPs to access these resources outside the virtual private network.
Therefore, this step will:
- Provision a custom network and subnetwork
- Provision firewall rules
- Provision a Cloud NAT and its dependent Cloud Router
DIR=infrastructure/02.network
terraform -chdir=$DIR init
terraform -chdir=$DIR apply -var="project=$(gcloud config get-value project)"
The Apache Beam example that our Dataflow template executes is a derived word count for both Java and python.
The word count example requires a source Google Cloud Storage bucket.
To make the example interesting, we copy all the files from
gs://apache-beam-samples/shakespeare/*
to a custom bucket in our project.
Therefore, this step will:
- Provision a Google Cloud storage bucket
- Create Google Cloud storage objects to read from in the pipeline
Run the terraform workflow in the infrastructure/03.io directory.
Terraform will ask your permission before provisioning resources.
If you agree with terraform provisioning resources,
type yes
to proceed.
DIR=infrastructure/03.io
terraform -chdir=$DIR init
terraform -chdir=$DIR apply -var="project=$(gcloud config get-value project)"
We will use Cloud Build to build the custom Dataflow template. There are advantages to using Cloud Build to build our custom Dataflow template, instead of performing the necessary commands on our local machine. Cloud Build connects to our version control, GitHub in this example, so that any changes made to a specific branch will automatically trigger a new build of our Dataflow template.
Therefore, this step will:
- Provision cloud build trigger that will:
- Run the language specific build process i.e. gradle shadowJar, go build, etc.
- Execute the
gcloud dataflow flex-template
command with relevant arguments.
In order to benefit from Cloud Build, the service requires we own this repository; it will not work with a any repository, even if it is public.
See infrastructure/04.template#Requirements
First, set your GitHub organization or username:
GITHUB_REPO_OWNER=<change me>
Next, set expected defaults.
GITHUB_REPO_NAME=professional-services
WORKING_DIR_PREFIX=examples/dataflow-custom-templates
Run the terraform workflow in the infrastructure/04.template directory.
Terraform will ask your permission before provisioning resources.
If you agree with terraform provisioning resources,
type yes
to proceed.
DIR=infrastructure/04.template
terraform -chdir=$DIR init
terraform -chdir=$DIR apply -var="project=$(gcloud config get-value project)" -var="github_repository_owner=$GITHUB_REPO_OWNER" -var="github_repository_name=$GITHUB_REPO_NAME" -var="working_dir_prefix=$WORKING_DIR_PREFIX"
Navigate to https://console.cloud.google.com/cloud-build/triggers.
You should see a Cloud Build trigger listed for each language of this example.
Click the RUN
button next to the created Cloud Build trigger to execute the
custom template Cloud Build trigger for your language of choice manually.
See https://cloud.google.com/build/docs/automating-builds/create-manual-triggers?hl=en#running_manual_triggers for more information.
This step will take several minutes to complete.
There are multiple ways to run a Dataflow Job from a custom template. We will use the Google Cloud Web UI.
To start the process, navigate to https://console.cloud.google.com/dataflow/createjob.
Select Custom Template
from the Dataflow template
drop down menu. Then,
click the BROWSE
button and navigate to the bucket with the name that starts
with dataflow-templates-
. Within this bucket, select the json file object
that represents the template details. You should see a JSON file for each
of the Cloud Build triggers you ran to create the custom template.
The Google Cloud console will further prompt for required fields such as Job name and any required fields for the custom Dataflow template.
When you are satisfied by the values provided to the custom Dataflow template,
click the RUN
button.
Navigate to https://console.cloud.google.com/dataflow/jobs to locate the job you just created. Clicking on the job will let you navigate to the job monitoring screen.
To clean up resources provisioned by the terraform modules, run the following:
Terraform will ask you to confirm with yes
to proceed.
Destroy the Cloud Build triggers:
DIR=infrastructure/04.template
terraform -chdir=$DIR destroy -var="project=$(gcloud config get-value project)" -var="github_repository_owner=$GITHUB_REPO_OWNER"
Destroy the Google Cloud storage resources:
DIR=infrastructure/03.io
terraform -chdir=$DIR destroy -var="project=$(gcloud config get-value project)"
Destroy the custom networking resources
DIR=infrastructure/02.network
terraform -chdir=$DIR destroy -var="project=$(gcloud config get-value project)"
Destroy the provisioned setup resources
DIR=infrastructure/01.setup
terraform -chdir=$DIR destroy -var="project=$(gcloud config get-value project)"