This repository contains a Databricks notebook that can be used to help schedule and automate the starting, stopping, and configuration of Databricks SQL Warehouses. This solution is designed be used with Databricks Workflows for scheduling and notifications.
This notebook automatically leverages the Databricks PAT of the user or service principal that is running the job; therefore, ensure that this account has permissions to update the SQL Warehouse.
This notebook should be triggered using Databricks Workflows. Users can either import this notebook (update_sql_warehouse.py
) into the Databricks Workspace, or use a Git reference to this repository.
Once a job is created, add a task to point to the update_sql_warehouse.py
notebook.
Then leverage the Jobs scheduler to set when should this task should be executed:
This notebook uses Databricks widgets to configure the how the SQL warehouse should be updated.
This is the unique ID for the SQL warehouse that should be updated.
This parameter supports to following options:
start
: Start the warehouse.stop
: Stop the warehouse.update
: Only update the warehouse definition.
This parameter sets the auto-terminate time window. If set to 0, the Warehouse will not be set to never auto-terminate.
This parameter sets the cluster size:
XXSMALL
: 2X-SmallXSMALL
: X-SmallSMALL
: SmallMEDIUM
: MediumLARGE
: :LargeXLARGE
: X-LargeXXLARGE
: 2X-LargeXXXLARGE
: 3X-LargeXXXXLARGE
: 4X-Large
Changing the cluster size will cause the Warehouse to restart if it is already running.
This parameter sets the minimum number of clusters in an endpoint. max_num_clusters
must be >= min_num_clusters
This parameter sets the maximum number of clusters in an endpoint. max_num_clusters
must be >= min_num_clusters
This parameter sets the spot instance policy:
COST_OPTIMIZED
: Use spot instances when possible.RELIABILITY_OPTIMIZED
: Do not use spot instances.
This parameter supports to following options:
True
: Use serverless compute.False
: Use classic compute.
- Leverage a Cluster Pool set to use All Spot Instances with the smallest VM option:
- Azure:
Standard_F4
- AWS:
m4.large
- Azure:
- Use a Single Node cluster & the Cluster Pool to execute the job
-
Use Databricks job features to help with notifications and retries if needed.
-
If using Spot Instances, make sure retries is configured in the case eviction occurs.
-
When resizing the endpoint i.e SMALL -> LARGE. This will trigger a restart of the endpoint, which will terminate any active queries; therefore use caution when performing an endpoint resize.
-
Job/Automated clusters usually take 3-5 minutes to spin up; therefore, account for that extra time when scheduling.
-
For help with generating a quartz cron expression, use the following: Link
Let imagine a scenario where we want to create a SQL Warehouse that is able to dynamically change configurations based on a set schedule.
During core business hours i.e. 7am - 6pm MT Monday to Friday, this Warehouse should have a size of X-Large and should not auto-terminate based on inactivity. This warehouse should also have have auto-scaling set up such that it can scale between 1-4 clusters within the endpoint if required.
During the weekend and outside of core business hours on weekdays, we want to switch to a Small Databricks SQL Warehouse that auto-terminates after 30 mins of inactivity. We also want to turn off auto-scaling and leverage spot instances whenever possible.
In addition to these configuration changes, we want to SQL Warehouse to automatically start at 7am Monday to Friday if it is not already started.
In order to implement the following, we will need two Databricks jobs:
1. Start the Warehouse at 7am on Monday - Friday and update the configuration for core hours.
2. Update the Warehouse at 6pm on Monday - Friday and modify the configuration for non-core hours.