databricks-sql-start-stop

This repository contains a Databricks notebook that can be used to help schedule and automate the starting, stopping, and configuration of Databricks SQL Warehouses. This solution is designed be used with Databricks Workflows for scheduling and notifications.

This notebook automatically leverages the Databricks PAT of the user or service principal that is running the job; therefore, ensure that this account has permissions to update the SQL Warehouse.

How to use this notebook

This notebook should be triggered using Databricks Workflows. Users can either import this notebook (update_sql_warehouse.py) into the Databricks Workspace, or use a Git reference to this repository.

Once a job is created, add a task to point to the update_sql_warehouse.py notebook.

Then leverage the Jobs scheduler to set when should this task should be executed:

Parameters

This notebook uses Databricks widgets to configure the how the SQL warehouse should be updated.

[Required] `warehouse_id`

This is the unique ID for the SQL warehouse that should be updated.

[Required] `start_stop_update`

This parameter supports to following options:

start: Start the warehouse.
stop: Stop the warehouse.
update: Only update the warehouse definition.

[Optional] `auto_stop_mins`

This parameter sets the auto-terminate time window. If set to 0, the Warehouse will not be set to never auto-terminate.

[Optional] `size`

This parameter sets the cluster size:

XXSMALL: 2X-Small
XSMALL: X-Small
SMALL: Small
MEDIUM: Medium
LARGE: :Large
XLARGE: X-Large
XXLARGE: 2X-Large
XXXLARGE: 3X-Large
XXXXLARGE: 4X-Large

Changing the cluster size will cause the Warehouse to restart if it is already running.

[Optional] `min_num_clusters`

This parameter sets the minimum number of clusters in an endpoint. max_num_clusters must be >= min_num_clusters

[Optional] `max_num_clusters`

This parameter sets the maximum number of clusters in an endpoint. max_num_clusters must be >= min_num_clusters

[Optional] `spot_instance_policy`

This parameter sets the spot instance policy:

COST_OPTIMIZED: Use spot instances when possible.
RELIABILITY_OPTIMIZED: Do not use spot instances.

[Optional] `enable_serverless_compute`

This parameter supports to following options:

True: Use serverless compute.
False: Use classic compute.

Starting a SQL Warehouse

Stopping a SQL Warehouse

Updating a SQL Warehouse

Best Practices

Leverage a Cluster Pool set to use All Spot Instances with the smallest VM option:
- Azure: Standard_F4
- AWS: m4.large
Use a Single Node cluster & the Cluster Pool to execute the job

Use Databricks job features to help with notifications and retries if needed.
If using Spot Instances, make sure retries is configured in the case eviction occurs.
When resizing the endpoint i.e SMALL -> LARGE. This will trigger a restart of the endpoint, which will terminate any active queries; therefore use caution when performing an endpoint resize.
Job/Automated clusters usually take 3-5 minutes to spin up; therefore, account for that extra time when scheduling.
For help with generating a quartz cron expression, use the following: Link

Example Scenario

Let imagine a scenario where we want to create a SQL Warehouse that is able to dynamically change configurations based on a set schedule.

During core business hours i.e. 7am - 6pm MT Monday to Friday, this Warehouse should have a size of X-Large and should not auto-terminate based on inactivity. This warehouse should also have have auto-scaling set up such that it can scale between 1-4 clusters within the endpoint if required.

During the weekend and outside of core business hours on weekdays, we want to switch to a Small Databricks SQL Warehouse that auto-terminates after 30 mins of inactivity. We also want to turn off auto-scaling and leverage spot instances whenever possible.

In addition to these configuration changes, we want to SQL Warehouse to automatically start at 7am Monday to Friday if it is not already started.

In order to implement the following, we will need two Databricks jobs:

1. Start the Warehouse at 7am on Monday - Friday and update the configuration for core hours.

2. Update the Warehouse at 6pm on Monday - Friday and modify the configuration for non-core hours.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
img		img
README.md		README.md
update_sql_warehouse.py		update_sql_warehouse.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

databricks-sql-start-stop

How to use this notebook

Parameters

[Required] `warehouse_id`

[Required] `start_stop_update`

[Optional] `auto_stop_mins`

[Optional] `size`

[Optional] `min_num_clusters`

[Optional] `max_num_clusters`

[Optional] `spot_instance_policy`

[Optional] `enable_serverless_compute`

Starting a SQL Warehouse

Stopping a SQL Warehouse

Updating a SQL Warehouse

Best Practices

Example Scenario

About

Releases

Packages

Languages

guanjieshen/databricks-sql-start-stop

Folders and files

Latest commit

History

Repository files navigation

databricks-sql-start-stop

How to use this notebook

Parameters

[Required] warehouse_id

[Required] start_stop_update

[Optional] auto_stop_mins

[Optional] size

[Optional] min_num_clusters

[Optional] max_num_clusters

[Optional] spot_instance_policy

[Optional] enable_serverless_compute

Starting a SQL Warehouse

Stopping a SQL Warehouse

Updating a SQL Warehouse

Best Practices

Example Scenario

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

[Required] `warehouse_id`

[Required] `start_stop_update`

[Optional] `auto_stop_mins`

[Optional] `size`

[Optional] `min_num_clusters`

[Optional] `max_num_clusters`

[Optional] `spot_instance_policy`

[Optional] `enable_serverless_compute`

Packages