Databricks Workflows is a highly-reliable, managed orchestrator that lets you author and schedule DAGs of notebooks, Python scripts as well as dbt projects as production jobs.
The capability of running dbt in a Job is currently in private preview. You must be enrolled in the private preview to follow the steps in this guide. Features, capabilities and pricing may change at any time.
In this guide, you will learn how to update an existing dbt project to run as a job, retrieving dbt run artifacts using an API and debug common issues.
When you run a dbt project as a Databricks Job, the dbt Python process as well as the SQL generated by dbt run on the same Automated Cluster.
If you want to run the SQL on, say, a Databricks SQL endpoint or even another cloud data warehouse, you can customize the checked-in profiles.yml
file appropriately (see below).
- An existing dbt project version controlled in git
- Access to a Databricks workspace
- Ability to launch job clusters (using a policy or cluster create permissions) or access to an existing interactive cluster with
dbt-core
anddbt-databricks
libraries installed orCAN_MANAGE
permissions to install thedbt-core
anddbt-databricks
as cluster libraries. We recommend using DBR 10.4 or later versions for better SQL compatibility. - Files in Repos must be enabled and is only supported on Databricks Runtime (DBR) 8.4+ or DBR 11+ depending on the configuration. Please make sure the cluster has the appropriate DBR version.
- Install and configure the Databricks CLI
- Install jq, a popular open source tool for parsing JSON from the command line
In this step, you will create a job that will run the dbt project on a schedule.
The dbt task only supports retrieve dbt projects from Git. Please follow the documentation to connect Databricks to Git.
- Log in to your Databricks workspace
- Click the Data Science & Engineering persona in the left navigation bar
- Click Workflows
- Click Create Job
- Click Type and choose dbt
- Click Edit next to "Git provider"
- In the dialog, enter your Git repository URL, and choose the Git provider. Also, choose a branch / tag / commit e.g.
main
. - If your dbt project is in the root of the git repository, leave the Path field empty. Otherwise, provide the relative path e.g.
/my/relative/path
. - You can customize dbt commands as needed, including any flag accepted by the dbt CLI.
- By default, Databricks installs a recent version of
dbt-databricks
from PyPi, which will also installdbt-spark
as well asdbt-core
. You can customize this version if you wish. - You can customize the Automated Cluster if you wish by clicking Edit in the Cluster dropdown.
- Click Save
You can now run your newly-saved job and see its output.
- Click Run Now on the notification that shows up when you save the job
- Click the active run and see dbt output. Note that dbt output is not real-time, it lags behind dbt's progress by several seconds to a minute.
A dbt run generates useful artifacts which you may want to retrieve for analysis and more. Databricks saves the contents of /logs
and /target
directories as a compressed archive which you can retrieve using the Jobs API.
It is currently not possible to refer to a previous run's artifacts e.g. using the
--state
flag. You can, however, include a known good state in your repository.
dbt-artifacts is a popular dbt package for ingesting dbt artifacts into tables. This is currently not supported on Databricks. Please contact us if you are interested in Databricks supporting this package.
Follow these steps to retrieve dbt artifacts from a job run:
- Go to a job in Databricks and copy the Task Run ID. It appears in the sidebar under Task run details when you click on a run.
- Enter the following command in your terminal:
$ databricks jobs configure --version=2.1
$ databricks runs get --run-id TASK_RUN_ID | jq .tasks
- The above command will return an array of tasks with their
run_id
s. Find the dbt task'srun_id
and run this command:
$ DBT_ARTIFACT_URL="$(databricks runs get-output --run-id DBT_TASK_RUN_ID | jq -r .dbt_output.artifacts_link)"
$ curl $DBT_ARTIFACT_URL --output artifact.tar.gz
On macOS or Linux, you can run the following command to expand and decompress the archive:
$ tar -xvf artifact.tar.gz
- You must provide a
profiles.yml
file for now in the root of the Git repository. Please check that this file is present and is properly named e.g. it is notprofile.yml
- If you do not use the automatically-generated
profiles.yml
, check your Personal Access Token (PAT). It must not be expired. - Consider adding
dbt debug
as the first command. This may give you a clue about the failure.
If you have checked out the Git repository before enabling the Files in Repos feature, the checkout might be cached invalidly. You need to push a dummy commit to your repository to force a fresh checkout.
By default the dbt task type will connect to the Automated Cluster dbt-core is running on without any configuration changes or need to check in any secrets. It does so by generating a default profiles.yml
and telling dbt to use it. We have no restrictions on connection to any other dbt targets such as Databricks SQL, Amazon Redshift, Google BigQuery, Snowflak, or any other supported adapter. The automatically generated profile can be overridden by specifying an alternative profiles directory in the dbt command using --profiles-dir <dir>
, where the path of the <dir>
should be a relative path like .
or ./my-directory
.
If you'd like to connect to multiple outputs and include the current Automated Cluster as one of those, the following configuration can be used without exposing any secrets:
databricks_demo:
target: databricks_cluster
outputs:
databricks_cluster:
type: databricks
connect_retries: 5
connect_timeout: 180
schema: "<your-schema>"
threads: 8 # This can be increased or decreased to control the parallism
host: "{{ env_var('DBT_HOST') }}"
http_path: "sql/protocolv1/o/{{ env_var('DBT_ORG_ID') }}/{{ env_var('DBT_CLUSTER_ID') }}"
token: "{{ env_var('DBT_ACCESS_TOKEN') }}"