Add databricks asset bundles docs (#4265)

* rename docs Signed-off-by: Nok <[email protected]> * add new docs Signed-off-by: Nok <[email protected]> * add redirection Signed-off-by: Nok <[email protected]> * indeX Signed-off-by: Nok <[email protected]> * add scaffold Signed-off-by: Nok <[email protected]> * update Signed-off-by: Nok <[email protected]> * style Signed-off-by: Nok <[email protected]> * add index page Signed-off-by: Nok <[email protected]> * spelling Signed-off-by: Nok <[email protected]> * move DAB to beginning Signed-off-by: Nok <[email protected]> * add back the instruction: Signed-off-by: Nok <[email protected]> * dbx gone for good Signed-off-by: Nok <[email protected]> * add more image for running jobs and add existing cluster section Signed-off-by: Nok <[email protected]> * langauge Signed-off-by: Nok <[email protected]> * Update docs/source/deployment/databricks/databricks_ide_development_workflow.md Co-authored-by: Merel Theisen <[email protected]> Signed-off-by: Nok Lam Chan <[email protected]> * Update docs/source/deployment/databricks/databricks_ide_development_workflow.md Co-authored-by: Merel Theisen <[email protected]> Signed-off-by: Nok Lam Chan <[email protected]> * address review comments Signed-off-by: Nok <[email protected]> * rename Signed-off-by: Nok <[email protected]> * fic index Signed-off-by: Nok <[email protected]> * fix reference Signed-off-by: Nok <[email protected]> * update release noteS Signed-off-by: Nok <[email protected]> --------- Signed-off-by: Nok <[email protected]> Signed-off-by: Nok Lam Chan <[email protected]> Co-authored-by: Merel Theisen <[email protected]>
kedro-org · Nov 26, 2024 · cd4a7b8 · cd4a7b8
1 parent 9cbd2f7
commit cd4a7b8
Show file tree

Hide file tree

Showing 10 changed files with 198 additions and 8 deletions.
diff --git a/Makefile b/Makefile
@@ -31,7 +31,7 @@ secret-scan:
 	trufflehog --max_depth 1 --exclude_paths trufflehog-ignore.txt .
 
 build-docs:
-	uv pip install --system -e ".[docs]"
+	uv pip install -e ".[docs]"
 	./docs/build-docs.sh "docs"
 
 show-docs:

diff --git a/RELEASE.md b/RELEASE.md
@@ -14,10 +14,11 @@
 
 ## Breaking changes to the API
 ## Documentation changes
-* Updated CLI autocompletion docs with new Click syntax.
-* Standardised `.parquet` suffix in docs and tests.
+* Added Databricks Asset Bundles deployment guide.
 * Added a new minimal Kedro project creation guide.
 * Added example to explain how dataset factories work.
+* Updated CLI autocompletion docs with new Click syntax.
+* Standardised `.parquet` suffix in docs and tests.
 
 ## Community contributions
 * [Hyewon Choi](https://github.com/hyew0nChoi)

diff --git a/...ks/databricks_ide_development_workflow.md → ...ent/databricks/databricks_dbx_workflow.md b/...ks/databricks_ide_development_workflow.md → ...ent/databricks/databricks_dbx_workflow.md
@@ -1,5 +1,8 @@
 # Use an IDE, dbx and Databricks Repos to develop a Kedro project
 
+```{warning}
+`dbx` is deprecated in 2023, the recommended workflow now is to use [Databricks Asset Bundles](./databricks_ide_databricks_asset_bundles_workflow.md)
+```
 This guide demonstrates a workflow for developing Kedro projects on Databricks using your local environment for development, then using dbx and Databricks Repos to sync code for testing on Databricks.
 
 By working in your local environment, you can take advantage of features within an IDE that are not available on Databricks notebooks:

diff --git a/docs/source/deployment/databricks/databricks_deployment_workflow.md b/docs/source/deployment/databricks/databricks_deployment_workflow.md
@@ -15,7 +15,7 @@ Here are some typical use cases for running a packaged Kedro project as a Databr
 
 Running your packaged project as a Databricks job is very different from running it from a Databricks notebook. The Databricks job cluster has to be provisioned and started for each run, which is significantly slower than running it as a notebook on a cluster that has already been started. In addition, there is no way to change your project's code once it has been packaged. Instead, you must change your code, create a new package, and then upload it to Databricks again.
 
-For those reasons, the packaging approach is unsuitable for development projects where rapid iteration is necessary. For guidance on developing a Kedro project for Databricks in a rapid build-test loop, see the [development workflow guide](./databricks_ide_development_workflow.md).
+For those reasons, the packaging approach is unsuitable for development projects where rapid iteration is necessary. For guidance on developing a Kedro project for Databricks in a rapid build-test loop, see the [development workflow guide](./databricks_ide_databricks_asset_bundles_workflow.md).
 
 ## What this page covers
 

diff --git a/...ource/deployment/databricks/databricks_ide_databricks_asset_bundles_workflow.md b/...ource/deployment/databricks/databricks_ide_databricks_asset_bundles_workflow.md
@@ -0,0 +1,186 @@
+# Use an IDE and Databricks Asset Bundles to deploy a Kedro project
+
+```{note}
+The `dbx` package was deprecated by Databricks, and dbx workflow documentation is moved to a [new page](./databricks_dbx_workflow.md).
+```
+
+This guide demonstrates a workflow for developing a Kedro Project on Databricks using Databricks Asset Bundles. You will learn how to develop your project using a local environment, then use `kedro-databricks` and Databricks Asset Bundle to package your code for running pipelines on Databricks. To learn more about Databricks Asset Bundles and customisation, read [What are Databricks Asset Bundles](https://docs.databricks.com/en/dev-tools/bundles/index.html).
+
+## Benefits of local development
+
+By working in your local environment, you can take advantage of features within an IDE that are not available on Databricks notebooks:
+
+- Auto-completion and suggestions for code, improving your development speed and accuracy.
+- Linters like [Ruff](https://docs.astral.sh/ruff) can be integrated to catch potential issues in your code.
+- Static type checkers like Mypy can check types in your code, helping to identify potential type-related issues early in the development process.
+
+To set up these features, look for instructions specific to your IDE (for instance, [VS Code](https://code.visualstudio.com/docs/python/linting)).
+
+```{note}
+If you prefer to develop projects in notebooks rather than in an IDE, you should follow our guide on [how to develop a Kedro project within a Databricks workspace](./databricks_notebooks_development_workflow.md) instead.
+```
+
+## What this page covers
+
+The main steps in this tutorial are as follows:
+
+- [Prerequisites](#prerequisites)
+- [Set up your project](#set-up-your-project)
+- [Create the Databricks Asset Bundles](#create-the-databricks-asset-bundles-using-kedro-databricks)
+- [Deploy Databricks Job](#deploy-databricks-job-using-databricks-asset-bundles)
+- [Run Databricks Job](#how-to-run-the-deployed-job)
+
+## Prerequisites
+
+- An active [Databricks deployment](https://docs.databricks.com/getting-started/index.html).
+- A [Databricks cluster](https://docs.databricks.com/clusters/configure.html) configured with a recent version (>= 11.3 is recommended) of the Databricks runtime.
+- [Conda installed](https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html) on your local machine to create a virtual environment with Python >= 3.9.
+
+## Set up your project
+
+### Note your Databricks username and host
+Note your Databricks **username** and **host** as you will need it for the remainder of this guide.
+
+Find your Databricks username in the top right of the workspace UI and the host in the browser's URL bar, up to the first slash (e.g., `https://adb-123456789123456.1.azuredatabricks.net/`):
+
+![Find Databricks host and username](../../meta/images/find_databricks_host_and_username.png)
+
+```{note}
+Your databricks host must include the protocol (`https://`).
+```
+### Install Kedro and Databricks CLI in a new virtual environment
+In your local development environment, create a virtual environment for this tutorial using Conda:
+
+```bash
+conda create --name databricks-iris python=3.10
+```
+
+Once it is created, activate it:
+
+```bash
+conda activate databricks-iris
+```
+
+
+### Authenticate the Databricks CLI
+**Now, you must authenticate the Databricks CLI with your Databricks instance.**
+
+[Refer to the Databricks documentation](https://docs.databricks.com/en/dev-tools/cli/authentication.html) for a complete guide on how to authenticate your CLI. The key steps are:
+
+1. Create a personal access token for your user on your Databricks instance.
+2. Run `databricks configure --token`.
+3. Enter your token and Databricks host when prompted.
+4. Run `databricks fs ls dbfs:/` at the command line to verify your authentication.
+
+
+### Create a new Kedro Project
+Create a Kedro project with the `databricks-iris` starter using the following command in your local environment:
+
+```bash
+kedro new --starter=databricks-iris
+```
+
+Name your new project `iris-databricks` for consistency with the rest of this guide. This command creates a new Kedro project using the `databricks-iris` starter template.
+
+ ```{note}
+If you are not using the `databricks-iris` starter to create a Kedro project, **and** you are working with a version of Kedro **earlier than 0.19.0**, then you should [disable file-based logging](https://docs.kedro.org/en/0.18.14/logging/logging.html#disable-file-based-logging) to prevent Kedro from attempting to write to the read-only file system.
+ ```
+
+## Create the Databricks Asset Bundles using `kedro-databricks`
+
+`kedro-databricks` is a wrapper around the `databricks` CLI. It's the simplest way to get started without getting stuck with configuration.
+1. Install `kedro-databricks`:
+
+```bash
+pip install kedro-databricks
+```
+
+2. Initialise the Databricks configuration:
+
+```bash
+kedro databricks init
+```
+
+This generates a `databricks.yml` file in the `conf` folder, which sets the default cluster type. You can override these configurations if needed.
+
+3. Create Databricks Asset Bundles:
+
+```bash
+kedro databricks bundle
+```
+
+This command reads the configuration from `conf/databricks.yml` (if it exists) and generates the Databricks job configuration inside a `resource` folder.
+
+### Running a Databricks Job Using an Existing Cluster
+
+By default, Databricks creates a new job cluster for each job. However, there are instances where you might prefer to use an existing cluster, such as:
+
+1. Lack of permissions to create a new cluster.
+2. The need for a quick start with an all-purpose cluster.
+
+While it is generally [**not recommended** to utilise **all-purpose compute** for running jobs](https://docs.databricks.com/en/jobs/compute.html#should-all-purpose-compute-ever-be-used-for-jobs), it is feasible to configure a Databricks job for testing purposes.
+
+To begin, you need to determine the `cluster_id`. Navigate to the `Compute` tab and select the `View JSON` option.
+
+
+![Find cluster ID through UI](../../meta/images/databricks_cluster_id1.png)
+
+You will see the cluster configuration in JSON format, copy the `cluster_id`
+![cluster_id in the JSON view](../../meta/images/databricks_cluster_id2.png)
+
+Next, update `conf/databricks.yml`
+```diff
+    tasks:
+        - task_key: default
+-          job_cluster_key: default
++          existing_cluster_id: 0502-***********
+```
+
+Then generate the bundle definition again with the `overwrite` option.
+```
+kedro databricks bundle --overwrite
+```
+## Deploy Databricks Job using Databricks Asset Bundles
+
+Once you have all the resources generated, deploy the Databricks Asset Bundles to Databricks:
+
+```bash
+kedro databricks deploy
+```
+
+You should see output similar to:
+
+```
+Uploading databrick_iris-0.1-py3-none-any.whl...
+Uploading bundle files to /Workspace/Users/xxxxxxx.com/.bundle/databrick_iris/local/files...
+Deploying resources...
+Updating deployment state...
+Deployment complete!
+```
+
+## How to run the Deployed job?
+
+There are two options to run Databricks Jobs:
+
+### Run Databricks Job with `databricks` CLI
+
+```bash
+databricks bundle run
+```
+
+This will shows all the job that you have created. Select the job and run it.
+```bash
+? Resource to run:
+  Job: [dev] databricks-iris (databricks-iris)
+```
+You should see similar output like this:
+```
+databricks bundle run
+Run URL: https://<host>/?*********#job/**************/run/**********
+```
+
+Copy that URL into your browser or go to the `Jobs Run` UI to see the run status.
+
+### Run Databricks Job with Databricks UI
+Alternatively, you can go to the `Workflow` tab and select the desired job to run directly:
+![Run deployed Databricks Job with Databricks UI](../../meta/images/databricks-job-run.png)
diff --git a/docs/source/deployment/databricks/databricks_notebooks_development_workflow.md b/docs/source/deployment/databricks/databricks_notebooks_development_workflow.md
@@ -2,7 +2,7 @@
 
 This guide demonstrates a workflow for developing Kedro projects on Databricks using only a Databricks Repo and a Databricks notebook. You will learn how to develop and test your Kedro projects entirely within the Databricks workspace.
 
-This method of developing a Kedro project for use on Databricks is ideal for developers who prefer developing their projects in notebooks rather than an in an IDE. It also avoids the overhead of setting up and syncing a local environment with Databricks. If you want to take advantage of the powerful features of an IDE to develop your project, consider following the [guide for developing a Kedro project for Databricks using your local environment](./databricks_ide_development_workflow.md).
+This method of developing a Kedro project for use on Databricks is ideal for developers who prefer developing their projects in notebooks rather than an in an IDE. It also avoids the overhead of setting up and syncing a local environment with Databricks. If you want to take advantage of the powerful features of an IDE to develop your project, consider following the [guide for developing a Kedro project for Databricks using your local environment](./databricks_ide_databricks_asset_bundles_workflow.md).
 
 In this guide, you will store your project's code in a repository on [GitHub](https://github.com/). Databricks integrates with many [Git providers](https://docs.databricks.com/repos/index.html#supported-git-providers), including GitLab and Azure DevOps. The steps  to create a Git repository and sync it with Databricks also generally apply to these Git providers, though the exact details may vary.
 

diff --git a/docs/source/deployment/databricks/index.md b/docs/source/deployment/databricks/index.md
@@ -12,8 +12,7 @@ To avoid the overhead of setting up and syncing a local development environment
 
 **I want a hybrid workflow model combining local IDE with Databricks**
 
-
-The workflow documented in ["Use an IDE, dbx and Databricks Repos to develop a Kedro project"](./databricks_ide_development_workflow.md) is for those that prefer to work in a local IDE.
+The workflow documented in ["Use Databricks Asset Bundles to deploy a Kedro project"](./databricks_ide_databricks_asset_bundles_workflow.md) is for those that prefer to work in a local IDE.
 
 If you're in the early stages of learning Kedro, or your project requires constant testing and adjustments, choose this workflow. You can use your IDE's capabilities for faster, error-free development, while testing on Databricks. Later you can make the transition into a production deployment with this approach, although you may prefer to switch to use [job-based deployment](./databricks_deployment_workflow.md) and fully optimise your workflow for production.
 
@@ -46,7 +45,8 @@ Remember, the best choice of workflow is the one that aligns best with your proj
 :maxdepth: 1
 
 databricks_notebooks_development_workflow.md
-databricks_ide_development_workflow.md
+databricks_ide_databricks_asset_bundles_workflow.md
 databricks_deployment_workflow
 databricks_visualisation
+databricks_dbx_workflow.md
 ```
diff --git a/docs/source/meta/images/databricks-job-run.png b/docs/source/meta/images/databricks-job-run.png
diff --git a/docs/source/meta/images/databricks_cluster_id1.png b/docs/source/meta/images/databricks_cluster_id1.png
diff --git a/docs/source/meta/images/databricks_cluster_id2.png b/docs/source/meta/images/databricks_cluster_id2.png