From c300fa4ebd363ac28a09e1e5ae64f7c8ebdde666 Mon Sep 17 00:00:00 2001 From: Future-Outlier Date: Tue, 28 Nov 2023 08:59:23 +0800 Subject: [PATCH] Databricks Plugin Setup Doc Enhancement (#4445) --------- Signed-off-by: Future Outlier Signed-off-by: Kevin Su Co-authored-by: Future Outlier Co-authored-by: Kevin Su --- rsts/deployment/plugins/webapi/databricks.rst | 110 ++++++++++++++---- 1 file changed, 89 insertions(+), 21 deletions(-) diff --git a/rsts/deployment/plugins/webapi/databricks.rst b/rsts/deployment/plugins/webapi/databricks.rst index 671fdb4e18..ee38a481df 100644 --- a/rsts/deployment/plugins/webapi/databricks.rst +++ b/rsts/deployment/plugins/webapi/databricks.rst @@ -42,30 +42,99 @@ Databricks workspace To set up your Databricks account, follow these steps: 1. Create a `Databricks account `__. + +.. image:: https://raw.githubusercontent.com/flyteorg/static-resources/main/flyte/deployment/plugins/databricks/databricks_workspace.png + :alt: A screenshot of Databricks workspace creation. + 2. Ensure that you have a Databricks workspace up and running. + +.. image:: https://raw.githubusercontent.com/flyteorg/static-resources/main/flyte/deployment/plugins/databricks/open_workspace.png + :alt: A screenshot of Databricks workspace. + 3. Generate a `personal access token `__ to be used in the Flyte configuration. - You can find the personal access token in the user settings within the workspace. + You can find the personal access token in the user settings within the workspace. ``User settings`` -> ``Developer`` -> ``Access tokens`` -.. note:: +.. image:: https://raw.githubusercontent.com/flyteorg/static-resources/main/flyte/deployment/plugins/databricks/databricks_access_token.png + :alt: A screenshot of access token. + +4. Enable custom containers on your Databricks cluster before you trigger the workflow. + +.. code-block:: bash + + curl -X PATCH -n -H "Authorization: Bearer " \ + https:///api/2.0/workspace-conf \ + -d '{"enableDcs": "true"}' - When testing the Databricks plugin on the demo cluster, create an S3 bucket because the local demo - cluster utilizes MinIO. Follow the `AWS instructions - `__ - to generate access and secret keys, which can be used to access your preferred S3 bucket. +For more detail, check `custom containers `__. -Create an `instance profile +5. Create an `instance profile `__ for the Spark cluster. This profile enables the Spark job to access your data in the S3 bucket. -Please follow all four steps specified in the documentation. -Upload the following entrypoint.py file to either +Create an instance profile using the AWS console (For AWS Users) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +1. In the AWS console, go to the IAM service. +2. Click the Roles tab in the sidebar. +3. Click Create role. + + a. Under Trusted entity type, select AWS service. + b. Under Use case, select **EC2**. + c. Click Next. + d. At the bottom of the page, click Next. + e. In the Role name field, type a role name. + f. Click Create role. + +4. In the role list, click the **AmazonS3FullAccess** role. +5. Click Create role button. + +In the role summary, copy the Role ARN. + +.. image:: https://raw.githubusercontent.com/flyteorg/static-resources/main/flyte/deployment/plugins/databricks/s3_arn.png + :alt: A screenshot of s3 arn. + +Locate the IAM role that created the Databricks deployment +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +If you don’t know which IAM role created the Databricks deployment, do the following: + +1. As an account admin, log in to the account console. +2. Go to ``Workspaces`` and click your workspace name. +3. In the Credentials box, note the role name at the end of the Role ARN + +For example, in the Role ARN ``arn:aws:iam::123456789123:role/finance-prod``, the role name is finance-prod + +Edit the IAM role that created the Databricks deployment +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +1. In the AWS console, go to the IAM service. +2. Click the Roles tab in the sidebar. +3. Click the role that created the Databricks deployment. +4. On the Permissions tab, click the policy. +5. Click Edit Policy. +6. Append the following block to the end of the Statement array. Ensure that you don’t overwrite any of the existing policy. Replace with the role you created in Configure S3 access with instance profiles. + +.. code-block:: bash + + { + "Effect": "Allow", + "Action": "iam:PassRole", + "Resource": "arn:aws:iam:::role/" + } + + +6. Upload the following ``entrypoint.py`` file to either `DBFS `__ -(the final path can be ``dbfs:///FileStore/tables/entrypoint.py``) or S3. -This file will be executed by the Spark driver node, overriding the default command in the -`dbx `__ job. +(the final path will be ``dbfs:///FileStore/tables/entrypoint.py``) or S3. +This file will be executed by the Spark driver node, overriding the default command of the +`Databricks `__ job. This entrypoint file will + +1. Download the inputs from S3 to the local filesystem. +2. Execute the spark task. +3. Upload the outputs from the local filesystem to S3 for the downstream tasks to consume. + -.. TODO: A quick-and-dirty workaround to resolve https://github.com/flyteorg/flyte/issues/3853 issue is to import pandas. +.. image:: https://raw.githubusercontent.com/flyteorg/static-resources/main/flyte/deployment/plugins/databricks/dbfs.png + :alt: A screenshot of dbfs. .. code-block:: python @@ -101,9 +170,7 @@ This file will be executed by the Spark driver node, overriding the default comm def main(): - args = sys.argv - click_ctx = click.Context(click.Command("dummy")) if args[1] == "pyflyte-fast-execute": parser = _fast_execute_task_cmd.make_parser(click_ctx) @@ -122,6 +189,12 @@ This file will be executed by the Spark driver node, overriding the default comm Specify plugin configuration ---------------------------- +.. note:: + + Demo cluster saves the data to minio, but Databricks job saves the data to S3. + Therefore, you need to update the AWS credentials for the single binary deployment, so the pod can + access the S3 bucket that DataBricks job writes to. + .. tabs:: @@ -330,7 +403,6 @@ Add the Databricks access token to FlytePropeller: apiVersion: v1 data: FLYTE_DATABRICKS_API_TOKEN: - client_secret: Zm9vYmFy kind: Secret ... @@ -376,8 +448,4 @@ Wait for the upgrade to complete. You can check the status of the deployment pod kubectl get pods -n flyte -.. note:: - - Make sure you enable `custom containers - `__ - on your Databricks cluster before you trigger the workflow. \ No newline at end of file +For databricks plugin on the Flyte cluster, please refer to `Databricks Plugin Example `_