Skip to content

Commit

Permalink
Databricks Plugin Setup Doc Enhancement (#4445)
Browse files Browse the repository at this point in the history
---------

Signed-off-by: Future Outlier <[email protected]>
Signed-off-by: Kevin Su <[email protected]>
Co-authored-by: Future Outlier <[email protected]>
Co-authored-by: Kevin Su <[email protected]>
  • Loading branch information
3 people authored Nov 28, 2023
1 parent 98dd505 commit c300fa4
Showing 1 changed file with 89 additions and 21 deletions.
110 changes: 89 additions & 21 deletions rsts/deployment/plugins/webapi/databricks.rst
Original file line number Diff line number Diff line change
Expand Up @@ -42,30 +42,99 @@ Databricks workspace
To set up your Databricks account, follow these steps:

1. Create a `Databricks account <https://www.databricks.com/>`__.

.. image:: https://raw.githubusercontent.com/flyteorg/static-resources/main/flyte/deployment/plugins/databricks/databricks_workspace.png
:alt: A screenshot of Databricks workspace creation.

2. Ensure that you have a Databricks workspace up and running.

.. image:: https://raw.githubusercontent.com/flyteorg/static-resources/main/flyte/deployment/plugins/databricks/open_workspace.png
:alt: A screenshot of Databricks workspace.

3. Generate a `personal access token
<https://docs.databricks.com/dev-tools/auth.html#databricks-personal-ACCESS_TOKEN-authentication>`__ to be used in the Flyte configuration.
You can find the personal access token in the user settings within the workspace.
You can find the personal access token in the user settings within the workspace. ``User settings`` -> ``Developer`` -> ``Access tokens``

.. note::
.. image:: https://raw.githubusercontent.com/flyteorg/static-resources/main/flyte/deployment/plugins/databricks/databricks_access_token.png
:alt: A screenshot of access token.

4. Enable custom containers on your Databricks cluster before you trigger the workflow.

.. code-block:: bash
curl -X PATCH -n -H "Authorization: Bearer <your-personal-access-token>" \
https://<databricks-instance>/api/2.0/workspace-conf \
-d '{"enableDcs": "true"}'
When testing the Databricks plugin on the demo cluster, create an S3 bucket because the local demo
cluster utilizes MinIO. Follow the `AWS instructions
<https://docs.aws.amazon.com/powershell/latest/userguide/pstools-appendix-sign-up.html>`__
to generate access and secret keys, which can be used to access your preferred S3 bucket.
For more detail, check `custom containers <https://docs.databricks.com/administration-guide/clusters/container-services.html>`__.

Create an `instance profile
5. Create an `instance profile
<https://docs.databricks.com/administration-guide/cloud-configurations/aws/instance-profiles.html>`__
for the Spark cluster. This profile enables the Spark job to access your data in the S3 bucket.
Please follow all four steps specified in the documentation.

Upload the following entrypoint.py file to either
Create an instance profile using the AWS console (For AWS Users)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

1. In the AWS console, go to the IAM service.
2. Click the Roles tab in the sidebar.
3. Click Create role.

a. Under Trusted entity type, select AWS service.
b. Under Use case, select **EC2**.
c. Click Next.
d. At the bottom of the page, click Next.
e. In the Role name field, type a role name.
f. Click Create role.

4. In the role list, click the **AmazonS3FullAccess** role.
5. Click Create role button.

In the role summary, copy the Role ARN.

.. image:: https://raw.githubusercontent.com/flyteorg/static-resources/main/flyte/deployment/plugins/databricks/s3_arn.png
:alt: A screenshot of s3 arn.

Locate the IAM role that created the Databricks deployment
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If you don’t know which IAM role created the Databricks deployment, do the following:

1. As an account admin, log in to the account console.
2. Go to ``Workspaces`` and click your workspace name.
3. In the Credentials box, note the role name at the end of the Role ARN

For example, in the Role ARN ``arn:aws:iam::123456789123:role/finance-prod``, the role name is finance-prod

Edit the IAM role that created the Databricks deployment
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1. In the AWS console, go to the IAM service.
2. Click the Roles tab in the sidebar.
3. Click the role that created the Databricks deployment.
4. On the Permissions tab, click the policy.
5. Click Edit Policy.
6. Append the following block to the end of the Statement array. Ensure that you don’t overwrite any of the existing policy. Replace <iam-role-for-s3-access> with the role you created in Configure S3 access with instance profiles.

.. code-block:: bash
{
"Effect": "Allow",
"Action": "iam:PassRole",
"Resource": "arn:aws:iam::<aws-account-id-databricks>:role/<iam-role-for-s3-access>"
}
6. Upload the following ``entrypoint.py`` file to either
`DBFS <https://docs.databricks.com/archive/legacy/data-tab.html>`__
(the final path can be ``dbfs:///FileStore/tables/entrypoint.py``) or S3.
This file will be executed by the Spark driver node, overriding the default command in the
`dbx <https://docs.databricks.com/dev-tools/dbx.html>`__ job.
(the final path will be ``dbfs:///FileStore/tables/entrypoint.py``) or S3.
This file will be executed by the Spark driver node, overriding the default command of the
`Databricks <https://docs.databricks.com/dev-tools/dbx.html>`__ job. This entrypoint file will

1. Download the inputs from S3 to the local filesystem.
2. Execute the spark task.
3. Upload the outputs from the local filesystem to S3 for the downstream tasks to consume.


.. TODO: A quick-and-dirty workaround to resolve https://github.com/flyteorg/flyte/issues/3853 issue is to import pandas.
.. image:: https://raw.githubusercontent.com/flyteorg/static-resources/main/flyte/deployment/plugins/databricks/dbfs.png
:alt: A screenshot of dbfs.

.. code-block:: python
Expand Down Expand Up @@ -101,9 +170,7 @@ This file will be executed by the Spark driver node, overriding the default comm
def main():
args = sys.argv
click_ctx = click.Context(click.Command("dummy"))
if args[1] == "pyflyte-fast-execute":
parser = _fast_execute_task_cmd.make_parser(click_ctx)
Expand All @@ -122,6 +189,12 @@ This file will be executed by the Spark driver node, overriding the default comm
Specify plugin configuration
----------------------------
.. note::

Demo cluster saves the data to minio, but Databricks job saves the data to S3.
Therefore, you need to update the AWS credentials for the single binary deployment, so the pod can
access the S3 bucket that DataBricks job writes to.


.. tabs::

Expand Down Expand Up @@ -330,7 +403,6 @@ Add the Databricks access token to FlytePropeller:
apiVersion: v1
data:
FLYTE_DATABRICKS_API_TOKEN: <ACCESS_TOKEN>
client_secret: Zm9vYmFy
kind: Secret
...
Expand Down Expand Up @@ -376,8 +448,4 @@ Wait for the upgrade to complete. You can check the status of the deployment pod
kubectl get pods -n flyte
.. note::
Make sure you enable `custom containers
<https://docs.databricks.com/administration-guide/clusters/container-services.html>`__
on your Databricks cluster before you trigger the workflow.
For databricks plugin on the Flyte cluster, please refer to `Databricks Plugin Example <https://docs.flyte.org/projects/cookbook/en/latest/auto_examples/databricks_plugin/index.html>`_

0 comments on commit c300fa4

Please sign in to comment.