Skip to content

Commit

Permalink
[HWORKS-164] Add Airflow documentation (#328)
Browse files Browse the repository at this point in the history
  • Loading branch information
SirOibaf authored Nov 29, 2023
1 parent 735e941 commit b71528e
Show file tree
Hide file tree
Showing 6 changed files with 90 additions and 8 deletions.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
89 changes: 89 additions & 0 deletions docs/user_guides/projects/airflow/airflow.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
---
description: Documentation on how to orchestrate Hopsworks jobs using Apache Airflow
---

# Orchestrate Jobs using Apache Airflow

## Introduction

Hopsworks jobs can be orchestrated using [Apache Airflow](https://airflow.apache.org/). You can define a Airflow DAG (Directed Acyclic Graph) containing the dependencies between Hopsworks jobs.
You can then schedule the DAG to be executed at a specific schedule using a [cron](https://en.wikipedia.org/wiki/Cron) expression.

Airflow DAGs are defined as Python files. Within the Python file, different operators can be used to trigger different actions. Hopsworks provides an operator to execute jobs on Hopsworks and a sensor to wait for a specific job to finish.

### Use Apache Airflow in Hopsworks

Hopsworks deployments include a deployment of Apache Airflow. You can access it from the Hopsworks UI by clicking on the _Airflow_ button on the left menu.

Airfow is configured to enforce Role Based Access Control (RBAC) to the Airflow DAGs. Admin users on Hopsworks have access to all the DAGs in the deployment. Regular users can access all the DAGs of the projects they are a member of.

!!! note "Access Control"
Airflow does not have any knowledge of the Hopsworks project you are currently working on. As such, when opening the Airflow UI, you will see all the DAGs all of the projects you are a member of.

#### Hopsworks DAG Builder

<figure>
<img src="../../../../assets/images/guides/airflow/airflow_dag_builder.png" alt="Airflow DAG Builder"/>
<figcaption>Airflow DAG Builder</figcaption>
</figure>

You can create a new Airflow DAG to orchestrate jobs using the Hopsworks DAG builder tool. Click on _New Workflow_ to create a new Airflow DAG. You should provide a name for the DAG as well as a schedule interval. You can define the schedule using the dropdown menus or by providing a cron expression.

You can add to the DAG Hopsworks operators and sensors:

- **Operator**: The operator is used to trigger a job execution. When configuring the operator you select the job you want to execute and you can optionally provide execution arguments. You can decide whether or not the operator should wait for the execution to be completed. If you select the _wait_ option, the operator will block and Airflow will not execute any parallel task. If you select the _wait_ option the Airflow task fails if the job fails. If you want to execute tasks in parallel, you should not select the _wait_ option but instead use the sensor. When configuring the operator, you can can also provide which other Airflow tasks it depends on. If you add a dependency, the task will be executed only after the upstream tasks have been executed successfully.

- **Sensor**: The sensor can be used to wait for executions to be completed. Similarly to the _wait_ option of the operator, the sensor blocks until the job execution is completed. The sensor can be used to launch several jobs in parallel and wait for their execution to be completed. Please note that the sensor is defined at the job level rather than the execution level. The sensor will wait for the most recent execution to be completed and it will fail the Airflow task if the execution was not successful.

You can then create the DAG and Hopsworks will generate the Python file.

#### Write your own DAG

If you prefer to code the DAGs or you want to edit a DAG built with the builder tool, you can do so. The Airflow DAGs are stored in the _Airflow_ dataset which you can access using the file browser in the project settings.

When writing the code for the DAG you can invoke the operator as follows:

```python
HopsworksLaunchOperator(dag=dag,
task_id="profiles_fg_0",
project_name="airflow_doc",
job_name="profiles_fg",
job_arguments="",
wait_for_completion=True)
```

You should provide the name of the Airflow task (`task_id`) and the Hopsworks job information (`project_name`, `job_name`, `job_arguments`). You can set the `wait_for_completion` flag to `True` if you want the operator to block and wait for the job execution to be finished.

Similarly, you can invoke the sensor as shown below. You should provide the name of the Airflow task (`task_id`) and the Hopsworks job information (`project_name`, `job_name`)

```python
HopsworksJobSuccessSensor(dag=dag,
task_id='wait_for_profiles_fg',
project_name="airflow_doc",
job_name='profiles_fg')
```

When writing the DAG file, you should also add the `access_control` parameter to the DAG configuration. The `access_control` parameter specicifies which projects have access to the DAG and which actions the project members can perform on it. If you do not specify the `access_control` option, project members will not be able to see the DAG in the Airflow UI.

!!! warning "Admin access"
The `access_control` configuration does not apply to Hopsworks admin users which have full access to all the DAGs even if they are not member of the project.

```python
dag = DAG(
dag_id = "example_dag",
default_args = args,
access_control = {
"project_name": {"can_dag_read", "can_dag_edit"},
},
schedule_interval = "0 4 * * *"
)
```

!!! note "Project Name"
You should replace the `project_name` in the snippet above with the name of your own project

#### Manage Airflow DAGs using Git

You can leverage the [Git integration](../git/clone_repo.md) to track your Airflow DAGs in a git repository. Airflow will only consider the DAG files which are stored in the _Airflow_ Dataset in Hopsworks.
After cloning the git repository in Hopsworks, you can automate the process of copying the DAG file in the _Airflow_ Dataset using the [copy method](https://docs.hopsworks.ai/hopsworks-api/{{{ hopsworks_version }}}/generated/api/datasets/#copy) of the Hopsworks API.
3 changes: 0 additions & 3 deletions docs/user_guides/projects/git/clone_repo.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,6 @@
Repositories are cloned and managed within the scope of a project. The content of the repository will reside on the Hopsworks File System. The content of the repository can be edited from Jupyter notebooks and can for example be used to configure Jobs.
Repositories can be managed from the Git section in the project settings. The Git overview in the project settings provides a list of repositories currently cloned within the project, the location of their content as well which branch and commit their HEAD is currently at.

!!! warning "Beta"
The feature is currently in Beta and will be improved in the upcoming releases.

## Prerequisites

- For cloning a private repository, you should configure a [Git Provider](configure_git_provider.md) with your git credentials. You can clone a GitHub and GitLab public repository without configuring the provider. However, for BitBucket you always need to configure the username and token to clone a repository.
Expand Down
3 changes: 0 additions & 3 deletions docs/user_guides/projects/git/configure_git_provider.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,6 @@

When you perform Git operations on Hopsworks that need to interact with the remote repository, Hopsworks relies on the Git HTTPS protocol to perform those operations. Authentication with the remote repository happens through a token generated by the Git repository hosting service (GitHub, GitLab, BitBucket).

!!! warning "Beta"
The feature is currently in Beta and will be improved in the upcoming releases.

!!! notice "Token permissions"
The token permissions should grant access to public and private repositories including read and write access to repository contents and commit statuses.
If you are using the new GitHub access tokens, make sure you choose the correct `Resource owner` when generating the token for the repositories you will want to clone. For the `Repository permissions` of the new GitHub fine-grained token, you should atleast give read and write access to `Commit statuses` and `Contents`.
Expand Down
2 changes: 0 additions & 2 deletions docs/user_guides/projects/git/repository_actions.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,6 @@
# Repository actions
## Introduction
This section explains the git operations or commands you can perform on hopsworks git repositories. These commands include commit, pull, push, create branches and many more.
!!! warning "Beta"
The feature is currently in Beta and will be improved in the upcoming releases.

!!! notice "Repository permissions"
Git repositories are private. Only the owner of the repository can perform git actions on the repository such as commit, push, pull e.t.c.
Expand Down
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -136,6 +136,7 @@ nav:
- Run Spark Job: user_guides/projects/jobs/spark_job.md
- Run Python Job: user_guides/projects/jobs/python_job.md
- Scheduling: user_guides/projects/jobs/schedule_job.md
- Airflow: user_guides/projects/airflow/airflow.md
- OpenSearch:
- Connect: user_guides/projects/opensearch/connect.md
- Kafka:
Expand Down

0 comments on commit b71528e

Please sign in to comment.