diff --git a/docs/assets/images/guides/airflow/airflow_dag_builder.png b/docs/assets/images/guides/airflow/airflow_dag_builder.png new file mode 100644 index 000000000..711559714 Binary files /dev/null and b/docs/assets/images/guides/airflow/airflow_dag_builder.png differ diff --git a/docs/user_guides/projects/airflow/airflow.md b/docs/user_guides/projects/airflow/airflow.md new file mode 100644 index 000000000..17943c99e --- /dev/null +++ b/docs/user_guides/projects/airflow/airflow.md @@ -0,0 +1,89 @@ +--- +description: Documentation on how to orchestrate Hopsworks jobs using Apache Airflow +--- + +# Orchestrate Jobs using Apache Airflow + +## Introduction + +Hopsworks jobs can be orchestrated using [Apache Airflow](https://airflow.apache.org/). You can define a Airflow DAG (Directed Acyclic Graph) containing the dependencies between Hopsworks jobs. +You can then schedule the DAG to be executed at a specific schedule using a [cron](https://en.wikipedia.org/wiki/Cron) expression. + +Airflow DAGs are defined as Python files. Within the Python file, different operators can be used to trigger different actions. Hopsworks provides an operator to execute jobs on Hopsworks and a sensor to wait for a specific job to finish. + +### Use Apache Airflow in Hopsworks + +Hopsworks deployments include a deployment of Apache Airflow. You can access it from the Hopsworks UI by clicking on the _Airflow_ button on the left menu. + +Airfow is configured to enforce Role Based Access Control (RBAC) to the Airflow DAGs. Admin users on Hopsworks have access to all the DAGs in the deployment. Regular users can access all the DAGs of the projects they are a member of. + +!!! note "Access Control" + Airflow does not have any knowledge of the Hopsworks project you are currently working on. As such, when opening the Airflow UI, you will see all the DAGs all of the projects you are a member of. + +#### Hopsworks DAG Builder + +
+ Airflow DAG Builder +
Airflow DAG Builder
+
+ +You can create a new Airflow DAG to orchestrate jobs using the Hopsworks DAG builder tool. Click on _New Workflow_ to create a new Airflow DAG. You should provide a name for the DAG as well as a schedule interval. You can define the schedule using the dropdown menus or by providing a cron expression. + +You can add to the DAG Hopsworks operators and sensors: + +- **Operator**: The operator is used to trigger a job execution. When configuring the operator you select the job you want to execute and you can optionally provide execution arguments. You can decide whether or not the operator should wait for the execution to be completed. If you select the _wait_ option, the operator will block and Airflow will not execute any parallel task. If you select the _wait_ option the Airflow task fails if the job fails. If you want to execute tasks in parallel, you should not select the _wait_ option but instead use the sensor. When configuring the operator, you can can also provide which other Airflow tasks it depends on. If you add a dependency, the task will be executed only after the upstream tasks have been executed successfully. + +- **Sensor**: The sensor can be used to wait for executions to be completed. Similarly to the _wait_ option of the operator, the sensor blocks until the job execution is completed. The sensor can be used to launch several jobs in parallel and wait for their execution to be completed. Please note that the sensor is defined at the job level rather than the execution level. The sensor will wait for the most recent execution to be completed and it will fail the Airflow task if the execution was not successful. + +You can then create the DAG and Hopsworks will generate the Python file. + +#### Write your own DAG + +If you prefer to code the DAGs or you want to edit a DAG built with the builder tool, you can do so. The Airflow DAGs are stored in the _Airflow_ dataset which you can access using the file browser in the project settings. + +When writing the code for the DAG you can invoke the operator as follows: + +```python +HopsworksLaunchOperator(dag=dag, + task_id="profiles_fg_0", + project_name="airflow_doc", + job_name="profiles_fg", + job_arguments="", + wait_for_completion=True) +``` + +You should provide the name of the Airflow task (`task_id`) and the Hopsworks job information (`project_name`, `job_name`, `job_arguments`). You can set the `wait_for_completion` flag to `True` if you want the operator to block and wait for the job execution to be finished. + +Similarly, you can invoke the sensor as shown below. You should provide the name of the Airflow task (`task_id`) and the Hopsworks job information (`project_name`, `job_name`) + +```python +HopsworksJobSuccessSensor(dag=dag, + task_id='wait_for_profiles_fg', + project_name="airflow_doc", + job_name='profiles_fg') +``` + +When writing the DAG file, you should also add the `access_control` parameter to the DAG configuration. The `access_control` parameter specicifies which projects have access to the DAG and which actions the project members can perform on it. If you do not specify the `access_control` option, project members will not be able to see the DAG in the Airflow UI. + +!!! warning "Admin access" + The `access_control` configuration does not apply to Hopsworks admin users which have full access to all the DAGs even if they are not member of the project. + +```python + dag = DAG( + dag_id = "example_dag", + default_args = args, + access_control = { + "project_name": {"can_dag_read", "can_dag_edit"}, + }, + + schedule_interval = "0 4 * * *" + ) +``` + +!!! note "Project Name" + You should replace the `project_name` in the snippet above with the name of your own project + +#### Manage Airflow DAGs using Git + +You can leverage the [Git integration](../git/clone_repo.md) to track your Airflow DAGs in a git repository. Airflow will only consider the DAG files which are stored in the _Airflow_ Dataset in Hopsworks. +After cloning the git repository in Hopsworks, you can automate the process of copying the DAG file in the _Airflow_ Dataset using the [copy method](https://docs.hopsworks.ai/hopsworks-api/{{{ hopsworks_version }}}/generated/api/datasets/#copy) of the Hopsworks API. \ No newline at end of file diff --git a/docs/user_guides/projects/git/clone_repo.md b/docs/user_guides/projects/git/clone_repo.md index b8305839d..0e609b26a 100644 --- a/docs/user_guides/projects/git/clone_repo.md +++ b/docs/user_guides/projects/git/clone_repo.md @@ -5,9 +5,6 @@ Repositories are cloned and managed within the scope of a project. The content of the repository will reside on the Hopsworks File System. The content of the repository can be edited from Jupyter notebooks and can for example be used to configure Jobs. Repositories can be managed from the Git section in the project settings. The Git overview in the project settings provides a list of repositories currently cloned within the project, the location of their content as well which branch and commit their HEAD is currently at. -!!! warning "Beta" - The feature is currently in Beta and will be improved in the upcoming releases. - ## Prerequisites - For cloning a private repository, you should configure a [Git Provider](configure_git_provider.md) with your git credentials. You can clone a GitHub and GitLab public repository without configuring the provider. However, for BitBucket you always need to configure the username and token to clone a repository. diff --git a/docs/user_guides/projects/git/configure_git_provider.md b/docs/user_guides/projects/git/configure_git_provider.md index 6d68e1cfb..94e554d79 100644 --- a/docs/user_guides/projects/git/configure_git_provider.md +++ b/docs/user_guides/projects/git/configure_git_provider.md @@ -4,9 +4,6 @@ When you perform Git operations on Hopsworks that need to interact with the remote repository, Hopsworks relies on the Git HTTPS protocol to perform those operations. Authentication with the remote repository happens through a token generated by the Git repository hosting service (GitHub, GitLab, BitBucket). -!!! warning "Beta" - The feature is currently in Beta and will be improved in the upcoming releases. - !!! notice "Token permissions" The token permissions should grant access to public and private repositories including read and write access to repository contents and commit statuses. If you are using the new GitHub access tokens, make sure you choose the correct `Resource owner` when generating the token for the repositories you will want to clone. For the `Repository permissions` of the new GitHub fine-grained token, you should atleast give read and write access to `Commit statuses` and `Contents`. diff --git a/docs/user_guides/projects/git/repository_actions.md b/docs/user_guides/projects/git/repository_actions.md index 5dce43832..81bef3cf1 100644 --- a/docs/user_guides/projects/git/repository_actions.md +++ b/docs/user_guides/projects/git/repository_actions.md @@ -1,8 +1,6 @@ # Repository actions ## Introduction This section explains the git operations or commands you can perform on hopsworks git repositories. These commands include commit, pull, push, create branches and many more. -!!! warning "Beta" - The feature is currently in Beta and will be improved in the upcoming releases. !!! notice "Repository permissions" Git repositories are private. Only the owner of the repository can perform git actions on the repository such as commit, push, pull e.t.c. diff --git a/mkdocs.yml b/mkdocs.yml index b713a31bb..cb9707d92 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -136,6 +136,7 @@ nav: - Run Spark Job: user_guides/projects/jobs/spark_job.md - Run Python Job: user_guides/projects/jobs/python_job.md - Scheduling: user_guides/projects/jobs/schedule_job.md + - Airflow: user_guides/projects/airflow/airflow.md - OpenSearch: - Connect: user_guides/projects/opensearch/connect.md - Kafka: