Skip to content

Commit

Permalink
Merge pull request #122 from aowen-uwmad/patch-1
Browse files Browse the repository at this point in the history
Adding DAGMan guides
  • Loading branch information
aowen-uwmad authored Jan 24, 2024
2 parents ca4f432 + 959db0c commit ca18afc
Show file tree
Hide file tree
Showing 3 changed files with 535 additions and 12 deletions.
Binary file added documentation/assets/simple-DAG_fig.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,244 @@
---
ospool:
path: htc_workloads/automated_workflows/dagman-simple-example.md
---

# Simple Example of a DAGMan Workflow

## Overview

In this guide:

1. [Introduction](#1-introduction)
2. [Structure of the DAG](#2-structure-of-the-dag)
3. [The Minimal DAG Input File](#3-the-minimal-dag-input-file)
4. [The Submit Files](#4-the-submit-files)
5. [Running the Simple DAG](#5-running-the-simple-dag)
6. [Monitoring the Simple DAG](#6-monitoring-the-simple-dag)
7. [Wrapping Up](#7-wrapping-up)

For the full details on various DAGMan features, see the HTCondor manual pages:

* [HTCondor's DAGMan Documentation](https://htcondor.readthedocs.io/en/latest/automated-workflows/index.html)

## 1. Introduction

Consider the case of two HTCondor jobs that use the submit files `A.sub` and `B.sub`.
Let's say that `A.sub` generates an output file (`output.txt`) that `B.sub` will analyze.
To run this workflow manually, we would

1. Submit the first HTCondor job with `condor_submit A.sub`.
2. Wait for the first HTCondor job to complete successfully.
3. Submit the second HTCondor job with `condor_submit B.sub`.

If the first HTCondor job using `A.sub` is fairly short, then manually running this workflow is not a big deal.
But if the first HTCondor job takes a long time to complete (maybe takes several hours to run, or has to wait for special resources),
this can be very inconvenient.
Instead, we can use DAGMan to automatically submit `B.sub` once the first HTCondor job using `A.sub` has completed successfully.
This guide walks through the process of creating such a DAGMan workflow.

## 2. Structure of the DAG

In this scenario, our workflow could be described as a DAG consisting of two nodes (`A.sub` and `B.sub`) connected by a single edge (`output.txt`).
To represent this relationship, we will define nodes `A` and `B` - corresponding to `A.sub` and `B.sub`, respectively - and connect them with a line pointing from `A` and `B`, like in this figure:

![Node A with arrow pointing to Node B](../../assets/simple-DAG_fig.png)


In order to use DAGMan to run this workflow, we need to communicate this structure to DAGMan via the `.dag` input file.

## 3. The Minimal DAG Input File

Let's call the input file `simple.dag`.
At minimum, the contents of the `simple.dag` input file are

```
# simple.dag
# Define the DAG jobs
JOB A A.sub
JOB B B.sub
# Define the connections
PARENT A CHILD B
```

In a DAGMan input file, a node is defined using the `JOB` keyword, followed by the name of the node and the name of the corresponding submit file.
In this case, we have created a node named `A` and instructed DAGMan to use the submit file `A.sub` for executing that node.
We have similarly created node `B` and instructed DAGMan to use the submit file `B.sub`.
(While there is no requirement that the name of the node match the name of the corresponding submit file, it is convenient to use a consistent naming scheme.)

To connect the nodes, we use the `PARENT .. CHILD ..` syntax.
Since node `B` requires that node `A` has completed successfully, we say that node `A` is the `PARENT` while node `B` is the `CHILD`.
Note that we do not need to define *why* node `B` is dependent on node `A`, only that it is.

## 4. The Submit Files

Now let's define simple examples of the submit files `A.sub` and `B.sub`.

### Node A

First, the submit file `A.sub` uses the executable `A.sh`, which will generate the file called `output.txt`.
We have explicitly told HTCondor to transfer back this file by using the `transfer_output_files` command.

```
# A.sub
executable = A.sh
log = A.log
output = A.out
error = A.err
transfer_output_files = output.txt
+JobDurationCategory = "Medium"
request_cpus = 1
request_memory = 1GB
request_disk = 1GB
queue
```

The executable file simply saves the hostname of the machine running the script:

```
#!/bin/bash
# A.sh
hostname > output.txt
sleep 1m # so we can see the job in "running" status
```

### Node B

Second, the submit file `B.sub` uses the executable `B.sh` to print a message using the contents of the `output.txt` file generated by `A.sh`.
We have explicitly told HTCondor to transfer `output.txt` as an *input* file for this job, using the `transfer_input_files` command.
Thus we have finally defined the "edge" that connects nodes `A` and `B`: the use of `output.txt`.

```
# B.sub
executable = B.sh
log = B.log
output = B.out
error = B.err
transfer_input_files = output.txt
+JobDurationCategory = "Medium"
request_cpus = 1
request_memory = 1GB
request_disk = 1GB
queue
```

The executable file contains the command for printing the desired message, which will be printed to `B.out`.

```
#!/bin/bash
# B.sh
echo "The previous job was executed on the following machine:"
cat output.txt
sleep 1m # so we can see the job in "running" status
```

### The directory structure

Based on the contents of `simple.dag`, DAGMan is expecting that the submit files `A.sub` and `B.sub` are in the same directory as `simple.dag`.
The submit files in turn are expecting `A.sh` and `B.sh` be in the same directory as `A.sub` and `B.sub`.
Thus, we have the following directory structure:

```
DAG_simple/
|-- A.sh
|-- A.sub
|-- B.sh
|-- B.sub
|-- simple.dag
```

It is possible to organize each job into its own directory, but for now we will use this simple, flat organization.

## 5. Running the Simple DAG

To run the DAG workflow described by `simple.dag`, we use the HTCondor command `condor_submit_dag`:

```
condor_submit_dag simple.dag
```

The DAGMan utility will then parse the input file and generate an assortment of related files that it will use for monitoring and managing your workflow.
Here is the output of running the above command:

```
[user@ap40 DAG_simple]$ condor_submit_dag simple.dag
Loading classad userMap 'checkpoint_destination_map' ts=1699037029 from /etc/condor/checkpoint-destination-mapfile
-----------------------------------------------------------------------
File for submitting this DAG to HTCondor : simple.dag.condor.sub
Log of DAGMan debugging messages : simple.dag.dagman.out
Log of HTCondor library output : simple.dag.lib.out
Log of HTCondor library error messages : simple.dag.lib.err
Log of the life of condor_dagman itself : simple.dag.dagman.log
Submitting job(s).
1 job(s) submitted to cluster 562265.
-----------------------------------------------------------------------
```

The output shows the list of standard files that are created with every DAG submission along with brief descriptions.
A couple of additional files, some of them temporary, will be created during the lifetime of the DAG.

## 6. Monitoring the Simple DAG

You can see the status of the DAG in your queue just like with any other HTCondor job submission.

```
[user@ap40 DAG_simple]$ condor_q
-- Schedd: ap40.uw.osg-htc.org : <128.105.68.92:9618?... @ 12/14/23 11:26:51
OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS
user simple.dag+562265 12/14 11:26 _ _ 1 2 562279.0
```

There are a couple of things to note about the `condor_q` output:

- The `BATCH_NAME` for the DAGMan job is the name of the input DAG file, `simple.dag`, plus the Job ID of the DAGMan scheduler job (`562265` in this case): `simple.dag+562265`.
- The total number of jobs for `simple.dag+562265` corresponds to the total number of nodes in the DAG (2).
- Only 1 node is listed as "Idle", meaning that DAGMan has only submitted 1 job so far. This is consistent with the fact that node `A` has to complete before DAGMan can submit the job for node `B`.

> Note that if you are very quick to run your `condor_q` command after running your `condor_submit_dag` command, then you may see only the DAGMan scheduler job. It may take a few seconds for DAGMan to start up and submit the HTCondor job associated with the first node.
To see more detailed information about the DAG workflow, use `condor_q -nob -dag`.
For example,

```
[user@ap40 DAG_simple]$ condor_q -dag -nob
-- Schedd: ap40.uw.osg-htc.org : <128.105.68.92:9618?... @ 12/14/23 11:27:03
ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD
562265.0 user 12/14 11:26 0+00:00:37 R 0 0.5 condor_dagman -p 0 -f -l . -Loc
562279.0 |-A 12/14 11:26 0+00:00:00 I 0 0.0 A.sh
```

In this case, the first entry is the DAGMan scheduler job that you created when you first submitted the DAG.
The following entries correspond to the nodes whose jobs are currently in the queue.
Nodes that have not yet been submitted by DAGMan or that have completed and thus left the queue will not show up in your `condor_q` output.

## 7. Wrapping Up

After waiting enough time, this simple DAG workflow should complete without any issues.
But of course, that will not be the case for every DAG, especially as you start to create your own.
DAGMan has a lot more features for managing and submitting DAG workflows, ranging from how to handle errors, combining DAG workflows, and restarting failed DAG workflows.

For now, we recommend that you continue exploring DAGMan by going through our [Intermediate DAGMan Tutorial](../../tutorials/tutorial-DAGMan-intermediate). There is also our guide [Submit Workflows with HTCondor's DAGMan](dagman-workflows), which contains links to more resources in the [More Resources](dagman-workflows#more-resources) section.

Finally, the definitive guide to DAGMan and DAG workflows is [HTCondor's DAGMan Documentation](https://htcondor.readthedocs.io/en/latest/automated-workflows/index.html).
Loading

0 comments on commit ca18afc

Please sign in to comment.