-
Notifications
You must be signed in to change notification settings - Fork 11
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #122 from aowen-uwmad/patch-1
Adding DAGMan guides
- Loading branch information
Showing
3 changed files
with
535 additions
and
12 deletions.
There are no files selected for viewing
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
244 changes: 244 additions & 0 deletions
244
documentation/htc_workloads/automated_workflows/dagman-simple-example.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,244 @@ | ||
--- | ||
ospool: | ||
path: htc_workloads/automated_workflows/dagman-simple-example.md | ||
--- | ||
|
||
# Simple Example of a DAGMan Workflow | ||
|
||
## Overview | ||
|
||
In this guide: | ||
|
||
1. [Introduction](#1-introduction) | ||
2. [Structure of the DAG](#2-structure-of-the-dag) | ||
3. [The Minimal DAG Input File](#3-the-minimal-dag-input-file) | ||
4. [The Submit Files](#4-the-submit-files) | ||
5. [Running the Simple DAG](#5-running-the-simple-dag) | ||
6. [Monitoring the Simple DAG](#6-monitoring-the-simple-dag) | ||
7. [Wrapping Up](#7-wrapping-up) | ||
|
||
For the full details on various DAGMan features, see the HTCondor manual pages: | ||
|
||
* [HTCondor's DAGMan Documentation](https://htcondor.readthedocs.io/en/latest/automated-workflows/index.html) | ||
|
||
## 1. Introduction | ||
|
||
Consider the case of two HTCondor jobs that use the submit files `A.sub` and `B.sub`. | ||
Let's say that `A.sub` generates an output file (`output.txt`) that `B.sub` will analyze. | ||
To run this workflow manually, we would | ||
|
||
1. Submit the first HTCondor job with `condor_submit A.sub`. | ||
2. Wait for the first HTCondor job to complete successfully. | ||
3. Submit the second HTCondor job with `condor_submit B.sub`. | ||
|
||
If the first HTCondor job using `A.sub` is fairly short, then manually running this workflow is not a big deal. | ||
But if the first HTCondor job takes a long time to complete (maybe takes several hours to run, or has to wait for special resources), | ||
this can be very inconvenient. | ||
Instead, we can use DAGMan to automatically submit `B.sub` once the first HTCondor job using `A.sub` has completed successfully. | ||
This guide walks through the process of creating such a DAGMan workflow. | ||
|
||
## 2. Structure of the DAG | ||
|
||
In this scenario, our workflow could be described as a DAG consisting of two nodes (`A.sub` and `B.sub`) connected by a single edge (`output.txt`). | ||
To represent this relationship, we will define nodes `A` and `B` - corresponding to `A.sub` and `B.sub`, respectively - and connect them with a line pointing from `A` and `B`, like in this figure: | ||
|
||
 | ||
|
||
|
||
In order to use DAGMan to run this workflow, we need to communicate this structure to DAGMan via the `.dag` input file. | ||
|
||
## 3. The Minimal DAG Input File | ||
|
||
Let's call the input file `simple.dag`. | ||
At minimum, the contents of the `simple.dag` input file are | ||
|
||
``` | ||
# simple.dag | ||
# Define the DAG jobs | ||
JOB A A.sub | ||
JOB B B.sub | ||
# Define the connections | ||
PARENT A CHILD B | ||
``` | ||
|
||
In a DAGMan input file, a node is defined using the `JOB` keyword, followed by the name of the node and the name of the corresponding submit file. | ||
In this case, we have created a node named `A` and instructed DAGMan to use the submit file `A.sub` for executing that node. | ||
We have similarly created node `B` and instructed DAGMan to use the submit file `B.sub`. | ||
(While there is no requirement that the name of the node match the name of the corresponding submit file, it is convenient to use a consistent naming scheme.) | ||
|
||
To connect the nodes, we use the `PARENT .. CHILD ..` syntax. | ||
Since node `B` requires that node `A` has completed successfully, we say that node `A` is the `PARENT` while node `B` is the `CHILD`. | ||
Note that we do not need to define *why* node `B` is dependent on node `A`, only that it is. | ||
|
||
## 4. The Submit Files | ||
|
||
Now let's define simple examples of the submit files `A.sub` and `B.sub`. | ||
|
||
### Node A | ||
|
||
First, the submit file `A.sub` uses the executable `A.sh`, which will generate the file called `output.txt`. | ||
We have explicitly told HTCondor to transfer back this file by using the `transfer_output_files` command. | ||
|
||
``` | ||
# A.sub | ||
executable = A.sh | ||
log = A.log | ||
output = A.out | ||
error = A.err | ||
transfer_output_files = output.txt | ||
+JobDurationCategory = "Medium" | ||
request_cpus = 1 | ||
request_memory = 1GB | ||
request_disk = 1GB | ||
queue | ||
``` | ||
|
||
The executable file simply saves the hostname of the machine running the script: | ||
|
||
``` | ||
#!/bin/bash | ||
# A.sh | ||
hostname > output.txt | ||
sleep 1m # so we can see the job in "running" status | ||
``` | ||
|
||
### Node B | ||
|
||
Second, the submit file `B.sub` uses the executable `B.sh` to print a message using the contents of the `output.txt` file generated by `A.sh`. | ||
We have explicitly told HTCondor to transfer `output.txt` as an *input* file for this job, using the `transfer_input_files` command. | ||
Thus we have finally defined the "edge" that connects nodes `A` and `B`: the use of `output.txt`. | ||
|
||
``` | ||
# B.sub | ||
executable = B.sh | ||
log = B.log | ||
output = B.out | ||
error = B.err | ||
transfer_input_files = output.txt | ||
+JobDurationCategory = "Medium" | ||
request_cpus = 1 | ||
request_memory = 1GB | ||
request_disk = 1GB | ||
queue | ||
``` | ||
|
||
The executable file contains the command for printing the desired message, which will be printed to `B.out`. | ||
|
||
``` | ||
#!/bin/bash | ||
# B.sh | ||
echo "The previous job was executed on the following machine:" | ||
cat output.txt | ||
sleep 1m # so we can see the job in "running" status | ||
``` | ||
|
||
### The directory structure | ||
|
||
Based on the contents of `simple.dag`, DAGMan is expecting that the submit files `A.sub` and `B.sub` are in the same directory as `simple.dag`. | ||
The submit files in turn are expecting `A.sh` and `B.sh` be in the same directory as `A.sub` and `B.sub`. | ||
Thus, we have the following directory structure: | ||
|
||
``` | ||
DAG_simple/ | ||
|-- A.sh | ||
|-- A.sub | ||
|-- B.sh | ||
|-- B.sub | ||
|-- simple.dag | ||
``` | ||
|
||
It is possible to organize each job into its own directory, but for now we will use this simple, flat organization. | ||
|
||
## 5. Running the Simple DAG | ||
|
||
To run the DAG workflow described by `simple.dag`, we use the HTCondor command `condor_submit_dag`: | ||
|
||
``` | ||
condor_submit_dag simple.dag | ||
``` | ||
|
||
The DAGMan utility will then parse the input file and generate an assortment of related files that it will use for monitoring and managing your workflow. | ||
Here is the output of running the above command: | ||
|
||
``` | ||
[user@ap40 DAG_simple]$ condor_submit_dag simple.dag | ||
Loading classad userMap 'checkpoint_destination_map' ts=1699037029 from /etc/condor/checkpoint-destination-mapfile | ||
----------------------------------------------------------------------- | ||
File for submitting this DAG to HTCondor : simple.dag.condor.sub | ||
Log of DAGMan debugging messages : simple.dag.dagman.out | ||
Log of HTCondor library output : simple.dag.lib.out | ||
Log of HTCondor library error messages : simple.dag.lib.err | ||
Log of the life of condor_dagman itself : simple.dag.dagman.log | ||
Submitting job(s). | ||
1 job(s) submitted to cluster 562265. | ||
----------------------------------------------------------------------- | ||
``` | ||
|
||
The output shows the list of standard files that are created with every DAG submission along with brief descriptions. | ||
A couple of additional files, some of them temporary, will be created during the lifetime of the DAG. | ||
|
||
## 6. Monitoring the Simple DAG | ||
|
||
You can see the status of the DAG in your queue just like with any other HTCondor job submission. | ||
|
||
``` | ||
[user@ap40 DAG_simple]$ condor_q | ||
-- Schedd: ap40.uw.osg-htc.org : <128.105.68.92:9618?... @ 12/14/23 11:26:51 | ||
OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS | ||
user simple.dag+562265 12/14 11:26 _ _ 1 2 562279.0 | ||
``` | ||
|
||
There are a couple of things to note about the `condor_q` output: | ||
|
||
- The `BATCH_NAME` for the DAGMan job is the name of the input DAG file, `simple.dag`, plus the Job ID of the DAGMan scheduler job (`562265` in this case): `simple.dag+562265`. | ||
- The total number of jobs for `simple.dag+562265` corresponds to the total number of nodes in the DAG (2). | ||
- Only 1 node is listed as "Idle", meaning that DAGMan has only submitted 1 job so far. This is consistent with the fact that node `A` has to complete before DAGMan can submit the job for node `B`. | ||
|
||
> Note that if you are very quick to run your `condor_q` command after running your `condor_submit_dag` command, then you may see only the DAGMan scheduler job. It may take a few seconds for DAGMan to start up and submit the HTCondor job associated with the first node. | ||
To see more detailed information about the DAG workflow, use `condor_q -nob -dag`. | ||
For example, | ||
|
||
``` | ||
[user@ap40 DAG_simple]$ condor_q -dag -nob | ||
-- Schedd: ap40.uw.osg-htc.org : <128.105.68.92:9618?... @ 12/14/23 11:27:03 | ||
ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD | ||
562265.0 user 12/14 11:26 0+00:00:37 R 0 0.5 condor_dagman -p 0 -f -l . -Loc | ||
562279.0 |-A 12/14 11:26 0+00:00:00 I 0 0.0 A.sh | ||
``` | ||
|
||
In this case, the first entry is the DAGMan scheduler job that you created when you first submitted the DAG. | ||
The following entries correspond to the nodes whose jobs are currently in the queue. | ||
Nodes that have not yet been submitted by DAGMan or that have completed and thus left the queue will not show up in your `condor_q` output. | ||
|
||
## 7. Wrapping Up | ||
|
||
After waiting enough time, this simple DAG workflow should complete without any issues. | ||
But of course, that will not be the case for every DAG, especially as you start to create your own. | ||
DAGMan has a lot more features for managing and submitting DAG workflows, ranging from how to handle errors, combining DAG workflows, and restarting failed DAG workflows. | ||
|
||
For now, we recommend that you continue exploring DAGMan by going through our [Intermediate DAGMan Tutorial](../../tutorials/tutorial-DAGMan-intermediate). There is also our guide [Submit Workflows with HTCondor's DAGMan](dagman-workflows), which contains links to more resources in the [More Resources](dagman-workflows#more-resources) section. | ||
|
||
Finally, the definitive guide to DAGMan and DAG workflows is [HTCondor's DAGMan Documentation](https://htcondor.readthedocs.io/en/latest/automated-workflows/index.html). |
Oops, something went wrong.