Skip to content

Commit

Permalink
Merge pull request #52 from ENCODE-DCC/dev
Browse files Browse the repository at this point in the history
v0.7.0
  • Loading branch information
leepc12 authored Mar 10, 2020
2 parents 8907e47 + 7d246aa commit f4a56d0
Show file tree
Hide file tree
Showing 10 changed files with 650 additions and 308 deletions.
18 changes: 18 additions & 0 deletions DETAILS.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,19 @@ list | WF_ID or STR_LABEL |List submitted workflows on a Cromwell server
metadata | WF_ID or STR_LABEL |Retrieve metadata JSONs for workflows
debug, troubleshoot | WF_ID, STR_LABEL or<br>METADATA_JSON_FILE |Analyze reason for errors

* `init`: To initialize Caper on a given platform. This command also downloads Cromwell/Womtool JARs so that Caper can work completely offline with local data files.

**Platform**|**Description**
:--------|:-----
sherlock | Stanford Sherlock cluster (SLURM)
scg | Stanford SCG cluster (SLURM)
gcp | Google Cloud Platform
aws | Amazon Web Service
local | General local computer
sge | HPC with Sun GridEngine cluster engine
pbs | HPC with PBS cluster engine
slurm | HPC with SLURM cluster engine

* `run`: To run a single workflow. A string label `-s` is optional and useful for other subcommands to indentify a workflow.

```bash
Expand Down Expand Up @@ -217,11 +230,16 @@ We highly recommend to use a default configuration file described in the section
:-----|:-----|:-----|:-----
ip|--ip|localhost|Cromwell server IP address or hostname
port|--port|8000|Cromwell server port
no-server-heartbeat|--no-server-heartbeat||Flag to disable server heartbeat file.
server-heartbeat-file|--server-heartbeat-file|`~/.caper/default_server_heartbeat`|Heartbeat file for Caper clients to get IP and port of a server.
server-heartbeat-timeout|--server-heartbeat-timeout|120000|Timeout for a heartbeat file in Milliseconds.

cromwell|--cromwell|[cromwell-40.jar](https://github.com/broadinstitute/cromwell/releases/download/40/cromwell-40.jar)|Path or URL for Cromwell JAR file
max-concurrent-tasks|--max-concurrent-tasks|1000|Maximum number of concurrent tasks
max-concurrent-workflows|--max-concurrent-workflows|40|Maximum number of concurrent workflows
max-retries|--max-retries|1|Maximum number of retries for failing tasks
disable-call-caching|--disable-call-caching| |Disable Cromwell's call-caching (re-using outputs)
soft-glob-output|--soft-glob-output||Use soft-linking for globbing outputs for a filesystem that does not allow hard-linking: e.g. beeGFS.
backend-file|--backend-file| |Custom Cromwell backend conf file. This will override Caper's built-in backends

* Troubleshoot parameters for `caper troubleshoot` subcommand.
Expand Down
146 changes: 144 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
**IMPORATNT**: A new flag `--soft-glob-output` is added to use soft-linking for globbing outputs. Use it for `caper server/run` (not for `caper submit`) on a filesystem that does not allow hard-linking: e.g. beeGFS.

**IMPORATNT**: Caper defaults back to **NOT** use a file-based metadata DB, which means no call-caching (re-using outputs from previous workflows) by default.

**IMPORATNT**: Even if you still want to use a file-based DB (`--db file` and `--file-db [DB_PATH]`), metadata DB generated from Caper<0.6 (with Cromwell-42) is not compatible with metadata DB generated from Caper>=0.6 (with Cromwell-47). Refer to [this doc](https://github.com/broadinstitute/cromwell/releases/tag/43) for such migration.
Expand Down Expand Up @@ -45,7 +47,8 @@ Caper is based on Unix and cloud platform CLIs (`curl`, `gsutil` and `aws`) and
export PATH=$PATH:~/.local/bin
```

5) Choose a platform from the following table and initialize Caper. This will create a default Caper configuration file `~/.caper/default.conf`, which have only required parameters for each platform. There are special platforms for Stanford Sherlock/SCG users.
5) Choose a platform from the following table and initialize Caper. This will create a default Caper configuration file `~/.caper/default.conf`, which have only required parameters for each platform. There are special platforms for Stanford Sherlock/SCG users. This will also install Cromwell/Womtool JARs on `~/.caper`. Downloading those files can take up to 10 minutes. Once they are installed, Caper can completely work offline with local data files.

```bash
$ caper init [PLATFORM]
```
Expand Down Expand Up @@ -99,11 +102,12 @@ $ sbatch ... --wrap "caper run ..."

## Running pipelines on Stanford Sherlock

> **WARINING**: DO NOT INSTALL CAPER, CONDA AND PIPELINE'S WDL ON `$SCRATCH` OR `$OAK` STORAGES. You will see `Segmentation Fault` errors. Install these executables (Caper, Conda, WDL, ...) on `$HOME` OR `$PI_HOME`. You can still use `$SCRATCH` or `$OAK` for data and Caper's outputs.
> **IMPORTANT**: DO NOT INSTALL CAPER, CONDA AND PIPELINE'S WDL ON `$SCRATCH` OR `$OAK` STORAGES. You will see `Segmentation Fault` errors. Install these executables (Caper, Conda, WDL, ...) on `$HOME` OR `$PI_HOME`. You can still use `$OAK` for input data (e.g. FASTQs defined in your input JSON file) but not for outputs, which means that you should not run Caper on `$OAK`. `$SCRATCH` and `$PI_SCRATCH` are okay for both input and output data so run Caper on them. Running Croo to organize outputs into `$OAK` is okay.
Submit a Caper leader job (`caper run`) to SLURM. For a partition `-p [SLURM_PARTITON]`, make sure that you use the same SLURM partition (`slurm-partition` in `~/.caper/default.conf`) as defined in Caper's configuration file. `-J [JOB_NAME]` is to identify Caper's leader job for each workflow. Make a separate directory for each workflow output will be written to each directory.

```bash
$ # DO NOT RUN THIS ON OAK STORAGE!
$ # conda activate here if required
$ cd [OUTPUT_DIR] # make a separate directory for each workflow.
$ sbatch -p [SLURM_PARTITON] -J [JOB_NAME] --export=ALL --mem 3G -t 4-0 --wrap "caper run [WDL] -i [INPUT_JSON]"
Expand Down Expand Up @@ -200,6 +204,144 @@ $ cd [OUTPUT_DIR] # make a separate directory for each workflow
$ caper run [WDL] -i [INPUT_JSON]
```

## Running pipelines on a custom backend

If Caper's built-in backends don't work as expected on your clusters (e.g. due to different resource settings), then you can override built-in backends with your own configuration file (e.g. `your.backend.conf`). Caper generates a `backend.conf` for built-in backends on a temporary directory.

Find this `backend.conf` first by dry-running `caper run [WDL] --dry-run ...`. For example of a `slurm` backend:
```
$ caper run toy.wdl --dry-run --backend slurm
[Caper] Validating WDL/input JSON with womtool...
Picked up _JAVA_OPTIONS: -Xms256M -Xmx4024M -XX:ParallelGCThreads=1
Success!
[Caper] cmd: ['java', '-Xmx3G', '-XX:ParallelGCThreads=1', '-DLOG_LEVEL=INFO', '-DLOG_MODE=standard', '-jar', '-Dconfig.file=/mnt/data/scratch/leepc12/caper_out/.caper_tmp/toy/20200309_151256_331283/backend.conf', '/users/leepc12/.caper/cromwell_jar/cromwell-47.jar', 'run', '/mnt/data2/scratch/leepc12/test_caper_refac/toy.wdl', '-i', '/mnt/data/scratch/leepc12/caper_out/.caper_tmp/toy/20200309_151256_331283/inputs.json', '-o', '/mnt/data/scratch/leepc12/caper_out/.caper_tmp/toy/20200309_151256_331283/workflow_opts.json', '-l', '/mnt/data/scratch/leepc12/caper_out/.caper_tmp/toy/20200309_151256_331283/labels.json', '-m', '/mnt/data/scratch/leepc12/caper_out/.caper_tmp/toy/20200309_151256_331283/metadata.json']
```

Look for a file defined with a Java parameter `-Dconfig.file` and find a backend of interest (`slurm` in this example) in the file.
```
include required(classpath("application"))
backend {
default = "slurm"
providers {
...
slurm {
actor-factory = "cromwell.backend.impl.sfs.config.ConfigBackendLifecycleActorFactory"
config {
default-runtime-attributes {
time = 24
}
concurrent-job-limit = 1000
script-epilogue = "sleep 10 && sync"
root = "/mnt/data/scratch/leepc12/caper_out"
runtime-attributes = """
String? docker
String? docker_user
Int cpu = 1
Int? gpu
Int? time
Int? memory_mb
String? slurm_partition
String? slurm_account
String? slurm_extra_param
String? singularity
String? singularity_bindpath
String? singularity_cachedir
"""
submit = """ITER=0; until [ $ITER -ge 3 ]; do
sbatch --export=ALL -J ${job_name} -D ${cwd} -o ${out} -e ${err} ${"-t " + time*60} -n 1 --ntasks-per-node=1 ${true="--cpus-per-task=" false="" defined(cpu)}${cpu} ${true="--mem=" false="" defined(memory_mb)}${memory_mb} ${"-p " + slurm_partition} ${"--account " + slurm_account} ${true="--gres gpu:" false="" defined(gpu)}${gpu} ${slurm_extra_param} --wrap "${if defined(singularity) then '' else '/bin/bash ${script} #'} if [ -z \"$SINGULARITY_BINDPATH\" ]; then export SINGULARITY_BINDPATH=${singularity_bindpath}; fi; if [ -z \"$SINGULARITY_CACHEDIR\" ]; then export SINGULARITY_CACHEDIR=${singularity_cachedir}; fi; singularity exec --cleanenv --home ${cwd} ${if defined(gpu) then '--nv' else ''} ${singularity} /bin/bash ${script}" && break
ITER=$[$ITER+1]; sleep 30; done
"""
kill = "scancel ${job_id}"
exit-code-timeout-seconds = 360
check-alive = "for ITER in 1 2 3; do CHK_ALIVE=$(squeue --noheader -j ${job_id} --format=%i | grep ${job_id}); if [ -z \"$CHK_ALIVE\" ]; then if [ \"$ITER\" == 3 ]; then /bin/bash -c 'exit 1'; else sleep 30; fi; else echo $CHK_ALIVE; break; fi; done"
job-id-regex = "Submitted batch job (\\d+).*"
}
}
...
}
...
````
Some part of the script (wrapped in `${}`) is written in WDL. For example, `${true="--mem=" false="" defined(memory_mb)}`, if `memory_mb` is defined it will print `--mem`). For such WDL expressions, you can use any variables defined in `runtime-attributes`.
For example, if your cluster does not allow importing all environment variables (`sbatch --export=ALL` ...) then you can remove `--export=ALL` from the above script.
There is a retrial logic implemented in this SLURM backend. It retries submitting up to 3 times for some SLURM clusters.
```
ITER=0; until [ $ITER -ge 3 ]; do
...
ITER=$[$ITER+1]; sleep 30; done
```
Also, there is another logic to use Singularity. If `singularity` is not given, then Cromwell will run `/bin/bash ${script}` otherwise this backend will collect some Singularity specific environment variables and finally run `singularity exec --cleanenv --home ${cwd} ${singularity} /bin/bash ${script}`. `${singularity}` is a variable that has singularity image location defined in `runtime-attributes` mentioned above.
```
sbatch ... --wrap "${if defined(singularity) then '' else '/bin/bash ${script} #`} ..."
```
There are some built-in variables (`out`, `err`, `cwd`, `script`, `cpu`, `memory_mb` and `time`) in Cromwell, which are important to keep Cromwell's task running. For example, if you remove `-o ${out}` from the script and Cromwell will fail to find `stdout` on output directory, which will lead to a pipeline failure.
See more [details](https://cromwell.readthedocs.io/en/stable/Configuring/) about a backend configuration file.
Your custom `your.backend.conf` file will override on Caper's existing built-in backend, so keep modified parts (`submit` command line in this example) only in your `your.backend.conf` file.
```
backend {
default = "slurm"
providers {
slurm {
submit = """sbatch --export=ALL -J ${job_name} -D ${cwd} -o ${out} -e ${err} ${"-t " + time*60} -n 1 --ntasks-per-node=1 ${true="--cpus-per-task=" false="" defined(cpu)}${cpu} ${true="--mem=" false="" defined(memory_mb)}${memory_mb} ${"-p " + slurm_partition} ${"--account " + slurm_account} ${true="--gres gpu:" false="" defined(gpu)}${gpu} ${slurm_extra_param} --wrap "${if defined(singularity) then '' else '/bin/bash ${script} #'} if [ -z \"$SINGULARITY_BINDPATH\" ]; then export SINGULARITY_BINDPATH=${singularity_bindpath}; fi; if [ -z \"$SINGULARITY_CACHEDIR\" ]; then export SINGULARITY_CACHEDIR=${singularity_cachedir}; fi; singularity exec --cleanenv --home ${cwd} ${if defined(gpu) then '--nv' else ''} ${singularity} /bin/bash ${script}" && break
ITER=$[$ITER+1]; sleep 30; done
"""
}
}
}
```
And then run `caper run` with your `your.backend.conf`.
```
$ caper run ... --backend-file your.backend.conf
```
## Caper server heartbeat (running multiple servers)
Caper server writes a heartbeat file (specified by `--server-heartbeat-file`) on every 120 seconds (controlled by `--server-heartbeat-timeout`). This file will contain an IP(hostname)/PORT pair of the running `caper server`.
Example heartbeat file:
```bash
$ cat ~/.caper/default_server_heartbeat
kadru.stanford.edu:8000
```

This heartbeat file is useful when users don't want to find IP(hostname)/PORT of a running `caper server` especially when they `qsub`bed or `sbatch`ed `caper server` on their clusters. For such cases, IP (hostname of node/instance) of the server is later determined after the cluster engine starts the submitted `caper server` job and it's inconvenient for the users to find the IP (hostname) of the running server manually with `qstat` or `squeue` and add it back to Caper's configuration file `~/.caper/default.conf`.

Therefore, Caper defaults to use this heartbeat file (can be disabled by a flag `--no-server-heartbeat`). So if client-side caper functions like `caper list` and `caper metadata` finds this heartbeat file and automatically parse it to get an IP/PORT pair.

However, there can be a conflict if users want to run multiple `caper server`s on the same machine (or multiple machines sharing the same caper configuration directory `~/.caper/` and hence the same default heartbeat file). For such cases, users can disable this heartbeat feature by adding the following line to their configuration file: e.g. `~/.caper/default.conf`.
```bash
no-server-heartbeat=True
```

Then start multiple servers with different port and DB (for example of MySQL). Users should make sure that each server uses a different DB (file or MySQL server port, whatever...) since there is no point of using multiple Caper servers with the same DB. For example of MySQL, users should not forget to spin up multiple MySQL servers with different ports.

```bash
$ caper server --port 8000 --mysql-db-port 3306 ... &
$ caper server --port 8001 --mysql-db-port 3307 ... &
$ caper server --port 8002 --mysql-db-port 3308 ... &
```

Send queries to a specific server.
```bash
$ caper list --port 8000
$ caper list --port 8001
$ caper list --port 8002
```

## Metadata database

If you are not interested in resuming failed workflows skip this section.
Expand Down
Loading

0 comments on commit f4a56d0

Please sign in to comment.