Merge pull request #52 from ENCODE-DCC/dev

v0.7.0
ENCODE-DCC · Mar 10, 2020 · f4a56d0 · f4a56d0
2 parents 8907e47 + 7d246aa
commit f4a56d0
Show file tree

Hide file tree

Showing 10 changed files with 650 additions and 308 deletions.
diff --git a/DETAILS.md b/DETAILS.md
@@ -41,6 +41,19 @@ list     | WF_ID or STR_LABEL |List submitted workflows on a Cromwell server
 metadata | WF_ID or STR_LABEL |Retrieve metadata JSONs for workflows
 debug, troubleshoot | WF_ID, STR_LABEL or<br>METADATA_JSON_FILE |Analyze reason for errors
 
+* `init`: To initialize Caper on a given platform. This command also downloads Cromwell/Womtool JARs so that Caper can work completely offline with local data files.
+
+	**Platform**|**Description**
+	:--------|:-----
+	sherlock | Stanford Sherlock cluster (SLURM)
+	scg | Stanford SCG cluster (SLURM)
+	gcp | Google Cloud Platform
+	aws | Amazon Web Service
+	local | General local computer
+	sge | HPC with Sun GridEngine cluster engine
+	pbs | HPC with PBS cluster engine
+	slurm | HPC with SLURM cluster engine
+
 * `run`: To run a single workflow. A string label `-s` is optional and useful for other subcommands to indentify a workflow.
 
 	```bash
@@ -217,11 +230,16 @@ We highly recommend to use a default configuration file described in the section
 	:-----|:-----|:-----|:-----
 	ip|--ip|localhost|Cromwell server IP address or hostname
 	port|--port|8000|Cromwell server port
+	no-server-heartbeat|--no-server-heartbeat||Flag to disable server heartbeat file.
+	server-heartbeat-file|--server-heartbeat-file|`~/.caper/default_server_heartbeat`|Heartbeat file for Caper clients to get IP and port of a server.
+	server-heartbeat-timeout|--server-heartbeat-timeout|120000|Timeout for a heartbeat file in Milliseconds.
+
 	cromwell|--cromwell|[cromwell-40.jar](https://github.com/broadinstitute/cromwell/releases/download/40/cromwell-40.jar)|Path or URL for Cromwell JAR file
 	max-concurrent-tasks|--max-concurrent-tasks|1000|Maximum number of concurrent tasks
 	max-concurrent-workflows|--max-concurrent-workflows|40|Maximum number of concurrent workflows
 	max-retries|--max-retries|1|Maximum number of retries for failing tasks
 	disable-call-caching|--disable-call-caching| |Disable Cromwell's call-caching (re-using outputs)
+	soft-glob-output|--soft-glob-output||Use soft-linking for globbing outputs for a filesystem that does not allow hard-linking: e.g. beeGFS.
 	backend-file|--backend-file| |Custom Cromwell backend conf file. This will override Caper's built-in backends
 
 * Troubleshoot parameters for `caper troubleshoot` subcommand.

diff --git a/README.md b/README.md
@@ -1,3 +1,5 @@
+**IMPORATNT**: A new flag `--soft-glob-output` is added to use soft-linking for globbing outputs. Use it for `caper server/run` (not for `caper submit`) on a filesystem that does not allow hard-linking: e.g. beeGFS.
+
 **IMPORATNT**: Caper defaults back to **NOT** use a file-based metadata DB, which means no call-caching (re-using outputs from previous workflows) by default.
 
 **IMPORATNT**: Even if you still want to use a file-based DB (`--db file` and `--file-db [DB_PATH]`), metadata DB generated from Caper<0.6 (with Cromwell-42) is not compatible with metadata DB generated from Caper>=0.6 (with Cromwell-47). Refer to [this doc](https://github.com/broadinstitute/cromwell/releases/tag/43) for such migration.
@@ -45,7 +47,8 @@ Caper is based on Unix and cloud platform CLIs (`curl`, `gsutil` and `aws`) and
 	export PATH=$PATH:~/.local/bin
 	```
 
-5) Choose a platform from the following table and initialize Caper. This will create a default Caper configuration file `~/.caper/default.conf`, which have only required parameters for each platform. There are special platforms for Stanford Sherlock/SCG users.
+5) Choose a platform from the following table and initialize Caper. This will create a default Caper configuration file `~/.caper/default.conf`, which have only required parameters for each platform. There are special platforms for Stanford Sherlock/SCG users. This will also install Cromwell/Womtool JARs on `~/.caper`. Downloading those files can take up to 10 minutes. Once they are installed, Caper can completely work offline with local data files.
+
 	```bash
 	$ caper init [PLATFORM]
 	```
@@ -99,11 +102,12 @@ $ sbatch ... --wrap "caper run ..."
 
 ## Running pipelines on Stanford Sherlock
 
-> **WARINING**: DO NOT INSTALL CAPER, CONDA AND PIPELINE'S WDL ON `$SCRATCH` OR `$OAK` STORAGES. You will see `Segmentation Fault` errors. Install these executables (Caper, Conda, WDL, ...) on `$HOME` OR `$PI_HOME`. You can still use `$SCRATCH` or `$OAK` for data and Caper's outputs.
+> **IMPORTANT**: DO NOT INSTALL CAPER, CONDA AND PIPELINE'S WDL ON `$SCRATCH` OR `$OAK` STORAGES. You will see `Segmentation Fault` errors. Install these executables (Caper, Conda, WDL, ...) on `$HOME` OR `$PI_HOME`. You can still use `$OAK` for input data (e.g. FASTQs defined in your input JSON file) but not for outputs, which means that you should not run Caper on `$OAK`. `$SCRATCH` and `$PI_SCRATCH` are okay for both input and output data so run Caper on them. Running Croo to organize outputs into `$OAK` is okay.
 
 Submit a Caper leader job (`caper run`) to SLURM. For a partition `-p [SLURM_PARTITON]`, make sure that you use the same SLURM partition (`slurm-partition` in `~/.caper/default.conf`) as defined in Caper's configuration file. `-J [JOB_NAME]` is to identify Caper's leader job for each workflow. Make a separate directory for each workflow output will be written to each directory.
 
 ```bash
+$ # DO NOT RUN THIS ON OAK STORAGE!
 $ # conda activate here if required
 $ cd [OUTPUT_DIR]  # make a separate directory for each workflow.
 $ sbatch -p [SLURM_PARTITON] -J [JOB_NAME] --export=ALL --mem 3G -t 4-0 --wrap "caper run [WDL] -i [INPUT_JSON]"
@@ -200,6 +204,144 @@ $ cd [OUTPUT_DIR]  # make a separate directory for each workflow
 $ caper run [WDL] -i [INPUT_JSON]
 ```
 
+## Running pipelines on a custom backend
+
+If Caper's built-in backends don't work as expected on your clusters (e.g. due to different resource settings), then you can override built-in backends with your own configuration file (e.g. `your.backend.conf`). Caper generates a `backend.conf` for built-in backends on a temporary directory.
+
+Find this `backend.conf` first by dry-running `caper run [WDL] --dry-run ...`. For example of a `slurm` backend:
+```
+$ caper run toy.wdl --dry-run --backend slurm
+[Caper] Validating WDL/input JSON with womtool...
+Picked up _JAVA_OPTIONS: -Xms256M -Xmx4024M -XX:ParallelGCThreads=1
+Success!
+[Caper] cmd:  ['java', '-Xmx3G', '-XX:ParallelGCThreads=1', '-DLOG_LEVEL=INFO', '-DLOG_MODE=standard', '-jar', '-Dconfig.file=/mnt/data/scratch/leepc12/caper_out/.caper_tmp/toy/20200309_151256_331283/backend.conf', '/users/leepc12/.caper/cromwell_jar/cromwell-47.jar', 'run', '/mnt/data2/scratch/leepc12/test_caper_refac/toy.wdl', '-i', '/mnt/data/scratch/leepc12/caper_out/.caper_tmp/toy/20200309_151256_331283/inputs.json', '-o', '/mnt/data/scratch/leepc12/caper_out/.caper_tmp/toy/20200309_151256_331283/workflow_opts.json', '-l', '/mnt/data/scratch/leepc12/caper_out/.caper_tmp/toy/20200309_151256_331283/labels.json', '-m', '/mnt/data/scratch/leepc12/caper_out/.caper_tmp/toy/20200309_151256_331283/metadata.json']
+```
+
+Look for a file defined with a Java parameter `-Dconfig.file` and find a backend of interest (`slurm` in this example) in the file.
+```
+include required(classpath("application"))
+backend {
+  default = "slurm"
+  providers {
+
+  ...
+
+    slurm {
+      actor-factory = "cromwell.backend.impl.sfs.config.ConfigBackendLifecycleActorFactory"
+      config {
+        default-runtime-attributes {
+          time = 24
+        }
+        concurrent-job-limit = 1000
+        script-epilogue = "sleep 10 && sync"
+        root = "/mnt/data/scratch/leepc12/caper_out"
+        runtime-attributes = """
+        String? docker
+        String? docker_user
+        Int cpu = 1
+        Int? gpu
+        Int? time
+        Int? memory_mb
+        String? slurm_partition
+        String? slurm_account
+        String? slurm_extra_param
+        String? singularity
+        String? singularity_bindpath
+        String? singularity_cachedir
+    """
+        submit = """ITER=0; until [ $ITER -ge 3 ]; do
+        sbatch         --export=ALL         -J ${job_name}         -D ${cwd}         -o ${out}         -e ${err}         ${"-t " + time*60}         -n 1         --ntasks-per-node=1         ${true="--cpus-per-task=" false="" defined(cpu)}${cpu}         ${true="--mem=" false="" defined(memory_mb)}${memory_mb}         ${"-p " + slurm_partition}         ${"--account " + slurm_account}         ${true="--gres gpu:" false="" defined(gpu)}${gpu}         ${slurm_extra_param}         --wrap "${if defined(singularity) then '' else             '/bin/bash ${script} #'}             if [ -z \"$SINGULARITY_BINDPATH\" ]; then             export SINGULARITY_BINDPATH=${singularity_bindpath}; fi;             if [ -z \"$SINGULARITY_CACHEDIR\" ]; then             export SINGULARITY_CACHEDIR=${singularity_cachedir}; fi;             singularity exec --cleanenv --home ${cwd}             ${if defined(gpu) then '--nv' else ''}             ${singularity} /bin/bash ${script}" && break
+    ITER=$[$ITER+1]; sleep 30; done
+    """
+        kill = "scancel ${job_id}"
+        exit-code-timeout-seconds = 360
+        check-alive = "for ITER in 1 2 3; do CHK_ALIVE=$(squeue --noheader -j ${job_id} --format=%i | grep ${job_id}); if [ -z \"$CHK_ALIVE\" ]; then if [ \"$ITER\" == 3 ]; then /bin/bash -c 'exit 1'; else sleep 30; fi; else echo $CHK_ALIVE; break; fi; done"
+        job-id-regex = "Submitted batch job (\\d+).*"
+      }
+    }
+
+  ...
+
+}
+
+...
+
+````
+
+Some part of the script (wrapped in `${}`) is written in WDL. For example, `${true="--mem=" false="" defined(memory_mb)}`, if `memory_mb` is defined it will print `--mem`). For such WDL expressions, you can use any variables defined in `runtime-attributes`.
+
+For example, if your cluster does not allow importing all environment variables (`sbatch --export=ALL` ...)  then you can remove `--export=ALL` from the above script.
+
+There is a retrial logic implemented in this SLURM backend. It retries submitting up to 3 times for some SLURM clusters.
+```
+ITER=0; until [ $ITER -ge 3 ]; do
+...
+ITER=$[$ITER+1]; sleep 30; done
+```
+
+Also, there is another logic to use Singularity. If `singularity` is not given, then Cromwell will run `/bin/bash ${script}` otherwise this backend will collect some Singularity specific environment variables and finally run `singularity exec --cleanenv --home ${cwd} ${singularity} /bin/bash ${script}`. `${singularity}` is a variable that has singularity image location defined in `runtime-attributes` mentioned above.
+```
+sbatch ... --wrap "${if defined(singularity) then '' else '/bin/bash ${script} #`} ..."
+```
+
+There are some built-in variables (`out`, `err`, `cwd`, `script`, `cpu`, `memory_mb` and `time`) in Cromwell, which are important to keep Cromwell's task running. For example, if you remove `-o ${out}` from the script and Cromwell will fail to find `stdout` on output directory, which will lead to a pipeline failure.
+
+See more [details](https://cromwell.readthedocs.io/en/stable/Configuring/) about a backend configuration file.
+
+Your custom `your.backend.conf` file will override on Caper's existing built-in backend, so keep modified parts (`submit` command line in this example) only in your `your.backend.conf` file.
+```
+backend {
+  default = "slurm"
+  providers {
+    slurm {
+        submit = """sbatch         --export=ALL         -J ${job_name}         -D ${cwd}         -o ${out}         -e ${err}         ${"-t " + time*60}         -n 1         --ntasks-per-node=1         ${true="--cpus-per-task=" false="" defined(cpu)}${cpu}         ${true="--mem=" false="" defined(memory_mb)}${memory_mb}         ${"-p " + slurm_partition}         ${"--account " + slurm_account}         ${true="--gres gpu:" false="" defined(gpu)}${gpu}         ${slurm_extra_param}         --wrap "${if defined(singularity) then '' else             '/bin/bash ${script} #'}             if [ -z \"$SINGULARITY_BINDPATH\" ]; then             export SINGULARITY_BINDPATH=${singularity_bindpath}; fi;             if [ -z \"$SINGULARITY_CACHEDIR\" ]; then             export SINGULARITY_CACHEDIR=${singularity_cachedir}; fi;             singularity exec --cleanenv --home ${cwd}             ${if defined(gpu) then '--nv' else ''}             ${singularity} /bin/bash ${script}" && break
+    ITER=$[$ITER+1]; sleep 30; done
+    """
+    }
+  }
+}
+```
+
+And then run `caper run` with your `your.backend.conf`.
+```
+$ caper run ... --backend-file your.backend.conf
+```
+
+
+## Caper server heartbeat (running multiple servers)
+
+Caper server writes a heartbeat file (specified by `--server-heartbeat-file`) on every 120 seconds (controlled by `--server-heartbeat-timeout`). This file will contain an IP(hostname)/PORT pair of the running `caper server`.
+
+Example heartbeat file:
+```bash
+$ cat ~/.caper/default_server_heartbeat
+kadru.stanford.edu:8000
+```
+
+This heartbeat file is useful when users don't want to find IP(hostname)/PORT of a running `caper server` especially when they `qsub`bed or `sbatch`ed `caper server` on their clusters. For such cases, IP (hostname of node/instance) of the server is later determined after the cluster engine starts the submitted `caper server` job and it's inconvenient for the users to find the IP (hostname) of the running server manually with `qstat` or `squeue` and add it back to Caper's configuration file `~/.caper/default.conf`.
+
+Therefore, Caper defaults to use this heartbeat file (can be disabled by a flag `--no-server-heartbeat`). So if client-side caper functions like `caper list` and `caper metadata` finds this heartbeat file and automatically parse it to get an IP/PORT pair.
+
+However, there can be a conflict if users want to run multiple `caper server`s on the same machine (or multiple machines sharing the same caper configuration directory `~/.caper/` and hence the same default heartbeat file). For such cases, users can disable this heartbeat feature by adding the following line to their configuration file: e.g. `~/.caper/default.conf`.
+```bash
+no-server-heartbeat=True
+```
+
+Then start multiple servers with different port and DB (for example of MySQL). Users should make sure that each server uses a different DB (file or MySQL server port, whatever...) since there is no point of using multiple Caper servers with the same DB. For example of MySQL, users should not forget to spin up multiple MySQL servers with different ports.
+
+```bash
+$ caper server --port 8000 --mysql-db-port 3306 ... &
+$ caper server --port 8001 --mysql-db-port 3307 ... &
+$ caper server --port 8002 --mysql-db-port 3308 ... &
+```
+
+Send queries to a specific server.
+```bash
+$ caper list --port 8000
+$ caper list --port 8001
+$ caper list --port 8002
+```
+
 ## Metadata database
 
 If you are not interested in resuming failed workflows skip this section.