Skip to content

Latest commit

 

History

History
63 lines (36 loc) · 3.2 KB

README.md

File metadata and controls

63 lines (36 loc) · 3.2 KB

Cloud Workflow Orchestration

This repo contains resources designed to enable running genomic analysis workflows on the cloud. This includes documentation and tools for setting up Google Cloud appropriately, spinning up VMs to run the workflow, monitoring runs, and retrieving results.

It is designed to be used in concert with the workflows in the analysis-wdls repository, though other WDL workflows should certainly work as well.

Quick Start

There are two supported methods of launching workflows documented in the following pages:

  1. manual submission to Cromwell,
  2. leveraging WUSTL's GMS to automate interactions.

There are also several end-to-end demonstrations of running our WDL workflows on Google Cloud:

  • Running the Immunogenomics workflow (immuno.wdl) manually and assuming that your input data is stored on the compute1/storage1 system at WASHU: immuno-compute1

  • Running the Immunogenomics workflow (immuno.wdl) manually and assuming the your input data is on a local machine that is not associated with WASHU in any way. For example, an external institution or on a personal laptop and Google Cloud account: immuno-local

More Information

Motivation

Using the cloud to run workflows has many advantages, including avoiding reliability or access issues with local clusters, enabling scalability, and easing data-sharing.

Limitations

Cromwell does a solid job of orchestrating workflows such as the ones provided in our analysis-wdls repository, and scales well to dozens of samples running dozens of steps at a time. If you're in a situation that requires even higher scalability, to thousands of samples or uses massively parallelized workflows, this backend may not be your best option, and investigating Terra or DNAnexus might be worthwhile.

Costs

Costs of running workflows can vary wildly depending on the resources that are consumed and the size of the input data. In addition, the cost of cloud resources changes frequently. As one point of reference, at the time of this writing an end-to-end immunogenomics workflow with exome, rnaseq, and neoantigen predictions costs in the neighborhood of 20 dollars when using preemptible nodes.

Shared Helper Scripts

The README for each section includes instructions that go over relevant helper scripts for that section. If you'd like to inspect the helper scripts on their own, the best starting point is scripts/README.md

Docker Image

These scripts are contained within a Docker container image, to they can be used asynchronously with bsub. This container image can be found on dockerhub at mgibio/cloudize-workflow. Using latest is always suggested but semantic versioning will be followed in case prior behavior is needed.

After modifying any scripts, build and tag the docker image

sh infra.sh build-and-tag VERSION

This command will create docker images with tags VERSION and latest