Skip to content

Workflow language and executor designed to streamline design, execution of bioinformatic pipelines in a cloud environment.

Notifications You must be signed in to change notification settings

alexwaldrop/GAP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CloudConductor alt text

What is CloudConductor?

CloudConductor is a cloud-based workflow engine for defining and executing bioinformatics pipelines in a cloud environment. Currently, the framework has been tested extensively on the Google Cloud Platform, but will eventually support other platforms including AWS, Azure, etc.

Feature Highlights

  • User-friendly
    • Define complex workflows by linking together user-defined modules that can be re-used across pipelines
    • Config_obj for clean, readable workflows (see below example)
    • +50 pre-installed modules for existing bioinformatics tools
  • Portable
    • Docker integration ensures reproducible runtime environment for modules
    • Platform independent (currently supports GCS; AWS, Azure to come)
  • Modular/Extensible
    • User-defined Plug-N-Play modules
      • Re-used across pipelines, re-combined in any combination
      • Modules easily added, customized as new tools needed, old tools changed
      • Eliminates copy/paste re-use of code across workflows
  • Pre-Launch Type-Checking
    • Strongly-typed module declarations allow catching pipeline errors before they occur
    • Pre-launch checks make sure all external files exist before runtime
  • Scalable
    • Removes resource limitations imposed by cluster-based HPCCs
  • Elastic
    • VM usage automatically scales to match input file sizes, computational needs
  • Scatter-Gather Parallelism
    • In-built logic for dividing large tasks into small chunks and re-combining
  • Economical
    • Preemptible/Spot instances drastically cut workflow costs

Setting up your system

CloudConductor is currently designed only for Linux systems. You will need to install and configure the following tools to run your pipelines on Google Cloud:

  1. Python v2.7.*

    You can check your Python version by running the following command in your terminal:

    $ python -V
    Python 2.7.10

    To install the correct version of Python, visit the official Python website.

  2. Python packages: configobj, jsonschema, requests

    You will need pip to install the above packages. After installing pip, run the following commands in your terminal:

    # Upgrade pip
    sudo pip install -U pip
    
    # Install Python modules
    sudo pip install -U configobj jsonschema requests
  3. Google Cloud Platform SDK

    Follow the instructions on the official Google Cloud website.

Usage

For more information about CloudConductor and how to use it check the documentation.

    usage: CloudConductor [-h] --input SAMPLE_SET_CONFIG --name PIPELINE_NAME
                          --pipeline_config GRAPH_CONFIG --res_kit_config
                          RES_KIT_CONFIG --plat_config PLATFORM_CONFIG --plat_name
                          PLATFORM_MODULE [-v] -o FINAL_OUTPUT_DIR
    
    optional arguments:
      -h, --help            show this help message and exit
      --input SAMPLE_SET_CONFIG
                            Path to config file containing input files and information for one or more samples.
      --name PIPELINE_NAME  Descriptive pipeline name. Will be appended to final output dir. Should be unique across runs.
      --pipeline_config GRAPH_CONFIG
                            Path to config file defining pipeline graph and tool-specific input.
      --res_kit_config RES_KIT_CONFIG
                            Path to config file defining the resources used in the pipeline.
      --plat_config PLATFORM_CONFIG
                            Path to config file defining platform where pipeline will execute.
      --plat_name PLATFORM_MODULE
                            Platform to be used. Possible values are:
                               Google (as module 'GooglePlatform')
      -v                    Increase verbosity of the program.Multiple -v's increase the verbosity level:
                               0 = Errors
                               1 = Errors + Warnings
                               2 = Errors + Warnings + Info
                               3 = Errors + Warnings + Info + Debug
      -o FINAL_OUTPUT_DIR, --output_dir FINAL_OUTPUT_DIR
                            Absolute path to the final output directory.

A simple pipeline example

Below, we use CloudConductor's in-built scatter-gather logic to align a set of reads to a reference genome.

# Trim input FASTQ reads with Trimmomatic
[trim_reads]
	module      = Trimmomatic

# Scatter trimmed FASTQ into smaller chunks for fast alignment
[split_fastq]
	module      = FastqSplitter
	input_from  = trim_reads

# Align chunks in parallel to reference genome using BWA
[align_reads]
	module      = BWA
	input_from  = split_fastq, get_read_group

# Index BAM files output by BWA
[index_bam]
	module      = Samtools
	submodule   = Index
	input_from  = align_reads

#  Merge split BAM files into single bam
[merge_bams]
	module      = MergeBams
	input_from  = align_reads, index_bam

What Next?

Get started with our full documentation to explore the ways CloudConductor can streamline the development and execution of complex, multi-sample workflows typical in bioinformatics.

Project Status

CloudConductor is actively under development. To get involved or request features, please any of the authors listed below.

Authors

About

Workflow language and executor designed to streamline design, execution of bioinformatic pipelines in a cloud environment.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published