Skip to content
This repository has been archived by the owner on Oct 15, 2022. It is now read-only.

Latest commit

 

History

History
135 lines (96 loc) · 5.41 KB

README.md

File metadata and controls

135 lines (96 loc) · 5.41 KB

Running mikropml with snakemake

Snakemake is a workflow manager that enables massively parallel and reproducible analyses. Snakemake is a suitable tool to use when you can break a workflow down into discrete steps, with each step having input and output files.

mikropml is an R package for supervised machine learning pipelines. We provide this example workflow as a template to get started running mikropml with snakemake. We hope you then customize the code to meet the needs of your particular ML task.

For more details on these tools, see the Snakemake tutorial and read the mikropml docs.

The Workflow

The Snakefile contains rules which define the output files we want and how to make them. Snakemake automatically builds a directed acyclic graph (DAG) of jobs to figure out the dependencies of each of the rules and what order to run them in. This workflow preprocesses the example dataset, calls mikropml::run_ml() for each seed and ML method set in the config file, combines the results files, plots performance results (cross-validation and test AUROCs, hyperparameter AUROCs from cross-validation, and benchmark performance), and renders a simple R Markdown report as a GitHub-flavored markdown file .

rulegraph

The DAG shows how calls to run_ml can run in parallel if snakemake is allowed to run more than one job at a time. If we use 100 seeds and 4 ML methods, snakemake would call run_ml 400 times. Here's a small example DAG if we were to use only 2 seeds and 2 ML methods:

dag

Quick Start

  1. Clone or download this repo and go to the directory.

    git clone https://github.com/SchlossLab/Barron_IBD-CDI_2022
    cd mikropml-snakemake-workflow

    Alternatively, you can fork it before cloning.

  2. Install the dependencies.

    1. If you don't have conda yet, we recommend installing miniconda.

    2. Next, install mamba, a fast drop-in replacement for conda:

      conda install mamba -n base -c conda-forge
    3. Finally, create the environment and activate it:

      mamba env create -f config/environment.yml
      conda activate ibd-cdi
  3. Edit the configuration file config/config.yml.

    • outcome_colname: column name of the outcomes for the dataset.
    • ml_methods: list of machine learning methods to use. Must be supported by mikropml.
    • kfold: k number for k-fold cross validation during model training.
    • ncores: the number of cores to use for preprocessing and for each mikropml::run_ml() call. Do not exceed the number of cores you have available.
    • nseeds: the number of different random seeds to use for training models with mikropml::run_ml().

    You can leave these options as-is if you'd like to first make sure the workflow runs without error on your machine before using your own dataset and custom parameters.

    The default config file is suitable for initial testing, but we recommend using more cores if available and more seeds for model training. A more robust configuration is provided in config/config_robust.yml.

  4. Do a dry run to make sure the snakemake workflow is valid.

    snakemake -n
  5. Run the workflow.

    Run it locally with:

    snakemake

    Or specify a different config file with:

    snakemake --configfile config/config_robust.yml

    To run the workflow on an HPC with Slurm:

    1. Edit your email (YOUR_EMAIL_HERE), Slurm account (YOUR_ACCOUNT_HERE), and other Slurm parameters as needed in:

    2. Submit the snakemake workflow with:

      sbatch code/submit_slurm.sh

      The main job will then submit all other snakemake jobs, allowing independent steps of the workflow to run on different nodes in parallel. Slurm output files will be written to log/hpc/.

  6. View the results in report.md.

    This example report was created by running the workflow on the Great Lakes HPC at the University of Michigan with config/config_robust.yml.

Out of memory or walltime

If any of your jobs fail because it ran out of memory, you can increase the memory for the given rule in the config/cluster.json file. For example, if the combine_hp_performance rule fails, you can increase the memory from 16GB to, say, 24GB. You can also change other slurm parameters from the defaults in this file (e.g. walltime, number of cores, etc.).

More resources