Skip to content

06. Running pipesnake

Ian Brennan edited this page Sep 3, 2024 · 1 revision

Whether you want to start from raw or assembled data, you can run pipesnake end-to-end easily and flexibly.


Getting Up and Running

pipesnake is a package that we're frequently updating and improving. To make sure you're working with the most recent version, make sure to pull down from nextflow.

nextflow pull ausarg/pipesnake

Let's provide some examples of how you might customize pipesnake to suit your needs. Because of how the pipeline is designed, it's extremely flexible with many tuneable parameters. We'll use a local version of the toy dataset to show example commands and to save space we won't include absolute paths (e.g. ToyData_SampleInfo.csv instead of pipesnake-main/example/data/ToyData_SampleInfo.csv). If you need to make a sample info sheet for your own data, make sure to format it similarly (see 2.4 Quick Start: Prepare your input files).

Below we'll build up the command from basic to complex.

1. Basic

Pretty much the simplest analysis, with all parameters relying on defaults.

nextflow run ausarg/pipesnake --input ToyData_SampleInfo.csv --outdir Basic --blat_db ToyData_SqCL_25Targets.fasta -profile docker

2. Filter the Reads

Now we'll filter the raw reads against targets of close relatives to remove off-target reads before we begin assembling. This should give us a significant speed-up in the assembly step.

nextflow run ausarg/pipesnake --input ToyData_SampleInfo.csv --outdir Filtered --blat_db ToyData_SqCL_25Targets.fasta -profile docker --filter ToyData_SqCL_Filter.fasta

3. SPAdes Assembly

We'll still filter our reads, but now imagine we'd prefer to use SPAdes instead of Trinity for assembling our data.

nextflow run ausarg/pipesnake --input ToyData_SampleInfo.csv --outdir wSPAdes --blat_db ToyData_SqCL_25Targets.fasta -profile docker --filter ToyData_SqCL_Filter.fasta --assembly SPAdes

4. RAxML Gene-trees

Got something against IQTREE? Let's try RAxML instead. We'll continue to filter the reads and assemble with SPAdes.

nextflow run ausarg/pipesnake --input ToyData_SampleInfo.csv --outdir wSPAdesRAxML --blat_db ToyData_SqCL_25Targets.fasta -profile docker --filter ToyData_SqCL_Filter.fasta --assembly SPAdes --tree_method raxml

5. Remove RAxML Outputs

Maybe we'd like to limit the output files. Let's tell pipesnake not to return all the RAxML output files. We'll still filter, assemble with SPAdes, and estimate genetrees with RAxML.

nextflow run ausarg/pipesnake --input ToyData_SampleInfo.csv --outdir wSPAdesRAxMLout --blat_db ToyData_SqCL_25Targets.fasta -profile docker --filter ToyData_SqCL_Filter.fasta --assembly SPAdes --tree_method raxml --raxml_keep_output false

Starting from the Middle

Let's say you are looking to combine samples from a new project with some from an old one. Assuming you followed similar methods, you probably don't want to reassemble all the raw data again.

Already have PRG* files for all the samples of interest? We can use the --stage command to run pipesnake on a directory of PRG files. To demonstrate this functionality, we can practice with the example data we already have in the pipesnake directory:

nextflow run ausarg/pipesnake -profile test_PRG,docker --stage from-prg --outdir <outdir>

This run will pick up from the PHYLOGENY_MAKE_ALIGNMENTS step and pass from alignment all the way to a species tree, skipping the raw data assembly and filtering steps.

To do this with our own data we first need a new sample info file, one that looks like this:

sample_id prg_file
Sample1 /[FULL_PATH_TO]/Sample1.fasta
Sample2 /[FULL_PATH_TO]/Sample2.fasta
Sample3 /[FULL_PATH_TO]/Sample3.fasta
... ...

Once we have the new sample info file (which we'll pass into pipesnake with --input), we will specify where it should start in the pipeline with --stage from-prg. By default (without having to specify), pipesnake will run --stage from-start, and run the full pipeline, to pick up just from the PRG step we can run pipesnake like this:

nextflow run ausarg/pipesnake --input <new_sample_info.csv> --outdir <outdir> -profile docker --stage from-prg --blat_db <target_file.fasta>

This should run comparatively quickly given that you do not need to assemble the raw data.


Ending Early

What if you want to take advantage of pipesnake's design, but aren't read to commit to all the alignments and gene trees? We can exit the pipeline after generating PRG files, by indicating this in the pipeline parameters --stage end-prg.

nextflow run ausarg/pipesnake --input ToyData_SampleInfo.csv --outdir EndPRG --blat_db ToyData_SqCL_25Targets.fasta -profile docker --stage end-prg

A Note on PRGs

Pseudo-Reference Genome (PRG) files, as we are calling them, are the single-sample .fasta files. They contain all the named loci (best chosen contigs) for a given sample, and probably look something like this:

>AHE-L1
CAGGAGGAGAAATGTCTCAGCTC...
>AHE-L10
GTTTATAACAAATAAACAGAATA...
>AHE-L100
TTAATGCAACTCTTCAGTTGGCT...
>AHE-L101
TGTGGTGGGCTGAGTGGCTTGAA...


Configuring Resources

pipesnake allocates resources based on a config file. When no config file is specified, it defaults to using conf/base.config. This file includes parameter values for all the processes in the pipeline, but they may not be well suited to your machine. There are two easy ways to change the resource allotment:

1. Adapting the base.config file

This is probably the easier of the two options, but beware, you will lose changes if you pull down a new version of the pipeline. Open the conf/base.config file and make changes as you see fit.

/*
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    nf-core/ausargph Nextflow base config file
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    A 'blank slate' config file, appropriate for general use on most high performance
    compute environments. Assumes that all software is installed and available on
    the PATH. Runs in `local` mode - all jobs will be run on the logged in environment.
----------------------------------------------------------------------------------------
*/

params {
    
    //Default values for bbmap_dedupe parameters
    bbmap_dedupe_args = "ac=f"
    bbmap_dedupe_cpus = 1
    bbmap_dedupe_memory = 24.GB
    bbmap_dedupe_walltime = 4.h
...

For example. The base.config file requests 24 GB of memory to deduplicate reads. For my local machine, this is too much, so I might want to change this to 12 GB.

2. Build a new .config

The second option is more bespoke. Start by copying the base.config to your working directory and renaming the file to something unique, I'll use local_mac.config. Change any/all parameters that you'd like, including defaults for processes. Once you've made all the changes you'd like, save the file. To use it, pass local_mac.config to the pipeline with the -c command:

nextflow run ausarg/pipesnake --input ... ... ... -c local_mac.config

That's it.