-
Notifications
You must be signed in to change notification settings - Fork 2
06. Running pipesnake
Whether you want to start from raw or assembled data, you can run pipesnake
end-to-end easily and flexibly.
pipesnake
is a package that we're frequently updating and improving. To make sure you're working with the most recent version, make sure to pull down from nextflow.
nextflow pull ausarg/pipesnake
Let's provide some examples of how you might customize pipesnake
to suit your needs. Because of how the pipeline is designed, it's extremely flexible with many tuneable parameters. We'll use a local version of the toy dataset to show example commands and to save space we won't include absolute paths (e.g. ToyData_SampleInfo.csv
instead of pipesnake-main/example/data/ToyData_SampleInfo.csv
). If you need to make a sample info sheet for your own data, make sure to format it similarly (see 2.4 Quick Start: Prepare your input files).
Below we'll build up the command from basic to complex.
Pretty much the simplest analysis, with all parameters relying on defaults.
nextflow run ausarg/pipesnake --input ToyData_SampleInfo.csv --outdir Basic --blat_db ToyData_SqCL_25Targets.fasta -profile docker
Now we'll filter the raw reads against targets of close relatives to remove off-target reads before we begin assembling. This should give us a significant speed-up in the assembly step.
nextflow run ausarg/pipesnake --input ToyData_SampleInfo.csv --outdir Filtered --blat_db ToyData_SqCL_25Targets.fasta -profile docker --filter ToyData_SqCL_Filter.fasta
We'll still filter our reads, but now imagine we'd prefer to use SPAdes
instead of Trinity
for assembling our data.
nextflow run ausarg/pipesnake --input ToyData_SampleInfo.csv --outdir wSPAdes --blat_db ToyData_SqCL_25Targets.fasta -profile docker --filter ToyData_SqCL_Filter.fasta --assembly SPAdes
Got something against IQTREE
? Let's try RAxML
instead. We'll continue to filter the reads and assemble with SPAdes
.
nextflow run ausarg/pipesnake --input ToyData_SampleInfo.csv --outdir wSPAdesRAxML --blat_db ToyData_SqCL_25Targets.fasta -profile docker --filter ToyData_SqCL_Filter.fasta --assembly SPAdes --tree_method raxml
Maybe we'd like to limit the output files. Let's tell pipesnake
not to return all the RAxML
output files. We'll still filter, assemble with SPAdes
, and estimate genetrees with RAxML
.
nextflow run ausarg/pipesnake --input ToyData_SampleInfo.csv --outdir wSPAdesRAxMLout --blat_db ToyData_SqCL_25Targets.fasta -profile docker --filter ToyData_SqCL_Filter.fasta --assembly SPAdes --tree_method raxml --raxml_keep_output false
Let's say you are looking to combine samples from a new project with some from an old one. Assuming you followed similar methods, you probably don't want to reassemble all the raw data again.
Already have PRG* files for all the samples of interest? We can use the --stage
command to run pipesnake
on a directory of PRG files. To demonstrate this functionality, we can practice with the example data we already have in the pipesnake
directory:
nextflow run ausarg/pipesnake -profile test_PRG,docker --stage from-prg --outdir <outdir>
This run will pick up from the PHYLOGENY_MAKE_ALIGNMENTS
step and pass from alignment all the way to a species tree, skipping the raw data assembly and filtering steps.
To do this with our own data we first need a new sample info
file, one that looks like this:
sample_id | prg_file |
---|---|
Sample1 | /[FULL_PATH_TO]/Sample1.fasta |
Sample2 | /[FULL_PATH_TO]/Sample2.fasta |
Sample3 | /[FULL_PATH_TO]/Sample3.fasta |
... | ... |
Once we have the new sample info
file (which we'll pass into pipesnake
with --input
), we will specify where it should start in the pipeline with --stage from-prg
. By default (without having to specify), pipesnake
will run --stage from-start
, and run the full pipeline, to pick up just from the PRG step we can run pipesnake
like this:
nextflow run ausarg/pipesnake --input <new_sample_info.csv> --outdir <outdir> -profile docker --stage from-prg --blat_db <target_file.fasta>
This should run comparatively quickly given that you do not need to assemble the raw data.
What if you want to take advantage of pipesnake
's design, but aren't read to commit to all the alignments and gene trees? We can exit the pipeline after generating PRG files, by indicating this in the pipeline parameters --stage end-prg
.
nextflow run ausarg/pipesnake --input ToyData_SampleInfo.csv --outdir EndPRG --blat_db ToyData_SqCL_25Targets.fasta -profile docker --stage end-prg
Pseudo-Reference Genome (PRG) files, as we are calling them, are the single-sample .fasta
files. They contain all the named loci (best chosen contigs) for a given sample, and probably look something like this:
>AHE-L1
CAGGAGGAGAAATGTCTCAGCTC...
>AHE-L10
GTTTATAACAAATAAACAGAATA...
>AHE-L100
TTAATGCAACTCTTCAGTTGGCT...
>AHE-L101
TGTGGTGGGCTGAGTGGCTTGAA...
pipesnake
allocates resources based on a config
file. When no config file is specified, it defaults to using conf/base.config
. This file includes parameter values for all the processes in the pipeline, but they may not be well suited to your machine. There are two easy ways to change the resource allotment:
This is probably the easier of the two options, but beware, you will lose changes if you pull down a new version of the pipeline. Open the conf/base.config
file and make changes as you see fit.
/*
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
nf-core/ausargph Nextflow base config file
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A 'blank slate' config file, appropriate for general use on most high performance
compute environments. Assumes that all software is installed and available on
the PATH. Runs in `local` mode - all jobs will be run on the logged in environment.
----------------------------------------------------------------------------------------
*/
params {
//Default values for bbmap_dedupe parameters
bbmap_dedupe_args = "ac=f"
bbmap_dedupe_cpus = 1
bbmap_dedupe_memory = 24.GB
bbmap_dedupe_walltime = 4.h
...
For example. The base.config
file requests 24 GB of memory to deduplicate reads. For my local machine, this is too much, so I might want to change this to 12 GB.
The second option is more bespoke. Start by copying the base.config
to your working directory and renaming the file to something unique, I'll use local_mac.config
. Change any/all parameters that you'd like, including defaults for processes. Once you've made all the changes you'd like, save the file. To use it, pass local_mac.config
to the pipeline with the -c
command:
nextflow run ausarg/pipesnake --input ... ... ... -c local_mac.config
That's it.