Skip to content

02. Quick Start

Ian Brennan edited this page Sep 3, 2024 · 1 revision

pipesnake relies on Nextflow but the remaining infrastructure is packaged within Docker, Singularity, or Conda containers. Pick your poison.


1. Install Nextflow

  • version >=23.04.1
  • if you need more support, follow instructions at the top of the FAQ

2. Install Docker, Singularity, or Conda

  • you can follow this tutorial to help install singularity.
  • You can use Conda both to install Nextflow itself and also to manage software within pipelines. Please only use it within pipelines as a last resort; see docs.

3. Download pipesnake

The below options will download the pipeline and run an example dataset in just one command:

  • Using docker:
nextflow run ausarg/pipesnake -profile test,docker --outdir <OUTDIR>
  • Using singularity:

If you are using singularity, first use nf-core download to download the singularity images for the necessary software before running the pipeline. If you don't already have nf-core (nf-core/tools) installed, you can do that easily in a variety of ways (e.g. conda, pip, etc), see here.

nf-core download ausarg/pipesnake

Once you have installed pipesnake with nf-core you can run the test.

nextflow run ausarg/pipesnake -profile test,singularity --outdir <OUTDIR>
  • Using conda:

We are temporarily recommending that users not use the conda implementation due to some outstanding issues. If you'd like to try anyway, instructions are below.

If you are using conda, it is highly recommended to use the NXF_CONDA_CACHEDIR or conda.cacheDir settings to store the environments in a central location for future pipeline runs. If conda envornment creation fails, consider using mamba to create the needed envornmentin the cache directory using the same hashed names reported in nextflow logs.

nextflow run ausarg/pipesnake -profile test --outdir <OUTDIR> -with-conda true

4. Prepare your input files

4.1 Generate a sample sheet (for --input):

sample_id read1 read2 barcode1 barcode2 adaptor1 adaptor2 lineage
Sample1 /[PATH_TO]/Sample1_A_R1.fastq /[PATH_TO]/Sample1_A_R2.fastq AGGTTTGAGC TACCTGGTCG TCAC*ATCT ACAC*ACAC Crocodile
Sample2 /[PATH_TO]/Sample2_A_R1.fastq /[PATH_TO]/Sample2_A_R2.fastq CGGTGGAAGC GTGTCTGAAG TCAC*ATCT ACAC*ACAC Gecko
Sample3 /[PATH_TO]/Sample3_A_R1.fastq /[PATH_TO]/Sample3_A_R2.fastq TACTTACTGG GAAATCCTAC TCAC*ATCT ACAC*ACAC Snake
Sample1 /[PATH_TO]/Sample1_B_R1.fastq /[PATH_TO]/Sample1_B_R2.fastq TCACCGATAA AGGCACACTC TCAC*ATCT ACAC*ACAC Crocodile
  • The sample sheet must have the above headers, but additional columns (e.g. notes) are ok to include though will not be read. A single entry (row) corresponds to a pair of sequence read files (R1 & R2) for the same sample, but an individual sample may have multiple entries (see Sample1). read1 and read2 must indicate the absolute path to the read files. The * in adaptor sequences indicates the placement of the barcode sequence. Information about standard Illumina adaptors and trimming can be found here. Finally, the lineage designation is what you would like that sample to ultimately be called in output alignments, locus trees, and the species tree. Save your sample sheet as a comma-separated .csv file.
  • AusARG Datasets: use the BPA_process_metadata.py script to generate your sample sheet from within your downloaded BPA metadata directory.

4.2 Generate a targets file (for --blat_db):

  • The target sequence file is simply a FASTA file of your focal loci. Locus names must be unique, and ideally the target sequence data is not too divergent from your samples (though BLAT is quite flexible). An example targets file is included in SqCL_Targets.fasta, and is appropriate for use with SqCL projects.
>RAG1  
TATGTTCAAATGTCCTTGGAAAACTTCTGTCT...  
>AHE-L1  
AACTTATACAAATCTTGGATGCCATGGATCCA...
>UCE-1520
ACAGAGGTCGATATACCGTAGAAGATGTCCAG...
...

4.3 Generate a filtering file (for --filter):

  • The filtering sequences file is just another FASTA file of your focal loci, but from phylogenetically near samples (high similarity, e.g. intra-family). This is optional, but may be useful for speeding up the assembly step. These sequences are used as a reference to quickly (and loosely) map the raw reads against to exclude off-target sequences that would otherwise slow down the assembly. Locus names do not have to be unique and redundant targets from different taxa may improve filtration.
>RAG1  boa
TGTGTTCAAATGTCCTTGGAAAACTTCTGTCT...  
>RAG1  python
TATGTTCAAATGTCCTTGGAAAACTTCTGTCT... 
>AHE-L1  boa
ATCTTATACAAATCTTGGATGCCATGGATCCA...
>AHE-L1  python
AACTTATACAAATCTTGGATGCCATGGATCCA...  
...

5. Run your own analysis

Note that some form of configuration will be needed so that Nextflow knows how to fetch the required software. This is usually done in the form of a config profile (YOURPROFILE in the example command above). You can chain multiple config profiles in a comma-separated string.

  • The pipeline comes with config profiles called docker, singularity, podman, shifter, charliecloud and conda which instruct the pipeline to use the named tool for software management. For example, -profile test,docker.
nextflow run ausarg/pipesnake --input samplesheet.csv --outdir <OUTDIR> --blat_db <TARGET_SEQUENCES> --disable_filter false --filter <FILTER_SEQUENCES> -profile <docker/singularity/podman/shifter/charliecloud/conda/institute>