Skip to content

Latest commit

 

History

History
97 lines (72 loc) · 3.63 KB

README.md

File metadata and controls

97 lines (72 loc) · 3.63 KB

Read Processing for paired-end Illumina reads

This repository contains a Makefile to process paired-end illumina data with bowtie2, samtools, picard, and gatk. Much of the analyses are based on @knausb's bam processing workflow, but tweaked for a haploid plant pathogen with an available genome (here we use it for Sclerotinia sclerotiorum). This is designed to run on the HCC SLURM Cluster using the SLURM_Array submission script in the users $PATH. There are no guarantees that this will work anywhere else.

Running the workflow

To build your analysis, add your data to directories called reads/ and mitochondria_genome/, edit variables in make like ROOT_DIR, and ensure that you have an environment variable called $EMAIL so you can be spammed every time the processes start and finish. In your shell, type:

make

This will generate the genome index, sam files, bam files, g.vcf files and a final GATK/res.vcf.gz. You can find a manifest of the files generated from each sample in [manifest.txt] (Which will be updated randomly as I work out the bugs and test things).

Adding steps to the workflow

If you want to add steps/rules to the workflow, you should first be familiar or comfortable with Makefiles. Here are some helpful guides:

One of the things that gets kind of weird about this Makefile as compared to traditional makefiles is the fact that I wrote it in a really wonky sort of way where there are many dependencies for a single rule (This may change in the future).

Many of the rules in the makefile take the form of:

.PHONY: all out

all : out

SAMPLES     := $(shell ls -1d samples-dir/*)
OUT_SAMPLES := $(patsubst %.in,%.out,$(SAMPLES))

out : $(OUT_SAMPLES)

runs/ARRAY-JOB-NAME/ARRAY-JOB-NAME.sh : $(SAMPLES)
	echo $^ | sed -r 's/([^ ]+?).in */script-to-run.sh \1.in -o \1.out\n/' > $(RUNFILES)/run-script-to-run.txt
	SLURM_Array -c $(RUNFILES)/run-script-to-run.txt \
	            --mail $(EMAIL) \
				-r runs/ARRAY-JOB-NAME \
				-l $(MODULE) \
				--hold \
				-w $(ROOT_DIR)

$(OUT_SAMPLES) : $(SAMPLES) runs/ARRAY-JOB-NAME/ARRAY-JOB-NAME.sh

Where the target is a phony target that generates one OUT_SAMPLE for every SAMPLE via runs/ARRAY-JOB-NAME/ARRAY-JOB-NAME.sh. This shell script as a target is generated via SLURM_Array as it submits a SLURM array job. Is this an elegant solution? No. It's a bit hamfisted, and I will probably change it in the future.

Required directories

  • mitochondria_genome/: A gzipped fasta file such as one from here: ftp://ftp.broadinstitute.org/pub/annotation/fungi/sclerotinia_sclerotiorum/broad/genomes/sclerotinia_sclerotiorum/
  • reads/: Paired-end genomic data, in \*_[12].fq.gz format

Generated directories

  • runfiles/: shell scripts for pre-processing submission scripts (kept in this directory for posterity)
  • bt2-index/: genome index files generated via make index
  • SAMS/: mapped sam files generated via make map
  • BAMS/: filtered bam files
  • GVCF/: *.g.vcf and *.vcf files generated via GATK
  • runs/: std out and std err of runs