File naming convention

To standardlize the process, we strongly suggest to start with universal file naming format. In this pipeline, we will use filename convention as: [PD|AD|HD|HC]_[subjectID]_[cellType]_<b#>_<rep#>.[R1|R2].<fq|fastq>.gz where,

Beginning with patient type, PD="Parkinson's disease", AD="Alzhimer's disease", HD="Huntington's disease" and HC="Healthy controls", in 2-letter abbreviation;
Following with subjectID (NOT lane# or sequencing date), which is unique per subject;
Following with cell type or tissue short name, e.g. SNDA="Substantial nigra dopamine neuron", MCPY="Motor cortex pyramidal neuron" etc.;
b# and rep# are optional, in format of (if any);
batch number: b1, b2 etc.
replication number (capitalized one for biological rep), e.g. rep1 is for technique replicate 1, Rep1 for biological replicate 1
Use R1 or R2 to tell the two mates of pair-end sequencing data;
Use zipped fastq;

Folder structure

For each project (e.g. PDMap), it should have a folder named with project short name and followed by the type of data, such as PDMap_RNAseq, HDPredict_smallRNA etc.

rawfiles
- raw fastq files from sequencing;
- use soft links for external location or unformated file name
- log file (readme.txt or Excel)
filtered
- filtered files (e.g. adaptor removal/clip)
run_output
- sub-folder per sample (e.g. sample1, sample2 etc.)
  - output of Tophat/Cufflinks/htseq-count runs
  - uniq subfolder for the runs for unique reads only
  - status indication file (.status*) for tracking the progress
for_display
- files used for display on UCSC / IGV, such as *.bam, *.bam.bai, *.bw, *.gtf etc.
- use soft links to the output files
results
- result files for integrative analysis, e.g. differential analysis by combining all samples.

Pipeline requirement

Install programs: tophat, cufflinks, bowtie, bedtools, samtools, fastq-mcf, fastqc, and htseq-count;
Install Jim Kent's executable programms: http://hgdownload.cse.ucsc.edu/admin/exe/;
Install R and bioconductor packages: DESeq2, MatrixEQTL, SPIA etc.
Add path of the executable programs to the $PATH;

Pipeline structure

Each sub-module has its own status hidden file (e.g. .status.$modulename.adaptorremoval). To check if any sample not completed at any step, use script below:

$grep touch ~/pipeline/modules/_RNAseq.sh | grep -v "#" | sed 's/.*modulename\.//g' | while read i; do echo $i `ll ~/neurogen/rnaseq_PD/run_output/*/.status._RNAseq.sh.$i | wc -l`; done

Main script:

RNAseq.pipeline.sh

Usage: RNAseq.pipeline.sh /data/neurogen/rnaseq_PD/rawfiles
Function: Main script for submitting RNAseq data analysis jobs to high-properformance computing cluster in batch. For now, it's configured to support only HPC environment with LSF job scheduler.
Input: absolute path of folder for the raw fastq files
Output: Tophat/Cufflinks/htseq-count/callSNP etc. runs for each sample using both multiple and unique mappers

Modules:

_RNAseq.sh

Usage: _RNAseq.sh HC_BN10-39_2.R1.fastq.gz HC_BN10-39_2.R2.fastq.gz
Function: Routine steps for pair-end RNAseq data
Input: a pair of fastq files from PE RNAsequencing
Output: BAM/SAM (from alignment), BED/GTF (from assembly) etc.

List of sub-moduels in the core script _RNAseq.sh:

$ grep touch ~/pipeline/modules/_RNAseq.sh | grep -v "#" | sed 's/.*modulename\.//g'
adaptorremoval 
shortReadsExtract 
shortReadsExtract_stemloop
fastqc
mapping
circRNA
sam2bam
bam2stat
sam2bw
rpm_vs_coverage
cufflinks.multi.denovo
cufflinks 
htseqcount
metaexon
uniq
uniq.sam2bw
uniq.bam2stat
cufflinks.denovo
uniq.bam2annotation
uniq.normalize
uniq.cufflinks.rpkm
uniq.cuffquant.rpkm
uniq.htseqcount
uniq.metaexon
makelinks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

File naming convention

Folder structure

Pipeline requirement

Pipeline structure

Main script:

Modules:

_RNAseq.sh

Files

README.md

Latest commit

History

README.md

File metadata and controls

File naming convention

Folder structure

Pipeline requirement

Pipeline structure

Main script:

Modules:

_RNAseq.sh