Merge pull request #29 from Gerbenvandervries/master

readme update and bugfixes
molgenis · Apr 15, 2021 · 01ce2ea · 01ce2ea
2 parents 1bd468a + ab7611b
commit 01ce2ea
Show file tree

Hide file tree

Showing 5 changed files with 131 additions and 50 deletions.
diff --git a/README.md b/README.md
@@ -1,3 +1,101 @@
-# NGS_RNA 
+<h1>NGS_RNA pipeline </h1>
 
-https://github.com/gatk-workflows/gatk4-rnaseq-germline-snps-indels/blob/master/gatk4-rna-best-practices.wdl
+<h2>Description of the different steps used in the RNA analysis pipeline </h2>
+
+<h3>Gene expression quantification </h3>
+The trimmed fastQ files were aligned to a reference genome using Star [1] with default settings. Before gene quantification 
+SAMtools [2] was used to sort the aligned reads. 
+The gene level quantification was performed by HTSeq-count [3] using --mode=union. 
+The gene annotation database which is included in the results dir in folder expression/. Deseq2 was used for differential expression analysis on STAR bams.
+For experimental group conditions the 'condition' column in the samplesheet was used the distinct groups within the samples.
+
+<h3>Calculate QC metrics on raw and aligned data </h3>
+Quality control (QC) metrics are calculated for the raw sequencing data. This is done using 
+the tool FastQC [4]. QC metrics are calculated for the aligned reads using 
+Picard-tools [5], CollectRnaSeqMetrics, MarkDuplicates, CollectInsertSize-
+Metrics and SAMtools flagstat.
+
+<h3>Splicing event calling using Leafcutter</h3>
+Leafcutter quantifies RNA splicing variation detection.
+
+<h3>GATK variant calling</h3>
+Variant calling was done using GATK. First, we use a GATK tool called SplitNCigarReads
+developed specially for RNAseq, which splits reads into exon segments (getting rid of Ns
+but maintaining grouping information) and hard-clip any sequences overhanging into the intronic regions.
+The variant calling itself was done using HaplotypeCaller in GVCF mode. All  samples are 
+then jointly genotyped by taking the gVCFs produced earlier and running GenotypeGVCFs 
+on all of them together to create a set of raw SNP and indel calls. [6]
+
+Results archive
+The zipped archive contains the following data and subfolders:
+
+- alignment: merged BAM file with index, md5sums and alignment statistics (.Log.final.out)
+- expression: textfiles with gene level quantification per sample and per project. 
+- fastqc: FastQC output
+- qcmetrics: Multiple qcMetrics and images generated with Picard-tools or SAMTools Flagstat.
+- leafcutter: Leafcutter and RegTools output files.
+- expression/Deseq2 differential expression analysis.
+- multiqc_data: Combined MultiQC tables used for multiqc report html.
+- variants: Variants calls using GATK.
+- rawdata: raw sequence file in the form of a gzipped fastq file (.fq.gz)
+
+The root of the results directory contains the final QC report, README.txt, analysis results from each tool, 
+and the samplesheet which formed the basis for this analysis. 
+
+1. Alexander Dobin  1 , Carrie A Davis, Felix Schlesinger, Jorg Drenkow, Chris Zaleski, 
+Sonali Jha, Philippe Batut, Mark Chaisson, Thomas R Gingeras: STAR: ultrafast universal RNA-seq aligner
+2013 Jan 1;29(1):15-21.  doi: 10.1093/bioinformatics/bts635.  Epub 2012 Oct 25.
+2. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R,
+Subgroup 1000 Genome Project Data Processing: The Sequence Alignment/Map format and SAMtools.
+Bioinforma 2009, 25 (16):2078–2079.
+3. Anders S, Pyl PT, Huber W: HTSeq – A Python framework to work with high-throughput sequencing data
+HTSeq – A Python framework to work with high-throughput sequencing data. 2014:0–5.
+4. Andrews, S. (2010). FastQC a Quality Control Tool for High Throughput Sequence Data [Online]. 
+Available online at: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ ${samtoolsVersion}
+5. Picard Sourceforge Web site. http://picard.sourceforge.net/ ${picardVersion}
+6. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. 
+McKenna A et al.2010 GENOME RESEARCH 20:1297-303, Version: ${gatkVersion}
+7. Li YI, Knowles DA, Humphrey J, et al. Annotation-free quantification of RNA splicing using LeafCutter. 
+Nat Genet. 2018;50(1):151-158. doi:10.1038/s41588-017-0004-9
+
+
+<h2>Manual</h2>
+
+<h3>1) Copy rawdata to raw data ngs folder </h3>
+
+```BASH
+scp –r SEQSTARTDATE_SEQ_RUNTEST_FLOWCELLXX username@yourcluster:${root}/groups/$groupname/${tmpDir}/rawdata/ngs/YOURDIR
+```
+<h3>2) Create a folder in the generatedscripts folder </h3>
+
+```BASH
+mkdir ${root}/groups/$groupname/${tmpDir}/generatedscripts/TestRun
+```
+<h3>3) Copy samplesheet to generatedscripts folder </h3>
+
+```BASH
+scp –r TestRun.csv username@yourcluster:/groups/$groupname/${tmpDir}/generatedscripts/
+```
+Note: the name of the folder should be the same as samplesheet (.csv) file.
+Note2: Example samplesheet can be found in $EBROOTNGS_RNA/templates/externalSamplesheet.csv
+
+<h3>4) Run the generate script </h3>
+
+```BASH
+module load NGS_RNA
+cd ${root}/groups/$groupname/${tmpDir}/generatedscripts/TestRun
+cp $EBROOTNGS_RNA/generate_template.sh .
+bash generate_template.sh
+cd scripts
+```
+Note: If you want to run the pipeline locally, you should change the backend in the CreateInhouseProjects.sh script (this can be done almost at the end of the script where you have something like: sh ${EBROOTMOLGENISMINCOMPUTE}/molgenis_compute.sh search for –b slurm and change it into –b localhost
+
+```BASH
+bash submit.sh
+```
+<h3>5) Submit jobs </h3>
+
+Navigate to jobs folder. The location of the jobs folder will be outputted at the step before this one (step 4).
+```BASH
+bash submit.sh
+```
diff --git a/README.txt b/README.txt
diff --git a/protocols/CopyToResultsDir.sh b/protocols/CopyToResultsDir.sh
@@ -11,20 +11,21 @@
 #string projectJobsDir
 #string projectHTseqExpressionTable
 #string annotationGtf
-#string anacondaVersion
 #string indexFileID
 #string seqType
 #string jdkVersion
 #string fastqcVersion
+#string TrimGaloreVersion
 #string samtoolsVersion
 #string RVersion
-#string wkhtmltopdfVersion
 #string picardVersion
-#string hisatVersion
 #string htseqVersion
 #string pythonVersion
 #string gatkVersion
-#string ghostscriptVersion
+#string RSeQCVersion
+#string starVersion
+#string leafcutterVersion
+#string multiqcVersion
 #string ensembleReleaseVersion
 #string groupname
 #string tmpName
@@ -105,7 +106,7 @@ usedWorkflow=$(basename ${workflow})
 	cp "${intermediateDir}/"*".collectrnaseqmetrics.pdf" "${projectResultsDir}/qcmetrics"
 
 # Copy leafcutter
-	cp "${intermediateDir}/leafcutter_"* "${projectResultsDir}/leafcutter/"
+	cp "${intermediateDir}/${project}_leafcutter_"* "${projectResultsDir}/leafcutter/"
 	cp "${intermediateDir}/${project}"*"_perind_"* "${projectResultsDir}/leafcutter/"
 
 #only available with PE
@@ -128,14 +129,14 @@ University of Groningen, University Medical Center Groningen, Department of Gene
 Description of the different steps used in the RNA analysis pipeline
 
 Gene expression quantification
-The trimmed fastQ files where aligned to build ${indexFileID} ensembleRelease ${ensembleReleaseVersion} 
-reference genome using ${hisatVersion} [1] with default settings. Before gene quantification 
+The trimmed fastQ files using ${TrimGaloreVersion} where aligned to build ${indexFileID} ensembleRelease ${ensembleReleaseVersion} 
+reference genome using ${starVersion} [1] with default settings. Before gene quantification 
 ${samtoolsVersion} [2] was used to sort the aligned reads. 
 The gene level quantification was performed by HTSeq-count ${htseqVersion} [3] using --mode=union, 
 Ensembl version ${ensembleReleaseVersion} was used as gene annotation database which is included
 in folder expression/. Deseq2 was used for differential expression analysis on STAR bams.
 For experimental group conditions the 'condition' column in the samplesheet was used the 
-distinct groups within the samples.
+distinct groups within the samples. 
 
 Calculate QC metrics on raw and aligned data
 Quality control (QC) metrics are calculated for the raw sequencing data. This is done using 
@@ -161,7 +162,8 @@ The zipped archive contains the following data and subfolders:
 - expression: textfiles with gene level quantification per sample and per project. 
 - fastqc: FastQC output
 - qcmetrics: Multiple qcMetrics and images generated with Picard-tools or SAMTools Flagstat.
-- leafcutter: Leafcutter and RegTools output files.
+- leafcutter: Leafcutter and RegTools output files
+- expression/Deseq2: Deseq2 was used for differential expression analysis.
 - multiqc_data: Combined MultiQC tables used for multiqc report html.
 - variants: Variants calls using GATK. (optional)
 - rawdata: raw sequence file in the form of a gzipped fastq file (.fq.gz)
@@ -178,7 +180,7 @@ ${RVersion}
 ${TrimGaloreVersion}
 ${picardVersion}
 ${htseqVersion}
-${PythonPlusVersion}
+${pythonVersion}
 ${gatkVersion}
 ${RSeQCVersion}
 ${starVersion}

diff --git a/protocols/Leafcutter.sh b/protocols/Leafcutter.sh
@@ -55,33 +55,34 @@ python "${EBROOTLEAFCUTTER}"/clustering/leafcutter_cluster_regtools.py \
 --checkchrom
 
 echo "create group_list"
-awk -F',' '{print $1".sorted.merged.bam\t"$2}' "${intermediateDir}"/metadata.csv \
-> "${intermediateDir}${project}"_groups_file.txt
+awk -F',' '{print $1".sorted.merged.bam\t"$2}' "${intermediateDir}/metadata.csv" \
+> "${intermediateDir}${project}_groups_file.txt"
 
-sed 1d "${intermediateDir}${project}"_groups_file.txt > "${intermediateDir}${project}"_groups_file.txt
+sed 1d "${intermediateDir}${project}_groups_file.txt" > "${intermediateDir}${project}"_groups_file.txt.tmp
+mv "${intermediateDir}${project}_groups_file.txt.tmp" "${intermediateDir}${project}_groups_file.txt"
 
 if [[ "${conditionCount}" -gt 1 ]]
 then
 	echo "Differential Splicing with ${conditionCount} groups."
-	Rscript "${EBROOTLEAFCUTTER}"/scripts/leafcutter_ds.R \
+	Rscript "${EBROOTLEAFCUTTER}/scripts/leafcutter_ds.R" \
 	--num_threads 4 \
 	-o "${intermediateDir}${project}_leafcutter_ds" \
-	"${intermediateDir}${project}"_leafcutter_cluster_regtools_perind_numers.counts.gz \
-	"${intermediateDir}${project}"_groups_file.txt
+	"${intermediateDir}${project}_leafcutter_cluster_regtools_perind_numers.counts.gz" \
+	"${intermediateDir}${project}_groups_file.txt"
 
-	Rscript "${EBROOTLEAFCUTTER}"/scripts/ds_plots.R \
-	-e "${EBROOTLEAFCUTTER}"/annotation_codes/gencode_hg19/gencode_hg19_all_exons.txt.gz \
+	Rscript "${EBROOTLEAFCUTTER}/scripts/ds_plots.R" \
+	-e "${EBROOTLEAFCUTTER}/annotation_codes/gencode_hg19/gencode_hg19_all_exons.txt.gz" \
 	-o "${intermediateDir}${project}_leafcutter_ds" \
-	"${intermediateDir}${project}"_leafcutter_cluster_regtools_perind_numers.counts.gz \
-	"${intermediateDir}${project}"_groups_file.txt \
-	"${intermediateDir}${project}"_leafcutter_ds_cluster_significance.txt \
+	"${intermediateDir}${project}_leafcutter_cluster_regtools_perind_numers.counts.gz" \
+	"${intermediateDir}${project}_groups_file.txt" \
+	"${intermediateDir}${project}_leafcutter_ds_cluster_significance.txt" \
 	-f 0.05
 
 else
        echo "Outlier Splicing, $NUMBERCONDITIONS conditions found."
-	Rscript	"${EBROOTLEAFCUTTER}"/scripts//leafcutterMD.R \
+	Rscript	"${EBROOTLEAFCUTTER}/scripts/leafcutterMD.R" \
 	--num_threads 8 \
-	"${intermediateDir}${project}"_leafcutter_cluster_regtools_perind_numers.counts.gz
+	"${intermediateDir}${project}_leafcutter_cluster_regtools_perind_numers.counts.gz"
 fi 
 
 cd -
diff --git a/templates/testRun.csv b/templates/testRun.csv
@@ -0,0 +1,5 @@
+internalSampleID,condition,sample_name,sampleType,externalSampleID,project,sequencer,contact,sequencingStartDate,run,flowcell,lane,seqType,prepKit,capturingKit,barcode,barcodeType,externalFastQ_1,externalFastQ_2,species
+SID.1,control,SID.1,RNA,SID.1,testproject,testset,none,set,testset,flowcell,1,SR,None,None,SID.1,AGI,/groups/umcg-atd/tmp01/generatedscripts/testset_Radboudumc/fastq/SID.1.fastq.gz,/groups/umcg-atd/tmp01/generatedscripts/testset/fastq/SID.1.fastq.gz,homo_sapiens|GRCh37
+SID.2,control,SID.2,RNA,SID.2,testproject,testset,none,set,testset,flowcell,1,SR,None,None,SID.2,AGI,/groups/umcg-atd/tmp01/generatedscripts/testset/fastq/SID.2.fastq.gz,/groups/umcg-atd/tmp01/generatedscripts/testset/fastq/SID.2.fastq.gz,homo_sapiens|GRCh37
+SID.3,test,SID.3,RNA,SID.3,testproject,testset,none,set,testset,flowcell,1,SR,None,None,SID.3,AGI,/groups/umcg-atd/tmp01/generatedscripts/testset/fastq/SID.3.fastq.gz,/groups/umcg-atd/tmp01/generatedscripts/testset/fastq/SID.3.fastq.gz,homo_sapiens|GRCh37
+SID.4,test,SID.4,RNA,SID.4,testproject,testset,none,set,testset,flowcell,1,SR,None,None,SID.4,AGI,/groups/umcg-atd/tmp01/generatedscripts/testset/fastq//SID.4.fastq.gz,/groups/umcg-atd/tmp01/generatedscripts/testset/fastq/SID.4.fastq.gz,homo_sapiens|GRCh37