Skip to content

Commit

Permalink
Merge pull request #29 from Gerbenvandervries/master
Browse files Browse the repository at this point in the history
readme update and bugfixes
  • Loading branch information
marieke-bijlsma authored Apr 15, 2021
2 parents 1bd468a + ab7611b commit 01ce2ea
Show file tree
Hide file tree
Showing 5 changed files with 131 additions and 50 deletions.
102 changes: 100 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,101 @@
# NGS_RNA
<h1>NGS_RNA pipeline </h1>

https://github.com/gatk-workflows/gatk4-rnaseq-germline-snps-indels/blob/master/gatk4-rna-best-practices.wdl
<h2>Description of the different steps used in the RNA analysis pipeline </h2>

<h3>Gene expression quantification </h3>
The trimmed fastQ files were aligned to a reference genome using Star [1] with default settings. Before gene quantification
SAMtools [2] was used to sort the aligned reads.
The gene level quantification was performed by HTSeq-count [3] using --mode=union.
The gene annotation database which is included in the results dir in folder expression/. Deseq2 was used for differential expression analysis on STAR bams.
For experimental group conditions the 'condition' column in the samplesheet was used the distinct groups within the samples.

<h3>Calculate QC metrics on raw and aligned data </h3>
Quality control (QC) metrics are calculated for the raw sequencing data. This is done using
the tool FastQC [4]. QC metrics are calculated for the aligned reads using
Picard-tools [5], CollectRnaSeqMetrics, MarkDuplicates, CollectInsertSize-
Metrics and SAMtools flagstat.

<h3>Splicing event calling using Leafcutter</h3>
Leafcutter quantifies RNA splicing variation detection.

<h3>GATK variant calling</h3>
Variant calling was done using GATK. First, we use a GATK tool called SplitNCigarReads
developed specially for RNAseq, which splits reads into exon segments (getting rid of Ns
but maintaining grouping information) and hard-clip any sequences overhanging into the intronic regions.
The variant calling itself was done using HaplotypeCaller in GVCF mode. All samples are
then jointly genotyped by taking the gVCFs produced earlier and running GenotypeGVCFs
on all of them together to create a set of raw SNP and indel calls. [6]

Results archive
The zipped archive contains the following data and subfolders:

- alignment: merged BAM file with index, md5sums and alignment statistics (.Log.final.out)
- expression: textfiles with gene level quantification per sample and per project.
- fastqc: FastQC output
- qcmetrics: Multiple qcMetrics and images generated with Picard-tools or SAMTools Flagstat.
- leafcutter: Leafcutter and RegTools output files.
- expression/Deseq2 differential expression analysis.
- multiqc_data: Combined MultiQC tables used for multiqc report html.
- variants: Variants calls using GATK.
- rawdata: raw sequence file in the form of a gzipped fastq file (.fq.gz)

The root of the results directory contains the final QC report, README.txt, analysis results from each tool,
and the samplesheet which formed the basis for this analysis.

1. Alexander Dobin 1 , Carrie A Davis, Felix Schlesinger, Jorg Drenkow, Chris Zaleski,
Sonali Jha, Philippe Batut, Mark Chaisson, Thomas R Gingeras: STAR: ultrafast universal RNA-seq aligner
2013 Jan 1;29(1):15-21. doi: 10.1093/bioinformatics/bts635. Epub 2012 Oct 25.
2. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R,
Subgroup 1000 Genome Project Data Processing: The Sequence Alignment/Map format and SAMtools.
Bioinforma 2009, 25 (16):2078–2079.
3. Anders S, Pyl PT, Huber W: HTSeq – A Python framework to work with high-throughput sequencing data
HTSeq – A Python framework to work with high-throughput sequencing data. 2014:0–5.
4. Andrews, S. (2010). FastQC a Quality Control Tool for High Throughput Sequence Data [Online].
Available online at: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ ${samtoolsVersion}
5. Picard Sourceforge Web site. http://picard.sourceforge.net/ ${picardVersion}
6. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.
McKenna A et al.2010 GENOME RESEARCH 20:1297-303, Version: ${gatkVersion}
7. Li YI, Knowles DA, Humphrey J, et al. Annotation-free quantification of RNA splicing using LeafCutter.
Nat Genet. 2018;50(1):151-158. doi:10.1038/s41588-017-0004-9


<h2>Manual</h2>

<h3>1) Copy rawdata to raw data ngs folder </h3>

```BASH
scp –r SEQSTARTDATE_SEQ_RUNTEST_FLOWCELLXX username@yourcluster:${root}/groups/$groupname/${tmpDir}/rawdata/ngs/YOURDIR
```
<h3>2) Create a folder in the generatedscripts folder </h3>

```BASH
mkdir ${root}/groups/$groupname/${tmpDir}/generatedscripts/TestRun
```
<h3>3) Copy samplesheet to generatedscripts folder </h3>

```BASH
scp –r TestRun.csv username@yourcluster:/groups/$groupname/${tmpDir}/generatedscripts/
```
Note: the name of the folder should be the same as samplesheet (.csv) file.
Note2: Example samplesheet can be found in $EBROOTNGS_RNA/templates/externalSamplesheet.csv

<h3>4) Run the generate script </h3>

```BASH
module load NGS_RNA
cd ${root}/groups/$groupname/${tmpDir}/generatedscripts/TestRun
cp $EBROOTNGS_RNA/generate_template.sh .
bash generate_template.sh
cd scripts
```
Note: If you want to run the pipeline locally, you should change the backend in the CreateInhouseProjects.sh script (this can be done almost at the end of the script where you have something like: sh ${EBROOTMOLGENISMINCOMPUTE}/molgenis_compute.sh search for –b slurm and change it into –b localhost

```BASH
bash submit.sh
```
<h3>5) Submit jobs </h3>

Navigate to jobs folder. The location of the jobs folder will be outputted at the step before this one (step 4).
```BASH
bash submit.sh
```
25 changes: 0 additions & 25 deletions README.txt

This file was deleted.

22 changes: 12 additions & 10 deletions protocols/CopyToResultsDir.sh
Original file line number Diff line number Diff line change
Expand Up @@ -11,20 +11,21 @@
#string projectJobsDir
#string projectHTseqExpressionTable
#string annotationGtf
#string anacondaVersion
#string indexFileID
#string seqType
#string jdkVersion
#string fastqcVersion
#string TrimGaloreVersion
#string samtoolsVersion
#string RVersion
#string wkhtmltopdfVersion
#string picardVersion
#string hisatVersion
#string htseqVersion
#string pythonVersion
#string gatkVersion
#string ghostscriptVersion
#string RSeQCVersion
#string starVersion
#string leafcutterVersion
#string multiqcVersion
#string ensembleReleaseVersion
#string groupname
#string tmpName
Expand Down Expand Up @@ -105,7 +106,7 @@ usedWorkflow=$(basename ${workflow})
cp "${intermediateDir}/"*".collectrnaseqmetrics.pdf" "${projectResultsDir}/qcmetrics"

# Copy leafcutter
cp "${intermediateDir}/leafcutter_"* "${projectResultsDir}/leafcutter/"
cp "${intermediateDir}/${project}_leafcutter_"* "${projectResultsDir}/leafcutter/"
cp "${intermediateDir}/${project}"*"_perind_"* "${projectResultsDir}/leafcutter/"

#only available with PE
Expand All @@ -128,14 +129,14 @@ University of Groningen, University Medical Center Groningen, Department of Gene
Description of the different steps used in the RNA analysis pipeline
Gene expression quantification
The trimmed fastQ files where aligned to build ${indexFileID} ensembleRelease ${ensembleReleaseVersion}
reference genome using ${hisatVersion} [1] with default settings. Before gene quantification
The trimmed fastQ files using ${TrimGaloreVersion} where aligned to build ${indexFileID} ensembleRelease ${ensembleReleaseVersion}
reference genome using ${starVersion} [1] with default settings. Before gene quantification
${samtoolsVersion} [2] was used to sort the aligned reads.
The gene level quantification was performed by HTSeq-count ${htseqVersion} [3] using --mode=union,
Ensembl version ${ensembleReleaseVersion} was used as gene annotation database which is included
in folder expression/. Deseq2 was used for differential expression analysis on STAR bams.
For experimental group conditions the 'condition' column in the samplesheet was used the
distinct groups within the samples.
distinct groups within the samples.
Calculate QC metrics on raw and aligned data
Quality control (QC) metrics are calculated for the raw sequencing data. This is done using
Expand All @@ -161,7 +162,8 @@ The zipped archive contains the following data and subfolders:
- expression: textfiles with gene level quantification per sample and per project.
- fastqc: FastQC output
- qcmetrics: Multiple qcMetrics and images generated with Picard-tools or SAMTools Flagstat.
- leafcutter: Leafcutter and RegTools output files.
- leafcutter: Leafcutter and RegTools output files
- expression/Deseq2: Deseq2 was used for differential expression analysis.
- multiqc_data: Combined MultiQC tables used for multiqc report html.
- variants: Variants calls using GATK. (optional)
- rawdata: raw sequence file in the form of a gzipped fastq file (.fq.gz)
Expand All @@ -178,7 +180,7 @@ ${RVersion}
${TrimGaloreVersion}
${picardVersion}
${htseqVersion}
${PythonPlusVersion}
${pythonVersion}
${gatkVersion}
${RSeQCVersion}
${starVersion}
Expand Down
27 changes: 14 additions & 13 deletions protocols/Leafcutter.sh
Original file line number Diff line number Diff line change
Expand Up @@ -55,33 +55,34 @@ python "${EBROOTLEAFCUTTER}"/clustering/leafcutter_cluster_regtools.py \
--checkchrom
echo "create group_list"
awk -F',' '{print $1".sorted.merged.bam\t"$2}' "${intermediateDir}"/metadata.csv \
> "${intermediateDir}${project}"_groups_file.txt
awk -F',' '{print $1".sorted.merged.bam\t"$2}' "${intermediateDir}/metadata.csv" \
> "${intermediateDir}${project}_groups_file.txt"
sed 1d "${intermediateDir}${project}"_groups_file.txt > "${intermediateDir}${project}"_groups_file.txt
sed 1d "${intermediateDir}${project}_groups_file.txt" > "${intermediateDir}${project}"_groups_file.txt.tmp
mv "${intermediateDir}${project}_groups_file.txt.tmp" "${intermediateDir}${project}_groups_file.txt"
if [[ "${conditionCount}" -gt 1 ]]
then
echo "Differential Splicing with ${conditionCount} groups."
Rscript "${EBROOTLEAFCUTTER}"/scripts/leafcutter_ds.R \
Rscript "${EBROOTLEAFCUTTER}/scripts/leafcutter_ds.R" \
--num_threads 4 \
-o "${intermediateDir}${project}_leafcutter_ds" \
"${intermediateDir}${project}"_leafcutter_cluster_regtools_perind_numers.counts.gz \
"${intermediateDir}${project}"_groups_file.txt
"${intermediateDir}${project}_leafcutter_cluster_regtools_perind_numers.counts.gz" \
"${intermediateDir}${project}_groups_file.txt"
Rscript "${EBROOTLEAFCUTTER}"/scripts/ds_plots.R \
-e "${EBROOTLEAFCUTTER}"/annotation_codes/gencode_hg19/gencode_hg19_all_exons.txt.gz \
Rscript "${EBROOTLEAFCUTTER}/scripts/ds_plots.R" \
-e "${EBROOTLEAFCUTTER}/annotation_codes/gencode_hg19/gencode_hg19_all_exons.txt.gz" \
-o "${intermediateDir}${project}_leafcutter_ds" \
"${intermediateDir}${project}"_leafcutter_cluster_regtools_perind_numers.counts.gz \
"${intermediateDir}${project}"_groups_file.txt \
"${intermediateDir}${project}"_leafcutter_ds_cluster_significance.txt \
"${intermediateDir}${project}_leafcutter_cluster_regtools_perind_numers.counts.gz" \
"${intermediateDir}${project}_groups_file.txt" \
"${intermediateDir}${project}_leafcutter_ds_cluster_significance.txt" \
-f 0.05
else
echo "Outlier Splicing, $NUMBERCONDITIONS conditions found."
Rscript "${EBROOTLEAFCUTTER}"/scripts//leafcutterMD.R \
Rscript "${EBROOTLEAFCUTTER}/scripts/leafcutterMD.R" \
--num_threads 8 \
"${intermediateDir}${project}"_leafcutter_cluster_regtools_perind_numers.counts.gz
"${intermediateDir}${project}_leafcutter_cluster_regtools_perind_numers.counts.gz"
fi
cd -
5 changes: 5 additions & 0 deletions templates/testRun.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
internalSampleID,condition,sample_name,sampleType,externalSampleID,project,sequencer,contact,sequencingStartDate,run,flowcell,lane,seqType,prepKit,capturingKit,barcode,barcodeType,externalFastQ_1,externalFastQ_2,species
SID.1,control,SID.1,RNA,SID.1,testproject,testset,none,set,testset,flowcell,1,SR,None,None,SID.1,AGI,/groups/umcg-atd/tmp01/generatedscripts/testset_Radboudumc/fastq/SID.1.fastq.gz,/groups/umcg-atd/tmp01/generatedscripts/testset/fastq/SID.1.fastq.gz,homo_sapiens|GRCh37
SID.2,control,SID.2,RNA,SID.2,testproject,testset,none,set,testset,flowcell,1,SR,None,None,SID.2,AGI,/groups/umcg-atd/tmp01/generatedscripts/testset/fastq/SID.2.fastq.gz,/groups/umcg-atd/tmp01/generatedscripts/testset/fastq/SID.2.fastq.gz,homo_sapiens|GRCh37
SID.3,test,SID.3,RNA,SID.3,testproject,testset,none,set,testset,flowcell,1,SR,None,None,SID.3,AGI,/groups/umcg-atd/tmp01/generatedscripts/testset/fastq/SID.3.fastq.gz,/groups/umcg-atd/tmp01/generatedscripts/testset/fastq/SID.3.fastq.gz,homo_sapiens|GRCh37
SID.4,test,SID.4,RNA,SID.4,testproject,testset,none,set,testset,flowcell,1,SR,None,None,SID.4,AGI,/groups/umcg-atd/tmp01/generatedscripts/testset/fastq//SID.4.fastq.gz,/groups/umcg-atd/tmp01/generatedscripts/testset/fastq/SID.4.fastq.gz,homo_sapiens|GRCh37

0 comments on commit 01ce2ea

Please sign in to comment.