Skip to content
Nanjala Ruth edited this page Aug 25, 2020 · 9 revisions

Welcome to the Group-5-miniproject_RNASEQ wiki!

RNA-Seq data processing and gene expression analysis workflow

Background

RNA seq is widely used for gene expression studies to quantify the RNA in a sample using next-generation sequencing (NGS). It is a powerful tool with many applications for gene discovery and quantification. Here, we assume differential expression is being assessed between 2 experimental conditions, i.e. a simple 1:1 comparison. The sample data is from a human genome. It is highly reccomended to use a HPC environment for increased RAM and computational power.

Pipeline

The pipeline for this mini-project include:

  • FastQC for quality check
  • Trimmomatic for adaptor removal and trimming
  • HISAT2 for alignment and Subread’s featureCounts for count generation
  • Kallisto for pseudoalignment
  • MultiQC to collect the statistics
  • Statistical analysis in R using DESEQ
  • Converting the pipeline to R Markdown
  • Converting the pipeline to Snakemake/Nextflow languages

Getting Setup

A. Installing Miniconda (if needed)

  1. Download Miniconda for your specific OS to your home directory
    • Linux: wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
    • Mac: curl https://repo.continuum.io/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
  2. Run:
    • bash Miniconda3-latest-Linux-x86_64.sh
    • bash Miniconda3-latest-MacOSX-x86_64.sh
  3. Follow all the prompts: if unsure, accept defaults
  4. Close and re-open your terminal
  5. If the installation is successful, you should see a list of installed packages with
    • conda list If the command cannot be found, you can add Anaconda bin to the path using: export PATH=~/miniconda3/bin:$PATH

B. Setting up the folder Structure

Clone this repository with folder structure into the current working folder

git clone https://github.com/nanjalaruth/Group-5-miniproject_RNASEQ.git
mkdir -p Rawdata Results Scripts README.md

C. Download Data files(If you would like to use these dataset)

cd Rawdata/
#nano sample_id.txt and then add a list of the sample names:
#sample37
#sample38
#sample39
#sample40
#sample41
#sample42

for sample in `cat sample_id.txt`
do
	wget http://h3data.cbio.uct.ac.za/assessments/RNASeq/practice/dataset/${sample}_R1.fastq.gz
	wget http://h3data.cbio.uct.ac.za/assessments/RNASeq/practice/dataset/${sample}_R2.fastq.gz
done

#download the metadata file
wget -c http://h3data.cbio.uct.ac.za/assessments/RNASeq/practice/practice.dataset.metadata.tsv

D. Download the annotation file, reference genome and transcriptome.

# Download the human annotation file
wget -P ftp://ftp.ensembl.org/pub/release-100/gtf/homo_sapiens/Homo_sapiens.GRCh38.100.gtf.gz -O hisat2/Homo_sapiens.GRCh38.100.gtf.gz
gunzip  hisat2/Homo_sapiens.GRCh38.100.gtf.gz 

# Download the human reference genome
wget -c ftp://ftp.ensembl.org/pub/release-100/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz -O hisat2/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
gunzip  hisat2/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz 

# Download the human transcriptome reference
wget -c ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_34/gencode.v34.transcripts.fa.gz -O Kallisto/gencode.v34.transcripts.fa.gz