MicroRNA-seq libraries are built using the ENCODE protocol "Multiplexed miRNA Sequencing Library Generation Protocol".
Briefly, libraries are built from 300 ng to 1 ug of total RNA. Adapters are ligated to the 5’ and 3’ ends of small RNAs using their 5’ phosphate and 3’ hydroxyl groups, then the ligation product is reverse transcribed. The 5’ adapter adds a 6-nucleotide barcode from a one of 7 sets of 4 distinct barcodes used in downstream demultiplexing. The cDNA is amplified with 58 bp reverse and 55 bp forward primers containing additional 6-nucleotide barcodes added to the 3’ end. The 140 bp product containing mature microRNA (21-25 nucleotides) is size-selected using 10% TBE-Urea gel.
Libraries are isolated from the TBE-Urea gel by agitated incubation at 70°C, 1000 RPM for 2 hours in a buffer containing 0.5 M ammonium acetate, 0.1% SDS, and 0.1 uM EDTA, then precipitated overnight in 50% isopropanol.
Libraries are quantified using Qubit dsDNA HS Assay Kit and sequenced on an Illumina NextSeq 2000 with P1 or P2 100 cycle kits as 75 bp single-end reads (SE 75/0/6/0) to ~10 M raw read depth per library. We have good results sequencing pooled library at 850 pM (%Loading Concentration > 95). Submission to the ENCODE portal requires >5 M aligned reads, >300 microRNAs detected at >2 CPM, and a Spearman replicate correlation >0.85.
- seqtk (
conda install -c bioconda seqtk
) - python, with
os
andsys
libraries installed
Nothing! Can module load STAR and cutadapt on HPC.
- R with
tidyverse
,rtracklayer
,ggrepel
,ComplexHeatmap
, (install using BiocManager i.e.BiocManager::install("ComplexHeatmap")
),optparse
(to grab gene names from miRNA GTF) - python with
pandas
,seaborn
,matplotlib
,sklearn
,argparse
,pydeseq2
,pyranges
(to grab gene names from miRNA GTF)
Library structure deviates from standard Illumina, so samples must be demultiplexed with custom code. Illumina NextSeq2000 SampleSheet (e.g. for our practice run) needs to specify read setup but the SampleID / Index doesn't matter since we're just going to use the "undetermined" reads.
- After sequencing is complete, transfer reads to HPC.
scp /home/sharing/runs/231006_VH00582_82_AACYV55M5/Analysis/1/Data/fastq/*.fastq.gz [email protected]:/pub/erebboah/mirna_pipeline/fastq/run_1080
- Concatenate all reads to undetermined.fastq.gz and unzip, making sure you have enough space. Need 24G for a P1 100 cycle kit, x2 since you'll basically be copying it to demultiplex = 48G. If you clean up all files at the end (undetermined.fastq.gz and all the individual sample output directories, just keeping the counts directory), you'll only need ~5G, mostly for the demultiplexed fastqs.
cd /pub/erebboah/mirna_pipeline/fastq/
cat *.fastq.gz > undetermined.fastq.gz
gunzip undetermined.fastq.gz
-
Edit the "real" sample sheet. The first column is the 5' Adaptor ID (1-7) and forward primer index ID (1-12, 14, 15, 18, 21-25) separated by an underscore. The second column is your desired fastq name (sample ID). No header! Preferably matches some sample ID in your metadata. Should be in the same directory as the scripts. Example of samplesheet.csv
-
Edit demux_mirna.sh so that you are using your HPC account (change
#SBATCH -A SEYEDAM_LAB
) and your conda environment (changesource ~/miniconda3/bin/activate seqtkpython3
). As inputs, you need fastq namedundetermined.fastq
in the fastq directory of this repo (e.g./pub/erebboah/mirna_pipeline/fastq/
) and your samplesheet.csv in the scripts directory (e.g./pub/erebboah/mirna_pipeline/scripts/
). -
Run demultiplexing bash script:
sbatch demux_mirna.sh
. Output is demultiplexed, gzipped fastqs with sample IDs in samplesheet.csv, in the fastq directory along with undetermined.fastq.gz.
- Download microRNA GENCODE GTFs from ENCODE portal: vM21 mouse or v29 human.
wget https://www.encodeproject.org/files/ENCFF094ICJ/@@download/ENCFF094ICJ.gtf.gz
wget https://www.encodeproject.org/files/ENCFF470CZH/@@download/ENCFF470CZH.gtf.gz
mv ENCFF094ICJ.gtf.gz mm10_mirna.gtf.gz
mv ENCFF470CZH.gtf.gz hg38_mirna.gtf.gz
gunzip *.gtf.gz
- Download reference sequences from the portal: mm10 mouse or hg38 human.
wget https://www.encodeproject.org/files/mm10_no_alt_analysis_set_ENCODE/@@download/mm10_no_alt_analysis_set_ENCODE.fasta.gz
wget https://www.encodeproject.org/files/GRCh38_no_alt_analysis_set_GCA_000001405.15/@@download/GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.gz
gunzip *.fasta.gz
- Once you have the reference files, run STAR in genomeGenerate mode:
sbatch make_ref.sh
Run cutadapt to trim adapters and STAR to map. Specify human or mouse in sbatch statement.
Inputs:
- Demultiplexed and gzipped fastqs in
fastq
- STAR reference directory, e.g.
ref/mm10
orref/hg38
- samplesheet.csv in
scripts
to indicate the samples you want to map
Human: sbatch trim_map.sh hg38
Mouse: sbatch trim_map.sh mm10
Each sample will get its own output directory, named the same as the sample ID, e.g. ENC4_453_SB
. Within the sample directory, there's cutadapt
and star
directories containing intermediate files. The actual tab-separated microRNA quantifications per sample are in the main counts directory counts
(e.g. /pub/erebboah/mirna_pipeline/counts/ENC4_453_SB.tsv
). Remove the sample directories to save space if you only need the final counts.
Other than counts, you may be interested in the STAR report (e.g. ENC4_453_SB/star/Log.final.out
) and signal files to display on the UCSC genome browser (e.g. ENC4_453_SB/star/Signal.UniqueMultiple.str1.out.wig
). The percent uniquely mapped reads is low because the microRNAs are so short. On average I get ~43% multi-mapped reads, and pass ENCODE standards. For microRNA-seq, the number of multi-mapped reads plus unique reads are the total number of aligned reads (should be > 5M). Feel free to poke around my old ENCODE miRNA-seq spreadsheet for other QC stuff.
-
Concatenate counts per sample into a counts matrix. Example code in R and python using data from our practice run.
-
Convert counts to CPM (counts per million) to normalize for library depth: R and python
-
Differential microRNA expression analysis between conditions using pyDeseq2: python. Takes in command-line prompts. For this practice run we have timepoint, sex, and technician metadata. I recommend performing differential expression analysis between conditions within a single group for simplicity in interpretation of the results. This script takes in command-line arguments for sex, timepoint, technician, group to test, and output file name. For example, in our practice run, QC and PCA shows the RM samples to be QC outliers for whatever reason. Here are some examples of how to run
run_pydeseq2.py
:A ) PND 14 vs. 2 month timepoints within females:
python3 run_pydeseq2.py --sex Female --timepoint PND_14 PNM_02 --technician SB NM --group timepoint --output ../degs/pnd14_vs_pnm02_female
B) Females vs. males within PND 14 timepoint:
python3 run_pydeseq2.py --sex Female Male --timepoint PND_14 --technician SB NM --group sex --output ../degs/female_vs_male_pnd14
Inputs to
run_pydeseq2.py
must exactly match your metadata. -
Heatmap and volcano plot of differentially expressed microRNAs: R Here are some examples of how to run
deg_plotting.R
:A ) PND 14 vs. 2 month timepoints within females (using file output from
run_pydeseq2.py
), |log2FC| > 1 and padj < 0.01, removing 4 RM outliers:Rscript deg_plotting.R --fname ../degs/pnd14_vs_pnm02_female.csv --l2fc 1 --padj 0.01 --outliers "ENC4_453_RM ENC4_455_RM ENC4_465_RM ENC4_467_RM"
B) Females vs. males within PND 14 timepoint:
Rscript deg_plotting.R --fname ../degs/female_vs_male_pnd14.csv --l2fc 1 --padj 0.01 --outliers "ENC4_453_RM ENC4_455_RM ENC4_465_RM ENC4_467_RM"
sbatch demux_mirna.sh
sbatch make_ref.sh
(only need to run once to generate references!)sbatch trim_map.sh mm10
(e.g. for mouse samples)- Analysis in R and/or python.