miPyRNA: a python-based package for small RNA-Seq data analysis
Today, massive amounts of data are generated by Next-Generation Sequencing (NGS) technologies, enabling the exploration of small RNA profiles, including microRNAs (miRNAs). In recent years, numerous algorithms, statistical methods, and software tools have been developed to address the specific steps of miRNA analysis, such as identification, quantification, and differential expression analysis. However, a streamlined and reproducible workflow for miRNA data analysis remains a significant challenge.
To address this, we have developed a Python package, miPyRNA, designed specifically for efficient, manageable, and reproducible miRNA analysis from NGS data. This tool integrates current software with custom Python scripts, providing users with a versatile platform for miRNA data processing. Unlike other tools that confine users to pre-defined workflows, miPyRNA allows for greater flexibility by combining widely used command-line tools with tailored Python-based functionality. This approach enables fast and accurate identification of miRNAs, differential expression analysis, and downstream functional studies, empowering researchers to gain deeper insights into the regulatory roles of miRNAs in biological processes.
miPyRNA requires a input file containing information of samples and input read files. Input template and example files here:
# Project title/Information lines should start with # | ||||
---|---|---|---|---|
SampleName | Replication | Identifier | File1 | File2 |
AddFull Sample Name Here | Add Replication Here | Add sample Identifier Here | Add Sample File Name Here | Add Reverese File here if Paired END |
Example input file:
#Arabidopsis transcriptome study under high light stress | ||||
---|---|---|---|---|
SampleName | Replication | Identifier | File1 | File2 |
GL0.5h1 | GL0.5h1 | GL0.5 | SRR6767632_001.fastq.gz | SRR6767632_002.fastq.gz |
GLO.5h2 | GLO.5h2 | GL0.5 | SRR6767633_001.fastq.gz | SRR6767633_002.fastq.gz |
GL6h1 | GL6h1 | GL6 | SRR6767634_001.fastq.gz | SRR6767634_002.fastq.gz |
GL6h2 | GL6h2 | GL6 | SRR6767635_001.fastq.gz | SRR6767635_002.fastq.gz |
GL12h1 | GL12h1 | GL12 | SRR6767636_001.fastq.gz | SRR6767636_002.fastq.gz |
GL12h2 | GL12h2 | GL12 | SRR6767637_001.fastq.gz | SRR6767637_002.fastq.gz |
GL24h1 | GL24h1 | GL24 | SRR6767639_001.fastq.gz | SRR6767639_002.fastq.gz |
GL24h2 | GL24h2 | GL24 | SRR6767640_001.fastq.gz | SRR6767640_002.fastq.gz |
GL48h1 | GL48h1 | GL48 | SRR6767642_001.fastq.gz | SRR6767642_002.fastq.gz |
GL48h2 | GL48h2 | GL48 | SRR6767643_001.fastq.gz | SRR6767643_002.fastq.gz |
GL72h1 | GL72h1 | GL72 | SRR6767644_001.fastq.gz | SRR6767644_002.fastq.gz |
GL72h2 | GL72h2 | GL72 | SRR6767645_001.fastq.gz | SRR6767645_002.fastq.gz |
-
Quality Control
- Perform an initial assessment of raw sequencing reads to ensure data quality.
- Use tools like
FastQC
or custom scripts to evaluate sequence quality, GC content, and adapter contamination.
-
Adapter Trimming
- Remove adapter sequences and low-quality bases from the raw reads using tools like
Cutadapt
orTrimmomatic
. - Generate clean, high-quality reads for downstream analysis.
- Remove adapter sequences and low-quality bases from the raw reads using tools like
-
Read Mapping
- Align trimmed reads to the reference genome or small RNA databases (e.g.,
miRBase
) using tools likeBowtie
orHISAT2
, optimized for small RNA sequences.
- Align trimmed reads to the reference genome or small RNA databases (e.g.,
-
miRNA Identification
- Use deep learning-based models for identifying known and novel miRNAs in plants and animals.
- Train and implement neural networks tailored for miRNA recognition, leveraging features such as sequence composition, secondary structure, and evolutionary conservation.
- Predict secondary structures and validate novel miRNA candidates.
-
Quantification
- Calculate expression levels of identified miRNAs in terms of reads per million (RPM) or normalized counts.
-
Differential Expression Analysis
- Perform statistical analysis to identify differentially expressed miRNAs between conditions using tools like
DESeq2
,edgeR
, orlimma
.
- Perform statistical analysis to identify differentially expressed miRNAs between conditions using tools like
-
Functional Annotation
- Annotate target genes of miRNAs using target prediction algorithms such as
TargetScan
ormiRanda
. - Perform enrichment analyses (e.g., Gene Ontology, KEGG) for target genes.
- Annotate target genes of miRNAs using target prediction algorithms such as
-
Visualization
- Generate plots such as expression heatmaps, volcano plots, and scatter plots to interpret results effectively.
- Provide a graphical summary of significant miRNAs and their targets.
-
Report Generation
- Compile results into a detailed, reproducible report, including raw and processed data, figures, and analysis logs.
This updated workflow incorporates state-of-the-art deep learning models to enhance the accuracy and specificity of miRNA identification in both plants and animals, ensuring robust and reliable analysis with miPyRNA.
This source code was developed in Linux, and has been tested on Linux and OS X. The main prerequisite is Python > 3.7. Following are the external dependencies:
- Flexbar – flexible barcode and adapter removal https://github.com/seqan/flexbar
- Trimmomatic: A flexible read trimming tool for Illumina NGS data http://www.usadellab.org/cms/?page=trimmomatic
- Trim Galore https://github.com/FelixKrueger/TrimGalore
- SortMeRNA [https://github.com/sortmerna/sortmerna] (https://github.com/sortmerna/sortmerna)
- STAR Aligner https://github.com/alexdobin/STAR
- HISAT2 http://daehwankimlab.github.io/hisat2/
- Bowtie2 https://github.com/BenLangmead/bowtie2
- Subread https://subread.sourceforge.net/
- HTSeq https://github.com/simon-anders/htseq
- Samtools https://github.com/samtools/samtools
- Bamtools https://github.com/pezmaster31/bamtools
- R Language https://cran.r-project.org/bin/windows/base/
- DESeq2 https://bioconductor.org/packages/release/bioc/html/DESeq2.html
- edgeR https://bioconductor.org/packages/release/bioc/html/edgeR.html
- Python 3 https://www.python.org/downloads/
This guide explains how to install miPyRNA using either a Miniconda environment or Docker for cross-platform compatibility.
To set up miPyRNA in a Miniconda environment, first, clone the repository from GitHub by running:
git clone https://github.com/navduhan/mipyrna.git
Download the Miniconda installer:
https://docs.conda.io/en/latest/miniconda.html#linux-installers
cd mipyrna
conda env create -f mipyrna_environment.yaml
pip install .
clone the repository from GitHub by running:
```bash
git clone https://github.com/navduhan/mipyrna.git
cd mipyrna
docker build -t mipyrna .
mipyrna -h
Written by Naveen Duhan ([email protected]),
Kaundal Bioinformatics Lab, Utah State University,
Released under the terms of GNU General Public Licence v3
In case of technical problems (bugs etc.) please contact Naveen Duhan ([email protected])
For any Questions on the scientific aspects of the miPyRNA-0.2 method please contact:
Rakesh Kaundal, ([email protected])
Naveen Duhan, ([email protected])