Skip to content

A python package for the identification of lncRNA from the assembled novel transcripts

Notifications You must be signed in to change notification settings

yaskermezli/lncScore

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Introduction

lncScore is a python package for the identification of lncRNA from the assembled novel transcripts, and it also can be used to calculate the coding potential.

Abstract

RNA-Seq based transcriptome assembly has been widely used in the identification of novel lncRNAs. However, the best-performing transcript reconstruction method merely identified 21% of full-length protein-coding transcripts from H. sapiens. Those partial-length protein-coding transcripts are more likely to be classified as lncRNAs due to their incomplete CDS, leading to higher false positive rate for lncRNA identification. Furthermore, potential sequencing or assembly error that gain or abolish stop codons also complicates ORF-based prediction of lncRNAs. Therefore, it remains a challenge to identify lncRNAs from the assembled transcripts, particularly the partial-length ones. Here, we present a novel alignment-free tool, lncScore, which uses a logistic regression model with 11 carefully selected features. Compared to other alignment-free tools, lncScore outperforms them on accurately distinguishing lncRNAs from mRNAs, especially partial-length mRNAs in the human and mouse datasets. In addition, lncScore also performed well on transcripts from five other species (Zebrafish, Fly, C. elegans, Rat, and Sheep), using models trained on human and mouse datasets. To speed up the prediction, multithreading is implemented within lncScore, and it only took 2 minute to clas-sify 64,756 transcripts and 54 seconds to train a new model with 21,000 transcripts with 12 threads, which is much faster than other tools.

Documentation

Documentation for the software is available at http://lncscore.openbioinformatics.org.

Installation

The following software should be installed in your cluster or computer before running the lncScore.py.

  •     Perl (>=5.10.1), https://www.perl.org/get.html.
    
  •     Python (>= 2.7), https://www.python.org/downloads/.
    
  •     The scikit-learn module, http://scikit-learn.org/stable/install.html.
    

In most use cases the best way to install Python and scikit-learn package on your system is by using Anaconda(https://www.continuum.io), which is an easy-to-install free Python distirbution and includes more than 400 of the most popular Python packages. Anaconda includes installers(https://www.continuum.io/downloads) for Windows, OS X, and Linux.

If the input file in .bed format, then an additional python package named 'pysam' is required to be installed first. After the installation of Anaconda, you can use the command 'conda install pysam' to install the Pysam package.

Author

lncScore is developed by Jian Zhao ([email protected]). For questions and comments, please contact Jian or submit an issue on github.

Reference

  • Zhao J, Song X*, Wang K*. lncScore: alignment-free identification of lncRNA from assembled novel transcripts. Submitted, 2016

License

WGLab MIT License.

About

A python package for the identification of lncRNA from the assembled novel transcripts

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 95.2%
  • Perl 4.8%