The correct annotation of non-coding RNAs, especially long non-coding RNAs (lncRNAs), is still a critial challenge in genome analyses due to their highly heterogeneous characteristics. Due to this heterogeneity, transcriptome data sources can be an important factor that might affect lncRNA annotation quality. Long-read technologies now bring the potential to improve the quality of transcriptome annotation, specially in genome entities that are not ``classic'' coding genes. However, there is a gap regarding benchmarking studies that test if the direct use of lncRNA predictors in long-reads makes more precise identification of these transcripts. Considering that lncRNA identification tools were not trained with these reads, our study want to address: how is the performance of these tools? Are they also able to efficiently identify lncRNAs? For this, we used short and long-read data from human and selected plan ts transcriptomes to test our questions. We can provide evidence of where and how to make potential better approaches for the lncRNA annotation by understanding these issues.
Keywords: Non-coding RNAs, high-throughput sequencing technologies, coding, methods, benchmarking, tools, NGS, transcripts
- gencode_v21_intersection_by_tools_coding.csv: Sequences classified as coding by the tools in GENCODE dataset version 21;
- gencode_v38_intersection_by_tools_coding.csv: Sequences classified as coding by the tools in GENCODE dataset version 38;
- Alisson Gaspar Chiquitto - https://github.com/chiquitto
- Lucas Otávio Leme Silva
- Liliane Santana Oliveira
- Douglas Silva Domingues
- Alexandre Rossi Paschoal - https://github.com/alerpaschoal
Contact: [email protected]