Skip to content

Small python CLT that generates a comparative plot for multiple rnaQUAST reports

License

Notifications You must be signed in to change notification settings

SimonHegele/rnaQAUSTcompare

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

45 Commits
 
 
 
 
 
 
 
 

Repository files navigation

rnaQUASTcompare

Small python command line tool that generates comparative plots for multiple rnaQUAST short reports.
rnaQUAST (https://github.com/ablab/rnaquast) is a great tool for the evaluation of transcriptome assemblies.

It generates a multitude of metrics for the quality of transcriptome assemblies, many of them by mapping the transcripts to an annotated genome.
rnaQUAST does a great job at rating individual assemblies, however, directly comparing different reports is not as easy.

This tool compares metrics from rnaQUASTs short reports.

Usage

positional arguments:
  report_dirs           paths to output directories from rnaQUAST

options:
  -h, --help            show this help message and exit
  -names NAMES [NAMES ...]
                        list of names for the assemblies (default=["auto"])
  -colors COLORS [COLORS ...]
                        list of colors in hexcode (default=["auto"])

Output

rnaQUASTcompare.py will generate a folder with the current date and time in the same directory.

I found the metric "Avg. mismatches per transcripts" to favor assemblies with transcripts that are shorter
and replaced it with "Avg. mismatches per aligned kb".

1. Dataframes

Dataframes combining the data of all short reports in .csv, .tsv and .tex format

2. Plots

Metrics are grouped into four groups: "Gene metrics", "Transcript metrics", "Isoform metrics"
and "Other metrics". For each of them a bar and a line plot will be created.
Additionally a combined plot for all metrics together a bar and line plot for all metrics together is created.
In the comined plot all values are scaled to [0,1], the details of the scaling operations can be found below.

Combined plots for all metrics with scaled values and individual plots for each metrics group.

Example:

A comparison of three transcriptome assembly tools from the same RNA-Seq data.

Value scaling

I divided the metrics into groups:

  • Gene metrics
    "50%-assembled genes", "95%-assembled genes", "50%-covered genes", "95%-covered genes"
  • Isoforms metrics
    "50%-assembled isoforms", "95%-assembled isoforms", "50%-covered isoforms", "50%-covered isoforms"
  • Transcript metrics
    "Transcripts > 500 bp", "Transcripts > 1000 bp", "Aligned", "Uniquely aligned", "Multiply aligned", "Unaligned", "Misassemblies", "Unannotated", "50%-matched", "95%-matched"
  • Scaled metrics
    "Database coverage", "Avg. aligned fraction", "Mean fraction of transcript matched"
  • Other metrics
    "Transcripts", "Avg. mismatches per aligned kb", "Duplication ratio"

Gene metrics are divided by the number of genes in the genome annotation
Isoforms metrics are divided by the number of isoforms in the genome annotation.
Transcrpts metrics are divided by the number of sequences in the respective assembly.
Scaled metrics are left unchanged.
Other metrics are divided by the maximum value for all assemblies.

How to cite rnaQUAST compare

To me it doesn't really matter if you cite this tool at all. If you think you have to or
want to make others aware of this tool you can refer directly to this repository.
I would also be very pleased if you could let me know if and how you use this tool.

About

Small python CLT that generates a comparative plot for multiple rnaQUAST reports

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published