Skip to content

Understanding the output of SQANTI3 QC

Fabián Robledo Yagüe edited this page Oct 4, 2024 · 15 revisions

Table of contents:


SQANTI3 QC output

As a result of running SQANTI3 QC, the tool will create a series of output files in the specified directory (--dir or -d flag in the QC script), all of which will be named using the supplied prefix (--output or -o flag in the QC script). Given that SQANTI3 also runs several other tools internally (e.g. STAR, kallisto, GMST...), a series of subfolders will also appear in the output directory.

The example/SQANTI3_QC_output subfolder in the SQANTI3 main directory contains an updated example of what the output of SQ3 QC looks like:

This is the result of running SQANTI3 QC on example data, i.e. a subset of the UHR dataset. A more detailed explanation of the different output files included below.

Main output: classification and junctions files

SQANTI3 characterizes the input transcriptome by computing a series of attributes by transcript, which are written to the classification file, and a series of attributes by junction, which are written to the junctions file.

  • Classification file: a tab-separated file where transcripts are rows and QC attributes computed by SQANTI3 are columns. Isoforms are identified by their ID in the input long-read GTF file. For a full glossary of columns and their meaning, see below.

  • Juctions file: a tab-separated file containing each junction for every PB isoform as rows. Rows are identified by the isoform ID and the junction number (shown in order, i.e. the 1st, 2nd, 3rd... ith junction), which are included in the two first columns. Take into account that the same junction may appear multiple times if they are shared by multiple PB isoforms, for instance, when several FSM and ISM transcripts associated to the same reference isoform are found. For a full glossary of columns and their meaning, see below

Transcriptome files

  • GTF file: the _corrected.gtf file constitutes the main transcriptome output of SQANTI3. It contains the annotation for all transcripts supplied to SQANTI3. This file may be used for downstream analyses such as quantification, however, we recommend filtering your transcriptome using the QC attributes in SQANTI3 before proceeding with your analyses. For this purpose, you may apply one of the two filters available in SQANTI3 or, alternatively, define your own filtering criteria.

  • FASTA file: the _corrected.fasta file contains the genomic sequences of transcripts supplied to SQANTI3 QC, obtained using the _corrected.gtf file and the provided reference genome.

  • FAA file: the _corrected.faa file contains FASTA-formatted protein sequences derived from transcript sequences. These sequences are obtained by running the GeneMark (GMST) ORF prediction algorithm and will not be output if the --skipORF flag is supplied.

  • CDS GFF file: the _corrected.gtf.cds.gff file is generated by expanding the GTF file to include coding sequence (CDS) information (obtained from GMST ORF predictions). This means that, in addition to exons and transcripts, CDS will also be included as a feature in the 3rd GFF column.

  • tappAS GFF3 file (optional): if the --isoAnnotLite argument is used, IsoAnnotLite will be run internally and a tappAS-formatted GFF3 file containing all isoforms in the input transcriptome generated in the output folder. This allows direct usage of a SQANTI3-processed transcriptome in tappAS. Of note, if a pre-annotated tappAS GFF3 file is supplied to SQANTI3 QC via the --gff3 flag, the tappAS GFF3 annotation will also contain functional feature transferred by position (see details on how IsoAnnotLite works here).

Note on indel correction and GTF/FASTA output files

SQ3 currently accepts both GTF (default) and FASTA transcriptome files as input. If a FASTA input file is supplied, sequences are mapped to the genome to generate the _corrected.gtf file. Using this GTF and the reference genome, SQ3 performs indel correction and generates the _corrected.fasta file. Please note that this type of reference-based correction will remove SNPs, therefore, users who are interested in this kind of information should refrain from using a FASTA file as input. Alternatively, if users supply a GTF input file, SQ3 will NOT perform indel correction, just extract transcript sequences using the reference genome and the transcriptome GTF file, which will be equivalent to the _corrected.gtf file.

SQANTI QC report

The SQANTI QC report is generated by the SQANTI3_report.R script, which takes the classification (e.g. UHR_chr22_classification.txt) and junctions (e.g. UHR_chr22_junctions.txt) files.

The report includes over 150 plots, therefore, in order to ease interpretation, we recently implemented the --report argument with four options: pdf, html, both and skip. This means that the report can be created as an interactive HTML document as well as (or alternatively to) a PDF report -or skipped entirely if the user does not wish to see a report for the QC run.

Users who are proficient in R are welcome to modify the R script to add new figures -we will be constantly adding new figures as well.

Supplementary output

  • Parameter file (_params.txt): plain text file including the full paths of all files supplied to SQANTI3, as well as any additional parameters included in the call to sqanti3_qc.py. Here is an example the fields included in the parameter file for a full SQ3 QC run starting from raw short-read data.
Version 5.0
Input   /path/to/transcriptome/long-read_transcripts.gtf
Annotation      /path/to/reference_annotation/reference.gtf
Genome  /path/to/reference_genome/genome.fasta
MinRefLength    200
ForceIdIgnore   False
Aligner minimap2
FLCount /path/to/FL_counts/transcripts.collapsed.abundance.txt
Expression      NA
Junction        /path/to/mapping_results/STAR_mapping
CAGEPeak        /path/to/CAGE_file/CAGE_peaks.bed
PolyAMotif      /path/to/polyA_file/polyA_motif-list.txt
PolyAPeak       /path/to/polyA_files/polyA_peaks.bed
IsFusion        False
PhyloP  NA
SkipORF False
ORFInput        NA
FASTAused       False
Expression      NA
GMAPindex       NA
OutputPrefix    SQANTI3
OutputDirectory /path/to/SQANTI3_output/
Coverage        /path/to/mapping_results/STAR_mapping
CanonicalSites  ATAC,GCAG,GTAG
PostTTSWindow   20
GeneName        False
ReportType      both
RunIsoAnnotLite False
isoAnnotGFF3    NA
ShortReads      /path/to/raw_data/SR.fofn
ShortReadsBAMs  /path/to/raw_data/SR_bams.fofn

  • STAR folders contain the different outputs of the STAR aligner, which is run internally using provided short-reads during QC.

    • STAR_index: STAR index for the supplied reference genome. If you wish to supply a pre-computed index to SQ3 QC, see the shortcut described in issue #130.
    • STAR_mapping: STAR mapping files. SQANTI3 uses short-read coverage information (_SJ.out.tab file) and mapping BAM files (_Aligned.sortedByCoord.bam) to compute short-read coverage, isoform expression (see kallisto output below) and the TSS ratio metric.
  • Kallisto output folder: this folder will contain the results of running kallisto to compute isoform expression estimates using the different short-read replicates supplied. In it, users will find one subfolder per replicate, in which kallisto abundance files (i.e. quantification matrices) are stored.

  • RT-switching (RTS) folder: output files generated by the RT-switching prediction script (utilities/rt_switching.py).

  • GMST folder: output files generated after running GeneMarkST (utilities/gmst) for ORF prediction.

  • genePred files: reference annotation and long-read transriptome GTF files are converted to genePred format during isoform classification. The two .genePred files found in the output folder constitute the intermediate files that are generated in this process.

Glossary of classification file columns (classification.txt)

The output _classification.txt has the following fields:

  1. isoform: the isoform ID. Usually in PB.X.Y format.
  2. chrom: chromosome.
  3. strand: strand.
  4. length: isoform length.
  5. exons: number of exons.
  6. structural_category: one of the categories ["full-splice_match", "incomplete-splice_match", "novel_in_catalog", "novel_not_in_catalog", "genic", "antisense", "fusion", "intergenic", "genic_intron"]
  7. associated_gene: the reference gene name.
  8. associated_transcript: the reference transcript name.
  9. ref_length: reference transcript length.
  10. ref_exons: reference transcript number of exons.
  11. diff_to_TSS: distance of query isoform 5' start to reference transcript start end. Negative value means query starts downstream of TSS, so the transcript is shorter than reference transcript.
  12. diff_to_TTS: distance of query isoform 3' end to reference annotated end site. Negative value means query ends upstream of TTS, so the transcript is shorter than reference transcript.
  13. diff_to_gene_TSS: distance of query isoform 5' start to the closest start end of any transcripts of the matching gene. This field is different from diff_to_TSS since it's looking at all annotated starts of a gene. Negative value means query starts downstream of TSS.
  14. diff_to_gene_TTS: distance of query isoform 3' end to the closest end of any transcripts of the matching gene. Negative value means query ends upstream of TTS.
  15. subcategory: additional splicing categorization, separated by semi-colons. Categories include: mono-exon, multi-exon. Intron rentention is marked with intron_retention.
  16. RTS_stage: TRUE if one of the junctions could be a RT switching artifact.
  17. all_canonical: TRUE if all junctions have canonical splice sites.
  18. min_sample_cov: sample with minimum coverage.
  19. min_cov: minimum junction coverage based on short read STAR junction output file. NA if no short read given.
  20. min_cov_pos: the junction that had the fewest coverage. NA if no short read data given.
  21. sd_cov: standard deviation of junction coverage counts from short read data. NA if no short read data given.
  22. FL or FL.<sample>: FL count associated with this isoform per sample if --fl_count is provided, otherwise NA.
  23. n_indels: total number of indels based on alignment.
  24. n_indels_junc: number of junctions in this isoform that have alignment indels near the junction site (indicating potentially unreliable junctions).
  25. bite: TRUE if contains at least one "bite" positive SJ.
  26. iso_exp: short read expression for this isoform if --expression is provided, otherwise NA.
  27. gene_exp: short read expression for the gene associated with this isoform (summing over all isoforms) if --expression is provided, otherwise NA.
  28. ratio_exp: ratio of iso_exp to gene_exp if --expression is provided, otherwise NA.
  29. FSM_class: This feature classifies the transcript according to the expression of other isoforms in the gene to which the transcript belongs. Transcripts belonging to genes that only express one isoform are classified as A. Transcripts belonging to genes that express more than one isoform but none is a FSM are classified as B. Transcripts belonging to genes which express more than one isoform and other isoforms and at least one is a FSM are classified as C.
  30. coding: Coding potential capacity according to GeneMarkS-T. It can take values of "coding" or "non_coding"
  31. ORF_length: predicted ORF length.
  32. CDS_length: predicted CDS length. It does include the stop codon.
  33. CDS_start: CDS start.
  34. CDS_end: CDS end.
  35. CDS_genomic_start: genomic coordinate of the CDS start. If on - strand, this coord will be greater than the end.
  36. CDS_genomic_end: genomic coordinate of the CDS end. If on - strand, this coord will be smaller than the start.
  37. predicted_NMD: TRUE if there's a predicted ORF and CDS ends at least 50bp before the last junction; FALSE if otherwise. NA if non-coding.
  38. perc_A_downstreamTTS: percent of genomic "A"s in the downstream 20 bp window. If this number if high (say > 0.8), the 3' end site of this isoform is probably not reliable.
  39. seq_A_downstream_TTS: sequence of the downstream 20 bp window.
  40. dist_to_CAGE_peak: distance to closest TSS based on CAGE Peak data. Negative means upstream of TSS and positive means downstream of TSS. Strand-specific. SQANTI3 only searches for nearby CAGE Peaks within 10000 bp of the PacBio transcript start site. Will be NA if none are found within 10000 bp.
  41. within_CAGE_peak: TRUE if the transcript start site is within a CAGE Peak.
  42. dist_to_polyA_site: distance to the closest polyA site, based on polyA site data (e.g. Quant-seq).
  43. within_polyA_site: TRUE if the transcript start site is within a polyA site, retrieved from polyA site data (e.g. Quant-seq).
  44. polyA_motif: if --polyA_motif_list is given, shows the top ranking polyA motif found within 50 bp upstream of end.
  45. polyA_dist: if --polyA_motif_list is given, shows the location of the last base of the hexamer. Position 0 is the putative poly(A) site. This distance is hence always negative because it is upstream.
  46. polyA_motif_found: TRUE if a polyA motif given via --polyA_motif_list is detected in the 3'end of the transcript.
  47. ORF_seq: Predicted ORF sequence. These sequences are also stored in the *_corrected.faa file.
  48. ratio_TSS: Using Short-Read data, we measure the mean coverage of the 100bp upstream and downstream a reported TSS. Then we calculate the ratio coverage inside isoform + 0.01/ coverage outside isoform + 0.01. If several SR samples are provided, ratio_TSS will represent the maximum value of the ratios across the samples. This means that if an isoform have a ratio_TSS greater than 1 it is more likely that its TSS is true. Meanwhile, if the ratio_TSS is close or lower than 1, the SR coverage is similar inside and outside the isoform, something that we wouldn't expect if the TSS was true.

Glossary of junctions file columns (junctions.txt)

The _junctions.txt file contains the following columns:

  1. isoform: Isoform ID.
  2. junction_number: The i-th junction of the isoform.
  3. chrom: Chromosome.
  4. strand: Strand.
  5. genomic_start_coord: Start of the junction (1-based), note that if on - strand, this would be the junction acceptor site instead.
  6. genomic_end_coord: End of the junction (1-based), note that if on - strand, this would be the junction donor site instead.
  7. transcript_coord: Currently not implemented. Ignore.
  8. junction_category: known if the (donor-acceptor) combination is annotated in the GTF file, novel otherwise. Note that it is possible to have a novel junction even though both the donor and acceptor site are known, since the combination might be novel.
  9. start_site_category: known if the junction start site is annotated. If on - strand, this is actually the donor site.
  10. end_site_category: known if the junction end site is annotated. If on - strand, this is actually the acceptor site.
  11. diff_to_Ref_start_site: distance to closest annotated junction start site. If on - strand, this is actually the donor site.
  12. diff_to_Ref_end_site: distance to closest annotated junction end site. If on - strand, this is actually the acceptor site.
  13. bite_junction: Applies only to novel splice junctions. If the novel intron partially overlaps annotated exons the bite value is TRUE, otherwise it is FALSE.
  14. splice_site: Splice motif.
  15. RTS_junction: TRUE if junction is predicted to a template switching artifact.
  16. indel_near_junct: TRUE if there is alignment indel error near the junction site, indicating potential junction incorrectness.
  17. sample_with_cov: If --coverage (short read junction coverage info) is provided, shows the number of samples (cov files) that have uniquely mapped short read that support this junction.
  18. total_coverage_unique/multi: Total number of uniquely or multi-mapped short read support from all samples that cover this junction.