Last Updated: 02/09/2021 (v1.6.1) Update on isoAnnot lite and minor changes to v1.6
SQANTI3 is the newest version of the SQANTI tool (publication) that merges features from SQANTI, (code repository) and SQANTI2 (code repository), together with new additions. SQANTI3 will continue as an integrated development aiming to providing you the best characterization possible for your new long read-defined transcriptome. SQANTI3 is the first module of the Functional IsoTranscriptomics (FIT) framework, that also includes IsoAnnot and tappAS.
New features implemented in SQANTI3 not available in previous versions are:
- IsoAnnotLite implemented to generate tappAS compatible gff3 files. Gff3 output may incorporate functional annotation labels for model species supported by tappAS.
- CAGE peak definitions: CAGE peak will only be associated to a transcript when located upstream of the transcript stariting site. This option requires CAGE peak data.
- Updated
bite
definition and ISM subcategories5prime_fragment
and3prime_fragment
. - Accepts several Short Reads expression files at the same time or as an expression matrix.
- New plots:
- Saturation curves (number of transcripts detected as a function of the sequencing depth)
- Distance to CAGE (if any).
- Reference transcript redundance (number of FSM and ISM with the same reference transcript match)
- Installation provided as a Conda yml environment file
- Setting up SQANTI3
- Running SQANTI3 Quality Control
- Filtering Isoforms using SQANTI3
- SQANTI3 Output Explanation
2021.02.09 - updated to v1.6.1. SQANTI3 now reports in the classification file the genomic coordinates of CAGE peaks. New version of isoAnnot Lite (v2.1).
2020.12.16 - updated to v1.6. Fixed SQANTI3_report.R
num isoforms per gene plotting error.
2020.12.16 - updated to v1.5. Fixed mono-exonic mis-classification of antisense in edge cases.
2020.09.24 - updated to v1.3. Fixed minor printing bug in SQANTI3_report.R
. Also sqanti_qc3.py
supports file patterns again for coverage junction file input.
2020.09.23 - updated to v1.2. Fixed SQANTI3_report.R
bug when there are no non-canonical junctions.
2020.09.15 - updated to v1.1. Fixed report script _classification_TPM.txt
linebreak issue. rarefaction disabled for now.
2020.09.07 - updated to v1.0. Fixed SQANTI3 report missing figures (ex: intrapriming).
2020.05.12 - SQANTI3 first release.
- Perl
- Minimap2
- Python (3.7)
- pysam
- psutil
- bx-python
- BioPython
- BCBioGFF
- cDNA_Cupcake
- R (>= 3.4.0)
- R packages (for
sqanti3_qc.py
): ggplot2, scales, reshape, gridExtra, grid, dplyr, NOISeq, ggplotify
- gtfToGenePred
We recommend using Anaconda to substantially facilitate installation of all Python dependencies. Probably you already have Anaconda installed because you use Iso-Seq3 or cDNA_Cupcake. Please, follow the steps here to ensure an error-free installation. All the dependencies will be installed automatically in a conda environment, except for cDNA_Cupcake (more details in step 5). The installation will be done just once. When the environment has been entirely built, you just need to activate the conda environment of SQANTI3 and run it!
(0) Make sure you have installed Anaconda and it is updated. Find here the generic Anaconda installation for Linux environment. Currently only Linux environments are supported.
export PATH=$HOME/anacondaPy37/bin:$PATH
conda -V
conda update conda
(1) Download or clone the SQANTI3 repository.
git clone https://github.com/ConesaLab/SQANTI3.git
(2) Move into the SQANTI3 folder and create a virutal environment with all the Python packages required for installation by running the SQANTI3.conda_env.yml
script that you can find in the main folder. This script contains all the information necessary to install the required dependencies. Type y
to agree to the interactive questions. You can change the name of the environment by using -n option. By default, the name of the environment will be SQANTI3.env.
cd SQANTI3
conda env create -f SQANTI3.conda_env.yml
source activate SQANTI3.env
(3) Once you have activated the virtual environment, you should see your prompt changing to something like this:
(SQANTI3.env)$
(4) You also need to install gtfToGenePred that seems to have some issues with Python 3.7 (or openssl) when installed though conda. At this point, the easiest solution is to download it from UCSC Download Page and add it to the SQANTI3/utilities
folder (and give it execute permissions):
wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/gtfToGenePred -P <path_to>/SQANTI3/utilities/
chmod +x <path_to>/SQANTI3/utilities/gtfToGenePred
(5) If you don't already have cDNA_Cupcake installed, you can do that now. First, check that you have already activated the SQANTI3.env
environment and then proceed with the following commands:
(SQANTI3.env)$ git clone https://github.com/Magdoll/cDNA_Cupcake.git
(SQANTI3.env)$ cd cDNA_Cupcake
(SQANTI3.env)$ python setup.py build
(SQANTI3.env)$ python setup.py install
No installation for SQANTI3 itself is required. The scripts can be run directly.
Running SQANTI3 Quality Control (please read completely specially when planning to use tappAS for downstream analysis)
Before starting any SQANTI3 run, remember that you need to activate the SQANTI3 environment and add cDNA_Cupcake/sequence
to $PYTHONPATH
.
$ source activate SQANTI3.env
(SQANTI3.env)-bash-4.1$ export PYTHONPATH=$PYTHONPATH:<path_to>/cDNA_Cupcake/sequence/
This are the minimal files that you will need to run SQANTI3:
-
Long read-defined transcriptome. It can be obtained from any of the available Third Generation Sequencing techonologies like Iso-Seq (PacBio) or Nanopore. SQANTI3 accepts it in several formats such as FASTA, FASTQ and GTF (using
--gtf
option). If you provide the sequences of your transcripts, a mapping step will be performed initially with minimap2. It is strongly recommended to collapse sequences into unique transcripts BEFORE running SQANTI3 using cDNA_Cupcake or TAMA. -
Reference annotation in GTF format. This file will be taken as reference to describe the degree of novelty of each transcript. Some examples of reference transcriptomes can be, for example, GENCODE or CHESS.
-
Reference genome, in FASTA format. For example hg38. Make sure your annotation GTF is based on the correct ref genome version! Please check that the chromosome/scaffolds names are the same in the reference annotation and the reference genome.
-
CAGE Peak data (from FANTOM5). In SQANTI2, it was provided a version of CAGE Peak for hg38 genome which was originally from FANTOM5.
-
Intropolis Junction BED file. In previous versions of SQANTI, it was provided a version of Intropolis for hg38 genome, modified into STAR junction format which is still valid.
-
polyA motif list. A ranked list of polyA motifs to find upstream of the 3' end site. See human polyA list for an example.
-
FL count information. See FL count section to include Iso-Seq FL count information for each isoform.
-
Short read expression. See Short Read Expression section to include short read expression (RSEM or Kallisto output files or a user-defined expression matrix).
-
tappAS-annotation file. A gff3 file which contains functional annotations of a reference transcriptome (e.g. Ensembl) at the isoform level. When
--isoAnnotLite
option is activated and a gff3 tappAS-like is provided, SQANTI3 will run internally isoAnnot Lite. You can find some of them in this repository. For more information about tappAS functionalities visit its webpage.
The script usage is:
python sqanti3_qc.py [-h] [--min_ref_len MIN_REF_LEN] [--force_id_ignore]
[--aligner_choice {minimap2,deSALT,gmap}]
[--cage_peak CAGE_PEAK]
[--polyA_motif_list POLYA_MOTIF_LIST]
[--polyA_peak POLYA_PEAK] [--phyloP_bed PHYLOP_BED]
[--skipORF] [--is_fusion] [-g] [-e EXPRESSION]
[-x GMAP_INDEX] [-t CPUS] [-n CHUNKS] [-o OUTPUT]
[-d DIR] [-c COVERAGE] [-s SITES] [-w WINDOW]
[--genename] [-fl FL_COUNT] [-v] [--isoAnnotLite]
[--gff3 GFF3]
isoforms annotation genome
If you don't feel like running the ORF prediction part, use --skipORF
. Just know that all your transcripts will be annotated as non-coding.
If you have short read data, you can run STAR to get the junction file (usually called SJ.out.tab
, see STAR manual) and supply it to SQANTI3 as is. If you have several samples, it is recommended to provide them as separated *SJ.out.tab
files.
By way of example, the following command was used to run SQANTI3 with th example
data. The input files are a subdivision of the Melanoma Cancer Cell Line IsoSeq Data that corresponds just to those polished sequences that map to chromosome 13.
python sqanti3_qc.py --gtf input.gtf \
gencode.v35.annotation.gtf \
hg38.fa \
--cage_peak hg38.cage_peaks.chr13.bed --polyA_motif_list polyA.list \
--expression rsemQuantification.chr13.isoforms.results \
--fl_count input.abundances.txt \
-c star.SJ.out.tab \
--isoAnnotLite --gff3 tappAS.input.gff3
It is highly recommended that you run SQANTI3 using an input GFF3/GTF (adding the --gtf
option in front) instead of input fasta/fastq.
You can obtain the input GFF3/GTF using Cupcake collapse
There are two options related to parallelization. The first is -t
(--cpus
) that designates the number of CPUs used by the aliger.
If your input is GTF (using --gtf
option), the -t
option has no effect.
The second is -n
(--chunks
) that chunks the input (GTF or fasta) into chunks and run SQANTI3 in parallel before combining them.
Note that if you have -t 30 -n 10
, then each chunk gets (30/10=3) CPUs.
You can look at the example_out subfolder for a sample output. The PDF file shows all the figures drawn using R script SQANTI3_report.R, taking the melanoma_chr13_classification.txt
and melanoma_chr13_junctions.txt
of the same folder as input. If you know R well, you are free to modify the R script to add new figures! We will be constantly adding new figures as well, so check out the updates section
Detailed explanation of _classification.txt
and _junctions.txt
below.
CAGE peak data must be provided in BED file format where:
- column 1 is chromosome (chr1, chr2...)
- column 2 is 0-based start coordinate
- column 3 is 1-based end coordinate
- column 5 is strand information
For human and mouse, you can download refTSS BED files.
Provide the CAGE peak BED file with the --cage_peak
parameter in sqanti3_qc.py
.
Distance to the closest CAGE peak signal is the dist_peak
and within_peak
columns in the output .classification.txt
file.
See here for more info.
PolyA site data must be provided in BED file format where:
- column 1 is chromosome (chr1, chr2...)
- column 2 is 0-based start coordinate
- column 3 is 1-based end coordinate
- column 5 is strand information
For human, mouse, and worm, you can download the Atlas BED files.
Note however that the chromosomes in the BED files are missing the "chr" prefix!
You can modify the downloaded bed file as shown below:
sed -i 's/^/chr/' atlas.clusters.2.0.GRCh38.96.bed
Provide the PolyA site BED file with the --polyA_peak
parameter in sqanti3_qc.py
.
Distance to the closest PolyA site is the dist_to_polya_site
and within_polya_site
columns in the output .classification.txt
file.
See here for more info.
In addition, you can also provide a polyA motif list in the form of a text file. For example, the common polyA motifs for human are shown here
Provide the PolyA site BED file with the --polyA_motif_list
parameter in sqanti3_qc.py
.
Distance to the closest PolyA site is the polyA_motif
and polyA_dist
columns in the output .classification.txt
file.
See here for more info.
sqanti3_qc.py
supports single or multi-sample FL counts from PacBio Iso-Seq pipeline. There are three acceptable formats.
A single sample FL Count file is automatically produced by the Iso-Seq With Mapping pipeline in SMRTLink/SMRTAnalysis with the following format:
#
# -----------------
# Field explanation
# -----------------
# count_fl: Number of associated FL reads
# norm_fl: count_fl / total number of FL reads, mapped or unmapped
# Total Number of FL reads: 1065
#
pbid count_fl norm_fl
PB.1.1 2 1.8779e-03
PB.1.2 6 5.6338e-03
PB.1.3 3 2.8169e-03
PB.2.1 3 2.8169e-03
PB.3.1 2 1.8779e-03
PB.4.1 8 7.5117e-03
A multi-sample FL Count file produced by the chain_samples.py script in Cupcake will have the following format:
superPBID | sample1 | sample2 |
---|---|---|
PB.1.2 | 3 | NA |
PB.2.1 | 2 | NA |
PB.3.1 | 2 | 2 |
PB.3.2 | 5 | 3 |
PB.3.3 | 5 | 2 |
This is a tab-delimited file.
A multi-sample Demux FL Count file produced by the demux_isoseq_with_genome.py script in Cupcake will have the following format:
id,sample1,sample2
PB.1.1,3,10
PB.1.2,0,11
PB.1.3,4,4
This is a comma-delimited file.
For each sample provided through the --fl_count
option, sqanti3_qc.py
will create a column in the _classification.txt
output file that is FL.<sample>
. Note that this is the raw FL count provided. The sum of all the FL reads accross the samples associated to one transcript will be recorded in th FL
column of the _classification.txt
output file.
When plotted, the script SQANTI3_report.R will convert the FL counts to TPM using the formula:
raw FL count for PB.X.Y in sample1
FL_TPM(PB.X.Y,sample1) = ------------------------------------- x 10^6
total FL count in sample1
Two additional columns, FL_TPM.<sample>
and FL_TPM.<sample>_log10
will be added and output to a new classification file with the suffix _classification_TPM.txt
. Please, do not mix up _classification_TPM.txt
and _classification.txt
files. The one used in subsequent scripts (filtering, future isoAnnot, etc.) will be the _classification.txt
one.
Use --expression
to optionally provide short read expression data. Two formats are supported if you input one file per sample. In any case, you can provide several expression data files as a chain of comma-separated paths or by providing a directory were ONLY expression data is present. If more than one file is provided, iso_exp
column will represent the average expression value accross all the files.
There is also the possibility of using as input an expression matrix, as a tab-separated file, in which each transcript will be a row and each sample/replicate a column. Transcript idientifiers must be the same than the ones described in the Long read-defined transcriptome used as input and the name for that column must be ID
. The rest of the column names in the header should be the sample/replicate identifier.
Kallisto expression files have the format:
target_id length eff_length est_counts tpm
PB.1.1 1730 1447.8 0 0
PB.1.3 1958 1675.8 0 0
PB.2.1 213 54.454 0 0
PB.2.2 352 126.515 0 0
PB.3.1 153 40.3918 0 0
PB.4.1 1660 1377.8 0 0
PB.5.1 2767 2484.8 0 0
RSEM expression files have the format:
transcript_id gene_id length effective_length expected_count TPM FPKM IsoPct posterior_mean_count posterior_standard
_deviation_of_count pme_TPM pme_FPKM IsoPct_from_pme_TPM TPM_ci_lower_bound TPM_ci_upper_bound FPKM_ci_lower_boun
d FPKM_ci_upper_bound
PB.1.1 PB.1 1516 1369.11 8.00 0.05 0.22 100.00 8.00 0.00 0.06 0.25 100.00 0.0233365 0.0984532 0.
0981584 0.413561
PB.10.1 PB.10 1101 954.11 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.04 100.00 4.80946e-08 0.0283585 2.
01809e-07 0.11905
PB.100.1 PB.100 979 832.11 0.00 0.00 0.00 0.00 6.62 6.60 0.08 0.35 4.00 3.51054e-06 0.241555 1.47313e-05 1.01397
PB.100.2 PB.100 226 81.11 20.18 2.26 9.47 100.00 16.84 5.20 1.99 8.36 96.00 0.559201 3.45703 2.
34572 14.5141
I've made a lightweight filtering script based on SQANTI3 output that filters for two things: (a) intra-priming and (b) short read junction support.
The script usage is:
usage: sqanti3_RulesFilter.py [-h] [--sam SAM] [--faa FAA] [-a INTRAPRIMING]
[-r RUNALENGTH] [-m MAX_DIST_TO_KNOWN_END]
[-c MIN_COV] [--filter_mono_exonic] [--skipGTF]
[--skipFaFq] [--skipJunction] [-v]
sqanti_class isoforms gtf_file
sqanti3_RulesFilter.py: error: the following arguments are required: sqanti_class, isoforms, gtf_file
python sqanti3_RulesFilter.py [classification] [fasta] [sam] [gtf]
[-a INTRAPRIMING] [-c MIN_COV] [-m MAX_DIST_TO_KNOWN_END]
where -a
determines the fraction of genomic 'A's above which the isoform will be filtered. The default is -a 0.6
.
-r
is another option for looking at genomic 'A's that looks at the immediate run-A length. The default is -r 6
.
-m
sets the maximum distance to an annotated 3' end (the diff_to_gene_TTS
field in classification output) to offset the intrapriming rule.
-c
is the filter for the minimum short read junction support (looking at the min_cov
field in _classification.txt
), and can only be used if you have short read data.
For example:
python sqanti3_RulesFilter.py test_classification.txt \
test.renamed_corrected.fasta \
test.gtf
The current filtering rules are as follow:
- If a transcript is FSM, then it is kept unless the 3' end is unreliable (intrapriming).
- If a transcript is not FSM, then it is kept only if all of below are true:
- (1) 3' end is reliable.
- (2) does not have a junction that is labeled as RTSwitching.
- (3) all junctions are either canonical or has short read coverage above
-c
threshold.
SQANTI/SQANTI2/SQANTI3 categorize each isoform by finding the best matching reference transcript in the following order:
-
FSM (Full Splice Match): meaning the reference and query isoform have the same number of exons and each internal junction agree. The exact 5' start and 3' end can differ by any amount.
-
ISM (Incomplete Splice Match): the query isoform has fewer 5' exons than the reference, but each internal junction agree. The exact 5' start and 3' end can differ by any amount.
-
NIC (Novel In Catalog): the query isoform does not have a FSM or ISM match, but is using a combination of known donor/acceptor sites.
-
NNC (Novel Not in Catalog): the query isoform does not have a FSM or ISM match, and has at least one donor or acceptor site that is not annotated.
-
Antisense: the query isoform does not have overlap a same-strand reference gene but is anti-sense to an annotated gene.
-
Genic Intron: the query isoform is completely contained within an annotated intron.
-
Genic Genomic: the query isoform overlaps with introns and exons.
-
Intergenic: the query isoform is in the intergenic region.
Some of the classifications have further subtypes (the subtype
) field in SQANTI3 classification output. They are explained below.
Novel isoforms are subtyped based on whether they use a combination of known junctions (junctions are pairs of , sites), a combination of known splice sites (the individual donor and acceptor sites are known, but at least combination is novel), or at least one splice site (donor or acceptor) is novel.
The output _classification.txt
has the following fields:
isoform
: the isoform ID. Usually inPB.X.Y
format.chrom
: chromosome.strand
: strand.length
: isoform length.exons
: number of exons.structural_category
: one of the categories ["full-splice_match", "incomplete-splice_match", "novel_in_catalog", "novel_not_in_catalog", "genic", "antisense", "fusion", "intergenic", "genic_intron"]associated_gene
: the reference gene name.associated_transcript
: the reference transcript name.ref_length
: reference transcript length.ref_exons
: reference transcript number of exons.diff_to_TSS
: distance of query isoform 5' start to reference transcript start end. Negative value means query starts downstream of reference.diff_to_TTS
: distance of query isoform 3' end to reference annotated end site. Negative value means query ends upstream of reference.diff_to_gene_TSS
: distance of query isoform 5' start to the closest start end of any transcripts of the matching gene. This field is different fromdiff_to_TSS
since it's looking at all annotated starts of a gene. Negative value means query starts downstream of reference.diff_to_gene_TTS
: distance of query isoform 3' end to the closest end of any transcripts of the matching gene. Negative value means query ends upstream of reference.subcategory
: additional splicing categorization, separated by semi-colons. Categories include:mono-exon
,multi-exon
. Intron rentention is marked withintron_retention
.RTS_stage
: TRUE if one of the junctions could be a RT switching artifact.all_canonical
: TRUE if all junctions have canonical splice sites.min_sample_cov
: sample with minimum coverage.min_cov
: minimum junction coverage based on short read STAR junction output file. NA if no short read given.min_cov_pos
: the junction that had the fewest coverage. NA if no short read data given.sd_cov
: standard deviation of junction coverage counts from short read data. NA if no short read data given.FL
orFL.<sample>
: FL count associated with this isoform per sample if--fl_count
is provided, otherwise NA.n_indels
: total number of indels based on alignment.n_indels_junc
: number of junctions in this isoform that have alignment indels near the junction site (indicating potentially unreliable junctions).bite
: TRUE if contains at least one "bite" positive SJ.iso_exp
: short read expression for this isoform if--expression
is provided, otherwise NA.gene_exp
: short read expression for the gene associated with this isoform (summing over all isoforms) if--expression
is provided, otherwise NA.ratio_exp
: ratio ofiso_exp
togene_exp
if--expression
is provided, otherwise NA.FSM_class
: ignore this field for now.ORF_length
: predicted ORF length.CDS_length
: predicted CDS length.CDS_start
: CDS start.CDS_end
: CDS end.CDS_genomic_start
: genomic coordinate of the CDS start. If on - strand, this coord will be greater than the end.CDS_genomic_end
: genomic coordinate of the CDS end. If on - strand, this coord will be smaller than the start.predicted_NMD
: TRUE if there's a predicted ORF and CDS ends before the last junction; FALSE if otherwise. NA if non-coding.perc_A_downstreamTTS
: percent of genomic "A"s in the downstream 20 bp window. If this number if high (say > 0.8), the 3' end site of this isoform is probably not reliable.seq_A_downstream_TTS
: sequence of the downstream 20 bp window.dist_peak
: distance to closest TSS based on CAGE Peak data. Negative means upstream of TSS and positive means downstream of TSS. Strand-specific. SQANTI3 only searches for nearby CAGE Peaks within 10000 bp of the PacBio transcript start site. Will beNA
if none are found within 10000 bp.within_peak
: TRUE if the PacBio transcript start site is within a CAGE Peak.pos_cage_peak
: First genomic coordinate of the CAGE peak.polyA_motif
: if--polyA_motif_list
is given, shows the top ranking polyA motif found within 50 bp upstream of end.polyA_dist
: if--polyA_motif_list
is given, shows the location of the last base of the hexamer. Position 0 is the putative poly(A) site. This distance is hence always negative because it is upstream.
THe .junctions.txt
file shows every junction for every PB isoform. NOTE because of this the same junction might appear multiple times if they are shared by multiple PB isoforms.
isoform
: Isoform ID.junction_number
: The i-th junction of the isoform.chrom
: Chromosome.strand
: Strand.genomic_start_coord
: Start of the junction (1-based), note that if on - strand, this would be the junction acceptor site instead.genomic_end_coord
: End of the junction (1-based), note that if on - strand, this would be the junction donor site instead.transcript_coord
: Currently not implemented. Ignore.junction_category
:known
if the (donor-acceptor) combination is annotated in the GTF file,novel
otherwise. Note that it is possible to have anovel
junction even though both the donor and acceptor site are known, since the combination might be novel.start_site_category
:known
if the junction start site is annotated. If on - strand, this is actually the donor site.end_site_category
:known
if the junction end site is annotated. If on - strand, this is actually the acceptor site.diff_to_Ref_start_site
: distance to closest annotated junction start site. If on - strand, this is actually the donor site.diff_to_Ref_end_site
: distance to closest annotated junction end site. If on - strand, this is actually the acceptor site.bite_junction
: Applies only to novel splice junctions. If the novel intron partially overlaps annotated exons the bite value is TRUE, otherwise it is FALSE.splice_site
: Splice motif.RTS_junction
: TRUE if junction is predicted to a template switching artifact.indel_near_junct
: TRUE if there is alignment indel error near the junction site, indicating potential junction incorrectness.sample_with_cov
: If--coverage
(short read junction coverage info) is provided, shows the number of samples (cov files) that have short read that support this junction.total_coverage
: Total number of short read support from all samples that cover this junction.