SQANTI SIM: design

Introduction

sqanti-sim.py design performs two main functions: produces a reduced transcriptome reference annotation and sets the expression levels to the annotated transcripts. This tool takes as input the reference annotation in GTF format and the index file output by sqanti-sim.py classif (prefix_index.tsv). SQANTI-SIM classifies the transcripts as "novel" or "known" based on a user-defined number of transcripts for each structural category, with FSM denoting known transcripts and ISM, NIC, and NNC being potential new transcripts. This tool removes transcripts classified as "novel" in this way from the annotation to produce a "reduced" GTF file. Since all of the transcripts requested to be simulated as novel are missing from the "reduced" GTF file, they will appear as novel to any pipeline that uses this annotation as a reference.

In order to remove the target transcripts without changing their structural category, the algorithm randomizes all transcripts from the reference and searches for the requested structural categories one by one. Once a matching transcript is found, the associated reference gene is anchored so that other reference transcripts from that gene cannot be deleted, except in the case of ISM, where the reference transcript is anchored instead. Any matched transcript that is already anchored will not be removed to ensure that previously classified transcripts maintain their requested structural category. To avoid simulating miRNAs or small RNA, sqanti-sim.py design only targets transcripts with a minimum length of 200 bp.

Following the selection of novel and known transcripts to simulate, expression values must be assigned to them. This stage is designed to be highly customizable and flexible, relying on three approaches:

equal: This mode will assign the same expression value to all the simulated transcripts, not considering whether they are novel or known. The same number of reads will be requested for all the transcripts, thus, generating a very controlled and steady simulation scenario.
custom: This mode permits to customize two different NB distributions (one for known and one for new transcripts) to highlight the difference in expression levels between new and known transcripts. The expression values for novel and known transcripts can be differentiated by changing the number of successes n and the probability of success p for two different NB distributions. Values are drawn at random from these distributions, swapping all zero values by one, ensuring that all targeted transcripts have at least one simulated read.
sample: Finally, expression values can be obtained using a real expression distribution learned from an existent lrRNA-seq sample (FASTA or FASTQ format). A raw counts distribution of the mapped reads is generated after the reads have been aligned to the transcriptome using Minimap2 ignoring secondary and supplementary alignments. A random expression value from the original distribution is assigned to each target transcript using this bootstrapping methodology. However, the --diff_exp parameter from the sample approach allows the simulation of different expression values for novel and known transcripts by generating an inverted vector of probabilities to assign lower expression values to novel transcripts and higher to known ones. You can customize this probability vector by selecting the start (--low_prob) and end (--high_prob) probabilities of the range. Additionally, SQANTI-SIM can capture the distribution of the number of isoforms of a gene that are expressed (--iso_complex).

design_modes

SQANTI-SIM design outputs a reduced reference annotation in GTF format and the index file with new columns indicating whether the read is novel or known, as well as the requested expression values for each transcript. IMPORTANT: The modified reference annotation must be used as reference by your transcript identification pipeline.

Usage

SQANTI-SIM design mode usage:

python sqanti-sim.py design [-h] {equal,custom,sample} ...

With the --help option you can display a complete description of the arguments:

(SQANTI-SIM.env)$ python sqanti-sim.py design --help

sqanti-sim.py design parse options

optional arguments:
  -h, --help            show this help message and exit

subcommands:
  Different modes to generate the expression matrix: equal (simulate with
  equal coverage for all reads), custom (simulate with diferent negative
  binomial distributions for novel and known transcripts) or sample
  (simulate using a real sample)

  {equal,custom,sample}
    equal               Run in equal mode
    custom              Run in custom mode
    sample              Run in sample mode

equal mode

(SQANTI-SIM.env)$ python sqanti-sim.py design equal --help

usage: sqanti-sim.py design equal [-h] -i TRANS_INDEX --gtf GTF [-o OUTPUT]
                                 [-d DIR] [-nt TRANS_NUMBER]
                                 [--read_count READ_COUNT] [--ISM ISM]
                                 [--NIC NIC] [--NNC NNC] [--Fusion FUSION]
                                 [--Antisense ANTISENSE] [--GG GG] [--GI GI]
                                 [--Intergenic INTERGENIC] [-k CORES]
                                 [-s SEED]

optional arguments:
  -h, --help            show this help message and exit
  -i TRANS_INDEX, --trans_index TRANS_INDEX
                        File with transcript information generated with
                        SQANTI-SIM (*_index.tsv)
  --gtf GTF             tComplete reference annotation in GTF format
  -o OUTPUT, --output OUTPUT
                        Prefix for output files
  -d DIR, --dir DIR     Directory for output files (default: .)
  -nt TRANS_NUMBER, --trans_number TRANS_NUMBER
                        Total number of transcripts to simulate
  --read_count READ_COUNT
                        Number of reads to simulate
  --ISM ISM             Number of incomplete-splice-matches to simulate
  --NIC NIC             Number of novel-in-catalog to simulate
  --NNC NNC             Number of novel-not-in-catalog to simulate
  --Fusion FUSION       Number of Fusion to simulate
  --Antisense ANTISENSE
                        Number of Antisense to simulate
  --GG GG               Number of Genic-genomic to simulate
  --GI GI               Number of Genic-intron to simulate
  --Intergenic INTERGENIC
                        Number of Intergenic to simulate
  -k CORES, --cores CORES
                        Number of cores to run in parallel
  -s SEED, --seed SEED  Randomizer seed

Running the equal mode with the minimum input will look as follows:

(SQANTI-SIM.env)$ python sqanti-sim.py design equal \
			--gtf reference_annotation.gtf \
			--trans_index prefix_index.tsv \
			--read_count 50000

custom mode

(SQANTI-SIM.env)$ python sqanti-sim.py design custom --help

usage: sqanti-sim.py design custom [-h] -i TRANS_INDEX --gtf GTF [-o OUTPUT]
                                  [-d DIR] [-nt TRANS_NUMBER]
                                  [--nbn_known NBN_KNOWN]
                                  [--nbp_known NBP_KNOWN]
                                  [--nbn_novel NBN_NOVEL]
                                  [--nbp_novel NBP_NOVEL] [--ISM ISM]
                                  [--NIC NIC] [--NNC NNC] [--Fusion FUSION]
                                  [--Antisense ANTISENSE] [--GG GG] [--GI GI]
                                  [--Intergenic INTERGENIC] [-k CORES]
                                  [-s SEED]

optional arguments:
  -h, --help            show this help message and exit
  -i TRANS_INDEX, --trans_index TRANS_INDEX
                        File with transcript information generated with
                        SQANTI-SIM (*_index.tsv)
  --gtf GTF             tComplete reference annotation in GTF format
  -o OUTPUT, --output OUTPUT
                        Prefix for output files
  -d DIR, --dir DIR     Directory for output files (default: .)
  -nt TRANS_NUMBER, --trans_number TRANS_NUMBER
                        Total number of transcripts to simulate
  --nbn_known NBN_KNOWN
                        Average read count per known transcript to simulate
                        (i.e., the parameter 'n' of the Negative Binomial
                        distribution)
  --nbp_known NBP_KNOWN
                        The parameter 'p' of the Negative Binomial
                        distribution for known transcripts
  --nbn_novel NBN_NOVEL
                        Average read count per novel transcript to simulate
                        (i.e., the parameter 'n' of the Negative Binomial
                        distribution)
  --nbp_novel NBP_NOVEL
                        The parameter 'p' of the Negative Binomial
                        distribution for novel transcripts
  --ISM ISM             Number of incomplete-splice-matches to simulate
  --NIC NIC             Number of novel-in-catalog to simulate
  --NNC NNC             Number of novel-not-in-catalog to simulate
  --Fusion FUSION       Number of Fusion to simulate
  --Antisense ANTISENSE
                        Number of Antisense to simulate
  --GG GG               Number of Genic-genomic to simulate
  --GI GI               Number of Genic-intron to simulate
  --Intergenic INTERGENIC
                        Number of Intergenic to simulate
  -k CORES, --cores CORES
                        Number of cores to run in parallel
  -s SEED, --seed SEED  Randomizer seed

Running the custom mode with the minimum input will look as follows:

(SQANTI-SIM.env)$ python sqanti-sim.py design custom \
			--gtf reference_annotation.gtf \
			--trans_index prefix_index.tsv \
			--nbn_known 15 --nbp_known 0.5 \
			--nbn_novel 5 --nbp_novel 0.5

sample mode

(SQANTI-SIM.env)$ python sqanti-sim.py design sample --help

usage: sqanti-sim.py design sample [-h] -i TRANS_INDEX --gtf GTF [-o OUTPUT]
                                  [-d DIR] [-nt TRANS_NUMBER] --genome GENOME
                                  [--pb_reads PB_READS | --ont_reads ONT_READS | --mapped_reads MAPPED_READS]
                                  [--iso_complex] [--diff_exp]
                                  [--low_prob LOW_PROB]
                                  [--high_prob HIGH_PROB] [--ISM ISM]
                                  [--NIC NIC] [--NNC NNC] [--Fusion FUSION]
                                  [--Antisense ANTISENSE] [--GG GG] [--GI GI]
                                  [--Intergenic INTERGENIC] [-k CORES]
                                  [-s SEED]

optional arguments:
  -h, --help            show this help message and exit
  -i TRANS_INDEX, --trans_index TRANS_INDEX
                        File with transcript information generated with
                        SQANTI-SIM (*_index.tsv)
  --gtf GTF             Complete reference annotation in GTF format
  -o OUTPUT, --output OUTPUT
                        Prefix for output files
  -d DIR, --dir DIR     Directory for output files (default: .)
  -nt TRANS_NUMBER, --trans_number TRANS_NUMBER
                        Total number of transcripts to simulate
  --genome GENOME       Reference genome FASTA
  --pb_reads PB_READS   Input PacBio reads for characterization in FASTA or
                        FASTQ format
  --ont_reads ONT_READS
                        Input ONT reads for characterization in FASTA or FASTQ
                        format
  --mapped_reads MAPPED_READS
                        Aligned reads in SAM format
  --iso_complex         If used the program will simulate the expressed
                        isoform complexity (number of isoforms per gene)
  --diff_exp            If used the program will simulate different expression
                        values for novel and known transcripts
  --low_prob LOW_PROB   Low value of prob vector (if --diff_exp)
  --high_prob HIGH_PROB
                        High value of prob vector (if --diff_exp)
  --ISM ISM             Number of incomplete-splice-matches to simulate
  --NIC NIC             Number of novel-in-catalog to simulate
  --NNC NNC             Number of novel-not-in-catalog to simulate
  --Fusion FUSION       Number of Fusion to simulate
  --Antisense ANTISENSE
                        Number of Antisense to simulate
  --GG GG               Number of Genic-genomic to simulate
  --GI GI               Number of Genic-intron to simulate
  --Intergenic INTERGENIC
                        Number of Intergenic to simulate
  -k CORES, --cores CORES
                        Number of cores to run in parallel
  -s SEED, --seed SEED  Randomizer seed

Running the sample mode with the minimum input and mode-specific parameters will look as follows:

(SQANTI-SIM.env)$ python sqanti-sim.py design sample \
			--gtf reference_annotation.gtf \
			--trans_index prefix_index.tsv \
			--genome reference_genome.fasta \
			--pb_reads reads.fastq

Arguments detailed explanation

Required input

These are the minimum parameters you will need to run sqanti-sim.py design:

Design mode: This is the first positional argument right after "design" (sqanti-sim.py design <mode>). It can take three different values: equal (simulate with equal coverage for all reads), custom (simulate with different negative binomial distributions for novel and known transcripts) or sample (simulate using a real sample).
Transcript index file (-i): This file is the prefix_index.tsv file generated in the previous classif step. New columns and information will be added to this file, so you should track it and use the most updated version of this file in the next steps of the SQANTI-SIM pipeline.
Reference annotation in GTF format (--gtf): This file must be the same as the one used in the classif step, so the transcript names and references are properly matched. From this file, SQANTI-SIM will generate a modified reference annotation to use as a reference in your transcript reconstruction pipeline. An example of a reference transcriptome and its required format is GENCODE.
Reference genome in FASTA format (--genome): Only required for sample mode, if not ignored. This is the reference genome in FASTA format. It will be used together with the reference annotation to extract the transcript sequences in FASTA format using GffRead.
Long-read mRNA sequences (--pb_reads/--ont_reads/--mapped_reads): Only required for sample mode, if not ignore. A real sample of Long-read mRNA sequences in FASTA or FASTQ format, or the aligned reads in SAM format, to quantify and extract the expression values. The reads will be mapped with Minimap2, and the expression distribution will be generated from the raw counts of the alignment.

Optional input

Novel transcripts to simulate: The number of transcripts of each structural category to simulate as novel. Use --ISM to simulate Incomplete Splice Match transcripts, --NIC for Novel In Catalog, --NNC for Novel not In Catalog, --Fusion for Fusion, --Antisense for Antisense, --GG for Genig-genomic, --GI for Genic-intron and --Intergenic for Intergenic transcripts. Use this parameter followed by the number of transcripts to simulate with that structural category.
Transcripts to simulate (-nt): This is the total number of transcripts to simulate; this means that this will be all the different transcripts that will be simulated. The number of novel transcripts will be subtracted from this value. The rest will be simulated from known annotated transcripts kept in the modified reference annotation. Then, the number of different known transcripts to simulate will be the requested number of transcripts to simulate (-nt) minus the number of novel transcripts (sum of different structural categories).
Output prefix (-o): The output prefix for the index file. SQANTI-SIM will use "sqanti-sim" as the default prefix.
Output directory (-d): Output directory for output files. SQANTI-SIM will use the directory where the script was run as the default output directory.
Parallelization (-k): Number of cores to run in parallel. Most SQANTI-SIM modes have code chunks that can be run in parallel. However, the default option is to run SQANTI-SIM in one single thread.
Number of reads (--read_count): Can only be used for equal mode, if not ignored. The number of total reads to simulate. It will be divided by the number of total transcripts to compute the requested counts for each transcript and the TPM values. All transcripts will have the same TPM and count values.
Isoform complexity per gene (--iso_complex): Can only be used in sample mode. By default, SQANTI-SIM chooses the known transcripts randomly to simulate; however, if this argument is used, SQANTI-SIM will mimic the isoform expression complexity from a given sample, i.e., it will simulate transcripts so that the expressed isoforms per gene follow the same distribution as the real sample (--pb_reads/--ont_reads). This procedure may alter the total number of simulated transcripts.
Negative binomial distribution parameters: Can only be used in custom mode, if not ignored. When running in custom modes, expression values are taken from 2 different negative binomial distributions, one for novel transcripts and the other for known so that you can customize the differences in expression levels. To generate both negative binomial distributions you need to specify the parameter n successes (--nbn_known and --nbn_novel) and p probability of success (--nbp_known and --nbp_novel) where n is > 0 and p is in the interval [0,1].
Different expression novel and known (--diff_exp): Can only be used in sample mode, if not ignored. sample mode by default simulates the same level of expression for novel and known transcripts. However, this parameter can be used to simulate different levels of expressions. It uses a probability vector in the interval [0,1] and its inverse to assign probability values for each count value of the expression distribution. You can decide the start and end of the range of the probability vector using --low_prob as the start probability and --high_prob as the end probability.

Output explanation

prefix_modified.gtf

This file contains the reference annotation in GTF format without those transcripts classified as novel in the index file. This file has the same content as the original reference annotation except for those novel transcripts and those genes left without a single annotated transcript when removing the novel transcripts. It is mandatory to use this file as the reference annotation in your transcript identification pipeline to presume novelty.

prefix_index.tsv

This is the main index file with the structural annotation of the reference transcripts. This file will be requested and modified in each other SQANTI-SIM mode. It must be parsed using -i/--trans_index. After running the SQANTI-SIM design mode, this file will contain three extra columns indicating whether the read is novel or known, as well as the requested expression values for each transcript. The columns of the prefix_index.tsv file are described below:

transcript_id: The transcript id of the query transcript taken from the reference annotation.
gene_id: The reference gene where this transcript comes from.
structural_category: The potential structural category in which this transcript could be classified if simulated as a novel.
associated_gene: The reference gene that confers the query transcript's structural category. If there is more than one associated gene, they will be written separated by "_", e.g., Gene1_Gene2_Gene3.
associated_trans: The reference transcript name that confers the structural category. If there is no specific associated transcript, "novel" will appear instead.
chrom: Chromosome.
strand: Strand.
exons: Number of exons.
donors: End of the junction (if - strand, it will be the start of the junction). Values are 1-based.
acceptors: Start of the junction (if - strand, it will be the end of the junction). Values are 1-based.
TSS_genomic_coord: Start site of the transcript. Values are 1-based.
TTS_genomic_coord: Termination site of the transcript. Values are 1-based.
length: Length of the transcript.
sim_type: If the transcript is present in the modified reference annotation (known) or not (novel).
requested_counts: The requested reads to simulate in the sim step.
requested_tpm: The requested TPM value to simulate in the sim step.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly