-
Notifications
You must be signed in to change notification settings - Fork 1
SQANTI SIM: design
sqanti-sim.py design
performs two main functions: produces a reduced transcriptome reference annotation and sets the expression levels to the annotated transcripts. This tool takes as input the reference annotation in GTF format and the index file output by sqanti-sim.py classif
(prefix_index.tsv). SQANTI-SIM classifies the transcripts as "novel" or "known" based on a user-defined number of transcripts for each structural category, with FSM denoting known transcripts and ISM, NIC, and NNC being potential new transcripts. This tool removes transcripts classified as "novel" in this way from the annotation to produce a "reduced" GTF file. Since all of the transcripts requested to be simulated as novel are missing from the "reduced" GTF file, they will appear as novel to any pipeline that uses this annotation as a reference.
In order to remove the target transcripts without changing their structural category, the algorithm randomizes all transcripts from the reference and searches for the requested structural categories one by one. Once a matching transcript is found, the associated reference gene is anchored so that other reference transcripts from that gene cannot be deleted, except in the case of ISM, where the reference transcript is anchored instead. Any matched transcript that is already anchored will not be removed to ensure that previously classified transcripts maintain their requested structural category. To avoid simulating miRNAs or small RNA, sqanti-sim.py design
only targets transcripts with a minimum length of 200 bp.
Following the selection of novel and known transcripts to simulate, expression values must be assigned to them. This stage is designed to be highly customizable and flexible, relying on three approaches:
-
equal: This mode will assign the same expression value to all the simulated transcripts, not considering whether they are novel or known. The same number of reads will be requested for all the transcripts, thus, generating a very controlled and steady simulation scenario.
-
custom: This mode permits to customize two different NB distributions (one for known and one for new transcripts) to highlight the difference in expression levels between new and known transcripts. The expression values for novel and known transcripts can be differentiated by changing the number of successes n and the probability of success p for two different NB distributions. Values are drawn at random from these distributions, swapping all zero values by one, ensuring that all targeted transcripts have at least one simulated read.
-
sample: Finally, expression values can be obtained using a real expression distribution learned from an existent lrRNA-seq sample (FASTA or FASTQ format). A raw counts distribution of the mapped reads is generated after the reads have been aligned to the transcriptome using Minimap2 ignoring secondary and supplementary alignments. A random expression value from the original distribution is assigned to each target transcript using this bootstrapping methodology. However, the
--diff_exp
parameter from the sample approach allows the simulation of different expression values for novel and known transcripts by generating an inverted vector of probabilities to assign lower expression values to novel transcripts and higher to known ones. You can customize this probability vector by selecting the start (--low_prob
) and end (--high_prob
) probabilities of the range. Additionally, SQANTI-SIM can capture the distribution of the number of isoforms of a gene that are expressed (--iso_complex
).
SQANTI-SIM design outputs a reduced reference annotation in GTF format and the index file with new columns indicating whether the read is novel or known, as well as the requested expression values for each transcript. IMPORTANT: The modified reference annotation must be used as reference by your transcript identification pipeline.
SQANTI-SIM design mode usage:
python sqanti-sim.py design [-h] {equal,custom,sample} ...
With the --help
option you can display a complete description of the arguments:
(SQANTI-SIM.env)$ python sqanti-sim.py design --help
sqanti-sim.py design parse options
optional arguments:
-h, --help show this help message and exit
subcommands:
Different modes to generate the expression matrix: equal (simulate with
equal coverage for all reads), custom (simulate with diferent negative
binomial distributions for novel and known transcripts) or sample
(simulate using a real sample)
{equal,custom,sample}
equal Run in equal mode
custom Run in custom mode
sample Run in sample mode
(SQANTI-SIM.env)$ python sqanti-sim.py design equal --help
usage: sqanti-sim.py design equal [-h] -i TRANS_INDEX --gtf GTF [-o OUTPUT]
[-d DIR] [-nt TRANS_NUMBER]
[--read_count READ_COUNT] [--ISM ISM]
[--NIC NIC] [--NNC NNC] [--Fusion FUSION]
[--Antisense ANTISENSE] [--GG GG] [--GI GI]
[--Intergenic INTERGENIC] [-k CORES]
[-s SEED]
optional arguments:
-h, --help show this help message and exit
-i TRANS_INDEX, --trans_index TRANS_INDEX
File with transcript information generated with
SQANTI-SIM (*_index.tsv)
--gtf GTF tComplete reference annotation in GTF format
-o OUTPUT, --output OUTPUT
Prefix for output files
-d DIR, --dir DIR Directory for output files (default: .)
-nt TRANS_NUMBER, --trans_number TRANS_NUMBER
Total number of transcripts to simulate
--read_count READ_COUNT
Number of reads to simulate
--ISM ISM Number of incomplete-splice-matches to simulate
--NIC NIC Number of novel-in-catalog to simulate
--NNC NNC Number of novel-not-in-catalog to simulate
--Fusion FUSION Number of Fusion to simulate
--Antisense ANTISENSE
Number of Antisense to simulate
--GG GG Number of Genic-genomic to simulate
--GI GI Number of Genic-intron to simulate
--Intergenic INTERGENIC
Number of Intergenic to simulate
-k CORES, --cores CORES
Number of cores to run in parallel
-s SEED, --seed SEED Randomizer seed
Running the equal mode with the minimum input will look as follows:
(SQANTI-SIM.env)$ python sqanti-sim.py design equal \
--gtf reference_annotation.gtf \
--trans_index prefix_index.tsv \
--read_count 50000
(SQANTI-SIM.env)$ python sqanti-sim.py design custom --help
usage: sqanti-sim.py design custom [-h] -i TRANS_INDEX --gtf GTF [-o OUTPUT]
[-d DIR] [-nt TRANS_NUMBER]
[--nbn_known NBN_KNOWN]
[--nbp_known NBP_KNOWN]
[--nbn_novel NBN_NOVEL]
[--nbp_novel NBP_NOVEL] [--ISM ISM]
[--NIC NIC] [--NNC NNC] [--Fusion FUSION]
[--Antisense ANTISENSE] [--GG GG] [--GI GI]
[--Intergenic INTERGENIC] [-k CORES]
[-s SEED]
optional arguments:
-h, --help show this help message and exit
-i TRANS_INDEX, --trans_index TRANS_INDEX
File with transcript information generated with
SQANTI-SIM (*_index.tsv)
--gtf GTF tComplete reference annotation in GTF format
-o OUTPUT, --output OUTPUT
Prefix for output files
-d DIR, --dir DIR Directory for output files (default: .)
-nt TRANS_NUMBER, --trans_number TRANS_NUMBER
Total number of transcripts to simulate
--nbn_known NBN_KNOWN
Average read count per known transcript to simulate
(i.e., the parameter 'n' of the Negative Binomial
distribution)
--nbp_known NBP_KNOWN
The parameter 'p' of the Negative Binomial
distribution for known transcripts
--nbn_novel NBN_NOVEL
Average read count per novel transcript to simulate
(i.e., the parameter 'n' of the Negative Binomial
distribution)
--nbp_novel NBP_NOVEL
The parameter 'p' of the Negative Binomial
distribution for novel transcripts
--ISM ISM Number of incomplete-splice-matches to simulate
--NIC NIC Number of novel-in-catalog to simulate
--NNC NNC Number of novel-not-in-catalog to simulate
--Fusion FUSION Number of Fusion to simulate
--Antisense ANTISENSE
Number of Antisense to simulate
--GG GG Number of Genic-genomic to simulate
--GI GI Number of Genic-intron to simulate
--Intergenic INTERGENIC
Number of Intergenic to simulate
-k CORES, --cores CORES
Number of cores to run in parallel
-s SEED, --seed SEED Randomizer seed
Running the custom mode with the minimum input will look as follows:
(SQANTI-SIM.env)$ python sqanti-sim.py design custom \
--gtf reference_annotation.gtf \
--trans_index prefix_index.tsv \
--nbn_known 15 --nbp_known 0.5 \
--nbn_novel 5 --nbp_novel 0.5
(SQANTI-SIM.env)$ python sqanti-sim.py design sample --help
usage: sqanti-sim.py design sample [-h] -i TRANS_INDEX --gtf GTF [-o OUTPUT]
[-d DIR] [-nt TRANS_NUMBER] --genome GENOME
[--pb_reads PB_READS | --ont_reads ONT_READS | --mapped_reads MAPPED_READS]
[--iso_complex] [--diff_exp]
[--low_prob LOW_PROB]
[--high_prob HIGH_PROB] [--ISM ISM]
[--NIC NIC] [--NNC NNC] [--Fusion FUSION]
[--Antisense ANTISENSE] [--GG GG] [--GI GI]
[--Intergenic INTERGENIC] [-k CORES]
[-s SEED]
optional arguments:
-h, --help show this help message and exit
-i TRANS_INDEX, --trans_index TRANS_INDEX
File with transcript information generated with
SQANTI-SIM (*_index.tsv)
--gtf GTF Complete reference annotation in GTF format
-o OUTPUT, --output OUTPUT
Prefix for output files
-d DIR, --dir DIR Directory for output files (default: .)
-nt TRANS_NUMBER, --trans_number TRANS_NUMBER
Total number of transcripts to simulate
--genome GENOME Reference genome FASTA
--pb_reads PB_READS Input PacBio reads for characterization in FASTA or
FASTQ format
--ont_reads ONT_READS
Input ONT reads for characterization in FASTA or FASTQ
format
--mapped_reads MAPPED_READS
Aligned reads in SAM format
--iso_complex If used the program will simulate the expressed
isoform complexity (number of isoforms per gene)
--diff_exp If used the program will simulate different expression
values for novel and known transcripts
--low_prob LOW_PROB Low value of prob vector (if --diff_exp)
--high_prob HIGH_PROB
High value of prob vector (if --diff_exp)
--ISM ISM Number of incomplete-splice-matches to simulate
--NIC NIC Number of novel-in-catalog to simulate
--NNC NNC Number of novel-not-in-catalog to simulate
--Fusion FUSION Number of Fusion to simulate
--Antisense ANTISENSE
Number of Antisense to simulate
--GG GG Number of Genic-genomic to simulate
--GI GI Number of Genic-intron to simulate
--Intergenic INTERGENIC
Number of Intergenic to simulate
-k CORES, --cores CORES
Number of cores to run in parallel
-s SEED, --seed SEED Randomizer seed
Running the sample mode with the minimum input and mode-specific parameters will look as follows:
(SQANTI-SIM.env)$ python sqanti-sim.py design sample \
--gtf reference_annotation.gtf \
--trans_index prefix_index.tsv \
--genome reference_genome.fasta \
--pb_reads reads.fastq
These are the minimum parameters you will need to run sqanti-sim.py design
:
-
Design mode: This is the first positional argument right after "design" (
sqanti-sim.py design <mode>
). It can take three different values: equal (simulate with equal coverage for all reads), custom (simulate with different negative binomial distributions for novel and known transcripts) or sample (simulate using a real sample). -
Transcript index file (
-i
): This file is the prefix_index.tsv file generated in the previous classif step. New columns and information will be added to this file, so you should track it and use the most updated version of this file in the next steps of the SQANTI-SIM pipeline. -
Reference annotation in GTF format (
--gtf
): This file must be the same as the one used in the classif step, so the transcript names and references are properly matched. From this file, SQANTI-SIM will generate a modified reference annotation to use as a reference in your transcript reconstruction pipeline. An example of a reference transcriptome and its required format is GENCODE. -
Reference genome in FASTA format (
--genome
): Only required for sample mode, if not ignored. This is the reference genome in FASTA format. It will be used together with the reference annotation to extract the transcript sequences in FASTA format using GffRead. -
Long-read mRNA sequences (
--pb_reads/--ont_reads/--mapped_reads
): Only required for sample mode, if not ignore. A real sample of Long-read mRNA sequences in FASTA or FASTQ format, or the aligned reads in SAM format, to quantify and extract the expression values. The reads will be mapped with Minimap2, and the expression distribution will be generated from the raw counts of the alignment.
-
Novel transcripts to simulate: The number of transcripts of each structural category to simulate as novel. Use
--ISM
to simulate Incomplete Splice Match transcripts,--NIC
for Novel In Catalog,--NNC
for Novel not In Catalog,--Fusion
for Fusion,--Antisense
for Antisense,--GG
for Genig-genomic,--GI
for Genic-intron and--Intergenic
for Intergenic transcripts. Use this parameter followed by the number of transcripts to simulate with that structural category. -
Transcripts to simulate (
-nt
): This is the total number of transcripts to simulate; this means that this will be all the different transcripts that will be simulated. The number of novel transcripts will be subtracted from this value. The rest will be simulated from known annotated transcripts kept in the modified reference annotation. Then, the number of different known transcripts to simulate will be the requested number of transcripts to simulate (-nt
) minus the number of novel transcripts (sum of different structural categories). -
Output prefix (
-o
): The output prefix for the index file. SQANTI-SIM will use "sqanti-sim" as the default prefix. -
Output directory (
-d
): Output directory for output files. SQANTI-SIM will use the directory where the script was run as the default output directory. -
Parallelization (
-k
): Number of cores to run in parallel. Most SQANTI-SIM modes have code chunks that can be run in parallel. However, the default option is to run SQANTI-SIM in one single thread. -
Number of reads (
--read_count
): Can only be used for equal mode, if not ignored. The number of total reads to simulate. It will be divided by the number of total transcripts to compute the requested counts for each transcript and the TPM values. All transcripts will have the same TPM and count values. -
Isoform complexity per gene (
--iso_complex
): Can only be used in sample mode. By default, SQANTI-SIM chooses the known transcripts randomly to simulate; however, if this argument is used, SQANTI-SIM will mimic the isoform expression complexity from a given sample, i.e., it will simulate transcripts so that the expressed isoforms per gene follow the same distribution as the real sample (--pb_reads/--ont_reads
). This procedure may alter the total number of simulated transcripts. -
Negative binomial distribution parameters: Can only be used in custom mode, if not ignored. When running in custom modes, expression values are taken from 2 different negative binomial distributions, one for novel transcripts and the other for known so that you can customize the differences in expression levels. To generate both negative binomial distributions you need to specify the parameter n successes (
--nbn_known
and--nbn_novel
) and p probability of success (--nbp_known
and--nbp_novel
) where n is > 0 and p is in the interval [0,1]. -
Different expression novel and known (
--diff_exp
): Can only be used in sample mode, if not ignored. sample mode by default simulates the same level of expression for novel and known transcripts. However, this parameter can be used to simulate different levels of expressions. It uses a probability vector in the interval [0,1] and its inverse to assign probability values for each count value of the expression distribution. You can decide the start and end of the range of the probability vector using--low_prob
as the start probability and--high_prob
as the end probability.
This file contains the reference annotation in GTF format without those transcripts classified as novel in the index file. This file has the same content as the original reference annotation except for those novel transcripts and those genes left without a single annotated transcript when removing the novel transcripts. It is mandatory to use this file as the reference annotation in your transcript identification pipeline to presume novelty.
This is the main index file with the structural annotation of the reference transcripts. This file will be requested and modified in each other SQANTI-SIM mode. It must be parsed using -i/--trans_index
. After running the SQANTI-SIM design mode, this file will contain three extra columns indicating whether the read is novel or known, as well as the requested expression values for each transcript. The columns of the prefix_index.tsv file are described below:
- transcript_id: The transcript id of the query transcript taken from the reference annotation.
- gene_id: The reference gene where this transcript comes from.
- structural_category: The potential structural category in which this transcript could be classified if simulated as a novel.
- associated_gene: The reference gene that confers the query transcript's structural category. If there is more than one associated gene, they will be written separated by "_", e.g., Gene1_Gene2_Gene3.
- associated_trans: The reference transcript name that confers the structural category. If there is no specific associated transcript, "novel" will appear instead.
- chrom: Chromosome.
- strand: Strand.
- exons: Number of exons.
- donors: End of the junction (if - strand, it will be the start of the junction). Values are 1-based.
- acceptors: Start of the junction (if - strand, it will be the end of the junction). Values are 1-based.
- TSS_genomic_coord: Start site of the transcript. Values are 1-based.
- TTS_genomic_coord: Termination site of the transcript. Values are 1-based.
- length: Length of the transcript.
- sim_type: If the transcript is present in the modified reference annotation (known) or not (novel).
- requested_counts: The requested reads to simulate in the sim step.
- requested_tpm: The requested TPM value to simulate in the sim step.