Terminitor is a deep neural network that predicts whether a sequence contains a polyadenylated (poly(A)) cleavage site (CS) at certain position.
For more information, please refer to the preprint: https://www.biorxiv.org/content/10.1101/710699v2
www.bcgsc.ca/downloads/supplementary/Terminitor
This ftp site contains two datasets, human and mouse, and two corresponding pre-trained models for test.
- Python3
- Numpy
- Keras
- Scikit-learn
- Pybedtools
- Pysam
- HTSeq
A Python environment for these packages can be created with conda
, e.g.
conda create --name terminitor pysam pybedtools numpy keras scikit-learn htseq
For more information, consult the user guide for conda.
Usage: train.py [-h] [-v] -polya POLYA -cs CS -non NON -model MODEL -l L
optional arguments:
-h, --help show this help message and exit
-v, --version show program's version number and exit
-polya POLYA Poly(A) CS, fasta file
-cs CS Non-poly(A) CS, fasta file
-non NON Non-CS, fasta file
-model MODEL File name of trained model
-l L Length of input sequences
Usage: extract_from_sequences.py [-h] [-v] -t ANNOT_TRANS -a ANNOT_ALL -m ALN -g GENOME -o O [-u UP_LEN] [-d DOWN_LEN]
optional arguments:
-h, --help show this help message and exit
-v, --version show program's version number and exit
-t ANNOT_TRANS, --annot_trans ANNOT_TRANS
Transcript annotation file, GTF format. This file
contains only transcript level annotation, can be
downloaded from the ftp site provided on our Github
page
-a ANNOT_ALL, --annot_all ANNOT_ALL
Ensembl annotation file, GTF format. Can be downloaded
from Ensembl ftp site
-m ALN, --aln ALN The alignment file from assembled transcript contigs
to reference genome in BAM format.
-g GENOME, --genome GENOME
Indexed reference genome assembly in FASTA format, which
can be downloaded from Ensembl
-o O Output file, fasta format containing candidate
sequences to be tested
-u UP_LEN, --up_len UP_LEN
Upstream sequence length
-d DOWN_LEN, --down_len DOWN_LEN
Downstream sequence length
Usage: test.py [-h] [-v] -t TEST_FILE -m MODEL -l L -o OUTPUT
optional arguments:
-h, --help show this help message and exit
-v, --version show program's version number and exit
-t TEST_FILE, --test_file TEST_FILE
Fasta file to be tested
-m MODEL, --model MODEL
Pre-trained model file
-l L Length of input sequences
-o OUTPUT, --output OUTPUT
Output probabilities
- For Illumina RNA-seq short reads, run assembly with RNA-Bloom
java -jar RNA-Bloom.jar -left read2.fq -right read1.fq -revcomp-right -outdir assembly -a 4 -e 1 -stratum 01 -ss -ntcard -fpr 0.005
For PacBio CCS reads, skip this step
- Genome alignment with minimap2
minimap2 -ax splice hg38.mmi rnabloom.transcripts.fa | samtools view -u - | samtools sort -T tmp_prefix -O BAM -o aln.bam
samtools index aln.bam
- Extract candidate sequence
python extract_from_sequences.py -t Homo_sapiens.GRCh38.99.transcripts.gtf -a Homo_sapiens.GRCh38.99.gtf -g Homo_sapiens.GRCh38.dna.primary_assembly.fa -m aln.bam -o extracted_sequences.fa
The GTF file for -a
option can be downloaded from Ensembl.
The GTF file for -t
option can be generated based on the Ensembl annotation, e.g.
awk '$3=="transcript" {print}' Homo_sapiens.GRCh38.99.gtf > Homo_sapiens.GRCh38.99.transcripts.gtf
The reference genome for -g
option can be downloaded from Ensembl and must be indexed, e.g.
samtools faidx Homo_sapiens.GRCh38.dna.primary_assembly.fa
- Test
python test.py -t extracted_sequences.fa -m pre_trained_model -l 200 -o probablities.txt