This pipeline allows transcriptome assembly, aiming to leave one transcript per gene. 64-bit Linux and macOS are supported.
git clone https://github.com/albidgy/trans2express
cd trans2express/
Create conda environment trans2express_env
and install required databases:
bash install.sh
* Keep in mind that one of the databases weighs several hundred Gb, so it may take several hours to install.
python trans2express.py [options]
-1 / --short_reads1
[Requied] Forward short reads in fastq or fastq.gz format
-2 / --short_reads2
Reversed short reads in fastq or fastq.gz format. If you have single-end reads do not specify this parameter.
--long_reads
[Requied] Nanopore long reads in fastq or fastq.gz format.
-o / --output_dir
Output directory. By default, output directory is ../res_trans2express_YEAR_MONTH_DAY_HOUR_MINUTE_SECOND.
-t / --threads
Number of threads. By default, is 1.
-m / --memory_lim
Memory limit in Gb. By default, is 10 Gb.
-h / --help
See more information.
--diamond_db
DIAMOND nr database (nr.dmnd) for finding homologous proteins for prediction CDS by TransDecoder and for removing foreign transcripts. The database is downloaded when you run the install.sh script or you can create your own .dmnd db. By default, is db/nr.dmnd.
--diamond_taxonomic_id
File with taxonomic ids list by nr.dmnd database for removal foreign rna. The file is downloaded when you run the install.sh script. By default, is db/taxonomic_id_to_full_taxonomy.txt.
--go_tree
File with broad GO terms. The file is downloaded when you run the install.sh script, or you can create your own goTree file. By default, is db/goTree.txt.
--min_short_read_length
Minimum length of short reads for fastp tool. By default, is 50.
--seq_idy_threshold
Sequence identity threshold for CD-HIT-EST tool. By default, is 0.98.
--alignment_type
Select type of alignment for CD-HIT-EST tool. 0 - local alignment, 1 - global alignment. By default, is 0.
--subseq_len_matching_cov
Alignment coverage for the shorter sequence for CD-HIT-EST tool. By default, is 0.6.
As a result of the pipeline's work, the final_assemly main directory is created, in which the following files are located:
final.clust_transcripts_longest_iso.fasta
- assemblied transcriptome fasta file;final.clust_annotation_longest_iso.gff3
- annotation file in gff format;final.clust_proteins_longest_iso.fasta
- proteins fasta file;final.clust_cds_longest_iso.fasta
- CDS fasta file;GO_annotation.txt
- file with GO terms.
If you use Trans2express in your research, please cite the paper.