Trans2express v.1.2: de novo hybrid transcriptome assembly tool

This pipeline allows transcriptome assembly, aiming to leave one transcript per gene. 64-bit Linux and macOS are supported.

Installation

git clone https://github.com/albidgy/trans2express
cd trans2express/

Create conda environment trans2express_env and install required databases:

bash install.sh

* Keep in mind that one of the databases weighs several hundred Gb, so it may take several hours to install.

Command line options

python trans2express.py [options]

General options

-1 / --short_reads1 [Requied] Forward short reads in fastq or fastq.gz format

-2 / --short_reads2 Reversed short reads in fastq or fastq.gz format. If you have single-end reads do not specify this parameter.

--long_reads [Requied] Nanopore long reads in fastq or fastq.gz format.

-o / --output_dir Output directory. By default, output directory is ../res_trans2express_YEAR_MONTH_DAY_HOUR_MINUTE_SECOND.

-t / --threads Number of threads. By default, is 1.

-m / --memory_lim Memory limit in Gb. By default, is 10 Gb.

-h / --help See more information.

Optional arguments

--diamond_db DIAMOND nr database (nr.dmnd) for finding homologous proteins for prediction CDS by TransDecoder and for removing foreign transcripts. The database is downloaded when you run the install.sh script or you can create your own .dmnd db. By default, is db/nr.dmnd.

--diamond_taxonomic_id File with taxonomic ids list by nr.dmnd database for removal foreign rna. The file is downloaded when you run the install.sh script. By default, is db/taxonomic_id_to_full_taxonomy.txt.

--go_tree File with broad GO terms. The file is downloaded when you run the install.sh script, or you can create your own goTree file. By default, is db/goTree.txt.

--min_short_read_length Minimum length of short reads for fastp tool. By default, is 50.

--seq_idy_threshold Sequence identity threshold for CD-HIT-EST tool. By default, is 0.98.

--alignment_type Select type of alignment for CD-HIT-EST tool. 0 - local alignment, 1 - global alignment. By default, is 0.

--subseq_len_matching_cov Alignment coverage for the shorter sequence for CD-HIT-EST tool. By default, is 0.6.

Output data

As a result of the pipeline's work, the final_assemly main directory is created, in which the following files are located:

final.clust_transcripts_longest_iso.fasta - assemblied transcriptome fasta file;
final.clust_annotation_longest_iso.gff3 - annotation file in gff format;
final.clust_proteins_longest_iso.fasta - proteins fasta file;
final.clust_cds_longest_iso.fasta - CDS fasta file;
GO_annotation.txt - file with GO terms.

Citations

If you use Trans2express in your research, please cite the paper.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
img		img
scripts		scripts
README.md		README.md
install.sh		install.sh
trans2express.py		trans2express.py
trans2express_env.yml		trans2express_env.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Trans2express v.1.2: de novo hybrid transcriptome assembly tool

Installation

Command line options

General options

Optional arguments

Output data

Citations

About

Releases

Packages

Languages

albidgy/trans2express

Folders and files

Latest commit

History

Repository files navigation

Trans2express v.1.2: de novo hybrid transcriptome assembly tool

Installation

Command line options

General options

Optional arguments

Output data

Citations

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages