Skip to content

albidgy/trans2express

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Trans2express v.1.2: de novo hybrid transcriptome assembly tool

This pipeline allows transcriptome assembly, aiming to leave one transcript per gene. 64-bit Linux and macOS are supported.

Installation

git clone https://github.com/albidgy/trans2express
cd trans2express/

Create conda environment trans2express_env and install required databases:

bash install.sh

* Keep in mind that one of the databases weighs several hundred Gb, so it may take several hours to install.

Command line options

python trans2express.py [options]

General options

-1 / --short_reads1 [Requied] Forward short reads in fastq or fastq.gz format

-2 / --short_reads2 Reversed short reads in fastq or fastq.gz format. If you have single-end reads do not specify this parameter.

--long_reads [Requied] Nanopore long reads in fastq or fastq.gz format.

-o / --output_dir Output directory. By default, output directory is ../res_trans2express_YEAR_MONTH_DAY_HOUR_MINUTE_SECOND.

-t / --threads Number of threads. By default, is 1.

-m / --memory_lim Memory limit in Gb. By default, is 10 Gb.

-h / --help See more information.

Optional arguments

--diamond_db DIAMOND nr database (nr.dmnd) for finding homologous proteins for prediction CDS by TransDecoder and for removing foreign transcripts. The database is downloaded when you run the install.sh script or you can create your own .dmnd db. By default, is db/nr.dmnd.

--diamond_taxonomic_id File with taxonomic ids list by nr.dmnd database for removal foreign rna. The file is downloaded when you run the install.sh script. By default, is db/taxonomic_id_to_full_taxonomy.txt.

--go_tree File with broad GO terms. The file is downloaded when you run the install.sh script, or you can create your own goTree file. By default, is db/goTree.txt.

--min_short_read_length Minimum length of short reads for fastp tool. By default, is 50.

--seq_idy_threshold Sequence identity threshold for CD-HIT-EST tool. By default, is 0.98.

--alignment_type Select type of alignment for CD-HIT-EST tool. 0 - local alignment, 1 - global alignment. By default, is 0.

--subseq_len_matching_cov Alignment coverage for the shorter sequence for CD-HIT-EST tool. By default, is 0.6.

Output data

As a result of the pipeline's work, the final_assemly main directory is created, in which the following files are located:

  • final.clust_transcripts_longest_iso.fasta - assemblied transcriptome fasta file;
  • final.clust_annotation_longest_iso.gff3 - annotation file in gff format;
  • final.clust_proteins_longest_iso.fasta - proteins fasta file;
  • final.clust_cds_longest_iso.fasta - CDS fasta file;
  • GO_annotation.txt - file with GO terms.

Citations

If you use Trans2express in your research, please cite the paper.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published