Citation: Cheon, S., Zhang, J. and Park, C., 2020. Is phylotranscriptomics as reliable as phylogenomics?. Molecular biology and evolution, 37(12), pp.3672-3683.
##Notice: Consistent text formatting helps reders to interpret information. <text> means parameters for Linux shell and python scripts such as input file name or integer values.
Hardware
32-core processors (recommend > 8-core processors)
256 Gb (Gigabytes) of RAM (recommend > 120 Gb of RAM)
> 1 Terabytes of space require for raw sequencing data and processed data
Ubuntu 16.04 LTS (recommend LTS version)
Internet accessible environments
Software
Python2.7 with Biopython
JAVA 1.8 (or higher)
cmake
SRA Toolkit (v 2.10.8 or higher)
Trimmomatic (v 0.36 or higher)
Samtools
Trinity (v 2.2.0 or higher)
Jellyfish (v 2.3.0 or higher)
Bowtie2 (v 2.3.5.1 or higher)
TransDecoder (v 3.0.0 or higher)
CD-hit (v 4.6.6 or higher)
BLAST+ (v 2.9.0 or higher)
OrthoFinder (v.2.4.0 or higher)
DIAMOND (v. 0.9.24 or higher)
MCL
Prank (v.150803)
Phyutility (v.2.7.1)
IQ-Tree (v. 1.6.11 or higher)
- Install anaconda
wget https://repo.anaconda.com/archive/Anaconda3-2020.11-Linux-x86_64.sh
sh https://repo.anaconda.com/archive/Anaconda3-2020.11-Linux-x86_64.sh
rm https://repo.anaconda.com/archive/Anaconda3-2020.11-Linux-x86_64.sh
- Create conda environmental and installation
conda update conda
conda env create -n phylo --file 2021.May.Phylo_env.yaml
conda activate phylo
- Phyutility linux package download
sudo apt-get install phyutility
- Trimmpmatic file download
wget http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.39.zip
unzip Trimmomatic-0.39.zip
1.Quick start guideline for example raw data download
2.Quick start guideline from translated de novo assembled transcripts with example dataset
##Notice: if you have a proteome sequence or translated de novo assembled reference transcrits. you can starts from part 2.
For phylogeny with RNA-seq data. we are download RNA-seq raw data from NCBI SRA database.
fastq-dump --defline-seq '@$sn[_$rn]/$ri' --split-files <SRA Accession ID>
1. De novo transcriptome assembly with Trinity
Short reads RNA sequencing data processed by Trinity assembler with Trimmomatic read trimming toool for illumina NGS data.
For data sets with known adaptor sequence and phred scores for base quality.
If you have single-end sequencing data
Trinity --seqType fq --trimmomatic --quality_trimming_params <"ILLUMINACLIP:/home/your/path/trinity-plugins/Trimmomatic-0.36/adapters/TruSeq3-PE.fa>:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36"> --max_memory <200G> --CPU <32> --full_cleanup --single <single-end reads.fastq> --output <trinity_output_Name>
or paired-end sequencing data
Trinity --seqType fq --trimmomatic --quality_trimming_params <"ILLUMINACLIP:/home/your/path/trinity-plugins/Trimmomatic-0.36/adapters/TruSeq3-PE.fa>:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36"> --max_memory 200G --CPU 32 --full_cleanup --left <forward reads.fastq> --right <reverse reads.fastq> --output <trinity_output_Name>
2. Find Open Reading Frames and translate using TransDecoder with blastp for orfs selection
Assembled transcripts were translated with TransDecoder programs and choosing orfs with blastp results.
For blastp, download and make database file from Uniprot/Swiss-Prot
wget https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
gzip -d uniprot_sprot.fasta.gz
makeblastdb -in uniprot_sprot.fasta -dbtype prot
TransDecoder.LongOrfs -t <transcripts> -S
blastp -query <transcripts>.transdecoder_dir/longest_orfs.pep -db uniprot_sprot.fasta -max_target_seqs 1 -outfmt 6 -evalue 10 -num_threads 32 -out Genus_Species.outfmt6
TransDecoder.Predict -t <transcripts> --retain_blastp_hits Genus_Species.outfmt6 --single_best_only
3. Clustering with CD-hit
Reduce translated sequence redundancy with CD-hit
cdhit -i <transcripts>.transdecoder.pep -o <Genus Species>.fa.cdhit -c 0.99 -n 5 -T 32
4. Sequence ID fixation.
CD-hit output file " Genus_Species.fa.cdhit" sequence ID change to shorten name to Genus_Species@seqID. The special character "@" is used to separate taxon name and sequence ID. Any "-" (hyphen) in the sequence name will be replaces py phyutility and cause problems in downstream process.
python2 fix_names_from_CDhit.py <CDhit output file.cdhit> <Genus name> <Species name>
1. Running OrthoFinder
Orthology inference, Copy all the Genus_Species.fix.fa files (or any proteom sequences) into a new directory such as <OrthoFinder_running_dir>.
orthofinder -f <OrthoFinder_running_dir> -S diamond -t <number of threads>
2. Single copy orthologous prediction
Choose the minimal number of taxa filters for single copy orthologs inference (recommend half of taxa)
python2 singlecopy_from_OrthoFinder.py <OrthoFinder_running_dir> SingleCopy <Min number of taxa>
3. Multiple sequence alignment with Prank
python2 prank_Wrapper.py SingleCopy
4. Alignment trimming with Phyutility
I usually use 0.3 for minimal aling column.
python2 phyutility_Wrapper.py SingleCopy <min_align_column>
5. Concatenate supermatrix
You can choose minimal cleaned alignment length per orthologs and minimal number of taxa filters (recommand 150, half of taxa for amino acid tree).
Concatenate with selected cleand orthologous for supermatrix.
python2 supermatrix_concatenate.py SingleCopy <min_align_length> <min_taxa> <output_name>
Run IQ-Tree with 1000 UFBoot replications and search for best fit tree. Use LG+C60+R+F model.
iqtree -s <Concatenate_matrix>.phy -spp <Concatenate_matrix>.model -m LG+C60+R+F -bb 1000 -nt 32