You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, I'm running dNdS() on the cds of 2 species containing 13486 orthologous pairs, but only 1754 genes get the calculations done for. The rest runs into this error.
ERROR: number of input seqs differ (aa: 1; nuc: 2)!!
Starting orthology inference (RBH) and dNdS estimation (YN) using the follwing parameters:
query = 'ZFcdsorth.fa'
subject = 'BFcdsorth.fa'
seq_type = 'cds'
e-value: 1E-5
aa_aln_type = 'multiple'
aa_aln_tool = 'clustalo'
comp_cores = '1'
Creating folder 'orthologr_alignment_files' to store alignment files ...
Starting Orthology Inference ...
Running blastp: 2.9.0+ ...
There seem to be 6 coding sequences in your input dataset which cannot be properly divided in base triplets, because their sequence length cannot be divided by 3.
A fasta file storing all corrupted coding sequences for inspection was generated and stored at '/gpfs/data/ehuertas/mfariasv/aligned_newBFV2/dNdS/ZFcdsorth.fa_corrupted_cds
_seqs.fasta'.
You chose option 'delete_corrupt_cds = FALSE', thus corrupted coding sequences were retained for subsequent analyses.
The following modifications were made to the CDS sequences that were not divisible by 3:
- If the sequence had 1 residue nucleotide then the last nucleotide of the sequence was removed.
- If the sequence had 2 residue nucleotides then the last two nucleotides of the sequence were removed.
If after consulting the file 'ZFcdsorth.fa_corrupted_cds_seqs.fasta' you wish to remove all corrupted coding sequences please specify the argument 'delete_corrupt_cds = TRU
E'.
All corrupted CDS were trimmed.
Building a new DB, current time: 01/24/2022 19:25:22
New DB name: /tmp/RtmpUFtcuE/_blast_db/blastdb_BFcdsorth.fa_protein.fasta
New DB title: blastdb_BFcdsorth.fa_protein.fasta
Sequence type: Protein
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 13486 sequences in 0.380335 seconds.
Running blastp: 2.9.0+ ...
There seem to be 6 coding sequences in your input dataset which cannot be properly divided in base triplets, because their sequence length cannot be divided by 3.
A fasta file storing all corrupted coding sequences for inspection was generated and stored at '/gpfs/data/ehuertas/mfariasv/aligned_newBFV2/dNdS/ZFcdsorth.fa_corrupted_cds
_seqs.fasta'.
You chose option 'delete_corrupt_cds = FALSE', thus corrupted coding sequences were retained for subsequent analyses.
The following modifications were made to the CDS sequences that were not divisible by 3:
- If the sequence had 1 residue nucleotide then the last nucleotide of the sequence was removed.
- If the sequence had 2 residue nucleotides then the last two nucleotides of the sequence were removed.
If after consulting the file 'ZFcdsorth.fa_corrupted_cds_seqs.fasta' you wish to remove all corrupted coding sequences please specify the argument 'delete_corrupt_cds = TRU
E'.
All corrupted CDS were trimmed.
Building a new DB, current time: 01/24/2022 20:21:27
New DB name: /tmp/RtmpUFtcuE/_blast_db/blastdb_ZFcdsorth.fa_protein.fasta
New DB title: blastdb_ZFcdsorth.fa_protein.fasta
Sequence type: Protein
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 13486 sequences in 0.404176 seconds.
There seem to be 6 coding sequences in your input dataset which cannot be properly divided in base triplets, because their sequence length cannot be divided by 3.
A fasta file storing all corrupted coding sequences for inspection was generated and stored at '/gpfs/data/ehuertas/mfariasv/aligned_newBFV2/dNdS/ZFcdsorth.fa_corrupted_cds
_seqs.fasta'.
You chose option 'delete_corrupt_cds = FALSE', thus corrupted coding sequences were retained for subsequent analyses.
The following modifications were made to the CDS sequences that were not divisible by 3:
- If the sequence had 1 residue nucleotide then the last nucleotide of the sequence was removed.
- If the sequence had 2 residue nucleotides then the last two nucleotides of the sequence were removed.
If after consulting the file 'ZFcdsorth.fa_corrupted_cds_seqs.fasta' you wish to remove all corrupted coding sequences please specify the argument 'delete_corrupt_cds = TRU
E'.
All corrupted CDS were trimmed.
Orthology Inference Completed.
Starting dN/dS Estimation ...
ERROR: number of input seqs differ (aa: 1; nuc: 2)!!
aa 'A1CF'
nuc 'A1CF A1CF'
*****************************************************************
Function: Parse fasta file with aligned pairwise sequences into AXT file
Reference: Zhang Z, Li J, Zhao XQ, Wang J, Wong GK, Yu J: KaKs Calculator: Calculating Ka and Ks through model selection and model averaging. Genomics Proteomics Bioinforma
tics 2006 , 4:259-263.
Web Link: Documentation, example and updates at <http://code.google.com/p/kaks-calculator>
*****************************************************************
I noticed that all the orthologous pairs for which the error DOES NOT have different names
[mfariasv@login005 dNdS]$ head BFcdsorthZF.dNdS
"","query_id","subject_id","dN","dS","dNdS","method","perc_identity","num_ident_matches","alig_length","mismatches","gap_openings","n_gaps","pos_match","ppos","q_start","q_end","q_len","qcov","qcovhsp","s_start","s_end","s_len","evalue","bit_score","score_raw"
"1","ABCF2","LOC110475106",0.000859565,0.0378507,0.0227094,"YN",99.801,501,502,1,0,0,501,99.8,52,553,553,100,91,123,624,624,0,1051,2719
Did I understand the issue correctly that you have the same header names in two different fasta files (representing two different species), but behind each header name lies a different coding sequence? Can we assume that headers with the same name in two different species are supposed to be orthologous genes?
If I understood correctly, then it seems to me that internally the wrong header name is selected when computing dNdS. Did you try renaming the headers to from >ABCG1 to e.g. >ABCG1_BF and >ABCG1_ZF? If yes, does the same issue remain?
Would it be possible to construct a small example run with only a few sequences so that I can reproduce this issue and
troubleshoot at each analysis step?
Hello, I'm running dNdS() on the cds of 2 species containing 13486 orthologous pairs, but only 1754 genes get the calculations done for. The rest runs into this error.
ERROR: number of input seqs differ (aa: 1; nuc: 2)!!
I'm running the program as follow:
The program runs:
I noticed that all the orthologous pairs for which the error DOES NOT have different names
While all genes for which the error happens and dNdS is not calculated have the same names in both species:
For example for ABCG1
But the sequences are indeed different:
The text was updated successfully, but these errors were encountered: