-
Notifications
You must be signed in to change notification settings - Fork 10
/
Copy pathalbayrak_10_clustering_800655.pdf.txt
1327 lines (1019 loc) · 45 KB
/
albayrak_10_clustering_800655.pdf.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Clustering of protein families into functional subtypes using Relative Complexity Measure with reduced amino acid alphabets</title>
<meta name="Subject" content="BMC Bioinformatics 2010 11:428. doi:10.1186/1471-2105-11-428"/>
<meta name="Author" content="Aydin Albayrak"/>
<meta name="Creator" content="Arbortext Advanced Print Publisher 10.0.1082/W Unicode"/>
<meta name="Producer" content="Acrobat Distiller 9.0.0 (Windows)"/>
<meta name="CreationDate" content=""/>
</head>
<body>
<pre>
Albayrak et al. BMC Bioinformatics 2010, 11:428
http://www.biomedcentral.com/1471-2105/11/428
METHODOLOGY ARTICLE
Open Access
Clustering of protein families into functional
subtypes using Relative Complexity Measure with
reduced amino acid alphabets
Aydin Albayrak1, Hasan H Otu2,3, Ugur O Sezerman1*
Abstract
Background: Phylogenetic analysis can be used to divide a protein family into subfamilies in the absence of
experimental information. Most phylogenetic analysis methods utilize multiple alignment of sequences and are
based on an evolutionary model. However, multiple alignment is not an automated procedure and requires human
intervention to maintain alignment integrity and to produce phylogenies consistent with the functional splits in
underlying sequences. To address this problem, we propose to use the alignment-free Relative Complexity Measure
(RCM) combined with reduced amino acid alphabets to cluster protein families into functional subtypes purely on
sequence criteria. Comparison with an alignment-based approach was also carried out to test the quality of the
clustering.
Results: We demonstrate the robustness of RCM with reduced alphabets in clustering of protein sequences into
families in a simulated dataset and seven well-characterized protein datasets. On protein datasets, crotonases,
mandelate racemases, nucleotidyl cyclases and glycoside hydrolase family 2 were clustered into subfamilies with
100% accuracy whereas acyl transferase domains, haloacid dehalogenases, and vicinal oxygen chelates could be
assigned to subfamilies with 97.2%, 96.9% and 92.2% accuracies, respectively.
Conclusions: The overall combination of methods in this paper is useful for clustering protein families into
subtypes based on solely protein sequence information. The method is also flexible and computationally fast
because it does not require multiple alignment of sequences.
Background
Proteins that evolve from a common ancestor can
change functionality over time [1] and produce highly
divergent protein families that can be divided into subfamilies with similar but distinct functions (i.e., functional
subfamilies or subtypes) [2]. Identification of subfamilies
using protein sequence information can be carried out
using phylogenetic methods that can reveal the evolutionary relationship between proteins by clustering similar proteins together in a phylogenetic tree [3-5]. The
most common method for identifying similarities
in sequences through phylogenetic analysis starts with
the construction of a multiple alignment of homologous
sequences using a substitution matrix. Multiple
* Correspondence: [email protected]
1
Biological Sciences and Bioengineering, Sabanci University, Orhanli, Tuzla,
Istanbul, Turkey
Full list of author information is available at the end of the article
alignment scores are then transformed into a distance
matrix to construct a phylogenetic tree. Often the
branching order of a phylogenetic tree exactly matches
the known functional split between proteins [1] and
branch lengths are proportional to the extent of evolutionary changes since the last common ancestor.
Multiple sequence alignment (MSA) is constructed
using a scoring scheme which reward or penalize each
substitution, insertion and deletion to get an optimum
alignment of the given sequences. The quality of an
MSA is connected to the chosen parameters that are
entered manually and an expert handling is almost
always required to maintain alignment integrity by
observing general trends in each protein family. As such
different alignment parameters may yield different
phylogenetic trees that are only as good as the MSA
that the trees are derived from [6,7].
© 2010 Albayrak et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative
Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and
reproduction in any medium, provided the original work is properly cited.
Albayrak et al. BMC Bioinformatics 2010, 11:428
http://www.biomedcentral.com/1471-2105/11/428
Phylogenetic analysis is broadly divided into two
groups of methods. Algorithms in the first group calculate a matrix representing the distance between each
pair of sequences and then transform this matrix into a
tree using a tree-clustering algorithm. Algorithms in the
first category utilize various distance measures with different models to account for nucleotide or amino acid
substitutions. In the second group, the tree that can
best explain the observed sequences under the chosen
evolutionary model is found by evaluating the fitness of
different tree topologies [6,8]. The second category can
further be divided into two groups based on the optimality criterion used in tree evaluation: maximum parsimony and maximum likelihood. Under maximum
parsimony [9], the preferred phylogenetic tree is the tree
that requires the least evolutionary change to explain
the observed data whereas under maximum likelihood
[9,10], it is the most probable tree under the chosen
evolutionary assumption.
The prediction of subfamilies from protein MSAs have
been carried out previously by comparing subfamily hidden Markov models, subfamily specific sequence profiles, analyzing positional entropies in an alignment, and
ascending hierarchical method [4,5,11,12]. All of these
methods require an alignment of biological sequences
that assume some sort of an evolutionary model. Computational complexity and the inherent ambiguity of the
alignment cost criteria are two major problems in MSA
along with controversial evolutionary models that are
used to explain them.
A novel approach for phylogenetic analysis based on
Relative Complexity Measure (RCM) of whole genomic
sequences have been previously proposed by Otu et al,
that eliminates the need for MSA and produces successful phylogenies on real and simulated datasets [8]. The
algorithm employs Lempel-Ziv (LZ) complexity [13] and
produces a score for each sequence pair that can be
interpreted as the “closeness” of the sequence pairs.
Unequal sequence length or different positioning of
similar regions along sequences (such as different gene
order in genomes) is not an issue as the method has
been shown to handle both cases naturally. Moreover,
RCM does not use any approximations and assumptions
in calculating the distance between sequences. Therefore, RCM utilizes the information contained in
sequences and requires no human intervention.
Application of RCM to genomic sequences for phylogenetic analysis was successfully carried out on various
datasets containing genomic sequences [8,14]. Moreover,
Liu et al [15] extended this method further to integrate
the hydropathy profile and a different LZ-based distance
measure for phylogenetic analysis of protein sequences
while Russell et al integrated a merged amino acid
alphabet containing 11 characters to represent all amino
Page 2 of 10
acids to reduce complexity prior to calculating a pairwise distance measure to be used as a pairwise scoring
function in determining the order with which sequences
should be joined in a multiple sequence alignment
problem [16].
Application of RCM to evaluate genomic sequences is
relatively straight forward since RCM based on LempelZiv complexity scores can capture each mutation in
DNA sequences and register it as an increase in the
complexity scores of compared sequences. However,
substitution of one residue into another in proteins is
tolerable as long as the substituted residue is not highly
conserved and physicochemical and structural properties
of the substituted and the native residues are not fundamentally different [17-19]. Employment of hydropathyindex-based grouping of residues is one way of a preprocessing requirement to capture only the mutations
that would not be tolerated in a protein sequence since
LZ algorithm is not capable of accounting for amino
acid substitution frequencies and similarity scores.
Hence, any application that uses RCM to generate a distance matrix of protein sequences should be linked to
treating the sequence with a reduced amino acid alphabet (RAAA) prior to calculating their RCMs.
In this paper, we utilize RCM with different reduced
amino acid alphabets and assess RCM’s potential in
clustering protein families into functional subtypes
based solely on sequence data. This method clustered
seven well-characterized protein families into their functional subtypes with 92% - 100% accuracy.
Methods
Datasets
Simulated Dataset
Performance of RCM was tested on a simulated dataset
that contains 10 randomly evolved protein sequences
from a root sequence of length 500 by using INDELible
V1.02 [20]. Simulated protein sequences were generated
according to the following parameters:
1. JTT-dcmut [21] was chosen as the amino acid
substitution model.
2. Power law insertion/deletion length distribution
model with a = 1.7 and maximum allowed insertion/
deletion length of 500 were used.
3. Both insertion and deletion rates were set to the
default parameter of 0.1 relative to average substitution
rate of 1%.
4. Length of the root protein sequence was set to 500.
5. The rooted tree with 10 taxa that reflects the true
phylogenetic evolution of the sequences was generated
along with the true MSA from which the true tree was
inferred.
6. The true MSA was then inputted into ClustalW2
[22] and the bootstrap tree was generated (1000
Albayrak et al. BMC Bioinformatics 2010, 11:428
http://www.biomedcentral.com/1471-2105/11/428
Page 3 of 10
in procedures employed in directed evolution experiments [26,31]. One of the finest examples is the reduction of amino acid alphabet into a binary code that is
composed of characters representing polar and nonpolar amino acid residues [27]. Grouping of amino acid
residues has also been used extensively in HydrophobicPolar (HP) lattice model to explain the hydrophobic
collapse theory of protein folding [32].
A recent study was carried out by Peterson et al to
test the performance of over 150 RAAAs on the
sequence library from DALIpdb90 database and showed
that RAAAs improves sensitivity and specificity in fold
prediction between protein sequence pairs with high
structural similarity and low sequence identity [33].
We tested performances of six amino acid reduction
schemes with 15 different level of groupings to separate
proteins into functional subfamilies (Table 2). These
included three top performing RAAA (HSDM17,
SDM12, GBMR4) from reference [33] and three random
RAAA of size 4.
bootstrap trials, including positions with gaps, and
correcting for multiple substitutions)
Protein Datasets
RCM was tested on seven protein datasets. Number of
sequences, number of subfamilies, average length, standard deviation of sequence lengths and mean percent
identities (PID) [23] of sequences for each family are
summarized in Table 1. Protein sequences for mandelate
racemases, crotonases, haloacid dehalogenases and vicinal oxygen chelates (VOC) were extracted from extensively curated Structure-Function Linkage Database
which contains sets of subfamily grouping for a large set
of protein families. SFLD contains protein families with
a hierarchical classification scheme based on sequence,
structure and conserved chemical reactions at the superfamily, subgroup, and family levels [24]. Crotonases and
haloacid dehalogenases were filtered such that subfamilies that contain less than 3 sequences or more than 200
sequences were removed to prevent sequence number
bias and to reduce computational complexity. Unknown
or unspecified amino acids were discarded (21, 22 and
10 occurrences in mandelate racemase, crotonase and
VOC family, respectively). The protein sequences for
acyl transferase (AT) domains and nucleotidyl cyclases
were obtained from reference [25]. The protein
sequences in the hard-to-align dataset that contains glycoside hydrolase family 2 (GH2) members were adapted
from reference [3]. Expert curated annotations of protein sequences and abbreviations used for sequences in
this study are provided in Additional File 1.
Substitution Matrices
Amino acids that are within the same group in a RAAA
are considered identical [33]. Substitution matrices that
assign the same similarity score to each amino acid
within the same group were obtained from reference
[33]. For those RAAAs in the EB scheme and the three
random RAAAs, new substitution matrices were created
from BLOSUM62 frequency counts using the same
procedure outlined in reference [33].
Reduced Amino Acid Alphabets
Lempel-Ziv Complexity
Sequence space of proteins is redundant and generates
only a limited number of folds, domains, and structures
[26]. Various strategies have been devised that take a
coarse-grained approach to account for the degeneracy
of sequences by grouping similar amino acids together
[17-19,27-30]. Grouping is usually carried out based on
structural and physiochemical similarities of amino acids
[28]. Grouping of amino acids in sequence space can
help develop prediction methods for various sequence
determinants and decrease the amount of search space
In this paper, a normalized distance measure that was
previously used for phylogenetic tree construction of
whole genome sequences was employed. The distance
measure was based on Lempel-Ziv [34] complexity and
was known to accurately cluster all related genomic
sequences under one branch of the tree [8].
Lempel-Ziv (LZ) complexity score of a sequence is
obtained by counting the number of steps required to
generate a copy of the primary sequence starting from a
null state. At each step, an amino acid or a series of
Table 1 General Properties of the Datasets
Family
μ Length
s Length
μ PID*
# of sequences
# of subfamilies
Crotonases
467
13
332
87
21
Mandelate racemases
184
8
416
74
27
Vicinal oxygen chelates
309
18
294
108
14
Haloacid dehalogenases
195
14
303
137
12
Nucleotidyl cyclases
Acyl transferases
75
177
2
2
1059
290
200
12
21
41
GH2 hydrolases
33
4
872
160
15
* Mean Percent Identity (μ PID) is the average of all pairwise sequence identities in a given family.
Albayrak et al. BMC Bioinformatics 2010, 11:428
http://www.biomedcentral.com/1471-2105/11/428
Page 4 of 10
Table 2 Reduced Amino Acid Alphabets
Size
Matrix
Gaps#
Reference
ML*
4,8,10,15
BL50
12/2
[28]
EB§
13,11,9,8,5
BL62
11/1
[18]
HSDM*
17
HSDM
19/1
[29]
SDM*
12
SDM
7/1
4
BL62
11/1
[30]
4,4,4
BL62
11/1
This study
LZ-based distance measures defined in Out et al performed slightly worse than the above distance (data not
shown). Although in performance between five measures
were not significant, we adopted the aforementioned
distance for its ability to account for length variance.
[29]
Scheme
GBMR*
§
RANDOM
Reduced amino acid schemes used in this study.* Substitution matrices for
these reduced alphabets were obtained from reference [33]. § BL62 frequency
counts were used to derive these substitution matrices using the formula
outlined in reference [33]. #Gap opening/gap extension penalties used for
MSAs in ClustalW2.
amino acids are copied from the subsequence that has
been constructed thus far allowing for a single letter innovation. The number of steps needed to obtain the whole
sequence is identified as the LZ-complexity score of the
given sequence. The exhaustive library of a sequence is
defined as the smallest number of distinct amino acid or
amino acid combinations required to construct the
sequence using a copying process described by Lempel
and Ziv [34]. For example, the LZ-complexity of the simple sequence ‘AAILNAIIANNL’ would be obtained as
shown in Table 3. Since seven steps are needed to generate the whole sequence, the LZ-complexity score for this
sequence is 7. The LZ-complexity of a sequence ‘X’ compared to a sequence ‘Y’ is known as the RCM of ‘X’ with
respect to ‘Y’. This is the number of steps required to construct sequence ‘X’ beginning with ‘Y’ instead of a null
sequence. Five different distance metrics have been suggested by Otu et al [8], however, this work used only the
following normalized distance metric that accounts for the
differences in sequence lengths:
D XY =
c( XY )+ c(YX )− c( X )− c(Y )
c( XY )+ c(YX )
2
The relative complexity measure (RCM) for creation of
the distance matrix was utilized as previously described
[8]. Phylogenetic trees were generated from distance
matrices using neighbor-joining [35] program of the
phylogeny inference package, PHYLIP 3.68 [36]. Unrooted trees were rooted with midpoint rooting by placing the root halfway between the two most distinct
taxa. Midpoint-rooted trees were converted to cladograms (i.e., branch lengths are discarded) using the
Retree program of PHYLIP package [36]. Phylogenetic
trees for all protein families and RAAAs are shown in
supplementary materials (Additional File 2) in Newick
format and can be visualized with a tree-drawing
program.
ClustalW2
Protein sequences in each family were aligned using
ClustalW2 [22] for comparison with RCM. MSAs were
performed using updated substitution matrices with gap
extension and gap opening penalties provided in Table
2. Bootstrap analyses were carried out 100 times and
trees containing bootstrap values were created using
ClustalW2 with the neighbor-joining clustering algorithm. For convenience, MSAs that were carried out
using ClustalW2 will be referred as the MSA or the
MSA method for the rest of the article.
Tree Based Classification (TBC)
where c(XY) and c(YX) are RCM of X appended to
Y and Y appended to X, respectively. Remaining four
Table 3 Lempel-Ziv Complexity
Sequence X = AAILNAIIANNL
Exhaustive History
Distance Matrix & Phylogenetic Tree
Complexity
A
1
AI
2
L
3
N
AII
4
5
AN
6
NL
7
HE(X)
C(X) = 7
The exhaustive library construction and Lempel-Ziv complexity score
calculation of sequence X.
TBC algorithm [4] was used to check the accuracy of
each tree in separating protein families into subfamilies.
TBC divides a tree into disjoint subtrees and assigns a
protein subfamily to a subtree that maximizes the number of true positives when the proportions of fp/(tp+fp)
and fn/(tp+fn) are both equal to 0.5 for a given subtree,
where fp is the number of false positives, fn is the number of false negatives and tp is the number of true positives. Above proportions correspond to the “maximal
allowed contamination” level that minimizes the TBC
error over the whole tree.
TBC requires a bifurcating tree of sequences in a protein family and an attribute file that contains expert
curated assignment of each sequence to a particular
subfamily. TBC accuracy (i.e., the percentage of correctly classified sequences) is the primary performance
measure to evaluate the division of protein families into
subtypes using the TBC algorithm. TBC accuracy is
equal to 1- %TBC error where %TBC error is the total
Albayrak et al. BMC Bioinformatics 2010, 11:428
http://www.biomedcentral.com/1471-2105/11/428
Page 5 of 10
number of fp, fn, and unclassified sequences divided by
the total number of sequences. For a detailed analysis of
the TBC algorithm, refer to reference [4].
Protocol
The proposed algorithm operates on a set of
sequences in FASTA format. After one of the alphabets given in Table 1 is applied to all the sequences in
the dataset, RCMs are calculated and used to obtain
the distance between each pair for the neighbor-joining clustering to create a phylogenetic tree. For each
RAAA, a single tree based on RCM is generated and
analyzed using TBC algorithm to determine how well
it clusters different subfamilies under different
branches of the tree.
For simulated dataset, three phylogenetic trees were
compared: The true tree generated by INDELible, the
bootstrap tree and the RCM tree. INDELible creates a
true MSA of the simulated protein sequences. This
alignment was used in ClustalW2 and bootstrapped
1000 times and the resulting tree was called the bootstrap tree. The third tree is the RCM tree that was
generated by the proposed approach.
For seven protein datasets, first, the original fasta
sequences were used to calculate RCMs and their associated RCM trees. Second, the original fasta sequences
were re-coded using different RAAAs (Table 2) and the
reduced sequences were used to calculate their RCMs
and the associated RCM trees.
A similar procedure was applied to the phylogenetic
trees using the MSA method. For each protein family,
MSA was carried out using the corresponding substitution matrices and gap penalties provided in Table 2.
MSA-based trees were created following bootstrap analysis (100 replicates) with ClustalW2.
Finally, for each family, a total of 16 phylogenetic trees
(1 for 20-letter alphabet, 12 for RAAAs, and 3 for random RAAAs) for each method are generated and
checked how well they separated families into subfamilies. A summary of the overall workflow is depicted in
Figure 1.
Sequences in
Protein Datasets
ClustalW2
Original & Reduced Amino
Acid Alphabets
MSA
(Different Substitution Matrices)
LZ Algorithm
ClustalW2
Results and Discussion
Simulated Dataset
Phylogenetic analysis of protein sequences has been intimately connected with MSA. A phylogenetic tree is generated from an evolutionary distance matrix using MSA
of sequences. However, for real biological datasets, the
true tree is rarely known. Therefore, protein sequence
evolution was simulated to study the reliability of the
RCM method. A simulated protein dataset containing
10 protein sequences was generated to show that RCM
coupled with a RAAA can produce a phylogenetic tree
(RCM tree) that is consistent with the true tree and the
bootstrap tree. The true tree is produced by INDELible
and is the original tree that reflects the evolution of 10
simulated sequences. On the other hand, the bootstrap
tree is the tree that was produced by ClustalW2 using
the true MSA implied by INDELible. The bootstrap tree
is identical to the true tree and the bootstrap supports
for all branches are high reflecting the consistency [37]
in the branching. The RCM tree was produced by the
alignment-free RCM approach. The RCM tree is identical to both the true tree and the bootstrap tree reflecting its potential for use in phylogenetic analysis of
protein sequences. The tree topology of only one of the
trees is shown in Figure 2 since they are all identical.
Performance of the RCM approach
We applied the RCM approach to seven protein datasets. RCM method showed an efficient division of protein families into subfamilies using RAAAs. Phylogenetic
trees of the seven protein families using RCM approach
are shown in Figure 3 for ML15 alphabet. Detailed comparison of RCM with MSA in terms of TBC accuracy,
the number and percentage of TBC error for each
RAAA and each dataset is provided in Additional File 3.
Crotonases
Members of crotonase family contain 467 protein
sequences from 13 different subfamilies and catalyze
diverse metabolic reactions with certain family members
displaying dehalogenase, hydratase, and isomerase activities. TBC accuracy varied between 96.4% and 100% for
RCM
Calculation
Bootstrap
Phylip
Neighbor & Retree
Phylip
Retree
RCM
Tree
TBC
Misclassified
Sequences
MSA
Tree
TBC
Misclassified
Sequences
Figure 1 Protocol Overview. For RCM, the original sequences and sequences recoded with reduced alphabets are used to calculate RCM-based
distances which are then inputted sequentially to the Neighbor-Joining and Retree programs of the PHYLIP v3.68 package. For MSA, first,
alignments are carried out using ClustalW2 with substitution matrices corresponding to each amino acid alphabet. Following bootstrap analysis
with ClustalW2, Retree program is used to root the trees with midpoint rooting and to discard branch lengths. Each phylogenetic tree is then
inputted to the TBC algorithm along with its attribute file that shows the expert assignment of each sequence to each family to calculate the
TBC error.
Albayrak et al. BMC Bioinformatics 2010, 11:428
http://www.biomedcentral.com/1471-2105/11/428
Page 6 of 10
mandelate racemases contain a conserved histidine, presumably acting as an active site base [38]. When the
RCM approach was tested on mandelate racemases, all
resulting trees showed correct assignment of functional
subfamilies into 8 different clusters with 100% accuracy
using all alphabets except GBMR4 that resulted in
96.7% TBC accuracy.
3
6
8
9
4
Vicinal oxygen chelates (VOC)
5
VOC family contains 309 sequences from 18 different
subfamilies. The number of TBC accuracy varied
between 77.7% and 92% for RCM and 81.9% to 91.3%
for MSA. Members of VOC have an average sequence
length of 294 amino acids and a mean PID of 14%
(Table 1). The low PID and the highly divergent nature
of this family make its subfamilies susceptible to misclassification more than other families based on
sequence information alone. In this dataset, EB8 performed better than 20-letter alphabet (92.2% vs. 91.3%)
with RCM while GBMR4, ML4, EB8, EB, EB13 and 20letter alphabets resulted in 91.3% TBC errors with MSA.
2
7
10
1
Figure 2 Tree topology of the simulated dataset. The identical
topology of the three phylogenetic trees (i.e., RCM tree, bootstrap
tree and true tree) for the simulated dataset is shown.
RCM. The top performing RAAA with the smallest size
was GBMR4 that resulted in 100% TBC accuracy. TBC
accuracy was 100% for all RAAAs tested with MSA.
Mandelate Racemases
The mandelate racemase dataset contains 184 sequences
that are assigned to 8 expert curated subfamilies. All
A
D
Haloacid dehalogenases
Haloacid dehalogenases contains 195 sequences that
belong to 14 different subfamilies. Haloacid dehalogenase family is similar to VOCs in its highly divergent nature based on the low mean PID (12%) that places the
sequences in this family in the “twilight zone” to infer
C
B
E
F
G
Figure 3 Phylogenetic trees of protein families. RCM trees were drawn using ML15 alphabet. For each family, the taxa corresponding to
different subfamilies are colored differently. (A) Crotonases (B) Mandelate racemases (C) Vicinal oxygen chelates (D) Haloacid dehalogenase (E)
Nucleotidyl cyclases (F) Acyl transferases (G) GH2 hydrolases
Albayrak et al. BMC Bioinformatics 2010, 11:428
http://www.biomedcentral.com/1471-2105/11/428
Page 7 of 10
Table 4 TBC errors for top performing RAAA
Crotonases
Mandelate
racemases
Vicinal
oxygen
chelates
Haloacid
dehalogenases
Nucleotidyl
cyclases
Acyl
transferases
GH2
hydrolases
RCM
MSA
RCM
MSA
RCM
MSA
RCM
MSA
RCM
MSA
RCM
MSA
RCM
MSA
Accuracy
100
100
100
100
91.6
91.3
93.3
99.5
100
100
91.5
97.2
87.9
100
Error
0
0
0
0
8.4
8.7
6.7
0.5
0
0
8.5
2.8
12.1
0
Statistics for top
performing
RAAA
Accuracy
100
100
100
100
92.2
91.3
96.9
99.5
100
100
97.2
97.2
100
100
Error
0
0
0
0
0
0
2.8
2.8
0
0
Top performing
RAAAs
RAAA
GBMR4
20 letter
ML4 ML4 GBMR4
GBMR4
ML4
7.8
8.7
3.1
0.5
EB8
GBMR4
ML4
ML15
ML8
ML4 GBMR4 ML4 ML4
ML4
ML4
GBMR4
GBMR4 GBMR4 GBMR4
TBC accuracy and percentage of TBC error are reported for the 20-letter alphabet and the top performing RAAA. If two RAAAs with the same size have identical
TBC accuracies, both RAAAs are reported at the final row in the table. Bold entries correspond to top performers using RCM and MSA for the specified datasets
any relation between sequences based on sequence
information alone. ML15 was the best performing
RAAA for RCM with 96.9% accuracy (Table 4). The size
of the best performing RAAA for this family is larger
compared to other families hinting that highly divergent
sequences may require larger alphabets with lower level
of grouping.
Nucleotidyl cyclases
Nucleotidyl cyclase family has two functional subfamilies, adenylate and guanylate cyclases that correspond to
use of the substrates ATP and GTP respectively. The
nucleotidyl cyclase family with 33 adenylate cyclases and
42 guanylate cyclases was clustered into two distinct
subfamilies with 100% accuracy using both methods and
all RAAAs except EB5 and EB8 for RCM and ML4 and
EB5 for MSA, all of which resulted in 98.7% accuracy
(Table 4). Moreover, the clustering result for the nucleotidyl cyclases are in agreement with the result obtained
previously by the MSA-dependent clustering algorithm
that uses the residues with the highest evolutionary split
statistic to split protein families into functional subfamilies [25].
Acyl transferases (AT)
The AT domains of Type I modular polyketide
synthases are responsible for the substrate selection.
Most incorporate either a C2 unit (malonyl-CoA substrate) or a C3 unit (methylmalonyl-CoA substrate). The
choice of substrate can be deduced from the chemical
structure of the polyketide product [25]. In the acyl
transferase dataset, 99 of the 177 sequences use C2
units whereas 78 use C3 units as substrate.
Previously, Goldstein et al [25] used evolutionary split
statistic and clustered the AT domains into 2 subfamilies with 2 false assignments for the 5 residue-long
motif. The number of false assignments increased to 5
with increasing motif length (up to 30-residue long)
suggesting that the utilization of a larger motif increases
the noise and error rate. As such, inclusion of only
5 residues (less noise) with high split statistics increases
the assignment accuracy (5 vs. 2 false assignments).
A similar trend is observed in the case of RCM. While
the TBC accuracy for AT domains was only 91% (15
false assignments) with the 20-letter alphabet (Table 4),
the accuracy increased to 97% (5 false assignments) with
the utilization of the ML4, ML8, EB9, ML10, EB11,
SDM12, EB13, and HSDM17 alphabets. Furthermore,
4 of the 5 misclassified sequences using the above
reduced alphabets are contained in the 2, 3 and 4 false
assignments produced by the Goldstein et al ’s approach
using the 5,10 and 15 residue-long motifs, respectively.
Although the accuracy was higher previously, it should
be noted that the RCM approach did neither require an