Results comparison of Metaxa2 using blastn and megablast
Metaxa2 v2.1.1 is a bioiformatic tool designed to assign 16S rRNA sequences from a metagenomic dataset to an archaeal, bacterial, nuclear eukaryote, mitochondrial or chloroplast origin (1). Metaxa2, detect the 16S rRNA candidate sequences from the reads universe using hidden Markov models built from the SILVA database. Ribosomal sequences are compared against the Metaxa2 database by blast and the method takes into account the five best hits to assign the taxonomic identity per sequence. In the case of ambiguity (reliability score < 80), the algorithm align the five best hits using MAFFT and recalculates the reliability score for the next taxonomic level in the lineage until the score be > 80. The method use blastn by default, and have the option of running megablast. We tested the method with the V3V4 lib1 dataset available in the datasets_16SrRNA directory of this repo, using both options.
Metaxa2-mtx blastn | Metaxa2-mtx megablast | |||||||
Taxonomic level | Sens | Spec | ACC | MCC | Sens | Spec | ACC | MCC |
domain | 1.000000 | 0.999840 | 0.999992 | 0.999916 | 1.000000 | 0.999867 | 0.999994 | 0.999930 |
phylum | 1.000000 | 0.999973 | 0.999999 | 0.999986 | 1.000000 | 0.999973 | 0.999999 | 0.999986 |
class | 1.000000 | 0.904833 | 0.994992 | 0.948723 | 1.000000 | 0.904833 | 0.994992 | 0.948723 |
order | 0.969362 | 0.829477 | 0.961332 | 0.699140 | 0.969362 | 0.829477 | 0.961332 | 0.699140 |
family | 0.967010 | 0.382080 | 0.894109 | 0.433813 | 0.967010 | 0.382070 | 0.894108 | 0.433803 |
genus | 0.909425 | 0.602545 | 0.885172 | 0.409323 | 0.909425 | 0.602529 | 0.885171 | 0.409312 |
species | 0.199642 | 0.795149 | 0.235305 | -0.003090 | 0.199642 | 0.795127 | 0.235304 | -0.003103 |
subspecies | 0.111160 | 0.978576 | 0.153370 | 0.062514 | 0.111160 | 0.978576 | 0.153370 | 0.062514 |
We observed that for this datasets, performance statistical descriptors was almost identical indicating that blast parameters does not represent a substantial difference in the sensitivity and/or specificity of the Metaxa2 taxonomic assignments.
We runned both programs in the server of Biotechnology Institute (UNAM) splitting the dataset into chunks of 1,000 sequences. The jobs were performed over each chunk using 4 threads with 7G each. In the following table are the stats of average and standard-deviation of time and memory spent in each case.
Metaxa2 alignment algorithm | wall clock (s) | ru_utime | ru_stime | CPU | mem | io | maxvmem(G) |
Blastn (average) | 817.833 | 3169.473 | 16.812 | 3191.073 | 2477.961 | 0.096 | 2.456 |
Blastn (SD) | 35.304 | 116.549 | 0.692 | 116.465 | 90.759 | 0.009 | 0.239 |
Megablast (average) | 824.895 | 3166.722 | 16.654 | 3188.276 | 2475.763 | 0.097 | 2.475 |
Megablast (SD) | 33.459 | 114.924 | 1.017 | 115.157 | 89.731 | 0.009 | 0.221 |
Again, we got highly similar stats in performance for both BLAST-algorithms in terms of memory and time evaluating the method in minisets of 1,000 amplicon sequences. It means that at least for this kind of data, divide the main dataset is a good strategy to improve the use efficiency of computational resources. You will find the chunkMaker.pl script in the bin directory of this repo.
- Bengtsson-Palme, J. et al. METAXA2: improved identification and taxonomic classification of small and large subunit rRNA in metagenomic data. Mol. Ecol. Resour. 15, 1403–1414 (2015).