Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

misclassification of Melissococcus plutonius #2

Open
duancopeland opened this issue Sep 12, 2022 · 5 comments
Open

misclassification of Melissococcus plutonius #2

duancopeland opened this issue Sep 12, 2022 · 5 comments

Comments

@duancopeland
Copy link

Hello and thanks for this tool. Recently I applied the BEExact database to a dataset from an EFB outbreak. The number one OTU came up as:

Bacteria(100);Firmicutes(100);Bacilli(100);Lactobacillales(100);Lactobacillales_unclassified(100);Lactobacillales_unclassified(100);Lactobacillales_unclassified(100);

When I BLAST this sequence in the database it indeed comes up as Melissococcus plutonius

Not sure why this hit would come up under Bacilli, but I hope this helps to improve this awesome tool.

@bdaisley
Copy link
Owner

Hi @wapoga

Thanks for bringing this up. Melissococcus plutonius is a member of Bacilli at the 'class' rank, and also Lactobacillales at the 'order' rank, so this would appear correct for the higher rank levels of the original taxonomy string you got. But I agree, it seems quite odd that it came back as unclassified at the family level and lower ranks.

I did a quick validation check with the 'IDTAXA___BEEx-V3V4-TS.RData' classifier database (v2021.0.2) using a 16S rRNA gene sequence from a confirmed Melissococcus plutonius isolate (NCBI ref NR_074098.1 - https://www.ncbi.nlm.nih.gov/nuccore/NR_074098.1) and was able to get the correct species-level identification as follows:

Bacteria(100);Firmicutes(100);Bacilli(100);Lactobacillales(100);Enterococcaceae(98.90);Melissococcus(98.90);Melissococcus_plutonius(98.90)

If you don't mind, could you please provide the sequence you were trying to classify (copy and paste here or email is fine) as well as the the following details so that I may reproduce the result you got and address this issue more thoroughly:

  • Which BEExact database did you use? (e.g. V4, V3V4, V5V6, etc)
  • Which classifier software did you use and were any of the settings different than default? (e.g. IDTAXA, QIIME2, others)
  • Did you use clustering methods (97-99% OTUs) or denoising algorithms (100% ASVs/SVs) to determine your sequence reads?

@duancopeland
Copy link
Author

i'll get these to you in the next day

@duancopeland
Copy link
Author

Maybe there's an explanation for this in our protocol:
We used mothur for the whole analysis
using mothur I took the dadaBEEx-FL-TS.fa and trimmed it to v3v4 region, then aligned our sequences to this
Then I classified the seqs with the BEEx_FL_bxid_refs_sequences.fasta reference
In mothur I used the make.shared command to get unique seqs (ASVs)

Here is the representative sequence for the Melissococcus that is coming up as:
Bacteria(100);Firmicutes(100);Bacilli(100);Lactobacillales(100);Lactobacillales_unclassified(100);Lactobacillales_unclassified(100);Lactobacillales_unclassified(100);

AGGGAATCTTCGGCAATGGACGAAAGTCTGACCGAGCAACGCCGCGTGAGTGAAGAAGGTTTTCGGATCGTAAAACTCTGTTGTTAGAGAAGAATAGGGGAAAGAGTAACTGTTTTCCTCGTGACGGTATCTAACCAGAAAGCCACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTTTTTAAGTCTGATGTGAAAGCCCCCGGCTCAACCGGGGAGGGTCATTGGAAACTGGAAGACTTGAGTGCAGAAGAGGAGAGTGGAATTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAGGAACACCAGTGGCGAAGGCGGCTCTCTGGTCTGTAACTGACGCTGAGGCTCGAAAGCGTGGGGAGCAAACA

@bdaisley
Copy link
Owner

Okay, so I believe I figured out what was going wrong here. Replicating the steps you indicated, I ran the align.seqs command in mothur and found that the misclassified sequence aligned with 100% identity to BX004500;Bacteria;Firmicutes;Bacilli;Lactobacillales;Enterococcaceae;Melissococcus;Melissococcus_plutonius.

QueryName QueryLength TemplateName TemplateLength SearchMethod SearchScore AlignmentMethod QueryStart QueryEnd TemplateStart TemplateEnd PairwiseAlignmentLength GapsInQuery GapsInTemplate LongestInsert SimBtwnQuery&Template
misclassified_seq 426 BX004500 253 kmer 58.23 needleman 176 426 1 251 251 0 0 0 100.00

However, when I ran the classify.seqs command in mothur it was misclassified the same as you mentioned:

>misclassified_seq
AGGGAATCTTCGGCAATGGACGAAAGTCTGACCGAGCAACGCCGCGTGAGTGAAGAAGGTTTTCGGATCGTAAAACTCTGTTGTTAGAGAAGAATAGGGGAAAGAGTAACTGTTTTCCTCGTGACGGTATCTAACCAGAAAGCCACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTTTTTAAGTCTGATGTGAAAGCCCCCGGCTCAACCGGGGAGGGTCATTGGAAACTGGAAGACTTGAGTGCAGAAGAGGAGAGTGGAATTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAGGAACACCAGTGGCGAAGGCGGCTCTCTGGTCTGTAACTGACGCTGAGGCTCGAAAGCGTGGGGAGCAAACA

mothur > classify.seqs(fasta=misclassified_seq.fasta, reference=BEEx-FL-bxid-refs_sequences.fa, taxonomy=BEEx-FL-bxid-refs_taxonomy.txt)

Result:
misclassified_seq=Bacteria(100);Firmicutes(100);Bacilli(100);Lactobacillales(100);Lactobacillales_unclassified(100);Lactobacillales_unclassified(100);Lactobacillales_unclassified(100); 

To see if this could be a formating issue with the RDP classifier implemented in mothur, I tested the misclassified_seq using another RDP classifier algorithm implemented in the dada2 R package. The sequence was classified correctly:

library(dada2)
library(data.table)

query <- readDNAStringSet("misclassified_seq.fasta", format="fasta")
database <- "dada2___BEEx-FL-TS.fa.gz"

dada2::assignTaxonomy(query, database, multithread=FALSE, outputBootstraps=TRUE)
Result:
misclassified_seq=Bacteria(100);Firmicutes(100);Bacilli(100);Lactobacillales(100);Enterococcaceae(81);Melissococcus(72);Melissococcus_plutonius(72);

Based on this, I constructed a single seed subset reference databse from the full-length BEExact database for use with mothur. The respective files are available here:
BEEx-FL-seeds_sequences.fasta
BEEx-FL-seeds_taxonomy.txt

To format the taxa headers and make a mothur-specific classification databse, I ran the following:

awk 'FNR==NR{
  a[">"$1]=$2;next
}
$1 in a{
  sub(/>/, $1 "\t" "Root;" a[$1])
}1' BEEx-FL-seeds_taxonomy.txt BEEx-FL-seeds_sequences.fasta > mothur___BEEx-FL-seeds_sequences.fasta

I've uploaded this formatted database to the classifier databases directory, available here:
mothur___BEEx-FL-seeds_sequences.fasta

To confirm these changes solved the issue, I ran the following in mothur:


mothur > classify.seqs(fasta=misclassified_seq.fasta, reference=mothur___BEEx-FL-seeds_sequences.fasta, taxonomy=BEEx-FL-seeds_taxonomy.txt)

Result:
misclassified_seq=Bacteria(100);Firmicutes(100);Bacilli(100);Lactobacillales(100);Enterococcaceae(100);Melissococcus(98);Melissococcus_plutonius(98);

Everything looks to be in order now. I would suggest re-classifying the rest of your sequences using the mothur-formatted database, as it is likely other taxa were also incorrectly classified.

Let me know if this solves the problem on your end.

***Additional info: For clarity to others who may read this, before running align.seqs, I first aligned the full-length BEEx-FL-TS.fa database to the 50,000-character global SILVA alignment (via SINA v1.2.11), and then trimmed the alignment file to the the V3-V4 region (i.e. positions 13862 to 23444) using the pcr.seqs command in mothur. I've uploaded the 50K alginment file to the BEExact seeds directory available here in case it may be helpful to others.

@duancopeland
Copy link
Author

Thanks Brendan, you are awesome!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants