Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect annotations with custom reference database #2083

Open
kodnerlab opened this issue Feb 17, 2025 · 1 comment
Open

Incorrect annotations with custom reference database #2083

kodnerlab opened this issue Feb 17, 2025 · 1 comment

Comments

@kodnerlab
Copy link

Hello! I have created a custom reference database for my ecosystem (18S V4) to assign taxonomic annotations using assignTaxonomy().

I just used the database on a small test dataset and the output looks good, but I noticed something unexpected. The most abundant ASV is assigned to the correct genus, but a different species that I would expect. I BLASTed the ASV sequence and get a 100% match to Chloromonas brevispina, which happens to be the first sequence in the custom database that is formatted like this (first two sequences below):

Eukaryota;Viridiplantae;Chlorophyta;core_chlorophytes;Chlorophyceae;CS_clade;Chlamydomonadales;Chlamydomonadaceae;Chloromonadinia;Chloromonas;Chloromonas brevispina
CCGTAGTAATTCTAGAGCTAATACGTGCGTAAATCCCGACTTCTGGAAGGGACGTATTTATTAGATAAAAGGCCAGCCGGGCTTGCCCGACTTGAGGTGAATCATGATAACTCCACGAATCGCACGGCCTTGTGCCGGCAATGTTTCATTCAAATTTCTGCCCTATCAACTTTCGATGGTAGGATAGAGGCCTACCATGGTGGTAACGGGTGACGGAGGATTAGGGTTCGATTCCGGAGAGGGAGCCTGAGAAATGGCTACCACATCCAAGGAAGGCAGCAGGCGCGCAAATTACCCAATCCCGACACGGGGAGGTAGTGACAATAAATAACAATACCGGGCATTTTATGTCTGGTAATTGGAATGAGCACAATCTAAATCCCTTAACGAGGATCCATTGGAGGGCAAGTCTGGTGCCAGCAGCCGCGGTAATTCCAGCTCCAATAGCGTATATTTAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGGATTTCGGGTGGGTTGCAATGGTCCGGTTCGCTGTGTACTGTTGCGGCCTTCCTTTCTGCCGGGGACGGGCTCTTGGGCTTCACTGTCTGGGATCCGGAGTCGGCGAGGTTACTTTGAGTAAATTAGAGTGTTCAAAGCAAGCTTACGCTCTGAATACATTAGCATGGAATAACACGATAGGACTCTGGCCTATCTTGTTGGTCTGTAGGACTGGAGTAATGATTAAGAGGGACAGTCGGGGGCATTCGTATTTCATTGTCAGAGGTGAAATTCTTGGATTTATGAAAGACGAACTTCTGCGAAAGCATTTGCCAAGGATGTTTTCATTAATCAAGAACGAAAGTTGGGGGCTCGAAGACGATTAGATACCGTCGTAGTCTCAACCATAAACGATGCCGACTAGGGATTGGCAGGTGTTCTTTTGATGACCCTGCCAGCACCTTATGAGAAATCAAAGTTTTTGGGTTCCGGGGGGAGTATGGTCGCAAGGCTGAAACTTAAAGGAATTGACGGAAGGGCACCACCAGGCGTGGAGCCTGCGGCTTAATTTGACTCAACACGGGAAAACTTACCAGGTCCAGACACGGGGAGGATTGACAGATTGAGAGCTCTTTCTTGATTCTGTGGGTGGTGGTGCATGGCCGTTCTTAGTTGGTGGGTTGCCTTGTCAGGTTGATTCCGGTAACGAACGAGACCTCAGCCTGCTAAATAGTCACATCTGCCTCTCGCAGCTGGCCGACTTCTTAGAGGGACTATTGTCGTTTAGGCAATGGAAGTATGAGGCAATAACAGGTCTGTGATGCCCTTAGATGTTCTGGGCCGCACGCGCGCTACACTGATGCGTTCAACGAGCCTATCCTTGGCCGAGAGGCCCGGGTAATCTTCGAAACCGCATCGTGATGGGGATAGATTATTGCAATTATTAGTCTTCAACGAGGAATGCCTAGTAAGCGCG
Eukaryota;Viridiplantae;Chlorophyta;core_chlorophytes;Chlorophyceae;CS_clade;Chlamydomonadales;Chlamydomonadaceae;Chloromonadinia;Chloromonas;Chloromonas_sp Hakkoda-1
AGTCATATGCTTGTCTCAAAGATTAAGCCATGCATGTCTAAGTATAAACTGCTTTATACTGTGAAACTGCGAATGGCTCATTAAATCAGTTATAGTTTATTTGATGGTACTTACTACTCGGATAACCGTAGTAATTCTAGAGCTAATACGTGCGTAAATCCCGACTTCTGGAAGGGACGTATTTATTAGATAAAAGGCCAGCCGGGCTTGCCCGACTTGAGGTGAATCATGATAACTCCACGAATCGCATGGCCTTGTGCCGGCGATGTTTCATTCAAATTTCTGCCCTATCAACTTTCGATGGTAGGATAGAGGCCTACCATGGTGGTAACGGGTGACGGAGGATTAGGGTTCGATTCCGGAGAGGGAGCCTGAGAAATGGCTACCACATCCAAGGAAGGCAGCAGGCGCGCAAATTACCCAATCCCGACACGGGGAGGTAGTGACAATAAATAACAATACCGGGCATTTTATGTCTGGTAATTGGAATGAGCACAATCTAAATCCCTTAACGAGGATCCATTGGAGGGCAAGTCTGGTGCCAGCAGCCGCGGTAATTCCAGCTCCAATAGCGTATATTTAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGGATTTCGGGTGGGTTGCAATGGTCCGGTTCGCTGTGTACTGTTGCGGCCTTCCTTTCTGCCGGGGACGGGCTCTTGGGCTTCACTGTCTGGGATCCGGAGTCGGCGAGGTTACTTTGAGTAAATTAGAGTGTTCAAAGCAAGCTTACGCTCTGAATACATTAGCATGGAATAACACGATAGGACTCTGGCCTATCTTGTTGGTCTGTAGGACTGGAGTAATGATTAAGAGGGACAGTCGGGGGCATTCGTATTTCATTGTCAGAGGTGAAATTCTTGGATTTATGAAAGACGAACTTCTGCGAAAGCATTTGCCAAGGATGTTTTCATTAATCAAGAACGAAAGTTGGGGGCTCGAAGACGATTAGATACCGTCGTAGTCTCAACCATAAACGATGCCGACTAGGGATTGGCAGGTGTTCTTTTGATGACCCTGCCAGCACCTTATGAGAAATCAAAGTTTTTGGGTTCCGGGGGGAGTATGGTCGCAAGGCTGAAACTTAAAGGAATTGACGGAAGGGCACCACCAGGCGTGGAGCCTGCGGCTTAATTTGACTCAACACGGGAAAACTTACCAGGTCCAGACACGGGGAGGATTGACAGATTGAGAGCTCTTTCTTGATTCTGTGGGTGGTGGTGCATGGCCGTTCTTAGTTGGTGGGTTGCCTTGTCAGGTTGATTCCGGTAACGAACGAGACCTCAGCCTGCTAAATAGTCACATCTGCCTCTCGCAGCTGGCCGACTTCTTAGAGGGACTATTGTCGTTTAGGCAATGGAAGTATGAGGCAATAACAGGTCTGTGATGCCCTTAGATGTTCTGGGCCGCACGCGCGCTACACTGATGCGTTCAACGAGCCTATCCTTGGCCGAGAGGCCCGGGTAATCTTCGAAACCGCATCGTGATGGGGATAGATTATTGCAATTATTAGTCTTCAACGAGGAATGCCTAGTAAGCGCGAGTCATCAGCTCGCGTTGATTACGTCCCTGCCCTTTGTACACACCGCCCGTCGCTCCTACCGATTGAGTGTGCTGGTGAAGTGTCGGGATTGGCTCTGTTTGATGGCAACATCGAGCAAGGCTAAGAACGTCATTAAACCCTCCCACTTAGAGGAAGGAGAAGTCGTAACAAGGTTTCC

However, the annotation of the top ASV is to Chloromonas hindakii, with a confidence of .63. It's BLAST hit is only 99.4% similar to the ASV.
I don't understand why a 100% match would be passed over during the assignTaxonomy() step. Any thoughts?
The Chloromonas hindakii is a longer reference sequence than Chloromonas brevispina by maybe 1000 bps. Do I have to trim my database to the region of my amplicon? There are only about 5000 sequences in my custom reference database. Thanks!

@benjjneb
Copy link
Owner

For short-read species assignment, using assignTaxonomy is not what we recommend. The RDP classifier algorithm that assignTaxonomy implements was never really intended to perfectly discriminate between reference sequences differering by just a single-nucleotide. There is an alternative function called assignSpecies which uses exact unambiguous matching to assign at the species level that presumably would work correctly here. Also, the default confidence threshold of 0.5 is tuned for genus-level assignment on 16S data -- for species-level with such closely related references a higher threshold may be more appropriate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants