Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with incorrect virus classification using metabuli binning2report function and suggestion for ICTV taxonomy update #96

Open
Enkabloza opened this issue Oct 14, 2024 · 6 comments

Comments

@Enkabloza
Copy link

Hi Metabuli team,

First of all, I would like to thank you for such an excellent tool, I really enjoy using it. I’m currently using Metabuli to classify my viral metagenome sequences, and I have been using the viral database provided by Metabuli for my analyses. After running the classification, I obtain a report file that looks like this:

98.6127 39657793 39657793 no rank 0 unclassified
1.3873 557902 166 no rank 1 root
1.3787 554456 812 superkingdom 10239 Viruses
1.2988 522339 0 clade 2731341 Duplodnaviria
1.2988 522339 31 kingdom 2731360 Heunggongvirae
1.2938 520316 0 phylum 2731618 Uroviricota
1.2938 520316 14652 class 2731619 Caudoviricetes
0.5720 230022 229876 genus 2843396 Jouyvirus
0.0003 105 0 species 2844245 Jouyvirus ev017
0.0003 105 105 no rank 2847060 Escherichia phage ev017

After generating this report, I attempt to convert it into a Kraken-style format using the metabuli binning2report function. However, during this conversion, I encounter an issue where the output file no longer includes virus classifications but instead focuses solely on bacteria. The output looks like this:

30.75 1051 1051 no rank 0 unclassified
54.13 1850 485 no rank 1 root
39.88 1363 0 no rank 131567 cellular organisms
39.47 1349 230 superkingdom 2 Bacteria
17.50 598 0 phylum 1224 Pseudomonadota
7.72 264 0 class 28211 Alphaproteobacteria
5.47 187 0 order 356 Hyphomicrobiales
4.27 146 0 family 335928 Xanthobacteraceae
4.04 138 67 genus 6 Azorhizobium
2.08 71 71 species 7 Azorhizobium caulinodans

Why does this issue occur? My goal is to convert my report to Kraken format so that I can eventually transform the files into a BIOM (Biological Observation Matrix) format. This would allow me to combine all my reports into a single file, and then I could use the R phyloseq package to generate various statistics from my samples.

Additionally, I noticed there have been previous requests regarding updating the taxonomy to align with ICTV. I came across two resources that might be helpful:

This one explains how to construct NCBI-style taxdump files for the International Committee on Taxonomy of Viruses (ICTV):
https://github.com/shenwei356/ictv-taxdump

This other resource provides a tutorial on how to build a protein FASTA database for ICTV (though it is adapted for MMseqs2, it might help in building an ICTV viral database for Metabuli):
https://github.com/apcamargo/ictv-mmseqs2-protein-database/blob/master/README.md

Thank you for your time and attention. I hope you have a great start to the week!

Best regards,

@jaebeom-kim
Copy link
Member

Thank you so much for the resources of ICTV !! Sounds very useful.

binning2report was implemented for internal use and wasn't maintained well.
I didn't know that it's visible to users.
It doesn't convert metabuli's report to kraken's report, but it converts read-by-read classification to metabuli's report.
I wrote Kraken style report file because I thought metabuli's report is following kraken's style.

Let me check BIOM format and see if I can make a module for the conversion you want.
Thanks again!

@jaebeom-kim
Copy link
Member

jaebeom-kim commented Oct 15, 2024

Thank you so much for the tips !!
I could build a viral DB based on ICVT VMR39.2.
It's available here.
https://hulk.mmseqs.com/jaebeom/vmr39.2/
Could you try this and give feedback?
Thanks again :)

Please use this Metabuli version.
https://github.com/jaebeom-kim/Metabuli
The DB is not compatible to the latest release.

@Lelouchzhu
Copy link

Thank you so much for the tips !! I could build a viral DB based on ICVT VMR39.2. It's available here. https://hulk.mmseqs.com/jaebeom/vmr39.2/ Could you try this and give feedback? Thanks again :)

Please use this Metabuli version. https://github.com/jaebeom-kim/Metabuli The DB is not compatible to the latest release.

Thanks for the effort. Out of curiosity I just checked the latest ICTV taxonomy. They now put SARS-CoV-1 and SARS-CoV-2 all into this weird species name Betacoronavirus pandemicum !! Hope you guys dont follow ICTV taxonomy so soon as it is now so confusing...

@Enkabloza
Copy link
Author

Thank you very much, Jaebeom. I hope you're having an excellent start to the week, and I appreciate you taking the time to read my comments and share your database. I apologize for the delayed response. On another note, I have downloaded the ICTV VMR39.2 database that you shared.

Currently, I am using this version of Metabuli: Metabuli Version 1.0.8. I have a question: is the version I downloaded different from the one available in this directory? -> https://github.com/jaebeom-kim/Metabuli

Another question: Should I download the nodes.dmp and names.dmp files from here > https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/ or do I need to build my own taxonomy files using TaxonKit and create an ICTV taxdump file?

Thank you again for your help, and I wish you an excellent day!

Best regards

@jaebeom-kim
Copy link
Member

Hi! Your comment was very helpful!

Currently, I am using this version of Metabuli: Metabuli Version 1.0.8. I have a question: is the version I downloaded different from the one available in this directory? -> https://github.com/jaebeom-kim/Metabuli

Yes, please use https://github.com/jaebeom-kim/Metabuli when you try the ICTV database. Sorry for this inconvenience. Taxonkit uses the full range of 32 bit integer for taxonomy ID, but Metabuli used only 31 bits, so I made a quick fix in my fork.

Another question: Should I download the nodes.dmp and names.dmp files from here > https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/ or do I need to build my own taxonomy files using TaxonKit and create an ICTV taxdump file?

You don't need to download any dmp files to try the ICTV database.
But, I just shared dmp files in https://hulk.mmseqs.com/jaebeom/vmr39.2/ictv-taxdump/ just in case.

Thanks again!

@jaebeom-kim
Copy link
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants