Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to handle species values ending in 'bacterium', sometimes with bin information #134

Open
Dries-B opened this issue Oct 29, 2024 · 3 comments

Comments

@Dries-B
Copy link

Dries-B commented Oct 29, 2024

Hi!

When using CAT, I get results where the taxonomy columns look as follows:

Archaea; Euryarchaeota; Methanomicrobia; Methanomicrobiales; NA; NA; Methanomicrobiales archaeon

I don't think the species value Methanomicrobiales archaeon has added value here, so I can remove it. (The same goes for values ending in bacterium.)

However, I also get the following:

Archaea; Euryarchaeota; Thermoplasmata; Methanomassiliicoccales; NA; NA; Methanomassiliicoccales archaeon Mx-03

I assume that the Mx-03 is bin information coming from the database. Is that correct? In that case, I think keeping the information would be best for my analysis.

In processing the data, intuitively, I think it would be good to fill the NA values of the genus and family columns with the species information, because it does not make sense to me that we would know assign a species, but not its genus. (This is relevant to my research because I am comparing different taxonomy levels in my analysis.) What is your view on this?

@Dries-B Dries-B changed the title Species ending in bacterium, sometimes with bin information How to handle species values ending in 'bacterium', sometimes with bin information Oct 29, 2024
@thauptfeld
Copy link
Collaborator

Hi!

Honestly, I understand why you feel that this doesn't add a lot of information. Like you said, this is indeed an issue coming from the database itself. CAT only parses the taxonomy files and names that NCBI and GTDB provide. In NCBI, the official lineage is given as a set of taxids and each taxids has a taxonomic rank attached. So, an organism might have the lineage (not a real example): 1,2,3,4,5; and then the ranks would be: root,superkingdom,phylum,order,species. So, adding the name to the family rank doesn't work, as according to NCBI, the organism doesn't have a family rank.

How to handle this really depends on what you are trying to do with this information. For your own analysis, you can of course add or remove any information that you find (not) useful. But once it comes to sharing your results, things become a bit more complicated. Considering FAIR data principles you should adhere to a particular taxonomy, in this case you seem to have picked NCBI. Each taxonomy has a "correct" set of taxids and attached names, so renaming your Methanomicrobiales archaeon would lead to your research being less interoperable and reusable for other people, since there would be "arbitrary" changes to the organism names that people cannot easily figure out.

I think your options would be to either keep the names as they are, or work with the taxids instead of the names to compare samples etc., then only add the names at the very end. Then you also wouldn't see NA values until the end.

Let me know if you have any more questions!

@Dries-B
Copy link
Author

Dries-B commented Nov 11, 2024

Thank you for your reply!

My question does indeed revolve around analysis. I'm trying to define taxonomic features for a classifier and, of course, the rank of the definition is relevant here. (I agree that it's preferable not to alter a tool's output in a non-standard way.)


So, an organism might have the lineage (not a real example): 1,2,3,4,5; and then the ranks would be: root,superkingdom,phylum,order,species. So, adding the name to the family rank doesn't work, as according to NCBI, the organism doesn't have a family rank.

For me, it would make more sense here to skip the species-level annotation here. But as you state, this is a choice of the databases.


I think your options would be to either keep the names as they are, or work with the taxids instead of the names to compare samples etc., then only add the names at the very end. Then you also wouldn't see NA values until the end.

Thank you for suggestion, but it seems to me that switching from names to lineage/taxids would not get rid of this problem; it would only make it less visible to the eye.


Furthermore, I think that using taxids does not prevent the following problem: species 1906667: Methanomassiliicoccales archaeon matches neither the taxid, nor the name of species 1713724: Methanomassiliicoccales archaeon RumEn M1 (while both share order 1577790 Methanomassiliicoccales).

To me it seems that 1906667 should be seen as a 'rest taxid', and we should not take it into account at species level. Taxid 1713724 seems more specific, as it is distinguished from the rest contigs in taxid 1906667 (as well as another, more specific, taxid: 1820006 Methanomassiliicoccales archaeon Mx-03). We might choose to take these more specific taxids into account, or choose not to.

Do you agree with this reasoning?

@Dries-B
Copy link
Author

Dries-B commented Nov 11, 2024

As this issue relates to the NCBI database, I think it may be a more general bioinformatics question. I have therefore rephrased it and posted it on Biostars: https://www.biostars.org/p/9605332/.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants