How to handle species values ending in 'bacterium', sometimes with bin information #134

Dries-B · 2024-10-29T15:20:36Z

Hi!

When using CAT, I get results where the taxonomy columns look as follows:

Archaea; Euryarchaeota; Methanomicrobia; Methanomicrobiales; NA; NA; Methanomicrobiales archaeon

I don't think the species value Methanomicrobiales archaeon has added value here, so I can remove it. (The same goes for values ending in bacterium.)

However, I also get the following:

Archaea; Euryarchaeota; Thermoplasmata; Methanomassiliicoccales; NA; NA; Methanomassiliicoccales archaeon Mx-03

I assume that the Mx-03 is bin information coming from the database. Is that correct? In that case, I think keeping the information would be best for my analysis.

In processing the data, intuitively, I think it would be good to fill the NA values of the genus and family columns with the species information, because it does not make sense to me that we would know assign a species, but not its genus. (This is relevant to my research because I am comparing different taxonomy levels in my analysis.) What is your view on this?

The text was updated successfully, but these errors were encountered:

thauptfeld · 2024-11-01T07:23:09Z

Hi!

Honestly, I understand why you feel that this doesn't add a lot of information. Like you said, this is indeed an issue coming from the database itself. CAT only parses the taxonomy files and names that NCBI and GTDB provide. In NCBI, the official lineage is given as a set of taxids and each taxids has a taxonomic rank attached. So, an organism might have the lineage (not a real example): 1,2,3,4,5; and then the ranks would be: root,superkingdom,phylum,order,species. So, adding the name to the family rank doesn't work, as according to NCBI, the organism doesn't have a family rank.

How to handle this really depends on what you are trying to do with this information. For your own analysis, you can of course add or remove any information that you find (not) useful. But once it comes to sharing your results, things become a bit more complicated. Considering FAIR data principles you should adhere to a particular taxonomy, in this case you seem to have picked NCBI. Each taxonomy has a "correct" set of taxids and attached names, so renaming your Methanomicrobiales archaeon would lead to your research being less interoperable and reusable for other people, since there would be "arbitrary" changes to the organism names that people cannot easily figure out.

I think your options would be to either keep the names as they are, or work with the taxids instead of the names to compare samples etc., then only add the names at the very end. Then you also wouldn't see NA values until the end.

Let me know if you have any more questions!

Dries-B · 2024-11-11T10:14:24Z

Thank you for your reply!

My question does indeed revolve around analysis. I'm trying to define taxonomic features for a classifier and, of course, the rank of the definition is relevant here. (I agree that it's preferable not to alter a tool's output in a non-standard way.)

So, an organism might have the lineage (not a real example): 1,2,3,4,5; and then the ranks would be: root,superkingdom,phylum,order,species. So, adding the name to the family rank doesn't work, as according to NCBI, the organism doesn't have a family rank.

For me, it would make more sense here to skip the species-level annotation here. But as you state, this is a choice of the databases.

I think your options would be to either keep the names as they are, or work with the taxids instead of the names to compare samples etc., then only add the names at the very end. Then you also wouldn't see NA values until the end.

Thank you for suggestion, but it seems to me that switching from names to lineage/taxids would not get rid of this problem; it would only make it less visible to the eye.

Furthermore, I think that using taxids does not prevent the following problem: species 1906667: Methanomassiliicoccales archaeon matches neither the taxid, nor the name of species 1713724: Methanomassiliicoccales archaeon RumEn M1 (while both share order 1577790 Methanomassiliicoccales).

To me it seems that 1906667 should be seen as a 'rest taxid', and we should not take it into account at species level. Taxid 1713724 seems more specific, as it is distinguished from the rest contigs in taxid 1906667 (as well as another, more specific, taxid: 1820006 Methanomassiliicoccales archaeon Mx-03). We might choose to take these more specific taxids into account, or choose not to.

Do you agree with this reasoning?

Dries-B · 2024-11-11T13:35:59Z

As this issue relates to the NCBI database, I think it may be a more general bioinformatics question. I have therefore rephrased it and posted it on Biostars: https://www.biostars.org/p/9605332/.

Dries-B changed the title ~~Species ending in bacterium, sometimes with bin information~~ How to handle species values ending in 'bacterium', sometimes with bin information Oct 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to handle species values ending in 'bacterium', sometimes with bin information #134

How to handle species values ending in 'bacterium', sometimes with bin information #134

Dries-B commented Oct 29, 2024

thauptfeld commented Nov 1, 2024

Dries-B commented Nov 11, 2024

Dries-B commented Nov 11, 2024

How to handle species values ending in 'bacterium', sometimes with bin information #134

How to handle species values ending in 'bacterium', sometimes with bin information #134

Comments

Dries-B commented Oct 29, 2024

thauptfeld commented Nov 1, 2024

Dries-B commented Nov 11, 2024

Dries-B commented Nov 11, 2024