-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to handle species values ending in 'bacterium', sometimes with bin information #134
Comments
bacterium
, sometimes with bin information
Hi! Honestly, I understand why you feel that this doesn't add a lot of information. Like you said, this is indeed an issue coming from the database itself. CAT only parses the taxonomy files and names that NCBI and GTDB provide. In NCBI, the official lineage is given as a set of taxids and each taxids has a taxonomic rank attached. So, an organism might have the lineage (not a real example): 1,2,3,4,5; and then the ranks would be: root,superkingdom,phylum,order,species. So, adding the name to the family rank doesn't work, as according to NCBI, the organism doesn't have a family rank. How to handle this really depends on what you are trying to do with this information. For your own analysis, you can of course add or remove any information that you find (not) useful. But once it comes to sharing your results, things become a bit more complicated. Considering FAIR data principles you should adhere to a particular taxonomy, in this case you seem to have picked NCBI. Each taxonomy has a "correct" set of taxids and attached names, so renaming your Methanomicrobiales archaeon would lead to your research being less interoperable and reusable for other people, since there would be "arbitrary" changes to the organism names that people cannot easily figure out. I think your options would be to either keep the names as they are, or work with the taxids instead of the names to compare samples etc., then only add the names at the very end. Then you also wouldn't see NA values until the end. Let me know if you have any more questions! |
Thank you for your reply! My question does indeed revolve around analysis. I'm trying to define taxonomic features for a classifier and, of course, the rank of the definition is relevant here. (I agree that it's preferable not to alter a tool's output in a non-standard way.)
For me, it would make more sense here to skip the species-level annotation here. But as you state, this is a choice of the databases.
Thank you for suggestion, but it seems to me that switching from names to lineage/taxids would not get rid of this problem; it would only make it less visible to the eye. Furthermore, I think that using taxids does not prevent the following problem: species To me it seems that 1906667 should be seen as a 'rest taxid', and we should not take it into account at species level. Taxid 1713724 seems more specific, as it is distinguished from the rest contigs in taxid 1906667 (as well as another, more specific, taxid: Do you agree with this reasoning? |
As this issue relates to the NCBI database, I think it may be a more general bioinformatics question. I have therefore rephrased it and posted it on Biostars: https://www.biostars.org/p/9605332/. |
Hi!
When using CAT, I get results where the taxonomy columns look as follows:
I don't think the species value
Methanomicrobiales archaeon
has added value here, so I can remove it. (The same goes for values ending inbacterium
.)However, I also get the following:
I assume that the
Mx-03
is bin information coming from the database. Is that correct? In that case, I think keeping the information would be best for my analysis.In processing the data, intuitively, I think it would be good to fill the NA values of the genus and family columns with the species information, because it does not make sense to me that we would know assign a species, but not its genus. (This is relevant to my research because I am comparing different taxonomy levels in my analysis.) What is your view on this?
The text was updated successfully, but these errors were encountered: