-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No support below kingdom level for known E. coli contigs #92
Comments
Dear @mpgriesh, Thanks for this report! It's something that we have noticed ourselves as well, and I think it stems from the fact that many E. coli are misannotated in NCBI nr, as human for example. It may also be that foreign vectors within lab coli's are annotated as coli... There is currently not an easy solution for this that we can implement except cleaning nr ourselves (we're thinking on how to do this automatically). For now, could you try one of the latest databases (see https://tbb.bio.uu.nl/tina/CAT_prepare/)? NCBI is removing misannotations so newer databases may be better, then again they may also contain more misannotations. A more viable alternative may be to use the GTDB databse instead of NCBI nr (we have implemented it in the latest version of CAT). GTDB does not have the nr misannotations, but of course does have a smaller search space. You can also find the latest GTDB database formatted for CAT on https://tbb.bio.uu.nl/tina/CAT_prepare/. Do keep me updated! Best wishes, Bastiaan |
Hi, first of all, thanks for the awesome too. I am using it in my metagenome/transcriptome pipeline, but I am having similar problems with this. In one example, there was one viral contig whose proteins have hundreds of hits against the correct virus sequences in NR, but there was also one artificial construct whose I know GTDB is curated and a potential solution, but it doesn't have the virus and eukaryotes that I need. I think there may be two potential solution.
|
Hi, I am running CAT to annotate contigs from metagenomes that are known to be E. coli by several other approaches. For example, greater than 99% average nucleotide identity with reference E. coli genomes and BLASTx also identifies the vast majority of ORFs as E. coli. For me, CAT identifies several other contigs with species support in the metagenome, and the db and taxonomy folders from CAT prepare seem to be correct as no errors are encountered.
Despite this, across many samples, of 239 contigs expected to be E. coli, 9 are classified at the family level (Enterobacteriaceae) and the rest are classified as Bacteria.
When I look at the ORFs themselves following add_names, I see something similar as the vast majority of individual ORFs receive no support below Bacteria.
I tried this with a lab E. coli reference genome as a single "contig" and it is classified as Bacteria.
I am using CAT v5.2.3 and the original version of the DB from 2021-01-07 as described in the repo. I do see several strains and species of E. coli in names.dmp and the taxid's are present in nodes.dmp.
The text was updated successfully, but these errors were encountered: