Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No support below kingdom level for known E. coli contigs #92

Open
mpgriesh opened this issue Aug 28, 2023 · 2 comments
Open

No support below kingdom level for known E. coli contigs #92

mpgriesh opened this issue Aug 28, 2023 · 2 comments

Comments

@mpgriesh
Copy link

Hi, I am running CAT to annotate contigs from metagenomes that are known to be E. coli by several other approaches. For example, greater than 99% average nucleotide identity with reference E. coli genomes and BLASTx also identifies the vast majority of ORFs as E. coli. For me, CAT identifies several other contigs with species support in the metagenome, and the db and taxonomy folders from CAT prepare seem to be correct as no errors are encountered.

Despite this, across many samples, of 239 contigs expected to be E. coli, 9 are classified at the family level (Enterobacteriaceae) and the rest are classified as Bacteria.

When I look at the ORFs themselves following add_names, I see something similar as the vast majority of individual ORFs receive no support below Bacteria.

I tried this with a lab E. coli reference genome as a single "contig" and it is classified as Bacteria.

image

I am using CAT v5.2.3 and the original version of the DB from 2021-01-07 as described in the repo. I do see several strains and species of E. coli in names.dmp and the taxid's are present in nodes.dmp.

@bastiaanvonmeijenfeldt
Copy link
Collaborator

Dear @mpgriesh,

Thanks for this report! It's something that we have noticed ourselves as well, and I think it stems from the fact that many E. coli are misannotated in NCBI nr, as human for example. It may also be that foreign vectors within lab coli's are annotated as coli... There is currently not an easy solution for this that we can implement except cleaning nr ourselves (we're thinking on how to do this automatically).

For now, could you try one of the latest databases (see https://tbb.bio.uu.nl/tina/CAT_prepare/)? NCBI is removing misannotations so newer databases may be better, then again they may also contain more misannotations. A more viable alternative may be to use the GTDB databse instead of NCBI nr (we have implemented it in the latest version of CAT). GTDB does not have the nr misannotations, but of course does have a smaller search space. You can also find the latest GTDB database formatted for CAT on https://tbb.bio.uu.nl/tina/CAT_prepare/.

Do keep me updated!

Best wishes,

Bastiaan

@zxl124
Copy link

zxl124 commented May 20, 2024

Hi, first of all, thanks for the awesome too. I am using it in my metagenome/transcriptome pipeline, but I am having similar problems with this. In one example, there was one viral contig whose proteins have hundreds of hits against the correct virus sequences in NR, but there was also one artificial construct whose fastaid2LCAtaxid is 1. This ruins the taxonomy assignment of that entire contig. This also happens for other contigs. I imagine similar entries in the NR database like artificial sequences or metagenome sequences would ruin taxonomy assignment similarly.

I know GTDB is curated and a potential solution, but it doesn't have the virus and eukaryotes that I need. I think there may be two potential solution.

  1. Create a blacklist of tax IDs that find_LCA_for_ORF function would just ignore. This is easier to do, maybe that's something you are considering. In addition to tax ID of 1, is there anything else you think that can be ignored?
  2. Use a voting approach for find_LCA_for_ORF. Like how CAT assign taxonomy based on voting of taxonomy from all ORFs of the contig, you can assign taxonomy to ORFs based on the consensus of all hits based on a similar fraction option like -f. IN this case, the smaller number of "poisonous" proteins in the database wouldn't matter. This could make the program a lot slower though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants