Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cenote-taker2 vs (blastn nt & diamond nr) #42

Open
NailouZhang opened this issue Dec 20, 2022 · 1 comment
Open

cenote-taker2 vs (blastn nt & diamond nr) #42

NailouZhang opened this issue Dec 20, 2022 · 1 comment

Comments

@NailouZhang
Copy link

NailouZhang commented Dec 20, 2022

Hi Mike,
Recently, I ran cenote-taker2 and blastn against nt database & diamond against nr database with the contigs assembled by Megahit. I found that about 10000 sequences were classified as viruses, while about 1000 were identified by blast. I am confused about why the results from blast are ten times less than cenote-taker2.

As you pointed that "Many virus genomes are integrated into host chromosomes" and "viral genes and genomes are often misidentified as host sequences"(Tisza M J, Belford A K, Dominguez-Huerta G, et al. Cenote-Taker 2 democratizes virus discovery and sequence annotation[J]. Virus evolution, 2021, 7(1): veaa100.). Thus, blast may have some false-negatives results. So, Is there a threshold to classify sequences as viral or non-viral using both tools (e.g. blast p-value or percent of ident or mapping length)?

wish you a merry Christmas in advance!

Nailou Zhang

@mtisza1
Copy link
Owner

mtisza1 commented Jan 11, 2023

Hi Nailou,

Thanks for your comment. It's a bit complicated to assess this without more information about how Cenote-Taker 2 was run and what settings you used with blast and diamond.

Using blastn against nt could be a great way to look for viruses present in this database and their close relatives, however, the vast majority of the viruses that exist on earth are not catalogued in nt. Recent estimates suggest that there are around 1 billion virus species on earth. The number of virus species in nt is in the tens of thousands.

Of course, as a general statement, Cenote-Taker 2 will return false positives at some unknown rate. If you are querying contigs assembled from WGS reads and you use -db virion --lin_minimum_hallmark_genes 2 --circ_minimum_hallmark_genes 2, I would estimate the false positive rate is only about ~1%, maybe less. It's hard to measure this meaningfully, in my opinion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants