Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unusual abundance results with custom classifier index #252

Open
hirosatosd opened this issue Feb 1, 2023 · 6 comments
Open

Unusual abundance results with custom classifier index #252

hirosatosd opened this issue Feb 1, 2023 · 6 comments

Comments

@hirosatosd
Copy link

hirosatosd commented Feb 1, 2023

Hi,

I recently created a custom classifier index and the abundance results aren't making sense to me. What would cause the abundance calculation to be so high (0.984) when the numReads is only 99 and the numUniqueReads is 0? Total reads for this sample was 314521.

image

@mourisl
Copy link
Collaborator

mourisl commented Feb 2, 2023

This is indeed very strange. Are many reads assigned to the ancestors of 52461?

@hirosatosd
Copy link
Author

Yeah, at genus level numReads is 2301874 and the numUniqueReads is 2242384

@mourisl
Copy link
Collaborator

mourisl commented Feb 2, 2023

I think when computing the abundance, Centrifuge will trickle down the abundance from ancestor taxonomy to the leaves, so that's why this species gets a very high abundance.

@pjtorres
Copy link

pjtorres commented Feb 2, 2023

Hi, I also had an issue with the abundance calculation, except in my case I had 710 unique reads out of 710 total reads (numreads) assigned to a taxa and relative abundance is 0.0 .

2710803 strain 1080 710 710 0.0

In the same run and report file I have another taxa that had 0 unique reads, but was multi mapped to 5 times and that taxa did get a relative abundance.

2042592 strain 807 5 0 3.67064e-292

Appreciate any advice or guidance. Thanks.

@mourisl
Copy link
Collaborator

mourisl commented Feb 2, 2023

@pjtorres I think this is more like a rounding error in EM algorithm. I recently also noticed that some strains may get very high abundance due to their short genome sizes. Is this your case?

I'm thinking about adding a parameter to ignore the short ones. This might be related to the issue you just opened.

@pjtorres
Copy link

pjtorres commented Feb 2, 2023

@mourisl thank you for your quick response! I agree that it is some issue with the EM part. I also think it might have to do with genome length. However; in the specific example above you can see that both strains are about the same size 807 vs 1080 and the read with no unique reads and 5 multi mapped reads is the one that got an abundance estimation > 0.0. But adding a parameter to ignore genome lengths of a particular size would be great. Thanks again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants