Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scientific notation in abundance file result in rounding errors #39

Open
donovan-h-parks opened this issue Sep 11, 2023 · 5 comments
Open
Labels

Comments

@donovan-h-parks
Copy link

Hi. We've run into a small issue that we are hoping can be fixed in the next release. The abundance profile produced with the -abundances flag reports pair counts in scientific notation when numbers get large, e.g.:

# query summary: number of queries mapped per taxon
# rank 	name 	taxid 	number of reads 	abundance
domain 	Archaea 	439684927 	461 	0.0130496%
domain 	Bacteria 	609216830 	1.05068e+06 	29.7417%

This can result in small errors due to rounding. For example, in this case there is really 1050675 Bacterial read pairs, but it gets rounded up to 1050680. While having 5 extra read pairs is minor in terms of the resulting abundance estimates it makes it challenging to track the fait of all reads. In our code, we have a check that the number of input reads is equal to the number of reads in the MetaCache abundance profile (including unclassified). This is just a unit test to ensure our parsing is correct and that no reads are lost during any manipulation of data, but, more generally, not being able to account for all reads is a bit scary.

I imagine the intent is for this profile to produced integers, so am hoping this can be fixed in the next release. Thanks.

@muellan muellan added the bug label Sep 12, 2023
@muellan
Copy link
Owner

muellan commented Sep 12, 2023

This is clearly a bug. I'll look into it.

@donovan-h-parks
Copy link
Author

Thanks André. Much appreciated.

@muellan
Copy link
Owner

muellan commented Mar 11, 2024

The latest release should fix this issue.

@muellan
Copy link
Owner

muellan commented Mar 11, 2024

@donovan-h-parks: could you maybe check, if this fixes your outputs?

@donovan-h-parks
Copy link
Author

donovan-h-parks commented Mar 11, 2024

Hi @muellan,

I'm still encountering this issue in v2.4.2, e.g.:

# query summary: number of queries mapped per taxon
# rank 	name 	taxid 	number of reads 	abundance
domain 	Bacteria 	81602897 	9.96505e+06 	13.2788%
domain 	Archaea 	1337977286 	13005 	0.0173297%
phylum 	Cyanobacteriota 	15887558 	3226 	0.00429877%
phylum 	Fibrobacterota 	31572563 	34 	4.53063e-05%
phylum 	Spirochaetota 	39784000 	1217 	0.0016217%
...

I run MetaCache as follows:

> metacache query my_db P00001898_1.fq.gz P0000189_2.fq.gz -pairfiles -no-map -taxids -lineage -separate-cols -separator      -threads 32 -abundances metacache_raw_profile.tsv -lowest subspecies -out metacache_read_classification.log

MetaCache version details:

> metacache info
------------------------------------------------
MetaCache version    2.4.2 (20240311)
database version     20200820
------------------------------------------------
...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants