Scientific notation in abundance file result in rounding errors #39

donovan-h-parks · 2023-09-11T21:10:56Z

Hi. We've run into a small issue that we are hoping can be fixed in the next release. The abundance profile produced with the -abundances flag reports pair counts in scientific notation when numbers get large, e.g.:

# query summary: number of queries mapped per taxon
# rank 	name 	taxid 	number of reads 	abundance
domain 	Archaea 	439684927 	461 	0.0130496%
domain 	Bacteria 	609216830 	1.05068e+06 	29.7417%

This can result in small errors due to rounding. For example, in this case there is really 1050675 Bacterial read pairs, but it gets rounded up to 1050680. While having 5 extra read pairs is minor in terms of the resulting abundance estimates it makes it challenging to track the fait of all reads. In our code, we have a check that the number of input reads is equal to the number of reads in the MetaCache abundance profile (including unclassified). This is just a unit test to ensure our parsing is correct and that no reads are lost during any manipulation of data, but, more generally, not being able to account for all reads is a bit scary.

I imagine the intent is for this profile to produced integers, so am hoping this can be fixed in the next release. Thanks.

The text was updated successfully, but these errors were encountered:

muellan · 2023-09-12T11:51:22Z

This is clearly a bug. I'll look into it.

donovan-h-parks · 2023-09-18T13:56:06Z

Thanks André. Much appreciated.

muellan · 2024-03-11T10:02:51Z

The latest release should fix this issue.

muellan · 2024-03-11T10:19:41Z

@donovan-h-parks: could you maybe check, if this fixes your outputs?

donovan-h-parks · 2024-03-11T15:48:29Z

Hi @muellan,

I'm still encountering this issue in v2.4.2, e.g.:

# query summary: number of queries mapped per taxon
# rank 	name 	taxid 	number of reads 	abundance
domain 	Bacteria 	81602897 	9.96505e+06 	13.2788%
domain 	Archaea 	1337977286 	13005 	0.0173297%
phylum 	Cyanobacteriota 	15887558 	3226 	0.00429877%
phylum 	Fibrobacterota 	31572563 	34 	4.53063e-05%
phylum 	Spirochaetota 	39784000 	1217 	0.0016217%
...

I run MetaCache as follows:

> metacache query my_db P00001898_1.fq.gz P0000189_2.fq.gz -pairfiles -no-map -taxids -lineage -separate-cols -separator      -threads 32 -abundances metacache_raw_profile.tsv -lowest subspecies -out metacache_read_classification.log

MetaCache version details:

> metacache info
------------------------------------------------
MetaCache version    2.4.2 (20240311)
database version     20200820
------------------------------------------------
...

muellan added the bug label Sep 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scientific notation in abundance file result in rounding errors #39

Scientific notation in abundance file result in rounding errors #39

donovan-h-parks commented Sep 11, 2023

muellan commented Sep 12, 2023

donovan-h-parks commented Sep 18, 2023

muellan commented Mar 11, 2024

muellan commented Mar 11, 2024

donovan-h-parks commented Mar 11, 2024 •

edited

Loading

Scientific notation in abundance file result in rounding errors #39

Scientific notation in abundance file result in rounding errors #39

Comments

donovan-h-parks commented Sep 11, 2023

muellan commented Sep 12, 2023

donovan-h-parks commented Sep 18, 2023

muellan commented Mar 11, 2024

muellan commented Mar 11, 2024

donovan-h-parks commented Mar 11, 2024 • edited Loading

donovan-h-parks commented Mar 11, 2024 •

edited

Loading