You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Merging bsseq objects is extraordinarily memory inefficient currently. In initial tests, merging a list (even a list of 2 elements!) of bsseq objects often results in peak memory exceeding 10 times the size of the sum of memories occupied by individual objects; the theoretical worst-case optimal behavior of any merging algorithm should not exceed 2 times this sum. Note that in these tests, only in-memory portions of the object were quantified (not HDF5-backed assays).
The end goal here will either be to 1. fix any inefficient code on my end (e.g. is do.call even expected to do things in a memory-efficient way?), and/or 2. raise GitHub issues on dependent buggy packages.
(1) Questionable pieces of my code include:
using do.call on a list of objects: should we expect do.call to iterate over the list in a memory-efficient way?
using rbind instead of combine or combineList (the officially intended methods for this purpose), though I'm only using rbind because of this open bug, and a bsseq contributor claims rbind is suitable for our case
possibly not using required HDF5Array or DelayedArray settings, such as setAutoRealizationBackend("HDF5Array")
(2) Regarding probable issues in dependent packages: combining just the rowRanges of two bsseq objects results in peak memory usage hitting ~4 times the sum of memory sizes of individual ranges, which is arguably a bug I'll need to make a reprex and issue for on the GenomicsRanges GitHub. But, as mentioned above, merging 2 bsseq objects is even more memory inefficient, so this GenomicsRanges bug is only part of the problem. It's possible that there are failures to use the HDF5 backend when merging assays, such as noted in this (currently) open issue.
The text was updated successfully, but these errors were encountered:
Merging
bsseq
objects is extraordinarily memory inefficient currently. In initial tests, merging a list (even a list of 2 elements!) ofbsseq
objects often results in peak memory exceeding 10 times the size of the sum of memories occupied by individual objects; the theoretical worst-case optimal behavior of any merging algorithm should not exceed 2 times this sum. Note that in these tests, only in-memory portions of the object were quantified (not HDF5-backed assays).The end goal here will either be to 1. fix any inefficient code on my end (e.g. is
do.call
even expected to do things in a memory-efficient way?), and/or 2. raise GitHub issues on dependent buggy packages.(1) Questionable pieces of my code include:
do.call
on a list of objects: should we expectdo.call
to iterate over the list in a memory-efficient way?rbind
instead ofcombine
orcombineList
(the officially intended methods for this purpose), though I'm only usingrbind
because of this open bug, and absseq
contributor claimsrbind
is suitable for our caseHDF5Array
orDelayedArray
settings, such assetAutoRealizationBackend("HDF5Array")
(2) Regarding probable issues in dependent packages: combining just the
rowRanges
of twobsseq
objects results in peak memory usage hitting ~4 times the sum of memory sizes of individual ranges, which is arguably a bug I'll need to make a reprex and issue for on theGenomicsRanges
GitHub. But, as mentioned above, merging 2bsseq
objects is even more memory inefficient, so thisGenomicsRanges
bug is only part of the problem. It's possible that there are failures to use the HDF5 backend when merging assays, such as noted in this (currently) open issue.The text was updated successfully, but these errors were encountered: