Re-think validity method of BSseq #86

PeteHaitch · 2019-09-25T20:06:13Z

The check that 0 <= M <= Cov < Inf is very expensive when these are disk-backed matrices. The validity check code is written in C++ using beachmat with early termination, so it's unlikely to get much faster by re-writing. It may be worth removing this part of the validity check.

The text was updated successfully, but these errors were encountered:

kasperdanielhansen · 2019-09-26T00:59:54Z

I think optionally skipping the check is a no-brainer. I also think we can figure out how to not do the check when an object is loaded (under the assumption that the check was made earlier) or subsetted and probably many other cases.

But in my experience there are always new users out there who are doing strange things, and it is nice to catch that.

PeteHaitch · 2019-09-26T03:15:12Z

But in my experience there are always new users out there who are doing strange things, and it is nice to catch that.

Appreciate this concern. Perhaps BSseq() can check M and Cov but not have it as part of the S4 validity method. The trouble with it being a part of the S4 validity method is that lots of things can trigger the validity check AFAIU,

hpages · 2019-09-26T06:05:49Z

The trouble with it being a part of the S4 validity method is that lots of things can trigger the validity check AFAIU

True. And you don't necessarily have control over these things. Sounds sensible to check M and Cov in the BSseq() constructor only.

BTW the use of beachmat/C++ to implement bsseq:::.checkMandCov() feels overkill. Also the column-wise processing strategy might not play well with some chunking. It's generally more efficient to use a block processing strategy where the blocks are compatible with the chunks and make no assumption about their geometry. This can easily be implemented in R with something like:

check_M_and_Cov <- function(M, Cov)
{
    if (anyNA(M) || anyNA(Cov))
        return("'M' and 'Cov' cannot contain NAs.")
    if (!all(is.finite(Cov)))
        return("All values in 'Cov' must be finite.")
    if (!all(M >= 0L))
        return("All values in 'M' must be >= 0.")
    if (!all(M <= Cov))
        return(paste("All values in 'M' must be less than or equal",
                     "to the corresponding value in 'Cov'."))
    NULL
}
        
### An R version of bsseq:::.checkMandCov() that uses block processing and does
### early bailout.
BLOCK_check_M_and_Cov <- function(M, Cov, grid=NULL)
{
    M_dim <- dim(M)
    Cov_dim <- dim(Cov)
    stopifnot(length(M_dim) == length(Cov_dim), all(M_dim == Cov_dim))
    grid <- DelayedArray:::normarg_grid(grid, M)
    for (b in seq_along(grid)) {
        viewport <- grid[[b]]
        M_block <- read_block(M, viewport)
        Cov_block <- read_block(Cov, viewport)
        msg <- check_M_and_Cov(M_block, Cov_block)
        if (is.character(msg))
            break
    }
    msg
}

To my surprise, using BLOCK_check_M_and_Cov() on Stephanie's data (Bioconductor/Contributions#1207) where the chunks are favorable to a column-wise processing strategy still makes a small difference compared to using bsseq:::.checkMandCov():

library(ExperimentHub)
eh <- ExperimentHub()
path_to_assays_h5 <- eh[["EH3127"]]  # this EH id will probably change very soon!

library(HDF5Array)
M <- HDF5Array(path_to_assays_h5, "assay001")
Cov <- HDF5Array(path_to_assays_h5, "assay002")
dim(M)
# [1] 29039352       44
dim(Cov)
# [1] 29039352       44

### The chunks are very favorable to a column-wise processing strategy:
chunkdim(M)
# [1] 1000000       1
chunkdim(Cov)
# [1] 1000000       1

library(bsseq)
system.time(aa <- bsseq:::.checkMandCov(M, Cov))
#    user  system elapsed 
# 128.094  29.299 157.433 
system.time(bb <- BLOCK_check_M_and_Cov(M, Cov))
#    user  system elapsed 
# 133.949   8.069 142.056

However the difference might become more significant when the chunks are not favorable to a column-wise processing strategy.

H.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-think validity method of BSseq #86

Re-think validity method of BSseq #86

PeteHaitch commented Sep 25, 2019

kasperdanielhansen commented Sep 26, 2019

PeteHaitch commented Sep 26, 2019

hpages commented Sep 26, 2019 •

edited

Loading

Re-think validity method of BSseq #86

Re-think validity method of BSseq #86

Comments

PeteHaitch commented Sep 25, 2019

kasperdanielhansen commented Sep 26, 2019

PeteHaitch commented Sep 26, 2019

hpages commented Sep 26, 2019 • edited Loading

hpages commented Sep 26, 2019 •

edited

Loading