Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-think validity method of BSseq #86

Open
PeteHaitch opened this issue Sep 25, 2019 · 3 comments
Open

Re-think validity method of BSseq #86

PeteHaitch opened this issue Sep 25, 2019 · 3 comments

Comments

@PeteHaitch
Copy link
Contributor

The check that 0 <= M <= Cov < Inf is very expensive when these are disk-backed matrices. The validity check code is written in C++ using beachmat with early termination, so it's unlikely to get much faster by re-writing. It may be worth removing this part of the validity check.

@kasperdanielhansen
Copy link
Contributor

I think optionally skipping the check is a no-brainer. I also think we can figure out how to not do the check when an object is loaded (under the assumption that the check was made earlier) or subsetted and probably many other cases.

But in my experience there are always new users out there who are doing strange things, and it is nice to catch that.

@PeteHaitch
Copy link
Contributor Author

But in my experience there are always new users out there who are doing strange things, and it is nice to catch that.

Appreciate this concern. Perhaps BSseq() can check M and Cov but not have it as part of the S4 validity method. The trouble with it being a part of the S4 validity method is that lots of things can trigger the validity check AFAIU,

@hpages
Copy link
Contributor

hpages commented Sep 26, 2019

The trouble with it being a part of the S4 validity method is that lots of things can trigger the validity check AFAIU

True. And you don't necessarily have control over these things. Sounds sensible to check M and Cov in the BSseq() constructor only.

BTW the use of beachmat/C++ to implement bsseq:::.checkMandCov() feels overkill. Also the column-wise processing strategy might not play well with some chunking. It's generally more efficient to use a block processing strategy where the blocks are compatible with the chunks and make no assumption about their geometry. This can easily be implemented in R with something like:

check_M_and_Cov <- function(M, Cov)
{
    if (anyNA(M) || anyNA(Cov))
        return("'M' and 'Cov' cannot contain NAs.")
    if (!all(is.finite(Cov)))
        return("All values in 'Cov' must be finite.")
    if (!all(M >= 0L))
        return("All values in 'M' must be >= 0.")
    if (!all(M <= Cov))
        return(paste("All values in 'M' must be less than or equal",
                     "to the corresponding value in 'Cov'."))
    NULL
}
        
### An R version of bsseq:::.checkMandCov() that uses block processing and does
### early bailout.
BLOCK_check_M_and_Cov <- function(M, Cov, grid=NULL)
{
    M_dim <- dim(M)
    Cov_dim <- dim(Cov)
    stopifnot(length(M_dim) == length(Cov_dim), all(M_dim == Cov_dim))
    grid <- DelayedArray:::normarg_grid(grid, M)
    for (b in seq_along(grid)) {
        viewport <- grid[[b]]
        M_block <- read_block(M, viewport)
        Cov_block <- read_block(Cov, viewport)
        msg <- check_M_and_Cov(M_block, Cov_block)
        if (is.character(msg))
            break
    }
    msg
}

To my surprise, using BLOCK_check_M_and_Cov() on Stephanie's data (Bioconductor/Contributions#1207) where the chunks are favorable to a column-wise processing strategy still makes a small difference compared to using bsseq:::.checkMandCov():

library(ExperimentHub)
eh <- ExperimentHub()
path_to_assays_h5 <- eh[["EH3127"]]  # this EH id will probably change very soon!

library(HDF5Array)
M <- HDF5Array(path_to_assays_h5, "assay001")
Cov <- HDF5Array(path_to_assays_h5, "assay002")
dim(M)
# [1] 29039352       44
dim(Cov)
# [1] 29039352       44

### The chunks are very favorable to a column-wise processing strategy:
chunkdim(M)
# [1] 1000000       1
chunkdim(Cov)
# [1] 1000000       1

library(bsseq)
system.time(aa <- bsseq:::.checkMandCov(M, Cov))
#    user  system elapsed 
# 128.094  29.299 157.433 
system.time(bb <- BLOCK_check_M_and_Cov(M, Cov))
#    user  system elapsed 
# 133.949   8.069 142.056 

However the difference might become more significant when the chunks are not favorable to a column-wise processing strategy.

H.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants