Changes in subsetByColData between in 3.14 #303

gogonzo · 2021-10-29T07:31:31Z

Hi there,
I spotted changes in version for Bioc 3.14 when using subsetByColData. Code below used to work but now it throws an error

subset(
  colData(miniACC),
  subset = race == "white"
)

#> Error: logical subscript contains NAs

Previously it went fine because function use to handle NA in the condition. I think previous state was what user expects. See the definition of subset.data.frame where you can notice r & !is.na(r) which excludes NA from the subsetting operation.

function (x, subset, select, drop = FALSE, ...) 
{
    r <- if (missing(subset)) 
        rep_len(TRUE, nrow(x))
    else {
        e <- substitute(subset)
        r <- eval(e, x, parent.frame())
        if (!is.logical(r)) 
            stop("'subset' must be logical")
        r & !is.na(r)
    }
   ...
    x[r, vars, drop = drop]
}

I think that the problem lays in this line

MultiAssayExperiment/R/subsetBy-methods.R

Line 264 in f256a29

newcoldata <- coldata[y, , drop = FALSE]

which I believe can be modified to

    newcoldata <- if (is.logical(y)) {
      subset(coldata, subset = y, drop = FALSE)
    } else {
      # subset would normally throws an error here
      # but MAE can be filtered also by indices (integer or rowname)
      coldata[y, , drop = FALSE]
    }

What do you think?
Regards,
Dawid Kaledkowski

The text was updated successfully, but these errors were encountered:

LiNk-NY · 2021-10-29T14:56:36Z

Hi Dawid, @gogonzo

We've increased the efficiency of the subsetting operation by cleaning up some of the code and resolving an issue with multiple subsequent subsets. We directly pass along the subset vector to the colData and let DataFrame handle it.
It is possible to use subset in our code but that may change depending on whether subset is meant to replicate the bracket operation [ (which is what we use now). That question would have to be answered by the DataFrame maintainers.

In the meantime, you can definitely use a more complete subset vector, e.g.,
miniACC$race == "white" & !is.na(miniACC$race) or which(miniACC$race == "white").

NAs should be handled explicitly either by removing them beforehand or using an argument like na.rm.

Best regards,
Marcel

gogonzo · 2021-11-03T13:59:40Z

@LiNk-NY
Yup,
I understand that changes made were needed but I don't see why DataFrame should be responsible for handling NAs. If we refer to the base R, good example is iris[c(1, NA), ] which will return column with NA and that is why I don't expect [.DataFrame to manage missing values.

object[c(1, NA), ]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1           5.1         3.5          1.4         0.2  setosa
# NA           NA          NA           NA          NA    <NA>

I believe problem lays in MultiAssayExperiment where subsetByColData does not handle NAs as subset.default does by x[subset & !is.na(subset)]

var <- c(T, NA, rep(F, 148))
subset(iris, var)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1          5.1         3.5          1.4         0.2  setosa

Do you agree or disagree?

LiNk-NY · 2021-11-03T18:31:32Z

DataFrame is modeled after data.frame but it does not necessarily follow the same behavior in
all cases. If subset is meant to replicate the behavior of [, then the subset code that you mention
would no longer work after a corresponding update in the DataFrame interface.

I believe problem lays in MultiAssayExperiment where subsetByColData does not handle NAs as subset.default does by x[subset & !is.na(subset)]

We are not following the subset.default contract because subsetByColData is not an S3
subset method.

In order to make the handling of NAs more explicit, one possible enhancement could be to
include a na.rm argument in the function. Otherwise, the user has to make sure that the
input does not include NAs as when subsetting a DataFrame using the bracket method.

Best,
Marcel

gogonzo · 2021-11-05T10:56:19Z

Hi @LiNk-NY thank you for the answer. I still think that problem lays in MAE, I'm even more sure now about this. There are still problems with the S4Vector which I have reported but it's not case for the MAE::subsetByColData, but it's a smaller problem as far as subset.DataFrame works as expected.
After the (change)[https://github.com/Bioconductor/S4Vectors/issues/89]
subsetting by empty vector is possible, and change in the MAE adjusted to this
case correctly.

library(MultiAssayExperiment)

# S4Vectors and MultiAssayExperiment
DF <- miniACC@colData

subset(DF, race %in% c())
DF[DF$race %in% c(), ]
subsetByColData(miniACC, DF$race %in% c())

# base R
df <- data.frame(a = c(1, NA, 3), b = 1:3)

subset(df, a %in% c())
df[df$a %in% c(), ]

However, MAE failed to cover the case didn't fail in subsetByColData (bioc 3.13). Users see this as a change of the API of the public method. Function also has a name which suggest it's a subset (applied on a colData) and not [ which was correct until the last update.

# S4Vectors and MultiAssayExperiment
subset(DF, race == "white") # 
DF[DF$race == "white", ] # error - reported in S4Vectors
subsetByColData(miniACC, DF$race == "white") # error


# base R
df[df$a > 0, ] # return includes NAs
subset(df, a > 0) # omits NA

I fully understand the reason - it's because subsetByColData uses [ directly to subset which is good, but not entirely.

MultiAssayExperiment/R/subsetBy-methods.R

Line 264 in f256a29

newcoldata <- coldata[y, , drop = FALSE]

This line is the reason of the failure, and I wouldn't expect S4Vectors to fix this issue, because it should rather be like I've proposed below or in the first comment.

newcoldata <- coldata[y & !is.na(y), , drop = FALSE]

I can even help you to make a PR if you agree.

Regards,
DK

LiNk-NY · 2021-11-05T15:05:02Z

There is a general misconception that subsetByColData should act like subset and that is not the case.
Any subsetting done in MultiAssayExperiment follows the established DataFrame and related S4 data structures.
If DataFrame changes how they subset, we will follow.

The removal of NA's should be explicit and preferably done by the user. I consider this to be usual and customary practice in R.

As above, we can consider adding an na.rm argument but that issue would have to be opened up as a feature request / pull request.

Best regards,
Marcel

gogonzo · 2021-11-08T10:00:38Z

@LiNk-NY

As above, we can consider adding an na.rm argument but that issue would have to be opened up as a feature request / pull request.

Ok, forget the whole discussion. +1 to na.rm

Regards,
DK

gogonzo mentioned this issue Oct 29, 2021

teal with bioconductor 3.14 insightsengineering/teal#345

Closed

LiNk-NY closed this as completed Nov 5, 2021

gogonzo mentioned this issue Nov 8, 2021

na.rm argument in subsetByColData #304

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changes in subsetByColData between in 3.14 #303

Changes in subsetByColData between in 3.14 #303

gogonzo commented Oct 29, 2021

LiNk-NY commented Oct 29, 2021

gogonzo commented Nov 3, 2021

LiNk-NY commented Nov 3, 2021 •

edited

Loading

gogonzo commented Nov 5, 2021

LiNk-NY commented Nov 5, 2021

gogonzo commented Nov 8, 2021 •

edited

Loading

Changes in subsetByColData between in 3.14 #303

Changes in subsetByColData between in 3.14 #303

Comments

gogonzo commented Oct 29, 2021

LiNk-NY commented Oct 29, 2021

gogonzo commented Nov 3, 2021

LiNk-NY commented Nov 3, 2021 • edited Loading

gogonzo commented Nov 5, 2021

LiNk-NY commented Nov 5, 2021

gogonzo commented Nov 8, 2021 • edited Loading

LiNk-NY commented Nov 3, 2021 •

edited

Loading

gogonzo commented Nov 8, 2021 •

edited

Loading