Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes in subsetByColData between in 3.14 #303

Closed
gogonzo opened this issue Oct 29, 2021 · 6 comments
Closed

Changes in subsetByColData between in 3.14 #303

gogonzo opened this issue Oct 29, 2021 · 6 comments

Comments

@gogonzo
Copy link

gogonzo commented Oct 29, 2021

Hi there,
I spotted changes in version for Bioc 3.14 when using subsetByColData. Code below used to work but now it throws an error

subset(
  colData(miniACC),
  subset = race == "white"
)

#> Error: logical subscript contains NAs

Previously it went fine because function use to handle NA in the condition. I think previous state was what user expects. See the definition of subset.data.frame where you can notice r & !is.na(r) which excludes NA from the subsetting operation.

function (x, subset, select, drop = FALSE, ...) 
{
    r <- if (missing(subset)) 
        rep_len(TRUE, nrow(x))
    else {
        e <- substitute(subset)
        r <- eval(e, x, parent.frame())
        if (!is.logical(r)) 
            stop("'subset' must be logical")
        r & !is.na(r)
    }
   ...
    x[r, vars, drop = drop]
}

I think that the problem lays in this line

newcoldata <- coldata[y, , drop = FALSE]

which I believe can be modified to

    newcoldata <- if (is.logical(y)) {
      subset(coldata, subset = y, drop = FALSE)
    } else {
      # subset would normally throws an error here
      # but MAE can be filtered also by indices (integer or rowname)
      coldata[y, , drop = FALSE]
    }

What do you think?
Regards,
Dawid Kaledkowski

@LiNk-NY
Copy link
Collaborator

LiNk-NY commented Oct 29, 2021

Hi Dawid, @gogonzo

We've increased the efficiency of the subsetting operation by cleaning up some of the code and resolving an issue with multiple subsequent subsets. We directly pass along the subset vector to the colData and let DataFrame handle it.
It is possible to use subset in our code but that may change depending on whether subset is meant to replicate the bracket operation [ (which is what we use now). That question would have to be answered by the DataFrame maintainers.

In the meantime, you can definitely use a more complete subset vector, e.g.,
miniACC$race == "white" & !is.na(miniACC$race) or which(miniACC$race == "white").

NAs should be handled explicitly either by removing them beforehand or using an argument like na.rm.

Best regards,
Marcel

@gogonzo
Copy link
Author

gogonzo commented Nov 3, 2021

@LiNk-NY
Yup,
I understand that changes made were needed but I don't see why DataFrame should be responsible for handling NAs. If we refer to the base R, good example is iris[c(1, NA), ] which will return column with NA and that is why I don't expect [.DataFrame to manage missing values.

object[c(1, NA), ]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1           5.1         3.5          1.4         0.2  setosa
# NA           NA          NA           NA          NA    <NA>

I believe problem lays in MultiAssayExperiment where subsetByColData does not handle NAs as subset.default does by x[subset & !is.na(subset)]

var <- c(T, NA, rep(F, 148))
subset(iris, var)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1          5.1         3.5          1.4         0.2  setosa

Do you agree or disagree?

@LiNk-NY
Copy link
Collaborator

LiNk-NY commented Nov 3, 2021

DataFrame is modeled after data.frame but it does not necessarily follow the same behavior in
all cases. If subset is meant to replicate the behavior of [, then the subset code that you mention
would no longer work after a corresponding update in the DataFrame interface.

I believe problem lays in MultiAssayExperiment where subsetByColData does not handle NAs as subset.default does by x[subset & !is.na(subset)]

We are not following the subset.default contract because subsetByColData is not an S3
subset method.

In order to make the handling of NAs more explicit, one possible enhancement could be to
include a na.rm argument in the function. Otherwise, the user has to make sure that the
input does not include NAs as when subsetting a DataFrame using the bracket method.

Best,
Marcel

@gogonzo
Copy link
Author

gogonzo commented Nov 5, 2021

Hi @LiNk-NY thank you for the answer. I still think that problem lays in MAE, I'm even more sure now about this. There are still problems with the S4Vector which I have reported but it's not case for the MAE::subsetByColData, but it's a smaller problem as far as subset.DataFrame works as expected.
After the (change)[https://github.com/Bioconductor/S4Vectors/issues/89]
subsetting by empty vector is possible, and change in the MAE adjusted to this
case correctly.

library(MultiAssayExperiment)

# S4Vectors and MultiAssayExperiment
DF <- miniACC@colData

subset(DF, race %in% c())
DF[DF$race %in% c(), ]
subsetByColData(miniACC, DF$race %in% c())

# base R
df <- data.frame(a = c(1, NA, 3), b = 1:3)

subset(df, a %in% c())
df[df$a %in% c(), ]

However, MAE failed to cover the case didn't fail in subsetByColData (bioc 3.13). Users see this as a change of the API of the public method. Function also has a name which suggest it's a subset (applied on a colData) and not [ which was correct until the last update.

# S4Vectors and MultiAssayExperiment
subset(DF, race == "white") # 
DF[DF$race == "white", ] # error - reported in S4Vectors
subsetByColData(miniACC, DF$race == "white") # error


# base R
df[df$a > 0, ] # return includes NAs
subset(df, a > 0) # omits NA 

I fully understand the reason - it's because subsetByColData uses [ directly to subset which is good, but not entirely.

newcoldata <- coldata[y, , drop = FALSE]

This line is the reason of the failure, and I wouldn't expect S4Vectors to fix this issue, because it should rather be like I've proposed below or in the first comment.

newcoldata <- coldata[y & !is.na(y), , drop = FALSE]

I can even help you to make a PR if you agree.

Regards,
DK

@LiNk-NY
Copy link
Collaborator

LiNk-NY commented Nov 5, 2021

There is a general misconception that subsetByColData should act like subset and that is not the case.
Any subsetting done in MultiAssayExperiment follows the established DataFrame and related S4 data structures.
If DataFrame changes how they subset, we will follow.

The removal of NA's should be explicit and preferably done by the user. I consider this to be usual and customary practice in R.

As above, we can consider adding an na.rm argument but that issue would have to be opened up as a feature request / pull request.

Best regards,
Marcel

@LiNk-NY LiNk-NY closed this as completed Nov 5, 2021
@gogonzo
Copy link
Author

gogonzo commented Nov 8, 2021

@LiNk-NY

As above, we can consider adding an na.rm argument but that issue would have to be opened up as a feature request / pull request.

Ok, forget the whole discussion. +1 to na.rm

Regards,
DK

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants