Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

case_when returns length-0 option if any option is length-0 #7082

Open
antdurrant opened this issue Sep 5, 2024 · 11 comments
Open

case_when returns length-0 option if any option is length-0 #7082

antdurrant opened this issue Sep 5, 2024 · 11 comments

Comments

@antdurrant
Copy link

case_when always returns length-0 option if any option is length-0

I would expect an error as in case_match. This seems to only be an issue if the input to case_when is length 1 - it errors otherwise.

 x <- 5

# silently returns the wrong thing
dplyr::case_when(
    x == 5 ~ "hello",
    x > 10 ~ character(),
    .default = "other"
)
#> character(0)

# errors
dplyr::case_match(
    x,
    5 ~ "hello",
    10 ~ character(),
    .default = "other"
)
#> Error in `dplyr::case_match()`:
#> ! `..2 (right)` must have size 1, not size 0.

packageVersion("dplyr")
#> [1] '1.1.4'

y <- 5:7

# errors out as expected
dplyr::case_when(
    y == 5 ~ "hello",
    y > 10 ~ character(),
    .default = "other"
)
#> Error in `dplyr::case_when()`:
#> ! Can't recycle `..1 (left)` (size 3) to match `..2 (right)` (size 0).

# errors
dplyr::case_match(
    y,
    5 ~ "hello",
    10 ~ character(),
    .default = "other"
)
#> Error in `dplyr::case_match()`:
#> ! `..2 (right)` must have size 3, not size 0.

Created on 2024-09-05 with reprex v2.1.1

@dchiu911

This comment was marked as off-topic.

@DavisVaughan

This comment was marked as resolved.

@DavisVaughan
Copy link
Member

First off, just know that I do agree with you here. My ideal version of case_when() would follow two rules:

  • The LHSs must all have the same size. No recycling is done. That size of the LHSs determines the output size.
  • The RHSs must either have size 1 or size size. Size 1 RHSs get recycled to size size.

And in fact the lower level internal dplyr:::vec_case_when() is implemented this way, and we patch case_when() on top of it to be more liberal with the recycling.

In case_when() itself, we still have to deal with the legacy behavior of TRUE ~ meaning "everything else gets this default value", so all of the LHS values get recycled against each other and we can't easily change that (so that goes against bullet 1).

For some reason we also recycle the LHSs and RHSs together to compute the common size, rather than only using the LHSs (that goes against bullet 2 and causes the issue you are reporting). I can't think of any reason this is useful.


It's possible we could look into getting rid of this 2nd behavior, and make case_when() follow:

  • The LHSs get recycled against each other to compute a size of the output (i.e. a modified first bullet from above so TRUE ~ is supported still)
  • The RHSs must either have size 1 or size size. Size 1 RHSs get recycled to size size (i.e. the exact second bullet from above)

The change would be around here

dplyr/R/case-when.R

Lines 168 to 170 in 1d17672

# `case_when()`'s formula interface finds the common size of ALL of its inputs.
# This is what allows `TRUE ~` to work.
.size <- vec_size_common(!!!conditions, !!!values, .size = .size)

@DavisVaughan
Copy link
Member

DavisVaughan commented Sep 6, 2024

Also, in case_match() the behavior is more obvious. We get a primary input, .x, so we get to assert that all the RHSs should have size 1 or size equal to size of .x. No recycling needed.

@DavisVaughan
Copy link
Member

This also seems to only really affect things when the LHSs all have length 1, causing them to get recycled to any other size (i.e. the size 0 from character() on the RHS). So it doesn't come up that often, which gives me hope that we can possibly make this stricter.

x <- 1
y <- 1:2

dplyr::case_when(
    x == 5 ~ "hello",
    x > 10 ~ character(),
    .default = "other"
)
#> character(0)

dplyr::case_when(
    y == 5 ~ "hello",
    y > 10 ~ character(),
    .default = "other"
)
#> Error in `dplyr::case_when()`:
#> ! Can't recycle `..1 (left)` (size 2) to match `..2 (right)` (size 0).
#> Backtrace:
#>     ▆
#>  1. ├─dplyr::case_when(y == 5 ~ "hello", y > 10 ~ character(), .default = "other")
#>  2. │ └─vctrs::vec_size_common(!!!conditions, !!!values, .size = .size)
#>  3. └─vctrs::stop_incompatible_size(...)
#>  4.   └─vctrs:::stop_incompatible(...)
#>  5.     └─vctrs:::stop_vctrs(...)
#>  6.       └─rlang::abort(message, class = c(class, "vctrs_error"), ..., call = call)

Created on 2024-09-06 with reprex v2.0.2

@antdurrant
Copy link
Author

@DavisVaughan - yep, those two rules match my mental model pretty precisely.

I suspect this realistically only happens by accident. I certainly only found out about this behaviour by having something that wound up being length-0 on one of the RHS (pulling text on a filter where a third party had silently stopped providing anything that matched the filter at all). Obviously not the most defensive programming on my part.

I can see having to hand the legacy TRUE causing issues (and actually kind of understand slightly better now why .default is preferred)

@DavisVaughan
Copy link
Member

See also #7088 where RHS recycling could possibly be used in some cases to incorrectly mimic a simple if statement. I'd argue this should be an error, and changing the behavior as suggested above where the RHSs need to be size 1 or size size (where size comes from the common size of the LHSs) would indeed help make this an error too.

dplyr::case_when(
  FALSE ~ c(1, 2),
  TRUE ~ 3
)
#> [1] 3 3

@benhmin
Copy link

benhmin commented Oct 1, 2024

I have had similar issues to those discussed here.

What surprises me as a user is that the case_when statement doesn't bypass an element that is false instead of checking whether the right side is an acceptable length.

example:

a <- NULL

b <- case_when(is.character(a) ~ a, TRUE ~ NA)

Error in `dplyr::case_when()`:
! `..1 (right)` must be a vector, not `NULL`.
Hide Traceback1. ├─dplyr::case_when(is.character(a) ~ a, TRUE ~ NA)
 2. │ └─dplyr:::vec_case_when(...)
 3. │   └─vctrs::list_check_all_vectors(values, arg = values_arg, call = call)
 4. └─vctrs:::stop_scalar_type(`<fn>`(NULL), "..1 (right)", `<env>`)
 5.   └─vctrs:::stop_vctrs(...)
 6.     └─rlang::abort(message, class = c(class, "vctrs_error"), ..., call = call)

I would expect case_when not to care what the right side of the first argument is given that the left side is false. In this use case I have gone back to an if/else workflow but it's a little less readable and not my usual when working with tidy data.

@DavisVaughan
Copy link
Member

@benhmin if you are using case_when() for single TRUE / FALSE values, then I would highly suggest you use a simple if/else instead. if/else is absolutely the way to go there. case_when() is for vectorized cases. I think we really need to make this clearer in the docs because it seems to trip up a lot of people.

@benhmin
Copy link

benhmin commented Oct 1, 2024

@benhmin if you are using case_when() for single TRUE / FALSE values, then I would highly suggest you use a simple if/else instead. if/else is absolutely the way to go there. case_when() is for vectorized cases. I think we really need to make this clearer in the docs because it seems to trip up a lot of people.

Thank you, this makes sense.

I think one of the main uses cases I have for case_when() is in a piped data manipulation workflow where there are more than two outcomes and so ifelse()/if_else() can't capture them all. I think that would be ok still based on this, yes?

@DavisVaughan
Copy link
Member

Like, the outcome possibilities are "A", "B", or "C", but its always a single scalar value? In that case, I'd say you just want a "simple" switch() statement

FWIW

  • case_when() is vectorized if/else
  • case_match() is vectorized switch()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants