Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New function to drop instead of keep observations (anti filter) #6888

Closed
evanmorier opened this issue Jul 21, 2023 · 2 comments
Closed

New function to drop instead of keep observations (anti filter) #6888

evanmorier opened this issue Jul 21, 2023 · 2 comments

Comments

@evanmorier
Copy link

evanmorier commented Jul 21, 2023

There are a couple of problems I regularly run into when subsetting data with multiple conditions.

The first has been brought up several times (including in this open issue): filter() drops any observations where any of the variables used in the conditions is missing.

A less obvious issue is readability. I often have to drop observations where any of several conditions is true. Writing this using filter() usually requires complex use of negation and parentheses, even if there are no missings to be concerned about. Understanding and explaining these statements in my code is fraught to say the least.

I propose a new dplyr function called drop_if(). It would drop all rows meeting any condition specified. This would greatly reduce the need to use !() and would also keep observations that are missing any of the references variables in the resulting data set by default.

The way I wrote this out for now is straightforward to understand but likely not efficient from a computational standpoint (filter to keep observations matching the condition then using anti_join() to drop them from the original data set).

library(tidyverse)

mpg_miss <- mpg %>% 
  mutate(
    class = na_if(class, "pickup"),
    cty   = na_if(cty, 18)
  )

mpg_miss_rown <- mpg_miss %>% 
  mutate(rown = row_number())

# if we want to drop where class is "suv" or cty < 15, temp df of these
keep <- mpg_miss_rown %>% 
  filter(
    class == "suv" | cty < 15
  ) %>% 
  select(rown)

keep %>% 
  nrow()
## [1] 89

drop <- mpg_miss_rown %>% 
  anti_join(keep) %>% 
  select(-rown)
## Joining, by = "rown"

drop %>% nrow()
## [1] 145

# drops if class or cty is NA
mpg_miss %>% 
  filter(!(class == "suv" | cty < 15)) %>% 
  nrow()
## [1] 114

# annoying syntax but produces desired result
mpg_miss %>% 
  filter(!(class == "suv" | cty < 15) | (is.na(class) & !cty < 15) | (class != "suv" & is.na(cty)))  %>%
  nrow()
## [1] 145

## Not run:
# What syntax could look like:
mpg_miss %>% 
  drop_if(class == "suv" | cty < 15)
## End(**Not run**)
@DavisVaughan
Copy link
Member

We've been noodling on this awhile. Closing in favor of a full write up in #6891

@evanmorier
Copy link
Author

Thanks, I searched before writing this up but didn't see the open issue you linked. Looking forward to seeing how this plays out. It would be a huge huge help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants