Skip to content

New function to drop instead of keep observations (anti filter) #6888

Closed
@evanmorier

Description

@evanmorier

There are a couple of problems I regularly run into when subsetting data with multiple conditions.

The first has been brought up several times (including in this open issue): filter() drops any observations where any of the variables used in the conditions is missing.

A less obvious issue is readability. I often have to drop observations where any of several conditions is true. Writing this using filter() usually requires complex use of negation and parentheses, even if there are no missings to be concerned about. Understanding and explaining these statements in my code is fraught to say the least.

I propose a new dplyr function called drop_if(). It would drop all rows meeting any condition specified. This would greatly reduce the need to use !() and would also keep observations that are missing any of the references variables in the resulting data set by default.

The way I wrote this out for now is straightforward to understand but likely not efficient from a computational standpoint (filter to keep observations matching the condition then using anti_join() to drop them from the original data set).

library(tidyverse)

mpg_miss <- mpg %>% 
  mutate(
    class = na_if(class, "pickup"),
    cty   = na_if(cty, 18)
  )

mpg_miss_rown <- mpg_miss %>% 
  mutate(rown = row_number())

# if we want to drop where class is "suv" or cty < 15, temp df of these
keep <- mpg_miss_rown %>% 
  filter(
    class == "suv" | cty < 15
  ) %>% 
  select(rown)

keep %>% 
  nrow()
## [1] 89

drop <- mpg_miss_rown %>% 
  anti_join(keep) %>% 
  select(-rown)
## Joining, by = "rown"

drop %>% nrow()
## [1] 145

# drops if class or cty is NA
mpg_miss %>% 
  filter(!(class == "suv" | cty < 15)) %>% 
  nrow()
## [1] 114

# annoying syntax but produces desired result
mpg_miss %>% 
  filter(!(class == "suv" | cty < 15) | (is.na(class) & !cty < 15) | (class != "suv" & is.na(cty)))  %>%
  nrow()
## [1] 145

## Not run:
# What syntax could look like:
mpg_miss %>% 
  drop_if(class == "suv" | cty < 15)
## End(**Not run**)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions