You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are a couple of problems I regularly run into when subsetting data with multiple conditions.
The first has been brought up several times (including in this open issue): filter() drops any observations where any of the variables used in the conditions is missing.
A less obvious issue is readability. I often have to drop observations where any of several conditions is true. Writing this using filter() usually requires complex use of negation and parentheses, even if there are no missings to be concerned about. Understanding and explaining these statements in my code is fraught to say the least.
I propose a new dplyr function called drop_if(). It would drop all rows meeting any condition specified. This would greatly reduce the need to use !() and would also keep observations that are missing any of the references variables in the resulting data set by default.
The way I wrote this out for now is straightforward to understand but likely not efficient from a computational standpoint (filter to keep observations matching the condition then using anti_join() to drop them from the original data set).
library(tidyverse)
mpg_miss<-mpg %>%
mutate(
class= na_if(class, "pickup"),
cty= na_if(cty, 18)
)
mpg_miss_rown<-mpg_miss %>%
mutate(rown= row_number())
# if we want to drop where class is "suv" or cty < 15, temp df of thesekeep<-mpg_miss_rown %>%
filter(
class=="suv"|cty<15
) %>%
select(rown)
keep %>%
nrow()
## [1] 89drop<-mpg_miss_rown %>%
anti_join(keep) %>%
select(-rown)
## Joining, by = "rown"drop %>% nrow()
## [1] 145# drops if class or cty is NAmpg_miss %>%
filter(!(class=="suv"|cty<15)) %>%
nrow()
## [1] 114# annoying syntax but produces desired resultmpg_miss %>%
filter(!(class=="suv"|cty<15) | (is.na(class) &!cty<15) | (class!="suv"& is.na(cty))) %>%
nrow()
## [1] 145## Not run:# What syntax could look like:mpg_miss %>%
drop_if(class=="suv"|cty<15)
## End(**Not run**)
The text was updated successfully, but these errors were encountered:
Thanks, I searched before writing this up but didn't see the open issue you linked. Looking forward to seeing how this plays out. It would be a huge huge help.
There are a couple of problems I regularly run into when subsetting data with multiple conditions.
The first has been brought up several times (including in this open issue):
filter()
drops any observations where any of the variables used in the conditions is missing.A less obvious issue is readability. I often have to drop observations where any of several conditions is true. Writing this using
filter()
usually requires complex use of negation and parentheses, even if there are no missings to be concerned about. Understanding and explaining these statements in my code is fraught to say the least.I propose a new
dplyr
function calleddrop_if()
. It would drop all rows meeting any condition specified. This would greatly reduce the need to use!()
and would also keep observations that are missing any of the references variables in the resulting data set by default.The way I wrote this out for now is straightforward to understand but likely not efficient from a computational standpoint (filter to keep observations matching the condition then using
anti_join()
to drop them from the original data set).The text was updated successfully, but these errors were encountered: