You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current behavior of join functions with non-joined duplicated columns is sub-optimal, in my opinion. I've seen many people fall into traps and/or get error messages they can't interpret. The situation generally arises when we're joining two tables that have a duplicated variable we don't want to join by. For example:
library(tidyverse)
ages<- tibble(
ID=1:3,
name= c("John", "Judy", "Carl"),
age= c(24, 27, 31)
)
heights<- tibble(
ID=1:3,
name= c("John", "Judy", "Carl"),
height= c(173, 159, 181)
)
left_join(ages, heights, by="ID")
#> # A tibble: 3 x 5#> ID name.x age name.y height#> <int> <chr> <dbl> <chr> <dbl>#> 1 1 John 24 John 173#> 2 2 Judy 27 Judy 159#> 3 3 Carl 31 Carl 181
I think that by default, left_join() should generate a warning that there were duplicated columns that were renamed. Without the warning, people get confused that their column name is gone, for example in code such as this:
left_join(ages, heights, by="ID") %>%
filter(name=="John")
#> Error: Problem with `filter()` input `..1`.#> x object 'name' not found#> ℹ Input `..1` is `name == "John"`.
It is surprisingly difficult for beginners to debug this, because they just have no mental framework for why the column name is suddenly gone. I've also seen it happen that the code worked when written initially, but over time things change in the data tables and suddenly the error pops up in filter() and it's highly confusing.
Second, I find myself repeatedly writing the pattern where I have to remove the duplicated column right before the join, as in:
ages %>%
left_join(select(heights, -name), by="ID") %>%
filter(name=="John")
#> # A tibble: 1 x 4#> ID name age height#> <int> <chr> <dbl> <dbl>#> 1 1 John 24 173
I think that's cumbersome, and it would be great if something like the following was possible:
ages %>%
left_join(heights, by="ID", keep_duplicated=FALSE) %>%
filter(name=="John")
#> # A tibble: 1 x 4#> ID name age height#> <int> <chr> <dbl> <dbl>#> 1 1 John 24 173
This should then also suppress the warning, since I've explicitly told left_join() how to handle duplicated columns.
The text was updated successfully, but these errors were encountered:
The current behavior of join functions with non-joined duplicated columns is sub-optimal, in my opinion. I've seen many people fall into traps and/or get error messages they can't interpret. The situation generally arises when we're joining two tables that have a duplicated variable we don't want to join by. For example:
Created on 2021-04-21 by the reprex package (v1.0.0)
I think that by default,
left_join()
should generate a warning that there were duplicated columns that were renamed. Without the warning, people get confused that their columnname
is gone, for example in code such as this:It is surprisingly difficult for beginners to debug this, because they just have no mental framework for why the column
name
is suddenly gone. I've also seen it happen that the code worked when written initially, but over time things change in the data tables and suddenly the error pops up infilter()
and it's highly confusing.Second, I find myself repeatedly writing the pattern where I have to remove the duplicated column right before the join, as in:
I think that's cumbersome, and it would be great if something like the following was possible:
This should then also suppress the warning, since I've explicitly told
left_join()
how to handle duplicated columns.The text was updated successfully, but these errors were encountered: