Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better handling of non-joined duplicate variables in join functions #5860

Closed
clauswilke opened this issue Apr 21, 2021 · 2 comments
Closed
Labels
feature a feature request or enhancement tables 🧮 joins and set operations

Comments

@clauswilke
Copy link
Member

clauswilke commented Apr 21, 2021

The current behavior of join functions with non-joined duplicated columns is sub-optimal, in my opinion. I've seen many people fall into traps and/or get error messages they can't interpret. The situation generally arises when we're joining two tables that have a duplicated variable we don't want to join by. For example:

library(tidyverse)

ages <- tibble(
  ID = 1:3,
  name = c("John", "Judy", "Carl"),
  age = c(24, 27, 31)
)

heights <- tibble(
  ID = 1:3,
  name = c("John", "Judy", "Carl"),
  height = c(173, 159, 181)
)

left_join(ages, heights, by = "ID")
#> # A tibble: 3 x 5
#>      ID name.x   age name.y height
#>   <int> <chr>  <dbl> <chr>   <dbl>
#> 1     1 John      24 John      173
#> 2     2 Judy      27 Judy      159
#> 3     3 Carl      31 Carl      181

Created on 2021-04-21 by the reprex package (v1.0.0)

I think that by default, left_join() should generate a warning that there were duplicated columns that were renamed. Without the warning, people get confused that their column name is gone, for example in code such as this:

left_join(ages, heights, by = "ID") %>%
  filter(name == "John")
#> Error: Problem with `filter()` input `..1`.
#> x object 'name' not found
#> ℹ Input `..1` is `name == "John"`.

It is surprisingly difficult for beginners to debug this, because they just have no mental framework for why the column name is suddenly gone. I've also seen it happen that the code worked when written initially, but over time things change in the data tables and suddenly the error pops up in filter() and it's highly confusing.

Second, I find myself repeatedly writing the pattern where I have to remove the duplicated column right before the join, as in:

ages %>%
  left_join(select(heights, -name), by = "ID") %>%
  filter(name == "John")
#> # A tibble: 1 x 4
#>      ID name    age height
#>   <int> <chr> <dbl>  <dbl>
#> 1     1 John     24    173

I think that's cumbersome, and it would be great if something like the following was possible:

ages %>%
  left_join(heights, by = "ID", keep_duplicated = FALSE) %>%
  filter(name == "John")
#> # A tibble: 1 x 4
#>      ID name    age height
#>   <int> <chr> <dbl>  <dbl>
#> 1     1 John     24    173

This should then also suppress the warning, since I've explicitly told left_join() how to handle duplicated columns.

@hadley hadley added feature a feature request or enhancement tables 🧮 joins and set operations labels Apr 28, 2021
@hadley
Copy link
Member

hadley commented Apr 28, 2021

Duplicate of #4028 ?

@clauswilke
Copy link
Member Author

Yes, I had even commented on that issue already. Apologies, I didn't see it when I did a search for related issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement tables 🧮 joins and set operations
Projects
None yet
Development

No branches or pull requests

2 participants