Better handling of non-joined duplicate variables in join functions #5860

clauswilke · 2021-04-21T21:25:53Z

The current behavior of join functions with non-joined duplicated columns is sub-optimal, in my opinion. I've seen many people fall into traps and/or get error messages they can't interpret. The situation generally arises when we're joining two tables that have a duplicated variable we don't want to join by. For example:

library(tidyverse)

ages <- tibble(
  ID = 1:3,
  name = c("John", "Judy", "Carl"),
  age = c(24, 27, 31)
)

heights <- tibble(
  ID = 1:3,
  name = c("John", "Judy", "Carl"),
  height = c(173, 159, 181)
)

left_join(ages, heights, by = "ID")
#> # A tibble: 3 x 5
#>      ID name.x   age name.y height
#>   <int> <chr>  <dbl> <chr>   <dbl>
#> 1     1 John      24 John      173
#> 2     2 Judy      27 Judy      159
#> 3     3 Carl      31 Carl      181

^{Created on 2021-04-21 by the reprex package (v1.0.0)}

I think that by default, left_join() should generate a warning that there were duplicated columns that were renamed. Without the warning, people get confused that their column name is gone, for example in code such as this:

left_join(ages, heights, by = "ID") %>%
  filter(name == "John")
#> Error: Problem with `filter()` input `..1`.
#> x object 'name' not found
#> ℹ Input `..1` is `name == "John"`.

It is surprisingly difficult for beginners to debug this, because they just have no mental framework for why the column name is suddenly gone. I've also seen it happen that the code worked when written initially, but over time things change in the data tables and suddenly the error pops up in filter() and it's highly confusing.

Second, I find myself repeatedly writing the pattern where I have to remove the duplicated column right before the join, as in:

ages %>%
  left_join(select(heights, -name), by = "ID") %>%
  filter(name == "John")
#> # A tibble: 1 x 4
#>      ID name    age height
#>   <int> <chr> <dbl>  <dbl>
#> 1     1 John     24    173

I think that's cumbersome, and it would be great if something like the following was possible:

ages %>%
  left_join(heights, by = "ID", keep_duplicated = FALSE) %>%
  filter(name == "John")
#> # A tibble: 1 x 4
#>      ID name    age height
#>   <int> <chr> <dbl>  <dbl>
#> 1     1 John     24    173

This should then also suppress the warning, since I've explicitly told left_join() how to handle duplicated columns.

The text was updated successfully, but these errors were encountered:

hadley · 2021-04-28T13:32:12Z

Duplicate of #4028 ?

clauswilke · 2021-04-28T15:20:51Z

Yes, I had even commented on that issue already. Apologies, I didn't see it when I did a search for related issues.

hadley added feature a feature request or enhancement tables 🧮 joins and set operations labels Apr 28, 2021

clauswilke closed this as completed Apr 28, 2021

clauswilke mentioned this issue Apr 28, 2021

Better way to handle name collisions in joins #4028

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better handling of non-joined duplicate variables in join functions #5860

Better handling of non-joined duplicate variables in join functions #5860

clauswilke commented Apr 21, 2021 •

edited

Loading

hadley commented Apr 28, 2021

clauswilke commented Apr 28, 2021

Better handling of non-joined duplicate variables in join functions #5860

Better handling of non-joined duplicate variables in join functions #5860

Comments

clauswilke commented Apr 21, 2021 • edited Loading

hadley commented Apr 28, 2021

clauswilke commented Apr 28, 2021

clauswilke commented Apr 21, 2021 •

edited

Loading