Better way to handle name collisions in joins #4028

krlmlr · 2018-12-17T17:17:48Z

Currently, non-join columns available in both tables are given suffixes .x and .y . Occasionaly one might want to raise an error or keep only the lhs columns in these situations. (This would also make it easier to adopt universal/unique renaming here.)

library(tidyverse)
x <- tibble(a = 1, b = 2)
y <- tibble(a = 1, b = 3)

# left_join(x, y, by = "a", .resolve = "rename")
left_join(x, y, by = "a")
#> # A tibble: 1 x 3
#>       a   b.x   b.y
#>   <dbl> <dbl> <dbl>
#> 1     1     2     3

# left_join(x, y, by = "a", .resolve = "unique")
left_join(x, y, by = "a") %>%
  rename(b..2 = b.x, b..3 = b.y)
#> # A tibble: 1 x 3
#>       a  b..2  b..3
#>   <dbl> <dbl> <dbl>
#> 1     1     2     3

# left_join(x, y, by = "a", .resolve = "stop")
rlang::abort('Column `b` found in both tables, and .resolve = "stop".')
#> Error: Column `b` found in both tables, and .resolve = "stop".

# left_join(x, y, by = "a", .resolve = "left")
left_join(x, y %>% select(-b), by = "a")
#> # A tibble: 1 x 2
#>       a     b
#>   <dbl> <dbl>
#> 1     1     2

^{Created on 2018-12-17 by the reprex package (v0.2.1.9000)}

The text was updated successfully, but these errors were encountered:

alex-l-m · 2019-02-05T16:09:55Z

I would appreciate something like this. Unintentional column renamings are such a common error that I would probably just make a habit of putting in .resolve="stop", even when I don't anticipate it being necessary.

Currently as a hack I set both suffixes to be equal, which only raises an error when they end up getting used. Arguably I am making my code unnecessarily wordy but I think it's always worth some extra text to make my expectations explicit ¯_(ツ)_/¯

hadley · 2020-01-11T14:39:02Z

I think the simplest interface might be to make default suffixes to NULL, which would generate an informative error if they turned out to be needed.

clauswilke · 2020-04-27T19:45:28Z

I'd also like to vote in favor of some improvement here. I just saw a bunch of unexpected behavior in an analysis when a table had gained a column(*) that was already present in a table it was being joined with. The fact that there is no warning or message stating that columns are being renamed makes finding these kinds of bugs very difficult.

I think the default should either be no renaming at all, or, if you want to keep backwards compatibility, rename but issue at least a message, if not a warning.

(*) To clarify: I was rerunning the analysis with new input data, and one of the input tables had unexpectedly gained a new column.

clauswilke · 2021-04-28T15:24:17Z

I think the simplest interface might be to make default suffixes to NULL, which would generate an informative error if they turned out to be needed.

Lately I have wished to have something like .resolve = "left" proposed here. When you have a bunch of tables with duplicated columns that you need to merge, it's cumbersome to constantly select() them away. See also my explanation and reprex in #5860.

hadley · 2021-04-28T16:34:02Z

My feature wish list

I'd like to be able to automatically rename all non-key columns during a join. A frequent pattern I use is to do this using rename_with():
```
left_join(
  df_x |> rename_with(~ paste0("x.", .), -key),
  df_y |> rename_with(~ paste0("y.", .), -key),
  by = join_by(key)
)
```
I find this a bit annoying to type and a bit inelegant. I think it would be worth baking into join_ functions.
I'd like to be able to keep the join keys from both tables. Yes, they would (mostly) contain the same information so in theory there's rarely a need to do this, but in practice it would occasionally be useful. As a motivating example, here's another pattern I frequently reach for:
```
full_join(
  df_x |> mutate(x_exists = TRUE),
  df_y |> mutate(y_exists = TRUE),
  by = join_by(key)
)
```
In this case, retaining the key column from both tables in the output would make it easier to investigate where matches have/haven't been found without resorting to the mutate(). Another way I can see this being useful is that you would always then have ncol(x) + ncol(y) == ncol(*_join(x, y)); in other words it would be easier in some cases to predict the structure of the resulting data frame.
Echoing previous comments in this thread, it would be great to be able to control when and how renaming occurs in the case of namespace conflicts. I can think of a few strategies:
- Fail with an error; this is usually the behaviour I want.
- Transform the names of the conflicting columns for one or both of the input data frames, possibly with a warning. More control here than just being able to add a suffix would be really useful; I'd like to be able to pass a function to do this.
  - 'Packing' the conflicting columns into data frame columns might be another useful option, but I'm not sure this feels consistent with other tidyverse functions, e.g. pivot_wider().

Ideas for an API:

Possible arguments which could be used to achieve the above:

conflict_repair = NULL: could be analogous to suffix and have the following options:
- NULL: Use a c(".x", ".y") suffix, as per the current behaviour.
- A length-2 character vector. I think it's worth considering that if this is supplied, elements should be used as prefixes rather than suffixes for conflicting columns. IMO this form is as bit easier to work with, and would have a nice symmetry with names_prefix in pivot_longer()/pivot_wider().
- A pair of functions which would be used to transform any conflicting names.
conflict_action = c("warn", "repair", "error"): how to handle namespace conflicts. Could have the following options:
- "warn": throw a warning/message.
- "error": throw an error.
- "repair": silently repair.
An alternative would be to have NULL as a default, meaning warn if conflict_repair is not supplied and repair silently if it is, but I think the bugs caused by silent renaming are horrible enough that it would probably be better not to activate this implicitly.
names_transform = NULL: Similar to conflict_repair but would apply to all columns besides the ones used as keys, not just the ones with conflicts. The user would only be able to supply one of names_transform and conflict_repair.
keys_strategy = c("keep_x", "keep_y", "keep_both"). Which key columns should be retained in the output:
- "keep_x": the output would have key columns which follow the naming given in the left data frame, as per the current behaviour.
- "keep_y": the output would have key columns which follow the naming given in the right data frame.
- "keep_both": the output would include both sets of keys. The user would need to be responsible for ensuring keys were uniquely named in the left and right data frames.

The existing suffix argument could still be used instead of conflict_repair, but would be superseded in favour of it.

krlmlr added the feature a feature request or enhancement label Dec 17, 2018

hadley added the tables 🧮 joins and set operations label Dec 11, 2019

hadley changed the title ~~FR: Handling column name collisions in joins~~ Better way to handle name collisions in joins Dec 11, 2019

hadley mentioned this issue Apr 28, 2021

Better handling of non-joined duplicate variables in join functions #5860

Closed

This comment was marked as off-topic.

Sign in to view

DavisVaughan mentioned this issue Nov 9, 2023

[Feature Request] Option to throw error if tables in join have shared column names other than join key #6960

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better way to handle name collisions in joins #4028

Better way to handle name collisions in joins #4028

krlmlr commented Dec 17, 2018

alex-l-m commented Feb 5, 2019

hadley commented Jan 11, 2020

clauswilke commented Apr 27, 2020 •

edited

Loading

clauswilke commented Apr 28, 2021

hadley commented Apr 28, 2021

This comment was marked as off-topic.

olivroy commented May 7, 2024 •

edited

Loading

wurli commented Sep 30, 2024 •

edited

Loading

Better way to handle name collisions in joins #4028

Better way to handle name collisions in joins #4028

Comments

krlmlr commented Dec 17, 2018

alex-l-m commented Feb 5, 2019

hadley commented Jan 11, 2020

clauswilke commented Apr 27, 2020 • edited Loading

clauswilke commented Apr 28, 2021

hadley commented Apr 28, 2021

This comment was marked as off-topic.

olivroy commented May 7, 2024 • edited Loading

wurli commented Sep 30, 2024 • edited Loading

My feature wish list

Ideas for an API:

clauswilke commented Apr 27, 2020 •

edited

Loading

olivroy commented May 7, 2024 •

edited

Loading

wurli commented Sep 30, 2024 •

edited

Loading