Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better way to handle name collisions in joins #4028

Open
krlmlr opened this issue Dec 17, 2018 · 8 comments
Open

Better way to handle name collisions in joins #4028

krlmlr opened this issue Dec 17, 2018 · 8 comments
Labels
feature a feature request or enhancement tables 🧮 joins and set operations

Comments

@krlmlr
Copy link
Member

krlmlr commented Dec 17, 2018

Currently, non-join columns available in both tables are given suffixes .x and .y . Occasionaly one might want to raise an error or keep only the lhs columns in these situations. (This would also make it easier to adopt universal/unique renaming here.)

library(tidyverse)
x <- tibble(a = 1, b = 2)
y <- tibble(a = 1, b = 3)

# left_join(x, y, by = "a", .resolve = "rename")
left_join(x, y, by = "a")
#> # A tibble: 1 x 3
#>       a   b.x   b.y
#>   <dbl> <dbl> <dbl>
#> 1     1     2     3

# left_join(x, y, by = "a", .resolve = "unique")
left_join(x, y, by = "a") %>%
  rename(b..2 = b.x, b..3 = b.y)
#> # A tibble: 1 x 3
#>       a  b..2  b..3
#>   <dbl> <dbl> <dbl>
#> 1     1     2     3

# left_join(x, y, by = "a", .resolve = "stop")
rlang::abort('Column `b` found in both tables, and .resolve = "stop".')
#> Error: Column `b` found in both tables, and .resolve = "stop".

# left_join(x, y, by = "a", .resolve = "left")
left_join(x, y %>% select(-b), by = "a")
#> # A tibble: 1 x 2
#>       a     b
#>   <dbl> <dbl>
#> 1     1     2

Created on 2018-12-17 by the reprex package (v0.2.1.9000)

@krlmlr krlmlr added the feature a feature request or enhancement label Dec 17, 2018
@alex-l-m
Copy link

alex-l-m commented Feb 5, 2019

I would appreciate something like this. Unintentional column renamings are such a common error that I would probably just make a habit of putting in .resolve="stop", even when I don't anticipate it being necessary.

Currently as a hack I set both suffixes to be equal, which only raises an error when they end up getting used. Arguably I am making my code unnecessarily wordy but I think it's always worth some extra text to make my expectations explicit ¯_(ツ)_/¯

@hadley hadley added the tables 🧮 joins and set operations label Dec 11, 2019
@hadley hadley changed the title FR: Handling column name collisions in joins Better way to handle name collisions in joins Dec 11, 2019
@hadley
Copy link
Member

hadley commented Jan 11, 2020

I think the simplest interface might be to make default suffixes to NULL, which would generate an informative error if they turned out to be needed.

@clauswilke
Copy link
Member

clauswilke commented Apr 27, 2020

I'd also like to vote in favor of some improvement here. I just saw a bunch of unexpected behavior in an analysis when a table had gained a column(*) that was already present in a table it was being joined with. The fact that there is no warning or message stating that columns are being renamed makes finding these kinds of bugs very difficult.

I think the default should either be no renaming at all, or, if you want to keep backwards compatibility, rename but issue at least a message, if not a warning.

(*) To clarify: I was rerunning the analysis with new input data, and one of the input tables had unexpectedly gained a new column.

@clauswilke
Copy link
Member

I think the simplest interface might be to make default suffixes to NULL, which would generate an informative error if they turned out to be needed.

Lately I have wished to have something like .resolve = "left" proposed here. When you have a bunch of tables with duplicated columns that you need to merge, it's cumbersome to constantly select() them away. See also my explanation and reprex in #5860.

@hadley
Copy link
Member

hadley commented Apr 28, 2021

See also #5700

@alnajar

This comment was marked as off-topic.

@olivroy
Copy link

olivroy commented May 7, 2024

Would love the option suffix = NA or suffix = NULL to error in non-duplicate columns are found. My situation is exactly what is discussed above, but wanted to chime in.

This should only have an effect if by is specified. I think the rationale has been discussed, but it would be great to halt the process there, similar to how unmatched works. More sophisticated behavior would be great as sometimes, you trust more x or y to provide the correct information.

Maybe should inform also if you have

df1 <- data.frame(
  id = c(1, 2, 3),
  name = c("name1", "name2", "name3"),
  value1 = c(1, 5, 7)
)
df2 <- data.frame(
  id = c(1, 2, 3),
  name = c("name1", "name2", "name3"),
  value2 = c(1, 6, 7)
)
df2 <- data.frame(
  id = c(1, 2, 3),
  name = c("name.1", "name.2", "name.3"),
  value2 = c(1, 5, 7)
)

df1 |> left_join(df2, by = "id", suffix = NA)
#> Error in `left_join()`
#> `name` is found in `x`, and `y`
#> Mapping is compatible, you should use `join_by(id, name)`
df1 |> left_join(df3, by = "id", suffix = NA)
#> Error in `left_join()`
#> `name` is found in `x`, and `y` and is not the same
#> Either delete the `name` variable from `x` or `y`, or use suffix.

my main reasoning behind specifying by is to avoid printing everytime Joining with ..., but the message is useful when something unexpected happened, so hopefully there is a way to reconcile printing the message everytime and debugging when you notice something weird and "hard-coding" your join and error when something unexpected happens.

@wurli
Copy link

wurli commented Sep 30, 2024

I'd suggest that there's room for enhancements to checking/transformation of column names and join keys that goes beyond the scope of the suffix argument.

My feature wish list

  1. I'd like to be able to automatically rename all non-key columns during a join. A frequent pattern I use is to do this using rename_with():

    left_join(
      df_x |> rename_with(~ paste0("x.", .), -key),
      df_y |> rename_with(~ paste0("y.", .), -key),
      by = join_by(key)
    )

    I find this a bit annoying to type and a bit inelegant. I think it would be worth baking into join_ functions.

  2. I'd like to be able to keep the join keys from both tables. Yes, they would (mostly) contain the same information so in theory there's rarely a need to do this, but in practice it would occasionally be useful. As a motivating example, here's another pattern I frequently reach for:

    full_join(
      df_x |> mutate(x_exists = TRUE),
      df_y |> mutate(y_exists = TRUE),
      by = join_by(key)
    )

    In this case, retaining the key column from both tables in the output would make it easier to investigate where matches have/haven't been found without resorting to the mutate(). Another way I can see this being useful is that you would always then have ncol(x) + ncol(y) == ncol(*_join(x, y)); in other words it would be easier in some cases to predict the structure of the resulting data frame.

  3. Echoing previous comments in this thread, it would be great to be able to control when and how renaming occurs in the case of namespace conflicts. I can think of a few strategies:

    • Fail with an error; this is usually the behaviour I want.

    • Transform the names of the conflicting columns for one or both of the input data frames, possibly with a warning. More control here than just being able to add a suffix would be really useful; I'd like to be able to pass a function to do this.

      • 'Packing' the conflicting columns into data frame columns might be another useful option, but I'm not sure this feels consistent with other tidyverse functions, e.g. pivot_wider().

Ideas for an API:

Possible arguments which could be used to achieve the above:

  • conflict_repair = NULL: could be analogous to suffix and have the following options:

    • NULL: Use a c(".x", ".y") suffix, as per the current behaviour.
    • A length-2 character vector. I think it's worth considering that if this is supplied, elements should be used as prefixes rather than suffixes for conflicting columns. IMO this form is as bit easier to work with, and would have a nice symmetry with names_prefix in pivot_longer()/pivot_wider().
    • A pair of functions which would be used to transform any conflicting names.
  • conflict_action = c("warn", "repair", "error"): how to handle namespace conflicts. Could have the following options:

    • "warn": throw a warning/message.
    • "error": throw an error.
    • "repair": silently repair.

    An alternative would be to have NULL as a default, meaning warn if conflict_repair is not supplied and repair silently if it is, but I think the bugs caused by silent renaming are horrible enough that it would probably be better not to activate this implicitly.

  • names_transform = NULL: Similar to conflict_repair but would apply to all columns besides the ones used as keys, not just the ones with conflicts. The user would only be able to supply one of names_transform and conflict_repair.

  • keys_strategy = c("keep_x", "keep_y", "keep_both"). Which key columns should be retained in the output:

    • "keep_x": the output would have key columns which follow the naming given in the left data frame, as per the current behaviour.
    • "keep_y": the output would have key columns which follow the naming given in the right data frame.
    • "keep_both": the output would include both sets of keys. The user would need to be responsible for ensuring keys were uniquely named in the left and right data frames.

The existing suffix argument could still be used instead of conflict_repair, but would be superseded in favour of it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement tables 🧮 joins and set operations
Projects
None yet
Development

No branches or pull requests

7 participants