-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better way to handle name collisions in joins #4028
Comments
I would appreciate something like this. Unintentional column renamings are such a common error that I would probably just make a habit of putting in .resolve="stop", even when I don't anticipate it being necessary. Currently as a hack I set both suffixes to be equal, which only raises an error when they end up getting used. Arguably I am making my code unnecessarily wordy but I think it's always worth some extra text to make my expectations explicit ¯_(ツ)_/¯ |
I think the simplest interface might be to make default |
I'd also like to vote in favor of some improvement here. I just saw a bunch of unexpected behavior in an analysis when a table had gained a column(*) that was already present in a table it was being joined with. The fact that there is no warning or message stating that columns are being renamed makes finding these kinds of bugs very difficult. I think the default should either be no renaming at all, or, if you want to keep backwards compatibility, rename but issue at least a message, if not a warning. (*) To clarify: I was rerunning the analysis with new input data, and one of the input tables had unexpectedly gained a new column. |
Lately I have wished to have something like |
See also #5700 |
This comment was marked as off-topic.
This comment was marked as off-topic.
Would love the option This should only have an effect if Maybe should inform also if you have df1 <- data.frame(
id = c(1, 2, 3),
name = c("name1", "name2", "name3"),
value1 = c(1, 5, 7)
)
df2 <- data.frame(
id = c(1, 2, 3),
name = c("name1", "name2", "name3"),
value2 = c(1, 6, 7)
)
df2 <- data.frame(
id = c(1, 2, 3),
name = c("name.1", "name.2", "name.3"),
value2 = c(1, 5, 7)
)
df1 |> left_join(df2, by = "id", suffix = NA)
#> Error in `left_join()`
#> `name` is found in `x`, and `y`
#> Mapping is compatible, you should use `join_by(id, name)`
df1 |> left_join(df3, by = "id", suffix = NA)
#> Error in `left_join()`
#> `name` is found in `x`, and `y` and is not the same
#> Either delete the `name` variable from `x` or `y`, or use suffix. my main reasoning behind specifying |
I'd suggest that there's room for enhancements to checking/transformation of column names and join keys that goes beyond the scope of the My feature wish list
Ideas for an API:Possible arguments which could be used to achieve the above:
The existing |
Currently, non-join columns available in both tables are given suffixes
.x
and.y
. Occasionaly one might want to raise an error or keep only the lhs columns in these situations. (This would also make it easier to adopt universal/unique renaming here.)Created on 2018-12-17 by the reprex package (v0.2.1.9000)
The text was updated successfully, but these errors were encountered: