Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Option to throw error if tables in join have shared column names other than join key #6960

Closed
shearerpmm opened this issue Nov 9, 2023 · 2 comments

Comments

@shearerpmm
Copy link

Today, if two tables are joined and they have shared columns other than the join key, suffixes are automatically appended to disambiguate. For example:

library(tidyverse)
dat <- as_tibble(mtcars) %>%
  distinct(mpg, wt, vs)
dat %>% left_join(dat, by = c('mpg', 'wt'))
#> # A tibble: 32 × 4
#>      mpg    wt  vs.x  vs.y
#>    <dbl> <dbl> <dbl> <dbl>
#>  1  21    2.62     0     0
#>  2  21    2.88     0     0
#>  3  22.8  2.32     1     1
#>  4  21.4  3.22     1     1
#>  5  18.7  3.44     0     0
#>  6  18.1  3.46     1     1
#>  7  14.3  3.57     0     0
#>  8  24.4  3.19     1     1
#>  9  22.8  3.15     1     1
#> 10  19.2  3.44     1     1
#> # ℹ 22 more rows

In many data pipelines, overlapping columns are unexpected and cause issues later in the pipeline. (I think this is why most SQL databases throw an error in this case, rather than automatically appending suffixes.) For such cases, it would be nice to have function-level or package-level options to throw an error.

@DavisVaughan
Copy link
Member

dup of #4028

@philibe
Copy link

philibe commented Nov 15, 2023

When I have overlapping columns, when the left and right columns are similar, I suffix all the right columns, like in SQL with prefix.

With that I handle name collisions in joins by avoid them: when an automatic setting (#4028) could be unpredictable or to much complex or with too many warnings or errors I prefer manual settings. :)

library(tidyverse)

dat <- as_tibble(mtcars) %>%  distinct(mpg, wt, vs)

Tools.suffix_all_columns <- function (flux, suffix) {
  flux %>% rename_all(list(~paste0(., suffix)))
}  

dat %>% 
  left_join( dat %>% 
               Tools.suffix_all_columns(.,".ref"), 
             by=c("mpg"="mpg.ref","wt"="wt.ref")
  )
#> # A tibble: 32 × 4
#>      mpg    wt    vs vs.ref
#>    <dbl> <dbl> <dbl>  <dbl>
#>  1  21    2.62     0      0
#>  2  21    2.88     0      0
#>  3  22.8  2.32     1      1
#>  4  21.4  3.22     1      1
#>  5  18.7  3.44     0      0
#>  6  18.1  3.46     1      1
#>  7  14.3  3.57     0      0
#>  8  24.4  3.19     1      1
#>  9  22.8  3.15     1      1
#> 10  19.2  3.44     1      1
#> # … with 22 more rows

Created on 2023-11-15 with reprex v2.0.2

And after I rename the columns with suffix if necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants