Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Renamed columns are not being handled correctly in join_by() translation #1524

Open
francisbarton opened this issue Jul 16, 2024 · 6 comments

Comments

@francisbarton
Copy link

francisbarton commented Jul 16, 2024

I discovered a strange error in my work, it seems like a bug?
It probably looks like a pretty contrived situation in the reprex below, but that is what I had to do to reproduce what I found.

For some reason, the column name y$rhs_val is treated differently when being compared to the right argument of between(), to when compared to the left argument.

In the former case it retains its renamed name, in the latter case it reverts to its original name (VALUE).
I would expect the "new" column name to be used identically on both halves of the between step.

library(dplyr, warn.conflicts = FALSE)

con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")

tbl_a <- tibble::tribble(
  ~ID, ~LEFT, ~RIGHT,
  "A", 1, 3,
  "B", 2, 4,
  "C", 3, 5
)
  
tbl_b <- tibble::tribble(
  ~ID, ~VALUE,
  "A", 4,
  "B", 4,
  "C", 4
) 

dplyr::copy_to(con, tbl_a)
dplyr::copy_to(con, tbl_b)

tbl_b2 <- dplyr::tbl(con, "tbl_b") |>
  janitor::clean_names() |>
  dplyr::rename(rhs_val = "value")

dplyr::tbl(con, "tbl_a") |>
  janitor::clean_names() |>
  dplyr::semi_join(
    tbl_b2,
    join_by(
      "id",
      between(y$rhs_val, x$left, x$right)
    )
  )
#> Error in `collect()`:
#> ! Failed to collect lazy table.
#> Caused by error:
#> ! no such column: tbl_b.rhs_val

dplyr::tbl(con, "tbl_a") |>
  janitor::clean_names() |>
  dplyr::semi_join(
    tbl_b2,
    join_by(
      "id",
      between(y$rhs_val, x$left, x$right)
    )
  ) |>
  dplyr::show_query()
#> <SQL>
#> SELECT `ID` AS `id`, `LEFT` AS `left`, `RIGHT` AS `right`
#> FROM `tbl_a`
#> WHERE EXISTS (
#>   SELECT 1 FROM `tbl_b`
#>   WHERE
#>     (`tbl_a`.`ID` = `tbl_b`.`ID`) AND
#>     (`tbl_a`.`LEFT` <= `tbl_b`.`VALUE`) AND
#>     (`tbl_a`.`RIGHT` >= `tbl_b`.`rhs_val`)
#> )

Created on 2024-07-16 with reprex v2.1.1

Needless to say, if you collect() both tables before attempting the semi_join, it works fine.
So it seems a dbplyr issue specifically, as the show_query() code seems to confirm.

I have dbplyr v2.5.0 (CRAN)


(PS I was expecting the SQL translation here to use WHERE ... rhs_val BETWEEN LEFT AND RIGHT instead of the two-step comparison using <= and >= - but maybe BETWEEN ... AND ... doesn't work across all SQL implementations?)

@davidfoord1
Copy link

davidfoord1 commented Jul 17, 2024

To add on:

#> ! no such column: tbl_b.rhs_val

The rename() behaviour has been dropped by between() - there is no longer a

`VALUE` AS `rhs_val`

As in

show_query(tbl_b2)
#> <SQL>
#> SELECT `ID` AS `id`, `VALUE` AS `rhs_val`
#> FROM `tbl_b`

Trying with column-based joins the rename is in the final SELECT, but not applied to the table being joined to, so you still get the no such column error, as well as the mismatch between left and right.

col_query <- dplyr::tbl(con, "tbl_a") |>
  janitor::clean_names() |>
  dplyr::left_join(  # left instead of semi join
    tbl_b2,
    join_by(
      "id",
      between(y$rhs_val, x$left, x$right)
    )
  )
  
col_query
#> Error in `collect()`:
#> ! Failed to collect lazy table.
#> Caused by error:
#> ! no such column: tbl_b.rhs_val
#> Run `rlang::last_trace()` to see where the error occurred.

show_query(col_query)
#> <SQL>
#> SELECT
#>   `tbl_a`.`ID` AS `id`,
#>   `LEFT` AS `left`,
#>   `RIGHT` AS `right`,
#>   `VALUE` AS `rhs_val`
#> FROM `tbl_a`
#> LEFT JOIN `tbl_b`
#>   ON (
#>     `tbl_a`.`ID` = `tbl_b`.`ID` AND
#>     `tbl_a`.`LEFT` <= `tbl_b`.`VALUE` AND
#>     `tbl_a`.`RIGHT` >= `tbl_b`.`rhs_val`
#>   )

@francisbarton
Copy link
Author

I've just discovered https://dbplyr.tidyverse.org/articles/reprex.html and I'm going to try to supply a tighter, smaller, lower-level reprex of the issue if I can home in on it.

@francisbarton
Copy link
Author

francisbarton commented Jul 17, 2024

between() doesn't seem to be implemented for MSSQL in the same way as it is for several other DBs simulated in dbplyr:

library(dbplyr)
translate_sql(between(x, 2L, 4L), con = simulate_sqlite())
#> <SQL> `x` BETWEEN 2 AND 4
translate_sql(between(x, 2L, 4L), con = simulate_postgres())
#> <SQL> `x` BETWEEN 2 AND 4
translate_sql(between(x, 2L, 4L), con = simulate_mssql())
#> Error in if (context$clause == "WHERE") {: argument is of length zero

Created on 2024-07-17 with reprex v2.1.1

@francisbarton
Copy link
Author

Another reprex - sorry, I'm sorting of using this issue as a place to work through this.

it doesn't make a difference if you use simulate_mssql() or simulate_sqlite():

library(dbplyr)
library(dplyr, warn.conflicts = FALSE)

v1 <- list(id = LETTERS[1:5], x = 1:5, y = 5:1)
v2 <- list(id = LETTERS[1:5], w = 2:6)
lf1 <- lazy_frame(!!!v1, con = simulate_mssql())
lf2 <- lazy_frame(!!!v2, con = simulate_mssql()) |>
  rename(z = "w")

lf1 |>
  semi_join(lf2, join_by("id", between(y$z, x$x, x$y)))
#> <SQL>
#> SELECT `df_LHS`.*
#> FROM `df` AS `df_LHS`
#> WHERE EXISTS (
#>   SELECT 1 FROM `df` AS `df_RHS`
#>   WHERE
#>     (`df_LHS`.`id` = `df_RHS`.`id`) AND
#>     (`df_LHS`.`x` <= `df_RHS`.`w`) AND
#>     (`df_LHS`.`y` >= `df_RHS`.`z`)
#> )

lf1 <- lazy_frame(!!!v1, con = simulate_sqlite())
lf2 <- lazy_frame(!!!v2, con = simulate_sqlite()) |>
  rename(z = "w")

lf1 |>
  semi_join(lf2, join_by("id", between(y$z, x$x, x$y)))
#> <SQL>
#> SELECT `df_LHS`.*
#> FROM `df` AS `df_LHS`
#> WHERE EXISTS (
#>   SELECT 1 FROM `df` AS `df_RHS`
#>   WHERE
#>     (`df_LHS`.`id` = `df_RHS`.`id`) AND
#>     (`df_LHS`.`x` <= `df_RHS`.`w`) AND
#>     (`df_LHS`.`y` >= `df_RHS`.`z`)
#> )

Created on 2024-07-17 with reprex v2.1.1

@francisbarton
Copy link
Author

francisbarton commented Jul 17, 2024

There is a test that seems to relate to this here - with a comment to say that the query generated is wrong, although all the tests there pass.

If I amend the example situation in that test so that the last condition with lf4 is a join_by() criterion with the second part involving a comparison with < (forget about all the stuff about between() above), then we get:

library(dbplyr)
library(dplyr, warn.conflicts = FALSE)

# from https://github.com/tidyverse/dbplyr/blob/860bd6adf988a2b039485030d234eba05ee46a3c/tests/testthat/test-verb-joins.R#L656-L681
lf1 <- lazy_frame(x = 1, a = 2, .name = "lf1")
lf2 <- lazy_frame(x = 1, b = 3, .name = "lf2")
lf3 <- lazy_frame(x = 1, c = 4, .name = "lf3")
lf4 <- lazy_frame(x = 1, a = 5, .name = "lf4") |>
  rename(a4 = "a")

lf1 |>
  inner_join(lf2, by = "x") |>
  inner_join(lf3, by = "x") |>
  inner_join(lf4, by = join_by("x", a < a4))
#> <SQL>
#> SELECT `lf1`.*, `b`, `c`, `lf4`.`a` AS `a4`
#> FROM `lf1`
#> INNER JOIN `lf2`
#>   ON (`lf1`.`x` = `lf2`.`x`)
#> INNER JOIN `lf3`
#>   ON (`lf1`.`x` = `lf3`.`x`)
#> INNER JOIN `lf4`
#>   ON (`lf1`.`x` = `lf4`.`x` AND `lf1`.`a` < `lf4`.`a`)

Created on 2024-07-18 with reprex v2.1.1

Suggesting that the rename is being kept in the top-level SELECT line in an attempt to avoid having a nested query, but a rename like this probably needs to be operationalised sooner, in the final INNER JOIN, something like this? :

SELECT `lf1`.*, `b`, `c`, `a4`
FROM `lf1`
INNER JOIN `lf2`
  ON (`lf1`.`x` = `lf2`.`x`)
INNER JOIN `lf3`
  ON (`lf1`.`x` = `lf3`.`x`)
INNER JOIN (SELECT `x`, `a` AS `a4` FROM `lf4`) AS `lf4`
  ON (`lf1`.`x` = `lf4`.`x` AND `lf1`.`a` < `lf4`.`a4`)

@francisbarton francisbarton changed the title RHS col name is treated differently on either side of between() Renamed columns are not being handled correctly in join_by() translation Jul 17, 2024
@francisbarton
Copy link
Author

francisbarton commented Jul 17, 2024

Here's a really small reprex showing the issue with rename() - I think this behaviour is incorrect? In real life this SQL will cause an error saying that lf2.x is not found.

library(dbplyr)
library(dplyr, warn.conflicts = FALSE)

lf1 <- lazy_frame(x = seq(2), con = simulate_sqlite(), .name = "lf1")
lf2 <- lazy_frame(x = 1, con = simulate_sqlite(), .name = "lf2") |>
  rename(y = "x")

lf1 |>
  left_join(lf2, by = c(x = "y"), keep = TRUE)
#> <SQL>
#> SELECT `lf1`.`x` AS `x`, `lf2`.`x` AS `y`
#> FROM `lf1`
#> LEFT JOIN `lf2`
#>   ON (`lf1`.`x` = `lf2`.`x`)

Created on 2024-07-18 with reprex v2.1.1

The behaviour in my first post above with the left and right sides of between() being treated differently is more bizarre, but at root I am guessing it is something to do with where renames are actioned?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants