Renamed columns are not being handled correctly in `join_by()` translation #1524

francisbarton · 2024-07-16T19:15:11Z

I discovered a strange error in my work, it seems like a bug?
It probably looks like a pretty contrived situation in the reprex below, but that is what I had to do to reproduce what I found.

For some reason, the column name y$rhs_val is treated differently when being compared to the right argument of between(), to when compared to the left argument.

In the former case it retains its renamed name, in the latter case it reverts to its original name (VALUE).
I would expect the "new" column name to be used identically on both halves of the between step.

library(dplyr, warn.conflicts = FALSE)

con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")

tbl_a <- tibble::tribble(
  ~ID, ~LEFT, ~RIGHT,
  "A", 1, 3,
  "B", 2, 4,
  "C", 3, 5
)
  
tbl_b <- tibble::tribble(
  ~ID, ~VALUE,
  "A", 4,
  "B", 4,
  "C", 4
) 

dplyr::copy_to(con, tbl_a)
dplyr::copy_to(con, tbl_b)

tbl_b2 <- dplyr::tbl(con, "tbl_b") |>
  janitor::clean_names() |>
  dplyr::rename(rhs_val = "value")

dplyr::tbl(con, "tbl_a") |>
  janitor::clean_names() |>
  dplyr::semi_join(
    tbl_b2,
    join_by(
      "id",
      between(y$rhs_val, x$left, x$right)
    )
  )
#> Error in `collect()`:
#> ! Failed to collect lazy table.
#> Caused by error:
#> ! no such column: tbl_b.rhs_val

dplyr::tbl(con, "tbl_a") |>
  janitor::clean_names() |>
  dplyr::semi_join(
    tbl_b2,
    join_by(
      "id",
      between(y$rhs_val, x$left, x$right)
    )
  ) |>
  dplyr::show_query()
#> <SQL>
#> SELECT `ID` AS `id`, `LEFT` AS `left`, `RIGHT` AS `right`
#> FROM `tbl_a`
#> WHERE EXISTS (
#>   SELECT 1 FROM `tbl_b`
#>   WHERE
#>     (`tbl_a`.`ID` = `tbl_b`.`ID`) AND
#>     (`tbl_a`.`LEFT` <= `tbl_b`.`VALUE`) AND
#>     (`tbl_a`.`RIGHT` >= `tbl_b`.`rhs_val`)
#> )

^{Created on 2024-07-16 with reprex v2.1.1}

Needless to say, if you collect() both tables before attempting the semi_join, it works fine.
So it seems a dbplyr issue specifically, as the show_query() code seems to confirm.

I have dbplyr v2.5.0 (CRAN)

(PS I was expecting the SQL translation here to use WHERE ... rhs_val BETWEEN LEFT AND RIGHT instead of the two-step comparison using <= and >= - but maybe BETWEEN ... AND ... doesn't work across all SQL implementations?)

The text was updated successfully, but these errors were encountered:

davidfoord1 · 2024-07-17T10:29:29Z

To add on:

#> ! no such column: tbl_b.rhs_val

The rename() behaviour has been dropped by between() - there is no longer a

`VALUE` AS `rhs_val`

As in

show_query(tbl_b2)
#> <SQL>
#> SELECT `ID` AS `id`, `VALUE` AS `rhs_val`
#> FROM `tbl_b`

Trying with column-based joins the rename is in the final SELECT, but not applied to the table being joined to, so you still get the no such column error, as well as the mismatch between left and right.

col_query <- dplyr::tbl(con, "tbl_a") |>
  janitor::clean_names() |>
  dplyr::left_join(  # left instead of semi join
    tbl_b2,
    join_by(
      "id",
      between(y$rhs_val, x$left, x$right)
    )
  )
  
col_query
#> Error in `collect()`:
#> ! Failed to collect lazy table.
#> Caused by error:
#> ! no such column: tbl_b.rhs_val
#> Run `rlang::last_trace()` to see where the error occurred.

show_query(col_query)
#> <SQL>
#> SELECT
#>   `tbl_a`.`ID` AS `id`,
#>   `LEFT` AS `left`,
#>   `RIGHT` AS `right`,
#>   `VALUE` AS `rhs_val`
#> FROM `tbl_a`
#> LEFT JOIN `tbl_b`
#>   ON (
#>     `tbl_a`.`ID` = `tbl_b`.`ID` AND
#>     `tbl_a`.`LEFT` <= `tbl_b`.`VALUE` AND
#>     `tbl_a`.`RIGHT` >= `tbl_b`.`rhs_val`
#>   )

francisbarton · 2024-07-17T21:27:50Z

I've just discovered https://dbplyr.tidyverse.org/articles/reprex.html and I'm going to try to supply a tighter, smaller, lower-level reprex of the issue if I can home in on it.

francisbarton · 2024-07-17T21:45:33Z

between() doesn't seem to be implemented for MSSQL in the same way as it is for several other DBs simulated in dbplyr:

library(dbplyr)
translate_sql(between(x, 2L, 4L), con = simulate_sqlite())
#> <SQL> `x` BETWEEN 2 AND 4
translate_sql(between(x, 2L, 4L), con = simulate_postgres())
#> <SQL> `x` BETWEEN 2 AND 4
translate_sql(between(x, 2L, 4L), con = simulate_mssql())
#> Error in if (context$clause == "WHERE") {: argument is of length zero

^{Created on 2024-07-17 with reprex v2.1.1}

francisbarton · 2024-07-17T22:18:11Z

Another reprex - sorry, I'm sorting of using this issue as a place to work through this.

it doesn't make a difference if you use simulate_mssql() or simulate_sqlite():

library(dbplyr)
library(dplyr, warn.conflicts = FALSE)

v1 <- list(id = LETTERS[1:5], x = 1:5, y = 5:1)
v2 <- list(id = LETTERS[1:5], w = 2:6)
lf1 <- lazy_frame(!!!v1, con = simulate_mssql())
lf2 <- lazy_frame(!!!v2, con = simulate_mssql()) |>
  rename(z = "w")

lf1 |>
  semi_join(lf2, join_by("id", between(y$z, x$x, x$y)))
#> <SQL>
#> SELECT `df_LHS`.*
#> FROM `df` AS `df_LHS`
#> WHERE EXISTS (
#>   SELECT 1 FROM `df` AS `df_RHS`
#>   WHERE
#>     (`df_LHS`.`id` = `df_RHS`.`id`) AND
#>     (`df_LHS`.`x` <= `df_RHS`.`w`) AND
#>     (`df_LHS`.`y` >= `df_RHS`.`z`)
#> )

lf1 <- lazy_frame(!!!v1, con = simulate_sqlite())
lf2 <- lazy_frame(!!!v2, con = simulate_sqlite()) |>
  rename(z = "w")

lf1 |>
  semi_join(lf2, join_by("id", between(y$z, x$x, x$y)))
#> <SQL>
#> SELECT `df_LHS`.*
#> FROM `df` AS `df_LHS`
#> WHERE EXISTS (
#>   SELECT 1 FROM `df` AS `df_RHS`
#>   WHERE
#>     (`df_LHS`.`id` = `df_RHS`.`id`) AND
#>     (`df_LHS`.`x` <= `df_RHS`.`w`) AND
#>     (`df_LHS`.`y` >= `df_RHS`.`z`)
#> )

^{Created on 2024-07-17 with reprex v2.1.1}

francisbarton · 2024-07-17T23:43:30Z

There is a test that seems to relate to this here - with a comment to say that the query generated is wrong, although all the tests there pass.

If I amend the example situation in that test so that the last condition with lf4 is a join_by() criterion with the second part involving a comparison with < (forget about all the stuff about between() above), then we get:

library(dbplyr)
library(dplyr, warn.conflicts = FALSE)

# from https://github.com/tidyverse/dbplyr/blob/860bd6adf988a2b039485030d234eba05ee46a3c/tests/testthat/test-verb-joins.R#L656-L681
lf1 <- lazy_frame(x = 1, a = 2, .name = "lf1")
lf2 <- lazy_frame(x = 1, b = 3, .name = "lf2")
lf3 <- lazy_frame(x = 1, c = 4, .name = "lf3")
lf4 <- lazy_frame(x = 1, a = 5, .name = "lf4") |>
  rename(a4 = "a")

lf1 |>
  inner_join(lf2, by = "x") |>
  inner_join(lf3, by = "x") |>
  inner_join(lf4, by = join_by("x", a < a4))
#> <SQL>
#> SELECT `lf1`.*, `b`, `c`, `lf4`.`a` AS `a4`
#> FROM `lf1`
#> INNER JOIN `lf2`
#>   ON (`lf1`.`x` = `lf2`.`x`)
#> INNER JOIN `lf3`
#>   ON (`lf1`.`x` = `lf3`.`x`)
#> INNER JOIN `lf4`
#>   ON (`lf1`.`x` = `lf4`.`x` AND `lf1`.`a` < `lf4`.`a`)

^{Created on 2024-07-18 with reprex v2.1.1}

Suggesting that the rename is being kept in the top-level SELECT line in an attempt to avoid having a nested query, but a rename like this probably needs to be operationalised sooner, in the final INNER JOIN, something like this? :

SELECT `lf1`.*, `b`, `c`, `a4`
FROM `lf1`
INNER JOIN `lf2`
  ON (`lf1`.`x` = `lf2`.`x`)
INNER JOIN `lf3`
  ON (`lf1`.`x` = `lf3`.`x`)
INNER JOIN (SELECT `x`, `a` AS `a4` FROM `lf4`) AS `lf4`
  ON (`lf1`.`x` = `lf4`.`x` AND `lf1`.`a` < `lf4`.`a4`)

francisbarton · 2024-07-17T23:55:40Z

Here's a really small reprex showing the issue with rename() - I think this behaviour is incorrect? In real life this SQL will cause an error saying that lf2.x is not found.

library(dbplyr)
library(dplyr, warn.conflicts = FALSE)

lf1 <- lazy_frame(x = seq(2), con = simulate_sqlite(), .name = "lf1")
lf2 <- lazy_frame(x = 1, con = simulate_sqlite(), .name = "lf2") |>
  rename(y = "x")

lf1 |>
  left_join(lf2, by = c(x = "y"), keep = TRUE)
#> <SQL>
#> SELECT `lf1`.`x` AS `x`, `lf2`.`x` AS `y`
#> FROM `lf1`
#> LEFT JOIN `lf2`
#>   ON (`lf1`.`x` = `lf2`.`x`)

^{Created on 2024-07-18 with reprex v2.1.1}

The behaviour in my first post above with the left and right sides of between() being treated differently is more bizarre, but at root I am guessing it is something to do with where renames are actioned?

francisbarton changed the title ~~RHS col name is treated differently on either side of between()~~ Renamed columns are not being handled correctly in join_by() translation Jul 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Renamed columns are not being handled correctly in `join_by()` translation #1524

Renamed columns are not being handled correctly in `join_by()` translation #1524

francisbarton commented Jul 16, 2024 •

edited

Loading

davidfoord1 commented Jul 17, 2024 •

edited

Loading

francisbarton commented Jul 17, 2024

francisbarton commented Jul 17, 2024 •

edited

Loading

francisbarton commented Jul 17, 2024

francisbarton commented Jul 17, 2024 •

edited

Loading

francisbarton commented Jul 17, 2024 •

edited

Loading

Renamed columns are not being handled correctly in join_by() translation #1524

Renamed columns are not being handled correctly in join_by() translation #1524

Comments

francisbarton commented Jul 16, 2024 • edited Loading

davidfoord1 commented Jul 17, 2024 • edited Loading

francisbarton commented Jul 17, 2024

francisbarton commented Jul 17, 2024 • edited Loading

francisbarton commented Jul 17, 2024

francisbarton commented Jul 17, 2024 • edited Loading

francisbarton commented Jul 17, 2024 • edited Loading

Renamed columns are not being handled correctly in `join_by()` translation #1524

Renamed columns are not being handled correctly in `join_by()` translation #1524

francisbarton commented Jul 16, 2024 •

edited

Loading

davidfoord1 commented Jul 17, 2024 •

edited

Loading

francisbarton commented Jul 17, 2024 •

edited

Loading

francisbarton commented Jul 17, 2024 •

edited

Loading

francisbarton commented Jul 17, 2024 •

edited

Loading