Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

distinct(.keep_all = TRUE) does not work with filtering joins #1534

Open
Xitian9 opened this issue Aug 14, 2024 · 1 comment
Open

distinct(.keep_all = TRUE) does not work with filtering joins #1534

Xitian9 opened this issue Aug 14, 2024 · 1 comment

Comments

@Xitian9
Copy link

Xitian9 commented Aug 14, 2024

An error is produced when trying to use distinct(..., .keep_all = TRUE) within the second argument of a filtering join. The error does not occur when used within the first argument of a filtering join, within either argument of a mutating join, or when using .keep_all = FALSE.

df1 <- dbplyr::lazy_frame(id = 1)
df2 <- dbplyr::lazy_frame(id = 1)

anti_join(df1, df2 |> distinct(id, .keep_all = TRUE))
Error in `get_env()`:
! Can't extract an environment from a call.
@gadenbuie
Copy link
Contributor

Here's another variation on the above reprex that's also a little smaller (by eliminating some of the extra calls that distinct() adds when .keep_all = TRUE).

library(dplyr, warn.conflicts = FALSE)
library(dbplyr, warn.conflicts = FALSE)

df1 <- dbplyr::lazy_frame(id = 1)
df2 <- dbplyr::lazy_frame(id = 1)

df2 |>
  filter(row_number() == 1L) |>
  anti_join(df1, y = _)
#> Joining with `by = join_by(id)`
#> Error in `get_env()`:
#> ! Can't extract an environment from a call.

Here's the generated SQL for the filtered df2:

df2 |>
  filter(row_number() == 1L)
#> <SQL>
#> SELECT `id`
#> FROM (
#>   SELECT `df`.*, ROW_NUMBER() OVER () AS `col01`
#>   FROM `df`
#> ) AS `q01`
#> WHERE (`col01` = 1)

git bisect shows #1396 as the place where this change was introduced and I think it stems from sql_build.lazy_semi_join_query introducing a sql() call in the expression.

dbplyr/R/lazy-join-query.R

Lines 266 to 273 in 1ff6363

y_vars <- op_vars(op$y)
replacements <- purrr::map(y_vars, ~ call2("$", call2("sql", op$by$y_as), sym(.x)))
where <- purrr::map(
op$where,
~ replace_sym(.x, y_vars, replacements)
)
where_sql <- translate_sql_(where, con = con, context = list(clause = "WHERE"))

# debugging inside sql_build.lazy_semi_join_query()
str(where)
#> List of 1
#>  $ : language sql("`RHS`")$col01 == 1L

Prior to this commit, the where clause would be

# debugging inside sql_build.lazy_semi_join_query()
str(where)
#> List of 1
#>  $ : language RHS$col01 == 1L

which doesn't contain any function calls and therefore doesn't hit the code paths in sql_data_mask() that try to call get_env() on the call expression. The first place where get_env() is called is if any unknown function calls appear in the expression:

dbplyr/R/translate-sql.R

Lines 164 to 166 in 1ff6363

unknown <- setdiff(all_calls(expr), names(variant))
op <- if (strict) missing_op else default_op
top_env <- ceply(unknown, op, parent = empty_env(), env = get_env(expr))

If you mark sql as a known function (since it's defined inside sql_data_mask() as a special call)

unknown <- setdiff(all_calls(expr), c(names(variant), "sql"))

you get past this error but still run into sql() also wanting to access the environment of expr

dbplyr/R/translate-sql.R

Lines 192 to 199 in 1ff6363

special_calls2$sql <- function(...) {
dots <- exprs(...)
env <- get_env(expr)
dots <- purrr::map(dots, eval_tidy, env = env)
exec(sql, !!!dots)
}

I'm not sure what the best fix would be but here are two options I've thought of:

  1. Default to empty_env() or global_env() in the sql() definition if expr doesn't have an environment. The first option seems closest to the desired behavior since it will error if sql() somehow is called inside an expression that should have had an attached environment. (Change would be here.)

  2. Promote the where clauses to quosures if needed when injecting the sql() call. (Change would be here.)

If the first option sounds reasonable, I'd be happy to submit a PR. (The second option would probably be more easily handled by someone with more experience with dbplyr internals.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants