-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A complete tidyselect integration for the columns
argument in validation functions
#493
Conversation
…aluation in all_of
A small update on the one test that's failing in this PR: outdated (failing test now resolved - tidyselect errors "lazily")I mischaracterized the issue. Regardless of whether the input is an agent or a table, the validation step will immediately resolve the column. The difference is just whether So even on the main branch, if I swap library(pointblank)
create_agent(small_table) %>% col_vals_not_null(vars("z")) %>% is("ptblank_agent")
#> [1] TRUE
create_agent(small_table) %>% col_vals_not_null(vars(z)) %>% is("ptblank_agent")
#> [1] TRUE
create_agent(small_table) %>% col_vals_not_null(z) %>% is("ptblank_agent")
#> Error in `resolve_expr_to_cols()`:
#> ! Can't subset columns that don't exist.
#> ✖ Column `z` doesn't exist.
create_agent(small_table) %>% col_vals_not_null("z") %>% is("ptblank_agent")
#> Error in `resolve_expr_to_cols()`:
#> ! Can't subset columns that don't exist.
#> ✖ Column `z` doesn't exist. So I think this raises an additional question (which would also be nice to resolve with this PR): Is the current behavior of
Full exploration of library(pointblank) # devel @main
library(testthat)
#>
#> Attaching package: 'testthat'
#> The following object is masked from 'package:pointblank':
#>
#> matches
test_that("Exceptional behavior of `vars()` with non-existent columns", {
withr::local_options(lifecycle_verbosity = "quiet")
# Supported variety of column selection patterns work for an existing column:
col_a <- "a"
expect_no_error({
# `vars()` syntax
small_table %>% col_vals_not_null(vars(a))
# Other column selection expressions
small_table %>% col_vals_not_null(a)
small_table %>% col_vals_not_null("a")
small_table %>% col_vals_not_null(col_a)
})
# All patterns error when the input is a **table** and column is non-existent
col_z <- "z"
expect_error( small_table %>% col_vals_not_null(vars("z")) )
expect_error( small_table %>% col_vals_not_null(z) )
expect_error( small_table %>% col_vals_not_null("z") )
expect_error( small_table %>% col_vals_not_null(col_z) )
# When input is an **agent**, all error except `vars()` - is this a feature?
# - `vars()` is special-cased: it's deparsed immediately and bypasses tidyselect
small_agent <- create_agent(small_table)
expect_error( small_agent %>% col_vals_not_null(z) )
expect_error( small_agent %>% col_vals_not_null("z") )
expect_error( small_agent %>% col_vals_not_null(col_z) )
expect_no_error( small_agent %>% col_vals_not_null(vars("z")) )
# The current test treats this behavior of `vars()` as a feature and expects:
# 1) Interrogation of the agent with the problematic validation runs
# 2) Upon interrogation, it captures the error and sends a message about it gracefully
expect_no_error(
x <- small_agent %>% col_vals_not_null(vars("z")) %>% interrogate()
)
expect_true(x$validation_set$eval_error)
})
#> Test passed 🎉 |
I just finished hunting down tidyselect-related deprecation warnings and the PR now stands at:
Which still includes the 1 error from Rest of test warn/fails
I think the PR is ready for course-correcting now. Thanks in advance for considering! |
Thanks so much for all the careful work you've put into this! The story with If we can gracefully move away from that, have laziness like we currently do (i.e., failures happen in the report, not during development of the validation plan), and not break current workflows using I'll read over your notes/changes a bit more carefully in the coming days (just so happens that I'm busier than usual right now) but for now I just wanted to say this is headed in a great direction and, again, I appreciate the work you've been doing here! |
Awesome - thanks for being receptive to this! And I appreciate the additional context as well - they're very helpful to know at this point of the PR. I'll let you take the time to have a closer look at the changes. In the meantime, if I understand correctly, the next steps should be: 1) Make sure that column selection failures are only signaled at
|
The work you’re doing here is quite significant so please add yourself as an author (but only if you want to be a co-author!). |
Thanks - yes, I'd be honored to! |
The basic groundwork for a complete tidyselect integration in the
The remaining issues that'd be nice to resolve as a part of this PR are:
I have a few other smaller issues that I'm keeping track of in my mind but they could probably be tackled as separate PRs, so I'll refrain from complicating this PR any further than this. |
columns
argument in validation functions
This is really excellent work! It's good that there's backwards compatibility with Line 700 in dc1b917
I think we should instead remove references to
We should definitely move over all examples to the preferred form and make sure that the YAML writing is modified to remove its use of
Thanks for thinking further ahead on this. Feel free to create issues or just go for it with new PRs! I'll do a bit of manual testing with this PR, just to see if there are any new issues. Again, thank you for all the work! |
Gotcha - I'll let you finish your side of tests before I do anything more here (though I think I've put in everything I wanted - happy to move residual issues into their own PRs once this big one is merged). Please let me know if any part of the code is unclear! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent work!
Hi Rich -
Making my way over from your R/Pharma pointblank workshop yesterday - thanks for the great package! I've actually been accumulating some thoughts about a complete tidyselect integration for pointblank, and the workshop nudged me to wrap up my thoughts into a PR.
Intro
My understanding of the column selection logic as it currently exists is that the
columns
argument of user-facing validations functions is alwaysenquo()
-ed, and forwarded to (at least) three different implementations:If input is a
vars()
call, deparse the elements and return the column names in a character vector.If input is a call to one of the supported {tidyselect} functions in
exported_tidyselect_fns
, forward the quosure toeval_select()
.For all other cases, the input (sometimes character vector, sometimes a symbol) is forwarded to
vars_select()
as quosure and defused there.This PR tries to unify these various behaviors under
tidyselect::eval_select()
(mostly by rewriting theresolve_columns()
internal function). The proposal is to make all user-supplied expressions tocolumns
pass through {tidyselect}, as much as possible.Problems
There are (or were, if I did this right) four points of friction for a complete
{tidyselect}
-backend refactoring:Both users and developers of pointblank can freely switch between passing expressions vs. character vectors to the column argument without being explicit about which one we're intending. The PRO is that the function is smart and it "just works", which is convenient. The CON is that it requires/causes the kind of fractured implementation of the selection logic that I described above. The solution is to expect the argument to always be captured/quoted by default, unless it's specifically marked for ordinary evaluation.
dplyr::select()
has already paved way for this by introducingall_of()
. So anytime in the source code we do something likecolumns = columns
and expect thecolumns
argument to be ordinarily evaluated as a character vector, this PR ensures that the input is wrapped inall_of()
to leteval_select()
handle it:The way we currently implement
vars()
will become somewhat of an off-label usage in a{tidyselect}
context. There's no way to make this straightforwardly work, so I just special case this for backwards compatibility. Luckily,{tidyselect}
already offers this capability withc()
, so the PR simply intercepts thevars()
expression and replacesvars
withc
before sending it off toeval_select()
.eval_select()
is strict about resolving column selections - it throws an error if the selection is invalid. In our case, this is a doubled edged sword (see also the error in tests at the bottom of this PR). On the one hand, we get the really informative tidyverse-style error messages it generates for free, instead of the generic message that pointblank currently throws. On the other hand, the dataframe thatresolve_columns()
receives is sometimes empty, with no columns (ex: in the case ofserially()
). In the PR, I decided to just special case empty data - thevars()
expression in that case is simply deparsed into a character vector, bypassing tidyselect entirely (leaving it up to pointblank to catch any column selection errors downstream).Some validation functions (ex:
col_exists()
) simply deparse the user-supplied column expression immediately without doing anything more with it. These should get re-routed toresolve_columns()
for consistency (of having all user-facing column selection logic be powered by{tidyselect}
).The first three actually turned out to be pretty trivial to address, as far as refactoring is concerned - It took a re-write of the
resolve_columns()
function + wrapping the input tocolumns
inall_of()
whenever we're passing a character vector, forcing the source code to be intentional about it. The fourth friction point requires editing the body of individual validation functions, so I'm holding them off for now.Example
To demonstrate what the PR enables, here it is in action for
rows_distinct()
, which seemed like the prime candidate for this refactoring:One welcome consequence of this is that the
column
information is no longer NA for therows_distinct()
validation step:A new test file
test-tidyselect_integration.R
has a more comprehensive coverage of {tidyselect} features that we expect to work.Remaining considerations
This is a draft PR because of these outstanding issues:
Deprecation
We might want to consider signalling deprecation for 1)
vars()
(usec()
instead), and 2) passing character vectors expecting it to "just work" (wrap it inall_of()
instead). Users have already built up intuition about this behavior from using dplyr, so it hopefully shouldn't be too difficult to grok. I haven't implemented avars()
deprecation warning in this PR yet (I don't know how disruptive that'd be) but in the case ofall_of()
the PR just lets the deprecation warning trickle up from tidyselect.Coverage and WIP status
This PR needs to finish integrating the new tidyselect implementation for some of the validation functions that still use the old implementation (ex:
rows_complete()
) and also for those that just don't do tidyselect entirely (ex:col_exists()
)Tests
outdated
When I
devtools::test()
(Windows), I get:Compared to main, this PR introduces 1 new FAIL and 10 new WARNs:
The warnings are mostly deprecation errors from using or missing
all_of()
in the appropriate places. I'm in the process of hunting the rest of these down.The one new error may be a breaking change - the PR changes pointblank's behavior on this test:
pointblank/tests/testthat/test-get_agent_report.R
Lines 84 to 92 in dc1b917
Whereas pointblank agents strive to delay the evaluation of the validation checks until
interrogate()
, passing the column inputs througheval_select()
triggers immediate errors for non-existent columns. In other words, the agent fails to "build" entirely so it fails the tests for this agent. I'm sure it's possible to put the old behavior back in (of gracefully signaling the "evaluation issue requires attention" error) but just wanted to put it out there first before I do anything about it since I think this one requires some more thought.Related GitHub Issues and PRs
I'm sure this cuts across several open issues, but to list a few that this PR would close if worked to completion:
vars()
seems required forrows_*()
but not other functions #416,ends_with()
doesn't work withexpect_col_exists()
#433, Allow use of select helpers incol_exists()
#221Checklist
testthat
unit tests totests/testthat
for any new functionality.