Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support unordered row_number() on Snowflake, MSSQL, and Teradata #1331

Merged
merged 15 commits into from
Aug 22, 2023

Conversation

fh-mthomson
Copy link
Contributor

@fh-mthomson fh-mthomson commented Jul 8, 2023

Of note, from https://stackoverflow.com/questions/44105691/row-number-without-order-by per second comment, I opted to use SELECT NULL over SELECT 1 (which also aligns with the implementation for Teradata).

SELECT `df`.*, ROW_NUMBER() OVER (PARTITION BY `y` ORDER BY `y`) AS `rown`
SELECT
`df`.*,
ROW_NUMBER() OVER (PARTITION BY `y` ORDER BY (SELECT NULL)) AS `rown`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly, Teradata (via win_rank_tdata()) was previously defaulting to ORDER BY the result of win_current_group() (added in 4109051#diff-c4f086980e5692d16e3337acd7bfefe3a9315135688133ee4f24ea2944ccba3bR174 by @overmar ).

By generalizing the behavior to fall back to SELECT NULL for all back ends (incl. Snowflake / MSSQL) this now no longer defaults to ordering by the group.

I'm likely not familiar enough with (1) PARTITION vs ORDER and (2) Teradata to determine whether this would have adverse impacts, so would welcome input from more knowledgeable folks!

@fh-mthomson
Copy link
Contributor Author

@mgirlich would you be open to reviewing? thank you!

@fh-mthomson
Copy link
Contributor Author

@mgirlich @hadley would you be open to including this PR in the 2.4.0 release?

Apologies for the direct pester, but it's last (known) Snowflake <> dbplyr blocker for us; would love to close out if the timing works for y'all!

Copy link
Member

@hadley hadley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this!

@mgirlich I'm ok with allowing an argument-less row_number(); it doesn't really make a lot of sense for databases, but it's still sometimes useful.

R/translate-sql-window.R Outdated Show resolved Hide resolved

#' @export
#' @rdname win_over
win_rank_tdata <- function(f) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this exported in a released dbplyr?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was, as of 2.3.0 (added in 4109051#diff-c4f086980e5692d16e3337acd7bfefe3a9315135688133ee4f24ea2944ccba3bR167). I'd guess it was primarily to enable row_number() for Teradata backend.

Open to any recommended patterns for deprecation!

#' @rdname win_over
#' @export
win_rank <- function(f) {
win_rank <- function(f, use_default_order_null = FALSE) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This argument name feels a bit long to me, and needs to be documented. Maybe empty_order = TRUE or order_by = c("required", "optional"), or something along those lines?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to empty_order (defaulting to FALSE to not apply to all backends) and added some docs!

@mgirlich mgirlich merged commit 9ff37e0 into tidyverse:main Aug 22, 2023
14 checks passed
@mgirlich
Copy link
Collaborator

Thanks for the PR! 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support row_number() on Snowflake
3 participants