Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSSQL slice_sample() translation always returns the same rows #1503

Open
thomashulst opened this issue May 16, 2024 · 0 comments · May be fixed by #1504
Open

MSSQL slice_sample() translation always returns the same rows #1503

thomashulst opened this issue May 16, 2024 · 0 comments · May be fixed by #1504

Comments

@thomashulst
Copy link
Contributor

thomashulst commented May 16, 2024

Using slice_sample() on an MSSQL backend always return the same rows:

df <- tbl(con, "mtcars")
df |> slice_sample(n = 3)
#> # Source:   SQL [3 x 12]
#> # Database: Microsoft SQL Server 14.00.2052[***]
#>   car            mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <chr>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Mazda RX4     21       6   160   110  3.9   2.62  16.5     0     1     4     4
#> 2 Mazda RX4 W…  21       6   160   110  3.9   2.88  17.0     0     1     4     4
#> 3 Datsun 710    22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
df |> slice_sample(n = 3)
#> # Source:   SQL [3 x 12]
#> # Database: Microsoft SQL Server 14.00.2052[***]
#>   car            mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <chr>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Mazda RX4     21       6   160   110  3.9   2.62  16.5     0     1     4     4
#> 2 Mazda RX4 W…  21       6   160   110  3.9   2.88  17.0     0     1     4     4
#> 3 Datsun 710    22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
df |> slice_sample(n = 3)
#> # Source:   SQL [3 x 12]
#> # Database: Microsoft SQL Server 14.00.2052[***]
#>   car            mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <chr>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Mazda RX4     21       6   160   110  3.9   2.62  16.5     0     1     4     4
#> 2 Mazda RX4 W…  21       6   160   110  3.9   2.88  17.0     0     1     4     4
#> 3 Datsun 710    22.8     4   108    93  3.85  2.32  18.6     1     1     4     1

Created on 2024-05-16 with reprex v2.1.0

The issue is with the RAND() function. RAND() on an MSSQL backend will always return the same value for each row in the result set. As such, ordering the rows by the return value of RAND() and selecting the top n rows will just return the data in its initial order:

dbGetQuery(con, "SELECT TOP(3) *, RAND() as rand FROM mtcars")
#>             car  mpg cyl disp  hp drat    wt  qsec vs am gear carb      rand
#> 1     Mazda RX4 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4 0.1899416
#> 2 Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 0.1899416
#> 3    Datsun 710 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1 0.1899416

Created on 2024-05-16 with reprex v2.1.0

An alternative approach to generating random numbers on an MSSQL backend is given here. It works by using NEWID() to return a globally unique identifer for each row, hashing the GUID with CHECKSUM() and using this as a seed for RAND():

dbGetQuery(con, "SELECT TOP(3) *, RAND(CHECKSUM(NEWID())) as rand FROM mtcars")
#>             car  mpg cyl disp  hp drat    wt  qsec vs am gear carb       rand
#> 1     Mazda RX4 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4 0.78626466
#> 2 Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 0.31999640
#> 3    Datsun 710 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1 0.03604049

Created on 2024-05-16 with reprex v2.1.0

Changing the runif() function translation in the MSSQL backend should be straightforward:

      runif = function(n = n(), min = 0, max = 1) {
        sql_runif(RAND(CHECKSUM(NEWID())), n = {{ n }}, min = min, max = max)
      }
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant