Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using across() within mutate() is appreciably slower than mutating columns individually #6897

Closed
MattJEM opened this issue Jul 27, 2023 · 2 comments

Comments

@MattJEM
Copy link

MattJEM commented Jul 27, 2023

If I group a dataset and apply the same function across multiple columns using mutate(across(...)) I see a performance decrease compared to individually mutating the columns. It is 5.31s vs 0.053s in the reprex, but in my actual, much larger dataset the difference is 1.5 hours vs a couple of minutes.


sessionInfo()

R version 4.3.1 (2023-06-16 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server 2019 x64 (build 17763)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] modeldata_1.1.0 dplyr_1.1.2    

loaded via a namespace (and not attached):
 [1] utf8_1.2.3        R6_2.5.1          tidyselect_1.2.0  magrittr_2.0.3    glue_1.6.2        tibble_3.2.1     
 [7] pkgconfig_2.0.3   generics_0.1.3    lifecycle_1.0.3   cli_3.6.1         fansi_1.0.4       vctrs_0.6.3      
[13] withr_2.5.0       compiler_4.3.1    rstudioapi_0.15.0 tools_4.3.1       pillar_1.9.0      crayon_1.5.2     
[19] rlang_1.1.1

Reprex

library(dplyr)
library(modeldata)

data(ames)

start_time1 <- Sys.time()
ames_calc1 <- ames %>%
  group_by(Year_Built, MS_Zoning) %>%
  mutate(across(First_Flr_SF:TotRms_AbvGrd, mean, na.rm = TRUE, .names = '{.col}_mean')) %>%
  ungroup()
stop_time1 <- Sys.time()

start_time2 <- Sys.time()
ames_calc2 <- ames %>%
  group_by(Year_Built, MS_Zoning) %>%
  mutate(First_Flr_SF_mean = mean(First_Flr_SF, na.rm = TRUE),
         Second_Flr_SF_mean = mean(Second_Flr_SF, na.rm = TRUE),
         Gr_Liv_Area_mean = mean(Gr_Liv_Area, na.rm = TRUE),
         Bsmt_Full_Bath_mean = mean(Bsmt_Full_Bath, na.rm = TRUE),
         Bsmt_Half_Bath_mean = mean(Bsmt_Half_Bath, na.rm = TRUE),
         Full_Bath_mean = mean(Full_Bath, na.rm = TRUE),
         Half_Bath_mean = mean(Half_Bath, na.rm = TRUE),
         Bedroom_AbvGr_mean = mean(Bedroom_AbvGr, na.rm = TRUE),
         Kitchen_AbvGr_mean = mean(Kitchen_AbvGr, na.rm = TRUE),
         TotRms_AbvGrd_mean = mean(TotRms_AbvGrd, na.rm = TRUE)) %>%
  ungroup()
stop_time2 <- Sys.time()

stop_time1 - start_time1 # Using across() takes 5.313556 secs
stop_time2 - start_time2 # Individually mutating takes 0.05337405 secs
@ynsec37
Copy link

ynsec37 commented Jul 28, 2023

As it is interesting for me and when I try to run the example, there is a warning:

Supply arguments directly to .fns through an anonymous function instead.

after updating by across(a:b, \(x) mean(x, na.rm = TRUE)), the running time seems to not get much slower.
The median time of across is 47.35ms, and the median time of mutate individually is 47.62ms

library(dplyr, warn.conflicts = FALSE)
library(modeldata)

bench::mark(
 ames %>%
    group_by(Year_Built, MS_Zoning) %>%
    mutate(across(First_Flr_SF:TotRms_AbvGrd, mean, na.rm = TRUE, .names = "{.col}_mean")) %>%
    ungroup(),
  
  ames %>%
    group_by(Year_Built, MS_Zoning) %>%
    mutate(across(First_Flr_SF:TotRms_AbvGrd, \(x) mean(x, na.rm = TRUE), .names = "{.col}_mean")) %>%
    ungroup(),
  
  ames %>%
    group_by(Year_Built, MS_Zoning) %>%
    mutate(First_Flr_SF_mean = mean(First_Flr_SF, na.rm = TRUE),
      Second_Flr_SF_mean = mean(Second_Flr_SF, na.rm = TRUE),
      Gr_Liv_Area_mean = mean(Gr_Liv_Area, na.rm = TRUE),
      Bsmt_Full_Bath_mean = mean(Bsmt_Full_Bath, na.rm = TRUE),
      Bsmt_Half_Bath_mean = mean(Bsmt_Half_Bath, na.rm = TRUE),
      Full_Bath_mean = mean(Full_Bath, na.rm = TRUE),
      Half_Bath_mean = mean(Half_Bath, na.rm = TRUE),
      Bedroom_AbvGr_mean = mean(Bedroom_AbvGr, na.rm = TRUE),
      Kitchen_AbvGr_mean = mean(Kitchen_AbvGr, na.rm = TRUE),
      TotRms_AbvGrd_mean = mean(TotRms_AbvGrd, na.rm = TRUE)) %>%
    ungroup()
)
#> Warning: There was 1 warning in `mutate()`.
#> ℹ In argument: `across(First_Flr_SF:TotRms_AbvGrd, mean, na.rm = TRUE, .names =
#>   "{.col}_mean")`.
#> ℹ In group 1: `Year_Built = 1872`, `MS_Zoning = Residential_Medium_Density`.
#> Caused by warning:
#> ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
#> Supply arguments directly to `.fns` through an anonymous function instead.
#> 
#>   # Previously
#>   across(a:b, mean, na.rm = TRUE)
#> 
#>   # Now
#>   across(a:b, \(x) mean(x, na.rm = TRUE))
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 3 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                        <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 ames %>% group_by(Year_Built, MS…   6.95s   6.95s     0.144   31.64MB     4.75
#> 2 ames %>% group_by(Year_Built, MS…  43.8ms 47.35ms    20.6      1.24MB     7.50
#> 3 ames %>% group_by(Year_Built, MS… 39.39ms 47.62ms    17.9      1.15MB     5.96

Created on 2023-07-28 with reprex v2.0.2

@DavisVaughan
Copy link
Member

This slowdown is due to the repeated calls to lifecycle::deprecate_soft(), which are due to passing na.rm = TRUE through the ... of across(), which is deprecated now. We are working on making these lifecycle functions faster (r-lib/lifecycle#176), but in the meantime if you follow the advice of this warning and switch to an anonymous function (or make your own standalone mean() wrapper function) then it should be fixed because the repeated deprecation warnings won't be thrown.

! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
Supply arguments directly to `.fns` through an anonymous function instead.

  # Previously
  across(a:b, mean, na.rm = TRUE)

  # Now
  across(a:b, \(x) mean(x, na.rm = TRUE))
library(dplyr)
library(modeldata)

data(ames)

ames <- ames %>%
  group_by(Year_Built, MS_Zoning)

# no deprecation warning in this case
bench::mark(
  mutate(ames, across(First_Flr_SF:TotRms_AbvGrd, \(x) mean(x, na.rm = TRUE), .names = '{.col}_mean')),
  iterations = 5
)
#> # A tibble: 1 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                          <bch:> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 mutate(ames, across(First_Flr_SF:T… 8.18ms 8.55ms      119.    4.17MB     79.1

# dev lifecycle + https://github.com/r-lib/lifecycle/pull/177
bench::mark(
  mutate(ames, across(First_Flr_SF:TotRms_AbvGrd, mean, na.rm = TRUE, .names = '{.col}_mean')),
  iterations = 20
)
#> # A tibble: 1 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                           <bch> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 mutate(ames, across(First_Flr_SF:To… 240ms  251ms      3.93    15.1MB     38.9

# CRAN lifecycle
bench::mark(
  mutate(ames, across(First_Flr_SF:TotRms_AbvGrd, mean, na.rm = TRUE, .names = '{.col}_mean')),
  iterations = 5
)
#> # A tibble: 1 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                           <bch> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 mutate(ames, across(First_Flr_SF:To… 1.26s  1.29s     0.770    27.3MB     20.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants