Using `across()` within `mutate()` is appreciably slower than mutating columns individually #6897

MattJEM · 2023-07-27T15:58:10Z

If I group a dataset and apply the same function across multiple columns using mutate(across(...)) I see a performance decrease compared to individually mutating the columns. It is 5.31s vs 0.053s in the reprex, but in my actual, much larger dataset the difference is 1.5 hours vs a couple of minutes.

sessionInfo()

R version 4.3.1 (2023-06-16 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server 2019 x64 (build 17763)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] modeldata_1.1.0 dplyr_1.1.2    

loaded via a namespace (and not attached):
 [1] utf8_1.2.3        R6_2.5.1          tidyselect_1.2.0  magrittr_2.0.3    glue_1.6.2        tibble_3.2.1     
 [7] pkgconfig_2.0.3   generics_0.1.3    lifecycle_1.0.3   cli_3.6.1         fansi_1.0.4       vctrs_0.6.3      
[13] withr_2.5.0       compiler_4.3.1    rstudioapi_0.15.0 tools_4.3.1       pillar_1.9.0      crayon_1.5.2     
[19] rlang_1.1.1

Reprex

library(dplyr)
library(modeldata)

data(ames)

start_time1 <- Sys.time()
ames_calc1 <- ames %>%
  group_by(Year_Built, MS_Zoning) %>%
  mutate(across(First_Flr_SF:TotRms_AbvGrd, mean, na.rm = TRUE, .names = '{.col}_mean')) %>%
  ungroup()
stop_time1 <- Sys.time()

start_time2 <- Sys.time()
ames_calc2 <- ames %>%
  group_by(Year_Built, MS_Zoning) %>%
  mutate(First_Flr_SF_mean = mean(First_Flr_SF, na.rm = TRUE),
         Second_Flr_SF_mean = mean(Second_Flr_SF, na.rm = TRUE),
         Gr_Liv_Area_mean = mean(Gr_Liv_Area, na.rm = TRUE),
         Bsmt_Full_Bath_mean = mean(Bsmt_Full_Bath, na.rm = TRUE),
         Bsmt_Half_Bath_mean = mean(Bsmt_Half_Bath, na.rm = TRUE),
         Full_Bath_mean = mean(Full_Bath, na.rm = TRUE),
         Half_Bath_mean = mean(Half_Bath, na.rm = TRUE),
         Bedroom_AbvGr_mean = mean(Bedroom_AbvGr, na.rm = TRUE),
         Kitchen_AbvGr_mean = mean(Kitchen_AbvGr, na.rm = TRUE),
         TotRms_AbvGrd_mean = mean(TotRms_AbvGrd, na.rm = TRUE)) %>%
  ungroup()
stop_time2 <- Sys.time()

stop_time1 - start_time1 # Using across() takes 5.313556 secs
stop_time2 - start_time2 # Individually mutating takes 0.05337405 secs

The text was updated successfully, but these errors were encountered:

ynsec37 · 2023-07-28T04:41:03Z

As it is interesting for me and when I try to run the example, there is a warning:

Supply arguments directly to .fns through an anonymous function instead.

after updating by across(a:b, \(x) mean(x, na.rm = TRUE)), the running time seems to not get much slower.
The median time of across is 47.35ms, and the median time of mutate individually is 47.62ms

library(dplyr, warn.conflicts = FALSE)
library(modeldata)

bench::mark(
 ames %>%
    group_by(Year_Built, MS_Zoning) %>%
    mutate(across(First_Flr_SF:TotRms_AbvGrd, mean, na.rm = TRUE, .names = "{.col}_mean")) %>%
    ungroup(),
  
  ames %>%
    group_by(Year_Built, MS_Zoning) %>%
    mutate(across(First_Flr_SF:TotRms_AbvGrd, \(x) mean(x, na.rm = TRUE), .names = "{.col}_mean")) %>%
    ungroup(),
  
  ames %>%
    group_by(Year_Built, MS_Zoning) %>%
    mutate(First_Flr_SF_mean = mean(First_Flr_SF, na.rm = TRUE),
      Second_Flr_SF_mean = mean(Second_Flr_SF, na.rm = TRUE),
      Gr_Liv_Area_mean = mean(Gr_Liv_Area, na.rm = TRUE),
      Bsmt_Full_Bath_mean = mean(Bsmt_Full_Bath, na.rm = TRUE),
      Bsmt_Half_Bath_mean = mean(Bsmt_Half_Bath, na.rm = TRUE),
      Full_Bath_mean = mean(Full_Bath, na.rm = TRUE),
      Half_Bath_mean = mean(Half_Bath, na.rm = TRUE),
      Bedroom_AbvGr_mean = mean(Bedroom_AbvGr, na.rm = TRUE),
      Kitchen_AbvGr_mean = mean(Kitchen_AbvGr, na.rm = TRUE),
      TotRms_AbvGrd_mean = mean(TotRms_AbvGrd, na.rm = TRUE)) %>%
    ungroup()
)
#> Warning: There was 1 warning in `mutate()`.
#> ℹ In argument: `across(First_Flr_SF:TotRms_AbvGrd, mean, na.rm = TRUE, .names =
#>   "{.col}_mean")`.
#> ℹ In group 1: `Year_Built = 1872`, `MS_Zoning = Residential_Medium_Density`.
#> Caused by warning:
#> ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
#> Supply arguments directly to `.fns` through an anonymous function instead.
#> 
#>   # Previously
#>   across(a:b, mean, na.rm = TRUE)
#> 
#>   # Now
#>   across(a:b, \(x) mean(x, na.rm = TRUE))
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 3 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                        <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 ames %>% group_by(Year_Built, MS…   6.95s   6.95s     0.144   31.64MB     4.75
#> 2 ames %>% group_by(Year_Built, MS…  43.8ms 47.35ms    20.6      1.24MB     7.50
#> 3 ames %>% group_by(Year_Built, MS… 39.39ms 47.62ms    17.9      1.15MB     5.96

^{Created on 2023-07-28 with reprex v2.0.2}

DavisVaughan · 2023-11-03T20:45:52Z

This slowdown is due to the repeated calls to lifecycle::deprecate_soft(), which are due to passing na.rm = TRUE through the ... of across(), which is deprecated now. We are working on making these lifecycle functions faster (r-lib/lifecycle#176), but in the meantime if you follow the advice of this warning and switch to an anonymous function (or make your own standalone mean() wrapper function) then it should be fixed because the repeated deprecation warnings won't be thrown.

! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
Supply arguments directly to `.fns` through an anonymous function instead.

  # Previously
  across(a:b, mean, na.rm = TRUE)

  # Now
  across(a:b, \(x) mean(x, na.rm = TRUE))

library(dplyr)
library(modeldata)

data(ames)

ames <- ames %>%
  group_by(Year_Built, MS_Zoning)

# no deprecation warning in this case
bench::mark(
  mutate(ames, across(First_Flr_SF:TotRms_AbvGrd, \(x) mean(x, na.rm = TRUE), .names = '{.col}_mean')),
  iterations = 5
)
#> # A tibble: 1 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                          <bch:> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 mutate(ames, across(First_Flr_SF:T… 8.18ms 8.55ms      119.    4.17MB     79.1

# dev lifecycle + https://github.com/r-lib/lifecycle/pull/177
bench::mark(
  mutate(ames, across(First_Flr_SF:TotRms_AbvGrd, mean, na.rm = TRUE, .names = '{.col}_mean')),
  iterations = 20
)
#> # A tibble: 1 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                           <bch> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 mutate(ames, across(First_Flr_SF:To… 240ms  251ms      3.93    15.1MB     38.9

# CRAN lifecycle
bench::mark(
  mutate(ames, across(First_Flr_SF:TotRms_AbvGrd, mean, na.rm = TRUE, .names = '{.col}_mean')),
  iterations = 5
)
#> # A tibble: 1 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                           <bch> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 mutate(ames, across(First_Flr_SF:To… 1.26s  1.29s     0.770    27.3MB     20.2

DavisVaughan mentioned this issue Nov 3, 2023

Only generate the trace_back() as needed r-lib/lifecycle#177

Merged

DavisVaughan closed this as completed Nov 3, 2023

nirguk mentioned this issue Jan 18, 2024

perfromance slowdown using across within mutate #6985

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using `across()` within `mutate()` is appreciably slower than mutating columns individually #6897

Using `across()` within `mutate()` is appreciably slower than mutating columns individually #6897

MattJEM commented Jul 27, 2023

ynsec37 commented Jul 28, 2023

DavisVaughan commented Nov 3, 2023

Using across() within mutate() is appreciably slower than mutating columns individually #6897

Using across() within mutate() is appreciably slower than mutating columns individually #6897

Comments

MattJEM commented Jul 27, 2023

ynsec37 commented Jul 28, 2023

DavisVaughan commented Nov 3, 2023

Using `across()` within `mutate()` is appreciably slower than mutating columns individually #6897

Using `across()` within `mutate()` is appreciably slower than mutating columns individually #6897