-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using across()
within mutate()
is appreciably slower than mutating columns individually
#6897
Comments
As it is interesting for me and when I try to run the example, there is a warning:
after updating by library(dplyr, warn.conflicts = FALSE)
library(modeldata)
bench::mark(
ames %>%
group_by(Year_Built, MS_Zoning) %>%
mutate(across(First_Flr_SF:TotRms_AbvGrd, mean, na.rm = TRUE, .names = "{.col}_mean")) %>%
ungroup(),
ames %>%
group_by(Year_Built, MS_Zoning) %>%
mutate(across(First_Flr_SF:TotRms_AbvGrd, \(x) mean(x, na.rm = TRUE), .names = "{.col}_mean")) %>%
ungroup(),
ames %>%
group_by(Year_Built, MS_Zoning) %>%
mutate(First_Flr_SF_mean = mean(First_Flr_SF, na.rm = TRUE),
Second_Flr_SF_mean = mean(Second_Flr_SF, na.rm = TRUE),
Gr_Liv_Area_mean = mean(Gr_Liv_Area, na.rm = TRUE),
Bsmt_Full_Bath_mean = mean(Bsmt_Full_Bath, na.rm = TRUE),
Bsmt_Half_Bath_mean = mean(Bsmt_Half_Bath, na.rm = TRUE),
Full_Bath_mean = mean(Full_Bath, na.rm = TRUE),
Half_Bath_mean = mean(Half_Bath, na.rm = TRUE),
Bedroom_AbvGr_mean = mean(Bedroom_AbvGr, na.rm = TRUE),
Kitchen_AbvGr_mean = mean(Kitchen_AbvGr, na.rm = TRUE),
TotRms_AbvGrd_mean = mean(TotRms_AbvGrd, na.rm = TRUE)) %>%
ungroup()
)
#> Warning: There was 1 warning in `mutate()`.
#> ℹ In argument: `across(First_Flr_SF:TotRms_AbvGrd, mean, na.rm = TRUE, .names =
#> "{.col}_mean")`.
#> ℹ In group 1: `Year_Built = 1872`, `MS_Zoning = Residential_Medium_Density`.
#> Caused by warning:
#> ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
#> Supply arguments directly to `.fns` through an anonymous function instead.
#>
#> # Previously
#> across(a:b, mean, na.rm = TRUE)
#>
#> # Now
#> across(a:b, \(x) mean(x, na.rm = TRUE))
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 3 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:t> <bch:t> <dbl> <bch:byt> <dbl>
#> 1 ames %>% group_by(Year_Built, MS… 6.95s 6.95s 0.144 31.64MB 4.75
#> 2 ames %>% group_by(Year_Built, MS… 43.8ms 47.35ms 20.6 1.24MB 7.50
#> 3 ames %>% group_by(Year_Built, MS… 39.39ms 47.62ms 17.9 1.15MB 5.96 Created on 2023-07-28 with reprex v2.0.2 |
This slowdown is due to the repeated calls to
library(dplyr)
library(modeldata)
data(ames)
ames <- ames %>%
group_by(Year_Built, MS_Zoning)
# no deprecation warning in this case
bench::mark(
mutate(ames, across(First_Flr_SF:TotRms_AbvGrd, \(x) mean(x, na.rm = TRUE), .names = '{.col}_mean')),
iterations = 5
)
#> # A tibble: 1 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:> <bch:> <dbl> <bch:byt> <dbl>
#> 1 mutate(ames, across(First_Flr_SF:T… 8.18ms 8.55ms 119. 4.17MB 79.1
# dev lifecycle + https://github.com/r-lib/lifecycle/pull/177
bench::mark(
mutate(ames, across(First_Flr_SF:TotRms_AbvGrd, mean, na.rm = TRUE, .names = '{.col}_mean')),
iterations = 20
)
#> # A tibble: 1 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch> <bch:> <dbl> <bch:byt> <dbl>
#> 1 mutate(ames, across(First_Flr_SF:To… 240ms 251ms 3.93 15.1MB 38.9
# CRAN lifecycle
bench::mark(
mutate(ames, across(First_Flr_SF:TotRms_AbvGrd, mean, na.rm = TRUE, .names = '{.col}_mean')),
iterations = 5
)
#> # A tibble: 1 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch> <bch:> <dbl> <bch:byt> <dbl>
#> 1 mutate(ames, across(First_Flr_SF:To… 1.26s 1.29s 0.770 27.3MB 20.2 |
If I group a dataset and apply the same function across multiple columns using
mutate(across(...))
I see a performance decrease compared to individually mutating the columns. It is 5.31s vs 0.053s in the reprex, but in my actual, much larger dataset the difference is 1.5 hours vs a couple of minutes.sessionInfo()
Reprex
The text was updated successfully, but these errors were encountered: