-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow pure R expression #4
Comments
this variant does not use edit: ok that would require a new tailored |
Thank you!
That's beyond my knowledge, I'll leave that to you if you want to explore (but developing
Is there a reason to prefer $map over $apply? Mimicking |
99% of times one should pick $map in select contexts and $apply in GroupBy contexts. Apply in select is like scalar lapply on each value, and double overhead of lapply. Map in GroupBy ignores GroupBy, apply should be used. I would rename the two right of the four methods map() and two wrong ones map_dont_ever_use_me() :) |
Rust-polars recently added a feature that detects if Probably a more realistic way is to encourage people to write their custom functions directly in Polars expressions. Then I could check whether the function returns a Polars expression and warn the user if it doesn't. That said, it's gonna introduce some ambiguity because:
Example for writing functions with Polars expressions: foo <- function(x, y) {
tmp <- polars::pl$mean(x)
tmp2 <- polars::pl$mean(y)
tmp + tmp2
}
foo("a", "b")
#> polars Expr: [(col("a").mean()) + (col("b").mean())]
class(foo("a", "b"))
#> [1] "Expr"
polars::pl$DataFrame(mtcars)$groupby("am")$agg(
foo("drat", "mpg")$alias("test")
)
#> shape: (2, 2)
#> ┌─────┬───────────┐
#> │ am ┆ test │
#> │ --- ┆ --- │
#> │ f64 ┆ f64 │
#> ╞═════╪═══════════╡
#> │ 1.0 ┆ 28.442308 │
#> │ 0.0 ┆ 20.433684 │
#> └─────┴───────────┘ |
Now possible: use custom functions that return a Polars expression: library(tidypolars)
library(dplyr, warn.conflicts = FALSE)
foo <- function(x, y, z) {
tmp <- x$mean() + y$mean()
tmp / z$sum()
}
foo_dplyr <- function(x, y, z) {
tmp <- mean(x) + mean(y)
tmp / sum(z)
}
large_iris <- data.table::rbindlist(rep(list(iris), 100000))
large_iris_pl <- as_polars(large_iris)
bench::mark(
dplyr = large_iris |>
group_by(Species) |>
mutate(foo = foo_dplyr(Sepal.Length, Sepal.Width, Petal.Length)),
tidypolars = large_iris_pl |>
group_by(Species) |>
mutate(foo = foo(Sepal.Length, Sepal.Width, Petal.Length)),
iterations = 10,
check = FALSE
)
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 dplyr 877ms 877ms 1.14 689.72MB 14.8
#> 2 tidypolars 152ms 174ms 5.72 2.65MB 0 |
Hey @etiennebacher I really like tidypolars and how it integrates with polars! Very smart.
I was thinking it could be possible to allow pure R syntax also with some performance loss. Sometimes a user cannot figure out how todo something in polars and the performance does not matter for that step.
Here it is likely slower than dplyr, as the used columns must be transformed (vectorized) from polars to R first ... and then the output back to polars. It would be possible to use arrow c_datainterface + R altrep to avoid the polars->R conversion ... maybe also the R->polars conversion. Then monkey_mutate would be just as fast dplyrs.
Created on 2023-06-09 with reprex v2.0.2
The text was updated successfully, but these errors were encountered: