Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More support for lubridate #53

Open
etiennebacher opened this issue Nov 9, 2023 · 7 comments
Open

More support for lubridate #53

etiennebacher opened this issue Nov 9, 2023 · 7 comments
Labels
help wanted Extra attention is needed

Comments

@etiennebacher
Copy link
Owner

etiennebacher commented Nov 9, 2023

polars has tons of datetime functions (not all are supported in the R implementation for now) but I don't use lubridate enough to thorougly test them (I don't have real workflows where I can test that they work as expected).

Some help on this would be greatly appreciated. The way to add support for new functions is a bit convoluted, I should make that easier, happy to help if someone wants to take a shot.

@etiennebacher etiennebacher added the help wanted Extra attention is needed label Nov 9, 2023
@etiennebacher
Copy link
Owner Author

If you come across this issue and want to help, take a look at https://tidypolars.etiennebacher.com/contributing#how-to-add-support-for-an-r-function-in-tidypolars

@frankiethull
Copy link

I second this!

As someone new to polars, I've found {tidypolars} very helpful and a great tool. I came here to ask for more lubridate support and saw the current issue and wanted to give a +1.

I was trying to use floor_date and make_date with no luck. Also, there isn't a lot of documentation on polars for time-series in R so I did my own workaround (for now.) Not sure if anyone has any recommendations but based off the r-polars vignette and python user guide I found this workflow the best for now.

initial test trying to make a date using paste0, too slow

# method 1, slow concatenation with paste:
tictoc::tic()
data_pl |>
  mutate(
    hour  = hour(started_at),
    month = month(started_at),
    year  = year(started_at),
    mday  = mday(started_at)
  ) |> 
  mutate( # already too slow w/o even converting str to datetime
    str_dt = paste0(year, "-", month, "-", mday, " ", hour)
  )
tictoc::toc() #6 secs

making a date with concat_str and to_date is way faster

# method 2, multi-step method mutate & with_column:
tictoc::tic()
chpt <- data_pl |>
  mutate( # each piece
    hour  = hour(started_at),
    month = month(started_at),
    year  = year(started_at),
    mday  = mday(started_at),
    dash = "-"
  ) 
chpt <- chpt$with_columns(
   # faster concatenate, needs a spacer [dash]:
    pl$concat_str("year", "dash", "month", "dash", "mday")$alias("x")
  )
# convert to date/datetime
chpt$with_columns(
  pl$col("x")$str$to_date("%Y-%m-%d")
)
tictoc::toc() #.4 secs

note this is on ~ 1million rows of citibike data.

@etiennebacher
Copy link
Owner Author

Hi, thanks for your interest in tidypolars. I think the best matches for lubridate::make_date() and make_datetime() are pl.date() and pl.datetime() in python polars, but apparently they're not in r-polars yet.

As I said above, I don't have much time to dedicate to compatibility with lubridate as I rarely use datetime variables, but I'd be happy to review a PR, even an incomplete one. Let me know if you want to try making one and if you need some help. The link above should give most necessary info to get started, but feel free to ask here if I forgot something.


For reference, here's a small reprex for your example, with a shorter syntax for method 2:

library(tidypolars)
library(polars)
library(dplyr, warn.conflicts = FALSE)

foo <- pl$DataFrame(x = rep("2009-08-03 12:01:59", 1e6))$select(pl$col("x")$str$to_datetime())

foo2 <- foo |>
  mutate(
    hour  = hour(x),
    month = month(x),
    year  = year(x),
    mday  = mday(x)
  )

system.time({
  foo2 |> 
    mutate( 
      str_dt = paste0(year, "-", month, "-", mday)
    )
})
#>    user  system elapsed 
#>    1.18    0.05    1.25

system.time({
  foo2$with_columns(
    # eventually this should be replaced by `str_dt = pl$date("year", "month", "day")`
    str_dt = pl$concat_str("year", pl$lit("-"), "month", pl$lit("-"), "mday")$str$to_date("%Y-%m-%d")
  )
})
#>    user  system elapsed 
#>    0.09    0.01    0.11

@etiennebacher
Copy link
Owner Author

I think the best matches for lubridate::make_date() and make_datetime() are pl.date() and pl.datetime() in python polars, but apparently they're not in r-polars yet.

They are now available in the development version of r-polars and will be included in polars 0.16.0. Here's an example with 50M obs:

library(polars)

test <- pl$DataFrame(
  y = sample(2000:2019, 5*1e7, TRUE),
  m = sample(1:12, 5*1e7, TRUE),
  d = sample(1:31, 5*1e7, TRUE)
)

system.time({
  test$with_columns(
    date = pl$concat_str("y", pl$lit("-"), "m", pl$lit("-"), "d")$str$to_date("%Y-%m-%d", strict = FALSE)
  )$print()
})
#> shape: (50_000_000, 4)
#> ┌──────┬─────┬─────┬────────────┐
#> │ y    ┆ m   ┆ d   ┆ date       │
#> │ ---  ┆ --- ┆ --- ┆ ---        │
#> │ i32  ┆ i32 ┆ i32 ┆ date       │
#> ╞══════╪═════╪═════╪════════════╡
#> │ 2011 ┆ 10  ┆ 22  ┆ 2011-10-22 │
#> │ 2016 ┆ 6   ┆ 16  ┆ 2016-06-16 │
#> │ 2007 ┆ 4   ┆ 21  ┆ 2007-04-21 │
#> │ 2012 ┆ 2   ┆ 9   ┆ 2012-02-09 │
#> │ 2014 ┆ 11  ┆ 25  ┆ 2014-11-25 │
#> │ …    ┆ …   ┆ …   ┆ …          │
#> │ 2002 ┆ 3   ┆ 26  ┆ 2002-03-26 │
#> │ 2001 ┆ 1   ┆ 21  ┆ 2001-01-21 │
#> │ 2011 ┆ 12  ┆ 18  ┆ 2011-12-18 │
#> │ 2009 ┆ 9   ┆ 18  ┆ 2009-09-18 │
#> │ 2012 ┆ 5   ┆ 19  ┆ 2012-05-19 │
#> └──────┴─────┴─────┴────────────┘
#>    user  system elapsed 
#>    4.76    0.82    5.66

### NEW

system.time({
  test$with_columns(date = pl$date("y", "m", "d"))$print()
})
#> shape: (50_000_000, 4)
#> ┌──────┬─────┬─────┬────────────┐
#> │ y    ┆ m   ┆ d   ┆ date       │
#> │ ---  ┆ --- ┆ --- ┆ ---        │
#> │ i32  ┆ i32 ┆ i32 ┆ date       │
#> ╞══════╪═════╪═════╪════════════╡
#> │ 2011 ┆ 10  ┆ 22  ┆ 2011-10-22 │
#> │ 2016 ┆ 6   ┆ 16  ┆ 2016-06-16 │
#> │ 2007 ┆ 4   ┆ 21  ┆ 2007-04-21 │
#> │ 2012 ┆ 2   ┆ 9   ┆ 2012-02-09 │
#> │ 2014 ┆ 11  ┆ 25  ┆ 2014-11-25 │
#> │ …    ┆ …   ┆ …   ┆ …          │
#> │ 2002 ┆ 3   ┆ 26  ┆ 2002-03-26 │
#> │ 2001 ┆ 1   ┆ 21  ┆ 2001-01-21 │
#> │ 2011 ┆ 12  ┆ 18  ┆ 2011-12-18 │
#> │ 2009 ┆ 9   ┆ 18  ┆ 2009-09-18 │
#> │ 2012 ┆ 5   ┆ 19  ┆ 2012-05-19 │
#> └──────┴─────┴─────┴────────────┘
#>    user  system elapsed 
#>    2.64    0.41    3.06

system.time({
  test$with_columns(date = pl$datetime("y", "m", "d"))$print()
})
#> shape: (50_000_000, 4)
#> ┌──────┬─────┬─────┬─────────────────────┐
#> │ y    ┆ m   ┆ d   ┆ date                │
#> │ ---  ┆ --- ┆ --- ┆ ---                 │
#> │ i32  ┆ i32 ┆ i32 ┆ datetime[μs]        │
#> ╞══════╪═════╪═════╪═════════════════════╡
#> │ 2011 ┆ 10  ┆ 22  ┆ 2011-10-22 00:00:00 │
#> │ 2016 ┆ 6   ┆ 16  ┆ 2016-06-16 00:00:00 │
#> │ 2007 ┆ 4   ┆ 21  ┆ 2007-04-21 00:00:00 │
#> │ 2012 ┆ 2   ┆ 9   ┆ 2012-02-09 00:00:00 │
#> │ 2014 ┆ 11  ┆ 25  ┆ 2014-11-25 00:00:00 │
#> │ …    ┆ …   ┆ …   ┆ …                   │
#> │ 2002 ┆ 3   ┆ 26  ┆ 2002-03-26 00:00:00 │
#> │ 2001 ┆ 1   ┆ 21  ┆ 2001-01-21 00:00:00 │
#> │ 2011 ┆ 12  ┆ 18  ┆ 2011-12-18 00:00:00 │
#> │ 2009 ┆ 9   ┆ 18  ┆ 2009-09-18 00:00:00 │
#> │ 2012 ┆ 5   ┆ 19  ┆ 2012-05-19 00:00:00 │
#> └──────┴─────┴─────┴─────────────────────┘
#>    user  system elapsed 
#>    2.25    0.53    2.78

@frankiethull
Copy link

first of all, thank you for your modification of method 2 with lit()! Using a column called "dash" felt hackish but was the only way I could figure this out. I knew I was missing something related to the polars interface.

second, I had not thought about checking the development version. The new support for $date and $datetime is mainly what I am after! This is great to hear.

lastly, I still give this ticket a +1 for more support for lubridate, but don't think I'm ready for a PR on it. The help you gave me is exactly what I am after for now

@etiennebacher
Copy link
Owner Author

second, I had not thought about checking the development version.

Even if you did, I only added it in polars because you participated here 😉

lastly, I still give this ticket a +1 for more support for lubridate, but don't think I'm ready for a PR on it. The help you gave me is exactly what I am after for now

Once polars 0.16.0 is out, I'll make a PR to add support for make_date() and make_datetime(). I'll try to make it as clear as possible so that other people can imitate it to add support for other functions.

@etiennebacher
Copy link
Owner Author

@frankiethull I have added support for make_date() in #108. As you can see here, only 3 lines were needed to add support, and the rest is only testing (there's one small change in the internals but that's not something you'd have to implement yourself). Of course, that doesn't mean it's always so easy, but most of the time it shouldn't be too long.

If you want to try to implement some lubridate function, I'd be happy to provide some guidance (but you can already take a look at the required steps here: https://tidypolars.etiennebacher.com/contributing#how-to-add-support-for-an-r-function-in-tidypolars).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants