-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.Rmd
107 lines (76 loc) · 4.08 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
library("purrr") # prevent `purrr` load message by `furrr`
devtools::load_all()
```
# tidier
<!-- badges: start -->
[![CRAN status](https://www.r-pkg.org/badges/version/tidier)](https://CRAN.R-project.org/package=tidier) [![R-CMD-check](https://github.com/talegari/tidier/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/talegari/tidier/actions/workflows/R-CMD-check.yaml) `r badger::badge_devel(color = "blue")`
<!-- badges: end -->
`tidier` package provides '[Apache Spark](https://spark.apache.org/)' style window aggregation for R dataframes and remote `dbplyr` tbls via '[mutate](https://dplyr.tidyverse.org/reference/mutate.html)' in '[dplyr](https://dplyr.tidyverse.org/index.html)' flavour.
## Example
**Create a new column with average temp over last seven days in the same month**.
```{r}
set.seed(101)
airquality |>
# create date column
dplyr::mutate(date_col = lubridate::make_date(1973, Month, Day)) |>
# create gaps by removing some days
dplyr::slice_sample(prop = 0.8) |>
# compute mean temperature over last seven days in the same month
tidier::mutate(avg_temp_over_last_week = mean(Temp, na.rm = TRUE),
.order_by = Day,
.by = Month,
.frame = c(lubridate::days(7), # 7 days before current row
lubridate::days(-1) # do not include current row
),
.index = date_col
)
```
## Features
- `mutate` supports
- `.by` (group by),
- `.order_by` (order by),
- `.frame` (endpoints of window frame),
- `.index` (identify index column like date column, in df version only),
- `.complete` (whether to compute over incomplete window, in df version only).
- `mutate` automatically uses a future backend (via [`furrr`](https://furrr.futureverse.org/), in df version only).
## Motivation
This implementation is inspired by Apache Spark's [`windowSpec`](https://spark.apache.org/docs/3.2.1/api/python/reference/api/pyspark.sql.Column.over.html?highlight=windowspec) class with [`rangeBetween`](https://spark.apache.org/docs/3.2.1/api/python/reference/api/pyspark.sql.WindowSpec.rangeBetween.html) and [`rowsBetween`](https://spark.apache.org/docs/3.2.1/api/python/reference/api/pyspark.sql.WindowSpec.rowsBetween.html).
## Ecosystem
1. [`dbplyr`](https://dbplyr.tidyverse.org/) implements this via [`dbplyr::win_over`](https://dbplyr.tidyverse.org/reference/win_over.html?q=win_over#null) enabling [`sparklyr`](https://spark.rstudio.com/) users to write window computations. Also see, [`dbplyr::window_order`/`dbplyr::window_frame`](https://dbplyr.tidyverse.org/reference/window_order.html?q=window_fr#ref-usage). `tidier`'s `mutate` wraps this functionality via uniform syntax across dataframes and remote tbls.
2. [`tidypyspark`](https://talegari.github.io/tidypyspark/_build/html/index.html) python package implements `mutate` style window computation API for pyspark.
## Installation
- dev: `remotes::install_github("talegari/tidier")`
- cran: `install.packages("tidier")`
## Acknowledgements
`tidier` package is deeply indebted to three amazing packages and people behind it.
1. [`dplyr`](https://cran.r-project.org/package=dplyr):
<!-- -->
```
Wickham H, François R, Henry L, Müller K, Vaughan D (2023). _dplyr: A
Grammar of Data Manipulation_. R package version 1.1.0,
<https://CRAN.R-project.org/package=dplyr>.
```
2. [`slider`](https://cran.r-project.org/package=slider):
<!-- -->
```
Vaughan D (2021). _slider: Sliding Window Functions_. R package
version 0.2.2, <https://CRAN.R-project.org/package=slider>.
```
3. [`dbplyr`](https://cran.r-project.org/package=dbplyr):
<!-- -->
```
Wickham H, Girlich M, Ruiz E (2023). _dbplyr: A 'dplyr' Back End
for Databases_. R package version 2.3.2,
<https://CRAN.R-project.org/package=dbplyr>.
```