forked from pola-rs/r-polars
-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.Rmd
341 lines (255 loc) · 12.4 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
---
output:
github_document:
html_preview: false
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# polars
<!-- badges: start -->
[![R-universe status badge](https://rpolars.r-universe.dev/badges/polars)](https://rpolars.r-universe.dev)
[![CRAN status](https://www.r-pkg.org/badges/version/polars)](https://CRAN.R-project.org/package=polars)
[![Dev R-CMD-check](https://github.com/pola-rs/r-polars/actions/workflows/check.yaml/badge.svg)](https://github.com/pola-rs/r-polars/actions/workflows/check.yaml)
[![Docs release](https://img.shields.io/badge/docs-release-blue.svg)](https://rpolars.github.io)
<!-- badges: end -->
The **polars** package for R gives users access to [a lightning
fast](https://duckdblabs.github.io/db-benchmark/) Data Frame library written in
Rust. [Polars](https://www.pola.rs/)' embarrassingly parallel execution, cache
efficient algorithms and expressive API makes it perfect for efficient data
wrangling, data pipelines, snappy APIs, and much more besides. Polars also supports
"streaming mode" for out-of-memory operations. This allows users to analyze
datasets many times larger than RAM.
Documentation can be found on the **r-polars**
[homepage](https://rpolars.github.io).
The primary developer of the upstream Polars project is
Ritchie Vink ([@ritchie46](https://github.com/ritchie46)).
This R port is maintained by
Søren Welling ([@sorhawell](https://github.com/sorhawell)) and
[contributors](https://github.com/pola-rs/r-polars/graphs/contributors).
Consider joining our [Discord](https://discord.com/invite/4UfP5cfBE7) (subchannel) for
additional help and discussion.
## Install
The package can be installed from R-universe, or GitHub.
Some platforms can install pre-compiled binaries, and others will need to build from source.
````{comment}
### CRAN
CRAN provides pre-compiled binaries for Windows (x86_64) and macOS.
Binary packages on CRAN are compiled by stable Rust, with nightly features disabled.
```r
install.packages("polars")
```
````
### R-universe
[R-universe](https://rpolars.r-universe.dev/polars#install) provides
pre-compiled **polars** binaries for Windows (x86_64), macOS (x86_64) and Ubuntu 22.04 (x86_64)
with source builds for other platforms.
Binary packages on R-universe are compiled by stable Rust, with nightly features disabled.
```r
install.packages("polars", repos = "https://rpolars.r-universe.dev")
```
```r
# For Ubuntu binary installation
install.packages("polars", repos = "https://rpolars.r-universe.dev/bin/linux/jammy/4.3")
```
Special thanks to Jeroen Ooms ([@jeroen](https://github.com/jeroen)) for the
excellent R-universe support.
### GitHub releases
We also provide pre-compiled binaries for various operating systems on our
[GitHub releases](https://github.com/pola-rs/r-polars/releases) page. You can
download and install these files manually, or install directly from R. Simply
match the URL for your operating system and the desired release. For example, to
install the latest release of **polars** on one can use:
#### Linux (x86_64)
```r
install.packages(
"https://github.com/pola-rs/r-polars/releases/latest/download/polars__x86_64-pc-linux-gnu.gz",
repos = NULL
)
```
#### Windows
```r
install.packages(
"https://github.com/pola-rs/r-polars/releases/latest/download/polars.zip",
repos = NULL
)
```
#### macOS(x86_64)
```r
install.packages(
"https://github.com/pola-rs/r-polars/releases/latest/download/polars__x86_64-apple-darwin20.tgz",
repos = NULL
)
```
Just remember to invoke the `repos = NULL` argument if you are installing these
binary builds directly from within R.
Binary packages on GitHub releases are compiled by nightly Rust, with nightly features enabled.
### Build from source
For source installation,
the Rust toolchain (Rust `r RcppTOML::parseTOML("src/rust/Cargo.toml")$package$"rust-version"` or later) must be configured.
Please check the <https://github.com/r-rust/hellorust> repository for about Rust code in R packages.
```{r, include = FALSE}
rust_toolchain_version = brio::read_file("Makefile") |>
stringr::str_extract(r"((?<=RUST_TOOLCHAIN_VERSION).*nightly.*(?=\n))") |>
stringr::str_extract(r"(nightly.*)")
```
During source installation, some environment variables can be set to enable Rust features and profile changes.
- `RPOLARS_FULL_FEATURES="true"` (Build with nightly feature enabled, requires Rust toolchain `r rust_toolchain_version`)
- `RPOLARS_PROFILE="release-optimized"` (Build with more optimization, requires Rust 1.66 or later)
## Quickstart example
The [Get Started](https://rpolars.github.io/articles/polars/) vignette (`vignette("polars")`) contains a series of detailed
examples, but here is a quick illustration.
**polars** is a very powerful package with many functions. To avoid conflicts
with other packages and base R function names, **polars**'s top level functions
are hosted in the `pl` namespace, and accessible via the `pl$` prefix. To
convert an R data frame to a Polars `DataFrame`, we call:
```{r}
library(polars)
dat = pl$DataFrame(mtcars)
dat
```
This `DataFrame` object can be manipulated using many of the usual R functions and accessors, e.g.:
```{r}
dat[1:4, c("mpg", "qsec", "hp")]
```
However, the true power of Polars is unlocked by using *methods*, which are
encapsulated in the `DataFrame` object itself. For example, we can chain the
`$groupby()` and the `$mean()` methods to compute group-wise means for each
column of the dataset:
```{r}
dat$groupby("cyl", maintain_order = TRUE)$mean()
```
Note that we use `maintain_order = TRUE` so that `polars` always keeps the groups
in the same order as they are in the original data.
[The **polars** vignette](https://rpolars.github.io/articles/polars/)
contains many more examples of how to use the package to:
* Read CSV, JSON, Parquet, and other file formats.
* Filter rows and select columns.
* Modify and create new columns.
* Group by and aggregate.
* Reshape data.
* Join and concatenate different datasets.
* Sort data.
* Work with dates and times.
* Handle missing values.
* Use the lazy execution engine for maximum performance and memory-efficient operations.
* Etc.
## Development and Contributions
Contributions are very welcome!
As of March 2023, **polars** has now reached nearly 100% coverage of the
underlying "lazy" Expr syntax. While translation of the "eager" syntax is still
a little further behind, you should be able to do just about everything using
`$select()` + `$with_columns()`. Most of the methods associated with
`DataFrame` and `LazyFrame` classes have been implemented, but not all. There
is still much to do, and your help would be much appreciated!
If you spot missing functionality---implemented in Python but not
R---please let us know on GitHub.
### System dependencies
To install the development version of Polars or develop new features, you will
to install the Rust toolchain:
* Install [`rustup`](https://rustup.rs/), the cross-platform Rust installer. Then:
```sh
rustup toolchain install `r rust_toolchain_version`
rustup default `r rust_toolchain_version`
```
- Windows: Make sure the latest version of [Rtools](https://cran.r-project.org/bin/windows/Rtools/) is installed and on your PATH.
* macOS: Make sure [`Xcode`](https://developer.apple.com/support/xcode/) is installed.
* Install [CMake](https://cmake.org/) and add it to your PATH.
### Implementing new features
Here are the steps required for an example contribution, where we are implementing the
[cosine expression](https://rpolars.github.io/reference/Expr_cos/):
* Look up the [polars.Expr.cos method in py-polars documentation](https://pola-rs.github.io/polars/py-polars/html/reference/expressions/api/polars.Expr.cos.html).
* Press the `[source]` button to see the [Python implementation](https://github.com/pola-rs/polars/blob/d23bbd2f14f1cd7ae2e27e1954a2dc4276501eef/py-polars/polars/expr/expr.py#L5892-L5914)
* Find the cos [py-polars rust implementation](https://github.com/pola-rs/polars/blob/a1afbc4b78f5850314351f7e85ded95fd68b6453/py-polars/src/lazy/dsl.rs#L396) (likely just a simple call to the Rust-Polars API)
* Adapt the Rust part and place it [here](https://github.com/pola-rs/r-polars/blob/c56c49a6fc172685f50c15fffe3d14231297ad97/src/rust/src/rdataframe/rexpr.rs#L754).
* Adapt the Python frontend syntax to R and place it [here](https://github.com/pola-rs/r-polars/blob/c56c49a6fc172685f50c15fffe3d14231297ad97/R/expr__expr.R#L3138). Add the roxygen docs + examples above.
* Notice we use `Expr_cos = "use_extendr_wrapper"`, it means we're just using unmodified the [extendr auto-generated wrapper](https://github.com/pola-rs/r-polars/blob/c56c49a6fc172685f50c15fffe3d14231297ad97/R/extendr-wrappers.R#L253)
* Write a test [here](https://github.com/pola-rs/r-polars/blob/c56c49a6fc172685f50c15fffe3d14231297ad97/tests/testthat/test-expr.R#L1921).
* Run `renv::restore()` and resolve all R packages
* Run `rextendr::document()` to recompile and confirm the added method functions as intended, e.g. `pl$DataFrame(a=c(0,pi/2,pi,NA_real_))$select(pl$col("a")$cos())`
* Run `devtools::test()`. See below for how to set up your development environment correctly.
Note that PRs to **polars** will be automatically be built and tested on all
platforms as part of our GitHub Actions workflow. A more detailed description of
the development environment and workflow for local builds is provided below.
### Development workflow
Assuming the system dependencies have been met (above), the typical **polars**
development workflow is as follows:
**Step 1:** Fork the **polars** repo on GitHub and then clone it locally.
```sh
git clone [email protected]:<YOUR-GITHUB-ACCOUNT>/r-polars.git
cd r-polars
```
**Step 2:** Build the package and install the suggested package dependencies.
* Option A: Using **devtools**.
```sh
Rscript -e 'devtools::install(pkg = ".", dependencies = TRUE)'
```
* Option B: Using **renv**.
```sh
# Rscript -e 'install.packages("renv")'
Rscript -e 'renv::activate(); renv::restore()'
```
**Step 3:** Make your proposed changes to the R and/or Rust code. Don't forget to run:
```r
rextendr::document() # compile Rust code + update wrappers & docs
devtools::test() # run all unit tests
```
**Step 4 (optional):** Build the package locally.
```sh
R CMD INSTALL --no-multiarch --with-keep.source .
```
**Step 5:** Commit your changes and submit a PR to the main **polars** repo.
* As aside, notice that `./renv.lock` sets all R packages during the server build.
*Tip:* To speed up the local rextendr::document() or R CMD check, run the following:
```r
source("inst/misc/develop_polars.R")
#to rextendr:document() + not_cran + load packages + all_features
load_polars()
#to check package + reuses previous compilation in check, protects against deletion
check_polars() #assumes rust target at `paste0(getwd(),"/src/rust")`
```
* The `RPOLARS_RUST_SOURCE` environment variable allows **polars** to recover the Cargo cache even if source files have been moved. Replace with your own absolute path to your local clone!
* `filter_rcmdcheck.R` removes known warnings from final check report.
* `unlink("check")` cleans up.
### Misc
If you experience unexpected sluggish performance, when using polars in a given IDE, we'd like to hear about it. You can try to activate `pl$set_polars_options(debug_polars = TRUE)` to profile what methods are being touched (not necessarily run) and how fast. Below is an example of good behavior.
```r
#run e.g. an eager query after setting debug_polars = TRUE
pl$DataFrame(iris)$select("Species")
[TIME? ms]
pl$DataFrame() -> [0.73ms]
.pr$DataFrame$new_with_capacity() -> [0.56ms]
.pr$DataFrame$set_column_from_robj() -> [11.04ms]
.pr$DataFrame$set_column_from_robj() -> [0.3309ms]
.pr$DataFrame$set_column_from_robj() -> [0.283ms]
.pr$DataFrame$set_column_from_robj() -> [0.2761ms]
.pr$DataFrame$set_column_from_robj() -> [12.54ms]
DataFrame$select() -> [0.3681ms]
ProtoExprArray$push_back_rexpr() -> [0.21ms]
pl$col() -> [0.1669ms]
.pr$Expr$col() -> [0.212ms]
.pr$DataFrame$select() -> [1.229ms]
DataFrame$print() -> [0.1781ms]
.pr$DataFrame$print() -> shape: (150, 1)
┌───────────┐
│ Species │
│ --- │
│ cat │
╞═══════════╡
│ setosa │
│ setosa │
│ setosa │
│ setosa │
│ … │
│ virginica │
│ virginica │
│ virginica │
│ virginica │
└───────────┘
```