Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add type = "parquet" #729

Merged
merged 6 commits into from
Mar 6, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@

* `board_s3()` now uses pagination for listing and versioning (#719, @mzorko).

* Added `type = "parquet"` to read and write Parquet files (#729).

# pins 1.1.0

## Breaking changes
Expand Down
21 changes: 17 additions & 4 deletions R/pin-read-write.R
Original file line number Diff line number Diff line change
Expand Up @@ -56,9 +56,9 @@ pin_read <- function(board, name, version = NULL, hash = NULL, ...) {
#' When retrieving the pin, this will be stored in the `user` key, to
#' avoid potential clashes with the metadata that pins itself uses.
#' @param type File type used to save `x` to disk. Must be one of
#' "csv", "json", "rds", "arrow", or "qs". If not supplied, will use JSON for
#' bare lists and RDS for everything else. Be aware that CSV and JSON are
#' plain text formats, while RDS, Arrow, and
#' "csv", "json", "rds", "parquet", "arrow", or "qs". If not supplied, will
#' use JSON for bare lists and RDS for everything else. Be aware that CSV and
#' JSON are plain text formats, while RDS, Parquet, Arrow, and
#' [qs](https://CRAN.R-project.org/package=qs) are binary formats.
#' @param versioned Should the pin be versioned? The default, `NULL`, will
#' use the default for `board`
Expand Down Expand Up @@ -133,6 +133,7 @@ object_write <- function(x, path, type = "rds") {
switch(type,
rds = write_rds(x, path),
json = jsonlite::write_json(x, path, auto_unbox = TRUE),
parquet = write_parquet(x, path),
arrow = write_arrow(x, path),
pickle = abort("'pickle' pins not supported in R"),
joblib = abort("'joblib' pins not supported in R"),
Expand Down Expand Up @@ -168,13 +169,19 @@ write_qs <- function(x, path) {
invisible(path)
}

write_parquet <- function(x, path) {
check_installed("arrow")
arrow::write_parquet(x, path)
invisible(path)
}

write_arrow <- function(x, path) {
check_installed("arrow")
arrow::write_feather(x, path)
invisible(path)
}

object_types <- c("rds", "json", "arrow", "pickle", "csv", "qs", "file")
object_types <- c("rds", "json", "parquet", "arrow", "pickle", "csv", "qs", "file")

object_read <- function(meta) {
path <- fs::path(meta$local$dir, meta$file)
Expand All @@ -189,6 +196,7 @@ object_read <- function(meta) {
switch(type,
rds = readRDS(path),
json = jsonlite::read_json(path, simplifyVector = TRUE),
parquet = read_parquet(path),
arrow = read_arrow(path),
pickle = abort("'pickle' pins not supported in R"),
joblib = abort("'joblib' pins not supported in R"),
Expand Down Expand Up @@ -217,6 +225,11 @@ read_qs <- function(path) {
qs::qread(path, strict = TRUE)
}

read_parquet <- function(path) {
check_installed("arrow")
arrow::read_parquet(path)
}

read_arrow <- function(path) {
check_installed("arrow")
arrow::read_feather(path)
Expand Down
2 changes: 1 addition & 1 deletion README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ It takes three arguments: the board to pin to, an object, and a name:
board %>% pin_write(head(mtcars), "mtcars")
```

As you can see, the data saved as an `.rds` by default, but depending on what you're saving and who else you want to read it, you might use the `type` argument to instead save it as a `csv`, `json`, or `arrow` file.
As you can see, the data saved as an `.rds` by default, but depending on what you're saving and who else you want to read it, you might use the `type` argument to instead save it as a Parquet, Arrow, CSV, or JSON file.

You can later retrieve the pinned data with `pin_read()`:

Expand Down
7 changes: 4 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ board <- board_temp()
board
#> Pin board <pins_board_folder>
#> Path:
#> '/var/folders/hv/hzsmmyk9393_m7q3nscx1slc0000gn/T/RtmpTxyyP1/pins-114c073a9ddd2'
#> '/var/folders/hv/hzsmmyk9393_m7q3nscx1slc0000gn/T/RtmpwGre3p/pins-15a8b4f3f602c'
#> Cache size: 0
```

Expand All @@ -71,13 +71,14 @@ arguments: the board to pin to, an object, and a name:
``` r
board %>% pin_write(head(mtcars), "mtcars")
#> Guessing `type = 'rds'`
#> Creating new version '20230223T220424Z-a800d'
#> Creating new version '20230303T233508Z-a800d'
#> Writing to pin 'mtcars'
```

As you can see, the data saved as an `.rds` by default, but depending on
what you’re saving and who else you want to read it, you might use the
`type` argument to instead save it as a `csv`, `json`, or `arrow` file.
`type` argument to instead save it as a Parquet, Arrow, CSV, or JSON
file.

You can later retrieve the pinned data with `pin_read()`:

Expand Down
6 changes: 3 additions & 3 deletions man/pin_read.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion tests/testthat/_snaps/pin-read-write.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@
pin_write(board, mtcars, name = "mtcars", type = "froopy-loops")
Condition
Error in `object_write()`:
! `type` must be one of "rds", "json", "arrow", "pickle", "csv", or "qs", not "froopy-loops".
! `type` must be one of "rds", "json", "parquet", "arrow", "pickle", "csv", or "qs", not "froopy-loops".
Code
pin_write(board, mtcars, name = "mtcars", metadata = 1)
Condition
Expand Down
9 changes: 6 additions & 3 deletions tests/testthat/test-pin-read-write.R
Original file line number Diff line number Diff line change
Expand Up @@ -8,13 +8,16 @@ test_that("can round trip all types", {
pin_write(board, df, "df-1", type = "rds")
expect_equal(pin_read(board, "df-1"), df)

pin_write(board, df, "df-2", type = "arrow")
pin_write(board, df, "df-2", type = "parquet")
expect_equal(pin_read(board, "df-2"), df)

pin_write(board, df, "df-3", type = "csv")
pin_write(board, df, "df-3", type = "arrow")
expect_equal(pin_read(board, "df-2"), df)

pin_write(board, df, "df-4", type = "csv")
expect_equal(pin_read(board, "df-3"), df)

pin_write(board, df, "df-4", type = "qs")
pin_write(board, df, "df-5", type = "qs")
expect_equal(pin_read(board, "df-4"), df)

# List
Expand Down
9 changes: 5 additions & 4 deletions vignettes/pins.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -61,10 +61,11 @@ The only rule for a pin name is that it can't contain slashes.
As you can see from the output, pins has chosen to save this data to an `.rds` file.
But you can choose another option depending on your goals:

- `type = "rds"` uses `writeRDS()` to create a binary R data file. It can save any R object but it's only readable from R, not other languages.
- `type = "csv"` uses `write.csv()` to create a `.csv` file. CSVs can read by any application, but only support simple columns (e.g. numbers, strings, dates), can take up a lot of disk space, and can be slow to read.
- `type = "arrow"` uses `arrow::write_feather()` to create an arrow/feather file. [Arrow](https://arrow.apache.org) is a modern, language-independent, high-performance file format designed for data science. Not every tool can read arrow files, but support is growing rapidly.
- `type = "json"` uses `jsonlite::write_json()` to create a `.json` file. Pretty much every programming language can read json files, but they only work well for nested lists.
- `type = "rds"` uses `writeRDS()` to create a binary R data file. It can save any R object (including trained models) but it's only readable from R, not other languages.
- `type = "csv"` uses `write.csv()` to create a CSV file. CSVs are plain text and can be read easily by many applications, but they only support simple columns (e.g. numbers, strings), can take up a lot of disk space, and can be slow to read.
- `type = "parquet"` uses `arrow::write_parquet()` to create a Parquet file. [Parquet](https://parquet.apache.org/) is a modern, language-independent, column-oriented file format for efficient data storage and retrieval. Parquet is an excellent choice for storing tabular data but requires the [arrow](https://arrow.apache.org/docs/r/) package.
- `type = "arrow"` uses `arrow::write_feather()` to create an Arrow/Feather file.
- `type = "json"` uses `jsonlite::write_json()` to create a JSON file. Pretty much every programming language can read json files, but they only work well for nested lists.
- `type = "qs"` uses `qs::qsave()` to create a binary R data file, like `writeRDS()`. This format achieves faster read/write speeds than RDS, and compresses data more efficiently, making it a good choice for larger objects. Read more on the [qs package](https://github.com/traversc/qs).

After you've pinned an object, you can read it back with `pin_read()`:
Expand Down