Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use nanoparquet package, to read/write parquet files #843

Merged
merged 2 commits into from
Oct 3, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,7 @@ Suggests:
Microsoft365R,
mime,
mockery,
nanoparquet,
openssl,
paws.storage,
qs,
Expand Down
2 changes: 2 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,8 @@

* Fixed how previously deleted pin versions are detected (#838, @MichalLauer)

* Switched writing with `type = "parquet"` to use the nanoparquet package (#843).

# pins 1.3.0

## Breaking changes
Expand Down
8 changes: 4 additions & 4 deletions R/pin-read-write.R
Original file line number Diff line number Diff line change
Expand Up @@ -194,8 +194,8 @@ write_qs <- function(x, path) {
}

write_parquet <- function(x, path) {
check_installed("arrow")
arrow::write_parquet(x, path)
check_installed("nanoparquet")
nanoparquet::write_parquet(x, path)
invisible(path)
}

Expand Down Expand Up @@ -251,8 +251,8 @@ read_qs <- function(path) {
}

read_parquet <- function(path) {
check_installed("arrow")
arrow::read_parquet(path)
check_installed("nanoparquet")
nanoparquet::read_parquet(path)
}

read_arrow <- function(path) {
Expand Down
9 changes: 8 additions & 1 deletion tests/testthat/test-pin-read-write.R
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
test_that("can round trip all types", {
skip_if_not_installed("qs")
skip_if_not_installed("arrow")
skip_if_not_installed("nanoparquet")
board <- board_temp()

# Data frames
Expand All @@ -9,7 +10,13 @@ test_that("can round trip all types", {
expect_equal(pin_read(board, "df-1"), df)

pin_write(board, df, "df-2", type = "parquet")
expect_equal(pin_read(board, "df-2"), df)
expect_equal(
withr::with_options(
list(nanoparquet.class = c("tbl_df", "tbl")),
pin_read(board, "df-2")
),
df
)

pin_write(board, df, "df-3", type = "arrow")
expect_equal(pin_read(board, "df-3"), df)
Expand Down
2 changes: 1 addition & 1 deletion vignettes/pins.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@ But you can choose another option depending on your goals:

- `type = "rds"` uses `writeRDS()` to create a binary R data file. It can save any R object (including trained models) but it's only readable from R, not other languages.
- `type = "csv"` uses `write.csv()` to create a CSV file. CSVs are plain text and can be read easily by many applications, but they only support simple columns (e.g. numbers, strings), can take up a lot of disk space, and can be slow to read.
- `type = "parquet"` uses `arrow::write_parquet()` to create a Parquet file. [Parquet](https://parquet.apache.org/) is a modern, language-independent, column-oriented file format for efficient data storage and retrieval. Parquet is an excellent choice for storing tabular data but requires the [arrow](https://arrow.apache.org/docs/r/) package.
- `type = "parquet"` uses `nanoparquet::write_parquet()` to create a Parquet file. [Parquet](https://parquet.apache.org/) is a modern, language-independent, column-oriented file format for efficient data storage and retrieval. Parquet is an excellent choice for storing tabular data but requires the [nanoparquet](https://nanoparquet.r-lib.org/) package.
- `type = "arrow"` uses `arrow::write_feather()` to create an Arrow/Feather file.
- `type = "json"` uses `jsonlite::write_json()` to create a JSON file. Pretty much every programming language can read json files, but they only work well for nested lists.
- `type = "qs"` uses `qs::qsave()` to create a binary R data file, like `writeRDS()`. This format achieves faster read/write speeds than RDS, and compresses data more efficiently, making it a good choice for larger objects. Read more on the [qs package](https://github.com/traversc/qs).
Expand Down
Loading