-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[R] Enable col_select
or similar in open_csv_dataset
to read files with a shared subset of columns
#38031
Comments
col_select
or similar in open_csv_dataset
to read files with a shared subset of columnscol_select
or similar in open_csv_dataset
to read files with a shared subset of columns
Could you do something like this? library(arrow)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.
#>
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#>
#> timestamp
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
tf <- tempfile()
dir.create(tf)
write_dataset(group_by(mtcars, am), tf, format = "csv")
columns_i_care_about <- c("mpg", "hp")
open_dataset(tf, format = "csv") %>%
select(!!columns_i_care_about) %>%
collect()
#> # A tibble: 32 × 2
#> mpg hp
#> <dbl> <int>
#> 1 21.4 110
#> 2 18.7 175
#> 3 18.1 105
#> 4 14.3 245
#> 5 24.4 62
#> 6 22.8 95
#> 7 19.2 123
#> 8 17.8 123
#> 9 16.4 180
#> 10 17.3 180
#> # ℹ 22 more rows Created on 2023-10-05 with reprex v2.0.2 If not, please can you make me a small reprex showing the kind of thing you'd like to do? |
@thisisnic I don't know if this was updated in a recent Arrow version, but it looks like what I want works now! Below is a reprex for it. Closing this issue. suppressPackageStartupMessages(library(arrow))
suppressPackageStartupMessages(library(dplyr))
mtcars_part_1 <- mtcars |>
filter(am == 0) |>
select(mpg, cyl, disp)
mtcars_part_2 <- mtcars |>
filter(am == 1) |>
select(mpg, cyl, hp)
tf <- tempfile()
dir.create(tf)
tf1 <- tempfile(tmpdir=tf)
dir.create(tf1)
tf2 <- tempfile(tmpdir=tf)
dir.create(tf2)
write_csv_arrow(mtcars_part_1, file.path(tf1, "mtcars_subset.csv"))
write_csv_arrow(mtcars_part_2, file.path(tf2, "mtcars_subset.csv"))
csv_files <- list.files(tf, full.names = TRUE, recursive=TRUE)
basename(csv_files)
#> [1] "mtcars_subset.csv" "mtcars_subset.csv"
columns_i_care_about <- c("mpg", "cyl")
# This used to fail, but it seems to be working now...
open_csv_dataset(csv_files, unify_schemas = TRUE) |>
collect()
#> # A tibble: 32 × 4
#> mpg cyl disp hp
#> <dbl> <int> <dbl> <int>
#> 1 21.4 6 258 NA
#> 2 18.7 8 360 NA
#> 3 18.1 6 225 NA
#> 4 14.3 8 360 NA
#> 5 24.4 4 147. NA
#> 6 22.8 4 141. NA
#> 7 19.2 6 168. NA
#> 8 17.8 6 168. NA
#> 9 16.4 8 276. NA
#> 10 17.3 8 276. NA
#> # ℹ 22 more rows
open_csv_dataset(csv_files, unify_schemas = TRUE) |>
select(!!columns_i_care_about) |>
collect()
#> # A tibble: 32 × 2
#> mpg cyl
#> <dbl> <int>
#> 1 21 6
#> 2 21 6
#> 3 22.8 4
#> 4 32.4 4
#> 5 30.4 4
#> 6 33.9 4
#> 7 27.3 4
#> 8 26 4
#> 9 30.4 4
#> 10 15.8 8
#> # ℹ 22 more rows
# read_csv(col_select = ) actually doesn't work...
readr::read_csv(csv_files)
#> Error: Files must have consistent column names:
#> * File 1 column 3 is: disp
#> * File 2 column 3 is: hp
readr::read_csv(csv_files, col_select = !!columns_i_care_about)
#> Error: Files must have consistent column names:
#> * File 1 column 3 is: disp
#> * File 2 column 3 is: hp Created on 2023-10-08 with reprex v2.0.2 |
I'm glad it's working for you now, and thanks for supplying the reprex anyway, as it's always useful for us to know the kinds of things that users want to be able to do with arrow! |
Describe the enhancement requested
Per the documentation,
col_select
is currently not supported inarrow::open_csv_dataset
and it is recommended to "instead, subset columns after dataset creation".This approach doesn't work, however, when the files don't share the total schema. Often, though, I may only care about a subset of columns which I know are shared by all the files, even if random other columns have been added in. It would be great if there was a way to specify that columns outside the schema should be ignored, or to enable
col_select
.Component(s)
R
The text was updated successfully, but these errors were encountered: