Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] Enable col_select or similar in open_csv_dataset to read files with a shared subset of columns #38031

Closed
orgadish opened this issue Oct 5, 2023 · 3 comments

Comments

@orgadish
Copy link
Contributor

orgadish commented Oct 5, 2023

Describe the enhancement requested

Per the documentation, col_select is currently not supported in arrow::open_csv_dataset and it is recommended to "instead, subset columns after dataset creation".

This approach doesn't work, however, when the files don't share the total schema. Often, though, I may only care about a subset of columns which I know are shared by all the files, even if random other columns have been added in. It would be great if there was a way to specify that columns outside the schema should be ignored, or to enable col_select.

Component(s)

R

@thisisnic thisisnic changed the title Enable col_select or similar in open_csv_dataset to read files with a shared subset of columns [R] Enable col_select or similar in open_csv_dataset to read files with a shared subset of columns Oct 5, 2023
@thisisnic
Copy link
Member

Could you do something like this?

library(arrow)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

tf <- tempfile()
dir.create(tf)
write_dataset(group_by(mtcars, am), tf, format = "csv")

columns_i_care_about <- c("mpg", "hp")

open_dataset(tf, format = "csv") %>%
  select(!!columns_i_care_about) %>%
  collect()
#> # A tibble: 32 × 2
#>      mpg    hp
#>    <dbl> <int>
#>  1  21.4   110
#>  2  18.7   175
#>  3  18.1   105
#>  4  14.3   245
#>  5  24.4    62
#>  6  22.8    95
#>  7  19.2   123
#>  8  17.8   123
#>  9  16.4   180
#> 10  17.3   180
#> # ℹ 22 more rows

Created on 2023-10-05 with reprex v2.0.2

If not, please can you make me a small reprex showing the kind of thing you'd like to do?

@orgadish
Copy link
Contributor Author

orgadish commented Oct 8, 2023

@thisisnic I don't know if this was updated in a recent Arrow version, but it looks like what I want works now!

Below is a reprex for it. read_csv(col_select = ...) actually does not work, so I'm glad open_dataset does!

Closing this issue.

suppressPackageStartupMessages(library(arrow))
suppressPackageStartupMessages(library(dplyr))

mtcars_part_1 <- mtcars |> 
  filter(am == 0) |> 
  select(mpg, cyl, disp)

mtcars_part_2 <- mtcars |> 
  filter(am == 1) |> 
  select(mpg, cyl, hp)

tf <- tempfile()
dir.create(tf)
tf1 <- tempfile(tmpdir=tf)
dir.create(tf1)
tf2 <- tempfile(tmpdir=tf)
dir.create(tf2)
write_csv_arrow(mtcars_part_1, file.path(tf1, "mtcars_subset.csv"))
write_csv_arrow(mtcars_part_2, file.path(tf2, "mtcars_subset.csv"))
csv_files <- list.files(tf, full.names = TRUE, recursive=TRUE)
basename(csv_files)
#> [1] "mtcars_subset.csv" "mtcars_subset.csv"

columns_i_care_about <- c("mpg", "cyl")

# This used to fail, but it seems to be working now...
open_csv_dataset(csv_files, unify_schemas = TRUE) |> 
  collect()
#> # A tibble: 32 × 4
#>      mpg   cyl  disp    hp
#>    <dbl> <int> <dbl> <int>
#>  1  21.4     6  258     NA
#>  2  18.7     8  360     NA
#>  3  18.1     6  225     NA
#>  4  14.3     8  360     NA
#>  5  24.4     4  147.    NA
#>  6  22.8     4  141.    NA
#>  7  19.2     6  168.    NA
#>  8  17.8     6  168.    NA
#>  9  16.4     8  276.    NA
#> 10  17.3     8  276.    NA
#> # ℹ 22 more rows
open_csv_dataset(csv_files, unify_schemas = TRUE) |> 
  select(!!columns_i_care_about) |> 
  collect()
#> # A tibble: 32 × 2
#>      mpg   cyl
#>    <dbl> <int>
#>  1  21       6
#>  2  21       6
#>  3  22.8     4
#>  4  32.4     4
#>  5  30.4     4
#>  6  33.9     4
#>  7  27.3     4
#>  8  26       4
#>  9  30.4     4
#> 10  15.8     8
#> # ℹ 22 more rows

# read_csv(col_select = ) actually doesn't work...
readr::read_csv(csv_files)
#> Error: Files must have consistent column names:
#> * File 1 column 3 is: disp
#> * File 2 column 3 is: hp
readr::read_csv(csv_files, col_select = !!columns_i_care_about)
#> Error: Files must have consistent column names:
#> * File 1 column 3 is: disp
#> * File 2 column 3 is: hp

Created on 2023-10-08 with reprex v2.0.2

@orgadish orgadish closed this as completed Oct 8, 2023
@thisisnic
Copy link
Member

I'm glad it's working for you now, and thanks for supplying the reprex anyway, as it's always useful for us to know the kinds of things that users want to be able to do with arrow!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants