[R] Enable `col_select` or similar in `open_csv_dataset` to read files with a shared subset of columns #38031

orgadish · 2023-10-05T08:30:52Z

Describe the enhancement requested

Per the documentation, col_select is currently not supported in arrow::open_csv_dataset and it is recommended to "instead, subset columns after dataset creation".

This approach doesn't work, however, when the files don't share the total schema. Often, though, I may only care about a subset of columns which I know are shared by all the files, even if random other columns have been added in. It would be great if there was a way to specify that columns outside the schema should be ignored, or to enable col_select.

Component(s)

R

The text was updated successfully, but these errors were encountered:

thisisnic · 2023-10-05T10:18:17Z

Could you do something like this?

library(arrow)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

tf <- tempfile()
dir.create(tf)
write_dataset(group_by(mtcars, am), tf, format = "csv")

columns_i_care_about <- c("mpg", "hp")

open_dataset(tf, format = "csv") %>%
  select(!!columns_i_care_about) %>%
  collect()
#> # A tibble: 32 × 2
#>      mpg    hp
#>    <dbl> <int>
#>  1  21.4   110
#>  2  18.7   175
#>  3  18.1   105
#>  4  14.3   245
#>  5  24.4    62
#>  6  22.8    95
#>  7  19.2   123
#>  8  17.8   123
#>  9  16.4   180
#> 10  17.3   180
#> # ℹ 22 more rows

^{Created on 2023-10-05 with reprex v2.0.2}

If not, please can you make me a small reprex showing the kind of thing you'd like to do?

orgadish · 2023-10-08T23:56:04Z

@thisisnic I don't know if this was updated in a recent Arrow version, but it looks like what I want works now!

Below is a reprex for it. read_csv(col_select = ...) actually does not work, so I'm glad open_dataset does!

Closing this issue.

suppressPackageStartupMessages(library(arrow))
suppressPackageStartupMessages(library(dplyr))

mtcars_part_1 <- mtcars |> 
  filter(am == 0) |> 
  select(mpg, cyl, disp)

mtcars_part_2 <- mtcars |> 
  filter(am == 1) |> 
  select(mpg, cyl, hp)

tf <- tempfile()
dir.create(tf)
tf1 <- tempfile(tmpdir=tf)
dir.create(tf1)
tf2 <- tempfile(tmpdir=tf)
dir.create(tf2)
write_csv_arrow(mtcars_part_1, file.path(tf1, "mtcars_subset.csv"))
write_csv_arrow(mtcars_part_2, file.path(tf2, "mtcars_subset.csv"))
csv_files <- list.files(tf, full.names = TRUE, recursive=TRUE)
basename(csv_files)
#> [1] "mtcars_subset.csv" "mtcars_subset.csv"

columns_i_care_about <- c("mpg", "cyl")

# This used to fail, but it seems to be working now...
open_csv_dataset(csv_files, unify_schemas = TRUE) |> 
  collect()
#> # A tibble: 32 × 4
#>      mpg   cyl  disp    hp
#>    <dbl> <int> <dbl> <int>
#>  1  21.4     6  258     NA
#>  2  18.7     8  360     NA
#>  3  18.1     6  225     NA
#>  4  14.3     8  360     NA
#>  5  24.4     4  147.    NA
#>  6  22.8     4  141.    NA
#>  7  19.2     6  168.    NA
#>  8  17.8     6  168.    NA
#>  9  16.4     8  276.    NA
#> 10  17.3     8  276.    NA
#> # ℹ 22 more rows
open_csv_dataset(csv_files, unify_schemas = TRUE) |> 
  select(!!columns_i_care_about) |> 
  collect()
#> # A tibble: 32 × 2
#>      mpg   cyl
#>    <dbl> <int>
#>  1  21       6
#>  2  21       6
#>  3  22.8     4
#>  4  32.4     4
#>  5  30.4     4
#>  6  33.9     4
#>  7  27.3     4
#>  8  26       4
#>  9  30.4     4
#> 10  15.8     8
#> # ℹ 22 more rows

# read_csv(col_select = ) actually doesn't work...
readr::read_csv(csv_files)
#> Error: Files must have consistent column names:
#> * File 1 column 3 is: disp
#> * File 2 column 3 is: hp
readr::read_csv(csv_files, col_select = !!columns_i_care_about)
#> Error: Files must have consistent column names:
#> * File 1 column 3 is: disp
#> * File 2 column 3 is: hp

^{Created on 2023-10-08 with reprex v2.0.2}

thisisnic · 2023-10-09T08:46:40Z

I'm glad it's working for you now, and thanks for supplying the reprex anyway, as it's always useful for us to know the kinds of things that users want to be able to do with arrow!

orgadish added the Type: enhancement label Oct 5, 2023

github-actions bot added the Component: R label Oct 5, 2023

thisisnic changed the title ~~Enable col_select or similar in open_csv_dataset to read files with a shared subset of columns~~ [R] Enable col_select or similar in open_csv_dataset to read files with a shared subset of columns Oct 5, 2023

orgadish closed this as completed Oct 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[R] Enable `col_select` or similar in `open_csv_dataset` to read files with a shared subset of columns #38031

[R] Enable `col_select` or similar in `open_csv_dataset` to read files with a shared subset of columns #38031

orgadish commented Oct 5, 2023

thisisnic commented Oct 5, 2023

orgadish commented Oct 8, 2023

thisisnic commented Oct 9, 2023

[R] Enable col_select or similar in open_csv_dataset to read files with a shared subset of columns #38031

[R] Enable col_select or similar in open_csv_dataset to read files with a shared subset of columns #38031

Comments

orgadish commented Oct 5, 2023

Describe the enhancement requested

Component(s)

thisisnic commented Oct 5, 2023

orgadish commented Oct 8, 2023

thisisnic commented Oct 9, 2023

[R] Enable `col_select` or similar in `open_csv_dataset` to read files with a shared subset of columns #38031

[R] Enable `col_select` or similar in `open_csv_dataset` to read files with a shared subset of columns #38031