[R] Add option to record file path in `open_dataset` and `open_csv_dataset` #38036

orgadish · 2023-10-05T10:02:48Z

Describe the enhancement requested

I often have to work with data where there is information stored in the file path (e.g. in the directory containing this file, or in the file name).

When I use readr::read_csv, there is an id argument:

id	The name of a column in which to store the file path. This is useful when reading multiple input files and there is data in the file paths, such as the data collection date. If NULL (the default) no extra column is created.

As far as I can tell, the only way to recreate this with open_csv_dataset currently, is to read each file, resave it with the file path as an existing column and then use open_csv_dataset. (I know the list of files is stored in the Dataset object, but I don't know if the data associated with each file is stored).

It would be great if there was an id equivalent for the open_dataset functions.

Component(s)

R

The text was updated successfully, but these errors were encountered:

thisisnic · 2023-10-05T10:11:47Z

There's a function add_filename() which might be able to help you, e.g.

library(arrow)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

tf <- tempfile()
dir.create(tf)
write_dataset(group_by(mtcars, am), tf, format = "csv")

list.files(tf, recursive = TRUE)
#> [1] "am=0/part-0.csv" "am=1/part-0.csv"
open_dataset(tf, format = "csv") %>%
  mutate(filename = add_filename()) %>%
  arrange(mpg) %>% # just to change up the order
  select(mpg, filename) %>% # so it displays more nicely in the reprex
  collect()
#> # A tibble: 32 × 2
#>      mpg filename                                        
#>    <dbl> <chr>                                           
#>  1  10.4 /tmp/Rtmp5HPoh4/file3e584718df9b/am=0/part-0.csv
#>  2  10.4 /tmp/Rtmp5HPoh4/file3e584718df9b/am=0/part-0.csv
#>  3  13.3 /tmp/Rtmp5HPoh4/file3e584718df9b/am=0/part-0.csv
#>  4  14.3 /tmp/Rtmp5HPoh4/file3e584718df9b/am=0/part-0.csv
#>  5  14.7 /tmp/Rtmp5HPoh4/file3e584718df9b/am=0/part-0.csv
#>  6  15   /tmp/Rtmp5HPoh4/file3e584718df9b/am=1/part-0.csv
#>  7  15.2 /tmp/Rtmp5HPoh4/file3e584718df9b/am=0/part-0.csv
#>  8  15.2 /tmp/Rtmp5HPoh4/file3e584718df9b/am=0/part-0.csv
#>  9  15.5 /tmp/Rtmp5HPoh4/file3e584718df9b/am=0/part-0.csv
#> 10  15.8 /tmp/Rtmp5HPoh4/file3e584718df9b/am=1/part-0.csv
#> # ℹ 22 more rows

^{Created on 2023-10-05 with reprex v2.0.2}

orgadish · 2023-10-06T16:34:55Z

That's great, I didn't even know this existed! add_filename() is exactly what I needed.

orgadish added the Type: enhancement label Oct 5, 2023

github-actions bot added the Component: R label Oct 5, 2023

thisisnic changed the title ~~Add option to record file path in open_dataset and open_csv_dataset~~ [R] Add option to record file path in open_dataset and open_csv_dataset Oct 5, 2023

orgadish closed this as completed Oct 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[R] Add option to record file path in `open_dataset` and `open_csv_dataset` #38036

[R] Add option to record file path in `open_dataset` and `open_csv_dataset` #38036

orgadish commented Oct 5, 2023

thisisnic commented Oct 5, 2023 •

edited

Loading

orgadish commented Oct 6, 2023

[R] Add option to record file path in open_dataset and open_csv_dataset #38036

[R] Add option to record file path in open_dataset and open_csv_dataset #38036

Comments

orgadish commented Oct 5, 2023

Describe the enhancement requested

Component(s)

thisisnic commented Oct 5, 2023 • edited Loading

orgadish commented Oct 6, 2023

[R] Add option to record file path in `open_dataset` and `open_csv_dataset` #38036

[R] Add option to record file path in `open_dataset` and `open_csv_dataset` #38036

thisisnic commented Oct 5, 2023 •

edited

Loading