Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] Add option to record file path in open_dataset and open_csv_dataset #38036

Closed
orgadish opened this issue Oct 5, 2023 · 2 comments
Closed

Comments

@orgadish
Copy link
Contributor

orgadish commented Oct 5, 2023

Describe the enhancement requested

I often have to work with data where there is information stored in the file path (e.g. in the directory containing this file, or in the file name).

When I use readr::read_csv, there is an id argument:

id The name of a column in which to store the file path. This is useful when reading multiple input files and there is data in the file paths, such as the data collection date. If NULL (the default) no extra column is created.

As far as I can tell, the only way to recreate this with open_csv_dataset currently, is to read each file, resave it with the file path as an existing column and then use open_csv_dataset. (I know the list of files is stored in the Dataset object, but I don't know if the data associated with each file is stored).

It would be great if there was an id equivalent for the open_dataset functions.

Component(s)

R

@thisisnic
Copy link
Member

thisisnic commented Oct 5, 2023

There's a function add_filename() which might be able to help you, e.g.

library(arrow)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

tf <- tempfile()
dir.create(tf)
write_dataset(group_by(mtcars, am), tf, format = "csv")

list.files(tf, recursive = TRUE)
#> [1] "am=0/part-0.csv" "am=1/part-0.csv"
open_dataset(tf, format = "csv") %>%
  mutate(filename = add_filename()) %>%
  arrange(mpg) %>% # just to change up the order
  select(mpg, filename) %>% # so it displays more nicely in the reprex
  collect()
#> # A tibble: 32 × 2
#>      mpg filename                                        
#>    <dbl> <chr>                                           
#>  1  10.4 /tmp/Rtmp5HPoh4/file3e584718df9b/am=0/part-0.csv
#>  2  10.4 /tmp/Rtmp5HPoh4/file3e584718df9b/am=0/part-0.csv
#>  3  13.3 /tmp/Rtmp5HPoh4/file3e584718df9b/am=0/part-0.csv
#>  4  14.3 /tmp/Rtmp5HPoh4/file3e584718df9b/am=0/part-0.csv
#>  5  14.7 /tmp/Rtmp5HPoh4/file3e584718df9b/am=0/part-0.csv
#>  6  15   /tmp/Rtmp5HPoh4/file3e584718df9b/am=1/part-0.csv
#>  7  15.2 /tmp/Rtmp5HPoh4/file3e584718df9b/am=0/part-0.csv
#>  8  15.2 /tmp/Rtmp5HPoh4/file3e584718df9b/am=0/part-0.csv
#>  9  15.5 /tmp/Rtmp5HPoh4/file3e584718df9b/am=0/part-0.csv
#> 10  15.8 /tmp/Rtmp5HPoh4/file3e584718df9b/am=1/part-0.csv
#> # ℹ 22 more rows

Created on 2023-10-05 with reprex v2.0.2

@thisisnic thisisnic changed the title Add option to record file path in open_dataset and open_csv_dataset [R] Add option to record file path in open_dataset and open_csv_dataset Oct 5, 2023
@orgadish
Copy link
Contributor Author

orgadish commented Oct 6, 2023

That's great, I didn't even know this existed! add_filename() is exactly what I needed.

@orgadish orgadish closed this as completed Oct 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants