You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I often have to work with data where there is information stored in the file path (e.g. in the directory containing this file, or in the file name).
When I use readr::read_csv, there is an id argument:
id
The name of a column in which to store the file path. This is useful when reading multiple input files and there is data in the file paths, such as the data collection date. If NULL (the default) no extra column is created.
As far as I can tell, the only way to recreate this with open_csv_dataset currently, is to read each file, resave it with the file path as an existing column and then use open_csv_dataset. (I know the list of files is stored in the Dataset object, but I don't know if the data associated with each file is stored).
It would be great if there was an id equivalent for the open_dataset functions.
Component(s)
R
The text was updated successfully, but these errors were encountered:
There's a function add_filename() which might be able to help you, e.g.
library(arrow)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.#> #> Attaching package: 'arrow'#> The following object is masked from 'package:utils':#> #> timestamp
library(dplyr)
#> #> Attaching package: 'dplyr'#> The following objects are masked from 'package:stats':#> #> filter, lag#> The following objects are masked from 'package:base':#> #> intersect, setdiff, setequal, uniontf<- tempfile()
dir.create(tf)
write_dataset(group_by(mtcars, am), tf, format="csv")
list.files(tf, recursive=TRUE)
#> [1] "am=0/part-0.csv" "am=1/part-0.csv"
open_dataset(tf, format="csv") %>%
mutate(filename= add_filename()) %>%
arrange(mpg) %>% # just to change up the order
select(mpg, filename) %>% # so it displays more nicely in the reprex
collect()
#> # A tibble: 32 × 2#> mpg filename #> <dbl> <chr> #> 1 10.4 /tmp/Rtmp5HPoh4/file3e584718df9b/am=0/part-0.csv#> 2 10.4 /tmp/Rtmp5HPoh4/file3e584718df9b/am=0/part-0.csv#> 3 13.3 /tmp/Rtmp5HPoh4/file3e584718df9b/am=0/part-0.csv#> 4 14.3 /tmp/Rtmp5HPoh4/file3e584718df9b/am=0/part-0.csv#> 5 14.7 /tmp/Rtmp5HPoh4/file3e584718df9b/am=0/part-0.csv#> 6 15 /tmp/Rtmp5HPoh4/file3e584718df9b/am=1/part-0.csv#> 7 15.2 /tmp/Rtmp5HPoh4/file3e584718df9b/am=0/part-0.csv#> 8 15.2 /tmp/Rtmp5HPoh4/file3e584718df9b/am=0/part-0.csv#> 9 15.5 /tmp/Rtmp5HPoh4/file3e584718df9b/am=0/part-0.csv#> 10 15.8 /tmp/Rtmp5HPoh4/file3e584718df9b/am=1/part-0.csv#> # ℹ 22 more rows
thisisnic
changed the title
Add option to record file path in open_dataset and open_csv_dataset
[R] Add option to record file path in open_dataset and open_csv_datasetOct 5, 2023
Describe the enhancement requested
I often have to work with data where there is information stored in the file path (e.g. in the directory containing this file, or in the file name).
When I use
readr::read_csv
, there is anid
argument:As far as I can tell, the only way to recreate this with
open_csv_dataset
currently, is to read each file, resave it with the file path as an existing column and then useopen_csv_dataset
. (I know the list of files is stored in the Dataset object, but I don't know if the data associated with each file is stored).It would be great if there was an
id
equivalent for theopen_dataset
functions.Component(s)
R
The text was updated successfully, but these errors were encountered: