-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multiple sources (wish) #80
Comments
Hello, I am constantly working with multi-file netcdf datasets, and often have questions which span over long time periods. Suppose, as in my case, that you have netcdf files (which are stratified by time) with dimensions lat, lon, and time. To be clear, the file directory would look something like this: 6km_reanalysis_194801.nc A goal one might have is to apply a function across time (the unlimited dimension) at each lat, lon point on one or multiple variables. At the moment, I do not think this is possible with tidync (and I am not aware of a "clean" and efficient way of doing it in R). If the files were small enough, the solution would be to first concatenate the files, then proceed by calling tidync on a single file. I am wondering if it is possible for tidync to have a lazy-loading solution to this problem. In essence, a tidync equivalent to Python's xarray's open_mfdataset() and lazy evaluation. If tidync worked as I am suggesting, the first argument could be a vector of file names (in our case, 6km_reanalysis*.nc). A tidync object would be created by calling tidync() on this list, with metadata derived mostly from the first file (as suggested above). At this point, no data would be loaded. Now, suppose I have a function, which is written to work on a dataframe grouped by location (unique lat, lon pairs in the case of the .nc file). This is to say, if I had a dataframe with columns lat, lon, time, var1, var2, and var3, my function outputs a dataframe with columns lat, lon, out_var1 and out_var2. Notice that the time dimension has "collapsed" here, but geographic points are still kept distinct. The problem is that the data are split into multiple files over time. To construct such a dataframe for a few geographic points, we would currently need to loop through all the file names, extract the .nc data as tbls or dataframes, then bind them together. This involves a lot of reading and writing and dirty code. The size of the geographical chunks would need to depend on the resolution, the number of input variables and output variables, the temporal resolution, and the timespan. Ideally, we could create a tidync object from a list of file names, lazily command that it be made into a tbl/df, then apply a function to this tbl/df (with sequential pipes). In this process, only small geographical chunks would be converted to a tbl/df at a time, and the function is applied to these chunks (sequentially or simultaneously on different cores). The result would be a tbl with one dimension collapsed or reduced. In the case of time, there might not be a time column anymore, or time might now represent the end of a time range over which a function was applied (think of a climatological summary). I ask if this is technically possible for tidync to behave this way. I am ignorant of the low-level workings of R, so I might be asking too much. However, xarray has shown me that such things can be done in Python. Unfortunately, I do not think any of the python plotting features approach the quality of ggplot and dplyr is also unmatched. I would be over-the-moon if R had a good way of dealing with multi-file netcdf datasets. Many thanks. |
thanks for the detailed qs, I think I understand what you're asking - a fully lazy and chunked processing is pretty much a lot of work, but I think you can achieve what you're after with a bit of custom application. I'd like to know what workflow you have atm - are you getting the output as |
Hello, I have created an example analogous to my use case. I have a function that works on a dataframe/tibble which summarizes a variable (currently just one variable) in terms of "events". These are defined as streaks of timesteps in which the selected variable surpasses a threshold. In this example, a precipitation event is a streak of hours in which hourly precipitaton surpasses a rate of 1mm/hr. These events are summarized with their starting time, their cumulative magnitude (e.g. total precip), maximum magnitude (e.g. max hourly precip rate), and length. The netCDF files with my data are broken along the time dimension. In this example, there is a file for each day and each day has a timestep for every hour. In order to create precipitation events, one must consider multiple files at once. In the case of precipitation in the Western US, one would need to consider as many files as there are days in a rainy season (this depends on your exact location and purpose of your study, but the max you would need to consider is 366 files, with your cutoff on a day in which precipitation is very unlikely). Here is my create_event() function, which takes in a "raw" df/tibble and returns a summary of events, as described above:
This is how I can currently apply the function a tibble obtained from a for-loop of tidync over 31 files (only 1 month, or 31 * 24 timesteps).
This creates the desired output after a bit of a wait. It would be great if I could do this with the following syntax:
Going even further, it would be useful to not event need to write this (intermediate) event_summary_tbl to memory at all. If my goal is to summarize the properties of some subset of events at each gridcell, it would be nice to write event_summary_tbl <- tidync(file_names) %>% # all file names inputted at once Do you think this would be technically feasible? Three of the netCDF files and an example image of some of the output are available on this drive. Many thanks for your time and interest! |
honestly, it's probably the bind_rows that is taking all the time - but thanks for the example, I will try it out and try to explain more. you could collect in a list and do the bind_rows() in one final step and that might be fast enough, untested code: listof <- vector("list", length(fl_names))
for(i in 1:length(fl_names)) {
current_nc <- tidync(fl_names[i])
#print(current_nc)
listof[[i]] <- current_nc %>%
hyper_filter(lon = lon >= w_us_bnds[1] & lon <= w_us_bnds[2],
lat = lat >= w_us_bnds[3] & lat <= w_us_bnds[4]) %>%
hyper_tibble(select_var = event_var)
}
big_tbl <- bind_rows(listof)
I hear you about better ways ... but realistically I probably can't achieve that without external help. |
If you create an ncml file that contains the metadata for your list of files and you can use that ncml file as your file name, otherwise using tidync as normal. |
oh wow, I always wondered about that but never saw it in action, that would be awesome to wrap up 👌 |
Old code experiment
The text was updated successfully, but these errors were encountered: