-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AWS Retrospective #226
Comments
@fernando-aristizabal I would never discourage the development of new tools. :) There's actually a stale issue about building retrospective client here: #157 We never actually got around to building the tool, but you may find some of the discussion useful. Us and others have encountered some difficulty reliably retrieving and validating the zarr data. |
@fernando-aristizabal, thanks for opening this! I share @jarq6c's sentiment. What do you envision the api(s) would return? A |
Hey @aaraney, initial thought was to keep it to xarray since that's what natively works best for these zarr/netcdf files. It would also keep data lazily loaded and up to the user to slice or convert to a desired object type. Given some of the issues with Zarr, has anyone produced a kerchunk index of the NetCDF retro data that we can use? It would load in a similar fashion and likely avoid some of the problems introduced in the Zarr rechuncking. |
We might ask @mgdenno to contribute to this conversation. The TEEHR project (https://github.com/RTIInternational/teehr) has a system in place to retrieve these data (time series, point, and gridded) for exploratory evaluations. There's may be an opportunity to collaborate with CIROH. |
I have a few thoughts to contribute to the conversation.
Regardless, we are certainly interested in collaborating on common tooling so we can try not to reinvent "the wheel". FRSA @samlamont |
@fernando-aristizabal Please take a look. @igarousi, we should connect about this and add some material to the comment thread here. |
Hey everyone! Thanks for contributing to this! It seems like a great survey of the various efforts to better access NWM data. I'll start off commenting on @mgdenno's insightful points.
Moving on to @jameshalgren's info on some of the work that CIROH has been doing on this. It seems very helpful as a few CIROH people have reached out to me or mentioned their questions on NWM data access. I took at look at the README.md but wasn't able to get a successful request.
My understanding is that these are single file jsons for the forcing data? The forcing data seems of interest to people based on feedback. Is there a single multi-file json or a plan to? I'd like to share that there is some work here showing how Zarr rechunked across time instead of features showed significant improvement in time series based queries. The repo for this is available here as well as more specifically here and here. This work was influenced by @jarq6c, @sudhirshrestha-noaa, and @AustinJordan-NOAA. Hopefully this adds to the various efforts at improving NWM data access and builds towards generating a comprehensive solution for research and dissemination applications. |
Hi all, this is Sam, I'm working with @mgdenno on the TEEHR tooling and have a few points/questions to add. Regarding TEEHR, yes we create the single file jsons and then, in some cases, the combined json using Also just to clarify, is the overall discussion here around how to best support a variety of querying schemes for the NWM retrospective (and forecast?) dataset(s) (for instance fetching the entire time series for one feature vs. the partial time series of many features)? If so, I'm curious what the advantages are in accessing the data using the Kerchunk reference files vs. optimizing the chunking scheme of the provided Zarr dataset. As I understand, with Kerchunk we're tied to the original chunking scheme (or multiples of) of the I'm also curious if I hope these comments are helpful, happy to discuss further if not! |
@samlamont Thanks for jumping in with interesting input.
This is my general understanding as well since kerchunk doesn't actually rechunk the files just builds an index around them allowing for access of meta-data and lazy loading. What would you say the advantage of single file jsons are without aggregating them?
Building on the previous comment, it's my understanding that the value of kerchunk is when building a multi-file json you get the advantages we've previously mentioned. Zarr offers the same benefits while also rechunking and recompressing in cloud optimized formats. The link I shared previously demonstrates how this can speed up access if done properly for the correct applications.
This discussion started by wanting to add some of the AWS retro Zarr references to the hydrotools repo as to supplement the repo's existing NWM data access tools. @jarq6c brought up some concerns with the Zarr rechunking there and the conversation expanded to various indexing/chunking efforts. It's apparent there are many efforts here across groups without a clear, consistent solution to gather around. During the time of this thread, I learned that @GautamSood-NOAA and @sudhirshrestha-noaa will also be doing some rechunking, they want to solicit feedback from SMEs on what variables in addition to streamflow, qSfcLatRunoff, and qBucket might be useful. I suggested the forcing data as well as the lake variables as they all maybe relevant for FIM eventually. They are eager for people's opinions to feel free to communicate your needs to them.
Lastly, sharding seems like a partial chunk read? It's hard to tell because some of their links appear down. If so I'm sure this would add value if we have large chunks with specific queries. |
Hi @fernando-aristizabal thanks for the feedback. On the single json vs. aggregated approach for NWM forecasts, we noticed a much smoother Dask task stream when using the single file jsons as opposed to aggregating with Thanks for the additional clarification, I'm happy to contribute to this effort in any way. I'll post back here if I learn of any benefits to sharding. |
As the NWM Client seems to focus on forecast data from 2018 onwards on GCP or the past two days on NOMADS, I've thought about the retrospective somewhat.
AWS publishes three versions of NWM retrospective analysis:
At least two of the versions, 2.1 and 2.0, have been rechunked to Zarr which make for easy ingest:
After your imports:
The 2.1 dataset is as follows:
Other variables are available such as precipitation:
The 2.0 data seems as so:
While this seems easy enough, it might be useful to write a function to abstract some of these URIs away to something that caters to domain scientists. Please let me know if this is of interest to you all and I might be able to get to it in the next few weeks as @GregoryPetrochenkov-NOAA and myself will be doing some FIM evals using NWM data in the near-future.
The text was updated successfully, but these errors were encountered: