-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Add Dask integration #205
Comments
Thanks for reaching out! As far as I was aware, these "netcdf mocks" I wrote used to be supported by dask. All the I would love someone to pick up this project! Other voices on this repo have said that the diagnostic variables from salem are useful to them. What I think would be more future-proof would be to use some of the ideas in salem and make a new, dedicated project just for that (WRF diagnostic variables, maybe plotting although metpy might be used for that as well). Back in the days, I was never 100% sure that mocking netcdf variables was the best way to do it (it feels quite hacky). The other alternative was to use dask to create delayed variables which are computed on demand, but I'm not sure this is possible given the complexity here (unstaggering, etc.) |
As someone on the MetPy side who's been very interested in better WRF handling, but haven't been able to prioritize the cycles to work on spinning up something building off of salem on my own, I'd definitely be interested in collaborating on such a new project. Based on the discussion in Unidata/MetPy#1004, @tjwixtrom has some initial implementations of a Dask-centric approach to WRF output handling, so that could be another useful starting point. That being said, based on the success of xgcm in the pangeo ecosystem, I'd also be inclined to directly build-in staggered grid handling using xgcm rather than just support unstaggering. Would it be worth trying to get interested individuals together to spin up such a community project (perhaps under the pangeo umbrella)? |
@fmaussion Thanks for pointing me to those tests, I completely missed them. Generating dask-powered xarrays using When I first saw those mock variables, I immediately had to think about using delayed variables instead. Using xgcm for handling staggered grids does look interesting indeed. I'll have a proper look at it soon, however at the moment I'm not convinced that it is able to delay the unstaggering operation. @jthielen It's good to know that there is some interest in crafting a robust solution for this problem. I will spend the next 3 years of my PhD working on high-resolution WRF modelling, so having a parallel and robust framework available for analysis is a priority for me and I'd be willing to invest some time. However, while I'd definitely be interested in joining the effort I don't have a huge amount of software engineering background apart from the usual university data analysis hacking stuff. So this would definitely be a learning experience for me but I am motivated to do this. |
Maybe you two can reach out in the Pangeo discourse to gain momentum. My experience however is that you'll need one or two highly motivated people able to work on a working prototype once you have sorted out the design details (e.g. mocks vs delayed). |
Regarding the code mentioned by @jthielen, that is a relatively simple solution based in Xarray/dask. I have been using it as the basis for my current work. It scales relatively well (for my purposes at least) and can handle quite large datasets. The major difference from something like XGCM is that it has been designed to be WRF-specific, which results in a simpler user interaction but is much less flexible in terms of other models/grids. I am happy to contribute this as the starting point to get something going based on existing projects. |
@tjwixtrom I think your code could be a nice starting point indeed. Some ideas on how to build upon it might be: formalizing the CF-compliance by using e.g. cfdm or by just abstracting the cf-vars into a class to reduce code duplication, implementing the remaining grid types (possibly using xgcm grids), supporting multiple datasets, abstracting diagnostic variable definitions and maybe adding additional ones from #18. I don't think that keeping it WRF-specific is a bad thing per se, however it could be nice to try to keep the possibility of extending the software for models like MPAS or ICON in mind. In the end, I think prioritizing providing the functionality of wrf-python and salem in a scalable way would be a good focus. |
@lpilz Definitely good ideas! I might suggest building CF-related functionality around cf-xarray to best integrate with the pangeo stack, instead of using something separate from xarray. Keeping it WRF-specific helps manage the scope of the project, and if we combine that with a philosophy of contributing any new general-purpose functionality we need to upstream projects (e.g., cf-xarray, xgcm, MetPy), then it should be pretty easy to extend to other models later on. If you're willing to dedicate the time to working on this, I'd be glad to join in and help on the software engineering/package management side. Taking the ideas brought up so far, here's what I'd envision this package looking like:
Which, in practice, could look something loosely like (for post-processing to isobaric levels then a quick preview plot of 700 hPa RH): from metpy.units import units
import numpy as np
import xarray as xr
import xgcm
import newWrfPackage
ds = xr.open_dataset("wrfout_d01_2021-01-01_00:00:00", engine="netcdf").wrf.parse_wrf(
include_diagnostics=["air_temperature", "geopotential_height"]
)
grid = xgcm.Grid(ds)
variables = ["air_temperature", "humidity_mixing_ratio", "x_wind", "y_wind", "geopotential_height"]
isobaric_levels = np.linspace(1000, 100, 37) * units.hPa
ds_isobaric = grid.transform(
ds.cf[variables], "Z", isobaric_levels, target_data=ds.cf["air_pressure"], method="logarithmic"
)
rh = ds_isobaric.metpy.calc("relative_humidity")
rh.metpy.sel(vertical=700 * units.hPa, time="2021-01-01 12:00").plot() Initial real examples would almost certainly be messier, as this sample does leverage some not-yet-ready features like xgcm unit-awareness and MetPy's automated field solver. What are everyone's thoughts on this? Also, once we get a good set of ideas together and a list of who's interested in spinning up the package, I'd be glad to write up a post on the pangeo discourse to see if they'd be interested in hosting it. |
@jthielen Thanks for picking this up. I'd definitely be interested in dedicating some time to this. Also, cf-xarray is probably more sensible than using cfdm, didn't know this project existed. Maybe cf-xarray could be added to this website. Your ideas around this package are very nice, I just have some comments regarding two of them. Regarding the handling of the diagnostic variables, I would propose a solution similar to @tjwixtrom's code. I'm not quite sure this is feasible, but I really like the idea of exposing all the diagnostics in a delayed fashion just waiting for the With regards to |
Sorry for loosing track of following up on this! As far as the follow-up points raised:
That being said, I wanted to mention what reminded me of this again! I just stumbled across NCAR's new |
No worries, I was busy the last weeks too so it actually kinda worked out.
It's very interesting to see NCAR picking up WRF postprocessing again, good catch. However, I'm not sure if they will add the computation of diagnostic variables and automatic destaggering, which would be the core of this project. But it is definitely a good idea to ask them exactly these kinds of questions before proceeding. If you want, you can put me in CC of the email, I'd also be interested in their take on our points and initiative. |
Tagging @mgrover1 who is involved with the NCAR xWRF effort. |
@andersy005 put together a prototype backend to help with reading WRF data into xarray, ( It is not clear yet what should all be included in Perhaps setting up a time to chat about this would be helpful? We may want to include some of the GeoCAT developers (@michaelavs and @erogluorhan), who have been maintaining wrf-python. |
Hello, all the discussion here as well as the recent prototype xwrf effort of @andersy005 and blog posting of @mgrover1 are invaluable for Xarray/Dask-compatible WRF diagnostics. We have the objective of making WRF-Python have improved Xarray compatibility, which could potentially leverage xwrf or other discussions here, even though we could not take actions for that yet. We are happy to contribute any community-owned discussion and implementation. |
Thank you all for chiming in (looks like I don't have to reach out in a separate channel)! Setting up a call as @mgrover1 suggests sounds like a good idea to me! Though, if that doesn't work out, perhaps we could organize asynchronous discussion in another location (perhaps GitHub discussions on the NCAR/xwrf repo)? |
Hi everyone! @mgrover and @jthielen, I would be open to/interested in a call where these items can be discussed, so please let me know any updates in that area! |
Hi everyone, great to see so much enthusiasm for new developments in wrf data postprocessing. I concur with @jthielen that a more discussion-friendly communications format would certainly be helpful and second @mgrover1's initiative for a call. What I would be very interested in, is whether NCAR is aiming to continue developing wrf-python towards dask, supersede it with xwrf or is aiming to start over with a new package. This could also decide on where to continue the discussion, as we certainly don't want to overstay our welcome at @fmaussion's ;) |
Hi @lpilz , thanks for asking! There are pros & cons of either way (i.e. continuing with WRF-Python, or starting over a new package), and a few of them that immediately come into my mind are:
Out of all of these, I am giving the biggest weight to the first one: user community of WRF-Python. It does not only provides us with a great opportunity of having such a big user base but also brings us (i.e. NCAR's GeoCAT team) the responsibility of maintaining the toolkit's existing functionality and improving it for the current users. That said, our GeoCAT team's preliminary plan would be keeping the existing functionality of WRF-Python for existing users as well as leveraging xwrf or other technologies to improve our Xarray compatibility and introduce Dask scalability. However, we are open to other thoughts on this as well as to any extent of collaboration to take the best action for the community. |
I'm currently looking around for a python package which is able to extract some diagnostics from WRF output and do computations in a distributed way, as I will handle very large datasets in the future. As salem is built on pure python and uses xarray in the background, I prefer it over wrf-python (also because development on wrf-python pretty much ceased).
I already tried to get salem to run with dask in the most naive way possible, however I stumbled over the problem that workers die when executing the open_wrf_dataset function with the error
distributed.protocol.core - CRITICAL - Failed to deserialize
andTypeError: Could not serialize object of type ImplicitToExplicitIndexingAdapter
. I had a quick look into the salem code and I think this might be due to the FakeVariable objects not having serializers associated with them. Does this sound sensible to you?Since I know that you, @fmaussion, don't have a lot of time to spend on this project I thinking about adding dask compatibility myself. However, before jumping the gun I just wanted to ask if there was any prior work I could maybe build on or if you have any advice.
The text was updated successfully, but these errors were encountered: