Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Division of Features in Top-Level API #20

Closed
jthielen opened this issue Nov 23, 2021 · 10 comments
Closed

Division of Features in Top-Level API #20

jthielen opened this issue Nov 23, 2021 · 10 comments
Milestone

Comments

@jthielen
Copy link
Collaborator

While detailed API discussions will be ongoing based on #13 and other issues/discussions that follow from that, #14 (comment) and #14 (comment) raised a more high-level API point that would be good to clear up first: what features go into the xwrf backend, and what goes elsewhere (such as a .wrf accessor)?

Original comments:


If so, I think this means we can't have direct Dask operations within the backend, but would rather need to design custom backend arrays that play nicely with the Dask chunking xarray itself does, or re-evaluate the approach for derived quantities so that they are outside the backend. Perhaps the intake-esm approach could help in that regard at least?

Wouldn't creating custom backend arrays be overkill? Assuming we want to support reading files via the Python-netCDF4 library, we might be able to write a custom data store that borrows from xarray's NetCDF4DataStore: https://github.com/pydata/xarray/blob/5db40465955a30acd601d0c3d7ceaebe34d28d11/xarray/backends/netCDF4_.py#L291. With this custom datastore, we would have more control over what to do with variables, dimensions, attrs before passing them to xarray. Wouldn't this suffice for the data loading (without the derived quantities)?

I think there's value in keeping the backend plugin simple (e.g. performing simple tasks such as decoding coordinates, fixing attributes/metadata, etc) and everything else outside the backend. Deriving quantities doesn't seem simple enough to warrant having this functionality during the data loading.

Some of the benefits of deriving quantities outside the backend are that this approach:

(1) doesn't obfuscate what's going on,
(2) gives users the opportunity to fix aspects of the dataset that might be missed by xwrf during data loading before passing this cleaned dataset to the functionality for deriving quantities.
(3) removes the requirement for deriving quantities to be a lazy operation i.e. if your dataset is in memory, deriving the quantity is done eagerly...

Originally posted by @andersy005 in #14 (comment)


Some of the benefits of deriving quantities outside the backend are that this approach:

Also, Wouldn't it be beneficial for deriving quantities to be backend agnostic? I'm imagining cases in which the data have been post-processed and saved in a different format (e.g. Zarr) and you still want to be able to use the same code for deriving quantities on the fly.

Originally posted by @andersy005 in #14 (comment)


Deriving quantities doesn't seem simple enough to warrant having this functionality during the data loading.

This sounds like it factors directly into the "keep the solutions as general as possible (so that maybe also MPAS can profit from it)" discussion. However, I feel that we have to think about the user-perspective too. I don't have any set opinions on this and we should definitely discuss this maybe in a larger group too. Here some thoughts on this so far:

I think the reason users like wrf-python is because it is an easy one-stop-shop for getting wrf output to work with python - this is especially true because lots of users are scientists and not software engineers or programmers. I personally take from this point that it would be prudent to keep the UX as easy as possible. I think this is what the Backend-approach does really well. Basically users just have to add the engine='xwrf' kwarg and then it just works (TM). Meaning that it provides the users with CF-compliant de-WRFified meteo data. Also, given that the de-WRFification of the variable data is not too difficult (it's basically just adding fields for three variables), I think the overhead in complexity wouldn't be too great. However, while I do see that it breaks the conceptual barrier between data loading (and decoding etc.) and computation, this breakage would be required in order to provide the user with meteo data rather than raw wrf fields.

@andersy005 do you already have some other ideas on how one could handle this elegantly?

Also, should we move this discussion to a separate issue maybe?

Originally posted by @lpilz in #14 (comment)

@andersy005 andersy005 pinned this issue Nov 23, 2021
@andersy005
Copy link
Member

andersy005 commented Nov 23, 2021

@andersy005 do you already have some other ideas on how one could handle this elegantly?

I'm not thoroughly familiar with xarray's backend API :( So, I'm going to tinker with the xarray's backend API source code sometime next week to figure out how much we can get out of it at a minimal cost (performance and user experience-wise). And then I'll post some updates.

@jthielen
Copy link
Collaborator Author

jthielen commented Nov 23, 2021

So here are my thoughts on this! I see four core categories of features for xwrf:

I initially envisioned (like I believe @lpilz is alluding to) that all four of these could work in the xwrf backend. However, given the hiccup that Dask chunking is a post-backend operation in xarray (which I didn't realize when writing the first cut of #14), I'm not sure that the workarounds required to fit diagnostics and destaggering lazily into the backend is worth it...it could be though...

So, two options came to mind:

1) Do it all in the backend

This is the cleanest from the user API point-of-view, as they can get their cleaned, ready-to-use dataset as simply

ds = xr.open_dataset(
    'wrfout_d01_2017-12-02_12 00 00',
    engine='xwrf',
    backend_kwargs={...}
)

However, making the core diagnostics and destaggering lazy-loading in the backend will take much more work behind-the-scenes. I'm with @andersy005 in that I'm not thoroughly familiar with xarray's backend API, but what I can say is that, if we go this path, we will need some way to create "mocked variables" whose _data arrays look like a regular variable (i.e., MemoryCachedArray(CopyOnWriteArray(LazilyIndexedArray(NetCDF4ArrayWrapper)))) but have a simple function (add a constant, divide by a constant, add two arrays from different variables, or add two different slices of a single array) that gets applied lazily. I'd be curious if directly copying FakeVariable from Salem (https://github.com/fmaussion/salem/blob/master/salem/wrftools.py#L143-L177, possibly with some Dask-compatibility modifications) and injecting these after the NetCDF4 backend call is sufficient. Otherwise, we'd probably have to do some more low-level subclassing/wrapping of the NetCDF4 backend and/or backend-array.

2) Do coordinates and CF attrs in the backend, but diagnostics and destaggering externally

This would be fairly easy to implement! The diagnostics could look just like that in https://github.com/jthielen/xwrf/blob/salem-style-coords-backup-0/xwrf/diagnostics.py, and the destaggering a "search and replace" routine that looks for a *_stag dimension, applies (a[:-1] + a[1:])/2 (on the appropriate dimension of course), and corrects the named dimension to the proper unstaggered version. Dask compatibility should be "automatic," given that chunks should already have been applied prior to these methods being run on a dataset.

It is less ideal, however from the user side. If we have a "combo" method for both the core diagnostics and destaggering together, it would look something like

ds = xr.open_dataset(
    'wrfout_d01_2017-12-02_12 00 00',
    engine='xwrf',
    backend_kwargs={...}
).wrf.diag_and_destagger()

or

ds = xr.open_dataset(
    'wrfout_d01_2017-12-02_12 00 00',
    engine='xwrf',
    backend_kwargs={...}
).pipe(xwrf.diag_and_destagger)

Not terrible, but trying to justify always needing two separate operations to read in WRF doesn't seem like the greatest UX. Also, we'll have to come up with some "CF adjacent" metadata for the raw T, P, PB, PH, and PHB fields so they still make sense post-backend but pre-diagnostic calculation and are not confused with the real physical variables.

Does anyone have strong preferences between these two options? Or another option altogether?

@jthielen
Copy link
Collaborator Author

Also, in response to the raised points about not including diagnostics in the backend:

(1) doesn't obfuscate what's going on,

👍

(2) gives users the opportunity to fix aspects of the dataset that might be missed by xwrf during data loading before passing this cleaned dataset to the functionality for deriving quantities.

As far as I can tell, we're only talking about 5 raw fields ("T", "P", "PB", "PH" and "PHB") being mapped to 4 derived fields ("air_potential_temperature", "air_pressure", "geopotential" and "geopotential_height"). I can't think of any case where the result coming out of the xwrf backend with respect to those fields would need correction by the user before xwrf diagnostics (i.e., if those fields are messed up so that the xwrf diagnostics won't work, they almost certainly would have been messed up in a way that the xwrf backend wouldn't recognize them either). But I could be mistaken too!

(3) removes the requirement for deriving quantities to be a lazy operation i.e. if your dataset is in memory, deriving the quantity is done eagerly...

👍

It also makes it much easier to implement, as we'd just write the expected code (e.g., ds['T'] + 300) and the "laziness" should be handled automatically.

Also, Wouldn't it be beneficial for deriving quantities to be backend agnostic? I'm imagining cases in which the data have been post-processed and saved in a different format (e.g. Zarr) and you still want to be able to use the same code for deriving quantities on the fly.

If you're post-processing (and possibly to saving another format), I'm not sure why would you want to re-save the raw components instead of the combined fields? It's more data and harder to describe with quality metadata (and I shudder at the thought of intentionally re-saving data with WRF-style metadata).

@jthielen
Copy link
Collaborator Author

Also, re:MPAS

This sounds like it factors directly into the "keep the solutions as general as possible (so that maybe also MPAS can profit from it)" discussion.

Prefacing this by saying I have not worked with MPAS itself at all, taking a look through the raw example MPAS files available through MPAS-Analysis's documentation, I don't see any overlap with WRF-style metadata or weird offset/split components. So, if we stick to the limited scope for xwrf mentioned in #20 (comment), then I wouldn't forsee anything where MPAS can also profit from it. However, the upstream improvements to MetPy and xgcm that will be needed to achieve feature parity with wrf-python (e.g., grid-capable CAPE, sigma/hybrid-sigma to isobaric vertical coordinate transformation) will likely bring benefit to MPAS too (assuming that MPAS output can be made CF-friendly as well)!

I believe FV3 would be a similar story (based on the limited experience I have with FV3-LAM short range weather app).

@andersy005 andersy005 added this to Xdev Nov 24, 2021
@andersy005 andersy005 moved this to 🌳 Todo in Xdev Nov 24, 2021
@kmpaul kmpaul added this to xWRF Dec 16, 2021
@kmpaul kmpaul added this to the Project xWRF milestone Dec 21, 2021
@lpilz
Copy link
Collaborator

lpilz commented Jan 3, 2022

Does anyone have strong preferences between these two options? Or another option altogether?

Of the two options, I would prefer 2) for some of the same reasons @andersy005 mentioned. I was trying to remember whether we discussed this and didn't find anything in the meeting notes regarding this topic, so I would like to add the other option

3) Do everything in an xarray accessor

I guess this would be the easiest option. I looked through the xwrf code and couldn't find any hard reason as to why we had to use an xarray backend over NetCDF4 - if not for the nice UX and some slight optimizations. But I might be wrong on this, so please correct me if this is the case.

To me, there are quite a few reasons for considering this option:

  1. It quite naturally keeps the division between data loading (xarray backends) and data modification (xarray accessor)
  2. It enables us to use dask in post-processing
  3. It avoids having to deal with partially processed data (T, PH, etc. vs air_temperature, geopotential, etc.)
  4. We only have one operation and not two

Again, this is only true provided we aren't missing any data from just loading wrfout files with the NetCDF4 backend, but I don't think we are. The main reason for considering this, to me, is that we could keep the xarray backend's original scope - dealing with different file formats and translating them to xarray - and not bloat it with custom logic. However, if we feel like the nice xarray backend UX (or the optimization) is worth breaking this division, I'm totally on board. In the end I am all about building a product people will actually use.

@andersy005
Copy link
Member

  1. Do everything in an xarray accessor

I guess this would be the easiest option. I looked through the xwrf code and couldn't find any hard reason as to why we had to use an xarray backend over NetCDF4 - if not for the nice UX and some slight optimizations.

I'm planning to start working on the xarray accessor unless there are objections and/or someone has started working on this already

@lpilz
Copy link
Collaborator

lpilz commented Feb 2, 2022

That would be great 👍

@andersy005 andersy005 moved this from 🌳 Todo to ▶ In Progress in Xdev Project Board Feb 10, 2022
@lpilz
Copy link
Collaborator

lpilz commented Feb 15, 2022

I think #44 closes this discussion, do you agree?

@jthielen
Copy link
Collaborator Author

I think #44 closes this discussion, do you agree?

I concur! (Feel free to re-open if otherwise.)

Repository owner moved this from 🌳 Todo to ✅ Done in Xdev Feb 15, 2022
Repository owner moved this from ▶ In Progress to ✅ Done in Xdev Project Board Feb 15, 2022
@jthielen jthielen moved this to Done in xWRF Feb 15, 2022
@kmpaul
Copy link
Contributor

kmpaul commented Feb 15, 2022

Progress!!!! 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants