Division of Features in Top-Level API #20

jthielen · 2021-11-23T21:07:00Z

While detailed API discussions will be ongoing based on #13 and other issues/discussions that follow from that, #14 (comment) and #14 (comment) raised a more high-level API point that would be good to clear up first: what features go into the xwrf backend, and what goes elsewhere (such as a .wrf accessor)?

Original comments:

If so, I think this means we can't have direct Dask operations within the backend, but would rather need to design custom backend arrays that play nicely with the Dask chunking xarray itself does, or re-evaluate the approach for derived quantities so that they are outside the backend. Perhaps the intake-esm approach could help in that regard at least?

Wouldn't creating custom backend arrays be overkill? Assuming we want to support reading files via the Python-netCDF4 library, we might be able to write a custom data store that borrows from xarray's NetCDF4DataStore: https://github.com/pydata/xarray/blob/5db40465955a30acd601d0c3d7ceaebe34d28d11/xarray/backends/netCDF4_.py#L291. With this custom datastore, we would have more control over what to do with variables, dimensions, attrs before passing them to xarray. Wouldn't this suffice for the data loading (without the derived quantities)?

I think there's value in keeping the backend plugin simple (e.g. performing simple tasks such as decoding coordinates, fixing attributes/metadata, etc) and everything else outside the backend. Deriving quantities doesn't seem simple enough to warrant having this functionality during the data loading.

Some of the benefits of deriving quantities outside the backend are that this approach:

(1) doesn't obfuscate what's going on,
(2) gives users the opportunity to fix aspects of the dataset that might be missed by xwrf during data loading before passing this cleaned dataset to the functionality for deriving quantities.
(3) removes the requirement for deriving quantities to be a lazy operation i.e. if your dataset is in memory, deriving the quantity is done eagerly...

Originally posted by @andersy005 in #14 (comment)

Some of the benefits of deriving quantities outside the backend are that this approach:

Also, Wouldn't it be beneficial for deriving quantities to be backend agnostic? I'm imagining cases in which the data have been post-processed and saved in a different format (e.g. Zarr) and you still want to be able to use the same code for deriving quantities on the fly.

Originally posted by @andersy005 in #14 (comment)

Deriving quantities doesn't seem simple enough to warrant having this functionality during the data loading.

This sounds like it factors directly into the "keep the solutions as general as possible (so that maybe also MPAS can profit from it)" discussion. However, I feel that we have to think about the user-perspective too. I don't have any set opinions on this and we should definitely discuss this maybe in a larger group too. Here some thoughts on this so far:

I think the reason users like wrf-python is because it is an easy one-stop-shop for getting wrf output to work with python - this is especially true because lots of users are scientists and not software engineers or programmers. I personally take from this point that it would be prudent to keep the UX as easy as possible. I think this is what the Backend-approach does really well. Basically users just have to add the engine='xwrf' kwarg and then it just works (TM). Meaning that it provides the users with CF-compliant de-WRFified meteo data. Also, given that the de-WRFification of the variable data is not too difficult (it's basically just adding fields for three variables), I think the overhead in complexity wouldn't be too great. However, while I do see that it breaks the conceptual barrier between data loading (and decoding etc.) and computation, this breakage would be required in order to provide the user with meteo data rather than raw wrf fields.

@andersy005 do you already have some other ideas on how one could handle this elegantly?

Also, should we move this discussion to a separate issue maybe?

Originally posted by @lpilz in #14 (comment)

The text was updated successfully, but these errors were encountered:

andersy005 · 2021-11-23T21:24:17Z

@andersy005 do you already have some other ideas on how one could handle this elegantly?

I'm not thoroughly familiar with xarray's backend API :( So, I'm going to tinker with the xarray's backend API source code sometime next week to figure out how much we can get out of it at a minimal cost (performance and user experience-wise). And then I'll post some updates.

jthielen · 2021-11-23T21:51:34Z

So here are my thoughts on this! I see four core categories of features for xwrf:

Coordinate creation/modification (Coordinate UX #11)
Applying CF metadata to coordinates and data variables (Mapping of WRF output variables to CF metadata attributes #15)
Transforming split/offset variables into their physically meaningful counterparts (Overall dataset structure #10)
Destaggering (no issue yet)

I initially envisioned (like I believe @lpilz is alluding to) that all four of these could work in the xwrf backend. However, given the hiccup that Dask chunking is a post-backend operation in xarray (which I didn't realize when writing the first cut of #14), I'm not sure that the workarounds required to fit diagnostics and destaggering lazily into the backend is worth it...it could be though...

So, two options came to mind:

1) Do it all in the backend

This is the cleanest from the user API point-of-view, as they can get their cleaned, ready-to-use dataset as simply

ds = xr.open_dataset(
    'wrfout_d01_2017-12-02_12 00 00',
    engine='xwrf',
    backend_kwargs={...}
)

However, making the core diagnostics and destaggering lazy-loading in the backend will take much more work behind-the-scenes. I'm with @andersy005 in that I'm not thoroughly familiar with xarray's backend API, but what I can say is that, if we go this path, we will need some way to create "mocked variables" whose _data arrays look like a regular variable (i.e., MemoryCachedArray(CopyOnWriteArray(LazilyIndexedArray(NetCDF4ArrayWrapper)))) but have a simple function (add a constant, divide by a constant, add two arrays from different variables, or add two different slices of a single array) that gets applied lazily. I'd be curious if directly copying FakeVariable from Salem (https://github.com/fmaussion/salem/blob/master/salem/wrftools.py#L143-L177, possibly with some Dask-compatibility modifications) and injecting these after the NetCDF4 backend call is sufficient. Otherwise, we'd probably have to do some more low-level subclassing/wrapping of the NetCDF4 backend and/or backend-array.

2) Do coordinates and CF attrs in the backend, but diagnostics and destaggering externally

This would be fairly easy to implement! The diagnostics could look just like that in https://github.com/jthielen/xwrf/blob/salem-style-coords-backup-0/xwrf/diagnostics.py, and the destaggering a "search and replace" routine that looks for a *_stag dimension, applies (a[:-1] + a[1:])/2 (on the appropriate dimension of course), and corrects the named dimension to the proper unstaggered version. Dask compatibility should be "automatic," given that chunks should already have been applied prior to these methods being run on a dataset.

It is less ideal, however from the user side. If we have a "combo" method for both the core diagnostics and destaggering together, it would look something like

ds = xr.open_dataset(
    'wrfout_d01_2017-12-02_12 00 00',
    engine='xwrf',
    backend_kwargs={...}
).wrf.diag_and_destagger()

or

ds = xr.open_dataset(
    'wrfout_d01_2017-12-02_12 00 00',
    engine='xwrf',
    backend_kwargs={...}
).pipe(xwrf.diag_and_destagger)

Not terrible, but trying to justify always needing two separate operations to read in WRF doesn't seem like the greatest UX. Also, we'll have to come up with some "CF adjacent" metadata for the raw T, P, PB, PH, and PHB fields so they still make sense post-backend but pre-diagnostic calculation and are not confused with the real physical variables.

Does anyone have strong preferences between these two options? Or another option altogether?

jthielen · 2021-11-23T22:10:13Z

Also, in response to the raised points about not including diagnostics in the backend:

(1) doesn't obfuscate what's going on,

👍

(2) gives users the opportunity to fix aspects of the dataset that might be missed by xwrf during data loading before passing this cleaned dataset to the functionality for deriving quantities.

As far as I can tell, we're only talking about 5 raw fields ("T", "P", "PB", "PH" and "PHB") being mapped to 4 derived fields ("air_potential_temperature", "air_pressure", "geopotential" and "geopotential_height"). I can't think of any case where the result coming out of the xwrf backend with respect to those fields would need correction by the user before xwrf diagnostics (i.e., if those fields are messed up so that the xwrf diagnostics won't work, they almost certainly would have been messed up in a way that the xwrf backend wouldn't recognize them either). But I could be mistaken too!

(3) removes the requirement for deriving quantities to be a lazy operation i.e. if your dataset is in memory, deriving the quantity is done eagerly...

👍

It also makes it much easier to implement, as we'd just write the expected code (e.g., ds['T'] + 300) and the "laziness" should be handled automatically.

Also, Wouldn't it be beneficial for deriving quantities to be backend agnostic? I'm imagining cases in which the data have been post-processed and saved in a different format (e.g. Zarr) and you still want to be able to use the same code for deriving quantities on the fly.

If you're post-processing (and possibly to saving another format), I'm not sure why would you want to re-save the raw components instead of the combined fields? It's more data and harder to describe with quality metadata (and I shudder at the thought of intentionally re-saving data with WRF-style metadata).

jthielen · 2021-11-23T22:35:39Z

Also, re:MPAS

This sounds like it factors directly into the "keep the solutions as general as possible (so that maybe also MPAS can profit from it)" discussion.

Prefacing this by saying I have not worked with MPAS itself at all, taking a look through the raw example MPAS files available through MPAS-Analysis's documentation, I don't see any overlap with WRF-style metadata or weird offset/split components. So, if we stick to the limited scope for xwrf mentioned in #20 (comment), then I wouldn't forsee anything where MPAS can also profit from it. However, the upstream improvements to MetPy and xgcm that will be needed to achieve feature parity with wrf-python (e.g., grid-capable CAPE, sigma/hybrid-sigma to isobaric vertical coordinate transformation) will likely bring benefit to MPAS too (assuming that MPAS output can be made CF-friendly as well)!

I believe FV3 would be a similar story (based on the limited experience I have with FV3-LAM short range weather app).

lpilz · 2022-01-03T16:51:06Z

Does anyone have strong preferences between these two options? Or another option altogether?

Of the two options, I would prefer 2) for some of the same reasons @andersy005 mentioned. I was trying to remember whether we discussed this and didn't find anything in the meeting notes regarding this topic, so I would like to add the other option

3) Do everything in an xarray accessor

I guess this would be the easiest option. I looked through the xwrf code and couldn't find any hard reason as to why we had to use an xarray backend over NetCDF4 - if not for the nice UX and some slight optimizations. But I might be wrong on this, so please correct me if this is the case.

To me, there are quite a few reasons for considering this option:

It quite naturally keeps the division between data loading (xarray backends) and data modification (xarray accessor)
It enables us to use dask in post-processing
It avoids having to deal with partially processed data (T, PH, etc. vs air_temperature, geopotential, etc.)
We only have one operation and not two

Again, this is only true provided we aren't missing any data from just loading wrfout files with the NetCDF4 backend, but I don't think we are. The main reason for considering this, to me, is that we could keep the xarray backend's original scope - dealing with different file formats and translating them to xarray - and not bloat it with custom logic. However, if we feel like the nice xarray backend UX (or the optimization) is worth breaking this division, I'm totally on board. In the end I am all about building a product people will actually use.

andersy005 · 2022-02-02T17:56:16Z

Do everything in an xarray accessor

I guess this would be the easiest option. I looked through the xwrf code and couldn't find any hard reason as to why we had to use an xarray backend over NetCDF4 - if not for the nice UX and some slight optimizations.

I'm planning to start working on the xarray accessor unless there are objections and/or someone has started working on this already

lpilz · 2022-02-02T17:59:12Z

That would be great 👍

lpilz · 2022-02-15T18:14:36Z

I think #44 closes this discussion, do you agree?

jthielen · 2022-02-15T21:38:36Z

I think #44 closes this discussion, do you agree?

I concur! (Feel free to re-open if otherwise.)

kmpaul · 2022-02-15T22:03:28Z

Progress!!!! 😄

andersy005 pinned this issue Nov 23, 2021

andersy005 added this to Xdev Nov 24, 2021

andersy005 moved this to 🌳 Todo in Xdev Nov 24, 2021

kmpaul added this to xWRF Dec 16, 2021

kmpaul added this to the Project xWRF milestone Dec 21, 2021

andersy005 added this to Xdev Project Board Jan 31, 2022

andersy005 moved this to Todo in Xdev Project Board Jan 31, 2022

andersy005 mentioned this issue Feb 3, 2022

Add xwrf/wrf DataArray and Dataset Accessors #44

Merged

3 tasks

andersy005 moved this from 🌳 Todo to ▶ In Progress in Xdev Project Board Feb 10, 2022

jthielen closed this as completed Feb 15, 2022

Repository owner moved this from 🌳 Todo to ✅ Done in Xdev Feb 15, 2022

Repository owner moved this from ▶ In Progress to ✅ Done in Xdev Project Board Feb 15, 2022

jthielen moved this to Done in xWRF Feb 15, 2022

jthielen unpinned this issue Feb 15, 2022

jthielen mentioned this issue Mar 6, 2022

[FEATURE]: Essential diagnostic variables #57

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Division of Features in Top-Level API #20

Division of Features in Top-Level API #20

jthielen commented Nov 23, 2021

andersy005 commented Nov 23, 2021 •

edited

Loading

jthielen commented Nov 23, 2021 •

edited

Loading

jthielen commented Nov 23, 2021

jthielen commented Nov 23, 2021

lpilz commented Jan 3, 2022 •

edited

Loading

andersy005 commented Feb 2, 2022

lpilz commented Feb 2, 2022

lpilz commented Feb 15, 2022

jthielen commented Feb 15, 2022

kmpaul commented Feb 15, 2022

Division of Features in Top-Level API #20

Division of Features in Top-Level API #20

Comments

jthielen commented Nov 23, 2021

andersy005 commented Nov 23, 2021 • edited Loading

jthielen commented Nov 23, 2021 • edited Loading

jthielen commented Nov 23, 2021

jthielen commented Nov 23, 2021

lpilz commented Jan 3, 2022 • edited Loading

andersy005 commented Feb 2, 2022

lpilz commented Feb 2, 2022

lpilz commented Feb 15, 2022

jthielen commented Feb 15, 2022

kmpaul commented Feb 15, 2022

andersy005 commented Nov 23, 2021 •

edited

Loading

jthielen commented Nov 23, 2021 •

edited

Loading

lpilz commented Jan 3, 2022 •

edited

Loading