-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PR]: Update Regrid2 missing and fill value behaviors to align with CDAT and add unmapped_to_nan
arg for output data
#613
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #613 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 15 15
Lines 1521 1543 +22
=========================================
+ Hits 1521 1543 +22 ☔ View full report in Codecov by Sentry. |
Hey @lee1043, at our past meeting on 3/13, Jason mentioned having an Do you agree with this idea? Thanks. |
@tomvothecoder I agree, thanks! |
- Load dask arrays into memory using `.values`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#615 should be fixed with this commit, 85d1c7c
(#613).
The issue was that Dask Arrays are being maintained for bounds and the data variable array. I added calls to convert Dask Arrays to Numpy Arrays using replaced calls to .compute()
and replacing.data
with .values
.
I ran the example code in #615 and it works now. @lee1043 can you confirm?
# %%
import os
import xcdat as xc
target_grid = xc.create_uniform_grid(-90, 89, 1.0, 0, 359, 1.0)
input_data = os.path.join(
"/p/css03/esgf_publish/",
"CMIP6/CMIP/AWI/AWI-CM-1-1-MR/historical/r1i1p1f1/Amon/psl/gn/v20181218/",
"psl_Amon_AWI-CM-1-1-MR_historical_r1i1p1f1_gn_201301-201312.nc",
)
ds = xc.open_dataset(input_data)
ds_regrid = ds.regridder.horizontal("psl", target_grid, tool="regrid2") # WORKS FINE
# %%
ds = xc.open_mfdataset(input_data)
ds_regrid_mf = ds.regridder.horizontal("psl", target_grid, tool="regrid2") # WORKS FINE
# %%
ds_regrid.identical(ds_regrid_mf) # True
def _get_latitude_weights( | ||
bounds: List[Tuple[np.ndarray, np.ndarray]] | ||
) -> List[np.ndarray]: | ||
weights = [] | ||
|
||
for x, y in bounds: | ||
cell_weight = np.sin(np.deg2rad(x)) - np.sin(np.deg2rad(y)) | ||
cell_weight = cell_weight.reshape((-1, 1)) | ||
|
||
weights.append(cell_weight) | ||
|
||
return weights | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I extracted this section of code into its own private function and used an explicit for loop to make it more readable compared to list comprehension (IMO).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I don't think Dask Arrays support np.deg2rad
and/or np.sin
with np.nan
values, resulting in ValueError: cannot convert float NaN to integer
in #615.
xcdat/regridder/regrid2.py
Outdated
@@ -505,4 +528,4 @@ def _get_bounds_ensure_dtype(ds, axis): | |||
if bounds.dtype != np.float32: | |||
bounds = bounds.astype(np.float32) | |||
|
|||
return bounds.data | |||
return bounds.values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure to use .values
over .data
to convert Dask Arrays to np.ndarray
xcdat/regridder/regrid2.py
Outdated
@@ -113,7 +114,7 @@ def _regrid( | |||
lon_mapping, lon_weights = _map_longitude(src_lon_bnds, dst_lon_bnds) | |||
|
|||
# convert to pure numpy | |||
input_data = input_data_var.astype(np.float32).data | |||
input_data = input_data_var.astype(np.float32).values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure to use .values
to convert Dask Arrays to .ndarray
. Otherwise we get a TypeError: take() got an unexpected keyword argument 'mode'
downstream with np.take()
.
I confirm this resolves #615. |
It looks like the current version of the code in this PR could introduce unexpected behavior (guess it's a bug?) when there are NaN values in the original field and regrid using regrid2. import os
import xcdat as xc
from pcmdi_metrics.utils import apply_landmask
input_data = os.path.join(
"/p/css03/esgf_publish/CMIP6/CMIP/AWI/AWI-CM-1-1-MR/historical/r1i1p1f1/Amon/ts/gn/v20200511",
"ts_Amon_AWI-CM-1-1-MR_historical_r1i1p1f1_gn_201301-201312.nc")
target_grid = xc.create_uniform_grid(-90, 89, 1.0, 0, 359, 1.0)
ds = xc.open_dataset(input_data)
# Apply land mask
ds_masked = ds.copy()
da_masked = apply_landmask(ds, "ts", keep_over="ocean")
ds_masked["ts"] = da_masked
ds_masked["ts"].isel(time=0).plot()
# regrid
ds_masked_regrid = ds_masked.regridder.horizontal("ts", target_grid, tool="regrid2")
ds_masked_regrid["ts"].isel(time=0).plot() ds_masked_regrid (from code in main branch) -- what I expect ds_masked_regrid (from this branch) -- what I get @jasonb5 @tomvothecoder do you have any idea? |
For the above, it looks like 1e20 is filled for grid with NaN then do the regridding, which results 1e20 over the land and ~1e19 along the coastlines. If I plot using the below command, I get the 3rd plot in the above comment instead of the last one. ds_masked_regrid["ts"].where(ds_masked_regrid["ts"] < 1e16).isel(time=0).plot() |
@jasonb5 any updates? Also #615 is fixed. Jiwoo opened PR #634 to get that one in separately if this PR is going to take longer. We can push back v0.7.0 another week if needed. |
@tomvothecoder @lee1043 I'll have time today to finish this up. |
Thanks for the update @jasonb5! |
@lee1043 Fixed the issue, after regridding the missing values weren't replaced with Nan's again which resulted in the bad plot. There is still potentially an issue with coastline values ~1e19 so you still need |
@jasonb5 thank you very much for following up and for the update. I test with the latest code in this PR, which has changed 1e20 to nan. However, as you mentioned, there still are issues with coastlines that still concerns me. I think that indicates some of the original information is missed by averaging with 1e20, occurring along the coastlines in this case. Having ~1e19 numbers cause substantial impact on the statistical numbers when conducting model evaluation. I can exclude them by taking < 1e16, but still, that make the coastline grids as blank, which influences to the statistics. Is there any chance to make it those 1e20 or nan to be completely excluded before the regrinding process starts, so they won't affect to the regridded output at all? ds_masked_regrid (from this branch) -- what I get By applying |
@lee1043 I'm investigating that issue how, I'll probably update this PR to just include a few fixes. |
@lee1043 I added another fix, your example still displays the same issue. The following example works: import os
import xcdat as xc
import numpy as np
from pcmdi_metrics.utils import apply_landmask, create_land_sea_mask
input_data = os.path.join(
"/p/css03/esgf_publish/CMIP6/CMIP/AWI/AWI-CM-1-1-MR/historical/r1i1p1f1/Amon/ts/gn/v20200511",
"ts_Amon_AWI-CM-1-1-MR_historical_r1i1p1f1_gn_201301-201312.nc")
target_grid = xc.create_uniform_grid(-90, 89, 1.0, 0, 359, 1.0)
ds = xc.open_dataset(input_data)
# Apply land mask
ds_masked = ds.copy()
mask = create_land_sea_mask(ds_masked)
# invert mask to keep ocean
ds_masked["mask"] = np.absolute(mask - 1)
# ds_masked["psl"].isel(time=0).plot()
# regrid
ds_masked_regrid = ds_masked.regridder.horizontal("ts", target_grid, tool="regrid2")
ds_masked_regrid["ts"].isel(time=0).plot() I'm still looking into why your example still fails. |
Hi @jasonb5, thanks for following up. But I don't think the code in this comment is working. This is what I get which still has coastline issue. Seeing the result plot below, on top of colorbar color range indicates 1e19, which appears along coastline. It becomes more clear when plotting where >1e10 I think the key issue here is that 1e20 is being used when interpolating. In this PR, logic of the code is checking if there is attributes for FillValue or missing value, and if so, 1e20 is injected instead of NaN in the input variable (here). This input variable is passed to regrid2 (here), which include 1e20 when actually interpolating values. Instead, I think inclusion of 1e20 when interpolating can be prevented by replacing 1e20 or any missing value if exists ( |
@lee1043 Merged your change, and added a new one. Both pre-applied mask and passing mask as a variable should both work now. Also note that how the mask is applied by cdms2 and passing the mask as a variable to xcdat differs from pcmdi_metrics import os
import xcdat as xc
import numpy as np
import xarray as xr
import cdms2
import matplotlib.pyplot as plt
from pcmdi_metrics.utils import apply_landmask, create_land_sea_mask
fig, axs = plt.subplots(nrows=3, ncols=2, figsize=(10,10))
input_data = "ts_Amon_AWI-CM-1-1-MR_historical_r1i1p1f1_gn_201301-201312.nc"
target_grid = xc.create_uniform_grid(-90, 89, 1.0, 0, 359, 1.0)
cdms2_grid = cdms2.createUniformGrid(-90, 180, 1.0, 0, 360, 1.0)
ds = xc.open_dataset(input_data)
cdms2_ds = cdms2.open(input_data)
cdms2_var = cdms2_ds("ts")
mask = create_land_sea_mask(ds)
# cdms2 automatically inverts the mask
cdms2_mask = mask.values
cdms2_regrid = cdms2_var.regrid(cdms2_grid, mask=cdms2_mask, regridTool="regrid2")
cdms2_regrid = xr.DataArray(cdms2_regrid.data, dims=("time", "lat", "lon"))
cdms2_regrid = cdms2_regrid.where(cdms2_regrid!=1e20)
cdms2_regrid_t0 = cdms2_regrid.isel(time=0)
cdms2_regrid_t0.plot(ax=axs[0,0])
(cdms2_regrid_t0 - cdms2_regrid_t0).plot(ax=axs[0,1])
ds1 = ds.copy()
ds1_masked = apply_landmask(ds1, "ts", keep_over="ocean")
ds1["ts"] = ds1_masked
ds1_regrid = ds1.regridder.horizontal("ts", target_grid, tool="regrid2")
ds1_regrid_t0 = ds1_regrid["ts"].isel(time=0)
ds1_regrid_t0.plot(ax=axs[1,0])
# replace nan with 1e20 to show where data is missing
(ds1_regrid_t0.fillna(1e20) - cdms2_regrid_t0.fillna(1e20)).plot(ax=axs[1,1])
ds2 = ds.copy()
# need to manually invert the mask for ocean
ds2["mask"] = np.absolute(mask - 1)
ds2_regrid = ds2.regridder.horizontal("ts", target_grid, tool="regrid2")
ds2_regrid_t0 = ds2_regrid["ts"].isel(time=0)
ds2_regrid_t0.plot(ax=axs[2,0])
# replace nan with 1e20 to show where data is missing
(ds2_regrid_t0.fillna(1e20) - cdms2_regrid_t0.fillna(1e20)).plot(ax=axs[2,1])
plt.tight_layout()
plt.draw() |
Hi @jasonb5, thank you for the update. I checked the updated code in this PR works well for my example code above.
Thank you for pointing out. It is kind of intended behavior that in PMP I made the apply landmask function to be more strict when selecting either land or ocean grid. Default criteria for land is land fraction >= 0.8 while for the ocean < 0.2. I might have to rethink on this more to make it more consistent to what CDAT does, to keep minimal impact of xcdat conversion in PMP results. |
@lee1043 Sounds good, just wanted to make a note. @tomvothecoder I think this is good to merge, and I'll tackle #625 in the next release. |
@jasonb5 Great! I'll do a quick final review and then merge. |
unmapped_to_nan
arg for output data
unmapped_to_nan
arg for output data unmapped_to_nan
arg for output data
I'm going to add the codecov token to the repo and GitHub Actions build to hopefully fix the codecov upload problem and prevent/limit future occurrences. https://community.codecov.com/t/upload-issues-unable-to-locate-build-via-github-actions-api/3954 |
Description
_FillValue
,missing_value
, or default to 1e20Todo
Checklist
If applicable: