-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix get_grid when region_mask_regions data is present #64
Conversation
Also add a test to ensure get_grid().to_netcdf() works and fixed a typo in a function name in region_masks. Still need to add region_mask_regions back as a DataArray.
Co-authored-by: Anderson Banihirwe <[email protected]>
tests/test_grid.py
Outdated
ds = pop_tools.get_grid(grid) | ||
for format in ['NETCDF4', 'NETCDF3_64BIT']: | ||
gridfile = f'{grid}_{format}.nc' | ||
ds.to_netcdf(gridfile) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, do we want to confirm that we are getting an identical dataset?
ds.to_netcdf(gridfile) | |
ds.to_netcdf(gridfile, format=format) | |
ds_2 = xr.open_dataset(gridfile) | |
xr.testing.assert_identical(ds, ds_2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's a good idea. Currently, the test (as @mnlevy1981 created it) confirms that writing the dataset to file does not fail, which is the point. Rereading the dataset and testing that it is identical seems (to me) to test something else...that xarray
can round-trip via NetCDF. I'm not sure that is necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the test is generating the correct netCDF files: if I comment out the os.system()
call deleting the netCDF file in test_grid.py
, the resulting netCDF files are missing the region-specific variables @kmpaul added:
$ ncdump -h POP_gx1v7_NETCDF3_64BIT.nc | grep region
:region_mask_fname = "inputdata/ocn/pop/gx1v7/grid/region_mask_20151008.ieeei4" ;
I think grid_attrs
is more permanent than we were giving it credit for, and the regions = grid_attrs.pop('region_mask_regions')
is only being applied the first time we call get_grid()
for a given grid? The following kludgy update
$ git diff
diff --git a/pop_tools/grid.py b/pop_tools/grid.py
index 398ab65..e0f2462 100644
--- a/pop_tools/grid.py
+++ b/pop_tools/grid.py
@@ -324,8 +324,12 @@ def get_grid(grid_name, scrip=False):
dso['region_name'] = xr.DataArray(list(region_names), coords=[region_coord], dims=['nreg'])
dso['region_val'] = xr.DataArray(list(region_vals), coords=[region_coord], dims=['nreg'])
dso['region_val'].attrs['coordinate'] = 'region_name'
+ else:
+ regions = None
dso.attrs = grid_attrs
+ if regions:
+ grid_attrs['region_mask_regions'] = regions
return dso
Provides the expected result:
$ ncdump -h POP_gx1v7_NETCDF3_64BIT.nc | grep region
char region_name(nreg, string21) ;
region_name:_Encoding = "utf-8" ;
int region_val(nreg) ;
region_val:coordinate = "region_name" ;
:region_mask_fname = "inputdata/ocn/pop/gx1v7/grid/region_mask_20151008.ieeei4" ;
$ ncdump -h POP_gx1v7_NETCDF4.nc | grep region
string region_name(nreg) ;
int64 region_val(nreg) ;
region_val:coordinate = "region_name" ;
:region_mask_fname = "inputdata/ocn/pop/gx1v7/grid/region_mask_20151008.ieeei4" ;
Though I assume there's a more pythonic way to say dso.attrs
should equal everything in grid_attrs
except region_mask_regions
.
Note that I only stumbled on this because I was curious about how the NETCDF3_64BIT
format handled the string
variable -- on the plus side, it looks very similar to what @klindsay28 proposed in #45 (comment)
Perhaps $ git diff
diff --git a/pop_tools/grid.py b/pop_tools/grid.py
index 398ab65..597d804 100644
--- a/pop_tools/grid.py
+++ b/pop_tools/grid.py
@@ -318,14 +318,14 @@ def get_grid(grid_name, scrip=False):
# Remove region_mask_regions
if 'region_mask_regions' in grid_attrs:
- regions = grid_attrs.pop('region_mask_regions')
+ regions = grid_attrs['region_mask_regions']
region_names, region_vals = list(zip(*regions.items()))
region_coord = list(range(len(regions)))
dso['region_name'] = xr.DataArray(list(region_names), coords=[region_coord], dims=['nreg'])
dso['region_val'] = xr.DataArray(list(region_vals), coords=[region_coord], dims=['nreg'])
dso['region_val'].attrs['coordinate'] = 'region_name'
- dso.attrs = grid_attrs
+ dso.attrs = {key : grid_attrs[key] for key in grid_attrs if key != 'region_mask_regions'}
return dso |
@mnlevy1981 I cannot reproduce the error you are finding. I am able (with the code in its current form) to write out all of the new region variables in both NETCDF4 and NETCDF3_64BIT format. (poptools) kpaul@cisl-brainerd: pop-tools> ncdump -v region_name POP_gx1v7_NETCDF3_64BIT.nc | tail -n 17
data:
region_name =
"Black Sea",
"Baltic Sea",
"Red Sea",
"Southern Ocean",
"Pacific Ocean",
"Indian Ocean",
"Persian Gulf",
"Atlantic Ocean",
"Mediterranean Sea",
"Lab. Sea & Baffin Bay",
"GIN Seas",
"Arctic Ocean",
"Hudson Bay" ;
} Personally, I find the use of the |
@kmpaul What happens if you run >>> pop_tools.get_grid('POP_gx1v7')
>>> pop_tools.get_grid('POP_gx1v7').to_netcdf('POP_gx1v7_NETCDF3_64BIT.nc') Because I am seeing $ ncdump -v region_name POP_gx1v7_NETCDF3_64BIT.nc | tail -n 17
ncdump: region_name: No such variable My concern with this is that the test for generating the netCDF file is being run after a different test has already run |
Ah! I see what you are suggesting. The Is that the behavior you want? It seems dangerous to let the Another safe solution would be to make a copy of It seems that the number of times that def get_grid_defs():
with open(grid_defs_file) as f:
grid_defs = yaml.safe_load(grid_defs_file)
return grid_defs |
Maybe the simplest solution would be to copy all of $ git diff
diff --git a/pop_tools/grid.py b/pop_tools/grid.py
index 398ab65..7404aa2 100644
--- a/pop_tools/grid.py
+++ b/pop_tools/grid.py
@@ -316,16 +316,16 @@ def get_grid(grid_name, scrip=False):
grid_attrs.update({'title': f'{grid_name} grid'})
+ dso.attrs = grid_attrs
# Remove region_mask_regions
- if 'region_mask_regions' in grid_attrs:
- regions = grid_attrs.pop('region_mask_regions')
+ if 'region_mask_regions' in dso.attrs:
+ regions = dso.attrs.pop('region_mask_regions')
region_names, region_vals = list(zip(*regions.items()))
region_coord = list(range(len(regions)))
dso['region_name'] = xr.DataArray(list(region_names), coords=[region_coord], dims=['nreg'])
dso['region_val'] = xr.DataArray(list(region_vals), coords=[region_coord], dims=['nreg'])
dso['region_val'].attrs['coordinate'] = 'region_name'
- dso.attrs = grid_attrs
return dso |
Yeah. That solves this problem. Do you think that exposing a dict that should not be mutable is not a problem? |
...that previous question what written with horrible English. Let me re-ask the question: Currently, the |
@mnlevy1981: I implemented your "pop from dso.attrs" approach. |
I think it should be fixed, but my vote would be to take care of it in a future PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great and I really like how you added test_get_grid_twice()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @kmpaul
You're welcome! Happy to help. |
I've implemented a simple fix for the
pop_tools.get_grid
error described in #45, and I've included a test. This solution was developed mostly under the guidance of @mnlevy1981, but my solution differs from the solution proposed by @klindsay28 in the discussions under #45.I opted for a
string
datatype in the NetCDF file, rather than thechar
datatype. This appears to work for bothNETCDF4
andNETCDF3_64BIT
NetCDF file formats, which is surprising to me! The advantage of thestring
datatype is that you do not need thenchar
dimension to indicate maximum length of the strings. However, documentation on NetCDF seems to suggest thatstring
types are only supported by NetCDF4 and not the more general CDF5.Since this PR appears to work (so far), I might suggest that either the resulting NetCDF grid files be tested by other people now or we accept this PR for now and work on a simple fix if it is discovered to be a problem in the future.
The new data added to the NetCDF grid files is summarized below: