Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix get_grid when region_mask_regions data is present #64

Merged
merged 7 commits into from
Sep 8, 2020

Conversation

kmpaul
Copy link
Contributor

@kmpaul kmpaul commented Sep 3, 2020

I've implemented a simple fix for the pop_tools.get_grid error described in #45, and I've included a test. This solution was developed mostly under the guidance of @mnlevy1981, but my solution differs from the solution proposed by @klindsay28 in the discussions under #45.

I opted for a string datatype in the NetCDF file, rather than the char datatype. This appears to work for both NETCDF4 and NETCDF3_64BIT NetCDF file formats, which is surprising to me! The advantage of the string datatype is that you do not need the nchar dimension to indicate maximum length of the strings. However, documentation on NetCDF seems to suggest that string types are only supported by NetCDF4 and not the more general CDF5.

Since this PR appears to work (so far), I might suggest that either the resulting NetCDF grid files be tested by other people now or we accept this PR for now and work on a simple fix if it is discovered to be a problem in the future.

The new data added to the NetCDF grid files is summarized below:

dimensions:
        nreg = 14 ;
variables:
        int64 nreg(nreg) ;
        string region_name(nreg) ;
        int64 region_val(nreg) ;
                region_val:coordinate = "region_name" ;

mnlevy1981 and others added 4 commits September 2, 2020 15:59
Also add a test to ensure get_grid().to_netcdf() works and fixed a typo in a
function name in region_masks. Still need to add region_mask_regions back as a
DataArray.
tests/test_grid.py Outdated Show resolved Hide resolved
Co-authored-by: Anderson Banihirwe <[email protected]>
ds = pop_tools.get_grid(grid)
for format in ['NETCDF4', 'NETCDF3_64BIT']:
gridfile = f'{grid}_{format}.nc'
ds.to_netcdf(gridfile)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, do we want to confirm that we are getting an identical dataset?

Suggested change
ds.to_netcdf(gridfile)
ds.to_netcdf(gridfile, format=format)
ds_2 = xr.open_dataset(gridfile)
xr.testing.assert_identical(ds, ds_2)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's a good idea. Currently, the test (as @mnlevy1981 created it) confirms that writing the dataset to file does not fail, which is the point. Rereading the dataset and testing that it is identical seems (to me) to test something else...that xarray can round-trip via NetCDF. I'm not sure that is necessary.

Copy link
Contributor

@andersy005 andersy005 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me 👍

Copy link
Collaborator

@mnlevy1981 mnlevy1981 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the test is generating the correct netCDF files: if I comment out the os.system() call deleting the netCDF file in test_grid.py, the resulting netCDF files are missing the region-specific variables @kmpaul added:

$ ncdump -h POP_gx1v7_NETCDF3_64BIT.nc | grep region
		:region_mask_fname = "inputdata/ocn/pop/gx1v7/grid/region_mask_20151008.ieeei4" ;

I think grid_attrs is more permanent than we were giving it credit for, and the regions = grid_attrs.pop('region_mask_regions') is only being applied the first time we call get_grid() for a given grid? The following kludgy update

$ git diff
diff --git a/pop_tools/grid.py b/pop_tools/grid.py
index 398ab65..e0f2462 100644
--- a/pop_tools/grid.py
+++ b/pop_tools/grid.py
@@ -324,8 +324,12 @@ def get_grid(grid_name, scrip=False):
         dso['region_name'] = xr.DataArray(list(region_names), coords=[region_coord], dims=['nreg'])
         dso['region_val'] = xr.DataArray(list(region_vals), coords=[region_coord], dims=['nreg'])
         dso['region_val'].attrs['coordinate'] = 'region_name'
+    else:
+        regions = None

     dso.attrs = grid_attrs
+    if regions:
+        grid_attrs['region_mask_regions'] = regions

     return dso

Provides the expected result:

$ ncdump -h POP_gx1v7_NETCDF3_64BIT.nc | grep region
	char region_name(nreg, string21) ;
		region_name:_Encoding = "utf-8" ;
	int region_val(nreg) ;
		region_val:coordinate = "region_name" ;
		:region_mask_fname = "inputdata/ocn/pop/gx1v7/grid/region_mask_20151008.ieeei4" ;
$ ncdump -h POP_gx1v7_NETCDF4.nc | grep region
	string region_name(nreg) ;
	int64 region_val(nreg) ;
		region_val:coordinate = "region_name" ;
		:region_mask_fname = "inputdata/ocn/pop/gx1v7/grid/region_mask_20151008.ieeei4" ;

Though I assume there's a more pythonic way to say dso.attrs should equal everything in grid_attrs except region_mask_regions.

Note that I only stumbled on this because I was curious about how the NETCDF3_64BIT format handled the string variable -- on the plus side, it looks very similar to what @klindsay28 proposed in #45 (comment)

@mnlevy1981
Copy link
Collaborator

Perhaps

$ git diff
diff --git a/pop_tools/grid.py b/pop_tools/grid.py
index 398ab65..597d804 100644
--- a/pop_tools/grid.py
+++ b/pop_tools/grid.py
@@ -318,14 +318,14 @@ def get_grid(grid_name, scrip=False):

     # Remove region_mask_regions
     if 'region_mask_regions' in grid_attrs:
-        regions = grid_attrs.pop('region_mask_regions')
+        regions = grid_attrs['region_mask_regions']
         region_names, region_vals = list(zip(*regions.items()))
         region_coord = list(range(len(regions)))
         dso['region_name'] = xr.DataArray(list(region_names), coords=[region_coord], dims=['nreg'])
         dso['region_val'] = xr.DataArray(list(region_vals), coords=[region_coord], dims=['nreg'])
         dso['region_val'].attrs['coordinate'] = 'region_name'

-    dso.attrs = grid_attrs
+    dso.attrs = {key : grid_attrs[key] for key in grid_attrs if key != 'region_mask_regions'}

     return dso

@kmpaul
Copy link
Contributor Author

kmpaul commented Sep 8, 2020

@mnlevy1981 I cannot reproduce the error you are finding. I am able (with the code in its current form) to write out all of the new region variables in both NETCDF4 and NETCDF3_64BIT format.

(poptools) kpaul@cisl-brainerd: pop-tools> ncdump -v region_name POP_gx1v7_NETCDF3_64BIT.nc | tail -n 17
data:

 region_name =
  "Black Sea",
  "Baltic Sea",
  "Red Sea",
  "Southern Ocean",
  "Pacific Ocean",
  "Indian Ocean",
  "Persian Gulf",
  "Atlantic Ocean",
  "Mediterranean Sea",
  "Lab. Sea & Baffin Bay",
  "GIN Seas",
  "Arctic Ocean",
  "Hudson Bay" ;
}

Personally, I find the use of the pop method of a dict to be appropriate and "Pythonic." I don't think that reconstructing an entirely new dict just to leave the region_mask_regions key out is necessary.

@mnlevy1981
Copy link
Collaborator

@mnlevy1981 I cannot reproduce the error you are finding. I am able (with the code in its current form) to write out all of the new region variables in both NETCDF4 and NETCDF3_64BIT format.

(poptools) kpaul@cisl-brainerd: pop-tools> ncdump -v region_name POP_gx1v7_NETCDF3_64BIT.nc | tail -n 17
data:

 region_name =
  "Black Sea",
  "Baltic Sea",
  "Red Sea",
  "Southern Ocean",
  "Pacific Ocean",
  "Indian Ocean",
  "Persian Gulf",
  "Atlantic Ocean",
  "Mediterranean Sea",
  "Lab. Sea & Baffin Bay",
  "GIN Seas",
  "Arctic Ocean",
  "Hudson Bay" ;
}

@kmpaul What happens if you run get_grid() twice in a row? E.g.

>>> pop_tools.get_grid('POP_gx1v7')
>>> pop_tools.get_grid('POP_gx1v7').to_netcdf('POP_gx1v7_NETCDF3_64BIT.nc')

Because I am seeing

$ ncdump -v region_name POP_gx1v7_NETCDF3_64BIT.nc | tail -n 17
ncdump: region_name: No such variable

My concern with this is that the test for generating the netCDF file is being run after a different test has already run get_grid(), so as the code stands pytest is not actually testing the ability to write these fields to file.

@kmpaul
Copy link
Contributor Author

kmpaul commented Sep 8, 2020

@kmpaul What happens if you run get_grid() twice in a row? E.g.

>>> pop_tools.get_grid('POP_gx1v7')
>>> pop_tools.get_grid('POP_gx1v7').to_netcdf('POP_gx1v7_NETCDF3_64BIT.nc')

Ah! I see what you are suggesting. The grid_defs file is read an import time (i.e., only once) and the pop method is removing the region_mask_regions key/value upon first processing of the selected grid.

Is that the behavior you want? It seems dangerous to let the grid_defs dictionary variable be mutable and not hidden (i.e., preceded by a _). You could easily secure this variable by using a frozendict, if you were willing to either add a dependency or code up the pattern in a utility class. Regardless, I understand what you are saying, now.

Another safe solution would be to make a copy of grid_defs in the get_grid function. Or, you could actually read the grid_defs variable from file in the get_grid function. Or, you could cache the result of get_grid so that you don't have to regenerate the dataset with every call. Any of these solutions seem reasonable to me.

It seems that the number of times that get_grid gets called is small, though, which would suggest that reading the grid_defs_file in the get_grid function would not create a bottleneck. ...Or, perhaps better if you need to use the grid_defs dictionary in other places, you could just replace its usage with a get_grid_defs function:

def get_grid_defs():
    with open(grid_defs_file) as f:
        grid_defs = yaml.safe_load(grid_defs_file)
    return grid_defs

@mnlevy1981
Copy link
Collaborator

mnlevy1981 commented Sep 8, 2020

Maybe the simplest solution would be to copy all of grid_attrs into dso.attrs and then pop from dso.attrs if necessary?

$ git diff
diff --git a/pop_tools/grid.py b/pop_tools/grid.py
index 398ab65..7404aa2 100644
--- a/pop_tools/grid.py
+++ b/pop_tools/grid.py
@@ -316,16 +316,16 @@ def get_grid(grid_name, scrip=False):

     grid_attrs.update({'title': f'{grid_name} grid'})

+    dso.attrs = grid_attrs
     # Remove region_mask_regions
-    if 'region_mask_regions' in grid_attrs:
-        regions = grid_attrs.pop('region_mask_regions')
+    if 'region_mask_regions' in dso.attrs:
+        regions = dso.attrs.pop('region_mask_regions')
         region_names, region_vals = list(zip(*regions.items()))
         region_coord = list(range(len(regions)))
         dso['region_name'] = xr.DataArray(list(region_names), coords=[region_coord], dims=['nreg'])
         dso['region_val'] = xr.DataArray(list(region_vals), coords=[region_coord], dims=['nreg'])
         dso['region_val'].attrs['coordinate'] = 'region_name'

-    dso.attrs = grid_attrs

     return dso

@kmpaul
Copy link
Contributor Author

kmpaul commented Sep 8, 2020

Yeah. That solves this problem. Do you think that exposing a dict that should not be mutable is not a problem?

@kmpaul
Copy link
Contributor Author

kmpaul commented Sep 8, 2020

...that previous question what written with horrible English. Let me re-ask the question: Currently, the grid_defs variable is a dict and is exposed as part of the public API. It should be immutable, but it is not. Do you feel that this is a problem that should be fixed by this (or a future) PR?

@kmpaul
Copy link
Contributor Author

kmpaul commented Sep 8, 2020

@mnlevy1981: I implemented your "pop from dso.attrs" approach.

@mnlevy1981
Copy link
Collaborator

Currently, the grid_defs variable is a dict and is exposed as part of the public API. It should be immutable, but it is not. Do you feel that this is a problem that should be fixed by this (or a future) PR?

I think it should be fixed, but my vote would be to take care of it in a future PR.

Copy link
Collaborator

@mnlevy1981 mnlevy1981 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great and I really like how you added test_get_grid_twice().

Copy link
Contributor

@dcherian dcherian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kmpaul

@kmpaul
Copy link
Contributor Author

kmpaul commented Sep 8, 2020

You're welcome! Happy to help.

@dcherian dcherian merged commit ca3c67c into NCAR:master Sep 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants