Fix get_grid when region_mask_regions data is present #64

kmpaul · 2020-09-03T16:34:30Z

I've implemented a simple fix for the pop_tools.get_grid error described in #45, and I've included a test. This solution was developed mostly under the guidance of @mnlevy1981, but my solution differs from the solution proposed by @klindsay28 in the discussions under #45.

I opted for a string datatype in the NetCDF file, rather than the char datatype. This appears to work for both NETCDF4 and NETCDF3_64BIT NetCDF file formats, which is surprising to me! The advantage of the string datatype is that you do not need the nchar dimension to indicate maximum length of the strings. However, documentation on NetCDF seems to suggest that string types are only supported by NetCDF4 and not the more general CDF5.

Since this PR appears to work (so far), I might suggest that either the resulting NetCDF grid files be tested by other people now or we accept this PR for now and work on a simple fix if it is discovered to be a problem in the future.

The new data added to the NetCDF grid files is summarized below:

dimensions:
        nreg = 14 ;
variables:
        int64 nreg(nreg) ;
        string region_name(nreg) ;
        int64 region_val(nreg) ;
                region_val:coordinate = "region_name" ;

Also add a test to ensure get_grid().to_netcdf() works and fixed a typo in a function name in region_masks. Still need to add region_mask_regions back as a DataArray.

tests/test_grid.py

Co-authored-by: Anderson Banihirwe <[email protected]>

andersy005 · 2020-09-03T17:44:43Z

tests/test_grid.py

+        ds = pop_tools.get_grid(grid)
+        for format in ['NETCDF4', 'NETCDF3_64BIT']:
+            gridfile = f'{grid}_{format}.nc'
+            ds.to_netcdf(gridfile)


Also, do we want to confirm that we are getting an identical dataset?

Suggested change

ds.to_netcdf(gridfile)

ds.to_netcdf(gridfile, format=format)

ds_2 = xr.open_dataset(gridfile)

xr.testing.assert_identical(ds, ds_2)

I think that's a good idea. Currently, the test (as @mnlevy1981 created it) confirms that writing the dataset to file does not fail, which is the point. Rereading the dataset and testing that it is identical seems (to me) to test something else...that xarray can round-trip via NetCDF. I'm not sure that is necessary.

andersy005

This looks good to me 👍

mnlevy1981

I don't think the test is generating the correct netCDF files: if I comment out the os.system() call deleting the netCDF file in test_grid.py, the resulting netCDF files are missing the region-specific variables @kmpaul added:

$ ncdump -h POP_gx1v7_NETCDF3_64BIT.nc | grep region
		:region_mask_fname = "inputdata/ocn/pop/gx1v7/grid/region_mask_20151008.ieeei4" ;

I think grid_attrs is more permanent than we were giving it credit for, and the regions = grid_attrs.pop('region_mask_regions') is only being applied the first time we call get_grid() for a given grid? The following kludgy update

$ git diff
diff --git a/pop_tools/grid.py b/pop_tools/grid.py
index 398ab65..e0f2462 100644
--- a/pop_tools/grid.py
+++ b/pop_tools/grid.py
@@ -324,8 +324,12 @@ def get_grid(grid_name, scrip=False):
         dso['region_name'] = xr.DataArray(list(region_names), coords=[region_coord], dims=['nreg'])
         dso['region_val'] = xr.DataArray(list(region_vals), coords=[region_coord], dims=['nreg'])
         dso['region_val'].attrs['coordinate'] = 'region_name'
+    else:
+        regions = None

     dso.attrs = grid_attrs
+    if regions:
+        grid_attrs['region_mask_regions'] = regions

     return dso

Provides the expected result:

$ ncdump -h POP_gx1v7_NETCDF3_64BIT.nc | grep region
	char region_name(nreg, string21) ;
		region_name:_Encoding = "utf-8" ;
	int region_val(nreg) ;
		region_val:coordinate = "region_name" ;
		:region_mask_fname = "inputdata/ocn/pop/gx1v7/grid/region_mask_20151008.ieeei4" ;
$ ncdump -h POP_gx1v7_NETCDF4.nc | grep region
	string region_name(nreg) ;
	int64 region_val(nreg) ;
		region_val:coordinate = "region_name" ;
		:region_mask_fname = "inputdata/ocn/pop/gx1v7/grid/region_mask_20151008.ieeei4" ;

Though I assume there's a more pythonic way to say dso.attrs should equal everything in grid_attrs except region_mask_regions.

Note that I only stumbled on this because I was curious about how the NETCDF3_64BIT format handled the string variable -- on the plus side, it looks very similar to what @klindsay28 proposed in #45 (comment)

mnlevy1981 · 2020-09-04T04:49:38Z

Perhaps

$ git diff
diff --git a/pop_tools/grid.py b/pop_tools/grid.py
index 398ab65..597d804 100644
--- a/pop_tools/grid.py
+++ b/pop_tools/grid.py
@@ -318,14 +318,14 @@ def get_grid(grid_name, scrip=False):

     # Remove region_mask_regions
     if 'region_mask_regions' in grid_attrs:
-        regions = grid_attrs.pop('region_mask_regions')
+        regions = grid_attrs['region_mask_regions']
         region_names, region_vals = list(zip(*regions.items()))
         region_coord = list(range(len(regions)))
         dso['region_name'] = xr.DataArray(list(region_names), coords=[region_coord], dims=['nreg'])
         dso['region_val'] = xr.DataArray(list(region_vals), coords=[region_coord], dims=['nreg'])
         dso['region_val'].attrs['coordinate'] = 'region_name'

-    dso.attrs = grid_attrs
+    dso.attrs = {key : grid_attrs[key] for key in grid_attrs if key != 'region_mask_regions'}

     return dso

kmpaul · 2020-09-08T14:59:59Z

@mnlevy1981 I cannot reproduce the error you are finding. I am able (with the code in its current form) to write out all of the new region variables in both NETCDF4 and NETCDF3_64BIT format.

(poptools) kpaul@cisl-brainerd: pop-tools> ncdump -v region_name POP_gx1v7_NETCDF3_64BIT.nc | tail -n 17
data:

 region_name =
  "Black Sea",
  "Baltic Sea",
  "Red Sea",
  "Southern Ocean",
  "Pacific Ocean",
  "Indian Ocean",
  "Persian Gulf",
  "Atlantic Ocean",
  "Mediterranean Sea",
  "Lab. Sea & Baffin Bay",
  "GIN Seas",
  "Arctic Ocean",
  "Hudson Bay" ;
}

Personally, I find the use of the pop method of a dict to be appropriate and "Pythonic." I don't think that reconstructing an entirely new dict just to leave the region_mask_regions key out is necessary.

mnlevy1981 · 2020-09-08T15:12:00Z

@mnlevy1981 I cannot reproduce the error you are finding. I am able (with the code in its current form) to write out all of the new region variables in both NETCDF4 and NETCDF3_64BIT format.

(poptools) kpaul@cisl-brainerd: pop-tools> ncdump -v region_name POP_gx1v7_NETCDF3_64BIT.nc | tail -n 17
data:

 region_name =
  "Black Sea",
  "Baltic Sea",
  "Red Sea",
  "Southern Ocean",
  "Pacific Ocean",
  "Indian Ocean",
  "Persian Gulf",
  "Atlantic Ocean",
  "Mediterranean Sea",
  "Lab. Sea & Baffin Bay",
  "GIN Seas",
  "Arctic Ocean",
  "Hudson Bay" ;
}

@kmpaul What happens if you run get_grid() twice in a row? E.g.

>>> pop_tools.get_grid('POP_gx1v7')
>>> pop_tools.get_grid('POP_gx1v7').to_netcdf('POP_gx1v7_NETCDF3_64BIT.nc')

Because I am seeing

$ ncdump -v region_name POP_gx1v7_NETCDF3_64BIT.nc | tail -n 17
ncdump: region_name: No such variable

My concern with this is that the test for generating the netCDF file is being run after a different test has already run get_grid(), so as the code stands pytest is not actually testing the ability to write these fields to file.

kmpaul · 2020-09-08T15:49:51Z

@kmpaul What happens if you run get_grid() twice in a row? E.g.
>>> pop_tools.get_grid('POP_gx1v7')
>>> pop_tools.get_grid('POP_gx1v7').to_netcdf('POP_gx1v7_NETCDF3_64BIT.nc')

Ah! I see what you are suggesting. The grid_defs file is read an import time (i.e., only once) and the pop method is removing the region_mask_regions key/value upon first processing of the selected grid.

Is that the behavior you want? It seems dangerous to let the grid_defs dictionary variable be mutable and not hidden (i.e., preceded by a _). You could easily secure this variable by using a frozendict, if you were willing to either add a dependency or code up the pattern in a utility class. Regardless, I understand what you are saying, now.

Another safe solution would be to make a copy of grid_defs in the get_grid function. Or, you could actually read the grid_defs variable from file in the get_grid function. Or, you could cache the result of get_grid so that you don't have to regenerate the dataset with every call. Any of these solutions seem reasonable to me.

It seems that the number of times that get_grid gets called is small, though, which would suggest that reading the grid_defs_file in the get_grid function would not create a bottleneck. ...Or, perhaps better if you need to use the grid_defs dictionary in other places, you could just replace its usage with a get_grid_defs function:

def get_grid_defs():
    with open(grid_defs_file) as f:
        grid_defs = yaml.safe_load(grid_defs_file)
    return grid_defs

mnlevy1981 · 2020-09-08T15:58:03Z

Maybe the simplest solution would be to copy all of grid_attrs into dso.attrs and then pop from dso.attrs if necessary?

$ git diff
diff --git a/pop_tools/grid.py b/pop_tools/grid.py
index 398ab65..7404aa2 100644
--- a/pop_tools/grid.py
+++ b/pop_tools/grid.py
@@ -316,16 +316,16 @@ def get_grid(grid_name, scrip=False):

     grid_attrs.update({'title': f'{grid_name} grid'})

+    dso.attrs = grid_attrs
     # Remove region_mask_regions
-    if 'region_mask_regions' in grid_attrs:
-        regions = grid_attrs.pop('region_mask_regions')
+    if 'region_mask_regions' in dso.attrs:
+        regions = dso.attrs.pop('region_mask_regions')
         region_names, region_vals = list(zip(*regions.items()))
         region_coord = list(range(len(regions)))
         dso['region_name'] = xr.DataArray(list(region_names), coords=[region_coord], dims=['nreg'])
         dso['region_val'] = xr.DataArray(list(region_vals), coords=[region_coord], dims=['nreg'])
         dso['region_val'].attrs['coordinate'] = 'region_name'

-    dso.attrs = grid_attrs

     return dso

kmpaul · 2020-09-08T16:01:44Z

Yeah. That solves this problem. Do you think that exposing a dict that should not be mutable is not a problem?

kmpaul · 2020-09-08T16:13:42Z

...that previous question what written with horrible English. Let me re-ask the question: Currently, the grid_defs variable is a dict and is exposed as part of the public API. It should be immutable, but it is not. Do you feel that this is a problem that should be fixed by this (or a future) PR?

kmpaul · 2020-09-08T16:14:55Z

@mnlevy1981: I implemented your "pop from dso.attrs" approach.

mnlevy1981 · 2020-09-08T16:17:34Z

Currently, the grid_defs variable is a dict and is exposed as part of the public API. It should be immutable, but it is not. Do you feel that this is a problem that should be fixed by this (or a future) PR?

I think it should be fixed, but my vote would be to take care of it in a future PR.

mnlevy1981

This looks great and I really like how you added test_get_grid_twice().

dcherian

Thanks @kmpaul

kmpaul · 2020-09-08T17:05:51Z

You're welcome! Happy to help.

mnlevy1981 and others added 4 commits September 2, 2020 15:59

Remove region_mask_regions from get_grid attrs

35cff1b

Also add a test to ensure get_grid().to_netcdf() works and fixed a typo in a function name in region_masks. Still need to add region_mask_regions back as a DataArray.

First pass at writing region names/values arrays

d8a24c2

Ignore VSCode files

1f861ff

Adding region name/val arrays to dataset

6f79f73

kmpaul requested review from andersy005, dcherian, klindsay28, matt-long and mnlevy1981 as code owners September 3, 2020 16:34

kmpaul mentioned this pull request Sep 3, 2020

get_grid returns a file that cannot be written to netCDF #45

Closed

andersy005 reviewed Sep 3, 2020

View reviewed changes

tests/test_grid.py Outdated Show resolved Hide resolved

Use NetCDF format when writing file

3a5f2c1

Co-authored-by: Anderson Banihirwe <[email protected]>

andersy005 reviewed Sep 3, 2020

View reviewed changes

andersy005 approved these changes Sep 4, 2020

View reviewed changes

mnlevy1981 requested changes Sep 4, 2020

View reviewed changes

Pop from dso.attrs not grid_attrs

b2b7b09

Add test for 2x get_grid call

d77ce57

mnlevy1981 approved these changes Sep 8, 2020

View reviewed changes

dcherian approved these changes Sep 8, 2020

View reviewed changes

klindsay28 approved these changes Sep 8, 2020

View reviewed changes

dcherian merged commit ca3c67c into NCAR:master Sep 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix get_grid when region_mask_regions data is present #64

Fix get_grid when region_mask_regions data is present #64

kmpaul commented Sep 3, 2020

andersy005 Sep 3, 2020

kmpaul Sep 3, 2020

andersy005 left a comment

mnlevy1981 left a comment •

edited

Loading

mnlevy1981 commented Sep 4, 2020

kmpaul commented Sep 8, 2020

mnlevy1981 commented Sep 8, 2020

kmpaul commented Sep 8, 2020 •

edited

Loading

mnlevy1981 commented Sep 8, 2020 •

edited

Loading

kmpaul commented Sep 8, 2020 •

edited

Loading

kmpaul commented Sep 8, 2020

kmpaul commented Sep 8, 2020

mnlevy1981 commented Sep 8, 2020

mnlevy1981 left a comment

dcherian left a comment

kmpaul commented Sep 8, 2020

-            ds.to_netcdf(gridfile)
+            ds.to_netcdf(gridfile, format=format)
+            ds_2 = xr.open_dataset(gridfile)
+            xr.testing.assert_identical(ds, ds_2)

Fix get_grid when region_mask_regions data is present #64

Fix get_grid when region_mask_regions data is present #64

Conversation

kmpaul commented Sep 3, 2020

andersy005 Sep 3, 2020

Choose a reason for hiding this comment

kmpaul Sep 3, 2020

Choose a reason for hiding this comment

andersy005 left a comment

Choose a reason for hiding this comment

mnlevy1981 left a comment • edited Loading

Choose a reason for hiding this comment

mnlevy1981 commented Sep 4, 2020

kmpaul commented Sep 8, 2020

mnlevy1981 commented Sep 8, 2020

kmpaul commented Sep 8, 2020 • edited Loading

mnlevy1981 commented Sep 8, 2020 • edited Loading

kmpaul commented Sep 8, 2020 • edited Loading

kmpaul commented Sep 8, 2020

kmpaul commented Sep 8, 2020

mnlevy1981 commented Sep 8, 2020

mnlevy1981 left a comment

Choose a reason for hiding this comment

dcherian left a comment

Choose a reason for hiding this comment

kmpaul commented Sep 8, 2020

mnlevy1981 left a comment •

edited

Loading

kmpaul commented Sep 8, 2020 •

edited

Loading

mnlevy1981 commented Sep 8, 2020 •

edited

Loading

kmpaul commented Sep 8, 2020 •

edited

Loading