Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Best practice when the _Unsigned attribute is present in NetCDF files #1444

Closed
deeplycloudy opened this issue Jun 6, 2017 · 5 comments
Closed

Comments

@deeplycloudy
Copy link
Contributor

deeplycloudy commented Jun 6, 2017

Some (large) data providers are writing NetCDF-4-extended files but using an _Unsigned attribute to indicate that a signed data type should be interpreted as unsigned bytes.

Background: Unidata/netcdf4-python#656

From the background discussion above, it is my understanding that xarray does not honor the attribute because it’s not a part of the CF spec, is only mentioned as a proposed attribute in the NetCDF Best Practices, and because "xarray wants the Variable dtype to be the same as the dtype of the data returned."

Taking the above as a given, it is necessary for xarray users encountering such variables to do the following after reading the data:

scale_factor = data.encoding['scale_factor']
add_offset = data.encoding['add_offset']
unscale = ((data - add_offset)/scale_factor).data.astype(dtype).astype('float64')
fixed = unscale * scale_factor + add_offset

The un-scaling step can be saved by turning off auto mask and scale.

In order to automate the above process while still being able to use the functionality of Dataset, one approach might be to automatically perform the above steps on some known list of variables, and then reassign those variables to the Dataset. The downside is the need to read all variables up front, which could be expensive when processing large datasets where not all variables are needed.

Is there another approach that would preserve lazy data loading, for instance by providing pre/post hooks for transformation functions at the __getitem__ stage? Is there something I could do to help document that as a best practice?

@shoyer
Copy link
Member

shoyer commented Jun 7, 2017

To clarify: I didn't want to let NetCDF4-python parse this attribute, because it would break an invariant in our data model. I'm not opposed to supporting this separately in xarray (using xarray's machinery for handling netcdf conventions).

I will say though that this convention makes little sense to me for netcdf4/hdf5 files, which already provide native support for unsigned types, so separately I would also encourage this data provider to change their practices here :).

@dopplershift
Copy link
Contributor

HDF5 doesn't come without its own draw backs...such as vastly more complicated (and admittedly more capable) on-disk format. We should not presume to know better than the data providers what their trade-offs are.

Also, while unsigned types are a feature of the netCDF4/HDF5, strictly speaking, the netCDF4 unsigned types are not compatible with any released version of the CF specification (see list of acceptable data types here: http://cfconventions.org/cf-conventions/v1.6.0/cf-conventions.html#_data_types).

So it's not exactly as simple as you'd like it to be.

@shoyer
Copy link
Member

shoyer commented Jun 7, 2017

HDF5 doesn't come without its own draw backs...such as vastly more complicated (and admittedly more capable) on-disk format.

Yes, of course. But I thought netCDF4-extended meant NetCDF on HDF5, making use of all the HD5 features. So I didn't think that was the tradeoff here.

Also, while unsigned types are a feature of the netCDF4/HDF5, strictly speaking, the netCDF4 unsigned types are not compatible with any released version of the CF specification

Sure, though if you're strictly following CF, you probably would use valid_min/valid_max/valid_range instead of _Unsigned...

Anyways, again I don't really have an objection here. This attribute can be interpreted in a pretty unambiguous way, so adding support for this would be mostly harmless.

@deeplycloudy
Copy link
Contributor Author

deeplycloudy commented Jun 7, 2017

Ah, I see the distinction concerning the xarray implementation being made in the earlier discussion. There certainly are some tradeoffs being made by the data provider, and I'm far enough removed from those decisions that there's not much I can do.

Thanks for the clarification on the background discussion and the willingness to consider support for the attribute. It's my first time dealing with xarray internals, but given that there's a reference implementation in NetCDF4-python, I'm willing to have a go at a PR for this.

If I had to take a guess, it belongs in conventions.py, fitting with the logic beginning line 789 where the mask_and_scale is handled. It could be built into the MaskedAndScaledArray class, though perhaps a new UnsignedIntArray class might be preferred?

The docstrings note that the implementation is designed around CF-style data. As noted above, this attribute is outside the CF conventions, so I might note that exception in a comment in the code.

@deeplycloudy
Copy link
Contributor Author

deeplycloudy commented Jun 20, 2017

I'm dropping a quick note here to flag that PR #1453 has been written to address this issue. I have used the draft PR in earnest as part of some ongoing analyses of the data that led me to raise the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants