Best practice when the _Unsigned attribute is present in NetCDF files #1444

deeplycloudy · 2017-06-06T19:05:07Z

Some (large) data providers are writing NetCDF-4-extended files but using an _Unsigned attribute to indicate that a signed data type should be interpreted as unsigned bytes.

Background: Unidata/netcdf4-python#656

From the background discussion above, it is my understanding that xarray does not honor the attribute because it’s not a part of the CF spec, is only mentioned as a proposed attribute in the NetCDF Best Practices, and because "xarray wants the Variable dtype to be the same as the dtype of the data returned."

Taking the above as a given, it is necessary for xarray users encountering such variables to do the following after reading the data:

scale_factor = data.encoding['scale_factor']
add_offset = data.encoding['add_offset']
unscale = ((data - add_offset)/scale_factor).data.astype(dtype).astype('float64')
fixed = unscale * scale_factor + add_offset

The un-scaling step can be saved by turning off auto mask and scale.

In order to automate the above process while still being able to use the functionality of Dataset, one approach might be to automatically perform the above steps on some known list of variables, and then reassign those variables to the Dataset. The downside is the need to read all variables up front, which could be expensive when processing large datasets where not all variables are needed.

Is there another approach that would preserve lazy data loading, for instance by providing pre/post hooks for transformation functions at the __getitem__ stage? Is there something I could do to help document that as a best practice?

The text was updated successfully, but these errors were encountered:

shoyer · 2017-06-07T07:34:05Z

To clarify: I didn't want to let NetCDF4-python parse this attribute, because it would break an invariant in our data model. I'm not opposed to supporting this separately in xarray (using xarray's machinery for handling netcdf conventions).

I will say though that this convention makes little sense to me for netcdf4/hdf5 files, which already provide native support for unsigned types, so separately I would also encourage this data provider to change their practices here :).

dopplershift · 2017-06-07T15:32:03Z

HDF5 doesn't come without its own draw backs...such as vastly more complicated (and admittedly more capable) on-disk format. We should not presume to know better than the data providers what their trade-offs are.

Also, while unsigned types are a feature of the netCDF4/HDF5, strictly speaking, the netCDF4 unsigned types are not compatible with any released version of the CF specification (see list of acceptable data types here: http://cfconventions.org/cf-conventions/v1.6.0/cf-conventions.html#_data_types).

So it's not exactly as simple as you'd like it to be.

shoyer · 2017-06-07T15:39:55Z

HDF5 doesn't come without its own draw backs...such as vastly more complicated (and admittedly more capable) on-disk format.

Yes, of course. But I thought netCDF4-extended meant NetCDF on HDF5, making use of all the HD5 features. So I didn't think that was the tradeoff here.

Also, while unsigned types are a feature of the netCDF4/HDF5, strictly speaking, the netCDF4 unsigned types are not compatible with any released version of the CF specification

Sure, though if you're strictly following CF, you probably would use valid_min/valid_max/valid_range instead of _Unsigned...

Anyways, again I don't really have an objection here. This attribute can be interpreted in a pretty unambiguous way, so adding support for this would be mostly harmless.

deeplycloudy · 2017-06-07T16:25:01Z

Ah, I see the distinction concerning the xarray implementation being made in the earlier discussion. There certainly are some tradeoffs being made by the data provider, and I'm far enough removed from those decisions that there's not much I can do.

Thanks for the clarification on the background discussion and the willingness to consider support for the attribute. It's my first time dealing with xarray internals, but given that there's a reference implementation in NetCDF4-python, I'm willing to have a go at a PR for this.

If I had to take a guess, it belongs in conventions.py, fitting with the logic beginning line 789 where the mask_and_scale is handled. It could be built into the MaskedAndScaledArray class, though perhaps a new UnsignedIntArray class might be preferred?

The docstrings note that the implementation is designed around CF-style data. As noted above, this attribute is outside the CF conventions, so I might note that exception in a comment in the code.

deeplycloudy · 2017-06-20T18:35:11Z

I'm dropping a quick note here to flag that PR #1453 has been written to address this issue. I have used the draft PR in earnest as part of some ongoing analyses of the data that led me to raise the issue.

deeplycloudy mentioned this issue Jun 14, 2017

Automate interpretation of _Unsigned attribute #1453

Merged

4 tasks

jhamman added topic-backends contrib-help-wanted labels Jun 28, 2017

shoyer closed this as completed in #1453 Jul 28, 2017

fujiisoup mentioned this issue Aug 31, 2017

WIP: indexing with broadcasting #1473

Closed

4 tasks

forman mentioned this issue Sep 20, 2017

Support for unsigned data #1579

Closed

fujiisoup mentioned this issue Oct 19, 2017

indexing with broadcasting #1639

Merged

4 tasks

aleksandervines mentioned this issue Jan 11, 2018

Check CF-compliance of test files nansencenter/nansat#184

Closed

snowman2 mentioned this issue Sep 13, 2022

BUG: Support integer data types for netCDF _Unsigned convention OSGeo/gdal#6352

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best practice when the _Unsigned attribute is present in NetCDF files #1444

Best practice when the _Unsigned attribute is present in NetCDF files #1444

deeplycloudy commented Jun 6, 2017 •

edited

Loading

shoyer commented Jun 7, 2017

dopplershift commented Jun 7, 2017

shoyer commented Jun 7, 2017

deeplycloudy commented Jun 7, 2017 •

edited

Loading

deeplycloudy commented Jun 20, 2017 •

edited

Loading

Best practice when the _Unsigned attribute is present in NetCDF files #1444

Best practice when the _Unsigned attribute is present in NetCDF files #1444

Comments

deeplycloudy commented Jun 6, 2017 • edited Loading

shoyer commented Jun 7, 2017

dopplershift commented Jun 7, 2017

shoyer commented Jun 7, 2017

deeplycloudy commented Jun 7, 2017 • edited Loading

deeplycloudy commented Jun 20, 2017 • edited Loading

deeplycloudy commented Jun 6, 2017 •

edited

Loading

deeplycloudy commented Jun 7, 2017 •

edited

Loading

deeplycloudy commented Jun 20, 2017 •

edited

Loading