-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Best practice when the _Unsigned attribute is present in NetCDF files #1444
Comments
To clarify: I didn't want to let NetCDF4-python parse this attribute, because it would break an invariant in our data model. I'm not opposed to supporting this separately in xarray (using xarray's machinery for handling netcdf conventions). I will say though that this convention makes little sense to me for netcdf4/hdf5 files, which already provide native support for unsigned types, so separately I would also encourage this data provider to change their practices here :). |
HDF5 doesn't come without its own draw backs...such as vastly more complicated (and admittedly more capable) on-disk format. We should not presume to know better than the data providers what their trade-offs are. Also, while unsigned types are a feature of the netCDF4/HDF5, strictly speaking, the netCDF4 unsigned types are not compatible with any released version of the CF specification (see list of acceptable data types here: http://cfconventions.org/cf-conventions/v1.6.0/cf-conventions.html#_data_types). So it's not exactly as simple as you'd like it to be. |
Yes, of course. But I thought netCDF4-extended meant NetCDF on HDF5, making use of all the HD5 features. So I didn't think that was the tradeoff here.
Sure, though if you're strictly following CF, you probably would use Anyways, again I don't really have an objection here. This attribute can be interpreted in a pretty unambiguous way, so adding support for this would be mostly harmless. |
Ah, I see the distinction concerning the xarray implementation being made in the earlier discussion. There certainly are some tradeoffs being made by the data provider, and I'm far enough removed from those decisions that there's not much I can do. Thanks for the clarification on the background discussion and the willingness to consider support for the attribute. It's my first time dealing with xarray internals, but given that there's a reference implementation in NetCDF4-python, I'm willing to have a go at a PR for this. If I had to take a guess, it belongs in The docstrings note that the implementation is designed around CF-style data. As noted above, this attribute is outside the CF conventions, so I might note that exception in a comment in the code. |
I'm dropping a quick note here to flag that PR #1453 has been written to address this issue. I have used the draft PR in earnest as part of some ongoing analyses of the data that led me to raise the issue. |
Some (large) data providers are writing NetCDF-4-extended files but using an
_Unsigned
attribute to indicate that a signed data type should be interpreted as unsigned bytes.Background: Unidata/netcdf4-python#656
From the background discussion above, it is my understanding that xarray does not honor the attribute because it’s not a part of the CF spec, is only mentioned as a proposed attribute in the NetCDF Best Practices, and because "xarray wants the
Variable
dtype to be the same as the dtype of the data returned."Taking the above as a given, it is necessary for xarray users encountering such variables to do the following after reading the data:
The un-scaling step can be saved by turning off auto mask and scale.
In order to automate the above process while still being able to use the functionality of
Dataset
, one approach might be to automatically perform the above steps on some known list of variables, and then reassign those variables to theDataset
. The downside is the need to read all variables up front, which could be expensive when processing large datasets where not all variables are needed.Is there another approach that would preserve lazy data loading, for instance by providing pre/post hooks for transformation functions at the
__getitem__
stage? Is there something I could do to help document that as a best practice?The text was updated successfully, but these errors were encountered: