xr.open_dataset() reading ubyte variables as float32 from DAP server #7782

Articoking · 2023-04-24T14:04:13Z

What happened?

Trying to open and save a netcdf file through CEDA's DAP server (http://dap.ceda.ac.uk/thredds/dodsC/neodc/esacci/snow/data/scfv/MODIS/v2.0/2010/01/20100101-ESACCI-L3C_SNOW-SCFV-MODIS_TERRA-fv2.0.nc) whose variables scfv and scfv_unc are of type ubyte. File DDS is as follows:

Dataset {
    Float64 lat_bnds[lat = 18000][nv = 2];
    Float64 lon_bnds[lon = 36000][nv = 2];
    Float32 time[time = 1];
    Float64 lat[lat = 18000];
    Float64 lon[lon = 36000];
    Grid {
     ARRAY:
        Byte scfv[time = 1][lat = 18000][lon = 36000];
     MAPS:
        Float32 time[time = 1];
        Float64 lat[lat = 18000];
        Float64 lon[lon = 36000];
    } scfv;
    Grid {
     ARRAY:
        Byte scfv_unc[time = 1][lat = 18000][lon = 36000];
     MAPS:
        Float32 time[time = 1];
        Float64 lat[lat = 18000];
        Float64 lon[lon = 36000];
    } scfv_unc;
    Int32 spatial_ref;
} neodc/esacci/snow/data/scfv/MODIS/v2.0/2010/01/20100101-ESACCI-L3C_SNOW-SCFV-MODIS_TERRA-fv2.0.nc;

And its DAS has attribute _Unsigned as true.

Attributes {
    lat_bnds {
    }
    lon_bnds {
    }
    time {
        String axis "T";
        String standard_name "time";
        String long_name "time";
        String calendar "standard";
        String units "hours since 1950-01-01 00:00:00";
        Int32 _ChunkSizes 1024;
    }
    lat {
        String axis "Y";
        String standard_name "latitude";
        String long_name "WGS84 latitude coordinates, center of pixel";
        String units "degrees_north";
        Float64 actual_range -89.995, 89.995;
        String bounds "lat_bnds";
        Int32 _ChunkSizes 18000;
    }
    lon {
        String axis "X";
        String standard_name "longitude";
        String long_name "WGS84 longitude coordinates, center of pixel";
        String units "degrees_east";
        Float64 actual_range -179.995, 179.995;
        String bounds "lon_bnds";
        Int32 _ChunkSizes 36000;
    }
    scfv {
        String _Unsigned "true";
        Int16 _FillValue -1;
        String standard_name "snow_area_fraction_viewable_from_above";
        String long_name "Snow Cover Fraction Viewable";
        String units "percent";
        Int16 valid_range 0, -2;
        Byte actual_range 0, 100;
        Int16 flag_values -51, -50, -46, -41, -4, -3, -2;
        String flag_meanings "Cloud Polar_Night_or_Night Water Permanent_Snow_and_Ice Classification_failed Input_Data_Error No_Satellite_Acquisition";
        Int16 missing_value -1;
        String ancillary_variables "scfv_unc";
        String grid_mapping "spatial_ref";
        Int32 _ChunkSizes 1, 1385, 2770;
    }
    scfv_unc {
        String _Unsigned "true";
        Int16 _FillValue -1;
        String standard_name "snow_area_fraction_viewable_from_above standard_error";
        String long_name "Unbiased Root Mean Square Error for Snow Cover Fraction Viewable";
        String units "percent";
        Int16 valid_range 0, -2;
        Byte actual_range 0, 100;
        Int16 flag_values -51, -50, -46, -41, -4, -3, -2;
        String flag_meanings "Cloud Polar_Night_or_Night Water Permanent_Snow_and_Ice Classification_failed Input_Data_Error No_Satellite_Acquisition";
        Int16 missing_value -1;
        String grid_mapping "spatial_ref";
        Int32 _ChunkSizes 1, 1385, 2770;
    }
    ...

Using xr.open_dataset(http://dap.ceda.ac.uk/thredds/dodsC/neodc/esacci/snow/data/scfv/MODIS/v2.0/2010/01/20100101-ESACCI-L3C_SNOW-SCFV-MODIS_TERRA-fv2.0.nc) the mentioned variables get read as float32 instead of ubyte or at least byte

What did you expect to happen?

The returned Dataset should have scfv and scfv_unc of dtype ubyte

Minimal Complete Verifiable Example

import xarray as xr
ds = xr.open_dataset("http://dap.ceda.ac.uk/thredds/dodsC/neodc/esacci/snow/data/scfv/MODIS/v2.0/2010/01/20100101-ESACCI-L3C_SNOW-SCFV-MODIS_TERRA-fv2.0.nc")
print(ds)

MVCE confirmation

Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
Complete example — the example is self-contained, including all data and the text of any traceback.
Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

/usr/local/install/python-3.9/lib/python3.9/site-packages/xarray/conventions.py:523: SerializationWarning: variable 'scfv' has multiple fill values {-1, 255}, decoding all values to NaN.
  new_vars[k] = decode_cf_variable(
/usr/local/install/python-3.9/lib/python3.9/site-packages/xarray/conventions.py:523: SerializationWarning: variable 'scfv_unc' has multiple fill values {-1, 255}, decoding all values to NaN.
  new_vars[k] = decode_cf_variable(
<xarray.Dataset>
Dimensions:      (lat: 18000, nv: 2, lon: 36000, time: 1)
Coordinates:
  * time         (time) datetime64[ns] 2010-01-01
  * lat          (lat) float64 -90.0 -89.98 -89.97 -89.97 ... 89.98 89.98 90.0
  * lon          (lon) float64 -180.0 -180.0 -180.0 -180.0 ... 180.0 180.0 180.0
Dimensions without coordinates: nv
Data variables:
    lat_bnds     (lat, nv) float64 ...
    lon_bnds     (lon, nv) float64 ...
    spatial_ref  int32 ...
    scfv         (time, lat, lon) float32 ...
    scfv_unc     (time, lat, lon) float32 ...
Attributes: (12/43)
    title:                           ESA CCI viewable snow product level L3C ...
    institution:                     ENVEO IT GmbH
    source:                          TERRA MODIS, Collection 6.1: calibrated ...
    history:                         2021-12-06: ESA snow_cci processing line...
    references:                      http://snow-cci.enveo.at/
    tracking_id:                     2be7fb10-d660-497c-99b9-086372ec6c83
    ...                              ...
    platform:                        TERRA
    sensor:                          MODIS
    spatial_resolution:              0.01 degree
    key_variables:                   scfv
    doi:                             10.5285/ebe625b6f77945a68bda0ab7c78dd76b
    DODS_EXTRA.Unlimited_Dimension:  time

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS

commit: None
python: 3.9.15 (main, Nov 24 2022, 14:31:59)
[GCC 11.2.0]
python-bits: 64
OS: Linux
OS-release: 3.10.0-1160.80.1.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.12.2
libnetcdf: 4.8.1

xarray: 2022.11.0
pandas: 1.5.2
numpy: 1.23.5
scipy: 1.9.3
netCDF4: 1.6.2
pydap: None
h5netcdf: None
h5py: 3.7.0
Nio: None
zarr: None
cftime: 1.6.2
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.3.3
cfgrib: 0.9.10.3
iris: None
bottleneck: None
dask: 2022.11.1
distributed: None
matplotlib: 3.5.2
cartopy: 0.21.0
seaborn: 0.12.1
numbagg: None
fsspec: 2022.11.0
cupy: None
pint: 0.20.1
sparse: 0.13.0
flox: None
numpy_groupies: None
setuptools: 65.5.0
pip: 22.2.2
conda: 22.9.0
pytest: 7.2.0
IPython: 7.33.0
sphinx: 5.3.0

The text was updated successfully, but these errors were encountered:

welcome · 2023-04-24T14:04:15Z

Thanks for opening your first issue here at xarray! Be sure to follow the issue template!
If you have an idea for a solution, we would really welcome a Pull Request with proposed changes.
See the Contributing Guide for more.
It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better.
Thank you!

kmuehlbauer · 2023-04-24T14:31:00Z

@Articoking

As both variables have a _FillValue attached xarray converts these values to NaN effectively casting to float32 in this case.

You might inspect the .encoding-property of the respective variables to get information of the source dtype.

You can deactivate the automatic conversion by adding kwarg mask_and_scale=False.

There is more information in the docs https://docs.xarray.dev/en/stable/user-guide/io.html

Articoking · 2023-04-24T14:58:36Z

Thank you for your quick reply. Adding the mask_and_scale=False kwarg solves the issue of conversion to float, but the resulting is of dtype int8 instead of uint8. Is there any way of making open_dataset() directly interpret the values as unsigned?

It would save me quite a lot of processing time since using DataArray.astype(np.uint8) takes a while to run.

kmuehlbauer · 2023-04-24T15:10:24Z

Then you are somewhat deadlocked. mask_and_scale=False will also deactivate the Unsigned decoding.

You might be able to achieve what want by using decode_cf=False (completely deactivate cf decoding). Then you would have to remove _FillValue attribute as well as missing_value attribute from the variables. Finally, you can run xr.decode_cf(ds) to correctly decode your data.

I'll add a code example tomorrow if no one beats me to it.

Articoking · 2023-04-24T15:39:50Z

Your suggestion worked perfectly, thank you very much! Avoiding using astype() reduced processing time massively

dcherian · 2023-04-24T15:55:48Z

mask_and_scale=False will also deactivate the Unsigned decoding.

Do these two have to be linked? I wonder if we can handle the filling later :

xarray/xarray/coding/variables.py

Lines 397 to 407 in 2657787

    
                   if "_FillValue" in attrs: 
        
                       new_fill = unsigned_dtype.type(attrs["_FillValue"]) 
        
                       attrs["_FillValue"] = new_fill 
        
           elif data.dtype.kind == "u": 
        
               if unsigned == "false": 
        
                   signed_dtype = np.dtype(f"i{data.dtype.itemsize}") 
        
                   transform = partial(np.asarray, dtype=signed_dtype) 
        
                   data = lazy_elemwise_func(data, transform, signed_dtype) 
        
                   if "_FillValue" in attrs: 
        
                       new_fill = signed_dtype.type(attrs["_FillValue"]) 
        
                       attrs["_FillValue"] = new_fill

It seems like this code is setting fill values to the right type for CFMaskCoder which is the next step

xarray/xarray/conventions.py

Lines 266 to 272 in 2657787

    
           if mask_and_scale: 
        
               for coder in [ 
        
                   variables.UnsignedIntegerCoder(), 
        
                   variables.CFMaskCoder(), 
        
                   variables.CFScaleOffsetCoder(), 
        
               ]: 
        
                   var = coder.decode(var, name=name)

kmuehlbauer · 2023-04-24T16:52:30Z

@dcherian Yes, that would work.

We would want to check the different attributes and apply the coders only as needed. That might need some refactoring. I'm already wrapping my head around this for several weeks now.

dcherian · 2023-04-24T17:18:37Z

We would want to check the different attributes and apply the coders only as needed.

The current approach seeems OK no? It seems like the bug is that UnsignedMaskCodershould be outside if mask_and_scale

We would want to check the different attributes and apply the coders only as needed.

EDIT: I mean that each coder checks whether it is applicable, so we already do that

kmuehlbauer · 2023-04-24T20:47:43Z

@dcherian The main issue here is that we have two different CF things which are applied, Unsigned and _FillValue/missing_value.

For netcdf4-python the values would just be masked and the dtype would be preserved. For xarray it will be cast to float32 because of the _FillValue/missing_value.

I agree, moving the Unsigned Coder out of mask_and_scale should help in that particular case.

kmuehlbauer · 2023-04-26T08:28:39Z

This is how netCDF4-python handles this data with different parameters:

import netCDF4 as nc
with nc.Dataset("http://dap.ceda.ac.uk/thredds/dodsC/neodc/esacci/snow/data/scfv/MODIS/v2.0/2010/01/20100101-ESACCI-L3C_SNOW-SCFV-MODIS_TERRA-fv2.0.nc") as ds_dap:
    v = ds_dap["scfv"]
    print(v)
    print("\n- default")
    print(f"variable dtype: {v.dtype}")
    print(f"first 2 elements: {v[0, 0, :2].dtype} {v[0, 0, :2]}")
    print(f"last 2 elements: {v[0, 0, -2:].dtype} {v[0, 0, -2:]}")
    print("\n- maskandscale False")
    ds_dap.set_auto_maskandscale(False)
    v = ds_dap["scfv"]
    print(f"variable dtype: {v.dtype}")
    print(f"first 2 elements: {v[0, 0, :2].dtype} {v[0, 0, :2]}")
    print(f"last 2 elements: {v[0, 0, -2:].dtype} {v[0, 0, -2:]}")
    print("\n- mask/scale False")
    ds_dap.set_auto_mask(False)
    ds_dap.set_auto_scale(False)
    v = ds_dap["scfv"]
    print(f"variable dtype: {v.dtype}")
    print(f"first 2 elements: {v[0, 0, :2].dtype} {v[0, 0, :2]}")
    print(f"last 2 elements: {v[0, 0, -2:].dtype} {v[0, 0, -2:]}")
    print("\n- mask True / scale False")
    ds_dap.set_auto_mask(True)
    ds_dap.set_auto_scale(False)
    v = ds_dap["scfv"]
    print(f"variable dtype: {v.dtype}")
    print(f"first 2 elements: {v[0, 0, :2].dtype} {v[0, 0, :2]}")
    print(f"last 2 elements: {v[0, 0, -2:].dtype} {v[0, 0, -2:]}")
    print("\n- mask False / scale True")
    ds_dap.set_auto_mask(False)
    ds_dap.set_auto_scale(True)
    v = ds_dap["scfv"]
    print(f"variable dtype: {v.dtype}")
    print(f"first 2 elements: {v[0, 0, :2].dtype} {v[0, 0, :2]}")
    print(f"last 2 elements: {v[0, 0, -2:].dtype} {v[0, 0, -2:]}")
    print("\n- mask True / scale True")
    ds_dap.set_auto_mask(True)
    ds_dap.set_auto_scale(True)
    v = ds_dap["scfv"]
    print(f"variable dtype: {v.dtype}")
    print(f"first 2 elements: {v[0, 0, :2].dtype} {v[0, 0, :2]}")
    print(f"last 2 elements: {v[0, 0, -2:].dtype} {v[0, 0, -2:]}")
    print("\n- maskandscale True")
    ds_dap.set_auto_mask(False)
    ds_dap.set_auto_scale(False)
    ds_dap.set_auto_maskandscale(True)
    v = ds_dap["scfv"]
    print(f"variable dtype: {v.dtype}")
    print(f"first 2 elements: {v[0, 0, :2].dtype} {v[0, 0, :2]}")
    print(f"last 2 elements: {v[0, 0, -2:].dtype} {v[0, 0, -2:]}")

<class 'netCDF4._netCDF4.Variable'>
int8 scfv(time, lat, lon)
    _Unsigned: true
    _FillValue: -1
    standard_name: snow_area_fraction_viewable_from_above
    long_name: Snow Cover Fraction Viewable
    units: percent
    valid_range: [ 0 -2]
    actual_range: [  0 100]
    flag_values: [-51 -50 -46 -41  -4  -3  -2]
    flag_meanings: Cloud Polar_Night_or_Night Water Permanent_Snow_and_Ice Classification_failed Input_Data_Error No_Satellite_Acquisition
    missing_value: -1
    ancillary_variables: scfv_unc
    grid_mapping: spatial_ref
    _ChunkSizes: [   1 1385 2770]
unlimited dimensions: time
current shape = (1, 18000, 36000)
filling off

- default
variable dtype: int8
first 2 elements: uint8 [215 215]
last 2 elements: uint8 [215 215]

- maskandscale False
variable dtype: int8
first 2 elements: int8 [-41 -41]
last 2 elements: int8 [-41 -41]

- mask/scale False
variable dtype: int8
first 2 elements: int8 [-41 -41]
last 2 elements: int8 [-41 -41]

- mask True / scale False
variable dtype: int8
first 2 elements: int8 [-- --]
last 2 elements: int8 [-- --]

- mask False / scale True
variable dtype: int8
first 2 elements: uint8 [215 215]
last 2 elements: uint8 [215 215]

- mask True / scale True
variable dtype: int8
first 2 elements: uint8 [215 215]
last 2 elements: uint8 [215 215]

- maskandscale True
variable dtype: int8
first 2 elements: uint8 [215 215]
last 2 elements: uint8 [215 215]

First, the dataset was created with filling off (read more about that in the netcdf file format specs https://docs.unidata.ucar.edu/netcdf-c/current/file_format_specifications.html). This should not be a problem for the analysis, but it tells us that all data points should have been written to somehow.

As we can see from the above output, in netCDF4-python scaling is adapting the dtype to unsigned, not masking. This is also reflected in the docs https://unidata.github.io/netcdf4-python/#Variable.

If Xarray is trying to align with netCDF4-python it should separate mask and scale as netCDF4-python is doing. It does this already by using different coders but it doesn't separate it API-wise.

We would need a similar approach here for Xarray with additional kwargs scale and mask in addition to mask_and_scale. We cannot just move the UnsignedCoder out of mask_and_scale and apply it unconditionally.

dcherian · 2023-04-26T15:29:14Z

Thanks for the in-depth investigation!

As we can see from the above output, in netCDF4-python scaling is adapting the dtype to unsigned, not masking. This is also reflected in the docs unidata.github.io/netcdf4-python/#Variable.

Do we know why this is so?

If Xarray is trying to align with netCDF4-python it should separate mask and scale as netCDF4-python is doing. It does this already by using different coders but it doesn't separate it API-wise.

👍

kmuehlbauer · 2023-04-27T13:33:50Z

As we can see from the above output, in netCDF4-python scaling is adapting the dtype to unsigned, not masking. This is also reflected in the docs unidata.github.io/netcdf4-python/#Variable.

Do we know why this is so?

TL;DR: NETCDF3 detail to allow (signal) unsigned integer, still used in recent formats

more discussion details on this over at Interpretation of reserved _Unsigned attribute written by netCDF-Java Unidata/netcdf4-python#656
at NetCDF Users Guide on packed data:

A conventional way to indicate whether a byte, short, or int variable is meant to be interpreted as unsigned, even for the netCDF-3 classic model that has no external unsigned integer type, is by providing the special variable attribute _Unsigned with value "true". However, most existing data for which packed values are intended to be interpreted as unsigned are stored without this attribute, so readers must be aware of packing assumptions in this case. In the enhanced netCDF-4 data model, packed integers may be declared to be of the appropriate unsigned type.

My suggestion would be to nudge the user by issuing warnings and link to new to be added documentation on the topic. This could be in line with the cf-coding conformance checks which have been discussed yesterday in the dev-meeting.

Articoking added bug needs triage Issue that has not been reviewed by xarray team member labels Apr 24, 2023

Articoking closed this as completed Apr 24, 2023

dcherian added topic-CF conventions and removed needs triage Issue that has not been reviewed by xarray team member labels Apr 24, 2023

dcherian reopened this Apr 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xr.open_dataset() reading ubyte variables as float32 from DAP server #7782

xr.open_dataset() reading ubyte variables as float32 from DAP server #7782

Articoking commented Apr 24, 2023

INSTALLED VERSIONS

welcome bot commented Apr 24, 2023

kmuehlbauer commented Apr 24, 2023

Articoking commented Apr 24, 2023

kmuehlbauer commented Apr 24, 2023 •

edited

Loading

Articoking commented Apr 24, 2023

dcherian commented Apr 24, 2023

kmuehlbauer commented Apr 24, 2023

dcherian commented Apr 24, 2023 •

edited

Loading

kmuehlbauer commented Apr 24, 2023

kmuehlbauer commented Apr 26, 2023

dcherian commented Apr 26, 2023

kmuehlbauer commented Apr 27, 2023

xr.open_dataset() reading ubyte variables as float32 from DAP server #7782

xr.open_dataset() reading ubyte variables as float32 from DAP server #7782

Comments

Articoking commented Apr 24, 2023

What happened?

What did you expect to happen?

Minimal Complete Verifiable Example

MVCE confirmation

Relevant log output

Anything else we need to know?

Environment

INSTALLED VERSIONS

welcome bot commented Apr 24, 2023

kmuehlbauer commented Apr 24, 2023

Articoking commented Apr 24, 2023

kmuehlbauer commented Apr 24, 2023 • edited Loading

Articoking commented Apr 24, 2023

dcherian commented Apr 24, 2023

kmuehlbauer commented Apr 24, 2023

dcherian commented Apr 24, 2023 • edited Loading

kmuehlbauer commented Apr 24, 2023

kmuehlbauer commented Apr 26, 2023

dcherian commented Apr 26, 2023

kmuehlbauer commented Apr 27, 2023

kmuehlbauer commented Apr 24, 2023 •

edited

Loading

dcherian commented Apr 24, 2023 •

edited

Loading