[fairly critical] `extract_time()` preprocessor, masked data and new dask=2024.8.0 return cubes with fill values instead of masked elements #2505

valeriupredoi · 2024-08-07T16:37:48Z

extract_time() is causing an attribute loss that leads to fill_value not being taken into account in the next preprocessor, see example code below:

import cf_units
import iris
import numpy as np
import pytest

from esmvalcore.preprocessor import extract_time

esmvaltool_sample_data = pytest.importorskip("esmvaltool_sample_data")

cubes = esmvaltool_sample_data.load_timeseries_cubes(mip_table='day')
c0 = cubes[0]
time_slice = {'start_year': 2001, 'end_year': 2002, 'start_month': 12, 'end_month': 2, 'start_day': 1, 'end_day': 1}
cubes = [extract_time(cube, **time_slice) for cube in cubes]

regrid_kwargs = {
    'grid': c0,
    'scheme': iris.analysis.Nearest(),
}
for c in cubes:
    reg_cub = c.regrid(**regrid_kwargs)
    print(np.mean(reg_cub.data))

toggle the time slicing and if it's on: one gets some infs in the mean computation with lazy data ie values of 1e+36 for data points, since the dataset looks like:

<class 'netCDF4._netCDF4.Variable'>
float32 ta(time, plev, lat, lon)
    standard_name: air_temperature
    long_name: Air Temperature
    units: K
    cell_methods: time: mean (interval: 1 month)
unlimited dimensions: 
current shape = (1095, 2, 3, 2)
filling on, default _FillValue of 9.969209968386869e+36 used

This is happening only with the new Dask; and not with the old 2024.7.1, see #2503

The text was updated successfully, but these errors were encountered:

valeriupredoi · 2024-08-07T16:40:25Z

I can't see where the issue is coming from since if I look at a cube (netCDF4 Dataset) before and after extract time, it looks the same from netCDF4's point of view:

<class 'netCDF4._netCDF4.Variable'>
float32 ta(time, plev, lat, lon)
    standard_name: air_temperature
    long_name: Air Temperature
    units: K
    cell_methods: time: mean (interval: 1 month)
unlimited dimensions: 
current shape = (1095, 2, 3, 2)
filling on, default _FillValue of 9.969209968386869e+36 used

<class 'netCDF4._netCDF4.Variable'>
float32 ta(time, plev, lat, lon)
    standard_name: air_temperature
    long_name: Air Temperature
    units: K
    cell_methods: time: mean (interval: 1 month)
unlimited dimensions: 
current shape = (62, 2, 3, 2)
filling on, default _FillValue of 9.969209968386869e+36 used

so this means iris is missing something it needs

valeriupredoi · 2024-08-07T17:21:57Z

the code above, edited for the last cube in the list:

import cf_units
import iris
import numpy as np
import pytest

from esmvalcore.preprocessor import extract_time

esmvaltool_sample_data = pytest.importorskip("esmvaltool_sample_data")

cubes = esmvaltool_sample_data.load_timeseries_cubes(mip_table='day')
c0 = cubes[0]
time_slice = {'start_year': 2001, 'end_year': 2002, 'start_month': 12, 'end_month': 2, 'start_day': 1, 'end_day': 1}
cube = cubes[-1]
cubes = extract_time(cube, **time_slice)

regrid_kwargs = {
    'grid': c0,
    'scheme': iris.analysis.Nearest(),
}

reg_cub = cube.regrid(**regrid_kwargs)
print(np.mean(reg_cub.data))

returns a valid numerical result - my brain is currently fried from trying to understand wth is going on in here, so am just gonna call it a day, for now.

valeriupredoi · 2024-08-08T13:48:25Z

OK got the bugger (almost)! It proves out if one constructs a list of cubes, then the bug creeps in:

import cf_units
import iris
import numpy as np
import pytest

from esmvalcore.preprocessor import extract_time

esmvaltool_sample_data = pytest.importorskip("esmvaltool_sample_data")

cubes = esmvaltool_sample_data.load_timeseries_cubes(mip_table='day')
c0 = cubes[0]
time_slice = {'start_year': 2001, 'end_year': 2002, 'start_month': 12, 'end_month': 2, 'start_day': 1, 'end_day': 1}
cubes = [cubes[-1]]
cubes = [extract_time(cube, **time_slice) for cube in cubes]

regrid_kwargs = {
    'grid': c0,
    'scheme': iris.analysis.Nearest(),
}

for cube in cubes:
    reg_cub = cube.regrid(**regrid_kwargs)
    print(np.mean(reg_cub.data))

this will make the output from regrid to contain 1e36s instead of masked values for them!

valeriupredoi · 2024-08-08T13:51:01Z

even simpler:

import cf_units
import iris
import numpy as np
import pytest

from esmvalcore.preprocessor import extract_time

esmvaltool_sample_data = pytest.importorskip("esmvaltool_sample_data")

cubes = esmvaltool_sample_data.load_timeseries_cubes(mip_table='day')
c0 = cubes[0]
time_slice = {'start_year': 2001, 'end_year': 2002, 'start_month': 12, 'end_month': 2, 'start_day': 1, 'end_day': 1}
cubes = [cubes[-1]]
cubes = [extract_time(cube, **time_slice) for cube in cubes]

print(cubes[0].data)

that cube will have 1e36s inside its data instead of masked elements!

valeriupredoi · 2024-08-08T13:56:07Z

OK - pushing the enemy even closer to the town center: it appears that the problem is with the sample data loader - saving the "problem" cube to disk, and loading from there ie:

import cf_units
import iris
import numpy as np
import pytest

from esmvalcore.preprocessor import extract_time

esmvaltool_sample_data = pytest.importorskip("esmvaltool_sample_data")

# cubes = esmvaltool_sample_data.load_timeseries_cubes(mip_table='day')
time_slice = {'start_year': 2001, 'end_year': 2002, 'start_month': 12, 'end_month': 2, 'start_day': 1, 'end_day': 1}
# iris.save(cubes[-1], "problem_cube.nc")
# cubes = [cubes[-1]]
cubes = [iris.load_cube("problem_cube.nc")]
cubes = [extract_time(cube, **time_slice) for cube in cubes]
print(cubes[0].data)

will result in correct behaviour of the masked bit

valeriupredoi · 2024-08-08T14:45:10Z

OK the core_data() members differ significantly for those two cubes, even though they should be identical:

dask.array<concatenate, shape=(1095, 2, 3, 2), dtype=float32, chunksize=(365, 2, 3, 2), chunktype=numpy.ndarray>
dask.array<array, shape=(1095, 2, 3, 2), dtype=float32, chunksize=(1095, 2, 3, 2), chunktype=numpy.ndarray>

valeriupredoi · 2024-08-08T15:42:31Z

blithering Hell! I isolated the problem - it's an iris issue, here the MRE/MRC:

import iris
import numpy as np


c1 = iris.load_cube("cubb-1.nc")
c2 = iris.load_cube("cubb-2.nc")

# apply slice to concatenated cube
slicer = (
    np.random.choice(a=[False, True], size=(730,)),
    slice(None, None, None),
    slice(None, None, None),
    slice(None, None, None)
)

# can use this to slice each of cubes c1 and/or c2
slicer1 = (
    np.random.choice(a=[False, True], size=(365,)),
    slice(None, None, None),
    slice(None, None, None),
    slice(None, None, None)
)

cube = iris.cube.CubeList([c1, c2]).concatenate_cube()
cubes = cube[slicer]
print("After slicing", cubes.data)

Gonna close this, after I open an issue at iris 🍺

valeriupredoi · 2024-08-08T16:00:06Z

issue in iris SciTools/iris#6109 - will close this, and open a separate issue that monitors progress on the iris issue

valeriupredoi added the bug Something isn't working label Aug 7, 2024

valeriupredoi changed the title ~~extract_time() preprocessor and new dask=2024.8.0~~ [fairly critical] extract_time() preprocessor, masked data and new dask=2024.8.0 return cubes with fill values instead of masked elements Aug 7, 2024

valeriupredoi assigned valeriupredoi and schlunma Aug 7, 2024

valeriupredoi closed this as completed Aug 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fairly critical] `extract_time()` preprocessor, masked data and new dask=2024.8.0 return cubes with fill values instead of masked elements #2505

[fairly critical] `extract_time()` preprocessor, masked data and new dask=2024.8.0 return cubes with fill values instead of masked elements #2505

valeriupredoi commented Aug 7, 2024

valeriupredoi commented Aug 7, 2024

valeriupredoi commented Aug 7, 2024

valeriupredoi commented Aug 8, 2024

valeriupredoi commented Aug 8, 2024

valeriupredoi commented Aug 8, 2024

valeriupredoi commented Aug 8, 2024

valeriupredoi commented Aug 8, 2024

valeriupredoi commented Aug 8, 2024

[fairly critical] extract_time() preprocessor, masked data and new dask=2024.8.0 return cubes with fill values instead of masked elements #2505

[fairly critical] extract_time() preprocessor, masked data and new dask=2024.8.0 return cubes with fill values instead of masked elements #2505

Comments

valeriupredoi commented Aug 7, 2024

valeriupredoi commented Aug 7, 2024

valeriupredoi commented Aug 7, 2024

valeriupredoi commented Aug 8, 2024

valeriupredoi commented Aug 8, 2024

valeriupredoi commented Aug 8, 2024

valeriupredoi commented Aug 8, 2024

valeriupredoi commented Aug 8, 2024

valeriupredoi commented Aug 8, 2024

[fairly critical] `extract_time()` preprocessor, masked data and new dask=2024.8.0 return cubes with fill values instead of masked elements #2505

[fairly critical] `extract_time()` preprocessor, masked data and new dask=2024.8.0 return cubes with fill values instead of masked elements #2505