Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fairly critical] extract_time() preprocessor, masked data and new dask=2024.8.0 return cubes with fill values instead of masked elements #2505

Closed
valeriupredoi opened this issue Aug 7, 2024 · 8 comments
Assignees
Labels
bug Something isn't working

Comments

@valeriupredoi
Copy link
Contributor

extract_time() is causing an attribute loss that leads to fill_value not being taken into account in the next preprocessor, see example code below:

import cf_units
import iris
import numpy as np
import pytest

from esmvalcore.preprocessor import extract_time

esmvaltool_sample_data = pytest.importorskip("esmvaltool_sample_data")

cubes = esmvaltool_sample_data.load_timeseries_cubes(mip_table='day')
c0 = cubes[0]
time_slice = {'start_year': 2001, 'end_year': 2002, 'start_month': 12, 'end_month': 2, 'start_day': 1, 'end_day': 1}
cubes = [extract_time(cube, **time_slice) for cube in cubes]

regrid_kwargs = {
    'grid': c0,
    'scheme': iris.analysis.Nearest(),
}
for c in cubes:
    reg_cub = c.regrid(**regrid_kwargs)
    print(np.mean(reg_cub.data))

toggle the time slicing and if it's on: one gets some infs in the mean computation with lazy data ie values of 1e+36 for data points, since the dataset looks like:

<class 'netCDF4._netCDF4.Variable'>
float32 ta(time, plev, lat, lon)
    standard_name: air_temperature
    long_name: Air Temperature
    units: K
    cell_methods: time: mean (interval: 1 month)
unlimited dimensions: 
current shape = (1095, 2, 3, 2)
filling on, default _FillValue of 9.969209968386869e+36 used

This is happening only with the new Dask; and not with the old 2024.7.1, see #2503

@valeriupredoi valeriupredoi added the bug Something isn't working label Aug 7, 2024
@valeriupredoi
Copy link
Contributor Author

I can't see where the issue is coming from since if I look at a cube (netCDF4 Dataset) before and after extract time, it looks the same from netCDF4's point of view:

<class 'netCDF4._netCDF4.Variable'>
float32 ta(time, plev, lat, lon)
    standard_name: air_temperature
    long_name: Air Temperature
    units: K
    cell_methods: time: mean (interval: 1 month)
unlimited dimensions: 
current shape = (1095, 2, 3, 2)
filling on, default _FillValue of 9.969209968386869e+36 used

<class 'netCDF4._netCDF4.Variable'>
float32 ta(time, plev, lat, lon)
    standard_name: air_temperature
    long_name: Air Temperature
    units: K
    cell_methods: time: mean (interval: 1 month)
unlimited dimensions: 
current shape = (62, 2, 3, 2)
filling on, default _FillValue of 9.969209968386869e+36 used

so this means iris is missing something it needs

@valeriupredoi
Copy link
Contributor Author

the code above, edited for the last cube in the list:

import cf_units
import iris
import numpy as np
import pytest

from esmvalcore.preprocessor import extract_time

esmvaltool_sample_data = pytest.importorskip("esmvaltool_sample_data")

cubes = esmvaltool_sample_data.load_timeseries_cubes(mip_table='day')
c0 = cubes[0]
time_slice = {'start_year': 2001, 'end_year': 2002, 'start_month': 12, 'end_month': 2, 'start_day': 1, 'end_day': 1}
cube = cubes[-1]
cubes = extract_time(cube, **time_slice)

regrid_kwargs = {
    'grid': c0,
    'scheme': iris.analysis.Nearest(),
}

reg_cub = cube.regrid(**regrid_kwargs)
print(np.mean(reg_cub.data))

returns a valid numerical result - my brain is currently fried from trying to understand wth is going on in here, so am just gonna call it a day, for now.

@valeriupredoi valeriupredoi changed the title extract_time() preprocessor and new dask=2024.8.0 [fairly critical] extract_time() preprocessor, masked data and new dask=2024.8.0 return cubes with fill values instead of masked elements Aug 7, 2024
@valeriupredoi
Copy link
Contributor Author

OK got the bugger (almost)! It proves out if one constructs a list of cubes, then the bug creeps in:

import cf_units
import iris
import numpy as np
import pytest

from esmvalcore.preprocessor import extract_time

esmvaltool_sample_data = pytest.importorskip("esmvaltool_sample_data")

cubes = esmvaltool_sample_data.load_timeseries_cubes(mip_table='day')
c0 = cubes[0]
time_slice = {'start_year': 2001, 'end_year': 2002, 'start_month': 12, 'end_month': 2, 'start_day': 1, 'end_day': 1}
cubes = [cubes[-1]]
cubes = [extract_time(cube, **time_slice) for cube in cubes]

regrid_kwargs = {
    'grid': c0,
    'scheme': iris.analysis.Nearest(),
}

for cube in cubes:
    reg_cub = cube.regrid(**regrid_kwargs)
    print(np.mean(reg_cub.data))

this will make the output from regrid to contain 1e36s instead of masked values for them!

@valeriupredoi
Copy link
Contributor Author

even simpler:

import cf_units
import iris
import numpy as np
import pytest

from esmvalcore.preprocessor import extract_time

esmvaltool_sample_data = pytest.importorskip("esmvaltool_sample_data")

cubes = esmvaltool_sample_data.load_timeseries_cubes(mip_table='day')
c0 = cubes[0]
time_slice = {'start_year': 2001, 'end_year': 2002, 'start_month': 12, 'end_month': 2, 'start_day': 1, 'end_day': 1}
cubes = [cubes[-1]]
cubes = [extract_time(cube, **time_slice) for cube in cubes]

print(cubes[0].data)

that cube will have 1e36s inside its data instead of masked elements!

@valeriupredoi
Copy link
Contributor Author

OK - pushing the enemy even closer to the town center: it appears that the problem is with the sample data loader - saving the "problem" cube to disk, and loading from there ie:

import cf_units
import iris
import numpy as np
import pytest

from esmvalcore.preprocessor import extract_time

esmvaltool_sample_data = pytest.importorskip("esmvaltool_sample_data")

# cubes = esmvaltool_sample_data.load_timeseries_cubes(mip_table='day')
time_slice = {'start_year': 2001, 'end_year': 2002, 'start_month': 12, 'end_month': 2, 'start_day': 1, 'end_day': 1}
# iris.save(cubes[-1], "problem_cube.nc")
# cubes = [cubes[-1]]
cubes = [iris.load_cube("problem_cube.nc")]
cubes = [extract_time(cube, **time_slice) for cube in cubes]
print(cubes[0].data)

will result in correct behaviour of the masked bit

@valeriupredoi
Copy link
Contributor Author

OK the core_data() members differ significantly for those two cubes, even though they should be identical:

dask.array<concatenate, shape=(1095, 2, 3, 2), dtype=float32, chunksize=(365, 2, 3, 2), chunktype=numpy.ndarray>
dask.array<array, shape=(1095, 2, 3, 2), dtype=float32, chunksize=(1095, 2, 3, 2), chunktype=numpy.ndarray>

@valeriupredoi
Copy link
Contributor Author

blithering Hell! I isolated the problem - it's an iris issue, here the MRE/MRC:

import iris
import numpy as np


c1 = iris.load_cube("cubb-1.nc")
c2 = iris.load_cube("cubb-2.nc")

# apply slice to concatenated cube
slicer = (
    np.random.choice(a=[False, True], size=(730,)),
    slice(None, None, None),
    slice(None, None, None),
    slice(None, None, None)
)

# can use this to slice each of cubes c1 and/or c2
slicer1 = (
    np.random.choice(a=[False, True], size=(365,)),
    slice(None, None, None),
    slice(None, None, None),
    slice(None, None, None)
)

cube = iris.cube.CubeList([c1, c2]).concatenate_cube()
cubes = cube[slicer]
print("After slicing", cubes.data)

Gonna close this, after I open an issue at iris 🍺

@valeriupredoi
Copy link
Contributor Author

issue in iris SciTools/iris#6109 - will close this, and open a separate issue that monitors progress on the iris issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants