-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix for expver dimension #56
Conversation
Hey @alxmrs We for the majority of the data files this fix seems working (Verified after back filling the data). But following two variables are hitting the assertion
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. If this has been tested on realistic data, than it makes sense to me.
src/arco_era5/source_data.py
Outdated
@@ -372,12 +372,12 @@ def _read_nc_dataset(gpath_file): | |||
# and: https://confluence.ecmwf.int/display/CKB/ERA5%3A+data+documentation#ERA5:datadocumentation-Dataupdatefrequency # pylint: disable=line-too-long | |||
# for further details. | |||
|
|||
all_dims_except_time = tuple(set(dataarray.dims) - {"time"}) | |||
all_dims_except_time = tuple(set(dataarray.dims) - {"time", "expver"}) | |||
# Should have only trailing nans. | |||
a = dataarray.sel(expver=1).isnull().any(dim=all_dims_except_time) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to Alvaro ([email protected]), the fix should be:
dataarray.sel(expver=1, drop=True)
src/arco_era5/source_data.py
Outdated
@@ -372,12 +372,12 @@ def _read_nc_dataset(gpath_file): | |||
# and: https://confluence.ecmwf.int/display/CKB/ERA5%3A+data+documentation#ERA5:datadocumentation-Dataupdatefrequency # pylint: disable=line-too-long | |||
# for further details. | |||
|
|||
all_dims_except_time = tuple(set(dataarray.dims) - {"time"}) | |||
all_dims_except_time = tuple(set(dataarray.dims) - {"time", "expver"}) | |||
# Should have only trailing nans. | |||
a = dataarray.sel(expver=1).isnull().any(dim=all_dims_except_time) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly the reason why this was not working is that dataarray.sel(expver=1)
and dataarray.sel(expver=5)
,
even though they no longer have the "expver" dim, they both remember which expver coordinate they came from (they keep a scalar expver
coordinate), and then they complain when running arithmetics between them because the scalar coordinate don't match. Then by adding "expver" to the dimensions to reduce the scalar coordinate diappears too. However this is unintuitive, because the propose code tries to reduce across a dim, which no longer exists.
I believe the right fix, rather than setting:
all_dims_except_time = tuple(set(dataarray.dims) - {"time", "expver"})
should be to use:
dataarray.sel(expver=1, drop=True)
dataarray.sel(expver=5, drop=True)
If you confirm this alternative fix works, this would be preferable.
I think the right fix for this should be:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Talking more with Alvaro here: I think these errors only crop up if we download ERA5T, not ERA5. If we simply updated the data, or put limits into how recently we were updating the data, I believe these issues would go away.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've downloaded the data for 2023 on 21-22 September. Will that cause any trouble?
Tried below snipped giving an error.
all_dims_except_time = tuple(set(dataarray.dims) - {"time"})
a = dataarray.sel(expver=1, drop=True).isnull().any(dim=all_dims_except_time)
b = dataarray.sel(expver=5, drop=True).isnull().any(dim=all_dims_except_time)
disjoint_nans = bool(next(iter((a ^ b).all().data_vars.values())))
assert disjoint_nans, "The nans are not disjoint in expver=1 vs 5"
dataarray = dataarray.sel(expver=1).combine_first(dataarray.sel(expver=5))
Error
ValueError: 'expver' not found in array dimensions ('time', 'latitude', 'longitude')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you indicate which line is throwing the error?
Perhaps the problem is that this file has an "expver" coordinate, but not an "expver" dim, which is something we have not found before, but can probably be solved by changing:
if "expver" in dataarray.coords:
# previous code
to
if "expver" in dataarray.dims:
# previous code
elif "expver" in dataarray.coords:
dataarray = dataarray.drop_vars('expver')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've downloaded the data for 2023 on 21-22 September. Will that cause any trouble?
In short, yes I do think this is the root issue. Our goal with keeping data up-to-date is to have a ~1-2 month delay in freshness exactly for this reason. Unless we plan to re-download and update ERA5T data to ERA5 (I don't recommend this), then it would be best to change the interval at which we ingest data to avoid the case that this PR is trying to solve for.
@alxmrs After verifying the code for April, May and June 2023 data, It seems that the changes are required for data ingestion. As the data for April and May also downloaded on 21st September so it should be fresh enough for the process further. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Thanks for the discussion in this PR. |
Added the fix for
expver
dimension for recent months data.