Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for yearly resampling #69

Closed
pimmeerdink opened this issue Feb 27, 2024 · 10 comments · Fixed by #75
Closed

Fix for yearly resampling #69

pimmeerdink opened this issue Feb 27, 2024 · 10 comments · Fixed by #75

Comments

@pimmeerdink
Copy link

Hey guys,

We have recently discovered a feature of the lilio code that leads to some undesired behaviour, we have implemented a solution and were wondering if it is something you might be interested in incorporating this into the package.

Basically, the problem comes down to the fact that lilio uses intervals which are closed on the right. The problem arises in 2 scenarios:

1 You have an target interval that should include dates up to 31st of December, say the start point is the 1st of December. For example, intervals are defined as [2000-12-01 , 2001-01-01),[2001-12-01 , 2002-01-01) etc. The last interval will be [2015-12-01 , 2016-01-01). When indexing this on our input array we will encounter missing data as we try to index 2016-01-01 but our data only runs until 2015-12-31. Hence, lilio discards the year 2015 as it cannot fit the data to the interval. The obvious solution, then, is to use open indices on both sides: thus our intervals become [2000-12-01, 2000-12-31].
2. The same problem arises when you have some preprocessed target data with a single datapoint per year (say, the yield per year). You have assigned the timestamp of the first of October. The calender should check, [2015-10-01, 2015-10-02], but the last datapoint in 2015 is 2015-10-01, it cannot find the any closing date that is >= 2015-10-02 and thus it will discard the year 2015.

This is what we have implemented in our fork of lilio. What do you guys think?

Note, this fork also includes that you can ask for a 1Y frequency by length="1Y".

Best,

Pim

@BSchilperoort
Copy link
Contributor

Hi Pim,

Thanks for opening this issue. I am a bit confused that you state:

the problem comes down to the fact that lilio uses intervals which are closed on the right

Because they are not supposed to be closed on the right. They are only supposed to be closed on the left:

intervals.append(pd.Interval(left_date, right_date, closed="left"))

intervals.append(pd.Interval(left_date, right_date, closed="left"))

To avoid the exact issue that you described.

This means that the interval [2000-12-01 , 2001-01-01) should be equivalent to date >= 2000-12-01 and date < 2001-01-01.

Perhaps something is going wrong somewhere, looking at the unit tests I am not sure if this is working 100% correctly 🤔

By the way, square brackets denote a closed interval while round brackets denote an open interval.

@semvijverberg
Copy link
Member

semvijverberg commented Feb 28, 2024

Hi Bart,

Thanks for the fast response! Sorry for the confusion, you're right. It should be:

the problem comes down to the fact that lilio uses intervals which are not closed on the right.

This issue described in example 1 and 2 happen when it is not closed right. Hence, we propose to close the intervals on both sides.

I did not mean to let the PR be opened right away, but first await your response. Since it is already opened, I have rewritten the problem statement. The tests of #68 are failing because of linting / docstring tests. We can fix that later, but first let's get the problem statement clear.

@BSchilperoort
Copy link
Contributor

BSchilperoort commented Feb 29, 2024

OK. It seems that something is going wrong in the checks then.

The function that tries to find out if a date is within a certain interval should be computing this correctly:

lilio/lilio/resampling.py

Lines 112 to 133 in c0ef954

def _contains(interval_index: pd.IntervalIndex, timestamps) -> np.ndarray:
"""Check elementwise if the intervals contain the timestamps.
Will return a boolean array of the shape (n_timestamps, n_intervals).
Args:
interval_index: An IntervalIndex containing all intervals that should be
checked.
timestamps: A 1-D array containing
Returns:
np.ndarray: 2-D mask array
"""
if interval_index.closed_left:
a = np.greater_equal(timestamps, interval_index.left.values[np.newaxis].T)
else:
a = np.greater(timestamps, interval_index.left.values[np.newaxis].T)
if interval_index.closed_right:
b = np.less_equal(timestamps, interval_index.right.values[np.newaxis].T)
else:
b = np.less(timestamps, interval_index.right.values[np.newaxis].T)
return a & b

(note the "less" vs. "less_equal").

Perhaps this is going wrong in some other check, causing a year to be dropped when it should not be (e.g. the calendar map_years or map_to_data methods).
[2001-12-01 , 2002-01-01)
should be equivalent to
[2001-12-01 , 2001-12-31].

@BSchilperoort
Copy link
Contributor

BSchilperoort commented Feb 29, 2024

Hi Pim, Sem,

We had a look and did find a small bug in the calendar mapping. That's fixed in #70

However, the behavior you described is actually the intended behavior. In lilio we do not try to infer the bounds for the time coordinates, and thus cannot know for sure what these are.

Let's say you have a dataset with time coordinates separated by 48 hours. The last data point is at 2015-12-30. How can Lilio properly infer that it can map an (open) interval with the right side of 2016-01-01 to this data point? Therefore we check if the final date of the input data exceeds the right bound of the last calendar interval.

lilio/lilio/calendar.py

Lines 485 to 495 in c0ef954

def _set_year_range_from_timestamps(self):
min_year = self._first_timestamp.year # type: ignore
max_year = self._last_timestamp.year # type: ignore
# ensure that the input data could always cover the advent calendar
# last date check
if self._map_year(max_year).iloc[0].right > self._last_timestamp:
max_year -= 1
# first date check
while self._map_year(min_year).iloc[-1].right <= self._first_timestamp:
min_year += 1

To summarize: the problem lies in the fact that timestamps of data usually (but not always...) represent the start of the interval.
An interesting change would be to support CF-convention time bounds. Where each time coordinate has a left and right bound. In that case we would be able to determine without ambiguity how to map a calendar to input data.

For your case I'd advise padding the input data with an extra timestamp if you really need that last year of data.

@semvijverberg
Copy link
Member

semvijverberg commented Mar 6, 2024

I'm not sure if I follow you.

I agree that, these:
[2001-12-01 , 2002-01-01)
[2001-12-01 , 2001-12-31].
should be identical, but currently they are not if your data ends in 2001 (while you actually have the data to aggregate to 2001-12-31, it will not because it does not find 2002-01-01).

self._map_year(max_year).iloc[0].right is currently 2002-01-01 (because of the open interval), which is > self._last_timestamp which is 2001-12-31.

Hence, self._map_year(max_year).iloc[0].right > self._last_timestamp is True, and so max_year will be subtracted with 1, thereby completely dropping the year 2001 (while it does have the data to actually include 2001).

Btw, this test is not testing the issue I am raising here, since the data self._last_timestamp is not precisely at 12-31-2009.

Hope this helps to clarify. Maybe we should call sometime soon to discuss 'in-person'.

Thanks for the assistance and the recent PR @ClaireDons @BSchilperoort!

@BSchilperoort
Copy link
Contributor

I agree that, these intervals should be identical, but currently they are not if your data ends in 2001

The problem is not in the interval code, but actually that we need to ensure that:

self._last_timestamp > self._map_year(max_year).iloc[0].right

Let's say you have these two set of timesteps as the index of input data:

timeseries_a = [..., 2001-12-29, 2001-12-30, 2001-12-31].
timeseries_b = [..., 2001-12-14, 2001-12-21, 2001-12-28].

For both you would technically have enough data to fill the interval [2001-12-01 , 2002-01-01), but you have to assume or infer somehow that series A is a daily timeseries, and series B is a weekly one. This of can of course only work if you data is regularly spaced (or more specifically, regularly spaced in the Gregorian calendar).

Correctly inferring the frequency of a timeseries can be challenging, and very finicky. For example, if you have a noleap calendar (many GCMs), there would be random gaps in a gregorian calendar. Or if your timeseries is not regulary spaced due to missing data points, etc.

If you had a magical function that could do this, infer_freq then you could do:

input_data_frequency: pd.Timedelta = infer_freg(input_data)
self._last_timestamp + input_data_frequency > self._map_year(max_year).iloc[0].right

To solve the issue. But to err on the safe side, we currently just make sure that there is a data point that's beyond the right side of the last calendar interval.

The code currently does contain infer_input_data_freq, but that function takes the smallest step in the calendar to be cautious. Here we would want to actually take the largest step in the calendar if pandas.infer_freq or xarray.infer_freq don't work.

@semvijverberg
Copy link
Member

Dear Bart,

Thanks for diving into this issue.

I don't mean to annoy here, but I still believe that our solution is both on the safe side, as well as solving our issue.

Minimal example testing data resolution of "1d" and "5d":

import numpy as np
import pandas as pd
import xarray as xr
import lilio

# Hide the full data when displaying a dataset in the notebook
xr.set_options(display_expand_data=False)

for FREQUENCY in [1, 5]:
    n = int(365.25*2 / FREQUENCY)
    time_index = pd.date_range(start="2015-01-01", end="2016-12-31", freq=f"{FREQUENCY}d")
    time_coord = {"time": time_index}
    y = xr.DataArray(np.random.randn(time_index.size), coords=time_coord, name="target")
    calendar = lilio.Calendar(anchor="12-01")
    calendar.add_intervals("target", length="31d")
    calendar.map_to_data(y)
    y_resampled = lilio.resample(calendar, y)
    print(f"Available years {np.unique(y.time.dt.year)}")
    print(y_resampled)

Current implementation output:

Available years [2015 2016]
<xarray.DataArray 'target' (anchor_year: 1, i_interval: 1)> Size: 8B
0.0008229
Coordinates:
  * anchor_year  (anchor_year) int64 8B 2015
  * i_interval   (i_interval) int64 8B 1
    left_bound   (anchor_year, i_interval) datetime64[ns] 8B 2015-12-01
    right_bound  (anchor_year, i_interval) datetime64[ns] 8B 2016-01-01
    is_target    (i_interval) bool 1B True
Attributes:
    lilio_version:               0.4.2
    lilio_calendar_anchor_date:  12-01
    lilio_calendar_code:         Calendar(\n    anchor='12-01',\n    allow_ov...
    history:                     2024-03-21 13:13:31 UTC - Resampled with a L...
Available years [2015 2016]
<xarray.DataArray 'target' (anchor_year: 1, i_interval: 1)> Size: 8B
0.1837
Coordinates:
  * anchor_year  (anchor_year) int64 8B 2015
  * i_interval   (i_interval) int64 8B 1
    left_bound   (anchor_year, i_interval) datetime64[ns] 8B 2015-12-01
    right_bound  (anchor_year, i_interval) datetime64[ns] 8B 2016-01-01
    is_target    (i_interval) bool 1B True
Attributes:
    lilio_version:               0.4.2
    lilio_calendar_anchor_date:  12-01
    lilio_calendar_code:         Calendar(\n    anchor='12-01',\n    allow_ov...
    history:                     2024-03-21 13:13:31 UTC - Resampled with a L...

Our suggested implementation:

Available years [2015 2016]
<xarray.DataArray 'target' (anchor_year: 2, i_interval: 1)> Size: 16B
-0.1375 0.1385
Coordinates:
  * anchor_year  (anchor_year) int64 16B 2015 2016
  * i_interval   (i_interval) int64 8B 1
    left_bound   (anchor_year, i_interval) datetime64[ns] 16B 2015-12-01 2016...
    right_bound  (anchor_year, i_interval) datetime64[ns] 16B 2015-12-31 2016...
    is_target    (i_interval) bool 1B True
Attributes:
    lilio_version:               0.4.1
    lilio_calendar_anchor_date:  12-01
    lilio_calendar_code:         Calendar(\n    anchor='12-01',\n    allow_ov...
    history:                     2024-03-21 13:12:25 UTC - Resampled with a L...
Available years [2015 2016]
<xarray.DataArray 'target' (anchor_year: 2, i_interval: 1)> Size: 16B
0.01442 -0.6337
Coordinates:
  * anchor_year  (anchor_year) int64 16B 2015 2016
  * i_interval   (i_interval) int64 8B 1
    left_bound   (anchor_year, i_interval) datetime64[ns] 16B 2015-12-01 2016...
    right_bound  (anchor_year, i_interval) datetime64[ns] 16B 2015-12-31 2016...
    is_target    (i_interval) bool 1B True
Attributes:
    lilio_version:               0.4.1
    lilio_calendar_anchor_date:  12-01
    lilio_calendar_code:         Calendar(\n    anchor='12-01',\n    allow_ov...
    history:                     2024-03-21 13:12:25 UTC - Resampled with a L...

The core adaptions are:

  1. The change from open-left to closed-both interval here
  2. We adapted the interval check here.

I know you're a smart guy @BSchilperoort so maybe I'm missing something, but I still don't see it:p

@BSchilperoort
Copy link
Contributor

Hi Sem, because of Felix's issue (#74) I now actually have a good idea on how to solve this.

Default mode:

  • Keep the calendar intervals open on the right
  • Use pd.infer_freq or xr.infer_freq to estimate the time step between data points (which should work if the data is monotonous)
  • Compare the right side of the last calendar interval to the last data point + 1*the inferred frequency.

"Greedy mode":

  • Compare the last data point to the left side of the last interval of the calendar.
    • I.e. if there is any data at all in the interval it's OK. Even if it's fewer data points than the other years.

The default mode should be OK for your use case, and the greedy mode should allow users to cover as many years as possible, at the cost of technically not correctly samping the last interval.

To solve #74, the minimum year check should receive the same treatment. (As this issue is just about the max_year).

As an extra addition to Lilio, we could make it compatible with cf-convention like bounds. Then we know the left/right bounds of the data and 100% correctly map the years and resample. However, that might take more time than it's worth.

@ClaireDons should be able to tackle this issue soon.

@ClaireDons ClaireDons linked a pull request Jun 10, 2024 that will close this issue
@BSchilperoort
Copy link
Contributor

#75 should close this issue.

We have addressed your problems in 2 ways:

  • We now try to infer the time bounds of the data.
  • You can now specify calendar.map_to_data(..., safe=False), where even "half full" intervals will still be generated.

@semvijverberg
Copy link
Member

I think that second option calendar.map_to_data(..., safe=False) would be an alternative solution to the "closing intervals on both sides" solution that I implemented in our forked repo. I will try that out in the future! Would be nice to sync back with the AI4S2S OG repo.

This ticket can be closed. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants