-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix for yearly resampling #69
Comments
Hi Pim, Thanks for opening this issue. I am a bit confused that you state:
Because they are not supposed to be closed on the right. They are only supposed to be closed on the left: Line 380 in c0ef954
Line 390 in c0ef954
To avoid the exact issue that you described. This means that the interval [2000-12-01 , 2001-01-01) should be equivalent to Perhaps something is going wrong somewhere, looking at the unit tests I am not sure if this is working 100% correctly 🤔 By the way, square brackets denote a closed interval while round brackets denote an open interval. |
Hi Bart, Thanks for the fast response! Sorry for the confusion, you're right. It should be: the problem comes down to the fact that lilio uses intervals which are not closed on the right. This issue described in example 1 and 2 happen when it is not closed right. Hence, we propose to close the intervals on both sides. I did not mean to let the PR be opened right away, but first await your response. Since it is already opened, I have rewritten the problem statement. The tests of #68 are failing because of linting / docstring tests. We can fix that later, but first let's get the problem statement clear. |
OK. It seems that something is going wrong in the checks then. The function that tries to find out if a date is within a certain interval should be computing this correctly: Lines 112 to 133 in c0ef954
(note the "less" vs. "less_equal"). Perhaps this is going wrong in some other check, causing a year to be dropped when it should not be (e.g. the calendar |
Hi Pim, Sem, We had a look and did find a small bug in the calendar mapping. That's fixed in #70 However, the behavior you described is actually the intended behavior. In lilio we do not try to infer the bounds for the time coordinates, and thus cannot know for sure what these are. Let's say you have a dataset with time coordinates separated by 48 hours. The last data point is at 2015-12-30. How can Lilio properly infer that it can map an (open) interval with the right side of 2016-01-01 to this data point? Therefore we check if the final date of the input data exceeds the right bound of the last calendar interval. Lines 485 to 495 in c0ef954
To summarize: the problem lies in the fact that timestamps of data usually (but not always...) represent the start of the interval. For your case I'd advise padding the input data with an extra timestamp if you really need that last year of data. |
I'm not sure if I follow you. I agree that, these:
Hence, Btw, this test is not testing the issue I am raising here, since the data self._last_timestamp is not precisely at 12-31-2009. Hope this helps to clarify. Maybe we should call sometime soon to discuss 'in-person'. Thanks for the assistance and the recent PR @ClaireDons @BSchilperoort! |
The problem is not in the interval code, but actually that we need to ensure that: self._last_timestamp > self._map_year(max_year).iloc[0].right Let's say you have these two set of timesteps as the index of input data: timeseries_a = [..., 2001-12-29, 2001-12-30, 2001-12-31].
timeseries_b = [..., 2001-12-14, 2001-12-21, 2001-12-28]. For both you would technically have enough data to fill the interval Correctly inferring the frequency of a timeseries can be challenging, and very finicky. For example, if you have a noleap calendar (many GCMs), there would be random gaps in a gregorian calendar. Or if your timeseries is not regulary spaced due to missing data points, etc. If you had a magical function that could do this, input_data_frequency: pd.Timedelta = infer_freg(input_data)
self._last_timestamp + input_data_frequency > self._map_year(max_year).iloc[0].right To solve the issue. But to err on the safe side, we currently just make sure that there is a data point that's beyond the right side of the last calendar interval. The code currently does contain |
Dear Bart, Thanks for diving into this issue. I don't mean to annoy here, but I still believe that our solution is both on the safe side, as well as solving our issue. Minimal example testing data resolution of "1d" and "5d":
Current implementation output:
Our suggested implementation:
The core adaptions are: I know you're a smart guy @BSchilperoort so maybe I'm missing something, but I still don't see it:p |
Hi Sem, because of Felix's issue (#74) I now actually have a good idea on how to solve this. Default mode:
"Greedy mode":
The default mode should be OK for your use case, and the greedy mode should allow users to cover as many years as possible, at the cost of technically not correctly samping the last interval. To solve #74, the minimum year check should receive the same treatment. (As this issue is just about the max_year). As an extra addition to Lilio, we could make it compatible with cf-convention like bounds. Then we know the left/right bounds of the data and 100% correctly map the years and resample. However, that might take more time than it's worth. @ClaireDons should be able to tackle this issue soon. |
#75 should close this issue. We have addressed your problems in 2 ways:
|
I think that second option This ticket can be closed. Thanks! |
Hey guys,
We have recently discovered a feature of the lilio code that leads to some undesired behaviour, we have implemented a solution and were wondering if it is something you might be interested in incorporating this into the package.
Basically, the problem comes down to the fact that lilio uses intervals which are closed on the right. The problem arises in 2 scenarios:
1 You have an target interval that should include dates up to 31st of December, say the start point is the 1st of December. For example, intervals are defined as [2000-12-01 , 2001-01-01),[2001-12-01 , 2002-01-01) etc. The last interval will be [2015-12-01 , 2016-01-01). When indexing this on our input array we will encounter missing data as we try to index 2016-01-01 but our data only runs until 2015-12-31. Hence, lilio discards the year 2015 as it cannot fit the data to the interval. The obvious solution, then, is to use open indices on both sides: thus our intervals become [2000-12-01, 2000-12-31].
2. The same problem arises when you have some preprocessed target data with a single datapoint per year (say, the yield per year). You have assigned the timestamp of the first of October. The calender should check, [2015-10-01, 2015-10-02], but the last datapoint in 2015 is 2015-10-01, it cannot find the any closing date that is >= 2015-10-02 and thus it will discard the year 2015.
This is what we have implemented in our fork of lilio. What do you guys think?
Note, this fork also includes that you can ask for a 1Y frequency by length="1Y".
Best,
Pim
The text was updated successfully, but these errors were encountered: