First lag day is same as first forecast day #256

ampersandmcd · 2024-05-02T07:15:19Z

Lines 463 to 464 in f128536

    
           pd.Timestamp(forecast_date - dt.timedelta(days=n)) 
        
           for n in range(num_channels)

Howdy there! I've been running a toy experiment with one-day lag, one-day forecasts to debug my diffusion model code and seem to be getting predictors that are from the same day as targets. I think the issue originates from this line of code: when num_channels is 1, n only ever takes the value of 0, and hence we are left with forecast_date - dt.timedelta(0) meaning we get predictors for the same day as the target forecast date. This allows the model to "cheat" in my case as the target siconca is the same as the predictor siconca.

Fixing this would just require L463 to be pd.Timestamp(forecast_date - dt.timedelta(days=n+1)) I think.

Below is a script I wrote to confirm this behaviour and a few screenshots from my debugging pane if they are helpful!

import os
import pandas as pd
from icenet.data.processors.era5 import IceNetERA5PreProcessor
from icenet.data.processors.meta import IceNetMetaPreProcessor
from icenet.data.processors.osi import IceNetOSIPreProcessor
from icenet.data.loaders import IceNetDataLoaderFactory

if "datasets.ipynb" in os.listdir():
    os.chdir("../")
print("Running in {}".format(os.getcwd()))

processing_dates = dict(
    train=[pd.to_datetime(el) for el in pd.date_range("2020-01-01", "2020-01-01")],
    val=[pd.to_datetime(el) for el in pd.date_range("2020-01-01", "2020-01-01")],
    test=[pd.to_datetime(el) for el in pd.date_range("2020-01-01", "2020-01-01")],
)
processed_name = "single_day"

pp = IceNetERA5PreProcessor(
    ["uas", "vas"],
    ["tas", "zg500", "zg250"],
    processed_name,
    processing_dates["train"],
    processing_dates["val"],
    processing_dates["test"],
    linear_trends=tuple(),
    north=False,
    south=True
)
osi = IceNetOSIPreProcessor(
    ["siconca"],
    [],
    processed_name,
    processing_dates["train"],
    processing_dates["val"],
    processing_dates["test"],
    linear_trends=tuple(),
    north=False,
    south=True
)
meta = IceNetMetaPreProcessor(
    processed_name,
    north=False,
    south=True
)
pp.init_source_data(
    lag_days=1,
)
pp.process()
osi.init_source_data(
    lag_days=1,
)
osi.process()
meta.process()

implementation = "dask"
loader_config = "loader.single_day.json"
dataset_name = "single_day"
lag = 1

dl = IceNetDataLoaderFactory().create_data_loader(
    implementation,
    loader_config,
    dataset_name,
    lag,
    n_forecast_days=1,
    north=False,
    south=True,
    output_batch_size=1,
    generate_workers=1
)

x, y, sw = dl.generate_sample(pd.Timestamp("2020-12-01"))
diff = x[:, :, 2] - y.squeeze()
print(diff)  # should not be zero but is zero

Note in the screenshots below that forecast_date == channel_dates.

The text was updated successfully, but these errors were encountered:

JimCircadian · 2024-05-03T07:38:02Z

Hi @ampersandmcd thank you so much for such a detailed issue, perfectly described. There are two issues here.

Definitely, the first element of the predictor should not be the same as the input siconca! @bnubald and I will validate this on our own datasets and accept your PR.
The lead / lag issue is a separate issue of how we describe things. The "forecast date" is the date of initialisation for the forecast, so you want a lag of 0 to present the data for the forecast date (as that is the data for which you have data upon which you want to base the forecast, as opposed to the first date of the forecast period), and then the first output of the element is lead time == 1, e.g. the first date forecast after the forecast date. @bnubald and I were discussing this the other week and to be honest it's always been a source of confusion

The really key point, as you say, is that the first element of the output should not be the same as the input. I'll validate with my own dataset to ensure this is the case but your demo looks good, it's just the amendment we need to be careful with so the semantics of all these terms is clear(er) than it has been.

Really interested to hear your thoughts on this, thank you so much! This is why we need power users such as yourself! 😉

bnubald · 2024-05-03T08:50:05Z

Thanks for providing a detailed breakdown @ampersandmcd!

I'll have a look at this, as @JimCircadian mentioned, we'll have a look at the first aspect. The second point mentioned by @JimCircadian is something I've had confusion over understanding, so, some updates to the documentation along with visuals would be a great addition.

JimCircadian · 2024-05-05T19:35:36Z

I've confirmed this behaviour, but want to be clear that we have the correct fix. We need this to be clearly defined appropriate to what we're doing: there are plenty of definitions around lag and lead that are clear, the use of the terms and indexing are OK, we have the wrong data in when considering the "forecast date", which is totally a mess up on my part. My impression is that the forecast date is the forecast initialisation date: the date we have data to provide for, so we have a lag of zero on that date (the last day we have data for), then forecast ahead from 1 to n days. The X below marks the forecast initialisation date.

     .-.-.-.-.-.-.-.-.-.-.-.-.
LEAD | | | | |X|1|2|3|4|5|6|7|
     :-:-:-:-:-:-:-:-:-:-:-:-:
 LAG | |3|2|1|0| | | | | | | |
     '-'-'-'-'-'-'-'-'-'-'-'-'
              ^
      FORECAST_INIT_DATE

Therefore generating a sample for 2020-12-01 should result in the targets starting from 2020-12-2, which is where the bug is? @scotthosking I'm interested to hear whether I'm munging the terminology here, but let's put the right fix in, as @bnubald has mentioned this needs clearly documenting as it's always been a headache and now it's causing an actual problem (though I hope we would've caught it before doing any serious work on forecasting! 😉 )

ots22 · 2024-05-08T16:50:26Z

On the terminology: I remember being caught out by 'forecast date' before. Could it be worth switching to something less ambiguous, like forecast_init_date as you have above?

Is 'forecast time' standard terminology in this community? I know 'data time' and 'valid time' are used in weather forecasting to refer to these times, but 'forecast time' seems to be used less consistently from a quick glance around (appreciating that they might use different terms).

JimCircadian · 2024-05-09T07:39:29Z

Excellent, it's probably worth us doing that @ots22, no doubt. This was a topic of confusion for me too. I have come across data time and valid time in operational forecasting scenarios, so will have a review and we can settle the terms once and for all, and document appropriately.

JimCircadian · 2024-05-09T08:02:50Z

Part of this confusion I think comes from differences between certain forecast organisations and certain other ones, so we just need to settle on which approach to take. 😉

ampersandmcd · 2024-05-10T07:30:43Z

The WeatherBench documentation has a nice explainer between init_time and valid_time conventions, and the library looks like it's built to support both. Could be a good precedent to follow!

https://weatherbench2.readthedocs.io/en/latest/init-vs-valid-time.html

JimCircadian · 2024-05-10T09:36:05Z

I think we'll stick with init-time convention which is what @tom-andersson, I believe, originally intended. That also means less changes, but results in the notion that the forecast initialisation date we're specifying is in fact part of the forecast at step zero: "Each forecast starts with the atmospheric conditions at a specific 'initialization time'.".

We should definitely support both, so will raise that as a separate issue. Just to be clear, I'll accept your change to this @ampersandmcd as the "quick fix" to the bug, but we'll need to add a feature in to allow for specifying dates according to valid-time approach, which is what I think I was on.

So this is where we are now, which is probably what was meant to be happening all along. Oops

     .-.-.-.-.-.-.-.-.-.-.-.-.
LEAD | | | | |0|1|2|3|4|5|6|7|
     :-:-:-:-:-:-:-:-:-:-:-:-:
 LAG | |3|2|1| | | | | | | | |
     '-'-'-'-'-'-'-'-'-'-'-'-'
              ^
      FORECAST_INIT_DATE

ampersandmcd added the bug Something isn't working label May 2, 2024

ampersandmcd mentioned this issue May 2, 2024

Resolve issue 256-first-lag-day-is-same-as-first-forecast-day #257

Merged

ampersandmcd linked a pull request May 2, 2024 that will close this issue

Resolve issue 256-first-lag-day-is-same-as-first-forecast-day #257

Merged

bnubald mentioned this issue May 3, 2024

[DOCS] Include a breakdown of the concept of lag/forecast dates for icenet #258

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

First lag day is same as first forecast day #256

First lag day is same as first forecast day #256

ampersandmcd commented May 2, 2024

JimCircadian commented May 3, 2024

bnubald commented May 3, 2024

JimCircadian commented May 5, 2024

ots22 commented May 8, 2024

JimCircadian commented May 9, 2024

JimCircadian commented May 9, 2024

ampersandmcd commented May 10, 2024

JimCircadian commented May 10, 2024

First lag day is same as first forecast day #256

First lag day is same as first forecast day #256

Comments

ampersandmcd commented May 2, 2024

JimCircadian commented May 3, 2024

bnubald commented May 3, 2024

JimCircadian commented May 5, 2024

ots22 commented May 8, 2024

JimCircadian commented May 9, 2024

JimCircadian commented May 9, 2024

ampersandmcd commented May 10, 2024

JimCircadian commented May 10, 2024