Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First lag day is same as first forecast day #256

Open
ampersandmcd opened this issue May 2, 2024 · 8 comments · Fixed by #257
Open

First lag day is same as first forecast day #256

ampersandmcd opened this issue May 2, 2024 · 8 comments · Fixed by #257
Labels
bug Something isn't working

Comments

@ampersandmcd
Copy link

pd.Timestamp(forecast_date - dt.timedelta(days=n))
for n in range(num_channels)

Howdy there! I've been running a toy experiment with one-day lag, one-day forecasts to debug my diffusion model code and seem to be getting predictors that are from the same day as targets. I think the issue originates from this line of code: when num_channels is 1, n only ever takes the value of 0, and hence we are left with forecast_date - dt.timedelta(0) meaning we get predictors for the same day as the target forecast date. This allows the model to "cheat" in my case as the target siconca is the same as the predictor siconca.

Fixing this would just require L463 to be pd.Timestamp(forecast_date - dt.timedelta(days=n+1)) I think.

Below is a script I wrote to confirm this behaviour and a few screenshots from my debugging pane if they are helpful!

import os
import pandas as pd
from icenet.data.processors.era5 import IceNetERA5PreProcessor
from icenet.data.processors.meta import IceNetMetaPreProcessor
from icenet.data.processors.osi import IceNetOSIPreProcessor
from icenet.data.loaders import IceNetDataLoaderFactory

if "datasets.ipynb" in os.listdir():
    os.chdir("../")
print("Running in {}".format(os.getcwd()))

processing_dates = dict(
    train=[pd.to_datetime(el) for el in pd.date_range("2020-01-01", "2020-01-01")],
    val=[pd.to_datetime(el) for el in pd.date_range("2020-01-01", "2020-01-01")],
    test=[pd.to_datetime(el) for el in pd.date_range("2020-01-01", "2020-01-01")],
)
processed_name = "single_day"

pp = IceNetERA5PreProcessor(
    ["uas", "vas"],
    ["tas", "zg500", "zg250"],
    processed_name,
    processing_dates["train"],
    processing_dates["val"],
    processing_dates["test"],
    linear_trends=tuple(),
    north=False,
    south=True
)
osi = IceNetOSIPreProcessor(
    ["siconca"],
    [],
    processed_name,
    processing_dates["train"],
    processing_dates["val"],
    processing_dates["test"],
    linear_trends=tuple(),
    north=False,
    south=True
)
meta = IceNetMetaPreProcessor(
    processed_name,
    north=False,
    south=True
)
pp.init_source_data(
    lag_days=1,
)
pp.process()
osi.init_source_data(
    lag_days=1,
)
osi.process()
meta.process()

implementation = "dask"
loader_config = "loader.single_day.json"
dataset_name = "single_day"
lag = 1

dl = IceNetDataLoaderFactory().create_data_loader(
    implementation,
    loader_config,
    dataset_name,
    lag,
    n_forecast_days=1,
    north=False,
    south=True,
    output_batch_size=1,
    generate_workers=1
)

x, y, sw = dl.generate_sample(pd.Timestamp("2020-12-01"))
diff = x[:, :, 2] - y.squeeze()
print(diff)  # should not be zero but is zero

Note in the screenshots below that forecast_date == channel_dates.

image image
@JimCircadian
Copy link
Member

Hi @ampersandmcd thank you so much for such a detailed issue, perfectly described. There are two issues here.

  1. Definitely, the first element of the predictor should not be the same as the input siconca! @bnubald and I will validate this on our own datasets and accept your PR.
  2. The lead / lag issue is a separate issue of how we describe things. The "forecast date" is the date of initialisation for the forecast, so you want a lag of 0 to present the data for the forecast date (as that is the data for which you have data upon which you want to base the forecast, as opposed to the first date of the forecast period), and then the first output of the element is lead time == 1, e.g. the first date forecast after the forecast date. @bnubald and I were discussing this the other week and to be honest it's always been a source of confusion

The really key point, as you say, is that the first element of the output should not be the same as the input. I'll validate with my own dataset to ensure this is the case but your demo looks good, it's just the amendment we need to be careful with so the semantics of all these terms is clear(er) than it has been.

Really interested to hear your thoughts on this, thank you so much! This is why we need power users such as yourself! 😉

@bnubald
Copy link
Collaborator

bnubald commented May 3, 2024

Thanks for providing a detailed breakdown @ampersandmcd!

I'll have a look at this, as @JimCircadian mentioned, we'll have a look at the first aspect. The second point mentioned by @JimCircadian is something I've had confusion over understanding, so, some updates to the documentation along with visuals would be a great addition.

@JimCircadian
Copy link
Member

I've confirmed this behaviour, but want to be clear that we have the correct fix. We need this to be clearly defined appropriate to what we're doing: there are plenty of definitions around lag and lead that are clear, the use of the terms and indexing are OK, we have the wrong data in when considering the "forecast date", which is totally a mess up on my part. My impression is that the forecast date is the forecast initialisation date: the date we have data to provide for, so we have a lag of zero on that date (the last day we have data for), then forecast ahead from 1 to n days. The X below marks the forecast initialisation date.

     .-.-.-.-.-.-.-.-.-.-.-.-.
LEAD | | | | |X|1|2|3|4|5|6|7|
     :-:-:-:-:-:-:-:-:-:-:-:-:
 LAG | |3|2|1|0| | | | | | | |
     '-'-'-'-'-'-'-'-'-'-'-'-'
              ^
      FORECAST_INIT_DATE

Therefore generating a sample for 2020-12-01 should result in the targets starting from 2020-12-2, which is where the bug is? @scotthosking I'm interested to hear whether I'm munging the terminology here, but let's put the right fix in, as @bnubald has mentioned this needs clearly documenting as it's always been a headache and now it's causing an actual problem (though I hope we would've caught it before doing any serious work on forecasting! 😉 )

@ots22
Copy link
Member

ots22 commented May 8, 2024

On the terminology: I remember being caught out by 'forecast date' before. Could it be worth switching to something less ambiguous, like forecast_init_date as you have above?

Is 'forecast time' standard terminology in this community? I know 'data time' and 'valid time' are used in weather forecasting to refer to these times, but 'forecast time' seems to be used less consistently from a quick glance around (appreciating that they might use different terms).

@JimCircadian
Copy link
Member

Excellent, it's probably worth us doing that @ots22, no doubt. This was a topic of confusion for me too. I have come across data time and valid time in operational forecasting scenarios, so will have a review and we can settle the terms once and for all, and document appropriately.

@JimCircadian
Copy link
Member

Part of this confusion I think comes from differences between certain forecast organisations and certain other ones, so we just need to settle on which approach to take. 😉

@ampersandmcd
Copy link
Author

The WeatherBench documentation has a nice explainer between init_time and valid_time conventions, and the library looks like it's built to support both. Could be a good precedent to follow!

@JimCircadian
Copy link
Member

I think we'll stick with init-time convention which is what @tom-andersson, I believe, originally intended. That also means less changes, but results in the notion that the forecast initialisation date we're specifying is in fact part of the forecast at step zero: "Each forecast starts with the atmospheric conditions at a specific 'initialization time'.".

We should definitely support both, so will raise that as a separate issue. Just to be clear, I'll accept your change to this @ampersandmcd as the "quick fix" to the bug, but we'll need to add a feature in to allow for specifying dates according to valid-time approach, which is what I think I was on.

So this is where we are now, which is probably what was meant to be happening all along. Oops

     .-.-.-.-.-.-.-.-.-.-.-.-.
LEAD | | | | |0|1|2|3|4|5|6|7|
     :-:-:-:-:-:-:-:-:-:-:-:-:
 LAG | |3|2|1| | | | | | | | |
     '-'-'-'-'-'-'-'-'-'-'-'-'
              ^
      FORECAST_INIT_DATE

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants