Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

predict and forecast with polars dataframe, output ds has different time unit than source #703

Closed
deanm0000 opened this issue Nov 20, 2023 · 2 comments · Fixed by #705
Closed
Labels

Comments

@deanm0000
Copy link

deanm0000 commented Nov 20, 2023

What happened + What you expected to happen

When training from a polars dataframe, the time unit of the original df isn't used for predictions/forecasts. This means we can't join the two dfs.

One way to fix that would be...
I see that there's this line in core.py:

    def _prepare_fit(
        self.df_constructor = type(df)

I think there needs to be something like

if self.df_constructor == pl_DataFrame:
    self.df_time_unit = df.select('ds').dtypes[0].time_unit

and then in:

            fcsts_df = pl.from_pandas(fcsts_df)

that should be

            fcsts_df = pl.from_pandas(fcsts_df).with_columns(ds=pl.col('ds').cast_time_unit(self.df_time_unit))

I don't know if there's anywhere else that needs to be done but that's what looks like would fix it.

Versions / Dependencies

statsforecast version=1.6.0

polars version = 0.19.13

Reproduction script

I took the Electricity Load Forecast for example and used polars from the beginning instead of pandas like this

import polars as pl
df = pl.read_csv('https://raw.githubusercontent.com/panambY/Hourly_Energy_Consumption/master/data/PJM_Load_hourly.csv')

df.columns=['ds','y']

df=(
    df.with_columns(
    unique_id=pl.lit('PJM_Load_hourly'),
    ds=pl.col('ds').str.strptime(pl.Datetime,"%Y-%m-%d %H:%M:%S")
    )
    .sort(['unique_id', 'ds'])
)

from statsforecast import StatsForecast
from statsforecast.models import MSTL, AutoARIMA, SeasonalNaive

mstl = MSTL(
    season_length=[24, 24 * 7], # seasonalities of the time series 
    trend_forecaster=AutoARIMA() # model used to forecast trend
)

sf = StatsForecast(
    models=[mstl], # model used to fit each time series 
    freq='H', # frequency of the data
)

sf = sf.fit(df=df)

prediction = sf.predict(h=24)
prediction
shape: (24, 3)
┌─────────────────┬─────────────────────┬──────────────┐
│ unique_id       ┆ ds                  ┆ MSTL         │
│ ---             ┆ ---                 ┆ ---          │
│ str             ┆ datetime[ns]        ┆ f32          │
╞═════════════════╪═════════════════════╪══════════════╡
│ PJM_Load_hourly ┆ 2002-01-01 01:00:00 ┆ 29956.744141 │
│ PJM_Load_hourly ┆ 2002-01-01 02:00:00 ┆ 29057.691406 │
│ PJM_Load_hourly ┆ 2002-01-01 03:00:00 ┆ 28654.699219 │
│ PJM_Load_hourly ┆ 2002-01-01 04:00:00 ┆ 28499.009766 │
│ …               ┆ …                   ┆ …            │
│ PJM_Load_hourly ┆ 2002-01-01 21:00:00 ┆ 36605.328125 │
│ PJM_Load_hourly ┆ 2002-01-01 22:00:00 ┆ 35468.988281 │
│ PJM_Load_hourly ┆ 2002-01-01 23:00:00 ┆ 33793.449219 │
│ PJM_Load_hourly ┆ 2002-01-02 00:00:00 ┆ 32008.271484 │
└─────────────────┴─────────────────────┴──────────────┘

Note how ds is a datetime[ns]

but

df
shape: (32_896, 3)
┌─────────────────────┬─────────┬─────────────────┐
│ ds                  ┆ y       ┆ unique_id       │
│ ---                 ┆ ---     ┆ ---             │
│ datetime[μs]        ┆ f64     ┆ str             │
╞═════════════════════╪═════════╪═════════════════╡
│ 1998-04-01 01:00:00 ┆ 22259.0 ┆ PJM_Load_hourly │
│ 1998-04-01 02:00:00 ┆ 21244.0 ┆ PJM_Load_hourly │
│ 1998-04-01 03:00:00 ┆ 20651.0 ┆ PJM_Load_hourly │
│ 1998-04-01 04:00:00 ┆ 20421.0 ┆ PJM_Load_hourly │
│ …                   ┆ …       ┆ …               │
│ 2001-12-31 21:00:00 ┆ 35082.0 ┆ PJM_Load_hourly │
│ 2001-12-31 22:00:00 ┆ 33890.0 ┆ PJM_Load_hourly │
│ 2001-12-31 23:00:00 ┆ 32590.0 ┆ PJM_Load_hourly │
│ 2002-01-01 00:00:00 ┆ 31569.0 ┆ PJM_Load_hourly │
└─────────────────────┴─────────┴─────────────────┘

from the source df it is datetime[μs]

that means that we can't join them directly.

Issue Severity

Low: It annoys or frustrates me.

@deanm0000 deanm0000 added the bug label Nov 20, 2023
@jmoralez
Copy link
Member

Hey @deanm0000, thanks for using statsforecast and for the excellent example. I believe it's happening here

last_date_f = lambda x: pd.date_range(
x + self.freq, periods=h, freq=self.freq
)
since for polars the dates are a numpy array which are then plugged into a pandas date_range and they loose their units there. We're currently working on changing those functions so we'll make sure the units are preserved in the new version, thanks for raising this.

@deanm0000
Copy link
Author

yeah I noticed it was building up a pandas df and then converting to polars. Of course, long term, if it were setup so that it only uses the library of the input that'd be even better but figured my snippet is the fastest way to address the issue in the short term.

I was actually pleasantly surprised the library has polars recognition and support at all so I appreciate where it is so don't take that as a complaint.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants