[FEATURE] Time Series Target Encoding Transformer #535

canerturkseven · 2022-09-21T14:35:37Z

Hi all,

I am a data scientist who is working mainly on the time series problems. Usually, the best features are lags, rolling means and target encoders. I have already a transformer for Spark DataFrames that I use in my daily work for creating these features.

I want to contribute to creating this time series target encoder (which will be used for creating lags, rolling means and target encodings) transformer also for pandas.

For instance, I use the following class to create rolling means at item, store, region levels with rolling window of 13, 52, 104 weeks and some skip periods to prevent data leakage.

The transformer that I created was designed using Window functionality in Spark and to be used in the preprocessing step. However, I am willing to create one for the scikit-learn pipelines.

If you are also think that is a good idea, I would be happy to discuss the implementation.

Best,

koaning · 2022-09-21T15:04:53Z

I'm interested in discussing this in more detail, but let's first discuss a few nitpicks.

In the future, please use the code feature of markdown to share code, not a screenshot. A screenshot does not allow one to search or copy/paste.
Is a TargetEncoder the best name for such a component? It feels like LaggedFeatureEncoder or LaggedFeaturizer might be more specific to what the component does.
Does the component output a pandas dataframe? Is that required? Would it be better to just output a numpy array instead? I can imagine that the manual labor of getting output_cols aligned with the other params to be a source of human error.
What does skip_periods do?
Is it possible make sure that the conventions in this tool are in sync with the RepeatingBasisFunctionTransformer?
Is scikit-lego the best place for this transformer? Wouldn't it perhaps be better to add this to a tool that specific for time series?

canerturkseven · 2022-09-21T18:09:36Z

Thanks a lot for your answer Vincent. I am also a data scientist based in Amsterdam and I really appreciated the work you did with this package.

Below I answered your questions. However, I want to ask a question first:

1- Calculating lags and rolling statistics are quite easy on the whole dataset. To be able to use in pipelines, the way I thought is:

Fit: we need to estimate these statistics for both training and test dates
Transform: we need to make a join (by date and any other grouping cols) in order to gather these statistics to training and test dates.
That makes it a bit of cumbersome to use in pipelines. I really wonder your view on this. Do you think is this still should happen during pipeline, or should it belong to some different preprocessing pipeline?

Answers to your questions:
1- Yes, I will do so from now. Thank you for the suggestion.
2- I thought this transformer could be used to calculate lags, rolling statistics and target encodings. Although target encodings are widely known in the industry, maybe people will not expect to be able to compute lags and rolling statistics with this transformer. So, I agree maybe a better name might be required.
3- I think this is not a requirement. I just thought that calculating those features will be easier with Pandas. However, numpy could be done I guess.
4- It lags the data before calculating the statistics. For example, if I am training a model that will make Week+2 predictions, then I need to calculate all of these statistics with a gap of 2 weeks to prevent data leakage.
5- Could you elaborate more on the specific conventions used in the RepeatingBasisFunctionTransformer?
6- I believe lags/rolling statistics are very fundamental features that can be used in any problem that have time dimension. I believe this transformer would be in good combination with TimeGapSplit.

glevv · 2022-09-23T06:32:26Z

Is it like CatBoostEncoder encoder with has_time=True?

Edit: CatBoostEncoder uses cumulative statistics, instead of lags or window functions

canerturkseven · 2022-09-23T08:49:41Z

As far I as know, CatBoost uses cross validation and regularization to encode high cardinality categorical variables to prevent data leakage. The purpose is same, but in this one there is no need to use cross validation and regularization because we already have time dimension. We just need to respect time dimension to prevent data leakage.

canerturkseven added the enhancement New feature or request label Sep 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Time Series Target Encoding Transformer #535

[FEATURE] Time Series Target Encoding Transformer #535

canerturkseven commented Sep 21, 2022

koaning commented Sep 21, 2022 •

edited

Loading

canerturkseven commented Sep 21, 2022

glevv commented Sep 23, 2022 •

edited

Loading

canerturkseven commented Sep 23, 2022

[FEATURE] Time Series Target Encoding Transformer #535

[FEATURE] Time Series Target Encoding Transformer #535

Comments

canerturkseven commented Sep 21, 2022

koaning commented Sep 21, 2022 • edited Loading

canerturkseven commented Sep 21, 2022

glevv commented Sep 23, 2022 • edited Loading

canerturkseven commented Sep 23, 2022

koaning commented Sep 21, 2022 •

edited

Loading

glevv commented Sep 23, 2022 •

edited

Loading