Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Time Series Target Encoding Transformer #535

Open
canerturkseven opened this issue Sep 21, 2022 · 4 comments
Open

[FEATURE] Time Series Target Encoding Transformer #535

canerturkseven opened this issue Sep 21, 2022 · 4 comments
Labels
enhancement New feature or request

Comments

@canerturkseven
Copy link

Hi all,

I am a data scientist who is working mainly on the time series problems. Usually, the best features are lags, rolling means and target encoders. I have already a transformer for Spark DataFrames that I use in my daily work for creating these features.

I want to contribute to creating this time series target encoder (which will be used for creating lags, rolling means and target encodings) transformer also for pandas.

For instance, I use the following class to create rolling means at item, store, region levels with rolling window of 13, 52, 104 weeks and some skip periods to prevent data leakage.
image

The transformer that I created was designed using Window functionality in Spark and to be used in the preprocessing step. However, I am willing to create one for the scikit-learn pipelines.

If you are also think that is a good idea, I would be happy to discuss the implementation.

Best,

@canerturkseven canerturkseven added the enhancement New feature or request label Sep 21, 2022
@koaning
Copy link
Owner

koaning commented Sep 21, 2022

I'm interested in discussing this in more detail, but let's first discuss a few nitpicks.

  1. In the future, please use the code feature of markdown to share code, not a screenshot. A screenshot does not allow one to search or copy/paste.
  2. Is a TargetEncoder the best name for such a component? It feels like LaggedFeatureEncoder or LaggedFeaturizer might be more specific to what the component does.
  3. Does the component output a pandas dataframe? Is that required? Would it be better to just output a numpy array instead? I can imagine that the manual labor of getting output_cols aligned with the other params to be a source of human error.
  4. What does skip_periods do?
  5. Is it possible make sure that the conventions in this tool are in sync with the RepeatingBasisFunctionTransformer?
  6. Is scikit-lego the best place for this transformer? Wouldn't it perhaps be better to add this to a tool that specific for time series?

@canerturkseven
Copy link
Author

Thanks a lot for your answer Vincent. I am also a data scientist based in Amsterdam and I really appreciated the work you did with this package.

Below I answered your questions. However, I want to ask a question first:

1- Calculating lags and rolling statistics are quite easy on the whole dataset. To be able to use in pipelines, the way I thought is:

  • Fit: we need to estimate these statistics for both training and test dates
  • Transform: we need to make a join (by date and any other grouping cols) in order to gather these statistics to training and test dates.
    That makes it a bit of cumbersome to use in pipelines. I really wonder your view on this. Do you think is this still should happen during pipeline, or should it belong to some different preprocessing pipeline?

Answers to your questions:
1- Yes, I will do so from now. Thank you for the suggestion.
2- I thought this transformer could be used to calculate lags, rolling statistics and target encodings. Although target encodings are widely known in the industry, maybe people will not expect to be able to compute lags and rolling statistics with this transformer. So, I agree maybe a better name might be required.
3- I think this is not a requirement. I just thought that calculating those features will be easier with Pandas. However, numpy could be done I guess.
4- It lags the data before calculating the statistics. For example, if I am training a model that will make Week+2 predictions, then I need to calculate all of these statistics with a gap of 2 weeks to prevent data leakage.
5- Could you elaborate more on the specific conventions used in the RepeatingBasisFunctionTransformer?
6- I believe lags/rolling statistics are very fundamental features that can be used in any problem that have time dimension. I believe this transformer would be in good combination with TimeGapSplit.

@glevv
Copy link
Contributor

glevv commented Sep 23, 2022

Is it like CatBoostEncoder encoder with has_time=True?

Edit: CatBoostEncoder uses cumulative statistics, instead of lags or window functions

@canerturkseven
Copy link
Author

As far I as know, CatBoost uses cross validation and regularization to encode high cardinality categorical variables to prevent data leakage. The purpose is same, but in this one there is no need to use cross validation and regularization because we already have time dimension. We just need to respect time dimension to prevent data leakage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants