-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE] Time Series Target Encoding Transformer #535
Comments
I'm interested in discussing this in more detail, but let's first discuss a few nitpicks.
|
Thanks a lot for your answer Vincent. I am also a data scientist based in Amsterdam and I really appreciated the work you did with this package. Below I answered your questions. However, I want to ask a question first: 1- Calculating lags and rolling statistics are quite easy on the whole dataset. To be able to use in pipelines, the way I thought is:
Answers to your questions: |
Is it like Edit: |
As far I as know, CatBoost uses cross validation and regularization to encode high cardinality categorical variables to prevent data leakage. The purpose is same, but in this one there is no need to use cross validation and regularization because we already have time dimension. We just need to respect time dimension to prevent data leakage. |
Hi all,
I am a data scientist who is working mainly on the time series problems. Usually, the best features are lags, rolling means and target encoders. I have already a transformer for Spark DataFrames that I use in my daily work for creating these features.
I want to contribute to creating this time series target encoder (which will be used for creating lags, rolling means and target encodings) transformer also for pandas.
For instance, I use the following class to create rolling means at item, store, region levels with rolling window of 13, 52, 104 weeks and some skip periods to prevent data leakage.
The transformer that I created was designed using Window functionality in Spark and to be used in the preprocessing step. However, I am willing to create one for the scikit-learn pipelines.
If you are also think that is a good idea, I would be happy to discuss the implementation.
Best,
The text was updated successfully, but these errors were encountered: