Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

preprocessing.RepeatingBasisFunction encodes different labels as the same[BUG] #574

Open
yug030 opened this issue Aug 3, 2023 · 2 comments
Labels
bug Something isn't working

Comments

@yug030
Copy link

yug030 commented Aug 3, 2023

Hi,

I'm working on a task to encode timestamps (more specifically, weekdays) and I noticed RepeatingBasisFunction does not work correctly with weekday=0 and weekday=6: the outputs for both are the same.

Here's the code snippet to reproduce the bug:

import pandas as pd
from sklego.preprocessing import RepeatingBasisFunction

start_date = '2023-06-25 00:00:00'
end_date = '2023-07-03 23:59:59'
timestamps = pd.date_range(start=start_date, end=end_date, freq='1D')
df = pd.DataFrame({'Timestamps': timestamps})
df['wkday'] = df['Timestamps'].dt.weekday
print(df)
rbf = RepeatingBasisFunction(n_periods=4, column="wkday", input_range=(0,6), remainder="drop")
rbf.fit(df)
print(rbf.transform(df))

The output for me is:

  Timestamps  wkday
0 2023-06-25      6
1 2023-06-26      0
2 2023-06-27      1
3 2023-06-28      2
4 2023-06-29      3
5 2023-06-30      4
6 2023-07-01      5
7 2023-07-02      6
8 2023-07-03      0

array([[1.        , 0.36787944, 0.01831564, 0.36787944],
       [1.        , 0.36787944, 0.01831564, 0.36787944],
       [0.64118039, 0.89483932, 0.16901332, 0.06217652],
       [0.16901332, 0.89483932, 0.64118039, 0.06217652],
       [0.01831564, 0.36787944, 1.        , 0.36787944],
       [0.16901332, 0.06217652, 0.64118039, 0.89483932],
       [0.64118039, 0.06217652, 0.16901332, 0.89483932],
       [1.        , 0.36787944, 0.01831564, 0.36787944],
       [1.        , 0.36787944, 0.01831564, 0.36787944]])

Update:
If I add 1 to "wkday" column (so it's 1 to 7 now) and use rbf = RepeatingBasisFunction(n_periods=7, column="wkday", input_range=(0,7), remainder="drop") and call fit and transform, I will get a square matrix with 1.0 on the diagonal which looks correct, but it's different from the documentation.

@yug030 yug030 added the bug Something isn't working label Aug 3, 2023
@FBruzzesi
Copy link
Collaborator

Hi, thank you for reporting the issue and sorry for the late reply!

This seems to be unexpected/buggy behavior. My understanding is that some assumptions have been made in the implementation and it may be worth to adjust or clearly state them.

I will take the time to have a closer look at it.

@Alex-Cremers
Copy link

It's a bit confusing, but I think that this is expected behavior. Theinput_range argument is to be understood as an interval that loops on itself at the boundaries, so if you provide (0,6) you assume that 6 is the same as 0, leaving only distinct 6 integers in the range. In this case, you want to provide (0,7), which translates to the interval [0, 7) and does contain 7 distinct integers.

If you think in terms of continuous variables rather than integer, it possibly makes more sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants