Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Breakpoints found in DataFrame but not in NumpyArray with same data #318

Closed
jdkworld opened this issue Dec 15, 2023 · 3 comments
Closed

Comments

@jdkworld
Copy link

jdkworld commented Dec 15, 2023

I have this signal, when I input it into Binseg as Pandas DataFrame, I get the correct breakpoints but when I input it as Numpy Array, it does not find any breakpoints.
Am I missing something? Why is the behaviour different? Can it be due to the way in which NaNs are handled in both cases?
Also, when I have two the same columns in my dataframe, into breakpoints are found.

signal.csv

dataframe = pd.read_csv('signal.csv', header=None)

# as dataframe
signal = dataframe
algo = rpt.Binseg(model="normal", min_size=12*24*7, jump=12*24).fit(signal)
result = algo.predict(pen=100)  
rpt.display(signal, result)
plt.show()
# WORKS

# as numpy array
signal = dataframe.values
algo = rpt.Binseg(model="normal", min_size=12*24*7, jump=12*24).fit(signal)
result = algo.predict(pen=100)  
rpt.display(signal, result)
plt.show()
# DOES NOT WORK

# as dataframe with 2 columns with exactly the same data
signal = dataframe
signal['1'] = signal[0]
algo = rpt.Binseg(model="normal", min_size=12*24*7, jump=12*24).fit(signal)
result = algo.predict(pen=100)  
rpt.display(signal, result)
plt.show()
# DOES NOT WORK
@oboulant
Copy link
Collaborator

Hi @jdkworld ,

Thx for you interest in ruptures.

This is because of the nan values in your series ! Indeed, ruptures expects the user to have handled on its own missing data. If ruptures has as input series with missing data, then the behaviour is unexpected.

If you remove the missing data, the outputs "looks" fine.

series = dataframe.to_numpy(dtype='float', na_value=np.nan)
print(f"Raw data : shape is {series.shape}")
series = series[~np.isnan(series)]
print(f"After removing the nans : shape is {series.shape}")
algo = rpt.Binseg(model="normal", min_size=12*24*7, jump=12*24).fit(series)
result = algo.predict(pen=100)  
rpt.display(series, result)
plt.show()

which outputs
image

I hope this helps ! Let us know !

Olivier

@jdkworld
Copy link
Author

Hi Olivier,

Thanks a lot for your answer.
The signal is a timeseries and I still want min_size and jump to correspond to the correct time period. So just removing the NaNs is no option.
As I understand you, I should therefore fill in all missing data so that no NaN values are left and the time interval for each step is constant?

Josien

@oboulant
Copy link
Collaborator

If you want to keep the timeseries' structure along the time axis, then yes you have to fill the missing values with something.

And here, there are many many strategies (0.0, last known value, randomly draw from the series, mean or median on a particular time window, etc), but it all depends your use case and this is a decision you have to make according to the underlying goal of the task you are trying to solve !

Hope it helps !

Olivier

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants