TimeSeriesDataset not feeding data in consecutive order #968

hjarraya · 2020-03-25T11:53:48Z

Currently, TimeSeriesDataset not feeding data to the model in consecutive time order, if there is an interval gap in the data.
This is causing poor results in LSTM based models.

epa095 · 2020-03-30T07:31:27Z

Do you actually mean non-consecutive (i.e. non-sorted, e.g. 1,5,3,2), or do you mean that is has gaps (i.e. 1,2,5,6)? If it is the former then that is reasonable easy to fix, each data provider should deliver it timeseries sorted, and then the timeseries dataset will give a sorted output.

If the problem is that it has holes then its a bit harder to change, as that is just how it is when there are row filtering. One could of course change the signature to return a list/stream of (X,y) pairs instead of a single (X,y) pair. But then what do you do when it is time to train the model? Most/many scikit-learn models does not support iterative training on several datasets, so you can fit it several times, but it is only the data from the last fit which is used. Estimators which support iterative fitting implements partial_fit, but I don't think scikit learn pipelines or TransformedTargetRegressor play nicely with these (that is something to investigate). If they work nicely with partial_fit then one way forward could be to

change the signature of the dataset so it returns a list of X,y pairs
Implement partial_fit in the gordo keras wrapper classes
change the builder so it uses partial_fit instead of fit (maybe dynamically depending on whether it gets a single X,y pair or a list of them).

If pipelines does not support partial_fit nicely then there is another alternative (but its not pretty), and that is to try to identify the different segments inside the fit function of the gordo keras lstm-class, and call the tensorflow fit function iteratively on the identified sequences. The issue is, how do one identify the sequences? It can be identified from the index of the timeseries data frame, but the problem is that the dataframe is long gone when we get down to the fit-function, since the scalers returns numpy arrays, not dataframes! There is a project for pandas-sklearn integration https://github.com/scikit-learn-contrib/sklearn-pandas which has a class DataFrameMapper which can change the scikit-learn stuff to rather return dataframes, so that can be used. But how? Should it dynamically("magically") wrap all the scikit-learn stuff in a provided model definiton?
Alternatively one can copy the index into the dataframe as a column (or create a new boolean column which just sais if the current row is directly following the previous, or if there is a gap between then), but then one must be careful to not start doing predictions on it.

Before starting this I would try to collect some good data how big of a problem this is on some real machines.

hjarraya assigned hjarraya and koropets Mar 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TimeSeriesDataset not feeding data in consecutive order #968

TimeSeriesDataset not feeding data in consecutive order #968

hjarraya commented Mar 25, 2020

epa095 commented Mar 30, 2020

TimeSeriesDataset not feeding data in consecutive order #968

TimeSeriesDataset not feeding data in consecutive order #968

Comments

hjarraya commented Mar 25, 2020

epa095 commented Mar 30, 2020