Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[API] Redesign towards pytorch-forecasting 2.0 #1736

Open
fkiraly opened this issue Dec 20, 2024 · 32 comments
Open

[API] Redesign towards pytorch-forecasting 2.0 #1736

fkiraly opened this issue Dec 20, 2024 · 32 comments
Labels
API design API design & software architecture enhancement New feature or request

Comments

@fkiraly
Copy link
Collaborator

fkiraly commented Dec 20, 2024

Discussion thread for API re-design for pytorch.forecasting next 1.X and towards 2.0. Comments appreciated from everyone!

Summary of discussion on Dec 20, 2024 and prior, about re-design of pytorch-forecasting.

FYI @agobbifbk, @thawn, @sktime/core-developers.

High-level directions:

High-level features for 2.0 with MoSCoW analysis:

  • M: unified model API which is easily extensible and composable, similar to sktime and DSIPTS, but as closely to the pytorch level as possible. The API need not cover forecasters in general, only torch based forecasters.
    • M: unified monitoring and logging API, also see [API] redesign of logging and monitoring for 2.0 #1700
    • M: extension templates need to be created
    • S: skbase can be used to curate the forecasters as records, with tags, etc
    • S: model persistence
    • C: third party extension patterns, so new models can "live" in other repositories or packages, for instance thuml
  • M: reworked and unified data input API
    • M: support static variables and categoricals
    • S: support for multiple data input locations and formats - pandas, polars, distributed solutions etc
  • M: MLops and benchmarking features as in DSIPTS
  • S: support for pre-training, model hubs, foundation models, but this could be post-2.0

Todos:
0. update documentation on dsipts to signpost the above. README etc.

  1. highest priority - consolidated API design for model and data layer.
    • Depending on distance to current ptf and dsipts, use one or the other location for improvements (separate 2.0 -> dsipts, 1.X -> ptf as current).
    • ptf = stable and downwards compatible; dsipts = "playground"
    • first step for that: side-by-side comparisons of code, defined core workflows
  2. planning sessions & sprints from Jan 2025
@fkiraly fkiraly added enhancement New feature or request API design API design & software architecture labels Dec 20, 2024
@fkiraly fkiraly pinned this issue Dec 20, 2024
@fkiraly
Copy link
Collaborator Author

fkiraly commented Jan 1, 2025

Having reviewed multiple code bases - pytorch-forecasting, DSIPTS, neuralforecast, thuml, I have come to understand that the DataLoader and DataSet conventions are key, in particular the input convention for forward. Interestingly, all the above-mentioned packages have different conventions here, and none seems satisfactory. What is probably most promising is a "merge" of pytorch-forecasting and DSIPTS.

The model layer will mostly follow the data layer, given that torch has an idiosyncratic forward interface.

My suggestions for high-level requirements on data loaders:

  • easy to use in pure torch, detour via pandas can be avoided (this is currently possible but not easy)
  • support for future-known and unknown, endo- and exogenous, group ID and static variables
  • if possible, downwards compatibility to pytorch-forecasting

Observations of the current API:

  • neither package spells out the forward API properly, or has checking utilities for the containers.
  • pytorch-forecasting seems to do a resampling for a decoder/encoder structure already in the data loader - this may not be necessary for all models
  • DSIPTS is closer to the abstract data type, but lacks support for static variables

The explicit specifications can be reconstructed from usage and docstrings, for convenience listed below:

pytorch-forecasting

From the docstring of TimeSeriesDataset.to_dataloader:

DataLoader: dataloader that returns Tuple.
    First entry is ``x``, a dictionary of tensors with the entries (and shapes in brackets)

    * encoder_cat (batch_size x n_encoder_time_steps x n_features): long tensor of encoded
        categoricals for encoder
    * encoder_cont (batch_size x n_encoder_time_steps x n_features): float tensor of scaled continuous
        variables for encoder
    * encoder_target (batch_size x n_encoder_time_steps or list thereof with each entry for a different
        target):
        float tensor with unscaled continous target or encoded categorical target,
        list of tensors for multiple targets
    * encoder_lengths (batch_size): long tensor with lengths of the encoder time series. No entry will
        be greater than n_encoder_time_steps
    * decoder_cat (batch_size x n_decoder_time_steps x n_features): long tensor of encoded
        categoricals for decoder
    * decoder_cont (batch_size x n_decoder_time_steps x n_features): float tensor of scaled continuous
        variables for decoder
    * decoder_target (batch_size x n_decoder_time_steps or list thereof with each entry for a different
        target):
        float tensor with unscaled continous target or encoded categorical target for decoder
        - this corresponds to first entry of ``y``, list of tensors for multiple targets
    * decoder_lengths (batch_size): long tensor with lengths of the decoder time series. No entry will
        be greater than n_decoder_time_steps
    * group_ids (batch_size x number_of_ids): encoded group ids that identify a time series in the dataset
    * target_scale (batch_size x scale_size or list thereof with each entry for a different target):
        parameters used to normalize the target.
        Typically these are mean and standard deviation. Is list of tensors for multiple targets.


    Second entry is ``y``, a tuple of the form (``target``, `weight`)

    * target (batch_size x n_decoder_time_steps or list thereof with each entry for a different target):
        unscaled (continuous) or encoded (categories) targets, list of tensors for multiple targets
    * weight (None or batch_size x n_decoder_time_steps): weight

There is a custom collate_fn, it is (oddly?) stored in the TimeSeriesDataSet._collate_fm as a static method, which is then passed to the data loader.

DSIPTS

Specifies a simpler structure that is closer ot the abstract data type of the time series data - and therefore imo better.

The data loader needs to return batches as follows, from the docstring of Base.forward:

            batch (dict): the batch structure. The keys are:
                y : the target variable(s). This is always present
                x_num_past: the numerical past variables. This is always present
                x_num_future: the numerical future variables
                x_cat_past: the categorical past variables
                x_cat_future: the categorical future variables
                idx_target: index of target features in the past array

This is missing group ID or static variables, but imo is closer to the end state where we want to go.

Ensuring downwards compatibility

Downwards compatibility can be ensured by:

  • providing converter functions between the two types of batches. This can be achieved with aadditional decoder/encoder layers, or a DataLoader depending on another DataLoader.
  • neural networks being tagged with input assumptions on forward. This is probably a good idea in general as well.

Also, currently none of the libraries seems to have stringent tests for the API - we should probably introduce these. scikit-base can be used to draw these up quickly.

@fkiraly
Copy link
Collaborator Author

fkiraly commented Jan 1, 2025

question for @jdb78 - why did you design the forward API with encoder/decoder specific variables? Personally, I consider this a modelling choice, since not every deep learning forecaster is encoder/decoder based.

Side note: one possible design is to have data loaders that are specific to neural networks, facing a more general API

@geetu040
Copy link

geetu040 commented Jan 1, 2025

question for @jdb78 - why did you design the forward API with encoder/decoder specific variables? Personally, I consider this a modelling choice, since not every deep learning forecaster is encoder/decoder based.

forward method is in the API design of torch.nn.Module, which is the base class for every layer and model in pytorch. So I would say its not encoder/decoder based or a modelling choice in just this specific context.

@fkiraly
Copy link
Collaborator Author

fkiraly commented Jan 1, 2025

@geetu040, what I mean is the format of the x in forward, not the choice of forward itself (which indeed is fixed by torch). This can be an arbitrarily nested structure of dict and tuple, with leaf entries being tensors. The convention on the exact structure of x is up to the user, and this is where a core part of the API definition is "hidden" - all listed packages differ in their choices for the type of x that needs to be passed.

So, for pytorch-forecasting, the choice of having decoder/encoder related fields is indeed a choice.

@Sohaib-Ahmed21
Copy link
Contributor

@fkiraly what are your reviews on model initialization in pytorch_forecasting from from_dataset class method in model as other packages initialize model from the init method.

@jdb78
Copy link
Collaborator

jdb78 commented Jan 2, 2025

The idea is that basically all models can be represented as encoder/decoder. In some cases they are the same.

@fkiraly
Copy link
Collaborator Author

fkiraly commented Jan 2, 2025

The idea is that basically all models can be represented as encoder/decoder. In some cases they are the same.

Is that really true though for all models out there? And do we need this as forward args at the top level - as opposed to inside a layer?

See for instance Amazon Chronos:
https://github.com/amazon-science/chronos-forecasting/blob/main/src/chronos/chronos.py

or Google TimesFM:
https://github.com/google-research/timesfm/blob/master/src/timesfm/pytorch_patched_decoder.py

What I think we need for 2.0 is an API design that can cover all torch-based forecasting models.

@fkiraly
Copy link
Collaborator Author

fkiraly commented Jan 2, 2025

@fkiraly what are your reviews on model initialization in pytorch_forecasting from from_dataset class method in model as other packages initialize model from the init method.

I think there is no serious problem with that as ultimately it calls __init__ which in turn calls the hooks. I have three main feelings here:

  • positive: I think it is a smart idea, since many parameters will be the same for multiple models, given one dataset
  • negative: it complicates the interface, since we are passing information about the model and the dataset in many places.
  • question: I would really like @jdb78's thoughts on why/how you decided to put which parameters where - in TimeSeriesDataSet args, the model __init__, or the forward args, e.g., allowed_encoder_known_variable_names

@Sohaib-Ahmed21
Copy link
Contributor

Sohaib-Ahmed21 commented Jan 2, 2025

I think there is no serious problem with that as ultimately it calls __init__ which in turn calls the hooks. I have three main feelings here:

  • positive: I think it is a smart idea, since many parameters will be the same for multiple models, given one dataset
  • negative: it complicates the interface, since we are passing information about the model and the dataset in many places.

Yup, the interface complication is the main concerning thing considering use-cases like those involving single model. But yes, the positive and negative sides need cost-benefit analysis to decide final.

@fkiraly
Copy link
Collaborator Author

fkiraly commented Jan 3, 2025

Some further thoughts about the design:

  • I think there should be a DataSet that provides the time series raw, without any transformation or resampling.
    • optimally, this will be decoupled from pandas, using pandas only as one possible source.
    • this should be more similar to DSIPTS
  • I think we should the idiomatic follow DataSet vs DataLoader separation clearly - the usual separation is DataSet = sample-level operations, loading; DataLoader = shuffling, batching, etc.
    • the current pytorch-forecasting does not follow this separation! The data loader just copies mostly what the DataSet does, introducing high coupling between the layers that should be separated.

I would hence suggest, on the pytorch-forecasting side, a refactor that introduces a clear layer separation, but leaves the current interfaces intact until 2.0:

  • introduction of a DataSet subclass C similar to DSIPTS, close to the data. This can be subclassed for non-memory data sources
    • optionally, there can be subclasses that take sktime data types as lazy arguments. This would greatly facilitate interfacing.
  • introduction of a SlidingDataLoader that unifies the current logic in TimeSeriesDataSet.__getitem__, TimeSeriesDataSet._construct_index and the dataloader returned by TimeSeriesDataSet.to_dataloader. This DataLoader would take C as argument, and the parameters used in the above, and return the same batches as the current TimeSeriesDataSet.
    • for downwards compatibility, it can also take a TimeSeriesDataSet - this is polymorphism, just to ensure downwards compatibility.
  • on the model side, we design the API that each model comes with its own loader - or loaders. There is a default loader for each model - current pytorch-forecasting moels all point to the loader implied by TimeSeriesDataSet.
    • optionally, we could introduce a composite class closer to sktime, which consists of a loader and a model - one for each model.

@fkiraly
Copy link
Collaborator Author

fkiraly commented Jan 3, 2025

@geetu040, @benHeid, @jdb78, I would appreciate your thoughts and opinions!

@benHeid
Copy link
Collaborator

benHeid commented Jan 5, 2025

I would like to the following topics to the discussion of redesigning the API:

  • Currently, the Trainer from lightning is used. I.e., we cannot modify the Trainer if necessary. Thus, it might be worth to add a Trainer that just wraps the Trainer from lightning to provide as the flexibility. E.g., also transformers from huggingface or gluonts has their own Trainer implementations.
  • There exist different architectures with different inheritances. Do we want to touch this or leave it as it is. E.g. BaseModelWithCovariates, BaseModel, AutoRegressiveBaseModelWithCovariates.
  • Should, the datasets be capable of applying preprocessing. Like transformations? Or should we introduce such datasets? They might apply sktime transformer pipelines. Or apply bootstrapping etc. (Probably for each of these ideas a separate dataset should be implemented).

Specific replies to thoughts from above.

introduction of a SlidingDataLoader that unifies the current logic in TimeSeriesDataSet.__getitem__, TimeSeriesDataSet._construct_index and the dataloader returned by TimeSeriesDataSet.to_dataloader.

Should the DataLoader do the sliding? For me this is more about how the samples are created. Thus, this should be part of the dataset (At least if I got this correctly: https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset)

on the model side, we design the API that each model comes with its own loader - or loaders. There is a default loader for each model - current pytorch-forecasting moels all point to the loader implied by TimeSeriesDataSet.

Why should be the loader model specific. The loader's task is to provide an iterable by combining a Dataset and a Sampler(https://pytorch.org/docs/stable/data.html#module-torch.utils.data). The structure of each sample is determined by the Dataset. Thus, I would argue that if something of the data related code should be model specific, it should be the Dataset. Furthermore, I the bullet point above, you proposed to introduce also SlidingDataloader. This would mean, we need to implement a DataLoader, SlidingDataLoader, and potential further DataLoader for each model separately.

Unfortunately, I haven't attended any of the planning sessions, thus the following question might be already answered. Why are we currently aiming for having one DataLoader/Dataset per model? E.g., if we have two models that only support endogenous features, I would suppose that their datasets would be identical. Thus, I do not see the need of having multiple ones. Furthermore, models that support exogenous features does not always require exogenous features, they might also be applicable to only endogenous features. Would this then require multiple datasets/dataloader per model?

@fkiraly
Copy link
Collaborator Author

fkiraly commented Jan 5, 2025

@benHeid, excellent points!

Replies to your raised design angles:

Currently, the Trainer from lightning is used. I.e., we cannot modify the Trainer if necessary. Thus, it might be worth to add a Trainer that just wraps the Trainer from lightning to provide as the flexibility. E.g., also transformers from huggingface or gluonts has their own Trainer implementations.

Good idea to introduce a separation here. Question, why do you think we cannot modify the trainer in the current state of pytorch-forecasting? It is extraneous to the model. Or do you think the dependency on lightning is too strong to allow other methods of training? Would we have to rearchitect the models one level lower, removing the lightning inheritance, to change this?

More generally, what are architectures you can think of that allow to treat all the abovementioned trainers under one interface?

There exist different architectures with different inheritances. Do we want to touch this or leave it as it is. E.g. BaseModelWithCovariates, BaseModel, AutoRegressiveBaseModelWithCovariates.

I think this is more complex than needs be and needs a redesign, but it is not affecting the base API as such, as far as I can see - therefore I would leave it to later, after the base API rework. My approach would be to replace inheritance with tags and/or polymorphism in a lower number of base classes. The model specific ones perhaps need to stay, but the type specific ones could be merged.

However, I think it makes sense only after we have reworked the base API, as it will imply how the base classes will look like.

Should, the datasets be capable of applying preprocessing. Like transformations? Or should we introduce such datasets? They might apply sktime transformer pipelines. Or apply bootstrapping etc. (Probably for each of these ideas a separate dataset should be implemented).

Good question. The idiomatic architecture for torch extensions has DataSet deal with instance level transformations (in our case "series-to-series"), and DataLoader with batch level ones, that would include augmentation and/or more general panel level transformations.

As mentioned, the current architecture violates this idiom, since the DataSet does decoder/encoder level window subsampling which is a panel level transformation.

@fkiraly
Copy link
Collaborator Author

fkiraly commented Jan 5, 2025

Specific replies to thoughts from above.

introduction of a SlidingDataLoader that unifies the current logic in TimeSeriesDataSet.__getitem__, TimeSeriesDataSet._construct_index and the dataloader returned by TimeSeriesDataSet.to_dataloader.

Should the DataLoader do the sliding? For me this is more about how the samples are created. Thus, this should be part of the dataset (At least if I got this correctly: https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset)

I would disagree with the opinion that slicing should be part of the DataSet.

I interpret the architectural intention (which is not very clear in the torch docs) that DataSet should be as close to the raw data as possible, possibly including instance (__getitem__ level) pre-processing, while DataLoader should collect all concerns related to batching, slicing, sampling, and batch level transformations such as data augmentation.

If we accept this architectural intention as correct, this implies:

  • the current architecture does not align with it, as DataSet carries out a significant part of batching, slicing, sampling, batch level transformations
  • the optimal end state has these concerns transferred to a DataLoader subclass, and removed from DataSet subclasses. The DataLoader can be composite.
  • one possible version of this end state has DataSet closer to the DSIPTS design, which I consider for that reason superior (though also not perfect=, when it comes to the DataSet / DataLoader separation.

on the model side, we design the API that each model comes with its own loader - or loaders. There is a default loader for each model - current pytorch-forecasting moels all point to the loader implied by TimeSeriesDataSet.

Why should be the loader model specific. The loader's task is to provide an iterable by combining a Dataset and a Sampler(https://pytorch.org/docs/stable/data.html#module-torch.utils.data). The structure of each sample is determined by the Dataset.

My reasoning is, that some but not all models require a loader that has encoder/decoder specifics. Examples are most current models, counterexamples are some models in DSIPTS and the above linked foundation models.

If we accept that some models require encoder/decoder batches, while some others do not, and we think that this should be done in a DataLoader (and not in, say, extra layers), then this necessarily implies that loaders will have to be model specific.

Thus, I would argue that if something of the data related code should be model specific, it should be the Dataset.

I think this is a corollary of how you envision the separation of DataSet and DataLoader, which is a point where we differ. If you start from my proposal of the separation, you get the same consequence but for the DataLoader.

Why are we currently aiming for having one DataLoader/Dataset per model?

To clarify, that is not the aim in my design. In my design, there would probably be a small number of DataLoader-s, it will be an n:1 relationship of models:loaders, while all will use the same DataSet interface and possibly an additional middle unified DataLoader layer.

We should of course not need to define one DataSet per model, as in sktime or sklearn it should be easy to loop over models, so only one DataSet overall.

The key problem is actually the state you are pointing out as problematic - currently, we have to define one DataSet per package (or even mode), for using the various foundation models etc. In sktime, this now happens under the hood in sktime classes, but the unification should happen one layer deeper, in torch or lightning.

Unfortunately, I haven't attended any of the planning sessions, thus the following question might be already answered.

No problem, we will have a big one with new mentees, @agobbifbk plus team, in the new year (the FBK team will return next week from holiday). @thawn, you are very welcome to attend and participate in planning too.

We will align on a date for this in the pytorch-forcasting channel on the sktime discord.

@benHeid
Copy link
Collaborator

benHeid commented Jan 5, 2025

Good idea to introduce a separation here. Question, why do you think we cannot modify the trainer in the current state of pytorch-forecasting? It is extraneous to the model. Or do you think the dependency on lightning is too strong to allow other methods of training? Would we have to rearchitect the models one level lower, removing the lightning inheritance, to change this?

Technical it is possible. However, the user has to change the Trainer in that case. So I think introducing a Trainer with a major version release is more intuitive. And than the user does not have to change it.

More generally, what are architectures you can think of that allow to treat all the abovementioned trainers under one interface?

I think all architectures can be trained with that trainer. However, there might be features that we would like to implement that are time series specific. E.g., Truncated back propagation through time, which was removed from lightning trainer: #1581

@benHeid
Copy link
Collaborator

benHeid commented Jan 5, 2025

With regards to the discussion, where the slicing is located. I am still not convinced that a custom DataLoader should be a solution, and I still think that tasks like slicing are intended to be part of datasets and not of Dataloaders. The reasons why I think this are:

  • I haven't found any good example of custom Dataloader on the PyTorch site. But plenty of tutorials and documentation showing different datasets.
  • Regarding slicing, I would say that from the level where they are applied, they are comparable to operations such as creating subset datasets or concat datasets. For those operations, there are extra dataset implementations available in PyTorch. Thus, I could imagine that a possible solution would be to have preprocessing datasets or augmentation datasets that are afterwards passed to a slicing dataset that will slice the transformed and augmented time series. However, might be complicated for users that need to chain various Datasets.

@fkiraly
Copy link
Collaborator Author

fkiraly commented Jan 8, 2025

Technical it is possible. However, the user has to change the Trainer in that case. So I think introducing a Trainer with a major version release is more intuitive. And than the user does not have to change it.

I do not completely grasp the context - could you perhaps add two code snippets, current and speculative, for modifying trainer?

With regards to the discussion, where the slicing is located. I am still not convinced that a custom DataLoader should be a solution, and I still think that tasks like slicing are intended to be part of datasets and not of Dataloaders.

Hm, I see how one could make an argument either way.

Then, would the implied design be (a) raw time series dataset without slicing e.g., as a DataSet, and a WindowSlicedTimeSeriesDataSet(base_dataset, params)?`

To avoid chaining, we can always have an all-in-one delegator.

@benHeid
Copy link
Collaborator

benHeid commented Jan 9, 2025

I do not completely grasp the context - could you perhaps add two code snippets, current and speculative, for modifying trainer?

Currently, her are importing directly from lightning: from lightning.pytorch import Trainer. If we change at some point to an own implementation, the user need to adapt their imports. I think this adaption, is better to happen with a major release. So that with 2.0 the user have to do from pytorch_forecasting.trainer import Trainer. This trainer can at the beginning be an empty wrapper around the lightning Trainer:

from lightning.pytorch import Trainer as PL_trainer

class Trainer(PL_trainer):
   ...
   

Then, would the implied design be (a) raw time series dataset without slicing e.g., as a DataSet, and a WindowSlicedTimeSeriesDataSet(base_dataset, params)?`

Yes

fkiraly added a commit that referenced this issue Jan 10, 2025
This PR carries out a clean-up refactor of `TimeSeriesDataSet`. No
changes are made to the logic.

This is in preparation for major work items impacting the logic, e.g.,
removal of the `pandas` coupling (see
#1685), or a 2.0
rework (see #1736).
In general, a clean state would make these easier.

Work carried out:

* clear, and complete docstrings, in numpydoc format
* separating logic, e.g., for parameter checking, data formatting,
default handling
* reducing cognitive complexity and max indentations, addressing "code
smells"
* linting
@thawn
Copy link

thawn commented Jan 13, 2025

@fkiraly I am back from vacation. I hope you had a relaxing time during the Holidays. Before I get started with figuring out how I can help with the integration ofTime-Series-Library: Have you tried contacting the original developers of that library? I think this is important for two reasons:

  1. They know their code much better than me
  2. I would feel bad using their code without asking them first (even though the license technically allows this)
  3. Even though my pull request to time-series-library may look like a lot of changes, it is a relatively simple refactor that took me inside of two hours (the major part being running tests if TimesNet still works, because that is what I am using downstream)

@fkiraly
Copy link
Collaborator Author

fkiraly commented Jan 13, 2025

Have you tried contacting the original developers of that library?

I think somewhere I pinged them, but you are right - we should also let them know and ask very explicitly, with the current state of discussions. I see multiple options that make technical sense:

  • developing a common framework layer for models, like sktime, and people can have their own packages or estimators, see here: https://www.sktime.net/en/latest/estimator_overview.html (estimators have "author", "maintainer" ownership tags, or can even be in other packages entirely, like prophetverse or skchange)
  • merge into a common code base, e.g., pytorch-forecasting which already has a well-maintained package and release layer, and an active community

Both options will require agreeing on a common framework level API - I still think DSIPTS is closest to what we want, of course comments are much appreciated.

FYI @wuhaixu2016, @ZDandsomSP, @GuoKaku, @DigitalLifeYZQiu, @Musongwhk (please feel free to add further pings if I forgot someone)

@fkiraly
Copy link
Collaborator Author

fkiraly commented Jan 13, 2025

I would feel bad using their code without asking them first (even though the license technically allows this)

Agree, at least credit needs to be given clearly to the authors of time-series-library!

A complication is that there is much copy-pasting going on in the neural network area historically, some (but not all) code in time-series-library is even copied from elsewhere afaik.

I think we should also come up with a way to fairly assign credit while backtracking the entire copy-merge-modify tree, although that might be a bit of work. A complication is that the people actively maintaining the code may be different from those who wrote parts of it (but no longer maintain it), and all deserve credit!
A solution could look like the tagging system in sktime which separates maintainer-type owners from authorship credits, some example pull requests on this topic are here:

sktime/sktime#6850
sktime/sktime#7518

The first PR includes handling of cure-lab algorithms, which is historically at a central position of the copy-modify-tree for deep learning forecasters, but was never turned into a package. In that latter respect, it is similar to time-series-library.

@thawn
Copy link

thawn commented Jan 13, 2025

Regarding the API of Time-Series-Library

from my (limited) grasp of the code, this consists of the following:

My feeling is that adapting this to any more general API will require major work. Using TimesNet in my code required me to write my own dataset and dataloaders as well as model wrapper.

Edit: this should not be taken as criticism of Time-Series-Library. I am very grateful to the authors for their code and their papers. Their work helped me a lot for my project. It just shows that the library was written for a specific purpose (benchmarking) and not as a general purpose API like what is planned here.

@agobbifbk
Copy link

Regarding the API of Time-Series-Library

from my (limited) grasp of the code, this consists of the following:

My feeling is that adapting this to any more general API will require major work. Using TimesNet in my code required me to write my own dataset and dataloaders as well as model wrapper.

Edit: this should not be taken as criticism of Time-Series-Library. I am very grateful to the authors for their code and their papers. Their work helped me a lot for my project. It just shows that the library was written for a specific purpose (benchmarking) and not as a general purpose API like what is planned here.

A lot of models in DSIPTS are an adaptation of online repositories with the same logic of Time-Series-Library. What I've done so far is to align and standardize these API calls in the DSIPTS data preparation. I also found hard to understand the input parameters sometimes, this is why I leverage on Hydra.

@fkiraly
Copy link
Collaborator Author

fkiraly commented Jan 16, 2025

For discussion on Fri, here is a speculative design for a raw data container.
#1755

@fkiraly
Copy link
Collaborator Author

fkiraly commented Jan 17, 2025

outcome from prioritization meeting on Jan 15:

data layer - dataset, dataloader 👍👍👍👍👍👍👍 💬 ✔️✔️✔️

  • dataset and dataloader API consolidation

model layer - base classes, configs, unified API 👍👍👍👍 ✔️

  • A more refined base classes maybe and proper documentation of them
  • tests: input-output shapes of the bathches
  • operationalization of the models (inference must be a clear process/output)
  • think about how model config are stored. e.g. versioning so that we know how to load model weights.
  • easy interface for adding new architectures

foundation models, model hubs 👍 👍 👍

  • To think how to handle pre-trained models in ptf and how to interface them?
  • integration with model hubs (hf/kaggle/...)
  • [IDEA] looking to Chronos-Bolt and integrate into the repo.

documentation👍✔️

  • adding tutorials and examples for the new users
  • proper documentation of base classes

benchmarking 👍 👍💬

  • easy way to use benchmark datasets
  • reproducibility, scalability, concept of experiment (same dataset, different models/parameters to be compared )

mlops and scaling (distributed, cluster etc)👍 👍

  • operationalization of the models (inference)
  • hooks for: slurm cluster, multiprocess, OPTUNA
  • think about how model config are stored. e.g. versioning so that we know how to load model weights.

more learning tasks supported

  • [IDEA] continuous learning / active learning
  • [IDEA] easy to convert DL architectures from regression to classification

@agobbifbk
Copy link

I came up with this https://zarr.readthedocs.io/en/stable/. It is widely used when you have a large dataset that can not fit on RAM/VRAM. The idea is to create the zarr before creating the dataset. It generates chunks of data on a give dimension(s), in hour case temporal dimension, the datasets open the zarr and in the get_item function you retrieve the window you ask for.
Google has created a bucket with ALL ERA5 data (all the European hourly meteorological variables!).
While creating the zarr we can:

  • update the scaler(s) on the training set
  • updating the label encode(s)
  • keep track of the valid samples (not nan)

PROs:

  • our models can run on a enormous dataset
  • it should not be difficult to fit it in the API we are thinking
  • linux can cash automatically the most used files so maybe we gain some speed
  • most used technology for this kind of application (similar to xarrray, nc, and other compressing/storaging options for BIG data)

CONs:

  • need to create a zarr file on the disk (redundancy, space)
  • slower respect having all the numpy-tensor dataset in RAM

@thawn
Copy link

thawn commented Jan 21, 2025

I came up with this https://zarr.readthedocs.io/en/stable/.

+1 from me. We are also using zarr as file backend in our time series project.

@thawn
Copy link

thawn commented Jan 21, 2025

As default data format in our minimal dataset class, I realized that xarray may be a great choice: Xarray natively supports column names (i.e. metadata) as well as n-dimentional columns, which makes it an ideal replacement for both numpy and pandas. Furthermore, it has the advantage of dask support (which handles memory issues very well). By using Dask.array, xarray enables native chunking (ideal for non-overlapping time windows) as well as overlapping chunks (for sliding time windows).

@fkiraly
Copy link
Collaborator Author

fkiraly commented Jan 22, 2025

@agobbifbk, @thawn, I think there are multiple solutions that have a similar feature set - dask and polars as well.

As long as we have consistent __getitem__ return format, nothing stops us from adding dataset classes that take either of these as input.

@fkiraly
Copy link
Collaborator Author

fkiraly commented Jan 22, 2025

FYI, a simplified version of the previous design sketch here: #1757 - for discussion

@agobbifbk
Copy link

Desiderata from the DataSet component:

  • possibility to cut data based on list of ranges (for identifying he training dataset). ranges can be temporal values or percentage
  • scalers (minmax) trained on the train dataset --> maybe is better to train them outside the DataSet so there will be only the transform function here and the dataset can work both for train and infererece!
  • label encoder (trained on the train dataset) --> if there are strings in columns we need to convert them into 0 N-1 integers
  • groups: suppose to have M identities sharing the same data (think about M meteorological station collecting temperatures) in this case the time is not an index but the tuple (time-group) is. Maybe sometimes you want to normalize by groups
  • since interpolating is not always a good choice, and AFAIU the slicing is performed there, we need to ensure that the samples is valid --> precompute all the possible good starting points?

@fkiraly
Copy link
Collaborator Author

fkiraly commented Jan 31, 2025

Completed a design proposal draft here: sktime/enhancement-proposals#39

Input appreciated!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API design API design & software architecture enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

7 participants