How to utilize the HuggingFace dataset for pre-training #162

FrankHo-Hwc · 2024-07-27T12:09:21Z

FrankHo-Hwc
Jul 27, 2024

I want to use the HuggingFace you provided in the HuggingFace. However, I met some problems.
As you mentioned in the ChronosDataset class, the entries in the datasets must have 'start' and 'target' fields, as shown below.

However, the datasets you provided in the HuggingFace don't have the same format.

The entries from the HuggingFace datasets have "id", "timestamp" instead of "start". As a result, I met the problem below when using the HuggingFace dataset.

So is there any way to fix this problem? Thanks.

abdulfatir · 2024-07-28T14:41:12Z

abdulfatir
Jul 28, 2024
Maintainer

@FrankHo-Hwc Thanks for raising this. You can convert the dataset into GluonTS-compatible format using something like this:

import datasets
import numpy as np
from gluonts.dataset.arrow import ArrowWriter
from tqdm.auto import tqdm


def hf_to_gluonts_univariate(hf_dataset: datasets.Dataset, batch_size: int = 1000):
    series_fields = [
        col
        for col in hf_dataset.features
        if isinstance(hf_dataset.features[col], datasets.Sequence)
    ]
    series_fields.remove("timestamp")
    dataset_length = hf_dataset.info.splits["train"].num_examples

    pbar = tqdm(total=dataset_length)
    for batch in hf_dataset.iter(batch_size=batch_size):
        batch = [dict(zip(batch, t)) for t in zip(*batch.values())]
        for hf_entry in batch:
            for field in series_fields:
                yield {
                    "start": np.datetime64(hf_entry["timestamp"][0], "s"),
                    "target": np.array(hf_entry[field]),
                }
        pbar.update(batch_size)
    pbar.close()


if __name__ == "__main__":
    # Load TSMixup data and convert it into GluonTS arrow format
    ds = datasets.load_dataset(
        "autogluon/chronos_datasets",
        "training_corpus_tsmixup_10m",
        split="train",
    )
    ds.set_format("numpy")
    ArrowWriter(compression="lz4").write_to_file(
        hf_to_gluonts_univariate(ds),
        path="./tsmixup-data.arrow",
    )

    # Load KernelSynth data and convert it into GluonTS arrow format
    ds = datasets.load_dataset(
        "autogluon/chronos_datasets",
        "training_corpus_kernel_synth_1m",
        split="train",
    )
    ds.set_format("numpy")
    ArrowWriter(compression="lz4").write_to_file(
        hf_to_gluonts_univariate(ds),
        path="./kernelsynth-data.arrow",
    )

2 replies

FrankHo-Hwc Jul 29, 2024
Author

Thanks for your answer! That helps me a lot!

abdulfatir Jul 29, 2024
Maintainer

@FrankHo-Hwc I updated the code above, adding ds.set_format("numpy") which should make this much faster.

tomar-s · 2024-08-06T11:24:11Z

tomar-s
Aug 6, 2024

I have a follow up question on this. In your evaluate.py file where you use huggingface datasets to perform zero-shot/in-domain evaluation you use the below function to get the test data template.

def load_and_split_dataset(backtest_config: dict):

hf_repo = backtest_config["hf_repo"]
dataset_name = backtest_config["name"]
offset = backtest_config["offset"]
prediction_length = backtest_config["prediction_length"]
num_rolls = backtest_config["num_rolls"]
# This is needed because the datasets in autogluon/chronos_datasets_extra cannot
# be distribued due to license restrictions and must be generated on the fly
trust_remote_code = True if hf_repo == "autogluon/chronos_datasets_extra" else False
ds = datasets.load_dataset(
    hf_repo, dataset_name, split="train", trust_remote_code=trust_remote_code
)
ds.set_format("numpy")

gts_dataset = to_gluonts_univariate(ds)

# Split dataset for evaluation
_, test_template = split(gts_dataset, offset=offset)
test_data = test_template.generate_instances(prediction_length, windows=num_rolls)

return test_data

My question is why do you load the train split of the hf dataset here

ds = datasets.load_dataset(
hf_repo, dataset_name, split="train", trust_remote_code=trust_remote_code
)
Also, at this line :
_, test_template = split(gts_dataset, offset=offset)

The unused return value "_" is this the training split you used when training chronos for in-domain datasets.

2 replies

abdulfatir Aug 8, 2024
Maintainer

@tomar-s I have answered this question on another thread, but also answering here for completeness. The following note is written in the HF dataset repo:

NOTE: The train split of all datasets contains the full time series and has no relation to the train/test split used in the Chronos paper.

tomar-s Aug 8, 2024

Thank you very much. I appreciate it!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to utilize the HuggingFace dataset for pre-training #162

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

How to utilize the HuggingFace dataset for pre-training #162

FrankHo-Hwc Jul 27, 2024

Replies: 2 comments · 4 replies

abdulfatir Jul 28, 2024 Maintainer

FrankHo-Hwc Jul 29, 2024 Author

abdulfatir Jul 29, 2024 Maintainer

tomar-s Aug 6, 2024

abdulfatir Aug 8, 2024 Maintainer

tomar-s Aug 8, 2024

FrankHo-Hwc
Jul 27, 2024

Replies: 2 comments 4 replies

abdulfatir
Jul 28, 2024
Maintainer

FrankHo-Hwc Jul 29, 2024
Author

abdulfatir Jul 29, 2024
Maintainer

tomar-s
Aug 6, 2024

abdulfatir Aug 8, 2024
Maintainer