How to utilize the HuggingFace dataset for pre-training #162
Replies: 2 comments 4 replies
-
@FrankHo-Hwc Thanks for raising this. You can convert the dataset into GluonTS-compatible format using something like this: import datasets
import numpy as np
from gluonts.dataset.arrow import ArrowWriter
from tqdm.auto import tqdm
def hf_to_gluonts_univariate(hf_dataset: datasets.Dataset, batch_size: int = 1000):
series_fields = [
col
for col in hf_dataset.features
if isinstance(hf_dataset.features[col], datasets.Sequence)
]
series_fields.remove("timestamp")
dataset_length = hf_dataset.info.splits["train"].num_examples
pbar = tqdm(total=dataset_length)
for batch in hf_dataset.iter(batch_size=batch_size):
batch = [dict(zip(batch, t)) for t in zip(*batch.values())]
for hf_entry in batch:
for field in series_fields:
yield {
"start": np.datetime64(hf_entry["timestamp"][0], "s"),
"target": np.array(hf_entry[field]),
}
pbar.update(batch_size)
pbar.close()
if __name__ == "__main__":
# Load TSMixup data and convert it into GluonTS arrow format
ds = datasets.load_dataset(
"autogluon/chronos_datasets",
"training_corpus_tsmixup_10m",
split="train",
)
ds.set_format("numpy")
ArrowWriter(compression="lz4").write_to_file(
hf_to_gluonts_univariate(ds),
path="./tsmixup-data.arrow",
)
# Load KernelSynth data and convert it into GluonTS arrow format
ds = datasets.load_dataset(
"autogluon/chronos_datasets",
"training_corpus_kernel_synth_1m",
split="train",
)
ds.set_format("numpy")
ArrowWriter(compression="lz4").write_to_file(
hf_to_gluonts_univariate(ds),
path="./kernelsynth-data.arrow",
) |
Beta Was this translation helpful? Give feedback.
-
I have a follow up question on this. In your evaluate.py file where you use huggingface datasets to perform zero-shot/in-domain evaluation you use the below function to get the test data template. def load_and_split_dataset(backtest_config: dict):
My question is why do you load the train split of the hf dataset here ds = datasets.load_dataset( The unused return value "_" is this the training split you used when training chronos for in-domain datasets. |
Beta Was this translation helpful? Give feedback.
-
I want to use the HuggingFace you provided in the HuggingFace. However, I met some problems.
As you mentioned in the ChronosDataset class, the entries in the datasets must have 'start' and 'target' fields, as shown below.
However, the datasets you provided in the HuggingFace don't have the same format.
The entries from the HuggingFace datasets have "id", "timestamp" instead of "start". As a result, I met the problem below when using the HuggingFace dataset.
So is there any way to fix this problem? Thanks.
Beta Was this translation helpful? Give feedback.
All reactions