Training and fine tuning protocols #140

tanz63 · 2024-05-31T03:10:07Z

tanz63
May 31, 2024

In Chronos paper, training is implemented with fixed step number_("The models were optimized for 200K steps using the AdamW optimizer with a weight decay of 0.01. The
learning rate was annealed linearly from its initial value of 0.001 to 0 over the training steps")_. what's the logic behind this configration? Since there is no downstream task fine tuning like LLM counterpart, how is it supposed to avoid overfitting? Are there some heuristics like the step number is equal to 1 or 2 epochs?
Also, the fine tuning is implemented in a dataset-agnostic fashion with an initial learning rate of 0.001, annealed
linearly to 0 over 1000 steps, what are the insights behind it?

abdulfatir · 2024-05-31T13:34:36Z

abdulfatir
May 31, 2024
Maintainer

@tanz63 these configurations are more or less the defaults from torch and transformers. The 200K steps were set based on the visual inspection of the loss curve (although we did see improved performance with longer training as shown in the hyperparameter analysis).

Regarding fine-tuning, it's a dataset-agnostic proof of concept. We did not deliberate hard about the setting. One could potentially obtain significantly better fine-tuning performance by carefully validating the hyperparameters.

0.001 is a commonly-used default for the learning rate.
linear annealing is the default in transformers.
1000 felt like a good number. ;)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training and fine tuning protocols #140

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Training and fine tuning protocols #140

tanz63 May 31, 2024

Replies: 1 comment

abdulfatir May 31, 2024 Maintainer

tanz63
May 31, 2024

abdulfatir
May 31, 2024
Maintainer