The rough order of magnitude for the loss during pretraining is how many? #2

kssmmm · 2024-04-10T08:28:16Z

I pretrained the model with librispeech960h and get the loss of 0.2. However, when I used the checkpoint to finetune with the librispeech100h, I got a dev-wer about 100. Did I make a mistake during the pretraining phase or the fine-tuning phase?

Alexander-H-Liu · 2024-04-10T23:13:01Z

Hi,
Your training loss seems too low, should be ~1.4 after training for 200k steps and ~1.1 after 400k steps.
super low loss in self-distillation usually means the teacher model collapsed (constant output regardless of input) and the training runs into trivial task.

kssmmm · 2024-04-11T11:01:21Z

Hi, Your training loss seems too low, should be ~1.4 after training for 200k steps and ~1.1 after 400k steps. super low loss in self-distillation usually means the teacher model collapsed (constant output regardless of input) and the training runs into trivial task.

Previously, I modified the values in the config file from fp16 to bf16, and also changed the max token value from 3.8 million to 2.4 million. Now I have changed them back. It seems that the loss during the pretraining phase is consistent with what you mentioned, I didn't expect these two parameters to have such a significant impact.

hadas · 2024-04-12T20:58:50Z

Hi, I ran into a similar issue with a very low loss and cluster collapse. Except for the batch size (4), I haven't changed anything in the base configuration, but it also happened with the default size. What can I do to prevent it?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The rough order of magnitude for the loss during pretraining is how many? #2

The rough order of magnitude for the loss during pretraining is how many? #2

kssmmm commented Apr 10, 2024

Alexander-H-Liu commented Apr 10, 2024 •

edited

Loading

kssmmm commented Apr 11, 2024

hadas commented Apr 12, 2024

The rough order of magnitude for the loss during pretraining is how many? #2

The rough order of magnitude for the loss during pretraining is how many? #2

Comments

kssmmm commented Apr 10, 2024

Alexander-H-Liu commented Apr 10, 2024 • edited Loading

kssmmm commented Apr 11, 2024

hadas commented Apr 12, 2024

Alexander-H-Liu commented Apr 10, 2024 •

edited

Loading