You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, thank you for maintaining this great repo! We are currently exploring how muP interacts with our unit scaling method, and whether there is a scheme that satisfies both at once.
I have tried to recreate the RHS of your Figure 1 using examples/Transformer/main.py to serve as our baseline. Whilst my results look sensible (nice stable optimal learning rate across varying widths, pleasing tick shape) I have been unable to choose hyperparameters that exactly recreate your plot. In particular my training losses are higher (e.g., width 128 gets to minimum training loss 5.2 whilst yours has a minimum around ~4.75) and my optimal learning rate is slightly different.
I am using the default arguments from main.py except where they are contradicted by the paper's description of Fig. 1.
Can you point to a description of the training parameters you used for Fig 1, or highlight which of the below might be incorrect?
Param
Val
Reason
ffn_ratio
4
Section 3, pg5
epochs
5
Section 3, pg5
optimizer
'muadam'
as per Fig 1 caption
norm
postnorm
as per Fig 18 caption
base width
128
used by the other transformer experiments in the paper
output_mult
1
default
nlayers
2
default
nhead
2
default
batch_size
20
default
bptt
35
default
dropout
0.2
default
etc...
...
deafult
Thanks very much. My plot is quite close to yours already, but we would prefer to know our results are directly comparable, and would therefore like to be able to exactly recreate your figure for the baseline.
The text was updated successfully, but these errors were encountered:
Hi, thank you for maintaining this great repo! We are currently exploring how muP interacts with our unit scaling method, and whether there is a scheme that satisfies both at once.
I have tried to recreate the RHS of your Figure 1 using
examples/Transformer/main.py
to serve as our baseline. Whilst my results look sensible (nice stable optimal learning rate across varying widths, pleasing tick shape) I have been unable to choose hyperparameters that exactly recreate your plot. In particular my training losses are higher (e.g., width 128 gets to minimum training loss 5.2 whilst yours has a minimum around ~4.75) and my optimal learning rate is slightly different.I am using the default arguments from
main.py
except where they are contradicted by the paper's description of Fig. 1.Can you point to a description of the training parameters you used for Fig 1, or highlight which of the below might be incorrect?
Thanks very much. My plot is quite close to yours already, but we would prefer to know our results are directly comparable, and would therefore like to be able to exactly recreate your figure for the baseline.
The text was updated successfully, but these errors were encountered: