Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproducing Figure 1 using 'examples/Transformer/main.py' #69

Open
jndean opened this issue Jan 16, 2024 · 0 comments
Open

Reproducing Figure 1 using 'examples/Transformer/main.py' #69

jndean opened this issue Jan 16, 2024 · 0 comments

Comments

@jndean
Copy link

jndean commented Jan 16, 2024

Hi, thank you for maintaining this great repo! We are currently exploring how muP interacts with our unit scaling method, and whether there is a scheme that satisfies both at once.

I have tried to recreate the RHS of your Figure 1 using examples/Transformer/main.py to serve as our baseline. Whilst my results look sensible (nice stable optimal learning rate across varying widths, pleasing tick shape) I have been unable to choose hyperparameters that exactly recreate your plot. In particular my training losses are higher (e.g., width 128 gets to minimum training loss 5.2 whilst yours has a minimum around ~4.75) and my optimal learning rate is slightly different.

I am using the default arguments from main.py except where they are contradicted by the paper's description of Fig. 1.
Can you point to a description of the training parameters you used for Fig 1, or highlight which of the below might be incorrect?

Param Val Reason
ffn_ratio 4 Section 3, pg5
epochs 5 Section 3, pg5
optimizer 'muadam' as per Fig 1 caption
norm postnorm as per Fig 18 caption
base width 128 used by the other transformer experiments in the paper
output_mult 1 default
nlayers 2 default
nhead 2 default
batch_size 20 default
bptt 35 default
dropout 0.2 default
etc... ... deafult

Thanks very much. My plot is quite close to yours already, but we would prefer to know our results are directly comparable, and would therefore like to be able to exactly recreate your figure for the baseline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant