Questions about the training settings #5

qwertier24 · 2019-06-26T06:04:06Z

Hi! I am really interested in this fascinating work. However, I have some questions about the training methods for the transformer model.

In the paper you mention the transformer model is trained with learning rate = 6e-4 but do not say which lr decay method you are using, which I am curious about. I am also curious about the number of layers in the encoder and decoder.

Could you please demonstrate more specifically about the training settings? It will be more convenient for someone like me who want to reproduce your results if you could just publish your training source codes.

Thank you very much!

ischlag · 2019-08-07T15:53:57Z

I don't think they use a learning rate decay. At least it is not mentioned anywhere. They do train on all tasks simultaneously (according to the first author) for 500k steps with the attention-is-all-you-need transformer architecture which is a 6 layer encoder and decoder with a hidden size of 512 and a dense filter size of 2048. The batch size is 1024 so you will need some serious compute in order to reproduce this. With this config trained on 4 V100 GPUs, you can do 50k steps in ~13h.

They used the tensor2tensor implementation of the transformer. So technically the code is public. Good luck with that.

Have you had any success @mayukuner?

davidsaxton · 2019-08-07T16:24:55Z

That's correct: no learning rate decay for the results reported in the paper. 6 layers in the decoder and encoder.

qwertier24 · 2019-08-08T02:37:16Z

@ischlag Not even close to success. I used transformer_base_v1 as the base parameter set and modified it a little by adding a constant lr scheduler, a warmup procedure, and other stuff like this:

PROBLEM=algorithmic_math_deepmind_all
MODEL=transformer
HPARAMS=transformer_base_v1

t2t-trainer --data_dir=$DATA_DIR --problem=$PROBLEM --model=$MODEL --hparams_set=$HPARAMS --output_dir=$TRAIN_DIR --worker_gpu=8 --hparams='max_length=64,num_hidden_layers=6,learning_rate=6e-4,learning_rate_schedule=constant*linear_warmup,learning_rate_constant=6e-4,clip_grad_norm=0.1,optimizer_adam_beta2=0.995,batch_size=8192'

Here I am using a batch size of 8192 because 8 GTX 1080TI is utilized and 1024 sentences have approximately 8192 * 8 tokens so I think this is not a problem.

And the result is:

By the way, I changed the dataset generator a little by random selecting training data from train-easy, train-medium, and train-hard to make the dataset's size approximately 2M. The validation set is sampled from interpolation. The code can be found here: https://gist.github.com/mayukuner/dd7f88c0309cc926f1b02cf596b010c4

@davidsaxton Have you used curriculum training? I don't think you have while I really couldn't figure out why I cannot reproduce your results. Am I missing something here?

ischlag · 2019-08-08T15:53:55Z

I'd highly recommend to not deviate from the hyperparameters that are given in the paper. The transformer architecture is rather sensible to those. Remove your schedule and set the batch_size to 1024. Then train for 500k steps. Make sure your accuracy is 1 for getting all output tokens right and 0 for getting even just one wrong (no per symbol accuracy is reported).

qwertier24 · 2019-08-09T02:21:45Z

@ischlag I am trying my best to get close to the settings in the paper.

As you can see, the batch_size here is the maximum number of tokens per batch per GPU, so overall each batch contains 8192*8 tokens, which is close to 1024 sentences per batch.

Plus, I did not use any scheduler except for the warmup, the learning rate curve is as follows:

Also, the reported accuracy_per_sequence is exactly the criterion for this dataset as the paper states.

So I guess I am not doing anything wrong here, right?

ischlag · 2019-08-09T02:52:08Z

Well, I'm not sure how I'm supposed to "see" that. If you are certain that batch_size is actually the number of tokens per GPU instead of the number of samples used for one step, then so be it.

I changed the dataset generator a little by random selecting training data

Are you sure this is not going to skip data? The tf.data pipeline might do some caching and only goes through the generator once. Unfortunately, it is virtually impossible for me to tell by looking at the t2t code.

qwertier24 · 2019-08-09T02:59:20Z

@ischlag Sorry I did not explain it well because I thought you are familiar with T2T. The generator in T2T generates 2 million (question, answer) pairs per module. So I have to change it to make it generate 2M samples in total. The data generation procedure has been verified.

ischlag · 2019-08-09T03:23:08Z

I'm somewhat familiar with it but I decided to not use it due to its obscurity. I'm just trying to help you here. We are working on reproducing it ourselves with a clean PyTorch implementation and I'll post the results one we managed.

That said, you should not have 2M samples in total but n * 2M where n is the number of modules (I think 56 or so). ~~Furthermore, it is not clear to me how you encode the characters. Your gist file says text_problems.VocabType.CHARACTER which indicates byte-level encoding. I might be wrong on that though.~~

If that also doesn't help then I'm out of ideas. As a dummy experiment, you could train only on numbers__place_value, which in my case takes ca. 3-5k steps to train for virtually 100% accuracy.

qwertier24 · 2019-08-09T03:46:09Z

@ischlag You are right, I missed per module in the paper. So I guess 2M * 56 sequences should be used for training. Thank you for your help and look forward to your result!

ischlag · 2019-08-13T17:23:14Z

@mayukuner I'm currently training 3 baselines with my PyTorch implementation. The best result so far is 50% accuracy on all interpolation data after 45k steps and improving. So this starts to look promising. However, this is with a learning rate of 1e-4, not 6e-4. The 6e-4 run is stuck at a loss of 3.15 and 0% train accuracy even after 50k steps.

@davidsaxton Are you sure your learning rate in the paper is 6e-4 and not 6e-5?

qwertier24 · 2019-08-14T02:16:06Z

@ischlag Have you clipped the gradients of the tensors? You may also try to use warmup in the beginning of the training stage. The LR of 6e-4 seems OK to me. With tensor2tensor, the model can be trained to have an accuracy of 70% on interpolation test after 300k steps.

ischlag · 2019-08-14T16:35:18Z

Yes, I'm clipping the gradient norm of the parameters at 0.1. 6e-4 doesn't work at all. Even 3e-4 doesn't work at all. I've been very carefully going through my implementation several times.

My parameters are initialized from U[-a,a] with a=sqrt(6/in_out_avg). I share the Embedding matrix with the last layer before the softmax. I only scale the Embedding by sqrt(d_model) and I scale the dot products by 1/sqrt(d_k). Beta1 0.9, beta2 0.995. With the default epsilon. The Embedding I scale just like in the official transformer code but I'm not sure why it is sqrt(d_model). The one for the keys makes sense though. @mayukuner are you doing the same?

I'm still training and I'm now at 60% interpolation accuracy after 120k steps. So it looks good, just not with the right learning rate for me.

davidsaxton · 2019-08-15T16:29:15Z

@ischlag I clipped the gradient absolute value not norm (i.e., |g_i| <= 0.1 for every gradient index i)

chauhanjatin10 · 2020-01-29T18:00:52Z

Hi @mayukuner , can you share your implementation with me?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about the training settings #5

Questions about the training settings #5

qwertier24 commented Jun 26, 2019

ischlag commented Aug 7, 2019

davidsaxton commented Aug 7, 2019

qwertier24 commented Aug 8, 2019 •

edited

Loading

ischlag commented Aug 8, 2019 •

edited

Loading

qwertier24 commented Aug 9, 2019 •

edited

Loading

ischlag commented Aug 9, 2019

qwertier24 commented Aug 9, 2019

ischlag commented Aug 9, 2019 •

edited

Loading

qwertier24 commented Aug 9, 2019

ischlag commented Aug 13, 2019

qwertier24 commented Aug 14, 2019

ischlag commented Aug 14, 2019 •

edited

Loading

davidsaxton commented Aug 15, 2019

chauhanjatin10 commented Jan 29, 2020

Questions about the training settings #5

Questions about the training settings #5

Comments

qwertier24 commented Jun 26, 2019

ischlag commented Aug 7, 2019

davidsaxton commented Aug 7, 2019

qwertier24 commented Aug 8, 2019 • edited Loading

ischlag commented Aug 8, 2019 • edited Loading

qwertier24 commented Aug 9, 2019 • edited Loading

ischlag commented Aug 9, 2019

qwertier24 commented Aug 9, 2019

ischlag commented Aug 9, 2019 • edited Loading

qwertier24 commented Aug 9, 2019

ischlag commented Aug 13, 2019

qwertier24 commented Aug 14, 2019

ischlag commented Aug 14, 2019 • edited Loading

davidsaxton commented Aug 15, 2019

chauhanjatin10 commented Jan 29, 2020

qwertier24 commented Aug 8, 2019 •

edited

Loading

ischlag commented Aug 8, 2019 •

edited

Loading

qwertier24 commented Aug 9, 2019 •

edited

Loading

ischlag commented Aug 9, 2019 •

edited

Loading

ischlag commented Aug 14, 2019 •

edited

Loading