Why the max_seq_length = 512 for XLNet? #263

vr25 · 2020-04-23T12:23:40Z

Hi,

Just a conceptual question:
In the paper, it is mentioned that XLNet derives some parts from Transformer-XL which isn't limited to a fixed context but the hyperparameters section says that the max length is 512.

Can you please help me better understand it?

Thanks!

mihaidobri · 2020-09-30T02:10:56Z

I was having the same question. @zihangdai could you please help us with this?

mihaidobri · 2020-09-30T02:13:43Z

or maybe @kimiyoung ?

zihangdai · 2020-10-01T17:11:51Z

Assuming you are familiar with Transformer-XL, max_seq_length means the length of each training segment where you can back-prop (as the gradient does not pass to the memory in Transformer-XL).

Then, why the value 512?
(1) Longer sequence requires more pretraining time.
(2) Most of the tasks considered at that time do not really require handling long sequences: GLUE -> 128, SQuAD -> 512. RACE performance can be improved slightly if you also increase max_seq_length during finetuning. Technically, you can increase the sequence length if you want during finetuning. But if it's too long, the generalization may not be good as longer sequences are not seen during pretraining.

mihaidobri · 2020-10-01T19:19:45Z

@zihangdai thank you for your fast reply!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why the max_seq_length = 512 for XLNet? #263

Why the max_seq_length = 512 for XLNet? #263

vr25 commented Apr 23, 2020

mihaidobri commented Sep 30, 2020

mihaidobri commented Sep 30, 2020

zihangdai commented Oct 1, 2020

mihaidobri commented Oct 1, 2020

Why the max_seq_length = 512 for XLNet? #263

Why the max_seq_length = 512 for XLNet? #263

Comments

vr25 commented Apr 23, 2020

mihaidobri commented Sep 30, 2020

mihaidobri commented Sep 30, 2020

zihangdai commented Oct 1, 2020

mihaidobri commented Oct 1, 2020