Sequence Parallel is incompatible with Rotary Positional Embedding #385

anogkongda · 2024-05-09T12:43:04Z

I would like to finetune llama2 on long sequence data. (more than or eq 32K)

I follow the example below for sequence parallel:

https://github.com/microsoft/Megatron-DeepSpeed/blob/main/examples_deepspeed/deepspeed4science/megatron_long_seq_support/pretrain_gpt_30B_seq_parallel.sh

Sadly, the lm loss is NaN if I use rotary positional embedding.
When I disable rotary positional embedding, the loss is ok even other parameters/arguments are the same as before.

anogkongda · 2024-05-10T08:32:54Z

After testing, I found the following:

Reducing the model size (e.g., the original 32-layer LLaMA 7B reduced to 16 layers) prevents the loss from becoming NaN.
Switching from BF16 to FP16 also prevents the loss from becoming NaN.
When the loss becomes NaN, there's no protection mechanism, which causes all model parameters to turn into NaN.
When Sequence Parallel is enabled, the BF16 Optimizer might overflow under certain circumstances, potentially due to computational errors.
Observing the trend of loss change in FP16 training is still ongoing.

inkcherry · 2024-06-05T07:31:37Z

hi, @anogkongda, I also encountered the NAN issue and resolved it with this #399, could you try this. Can it solve your problem?

anogkongda · 2024-06-11T11:54:52Z

hi, @anogkongda, I also encountered the NAN issue and resolved it with this #399, could you try this. Can it solve your problem?

thank you, I will try this and report my result ASAP.

anogkongda · 2024-06-18T03:29:30Z

It doesn't work in my case. I'm trying more to make it correct.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sequence Parallel is incompatible with Rotary Positional Embedding #385

Sequence Parallel is incompatible with Rotary Positional Embedding #385

anogkongda commented May 9, 2024

anogkongda commented May 10, 2024

inkcherry commented Jun 5, 2024

anogkongda commented Jun 11, 2024

anogkongda commented Jun 18, 2024

Sequence Parallel is incompatible with Rotary Positional Embedding #385

Sequence Parallel is incompatible with Rotary Positional Embedding #385

Comments

anogkongda commented May 9, 2024

anogkongda commented May 10, 2024

inkcherry commented Jun 5, 2024

anogkongda commented Jun 11, 2024

anogkongda commented Jun 18, 2024