MPMD detected error when using `optimum-neuron` with TP #24

michaelbenayoun · 2024-06-27T14:31:29Z

So basically I am trying to train LLama / Mistral.

I run the following command:

NEURON_RT_LOG_LEVEL=info XLA_USE_BF16=1 ./train_mistral.sh

Here is the link to train_mistral.sh

The issue is that I get MPMD detected. It means that at some point at least 2 workers try to execute different graphs. So I tried to check the diff between the two HLO graphs. I ran the script multiple times, I cannot say I always end-up with the same diff, but at least multiple times I ended up with this:

Basically:

In one case, one parameter is the input to a select op.
In the other case, we have 2 parameters: the same one as in the first case, let's call it p, and a scalar., let's call it s . Then the input to the select op is in this case: p - broadcast(s).

After analyzing it a bit, I think it this computation comes from the ParallelEmbedding layer. For some reason what is considered a constant equal to 0 in one case, is considered a parameter in the other case.

I thought it could be linked to scalar specialization by XLA so I also ran the job with XLA_NO_SPECIAL_SCALARS=1 but ended up with a MPMD detected error as well.

So I tried not to use ParallelEmbedding. When sequence parallelism is enabled I end-up with:

In one case it does [16, 64] -> [1, 16, 64]. So here it seems to be B x S x H. Then it adds a reshape at the end to become S / 2 x B x H.
And in the other case [1, 64] -> [16, 1, 64] . Here it is S x B x H. And then we end-up with S / 2 x B x H.

Finally, I tried disabling sequence parallelism and ended-up with:

Note: when I disable tensor parallelism it seems to be working properly.

The text was updated successfully, but these errors were encountered:

michaelbenayoun · 2024-07-17T13:50:18Z

It was linked to torch.autocast.

Release 2.21

michaelbenayoun changed the title ~~MPMD detected when using optimum-neuron.~~ MPMD detected when using optimum-neuron with TP Jun 27, 2024

michaelbenayoun changed the title ~~MPMD detected when using optimum-neuron with TP~~ MPMD detected error when using optimum-neuron with TP Jun 27, 2024

michaelbenayoun mentioned this issue Jul 3, 2024

Fix MPMD detected error during training with TP huggingface/optimum-neuron#648

Merged

aws-taylor added the bug Something isn't working label Nov 11, 2024

awsjoshir pushed a commit that referenced this issue Dec 22, 2024

Merge pull request #24 from aws-neuron/release_cut_2.21

3aa65c6

Release 2.21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MPMD detected error when using `optimum-neuron` with TP #24

MPMD detected error when using `optimum-neuron` with TP #24

michaelbenayoun commented Jun 27, 2024 •

edited

Loading

michaelbenayoun commented Jul 17, 2024

MPMD detected error when using optimum-neuron with TP #24

MPMD detected error when using optimum-neuron with TP #24

Comments

michaelbenayoun commented Jun 27, 2024 • edited Loading

michaelbenayoun commented Jul 17, 2024

MPMD detected error when using `optimum-neuron` with TP #24

MPMD detected error when using `optimum-neuron` with TP #24

michaelbenayoun commented Jun 27, 2024 •

edited

Loading