[BUG] Using different distributed strategies of Megatron-LM to train the llama3.1-8B model results in inconsistent training loss #1324

cailun01 · 2024-12-16T08:11:17Z

Describe the bug
I used 4 distributed training strategies to train llama-3.1-8B from scratch. And I use same dataset and super parameters. But I got inconsistent training losses.

To Reproduce
Steps to reproduce the behavior. The easier it is to reproduce the faster it will get maintainer attention.

Expected behavior
Regardless of the distributed training strategy used, the training loss should remain consistent.

Stack trace/logs

Environment (please complete the following information):

item	version
GPU model	8*H20
GPU driver	535.183.06
torch	2.4.0+cu124
nccl	2.20.5+cuda12.4
megatron-core	0.9.0 (commit: `1afee59`)
flash-attn	2.5.7
TransformerEngine	1.12.0+7f2afaa
apex	commit: 2d8302

Proposed fix
If you have a proposal for how to fix the issue state it here or link to a PR.

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Using different distributed strategies of Megatron-LM to train the llama3.1-8B model results in inconsistent training loss #1324

[BUG] Using different distributed strategies of Megatron-LM to train the llama3.1-8B model results in inconsistent training loss #1324

cailun01 commented Dec 16, 2024

[BUG] Using different distributed strategies of Megatron-LM to train the llama3.1-8B model results in inconsistent training loss #1324

[BUG] Using different distributed strategies of Megatron-LM to train the llama3.1-8B model results in inconsistent training loss #1324

Comments

cailun01 commented Dec 16, 2024