Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Using different distributed strategies of Megatron-LM to train the llama3.1-8B model results in inconsistent training loss #1324

Open
cailun01 opened this issue Dec 16, 2024 · 0 comments

Comments

@cailun01
Copy link

Describe the bug
I used 4 distributed training strategies to train llama-3.1-8B from scratch. And I use same dataset and super parameters. But I got inconsistent training losses.

To Reproduce
Steps to reproduce the behavior. The easier it is to reproduce the faster it will get maintainer attention.

Expected behavior
Regardless of the distributed training strategy used, the training loss should remain consistent.

Stack trace/logs
Image

Environment (please complete the following information):

item version
GPU model 8*H20
GPU driver 535.183.06
torch 2.4.0+cu124
nccl 2.20.5+cuda12.4
megatron-core 0.9.0 (commit: 1afee59)
flash-attn 2.5.7
TransformerEngine 1.12.0+7f2afaa
apex commit: 2d8302

Proposed fix
If you have a proposal for how to fix the issue state it here or link to a PR.

Additional context
Add any other context about the problem here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant