-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix gradient clipping #5150
Fix gradient clipping #5150
Conversation
I'm actually fascinated how did this bug go unnoticed for so long, so LONG? isn't grad clip on by default? does this mean literally all of huggingface trainer, lightning etc that leverages Deepspeed backend was faulty, yet somehow everyone succeeded on training? like HOW? is this not-as-critical? Just to be clear I'm huge fan / user of deepspeed and I am very glad this tool exists, I am just genuinely curious how this could've been not-so-impactful for so long (in terms of training dynamics or so) given it looks like it impacts all of |
@cloneofsimo One reason is that this function is called only for limited cases. I noticed that this issue when I set zero_stage=0 and precision=fp32. I didn't see this issue when I changed zero_stage. Probably not many users have used DeepSpeed with this config. |
@tohtana thanks for clarification! Edit: yeah @SeunghyunSEO , @tohtana is right, looks like other functions are used for zero, which is why people use this in the first place |
@tohtana ty for your kind explanation. my understanding is that when zero3 (or 1,2) is activated, it has no problem because we use this function to clip and scale the gradient, right? |
The gradient clipping API doesn't apply the coefficient correctly. This PR resolves the issue and adds a test case. Co-authored-by: Logan Adams <[email protected]>
The gradient clipping API doesn't apply the coefficient correctly. This PR resolves the issue and adds a test case. Co-authored-by: Logan Adams <[email protected]>
The gradient clipping API doesn't apply the coefficient correctly. This PR resolves the issue and adds a test case.