-
Notifications
You must be signed in to change notification settings - Fork 211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loss does not drop when using Liger Kernel at Qwen2.5 #257
Comments
Can you update your code example with how you're applying LigerKernel? |
Fwiw, Qwen2.5 uses the same model architecture as Qwen2 so Liger should still work correctly: https://huggingface.co/Qwen/Qwen2.5-7B/blob/main/config.json#L3 |
@tyler-romero |
@Se-Hun cc @tyler-romero This is the same issue with #268; the monkey patch methods to an already instanciated model do not copy the weights of the original model. HF trainer and TRL SFTrainer relies on the methods, while axolotl does not. You may use axolotl until the issue is fixed. |
In my case, train Qwen2.5-14B-Instruct, the grad norm quick increase nan |
@Arcmoon-Hu which version of liger-kernel are you on and did you not see the issue without apply kernel? |
Hi, is there any updates? Thanks! |
Thanks for quick reply.
If need other information, I can supply |
@Arcmoon-Hu could you provide a minimal reproducible script for the issue? thanks! |
The question is solved, I just pull the latest code and rebuild it. It's really awesome! I tested qwen2.5-14b-Instruct model on one 8*A800 machine, per device batch_size doubled(2 ➡️ 4), and if keeping the total batch size equal, the training time 14 hours ➡️ 10.5 hours. |
@Arcmoon-Hu good to know that. I am aware of the transformer issue and will fix it ASAP |
🐛 Describe the bug
I am trying to instruction tuning Qwen2.5-14B-Instruct with Liger Kernel.
I know that the liger kernel is supported in the dev version of huggingface transformers. However, when training the Qwen2.5 model with Liger Kernel, the loss value does not drop. Not supported yet at Qwen2.5?
Reproduce
Python Code Example :
Run Example :
Versions
Environment Report:
Operating System: Linux-5.15.0-1047-oracle-x86_64-with-glibc2.35
Python version: 3.10.14
PyTorch version: 2.4.0+cu121
CUDA version: 12.1
Triton version: 3.0.0
Transformers version: 4.45.0.dev0
The text was updated successfully, but these errors were encountered: