-
Notifications
You must be signed in to change notification settings - Fork 211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Encountered errors when reproducing lightning training example #271
Comments
i think it's related to the deepspeed model init method. When using deepspeed the model should be initialized in a context where all new tensor creation will have 0 shape and it's inside of deepspeed source to implement the sharding & broadcast. There could be something falling off either throughout liger diffs or deepspeed/HF new version release. Will TAL and get back to this issue asap. |
So it was |
Thanks @yundai424, above issue has been solved by install liger-kernel-lightly.
I wonder is it expected? And what should be the baseline of lightning trainer optimization? |
🐛 Describe the bug
Tried to reproduce the liger kernel optimization on lighting trainer with deepspeed zero3 but encountered several errors.
Reproduce
script:
output:
I fixed above error by adding "import deepspeed" in training.py, but after that another error raised:
Versions
Environment Report:
Operating System: Linux-6.5.0-1025-azure-x86_64-with-glibc2.31
Python version: 3.10.14
PyTorch version: 2.4.1+cu121
CUDA version: 12.1
Triton version: 3.0.0
Transformers version: 4.42.4
deepspeed version: 0.15.0
liger_kernel version 0.3.0
The text was updated successfully, but these errors were encountered: