-
Notifications
You must be signed in to change notification settings - Fork 269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Granite 3.0 and 3.1 models #558
base: main
Are you sure you want to change the base?
Conversation
Fixes #557 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conversion Tests (except for Multimodal ones) have been done with the changes in here
@DRXD1000 Thank you so much for the support! I really appreciate that you caught that config PEBCAK, I've been trying to figure out why the layer has different output behavior from Llama! |
@JamesKunstle was a pleasure! I can't see what causes the merge conflict, if i can be of further assistance let me know :) |
@DRXD1000 I'd like to request pausing on the merge for just a bit for two things:
I'll also fix the merge conflicts. |
98e4062
to
9e59e13
Compare
@DRXD1000 For my future education: how did you debug the logit scaling value problem and pick the values per data type? |
@JamesKunstle i was looking at the granite implementation in huggingface and tried to load the model with Llama directly (in transformers not Liger to see if it is possible to skip a sperate implementatio). After a short benchmark i noticed this did not work. After reading the modeling_granite.py in transformers again i noticed the logits_scaling with the very obvious #Main difference to llama comment (must have been blind the first time reading it...) After that i checked your pr, changed the Tests and ran them. The test failed at logits. With the source Code in mind I then looked at the default settings in GraniteConfig and at some values of the trained models. Since they scaled with model size i gradually increased them till the test passed. Since normal users will not do pretraing and a proper value of logits_scaling will allready exists, it was fine for me to do this trial and error approach. Wow this got longer than planned 😅 |
@DRXD1000 That's an excellent explanation, thanks a lot, it helps a lot! I hadn't considered that value- I figured it was defaulted in the config so I didn't inspect it. I was trying to debug by isolating the layer from the |
Granite 3.(0,1) models are Llama-architecture models with some different scaling terms in various places. This commit adds granite model patching for decoder-only granite 3 models (not multimodal) and the corresponding tests. Signed-off-by: James Kunstle <[email protected]>
Signed-off-by: James Kunstle <[email protected]>
d8c00be
to
e5ad0f6
Compare
Sorry for any reviewers- |
@ByronHsu Would like to have the workflow tests approved for merge! |
Granite 3.(0,1) models are Llama-architecture models with some different scaling terms in various places. This commit adds granite model patching for decoder-only granite 3 models (not multimodal) and the corresponding tests.
Summary
This change enables patching Granite 3.(0,1) models w/ Liger kernels. We would like to use Liger kernels in our training implementation but we're a Granite-first codebase for the moment.
Testing Done
Convergence tests confirm that loss and model parameters are equivalent w/ and w/o Liger kernels. Logits, however, are not equivalent even when only swapping the SwiGLUMLP layer. The ator and rtol may need to be tuned for Granite vs. Llama, I'm going to continue investigating this before this PR is merged.
make test
to ensure correctnessmake checkstyle
to ensure code stylemake test-convergence
to ensure convergence