Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Granite 3.0 and 3.1 models #558

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

JamesKunstle
Copy link

@JamesKunstle JamesKunstle commented Feb 5, 2025

Granite 3.(0,1) models are Llama-architecture models with some different scaling terms in various places. This commit adds granite model patching for decoder-only granite 3 models (not multimodal) and the corresponding tests.

Summary

This change enables patching Granite 3.(0,1) models w/ Liger kernels. We would like to use Liger kernels in our training implementation but we're a Granite-first codebase for the moment.

Testing Done

Convergence tests confirm that loss and model parameters are equivalent w/ and w/o Liger kernels. Logits, however, are not equivalent even when only swapping the SwiGLUMLP layer. The ator and rtol may need to be tuned for Granite vs. Llama, I'm going to continue investigating this before this PR is merged.

  • Hardware Type: EC2 g6e.12xlarge; 4xL40s
  • run make test to ensure correctness
  • run make checkstyle to ensure code style
  • run make test-convergence to ensure convergence

@JamesKunstle
Copy link
Author

Fixes #557

Copy link

@DRXD1000 DRXD1000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conversion Tests (except for Multimodal ones) have been done with the changes in here

@JamesKunstle
Copy link
Author

@DRXD1000 Thank you so much for the support! I really appreciate that you caught that config PEBCAK, I've been trying to figure out why the layer has different output behavior from Llama!

DRXD1000

This comment was marked as outdated.

@DRXD1000
Copy link

@JamesKunstle was a pleasure! I can't see what causes the merge conflict, if i can be of further assistance let me know :)

@JamesKunstle
Copy link
Author

JamesKunstle commented Feb 17, 2025

@DRXD1000 I'd like to request pausing on the merge for just a bit for two things:

  1. I want to run the convergence tests myself so I can support this code in the future,
  2. I need to try to get the FusedLinearCrossEntropy layer working correctly for Granite and error correctly if someone selects it. I think Granite may require logit materialization for scaling during the backward pass but I'm going to try to find a solution so we can still use that kernel.

I'll also fix the merge conflicts.

@JamesKunstle
Copy link
Author

@DRXD1000 For my future education: how did you debug the logit scaling value problem and pick the values per data type?

@DRXD1000
Copy link

DRXD1000 commented Feb 18, 2025

@JamesKunstle i was looking at the granite implementation in huggingface and tried to load the model with Llama directly (in transformers not Liger to see if it is possible to skip a sperate implementatio). After a short benchmark i noticed this did not work.

After reading the modeling_granite.py in transformers again i noticed the logits_scaling with the very obvious #Main difference to llama comment (must have been blind the first time reading it...)

After that i checked your pr, changed the Tests and ran them. The test failed at logits. With the source Code in mind I then looked at the default settings in GraniteConfig and at some values of the trained models. Since they scaled with model size i gradually increased them till the test passed.

Since normal users will not do pretraing and a proper value of logits_scaling will allready exists, it was fine for me to do this trial and error approach.

Wow this got longer than planned 😅

@JamesKunstle
Copy link
Author

@DRXD1000 That's an excellent explanation, thanks a lot, it helps a lot! I hadn't considered that value- I figured it was defaulted in the config so I didn't inspect it. I was trying to debug by isolating the layer from the GraniteMLP layer and comparing the individual logits and those were different too in my testing, so I was pretty confused. This seems like a much better way to investigate.

JamesKunstle and others added 5 commits February 18, 2025 15:32
Granite 3.(0,1) models are Llama-architecture models with some different scaling
terms in various places. This commit adds granite model patching for
decoder-only granite 3 models (not multimodal) and the corresponding
tests.

Signed-off-by: James Kunstle <[email protected]>
@JamesKunstle
Copy link
Author

Sorry for any reviewers- ruff reformatted lots of stuff from make checkstyle.

@JamesKunstle
Copy link
Author

@ByronHsu Would like to have the workflow tests approved for merge!

@lancerts lancerts self-requested a review February 19, 2025 05:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants