-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor llama / mixtral / grok for shared features #267
Conversation
62aecb3
to
bb5be7e
Compare
@@ -122,33 +123,32 @@ def prefill( | |||
self._assert_device(seq_block_ids) | |||
self._assert_device(*cache_state, dtype=self.activation_dtype) | |||
h = self.token_embedding(tokens) | |||
h *= 78.38367176906169 | |||
h *= math.sqrt(h.shape[-1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This requires import math
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better as torch? I don't think math traces well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we not have prefill here testing export? If not that needs to be added
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should try to make faked versions of each model that is locally exportable. @KyleHerndon can you look into making faked theta
parameters for the model? It can just be a single layer with smallers tensors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are the theta generators for attention/FFN/MOE blocks. We already have attention and prefill tests, might need to add export component in there.
@@ -122,33 +123,32 @@ def prefill( | |||
self._assert_device(seq_block_ids) | |||
self._assert_device(*cache_state, dtype=self.activation_dtype) | |||
h = self.token_embedding(tokens) | |||
h *= 78.38367176906169 | |||
h *= math.sqrt(h.shape[-1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we not have prefill here testing export? If not that needs to be added
bb5be7e
to
770f83f
Compare
770f83f
to
2bd1bbc
Compare
5beb357
to
4fb7663
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me
Many of these features can toggle between depending on architecture. Replumbing the configurations separately allows better reuse and understanding of how models vary between eachother. grok uses a softcap, plumbing a value enables `sc * tanh( v / sc)` grok has some hardcoded values that have better representations, e.g. `sqrt(6144)` and `sqrt(3)`. output normalization is optional but used by mixtral. Presence of the tensor is sufficient for performing the normalization.
Many of these features can toggle between depending on architecture. Replumbing the configurations separately allows better reuse and understanding of how models vary between eachother. grok uses a softcap, plumbing a value enables `sc * tanh( v / sc)` grok has some hardcoded values that have better representations, e.g. `sqrt(6144)` and `sqrt(3)`. output normalization is optional but used by mixtral. Presence of the tensor is sufficient for performing the normalization. We remove the sparse moe block as we now know it will not be used due to poor performance.
4fb7663
to
80891fe
Compare
Many of these features can toggle between depending on architecture. Replumbing the configurations separately allows better reuse and understanding of how models vary between eachother.
grok uses a softcap, plumbing a value enables
sc * tanh( v / sc)
grok has some hardcoded values that have better representations, e.g.sqrt(6144)
andsqrt(3)
.output normalization is optional but used by mixtral. Presence of the tensor is sufficient for performing the normalization.