Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor llama / mixtral / grok for shared features #267

Merged
merged 6 commits into from
Oct 16, 2024

Conversation

rsuderman
Copy link
Contributor

Many of these features can toggle between depending on architecture. Replumbing the configurations separately allows better reuse and understanding of how models vary between eachother.

grok uses a softcap, plumbing a value enables sc * tanh( v / sc) grok has some hardcoded values that have better representations, e.g. sqrt(6144) and sqrt(3).

output normalization is optional but used by mixtral. Presence of the tensor is sufficient for performing the normalization.

@@ -122,33 +123,32 @@ def prefill(
self._assert_device(seq_block_ids)
self._assert_device(*cache_state, dtype=self.activation_dtype)
h = self.token_embedding(tokens)
h *= 78.38367176906169
h *= math.sqrt(h.shape[-1])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This requires import math

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better as torch? I don't think math traces well

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we not have prefill here testing export? If not that needs to be added

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should try to make faked versions of each model that is locally exportable. @KyleHerndon can you look into making faked theta parameters for the model? It can just be a single layer with smallers tensors.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are the theta generators for attention/FFN/MOE blocks. We already have attention and prefill tests, might need to add export component in there.

sharktank/sharktank/export_layer/export_moe.py Outdated Show resolved Hide resolved
@@ -122,33 +123,32 @@ def prefill(
self._assert_device(seq_block_ids)
self._assert_device(*cache_state, dtype=self.activation_dtype)
h = self.token_embedding(tokens)
h *= 78.38367176906169
h *= math.sqrt(h.shape[-1])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we not have prefill here testing export? If not that needs to be added

sharktank/sharktank/models/grok/grok.py Outdated Show resolved Hide resolved
Copy link
Contributor

@IanNod IanNod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

Many of these features can toggle between depending on architecture.
Replumbing the configurations separately allows better reuse and
understanding of how models vary between eachother.

grok uses a softcap, plumbing a value enables `sc * tanh( v / sc)`
grok has some hardcoded values that have better representations, e.g.
`sqrt(6144)` and `sqrt(3)`.

output normalization is optional but used by mixtral. Presence of the
tensor is sufficient for performing the normalization.
Many of these features can toggle between depending on architecture.
Replumbing the configurations separately allows better reuse and
understanding of how models vary between eachother.

grok uses a softcap, plumbing a value enables `sc * tanh( v / sc)`
grok has some hardcoded values that have better representations, e.g.
`sqrt(6144)` and `sqrt(3)`.

output normalization is optional but used by mixtral. Presence of the
tensor is sufficient for performing the normalization.

We remove the sparse moe block as we now know it will not be used due to
poor performance.
@rsuderman rsuderman merged commit f5fcd00 into nod-ai:main Oct 16, 2024
8 of 9 checks passed
@rsuderman rsuderman deleted the refactor_llm branch October 16, 2024 19:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants