You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In addition to that, did you try using muP for sparse MoE models ? Am curious about any findings for those. Specifically, I was wondering if the routing gate (hdim, num_experts) would also be a MuReadout layer (if we don't scale the number of experts).
Would be grateful for any advice :)
Thank you !
The text was updated successfully, but these errors were encountered:
On Thu, Jun 1, 2023, 12:41 AM Barun Patra ***@***.***> wrote:
Duplicate of question asked on the mutransformers repository (link
<microsoft/mutransformers#3 (comment)>)
Hi !
I was wondering if (learned) positional embeddings should be MuReadout
layers, since they map to a finite dimensional space. Specifically
https://github.com/microsoft/mutransformers/blob/480287ce7b18a07a3432e8f2fbc0f0e5b71e2599/mutransformers/models/bert/modeling_bert.py#L174
self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
In addition to that, did you try using muP for sparse MoE models ? Am
curious about any findings for those. Specifically, I was wondering if the
routing gate (hdim, num_experts) would also be a MuReadout layer (if we
don't scale the number of experts).
Would be grateful for any advice :)
Thank you !
—
Reply to this email directly, view it on GitHub
<#48>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AMWHHM6LLOILTYMSMQ46TP3XI7CLLANCNFSM6AAAAAAYWDHITI>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
Duplicate of question asked on the mutransformers repository (link)
Hi !
I was wondering if (learned) positional embeddings should be MuReadout layers, since they map to a finite dimensional space. Specifically
https://github.com/microsoft/mutransformers/blob/480287ce7b18a07a3432e8f2fbc0f0e5b71e2599/mutransformers/models/bert/modeling_bert.py#L174
In addition to that, did you try using muP for sparse MoE models ? Am curious about any findings for those. Specifically, I was wondering if the routing gate (hdim, num_experts) would also be a MuReadout layer (if we don't scale the number of experts).
Would be grateful for any advice :)
Thank you !
The text was updated successfully, but these errors were encountered: