You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I was curious if there are any explicit mechanisms in place to prevent activations norms exploding with the initialization olmo2 uses.
Specifically, with a N(0,0.02) init, followed by x = x + norm(attn(x, ...)) layers, I would assume the variance of activations should keep increasing over layers? The paper suggests otherwise though in Section 3.2.
Unfortunately, when I initialize a random model, I don't see the same behaviour and observe constant increase in activation variance.
Was wondering if someone from the team could shed light on what prevents this from happening?
FYI, my random model isn't an "Olmo2" model, but a similar transformer based architecture, I do use QK layer norm in my attention layer.
The text was updated successfully, but these errors were encountered:
❓ The question
I was curious if there are any explicit mechanisms in place to prevent activations norms exploding with the initialization olmo2 uses.
Specifically, with a N(0,0.02) init, followed by
x = x + norm(attn(x, ...))
layers, I would assume the variance of activations should keep increasing over layers? The paper suggests otherwise though in Section 3.2.Unfortunately, when I initialize a random model, I don't see the same behaviour and observe constant increase in activation variance.
Was wondering if someone from the team could shed light on what prevents this from happening?
FYI, my random model isn't an "Olmo2" model, but a similar transformer based architecture, I do use QK layer norm in my attention layer.
The text was updated successfully, but these errors were encountered: