Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Activations Exploding Across Layers #797

Open
c3-utsavdutta98 opened this issue Feb 15, 2025 · 0 comments
Open

Activations Exploding Across Layers #797

c3-utsavdutta98 opened this issue Feb 15, 2025 · 0 comments
Labels
type/question An issue that's a question

Comments

@c3-utsavdutta98
Copy link

c3-utsavdutta98 commented Feb 15, 2025

❓ The question

I was curious if there are any explicit mechanisms in place to prevent activations norms exploding with the initialization olmo2 uses.
Specifically, with a N(0,0.02) init, followed by x = x + norm(attn(x, ...)) layers, I would assume the variance of activations should keep increasing over layers? The paper suggests otherwise though in Section 3.2.

Unfortunately, when I initialize a random model, I don't see the same behaviour and observe constant increase in activation variance.

Was wondering if someone from the team could shed light on what prevents this from happening?

FYI, my random model isn't an "Olmo2" model, but a similar transformer based architecture, I do use QK layer norm in my attention layer.

@c3-utsavdutta98 c3-utsavdutta98 added the type/question An issue that's a question label Feb 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/question An issue that's a question
Projects
None yet
Development

No branches or pull requests

1 participant