Activations Exploding Across Layers #797

c3-utsavdutta98 · 2025-02-15T00:25:33Z

❓ The question

I was curious if there are any explicit mechanisms in place to prevent activations norms exploding with the initialization olmo2 uses.
Specifically, with a N(0,0.02) init, followed by x = x + norm(attn(x, ...)) layers, I would assume the variance of activations should keep increasing over layers? The paper suggests otherwise though in Section 3.2.

Unfortunately, when I initialize a random model, I don't see the same behaviour and observe constant increase in activation variance.

Was wondering if someone from the team could shed light on what prevents this from happening?

FYI, my random model isn't an "Olmo2" model, but a similar transformer based architecture, I do use QK layer norm in my attention layer.

The text was updated successfully, but these errors were encountered:

c3-utsavdutta98 added the type/question An issue that's a question label Feb 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Activations Exploding Across Layers #797

Activations Exploding Across Layers #797

c3-utsavdutta98 commented Feb 15, 2025 •

edited

Loading

Activations Exploding Across Layers #797

Activations Exploding Across Layers #797

Comments

c3-utsavdutta98 commented Feb 15, 2025 • edited Loading

❓ The question

c3-utsavdutta98 commented Feb 15, 2025 •

edited

Loading