You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Last week, we realized that the bias terms added to the residual stream by other layers are always passed through a layernorm (with its own bias) before reaching its destination within another layer. Given that, we might instead analyze a fused bias term between each layer instead of splitting them apart.
On initial analysis, the layer norm biases seem to have a large portion of the largest terms.
https://github.com/nicholasturner1/gpt-omics/blob/main/notebooks/220509_initial_gptj_run.ipynb
These terms also reach for surprisingly long distances (longer than the attention heads). Could be something interesting here.
The text was updated successfully, but these errors were encountered: