Epsilon change in normalise for stability #2421

billera · 2024-04-07T13:42:30Z

Normalise allows for an optional epsilon term aimed towards improving numerical stability. Previously the epsilon was added after computing the standard deviation of the input. The standard deviation computation involves a square root, leading to NaN's in gradients dependent on normalise when the variance is very low, and for instance LayerNorms applied to low variance inputs will result in NaN gradients. By first computing the variance and taking the square root after adding epsilon^2 (squaring to preserve scale), we prevent NaN's in gradients at low variance. See the following example with LayerNorm in the current patch.

using Flux 
using Zygote 

ln = LayerNorm(256; eps = 1f-3)
for i in 1:10 
    x = ones(Float32, 256) .+ randn(Float32, 256) .* 10f0^(-i)
    l, gs = Zygote.withjacobian(ln, x)
    @show maximum(gs[1])
end


>>> maximum(gs[1]) = 9.44178f0
>>> maximum(gs[1]) = 95.85736f0
>>> maximum(gs[1]) = 477.4946f0
>>> maximum(gs[1]) = 910.05457f0
>>> maximum(gs[1]) = 985.8402f0
>>> maximum(gs[1]) = 995.0282f0
>>> maximum(gs[1]) = 995.9835f0
>>> maximum(gs[1]) = NaN32
>>> maximum(gs[1]) = NaN32
>>> maximum(gs[1]) = NaN32

We observe that while the gradients are fixed at low variance due to the epsilon addition in the denominator, this does prevent NaN's, due to the non-padded square root in the std computation. But, when using the updated normalise, these NaN's dissapear,

>>> maximum(gs[1]) = 9.531697f0
>>> maximum(gs[1]) = 105.468056f0
>>> maximum(gs[1]) = 674.7051f0
>>> maximum(gs[1]) = 991.67163f0
>>> maximum(gs[1]) = 996.03973f0
>>> maximum(gs[1]) = 996.09314f0
>>> maximum(gs[1]) = 996.0937f0
>>> maximum(gs[1]) = 996.0937f0
>>> maximum(gs[1]) = 996.0937f0
>>> maximum(gs[1]) = 996.0937f0

and remain fixed to the implicitly capped value. A simple test verifying this computation's equivalence with the previous one (modulo the differences at very low standard deviations) could be added if desired.

codecov · 2024-04-07T14:44:23Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 74.02%. Comparing base (caa1cee) to head (b600f7a).

Additional details and impacted files

@@             Coverage Diff             @@
##           master    #2421       +/-   ##
===========================================
+ Coverage   46.13%   74.02%   +27.88%     
===========================================
  Files          32       32               
  Lines        1877     1925       +48     
===========================================
+ Hits          866     1425      +559     
+ Misses       1011      500      -511

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

src/layers/stateless.jl

CarloLucibello · 2024-04-07T16:04:13Z

I agree with this change and pytorch does the same thing. It should considered a breaking change thous, so let's wait for when we are near v01.5 before merging.

Co-authored-by: Carlo Lucibello <[email protected]>

mcabbott · 2024-04-09T14:34:26Z

Can this have a test with input which triggers the NaN behaviour before?

Ideally testing not just the function, but also LayerNorm, maybe BatchNorm, anything which uses this internally. Then if the implementation of these layers finally gets replaced, it will be harder to lose the change.

ToucheSir · 2024-04-13T02:40:29Z

Putting a backlink to #2096 because this work should close that.

epsilon change for stability

da2061d

billera closed this Apr 7, 2024

billera reopened this Apr 7, 2024

CarloLucibello reviewed Apr 7, 2024

View reviewed changes

src/layers/stateless.jl Outdated Show resolved Hide resolved

CarloLucibello added this to the v0.15 milestone Apr 7, 2024

CarloLucibello added the breaking label Apr 7, 2024

Change comment for eps

b600f7a

Co-authored-by: Carlo Lucibello <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Epsilon change in normalise for stability #2421

Epsilon change in normalise for stability #2421

billera commented Apr 7, 2024

codecov bot commented Apr 7, 2024 •

edited

Loading

CarloLucibello commented Apr 7, 2024

mcabbott commented Apr 9, 2024

ToucheSir commented Apr 13, 2024

Epsilon change in normalise for stability #2421

Are you sure you want to change the base?

Epsilon change in normalise for stability #2421

Conversation

billera commented Apr 7, 2024

codecov bot commented Apr 7, 2024 • edited Loading

Codecov Report

CarloLucibello commented Apr 7, 2024

mcabbott commented Apr 9, 2024

ToucheSir commented Apr 13, 2024

codecov bot commented Apr 7, 2024 •

edited

Loading