Code which implements BADAM.
The Adam algorithm uses a preconditioner which scales parameter updates by the square root of an exponentially weighted moving average (EWMA) estimate of the squared gradients. This can be readily translated into a geometric notion of uncertainty, if this uncertainty is
large then the stepsize be close to zero. "This is a desirable property, since a smaller SNR means that
there is greater uncertainty about whether the direction of
Likewise, the diagonal Hessian of the log-likelihood required for Laplace approximation around the optimal parameters of a NN, uses the squared gradients (Generalized Gauss-Newton approximation). But importantly no square-root or EWMA unlike Adam. By interpreting adaptive subgradient methods in a Bayesian manner can we construct approximate posteriors by leveraging the preconditioners from these optimizers, cheaply and for free?
Required packages are listed in requirements.txt
which contains instructions to create a conda env with the correct versions. Run:
mkdir logs
mkdir plots
There are a couple different ways to use the curvature estimates from Adam as an estimate of uncertainty in the loss landscape. One can perform a Laplace approximation of the likelihood and then introduce a Gaussian prior and hence the posterior will be Gaussian: final weight distribution
Alternatively one can introduce an explicit L^2 regularisation for thr weights of a NN and then the curvature estimates can be used for a Laplace approximation around the posterior:
For pruning the exact version doesn't matter too much since the signal to noise ratios are equal. On the other hand for the regression experiments the Laplace around the posterior version of Badam works better. The mean coefficient of the the first version can zero out some important weights in small NN models used for regression experiments.
I haven't had time to tidy up the code, so please get in touch if you have any questions.