Add nuts-rs's metric adaptation #311

sethaxen · 2023-01-10T22:19:52Z

nuts-rs is a Rust implementation of NUTS that is wrapped by nutpie for use in PyMC. It has a novel metric adaptation approach, described as:

In a first window we sample using the identity as mass matrix and adapt the step size using the normal dual averaging algorithm. After discard_window draws we start computing a diagonal mass matrix using an exponentially decaying estimate for sqrt(sample_var / grad_var). After 2 * discard_window draws we switch to the entimated mass mass_matrix and keep adapting it live until stop_tune_at.

On Mastodon, @aseyboldt shared the following plot:

On Slack, he explained:

This shows the effective sample size divided by the total number of gradient evals (during warmup + sampling), for 200 and 1000 warmup steps and nutpie and stan (with a diagonal mass matrix), for different models

The models were selected from posteriordb. In general, this seems to outperform Stan's metric adaptation and also work well for more models. It also seems to allow for much shorter warm-up phases.

The text was updated successfully, but these errors were encountered:

sethaxen · 2023-01-11T11:23:27Z

There's also a work-in-progress paper here explaining the theory behind the method: https://github.com/aseyboldt/covadapt-paper/blob/main/main.pdf

aseyboldt · 2023-01-11T16:47:34Z

Would be great to have a second implementation of this!

In a first window we sample using the identity as mass matrix and adapt the step size using the normal dual averaging algorithm. After discard_window draws we start computing a diagonal mass matrix using an exponentially decaying estimate for sqrt(sample_var / grad_var). After 2 * discard_window draws we switch to the entimated mass mass_matrix and keep adapting it live until stop_tune_at.

This is no longer up to date unfortunately, sorry about that.

I guess the ground truth really is just the rust code right now, but roughly:

The mass matrix is initialized using the gradient (https://github.com/pymc-devs/nuts-rs/blob/main/src/adapt_strategy.rs#L267)
I then use an online variance algorithm to keep track of the running variance of draws and gradients in two overlapping windows each, updating the mass matrix regularly. (https://github.com/pymc-devs/nuts-rs/blob/main/src/adapt_strategy.rs#L389-L414)
A final windows stops the mass matrix adaptation and we just wait for step size adaptation to catch up.

The logic for the mass matrix update is here: https://github.com/pymc-devs/nuts-rs/blob/main/src/adapt_strategy.rs#L202-L220

If something is unclear (or could be improved :-) ) feel to ask.

sethaxen · 2023-01-13T08:59:52Z

Thanks, @aseyboldt for the pointers! I'll check back in if I get lost. Btw, I was scanning your draft of the paper, and it seems like a really nice contribution. I'm excited to read the finished version.

sethaxen · 2023-01-26T10:30:06Z

The mass matrix is initialized using the gradient (https://github.com/pymc-devs/nuts-rs/blob/main/src/adapt_strategy.rs#L267)

Why this particular initialization choice? Sometimes $J^TJ$ for Jacobian $J$ is used to approximate the Hessian, which would correspond to using the square of the gradient instead of the absolute value; if the target distribution is IID Gaussian and the initial point is drawn from the same distribution, then the expected value of this initialization is the true inverse covariance matrix.

aseyboldt · 2023-01-26T16:57:22Z

That's a good question, initially I was using the square of the gradient. From some experiments I'm pretty sure that that just doesn't work in practice though (but verification of that would be great ;-) ).
My motivation for 1/abs(grad) was that if we initialize the gradient variance using grad ** 2, and the draw variance using ones, the mean will be sqrt(1 / grad^2) = 1/abs(grad). I guess we can also interpret is as a regularized estimate.

So unfortunately I don't really have a good theoretical answer, more of a "that's what turned out to work well"...

aseyboldt · 2023-01-26T23:11:50Z

@sethaxen Someone also asked on the stan slack, and I made a small notebook for comparing those with a diagonal normal posterior: https://gist.github.com/aseyboldt/46ce8a0967026fd3b7ae11a53c884975

nsiccha · 2023-01-27T06:54:26Z

@aseyboldt that was me 👍
I performed the same experiment that you did, just for a standard normal target and for varying numbers of dimensions. From my experiment and from your experiment it seems to be clear that the question which initialization method you would expect to perform better will depend strongly on the posterior via the number of dimensions and (in this simple example) on the condition number of the (true) covariance matrix. In my experiment as well, the squared gradient "performed" much worse (higher condition number), due to the implied sampling frequency of the condition number.

I wonder whether the amount of regularization has to depend on the number of dimensions, or whether one would want to use some hierarchical model, or whether one might want to sample a few additional points (from the initialization distribution) to get a better "feel" for the distribution of the gradients.

mike-lawrence · 2023-01-27T08:37:50Z

one might want to sample a few additional points (from the initialization distribution) to get a better "feel" for the distribution of the gradients.

Maybe it would make sense to use pathfinder to get to the typical set first, then do the metric adaptation from initialization points from the typical set?

aseyboldt · 2023-01-27T17:34:28Z

@nsiccha I still think looking at a standard normal is a bit misleading. It is I think quite normal that the initial mass matrix gets worse with increasing dimension, as the condition number is looking at the max and min eigenvalues, and with more dimensions there are just more variables that you could scale incorrectly. This also happens if you initialize using the identity and vary the true (diagonal) covariance.

With the code from the notebook and $log(\text{true sd}) \sim N(0, 1)$:

So both initialization methods get worse when n_dim increases, but in both cases using the gradient to initialize is better.

I think more important is what happens in real world posteriors. And there I observed that initializing with the grad avoids a phase very early in sampling where the step size is very small and tree_size very big, so where we do a lot of gradient evals for very little change in position.

@mike-lawrence

Maybe it would make sense to use pathfinder to get to the typical set first, then do the metric adaptation from initialization points from the typical set?

Yes, running pathfinder or advi (which is also pretty straight forward to do using the fisher distance by the way) might improve things further I think. There is also quite a bit of relatively low hanging fruit with generalizing this to low rank updates to the mass matrix (see for instance an experiment that I think could be simplified quite a bit here: https://github.com/aseyboldt/covadapt) or non-linear transformations of the posterior (normalizing flows etc).

nsiccha · 2023-01-30T10:12:51Z

@aseyboldt

I think more important is what happens in real world posteriors.

Absolutely. I've also realized that the dependence on the dimension is lower than I thought, or rather is less important for posteriors further from a standard normal. The above experiments still give you the "pseudo-theoretical" justification for why to initialize the way you do: Depending on the posterior, the grad-initialization performs best in expectation. 🎉

mike-lawrence · 2023-01-31T13:35:59Z

Sounds like a job for posteriordb

sethaxen · 2023-01-31T14:07:54Z

Thanks @aseyboldt and @nsiccha for this discussion! When I have a chance I'd like to play with some of these metric initialization experiments to build some intuition for why this works.

There is also quite a bit of relatively low hanging fruit with generalizing this to low rank updates to the mass matrix (see for instance an experiment that I think could be simplified quite a bit here: https://github.com/aseyboldt/covadapt) or non-linear transformations of the posterior (normalizing flows etc).

Yeah it feels like there's something here. Pathfinder similarly uses a low-rank-update matrix for the metric, but it tends to not do well when the Hessian of the negative log density is not everywhere positive definite. I wonder if some of the ideas of both methods could be unified/generalized.

Sounds like a job for posteriordb

💯 This is the way to go for benchmarking. FYI, there's a Julia wrapper: https://github.com/sethaxen/PosteriorDB.jl. BridgeStan.jl should be registered soon, and then I have some code I'll release in a package that converts a PosteriorDB.Posterior to a LogDensityProblem, which will make benchmarking with AdvancedHMC, DynamicHMC, Pathfinder, and others much easier.

aseyboldt · 2023-01-31T17:00:53Z

Sounds like a job for posteriordb

@mike-lawrence That's where the plot in the description comes from :-)

torfjelde · 2023-01-31T17:44:01Z

100 This is the way to go for benchmarking. FYI, there's a Julia wrapper: https://github.com/sethaxen/PosteriorDB.jl. BridgeStan.jl should be registered soon, and then I have some code I'll release in a package that converts a PosteriorDB.Posterior to a LogDensityProblem, which will make benchmarking with AdvancedHMC, DynamicHMC, Pathfinder, and others much easier.

I'm excite 👀

nsiccha · 2023-02-01T10:04:19Z

Discussion on Stan Slack for any of the Julia folks not in the Stan Slack.

svilupp mentioned this issue Feb 12, 2023

Draft of Nutpie/Nuts-rs mass matrix adaption #312

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add nuts-rs's metric adaptation #311

Add nuts-rs's metric adaptation #311

sethaxen commented Jan 10, 2023

sethaxen commented Jan 11, 2023

aseyboldt commented Jan 11, 2023

sethaxen commented Jan 13, 2023

sethaxen commented Jan 26, 2023 •

edited

Loading

aseyboldt commented Jan 26, 2023

aseyboldt commented Jan 26, 2023

nsiccha commented Jan 27, 2023

mike-lawrence commented Jan 27, 2023

aseyboldt commented Jan 27, 2023

nsiccha commented Jan 30, 2023

mike-lawrence commented Jan 31, 2023

sethaxen commented Jan 31, 2023

aseyboldt commented Jan 31, 2023

torfjelde commented Jan 31, 2023

nsiccha commented Feb 1, 2023

Add nuts-rs's metric adaptation #311

Add nuts-rs's metric adaptation #311

Comments

sethaxen commented Jan 10, 2023

sethaxen commented Jan 11, 2023

aseyboldt commented Jan 11, 2023

sethaxen commented Jan 13, 2023

sethaxen commented Jan 26, 2023 • edited Loading

aseyboldt commented Jan 26, 2023

aseyboldt commented Jan 26, 2023

nsiccha commented Jan 27, 2023

mike-lawrence commented Jan 27, 2023

aseyboldt commented Jan 27, 2023

nsiccha commented Jan 30, 2023

mike-lawrence commented Jan 31, 2023

sethaxen commented Jan 31, 2023

aseyboldt commented Jan 31, 2023

torfjelde commented Jan 31, 2023

nsiccha commented Feb 1, 2023

sethaxen commented Jan 26, 2023 •

edited

Loading