Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add nuts-rs's metric adaptation #311

Open
sethaxen opened this issue Jan 10, 2023 · 15 comments
Open

Add nuts-rs's metric adaptation #311

sethaxen opened this issue Jan 10, 2023 · 15 comments

Comments

@sethaxen
Copy link
Member

nuts-rs is a Rust implementation of NUTS that is wrapped by nutpie for use in PyMC. It has a novel metric adaptation approach, described as:

In a first window we sample using the identity as mass matrix and adapt the step size using the normal dual averaging algorithm. After discard_window draws we start computing a diagonal mass matrix using an exponentially decaying estimate for sqrt(sample_var / grad_var). After 2 * discard_window draws we switch to the entimated mass mass_matrix and keep adapting it live until stop_tune_at.

On Mastodon, @aseyboldt shared the following plot:

On Slack, he explained:

This shows the effective sample size divided by the total number of gradient evals (during warmup + sampling), for 200 and 1000 warmup steps and nutpie and stan (with a diagonal mass matrix), for different models

The models were selected from posteriordb. In general, this seems to outperform Stan's metric adaptation and also work well for more models. It also seems to allow for much shorter warm-up phases.

@sethaxen
Copy link
Member Author

There's also a work-in-progress paper here explaining the theory behind the method: https://github.com/aseyboldt/covadapt-paper/blob/main/main.pdf

@aseyboldt
Copy link

Would be great to have a second implementation of this!

In a first window we sample using the identity as mass matrix and adapt the step size using the normal dual averaging algorithm. After discard_window draws we start computing a diagonal mass matrix using an exponentially decaying estimate for sqrt(sample_var / grad_var). After 2 * discard_window draws we switch to the entimated mass mass_matrix and keep adapting it live until stop_tune_at.

This is no longer up to date unfortunately, sorry about that.

I guess the ground truth really is just the rust code right now, but roughly:

The logic for the mass matrix update is here: https://github.com/pymc-devs/nuts-rs/blob/main/src/adapt_strategy.rs#L202-L220

If something is unclear (or could be improved :-) ) feel to ask.

@sethaxen
Copy link
Member Author

Thanks, @aseyboldt for the pointers! I'll check back in if I get lost. Btw, I was scanning your draft of the paper, and it seems like a really nice contribution. I'm excited to read the finished version.

@sethaxen
Copy link
Member Author

sethaxen commented Jan 26, 2023

The mass matrix is initialized using the gradient (https://github.com/pymc-devs/nuts-rs/blob/main/src/adapt_strategy.rs#L267)

Why this particular initialization choice? Sometimes $J^TJ$ for Jacobian $J$ is used to approximate the Hessian, which would correspond to using the square of the gradient instead of the absolute value; if the target distribution is IID Gaussian and the initial point is drawn from the same distribution, then the expected value of this initialization is the true inverse covariance matrix.

@aseyboldt
Copy link

That's a good question, initially I was using the square of the gradient. From some experiments I'm pretty sure that that just doesn't work in practice though (but verification of that would be great ;-) ).
My motivation for 1/abs(grad) was that if we initialize the gradient variance using grad ** 2, and the draw variance using ones, the mean will be sqrt(1 / grad^2) = 1/abs(grad). I guess we can also interpret is as a regularized estimate.

So unfortunately I don't really have a good theoretical answer, more of a "that's what turned out to work well"...

@aseyboldt
Copy link

@sethaxen Someone also asked on the stan slack, and I made a small notebook for comparing those with a diagonal normal posterior: https://gist.github.com/aseyboldt/46ce8a0967026fd3b7ae11a53c884975

@nsiccha
Copy link

nsiccha commented Jan 27, 2023

@aseyboldt that was me 👍
I performed the same experiment that you did, just for a standard normal target and for varying numbers of dimensions. From my experiment and from your experiment it seems to be clear that the question which initialization method you would expect to perform better will depend strongly on the posterior via the number of dimensions and (in this simple example) on the condition number of the (true) covariance matrix. In my experiment as well, the squared gradient "performed" much worse (higher condition number), due to the implied sampling frequency of the condition number.

I wonder whether the amount of regularization has to depend on the number of dimensions, or whether one would want to use some hierarchical model, or whether one might want to sample a few additional points (from the initialization distribution) to get a better "feel" for the distribution of the gradients.

@mike-lawrence
Copy link

one might want to sample a few additional points (from the initialization distribution) to get a better "feel" for the distribution of the gradients.

Maybe it would make sense to use pathfinder to get to the typical set first, then do the metric adaptation from initialization points from the typical set?

@aseyboldt
Copy link

@nsiccha I still think looking at a standard normal is a bit misleading. It is I think quite normal that the initial mass matrix gets worse with increasing dimension, as the condition number is looking at the max and min eigenvalues, and with more dimensions there are just more variables that you could scale incorrectly. This also happens if you initialize using the identity and vary the true (diagonal) covariance.

With the code from the notebook and $log(\text{true sd}) \sim N(0, 1)$:

image
image

So both initialization methods get worse when n_dim increases, but in both cases using the gradient to initialize is better.

I think more important is what happens in real world posteriors. And there I observed that initializing with the grad avoids a phase very early in sampling where the step size is very small and tree_size very big, so where we do a lot of gradient evals for very little change in position.

@mike-lawrence

Maybe it would make sense to use pathfinder to get to the typical set first, then do the metric adaptation from initialization points from the typical set?

Yes, running pathfinder or advi (which is also pretty straight forward to do using the fisher distance by the way) might improve things further I think. There is also quite a bit of relatively low hanging fruit with generalizing this to low rank updates to the mass matrix (see for instance an experiment that I think could be simplified quite a bit here: https://github.com/aseyboldt/covadapt) or non-linear transformations of the posterior (normalizing flows etc).

@nsiccha
Copy link

nsiccha commented Jan 30, 2023

@aseyboldt

I think more important is what happens in real world posteriors.

Absolutely. I've also realized that the dependence on the dimension is lower than I thought, or rather is less important for posteriors further from a standard normal. The above experiments still give you the "pseudo-theoretical" justification for why to initialize the way you do: Depending on the posterior, the grad-initialization performs best in expectation. 🎉

@mike-lawrence
Copy link

Sounds like a job for posteriordb

@sethaxen
Copy link
Member Author

Thanks @aseyboldt and @nsiccha for this discussion! When I have a chance I'd like to play with some of these metric initialization experiments to build some intuition for why this works.

There is also quite a bit of relatively low hanging fruit with generalizing this to low rank updates to the mass matrix (see for instance an experiment that I think could be simplified quite a bit here: https://github.com/aseyboldt/covadapt) or non-linear transformations of the posterior (normalizing flows etc).

Yeah it feels like there's something here. Pathfinder similarly uses a low-rank-update matrix for the metric, but it tends to not do well when the Hessian of the negative log density is not everywhere positive definite. I wonder if some of the ideas of both methods could be unified/generalized.

Sounds like a job for posteriordb

💯 This is the way to go for benchmarking. FYI, there's a Julia wrapper: https://github.com/sethaxen/PosteriorDB.jl. BridgeStan.jl should be registered soon, and then I have some code I'll release in a package that converts a PosteriorDB.Posterior to a LogDensityProblem, which will make benchmarking with AdvancedHMC, DynamicHMC, Pathfinder, and others much easier.

@aseyboldt
Copy link

Sounds like a job for posteriordb

@mike-lawrence That's where the plot in the description comes from :-)

@torfjelde
Copy link
Member

100 This is the way to go for benchmarking. FYI, there's a Julia wrapper: https://github.com/sethaxen/PosteriorDB.jl. BridgeStan.jl should be registered soon, and then I have some code I'll release in a package that converts a PosteriorDB.Posterior to a LogDensityProblem, which will make benchmarking with AdvancedHMC, DynamicHMC, Pathfinder, and others much easier.

I'm excite 👀

@nsiccha
Copy link

nsiccha commented Feb 1, 2023

Discussion on Stan Slack for any of the Julia folks not in the Stan Slack.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants