-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Significantly Faster Adjoint Solves #49
Comments
Yes, I saw it. That'll only work on neural ODEs though. There's more than a few well-known results that you need very accurate gradients for optimization of physical parameters in differential equations. Generally adjoint methods have a difficulty because of this: this just accentuates the difficulty. Pumas actually has a ton of examples showing how it can fail... they need a limitations section mentioning the non-generalizability of the results and should mention some of those. FWIW, it's one line in DiffEqFlux. You just do I can't share more details on the model here since IIRC it was on an FDA submission, but this plot really showcases how it fails. If you have too high of tolerances you can get non-smooth changes in the gradient and thus it's not able to hone in on the saddle point. So what we had to do for the FDA submissions to work was ensure that all gradient calculations were to at least 1e-8 tolerance, since otherwise you could get divergence due to the stiffness. So saying that this gives "faster ODE adjoints" is misleading: the adjoint just doesn't take the error of the integral into account, but this can have some very adverse effects. In fact, in the latest version of the UDE paper there's a paragraph of I think 10-15 sources that demonstrates ways that adjoints methods can fail because of this accuracy issue. So 🤷 it's a one line thing with enough well-known counter examples so I don't think any reviewer would accept it, so I've effectively ignored it. In fact, Lars Ruthotto has a nice paper that demonstrates that these errors do have a major effect on the training performance of even neural ODEs which you can only see if you train with backprop AD. https://arxiv.org/pdf/2005.13420.pdf But the traditional discrete sensitivity analysis literature in particular has a ton of damning examples that show you should probably not do this in any general purpose code. |
@ChrisRackauckas have you seen this work by Ricky already?
https://arxiv.org/abs/2009.09457
"Hey, that's not an ODE": Faster ODE Adjoints with 12 Lines of Code
Patrick Kidger, Ricky T. Q. Chen, Terry Lyons
Neural differential equations may be trained by backpropagating gradients via the adjoint method, which is another differential equation typically solved using an adaptive-step-size numerical differential equation solver. A proposed step is accepted if its error, \emph{relative to some norm}, is sufficiently small; else it is rejected, the step is shrunk, and the process is repeated. Here, we demonstrate that the particular structure of the adjoint equations makes the usual choices of norm (such as L2) unnecessarily stringent. By replacing it with a more appropriate (semi)norm, fewer steps are unnecessarily rejected and the backpropagation is made faster. This requires only minor code modifications. Experiments on a wide range of tasks---including time series, generative modeling, and physical control---demonstrate a median improvement of 40% fewer function evaluations. On some problems we see as much as 62% fewer function evaluations, so that the overall training time is roughly halved.
The text was updated successfully, but these errors were encountered: