Adjoint ode design #37

wds15 · 2021-02-17T20:37:10Z

This PR discusses how the adjoint ODE approach can be integrated with Stan math. The focus of the document is to

describe the merits of the approach
how the CVODES user manual notation lines up with Stan math notation
shows how the CVODES functionality is integrated with the Stan math autodiff library
proposes a super-set of tuning parameters for the numerical procedures which need to be assessed by the community in order to weed them out somewhat (or not at all)

I am not sure on how to present a rendered version of this. I used this R command to render the text:

rmarkdown::render("0027-adjoint-ode.md", output_format="bookdown::html_document2")

Here is a rendered pdf of the document of easier reading:
0027-adjoint-ode.pdf

updated version 7. March 2021

0027-adjoint-ode.pdf

update 23rd March 2021

0027-adjoint-ode.pdf

update 7th April 2021

0027-adjoint-ode.pdf

wds15 · 2021-03-04T20:55:37Z

For anyone who wants to get his hands dirty with the running prototype of this, you need to:

clone cmdstan with the stan and stan-math submodule
cd into cmdstan/stan/lib/stan_math/
checkout the branch feature/adjoint-odes
move back to cmdstan directory
grab the latest stanc experimental version created by Rok as listed on the PR Feature/adjoint odes math#1905
Currently that means to head to https://github.com/stan-dev/stanc3/actions/runs/621697392
from there you grab mac-stanc (for mac) which you put into cmdstan/bin. I then copy this thing manually to bin/stanc and make sure that macos allows you to run this thing by right-clicking on the binary in the Finder while holding down the Control key. Then press "open" and from now on you can run the binary from the command line. This stuff is not needed on Linux... no idea about windows.
at this point it should be a matter of building cmdstan the usual way
Stan models can now include a ode_adjoint_tol or ode_adjoint_tol_ctl call.

An example model I used for benchmarks is in cmdstan/stan/lib/stan_math/ode-adjoint/linked-mass-flow.stan .

bbbales2 · 2021-03-04T21:15:39Z

@wds15 use the cmdstan tarball builder. Docs for it are hidden here

It will allow you to build a tarball that is easy to share and use in cmdstanr (see the closures tarball here)

funko-unko · 2021-03-05T12:17:45Z

Alright, so this is how it worked for me via @wds15 's instructions:

# Assumes that a compatible `stanc` file is located at ./stanc
# Uses 8 cores to build cmdstan
git clone --recurse-submodules https://github.com/stan-dev/cmdstan
cd cmdstan/stan/lib/stan_math
git checkout -b feature/adjoint-odes origin/feature/adjoint-odes
cd ../../../
mkdir bin
cd ..
cp stanc cmdstan/bin/stanc
cd cmdstan
chmod +x bin/stanc
make build -j8

And then this showed up, which also worked for me:
stan-dev/math#1905 (comment)

I've found this template for the design doc:
https://github.com/stan-dev/design-docs/blob/92e173efc04aaa3d3dae277dcc7b8681f1008543/0000-template.md

I guess we want to keep to that template? Is a "stan programmer" as mentioned in the template a user of stan ("programming stan models") or a developer of stan? I guess a simple example comparing forward/adjoint would be helpful both for users and developers?

Anyhow, some comments:

As described in the doc, the complexity of the forward mode grows quadratically, not exponentially.
num_checkpoints should be renamed into something like num_steps_between_checkpoints (as we talked about).
Increasing num_steps_between_checkpoints too much degrades performance, I suspect because the memory needed gets allocated anew at every leapfrog step? Not something that we should worry/optimize about for now I guess. Also, making this too large will of course result in failure to allocate the memory. I guess this should not really be a problem?
Different values of num_steps_between_checkpoints appear to lead to different results and then to (not insignificantly) different samples. Maybe something to keep in mind.
The interface looks fine to me.

What needs to be done?

wds15 · 2021-03-05T14:09:36Z

What we need to do in my view is

Find out if the simplified interface has sane defaults setup which work "OK" for many problems.
Gather experience with the adjoint method in problems. So find out when it is better to use adjoint is what would be good to be able to answer to some degree with some accuracy. Ideally people do that on problems they know well.
I came up with a scalable example - should there be other scalable examples?
Collect as much information as we can about making recommendations for the different knobs of the Ctl method. This will be hard to do in general, but maybe we can still distill some rules of thumb.
For example: I found that it is better - I think - to have greater precision (absolute tolerance being smaller) for the forward solve as for the backward solve and the quadrature stuff should have the least precision (highest absolute tolerance value).

Did I miss any option to expose for the Ctl thing?

Thanks for the other comments, will update the doc accordingly.

funko-unko · 2021-03-05T14:47:54Z

Hmmm, I believe at least for the interpolation method the polynomial interpolation appears to be preferred. The manual (https://computing.llnl.gov/sites/default/files/public/cvs_guide-5.7.0.pdf) states

The Hermite cubic interpolation option is present because it was implemented chronologically first and it is also used
by other adjoint solvers (e.g. daspkadjoint). The variable-degree polynomial is more memory-efficient (it requires
only half of the memory storage of the cubic Hermite interpolation) and is more accurate.

As for the number of steps between checkpoints, there probably is no way to tell a priori which number is best. 250 sounds fine I guess, most ODE problems in Stan will still be rather small, won't they? With 1k double valued unknowns, that would mean 2 MBs of RAM? The AD tape is probably larger (and also kept in memory?)?

As for the tolerances, 🤷‍♂️. We'll need a suite of testcases for that.

I do believe more and simpler (scalable) examples would be useful, not only to evaluate the use cases of the adjoint method. They could also be presented to users as examples when which methods perform well?

wds15 · 2021-03-05T16:32:22Z

I will switch to the polynomial method for the simpler method then. Maybe we even drop the option for the polynomial entirely?

For the steps between checkpoints: I tested with the example values like 10, 50, 250, 500 and found that you see speedups when going from 10 to 250, but then it got flat. In the CVODES examples they use 100s (100 or 500) as I recall. Memory wise I would not expect an issue here at all. ODE problems are small in a Bayesian context as I have seen it (or you will have to wait forever for your results).

Tolerances: I opt for making the chosen tolerance the least tolerance you get from this. So forward & backward have higher precision and the tolerance given is the one from the computed parameter gradients.

For now we need to come to the conclusion that this is all good enough to land in Stan. Maybe we mark the simple interface as "experimental" in the doc to warn people that we may change the underlying numeric details? Do you have any speicfic examples in mind before proceeding?

Tagging @charlesm93 who wanted to get his hands dirty on this.

betanalpha · 2021-03-05T21:18:12Z

Section 1: We should be consistent with what terms are kept in the scaling. Either N (M + 1) and 2 * N + M or N * M and N + M. I think that it’s worth noting the overheads here, even just C_{sens} = N * (M + 1) + overhead_{sens} and C_{adj} = 2 * N + M + overhead_{adj}, and mentioning that the adjoint overhead is larger and can be limiting for small systems. This will also better prepare the reader for the discussion in Section 6. Section 3: "the adjoints of the input operands are directly computed given the input adjoints” reads pretty awkwardly. Firstly to avoid confusion I recommend being explicit and redundant and everywhere referring to "autodiff adjoints" when referring to the tangents/cotangents passed along the expression graph and “system adjoint” or “ODE adjoint” when referring to backwards solve method for propagating the autodiff adjoints through the ODE solve. Secondly using input to refer to both the operands and the autodiff adjoints is a big confusing given that each refers to a different direction through the expression graph. Section 4: Derivations for general l(theta) can be hard to connect back to the autodiff circumstance; I recommend always mentioning that taking l(theta) = delta * y(t) will recover the adjoint-jacobian product needed for reverse mode autodiff. I recommend noting that lambda is a vector, one for each state, and the backwards solve (5) requires solving a coupled system of N equations (hence the cost of N). Each parameter sensitivity is one-dimensional which is why each of the corresponding integrals can be solved with a straightforward, one-dimensional quadrature (it might also help to refer to “one-dimensional quadrature” everywhere instead of just “quadrature”). I think it would help to list the control parameters currently assumed for the current integrate_ode_bdf signature. I really don’t think that we understand these solvers well enough to implement a `ode_adjoint_tol` function with the initial feature. Even ignoring the danger of having some implicit defaults this signature assumes that that uniform tolerances across all of the solvers is useful for which I haven’t seen any strong support. Users will be able to quickly try out the new function with recommended defaults for all of the control parameters that can be cut and paste into the Stan program. This has the added benefit of communicating exactly what was run (to correlate with any performance results they share) instead of relying on the exact repo commit when the default parameters might be changing. In the end I don’t see any strong reason for including this signature. Section 6: Blanket statements like “ODEs with many states and parameters are currently out of scope for Stan” can be more confusing than useful (everyone had different computing resources and hence thresholds that define practically useful tools, “scope” is ambiguous, performance depends on not just the solve time but also the behavior of the likelihood functions as the number of states and parameters increases, etc). What’s wrong with just saying that the performance of the current methods scale poorly with the number of states and parameters, but the adjoint method offers better scaling? For a fixed computational budget the adjoint method may allow for larger systems to be fit, and for a fixed system it may speed up existing fits. Section 7: I think that it will be helpful to differentiate between prior uses of the adjoint method for specific functions and the adjoin method for general autodiff. The adjoint method for general functions is used all over the sciences but the requirement of a fixed function makes it too rigid for general purpose tools like Stan (indeed this is why we didn’t consider the adjoint method much easier even though it was always in CVODES). Using the adjoint method for adjoint-Jacobian products, however, can be integrated into any reverse mode autodiff implementation. The application to autodiff is the really exciting new bit (the functionality built into tools like JAX for neural ODEs would be a recent prior art).

wds15 · 2021-03-06T22:11:32Z

@betanalpha thanks for looking over it.

Section 1: I adapted accordingly.
Section 3 & 4: Will work on that.
Section 6: I reworded it.
Section 7: I adapted as suggested.

Wrt. to the suggested ode_adjoint_tol signature which sets some defaults: The thought is that users should be able to quickly try out the adjoint method. So when users already spend some thinking on their ODE problem as they use the tol flavor, then it should really be easy for users to just try adjoint. I think we should be able to setup a default which gives reasonable performance (maybe not best). This is also a big argument to encourage users to try the prototype. It's just about swapping out the function name. I would suggest to keep the simpler interface at least for getting users to test this prototype during this month... @funko-unko liked the short function call for basically those reasons given.

wds15 · 2021-03-07T15:52:21Z

@betanalpha ok, now I hopefully got your other comments straight as well. Thanks for looking into it. I am now talking of "autodiff adjoints" and "ODE adjoint states" as you suggest, which makes sense to hopefully get down the confusion level...

I suggest to now launch until Easter an experimental phase with Stan users. The goal is to collect some feedback on

the utility of the method
the utility of the tuning parameters (we can probably drop the interpolation_polynomial, for example... I think)
the utility of the ode_adjoint_tol call

We can collect feedback on the forum and in summarized form in the wiki. After Easter we then proceed with finishing this up and merging it to stan math during April such that it will be ready for the next May release.

Sounds like a plan?

Tagging @charlesm93, @bbbales2 , @rok-cesnovar & @funko-unko ... @yizhang-yiz

funko-unko · 2021-03-08T09:17:02Z

Can we maybe just slap a warning on the ode_adjoint_tol function? Something along the lines of

The other implicit tuning parameters may change in the future, the current setting is equivalent to ode_adjoint_tol_ctl(..., values of implicit parameters).

such that the user may just swap out the function call? Maybe with a notice to include this message if they post in the forums? This would discourage the use of the ode_adjoint_tol function while still making it easy to "just use" the adjoint method, and we would always know all of the tuning parameters?

PS: Clearly my opinion should have very little weight.

rok-cesnovar · 2021-03-08T09:38:38Z

The laid out plan work for me! Maybe just mention that Easter is the 4th of April.

Longterm we might want to use tuples or even better structs (though I am not sure there is any work being done onr structs) for default arguments.

So the signature becomes:

vector[] ode_adjoint_tol_ctl(F f,
    vector y0,
    real t0, real[] times,
    (real rel_tol_f, vector abs_tol_f, real rel_tol_b, vector abs_tol_b,
    real rel_tol_q, real abs_tol_q, int max_num_steps, int num_checkpoints,
    int interpolation_polynomial, int solver_f, int solver_b)
    T1 arg1, T2 arg2, ...)

and you could call it as:

ode_adjoint_tol_ctl(udf, y0, t0, times, default_ode_adjoint_args(), ...)

where default_ode_adjoint_args() would be a built-in function that would return a tuple with the default arguments.

One could also do this:

transformed data {
  (real, vector, real, vector, real, real, int, int, int, int, int) args = default_ode_adjoint_args();
  args.7 = 500; // would set max_num_steps
}

Structs with named elements obviously works much better for this.

betanalpha · 2021-03-09T19:01:25Z

Great, thanks. That all sounds good, although again I think `ode_adjoint_tol` is premature at the moment. Introducing it offers the danger that users will not understand its limitations (it doesn’t matter how many warnings you put in, if you allow defaults people will use them and expect them to work) and we can always add it later without any real cost.

…

On Mar 7, 2021, at 10:52 AM, wds15 ***@***.***> wrote: @betanalpha <https://github.com/betanalpha> ok, now I hopefully got your other comments straight as well. Thanks for looking into it. I am now talking of "autodiff adjoints" and "ODE adjoint states" as you suggest, which makes sense to hopefully get down the confusion level... I suggest to now launch until Easter an experimental phase with Stan users. The goal is to collect some feedback on the utility of the method the utility of the tuning parameters (we can probably drop the interpolation_polynomial, for example... I think) the utility of the ode_adjoint_tol call We can collect feedback on the forum and in summarized form in the wiki. After Easter we then proceed with finishing this up and merging it to stan math during April such that it will be ready for the next May release. Sounds like a plan? Tagging @charlesm93 <https://github.com/charlesm93>, @bbbales2 <https://github.com/bbbales2> , @rok-cesnovar <https://github.com/rok-cesnovar> & @funko-unko <https://github.com/funko-unko> ... @yizhang-yiz <https://github.com/yizhang-yiz> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#37 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AALU3FTZJ3MOO4FCMOFTHATTCOOMDANCNFSM4XZBH45Q>.

betanalpha · 2021-03-09T19:03:14Z

Could we not get the same effect using variadic arguments for all of the control parameters?

…

On Mar 8, 2021, at 4:38 AM, Rok Češnovar ***@***.***> wrote: Longterm we might want to use tuples or even better structs (though I am not sure there is any work being done onr structs) for default arguments. So the signature becomes: vector[] ode_adjoint_tol_ctl(F f, vector y0, real t0, real[] times, (real rel_tol_f, vector abs_tol_f, real rel_tol_b, vector abs_tol_b, real rel_tol_q, real abs_tol_q, int max_num_steps, int num_checkpoints, int interpolation_polynomial, int solver_f, int solver_b) T1 arg1, T2 arg2, ...) and you could call it as: ode_adjoint_tol_ctl(udf, y0, t0, times, default_ode_adjoint_args(), ...) where default_ode_adjoint_args() would be a built-in function that would return a tuple with the default arguments. One could also do this: transformed data { (real, vector, real, vector, real, real, int, int, int, int, int) args = default_ode_adjoint_args(); args.7 = 500; // would set max_num_steps } Structs with named elements obviously works much better for this. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#37 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AALU3FQFJMCAPGT4DDML6UTTCSLK3ANCNFSM4XZBH45Q>.

rok-cesnovar · 2021-03-09T19:25:34Z

Current the compiler can not deduce where the control arguments stop and UDF args start.
For example:

vector[] ode_adjoint_tol_ctl(F f,
    vector y0, real t0, real[] times,
    real rel_tol_f, vector abs_tol_f, real rel_tol_b, vector abs_tol_b,
    real rel_tol_q, real abs_tol_q, int max_num_steps, int num_checkpoints,
    int interpolation_polynomial, int solver_f, int solver_b)

vector[] ode_adjoint_tol_ctl(F f,
    vector y0, real t0, real[] times,
    real rel_tol_f, vector abs_tol_f, real rel_tol_b, vector abs_tol_b,
    real rel_tol_q, real abs_tol_q, int max_num_steps, int num_checkpoints,
    int interpolation_polynomial, int arg1, int arg2)

are the same. We could add logic where we check what args f has and deduce based on that and add missing control args for the C++ call (C++ compiler can not do the same). But that would mean stanc3 would hold the ode default values and I am not sure I like that.

So doable but not sure its worth. I agree with "if you allow defaults people will use them and expect them to work".

betanalpha · 2021-03-09T21:34:31Z

Yeah, I guess we’d need named arguments. Regardless the defaults would have to be stored somewhere, either in each program, in the ODE function implementation, or in some default parameter object. Personally I’d prefer the first initially and then the second eventually, if only to reduce the code associated with the ODE implementations (i.e. default configurations are defined and docs in the ODE functions directly and no searching is required).

…

On Mar 9, 2021, at 2:25 PM, Rok Češnovar ***@***.***> wrote: Current the compiler can not deduce where the control arguments stop and UDF args start. For example: vector[] ode_adjoint_tol_ctl(F f, vector y0, real t0, real[] times, real rel_tol_f, vector abs_tol_f, real rel_tol_b, vector abs_tol_b, real rel_tol_q, real abs_tol_q, int max_num_steps, int num_checkpoints, int interpolation_polynomial, int solver_f, int solver_b) vector[] ode_adjoint_tol_ctl(F f, vector y0, real t0, real[] times, real rel_tol_f, vector abs_tol_f, real rel_tol_b, vector abs_tol_b, real rel_tol_q, real abs_tol_q, int max_num_steps, int num_checkpoints, int interpolation_polynomial, int arg1, int arg2) are the same. We could add logic where we check what args f has and deduce based on that and add missing control args for the C++ call (C++ compiler can not do the same). But that would mean stanc3 would hold the ode default values and I am not sure I like that. So doable but not sure its worth. I agree with "if you allow defaults people will use them and expect them to work". — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#37 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AALU3FR2FEF7FN2WOQZA5T3TCZY35ANCNFSM4XZBH45Q>.

wds15 · 2021-03-20T21:51:59Z

I just updated some more details.

@betanalpha did I get your comments in as you had in mind?

It would be good to get the design doc agreed so that it's ready to be merge by the end of this month.

The only points left open at this stage should be settled during the experimental phase which ends at Easter. The questions open are

Does the ode_adjoint_tol_ctl function have all required tuning parameters?
- Can some parameters be dropped?
- Are some additional ones needed?
Should we keep the simpler ode_adjoint_tol function signature?
- Are the defaults sensible?

As it has been a sticking point to include the simpler ode_adjoint_tol: Is it also a possibility to include the function, but mark it in the documentation clearly that the defaults can be changed with releases? That would mimic essentially the statement about the default priors in rstanarm, which are not guaranteed to stay the same between releases.

Pinging @bbbales2 @yizhang-yiz @rok-cesnovar @funko-unko @charlesm93

betanalpha · 2021-03-22T17:27:41Z

Mostly small comments. I'm still not comfortable with including a simpler function in the initial release. Users will not respect any "experimental feature" warnings and they will complain about changing behavior from version to version even if the behavior is explicitly documented as not stable. Even when they don't complain they often end up using the same function without realizing the difference, which is already a huge problem with rstanarm priors.

Section 1

“severely costly to fit” -> “expensive to evaluate”. The autodiff method is independent of how hard ODE-based models are to fit.

ressources -> resources

This matters less in the design doc and more in final user facing doc, but given the heterogeneity in terminology across fields we should try to mention as many common terms as possible. For example not just “so-called forward-mode ODE method” but rather “forward-mode, direct, sensitivity”.

Section 2

“susceptible infections & removed” -> “susceptible, infected, and recovered (SIR)”

Section 3

“merely” -> “many”, although I think “more” by itself is sufficient here.

“in addition to solve M one-dimensional quadrature problems" -> "in addition to solve M one-dimensional quadrature problems that depend on the adjoint states". To harm being clear to hear to avoid any misunderstanding about the solves parallelizeable or the like.

"We need to first collect some experience before a simpler version can be made available (if that is feasible at all)." I very much agree, but then why is a simpler version being considered for the initial implementation?

Are the quadrature tolerances different for each of the M integrations?

Section 4

Use theta or p consistently to denote the parameters throughout.

g(t, y, p) will not equal l(p) for the autodiff application -- it will equal dy/dt = f(t, y, p) which allows us to propagate the autodiff adjoint information from any subsequent expression graph.

wds15 · 2021-03-22T20:45:23Z

Thanks @betanalpha !

Mostly small comments. I'm still not comfortable with including a simpler function in the initial release. Users will not respect any "experimental feature" warnings and they will complain about changing behavior from version to version even if the behavior is explicitly documented as not stable. Even when they don't complain they often end up using the same function without realizing the difference, which is already a huge problem with rstanarm priors.

I did change my mind in the process of all this. So I do think that we should supply a simplified version to make it easy for users to try this thing out (which is why there could be left-overs saying otherwise). Let's be pragmatic here and I suggest to proceed as follows: We don't provide the simpler version in the Stan language itself. Instead, we pack the simplified function call into the Stan user documentation in the form of a Stan function. ... unless another reviewer here can convince @betanalpha otherwise, of course (sounds unlikely ;) ). Is that reasonable?

Are the quadrature tolerances different for each of the M integrations?

Right now the quadrature tolerances are all the same... but I just checked, the CVODES interface allows to specify the absolute tolerance for each parameter separately (I thought that is not possible). Should we change that and make that also a vector for the absolute tolerances? That's an awful lot of stuff to specify then and it's a bit inconvenient as the length of the vector changes whenever you add or drop a parameter...even worse is that the correspondence of absolute error tolerance to actual parameter is not so transparent in that we have a variadic interface, but the absolute tolerances would correspond to a "flattened" version of the variadic parameter pack... for the sake of simplifying at least something already now, I would leave things as they are right now and keep the same absolute tolerance for all parameters. Agree?

Use theta or p consistently to denote the parameters throughout.

My intention is to keep "theta" as denoting a super-set of parameters and p is a subset of parameters needed for the ODE. I would like to keep notation as is, since this sets apart notation in stan-math (theta) and notation in the CVODES manual (p).

g(t, y, p) will not equal l(p) for the autodiff application -- it will equal dy/dt = f(t, y, p) which allows us to propagate the autodiff adjoint information from any subsequent expression graph.

Maybe not ideal as it is right now... I will have one more look, but I would like to make it as simple as possible. Possibly we just say that l(theta) is - as an example - a normal lpdf of an ODE state at time T.

rok-cesnovar · 2021-03-23T07:30:28Z

Instead, we pack the simplified function call into the Stan user documentation in the form of a Stan function

I agree, this way the default values are out in the open in the UDF code and you also get the option to switch to adjoint ODEs "easily".

wds15 · 2021-03-23T08:12:10Z

There is just one catch with putting the function into the manual: It would be a user-defined Stan function and we do not support variadic arguments to user defined functions (or dow we?).

Anyway, we can still agree at a later time-point on the simpler function being part of the language itself... unless anyone else reviewer this has different views.

@betanalpha is that a plan?

charlesm93 · 2021-04-12T20:48:50Z

I read through the design doc and I think overall it's very well written. The only comments I have is that I expect the adjoint method to work better than forward diff also in the small number of parameters, large number of states regime -- to be tested, I'll put something together -- and that one reference we could include is the work on Neural ODEs by Chen et al (https://arxiv.org/abs/1806.07366).

I agree with @betanalpha and @funko-unko that we should have per-parameter absolute tolerances for the quadrature step. This is consistent with other absolute tolerances we have, I don't think it's a burden to users who won't use it, and it can help us study adjoint methods for ODEs in a Bayesian context. It seems a reasonable application would be a case where we use varying absolute tolerances for the forward and backward solves. CVODEs looks at different contexts and the Julia package is fairly novel -- though I'd be interested to hear their reasoning.

In my view, the most compelling reason for not having varying absolute tolerances right now is that it's more work for the developers and it'll delay the feature's release -- Can someone confirm this? That's fair enough. We can release the version as is but the design doc should acknowledge varying absolute tolerance is a desirable extension if not a top priority. And also that design as is doesn't hinder this extension. This strikes me as an acceptable compromise.

I'd love some documentation on the benefits of scaling parameters in ODEs, some examples, and how well this works in practice!

funko-unko · 2021-04-13T07:49:13Z

I believe part of the problem with settling the question whether or not these parameterwise tolerances are needed is that there is no quick/easy way to estimate the impact of these tolerances, neither for us in the very general setting, nor for users with a specific model.

I believe the only way we have right now to estimate this is to just run full HMC and see how the sampler performs, which will take "forever". I think it would help a lot if it were (easily or at all) possible to explore how the solver performs for samples from the prior, i.e. what tolerances are needed for a given (estimated) final error in the density and in the gradient.

The first you can do in a slightly hacky way, but for the latter you have to jump through a thousand hoops. I strongly believe that this should be made very very (very) easy for the user.

One way to do this would be to be able to pass a csv file to fixed_param. And while we're at it, users should also be very interested in the amount of time it takes to integrate the ODE for a given set of parameters, which as we all know may vary widely across the parameter space.

wds15 · 2021-04-13T09:37:31Z

At this point it would be great to really get to the point of rejecting (option 1) or accepting (option 2) with your post. Rationale as to why a specific option has been chosen is a great bonus, of course.

The design doc already has a short discussion under "Drawbacks" about the lack of the per-parameter tolerances. Here are a few more points to this:

We have not yet seen any case where the per-parameter absolute tolerances would have been needed. The empirical observation is that other communities (CVODES, Julia) do not make use of such a feature and hence the need for this feature should be driven by an actual problem which we cannot solve with the simplified approach. Moreover, I think that we should have a discussion about wether we need this feature or not only once we have proven that there is a need. Per se the default is that the feature is not needed for real-world problems for which Stan is being used for.
Our current ODE solvers also do not allow for different tolerances for anything. They are all the same and no one has ever complained about it. There were discussions on extending that to by-state, but that did not happen for various reasons. In any case, all Stan programs using ODEs so far have been working nicely with just a single absolute tolerance for states and gradients. Thus, if this assumption is wrong by now, then this actually affects all our solvers and we should treat this separately from adjoint would be my view.
The usability is severely complicated given one has a variadic interface. One would have to respect ordering of parameters and their storage order. That can be avoided if the user restricts to only using 1d array inputs, but still.
Indeed, my own resources are limited and in addition I do personally not think that the per-parameter thing is good for users as it complicates usability a lot.
The per-parameter tolerance thing can be emulated - to some extend and with some work - using scaling and shifting of parameters and states. This won't work perfect, but that is totally possible to do,
Finally, the code will be written in a way so that the per-parameter tolerance thing can be added at a later point - should it be deemed useful for problems we encounter (and deem these problems to be important enough to warrant an implementation).

So far we do not even have any strong use-case for per-parameter tolerances for the parameters. Instead, we do have seen that the existing system gives users huge speedups.

yizhang-yiz · 2021-04-13T19:14:44Z

What's the established voting procedure on design doc and on github?

wds15 · 2021-04-14T07:51:49Z

I am not calling out for an established formal ruling here. Instead I brought up the topic at the Stan Gathering last week Thursday April 8th and it was suggested to go through a voting here in the way I started it (at least this is how I understood the meeting outcome). To me that is reasonable given that

we are in principle in alignment on the proposal in that we want adjoint ODE
we seem to have different views on a specific feature
I would assume that we can resolve this now with a voting procedure in a constructive and collaborative manner
I also assume that we can trust each other here as I don't see any reason as to why someone would intend to manipulate, cheat or do any other odd things... we do agree on the goal to get adjoint ODE in such that I would expect people to be constructive here.

I hope that makes sense to everyone involved.

SteveBronder · 2021-04-14T18:10:10Z

In the past for these we had a one week voting period where folks would use the github review process to review or request changes. If the majority is approve then we merge the design doc.

If we wanted anyone to be able to vote we could also just have a comment people could thumbs up or down. I'm fine with either but prefer the review process so that all the tallies stay at the bottom of the page and dissents show up as request review comments which are easier to note looking over a proposal's history

yizhang-yiz · 2021-04-14T18:15:12Z

Thanks @SteveBronder . Looks like this procedure is missing in the voting proposal discussion. I suggest we stick to the process in that proposal.

SteveBronder · 2021-04-14T22:40:18Z

I don't want to get the doc discussion off track so maybe this would be better to discuss in the general meeting Thursday. The actual review process in the design doc readme is a bit vague, which tbh I think is good actually as it lets the scope of the voting pool be decided by the reviewer(s). Whatever way people want to vote is cool and good by me

betanalpha · 2021-04-15T15:39:42Z

We’re in an infinite loophole and no progress will be made without a vote, but I want to point out a few things to have my peace.

Our current ODE solvers also do not allow for different tolerances for anything. They are all the same and no one has ever complained about it. There were discussions on extending that to by-state, but that did not happen for various reasons. In any case, all Stan programs using ODEs so far have been working nicely with just a single absolute tolerance for states and gradients. Thus, if this assumption is wrong by now, then this actually affects all our solvers and we should treat this separately from adjoint would be my view.

By this reasoning the adjoint signature should not have any vector tolerances. The fact that the non-variadic arguments expose vector tolerances, as was strongly motivated by stability issues that are starting to arise in more complicated ODE models that are outside of the typical CVODES scope, completely invalidates this line of reasoning.

The usability is severely complicated given one has a variadic interface. One would have to respect ordering of parameters and their storage order. That can be avoided if the user restricts to only using 1d array inputs, but still.

The adjoint solver requires a sophisticated configuration that can have strong affects on the solver stability that we do not yet fully understand (not only how to optimally configure but also how to diagnose poor configurations). Yes we will always have to balance functionality with robustness, and we will never be able to release functionality the we understand completely. The original ODE solvers in Stan were developed when users were working with much simpler ODE systems, but contemporary applications, especially the large systems for which the adjoint method will be advertised, as much more complex. Prioritizing simple function signatures to maximize easy of use externalizes the consequences of any poor solver configuration to users who are least prepared to investigate any problems that arise. A non-variadic signature would be a frustration but not an obstruction for the more experienced users working with complex ODE systems who would benefit from this speedup the most.

Indeed, my own resources are limited and in addition I do personally not think that the per-parameter thing is good for users as it complicates usability a lot.

Exposing _any_ of the solver configuration complicates usability. Limiting what is exposed by forcing suboptimal defaults only shifts the complication from how easy it is to incorporate the function into a Stan program to how easy it is to diagnose and investigate stability problems in the model fit. Usability arguments like this would make a lot more sense if the decision was between exposing _no_ solver configuration options vs exposing _all_ solver configuration options, but the current decision is between exposing just the configuration options that are immediately compatible with the variadic signature verses all configuration options.

The per-parameter tolerance thing can be emulated - to some extend and with some work - using scaling and shifting of parameters and states. This won't work perfect, but that is totally possible to do,

I would love to see any non-speculative guidance on how this can be accomplished. The quadrature tolerances will depend on how parameter Jacobians evolve between the observed times _relative_ the adjoint states. These differential behaviors cannot be controlled by constant, linear transformations of the system that leave the Jacobians invariant.

Finally, the code will be written in a way so that the per-parameter tolerance thing can be added at a later point - should it be deemed useful for problems we encounter (and deem these problems to be important enough to warrant an implementation). So far we do not even have any strong use-case for per-parameter tolerances for the parameters. Instead, we do have seen that the existing system gives users huge speedups.

Placing the burden of proof on one outcome but not the other by framing that outcome as the default and the other as alternative is entirely artificial here. Right now every comparison of ODE solver performance has been complicated by the tolerance configurations — difference solvers can leapfrog each other, and become more or less stable, with non-default tuning. There is no question that the adjoint method will be extremely useful in the right circumstances, but it also exposes even more configuration options that can drastically effect its performance making the method more fragile. State tolerance sensitivity has proven to be important for previous solvers, and the current signature exposes all of the state tolerances accordingly for both the forward and reverse solves. While the exposure of the full configuration is somewhat independent of introducing an adjoint solver, this will be the first solver that we’ve introduced since the community has really appreciated how sensitive performance can be to these configurations. This is all great so far. The adjoint solver, however, also requires one-dimensional quadrature that parallels the reverse adjoint solve. No existing ODE method requires a similar quadrature and so there’s no local evidence from which we can draw about how important the quadrature tolerances might be. The closest analog is the 1D integrator exposed in Stan; although the quadrature algorithm isn’t exactly the same the tolerances in the 1D integrator can be very important for getting reasonable results out. Beyond the lack of empirical evidence no one has made any theoretical arguments regarding under what circumstances the adjoint quadrature would be robust or not so we have little understanding on how fragile the quadrature might be and, more importantly, how to identify any problems arising from a poor configuration. The situation is very much parallel to the state tolerances and to me that requires burden of proof for any argument to _not_ include quadrature tolerances. The only arguments for including solver tolerances but not quadrature tolerances — independent of design considerations — are based on extrapolating a few documentation examples in packages focusing on different use cases than Stan. What frustrates me here is that the current arguments are all based on maintaining the variadic signature, which was introduced without any consideration for parameter-dependent solver configurations. Instead of considering that fundamental design incompatibility might be an argument for limiting the use of variadic signatures when parameter-dependent solver configurations are relevant the variadic solver is being taken as mandatory and incompatible functionality being rejected by default. I very much hope that the quadrature tolerances do not end up that important, or that there are straightforward ways to identify poor configurations of those tolerances so that users can at least understand when the adjoint functionality might not be stable enough for their use, so that simpler signatures are sufficient. I do not think we have anywhere near enough evidence to support a simpler signature, however, and exposing one prematurely shifts all risks to unprepared users. At the very least I hope that everyone voting yes here will accept the responsibility of monitoring the forums for problems and providing consistent user support to minimize the costs of a poor user experience. I continue to advocate for a more careful rollout, first exposing only a signature that requires the full adjoint solver configuration to be tuned so that users understand the requirements of using the functionality and can experiment and provide feedback. If that feedback motivates a simpler signature then we can expose new functions to the language in the future. Forcing speculative signatures, on the other hand, risks deprecated functions that complicate documentation and user experience.

mitzimorris · 2021-04-15T17:25:04Z

I have read through this thread and was present at the online meeting where the issues were clearly laid out and discussed by Sebastian and Charles, among others.
Sebastian's proposal is both pragmatic and promising. I vote yes.

wds15 · 2021-04-15T19:22:27Z

As per discussion today with @charlesm93 I have added a small sentence to clarify that a vector absolute tolerance for the parameters can be added later to the implementation. That is, the C++ code should be written in a way to allow for an extension should this be deemed necessary at a later point. Otherwise no change.

Great to see one vote here (even better that it is positive from my personal perspective).

bbbales2 · 2021-04-16T04:14:18Z

I vote yes. Reasons over here

funko-unko · 2021-04-16T10:15:12Z

For what it's worth, I also vote yes. But I do see @betanalpha's point.

wds15 · 2021-04-19T19:27:43Z

BUMP.

One more week to go for voting. So far we have 3 positive votes and no vote against it... though @betanalpha has concerns against the proposal, but I have not read a clear vote for or against it.

Tagging again interested people I am aware of:

@charlesm93 @yizhang-yiz @bob-carpenter @rok-cesnovar @SteveBronder

rok-cesnovar · 2021-04-20T10:31:27Z

I also give a positive vote. Even though I am familiar with the underlying implementation and have worked on the Stanc3 side of things, I do have to acknowledge that I am by no means an expert in modeling with ODEs so put a lesser weight on my vote :)

Comments:

I agree with Micheal's comments about this being a somewhat arbitrary subset of exposed arguments, though I think the reasoning for this particular one seems reasonable to me (easier to implement, no per-parameter tolerances are currently being used in ODE solvers in Stan/CVODES/Julia, etc) so I think this is a good step 1 that does not hinder us for any future evolutions of the ODE adjoint interface.
This implementation does not make a per-parameter tolerances implementation any more difficult to implement in the future if we find that to be needed. The only cost is that we would have 2 Stan function signatures but the backend code would overlap so there is not much maintenance cost there. I do not think we will have to deprecate the existing one as it seems to be working for at least the problems currently used.
By waiting with the per-parameter tolerances implementation we could add much friendlier and less error-prone signature in the future once tuples are introduced (maybe for this release but probably 50/50 at this point) as instead of parameters we would supply a (parameter, tolerance) tuple. So the signature would be

vector[] ode_adjoint_tol_ctl(F f,
    vector y0,
    real t0, real[] times,
    (real rel_tol_f, vector abs_tol_f, real rel_tol_b, vector abs_tol_b,
    real rel_tol_q, real abs_tol_q, int max_num_steps, int num_checkpoints,
    int interpolation_polynomial, int solver_f, int solver_b)
    (T1 arg1, T1 arg_tol), (T2 arg2, T2 arg2_tol) ...)

So bottom line for me is: this seems useful (I trust the judgement of Sebastian, Charles and the rest), should not put a
significant maintenance burden on us even if the per-parameter tolerance implementation would become available.

SteveBronder · 2021-04-20T20:53:18Z

+1 I vote yes!

charlesm93 · 2021-04-20T21:04:03Z

I also vote yes.

wds15 · 2021-04-22T08:30:24Z

BUMP.

6 positive votes and no (definite) negative vote by now.

Thanks for everyone taking the time to dive into this so far.

Voting closes on Monday 26th April 2021.

syclik · 2021-04-25T13:31:42Z

My vote: yes.

wds15 · 2021-04-26T07:55:28Z

BUMP.

Last day for voting is today!

7 positive votes and no (definite) negative vote by now.

betanalpha · 2021-04-26T16:28:10Z

I vehemently vote no. The intent of this implementation was to expose all of the tolerances (and not give them defaults) so that advanced users would be forced to experiment and provide feedback. Prioritizing the variadic signature negates this intent, both in limiting what tolerances can be configured and allowing unexamined use of the method. The design fundamentally changed halfway through the review process solely to rationalize merging a feature. There is zero principled arguments for why the quadrature tolerances, and only the quadrature tolerances, can be ignored and no one has provided any strategy for how problems due to misconfigured quadrature tolerances can be, let alone will be, identified and studied. Note also that in the history of Stan we have never actually deprecated a function from the language. Premature features will persist and users will continue to incorporate them into their Stan programs.

yizhang-yiz · 2021-04-26T17:01:28Z

Note also that in the history of Stan we have never actually deprecated a function from the language

IMO Allowing deprecation is a healthy to do in the long term. Maybe we should have a discussion on that at some point.

rok-cesnovar · 2021-04-26T18:14:03Z

Note also that in the history of Stan we have never actually deprecated a function from the language.

The functions are deprecated in the sense that the compiler will warn the users that a function or some syntax (˙#˙ is deprecated for comments, as is <-, etc.) is depreacted. We have not removed them as that would require a bump of the major version and when that will happen requires a bit more planning and discussion then just a typical release cycle. There is no immediate plan on when we will do that.

I have personally been slowly collecting things that will need to be removed once we bump (will open an issue somewhere at some point). Some already have an issue in one of the repos, others do not.

So far collected:

Language:

_log, _cdf_log, _ccdf_log suffixed functions
multiply_log, binomial_coefficient_log, cov_exp_quad
integrate_ode_* functions
get_lp, if_else, increment_log_prob
abs in favor of fabs (there is flip-flopping so this one is a maybe)
the old array syntax in favor of array[] (another maybe, I dont think this was decided, though it would simplify code)

Others:

bin/print in cmdstan
remove Math code associated with all deprecated functions

EDIT: Lets not derail this thread, we can discuss that elsewhere if needed. Just wanted to point out removing deprecated is in plans once we decide to bump.

wds15 · 2021-04-26T18:53:48Z

My understanding form the start was to expose a function signature which exposes a large set of tuning options - and we do have that now. In fact, I personally was for including an additional bdf variant, but in discussions over it I did concur with others that this functionality is not needed. Everything is a trade off.

There is no variadic vs per-parameter thing here, not at all. Also, I really do not see a need to deprecate at some point the proposed signature in favor for the more complicated one with per-parameter tolerances. Deprecating the simpler signature would only be needed if almost all cases we encounter require the more complicated signature - but that is really unlikely (time will tell).

Let's see it positive: We are really making progress here! Solving large ODE systems in Stan will be faster for a very large class of problems (at least for those we benchmarked).

betanalpha · 2021-04-26T20:05:44Z

The functions are deprecated in the sense that the compiler will warn the users that a function or some syntax (˙#˙ is deprecated for comments, as is <-, etc.). We have not removed them as that would require a bump of the major version and when that will happen requires a bit more planning and discussion then just a typical release cycle. There is no immediate plan on when we will do that.

Yes, a major version bump has been “planned” for many years now with political and logistical issues preventing any progress. Deprecation is absolutely a useful tool in software development but until we can demonstrate that we can actually implement deprecation as a development team its use to motivate aggressive feature releases is, in my opinion, weak.

There is no variadic vs per-parameter thing here, not at all.

Quoting Sebastian: "Right now the quadrature tolerances are all the same... but I just checked, the CVODES interface allows to specify the absolute tolerance for each parameter separately (I thought that is not possible). Should we change that and make that also a vector for the absolute tolerances? That's an awful lot of stuff to specify then and it's a bit inconvenient as the length of the vector changes whenever you add or drop a parameter...even worse is that the correspondence of absolute error tolerance to actual parameter is not so transparent in that we have a variadic interface, but the absolute tolerances would correspond to a "flattened" version of the variadic parameter pack... for the sake of simplifying at least something already now, I would leave things as they are right now and keep the same absolute tolerance for all parameters. Agree?" A signal quadrature tolerance was used in the initial signature based on your misunderstanding of CVODES. Honest mistake. But reframing that initial signature as a default that’s “good enough” is not.

Also, I really do not see a need to deprecate at some point the proposed signature in favor for the more complicated one with per-parameter tolerances. Deprecating the simpler signature would only be needed if almost all cases we encounter require the more complicated signature - but that is really unlikely (time will tell).

The exact same argument holds for _any_ of the solver configuration arguments. Why are we exposing any of them? Why are we just now exposing vector tolerances? Why did everyone agree that a no-tolerance signature was a bad idea? We’re starting to recognize that they have substantial influence on how the solvers behave and we have _no idea_ which are important. At best the choice of which tolerances are exposed and not exposed is arbitrary, at worst it’s based on forcing the variadic signature.

Let's see it positive: We are really making progress here! Solving large ODE systems in Stan will be faster for a very large class of problems (at least for those we benchmarked).

Introducing a new feature that is not fully understood is progress only when there are not poor consequences for users. Being able to fit larger problems is useful only if the solvers maintain their robustness in those larger problems, especially as prior specification becomes more and more subtle in higher-dimensional spaces. I have no illusions of convincing anyone here; everyone is very optimistic that things will be fine based on little to no evidence. I again just hope that everyone voting “yes” because they speculate that there won’t be problems will take responsibility and do the work to review user feedback for stability problems due to the adjoint quadrature configuration and help users work through those issues.

wds15 · 2021-04-27T07:48:55Z

Voting period is over as of today 27th April 2021

Result:

positive votes: 7
negative votes: 1

Thus, the decision is that we accept this PR as is ... and merge it.

Thanks for everyone taking the time to go through the material!

The PR for this in stan-math is getting along very well.

martinmodrak · 2021-04-27T13:03:26Z

I don't have an opinion on the technical aspects of the design doc, but the discussion prompted me to think about culture I would personally like to see in the Stan dev community (while others may obviously differ).

I think it is important to bear in mind that everybody here shares a big portion of values - we all want Stan to be useful, reliable, usable, fast, ... We all value the effort others put into improving Stan, we all want our perspectives to be treated seriously. Disagreements like this are IMHO rarely due to some deep fundamental differences. Instead they arise partly from genuine uncertainty about what a design choice will actually entail in the real world and partly from somewhat different weights each of us puts on individual values/goals and how we evaluate the uncertainty. And I think this should be a reason for humility - some of the judgements any of us makes will be wrong. Some of our values/goals may turn out to lead to outcomes we didn't envision and don't like. Moreover whether a decision will in fact turn out to be good in practice doesn't in my experience correlate neatly with ability to convince others or to come up with good arguments. To me a consequence is that in most projects one should expect to rarely have all their requirements/expectations/... met - otherwise it would likely mean that the project only reflects the values of some of its team members.

This humility also IMHO entails that we should try to avoid looking at results of votes (or any decision) as personal wins or losses, or try to keep scores on who was right how many times, or engage in some form of "I told you so" when it turns out somebody was wrong. We share most of our goals and we all benefit if Stan works great and lose out when there are issues. And I felt it is important to remind ourselves of this from time to time. End of my two cents.

SteveBronder

Approving since votes were good!

wds15 added 3 commits February 2, 2021 21:59

started to work on design doc

739fc3c

finish analytical stuff

1855347

add missing bits

4b11957

wds15 mentioned this pull request Feb 17, 2021

Feature/adjoint odes stan-dev/math#1905

Merged

5 tasks

wds15 added 2 commits February 18, 2021 11:39

fix links

10832ab

refine function calls

92e173e

wds15 added 2 commits March 6, 2021 23:25

address Mikes comments part 1

59c96a5

adress remaining sections

1b94e37

update

6139843

bump

3185b30

note that vector absoute tolerances can be retrofitted if deemed needed

fb171b9

SteveBronder approved these changes Apr 29, 2021

View reviewed changes

SteveBronder merged commit 91177e7 into stan-dev:master Apr 29, 2021

Adjoint ode design #37

Adjoint ode design #37

Conversation

wds15 commented Feb 17, 2021 • edited Loading

wds15 commented Mar 4, 2021

bbbales2 commented Mar 4, 2021

funko-unko commented Mar 5, 2021

wds15 commented Mar 5, 2021 • edited Loading

funko-unko commented Mar 5, 2021

wds15 commented Mar 5, 2021

betanalpha commented Mar 5, 2021 via email

wds15 commented Mar 6, 2021

wds15 commented Mar 7, 2021

funko-unko commented Mar 8, 2021

rok-cesnovar commented Mar 8, 2021 • edited Loading

betanalpha commented Mar 9, 2021 via email

betanalpha commented Mar 9, 2021 via email

rok-cesnovar commented Mar 9, 2021

betanalpha commented Mar 9, 2021 via email

wds15 commented Mar 20, 2021

betanalpha commented Mar 22, 2021

wds15 commented Mar 22, 2021

rok-cesnovar commented Mar 23, 2021

wds15 commented Mar 23, 2021

charlesm93 commented Apr 12, 2021

funko-unko commented Apr 13, 2021

wds15 commented Apr 13, 2021

yizhang-yiz commented Apr 13, 2021

wds15 commented Apr 14, 2021

SteveBronder commented Apr 14, 2021

yizhang-yiz commented Apr 14, 2021 • edited Loading

SteveBronder commented Apr 14, 2021

betanalpha commented Apr 15, 2021 via email

mitzimorris commented Apr 15, 2021

wds15 commented Apr 15, 2021

bbbales2 commented Apr 16, 2021

funko-unko commented Apr 16, 2021

wds15 commented Apr 19, 2021

rok-cesnovar commented Apr 20, 2021

SteveBronder commented Apr 20, 2021

charlesm93 commented Apr 20, 2021

wds15 commented Apr 22, 2021

syclik commented Apr 25, 2021

wds15 commented Apr 26, 2021

betanalpha commented Apr 26, 2021 via email

yizhang-yiz commented Apr 26, 2021

rok-cesnovar commented Apr 26, 2021 • edited Loading

wds15 commented Apr 26, 2021

betanalpha commented Apr 26, 2021 via email

wds15 commented Apr 27, 2021

martinmodrak commented Apr 27, 2021

SteveBronder left a comment

Choose a reason for hiding this comment

wds15 commented Feb 17, 2021 •

edited

Loading

wds15 commented Mar 5, 2021 •

edited

Loading

rok-cesnovar commented Mar 8, 2021 •

edited

Loading

yizhang-yiz commented Apr 14, 2021 •

edited

Loading

rok-cesnovar commented Apr 26, 2021 •

edited

Loading