In the following we will question some fundamental aspects of the formulations so far, namely the update step computed via gradients. To re-cap, the approaches explained in the previous chapters either dealt with purely supervised training, integrated the physical model as a physical loss term or included it via differentiable physics (DP) operators embedded into the training graph. The latter two methods are more relevant in the context of this book. They share similarities, but in the loss term case, the physics evaluations are only required at training time. For DP approaches, the solver itself is usually also employed at inference time, which enables an end-to-end training of NNs and numerical solvers. All three approaches employ first-order derivatives to drive optimizations and learning processes, and the latter two also using them for the physics terms. This is a natural choice from a deep learning perspective, but we haven't questioned at all whether this is actually the best choice.
Not too surprising after this introduction: A central insight of the following chapter will be that regular gradients can be a sub-optimal choice for learning problems involving physical quantities.
It turns out that both supervised and DP gradients have their pros and cons, and leave room for custom methods that are aware of the physics operators.
In particular, we'll show how scaling problems of DP gradients affect NN training (as outlined in {cite}holl2021pg
),
and revisit the problems of multi-modal solutions.
Finally, we'll explain several alternatives to prevent these issues.
% It turns out that a key property that is missing in regular gradients is a proper inversion of the Jacobian matrix.
:class: tip
Below, we'll proceed in the following steps:
- Show how the properties of different optimizers and the associated scaling issues can negatively affect NN training.
- Identify the problem with our GD or Adam training runs so far. Spoiler: they're missing an _inversion_ process to make the training scale-invariant.
- We'll then explain two alternatives to alleviate these problems: an analytical full-, and a numerical half-inversion scheme.
% note, re-introduce multi-modality at some point...
Before diving into the details of different optimizers, the following paragraphs should provide some intuition for why this inversion is important. As mentioned above, all methods discussed so far use gradients, which come with fundamental scaling issues: even for relatively simple linear cases, the direction of the gradient can be negatively distorted, thus preventing effective progress towards the minimum. (In non-linear settings, the length of the gradient anticorrelates with the distance from the minimum point, making it even more difficult to converge.)
In 1D, this problem can alleviated by tweaking the learning rate, but it becomes very clear in higher dimensions. Let's consider a very simple toy "physics" function in two dimensions that simply applies a factor
For
---
height: 200px
name: physgrad-scaling
---
Loss landscapes in $x$ for different $\alpha$ of the 2D example problem. The green arrows visualize an example update step $- \nabla_x$ (not exactly to scale) for each case.
However, within this book we're targeting physical learning problems, and hence we have physical functions integrated into the learning process, as discussed at length for differentiable physics approaches. This is fundamentally different! Physical processes pretty much always introduce different scaling behavior for different components: some changes in the physical state are sensitive and produce massive responses, others have barely any effect. In our toy problem we can mimic this by choosing different values for
For larger
Note that inversion, naturally, does not mean negation (
:class: tip
A scale-invariant optimization for a given function yields the same result for different parametrizations (i.e. scalings) of the function.
E.g., for our toy problem above this means that optimization trajectories are identical no matter what value we choose for
We'll now evaluate and discuss how different optimizers perform in comparison. As before, let
All NNs of the previous chapters were trained with gradient descent (GD) via backpropagation. GD with backprop was also employed for the PDE solver (simulator)
$$ \Big( \frac{\partial L}{\partial x} \Big)^T = \Big( \frac{\partial \mathcal P}{\partial x} \Big)^T \Big( \frac{\partial L}{\partial \mathcal P} \Big)^T $$ (loss-deriv)
As the
We've shown in previous chapters that using
Note that we exclusively consider multivariate functions, and hence all symbols represent vector-valued expressions unless noted otherwise.
%techniques such as Newton's method or BFGS variants are commonly used to optimize numerical processes since they can offer better convergence speed and stability. These methods likewise employ gradient information, but substantially differ from GD in the way they compute the update step, typically via higher order derivatives.
%{figure} resources/placeholder.png %--- %height: 220px %name: pg-training %--- %TODO, visual overview of PG training %
The optimization updates
$$ \Delta x_{\text{GD}} = -\eta \cdot \frac{\partial L}{\partial x} $$ (GD-update)
where supervised-airfoils
, but we've also used it in the differentiable physics approaches. E.g., in {doc}diffphys-code-sol
we've computed the derivative of the fluid solver. In the latter case, we've still only updated the NN parameters, but the fluid solver Jacobian was part of equation {eq}GD-update
, as shown in {eq}loss-deriv
.
We'll jointly evaluate GD and several other methods with respect to a range of categories: their handling of units, function sensitivity, and behavior near optima. While these topics are related, they illustrate differences and similarities of the approaches.
Units 📏
A first indicator that something is amiss with GD is that it inherently misrepresents dimensions.
Assume two parameters
One could argue that units aren't very important for the parameters of NNs, but nonetheless it's unnerving from a physics perspective that they're wrong, and it hints at some more fundamental problems.
Function sensitivity 🔍
As illustrated above, GD has also inherent problems when functions are not normalized.
Consider a simplified version of the toy example above, consisting only of the function
More specifically, if we look at how the loss changes, the expansion around
This demonstrates that
for sensitive functions, i.e. functions where small changes in
Such sensitivity problems can occur easily in complex functions such as deep neural networks where the layers are typically not fully normalized.
Normalization in combination with correct setting of the learning rate
Convergence near optimum 💎
Finally, the loss landscape of any differentiable function necessarily becomes flat close to an optimum,
as the gradient approaches zero upon convergence.
Therefore
This is an important point, and we will revisit it below. It's also somewhat surprising at first, but it can actually stabilize the training. On the other hand, it makes the learning process difficult to control.
Newton's method employs the gradient
$$ \Delta x_{\text{QN}} = -\eta \cdot \left( \frac{\partial^2 L}{\partial x^2} \right)^{-1} \frac{\partial L}{\partial x} . $$ (quasi-newton-update)
More widely used in practice are Quasi-Newton methods, such as BFGS and its variants, which approximate the Hessian matrix.
However, the resulting update
Units and Sensitivity 📏
Quasi-Newton methods definitely provide a much better handling of physical units than GD.
The quasi-Newton update from equation {eq}quasi-newton-update
produces the correct units for all parameters to be optimized.
As a consequence,
If we now consider how the loss changes via
$L(x+\Delta x_{\text{QN}}) = L(x) + -\eta \cdot \left( \frac{\partial^2 L}{\partial x^2} \right)^{-1} \frac{\partial L}{\partial x} \frac{\partial L}{\partial x} + \cdots $ , the second term correctly cancels out the
Convergence near optimum 💎
Quasi-Newton methods also exhibit much faster convergence when the loss landscape is relatively flat.
Instead of slowing down, they take larger steps, even when
Consistency in function compositions
So far, quasi-Newton methods address both shortcomings of GD. However, similar to GD, the update of an intermediate space still depends on all functions before that. This behavior stems from the fact that the Hessian of a composite function carries non-linear terms of the gradient.
Consider a function composition
% chain of function evaluations: Hessian of an outer function is influenced by inner ones; inversion corrects and yields quantity similar to IG, but nonetheless influenced by "later" derivatives
Dependence on Hessian 🎩
In addition, a fundamental disadvantage of quasi-Newton methods that becomes apparent from the discussion above is their dependence on the Hessian. It plays a crucial role for all the improvements discussed so far.
The first obvious drawback is the computational cost.
While evaluating the exact Hessian only adds one extra pass to every optimization step, this pass involves higher-dimensional tensors than the computation of the gradient.
As
% Many algorithms therefore avoid computing the exact Hessian and instead approximate it by accumulating the gradient over multiple update steps. The memory requirements also grow quadratically.
The quasi-Newton update above additionally requires the inverse Hessian matrix. Thus, a Hessian that is close to being non-invertible typically causes numerical stability problems, while inherently non-invertible Hessians require a fallback to a first order GD update.
Another related limitation of quasi-Newton methods is that the objective function needs to be twice-differentiable. While this may not seem like a big restriction, note that many common neural network architectures use ReLU activation functions of which the second-order derivative is zero. % Related to this is the problem that higher-order derivatives tend to change more quickly when traversing the parameter space, making them more prone to high-frequency noise in the loss landscape.
_Quasi-Newton Methods_
are still a very active research topic, and hence many extensions have been proposed that can alleviate some of these problems in certain settings. E.g., the memory requirement problem can be sidestepped by storing only lower-dimensional vectors that can be used to approximate the Hessian. However, these difficulties illustrate the problems that often arise when applying methods like BFGS.
%\nt{In contrast to these classic algorithms, we will show how to leverage invertible physical models to efficiently compute physical update steps. In certain scenarios, such as simple loss functions, computing the inverse gradient via the inverse Hessian will also provide a useful building block for our final algorithm.} %, and how to they can be used to improve the training of neural networks.
As a first step towards fixing the aforementioned issues, we'll consider what we'll call inverse gradients (IGs). These methods actually use an inverse of the Jacobian, but as we always have a scalar loss at the end of the computational chain, this results in a gradient vector. Unfortunately, they come with their own set of problems, which is why they only represent an intermediate step (we'll revisit them in a more practical form later on).
Instead of
$$ \Delta x_{\text{IG}} = \frac{\partial x}{\partial y} \cdot \Delta y. $$ (IG-def)
to be the IG update.
Here, the Jacobian
Note that instead of using a learning rate, here the step size is determined by the desired increase or decrease of the value of the output,
% Positive Aspects Units 📏
IGs scale with the inverse derivative. Hence the updates are automatically of the same units as the parameters without requiring an arbitrary learning rate:
Function sensitivity 🔍
They also don't have problems with normalization as the parameter updates from the example
Convergence near optimum and function compositions 💎
Like Newton's method, IGs show the opposite behavior of GD close to an optimum: they produce updates that still progress the optimization, which usually improves convergence.
Additionally, IGs are consistent in function composition.
The change in
Note that even Newton's method with its inverse Hessian didn't fully get this right. The key here is that if the Jacobian is invertible, we'll directly get the correctly scaled direction at a given layer, without helper quantities such as the inverse Hessian.
% Consistency in function compositions
% In the example in table~\ref{tab:function-composition-example}, the change
Dependence on the inverse Jacobian 🎩
So far so good. The above properties are clearly advantageous, but unfortunately IGs
require the inverse of the Jacobian,
Thus, we now consider the fact that inverse gradients are linearizations of inverse functions and show that using inverse functions provides additional advantages while retaining the same benefits.
% --- split --- ?
So far we've discussed the problems of existing methods, and a common theme among the methods that do better, Newton and IGs, is that the regular gradient is not sufficient. We somehow need to address it's problems with some form of inversion to arrive at scale invariance. Before going into details of NN training and numerical methods to perform this inversion, we will consider one additional "special" case that will further illustrate the need for inversion: if we can make use of an inverse simulator, this likewise addresses many of the inherent issues of GD. It actually represents the ideal setting for computing update steps for the physics simulation part.
Let
Trying to this employ inverse solver in the minimization problem from the top, somewhat surprisingly, makes the whole minimization obsolete (at least if we consider single cases with one $x,y^$ pair). We just need to evaluate $\mathcal P^{-1}(y^)$ to solve the inverse problem and obtain
Now, instead of evaluating
It also turns out to be a good idea to employ a local inverse that is conditioned on an initial guess for the solution
Equipped with these changes, we can formulate an optimization problem where a current state of the optimization
$$ \Delta x_{\text{PG}} = \frac{ \big( \mathcal P^{-1} (y_0 + \Delta y; x_0) - x_0 \big) }{\Delta y} \cdot \Delta y . $$ (PG-def)
Here the step in PG-def
effectively cancels out to give a step in terms of IG-def
.
The update
The update obtained with a regular gradient descent method has surprising shortcomings due to scaling issues.
Classical, inversion-based methods like IGs and Newton's method remove some of these shortcomings,
with the somewhat theoretical construct of the update from inverse simulators (
In contrast to the second- and first-order approximations from Newton's method and IGs, it can potentially take highly nonlinear effects into account. Due to the potentially difficult construct of the inverse simulator, the main goal of the following sections is to illustrate how much we can gain from including all the higher-order information. Note that all three methods successfully include a rescaling of the search direction via inversion, in contrast to the previously discussed GD training. All of these methods represent different forms of differentiable physics, though.
Before moving on to including improved updates in NN training processes, we will discuss some additional theoretical aspects, and then illustrate the differences between these approaches with a practical example.
The following sections will provide an in-depth look ("deep-dive"), into
optimizations with inverse solvers. If you're interested in practical examples
and connections to NNs, feel free to skip ahead to {doc}`physgrad-comparison` or
{doc}`physgrad-nn`, respectively.
We'll now derive and discuss the
Update steps computed as described above also have some nice theoretical properties, e.g., that the optimization converges given that holl2021pg
.
% We now show that these terms can help produce more stable updates than the IG alone, provided that
To more clearly illustrate the advantages in non-linear settings, we
apply the fundamental theorem of calculus to rewrite the ratio
% \begin{equation} \label{eq:avg-grad}
% $\begin{aligned} % \frac{\Delta z}{\Delta x} = \frac{\int_{x_0}^{x_0+\Delta x} \frac{\partial z}{\partial x} , dx}{\Delta x} % \end{aligned}$
% where we've integrated over a trajectory in
Here the expressions inside the integral is the local gradient, and we assume it exists at all points between
The equations naturally generalize to higher dimensions by replacing the integral with a path integral along any differentiable path connecting
Let
Instead of using this "perfect" inverse
By contrast, a local inverse only needs to exist and be accurate in the vicinity of
Non-injective functions can be inverted, for example, by choosing the closest
For differentiable functions, a local inverse is guaranteed to exist by the inverse function theorem as long as the Jacobian is non-singular.
That is because the inverse Jacobian
The inverse function of a simulator is typically the time-reversed physical process.
In some cases, inverting the time axis of the forward simulator,
However, the simulator itself needs to be of sufficient accuracy to provide the correct estimate. For more complex settings, e.g., fluid simulations over the course of many time steps, the first- and second-order schemes as employed in {doc}overview-ns-forw
would not be sufficient.
Since introducing IGs, we've only considered a simulator with an output IG-def
we've introduced the inverse gradient (IG) update, which gives
By applying the chain rule and substituting the IG PG-def
, we obtain, up to first order:
These equations show that equation {eq}PG-def
is equal to the IG from the section above up to first order, but contains nonlinear terms, i.e.
$ \Delta x_{\text{PG}} / \Delta y = \frac{\partial x}{\partial y} + \mathcal O(\Delta y^2)
Also, we have turned the step w.r.t. quasi-newton-update
to determine