Structure preserving deep learning

Over the past few years, deep learning has risen to the foreground as a topic of massive interest, mainly as a result of successes obtained in solving large-scale image processing tasks. There are multiple challenging mathematical problems involved in applying deep learning: most deep learning methods require the solution of hard optimisation problems, and a good understanding of the tradeoff between computational effort, amount of data and model complexity is required to successfully design a deep learning approach for a given problem. A large amount of progress made in deep learning has been based on heuristic explorations, but there is a growing effort to mathematically understand the structure in existing deep learning methods and to systematically design new deep learning methods to preserve certain types of structure in deep learning. In this article, we review a number of these directions: some deep neural networks can be understood as discretisations of dynamical systems, neural networks can be designed to have desirable properties such as invertibility or group equivariance, and new algorithmic frameworks based on conformal Hamiltonian systems and Riemannian manifolds to solve the optimisation problems have been proposed. We conclude our review of each of these topics by discussing some open problems that we consider to be interesting directions for future research.


Introduction
Structure-preserving numerical schemes have their roots in geometric integration [62], and numerical schemes that is build on characterisations of PDEs as metric gradient flows [5], just to name a few. The overarching aim of structure-preserving numerics is to preserve certain properties of the continuous model, e.g. mass or energy conservation, in its discretisation. But structure preservation is not just restricted to play a role in classical numerical analysis of ODEs and PDEs. Indeed, through the advent of continuum interpretations of neural networks [60,46,47,116], Structure-preserving deep learning 889 structure preservation is also entering the field of deep learning. Here, the main objectives are to use the continuum model and structure-preserving schemes to derive stable and converging neural networks and associated training procedures and algorithms.

Neural networks
Neural networks are a rich class of machine learning models that can be leveraged for many different tasks including regression, classification, natural language processing, reinforcement learning and image generation [85]. While it is difficult to provide an all-encompassing definition for neural networks, they can generally be characterised as a combination of simple, parametric functions between feature spaces. These functions act as individual building blocks (commonly called the layers of the network). The main mechanism for combining these layers, which will be adopted in this work, is by function composition.
For any k ∈ {0, . . . , K − 1}, let X k denote a vector space (our feature space). While in most applications, these are simply finite-dimensional Euclidean spaces, we will assume more general structures (such as Banach spaces) when appropriate. With this, we then define a generic layer where k is the set of possible parameter values of this layer. A neural network can then be defined via the iteration such that X 0 = X and X K = Y, where θ := (θ 0 , . . . , θ K−1 ) ∈ 0 × · · · × K−1 =: denotes the entirety of the network's parameters. The first layer is commonly referred to as the neural network's input layer, the final layer as the neural network's output layer and all of the remaining layers are called hidden layers. While the above definitions are still quite general, in practice several standard layer types are employed. Most ubiquitous are those that can be written as a learnable affine combination of the input, followed by a simple, non-linear function: the layer's activation function. The quintessential example are fully connected layers such that the neural network's output entries can be regarded as the individual class membership probabilities of the input. An important extension to the concept of fully connected layers lies in convolutional layers [86] where the matrix-vector product is replaced by the application of a (multichannel) convolution operator. These are the main building block of neural networks used in imaging applications. Suppose we are given a set of paired training data (x n , y n ) N n=1 ⊂ X × Y, which is the case for predictive tasks like regression or classification. Training the model then amounts to solving the optimisation problem min θ∈ E(θ ) = 1 N N n=1 L n ( (x n , θ )) + R(θ ) . (1.4) Here, L n (y) := L(y, y n ) : Y → R ∞ is the loss for a specific data point where L : Y × Y → R ∞ := R ∪ {∞} is a general loss function, which usually satisfies L 0 and L(y 1 , y 2 ) = 0 if and only if y 1 = y 2 . The function L n is usually smooth on its effective domain {y | L n (y) < ∞} and convex. The term R : → R ∞ acts as a regulariser which penalises and constrains unwanted solutions. In this setting, solving (1.4) is a form of empirical risk minimisation [117]. Typically, variants of stochastic gradient descent are employed to solve this task. The calculation of the necessary gradients is performed using the famous backpropagation algorithm, which can be understood as an application of reverse-mode auto-differentiation [92].

Residual networks and differential equations
In the following, we will discuss artificial neural network architectures that arise from the numerical discretisation of time-and parameter-dependent differential equations. Differential equations have a long history in the mathematical treatment of neural networks. Initially, neural networks were motivated by biological neurons in the brain. Mathematical models for the interactions of biological neurons are based on non-linear, time-dependent differential equations. These have inspired some famous artificial neural networks such as the Hopfield networks [68]. On a time interval [0, T], an approximation to the solution of the differential equation is obtained at discrete times 0 = t 0 < t 1 < · · · < t K = T, e.g. t k = k h and h = T/K, corresponding to the different layers of the network architecture. For a fixed final time T, the existence of an underlying continuous model guarantees the existence of a continuous limit as the number of layers goes to infinity and h goes to zero.
In the spirit of geometric numerical integration [62], we discuss structural properties of the artificial neural network as arising from the structure-preserving discretisation of a differential equation with appropriate qualitative features such as having an underlying symmetry, an energy function or a Lyapunov function or preserving a volume form or a symplectic structure.
Contrary to the 'classical' design principle for layers of affine transformations followed by activation functions (cf. (1.3)), the so-called residual layers are a variation to this principle, which has risen to become a standard design concept of neural networks. Here, the output of one such layer is again added to its input, which again defines a network, the ResNet [65], : X × → X , (x, θ ) = z K that is now given by the iteration It is well-known that the optimal control formulation can be phrased as a closed dynamical system by using Pontryagin's principle and that this results in a constrained Hamiltonian boundary value problem [12]. The Hamiltonian of this system is H(z, p, θ ) = p, f (z, θ ) , (1.9) where p ∈ T * z X ≡ R M is a vector of Lagrange multipliers. The dynamical system is then given bẏ z = ∂ p H,ṗ = −∂ z H, (1.10) subject to the constraint 0 = ∂ θ H. (1.11) The adjoint equation for p can be expressed aṡ L n (z(T)). (1.12) In what follows we will review some of the guiding principles of structure-preserving deep learning, emphasising recent contributions on new neural networks architectures as discretisations of ODEs and PDEs in Section 2, invertible neural networks in Section 3, the interpretation of the training of neural networks as an optimal control problem in Section 4, equivariant neural networks in Section 5 and structure-preserving numerical schemes for the training of neural networks in Section 6.

Structure-preserving ODE formulations
In this section, we look at how the ODE formulation (1.7) can be restricted or extended in order to ensure that its flow has favourable properties. The relationship of these flow properties to the performance of the network and its optimisation is an emerging topic of investigation, presently being explored theoretically and experimentally.
We are zooming in on the forward problem itself, assuming that the parameters θ are fixed. By abuse of notation, we will in this section write f (t, z) for f (z, θ (t)) in that we consider θ (t) a known function of t. It is perhaps not so obvious what the desirable properties of the flow should be, and to some extent we lean here on prior work by Haber and Ruthotto [60] as well as Chang et al. [23].
A central consideration in both ODEs and neural networks is that of stability, so the welldeveloped stability theory of ODEs and their discretisation offers valuable insights into the stability of neural networks. However, it is important to qualify in which sense one is talking about stability, and how the different notions of stability of ODEs carry over to neural networks. A classical notion of stability of ODEs considers the stability of equilibrium points, i.e. points z 0 = z(0) with f (t, z 0 ) = 0 for t ∈ [0, ∞). This type of stability can be studied in terms of Lyapunov functions, that is, functions V (z) that are non-increasing along solution trajectories. Functions that are constant along solutions are called first integrals, and a particular instance is the energy function of autonomous Hamiltonian systems.

893
In this view, trajectories for perturbations of the initial value are studied for all t ∈ [0, ∞). However, this notion of stability has limited applicability to neural networks, as one typically only considers finite time horizons t ∈ [0, T] (cf. Section 1.2). Furthermore, equilibria hold no particular significance in neural networks.
Therefore, a more directly applicable view of stability is whether z(T) does not change 'too much' when the initial value z 0 is perturbed. That is, small perturbations in the data should not lead to large deviations in the end result. Such considerations lead to asking whether z(T) is, e.g. Lipschitz continuous (or more generally, uniformly continuous) in z 0 . This means that for any two solutions z 1 (t) and z 2 (t) to (1.7), we have for some C 0. It is well-known that this type of estimate can be obtained in several different ways, depending on the properties of the underlying vector field in (1.7). In particular, the Lipschitz constant C now depends on f as well as θ . Thus, in order to guarantee that C is not too large, one may want to impose certain restrictions on f or θ .
Lipschitz constant for the activation function σ . Note that for a discretisation of this ODE, the smallest Lipschitz constant not only depends on f and θ , but also on number of discretisation steps N, as well as the chosen numerical discretisation scheme. If one is to design a neural network as a discretised ODE which is supposed to be stable in the sense stated above, one has to take all of these into account.
For stability of non-linear ODEs one may, for instance, consider growth and contractivity in the L 2 -norm using a one-sided Lipschitz condition. This is similar to the analysis proposed in [139]. Suppose there exists a constant ν ∈ R such that for all admissible z 1 , z 2 , and t ∈ [0, T] we have In this case it is easily seen that for any two solutions z 1 (t), z 2 (t) to (1.7) one has so that the problem is non-expansive if ν 0 [64]. For instance, the vector field (1.8) satisfies (2.2) for σ absolutely continuous if where the supremum is taken over all diagonal matrices D with diagonal entries in σ (R  the weights and biases may change slowly enough for the eigenvalue analysis to be valid when considering the stability of the forward problem. We illustrate this issue through an example. We use the standard vector field (1.8) and approximate its solution by the Euler method to obtain the standard ResNet iteration (1.6). Suppose that R(θ ) in (1.4) is chosen to be a discretisation of in order to penalise rapid variations in the parameters θ (t) through the layers. A discretised version could be where h is the time step t j − t j−1 . We first train the model on the well-known spiral problem in two dimensions, using K = 70 two-dimensional layers and N = 8000 data points. We apply three different values for the regularisation parameter, λ ∈ {0, 0.01, 0.5}. In Figure 2 we have plotted how the weights and biases change over time.
The three trained models were run for a test dataset of 4000 points and the classification success rate was 99.2%, 94.9% and 94.3%, respectively, for λ = 0, 0.01, 0.5. We wanted to compare the eigenvalue analysis to the contractivity principle for this experiment. This was done locally in time by first computing the Jacobian matrix J (t) := ∂f ∂z (t, z(t)) of the vector field f (t, z(t)) = σ (A(t)z(t) + b) for every time step along a trajectory of the trained model. We compared the smallest real part of the eigenvalues of J with the logarithmic norm of J , which is a local version of the one-sided Lipschitz constant discussed above. The result is seen in Figure 3. Notice that the discrepancy between these two measures of stability does not seem to decrease as more regularisaton is added as one might expect. Clearly, the set of all vector fields of the form (1.8) include both stable and unstable cases. There are different ways of ensuring that a vector field has stable or contractive trajectories. One is to put restrictions on the family of matrices A(t); another option is to alter the form of (1.8) either by adding symmetry to it or by embedding it in a larger space where it can be given desirable geometric properties. For example, by doubling the dimension, it is possible in several different ways to obtain a non-autonomous Hamiltonian vector field. In [23] and [60] several models are suggested. We mention a few of them in Section 2.1.2.

Dissipative models and gradient flows
For different reasons, it is desirable to use vector fields with good stability properties [135]. This will ensure that data which are close to each other in a chosen norm initially remain close as the features propagate through the neural network. A model suggested in [60] is to consider weight matrices A(t) of the form where S(t) is arbitrary and γ is a small dissipation parameter. Flows of vector fields of the forṁ z = A(t)z + b(t) exactly preserve the L 2 -norm of the flow when A(t) is skew-symmetric. The nonlinearity will generally alter this behaviour, but adding a small amount of dissipation can improve the stability. It is, however, not guaranteed to reduce the one-sided Lipschitz condition.
Zhang and Schaeffer [139] analyse the stability of a model where the activation function σ is always chosen to be the Rectified Linear Unit (ReLU) function. The form of the discretised model is such that in the limit when h tends to zero, it must be replaced by a differential inclusion rather than an ODE of the form discussed above, meaning thatż − f (t, z) belongs to some specified subdomain of R M . Their model vector field is Growth and stability estimates are derived for this class of vector fields, as well as for cases where restrictions are imposed on the parameters, such as A 2 (t) having positive elements or the case A 1 (t) = A 2 (t) T =: A(t) and b 2 (t) = 0. For this last case, we consider for simplicity the ODĖ which is a gradient system in the sense thatż = −∇ z V with V = γ (A(t)z + b(t)), 1 where γ = σ and 1 is the vector of ones.

2) for any choice of parameters A(t) and b(t) with
where μ * = min t μ(t) and where μ(t) is the smallest singular value of A(t). In particular, Therefore, by the convexity of V (t, z), (2) Let z 1 and z 2 be vectors in R M . Using (2.4) while suppressing the t-dependence in the parameters, we find For real scalars ζ , η and β we have and since 0 σ (s + β) 1 a.e. we have Using this inequality in (2.5) we obtain Remark. In Theorem 2.1 we restricted the class of activation functions to be absolutely continuous with 0 σ (s) 1. This is true for many of the activation functions proposed in the literature, in particular for the commonly used activation functions ReLU σ = tanh and the sigmoid σ (s) = 1/(1 + exp(−s)).
This leads to differential equations of the forṁ There are different ways to construct a Hamiltonian system from (1.8). In [23] the following model is suggested:ż A simple case is obtained by choosing σ 1 (s) := s, A 1 (t) ≡ I, b 1 (t) ≡ 0 and σ 2 (s) := σ (s), which after eliminating p yields the second-order ODË A second example is considered by Chang et al. [23], who set σ 1 = σ 2 = σ and apply the resulting deep Hamiltonian neural network discretised by an explicit symplectic integrator to several standard image classification tasks. They find that its performance is competitive with state-of-the-art ResNet networks, requiring a similar number of parameters but much less memory. When the training set was drastically reduced or the number of layers was very large (up to 1202), performance was better than ResNet, a result they attribute to the intrinsic stability of the network architecture.
From the outset, it might not be so clear which geometric features such a non-autonomous Hamiltonian system has. There seem to be at least two ways to understand this problem [97]. Let us assume that u = (z, p) ∈ T * R M ≡ R M ⊕ R M with 'positions' z and 'momenta' p forming the phase space. We have the natural symplectic form ω 0 = dp ∧ dz on the phase space. This form can be represented using the Darboux matrix J as ω 0 (ξ , η) = ξ , J η , J = 0 I −I 0 and the Hamiltonian vector field f (t, z, p) defined via dH(·) = ω 0 (f (t, z, p), ·).
Let us now introduce as phase space T * R M × R by including the time t as a dependent variable. The natural projection is τ : (z, p, t) → (z, p). The space T * R M × R can then be furnished with a contact structure defined as 898 E. Celledoni et al.
The first term is just the original form, ignoring the t-component of the tangents, while the second form is only concerned with the 't-direction'. One can write the matrix of ω H by adding a row and a column to J coming from the second term −dH ∧ dt The resulting vector field is then f H = (f T , f t ) T and can be expressed through the equations Extended phase space. One can in fact recover an autonomous Hamiltonian problem from the non-autonomous one by adding two extra dependent variables, say t and p t . We do this by considering the manifold T * (R M × R). One needs a new projection μ : (z, p, t, p t ) → (z, p, t) and we can define an extended (autonomous) Hamiltonian on T * (R M × R) as and two-form The corresponding matrix, J E , is just the original Darboux matrix J where each of the M × M identity matrices has been replaced by (M + 1) × (M + 1) identity matrices. The extended Hamiltonian K is exactly preserved along the flow so that the new conjugate momentum variable p t will keep track of how H(z, p, t) varies along solutions. The resulting vector field f E can be written as dK(·) = 0 (f E , ·) and in coordinates the ODE system becomeṡ We see at once that since the equations forż,ṗ do not depend on p t and since the solution for t is explicitly given, we solve the same problem as before. After solving for z and p, we obtain p t independently by integration. The second thing one may observe is that if a numerical method φ h has the property that 1 then what we obtain by just considering the first 2M components of the numerical solution to the extended system is precisely the same as what we would have obtained applying the same method to the original non-autonomous problem. This observation was used by Asorey et al. [7] to define what is meant by canonical transformations in the non-autonomous case, and we refer to this paper for further results on the structural connections between the two systems.
In applications to deep learning, one should note that geometric properties of the solution can mostly be deduced from the extended system rather than the original non-autonomous one; there are numerical methods, which preserve energy or the symplectic form 0 and rigorous results can be proved for the long time behaviour of such integrators [62].

Structure-preserving numerical methods for the ODE model
The rationale behind proposing ODE formulations with geometric structure is to enforce a controlled behaviour of the solution as it is propagated through the hidden layers of the network. It is therefore also important that when the ODE is approximated by a numerical method, this behaviour should be retained by the numerical scheme.

Numerical methods preserving dissipativity
When numerically solving ODEs, which satisfy the one-sided Lipschitz condition (2.2), a desirable property of the numerical scheme would be that it contracts two nearby numerical solutions whenever ν 0. That is, it should satisfy for each time step k, preferably without too severe restrictions on the time step h. Methods that can achieve this for any step size exist, and are called B-stable. There are many B-stable methods, and for Runge-Kutta methods, B-stability can be easily checked by the condition of algebraic stability. Examples of B-stable methods are the implicit Euler method and all implicit Runge-Kutta methods in the Gauss, Radau IA, Radau IIA and Lobatto IIIC classes [64]. In deep learning algorithms it has been more usual to consider explicit schemes since they are much cheaper per step than implicit ones, but they are subject to restrictions on the time step. For instance, the explicit Euler method is unstable for any step size when applied even to linear constant coefficient systems where there are eigenvalues of the coefficient matrix on the imaginary axis. To consider contractivity of explicit schemes, we need to replace (2.2) by a different monotonicity condition where we assumeν < 0. For every Runge-Kutta method with strictly positive weights b i there is a constant r such that the numerical solution is contractive whenever the step size satisfies The value of r can be calculated for every Runge-Kutta method with positive weights. For example, for the explicit Euler method as well as for the classical fourth-order method of Kutta one has r = 1 [38].

Numerical methods preserving energy or symplecticity
For autonomous Hamiltonian systems there are two important geometric properties which are conserved. One is the energy or Hamiltonian H(z, p); the other is the symplectic structure, a closed non-degenerate differential two-form. These two properties are also the main targets for structure-preserving numerical methods. All Hamiltonian deep learning models presented here can be extended to separable nonautonomous canonical systems (2.6)-(2.7). Such systems preserve the symplectic two-form 900 E. Celledoni et al. dp ∧ dq and there are many examples of explicit numerical methods that also preserve this same form in the sense that dp k+1 ∧ dq k+1 = dp k ∧ dq k . The simplest example of such a scheme is the symplectic Euler method, defined for the variables z and p as (2.10) The symplectic Euler method is explicit for separable Hamiltonian systems and is an example of a splitting method [100]. Many other examples and a comprehensive treatment of symplectic integrators can be found in [62]. When applying symplectic integrators to Hamiltonian problems, one has the important notion of backward error analysis. The numerical approximations obtained can be interpreted as the exact flow of a perturbed system with HamiltonianH(z, p) = H(z, p) + hH 2 (z, p) + · · · 2 . This partly explains the popularity of symplectic integrators, since many of the characteristics of Hamiltonian flows are inherited by the numerical integrator. Similarly, there exist many numerical methods that preserve energy exactly for autonomous problems. For instance, there is a large class of schemes based on discrete gradients [101]. A discrete gradient of a function H(z) is a continuous map ∇H : For the Hamiltonian ODEż = J ∇H(z) it is easily seen that the method defined by is energy preserving in the sense that H(z k ) = H(z 0 ) for all k > 0. There are many choices of discrete gradients, but most of them lead to implicit schemes and therefore have the disadvantage of being computationally expensive.
Another disadvantage is that even if it makes sense to impose energy conservation for the extended autonomised system explained above for deep learning models, it is not clear what that would mean for the original problem. It remains an open problem to understand the potential for and the benefits of using energy-preserving schemes for non-autonomous Hamiltonian systems in deep learning.

Splitting methods and shears
Splitting methods are very popular time integration methods that can be easily applied to preserve geometric structures of the underlying ODE problems such as symplecticity, invertibility and some symmetries. The idea of splitting and composition is simply to split the vector field in the sum of two (or more) vector fields, to integrate separately each of the parts and compose the corresponding flows. For example, splitting a Hamiltonian vector field into the sum of two Hamiltonian vector fields and composing their flows results into a symplectic integrator. If the individual parts are easy to integrate exactly, the resulting time integration method has often low computational cost. We refer to [100] for an overview on splitting methods.

901
An ODE on R d is a shear if there exist a basis of R d in which the ODE takes the forṁ A diffeomorphism on R d is called a shear if it is the flow of a shear. Splitting vector fields into the sum of shears allows the computation of the individual flows exactly simply applying the forward Euler method to each term.
Consider the shears R n × R n → R n × R n : In the case of autonomous, separable, Hamiltonian systems, the symplectic Euler method (2.10) can be seen as the composition of two such maps where Another popular example is the Störmer-Verlet integrator or leapfrog integrator which is also the composition of two shears. It is possible to represent quite complicated functions with just two shears. The 'standard map' (also known as Chirikov-Taylor map) is an area preserving, chaotic composition of two shears, much studied in dynamics and accelerator physics. Shears are useful tools to construct numerical time integration methods that preserve geometric features. As already mentioned symplecticity is one such property (if f and g are gradients), another is the preservation of volume and, as we will see in Section 3, shears can be used to obtain invertible neural networks.

Features evolving on Lie groups or homogeneous manifolds
If the data belongs naturally to a differentiable manifold M of finite dimension and f (z, θ ) in (1.7) is a vector field on M for all θ ∈ then z(t) ∈ M. Concrete examples of datasets in this category are manifold-valued images and signals. One example is diffusion tensor imaging consisting of tensor data, which at each point (i, j) in space corresponds to a 3 × 3 matrix A i,j ∈ Sym + (3), the 3 × 3 symmetric positive definite matrices [36]. If m × l is the number of pixels in the image then M = Sym + (3) m×l . Another example is InSAR imaging data taking values on the unit circle S 1 , in which M = S 1 m×l [115]. In both these examples denoising autoencoder neural networks [126] can be used to learn image denoising. The loss function (1.4) can take the form where X and Y are symmetric 3 × 3 and A is symmetric and positive definite, and with trace denoting the trace of matrices. With this metric Sym + (3) is a Riemannian symmetric space and is geodesically complete [130]. A third example concerns classification tasks where the data are Lie group valued curves for activity recognition. Here, M = G m with m the number of points where the curve is sampled and G = SO (3) is the group of rotations [18]. A loss function can be built using the following distance between two sampled curves where g is the Lie algebra of G, · g is a norm deduced from an inner product on g, and log : G → g denotes the matrix logarithm. An important feature of this distance is that it is re-parametrisation invariant and taking the infimum over all (discrete) parametrisations of the second curve one obtains a well defined distance for curves independent of their parametrisation [19]. The ODE (1.7) should in this setting be adapted to be an ODE on can be modified adding constraints. Alternatively, sometimes one can consider intrinsic manifold formulations, which allow the representation of the data with d − p degrees of freedom instead of d. The numerical integration of this ODE must then respect the manifold structure.

Geometric properties of Hamiltonian models
Autonomous Hamiltonian systems and their numerical solution are by now very well understood. For such models, one has conservation of energy that is attractive when considering stability of the neural network map. The same cannot, to our knowledge, be said about the non-autonomous case. One can approach this problem in different ways. One is to consider canonicity in the sense of [7] and study the geometric properties of canonical transformations via the extended autonomous Hamiltonian system. This is, however, not so straightforward for a number of reasons. One issue is that every level set of the extended Hamiltonian will be non-compact. Another issue is that the added conjugate momentum variable, denoted p t above, is artificial and is only used to balance the time-varying energy function.
One should also consider the effect of regularisation, since one may expect that smoothing of the parameters will cause the system to behave more similarly to the autonomous problem. See Section 4.2 of this paper.

Measure-preserving models
The most common example of an invariant measure is phase space volume. In invertible networks, some attention is given to volume-preserving maps, see [113,40] and also Section 3 of this paper. For the ODE model, this amounts to the vector field f (t, z) being divergence free, and there are several ways to parametrise such vector fields. All Hamiltonian vector fields are volume preserving, but the converse is not true. Volume-preserving integrators can be obtained, for example, via splitting and composition methods [100].

Manifold based models
A generalisation of most of the approaches presented in this section to the manifold setting is still missing. For data evolving on manifolds the ODE (1.7) should be adapted to be an ODE on M. Just as an example, a gradient flow on M analogous to (2.4) can be considered starting from defining a function V : M × → R using the antiderivative of the activation function σ and the Riemannian metric. The Hamiltonian formulations of Section 2.1.2 could be also generalised to manifolds in a similar way, starting from the definition of the Hamiltonian function.
An appropriate numerical time discretisation of the ODEs for deep learning algorithms must also guarantee that the evolution through the layers remains on the manifold so that one can make use of the Riemannian structure of M and obtain improved convergence of the gradient descent optimisation, see Section 6.2. The numerical time discretisation of this ODE must then respect the manifold structure and there is a vast literature on this subject, see, for example, [62,Chapter IV]. For numerical integration methods on Lie groups and homogeneous manifolds including symplectic and energy-preserving integrators, see [72,22].

Invertible neural networks and normalising flows
In the previous sections, we have viewed certain neural networks as discretisations of continuous systems. In the following, we will return to the classical view of discrete networks, which are additionally endowed with a certain structure. Invertible neural networks, i.e. bijective neural networks, are such an example. They have been an emerging topic in deep learning over the last few years. They are naturally connected to neural networks that are inspired by ordinary differential equations, as flows of ODEs are themselves invertible. Two of the main applications of invertible neural networks lie in memory-efficient backpropagation as well as generative modelling with density estimation. For simplicity, in the following we will use the abbreviation f k = f k (·, θ k ) when appropriate. Much like, in general, invertible networks are typically parametrised as a function composition = f K • · · · • f 1 , with the additional constraint that each layer f k is a bijective function. The inverse can then simply be computed via A main restriction this imposes on the architecture is the fact that for where the X k are the feature spaces. As a consequence, the dimension of the input space determines the dimensions of the feature spaces in a (fully) invertible neural network.

Types of invertible layers
Invertible networks and layers can roughly be divided into two categories: Those that are algebraically invertible and those that are inverted with a numerical approximation scheme.

Coupling layers
Coupling layers [40] work by splitting a layer's input into two parts and applying a suitable transformation that is easily invertible for one of the two parts. Mathematically, for an M-dimensional input, the index set I = {1, . . . , M} is partitioned into two sets. In convolutional 904 E. Celledoni et al. networks, the partition is often done channel-wise, i.e. the channels are split into two sets. A different type of partitioning is invertible downsampling (also known as a masking or squeeze operation [40]). With x = (x 1 , x 2 ) ∈ R M 1 × R M 2 and y = (y 1 , y 2 ) ∈ R M 1 × R M 2 , a coupling layer g : x → y is defined via the mapping y → x is given by where γ −1 is the inverse of γ in its first argument (for fixed second argument). The coupling law γ can be as simple as γ : (a, b) → a + b, with which f is called an additive coupling layer [40], such that Another commonly used class of coupling layers are affine coupling layers [40,41]. More complex, but in theory, more expressive coupling laws can be constructed from strictly monotonic splines [45]. Note that (3.1) and (3.2) are shear mappings (Section 2.2.3), which in the case of ODEs are used to construct, e.g. symplecticity-preserving numerical solutions.
There is in principle no restriction on the function f . This allows for the utilisation of arbitrarily expressive subnetworks f (e.g. a sequence of convolutional layers with non-linear activation functions), without rendering g (algebraically) non-invertible. The numerical stability of this inversion may still pose an issue. Stability guarantees (both for the forward and the inverse mapping) can, for instance, be controlled via the Lipschitz constant of f and γ , as shown in [10]. This mirrors stability considerations of ODEs (see Section 2.1), where guarantees can be formulated as conditions on the Lipschitz constants. Note that while f has to map to R M 2 , the individual layers which comprise f may change the dimensionality throughout this subnetwork -only the final layer has to transform the data to the required space R M 2 . Since coupling layers perform learnable computations only on a part of the input, the partitioning should change throughout the network, for example, by switching the roles of x 1 and x 2 .
As demonstrated by Teshima et al. [120], neural networks built using coupling layers are universal approximators of diffeomorphic functions. This means that they can approximate a diffeomorphic function up to arbitrary precision, given a sufficiently large architecture. This can be seen as an extension of various other universal approximation theorems (cf. [37,69]), which state similar results for other classes of neural networks and functions.

Invertible layers through iterative schemes
Another class of invertible layers are those that are invertible with an iterative scheme. Invertible residual networks [9] are a special case of the commonly used residual networks [65], which allow for the inversion with a simple fixed-point iteration. As in Section 1.2, a residual layer can be framed as a function g : where Lip(f ) is the Lipschitz constant of f . According to Banach's fixed-point theorem, if Lip(f ) < 1, then y is guaranteed to have a unique fixed point, which is the inverse of y under g, i.e. z * = g −1 (y). This inverse can be approximated to arbitrary precision via the iteration for any initial value x 0 . While most common layer types (such as dense or convolutional layers equipped with activation functions with bounded derivative) are Lipschitz, the condition Lip(f ) < 1 is not necessarily met. Behrmann et al. [9] enforce the required Lipschitz constraint by spectral normalisation: Let the linear layer A θ depend linearly on its parameters θ (e.g. convolutional layers without non-linear activation functions and biases; these are linear layers that linearly depend on their kernel). Then Aθ parametrised bȳ Lipschitz constant Lip(f ) c for any activation function σ with Lip(σ ) 1, where b is a bias vector. Thus, by updating the layers' parameters via spectral normalisation according to (3.6) after each gradient step, the required Lipschitz constraint for invertibility is guaranteed, if one chooses c < 1. In practice, the spectral norm A θ 2 is calculated with the power method. To save computations, Behrmann et al. [9] only perform a single power iteration, but reuse the estimation of the leading singular vector from the previous training step as the initial guess. While the power method (with a finite number of iterations) technically only provides a lower bound to A θ 2 , the Lipschitz constraint is still met in practice [9]. In Figure 4, the dynamics of an invertible residual network and its non-invertible counterpart in 1D are compared. Both images depict a 30-layer residual network, which was trained to learn the identity function on [−2, 2] on 15 noisy samples. The Lipschitz constant of the residual blocks f were restricted to be smaller than one, whereas no such restriction was put on the non-invertible network. As invertible functions on the real line are strictly monotonic, the paths created by the invertible networks are non-crossing, which can happen in non-invertible networks.
In the case of invertible residual networks, the connection to neural networks as numerical solutions to ODEs is particularly strong. As noted in Section 1.2, ResNets can generally be viewed as Euler discretisations of an ODE, if one views the activations, weights and biases as observations of time-dependent variables. Fittingly, Behrmann et al. [9] originally motivate the inversion of such a ResNet by looking at the dynamics of the associated ODE backwards in time.

Linear invertible components
Above, two general approaches for constructing invertible, non-linear layers were presented. In the following, we list a few linear, invertible layers, which are used to increase the expressivity of invertible networks.  Coupling layers (Section 3.1.1) work by dimension splitting, which in the context of convolutional neural networks usually consists in splitting the channels into two groups, which are processed independently from one another. This is in contrast to standard, multichannel convolutions, where each output channel depends on all input channels. In order to overcome this limitation, the channels can subsequently be linearly combined (via an endomorphism). If the matrix of these coefficients is invertible, the automorphism created this way can be integrated as a linear layer into the invertible network framework. These invertible matrices can be parametrised in different ways: One approach [78] is to directly parametrise the desired matrix via a LU factorisation (with additional fixed permutation), which can then be inverted. While not discussed in the original publication, the diagonal entries themselves can be, e.g. parametrised to be larger than some ε > 0 to enforce stable invertibility. This is known as invertible 1 × 1 convolution, because this 'channel mixing' can equivalently be realised as a convolution with (in 2D) one-byone filters. An extension to this idea to convolutional layers (with larger filters) is presented in [67]. A computationally particularly cheaply (and stably) invertible class of matrices are orthogonal matrices. In [110], these are parametrised as a sequence of Householder transformations. Other possible parametrisations of (special) orthogonal matrices include exponentials of skewsymmetric matrices, Cayley transforms or sequences of Givens rotations. For a discussion about the numerical optimisation of such classes of parameter matrices compare also the forthcoming Section 6 and in particular (6.7).
Another type of linear, invertible layer is determined by invertible down-and up-sampling operations for image data. While classical down-and up-sampling methods (such as bilinear interpolation or nearest-neighbour methods) from image processing are inherently non-invertible (as they change the dimensionality of the input), it is possible to construct invertible down-and up-sampling methods. Since the dimensionality of their output must be the same as the dimensionality of their input, decreasing (or increasing) the spatial dimensionality of an image must be accompanied by a suitable increase (respectively decrease) in the number of channels. One such transformation for invertible down-sampling is the pixel shuffle transform [40], which subsamples the pixels and reorders them into new channels. The inverse of this transformation is an invertible up-sampling operation. In [52], this is generalised to learnable invertible down-and up-sampling operations, which contain the (inverse) pixel shuffle as a special case. The general idea is to construct orthogonal strided convolutional operators, where the kernel size matches the stride s ∈ N d , with spatial dimensionality d. It can be shown that by a specific reordering of an orthogonal τ -by-τ matrix (where τ = s 1 · · · s d ) into a filter kernel, the resulting convolution is orthogonal. This implies that the inverse is simply given by the corresponding transposed convolution. The authors propose to parametrise the required orthogonal matrices via exponentiating skew-symmetric matrices. As the matrix exponential is a surjective mapping from the Lie algebra of skew-symmetric matrices to the special orthogonal matrices, any special orthogonal matrix can be parametrised this way.

Memory-efficient backpropagation
One possible area of application for invertible neural networks is memory-efficient backpropagation [57]. Let z k+1 = f (z k , θ k ) be the output of a neural network's layer with non-linear mapping f : X k × k → X k+1 , where θ k are the layer's parameters. Let L be the loss of the network. Then provides the weight gradient necessary for the training of the network. For the calculation of this gradient, one needs both the gradient of the loss with respect to the output node (i.e. ∇ z k+1 L), as well as the ability to calculate the derivative ∂ θ k f (z k , θ k ). Unless f is linear, the calculation of the derivative requires access to z k , meaning that z k needs to be stored in memory. This typically represents the bulk of the memory demand in training neural networks. If, however, f is invertible, one can simply calculate z k from z k+1 via for the additional computational cost of calculating the inverse (where f −1 is the inverse of f in its first argument). Thus, instead of storing activations in memory during the forward pass, intermediate activations can simply be successively reconstructed from the output layer's activation. The memory requirement of training an invertible network using this scheme is thus independent of the number of invertible layers.

Invertible networks as subnetworks
In practice, many classical applications of neural networks such as classification, regression or segmentation do not typically map between vector spaces of the same dimensionality. Hence, a bijective function that maps between these two spaces does not exist, which is why a fully invertible neural network typically cannot solve the desired task. It is, however, possible to use an invertible neural networks as a subnetwork in a neural network. For example, if : X → X is an invertible network and η : X → Y is a suitable non-invertible layer, the combined network η • : X → Y can be used for a classification task of predicting labels in Y from features X . As demonstrated in [74], such a neural network can have competitive performance to a comparably sized residual network [65] on ImageNet [39]. Adding an output layer which transforms between spaces mirrors the approach for discretised ODEs, see Section 1.2.
The use of invertible networks as subnetworks allows for the utilisation of the memory efficient backpropagation for these subnetworks (cf. Section 3.2.1), while the respective gradients of non-invertible subnetworks can be calculated conventionally.

Density estimation and generative modelling
Aside from generative adversarial networks (GANs) and variational autoencoders (VAEs), normalising flows are another class of machine learning models that can be used to artificially generate data. While GANs are able to generate realistic-looking images [75], they lack the ability to estimate the likelihood of data under the generative model at hand. Likewise, VAEs can only estimate a lower bound of the likelihood -the variational lower bound. This is in contrast to normalising flows, which are trained via maximum likelihood estimation.
Like most generative models, normalising flows generate data from a simple base distribution (usually Gaussian) via a learned transformation. Let the random variable z have probability density function q, for which we will write z ∼ q(z). For any diffeomorphism f , it holds that due to the change-of-variables theorem 3 [14]. This means that for the probability density of x (denoted P), the log-likelihood of x can be expressed as Let f be parametrised by an invertible neural network, i.e. f (x) = (x, θ ) for all x ∈ X . Given training data (x n ) N n=1 , the minimiser of is a maximum likelihood estimator of the training data under the neural network. Models trained this way are called normalising flows, because they are usually trained to convert complicated data into normal distributions. In this framework, artificial data can be generated by sampling f −1 (z), where z ∼ q(z). An example of learning the 'two half-moons' dataset this way is provided in Figure 5. This case highlights that continuously transforming a normal distribution into a distribution whose probability density function has disconnected support is not possible. However, the normalising flow instead assigns low probability density to areas outside of the support, which results in a distribution that resembles the true distribution.
One of the main difficulties in designing normalising flows lies in the evaluation (respectively differentiation) of the determinant term. Note that for a neural network of the form (1.2) it holds that det(∂ z 0 (z 0 , θ )) = det(∂ z K−1 z K ) · · · det(∂ z 0 z 1 ), Structure-preserving deep learning 909 FIGURE 5. Left: 'Two half-moons' dataset. Middle: An approximation of the probability density function of the 'two half-moons' dataset by a normalising flow, generated by the code from [24]. Right: Points sampled from the approximated distribution.
such that the computation of the determinant for the whole network can be performed in a layer-by-layer fashion. Still, naïvely computing each determinant by the Leibniz formula yields prohibitive computation times. Depending on the types of layers used, the structure of the individial layers' Jacobians makes employing different determinant identities possible. If, for example, the Jacobian is lower triangular (as is the case with coupling layers [40]), the determinant reduces to the product of the Jacobian's diagonal entries. Invertible residual networks [9] on the other hand induce no such structure. By using the spectral norm constraint on f in equation (3.3), it holds that det(∂ z 0 (z 0 , θ )) = det(∂ z 0 (z 0 , θ )) and using the identity from [131], the authors show that log det(∂ z 0 (z 0 , θ )) = trace log ∂ z 0 (z 0 , θ ) holds. The matrix logarithm is then approximated as a truncated Taylor series, whereas for the evaluation of the trace, the Hutchinson estimator [70] is employed. An extension to this concept is presented in [24], where the truncation with a fixed number of steps is replaced by a 'Russian roulette' estimator, a type of Monte Carlo estimator. This estimator introduces a stochastic truncation of the series, which results in an unbiased estimator of the log-likelihood (3.7).
In the context of normalising flows, another connection to ODEs becomes apparent: As noted in [25], a continuous version of the change-of-variables theorem may be used to formulate a continuous normalising flow, which in turn can be solved numerically via a discretisation scheme. A maximum likelihood estimator can in turn be obtained by training a neural ODE [25].

Fast inverses without coupling
Coupling layers (Section 3.1.1) allow both for quick forward and inverse computations. Their expressivity is, however, hindered by the fact that expressive, learned transformations are only applied to a part of their input, while the remaining portion of the input is not transformed. However, while this problem can be mitigated by using multiple coupling layers (with different partitions of the input), it may be desirable to construct invertible layers where each output 910 E. Celledoni et al. neuron depends on all input neurons (as is the case with 'conventional' fully connected or convolutional layers). Invertible residual layers (Section 3.1.2) on the other hand fulfil this criterion, but they rely on a numerical inversion via a fixed-point iteration, which requires multiple evaluations of the forward operation. Furthermore, in order to guarantee their invertibility, a Lipschitz constant needs to be controlled, which requires additional computation time. For practical purposes, it would be desirable to combine both methods' advantages in order to obtain both fast and expressive invertible layers.

Stability guarantees
As discussed in [10], controlling the Lipschitz constant of invertible layers is a conceptionally simple, but practically costly method of guaranteeing a certain stability of the inverse. Furthermore, as discussed above, they need to be controlled in order to guarantee the convergence of the iterative scheme for the inversion of residual layers. In the interest of computational efficiency, is there a computationally cheaper way to control the Lipschitz constant than via the proposed power method? Work in this direction has already been performed in [24], as the authors experiment with Lipschitz constants corresponding to mixed norms.
In a similar vein, are there alternative ways of guaranteeing stability other than by controlling the Lipschitz constant?

Deep Learning meets optimal control
As outlined in the introduction, supervised deep learning with the ResNet architecture can be seen as solving a discretised optimal control problem. This observation has been made in [29,46,60,93] with further extensions to PDEs described in [116]. An advantage of this strategy is that many tools from scientific computing for the solution of optimal control problems, ODEs and PDEs can be utilised, see, for example, [59,106] for layer-parallel training.
Recall (1.4) where the neural network (·, θ ) : X → X is either defined by a recursion like (1.5) or the solution at the final time of an ODE (1.7). Another approach is to view the training as an optimisation problem over × X N where X is the space of the dynamics z, i.e.

Training algorithms
The discrete or continuous optimal control problem can be solved in multiple ways, some of which we will discuss next. The advantage of the reduced problem formulation (4.1) is that it is a smooth and unconstrained optimisation problem such that if is a Hilbert space, derivative-based algorithms such as gradient descent are applicable. Due to the non-linearity of we can at most expect to converge to a critical point E (θ ) = 0 with a derivative-based algorithm. The most basic algorithm to train neural networks is stochastic gradient descent (SGD) [114]. Given an initial estimate θ 0 , SGD consists of iterating two simple steps. First, sample a set of data points S j ⊂ {1, . . . , N} and then iterate

Structure-preserving deep learning
Other first-order algorithms used for deep learning are the popular Adam method [77] and the Gauss-Newton [60] method. A discussion on the convergence of SGD is out of the scope of this paper. Instead, we focus on how to compute the derivatives (L n • (x n , ·)) (θ j ) in the continuous model which is the central building block for all first-order methods. This following theorem is very common and the main idea dates back to Pontryagin [109]. This formulation is inspired by Bonnans [16, Lemma 2.47]. For finite dimensional θ a similar theorem can be proven, which dates back to Grönwall in 1919, see [58,63]. Defining neural networks via ODEs and computing gradients via continuous formulas similar to Theorem 4.1 has been first proposed in [25] with extensions in [56] and [44]. In the deep learning community this is referred to as Neural ODEs.
Similar to Theorem 4.1 a discrete version can be derived when the ODE (1.7) is discretised with a Runge-Kutta method [12,61]. For simplicity we just state the special case of the explicit Euler discretisation (ResNet (1.5)) here.
then the derivative of A := L n • (x n , ·) is given by ∂ θ k A(θ ) = h∂ θ f (z k n , θ k ) T p k+1 n .
Some readers will spot the similarity between Theorem 4.2 and what is called backpropagation in the deep learning community. This observation was already made in the 1980s, see, e.g. [83].

E. Celledoni et al.
If all functions in (4.5) are discretised by constant functions on [t k , t k+1 ], then the gradient of the discrete problem (4.6) approximates the Fréchet derivative of the continuous problem (4.5). In more detail, let e k,j ∈ L ∞ be supported on [t k , t k+1 ] and e k,j (t) = e j ∈ R M 2 +M with e j i = 1 if j = i and 0 else. If we denote by A = L n • (x n , ·) the data fit using the ODE solution (1.7) and A = L n • (x n , ·) the data fit with the ResNet (1.5), then In other words, in this case of piecewise constant functions, discretise-then-optimise is the same as the optimise-then-discretise.

Method of successive approximation
Another strategy to train neural networks has been recently proposed by [88] and is not based on the gradients of the reduced formulation (4.1) but on the necessary conditions of (4.2) also known as Pontryagin's maximum principle [109] instead. Given an initial estimate of θ 0 , each iteration of the method of successive approximation has three simple steps. First, solve (4.2b) with θ j which we denote by z j n , i.e.
Then solve the adjoint equation (4.4) with θ j , z j n and denote the solution by p j n , i.e. p j n = −∂ z f (z j n (t), θ j (t)) T p j n , p j n (T) = L n (z j n (T)), n = 1, . . . , N. (4.10) The third and final step is to maximise the Hamiltonian (1.9) given p j n , z j n . That is, for any t ∈ Note that this algorithm is potentially well-defined also in the non-smooth case, i.e. if f is not differentiable with respect to θ . If f is indeed smooth, R = 0 and θ j+1 = θ j , then and the Fréchet derivative of (4.1) vanishes. Analysis, extensions and numerical examples of the method of successive approximation are presented in [88].

Regularisation
The training of a neural network (1.4) may include explicit regularisation and several different regularisers R have been proposed in the literature, e.g. [12,122,60,103,111,49], which we want to discuss in this section. Before we dive into the specifics of these regularisers, we would like to answer the question if regularisation is necessary for training neural networks.

Example 4.3
In order to shed some light on this, we consider the most trivial example which is taken from [122]. To this end we use ResNet (1.5) and let N = 1, K = 1, M = 1, L 1 (z) = (z − 1) 2 , x 1 = 0 and σ = tanh. When no regulariser is present, R = 0, the training problem (1.4) simplifies to min A∈R,b∈R Since tanh(R) = (−1, 1), there are two possible problems here. First, (4.13) does not have a solution, so the task of training does not really make sense. Second, even if we ignore the first problem and just apply a descent algorithm on (4.13), then we encounter another problem: minimising sequences are not bounded. For example, let A j := 0, b j := j, then lim j→∞ E(A j , b j ) = 0 but (A j , b j ) j∈N is unbounded and does not even contain a convergent subsequence. Thus, we cannot expect our training algorithm to converge.
The key problem in the previous example was that the range of the neural network (x n , ·) was not closed. The non-closedness of the range is a characterisation of ill-posedness for linear inverse problems in infinite dimensions, see, e.g. [30,Theorem 3.7]. While this may never be the case for finite-dimensional linear inverse problems, the non-linearity in (4.13) results in exactly this property.
In order to overcome this problem, we can either pose constraints on the data fidelity (network architecture, link function, etc.) or we cure the ill-posedness by regularisation as is classically done when considering ill-posed inverse problems [50,73]. To overcome the problem in (4.13) it is sufficient to add a regulariser R which is coercive, meaning that for any sequence of parameters (θ j ) j∈N it holds that (4.14) This condition implies that minimising sequences, which are sequences (θ j ) j∈N such that lim j→∞ E(θ j ) = inf θ∈ E(θ ), are bounded. In reflexive Banach spaces, this is sufficient to guarantee at least a convergent subsequence. For more information on regularisation of non-linear ill-posed inverse problems, we refer to classical textbooks [50,73].
In the remainder of this section, we discuss a couple of specific choices for the regulariser R. In finite dimensions, due to the equivalence of all norms, any norm is coercive. In addition, the regulariser may impose additional properties on the estimated parameters. By now, it is standard to use the 1-norm θ 1 = i |θ i | or the squared 2-norm θ 2 2 = i |θ i | 2 for regularisation in deep learning to promote solutions, which are sparse or have small coefficients, respectively, [103,111]. The interpretation of a deep neural network as a process that changes with time naturally calls for other norms. For instance, the next section relies on the squared H 1 -norm as a regularisation. For θ : [0, T] → R M it is defined as This regularisation and its discrete counterpart will promote solutions which vary smoothly across the layers [60,122]. Finally, the connection of deep neural networks to discretised ODEs motivate other nonstandard regularisers, too. To this end we consider ResNet with time-varying discretisation 914 E. Celledoni et al. (4.16) and extend the parameters to θ = (A k , b k , h k ) K−1 k=0 . Then it is natural to ensure that the time steps h := (h k ) K−1 k=0 are non-negative and sum to T. More precisely, we regularise the time steps with R : Note that R is non-smooth and convex and since its proximal operator can be computed efficiently [43], proximal algorithms can be efficiently employed. This regulariser has been used for deep learning in [12]. For residual neural networks, it has also been proposed to penalise the 'weighted path norm', which takes into account the 'effective' depth of a network [49].
It should be noted that the 'problem' of non-existent solutions is usually ignored in practice since most optimisation algorithms do not converge (or at least not in reasonable time) and even if they converge, they may not converge to the global solution. For some neural networks, one can prove that even if the iterates do not converge, an associated quantity called 'normalized margin' still converges [94].

Deep limits
Let θ (K) denote a minimiser of (1.4) with a ResNet (1.5) with K layers and by θ ∞ a minimiser of (1.4) with the ODE constraint (1.7). In what sense and under which conditions do discrete solutions θ (K) converge when the number of layers K tends to infinity? If these converge, do they converge to a solution of the optimal control problem θ ∞ ?
In order to answer these questions one needs a topology in which we can compare θ (K) ∈ (R M 2 +M ) K and θ ∞ : [0, T] → R M 2 +M . With the discrete measure μ (K) = 1 K K−1 k=0 δ kT/K we make the identification (R M 2 +M ) K ∼ = L 2 (μ (K) , [0, T], R M 2 +M ) =: L 2 (μ (K) ) where by abusing notation, we associate the discrete object θ (K) with the piecewise constant function θ (K) : . It turns out that H 1 := H 1 ([0, T], R M 2 +M ) is a suitable solution space for the optimal control problem, so that L 2 := L 2 ([0, T], R M 2 +M ) is a natural candidate for the convergence of θ (K) to θ ∞ since both L 2 (μ (K) ) and H 1 can be embedded into L 2 .
The following theorem is a special case of Theorem 2.1 in [122] which answers the question of deep limits for ResNet (1.6). Its proof relies on the equivalence of convergence in L 2 and a certain transport metric [55], see [122] for more details.

Open problems
Connecting deep learning to optimal control has opened up new avenues to advance the field of deep learning. In this section we discussed algorithms motivated by this connection which are based on derivatives and necessary optimality conditions. We discussed the need and potential for variational regularisation of deep learning and understanding the behaviour of deep neural networks as we increase the number of layers. All of these routes have natural extensions, which will pave the way for better understanding of deep learning and even more powerful tools.

Algorithms with built-in errors
Using ODEs as a network architecture and computing gradients via the adjoint, i.e. Theorem 4.1, is theoretically appealing. However, in practice, both the forward and the adjoint ODE have to be solved numerically which induces errors into the gradient. When using off-the-shelf first-order methods like SGD, it is assumed that the gradients are computed exactly, which may either hinder performance or require prohibitively accurate computations, see, for instance, the discussion in [56]. That being said, state-of-the-art algorithms like SGD and Adam are stochastic and the updates are not guaranteed to decrease the objective so the impact of discretisation errors is not clear. Since the numerical solution of the ODE can be computed to any given tolerance, this naturally poses the question how to use such knowledge and control over the accuracy in optimisation algorithms. Some algorithms have been extended to include such errors, see for instance [35,Chapter 8] and [133].

Algorithms without gradients
The method of successive approximations and its extended version have been proposed for deep learning [88,89], but potentially more development is needed to fully exploit this direction including stochastic updates with respect to the data points and efficient maximisation of the Hamiltonian. The method of successive approximations exploits the structure of deep learning only to the point of optimal control (4.2) but is generic in terms of the architecture, e.g. the choice of f . For discretised system, a more tailored algorithm has been proposed in [119] but without any theoretical guarantees on its convergence.

Architectures, rates and topologies for deep limits
The question if the ResNet converges with increasing number of layers was satisfactorily answered in [121], but these results still leave a number of questions unanswered. First, do other architectures also have deep limits? The most likely candidates here are discretisations of ODEs.

E. Celledoni et al.
Second, are these results tightly linked to H 1 -regularisaton or can they be extended to other topologies, e.g. the one induced by the total variation? Third, are there convergence rates for these limits?
Another work on the convergence of discretised ODEs with finer discretisations is [61]. In particular this work not only proves convergence but also convergence rates. It utilises assumptions on smoothness and coercivity of certain Hessians but it is not clear if these are met in a deep learning context. It would be interesting to verify or falsify the assumptions for relevant learning problems. In case these are not met, a different architecture or regularisation might be a way out.

Equivariant neural networks
In recent years, the use of convolutional architectures in deep neural networks [84] for imaging tasks in machine learning has proven to be an extremely fruitful idea. A particularly well-known example of this is the work of Krizhevksy et al. [82], where deep convolutional neural networks (CNNs) were used to achieve state-of-the-art performance in the ImageNet contest (a challenging image classification task), outperforming other approaches by a large margin. Another striking example of the power of CNNs is given by Ulyanov et al. [124], where it is shown that even an untrained CNN can be used as an effective prior for natural images. It is commonly understood that the effectiveness of CNNs in imaging tasks is in large part due to them being in some sense right for the problems at hand. CNNs combine the flexibility of neural networks (in the form of many learnable filters) with the known symmetries of images: both convolutions and pointwise non-linearities commute with translations of the underlying domain, so that the outputs of a CNN transform in a predictable way when their inputs are translated. By constraining the search space in a principled way (in theory, fully connected networks are at least as expressive as CNNs), CNNs can make efficient use of training data to learn to perform tasks to a higher standard than fully connected networks and it is easier to interpret the action of a CNN on its inputs than it is do the same for a fully connected network.
Given the success of CNNs in machine learning, it is natural to ask whether the concept of convolutional architectures can be generalised to incorporate other symmetries into neural network architectures. One current line of research in this direction is the study of group equivariant neural networks, which has been gaining considerable traction since the work on group convolutional neural networks by Cohen and Welling [33]. A neural network can be thought of as a function taking inputs from a feature space X 1 and returning outputs from a feature space X 2 . We will call a function F : X 1 → X 2 G-equivariant if there are group actions of G on X 1 and X 2 (denoted by T X 1 and T X 2 respectively to emphasise that the group actions need not be of the same nature) such that To elucidate this definition, let us note some examples of behaviour covered by it: • if G acts trivially on X 2 , i.e.T X 2 g = id, we recover invariance of F to group transformations of its input, which is often a desirable property of an image classifier; • whereas if X 1 = X 2 and T X 1 = T X 2 , the output of F transforms in exactly the same way as the input does, which is a useful property to have in many image-to-image tasks such as segmentation.
If we are given two G-equivariant functions F 1 : X 1 → X 2 , F 2 : X 2 → X 3 (with the same action of G on X 2 for both functions), their composition is easily seen to be G-equivariant: Appealing to this result and noting the usual structure of a neural network as a composition of an alternating sequence of affine maps and non-linearities, there is a promising way of designing G-equivariant neural networks: design G-equivariant linear maps and G-equivariant nonlinearities, add biases as appropriate to get affine maps from the linear maps and compose the affine maps and non-linearities as you would in an ordinary neural network.
As it turns out, when the group actions considered above are in fact group representations (i.e. they act linearly), the problem of finding and characterising G-equivariant linear maps reduces to the well-studied problem of finding and characterising intertwiners in representation theory. This insight has recently been used to unify existing approaches to G-equivariant neural networks and to show that G-equivariant linear maps necessarily take the form of a type of group convolution [31]. Let us describe this work in some more detail here.

Homogeneous spaces
It has been observed [31,81] that there is a common setting unifying many of the existing approaches to G-equivariant neural networks. We are given a topological group G, which acts continuously and transitively on a domain X (X is a so-called homogeneous space of G). With this assumption we can cover the cases where X = R d and G = SE(d) := R d SO(d), the group of rotations and translations, and where X = S d−1 and G = SO(d), which are two commonly studied cases. Fixing a point p ∈ X as the origin and denoting by G p = {g ∈ G | gp = p} the stabiliser of p, we can identify X with the quotient space G/G p : since G acts transitively on X , for any q ∈ X there is a g ∈ G such that gp = q. On the other hand if g 1 p = g 2 p, then g −1 2 g 1 p = p, so that g −1 2 g 1 ∈ G p , or g 2 G p = g 1 G p . Hence, there is a well-defined bijective map X → G/G p mapping q to gG p , where g ∈ G is such that gq = p. Conversely, given a closed subgroup H < G, G acts transitively on the quotient space G/H by left multiplication, making it into a homogeneous space of G. From this reasoning, we see that the homogeneous spaces of G can be identified precisely with the quotient spaces G/H as H ranges over closed subgroups of G. Henceforth, we will consider two arbitrary subgroups H 1 , H 2 < G and the associated homogeneous spaces G/H 1 , G/H 2 .

Equivariant linear maps
Scalar-valued features on these spaces can be modelled as functions G/H i → R and these can be arbitrarily stacked to get feature maps x : G/H i → R C that transform under the action of G according to [π 1 (g)x](p) = x(g −1 p), but in this setting one can also consider more general features: given a representation (V i , ρ i ) of H i , we can consider signals to be fields of V i -valued features on G/H i , which are strictly more general than the stacks of scalar-valued fields since their components are mixed under the action of G. This is mathematically formalised by noting that G is a principal H i -bundle, constructing from this the associated vector bundle P i with fibre space V i , the sections of which, (P i ), are the signals of interest. Under the 918 E. Celledoni et al. action of the group G, these signals naturally transform according to the representation of G induced by H, π G H i . To put this in the notation of (5.1), we are taking X 1 = (P 1 ), X 2 = (P 2 ) and T X 1 g = π G H 1 (g), T X 2 g = π G H 2 (g) and we are asking what the general form is of a linear map F : X 1 → X 2 that satisfies (5.1) in this case. There are multiple ways in which (P i ) and π G H i can be modelled, but for this purpose it is easiest to consider Mackey functions: we identify x ∈ (P i ) with x : G → V i satisfying x(gh) = ρ i (h −1 )x(g) for all h ∈ H i , in which case the induced representation is just given by [π G H i (g)x](g ) = x(g −1 g ). With this notation in place, the following result on equivariant linear maps can be proved (see also Theorem 6.1 in [31]): that G is a locally compact unimodular group and that H 1 , H 2 < G are closed subgroups. Suppose that F : X 1 → X 2 is an integral operator (with respect to the Haar measure on G): Then F is equivariant, in the sense that if and only if F is given by convolution with a kernel k : satisfying the additional constraint k(h 2 gh 1 ) = ρ 2 (h 2 )k(g)ρ 1 (h 1 ) ∀h 1 ∈ H 1 , g ∈ G, h 2 ∈ H 2 .
Derivation. As alluded to in the statement of the theorem, given the assumptions on G, there is a unique (up to multiplication by a scalar constant) Haar measure on G that is both left invariant and right invariant. Writing F as integration against a kernel K : G × G → Hom(V 1 , V 2 ), the equivariance condition tells us that The final expression on the right-hand side can be rewritten using left invariance of the measure to give G (K(g −1 g, g ) − K(g, g g ))x(g ) dg = 0 ∀x ∈ (P 1 ), which is the case if and only if K(g −1 g, g ) = K(g, g g ) ∀g, g , g ∈ G.
Structure-preserving deep learning 919 This final condition tells us that K(g, g ) = K(e, g −1 g ) =: k(g −1 g ), so we find that F is equivariant and only if it is given by a convolution-type operation: Since we have assumed that F(x) ∈ (P 2 ) is a Mackey function, we must have Hence On the other hand, x ∈ (P 1 ) is also a Mackey function, so for any Here we have used right invariance of the measure. This implies that k(gh 1 ) = k(g)ρ 1 (h 1 ) for h 1 ∈ H 1 , g ∈ G. Taking together the above results, the condition for equivariance of an integral operator F : (P 1 ) → (P 2 ) is that we perform a convolution-type operation against a kernel k satisfying the linear constraints k(h 2 gh 1 ) = ρ 2 (h 2 )k(g)ρ 1 (h 1 ) ∀h 1 ∈ H 1 , g ∈ G, h 2 ∈ H 2 .

Note 5.2
The assumptions on the group G are immediately satisfied for compact groups such as SO (3), which is the case that is considered in rotationally equivariant neural networks for spherical image data [32,51,80] (if G = SO(3) and H i = SO(2), then S 2 = G/H i ). It is not necessary for G to be compact though; much of the work on equivariant neural networks has been focused on the case where G is a group of rigid motions of a Euclidean space [33,34,128,129,132]. In that case G = T R where T = R d or T = Z d and R is a group of rotations and reflections (such as SO(d), O(d) or finite subgroups thereof). With any of these choices G is a locally compact unimodular group.
The constraints on the convolution kernels stated in Theorem 5.1 can be solved once the type of features V i have been chosen (so before training time) to find a basis for the convolution kernels that give rise to equivariant linear maps, and at training time we can learn equivariant linear maps by learning the parameters of an expansion in this basis. 920 E. Celledoni et al.

Equivariant non-linearities
Having established the general form for a large class of equivariant linear maps, the question remains how to choose equivariant non-linearities. If we are to propose a non-linearity F : (P 1 ) → (P 1 ), we cannot in general just apply a pointwise non-linearity as in ordinary neural networks: this will only work if the chosen representation of H 1 does not mix components, as is the case for instance if the representation of H 1 is trivial (which is always the case if H 1 = {e}) or if the regular representation of H 1 is chosen [129]. One way to make this work, in general, is by having the first layer of the network be a linear layer mapping feature maps defined on the base domain X = G/H 1 to feature maps defined on the group G = G/{e} and letting all feature maps after that be defined on the group [33,11]. If the chosen representation does not work well with pointwise non-linearities, another thing that can be done is to take the pointwise norm of the feature map (which is a scalar-valued feature map), pass it through a pointwise non-linearity, and multiply this by the feature map: F(x)(p) = f ( x(p) )x(p), as is done for instance in [132,121]. Note that f could include a learnable parameter such as a bias parameter. Another option is to combine features of different types in a tensor product [80], which is particularly convenient when working in Fourier space: in Fourier space the convolution operation becomes a pointwise multiplication, but it is not immediately obvious how to apply equivariant non-linearities, so other methods performing the convolution in Fourier space have to transform back to 'real' space (which is computationally expensive) before applying the non-linearity [32,51].

A numerical demonstration of the use of equivariant neural networks
In this section, let us consider an example that demonstrates some of the advantages that can be gained by using equivariant neural networks. We will use a supervised learning approach to learn a denoiser: we assume that we are given pairs of random variables representing clean and noisy images (x * , x) and attempt to solve the following optimisation problem: Here, C is a class of functions, λ 0 is a small constant, ∇ is a finite difference image gradient operator and · 1,ε is a smoothed version of the 1-norm. In this specific experiment, we let the clean images x * be of size 60 × 60, containing random rectangles with sides aligned to the grid and with random colours and we let the noisy images be generated from clean images by adding Gaussian white noise (as shown in the top right row of Figure 6). It is natural to ask for the denoiser to commute with rotations and translations, that is, for the denoiser to be equivariant with respect to rotations and translations. We compare two choices of C in Problem (5.3), which we will refer to as the class of ordinary and equivariant denoisers respectively. In both cases, the functions in C the form of a ResNet (as defined in (1.5)) preceded by a learnable lifting layer and succeeded by a learnable projection layer. The distinction between the ordinary and equivariant denoisers, is that the ordinary denoiser uses ordinary convolution operations and a pointwise ReLU non-linearity (resulting in translation equivariance), whereas the equivariant denoiser uses the rotation and translation-equivariant versions of these operations as defined in [132]. To give a fair comparison, we fix the same width and depth in both classes. In this case, the equivariant denoiser has a number of degrees of freedom that is less than 10% of the number of degrees of freedom of the ordinary denoiser (62,994 vs. 741,891). The denoisers are FIGURE 6. A demonstration of the use of equivariance in a denoising task. On the left, we have plotted training errors for five runs of each denoiser. On the right, we have displayed the outputs of the denoisers on an example (the 'on-grid' example) similar to the ones used for training and on a rotated version.
trained by performing 1500 iterations of Adam [77] on minibatches of size 64, decreasing the step size when validation error stagnates, and for each denoiser, we perform 5 training runs.
The results are shown in Figure 6: as we see in the plot of the training errors, the equivariant denoiser consistently converges in fewer iterations than the ordinary denoiser and achieves a better objective function value. Besides this, we can be confident that the equivariant denoiser will generalise to rotated examples despite not having seen them in training, whereas it is hard to say anything about how the ordinary denoiser generalises to rotated examples.

Sample complexity bounds
One of the main motivations that is given for the use of equivariant neural networks is that they should be able to use training data more efficiently than neural networks that are not designed to be equivariant. This is particularly important in applications in which training data are in short supply, as is often the case in inverse problems [6]. In the machine learning community, the number of samples needed to estimate a given object is known as the sample complexity. There is recent work [42] establishing sample complexity bounds for some simple CNN models, and showing that these bounds compare favourably to the corresponding ones for fully connected networks. In a similar vein, it would be interesting to establish rigorous sample complexity bounds for equivariant neural network models that guarantee their data efficiency.

Approximation properties
When applying existing group equivariant neural network architectures as in the framework described above, there are a large number of choices that need to be made: the domains on which the feature maps are defined, the choices of the types of features (the representation of H that is used), the choice of non-linearity. While there is a vast amount of literature on the approximation properties for ordinary neural networks (including older works such as [37,69] and some more recent works that apply to CNNs [15,108]), there is not yet the same theoretical guidance on how the choices of the various aspects of group equivariant neural networks can be made to guarantee that the networks are sufficiently expressive. There is some theoretical work 922 E. Celledoni et al.
on hypothetical equivariant neural networks and approximation results relating to them [136], but it is not yet of practical use in choosing an equivariant architecture.

Approximate equivariance
Many of the symmetries we would like to work with in equivariant neural networks are continuous, but when we implement them in practice it is necessary to discretise them. Generally in the literature, one of two approaches is taken: the group is discretised and exact group convolutions are taken with respect to the discrete subgroup (in which case we have exact equivariance to the discrete symmetry), or the group convolutions are derived in the continuous setting and eventually discretised (so that we have approximate equivariance to the continuous symmetry). In either case, it has not been studied in detail how much of an error we make when we make these discretisations and whether there is an optimal discretisation to use. It would be of great interest to provide bounds on the equivariance error, as these could be used to decide, for example, whether a theoretically invariant classifier is actually invariant in practice.

Structure-exploiting learning
The training of neural networks amounts to the numerical optimisation of typically smooth, but high-dimensional and highly non-linear objectives as in (1.4), and as discussed in the optimal control framework in Section 4. The most widely used numerical method for (1.4) is Adam [77]. As before let : X × → X be a (deep) neural network that depends on the data and the parameters θ ∈ . The training of amounts to the optimisation over the parameters θ with respect to a loss function as in (1.4). While the loss function L usually is a convex function, the dependence of the parameter θ on the network is in general highly non-linear which makes the overall optimisation problem in (1.4) non-convex. For what follows, let us denote by L j (θ ) = 1 |S j | n∈S j (L(η • (x n , θ ), y n ) + 1 N R(θ )), for j = 0, 1, . . ., where S j is a randomly chosen set of indices from the N training samples. For fixed positive parameters α and β 1 , β 2 < 1, and for appropriate initialisations for the parameters and the auxiliary moment functions m 0 and v 0 , Adam amounts to the following iteration: for a small ε > 0 and where √ v j is taken component-wise. Stochasticity, in the form of choosing subsets S j randomly in every iteration, is crucial to deal with high dimensionality of the problem.
Adam is being used for almost all of neural network training because of its easy implementation, its robustness to rescaling, its computational efficiency and small memory requirements and for its suitability for problems which are large-scale in terms of training data and parameters. On the other hand, its theoretical convergence properties do in general not guarantee convergence to a solution of (1.4), cf. [112] where the authors propose a convergent variant in the convex setting and [137], which provides convergence guarantees in the non-convex and stochastic setting.
The literature for optimising smooth (non-convex) objectives is, however, much richer than Adam alone. It is tightly linked to developments in convex analysis and operations research, as well as the numerical discretisation of dynamical systems, ODEs and PDEs as discussed in parts in Sections 2 and 4. In the context of neural network training other optimisation schemes have also been investigated recently. Here, we will mainly focus on those that have some structurepreserving property such as Hamiltonian descent [95,105], which are guaranteed to dissipate a Hamiltonian energy, optimisation approaches when the features or network parameters are elements in a Riemannian rather than Euclidean space [1,87,123], and -as a special case of the latter -information geometry approaches that describe optimisation on statistical manifolds [4].

Conformal Hamiltonian systems
The most classic approach for minimising E(·) over R L is gradient descent, i.e. to seek a stationary point of E by evolvingθ Several optimisation methods for E can then be derived through different discretisations of (4.6), with the simplest one being explicit gradient descent and its stochastic versions [114]. Another route for the derivation of optimisers for E is obtained by replacing the gradient flow (6.2) with a Hamiltonian flow as in (1.10) with dissipation, for example, a conformal Hamitonian system, i.e.
with γ , μ > 0. Rewriting the system (6.3) in one equation gives which is gradient descent accelerated by momentum, cf. also [118,90]. More generally, (6.3) is a special case of a conformal Hamiltonian system of the forṁ  While the gradient system (6.2) dissipates E, for a conformal Hamiltonian system (6.4) the Hamiltonian H is dissipated as For separable Hamiltonians with the kinetic energy T chosen so it has a global minimiser in 0, limit points of (6.4) recover stationary points of E. More precisely, for H = T(p) + E(θ ) being separable the equilibria of (6.4) fulfil If T is chosen so it has a unique global minimum in p * = 0 (e.g. in the case of (6.3)), we have that (0, θ * ) is a zero of ∇H(p, θ ) if and only if u * is a zero of ∇E(θ ), i.e. θ * is a stationary point of E(θ ). Moreover, we have that (0, θ * ) is the solution of the conformal Hamiltonian system as t → ∞. Then, using a numerical integration method that preserves the property of Hamiltonian energy dissipation (such a numerical method is called conformal symplectic scheme [13]), we get an approximation of (θ * , 0), where (θ * , 0) is an equilibrium of (6.4) and θ * a stationary point of E [99]. In recent work, Frana et al. [53] take advantage of the connection between optimisation schemes for E and conformal symplectic Hamiltonian schemes for H = T + E for the design of new optimisation approaches for E -similar to the connections we have seen before, e.g. between Adam and conformal Hamiltonian descent. See also Figure 7 for a comparison between different optimisation methods applied to the camelback function.
Structure-preserving deep learning 925

Learning in Riemannian metric spaces
After identifying the dissipation of an appropriate Hamiltonian as a structure worth preserving for numerical optimisation of neural networks, as discussed in the last section, explicit conditions on the parameter matrices themselves, such as orthogonality, seem to impose structure on the optimisation that can be of advantage and in particular lead to better (at least empirical) convergence rates, better generalisability and accuracy [87]. Such conditions usually pose a parameter optimisation problem within a Riemannian metric space rather than Euclidean space. Moreover, there are several important applications, cf. in particular the next subsection, where the training data and the parameters naturally lie in a Riemannian space.
While in Section 2.3 we discussed the case when the training data and features lie on a Riemannian manifold, in this section, we consider the setting in which the parameters belong to a manifold. This in particular means that ordinary gradient descent a la (6.2), i.e. with the gradient ∇ being the ordinary gradient in R L , does in general not describe steepest descent of E in such a Riemannian metric space. Instead descent with respect to a metric-induced gradient needs to be considered [5]. In what follows, we consider some representative examples of how Riemannian optimisation can arise in the context of deep learning.

Network parameters evolving on manifolds
The parameters to be learned evolve on a manifold when additional structure or constraints are imposed on the parameters. This is done to improve stability of the training algorithm and of the trained model. Pioneering work advocating the use of Riemannian gradient and orthogonality constraints can be found in [2,3,17] and there is an extensive follow-up literature. Examples of this procedure in the context of deep learning have recently appeared in [91] and earlier in the context of CNNs in [8,28]. It is crucial to implement the Riemannian gradient descent efficiently, making good use of the tensor structure of the data and of the layers of the neural network. One way to proceed is to introduce an evolution equation for θ and replace (1.7) by the extended system of ODEsż = f (z, θ ), (6.6) θ = g(θ ). (6.7) Of particular importance is the case where the second equation evolves on a Lie group G or on a homogeneous manifold M = G/H. 4 In [3] G is simply the general linear group, while orthogonality constraints have been adopted in [71] in the context of independent component analysis.
Other concrete examples that we have in mind include the affine group, the special Euclidean group SE(n), the special orthogonal group SO(n), the Stiefel manifold V n,p = SO(n)/SO(n − p) and the Grassmann manifold G n,p = SO(n)/(SO(n − p) × SO(p)) including the n-sphere. For matrix Lie groups G (and homogeneous manifolds) (6.7) can simply take the form of a linear matrix differential equationθ Celledoni et al. and M(t) ∈ g, and the optimisation can be performed in terms of the variable M. To preserve the manifold structure one can consider a discretisation of this equation of the form where exp : g → G is the matrix exponential. This discretisation can be seen as the prototype for constructing local coordinates 5 on the manifold of parameters. Alternatively, any approximation of the exponential map φ ≈ exp (e.g. by a rational approximant) preserving the property φ : g → G, can be used to construct local parametrisations of the manifold 6 . On homogeneous manifolds such as V n,p and G n,p additional structure on M(t) can be exploited to reduce the computational cost of matrix exponentials and similar mappings [20]. After time discretisation, this approach guarantees that the parameters at each layer of the network belong to manifolds which are all naturally equipped with Riemannian metrics and some of which are compact. In particular, a metric on G that is H-right invariant (i.e. the right multiplication R h with h ∈ H is an isometry) descends to a Riemannian metric on M = G/H [54]. Gradient descent techniques to train the network should then exploit the Riemannian structure [2,91]. A sufficient condition for convergence of Riemannian gradient descent is that the manifolds are geodesically complete [130,123]. All Riemannian homogeneous manifolds as well as compact Riemannian manifolds are geodesically complete [79,IV.4].

Information geometry
A special case of the Riemannian structure discussed in the previous paragraph arises when taking into account the inherent statistical properties of the underlying unknown distributions of training pairs and, connected to this, statistical properties of the network parameters θ .
Treating the parameters θ as probability distributions, they can be modelled as elements on a statistical manifold with an appropriate metric. A statistical manifold is a Riemannian manifold whose points correspond to probability distributions. Gradient descent on statistical manifolds is studied in information geometry. Here, the so-called natural gradient is the proposed notion of gradient on a statistical manifold [2]. For an L-dimensional parameter space, equipped with a Riemannian metric tensor G = G(θ ) = (g ij (θ )) ∈ R L×L that depends on θ , the natural gradient of E(θ ) is defined as∇ where ∇E(θ ) denotes the ordinary gradient of E in R L . The natural metric considered in this context is the Fisher information, with G being the Fisher information matrix of the parameters θ . Natural gradient descent then readsθ = −G −1 (θ )∇E(θ ). (x n , θ ) − y n 2 a squared error loss and G being the Fisher information matrix, we have where J n is the Jacobian of (x n , θ ) with respect to θ , cf. [96]. Note that in this case natural gradient descent, discretised with forward Euler, is equivalent to Gauss-Newton iteration [104]. This connection between natural gradient descent and (extended) Gauss-Newton methods can be extended to more general losses as well [107]. In the continuum limit, (6.8) is a gradient flow with respect to the Fisher-Rao metric, see [102,Definition 3.1]. Several works [2,4,107] have demonstrated advantages of using the natural gradient over the ordinary gradient in (6.2) for neural network training. The Riemannian structure seems to guard against the gradient descent being trapped in flat areas of the loss function's surface [134], and as a result, the network to feature better generalisation capabilities, cf. [66,76] for different characterisations of flatness of the loss function's surface.

Optimisation of two-layer ReLU neural networks as Wasserstein gradient flows
Continuing our discussion on the mathematical structure of training deep neural networks, we now consider using such structure to analyse optimality and generalisation properties of trained networks. More precisely, the inherent gradient flow structure of training neural networks also appears when studying global optimality and generalisation properties of trained networks. Bach and Chizat [26] pick up the gradient flow formulation of neural network learning and study convergence of the learning problem (1.4) to a global minimiser of E for two-layer ReLU neural networks in the ∞-width limit. In particular, their analysis makes use of the structure of a Wasserstein gradient flow formulation of (6.2) over the space of probability measures in the ∞-width network limit. Proving convergence to a global minimiser of E also allows the study of optimal generalisation capabilities of the trained two-layer ReLU neural network by characterising the limit (for a certain class of loss functions with exponential tails) as a max-margin classifier [27].
In [26] they consider a two-layer neural network of the form where f (θ j , x) = c j max{a t j x + b j , 0}, for x ∈ R M and θ j = (a j , b j , c j ) ∈ R M+2 . The weights θ in (6.9) are learned by minimising a loss function such as (1.4) with optional regularisation R(θ ) = λ L L j=1 θ j 2 2 . In this setting, they investigate the performance (generalisation capabilities) of the learned network by studying the associated gradient flow of the loss function E for initial weights θ (0) = θ 0 ∼ i.i.d. μ 0 ∈ P 2 (R n+1 )θ = −m∇E(θ , x).
Note that this gradient flow formulation of neural network training also has connections to several interesting works on the mean-field theory of neural networks, see for instance [48].

Port-Hamiltonian optimisation methods
Generally, investigating new optimisation methods for E by considering different instances of Hamiltonian descent is a promising research direction. Here, different choices of Hamiltonians or special cases of Hamiltonian systems might be advantageous for classes of loss functions and network architectures, imposing different descent dynamics. For example, interesting schemes arise when symplectic numerical integration is applied to port-Hamiltonian systems (with different Hamiltonians), cf. [125] for an introduction to port-Hamiltonian systems and [21] for the development of structure-preserving numerical integrators for port-Hamiltonian systems by using discrete gradient and splitting approaches. Massaroli et al. [98] design a loss function and parameter optimisation dynamics of the network in such a way that the neural network itself behaves like an autonomous port-Hamiltonian system. This, in turn, allows a proof of convergence of the optimisation algorithm to a minimum of the loss. Taking this a step further, port-Hamiltonian systems also lend themselves to the design of locally adaptive optimisation schemes as the port-Hamiltonian structure is preserved under concatenation of port-Hamiltonians with orthogonal input-output relation.

Convergence analysis for natural gradient optimisation
While several papers seem to suggest (mostly empirically or for linear networks) that natural gradient descent helps to mitigate nuisance curvature in the neural network parameter optimisation, very little seems to have been done on this theoretically, in particular for non-linear neural networks [138]. In general, while for particular choices of the Riemannian metric (6.8) boils down to classical optimisation schemes, such as Gauss-Newton for G being the Fisher information metric, analytic results on convergence properties of natural gradient flow discretisations are generally open. Indeed, the study of (6.8) for different choices of G could be very interesting. For instance, if G is a BFGS approximation to the Hessian of E, then a stochastic method was proposed and analysed by Wang et al. [127].

Studying generalisation properties of neural networks by metric gradient flows
As the example of the Wasserstein gradient flow in Section 6.3 has shown, metric gradient flows can serve as a useful tool for studying convergence of the network training and for the study of generalisation properties of the minimisers [26,27]. It would be interesting to see if other metric gradient flows would also lend themselves to such an analysis. In connection with information geometry in Section 6.2.2 we have seen the gradient flow with respect to the Fisher-Rao metric appearing. Could this be used to study convergence properties for other network architectures, beyond two-layer ReLU? Are there other metrics that could be interesting to investigate for that purpose?

Conclusion
Structure-preserving approaches to deep learning are a means to design neural networks with guaranteed mathematical properties, to derive optimisation schemes that improve their training and to analyse their global optimality and generalisation capabilities. In this paper, we discuss some recent examples from this emerging topic of structure-preserving deep learning. These include ODE and PDE parametrisations of neural networks for improved stability properties, an optimal control formulation for neural network training that gives rise to systematic training and regularisation procedures, invertible neural networks for large-scale deep learning, equivariant neural network architectures for the design of neural networks that preserve group transformations and structure-preserving training of neural networks by means of Hamiltonian descent and Riemannian metric gradient flows. Together with the discussion of state-of-the-art results, we also suggest a range of open problems that we identified as interesting mathematical avenues that could help to shed light on the systematic design and training of deep neural networks.