1. Introduction
In this work, we propose a novel technique for numerically approximating the solution of a class of McKean–Vlasov stochastic differential equations (MV-SDEs) using stochastic gradient descent (SGD). We cast the solution of the MV-SDEs as the fixed point of a certain functional for which we find the solution via minimizing a related objective function. As the possible domain is infinite-dimensional, we look for an approximation to the true solution in a finite-dimensional domain via the method of SGD. We show that the approximation in the finite-dimensional domain approaches the true solution in the limit.
MV-SDEs play an important role in a variety of fields, including, but not limited to, kinetic theory of gases and plasmas [Reference Cattiaux, Guillin and Malrieu12, Reference Malrieu38], chemotaxis modelling [Reference Suzuki45], population dynamics including crowd [Reference Maury and Faure39] and pedestrian [Reference Mahato, Klar and Tiwari37] dynamics, machine learning (ML) [Reference Carmona and Laurière10, Reference Carmona and Laurière11], finance and optimal control [Reference Carmona and Delarue9], quantum mechanics [Reference Golse, Mouhot and Paul27], and mean-field games [Reference Carmona and Delarue9]. Typically, the solution of an MV-SDE is approximated via an
$N_0$
-dimensional interacting particle system (IPS) and it is well-known, via the so-called ‘propagation of chaos’ results (e.g. [Reference Méléard40, Reference Sznitman46]), that the empirical law of the IPS converges to the law of the MV-SDE as
$N_0 \to \infty$
. Numerically, one can then use time-discretization and simulation to obtain accurate approximate solutions of the MV-SDE via the IPS ([Reference Antonelli and Kohatsu-Higa2, Reference Bossy and Talay8, Reference Talay and Vaillant48). The applicability of this approach has been recently extended across a variety of settings under varying regularity and growth assumptions on its coefficients; see, for example, [Reference Biswas and Kumar7, Reference Chen and Dos Reis17–Reference Chen and Dos Reis20, Reference Haji-Ali and Tempone28, Reference Soni, Neelima and Dos Reis42, Reference Sun, Yang and Zhao44, Reference Yuanping, Xiaoyue and Yi50].
Even though the ‘IPS’ approach is very powerful and can be applied to many different classes of MV-SDEs, depending on the type of interaction kernel, the computational cost of simulating an IPS can be expensive. As such, there have been many other works in the literature to explore alternative techniques for numerically solving for the solution of an MV-SDE. In [Reference Szpruch, Tan and Tse47], an iterative multilevel Monte Carlo scheme of creating a particle system is combined with Picard iteration to reduce the computational cost. In [Reference Gobet and Pagliarani25], the technique of analytical expansions is used to derive accurate approximation of the density of the solution of an MV-SDE with elliptic constant diffusion coefficient and bounded drift. In another approach, [Reference Baladron, Fasoli, Faugeras and Touboul3, Reference Cazacu13, Reference Goddard, Gooding, Short and Pavliotis26] use the Fokker–Planck equation representation of the density of the solution to apply numerical techniques used to solve partial differential equations (PDEs). In [Reference Agarwal and Pagliarani1], the authors consider a class of MV-SDEs for which the solution can be readily computed using the Picard iteration. They use a Fourier-transform approach to solve a fixed-point equation, which also allowed them to incorporate Lévy jumps in their analysis in a relatively straightforward manner. In [Reference Hutzenthaler, Kruse and Nguyen32], the authors used a multilevel Picard iteration to develop numerical estimates of the solution of a class of MV-SDEs in high dimensions. They are able to prove that their numerical approach is computationally efficient and avoids the curse of dimensionality unlike the IPS approach. In a way, many of the methods mentioned are of decoupling-type: in a first step the measure component (or a functional thereof) is approximated, said approximation is plugged back into the McKean–Vlasov equation’s dynamics yielding an SDE that is then approximated in the usual fashion – such approaches have been used for importance sampling [21] or to deploy Koopman transfer operator approaches to the nonlinear semi-group operator of MV-SDEs [Reference Ioannou, Klus and Dos Reis33]. In [Reference Chaudru de Raynal and Garcia Trillos15] the authors developed a cubature-method-based approach to solve decoupled forward–backward MV-SDEs. Noteworthy of mention is the recent work of [Reference Chassagneux and Pagès14] in the context of ergodic simulation of the invariant distribution of McKean–Vlasov equations – a setting we do not explore. There, one replaces the measure argument in the coefficients of the MV-SDE by the process’ occupation (empirical) measure. This transforms the MV-SDE into a self-interacting diffusion that has full history dependence in its coefficients. The methodology remains then single-trajectory (no interacting particles) and thus avoids the classical particle-system propagation-of-chaos costs.
Recently, there have been tremendous developments in using machine-learning techniques to efficiently solve nonlinear PDEs in high dimensions owing to the seminal work [Reference Han and Jentzen29]. Typically in this approach, an appropriate approximation of the forward process related to the nonlinear PDE and an approximation of the solution in terms of its gradient are provided as inputs to a feedforward neural network. The network is then trained to learn the mapping between the forward process and the gradient of the solution using a SGD method, which minimizes a suitable loss function based on the terminal condition of the nonlinear PDE. It has been proven under different settings of nonlinear PDEs (for example, [Reference Hutzenthaler, Jentzen, Kruse, Anh Nguyen and von Wurstemberger30, Reference Hutzenthaler, Jentzen, Kruse and Nguyen31) that the machine-learning-based method reliant on SGD avoids the curse of dimensionality. This approach of using machine learning to estimate the solution of a nonlinear PDE has also been successfully extended to solve McKean–Vlasov stochastic control problems, see, for example, [Reference Carmona and Laurière10, Reference Carmona and Laurière11]. Our proposed approach to solve an MV-SDE using SGD is inspired by these aforementioned works (and the references therein). We do not rely on a neural network to learn the solution but instead use SGD to learn optimal weights in a polynomial basis function representation of the unknown functions in the considered MV-SDE. The particular class of MV-SDE we consider in this work (2) allows us to develop a more targeted approach, which is naturally more efficient than a general neural-network-based estimation approach.
For clarity of exposition, we work with a class of MV-SDEs in which the coefficients are separable with respect to their dependence on the measure and state of the MV-SDE, namely
\begin{equation} \textrm{d} X_t = \sum_{j=1}^K \Big( \mathrm{E}[\varphi_{1,j}(X_t)] \alpha_j(t,X_t) \, \textrm{d} t + \mathrm{E}[ \varphi_{2,j}(X_t)] \beta_{j}(t,X_t) \, \textrm{d} W_t \Big), \quad X_0=\xi.\end{equation}
Based on the recent work [Reference Belomestny and Schoenmakers5], a wide class of coefficients can be projected on a Fourier basis giving rise to the class of MV-SDEs in (1). We show that the unknown functions of time
$\bar{\gamma}_{1,j}(t)\,:\!=\,\mathrm{E}[\varphi_{1,j}(X_t)]$
and
$\bar{\gamma}_{2,j}(t)\,:\!=\,\mathrm{E}[\varphi_{2,j}(X_t)]$
satisfy a fixed-point equation, which we then pose as a minimization problem over the space of continuous functions. Next, we reformulate the infinite-dimensional minimization problem as a finite-dimensional one by looking for an approximate solution in a suitable finite-dimensional sub-domain, a typical choice being the space of nth-order polynomials. Once a basis for the latter is fixed, the objective function can be parametrized as a function on a Euclidean space. To minimize it we propose an algorithm based on SGD, where the stochastic gradient of the objective function depends on the solution to a Markovian SDE and on its gradient with respect to the parameters. To compute the gradient in SGD, we employ the Euler discretization of the relevant SDEs. To the best of our knowledge, our approach is the first to deploy an SGD-type algorithm to solve MV-SDEs. Moreover, we also provide a ‘mini-batch’ version of our SGD-based algorithm that greatly improves the efficiency of the algorithm.
In this work, we essentially build a theoretical foundation for the study of the SGD approach applied to the numerical approximation of MV-SDEs. We provide a convergence study and establish the related properties for such a purpose. Concretely, the convergence of our method is reliant on stability estimates of the related Markovian SDE and tangent processes, including a general interest result on the higher-order time-regularity property of the maps
$t \mapsto \mathrm{E}[\varphi_{\cdot,\cdot}(X_t)]$
. The moment estimates and Lipschitz property of the tangent process with respect to the unknown function also allows us to establish an explicit formula for the gradient of the loss function. Moreover, by performing a suitable penalization of the loss function, we are able to obtain Lipschitz continuity of its gradient. These intermediate results allow us to use the existing convergence results of descent methods by [Reference Bertsekas and Tsitsiklis6, Reference Patel and Zhang41] to show convergence of our algorithm to a stationary point. As our minimization problem is non-convex, classical results that ensure convergence of SGD algorithms to the global minimum do not apply to our setting. On the contrary, we rely on the developments in [Reference Bertsekas and Tsitsiklis6, Reference Patel and Zhang41] for the convergence of SGD algorithms to stationary points, in non-convex settings. We observe that the study of gradient-descent algorithms is an active research field, also with attempts of proving convergence to the global minimum in the non-convex setting. For instance, in [Reference Fehrman, Gess and Jentzen23], the authors prove global convergence for a particular version of the SGD algorithm that makes use of mini-batches and a randomized initial point, under suitable geometric assumptions for the objective function. Checking these assumptions in our setting would be quite complex, thus, for the purpose of readability, we only verify the assumptions in [Reference Bertsekas and Tsitsiklis6, Reference Patel and Zhang41] to ensure convergence to a stationary point. Through extensive numerical studies, we demonstrate the excellent performance of our SGD algorithm in solving well-studied MV-SDEs. In particular, we measure the accuracy of our method by computing relative error with respect to the benchmarks obtained by the IPS approach, and show that we achieve very low error values with a small computational budget.
Overall, the results of this paper can be viewed as a proof of concept for the application of SGD algorithms as a valid tool to deliver numerical approximations of MV-SDEs without the simulation of IPSs. In order to reduce the technicalities and keep the proofs as transparent as possible, the convergence analysis is performed here in the case of bounded coefficients. However, it could be extended to the case of
$\alpha_j(t,\cdot)$
,
$\beta_j(t,\cdot)$
with sub-linear growth and
$\varphi_{1,j}$
and
$\varphi_{2,j}$
with polynomial growth. We discuss this issue further below (see Remark 2). In particular, the numerical tests presented in Section 4.2 confirm that the approach is viable also when the coefficients
$\varphi_{1,j}$
and
$\varphi_{2,j}$
have quadratic growth.
An extension of this methodology to the case of Hölder continuous drift coefficients seems feasible, up to dealing with some technical issues arising from the fact that the gradient process becomes a diffusion with singular (distributional) drift coefficients. We also expect this approach to be suitable for extensions to the jump-diffusion framework. Finally, an interesting and challenging extension regards adapting this methodology to frameworks where the simulation of the associated IPS is less amenable, e.g. conditional MV-SDEs or MV-SDEs with moderate interaction among others.
The rest of the paper is organized as follows. At the start of Section 2, we provide the main setting of our considered MV-SDEs and for the convenience of the reader Tables 1 and 2 summarize the notation used throughout. We formulate the fixed-point problem for the unknown measure-dependent function in the MV-SDEs and then cast it as a minimization problem over the space of continuous functions. In Section 2.2, we introduce the finite-dimensional approximation of the original domain and provide an upper bound for the error we incur by solving the minimization problem on the reduced domain. We propose our SGD-based algorithm to solve the minimization problem at the beginning of Section 3, where we also provide a ‘mini-batch’ version of it. The convergence proof of the algorithm is presented in Section 3.3 along with the necessary intermediate results. The results from the extensive set of numerical experiments are presented in Section 4. The proofs of all the main theoretical results are relegated to the Appendix.
Table 1. Main notation for spaces and classic operators.

Table 2. Main notation for primary processes and auxiliary functions.

2. McKean–Vlasov SDEs with separable coefficients
Let
$T>0$
be a fixed time horizon. All the SDEs in this article are considered up to time T. We consider the class of
$\mathbb{R}^d$
-valued MV-SDEs in (1), where the interaction through the state is separated from the interaction through the law. Their dynamics can be compactly written as
\begin{equation}\begin{cases} \textrm{d} X_t = \bar{\gamma}(t) \big( \alpha(t,X_t) \, \textrm{d} t + \beta(t,X_t)\, \textrm{d} W_t \big), \quad X_0=\xi,\\ \bar{\gamma}(t)= \mathrm{E}[\varphi(X_t)], \end{cases}\end{equation}
for measurable maps
$\alpha\,:\, [0,T] \times \mathbb{R}^d \to \mathbb{R}^{K\times d}$
,
$\beta\,:\, [0,T] \times \mathbb{R}^d \to \mathbb{R}^{K \times d \times q}$
,
$\varphi \, : \, \mathbb{R}^d \to \mathbb{R}^K $
,
$\xi$
a random variable, and with W being a q-dimensional Brownian motion (column vector). The full probabilistic framework can be found in the Appendix.
Remark 1. The framework (2) of
$\bar{\gamma}$
maps can be extended to the type
$\bar{\gamma}(\!\cdot\!)= h \big( \mathrm{E}[\varphi(X_\cdot)] \big)$
for some invertible and suitably regular function h. We present our results under the setting of (2) only for the sake of simplicity.
Our starting point could also have been a MV-SDE with more general coefficients, to which we could have applied a projection over a Fourier basis (see [Reference Belomestny and Schoenmakers5]), but such an exercise would still have resulted in the class of MV-SDEs considered here. For example, in Section 4.3 we show, following [Reference Belomestny and Schoenmakers5, Section 2], that an MV-SDE with drift coefficient given by
$\mathrm{E}\left[\exp\left(-{(X_t- x)^2}/{2}\right)\right]|_{x=X_t}$
can be approximated by a system in the form of (2), to which we apply our SGD algorithm.
To proceed further, we first present the well-posedness and regularity assumptions.
Assumption 1. The random variable
$\xi\in L^2$
.
Assumption 2. The functions
$\alpha,\beta\in C([0,T]\times \mathbb{R}^d)$
and there exists
$R>0$
such that
and
$\alpha$
,
$\beta$
, and
$\varphi$
are bounded by R.
The following well-posedness result is classical.
Proposition 1. Let Assumptions 1 and 2 hold true. Then, there exists a unique solution
$X = (X_t)_{t\in [0,T]}$
to MV-SDE (2) in
$\mathcal{S}^2_{[0,T]}.$
Furthermore,
$\bar{\gamma}\in C_b([0,T])$
.
The above result follows from [Reference Carmona and Delarue9, Theorem 4.21 (p. 236)]). In particular, the fact that
$\bar{\gamma}$
is bounded and continuous is straightforward as
$\varphi$
is bounded and continuous.
Remark 2. (Growth assumptions.) We expect the conclusion of Proposition 1 to still hold true when replacing the condition in Assumption 2 of boundedness on
$\alpha(t,\cdot)$
and
$\beta(t,\cdot)$
with sub-linear growth, the latter implied by the Lipschitz condition (3). However, we are not aware of this result in the literature. The proof would take advantage of the specific structure of the coefficients, which are separable with respect to the state and measure component. In addition, we expect the result to be true by assuming the function
$\varphi$
only locally Lipschitz continuous with polynomial growth, provided the initial datum has suitably many finite moments.
Remark 3. (Regularity assumptions.) Concerning the well-posedness of MV-SDEs and the moment estimates of its solutions, Lipschitz continuity of the coefficients is a rather standard assumption. The literature on MV-SDEs also contains results for less regular coefficients (see [Reference Chaudru de Raynal16] and the references therein), but these typically require additional structural assumptions, e.g. ellipticity conditions for the diffusion coefficient. In addition, we rely on the Lipschitz continuity of
$\alpha(t,\cdot)$
and
$\beta(t,\cdot)$
in the convergence analysis of the SGD algorithm (see Section 1). Thus, we prefer to introduce now this regularity requirement.
Example 1. Assumption 2 requires the coefficient functions
$\alpha(t,\cdot)$
,
$\beta(t,\cdot)$
, and
$\varphi$
to be bounded and Lipschitz continuous. Among the elementary functions, this class includes several trigonometric functions (e.g. the model in Section 4.1), bounded rational functions, Gaussian-type functions, as well as sums and products of these. On the other hand, polynomial functions do not satisfy Assumption 2. However, affine functions
$\alpha(t,\cdot), \beta(t,\cdot)$
, and polynomial-type functions
$\varphi$
are allowed in the extended framework discussed in Remark 2 (e.g. the model in Section 4.2).
We now establish a time-regularity result for
$\bar{\gamma}$
, which can be achieved by requiring more regularity on the function
$\varphi$
and on the coefficients
$\alpha,\beta$
. We will see in Section 2.2 that one can take advantage of this higher regularity to have a faster convergence of the solution to the finite-dimensional minimization problem to the true
$\bar{\gamma}$
(see Proposition 4).
Assumption 3. There exists
$N\in \mathbb{N}$
and
$R>0$
such that
$\alpha,\beta \in C_b^{N-1,2(N-1)}([0,T] \times \mathbb{R}^d)$
and
$\varphi\in C^{2 N}_b(\mathbb{R}^d)$
, with
$\alpha$
,
$\beta$
, and
$\varphi$
and all their derivatives bounded by R.
Proposition 2. (Higher-order smoothness of
$\bar{\gamma}$
) Let Assumptions 1–3 with some
$N\in\mathbb{N}$
hold true. Then, the function
$\bar{\gamma}$
defined in (2) is in
$C^N_b([0,T])$
and, for any
$m=1,\ldots, N$
, we have
where
$\mathscr{K}$
is the Kolmogorov operator acting on a function
$u\in C_b^{1,2}([0,T] \times \mathbb{R}^d)$
as defined as
with
$\mathscr{K}^m u$
being the m-times composition of
$\mathscr{K}$
acting on u as
The proof is found in Appendix B.
Remark 4. By inspecting the proof of Proposition 2 and by Remark 2, one can see that the same conclusion could be derived lifting the boundedness condition in Assumption 3, if the derivatives of
$\alpha(t,\cdot)$
,
$\beta(t,\cdot)$
, and
$\varphi$
have polynomial growth, provided that
$X_0$
(thus,
$X_t$
) has suitably many finite moments.
Example 2. All the elementary functions satisfying Assumption 2 discussed in Example 1 are smooth with bounded derivatives and thus fulfil Assumption 3. On the other hand, affine
$\alpha(t,\cdot), \beta(t,\cdot)$
, and polynomial
$\varphi$
are allowed in the extended framework discussed in Remark 4.
2.1. Fixed-point equation for
$\bar{\gamma}$
We introduce a closely related SDE to (2), which takes function
$\gamma$
as input. Namely, given
$\gamma\in L^{\infty}([0,T],\mathbb{R}^K) $
denote by
$Z^{\gamma}$
the solution to the Markovian SDE,
The following result is a direct application of Lemma 3.
Proposition 3. Let Assumptions 1 and 2 hold true for some
$R>0$
. Then (5) is strongly well-posed for any
$\gamma\in L^{\infty}([0,T],\mathbb{R}^K) $
. Furthermore, for any
$\kappa>0$
, there exists
$C>0$
dependent on
$\kappa, T,R$
, and K, such that
for any
$\gamma,\widetilde{\gamma} \in L^{\infty}([0,T],\mathbb{R}^K)$
such that
$\|\gamma\|_{\infty},\|\widetilde{\gamma}\|_{\infty} \leq \kappa$
.
Now, define the operator
$\Psi:C\big([0,T],\mathbb{R}^K) \to C\big([0,T],\mathbb{R}^K)$
given as
which is well-defined under Assumptions 1 and 2.
Let us start by observing that solving the MV-SDE (2) is equivalent to finding a fixed point for
$\Psi$
. Indeed, if
$\bar{\gamma}$
is a solution to
then the process
$Z^{\bar{\gamma}}$
solves (2) and vice versa. Also note that, if
$\bar{\gamma}$
solves (6), then
$\bar{\gamma}$
solves the minimization problem
where
Therefore, under Assumptions 1 and 2, Proposition 1 implies that there exists a unique
$\bar{\gamma} \in C\big([0,T],\mathbb{R}^K\big)$
that solves (7), in particular,
Lemma 1. Let Assumptions 1 and 2 hold true for some
$R>0$
. Then, the functional F in (7) is locally Lipschitz continuous in sup-norm
$\|\cdot\|_\infty$
. Precisely, for any
$\kappa > 0$
there exists
$C > 0$
such that
where
$C>0$
depends on
$\kappa, T,R$
, and K.
Proof. Given
$\kappa>0$
, take
$\gamma, \hat\gamma \in C([0,T],\mathbb{R}^K) \ \textrm{with } \|\gamma\|_{\infty},\|\hat\gamma\|_{\infty} \leq \kappa$
. By the reverse triangle inequality we have
The second term in the right-hand side of (11) is dominated by
$\sqrt{T} \lVert \gamma- \hat\gamma\rVert_{\infty}.$
For the first term, by the definition of
$\Psi$
and by the Lipschitz continuity of
$\varphi$
, we obtain
\begin{align*} \lVert \Psi(\gamma) - \Psi(\hat\gamma) \rVert^2_{L^2([0,T])} =& \int_0^T\Big| \mathrm{E}\big[\varphi(Z^\gamma_t) - \varphi(Z^{\hat\gamma}_t)\big]\Big|^2 \, \textrm{d} t \leq C \int_0^T\mathrm{E}\Big[\big|Z^\gamma_t - Z^{\hat\gamma}_t\big|^2\Big] \, \textrm{d} t\\ \leq& C\, T \lVert \gamma-\hat\gamma \rVert^2_{\infty}, \end{align*}
where the last inequality follows from Proposition 3. Thus, by combining the two bounds for the terms in the right-hand side of (11), we get the result in (10).
A quick inspection of the above proof shows that (10) also holds with
$\|\cdot\|_\infty$
replaced by
$\|\cdot\|_{L^2([0,T])}$
since the same is also true for Proposition 3.
Remark 5. Depending on the numerical accuracy requirements, instead of
$F(\gamma)$
defined above, we could also use an objective function
$\widehat F(\gamma)$
given as
with a kernel (weight) function w over [0,T]. For example, we can take
$w(t)=C_1e^{C_2 t}$
for
$C_1,C_2>0$
. Such a choice of objective function
$\widehat F(\gamma)$
in the minimization problem (7) would force the SGD algorithm to be more accurate towards the end of the interval [0,T].
2.2. Projection on a finite-dimensional domain
In order to solve (7) numerically using a gradient-descent method, it is imperative that we first narrow down the infinite-dimensional domain
$C([0,T],\mathbb{R}^{K})$
to a finite-dimensional one. This approach is a natural choice in our setting of solving (7) as the solution
$\bar \gamma$
is a continuous map. Hereafter, we fix
$n\in\mathbb{N}$
and a family
$g_0,\ldots,g_n$
of linearly independent vectors of
$C([0,T],\mathbb{R})$
. Define the finite-dimensional sub-space of
$C([0,T],\mathbb{R}^{K})$
as
and the distance between
$\gamma \in C([0,T],\mathbb{R}^{K})$
and
$S_n$
as
Also introduce a linear lifting operator
Remark 6. The operator
$\mathscr{L}$
is an isomorphism between finite-dimensional normed vector spaces. In particular, both
$\mathscr{L}$
and its inverse
$\mathscr{L}^{-1}$
are continuous and bounded operators. Furthermore, by a compactness argument, it can be easily seen that for any
$\gamma \in C([0,T],\mathbb{R}^K),$
there exists
$p_n(\gamma) \in S_n$
such that
In other words,
$d_n$
is well-defined as a minimum and such a minimum is attained at
$p=p_n(\gamma)$
. It is important to observe that both
$p_n(\gamma)$
and
$d_n(\gamma)$
only depend on
$S_n$
and not on the particular choice of the basis
$g_i$
,
$i=0,\ldots,n$
.
Example 3. (Polynomial of best approximation.) Consider
$n\in\mathbb{N}$
and
$g_i (t) = t^i$
for
$i=0,\ldots, n$
. Then the elements of
$S_n$
read as
This choice clearly leads to
$S_n = \mathcal{P}_n(\mathcal{R}^K)$
, the space of
$\mathbb{R}^K$
-valued polynomials of order n. We stress that
$S_n = \mathcal{P}_n(\mathcal{R}^K)$
could be also obtained by fixing a basis
$g_i$
,
$i=0,\ldots,n$
, which is not the canonical one. Possible choices include Legendre, Chebyshev, Laguerre, and Hermite polynomials [Reference Süli and Mayers43]. In particular, the numerical results presented in Section 4 refer to the specific choice of
$g_i$
given by the Lagrange polynomials centred at Chebyshev points, i.e.
\begin{equation}g_i(t) = \Bigg( \prod_{\substack{ 0\leqslant l \leqslant n \\ l\neq i}} \frac{t-t_l}{t_i - t_l} \Bigg), \quad t_i = \frac{T}{2} + \frac{T}{2} \cos\Bigl( \frac{2i+1}{2n+2}\pi\Bigr),\quad i=0,1,\ldots,n.\end{equation}
Finally note that, in particular, the distance
$d_n(\gamma)$
and the projection
$p_n(\gamma)$
(see Remark 6) are the same for all these choices of basis, for they all generate the same finite-dimensional space
$S_n = \mathcal{P}_n(\mathcal{R}^K)$
. In this case,
$p_n$
in (14) is called the polynomial of best approximation of
$\gamma$
(see [Reference Isaacson and Keller34, Chapter 5]).
Example 4. Other examples for the basis
$\{g_i, i = 0, 1, \ldots\}$
include splines [Reference Beirão da Veiga, Buffa, Sangalli and Vázquez4] and smooth Fourier wavelet expansions [Reference Fang and Oosterlee22].
Since the maps
$\gamma$
are bounded by
$\|\varphi\|_\infty$
, we introduce the auxiliary map
$\textbf{ h}\,:\,\mathbb{R}^{K}\to\mathbb{R}^K$
so that we can ensure that
$\textbf{h}(\mathscr{L} a)$
is also bounded. Concretely, let
$\textbf{h}$
be a smooth bounded function
$\mathbb{R}^{K}\ni x \mapsto \textbf{h}(x)=\big(\textbf{h}_1(x_1),\ldots,\textbf{h}_K(x_K)\big)$
defined as
\begin{align}\mathbb{R}^K \ni x \mapsto \textbf{h}(x)=\begin{cases}x ,& \textrm{if}\ |x| \leq \| \varphi \|_{\infty} + 2 d_n (\bar{\gamma})\\[4pt]\| \varphi \|_{\infty} + 2 d_n (\bar{\gamma}),& \textrm{if}\ |x| \geq \| \varphi \|_{\infty} + 2 d_n (\bar{\gamma})\end{cases},\end{align}
where it is clear that
$\|\textbf{h}\|_{\infty} \leq \| \varphi \|_{\infty} + 2 d_n (\bar{\gamma})$
.
Let the constant
$\tau_n>0$
be such that
where
$B_{\tau_n}$
and
$B_{\| \varphi \|_{\infty} + d_n (\bar{\gamma})}$
denote the balls in
$\mathbb{R}^{(n+1) K}$
and
$S_n$
, respectively, with radius
$\tau_n$
and
$\| \varphi \|_{\infty} + d_n (\bar{\gamma})$
, centred at zero. Note that such
$\tau_n$
exists as
$\mathscr{L}^{-1}$
is a linear, and thus a bounded operator (see Remark 6). Finally, we consider a non-negative real-valued function
$H\in C^1(\mathbb{R}^{(n+1) K})$
such that
Example 5. A function H satisfying hypothesis (17) above is given by
\begin{equation*}H(a)\,:\!=\,\begin{cases}0, & \textrm{if } a \in B_{\tau_n}\\(|a|-\tau_n)^2, & \textrm{otherwise}\end{cases} , \quad {\nabla H(a)\,:\!=\,\begin{cases}0, & \textrm{if } a \in B_{\tau_n}\\\frac{2 (|a| - \tau_n)}{|a|}a, & \textrm{otherwise}\end{cases} ,}\end{equation*}
Remark 7. Let
$\bar{\gamma} \in C([0,T],\mathbb{R}^K)$
be the unique minimizer to (7) and consider the element
$p_n(\bar{\gamma})$
introduced in Remark 6. Since
$\|\bar{\gamma} \|_{\infty} \leq \| \varphi \|_{\infty}$
, it is clear that
$\|p_n(\bar{\gamma})\|_{\infty}\leq \| \varphi \|_{\infty} + d_n(\bar{\gamma})$
. Therefore, by (16) and (17), we obtain that
In place of (7), we consider the following optimization problem:
The role of map
$\mathbf{h}$
is to ensure that the estimate of
$\bar{\gamma}$
, defined in (9), remains bounded when computing the loss function F. This is reflected when we compute
$F^2(\textbf{h}\circ \mathscr{L} a)$
in (18). Moreover, the function H penalizes the coefficient vector a when the norm of our approximation
$\mathcal{L}a$
grows beyond
$\| \varphi \|_{\infty} + d_n (\bar{\gamma}).$
This is reflected in the definition of H in (17) and also motivated in Remark 7.
Lemma 2. Let Assumptions 1 and 2 hold true for some
$R>0$
. Let
$\bar{\gamma} \in C([0,T],\mathbb{R}^K)$
be the unique solution to (7) and assume that
$\| \varphi \|_{\infty} + d_n(\bar{\gamma})\leq \kappa$
for some
$\kappa>0$
. Then the minimization problem (18) admits a solution. Furthermore, for any
we have
where
$C>0$
depends on
$\kappa, T,R$
, and the dimensions.
Proof. The boundedness of
$\varphi$
implies that
$F^2(\textbf{h}\circ \mathscr{L} a)$
is bounded. Thus, by (17), G(a) tends to
$+\infty$
as
$|a| \to \infty$
. On the other hand, Lemma 1 together with the continuity of
$\textbf{h}$
and H imply that G is continuous on
$\mathbb{R}^{(n+1) K}$
. Therefore, the fact that G has at least a minimum
$\bar a_n$
on
$\mathbb{R}^{(n+1) K}$
stems from Weierstrass’ extreme value theorem.
Furthermore, let
$\bar a_n$
be a minimizer of G we then have
\begin{align*} F^2(\textbf{h}\circ \mathscr{L} \bar a_n)\leq F^2(\textbf{h}\circ \mathscr{L}\bar a_n) + H_n(\mathscr{L}\bar a_n)& \leq F^2\big(\textbf{h}\circ p_n(\bar{\gamma})\big) + H \big(\mathscr{L}^{-1} p_n(\bar{\gamma})\big)\\ & = F^2\big(p_n(\bar{\gamma})\big) = \big|F\big( p_n(\bar{\gamma})\big) - F(\bar \gamma)\big|^2\\ & \leq C \, \lVert p_n(\bar{\gamma}) - \bar \gamma \rVert^2_{\infty} = C \, d^2_n(\bar \gamma). \end{align*}
In the above, we used Remark 7 and Lemma 1. The result follows since
$\| \bar{\gamma} \|_{\infty}\leq \| \varphi \|_{\infty}$
and
$\| p_n(\bar{\gamma}) \|_{\infty}\leq \| \varphi \|_{\infty} + d_n(\bar{\gamma})\leq \kappa$
.
From the above result we observe that if we can compute the distance
$d_n$
explicitly, we have an explicit upper bound for the error encountered when solving (18) instead of (7). For instance, when choosing
$S_n$
as the space of the nth-order polynomials (see Example 3), an explicit error estimate can be obtained as shown next.
Proposition 4. Let Assumptions 1–3 hold true for some
$N\in\mathbb{N}$
,
$R>0$
. Let
$g_0,\ldots,g_n$
be a family of linearly independent vectors of
$C([0,T],\mathbb{R})$
generating the space of the
$\mathbb{R}^K$
-valued polynomials of order n, i.e.
$S_n = \mathcal{P}_n(\mathcal{R}^K)$
in (12). Then the minimization problem (18) admits a solution. Furthermore, for any
we have
where
$C>0$
depends on T,N,R, and the dimensions (and is independent of n).
Proof. By Lemma 2 the minimization problem (18) admits a solution and (19) holds true. By Proposition 2,
$\bar{\gamma}$
is N-times differentiable on [0, T]. In particular, (4) implies that its derivatives are bounded on [0, T] by a constant that only depends on N, R, and the dimensions.
A theorem of Jackson (see [Reference Timan49, Section 5.1.5 (p. 261)]) then yields
which concludes the proof.
Remark 8. We stress that the result of Proposition 4 does not depend on the choice of the basis
$g_i$
,
$i=0,\ldots,n$
. In particular, it applies to all the possible choices as discussed in Remark 3. We claim that the results of this section would still hold true after lifting the boundedness assumption on
$\alpha$
,
$\beta$
, and
$\varphi$
. Indeed, in light of Remark 2, we expect (7) to admit a unique solution
$\bar{\gamma}\in C([0,T],\mathbb{R}^K)$
even if
$\alpha(t,\cdot)$
,
$\beta(t,\cdot)$
have sub-linear growth and
$\varphi$
has polynomial growth (provided that the initial datum
$X_0=\xi$
has suitably many finite moments). Furthermore, the proof of Lemma 2 only relies on the estimates of Lemma 1, whose proof itself would be identical in this more general context. Similarly, by Remark 4, we expect Proposition 4 to still hold true if the derivatives of
$\alpha(t,\cdot)$
,
$\beta(t,\cdot)$
, and
$\varphi$
have polynomial growth.
3. SGD for MV-SDEs
We propose a stochastic SGD to solve the minimization problem (18). For the convenience of the reader we re-write the objective function G in (18) explicitly as
where we use the shorthand notation
$Z^{a}_t \,:\!=\, Z^{\textbf{h}\circ \mathscr{L} a}_t$
to denote the solution to (5) when
$\gamma$
is taken as
$\gamma=\textbf{h}\circ \mathscr{L} a$
, i.e.
$Z^{a}$
takes values in
$\mathbb{R}^d$
and solves
We point out that the objective function G only depends on the law of
$Z^{a}$
(and a). Thus, when weak uniqueness holds (for instance, under Assumptions 1 and 2), G can be defined in terms of any weak solution to (21). In particular, it only depends on the coefficients of (21) and on the law of
$\xi$
, but not on the specific realizations of
$\xi$
or the Brownian motion W. However, in the following, we will sometimes consider strong solutions associated to specific realizations of Brownian motion and initial datum. In these cases we will reinforce the notation by denoting
$Z^{a}(\xi , W)$
as the strong solution to (21) with respect to a given initial random variable
$\xi$
and a given Brownian motion W.
Lastly, it is clear that within our setting,
$\min_{a\in \mathbb{R}^{(n+1) K}} G(a)$
is not a convex minimization problem.
3.1. Some preliminaries
To explain the SGD algorithm we begin with some heuristics. Hereafter we write a generic element of
$a\in\mathbb{R}^{(n+1)K}$
as
-
• For any
$k = 0, \ldots, n,$
and
$j = 1,\ldots, K$
, by formally differentiating (21) with respect to the element
$a_{k,j}$
, we obtain that the pair
$(Z^a,\partial_{a_{k,j}} Z^a)$
solves the system given by (21) and (22)with
\begin{align}\textrm{d} Y^{k,j}_t &= g_k(t) \nabla \textbf{h}_j\big( \mathscr{L} a(t) \big) \Big( \alpha(t, Z_t )\, \textrm{d} t + \beta(t, Z_t ) \,\textrm{d} W_t \Big) \nonumber\\&+ \sum^d_{i=1} Y^{k,j,i}_t \textbf{h}\big( \mathscr{L} a(t) \big) \Big( \partial_{z_i} \alpha(t, Z_t ) \, \textrm{d} t + \partial_{z_i} \beta (t, Z_t)\, \textrm{d} W_t \Big), \quad Y^{k,j}_0 = 0, \end{align}
$\textbf{h}$
introduced in (16). Note that
$Y^{k,j}$
and Z take values in
$\mathbb{R}^d$
, with components denoted by
$Y^{k,j,i}$
and
$Z^i$
, for
$i=1,\ldots,d$
, respectively. In the following, we will sometimes reinforce the notation by denoting
$(Z^{a}_t (\xi,W),Y^{a;k,j}_t (\xi,W))$
as the strong solution to (21), (22) with respect to a given initial random variable
$\xi$
and Brownian motion W.
-
• Suppose that we can always exchange derivatives with time integrals and expected values (we justify this exchange of operations below). For any
$a\in\mathbb{R}^{(n+1) K}$
, and for any
$k=0,\ldots,n$
and
$j=1,\ldots, K$
, we obtain (23)
\begin{align}\nonumber &\partial_{a_{k,j}} \bigg( \int_0^T \big| \mathrm{E} \big[ \varphi(Z^{a}_t) - \textbf{h}\big((\mathscr{L} a) (t)\big) \big] \big|^2 \,\textrm{d} t \bigg) \\&= 2 \int_0^T \Big\langle \mathrm{E}\Big[ \varphi(Z^{a}_t) - \textbf{h}\big((\mathscr{L} a) (t)\big) \Big] , \mathrm{E} \Big[ \partial_{a_{k,j}} \Big( \varphi(Z^{a}_t) - \textbf{h}\big((\mathscr{L} a) (t)\big) \Big) \Big] \Big\rangle \, \textrm{d} t.\end{align}
Let now W,
$\tilde W$
be two independent Brownian motions and
$\xi$
,
$\tilde \xi$
be two identically distributed independent random variables, which are also independent of
$(W,\tilde W)$
. Assuming weak-uniqueness for the solutions to (21) we have
\begin{align}&\Big\langle \mathrm{E}\Big[ \varphi(Z^{ a}_t)- \textbf{h}\big((\mathscr{L} a) (t)\big) \Big] , \mathrm{E} \Big[ \partial_{a_{k,j}} \Big( \varphi(Z^{a}_t) - \textbf{h}\big((\mathscr{L} a) (t)\big) \Big) \Big] \Big\rangle \nonumber\\ &= \Big\langle \mathrm{E}\Big[ \varphi \big(Z^{a}_{t}(\xi, W)\big) - \textbf{h}\big((\mathscr{L} a) (t)\big)\Big] , \mathrm{E} \Big[ \partial_{a_{k,j}} \Big( \varphi \big(Z^{a}_{t}(\tilde\xi,\tilde W)\big) - \textbf{ h}\big((\mathscr{L} a) (t)\big) \Big) \Big] \Big\rangle \nonumber\\ &= \mathrm{E}\Big[ \Big\langle \varphi \big(Z^{a}_{t}(\xi, W)\big) - \textbf{h}\big((\mathscr{L} a) (t)\big) , \partial_{a_{k,j}} \Big( \varphi \big(Z^{a}_{t}(\tilde\xi,\tilde W)\big) - \textbf{h}\big((\mathscr{L} a) (t)\big) \Big) \Big\rangle \Big].\end{align}
-
• Therefore, combining (23) and (24), and since
$\partial_{a_{k,j}} Z^a$
solves (22), we formally obtain with
\begin{equation*} \nabla_a G(a) = \mathrm{E}[ v(a; \xi, {W} ; \, \tilde\xi, \tilde W) ] + \nabla_a H(a),\end{equation*}
$v(a; \xi, {W} ; \, \tilde\xi, \tilde W)\in\mathbb{R}^{(n+1)K}$
, whose components are given by (25)where
\begin{align}\nonumber&v_{k,j}(a; \xi, {W} ; \, \tilde\xi, \tilde W) \\& \,:\!=\, 2 \int_0^T \Big\langle \varphi \big(Z^{a}_{t}(\xi, W)\big) - \textbf{h}\big((\mathscr{L} a) (t)\big) , \nabla_x \varphi \big(Z^{a}_{t}(\tilde\xi,\tilde W)\big) Y^{a;k,j}_{t}(\tilde\xi,\tilde W) - \partial_{a_{k,j}} \textbf{ h}\big( (\mathscr{L} a) (t) \big) \Big\rangle \, \textrm{d} t,\end{align}
$\nabla_x \varphi$
stands for the Jacobian matrix of
$\varphi\,:\,\mathbb{R}^d \to \mathbb{R}^K$
.
3.2. Algorithm
In the previous section, we presented heuristics on how certain computations can be performed in order to implement the stochastic gradient algorithm. With the help of those explanations, we can now introduce the algorithm. First, let:
-
•
$(\eta_m)_{m\in\mathbb{N}}$
be a deterministic sequence of positive scalars (learning rates) such that (26)
\begin{equation}\sum_{m=0}^{\infty} \eta_m = + \infty\quad \textrm{and} \quad \sum_{m=0}^{\infty} \eta^2_m < + \infty ;\end{equation}
-
•
$( \bar{\Omega} , \bar{\mathcal{F}} , (\bar{\mathcal{F}}_m)_{m\in\mathbb{N}} , \bar{\mathrm{P}})$
be a filtered probability space rich enough to support four independent samples
$(W_m)_{m\in\mathbb{N}}$
,
$(\tilde W_m)_{m\in\mathbb{N}}$
,
$(\xi_{m})_{m \in\mathbb{N}}$
, and
$(\tilde\xi_{m})_{m \in\mathbb{N}}$
such that
-
•
$(W_m)_{m\in\mathbb{N}}$
and
$(\tilde W_m)_{m\in\mathbb{N}}$
are sequences of independent q-dimensional standard Brownian motions adapted to
$(\bar{\mathcal{F}}_m)_{m\in\mathbb{N}}$
; namely, for any
$m\in\mathbb{N}$
,
$W_m = (W_{m,t})_{t\in[0,T]}$
is a q-dimensional standard Brownian motion such that
$\mathcal{F}^{W_m}_T \subset \bar{\mathcal{F}}_m$
, and the same holds for
$\tilde W_m$
; -
•
$(\xi_{m})_{m \in\mathbb{N}}$
and
$(\tilde\xi_{m})_{m \in\mathbb{N}}$
of
$\xi$
are sequences of independent
$\mathbb{R}^d$
-valued random variables distributed like
$\xi$
.
For a deterministic
$\textbf{a}_0 \in \mathbb{R}^{(n+1) K},$
the estimate update is obtained iteratively as
The function v is defined component-wise by (25). Note that the
$\mathbb{R}^{(n+1) K}$
-valued stochastic processes
$(\textbf{a}_m)_{m\in\mathbb{N}}$
and
$(\textbf{v}_m)_{m\in\mathbb{N}}$
are recursively defined, as in Algorithm 1, and are adapted to the filtration
$(\bar{\mathcal{F}}_m)_{m\in\mathbb{N}}$
, which then represents the history of the algorithm.
Algorithm 1: SGD-MVSDE

Figure 1 below contains the flow-chart of Algorithm 1. We can also formulate a mini-batch version of the SGD-MVSDE algorithm by computing the gradient of the loss function v via averaging its value over M replications
$\bigl(W^i_{m+1},\tilde W^i_{m+1},\xi^i_{m+1},\tilde\xi^i_{m+1}\bigr)^M_{i=1}$
of
$(W_{m+1},\tilde W_{m+1},\xi_{m+1},\tilde\xi_{m+1})$
for a given iteration index m.
Algorithm 2: Mini-batch-SGD-MVSDE

Note that Algorithm 2 for
$M=1 $
is equivalent to Algorithm 1.

Figure 1. Flow-chart of Algorithm 1.
3.3. Convergence of Algorithm 1
We recall that H is a function in
$C^1(\mathbb{R}^{(n+1) K})$
and that it satisfies (17). In order to obtain a convergence result for Algorithm 1, we introduce the following assumption.
Assumption 4. The function H is such that
$\nabla_a H$
is globally Lipschitz continuous on
$\mathbb{R}^{(n+1) K}$
.
For example, the H specified in Example 5 satisfies Assumption 4. This requirement of Lipschitz regularity is essential to prove convergence of any iterative method with a sequential step (see Proposition 6 below).
Our main convergence result shows the convergence of the iterates
$\textbf{a}_{m}$
in Algorithm 1 to a stationary point of G.
Theorem 1. Let Assumptions 1 and 4 be in force. Assume also that
$\alpha(t,\cdot),\beta(t,\cdot)\in C^2(\mathbb{R}^d)$
, for any
$t\in[0,T]$
, and
$\varphi\in C^2(\mathbb{R}^K)$
, with derivatives bounded by some
$R>0$
. The map G is differentiable and there exists an
$\mathbb{R}^{(n+1)K}$
-valued random variable
$\textbf{a}_{\infty}$
such that, in Algorithm 1, we have
To prove Theorem 1, we first prove two intermediate results. These results enable us to employ the developments in [Reference Bertsekas and Tsitsiklis6, Reference Patel and Zhang41] for the convergence of SGD algorithms in non-convex settings. The first preliminary result contains: (i) a standard consistency condition that allows us to understand the process
$(\textbf{v}_{m})_{m\in\mathbb{N}}$
as a sequence of stochastic gradients of G; and (ii) a bound on the second conditional moment of the gradient noise.
Proposition 5. Under the assumptions of Theorem 1, we have
where we set
$\bar{\mathcal{F}}_0\,:\!=\,\{ \bar{\Omega}, \emptyset \}$
. Furthermore, the random variables
are bounded, uniformly with respect to
$m\in \mathbb{N}_0$
.
The next proposition is a regularity result for G and its gradient.
Proposition 6. (Regularity of and
$\nabla_a G$
.) Under the assumptions of Theorem 1, the function G is continuously differentiable and its gradient
$\nabla_a G$
is globally Lipschitz continuous on
$\mathbb{R}^{(n+1) K}$
.
We are now in the position to prove our main result in Theorem 1.
Proof of Theorem
1. Differentiability of G follows from Proposition 6. By Propositions 5 and 6, the hypothesis of [Reference Bertsekas and Tsitsiklis6, Proposition 3] and [Reference Patel and Zhang41, Theorem 1] are fulfilled. In particular, Theorem 1 in [Reference Patel and Zhang41] (see the remark on p. 5 there) implies that
$\bar{\mathrm{P}}$
-almost surely either
${\textbf{a}_m}$
converges in
$\mathbb{R}^{(n+1) K}$
or
$|{\textbf{a}_m}|\to +\infty$
. On the other hand, by Proposition 3 in [Reference Bertsekas and Tsitsiklis6], either
$G(\textbf{a}_m)\to -\infty$
or
$G(\textbf{a}_m)$
converges to a finite value and
$\nabla G(\textbf{a}_m)\to 0$
,
$\bar{\mathrm{P}}$
-almost surely. Since
we have that
$\textbf{a}_m$
converges
$\bar{\mathrm{P}}$
-almost surely to a stationary point of G.
Remark 9. We claim that the convergence analysis of this section could be derived assuming
$\alpha(t,\cdot)$
,
$\beta(t,\cdot)$
, and
$\varphi$
with sub-linear growth (instead of bounded). Indeed, Propositions 3.2 and 3.3 (on which Theorem 1 relies) would be proved with similar, though slightly more involved, arguments. In particular, the estimates of Lemmas 4 and 5 for the solutions to the system (21), (22), which are the crux of the argument, should extend to the case of unbounded
$\alpha(t,\cdot),\beta(t,\cdot)$
, and
$\varphi$
while keeping the benign boundedness assumption for their derivatives. Finally, we believe the assumptions on
$\varphi$
could be further relaxed by allowing for polynomial growth, provided additional integrability conditions are assumed on the initial datum.
Remark 10. The Lipschitz regularity of the derivatives of
$\alpha(t,\cdot),\beta(t,\cdot)$
are needed to employ standard stability estimates (see Lemma A.1) to the system (21) and (22). While it is true that dropping this assumption would allow to include popular models with Hölder-continuous coefficients (e.g. square-root diffusions), it would also result in the tangent process being the solution of an SDE with singular coefficients. This would require the development of new fundamental stability estimates on the tangent SDE (22). Therefore, this direction is left for future research.
4. Numerical study
We conduct several numerical experiments to study the numerical performance of Algorithm 2 for different values of the mini-batch sample size M. Precisely, we test it for the following models.
-
(i) The Kuramoto–Shinomoto–Sakaguchi model (Section 4.1), which is in the form of (2) with a low value of
$K=3$
. The goal here is to test the accuracy and efficiency of the scheme in a case where the dependence on the state is truly separated from the one on the measure. -
(ii) A model with a quadratic drift (Section 4.2), which is outside the scope of our Assumption 2. Here we show, numerically, that our SGD algorithm efficiently converges, at least for small times, even if the coefficients are super-linear in the measure variable.
-
(iii) An MV-SDE with a Gaussian convolution kernel (Section 4.3). The SDE has no separable coefficients, but it can be approximated by dynamics in the form of (2) with a sufficiently large K by employing a projection technique based on the generalized Fourier series (see [Reference Belomestny and Schoenmakers5]). The goal is to show that our algorithm can be successfully employed, in combination with the aforementioned projection technique, to obtain good approximations of the law of the original SDE.
The following choices are common to all our numerical experiments.
-
• The finite-dimensional space
$S_n$
for the domain projection of Section 2.2 is that of the n-dimensional polynomials. In particular, for a given
$n\in\mathbb{N}$
and
$T >0$
, the basis
$g_0,\ldots,g_n$
is given by the Lagrange polynomials centred at the Chebyshev nodes
$t_0,\ldots,t_n$
defined in (15). Therefore, recalling the lifting operator
$\mathscr{L}$
defined in (13), for any
$a\in \mathbb{R}^{(n+1)\times K}$
we have
\begin{equation*}(\mathscr{L} a) (t_j) = a_j, \quad j=0,\ldots, n.\end{equation*}
-
• Inspired from the suggestion of learning rate in [Reference Fehrman, Gess and Jentzen23], the sequence of the learning rates
$(\eta_m)_{m\in\mathbb{N}}$
is set as in compliance with the conditions (26).
\begin{equation*}\eta_m = r_0/(m+1)^{\rho}, \quad r_0 > 0, \ \rho \in (0.5,1],\end{equation*}
-
• The functions
$\textbf{h}$
and H appearing in (18) are chosen as the identity and null functions, respectively. Although this choice does not meet the assumptions of the convergence results of Section 3.3, it does simplify the implementation. In addition, we observe that the algorithm still converges in our experiments. -
• The initial iterate
$\textbf{a}_0$
(algorithm initialization) is always taken as with
\begin{equation*}\textbf{a}_0 = \big(\varphi(\tilde X_0), \ldots, \varphi(\tilde X_0)\big),\end{equation*}
$\tilde X_0$
being a random realization of
$X_0$
, which yields
$\mathscr{L} \textbf{a}_0 (t) \equiv \varphi(\tilde X_0)$
. In particular, when
$\mu_{X_0} = \delta_x$
, then
$\textbf{a}_0 = (\varphi(x),\ldots,\varphi(x))$
is deterministic.
-
• The benchmark value for the curve
$\bar{\gamma}_\cdot = \mathrm{E}[\varphi(X_{\cdot})]$
is given by (28)where
\begin{equation}\bar{\gamma}^{\textrm{MC}}_\cdot = \frac{1}{N_0} \sum_{i=1}^{N_0} \varphi(X^{N_{0},i}_{\cdot}),\end{equation}
$X^{N_0}$
is the
$N_0$
-dimensional mean-field particle system associated to (2). The relative error after m iterations of our SGD algorithm is (29)with
\begin{equation}\varepsilon_m\,:\!=\,\frac{ \| \mathscr{L} {\bf{a}}_{m} - {\bar{\gamma}}^{\textrm{MC}}\|_{L^2([0,T])}}{\|{\bar{\gamma}}^{\textrm{MC}} \|_{L^2([0,T])}},\end{equation}
$\mathbf{a}_{m}$
being the mth iterate of Algorithm 2 and the
$L^2({[0,T]})$
-norm over the interval [0, T] is as defined in Table 1. The algorithm is always halted when the error
$\varepsilon_m$
as computed above is below 1%.
-
• Algorithm 2 and the mean-field particle system associated with (2) make use of a time-stepping scheme to approximate the solution of the involved SDEs. In particular, the system (21), (22) is discretized by employing the Euler–Maruyama scheme with time-step
$h=10^{-2}$
. Note that this further approximation is absolutely standard under the assumption of
$\alpha(t,\cdot),\beta(t,\cdot)$
differentiable with bounded derivatives. The discretized mean-field particle system is written as with
\begin{equation*}X^{N_0,i}_t = X^{N_0,i}_{t_k} +\Bigl( \frac{1}{N_0} \sum_{i=1}^{N_{0}} \varphi(X^{N_{0},i}_{\eta(t)})\Bigr) \cdot \Bigl( \alpha(\eta(t),X^{N_{0},i}_{\eta(t)}) (t - t_k) +\beta(\eta(t),X^{N_{0},i}_{\eta(t)}) \bigl(W^i_t - W^i_{t_k} \bigr) \Bigr),\end{equation*}
$\eta(t)\,:\!=\,t_k$
if
$t \in [t_k, t_{k+1}),$
and
$t_{k+1}-t_k = h$
. Here
$\{t_k\}_{k}$
is a partition of [0, T]. Due to interactions between discretized diffusions, implementation of the above particle system will require
$O(N_{0}^2K h^{-1})$
arithmetic operations. On the other hand, in Algorithm 2, one sample of the mini-batch will involve
$O(Kdqh^{-1}), O(nK^2d^2qh^{-1})$
, and
$O(nK^2dh^{-1})$
number of arithmetic operations for implementing discretized (21), (22), and computing entries of v given in (25), respectively. Thus, the total number of arithmetic operations in one iteration of Algorithm 2 is
$O(MnK^2d^2qh^{-1})$
.
-
• Computational setup: All numerical experiments are conducted on the computing server of the Department of Mathematics, University of Bologna, featuring a 12th Gen Intel
$^{\circledR}$
Core
$^{\textrm{TM}}$
i9-12900K CPU, 125 GB RAM, and running Ubuntu
$^{\circledR}$
24.04.1 LTS. The computations were implemented in PYTHON
$^{\textrm{TM}}$
3.11.11 using NUMPY.
4.1. Kuramoto–Shinomoto–Sakaguchi MV-SDE
Consider the following MV-SDE, which is related to the famous Kuramoto–Shinomoto–Sakaguchi model (see [Reference Frank24, Section 5.3.2]):
The above MV-SDE can be seen in the form of (2) with
$K=3$
,
$d=1$
,
$q=1$
, and
Here,
$\bar{\gamma} (t) = \mathrm{E}[\varphi(X_t)]$
takes values on
$\mathbb{R}^3$
. However, as
$\bar{\gamma}_3(t)\equiv 1$
, we run the SGD algorithm with
$K=2$
to approximate
$\bar{\gamma}_1(t)=\mathrm{E}[\!\sin\!(X_t)]$
and
$\bar{\gamma}_2(t) = \mathrm{E}[\!\cos\!(X_t)]$
, which are the real unknown functions. Note that
$\varphi, \alpha$
, and
$\beta$
in this example satisfy the assumptions required in the convergence result of Section 3.3.
We test different choices of algorithm hyperparameters:
$r_0 = 1, 5, 10$
,
$\rho = 0.6, 0.7, 0.8, 0.9$
,
$n=3,4,5,6{, 7, 8}$
, and
$M= 1, 10, 100, 1000, 10{,}000$
. The benchmark
${\bar{\gamma}}^{\textrm{MC}}$
in (28), based on the Monte Carlo particle-system approximation, is obtained with
$N_{0}=10^6$
particles. For each combination of hyperparameters, we perform 1000 independent runs of the algorithm, and record the average number of iterations of the SGD algorithm required to achieve the relative error accuracy
$\varepsilon_m<1\%$
.
In Tables 3–5 we report the lowest average number of iterations obtained over different
$(r_0,\rho)$
combinations for a given M and n for time horizons
$T=0.5, 1, 2,$
respectively. We observe that the average number of iterations decreases with the size of the mini-batch sample M, and increases with the time horizon T. By taking into account the computational time, it is clear that
$M=1000$
gives the best performance. It is evident that our mini-batch-SGD algorithm converges very quickly to the Monte Carlo particle-system-based benchmark.
Table 3. Kuramoto–Shinomoto–Sakaguchi MV-SDE. Average number of iterations m (execution time in seconds) over 1000 independent runs of the algorithm to achieve relative error accuracy
$\varepsilon_m<1\%$
with the best combination of
$(r_0,\rho)$
for each pair (n,M). Here
$T=0.5$
,
$x_0=0.5$
, and
$\sigma=0.5$
, and the Monte Carlo benchmark exhibited an execution time of 1.98 seconds.

Table 4. Kuramoto–Shinomoto–Sakaguchi MV-SDE. Average number of iterations m (execution time in seconds) over 1000 independent runs of the algorithm to achieve relative error accuracy
$\varepsilon_m<1\%$
with the best combination of
$(r_0,\rho)$
for each pair (n,M). Here
$T=1$
,
$x_0=0.5$
, and
$\sigma=0.5$
, and the Monte Carlo benchmark exhibited an execution time of 3.60 seconds.

Table 5. Kuramoto–Shinomoto–Sakaguchi MV-SDE. Average number of iterations m (execution time in seconds) over 1000 independent runs of the algorithm to achieve relative error accuracy
$\varepsilon_m<1\%$
with the best combination of
$(r_0,\rho)$
for each pair (n,M). Here
$T=2$
,
$x_0=0.5$
, and
$\sigma=0.5$
, and the Monte Carlo benchmark exhibited an execution time of 6.59 seconds.

In Figures 2–4 we plot the output curves
$\mathscr{L} \textbf{a}_m=(\mathscr{L} \textbf{a}_m^1, \mathscr{L} \textbf{a}_m^2)$
of the SGD algorithm versus the benchmark curves
${\bar{\gamma}}^{\textrm{MC}}=({\bar{\gamma}}^{\textrm{MC}}_1 , {\bar{\gamma}}^{\textrm{MC}}_2),$
for all values of the polynomial degree
$n=3, 4, \ldots, 8$
and for time horizons
$T=0.5, 1, 2$
respectively. Note that the approximation does not deteriorate as the time horizon T increases. In particular, the slight loss of precision that can be observed towards the right end of the time interval can be explained by the choice of polynomial approximations. For instance, the gap between
$(\mathscr{L} \textbf{a}_m)^1$
and
${\bar{\gamma}}^{\textrm{MC}}_1$
, which can be seen in Figure 2 in the region
$t\in[0.4,0.5]$
, does not appear in the analogous plot of Figure 3 (or Figure 4) after the same number of steps. This shows that the error does not propagate over time: this is mainly due to the fact that ours is a full history-type minimization algorithm, in that it approximates the minimum in (7) over the set of the continuous curves on [0, T].

Figure 2. Kuramoto–Shinomoto–Sakaguchi MV-SDE. Comparison of the output curves
$\mathscr{L} \textbf{a}_m=(\mathscr{L} \textbf{a}_m^1, \mathscr{L} \textbf{a}_m^2)$
of the SGD algorithm versus the benchmark curves
${\bar{\gamma}}^{\textrm{MC}}=({\bar{\gamma}}^{\textrm{MC}}_1 , {\bar{\gamma}}^{\textrm{MC}}_2)$
for all values of n, for time-step size
$h=10^{-2}$
,
$T=0.5$
,
$M=1000$
,
$x_0=0.5$
, and
$\sigma=0.5$
.

Figure 3. Kuramoto–Shinomoto–Sakaguchi MV-SDE. Comparison of the output curves
$\mathscr{L} \textbf{a}_m=(\mathscr{L} \textbf{a}_m^1, \mathscr{L} \textbf{a}_m^2)$
of the SGD algorithm versus the benchmark curves
${\bar{\gamma}}^{\textrm{MC}}=({\bar{\gamma}}^{\textrm{MC}}_1 , {\bar{\gamma}}^{\textrm{MC}}_2)$
for all values of n, for time-step size
$h=10^{-2}$
,
$T=1$
,
$M=1000$
,
$x_0=0.5$
, and
$\sigma=0.5$
Finally observe that the algorithm works best for low values of the polynomial order, namely
$n=3,4$
. Indeed, typically the number of iterations needed to reach the target relative error accuracy
$\varepsilon_m<1\%$
(cf. Tables 3–5) increase as higher values of n are considered. For the best values of the mini-batch size, namely
$M=100,1000$
, this phenomenon seems less evident. However, we can see from Figures 2–4 that the quality of the approximation deteriorates when the number of iterations does not increase with n. This can be interpreted as an overfitting phenomenon, which is expected as the benchmark curves
${\bar{\gamma}}^{\textrm{MC}}=({\bar{\gamma}}^{\textrm{MC}}_1 , {\bar{\gamma}}^{\textrm{MC}}_2)$
are relatively flat.
4.2. Polynomial drift MV-SDE
Consider the MV-SDE with polynomial drift given as follows:
The latter can be cast in the form of (2) with
$K=3$
,
$d=1$
,
$q=1$
, and
Once more, the function
$\bar{\gamma} (t) = \mathrm{E}[\varphi(X_t)]$
takes values on
$\mathbb{R}^3$
but
$\bar{\gamma}_3(t)\equiv 1$
. Thus, we run the SGD algorithm with
$K=2$
to approximate
$\bar{\gamma}_1(t)=\mathrm{E}[(X_t)^2]$
and
$\bar{\gamma}_2(t) = \mathrm{E}[X_t]$
, which are the real unknown functions. Also note that
$\varphi, \alpha$
, and
$\beta$
in this example are unbounded and thus do not satisfy the assumptions in the results of Section 3.3. In particular, the drift coefficient has super-linear growth in the measure component.
We test different choices of algorithm hyperparameters:
$r_0 = 1, 5, 10$
,
$\rho = 0.6, 0.7,$
and
$M= 1, 10, 100, 1000$
. We set the polynomial degree
$n=3,$
and test the performance of the SGD algorithm for different time horizons,
$T=0.1, 0.5, 1$
. The benchmark
${\bar{\gamma}}^{\textrm{MC}}$
in (28) based on the Monte Carlo particle-system approximation is obtained with
$N_{0}=10^{ 7}$
particles. For each combination of the hyperparameters, we perform 1000 independent runs of the algorithm, and record the average number of iterations of the SGD algorithm required to achieve the relative error accuracy
$\varepsilon_m<1\%$
.
In Table 6 we report the lowest average number of iterations over different
$(r_0,\rho)$
combinations for a given M and T. Once more, as the value of the mini-batch increases the number of steps to reach convergence decreases. Compared to the model tested in Section 4.1, increasing the time horizon T has a stronger impact on the convergence, which is due to the quadratic growth of the coefficient. We stress that for this model even the MC particle method requires an increasing number of particles to obtain reliable benchmark curves for increasing values of T.
Table 6. Polynomial drift MV-SDE. Average number of iterations (execution time in seconds) over 1000 independent runs of the algorithm to achieve relative error accuracy
$\varepsilon_m<1\%$
with the best combination of
$(r_0,\rho)$
for each pair (T,M). Here
$x_0=1$
,
$\delta=0.8$
and the Monte Carlo benchmark exhibited an execution time of 2.57 seconds for
$T=0.1$
, 9.78 seconds for
$T=0.5$
, and 19.53 seconds for
$T=1$
.


Figure 4. Kuramoto–Shinomoto–Sakaguchi MV-SDE. Comparison of the output curves
$\mathscr{L} \textbf{a}_m=(\mathscr{L} \textbf{a}_m^1, \mathscr{L} \textbf{a}_m^2)$
of the SGD algorithm versus the benchmark curves
${\bar{\gamma}}^{\textrm{MC}}=({\bar{\gamma}}^{\textrm{MC}}_1 , {\bar{\gamma}}^{\textrm{MC}}_2)$
for all values of n, for time-step size
$h=10^{-2}$
,
$T=2$
,
$M=1000$
,
$x_0=0.5$
, and
$\sigma=0.5$
In Figure 5 we plot the output curves
$\mathscr{L} \textbf{a}_m=(\mathscr{L} \textbf{a}_m^1, \mathscr{L} \textbf{a}_m^2)$
of the SGD algorithm versus the benchmark curves
${\bar{\gamma}}^{\textrm{MC}}=({\bar{\gamma}}^{\textrm{MC}}_1 , {\bar{\gamma}}^{\textrm{MC}}_2)$
for all time horizons
$T=0.1$
,
$T=0.5$
, and
$T=1$
. In the plots, particularly for
$T=0.5$
, notice the considerable loss of accuracy in our approximation
$\mathscr{L} \textbf{a}_m^2$
of
$\bar{\gamma}_2(t) = \mathrm{E}[X_t]$
near the right boundary of the interval (similar effects were visible in the plots relative to the Kuramoto model). After running additional tests, we do not believe that this effect is due to the polynomial projection, similar to Runge-type phenomena in polynomial interpolation over compact sets, but we claim that it is rather due to the norm used to define the cost function
$F^2$
(cf. (8)), in combination with the chosen threshold level we set for
$\varepsilon_m$
in (29). Indeed, lowering this threshold from to
$10^{-2}$
to
$10^{-3}$
reduces considerably the error, also near the boundaries of the interval. However, one could look for strategies to distribute the error in a more uniform way across the interval, and obtain a better qualitative result while keeping the same error threshold. One possibility of improvement is to first employ the alternative objective function presented in Remark 5, i.e. using a weight function to force the SGD algorithm to be more accurate towards the end of the interval. One can also use a similar weight function to compute a weighted
$L^2({[0,T]})$
norm for error
$\varepsilon_m$
in the stopping criterion. Another possibility is to experiment with different choices of basis functions listed in Examples 3 and 4 and choose the one with the best error performance.

Figure 5. Polynomial drift MV-SDE. Comparison of the output curves
$\mathscr{L} \textbf{a}_m=(\mathscr{L} \textbf{a}_m^1, \mathscr{L} \textbf{a}_m^2)$
of the SGD algorithm versus the benchmark curves
${\bar{\gamma}}^{\textrm{MC}}=({\bar{\gamma}}^{\textrm{MC}}_1 , {\bar{\gamma}}^{\textrm{MC}}_2)$
for all values of T, for time-step size
$h=10^{-2}$
,
$n=3$
,
$M=1000$
,
$x_0=1$
, and
$\delta=0.8$
.
4.3. Convolution-type MV-SDE
We consider the following convolution-type MV-SDE:
As shown in [Reference Belomestny and Schoenmakers5], the equation above can be approximated by employing a projection technique based on generalized Fourier series, which we sketch out here for the reader’s convenience. Noticing that the drift coefficient in (30) can be written as
we can expand
$b(\cdot,x)$
in a suitable weighted
$L^2$
space as follows. Setting the measure
$\mu(\textrm{d}x) = e^{-x^2/2} \, \textrm{d}x$
on the Borel sets of
$\mathbb{R}$
, we consider the following orthonormal basis of
$L^2(\mathbb{R},\mu)$
of the normalized Hermite polynomials
Therefore, for
$K\in\mathbb{N}$
sufficiently large,
$b(x,\cdot)$
can be approximated in
$L^2(\mathbb{R},\mu)$
as
\begin{equation}b(x,y) \approx \sum_{k=0}^{K} \alpha_k(x) \varphi_k(y),\end{equation}
where
Therefore, the MV-SDE (4.3) can be approximated by the MV-SDE with separable coefficients
\begin{equation*}\textrm{d} X^{(K)}_t = \sum^K_{k=0}\mathbb {E}\big[\varphi_k(X^{(K)}_t)\big] \alpha_k\big(X^{(K)}_t\big) \, \textrm{d} t + \sigma \,\textrm{d} W_t,\quad X_0 = \xi = \mathcal{N}_{(0,1)},\end{equation*}
which can be clearly cast in the form of (2). For a given
$K\in\mathbb{N}$
, we run the SGD algorithm in order to approximate
$\bar{\gamma}^{(K)}_k(t) = \mathbb {E}[\varphi_k(X^{(K)}_t)]$
for
$k=0,\ldots,K$
.

Figure 6. Monte Carlo method densities
$\tilde{w}^{(K),\textrm{MC}}_T(x)$
over the interval
$[\!-3,4]$
, for
$T=1$
and
$K = 3, 5, 10, 20$
. The benchmark vector
${\bar{\gamma}}^{\textrm{MC}}(T)$
was computed with
$N_{0}=10^7$
particles. Here
$X_0 \sim \mathcal{N}_{(0,1)}$
and
$\sigma = 0.1$
. In the model (30) we had
$X_0 \sim \mathcal{N}_{(0,1)}$
and
$\sigma = 0.1$
.
For this model, we test the accuracy of the SGD algorithm by checking the quality of the density approximation of the solution
$X_T$
to Section 4.3 that can be derived from the
$\bar{\gamma}^{(K)}_k$
curves as follows (see also [Reference Belomestny and Schoenmakers5]). For a fixed
$T>0$
denote by
$w_T(x)$
and
$w^{(K)}_T(x)$
the densities of
$X_T$
and
$X^{(K)}_T$
, respectively. Following the same argument that led to (31), for
$K\in\mathbb{N}$
sufficiently large,
$w_T(x)$
can be approximated in
$L^2(\mathbb{R},\mu)$
as
\begin{equation*}w_T(x) \approx w^{(K)}_T(x) \approx \sum_{k=0}^{K} \int_{\mathbb{R}} \varphi_k(y) w^{(K)}_T(y) \, \textrm{d}y\, \varphi_k(x) = \sum_{k=0}^{K} \bar{\gamma}^{(K)}_k(T) \varphi_k(x) =: \tilde{w}^{(K)}_T(x).\end{equation*}
Denote by
$\tilde{w}^{(K),\textrm{SGD}}_T$
and
$\tilde{w}^{(K),\textrm{MC}}_T$
, the approximations of
$\tilde{w}^{(K)}_T$
obtained by replacing the vector
$\bar{\gamma}^{(K)}(T)$
with the output of the SGD algorithm,
$\mathscr{L} \textbf{a}_m(T)$
, and with the particle-system benchmark,
${\bar{\gamma}}^{\textrm{MC}}(T)$
, respectively.

Figure 7. Comparison of
$\tilde{w}^{(K),\textrm{SGD}}_T(x)$
and
$\tilde{w}^{(K),\textrm{MC}}_T(x)$
over the interval
$[\!-3,4]$
, for
$T=1$
and
$K = 10$
. The benchmark vector
${\bar{\gamma}}^{\textrm{MC}}(T)$
was computed with
$N_{0}=10^7$
particles. The SGD output
$\mathscr{L} \textbf{a}_m(T)$
was obtained after
$m=172$
iterations (35 seconds of computation time), with parameters
$n = 3$
,
$M = 100$
,
$r_0 = 5$
, and
$\rho = 0.9$
. In the model (30) we had
$X_0 \sim \mathcal{N}_{(0,1)}$
and
$\sigma = 0.1$
.
In our tests we set
$\sigma = 0.1$
and
$T = 1$
as in [Reference Belomestny and Schoenmakers5]. The number of particles to derive the benchmark
${\bar{\gamma}}^{\textrm{MC}}(T)$
is set as
$N_{0}=10^7$
. Note that in [Reference Belomestny and Schoenmakers5] the authors used
$N_{0}=5\times 10^2$
, which is why our density plots differ considerably from theirs.
In Figure 6 we plot the approximate density
$\tilde{w}^{(K),\textrm{MC}}_T(x)$
for
$K = 3, 5, 10, 20$
, with x ranging on the interval
$[\!-3,4]$
. From this we conclude that the value
$K=10$
is high enough in order for
$\tilde{w}^{(K)}$
to accurately approximate
${w}_T$
over the interval
$[\!-3,4]$
. In Figure 7 we test the accuracy of the SGD algorithm by plotting
$\tilde{w}^{(K),\textrm{SGD}}_T(x)$
versus
$\tilde{w}^{(K),\textrm{MC}}_T(x)$
, with
$K=10$
, over the same interval. The parameters of the SGD algorithm were chosen as
$n = 3$
,
$M = 100$
,
$r_0 = 5$
,
$\rho = 0.9$
, and the algorithm halted after
$m=172$
iterations (35 seconds of computation time) when the relative error reached
$\varepsilon_m<1\%$
.
Appendix A. SDE stability estimates
Throughout this section, fix
$T>0$
and let
$({\Omega}, {\mathcal{F}}, ( {\mathcal{F}}_t )_{t \in [0,T]}, \mathrm{P})$
be a filtered probability space, with
$( {\mathcal{F}}_t)_{0\leq t \leq T}$
satisfying the usual hypothesis, supporting a q-dimensional Brownian motion
${W} = ( W_t)_{t \in [0,T]}$
. Consider the
$\mathbb{R}^d$
-valued SDE
where
$\zeta$
is an
$\mathbb{R}^d$
-valued
$\mathcal{F}_0$
-measurable random variable,
$\mu: [0,T] \times \mathbb{R}^d \to \mathbb{R}^{m\times d}$
and
$\beta\,:\, [0,T] \times \mathbb{R}^d \to \mathbb{R}^{m \times d \times q}$
are deterministic measurable coefficients, and
$\rho=(\rho_t)_{t\in[0,T]}$
is a progressively measurable
$\mathbb{R}^m$
-valued stochastic process. Hereafter, we denote by
$\mathcal{Y}^{\rho}$
a solution to (A1).
Lemma 3. Assume that
$\zeta\in L^2(\Omega,\mathcal{F}_0,\mathrm{P})$
,
$\rho$
is bounded, and there exists a constant
$\Lambda>0$
such that
and
Then (A1) has a unique solution
$\mathcal{Y}^{\rho}$
in the sense of indistinguishability.
Furthermore, for a given
$\kappa>0$
and for any
$\rho$
and
$\hat\rho$
with
$\|\rho\|_{\mathcal{S}^\infty_{[0,T]}},\|\hat\rho\|_{\mathcal{S}^\infty_{[0,T]}}\leq\kappa$
we have:
-
(a) if
$\zeta\in L^{2p}(\Omega,\mathcal{F}_0,\mathrm{P})$
for some
$p\geq 1$
, then (A4)where
\begin{equation} \lVert \mathcal{Y}^{\rho}\rVert_{\mathcal{S}^{2p}_{[0,T]}} \leq C_1 \end{equation}
$C_1>0$
only depends on
$\kappa, p, T,\Lambda$
,
$\mathrm{E} [ |\zeta|^{2p} ]$
, and the dimensions;
-
(b) if
$\zeta\in L^{4p}(\Omega,\mathcal{F}_0,\mathrm{P})$
for some
$p\geq 1$
, then (A5)where
\begin{equation} \lVert \mathcal{Y}^{\rho} - \mathcal{Y}^{\hat\rho} \rVert_{\mathcal{S}^{2p}_{[0,T]}} \leq C_2 \, \|\rho - \hat\rho\|_{\mathbb{L}^{4p}_{[0,T]}}, \end{equation}
$C_2>0$
only depends on
$\kappa, p, T,\Lambda$
,
$\mathrm{E} [ |\zeta|^{4p} ]$
, and the dimensions;
-
(c) if
(A6)then for any
\begin{equation}|\mu(t,x)| + |\sigma(t,x)| \leq \Lambda , \quad t\in[0,T], \ x\in\mathbb{R}^d,\end{equation}
$p\geq 1$
we have (A7)where
\begin{equation} \lVert \mathcal{Y}^{\rho} - \mathcal{Y}^{\hat\rho} \rVert_{\mathcal{S}^{2p}_{[0,T]}} \leq C_3 \, \|\rho - \hat\rho\|_{\mathbb{L}^{2p}_{[0,T]}} \leq C_3 \, T^{\frac{1}{2p}} \|\rho - \hat\rho\|_{\mathcal{S}^\infty_{[0,T]}}, \end{equation}
$C_3>0$
only depends on
$\kappa, p, T,\Lambda$
, and the dimensions. Note that (A7) is true only assuming
$\xi\in L^{2}(\Omega,\mathcal{F}_0,\mathrm{P})$
and that
$C_3$
does not depend on
$\mathrm{E}[|\zeta|^2]$
.
Proof. The well-posedness of
$\mathcal{Y}^{\rho}$
and part (a) of the result are particular cases of [Reference Krylov35, Theorem 2.5.7 (p. 77)] and [Reference Krylov35, Theorem 2.5.10 (p. 85)], respectively.
Parts (b) and (c) can be readily obtained by combining (A4) with [Reference Krylov35, Theorem 2.5.9 (p. 83)]. Precisely, from [Reference Krylov35, Theorem 2.5.9], there exists
$C>0$
, only dependent on
$\kappa, p, T,\Lambda$
, and K, such that
\begin{align*}\mathrm{E}&\Big[ \sup_{t \leq T} |\mathcal{Y}^\rho_t - \mathcal{Y}^{\hat\rho}_t |^{2p} \Big]\\&\quad\leq C\, \mathrm{E}\bigg[ \int^T_0 \Big( \big|\rho_t \mu(t,\mathcal{Y}^{\hat\rho}_t) - \hat\rho_t\mu(t,\mathcal{Y}^{\hat\rho}_t)\big|^{2p} + \big|\rho_t \sigma(t,\mathcal{Y}^{\hat\rho}_t) - \hat\rho_t\sigma(t,\mathcal{Y}^{\hat\rho}_t)\big|^{2p} \Big) \,\textrm{d} t \bigg]\\&\quad\leq C\, \mathrm{E}\bigg[ \int^T_0 |\rho_t - \hat\rho_t|^{2p} \Big( \big| \mu(t,\mathcal{Y}^{\hat\rho}_t)\big|^{2p}+ \big| \sigma(t,\mathcal{Y}^{\hat\rho}_t)\big|^{2p} \Big) \,\textrm{d} t \bigg]\\&\quad\leq C \Lambda\, \|\rho - \hat\rho\|^{2p}_{\mathbb{L}^{4p}_{[0,T]}} \, \bigg( \mathrm{E}\bigg[ \int^T_0 \big( 1 + \big| \mathcal{Y}^{\hat\rho}_t\big| \big)^{4p}\, \textrm{d} t\bigg] \bigg)^{\frac{1}{2}},\end{align*}
where the last inequality follows from (A3) and the Cauchy–Schwarz inequality. Combining the result with (A4), yields (A5). When (A6) holds true, we have a stronger estimate
which yields (A7).
Appendix B. Proofs
B.1. Proof of the time-regularity of
$\bar{\gamma}$
Proof of Proposition
2. Let X be the unique solution to (2), which exists due to Proposition 1. A straightforward application of Itô’s formula shows that its time-marginal laws
$\mu_t(\textrm{d} x)$
solve the Fokker–Planck equation in distributional form over
$[0, T] \times \mathbb{R}^d$
, which is
By Assumption 2 and Proposition 1, the coefficients of
$\mathscr{K}$
are continuous. Therefore, a standard approximating procedure yields
for any test function
$f \in C^{1,2}_b ([0,T] \times \mathbb{R}^d)$
. Now, by Assumption 3,
$\varphi\in C^2_b(\mathbb{R}^d)$
. Thus, (B1) with
$f(t,x)= \varphi(x)$
becomes
which is (4) with
$n=1$
, and in particular
$\bar{\gamma}\in C^{1}_b([0,T])$
.
Let now
$m\in \{ 1,\ldots, N-1\}$
, assume
$\bar{\gamma}\in C^{m}_b([0,T])$
and (4) be true for m. By Assumption 3 we obtain
Hence, (B1) with
$f= \mathscr{K}^{m} \varphi $
yields
and, in particular,
$\bar{\gamma}\in C^{m+1}_b([0,T])$
. This concludes the proof.
B.2. Auxiliary stability and moment-estimate results
Hereafter, we denote by
$(Z^a,Y^{a;k,j})$
any solution to (21), (22) for a given
$a\in\mathbb{R}^{(n+1)K}$
and
$k=0,\ldots,n$
,
$j=1,\ldots,K$
. Throughout the remainder of this section we denote by C, indistinctly, any positive constant that does not depend on
$a \in\mathbb{R}^{(n+1)K}$
.
We start with the following stability and moment estimates for the solution to the system (21) and (22).
Lemma 4. Let Assumption 1 be in force and assume that
$\alpha(t,\cdot),\beta(t,\cdot)\in C^2(\mathbb{R}^d)$
, for any
$t\in[0,T]$
, with derivatives bounded by some
$R>0$
. Then, for any
$a\in\mathbb{R}^{(n+1)K}$
, the system (21) and (22) is strongly well-posed for any
$k = 0, \ldots, n$
and
$j = 1,\ldots, K$
. Furthermore, we have
\begin{align}\lVert Z^{a}\rVert_{\mathcal{S}^2_{[0,T]}} + \sum_{\substack{k=0,\ldots,n \\ j=1,\ldots, K}} \lVert Y^{a;k,j} \rVert_{\mathcal{S}^2_{[0,T]}} & \leq C , && a\in\mathbb{R}^{(n+1)K} ,\end{align}
\begin{align} \lVert Z^{a} - Z^{a'} \rVert_{\mathcal{S}^2_{[0,T]}} + \sum_{\substack{k=0,\ldots,n \\ j=1,\ldots, K}} \lVert Y^{a;k,j} - Y^{a';k,j} \rVert_{\mathcal{S}^2_{[0,T]}} &\leq C \, |a - a' |, && a,a'\in\mathbb{R}^{(n+1)K}, \end{align}
for some constant
$C>0$
independent of a.
Proof. Well-posedness and estimates (B2) and (B3) for Z follow directly from Lemma 3 using that
$\textbf{h}$
is bounded and Lipschitz continuous. In particular, Lemma 3(a) yields
for
$C>0$
independent of a. Lemma 3(c) yields
\begin{align} \nonumber \lVert Z^{a} - Z^{a'} \rVert_{\mathcal{S}^4_{[0,T]}} & \leq C \| \textbf{h}\circ \mathscr{L} a - \textbf{h}\circ \mathscr{L} a' \|_{\infty} \nonumber\\ & \leq C \| \nabla \textbf{h} \|_{\infty} \| \mathscr{L} a - \mathscr{L} a' \|_{\infty} \leq C \| \nabla \textbf{h} \|_{\infty} \Big(\sum_{k=0,\ldots, n} \| g_k \|_{\infty} \Big) | a - a' |,\end{align}
with C depending on
$\| \textbf{h} \|_{\infty}$
but not on a, and
$\nabla\textbf{h}$
stands for the Jacobian of
$\textbf{h}\,:\,\mathbb{R}^K \to \mathbb{R}^K$
.
As for Y, to ease the notation we only consider the case
$q=1$
. The general case is completely analogous. We observe that SDE (22) can be cast in the form of (A1) with
$\zeta=0$
,
$m = 2 d(1+d) $
,
\begin{align*}\rho_t= \rho_t (a)&= \Big( g_k(t) \nabla {\textbf{h}_j} \big( \mathscr{L} a (t) \big) \alpha(t,Z_t), \textbf{h}\big( \mathscr{L} a(t) \big) \partial_{z_1} \alpha(t, Z_t ) , \ldots, \textbf{h}\big( \mathscr{L} a(t) \big) \partial_{z_d} \alpha(t, Z_t ) ,\\&\qquad \quad g_k(t) \nabla {\textbf{h}_j} \big( \mathscr{L} a (t) \big) \beta(t,Z_t), \textbf{h}\big( \mathscr{L} a(t) \big) \partial_{z_1} \beta(t, Z_t ) , \ldots, \textbf{h}\big( \mathscr{L} a(t) \big) \partial_{z_d} \beta(t, Z_t ) \Big),\end{align*}
and
\begin{equation*}\mu(t,y) = \begin{pmatrix} I_d \\ y_{1} I_d \\ \vdots\\ y_{d} I_d \\ 0_{d(1+d),d} \end{pmatrix}, \quad \sigma(t,y) = \begin{pmatrix} 0_{d(1+d),d} \\ I_d \\ y_{1} I_d \\ \vdots\\ y_{d} I_d \end{pmatrix}, \quad t\in [0,T], \ y=(y_1,\ldots,y_d)\in\mathbb{R}^d,\end{equation*}
where
$I_d$
is the
$d\times d$
identity matrix and
$0_{d(1+d),d}$
is the
$d(1+d)\times d$
-dimensional matrix with null entries. Under our assumptions, the functions
$\mu,\sigma$
satisfy conditions (A2) and (A3) for some
$\Lambda>0$
that is independent of a and the stochastic process
$\rho$
is bounded by a constant that is also independent of a (recall that by construction
$\textbf{ h}$
is constant outside a ball – see (16)). Therefore, by Lemma 3, (22) is strongly well-posed and we have (by part (b) of Lemma 3) that for any
$k = 0, \ldots, n$
and
$j = 1,\ldots, K$
\begin{align*} \lVert Y^{a;k,j}\rVert_{\mathcal{S}^2_{[0,T]}} & \leq C , \\ \lVert Y^{a;k,j} - Y^{a';k,j} \rVert_{\mathcal{S}^2_{[0,T]}} & \leq C \, \|\rho(a) - \rho(a')\|_{\mathbb{L}^4_{[0,T]}}. \end{align*}
In order to conclude, we write
\begin{align*}| \rho_t(a) - \rho_t(a') |& \leq |g_k(t)| \Big( \big| \nabla {\textbf{h}_j} \big( \mathscr{L} a (t) \big) \alpha(t,Z^a_t) -\nabla {\textbf{h}_j} \big( \mathscr{L} a' (t) \big) \alpha(t,Z^{a'}_t) \big| \\&+ \big| \nabla {\textbf{h}_j} \big( \mathscr{L} a (t) \big)\beta(t,Z^a_t) - \nabla {\textbf{h}_j} \big( \mathscr{L} a' (t) \big) \beta(t,Z^{a'}_t) \big| \Big) \hspace{0pt}\\&+ \sum_{i=1}^d \Big( \big| \textbf{h}\big( \mathscr{L} a(t) \big) \partial_{z_i} \alpha(t, Z^a_t ) - \textbf{ h}\big( \mathscr{L} a'(t) \big) \partial_{z_i} \alpha(t, Z^{a'}_t ) \big|\\&+ \big| \textbf{h}\big( \mathscr{L} a(t) \big) \partial_{z_i} \beta(t, Z^a_t ) - \textbf{h}\big( \mathscr{L} a'(t) \big) \partial_{z_i} \beta(t, Z^{a'}_t ) \big| \Big),\end{align*}
whereby from the boundedness of
$\textbf{h},\alpha,\beta$
and their derivatives, we obtain
This then yields, via (B4),
and completes the proof.
Lemma 5. Let Assumption 1 be in force. Assume also that
$\alpha(t,\cdot),\beta(t,\cdot)\in C^2(\mathbb{R}^d)$
, for any
$t\in[0,T]$
, and
$\varphi\in C^{2}(\mathbb{R}^K)$
, with derivatives bounded by some
$R>0$
. Then, the function
$\mathbb{R}^{(n+1)K}\ni a\mapsto\mathrm{E} \big[ \varphi(Z^{a}_t) \big] $
is differentiable, and
for any
$k = 0, \ldots, n$
and
$j = 1,\ldots, K$
. Furthermore, the functions
$\mathbb{R}^{(n+1)K}\ni a\mapsto \mathrm{E} \big[ \nabla_x \varphi \big(Z^{a}_{t})Y^{a;k,j}_{t} \big] $
are bounded and Lipschitz continuous, uniformly with respect to
$t\in [0,T]$
(for any k,j).
Proof. To ease the notation, we prove the statement for
$K=1$
and
$n=0$
, the general case being completely analogous.
Since
$\textbf{h},\alpha(t,\cdot),\beta(t,\cdot)$
and their derivatives up to order 2 are bounded, uniformly w.r.t.
$t\in[0,T]$
, [Reference Kunita36, Theorem 2.3.1 (p. 218)] yields
for any
$t\in[0,T]$
. Moreover, given the differentiability of
$\varphi$
and the integrability of
$Y^a$
from Lemma 4, the right-hand side of (B5) is well-defined. Therefore, the identity (B5) can be obtained once the exchange of derivative and expectation operations is justified. Let us set
By Lemma 4 and the boundedness of
$\varphi',\varphi''$
we have
\begin{align}|f(t;\,a)| &\leq C \mathrm{E} \big[|Y^{a}_{t} | \big] \leq C, \nonumber\\|f(t;\,a)- f(t;\,a')| &\leq \mathrm{E} \big[ \big| \varphi' \big(Z^{a}_{t}) - \varphi' \big(Z^{a'}_{t}) \big|\times \big|Y^{a}_{t} \big| \big] + \mathrm{E} \big[ \big| \varphi' \big(Z^{a}_{t}) \big| \times \big|Y^{a}_{t}- Y^{a'}_{t} \big| \big] \leq C |a-a'|,\end{align}
for any
$t\in [0,T]$
,
$a\in\mathbb{R}$
, with
$C>0$
independent of t and a. Therefore, we obtain
\begin{align*}\mathrm{E} \big[ \varphi(Z^{a}_t) \big] - \mathrm{E} \big[ \varphi(Z^{0}_t) \big]= \mathrm{E}\big[ \varphi(Z^{a}_t) - \varphi(Z^{0}_t) \big]& = \mathrm{E} \bigg[ \int_0^a \partial_{\hat a} \big\{\varphi(Z^{\hat a}_t) \big\} \, \textrm{d} \hat a \bigg]\\&= \mathrm{E} \bigg[ \int_0^a \varphi'(Z^{\hat a}_t) Y^{\hat a}_{t} \, \textrm{d} \hat a \bigg] = \int_0^a f(t; \, \hat a) \, \textrm{d} \hat a,\end{align*}
where the second last equality follows by using (B6), and the last equality from Fubini’s theorem (using (B7)).
Lastly, using the continuity of
$f(t;\cdot)$
we conclude that
which completes the proof.
B.3. Proof of the main auxiliary propositions
We are now in the position to prove Propositions 5 and 6.
Proof of Proposition
5. To ease the notation, we prove the statement for
$K=1$
and
$n=0$
, the general case being completely analogous. By (B5) in Lemma 5 we have
Recall the heuristics in (24) and (25). For any independent copies
$(\xi,W)$
and
$(\tilde\xi,\tilde W)$
, since
$\textbf{h},\textbf{h}',\varphi,\varphi'$
are bounded, we observe that
\begin{align}\nonumber\mathrm{E}\big[ \big| \varphi \big(Z^{a}_{t}(\bar{\xi},\bar W)\big) - \textbf{h}\big((\mathscr{L} a) (t)\big) \big|^2 \big]\times\mathrm{E}\big[ \big| \varphi' \big(Z^{a}_{t}(\tilde\xi,\tilde W)\big) Y^{a}_{t}(\tilde\xi,\tilde W)\big) -&\partial_{a} \textbf{h}\big( (\mathscr{L} a) (t) \big) \big|^2 \big]\\&\leq C \big( 1 + \lVert Y^{a} \rVert_{\mathcal{S}^2_{[0,T]}} \big)\leq C,\end{align}
where the last inequality follows from (B2). Therefore, we can exchange the integration and derivative operation to obtain the following:
with v as defined in (25), namely
\begin{align*}v&(a;\, \xi, {W} ;\, \tilde\xi, \tilde W)\\&\quad \,:\!=\, 2 \int_0^T \Big( \varphi \big(Z^{a}_{t}(\xi, W)\big) - \textbf{ h}\big((\mathscr{L} a) (t)\big) \Big) \Big( \varphi' \big(Z^{a}_{t}(\tilde\xi,\tilde W)\big) Y^{a}_{t}(\tilde\xi,\tilde W) - \partial_{a} \textbf{h}\big( (\mathscr{L} a) (t) \big) \Big) \,\textrm{d} t.\end{align*}
Moreover, Jensen’s inequality and Fubini’s theorem, together with (B8), prove that
with C independent of a. Hence, for
$m\in\mathbb{N}$
and recalling
$\textbf{v}_{m}$
as defined recursively in Algorithm 1, we obtain for any
$m \in \mathbb{N}_0$
\begin{align*}&\hspace{-8pt}\mathrm{E}_{\bar{\mathrm{P}}}\big[ \textbf{v}_{m+1} - \nabla_a G( \textbf{a}_m) | \bar{\mathcal{F}}_m \big]\\&\hspace{-8pt} = \mathrm{E}_{\bar{\mathrm{P}}}\Big[ v(\textbf{a}_m; \xi_{m+1}, W_{m+1};\tilde\xi_{m+1}, \tilde W_{m+1})- \mathrm{E}_{\bar{\mathrm{P}}}\big[ v(a; \xi_{m+1}, W_{m+1};\tilde\xi_{m+1}, \tilde W_{m+1}) \big]_{a=\textbf{a}_m}\Big| \bar{\mathcal{F}}_m \Big] = 0,\end{align*}
which is (27). From (B9) and Assumption 4 on H we also obtain
which concludes the proof.
Proof of Proposition
6. To ease the notation, we prove the statement for
$K=1$
and
$n=0$
, the general case being completely analogous.
By the same arguments as in the proof of Proposition 5, and using Assumption 4, one obtains
From this expression we obtain the identity
where
\begin{align*} I_1(t) & = \mathrm{E}\Big[ \varphi \big(Z^{a}_{t}\big) - \varphi \big(Z^{a'}_{t}\big) + \textbf{h}\big((\mathscr{L} a') (t)\big) - \textbf{h}\big((\mathscr{L} a) (t)\big) \Big] \mathrm{E}\Big[ \varphi' \big(Z^{a}_{t}\big) Y^{a}_{t} - \partial_{a} \textbf{h}\big( (\mathscr{L} a) (t) \big) \Big] , \\ I_2(t) & = \mathrm{E}\Big[ \varphi \big(Z^{a'}_{t}\big) - \textbf{h}\big((\mathscr{L} a') (t)\big) \Big] \mathrm{E}\Big[ \varphi' \big(Z^{a}_{t}\big) Y^{a}_{t} - \varphi' \big(Z^{a'}_{t}\big) Y^{a'}_{t} + \partial_{a} \textbf{ h}\big( (\mathscr{L} a') (t) \big) - \partial_{a} \textbf{h}\big( (\mathscr{L} a) (t) \big) \Big] , \\ I_3 & = \partial_a H(a) - \partial_a H(a') .\end{align*}
Now, by the boundedness of
$\varphi,\varphi',\varphi'',\textbf{h}, \textbf{h}'$
we have
\begin{align*}| I_1 (t) | & \leq C\, \mathrm{E}\big[ | Z^{a}_{t} - Z^{a'}_{t}| + |a - a'| \big] \mathrm{E}\big[ | Y^{a}_{t} | + 1 \big] , \\| I_2 (t) | & \leq C\, \mathrm{E}\big[ |a - a'| ( | Y^{a}_{t}| + 1 ) + | Y^{a}_{t} - Y^{a'}_{t}| \big].\end{align*}
Therefore, Lemma 4 and Assumption 4 yield the following result:
Supplementary material
The supplementary material for this article can be found at https://github.com/amatoandre/Numerical-approximation-of-MKV-SDEs-via-SGD
Funding Information
The research of SP was partially supported by the PRIN22 project no. CUP_E53C23001670001. The author SP has also received funding from the European Union’s Horizon Europe research and innovation programme under the Marie Skłodowska-Curie Actions Staff Exchanges (Grant Agreement No. 101183168, Call: HORIZON-MSCA-2023-SE-01). GdR acknowledges support from the Fundação para a Ciência e a Tecnologia (Portuguese Foundation for Science and Technology) through the project UIDB/00297/2020 and UIDP/00297/2020 (Centro de Matemática e Aplicações – NOVA Math) and from the UK Research and Innovation (UKRI) under the UK government’s Horizon Europe funding Guarantee [Project UKRI313].
Competing Interests
Funded by the European Union. Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Education and Culture Executive Agency (EACEA). Neither the European Union nor EACEA can be held responsible for them.

























































































