Self-normalized Cramér-type moderate deviation of stochastic gradient Langevin dynamics

Hongsheng Dai; Xiequan Fan; Jianya Lu

doi:10.1017/jpr.2025.10065

Self-normalized Cramér-type moderate deviation of stochastic gradient Langevin dynamics

Part of: Distribution theory - Probability Sequential methods Distribution theory Limit theorems Parametric inference

Published online by Cambridge University Press: 02 March 2026

Hongsheng Dai ,

Xiequan Fan and

Jianya Lu

Show author details

Hongsheng Dai*: Affiliation:
Newcastle University
Xiequan Fan*: Affiliation:
Northeastern University at Qinhuangdao
Jianya Lu*: Affiliation:
University of Essex
*: *Postal address: School of Mathematics, Statistics and Physics, Newcastle University, Newcastle upon Tyne NE1 7RU, UK. Email: hongsheng.dai@newcastle.ac.uk
**Postal address: School of Mathematics and Statistics, Northeastern University at Qinhuangdao, Qinhuangdao 066004, China. Email: fanxiequan@hotmail.com
***School of Mathematics, Statistics and Actuarial Science, University of Essex, Wivenhoe Park, Colchester CO4 3SQ, UK. Email: jianya.lu@essex.ac.uk

Article contents

Abstract
Introduction
Diffusion approximation and main results
Auxiliary lemmas for the proof
Proof of main result
Proof of Lemma
Estimation of the remainder $ \mathcal{R}_{\eta}$
Funding information
Competing interests
References

Rights & Permissions

Abstract

In this paper, we study the self-normalized Cramér-type moderate deviation of the empirical measure of the stochastic gradient Langevin dynamics (SGLD). Consequently, we also derive the Berry–Esseen bound for the SGLD. Our approach is by constructing a stochastic differential equation to approximate the SGLD and then applying Stein’s method to decompose the empirical measure into a martingale difference series sum and a negligible remainder term.

Keywords

Self-normalized Cramér-type moderate deviation stochastic gradient Langevin dynamics Stein’s method Diffusion approximation Berry–Esseen bound

MSC classification

Primary: 60F10: Large deviations 62E20: Asymptotic distribution theory

Secondary: 60E05: Distributions 62F12: Asymptotic properties of estimators 62L20: Stochastic approximation

Information

Type: Original Article
Information: Journal of Applied Probability , First View , pp. 1 - 28

DOI: https://doi.org/10.1017/jpr.2025.10065 [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright: © The Author(s), 2026. Published by Cambridge University Press on behalf of Applied Probability Trust

1. Introduction

For a non-convex stochastic loss function $\psi(\omega,\zeta)\,:\, \mathbb{R}^d\times\mathbb{R}^r\to\mathbb{R}$ , where $\zeta\in\mathbb{R}^r$ is a random variable with probability distribution $\nu$ , we consider the following optimization problem:

\begin{align*} \omega^*=\text{argmin}_{\omega\in\mathbb{R}^d}P(\omega),\quad P(\omega)=\mathrm{E}_{\zeta\sim\nu}\psi(\omega,\zeta).\end{align*}

To find the minimizer $\omega^*$ , in [Reference Welling and Teh30] the authors proposed the stochastic gradient Langevin dynamics (SGLD) algorithm, which has been widely applied to optimization problems. The iteration of the SGLD is given by

(1)

\begin{align} \omega_k=\omega_{k-1}-\eta\nabla\psi(\omega_{k-1},\zeta_k)+\sqrt{\eta\delta}\xi_{k},\end{align}

where $\eta>0$ is the step size, $\delta>0$ is the inverse temperature parameter, $(\xi_k)_{k\ge1}$ is a sequence of independent and identically distributed (i.i.d.) standard d-dimensional normal random vectors and $(\zeta_k)_{k\ge1}$ are i.i.d. samples from $\nu$ .

As the number of iterations k tends to infinity, [Reference Raginsky, Rakhlin and Telgarsky25] showed that (1) can find the approximate global minimizer. See [Reference Chen, Du and Tong4, Reference Lamperski20, Reference Xu, Chen, Zou and Gu31] for more details on the convergence of the SGLD. Unlike the stochastic gradient descent (SGD) algorithm, which may converge to local minima in non-convex optimization problems, the SGLD algorithm benefits from the inclusion of Gaussian noise in its iterations. This added noise allows SGLD to more effectively explore the parameter space, making it well suited for solving non-convex problems [Reference Chen, Du and Tong4, Reference Zhang, Liang and Charikar32].

It is natural to consider the iteration (1) as a discretization to a continuous dynamics for a given step size $\eta$ . We consider the following stochastic differential equation (SDE) to approximate the SGLD algorithm:

(2)

\begin{align} {\mathrm{d}} X_t=-\nabla P(X_t){\mathrm{d}} t+Q_{\eta,\delta}(X_t){\mathrm{d}} B_t,\end{align}

where $B_t$ is a d-dimensional standard Brownian motion and the diffusion matrix $Q_{\eta,\delta}(\!\cdot\!)\in\mathbb{R}^{d\times d}$ will be defined later. Significant work has been done in [Reference Feng, Gao, Li, Liu and Lu12, Reference Hu, Li, Li and Liu17] on comparing stochastic algorithms with their corresponding SDE approximations, and on establishing the diffusion approximation bound of

(3)

\begin{align} \sup\nolimits_{h\in\mathcal H}|\mathrm{E} h(\omega_k)-\mathrm{E} h(X_{k\eta})|\end{align}

for a family $\mathcal H$ of test functions h. Different choices of $\mathcal{H}$ correspond to different distance metrics, such as the Wasserstein-1 distance for the Lipschitz function h and the total variation distance for bounded h. The diffusion approximation provides valuable insights into algorithms by viewing them as continuous dynamics. Acting as a bridge, it enables the application of continuous dynamic analysis methods to study the properties of stochastic algorithms. See [Reference Chen, Lu and Xu2, Reference Li, Tai and Weinan21, Reference Li, Tai and Weinan22] for more details.

Under suitable conditions on $\psi$ , (1) and (2) are exponential ergodic with invariant measures $\pi_{\eta}$ and $\pi$ , respectively. For the SGLD and its invariant measure $\pi_{\eta}$ , we construct an empirical measure $\Pi_{\eta}$ as a statistic of $\pi_{\eta}$ , where

\begin{align*} \Pi_{\eta}(\!\cdot\!)=\frac{1}{m}\sum_{k=0}^{m-1}\delta_{\omega_k}(\!\cdot\!).\end{align*}

Here $\delta_{\omega_k}(\!\cdot\!)$ is the Dirac measure of $\omega_k$ . Since (3) converges to zero as $k\to\infty$ and $\eta\to0$ , given a test function $h\;:\;\mathbb{R}^d \rightarrow \mathbb{R}$ , it is natural to consider the asymptotic property of $\Pi_{\eta}(h)$ , namely, $\int h{\mathrm{d}} \Pi_{\eta}$ .

The study of self-normalized Cramér-type moderate deviation (SNCMD) explores the deviation properties of random variables and has been developed in recent decades; see [Reference Jing, Shao and Wang19, Reference Shao26] for the results for independent random variables. For dependent random variables, [Reference Chen, Shao, Wu and Xu5] studied the moderate deviation under mixing conditions, [Reference Fan, Grama, Liu and Shao7] focused on the SNCMD for martingales, and [Reference Fan, Grama, Liu and Shao8] analysed stationary sequences. We refer the reader to [Reference Jiang, Wan and Yang18, Reference Shao and Zhou27, Reference Zhang33] for further details. However, for the iterative output of a stochastic algorithm, such as (1), which is a sequence of dependent random variables, there has been limited analysis of SNCMD. See [Reference Gao and Yiu13, Reference Teh, Thiery and Vollmer28] for the fluctuation analysis of stochastic algorithms.

In this paper, we analyse the SNCMD of $\Pi_{\eta}(h)$ with Lipschitz test function h. Specifically, given a normalized term $\mathcal{Y}_{\eta}$ , we compare the tail probability of $\Pi_{\eta}(h)/\sqrt{\mathcal{Y}_{\eta}}$ after scaling and centralization (i.e., $\sqrt{m\eta}(\Pi_{\eta}(h)-{\pi}(h))/\sqrt{\delta\mathcal{Y}_{\eta}}$ ) with the tail probability of a standard normal distribution $N(0,I_d)$ . Using the diffusion approximation and Stein’s method, we have, for the first time, investigated the SNCMD of the SGLD algorithm, which provides a novel approach to and perspective on the analysis of the asymptotic properties of the SGLD algorithm. As a further extension, we also establish the corresponding Berry–Esseen bound. These non-asymptotic results quantify the finite-sample accuracy of the normal approximation to the distribution of the SGLD algorithm, thereby enhancing the theoretical reliability of the algorithm. In particular, they provide a quantitative guarantee for constructing confidence intervals with controlled error when the sample size is limited. By constructing the corresponding continuous-time dynamics as a bridge, the associated SDE offers a way to better understand the dynamic behavior of the stochastic algorithms, such as their convergence properties and the effects of hyperparameter choices. Related analyses can be found in [Reference Li, Tai and Weinan21, Reference Li, Tai and Weinan22].

Within this theoretical framework, a broader class of stochastic algorithms based on Langevin dynamics can also be analysed. For instance, algorithms such as stochastic variance-reduced gradient Langevin dynamics benefit from their variance-reduction mechanism, which leads to smoother updates and can be approximated by stochastic differential delay equations. Momentum-based accelerated stochastic algorithms, which exhibit faster convergence, can be well approximated by an underdamped Langevin diffusion. See [Reference Chen and Xu3, Reference Guillin, Wang, Xu and Yang15] for related approximation results.

Although [Reference Fan, Hu and Xu9, Reference Lu, Tan and Xu23] carried out a similar analysis, examining the SNCMD of the Langevin dynamics, their results are based on the relatively restrictive assumption that both the gradient and test functions are bounded. In contrast, our results extend this assumption by replacing the boundedness assumption with a Lipschitz condition. To relax this condition, we employ a truncation technique in the proof. In addition, our work provides a new application of Stein’s method within the realm of machine learning.

The approach to proving the main results relies on Stein’s method and a standard decomposition of $\Pi_{\eta}(h)$ , with similar ideas found in [Reference Lu, Tan and Xu23, Reference Teh, Thiery and Vollmer28]. The strategy of the proof begins with a diffusion approximation for the stochastic algorithm, constructing a corresponding SDE. Under some mild conditions, the SDE has an ergodic measure $\pi$ , and its associated Stein’s equation has a solution with good regularity properties. Using Stein’s equation, we decompose $\Pi_{\eta}(h)$ into a martingale difference series sum $\mathcal{H}_{\eta}$ and a remainder $\mathcal{R}_{\eta}$ . For $\mathcal{H}_{\eta}$ , we apply the martingale SNCMD theorem in [Reference Fan, Hu and Xu9] to compare it with the standard normal distribution. In addition, we show that the remainder $\mathcal R_{\eta}$ is exponentially negligible using concentration inequalities.

The paper is organized as follows. Diffusion approximation and our main results are stated in Section 2. In Section 3, we provide some preliminary lemmas. In Section 4, we give the proof of the SNCMD and the Berry–Esseen bound. The details of the proof of preliminary lemmas are deferred to Sections 5 and 6.

We conclude this section by introducing some notation which will be frequently used in what follows. The Euclidean norm and the inner product of vectors $x, y \in \mathbb{R}^d$ are denoted by $|x|$ and $\langle x, y \rangle$ , respectively. The norm of higher-rank tensors is denoted by $\|\cdot\|$ . For any two matrices $A,B\in \mathbb{R}^{d\times d}$ , the Hilbert–Schmidt norm is denoted by $\|A\|_{\text{HS}}=\sqrt{\sum_{i,j=1}^d A_{ij}^2}=\sqrt{\textrm{Tr}(A^\top A)}$ and their inner product by ${\langle} A,B{\rangle}_{\text{HS}}\,:\!=\,\sum_{i,j=1}^d A_{ij}B_{ij}$ , where $\top$ is the transpose operator. The symbols C and c represent positive constants whose values may vary from line to line. Let $\text{Lip}_1(\mathbb{R}^d)$ be the family of Lipschitz function defined on $\mathbb{R}^d$ with Lipschitz constant 1. We denote the conditional expectation $\mathrm{E}[\! \cdot \! |\omega_k]$ and conditional probability $\mathrm{P}(\! \cdot \!|\omega_k)$ by $\mathrm{E}_k[\! \cdot \!]$ and $\mathrm{P}_k(\!\cdot\!)$ , respectively. Finally, $\Phi(x)$ represents the cumulative distribution function for standard normal random variables.

2. Diffusion approximation and main results

We first construct the diffusion approximation. Rewriting (1), we have

(4)

\begin{align} \omega_k&=\omega_{k-1}-\eta\nabla P(\omega_{k-1})+\eta\nabla P(\omega_{k-1})-\eta\nabla\psi(\omega_{k-1},\zeta_k)+\sqrt{\eta\delta}\xi_{k}\nonumber\\[3pt] &\,:\!=\,\omega_{k-1}-\eta\nabla P(\omega_{k-1})+\sqrt\eta V_{\eta,\delta}(\omega_{k-1},\zeta_k,\xi_{k}), \\[-30pt] \nonumber \end{align}

where

\begin{align*} V_{\eta,\delta}(\omega_{k-1},\zeta_k,\xi_{k})= \sqrt\eta\nabla P(\omega_{k-1})-\sqrt\eta\nabla\psi(\omega_{k-1},\zeta_k)+\sqrt{\delta}\xi_{k}.\end{align*}

As $\mathrm{E}\psi(\cdot,\zeta)=P(\!\cdot\!)$ , a straightforward calculation implies

\begin{align*} \mathrm{E}_{k-1}[V_{\eta,\delta}(\omega_{k-1},\zeta_k,\xi_{k})]=0\end{align*}

and

\begin{align*} {\mathrm{cov}}[V_{\eta,\delta}(\omega_{k-1},\zeta_k,\xi_{k})|\omega_{k-1}]=\eta\Sigma(\omega_{k-1})+\delta I_d,\end{align*}

where

\begin{align*} \Sigma(x)=\mathrm{E}[\nabla\psi(x,\zeta)\nabla\psi(x,\zeta)^\top]-\nabla P(x)\nabla P(x)^\top\end{align*}

and $I_d$ is the d-dimensional identity matrix. Following the idea of [Reference Li, Tai and Weinan21, Reference Li, Tai and Weinan22], it is natural to consider the following SDE to approximate (1):

(5)

\begin{align} {\mathrm{d}} X_t=-\nabla P(X_t){\mathrm{d}} t+Q_{\eta,\delta}(X_t){\mathrm{d}} B_t,\end{align}

where

\begin{align*} Q_{\eta,\delta}(x)=\big(\eta\Sigma(x)+\delta I_d\big)^{1/2}\end{align*}

is a positive definite matrix and $B_t$ is a d-dimensional standard Brownian motion. For the cost function and random variable $\zeta$ , we introduce the following conditions.

Assumption 1. There exist constants $L, K_1 >0$ and $K_2 \ge 0$ such that for every $x,y\in \mathbb{R}^d$ , $z\in\mathbb{R}^r$ ,

(6)

\begin{align} |\nabla\psi(x,z)-\nabla\psi(y,z)|\le L|x-y|, \\[-25pt] \nonumber \end{align}

(7)

\begin{align} {\langle} x-y,-\nabla \psi(x,z)+\nabla \psi(y,z){\rangle}\le -K_1|x-y|^2+K_2. \\[-10pt] \nonumber \end{align}

Assumption 2. The random variable $\nabla\psi(x,\zeta)$ is sub-Gaussian for any $x\in\mathbb{R}^d$ , that is, there exist positive constants $K_\zeta$ and C such that

\begin{align*} \mathrm{E}\exp\{K_\zeta|\nabla\psi(x,\zeta)|^2\} \le C. \end{align*}

Remark 1. For ease of proof, we assume that $K_\zeta=1$ .

Lemma 2 (see Section 3) implies that the SGLD algorithm (1) and its corresponding SDE (5) are exponential ergodic with invariant measures $\pi_{\eta}$ and $\pi$ , respectively. Then we have the following Wasserstein-1 distance bound between $\pi$ and $\pi_{\eta}$ .

Theorem 1. Suppose Assumptions 1 and 2 hold. Then one has

(8)

\begin{align} W_1(\pi,\pi_{\eta})\le C\eta^{1/2}, \end{align}

where

\begin{align*} W_1(\pi,\pi_{\eta})=\sup\nolimits_{h\in\mathrm{Lip}_1}|\pi(h)-{\pi}_{\eta}(h)| \end{align*}

is the Wasserstein-1 distance.

Let f be the solution to the following Stein equation:

(9)

\begin{align} h-{\pi}(h)=\mathcal{L}f,\end{align}

where $\mathcal L$ is the generator of (5) given by

(10)

\begin{align} \mathcal L g(x)={\langle}\! -\nabla P(x), \nabla g(x) {\rangle}+\frac1 2{\langle} Q_{\eta,\delta}(x),\nabla^2g(x) {\rangle}_{\text{HS}}\,, \quad g\in\mathcal D(\mathcal L). \\[-30pt] \nonumber \end{align}

Denote

(11)

\begin{align} {\mathcal{Y}}_{\eta}=\frac{1}{m}\sum_{k=0}^{m-1} |\nabla f(\omega_k)|^2,\quad {\mathcal{W}}_{\eta}= \frac{\sqrt{m\eta}(\Pi_{\eta}(h)- \pi(h))}{\sqrt{\delta {\mathcal{Y}}_{\eta}}}.\end{align}

We have the following SNCMD of the SGLD algorithm.

Theorem 2. Suppose Assumptions 1 and 2 hold. Let $\omega_0\sim\pi_{\eta}$ and $h \in \mathrm{Lip}_1(\mathbb{R}^d,\mathbb{R})$ , and set $m=o(\eta^{-2})$ with $m\eta\to\infty$ . Then:

(i) as $m\le\eta^{-13/8}\delta^{-9/8}$ ,
\begin{align*} \Big| \frac{\mathrm{P}(\mathcal W_{\eta}> x)}{1-\Phi(x)} -1 \Big| \, & \le \, C\big(x^3m^{-1/4}+(1+x)m^{-1/4}\ln m+x^6\eta^{1/2}\delta^{-1/2}\big) \end{align*}
uniformly for $0 \le x=o(\eta^{-1/12}\delta^{1/12})$ as $\eta$ tends to zero and m tends to infinity.
(ii) as $m>\eta^{-13/8}\delta^{-9/8}$ ,
\begin{align*} \Big| \frac{\mathrm{P}(\mathcal W_{\eta}> x)}{1-\Phi(x)} -1 \Big| \, & \le \, C\big(x^3(m\eta\delta)^{-1/2}+\sqrt m\eta\delta x+ m^{-1/4}\ln m\big) \end{align*}
uniformly for $0 \le x=o( (m\eta\delta)^{1/6}\wedge (\sqrt m\eta\delta)^{-1})$ as $\eta$ tends to zero and m tends to infinity.

Moreover, the same results hold when $\mathcal{W}_{\eta}$ is replaced by $-\mathcal{W}_{\eta}$ .

Remark 2. To the best of our knowledge, there are currently few SNCMD results for iterative algorithms of this type, so the sharpness of the rate with respect to m is not yet fully understood.

Related and stronger bounds for algorithms or the Euler–Maruyama scheme can be found in [Reference Fan, Hu and Xu9]. Specifically, their result can be written in a form comparable to ours, yielding the upper bound $C\big(x^2m^{-1/4}+(1+x)m^{-1/4}(\ln m)^{1/2}\big)$ . While this bound is sharper, their analysis requires substantially stronger regularity assumptions, including bounded second derivatives of the test function h and the drift term. In contrast, our framework does not impose this boundedness assumption, and under weaker conditions the term with respect to m remains comparable. This is mainly due to martingale moderate deviation results combined with the concentration inequalities for stochastic processes established in [Reference Dedecker and Gouëzel6].

To handle the unbounded setting considered in this work, we employ a truncation method, which introduces the additional error term involving $\eta$ . This term is the primary reason why our bound is less sharp than that of [Reference Fan, Hu and Xu9].

Based on Theorem 2, we also derive the Berry–Esseen bound for the SGLD algorithm.

Theorem 3. Under the assumption of Theorem 2, we have

\begin{align*} \sup\nolimits_{x\in\mathbb{R}}|\mathrm{P}(\mathcal{W}_{\eta}\le x)-\Phi(x)|\le Cm^{-1/4}\ln m. \end{align*}

Further assuming $m=C\eta^{-2}/|\ln\eta|$ , we have

\begin{align*} \sup\nolimits_{x\in\mathbb{R}}|\mathrm{P}(\mathcal{W}_{\eta}\le x)-\Phi(x)|\le C\eta^{1/2}|\ln \eta|^{5/4}. \end{align*}

Remark 3. Theorem 3 establishes the convergence rate, in the Kolmogorov distance, of the iterate-averaged self-normalized estimator towards the normal distribution when either the number of iterations m or the step size $\eta$ is fixed. In practice, given a prescribed m, this theorem guides the choice of the constant step size $\eta$ with order $(\ln m/m)^{1/2}$ under which the convergence rate is of order $\eta^{1/2}$ with a logarithmic correction, which is close to the rate in Theorem 1. This result also has the same order as that in [Reference Fan, Hu and Xu9] up to a logarithmic correction, though their result is under stronger conditions. The resulting non-asymptotic bound offers a quantitative guarantee for the accuracy of normal-approximation-based confidence intervals, thereby improving the reliability of inference in finite-sample regimes.

In practical implementations, computing the normalizing factor $\mathcal Y_{\eta}$ defined in equation (11) can be challenging. The main difficulty lies in estimating the derivative of the Stein solution $\nabla f$ in Lemma 3. A Monte Carlo approach may be adopted by simulating multiple trajectories of the SDE (2) starting from x (i.e., $X_t(x)$ ), together with their gradient $Y_t=\nabla_x X_t(x)$ , and then approximating $\nabla f(x)$ via time-averaged integrals of $\nabla h(X_t(x))Y_t$ along these simulated paths. However, this simulation becomes computationally expensive in practice, especially in high-dimensional settings. Moreover, the resulting approximation inevitably introduces additional numerical errors, which may affect the accuracy of the SNCMD bounds. Addressing these computational challenges in practical implementations will be an important direction of our future work.

Remark 4. The assumption $\omega_0\sim\pi_{\eta}$ in Theorems 2 and 3 is not essential. Due to the exponential ergodicity of the SGLD algorithm, one can extend it to the case in which $\omega_0$ has a sub-Gaussian distribution. The advantage of taking $\omega_0\sim\pi_{\eta}$ is that in the calculation the terms describing the difference between the distribution of $\omega_k$ and $\pi_{\eta}$ will vanish, while in the general case one has to use exponential ergodicity of $\omega_k$ to bound the difference. Since $\omega_k$ converges to $\pi_{\eta}$ exponentially fast, the difference will not cause any great difficulty. For the ease of calculation, we only considered the case of $\omega_0\sim\pi_{\eta}$ .

Example 1. (Linear quadratic regulator policy [Reference Hambly, Xu and Yang16].) We illustrate our results with the linear quadratic regulator (LQR) problem, a fundamental model in optimal control with extensive applications in optimization and reinforcement learning. Practical implementations include autonomous driving and medical treatment systems such as insulin-level regulation in diabetes therapy. It concerns finding an optimal controller for a linear dynamical system

(12)

\begin{equation} \min _{\left\{u_t\right\}_{t=0}^{T-1}} \mathrm{E}\left[\sum_{t=0}^{T-1}\left(x_t^{\top} Q_t x_t+u_t^{\top} R_t u_t\right)+x_T^{\top} Q_T x_T \right]\!, \end{equation}

subject to, for $t=0,1,\ldots,T-1$ ,

\begin{align*}x_{t+1}=Ax_t+Bu_t+W_t,\end{align*}

where $A \in \mathbb{R}^{d \times d}$ , $B \in \mathbb{R}^{d \times p}$ , $Q_t \in \mathbb{R}^{d \times d}$ , and $R_t \in \mathbb{R}^{p \times p}$ are given positive definite matrices. The variable $x_t \in \mathbb{R}^d$ denotes the system state, $u_t \in \mathbb{R}^p$ is the control input at time t, and $W_t$ are i.i.d. random noise terms with zero mean and finite second moment.

The optimal control $u_t$ can be represented as a linear state feedback $u_t=-K_t^*x_t$ , where $K_t^*$ denotes the feedback gain matrix. Consequently, the entire control sequence can be parameterized by the feedback gain matrices $\mathbf{K}^* = \{K_0^*, K_1^*, \ldots, K_{T-1}^*\}$ , and the optimization problem can equivalently be written as

\begin{equation*} \min_{\mathbf{K}} \mathrm{E} [C(\mathbf{K})]=\min_{\mathbf{K}}\mathrm{E}\left[\sum_{t=0}^{T-1}\left(x_t^{\top} Q_t x_t+u_t^{\top} R_t u_t\right)+x_T^{\top} Q_T x_T\right]\!. \end{equation*}

More details of the LQR problem can be found in [Reference Bertsekas1]. For a specific time index t, the optimal vectorized gain matrix, denoted by $\operatorname{vec}\{K_t^*\}$ , can be efficiently computed using the SGLD algorithm,

(13)

\begin{equation} \operatorname{vec}\{{K}_t^{k+1}\} = \operatorname{vec}\{{K}_t^k\}-\eta\nabla_{t} C\big(\mathbf{K}^k\big)+\sqrt{\eta\delta}\xi_k \end{equation}

where $\nabla_t C(\mathbf{K})$ is the gradient of $C(\mathbf{K})$ with respect to $\operatorname{vec}\{K_t\}$ . In [Reference Hambly, Xu and Yang16, Lemma 3.5], the gradient of the objective function has been explicitly derived, and it can be easily verified that it satisfies Assumptions 1 and 2. Therefore, within the framework developed in this paper, we can verify that the empirical distribution of $\operatorname{vec}\{K_t\}$ , represented by $\frac{1}{m}\sum_{k=0}^{m-1}h(\! \operatorname{vec}\{K_t^{k}\})$ , satisfies a self-normalized moderate deviation principle and a Berry–Esseen bound for some test function h, after appropriate scaling. Moreover, these non-asymptotic results provide a theoretical foundation for constructing reliable confidence intervals for the learned controller, thereby enhancing the reliability of inference and decision-making in practical applications.

3. Auxiliary lemmas for the proof

The strategy for proving our main result is to decompose $\sqrt{m\eta/\delta}(\Pi_{\eta}(h)-{\pi}(h))$ into a martingale term and a remainder term as in (24), and show that the remainder is negligible and that the martingale satisfies the SNCMD. In this section, we give the decomposition and some auxiliary lemmas needed for the proof.

Lemma 1. Under Assumption 1, we have that $\nabla P$ and $Q_{\eta,\delta}$ are Lipschitz and satisfy the dissipative condition, that is, for any $x,y\in \mathbb{R}^d$ ,

(14)

\begin{align} |\nabla P(x)-\nabla P(y)| &\le L|x-y|, \\[-8pt] \nonumber \end{align}

(15)

\begin{align} {\langle} x-y,-\nabla P(x)+\nabla P(y){\rangle} &\le -K_1|x-y|^2+K_2, \\[-7pt] \nonumber \end{align}

(16)

\begin{align} \|Q_{\eta,\delta}(x)-Q_{\eta,\delta}(y)\| &\le C\sqrt \eta |x-y|. \\[8pt] \nonumber \end{align}

We also have that $\nabla \psi$ and $\nabla P$ have linear growth, that is,

(17)

\begin{align} |\nabla\psi(x,\zeta)| &\le L|x|+|\nabla\psi(0,\zeta)|, \\[-7pt] \nonumber \end{align}

(18)

\begin{align} |\nabla P(x)|&\le L|x|+|\nabla P(0)|. \\[8pt] \nonumber \end{align}

Proof. The proof will be given in Appendix A.

Lemma 2. Under Assumption 1, $(\omega_k)_{k\ge0}$ and the SDE (5) are both exponential ergodic with invariant measures $\pi_{\eta}$ and $\pi$ , respectively.

Proof. The proof will be given in Appendix B.

Lemma 3. Let $h\in \mathrm{Lip}_1(\mathbb{R}^d,\mathbb{R})$ . A solution to Stein’s equation

\begin{align*} h-{\pi}(h)=\mathcal L f \end{align*}

is given by

(19)

\begin{align} f(x)=-\int_0^\infty \mathrm{E}[h(X_t(x))-{\pi}(h)]{\mathrm{d}} t,\\[-10pt] \nonumber \end{align}

where $X_t(x)$ is the solution of equation (5) with initial value x. Moreover, there exists a positive constant C such that

(20)

\begin{align} |f(x)|&\le C(1+|x|^2), \\[-10pt] \nonumber \end{align}

(21)

\begin{align} |\nabla f(x)|&\le C(1+|x|^3), \\[-10pt] \nonumber \end{align}

(22)

\begin{align} \|\nabla ^2 f(x)\|&\le C(1+|x|^4), \\[-10pt] \nonumber \end{align}

(23)

\begin{align} \sup\nolimits_{y:|y-x|\le1}\frac{\|\nabla ^2 f(x)-\nabla ^2 f(y)\|}{|x-y|}&\le C(1+|x|^{5}). \end{align}

Proof. The proof will be given in Section 5.

We now introduce the decomposition. By Stein’s equation (9),

\begin{align*} \Pi_{\eta}(h)-{\pi}(h)&=\frac{1}{m}\sum_{k=0}^{m-1}\left(h(\omega_k )-{\pi}(h)\right) \nonumber \\[5pt] &=\frac1{m\eta}\sum_{k=0}^{m-1}\left[\mathcal{L}f(\omega_k )\eta-\left(f(\omega_{k+1} )-f(\omega_k )\right)\right]+\frac1{m\eta}\sum_{k=0}^{m-1}\left(\, f(\omega_{k+1} )-f(\omega_k )\right) \nonumber \\[5pt] &=\frac1{m\eta}[f(\omega_m )-f(\omega_0 )]+\frac1{m\eta}\sum_{k=0}^{m-1}\left[\mathcal{L}f(\omega_k )\eta-(f(\omega_{k+1} )-f(\omega_k ))\right]\!.\end{align*}

Equations (1), (10) and the Taylor expansion yield that

\begin{align*} &\mathcal{L}f(\omega_k )\eta-(f(\omega_{k+1} )-f(\omega_k )) \nonumber \\[5pt] &= \frac\eta2{\langle} \nabla^2f(\omega_k ), \eta\Sigma(\omega_k)+\delta I_d{\rangle}_{\mathrm{HS}} -{\langle} \nabla f(\omega_k),\eta\nabla P(\omega_k)-\eta\nabla\psi(\omega_k,\zeta_{k+1}){\rangle} \nonumber \\[5pt] &\quad -\sqrt{\eta\delta}{\langle} \nabla f(\omega_k ),\xi_{k+1}{\rangle} -\int_0^1\int_0^1s{\langle}\nabla^2 f(\omega_k+ss'\Delta\omega_k), \Delta \omega_k\Delta \omega_k^\top{\rangle}_{\mathrm{HS}}{\mathrm{d}} s'{\mathrm{d}} s,\end{align*}

where $\Delta\omega_k=-\eta \nabla \psi(\omega_{k},\zeta_{k+1})+\sqrt{\eta\gamma} \xi_{k+1}$ . Thus we have the decomposition

(24)

\begin{equation} \frac{\sqrt{m\eta}}{\sqrt\delta}(\Pi_{\eta}(h)-{\pi}(h))= \mathcal{H}_{\eta}+\mathcal{R}_{\eta},\end{equation}

where, as we shall see below, $\mathcal{H}_{\eta}$ is a martingale and $\mathcal{R}_{\eta}$ is a remainder. A similar decomposition can be found in [Reference Lu, Tan and Xu23, Reference Teh, Thiery and Vollmer28]. $\mathcal{H}_{\eta}$ and $\mathcal{R}_{\eta}$ are given by

\begin{align*} \mathcal{H}_{\eta} = -\frac{1}{\sqrt{m}}\sum_{k=0}^{m-1}{\langle}\nabla f(\omega_k ),\xi_{k+1}{\rangle},\quad \mathcal{R}_{\eta}=-\sum_{i=1}^4\mathcal{R}_{\eta,i},\end{align*}

with

\begin{align*} \mathcal{R}_{\eta,1}&=\frac1{\sqrt{m\eta\delta}}(\, f(\omega_0)-f(\omega_m )), \nonumber \\[3pt] \mathcal{R}_{\eta,2}&=\frac{\sqrt\eta}{\sqrt m\delta}\sum_{k=0}^{m-1}{\langle} \nabla f(\omega_k),\nabla P(\omega_k)-\nabla\psi(\omega_k,\zeta_{k+1}){\rangle},\nonumber \\[3pt] \mathcal{R}_{\eta,3}&=\frac{1}{\sqrt{m\eta\delta}}\sum_{k=0}^{m-1} \int_0^1\int_0^1s {\langle} \nabla^2 f(\omega_k+rr'\Delta\omega_k)-\nabla^2f(\omega_k ), \Delta \omega_k\Delta \omega_k^\top{\rangle}_{\mathrm{HS}}{\mathrm{d}} s'{\mathrm{d}} s,\nonumber \\[3pt] \mathcal{R}_{\eta,4}&=\frac{1}{2\sqrt{m\eta\delta}}\sum_{k=0}^{m-1}{\langle} \nabla^2f(\omega_k ), \eta^2\Sigma(\omega_k)+\eta\delta I_d-\Delta\omega_k\Delta\omega_k^\top{\rangle}_{\mathrm{HS}} \big\}.\end{align*}

The estimation of $\mathcal H_{\eta}$ and $\mathcal R_{\eta}$ depends on the following two lemmas.

Lemma 4. Suppose that Assumptions 1 and 2 hold. Let $h\in \mathrm{Lip}_1(\mathbb{R}^d,\mathbb{R})$ and $f\;:\;\mathbb{R}^d\to\mathbb{R}$ be the solution of (9). Then the inequality

\begin{align*} \mathrm{P}(|\mathcal{R}_{\eta}|>y)\le C\Big( {\mathrm{e}}^{-cy\eta^{1/2}\delta^{1/2}m^{1/2} } +{\mathrm{e}}^{-cy^{2/5}\delta^{1/5}\eta^{-1/5}} +{\mathrm{e}}^{-cy^{2/9}\eta^{-2/9}\delta^{-2/9} } +{\mathrm{e}}^{-cy^{2/7}\delta^{1/7}\eta^{-3/7}}\Big) \end{align*}

holds for any y satisfying $c_{m,\eta}\le y\le C\eta^{-7/2}\delta^{-7/2}$ , where $c_{m,\eta}=c(\eta^{1/2}\delta^{-1/2}\vee m^{1/2}\eta\delta)$ .

Proof. The proof will be given in Section 6.

Lemma 5. [Reference Fan, Hu and Xu9, Lemma 3.5] Let $\left(\beta_i, \mathcal{F}_i\right)_{i=1, \ldots, m}$ be a finite sequence of martingale differences. Assume that the following conditions hold.

(A1) There exists a number $\epsilon_m \in(0, \frac{1}{2}]$ such that
\begin{align*} \left|\mathrm{E}[\beta_i^k | \mathcal{F}_{i-1}]\right| \leq \frac{1}{2} k!\epsilon_m^{k-2} \mathrm{E}[\beta_i^2|\mathcal{F}_{i-1}], \quad \text { for all } k \geq 2 \text { and } 1 \leq i \leq m ; \end{align*}
(A2) There exist a number $\delta_m \in(0, \frac{1}{2}]$ and a positive constant C such that for all $x>0$ ,
\begin{align*} \mathrm{P}\big(\big|\sum_{i=1}^m\mathrm{E}[\beta_i^2|\mathcal{F}_{i-1}]-\mathrm{E}[\beta_i^2]\big| \geq x\big) \leq C \exp \left\{-x^2 \delta_m^{-2}\right\}. \end{align*}

Then the following inequality holds for all $0 \leq x=o\left(\min \left\{\epsilon_m^{-1}, \delta_m^{-1}\right\}\right)$ :

\begin{align*} &\Big|\ln \frac{\mathrm{P}\big(\sum_{i=1}^m\beta_i / \sqrt{\sum_{i=1}^m\mathrm{E}[\beta_i^2|\mathcal{F}_{i-1}]}\ge x\big)}{1-\Phi(x)}\Big| \\[4pt] &\quad\leq C\left(x^3\left(\epsilon_m+\delta_m\right)+(1+x)\left(\delta_m\left|\ln \delta_m\right|+\epsilon_m\left|\ln \epsilon_m\right|\right)\right) . \end{align*}

4. Proof of main result

In this section, we present the proof of our main result. The proof of Theorem 1 is based on the Stein method as developed in [Reference Fang, Shao and Xu11, Theorem 2.5]. For Theorems 2 and 3, we analyse the normalized Cramér-type moderate deviations for martingales and demonstrate that the remainder term $\mathcal{R}_{\eta}$ is negligible. See [Reference Fan, Hu and Xu9, Reference Fan and Shao10] for more details.

Proof of Theorem 1. Let $(\omega_k)_{k\ge0}$ be the Markov chain with initial value $\omega_0\sim\pi_{\eta}$ . The Taylor expansion implies that

(25)

\begin{align} 0 &= \mathrm{E}[f(\omega_1)-f(\omega_0)]\nonumber\\[3pt] &= \mathrm{E}\big[{\langle} \nabla f(\omega_0), \Delta\omega_0 {\rangle} +\int_0^1\int_0^1 s{\langle} \nabla^2f\big(\omega_0+rr'\Delta\omega_0\big),\Delta\omega_0\Delta\omega_0^\top {\rangle}_{\text {HS}}{\mathrm{d}} s{\mathrm{d}} s'\big]\nonumber\\[3pt] &= \mathrm{E}[{\langle} \nabla f(\omega_0), \Delta\omega_0 {\rangle}]+\frac12\mathrm{E}[{\langle} \nabla^2f (\omega_0),\Delta\omega_0\Delta\omega_0^\top {\rangle}_{\text {HS}}]\nonumber\\[3pt] &\quad +\mathrm{E}\big[\int_0^1\int_0^1 s{\langle} \nabla^2f\big(\omega_0+ss'\Delta\omega_0)\big)-\nabla^2f(\omega_0),\Delta\omega_0\Delta\omega_0^\top {\rangle}_{\text {HS}}{\mathrm{d}} s{\mathrm{d}} s'\big], \end{align}

where $\Delta \omega_0=\omega_1-\omega_0$ . Following (1) and (4), one obtains

\begin{align*} \mathrm{E}[{\langle} \nabla f(\omega_0), \Delta \omega_0 {\rangle}] &= \mathrm{E}[{\langle} \nabla f(\omega_0), \mathrm{E}_0[\Delta \omega_0] {\rangle}]\nonumber\\[4pt] &= \mathrm{E}[{\langle} \nabla f(\omega_0),-\eta\nabla P(\omega_{0}){\rangle}] \end{align*}

and

\begin{align*} \mathrm{E}[{\langle} \nabla^2f (\omega_0),\Delta \omega_0\Delta \omega_0^\top {\rangle}_{\text {HS}}] &= \mathrm{E}[{\langle} \nabla^2f (\omega_0),\mathrm{E}_0[\Delta \omega_0\Delta \omega_0^\top] {\rangle}_{\text {HS}}] \nonumber\\[4pt] &= \mathrm{E}[{\langle} \nabla^2f (\omega_0), \eta^2\nabla P(\omega_{0})\nabla P(\omega_{0})^\top+\eta^2\Sigma(\omega_0)+\delta\eta I_d {\rangle}_{\text {HS}}]. \end{align*}

Recall the generator of $(X_t)_{t\ge0}$ ,

\begin{align*} \mathcal L f(\omega_0)={\langle} \!-\nabla P(\omega_0), \nabla f(\omega_0) {\rangle}+\frac1 2{\langle} \eta\Sigma(\omega_0)+\delta I_d,\nabla^2f(\omega_0) {\rangle}_{\text{HS}}. \end{align*}

Combining the equalities above with (9), for any Lipschitz test function h, we obtain

(26)

\begin{align} \mathrm{E}[h(\omega_0)-\mu(h)] &= \mathrm{E}[\mathcal L f(\omega_0)]\nonumber\\[5pt] &= -\frac1\eta \mathrm{E}\left[\int_0^1\int_0^1 s{\langle} \nabla^2f\big(\omega_0+ss'\Delta\omega_0\big)-\nabla^2f(\omega_0),\Delta\omega_0\Delta\omega_0^\top {\rangle}_{\text {HS}}{\mathrm{d}} s{\mathrm{d}} s'\right]\nonumber\\[5pt] &\quad -\frac12\mathrm{E}[{\langle} \nabla^2f(\omega_0),\eta\nabla P(\omega_0)\nabla P(\omega_0)^\top {\rangle}_{\text{HS}}]. \end{align}

For the integration term of (26), one has

\begin{align*} &\mathrm{E}\left[\int_0^1\int_0^1 s{\langle} \nabla^2f\big(\omega_0+ss'\Delta\omega_0\big)-\nabla^2f(\omega_0),\Delta\omega_0\Delta\omega_0^\top {\rangle}_{\text {HS}}{\mathrm{d}} s{\mathrm{d}} s'\right]\\[4pt] &\quad=\mathrm{E}\left[\int_0^1\int_0^1 s{\langle} \nabla^2f\big(\omega_0+ss'\Delta\omega_0\big)-\nabla^2f(\omega_0),\Delta\omega_0\Delta\omega_0^\top {\rangle}_{\text {HS}}{\mathrm{d}} s{\mathrm{d}} s'1_{\{|\Delta\omega_0|\le1\}}\right]\\[4pt] &\qquad+\mathrm{E}\left[\int_0^1\int_0^1 s{\langle} \nabla^2f\big(\omega_0+ss'\Delta\omega_0\big)-\nabla^2f(\omega_0),\Delta\omega_0\Delta\omega_0^\top {\rangle}_{\text {HS}}{\mathrm{d}} s{\mathrm{d}} s'1_{\{|\Delta\omega_0| > 1\}}\right]\!. \end{align*}

For the first term above, (23) implies that

\begin{align*} & \left|\mathrm{E}\left[\int_0^1\int_0^1 s{\langle} \nabla^2f\big(\omega_0+ss'\Delta\omega_0\big)-\nabla^2f(\omega_0),\Delta\omega_0\Delta\omega_0^\top {\rangle}_{\text {HS}}{\mathrm{d}} s{\mathrm{d}} s'1_{\{|\Delta\omega_0|\le1\}}\right]\right|\\[4pt] &\quad\le\mathrm{E}\left[\int_0^1\int_0^1 s' s^2 \frac{|\nabla^2f\big(\omega_0+ss'\Delta\omega_0\big)-\nabla^2f(\omega_0)|}{|ss'\Delta\omega_0|}|\Delta\omega_0|^3 {\mathrm{d}} s{\mathrm{d}} s'1_{\{|\Delta\omega_0|\le1\}}\right]\\[4pt] &\quad\le C\mathrm{E}\left[(1+|\omega_0+\Delta\omega_0|^5)|\Delta\omega_0|^3 1_{\{|\Delta\omega_0|\le1\}}\right]\\[4pt] &\quad\le C\big(\mathrm{E}\big[(1+|\omega_0+\Delta\omega_0|^5)\ 1_{\{|\Delta\omega_0|\le1\}}\big]^2\big)^{1/2} \big(\mathrm{E}|\Delta\omega_0|^6\big)^{1/2} \le C\eta^{3/2}. \end{align*}

For the second term, (22) yields

By the Markov inequality and (1), we obtain

\begin{align*} \mathrm{E}1_{\{|\Delta\omega_0| > 1\}}=\mathrm{P}(|\Delta\omega_0| > 1)\le \mathrm{E}|\Delta\omega_0|^4\le C\eta^2. \end{align*}

Since $\mathrm{E}\left[(1+|\omega_0|^4+|\Delta\omega_0|^4)^2|\Delta\omega_0|^4\right]$ is bounded, we obtain

(27)

\begin{align} \mathrm{E}\big[\int_0^1\int_0^1 s{\langle} \nabla^2f\big(\omega_0+ss'\Delta\omega_0)\big)-\nabla^2f(\omega_0),\Delta\omega_0\Delta\omega_0^\top {\rangle}_{\text {HS}}{\mathrm{d}} s{\mathrm{d}} s'\big] \le C\eta^{3/2}. \end{align}

For the second term of (26), similar with the estimation of (27), (22) implies

(28)

\begin{align} \mathrm{E}[{\langle} \nabla^2f(\omega_0),\eta\nabla P(\omega_0)\nabla P(\omega_0)^\top {\rangle}_{\text{HS}}]\le C\eta. \end{align}

Combining (26)–(28), we have

\begin{align*} W_1(\pi,\pi_{\eta})\le C\eta^{1/2}. \end{align*}

Proof of Theorem 2. According to the decomposition in (24), we have

\begin{equation*} \frac{\sqrt{m\eta}}{\sqrt\delta}(\Pi_{\eta}(h)-{\pi}(h))= \mathcal{H}_{\eta}+\mathcal{R}_{\eta}. \end{equation*}

Thus, for any $x>0$ and $0 < y < x$ , we have

(29)

\begin{align} \mathrm{P}(\mathcal W_{\eta}> x)\le \mathrm{P}(\mathcal H_{\eta}/\sqrt{\mathcal Y_{\eta}}>x-y)+\mathrm{P}(\mathcal R_{\eta}/\sqrt{\mathcal Y_{\eta}}>y). \end{align}

Recall that

\begin{align*} \mathcal{H}_{\eta} = -\frac{1}{\sqrt{m}}\sum_{k=0}^{m-1}{\langle}\nabla f(\omega_k ),\xi_{k+1}{\rangle},\quad {\mathcal{Y}}_{\eta}=\frac{1}{m}\sum_{k=0}^{m-1} |\nabla f(\omega_k)|^2. \end{align*}

We denote

\begin{align*} \widehat{\nabla f}(\omega_k)=\nabla f(\omega_k)1_{\{|\omega_k|\le m^{1/12}\}},\quad \hat{\mathcal{Y}}_{\eta}=\frac{1}{m}\sum_{k=0}^{m-1} |\widehat{\nabla f}(\omega_k)|^2. \end{align*}

For the probability $\mathrm{P}(\mathcal H_{\eta}/\sqrt{\mathcal Y_{\eta}}>x-y)$ , we have

(30)

\begin{align} &\frac{\mathrm{P}(\mathcal H_{\eta}/\sqrt{\mathcal Y_{\eta}}>x-y )}{1-\Phi(x)} \nonumber\\[4pt] &\quad\le\frac{\mathrm{P}(\mathcal H_{\eta}/\sqrt{\mathcal Y_{\eta}}>x-y, \,|\omega_k|\le m^{1/12}\,\text{for any}\, k \in [0, m-1])}{1-\Phi(x-y)}\frac{1-\Phi(x-y)}{1-\Phi(x)} \nonumber \\[4pt] &\qquad +\ \frac{\sum_{k=0}^{m-1}\mathrm{P}(|\omega_k|>m^{1/12})}{1-\Phi(x)}. \end{align}

For the first term above,

\begin{align*} &\frac{\mathrm{P}(\mathcal H_{\eta}/\sqrt{\mathcal Y_{\eta}}>x-y, \,|\omega_k|\le m^{1/12}\,\text{for any}\, k \in [0, m-1])}{1-\Phi(x-y)}\\[4pt] &\quad= \frac{\mathrm{P}\left(\frac{1}{\sqrt{m \hat {\mathcal{Y}}_{\eta}}}\sum_{k=0}^{m-1}{\langle}\widehat{\nabla f}(\omega_k ),\xi_{k+1}{\rangle} >x-y, \,|\omega_k|\le m^{1/12}\,\text{for any}\, k \in [0, m-1]\right)}{1-\Phi(x-y)}\\[4pt] &\quad\le \frac{\mathrm{P}\left(\frac{1}{\sqrt{m \hat {\mathcal{Y}}_{\eta}}}\sum_{k=0}^{m-1}{\langle}\widehat{\nabla f}(\omega_k ),\xi_{k+1}{\rangle} >x-y\right)}{1-\Phi(x-y)}. \end{align*}

It is easy to see that $\Big(\frac{1}{\sqrt m}{\langle} \widehat{\nabla f}(\omega_k),\xi_{k+1}{\rangle}, \mathcal{F}_{k+1} \Big)_{k\ge0}$ is a sequence of martingale differences and $\sum_{k=0}^{m-1}\mathrm{E}_k[\frac{1}{m}{\langle} \widehat{\nabla f}(\omega_k),\xi_{k+1} {\rangle}^2]=\hat{\mathcal{Y}}_{\eta}$ . As $\xi_{k+1}$ is a normal random variable and satisfies the Bernstein condition, condition (A1) of Lemma 5 is satisfied. For (A2),

(31)

\begin{align} \mathrm{P}\big( |\hat{\mathcal{Y}}_{\eta}-\mathrm{E}\hat{\mathcal{Y}}_{\eta}|\ge x' \big)&= \mathrm{P}\left( \left|\sum_{k=0}^{m-1}(|\widehat{\nabla f}(\omega_k)|^2-\mathrm{E}|\widehat{\nabla f}(\omega_k)|^2) \right|\ge mx' \right) \nonumber\\[4pt] &\le 2\exp\{-c\, m^{1/2}x'^2\},\nonumber \end{align}

where the last inequality follows from [Reference Dedecker and Gouëzel6, Theorem 2]. Thus the conditions of Lemma 5 are satisfied with $\varepsilon_m=m^{-1/4}$ and $\delta_m=m^{-1/4}$ therein. By Lemma 5, we obtain for all $0 \le x=o(m^{1/4})$ ,

\begin{align*} &\frac{\mathrm{P}\left(\frac{1}{\sqrt{m \hat {\mathcal{Y}}_{\eta}}}\sum_{k=0}^{m-1}{\langle}\widehat{\nabla f}(\omega_k ),\xi_{k+1}{\rangle}>x-y\right)}{1-\Phi(x-y)}\\[4pt] &\quad\le \exp\{C((x-y)^3m^{-1/4}+(1+x-y)m^{-1/4}\ln m)\}. \end{align*}

For the tail of the normal distribution, one has the estimation

\begin{align*} \frac{1}{\sqrt{2\pi}(1+x)} {\mathrm{e}}^{-{x^2}/{2}}\le 1-\Phi(x)\le \frac{1}{\sqrt{\pi}(1+x)} {\mathrm{e}}^{-{x^2}/{2}}, \quad x \ge 0, \end{align*}

and

\begin{align*} \frac{1-\Phi(x-y)}{1-\Phi(x)} = 1+\frac{\int_{x-y}^x {\mathrm{e}}^{-s^2/2}{\mathrm{d}} s}{\int_x^\infty {\mathrm{e}}^{-s^2/2}{\mathrm{d}} s} \le 1+(1+x)y{\mathrm{e}}^{x^2/2-(x-y)^2/2} \le {\mathrm{e}}^{Cxy}. \end{align*}

Thus, for the first term of (30), we obtain for all $0 \le x=o(m^{1/4})$ ,

\begin{align*} &\frac{\mathrm{P}(\mathcal H_{\eta}/\sqrt{\mathcal Y_{\eta}}>x-y, \,|\omega_k|\le m^{1/12}\,\text{for any}\, k \in [0, m-1] )}{1-\Phi(x-y)}\frac{1-\Phi(x-y)}{1-\Phi(x)}\\[4pt] &\quad\le \exp\big\{C\big((x-y)^3m^{-1/4}+(1+x-y)m^{-1/4}\ln m+xy\big)\big\}. \end{align*}

For the second term of (30), Lemma 7 and the Markov inequality yield that for all $0\le x =o(m^{1/12}) ,$

\begin{align*} \frac{\sum_{k=0}^{m-1}\mathrm{P}(|\omega_k|>m^{1/12})}{1-\Phi(x)} &\le \sum_{k=0}^{m-1} \sqrt{2\pi}(1+x)\mathrm{E}\exp\{C|\omega_k|^2\}{\mathrm{e}}^{-Cm^{\frac1{6}}+x^2/2}\\[4pt] &\le C \exp\{-c(m^{\frac1{6}}-x^2)\}. \end{align*}

Combining the above estimation for (30), we obtain for all $0 \le x=o(m^{1/12})$ ,

(32)

\begin{align} &\frac{\mathrm{P}(\mathcal H_{\eta}/\sqrt{\mathcal Y_{\eta}}>x-y )}{1-\Phi(x)} \nonumber\\[4pt] &\le \exp\big\{C\big(x^3m^{-1/4}+(1+x)m^{-1/4}\ln m+xy\big)\big\}+ C \exp\{-c(m^{1/6}-x^2)\}. \end{align}

We now estimate the remainder term $\mathcal{R}_{\eta}$ ,

\begin{align*} \mathrm{P}(\mathcal{R}_{\eta}/\sqrt{{\mathcal{Y}}_{\eta}}\ge y) &\le \mathrm{P}\left(\mathcal{R}_{\eta}/\sqrt{{\mathcal{Y}}_{\eta}}\ge y, {\mathcal{Y}}_{\eta}\ge \mathrm{E}{\mathcal{Y}}_{\eta}-\frac12\mathrm{E}{\mathcal{Y}}_{\eta}\right)+\mathrm{P}\left({\mathcal{Y}}_{\eta}\le \mathrm{E}{\mathcal{Y}}_{\eta}-\frac12\mathrm{E}{\mathcal{Y}}_{\eta}\right)\\[4pt] &\le \mathrm{P}(\mathcal{R}_{\eta}\ge y\sqrt{\mathrm{E}{\mathcal{Y}}_{\eta}/2})+\mathrm{P}\left( \mathrm{E}{\mathcal{Y}}_{\eta}-{\mathcal{Y}}_{\eta}\ge\frac12\mathrm{E}{\mathcal{Y}}_{\eta}\right). \end{align*}

According to Lemma 4, we have

\begin{align*} \mathrm{P}(\mathcal R_{\eta} & \ge y\sqrt{\mathrm{E}{\mathcal{Y}}_{\eta}/2})\\ &\le C\Big({\mathrm{e}}^{-c\eta^{-1/5}\delta^{1/5}y^{2/5}}1_{\{y < m^{-5/6}\eta^{-7/6}\delta^{-1/2}\}}+{\mathrm{e}}^{-cm^{1/2}\eta^{1/2}\delta^{1/2}y}1_{\{y\ge m^{-5/6}\eta^{-7/6}\delta^{-1/2}\}}\Big), \end{align*}

for $c_{m,\eta}\le y$ . Similarly to the calculation of (31), we obtain

\begin{align*} \mathrm{P}( \mathrm{E}{\mathcal{Y}}_{\eta}-{\mathcal{Y}}_{\eta}\ge\mathrm{E}{\mathcal{Y}}_{\eta}/2)\le \mathrm{P}\big( \mathrm{E}\hat{\mathcal{Y}}_{\eta}-\hat{\mathcal{Y}}_{\eta}\ge \mathrm{E}{\mathcal{Y}}_{\eta}/2 \big)+\sum_{k=0}^{m-1}\mathrm{P}(|\omega_k|\ge m^{1/12})\le C{\mathrm{e}}^{-cm^{1/6}}. \end{align*}

This yields

(33)

\begin{align} & \frac{\mathrm{P}(\mathcal R_{\eta}\ge y\sqrt{\mathrm{E}{\mathcal{Y}}_{\eta}/2})}{1-\Phi(x)} \nonumber\\[4pt] &\quad \le C\big(\!\exp\{-c(m^{1/6}-x^2)\}+\exp\{-c\eta^{-1/5}\delta^{1/5}y^{2/5}+x^2\}1_{\{c_{m,\eta}\le y\le m^{-5/6}\eta^{-7/6}\delta^{-1/2}\}}\nonumber\\[4pt] &\qquad +\exp\{-cm^{1/2}\eta^{1/2}\delta^{1/2}y+x^2\}1_{\{y\ge m^{-5/6}\eta^{-7/6}\delta^{-1/2}\}}\big). \end{align}

For the case $m\le\eta^{-{13}/8}\delta^{-9/8}$ , combing (29), (32) and (33) with $y=x^5\eta^{1/2}\delta^{-1/2}+\eta^{1/2}\delta^{-1/2} |\ln\eta|$ , we have that

\begin{align*} \frac{\mathrm{P}(\mathcal W_{\eta} > x)}{1-\Phi(x)} &\le \exp\big\{C\big(x^3m^{-1/4}+(1+x)m^{-1/4}\ln m+x^6\eta^{1/2}\delta^{-1/2}+x\eta^{1/2}\delta^{-1/2}|\ln\eta|\big)\big\}\\[4pt] &+ C\big(\exp\{-c(x^5+|\ln\eta|)^{2/5}\}1_{\{0\le x < m^{-1/6}\eta^{-1/3}\}}+\exp\{-c(m^{1/6}-x^2)\}\\[4pt] &\quad\quad+\exp\{-c(x^5\eta m^{1/2})\}1_{\{x\ge m^{-1/6}\eta^{-1/3}\}}\big)\\[4pt]&\le 1+ C\big(x^3m^{-1/4}+(1+x)m^{-1/4}\ln m+x^6\eta^{1/2}\delta^{-1/2}\big) \end{align*}

holds uniformly for $0\le x =o(\eta^{-1/12}\delta^{1/12})$ . On the other hand, we similarly obtain

\begin{align*} \frac{\mathrm{P}(\mathcal W_{\eta}> x)}{1-\Phi(x)} &\ge \mathrm{P}(\mathcal H_{\eta}/\sqrt{\mathcal Y_{\eta}}>x+y)-\mathrm{P}(\mathcal R_{\eta}/\sqrt{\mathcal Y_{\eta}}<-y)\\[4pt] &\ge 1- C\big(x^3m^{-1/4}+(1+x)m^{-1/4}\ln m+x^6\eta^{1/2}\delta^{-1/2}\big). \end{align*}

Thus, we obtain

\begin{align*} \left| \frac{\mathrm{P}(\mathcal W_{\eta}> x)}{1-\Phi(x)} -1 \right| \le C\big(x^3m^{-1/4}+(1+x)m^{-1/4}\ln m+x^6\eta^{1/2}\delta^{-1/2}\big) \end{align*}

uniformly for $0\le x=o(\eta^{-1/12}\delta^{1/12})$ as $\eta$ tends to zero and m tends to infinity.

For the case $m>\eta^{-13/8}\delta^{-9/8}$ , it is easy to verify that $c_{m,\eta}\ge m^{-5/6}\eta^{-7/6}\delta^{-1/2}$ . Combing (29), (32) and (33) with $y=x^2(m\eta\delta)^{-1/2}+\sqrt m\eta\delta$ , we obtain

\begin{align*} \frac{\mathrm{P}(\mathcal W_{\eta}> x)}{1-\Phi(x)} &\le \exp\big\{C\big(x^3m^{-1/4}+(1+x)m^{-1/4}\ln m+xy\big)\big\}\\[4pt] &\quad + C\big(\exp\{-cm^{1/2}\eta^{1/2}\delta^{1/2}y+x^2\}+\exp\{-c(m^{1/6}-x^2)\}\big)\\[4pt] &\le 1+ C\big(x^3(m\eta\delta)^{-1/2}+\sqrt m\eta\delta x+ m^{-1/4}\ln m\big), \end{align*}

holds uniformly for $0\le x =o( (m\eta\delta)^{1/6}\wedge (\sqrt m\eta\delta)^{-1})$ . On the other hand, using similar arguments, we have

\begin{align*} \frac{\mathrm{P}(\mathcal W_{\eta}> x)}{1-\Phi(x)} &\ge \mathrm{P}(\mathcal H_{\eta}/\sqrt{\mathcal Y_{\eta}}>x+y)-\mathrm{P}(\mathcal R_{\eta}/\sqrt{\mathcal Y_{\eta}}<-y)\\[4pt] &\ge 1+ C\big(x^3(m\eta\delta)^{-1/2}+\sqrt m\eta\delta x+ m^{-1/4}\ln m\big). \end{align*}

Thus,

\begin{align*} \left| \frac{\mathrm{P}(\mathcal W_{\eta}> x)}{1-\Phi(x)} -1 \right| & \le C\big(x^3(m\eta\delta)^{-1/2}+\sqrt m\eta\delta x+ m^{-1/4}\ln m\big) \end{align*}

uniformly for $0\le x=o( (m\eta\delta)^{1/6}\wedge (\sqrt m\eta\delta)^{-1})$ as $\eta$ tends to zero and m tends to infinity.

Proof of Theorem 3. For the case $m\le\eta^{-{13}/8}\delta^{-9/8}$ , denote $C_{m,\eta}=\eta^{{-{1}/{24}}}\delta^{{1}/{24}} $ . It is easy to obtain the following decomposition:

\begin{align*} & \sup\nolimits_{x\in\mathbb{R}}|\mathrm{P}(\mathcal{W}_{\eta} < x)-\Phi(x)|\\[4pt]&\quad \le \sup\nolimits_{x\le -C_{m,\eta}}|\mathrm{P}(\mathcal{W}_{\eta}\le x)-\Phi(x)|+\sup\nolimits_{-C_{m,\eta}\le x\le 0 }|\mathrm{P}(\mathcal{W}_{\eta}\le x)-\Phi(x)|\\[4pt] &\qquad +\sup\nolimits_{0\le x\le C_{m,\eta}}|\mathrm{P}(\mathcal{W}_{\eta}\le x)-\Phi(x)|+\sup\nolimits_{x> C_{m,\eta}}|\mathrm{P}(\mathcal{W}_{\eta}\le x)-\Phi(x)|\\[4pt] & \quad \;=\!:\; I_1+I_2+I_3+I_4. \end{align*}

For $I_1$ and $I_4$ , Theorem 2 implies

\begin{align*} I_1 &= \sup\nolimits_{x\le -C_{m,\eta}}|\mathrm{P}(\mathcal{W}_{\eta}\le x)-\Phi(x)|\\[4pt] & \le \sup\nolimits_{x\le -C_{m,\eta}}\mathrm{P}(\mathcal{W}_{\eta}\le x)+\Phi(\! -c_{\eta,m})\\[4pt] &\le \Phi(-C_{m,\eta}){\mathrm{e}}^C+\Phi(\! -C_{m,\eta})\le Cm^{-1/4}\ln m. \end{align*}

Similarly,

\begin{align*} I_4\le Cm^{-1/4}\ln m. \end{align*}

For $I_2$ and $I_3$ , Theorem 2 and the inequality $|{\mathrm{e}}^x-1|\le|x|{\mathrm{e}}^{|x|}$ imply

\begin{align*} I_2 &= \sup\nolimits_{-C_{m,\eta}\le x\le 0 }|\mathrm{P}(\mathcal{W}_{\eta}\le x)-\Phi(x)|\\[4pt] &\le \sup\nolimits_{-C_{m,\eta}\le x\le 0 }C\Phi(x)\big(x^3m^{-1/4}+(1+x)m^{-1/4}\ln m+x^6\eta^{1/2}\delta^{-1/2}\big)\\[4pt] &\le Cm^{-1/4}\ln m. \end{align*}

Similarly,

\begin{align*} I_3\le Cm^{-1/4}\ln m. \end{align*}

Combining the estimations for the terms $I_1,\ldots,I_4$ , we have

\begin{align*} \sup\nolimits_{x\in\mathbb{R}}|\mathrm{P}(\mathcal{W}_{\eta}< x)-\Phi(x)| \le Cm^{-1/4}\ln m. \end{align*}

For the case $m>\eta^{-13/8}\delta^{-9/8}$ , taking $C_{m,\eta}= (m\eta\delta)^{1/12}\wedge (\sqrt m\eta\delta)^{-1/2}$ instead of $m^{1/24}$ , we can similarly show that

\begin{align*} \sup\nolimits_{x\in\mathbb{R}}|\mathrm{P}(\mathcal{W}_{\eta} < x)-\Phi(x)| \le Cm^{-1/4}\ln m. \end{align*}

Thus, for any $\eta^{-1} < m=o(\eta^{-2})$ , we have

\begin{align*} \sup\nolimits_{x\in\mathbb{R}}|\mathrm{P}(\mathcal{W}_{\eta}< x)-\Phi(x)| \le Cm^{-1/4}\ln m. \end{align*}

Further assuming $m=C\eta^{-2}/|\ln\eta|$ , we obtain

\begin{align*} \sup\nolimits_{x\in\mathbb{R}}|\mathrm{P}(\mathcal{W}_{\eta}\le x)-\Phi(x)| \le C\eta^{1/2}|\ln \eta|^{5/4}. \end{align*}

5. Proof of Lemma 3

The proof of Lemma 3 follows from [Reference Gilbarg and Trudinger14, Corollary 6.3]. For ease of reading, their result is given below. Let $\Omega$ be an open subset of $\mathbb{R}^d$ , $\alpha\in(0,1]$ . For any function defined on $\mathbb{R}^d$ , denote

\begin{align*} \|f\|_{0;\Omega}&=\sup\nolimits_{x\in\Omega}\|f(x)\|,\\[4pt] \,[f]_{\alpha;\Omega}&=\sup\nolimits_{x,y\in\Omega, x\neq y}\frac{\|f(x)-f(y)\|}{\|x-y\|^{\alpha}},\\[4pt] \|f\|_{0, \alpha;\Omega}&=\|f\|_{0;\Omega}+\,[\, f]_{\alpha;\Omega}.\end{align*}

Let $C^{k}(\mathbb{R}^d)$ , where $k \ge1$ , denote the collection of all kth-order continuously differentiable functions on $\mathbb{R}^d$ . $C^{k,\alpha}(\mathbb{R}^d)$ , with $\alpha\in(0,1]$ , refers to the collection of kth-order continuously differentiable functions whose kth-order partial derivatives are $\alpha$ -Hölder continuous. For the case $k=1$ , we simplify the notation to $C^{\alpha}(\mathbb{R}^d)$ .

Lemma 6. [Reference Gilbarg and Trudinger14, Corollary 6.3] Let $f\in C^{2,\alpha}(\Omega)$ , $h \in C^{\alpha}(\bar{\Omega})$ satisfy $\mathcal{L}f=h$ in a bounded domain $\Omega$ where

\begin{equation*} \mathcal{L}f(x)={\langle} a(x), \nabla^2 f(x) {\rangle}_{\mathrm{HS}}+{\langle} b(x),\nabla f(x){\rangle} \end{equation*}

is strictly elliptic and its coefficients are in $ C^{\alpha}(\bar{\Omega})$ . Then if $ \Omega' \subset \subset \Omega$ with $\mathrm{dist}(\Omega', \partial \Omega) \geq \bar d$ , there is a constant C such that

(34)

\begin{equation} \bar d \| \nabla f \|_{0; \Omega'} + \bar d^2 \| \nabla^2 f \|_{0; \Omega'} + \bar d^{2+\alpha} [ \nabla^2 f ]_{\alpha; \Omega'} \leq C ( \| f \|_{0; \Omega} + \|h \|_{0, \alpha; \Omega}), \end{equation}

where the positive constant C depends only on the ellipticity constant and the $ C^{\alpha}(\bar{\Omega}) $ norms of the coefficients of $\mathcal{L}$ .

Proof of Lemma 3. The existence of and the expression for the solution f can be proved similarly to [Reference Fang, Shao and Xu11, Proposition 6.1]. We now show the regularities of it. According to (19),

\begin{align*} |f(x)|\le \int_0^\infty |\mathrm{E}[h(X_t(x))]-{\pi}(h)|{\mathrm{d}} t\le \int_0^\infty C(1+|x|^2){\mathrm{e}}^{-ct}{\mathrm{d}} t\le C(1+|x|^2), \end{align*}

where the second inequality follows from (54).

For any $x\in\mathbb{R}^d$ , define $r(x)=\frac{1}{2(1+|x|)}\in(0,\frac12]$ and

\begin{align*}B_{r(x)}(x)=\{z\in\mathbb{R}^d\;:\; |x-z|\le r(x)\}.\end{align*}

Consider $\Omega=B_{r(y)}(y)$ and $\Omega'=B_{r(y)/2}(y)$ for any $y\in \mathbb{R}^d$ in Lemma 6. Then we have

\begin{align*}\mathrm{dist}(\Omega',\partial\Omega)\ge \frac{r(y)}{2}=\frac{1}{4(1+|y|)}.\end{align*}

Therefore, we take $\bar d=\frac{1}{4(1+|y|)}$ . Taking $\alpha=1$ in Lemma 6 and considering the operator (10),

\begin{equation*} \mathcal L f={\langle} -\nabla P, \nabla f {\rangle}+\frac1 2{\langle} Q_{\eta,\delta},\nabla^2f {\rangle}_{\text{HS}}. \end{equation*}

The $Q_{\eta,\delta}$ notation implies that $\mathcal{L}$ is strictly elliptic, thus (14) and (16) yield that its coefficients are Lipschitz functions in $\bar\Omega$ which satisfy the condition of Lemma 6. Then we have

(35)

\begin{align} r(y) \| \nabla f \|_{0; \Omega'} \leq C ( \| f \|_{0; \Omega} + \|h \|_{0, 1; \Omega}), \\[-28pt] \nonumber \end{align}

(36)

\begin{align} r(y)^2 \| \nabla^2 f \|_{0; \Omega'} \leq C ( \| f \|_{0; \Omega} + \|h \|_{0, 1; \Omega}), \\[-28pt] \nonumber \end{align}

(37)

\begin{align} r(y)^3 [ \nabla^2 f ]_{1; \Omega'} \leq C ( \| f \|_{0; \Omega} + \|h \|_{0, 1; \Omega}). \end{align}

For equality (21), since

\begin{align*}\int_{\mathbb{R}^d}r(x){\mathrm{d}} x=\infty, \end{align*}

for any $0 < r_0\le1$ , we have

\begin{align*} B_{r_0}(x)\subset \bigcup_{y\in B_{r_0(x)}} B_{r_{(y)/2}}(y)=\bigcup_{y\in B_{r_0(x)}} \Omega'. \end{align*}

Combining with (35), we obtain

\begin{align*} \| \nabla f \|_{0; B_{r_0}(x)} \le \sup\nolimits_{y\in B_{r_0}(x)} \| \nabla f \|_{0; B_{r(y)/2}(y)}\le \sup\nolimits_{y\in B_{r_0}(x)} C(1+|y|)( \| f \|_{0; \Omega} + \|h \|_{0, 1; \Omega}). \end{align*}

Inequality (20) implies that

\begin{align*} \| f \|_{0; \Omega}\le \sup\nolimits_{z\in B_{r(y)}(y)}|f(z)|\le \sup\nolimits_{z\in B_{r(y)}(y)}C(1+|z|^2)\le C(1+|y|^2). \end{align*}

Since h is the Lipschitz function, we have

\begin{align*} \| h \|_{0,1; \Omega}\le \sup\nolimits_{z\in B_{r(y)}(y)}|h(z)|+\sup\nolimits_{z_1,z_2\in B_{r(y)}(y)}\frac{|h(z_1)-h(z_2)|}{|z_1-z_2|}\le C(1+|y|^2). \end{align*}

Thus,

\begin{align*} \| \nabla f \|_{0; B_{r_0}(x)} \le \sup\nolimits_{y\in B_{r_0}(x)} C(1+|y|)( 1+|y^2|)\le C(1+|x|^3), \end{align*}

which yields

\begin{align*} | \nabla f(x) | \le C(1+|x|^3). \end{align*}

Similarly, (36) and (37) imply

\begin{align*} \| \nabla^2 f \|_{0; B_{r_0}(x)} \le C(1+|x|^4), \end{align*}

\begin{align*} [\nabla^2 f ]_{1; B_{r_0}(x)} \le C(1+|x|^5). \end{align*}

Thus, we obtain (22) and (23).

6. Estimation of the remainder $ \mathcal{R}_{\eta}$

We will give in this section several lemmas on $\mathcal{R}_{\eta}$ which play a crucial role in proving the main results. In order to estimate the tail probability of $\mathcal{R}_{\eta}$ , we need the following four lemmas, the first three lemmas paving the way for proving Lemma 4.

Lemma 7. For small enough $\gamma>0$ , one has

\begin{align*} \mathrm{E}\exp\{\gamma|\omega_{k}|^2\}\le C, \end{align*}

for any k.

Proof. For small enough $\gamma>0$ and any constant k, (1) implies

\begin{align*} \mathrm{E}\exp\{\gamma|\omega_{k+1}|^2\} &= \mathrm{E}\left[\exp\big\{\gamma(|\omega_k|^2+|\eta \nabla \psi(\omega_k, \zeta_{k+1})|^2+2{\langle}\omega_k,-\eta \nabla \psi(\omega_k, \zeta_{k+1}){\rangle})\big\}\right. \\[5pt] & \left.\quad \cdot\mathrm{E}_k[\exp \{\eta \delta \gamma|\xi_{k+1}|^2+2\gamma\langle\omega_k-\eta \nabla \psi(\omega_k, \zeta_{k+1}), \sqrt{\eta \delta} \xi_{k+1}\rangle \}|\zeta_{k+1}]\right] \end{align*}

By a straightforward calculation of the conditional expectation with respect to the Gaussian random variable $\xi_{k+1}$ , we have

\begin{align*} &\mathrm{E}_k[\exp \{\eta \delta \gamma|\xi_{k+1}|^2+2\gamma\langle\omega_k-\eta \nabla \psi(\omega_k, \zeta_{k+1}), \sqrt{\eta \delta} \xi_{k+1}\rangle \}| \zeta_{k+1}]\\[5pt] &\quad = \frac{1}{\sqrt{1-2 \eta \delta \gamma}} \exp \left\{ \frac{2\eta \delta \gamma^2}{1-2 \eta\delta\gamma}\left|\omega_k-\eta \nabla \psi\left(\omega_k, \zeta_{k+1}\right)\right|^2 \right\}\\[5pt] &\quad \le \frac{1}{\sqrt{1-2 \eta \delta \gamma}} \exp \left\{ \frac{4\eta \delta \gamma^2}{1-2 \eta\delta\gamma}(|\omega_k|^2+\eta^2 |\nabla \psi\left(\omega_k, \zeta_{k+1}\right)|^2) \right\}. \end{align*}

Here $\gamma$ is chosen to be small enough that $1-2\eta\delta\gamma>0$ .

\begin{align*} &\mathrm{E}[\exp\{\gamma|\omega_{k+1}|^2\}\\[5pt] &\quad\le\frac{1}{\sqrt{1-2\eta\delta\gamma}}\mathrm{E}\Big[\exp \big\{(1+\frac{4 \eta \delta \gamma}{1-2 \eta \delta \gamma}) \gamma|\omega_k|^2+(1+\frac{4 \eta \delta \gamma}{1-2 \eta \delta \gamma}) \gamma \eta^2 |\nabla \psi(\omega_k, \zeta_{k+1})|^2\\[5pt] &\qquad +2 \gamma{\langle} \omega_k,-\eta \nabla_k(\omega_k,\zeta_{k+1})\big\}\Big]\\[5pt] &\quad\le \frac{\exp\{2\gamma\eta K_2\}}{\sqrt{1-2\eta\delta\gamma}}\mathrm{E}\Big[\exp \big\{\big(1+\frac{4 \eta \delta \gamma}{1-2 \eta \delta \gamma}+2(1+\frac{4 \eta \delta \gamma}{1-2 \eta \delta \gamma})\eta^2L^2-K_1\eta\big) \gamma|\omega_k|^2\\[5pt] &\qquad +\big(2\gamma\eta^2(1+\frac{4 \eta \delta \gamma}{1-2 \eta \delta \gamma})+\frac{\gamma\eta}{K_1}\big)|\nabla \psi(0, \zeta_{k+1})|^2\big\}\Big]. \end{align*}

Since $\omega_k$ and $\zeta_{k+1}$ are independent, and $\nabla\psi(0,\zeta_{k+1})$ is sub-Gaussian, we can choose small enough $\gamma$ such that

\begin{align*} \mathrm{E}\exp\{\gamma|\omega_{k+1}|^2\} &\le \frac{\exp\{2\gamma\eta K_2\}}{\sqrt{1-2\eta\delta\gamma}}\mathrm{E}\left[\exp \left\{\left(1-\frac12K_1\eta\right)\gamma|\omega_k|^2\right\}\right] \\[5pt] &\times \left( \mathrm{E}\left[\exp\left\{\left(2\gamma\eta\left(1+\frac{4\eta\delta\gamma}{1-2\eta\delta\gamma}\right)+\frac{\gamma}{k_1}\right)|\nabla\psi(0,\zeta_{k+1})|^2 \right\}\right] \right)^\eta\\[5pt] &\le \frac{C^\eta}{\sqrt{1-2\eta\delta\gamma}}\big(\mathrm{E}\exp \{\gamma|\omega_k|^2\}\big)^{1-K_1\eta/2}, \end{align*}

where the last line follows the Hölder inequality. Inductively, we obtain

\begin{align*} \mathrm{E}\exp\{\gamma|\omega_{k+1}|^2\} &\le \frac{C^\eta}{\sqrt{1-2\eta\delta\gamma}}\big(\mathrm{E}\exp \{\gamma|\omega_k|^2\}\big)^{1-K_1\eta/2}\\[4pt] &\le \Big( \frac{C^\eta}{\sqrt{1-2\eta\delta\gamma}} \Big)^{c/\eta}\big(\mathrm{E}\exp\{\gamma|\omega_0|^2\}\big)^{(1-K_1\eta/2)^{k+1}}\le C. \end{align*}

Thus the exponential moment of $|\omega_k|^2$ exists for any k and small enough $\gamma$ .

Lemma 8. Let Assumptions 1 and 2 hold, considering the martingale difference

$(\Psi(\omega_k,\theta_{k+1}), \mathcal F_{k+1})_{k\ge0}$ with

(38)

\begin{align} \mathrm{E}_k|\Psi(\omega_k,\theta_{k+1})|^i\le C^i(1+|\omega_k|^{\alpha i}+i!), \end{align}

where $\alpha\ge0$ and i is any positive integer. Then for $\sqrt m=o(x)$ , we have

(39)

\begin{align} \mathrm{P}\left(\sum_{k=0}^{m-1} {\langle} \nabla f(\omega_k), \Psi(\omega_k,\theta_{k+1}) {\rangle} >x\right)\le C\exp\big\{c(x^2/m)^{1/(4+\alpha)}\big\}. \end{align}

Similarly,

(40)

\begin{align} \mathrm{P}\left(\sum_{k=0}^{m-1} {\langle} \nabla^2 f(\omega_k), \Psi(\omega_k,\theta_{k+1}) {\rangle}_{\mathrm{HS}} >x\right)\le C\exp\big\{c(x^2/m)^{1/(5+\alpha)}\big\}. \end{align}

Here f is the solution of Stein’s equation given in Lemma 3.

Proof. Denote $\hat\omega_k=\omega_k1_{\{|\omega_k|\le y\}}$ for large enough y to be chosen later, $A=\{|\omega_k|\le y, k=0,1,\ldots,m-1\}$ and $A^C$ as its complement. Then we have

(41)

\begin{align} &\mathrm{P}\left(\sum_{k=0}^{m-1} {\langle} \nabla f(\omega_k), \Psi(\omega_k,\theta_{k+1}) {\rangle} > x\right)\nonumber\\[4pt] &\quad\le \mathrm{P}\left(\sum_{k=0}^{m-1} {\langle} \nabla f(\omega_k), \Psi(\omega_k,\theta_{k+1}) {\rangle} > x, A\right)+\mathrm{P}\big(A^C\big)\nonumber\\[4pt] &\quad\le \mathrm{P}\left(\sum_{k=0}^{m-1} {\langle} \nabla f(\hat\omega_k), \Psi(\hat\omega_k,\theta_{k+1}) {\rangle} > x\right)+\sum_{k=0}^{m-1}\mathrm{P}\left(|\omega_k|>y\right)\nonumber\\[4pt] &\quad\le {\mathrm{e}}^{-\lambda x}\mathrm{E}\exp\left\{\sum_{k=0}^{m-1}\lambda{\langle} \nabla f(\hat\omega_k),\Psi(\hat\omega_k,\theta_{k+1}){\rangle}\right\} +{\mathrm{e}}^{-\gamma y^2}\sum_{k=0}^{m-1}\mathrm{E}\exp\{\gamma|\omega_k|^2\}, \end{align}

where the last inequality follows from the Markov inequality, $\lambda$ is a positive constant to be chosen later, and $\gamma$ is a sufficiently small positive constant. For the second term of (41), Lemma 7 implies

\begin{align*} {\mathrm{e}}^{-\gamma y^2}\sum_{k=0}^{m-1}\mathrm{E}\exp\{\gamma|\omega_k|^2\}\le mC{\mathrm{e}}^{-\gamma y^2}. \end{align*}

For the first term of (41), it is easy to see that

\begin{align*} &\mathrm{E}\exp\left\{\sum_{k=0}^{m-1}\lambda{\langle} \nabla f(\hat\omega_k),\Psi(\hat\omega_k,\theta_{k+1}){\rangle}\right\} \\[4pt] &\quad = \mathrm{E}\left[\exp\left\{\sum_{k=0}^{m-2}\lambda{\langle} \nabla f(\hat\omega_k),\Psi(\hat\omega_k,\theta_{k+1}){\rangle}\right\} \mathrm{E}_{m-1}\exp\{\lambda{\langle} \nabla f(\hat\omega_{m-1}),\Psi(\hat\omega_{m-1},\theta_{m}){\rangle}\}\right]\!. \end{align*}

Noticing

\begin{align*}\lambda\mathrm{E}_{m-1}{\langle} \nabla f(\hat\omega_{m-1}),\Psi(\hat\omega_{m-1},\theta_{m}){\rangle}=0,\end{align*}

by the Taylor expansion of the conditional expectation above, (21) and (38) imply

\begin{align*} &\mathrm{E}_{m-1}\exp\{\lambda\mathrm{E}_{m-1}{\langle} \nabla f(\hat\omega_{m-1}),\Psi(\hat\omega_k,\theta_{m}){\rangle}\}\\[4pt] &\quad =1+\sum_{i=2}^{\infty}\frac{\lambda^i}{i!}\mathrm{E}_{m-1}{\langle} \nabla f(\hat\omega_{m-1}), \Psi(\hat\omega_k,\theta_{m}){\rangle}^i\\[4pt] &\quad\le 1+\sum_{i=2}^\infty\frac{(C\lambda)^i}{i!}(1+y^3)^i \big(1+y^{\alpha i}+i!\big)\\[4pt] &\quad\le 1+\frac{(C\lambda y^{3+\alpha})^2}{1-C\lambda y^{3+\alpha}}, \end{align*}

if $C\lambda y^{3+\alpha}<1$ . By induction, we obtain

\begin{align*} \mathrm{E}\exp\!\left\{\sum_{k=0}^{m-1}\lambda{\langle} \nabla f(\hat\omega_k),\Psi(\hat\omega_k,\theta_{k+1}){\rangle}\right\}\le \Big( 1+\frac{(C\lambda y^{3+\alpha})^2}{1-C\lambda y^{3+\alpha}}\Big)^m. \end{align*}

Thus, for (41), we have

\begin{align*} \mathrm{P}\!\left(\sum_{k=0}^{m-1} {\langle} \nabla f(\omega_k), \Psi(\omega_k,\theta_{k+1}) {\rangle} >x\right)\le \bigg( 1+\frac{(C\lambda y^{3+\alpha})^2}{1-C\lambda y^{3+\alpha}}\bigg)^m {\mathrm{e}}^{-\lambda x} + mC{\mathrm{e}}^{-\gamma y^2}. \end{align*}

Let

\begin{align*}\lambda=\frac{x}{2mC^2y^{6+2\alpha}}\end{align*}

and

\begin{align*}y=\left(\frac{x^2}{2mC^2}\right)^{{1}/({8+2\alpha})}.\end{align*}

Then for large enough m and $\sqrt m=o(x)$ , one obtains

\begin{align*} \mathrm{P}\left(\sum_{k=0}^{m-1} {\langle} \nabla f(\omega_k), \Psi(\omega_k,\theta_{k+1}) {\rangle} >x\right)\le C{\mathrm{e}}^{-c(x^2/m)^{\frac{1}{4+\alpha}}}. \end{align*}

We can show (40) similarly. The details are omitted here.

Proof of Lemma 4. Recalling the definition of $\mathcal{R}_{\eta}$ , we have

\begin{align*} \mathrm{P}(|\mathcal{R}_{\eta}|>y)\le\sum_{i=1}^4\mathrm{P}\left(|\mathcal{R}_{\eta,i}|>\frac{y}{4}\right), \end{align*}

and we shall prove below that the following estimates hold:

(42)

\begin{align} \mathrm{P}(|\mathcal{R}_{\eta,1}| > y/4)&\le C {\mathrm{e}}^{-c \sqrt{m\eta\delta}y}, \\[-26pt] \nonumber \end{align}

(43)

\begin{align} \mathrm{P}\left(\left\vert\mathcal{R}_{\eta,2}\right\vert>y/4\right) &\le C{\mathrm{e}}^{-c\delta^{1/5} y^{2/5}\eta^{-1/5}}, \\[-26pt] \nonumber \end{align}

(44)

\begin{align} \mathrm{P}\left(\left\vert\mathcal{R}_{\eta,3}\right\vert >y/4\right)& \le C{\mathrm{e}}^{-cy^{2/9}\eta^{-2/9}\delta^{-2/9}},\\[-26pt] \nonumber \end{align}

(45)

\begin{align} \mathrm{P}\left(\left\vert\mathcal{R}_{\eta,4}\right\vert >y/4\right)& \le C{\mathrm{e}}^{-cy^{2/7}\delta^{1/7}\eta^{-3/7}}+C{\mathrm{e}}^{-cy^{2/5}\delta^{-1/5}\eta^{-1/5}}. \end{align}

Combining these estimates, we immediately get

\begin{equation*} \mathrm{P}(|\mathcal{R}_{\eta}|>y)\le C\Big( {\mathrm{e}}^{-cy\eta^{1/2}\delta^{1/2}m^{1/2} } +{\mathrm{e}}^{-cy^{2/5}\delta^{1/5} \eta^{-1/5}} +{\mathrm{e}}^{-cy^{1/6}\eta^{-5/12}\delta^{-5/12} } +{\mathrm{e}}^{-cy^{2/7}\delta^{1/7}\eta^{-3/7}}\Big), \end{equation*}

for $c(\eta^{1/2}\delta^{-1/2}\vee m^{1/2}\eta\delta)\le y\le C\eta^{-7/2}\delta^{-7/2}$ . We now show (42)–(45).

(a) Control of $\mathcal{R}_{\eta,1}$ . By the Markov inequality and (20),

\begin{align*} \mathrm{P}(|\mathcal{R}_{\eta,1}|>y/4) &= \mathrm{P}(\gamma|f(\omega_0)-f(\omega_m)|>\gamma \sqrt{m\eta\delta}y/4)\\[4pt] &\le \mathrm{E}\exp\{C\gamma(1+|\omega_0|^2+|\omega_m|^2)\}{\mathrm{e}}^{-\gamma \sqrt{m\eta\delta}y/4}\\[4pt] &\le (\mathrm{E}\exp\{2C\gamma|\omega_0|^2\})^{1/2}(\mathrm{E}\exp\{2C\gamma|\omega_m|^2\})^{1/2}{\mathrm{e}}^{-\gamma\sqrt{m\eta\delta}y/4+C\gamma}, \end{align*}

where $\gamma$ is a positive constant. Lemma 7 implies that the exponential moments of $\omega_0$ and $\omega_m$ are finite for small enough $\gamma$ . Thus

\begin{align*} \mathrm{P}(|\mathcal{R}_{\eta,1}|>y/4) \le C {\mathrm{e}}^{-c \sqrt{m\eta\delta}y}. \end{align*}

(b) Control of $\mathcal{R}_{\eta,2}$ . According to the definition of $\mathcal{R}_{\eta,2}$ , we have

\begin{align*} \mathrm{P}\left(\mathcal{R}_{\eta,2}>y/4\right) = \mathrm{P}\left( \sum_{k=0}^{m-1}{\langle} \nabla f(\omega_k),\nabla P(\omega_k)-\nabla\psi(\omega_k,\zeta_{k+1}){\rangle} >\frac{\sqrt{m\delta}y}{4\sqrt\eta} \right). \end{align*}

Since $\nabla\psi(0,\zeta_{k+1})$ is sub-Gaussian from Assumption 2, we have

\begin{align*} \mathrm{E}|\nabla\psi(0,\zeta_{k+1})|^i\le Ci!. \end{align*}

By (17) and (18), we have

\begin{align*} \mathrm{E}_{k}|\nabla P(\omega_{k})-\nabla\psi(\omega_{k},\zeta_{k+1})|^i &\le \mathrm{E}_k\big[2L|\omega_k|+|\nabla P(0)|+|\nabla \psi(0,\zeta_{k+1})| \big]^i\\[4pt] &\le C^i(1+|\omega_k|^i+i!), \end{align*}

which satisfies the condition of Lemma 8 with $\alpha=1$ . Thus, (39) yields

\begin{align*} \mathrm{P}\left(\mathcal{R}_{\eta,2}>y/4\right) \le C\exp\{-c\delta^{1/5} y^{2/5}\eta^{-1/5}\}, \end{align*}

under the condition $y>\sqrt{\eta/\delta}$ . $\mathrm{P}\left(\mathcal{R}_{\eta,2}<-y/4\right)$ can be estimated similarly. Thus (43) is proved.

\begin{align*} &\mathrm{P}(\mathcal{R}_{\eta,3}>y/4)\\[4pt] &\quad =\mathrm{P}\left(\sum_{k=0}^{m-1} \int_0^1\int_0^1s \big{\langle} \frac{\nabla^2 f(\omega_k+ss'\Delta\omega_k)-\nabla^2f(\omega_k )}{|ss'\Delta\omega_k|}, \Delta \omega_k\Delta \omega_k^\top\big{\rangle}_{\mathrm{HS}}|ss'\Delta\omega_k|{\mathrm{d}} s'{\mathrm{d}} s >\frac{\sqrt{m\eta\delta}y}4 \right)\\[4pt] &\quad\le \mathrm{P}\left(\sum_{k=0}^{m-1} \int_0^1\int_0^1 \frac{|\nabla^2 f(\omega_k+ss'\Delta\omega_k)-\nabla^2f(\omega_k )|}{|ss'\Delta\omega_k|} |\Delta \omega_k|^3{\mathrm{d}} s'{\mathrm{d}} s>\frac{\sqrt{m\eta\delta}y}4, A \right)+\mathrm{P}(A^C). \end{align*}

For the first term, (23) implies

\begin{align*} &\mathrm{P}\left(\sum_{k=0}^{m-1} \int_0^1\int_0^1 \frac{|\nabla^2 f(\omega_k+ss'\Delta\omega_k)-\nabla^2f(\omega_k )|}{|ss'\Delta\omega_k|} |\Delta \omega_k|^3{\mathrm{d}} s'{\mathrm{d}} s>\frac{\sqrt{m\eta\delta}y}4, A \right)\\[4pt] &\quad\le \mathrm{P}\left(\sum_{k=0}^{m-1} C(1+|\omega_k|^5)|\Delta\omega_k|^31_{\{|\Delta\omega_k| < y_1\}}\ge \sqrt{m\eta\delta}y\right)\\[4pt] &\quad\le \mathrm{P}\left(\sum_{k=0}^{m-1} C(1+|\omega_k|^5)|\Delta\omega_k|^31_{\{|\Delta\omega_k| < y_1\}}\ge \sqrt{m\eta\delta}y, |\omega_k| < y_2 \text{ for any } k\right)\\[4pt] &\qquad +\mathrm{P}(\max_{k\in\{0,\ldots,m-1\}}|\omega_k|\ge y_2)\\[4pt] &\quad \le \exp\left\{-\frac{C(\sqrt{m\eta\delta}y-m(\eta\delta)^{3/2})^2}{my_2^{10}y_1^6}\right\}+Cm{\mathrm{e}}^{-y_2^2}, \end{align*}

where the last inequality follows [Reference Dedecker and Gouëzel6, Theorem 2] and the fact that $\mathrm{E}|\Delta\omega_k|^3\le C(\eta\delta)^{3/2}$ .

For the second term, a straightforward calculation implies

\begin{align*} \mathrm{P}(A^C)&\le\sum_{k=0}^{m-1}\mathrm{P}(|\Delta \omega_k|>y_1)\\[4pt] &\le\sum_{k=0}^{m-1}\mathrm{P}(\eta|\nabla \psi(\omega_k,\zeta_{k+1})|)>y_1/2)+\sum_{k=0}^{m-1}\mathrm{P}(\sqrt{\eta\delta}|\xi_{k+1}|>y_1/2)\\[4pt] &\le \sum_{k=0}^{m-1}\mathrm{P}\left( |\omega_k|>\frac{Cy_1}\eta\right)+\sum_{k=0}^{m-1}\mathrm{P}\left(|\nabla\psi(0,\zeta_{k+1})|>\frac{Cy_1}\eta\right)+\sum_{k=0}^{m-1}\mathrm{P}\left(|\xi_{k+1}|>\frac{Cy_1}{\sqrt{\eta\delta}}\right)\\[4pt] &\le 2m{\mathrm{e}}^{-Cy_1^2/\eta^2}+m{\mathrm{e}}^{-Cy_1^2/(\eta\delta)}, \end{align*}

where the second inequality follows the iteration of $\omega_k$ , and the last inequality follows Lemma 7 and Assumption 2. Combining the calculations above, we obtain

\begin{align*} \mathrm{P}(\mathcal{R}_{\eta,3}>y/4)\le \exp\left\{-\frac{C(\sqrt{m\eta\delta}y-m(\eta\delta)^{3/2})^2}{my_2^{10}y_1^6}\right\}+Cm{\mathrm{e}}^{-y_2^2}+2m{\mathrm{e}}^{-Cy_1^2/\eta^2}+m{\mathrm{e}}^{-Cy_1^2/(\eta\delta)}. \end{align*}

Taking $y_1=y^{1/9}\eta^{7/18}\delta^{7/18} $ and $y_2= y^{1/9}\eta^{-1/9}\delta^{-1/9}$ , we complete the proof of (44), that is,

\begin{align*} \mathrm{P}\left(\left\vert\mathcal{R}_{\eta,3}\right\vert >y/4\right)& \le C{\mathrm{e}}^{-cy^{2/9}\eta^{-2/9}\delta^{-2/9}} , \end{align*}

for $c(\sqrt{m}\eta\delta \vee \eta\delta) <y<C\eta^{-7/2}\delta^{-7/2}$ .

(d) Control of $\mathcal{R}_{\eta,4}$ . Following the notation $\Sigma(\omega_k)$ and $\Delta\omega_k$ , we have

(46)

\begin{align} \mathrm{P}(\mathcal{R}_{\eta,4}>y/4) &= \mathrm{P}\left(\frac{1}{2\sqrt{m\eta\delta}}\sum_{k=0}^{m-1}{\langle} \nabla^2f(\omega_k ), \eta^2I_{1,k}+\eta\delta I_{2,k}+\eta^{\frac32}\delta^{1/2} I_{3,k}+\eta^{2} I_{4,k} {\rangle}_{\mathrm{HS}}>y/4 \right)\nonumber\\ &\quad \le \mathrm{P}\left(\sum_{k=0}^{m-1}{\langle} \nabla^2f(\omega_k ), I_{1,k}{\rangle}_{\mathrm{HS}}>C m^{1/2}\delta^{1/2}\eta^{-3/2} y \right) \nonumber \\ &\quad+ \mathrm{P}\left(\sum_{k=0}^{m-1}{\langle} \nabla^2f(\omega_k ), I_{2,k}{\rangle}_{\mathrm{HS}}>C m^{1/2}\eta^{-1/2}\delta^{-1/2} y \right)\nonumber\\ &\quad+ \mathrm{P}\left(\sum_{k=0}^{m-1}{\langle} \nabla^2f(\omega_k ), I_{3,k}{\rangle}_{\mathrm{HS}} > C m^{1/2}\eta^{-1} y \right)\nonumber \\&\quad+\mathrm{P}\left(\sum_{k=0}^{m-1}{\langle} \nabla^2f(\omega_k ), I_{4,k}{\rangle}_{\mathrm{HS}}>C m^{1/2}\delta^{1/2}\eta^{-3/2} y \right), \end{align}

where

\begin{align*} I_{1,k} &= \mathrm{E}_k[\nabla\psi(\omega_k,\zeta_{k+1})\nabla\psi(\omega_k,\zeta_{k+1})^\top]-\nabla\psi(\omega_k,\zeta_{k+1})\nabla\psi(\omega_k,\zeta_{k+1})^\top,\\[4pt] I_{2,k} &= I_d-\xi_{k+1}\xi_{k+1}^\top,\\[4pt] I_{3,k} &= \nabla \psi(\omega_k,\zeta_{k+1})\xi_{k+1}^\top +\xi_{k+1}\nabla \psi(\omega_k,\zeta_{k+1})^\top,\\[4pt] I_{4,k} &= -\nabla P(\omega_k)\nabla P(\omega_k)^\top. \end{align*}

For the first term of (46), according to (17) and Assumption 2, it is easy to verify that

\begin{align*} \mathrm{E}_{k}|I_{1,k}|^i \le C^i(1+|\omega_k|^{2i}+i!), \end{align*}

which satisfies the condition of Lemma 8 with $\alpha=2$ . Thus, (40) yields

(47)

\begin{align} \mathrm{P}\left(\sum_{k=0}^{m-1}{\langle} \nabla^2f(\omega_k ), I_{1,k}{\rangle}_{\mathrm{HS}}>C m^{1/2}\delta^{1/2}\eta^{-3/2} y \right) \le C\exp\big\{-c\delta^{1/7} y^{2/7}\eta^{-3/7}\big\} \end{align}

as $(\eta/\delta)^{1/2}<y$ . Similarly to the estimation of (47), one can also verify that $I_{2,k}$ and $I_{3,k}$ satisfy condition (38) with $\alpha=0$ and $\alpha=1$ respectively, thus (40) implies

(48)

\begin{align} \mathrm{P}\left(\sum_{k=0}^{m-1}{\langle} \nabla^2f(\omega_k ), I_{2,k}{\rangle}_{\mathrm{HS}}>C m^{1/2}\eta^{-1/2}\delta^{-1/2} y \right)\le C\exp\big\{-c\eta^{-1/5}\delta^{-1/5}y^{2/5}\big\} \end{align}

and

(49)

\begin{align} \mathrm{P}\left(\sum_{k=0}^{m-1}{\langle} \nabla^2f(\omega_k ), I_{3,k}{\rangle}_{\mathrm{HS}} > C m^{1/2}\eta^{-1} y\right)\le C\exp\{-cy^{\frac13}\eta^{-1/3}\}. \end{align}

For the last term of (46), let $\hat\omega_k=\omega_k1_{\{|\omega_k|\le y_3\}}$ . Similarly to the estimation of (39), expressions (18) and (22) yield

\begin{align*} &\mathrm{P}\left(\sum_{k=0}^{m-1}{\langle} \nabla^2f(\omega_k ), I_{4,k}{\rangle}_{\mathrm{HS}}>C m^{1/2}\delta^{1/2}\eta^{-3/2} y \right)\\[4pt] &\quad\le \mathrm{P}\left(\sum_{k=0}^{m-1}{\langle} \nabla^2f(\hat\omega_k ), -\nabla P(\hat\omega_k)\nabla P(\hat\omega_k)^\top{\rangle}_{\mathrm{HS}}>C m^{1/2}\delta^{1/2}\eta^{-3/2} y\right)+\sum_{k=0}^{m}\mathrm{P}(|\omega_k|\ge y_3)\\[4pt] &\quad\le \mathrm{P}\left(\sum_{k=0}^{m-1} (1+|\hat\omega_k|^6)>C m^{1/2}\delta^{1/2}\eta^{-3/2} y\right)+mC{\mathrm{e}}^{-cy_3^2}. \end{align*}

For the above probability, we have

\begin{align*} & \mathrm{P}\left(\sum_{k=0}^{m-1} (1+|\hat\omega_k|^6)>C m^{1/2}\delta^{1/2}\eta^{-3/2} y\right)\\[4pt] &\quad = \mathrm{P}\left(\sum_{k=0}^{m-1} (|\hat\omega_k|^6-\mathrm{E}|\hat\omega_k|^6)>Cm^{1/2}\delta^{1/2}\eta^{-3/2} y-m-\sum_{k=0}^{m-1}\mathrm{E}|\hat\omega_k|^6\right)\\[4pt] &\quad\le \mathrm{P}\left(\sum_{k=0}^{m-1} (|\hat\omega_k|^6-\mathrm{E}|\hat\omega_k|^6)>C m^{1/2}\delta^{1/2}\eta^{-3/2} y-m)\right)\\[4pt] &\quad\le \exp\big\{\!-y^2\delta\eta^{-3}y_3^{-12}\big\}. \end{align*}

Thus,

(50)

\begin{align} \mathrm{P}\left(\sum_{k=0}^{m-1}{\langle} \nabla^2f(\omega_k ), I_{4,k}{\rangle}_{\mathrm{HS}}>C m^{1/2}\delta^{1/2}\eta^{-3/2} y \right)\le C\exp\{-cy^{2/7}\delta^{1/7}\eta^{-3/7}\}, \end{align}

by taking $y_3=(y^2\eta^{-3}\delta)^{1/14}$ and $y>m^{1/2}\eta^{3/2}\delta^{-1/2}$ . Combing the results of (48)–(50), we obtain the bound of (46), that is,

\begin{align*} &&\mathrm{P}(|\mathcal{R}_{\eta,4}|>y/4)\le C{\mathrm{e}}^{-cy^{2/7}\delta^{1/7}\eta^{-3/7}}+C{\mathrm{e}}^{-cy^{2/5}\delta^{-1/5}\eta^{-1/5}}. \end{align*}

Thus we obtain that

\begin{equation*} \mathrm{P}(|\mathcal{R}_{\eta}|>y)\le C\Big( {\mathrm{e}}^{-cy\eta^{1/2}\delta^{1/2}m^{1/2} } +{\mathrm{e}}^{-cy^{2/5}\delta^{1/5} \eta^{-1/5}} +{\mathrm{e}}^{-cy^{2/9}\eta^{-2/9}\delta^{-2/9} } +{\mathrm{e}}^{-cy^{2/7}\delta^{1/7}\eta^{-3/7}}\Big) \end{equation*}

for $c(\eta^{1/2}\delta^{-1/2}\vee m^{1/2}\eta\delta)\le y\le C\eta^{-7/2}\delta^{-7/2}$ .

Appendix A. Proof of Lemma 1

Proof of Lemma 1. Since $\nabla P(x)=\mathrm{E}[\nabla\psi(x,\zeta)]$ , it is easy to see that $\nabla P$ has the same properties, that is,

\begin{align*} |\nabla P(x)-\nabla P(y)|\le L|x-y|, \\[-25pt] \nonumber \end{align*}

\begin{align*} {\langle} x-y,-\nabla P(x)+\nabla P(y){\rangle}\le -K_1|x-y|^2+K_2, \end{align*}

for any $x,y\in \mathbb{R}^d$ .

Following the assumptions (6) and (7), we further obtain the bounds for $\nabla \psi(x,z)$ and $\nabla P(x)$ , that is,

\begin{align*} |\nabla\psi(x,\zeta)|\le L|x|+|\nabla\psi(0,\zeta)|, \\[-25pt] \nonumber \end{align*}

\begin{align*} |\nabla P(x)|\le L|x|+|\nabla P(0)|. \end{align*}

Assumptions (7), (15) and Young’s inequality imply

(51)

\begin{align} {\langle} x, -\nabla \psi(x,\zeta){\rangle}&={\langle} x-0, -\nabla \psi(x,\zeta)+\nabla \psi(0,\zeta){\rangle}- {\langle} x, \nabla \psi(0,\zeta){\rangle} \nonumber \\[4pt] &\le-K_1|x|^2+K_2+\frac{K_1}{2}|x|^2+\frac{1}{2K_1}|\nabla\psi(0,\zeta)|^2 \nonumber\\[4pt] &=-\frac{K_1}{2}|x|^2+K_2+\frac{1}{2K_1}|\nabla\psi(0,\zeta)|^2. \end{align}

Similarly,

(52)

\begin{align} {\langle} x, -\nabla P(x){\rangle} &\le -\frac{K_1}{2}|x|^2+K_2+\frac{1}{2K_1}|\nabla P(0)|^2. \end{align}

Moreover,

(53)

\begin{align} \|\Sigma(x)\|\le 2\mathrm{E}|\nabla\psi(x,\zeta)|^2\le 4L^2|x|^2+C. \end{align}

For the Lipschitz property of $Q_{\eta,\delta}$ , recall that

\begin{equation*} Q_{\eta,\delta}(x)=\big(\mathrm{E}[V_{\eta,\delta}(x,\zeta,\xi)V_{\eta,\delta}(x,\zeta,\xi)^\top]\big)^{1/2}. \end{equation*}

By assumptions (6) and (14), the definition of $V_{\eta,\delta}(x,\zeta,\xi)$ implies that

\begin{align*} |V_{\eta,\delta}(x,\zeta,\xi)-V_{\eta,\delta}(y,\zeta,\xi)|\le 2\sqrt\eta L|x-y|, \end{align*}

which is Lipschitz. Denote the $L^2$ norm $\|X\|_{L^2}=(\mathrm{E}\|X\|^2)^{1/2}$ for any random variable X. Then we have

\begin{align*} Q_{\eta,\delta}(x)=\| V_{\eta,\delta}(x,\zeta,\xi)V_{\eta,\delta}(x,\zeta,\xi)^\top)^{1/2} \|_{L^2}. \end{align*}

Thus,

\begin{align*} &\|Q_{\eta,\delta}(x)-Q_{\eta,\delta}(y)\|\\[4pt] &\quad=\big\| \| V_{\eta,\delta}(x,\zeta,\xi)V_{\eta,\delta}(x,\zeta,\xi)^\top)^{1/2} \|_{L^2}-\| V_{\eta,\delta}(y,\zeta,\xi)V_{\eta,\delta}(y,\zeta,\xi)^\top)^{1/2} \|_{L^2} \big\|\\[4pt] &\quad\le\| ( V_{\eta,\delta}(x,\zeta,\xi)V_{\eta,\delta}(x,\zeta,\xi)^\top)^{1/2} - (V_{\eta,\delta}(y,\zeta,\xi)V_{\eta,\delta}(y,\zeta,\xi)^\top)^{1/2} \|_{L^2} \end{align*}

Since the mapping $V_{\eta,\delta}\to \big(V_{\eta,\delta}V_{\eta,\delta}^\top \big)^{1/2}=V_{\eta,\delta}V_{\eta,\delta}^\top/|V_{\eta,\delta}|$ is Lipschitz, we have

\begin{align*} \|Q_{\eta,\delta}(x)-Q_{\eta,\delta}(y)\|\le C\sqrt \eta |x-y|. \end{align*}

Appendix B. Proof of ergodicity

Proof of Lemma 2. We first give the proof of the ergodicity of $(X_t)_{t\ge0}$ . For the Lyapunov function $V(x)=|x|^2+1$ on $\mathbb{R}^d$ , expressions (10), (52) and (53) imply

\begin{align*} \mathcal L V(x) &= -{\langle} \nabla P(x), 2x{\rangle}+{\langle} \eta\Sigma(x)+\delta I_d, I_d {\rangle}_{\textrm {HS}}\\[3pt] &\le -K_1|x|^2+4\eta L^2|x|^2+C. \end{align*}

For small enough $\eta\le K_1/(8L^2)$ , one has

\begin{align*} \mathcal L V(x) \le -\frac {K_1}4V(x) +\left(C+\frac {K_1}4\right)1_{\{|x|^2\le K_1+4C\}}. \end{align*}

By [Reference Meyn and Tweedie24, Theorem 6.1], $(X_t)_{t\ge0}$ is exponential ergodic with invariant measure $\pi$ , that is, there exist constants C and c such that

(54)

\begin{align} \sup_{|h|\le V}|\mathrm{E} h(X_t(x))-{\pi}(h)|\le CV(x){\mathrm{e}}^{-ct}. \end{align}

The ergodicity of $(\omega_k)_{k\ge0}$ follows [Reference Tuominen and Tweedie29, Theorem 2.1]. Notice that

\begin{align*} \mathrm{E}_k[V(\omega_{k+1})] &= 1+ \mathrm{E}_{k}|\omega_k-\eta\nabla\psi(\omega_k,\zeta_{k+1})+\sqrt{\eta\delta}\xi_{k+1}|^2\\[4pt] &= 1+ |\omega_k|^2+\eta^2\mathrm{E}_{k}|\nabla\psi(\omega_k,\zeta_{k+1})|^2+\eta\delta d- 2\eta{\langle} \omega_k, \nabla P(\omega_k){\rangle}\\[4pt] &\le (1+2\eta^2L^2-\eta K_1)|\omega_k|^2+1+C\eta. \end{align*}

Denote the transition probability of $(\omega_k)_{k\ge0}$ by $P(x,{\mathrm{d}} y)$ for $x,y\in\mathbb{R}^d$ and let

\begin{align*} \quad V^n (x)={\mathrm{e}}^{c_1 n\eta}V(x),\quad r(n)= c_1\eta {\mathrm{e}}^{c_1 n\eta}. \end{align*}

A straightforward calculation implies

\begin{align*} &PV^{n+1}(x)+r(n)V(x)\\[4pt] &\quad ={\mathrm{e}}^{c_1(n+1)\eta}P V(x)+c_1\eta {\mathrm{e}}^{c_1n\eta}V(x)\\[4pt] &\quad\le {\mathrm{e}}^{c_1(n+1)\eta}\big((1+2\eta^2L^2-\eta K_1)|x|^2+1+C\eta\big)+c_1\eta {\mathrm{e}}^{c_1n\eta}V(x)\\[4pt] &\quad = {\mathrm{e}}^{c_1n\eta}V(x)+c_1\eta {\mathrm{e}}^{c_1n\eta}\\[4pt] &\quad \times \left[\left(\frac {{\mathrm{e}}^{c_1\eta}}{c_1\eta}(1+2\eta^2L^2-\eta K_1)+1-\frac{1}{c_1\eta}\right)V(x)+\frac {{\mathrm{e}}^{c_1\eta}}{c_1\eta}(C\eta+\eta K_1-2\eta^2L^2)\right]\\[4pt] &\quad = V^n(x)+r(n)\left[\frac {1}{c_1\eta}\!\left({\mathrm{e}}^{c_1\eta}(1+2\eta^2L^2-\eta K_1)+c_1\eta-1\!\right)\!V(x)+\frac {{\mathrm{e}}^{c_1\eta}}{c_1\eta}(C\eta+\eta K_1-2\eta^2L^2)\!\right]\!. \end{align*}

Choosing $\eta$ small enough such that ${\mathrm{e}}^{c_1\eta}(1+2\eta^2L^2-\eta K_1)+c_1\eta<1$ , we obtain

\begin{align*} P V^{n+1}(x)+r(n)V(x) &\le V^n(x)+br(n)1_{\{x\in \mathcal{C}\}}, \end{align*}

where

\begin{align*}b=\frac {{\mathrm{e}}^{c_1\eta}}{c_1\eta}(C\eta+\eta K_1-2\eta^2L^2),\quad\mathcal{C}=\bigg\{x\;:\;V(x)\le \frac{{\mathrm{e}}^{c_1\eta}(C\eta+\eta K_1-2\eta^2 L^2)}{1-{\mathrm{e}}^{c_1\eta}(1+2\eta^2L^2-2\eta K_1)-c_1\eta}\bigg\}.\end{align*}

A theorem due to Tuominen and Tweedie [Reference Tuominen and Tweedie29, Theorem 2.1] implies that $(\omega_k)_{k\ge0}$ is ergodic with invariant measure $\pi_{\eta}$ , that is, there exist constant C and c such that

(55)

\begin{align} \sup\nolimits_{|h|\le V}|\mathrm{E} h(\omega_k^x)-{\pi}_{\eta}(h)|\le C\eta^{-1}V(x){\mathrm{e}}^{-ck\eta}. \end{align}

Acknowledgements

We would like to thank two anonymous referees and the Associate Editor for their valuable comments which improved the paper considerably.

Funding information

Hongsheng Dai is supported by the EPSRC research grant ‘Pooling INference and COmbining Distributions Exactly: A Bayesian approach (PINCODE)’, EP/X027872/1, and by the UKRI grant EP/Y014650/1 as part of the ERC Synergy project OCEAN. Xiequan Fan is partially supported by the National Natural Science Foundation of China (Grant No. 12371155) and Natural Science Foundation of Hebei Province (Grant No. A2025501005). Jianya Lu is the corresponding author.

Competing interests

There were no competing interests to declare which arose during the preparation or publication process of this article.

References

Bertsekas, D. (2012). Dynamic Programming and Optimal Control: Volume I, vol. 4. Athena scientific.Google Scholar

Chen, P., Lu, J. and Xu, L. (2022). Approximation to stochastic variance reduced gradient Langevin dynamics by stochastic delay differential equations. Applied Mathematics & Optimization 85, 15.10.1007/s00245-022-09854-3CrossRef Google Scholar

Chen, P. and Xu, L. (2019). Approximation to stable law by the Lindeberg principle. Journal of Mathematical Analysis and Applications 480, 123338.10.1016/j.jmaa.2019.07.028CrossRef Google Scholar

Chen, X., Du, S. S. and Tong, X. T. (2020). On stationary-point hitting time and ergodicity of stochastic gradient Langevin dynamics. Journal of Machine Learning Research. Google Scholar

Chen, X., Shao, Q.-M., Wu, W. B. and Xu, L. (2016). Self-normalized Cramér-type moderate deviations under dependence. Annals of Statistics 44, 1593–1617.10.1214/15-AOS1429CrossRef Google Scholar

Dedecker, J. and Gouëzel, S. (2015). Subgaussian concentration inequalities for geometrically ergodic markov chains. Electronic Communications in Probability 20.Google Scholar

Fan, X., Grama, I., Liu, Q. and Shao, Q.-M. (2019). Self-normalized Cramér type moderate deviations for martingales. Bernoulli 25, 2793–2823.10.3150/18-BEJ1071CrossRef Google Scholar

Fan, X., Grama, I., Liu, Q. and Shao, Q.-M. (2020). Self-normalized Cramér type moderate deviations for stationary sequences and applications. Stochastic Processes and their Applications 130, 5124–5148.10.1016/j.spa.2020.03.001CrossRef Google Scholar

Fan, X., Hu, H. and Xu, L. (2024). Normalized and self-normalized Cramér-type moderate deviations for the Euler-Maruyama scheme for the sde. Science China Mathematics 67, 1865–1880.10.1007/s11425-022-2161-4CrossRef Google Scholar

Fan, X. and Shao, Q.-M. (2024). Cramér’s moderate deviations for martingales with applications. Annales de l’Institut Henri Poincaré Probabilités et Statistiques 60, 2046–2074.10.1214/23-AIHP1372CrossRef Google Scholar

Fang, X., Shao, Q.-M. and Xu, L. (2019). Multivariate approximations in Wasserstein distance by Stein’s method and Bismut’s formula. Probability Theory and Related Fields 174, 945–979.10.1007/s00440-018-0874-5CrossRef Google Scholar

Feng, Y., Gao, T., Li, L., Liu, J.-G. and Lu, Y. (2020). Uniform-in-time weak error analysis for stochastic gradient descent algorithms via diffusion approximation. Communications in Mathematical Sciences 18, 163–188.10.4310/CMS.2020.v18.n1.a7CrossRef Google Scholar

Gao, M. and Yiu, K.-F. C. (2023). Moderate deviations and invariance principles for sample average approximations. SIAM Journal on Optimization 33, 816–841.10.1137/22M1484584CrossRef Google Scholar

Gilbarg, D. and Trudinger, N. S. (2001). Elliptic Partial Differential Equations of Second Order. Springer Science & Business Media.10.1007/978-3-642-61798-0CrossRef Google Scholar

Guillin, A., Wang, Y., Xu, L. and Yang, H. (2024). Error estimates between SGD with momentum and underdamped Langevin diffusion. arXiv preprint arXiv:2410.17297.Google Scholar

Hambly, B., Xu, R. and Yang, H. (2021). Policy gradient methods for the noisy linear quadratic regulator over a finite horizon. SIAM Journal on Control and Optimization 59, 3359–3391.10.1137/20M1382386CrossRef Google Scholar

Hu, W., Li, C. J., Li, L. and Liu, J.-G. (2019). On the diffusion approximation of nonconvex stochastic gradient descent. Annals of Mathematical Sciences and Applications 4.Google Scholar

Jiang, H., Wan, Y. and Yang, G. (2022). Deviation inequalities and Cramér-type moderate deviations for the explosive autoregressive process. Bernoulli 28, 2634–2662.10.3150/21-BEJ1432CrossRef Google Scholar

Jing, B.-Y., Shao, Q.-M. and Wang, Q. (2003). Self-normalized Cramér-type large deviations for independent random variables. Annals of Probability 31, 2167–2215.10.1214/aop/1068646382CrossRef Google Scholar

Lamperski, A. (2021). Projected stochastic gradient Langevin algorithms for constrained sampling and non-convex learning. In Conference on Learning Theory. PMLR. pp. 2891–2937.Google Scholar

Li, Q., Tai, C. and Weinan, E. (2017). Stochastic modified equations and adaptive stochastic gradient algorithms. In International Conference on Machine Learning. PMLR. pp. 2101–2110.Google Scholar

Li, Q., Tai, C. and Weinan, E. (2019). Stochastic modified equations and dynamics of stochastic gradient algorithms I: Mathematical foundations. Journal of Machine Learning Research 20, 1474–1520.Google Scholar

Lu, J., Tan, Y. and Xu, L. (2022). Central limit theorem and self-normalized Cramér-type moderate deviation for Euler-Maruyama scheme. Bernoulli 28, 937–964.10.3150/21-BEJ1372CrossRef Google Scholar

Meyn, S. P. and Tweedie, R. L. (1993). Stability of markovian processes III: Foster-Lyapunov criteria for continuous-time processes. Advances in Applied Probability 25, 518–548.10.2307/1427522CrossRef Google Scholar

Raginsky, M., Rakhlin, A. and Telgarsky, M. (2017). Non-convex learning via stochastic gradient Langevin dynamics: A nonasymptotic analysis. In Conference on Learning Theory. PMLR. pp. 1674–1703.Google Scholar

Shao, Q.-M. (1999). A Cramér type large deviation result for Student’s t-statistic. Journal of Theoretical Probability 12, 385–398.10.1023/A:1021626127372CrossRef Google Scholar

Shao, Q.-M. and Zhou, W.-X. (2016). Cramér type moderate deviation theorems for self-normalized processes. Bernoulli 22, 2029–2079.10.3150/15-BEJ719CrossRef Google Scholar

Teh, Y. W., Thiery, A. H. and Vollmer, S. J. (2016). Consistency and fluctuations for stochastic gradient Langevin dynamics. Journal of Machine Learning Research 17, 193–225.Google Scholar

Tuominen, P. and Tweedie, R. L. (1994). Subgeometric rates of convergence of f-ergodic Markov chains. Advances in Applied Probability 26, 775–798.10.2307/1427820CrossRef Google Scholar

Welling, M. and Teh, Y. W. (2011). Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11). Citeseer. pp. 681–688.Google Scholar

Xu, P., Chen, J., Zou, D. and Gu, Q. (2018). Global convergence of Langevin dynamics based algorithms for nonconvex optimization. Advances in Neural Information Processing Systems 31.Google Scholar

Zhang, Y., Liang, P. and Charikar, M. (2017). A hitting time analysis of stochastic gradient Langevin dynamics. In Conference on Learning Theory. PMLR. pp. 1980–2022.Google Scholar

Zhang, Z.-S. (2023). Cramér-type moderate deviation of normal approximation for unbounded exchangeable pairs. Bernoulli 29, 274–299.10.3150/21-BEJ1457CrossRef Google Scholar

Article contents

Self-normalized Cramér-type moderate deviation of stochastic gradient Langevin dynamics

Abstract

Keywords

MSC classification

Information

1. Introduction

2. Diffusion approximation and main results

3. Auxiliary lemmas for the proof

4. Proof of main result

5. Proof of Lemma 3

6. Estimation of the remainder $ \mathcal{R}_{\eta}$

Appendix A. Proof of Lemma 1

Appendix B. Proof of ergodicity

Acknowledgements

Funding information

Competing interests

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests