Hostname: page-component-89b8bd64d-5bvrz Total loading time: 0 Render date: 2026-05-07T06:40:31.118Z Has data issue: false hasContentIssue false

Non-asymptotic analysis of online noisy stochastic gradient descent

Published online by Cambridge University Press:  03 March 2026

Riddhiman Bhattacharya*
Affiliation:
University of California , Santa Cruz
Tiefeng Jiang*
Affiliation:
The Chinese University of Hong Kong , Shenzhen
*
*Postal address: Statistics, University of California, Santa Cruz, Santa Cruz, USA. Email: briddhiman1729@gmail.com
**Postal address: School of Data Science, The Chinese University of Hong Kong, Shenzhen, China. Email: jiang040@umn.edu
Rights & Permissions [Opens in a new window]

Abstract

Past research has indicated that the covariance of the stochastic gradient descent (SGD) error done via minibatching plays a critical role in determining its regularization and escape from low potential points. Motivated by some new research in this area, we prove universality results by showing that noise classes that have the same mean and covariance structure of SGD via minibatching have similar properties. We mainly consider the SGD algorithm, with multiplicative noise, introduced in previous work (Wu et al (2016) Int. Conf. on Machine Learning, PMLR, pp. 10367–10376), which has a much more general noise class than the SGD algorithm done via minibatching. We establish non-asymptotic bounds for the multiplicative SGD algorithm in the Wasserstein distance. We also show that the error term for the algorithm is approximately a scaled Gaussian distribution with mean 0 at any fixed point.

Information

Type
Original Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2026. Published by Cambridge University Press on behalf of Applied Probability Trust
Figure 0

Algorithm 1 Online Multiplicative Stochastic Gradient Descent (Online M-SGD).

Figure 1

Figure 1. Histogram of the 10,000 samples of $\sqrt{m}\sum_{i=1}^{n}w_{i,k}U_i$. Here $p=6$ with $n=10^4$, $m=2000$. The weight vector $W=(w_1,w_2,\ldots,w_n)^{\mathsf{T}}$ is distributed as per $N(\mu, \Sigma)$ where $\mu$ and $\Sigma$ are as specified in Assumption 3

Figure 2

Figure 2. Histogram of the 10,000 samples of $\sqrt{m}\sum_{i=1}^{n}w_{i,k}U_i$. Here we have $p=1$ with $n=10^4$, $m=2000$. The weight vector $W=(w_1,w_2,\ldots,w_n)^{\mathsf{T}}$ is simulated from $\textrm{Dir}\left(\left(\frac{1999}{8000},\frac{1999}{8000},\ldots,\frac{1999}{8000}\right)\right)$. The plot indicates the Gaussian nature of the samples.

Figure 3

Figure 3. MSE vs iteration with $\gamma=0.5$.

Figure 4

Figure 4. MSE vs iteration with $\gamma=0.1$.