The latest research in training modern machine learning models: ‘A deterministic modification of gradient descent that avoids saddle points

In this post, Lorenz Kuger reflects on the recent success of machine learning models, and the associated challenges. To prevent these problems, Kuger introduces a recently published article that addresses how to modify gradient descent to avoid saddle points, which until now, has been a less researched direction. This new research is published and available now in European Journal of Applied Mathematics (EJAM).

Machine learning models based on deep neural networks

Machine learning models, particularly those based on deep neural networks, have enjoyed remarkable success in recent years. They have revolutionalised the fields of data analysis, image recognition, and natural language processing, among many other applications across the sciences.

A key factor in the training of these models is the use of variants of gradient descent algorithms, which optimise model parameters by minimising a loss function. However, the training optimisation problem for neural networks is highly non-convex, presenting a unique set of challenges.

Avoiding saddle points in the training process is crucial

Typically, the behaviour of gradient descent-type algorithms is very robust for convex problems. In non-convex settings, the situation is a little trickier, since the algorithms’ iterates might get stuck in local but non-global minima or other stationary points.

Special cases of stationary points that are not the minimum of a function are saddle points. These are points that are a local minimum along one direction in space, but a local maximum along another perpendicular direction. This makes the objective function look like a saddle in those two dimensions. Avoiding saddle points in the training process is crucial since the algorithm otherwise returns a non-optimal point as output.

Techniques have been developed to escape from a saddle point if the iterates of gradient descent are stuck in one. However, avoiding them from the start is a less researched direction.

A deterministic gradient-based approach to avoid saddle points

A new paper ‘A deterministic gradient-based approach to avoid saddle points’ by Lisa Maria Kreusser, Stanley Osher and Bao Wang [1] was published recently in the European Journal of Applied Mathematics. It precisely addresses this question of how to modify gradient descent to avoid saddle points. The algorithm proposed and analysed by the authors is a modification of Laplacian smoothing gradient descent.

Gain exclusive access to the paper 'A deterministic modification of gradient descent that avoids saddle points’ for one month here.
Gain exclusive access to the paper ‘A deterministic modification of gradient descent that avoids saddle points’ until 31 May 2023.

The difference between vanilla gradient descent and this algorithm is that the gradient directions are pre-multiplied by a Laplacian smoothing matrix with periodic boundary conditions. The additional step can be carried out in linear extra time and does not require any stochastic input or higher-order information about the objective function. This makes it easily applicable to a wide range of problems.

The first theoretical part of the paper is devoted to an analysis of the scheme’s attraction region for quadratic functions with a saddle point. The attraction region is the subspace of initial points for which an iterative algorithm will converge to the suboptimal saddle point.

For a canonical quadratic function, this subspace is computed exactly. While for this objective in an n-dimensional space gradient descent’s attraction region has dimension n-1, for the proposed method it is only the integer part of (n-1)/2, half as large. The authors extend this computation in parts to general quadratic functions in n dimensions, and provide intuition why the attraction region of their algorithm is of equal or smaller dimension than that of gradient descent.

This shows that the new method might be able to avoid saddle points that gradient descent gets stuck in for bad initial values. However, it is crucial to also examine its convergence speed and show that the modification does not sacrifice speed in turn.

This is the subject of the paper’s second theoretical contribution. By using a classical framework for convergence analysis of gradient descent algorithms, the authors show that their method’s convergence rate is consistent with that of gradient descent in the non-convex setting under the typical assumption of l-smoothness and a feasible choice of step sizes.

The theoretical analysis of the algorithm is further supported by numerical experiments. The authors test the modified scheme and compare it with classical gradient descent in cases where the latter fails to converge. Indeed, the additional Laplacian smoothing step manages to avoid saddle points that gradient descent is stuck in when the initial point is in a certain subspace.

Loss functions with a large number of saddle points are one of the major obstacles for training modern machine learning (ML) models efficiently.

You can read ‘A deterministic gradient-based approach to avoid saddle points’ by Lisa Maria Kreusser, Stanley Osher and Bao Wang in the European Journal of Applied Mathematics.

Gain exclusive access to the article for one month here, valid to 31 May 2023.


Leave a reply

Your email address will not be published. Required fields are marked *