1 Introduction
The purpose of this paper is to describe, and illustrate with numerical examples, the fundamentals of a branch of continuous optimization dedicated to problems in imaging science, in particular image reconstruction, inverse problems in imaging, and some simple classification tasks. Many of these problems can be modelled by means of an ‘energy’, ‘cost’ or ‘objective’ which represents how ‘good’ (or bad!) a solution is, and must be minimized.
These problems often share a few characteristic features. One is their size, which can be very large (typically involving at most around a billion variables, for problems such as three-dimensional image reconstruction, dense stereo matching, or video processing) but usually not ‘huge’ like some recent problems in learning or statistics. Another is the fact that for many problems, the data are structured in a two- or three-dimensional grid and interact locally. A final, frequent and fundamental feature is that many useful problems involve non-smooth (usually convex) terms, for reasons that are now well understood and concern the concepts of sparsity (DeVore Reference DeVore1998, Candès, Romberg and Tao Reference Candès, Romberg and Tao2006b, Donoho Reference Donoho2006, Aharon, Elad and Bruckstein Reference Aharon, Elad and Bruckstein2006) and robustness (Ben-Tal and Nemirovski Reference Ben-Tal and Nemirovski1998).
These features have strongly influenced the type of numerical algorithms used and further developed to solve these problems. Due to their size and lack of smoothness, higher-order methods such as Newton’s method, or methods relying on precise line-search techniques, are usually ruled out, although some authors have suggested and successfully implemented quasi-Newton methods for non-smooth problems of the kind considered here (Ito and Kunisch Reference Ito and Kunisch1990, Chan, Golub and Mulet Reference Chan, Golub and Mulet1999).
Hence these problems will usually be tackled with first-order descent methods, which are essentially extensions and variants of a plain gradient descent, appropriately adapted to deal with the lack of smoothness of the objective function. To tackle non-smoothness, one can either rely on controlled smoothing of the problem (Nesterov Reference Nesterov2005, Becker, Bobin and Candès Reference Becker, Bobin and Candès2011) and revert to smooth optimization techniques, or ‘split’ the problem into smaller subproblems which can be exactly (or almost) solved, and combine these resolutions in a way that ensures that the initial problem is eventually solved. This last idea is now commonly referred to as ‘proximal splitting’ and, although it relies on ideas from as far back as the 1950s or 1970s (Douglas and Rachford Reference Douglas and Rachford1956, Glowinski and Marroco Reference Glowinski and Marroco1975), it has been a very active topic in the past ten years in image and signal processing, as well as in statistical learning (Combettes and Pesquet Reference Combettes and Pesquet2011, Parikh and Boyd Reference Parikh and Boyd2014).
Hence, we will focus mainly on proximal splitting (descent) methods, and primarily for convex problems (or extensions, such as finding zeros of maximal-monotone operators). We will introduce several important problems in imaging and describe in detail simple first-order techniques to solve these problems practically, explaining how ‘best’ to implement these methods, and in particular, when available, how to use acceleration tricks and techniques to improve the convergence rates (which are generally very poor for such methods). This point of view is very similar to the approach in a recent tutorial of Burger, Sawatzky and Steidl (Reference Burger, Sawatzky and Steidl2014), though we will describe a larger class of problems and establish connections between the most commonly used first-order methods in this field.
Finally, we should mention that for many imaging problems, the grid structure of the data is well suited for massively parallel implementations on GPUs, and hence it is beneficial to develop algorithms that preserve this property.
The organization of this paper is as follows. We will first describe typical (simple) problems in imaging and explain how they can be reduced to the minimization of relatively simple functions, usually convex. Then, after a short introduction to the basic concepts of convexity in Section 3, we will describe in Sections 4 and 5 the classes of algorithms that are currently used to tackle these problems, illustrating each algorithm with applications to the problems introduced earlier. Each time, we will discuss the basic methods, convergence results and expected rates, and, when available, acceleration tricks which can sometimes turn a slow and inefficient method into a useful practical tool. We will focus mainly on two families of methods (whose usefulness depends on the structure of the problem): first-order descent methods and saddle-point methods. Both can be seen as either variants or extensions of the ‘proximal-point algorithm’ (Martinet Reference Martinet1970), and are essentially based on iterations of a $1$-Lipschitz operator; therefore, in Appendix A we will very briefly recall the general theory for such iterative techniques. It does not apply to accelerated variants which are not usually contractive (or not known to be contractive), but rates of convergence can be estimated; see Appendices B and C.
In a final theoretical section (Section 6) we will briefly introduce some extensions of these techniques to non-convex problems.
Then, in Section 7, we will review a series of practical problems (e.g., first- and higher-order regularization of inverse problems, feature selection and dictionary learning, segmentation, basic inpainting, optical flow), each time explaining which methods can be used (and giving the implementations in detail), and how methods can be tuned to each problem. Of course, we do not claim that we will always give the ‘optimal’ method to solve a problem, and we will try to refer to the relevant literature where a more thorough study can be found.
Our review of first-order algorithms for imaging problems is partly inspired by our own work and that of many colleagues, but also by important textbooks in optimization (Polyak Reference Polyak1987, Bertsekas Reference Bertsekas2015, Ben-Tal and Nemirovski Reference Ben-Tal and Nemirovski2001, Nesterov Reference Nesterov2004, Boyd and Vandenberghe Reference Boyd and Vandenberghe2004, Nocedal and Wright Reference Nocedal and Wright2006, Bauschke and Combettes Reference Bauschke and Combettes2011). However, we have tried to keep the level of detail as simple as possible, so that most should be accessible to readers with very little knowledge of optimization theory. Naturally we refer the interested reader to these references for a deeper understanding of modern optimization.
Finally we should mention that we will overlook quite a few important problems and methods in imaging. First, we will not discuss combinatorial optimization techniques for regularization/segmentation, as we fear that this would require us to almost double the size of the paper. Such methods, based on graph cuts or network flows, are very efficient and have been extensively developed by the computer vision community to tackle most of the problems we address here with continuous optimization. As an example, the paper of Boykov, Veksler and Zabih (Reference Boykov, Veksler and Zabih2001), which shows how to minimize the ‘Potts’ model (7.25) using graph-cuts, attains almost 6000 citations in Google Scholar, while the maximal flow algorithm of Boykov and Kolmogorov (Reference Boykov and Kolmogorov2004) is cited more than $3500$ times. We believe the two approaches complement one another nicely: they essentially tackle the same sort of problems, with similar structures, but from the perspective of implementation they are quite different. In particular, Hochbaum (Reference Hochbaum2001) presents an approach to solve exactly a particular case of Problem 2.6 in polynomial time; see also Darbon and Sigelle (Reference Darbon and Sigelle2006a, Reference Darbon and Sigelle2006b) (the variant in Chambolle and Darbon Reference Chambolle, Darbon, Lézoray and Grady2012 might be more accessible for the reader unfamiliar with combinatorial optimization). In general, graph-based methods are harder to parallelize, and can approximate fewer general energies than methods based on continuous optimization. However, they are almost always more efficient than non-parallel iterative continuous implementations for the same problem.
We will also ignore a few important issues and methods in image processing: we will not discuss many of the ‘non-local’ methods, which achieve state of the art for denoising (Dabov, Foi, Katkovnik and Egiazarian Reference Dabov, Foi, Katkovnik and Egiazarian2007, Buades, Coll and Morel Reference Buades, Coll and Morel2005, Buades, Coll and Morel Reference Buades, Coll and Morel2011). Although these approaches were not introduced as ‘variational’ methods, it is now known that they are closely related to methods based on structured sparsity (Danielyan, Katkovnik and Egiazarian Reference Danielyan, Katkovnik and Egiazarian2012) or (patch-based) Gaussian mixture models (Mallat and Yu Reference Mallat and Yu2010, Yu, Sapiro and Mallat Reference Yu, Sapiro and Mallat2012, Lebrun, Buades and Morel Reference Lebrun, Buades and Morel2013) and can be given a ‘variational’ form (Gilboa, Darbon, Osher and Chan Reference Gilboa, Darbon, Osher and Chan2006, Kindermann, Osher and Jones Reference Kindermann, Osher and Jones2005, Peyré, Bougleux and Cohen Reference Peyré, Bougleux and Cohen2008, Arias, Facciolo, Caselles and Sapiro Reference Arias, Facciolo, Caselles and Sapiro2011). The numerical algorithms to tackle these problems still need a lot of specific tuning to achieve good performance. We will address related issues in Section 7.12 (on ‘Lasso’-type problems) and present alternatives to non-local denoising.
Moreover, we will not mention the recent developments in computer vision and learning based on convolutional neural networks, or CNNs (LeCun et al. Reference LeCun, Boser, Denker, Henderson, Howard, Hubbard and Jackel1989, Krizhevsky, Sutskever and Hinton Reference Krizhevsky, Sutskever and Hinton2012), which usually achieve the best results in classification and image understanding. These models (also highly non-local) are quite different from those introduced here, although there is a strong connection with dictionary learning techniques (which could be seen as a basic ‘first step’ of CNN learning). Due to the complexity of the models, the optimization techniques for CNNs are very specific and usually rely on stochastic gradient descent schemes for smoothed problems, or stochastic subgradient descent (Krizhevsky et al. Reference Krizhevsky, Sutskever and Hinton2012, LeCun, Bottou, Orr and Muller Reference LeCun, Bottou, Orr, Muller, Orr and Muller1998b). The second author of this paper has recently proposed a framework which in some sense bridges the gap between descent methods or PDE approaches and CNN-based learning (Chen, Ranftl and Pock Reference Chen, Ranftl and Pock2014b).
More generally, we will largely ignore recent developments in stochastic first-order methods in optimization, which have been driven by big data applications and the need to optimize huge problems with often billions of variables (in learning and statistics, hence also with obvious applications to image analysis and classification). We will try to provide appropriate references when efficient stochastic variants of the methods described have recently been developed.
We now describe, in the next section, the key exemplary optimization problems which we are going to tackle throughout this paper.
2 Typical optimization problems in imaging
First let us give the reader a taste of typical optimization problems that arise from classical models in image processing, computer vision and machine learning. Another of our goals is to give a short overview of typical applications of variational models in imaging; more specific models will then be described in Section 7. Among the most important features in images are edges and texture. Hence, an important property of models in image processing is the ability to preserve sharp discontinuities in their solutions in order to keep precise identification of image edges. Another goal of most models is robustness (Ben-Tal and Nemirovski Reference Ben-Tal and Nemirovski1998, Ben-Tal, El Ghaoui and Nemirovski Reference Ben-Tal, El Ghaoui and Nemirovski2009), that is, the solution of a model should be stable in the presence of noise or outliers. In practice this implies that successful models should be non-smooth and hence non-differentiable. Indeed, a successful approach to these issues is known to be realized by the minimization of robust error functions based on norm functions. Classical optimization algorithms from non-linear optimization, such as gradient methods, Newton or quasi-Newton methods, cannot be used ‘out of the box’ since these algorithms require a certain smoothness of the objective function or cannot be applied to large-scale problems – hence the need for specialized algorithms that can exploit the structure of the problems and lead efficiently to good solutions.
2.1 Sparse representations
An important discovery in recent years (Candès et al. Reference Candès, Romberg and Tao2006b, Donoho Reference Donoho2006, Aharon et al. Reference Aharon, Elad and Bruckstein2006) is the observation that many real-world signals can be modelled via sparse representation in a suitable basis or ‘dictionary’. This property can be used to reconstruct a signal from far fewer measurements than required by the Shannon–Nyquist sampling theorem, for example, which states that the sampling frequency should be at least twice as high as the highest frequency in the signal. Furthermore, a sparse representation of a signal is desirable since it implies a certain robustness in the presence of noise. Given an input signal $b\in \mathbb{R}^{m}$, a sparse representation in the dictionary $A=(a_{i,j})_{i,j}\in \mathbb{R}^{m\times n}$ of $n$ column vectors $(a_{i,j})_{i=1}^{m}$ can be found by solving the following optimization problem (Mallat and Zhang Reference Mallat and Zhang1993, Chen, Donoho and Saunders Reference Chen, Donoho and Saunders1998):
where $x\in \mathbb{R}^{n}$ is the unknown coefficient vector. This model is usually known by the name basis pursuit (Chen and Donoho Reference Chen and Donoho1994). Since each column of $A$ can be interpreted as a basis atom, the equality constraint $Ax=b$ describes the fact that the signal $b$ should be represented as a sparse linear combination of those atoms. The function $f(x)$ is a sparsity-inducing function, such as $f(x)=\Vert x\Vert _{1}:=\sum _{i}|x_{i}|$ in the most simple case.
If some further prior knowledge concerning a relevant group structure is available, one can encode such information in the sparsity-inducing function. This idea is known as group sparsity, and is widely used in data analysis. It consists in using $\ell _{1,p}$-norms, with $p=2$ or $p=\infty$. The $p$-norm is taken within the groups and the $1$-norm is taken between the groups. This forces the solution to have only a few active groups, but within the active groups the coefficients can be dense.
For problems such as matrix factorization (Paatero and Tapper Reference Paatero and Tapper1994, Lee and Seung Reference Lee and Seung1999) or robust principal component analysis (Candès, Li, Ma and Wright Reference Candès, Li, Ma and Wright2011), where $x$ is tensor-valued, the sparsity-inducing norm could also be a function promoting the sparsity of the singular values of $x$ and hence forcing $x$ to be of low rank. A popular choice to achieve this goal is the $1$-Schatten norm (or nuclear norm) $\Vert \cdot \Vert _{{\mathcal{S}}_{1}}$, which is given by the $1$-norm of the singular values of $x$, and is polar to the spectral/operator norm $\Vert \cdot \Vert _{{\mathcal{S}}_{\infty }}$.
A more general formulation that also allows for noise in the observed signal $b$ is given by the following optimization problem, popularized by the name ‘Lasso’, least absolute shrinkage and selection operator (Tibshirani Reference Tibshirani1996):
where ${\it\lambda}>0$ is a parameter that can be adapted to the noise level of $b$. The parameter ${\it\lambda}$ can also be interpreted as a Lagrange multiplier for the constraint ${\textstyle \frac{1}{2}}\Vert Ax-b\Vert _{2}^{2}\leq {\it\sigma}^{2}$, where ${\it\sigma}$ is an estimate of the noise level. This shows the close connection between (2.1) and (2.2).
The Lasso approach can also be interpreted as a model that tries to synthesize the given signal $b$ using only a small number of basis atoms. A closely related problem is obtained by moving the linear operator $A$ from the data-fitting term to the regularization term, that is,
where $B$ is again a linear operator. If $A$ is invertible and $B=A^{-1}$, a simple change of variables shows that the two problems are equivalent. However, the more interesting cases are for non-invertible $B$, and the two problems can have very different properties. Here, the linear operator $B$ can be interpreted as an operator analysing the signal, and hence the model is known as the co-sparse analysis model (Nam, Davies, Elad and Gribonval Reference Nam, Davies, Elad and Gribonval2013). The basic idea behind this approach is that the scalar product of the signal with a given family of filters should vanish most of the time. The most influential model in imaging utilizing such sparse analysis regularizers is the total variation regularizer.
Here, we recall the ‘ROF’ (Rudin, Osher and Fatemi Reference Rudin, Osher and Fatemi1992, Chambolle and Lions Reference Chambolle and Lions1997) model for total variation based image denoising. We consider a scalar-valued digital image $u\in \mathbb{R}^{m\times n}$ of size $m\times n$ pixels.Footnote 1 A simple and standard approach for defining the (discrete) total variation is to use a finite difference scheme acting on the image pixels. We introduce a discrete gradient operator $\text{D}:\mathbb{R}^{m\times n}\rightarrow \mathbb{R}^{m\times n\times 2}$, which is defined by
We will also frequently need the operator norm $\Vert \text{D}\Vert$, which is estimated as
(see Chambolle Reference Chambolle2004b). The discrete ROF model is then defined by
where $u^{\diamond }\in \mathbb{R}^{m\times n}$ is the given noisy image, and the discrete total variation is defined by
that is, the $\ell _{1}$-norm of the $p$-norm of the pixelwise image gradients.Footnote 2 The parameter $p$ can be used, for example, to realize anisotropic ($p=1$) or isotropic ($p=2$) total variation. Some properties of the continuous model, such as the co-area formula, carry over to the discrete model only if $p=1$, but the isotropic total variation is often preferred in practice since it does not exhibit a grid bias.
From a sparsity point of view, the idea of the total variation denoising model is that the $\ell _{1}$-norm induces sparsity in the gradients of the image, hence it favours piecewise constant images with sparse edges. On the other hand, this property – also known as the staircasing effect – might be considered a drawback for some applications. Some workarounds for this issue will be suggested in Example 4.7 and Section 7.2. The isotropic case ($p=2$) can also be interpreted as a very simple form of group sparsity, grouping together the image derivatives in each spatial dimension.
In many practical problems it is necessary to incorporate an additional linear operator in the data-fitting term. Such a model is usually of the form
where $A:\mathbb{R}^{m\times n}\rightarrow \mathbb{R}^{k\times l}$ is a linear operator, $u^{\diamond }\in \mathbb{R}^{k\times l}$ is the given data, and $k,l$ will depend on the particular application. Examples include image deblurring, where $A$ models the blur kernel, and magnetic resonance imaging (MRI), where the linear operator is usually a combination of a Fourier transform and the coil sensitivities; see Section 7.4 for details.
The quadratic data-fitting term of the ROF model is specialized for zero-mean Gaussian noise. In order to apply the model to other types of noise, different data-fitting terms have been proposed. When the noise is impulsive or contains gross outliers, a simple yet efficient modification is to replace the quadratic data-fitting term with an $\ell _{1}$-data term. The resulting model, called the TV-$\ell _{1}$ model, is given by
This model has many nice properties such as noise robustness and contrast invariance (Nikolova Reference Nikolova2004, Chan and Esedoḡlu Reference Chan and Esedoḡlu2005). However, this does not come for free. While the ROF model still contains some regularity in the data term that can be exploited during optimization, the TV-$\ell _{1}$ model is completely non-smooth and hence significantly more difficult to minimize.
2.2 Three introductory examples for image restoration
We now will present three prototypical examples of image restoration, to which we will frequently refer in the algorithmic parts of the paper.
Example 2.1 (ROF model).
In the first example we consider standard image denoising using the ROF model (2.6) in the presence of Gaussian noise. Figure 2.1 shows the result of total variation based image denoising using this model. It is now well understood that efficient ways to solve this problem rely on convex duality (Chambolle and Lions Reference Chambolle and Lions1995, Chan et al. Reference Chan, Golub and Mulet1999, Chambolle Reference Chambolle2004b); for details on the particular algorithm used here, see Examples 4.8 and 5.6.
Figure 2.1(a) shows the original input image of size $360\times 270$ pixels and intensity values in the range $[0,1]$. Figure 2.1(b) shows its noisy variant, obtained by adding Gaussian noise of standard deviation ${\it\sigma}=0.1$. Figure 2.1(c) shows the result obtained by minimizing the ROF model using the FISTA algorithm (Algorithm 5). We used isotropic total variation $(p=2)$ and we set the regularization parameter ${\it\lambda}=0.1$. Observe that the ROF model successfully removes the noise from the image while preserving the main edges in the image. One can also observe that the ROF model is not very successful at reconstructing textured regions, as it favours piecewise constant images. State-of-the-art denoising methods will usually revert to non-local techniques that treat patches as a whole, allowing better representation of textures (Buades et al. Reference Buades, Coll and Morel2005, Buades et al. Reference Buades, Coll and Morel2011, Dabov et al. Reference Dabov, Foi, Katkovnik and Egiazarian2007). These approaches are not variational at first glance, but variants can be obtained by alternating minimization of non-local energies (Peyré et al. Reference Peyré, Bougleux and Cohen2008, Arias et al. Reference Arias, Facciolo, Caselles and Sapiro2011).
Example 2.2 (TV-deblurring).
In this second example we assume that the observed blurry image $u^{\diamond }$ has been obtained by convolving the unknown image $u$ with a two-dimensional blur kernel $a$ of size $k\times l$ pixels. We can ‘deblur’ the given image by minimizing the model (2.7) with $Au=a\ast u$. If we choose ${\it\lambda}=0$ in (2.7), then unless the original image $u^{\diamond }$ has no noise at all, it is well known that the noise will be amplified by the deconvolution process and ruin the quality of the deconvolution.
Figure 2.2 shows an example of image deblurring with known blur kernel. Figure 2.2(a) shows the original image of size $317\times 438$ pixels and intensity values in the range $[0,1]$. Figure 2.2(b) shows the blurry image together with the blur kernel of size $31\times 31$ pixels. The blurry image has been further degraded by adding zero-mean Gaussian noise with standard deviation $0.01$. Moreover, to get rid of unwanted boundary effects, we modified the input image by setting its intensity values to its average values at the image boundaries. This allows us to approximately assume periodic boundary conditions and hence to use a fast Fourier transform (FFT) to compute the convolution. Another way to deal with the boundary, which works better but is computationally more expensive, is suggested in Almeida and Figueiredo (Reference Almeida and Figueiredo2013).
Figure 2.2(c) shows the deblurred image using no regularization (${\it\lambda}=0$) and Figure 2.2(d) the deblurred image using the total variation regularized deblurring model. The regularization parameter was set to ${\it\lambda}=5\times 10^{-4}$. Observe that the regularization is essential to reduce the noise in the deblurred image. This particular example has been computed using the PDHG algorithm (Algorithm 6); see also Example 5.7 for details. Note that when the blur kernel is also unknown, the problem becomes non-convex and hence significantly more complex to solve (Levin, Weiss, Durand and Freeman Reference Levin, Weiss, Durand and Freeman2011).
Example 2.3 (TV-$\ell _{1}$ model).
In this third example we consider image restoration in the presence of salt-and-pepper noise. For this we utilize the TV-$\ell _{1}$ model (2.8). Figure 2.3 shows an example where the TV-$\ell _{1}$ model can successfully denoise an image of size $375\times 500$ pixels that has been degraded by adding $20\%$ salt-and-pepper noise. The intensity values of the input image are again in the range $[0,1]$. For comparison we also show the results of the ROF model (2.6) for this example. For the TV-$\ell _{1}$ model the regularization parameter was set to ${\it\lambda}=0.6$; for ROF, the regularization parameter was set to ${\it\lambda}=0.25$. It can be seen that the results of the ROF model are significantly inferior, since the quadratic data term of the ROF model does not fit the distribution of the salt-and-pepper noise at all well. The example was computed again using the PDHG algorithm (Algorithm 6); see also Example 5.8 for details.
3 Notation and basic notions of convexity
We recall some basic notions of convexity, and introduce our notation. Throughout the paper, at least in the theoretical parts, ${\mathcal{X}}$ (and ${\mathcal{Y}}$) is a Hilbert or Euclidean space endowed with a norm $\Vert \cdot \Vert =\langle \cdot ,\cdot \rangle ^{1/2}$. The results in this section and the next should usually be understood in finite dimensions, but most of them do not depend on the dimension, and often hold in a Hilbert space. If $M$ is a bounded positive definite symmetric operator, we define $\Vert x\Vert _{M}=\langle Mx,x\rangle ^{1/2}$, which in finite-dimensional space is a norm equivalent to $\Vert x\Vert$.
In two-dimensional image processing we usually consider norms acting on images $u$ defined on a regular Cartesian grid of $m\times n$ pixels. When the pixels are scalar-valued, that is, $u_{i,j}\in \mathbb{R}$, the image can also be written in the form $u=(u_{1,1},\ldots ,u_{m,n})\in \mathbb{R}^{m\times n}$.
A $p$-vector norm acting on the image is hence given by
When the pixels of an image $\mathbf{u}$ of size $m\times n$ pixels are vector-valued, we will adopt the notation $\mathbf{u}=(\mathbf{u}_{1,1},\ldots ,\mathbf{u}_{m,n})\in \mathbb{R}^{m\times n\times r}$, with bold-font variables $\mathbf{u}_{i,j}\in \mathbb{R}^{r}$ referring to the vector-valued pixel. In such images we will consider mixed $p,q$-vector norms which are given by
with $|\mathbf{u}_{i,j}|_{p}=(\sum _{k=1}^{r}|u_{i,j,k}|^{p})^{1/p}$ denoting the $p$-vector norm acting on the single pixels. Similarly, if the pixels are matrix-valued (or tensor-valued), that is, $\mathbf{U}_{i,j}\in \mathbb{R}^{r\times s}$, we have $\mathbf{U}=(\mathbf{U}_{1,1},\ldots ,\mathbf{U}_{m,n})\in \mathbb{R}^{m\times n\times r\times s}$, and we will consider matrix norms, acting on the single pixels $\mathbf{U}_{i,j}$.
3.1 Convex functions
An extended real valued function $f:{\mathcal{X}}\rightarrow [-\infty ,+\infty ]$ is said to be convex if and only if its epigraph
is a convex set, that is, if when ${\it\lambda}\geq f(x)$, ${\it\mu}\geq f(y)$, and $t\in [0,1]$, we have $t{\it\lambda}+(1-t){\it\mu}\geq f(tx+(1-t)y)$.Footnote 3 It is proper if it is not identically $+\infty$ and nowhere $-\infty$: in this case, it is convex if and only if, for all $x,y\in {\mathcal{X}}$ and $t\in [0,1]$,
It is strictly convex if the above inequality is strict whenever $x\neq y$ and $0<t<1$. It is lower semi-continuous (l.s.c.) if, for all $x\in {\mathcal{X}}$, if $x_{n}\rightarrow x$, then
A trivial but important example is the characteristic function or indicator function of a set $C$:
which is convex, l.s.c., and proper when $C$ is convex, closed and non-empty. The minimization of such functions will allow us to easily model convex constraints in our problems.
3.2 Subgradient
Given a convex, extended real valued, l.s.c. function $f:{\mathcal{X}}\rightarrow [-\infty ,+\infty ]$, we recall that its subgradient at a point $x$ is defined as the set
An obvious remark which stems from the definition is that this notion allows us to generalize Fermat’s stationary conditions (${\rm\nabla}f(x)=0$ if $x$ is a minimizer of $f$) to non-smooth convex functions: we indeed have
The function is strongly convex or ‘${\it\mu}$-convex’ if in addition, for $x,y\in {\mathcal{X}}$ and $p\in \partial f(x)$, we have
or, equivalently, if $x\mapsto f(x)-{\it\mu}\Vert x\Vert ^{2}/2$ is also convex. It is then, obviously, strictly convex as it satisfies
for any $x,y$ and any $t\in [0,1]$. A trivial but important remark is that if $f$ is strongly convex and $x$ is a minimizer, then we have (since $0\in \partial f(x)$)
for all $y\in {\mathcal{X}}$.
The domain of $f$ is the set $\text{dom}\,f=\{x\in {\mathcal{X}}:f(x)<+\infty \}$, while the domain of $\partial f$ is the set $\text{dom}\,\partial f=\{x\in {\mathcal{X}}:\partial f(x)\neq \emptyset \}$. Clearly $\text{dom}\,\partial f\subset \text{dom}\,f$; in fact if $f$ is convex, l.s.c. and proper, then $\text{dom}\,\partial f$ is dense in $\text{dom}\,f$ (Ekeland and Témam Reference Ekeland and Témam1999). In finite dimensions, one can show that for a proper convex function, $\text{dom}\,\partial f$ contains at least the relative interior of $\text{dom}\,f$ (that is, the interior in the vector subspace which is generated by $\text{dom}\,f$).
3.3 Legendre–Fenchel conjugate
To any function $f:{\mathcal{X}}\rightarrow [-\infty ,+\infty ]$ one can associate the Legendre–Fenchel conjugate (or convex conjugate)
which, as a supremum of linear and continuous functions, is obviously convex and lower semi-continuous. The biconjugate $f^{\ast \ast }$ is then the largest convex l.s.c. function below $f$ (from the definition it is easy to see that $f^{\ast \ast }\leq f$); in particular, if $f$ is already convex and l.s.c., we have $f^{\ast \ast }=f$. This is a consequence of the convex separation theorem (a corollary of the Hahn–Banach theorem), which is a difficult result in general (see Brézis Reference Brézis1983 for an introduction to convex duality in infinite dimensions which includes a detailed proof of this result), but it is a trivial consequence of the projection onto closed convex sets in Euclidean or Hilbert spaces.
By definition, we see that $x$ realizes the sup in (3.3) if and only if $y\in \partial f(x)$, and we have $f(x)+f^{\ast }(y)=\langle y,x\rangle$. In this case we easily deduce that $f^{\ast \ast }(x)=f(x)=\langle y,x\rangle -f^{\ast }(x)$, so that in particular, $y$ realizes the sup which defines $f^{\ast \ast }(x)$. Also, it follows that $x\in \partial f^{\ast }(y)$. We deduce the celebrated Legendre–Fenchel identity:
In particular, $\partial g$ and $\partial g^{\ast }$ are inverses. From the definition, it is clear that the subgradient of a convex function is a monotone operator, that is, it satisfies
while it is strongly monotone if $f$ is strongly convex:
An important remark is that a convex l.s.c. function $f$ is ${\it\mu}$-strongly convex if and only if its conjugate $f^{\ast }$ is $C^{1}$ with $(1/{\it\mu})$-Lipschitz gradient. In fact, $f$ is ${\it\mu}$-strongly convex if and only if, for any $x\in {\mathcal{X}}$ and $p\in \partial f(x)$, the ‘parabola’
touches the graph of $f$ at $x$ from below. But then, fairly simple computations show that the graph of $f^{\ast }$ is touched from above at $p$ by the conjugate parabola
which is equivalent to saying that $x={\rm\nabla}f^{\ast }(p)$ and (if this holds at all $p$) that $p\mapsto x={\rm\nabla}f^{\ast }(p)$ is $(1/{\it\mu})$-Lipschitz. Observe that in this case, the strong monotonicity of $\partial f$ also reads
which expresses that ${\rm\nabla}f^{\ast }$ is a ${\it\mu}$-co-coercive monotone operator: in general the gradient of a convex function with $L$-Lipschitz gradient is $(1/L)$-co-coercive.
We must mention here that subgradients of convex l.s.c. functions are only a particular class of maximal monotone operators, which are multivalued operators $T:{\mathcal{X}}\rightarrow {\mathcal{P}}({\mathcal{X}})$ such that
and whose graph $\{(x,p):x\in Tp\}\subset {\mathcal{X}}\times {\mathcal{X}}$ is maximal (with respect to inclusion) in the class of graphs of operators which satisfy (3.5). Strongly monotone and co-coercive monotone operators are defined accordingly. It is also almost obvious from the definition that any maximal monotone operator $T$ has an inverse $T^{-1}$ defined by $x\in T^{-1}p\Leftrightarrow p\in Tx$, which is also maximal monotone. The operators $\partial f$ and $\partial f^{\ast }$ are inverse in this sense. Examples of maximal monotone operators which are not subgradients of a convex function are given by skew-symmetric operators. See, for instance, Brézis (Reference Brézis1973) for a general study of maximal monotone operators in Hilbert spaces.
3.4 Proximal map and resolvent
Another important role in optimization is played by the so-called proximal map or proximity operator of a convex function defined as follows. If $f$ is convex, proper and l.s.c., then clearly, for any $x$, there is a unique minimizer ${\hat{y}}$ to the strongly convex problem
which also satisfies
for any $y$ (thanks to strong convexity). We let ${\hat{y}}:=\text{prox}_{{\it\tau}f}(x)$. It is easy to show that this defines a $1$-Lipschitz, monotone operator, which is itself the gradient of a convex function. Basic subdifferential calculus (Rockafellar Reference Rockafellar1997) shows that
in other words ${\hat{y}}=(I+{\it\tau}\partial f)^{-1}x$ is given by the resolvent of the maximal monotone operator ${\it\tau}\partial f$ at $x$. In general it is shown that $T$ is maximal monotone if and only if its resolvent $(I+T)^{-1}$ is well defined and single-valued; this is an important theorem due to Minty (Reference Minty1962). The resolvent is also a weak contraction, as well as a ‘firmly non-expansive operator’ (Bauschke, Moffat and Wang Reference Bauschke, Moffat and Wang2012), or equivalently a ‘$1/2$-averaged operator’; see Appendix A.
Playing with this expression and (3.4), we can easily deduce Moreau’s identity (Moreau Reference Moreau1965)
which in fact holds for any maximal monotone operators $T,T^{-1}$. It shows in particular that if we know how to compute $\text{prox}_{{\it\tau}f}$, then we also know how to compute $\text{prox}_{f^{\ast }/{\it\tau}}$. Finally, we will sometimes let $\text{prox}_{{\it\tau}f}^{M}(x)$ denote the proximity operator computed in the metric $M$, that is, the solution of
3.5 Fenchel–Rockafellar duality
We now introduce an essential notion in convex programming, that is, convex duality. This notion allows us to transform convex problems into other problems which sometimes have a nicer structure and are easier to tackle. A fairly extensive and very enlightening recent survey on duality for imaging and inverse problems can be found in Borwein and Luke (Reference Borwein, Luke and Scherzer2015).
Consider the minimization problem
where
are convex l.s.c. functions and $K:{\mathcal{X}}\rightarrow {\mathcal{Y}}$ is a bounded linear operator. Then, since $f=f^{\ast \ast }$, one can write
Under very mild conditions on $f,g$ (such as $f(0)<\infty$ and $g$ continuous at $0$ (see e.g. Ekeland and Témam Reference Ekeland and Témam1999, (4.21)); in finite dimensions it is sufficient to have a point $x$ with both $Kx$ in the relative interior of $\text{dom}\,f$ and $x$ in the relative interior of $\text{dom}\,g$ (Rockafellar Reference Rockafellar1997, Corollary 31.2.1)), one can swap the min and sup in the relation above and write
The last problem in this formula is the (Fenchel–Rockafellar) dual problem. Under the assumptions above, it has at least a solution $y^{\ast }$. If $x^{\ast }$ is any solution of the initial primal problem, then $(x^{\ast },y^{\ast })$ is a saddle point of the primal–dual formulation: for any $(x,y)\in {\mathcal{X}}\times {\mathcal{Y}}$ we have
where
denotes the Lagrangian. In particular, it satisfies
Observe that the primal–dual gap
which is always non-negative (even if the min and sup cannot be swapped), vanishes if and only if $(x,y)$ is a saddle point.
Finally we remark that
is a maximal monotone operator, being the sum of two maximal monotone operators, only one of which is a subgradient, and the conditions above can be written $T\binom{x^{\ast }}{y^{\ast }}\ni 0$.
Example 3.1 (dual of the ROF model).
As an example, consider the minimization problem (2.6) above. This problem has the general form (3.9), with $x=u$, $K=\text{D}$, $f={\it\lambda}\Vert \cdot \Vert _{p,1}$ and $g=\Vert \cdot -u^{\diamond }\Vert ^{2}/2$. Hence the dual problem (3.11) reads
where $\mathbf{p}\in \mathbb{R}^{m\times n\times 2}$ is the dual variable. Equation (3.13) shows that the solution $u$ of the primal problem is recovered from the solution $\mathbf{p}$ of the dual by letting $u=u^{\diamond }-\text{D}^{\ast }\mathbf{p}$. One interesting observation is that the dual ROF model has almost exactly the same structure as the Lasso problem (2.2).
In this example, $f$ is a norm, so $f^{\ast }$ is the indicator function of the polar ball: in this case the dual variable has the structure $\mathbf{p}=(\mathbf{p}_{1,1},\ldots ,\mathbf{p}_{m,n})$, where $\mathbf{p}_{i,j}=(p_{i,j,1},p_{i,j,2})$ is the per pixel vector-valued dual variable, and therefore
where $q$ is the parameter of the polar norm ball which is defined via $1/p+1/q=1$. The most relevant cases are $p=1$ or $p=2$. In the first case we have $q=+\infty$, so the corresponding constraint reads
In the second case we have $q=2$, and the corresponding constraint reads
Of course, more complex norms can be used, such as the nuclear norm for colour images. In this case the per pixel dual variable $\mathbf{p}_{i,j}$ will be matrix-valued (or tensor-valued) and should be constrained to have its spectral (operator) norm less than ${\it\lambda}$, for all $i,j$. See Section 7.3 for an example and further details.
In practice, we will (improperly) use ‘dual problem’ to denote the minimization problem
which is essentially a projection problem. For this problem, it is interesting to observe that the primal–dual gap
gives a bound on the $\ell _{2}$-error $\frac{1}{2}\Vert u-u^{\ast }\Vert ^{2}$, where $(u^{\ast },\mathbf{p}^{\ast })$ is a saddle point. More precisely, if we use both the strong convexity of the energy (with respect to $u$) and the strong convexity of the dual energy (with respect to $\text{D}^{\ast }\mathbf{p}$) and recall that $u^{\ast }=u^{\diamond }-\text{D}^{\ast }\mathbf{p}^{\ast }$ (so $\Vert \text{D}^{\ast }\mathbf{p}-\text{D}^{\ast }\mathbf{p}^{\ast }\Vert ^{2}=\Vert (u^{\diamond }-\text{D}^{\ast }\mathbf{p})-u^{\ast }\Vert ^{2}$), we find that
We can even provide a slightly finer criterion, since if we introduce the middle value
then it follows from (3.19) that
Using (3.18) we obtain the error criterion
This can be used to test the convergence of algorithms for this problem, at least when a dual variable $\mathbf{p}$ is correctly identified, since in that case, if $u$ is not provided by the algorithm, one can let $u=\tilde{u} =u^{\diamond }-\text{D}^{\ast }\mathbf{p}$. It also shows how to obtain, in a primal–dual method which provides both $u$ and $\mathbf{p}$ with $u\not =\tilde{u}$, a slightly better estimate of the primal $\ell _{2}$-error (and of the root-mean-square error $\Vert \tilde{u} -u^{\ast }\Vert /\sqrt{mn}$) than that given by the gap.
We will now describe, starting from the simplest, the first-order optimization methods that can be implemented to solve the problems described so far and a few others that will be introduced in Section 7.
4 Gradient methods
The first family of methods we are going to describe is that of first-order gradient descent methods. It might seem a bit strange to introduce such simple and classical tools, which might be considered outdated. However, as mentioned in the Introduction, the most efficient way to tackle many simple problems in imaging is via elaborate versions of plain gradient descent schemes. In fact, as observed in the 1950s, such methods can be considerably improved by adding inertial terms or performing simple over-relaxation steps (or less simple steps, such as Chebyshev iterations for matrix inversion: Varga Reference Varga1962), line-searches, or more elaborate combinations of these, such as conjugate gradient descent; see for instance Polyak (Reference Polyak1987, Section 3.2) or Bertsekas (Reference Bertsekas2015, Section 2.1). Also, if second-order information is available, Newton’s method or quasi-Newton variants such as the (l-)BFGS algorithm (Byrd, Lu, Nocedal and Zhu Reference Byrd, Lu, Nocedal and Zhu1995) can be used, and are known to converge very fast. However, for medium/large non-smooth problems such as those described above, such techniques are not always convenient. It is now acknowledged that, if not too complex to implement, then simpler iterations, which require fewer operations and can sometimes even be parallelized, will generally perform better for a wide class of large-dimensional problems, such as those considered in this paper.
In particular, first-order iterations can be accelerated by many simple tricks such as over-relaxation or variable metrics – for instance Newton’s method – but this framework can be transferred to fairly general schemes (Vũ Reference Vũ2013b, Combettes and Vũ Reference Combettes and Vũ2014), and since the seminal contribution of Nesterov (Reference Nesterov1983) it has been understood that some of the over-relaxation techniques developed for matrix inversion can be adapted to the non-linear setting, providing efficient first-order schemes for non-linear or non-smooth minimization problems. Let us start with the simplest approach and show how it can be improved.
4.1 Gradient descent
We therefore start by describing gradient descent methods, and we will see that these are sufficient to provide efficient methods for solving simple problems such as the Lasso (2.2) or dual ROF (3.17) problems.
Assume we need to find a minimizer of a convex function $f$, that is, to solve
and let us first assume that $f$ is differentiable. Then, the most straightforward approach to solving the problem is to implement a gradient descent scheme with fixed step size ${\it\tau}>0$: see Algorithm 1. The major issue is that this will typically not work if $f$ is not sufficiently smooth. The natural assumption is that ${\rm\nabla}f$ is Lipschitz with some constant $L$, and $0<{\it\tau}L<2$. If ${\it\tau}$ is too large, this method will oscillate: if for instance $f(x)=x^{2}/2$, then $x^{k+1}=(1-{\it\tau})x^{k}$, and it is obvious that this recursion converges if and only if ${\it\tau}<2$. On the other hand, a Taylor expansion shows that
so that if ${\it\tau}<2/L$, then we see both that $f(x^{k})$ is a strictly decreasing sequence (unless ${\rm\nabla}f(x^{k})=0$ at some point) and that $\sum _{k}\Vert {\rm\nabla}f(x^{k})\Vert ^{2}<+\infty$ if $f$ is bounded from below. If $f$ is, in addition, coercive (with bounded level sets), it easily follows in finite dimensions that $f(x^{k})$ converges to a critical value and that every converging subsequence of $(x^{k})_{k\geq 0}$ goes to a critical point. If $f$ is convex, then $x\mapsto x-{\it\tau}{\rm\nabla}f(x)$ is also a (weak) contraction, which shows that $\Vert x^{k}-x^{\ast }\Vert$ is also non-increasing, for any minimizerFootnote 4$x^{\ast }$ of $f$. In this case we can deduce the convergence of the whole sequence $(x^{k})_{k}$ to a solution, if $0<{\it\tau}<2/L$. In fact, this is a particular case of the fairly general theory of averaged operators, for which such iterations converge: see Theorem A.1 in the Appendix for details and references.
4.2 Implicit gradient descent and the proximal-point algorithm
The above analysis breaks down if ${\rm\nabla}f$ is not Lipschitz, as clearly it will be much harder to understand how $f$ and its gradient behave at the new point given by (4.1). One typical workaround is to use varying steps that converge to zero (see Section 4.3), but this leads to very slow algorithms. Another one is to try to implement an implicit gradient descent where the iteration (4.1) is replaced with
Now the question is how to implement this iteration. We clearly see that if such an $x^{k+1}$ exists, then it satisfies
so it is a critical point of the function
If, in addition, $f$ is convex and l.s.c. (rather than $C^{1}$), then this critical point is the unique minimizer of (4.4), that is, the proximal map of ${\it\tau}f$ evaluated at $x^{k}$ (see Section 3.4). We will say that the convex function $f$ is simple if the proximal maps ${\it\tau}f$, ${\it\tau}>0$, can be easily evaluated. An important observation is that its definition does not require any smoothness of $f$. Now consider the function
which is the Moreau–Yosida regularization of $f$ with parameter ${\it\tau}>0$. It is a standard fact that ${\rm\nabla}f_{{\it\tau}}(\bar{x})$ is a $(1/{\it\tau})$-Lipschitz function whose gradient is given by
Indeed, for any $x,y$, letting ${\it\xi}=\text{prox}_{{\it\tau}f}(x)$ and ${\it\eta}=\text{prox}_{{\it\tau}f}(y)$,
which actually shows that $(x-{\it\xi})/{\it\tau}$ is a subgradient of $f_{{\it\tau}}$ at $x$. In the third line we have used the fact that ${\it\xi}$ is the minimizer of a $(1/{\it\tau})$-strongly convex problem. The last term in this equation expresses the fact that the map $\bar{x}\mapsto (\bar{x}-\text{prox}_{{\it\tau}f}(\bar{x}))/{\it\tau}$ is ${\it\tau}$-co-coercive, which implies that it is also $(1/{\it\tau})$-Lipschitz, as claimed.
Now we can rewrite (4.6) in the form
which shows that an iteration of implicit gradient descent for $f$, which in its general form reads
is exactly the same as an iteration of explicit gradient descent for $f_{{\it\tau}}$, with step ${\it\tau}$ (which is admissible since ${\rm\nabla}f_{{\it\tau}}$ is $(1/{\it\tau})$-Lipschitz). As it is obvious from its definition (4.5) that both $f$ and $f_{{\it\tau}}$ have the same set of minimizers, the convergence theory for the implicit algorithm is a simple corollary of the convergence theory for explicit gradient descent with fixed step. Moreover, as we know we can take slightly larger steps (we can let $x^{k+1}=x^{k}-{\it\sigma}{\rm\nabla}f_{{\it\tau}}(x^{k})$ for any choice of ${\it\sigma}<2{\it\tau}$), then we also deduce the convergence for the iterations:
for any ${\it\theta}\in (-1,1)$. These elementary remarks are in fact at the heart of the rich theory of contraction semigroups in Hilbert and Banach spaces, when applied to more general monotone or accretive operators (indeed, (4.6) is exactly the Yosida regularization of the subgradient operator $\partial f$). See for instance Brézis (Reference Brézis1973).
In optimization theory, the extension of these remarks to a general monotone operator is known as the ‘proximal-point algorithm’ (PPA), introduced in Martinet (Reference Martinet1970). The general form of the PPA algorithm is the iteration
(with a possible relaxation as in (4.8)), which is shown to converge to a zero of $T$ under certain conditions.Footnote 5 This method has been studied extensively since the 1970s (Martinet Reference Martinet1970, Rockafellar Reference Rockafellar1976), and in fact many of the methods we consider later on are special instances. Convergence proofs and rates of convergence can be found, for instance, in Brézis and Lions (Reference Brézis and Lions1978) (these require $\sum _{k}{\it\tau}_{k}^{2}=+\infty$, but $\sum _{k}{\it\tau}_{k}=+\infty$ is sufficient if $T=\partial f$); see also the work of Güler (Reference Güler1991) when $T=\partial f$. In fact some of the results mentioned in Section 4.7 below will apply to this method as a particular case, when $T=\partial f$, extending some of the results of Güler (Reference Güler1991).
Fairly general convergence rates for gradient methods are given in the rich book of Bertsekas (Reference Bertsekas2015, Propositions 5.1.4, 5.1.5), depending on the behaviour of $f$ near the set of minimizers. In the simplest case of the descent (4.1) applied to a function $f$ with $L$-Lipschitz gradient, the convergence rate is found in many other textbooks (e.g. Nesterov Reference Nesterov2004) and reads as follows.
Theorem 4.1. Let $x^{0}\in {\mathcal{X}}$ and $x^{k}$ be recursively defined by (4.1), with ${\it\tau}\leq 1/L$. Then not only does $(x^{k})_{k}$ converge to a minimizer, but the value $f(x^{k})$ decays with the rate
where $x^{\ast }$ is any minimizer of $f$. If in addition $f$ is strongly convex with parameter ${\it\mu}_{f}>0$, we have
where ${\it\omega}=(1-{\it\tau}{\it\mu}_{f})<1$.
A short (standard) proof is given in Appendix B.
Remark 4.2. This form of the result is slightly suboptimal, allowing a very elementary proof in Appendix B. However, it can be checked that the first rate holds for larger steps ${\it\tau}<2/L$, while the second can be improved by taking larger steps (${\it\tau}=2/(L+{\it\mu}_{f})$), yielding linear convergence with a factor ${\it\omega}=(1-{\it\mu}_{f}/L)/(1+{\it\mu}_{f}/L)$; see for instance Nesterov (Reference Nesterov2004, Theorem 2.1.15). However, we will see very soon that this too can be improved.
Of course, the observations above show that similar rates will also hold for the implicit form (4.7): indeed, recalling that $f_{{\it\tau}}(x^{\ast })=f(x^{\ast })$ for any ${\it\tau}>0$, we have that a bound on $f_{{\it\tau}}(x^{k})-f_{{\it\tau}}(x^{\ast })$ is, by definition, also a bound on
We remark that in this implicit case it would seem that we only have to choose the largest possible ${\it\tau}$ to solve the minimization accurately. We will see further (Example 3.1) that in practice, we are not always free to choose the step or the metric which makes the algorithm actually implementable. In other situations the choice of the step might eventually result in a trade-off between the precision of the computation, the overall rate and the complexity of one single iteration (which should also depend on ${\it\tau}$).
4.3 Subgradient descent
Another way to implement a gradient scheme for a non-smooth convex objective is to perform a subgradient descent, that is, try to reduce the energy by following the direction of an arbitrarily chosen subgradient: see Algorithm 2. In general, this method performs poorly, as shown by Polyak (Reference Polyak1987) and Nesterov (Reference Nesterov2004) for example, since the typical rate for such a method (which is also optimal: Nesterov Reference Nesterov2004) is $O(1/\sqrt{k})$ for the best objective found after the $k$th iteration. However, if $f$ is not ‘simple’ but $\partial f$ is easy to compute, this might be an option, but it is preferable to try to use a splitting strategy as described in Section 4.7 and the following. A condition for convergence is that $f$ should be $M$-Lipschitz (at least near the optimum), which is not too restrictive in finite dimensions since $f$ is always locally Lipschitz in the interior of its domain.
The study of convergence of this algorithm is based on the following simple chain of (in)equalities: given $x^{\ast }$ a minimizer, we have
and hence (using $\Vert g_{k}\Vert \leq M$ and letting $x_{k}^{o}=\arg \min _{x_{i},i\leq k}f(x_{i})$)
which goes to $0$ as $k\rightarrow \infty$ by assumption. If we now choose $h_{i}=C/\sqrt{k+1}$ (Nesterov Reference Nesterov2004) for the first $k$ iterations, then at iteration $k$ we find
Clearly, this is much slower than the rate of descent with fixed steps which can be reached when ${\rm\nabla}f$ is Lipschitz.
A variant proposed by Polyak (Reference Polyak1987) consists in choosing $h_{k}=c_{k}(f(x^{k})-f(x^{\ast }))/\Vert g_{k}\Vert ^{2}$, with $c_{k}\in ({\it\alpha},2-{\it\alpha})$ for some ${\it\alpha}>0$. However, this requires knowledge, or a good estimate, of the optimal value $f(x^{\ast })$. The rate for this approach is again $O(1/k)$.
Inexact variants, based on the notion of ${\it\varepsilon}$-subgradients (Rockafellar Reference Rockafellar1997), have also been introduced (Bertsekas and Mitter Reference Bertsekas and Mitter1973, Polyak Reference Polyak1987). These have been studied recently by Benfenati and Ruggiero (Reference Benfenati and Ruggiero2013) to tackle non-linear inverse problems in imaging (see also Bonettini, Benfenati and Ruggiero Reference Bonettini, Benfenati and Ruggiero2014). They have also been used by Bonettini and Ruggiero (Reference Bonettini and Ruggiero2012) to reinterpret a primal–dual scheme of Zhu and Chan (Reference Zhu and Chan2008) (see Section 5.1) and prove its convergence.
4.4 Lower bounds for smooth convex optimization
An important question is what is the best possible rate of convergence of a first-order method applied to a convex optimization problem. Of course, the answer depends on the properties of the function to minimize. An answer in the form of lower bounds has been given by Nemirovski and Yudin (Reference Nemirovski and Yudin1983), and is also found in Nesterov (Reference Nesterov2004). The idea is to consider a fairly general class of first-order methods where the iterates $x^{k}$ are restricted to the subspace spanned by the gradients of earlier iterates, that is, for $k\geq 0$,
where $x^{0}$ is an arbitrary starting point. Then, for $x\in \mathbb{R}^{n}$, we consider $L>0$, ${\it\mu}\geq 0$, $1\leq p\leq n$, the minimization of functions of the form
Starting from an initial point $x^{0}=0$, any first-order method of the class considered can transmit the information of the data term only at the speed of one coordinate index per iteration. This makes such problems very hard to solve by any first-order methods in the class (4.11). Indeed, if we start from $x^{0}=0$ in the above problem (whose solution is given by $x_{k}^{\ast }=1$, $k=1,\ldots ,p$, and $0$ for $k>p$), then at the first iteration, only the first component $x_{1}^{1}$ will be updated (since $\partial f/\partial x_{i}(x^{0})=0$ for $i\geq 2$), and by induction we can check that $x_{l}^{k}=0$ for $l\geq k+1$.
For convenience we reproduce a variant of the results in Nesterov (Reference Nesterov2004) (where a slightly different function is used: see Theorems 2.1.7 and 2.1.13). If ${\it\mu}=0$, using (possible translates of) the function (4.12), which is very ill conditioned (and degenerate if defined in dimension $n>p$), the following general lower bound for smooth convex optimization can be shown.
Theorem 4.3. For any $x^{0}\in \mathbb{R}^{n}$, $L>0$, and $k<n$ there exists a convex, continuously differentiable function $f$ with $L$-Lipschitz-continuous gradient, such that for any first-order algorithm satisfying (4.11), we have
where $x^{\ast }$ denotes a minimizer of $f$.
This particular bound is reached by considering the function in (4.12) with $p=k+1$, and an appropriate change of variable which moves the starting point to the origin. Observe that the above lower bound is valid only if the number of iterates $k$ is less than the problem size. We cannot improve this with a quadratic function, as the conjugate gradient method (which is a first-order method) is then known to find the global minimizer after at most $n$ steps. But the practical problems we encounter in imaging are often so large that we will never be able to perform as many iterations as the dimension of the problem.
If choosing ${\it\mu}>0$ so that the function (4.12) becomes ${\it\mu}$-strongly convex, a lower bound for first-order methods is given in Theorem 2.1.13 of Nesterov (Reference Nesterov2004), which reads as follows.
Theorem 4.4. For any $x^{0}\in \mathbb{R}^{\infty }\simeq \ell _{2}(\mathbb{N})$ and ${\it\mu},L>0$ there exists a ${\it\mu}$-strongly convex, continuously differentiable function $f$ with $L$-Lipschitz-continuous gradient, such that, for any algorithm in the class of first-order algorithms defined by (4.11), we have
for all $k$, where $q=L/{\it\mu}\geq 1$ is the condition number, and $x^{\ast }$ is the minimizer of $f$.
In finite dimensions, one can adapt the proof of Nesterov (Reference Nesterov2004) to show the same result for sufficiently small $k$, with respect to $n$. It is important to bear in mind that these lower bounds are inevitable for any first-order algorithm (assuming the functions are ‘no better’ than with $L$-Lipschitz gradient and ${\it\mu}$-strongly convex). Of course, one could ask if these lower bounds are not too pessimistic, and whether such hard problems will appear in practice. We will indeed see that these lower bounds are highly relevant to our algorithms, and are observed when minimizing relatively simple problems such as the ROF model. Let us mention that many other types of interesting lower bounds can be found in the literature for most of the algorithmic techniques described in this paper, and a few others; see in particular the recent and fairly exhaustive study by Davis and Yin (Reference Davis and Yin2014a).
4.5 Accelerated gradient descent
Let us return to standard gradient descent. It turns out that the rates in Theorem 4.1 are suboptimal, in the sense that smaller upper bounds can be obtained, which (almost) match the lower bounds presented in the previous section. Accelerated variants of the gradient method for non-linear problems were first proposed by Nesterov (Reference Nesterov1983); see also Güler (Reference Güler1992) for the implicit form and variants, including inexact forms, and Salzo and Villa (Reference Salzo and Villa2012) for a more general result. The method consists in simply performing a varying relaxation step at each iteration: see Algorithm 3.
Theorem 4.5. Let $\{x^{k}\}$ be a sequence generated by the accelerated gradient descent (4.15). Then if $x^{\ast }$ is a minimizer, we have
This rate is clearly better than the rate in Theorem 4.1, and, in fact, optimal when comparing to the lower bound in Theorem 4.3. We leave the case where $f$ is strongly convex to Section 4.7 below, where we will present a more general result. Both are particular cases of Theorem B.1 in the Appendix.
Example 4.6 (minimizing the worst-case function).
In this example we show the actual performance of gradient methods for the worst-case function presented in (4.12) using $p=n=100$. Figure 4.1 compares the speed of convergence of gradient descent (GD), accelerated gradient descent (AGD), and conjugate gradient (CG), together with the lower bound for smooth optimization provided in (4.13). The results show that AGD is significantly faster than GD. For comparison we also applied CG, which is known to be an optimal method for quadratic optimization and provides convergence, in finitely many steps, to the true solution, in this case after at most $k=100$ iterations. Observe that CG exactly touches the lower bound at $k=99$ (black cross), which shows that the lower bound is sharp for this problem. Before and after $k=99$, however, the lower bound is fairly pessimistic.
4.6 Descent on the Moreau–Yosida regularization
Let us consider a certain class of problems in which the objective function is the sum of a simple convex function and a quadratic function. This is the case for the dual of the ROF problem or the Lasso problem. We want to solve
with $g$ simple (e.g., a characteristic function of a polar ball or an $\ell _{1}$-norm). An important observation is that in this case the Moreau–Yosida regularization (4.5) of $f$ is actually computable, provided that we choose the metric carefully. Let
which is positiveFootnote 6 if ${\it\tau}\Vert K\Vert ^{2}<1$. Then the Moreau–Yosida regularization of $f$ in the metric $M$ is given by
and the point $\hat{x}=\text{prox}_{f}^{M}(\bar{x})$ which solves this problem is given by
In other words, we can perform an implicit gradient descent (4.7) of $f$ in the metric $M$. For the Lasso problem, this iteration is known as the ‘iterative soft-thresholding’ algorithm (Donoho Reference Donoho1995, Chambolle, DeVore, Lee and Lucier Reference Chambolle, DeVore, Lee and Lucier1998, Daubechies, Defrise and De Mol Reference Daubechies, Defrise and De Mol2004, Bect, Blanc-Féraud, Aubert and Chambolle Reference Bect, Blanc-Féraud, Aubert, Chambolle, Pajdla and Matas2004), as the proximity operator of $g=\Vert \cdot \Vert _{1}$ consists of a ‘soft-thresholding’ of the values. From (4.6), we see that ${\rm\nabla}f_{M}(\bar{x})=\hat{x}-\bar{x}$ (where the gradient is computed also in the metric $M$), and is (still in this metric) $1$-Lipschitz. Therefore, we can solve the problem with a simple gradient descent (4.1) on the function $f_{M}$ (which is equivalent to (4.7) for $f$), or the accelerated descent described in Theorem 4.5.
It turns out that the operator in (4.17) can also be written (in the initial metric)
combining a step of implicit (‘backward’) gradient descent for $g$ and a step of explicit (‘forward’) gradient descent for the smooth part ${\textstyle \frac{1}{2}}\Vert K\cdot -x^{\diamond }\Vert ^{2}$ of (4.16). This is a particular case of a more general gradient descent algorithm which mixes the two points of view explained so far, and which we describe in Section 4.7 below.
These first elementary convergence results can already be applied to quite important problems in imaging and statistics. We first consider plain gradient descent for the primal ROF problem and then show how we can use implicit descent to minimize the dual of the ROF problem (3.17), which has the same structure as the Lasso problem (2.2).
Example 4.7 (minimizing the primal ROF model).
In this example we consider gradient descent methods to minimize the primal ROF model, in (2.6), for $p=2$. As mentioned above, this will work only if the gradient of our energy is Lipschitz-continuous, which is not the case for (2.6). Hence we consider a smoothed version of the total variation, which is obtained by replacing the norm $\Vert \text{D}u\Vert _{2,1}$, which is singular at $0$, with a smoothed approximation; this means in practice that we solve a different problem, but we could theoretically estimate how far the solution to this problem is from the solution to the initial problem. A classical choice is
where ${\it\varepsilon}>0$ is a (usually small) parameter. While this approximation is $C^{\infty }$, it tends to promote large gradients near the discontinuities of the image. A good alternative is the ‘Huber regularizer’. Letting
which is merely $C^{1}$ but smooths the absolute value function only locally around zero, we consider the following Huber-ROF problem:
with
Observe that (4.19) is strongly convex with parameter ${\it\mu}=1$. Although we want to minimize the primal problem here, we remark that the dual of the Huber-ROF model is
with
where ${\it\delta}_{\{\Vert \cdot \Vert _{2,\infty }\leq {\it\lambda}\}}(\mathbf{p})$ denotes the characteristic function of the polar ball $\{\mathbf{p}:\Vert \mathbf{p}\Vert _{2,\infty }\leq {\it\lambda}\}$ as in (3.16): it is simply the dual (3.17) of ROF, to which a small ‘smoothing term’ ${\it\varepsilon}/(2{\it\lambda})\Vert \mathbf{p}\Vert ^{2}$ has been added.
The gradient of (4.19) is computed as
where $\tilde{\mathbf{p}}={\rm\nabla}H_{{\it\varepsilon}}(Du)$, and it can be written as $\tilde{\mathbf{p}}=(\tilde{\mathbf{p}}_{1,1},\ldots ,\tilde{\mathbf{p}}_{m,n})$, with $\tilde{\mathbf{p}}_{i,j}$ given by
A simple computation shows that ${\rm\nabla}f(u)$ is Lipschitz-continuous with parameter $L=1+(\Vert \text{D}\Vert ^{2}{\it\lambda})/{\it\varepsilon}$, where $\Vert \text{D}\Vert \leq \sqrt{8}$ is the operator norm of $\text{D}$; see (2.5).
It can be observed that the auxiliary variables $\tilde{\mathbf{p}}$ are feasible (that is, ${\it\delta}_{\{\Vert \cdot \Vert _{2,\infty }\leq {\it\lambda}\}}(\tilde{\mathbf{p}})=0$) dual variables, as by definition they satisfy (3.14). Hence we can use these expressions to compute the primal–dual gap (3.18) (where the regularizer and its conjugate now need to be replaced with $H_{{\it\varepsilon}}(\text{D}u)$ and $H_{{\it\varepsilon}}^{\ast }(\mathbf{p})$):
Using (3.20), we also obtain that
where $u^{\ast }$ is the solution of (4.19) and $\tilde{u} =(u+u^{\diamond }-\text{D}^{\ast }\tilde{\mathbf{p}})/2$. We implement the gradient descent algorithm (4.1) using a constant step size ${\it\tau}=2/(L+{\it\mu})$ and apply the algorithm to Example 2.1. Figure 4.2 shows the convergence of the primal–dual gap using different values of ${\it\varepsilon}$. Since the objective function is smooth and strongly convex, the gradient descent converges linearly. However, for smaller values of ${\it\varepsilon}$, where the smoothed ROF model approaches the original ROF model, the convergence of the algorithm becomes very slow. The next example shows that it is actually a better idea to minimize the dual of the ROF model.
Example 4.8 (minimizing the dual ROF model).
Let us turn to the problem of minimizing the dual ROF model using the explicit representation of the Moreau–Yosida envelope. We consider (4.16) with $K=\text{D}$ and $g={\it\delta}_{\{\Vert \cdot \Vert _{2,\infty }\leq {\it\lambda}\}}$. The Moreau–Yosida regularization is given by
with ${\it\tau}^{\prime }$ such that $M=(1/{\it\tau}^{\prime })\,I-\text{D}\text{D}^{\ast }>0$, and the minimum of the right-hand side is attained for
where ${\rm\Pi}_{\{\Vert \cdot \Vert _{2,\infty }\leq {\it\lambda}\}}$ denotes the (pixelwise) orthogonal projection onto 2-balls with radius ${\it\lambda}$, that is, for each pixel $i,j$, the projection is computed by
As shown before, the gradient in the $M$-metric is given by
The advantages of minimizing the dual ROF model, rather than the primal ROF model as in Example 4.7, are immediate. Thanks to the implicit smoothing of the Moreau–Yosida regularization, we do not need to artificially smooth the objective function and hence any gradient method will converge to the exact minimizer. Second, the step size of a gradient method will just depend on $\Vert \text{D}\Vert$, whereas the step size of a gradient method applied to the primal ROF model is proportional to the smoothing parameter ${\it\varepsilon}$. We implement both a standard gradient descent (GD) with step size ${\it\tau}=1.9$ and the accelerated gradient descent (AGD) with step size ${\it\tau}=1$. The parameter ${\it\tau}^{\prime }$ in the $M$-metric is set to ${\it\tau}^{\prime }=0.99/\Vert \text{D}\Vert ^{2}$.
Since we are dealing with a smooth, unconstrained optimization in (4.22), we can also try to apply a black-box algorithm, which only needs information about the gradients and the function values. A very popular algorithm is the limited memory BFGS quasi-Newton method (Byrd et al. Reference Byrd, Lu, Nocedal and Zhu1995, Zhu, Byrd, Lu and Nocedal Reference Zhu, Byrd, Lu and Nocedal1997, Morales and Nocedal Reference Morales and Nocedal2011). We applied a $1$-memory variant of the l-BFGS algorithmFootnote 7 to the Moreau–Yosida regularization of the dual ROF model and supplied the algorithm with function values (4.22) (using the correct values of $\hat{p}$) and gradients (4.24). The idea of using variable metric approaches to the Moreau–Yosida regularization of the operator has been investigated in many papers (Bonnans, Gilbert, Lemaréchal and Sagastizábal Reference Bonnans, Gilbert, Lemaréchal and Sagastizábal1995, Burke and Qian Reference Burke and Qian1999, Reference Burke and Qian2000) and can lead to very fast convergence under simple smoothness assumptions. However, it is not always suitable or easily implementable for many of the problems we address in this paper.
The plot in Figure 4.3 represents the decay of the primal–dual gap (which bounds the energy and the $\ell _{2}$-error) obtained from gradient descent (GD), accelerated gradient descent (AGD) and the limited memory BFGS quasi-Newton method (l-BFGS). It appears that the energy actually decreases faster for the accelerated method and the quasi-Newton method, with no clear advantage of one over the other (the first being of course simpler to implement). Also observe that both AGD and l-BFGS are only slightly faster than the lower bound $O(1/k^{2})$ for smooth convex optimization. This shows that the dual ROF model is already quite a hard optimization problem. We should mention here that the idea of applying quasi-Newton methods to a regularized function as in this example has been recently extended to improve the convergence of some of the methods introduced later in this paper, namely the forward–backward and Douglas–Rachford splittings, with very interesting results: see Patrinos, Stella and Bemporad (Reference Patrinos, Stella and Bemporad2014) and Stella, Themelis and Patrinos (Reference Stella, Themelis and Patrinos2016).
4.7 Forward–backward splitting
We can write problem (4.16) in the general form
where $g$ is, as before, a ‘simple’ convex l.s.c. function and $f$ is a convex function with Lipschitz gradient. The basic idea of the forward–backward (FB) splitting scheme is to combine an explicit step of descent in the smooth part $f$ with a implicit step of descent in $g$. We thus introduce the operator
Another name found in the literature (Nesterov Reference Nesterov2013) is ‘composite gradient’ descent, as one may see $(\hat{x}-\bar{x})/{\it\tau}$ as a generalized gradient for $F$ at $\bar{x}$ (in particular, note the analogy with (4.6)). The essential reason justifying this is that a fixed point $\hat{x}=\bar{x}$ will clearly satisfy the stationary condition ${\rm\nabla}f(\bar{x})+\partial g(\bar{x})\ni 0$ of (4.26). Observe that in the particular case where $g={\it\delta}_{C}$ is the characteristic function of a closed, convex set $C$, then $\text{prox}_{{\it\tau}g}(x)$ reduces to ${\rm\Pi}_{C}(x)$ (the orthogonal projection onto $C$) and the mapping $T_{{\it\tau}}$ defines a projected gradient descent method (Goldstein Reference Goldstein1964). See Algorithm 4.
The theoretical convergence rate of plain FB splitting descent is not very good, as one can simply show the same as for gradient descent.
Theorem 4.9. Let $x^{0}\in {\mathcal{X}}$ and $x^{k}$ be recursively defined by (4.25), with ${\it\tau}\leq 1/L$. Then not only does $x^{k}$ converge to a minimizer but we have the rates
where $x^{\ast }$ is any minimizer of $f$. If in addition $f$ or $g$ is strongly convex with parameters ${\it\mu}_{f},{\it\mu}_{g}$ (with ${\it\mu}={\it\mu}_{f}+{\it\mu}_{g}>0$), we have
where ${\it\omega}=(1-{\it\tau}{\it\mu}_{f})/(1+{\it\tau}{\it\mu}_{g})$.
However, its behaviour is improved if the objective is smoother than actually known. Moreover, it is fairly robust to perturbations and can be over-relaxed; see in particular Combettes and Wajs (Reference Combettes and Wajs2005).
An ‘optimal’ accelerated version, generalizing Theorem 4.5, is also available for this method. This is introduced in Nesterov (Reference Nesterov2004) (for projected gradient descent). In the case ${\it\mu}={\it\mu}_{f}+{\it\mu}_{g}=0$, a more general algorithm, popularized under the name ‘FISTA’, is proposed in Beck and Teboulle (Reference Beck and Teboulle2009). The algorithm we present here unifies these approaches. The general iteration takes the form shown in Algorithm 5.
In (4.35), we can assume $L>{\it\mu}_{f}$, and hence ${\it\tau}{\it\mu}_{f}<1$; otherwise $f$ is quadratic and the problem is trivial. We have the following result, which unifies Nesterov (Reference Nesterov2004) and Beck and Teboulle (Reference Beck and Teboulle2009). See also Nesterov (Reference Nesterov2005, Reference Nesterov2013) and Tseng (Reference Tseng2008) for more general variants that enjoy the same convergence rates.
Theorem 4.10. Assume $t_{0}=0$ and let $x^{k}$ be generated by the algorithm, in either case ${\it\mu}=0$ or ${\it\mu}>0$. Then we have the decay rate
It must be mentioned that for ${\it\mu}=0$, a classical choice for $t_{k}$ is also $t_{k}=(k+1)/2$, which gives essentially the same rate. Variants of this choice which ensure, in addition, convergence of the iterates $(x_{k})_{k}$ to a solution, are discussed in Chambolle and Dossal (Reference Chambolle and Dossal2015). An important issue is the stability of these rates when the proximal operators can only be evaluated approximately: the situation here is worse than for the non-accelerated algorithm. Several papers address this issue and derive the corresponding rates, for example Schmidt, Roux and Bach (Reference Schmidt, Roux, Bach, Shawe-Taylor, Zemel, Bartlett, Pereira and Weinberger2011), Villa, Salzo, Baldassarre and Verri (Reference Villa, Salzo, Baldassarre and Verri2013) and Aujol and Dossal (Reference Aujol and Dossal2015); see also Güler (Reference Güler1992) and Salzo and Villa (Reference Salzo and Villa2012) for the backward step only, and d’Aspremont Reference d’Aspremont2008 in the smooth case. Naturally, a rate of convergence for the errors is required to obtain an improved global rate.
Proofs of both Theorems 4.9 and 4.10 are given in Appendix B, where more cases are discussed, including more possibilities for the choice of the parameters. They rely on the following essential but straightforward descent rule.Footnote 8 Let $\hat{x}=T_{{\it\tau}}\bar{x}$. Then, for all $x\in {\mathcal{X}}$,
In particular, if ${\it\tau}L\leq 1$,
The proof is elementary, especially if we follow the lines of the presentation in Tseng (Reference Tseng2008), in a more general setting. By definition, $\hat{x}$ is the minimizer of the $({\it\mu}_{g}+(1/{\it\tau}))$-strongly convex function
It follows that for all $x$ (see (3.7))
But since ${\rm\nabla}f$ is $L$-Lipschitz, $f(\bar{x})+\langle {\rm\nabla}f(\bar{x}),\hat{x}-\bar{x}\rangle \geq f(\hat{x})-(L/2)\Vert \hat{x}-\bar{x}\Vert ^{2}$, so equation (4.36) follows.
Remark 4.11. One can more precisely deduce from this computation that
where $D_{f}(x,y):=f(x)-f(y)-\langle {\rm\nabla}f(y),x-y\rangle \leq (L/2)\Vert x-y\Vert ^{2}$ is the ‘Bregman $f$-distance’ from $y$ to $x$ (Bregman Reference Bregman1967). In particular, (4.37) holds once
which is always true if ${\it\tau}\leq 1/L$ but might also occur in other situations, and in particular, be tested ‘on the fly’ during the iterations. This allows us to implement efficient backtracking strategies of the type of Armijo (Reference Armijo1966) (see Nesterov Reference Nesterov1983, Nesterov Reference Nesterov2013, Beck and Teboulle Reference Beck and Teboulle2009) for the algorithms described in this section when the Lipschitz constant of $f$ is not a priori known.
Remark 4.12. Observe that if $X\subset {\mathcal{X}}$ is a closed convex set containing the domain of $F$, and on which the projection ${\rm\Pi}_{X}$ can be computed, then the same inequality (4.37) holds if $\hat{x}=T_{{\it\tau}}{\rm\Pi}_{X}\bar{x}$ (requiring only that ${\rm\nabla}f$ is Lipschitz on $X$), provided $x\in X$; see Bonettini, Porta and Ruggiero (Reference Bonettini, Porta and Ruggiero2015). This means that the same rates are valid if we replace (4.30) with
which is feasible if $X$ is the domain of $F$.
Discussion
The idea of forward–backward splitting is very natural, and appears in many papers in optimization for imaging: it would not be possible to mention all the related literature. Historically, it is a generalization of projected gradient descent, which dates back at least to Goldstein (Reference Goldstein1964) (see Passty Reference Passty1979, Lions and Mercier Reference Lions and Mercier1979, Fukushima and Mine Reference Fukushima and Mine1981). For minimization problems, it can be viewed as successive minimizations of a parabolic upper bound of the smooth part added to the non-smooth part. It has been generalized, and popularized in the imaging community by Combettes and Wajs (Reference Combettes and Wajs2005), yet a few particular forms were already well known, such as iterative soft-thresholding for the Lasso problem (Daubechies et al. Reference Daubechies, Defrise and De Mol2004). It is not always obvious how to choose parameters correctly when they are unknown. Several backtracking techniques will work, such as those of Nesterov (Reference Nesterov2013), for both the Lipschitz constants and strong convexity parameters; see also Nesterov (Reference Nesterov1983), Beck and Teboulle (Reference Beck and Teboulle2009) and Bonettini et al. (Reference Bonettini, Porta and Ruggiero2015) for estimates of the Lipschitz constant.
For simpler problems such as Lasso (2.2), convergence of the iterates (more precisely of $Ax^{k}$) yields that after some time (generally unknown), the support $\{i:x_{i}^{\ast }=0\}$ of the solution $x^{\ast }$ should be detected by the algorithm (under ‘generic’ conditions). In that case, the objective which is solved becomes smoother than during the first iterations, and some authors have succeeded in exploiting this ‘partial smoothness’ to show better (linear) convergence of the FB descent (Bredies and Lorenz Reference Bredies and Lorenz2008, Grasmair, Haltmeier and Scherzer Reference Grasmair, Haltmeier and Scherzer2011, Liang, Fadili and Peyré Reference Liang, Fadili and Peyré2014, Tao, Boley and Zhang Reference Tao, Boley and Zhang2015). Liang, Fadili and Peyré (Reference Liang, Fadili and Peyré2015) have extended this approach to the abstract setting of Appendix A, so that this remark also holds for some of the saddle-point-type algorithms introduced in Section 5 below.
Another interesting and alternative approach to convergence rates is to use the ‘Kurdyka–Łojasiewicz’ (KL) inequality, which in practice will bound a function of the distance of a point to the critical set by the norm of the (sub)gradient. As shown by Bolte, Daniilidis and Lewis (Reference Bolte, Daniilidis and Lewis2006), such a property will hold for ‘most’ of the functions optimized in practice, including non-smooth functions, and this can lead to improved convergence rates for many algorithms (Attouch, Bolte and Svaiter Reference Attouch, Bolte and Svaiter2013). It is also possible to derive accelerated schemes for problems with different types of smoothness (such as Hölder-continuous gradients); see Nesterov (Reference Nesterov2015).
Finally, a heuristic technique which often works to improve the convergence rate, when the objective is smoother than actually known, consists simply in ‘restarting’ the method after a certain number of iterations: in Algorithm 5 (for ${\it\mu}=0$), we start with a new sequence $(t_{k})_{k}$ letting $t_{\bar{k}}=1$ for some sufficiently large $\bar{k}$. Ideally, we should restart when we are sure that the distance of $x^{\bar{k}}$ to the optimum $x^{\ast }$ (unique if the objective is strongly convex) has shrunk by a given, sufficiently small factor (but the corresponding value $\bar{k}$ depends on the strong convexity parameter). There is a simple way to implement such a scheme while still keeping the global $O(1/k^{2})$ rate (it consists in adapting the idea of the ‘Monotone FISTA’ scheme: see Remark B.3). A rigorous justification of a restarting scheme is discussed in O’Donoghue and Candès (Reference O’Donoghue and Candès2015).
4.8 Extensions and variants
4.8.1 Mirror descent
A natural extension of the proximal descent methods consists in replacing the function $(2{\it\tau})^{-1}\Vert y-x\Vert ^{2}$ in (3.6) with other distances between $y$ and $x$. There can be several good reasons for this.
∙ One may wish to use a (smooth) distance $d(y,x)$ which blows up when $y$ reaches the boundary of certain constraints. This is the principle of barriers and penalty functions.
∙ The proximity operator of a function is not easily computable with the squared Euclidean distance but it is simple in some non-linear metrics.
∙ The ambient space ${\mathcal{X}}$ is neither a Hilbert space nor a Euclidean space, and we need to optimize a function whose gradient is Lipschitz with respect to some non-Hilbert space norm.
The ‘mirror’ descent algorithm was introduced by Nemirovski and Yudin (Reference Nemirovski and Yudin1983) as a tool for optimization in Banach spaces. It requires the introduction of an auxiliary convex function ${\it\psi}$ whose gradient will act as a map between the space ${\mathcal{X}}$ and its dual ${\mathcal{X}}^{\prime }$. In the Hilbert space case, ${\it\psi}(x)=\Vert x\Vert ^{2}/2$ is the most natural choice, but there might be reasons to use other choices (whereas in Banach spaces it is not natural at all). The basic idea is to replace the gradient descent iteration (4.1) with
If we introduce the Bregman ${\it\psi}$-distance
we readily see that it is equivalent to defining $x^{k+1}$ as a point which minimizes
that is, we find $x^{k+1}$ by minimizing the linear approximation of $f$ at $x^{k}$ with some penalization of the distance between the two points. The natural ‘mirror prox’ alternative will consist in solving iteratively, if possible,
and defining $x^{k+1}$ to be the solution. In general it is required that ${\it\psi}$ be smooth and strongly convex with respect to the norm of the space ${\mathcal{X}}$, which is not necessarily Euclidean or Hilbertian. Convergence of these algorithms under various conditions on ${\it\psi}$ are studied in a few important papers; see in particular the papers of Eckstein (Reference Eckstein1993), Teboulle (Reference Teboulle1992), Chen and Teboulle (Reference Chen and Teboulle1993), Kiwiel (Reference Kiwiel1997), Beck and Teboulle (Reference Beck and Teboulle2003) and Auslender and Teboulle (Reference Auslender and Teboulle2004). The extensive monograph by Ben-Tal and Nemirovski (Reference Ben-Tal and Nemirovski2001)Footnote 9 presents many possible variants, with rates.
An important remark is that a non-linear variant of (4.37) is as easy to show in the non-linear case, since if $\hat{x}$ is the minimizer of
for some $f$ and admissible ${\it\psi}$, it satisfies
from which we deduce from simple computations that for any $x\in {\mathcal{X}}$,
It is relatively easy to deduce basic rates of convergence for the mirror and mirror prox schemes from this inequality, in the same way as for the Hilbertian FB splitting.
Quite naturally, this can be generalized to the full forward–backward splitting, where now the problem is of the form (4.26), and the point $\hat{x}$ is obtained from $\bar{x}$ by solving
A non-linear analogue of (4.37) will easily follow from (4.41) and the Lipschitz property of ${\rm\nabla}f$, which reads
where $\Vert \cdot \Vert _{\ast }$ is the norm in ${\mathcal{X}}^{\prime }$ induced by the norm $\Vert \cdot \Vert$ of ${\mathcal{X}}$, with respect to which ${\it\psi}$ is strongly convex. The simple FB descent method (using $\bar{x}=x^{k}$, $x^{k+1}=\hat{x}$) will then converge with essentially the same rate (but constants which depend on the new distance $D_{{\it\psi}}$); see Tseng (Reference Tseng2008) for details. More interesting is the fact that, again, Tseng (Reference Tseng2008) has also introduced accelerated variants which reach a convergence rate in $O(1/k^{2})$, as before (see also Allen-Zhu and Orecchia Reference Allen-Zhu and Orecchia2014). A different way to introduce barriers and non-linearities for solving (4.26) by smoothing is proposed in Nesterov (Reference Nesterov2005), where another $O(1/k^{2})$ algorithm is introduced.
The idea of considering non-linear proximity operator is not purely formal and can be useful. The most classical example is the case of optimization over the unit simplex
Then it is known (Teboulle Reference Teboulle1992, Beck and Teboulle Reference Beck and Teboulle2003) that the entropy
is $1$-strongly convex with respect to the $\ell _{1}$-normFootnote 10
In this case, the (constrained) mirror step takes the form
and the solution $\hat{x}$ satisfies
where ${\it\lambda}$ is a Lagrange multiplier for the constraint $\sum _{i}x_{i}=1$. We obtain that, for $i=1,\ldots ,n$,
There might be two advantages: one is that we do not have to project back onto the simplex (although this projection is very cheap), the other is that the parameters of the problem in the $\ell _{1}$-norm (such as the Lipschitz constant of the smooth part of the objective) might allow us to take a larger time step or yield better constants in the estimates for the rates (Beck and Teboulle Reference Beck and Teboulle2003).
Non-linear smoothing or mirror descent is also useful for solving optimal transportation problems; for applications in imaging see Benamou et al. (Reference Benamou, Carlier, Cuturi, Nenna and Peyré2015) and Ferradans, Papadakis, Peyré and Aujol (Reference Ferradans, Papadakis, Peyré and Aujol2014).
4.8.2 Inertial and over-relaxed algorithms
The accelerated methods described in Theorems 4.5 and 4.10 are based on a particular example of ‘overshooting’, where the new point is obtained by applying an operator to the old point with a ‘momentum’ (here, a multiple of the difference between the two last iterates).
Gradient descent type methods can be accelerated in many similar ways. A very efficient method is the heavy ball (HB) method (Polyak Reference Polyak1987), which consists in iterating
For strongly convex problems, that is, assuming $\ell \,\text{Id}\leq {\rm\nabla}^{2}f\leq L\text{Id}$, this can be optimal: convergence is ensured for $0\leq {\it\beta}<1$ and $0<{\it\alpha}<2(1+{\it\beta})/L$, and the choices ${\it\beta}=q^{2}$, where
yield the optimal rate $\Vert x^{k+1}-x^{\ast }\Vert =O(q^{k})$ (Polyak Reference Polyak1987, Theorem 1).
The heavy ball method has been generalized to monotone operators by Alvarez and Attouch (Reference Alvarez and Attouch2001) (see also Alvarez Reference Alvarez2003, Moudafi and Oliny Reference Moudafi and Oliny2003), so there exist general convergence results that allow for non-smooth terms.
We should of course also mention the conjugate gradient descent method, which is of the same sort, except that the parameters ${\it\alpha}$ and ${\it\beta}$ are updated dynamically at each iteration. Ideally we want to choose ${\it\alpha},{\it\beta}$ which solve
(see Polyak Reference Polyak1987). For a quadratic function this problem is easily solved, and it is known that the descent method obtained minimizes the quadratic function exactly in $\text{rank}A$ iterations, where $A={\rm\nabla}^{2}f$. It is the fastest method in this case (Polyak Reference Polyak1987); see the plot ‘CG’ in Figure 4.1. In practice, this method should be implemented on a sufficiently smooth problem when the cost of performing a line-search (which requires evaluations of the function) is not too large; as for non-quadratic problems, the optimal step cannot be computed in closed form.
A generalization of the HB algorithm to a strongly convex function given by the sum of a smooth, twice continuously differentiable function with Lipschitz-continuous gradient and a non-smooth function, with easily computed proximal map, was investigated for quadratic functions in Bioucas-Dias and Figueiredo (Reference Bioucas-Dias and Figueiredo2007) and for more general smooth functions in Ochs, Brox and Pock (Reference Ochs, Brox and Pock2015). It is of the form
The proximal HB algorithm offers the same optimal convergence rate as the HB algorithm, but can be applied only if the smooth function is twice continuously differentiable. It is therefore very efficient; see Figure 4.4 below for a comparison of this method with other accelerated methods.
Another standard and simple way to speed up such algorithms consists in simply over-relaxing the iterates, that is, replacing $x^{k+1}$ with the value $x^{k+1}+{\it\theta}(x^{k+1}-x^{k})$ such as in (4.8); this is not exactly the same as (4.30)–(4.31). Convergence is generally guaranteed as long as ${\it\theta}<1$; this has been studied in a very general setting by Combettes and Wajs (Reference Combettes and Wajs2005). The theoretical convergence rates are in general only slightly improved by such over-relaxations, but sometimes the empirical rates are much better. On the other hand, there do not seem to be many studies of over-relaxed accelerated algorithms, although a recent paper on the ‘FISTA’ method shows that it is actually possible and improves the convergence (Yamagishi and Yamada Reference Yamagishi and Yamada2011).
4.8.3 (Two) block(s) coordinate descent
It is obvious from the proof in Appendix B that any algorithm which ensures a descent rule such as (4.37) will enjoy the same convergence properties (Theorem 4.9) and can be accelerated by the same techniques as FB splitting. As a particular case, one can efficiently solve problems of the form
Indeed, in its dual formulation, this problem can be written as
and if we minimize successively with respect to $y_{1},y_{2}$, it turns out that we obtain a descent rule similar to (4.37).
Lemma 4.13. Given $\bar{y}_{1},\bar{y}_{2}$, let
Then, for all $(y_{1},y_{2})\in {\mathcal{X}}^{2}$, we have
This is even improved if either $f_{1}^{\ast }$ or $f_{2}^{\ast }$ is strongly convex (equivalently, if at least one of the functions $f_{1}$ or $f_{2}$ has Lipschitz gradient). It clearly follows that the scheme of the proof of Theorem 4.10 will also work for this method: see Appendix B. The proof of the lemma is elementary. Moreover, we can observe that since
is a convex function of $y_{1}$ with $1$-Lipschitz gradient, the alternating minimization method is simply a forward–backward splitting applied to the problem $\min _{y_{1}}f_{1}^{\ast }(y_{1})+\tilde{f}_{2}^{\ast }(y_{1})$; see for instance Combettes and Pesquet (Reference Combettes and Pesquet2011, Example 10.11). Less elementary is the fact that this descent rule still holds if the exact minimizations are replaced with proximal (implicit descent) steps or if the quadratic part is linearized, which can be useful if it involves linear operators; see Chambolle and Pock (Reference Chambolle and Pock2015b) for details.
A particular case of this splitting is used in Section 7.8 to compute Figure 7.9; see the explanations there (Chambolle and Pock Reference Chambolle and Pock2015b, Kolmogorov, Pock and Rolinek Reference Kolmogorov, Pock and Rolinek2016). It can also be used to implement fast parallel solvers for the ROF problem (2.6): the idea is to split the dual variable into two groups, one ‘living’ on the ‘odd’ squares (or cubes in three dimensions), that is, the edges connecting the vertices $(i,j)+\{0,1\}^{2}$, $i,j$ odd, and the other in the ‘even’ squares. Then one can use a dedicated solver to solve exactly or approximately the subproblem on each odd/even square, which are low-dimensional decoupled problems. This is particularly well adapted to implementation on GPUs; details can be found in Chambolle and Pock (Reference Chambolle and Pock2015b).
In general, block coordinate (or Gauss–Seidel) descent schemes can be implemented in many ways, and many generalizations involving non-smooth terms are proved to converge in the literature (Grippo and Sciandrone Reference Grippo and Sciandrone2000, Auslender Reference Auslender1976, Attouch et al. Reference Attouch, Bolte and Svaiter2013). As long as some energy decay is guaranteed, $O(1/k)$ rates are easy to prove. In the context of this chapter, see in particular Beck and Tetruashvili (Reference Beck and Tetruashvili2013); see also Tseng (Reference Tseng2001), Tseng and Yun (Reference Tseng and Yun2009), Beck (Reference Beck2015), Chouzenoux, Pesquet and Repetti (Reference Chouzenoux, Pesquet and Repetti2016) and Nesterov (Reference Nesterov2012).
For more than two blocks, efficient methods can be developed in two different directions: sums of ‘simple’ objective can be dualized and their proximity operators then computed in parallel (Attouch, Briceño-Arias and Combettes Reference Attouch, Briceño-Arias and Combettes2009/10, Raguet, Fadili and Peyré Reference Raguet, Fadili and Peyré2013, Becker and Combettes Reference Becker and Combettes2014, Pustelnik, Chaux and Pesquet Reference Pustelnik, Chaux and Pesquet2011). Acceleration is then possible in this framework (Goldfarb and Ma Reference Goldfarb and Ma2012). On the other hand, randomized algorithms seem to be a very efficient alternative for tackling problems with a huge number of variables or blocks (Nesterov Reference Nesterov2012). In particular, whereas in the deterministic setting it is hard to implement acceleration techniques for problems involving more than two blocks, stochastic block descent methods will typically average out antisymmetric terms in the descent rules and lead to much nicer inequalities which can be exploited to derive very efficient methods (Lin, Lu and Xiao Reference Lin, Lu and Xiao2015). A few recent algorithms recover optimal rates (in particular when specialized to the one-block case) and allow for descent steps which are optimal for each block (Fercoq and Richtárik Reference Fercoq and Richtárik2015, Fercoq and Richtárik Reference Fercoq and Richtárik2013).
4.8.4 FBF splitting
In the context of maximal monotone operators, an important generalization of the FB splitting algorithm is due to Tseng (Reference Tseng2000). The standard FB splitting algorithm requires the forward operator to be co-coercive, for example the gradient of a smooth function. This clearly limits the applicability of the algorithm to more general problems. The following modification, called the forward–backward–forward (FBF) algorithm simply assumes that the forward monotone operator is single-valued and Lipschitz-continuous. It can therefore be applied, for example, if the forward operator is a skew-symmetric matrix. Let $A$, $B$ be two maximal monotone operators with $A$ single-valued on $\text{dom}A\supset \text{dom}B$. The FBF algorithm is defined by the following scheme:
where ${\mathcal{X}}$ is a suitable non-empty set (e.g. $\mathbb{R}^{n}$) and ${\it\tau}_{k}$ is the largest number satisfying, for any ${\it\delta}\in (0,1)$,
which in practice can be determined by an Armijo-type backtracking procedure (Armijo Reference Armijo1966). An important application of this algorithm is to convex-concave saddle-point problems, which we will investigate in more detail in the next section.
4.9 Examples
We conclude this section by providing two examples. In the first example we consider minimizing the dual of the Huber-ROF problem, which is strongly convex and can therefore be minimized using accelerated proximal gradient descent for strongly convex problems. The second example uses the explicit representation of Moreau–Yosida regularization to transform the dual of an anisotropic variant of the ROF model into a form consisting of a smooth plus a non-smooth function, which can be tackled by accelerated forward–backward splitting.
Example 4.14 (minimizing the dual of Huber-ROF).
Let us revisit the dual of the Huber-ROF model introduced in (4.21):
where $u^{\diamond }$ is again the noisy image of size $m\times n$ from Example 2.1, and $\text{D}$ is the (two-dimensional) finite difference operator. This problem is the sum of a smooth function with Lipschitz-continuous gradient,
plus a non-smooth function with easily computed proximal map,
The gradient of the smooth function is given by
and its Lipschitz parameter is estimated again as $L\leq 8$. The non-smooth function is strongly convex with parameter ${\it\mu}={\it\varepsilon}/{\it\lambda}$ and its pixelwise proximal map is given by
Let us now apply the Huber-ROF model to the image in Example 2.1 using the parameters ${\it\lambda}=0.1$ and ${\it\varepsilon}=0.001$. We implemented the FISTA algorithm (Algorithm 5) using the extrapolation parameters corresponding to ${\it\mu}=0$ and the correct ${\it\mu}={\it\varepsilon}/{\it\lambda}$. For comparison, we also implemented the proximal heavy ball algorithm (4.43) and used the optimal parameter settings
Figure 4.4 shows that it is generally not a good idea to apply the classical FISTA algorithm using ${\it\mu}=0$ to a strongly convex problem. On the other hand, applying the FISTA algorithm with the correct settings for the strong convexity, that is, ${\it\mu}={\it\varepsilon}/{\it\lambda}$, largely improves the convergence rate of the algorithm. Interestingly, it turns out that the proximal HB algorithm converges almost twice as fast as the FISTA algorithm (${\it\omega}^{k}$ as opposed to ${\it\omega}^{2k}$ with $q=L/{\it\mu}_{g}$ and ${\it\omega}=(\sqrt{q}-1)/(\sqrt{q}+1)$). In fact the proximal HB algorithm seems to exactly obey the lower bound of first-order algorithms for the strongly convex problems presented in Theorem 4.14.
Example 4.15 (total variation on chains).
We have already seen that when the smooth function is quadratic, the forward–backward algorithm is equivalent to a plain gradient method applied to Moreau–Yosida regularization. The aim of this example is to give a practical problem where such an equivalence does not hold. Consider again the dual of the ROF model:
which differs slightly from our previous ROF problems by the choice of the norm constraining the dual variables. First, application of the adjoint of the finite difference operator to the dual variables $\mathbf{p}=(p_{1},p_{2})$ can be decomposed via
where $\text{D}_{d}^{\ast }$ is the adjoint finite difference operator in the direction $d$. Second, by a change of variables $t_{d}=\text{D}_{d}^{\ast }p_{d}$ and using the property that the constraint on $\mathbf{p}$ is also decomposable, we can rewrite the problem in the equivalent form
where
Hence, as shown in Section 4.8.3, this problem could be easily solved via accelerated alternating minimization in $t_{d}$ if we were able to efficiently compute the proximal maps with respect to ${\it\delta}_{C_{d}}(t_{d})$. Moreover, we have shown that the (accelerated) alternating minimization corresponds to an (accelerated) forward–backward algorithm on the partial Moreau–Yosida regularization that is obtained by partially minimizing (4.47) with respect to one variable, hence corresponding to a non-trivial instance of the forward–backward algorithm. Observe that the characteristic functions of the sets $C_{d}$ are exactly the convex conjugates of the total variation in each dimension $d$, that is,
In other words, if we were able to solve the proximal maps for one-dimensional total variation problems along chains, we could – thanks to Moreau’s identity – also efficiently solve the proximal maps for the functions ${\it\delta}_{C_{d}}(t_{d})$.
As a matter of fact, there exist several direct algorithms that can solve one-dimensional ROF problems very efficiently, and hence the proximal maps for one-dimensional total variation. Some of the algorithms even work in linear time; see Davies and Kovac (Reference Davies and Kovac2001), Condat (Reference Condat2013a), Johnson (Reference Johnson2013) and Kolmogorov et al. (Reference Kolmogorov, Pock and Rolinek2016), and references therein.
Figure 4.5 presents a comparison between the convergence rates of accelerated block descent (FISTA-chains) applied to (4.47) and a standard implementation of FISTA applied to (4.46). To solve the one-dimensional total variation subproblems on chains we used the linear-time dynamic programming approach from Kolmogorov et al. (Reference Kolmogorov, Pock and Rolinek2016). Figure 4.5(a) shows that in terms of iterations, the accelerated block descent is about 10–20 times as fast. Clearly, one iteration of the accelerated block descent is computationally more expensive compared to one iteration of the standard implementation; in our C++ implementation, one iteration of standard FISTA was approximately three times as fast compared to the accelerated block descent. Yet overall the block splitting technique turns out to be more efficient for a given precision, as shown in Figure 4.5(b). Later, in Section 7.8, we will come back to a similar example and show how accelerated block descent can be used to solve large-scale stereo problems.
5 Saddle-point methods
In this section we will briefly describe the main optimization techniques for finding saddle points, which are commonly used for imaging problems. The goal of these approaches is, as before, to split a complex problem into simpler subproblems which are easy to solve – although depending on the structure and properties of the functions, one form might be more suitable than another. We will mostly concentrate on one type of algorithm known as the ‘primal–dual’ algorithm, ‘ADMM’, or ‘Douglas–Rachford splitting’ (see references below) in a Euclidean setting, although more complex splitting techniques can be useful (e.g. Tseng Reference Tseng2000), as well as descent with respect to non-linear metrics or in Banach spaces (Nemirovski Reference Nemirovski2004, Chambolle and Pock Reference Chambolle and Pock2015a). We will mention the simplest useful results. These have been generalized and improved in many ways; see in particular Davis (Reference Davis2015) and Davis and Yin (Reference Davis and Yin2014a,Reference Davis and Yinb) for an extensive study of convergence rates, Chen, Lan and Ouyang (Reference Chen, Lan and Ouyang2014a), Ouyang, Chen, Lan and Pasiliao (Reference Ouyang, Chen, Lan and Pasiliao2015) and Valkonen and Pock (Reference Valkonen and Pock2015) for optimal methods exploiting partial regularity of some objectives, and Fercoq and Bianchi (Reference Fercoq and Bianchi2015) for efficient stochastic approaches.
The natural order in which to present these algorithms should be to start with the Douglas–Rachford splitting (Douglas and Rachford Reference Douglas and Rachford1956; the modern form we will describe is found in Lions and Mercier Reference Lions and Mercier1979) and the ADMM, which have been used for a long time in non-smooth optimization. However, since the convergence results for primal–dual methods are in some sense much simpler and carry on to the other algorithms, we first start by describing these methods.
5.1 Primal–dual algorithms
The problems we consider here are in the ‘standard’ form (3.9)
where $f,g$ are convex, l.s.c. and ‘simple’, and $K:{\mathcal{X}}\rightarrow {\mathcal{Y}}$ is a bounded linear operator. When $f$ is smooth, FB splitting can be used efficiently for such a problem. In other situations we usually have to revert to Lagrangian techniques or primal–dual methods. This is the case for Examples 2.2 and 2.3 below, for example.
The idea is to write the problem as a saddle point as in (3.10):
Then (this dates back to Arrow, Hurwicz and Uzawa Reference Arrow, Hurwicz and Uzawa1958), we alternate a (proximal) descent in the variable $x$ and an ascent in the dual variable $y$:
It is not clear that such iterations will converge. (We can easily convince ourselves that a totally explicit iteration, with $x^{k+1}$ above replaced with $x^{k}$, will in general not converge.) However, this scheme was proposed in Zhu and Chan (Reference Zhu and Chan2008) for problem (2.6) and observed to be very efficient for this problem, especially when combined with an acceleration strategy consisting in decreasing ${\it\tau}$ and increasing ${\it\sigma}$ at each step (e.g., following the rules in Algorithm 8 below). Proofs of convergence for the Zhu–Chan method have been proposed by Esser, Zhang and Chan (Reference Esser, Zhang and Chan2010), Bonettini and Ruggiero (Reference Bonettini and Ruggiero2012) and He, You and Yuan (Reference He, You and Yuan2014). For a general problem there exist several strategies to modify these iterations into converging subsequences. Popov (Reference Popov1981) proposed incorporating a type of ‘extragradient’ strategy into these iterations, as introduced by Korpelevich (Reference Korpelevich1976, Reference Korpelevich1983): the idea is simply to replace $y^{k}$ with $\text{prox}_{{\it\sigma}f^{\ast }}(y^{k}+{\it\sigma}K^{\ast }x^{k})$ in (5.1). This makes the algorithm convergent; moreover, an $O(1/k)$ (ergodic) convergence rate is shown in Nemirovski (Reference Nemirovski2004) (for a class of schemes including this one, using also non-linear ‘mirror’ descent steps: see Section 4.8.1). A variant with similar properties, but not requiring us to compute an additional step at each iteration, was proposed at roughly the same time by Esser et al. (Reference Esser, Zhang and Chan2010) (who gave it the name ‘PDHG’Footnote 11), and Pock, Cremers, Bischof and Chambolle (Reference Pock, Cremers, Bischof and Chambolle2009). The iterations can be written as in Algorithm 6.
The over-relaxation step $2x^{k+1}-x^{k}=x^{k+1}+(x^{k+1}-x^{k})$ can be interpreted as an approximate extragradient, and indeed it is possible to show convergence of this method with a rate which is the same as in Nemirovski (Reference Nemirovski2004) (see also Chambolle and Pock Reference Chambolle and Pock2011, Reference Chambolle and Pock2015a). On the other hand, this formula might recall similar relaxations present in other standard splitting algorithms such as the Douglas–Rachford splitting or the ADMM (see Sections 5.3 and 5.4 below), and indeed, we then see that this algorithm is merely a variant of these other methods, in a possibly degenerate metric. He et al. (Reference He, You and Yuan2014) observed that, letting $z=(x,y)$, the iterations above can be written as
where $T$ is the monotone operator in (3.15) and $M$ is the metric
which is positive definite if ${\it\tau}{\it\sigma}\Vert K\Vert ^{2}<1$. Hence, in this form the primal–dual algorithm is simply a proximal-point algorithm applied to the monotone operator $T$, and standard convergence results or rates (Brézis and Lions Reference Brézis and Lions1978) can be deduced.
This can be extended to a slightly more general form. Assume that we want to minimize the problem
where $f,g$ are convex, l.s.c. and ‘simple’, $K$ is a bounded linear operator and $h$ is convex with $L_{h}$-Lipschitz gradient term which we will treat explicitly. The primal–dual method for such a problem was suggested by Condat (Reference Condat2013b) and its convergence studied by Vũ (Reference Vũ2013a) and Boţ, Csetnek, Heinrich and Hendrich (Reference Boţ, Csetnek, Heinrich and Hendrich2015) (the latter papers dealing with more general monotone operators). Rates of convergence, including control of the primal–dual gap, are established in Chambolle and Pock (Reference Chambolle and Pock2015a) (a variant is studied in Drori, Sabach and Teboulle Reference Drori, Sabach and Teboulle2015), and a close (different) algorithm which mixes general monotone operators and subgradients and establishes similar rates is found in Davis and Yin (Reference Davis and Yin2015). The idea is simply to replace the descent step in $x$ with an FB splitting step, letting
Let us now write the algorithm in a general form, as in Algorithm 7. The first case (5.3)–(5.4) corresponds to iterations with fixed steps ${\it\tau},{\it\sigma}$, $\bar{x}=x^{k}$, $\bar{y}={\tilde{y}}=y^{k}$, $\tilde{x}=2x^{k+1}-x^{k}$. In this case we have the following result, which also obviously applies to the PDHG method (5.3)–(5.4). Here we let $L=\Vert K\Vert$.
Theorem 5.1. Let ${\it\tau},{\it\sigma}>0$ and $(x^{0},y^{0})\in {\mathcal{X}}\times {\mathcal{Y}}$ be given, and for $k\geq 0$ let
Assume
Then, for any $(x,y)\in {\mathcal{X}}\times {\mathcal{Y}}$, we have
whereFootnote 12
Moreover, if the inequality is strict in (5.9), then $(x^{k},y^{k})$ converge (weakly in infinite dimension) to a saddle point.
Proof. This is a particular case of Theorem 1, Remark 1 and Remark 3 in Chambolle and Pock (Reference Chambolle and Pock2015a). Under additional assumptions, one can derive a similar rate of convergence for the ‘true’ primal–dual gap ${\mathcal{G}}(X^{k},Y^{k})$.◻
Note that this result says little, in general, about the ‘best’ choice of the parameters ${\it\tau},{\it\sigma}$, and the empirical speed of convergence often depends a lot on this choice. Heuristic approaches have been proposed (which in general try to ensure that the primal and dual variables evolve at roughly the ‘same speed’); an efficient one, together with a backtracking strategy and convergence guarantees, is proposed in Goldstein et al. (Reference Goldstein, Li, Yuan, Esser and Baraniuk2015).
Acceleration
An interesting feature of these types of primal–dual iteration is the fact they can be ‘accelerated’, in cases when the objective function has more regularity. The first case is when $g+h$ (or $f^{\ast }$) is strongly convex: see Algorithm 8. Observe that if $f^{\ast }$ is ${\it\mu}_{f}$-strongly convex, then $x\mapsto f(Kx)$ has $(L^{2}/{\it\mu}_{f})$-Lipschitz gradient, and it is natural to expect that one will be able to decrease the objective at rate $O(1/k^{2})$ as before. Similarly, we expect the same if $g$ or $h$ is strongly convex. This is the result we now state. We should assume here that $g$ is ${\it\mu}_{g}$-convex, $h$ is ${\it\mu}_{h}$-convex, and ${\it\mu}={\it\mu}_{g}+{\it\mu}_{h}>0$. However, in this case it is no different to assuming that $g$ is ${\it\mu}$-convex, as one can always replace $h$ with $h(x)-{\it\mu}_{h}\Vert x\Vert ^{2}/2$ (which is convex with $(L_{h}-{\it\mu}_{h})$-Lipschitz gradient ${\rm\nabla}h(x)-{\it\mu}_{h}x$), and $g$ with $g(x)+{\it\mu}_{h}\Vert x\Vert ^{2}/2$ (whose proximity operator is as easy to compute as $g$’s). For notational simplicity, we will thus restrict ourselves to this latter case – which is equivalent to the general case upon replacing ${\it\tau}$ with ${\it\tau}^{\prime }={\it\tau}/(1+{\it\tau}{\it\mu}_{h})$.
Theorem 5.2. Let $(x^{k},y^{k})_{k\geq 0}$ be the iterations of Algorithm 8. For each $k\geq 1$, define $t_{k}={\it\sigma}_{k-1}/{\it\sigma}_{0}$, $T_{k}=\sum _{i=1}^{k}t_{i}$ and the averaged points