A STOCHASTIC APPROXIMATION ALGORITHM FOR STOCHASTIC SEMIDEFINITE PROGRAMMING

Motivated by applications to multi-antenna wireless networks, we propose a distributed and asynchronous algorithm for stochastic semidefinite programming. This algorithm is a stochastic approximation of a continuous-time matrix exponential scheme which is further regularized by the addition of an entropy-like term to the problem's objective function. We show that the resulting algorithm converges almost surely to an ɛ-approximation of the optimal solution requiring only an unbiased estimate of the gradient of the problem's stochastic objective. When applied to throughput maximization in wireless systems, the proposed algorithm retains its convergence properties under a wide array of mobility impediments such as user update asynchronicities, random delays and/or ergodically changing channels. Our theoretical analysis is complemented by extensive numerical simulations, which illustrate the robustness and scalability of the proposed method in realistic network conditions.


INTRODUCTION
Semidefinite programming (i.e., the minimization of a convex function over a convex subset of the cone of positive-semidefinite matrices) comprises a rich class of convex optimization problems that is both relatively tractable (interior-point methods can often be used with polynomial worst-case complexity [26]) and also very powerful (many optimization problems in engineering and combinatorial optimization can be recast as semidefinite programs [9]). Especially in an engineering context, many applications involve a certain degree of randomness (either in the objective function itself or in the feedback provided to the optimizer) [15,41] so many standard semidefinite optimization algorithms cannot be applied "off the shelf". For instance, minimum volume convering problems (where quadratic functions are expressed as semidefinite constraints) have been a very active research topic for the last fifty years [38]. In wireless telecommunications, transmission ranges of mobile devices have also been modeled as Euclidean balls with random parameters, hence expressible via semidefinite constraints with stochastic perturbations; as a result, route discovery in mobile ad-hoc networks is typically addressed using stochastic semidefinite programming approaches [43]. In view of the above, we focus in this paper on stochastic semidefinite programming, a subclass of semidefinite programs where the objective function is given in the form of an expectation with possibly unknown randomness.
In this framework, there are two main algorithmic approaches. In the "offline" approach, it is assumed that the optimizing agent (or agents in the case of multi-agent optimization) knows the expectation of his objective function in some semi-explicit form (possibly quite complicated) and tries to optimize it by calling an appropriate semidefinite optimization algorithm. On the other hand, in the "online" approach to optimization, the functional form of the objective function (and any inherent randomness) is unknown and the agent seeks to optimize his objective based on indirect (and possibly imperfect) performance indicators. The former approach is usually employed in large-scale industrial optimization problems where the collection of data is not costly, but their processing is. Instead, the latter approach applies to distributed optimization problems in complex systems (such as networks) where the optimizing agents are not capable of collecting a lot of data, but the agents have the computing power to handle the data they do collect.
Motivated by applications to wireless networks, our paper adopts the second approach with the aim of proposing a fully distributed algorithm for stochastic multi-agent semidefinite optimization problems that (a) requires minimal (and possibly imperfect) gradient information; and (b) is fully parallelizable and does not require any coordination between the optimizing agents. This algorithm is obtained as a variable step-size stochastic approximation [7,8] of a continuous-time matrix exponential learning scheme which has important ties to the mirror descent machinery of [17,25,27]. In contrast to mirror descent methods however, we establish the convergence of the algorithm's last iterate and not only the convergence of its empirical time-average, properly weighed by the step-size sequence employed. In applications to wireless mobile systems, this is crucial because it implies the convergence of the network to a stable, optimum state in a strong sense instead of a weaker, average sense.
To complement our abstract theoretical analysis (Sections 2 and 3), we also present a concrete application to multi-antenna wireless mobile networks with ergodically changing channel conditions. As explained in Section 4, this case fits squarely within the core framework of Section 2: First, this is due to the problem's inherently distributed aspect (since it is often impossible-or impractical-to coordinate and/or synchronize the mobile users' updates). Second, it is due to the lack of full system information at the user end and the fact that users do not necessarily know the stochastic law of their channels.
The users' objective in this setting is to maximize their information transmission rate by optimizing the covariance matrix of their input signal distribution. Two cases are considered. First, we consider the case where the users have perfect feedback from the receiver but their channels evolve following a stationary, ergodic process (the fast-fading regime) [15]. In this case, the users' transmission rate is the stochastic average of their achievable rate over all channel realizations and the problem boils down to a multi-agent stochastic semidefinite program. The second case concerns static channel conditions (i.e., the wireless medium is assumed to evolve at a much slower rate than the transmitters' update time-scale). In this case, a major challenge arises if the users only have access to imperfect receiver feedback and channel state information; thus, even though the underlying problem is deterministic, stochasticity arises from the noise in the users' measurements and observations. A partial description of our method applied to this "imperfect information" case was presented at our earlier conference paper [12].
In either case, we show how the proposed algorithmic scheme can be implemented in both synchronous and asynchronous ways. Additionally, we provide a procedure to compute an unbiased estimator of the gradient of the transmission rate for each transmitter via receiver-transmitter reciprocity. Finally, we also provide a suite of numerical simulations to illustrate the robustness of our algorithm and to compare it to more traditional water-filling techniques (which it outperforms).

PROBLEM FORMULATION AND PRELIMINARIES
As we mentioned in the introduction, our main goal is to provide an efficient and robust solution method for semidefinite optimization problems where the objective function depends on a controlled matrix variable X and a random variable ω (that cannot be controlled by the optimizer). More precisely, we consider problems where, through repeated iterations, the optimizing agent (or agent for short) seeks to converge to a value of X that optimizes the expected value of the objective function with respect to ω (i.e., that solves the agent's stochastic optimization problem "on average"). Obviously, if the agent's "mean" objective function can be calculated explicitly, the above boils down to a deterministic problem; however, a major challenge occurs if this expectation cannot be calculated-or, worse, if the distribution of ω is not even known to begin with.
The above problem will comprise the core of our considerations and we will formalize it in the following section; a variant formulation for multi-agent environments is then provided in Section 2.2. From a mathematical point of view, both models are essentially equivalent but, from a practical standpoint, they describe problems of a very different nature.

The Core Problem
Let H M = {X ∈ C M ×M : X = X † } denote the space of M × M Hermitian matrices and let X = {X 0 : tr(X) = 1} denote the spectrahedron of positive-semidefinite matrices with unit trace. In what follows, we will focus on the stochastic semidefinite optimization problem (SSP): where ω is an abstract random variable taking values in some probability space Ω, the expectation E[· ] is taken with respect to the law of ω, and f : X × Ω → R is a smooth random function which is convex with respect to X ∈ X for all ω ∈ Ω. Importantly, the simple formulation (SSP) above accounts for a fairly wide class of stochastic optimization problems over compact spectrahedra (the semidefinite equivalent of polytopes, either real or complex). In fact, as long as the feasible region X of a semidefinite program is a spectrahedron that to optimizing a convex function over X (at the cost of increasing the problem's dimensionality) [29]. As such, (SSP) can be seen as a canonical form for stochastic optimization problems over compact, unitary-invariant spectrahedra.
In the above framework, the mean objective function is itself convex over X . For simplicity, we will also assume that F is finite and smooth over X . In this way, we obtain the (convex) semidefinite optimization problem which could be solved by standard convex programming methods, provided that F is known to the optimizer. As such, the main difficulty in solving (SSP)/(SP) is precisely that the law of ω may not be known, in which case the functional form of F is also unknown. To circumvent this difficulty, standard results in convex analysis [37] show that the gradient matrix of the mean objective function F may be calculated by interchanging differentiation with expectation, that is, Thus, following the stochastic approximation approach of [9,25,27], we will focus on solving (SSP) based only on random (sample-dependent) estimates of the stochastic gradient matrices ∇ X f (X; ω) (for posterity, we note here that ∇ X f (X; ω) is Hermitian (on account of the fact that f is real)).
To make this precise, let X(1), X(2), . . . , be a (possibly random) sequence of play by the optimizing agent-that is, at stage n, the agent chooses X(n) and incurs an expected cost of F (X(n)). Then, at each stage n = 1, 2, . . . , we will assume that the agent has access to a random matrixV(n) which satisfies the statistical unbiasedness hypothesis: where F n denotes the filtration induced by the history process X(n).
The statistical hypothesis above allow us to account for a very wide range of estimation oracles: in particular, we will not be assuming independent and identically distributed (i.i.d) observations (a feature which is crucial in the context of wireless networks where observations are typically correlated with the state of the system). Instead, we will only assume there is an oracle mechanism that returns a F n -measurable estimateV(n) of V(X(n)) once the agent plays X(n); the construction of such an oracle for specific applications will be detailed in Section 4.
Remark 2.1: An important special case of the problem (SSP) is when the expectation in (2.1) is deterministic, that is, f (· , ω) = f (· , ω ) for almost every ω, ω ∈ Ω. In that case, Assumption (A1) accounts for problems where the optimizer is called to solve a deterministic semidefinite program with imperfect gradient feedback and stochasticity stems from the https://www.cambridge.org/core/terms. https://doi.org/10.1017/S0269964816000127 Downloaded from https://www.cambridge.org/core. IP address: 54.70.40.11, on 30 Jun 2019 at 00:13:28, subject to the Cambridge Core terms of use, available at random noise perturbing the agent's observations. More generally, depending on the structure of the probability space Ω, the randomness in the stochastic optimization problem (SSP) and the randomness in the gradient observationsV(n) could be completely decoupled; the only assumption that we will make regarding these different degrees of randomness is (A1).

Multi-Agent Optimization and Game Theory
In multi-agent environments, we assume that there are multiple optimizing agents k = 1, . . . , K, each one controlling an individual control variable X k that impacts the agents' global objective f in a different way. Specifically, this amounts to the following multi-agent version of (SSP): minimize

4)
where X k = {X k ∈ C M k ×M k : X k 0, tr(X k ) = 1} denotes the feasible region of agent k and f : k X k × Ω → R satisfies the same convexity and smoothness assumptions as before.
In this setting, if there is no central controller to coordinate the agents' actions and provide global feedback, it will be assumed that agents can only access an estimateV k of their individual gradient matrices

5)
where X = (X 1 , . . . , X K ) denotes the agents' aggregate action profile and F (X) = E[f (X; ω)]. Thus, mutatis mutandis, we will assume that Assumption (A1) applies to each agent separately, and we will seek to provide a distributed optimization algorithm that solves (2.4) under these assumptions. As a further extension of the above framework, we will also consider the case where each agent seeks to minimize unilaterally an individual objective function f k (i.e., there is no global objective). This situation is known as a game in normal form (or, more simply, a game) and the solution concept that we will focus on is that of Nash equilibrium [14,23,24,30]. Formally, we will say that an action profile X * = (X * 1 , . . . , X * K ) is a Nash equilibrium of the game induced by the mean individual objective functions with (X k ; X * −k ) denoting the tuple (X * 1 , . . . , X k , . . . , X * K ). Put differently, Nash equilibria are simply action profiles which are unilaterally stable in that no agent has any incentive to deviate from them.
The connection between game theory and distributed optimization is recovered in the class of potential games [23], that is, games where the players' mean objective functions are aligned along a common potential function F . More precisely, following Monderer and Shapley [23], we will say that F is a potential function for a game with mean objectives F k when for all actions X k , X k ∈ X k of agent k, and for all action profiles X −k ∈ X −k ≡ =k X of k's opponents. As can be easily seen, if a game admits a potential function, its Nash equilibria necessarily coincide with the critical points of its potential function [23]. Thus, if the game's potential F is convex over X ≡ k X k , it follows that the equilibrium problem (NE) Algorithm 1. Algorithmic implementation of (DXL).
can be reduced to the distributed optimization problem (2.4) (conversely, every distributed optimization problem can be seen as a potential game by setting f k = f for all k). We will use this observation freely throughout our paper-and, especially, in Section 4.

Single-Agent Optimization Analysis
The main algorithmic scheme that we will use to solve (SSP)/(SP) will be based on the following discounted exponential learning (DXL) recursion: where (1) Y(n) is an auxiliary scoring matrix which aggregates gradient information.
(3) γ n , n = 1, 2, . . . , is a non-increasing step-size sequence (specific assumptions for γ n will be discussed below). (4) τ > 0 is a (small) discount parameter which acts as a failsafe against the iterates Y(n) getting out of bounds.
Intuitively, (DXL) acts as a "regularized" stochastic gradient descent process: if we ignore the parameter τ for the moment, each iteration of (DXL) simply aggregates the received gradient information (in the update of Y) and then "projects" back to the primal variable X to receive a new gradient and continue the process. The reason that the exponentiation step acts as a "projection" operator is that it aligns the eigenvalues of X with those of Y, so, in a certain sense, X is an "exponential projection" of Y to X .
The role of the discount parameter τ in (DXL) (and the reason for calling it a "discount" in the first place) is more subtle. To understand it, note first that the recursive step of (DXL) where, assuming that γ n is small enough, we have set: By expanding the logarithm to leading order in (3.2), this last sum is asymptotically equal Accordingly, to leading order, (3.1) can be rewritten for large n (and small γ n ) as: with t n,m given by (3.3) and r = exp(−τ ). Of course, the above derivation is approximate in nature but it highlights the discount role of τ . For a constant step-size sequence γ n = γ, we have t n,m = γ · (n − m), so the exponential sum in (3.4) means that (DXL) assigns (exponentially) more weight to recent observations rather than older ones. In a sense, this discounting counters the use of a vanishing γ n . A decreasing step-size implies that more recent gradient observations enter the algorithm with a decreasing weight; by contrast, the use of a positive discount parameter τ > 0 tempers this (somewhat counter-intuitive) behavior by increasing the relative weight of more recent gradient observations. Moreover, from a calculational standpoint, the use of a positive discount parameter τ has the added benefit that the auxiliary score matrices Y(n) cannot grow too large. If the step-size sequence γ n is chosen in a way such that the geometric series n m=1 τ m r tn,m+1 remains summable (we will elaborate more on the choice of γ n below), then Y(n) will be uniformly bounded on account of (3.4) and Assumption (A1).
Of course, in so doing, the discount parameter τ also introduces a systematic deterministic bias to the gradient observationsV(n), that is, a perturbation that persists even in the noiseless regime whereV(n) is actually deterministic. Indeed, if (DXL) is run with perfect gradient observationsV(n) = V(X(n)), then any fixed point X * of (DXL) will satisfy: .

7)
for κ = τ (1 + log(tr[exp(−V * /τ )])). Importantly, the right-hand side (RHS) of (3.7) can be written more simply as V * + τ (log X * + I) = ∇F τ (X * ) where the perturbed objective denoting the so-called von Neumann (or quantum) entropy of X [40] (that the gradient of F τ is ∇F τ (X) = V(X) + τ (log X + I) follows from standard arguments in matrix calculus-see for example, [13,Appendix D]). Thus, given that X * must satisfy the trace constraint tr(X * ) = 1, it follows that any fixed point X * of (DXL) will be a solution of the perturbed optimization problem: Obviously, the solution set of (SP τ ) is asymptotically close to that of the unperturbed problem (SP) in the limit τ → 0 (where the entropic perturbation term h(X) vanishes): more precisely, if X * τ is a solution of (SP τ ), there exists a solution X * of (SP) such that the distance between X * τ and X * vanishes as τ → 0. That said, an important difference between (SP) and (SP τ ) is that the latter is strictly convex (because h is). As a result, (SP τ ) admits a unique solution, even when the solution set of (SP) is a non-singleton convex set.
With all this in mind, we are in a position to state our main result for (DXL): Then, the iterates X(n) of (DXL) converge almost surely to a solution of the perturbed optimization problem (SP τ ); in particular, X(n) converges (a.s.) within ε(τ ) of a solution of the stochastic problem (SSP) and the error ε(τ ) vanishes in the limit τ → 0. Theorem 3.1 will be our main result for (DXL) so, before proving it, some remarks are in order: Remark 3.1: The statement of Theorem 3.1 suggests that the discount parameter τ should be taken as small as possible in order to ensure the algorithm's convergence to a state X * τ ∈ X that is as close as possible to the solution set of (SSP). On the other hand, very small τ > 0 could mean that the iterates Y(n) of (DXL) could grow quite large, potentially exceeding the numerical capacity of the optimizing's agent calculating device-recall the discussion surrounding (3.4). As a result, the discount parameter τ > 0 essentially reflects the algorithm's accuracy versus memory trade-off: lower values of τ > 0 lead to better solutions of (SSP), but at the expense of higher memory requirements and more processing power. Ultimately, the choice of τ relies on the technical specifications of the optimization problem to be solved so the "optimal" choice of τ can only be made on a case-by-case basis.

Remark 3.2:
In a similar vein to the above remark, Assumption 1 can actually be relaxed to account for gradient observations that are only bounded in mean square (instead of being bounded almost surely). In this context however, a given observationV of V could exceed the storage/processing capacity of the agent's optimizing device, thus introducing additional arithmetic stability errors to running (DXL). Such issues lie beyond the scope of the current work so we opted to work with the almost sure boundedness assumption for simplicity.

Remark 3.3:
We should also note here that the " 2 − 1 " summability condition ∞ n=1 γ 2 n < ∞ n=1 γ n = ∞ can also be relaxed in the context of Assumption 1. Specifically, Theorem 3.1 remains true even with significantly more aggressive step-size sequences of the form γ n = n −a for some arbitrarily small a > 0. The reason for stating (and proving) Theorem 3.1 in the " 2 − 1 " framework was only done for simplicity; in practice, the use of a (nearly) constant step-size greatly accelerates the algorithm, a fact that we explore in Section 4. Now, to prove Theorem 3.1, our strategy will be as follows: First, we will show that the iterates of (DXL) constitute a so-called asymptotic pseudotrajectory (APT) of the mean, continuous-time dynamics:Ẏ , that is, the iterates of (DXL) are asymptotically close to solution segments of (DXL c ) of arbitrary length [7]. We will then show that (DXL c ) converges to the (unique) solution of the perturbed optimization problem (SP τ ); the claim of (3.1) will then follow from standard results in the theory of stochastic approximation [7]. We begin by showing that the iterates of (DXL) comprise an asymptotic pseudotrajectory of the dynamics (DXL c ) in the sense of [7], that is, whereX(t), t ≥ 0 is the linear interpolation of the iterates X(n) of (DXL) while Φ t (X) denotes the flow induced on X by (DXL c )-that is, Φ t (X), t ≥ 0, is the solution trajectory of (DXL c ) that starts at X ∈ X . To that end, we will first need the following boundedness result: Lemma 3.2: If γ n < 1/τ for all sufficiently large n, then the iterates Y(n) of (DXL) under Assumption 1 are bounded (a.s.).
Proof: First, let V > 0 be such that V (n) ≤ V almost surely (that such a V exists is a consequence of Assumption 1); additionally, let n 0 be such that 0 < 1 − γ n τ ≤ 1 for all n ≥ n 0 . Then, for n ≥ n 0 , the definition (DXL) of Y(n) and the bound V (n) We are thus reduced to the following cases: It follows that Y(n + 1) will either decrease or be uniformly bounded by V , so our claim follows by induction.
Thanks to this lemma, we readily obtain: Proposition 3.3: With notation as in Lemma 3.2, the sequence Y(n) comprises an asymptotic pseudotrajectory of (DXL c ).
Proof: First, taking expectations in the RHS of (DXL) yields:

11)
where we used Assumption 1 and the fact that Y(n) and X(n) are fully determined by F n . Since Y(n) is bounded (a.s.) by Lemma 3.2, our claim follows from Proposition 4.1 in [7].
We now proceed to show that the dynamics (DXL c ) converge to the (unique) solution of the perturbed optimization problem (SP τ ) from any initial condition Y(0). To that end, we will first need to derive the dynamics of the primal control variable X(t): Lemma 3.4: Let X(t) be a solution orbit of the continuous-time dynamical system (DXL c ). Then, X(t) satisfies the dynamics:

12)
where Then, differentiating X(t) with respect to t, we get:

14)
where the second equality is an application of Fréchet's derivative formula for matrix exponentials [16] and the last one follows by recalling that With this explicit expression for the evolution of X at hand, we are almost in a position to show that the perturbed objective function F τ (X) = F (X) + τ tr[X log X] of (SP τ ) is a strict Lyapunov function for the dynamics (3.12). The only other result that we will need is the following Jensen-like inequality for positive-definite matrices: where the last equality follows from the fact that tr(XW) is real (recall that X is positivedefinite while W is Hermitian). This inequality holds as an equality if and only if X 1/2 ∝ X a WX b so, with a + b = 1/2, this last condition is equivalent to W ∝ I, as claimed. With all this in hand, we obtain: Proposition 3.6: Let X(t) be an interior solution orbit of the continuous-time dynamics (DXL c ). Then, d dt F τ (X(t)) ≤ 0 for all t ≥ 0, with inequality if and only if V τ (X(t)) ∝ I-that is, at interior stationary points of (3.12).
Proof: By a simple application of the chain rule, we readily get:

16)
where we have used the definition of V τ and the fact that tr[Ẋ] = 0 (since tr[X(t)] = 1 for all t ≥ 0). Invoking Lemma 3.4, we then obtaiṅ

17)
and our assertion follows from Lemma 3.5 above.
As a corollary of the above, we then get: From the proof of Theorem 3.1, we can identify two points where the positivity of τ plays a crucial role. The first is the boundedness of the iterates Y(n) of the algorithm (Lemma 3.2) which guarantees that (DXL) is a stochastic approximation of the mean dynamics (DXL c ). The second is the fact that the problem (SP τ ) admits a unique, interior solution. In the limit case τ = 0, it is still possible to show that (DXL) comprises an asymptotic pseudotrajectory of (DXL c ) but the Lyapunov argument of Proposition 3.6 is more subtle. Since we are only interested in algorithms with finite iterates (for computer arithmetic reasons), we will not press this issue further, delegating it instead to future work.

Distributed Optimization in Asynchronous Multi-Agent Environments
Of course, even though the information requirements of (DXL) are relatively minimal (an imperfect oracle call to the gradient of the agent's stochastic objective), it is not clear whether it can be readily extended to a distributed optimization setting (or a game-theoretic context) where agents update independently of one another and there is often a delay between agent updates and observations. To overcome these limitations, we examine here a fully decentralized variant of (DXL) which addresses the issues above.

Repeat
At each UpdateEvent get gradient estimateV; n ← n + 1; until termination criterion is reached.
To make all this precise, we will work with the multi-agent stochastic optimization problem (2.4) and we will assume that the agents seek to converge to a solution thereof through repeated play. To that end, let n denote the nth overall update epoch in the system, let K n ⊂ K denote the subset of agents who update at this epoch (typically |K n | = 1 if agents update at random times), and let d k (n) be the number of periods that have elapsed at period n since the last update of agent k. With all this in mind, we will focus on the following asynchronous variant of (DXL):

18)
where n k = n j=1 1(k ∈ K j ) denotes the number of updates performed by agent k up to epoch n while the (asynchronous) gradient estimateV k (n) satisfies the unbiasedness assumption: where, as before, V k (X 1 , . . . , X n ) = ∇ X k F (X 1 , . . . , X K ). By definition, Y k (n) and X k (n) are updated at the (n + 1)th update period if and only if k ∈ K n , so every agent follows his individual update timer, independently of what other agents in the system do (for a pseudocode implementation, see Algorithm 2). Remarkably, in this completely decentralized context (with out-of-date and/or imperfect gradient observations), we still get: Theorem 3.8: Assume that the agents' delay process d k (n) are bounded (a.s.) and the set of agents K n that updates at the nth overall update epoch is a homogeneous recurrent Markov chain-that is, all agents update a strictly positive rate. Assume further that Algorithm 2 is run with step-sizes γ n ∝ 1/n and imperfect gradient estimatesV k (n) satisfying the unbiasedness assumption (A1'). Then, the algorithm's iterates converge (a.s.) to the (unique) minimizer of the perturbed objective F τ (X 1 , . . . , X K ) = F (X 1 , . . . , In particular, Algorithm 2 converges within ε(τ ) of a solution of the distributed stochastic optimization problem (2.4) and the approximation error ε(τ ) vanishes as τ → 0 + . Proof: Following Theorems 2 and 3 in [8], the asynchronous recursion (3.18) may be seen as a stochastic approximation of the rate-adjusted dynamics:

19)
where ρ k = lim n→∞ n k /n > 0 is the asymptotic update rate of user k (the existence and positivity of this limit follows from the ergodicity assumption on the set-valued process K n ). This multiplicative factor does not alter the rest points of the original dynamics (DXL c ) and an easy calculation shows that the perturbed objective function F τ (X 1 , . . . , X K ) remains a strict Lyapunov function for the rate-adjustment dynamics above. The rest of our proof then follows essentially as that of Theorem 3.1.

APPLICATIONS TO WIRELESS NETWORKS
We now turn to a concrete application of the algorithmic framework presented in the previous sections to distributed throughput maximization in multi-user wireless systems.

Problem Formulation
Throughout this section, we will focus on mobile systems where a set K = {1, . . . , K} of different transmitters (or users) communicate simultaneously with a single receiver (for instance, a base station or a wireless terminal). Following recent developments in wireless communication technology [1][2][3], we will further assume that each user k ∈ K is using M k antennas for transmission (multiplexing) while the receiver is using N antennas for signal reception and decoding. More precisely, this means that the aggregate signal reaching the receiver can be described by the standard channel model where: 1. y ∈ C N is the aggregate signal reaching the receiver (the channel's output). 2.
x k ∈ C M k is the transmitted signal (input) of the kth transmitter (the channel's input).
3. H k ∈ C N ×M k denotes the transfer matrix between the kth transmitter and the receiver, representing how the transmit signal is affected by the wireless medium.
To account for channel fading, we will assume in what follows that the users' channel matrices evolve over time following a bounded stationary process [15] and we will denote expectations over this distribution by E[· ] (as a special case, in the static channel regime). 4. z ∈ C N is the noise in the channel (including thermal, atmospheric and other peripheral interference effects). Following standard information-theoretic caveats, we will further assume that z can be modeled as a circularly symmetric, zero-mean Gaussian vector with unit covariance [11,39,42].
for signal encoding (on the structure of the channel matrices H k , the channel model (4.1) actually applies to several telecommunications systems, ranging from digital subscriber line (DSL) uplink networks with Toeplitz circulant H k , to code division multiple access (CDMA) radio networks [36]. For concreteness, we will stick here with the interpretation of the signal model (4.1) as an ad hoc multi-user MIMO multiple access channel with H k representing the channel of each link). Specifically, let denote the covariance matrix of the transmitters' input signal distribution, with the expectation E cb being taken over the user's input codebooks (importantly, the expectation E cb [· ] is not related to the expectation E[· ] taken over the distribution of the users' channels). Then, assuming perfect CSI at the receiver, the maximum achievable information transmission rate of user k will be given by the familiar expression [15,39]: where Q = (Q 1 , . . . , Q K ) denotes the users' covariance profile, Q −k is the corresponding profile for all users except k, the expectation E[· ] is taken over the users' channel law, and denotes the multi-user interference-plus-noise (MUI) of user k (from an informationtheoretic perspective, we are also assuming single user decoding (SUD) and perfect CSI at the receiver; for a more detailed account, see for example, [32][33][34]39,42]). In this context, the users' objective is to select input signal covariance matrices Q k so as to maximize their individual information transmission rate R k (Q) subject to the constraints tr(Q k ) = P k , (4.5) where P k is the transmit power of user k [39,42]. More formally, in the language of Section 2.2, the above boils down to the rate maximization game: where, for convenience, we have set X k = Q k /P k and the users' utility function u k is simply defined as: Clearly, in the presence of fading, the users' objectives are stochastic in nature because of the expectation over H in (4.3); otherwise, in the case of static channels, this expectation is trivial, so (RM) is deterministic. As a result, the game-theoretic problem (RM) can be seen as a special case of the multi-agent stochastic problem (SSP). Indeed, as was shown in [6], the users' reward functions u k satisfy the potential property [23]: where the game's potential function F is defined as: and the problem's feasible region is X = k X k with X k = {X k ∈ H M k : X k 0 and tr(X k ) = 1} (from an information-theoretic perspective, F simply represents the users' sum rate under a centralized successive interference cancellation (SIC) decoding scheme [42]. As a result, the above rate maximization problem can be seen both as a game (under single user decoding) or as a distributed, multi-agent optimization problem (under more sophisticated SIC schemes). For a more detailed discussion, see [6,12,18,19,21] and references therein).
Since the function M → log det(I + M) is concave in M over the entire cone of positivesemidefinite matrices [9], it follows that F is itself convex over X (in fact, if the law of the users' channel matrices does not contain any atoms, F is actually strictly convex). Accordingly, the rate maximization game (RM) falls squarely in the framework of Section 2.2: in realistic network situations, the distribution of the users' channel matrices is not known to the users, so the rate functions R k (or the game's potential F ) cannot be calculated a priori. As a result, to reach a Nash equilibrium of (RM), the system's users cannot rely on gradient observations of R k (or F ), but only on stochastic (and possibly imperfect and/or delayed) information on the quantities inside the expectation of (4.3) and (4.8), themselves obtained through an interplay between the transmitters and the receiver.
The framework described above naturally calls for a distributed solution method, so the algorithmic material of Section 3 seems particularly well-suited to the occasion. In the rest of this section, we will describe the specifics of this application.

Algorithmic Implementation-Synchronous Updates
The first step required to apply the algorithmic tools of Section 3 is to calculate the stochastic gradient of the users' rate functions R k . To that end, let so u k (X) = E[r k (X)]. Then, some matrix calculus readily yields where, in a slight abuse of notation,

W(X; H) = I + P H X H † (4.11)
denotes the aggregate signal covariance matrix at the receiver. Thus, if H(n) denotes the realization of the users' channel matrices at each update period n = 1, 2, . . . , and X(n) is their corresponding transmit profile, we will assume that a) H k (n) is measured at each transmitter k ∈ K; and b) W(X(n); H(n)) is measured at the receiver and is then broadcast to the transmitters. Under these assumptions, each transmitter k ∈ K can recreate their individual (stochastic) gradient matrices at period n as: (4.12) and, by construction, we will have: n)), for all k = 1, . . . , K. (P4) It is stable: the matrix exponentials can be calculated in a numerically stable and efficient manner [22].

Asynchronous Implementation
Let us now consider a more realistic wireless environment where the transmitters do not share a common update clock-so synchronous decisions are not possible. In this context,

Algorithm 3. MIMO rate maximization with SU
Parameters: discount parameter τ > 0; decreasing step-size sequence γ n . n ← 1; for each transmitter k ∈ K do simultaneously initialize Hermitian score matrix Y k ∈ H M k ; Repeat n ← n + 1; Receiver measures and broadcasts P = W −1 ; for each transmitter k ∈ K do simultaneously Measure channel matrix H k ; . until termination criterion is reached. the synchronous update structure of Algorithm 3 is no longer appropriate, so we will employ Algorithm 2 (which is fully decentralized) instead.
To that end, assume that each transmitter is equipped with an individual timer τ k whose ticks indicate the update events of user k. More precisely, we assume here that τ k : N → R + is an increasing (and possibly random) sequence such that τ k (n) marks the instance at which the kth user updates his covariance matrix X k for the nth time-so X k does not change between τ k (n) and τ k (n + 1). Similarly, we assume that the receiver is equipped with a timer τ 0 (n) that triggers the measurements of W(t) ≡ W(X(t); H(t)) = I + P H (t)X (t)H † (t). (4.14) Thus, at every tick of τ k , user k measures H k and updates X k while, at every tick of τ 0 , the receiver measures and broadcasts W. This asynchronous operating mode fits naturally within the framework of Algorithm 2, leading in turn to the following implementation of Algorithm 3 with asynchronous updates (AU): Algorithm 4 is run independently by each transmitter-though, of course, if all transmitters share a common timer, Algorithm 4 reduces to the synchronous context of Algorithm 3. Moreover, provided that all individual timers τ k have positive finite rate (i.e., lim τ k (n)/n exists and is finite), it is easy to see that the update sequence generated by Algorithm 4 satisfies the assumptions of Theorem 3.8. Indeed, the set-valued process K n used in (3.18) to indicate the set of transmitters updating their covariance matrices at the nth overall update event may be obtained from the users' individual timers τ k as follows: First, let K(t) = {k ∈ K : τ k (n) = t for some n ∈ N} denote the set of players updating at time t and let n(t) = card{s ≤ t : K(s) = ∅} be the total number of update epochs up to time t. Then, K n = K(inf{t : n(t) ≥ n}) and n k (n) = n r=1 1(k ∈ K r ), so the limit lim τ k (n)/n exists and is finite if and only if the limit lim n k (n)/n exists and is positive. With all this in mind, we readily obtain:

Learning with Imperfect Information
As a final application of the algorithmic framework of Section 3 to the problem at hand, we turn to the case where the users' channel matrices H k are static but channel and interference measurements are subject to observation noise and measurement errors. In this case, the rate maximization game (RM) becomes deterministic but the system is still subject to stochasticity originating from noise and uncertainty in the users' measurements.
In the perfect information case, the (deterministic) semidefinite problem (RM) may be solved by water-filling techniques [11], properly adapted to multi-user environments [31,32,42]. Such methods can be either iterative (with users updating their covariance matrices one after the other, in a round-robin fashion) [42] or simultaneous (with users updating all at once) [31]. The benefit of the former (iterative) scheme is that its convergence is guaranteed [42]; however, the algorithm's convergence rate is inversely proportional to the number of users in the system (making such methods unsuitable for large networks). On the other hand, simultaneous water-filling methods are faster [31], but their convergence is conditional on certain "mild interference" conditions which fail to hold even in very simple 2 × 2 systems [20]. Making matters worse, water-filling methods rely on perfect channel state information at the transmitter (CSIT) and perfect measurements of the output signal covariance matrix W at the receiver; when it is impossible (or impractical) to obtain such noiseless measurements, it is not known whether water-filling methods converge.
In light of the above, our goal here will be to provide a viable alternative to water-filling based on (DXL). To that end, we will focus on two sources of measurement noise: (1) The (static) transfer matrices H k can only be measured at the transmitter up to some random observational error.
(2) The receiver can only estimate the covariance W of the aggregate received signal y via random sampling (assumed to occur between the updates of the transmitters).
Even though these two randomness sources are independent of one another, the gradient matrices V k = H k W −1 H † k of (4.12) depend nonlinearly on H k and W, so care must be taken to construct an unbiased estimator of V k from noisy estimates of H k and W.
We first consider the random perturbations induced on the estimation of W −1 by signal sampling at the receiver end. On that account, recall that W is simply the covariance matrix of the aggregate received signal y ∈ C N : As a result, an unbiased estimate for the covariance W of y may be obtained from a systematically unbiased sample y 1 , . . . , y S of y by means of the classical estimator W = S −1 S s=1 y s y † s . Since E cb [y] = 0, we do not need to include an S/(S − 1) bias correction factor in the estimate of W. Also, in a slight abuse of notation, the measurement expectations here are taken with respect to the law of x, y, and z.
On the other hand, given thatŴ −1 is a biased estimator of W −1 (and hence introduces a systematic error to the measurement process) [4], we cannot use this classical covariance estimate for the received signal precision (inverse covariance) matrix W −1 . Instead, following [4], an unbiased estimate of the precision matrix P = W −1 of y is given by the corrected expression:P whereŴ = S −1 S s=1 y s y † s as before. Thus, to obtain W −1 , the receiver only needs to take S > N + 1 independent measurements of y and then broadcast the unbiased estimateP of W −1 to the network's users.
Similarly, in the absence of perfect channel state information at the transmitter, the users must obtain an unbiased estimate of the unilateral gradient matrices V k = H † k W −1 H k from the broadcasted value of W and imperfect measurements of their channel matrices H k . However, an added complication here is that the estimated matrixV k must be itself Hermitian-Q k need not be positive-definite and the DXL scheme may fail to be well-posed.
To accommodate this requirement, if each transmitter takes S > 1 independent measure-mentsĤ k,1 , . . . ,Ĥ k,S of their individual channel matrix H k , such an estimator is given by the expression:V whereP is the broadcast estimate (4.16) of W −1 . Indeed, given that the sample measurements H k,s are assumed stochastically independent, we will have:

18)
where we have used the independence of the samples to decorrelate the expectations in the second equality, and we relied on the unbiasedness ofP andĤ k for the last one. Thus, with E[V k ] = V k , our construction of an unbiased estimator for V k is complete. From an implementation viewpoint, the above leads to the following distributed operation protocol. First, with notation as in the previous section, let τ 0 denote the receiver's measurement timer (so τ 0 (n) is the nth instance in time at which the receiver measures and broadcast W −1 ). Then, at each tick of τ 0 , the receiver takes a sample of the received signal y of size S > M + 1 and computes the estimateP(τ 0 (n)) as above. (We implicitly assume here that this measurement process takes a negligible amount of time. This assumption is justified by the fact that the characteristic time at which the receiver estimates W for decoding purposes is much shorter than the interval between user updates.) Likewise, if τ k is the update timer of user k (so τ k (n) is the nth update time for user k), each transmitter k ∈ K is assumed to measure his individual channel matrix and calculate his gradient estimateV k using the recipe (4.17) with the latest broadcasted value ofP (obviously, if τ 0 = τ k for all k ∈ K, the above process boils down to the synchronous regime of Algorithm 3). Theorems 3.1 and 3.8 then yield: Corollary 4.3: With notation as before, the iterates of Algorithm 4 with imperfect feedback converge (a.s.) within ε(τ ) of a Nash equilibrium of the rate maximization game (RM) with static channels; moreover, the approximation error ε(τ ) vanishes as τ → 0 + .

Numerical Results
To assess the performance of (DXL) applied to realistic network conditions, we simulated in Figure 1 a multi-user uplink MIMO system consisting of a wireless base receiver with 5  antennas and K = 25 transmitters, each with an arbitrary number m k of transmit antennas picked uniformly between 2 and 6 (for simplicity, throughout our numerical simulations, we focused on the synchronous updates case (Algorithm 3)). For the static channel case, each user's channel matrix H k was drawn from a complex Gaussian distribution at the outset of the transmission (but remained static once chosen), and Algorithm 3 was ran with a large constant step size for different values of the discount parameter τ . The performance of the algorithm over time was then assessed by plotting the normalized efficiency ratio

19)
where F n denotes the users' sum rate at the nth iteration of the algorithm, and F max (resp. F min ) is the maximum (resp. minimum) value of F over the system's set X of feasible covariance matrices. Thus, by definition, an efficiency measure of 1 corresponds to a Nash equilibrium of the rate maximization game (RM), while an efficiency ratio of 0 means that the system is very far from equilibrium. The reason for using this efficiency measure instead of the user's sum rate F directly, was to eliminate any scaling artifacts arising for example, from F taking values in a very narrow band close to its maximum value. In tune with Theorem 3.1, Figure 1 reveals that Algorithm 3 converges within a few iterations (effectively, within a single iteration for low τ ), but the end value of the users' sum rate deteriorates for higher values of the discount parameter τ . In Figure 2, we fix the algorithm's discount parameter to a low level (τ = 10 −3 ) that ensures effective convergence to Nash equilibrium, and we investigate the algorithm's convergence speed as a function of the number of transmitters, using existing water-filling methods as a benchmark. Specifically, in Figure 2(a), we ran Algorithm 3 for a multi-user uplink MIMO system with K = 10, 25, 50 and 100 users using a large, constant step size; as a result of this parameter tuning, Algorithm 3 effectively attains the system's sum capacity within one or two iterations, even for large numbers of users. Importantly, as can be seen in Figure 2(b), this represents a marked improvement over water-filling methods, even in moderately-sized systems with K = 25 users: on the one hand, iterative water-filling (IWF) [42] Figure 3. The robustness of entropy-driven learning in the presence of measurement errors: in contrast to water-filling methods, the entropy-driven learning attains the channel's sum capacity, even in the presence of very high measurement errors.
performance level as the first iteration of Algorithm 3), whereas simultaneous water-filling (SWF) [31] fails to converge altogether. The robustness of discounted exponential learning is investigated further in Figure 3 where we simulate an uplink MIMO system consisting of K = 25 transmitters with imperfect CSI and noisy measurements at the receiver. For simplicity, we modeled these errors as additive i.i.d. zero-mean Gaussian perturbations to the matrices V k = H k W −1 H † k that are used in the update step of SU, and the strength of these perturbations was controlled by the ratio of the errors' standard deviation to the matrix norm of V k (so a relative error level of η = 100% means that the measurement error has the same magnitude as the measured variable). We then plotted the efficiency ratio achieved by Algorithm 3 over time for average error levels of η = 15% and η = 100%; for benchmarking purposes, we then also ran the iterative and simultaneous water-filling algorithms with the same relative error levels (and noise realizations). As can be seen in Figure 3, the performance of water-filling methods remains acceptable at low error levels (attaining 90-95% of the system's sum capacity), but when the measurement noise gets higher, water-filling offers no perceptible advantage over the users' initial choice of covariance matrices. By contrast, discounted exponential learning retains its convergence properties even for relative error levels as high as 100%-though, of course, the algorithm's convergence speed is negatively impacted.
Finally, to account for changing channel conditions, we also plotted the performance of Algorithm 3 for non-static channels following the well-known Jakes model of Rayleigh fading [10]. More precisely, in Figure 4, we consider a MIMO uplink system with 3 receive antennas and K = 10 users with 2 antennas each, transmitting at a frequency of f = 2 GHz  and with average pedestrian velocities of v = 5 km/h (corresponding to a channel coherence time of 108 ms). We then ran Algorithm 3 with an update period of δ = 3 ms, and we plotted the achieved sum rate F (t) at time t versus the maximum attainable sum rate F max (t) given the channel matrices H k (t) at time t (and versus the "uniform" sum rate that users could achieve by spreading their power uniformly over their antennas). As a result of its high convergence speed, Algorithm 3 tracks the system's sum capacity remarkably well, despite the changing channel conditions. Moreover, the sum rate difference between the learned transmit covariance profile and the uniform one shows that this tracking is not an artifact of the system's sum capacity always being within a narrow band of its (evolving) maximum, but a real consequence of learning.

CONCLUSIONS
In this paper, we introduced a class of distributed algorithms based on a regularized variant of matrix exponential learning for stochastic semidefinite programming with applications to robust spectrum management in multi-user MIMO systems. This adjustment of classical exponential learning generates a discrete-time algorithm which tracks the continuous-time dynamics of adjusted exponential learning and converges arbitrarily close to the system's optimum configuration. Thanks to this adjustment term, the algorithm remains robust in the presence of stochastic perturbations: it converges even when the agents only have imperfect (or delayed) information at their disposal, or even if they update in a fully asynchronous manner and independently of one another.
The optimization method of adjusted exponential learning method actually applies to a wide range of semidefinite problems; we focused here on the MIMO MAC where our approach dominates classical water filling techniques, both in terms of speed of convergence and robustness to random perturbations.
In the case of multi-user MIMO systems, it out-performs traditional water-filling methods, both in terms of robustness to imperfect signal measurements and speed of convergence: in practice, the algorithm converges within a few iterations, even for large numbers of antennas.