FAST NON-NEGATIVE LEAST-SQUARES LEARNING IN THE RANDOM NEURAL NETWORK

The random neural network is a biologically inspired neural model where neurons interact by probabilistically exchanging positive and negative unit-amplitude signals that has superior learning capabilities compared to other artificial neural networks. This paper considers non-negative least squares supervised learning in this context, and develops an approach that achieves fast execution and excellent learning capacity. This speedup is a result of significant enhancements in the solution of the non-negative least-squares problem which regard (a) the development of analytical expressions for the evaluation of the gradient and objective functions and (b) a novel limited-memory quasi-Newton solution algorithm. Simulation results in the context of optimizing the performance of a disaster management problem using supervised learning verify the efficiency of the approach, achieving two orders of magnitude execution speedup and improved solution quality compared to state-of-the-art algorithms.


INTRODUCTION
The random neural network (RNN) is a neural network model inspired by the spiking behavior of biophysical neurons [16,17]. When a neuron is excited, it transmits unit amplitude signals called spikes to either excite or inhibit the receiving neurons. The combined effect of excitatory and inhibitory inputs changes the potential level of the receiving neuron and determines whether it will become excited. RNN has attracted a lot of attention in the scientific community due to its analytical solvability, excellent learning capacity, implementation ease, as well as its representational, modeling and universal approximation capabilities [19]. RNN has also been applied for the solution for different types of problems including optimization (e.g., minimum Steiner tree [24], assignment of assets to tasks under uncertainty [31], task assignment in distributed systems [2], rescuer assignment of emergency evacuation [25], cognitive packet networks [28]) and modeling (e.g., G-networks [3,18,20,22], gene regulatory networks [23], and protein interaction networks [42]) problems. Nonetheless, the most important application area of RNN regards the solution of supervised learning problems such as laser intensity vehicle classification [35], wafer surface reconstruction [27], mine detection [1] and denial-of-service attack detection [41].
In the context of supervised RNN learning, several algorithms have been designed over the years aiming to optimize excitatory and inhibitory weight values to minimize some non-convex cost function associated with the learning task. In [21] the standard gradient descent supervised learning algorithm for recurrent RNN was developed while gradient descent learning algorithms have also been developed for extension models such as the Multiple-Class RNN [26], the RNN with synchronized interactions [30] and the feedforward RNN [32]. In the same context more advanced optimization algorithms have also been developed based on contrastive learning [43], quasi-Newton [38] and Levenberg-Marquardt [4] methods. A survey of RNN models, learning algorithms and applications can be found in [47].
Departing from the traditional approach of solving the supervised RNN learning problem as a non-linear, non-convex optimization problem, in [45,46] the problem was reformulated into a convex program and solved to optimality. This was accomplished by approximating the equations governing the RNN into a linear system of equations with non-negativity constraints, when the excitation levels of all neurons are fixed for each input pattern (either to desired or random values). This approximation yielded a linear leastsquares problem with non-negativity constraints (NNLS) which is a quadratic programming (QP) problem that can be optimally solved using convex optimization algorithms. Because the large-scale nature of the underlying problem prohibited the accurate solution of the NNLS problem using QP, a first-order recursive algorithm was developed for its solution. NNLS learning has been shown to provide comparable performance to gradient descent RNN learning with lower execution times.
Linear least-squares techniques for learning have also been utilized in feedforward connectionist neural networks and shown to be very efficient, obtaining smaller training errors and faster training times compared to backpropagation techniques. These methods are based on the observation that the inputs to the neurons of a given layer is a linear function of the outputs of the preceding layer. The non-linearity arises from the application of the activation function to the input of each neuron in order to obtain its output. Hence, if the outputs of two consecutive layers are known then the optimal weights connecting the two layers can be derived by minimizing the mean square error (MSE) between the actual and the desired input to the second layer [6]. One problem with this approach is that it does not take into consideration the scaling effect of the non-linear activation function. Attempts to improve this deficiency include approximating the activation function with a linear combination of convex functions [10], considering the slope of the activation function at the desired output values to achieve better scaling [14] and restricting the output neuron values in the non-saturation region of the activation function [48,49].
Least squares have also been considered in hybrid algorithms. One approach is to obtain the weights of all layers by standard backpropagation algorithms, apart from the output layer where least squares are used to exploit the desired output values from the training data [15,33]. Other hybrid algorithms have sophisticated iterative methods for choosing the desired weights or output values of the non-output neurons such as penalized functions [12] and sensitivity analysis [11], but employ least squares to optimize the performance of the network for given values of those parameters. Extreme learning machines [34] also rely on least squares learning in a feed-forward architecture upon performing random feature mapping, but the activation function can take any non-linear piecewise continuous form.
This paper revisits the NNLS supervised learning problem in RNN to develop a customized algorithm for the solution of the NNLS learning problem that achieves more than two orders of magnitude execution speedup. This is a result of significant enhancements which regard (a) the development of analytical expressions for the computation of the gradient and objective NNLS functions and (b) a novel limited-memory, quasi-Newton solution algorithm for the NNLS problem. Our approach also differs from existing least-squares techniques for connectionist neural networks because it is developed for a different neural network model so that the approximation and solution approach are different; moreover, it is applied to a fully recurrent network with the least-squares method applied to the whole network rather than on a layer-by-layer basis.
The paper is organized as follows. Section 2 introduces some preliminary material on the mathematical representation of RNN and the solution of the NNLS problem. Sections 3 and 4 outline the procedure to transform the supervised RNN learning problem into an NNLS formulation and the overall solution approach. Section 5 describes the proposed approach for the solution of the NNLS problem, while Section 6 presents the derived analytical expressions for the fast computation of the NNLS gradient and objective function. Section 7 demonstrates the efficiency of the developed approach and Section 8 concludes the paper.
Notation: All boldface letters indicate vectors (lower case) or matrices (upper case), while calligraphic letters denote sets. The superscript (·) T denotes the matrix transpose. Matrix I is the identity matrix. diag(x) is a diagonal matrix with elements of the main diagonal given by the entries of vector x, and all other elements equal to zero. 1 = [1, . . . , 1] T is a vector of all ones, while e i = [0, . . . , 1, . . . , 0] T is a vector of all zeros apart from a one at position i. ||z|| 2 denotes the Euclidean norm of a vector z. Operator denotes elementwise multiplication and ⊗ denotes the Kronecker product which if applied on x ∈ R N ×1 and y ∈ R M ×1 yields: ) and U int (a, b) represent the uniform distribution in the interval [a, b] generating real and integer numbers, respectively.

The RNN Model
RNN is a recurrent network of N fully connected neurons which exchange positive and negative signals in the form of unit amplitude spikes. At any time t, the state of neuron i is described by its signal potential k i (t) which is a non-negative integer associated with the accumulation of positive signals at the neuron. We say that neuron i is excited when k i (t) > 0, else if k i (t) = 0 then it is idle or quiescent. A closely related parameter is q i (t) = P r[k i (t) > 0] ≤ 1, which is the excitation probability of neuron i.
When neuron i is excited, it can randomly fire according to the exponential distribution with rate r i resulting in the reduction of its potential by 1. The fired spike either reaches neuron j as a positive signal with probability p + (i, j) or as a negative signal with probability p − (i, j), or it departs from the network with probability d(i). These probabilities must sum up to one yielding Hence, when neuron i is excited, it fires positive and negative signals to neuron j with rates: Combining Eqs. (1)-(3) an expression which associates r i with w + (i, j) and w − (i, j) is derived: Positive and negative signals can also arrive from the outside world according to Poisson processes of rates Λ i and λ i , respectively. Positive signals have an excitatory effect in the sense that they increase the signal potential of neuron j by 1. Contrary, negative signals have an inhibitory effect and cancel a positive spike if k j (t) > 0, while if k j (t) = 0 the negative signal has no effect. The values of the stationary parameters of the network, the stationary excitation probabilities q i = lim t→∞ q i (t), i = 1, . . . , N and the stationary probability distribution π(k) are derived from Theorem 1.
Theorem 1 [16]: Let the total arrival rates of positive and negative signals λ + (i) and λ − (i), i = 1, . . . , N be given by the following system of equations where If a unique non-negative solution {λ − (i), λ + (i)} exists for the non-linear system of Eqs. (5)- (7) such that q i < 1 ∀i, then: The theorem states that whenever a solution to the signal flow Eqs. (5)-(7) can be found such that q i < 1, ∀i, then the stationary joint probability distribution of the network has the simple product form (8) associated with the marginal probabilities of each neuron, π i (k i ). The condition q i < 1 can be viewed as a "stability condition" that guarantees that the excitation level of each neuron remains finite with probability one.

Non-negative Least Squares
NNLS is a convex QP optimization problem [37] defined as: Due to the special structure of NNLS, apart from classical QP solution approaches, several other approaches has been proposed to solve large scale problems which can generally be classified into active set algorithms and iterative approaches.
In active set algorithms variables are divided into two sets: the active set and the passive set. A variable belongs to the active set if it is negative or zero when the unconstrained problem is solved, otherwise it belongs to the passive set. When the unconstrained leastsquares problem is solved, negative or zero variables do not contribute to the constrained problem; therefore, if the true active set is known then the solution can be found by solving the unconstrained problem for the passive set of variables and setting the active variables equal to zero.
The most widely known active set algorithm is the one proposed by Lawson and Hanson [37]. In this approach, initially all the variables are inserted into the active set. Then an iterative procedure is followed where in each iteration variables that result in a strictly better evaluation of the cost function are identified and removed from the active set. The procedure continues until no more active variables can be freed to reduce further the cost function. Although it is possible to free many variables at a single iteration, general practice has shown that it is better to free only one variable at a time from the active set [13].
A modified version of this algorithm identifies calculations that can be computed beforehand to reduce the computational cost. The algorithm called FNNLS (Fast Non-negative Least Squares) [7] speeds-up the procedure, but requires the storage of the square matrix B T B; thus, it is not suitable for our problem. Active set methods are also not appropriate in our case because they require the solution of the unconstrained least-squares problem which involves a matrix inversion operation, that is an operation we want to avoid.
Iterative approaches adhere to non-linear optimization methods to update w. Usually the update of the current solution is based on projected gradient methods which can identify several active set constraints in one iteration. In the projected gradient methods, the update takes place toward the steepest descent direction; nonetheless, by using the projection operation it is ensured that the new point is within the feasible region as shown in Eqs. (10) and (11).
Projected gradient methods usually differ in the procedure used for the selection of the step-size s τ and the update of the gradient scaling matrix D τ , which must be symmetric and positive definite. They generally require simple matrix-vector operations and can perform well in ill-conditioned systems. D τ is computed based on second-order gradient information and results in fast convergence to the solution. An efficient and simple projected gradient quasi-Newton method that uses the gradient scaling matrix D τ was proposed in [36]. The method exploits the active and passive variables and also requires only simple matrix-vector operations. Nonetheless, it requires the explicit storage of D τ and cannot be used in our case. In Section 5, we develop a projected gradient NNLS (PGNNLS) algorithm with all the above desirable characteristics.

NON-NEGATIVE LEAST-SQUARES RNN LEARNING FORMULATION
In RNN, the kth input training pattern x k is represented by the vectors Λ k = [Λ 1k , . . . , Λ Nk ] and λ k = [λ 1k , . . . , λ Nk ], k = 1, . . . , K. Usually the approach taken is to assign the input training values, x ik to the exogenous arrival rates such that Λ ik = max{0, x ik } and λ ik = max{0, −x ik }. The desired values of the kth pattern, y k , are represented by the steady-state excitation probabilities of the neurons q k = [q 1k , . . . , q Nk ] emanating from applying input training pattern k to the network. The RNN weights updated during the learning process are w + (i, j) and w − (i, j), ∀i, j.
Without loss of generality we assume that the error function to be minimized is a general quadratic function of the form: where E k is the error function of the kth input-output pair,c i ∈ {0, 1} shows whether neuron i belongs to the set of output neurons (i ∈ I out ) or not (i ∈ I out ) and g i (q ik ) is a differentiable function of neuron i. Ideally we would like to observe the desired output y k for all patterns. This means, that for all patterns we should have: . This is achieved, if the positive and negative weights are selected so that the appropriate q ik values are achieved for all neurons. Without loss of generality, in the sequel we assume that g i (q ik ) = q ik , so that g −1 i (y ik ) = y ik . Combining Eqs. (5)-(7) we obtain: If we further assume that λ + (i, k) < r i + λ − (i, k) ∀i, k and also substitute Eq. (4) into (13) we obtain: If the network is only composed of output neurons, and if we assume that q ik = y ik , ∀i, k, then Eq. (14) becomes a linear system of NK equations with 2N 2 non-negative unknowns, the weights w + (i, j) and w − (i, j). If there are both output and non-output neurons then by selecting appropriate values for the excitation probabilities of the latter we can still obtain a linear system, as discussed in Section 4.
An accurate solution to Eq. (14) may not be available for two reasons. First, the number of equations may be larger that the number of unknowns; this is true when K > 2N . Second, the non-negativity constraints restrict the values of the variables and a solution may not exist even if K < 2N . As a result we formulate Eq. (14) as an NNLS problem to approach equality as much as possible in the least square sense yielding (9) The gradient of the objective function is given by: The ik row of matrix B and vector b in Eq. (9), which correspond to the ith signal flow equation of the kth pattern, are given by the following expressions: The column indices of B, ij + and ij − , indicate the position of the variables w + (i, j) and w − (i, j) in w, respectively. Notice that every value of B can be found by only using matrix Q = [q 1 , . . . , q k , . . . , q K ], Q ∈ R N ×K , which holds the q ik values of both output and non-output neurons; the d(i) values are usually constant and for simplicity we assume that d(i) = 0, ∀i. Also, despite the fact that every row of matrix B has 2N 2 elements, only 4N of them are non-zero and hence the density of non-zero elements in B is 2/N . One difficulty associated with the above formulations is the large dimensionality of B which implies that it may not be possible to be stored in memory. For example, in Section 7 we consider supervised RNN problems with dimensions up to N = 300 and K = 1000, so that matrix B has dimensions 300, 000 × 180, 000 which prohibits its storage in memory. Moreover, initial experimentation showed that B is ill-conditioned. Therefore, in Section 5 we develop a PGNNLS algorithm that does not require either storing large matrices or performing matrix inversion operations. It is important that only simple operations are performed, such as matrix-vector products, avoiding inefficient matrix-matrix multiplications or matrix inversion operations. To achieve the requirements of the solution approach, it is also important to consider the sparseness of B.

RNN-NNLS LEARNING ALGORITHM
Solving the NNLS problem can provide good learning performance only when all neurons are output neurons. Nonetheless, because this is not the case we take the following approach: if neuron i ∈ I out then we set q ik = y ik , ∀k, while if neuron i ∈ I out then we set q ik = U (a, b), ∀k, with 0 ≤ a ≤ b ≤ 1. Following this approach, we obtain q ik values for all neurons and patterns; thus, an NNLS problem is derived (Eqs. (9), (16), (17)), which can be solved using the PGNNLS algorithm developed in Section 5.
As already mentioned, due to the non-negativity constraints and depending on the dimensions N , K the system of Eq. (14) may not have a solution; therefore, the obtained weights from the solution of the NNLS optimization problem will not accurately satisfy Eqs. (5)- (7), and the obtained weights will not result in good performance. To deal with this issue, we use the weights acquired from the execution of Algorithm 2, to compute q ik from Eqs. (5)- (7). Then, a weighted version of the desired values q d ik and the exact values q ik is computed and used as the new q d ik values in PGNNLS. Using this iterative procedure, we progressively move toward weights that satisfy q d ik ≈ q ik . To retain our original goal of achieving the desired output values y ik , i ∈ I out we update the output q d ik values in two different ways: (a) there is a different weighting parameter for these neurons, 0 ≤ α o ≤ 1, typically close to one so that their desired values slowly vary, and (b) we restrict the neuron values within a desired region so that neurons corresponding to "0" decisions must have q ik ≤ 0.4 and neurons corresponding to "1" decisions must have q ik ≥ 0.6. By selecting the specific boundary values, we achieve to constrain each q ik in the desired region and to have a large variation range for the parameters.
The overall procedure is outlined in Algorithm 1, called RNN-NNLS. It is important to note that the NNLS algorithm does not require matrix B as input, which would be prohibitive for a large network. Due to Eq. (16), we can perform all the computations involving B using matrix Q which holds all the q ik values. Thereby, the order of memory required is the same with the standard RNN learning algorithm. The iterative procedure RNN-NNLS needs only a small number of iterations, NI RNN-NNLS , before the error stabilizes.

Algorithm 1 RNN-NNLS: RNN supervised learning algorithm based on NNLS formulation
Input: x k , y k , ∀k Output: w

PROJECTED GRADIENT NON-NEGATIVE LEAST-SQUARES ALGORITHM
In this section we develop a PGNNLS algorithm based on updating the search-direction using a limited-memory BFGS formula and performing an "Armijo rule along the projection arc" (APA) line-search [5]. Our approach is a modified version of the quasi-Newton NNLS algorithm proposed in [36]; nevertheless it is different both in terms of the employed linesearch (hyper-exponential instead of standard APA) and the procedure for updating the search direction (limited memory instead of full BFGS); the developed approach is outlined in Algorithm 2.
The key aspect of Algorithm 2 is that at iteration τ we only perform a line-search for the variables that are in the free-set F τ defined as: To understand the reason behind this, let us define the complement of F τ , called the binding set B τ : For the variables belonging to the binding set there are two possibilities about the search direction: (a) d τ i ≥ 0, and (b) d τ i < 0. In the first case, we have that w τ +1 i ] = 0 so that this variable remains constant and does not affect the cost function. In the second case, we have that w τ +1 is undesirable, as it contributes negatively toward the condition that guarantees function reduction at the particular direction (−(d τ ) T ∇f (w τ ) < 0).

Algorithm 2 PGNNLS: Projected Gradient Algorithm for the NNLS problem
Input and g τ ← ∇f (w τ ) using Eqs. (9) and (15); Find the binding set B τ according to expression (19); (9) and (15); Find the binding set B τ according to expression (19); (21); end if until a stopping criterion is met As a result, variables belonging to B τ should not affect the line-search procedure of iteration τ . This is achieved by considering a modified direction d defined as: whered τ = S τ ∇ P f (w τ ) and ∇ P f (w τ ) is the projected gradient given by Eq. (21). [ In this way, Eq. (10) becomes: An equivalent expression can be obtained by updating only the variables belonging to the free-set: , where S τ ∈ R |F |×|F | is the principal submatrix of S τ corresponding to the free variables and similarly, g τ i,F = [∇f (w τ )] F (i) , i = 1, . . . , |F|. As mentioned above, Algorithm 2 relies on a limited-memory BFGS update of the scaling matrix. In each iteration, the BFGS formula is updated so that the new matrix is symmetric, satisfies the secant equation and also is the closest to the current approximation matrix in the least-squares sense. In addition, if the associated problem is strictly convex and an appropriate line-search is considered, then the updated matrices are also positive definite [44]. The BFGS formula for updating S τ is given by: Notice that S τ is a rank-two modification of S τ −1 which can be obtained using Δw τ −1 and Δg τ −1 . Hence, if we store all vectors Δw k and Δg k from the start of the algorithm, we can obtain S τ without storing any matrix.
In the limited-memory variant of BFGS, instead of storing all vectors, we update S τ based on the M most recent Δw k and Δg k vector pairs. This is achieved with the use of the following recursive formula which is directly derived from (22) [8].
Using Eq. (23) we can efficiently update the search direction d τ = S τ ∇f (w τ ), without storing S τ at any iteration. As a result, the required memory for the quasi-Newton update is reduced from 2N 2 × 2N 2 to 2M × 2N 2 . This is a substantial memory saving, as it has been observed in practice that even small values of M (say M ∈ [3,7]) provide satisfactory results [8]. Nocedal and Wright [40] describe in detail the limited memory BFGS method and outline an iterative procedure for updating the search direction based on (23); we outline this procedure in Algorithm 3 The use of the limited-memory BFGS scheme also provides computational benefits. Note that updating the scaling matrix using the BFGS method requires several matrixvector operations whose computational complexity is O ((2N 2 ) 2 ). On the other hand, the use of Algorithm 3 requires 5M vector-vector products so that its computational complexity is O (5M (2N 2 )) which is significantly less than the complexity of a single matrix-vector product.
Let us now turn our attention to the line-search procedure. As mentioned above, the step-size s τ is found by employing the APA line-search [5]. In APA the step-size is chosen to be equal to s τ = β m , where m is the smallest non-negative integer satisfying the APA condition: where , 0 < σ APA < 1/2 and 0 < β < 1. An important advantage of the APA over other step-size rules is that it identifies many active constraints in one iteration. In addition, it is proven that the sequence {w τ } produced when applying the APA rule, converges to a stationary point {w * } [5], which in our case is a global minimum. In [9], a more detailed analysis of projected gradient algorithms further relaxed the convergence conditions. The authors showed that convergence to a stationary point can be achieved by choosing any step-size satisfying condition (24), under the assumptions that s τ is not too small, the cost function is bounded below and the gradient is uniformly continuous (Theorem 2.3 in [9]), which are true in the NNLS case. Hence, convergence is guaranteed even if we choose any value of m satisfying (24) rather than the smallest integer, as long as the selected β m are not too small.
Nevertheless, sequentially identifying the appropriate m value may require a large number of function evaluations and projections. In [46] we proposed to hyper-exponentially alternate s τ for the identification of an appropriate step-size value. In the hyper-exponential line-search (lineSearchHE), the first trial starts from s τ −1 and if the APA condition is satisfied, (s τ ≥ s τ −1 ), we hyper-exponentially increase the step-size (division of the initial step-size by β 2 k , k = 0, 1, 2, . . .) until a step-size s τ init /β 2 kv violating condition (24) is found; otherwise, we hyper-exponentially decrease the step-size (multiplication of the initial stepsize by β 2 k , k = 0, 1, 2, . . .) until a step-size s τ init β 2 ks satisfying condition (24) is found. In this way, an appropriate region for the optimal step-size is identified; then, a divide and conquer procedure is followed to find the largest value β m satisfying Eq. (24).
Formally the stopping criterion that should be met for the termination of the PGNNLS algorithm is related to the Karush-Kuhn-Tucker (KKT) optimality conditions. However, we do not require the accurate solution of the NNLS problem as it is only used to approximately train RNN. Hence, we employ the maximum number of iterations as stopping criterion.
The most costly operations that need to be performed at each iteration of Algorithm 2 involve the computation of f (w) and ∇f (w), which require matrix-vector product operations. In particular, at the start of each iteration, it is needed to evaluate f (w) and ∇f (w) once. Additionally, each trial of the line-search procedure requires the evaluation of f (w τ +1 cand ). In Section 6, we discuss two different approaches for the efficient evaluation of f (w) and ∇f (w) and derive analytical expressions.

EFFICIENT COMPUTATION OF NNLS COSTLY FUNCTIONS
As already mentioned, the most computationally expensive functions in Algorithm 2 are f (w) and ∇f (w). However, computing these functions directly is very inefficient, so that the structure and sparsity of matrix B should be exploited to find efficient ways to compute ∇f (w). In this section, we develop two such approaches. The first is based on the efficient computation of B T z 1 and Bz 2 where z 1 and z 2 are vectors of appropriate dimensions. The second is based on first computing and storing B T B in order to compute (B T B)z. We show that the computational complexity of the former approach is O(KN 2 ) per evaluation, while the complexity of the latter is O(N 3 ) per evaluation plus an initialization cost of O(KN 3 ). This indicates that each of the two approaches can be faster than the other depending on the problem dimensions (number of training pairs, K, and number of neurons, N ). The second approach is faster that the first if K N , otherwise the first approach is better.

The Structure of Matrix B
Matrix B is composed of many different matrix blocks which correspond to entries associated with positive or negative weights as well as different input-output training pairs. As a result we can represent B as: where B +k and B −k indicate the entries associated with the kth input-output training pair of the positive and negative weights, respectively. These matrices are sparse and are also of particular structure, as shown below for the case that N = 3, when d(i) = 0, i = 1, . . . , N.
Note that the structure of the above matrices allows their further decomposition into: For example, for the case that N = 3 matrices C k , D +k and D −k take the form: Notice that matrices C k , D +k and D −k also have a special structure while all can be decomposed further into N × N sized submatrices such that C k = [C k1 , . . . , C kN ], D +k = [D +k1 , . . . , D +kN ], and D −k = [D −k1 , . . . , D −kN ]. Sub-matrices C ki , D +ki , D −ki ∈ R N ×N are given by:

The First Approach for the Computation of f (w) and ∇f (w)
As already mentioned, functions f (w) and ∇f (w) can be computed according to Eqs. (9) and (15), respectively, which can be written as: where we have definedẑ = Bw,ẑ ∈ R NK×1 and z =ẑ − b. As a result, for the computation of f (w) the only expensive step is the calculation ofẑ = Bw. Similarly, the expensive steps for the computation of ∇f (w) are the calculation ofẑ = Bw and z = B T z, where z ∈ R 2N 2 ×1 . Note that the matrix-vector product Bw appears in both terms. As a result, only two matrix-vector products are needed for the evaluation of both functions at the same point w c :ẑ c = Bw c and z = B T z c , where z c =ẑ c − b. As the naive calculation of these matrix-vector products is not efficient, we manipulate the special structure and sparsity of matrix B to derive expressions of low computational complexity. Let us first examine the termẑ = Bw. Expanding B and w we obtain: . . .
whereẑ k ∈ R N ×1 and w + represents the positive weights so that value w + (iN − N + j) ≡ w + (i, j) and w − represents the negative weights such that w − (iN − N + j) ≡ w − (i, j). Note that to evaluateẑ it is sufficient to derive expressions for termsẑ k : Hence, the computation ofẑ k requires the efficient evaluation of C k w + , C k w − , D +k w + and D −k w − . Manipulation of these terms using Eqs. (28)- (30) and matrix algebra yieldŝ where the N × 1 vectors σ W + and σ W − are given by This definition implies that the ith element of σ W + or σ W − is equal to the sum of the elements belonging to the ith row of the associated matrix.
Having computedẑ and hence z, we can now proceed with the computation of z = B T z. If we define z T = z T 1 , . . . , z T K , where z k ∈ R N ×1 , and use Eq. (25) to expand matrix B we obtain .
(38) By exploiting the structure of B and using matrix algebra for each of the appearing terms C T k z k , D T +k z k and D T −k z k yields the following expression: Having derived expressions to efficiently derive functions f (w) and ∇f (w) the computational complexity of this approach is now examined. For the computation of Bw the most costly operations are the evaluation of the matrix-vector products (W + ) T q k and (W − ) T q k that appear in vectorsẑ k in Eq. (35). The time complexity of these operations is O(N 2 ), and as there are K such terms to be computed, the total complexity of evaluating Bw is O(KN 2 ). For the computation of term B T z, for each k we need to evaluate q k z k , q k ⊗ z k and q k ⊗ δ k which have O(N ), O(N 2 ) and O(N 2 ) complexity, respectively. In addition, summation of the latter two terms for all k requires O(KN 2 ). If the required multiplications are performed naively, then the computation of both Bw and B T z matrixvectors products would require O (2KN 3 ), as the dimensions of B are KN × 2N 2 , while the dimensions of w and z are 2N 2 × 1 and KN × 1, respectively. Hence this approach provides an O(N ) complexity reduction compared to naive matrix-vector multiplication.
With respect to memory requirements, this approach involves the storage of the necessary vectors that is, matrices W + and W − which have N 2 elements, and matrix Q which have KN elements in total, as well as a small number of auxiliary vectors. Naively storing B requires memory for 2KN 3 elements which is min{KN, 2N 2 } times larger than the memory required by our approach.
In sum, the computational complexity of computing f (w) and ∇f (w) is O(KN 2 ), while the approach does not require the storage of additional matrices other than the necessary W + , W − and Q which require KN + 2N 2 memory.

The Second Approach for the Computation of ∇f (w)
A second approach for the evaluation of functions f (w) and ∇f (w) is based on computing (during the initialization phase) the quantities Γ = B T B and β = B T b. Then, functions f (w) and ∇f (w) can be expressed using these quantities as: Based on the above expressions only the matrix-vector product Γw is required for their evaluation and hence at a particular point both functions can be computed by just evaluating Γw. Notice that B ∈ R KN×2N 2 and Γ ∈ R 2N 2 ×2N 2 so that the computation of f (w) and ∇f (w) are depended both on K, N in the first approach and only on N in the second. Expansion of matrix B according to Eq. (25) yields: The terms Γ lm , l, m = 1, 2, can be further decomposed into expressions involving C k , D +k and D −k through substitution of Eqs. (26) and (27), associated with B +k and B −k , into (42). For example, Γ 11 yields: Further substitution of Eqs. (28)-(30) into each of the subsequent terms of Γ lm and matrix algebra yields that these terms can be reproduced by only storing five vectors/matrices computed during the initialization phase. These include vector σ q ∈ R N ×1 and matrices For the derivation of Γw it is true that: where vectors w + and w − have already been defined in the first approach, while vectors z l = Γ l1 w + + Γ l2 w − , z l ∈ R N 2 ×1 , l = 1, 2 can be further decomposed into z T l = [z T l1 , . . . , z T lN ] with elements z li ∈ R N ×1 . In order to obtain low complexity expressions for these terms, we take advantage of the expressions derived for the composing matrices of Γ lm , and of Eq. (44), yielding the following expressions for z 1i and z 2i , i = 1, . . . , N where vectors σ W + and σ W − have already been defined in (36)- (37), vectors m c i , m r i ∈ R N ×1 are the ith column and row of matrix M, while the vector σ z ∈ R N ×1 is defined as The computational complexity of computing f (w) and ∇f (w) is dominated by the computation of Γw. In this approach, we need to examine both the initialization cost and the cost per function evaluation. The initialization cost is dominated by the derivation of matrices R s,i , i = 1, . . . , N which is of computational complexity O(KN 3 ). The computational cost per objective function or gradient evaluation involves the computation of vectors z 1i and z 2i , for i = 1, . . . , N given by Eqs. (46) and (47) In terms of memory requirements, this approach requires the storage of σ q , M, M s , R i and R s,i for i = 1, . . . , N apart from the necessary W + , W − and q k , k = 1, . . . , K. As each matrix R i or R s,i has N 2 elements, the total storage space required for this approach is O (N 3 + KN ) which is limiting for large values of N . As a result, this approach is more suitable for cases that K > N and N is small enough so that we can store at least 2N 3 + KN elements. Notice that if the above matrices are not used, then Γ requires the storage of 4N 4 elements. In

SIMULATION RESULTS
In this section, we evaluate the performance of the developed RNN-NNLS and PGNNLS algorithms in the context of emergency management optimization by considering the problem of assigning emergency units to incidents (AEUI), a problem first discussed in [29].

Problem Description
Consider that N L incidents occur simultaneously at different locations with I j people injured at incident j. N U emergency units or ambulances (say) are spatially distributed before the time of the incident with unit i being able to collect c i ∈ N + injured and having response time to incident j given by T ij > 0. We also assume that decisions are irrevocable so that after a unit is allocated to some incident, it cannot be re-assigned to some other incident.
The objective is to collect all injured at the minimum possible response time to ensure the quick collection and treatment of the civilians. The problem can be defined using mathematical programming as: Constraint (49b) indicates that an emergency unit must be allocated to exactly one incident, while (49c) expresses the fact that the total capacity of the units allocated to an incident must be at least equal to the number of people injured there. The allocation matrix X with elements X ij indicates whether agent i has been allocated to incident j (X ij = 1) or not (X ij = 0). The above problem is NP-hard in the strong sense since it is a generalization of the 0-1 Multiple Knapsack Problem which is of the same complexity class [39]. This means that no known algorithms exist to solve the problem in polynomial time. For this reason we rely upon heuristic algorithms that can provide fast and close to optimal solutions to the problem. In the next section, we discuss a heuristic method based on supervised learning that will be used to obtain fast and decentralized decision making, as well as close to optimal results.

Supervised Learning Solution Approach
The approach taken for the solution of AEUI problem is to train a RNN using numerous instances of the optimization problem with exact solutions which are obtained off-line. Then, if a problem instance is presented to the trained neural network, it will be able to provide a solution that is close to optimal, due to its generalization ability. As a result, the trained RNN can be "handed out" to all decision agents (emergency units) to serve as an "oracle" for decision making. When the emergency happens, each individual agent uses its "oracle" to obtain fast and decentralized decisions. Since all agents have the same "oracle", if they have the same information there will be no conflicts in their decisions; the "oracle" provides the same allocation matrix X to each agent, so that agent i is allocated to incident j with X ij = 1.

Training Architecture
By fixing the T ij and c i parameters, the problem can be mapped to a supervised learning context by representing the inputs to the network by I j and the outputs by X ij . Because I j ≥ 0 ∀j, in the RNN they will be represented by the parameters Λ j of the input neurons. The output variables are associated with the excitation probabilities of output neurons. Specifically, output neuron with index (i, j) represents decision variable X ij . During the training phase, we represent decision X ij = 1 with q (i,j) = 1 − q , 0 ≤ q < 1/2 and decision X ij = 0 with q (i,j) = q . During the testing phase, if the value of the particular neuron is q (i,j) > 0.5 then we assume that X ij = 1, otherwise we take X ij = 0.
With respect to the number of hidden neurons, we consider a configuration where the number of hidden neurons is twice the number of output neurons, that is, N H = 2N O . Furthermore, we always assume that our network is fully connected in terms of the W + and W − weight matrices. For the solution of the problem we considered two general NN architectures. In the "collective" NN architecture we construct a single neural network for all decisions which is comprised N O = N U N L output neurons. As the output of the neural network provides the actions for all agents, each agent only performs the action corresponding to him/her. In the "individual" NN architecture we construct and train a different NN for each agent's decision, so that we need to train N U architectures of N L output neurons. In this case, the ith NN is trained using as outputs only the variables X ij , j = 1, . . . , N L to advise agent i. Despite the fact that each NN provides a single action, decision making is still consistent because training is performed using the optimal solutions to the problem instances which are globally consistent.
To train the NNs, we have first generated at random 1000 problem instances for different numbers of emergency units and locations of incidents. The remaining parameters have been chosen at random with T ij = U (0, 1) and c i = U int (1,4). For each problem instance, the number of injured at location I j is also chosen from the uniform distribution such that I j = U int (0.5c t /N L , c t /N L ), where c t = i c i is the total capacity of the emergency units.
To evaluate the developed approach we have performed experiments with the following numbers of emergency units and incidents: N U = {5, 10, 15, 20} and N L = {3, 5}. Among the test cases considered, we only chose those whose required capacity was within the total available capacity of the emergency units. The optimal solution in each case was then obtained accurately by solving the combinatorial optimization problem in Matlab using function bintprog which employs a branch and bound procedure for the solution of binary combinatorial optimization problems. Testing after training was performed using a distinct but similarly generated set of 250 test cases so that the training and testing were disjoint, but with the same probability distributions for all parameters. The effectiveness of the learning algorithms was evaluated on the basis of the following metrics: • The percentage of instances that were solved so that all of the injured were evacuated • The percentage of people collected averaged over all testing instances • The average relative percentage deviation from the optimal, σ opt , which evaluates the closeness of the solution to optimality, taken over the number of problem instances that the emergency units covered all casualties N F , defined as: where f i NN (X) and f i opt (X) are the cost function values obtained from the heuristic neural network learning algorithm and the exact algorithm for instance i, respectively.

Performance Evaluation of PGNNLS
Before discussing the effectiveness of the RNN-NNLS algorithm for the solution of the investigated problem, results are presented concerning the computational efficiency and convergence speed of the developed PGNNLS algorithm. Specifically, the empirical computational performance of PGNNLS is examined for: (a) the fast computation of the objective and gradient NNLS functions, and (b) the convergence speed of PGNNLS compared to two other approaches.
To evaluate the efficiency of the two developed approaches for the computation of the costly NNLS functions in Section 6, we have measured the execution time required for the evaluation of 200 B T (Bw) operations which involve two matrix-vector products; the execution time also includes the initialization time. To demonstrate the benefit from using these approaches, a "naive" method for the computation of these products was implemented, that takes into consideration the sparsity of B, but performs no analytical manipulation. Fig. 1, illustrates the execution time ratio of the "naive" against the developed approaches (speedup) for different RNN architectures. In the figures, Approaches 1 and 2 correspond to the computation methods discussed in Sections 6.2 and 6.3, respectively. Fig. 1(a), shows the results of the "individual" RNN architecture when N L = {3, 5} and various ratios of hidden to output neurons. In this case the N U parameter is not important as the size of the network for each emergency unit depends only on N L . Because the constructed neural network for the "individual" architecture is small, Approach 2 is significantly better than Approach 1, while both approaches have an order of magnitude speedup compared to the "naive" implementation. In fact, Approach 2 reaches an overall speedup of fifty for N L = 5 and N H /N O = 2. On the contrary, for the "collective" architecture the (c) (d) Figure 1. Performance of approaches for computing the objective and gradient NNLS function compared to a "naive" one; the metric used is the ratio of execution times between the naive and another approach.
number of neurons is significantly larger that the "individual" one, which is in favor of Approach 1. Indeed, this is verified by the results which show that Approach 1 is better than Approach 2 by up to seven times. Also as the network size increases, with the addition of more hidden neurons, Approach 1 becomes more efficient and Approach 2 less efficient. These results show that both architectures are useful, as they perform better under different conditions, while they both provide a significant speedup over a "naive" implementation, as discussed in the derivation of these approaches. For the PGNNLS algorithm, we have chosen to perform 200 iterations with five correction vectors (M = 5) and hyper-exponential line search with parameters β = 0.4, σ APA = 0.25 and S τ 0 = I, ∀τ (see Eqs. (24) and (23)). To examine the efficiency of the proposed PGNNLS algorithm, we have compared its convergence in terms of iterations and execution time with two other algorithms, gradNNLS [46], and PQN-SPG [44]. The former is a projected gradient algorithm with first-order information (S τ = I) which was developed in conjuction with the introduction of the NNLS learning approach for RNN. The latter is a state-of-the-art limited-memory projected quasi-Newton algorithm for box constrained problems, such as NNLS. It is evident from Fig. 2 that the PGNNLS algorithm outperforms gradNNLS and PQN-SPG both in terms of the attained NNLS objective function value and the number of iterations to achieve convergence. In fact, for N L = 5 where the   training architecture is larger, the attained results are even better compared to the two other algorithms.  gradNNLS convergences faster for smaller problems (with N L = 3), while for larger problems PQN-SPG is up to 2.5 times faster. Note that all three NNLS algorithms have employed Approach 1 for the evaluation of the costly NNLS functions, which implies that the combined benefit obtained from the developed approach is more that two orders of magnitude.

Solving the AEUI Problem
In this section the performance of the RNN-NNLS algorithm for the solution of the AEUI problem is evaluated. To solve AEUI using the RNN-NNLS learning algorithm, we have employed the algorithm developed in 4. Specifically, we perform ten iterations (NI RNN-NNLS = 10), checking the solution quality after each iteration and storing the weights corresponding to the largest percentage of instances where all injured were collected. For updating the desired values of the non-output and output weights we have set α no = 0.75 and α o = 0.9 respectively. We have also normalized the inputs of the RNN so that Λ i ∈ [0.2, 1], while for the output neurons we have chosen q = 1/3, so that "low" and "high" neurons take values 1/3 and 2/3 respectively. The initial desired excitation probabilities of the non-output neurons are generated according to U (0.25, 0.75). Fig. 4 depicts the MSE with respect to the desired and attained excitation probabilities for the output neurons for ten iterations of the RNN-NNLS algorithm. It is clear that the MSE error decreases for subsequent iterations leading to the converge of the algorithm. In fact, stabilization of the MSE is accomplished after a very small number of iterations (around five). Although monotonic convergence cannot be guaranteed, the observed behavior is sufficient to produce good trained weights that will derive high quality solutions.
Finally, Fig. 5 summarizes the best results for the "collective" and "individual" NN architectures of the RNN-NNLS and RNN approaches. It is evident that the "collective" RNN-NNLS algorithm yields the best results in terms of percentage of instances were all injured were collected, as it is the most effective for N L = 5 and highly competitive for N L = 3. On the other hand, the "individual" RNN architecture produces the best results in terms of deviation from the optimal but has the worst performance in terms of the other two metrics. Among the other three architectures the "collective" RNN is the one with the best performance in terms of σ opt but it is not as effective in collecting injured, especially for larger problems.

CONCLUSIONS
In this work, we have studied non-negative least-squares learning in the context of the RNN. For this problem, a solution algorithm has been developed that achieves two orders of magnitude speedup compared to other state-of-the-art approaches. The speedup is a result of a customized limited-memory quasi-Newton method for the solution of the problem, as well as two efficient approaches for the computation of the gradient and objective functions that appear in the problem. Apart from improved learning efficiency, the developed approach has been shown to provide very good results for the solution of an emergency management optimization problem.