Adaptive Learning with Artificial Barriers Yielding Nash Equilibria in General Games

Artificial barriers in Learning Automata (LA) is a powerful and yet under-explored concept although it was first proposed in the 1980s. Introducing artificial non-absorbing barriers makes the LA schemes resilient to being trapped in absorbing barriers, a phenomenon which is often referred to as lock in probability leading to an exclusive choice of one action after convergence. Within the field of LA and reinforcement learning in general, there is a sacristy of theoretical works and applications of schemes with artificial barriers. In this paper, we devise a LA with artificial barriers for solving a general form of stochastic bimatrix game. Classical LA systems possess properties of absorbing barriers and they are a powerful tool in game theory and were shown to converge to game's of Nash equilibrium under limited information. However, the stream of works in LA for solving game theoretical problems can merely solve the case where the Saddle Point of the game exists in a pure strategy and fail to reach mixed Nash equilibrium when no Saddle Point exists for a pure strategy. In this paper, by resorting to the powerful concept of artificial barriers, we suggest a LA that converges to an optimal mixed Nash equilibrium even though there may be no Saddle Point when a pure strategy is invoked. Our deployed scheme is of Linear Reward-Inaction ($L_{R-I}$) flavor which is originally an absorbing LA scheme, however, we render it non-absorbing by introducing artificial barriers in an elegant and natural manner, in the sense that that the well-known legacy $L_{R-I}$ scheme can be seen as an instance of our proposed algorithm for a particular choice of the barrier. Furthermore, we present an $S$ Learning version of our LA with absorbing barriers that is able to handle $S$-Learning environment in which the feedback is continuous and not binary as in the case of the $L_{R-I}$.


I. INTRODUCTION
Narendra and Thathachar first presented the term Learning Automata (LA) in their 1974 survey [4].LA consists of an adaptive learning agent interacting with a stochastic environment with incomplete information.Lacking prior knowledge, LA attempts to determine the optimal action to take by first choosing an action randomly and then updating the action probabilities based on the reward/penalty input that the LA receives from the environment.This process is repeated until the optimal action is, finally, achieved.The LA update process can be described by the learning loop shown in Figure 1.
Formally, a LA is defined by the mean of a quintuple A, B, Q, F (., .),G(.) , where the elements of the quintuple are defined term by term as: 1) A = {α 1 , α 2 , . . ., α r } gives the set of actions available to the LA, while α(t) is the action selected at time instant t by the LA.Note that the LA selects one action at a time, and the selection is sequential.2) B = {β 1 , β 2 , . . ., β m } denotes the set of possible input values that the LA can receive.β(t) denotes the input at time instant t which is a form of feedback.3) Q = {q 1 , q 2 , . . ., q s } represents the states of the LA where Q(t) is the state at time instant t. 4) F (., .): Q × B → Q is the transition function at time t, such that, q(t + 1) = F [q(t), β(t)].In simple terms, F (., .)returns the next state of the LA at time instant t + 1 given the current state and the input from the environment both at time t using either a deterministic or a stochastic mapping.5) G(.) defines output function, it represents a mapping G : Q → A which determines the action of the LA as a function of the state.The Environment, E is characterized by : • C = {c 1 , c 2 , . . ., c r } is a set of penalty probabilities, where c i ∈ C corresponds to the penalty of action α i .Learning Automata: Research into LA over the past four decades is extensive, leading to the proposal of various types throughout the years.LA are mainly characterized as being Fixed Structure Learning Automata (FSLA) or Variable Structure Learning Automata (VSLA).In FSLA, the probability of the transition from one state to another state is fixed and the action probability of any action in any state is also fixed.Early research into LA centered around FSLA.Early pioneers in LA such as Tsetlin, Krylov, and Krinsky [5] proposed several examples of this types of automata.The research into LA moved gradually towards VSLA.Introduced by Varshavskii and Vorontsova in the early 1960's [6], VSLA has transition and output functions that evolve as the learning process continues [1].The state transitions or the action probabilities are updated at every time step.Continuous or Discretized: VSLA can also be defined as being Continuous or Discretized depending on the values that the action probabilities can take.In continuous LA, the action probabilities can take any value in the interval [0, 1].The drawback with continuous LA is that they approach a goal but never reach there and they have a slow rate of convergence.The concept of discretization was introduced in the 1980s to address the shortcomings of continuous LA.The proposed mitigation increased LA speeds of convergence [7], [8] by permitting an action probability that was close enough to zero or unity, jump to that end point in a single step.The method employed by the authors constrained the action selection probability to be one of a finite number of values in the interval [0, 1].Ergodic or Absorbing: Depending on their Markovian properties, LA can further be classified as either ergodic or equipped with characteristics of absorbing barriers.In an ergodic LA system, the final steady state does not depend on the initial state.In contrast to LA with absorbing barriers, the steady state depends on the initial state and when the LA converges, it gets locked into an absorbing state.
Absorbing barrier VSLA are preferred in static environments, while ergodic VSLA are suitable for dynamic environments.Environment types: The feedback from the LA β(t) is a scalar that falls in the interval [0, 1].If the feedback is binary, meaning 0 or 1, then the Environment is called Ptype.Whenever the feedback is a discrete values, we call the environment Q−type.In the third case where the feedback is any real number in the interval [0, 1], we call the environment as S−type.Traditionally the schemes that handle P -type environment operate with stochastic environments where the feedback is a a realization from a Bernoulli process, or at least in the proofs since the penalty and rewards do not really need in general to obey some statistical laws and can be for instance generated by an adversary.The scheme that handle P types such as the famous L R−I and a large class of fixed structured LA, do not automatically handle S-types and Q−types.Mason presented in [9] an LA scheme to handle S-learning environment.In this paper, we generalize our scheme to the S-type environment and provide the corresponding update scheme.LA with Artificially Absorbing Barriers: LA with artificially absorbing barrier were introduced in the 1980s.In [1], Oommen turned a discretized ergodic scheme into an absorbing one by introducing an artificially absorbing barrier that forces the scheme to converge to one of the absorbing barriers.Such a modification led to the advent of new LA families with previously unknown behavior.For instance, the ADL R−P and ADL I−P are absorbing schemes that are the result of the introducing absorbing barriers to their counterparts original corresponding schemes.Those absorbing scheme were shown to be ǫ-optimal in all random environments.Applications of LA: LA had been utilized in many applications over the years.Recent applications of LA include achieving fair load balancing based on two-time-scale separation paradigm [10], detection of malicious social bots [11], secure socket layer certificate verification [12], a protocol for intrusion detection [13], efficient decision making mechanism for stochastic nonlinear resource allocation [14], link prediction in stochastic social networks [15], user behavior analysisbased smart energy management for webpage ranking [16], and resource selection in computational grids [17], to mention a few.LA Applied to Game Theory: Studies on strategic games with LA were focused mainly on traditional L R−I which is desirable to use as it can yield Nash equilibrium in pure strategies [2].Although other ergodic schemes such as L R−P were used in games [18] with limited information, they did not gain popularity at least when it comes to applications due to their inability to converge to Nash equilibrium.LA has found numerous applications in game theoretical applications such as sensor fusion without knowledge of the ground truth [19], for distributed power control in wireless networks and more particurly NOMA [20], optimization of cooperative tasks [21], for content placement in cooperative caching [22], congestion control in Internet of Things [23], QoS satisfaction in autonomous mobile edge computing [24], opportunistic spectrum access [25] scheduling domestic shiftable loads in smart grids [26], anti-jamming channel selection algorithm for interference mitigation [27], relay selection in vehicular ad-hoc networks [28] etc. Objective and Contribution of this paper: In this paper, we propose an algorithm addressing bimatrix games which is a more general version of the zero-sum game treated in [3].
First we consider a stochastic game where the outcomes are binary in our case, which are either a reward or a penalty.The reward probabilities are given by corresponding payoff matrix of each player.The game we treat is of limited information which is a flavor of games often treated in LA.In such game, each player only observes the outcome of his action in the form of a reward or penalty without observing the action chosen by the other player.The player might not be even aware that he is playing against an opponent player.By virtue of the design principles of our scheme, at each round of the repetitive game, the players revise their strategies upon receiving a reward while maintain their strategies unchanged upon receiving a penalty.This is in concordance with the Linear Reward-Inaction, L R−I paradigm.Please note that this is radically different from the paradigm by Lakshmivarahan [3] where the players always revise their strategies at each round, where the magnitude of the adjustment in the probabilities of the action depend only on whether a reward or penalty is received at every time instant.
Furthermore, we provide an extension of our scheme to handle S-learning environment where the feedback is not binary but rather continuous.The informed reader will notice that our main focus is on the case of P -type environment, bthe sake of brevity and due to space limitations while we give enough exposure and attention related to the S-type environment.The remainder of this article is organized as follows.
In Section II, we present the game model for both P −type environments and S−type environments.In Section III, we introduce our devised L R−I with artificial barriers for handling P −type environments.In Section IV, we present the S− LA scheme with absorbing barriers for handling the general cases of S−type environments.The experimental results related to the L R−I are presented in Section V while some experiments of S− LA for handling S− type environments are given in the Appendix B.

II. THE GAME MODEL
In this section, we formalize the game model that is being investigated.Let P (t) = p 1 (t) p 2 (t) ⊺ denote the mixed strategy of player A at time instant t, where p 1 (t) accounts for the probability of adopting strategy 1 and, conversely, p 2 (t) stands for the probability of adopting strategy 2. Thus, P (t) describes the distribution over the strategies of player A. Similarly, we can define the mixed strategy of player B at time t as Q(t) = q 1 (t) q 2 (t) ⊺ .The extension to more than two actions per player is straightforward following the method analogous to what was used by Papavassilopoulos [29], which extended the work of Lakshmivarahan and Narendra [3].
Let α A (t) ∈ {1, 2} be the action chosen by player A at time instant t and α B (t) ∈ {1, 2} be the one chosen by player B, following the probability distributions P (t) and Q(t), respectively.The pair (α A (t), α B (t)) constitutes the joint action at time t, and are pure strategies.Specifically, if (α A (t), α B (t)) = (i, j), the probability of reward for player A is determined by r ij while that of player B is determined by c ij .Player A is in this case the row player while player B is the column player.
When we are operating in the P -type mode, the game is defined by two payoff matrices, R and C describing the reward probabilities of player A and player B respectively: and the matrix where, as aforementioned, all the entries of both matrices are probabilities.
In the case where the environment is a S-model type, the latter two matrices are deterministic and describe the feedback as a scalar in the interval [0, 1].For instance, if we operate in the S-type environment, the feedback when both players choose their respective first actions will be the scalar c 1 1 for player A and not Bernoulli feedback such in the previous case of P -type environment.It is possible also to consider c 1 1 as stochastic continuous variable with mean c 1 1 and which realization in c 1 1, however, for the sake of simplicity we consider c 1 1, and consequently C and R as deterministic.The asymptotic convergence proofs for the S−type environment will remain valid independently of whether C and R are deterministic or whether they are obtained from a distribution with support in the interval [0, 1] and with their means defined by the matrices.
Independently of the environment type, whether it is P −type or S−type environments, we have three cases to be distinguished for equilibria: then there is just one pure equilibrium since there is one player at least who has a dominant strategy.
> 0, there are two pure equilibria and one mixed equilibrium.In strategic games, Nash equilibria are equivalently called the "Saddle Points" for the game.Since the outcome for a given joint action is stochastic, the game is of stochastic genre.

III. GAME THEORETICAL LA ALGORITHM BASED ON THE L R−I WITH ARTIFICIAL BARRIERS
In this section, we shall present our L R−I with artificial barriers that is devised specially for the P -type environments.

A. Non-Absorbing Artificial Barriers
As we have seen above from surveying the literature, an originally ergodic LA can be rendered absorbing by operating a change in its end states.However, what is unkown in the literature is a scheme which is originally absorbing can be rendered ergodic.In many cases, this can be achieved by making the scheme behave according to to the absorbing scheme rule over the probability simplex and pushing the probability back inside the simplex whenever the scheme approaches absorbing barriers.Such a scheme is novel in the field of LA and its advantage is that the strategies avoids being absorbed in non-desirable absorbing barriers.Further, and interestingly, by countering the absorbing barriers, the scheme can migrate stochastically towards a desirable mixed strategy.Interestingly, as we will see later in the paper, even if the optimal strategy corresponds to an absorbing barrier the scheme will approach it.Thus, the scheme converges to mixed strategies whenever they correspond to optimal strategies while approaching the absorbing states whenever they are the optimal strategies.We shall give the details of our devised scheme in the next section which enjoy the above mentioned properties.

B. Non-Absorbing Game Playing
At this juncture, we shall present the design of our proposed LA scheme together with some theoretical results demonstrating that it can converge to the Saddle Points of the game even if the Saddle Point is a mixed Nash equilibrium.Our solution presents a new variant of the the L R−I scheme, which is made rather ergodic by modifying the update rule in a general form which makes the original L R−I with absorbing barriers corresponding to the corners of the simplex an instance of the latter general scheme for a particular choice of parameters of the scheme.The proof of convergence is based on Norman's theory for learning processes characterized by small learning steps [30], [31].
We introduce p max as the artificial barrier which is a real value close to 1. Similarly, we introduce p min = 1 − p max which corresponds to the lowest value any action probability can take.In order to enforce the constraint that the probability of any action for both players remains within the interval [p min , p max ] one should start by choosing initial values of p 1 (0) and q 1 (0) in the same interval, and further resorting to updates rules that ensure that each update keeps the probabilities within the same interval.
If the outcome from the environment is a reward at a time t for action i ∈ {1, 2}, the update rule is given by: where θ is a learning parameter.The informed reader observes that the update rules coincides with the classical L R−I except that p max replaces unity for updating p i (t + 1) and that p min replaces zero for updating p s (t + 1).
Following the Inaction principle of the L R−I , whenever the player receives a penalty, its action probabilities are kept unchanged which is formally given by: The update rules for the mixed strategy q(t + 1) are defined in a similar fashion.We shall now move to a theoretical analysis of the convergence properties of our proposed algorithm for solving a strategic game.In order to denote the optimal Nash equilibrium of the game we use the pair (p opt , q opt ).
We also should distinguish detail of the equilibrium according to the entries in the payoff matrices R and C for Case 1.
a) Case 1: Only One Mixed Nash Equilibrium Case (No Saddle Point in pure strategies): The first case depicts the situation where no Saddle Point exists in pure strategies.In other words, the only Nash equilibrium is a mixed one.Based on the fundamentals of Game Theory, the optimal mixed strategies can be shown to be the following: where L = (r 11 + r 22 ) − (r 12 + r 21 ) and L ′ = (c 11 + c 22 ) − (c 12 + c 21 ).This case can be divided into two sub-cases.The first sub-case given by: The second sub-case given by: Let the vector X(t) = p 1 (t) q 1 (t) ⊺ .We resort to the the notation ∆X(t) = X(t + 1) − X(t).For denoting the conditional expected value operator we use the nomenclature Using those notations, we introduce the next theorem of the article.
Theorem 1.Consider a two-player game with a payoff matrices as in Eq. (1) and Eq.(2), and a learning algorithm defined by equations Eq. (3) and Eq.(4) for both players A and B, with learning rate θ.Then, E[∆X(t)|X(t)] = θW (x) and for every ǫ > 0, there exists a unique stationary point Proof.We start by fist computing the conditional expected value of the increment ∆X(t): where the above format is possible since all possible updates share the form ∆X(t) = θW (t), for some W (t), as given in Eq. (3).
For ease of notation, we drop the dependence on t with the implicit assumption that all occurrences of X, p 1 and q 1 represent X(t), p 1 (t) and q 1 (t) respectively.W 1 (x) is then: where, By replacing p max = 1 − p min and rearranging the expression we get: .
Similarly, we can get By replacing p max = 1 − p min and rearranging the expression we get: We need to address the three identified cases.Consider Case 1: Only One Mixed Equilibrium Case, where there is only a single mixed equilibrium.We get For the sake of brevity, we consider the first sub-case given by condition Eq. 5. We have L > 0, since r 11 > r 12 and r 22 > r 21 .Therefore D A 12 (q 1 ) is an increasing function of q 1 and      For a given q 1 , W 1 (X) is quadratic in p 1 .Also, we have: Since W 1 (X) is quadratic with a negative second derivative with respect to p 1 , and since the inequalities in Eq. ( 16) are strict, it admits a single root p 1 for p 1 ∈ [0, 1].Moreover, we have W 1 (X) = 0 for some p 1 such that: Using a similar argument, we can see that there exists a single solution for each p 1 , and as p min → 0, we conclude that W 1 (X) = 0 whenever p 1 ∈ {0, p opt , 1}.Arguing in a similar manner we see that W 2 (X) = 0 when: Thus, there exists a small enough value for p min such that X * = [p * , q * ] ⊺ satisfies W 2 (X * ) = 0, proving Case 1).
In the proof of Case 1), we take advantage of the fact that for small enough p min , the learning algorithm enters a stationary point, and also identified the corresponding possible values for this point.It is thus always possible to select a small enough p min > 0 such that X * approaches X opt , concluding the proof for Case 1.) Case 2) and Case 3) can be derived in a similar manner, and the details are omitted to avoid repetition.
In the next theorem, we show that the expected value of ∆X(t) has a negative definite gradient.
Theorem 2. The matrix of partial derivatives, ∂W (X * ) ∂x is negative definite.
Proof.We start the proof by writing the explicit format for and then computing each of the entries as below: ) .As seen in Theorem 1, for a small enough value for p min , we can ignore the terms that are weighted by p min , and we will thus have ∂W (X * ) ∂X ≈ ∂W (Xopt) ∂X . We now subdivide the analysis into the three cases.
b) Case 1: No Saddle Point in pure strategies: In this case, we have: Similarly, we can compute The entry can be simplified to: and resulting in: We know that this case can be divided into two sub-cases.Let us consider the first sub-case given by: Thus, L > 0 and L ′ < 0 as a consequence of Eq. ( 23) Thus, the matrix given in Eq. ( 22) satisfies: which implies the 2 × 2 matrix is negative definite.The condition for only one pure equilibrium can be divided into four different sub-cases.
Without loss of generality, we can consider a particular subcase where q opt = 1 and p opt = 1.This reduces to r 11 −r 21 > 0 and c 11 − c 12 > 0.
Computing the entries of the matrix for this case yields: and The entry

∂W2(Xopt) ∂p1
can be simplified to: and resulting in: The matrix in (36) satisfies: for a sufficiently small value of p min , which again implies that the 2 × 2 matrix is negative definite.Without loss of generality, we suppose that (p opt , q opt ) = (1, 1) and (p opt , q opt ) = (0, 0) are the two pure Nash equilibria.This corresponds to a sub-case where: Whenever (p opt , q opt ) = (1, 1), we obtain stability of the fixed point as demonstrated in the previous case, case 2. Now, let us consider the stability for (p opt , q opt ) = (0, 0).
Computing the entries of the matrix for this case yields: and The entry

∂W2(Xopt) ∂p1
can be simplified to: and resulting in: The matrix in (36) satisfies: for a sufficiently small value of p min , which again implies that the 2 × 2 matrix is negative definite.Now, what remains to be shown is that the mixed Nash equilibrium in this case is unstable. .
(38) Using Eq.31, we can see that L > 0 and L ′ > 0 and thus: Theorem 3. We consider the update equations given by the L R−I scheme.For a sufficiently small p min approaching 0, and as θ → 0 and as time goes to infinity: E(p 1 (t)) E(q 1 (t)) → p * opt q * opt where p * opt q * opt corresponds to a Nash equilibrium of the game.
Proof.The proof of the result is obtained by virtue of applying a classical result due due to to Norman [31], given in the Appendix A, in the interest of completeness.
Norman theorem has been traditionally used to prove considerable amount of the results in the field of LA.In the context of game theoretical LA schemes, Norman theorem has been adapted by Lakshmivarahan and Narendra to derive similar convergence properties of the L R−ǫP [3] for the zerosum game.It is straightforward to verify that Assumptions (1)-( 6) as required for Norman's result in the appendix are satisfied.Thus, by further invoking Theorem 1 and Theorem 2 , the result follows.

IV. GAME THEORETICAL LA ALGORITHM BASED ON THE S-LEARNING WITH ARTIFICIAL BARRIERS
In this section, we give the update equations for the LA when the environment is of S− type.
In the case of S− type, the game is defined by two payoff matrices, R and C describing a deterministic feedback of player A and player B respectively.
All the entries of both matrices are deterministic numbers like in classical game theory settings.
The environment returns u A i (t): the payoff defined by the matrix R for player A at time t whenever player A question chooses an action i ∈ {1, 2}.
The update rule for the player A that takes into account u A i (t) is given by: where θ is a learning parameter.Note u A i is the feedback for action i of the player A which is one entry in the i t h row of the matrix R, depending on the action of the player B.
Similarly we can define u B i (t) the payoff defined by the matrix C for player B at time t whenever player B question chooses an action i ∈ {1, 2}.
For instance , if at time t, player A takes action 1 and player B takes action 2, then u A 1 (t) = r 12 and u B 2 (t) = c 21 .The update rules for player B can be obtained by analogy to those given for player A.
Theorem 4. We consider the update equations given by the S− Learning scheme given above in this Section.For a sufficiently small p min approaching 0, and as θ → 0 and as time goes to infinity: E(p 1 (t)) E(q 1 (t)) → p * opt q * opt where p * opt q * opt corresponds to a Nash equilibrium of the game.
Proof.The proofs of this theorem follows the same lines as the proofs give in Section III and are omitted here for the sake of brevity.

V. EXPERIMENTAL RESULTS
In this Section, we focus on providing thorough experiments for L R−I scheme.Some experiments of S− LA for handling S− type environments are given in the Appendix B that mainly aim to verify our theoretical findings.
To verify the theoretical properties of the proposed learning algorithm, we conducted several simulations that will be presented in this section.By using different instances of the payoff matrices R and C, we can experimentally cover the three cases referred to in Section III.

A. Convergence in Case 1
We examine a case of the game where only one mixed Nash equilibrium exists meaning that there is no Saddle Point in pure strategies.The game matrices R and C are given by: R = 0.2 0.6 0.4 0.5 , (41) which admits p opt = 0.6667 and q opt = 0.3333.We ran our simulation for 5 × 10 6 iterations, and present the error in Table I for different values of p max and θ as the difference between X opt and the mean over time of X(t) after convergence 1 .The high value for the number of iterations was chosen in order to eliminate the Monte Carlo error.A significant observation is that the error monotonically decreases as p max goes towards 1 (i.e., when p min → 0).For instance, for p max = 0.998 and θ = 0.001, the proposed scheme yields an error of 5.27 × 10 −3 , and further reducing θ = 0.0001 leads to an error of 3.34 × 10 −3 .

Table I:
Error for different values of θ and p max , when p opt = 0.6667 and q opt = 0.3333 for the game specified by the R matrix given by Eq. ( 41) and the C matrix given by Eq. (42).The behavior scheme is illustrated in Figure 2 showing the trajectory of the mixed strategies for both players (given by X(t)) for an ensemble of 1,000 runs using θ = 0.01 and p max = 0.99.
The trajectory of the ensemble enables us to notice the mean evolution of the mixed strategies.The spiral pattern results from one of the players adjusting to the strategy used by the other before the former learns by readjusting its strategy.The process is repeated, thus leading to more minor corrections until the players reach the Nash equilibrium.
The process can be visualized in Figure 3 presenting the time evolution of the strategies of both players for a single experiment with p max = 0.99 and θ = 0.00001 over 3 × 10 7 Figure 2: Trajectory of [p 1 (t), q 1 (t)] ⊺ for the case of the R matrix given by Eq. ( 41) and the C matrix given by Eq. ( 42) with p opt = 0.6667 and q opt = 0.3333, and using p max = 0.99 and θ = 0.01.steps.We observe an oscillatory behavior which vanishes as the players play for more iterations.It is worth noting that a larger value of θ will cause more steady state error (as specified in Theorem 1), but it will also disrupt this behavior as the players take larger updates whenever they receive a reward.Furthermore, decreasing θ results in a smaller convergence error, but also affects negatively the convergence speed as more iterations are required to achieve convergence.Figure 4 depicts the trajectories of the probabilities p 1 and q 1 for the same settings as those in Figure 3.
Figure 3: Time Evolution X(t) for the case of the R matrix given by Eq. ( 41) and the C matrix given by Eq. ( 42) with p opt = 0.6667 and q opt = 0.3333, and using p max = 0.99 and θ = 0.00001.Now, we turn our attention to the analysis of the the deterministic Ordinary Differential Equation (ODE) corresponding to our LA with barriers and plot it in Figure 5.The trajectory of the ODE is conform with our intuition and the results of the LA run in Figure 4.The two ODE are given by: Figure 4: Trajectory of X(t) where p opt = 0.6667 and q opt = 0.3333, using p max = 0.99 and θ = 0.00001.
To obtain the ODE for a particular example, we need just to replace the entries of R and C in the ODE by their values.In this sense to plot the ODE trajectories we only need to know R and C and of course p max .

B. Case 2: One Pure equilibrium
At this juncture, we shall experimentally show that the scheme possess still plausible convergence properties even in case where there is a single saddle point in pure strategies and that our proposed LA will approach the optimal pure equilibria.For this sake, we consider a case of the game where there is a single pure equilibrium which falls in the category of Case 2 with p opt = 1 and q opt = 0.The payoff matrices R and C for the games are given by: R = 0.7 0.9 0.6 0.8 , We first show the convergence errors of our method in Table II.As in the previous simulation for Case 1, the errors are on the order to 10 −2 for larger values of p max .We also observe that steady state error is slightly higher compared to the previous case of mixed Nash described by Eq. (41) and Eq. ( 42) which is treated in the previous section.
Table II: Error for different values of θ and p max for the game specified by the R matrix and the C matrix given by Eq. ( 45) and Eq. ( 46).We then plot the ODE for p max = 0.99 as shown in Figure 6.According to the ODE in Figure 6, we are expecting that the LA will converge towards the attractor of the ODE which corresponds to (p * , q * ) = 0.917, 0.040) as θ goes to zero.We see that (p * , q * ) = (0.917, 0.040) approaches (p opt , q opt ) = (1, 0) but there is still a gap between them.This is also illustrated in Figure 7 where we also consistently observe that the LA converges towards (p * , q * ) = (0.916, 0.041) after running our LA for 30,000 iterations with an ensemble of 1,000 experiments.
Observing the small dispersancy between between (p * , q * ) = (0.917, 0.040) and (p opt , q opt ) = (1, 0) from the ODE and from the LA trajectory as shown in Figure 6 and Figure 7 motivates us to choose even a larger value of p max .Thus, we increase p max from 0.99 to 0.999 and observe the expected convergence results from the ODE in in Figure 8.We observe a single attraction point close of the ODE close to the pure Nash equilibrium.We can read from the ODE trajectory that (p * , q * ) = (0.991, 0.004) which is closer (p opt , q opt ) = (1, 0) than the previous case with a smaller p max .
In Figure 9a, we depict the time evolution of the two components of the vector X(t) using the proposed algorithm for an ensemble of 1,000 runs.In the case of having a Pure Nash equilibrium, there is no oscillatory behavior as when a player assigns more probability to an action, since the other player reinforces the strategy.However, Figure 9a could mislead the reader to believe that the LA method has converged to a pure strategy for both players.In order to clarify that we are not converging to an absorbing state for the player A, we provide Figure 9b which zooms on Figure 9a around the region where the strategy of player A has converged in order to visualize that its maximum first action probability is limited by p max , as per the design of our updating rule.Similarly, we zoom on the evolution of the first action probability of player B in Figure 9c.We observe that the first action instead to converging to zero as it would be if we did not have absorbing barriers, its rather converges to a small probability limited by p min which approaches zero.Such propriety of evading lock in probability even for pure optimal strategies and which emanates from the ergodicity of our L R−I scheme with absorbing barriers is a desirable property specially when the payoff matrices are timevarying and thus the optimal equilibrium point might change over time.Such a case deserves a separate study to better understand the behavior of the scheme and to also understand the effect of the tuning parameters and how to control and vary them in this case to yield a compromise between learning and forgetting stale information.
Figure 9 depicts the time evolution of the probabilities for each player, with θ = 0.01, p max = 0.999 and for an ensemble with 1,000 runs.

C. Case 3: 2 Pure equilibria and 1 mixed
Now, we shall consider the last case 3.
As an instance of case 3, we consider the payoff matrices R and C given by: In Figure 10, we plot 9 trajectories for the LA for a number of iterations is 1,000,000.We observe that depending on the initial conditions, our LA converges to one of the two pure equilibria which is usually the closest to the starting point.We have also performed extensive simulations with initial values (0.5, 0.5) of the probabilities and we found that almost 50% Figure 7: Time evolution over time of X(t) for θ = 0.01 and p max = 0.99 for the case of the R matrix given Eq. ( 45) and for the C matrix given by Eq. (46). of the time the LA converges to the Nash equilibrium close to (1, 1) and 50% close to (0, 0).As a future work, we would like to explore how to push the LA to favor one of the two equilibria as there is usually an equilibrium that is superior to the other, and thus it is more desirable for both players to converge to the superior Nash equilibrium.
We plot the ODE corresponding to our LA for case 3 in Figure 11.We can see two attractions points which approach the two Nash equilibria.
Although for p max = 1, w our scheme is in theory ergodic and not absorbing, this is not the case in practice as shown in the simulation reported in Table III.In fact for θ = 0.0001 and as p max becomes larger or equal to 0.995, we observe that the error is zero meaning that the LA has converged already to an absorbing state!This lock in probability phenomenon is due to the limited accuracy of the machine and limitations of the random number generator.For smaller θ = 0.00001, we Table III: Error for different values of θ and p max for the game specified by the R matrix and the C matrix given by Eq. ( 47) and Eq.(48).expect that the LA will approximate better the ODE.Indeed, this is the case the absorption this time does not happen for p max = 0.995 and p max = 0.996 as in the previous case, but happen for only p max larger or equal to p max = 0.997.Solving the ODE for p max = 0.999, gives two solutions, namely, (p * , q * ) = (0.99699397, 0.99699397) and (0.00200603, 0.00200603) which approach (p opt , q opt ) = (1, 1) and (p opt , q opt ) = (0, 0) respectively.

D. Real-life Application Scenario
The application of game theory in cybersecurity is a promising research area attracting lots of attention [32], [33], [34].Our LA-based solution is well suited for that purpose.In cybersecurity, algorithms that can converge to mixed equilibria are preferred over those that get locked into pure ones since randomization reduces an attacker's predictive capability to guess the implemented strategy of the defender.For example, let us consider a repetitive two-person security game comprising of a hacker and network administrator.The hacker intends to disrupt the network by launching a Distributed Denial of Service attack (DDOS) of varying magnitudes that could be classified as high or low.The administrator can use varying levels of security measures to protect the assets.We can assume that the adoption of a strong defense strategy by the defender has an extra cost compared to a low one.Similarly, the usage of a high magnitude attack strategy by the attacker has a higher cost compared to a low magnitude attack strategy.Another example of a security game is the jammer and transmitter game [35] where a jammer is trying to guess the communication channel of the transmitter to interfere and block the communication.The transmitter chooses probabilistically a channel to transmit over and the jammer choose probabilistically a channel to attack.Clearly converging to pure strategies is neither desirable by the jammer nor by the transmitter as it will give a predictive advantage to the opponent.

VI. CONCLUSION
Learning Automata (LA) with artificially absorbing were first introduced in 1980s [1] and have been recently adopted in Estimator LA [36], [37], [38].In this paper, we rather propose a LA with artificially non-absorbing that is able to solve game theoretical problems.The scheme is able to converge to the game's Nash equilibrium under limited information that has clear advantages over the well-known LA solution for game theoretical due to Sastry et al. [2].In fact, the latter family of solution [2] which has found a huge number of applications are only able to only converge to pure strategies and fail to converge to optimal mixed equilibrium.This presents clear disadvantage specially in cases where no Saddle Point exists for a pure strategy and thus the only Nash equilibrium of the game is a mixed one.Our scheme is an ergodic one and illustrates a design by which an inherently absorbing scheme, in our case, Linear Reward-Inaction (L R−I ), is rendered ergodic.Interestingly, while being able to solve the mixed Nash equilibrium case, our scheme maintains the plausible properties of the original L R−I as it is able to converge to a near-optimal to the pure strategies in the probability simplex whenever a Saddle Point exists for pure strategies.Furthermore, we also provide a general S−type learning scheme for handling continuous feedback and not necessarily binary.As a future work, we would like to extend our scheme to Stackelberg games which are often employed in security and that assume that the defender deploys a mixed strategy that can be fully observed by the attacker who will optimally reply to it.The extension would be interesting but from being obvious.

APPENDIX B EXPERIMENTAL RESULTS FOR S-TYPE ENVIRONMENTS
In this section, we present the results of the experiments for the S-type learning game.We conducted several simulations similar to those presented in Section V.The same instances of the payoff matrices R and C were used, covering the cases referred to in Section III.
For all the experiments conducted for the S-LA, 9 trajectories were plotted for 2,000,000 iteration, with p max = 0.99 and θ = 0.0001.A general observation that we noticed when performing that the experiments is that the S-LA converges slower than the the L R−I .Therefore, we have doubled the number of iterations to allow the S-LA to converge in our experiments.

A. Case 1: Only one mixed Nash equilibrium exists
Figure 12 depicts the situation where the only Nash equilibrium that exists is a mixed one.We can easily observe that the S-LA approaches the trajectories of the ODE given in Figure 5. Please note that the ODE regardless of the LA type, whether it is L R−I or S−LA.

B. Case 2: One Pure equilibrium
We also examined the case where the game has a single pure equilibrium.The exhibited behavior is comparable to those reported in Section V.The trajectory of the LA depicted in Figure 13 tightly follows the trajectories of the ODE depicted in Figure 6.As θ goes to zero, the trajectories of the LA and those of the ODE will be indistinguishable [2].

C. Case 3: Two Pure equilibria and one mixed
Figure 14 shows the situation where there are two pure equilibria and one mixed.
We observe that the LA converges to one of the two pure equilibria that is closest to the starting point.The S-LA behaves much similar to the L R−I LA as shown in Figure 10.We also observe that the S-LA respectively converged to the Nash equilibrium close to (1, 1) and close to (0,0) approximately 50% of the time.

Figure 1 :
Figure 1: LA interacting with the environment.

Figure 6 :
Figure 6: Trajectory of the deterministic ODE using p max = 0.99 for case 2.
(a) Evolution over time of X(t).(b) Zoomed version for player A strategy.(c) Zoomed version for player B strategy.

Figure 9 :
Figure 9: The figure shows a) the evolution over time of X(t) for θ = 0.01 and p max = 0.999 when applied to game with payoffs specified by the R matrix and the C matrix given by Eq. (45) and Eq.(46)., and b) is a zoomed version around player A strategy c) and is a zoomed version around player B strategy.

Figure 10 : 9
Figure 10: 9 Trajectories of the LA with barriers starting from random initial point with p max = 0.99 and θ = 0.0001.