OPTIMAL ADMISSION AND ROUTING WITH CONGESTION-SENSITIVE CUSTOMER CLASSES

This paper considers optimal admission and routing control in multi-class service systems in which customers can either receive quality regular service which is subject to congestion or can receive congestion-free but less desirable service at an alternative service station, which we call the self-service station. We formulate the problem within the Markov decision process framework and focus on characterizing the structure of dynamic optimal policies which maximize the expected long-run rewards. For this, value function and sample path arguments are used. The congestion sensitivity of customers is modeled with class-independent holding costs at the regular service station. The results show how the admission rewards of customer classes affect their priorities at the regular and self-service stations. We explore that the priority for regular service may not only depend on regular service admission rewards of classes but also on the difference between regular and self-service admission rewards. We show that optimal policies have monotonicity properties, regarding the optimal decisions of individual customer classes such that they divide the state space into three connected regions per class.


INTRODUCTION
This paper focuses on the challenge of service providers to cope with dynamic and varying customer demands with their limited resources. The service providers have to find effective control mechanisms to manage their revenues, or customer satisfaction, to make the best use of their service capacities. Most commonly, in the literature these service systems are modeled as multi-class queueing systems and admission or/and dynamic routing controls are studied, within the queueing systems formulations, for contributing to the management of dynamic demands from distinguished customers.
Admission control of multi-class queueing systems for revenue management is intensively studied. Miller [21] proved the optimality of the trunk reservation policy for a multi-class loss queueing system in which the rewards that customer classes pay for being admitted to the system are orderable and the service rates of customers are not dependent on their classes. In this policy, acceptance decisions on individual customer classes have threshold structures, with respect to the number of customers in the system; if customer class i is accepted when there are n customers in the system, then class i should also be accepted when the system is less crowded. Moreover, the so-called trunks reserved per customer class depend on the admission rewards; if customer class i offers reward r i which is greater than r j , the admission reward that class j offers, then at any congestion level that class j is accepted, for sure class i must be also admitted to the system. After Miller's result, the optimality of trunk reservation policies in various other multi-class single-station queueing models are shown [8,9,13,14]. Feinberg and Yang [9] focused on the trunk reservation optimality for an M/M/c/N queue. Later, Fan-Orzechowski and Feinberg [8] extended this, by proving the optimality of randomized trunk reservation policies for an M/M/c/N model with constraints on the blocking probabilities of customer classes. The admission control studies which consider class-dependent service rates have focused on providing heuristic policies. The most common approaches of these studies include Linear Programming (LP) techniques [18] and asymptotic analyses [13].
In many service systems, waiting times of customers for being served are important service quality indicators, which can affect the revenues, or customer satisfaction levels, that service systems can achieve. For instance, waiting times can affect customer behavior in choosing among different service providers, and thus for a specific service provider, waiting times can be a determinant of the demand intensity of their system. The effects of system congestion on customer behavior in a queueing system are considered earliest by Naor [22], for an M/M/1 queue. Knudsen [15] and Stidham [27] considered an M/M/k and a GI/M/1 queues, respectively, by extending results by Naor [22]. These studies do not consider that customers can be of different types, for the brevity of their discussion in explaining how the potentials of systems to obtain admission rewards are lost when customers behave greedily based on congestion levels.
Koçaǧa and Ward [16] considered congestion-related costs through the abandonment of customers in their single-class multi-server model for controlling the admission decisions of arriving customers. Atar and Lev-Ari [2] studied the admission control in a single-server model with retrials. In their model, holding costs are used as means to incorporate congestion sensitivity of customers. We note that congestion sensitivity of customers is a recognized issue in call centers. Ata and Peng [1] studied the callback option to mitigate congestion in call centers. In this study, arriving customers are routed to an offline queue to be called back later when they accept the callback offer; otherwise, they are routed to the online queue in which customers incur congestion-related waiting time costs. Feinberg and Yang [10] considered congestion effects through class-dependent holding costs in the admission control problem of a multi-class queue model. Feinberg and Yang [10], being inspired by Miller [21], used a continuous-time Markov decision process formulation and relative bias functions in their policy iteration algorithm for obtaining the optimal policy of an M/M/c/N queue with class and congestion-dependent admission rewards. Authors of this study focused on extending trunk reservation properties to their model. Similarly, we also consider a service system in which admission rewards of customer classes are dependent on the congestion levels they experience in the system. However, differently, we assume that congestion-sensitive customers have an alternative service station, in which they can commence their services immediately without waiting. In this paper, we have two stations which work in parallel in the service system. Our model is inspired from systems, in which customers can either get quality regular service at a station supplied with professionals, say the regular station, or can serve themselves at the self-service station, in an unsupervised fashion. There are several examples of such service systems in practice. For instance, in many service stations, thanks to the availability of online tools, self-help desks (or self-check out points) are available for customers. For a more concrete example, let us consider the education sector. These days, online learning is assisting the teacher-led instruction in many schools with flexible educational models such as personalized learning. In these schools, learners may opt for learning by themselves, in a self-service fashion, instead of waiting for teacher assistance.
In these types of service systems, it is most likely that the regular service station will be costlier to operate than the self-service, such that the system is more likely to have a lower capacity at the regular station than at the self-service station. One can imagine that in such systems, customers would prefer the regular station for getting quality professional-assisted services. However, since the preference for the regular station will be prevalent for many customers and also since the capacity of the regular station would not be high due to costs, the congestion is likely to become an issue at the regular station. As a result, the preference of customers for regular service might decrease in favor of self-service, as the regular service station gets crowded.
For incorporating the congestion effects for the regular station, our model includes a finite buffer for this station such that there are holding costs incurred from the customers present at the station. With the holding costs, it is assumed that both the time spent waiting for service at the buffer and being served degrade the rewards obtained from customer admissions. Considering the nature of self-service, in which each self-server can only serve a single customer at a time, the self-service station is modeled as a loss network. In some service systems, there might also be a waiting room for using self-servers (e.g., self-checkouts in supermarkets). In this study, we assume that waiting for service is more relevant for the regular service station and thus only consider a waiting room for this station. With the introduction of the self-service station, the routing decisions can also be controlled. In this paper, for every customer class we dynamically, as a function of the state of the system, decide on whether we are going to admit the class into the system or not, and if we are accepting, then to which of the stations we should route for the maximization of long-run expected rewards. The problem of admission and routing control in such a service system is illustrated in Figure 1.
Dynamic routing control is also intensively studied. Compared with the dynamic admission control literature, congestion-related costs such as holding costs are much commonly considered in the control of routing arriving customers. In some cases, admission and routing controls are simultaneously considered. Bertsimas and Chryssikou [5] provided approximate LPs to extract heuristic admission and routing policies. Chong et al. [6] studied both routing and admission control for a two-class two-station system with class priorities. For systems with many service stations and/or customer classes, the use of asymptotic analysis is common in the literature to provide efficient routing policies. Bassamboo et al. [4] studied the optimal admission and routing control in the stochastic fluid approximation for a multi-class service system with multiple service pools. Dai and Tezcan [7] presented an asymptotically optimal routing policy under a heavy traffic regime for a parallel server system. Atar et al. [3] studied the routing control problem in the diffusion models of multi-class many-server queueing systems. Ward and Armony [28] considered routing control under the heavy traffic regime for multi-class systems with heterogeneous servers. For single-class systems with parallel servers with holding costs, there are many results on the optimal dynamic routing policies in the literature. For these systems, commonly the route-to-least-workload policy is explored [11,12,29].
The slow server problem, which considers a supporting slower server that can be switched on or off depending on the number of customers waiting in the common buffer of the system, is also relevant to the problem described in this paper, if we consider that selfservers are slower than regular ones [19,23]. If we only have a single customer class with no Illustrating the admission and routing control problem in the service system with the regular and self-service stations, for customer classes 1, 2, . . . , m with arrival rates λ 1 , λ 2 , . . . , λ m . rejection option, then the routing decision to slower self-servers can be seen as the switching on decision of the slow server problem. Lin and Kumar [19] considered the threshold type policy which switches the slow server on when the queue length exceeds a threshold. Nobel and Tijms [23] additionally considered setup costs involving the switching the slow server on and off and focused on a two-level hysteric switching rule which switches the slow server on if the queue length exceeds an upper threshold and switches it off if the queue length becomes lower than a lower threshold. There are also studies on the slow server problem which considered the waiting option for getting service at the fast server. Rubinovitch [25] considered a model with a slow and a fast server with a buffer in front of the fast server for letting customers to greedily choose between the immediate service at the slow server or queueing up for the busy fast server.
We formulate the admission and routing control problem in this system as a Markov decision process and focus on characterizing the properties of the optimal policy. We make use of a discrete-time Markov decision process formulation (following [20,26]) to investigate the structure of optimal policies. For this, value function and sample path arguments are used intensively. The paper investigates inter-class and intra-class properties of the optimal stationary policies. In the inter-class properties, the focus is on the priorities of customer classes for receiving service at both stations. In the intra-class properties, we focus on characterizing monotonicity properties of the optimal policies with respect to the number of customers present at stations, or say the remaining capacity levels at stations.
The remainder of the paper is organized as follows. Section 2 presents the Markov decision process formulation. Section 3 presents the properties of optimal policies. Lastly, we conclude the paper with Section 4.

MODEL FORMULATION
We consider that a set of customer classes i ∈ I = {1, . . . , m} with distinguished characteristics arrive to a service system with two parallel service stations which we refer to as regular and self-service stations. Each class i ∈ I arrives according to a Poisson process with rate λ i . The arriving customers can be routed to one of the two service stations, or can be rejected, dynamically. It is assumed that all servers in stations are exponential and service times are independent of customer classes. Let μ R and μ S denote the service rates, and c R and c S denote the number of servers at regular and self-service stations, respectively. The self-service station is considered as an M/M/c S /c S loss network and the regular station is considered to have a buffer with capacity B = N − c R throughout the paper. Let n R ∈ S 1 = {0, 1, . . . , N} and n S ∈ S 2 = {0, 1, . . . , c S } denote the state, the number of customers present, at regular and self-service stations, respectively. We consider that the non-negative reward that customer class i pays for regular service R R i is not less than the non-negative reward that the class would pay for self-service R S i . It is assumed that customers pay the admission rewards upon entry. Let μ R (n R ) and μ S (n S ) denote effective service rates, when there are n R and n S customers present at the regular and selfservice stations, respectively. Although regular service is preferred by customers, it might involve holding costs. Let H(n R ) denote the holding cost rate that the system incurs for having n R customers at the regular station. In this study, we restrict to linear holding costs; we let h denote the unit time holding cost rate per customer at the regular station, so H(n R ) = hn R .
We can consider S = S 1 × S 2 as the state space of the system which observes the number of customers present at both stations. On this state space, we formulate a Markov decision process. Let us focus on stationary policies and let Π denote the set of such policies. Interevent times are exponentially distributed and their rates are always bounded above by Λ := i λ i + μ R (N ) + μ S (c S ). Thus, we can apply uniformization (see [20,26]) and consider an equivalent decision process in discrete time. We let, without loss of generality, the uniformization constant Λ be equal to 1. Let v π T,α (s) denote the finite T -horizon α-discounted total expected reward under policy π ∈ Π for the process starting at state s ∈ S. We define where S π t denotes the state at the beginning of tth period, A π t denotes the action picked by policy π for it and r(S π t , A π t ) denotes the corresponding net reward collected resulting from this at state S π t (the sum of admission rewards resulting from the action picked minus the holding costs incurred during tth period). v π T,α (s) is well-defined for each initial state s and T , as the reward obtained at any state is bounded above by i λ i R R i . We let v T,α (s) = sup π∈Π v π T,α (s) denote the optimal finite T -horizon α-discounted total expected reward obtained from the process which starts at state s. We next define the infinite-horizon α-discounted total expected reward under policy π ∈ Π, for the process starting at state s v π α (s) := lim and v π (s) := lim for its long-run average reward counterpart. Note that for any policy π, long-run average reward v π (s) is independent of initial state s, as the Markov chain induced by any stationary policy π on our finite state space S is unichain. Next, we denote optimal expected rewards; let v α (s) = sup π∈Π v π α (s) denote the α-discounted total expected reward obtained by an optimal policy for the process which starts at state s and let v = sup π∈Π v π (s) denote the optimal long-run average reward.
By Theorem 6.2.6 of Puterman [24], we write the optimality equations v α (s) = T α v α (s) for s = (n R , n S ), where Here, we let The three choices in R α v α (n R , n S )(s), for each customer class i at state s, correspond to their routing to the regular and self-service stations and to their rejection, respectively. By letting v 0,α = 0, we write the optimality equations for the finite-horizon counterpart as v T +1,α (s) = T α v T,α (s).
For the long-run average reward case, we let y(s) denote the relative value function such that and then, we write the optimality equation

CHARACTERIZATION OF THE OPTIMAL POLICY
We characterize the properties of the optimal policy in this section.
In some single-station multi-class networks with static admission rewards and no holding costs, in which customer rewards can be ordered strictly, trunk reservation policies are optimal (see [21]). In these policies, acceptance decisions of customer classes are directly related to their admission rewards. On contrary, in our system with two stations, optimal acceptance and routing decisions of different classes are affected by not only regular station admission rewards R R i s, but also by the involved holding costs H(n R ) and by the self-service admission rewards R S i s.

Basic Properties
Let us begin by exploring some basic properties of the value functions.
Similarly, for the longrun average case, if v and y are the solutions of the optimality equation (8), then the above statements will hold with v α replaced by y.
Proof: We use sample path arguments to show these results. Let us consider property (i) for v α only, for a specific α ∈ [0, 1). For y, or for property (ii), a similar proof will follow. Consider a process which starts at state (n R + 1, n S ) and follows an optimal policy π * , let us call this Process 1. On the other hand, consider another process which starts at (n R , n S ) and uses a (potentially) suboptimal policy π which imitates π * , let us call this Process 2. We suppose that these two processes are defined on the same probability space and thus move in parallel, observe the same arrival and service completion transitions simultaneously, whenever it is possible. However, some events can not be observed in both processes. Firstly, Process 1 can observe a service completion which Process 2 can not. Secondly, when Process 1 reaches a state in which there is no capacity left at both stations, the follower Process 2 would be at a state where there is still one more spot in the regular station. So, in this situation, if an arrival occurs, Process 1 has to reject this arrival, although Process 2 can admit the arrival to the regular station. We call these events coupling events such that after their occurrences, both processes transition into the same state, and behave identically thereafter.
Let δ be a random variable denoting the difference in the (net) reward obtained by Process 2 from that of Process 1, until one of the coupling events occurs.
The admission events occurring before coupling provide the same rewards to both processes (R R i s and R S i s for each class i). We need to also consider the holding costs for the congestion at the regular station. Since until coupling occurs Process 1 always experiences a regular station which is at least as crowded as the one in Process 2, the net rewards collected by Process 2 would be at least as large as the net rewards which could be obtained by Process 1 for any sample path. This implies that δ ≥ 0 pathwise, thus E(δ) ≥ 0. Lemma 3.1 tells us that the service system benefits from more idle servers, or say spots in both of the stations, under the α-discounted total expected or long-run average reward optimalities. This is already intuitive.
We can additionally show that having idle servers or spots at the regular station is more beneficial than having them at the self-service station. This is possible because R R i ≥ R S i for any class i.
Similarly, for the long-run average case, if v and y are the solutions of the optimality equation (8), then the above statement will hold with v α replaced by y.
Proof: We again use sample path arguments. Let us consider this for v α , for a specific α ∈ [0, 1). For y, a similar proof will follow. Consider a process which starts at state (n R + 1, n S ) and follows an optimal policy π * , let us call this Process 1. On the other hand, consider another process which starts at (n R , n S + 1) and uses a (potentially) suboptimal policy π by imitating π * , let us call this Process 2. We again make the assumption that both processes are defined on the same probability space; they observe the same arrival and service completion events, whenever possible. Let δ be a random variable denoting the difference in the (net) reward obtained by Process 2 from that of Process 1, until a coupling event occurs. We have Events occurring before coupling provide the same admission rewards to both processes. Also note that until coupling occurs Process 1 always experiences a regular station which is at least as crowded as the one in Process 2, thus incurring larger holding costs. The first coupling event could be the service completion event at the regular station for Process 2 and the service completion event at the self-service station for Process 1. In this event, there are no admission rewards realized in both processes. The second possible coupling event can occur when Process 1 is at a state in which there are no spots left to admit a customer to the regular station, and on the contrary, Process 2 has still one spot left. In this situation, Process 1 can admit an arriving customer to the self-service, but Process 2 can obtain a larger reward by admitting the customer to the regular station instead as R R i ≥ R S i for each class i and couple with Process 1 as a result. Thirdly, we can consider a situation in which Process 2 can no longer admit customers to the self-service, but Process 1 still can, with its single remaining idle server. If π * chooses to admit an arriving customer to the self-service in such a situation, then Process 2 can admit the same customer to the regular station and can obtain a larger admission reward. So, whichever coupling event occurs first, we have that δ ≥ 0 pathwise. Thus, we will have E(δ) ≥ 0.

Inter-Class Properties
Firstly, from the optimality equations (4)-(8), we can directly infer the following result.

Proposition 3.3:
In an α-discounted (α ∈ [0, 1)) total expected reward optimal policy (finite-or infinite-horizon formulations), or in a long-run average reward optimal policy, if class i is routed to the regular (self-service) station at state is also routed to the regular service (self-service) at state (n R , n S ) in the optimal policy.

Highest priority for service
In multi-class single-station systems with stateindependent rewards, we can usually talk about a customer class with the highest priority to receive service. For instance, if we only have the regular station with h = 0, by using the result by Miller [21], we can say that any class j with R R j = max i R R i is among the customer classes with highest service priority (for receiving service whenever there is capacity at the service station).
For the case that h = 0 in our network with two stations, we have the following value function properties which indicate that priority for regular service is related to regular service rewards R R i s and also on the difference between regular and self-service admission rewards.
Similarly, for the long-run average case, if v and y are the solutions of the optimality equation (8), then the above statement will hold with v α replaced by y.
Proof: Let us show this for v α , for a specific α ∈ [0, 1), with sample path arguments. Consider a process which starts at state (n R , n S ) and follows an optimal policy π * , let us call this Process 1. On the other hand, consider another process which starts at (n R + 1, n S ) and follows a (potentially) suboptimal policy π by imitating π * , let us call this Process 2. We again make the assumption that both processes are defined on the same probability space.
Events occurring before coupling provide the same admission rewards to both processes. Firstly, Process 2 can observe a service completion which Process 1 can not. After this event, both processes will couple, without changing the rewards obtained, in any of them. Secondly, when Process 2 reaches a state in which there is no capacity left at both stations, Process 1 would be at a state where there is still one more spot in the regular station. When an arrival from class i occurs in this situation, Process 1 can obtain R R i while Process 2 will obtain no rewards, before they couple. This disadvantage of Process 2 can be at most max i R R i . That is why in any of these coupling cases, we will have that the (net) reward obtained by Process 2 will be at most max i R R i less than of Process 1 pathwise, then , this is sufficient to confirm the lemma.
Similarly, for the long-run average case, if v and y are the solutions of the optimality equation (8), then the above statement will hold with v α replaced by y.
Proof: Let us show this for v α , for a specific α ∈ [0, 1), by using sample path arguments. Consider a process which starts at state (n R , n S + 1) and follows an optimal policy π * , let us call this Process 1. On the other hand, consider another process which starts at (n R + 1, n S ) and follows a (potentially) suboptimal policy π by imitating π * , let us call this Process 2. We assume that both processes are constructed in the same probability space.
Again events occurring before coupling provide the same admission rewards to both processes. The first coupling event could be the service completion event at the regular station for Process 2 and the service completion event at the self-service station for Process 1. In this event, there are no admission rewards realized in both processes. The second possible coupling event can occur when Process 2 is at a state in which there are no spots left to admit a customer to the regular station, and on the contrary, Process 1 has still one spot left. At this event, Process 2 can admit the arriving customer to the self-service, but Process 1 can obtain a larger reward by admitting the customer to the regular station. In this situation, considering arrivals from all possible customer classes, in any arrival coupling event, the advantage of the reward obtained by Process 1 over Process 2 can be at most Thirdly, we can consider a situation in which Process 1 can no longer admit customers to the self-service, but Process 2 has still one free server at the station. In this case, when π * admits an arriving customer to the regular service, we let Process 2 to admit the customer to the self-service station for coupling. So, in any of these coupling cases, we will have that the (net) reward obtained by Process 2 will be at most max i R R i − R S i less than of Process 1 pathwise, then we also have v α (n R , , this is sufficient to confirm the lemma.
With these two lemmas, we can show the following.
Proposition 3.6: When the regular station is free of holding costs (h = 0), if for any customer class j we have that R R j = max i R R i and R S j = min i R S i , then this class is routed to the regular station at any state s = (n R , n S ) whenever n R < N, in an α-discounted (α ∈ [0, 1)) total expected or long-run average reward optimal policy.
Proof: We show this for α-discounted (α ∈ [0, 1)) total expected reward optimality. Class j is, preferably, routed to the regular station, than to be rejected, at any state s = (n R , n S ), n R < N when R R j + αv α (n R + 1, n S ) > αv α (n R , n S ). With Lemma 3.4, this holds for class j always. Moreover, this class is, preferably, routed to the regular, than to the selfservice station, at any state s = (n R , n S ), n R < N, n S < c S when R R j + αv α (n R + 1, n S ) > R S j + αv α (n R , n S + 1). With Lemma 3.5, this holds for class j always.
For the self-service station, which is modeled as a loss network and in which there are no holding costs involved, we have the following value function property.
Similarly, for the long-run average case, if v and y are the solutions of the optimality equation (8), then the above statement will hold with v α replaced by y.
Proof: We again use sample path arguments. Let us show this for v α , for a specific α ∈ [0, 1). Consider a process which starts at state (n R , n S ) and follows an optimal policy π * , let us call this Process 1. On the other hand, consider another process which starts in (n R , n S + 1) and uses a (potentially) suboptimal policy π which imitates π * , let us call this Process 2. We again make the assumption that both processes are constructed in the same probability space. Again events occurring before coupling provide the same admission rewards to both processes. Also, before coupling, both processes incur the same holding costs.
Firstly, Process 2 can observe a service completion which Process 1 can not. After this event, both processes will couple, without changing the rewards obtained, in any of them. Secondly, when Process 2 reaches a state in which there is no capacity left at both stations, Process 1 would be at a state where there is still one more spot in the self-service station. When an arrival from class i occurs in this situation, Process 1 can obtain R S i more rewards than Process 1, before they couple. This disadvantage of Process 2 can be at most max i R S i . That is why in any of these coupling cases, we will have that the (net) reward obtained by Process 2 will be at most max i R S i less than of Process 1 pathwise, then α (n R , n S + 1), this is sufficient to confirm the lemma.
Proposition 3.8: Under an α-discounted (α ∈ [0, 1)) total expected or long-run average reward optimal policy, at any state s = (n R , n S ) with n S < c S , any class j with R S j = max i R S i is not rejected.
Proof: We show this for α-discounted (α ∈ [0, 1)) total expected reward optimality. Class j is, preferably, routed to the self-service station, than to be rejected, at any state s = (n R , n S ), n R < N, n S < c S when R S j + αv α (n R , n S + 1) > αv α (n R , n S ). With Lemma 3.7, this holds for class j always.

Intra-Class Properties
In this section, we show the monotonicity properties of the optimal policy, regarding the optimal admission and routing decisions of individual customer classes as functions of the state of the system, or say the free capacities of the stations. Value function arguments are used for this.
The following lemma is useful to infer the monotonicity properties of the optimal policy. In naming these value function properties, we adapt the terminology by Koole [17]. Lemma 3.9: For any α ∈ [0, 1), we have that Similarly, for the long-run average case, if v and y are the solutions of the optimality equation (8), then the above statements will hold with v α replaced by y.
The proof of the lemma can be found in the Appendix. When we consider an α-discounted total expected or the long-run average reward optimality, these properties suggest the following. Property (i) tells that more capacity at the self-service station means that more customers can be admitted to the station. Property (ii) can be interpreted as more capacity at the self-service (regular) station means that more customers can be admitted to the regular (self-service) station. Property (iii) shows that more capacity at the self-service station has more potential to increase admissions to the self-service than to the regular service station. Likewise, property (iv) shows the counterpart of this for the regular service station. Lastly, property (v) indicates that more capacity at the regular service station means that more customers can be admitted to the station.
The following propositions imply that α-discounted total expected or long-run average reward optimal policies have monotonicity properties.
Proposition 3.10: Under an α-discounted (α ∈ [0, 1)) total expected or long-run average reward optimal policy, if any class j is routed to the self-service station at any state s = (n R , n S ), n S < c S , then class j is routed to the self-service station also at state (n R , n S − 1).
Proof: We show this for α-discounted total expected reward optimality. Class j is, preferably, routed to the self-service station, than being rejected, at any state s = (n R , n S ), n R < N, n S < c S when R S j + αv α (n R , n S + 1) > αv α (n R , n S ). With property (i) of Lemma 3.9, we have αv α (n R , n S ) − αv α (n R , n S + 1) ≥ αv α (n R , n S − 1) − αv α (n R , n S ), which shows that R S j > αv α (n R , n S − 1) − αv α (n R , n S ). This confirms that class j will not be rejected from the self-service station at state (n R , n S − 1). Moreover, class j is routed to the selfservice station, than to the regular service station at state s = (n R , n S ), n R < N, n S < c S when αv α (n R , n S + 1) − αv α (n R + 1, which confirms the advantage of routing the class j to the self-service over the regular service at state (n R , n S − 1). Proposition 3.11: Under an α-discounted (α ∈ [0, 1)) total expected or long-run average reward optimal policy, if any class j is routed to the regular service station at any state s = (n R , n S ), n R < N, then class j is routed to the regular service station also at state (n R − 1, n S ).
Proof: We show this for α-discounted total expected reward optimality. We can prove this with the help of properties (iv) and (v) of Lemma 3.9, in a similar fashion that we describe in the proof of Proposition 3.10.
In our model, we also control admission decisions such that we have the option to reject customers. The following property presents the monotonicity of rejection decisions of customer classes. Proposition 3.12: Under an α-discounted (α ∈ [0, 1)) total expected or long-run average reward optimal policy, if any class j is rejected at any state s = (n R , n S ), then class j is rejected also at states (n R + 1, n S ) and (n R , n S + 1).
Proof: We show this for α-discounted total expected reward optimality. Let us show this for any state s = (n R , n S ) with n R < N, n S < c S (in which both of the routing options are feasible). At state s = (n R , n S ), n R < N, n S < c S class j is rejected when R R i + αv α (n R + 1, n S ) < αv α (n R , n S ) and R S i + αv α (n R , n S + 1) < αv α (n R , n S ). First, let us show that class j is rejected also at state (n R + 1, n S ). By property (v) of Lemma 3.9, αv α (n R , n S ) − αv α (n R + 1, n S ) ≤ αv α (n R + 1, n S ) − αv α (n R + 2, n S ). So, we have R R j < αv α (n R + 1, n S ) − αv α (n R + 2, n S ), which shows that class j is rejected from the regular service station at state (n R + 1, n S ). By property (ii) of Lemma 3.9, αv α (n R , n S ) − αv α (n R , n S + 1) ≤ αv α (n R + 1, n S ) − αv α (n R + 1, n S + 1). With this, we have that R S j < αv α (n R + 1, n S ) − αv α (n R + 1, n S + 1), which confirms that class j is rejected also from the self-service station at state (n R + 1, n S ). For showing that class j is rejected from the self-service and regular service stations also at state (n R , n S + 1), we can use properties (i) and (ii) of Lemma 3.9 in a similar fashion.
These monotonicity properties imply that the discounted total expected or long-run average reward optimal policies divide the state space into three connected regions for any customer class such that in each region either the class is routed to the regular service or self-service stations, or it is rejected. For illustrating this, we present Figure 2 for a system with two classes. In obtaining this figure, we let λ 1 = λ 2 = 3, μ R = 1, μ S = 0.5, h = 2, c R = 3, c S = 10, B = 5 and consider long-run average reward optimal policies. Policies in Figure 2(a) and (b) correspond to the scenario that we have R R 1 = 15 and R S 1 = 2 for the first customer class and R R 2 = 13 and R S 1 = 3 for the second customer class.  second customer classes, respectively, for this scenario. We observe that the first class is accepted to the regular station more than the second class. Note that the regular service admission reward of the first class is higher than of the second class (R R 1 > R R 2 ) and also the reward difference R R 1 − R S 1 is higher than R R 2 − R S 2 . For the systems with h = 0, we have the result in Proposition 3.6 which confirms the regular service priority of the class who has the highest regular service admission reward and the lowest self-service admission reward. For the system in this figure with h = 2, we are able to observe the regular service priority of such a customer class.
In order to illustrate how the optimal policies change with respect to the admission rewards of customers, we then look at Figure 2(c) and (d). Differently from the setting in Figure 2(a) and (b), we have R S 1 = 6, instead of R S 1 = 2. Figure 2(c) and (d) presents the optimal admission and routing decisions of the first and second customer classes, respectively, for this different scenario. In this scenario, we still have that the first class has the highest regular service admission reward. However, this time the difference between the regular service and self-service admission rewards is higher for the second customer class. We can observe from the figure how this increase in the self-service admission reward of the first customer class reduces the priority of the first customer class for the regular service and at the same time increases the regular service priority of the second class.

CONCLUSIONS
This paper focuses on the optimal dynamic admission and routing control problem for the revenue management in a specific service system setting. In this setting, we imagine that customers arriving are not identical with respect to their rewards for service and they are sensitive to congestion. For modeling these differences, we consider customer classes. We get inspiration from service systems in which customers have the option to receive congestionfree services, for instance by self-serving their own demands through self-help desks, instead of opting for regular service, which, as we imagine, is provided by professionals. We argue that in such systems, the regular service can entail congestion-related costs due to high demands by customers and/or low staff levels. For studying the optimal admission and routing control of such systems, a queueing model with two parallel stations with multiservers is devised. The congestion-free self-service station is represented with a loss network. For the regular service station, a finite buffer where customers incur holding costs is used. The focus of this paper is on the discounted total expected and long-run average reward optimal policies. We use Markov decision process formulations to characterize the structure of optimal policies and show the well-structuredness of optimal policies, through value function and sample path arguments. We show this lemma on the α-discounted finite-horizon formulation v T,α , by using induction on T (the number of remaining time periods). We can then reason that the lemma will hold for vα or y. Let us start by showing the following.
Lemma A.1: For any T ≥ 1 and α ∈ [0, 1), we have that Proof: This lemma can be shown using sample path arguments. Here, we do it for the first statement. Construct two processes on the same probability space. Let Process 1 start at state (n R , n S ) and let Process 2 start at state (n R + 1, n S ). If these processes couple somewhere in the first T periods, then the difference in the rewards obtained by two processes will be the same if both processes have T or T + 1 periods remaining. If not, then Process 1 will obtain as many rewards as Process 2 in the first T periods, and in the last remaining period, there is no chance that the advantage of Process 1 over Process 2 will decrease. Now it is sufficient to show the following statements, to prove Lemma 3.9. For any T ≥ 1 and α ∈ [0, 1), we have that

Proving Statement (I):
This is the convexity property of the value functions with respect to the number of customers in the self-service station. We show this by induction on the value functions. We can show the initial induction step by letting v 0,α (. . .) = 0. Then, assuming that (I) holds for some T and α ∈ [0, 1) and we need to show that (I) also holds for T + 1 and α ∈ [0, 1).

(A.2)
Note that the holding costs (h(n R )) cancel out in Rαv T,α (n R , n S ) − Rαv T,α (n R , n S + 1). We show this statement by proving that each term in brackets in Eq. (A.1) is bounded above by v T +1,α (n R , n S + 1) − v T +1,α (n R , n S + 2). Then, we can be sure that the statement will hold as the coefficients of these terms sum up to 1.
Let us first look into the three parts of (A.1) that do not relate to the rewards.
(c) v T,α (n R , n S ) − v T,α (n R , n S + 1) ≤ v T,α (n R , n S + 1) − v T,α (n R , n S + 2), by the induction hypothesis of statement (I) and v T,α (n R , n S + 1) − v T,α (n R , n S + 2) ≤ v T +1,α (n R , n S + 1) − v T +1,α (n R , n S + 2) by Lemma A.1. Now, we check the reward differences. Rαv T,α (n R , n S ) − Rαv T,α (n R , n S + 1) is affected by the admission and routing choices made. We look into all possible combinations of these decisions. We show that, for any arbitrary customer class i, (a) Consider that it is optimal to route class i to the regular station at both (n R , n S ) and (n R , n S + 1) states. Then, we know that (b) Consider that it is optimal to route class i to the regular station at (n R , n S ) but to the self-service station at (n R , n S + 1). We know that, as it is not optimal to route class i to the regular station at (n R , n S + 1), For the rest, we can follow (d).
(c) Similarly, for the case that it is optimal to route class i to the regular station at (n R , n S ) but to reject at (n R , n S + 1), we can infer that R i α v T,α (n R , n S ) − R i α v T,α (n R , n S + 1) ≤ α[v T,α (n R + 1, n S ) − v T,α (n R + 1, n S + 1)] and follow the lines in (d).
(d) Consider that it is optimal to route class i to the self-service station at both (n R , n S ) and (n R , n S + 1) states. Thus, (e) Consider that it is optimal to route class i to the self-service station at (n R , n S ) but to the regular station at (n R , n S + 1). As it is not optimal to route class i to the selfservice station at (n R , n S + 1), ]. Then, we can follow (h) to show the rest.
(f) Consider that it is optimal to route class i to the self-service station at (n R , n S ) but to reject at (n R , n S + 1). We can follow the arguments in (i).
(g) Consider that it is optimal to reject class i at both (n R , n S ) and (n R , n S + 1) states. Then, ]. Then, Lemma A.1 completes the proof.
The case that class i is rejected at state (n R , n S ) but routed to the self-service station at state (n R , n S + 1) is impossible due to the induction hypothesis of statement (I). Likewise, it is also impossible that class i is rejected at state (n R , n S ) but routed to the regular station at state (n R , n S + 1), due to the induction hypothesis of statement (II).
Proving Statement (II): This is the supermodularity property of the value functions. We show this with induction on T by sample path arguments. Consider four processes on the same probability space such that all have T + 1 periods remaining. Let Process 1 and Process 4 start at (n R , n S ) and (n R + 1, n S + 1), respectively, and assume that both processes use an optimal policy π * . On the other hand, let Process 2 and Process 3 start at states (n R , n S + 1) and (n R + 1, n S ), respectively, and let them follow (potentially) suboptimal policies. However, we let that these policies will only deviate from the optimal policy π * during the first time period.
We let R k and R * k be the random variables denoting the (net) rewards obtained by the policies that Process k ∈ {1, 2, 3, 4} follows and the rewards that could be obtained if Process k was following an optimal policy instead, respectively. In order to show that . We now condition on the possible events that might occur in the first time period, by using the fact that after this event, we have T periods left in the horizon. The first event partitions the state space. By using the law of total expectation, it suffices to show that E(R * 1 − R 2 |An) ≤ E(R 3 − R * 4 |An) for any transition event An. We skip writing the holding costs incurred at the states (during the first time period) as they cancel out in E(R * 1 − R 2 |An) and E(R 3 − R * 4 |An) irrespective of the transition events (Ans).
First focus on arrival events. Let class i be an arbitrary customer class whose arrival we observe in the first time period. Let A 1 denote this arrival event. As we do not know the decisions that the optimal policy π * will take after observing this event at states (n R , n S ) and (n R + 1, n S + 1), below we consider all possible scenarios for the decisions that the optimal policy can take at these states.
Scenario 1 for A 1 : Assume that the optimal policy routes the arrived customer of class i to the regular station at both (n R , n S ) and (n R + 1, n S + 1) states. Let us consider that Process 2 and Process 3 also route this class to the regular station. Then, . Scenario 2 for A 1 : Assume that the optimal policy routes the arrived customer of class i to the regular station at (n R , n S ) and to the self-service at (n R + 1, n S + 1). Let Process 2 and Process 3 mimic Process 1 and Process 4, respectively. Then, E(R * 1 − R 2 |A 1 ) = (R R i + αv T,α (n R + 1, n S ) − R R i − αv T,α (n R + 1, n S + 1)) = α[v T,α (n R + 1, n S ) − v T,α (n R + 1, n S + 1)]. By the induction hypothesis of statement (I), we know that α[v T,α (n R + 1, n S ) − v T,α (n R + 1, n S + 1)] ≤ α[v T,α (n R + 1, n S + 1) − v T,α (n R + 1, n S + 2)], where α[v T,α (n R + 1, n S + 1) − v T,α (n R + 1, n S + 2)] = (R S i + αv T,α (n R + 1, n S + 1) − R S i − αv T,α (n R + 1, n S + 2)) = E(R 3 − R * 4 |A 1 ). Scenario 3 for A 1 : Assume that the optimal policy routes the arrived customer of class i to the regular station at (n R , n S ) and rejects at (n R + 1, n S + 1). Let Process 2 and Process 3 mimic Process 1 and Process 4, respectively. Then, . Scenario 4 for A 1 : Assume that the optimal policy routes the arrived customer of class i to the self-service station at both (n R , n S ) and (n R + 1, n S + 1) states. Let us consider that also Process 2 and Process 3 route this class to the self-service station.
Scenario 6 for A 1 : Assume that the optimal policy routes the arrived customer of class i to the self-service station at (n R , n S ) and rejects at (n R + 1, n S + 1). Let Process 2 to mimic Process 4 and Process 3 to mimic Process 1. Then, E(R * 1 − R 2 |A 1 ) = (R S i + αv T,α (n R , n S + 1) − αv T,α (n R , n S + 1)) = R S i = E(R 3 − R * 4 |A 1 ). Scenario 7 for A 1 : Assume that the optimal policy rejects class i at both (n R , n S ) and (n R + 1, n S + 1) states. Then, let Process 2 and Process 3 reject as well.
, by the induction hypothesis of statement (II). We realize that it is impossible that an optimal policy rejects class i at (n R , n S ) but accepts either to the self-service or regular stations at (n R + 1, n S + 1), by the induction hypothesis.
In all decision scenarios possible after observing event A 1 , we see that we can confirm . Now, we focus on the service completion events. It is possible that these types of events occur in the first time period. The effect of service completion events are not related to the decisions of the optimal policy; however, this time we need to consider scenarios for the states of the systems, due to the fact that we have a bounded state space.
Let us first consider the service completion event from the regular service station. Let A 2 denote this event.
Scenario 1 for A 2 : First, let us consider the case that n R > 0 such that all processes can observe this event. Then, . Scenario 2 for A 2 : Now consider that n R = 0 such that we can observe this event only in Process 3 and Process 4. For this case, we have . Now, let us consider the service completion event from the self-service station. Let A 3 denote this event.
Proving Statement (III): This is a superconvexity property of the value functions. We again show this with induction on T by sample path arguments. Consider four processes on the same probability space such that all have T + 1 periods remaining. Let Process 1 and Process 4 start at (n R + 1, n S ) and (n R , n S + 2), respectively, and assume that both processes use an optimal policy π * . On the other hand, let Process 2 and Process 3 start at (n R + 1, n S + 1) and (n R , n S + 1). For these processes, we do not assume that they follow an optimal policy. However, we consider that after the first time period, all policies follow an optimal policy.
We again let R k and R * k be the random variables denoting the (net) rewards obtained by the policies that Process k ∈ {1, 2, 3, 4} follows and the rewards that could be obtained if Process k was following an optimal policy instead, respectively. In order to show that . We now condition on the possible events that might occur in the first time period, by using the fact that after this event, we have T periods left in the horizon. The first event partitions the state space. By using the law of total expectation, it suffices to show that E(R * 1 − R 2 |An) ≤ E(R 3 − R * 4 |An) for any transition event An. We skip writing the holding costs incurred at the states (during the first time period) as they cancel out in E(R * 1 − R 2 |An) and E(R 3 − R * 4 |An) irrespective of the transition events (Ans).
First focus on arrival events. Let class i be an arbitrary customer class whose arrival we observe in the first time period. Let A 1 denote this arrival event. As we do not know the decisions that the optimal policy π * will take after observing this event at states (n R + 1, n S ) and (n R , n S + 2), below we consider all possible scenarios for the decisions that the optimal policy can take at these states.
Scenario 1 for A 1 : Assume that the optimal policy routes the arrived customer of class i to the regular station at both (n R + 1, n S ) and (n R , n S + 2) states. Let us consider that also Process 2 and Process 3 route this class to the regular station. Then, By the induction hypothesis of statement (III), . Scenario 2 for A 1 : Assume that the optimal policy routes the arrived customer of class i to the regular station at (n R + 1, n S ) but to the self-service station at (n R , n S + 2). Let us consider that Process 2 and Process 3 mimic Process 1 and Process 4, respectively. Then, . Scenario 3 for A 1 : Assume that the optimal policy routes the arrived customer of class i to the regular station at (n R + 1, n S ) but rejects at (n R , n S + 2). Let Process 2 mimic Process 4 and Process 3 mimic Process 1. Then, By the induction hypothesis of statement (III), . Scenario 4 for A 1 : Assume that the optimal policy routes the arrived customer of class i to the self-service station at both (n R + 1, n S ) and (n R , n S + 2) states. Let us consider that also Process 2 and Process 3 route this class to the self-service station. Then, E(R * 1 − R 2 |A 1 ) = α[v T,α (n R + 1, n S + 1) − v T,α (n R + 1, n S + 2)]. By the induction hypothesis of statement (III), α[v T,α (n R + 1, n S + 1) − v T,α (n R + 1, n S + 2)] ≤ α[v T,α (n R , n S + 2) − v T,α (n R +, n S + 3)], where α[v T,α (n R , n S + 2) − v T,α (n R +, n S + 3)] = E(R 3 − R * 4 |A 1 ). Scenario 5 for A 1 : Assume that the optimal policy routes the arrived customer of class i to the self-service station at (n R + 1, n S ) but to the regular station at (n R , n S + 2). Let us consider that Process 2 and Process 3 mimic Process 1 and Process 4, respectively. Then, E(R * 1 − R 2 |A 1 ) = α[v T,α (n R + 1, n S + 1) − v T,α (n R + 1, n S + 2)], where α[v T,α (n R + 1, n S + 1) − v T,α (n R + 1, n S + 2)] = E(R 3 − R * 4 |A 1 ). Scenario 6 for A 1 : Assume that the optimal policy routes the arrived customer of class i to the self-service station at (n R + 1, n S ) but rejects at (n R , n S + 2). Let us consider that Process 2 and Process 3 mimic Process 4 and Process 2, respectively. Then, E(R * 1 − R 2 |A 1 ) = R S i + α[v T,α (n R + 1, n S + 1) − v T,α (n R + 1, n S + 1)] = R S i , where R S i = R S i + α[v T,α (n R , n S + 2) − v T,α (n R , n S + 2)] = E(R 3 − R * 4 |A 1 ). Scenario 7 for A 1 : Assume that the optimal policy rejects the arrived customer of class i at both (n R + 1, n S ) and (n R , n S + 2) states. Let Process 2 and Process 3 reject as well. Then, E(R * 1 − R 2 |A 1 ) = α[v T,α (n R + 1, n S ) − v T,α (n R + 1, n S + 1)], where α[v T,α (n R + 1, n S ) − v T,α (n R + 1, n S + 1)] ≤ α[v T,α (n R , n S + 1) − v T,α (n R , n S + 2)] = E(R 3 − R * 4 |A 1 ) holds due to the induction hypothesis of statement (III). Scenario 8 for A 1 : Assume that the optimal policy rejects the arrived class i at (n R + 1, n S ) and routes to the regular station at (n R , n S + 2). Let us consider that Process 2 and Process 3 mimic Process 1 and Process 4, respectively. Then E(R * 1 − R 2 |A 1 ) = α[v T,α (n R + 1, n S ) − v T,α (n R + 1, n S + 1)], by the induction hypothesis of statement (I), α[v T,α (n R + 1, n S ) − v T,α (n R + 1, n S + 1)] ≤ α[v T,α (n R + 1, n S + 1) − v T,α (n R + 1, n S + 2)], where α[v T,α (n R + 1, n S + 1) − v T,α (n R + 1, n S + 2)] = E(R 3 − R * 4 |A 1 ). By the induction hypothesis, it is impossible that an optimal policy routes class i to the selfservice at (n R , n S + 2) but rejects at (n R + 1, n S ). If it is optimal to not to route class i to the self-service at (n R + 1, n S ), then R S i < α[v T,α (n R + 1, n S ) − v T,α (n R + 1, n S + 1)]. By the induction hypothesis of statement (III), α[v T,α (n R + 1, n S ) − v T,α (n R + 1, n S + 1)] ≤ α[v T,α (n R , n S + 1) − v T,α (n R , n S + 2)] and by the induction hypothesis of statement (I), α[v T,α (n R , n S + 1) − v T,α (n R , n S + 2)] ≤ α[v T,α (n R , n S + 2) − v T,α (n R , n S + 3)]. Then, it is impossible that R S i + αv T,α (n R , n S + 3) > αv T,α (n R , n S + 2).
In all decision scenarios possible after observing event A 1 , we see that we can confirm E(R * 1 − R 2 |A 1 ) ≤ E(R 3 − R * 4 |A 1 ). Now let us look into the case service completion events occur in the first time period. Let us first consider the service completion event from the regular service station. Let A 2 denote this event.
Scenario 1 for A 2 : First, let us consider the case that n R > 0 such that all processes can observe this event. Then, , by the induction hypothesis of statement (III). Scenario 2 for A 2 : Now consider that n R = 0 such that only Process 1 and Process 2 can observe this event. Then, we will have that E(R * , by the induction hypothesis of statement (I).
Let us now consider the service completion event from the self-service station. Let A 3 denote this event.
Scenario 1 for A 3 : First, let us consider the case that n S > 0 such that all processes can observe this event. Then, E(R * 1 − R 2 |A 3 ) = α[v T,α (n R + 1, n S − 1) − v T,α (n R + 1, n S )] ≤ α[v T,α (n R , n S ) − v T,α (n R , n S + 1)] = E(R 3 − R * 4 |A 3 ), by the induction hypothesis of statement (III). Scenario 2 for A 3 : Now consider the case that n S = 0. For this case, Process 1 can not observe this event. We have that E(R * 1 − R 2 |A 3 ) = α[v T,α (n R + 1, n S ) − v T,α (n R + 1, n S )] = 0 ≤ α[v T,α (n R , n S ) − v T,α (n R , n S + 1)] = E(R 3 − R * 4 |A 3 ), by Lemma 3.1. Lastly, a dummy transition event might occur due to uniformization, which does not change the state of any processes. So, the induction hypothesis of statement (III) is sufficient in this case.
Proving Statement (IV): This is a superconvexity property of the value functions. We again show this with induction on T by sample path arguments. Consider four processes on the same probability space such that all have T + 1 periods remaining. Let Process 1 and Process 4 start at (n R , n S + 1) and (n R + 2, n S ), respectively, and assume that both processes use an optimal policy π * . On the other hand, let Process 2 and Process 3 start at (n R + 1, n S + 1) and (n R + 1, n S ). For these processes, we do not assume that they follow an optimal policy. However, we consider that after the first time period, all policies follow an optimal policy.
We again let R k and R * k be the random variables denoting the (net) rewards obtained by the policies that Process k ∈ {1, 2, 3, 4} follows and the rewards that could be obtained if Process k was following an optimal policy instead, respectively. In order to show that E(R * . We now condition on the possible events that might occur in the first time period, by using the fact that after this event, we have T periods left in the horizon. The first event partitions the state space. By using the law of total expectation, it suffices to show that E(R * 1 − R 2 |An) ≤