Premium control with reinforcement learning

Abstract We consider a premium control problem in discrete time, formulated in terms of a Markov decision process. In a simplified setting, the optimal premium rule can be derived with dynamic programming methods. However, these classical methods are not feasible in a more realistic setting due to the dimension of the state space and lack of explicit expressions for transition probabilities. We explore reinforcement learning techniques, using function approximation, to solve the premium control problem for realistic stochastic models. We illustrate the appropriateness of the approximate optimal premium rule compared with the true optimal premium rule in a simplified setting and further demonstrate that the approximate optimal premium rule outperforms benchmark rules in more realistic settings where classical approaches fail.


Introduction
An insurance company's claim costs and investment earnings fluctuate randomly over time. The insurance company needs to determine the premiums before the coverage periods start, that is before knowing what claim costs will appear and without knowing how its invested capital will develop. Hence, the insurance company is facing a dynamic stochastic control problem. The problem is complicated because of delays and feedback effects: premiums are paid before claim costs materialise and premium levels affect whether the company attracts or loses customers.
An insurance company wants a steady high surplus. The optimal dividend problem introduced by de Finetti (1957) (and solved by Gerber, 1969) has the objective to maximise the expected present value of future dividends. Its solution takes into account that paying dividends too generously is suboptimal since a probability of default that is too high affects the expected present value of future dividends negatively. A practical problem with implementing the optimal premium rule, that is a rule that maps the state of the stochastic environment to a premium level, obtained from solving the optimal dividend problem is that the premiums would be fluctuating more than what would be feasible for a real insurance market with competition. A good premium rule needs to generate premiums that do not fluctuate wildly over time.
For a mutual insurance company, different from a private company owned by shareholders, maximising dividends is not the main objective. Instead the premiums should be low and suitably averaged over time, but also making sure that the surplus is sufficiently high to avoid a too high probability of default. Solving this multiple-objective optimisation problem is the focus of the present paper. Similar premium control problems have been studied by Martin-Löf (1983, 1994, and these papers have been a source of inspiration for our work. Martin-Löf (1983) carefully sets up the balance equations for the key economic variables of relevance for the performance of the insurance company and studies the premium control problem as a linear C The Author(s), 2023. Published by Cambridge University Press on behalf of The International Actuarial Association. This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
control problem under certain simplifying assumptions enabling application of linear control theory. The paper analyses the effects of delays in the insurance dynamical system on the linear control law with feedback and discusses designs of the premium control that ensure that the probability of default is small. Martin-Löf (1994) considers an application of general optimal control theory in a setting similar to, but simpler than, the setting considered in Martin-Löf (1983). The paper derives and discusses the optimal premium rule that achieves low and averaged premiums and also targets sufficient solvency of the insurance company.
The literature on optimal control theory in insurance is vast, see for example the textbook treatment by Schmidli (2008) and references therein. Our aim is to provide solutions to realistic premium control problems in order to allow the optimal premium rule to be used with confidence by insurance companies. In particular, we avoid considering convenient stochastic models that may fit well with optimal control theory but fail to take key features of real dynamical insurance systems into account. Instead, we consider an insurance company that has enough data to suggest realistic models for the insurance environment, but the complexity of these models do not allow for explicit expressions for transition probabilities of the dynamical system. In this sense, the model of the environment is not fully known. However, the models can be used for simulating the behaviour of the environment.
Increased computing power and methodological advances during the recent decades make it possible to revisit the problems studied in Martin-Löf (1983, 1994 and in doing so allow for more complex and realistic dynamics of the insurance dynamical system. Allowing realistic complex dynamics means that optimal premium rules, if possible to be obtained, will allow insurance companies to not only be given guidance on how to set premiums but actually have premium rules that they can use with certain confidence. The methodological advances that we use in this work is reinforcement learning and in particular reinforcement learning combined with function approximation, see for example Bertsekas and Tsitsiklis (1996) and Sutton and Barto (2018) and references therein. In the present paper, we focus on the temporal difference control algorithms SARSA and Q-learning. SARSA was first proposed by Rummery and Niranjan (1994) and named by Sutton (1995). Q-learning was introduced by Watkins (1989). By using reinforcement learning methods combined with function approximation, we obtain premium rules in terms of Markovian controls for Markov decision processes whose state spaces are much larger/more realistic than what was considered in the premium control problem studied in Martin-Löf (1994).
There exist other methods for solving general stochastic control problems with a known model of the environment, see for example Han and E (2016) and Germain et al. (2021). However, the deep learning methods in these papers are developed to solve fixed finite-time horizon problems. The stochastic control problem considered in the present paper has an indefinite-time horizon, since the terminal time is random and unbounded. A random terminal time also causes problems for the computation of gradients in deep learning methods. There are reinforcement learning methods, such as policy gradient methods (see e.g., Williams, 1992;Sutton et al., 1999, or for an overview Sutton and Barto, 2018, Ch. 13), that enable direct approximation of the premium rule by neural networks (or other function approximators) when the terminal time is random. However, for problems where the terminal time can be quite large (as in the present paper) these methods likely require an additional approximation of the value function (so-called actor-critic methods).
In the mathematical finance literature, there has recently been significant interest in the use of reinforcement learning, in particular related to hedging combined with function approximation, for instance the influential paper by Buehler et al. (2019) on deep hedging. Carbonneau (2021) uses the methodology in Buehler et al. (2019) and studies approaches to risk management of long-term financial derivatives motivated by guarantees and options embedded in life-insurance products. Another approach to deep hedging based on reinforcement learning for managing risks stemming from long-term life-insurance products is presented in Chong et al. (2021). Dynamic pricing has been studied extensively in the operations research literature. For instance, the problem of finding the optimal balance between learning an unknown demand function and maximising revenue is related to reinforcement learning. We refer to den Boer and Zwart (2014) and references therein. Reinforcement learning is used in Krasheninnikova et al. (2019) for determining a renewal pricing strategy in an insurance setting. However, the problem formulation and solution method are different from what is considered in the present paper. Krasheninnikova et al. (2019) considers retention of customers while maximising revenue and does not take claims costs into account. Furthermore, in Krasheninnikova et al. (2019) the state space is discretised in order to use standard Q-learning, while the present paper solves problems with a large or infinite state space by combining reinforcement learning with function approximation.
The paper is organised as follows. Section 2 describes the relevant insurance economics by presenting the involved cash flows, key economic quantities such as surplus, earned premium, reserves and how such quantities are connected to each other and their dynamics or balance equations. Section 2 also introduces stochastic models (simple, intermediate and realistic) giving a complete description of the stochastic environment in which the insurance company operates and aims to determine an optimal premium rule. The stochastic model will serve us by enabling simulation of data from which the reinforcement learning methods gradually learn the stochastic environment in the search for optimal premium rules. The models are necessarily somewhat complex since we want to take realistic insurance features into account, such as delays between accidents and payments and random fluctuations in the number of policyholders, partly due to varying premium levels.
Section 3 sets up the premium control problem we aim to solve in terms of a Markov decision process and standard elements of stochastic control theory such as the Bellman equation. Finding the optimal premium rule by directly solving the Bellman (optimality) equation numerically is not possible when considering state spaces for the Markov decision process matching a realistic model for the insurance dynamical system. Therefore, we introduce reinforcement learning methods in Section 4. In particular, we present basic theory for the temporal difference learning methods Q-learning and SARSA. We explain why these methods will not be able to provide us with reliable estimates of optimal premium rules unless we restrict ourselves to simplified versions of the insurance dynamical system. We argue that SARSA combined with function approximation of the so-called action-value function will allow us to determine optimal premium rules. We also highlight several pitfalls that the designer of the reinforcement learning method must be aware of and make sure to avoid.
Section 5 presents the necessary details in order to solve the premium control problem using SARSA with function approximation. We analyse the effects of different model/method choices on the performance of different reinforcement learning techniques and compare the performance of the optimal premium rule with those of simpler benchmark rules.
Finally, Section 6 concludes the paper. We emphasise that the premium control problem studied in the present paper is easily adjusted to fit the features of a particular insurance company and that the excellent performance of a carefully set up reinforcement learning method with function approximation provides the insurance company with an optimal premium rule that can be used in practice and communicated to stakeholders.

A stochastic model of the insurance company
The number of contracts written during year t + 1 is denoted N t+1 , known at the end of year t + 1. The premium per contract P t during year t + 1 is decided at the end of year t. Hence, P t is F t -measurable, where F t denotes the σ -algebra representing the available information at the end of year t. Contracts are assumed to be written uniformly in time over the year and provide insurance coverage for one year. Therefore, assuming that the premium income is earned linearly with time, the earned premium during year t + 1 is that is for contracts written during year t + 1, on average half of the premium income P t N t+1 will be earned during year t + 1, and half during year t + 2. Since only half of the premium income P t N t+1 is earned during year t + 1, the other half, which should cover claims during year t + 2, will be stored in the premium reserve. The balance equation for the premium reserve is V t+1 = V t + P t N t+1 − EP t+1 . Note that when we add cash flows or reserves occurring at time t + 1 to cash flows or reserves occurring at time t, the time t amounts should be interpreted as adjusted for the time value of money. We choose not to write this out explicitly in order to simplify notation. That contracts are written uniformly in time over the year means that I t,k , the incremental payment to policyholders during year t + k for accidents during year t + 1, will consist partly of payments to contracts written during year t + 1 and partly of payments to contracts written during year t. Hence, we assume that I t,k depends on both N t+1 and N t . Table 1 shows a claims triangle with entries I j,k representing incremental payments to policyholders during year j + k for accidents during year j + 1. For ease of presentation, other choices could of course be made, we will assume that the maximum delay between an accident and a resulting payment is four years. Entries I j,k with j + k ≤ t are F t -measurable and coloured blue in Table 1. Let where IC, PC and RP denote, respectively, incurred claims, paid claims and runoff profit. The balance equation for the claims reserve is The profit or loss during year t + 1 depends on changes in the reserves: where IE denotes investment earnings and OE denotes operating expenses. The dynamics of the surplus fund is therefore We consider three models of increasing complexity. The simple model allows us to solve the premium control problem with classical methods. In this situation, we can compare the results obtained with classical methods with the results obtained with more flexible methods, allowing the assessment of the performance of a chosen flexible method. Classical solution methods are not feasible for the intermediate model. However, the similarity between the simple and intermediate model allows us to understand how increasing model complexity affects the optimal premium rule. Finally, we consider a realistic model, where the models for the claims payments and investment earnings align closer with common distributional assumptions for these quantities. Since the simple model is a simplified version of the intermediate model, we begin by defining the intermediate model in Section 2.1, followed by the simple model in Section 2.2. In Section 2.3, the more realistic models for claims payments and investment earnings are defined.

Intermediate model
We choose to model the key random quantities as integer-valued random variables with conditional distributions that are either Poisson or Negative Binomial distributions. Other choices of distributions on the integers are possible without any major effects on the analysis that follows. Let where a > 0 is a constant, and b < 0 is the price elasticity of demand. The notation says that the conditional distribution of the number of contracts written during year t + 1 given the information at the end of year t depends on that information only through the premium decided at the end of year t for those contracts.
Let N t+1 = (N t+1 + N t )/2 denote the number of contracts during year t + 1 that provide coverage for accidents during year t + 1. Let saying that the operating expenses have both a fixed part and a variable part proportional to the number of active contracts. The appearance of N t+1 instead of N t+1 in the expressions above is due to the assumption that contracts are written uniformly in time over the year and that accidents occur uniformly in time over the year.
The constant α k is, for a given accident year, the expected fraction of claim costs paid during development year k. Let where μ denotes the expected claim cost per contract. We assume that different incremental claims payments I j,k are conditionally independent given information about the corresponding numbers of contracts written. Formally, the elements in the set are conditionally independent given N t−l , . . . , N t+1 . Therefore, using (2.1) and (2.5), (2.6) The model for the investment earnings IE t+1 is chosen so that G t ≤ 0 implies IE t+1 = 0 since G t ≤ 0 means that nothing is invested. Moreover, we assume that where NegBin(r, p) denotes the negative binomial distribution with probability mass function which corresponds to mean and variance Given a premium rule π that given the state S t = (G t , P t−1 , N t−3 , N t−2 , N t−1 , N t ) generates the premium P t , the system (S t ) evolves in a Markovian manner according to the transition probabilities that follows from (2.3)-(2.7) and (2.2). Notice that if we consider a less long-tailed insurance product so that α 3 = α 4 = 0 (at most one year delay from occurrence of the accident to final payment), then the dimension of the state space reduces to four, that is S t = (G t , P t−1 , N t−1 , N t ).

Simple model
Consider the situation where the insurer has a fixed number N of policyholders, who at some initial time point bought insurance policies with automatic contract renewal for the price P t year t + 1. The state at time t is S t = (G t , P t−1 ). In this simplified setting, OE t+1 = β 0 + β 1 N, all payments I t,k are independent, L(I t,k ) = Pois(α k μN), IC t+1 − RP t+1 = PC t+1 and L(PC t+1 ) = Pois(μN).

Realistic model
In this model, we change the distributional assumptions for both investment earnings and the incremental claims payments from the previously used integer-valued distributions. Let (Z t ) be a sequence of iid standard normals and let Let C t,j denote the cumulative claims payments for accidents occurring during year t + 1 up to and including development year j. Hence, I t,1 = C t,1 , and I t,j = C t,j − C t,j−1 for j > 1. We use the following model for the cumulative claims payments: where c 0 is interpreted as the average claims payment per policyholder during the first development year, and Z t,j are iid standard normals. Then, We do not impose restrictions on the parameters μ j and ν 2 j and as a consequence values f j = exp μ j + ν 2 j /2 ∈ (0, 1) are allowed (allowing for negative incremental paid amounts C t,j+1 − C t,j < 0). This is in line with the model assumption E C t,j+1 | C t,1 , . . . , C t,j = f j C t,j for some f j > 0 of the classical distribution-free Chain Ladder model by Mack (1993)

The control problem
We consider a set of states S + , a set of non-terminal states S ⊆ S + , and for each s ∈ S a set of actions A(s) available from state s, with A = ∪ s∈S A(s). We assume that A is discrete (finite or countable).
In order to simplify notation and limit the need for technical details, we will here and in Section 4 restrict our presentation to the case where S + is also discrete. However, we emphasise that when using function approximation in Section 4.2 the update Equation (4.3) for the weight vector is still valid when the state space is uncountable, as is the case for the realistic model. For each s ∈ S, s ∈ S + , a ∈ A(s) we define the reward received after taking action a in state s and transitioning to s , −f (a, s, s ), and the probability of transitioning from state s to state s after taking action a, p(s |s, a). We assume that rewards and transition probabilities are stationary (time-homogeneous). This defines a Markov decision process (MDP). A policy π specifies how to determine what action to take in each state. A stochastic policy describes, for each state, a probability distribution on the set of available actions. A deterministic policy is a special case of a stochastic policy, specifying a degenerate probability distribution, that is a one-point distribution.
Our objective is to find the premium policy that minimises the expected value of the premium payments over time, but that also results in (P t ) being more averaged over time, and further ensures that the surplus (G t ) is large enough so that the risk that the insurer cannot pay the claim costs and other expenses is small. By combining rewards with either constraints on the available actions from each state or the definition of terminal states, this will be accomplished with a single objective function, see further Sections 3.1-3.2. We formulate this in terms of a MDP, that is we want to solve the following optimisation problem: where π is a policy generating the premium P t given the state S t , A(s) is the set of premium levels available from state s, γ is the discount factor, f is the cost function, and E π [·] denotes the expectation given that policy π is used. Note that the discount factor γ t should not be interpreted as the price of a zero-coupon bond maturing at time t, since the cost that is discounted does not represent an economic cost. Instead γ reflects how much weight is put on costs that are immediate compared to costs further in the future. The transition probabilities are p s |s, a = P(S t+1 = s | S t = s, P t = a), and we consider stationary policies, letting π (a|s) denote the probability of taking action a in state s under policy π , π (a|s) = P π (P t = a | S t = s).
If there are no terminal states, we have T = ∞, and S + = S. We want to choose A(s), s ∈ S, f , and any terminal states such that the objective discussed above is achieved. We will do this in two ways, see Sections 3.1 and 3.2. The value function of state s under a policy π generating the premium P t is defined as

The Bellman equation for the value function is
When the policy is deterministic, we let π be a mapping from S to A, and The optimal value function is v * (s) = sup π v π (s). When the action space is finite, the supremum is attained, which implies the existence of an optimal deterministic stationary policy (see Puterman We use policy iteration in order to find the solution numerically. Let k = 0, and choose some initial deterministic policy π k (s) for all s ∈ S. Then (i) Determine V k (s) as the unique solution to the system of equations (ii) Determine an improved policy π k+1 (s) by computing (iii) If π k+1 (s) = π k (s) for some s ∈ S, then increase k by 1 and return to step (i).
Note that if the state space is large enough, solving the system of equations in step (i) directly might be too time-consuming. In that case, this step can be solved by an additional iterative procedure, called iterative policy evaluation, see for example Sutton and Barto (2018, Ch. 4.1).

MDP with constraint on the action space
The premiums (P t ) will be averaged if we minimise t c(P t ), where c is an increasing, strictly convex function. Thus for the first MDP, we let f (a, s, s ) = c(a). To ensure that the surplus (G t ) does not become negative too often, we combine this with the constraint saying that the premium needs to be chosen so that the expected value, given the current state, of the surplus stays nonnegative, that is and the optimisation problem becomes The choice of the convex function c, together with the constraint, will affect how quickly the premium can be lowered as the surplus or previous premium increases, and how quickly the premium must be increased as the surplus or previous premium decreases. Different choices of c affect how well different parts of the objective are achieved. Hence, one choice of c might put a higher emphasis on the premium being more averaged over time but slightly higher, while another choice might promote a lower premium level that is allowed to vary a bit more from one time point to another. Furthermore, it is not clear from the start what choice of c will lead to a specific result, thus designing the reward signal might require searching through trial and error for the cost function that achieves the desired result.

MDP with a terminal state
The constraint (3.2) requires a prediction of N t+1 according to (2.3). However, estimating the price elasticity in (2.3) is difficult task; hence, it would be desirable to solve the optimisation problem without having to rely on this prediction. To this end, we remove the constraint on the action space, that is we let A(s) = A for all s ∈ S, and instead introduce a terminal state which has a larger negative reward than all other states. This terminal state is reached when the surplus G t is below some predefined level, and it can be interpreted as the state where the insurer defaults and has to shut down. If we let G denote the set of non-terminal states for the first state variable (the surplus), then where η > 0. The optimisation problem becomes The reason for choosing η > 0 is to ensure that the reward when transitioning to the terminal state is lower than the reward when using action max A (the maximal premium level), that is, it should be more costly to terminate and restart compared with attempting to increase the surplus when the surplus is low.
The particular value of the parameter η > 0 together with the choice of the convex function c determines the reward signal, that is the compromise between minimising the premium, averaging the premium and ensuring that the risk of default is low. One way of choosing η is to set it high enough so that the reward when terminating is lower than the total reward using any other policy. Then, we require that . This choice of η will put a higher emphasis on ensuring that the risk of default is low, compared with using a lower value of η.

Choice of cost function
The function c is chosen to be an increasing, strictly convex function. That it is increasing captures the objective of a low premium. As discussed in Martin-Löf (1994), that it is convex means that the premiums will be more averaged, since The more convex shape the function has, the more stable the premium will be over time. One could also force stability by adding a term related to the absolute value of the difference between successive premium levels to the cost function. We have chosen a slightly simpler cost function, defined by c, and for the case with terminal states, by the parameter η.
As for the specific choice of the function c used in Section 5, we have simply used the function suggested in Martin-Löf (1994), but with slightly adjusted parameter values. That the function c, together with the constraint or the choice of terminal states and the value of η, leads to the desired goal of a low, stable premium and a low probability of default needs to be determined on a case by case basis, since we have three competing objectives, and different insurers might put different weight on each of them. This is part of designing the reward function. Hence, adjusting c and η will change how much weight is put on each of the three objectives, and the results in Section 5 can be used as basis for adjustments.

Reinforcement learning
If the model of the environment is not fully known, or if the state space or action space are not finite, the control problem can no longer be solved by classical dynamic programming approaches. Instead, we can utilise different reinforcement learning algorithms.

Temporal-difference learning
Temporal-difference (TD) methods can learn directly from real or simulated experience of the environment. Given a specific policy π which determines the action taken in each state, and the sampled or observed state at time t, S t , state at time t + 1, S t+1 , and reward R t+1 , the iterative update for the value function, using the one-step TD method, is where α t is a step size parameter. Hence, the target for the TD update is , based on another estimate, namely V(S t+1 ). The intuition behind using R t+1 + γ V(S t+1 ) as the target in the update is that this is a slightly better estimate of v π (S t ), since it consists of an actual (observed or sampled) reward at t + 1 and an estimate of the value function at the next observed state.
It has been shown in for example Dayan (1992) that the value function (for a given policy π ) computed using the one-step TD method converges to the true value function if the step size parameter 0 ≤ α t ≤ 1 satisfies the following stochastic approximation conditions where t k (s) is the time step when state s is visited for the kth time.

TD control algorithms
The one-step TD method described above gives us an estimate of the value function for a given policy π . To find the optimal policy using TD learning, a TD control algorithm, such as SARSA or Q-learning, can be used. The goal of these algorithms is to estimate the optimal action-value function q * (s, a) = max π q π (s, a), where q π is the action-value function for policy π , To keep a more streamlined presentation, we will here focus on the algorithm SARSA. The main reason for this has to do with the topic of the next section, namely function approximation. While there are some convergence results for SARSA with function approximation, there are none for standard Q-learning with function approximation. In fact, there are examples of divergence when combining off-policy training (as is done in Q-learning) with function approximation, see for example Sutton and Barto (2018, Ch. 11). However, some numerical results for the simple model with standard Q-learning can be found in Section 5, and we do provide complete details on Q-learning in the Supplemental Material, Section 2. The iterative update for the action-value function, using SARSA, is Hence, we need to generate transitions from state-action pairs (S t , A t ) to state-action pairs (S t+1 , A t+1 ) and observe the rewards R t+1 obtained during each transition. To do this, we need a behaviour policy, that is a policy that determines which action is taken in the state we are currently in when the transitions are generated. Thus, SARSA gives an estimate of the action-value function q π given the behaviour policy π . Under the condition that all state-action pairs continue to be updated, and that the behaviour policy is greedy in the limit, it has been shown in Singh et al. (2000) that SARSA converges to the true optimal action-value function if the step size parameter 0 ≤ α t ≤ 1 satisfies the following stochastic approximation conditions where t k (s, a) is the time step when a visit in state s is followed by taking action a for the kth time.
To ensure that all state-action pairs continue to be updated, the behaviour policy needs to be exploratory. At the same time, we want to exploit what we have learned so far by choosing actions that we believe will give us large future rewards. A common choice of policy that compromises in this way between exploration and exploitation is the ε-greedy policy, which with probability 1 − ε chooses the action that maximises the action-value function in the current state, and with probability ε chooses any other action uniformly at random: Another example is the softmax policy π (a|s) = e Q(s,a)/τ a∈A(s) e Q(s,a)/τ . To ensure that the behaviour policy π is greedy in the limit, it needs to be changed over time towards the greedy policy that maximises the action-value function in each state. This can be accomplished by letting ε or τ slowly decay towards zero.

Function approximation
The methods discussed thus far are examples of tabular solution methods, where the value functions can be represented as tables. These methods are suitable when the state and action space are not too large, for example for the simple model in Section 2.2. However, when the state space and/or action space is very large, or even continuous, these methods are not feasible, due to not being able to fit tables of this size in memory, and/or due to the time required to visit all state-action pairs multiple times. This is the case for the intermediate and realistic models presented in Sections 2.1 and 2.3. In both models, we allow the number of contracts written per year to vary, which increases the dimension of the state space. Thus, to solve the optimisation problem for the intermediate and the realistic model, we need approximate solution methods, in order to generalise from the states that have been experienced to other states. In approximate solution methods, the value function v π (s) (or action-value function q π (s, a)) is approximated by a parameterised function,v(s; w) (orq(s, a; w)). When the state space is discrete, it is common to minimise the following objective function, where μ π (s) is the fraction of time spent in state s. For the model without terminal states, μ π is the stationary distribution under policy π . For the model with terminal states, to determine the fraction of time spent in each transient state, we need to compute the expected number of visits η λ,π (s) to each transient state s ∈ S before reaching a terminal (absorbing) state, where λ(s) = P (S 0 = s) is the initial distribution. For ease of notation, we omit λ from the subscript below, and write η π and P π instead of η λ,π and P λ,π . Let p(s|s ) be the probability of transitioning from state s to state s under policy π , that is or, in matrix form, η π = λ + P η π , where P is the part of the transition matrix corresponding to transitions between transient states. If we label the states 0, 1, . . . , |S| (where state 0 represents all terminal states), then P = (p ij : i, j ∈ {1, 2, . . . , |S|}), where p ij = p(j | i). After solving this system of equations, the fraction of time spent in each transient state under policy π can be computed according to , for all s ∈ S.
This computation of μ π relies on the model of the environment being fully known and the transition probabilities explicitly computable, as is the case for the simple model in Section 2.2. However, for the situation at hand, where we need to resort to function approximation and determinev(s; w) (orq(s, a; w)) by minimising (4.2), we cannot explicitly compute μ π . Instead, μ π in (4.2) is captured by learning incrementally from real or simulated experience, as in semi-gradient TD learning. Using semi-gradient TD learning, the iterative update for the weight vector w becomes This update can be used to estimate v π for a given policy π , generating transitions from state to state by taking actions according to this policy. Similarly to standard TD learning (Section 4.1), the target R t+1 + γv(S t+1 ; w t ) is an estimate of the true (unknown) v π (S t+1 ). The name "semi-gradient" comes from the fact that the update is not based on the true gradient of R t+1 + γv(S t+1 ; w t ) −v(S t ; w t ) 2 ; instead, the target is seen as fixed when the gradient is computed, despite the fact that it depends on the weight vector w t . As in the previous section, estimating the value function given a specific policy is not our final goal -instead we want to find the optimal policy. Hence, we need a TD control algorithm with function approximation. One example of such an algorithm is semi-gradient SARSA, which estimates q π . The iterative update for the weight vector is (4.3) As with standard SARSA, we need a behaviour policy that generates transitions from state-action pairs to state-action pairs, that both explores and exploits, for example an ε-greedy or softmax policy. Furthermore, for the algorithm to estimate q * we need the behaviour policy to be changed over time towards the greedy policy. However, convergence guarantees only exist when using linear function approximation, see Section 4.2.1 below.

Linear function approximation
The simplest form of function approximation is linear function approximation. The value function is approximated byv(s; w) = w x(s), where x(s) are basis functions. Using the Fourier basis as defined in Konidaris et al. (2011), the ith basis function for the Fourier basis of order n is (here π ≈ 3.14 is a number) where s = (s 1 , s 2 , . . . , s k ) , c (i) = c (i) 1 , . . . , c (i) k , and k is the dimension of the state space. The c (i) 's are given by the k-tuples over the set {0, . . . , n}, and hence, i = 1, . . . , (n + 1) k . This means that x(s) ∈ R (n+1) k . One-step semi-gradient TD learning with linear function approximation has been shown to converge to a weight vector w * . However, w * is not necessarily a minimiser of J. Tsitsiklis and Van Roy (1997) derive the upper bound Since γ is often close to one, this bound can be quite large.
The convergence results for semi-gradient SARSA with linear function approximation depend on what type of policy is used in the algorithm. When using an ε-greedy policy, the weights have been shown to converge to a bounded region and might oscillate within that region, see Gordon (2001). Furthermore, Perkins and Precup (2003) have shown that if the policy improvement operator is Lipschitz continuous with constant L > 0 and ε-soft, then SARSA will converge to a unique policy. The policy improvement operator maps every q ∈ R |S||A| to a stochastic policy and gives the updated policy after iteration t as π t+1 = (q (t) ), where q (t) corresponds to a vectorised version of the stateaction values after iteration t, that is q (t) = xw t for the case where we use linear function approximation, where x ∈ R |S||A|×d is a matrix with rows x(s, a) , for each s ∈ S, a ∈ A, and d is the number of basis functions. That is Lipschitz continuous with constant L means that (q) − (q ) 2 ≤ L q − q 2 , for all q, q ∈ R |S||A| . That is ε-soft means that it produces a policy π = (q) that is ε-soft, that is π (a|s) ≥ ε/|A| for all s ∈ S and a ∈ A. In both Gordon (2001) and Perkins and Precup (2003), the policy improvement operator was not applied at every time step; hence, it is not the online SARSA-algorithm considered in the present paper that was investigated. The convergence of online SARSA under the assumption that the policy improvement operator is Lipschitz continuous with a small enough constant L was later shown in Melo et al. (2008). The softmax policy is Lipschitz continuous, see further the Supplemental Material, Section 3.
However, the value of the Lipschitz constant L that ensures convergence depends on the problem at hand, and there is no guarantee that the policy the algorithm converges to is optimal. Furthermore, for SARSA to approximate the optimal action-value function, we need the policy to get closer to the greedy policy over time, for example by decreasing the temperature parameter when using a softmax policy. Thus, the Lipschitz constant L, which is inversely proportional to the temperature parameter, will increase as the algorithm progresses, making the convergence results in Perkins and Precup (2003) and Melo et al. (2008) less likely to hold. As discussed in Melo et al. (2008), this is not an issue specific to the softmax policy. Any Lipschitz continuous policy that over time gets closer to the greedy policy will in fact approach a discontinuous policy, and hence, the Lipschitz constant of the policy might eventually become too large for the convergence result to hold. Furthermore, the results in Perkins and Precup (2003) and Melo et al. (2008) are not derived for a Markov decision process with an absorbing state. Despite this, it is clear from the numerical results in Section 5 that a softmax policy performs substantially better compared to an ε-greedy policy, and for the simple model approximates the true optimal policy well.
The convergence results in Gordon (2001), Perkins and Precup (2003) and Melo et al. (2008) are based on the stochastic approximation conditions where α t is the step size parameter used at time step t. Note that when using tabular methods (see Section 4.1), we had a vector of step sizes for each state-action pair. Here, this is not the case. This is a consequence of both that a vector of this size might not be possible to store in memory when the state space is large, and that we want to generalise from state-action pairs visited to other state-action pairs that are rarely/never visited, making the number of visits to each state-action pair less relevant.

Simple model
We use the following parameter values: γ = 0.9, N = 10, μ = 5, β 0 = 10, β 1 = 1, ξ = 0.05, and ν = 1. This means that the expected yearly total cost for the insurer is 70 and the expected yearly cost per customer is 7. We emphasise that parameter values are meant to be interpreted in suitable units to fit the application in mind. The cost function is Remark 5.1 (Cost function). The cost function (5.1) was suggested in Martin-Löf (1994), since it is an increasing, convex function, and thus will lead to the premium being more averaged over time. However, in Martin-Löf (1994), c 2 = 1.5 was used in the calculations. We have chosen a slightly lower value of c 2 due to that too extreme rewards can lead to numerical problems when using SARSA with linear function approximation.
Remark 5.2 (Truncation). Truncating the surplus process at 150 does not have a material effect on the optimal policy. However, the minimum value (here -20) will have an effect on the optimal policy for the MDP with a terminal state and should be seen as another parameter value that needs to be chosen to determine the reward signal, see Section 3.2.

Policy iteration
The top row in Figure 1 shows the optimal policy and the stationary distribution under the optimal policy, for the simple model with a constraint on the action space (Section 3.1) using policy iteration. The bottom row in Figure 1 shows the optimal policy and the fraction of time spent in each state under the optimal policy, for the simple model with terminal state (Section 3.2) using policy iteration. In both cases, the premium charged increases as the surplus or the previously charged premium decreases. Based on the fraction of time spent in each state under each of these two policies, we note that in both cases the average premium level is close to the expected cost per contract (7), but the average surplus level is slightly lower when using the policy for the model with a constraint on the action space compared to when using the policy for the model with the terminal state. However, the policies obtained for these two models are quite similar, and since (as discussed in Section 3.2) the model with the terminal state is more appropriate in more realistic settings, we focus the remainder of the analysis only on the model with the terminal state.

Linear function approximation
We have a 2-dimensional state space, and hence, k + 1 = 3. When using the Fourier basis we should have s ∈ [0, 1] k , a ∈ [0, 1]; hence, we rescale the inputs according tõ and use (s 1 ,s 2 ,ã) as input. We use a softmax policy, that is π (a|s) = eˆq (s,a; w)/τ a∈A(s) eˆq (s,a; w)/τ , where τ is slowly decreased according to where τ t is the parameter used during episode t. This schedule for decreasing the temperature parameter is somewhat arbitrary, and the parameters have not been tuned. The choice of a softmax policy is based on the results in Perkins and Precup (2003), Melo et al. (2008), discussed in Section 4.2.1. Since a softmax policy is Lipschitz continuous, convergence of SARSA to a unique policy is guaranteed, under the condition that the policy is also ε-soft and that the Lipschitz constant L is small enough. However, since the temperature parameter τ is slowly decreased, the policy chosen is not necessarily ε-soft for all states and time steps, and the Lipschitz constant increases as τ decreases. Despite this, our results show that the algorithm converges to a policy that approximates the optimal policy derived with policy iteration well when using a 3rd order Fourier basis, see the first column in Figure 2. The same cannot be said for an ε-greedy policy. In this case, the algorithm converges to a policy that in general charges a higher premium than the optimal policy derived with policy iteration, see the fourth column in Figure 2. For the ε-greedy policy, we decrease the parameter according to where ε t is the parameter used during episode t.
The starting state is selected uniformly at random from S. Furthermore, since discounting will lead to rewards after a large number of steps having a very limited effect on the total reward, we run each episode for at most 100 steps, before resetting to a starting state, again selected uniformly at random from S. This has the benefit of diversifying the states experienced, enabling us to achieve an approximate policy that is closer to the policy derived with dynamic programming as seen over the whole state space. The step size parameter used is where α t is the step size parameter used during episode t, and 0 < θ ≤ 0.5. The largest α 0 that ensures that the weights do not explode can be found via trial and error. However, the value of α 0 obtained in this way coincides with the "rule of thumb" for setting the step size parameter suggested in Sutton and Barto (2018, Ch. 9.6), namely , E π x x = s,a μ π (s)π (a|s)x(s, a) x(s, a).
If x(S t , A t ) x(S t , A t ) ≈ E π x x , then this step size ensures that the error (i.e., the difference between the updated estimate w t+1 x(S t , A t ) and the target R t+1 + γ w t x(S t+1 , A t+1 )) is reduced to zero after one update. Hence, using a step size larger than α 0 = E π x x −1 risks overshooting the optimum, or even divergence of the algorithm. When using the Fourier basis of order n, this becomes for the examples studied here For the simple model, we have used α 0 = 0.2 for n = 1, α 0 = 0.07 for n = 2, and α 0 = 0.03 for n = 3. For the intermediate model, we used α 0 = 0.06 for n = 1, α 0 = 0.008 for n = 2, and α 0 = 0.002 for n = 3. For the realistic model, we have used α 0 = 0.002 for n = 3. In all cases α 0 ≈ E π x x −1 . For θ , we tried values in the set {0.001, 0.1, 0.2, 0.3, 0.4, 0.5}. For the simple model, the best results were obtained with θ = 0.001 irrespective of n. For the intermediate model, we used θ = 0.5 for n = 1, θ = 0.2 for n = 2, and θ = 0.3 for n = 3. For the realistic model, we used θ = 0.2 for n = 3.

Remark 5.3 (
Step size). There are automatic methods for adapting the step size. One such method is the Autostep method from Mahmood et al. (2012), a tuning-free version of the Incremental Delta-Bar-Delta (IDBD) algorithm from Sutton (1992). When using this method, with parameters set as suggested by Mahmood et al. (2012), the algorithm performs marginally worse compared to the results below. Figure 2 shows the optimal policy for the simple model with terminal state using linear function approximation with 3rd-, 2nd-, and 1st-order Fourier basis using a softmax policy, and with 3rd-order Fourier basis using an ε-greedy policy. Figure 1 shows that the approximate optimal policy using 3rdorder Fourier basis is close to the optimal policy derived with policy iteration. Using 1st-or 2nd-order Fourier basis also gives a reasonable approximation of the optimal policy, but worse performance. Combining 3rd-order Fourier basis and an ε-greedy policy gives considerably worse performance. The same conclusions can be drawn from Table 2, where we see the expected total discounted reward per episode for these policies, together with the results for the optimal policy derived with policy iteration, the policy derived with Q-learning, and several benchmark policies (see Section 5.1.3). Clearly, the performance of 3rd-order Fourier basis is very close to the performance of the optimal policy derived with policy iteration, and hence, we conclude that the linear function approximation with 3rd-order Fourier basis using a softmax policy appears to converge to approximately the optimal policy. The policy derived with Q-learning shows worse performance than both the 3rd-and 2nd-order Fourier basis, while the number of episodes run for the Q-learning algorithm is approximately a factor 30 bigger than the number of episodes run before convergence of SARSA with linear function approximation. Hence, even for this simple model, the number of states is too large for the Q-learning algorithm to converge within a reasonable amount of time. Furthermore, we see that all policies derived with linear function approximation using a softmax policy outperform the benchmark policies. Note that the optimal policy derived with policy iteration, the best constant policy, and the myopic policy with the terminal state require full knowledge of the underlying model and the transition probabilities, and the myopic policy with the constraint requires an estimate of the expected surplus one time step ahead, while the policies derived with function approximation or Q-learning only require real or simulated experience of the environment.
To analyse the difference between some of the policies, we simulate 300 episodes for the policy with the 3rd-order Fourier basis, the best constant policy, and the myopic policy with terminal state, p min = 5.8, for a few different starting states, two of which can be seen in Figure 3. A star in the figure corresponds to one or more terminations. The total number of terminations (of 300 episodes) are as follows: S 0 = (−10, 2): Fourier 3:1, best constant: 291, myopic p min = 5.8: 20. S 0 = (50, 7): Fourier 3: 1, best constant: 0, myopic p min = 5.8: 20. For other starting states, the comparison is similar to that in Figure 3. We see that the policy with the 3rd order Fourier basis appears to outperform the myopic policy in all respects, that is on average the premium is lower, the premium is more stable over time, and we have very few defaults. The best constant policy naturally is the most stable, but leads to in general a higher premium compared to the other policies, and will for more strained starting states quickly lead to a large number of terminations. The myopic policy is given by the lowest premium level that satisfies the constraint. For details on how (5.3) is solved, see the Supplemental Material, Section 4.2. For the simple, intermediate, and realistic model, the myopic policy charges the minimum premium level for a large number of states. Since this policy so quickly reduces the premium to the minimum level as the surplus or previously charged premium increases, it is not likely to work that well. Hence, we suggest an additional benchmark policy where we set the minimum premium level to a higher value, p min . Thus, this adjusted myopic policy is given byπ (s) = max{π (s), p min }, where π denotes the policy that solves (5.3). Based on simulations of the total expected discounted reward per episode for different values of p min , we conclude that p min = 6.4 achieves the best results for both the simple and intermediate model, and p min = 10.5 achieves the best result for the realistic model.  that achieves the best results based on simulations of the total expected discounted reward per episode.
For the intermediate and realistic model, this myopic policy is too complex and is therefore not a good benchmark.  −19.95, . . . , 149.95, 150.00}. Figure 4 shows the optimal policy for the intermediate model using linear function approximation, with 3rd-, 2nd-, and 1st-order Fourier basis, for N t = N t−1 = 10. Comparing the policy with the 3rd-order Fourier basis with the policy with the 2nd order Fourier basis, the former appears to require a slightly lower premium when the surplus or previously charged premium is very low. The policy with the 1storder Fourier basis appears quite extreme compared to the other two policies. Comparing the policy with the 3rd-order Fourier basis for N t = N t−1 = 10 with the optimal policy for the simple model (bottom row in Figure 1), we note that π i (g, p, 10, 10) = π s (g, p), where π s and π i denotes, respectively, the policy for the simple and the intermediate model. There is a qualitative difference between these policies, since even given that we are in a state where N t = N t−1 = 10 using the intermediate model, the policy from the simple model does not take into account the effect the premium charged will have on the number of contracts issued at time t + 1. The policy with 3rd-order Fourier basis for N t , N t−1 ∈ {5, 10, 15} can be seen in Figure 5.

Intermediate model
To determine the performance of the policies for the intermediate model, we simulate the expected total discounted reward per episode for these policies. The results can be seen in Table 3. Here we clearly see that the policy with 3rd-order Fourier basis outperforms the other policies and that the policy with 1st-order Fourier basis performs quite badly since is not flexible enough to be used in this more realistic setting. We also compare the policies with the optimal policy derived with policy iteration from the simple model while simulating from the intermediate model. Though this policy performs worse compared to the policy with 3rd-and 2nd-order Fourier basis, it outperforms the policy with 1st-order Fourier basis. Note that the policies derived with function approximation only require real or simulated experience of the environment. The results for the myopic policy in Table 3 use the true parameters when computing the expected value of the surplus. Despite this, the policy derived with the 3rd order Fourier basis outperforms the myopic policy.
To analyse the difference between some of the policies, we simulate 300 episodes for the policy with the 3rd-order Fourier basis, the best constant policy and the policy from the simple model, for a few different starting states, two of which can be seen in Figure 6. Each star in the figures correspond to one or more terminations at that time point. The total number of terminations (of 300 episodes) are as follows: S 0 = (0, 7, 10, 10): Fourier 3: 1, best constant: 13, simple: 5. S 0 = (100, 15, 5, 5): Fourier 3: 0, best constant: 0, simple: 10. Comparing the policy with the 3rd-order Fourier basis with the policy from the simple model, we see that it tends to on average give a lower premium and leads to very few defaults, but is slightly more variable compared to the premium charged by the simple policy. This is not surprising, since the simple policy does not take the variation in the number of contracts issued into account. At the same time, this is to the detriment of the simple policy, since it cannot correctly take the risk of the varying number of contracts into account, hence leading to more defaults. For example, for the more strained starting state S 0 = (−10, 2, 20, 20) (not shown in figure), the number of defaults for the policy with the 3rd order Fourier basis is 91 of 300, and for the simple policy it is 213 of 300. Similarly, for starting state S 0 = (100, 15, 5, 5) (second column in Figure 6), the simple policy will tend set the premium much too low during the first time step, hence leading to more early defaults compared to for example starting state S 0 = (0, 7, 20, 20) (first column in Figure 6), despite the fact that the latter starting state has a much lower starting surplus.

Realistic model
To estimate the parameters of the model for the cumulative claims payments in (2.8), we use the motor third party liability data considered in Miranda et al. (2012). The data consist of incremental runoff triangles for number of reported accidents and incremental payments, with 10 development years. We have no information on the number of contracts. For parameter estimation only, we assume a constant number of contracts over the ten observed years, that is N t+1 = N, and that the total number of claims for each accident year is approximately 5% of the number of contracts. We further assume that the total number of claims per accident year is well approximated by the number of reported claims over the first two development years. This leads to an estimate of the number of contracts N t+1 in (2.8) of N = 2.17 · 10 5 . For parameter estimation, we assume that μ 0,t = μ 0 in (2.9). Hence, μ 0 and ν 2 0 are estimated as the sample mean and variance of log C t,1 10 t=1 . Similarly, μ j and ν 2 j are estimated as the sample mean and variance of log C t,j+1 /C t,j 10−j t=1 for j = 1, . . . , 8. Since we only have one observation for j = 9, we let μ 9 = log C 1,10 /C 1,9 and ν 9 = 0. c 0 is estimated by c 0 = exp μ 0 + ν 2 0 /2 / N ≈ 2.64. The parameter estimates can be seen in Table 1 in the Supplemental Material, Section 5.
For the model for investment earnings, the parameters are set to σ = 0.03 and μ = log (1.05) − σ 2 /2, which gives similar variation in investment earnings as in the intermediate model (2.7) when the surplus is approximately 50. The remaining parameters are γ = 0.9, β 0 = 2 · 10 5 , β 1 = 1, η = 10, and b = −0.3. The parameter a is set so that the expected number of contracts is 2 · 10 5 when the premium level corresponds to the expected total cost per contract, β 0 /(2 · 10 5 ) + β 1 + c 0 /α 1 ≈ 10.4. Hence, a ≈ 4.03 · 10 5 . The premium level is truncated and discretised according to A = {0.5, 1.0, . . . , 29.5, 30.0} The cost function is as before (5.1), but now adjusted to give rewards of similar size as in the simple and intermediate setting. Hence, when computing the reward, the premium is adjusted to lie in [0.2, 20.0] according to
The number of contracts is truncated according to N = {144170, . . . , 498601}. This is based on the 0.001-quantile of Pois(a(max A) b ), and the 0.999-quantile of Pois(a( min A) b ). The truncation of the cumulative claims payments C t,j is based on the same quantile levels and lognormal distributions with parameters μ = log (c 0 min N ) − ν 2 0 /2 + j−1 k=1 μ k and σ 2 = j−1 k=0 ν k , and with parameters μ = log (c 0 max N ) − ν 2 0 /2 + j−1 k=1 μ k and σ 2 = j−1 k=0 ν k , respectively, for j = 1, . . . , 9. The truncation for the cumulative claims payments can be seen in Table 2 in the Supplemental Material, Section 5. The surplus is truncated to lie in [−0.6, 4.5] · 10 6 . Note that for the realistic model, with 10 development years the state space becomes 13-dimensional. Using the full 3rd-order Fourier basis is not possible since it consists of 4 14 basis functions. We reduce the number of basis functions allowing for more flexibility where the model is likely to need it. Specifically, This means less flexibility for variables corresponding to the cumulative payments and no interaction terms between a variable corresponding to a cumulative payment and any other variable. Figure 7 shows the optimal policy for the realistic model using linear function approximation for N t , N t−1 ∈ {1.75, 2.00, 2.50} · 10 5 , C t−1,1 = c 0 · 2 · 10 5 , and C t−j,j = c 0 · 2 · 10 5 j−1 k=1 f k for j = 2, . . . , 9. To determine the performance of the approximate optimal policy for the realistic model, we simulate the expected total discounted reward per episode for this policy. The results can be seen in Table 4. The approximate optimal policy outperforms all benchmark policies. The best performing benchmark policy, the "interval policy," corresponds to choosing the premium level to be equal to the expected total cost per contract, when the number of contracts is 2 · 10 5 , as long as the surplus lies in the interval [1.2, 2.8] · 10 6 . This is based on a target surplus of 2 · 10 6 and choosing φ ∈ {0.1, 0.2, . . . , 1.0} that results in the best expected total reward, where the interval for the surplus is given by [1 − φ, 1 + φ] · 2 · 10 6 . When the surplus G t is below (above) this interval, the premium is increased (decreased) in order to decrease (increase) the surplus. The premium for this benchmark policy is P t = ⎧ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ min 10.5 + (1 − φ) · 2 · 10 6 − G t 2 · 10 5 , max A , if G t < (1 − φ) · 2 · 10 6 , 10.5, if (1 − φ) · 2 · 10 6 ≤ G t ≤ (1 + φ) · 2 · 10 6 , max 10.5 + (1 + φ) · 2 · 10 6 − G t 2 · 10 5 , min A , if (1 + φ) · 2 · 10 6 < G t , and rounded to the nearest half integer, to lie in A. As before the approximate optimal policy outperforms the benchmark policies despite the fact that both the myopic and the interval policy use the true parameters when computing the expected surplus or the expected total cost per contract. A comparison of the approximate optimal policy and the best benchmark policy, including a figure similar to Figure 6, is found in the Supplemental Material, Section 5.

Conclusion
Classical methods for solving premium control problems are suitable for simple dynamical insurance systems, and the model choice must to a large extent be based on how to make the problem solvable, rather than reflecting the real dynamics of the stochastic environment. For this reason, the practical use of the optimal premium rules derived with classical methods is often limited.
Reinforcement learning methods enable us to solve premium control problems in realistic settings that adequately capture the complex dynamics of the system. Since these techniques can learn directly from real or simulated experience of the stochastic environment, they do not require explicit expressions for transition probabilities. Further, these methods can be combined with function approximation in order to overcome the curse of dimensionality as the state space tends to be large in more realistic settings. This makes it possible to take key features of real dynamical insurance systems into account, for example payment delays and how the number of contracts issued in the future will vary depending on the premium rule. Hence, the optimal policies derived with these techniques can be used as a basis for decisions on how to set the premium for insurance companies.
We have illustrated strengths and weaknesses of different methods for solving the premium control problem for a mutual insurer and demonstrated that given a complex dynamical system, the approximate policy derived with SARSA using function approximation outperforms several benchmark policies. In particular, it clearly outperforms the policy derived with classical methods based on a more simplistic model of the stochastic environment, which fails to take important aspects of a real dynamical insurance system into account. Furthermore, the use of these methods is not specific to the model choices made in Section 2. The present paper provides guidance on how to carefully design a reinforcement learning method with function approximation for the purpose of obtaining an optimal premium rule, which together with models that fit the experience of the specific insurance company allows for optimal premium rules that can be used in practice.
The models may be extended to include dependence on covariates. However, it should be noted that if we want to model substantial heterogeneity among policyholders and consider a large covariate set, then the action space becomes much larger and function approximation also for the policy may become necessary.