Hostname: page-component-89b8bd64d-nlwjb Total loading time: 0 Render date: 2026-05-07T09:34:44.703Z Has data issue: false hasContentIssue false

Premium control with reinforcement learning

Published online by Cambridge University Press:  11 April 2023

Lina Palmborg*
Affiliation:
Department of Mathematics Stockholm University Stockholm 106 91, Sweden
Filip Lindskog
Affiliation:
Department of Mathematics Stockholm University Stockholm 106 91, Sweden
*
*Corresponding author. E-mail: lina.palmborg@math.su.se
Rights & Permissions [Opens in a new window]

Abstract

We consider a premium control problem in discrete time, formulated in terms of a Markov decision process. In a simplified setting, the optimal premium rule can be derived with dynamic programming methods. However, these classical methods are not feasible in a more realistic setting due to the dimension of the state space and lack of explicit expressions for transition probabilities. We explore reinforcement learning techniques, using function approximation, to solve the premium control problem for realistic stochastic models. We illustrate the appropriateness of the approximate optimal premium rule compared with the true optimal premium rule in a simplified setting and further demonstrate that the approximate optimal premium rule outperforms benchmark rules in more realistic settings where classical approaches fail.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2023. Published by Cambridge University Press on behalf of The International Actuarial Association
Figure 0

Table 1. Paid claim amounts from accidents during years $t-2,\dots,t+1$.

Figure 1

Figure 1. Simple model using policy iteration. Top: with constraint. Bottom: with terminal state. First and second column: optimal policy. Third column: fraction of time spent in each state under the optimal policy.

Figure 2

Figure 2. Optimal policy for simple model with terminal state using linear function approximation. First column: 3rd order Fourier basis. Second column: 2nd order Fourier basis. Third column: 1st order Fourier basis. Fourth column: 3rd order Fourier basis with $\varepsilon$-greedy policy.

Figure 3

Table 2. Expected discounted total reward (uniformly distributed starting states) for simple model with terminal state. The right column shows the fraction of episodes that end in the terminal state, within 100 time steps.

Figure 4

Figure 3. Simple model. Top row: policy with 3rd order Fourier basis. Bottom row: myopic policy with terminal state, $p_{\min}=5.8$. Left: starting state $S_0=({-}10,2)$. Right: starting state $S_0=(50,7)$. The red line shows the best constant policy. A star indicates at least one termination.

Figure 5

Figure 4. Optimal policy for intermediate model with terminal state using linear function approximation, $N_t=N_{t-1}=10$. Left: 3rd-order Fourier basis. Middle: 2nd-order Fourier basis. Right: 1st-order Fourier basis.

Figure 6

Figure 5. Optimal policy for the intermediate model with terminal state using linear function approximation with 3rd-order Fourier basis, for $N_t,N_{t-1}\in\{5,10,15\}$.

Figure 7

Table 3. Expected discounted total reward based on simulation, (uniformly distributed starting states). The right column shows the fraction of episodes that end in the terminal state, within 100 time steps.

Figure 8

Figure 6. Intermediate model. Top row: policy with 3rd Fourier basis. Bottom row: policy from the simple model. Left: starting state $S_0=(0, 7, 20, 20)$. Right: starting state $S_0=(100, 15, 5, 5)$. The red line shows the best constant policy. A star indicates at least one termination.

Figure 9

Figure 7. Optimal policy for realistic model with terminal state using linear function approximation, for $N_t,N_{t-1}\in\{1.75, 2.00, 2.50\}\cdot10^5$, $C_{t-1,1} = c_0\cdot2\cdot 10^5$, and $C_{t-j,j} = c_0\cdot2\cdot 10^5\prod_{k=1}^{j-1}f_k$.

Figure 10

Table 4. Expected discounted total reward based on simulation, (uniformly distributed starting states). The right column shows the fraction of episodes that end in the terminal state, within 100 time steps.

Supplementary material: PDF

Palmborg and Lindskog supplementary material

Palmborg and Lindskog supplementary material

Download Palmborg and Lindskog supplementary material(PDF)
PDF 2.6 MB