Pseudo-model-free hedging for variable annuities via deep reinforcement learning

This paper proposes a two-phase deep reinforcement learning approach, for hedging variable annuity contracts with both GMMB and GMDB riders, which can address model miscalibration in Black-Scholes financial and constant force of mortality actuarial market environments. In the training phase, an infant reinforcement learning agent interacts with a pre-designed training environment, collects sequential anchor-hedging reward signals, and gradually learns how to hedge the contracts. As expected, after a sufficient number of training steps, the trained reinforcement learning agent hedges, in the training environment, equally well as the correct Delta while outperforms misspecified Deltas. In the online learning phase, the trained reinforcement learning agent interacts with the market environment in real time, collects single terminal reward signals, and self-revises its hedging strategy. The hedging performance of the further trained reinforcement learning agent is demonstrated via an illustrative example on a rolling basis to reveal the self-revision capability on the hedging strategy by online learning.


Introduction
Variable annuities are long-term life products, in which policyholders participate in financial investments for profit sharing with insurers. Various guarantees are embedded in these contracts, such as guaranteed minimum maturity benefit (GMMB), guaranteed minimum death benefit (GMDB), guaranteed minimum accumulation benefit (GMAB), guaranteed minimum income benefit (GMIB), and guaranteed minimum withdrawal benefit (GMWB). According to the Insurance Information Institute in 2020, the sales of variable annuity contracts in the United States have amounted to, on average, 100.7 billion annually, from 2016 to 2020.
Due to their popularity in the market and their dual-risk bearing nature, valuation and risk management of variable annuities have been substantially studied in the literature. By the riskneutral option pricing approach, to name a few, (Milevsky & Posner 2001) studied the valuation of the GMDB rider; valuation and hedging of the GMMB rider under the Black-Scholes (BS) financial market model were covered in Hardy (2003); the GMWB rider was extensively investigated by Milevsky & Salisbury (2006), Dai et al. (2008), and Chen et al. (2008); valuation and hedging of the GMMB rider were studied in Cui et al. (2017) under the Heston financial market model; valuation of the GMMB rider, together with the feature that a contract can be surrendered before its maturity, was examined by Jeon & Kwak (2018), in which optimal surrender strategies were also provided. For a comprehensive review of this approach, see Feng (2018).
Valuation and risk management of variable annuities have recently been advanced via various approaches as well. Trottier et al. (2018) studied the hedging of variable annuities in the presence of basis risk based on a local optimisation method. Chong (2019) revisited the pricing and hedging problem of equity-linked life insurance contracts utilising the so-called principle of equivalent forward preferences. Feng & Yi (2019) compared the dynamic hedging approach to the stochastic reserving approach for the risk management of variable annuities. Moenig (2021a) investigated the valuation and hedging problem of a portfolio of variable annuities via a dynamic programming method. Moenig (2021b) explored the impact of market incompleteness on the policyholder's behaviour. Wang & Zou (2021) solved the optimal fee structure for the GMDB and GMMB riders. Dang et al. (2020) and (2022) proposed and analysed efficient simulation methods for measuring the risk of variable annuities.
Recently, state-of-the-art machine learning methods have been deployed to revisit the valuation and hedging problems of variable annuities at a portfolio level. Gan (2013) proposed a three-step technique, by (i) selecting representative contracts with clustering method, (ii) pricing these contracts with Monte Carlo (MC) simulation, and (iii) predicting the value of the whole portfolio based on the values of representative contracts with kriging method. To further boost the efficiency and the effectiveness of selecting and pricing the representative contracts, as well as valuating the whole portfolio, various methods at each of these three steps have been proposed. For instance, Gan & Lin (2015) extended the ordinary kriging method to the universal kriging method; Hejazi & Jackson (2016) used a neural network as the predictive model to valuate the whole portfolio; Gan & Valdez (2018) implemented the generalised beta of the second kind method instead of the kriging method to capture the non-Gaussian behaviour of the market price of variable annuities. See also, Gan (2018), Gan & Valdez (2020), Gweon et al. (2020), Liu & Tan (2020), Lin & Yang (2020), Feng et al. (2020), and Quan et al. (2021) for recent developments in this three-step technique. Similar idea has also been applied to the calculation of Greeks and risk measures of a portfolio of variable annuities; see Gan & Lin (2017), Gan & Valdez (2017), and Xu et al. (2018). All of the above literature applying the machine learning methods involve the supervised learning, which requires a pre-labelled dataset (in this case, it is the set of fair prices of the representative contracts) to train a predictive model.
Other than valuating and hedging variable annuities, supervised learning methods have also been applied to different actuarial contexts. Wüthrich (2018) used a neural network for the chainladder factors in the chain-ladder claim reserving model to include heterogeneous individual claim features. Gao & Wüthrich (2019) applied a convolutional neural network to classify drivers using their telematics data. Cheridito et al. (2020) estimated the risk measures of a portfolio of assets and liabilities with a feedforward neural network.  and Perla et al. (2021) studied the mortality rate forecasting problem, where  extended the traditional Lee-Carter model to multiple populations using a neural network, while Perla et al. (2021) applied deep learning techniques directly on a time-series data of mortality rate. Hu et al. (2022) modified the loss function in tree-based models to improve the predictive performance when applying to imbalanced datasets which are common in the insurance practice.
To the best of our knowledge, this paper is the first work to implement the RL algorithms with online learning to hedge contingent claims, particularly variable annuities. Contrary to Xu (2020) and Carbonneau (2021), in which both adapted the state-of-the-art DH approach in Bühler et al. (2019), this paper is in line with the recent works by Kolm & Ritter (2019) and Cao et al. (2021), while extends with actuarial components. We shall outline the differences between the RL and DH approaches throughout sections 3 and 4, as well as Appendices A and B. Kolm & Ritter (2019) discretised the action space and implemented RL algorithms for finitely many possible actions; however, as mentioned above, this paper does not discretise the action space but adapts the recently advanced policy gradient method, namely, the PPO. Comparing with Cao et al. (2021), in addition to the actuarial elements, this paper puts forward online learning to self-revise the hedging strategy.
In the illustrative example, we assume that the market environment is the BS financial and constant force of mortality (CFM) actuarial markets, and the focus is on contracts with both GMMB and GMDB riders. Furthermore, we assume that the model of the market environment being presumed by the insurer, which shall be supplied as the training environment, is also the BS and the CFM, but with a different set of parameters. That is, while the insurer constructs correct dynamic models of the market environment for the training environment, the parameters in the model of the market environment are not the same as those in the market environment. Section 2.4 shall set the stage of this illustrative example and shall show that, if the insurer forwardly implements, in the market environment, the incorrect Delta hedging strategy based on their presumed model of the market environment, then its hedging performance for the variable annuities is worse than that by the correct Delta hedging strategy based on the market environment. In sections 4 and 6, this illustrative example shall be revisited using the two-phase RL approach. As we shall see in section 6, the hedging performance of the RL agent is even worse than that of the incorrect Delta, at the very beginning of hedging in real time. However, delicate analysis shows that, with a fair amount of future trajectories (which are different from simulated scenarios, with more details in section 6), the hedging performance of the RL agent becomes comparable with that of the correct Delta within a reasonable amount of time. Therefore, the illustrative example addresses model miscalibration issue in hedging variable annuity contracts with GMMB and GMDB riders in BS financial and CFM actuarial market environments, which is common in practice.
This paper is organised as follows. Section 2 formulates the continuous hedging problem for variable annuities, reformulates it to the discrete and Markov setting, and motivates as well as outlines the two-phase RL approach. Section 3 discusses the RL approach in hedging variable annuities and provides a self-contained review of RL, particularly the PPO, which is a TD policy gradient method, while section 5 presents the implementation details of the online learning phase. Sections 4 and 6 revisit the illustrative example in the training and online learning phases, respectively. Section 7 collates the assumptions of utilising the two-phase RL approach for hedging contingent claims, as well as their implications in practice. This paper finally concludes and comments on future directions in section 8.

Classical hedging problem and model-based approach
We first review the classical hedging problem for variable annuities and its model-based solution to introduce some notations and to motivate the RL approach.
There are one risk-free asset and one risky asset in the financial market. Let B t and S t , for t ∈ [0, T], be the time-t values of the risk-free asset and the risky asset, respectively. Let G (1) = G (1) t t∈ [0,T] be the filtration which contains all financial market information; in particular, both processes B = {B t } t∈ [0,T] and S = {S t } t∈ [0,T] are G (1) -adapted. There are N policyholders in the actuarial market. For each policyholder i = 1, 2, . . . , N, denote T (i) x i as their random future lifetime, who is of age x i at the current time 0. Define, for each i = 1, 2, . . . , N, and for any t ≥ 0, J (i) t = 1 T (i) x i >t , be the corresponding time-t jump value generated by the random future lifetime of the i-th policyholder; that is, if the i-th policyholder survives at some time t ∈ [0, T], J (i) t = 1; otherwise, J (i) t t∈ [0,T] be the filtration which contains all actuarial market information; in particular, all single-jump processes [0,T] , for i = 1, 2, . . . , N, are G (2) -adapted.
Let F = {F t } t∈ [0,T] be the filtration which contains all actuarial and financial market information; that is, F = G (1) ∨ G (2) . Therefore, the filtered probability space is given by ( , F, F, P).

Variable annuities with guaranteed minimum maturity benefit and guaranteed minimum death benefit riders
At the current time 0, an insurer writes a variable annuity contract to each of these N policyholders. Each contract is embedded with both GMMB and GMDB riders. Assume that all these N contracts expire at the same fixed time T. In the following, fix a generic policyholder i = 1, 2, . . . , N. At the current time 0, the policyholder deposits F (i) 0 into their segregated account to purchase ρ (i) > 0 shares of the risky asset; that is, F (i) 0 = ρ (i) S 0 . Assume that the policyholder does not revise the number of shares ρ (i) throughout the effective time of the contract.
For any t ∈ 0, T (i) x i ∧ T , the time-t segregated account value of the policyholder is given by where m (i) ∈(0, 1) is the continuously compounded annualised rate at which the asset-value-based fees are deducted from the segregated account by the insurer. For any t ∈ T (i) x i ∧ T, T , the time-t segregated account value F (i) t must be 0; indeed, if the policyholder dies before the maturity, i.e. T (i) x i < T, then, due to the GMDB rider of a minimum guarantee G (i) D > 0, the beneficiary inherits max F (i) at the policyholder's death time T (i) x i right away. Due to the GMMB rider of a minimum guarantee G (i) M > 0, if the policyholder survives beyond the maturity, i.e. T (i)

Net liability of insurer
The liability of the insurer thus has two parts. The liability from the GMMB rider at the maturity for the i-th policyholder, where i = 1, 2, . . . , N, is given by G if the i-th policyholder survives beyond the maturity, and is 0 otherwise. The liability from the GMDB rider at the death time T (i) x i for the i-th policyholder, where i = 1, 2, . . . , N, is given by policyholder dies before the maturity, and is 0 otherwise. Therefore, at any time t ∈ [0, T], the future gross liability of the insurer accumulated to the maturity for these N contracts is given by Denote V GL t , for t ∈ [0, T], as the time-t value of the discounted (via the risk-free asset B) future gross liability of the insurer; if the liability is 0, the value will be 0.
From the asset-value-based fees collected by the insurer, a portion, known as the rider charge, is used to fund the liability due to the GMMB and GMDB riders; the remaining portion is used to cover overhead, commissions, and any other expenses. From the i-th policyholder, m (i) . Therefore, the cumulative future rider charge to be collected, from any time t ∈ [0, T] onward, till the maturity, by the insurer from these N policyholders, is given for t ∈ [0, T], as its time-t discounted (via the risk-free asset B) value; if the cumulative rider charge is 0, the value will be 0.
Hence, due to these N variable annuity contracts with both GMMB and GMDB riders, for any t ∈ [0, T], the time-t net liability of the insurer for these N contracts is given by One of the many ways to set the rate m (i) ∈(0, 1) for the asset-value-based fees, and the rate m (i) e ∈ 0, m (i) for the rider charge, for i = 1, 2, . . . , N, is based on the time-0 net liability of the insurer for the i-th policyholder. More precisely, m (i) and m (i) e are determined via are the time-0 values of, respectively, the discounted future gross liability and the discounted cumulative future rider charge, of the insurer for the i-th policyholder.

Continuous hedging and hedging objective
The insurer aims to hedge this dual-risk bearing net liability via investing in the financial market. To this end, letT be the death time of the last policyholder; that is,T = max i=1,2,...,N T (i) x i , which is random.
While the net liability L t is defined for any time t ∈ [0, T], as the difference between the values of discounted future gross liability and discounted cumulative future rider charge, L t = 0 for any t ∈ T ∧ T, T . Indeed, ifT < T, then, for any t ∈ T ∧ T, T , one has T (i) x i < t ≤ T for all i = 1, 2, . . . , N, and hence, the future gross liability accumulated to the maturity, and the cumulative rider charge from timeT onward are both 0 so are their values. Therefore, the insurer only hedges the net liability L t , for any t ∈ 0,T ∧ T .
Let H t be the hedging strategy, i.e. the number of shares of the risky asset being held by the insurer, at time t ∈ [0, T). Hence, H t = 0, for any t ∈ T ∧ T, T . Let H be the admissible set of hedging strategies, which is defined by where L is the Lebesgue measure on R. The condition (ii) indicates that there is not any constraint on the hedging strategies. Let P t be the time-t value, for t ∈ [0, T], of the insurer's hedging portfolio. Then, P 0 = 0, and together with the rider charges collected from the N policyholders, as well as the withdrawal https://doi.org/10.1017/S1748499523000027 Published online by Cambridge University Press for paying the liabilities due to the beneficiaries' inheritance from those policyholders who have already been dead, for any t ∈(0, T], which obviously depends on {H s } s∈ [0,t) . As in Bertsimas et al. (2000), the insurer's hedging objective function at the current time 0 should be given by the root-mean-square error (RMSE) of the terminal profit and loss (P&L), which is, for any H ∈ H, If the insurer has full knowledge of the objective probability measure P, and hence the correct dynamics of the risk-free asset and the risky asset in the financial market, as well as the correct mortality model in the actuarial market, the optimal hedging strategy, being implemented forwardly, is given by minimising the RMSE of the terminal P&L:

Pitfall of model-based approach
However, having correct model is usually not the case in practice. Indeed, the insurer, who is the hedging agent above, usually has little information regarding the objective probability measure P and hence easily misspecifies the financial market dynamics and the mortality model, which will in turn yield a poor performance from the supposedly optimal hedging strategy when it is implemented forwardly in the future. Section 2.4 outlines such an illustrative example which shall be discussed throughout the remaining of this paper.
To rectify this, we propose a two-phase (deep) RL approach to solve an optimal hedging strategy. In this approach, an RL agent, which is not the insurer themselves but is built by the insurer to hedge on their behalf, does not have any knowledge of the objective probability measure P, the financial market dynamics, and the mortality model; section 2.5 shall explain this approach in details. Before that, in the following section 2.3, the classical hedging problem shall first be reformulated with a Markov decision process (MDP) in a discrete time setting so that RL methods can be implemented. The illustrative example outlined in section 2.4 shall be revisited using the proposed two-phase RL approach in sections 4 and 6.
In the remaining of this paper, unless otherwise specified, all expectation operators shall be taken with respect to the objective probability measure P and denoted simply as E[·].
While the hedging agent decides the hedging strategy at the discrete time points, the actuarial and financial market models are continuous. Hence, the net liability L t = V GL t − V RC t is still defined for any time t ∈ [0, T] as before. Moreover, if t ∈ t k , t k+1 , for some k = 0, 1, . . . , n − 1, H t = H t k ; thus, P 0 = 0, and, if t ∈ t k , t k+1 , for some k = 0, 1, . . . , n − 1, For any H ∈ H, the hedging objective of the insurer at the current time 0 is E P tñ − L tñ 2 .
Hence, the optimal discrete hedging strategy, being implemented forwardly, is given by ( 2 )

Markov decision process
An MDP can be characterised by its state space, action space, Markov transition probability, and reward signal. In turn, these derive the value function and the optimal value function, which are equivalently known as, respectively, the objective function and the value function, in optimisation as in the previous sections. In the remaining of this paper, we shall adapt the MDP language.
• (State) Let X be the state space in R p , where p ∈ N. Each state in the state space represents a possible observation with p features in the actuarial and financial markets. Denote X t k ∈ X as the observed state at any time t k , where k = 0, 1, . . . , n; the state should minimally include an information related to the number of surviving policyholders N i=1 J (i) t k , and the term to maturity T − t k , in order to terminate the hedging at time tñ, which is the first time when The states (space) shall be specified in sections 4 and 5.
• (Action) Let A be the action space in R. Each action in the action space is a possible hedging strategy. Denote H t k X t k ∈ A as the action at any time t k , where k = 0, 1, . . . , n − 1, which is assumed to be Markovian with respect to the observed state X t k ; that is, given the current state X t k , the current action H t k X t k is independent of the past states X t 0 , X t 1 , . . . , X t k−1 .
In the sequel, for notational simplicity, we simply write H t k to represent H t k X t k , for k = 0, 1, . . . , n − 1. If the feature of the number of surviving policyholders N i=1 J (i) t k = 0, for k = 0, 1, . . . , n − 1, in the state X t k , then H t k = 0; in particular, for any t k , where k = n,ñ + 1, . . . , n − 1, the hedging strategy H t k = 0.
• (Markov property) At any time t k , where k = 0, 1, . . . , n − 1, given the current state X t k and the current hedging strategy H t k , the transition probability distribution of the next state X t k+1 in the market is independent of the past states X t 0 , X t 1 , . . . , X t k−1 and the past hedging strategies H t 0 , H t 1 , . . . , H t k−1 ; that is, for any Borel set B ∈ B(X ), • (Reward) At any time t k , where k = 0, 1, . . . , n − 1, given the current state X t k in the market and the current hedging strategy H t k , a reward signal R t k+1 X t k , H t k , X t k+1 is received, by the hedging agent, as a result of transition to the next state X t k+1 . The reward signal shall be specified after introducing the (optimal) value function below. In the sequel, occasionally, for notational simplicity, we simply write R t k+1 to represent R t k+1 X t k , H t k , X t k+1 , for k = 0, 1, . . . , n − 1. • (State, action, and reward sequence) The states, actions (which are hedging strategies herein), and reward signals form an episode, which is sequentially given by: • (Optimal value function) Based on the reward signals, the value function, at any time t k , where k = 0, 1, . . . , n − 1, with the state x ∈ X , is defined by, for any hedging strategies where γ ∈ [0, 1] is the discount rate; the value function, at the time t n = T with the state x ∈ X , is defined by V(t n , x) = 0. Hence, the optimal discrete hedging strategy, being implemented forwardly, is given by In turn, the optimal value function, at any time t k , where k = 0, 1, . . . , n − 1, with the state x ∈ X , is • (Reward engineering) To ensure the hedging problem being reformulated with the MDP, the value functions, given by that in (5), and the negative of that in (2), should coincide; that is, Hence, two possible constructions for the reward signals are proposed as follows; each choice of the reward signals shall be utilised in one of the two phases in the proposed RL approach.
− (Single terminal reward) An obvious choice is to only have a reward signal from the negative squared terminal P&L; that is, for any time t k , Necessarily, the discount rate is given as γ = 1. − (Sequential anchor-hedging reward) A less obvious choice is via telescoping the RHS of Equation (7), that Therefore, when L 0 = P 0 , another possible construction for the reward signal is, for any time t k , Again, the discount rate is necessarily given as γ = 1. The constructed reward in (9) outlines an anchor-hedging scheme. First, note that, at the current time 0, when L 0 = P 0 , there is no local hedging error. Then, at each future hedging time before the last policyholder dies and before the maturity, the hedging performance is measured by the local squared P&L, i.e. P t k − L t k 2 , which serves as an anchor. At the next hedging time, if the local squared P&L is smaller than the anchor, it will be rewarded, i.e. R t k+1 > 0; however, if the local squared P&L becomes larger, it will be penalised, i.e. R t k+1 < 0.

Illustrative example
The illustrative example below demonstrates the poor hedging performance by the Delta hedging strategy when the insurer miscalibrates the parameters in the market environment. We consider that the insurer hedges a variable annuity contract, with both GMMB and GMDB riders, of a single policyholder, i.e. N = 1, with the contract characteristics given in Table 1. The market environment follows the Black-Scholes (BS) in the financial part and the constant force of mortality (CFM) in the actuarial front. The risk-free asset earns a constant risk-free interest rate r > 0 that, for any t ∈ [0, T], dB t = rB t dt, while the value of the risky asset evolves as a geometric Brownian motion that, for any t ∈ [0, T], dS t = μS t dt + σ S t dW t , where μ is a constant drift, σ > 0 is a constant volatility, and W = {W t } t∈ [0,T] is the standard Brownian motion. The random future lifetime of the policyholder T x has a CFM ν > 0; that is, for any 0 ≤ t ≤ s ≤ T, the conditional survival probability P(T x , > s|T x > t) = e −ν(s−t) . Moreover, the Brownian motion W in the financial market and the future lifetime T x in the actuarial market are independent. Table 2 summarises the parameters in the market environment. Note that the risk-free interest rate, the risky asset initial price, the initial age of the policyholder, and the investment strategy of the policyholder are observable by the insurer.
Based on their best knowledge of the market, the insurer builds a model of the market environment. Suppose that the model happens to be the BS and the CFM as the market environment, but the insurer miscalibrates the parameters. Table 3 lists these parameters in the model of the market environment. In particular, the risky asset drift and volatility, as well as the force of mortality constant, are different from those in the market environment. For the observable parameters, they are the same as those in the market environment.  At any time t ∈ [0, T], the value of the hedging portfolio of the insurer is given by (17), with N = 1, in which the values of the risky asset and the single-jump process follow the market environment with the parameters in Table 2. At any time t ∈ [0, T], the value of the net liability of the insurer is given by (16), with N = 1, in both the market environment and its model; for its detailed derivations, we defer it to section 4.1, as the model of the market environment, with multiple homogeneous policyholders for effective training, shall be supplied as the training environment. Since the parameters in the model of the market environment (see Table 3) are different from those in the market environment (see Table 2), the net liability evaluated by the insurer using the model is different from that of the market environment. There are two implications. Firstly, the Delta hedging strategy of the insurer using the parameters in Table 3 is incorrect, while the correct Delta hedging strategy should use the parameters in Table 2. Secondly, the asset-value-based fee m and the rider charge m e given in Table 4, which are determined by the insurer based on the time-0 value of their net liability by Table 3 via the method in section 2.1.3, are mispriced. They would not lead to zero time-0 value of their net liability in the market environment which is based on Table 2.
To evaluate the hedging performance of the incorrect Delta strategy by the insurer in the market environment for the variable annuity of contract characteristics in Table 1, 5,000 market  scenarios using the parameters in Table 2 are simulated to realise terminal P&Ls. For comparison, the terminal P&Ls by the correct Delta hedging strategy are also obtained. Figure 1 shows the empirical density and cumulative distribution functions of the 5,000 realised terminal P&Ls by each Delta hedging strategy, while Table 5 outlines the summary statistics of the empirical distributions, in which RMSE is the estimated RMSE of the terminal P&L similar to (2). In Figure 1(a), the empirical density function of realised terminal P&Ls by the incorrect Delta hedging strategy is depicted to be more heavy-tailed on the left than that by the correct Delta strategy. In fact, the terminal P&L by the incorrect Delta hedging strategy is stochastically dominated by that by the correct Delta strategy in the first-order; see Figure 1(b). Table 5 shows that the terminal P&L by the incorrect Delta hedging strategy has a mean and a median farther from zero, a higher standard deviation, larger left-tail risks in terms of Value-at-Risk and Tail Value-at-Risk, and a larger RMSE than that by the correct Delta strategy.
These observations conclude that, even in a market environment as simple as the BS and the CFM, the incorrect Delta hedging strategy based on the miscalibrated parameters by the insurer does not perform well when it is being implemented forwardly. In general, the hedging performance of model-based approaches depends crucially on the calibration of parameters for the model of the market environment.

Two-phase reinforcement learning approach
In an RL approach, at the current time 0, the insurer builds an RL agent to hedge on their behalf in the future. The agent interacts with a market environment, by sequentially observing states, taking, as well as revising, actions, which are the hedging strategies, and collecting rewards.
https://doi.org/10.1017/S1748499523000027 Published online by Cambridge University Press Without possessing any prior knowledge of the market environment, the agent needs to, explore the environment while exploit the collected reward signals, for effective learning.
An intuitive proposition would be allowing an infant RL agent to learn directly from such market environment, like the one in section 2.4, moving forward. However, recall that the insurer actually does not know any exact market dynamics in the environment and thus is not able to provide any theoretical model for the net liability to the RL agent. In turn, the RL agent could not receive any sequential anchor-hedging reward signal in (9) from the environment, but instead receives the single terminal reward signal in (8). Since the rewards, except the terminal one, are all zero, the infant RL agent would learn ineffectively from such sparse rewards, i.e. the RL agent shall take a tremendous amount of time to finally learn a nearly optimal hedging strategy in the environment. Most importantly, while the RL agent is exploring and learning from the environment, which is not a simulated one, the insurer could suffer from huge financial burden due to any sub-optimal hedging performances.
In view of this, we propose that the insurer should first designate the infant RL agent to interact and learn from a training environment, which is constructed by the insurer based on their best knowledge of the market, for example, the model of the market environment in section 2.4. Since the training environment is known to the insurer (but is unknown to the RL agent), the RL agent can be supplied by a net liability theoretical model, and consequently learn from the sequential anchor-hedging reward signal in (9) of the training environment. Therefore, the infant RL agent would be guided by the net liability to learn effectively from the local hedging errors. After interacting and learning from the training environment for a period of time, in order to gauge the effectiveness, the RL agent shall be tested for its hedging performance in simulated scenarios from the same training environment. This first phase is called the training phase.

Training Phase:
(i) The insurer constructs the MDP training environment. (ii) The insurer builds the infant RL agent which uses the PPO algorithm. (iii) The insurer assigns the RL agent in the MDP training environment to interact and learn for a period of time, during which the RL agent collects the anchor-hedging reward signal in (9). (iv) The insurer deploys the trained RL agent to hedge in simulated scenarios from the same training environment and documents the baseline hedging performance.
If the hedging performance of the trained RL agent in the training environment is satisfactory, the insurer should then proceed to assign it to interact and learn from the market environment. Since the training and market environments are usually different, such as having different parameters as in section 2.4, the initial hedging performance of the trained RL agent in the market environment is expected to diverge from the fine baseline hedging performance in the training environment. However, different from an infant RL agent, the trained RL agent is experienced so that the sparse reward signal in (8) should be sufficient for the agent to revise the hedging strategy, from the nearly optimal one in the training environment to that in the market environment, within a reasonable amount of time. This second phase is called the online learning phase.

Online Learning Phase:
(v) The insurer assigns the RL agent in the market environment to interact and learn in real time, during which the RL agent collects the single terminal reward signal in (8).
These summarise the proposed two-phase RL approach. Figure 2 depicts the above sequence clearly. There are several assumptions underneath this two-phase RL approach in order to apply it effectively to a hedging problem of a contingent claim; as they involve specifics in later sections, we collate their discussions and elaborate their implications in practice in section 7. In the following section, we shall briefly review the training essentials of RL in order to introduce the PPO algorithm. For the details of online learning phase, we defer them until section 5.

Stochastic action for exploration
One of the fundamental ideas in RL is that, at any time t k , where k = 0, 1, . . . , n − 1, given the current state X t k , the RL agent does not take a deterministic action H t k but extends it to a stochastic action, in order to explore the MDP environment and in turn learn from the reward signals. The stochastic action is sampled through a so-called policy, which is defined below.
Let P(A) be a set of probability measures over the action space A; each probability measure μ (·) ∈ P(A) maps a Borel set A ∈ B(A) to μ A ∈ [0, 1]. The policy π(·) is a mapping from the state space X to the set of probability measures P(A); that is, for any state x ∈ X , π(x) = μ (·) ∈ P(A). The value function and the optimal value function, at any time t k , where k = 0, 1, . . . ,ñ − 1, with the state x ∈ X , are then generalised as, for any policy π(·), at any time t k , where k =ñ,ñ + 1, . . . , n − 1, with the state x ∈ X , for any policy π(·), contains only all Dirac measures over the action space A, which is the case in the DH approach of Bühler et al. (2019) (see Appendix A for more details), the value function and the optimal value function reduce to (4) and (6). With this relaxed setting, solving the optimal hedging strategy H * boils down to finding the optimal policy π * (·).

Policy approximation and parameterisation
As the hedging problem has the infinite action space A, tabular solution methods for problems of finite state space and finite action space (such as Q-learning), or value function approximation methods for problems of infinite state space and finite action space (such as deep Q-learning) are not suitable. Instead, a policy gradient method is employed.
To this end, the policy π(·) is approximated and parametrised by the weights θ p in an artificial neural network (ANN); in turn, denote the policy by π ·; θ p . The ANN N p ·; θ p (to be defined in (11) below) takes a state x ∈ X as the input vector and output parameters of a probability measure in P(A). In the sequel, the set P(A) contains all Gaussian measures (see, for example, , in which each has a mean c and a variance d 2 , which depend on the state input x ∈ X and the ANN weights θ p . Therefore, for any state x ∈ X , With such approximation and parameterisation, solving the optimal policy π * further boils down to finding the optimal ANN weights θ * p . Hence, denote the value function and the optimal value function in (10) by V t k , x; θ p and V t k , x; θ * p , for any t k , where k = 0, 1, . . . ,ñ − 1, with x ∈ X . However, the (optimal) value function still depends on the objective probability measure P, the financial market dynamics, and the mortality model, which are unknown to the RL agent. Before formally introducing the policy gradient methods to tackle this issue, we shall first explicitly construct the ANNs for the approximated policy, as well as for an estimate of the value function (to prepare the algorithm of policy gradient method to be reviewed below).

Network architecture
As alluded above, in this paper, the ANN involves two parts, which are the policy network and the value function network.

Policy network
Let N p be the number of layers for the policy network. For l = 0, 1, . . . , N p , let d (l) p be the dimension of the l-th layer, where the 0-th layer is the input layer; the 1, 2, . . . , N p − 1 -th layers are hidden layers; the N p -th layer is the output layer. In particular, d (0) p = p, which is the number of features in the actuarial and financial parts, and d (Np) p = 2, which outputs the mean c and the variance d 2 of the Gaussian measure. The policy network N p : R p → R 2 is defined as, for any x ∈ R p , where, for l = 1, 2, . . . , N p , the mapping W (l) is affine, and the mapping p is a componentwise activation function. Let θ p be the parameter vector of the policy network and in turn denote the policy network in (11) by N p x; θ p , for any x ∈ R p .

Value function network
The value function network is constructed similarly as in the policy network, except that all subscripts p (policy) are replaced by v (value). In particular, the value function network N v : R p → R is defined as, for any x ∈ R p , which models an approximated value functionV (see section 3.4 below). Let θ v be the parameter vector of the value function network and in turn denote the value function network in (12) by

Shared layers structure
Since the policy and value function networks should extract features from the input state vector in a similar manner, they are assumed to share the first few layers. More specifically, let N s < min N p , N v be the number of shared layers for the policy and value function networks; Let θ be the parameter vector of the policy and value function networks. Figure 3 depicts such a shared layers structure.

Proximal policy optimisation: a temporal-difference policy gradient method
A policy gradient method entails that, starting from initial ANN weights θ (0) , and via interacting with the MDP environment to observe the states and collect the reward signals, the RL agent gradually updates the ANN weights, by the (stochastic) gradient ascent on a certain surrogate performance measure defined for the ANN weights. That is, at each update step u = 1, 2, . . . , where the hyperparameter α ∈ [0, 1] is the learning rate of the RL agent, and, based on the expe- . REINFORCE, which is pioneered by Williams (1992), is a Monte Carlo policy gradient method, which updates the ANN weights by each episode. As this paper applies a temporal-difference (TD) policy gradient method, we relegate the review of REINFORCE to Appendix B, where the Policy Gradient Theorem, the foundation of any policy gradient methods, is presented.
PPO, which is pioneered by Schulman et al. (2017), is a TD policy gradient method, which updates the ANN weights by a batch of K ∈ N realisations. At each update step u = 1, 2, . . . , based on the ANN weights θ (u−1) , and thus the policy π ·; θ (u−1) p , the RL agent experiences E (u) ∈ N realised episodes for the K realisations.
• If E (u) = 1, the episode is given by is when the episode is initiated in this update, and h (u−1) t k , for k = 0, 1, . . . ,ñ − 1, is the time t k realised hedging strategy being sampled from the Gaussian distribution with the mean c x (u−1) . . , the episodes are given by The surrogate performance measure of PPO consists of three components. In the following, fix an update step u = 1, 2, . . . . Inspired by Schulman et al. (2015), in which the time-0 value function difference between two policies is shown to be equal to the expected advantage, together with importance sampling and KL divergence constraint reformulation, the first component in the surrogate performance measure of PPO is given by: where the importance sampling ratio q (u−1) ; θ p , the estimated advantage is evaluated at θ p = θ (u−1) p and bootstrapped through the approximated value function that and the function clip q (u−1) approximated value functionV is given by the output of the value network, i.e. (12) for k = 0, 1, . . . ,ñ − 1.
https://doi.org/10.1017/S1748499523000027 Published online by Cambridge University Press • if E (u) = 2, 3, . . . , Similar to REINFORCE in Appendix B, the second component in the surrogate performance measure of PPO minimises the loss between the bootstrapped sum of reward signals and the approximated value function. To this end, define: • if E (u) = 2, 3, . . . , Finally, to encourage the RL agent exploring the MDP environment, the third component in the surrogate performance measure of PPO is the entropy bonus. Based on the Gaussian density function, define • if E (u) = 1, Therefore, the surrogate performance measure of PPO is given by: where the hyperparameters c 1 , c 2 ∈ [0, 1] are the loss coefficients of the RL agent. Its estimated gradient, based on the K realisations, is then computed via automatic differentiation; see, for example, Baydin et al. (2018).

Illustrative Example Revisited: Training Phase
Recall that, in the training phase, the insurer constructs a model of the market environment for an MDP training environment, while the RL agent, which does not know any specifics of this MDP environment, observes states and receives the anchor-hedging reward signals in (9) from it and hence gradually learns the hedging strategy by the PPO algorithm reviewed in the last section. This section revisits the illustrative example in section 2.4 via the two-phase RL approach in the training phase.

Markov decision process training environment
The model of the market environment is the BS and the CFM in the financial and the actuarial parts. However, unlike the model following the market environment to write a single contract to a single policyholder, for effective training, the insurer writes identical contracts to N homogeneous policyholders in the training environment. Because of the homogeneity of the contracts and the policyholders, for all i = 1, 2, . . . , N, At any time t ∈ [0, T], the future gross liability of the insurer accumulated to the matu-

and its time-t discounted value is
where the probability measure Q defined on ( , F) is an equivalent martingale measure with respect to P. Herein, the probability measure Q is chosen to be the product measure of each individual equivalent martingale measure in the actuarial or financial part, which implies the independence among the Brownian motion W and the future lifetime x , clarifying the first term in the second equality above. The second term in that equality is due to the fact that, for i = 1, 2, . . . , N, the single-jump process J (i) is F-adapted. Under the probability measure Q, all future lifetime are identically distributed and have a CFM ν > 0, which are the same as those under the probability measure P in section 2.4. Therefore, for any i = 1, 2, . . . , N, and for any 0 ≤ t ≤ . For each policyholder i = 1, 2, . . . , N, by the independence and the Markov property, for any 0 ≤ t ≤ s ≤ T, Moreover, under the probability measure Q, for any t [0,T] is the standard Brownian motion under the probability measure Q.
Hence, the time-t value of the discounted future gross liability, for t ∈ [0, T], is given by where, for s ∈ [0, T) and As for the cumulative future rider charge to be collected by the insurer from any time t ∈ [0, T] onward, it is given by N i=1 T t m e F s J (i) s e r(T−s) ds, and its time-t discounted value is where the second equality is again due to the independence and the Markov property. Under the probability measure Q, E Q [F s |F t ] = e (r−m)(s−t) F t . Together with (15), Therefore, the time-t net liability of the insurer, for t ∈ [0, T], is given by which contributes parts of the reward signals in (9). The time-t value of the insurer's hedging portfolio, for t ∈ [0, T], as in (1), is given by: P 0 = 0, and if t ∈ t k , t k+1 , for some k = 0, 1, . . . , n − 1, which is also supplied to the reward signals in (9). At each time t k , where k = 0, 1, . . . , n, the RL agent is given to observe four features from this MDP environment; these four features are summarised in the state vector The first feature is the natural logarithm of the segregated account value of the policyholder.
The second feature is the hedging portfolio value of the insurer, being normalised by the initial number of policyholders. The third feature is the ratio of the number of surviving policyholders with respect to the initial number of policyholders. These features are either log-transformed or normalised to prevent the RL agent from exploring and learning from features with high variability. The last feature is the term to maturity. In particular, when either the third or the last feature first hits zero, i.e. at time tñ, an episode is terminated. The state space Recall that, at each time t k , where k = 0, 1, . . . ,ñ − 1, with the state vector (18) being the input, the output of the policy network in (11) is the mean c X t k ; θ p and the variance d 2 X t k ; θ p of a Gaussian measure; herein, the Gaussian measure represents the distribution of the average number of shares of the risky asset being held by the insurer at the time t k for each surviving policyholder. Hence, for k = 0, 1, . . . ,ñ − 1, the hedging strategy H t k in (17) is given by where H t k is sampled from the Gaussian measure. Since the hedging strategy is assumed to be Markovian with respect to the state vector, it can be shown, albeit tedious, that the state vector, in (18), and the hedging strategy together satisfy the Markov property in (3).
Also recall that the infant RL agent is trained in the MDP environment with multiple homogeneous policyholders. The RL agent should then effectively update the ANN weights θ and learn the hedging strategies, via a more direct inference on the force of mortality from the third feature in the state vector. The RL agent hedges daily, so that the difference between the consecutive discrete hedging time is δt k = t k+1 − t k = 1 252 , for k = 0, 1, . . . , n − 1. In this MDP training environment, the parameters of the model are given in Table 3, but with N = 500.

Building reinforcement learning agent
After constructing this MDP training environment, the insurer builds the RL agent which implements the PPO, which was reviewed in section 3.4. Table 6(a) summarises all hyperparameters of the implemented PPO, in which three of them are determined via grid search 2 , while the remaining two are fixed a priori since they alter the surrogate performance measure itself, and thus should not be based on grid search. Table 6(b) outlines the hyperparameters of the ANN architecture in section 3.3, which are all pre-specified, in which ReLU stands for Rectified Linear Unit; that is, the componentwise activation function is given by, for any z ∈ R, ψ(z) = max {z, 0}.

Training of reinforcement learning agent
With all these being set up, the insurer assigns the RL agent experiencing this MDP training environment, in order to observe the state, decide, as well as revise, the hedging strategy, and collect the anchor-hedging reward signal based on (9), as much as possible. Let U ∈ N be the number of update steps in the training environment on the ANN weights. Hence, the policy of the experienced RL agent is given by π ·; θ (U ) = π ·; θ (U ) p . Figure 4 depicts the training log of the RL agent in terms of bootstrapped sum of rewards and batch entropy. In particular, Figure 4(a) shows that the value function in (2) reduces to almost zero after around 10 8 training timesteps, which is equivalent to around 48,828 update steps for the ANN weights; within the same number of training timesteps, Figure 4(b) illustrates a gradual depletion on the batch entropy, and hence, the Gaussian measure gently becomes more concentrating around its mean, which implies that the RL agent progressively diminishes the degree of exploration on the MDP training environment, while increases the degree of exploitation on the learned ANN weights.

Baseline hedging performance
In the final step of the training phase, the trained RL agent is assigned to hedge in simulated scenarios from the same MDP training environment, except that N = 1 which is in line with hedging in the market environment. The trained RL agent takes the deterministic action c ·; θ (U ) p which is the mean of the Gaussian measure.
The number of simulated scenarios is 5,000. For each scenario, the insurer documents the realised terminal P&L, i.e. P tñ − L tñ . After all scenarios are experienced by the trained RL agent, the insurer examines the baseline hedging performance via the empirical distribution and the summary statistics of the realised terminal P&Ls. The baseline hedging performance of the RL agent is also benchmarked with those by other methods, namely the classical Deltas and the DH; see Appendix C for the implemented hyperparameters of the DH training. The following four classical Deltas are implemented in the simulated scenarios from the training environment, in which the (in)correctness of the Deltas are with respect to the training environment: • (correct) Delta of the CFM actuarial and BS financial models with the model parameters as in Table 3; • (incorrect) Delta of the increasing force of mortality (IFM) actuarial and BS financial models, where, for any i = 1, 2, . . . , N, if T < b, the conditional survival probability and W (1) , W (2) t = φt, with the model parameters as in Tables 3(b) and 8; • (incorrect) Delta in the IFM actuarial and Heston financial models with the model parameters as in Tables 7 and 8. Figure 5 shows the empirical density and cumulative distribution functions via the 5,000 realised terminal P&Ls by each hedging approach, while Table 9 outlines the summary statistics of these empirical distributions. To clearly illustrate the comparisons, Figure 6 depicts the empirical density functions via the 5,000 pathwise differences of the realised terminal P&Ls between the RL agent and each of the other approaches, while Table 10 lists the summary statistics of the empirical distributions; for example, comparing with the DH approach, the pathwise difference of the realised terminal P&Ls for the e-th simulated scenario, for e = 1, 2, . . . , 5, 000, is calculated by  As expected, the baseline hedging performance of the trained RL agent in this training environment is comparable with those by, the correct CFM and BS Delta, as well as the DH approach. Moreover, the RL agent outperforms all the other three incorrect Deltas, which are based on either incorrect IFM actuarial or Heston financial model, or both.

Online Learning Phase
Given the satisfactory baseline hedging performance of the experienced RL agent in the MDP training environment, the insurer finally assigns the agent to interact and learn from the market environment.    To distinguish them from the simulated time in the training environment, lett k , for k = 0, 1, 2, . . . , be the real time when the RL agent decides the hedging strategy in the market environment, such that 0 =t 0 <t 1 <t 2 < · · · , and δt k =t k+1 −t k = 1 252 . Note that the current time t =t 0 = 0 and the RL agent shall hedge daily on behalf of the insurer. At the current time 0, the insurer writes a variable annuity contract with the GMMB and GMDB riders to the first policyholder. When this first contract terminates, due to either the death of the first policyholder or the expiration of the contract, the insurer shall write an identical contract, i.e. contract with the same characteristics, to the second policyholder. And so on. These contract re-establishments ensure that the insurer shall hold only one written variable annuity contract with the GMMB and GMDB riders at a time, and the RL agent shall solely hedge the contract being effective at that moment.
In the online learning phase, the trained RL agent carries on with the PPO of policy gradient methods in the market environment. That is, as in section 3.4, starting from the ANN weights θ (U ) at the current time 0, and via interacting with the market environment to observe the states and collect the reward signals, the RL agent further updates the ANN weights by a batch ofK ∈ N realisations and the (stochastic) gradient ascent in (13) with the surrogate performance measure in (14), at each update step.
However, there are subtle differences of applying the PPO in the market environment from that in the training environment. At each further update step v = 1, 2, . . . , based on the ANN weights θ (U +v−1) , and thus the policy π ·; θ (U +v−1) p , the RL agent hedges each effective contract ofẼ (v) ∈ N realised policyholders for theK ∈ N realisations. Indeed, the concept of episodes in the training environment, by the state re-initiation when one episode ends, should be replaced by sequential policyholders in the real-time market environment, via the contract re-establishment when one policyholder dies or contract expires.

Batch sizeK
is when the last state is observed for the j-th policyholder in this update; necessarily, Moreover, the first two features in the state vector (18) are based on the real-time risky asset price realisation from the market, while all features depend on a particular effective policyholder. For ι ∈ N and k = 0, 1, . . . ,ñ (ι) −ñ (ι−1) , where for k = 1, 2, . . . ,ñ (ι) −ñ (ι−1) . Recall also that the reward signals collecting from the market environment should be based on that in (8); that is, for ι ∈ N and k = 0, 1, . . . ,ñ (ι) −ñ (ι−1) , Table 11 summarises all hyperparameters of the implemented PPO in the market environment, while the hyperparameters of the ANN architecture are still given in Table 6(b). In the online learning phase, the insurer should choose a smaller batch sizeK comparing to that in the training phase; this yields a higher updating frequency by the PPO to ensure that the experienced RL agent could revise the hedging strategy within a reasonable amount of time. However, fewer realisations in the batch cause less credible updates; hence, the insurer should also tune down the learning ratẽ α, from that in the training phase, to reduce the reliance on each further update step.

Illustrative Example Revisited: Online Learning Phase
This section revisits the illustrative example in section 2.4 via the two-phase RL approach in the online learning phase. In the market environment, the policyholders being sequentially written of the contracts with both GMMB and GMDB riders are homogeneous. Due to contract re-establishments to these sequential homogeneous policyholders, the number and age of policyholders shall be reset to the values as in Table 3(b) at each contract inception time. Furthermore, via the approach discussed in section 2.1.3, to determine the fee structures of each contract at its inception time, the insurer relies on the parameters of the model of the market environment in Table 3, except that now the risky asset initial price therein is replaced by the risky asset price observed at the contract inception time. Note that the fee structures of the first contract are still given as in Table 4, since the risky asset price observed at t = 0 is exactly the risky asset initial price.
Let V ∈ N be the number of further update steps in the market environment on the ANN weights. In order to showcase the result that (RLw/OL), the further trained RL agent with the online learning phase, could gradually revise the hedging strategy, from the nearly optimal one in the training environment, to the one in the market environment, we evaluate the hedging performance of RLw/OL on a rolling basis. That is, right after each further update step v = 1, 2, . . . , V, we first simulateM = 500 market scenarios stemming from the real-time realised state vector strategy S for the future trajectory ω f : which is a conditional expectation taking with respect to the scenarios from the time τ (j) S ω f be the sample mean of the terminal P&L based on the simulated scenarios: Figure 8 plots the sample means of the terminal P&L in (20), right after each further update step and implementing each hedging strategy, in two future trajectories. Firstly, notice that, in both future trajectories, the average hedging performance of RLw/oOL is even worse than that of ID. Secondly, the average hedging performances of RLw/OL between the two future trajectories are substantially different. In the best-case future trajectory, the RLw/OL is able to swiftly selfrevise the hedging strategy and hence quickly catch up the average hedging performance of ID by simply twelve further updates on the ANN weights, as well as that of CD in around two years; however, in the worst-case future trajectory, within 3 years, the RLw/OL is not able to improve the average hedging performance to even the level of ID, let alone to that of CD.
In view of the second observation above, the hedging performance of RLw/OL should not be concluded for each future trajectory alone; instead, it should be studied among the future trajectories. To this end, for each f = 1, 2, . . . , 1, 000, define as the first further update step such that the sample mean of the terminal P&L by RLw/OL is strictly greater than that by CD, for the future trajectory ω f ; herein, let min ∅ = 26 and also define t CD ω f = v CD ω f ×K 252 as the corresponding number of years. Therefore, the estimated proportion of the future trajectories, where RLw/OL is able to exceed the average hedging performance of CD within 3 years, is given by For each f = 1, 2, . . . , 1, 000, define v ID ω f and t ID ω f similarly for comparing RLw/OL with ID. Figure 9 shows the empirical conditional density functions of t CD and t ID , both subject to that RLw/OL exceeds the average hedging performance of CD within 3 years. Table 12 lists the summary statistics of the empirical conditional distributions.
The above analysis obviously neglected the variance, due to the simulated scenarios, of hedging performance by each hedging strategy. In the following, for each future trajectory, we define a refined first further update step such that the expected terminal P&L by RLw/OL is statistically significant to be strictly greater than that by CD. To this end, for each f = 1, 2, . . . , 1, 000, and v = 1, 2, . . . , 25, consider the following null and alternative hypotheses: where S = CD or ID; the analysis before supports this choice of the alternative hypothesis. Define respectively the test statistics and the p-value by   that RLw/OL is statistically significant to be exceeding CD within 3 years. For example, with α * = 0.1, Figure 10 and Table 14 illustrate that, comparing with Figure 9 and Table 12, the distributions are right-shifted as well as more spread, and the summary statistics are all increased. Finally, to further examine the hedging performance of RLw/OL in terms of the sample mean of the terminal P&L in (20), as well as take the random future trajectories into account, Figure 11 shows the snapshots of the empirical density functions, among the future trajectories, of the sample mean by each hedging strategy over time at t = 0, 0.6, 1.2, 1.8, 2.4, and 3; Table 15 outlines their summary statistics. Note that, at the current time t = 0, since none of the future trajectories has been realised yet, the empirical density functions are given by Dirac delta at the corresponding sample mean by each hedging strategy, which only depends on the simulated scenarios. As the time progresses, one can observe that the empirical density function by RLw/OL is gradually shifting to the right, substantially passing the one by ID and almost catching up the one by CD at t = 1.8. This sheds light on the high probability that RLw/OL is able to self-revise the hedging strategy from a very sub-optimal one to a nearly optimal one close to the CD.
(e) (f) Figure 11. Snapshots of empirical density functions of sample mean of terminal P&L by reinforcement learning agent with online learning phase, reinforcement learning agent without online learning phase, correct Delta, and incorrect Delta at different time points.

Methodological Assumptions and Implications in Practice
To apply the proposed two-phase RL approach to a hedging problem of contingent claims, there are at least four assumptions to be satisfied. This section discusses these assumptions and elaborates their implications in practice.

Observable, sufficient, relevant, and transformed features in state
One of the crucial components in an MDP environment of the training phase or the online learning phase is the state, in which the features provide information from the environment to the RL agent. First, the features must be observable by the RL agent for learning. For instance, in our proposed state vectors (18) and (19), all the four features, namely the segregated account value, the hedging portfolio value, the number of surviving policyholders, and the term to maturity, are observable. Any unobservable, albeit desirable, features cannot be included in the state, such as insider information which could provide a better inference on the future value of a risky asset, or exact health condition of a policyholder. Second, the observable features in the state should be sufficient for the RL agent to learn. For example, due to the dual-risk bearing nature of the contract in this paper, the proposed state vectors (18) and (19) incorporate both financial and actuarial features; also, the third and the fourth features in the state vectors (18) and (19) would inform the RL agent to halt its hedging at the terminal time. However, incorporating sufficient observable features in the state does not imply that every observable feature in the environment should be included; the observable features in the state need to be relevant for learning efficiently. Since the segregated account value and the term to maturity have already been included in the state vectors (18) and (19) as features, the risky asset value and the hedging time are respective similar information from the environment and thus are redundant features to be contained in the state. Finally, the features in the state which have high variance might be appropriately transformed for reducing the volatility due to exploration. For instance, the segregated account value in the state vectors (18) and (19) is log-transformed in both phases.

Reward engineering
Another crucial component in an MDP environment is the reward, which supplies signals to the RL agent to evaluate its actions, i.e. the hedging strategy, for learning. First, the reward signals, if available, should suggest the local hedging performance. For example, in this paper, the RL agent is provided by the sequential anchor-hedging reward, given in (9), in the training phase; through the net liability value in the MDP training environment, the RL agent often receives a positive (resp. negative) signal for encouragement (resp. punishment), which is more informative than collecting the zero reward. However, any informative reward signals need to be computable from an MDP environment. In this paper, since the insurer does not know the MDP market environment, the RL agent could not be supplied the sequential anchor-hedging reward signals, which consist of the net liability values, in the online learning phase, even though they are more informative; instead, the RL agent is given the less informative single terminal reward, given in (8), in the online learning phase which can be computed from the market environment.

Markov property in state and action
In an MDP environment of the training phase or the online learning phase, the state and action pair needs to satisfy the Markov property as in (3). In the training phase, since the MDP training environment is constructed, the Markov property can be verified theoretically for the state, with the included features in line with section 7.1, and the action, which is the hedging strategy. For example, in this paper, with the model of the market environment being the BS and the CFM, the state vector in (18) and the Markovian hedging strategy satisfy the Markov property in the training phase. Since the illustrative example in this paper assumes that the market environment also follows the BS and the CFM, the state vector in (19) and the Markovian hedging strategy satisfy the Markov property in the online learning phase as well. However, in general, as the market environment is unknown, the Markov property for the state and action pair would need to be checked statistically in the online phase as follows.
After the training phase and before an RL agent proceeding to the online learning phase, historical state and action sequences in a time frame are derived by hypothetically writing identical contingent claims and using the historical realisations from the market environment. For instance, historical values of risky assets are publicly available, or an insurer retrieves historical survival status of its policyholders with similar demographic information and medical history as the policyholder being actually written. These historical samples of the state and action pair are then used to conduct hypothesis testing on whether the Markov property in (3) holds for the pair in the market environment, by, for example, the test statistics proposed in Chen & Hong (2012). If the Markov property holds statistically, the RL agent could begin the online learning phase. Yet, if the property does not hold statistically, the state and action pair should be revised and then the training phase should be revisited; since the hedging strategy is the action in a hedging problem, only the state could be amended by including more features from the environment. Moreover, during the online learning phase, right after each further update step, new historical state and action sequences in a shifted time frame of the same duration are obtained together with the most recent historical realisations from the market environment and using the action samples being drawn from the updated policy. These regularly new samples should be applied to statistically verify the Markov property on a rolling basis. If the property fails to hold at any time, the state needs to be revised and the RL agent must be re-trained before resuming the online learning.

Re-establishment of contingent claims in online learning phase
Any contingent claims must have a finite terminal time realisation. On one hand, in the training phase, that would be the time when an episode ends and the state is re-initialised so that the RL agent can be trained in the training environment as long as possible. On the other hand, in the online learning phase, the market environment, and hence the state, could not be re-initialised; instead, at each terminal time realisation, the seller re-establishes identical contingent claims of the same contract characteristics and writing on (more or less) the same assets so that the RL agent can be trained in the market environment successively. In this paper, the terms to maturity and the minimum guarantees of all variable annuity contracts in the online learning phase are the same. Moreover, all re-established contracts therein write on the same financial risky asset, though the initial values of the asset are given by the real-time realisations in the market environment. Finally, while a new policyholder is written at each contract inception time, these policyholders have similar, if not identical, distributions of their random future lifetimes via examining their demographic information and medical history.

Concluding Remarks and Future Directions
This paper proposed the two-phase deep RL approach which can tackle practically common model miscalibration in hedging variable annuity contracts with both GMMB and GMDB riders in the BS financial and CFM actuarial market environments. The approach is composed of the training phase and the online learning phase. While the satisfactory hedging performance of the trained RL agent in the training environment was anticipated, the performance by the further trained RL agent in the market environment via the illustrative example should be highlighted. First, by comparing their sample means of terminal P&L from simulated scenarios, in most future trajectories, within a reasonable amount of time, the further trained RL agent was able to exceed the hedging performance by the correct Delta from the market environment and the incorrect Delta from the training environment. Second, through a more delicate hypothesis testing analysis, similar conclusions can be drawn in a fair amount of future trajectories. Finally, snapshots of empirical density functions, among the future trajectories, of the sample means of terminal P&L from simulated scenarios by each hedging strategy, shed light on the high probability that the further trained RL agent is indeed able to self-revise the hedging strategy.
There should be at least two future directions derived from this paper. (I) The market environment in the illustrative example of this paper was assumed to be the BS financial and CFM actuarial models, which turned out to be the same as designed by the insurer for the training environment, with different parameters though. Moreover, the policyholders were assumed to be homogeneous that their survival probabilities and investment behaviours are all the same, with even identical contracts of the same minimum guarantee and maturity. In the market environment, the agent only had to hedge one contract at a time, instead of a portfolio of contracts. Obviously, if any of these is to be relaxed, the trained RL agent from the current training environment should not be able to produce satisfactory hedging performance in a market environment. Therefore, the training environment will certainly need to be substantially extended in terms of its sophistication, in order for the trained RL agent to be able to further learn and hedge well in any realistic market environments. (II) Beyond this, an even more ambitious question needs to be addressed is that how much similar do the training and market environments have to be, such that the online learning for self-revision on hedging strategy is possible, if not efficient. This second future direction is related to the transfer learning being adapted to the variable annuities hedging problem and shall be investigated carefully in the future.

Algorithm 1. Pseudo-code for deep hedging method
Compared with policy gradient methods introduced in section 3.4, the DH method shows two key differences. First, it assumes that the hedging portfolio value P (u−1) tñ is differentiable with respect to υ a at each update u = 1, 2, . . . . Second, the update of ANN weights does not depend on intermediate rewards collected during an episode; that is, to update the weights, the DH agent has to experience a complete episode to realise the terminal P&L. Therefore, the update frequency of the DH method is lower than that of the RL method with TD feature. REINFORCE takes directly the time-0 value function V (u−1) 0, x; θ p , for any x ∈ X , as a part of the surrogate performance measure: In Williams (1992), the Policy Gradient Theorem was proved, which states that