EXTENDING THE LEE–CARTER MODEL WITH VARIATIONAL AUTOENCODER: A FUSION OF NEURAL NETWORK AND BAYESIAN APPROACH

Abstract In this study, we propose a nonlinear Bayesian extension of the Lee–Carter (LC) model using a single-stage procedure with a dimensionality reduction neural network (NN). LC is originally estimated using a two-stage procedure: dimensionality reduction of data by singular value decomposition followed by a time series model fitting. To address the limitations of LC, which are attributed to the two-stage estimation and insufficient model fitness to data, single-stage procedures using the Bayesian state-space (BSS) approaches and extensions of flexibility in modeling by NNs have been proposed. As a fusion of these two approaches, we propose a NN extension of LC with a variational autoencoder that performs the variational Bayesian estimation of a state-space model and dimensionality reduction by autoencoding. Despite being a NN model that performs single-stage estimation of parameters, our model has excellent interpretability and the ability to forecast with confidence intervals, as with the BSS models, without using Markov chain Monte Carlo methods.


INTRODUCTION
Longevity risk management and economic valuation of insurance and pension liabilities, which have recently received considerable attention, require statistical mortality models that provide stable long-term predictions even for single populations with limited data. The Lee-Carter (LC) model (Lee-Carter, 1992) is a basic statistical mortality model with desirable properties, wherein the log mortality is determined by the sum of a fixed age impact and a product of time index and age sensitivity.
The model fitting procedure of LC originally comprises two stages: singular value decomposition (SVD)-based dimensionality reduction of the log-mortality data and fitting of a time series model to the obtained time components. Various extensions of LC have been proposed in the literature, for example, a Poisson regression version of LC by Brouhns et al. (2002), a cohort extension of LC by Renshaw and Haberman (2006), and a two-factor period effect model by Cairns et al. (2006). Cairns et al. (2009) provides a quantitative comparison of eight stochastic mortality models including LC.
Although LC achieves good interpretability and ease of estimation using SVD and drifted random walk (RW), it has a limitation in its accuracy primarily because of two issues: incoherency between parameters because of the two-stage estimation and insufficient fitting to the nonlinearity of data.
To address the first issue, various single-stage estimations of LC in Bayesian settings, which are implemented by Markov chain Monte Carlo (MCMC) methods, have been proposed. Czado et al. (2005) considers a single-stage Bayesian estimation of LC in a Poisson regression form and that in the statespace form was considered by Pedroza (2006), Kogure and Kurachi (2010), Cairns et al. (2011), andFung et al. (2017). Notable extensions of the Bayesian LC model are proposed by Kogure and Kurachi (2010) in a risk-neutral form and Cairns et al. (2011) in a multi-population form. Moreover, Fung et al. (2017) introduces a general Bayesian state-space (BSS) modeling framework of the extensions of LC, which allows stochastic volatility in the period effect.
To address the second issue, many nonlinear extensions of LC using neural network (NN) methods have been proposed. NNs are typically defined by a network structure comprising multiple layers of neurons and an activation function that outputs a transformation of the weighted input to each neuron. As Cybenko (1989) demonstrates that any compactly supported continuous function can be uniformly approximated by a two-layer NN with a continuous sigmoidal activation function, NNs have a universal approximation capability for any function in a broad function class. The most basic NNs are feedforward NNs (FNNs) that have no cyclic connections, whereas NNs that have cyclic connections are called recurrent NNs (RNNs). NNs are also classified in supervised and unsupervised. Furthermore, convolutional NN (CNN), a sparse connected FNN to learn the neighborhood effect of the data, and RNN are often used for learning sequential data. Richman and Wüthrich (2021) proposes a NN-based generalization of LC using a fully connected network (FCN) with multiple hidden layers, which is called deep FCN. Perla et al. (2021) considers many supervised NN extensions of LC and shows the superiority of one-dimensional (1D) CNN over deep FCN and long short-term memory (LSTM), a RNN suitable for learning long sequential data. Wang et al. (2021) proposes mortality forecasts using twodimensional (2D) CNN to capture the neighborhood effect of the mortality data. Schnürch and Korn (2021) also considers 2D CNN and achieves mortality forecasts with confidence intervals. These NN approaches calibrate their models with huge training data such as the data of all countries of the Human Mortality Database (HMD; http://www.mortality.org). The huge training data will contribute to the stability and accuracy of predictions as a countermeasure to the seed robustness and overlearning problems common in NN approaches. While these NN approaches achieve single-stage estimation with relatively high prediction accuracy, they lose interpretability of the model and are not necessarily suitable for relatively small training data (e.g., single population data).
On the other hand, Hainaut (2018) proposes a replacement of SVD with an unsupervised NN for dimensionality reduction, NN-analyzer which is more commonly known as an autoencoder (AE), and Nigri et al. (2019) proposes an application of LSTM to time components obtained from SVD, both of which have the interpretability of the model, but with the limitation of two-stage estimation. Thus, the existing NN approaches for the mortality prediction are subject to the problem of either a two-stage estimation or loss of interpretability.
To achieve a single-stage estimation of the parameters without losing interpretability, we introduce a variational AE (VAE) proposed by Kingma and Welling (2013) to the mortality prediction; our method implies a fusion of NN and Bayesian approach. VAE, one of the representative generative NNs (i.e., NNs for generating new data by learning features of training data), performs the variational Bayesian estimation of the state-space model and the AE-based dimensionality reduction simultaneously.
The rest of this study is organized as follows. Section 2 discusses how the existing approaches extend the original LC model and the limitations. Section 3 presents an overview of the VAE approach. We propose a model in a generalized state-space form in Section 4 and discuss how to apply the VAE algorithm to the inference of the proposed model in Section 5. In Section 6, we apply our model to the data from the HMD and present numerical results including the calibration procedure of the model, performance comparison with LC, parameter comparison with LC to show the interpretability of the model, forecasts with confidence intervals, and remarks on the effects of changing the number of neurons in the model. Finally, Section 7 concludes this study.

EXTENSIONS OF LC
LC defines the log-mortality rate at age x in calendar year t as follows: where α x is the average log mortality at age x measured over the observation period, and the bilinear term β x κ t can be interpreted as the product of age-specific sensitivity factor and year-specific mortality improvement factor. The age-specific factor β x and the year-specific factor κ t are obtained by the SVD of the log-mortality data net of the age-specific averages {α x }, where the following model identification constraints are required: The prediction of future mortality in LC can be performed via a two-stage procedure: obtaining {κ t } by SVD and then fitting a time series model to {κ t }.
It is common to apply a drifted RW model determined by: where t follows an iid (i.e., independent and identically distributed) standard normal distribution. The interpretability of LC has given rise to various extensions, for example, Renshaw and Haberman (2006) (RH) makes a cohort extension by adding a new term β x κ t−x on the right-hand side of Equation (2.1).
However, LC is known to have limitations in estimation accuracy, primarily because of incoherency among variables caused by the two-stage estimation and the limitation of nonlinear representation capability of the bilinear form.
For the first issue, single-stage estimations of LC in Bayesian settings have been proposed. A direct translation of LC into a state-space form in Pedroza (2006) is given by: where x t = log m t,0 , log m t,1 , . . . , log m t,n T ; α = (α 0 , α 1 , . . . , α n ) T ; β = (β 0 , β 1 , . . . , β n ) T ; n denotes the maximum age observed. Here, MCMC is required to obtain posterior joint distributions for the parameters α, β, σ ε , μ, and σ . Because the MCMC is computationally intensive, the number of age categories to be estimated is often limited in many previous studies. Although many Bayesian extensions of LC have been proposed, they follow the bilinear form as in LC or RH, which results in limited nonlinear representation capabilities. For the second issue, the extension of the nonlinear representation capabilities, NN approaches have been recently proposed; our study is in this context. The reason behind the application of NNs to the nonlinear extension of models is the universal approximation capability of NNs that is demonstrated by Cybenko (1989). NNs generally comprise an input layer, hidden layers, an output layer, neurons in each layer, links between the neurons in different layers, and activation functions. In an FNN, the d i+1 -dimensional vector y i+1 representing the output value of neurons in the i+1-th layer is determined by the d i -dimensional output vector y i in the previous layer, the activation function φ, the weights W i ∈ R d i+1 ×d i , and the bias b i ∈ R d i+1 as follows: (2.5) NNs are trained to obtain weights W i that minimize loss functions under a given network structure and activation function, typically using a gradient descent method (GDM). Nigri et al. (2019) proposes a two-stage NN extension of LC in which the LSTM was applied to the extrapolation of time series components obtained by SVD. Perla et al. (2021) considers NN extensions of LC, which could perform single-stage estimations and demonstrates that 1D CNN outperformed FCN proposed by Richman and Wüthrich (2021) and LSTM. CNN, proposed by LeCun et al. (1990), is generally known as an effective method for 2D images and 3D spatial data; however, it has been recently used for 1D time series data. CNN replaces the product term W i y i in Equation (2.5) with the convolution, as described below; it is often performed in three stages. In the first stage, the convolution with shared weights is performed; in the second stage, the value obtained by the convolution is nonlinearly transformed by an activation function; finally, a pooling function is used to output the value. The 1D CNN uses the convolution filter where m ∈ N denotes the kernel size of the filter and J denotes the number of filters. Then, the output data of layer i, y i ∈ R d×T , is transformed as follows: (2.6) After the above transformation, the pooling function is employed. Generally, the pooling function returns the maximum, minimum, or average value within each window region of the input data. For an h-dimensional CNN input, y i ∈ R d×T and y i+1 ∈ R J×(T+1−m) are changed to y i ∈ R h×d×T and y i+1 ∈ R h×J×(T+1−m) . Wang et al. (2021) proposes mortality forecasts using twodimensional (2D) CNN to capture the neighborhood effect of the mortality data. Schnürch and Korn (2021) also considers 2D CNN and achieves mortality forecasts with confidence intervals. The confidence intervals are not based on the endogenous randomness as in the BSS models, but on the exogenous randomness driven by the random seed for the NN, which corresponds to the model uncertainty. Hainaut (2018) proposes a two-stage estimation of LC with a NN-based dimensionality reduction as an alternative to SVD. The NN algorithm called NN-analyzer in Hainaut (2018) can be classified as AE. Generally, AE is a NN that uses a low-dimensional hidden layer and learns such that the input data are close to the output of the AE reconstructed via the hidden layer and can extract nonlinear features and characterize multidimensional data in lowdimensional components. The first part of AE, which outputs low-dimensional features from the original data, is called an encoder f enc , and the second part, which reconstructs data from low-dimensional features, is called a decoder f dec .
For the input data X (t) at time t given by a vector of age-specific logmortality rates net of the age-specific averages, the reconstructed data X (t) by f dec and the d-dimensional latent factor κ nn t = κ nn,1 t , . . . , κ nn,d t , the NNanalyzer is described as follows: The loss function is given by the squared error between X (t) and X (t), and the parameters of f enc and f dec are estimated using a GDM to minimize the loss function. Moreover, the NN analyzer has the following features: • It has a symmetric network structure with three hidden layers using hyperbolic tangent sigmoidal and identity function as activation functions for both f enc and f dec . • The input and output data are evenly divided in subgroups, and each subgroup is exclusively connected to a specific neuron in the input and output layers, resulting in a sparsely connected AE (SAE), which is expected to prevent overlearning. • A genetic algorithm is used to identify appropriate initial values for the GDM.
Finally, using the latent factors, the mortality model is expressed as follows: where m t and the average mortality rate α are vectors of values for each age. The decoder term in Equation (2.8) gives a nonlinear generalization of the bilinear term of LC; however, the prediction requires extrapolating the latent factor κ nn t by a time series model, resulting in a two-stage estimation. We introduce VAE to perform the single-stage estimation of the AE-based extension of LC. For more theoretical background on NNs and AEs, we refer to Wüthrich and Merz (2022).

VARIATIONAL AUTOENCODER (VAE)
VAE, proposed by Kingma and Welling (2013), is a type of deep generative model with an AE structure that performs unsupervised learning with dimensionality reduction to a latent space. VAE assumes a probability distribution for the latent space, unlike the conventional AE, and can be implemented without using MCMC techniques. The aim of VAE is to identify p θ (x) that denotes the generative distribution of the data x with generative parameter θ; in the process, it can acquire the dimensionally reduced latent representation of data as a probability distribution. Assume that, for T∈ N, there exists a set of latent variables Z = {z t } T t=1 that generates a sample dataset X = {x t } T t=1 with probability p θ (x t |z t ) for each t, and z t follows the prior distribution p θ (z t ) , where p θ (z t ) and p θ (x t | z t ) follow probability distributions differentiable with respect to the parameters to be identified.
Because it is generally difficult to directly estimate the multidimensional posterior distribution p θ (Z|X ) , a variational approximation with the parameter ϕ, q ϕ (Z|X ) , is used; the approximation is often implemented in the form of a mean field approximation via a factor decomposable distribution as follows: ( 3.1) From the AE perspective, the approximate distribution q ϕ (z t |x t ) and generative distribution p θ (x t | z t ) are considered probabilistic encoder and decoder, respectively. The generative parameter θ and variational parameter ϕ are learned as network parameters in VAE to maximize the evidence lower bound (ELBO) given by Equation (3.2). The meanings of the ELBO become clear by rewriting Equation (3.2) with the Kullback-Leibler (KL) divergence represented by D KL [||] , which gives asymmetric distances between distributions, as shown in Equations (3.3) and (3.4). We also refer to Section 11.6.3 of Wüthrich and Merz (2022) for more details: (3.4) Equation (3.3) shows that the ELBO maximization implies maximizing the log-likelihood of the data with penalizing the approximation error given by D KL q ϕ (Z|X )| |p θ (Z|X ) . From the perspective of AE, Equation (3.4) shows that the integral term of the ELBO indicates an expected value of the negative reconstruction error of the data by the decoder p θ (X |Z) because log p θ (X |Z) is closer to 0 for higher reconstruction probability and takes a greater negative value for lower reconstruction probability; the second (KL) term of the ELBO acts as a regularizer that penalizes the KL distance between the approximate posterior distribution q ϕ (Z|X ) and the prior distribution p θ (Z). Thus, the ELBO maximization simultaneously performs the data reconstruction error minimization, which is essential for AEs, and regularization. While the distributions are usually selected such that the KL term of Equation (3.4) can be analytically calculated, the first term is difficult to analytically calculate. Thus, a sampling approximation by z t following q ϕ (z t |x t ) and a reparameterization technique (called reparameterization trick) for z t are required to optimize parameters using the GDM. The reparameterization is determined by a differentiable bivariate function of the data point x t and an auxiliary noise variable t as follows: (3.5) In particular, assuming that the iid noise t follows standard normal distribution, g ϕ (x t , t ) can be expressed as follows: where the parameters are given by a vector-valued function f enc ϕ (x t ) in the encoder.
For L ∈ N denoting the number of samples from g ϕ (x t , t ), the sampling approximation of the first term is given by: log p θ x t |z l,t . (3.7)

THE MODEL
We propose a nonlinear extension of LC, written as a state-space model with a latent variable that follows a drifted RW process. For the data {x t } T t=1 comprising vectors of age-specific log-mortality rates for each observation year t and latent variables {z t } T t=1 for all observation years, their joint probability and graphical representation (Figure 1) are as follows: Although a drifted RW is assumed for p ξ (z t+1 |z t ) as in the original LC model, we propose a generalization by replacing α + κ t β in Equation (2.4) with a nonlinear vector-valued function denoted by f θ (z t ) , where the function belongs to a broad nonlinear function class that can be obtained by a NN.
Assuming that p ξ (z 1 ) follows a normal distribution, the log-likelihood of the generative distribution to be maximized is given as follows:

VAE FOR THE MODEL
In this section, a VAE is used to estimate the parameters of the state-space model, and the state-space model specification chosen will determine the expression for the loss function used when fitting the VAE to data.

The loss function
To apply the GDM for the variational inference, it is necessary to derive the loss function given by the sign-reversed ELBO of the model. The joint posterior distribution p θ (z 1 , z 2 , . . . , z T |x 1 , x 2 , . . . , x T ) with the generative parameter θ is approximated by q ϕ (z 1 , z 2 , . . . , z T |x 1 , x 2 , . . . , x T ) with variational parameter ϕ where the approximation distribution q ϕ is assumed to be factorizable as expressed in Equation (3.1). The structure of the VAE used for the variational inference of the proposed model is represented in a graphical model (Figure 2) where the vertical network at each time t has an AE structure with encoder q ϕ (z t |x t ) and decoder p θ (x t | z t ). The ELBO to be maximized for the proposed model is given by: where the KL terms can be calculated analytically from the assumptions, given by Equation (3.6) and (4.2), as follows: Replacing z t in Equation (5.1) with the sampling approximations {z l,t } that follows the reparametrized distribution g ϕ (x t , t ) given by Equation (3.6), the ELBO is approximated by: Finally, the loss function (i.e., sign-reversed ELBO) is given by: 3) where z l,0 = z 0 − μ ξ , and log p θ x t |z l,t is obtained by replacing z t in Equation The mortality predictions can be obtained by decoding the projected state random variable z t, in the form of mean values or confidence intervals.

The network configuration
The proposed VAE has an FCN architecture comprising three hidden layers between the input and output layers; it has the following specifications: • Although the numbers of neurons in the first and third hidden layers are variable, the number of neurons in the latent layer (second hidden layer) is fixed at 1 to be consistent with the dimension of the state random variable z t . • Both the encoder and decoder use hyperbolic tangent sigmoidal function and identity function as activation functions. • The encoder (first hidden layer) includes the reparameterization unit given in Equation (3.6). • A fixed age-impact vector corresponding to α in Equation (2.4) is used as a learning parameter that is deducted before encoding and added after decoding, whose initial value is given by a vector of the age-specific averages of the observed log-mortality rates.
The architecture is illustrated in Figure 3. The model is coded from scratch in Python without using optimization packages, and the codes for the most

NUMERICAL APPLICATION
This section shows the results of applying the proposed model to the data from the HMD. The HMD data used here are the central mortality rates for ages 0-99 years, mixed gender, from 1961 to 2018. In this study, the observation period begins in 1961 to coincide with the introduction of the universal health insurance in Japan. For the accuracy evaluation, the data are divided into training data from 1961 to 2000 and test data from 2001 to 2018.

Calibration procedures
The hyperparameters to be determined for our model are the learning rate, the number of learning epochs, and the number of neurons in the first and third hidden layers. The number of neurons in the second hidden layer is fixed at 1 to be consistent with the dimension of the state variable of the model. The hyperparameters are selected to minimize the median of 10 minimum squared errors (MSEs), which correspond to 10 random seeds, over the validation data that consist of the latter 5 years (i.e., from 1996 to 2000) of the original training data; the selection universe of the hyperparameters contains the numbers of epochs from 5000 to 50,000 and the numbers of neurons in the three hidden layers from 10-1-10 to 50-1-50, where the configurations are limited to the symmetric type (i.e., the same number of neurons in the first and third hidden layers) common in AEs. Note that cross-validations are not available for time series models including our model. Since a larger number of epochs reduces the effect of the number of samples for the Monte Carlo integration of the loss function, the number L in Equation (5.3) is fixed at 10. The validated hyperparameters for six countries, including Japan and the United States (US), are given as follows.
The model, which is coded from scratch in Python without using optimization packages, requires a relatively large number of epochs as shown in Table 1, but the total run-time per country is about 10 min in the Google Colaboratory (https://colab.research.google.com) due to its shallow network structure. Although not used in this study, normalizing the input data to being centered with unit variance can reduce the number of epochs.

Performance comparison with LC
Using the hyperparameters given in Table 1, prediction performances are estimated in the same way over the test data. Table 2 summarizes the comparison of the prediction accuracy between LC (SVD+RW) and the VAE over the test data for six countries, where the accuracy measure is the same MSE as for the validation of the hyperparameters.
The observation period for the training data begins in 1961, when the universal health insurance was introduced in Japan, but longer observation periods are likely to improve forecasting accuracy in other countries. This model fits both in the context of the literature on BSS mortality models, where prediction accuracy is not often discussed, and in the context of the literature on NN models for mortality prediction, where prediction accuracy is the focus. As a NN model, this model focuses more on achieving features not found in previous NN models than on prediction accuracy, as shown in the following sections. In the following sections, we will focus our analysis on Japan and US.

Interpretability of the model
In this section, the interpretability, one of the key features of the model, is demonstrated by the comparison of the components of the model with all parameters of LC over the training data for Japan and US, using the hyperparameters given in Table 1. Figure 4 gives the comparison of LC's year-specific factor κ t with μ t , denoting the mean of the latent factor of the VAE generated by one random seed, over the training data for Japan and US. The descending curves of μ t can be interpreted as indicating medical progress as well as the LC's year-specific factor κ t . The slope of μ t is more gradual than that of κ t , for both Japan and US, indicating that excessive mortality reductions in the long-term predictions are less likely to occur than in LC. Figure 5 compares LC's age sensitivity factor β and the decoder's sensitivity to μ t , given by d dμ t f θ (μ t ) for one random seed, for all ages in 1970, 1980, 1990, and 2000, over the training data for Japan and US. The number of humps in the decoder's sensitivity curves is roughly similar to that of LC's age sensitivity factor β, for both Japan and US, capturing country-specific characteristics. Figure 6 shows that the learning parameter α of the VAE, generated by one random seed, is consistent with the fixed age factor α of LC, given by the averaged mortality, over the training data for Japan and US.  FIGURE 7. Forecasts with confidence intervals for latent factor z t over test data for Japan (left) and US (right).

Forecasts with confidence intervals
Existing NN models for the mortality forecasts with confidence intervals are solely based on the exogenously given randomness such as random seed for NNs or added randomness for the time series models in two-stage estimations.
In this section, we show that our model has an ability to forecast mortality with confidence intervals based on the endogenous randomness, as with BSS formulations, using the hyperparameters given in Table 1. Figure 7 shows the forecasts with 97.5% confidence intervals for the latent factor z t over the test data for Japan and US, appending μ t over the training data. The natural connection between the mean μ t of the latent variable encoded over the training data and the mean of the state model z t estimated over the test data shows the effect of the variational approximation.   Figure 8 gives mortality forecasts with 97.5% confidence intervals at ages 30, 40, 50, 60, 70, and 80 years over the test data for Japan and US, appending the actual mortality and the reconstructed mortality over the training data.

Remarks on changing number of neurons
The prediction accuracy of the VAE model is generally improved by increasing the number of neurons in the first and third hidden layers. Table 3 summarizes the comparison of the prediction accuracy by MSE of LC and VAE models (from 10-1-10 to 50-1-50) over the test data for Japan and US, using a fixed number of epochs per country. The results demonstrate that the number of neurons in a VAE that can outperform LC varies among countries and that if the number of neurons is extremely large, the prediction accuracy decreases because of overlearning.
We also consider the smoothness of mortality curves in ultralong-term predictions desirable in the economic valuation of insurance/pension liabilities and longevity risk management. Figure 9 shows the 50-year predictions of Japanese mortality by VAE (from 10-1-10 to 50-1-50), where the dark-colored curves are the predictions. For the 50-year projection in Japan, all data (i.e., from 1961 to 2018) are used for the training data. . Fifty-year predictions by VAE (from 10-1-10 to 50-1-50), Japan.
Improving the prediction accuracy of VAE trades-off with the smoothness of the prediction curves, but introducing an asymmetric configuration into the hidden layers of VAE can improve the smoothness of prediction while maintaining high prediction accuracy. It is effective to increase and decrease the numbers of neurons in the first and third hidden layers, respectively.
For example, VAE (50-1-3) yields relatively smooth prediction curves as shown in Figure 10, and the prediction accuracy (MSE:3.0416) remains better than LC (MSE:3.2696) on the same data as for Table 3.

CONCLUSIONS
This paper proposes a NN-based generalization of LC using VAE that performs mortality forecasts with confidence intervals based on the endogenous randomness (i.e., not by seed randomness for NNs), as with the BSS models, in a single-stage procedure without losing interpretability of the model. Our model fills a gap in the literature of NN extensions of LC, since previous NN models either enable single-step estimation of parameters but lose interpretability of the model, or retain interpretability but estimate parameters in two steps. The model also can yield relatively smooth mortality curves in long-term predictions due to the dimensionality reduction capability of the VAE. However, our model has the limitations that it employs a 1D RW with iid residuals for the latent state model, and thus the number of neurons in the second hidden layer is limited to one; the limitations are intended to avoid sampling approximations of multiple integrals that reduce estimation efficiency and often require MCMC. Dimensional extensions of the latent state model (i.e., multiple neurons in the second hidden layer) and the introduction of the non-iid residuals to the latent state model are future work; they can improve representation capabilities of the model and may allow its extension to cohort and multiple population models. However, if one has multiple neurons in the second hidden layer and may have to deal with an identifiability issue, see Example 7.28 in Wüthrich and Merz (2022