Neural networks for quantile claim amount estimation: a quantile regression approach

Abstract In this paper, we discuss the estimation of conditional quantiles of aggregate claim amounts for non-life insurance embedding the problem in a quantile regression framework using the neural network approach. As the first step, we consider the quantile regression neural networks (QRNN) procedure to compute quantiles for the insurance ratemaking framework. As the second step, we propose a new quantile regression combined actuarial neural network (Quantile-CANN) combining the traditional quantile regression approach with a QRNN. In both cases, we adopt a two-part model scheme where we fit a logistic regression to estimate the probability of positive claims and the QRNN model or the Quantile-CANN for the positive outcomes. Through a case study based on a health insurance dataset, we highlight the overall better performances of the proposed models with respect to the classical quantile regression one. We then use the estimated quantiles to calculate a loaded premium following the quantile premium principle, showing that the proposed models provide a better risk differentiation.


Introduction
The use of machine learning in insurance pricing has flourished in recent years and the related literature is increasing accordingly. For example, Guelman (2012) adopts gradient boosting for auto insurance cost modelling; Spedicato et al. (2018) performs insurance pricing optimisation using several machine learning models; Henckaerts et al. (2021) use tree-based techniques such as GBM in order to produce car insurance tariffs; Schelldorfer & Wüthrich (2019) employ neural networks to enhance GLMs performances in non-life insurance and Wüthrich (2020) proposes two different techniques to overcome the unbiasedness of neural network models for insurance portfolios. Machine learning techniques have also found extensive application in the context of insurance claim reserving: Gabrielli et al. (2020) improve the performances of over-dispersed Poisson model for general insurance claims reserving by means of neural networks embedding, and Wüthrich (2018) propose several machine learning algorithms for individual claim reserving.
However, such techniques only give an estimation of the expected value of the chosen variable (claim frequency or claim severity), since they are designed to return the pure premium of a specific policy, where the expected value of the total claim amount is given by the product between the expected values of claim frequency and claim severity. Hence, these models, even if they offer insight into the average loss of a policy, are unable to provide the modeler with some valuable information about its potential riskiness, e.g. the quantile of the total claim amount. To overcome this problem a quantile regression approach, originally introduced by Koenker

& Bassett
The analysis shows that premiums produced by the Quantile-CANN and QRNN provide a better risk diversification for the portfolio.
The remainder of the paper is structured as follows: Section 2 explores the two-part model considered in this work. Section 3 introduces the use of QRNN and of Quantile-CANN for the estimation of the quantile of the total claim amount. Section 4 presents the empirical application carried out on an Italian health insurance claim dataset. Section 5 concludes.
• the first stage allows to estimate the no claim probability p i = Pr(I N i = 0) = Pr(N i = 0) as function of covariates. To achieve this goal, we use the logistic regression: where the no claim probability is obtained as p i = 1 1−exp(x i β) ; • The second stage uses p i to obtain the τ i conditional quantile level ofS i , QS i (τ i |x i ), corresponding to the τ quantile level defined on the total claim amount S i , Q S i (τ |x i ). Following Heras et al. (2018), the τ i level can be calculated as for which In the literature, QS i (τ i |x i ) is generally calculated using the well known quantile regression approach of Koenker & Bassett (1978). In this paper, we will generalised this approach by introducing two alternative methods, the QRNN, a particular specification of Regression Neural Network introduced by Taylor (2000), and the Quantile Combined Actuarial Neural Network (Quantile-CANN) which is a new method to calculate quantiles embedding the CANN approach of Schelldorfer & Wüthrich (2019) in a quantile regression framework. These approaches allow performing quantile estimation without imposing any predetermined structure for the relations between the claim severity and the related covariates.

Quantile Claim Severity Models
The standard approach to tackle the problem of quantile claim severity estimation refers to traditional QR models, see for example Kudryavtsev (2009), Heras et al. (2018) and Baione & Biancalana (2019). In this paper, we generalise this approach by proposing the use of neural network models to estimate the conditional quantile of the aggregate claim severity. This approach allows performing these calculations without imposing any predetermined structure for the relations between the aggregate claim severity and the related covariates. In this way, we can capture nonlinear and complex patterns in the data and possible interactions between predictors.
The first model we introduce is QRNN, which is basically a feed-forward neural network minimising the same quantile loss function minimised by a QR model. The second one is a new model that generalises the CANN approach introduced by Schelldorfer & Wüthrich (2019) by combining the QR model with a QRNN to improve the model's performance, we call this model Quantile-CANN.
Since all the aforementioned models are somehow based on quantile regression, we give a light insight on QR models in the next subsection.

Quantile regression
Quantile regression, originally introduced by Koenker & Bassett (1978), is a distribution free method providing a way to model the conditional quantiles of a response variable with respect to a set of covariates in order to have a more robust and complete picture of the entire conditional distribution with respect to the classical mean regression. Quantile regression approach is quite suitable method used in all the situations where specific features, like skewness, fat-tails, outliers, truncation, censoring, and heteroscedasticity arise. In this section, we show how to calculate QS i (τ i |x i ) using QR standard tools, in particular, we analyse how to calculate the quantile of the log(S i ). The use of the logarithmic function derives from the need to transform the dependent variable from R + to R to apply the QR approach. Moreover, the use of the logarithmic transformation is coherent within the insurance pricing context, since it allows to consider multiplicative tariffs. In addition, by considering the equivariance under monotone transformation property of the quantile, it is possible to retrive the quantity of interest, i.e. Q log(S i ) (τ i |x i ) = log(QS i (τ i |x i )). For notational simplicity, hereafter, we will use τ instead of τ i in the formulas below.
The quantile regression model can be stated as follows: where β(τ ) = (β 1 (τ ), β 2 (τ ), . . . , β q 0 (τ )) ∈ R q 0 is the vector of unknown regression parameters and ε i having the τ -th conditional equal to zero for all i = 1, 2, . . . , I. The estimation of the of the regression parameters β(τ ) can be obtained by solving the following minimisation problem: where ρ τ is the quantile loss function defined as https://doi.org/10.1017/S1748499523000106 Published online by Cambridge University Press with I (u) being the indicator function. The conditional quantile ofS i is then estimated aŝ In what follows we explore the model proposed based on the neural network approach.

Quantile Regression Neural Networks
QRNN is a modeling technique introduced by Taylor (2000) based on the neural networks that enables to estimate the conditional probability distribution of multiperiod financial return within a quantile regression framework. With this approach, it is possible to estimate potential non-linear quantile relations without imposing any distributional assumption or functional relations between dependent and independent variables. Several applications of the methodology have already been implemented in different fields, see for instance Xu et al. (2017) and Cannon (2011). Up to our knowledge, the QRNN methodology has never been considered in an insurance pricing context. At a deeper insight QRNN model, for a fixed depth K ∈ N, and fixed τ , follows a typical feedforward structure: where the output of the network is given by the exponential activation function applied to the scalar product between the readout parameter vector θ (K+1) returning one neuron in the output layer and the composition of the different K hidden layers z (1) (θ (1) ), · · · , z (K) (θ (K) ), where θ (1) , · · · , θ (K) are the parameters belonging to each layer. The generic s-th hidden layer s (s) (θ (s) ) of dimension q s ∈ N is defined as where the j-th neuron in the s-th hidden layer is given by where φ is the activation function and θ (s) j = (θ (s) j,0 , θ (s) j,1 , . . . , θ (s) j,q s−1 ) is the vector of parameters belonging to j-th neuron in the s-th hidden layer. Considering the vector of parameters θ (s) 1 , . . . , θ (s) q s for each neuron in (8), we can define the matrix of parameters for the s-th hidden layer as θ (s) Since the network in (7) has K hidden layers, it is possible to denote with θ the full set of parameters for the network gathering the matrix of parameters of each layer: with dimension r, where r = K+1 s=1 q s (1 + q s−1 ). To obtain the optimal set of parametersθ for (10), we train the network (7) minimising the quantile loss function: where we fix the starting value of the network parameter θ 0 at the beginning of the training. From (7) and (11), we estimateQ log(S i ) (τ ), then given the equivariance to monotone transformation of the quantile function we retriveQS i (τ ) = exp(Q log(S i ) (τ )). It is worth noting that if we replace the exp and the φ activation functions in (7) and in (9) with the linear activation function and we consider no hidden layers, then the QRNN boils down to the classical QR model in (5). For a visualisation of the QRNN architecture, see Figure 1.

Quantile-CANN
The innovative model we propose in this section to estimate the quantiles for the distribution of interest is an extension of the combined actuarial neural network (CANN) approach proposed by Schelldorfer & Wüthrich (2019) and launched in the editorial of Wüthrich & Merz (2019). In particular, the CANN framework nests a generic regression model into the neural network architecture to enhance the estimates given by the generic regression model. Following the CANN approach, but in a quantile framework, we boost the QR model with the neural network features proposing the so-called Quantile-CANN model. Our approach allows for exploiting the quantile neural networks to improve the conditional quantile estimates given by a QR model. In such a way, the neural network directly improves the classical QR estimates, preserving the information contained therein.
The main advantage of the Quantile-CANN approach compared to the QRNN is that it combines the flexibility of the networks with the interpretability of a QR approach, so providing higher explainability. According to Wüthrich & Merz (2019), when the reference regression model is already close to optimal, its maximum likelihood estimator can be used as initialisation of the neural network fitting algorithm, we use the QR estimator to initialise the network parameter of the Quantile-CANN, then obtaining lower computational time for the network parameters calibration than the QRNN model.
Formally, we define the Quantile-CANN as where the first term of the right hand side of (12) refers to the QR model in (4) with vector of parameters β(τ ), while the second term is the QRNN model displayed in (7) (except for the missing exponential activation in the output layer). Therefore, the Quantile-CANN model combines the models discussed in the two previous subsections (3.1 and 3.2) by embedding the QR into the network architecture using a skip connection that links the input given by the QR estimates to the output layer (see Figure 2 for a graphical representation), where the models are merged by summing the two parts as in (12). The network parameter of the Quantile-CANN model is denoted by ϑ and consists of The optimal set for the network parameterθ of (13) is obtained training the Quantile-CANN (12) minimising the quantile loss function: The optimisation process to estimate ϑ in (13) works as follows: we first obtainβ(τ ) parameters minimising the quantile loss function for the quantile in (5). Then we use such parameters to initialise the network parameter of the Quantile-CANN by considering ϑ 0 = β (τ ), θ 0 , where θ 0 is the starting value of the network parameter belonging to the QRNN part of model (12). Therefore, starting from ϑ 0 , we optimise ϑ minimising (14) by means of the gradient descent algorithm 2 . During the optimisation process, both θ 0 andβ(τ ) parameters are trained.
Given the optimal set of parametersθ, from (12), we estimatê then, the quantile of the claim severity is obtained asQ CANÑ ). Note that if the Quantile-CANN model does not return any improvement with respect to the QR model it means that the latter is already able to capture all the relevant information incorporated in the data.

Data description
To assess the ability of the models proposed to capture additional information incorporated in the data, we carry out an empirical case study based on an Italian health insurance claim dataset. The data stem from an Italian insurer and report the claims collected in a general health insurance plan during 2018. The plan provides risk coverage for managers and retired managers belonging to companies of a specific industry in Italy. More specifically, the dataset consists of 132, 499 policyholders (employees or former employees). Since each policyholder can also enroll his relatives (spouse and children below 25 years of age) in the insurance coverage, we totally have 301,405 insured. For each one of them, the following information is available: a binary variable signaling whether the insured submitted at least a claim during the year, total claim severity per year, age, gender, region, firm dimension, and time of coverage permanence (in years). Table 1 provides a summary for the information available in the dataset. Among the insured, we count a total of 217,006 claimants, the monetary volume of the submitted claims is about euro 243 m.
In Figure 3 and Table 2, we report the histograms and the frequency tables for the variables in the dataset. The age (AG) variable is strongly concentrated at older ages, we notice a "dip" in the distribution between 25 and 40 years, this is due to the specific subscription policies of the Italian insurer: since the policyholders are managers, it is unfrequent they have less than 40 years In the right bottom panel in Figure 3, we also plot the histogram of the aggregate claim severity for the 217,006 claimants. We recall that the aggregate claim severity is defined as the sum of the cost of all claims submitted by a given claimant. As it is common in this field, the distribution is right-skewed, from the plot we observe the existence of some large claims; however, the tail doesn't seem to be particularly fat.

Evaluate model performance
To assess the general performance of the QR, QRNN, and Quantile-CANN models, we use the aforementioned dataset at different τ levels. Specifically, the performance of each model is measured in terms of quantile loss function (see Equation (6)), where the lower the loss the better the model.
The results reported in Table 3 are obtained by means of five-fold cross validation, where the dataset is divided in five folds and at each iteration, three of the five folds are used as training set, one as validation and one as testing set. The network models are trained over 2, 000 epochs using early stopping on the validation set to avoid overfitting. For the early stopping, we employ a patience parameter of 200 epochs; hence, the training stops when the validation loss does not decrease for at least 200 epochs. When training stops, the network weights obtained at the 200th last epoch are restored and saved as the optimal parameters for the model. Both models adopt a three hidden layer structure of dimension (20, 15, 10), we consider the hyperbolic tangent 4 activation function for QRNN, while we use the ReLu 5 activation function for Quantile-CANN. As for the variables presented in Table 1: AG, PE, DM are Min-Max scaled, GE is dummy encoded   and RE is treated using a d = 1 embedding layer. As for the QR model, we consider a splines function to model the AG and the PE effects.
For each model, we estimate the conditional quantile of the total claim amount at levels τ = (0.7, 0.75, 0.80, 0.85, 0.9) and compute the respective in sample and out of sample quantile loss function to evaluate their performance.
For our dataset, QRNN and Quantile-CANN exhibit an overall better performance in terms of the quantile loss function compared to the classical QR for each quantile level τ (see Table 3). It is interesting to note that Quantile-CANN always yields a lower quantile loss function compared to the QRNN, suggesting that building the network around the QR has not only improved the performance given by the QR model but also beats the QRNN. The results above illustrated seem to suggest that using a neural network approach can be appealing. However, such models often face one major drawback: the lack of explainability. Indeed, neural networks have a huge number of parameters and a complex inner structure made up of several hidden layers, that make make very difficult for the modeler to understand the results. To overcome such limitations, in recent years, a wide literature covering the topic of model agnostic tools has flourished, see Friedman & Popescu (2008a) and for an actuarial case study Lorentzen & Mayer (2020) or Henckaerts et al. (2021), aiming at providing interpretative tools for machine learning models.
The corresponding literature has a plethora of well established model agnostic techniques. In this work, we exploit the permutation variable importance of Breiman et al. (1984) to gauge the relevance of each covariate in the model; the ICE curves and the PD plots to showcase the marginal effect of a covariate over the prediction produced by the model; the H-squared statistic Friedman & Popescu (2008a) in order to spot potential interaction effects between the covariates.

Variable importance
Permutation variable importance of Breiman et al. (1984) measures the increase in the deviance of a model after permuting the values of a given covariate. The basic idea is quite simple: the importance of a covariate is measured by calculating the increase in the model's deviance after permuting the covariate. In other terms, the variable importance is the increase in model deviance when the information provided by an explanatory variable is destroyed. Such variable is deemed important if randomly shuffling its values increases the model deviance, because in this case the model relied on the feature for the prediction. While, a covariate is not important if shuffling its values produces little to no increase in the model's deviance In what follows the general framework for the algorithm. Let f be the generic prediction function given by a model, where x is the feature matrix, y is the vector of observations and D(y,f ) is the model's deviance. For instance in QR we have f = exp(x β (τ )) and D(y, f ) = ρ τ (y − f ).
1. We estimate the original model Deviance D 0 = D(y, f (x)); 2. For each covariate j = 1, ..., p: • take the covariate matrix x i , that can also be represented as x i = x .,j , x .,−j , where x .,j is the column vector belonging to covariate j and x .,−j is the matrix for the other covariates. Get the set of covariates x perm by permuting the values in x .,j ; • estimate model deviance D perm = D(y, f (x perm )); • compute the Permutation Variable Importance for covariate j as I j = D perm − D 0 3. Sort covariates by descending order I j for j = 1, ..., p.
Permutation variable importance can be either performed on the training set or the test set. Performing the analysis on the test set informs the modeler on how much the model counts on each covariate for the predictions, while using the test set would give a hint on the relevance of the covariate for the performance of the model on new and unseen data.
In Figure 4, we report the variable importance metric to find the most relevant variables in our dataset for the quantile models. The variables are ranked from top to bottom, starting with the most important one as measured by the variable importance. For all models, the most important variable is the Age (AG) followed by the spatial variable (RE), the other variables are far less relevant; however, they seem to have a somewhat higher importance in network models with respect to the QR.

Main effects
ICE profiles are a useful tool to study the marginal effect of a covariate over the response provided by the model. Such a profile, for a given covariate j, shows how the prediction provided by the model for an observation obs i = x i reacts when the covariate x i,j slides over its range of possible values.
In particular, producing ICE profiles for a set of observations gives a hint on how the response evolves with respect to the different values of the variable. An ICE plot represents the relationship between the prediction and a specific covariate for each single observation separately, producing one profile (or line) per observation. The values for a line of the data matrix is obtained by fixing the values of all other covariates and creating variants of this observation by replacing the covariate's value with values coming from a grid and producing predictions with the model for these new observations. This procedure, for a specific observation, results in a set of points given by the feature value from the grid and the respective predictions. From an algorithmic standpoint, we have: 1. take an observation obs i = x i and the corresponding prediction f (x i ) given by the model. 2. x i is a p dimensional vector that can also represented as x i = x i,j , x i,−j , where j is the covariate we want to study. 3. Consider the grid of V possible values (v 1 , v 2 , . . . , v V ) for the selected covariate j.
Then for each v k with k = 1, ..., V we repeat: Once the process ends, we obtain a curve ICE . The process is repeated potentially for each observation in the dataset and each variable.  The ICE profiles are useful to highlight the presence of interactions. Infact, the stronger the interaction effects associated with the variable j, the greater the differences in shape observed across ICE profiles. However, this model agnostic tool does not reveal with which other variable the interaction arises. Note that, by construction, ICE profiles of a given covariate j are parallel as long as the underlying model does not incorporate interactions (like in a GLM or in a QR on the log scale transformation).
The partial dependence (PD) profiles of Friedman & Popescu (2008b) are obtained averaging different ICE profiles of a given variable j. PD profiles can be viewed as the main effect of covariate j merged over all observations. In other terms, they represent the average effect of variable j and are able to display whether the relationship between the response and a covariate is linear, monotonic, or more complex. PD marginalises the model's prediction over the distribution of the covariates in x .,−j , so that the profile displays the relationship between variable j and the output produced by the model.
If we consider the ICE profiles obtained for variable j over a set of n observations, ICE j i,v k where i = 1, ..., n and k = 1, ..., V, the PD profile is obtained as where, again, V is the grid of possible values for the selected covariate. The PD j v k function, for a given value v k of variable j, reveals the average marginal effect on the prediction returned by the model. Note that PD profiles are not restricted to a single variable, it is also possible to consider multiple variables at the same time in order to study their joint effect on the response.
In Figures 5 and 6, we consider PD plots and individual conditional expectations (ICEs) to gain an understanding of the main effects of the variables over the conditional quantile of the total claim severity, for the different models. The different ICE profiles in Figure 6 were coloured to report the amount of the observed claim (the darker the colour the higher the amount) to detect possible peculiar profiles. The top-left plot in Figure 5 compares the PD plots for the AG variable produced by the different models. At first glance, the curves look quite similar, however taking a closer look we notice an important difference in the leftmost part of the plot, for the values below 25 years of age. More specifically, for both QRNN and Quantile-CANN, we observe an upward trend in the riskiness of the insured, starting from 500 euros and peaking at 1, 500 euros at around 15 years of age, then the curve falls off down to 700 euros at 25 years. While the QR shows a flat trend kicking off at around 1, 000 euro and slowly increasing after 25 years of age. The behaviour displayed by network models seems to be more reasonable, since it captures the cost for dental treatment in younger ages, as it is usually low for children below 10 years, while it raises significantly for teenagers often associated with the use of dental braces. The ICE curves associated with the PD plot displayed in the top-left plot of Figure 6 do not seem to reveal specific interactions within the neural network models, since all the profiles look almost parallel in the plot, which is the case when the variable has little to no interactions with other variables. Of course, the ICE profiles for the QR are always parallel (on a log scale), since the variables are always modelled additively.
The permanence (PE) PD plots display a similar evolution but at different levels (top-right pane in Figure 5). Furthermore, for this variable, we observe some wild ICE profiles (top-right pane in Figure 6), that may signal the presence of interactions with other variables. The Gender (GE) variable does not seem to have a particular behaviour, in fact, the range for the y-axis is pretty narrow. As for the regional variable (RE), the PD plots show almost identical profiles, with a particularly riskiness associated with insured living in Lazio and Valle d'Aosta.

Interaction effects
After having a look at main effects and at some clues on possible interactions in Section 4.4, here we closely study possible interaction effects between covariates captured by the network models. When a given model incorporates an interaction effect, its predictions cannot be expressed as the sum of its variables' main effects, because the effect of one variable depends on the value of another variable. To assess the presence of interaction effects, we adopt the H-statistic introduced by Friedman & Popescu (2008b), which estimates the interaction strength between two covariates by measuring how much of the prediction variance originates from their interaction. To measure pairwise interaction strength between covariates j and k, H-statistic is defined as follows: where the sums run over a subset of n randomly selected observations,P D k is the centered version 6 of the PD profile for variable k, andP D jk is the centered two-way PD for variable j and k. In other words, H jk measures the proportion of variability in the joint effect of x .,j and x .,k unexplained by their main effects. A value close to zero indicates almost no pairwise interaction, while a value close to one means that most effects come from the pairwise interaction. The H-statistic can also be larger than 1, this can happen when the variance of the joint interaction is larger than the variance of the 2-dimensional PD plot. Hence, the interaction strength is measured as the share of variance explained by the interaction. In Figure 7, we plot the values of the H-statistic for each model and each possible pairwise interaction. As discussed above, the QR model does not display any interaction between covariates, since it is not designed to do so. While, both QRNN and Quantile-CANN display some specific interactions. The strongest common interaction between the two models is the one between the firm's dimension (DM) and the gender of the insured (GE), followed by the one between the Permanence (PE) and, again, the Gender (GE). To gain insight on the behaviour of the interaction effects, in Figures 8 and 9, we report PD plots for Dimension (DM) and Permanence (PE), grouped with respect to Gender (GE). The interaction appears to be relevant if the curves display a different behaviour when conditioned to the different values of the gender variable.  For the first interaction, in Figure 8, we notice that both network models recognise higher riskiness to male insured, indeed in the two panels, the blue line (in Figure 8) lies always above the red curve. However, there are some relevant differences between QRNN and Quantile-CANN that are worthy of discussion. In particular, in the QRNN plot, the two curves display a downward parabolic trend which is rather steep for male insured and far flatter for females. Both curves reach their minimum at around 25 years of permanence and then they start increasing again, reaching 1, 900 euro for males and 1, 700 euros for females, the difference between the two curves is rather stable across the plot (though not constant). For Quantile-CANN, we observe a somewhat different behaviour, in fact, the two curves display a large distance for the insured with less than 10 years of permanence, while in the right part of the plot the distance in terms of potential riskiness between males and females is almost negligible.
Also for the interaction between dimension (DM) and gender (GE), the grouped PD plots (Figure 9) report an higher riskiness associated with male insured. Both panels show an upward parabolic trend with a maximum at around 500 for both plots for the PD plots associated with males insured. In the QRNN panel, we notice a quasi-sinusoidal trend for the females grouped PD plot, with a very low variability for the values in the y-axis. In other terms, for the QRNN, the dimension variable only has some sort of significant effect for males, while for females the effect of this variable seems negligible. As for the Quantile-CANN female plot, we have first had a substantially stable trend, then followed by a downward parabolic trend, with a good variability for the insured potential riskiness.
Thus, even though the gender variable did not appear to be significant in Figures 4 and 7, it has some relevance when interacting with other covariates. As shown above, network models have proven to be good at detecting such interactions.

Ratemaking
In Section 4.2, we have investigated models' behaviours evaluated at different quantile levels. We are now interested in evaluating models' performances when the focus is estimating the quantile premium principle introduced by Heras et al. (2018), where the premium paid by the insured is loaded according to its potential riskiness. In particular, we consider the convex combination of the quantile claim severity QS i (τ i |x i ) and the conditional expected value of the total claim amount S i : where 0 < γ < 1 is the loading factor and E(S i |x i ) the conditional expected value of the total claim amount that can be decomposed as E(S i |x i ) = Pr(N i > 0|x i ) · E(S i |x i ). Note that Pr(N i > 0|x i ) = 1 − p i can be obtained from Equation (1), while E(S i |x i ) is estimated using a Gamma regression model as in Frees (2010). In order to compute (18), we need to obtainQS i (τ i |x i ) using the two-part model discussed in Section 2.
The two-part model is estimated in taking the first fold in the 5-fold cross-validation, where we consider a 60-20-20 split between learning, validation, and testing set. The first step of the two-part model consists in estimating the no claim probability p i for each insured using logistic regression. Then as discussed in Section 2, the estimated p i is employed to compute the τ i level as in Equation (2) for each insured setting τ = 0.95. As a result of this first step, we obtain 47,120 unique values for τ i ranging between 0.799 and 0.896. Following the two-part approach designed by Heras et al. (2018) would involve fitting a regression model for each different quantile level τ i . Doing this would be rather time-consuming. Thus, to avoid that, we approximate the τ i values up to the second digit, where the second digit is rounded to the closest even digit 7 we perform the second step of the two-part model by computing the conditional quantile of the claim severitŷ QS i (τ i |x i ) using QR, QRNN, and Quantile-CANN.
In order to test the ability of the different models to accurately estimate the desired quantile level ofS i , we use the backtest criteria approach. More specifically, the models are backtested using the unconditional coverage (UC) test proposed by Kupiec (1995), which is a widespread testing technique generally employed to validate VaR models in the financial literature. This technique is equal to τ c = (1 − τ ). More formally, given a generic insured i, we can define the violation function as Hence, given a portfolio composed of I insured, we define the total amount of the violations occurred in the portfolio as I(τ ) = I i=1 I i (τ ), and the ratio of occurred violations aŝ It is now possible to define the UC test statistic as The null and the alternative hypothesis for the UC test are defined as In other terms, considering a portfolio of I insured, if the number of occurred violationsτ c · I is close enough to τ c · I, the test statistic LR uc is low, and the null hypothesis is not rejected. There is no evidence of any inadequacy for the tested quantile estimate. While, if the number of violations strongly differs from τ c · I, the test statistic increases indicating growing evidence that the proposed quantile either systematically understates or overstates the portfolio's potential riskiness, and thus the null hypothesis is rejected. In Table 4, we report the backtesting results for QR, QRNN, and Quantile-CANN. In the left part of the table, we report the unconditional coverage test statistic LR uc , while on the right side of the table, we display the corresponding p-values, keeping in mind that the null hypothesis is not rejected when the p-value is larger than 0.05. From Table 4, we observe that QRNN and Quantile-CANN pass the backtest at almost all quantile levels since the null hypothesis of unconditional coverage is not rejected. More specifically, the QRNN and Quantile-CANN pass the test at all quantile levels, with the sole exception of the Once the models are estimated, we are able to calculate tariffs P i as in Equation (18) and evaluate them using the ordered Lorenz curve introduced by Frees et al. (2014). This tool is a twist on the classical Lorenz curve usually employed in welfare economics to represent social inequality via the Gini index, see Farris (2010). In insurance literature, the ordered Lorenz curve is employed to compare different tariff structures issued by a set of competing models. Given a base tariff structure P base i and a competing tariff P comp i the Lorenz curve proposed by Frees et al. (2014) is ordered with respect to the relativity r i : A relativity r i consistently below 1 reveals a largely profitable policy for the company, that is likely to be lost to a competing insurance company proposing a cheaper premium. Instead a relativity r i greater than 1 signals an underpriced policy. Of course these statement hold true only if we assume P comp i to give a sharper representation of the real risk compared to P base i . Given r i , the ordered Lorenz curve can be defined as follows: for s ∈ [0, 1] where F n (r i ) is the empirical cumulative distribution function of the relativities r i . The idea behind the ordered Lorenz curve is that a model producing tariffs with a greater Gini index produces a stronger separation among premiums paid by the insured, signaling that such model is more capable to distinguish good risks from bad risks. Hence, a tariff structure P comp i that yields a larger Gini index is likely to result in a more profitable portfolio because of a better risk differentiation. Table 5 displays the two-way comparison of Gini indices for the network models and the QR. The rows report the model generating the base tariff structure P base i whereas the column stores the model from which the competing tariff structure P comp i is generated. The approach we use for selecting a tariff based on the Gini index is the "mini-max" strategy designed by Frees et al. (2014), which consists of selecting the model that provides the minimum Gini index among the maximal Gini indices taken over the competing models. The strategy is rather intuitive: if we have to choose a base premium we chose the one that has the minimal maximum improvement when compared to other models, meaning that the selected base premium is the least vulnerable to alternatives. In practice, we look for the base model with the lowest value in bold in Table 5. The Quantile-CANN appears as a clear winner since the other models are not able to achieve an high Gini index when considered as an alternative (see the last row), signaling that this model leads to a tariff structure that is the least likely to incur in adverse selection. The QRNN tariff structure achieves second place, followed by QR.

Conclusions
In the domain of insurance ratemaking, neural networks may be considered a tool to perform high-dimensional nonlinear regressions. Following this line, in this paper, we extend such techniques in order to estimate the conditional quantile of the total claim amount in the context of a two-part model devoted to the definition of a loaded premium. To fulfill the quantile estimation task, we propose two models. First, we use QRNN of Taylor (2000), which is a particular neural network specification that has never been applied before in actuarial sciences. Next, we generalise the CANN approach of Schelldorfer & Wüthrich (2019) introducing the Quantile-CANN, a flexible tool that merges the classical QR with a QRNN allowing to improve the results given by the QR model and the QRNN.
In the first part of the empirical application, Section 4.2, we test the performance of the proposed network models against the traditional QR model over a health insurance claim dataset. The results show that our models outperform QR in terms of quantile loss function. Furthermore, for QRNN and Quantile-CANN, we apply the set of model agnostic tools discussed in Lorentzen & Mayer (2020) to gain additional insight in the dataset.
In the second part of the empirical application, Section 4.6, we compute the Quantile Premium Principle introduced by Heras et al. (2018) using the different models discussed in the paper (QRNN, Quantile-CANN, and QR). To compare the tariff structures issued by the models, we adopt the ordered Lorenz designed by Frees et al. (2014). According to this technique, the proposed network models exhibit a better tariff structure w.r.t QR since they provide a better risk differentiation for the portfolio. This feature is paramount for the insurer since a better differentiation between good and bad risks is likely to increase profits.
This work focuses on the comparison between the neural network-based models and the quantile regression, further research could explore different network configurations, i.e., increasing the number of dimensions for the embedding layer used for the regional variable or even different machine learning models, such as tree-based models or gradient boosting machines. Another possible development of this work could consider a multivariate QRNN or Quantile-CANN approach in order to jointly model the conditional quantile of the total claim severity for different and possibly correlated claim types.