A NEW MULTIVARIATE ZERO-INFLATED HURDLE MODEL WITH APPLICATIONS IN AUTOMOBILE INSURANCE

Abstract The fact that a large proportion of insurance policyholders make no claims during a one-year period highlights the importance of zero-inflated count models when analyzing the frequency of insurance claims. There is a vast literature focused on the univariate case of zero-inflated count models, while work in the area of multivariate models is considerably less advanced. Given that insurance companies write multiple lines of insurance business, where the claim counts on these lines of business are often correlated, there is a strong incentive to analyze multivariate claim count models. Motivated by the idea of Liu and Tian (Computational Statistics and Data Analysis, 83, 200–222; 2015), we develop a multivariate zero-inflated hurdle model to describe multivariate count data with extra zeros. This generalization offers more flexibility in modeling the behavior of individual claim counts while also incorporating a correlation structure between claim counts for different lines of insurance business. We develop an application of the expectation–maximization (EM) algorithm to enable the statistical inference necessary to estimate the parameters associated with our model. Our model is then applied to an automobile insurance portfolio from a major insurance company in Spain. We demonstrate that the model performance for the multivariate zero-inflated hurdle model is superior when compared to several alternatives.


INTRODUCTION
Generalized linear models (GLMs) have been widely applied in insurance ratemaking over the last few decades. Within the GLM framework, Poisson and negative binomial distributions have been routinely applied to the analysis Astin Bulletin 52(2), 393-416. doi:10.1017/asb.2021.39 c The Author(s), 2022. Published by Cambridge University Press on behalf of The International Actuarial Association. This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited. of insurance claim frequencies. Some detailed illustrations can be found in Cameron and Trivedi (1998) and Frees (2009). In automobile insurance, many attempts have also been made in the actuarial literature to find an appropriate distribution to model the dollar cost associated with individual claims. It is often the case that automobile insurance policyholders do not incur any loss in the period of insurance coverage. Even if the insureds do incur a loss, they may opt not to put forward a claim in order to maintain a high level of no claims discount to their annual insurance premium. In short, insurance claim frequency data are often characterized by a large number of zero claims. This, in turn, leads to the study of zero-inflated versions of typical count distributions (see, e.g., Yip andYau, 2005 andBoucher et al., 2007).
In practice, an insurer may provide multiple lines of insurance business, where claims from different sources are bundled into one single policy. For example, in automobile insurance, one policy may cover the payment in respect of third-party liability and also losses sustained by the insured party. It is clear from data that claims from these two forms of insurance are often triggered from the same event leading to correlation between observed claim counts. Dependence between costs associated with two (or more) different types of claims must be considered when building a robust ratemaking system. As pointed in Bermúdez (2009), even a minor correlation between the claim counts can lead to major differences in ratemaking. Failure to take into account the positive correlation between the claims will often result in premiums which are too low relative to the underlying risk.
In the literature, there were some papers discussing zero-inflated models in the multivariate version. Most of these models concentrated on the case where the marginal claim count distributions are Poisson. Li et al. (1999) proposed a multivariate zero-inflated Poisson model as a mixture of m + 2 components of m-dimensional discrete distributions. The complexity of this model renders the implementation of maximum likelihood estimation not an easy job for large m. Bermúdez and Karlis (2011) considered zero-inflated versions for the multivariate Poisson model with common and full covariance structures. The inference procedure was completed using a Bayesian framework. Using a similar structure as in Li et al. (1999), Dong et al. (2014) gave a multivariate zero-inflated negative binomial model.
Recently Liu and Tian (2015) examined a new multivariate zero-inflated Poisson model. This model had the advantage of avoiding the computation issues resulting from an increase of dimension. However, all components in this model share a common zero-inflation parameter which restricts scope for application of this model. In addition, the fact that each marginal distribution is assumed to be Poisson imposes a restriction on the ability of the model to work with many different data sets. In an attempt to alleviate these limitations, we propose a multivariate zero-inflated hurdle model. A univariate hurdle model (Mullahy, 1986) is a two-part model that separates the occurrence of an event from the number of those events actually observed. Constructing a multivariate hurdle model by assuming a hurdle distribution in every margin has two advantages. First, it can easily handle the zero-inflation or zero-deflation feature in each margin. Second, it provides users with greater choice in modeling marginal behaviors. See some examples regarding the implementation of hurdle models in Boucher et al. (2007). As a result, our proposed multivariate zero-inflated hurdle model has greater in-built flexibility than the multivariate zero-inflated Poisson model considered in Liu and Tian (2015) in respect of diversifying marginal zero-inflation parameters and employing non-Poisson marginal distributions. The hurdle model has also been considered in a multivariate context by Zhang et al. (2020), but to study the Type I multivariate zero-truncated data, which has a very different feature from the type of data studied in this paper.
In our work, the inference process is enabled using the EM algorithm (Dempster et al., 1977). The EM algorithm is a two-step iterative method to find the maximum likelihood estimates (MLEs). It is particularly useful when working with zero-inflated models. Examples illustrating the implementation of the EM algorithm in zero-inflated models can be found in Lambert (1992) and Hall (2000). Unfortunately, when covariates are introduced in our model, there is no closed-form representation in the M-step. We could find the optimal values in the M-step using the Newton-Raphson method however this was shown to be computationally expensive. Rai (1993) provided an alternative approach in which the Newton-Raphson method is carried out for only one step in the M-step. This can reduce the computation time considerably.
Our work contributes to the existing literature in several ways. First, we provide a very efficient way to generalize the zero-inflated model from univariate case to multivariate case. Our proposed framework can easily handle high dimensional data without any computational issues. Second, our model differs from the model of Liu and Tian (2015) in the sense that hurdle margins are assumed here instead of just Poisson margins. This generalization preserves the flexibility of capturing types of features in the insurance claims data. It is also worth stressing that no closed-form exists anymore in the M-step due to the introduction of covariates, enabling us to use a generalized EM algorithm for inference. Third, we emphasize the better performance of our proposed model compared with several existing candidate models when fitted using the same insurance data set.
The rest of the article is organized as follows. Section 2 provides the definition of general multivariate zero-inflated distribution and investigates some of its distributional properties. In Section 3, we propose our multivariate zero-inflated hurdle model, followed by the corresponding EM algorithm for model inference. In Section 4, the proposed model is applied to an automobile insurance data set. The last section concludes the paper.

Definition
We now define our new multivariate zero-inflated distribution. Let Y = (Y 1 , . . . , Y m ) denote a discrete random vector where Y j , j = 1, . . . , m, are independent of each other and defined on N. Then Z = (Z 1 , . . . , Z m ) is said to follow the multivariate zero-inflated distribution if where U ∼ Bernoulli(π 0 ), 0 < π 0 < 1, and U is independent of Y. The symbol " d = " means that the random variables on both sides of the equality share the same distribution. The probability mass function (pmf) of Z can be derived as where z = (z 1 , . . . , z m ) is a vector of observed values, v = I(z = 0 m ) and I( · ) is an indicator function.

Marginal distributions
We first derive the marginal distribution for a single variable. The marginal distribution of Z j is Thus, Next we derive an expression for the marginal distribution of an arbitrary random sub-vector of Z. Denote J = (j 1 , j 2 , . . . , j r ) ⊂ (1, 2, . . . , m) where 1 < r < m and J C = (j r+1 , j r+2 , . . . , j m ) as the complementary set. Let Z r = (Z j 1 , Z j 2 , . . . , Z j r ) and z r = (z j 1 , z j 2 , . . . , z j r ) , the distribution of Z r is Proof. If z r = 0 r , Thus, The marginal distributions can also be obtained from the definition Z d = UY.

Conditional distribution
Let Z m−r = (Z j r+1 , Z j r+2 , . . . , Z j m ) and z m−r = (z j r+1 , z j r+2 , . . . , z j m ) . The conditional distribution of Z r |Z m−r is If z m−r = 0 m−r and z r = 0 r , If z m−r = 0 m−r and z r = 0 r ,

Expectation, variance and covariance
The expectation and variance of Z j , j = 1, . . . , m, are and the covariance between Z j and Z k , Proof. This is easily obtained from Z d = UY.

Moment generating function
The moment generating function of Z is Proof. The moment generating function of Z is

Two special cases
In this subsection, we introduce two multivariate zero-inflated distributions where the margins are assumed to be either all Poisson or all negative binomial distributed. Users do not have the flexibility to vary the marginal distribution types for these two distributions. We note that the multivariate zero-inflated Poisson distribution was first proposed by Liu and Tian (2015). To our best knowledge, the multivariate zero-inflated negative binomial distribution has not been studied in the literature before, but it has the same limitations as the Poisson model, which was discussed above in Section 1. We will use these two models in our model comparison in Section 4. For the purpose of simplification, we have put the EM algorithms of parameter estimation for these two models in Appendix A, where the location parameters are all covariate-dependent.

Model characterization
We shall assume that each underlying random variable Y j in (2.1) follows a zero-modified distribution, which can be characterized as follows: where W j follows a count distribution defined on N + , U j ∼ Bernoulli(π j ), 0 < π j < 1, and U j is independent of W j . Again, we assume that all Y j are independent of each other. Then Z constructed by (2.1) is said to follow the multivariate zero-inflated hurdle distribution with parameter vectors π = (π 1 , . . . , π m ) , = ( 1 , . . . , m ) and a zero-inflation parameter π 0 . Here j is the set of parameters related to W j . The pmf of Z can be expressed as where v = I(z = 0 m ).

Model inference
Suppose each Z i , i = 1, . . . , n, independently follows a multivariate zeroinflated hurdle distribution. The corresponding observed values are z 1 , . . . , z n , where z i = (z i1 , . . . , z im ) . The latent variables are v 1 , . . . , v n , where v i = I(z i = 0 m ). Now we introduce some covariates, x 1 , . . . , x n , where The parameter π ij can then be modeled as For the purpose of easy interpretation, we do not inject covariates in π 0 . We denote β = (β 1 , . . . , β m ) as the set of parameters related to all π ij , and as the set of parameters related to all W j , the likelihood function then can be written as The observed log-likelihood function can be divided into two parts: Thus, the maximization procedure can be completed for 1 and 2 , respectively. For 2 , the estimation can proceed in respect of the zero-truncation part of each margin separately. For 1 , we implement the EM algorithm as described below.
The observed log-likelihood function 1 can be rewritten as In addition to the known values z i , suppose we also know the value u i , one for each individual to take the value 1 if the observation of common zeros is inflated and 0 otherwise. The complete data log-likelihood function then becomes Note in our case, v i z ij = 0. Given initial values of parameters β and π 0 , the EM algorithm proceeds as follows.
• E-step: Replace u i bȳ -For π 0 , we can get There is no closed-form representation for β j , so we update the regression parameters, respectively, for each j by implementing the Newton-Raphson method for one step. The first-and second-order derivatives are given as follows: • Iterate through the E-step and M-step until some convergence criterion is met, for example, the relative change of observed log-likelihood between two consecutive iterations is smaller than a tolerance level ε.

Remark 1.
The initial values of parameters β j for EM algorithm can be obtained by implementing univariate logistic regression with observed values z 1j , . . . , z nj .
The initial value of parameter π 0 can be set as 0.5. The standard errors for the estimates can be approximated using the approach in Louis (1982).
For the case without covariates incorporated in π ij , the EM algorithm is simplified as shown below.
• E-step: Replace u i bȳ Initial value for π j can be set as the proportion of non-zeros for margin j.
• M-step: -Update π 0 through the following equation: -Update each π j through the following equation:

Data description
This application is based on an automobile portfolio from a major insurance company operating in Spain in 1995. The whole data set consists of 80,994 policyholders. We have access to a rich set of information for each individual. The detailed description for each predictor is presented in Table 1. The mean of each covariate is also provided in the table to show the proportion of corresponding group. For example, the mean of v1 tells us that 16.0% of the policyholders are female. The simplest policy only includes third-party liability (denoted as Z 1 type) and a set of basic guarantees such as emergency roadside assistance, legal assistance or insurance covering medical costs (denoted as Z 2 type). The comprehensive coverage (damage to one's vehicle caused by any unknown party, for example, damage resulting from theft, flood or fire) and the collision coverage (damage resulting from a collision with another vehicle or object when the policyholder is at fault) are also denoted as Z 2 type. For our purpose, we only select policyholders with full coverage (v9 = 0, v10 = 1) to implement our analysis. This is consistent with the analysis in Bermúdez and Karlis (2017). This subset only contains information for 28,590 policyholders. The empirical joint   Table 2. The overall Pearson's correlation coefficient between these two types of claim is 0.189. This observed correlation is consistent with our model's positive correlation assumption. For our study, we randomly take 70% of the observations from the subset as training data to develop the models, and the remaining 30% are reserved as hold-out sample for validation purpose.

Model fitting
Prior to fitting our proposed hurdle model, we need to determine the distributions for the zero-truncated univariate components W j , j = 1, 2. In addition to the commonly used zero-truncated Poisson (ZTP) and zero-truncated negative binomial (ZTNB) distributions, we also try unit-shifted Poisson (USP) and unit-shifted negative binomial (USNB) distributions. Implementing a unitshifted distribution on W j is equivalent to using a standard count distribution for W j − 1. These four univariate models can be implemented in R using the packages gamlss and gamlss.tr. The results are summarized in Table 3. Both the log-likelihood values and the χ 2 statistics indicate that the USNB distribution outperforms the alternatives for both claim type counts. The small differences between the observed and fitted frequencies demonstrate the adequacy of adopting a USNB distribution for both margins. We next turn to parameter estimation when covariates are introduced in our multivariate zero-inflated hurdle (MZIH) model. The estimation results in Table 4 correspond to π j , j = 0, 1, 2. The 95% confidence interval of π 0 is (0.318, 0.372), where the upper bound is far below the boundary 1. This reveals the zero-inflation feature reflected in the data set. Focusing on the claims of Z 1 type, parameters v4 and v7 are statistically significant. It can be concluded that policyholders with more years with the insurance company (v7) exhibit a lower probability of claiming. On the other hand, driving in a high-risk region (v4) is associated with an increase in the probability of making a claim. If we focus on the claims of Z 2 type, we notice that v3, v5, v7 and v11 are all statistically significant. This suggests that driving in a zone of medium risk (v3), driving license between 4 and 14 years (v5) and greater horsepower (v11) are all associated with increased chances of a claim in this category. However, clients with the company for more than 5 years (v7) have a lower probability of making a claim. Table 5 is associated with the estimates for W j , j = 1, 2. In this table, we report the coefficient estimates when a linear model with logarithmic link   Furthermore, we present in Table 6 the observed and expected frequencies (numbers in the brackets) based on the MZIH model. It can be observed that the overall fit is acceptable. The result of the chi-squared test indicates that only a few cells contribute to this goodness-of-fit. It is thus our belief that the overall quality of fit is good considering the number of cells.
Finally, we compare different available models. Apart from our proposed MZIH model, fitted with the USNB marginal distributions, our candidate models also include the multivariate zero-inflated Poisson (MZIP) and multivariate zero-inflated negative binomial (MZINB) models. It is worth mentioning that for the zero-truncation parts in our MZIH model, only intercepts are adopted to avoid the over-fitting problem. Furthermore, a model with two independent hurdle margins (Ind) is fitted as a benchmark. The marginal zerotruncated distributions in the independent hurdle model are the same as for the MZIH model. Different information criteria are provided in Table 7. By assuming the existence of a zero-inflation parameter, we are able to improve the model fitting considerably. This confirms the fact that extra common zeros exist in our data set. Also the superior performance of MZIH model over MZIP and MZINB models verifies the benefit of having extra flexibility in our MZIH model when handling marginal distributions.

Predictive analysis
To evaluate the predictive performance, we calculate the predicted claim frequencies and compare these to the observed frequencies from the test sample for the following scenarios: no claims in any line, claims occur in only one of the lines and claims occur in both lines. The candidate models include MZIH, MZIP, MZINB and Ind models. The goodness-of-fit results are shown in Table 8. Our conclusions are consistent with those made based on Table 7. As anticipated, we observe very poor performance from the Ind model, which can be explained by the failure to model the excess common zeros in the data set as well as the positive correlation between the two lines of claims. The more accurate prediction of MZIH model against MZIP and MZINB models is largely due to the introduction of hurdle margins. The tiny discrepancies between observed and predicted frequencies of our MZIH model suggest a satisfactory quality of fit for this particular data set.

Model comparison
In this subsection, we apply the four models from the previous section using the whole data set with 80,994 policyholders to facilitate the comparison between these models and the ones in Bermúdez (2009) and Bermúdez and Karlis (2012). All eleven covariates are considered in this case. The first candidate model is the best model studied in Bermúdez (2009) that is the zero-inflated bivariate Poisson (ZIBP) model with regressors on the third Poisson parameter λ 3 . The second candidate model is the best model studied in Bermúdez and Karlis (2012), which is the 2-finite mixture of bivariate Poisson (2-FMBP) model with regressors on the mixing proportion p. For readers' convenience, description of the ZIBP and 2-FMBP models is given in Appendix B. Our MZIH model is still fitted with the USNB marginal distributions for zerotruncation parts with only significant predictors reserved. The comparative analysis is shown in Table 9. As is evident from the table, all information criteria agree that our proposed MZIH model still performs the best.

CONCLUDING REMARKS
In this article, we introduced a new type of multivariate model, the so called multivariate zero-inflated hurdle (MZIH) model. We started with the definition for the multivariate zero-inflated distribution and provided some of its properties. Next, two special multivariate zero-inflated models, namely the MZIP and MZINB, were presented with associated EM algorithms necessary for estimating their parameters given in Appendix A. However, these two models often cannot accommodate the observed characteristics in the marginal distributions of claim counts. Our proposed MZIH model could address these limitations by allowing any zero-modified distribution for each margin. The separation of zeros from the positive parts provided us with more freedom to deal with zero features in each dimension and the marginal distributions as well. The usefulness of our model was illustrated with the help of real data from automobile insurance. The results from fitting several relevant models were shown and compared. As expected, the superiority of our proposed MZIH model was supported by different information criteria as well as predictive performance. Our proposed MZIH model provides significantly enhanced flexibility compared to univariate models for the actuary who is dealing with multiple related insurance coverage. Our model takes into account the correlations between observed claim frequencies for different lines of insurance business, meaning that the models for one line of business are enhanced by information in the data relating to other lines of business. Univariate models are unable to incorporate this given their separate focus on individual lines of business. Second, the MZIH model builds on the analysis from univariate modeling of claim counts by adding a hurdle requirement within a multivariate structure. More importantly, despite the significantly enhanced model flexibility inherent in MZIH model analysis, the additional effort in coding and computation is limited. Starting form appropriate values, our program for the EM algorithm only takes several minutes to obtain the results when working on tens of thousands of observations. The estimation aimed at the positive part of each margin can be easily handled in R using publicly available packages. All of these advantages make the MZIH model an attractive candidate when studying claims with the zero-inflation feature in a multivariate context.

R CODE
The authors used the R package gamlss and gamlss.tr in the model fitting of this paper. The authors are happy to share their R code when needed. Interested readers please contact the corresponding author. λ ij = exp (x i β j ) with new parameters β j = (β j0 , β j1 , . . . , β jp ) . If we denote β = (β 1 , . . . , β m ), the likelihood function becomes Following the same data augmentation method as for the multivariate zero-inflated hurdle model, we obtain the complete data likelihood function given as The complete data log-likelihood function is where C 0 is a constant not related to (β, π 0 ). Note in our case, v i z ij = 0. The associated EM algorithm is given below.
• E-step: Replace u i byū where λ ij = exp (x i β j ). • M-step: -For π 0 , we can get There is no closed-form representation for β j , so we update the regression parameters, respectively, for each j by implementing the Newton-Raphson method for one step. The first-and second-order derivatives are given as follows: For the case without covariates incorporated in λ ij , the corresponding E-step and M-step can be simplified as follows.
• E-step: Replace u i byū • M-step: -Update π 0 using the following equation: -Update each λ j using the following equation:
The EM algorithm can be described as follows.
There is no closed-form representation for β j and θ j , so we update the regression parameters, respectively, for each j by implementing the Newton-Raphson method for one step.
-Update each μ j using the following equation: -Substitute the value of μ j obtained from the last step into¯ C j , update θ j using the one variable Newton-Raphson method.