A NEW MULTIVARIATE ZERO-INFLATED HURDLE MODEL WITH APPLICATIONS IN AUTOMOBILE INSURANCE

Pengcheng Zhang; David Pitt; Xueyuan Wu

doi:10.1017/asb.2021.39

A NEW MULTIVARIATE ZERO-INFLATED HURDLE MODEL WITH APPLICATIONS IN AUTOMOBILE INSURANCE

Published online by Cambridge University Press: 07 January 2022

Pengcheng Zhang ,

David Pitt and

Xueyuan Wu

Show author details

Pengcheng Zhang: Affiliation:
School of Insurance, Shandong University of Finance and Economics, Jinan250014, China, E-Mail: qingdaozpc@163.com
David Pitt: Affiliation:
Centre for Actuarial Studies, Department of Economics, The University of Melbourne, Melbourne, VIC3010, Australia, E-Mail: david.pitt@unimelb.edu.au
Xueyuan Wu*: Affiliation:
Centre for Actuarial Studies, Department of Economics, The University of Melbourne, Melbourne, VIC3010, Australia, E-Mail: xueyuanw@unimelb.edu.au
*: E-Mail: xueyuanw@unimelb.edu.au

Article contents

Abstract
INTRODUCTION
Multivariate zero-inflated distribution
Multivariate zero-inflated hurdle model
Application
Concluding remarks
R code
References

Rights & Permissions

Abstract

The fact that a large proportion of insurance policyholders make no claims during a one-year period highlights the importance of zero-inflated count models when analyzing the frequency of insurance claims. There is a vast literature focused on the univariate case of zero-inflated count models, while work in the area of multivariate models is considerably less advanced. Given that insurance companies write multiple lines of insurance business, where the claim counts on these lines of business are often correlated, there is a strong incentive to analyze multivariate claim count models. Motivated by the idea of Liu and Tian (Computational Statistics and Data Analysis, 83, 200–222; 2015), we develop a multivariate zero-inflated hurdle model to describe multivariate count data with extra zeros. This generalization offers more flexibility in modeling the behavior of individual claim counts while also incorporating a correlation structure between claim counts for different lines of insurance business. We develop an application of the expectation–maximization (EM) algorithm to enable the statistical inference necessary to estimate the parameters associated with our model. Our model is then applied to an automobile insurance portfolio from a major insurance company in Spain. We demonstrate that the model performance for the multivariate zero-inflated hurdle model is superior when compared to several alternatives.

Keywords

Multivariate zero-inflation Hurdle model EM algorithm automobile insurance

Type: Research Article
Information: ASTIN Bulletin: The Journal of the IAA , Volume 52 , Issue 2 , May 2022 , pp. 393 - 416

DOI: https://doi.org/10.1017/asb.2021.39 [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright: © The Author(s), 2022. Published by Cambridge University Press on behalf of The International Actuarial Association

1. INTRODUCTION

Generalized linear models (GLMs) have been widely applied in insurance ratemaking over the last few decades. Within the GLM framework, Poisson and negative binomial distributions have been routinely applied to the analysis of insurance claim frequencies. Some detailed illustrations can be found in Cameron and Trivedi (Reference Cameron and Trivedi1998) and Frees (Reference Frees2009). In automobile insurance, many attempts have also been made in the actuarial literature to find an appropriate distribution to model the dollar cost associated with individual claims. It is often the case that automobile insurance policyholders do not incur any loss in the period of insurance coverage. Even if the insureds do incur a loss, they may opt not to put forward a claim in order to maintain a high level of no claims discount to their annual insurance premium. In short, insurance claim frequency data are often characterized by a large number of zero claims. This, in turn, leads to the study of zero-inflated versions of typical count distributions (see, e.g., Yip and Yau Reference Yip and Yau2005 and Boucher et al. Reference Boucher, Denuit and GuillÉn2007).

In practice, an insurer may provide multiple lines of insurance business, where claims from different sources are bundled into one single policy. For example, in automobile insurance, one policy may cover the payment in respect of third-party liability and also losses sustained by the insured party. It is clear from data that claims from these two forms of insurance are often triggered from the same event leading to correlation between observed claim counts. Dependence between costs associated with two (or more) different types of claims must be considered when building a robust ratemaking system. As pointed in BermÚdez (Reference BermÚdez2009), even a minor correlation between the claim counts can lead to major differences in ratemaking. Failure to take into account the positive correlation between the claims will often result in premiums which are too low relative to the underlying risk.

In the literature, there were some papers discussing zero-inflated models in the multivariate version. Most of these models concentrated on the case where the marginal claim count distributions are Poisson. Li et al. (Reference Li, Lu, Park, Kim, Brinkley and Peterson1999) proposed a multivariate zero-inflated Poisson model as a mixture of $m+2$ components of m-dimensional discrete distributions. The complexity of this model renders the implementation of maximum likelihood estimation not an easy job for large m. BermÚdez and Karlis (Reference BermÚdez and Karlis2011) considered zero-inflated versions for the multivariate Poisson model with common and full covariance structures. The inference procedure was completed using a Bayesian framework. Using a similar structure as in Li et al. (Reference Li, Lu, Park, Kim, Brinkley and Peterson1999), Dong et al. (Reference Dong, Clarke, Yan, Khattak and Huang2014) gave a multivariate zero-inflated negative binomial model.

Recently Liu and Tian (Reference Liu and Tian2015) examined a new multivariate zero-inflated Poisson model. This model had the advantage of avoiding the computation issues resulting from an increase of dimension. However, all components in this model share a common zero-inflation parameter which restricts scope for application of this model. In addition, the fact that each marginal distribution is assumed to be Poisson imposes a restriction on the ability of the model to work with many different datasets. In an attempt to alleviate these limitations, we propose a multivariate zero-inflated hurdle model. A univariate hurdle model (Mullahy Reference Mullahy1986) is a two-part model that separates the occurrence of an event from the number of those events actually observed. Constructing a multivariate hurdle model by assuming a hurdle distribution in every margin has two advantages. First, it can easily handle the zero-inflation or zero-deflation feature in each margin. Second, it provides users with greater choice in modeling marginal behaviors. See some examples regarding the implementation of hurdle models in Boucher et al. (Reference Boucher, Denuit and GuillÉn2007). As a result, our proposed multivariate zero-inflated hurdle model has greater in-built flexibility than the multivariate zero-inflated Poisson model considered in Liu and Tian (Reference Liu and Tian2015) in respect of diversifying marginal zero-inflation parameters and employing non-Poisson marginal distributions. The hurdle model has also been considered in a multivariate context by Zhang et al. (Reference Zhang, CalderÍn-Ojeda, Li and Wu2020), but to study the Type I multivariate zero-truncated data, which has a very different feature from the type of data studied in this paper.

In our work, the inference process is enabled using the EM algorithm (Dempster et al. Reference Dempster, Laird and Rubin1977). The EM algorithm is a two-step iterative method to find the maximum likelihood estimates (MLEs). It is particularly useful when working with zero-inflated models. Examples illustrating the implementation of the EM algorithm in zero-inflated models can be found in Lambert (Reference Lambert1992) and Hall (Reference Hall2000). Unfortunately, when covariates are introduced in our model, there is no closed-form representation in the M-step. We could find the optimal values in the M-step using the Newton–Raphson method however this was shown to be computationally expensive. Rai (1993) provided an alternative approach in which the Newton–Raphson method is carried out for only one step in the M-step. This can reduce the computation time considerably.

Our work contributes to the existing literature in several ways. First, we provide a very efficient way to generalize the zero-inflated model from univariate case to multivariate case. Our proposed framework can easily handle high dimensional data without any computational issues. Second, our model differs from the model of Liu and Tian (Reference Liu and Tian2015) in the sense that hurdle margins are assumed here instead of just Poisson margins. This generalization preserves the flexibility of capturing types of features in the insurance claims data. It is also worth stressing that no closed-form exists anymore in the M-step due to the introduction of covariates, enabling us to use a generalized EM algorithm for inference. Third, we emphasize the better performance of our proposed model compared with several existing candidate models when fitted using the same insurance dataset.

The rest of the article is organized as follows. Section 2 provides the definition of general multivariate zero-inflated distribution and investigates some of its distributional properties. In Section 4, we propose our multivariate zero-inflated hurdle model, followed by the corresponding EM algorithm for model inference. In Section 5, the proposed model is applied to an automobile insurance dataset. The last section concludes the paper.

2 Multivariate zero-inflated distribution

2.1 Definition

We now define our new multivariate zero-inflated distribution. Let $ \textbf{Y}=(Y_1, \ldots,Y_m)^\top $ denote a discrete random vector where $Y_j$ , $j=1,\ldots, m$ , are independent of each other and defined on $\mathbb{N}$ . Then $\textbf{Z}=(Z_1,\ldots,Z_m)^\top $ is said to follow the multivariate zero-inflated distribution if

(2.1)

\begin{eqnarray}\textbf{Z} \overset{d}{=} U\textbf{Y} = \left\{ \begin{gathered} \textbf{0}_m, \quad \hfill U = 0, \\[3pt] \textbf{Y}, \quad \hfill U = 1, \\\end{gathered} \right.\end{eqnarray}

where $U \sim Bernoulli(\pi_0)$ , $0<\pi_0<1$ , and U is independent of $\textbf{Y}$ . The symbol $``\overset{d}{=}"$ means that the random variables on both sides of the equality share the same distribution. The probability mass function (pmf) of $\textbf{Z}$ can be derived as

(2.2)

\begin{eqnarray}\Pr(\textbf{Z}=\textbf{Z})= \left[1-\pi_0+\pi_0\prod_{j=1}^{m}\Pr(Y_j=0)\right]^{v}\left[\pi_0 \prod_{j=1}^{m}\Pr(Y_j=z_j)\right]^{1-v},\end{eqnarray}

where $\textbf{Z}=(z_1,\ldots,z_m)^\top$ is a vector of observed values, $v=\mathbb{I}(\textbf{Z}=\textbf{0}_m)$ and $\mathbb{I}(\cdot)$ is an indicator function.

2.2 Properties of distribution

2.2.1 Marginal distributions

We first derive the marginal distribution for a single variable. The marginal distribution of $Z_j$ is

\begin{eqnarray*}\Pr ({Z_j} = {z_j}) = \left\{ \begin{gathered} \pi_0 f_{Y_j}(z_j), \quad \hfill z_j > 0, \\ 1 -\pi_0+\pi_0 f_{Y_j}(0), \quad \hfill z_j = 0.\end{gathered} \right.\end{eqnarray*}

Proof. If $z_j>0$ ,

\begin{eqnarray*} \Pr (Z_j=z_j) &=& \sum_{z_1=0}^\infty \cdots \sum_{z_{j-1}=0}^\infty \sum_{z_{j+1}=0}^\infty \cdots \sum_{z_m=0}^\infty \Pr(\textbf{Z}=\textbf{Z}) \\ &=& \pi_0 f_{Y_j}(z_j)\prod_{k=1,k\ne j}^m \sum_{z_k=0}^\infty f_{Y_k}({z_k}) \\ &=& \pi_0 f_{Y_j}(z_j).\end{eqnarray*}

Thus,

\begin{align*} \Pr (Z_j=0)=1-\sum_{z_j=1}^\infty \Pr(Z_j=z_j)=1-\pi_0+\pi_0 f_{Y_j}(0).\\[-40pt] \end{align*}

Next we derive an expression for the marginal distribution of an arbitrary random sub-vector of $\textbf{Z}$ . Denote $J = ({j_1},{j_2}, \ldots ,{j_r}) \subset (1,2, \ldots ,m)$ where $1<r<m$ and $J^C = ({j_{r+1}},{j_{r+2}}, \ldots ,{j_m})$ as the complementary set. Let ${\textbf{Z}_{r}}=(Z_{j_1},Z_{j_2}, \ldots ,Z_{j_r})^\top$ and ${\textbf{Z}_{r}}=(z_{j_1},z_{j_2}, \ldots ,z_{j_r})^\top$ , the distribution of $\textbf{Z}_{r}$ is

\begin{eqnarray*}\Pr (\textbf{Z}_r = \textbf{Z}_r) = \left\{ \begin{gathered} \pi_0 \prod_{j \in J} f_{Y_j}(z_j),\hfill \textbf{Z}_{r} \neq \textbf{0}_r, \\[3pt] 1-\pi_0+\pi_0\prod_{j \in J} f_{Y_j}(0), \quad \hfill \textbf{Z}_{r}=\textbf{0}_{r}.\end{gathered} \right.\end{eqnarray*}

Proof. If $\textbf{Z}_{r} \neq \textbf{0}_{r}$ ,

\begin{eqnarray*} \Pr (\textbf{Z}_r=\textbf{Z}_r) &=& \sum_{z_{j_{r+1}}=0}^\infty \cdots \sum_{z_{j_m}=0}^\infty{\Pr(\textbf{Z}=\textbf{Z})} \\[3pt] &=& \pi_0 \prod_{j \in J} f_{Y_j}(z_j) \prod_{j \in J^C}\sum_{z_j= 0}^\infty f_{Y_j}(z_j)\\[3pt] &=& \pi_0 \prod_{j \in J} f_{Y_j}(z_j).\end{eqnarray*}

Thus,

\begin{align*} \Pr (\textbf{Z}_r=\textbf{0}_r) &= 1 - \sum_{\textbf{Z}_r \neq \textbf{0}_r}\Pr(\textbf{Z}_r=\textbf{Z}_r) \\[3pt] &= 1-\pi_0\sum_{\textbf{Z}_r \neq \textbf{0}_r}{\prod_{j \in J}{f_{Y_j}}(z_j)}\\[3pt] &= 1-\pi_0\left[ 1 - \sum_{\textbf{Z}_r = \textbf{0}_r} \prod_{j \in J} f_{Y_j}(z_j) \right] \\[3pt] &= 1- \pi_0\left[ 1-\prod_{j \in J} f_{Y_j}(0) \right] \\[3pt] &= 1-\pi_0+\pi_0\prod_{j \in J} f_{Y_j}(0).\\[-50pt] \end{align*}

The marginal distributions can also be obtained from the definition $\textbf{Z} \overset{d}{=} U\textbf{Y}$ .

2.2.2 Conditional distribution

Let $\textbf{Z}_{m-r}=(Z_{j_{r+1}},Z_{j_{r+2}}, \ldots ,Z_{j_m})^\top$ and ${\textbf{Z}_{m-r}}=(z_{j_{r+1}},z_{j_{r+2}}, \ldots ,z_{j_m})^\top$ . The conditional distribution of ${\textbf{Z}_{r}} | {\textbf{Z}_{m-r}}$ is

\begin{eqnarray*}\Pr ({\textbf{Z}_r}&=& {\textbf{Z}_r} | {\textbf{Z}_{m-r}} = {\textbf{Z}_{m-r}})\\&&\qquad = \left\{ \begin{gathered} \prod_{j \in J} f_{Y_j}(z_j), \hfill \textbf{Z}_{m-r} \neq \textbf{0}_{m-r}, \\ \pi_0^{\ast}\prod_{j \in J} f_{Y_j}(z_j), \quad \hfill \textbf{Z}_r \neq \textbf{0}_r, \textbf{Z}_{m-r}=\textbf{0}_{m-r}, \\ 1-\pi_0^{\ast}+\pi_0^{\ast}\prod_{j \in J} f_{Y_j}(0), \quad \hfill \textbf{Z}_r=\textbf{0}_r, \textbf{Z}_{m-r}=\textbf{0}_{m-r},\end{gathered} \right.\end{eqnarray*}

where $\pi_0^{\ast}=\frac{\pi_0\prod_{j \in J^C} f_{Y_j}(0)}{1-\pi_0+\pi_0\prod_{j \in J^C} f_{Y_j}(0)}$ .

Proof. If $\textbf{Z}_{m-r} \neq \textbf{0}_{m-r}$ ,

\begin{eqnarray*} \Pr ({\textbf{Z}_{r}}= {\textbf{Z}_{r}} | {\textbf{Z}_{m-r}} = {\textbf{Z}_{m-r}}) =\frac{\pi_0\prod_{j = 1}^m f_{Y_j}(z_j)}{\pi_0 \prod_{j \in {J^C}} f_{Y_j}(z_j)} =\prod_{j\in J}f_{Y_j}(z_j).\end{eqnarray*}

If $\textbf{Z}_{m-r} = \textbf{0}_{m-r}$ and $\textbf{Z}_{r} \neq \textbf{0}_{r}$ ,

\begin{eqnarray*}\Pr (\textbf{Z}_{r}= \textbf{Z}_{r} | \textbf{Z}_{m-r}=\textbf{0}_{m-r}) &=&\frac{\pi_0 \prod_{j \in {J}} f_{Y_j}(z_j)\prod_{j \in {J^C}} f_{Y_j}(0)}{1-\pi_0+\pi_0\prod_{j \in J^C} f_{Y_j}(0)} \\[2pt]&=& \pi_0^{\ast}\prod_{j \in J} f_{Y_j}(z_j).\end{eqnarray*}

If $\textbf{Z}_{m-r} = \textbf{0}_{m-r}$ and $\textbf{Z}_{r} =\textbf{0}_{r}$ ,

\begin{eqnarray*}\Pr (\textbf{Z}_r= \textbf{0}_r | \textbf{Z}_{m-r}=\textbf{0}_{m-r}) &=&\frac{1-\pi_0+\pi_0\prod_{j=1}^{m} f_{Y_j}(0) }{1-\pi_0+\pi_0\prod_{j \in J^C} f_{Y_j}(0)} \\[2pt]&=& 1-\pi_0^{\ast}+\pi_0^{\ast}\prod_{j \in J} f_{Y_j}(0).\\[-40pt]\end{eqnarray*}

2.2.3 Expectation, variance and covariance

The expectation and variance of $Z_j$ , $j=1,\ldots,m$ , are

\begin{eqnarray*}\mbox{E}(Z_j)=\pi_0\mu _{1j}, \quad \mbox{Var}(Z_j)=\pi_0\mu_{2j}-\pi_0^2\mu_{1j}^2,\end{eqnarray*}

and the covariance between $Z_j$ and $Z_k$ , $j,k=1,\ldots,m$ , $j \neq k$ , is

\begin{eqnarray*}\mbox{Cov}(Z_j,Z_k) = \pi_0(1-\pi_0)\mu_{1j}\mu_{1k}>0,\end{eqnarray*}

where $\mu_{1j}= \mbox{E}(Y_j)$ , $\mu_{1k} = \mbox{E}(Y_k)$ and $\mu_{2j}=\mbox{E}(Y_j^2)$ .

Proof. This is easily obtained from $\textbf{Z} \overset{d}{=} U\textbf{Y}$ .

2.2.4 Moment generating function

The moment generating function of $\textbf{Z}$ is

\begin{eqnarray*}M_{\textbf{Z}}(\textbf{t})=1-\pi_0+\pi_0 \prod_{j=1}^m M_{Y_j}(t_j),\end{eqnarray*}

where $\textbf{t}=(t_1,\ldots,t_m)^\top$ .

Proof. The moment generating function of $\textbf{Z}$ is

\begin{eqnarray*}M_{\textbf{Z}}(\textbf{t}) &=& \mbox{E}\left[ {\exp (\textbf{t}^\top{\textbf{Z}})} \right] = \mbox{E}\left[ \exp (U\textbf{t}^\top {\textbf{Y}}) \right] = \mbox{E}\left\{ \mbox{E}\left[\exp(U\textbf{t}^\top{\textbf{Y}})|U \right] \right\} \\ &=& \mbox{E}\left[ M_{\textbf{Y}}(U\textbf{t})\right]=1-\pi_0+\pi_0 M_{\textbf{Y}}(\textbf{t})=1-\pi_0+\pi_0 \prod_{j=1}^m M_{Y_j}(t_j).\\[-40pt]\end{eqnarray*}

2.3 Two special cases

In this subsection, we introduce two multivariate zero-inflated distributions where the margins are assumed to be either all Poisson or all negative binomial distributed. Users do not have the flexibility to vary the marginal distribution types for these two distributions. We note that the multivariate zero-inflated Poisson distribution was first proposed by Liu and Tian (Reference Liu and Tian2015). To our best knowledge, the multivariate zero-inflated negative binomial distribution has not been studied in the literature before, but it has the same limitations as the Poisson model, which was discussed above in Section 1. We will use these two models in our model comparison in Section 5. For the purpose of simplification, we have put the EM algorithms of parameter estimation for these two models in Appendix A, where the location parameters are all covariate-dependent.

2.3.1 Multivariate zero-inflated Poisson distribution

Let $Y_j {\sim} Poisson(\lambda_j)$ , for $j=1,\ldots,m$ . Then $\textbf{Z}$ is said to follow the multivariate zero-inflated Poisson distribution with the parameter vector $\boldsymbol{\lambda}=(\lambda_1,\ldots, \lambda_m)^\top$ and a zero-inflation parameter $\pi_0$ , denoted by $\textbf{Z} \sim MZIP(\boldsymbol{\lambda},\pi_0)$ . The pmf of $\textbf{Z}$ is

(2.3)

\begin{eqnarray}\Pr(\textbf{Z}=\textbf{Z})= \left(1-\pi_0+\pi_0e^{-\sum_{j=1}^{m}\lambda_j}\right)^{v}\left(\pi_0\prod_{j=1}^{m} \frac{\lambda_j^{z_j}e^{-\lambda_j}}{z_j!}\right)^{1-v},\end{eqnarray}

where $v=\mathbb{I}(\textbf{Z}=\textbf{0}_m)$ .

2.3.2 Multivariate zero-inflated negative binomial distribution

Let $Y_j {\sim} NB(\mu_j,\theta_j)$ , for $j=1,\ldots,m$ . Then $\textbf{Z}$ is said to follow the multivariate zero-inflated negative binomial distribution with two parameter vectors $\boldsymbol{\mu}=(\mu_1,\ldots, \mu_m)^\top$ , $\boldsymbol{\theta}=(\theta_1,\ldots, \theta_m)^\top$ and a zero-inflation parameter $\pi_0$ , denoted by $\textbf{Z} \sim MZINB(\boldsymbol{\mu},\boldsymbol{\theta},\pi_0)$ . The pmf of $\textbf{Z}$ is

(2.4)

\begin{eqnarray}\Pr(\textbf{Z}=\textbf{Z}) &=& \left[1-\pi_0+\pi_0\prod_{j=1}^m{\left(\frac{\theta_j}{\mu_j+\theta_j} \right)}^{\theta_j}\right]^{v} \nonumber\\ && \times \left[\pi_0\prod_{j=1}^{m} \frac{\Gamma(z_j+\theta_j)}{\Gamma(\theta_j){z_j}!} \left(\frac{\mu_j}{\mu_j+\theta_j}\right)^{z_j}\left(\frac{\theta_j}{\mu_j+\theta_j}\right)^{\theta_j}\right]^{1-v},\end{eqnarray}

where $v=\mathbb{I}(\textbf{Z}=\textbf{0}_m)$ .

3 Multivariate zero-inflated hurdle model

3.1 Model characterization

We shall assume that each underlying random variable $Y_j$ in (2.1) follows a zero-modified distribution, which can be characterized as follows:

(3.1)

\begin{eqnarray}Y_j \overset{d}{=} U_j W_j = \left\{ \begin{gathered} 0, \quad \hfill U_j = 0, \\ W_j, \quad \hfill U_j = 1, \\\end{gathered} \right.\end{eqnarray}

where $W_j$ follows a count distribution defined on $\mathbb{N}_{+}$ , $U_j \sim Bernoulli(\pi_j)$ , $0<\pi_j<1$ , and $U_j$ is independent of $W_j$ . Again, we assume that all $Y_j$ are independent of each other. Then $\textbf{Z}$ constructed by (2.1) is said to follow the multivariate zero-inflated hurdle distribution with parameter vectors $\boldsymbol{\pi}=(\pi_1,\ldots, \pi_m)^\top$ , $\boldsymbol{\Theta}=(\boldsymbol{\Theta}_1,\ldots, \boldsymbol{\Theta}_m)^\top$ and a zero-inflation parameter $\pi_0$ . Here $\boldsymbol{\Theta}_j$ is the set of parameters related to $W_j$ . The pmf of $\textbf{Z}$ can be expressed as

(3.2)

\begin{eqnarray}\Pr(\textbf{Z}=\textbf{Z}) &=& \left[1-\pi_0+\pi_0\prod_{j=1}^{m}(1-\pi_j)\right]^{v} \nonumber \\ && \times \left[\pi_0 \prod_{j:z_j=0}(1-\pi_j) \prod_{j:z_j \ne 0}\pi_j f_{W_j}(z_j) \right]^{1-v},\end{eqnarray}

where $v=\mathbb{I}(\textbf{Z}=\textbf{0}_m)$ .

3.2 Model inference

Suppose each $\textbf{Z}_i$ , $i=1,\ldots, n$ , independently follows a multivariate zero-inflated hurdle distribution. The corresponding observed values are $\textbf{Z}_1,\ldots, \textbf{Z}_n$ , where $\textbf{Z}_i=(z_{i1},\ldots, z_{im})^\top$ . The latent variables are $v_1 ,\ldots, v_n$ , where $v_i=\mathbb{I}(\textbf{Z}_i=\textbf{0}_m)$ . Now we introduce some covariates, $\textbf{x}_1,\ldots, \textbf{x}_n$ , where $\textbf{x}_i=(1, x_{i1},\ldots, x_{ip})^\top$ . The parameter $\pi_{ij}$ can then be modeled as

(3.3)

\begin{eqnarray}\pi_{ij} = \frac{\exp({\textbf{x}_i^\top \boldsymbol{\beta}_j})}{1+\exp({\textbf{x}_i^\top \boldsymbol{\beta}_j})},\end{eqnarray}

where $\boldsymbol{\beta}_j=(\beta_{j0}, \beta_{j1},\ldots, \beta_{jp})^\top$ . For the purpose of easy interpretation, we do not inject covariates in $\pi_0$ . We denote $\boldsymbol{\beta}=(\boldsymbol{\beta}_1,\ldots,\boldsymbol{\beta}_m)$ as the set of parameters related to all $\pi_{ij}$ , and $\boldsymbol{\Theta}$ as the set of parameters related to all $W_j$ , the likelihood function then can be written as

(3.4)

\begin{eqnarray}L(\boldsymbol{\beta},\boldsymbol{\Theta},\pi_0) &=& \prod_{i=1}^{n}\left[1-\pi_0+\pi_0\prod_{j=1}^{m}(1-\pi_{ij})\right]^{v_i} \nonumber\\[3pt]&&\times \prod_{i=1}^{n}\left[\pi_0\prod_{j:z_{ij}=0}(1-\pi_{ij}) \prod_{j:z_{ij} \ne 0}\pi_{ij} f_{W_j}(z_{ij}) \right]^{1-v_i} .\end{eqnarray}

The observed log-likelihood function can be divided into two parts:

\begin{eqnarray*}\ell_1(\boldsymbol{\beta},\pi_0) &=&\sum_{i=1}^n v_i\log{\left[1-\pi_0+\pi_0\prod_{j=1}^{m}(1-\pi_{ij})\right]}+\sum_{i=1}^n (1-v_i)\log \pi_0\\[3pt]&&+\sum_{i = 1}^n (1-v_i){\left[\sum_{j:z_{ij} = 0}\log(1-{\pi_{ij}})+\sum_{j:z_{ij} \ne 0}\log \pi_{ij} \right]},\\\ell_2(\boldsymbol{\Theta}) &=&\sum_{i=1}^n\sum_{j:z_{ij} \ne 0}{(1-v_i)\log f_{W_{j}}(z_{ij})} = \sum_{j = 1}^m {\sum_{i:z_{ij} \ne 0} \log f_{W_{j}}(z_{ij})}.\end{eqnarray*}

Thus, the maximization procedure can be completed for $\ell_1$ and $\ell_2$ , respectively. For $\ell_2$ , the estimation can proceed in respect of the zero-truncation part of each margin separately. For $\ell_1$ , we implement the EM algorithm as described below.

Denote $\textbf{Z}'=(Z'_1,\ldots,Z'_m)^\top$ where $Z'_j=\mathbb{I}(Z_j>0)$ . The corresponding observed values are denoted by $\textbf{Z}'_1,\ldots,\textbf{Z}'_n$ where $\textbf{Z}'_i=(z'_{i1},\ldots,z'_{im})^\top$ . The observed log-likelihood function $\ell_1$ can be rewritten as

\begin{eqnarray*}\ell_1(\boldsymbol{\beta},\pi_0) &=&\sum_{i = 1}^n v_i\log{\left[1-\pi_0+\pi_0\prod_{j=1}^{m}(1-\pi_{ij})\right]}+\sum_{i = 1}^n (1-v_i)\log \pi_0\\[3pt]&&+\sum_{j=1}^m \sum_{i=1}^n (1-v_i)\left[z'_{ij}\log{\pi_{ij}} +(1-z'_{ij})\log (1-\pi_{ij}) \right].\end{eqnarray*}

In addition to the known values $\textbf{Z}'_i$ , suppose we also know the value $u'_i$ , one for each individual to take the value 1 if the observation of common zeros is inflated and 0 otherwise. The complete data log-likelihood function then becomes

\begin{eqnarray*}{\ell^c_1}(\boldsymbol{\beta},\pi_0) &=&\sum_{i=1}^n \left[u'_i v_i\log(1-\pi_0)+(1-u'_i v_i)\log\pi_0\right]\\&&+\sum_{j=1}^m \sum_{i=1}^n \left[z'_{ij}\log{\pi_{ij}}+(1-u'_i v_i-z'_{ij})\log (1-\pi_{ij}) \right] .\end{eqnarray*}

Note in our case, $v_i z'_{ij}=0$ . Given initial values of parameters $\boldsymbol{\beta}$ and $\pi_0$ , the EM algorithm proceeds as follows.

• E-step: Replace $u'_i$ by
\begin{eqnarray*}\bar{u}'_i = \frac{1-\pi_0}{1-\pi_0+\pi_0\prod_{j=1}^{m}(1-\pi_{ij})}, \quad i=1,\ldots,n,\end{eqnarray*}
where ${\pi_{ij}}=\frac{\exp({\textbf{x}_i^\top \boldsymbol{\beta}_j})}{1+\exp({\textbf{x}_i^\top \boldsymbol{\beta}_j})}$ .
• M-step:
• For $\pi_0$ , we can get
\begin{eqnarray*}\pi_0=1-\frac{1}{n}\sum_{i=1}^{n}\bar{u}'_i v_i.\end{eqnarray*}
• For $\boldsymbol{\beta}$ , let
\begin{eqnarray*}\bar{\ell}_{1j}^c(\boldsymbol{\beta}_j)=\sum_{i=1}^n \left[ z'_{ij}\log{\pi_{ij}}+(1-\bar{u}'_i v_i-z'_{ij})\log (1-\pi_{ij}) \right].\end{eqnarray*}
There is no closed-form representation for $\boldsymbol{\beta}_j$ , so we update the regression parameters, respectively, for each j by implementing the Newton–Raphson method for one step. The first- and second-order derivatives are given as follows:
\begin{eqnarray*}\frac{\partial \bar{\ell}_{1j}^c}{\partial {\boldsymbol{\beta}_j}} &=& \sum_{i=1}^n { \left[z'_{ij}-(1-\bar{u}'_i v_i)\pi_{ij}\right]{\textbf{x}}_i } , \\ \frac{\partial^2\bar{\ell}_{1j}^c}{\partial {\boldsymbol{\beta}_j}\partial \boldsymbol{\beta}_j^\top} &=& -\sum_{i=1}^n (1-\bar{u}'_i v_i){\pi_{ij}(1-\pi_{ij})}{\textbf{x}}_i{\textbf{x}}_i^\top.\end{eqnarray*}
• Iterate through the E-step and M-step until some convergence criterion is met, for example, the relative change of observed log-likelihood between two consecutive iterations is smaller than a tolerance level $\varepsilon$ .

Remark 1 The initial values of parameters $\boldsymbol{\beta} _{j}$ for EM algorithm can be obtained by implementing univariate logistic regression with observed values $z'_{1j},\ldots,z'_{nj}$ . The initial value of parameter $\pi_0$ can be set as 0.5. The standard errors for the estimates can be approximated using the approach in Louis (1982).

For the case without covariates incorporated in $\pi_{ij}$ , the EM algorithm is simplified as shown below.

• E-step: Replace $u'_i$ by
\begin{eqnarray*}\bar{u}'=\bar{u}'_i=\frac{1-\pi_0}{1-\pi_0+\pi_0\prod_{j=1}^{m}(1-\pi_{j})},\quad i=1,\ldots,n.\end{eqnarray*}
Initial value for $\pi_j$ can be set as the proportion of non-zeros for margin j.
• M-step:
• Update $\pi_0$ through the following equation:
\begin{eqnarray*}\pi_0=1-\frac{\bar{u}'}{n}\sum_{i=1}^{n}v_i.\end{eqnarray*}
• Update each $\pi_j$ through the following equation:
\begin{eqnarray*}\pi_j=\frac{ \sum_{i=1}^{n}z'_{ij}}{n-\bar{u}'\sum_{i=1}^{n}v_i},\quad j=1,\ldots,m.\end{eqnarray*}

4 Application

4.1 Data description

This application is based on an automobile portfolio from a major insurance company operating in Spain in 1995. The whole data set consists of 80,994 policyholders. We have access to a rich set of information for each individual. The detailed description for each predictor is presented in Table 1. The mean of each covariate is also provided in the table to show the proportion of corresponding group. For example, the mean of v1 tells us that 16.0% of the policyholders are female.

Table 1. The description for explanatory variables.

The simplest policy only includes third-party liability (denoted as $Z_1$ type) and a set of basic guarantees such as emergency roadside assistance, legal assistance or insurance covering medical costs (denoted as $Z_2$ type). The comprehensive coverage (damage to one’s vehicle caused by any unknown party, for example, damage resulting from theft, flood or fire) and the collision coverage (damage resulting from a collision with another vehicle or object when the policyholder is at fault) are also denoted as $Z_2$ type. For our purpose, we only select policyholders with full coverage ( $v9=0$ , $v10=1$ ) to implement our analysis. This is consistent with the analysis in BermÚdez and Karlis (Reference BermÚdez and Karlis2017). This subset only contains information for 28,590 policyholders. The empirical joint distribution for claim numbers $Z_1$ and $Z_2$ is displayed in Table 2. The overall Pearson’s correlation coefficient between these two types of claim is 0.189. This observed correlation is consistent with our model’s positive correlation assumption. For our study, we randomly take 70% of the observations from the subset as training data to develop the models, and the remaining 30% are reserved as hold-out sample for validation purpose.

Table 2. The empirical joint distribution of $Z_1$ and $Z_2$ .

4.2 Model fitting

Prior to fitting our proposed hurdle model, we need to determine the distributions for the zero-truncated univariate components $W_j$ , $j=1,2$ . In addition to the commonly used zero-truncated Poisson (ZTP) and zero-truncated negative binomial (ZTNB) distributions, we also try unit-shifted Poisson (USP) and unit-shifted negative binomial (USNB) distributions. Implementing a unit-shifted distribution on $W_j$ is equivalent to using a standard count distribution for $W_j-1$ . These four univariate models can be implemented in R using the packages gamlss and gamlss.tr. The results are summarized in Table 3. Both the log-likelihood values and the $\chi^2$ statistics indicate that the USNB distribution outperforms the alternatives for both claim type counts. The small differences between the observed and fitted frequencies demonstrate the adequacy of adopting a USNB distribution for both margins.

Table 3. Goodness-of-fit of marginal models.

We next turn to parameter estimation when covariates are introduced in our multivariate zero-inflated hurdle (MZIH) model. The estimation results in Table 4 correspond to $\pi_j$ , $j=0,1,2$ . The 95% confidence interval of $\pi_0$ is $(0.318, 0.372)$ , where the upper bound is far below the boundary 1. This reveals the zero-inflation feature reflected in the data set. Focusing on the claims of $Z_1$ type, parameters v4 and v7 are statistically significant. It can be concluded that policyholders with more years with the insurance company (v7) exhibit a lower probability of claiming. On the other hand, driving in a high-risk region (v4) is associated with an increase in the probability of making a claim. If we focus on the claims of $Z_2$ type, we notice that v3, v5, v7 and v11 are all statistically significant. This suggests that driving in a zone of medium risk (v3), driving license between 4 and 14 years (v5) and greater horsepower (v11) are all associated with increased chances of a claim in this category. However, clients with the company for more than 5 years (v7) have a lower probability of making a claim. Table 5 is associated with the estimates for $W_j$ , $j=1, 2$ . In this table, we report the coefficient estimates when a linear model with logarithmic link function is used to describe the location parameter in the USNB marginal distributions of the MZIH model. As can be observed from this table, conditional on the occurrence of claims, no predictor is significant for the expected number of $W_1$ and $W_2$ .

Table 4. Estimates and t-ratios associated with the covariates of $\pi_j$ in MZIH model.

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05.

Table 5. Estimates and t-ratios associated with the covariates of $W_j$ in MZIH model.

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05.

Furthermore, we present in Table 6 the observed and expected frequencies (numbers in the brackets) based on the MZIH model. It can be observed that the overall fit is acceptable. The result of the chi-squared test indicates that only a few cells contribute to this goodness-of-fit. It is thus our belief that the overall quality of fit is good considering the number of cells.

Table 6. Observed and expected frequencies of MZIH model for the joint distribution of $Z_1$ and $Z_2$ .

Finally, we compare different available models. Apart from our proposed MZIH model, fitted with the USNB marginal distributions, our candidate models also include the multivariate zero-inflated Poisson (MZIP) and multivariate zero-inflated negative binomial (MZINB) models. It is worth mentioning that for the zero-truncation parts in our MZIH model, only intercepts are adopted to avoid the over-fitting problem. Furthermore, a model with two independent hurdle margins (Ind) is fitted as a benchmark. The marginal zero-truncated distributions in the independent hurdle model are the same as for the MZIH model. Different information criteria are provided in Table 7. By assuming the existence of a zero-inflation parameter, we are able to improve the model fitting considerably. This confirms the fact that extra common zeros exist in our data set. Also the superior performance of MZIH model over MZIP and MZINB models verifies the benefit of having extra flexibility in our MZIH model when handling marginal distributions.

Table 7. Information criteria of four fitted models.

To evaluate the predictive performance, we calculate the predicted claim frequencies and compare these to the observed frequencies from the test sample for the following scenarios: no claims in any line, claims occur in only one of the lines and claims occur in both lines. The candidate models include MZIH, MZIP, MZINB and Ind models. The goodness-of-fit results are shown in Table 8. Our conclusions are consistent with those made based on Table 7. As anticipated, we observe very poor performance from the Ind model, which can be explained by the failure to model the excess common zeros in the data set as well as the positive correlation between the two lines of claims. The more accurate prediction of MZIH model against MZIP and MZINB models is largely due to the introduction of hurdle margins. The tiny discrepancies between observed and predicted frequencies of our MZIH model suggest a satisfactory quality of fit for this particular data set.

Table 8. Predicted frequencies of four models under four different scenarios.

In this subsection, we apply the four models from the previous section using the whole data set with 80,994 policyholders to facilitate the comparison between these models and the ones in BermÚdez (Reference BermÚdez2009) and BermÚdez and Karlis (Reference BermÚdez and Karlis2012). All eleven covariates are considered in this case. The first candidate model is the best model studied in BermÚdez (Reference BermÚdez2009) that is the zero-inflated bivariate Poisson (ZIBP) model with regressors on the third Poisson parameter $\lambda_3$ . The second candidate model is the best model studied in BermÚdez and Karlis (Reference BermÚdez and Karlis2012), which is the 2-finite mixture of bivariate Poisson (2-FMBP) model with regressors on the mixing proportion p. For readers’ convenience, description of the ZIBP and 2-FMBP models is given in Appendix B. Our MZIH model is still fitted with the USNB marginal distributions for zero-truncation parts with only significant predictors reserved. The comparative analysis is shown in Table 9. As is evident from the table, all information criteria agree that our proposed MZIH model still performs the best.

Table 9. Information criteria of six candidate models.

5 Concluding remarks

In this article, we introduced a new type of multivariate model, the so called multivariate zero-inflated hurdle (MZIH) model. We started with the definition for the multivariate zero-inflated distribution and provided some of its properties. Next, two special multivariate zero-inflated models, namely the MZIP and MZINB, were presented with associated EM algorithms necessary for estimating their parameters given in Appendix A. However, these two models often cannot accommodate the observed characteristics in the marginal distributions of claim counts. Our proposed MZIH model could address these limitations by allowing any zero-modified distribution for each margin. The separation of zeros from the positive parts provided us with more freedom to deal with zero features in each dimension and the marginal distributions as well. The usefulness of our model was illustrated with the help of real data from automobile insurance. The results from fitting several relevant models were shown and compared. As expected, the superiority of our proposed MZIH model was supported by different information criteria as well as predictive performance.

Our proposed MZIH model provides significantly enhanced flexibility compared to univariate models for the actuary who is dealing with multiple related insurance coverage. Our model takes into account the correlations between observed claim frequencies for different lines of insurance business, meaning that the models for one line of business are enhanced by information in the data relating to other lines of business. Univariate models are unable to incorporate this given their separate focus on individual lines of business. Second, the MZIH model builds on the analysis from univariate modeling of claim counts by adding a hurdle requirement within a multivariate structure. More importantly, despite the significantly enhanced model flexibility inherent in MZIH model analysis, the additional effort in coding and computation is limited. Starting form appropriate values, our program for the EM algorithm only takes several minutes to obtain the results when working on tens of thousands of observations. The estimation aimed at the positive part of each margin can be easily handled in R using publicly available packages. All of these advantages make the MZIH model an attractive candidate when studying claims with the zero-inflation feature in a multivariate context.

R code

The authors used the R package gamlss and gamlss.tr in the model fitting of this paper. The authors are happy to share their R code when needed. Interested readers please contact the corresponding author.

Acknowledgements

The authors would like to thank reviewers for valuable comments and suggestions that led to significant improvement from the first version. Mr. Pengcheng Zhang was supported by a University of Melbourne Faculty of Business and Economics Doctoral Program Scholarship. Acknowledgements also go to the Research Group of Risk in Insurance and Finance at the University of Barcelona for providing the data used in this paper.

APPENDIX

APPENDIX A. EM algorithms for two models

A.1 Multivariate zero-inflated Poisson model

Suppose for each independent individual, $\textbf{Z}_i \sim MZIP(\boldsymbol{\lambda}_i,\pi_0)$ , $i=1,\ldots, n$ , where $\boldsymbol{\lambda}_i=(\lambda_{i1},\ldots, \lambda_{im})^\top$ . Taking covariates into account, the parameter $\lambda_{ij}$ can be modeled as $\lambda_{ij} = \exp({\textbf{x}_i^\top \boldsymbol{\beta}_j})$ with new parameters $\boldsymbol{\beta}_j=(\beta_{j0}, \beta_{j1},\ldots, \beta_{jp})^\top$ . If we denote $\boldsymbol{\beta}=(\boldsymbol{\beta}_1,\ldots,\boldsymbol{\beta}_m)$ , the likelihood function becomes

\begin{eqnarray*}L(\boldsymbol{\beta},\pi_0)=\prod_{i=1}^{n}\left(1-\pi_0+\pi_0e^{-\sum_{j=1}^{m}\lambda_{ij}}\right)^{v_i} \prod_{i=1}^{n}\left(\pi_0\prod_{j=1}^{m}\frac{\lambda_{ij}^{z_{ij}}e^{-\lambda_{ij}}}{z_{ij}!}\right)^{1-v_i}.\end{eqnarray*}

Following the same data augmentation method as for the multivariate zero-inflated hurdle model, we obtain the complete data likelihood function given as

\begin{eqnarray*}L^c(\boldsymbol{\beta},\pi_0) &=&\prod_{i=1}^{n}(1-\pi_0)^{u'_i v_i}\prod_{i=1}^{n}\left(\pi_0 e^{-\sum_{j=1}^{m}\lambda_{ij}}\right)^{(1-u'_i)v_i} \\&& \times \prod_{i=1}^{n}\left(\pi_0\prod_{j=1}^{m}\frac{\lambda_{ij}^{z_{ij}}e^{-\lambda_{ij}}}{z_{ij}!}\right)^{1-v_i}.\end{eqnarray*}

The complete data log-likelihood function is

\begin{eqnarray*}{\ell^c}(\boldsymbol{\beta},\pi_0)&=&\sum_{i=1}^n \left[u'_i v_i\log(1-\pi_0)+(1-u'_i v_i)\log\pi_0\right] \\&&+\sum_{j=1}^m \sum_{i=1}^n \left[z_{ij}\log{\lambda_{ij}}-(1-u'_i v_i) \lambda_{ij}\right]+ C_0,\end{eqnarray*}

where $C_0$ is a constant not related to $(\boldsymbol{\beta}, \pi_0)$ . Note in our case, $v_i z_{ij}=0$ .

The associated EM algorithm is given below.

• E-step: Replace $u'_i$ by
\begin{eqnarray*}\bar{u}'_i= \frac{1-\pi_0}{1-\pi_0+\pi_0 e^{-\sum_{j=1}^{m}\lambda_{ij}}},\quad i=1,\ldots,n,\end{eqnarray*}
where ${\lambda}_{ij}=\exp(\textbf{x}_{\textbf{i}}^\top \boldsymbol{\beta_j})$ .
• M-step:
• For $\pi_0$ , we can get
\begin{eqnarray*}\pi_0=1-\frac{1}{n}\sum_{i=1}^{n}\bar{u}'_i v_i.\end{eqnarray*}
• For $\boldsymbol{\beta}$ , let
\begin{eqnarray*}\bar{\ell}_j^c(\boldsymbol{\beta}_j)=\sum_{i=1}^n \left[z_{ij}\log{\lambda_{ij}}-(1-\bar{u}'_i v_i)\lambda_{ij}\right].\end{eqnarray*}

There is no closed-form representation for $\boldsymbol{\beta}_j$ , so we update the regression parameters, respectively, for each j by implementing the Newton–Raphson method for one step. The first- and second-order derivatives are given as follows:

\begin{eqnarray*}\frac{\partial \bar{\ell}_j^c}{\partial \boldsymbol{\beta_j}} &=&\sum_{i=1}^n \left[z_{ij}-(1-\bar{u}'_i v_i)\lambda_{ij}\right] \textbf{x}_{\textbf{i}} , \\\frac{\partial^2 \bar{\ell}_j^c}{\partial \boldsymbol{\beta_j}\partial \boldsymbol{\beta_j}^\top}&=&-\sum_{i=1}^n (1-\bar{u}'_i v_i)\lambda _{ij}\textbf{x}_{\textbf{i}}\textbf{x}_{\textbf{i}}^\top.\end{eqnarray*}

For the case without covariates incorporated in $\lambda_{ij}$ , the corresponding E-step and M-step can be simplified as follows.

• E-step: Replace $u_i$ by
\begin{eqnarray*}\bar{u}'=\bar{u}'_i= \frac{1-\pi_0}{1-\pi_0+\pi_0 e^{-\sum_{j=1}^{m}\lambda_j}}, \quad i=1,\ldots,n.\end{eqnarray*}
• M-step:
• Update $\pi_0$ using the following equation:
\begin{eqnarray*}\pi_0=1-\frac{\bar{u}'}{n}\sum_{i=1}^{n}v_i.\end{eqnarray*}
• Update each $\lambda_j$ using the following equation:
\begin{eqnarray*}\lambda_j=\frac{ \sum_{i=1}^{n}z_{ij}}{n-\bar{u}'\sum_{i=1}^{n}v_i},\quad j=1,\ldots,m.\end{eqnarray*}

A.2 Multivariate zero-inflated negative binomial model

Suppose for each independent individual, $\textbf{Z}_i \sim MZINB(\boldsymbol{\mu}_i,\boldsymbol{\theta},\pi_0)$ , $i=1,\ldots, n$ , where $\boldsymbol{\mu}_i=(\mu_{i1},\ldots,\mu_{im})^\top$ and $\boldsymbol{\theta}=(\theta_1,\ldots, \theta_m)^\top$ . Similarly, the covariates could be introduced as ${\mu}_{ij}=\exp(\textbf{x}_i^\top \boldsymbol{\beta_j})$ . If we denote $\boldsymbol{\beta}=(\boldsymbol{\beta}_1,\ldots,\boldsymbol{\beta}_m)$ , the likelihood function becomes

\begin{eqnarray*}L(\boldsymbol{\beta},\boldsymbol{\theta},\pi_0) &=& \prod_{i=1}^{n} \left[1-\pi_0+\pi_0\prod_{j=1}^m{\left(\frac{\theta_j}{\mu_{ij}+\theta_j} \right)}^{\theta_j}\right]^{v_i} \nonumber\\ && \times \prod_{i=1}^{n} \left[\pi_0\prod_{j=1}^{m}\frac{\Gamma({z_{ij}}+{\theta_j})}{\Gamma({\theta_j}){z_{ij}}!} \left(\frac{\mu_{ij}}{\mu_{ij}+\theta_j}\right)^{z_{ij}}\left(\frac{{{\theta_j}}}{{\mu_{ij}}+{\theta_j}} \right)^{\theta_j}\right]^{1-v_i}.\end{eqnarray*}

Following the same data augmentation method as for the multivariate zero-inflated hurdle model, we obtain the complete data likelihood function given as

\begin{eqnarray*}L^c(\boldsymbol{\beta},\boldsymbol{\theta},\pi_0) &=&\prod_{i=1}^{n}(1-\pi_0)^{u'_i v_i}\prod_{i=1}^{n}\left[\pi_0\prod_{j=1}^{m}\left( \frac{\theta_j}{\mu_{ij}+\theta_j}\right)^{\theta_j}\right]^{(1-u'_i)v_i} \\ && \times \prod_{i=1}^{n} \left[ \pi_0\prod_{j=1}^{m}\frac{\Gamma (z_{ij}+\theta_j)}{\Gamma (\theta_j)z_{ij}!}{\left(\frac{\mu_{ij}}{\mu_{ij}+\theta_j} \right)}^{z_{ij}}{\left(\frac{\theta_j}{\mu_{ij}+\theta_j} \right)}^{\theta_j}\right]^{1-v_i}.\end{eqnarray*}

The complete data log-likelihood function is

\begin{eqnarray*}{\ell^c}(\boldsymbol{\beta},\boldsymbol{\theta},\pi_0) &=& \sum_{i=1}^n\left[ u'_i v_i\log(1-\pi_0)+(1-u'_i v_i)\log\pi_0\right]\\&&+\sum_{j=1}^m \sum_{i=1}^n\left\{(1-u'_i v_i)\theta_j\log{\theta_j}-\left[(1-u'_i v_i)\theta_j+z_{ij}\right]\log(\mu_{ij}+\theta_j) \right.\\&& \left.+z_{ij}\log{\mu_{ij}} +\log\left(\Gamma(z_{ij}+\theta_j)\right)-\log\left(\Gamma(\theta_j) \right)\right\}+{C_1},\end{eqnarray*}

where $C_1$ is a constant not related to $(\boldsymbol{\beta}$ , $\boldsymbol{\theta},\pi_0)$ . Note in our case, $v_i z_{ij}=0$ .

The EM algorithm can be described as follows.

• E-step: Replace $u_i$ by
\begin{eqnarray*}\bar{u}'_i=\frac{1-\pi_0}{1-\pi_0+\pi_0 \prod_{j=1}^{m}\left(\frac{\theta_j}{\mu_{ij}+\theta_j}\right)^{\theta_j}},\quad i=1,\ldots,n,\end{eqnarray*}
where $\mu_{ij}=\exp(\textbf{x}_{\textbf{i}}^\top \boldsymbol{\beta_j})$ .
• M-step:
• For $\pi_0$ , we can get
\begin{eqnarray*}\pi_0=1-\frac{1}{n}\sum_{i=1}^{n}\bar{u}'_i v_i.\end{eqnarray*}
• For $(\boldsymbol{\beta},\boldsymbol{\theta})$ , let
\begin{eqnarray*}\bar{\ell}_j^C(\boldsymbol{\beta}_j,{\theta}_j) &=& \sum_{i=1}^n\left\{(1-\bar{u}'_i v_i)\theta_j\log{\theta_j} -\left[(1-\bar{u}'_i v_i)\theta_j+z_{ij}\right]\log(\mu_{ij}+\theta_j) \right.\\[3pt]&& \left. +z_{ij}\log{\mu_{ij}}+\log\left(\Gamma(z_{ij}+\theta_j)\right)-\log\left(\Gamma(\theta_j)\right)\right\}.\end{eqnarray*}

There is no closed-form representation for $\boldsymbol{\beta} _{j}$ and $\theta_{j}$ , so we update the regression parameters, respectively, for each j by implementing the Newton–Raphson method for one step. The first- and second-order derivatives are given as follows:

\begin{align*}\frac{\partial \bar{\ell}_j^C}{\partial \boldsymbol{\beta _j}} &=\sum_{i=1}^n \frac{\left[z_{ij}-(1-\bar{u}'_i v_i)\mu_{ij}\right]\theta_j}{\mu_{ij}+\theta _j}\textbf{x}_{\textbf{i}},\\[3pt]\frac{\partial \bar{\ell}_j^C}{\partial \theta_j} &= \sum_{i=1}^n \bigg[(1-\bar{u}'_i v_i)\log \frac{\theta _j}{\mu_{ij}+\theta_j} + \frac{(1-\bar{u}'_i v_i)\mu_{ij}- z_{ij}}{\mu_{ij}+\theta_j} \\[3pt] &\quad +\psi(z_{ij}+\theta_j)-\psi(\theta_j) \bigg], \\[3pt]\frac{\partial^2\bar{\ell}_j^C}{\partial \boldsymbol{\beta_j}\partial \boldsymbol{\beta_j}^\top} &=-\sum_{i=1}^n {\frac{\left[z_{ij}+(1-\bar{u}'_i v_i)\theta_j\right]\mu_{ij}\theta _j}{(\mu_{ij}+\theta_j) ^2}\textbf{x}_{\textbf{i}}\textbf{x}_{\textbf{i}}^\top}, \\\frac{\partial ^2\bar{\ell}_j^C}{\partial \theta_j^2} &=\sum_{i= 1}^n \left[\frac{(1-\bar{u}'_i v_i)\mu_{ij}^2 + z_{ij}\theta _j}{\theta_j(\mu_{ij}+\theta_j)^2}+\psi_1(z_{ij} + \theta_j)-\psi_1(\theta_j)\right], \\[5pt]\frac{\partial^2\bar{\ell}_j^C}{\partial\boldsymbol{\beta_j}\partial \theta_j} &=\sum_{i=1}^n \frac{\left[z_{ij}-(1-\bar{u}'_i v_i)\mu_{ij}\right]\mu_{ij}}{(\mu_{ij}+{\theta_j})^2} \textbf{x}_{\textbf{i}}.\end{align*}

where $\psi(\cdot)$ and $\psi_1(\cdot)$ denote the digamma and trigamma functions, respectively.

For the case without covariates incorporated in $\mu_{ij}$ , the corresponding E-step and M-step can be simplified as follows.

• E-step: Replace $u_i$ by
\begin{eqnarray*}\bar{u}'=\bar{u}'_i=\frac{1-\pi_0}{1-\pi_0+\pi_0 \prod_{j=1}^{m}\left(\frac{\theta_j}{\mu_{j}+\theta_j}\right)^{\theta_j}},\quad i=1,\ldots,n.\end{eqnarray*}
• M-step:
• For $\pi_0$ , we can get
\begin{eqnarray*}\pi_0=1-\frac{\bar{u}'}{n}\sum_{i=1}^{n} v_i.\end{eqnarray*}
• For $(\boldsymbol{\mu},\boldsymbol{\theta})$ , let
\begin{eqnarray*}\bar{\ell}_j^C(\mu_j,{\theta}_j) &=& \sum_{i=1}^n\left\{(1-\bar{u}'_i v_i)\theta_j\log{\theta_j} -\left[(1-\bar{u}'_i v_i)\theta_j+z_{ij}\right]\log(\mu_{j}+\theta_j) \right.\\&& \left.+z_{ij}\log{\mu_{j}}+\log\left(\Gamma(z_{ij}+\theta_j)\right)-\log\left(\Gamma(\theta_j)\right)\right\}.\end{eqnarray*}
• Update each $\mu_j$ using the following equation:
\begin{eqnarray*}\mu_j=\frac{\sum_{i=1}^{n}z_{ij}}{n-\bar{u}'\sum_{i=1}^{n}v_i},\quad j=1,\ldots,m.\end{eqnarray*}
• Substitute the value of $\mu_j$ obtained from the last step into $\bar{\ell}_j^C$ , update $\theta_j$ using the one variable Newton–Raphson method.

APPENDIX B. Description for two previous models

B.1 Zero-inflated bivariate Poisson (ZIBP)

The bivariate Poisson (BP) regression model is defined as follows. Let $Y_1$ , $Y_2$ and $Y_3$ denote three independent Poisson random variables with parameters $\lambda_1$ , $\lambda_2$ and $\lambda_3$ , respectively. Then $Z_1=Y_1+Y_3$ and $Z_2=Y_2+Y_3$ follow jointly a BP distribution, denoted as $\textbf{Z}=(Z_1,Z_2) \sim BP(\lambda_1,\lambda_2,\lambda_3).$ Covariates can be introduced into the model through the following schemes:

\begin{eqnarray*}\lambda_j=\exp(\textbf{x}^\top \boldsymbol{\beta}_j), \quad j=1,2,3.\end{eqnarray*}

where $\textbf{x}=(1, x_{1},\ldots, x_{p})^\top$ and $\boldsymbol{\beta}_j=(\beta_{j0}, \beta_{j1},\ldots, \beta_{jp})^\top$ .

A zero-inflated bivariate Poisson (ZIBP) model is specified by the following pmf:

\begin{eqnarray*}f_{ZIBP}(z_1,z_2) = \left\{ \begin{gathered} \pi_0 f_{BP}(z_1,z_2), \quad \hfill (z_1,z_2)\neq(0,0), \\[3pt] 1-\pi_0+\pi_0f_{BP}(0,0), \quad \hfill (z_1,z_2)=(0,0). \end{gathered} \right.\end{eqnarray*}

B.2 2-finite mixture of bivariate Poisson (2-FMBP)

Let $\textbf{Z}_{\textbf{1}}=(Z_{11}$ , $Z_{12})$ and $\textbf{Z}_{2}=(Z_{21}$ , $Z_{22})$ denote two independent BP random variables with parameters $(\lambda_{11}, \lambda_{12}, \lambda_{13})$ and $(\lambda_{21}, \lambda_{22}, \lambda_{23}),$ respectively. Then the 2-finite mixture of bivariate Poisson (2-FMBP) model is defined by the following pmf:

\begin{eqnarray*}f(z_1,z_2) = pf_{\textbf{Z}_1}(z_1,z_2)+(1-p)f_{\textbf{Z}_2}(z_1,z_2).\end{eqnarray*}

Covariates can be introduced into the model through the following schemes:

\begin{eqnarray*}\lambda_{kj}=\exp(\textbf{x}^\top \boldsymbol{\beta}_{kj}), \quad k=1,2,\quad j=1,2,3,\end{eqnarray*}

where $\textbf{x}=(1, x_{1},\ldots, x_{p})^\top$ and $\boldsymbol{\beta}_{kj}=(\beta_{kj0}, \beta_{kj1},\ldots, \beta_{kjp})^\top$ . Covariates can also be incorporated in the mixing proportion p through a logit-link function.

References

Boucher, J.P., Denuit, M. and GuillÉn, M. (2007) Risk classification for claim counts: A comparative analysis of various zero-inflated mixed Poisson and hurdle models. North American Actuarial Journal, 11(4), 110–131.CrossRef Google Scholar

BermÚdez, L. (2009) A priori ratemaking using bivariate Poisson regression models. Insurance: Mathematics and Economics, 44(1), 135–141.Google Scholar

BermÚdez, L. and Karlis, D. (2011) Bayesian multivariate Poisson models for insurance ratemaking. Insurance: Mathematics and Economics, 48(2), 226–236.Google Scholar

BermÚdez, L. and Karlis, D. (2012) A finite mixture of bivariate Poisson regression models with an application to insurance ratemaking. Computational Statistics and Data Analysis, 56, 3988–3999.CrossRef Google Scholar

BermÚdez, L. and Karlis, D. (2017). A posteriori ratemaking using bivariate Poisson models. Scandinavian Actuarial Journal, 2017(2), 148–158.CrossRef Google Scholar

Cameron, A.C. and Trivedi, P.K. (1998) Regression Analysis of Count Data. Cambridge University Press.CrossRef Google Scholar

Dempster, A.P., Laird, N.M. and Rubin, D.B. (1977) Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1), 1–22.Google Scholar

Dong, C., Clarke, D.B., Yan, X., Khattak, A. and Huang, B. (2014) Multivariate random-parameters zero-inflated negative binomial regression model: An application to estimate crash frequencies at intersections. Accident Analysis and Prevention, 70, 320–329.CrossRef Google Scholar PubMed

Frees, E.W. (2009) Regression Modeling with Actuarial and Financial Applications. Cambridge University Press.CrossRef Google Scholar

Hall, D.B. (2000) Zero-inflated Poisson and binomial regression with random effects: A case study. Biometrics, 56(4), 1030–1039.CrossRef Google Scholar PubMed

Lambert, D. (1992) Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics, 34, 1–14.CrossRef Google Scholar

Li, C., Lu, J., Park, J., Kim, K., Brinkley, P. and Peterson, J. (1999). Multivariate zero-inflated Poisson models and their applications. Technometrics, 41(1), 29–38.CrossRef Google Scholar

Liu, Y. and Tian, G.L. (2015) Type I multivariate zero-inflated Poisson distribution with applications. Computational Statistics and Data Analysis, 83, 200–222.CrossRef Google Scholar

Louis, T.A. (1982) Finding the observed information matrix when using the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 44(2), 226–233.Google Scholar

Mullahy, J. (1986) Specification and testing of some modified count data models. Journal of Econometrics, 33(3), 341–365.CrossRef Google Scholar

Rai, S.N. and Matthews, D.E. (1993) Improving the EM algorithm. Biometrics, 49, 587–591.CrossRef Google Scholar

Yip, K.C.H. and Yau, K.K.W. (2005). On modeling claim frequency data in general insurance with extra zeros. Insurance Mathematics and Economics, 36(2), 153–163.CrossRef Google Scholar

Zhang, P., CalderÍn-Ojeda, E., Li, S. and Wu, X. (2020) On the type I multivariate zero-truncated hurdle model with applications in health insurance. Insurance Mathematics and Economics, 90, 35–45.CrossRef Google Scholar