Telematics Combined Actuarial Neural Networks for Cross-Sectional and Longitudinal Claim Count Data

We present novel cross-sectional and longitudinal claim count models for vehicle insurance built upon the Combined Actuarial Neural Network (CANN) framework proposed by Mario W\"uthrich and Michael Merz. The CANN approach combines a classical actuarial model, such as a generalized linear model, with a neural network. This blending of models results in a two-component model comprising a classical regression model and a neural network part. The CANN model leverages the strengths of both components, providing a solid foundation and interpretability from the classical model while harnessing the flexibility and capacity to capture intricate relationships and interactions offered by the neural network. In our proposed models, we use well-known log-linear claim count regression models for the classical regression part and a multilayer perceptron (MLP) for the neural network part. The MLP part is used to process telematics car driving data given as a vector characterizing the driving behavior of each insured driver. In addition to the Poisson and negative binomial distributions for cross-sectional data, we propose a procedure for training our CANN model with a multivariate negative binomial (MVNB) specification. By doing so, we introduce a longitudinal model that accounts for the dependence between contracts from the same insured. Our results reveal that the CANN models exhibit superior performance compared to log-linear models that rely on manually engineered telematics features.


Introduction and Motivations
Vehicle insurance products have traditionally been priced based on self-reported attributes provided by insureds.These attributes commonly include various risk factors, including gender, age, vehicle usage, and claim history.Insurers rely on this information to assess the level of risk associated with each insurance contract and determine appropriate premium rates.With the introduction of telematics technology, insurers can now collect a wide range of driving data through devices installed in the vehicles of policyholders or through mobile applications.This includes information such as vehicle speed, acceleration and braking behavior, mileage, location data, and factors like the time of day or types of roads frequently traveled.By leveraging this wealth of data, insurers can gain a more accurate and objective understanding of each individual's driving habits and style, enabling them to customize insurance offerings and pricing based on their actual driving behavior.This emerging paradigm, known as Usage-Based Insurance (UBI), revolutionizes the insurance landscape in various ways.For insurers, telematics data means more accurate risk assessment algorithms, which can often translate into a competitive advantage.For insureds, it means fairer premium rates that align more closely with their actual risk profiles rather than being computed based on broad demographic categories.It also means that they are priced based on risk indicators over which they have control.From a societal perspective, UBI also offers many advantages.One of the key benefits is the potential to improve road safety.By giving incentives for safe driving behavior and reduced mileage, UBI not only helps reduce the frequency and severity of accidents, ultimately saving lives and reducing the economic burden associated with road accidents, but also contributes to reducing greenhouse gas emissions.Additionally, telematics provide insurers with viable alternatives to sensitive risk factors, thereby helping to prevent unfair discrimination.For a more extensive overview of the benefits of UBI, we refer to the works of [Litman, 2007], [Bordoff and Noel, 2008] and [Ziakopoulos et al., 2022].
One of the most prominent questions related to UBI is how to make the most out of the collected driving data.A significant subset of the literature has focused on incorporating mileage into pricing models due to its acknowledged importance as a risk factor in assessing risk and determining premium rates (see, for instance, [Boucher et al., 2017], [Lemaire et al., 2015], and [Turcotte and Boucher, 2023]).However, mileage alone fails to provide the whole story about an insured individual's driving behavior, prompting researchers to consider additional telematics information.One prevalent approach involves drawing upon domain knowledge to craft telematics features from raw data.By applying their expertise in the field, researchers can engineer features that capture critical aspects of driving behavior, specifically driving characteristics that are thought to be correlated with the risk of accident.Common examples of such features include harsh braking/acceleration events, cornering events, speeding, distracted driving, the fraction of driving during both different time slots (e.g., rush hour, late-night hours, weekdays), and on different road types (e.g., urban roads, highways), as well as the fraction of driving in different speed slots.While this approach captures signals missed by traditional risk factors and mileage (thereby improving pricing accuracy), it relies heavily on human judgment, with its inherent flaws and biases.With countless possible telematics features that can be engineered from raw telematics data, selecting the optimal ones for pricing is not straightforward.Furthermore, this process necessitates the setting of thresholds.For example, how should night driving or harsh braking be precisely defined?
The limitations of the aforementioned approach have motivated researchers to explore a new set of methods that rely more on data and decrease the need for human judgment.As highlighted in a recent study by [Embrechts and Wüthrich, 2022], the increasing amount of data available presents a challenge in manually designing features, leading actuaries to increasingly depend on tools like neural networks to learn and extract meaningful representations from the data.[Blier-Wong et al., 2021] underline the importance of learning valuable representations from emerging data sources such as text, image, and sensor data.These sources, which include telematics car driving data, can enrich traditional data and offer improved insights for predicting future losses in insurance contracts.Neural networks are regarded as the most effective means for automatically extracting valuable features from raw data, which validates their practical application.In recent years, researchers have successfully applied the toolbox of deep learning, namely neural networks architectures with a large number of hidden layers, to handle telematics data and other types of unstructured data.In their work, [Wüthrich, 2017] introduce the speed-acceleration heatmap, a matrix representation that characterizes the driving style of an insured driver, which is well-suited for processing by deep learning algorithms.Subsequent studies ( [Gao and Wüthrich, 2018], [Gao et al., 2019], [Gao and Wüthrich, 2019], [Gao et al., 2022]) have effectively leveraged these heatmaps by employing neural networks to learn representations from them.In [Meng et al., 2022], the authors propose a supervised driving risk scoring convolutional neural network model that uses telematics car driving data to improve automobile insurance claims frequency prediction.[Blier-Wong et al., 2020] propose a Convolutional Regional Autoencoder model for generating geographical risk encodings using convolutional neural networks.The resulting encodings, which aim to replace the traditional territory variable, proved beneficial for risk-related regression tasks.
In this paper, we present novel claim count models based on the Combined Actuarial Neural Network (CANN) approach, initially proposed by [Wüthrich and Merz, 2019].The CANN approach involves embedding a classical regression model, such as a generalized linear model (GLM; see [Nelder and Wedderburn, 1972] and [Dionne and Vanasse, 1989]), into a neural network, achieved by blending the regression functions of both models.Consequently, the resulting model comprises the two following components: the classical regression (or actuarial) model and the neural network.This blending process can be interpreted as a form of neural network boosting for the actuarial model, combining the strengths of both approaches.The calibration of the CANN neural network is performed using the classical actuarial model as the initial value in the gradient descent algorithm, with the negative log-likelihood of the specified distribution used as the loss function.One of the key benefits of this specific architecture is the solid foundation offered by the classical model, complemented by the network component's flexibility and pattern recognition capabilities.Neural networks excel in approximating highly nonlinear functions and possess the ability to compute valuable interactions between input variables automatically.Consequently, the CANN approach combines the best of both worlds, leveraging the interpretability and reliability of the classical model while capitalizing on the power of neural networks to capture complex relationships and patterns in the data.A few studies have successfully leveraged this approach: [Schelldorfer and Wuthrich, 2019] present a case study where a Poisson GLM for predicting claims frequencies is initially used, then enhanced through generalized additive models (GAMs) with natural cubic splines and finally combined with a neural network, resulting in a CANN approach.The study also explores the use of embedding layers for more efficient treatment of categorical variables; [Gabrielli et al., 2020] boost an overdispersed Poisson model with a multilayer perceptron to improve individual loss reserving; [Tzougas and Kutzkov, 2023] use the CANN approach to enhance binary classification; [Laporta et al., 2023] apply the CANN architecture in the context of quantile regression.
Our models employ a log-linear model for the actuarial model part and a multilayer perceptron (MLP) for the network part.Telematics information is incorporated into the MLP as a telematics vector, which is given as input to represent the driving behavior of each insured driver.The MLP part additionally includes traditional risk factors as inputs, enabling interactions between traditional and telematics inputs, while the log-linear part, constrained in estimating complex functions, is only given traditional risk factors.We explore three distinct distribution specifications for the claim count: Poisson and negative binomial for cross-sectional analysis, and multivariate negative binomial (MVNB; see [Hausman et al., 1984] and [Boucher et al., 2008]), also known as negative multinomial, for longitudinal analysis.The MVNB distribution is a popular choice for modeling longitudinal claim count data, as it captures the dependence between contracts from the same insured.However, to our knowledge, this specification has never been adapted to a neural network model for claim count regression.In this study, we extend the application of the MVNB distribution by incorporating it into the neural network framework, specifically the CANN architecture, for modeling longitudinal claim count data.This adaptation allows us to leverage the strengths of both the MVNB distribution and the neural network architecture.Our findings indicate that the CANN models perform better than their log-linear counterparts that rely on manually engineered telematics features.Furthermore, the CANN model using the MVNB specification exhibits a significant improvement compared to the two cross-sectional specifications.
In Section 2, we present the two datasets available to us: the contract dataset and the telematics dataset.Following that, in Section 3, we delve into the theory behind the CANN claim count models and also discuss the log-linear models that serve as benchmarks.Moving on to Section 4, we provide an explanation of how we apply the models on our specific dataset and show how we preprocess telematics data.In Section 5, we assess the performance of the models on a holdout sample and interpret the CANN models through permutation feature importance and partial dependence plots.Lastly, we conclude in Section 6.

Data
We have access to data from a Canadian property and casualty insurance company, which comes in two distinct datasets: the contract and the telematics dataset.

Contract dataset
In the contract dataset, each row represents a unique insurance contract.Contracts typically last for one year, but there are instances where their duration may be shorter or longer.Each vehicle is observed over one or more contracts; therefore, one vehicle can be represented by one or more rows in this dataset.Based on risk factors, a premium must be computed for each contract.When using a cross-sectional data model, contracts from the same vehicle are assumed to be independent of each other.On the other hand, a longitudinal data model assumes a dependence between contracts, allowing it to use information from previous contracts (including traditional risk factors, telematics data, past claims, etc.) to compute the premium.The contract dataset includes attributes commonly used in vehicle insurance pricing models.These traditional risk factors, displayed in Table 1, are recorded for 117,268 insurance contracts initiated between December 15 th , 2015 and December 31 st , 2018.In cases where multiple drivers are associated with a particular contract, attributes of the principal driver are used.Additionally, the dataset includes the vehicle identification number (VIN), allowing us to identify the insured vehicle accurately, alongside the reported claim count.As our goal is to perform claim count regression on contracts, the claim count variable will serve as the response for our supervised learning algorithms, namely the log-linear and the CANN models.The 117,268 contracts are associated with 49,671 distinct vehicles, resulting in an average of approximately 2.36 contracts per vehicle.The histogram of the number of contracts per vehicle is shown in Figure 1.

Telematics dataset
All 117,268 contracts have been logged using an on-board diagnostics (OBD) device, capturing driving information.This data is stored as trip summaries in the telematics dataset, which comprises 117,566,259 trips.Each row in the dataset represents a specific trip, and every trip is described by 4 attributes: the departure and arrival date and time, the distance driven, and the maximum speed reached.Additionally, each trip is associated with a VIN, and with the date information, it is thus possible to link each trip with one of the 117,268 contracts.An extract from the telematics dataset is presented in Table 2.

Training, validation, and testing datasets
In supervised learning analysis, splitting the available data into training, validation, and testing sets is paramount for ensuring the reliability and ability to generalize of the learned model.performance.However, relying solely on the training set for performance assessment can lead to overfitting, particularly when the model has a high capacity.To address this, the validation set is used during the modeling process to assess the model's performance on unseen data.It plays an important role in tuning hyperparameters, selecting the optimal model architecture, and preventing overfitting.By assessing the model's performance on the validation set, one can obtain an estimate of its generalization performance and make necessary adjustments to improve its ability to generalize well to new, unseen examples.However, it is important to note that the back-and-forth process of evaluating the model on the validation set and adjusting its hyperparameters can introduce information leakage from the validation set into the training set.This can create an illusion of better performance than the model would exhibit in real-world scenarios.As a result, the testing set is reserved for the final evaluation of the learned model.It serves as an unbiased assessment of how well the model will perform on completely unseen data.This final evaluation provides an estimate of the model's true performance and helps determine its reliability in real-world scenarios.By keeping the testing set separate from the training and validation sets, we can ensure an unbiased evaluation and avoid any potential data leakage.We partition the data as outlined in Table 3 where y i,(1:t−1) = (y i1 , . . ., y i,t−1 ) is the vector of past claims and x i,(1:t) = {x i1 , . . ., x it } is the set of past and current covariate vectors for vehicle i.

Cross-sectional models
In addition to assuming independence between vehicles, cross-sectional models also assume independence between contracts from the same vehicle.Consequently, these models do not use the history of a vehicle to estimate its future risk.The PMF of the number of claims can thus be written as: (2)

Poisson regression
The Poisson distribution is widely used in supervised learning analysis for claim count data due to its good properties and simplicity.Under the Poisson specification, the PMF of the claim count for the t th contract of vehicle i, denoted by Y it , given its predictor vector, denoted by x it , is defined by with The mean parameter µ(x it ) denotes the conditional expectation (and conditional variance) of Y it .The regression function µ(•) captures the relationship between the predictors x it and the mean parameter in the Poisson distribution, indicating how the conditional expected count is influenced by the predictors.Subsequently, one must choose a specific functional form for µ(•), which defines a hypothesis function space H that includes all the candidate functions for modeling µ(•).The next step involves selecting the optimal function µ ∈ H, equivalent to estimating the parameters of the specified functional form based on the available data.
In order to define what constitutes a good regression function, it is necessary to select a suitable loss function that quantifies the dissimilarity between the estimated probability mass and the true label.The goal is to minimize this dissimilarity, improving the model's predictive performance.The cross-entropy loss, also known as the negative log-likelihood loss, is a commonly chosen option.For a specific observation i, the cross-entropy loss is given by − ln(p i ), where p i is the estimated probability of observing the true label y i .This loss function assigns a higher penalty to larger discrepancies between the true label and the predicted probability, incentivizing the model to converge towards more accurate predictions.To estimate the parameters, we typically aim to minimize the average loss function over the training set, also called the empirical risk.
In the case of Poisson regression, this involves minimizing the average Poisson cross-entropy by solving the following optimization problem: Note that this is equivalent to maximizing the likelihood function.For some specifications of µ(•), notably the log-linear specification, the criterion in Equation ( 4) is convex, which enables various convex optimization techniques to be applied.Alternative estimation techniques can also be used.One common option is regularization techniques, including lasso, Ridge, and elastic-net regressions.Instead of solely minimizing the average cross-entropy, these methods involve minimizing a modified objective function that includes a penalty term.
Regularization is particularly beneficial for addressing common issues such as multicollinearity and overfitting.

Log-linear Poisson regression.
In the Poisson regression context, one notable specification for the regression function is the log-linear form, where the mean parameter is expressed as the exponential of a linear function of the predictors: where β denotes a vector of parameters, and ⟨x, β⟩ stands for the inner product between the predictor vector x and the coefficient vector β.The use of the exponential function ensures that the mean parameter remains positive.Log-linear Poisson regression has favorable properties, notably its interpretability stemming from the quasi-linearity of the link function µ(•).Moreover, when maximum likelihood is used for parameter estimation, this regression model falls within the framework of generalized linear models.GLMs provide valuable properties, such as the asymptotic Gaussian distribution of the parameters β, allowing for the estimation of standard errors, hypothesis testing, and construction of confidence intervals.
However, log-linear regression does have a significant drawback -its regression function, being linear, lacks flexibility.To address this limitation, various techniques can be employed.In fact, any supervised learning technique could be used for the specification of µ(•).One simple approach to incorporating non-linearity involves adding polynomial terms of the predictors to the model alongside the linear terms.Splines, on the other hand, offer a flexible and powerful method for modeling non-linear relationships.Instead of fitting a single global function, splines divide the predictor range into smaller intervals and fit separate polynomial functions within each interval.This approach enables more localized and flexible modeling of the relationship between the predictors and the mean parameter.

CANN Poisson regression.
In some cases, the supervised learning problem may require even more flexibility, and neural networks are particularly useful in such scenarios.Neural networks are formidable function approximation machines, well-known for their ability to estimate a wide range of highly non-linear multivariate functions.One of the key advantages of neural networks is their ability to handle raw and unstructured data effectively.Because we deal with detailed telematics data, this capability forms the basis for adopting the Combined Actuarial Neural Network (CANN) approach of [Wüthrich and Merz, 2019] In our specific case, we use log-linear count regression as the classical model and a multilayer perceptron (MLP) as the neural network component in the CANN model.As a result, we have the following specification for the regression function: where µ MLP (•) is the regression function learned by a multilayer perceptron parametrized with θ.In a nutshell, an MLP consists of interconnected layers, including an input layer, hidden layers, and an output layer.Each layer applies an affine transformation to the inputs it receives, followed by a non-linear activation function.This combination of linear transformations and non-linear activations allows MLPs to model complex non-linear relationships in the data.To delve into the mathematical description of an MLP, we can break down its structure starting from the input layer and progressing towards the output layer: 1. Input layer (l = 0): The input layer consists of n 0 nodes representing the input variables x = [x 1 , x 2 , . . ., x n0 ].

First hidden layer (l = 1):
The first hidden layer contains n 1 nodes, connected to the nodes from the input layer (l = 0) and the nodes in the subsequent layer (l = 2).The computations in the first hidden layer involve an affine transformation of the input variables followed by the application of a non-linear activation function, which introduces non-linearity into the network.Let us denote the weight matrix between layers l = 0 and l = 1 as W (1) with dimensions (n 1 , n 0 ) and the bias vector as b (1) with dimensions (n 1 , 1).The activation function applied to the transformed inputs is denoted as ϕ.The computations in the first hidden layer can then be expressed as: where a (1) represents the preactivation values in the first hidden layer, and z (1) represents the postactivation values.It is worth noting that the activation function ϕ is applied element-wise on the preactivation vector a (1) .
3. Subsequent hidden layers (l = 2, 3, . . ., L − 2): Each subsequent hidden layer l contains n l nodes, connected to the nodes from the previous layer (l−1) and the nodes in the following layer (l+1).Similar to the first layer, the computations in the subsequent hidden layers involve an affine transformation of the inputs z (l−1) followed by the application of the non-linear activation function ϕ.Let us denote the weight matrix between layers l − 1 and l as W (l) with dimensions (n l , n l−1 ) and the bias vector as b (l) with dimensions (n l , 1).The computations in the hidden layers can then be expressed as: where a (l) represents the preactivation values in the l th hidden layer, and z (l) represents the postactivation values.
4. Output layer (l = L − 1): The output layer consists of n L−1 nodes, representing the final output(s) of the MLP.Similar to the hidden layers, the output layer involves an affine transformation followed by an activation function g.We denote the weight matrix between layers L − 2 (last hidden layer) and L − 1 as W (L−1) with dimensions (n L−1 , n L−2 ) and the bias vector as b (L−1) with dimensions (n L−1 , 1).The computations within the output layer can be expressed as: Note that the number of output neurons n L−1 should match the number of modeled distribution parameters.
In the context of Poisson regression, where we are modeling a single parameter µ, only one output neuron is necessary.The choice of the output activation function g(•) is important and should be aligned with the specific problem being tackled since it determines the range and properties of the output values.For instance, in the classic case of a multi-class classification problem (where the multinoulli distribution is used as a specification for the target variable), each output neuron represents a class, and the predicted probabilities for each class should be positive and sum up to 1.In this scenario, a common choice for the activation function is the softmax function, which normalizes the outputs and ensures they are positive and sum up to 1.In our case, we need to ensure that the parameter µ, which represents the expected count, is always positive.While the exponential function is a natural choice to enforce positivity, it can sometimes lead to numerical instability, especially for large input values.As a better alternative, we choose to use the softplus function as the activation function for the output layer, defined as ζ(x) = log(1 + exp(x)).The softplus function is well-behaved even for large input values, mitigating the issue of numerical instability that can arise with the exponential function.The parameters of the generic MLP described above, consisting of weight matrices and bias vectors, can be denoted as: Naturally, these parameters must be estimated.However, the criterion in Equation ( 4) is typically not convex, making it challenging to find the global minimum of the empirical risk.In practice, the goal is to find a "good enough" local minimum that yields satisfactory performance on the task.Gradient descent algorithms, such as stochastic gradient descent (SGD) and its variants, are commonly employed to update iteratively the parameters θ in the direction of the steepest descent.The backpropagation algorithm efficiently computes the gradients and propagates them through the network, enabling parameter updates.With the introduced notation for the MLP, we can now express the specification in (6) as The corresponding computational graph for L = 5 (i.e., 3 hidden layers) is shown in Figure 2. Like a standard  MLP, the network parameters β and θ can be estimated using gradient descent.A simplified pseudo-algorithm for the training of the Poisson CANN model is provided in Algorithm 1.The learning rate η is a hyperparameter that determines the step size taken every time a gradient descent step is performed.In other words, it controls how quickly or slowly the network parameters are updated during training.A higher learning rate allows for larger steps, which can lead to faster convergence.However, an excessively high learning rate may cause the optimization process to overshoot or oscillate around the minimum, hindering convergence.Conversely, a very low learning rate might result in slow convergence, requiring more iterations to reach an acceptable solution.Finding the right learning rate is important and is typically an empirical process that requires experimentation and tuning.In practice, mini-batch gradient descent is commonly used for training neural for epoch ← 1 to E do 1.For each contract (i, t), apply the CANN regression function with current network parameters to compute the current estimated mean parameter μit : 2. Compute the empirical risk over the training dataset: 3. Perform backpropagation to compute the gradients of R with respect to the network parameters: 4. Perform gradient descent using the learning rate: end networks.It works by dividing the training data into smaller subsets, called mini-batches, and computing the gradients and parameter updates based on these mini-batches.This approach offers computational efficiency and improved generalization compared to regular gradient descent, making it a preferred choice in practice.For a comprehensive understanding of neural networks, we refer to the excellent book [Goodfellow et al., 2016].

Negative binomial regression
One issue with the Poisson distribution is its equidispersion assumption.Indeed, we have that µ In practice, claim count data often exhibit overdispersion, where the observed variance of the claim count is greater than the mean.To address this limitation, alternative distributions allowing for overdispersion can be used.Among them, the negative binomial distribution (see, for instance, [Denuit et al., 2007] and [Cameron and Trivedi, 2013]) stands out as a common choice.Under the negative binomial specification, the PMF of the claim count for the t th contract of vehicle i (Y it ), given its predictor vector (x it ), can be written as where ϕ > 0 is a dispersion parameter.This can be seen as a generalization of the Poisson distribution.Indeed, the Poisson distribution is recovered when 1 ϕ → 0. The first two centered moments are given by: As can be seen, the negative binomial specification assumes overdispersion since Var[ Once the specification for the regression function µ(•) has been chosen, which defines a set of candidate functions H, one can estimate the parameters of the regression function µ(•) along with the dispersion parameter ϕ by maximum likelihood or, equivalently, by minimizing the empirical risk over the training set: Log-linear negative binomial regression.As in the Poisson case, a common specification for µ(•) is the log-linear form, defined in Equation ( 5).In this case, the criterion in ( 14) is convex, and convex optimization can be used to estimate β and ϕ.
CANN negative binomial regression.Similar to the approach used for the Poisson case, a CANN architecture can be used to model the mean parameter in the negative binomial distribution.The specification for the regression function µ(•) remains identical to the Poisson case, as defined in Equation ( 11).In order to incorporate the extra distribution parameter ϕ, an additional output neuron is introduced in the network.This output neuron is connected to a neural network weight w ϕ ∈ R through the softplus function, ensuring that ϕ remains positive, i.e., ϕ = ζ(w ϕ ).It is important to highlight that the distribution parameter ϕ is not directly connected to the input variables x.As a result, no heterogeneity is incorporated into this parameter, and a common estimated value ϕ is used for all observations.The exact architecture for the negative binomial CANN model is depicted in Figure 3.The network parameters β, θ, and w ϕ can be learned by minimizing the criterion in Equation ( 14) using the procedure described in Algorithm 2.

Longitudinal models
Cross-sectional models assume independence between all contracts.However, in our case, the data exhibits clustering due to contracts being grouped by vehicle.While it is reasonable to assume independence between contracts from distinct vehicles, this assumption is less valid for contracts from the same vehicle.In reality, the claim counts of contracts within the same vehicle may be influenced by shared vehicle-specific characteristics, unobserved risk factors, or policy-level effects, resulting in dependence between observations within each vehicle cluster.To appropriately address this dependence, we transition from cross-sectional to longitudinal models, enabling the introduction of within-vehicle dependence.In the case of claim count data, a longitudinal model can efficiently leverage the history of the vehicles to refine the risk estimation for future contracts.
While various models are available to analyze longitudinal data, such as random effects models, fixed effects models, generalized estimating equations (GEE), and autoregressive models (AR), among others, empirical evidence in the context of claim count regression supports the effectiveness of random (or mixed) effects models (see [Boucher et al., 2008]).In these models, a random effect, which is a random variable, is introduced in the 1 is added to the log-linear model's preactivation output value ⟨x, β⟩ before being transformed with the softplus function ζ(•) to obtain the µ value of the negative binomial distribution.The ϕ value is obtained by transforming a real-valued parameter w ϕ through the softplus function.The resulting parameters µ and ϕ are then compared to the ground truth y using negative binomial cross-entropy loss.The architecture shown employs a 3-hidden-layer MLP, but can be customized with any number of layers.specified distribution.For instance, in the case of count data, the specified distribution could be the Poisson distribution.The random effect is assumed to follow a certain distribution, such as a normal, gamma, or another appropriate distribution.The inclusion of the random effect allows for capturing the unobserved heterogeneity or individual-specific effects that cannot be accounted for by the observed covariates.It introduces additional variability into the model and accounts for the dependence within clusters.In longitudinal analysis, we need, for each vehicle i, to model the random vector of claim counts Y i,(1:Ti) = (Y i1 , . . ., Y i,Ti ).The joint PMF can be expressed with where f (θ i ) is the PDF of the ramdom effect.

Multivariate negative binomial regression
A multivariate negative binomial regression model is obtained by introducing a gamma-distributed random effect in the mean parameter of the Poisson distribution.Specifically, we assume that the conditional distribution of Y it , given 4. Perform backpropagation to compute the gradients of R with respect to the network parameters: 5. Perform gradient descent using the learning rate: end random variable with mean 1 and variance 1/ϕ.The density of Θ i is given by By using Equation ( 15), one can derive the joint distribution for the vector of claim counts: where µ i• = Ti t=1 µ it and y i• = Ti t=1 y it .This joint distribution is commonly referred to as the multivariate negative binomial (MVNB) or negative multinomial distribution.Note that the Poisson distribution is retrieved when 1 ϕ → 0. Furthermore, given the past claim history denoted as y i,(1:t−1) = (y i1 , . . ., y i,t−1 ) as well as current and past covariate vectors denoted as x i,(1:t) = (x i1 , . . ., x it ), one can show that the number of claims at time (or contract) t follows a negative binomial distribution.The probability of observing y it claims at time t, given the past claim history as well as past and current covariate vectors, is thus expressed with where ) represent the number of past claims and the sum of past µ values for contract (i, t), resepctively.In the special case when t = 1, there is no past history and we set Σ (y) it = Σ (µ) it = 0, which yields α i1 = γ i1 = ϕ.The expected claim count, given the past history, is given by: Fitting an MVNB model, therefore, amounts to fitting a negative binomial model, where the parameters α it and γ it depend on the vehicle's history.Once the specification for the regression function µ(•) is chosen, the parameter ϕ and the parameters in the regression function µ(•) can be estimated by minimizing the empirical risk over the training set.This can be achieved through the following optimization problem: Log-linear multivariate negative binomial regression.If the specification for µ(•) is the log-linear form, defined in Equation ( 5), the criterion in ( 21) is convex, and convex optimization can be used to estimate β and ϕ.

CANN multivariate negative binomial regression.
In the MVNB case, the CANN architecture, as defined in Equation ( 11), can also be used as a specification for the regression function µ(•).To incorporate the additional distribution parameters α it and γ it , two additional output neurons are introduced in the network, as depicted in Figure 4.The distribution parameters α it and γ it stem from a common parameter ϕ > 0, and for the t th contract of vehicle i, we have it .The neuron representing ϕ is connected to a network weight w ϕ ∈ R through the softplus function, i.e., ϕ = ζ(w ϕ ).An MVNB CANN model can be trained with backpropagation and gradient descent, as outlined in Algorithm 3. Notice that for a vehicle i, the parameter γ it depends on the µ parameter values for its past contracts.As the training procedure of the CANN model is iterative, the estimated µ values change at each iteration.Hence, it is crucial to update Σ (µ) it for each contract (i, t) at every iteration.This updating procedure is carried out in step 2 of Algorithm 3.

Pratical Application with Telematics Data
In this section, we explain how our CANN regression models are applied to our datastet.Additionally, we describe the application of the log-linear models, which serve as benchmark models in our analysis.

Log-linear models
The Poisson, negative binomial, and MVNB log-linear models are benchmarks for the Poisson, negative binomial, and MVNB CANN models.These models incorporate all 11 traditional risk factors from Table 1, including the real distance driven (although not strictly classified as a traditional risk factor).For each contract (i, t), these traditional risk factors are denoted by the vector x (trad) it .Notice that among the 11 traditional risk factors, 4 are categorical: gender, marital_status, pmt_plan, and veh_use.For these risk factors, the approach involves initially grouping all rare categories, defined as those representing 5% or less of the total number of observations, and labeling them as "others."We then encode them numerically using dummy encoding.are shown in Table 5. Notably, when using telematics information, the estimated ϕ parameter in the MVNB log-linear model is higher.A higher ϕ value brings the correcting factor in Equation 20 closer to one, indicating reduced importance on past experience when telematics features are used.This underscores the relevance of the engineered telematics features.

CANN models
For the CANN regression models, we extract low-level descriptor vectors that are specifically designed to accurately describe the driving patterns within a particular contract, at least with the dataset we have.We expect the MLP component within the CANN models to learn meaningful high-level features from these low-level vectors.The hope is that the learned features in the hidden layers will be more relevant than the handcrafted features of Table 4.Each contract (i, t) is described by the following descriptor vectors, which provide a summary of its telematics information: it,14 ∈ R 14 , it,10 ∈ R 10 .

• The elements in vector x
(h) it represent the fraction of driving during each of the 24 hours of the day.Therefore, x (h) it,j is the fraction of driving during the j th hour of the day for contract (i, t).

• The elements in vector x (d)
it represent the fraction of driving during each of the 7 days of the week.Therefore, x (d) it,j is the fraction of driving during the j th day of the week for contract (i, t).Monday, 1 20h-0h 2 0h-6h  Tuesday, Wednesday, Thursday, Friday, Saturday, and Sunday are denoted by j = 1, 2, 3, 4, 5, 6, 7, respectively.
• The elements in vector x it,j denotes the fraction of trips between 5(j − 1) and 5j kilometers.
These descriptor vectors capture specific aspects of the driving patterns, such as hourly, weekly, average speed, and maximum speed distribution, providing valuable information for the MLPs.Since MLPs can only accept vectors as input, we concatenate these four vectors into a global telematics vector: We incorporate this telematics vector into the MLP component of the CANN models, together with the traditional risk factors x (trad) , enabling interactions between telematics and traditional inputs.In contrast, the log-linear part of the CANN models only includes the traditional risk factors due to the difficulty of processing low-level information.The regression function for the µ parameter can thus be written as where x (trad, tele) is the concatenation of x (trad) and x (tele) .
The CANN models are trained using the torch library in the R programming language, using mini-batch gradient descent with 256 observations per batch.The optimizer we use to perform gradient descent is the Adam optimizer, which is a fairly popular choice for training neural networks.Additionally, we use the reduce-on-plateau learning rate scheduler, which dynamically adjusts the learning rate based on the model's performance, automatically reducing it when the improvement plateaus, allowing for better optimization and convergence during training.For the MLP component of our CANN models, we opt for 3 hidden layers (L = 5) with 128, 64, and 32 hidden units, respectively (n 1 = 128, n 2 = 64, n 3 = 32).We choose the rectified linear unit (ReLU) as the activation function ϕ(•) used in the hidden layers.Additionally, we add batch normalization and dropout layers in-between fully connected layers.Batch normalization applies a normalization transformation to the input of a layer by subtracting the mini-batch mean and dividing by the mini-batch standard deviation.By maintaining a stable mean and variance throughout the network, it can mitigate the vanishing or exploding gradients problem, enabling more effective and efficient training.The dropout layers, on the other hand, serve as a regularization technique that helps prevent overfitting.During training, dropout randomly sets a fraction of the hidden units of a given hidden layer to zero at each iteration, which forces the network to learn redundant representations and reduces the reliance on specific features.This regularization technique improves the model's ability to generalize well to unseen data.

CANN hyperparameter tuning
To maximize the performance of our CANN models, we use grid search for hyperparameter tuning, with the average loss observed on the validation dataset V a as our optimization criterion.Additionally, we incorporate a regularization technique known as "early stopping" to determine the best number of epochs.This approach allows us to prevent overfitting and select the optimal number of epochs based on the lowest average loss achieved during training.We focus on three key hyperparameters: p, which represents the probability of dropout in the dropout layers, l_start, denoting the initial learning rate used in the reduce-on-plateau learning rate scheduler, and factor, indicating the factor by which the learning rate is multiplied upon reaching a plateau.A plateau is the point where there is no observed improvement in the validation loss for two consecutive epochs.We compute the average validation loss for all 45 combinations derived from the following hyperparameter values: • l_start: 0.00001, 0.00005, 0.0001, 0.0005, 0.001 • factor: 0.3, 0.4, 0.5 • p: 0.2, 0.3, 0.4.
Remember that the network parameters in the classical components of the CANN models are initialized with the maximum likelihood estimators of the corresponding log-linear model, which is why we use relatively small learning rates.At the initialization stage, the network already produces reasonable predictions, reducing the need for large gradient descent steps.The validation loss for each of the 45 combinations and the three specifications is presented in Table 6.It is worth noting that each model is trained for 30 epochs, and as early stopping is employed, the displayed average validation loss is based on the optimal number of epochs, which can be less than 30.As can be seen, for all learning rates higher than 0.00001, the minimum average validation loss is achieved after a very small number of epochs, indicating that the network learns too quickly.Although the negative binomial and MVNB models perform best at a learning rate of 0.0001, we believe that with more epochs, we could achieve a lower average loss with a learning rate of 0.00001.This is particularly true since the average losses are quite similar for lr_start = 0.00001 and lr_start = 0.0001.When examining the first 9 rows of Table 6, it becomes apparent that the factor hyperparameter has a negligible effect on the validation loss.On the other hand, the p hyperparameter only seems to have an impact on the validation loss for the Poisson model, performing best when p = 0.4.Although the dropout rate does not significantly affect the performance for both the negative binomial and MVNB models, we also choose p = 0.4 for these two models since the best performance is achieved at a high number of epochs (29 and 30 epochs, respectively).This suggests that with more epochs, there is potential for further performance improvement.Therefore, we select lr_start = 0.00001, factor = 0.3, and p = 0.4 as the hyperparameters for all three specifications.We train the models again on the training set, this time for 100 epochs.The performance of the three models on the validation set is displayed in Table 7.As can be seen, all 3 specifications require 35 epochs to minimize the average validation loss.

Performance assessment on the testing set
After carefully tuning the hyperparameters of our CANN models, we have at hand promising claim count models that are now nearing implementation.The next crucial step is to estimate their generalization capabilities accurately.To achieve this, we cannot rely on the validation set, as it has been extensively used during the hyperparameter tuning process.Instead, we assess the models' generalization performance using the testing set T e , which has remained untouched until now.Using this independent dataset, we can estimate the models' predictive performance on unseen data points and determine their suitability for real-world applications.Furthermore, we perform a comparative analysis between the CANN and the benchmark models, namely the log-linear models that use telematics information in the form of handcrafted telematics features.This comparative assessment allows us to evaluate our CANN models' relative performance and effectiveness against established approaches.In order to fully capture the value of telematics data, we also evaluate the performance of all 6 models (Poisson, negative binomial and MVNB log-linear and CANN models) using only the 11 traditional risk factors as covariates.In the CANN models, the MLP component therefore only comprises the 11 traditional risk factors.This analysis helps us understand the contribution of telematics information in improving the predictive power of the models.
All 12 models are trained on the learning set, and their performance is evaluated on the testing set.To assess the performance, we employ 3 different scoring rules, namely the Poisson deviance, the logarithmic score, and the squared error.For each scoring rule, we compute the average value on the testing set.To assess the magnitude of the achieved performance, we begin by calculating the average scoring rule values for a baseline model.This baseline model is defined as a homogeneous Poisson log-linear model, where the estimation of the mean (and variance) parameter µ is estimated by the average number of claims per contract observed in the learning set: The average scoring rule values for this baseline model on the testing set are reported in Table 8.We can then evaluate the performance of each of the 6 models in terms of percentage improvement over the baseline model, as shown in Table 9.As can be seen, our CANN models consistently outperform their corresponding log-linear benchmark models across all scoring rules.Moreover, our longitudinal MVNB CANN model offers a significative improvement over both Poisson and negative binomial distributions, suggesting a substantial dependence among contracts within a vehicle.

Permutation feature importance and partial dependence plots
One substantial drawback of neural networks is their difficulty of interpretation.However, researchers have developed tools to shed light on the inner workings of these black box algorithms.Two particularly useful tools in this context are permutation feature importance and partial dependence plots.
Permutation feature importance is a model-agnostic technique that computes an importance score for each input (or variable) in a supervised learning algorithm.It achieves this by randomly permuting the values of a specific input while holding the other inputs constant and observing the resulting effect on the model's performance.By comparing the original model's performance with the permuted performance, we can determine the variable's importance relative to the chosen performance metric.Suppose we have a trained model and a holdout sample for evaluation purposes.We can initially score the model on this sample and measure its performance using a chosen metric, such as the average loss.Let us denote the average loss obtained with the original holdout sample as ℓ original .To assess the importance of input j in the prediction process, we randomly shuffle the values of input j in the holdout sample and rescore the model.This process yields a new average loss, denoted as ℓ permuted , where the superscript j indicates that input j has been permuted.If input j is indeed important for the model's prediction, the permuted average loss ℓ (j) permuted is expected to be greater than the original average loss ℓ original .This suggests that permuting the values of input j has a detrimental effect on the model's performance.To obtain an importance score for input j, we can compute the difference between the permuted average loss and the original average loss, resulting in the feature importance score FI j : To obtain a more reliable estimate of the importance score, this procedure can be repeated a certain number of times for input j, creating a distribution of the increase or decrease in the average loss.The whole procedure can then be repeated for all inputs.In Figure 5, the importance scores of the 20 most important variables for our best model, the MVNB CANN, are visualized using boxplots.Please note that the names used for the telematics inputs in Figure 5 differ from the introduced notation.However, a translation table is provided in Table 10 of Appendix A to clarify the correspondence between the names used and the introduced notation.Each boxplot represents the distribution of the 100 importance scores assigned to a specific input obtained by shuffling and assessing the model 100 times.The performance metric used is the average cross-entropy loss.The analysis reveals interesting findings regarding the claim count model.As can be seen, the top 5 most important variables are from our set of 11 traditional risk factors.Notably, veh_age, distance, and expo play a significant role in the model's performance.When it comes to telematics inputs, those related to maximum speed demonstrate a substantial impact on the model's performance.Particularly, vma_16, representing the fraction of trips made at a maximum speed exceeding 150 kilometers per hour, stands out as the most important input.In general, the fraction of trips made at high maximum speeds, such as vma_14, vma_15, and vma_16, proves to be valuable for predicting claims.Additionally, it is interesting to observe that h_22 and h_2, which represent the fraction of driving during night hours, contribute substantially to the assessment of risk.Importantly, the gender variable, often used by insurers as a risk factor, is rendered useless in the presence of telematics inputs.It ranks as the 70 th most important variable (not showed in Figure 5), indicating its insignificance in the model's predictive power.
Partial Dependence Plots (PDP) are valuable tools for understanding the relationship between a specific input variable and the output of a supervised learning model.PDPs are also model-agnostic, meaning they can be applied to different types of models.They provide insights into how changes in a particular input variable influence the model's predictions while keeping all other variables at fixed values.In other words, they illustrate the marginal effect of an input variable on the predicted outcome.To compute a PDP for a specific input variable j, the process involves the following steps.First, a grid of values is defined to cover the entire or plausible range of the variable's values.Next, while holding all other variables fixed, the input vector in the holdout dataset is sequentially replaced with each value from the defined grid.Subsequently, predictions are obtained using the trained model on the modified holdout dataset for each value.By plotting the input variable values on the x-axis and the corresponding average prediction on the y-axis, the resulting PDP visually showcases the relationship between the input variable and the model's predictions.Figure 6 displays the PDPs of the 8 most important telematics inputs in the MVNB CANN model.The plots reveal that the risk, expressed as the expected number of claims, appears to increase in a linear fashion with the proportion of trips made at high maximum speeds, indicated by the input variables vma_14, vma_15, and vma_16.Additionally, there appears to be a positive linear association between the expected number of claims and the proportion of driving taking place during nighttime hours, specifically between 9 p.m. and 10 p.m. (h_22) and between 1 a.m. and 2 a.m.(h_2).It is important to emphasize that when interpreting partial dependence plots, caution must be exercised, as the procedure assumes that the input variables are independent of each other.In particular, the interpretation of the PDPs related to the fraction of driving on Tuesdays (p_2) is challenging due to the correlation between the proportions of driving on different days of the week.For instance, if an insured individual drives in smaller proportions on Tuesdays, they will systematically drive in larger proportions on other days of the week.

Conclusions
In this study, we developed three novel claim count regression models leveraging telematics data in the form of trip summaries.Our models are based on the Combined Actuarial Neural Network architecture, specifically designed to address actuarial problems and harness rich and complex information such as data provided by telematics technology.One key aspect of our work is the adaptation of the CANN architecture to accommodate the MVNB distribution specification.This adaptation allows us to effectively capture the time dependence between insurance contracts, which is important for accurately modeling claim counts.Furthermore, our findings highlight the importance of telematics inputs related to the maximum speed reached during trips in the claim count models.With partial dependence plots, we found that claim frequency is positively correlated with the fraction of trips made at high maximum speeds.Overall, the new approaches developed in this article represent a significant advancement in accurately modeling claim counts and enhancing the performance of predictive models in the context of usage-based insurance.Remarkably, the CANN regression models consistently outperform traditional log-linear models using handcrafted telematics features, as demonstrated by the superior performance across three performance metrics.These results are further supported by the use of a proper machine learning methodology that effectively prevents data leakage and mitigates the risk of producing falsely optimistic results.
While the available telematics data has been instrumental in improving our claim count models, we believe that further improvement can be achieved with access to richer data.For instance, if second-by-second data or additional information such as harsh acceleration/braking and distracted driving were accessible, we believe the performance could be further improved.Depending on the data format, different types of neural networks, such as convolutional and recurrent neural networks, could be used as the network component in the CANN models.Additionally, we acknowledge that with more time and computational power, a more comprehensive fine-tuning process of the CANN models could yield even better results than what we achieved.Notably, we were constrained in adjusting the number of hidden layers and units in the MLP components of the CANN models due to time and computational limitations.Moreover, a more advanced tuning method, beyond the grid search approach used in this study, could be employed to optimize model performance.In this study, we used the MVNB distribution as our longitudinal specification.However, alternative longitudinal specifications, such as the beta-binomial distribution, exist and could be easily implemented as they share similarities with the MVNB specification.Finally, it would be interesting to conduct further research investigating the impact of using a longitudinal model on telematics variables.It is expected that the importance of certain telematics variables would decrease when considering past claim history, as this historical data can provide insights into the claiming risk of an insured.18).The ϕ value is obtained by transforming a real-valued parameter w ϕ through the softplus function.To obtain α, the sum of past claims Σ (y) is added to the ϕ parameter, while for γ, the sum of past µ values Σ (µ) is added to the same ϕ parameter.The resulting distribution parameters µ, α and γ are then compared to the ground truth y using negative binomial cross-entropy loss.The architecture shown employs a 3-hidden-layer MLP, but can be customized with any number of layers.

Figure 1 :
Figure 1: Number of contracts per vehicle.

Figure 2 :
Figure 2: CANN architecture for the Poisson specification.The MLP's preactivation output value a (4) 1 is added to the log-linear model's preactivation output value ⟨x, β⟩ before being transformed with the softplus function ζ(•).The resulting µ value is compared to the ground truth y using Poisson cross-entropy loss.The architecture shown employs a 3-hidden-layer MLP, but can be customized with any number of layers.

Figure 3 :
Figure 3: CANN architecture for the negative binomial specification.The MLP's preactivation output value a on Friday and Saturday frac_expo_mon_to_thu Fraction of driving on Monday to Thursday frac_expo_night Fraction of night driving 2 frac_expo_noon Fraction of midday driving 3 frac_expo_peak_evening Fraction of evening rush hour driving 4 frac_expo_peak_morning Fraction of morning rush hour driving 5 max_trip_max_speed Maximum of the maximum speed of the trips med_trip_avg_speed Median of the average speeds of the trips med_trip_distance Median of the distances of the trips med_trip_max_speed Median of the maximum speeds of the trips prop_long_trip Proportion of long trips (> 100km) it represent the fraction of trips made in different average speed slots.For instance, x (a) it,j denotes the fraction of trips made at an average speed between 10(j − 1) and 10j kilometers per hour.• The elements in vector x (m) it represent the fraction of trips made in different maximum speed slots.For instance, x (m) it,j denotes the fraction of trips made where the maximum speed reached falls between 10(j − 1) and 10j kilometers per hour.• The elements in vector x (k) it represent the fraction of trips made in different distance slots.For instance, x (k)

Figure 4 :
Figure 4: CANN architecture for the MVNB specification.The MLP's preactivation output value a (4) 1 is added to the log-linear model's preactivation output value ⟨x, β⟩ before being transformed with the softplus function ζ(•) to obtain the µ value of the negative binomial distribution of Equation (18).The ϕ value is obtained by transforming a real-valued parameter w ϕ through the softplus function.To obtain α, the sum of past claims Σ (y) is added to the ϕ parameter, while for γ, the sum of past µ values Σ (µ) is added to the same ϕ parameter.The resulting distribution parameters µ, α and γ are then compared to the ground truth y using negative binomial cross-entropy loss.The architecture shown employs a 3-hidden-layer MLP, but can be customized with any number of layers.

Figure 5 :
Figure 5: Importance scores of the 20 most important variables obtained for the MVNB CANN model.

Figure 6 :
Figure 6: Partial dependence plots showcasing the 8 most important telematics inputs in the MVNB CANN model.The histogram above each line plots shows the input's distribution.

Table 1 :
Variables of the contract dataset.

VIN Trip ID Departure datetime Arrival datetime Distance Maximum speed
The training set, which usually comprises the largest portion of the data, is used to train the model's parameters and optimize its

Table 2 :
Extract from the telematics dataset.Dates are displayed in the yyyy-mm-dd format.The actual VINs have been hidden for privacy purposes.

Table 3 : Data partitioning 3 Count Regression Models We
consider a training dataset denoted as T r , which consists of |T r | rows representing vehicle insurance contracts.Contracts are grouped by vehicle and each vehicle i is observed over T i contracts.We define Y it as a discrete random variable denoting the number of claims during the t th contract of vehicle i.Furthermore, we have x it a vector containing relevant predictor variables associated with the t th contract of vehicle i. Importantly, we assume independence among all insured vehicles.In claim count regression, the ultimate goal is to estimate the probability mass function (PMF) of the number of claims, given all past and current information about the vehicle.Mathematically, we seek to estimate: The classical regression model provides good initial estimations and serves as a guide for the neural network component.It offers a starting point for the network's optimization process, enabling faster convergence.The neural network component, in turn, refines the initial estimations, capturing additional signals and uncovering patterns that may have been missed by the classical model alone.
, which embeds a classical actuarial model into a neural network architecture.A CANN model consists of two distinct components: the classical regression model component and the neural network component.This architecture offers great flexibility, allowing for seamless integration of any classical model whose regression function is compatible with a neural network architecture.Likewise, the neural network component can employ various types of supervised architectures, such as convolutional neural networks (CNNs), recurrent neural networks (RNNs), and other architectures tailored to the specific problem at hand.
Parameter estimation procedure -Negative binomial CANN model Input: Training dataset {(x it , y it )} (i,t)∈Tr , learning rate η, number of epochs E Output: Trained model parameters θ, β and ŵϕ follows a Poisson distribution with mean µ(x it )θ i , where Θ i is a gamma-distributed Algorithm 2:

Table 4 :
Handcrafted telematics features extracted from the telematics dataset.

Table 5 :
Estimated parameters of the log-linear models on the training set.

Table 6 :
Coarse hyperparameter tuning for the CANN models.The training process is stopped after 30 epochs.The provided validation loss corresponds to the optimal number of epochs, consistent with the early stopping procedure.

Table 7 :
Optimal CANN models' performance on the validation set.

Table 8 :
Performance of the baseline model on the testing set.

Table 9 :
Performance comparison of the CANN models and their corresponding log-linear benchmark model on the testing set.