Machine Learning with High-Cardinality Categorical Features in Actuarial Applications

High-cardinality categorical features are pervasive in actuarial data (e.g. occupation in commercial property insurance). Standard categorical encoding methods like one-hot encoding are inadequate in these settings. In this work, we present a novel _Generalised Linear Mixed Model Neural Network_ ("GLMMNet") approach to the modelling of high-cardinality categorical features. The GLMMNet integrates a generalised linear mixed model in a deep learning framework, offering the predictive power of neural networks and the transparency of random effects estimates, the latter of which cannot be obtained from the entity embedding models. Further, its flexibility to deal with any distribution in the exponential dispersion (ED) family makes it widely applicable to many actuarial contexts and beyond. We illustrate and compare the GLMMNet against existing approaches in a range of simulation experiments as well as in a real-life insurance case study. Notably, we find that the GLMMNet often outperforms or at least performs comparably with an entity embedded neural network, while providing the additional benefit of transparency, which is particularly valuable in practical applications. Importantly, while our model was motivated by actuarial applications, it can have wider applicability. The GLMMNet would suit any applications that involve high-cardinality categorical variables and where the response cannot be sufficiently modelled by a Gaussian distribution.


Background
The advances in machine learning (ML) over the past two decades have transformed many disciplines.Within the actuarial literature, we see an explosion of works that apply and develop ML methods for various actuarial tasks (see Richman, 2021a,b).Meanwhile, there are many distinctive challenges with insurance data that are not addressed by general-purpose ML algorithms.
One prominent example is the usual presence of high-cardinality categorical features (i.e.categorical features with many levels or categories).These features often represent important risk factors in actuarial data.Examples include the occupation in commercial property insurance, or the cause of injury in workers' compensation insurance.Unfortunately, ML algorithms cannot "understand" (process) categorical features on their own.The classical textbook approach to this problem, and also the most popular one in practice, is one-hot encoding.It turns a categorical feature of q unique categories into numeric representations by constructing q binary attributes, one for each category; for example, a categorical feature with three unique categories will be represented as [1, 0, 0], [0, 1, 0] and [0, 0, 1].
One-hot encoding works well with a small number of independent categories, which is the case with most applications in the ML literature (e.g.Hastie et al., 2009).However, issues arise as cardinality expands: (i) The orthogonality (i.e.independence) assumption of one-hot encoding no longer seems appropriate, since the growing number of categories will inevitably start to interact as the feature space gets more and more crowded, (ii) The resultant high-dimensional feature matrix entails computational challenges, especially when used with already computationally expensive models such as neural networks; (iii) The often uneven distribution of data across categories makes it difficult to learn the behaviour of the rare categories.
For a motivating example, consider the workers' compensation scheme of the State of New York (2022).In the four million claims observed from 2000 to 2022, the cause of injury variable has 78 unique categories, with more than 200,000 observations (about 5%) for the most common cause (lifting) and less than 1000 observations (0.025%) for the 10 least common causes (e.g.crash of a rail vehicle, or gunshot).The uneven coverage of claims presents a modelling challenge and has been documented by actuarial practitioners (e.g.Pettifer and Pettifer, 2012).Furthermore, the cause of injury variable alone may not serve well to differentiate claim severities.Injuries from the same cause can result in vastly different claims experiences.It seems natural to consider its interaction with the nature of injury and part of body variables, each with 57 reported categories.Exploring 4446 (78 × 57) interaction categories is clearly infeasible with one-hot encoding.
The example above highlights that one-hot encoding is an inadequate tool for handling high-cardinality categorical features.A few alternatives exist but also have their own drawbacks, as listed below.It is worth pointing out that while the following approaches appear very distinct, they share one common feature.They all encourage the categories to "communicate" with each other, in the sense that the information learned from one category should also help with learning of the others.
(i) Manual (or data-guided) regrouping of categories of similar risk behaviours.The goal here is to reduce the number of categories so that the refined categories can be tackled with the standard one-hot encoding scheme.This is a working approach, but manual regrouping requires significant domain inputs, which are expensive.Furthermore, data-driven methods such as clustering (see e.g.Guiahi, 2017) require that data be first aggregated by categories, and this aggregation gives away potentially useful information available at more granular levels.
(ii) Entity embeddings from neural networks.Proposed by Guo and Berkhahn (2016), it seems to be now the most popular approach in the ML-driven stream of actuarial literature (Delong and Kozak, 2021;Shi and Shi, 2021;Kuo and Richman, 2021).Entity embeddings work to extract a low-dimensional numeric representation of each category, so that categories closer in distance would observe similar response values.However, as the entity embeddings are trained as an early component of a black-box neural network, they offer little transparency towards their effects on the response.
(iii) Generalised linear mixed models (GLMM) with the high-cardinality categorical features modelled as random effects.This is another popular approach among actuaries, partly due to its high interpretability (Antonio and Beirlant, 2007;Verbelen, 2019).However, GLMMs as an extension to GLMs also inherit their limitations-the same ones that have motivated actuaries to start turning away from GLMs and exploring ML solutions (see, e.g.Al-Mudafer et al., 2022;Henckaerts et al., 2021).
Some recent work has appeared in the ML literature aiming to extend GLMMs to take advantage of the more capable ML models, by embedding a ML model into the linear predictor of the GLMMs.The most notable examples include GPBoost (Sigrist, 2022) that combines GLMMs with a gradient boosting algorithm, and LMMNN (Simchoni and Rosset, 2022) that combines linear mixed models (LMM) with neural networks.Unfortunately, these models present their own limitations in the face of insurance data: GPBoost assumes overly optimistic random effects variance, and LMMNN is limited to a Gaussian-distributed response.While the Gaussian assumption offers computational advantages (in terms of analytical tractability), it is often ill suited to the more general distributions that are required for the modelling of financial and insurance data.
None of the existing techniques appears satisfactory for the modelling of high-cardinality categorical features.This paper seeks an alternative approach that more effectively leverages information in the categories, offers greater transparency than entity embeddings, and demonstrates improved predictive performance.In the next section, we introduce our approach.

Contributions
Our work presents two main contributions.First, we take inspiration from the latest ML developments and propose a generalised mixed effects neural network called GLMMNet.The GLMMNet is an extension to the LMMNN proposed by Simchoni and Rosset (2022).In a similar vein, the GLMMNet fuses a deep neural network to the GLMM structure: The Network component of the model provides it with the flexibility required to capture complex non-linear relationships in the data, and the GLMM-like component of the model allows a probabilistic interpretation and offers full transparency on the categorical variables (through the random effects estimates).In contrast with the Gaussian-based LMMNN, the key enhancement of the GLMMNet is its ability to model the entire class of exponential dispersion (ED) family distributions.The proposed extension requires a significant effort as it involves moving away from analytical results.The flexibility it brings, however, makes the GLMMNet approach widely applicable to many insurance contexts, such as in the prediction of claim frequency or the estimation of pure risk premiums.
Secondly, we provide a systematic empirical comparison of some of the most popular approaches for modelling high-cardinality categorical features.Although the difficulty with such variables has been a longstanding issue in actuarial modelling, to the best of our knowledge, this paper is the first piece of work to extensively compare a range of the existing approaches (including our own GLMMNet).We believe that such a benchmarking study will be valuable, especially to practitioners facing many options available to them.
Importantly, the methodological extensions proposed in this paper, while motivated by actuarial applications, are not limited to this context and can have much wider applicability.The extension of the LMMNN to the GLMMNet would suit any applications where the response cannot be sufficiently modelled by a Gaussian distribution.This unparalleled flexibility with the form of the response distribution, in tandem with the computational efficiency of the algorithm (due to the fast and highly scalable variational inference used to implement GLMMNet), make the GLMMNet a promising tool for many practical ML applications.

Outline of Paper
In Section 2, we introduce the GLMMNet, which fuses neural networks with GLMMs to enjoy both the predictive power of deep learning models and the statistical strength-specifically, the interpretability and likelihood-based estimation-of GLMMs.In Section 3, we illustrate and compare the proposed GLMMNet against a range of alternative approaches across a spectrum of simulation environments, each to mimic some specific elements of insurance data commonly observed in practice.Section 4 presents a real-life insurance case study and Section 5 concludes.
The code used in the numerical experiments is available on https://github.com/agi-lab/glmmnet.

GLMMNet: A Generalised Mixed Effects Neural Network
This section describes our GLMMNet in detail.We start by defining some notation in Section 2.1.We then provide an overview of GLMMs in Section 2.2 to set the stage for the introduction of the GLMMNet.Sections 2.3-2.5 showcase the architectural design of the GLMMNet and provide the implementation details.

Setting and Notation
We first introduce some notation to formalise the problem discussed above.We write Y to denote the response variable, which is typically one of claim frequency, claim severity or risk premium for most insurance problems.We assume that the distribution of Y ∈ Y depends on some covariates or features.In this work, we draw a distinction between standard features x ∈ X (i.e.numeric features and low-dimensional categorical features) and the high-cardinality categorical features z ∈ Z consisting of K features each with q i (i = 1, • • • , K) categories.The latter is the subject of our interest.All categorical features have been one-hot encoded in the representations x and z.
We assume x ∈ R p and z ∈ R q , where p ∈ Z and q = K i=1 q i ∈ Z represent the respective dimensions of the vectors.The empirical observations (i.e.data) are denoted D = (y i , x i , z i ) n i=1 , where n is the number of observations in the dataset.For convenience, let us also write y In general, we use bold font to denote vectors and matrices.
The purpose of any predictive model is to learn the conditional distribution p(y|x, z).It is generally assumed that the dependence of the response Y on the covariates is through a true but unknown function where φ represents a vector of any additional or auxiliary parameters of the likelihood.Most commonly µ(x, z) will be the conditional mean E(Y |x, z), although it can be interpreted more generally.The predictive model then approximates µ(•) by some parametric function μ(•).Specifically, a linear regression model will assume a linear structure for µ(•) with φ = σ 2 to account for the error variance.Indeed, for most ML models, the function µ(•) is the sole subject of interest (as opposed to the entire distribution p(y|x, z)).
There are many different options for choosing the function µ(•).We discuss and compare the main ones in this paper; see also Section 3.1.
In what follows, for notational simplicity we will now assume that z consists of only one (high-cardinality) categorical feature, i.e.K = 1 and q = q 1 .The extension to multiple categorical features, however, is very straightforward; see also Section 2.3.2.

Generalised Linear Mixed Models
Mixed models were originally introduced to model multiple correlated measurements on the same subject (e.g.longitudinal data), but they also offer an elegant solution to the modelling of high-cardinality categorical features (Frees, 2014).We now give a brief review of mixed models in the latter context.For further details, the books by McCulloch and Searle (2001); Gelman and Hill (2007); Denuit et al. (2019) are all excellent references for this topic.
Let us start with the simplest linear model case.A one-hot encoded linear model assumes where x are standard features (e.g.sum insured, age), z is a q-dimensional binary vector representation of a high-cardinality categorical variable (e.g.occupation), q i=1 α i = 0 are the regression coefficients to be estimated.Note that the additional constraint q i=1 α i = 0 is required to restore identifiability of parameters.In high-cardinality settings, the vector z becomes high-dimensional, and hence so must be α.High cardinality therefore introduces a large number of parameters to the estimation problem.Moreover, in most insurance problems, the distribution of categories will be highly imbalanced, meaning that some categories will have much more exposure (i.e.data) than others.The rarely observed categories are likely to produce highly variable parameter estimates.As established in earlier sections, all this makes it extremely difficult to estimate the α accurately (Gelman and Hill, 2007).Equation (2.2) is also referred to as the no pooling model by Antonio and Zhang (2014), as each α i , i = 1, 2, • • • , q has to be learned independently.At the other end of the spectrum, there is a complete pooling model, which simply ignores the z and assumes µ(x, z) = β 0 + x β. (2.3) In the middle ground between the overly noisy estimates produced by (2.2) and the over-simplistic estimates from (2.3), we can find the (linear) mixed models, which assume are known as the random effects characterising the deviation of individual categories from the population mean, in contrast with the fixed effects β 0 and β.Model (2.4) is also known as a random intercept model.
The central idea in (2.4) is that instead of assuming some fixed coefficient for each category, we assume that the effects of individual categories come from a distribution, so we only need to estimate (far fewer) parameters that govern the distribution of random effects rather than learning an independent parameter per category.This produces the equivalent effect of partial pooling (Gelman and Hill, 2007): categories with a smaller number of observations will have weaker influences on parameter estimates, and extreme values get regularised towards the collective mean (modelled by the fixed effects, i.e. β 0 + x β).
In the same way that linear mixed models extend linear models, GLMMs extend GLMs by adding random effects capturing between-category variation to complement the fixed effects in the linear predictor.
Suppose that we have q categories in the data, each with n i (i = 1, 2, • • • , q) observations (so the total number of observations is n = q i=1 n i ).A GLMM is then defined by four components: (1) Unconditional distribution of random effects.We assume that the random effects u j iid ∼ f (u j |γ), j = 1, 2, • • • , q depend on a small set of parameters γ.It is typically assumed that u j follows a Gaussian distribution with zero mean and variance σ 2 u , i.e. u j iid ∼ N (0, σ 2 u ).
(2) Response distribution.GLMMs assume that the responses Y i , i = 1, 2, • • • , n, given the random effect u j[i] for the category it belongs to (where j[i] means that the i-th observation belongs to the j-th category, j = 1, 2, • • • , q), are conditionally independent with an ED distribution, i.e.
where θ i denotes the location parameter and φ the dispersion parameter for the ED density, b(•) and c(•) are known functions.It follows from the properties of ED family distributions that ) and the dispersion parameter φ.
(3) Linear predictor.The linear predictor includes both fixed and random effects: where x i denotes a vector of fixed effects variables (e.g.sum insured, age) and z i a vector of random effects variables (e.g. a binary representation of a high-cardinality feature such as injury code).
(4) Link function.Finally, a monotone differentiable function g(•) links the conditional mean µ ij with the linear predictor η ij : (2.9) completing the model.
As explained earlier, the addition of random effects allows GLMMs to account for correlation within the same category, without overfitting to an individual category as is the case of a one-hot encoded GLM.The shared parameters that govern the distribution of random effects essentially enable information to transfer between categories, which is helpful for learning.

Model Architecture
In Section 1.1, we briefly reviewed previous attempts at embedding ML into mixed models.In particular, the LMMNN structure proposed by Simchoni and Rosset (2022) is the closest to what we envision for a mixed effects neural network.However, it suffers from two major shortcomings that restrict its applicability to insurance contexts: First, the LMMNN assumes a Gaussian response with an identity link function, for which analytical results can be obtained, but which is ill suited to modelling skewed, heavier-tailed distributions that dominate insurance and financial data.Second, LMMNNs model the random effects in a non-linear way (through a sub-network in the structure).This complicates the interpretation of random effects, which carries practical significance in many applications.
The GLMMNet aims to address these concerns.The architectural skeleton of the model is depicted in Figure 2.1.We adopt a very similar structure to that of Simchoni and Rosset (2022), except that we remove the sub-network that they used to learn a non-linear function of Z.As noted earlier, the main purpose of our modification is to retain the interpretability of random effect predictions from the GLMM's linear predictor.In addition, we also want to avoid an over-complicated structure for the random effects, whose role is to act as a regulariser (Gelman and Hill, 2007).In mathematical terms, we assume that the conditional y|u follows an ED family distribution, with where g(•) is the link function and f (•) is learned through a neural network.
In the following, we give a detailed description of each component in the architecture plotted above, followed by a discussion on how the collective network can be trained in Section 2.4.We remark that, in this work we have allowed f (•) to be fully flexible to take full advantage of the predictive power of neural networks.We recognise that in some applications, interpretability of the fixed effects may also be desirable.
In such cases, it is possible to replace the fixed effects module (described in Section 2.3.1) by a Combined Actuarial Neural Network (CANN; Wüthrich and Merz, 2019) structure.

Fixed Effects
The biggest limitation of GLMMs lies in their linear structure in (2.8): similar to GLMs, features need to be hand-crafted to allow for non-linearities and interactions (Richman, 2021a,b).The proposed GLMMNet addresses this issue by utilising a multi-layer network component for the fixed effects, which is represented as the shaded structure in Figure 2.1.For simplicity here we consider a fully connected feedforward neural network (FFNN), although many other network structures, e.g.convolutional neural networks (CNN) can also be easily accommodated.
A FFNN consists of multiple layers of neurons (represented by circles in the diagram) with non-linear activation functions to capture potentially complex relationships between input and output vectors; we refer to Goodfellow et al. (2016) for more details.Formally, for a network with L hidden layers and q l neurons in the l-th layer for l = 1, • • • , L, the l-th layer is defined in Equation (2.10) below.(2.11) where ϕ (l) : R → R is called the activation function for the l-th layer, b j ∈ R and w (l) j ∈ R q l−1 respectively represent the bias term (or intercept in statistical terminology) and the network weights (or regression coefficients) for the j-th neuron in the l-th layer, and q 0 is defined as the number of (preprocessed) covariates entering the network.
The network therefore provides a mapping from the (preprocessed) covariates x to the desired output layer through a composition of hidden layers: (2.12) Training a network involves making many specific choices; Appendix B.2 gives more details.

Random Effects
Below we give a description of the random effects component of GLMMNet, which is the top (unshaded) structure in Figure 2.1.For illustration, we stay with the case of a single high-cardinality categorical feature with q unique levels.The column of the categorical feature is first one-hot transformed into a binary matrix Z of size n × q, which forms the input to the random effects component of the GLMMNet.
Here comes the key difference from the fixed effects network: The weights in the random effects layer, denoted by u = [u 1 , u 2 , • • • , u q ] , are not fixed values but are assumed to come from a distribution.This way, instead of having to estimate the effects of every individual category-many of which may lack sufficient data to support a reliable estimate, we only need to estimate the far fewer parameters that govern the distribution of u (in our case, just a single parameter σ 2 u in (2.13)).A less visible yet important distinction of our GLMMNet from the LMMNN (Simchoni and Rosset, 2022) is that our model takes a Bayesian approach to the estimation of random effects (as opposed to an exact likelihood approach in the LMMNN).The Bayesian approach helps sidestep some difficult numerical approximations of the (marginal) likelihood function (as detailed in Section 2.4).We should point out that, although the Bayesian approach is popular for estimating GLMMs (e.g.Bürkner, 2017), there are further computational challenges that prevent them from being applied to ML mixed models; we elaborate on this in Section 2.4.
Following the literature (McCulloch and Searle, 2001;Antonio and Beirlant, 2007), we assume that the random effects follow a normal distribution with zero mean, and that the random effects are independent across categories: . This is taken as the prior distribution of the random effects u.
In the context of our research problem, we are interested in the predictions for u, which indicate the deviation of individual categories from the population mean, or in other words, the excess risk they carry.Under the Bayesian framework, given the posterior distribution p(u|D), we can simply take the posterior mode û = argmax u p(u|D) as a point estimate, which is also referred to as the maximum a posteriori (MAP) estimate.Alternatively, we can also derive interval predictions from the posterior distribution to determine whether the deviations as captured by u carry statistical significance.
We should note that the estimation of the posterior distribution, which we treated as given in the above, represents a major challenge in Bayesian inference.We come back to this topic and discuss how we tackle it in Section 2.4.
It should also be pointed out that the extension to multiple categorical features is very straightforward: all we need is to add another random effects layer (and one more variance parameter to estimate for each additional feature).

Response Distribution
As explained earlier, one novel contribution of GLMMNet is its ability to accommodate many non-Gaussian distributions, notably gamma (commonly used for claim severity), (over-dispersed) Poisson (for claim counts), and Bernoulli (for binary classification tasks).Our extension to non-Gaussian distributions is an important pre-requisite for insurance applications.
Precisely, we assume that the responses for the category it belongs to (where j[i] means that the i-th observation belongs to the j-th category), are conditionally independent with a distribution in the ED family.Equation (2.6) implies that the conditional distribution of Y i |u j[i] is completely specified by the conditional mean µ i and the dispersion parameter φ, so we can write1 In the GLMMNet, we assume that a link function g(•) connects the conditional mean µ i with the nonlinear predictor η i formed by adding together the output from the fixed effects module and the random effects layer: (2.14) In implementation, we use the inverse link g −1 (•) as the activation function for the final layer, to map the output [f (x i ), u j[i] ] from the fixed effects module and the random effects layer to a prediction for the conditional mean response Finally, as GLMMNet operates under probabilistic assumptions, the dispersion parameter φ (independent of covariates in our setting) can be estimated under maximum likelihood as part of the network.This is implemented by adding an input-independent but trainable weight to the final layer of GLMMNet (Chollet et al., 2015), whose value will be updated by (stochastic) gradient descent on the loss function.

Training of GLMMNet: Variational Inference
In a LMMNN (Simchoni and Rosset, 2022), the network is trained by minimising the negative loglikelihood of y.As the LMMNN assumes a Gaussian density for y|u and u ∼ N (0, Σ), the marginal likelihood of y can be derived analytically, i.e. y ∼ N (f (X), ZΣZ + σ 2 I).The network weights and biases, as well as the variance and covariance parameters σ 2 and σ 2 u are learned by minimising the negative log-likelihood where V = ZΣZ + σ 2 I and Σ := Cov(u) = σ 2 u I.In making the extension to GLMMNet, however, we no longer obtain a closed-form expression for the marginal likelihood: (2.15) where β denote the weights and biases in the fixed effects component of the GLMMNet (Section 2.3.1),j[i] indicates the category to which the i-th observation belongs, σ 2 u is the variance parameter of the random effects, and φ is the usual dispersion parameter for the ED density for the response.
We take a Bayesian approach to circumvent the difficult numerical approximations required to compute the integral in Equation (2.15).Under the Bayesian framework, π(u j ) = f (u j |σ u ) = N (0, σ 2 u ), j = 1, • • • , q, is taken as the prior for the random effects.Our goal is to make inference on the posterior of the random effects, given by where π(u) = q j=1 π(u j ) and D represents the data.Note that we again encounter an intractable integral in (2.16).The traditional solution is to use Markov chain Monte Carlo (MCMC) to sample from the posterior without computing its exact form; however, the computational burden of MCMC techniques restricts its applicability to large and complex models like a deep neural network.This partly explains why Bayesian methods, although frequently used for estimating GLMMs (e.g.Bürkner, 2017), tend not to be as widely adopted among ML mixed models (e.g.Sigrist, 2022;Hajjem et al., 2014).
With our GLMMNet, we apply variational inference to solve this problem.Variational inference is a popular approach among ML researchers working with Bayesian neural networks, due to its computational efficiency (Blei et al., 2017;Zhang et al., 2019).Variational inference does not try to directly estimate the posterior, but proposes a surrogate parametric distribution q ∈ Q to approximate it, therefore turning the difficult estimation into a highly efficient optimisation problem.In practice, the choice of the surrogate distribution is often made based on simplicity or computational feasibility.One particularly popular option is to use a mean-field distribution family (Blei et al., 2017), which assumes independence among all latent variables.In this work, we consider a diagonal Gaussian distribution for the surrogate posterior, where µ u is a q-dimensional vector of the surrogate posterior mean of the q random effects, and ) is a diagonal matrix of the posterior variance.More complex distributions, such as those that incorporate dependencies among latent variables, may provide a more accurate approximation.In this work, however, as we are mainly concerned with estimating the mean and dispersion of the posterior random effects, the chosen diagonal Gaussian surrogate distribution should suffice for this purpose.
The vector µ u and the diagonal elements of Σ u together form the variational parameters of the diagonal Gaussian, denoted by ϑ.The task here is to find the variational parameters ϑ * that make the approximating density q ϑ ∈ Q closest to the true posterior, where closeness is defined in the sense of minimising the Kullback-Leibler (KL) divergence (Kullback and Leibler, 1951).In mathematical terms, this means (2.17) = argmin (2.20) We take the term inside the outermost bracket, which is called the evidence lower bound (ELBO) loss or the negative of variational free energy in Neal and Hinton (1998), to be the loss function for the GLMMNet: We see that the loss function used to optimise the GLMMNet is a sum of two parts: a first part that captures the divergence between the (surrogate) posterior and the prior, and a second part that captures the likelihood of observing the given data under the posterior.This decomposition echoes the trade-off between the prior beliefs and the data in any Bayesian analysis (Blei et al., 2017).The loss function in (2.22) can be calculated via Monte Carlo, that is, by generating posterior samples under the current estimates of the variational parameters ϑ, evaluating the respective expressions at each of the sampled values, and taking the average to approximate the expectation.The derivatives of the loss function can be computed via automatic differentiation (Chollet et al., 2015) and used by standard gradient descent algorithms for optimisation (e.g.Kingma and Ba, 2014).

Prediction from GLMMNet
Recall that x denotes the standard input features and z ∈ R q is a binary representation of the highcardinality feature.Let (x * , z * ) represent a new data observation for which a prediction should be made.Note that when making predictions, it is possible to encounter a category that was never seen during training, in which case z * will be a zero vector (as it does not belong to any of the already-seen categories) and the predictor in Equation (2.14) reduces to η * = f (x * ).
To make predictions from GLMMNet, we take expectations under the posterior distribution on the random effects (Blundell et al., 2015;Jospin et al., 2022): (2.23) where q ϑ * (•) denotes the optimised surrogate posterior.As a result of Equation (2.23), any functional of y * |x * , z * , e.g. the expectation, can be computed in an analogous way.The expectation in (2.23) can be approximated by drawing Monte Carlo samples from the (surrogate) posterior, as outlined in Algorithm 1.When there are multiple high-cardinality features, each will have its own binary vector representation and optimised surrogate posterior, and the expectation in Equation (2.23) will be taken with respect to the joint posterior (which is simply a product of the surrogate posteriors when we assume independence between these features).
It is also worth pointing out that the model averaging in line 5 has the equivalent effect of training an ensemble of networks, which is usually otherwise computationally impractical (Blundell et al., 2015).

Algorithm 1: Prediction from GLMMNet
Input : A new data observation (x * , z * ); number of Monte Carlo iterations N Output: Predictive distribution: p(y * |x * , z * ) and φ is the dispersion parameter for the ED distribution (learned in the GLMMNet);

Comparison of GLMMNet with Other Leading Models
In this section, we compare the performance of the GLMMNet against the most popular existing alternatives for handling high-cardinality categorical features.We list the candidate models in Section 3.1 and the performance metrics we use to evaluate the models in Section 3.2.Control over the true data generating process via simulation allows us to highlight the respective strengths of the models.We challenge the models under environments of varying levels of complexity, e.g.low signal-to-noise ratio, highly imbalanced distribution of categories, and/or skewed response distributions.The simulation environments detailed in Section 3.3 are designed to mimic these typical characteristics of insurance data.Section 3.4 summarises the results.Ensuing insights are used to inform the real data case study in Section 4.

Models for Comparison
Listed below are the leading models widely used in practice that allow for high-cardinality categorical features.They will be implemented and used for comparison with the GLMMNet in this experiment.
1. GLM with -GLM ignore cat: complete pooling, i.e. ignoring the categorical feature altogether.In the notation of Section 2.1, it can be represented as µ(x, z) = g −1 (β 0 + x β), where g(•) is the link function, β 0 and β are the regression parameters to be estimated; -GLM one hot: one-hot encoding, which can be represented as µ(x, z) = g −1 (β 0 + x β + z α) where α are the additional parameters for each category in z; -GLM GLMM enc: GLMM encoding, which can be represented as µ(x, z) = g −1 (β 0 + x β + z α) where z represents the encoded value (a scalar) of z and α the additional parameter associated with it (more details in Appendix B.1); 2. Gradient boosting machine (GBM, with trees as base learners) under the same categorical encoding setup as above: -GBM ignore cat: µ(x, z) = µ(x) where µ(•) is a weighted sum of tree learners; -GBM one hot: µ(x, z) where µ(•, •) is a weighted sum of tree learners; -GBM GLMM enc: µ(x, z) = µ(x, z ) where z represents the encoded value (a scalar) of z; 3. Neural network with entity embeddings (NN ee).Here, µ(x, z) is composed of multiple layers of interconnected artificial neurons, where each neuron receives inputs, performs a weighted sum of these inputs, and applies an activation function to produce its output (see also Section 2.3.1).Entity embeddings add a linear layer between each one-hot encoded input and the first hidden layer and are learned as part of the training process.This is currently the most popular approach for handling categorical inputs in deep learning models and serves as our target benchmark; 4. GBM with pre-learned entity embeddings (GBM ee), which is similar to GBM GLMM enc except with the z replaced by a vector of entity embeddings learned from NN ee; 5. Mixed models: -GLMM with µ(x, z) = g −1 (β 0 + x β + z u) where u are the random effects (Section 2.2); -GPBoost (Sigrist, 2021(Sigrist, , 2022) ) with µ(x, z) = g −1 (f (x) + z u) where f (•) is an ensemble of tree learners; -The proposed GLMMNet, with µ(x, z) = g −1 (f (x) + z u) where f (•) is a FFNN (Section 2.3.1).Each model has its own features and properties, which renders the collection of models suitable for different contexts; Table 3.1 gives a high-level summary of their different characteristics.In the table, "interpretability" refers to the ease with which a human can understand a model, as well as the ability to derive insights from the model.For instance, GLMs have high interpretability, because GLMs have a small number of parameters and each parameter carries a physical meaning, which contrasts with the case of a standard deep neural network characterised by a large number of parameters that are meaningless on their own."Operational complexity" refers to the resources and time required to implement, train and fine-tune the model.Based on this qualitative comparison only, the GLMMNet has good potential to improve on neural networks with entity embeddings, because it can similarly capture complex relationships while also offering a wider range of response distributions and better interpretability.However, further investigations are needed to understand and illustrate in which circumstances the GLMMNet approach can outperform other approaches, which is the focus of the rest of this section.

Model Evaluation Criteria
For all experiments presented in this paper (including the real data case study in Section 4), we randomly split the data into a training and a test set.The training set is used to select hyperparameters (by further splitting into an inner-training and a validation set) and fit the pool of candidate models.The different models are then evaluated and compared based on their performance on the test dataset.Specifically, to quantify the predictive performance, we consider the metrics listed below: -Accuracy of point predictions.We report the root mean squared error (RMSE) and the mean absolute error (MAE), which are respectively given by where n * is the number of test observations, y * is the target output and ŷ * is the (point) prediction of the mean response.-Average accuracy of point predictions per category.In order to gain insights into the category-specific accuracy of point predictions, we consider the RMSE of average prediction for each category: where q is the number of categories, n * j is the number of test observations in the j-th category, y * j and y * j respectively denote the average of the response variable y and its prediction ŷ for all observations in the j-th category.This metric serves to measure and compare how well a model is able to capture the between-category differences.This can be interpreted as, for example, the average accuracy of loss predictions on each sub-portfolio, which has practical significance.
-Accuracy of probabilistic predictions.The quantification of probabilistic accuracy is especially relevant to actuarial applications (Embrechts and Wüthrich, 2022).In this work, we consider two proper scoring rules, the continuous ranked probability score ("CRPS") and the negative log-likelihood ("NLL"), both of which are commonly used for assessing probabilistic forecasts (Gneiting and Raftery, 2007;Al-Mudafer et al., 2022;Delong et al., 2021).The CRPS is defined as and where Fi (•) represents the distribution function (df) of a probabilistic forecaster for the i-th response value and y i represents its observed value.The CRPS is a quadratic measure of the difference between the forecast predictive df and the empirical df of the observation.Hence, the smaller it is the better.The integral in (3.1) has been evaluated analytically and implemented in R for many distributions; we refer to Jordan et al. (2019) for details.
Computation of the NLL is more straightforward: and where fi (•) is the density of the probabilistic forecaster.For models that do not produce a distributional forecast, we construct an artificial predictive distribution by using the point forecast as the mean and estimating a dispersion parameter for the assumed distribution (see Section 4.6 of Denuit et al., 2019).

Simulation Datasets
We generate n observations from a (generalised) non-linear mixed model with q categories (q ≤ n).We assume that the responses y i , i = 1, • • • , n, each belonging to a category j[i] ∈ {1, • • • , q}, given the random effects u = (u j ) q j=1 , are conditionally independent with a distribution in the ED family (e.g.gamma); denoted as y i |u j[i] ∼ ED(µ i , φ), where µ i = E(y i |u j[i] ) represents the mean and φ represents a dispersion parameter shared across all observations.A link function g(•) connects the conditional mean µ i with a non-linear predictor in the form of where x i is a vector of fixed effects covariates for the i-th observation, z i is a vector of random effects variables (in our case, a binary vector representation of a high-cardinality categorical feature), the random effects u j iid ∼ N (0, σ 2 u ) measure the deviation of individual categories from the population mean, and f (•) is some complex non-linear function of the fixed effects.
Here, we consider f (x) = 10 sin (πx 1 x 2 ) + 20 (x 3 − 0.5) 2 + 10x 4 + 5x 5 , (3.4) which was studied in Friedman (1991) and has become a standard simulation function for regression models (used in e.g.Kuss et al. 2005; also available in the scikit-learn Python library by Pedregosa et al. 2011).In our application, we use the Friedman function to test the model's ability to filter out irrelevant predictors (note that x 6 , • • • , x 10 is not used in f ) as well as detect interactions and non-linearities of input features, all in the presence of a high-cardinality grouping variable that is subject to random effects.We set up the desired simulation environments by adjusting the following parameters; refer to Appendix B.4 for details.
-Signal-to-noise ratio: a three-dimensional vector that captures the relative ratio of signal strength (as measured by µ f , the mean of f (x)), random effects variance (σ 2 u ), and variability of the response (σ 2 ; noise, or equivalently the irreducible error, as this component captures the unexplained inherent randomness in the response).
-Response distribution: distributional assumption for the response, e.g.Gaussian, gamma, or any other member of the ED family.-Inverse link : inverse of the link function, i.e. the inverse function of g(•) in Equation (3.3).
-Distribution of categories: whether to use balanced or skewed distribution for the allocation of categories.
A "balanced" distribution allocates approximately equal number of observations to each category; a "skewed" distribution generates categories from a (scaled) beta distribution; see Appendix B.4.Table 3.2 lists the simulation environments considered in our experiment.Each of the environments in experiments 2-5 has been configured to mimic one specific characteristic of practical insurance data, such as a low signal-to-noise ratio (the specification of [8, 1, 4] was selected based on estimated parameters from the real insurance case study in Section 4), an imbalanced distribution across categories, or a highly skewed response distribution for claim severity.We give a brief description of each environment below.
-Experiment 1 simulates the base scenario that adheres to the assumptions of a Gaussian GLMMNet.
-Experiment 2 simulates a gamma-distributed response, which is often used to model claim severity in lines of business (LoB) such as auto insurance or general liability.-Experiment 3 simulates a skewed distribution of categories, which is a common characteristic of highcardinality categorical features (e.g. car make, injury code).-Experiments 4-5 incrementally increase the level of noise in the data and simulate LoBs that are difficult to model by covariates, such as commercial building insurance, catastrophic events and cyber risks.-Experiment 6 represents the most challenging scenario, incorporating the complexities of all the other experiments.
For each scenario, we generate 5,000 training observations and 2,500 testing observations.

Results
In all scenarios studied, the GLMMNet outperforms or at least performs on par with the target benchmark model of an entity embedded neural network.In low to medium noise level environments, the GLMMNet usually outperforms the second best model (i.e. the entity embedded neural network) by a significant margin.The two outperform all the other mixed models, including GLMM and GPBoost.
Figure 3.1 compares the out-of-sample performance of the candidate models (listed in Section 3.1) over 50 simulation runs (of 5,000 training and 2,500 testing observations each) under a base scenario (experiment 1 in Table 3.2); the proposed GLMMNet is highlighted in green.For all metrics considered, smaller values mean a better fit.
From Figure 3.1, we see that GLMMNet is leading in all metrics.In particular, on average it performs better than the current state-of-the-art neural network with entity embeddings (NN ee), which is also the next best model.A paired sample Wilcoxon signed rank test on the out-of-sample MAE rejects the null hypothesis-that GLMMNet and NN ee perform the same-with strong evidence (p < 0.001), in favour of the alternative that GLMMNet produces a smaller error.The same conclusion holds for all metrics tested.Indeed, upon closer examination of the paired sample data, we find that for every single simulation run, the performance of GLMMNet surpasses that of NN ee.This suggests that optimising the networks via the (more suitable) mixed model likelihood provides a significant benefit to their predictive performance, in terms of both point predictions and (more notably) probabilistic predictions.
Ignoring GLMMNet, the best performing models in panels (a)-(c) are NN ee and GBM GLMM enc (a boosting model with GLMM encoding).Both models significantly outperform the linear family of models (GLM or GLMM), showing that a flexible model structure is required to capture the non-linearities in the data.
More importantly, without a suitable way of accounting for the high-cardinality categorical feature, a flexible model structure alone is not enough to achieve good performance.For example, the struggle of treebased models in dealing with categorical features has been well documented (Prokhorenkova et al., 2018).In this experiment, with the usual one-hot encoding, the GBM model (GBM one hot) performs even worse than its GLM counterpart (GLM one hot); although the reverse is true when both are fitted without the categorical variable altogether.This confirms the motivation for this research: more flexible ML models, when dealing with financial or insurance data, can often be constrained by their capability to model categorical variables with many levels.
The ranking of models in Figure 3.1(d)-which is designed to compare the models' ability to explain between-category variations-is slightly different.GLM one hot and GLMM perform better, moving up in the ranking to join the (consistently) top-performing GLMMNet and NN ee.This is a result of the balance property (for GLMs with canonical links) which ensures that the sum of fitted values must equal exactly the sum of observed values for any level of the categorical features in the training set (Wüthrich, 2021).In this case, the training RMSE avg will be zero by definition.In practice, it is remarkable that this quality often carries over to an out-of-sample set as well, as demonstrated by the performance of GLM one hot and GLMM here.
Figure 3.2 gives a more insightful picture of per-category predictive accuracy by plotting the densities of predictions from selected models against the true underlying density that generated the observations (leftmost).The plot shows a selection of representative categories (top three with the highest response values, two in the middle, and bottom three with the lowest), but the conclusions drawn here hold generally for other categories too.The colours roughly correspond to the range of values under which the predictions fall; darker colours represent smaller values (relative to an average observation), and lighter represent larger values.Ideally, the predicted density shapes should match as closely as possible to the true densities, and the colours should also agree as much as possible.
In line with the previous observations, GLMMNet produces very accurate density estimates and models the variation in the data extremely well.NN ee, on the other hand, clearly overestimates the response for Category 9, likely due to an overfit to the training data points.Unsurprisingly, ignoring the categorical variable (GLM ignore cat) almost eliminates the between-category differences and leads to very poor estimates.When equipped with one-hot encoding, the GLM (GLM one hot) is able to sense the differences in the mean (as indicated by the colours), as the balanced distribution of categories provides sufficient data points for learning the between-category differences.However, GLM one hot tends to produce narrower densities than the truth.This indicates a failure to capture the remaining, or within-category, variation in the data, confirming that its linear structure is too restrictive for this dataset.
The results from experiments 2-3 are mainly consistent with what we have seen in experiment 1 (see Appendix C.1). GLMMNet takes a strong lead in both cases and showcases a statistically significant advantage over its closest competitor NN ee in all cases (p < 0.01 from the corresponding Wilcoxon signed rank tests).
The result in experiment 2 (gamma-distributed response) is much within expectation; the GLMMNet by design takes good care of the response distribution via its loss function.The result in experiment 3 (skewed distribution of categories) further suggests that the predictive strength of the GLMMNet is, at least to some degree, immune to the shape of the categorical distribution.
As the noise level increases (experiments 4-6), the differences in model performance become smaller, and the predictive advantage of the GLMMNet starts to fade away (see Appendix C.2).In experiments 5-6, the GBM family of models surpasses both GLMMNet and NN ee in CRPS measures.One possible explanation is that the variance of the random effects here is too small with respect to the error variance to warrant the need for accurately modelling the categorical variable.Although GLMMNet may not always provide a boost to the overall predictive performance, it remains one of the top performers in terms of its ability to capture the individual category means (as measured by RMSE avg).The latter may be of practical significance in many applications.
The outperformance of GBM over deep learning models as observed in experiments 5 and 6 was already discussed in the prior literature, see, e.g.Borisov et al. (2021), though it remains an open question what causes this gap in performance.It is possible that neural networks are more likely to overfit to the high noise in the background, or that they may get stuck in a suboptimal local minimum and cannot gain enough momentum from the data to escape.In general, the high level of noise in the environment prevents the network-based models (including the proposed GLMMNet) from fully realising their predictive power.Further investigations suggest that adding regularisation helps improve the overall performance of the GLMMNet to a great extent, as shown in panel (a) of Figure 3.3, though it seems to bias the point predictions (panel b).
Contrasting experiments 5 and 6 shows that the GLMMNet ranks higher when the response moves away from Gaussian (see Figure C.2 in Appendix C.2).The predictive power that the GLMMNet regains in experiment 6 is likely a result of the change in the true distribution of the response.When the shape of the response looks less normal, the outperformance of GBM models over the GLMMNet becomes less distinctive.This comparison highlights the advantage of the GLMMNet in using a loss function that is better aligned with the distribution of the response.We expect this effect to be more pronounced when we work with heavier-tailed distributions that deviate even more from the normal.
Across all experiments, we find that the GLMMNet, NN ee, and GBM GLMM enc consistently outperform the other models in terms of predictive performance.In particular, we find that our GLMMNet outperforms or at least performs on par with the target benchmark model of an entity embedded neural network in all scenarios studied.That said, each model has its own strengths and limitations, and the choice of which model to use will depend on the specific needs of the modelling problem.Table 3.3 lists some of these considerations.We should also note that GBM ee has similar properties to GBM GLMM enc and is therefore not included in the table for brevity of presentation (except for the fact that GBM ee is more complicated to implement as it requires training an entity-embedded neural network beforehand).

NN ee
• Strong predictive performance, particularly in low-medium noise environments

GLMMNet
• Consistently outstanding predictive performance, particularly in low-medium noise environments and/or when the response deviates from the Gaussian distribution • Transparency on the category effects (through random effects predictions) • Offers naturally probabilistic estimates • Compromised performance in high noise environments when used without additional regularisation • Limited interpretability of the fixed effects component Table 3.3: Model comparison: strengths and limitations of the top performing models

Application to Real Insurance Data
In this section, we apply GLMMNet to a real (proprietary) insurance dataset (described in Section 4.1), and discuss its performance relative to the other models we have considered (Section 4.2).We also demonstrate, in Section 4.3, how such an analysis can potentially inform practitioners in the context of technical pricing.

Description of Data
The data for this analysis were provided by a major Australian insurer.The original data cover 27,351 commercial building and contents reported claims by small and medium-sized enterprises (SME) over the period between 2010 and 2015.The analysis seeks to construct a regression model to predict the ultimate costs of the reported claims based on other individual claim characteristics that can be observed early in the claims process (e.g.policy-level details and claim causes).A description of the (post-engineered) input features is given in Appendix B.5.1.
The response variable is the claim amount (claim severity), which has a very skewed distribution, as shown in Figure 4.1: even on a log scale, we can still observe some degree of positive skewness.In search for a suitable distribution to model the severity, we fit both lognormal and loggamma distributions to the (unlogged) marginal.Figure 4.2 indicates that both models may be suitable, with loggamma being slightly more advantageous.We will test both lognormal and loggamma distributions in our experiments below (Section 4.2); specifically, we fit Gaussian and gamma models to the log-transformed response.The high-cardinality categorical variable of interest is the occupation of the business, which is coded through the Australian and New Zealand Standard Industrial Classification (ANZSIC) system.The ANZSIC system hierarchically classifies occupations into Divisions, Subdivisions, Groups and Classes (from broadest to finest).See Appendix B.5.2 for an example.
We will look at occupations at the Class level.2There are over 300 unique levels of such occupation classes in this data set.Panel (a) of Figure 4.3 highlights the skewed distribution of occupations: the most common occupation has more than double the number of observations than the second most common, and the number of observations decays rapidly for the rarer classes.Panel (b) shows the heterogeneity in claims experiences between different occupation classes; calculations reveal a coefficient of variation of 160% for the occupation means.One challenge in modelling the occupation variable lies in determining how much confidence we can have in the observed claims experiences; intuitively, we should trust the estimates more when there are more data points, and less when there are few data points.As explained in Section 2.2, the mixed effects models are one solution to this problem, which we will explore further in this section.The dataset under consideration displays many typical characteristics of insurance data that we studied in Section 3 (see Section 3.3 in particular).For example, it is characterised by an extremely skewed response distribution (Figure 4.1), a low signal-to-noise ratio (there is little information that covariates can add to the marginal fits, as implied by Figure 4.2) and a highly disproportionate distribution of categories (Figure 4.3).These features of the dataset best align with those of Experiment 6 in Section 3. If the conclusions generalise, we expect that a regularised version of our GLMMNet will show strong performance.

Modelling Results
We consider the same candidate models as before; see Section 3.1.Having noted the underperformance of GLMMNet in the presence of high noise and the benefit of adding regularisation to it, we also consider an We follow the learning pipeline described at the start of Section 3.2 and split the data into a 90% training set and a 10% test set.Table 4.1 presents the out-of-sample metrics for the top performing lognormal and loggamma models respectively, compared against a one-hot encoded GLM as a baseline (full results in Appendix C.3).3 Note that we have selected slightly different metrics from the simulation experiments to better accommodate the heavier-tailed response.The RMSE, for example, is no longer a suitable measure as it penalises very heavily for predictions that are far from the observed values, which can happen very frequently when there are a few extreme observations in the data.The three metrics tell consistent stories most of the time.The results for CRPS and NLL are almost perfectly aligned, which is not surprising given that both are measures of how far the observed values depart from the probabilistic predictions.In general, we consider the CRPS and NLL measures to be more reliable estimates of the goodness-of-fit of the models than the MedAE, which only takes into account the point predictions but not the uncertainty associated with them.
The results mostly match our expectation.Among the lognormal models, the regularised GLMMNet l2 takes the lead, followed by GBM GLMM enc, NN ee, GBM ee and GLMM, whose performances are so close together that it is impossible to distinguish between the four.The GLM family of models performs poorly on this data, much worse than their mixed model counterpart, i.e.GLMM, confirming that the presence of the highcardinality categorical variable interferes with their modelling capabilities.Among the loggamma models, although NN ee outperforms GLMMNet l2 in CRPS, GLMMNet l2 remains the most competitive model in terms of NLL (on par with GLMM).One practical consideration to note is that when the environment is highly noisy and the underlying model is uncertain, such as here, the interpretability of the model becomes even more important.In such cases, models that are more interpretable are preferred, which in turn, highlights the advantages of GLMMNet over NN ee.
An important observation is that adding the 2 regularisation term clearly helps GLMMNet navigate the data better.In the lognormal models, the regularised network represents a 1.5% improvement in the CRPS and NLL from the original GLMMNet, which suggests that the original GLMMNet overfits to the training data (as can be confirmed from its deterioration in the test performance).Indeed, the amelioration of the score proves statistically significant (p < 0.001) under the Diebold-Mariano test (Gneiting and Katzfuss, 2014).The results for the loggamma models lead to similar conclusions.
Comparing the lognormal and loggamma models, we find that assuming a loggamma distribution for the response improves the model fit in almost all instances (except the GLM models) and across all three performance metrics.While the CRPS and NLL values are reasonably similar between the two tables, the median absolute error (MedAE) halves when moving from lognormal to loggamma.An intuitive explanation is that the lognormal models struggle more with fitting to the extreme claims and thus tend to over-predict the middle-ranged values in an attempt to boost the predictions.The loggamma models, on the other hand, are more comfortable with the extremes, as those can be captured reasonably well by the long tail.These results confirm the need to consider alternative distributions beyond the Gaussian-one major motivation for our proposed extension of the LMMNN to the GLMMNet.Even within the Gaussian framework, as demonstrated by the lognormal models presented here, the GLMMNet can still be useful.
Overall, as we have seen in the simulation experiments, the differences in model performance are relatively small due to the low signal-to-noise ratio.Only a limited number of risk factors were observed, and they are not necessarily the true drivers of the claims cost.In the experiments above, a simpler structure like GLMM sometimes performs comparably with the more complex and theoretically more capable GLMMNet.That said, the results clearly demonstrate that the GLMMNet is a promising model in terms of predictive performance.We expect to see better results with less noisy data or a greater volume of data, e.g. when more information becomes available as a policy progresses and claims develop, reducing the amount of noise in the system.This has been demonstrated in the simulation experiments (Section 3.4).

GLMMNet: Decomposing Random Effects Per Individual Category
One main advantage of mixed effects models in dealing with high-cardinality categorical features is the transparency they offer on the effects of individual categories.This is particularly important as it is often the case that the high-cardinality categorical feature is itself a key risk factor.For example, in this data set, there is a lot of heterogeneity in claims experience across different occupations, but the lack of data for some makes it hard to determine how much trust can be given to the experience.
The GLMMNet can provide useful insights in this regard.Figure 4.4 shows the posterior predictions for the effects of individual occupations (ordered by decreasing z-scores), plotted with 95% confidence intervals.Unsurprisingly, occupations with a larger number of observations generally have tighter intervals (i.e.lower standard deviations); see Figure C.4 of Appendix C.3.This prevents us from overtrusting extreme estimates, which may happen simply due to the small sample size.In the context of technical pricing, such an analysis can be helpful for identifying occupation classes that are statistically more or less risky.This is not possible with the often equally or less high-performing neural networks with entity embeddings.With the latter, we can analyse the embedding vectors to identify occupations that are "similar" to each other in the sense of producing similar output, but we cannot go any further.

Conclusions
High-cardinality categorical features are often encountered in real-world insurance applications.The high dimensionality creates a significant challenge for modelling.As discussed in Section 1.1, the existing techniques prove inadequate in addressing this issue.Recent advancements in ML modeling hold promise for the development of more effective approaches.
In this paper, we took inspiration from the latest developments in the ML literature and proposed a novel model architecture called GLMMNet that targets high-cardinality categorical features in insurance data.The proposed GLMMNet fuses neural networks with GLMMs to enjoy both the predictive power of deep learning models and the transparency and statistical strength of GLMMs.
GLMMNet is an extension to the LMMNN proposed by Simchoni and Rosset (2022).The LMMNN is limited to the case of a Gaussian response with an identity link function.GLMMNet, on the other hand, allows for the entire class of ED family distributions, which are better suited to the modelling of skewed distributions that we see in insurance and financial data.
Importantly, in making this extension, we have made several modifications to the original LMMNN architecture, including: -The removal of a non-linear transformation on the high-cardinality categorical features Z.This is to ensure the interpretability of random effect predictions, which carries practical significance (as discussed in Section 4.3).Posterior predictions of (a randomly selected sample of) the random effects in 95% confidence intervals from the loggamma GLMMNet; ordered by decreasing z-scores.Occupations that do not overlap with the zero line are highlighted in vermillion (if above zero) and blue (if below zero) respectively.Occupations that do overlap with the zero line are in the shaded region.The x -axis labels have been removed to preserve confidentiality.
-The shift from optimising the exact likelihood to a variational inference approach.As discussed in Section 2.4, the integral in the likelihood function does not generally have an analytical solution (with only a few exceptions, and the Gaussian example is one of those).The shift to a variational inference approach circumvents the complicated numerical approximations to the integral.This opens the door to a flexible range of alternative distributions for the response, including Bernoulli, Poisson, Gamma-any member of the ED family.This flexibility makes the GLMMNet widely applicable to many insurance contexts (and beyond).
In our experiments, we found that the GLMMNet often performed better than or at least comparably with the benchmark model of an entity embedded neural network, which is the most popular approach among actuarial researchers working with neural networks.Although they can both be outperformed by boosting models in an environment of low signal-to-noise ratio, adding regularisation to the GLMMNet was sufficient to bring the model back to the top.The GLMMNet's outperformance over the entity embedded neural network is an achievement worth celebrating, as it comes with the additional benefits of transparency on the random effects as well as the ability to produce probabilistic predictions (e.g.uncertainty estimates).
To sum up, the proposed GLMMNet has at least the following advantages over the existing approaches: -Improved predictive performance under most scenarios we have tested; -Interpretability with the random effects; -Flexibility with the form of the response distribution; -Ability to handle large datasets (due to the computationally efficient implementation through variational inference).All these make GLMMNet a worthy addition to the actuaries' modelling toolbox for handling highcardinality categorical features.This research was supported under Australian Research Council's Discovery Project DP200101859 funding scheme.Melantha Wang acknowledges financial support from UNSW Australia Business School.The views expressed herein are those of the authors and are not necessarily those of the supporting organisations.
The authors declare that they have no conflicts of interest.

A. GLMMNet Implementation
Below we outline a few practical considerations that arise from the implementation of GLMMNet.

A.1. Fixed Prior
The random effects layer of GLMMNet (Section 2.3.2) requires the specification of a prior distribution.We mentioned that we would use a Gaussian prior as per common practice with GLMMs, but we did not specify the exact parameters for it.With GLMMNet, it is possible to have a trainable prior whose parameters can be found by gradient descent with respect to the loss function in (2.21); this approach of estimating the prior from the data is known as empirical Bayes (Casella, 1985).We experimented with both fixed and trainable priors in our implementation.We found that having a trainable prior leads to worse performance in general.Blundell et al. (2015) also reached the same conclusion from their experiments with Bayesian neural networks and they speculated that the algorithm would be more tempted to update the prior parameters than the posterior when the prior were trainable.We choose to work with fixed priors.
As to what values to use for the fixed prior, we reference the Stan documentation by Gelman (2020), which recommends the use of weakly informative priors, i.e. priors that will give way to the likelihood in the presence of sufficient data but will dominate in the absence of data.The exact prior parameters to use will depend on the data and the task at hand.
Furthermore, we found that it is important to shrink the value of the σ j 's at the start of the learning process to help guide the algorithm in the right direction.In our implementation, this is achieved by adding a constant multiplier (e.g.0.01) to the parametrisation of σ j 's.In the absence of such a constraint, the algorithm tends to return unreasonably large σ j values, which significantly deteriorates the performance of GLMMNet.It appears that the unguided GLMMNet converges to strange local minima where the model attributes all the variation to the noise.
We note that, in theory, adding a constant multiplier as we did does not modify the solution space and thus should return the same results regardless.We were able to empirically verify that if we gave the model enough training time, it was able to figure out the right range of values for the parameters at the end.The tweak we made, however, helps the algorithm immediately converge to a better local minimum.One possible explanation is that by reducing the size of the gradients on the scale parameters, it helps the model focus on learning the more important location (mean parameters) of the posterior random effects.

A.3. Initialisation of trainable parameters
The network weights and biases in GLMMNet are initialised by Tensorflow's default Glorot uniform initialiser (Glorot and Bengio, 2010).The dispersion parameter for the ED family, which is also learned as part of GLMMNet, is initialised with a fixed estimate roughly calibrated to the data at hand.It does not seem to have a major impact on GLMMNet's performance, but we still consider it good practice to run some small scale experiments with this choice of initial value.

B.1. GLMM Encoding
GLMM encoding can be regarded as an easier and more flexible alternative to the machine learning mixed models-e.g.GPBoost (Sigrist, 2021(Sigrist, , 2022) ) or GLMMNet, which often present a convoluted estimation procedure.While simple enough, Pargent et al. (2022) found that this encoding scheme performed very well across a range of prediction tasks and a variety of datasets.
In the numerical experiments, we implemented a cross-validated version of GLMM encoding, as presented in Algorithm 2. The encoded values z then take the place of the original categorical feature to enter any subsequent ML model, e.g. a gradient boosting machine.

B.2. Neural Networks
Training a network involves making many specific choices.Below we briefly describe the specifications we use for the two network models, NN ee and GLMMNet.To allow a fair comparison, where possible, we keep those choices consistent across both networks.
-Architecture for the hidden layers.We choose to use three hidden layers and [64,32,16] hidden units (neurons) in each layer.This choice was made following the recommendation of Ferrario et al. (2020), where the authors suggested that the first hidden layer should be large enough to allow new features to be constructed from the raw input covariates and that successive layers should compress information.Some quick experimentation confirms that this choice of the hidden units works relatively well.-Activation functions.We use the ReLU function defined by f (x) = max(x, 0) for hidden layer activations and the inverse of the link function for final layer activation (see Section 2.3.3).-Optimiser.For both networks we use Tensorflow's default Adam optimiser with learning rate 0.001 (Kingma and Ba, 2014), which is a state-of-the-art optimisation algorithm with demonstrated superior performance for a range of predictive tasks.This has also been used in many actuarial applications, e.g.Richman and Wüthrich (2021).-Early stopping.In fitting the networks, we further split the training data into an inner-training set and a validation set.We decide the number of epochs to train the networks based on when the validation performance (as captured by the validation loss) stops improving for a fixed number of epochs.-Loss function.Section 2.4 discusses in detail the loss function we use to optimise GLMMNet.For NN ee, we follow standard practice and use the squared error loss function to optimise the network.While it is possible to design a likelihood-based loss function for NN ee, this involves further complications (such as reparametrisation of the dispersion parameters) above the scope of this project.
.7) where b (•) and b (•) denote the first and second derivatives of b(•) and V (•) is commonly referred to as the variance function.Equation (2.6) implies that the conditional distribution of Y i |u j[i] in (2.5) is completely specified by the conditional mean Figure 2.1: Architecture of GLMMNet

Figure 3 . 1 :
Figure 3.1: Boxplots of out-of-sample performance metrics of the different models; GLMMNet highlighted in green.Each experiment is repeated 50 times, with 5,000 training observations and 2,500 testing observations each.

Figure 3 . 2 :
Figure 3.2: True versus predicted densities for (selected) categories on the test set in experiment 1 Figure 3.3: Boxplots of out-of-sample performance metrics of the different models in experiment 5 (middle; high noise Gaussian), shown with an 2 -regularised GLMMNet.Each experiment is repeated 50 times, with 5,000 training observations and 2,500 testing observations each.
Figure 4.1: Histogram of claim amounts (on log scale).The x-axis numbers have been deliberately removed for confidentiality reasons.

Figure 4 . 3 :
Figure 4.3: Skewed distribution of occupation class and heterogeneity in experience.The axes labels have been removed to preserve confidentiality.

Figure
Figure4.4: Posterior predictions of (a randomly selected sample of) the random effects in 95% confidence intervals from the loggamma GLMMNet; ordered by decreasing z-scores.Occupations that do not overlap with the zero line are highlighted in vermillion (if above zero) and blue (if below zero) respectively.Occupations that do overlap with the zero line are in the shaded region.The x -axis labels have been removed to preserve confidentiality.

Algorithm 2 :6
GLMM Encoding, adapted fromPargent et al. (2022) Data: D = (y i , z i ) n i=1 Result: z = ψ(z) ∈ R n , a numeric representation of the categorical feature z 1 Randomly partition D into K subsets of equal size D = {D 1 , D 2 , • • • , D K }; 2 for k = 1 to K do 3 Fit a random intercept model η = β 0 1 + Zu with u ∼ N (0, σ 2 u I) on D train k = D\D k ; 4 for {y i , z i } ∈ D k do 5 if category of the i-th observation j[i] is in D train k then Predict z i = ŷi = g −1 (β 0 + u j[i] );

Figure
Figure C.3: Random effects predictions by GLMMNet (posterior mode) against the ground truth under (a) experiment 1, i.e. base scenario, (b) experiment 6, i.e. high complexity, high noise scenario

Table 3 . 1 :
Overview of different characteristics of candidate models Table 4.1: Comparison of lognormal and loggamma model performance (median absolute error, CRPS, negative log-likelihood) on the test (out-of-sample) set.The best values are bolded.