Hostname: page-component-6766d58669-7fx5l Total loading time: 0 Render date: 2026-05-17T21:51:04.733Z Has data issue: false hasContentIssue false

Machine Learning with High-Cardinality Categorical Features in Actuarial Applications

Published online by Cambridge University Press:  11 April 2024

Benjamin Avanzi
Affiliation:
Centre for Actuarial Studies, Department of Economics, University of Melbourne VIC 3010, Australia
Greg Taylor
Affiliation:
School of Risk and Actuarial Studies, UNSW Australia Business School, UNSW Sydney NSW 2052, Australia
Melantha Wang*
Affiliation:
School of Risk and Actuarial Studies, UNSW Australia Business School, UNSW Sydney NSW 2052, Australia
Bernard Wong
Affiliation:
School of Risk and Actuarial Studies, UNSW Australia Business School, UNSW Sydney NSW 2052, Australia
*
Corresponding author: Melantha Wang; Email: wang.melantha@gmail.com
Rights & Permissions [Opens in a new window]

Abstract

High-cardinality categorical features are pervasive in actuarial data (e.g., occupation in commercial property insurance). Standard categorical encoding methods like one-hot encoding are inadequate in these settings.

In this work, we present a novel Generalised Linear Mixed Model Neural Network (“GLMMNet”) approach to the modelling of high-cardinality categorical features. The GLMMNet integrates a generalised linear mixed model in a deep learning framework, offering the predictive power of neural networks and the transparency of random effects estimates, the latter of which cannot be obtained from the entity embedding models. Further, its flexibility to deal with any distribution in the exponential dispersion (ED) family makes it widely applicable to many actuarial contexts and beyond. In order to facilitate the application of GLMMNet to large datasets, we use variational inference to estimate its parameters—both traditional mean field and versions utilising textual information underlying the high-cardinality categorical features.

We illustrate and compare the GLMMNet against existing approaches in a range of simulation experiments as well as in a real-life insurance case study. A notable feature for both our simulation experiment and the real-life case study is a comparatively low signal-to-noise ratio, which is a feature common in actuarial applications. We find that the GLMMNet often outperforms or at least performs comparably with an entity-embedded neural network in these settings, while providing the additional benefit of transparency, which is particularly valuable in practical applications.

Importantly, while our model was motivated by actuarial applications, it can have wider applicability. The GLMMNet would suit any applications that involve high-cardinality categorical variables and where the response cannot be sufficiently modelled by a Gaussian distribution, especially where the inherent noisiness of the data is relatively high.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2024. Published by Cambridge University Press on behalf of The International Actuarial Association
Figure 0

Figure 1. Architecture of GLMMNet.

Figure 1

Algorithm 1. Prediction from GLMMNet

Figure 2

Table 1. Overview of different characteristics of candidate models

Figure 3

Table 2. Parameters used for the different simulation environments. Bold face indicates changes from the base scenario (i.e., experiment 1).

Figure 4

Figure 2. Boxplots of out-of-sample CRPS of the different models in Experiments 1–4; GLMMNet highlighted in blue.

Figure 5

Table 3. Model comparison: strengths and limitations of the top-performing models

Figure 6

Figure 3. Histogram and Gaussian kernel density estimate of claim amounts (on log scale). The x-axis numbers have been deliberately removed for confidentiality reasons.

Figure 7

Figure 4. Probability–probability (P-P) plot of empirical versus fitted lognormal and loggamma (theoretical) distributions. The parametric distributions are fitted to the (unlogged) marginal, that is, without any covariates.

Figure 8

Table 4. An example of ANZSIC occupation classification

Figure 9

Figure 5. Skewed distribution of occupation class and heterogeneity in experience. The axes labels have been removed to preserve confidentiality.

Figure 10

Table 5. Comparison of lognormal and loggamma model performance (median absolute error, CRPS, negative log-likelihood, RMSE of average prediction per category) on the test (out-of-sample) set. The best values are bolded.

Figure 11

Figure 6. Posterior predictions of (a randomly selected sample of) the random effects in 95% confidence intervals from the loggamma GLMMNet, ordered by decreasing z-scores. Occupations that do not overlap with the zero line are highlighted in vermillion (if above zero) and blue (if below zero), respectively. Occupations that do overlap with the zero line are in the shaded region. The x-axis labels have been removed to preserve confidentiality.

Figure 12

Table 6. Comparison of loggamma GLMMNet performance on training and test sets.

Supplementary material: File

Avanzi et al. supplementary material 1

Avanzi et al. supplementary material
Download Avanzi et al. supplementary material 1(File)
File 248 Bytes
Supplementary material: File

Avanzi et al. supplementary material 2

Avanzi et al. supplementary material
Download Avanzi et al. supplementary material 2(File)
File 685.4 KB
Supplementary material: File

Avanzi et al. supplementary material 3

Avanzi et al. supplementary material
Download Avanzi et al. supplementary material 3(File)
File 11.5 KB