Hostname: page-component-5db58dd55d-xnzfm Total loading time: 0 Render date: 2026-06-01T05:44:46.682Z Has data issue: false hasContentIssue false

Dual evaluation of performance and fairness from machine learning models for non-life insurance pricing

Published online by Cambridge University Press:  29 January 2026

Tarun Israni*
Affiliation:
School of Mathematical Sciences, University College Cork, Cork, Ireland
Linda Daly
Affiliation:
School of Mathematical Sciences, University College Cork, Cork, Ireland
John Condon
Affiliation:
School of Mathematical Sciences, University College Cork, Cork, Ireland
Eric Wolsztynski
Affiliation:
School of Mathematical Sciences, University College Cork, Cork, Ireland Insight Research Ireland Centre for Data Analytics, University College Cork, Cork, Ireland
*
Corresponding author: Tarun Israni; Email: tisrani@gmail.com
Rights & Permissions [Opens in a new window]

Abstract

An increasing number of reports highlight the potential of machine learning (ML) methodologies over the conventional generalised linear model (GLM) for non-life insurance pricing. In parallel, national and international regulatory institutions are accentuating their focus on pricing fairness to quantify and mitigate algorithmic differences and discrimination. However, comprehensive studies that assess both pricing accuracy and fairness remain scarce. We propose a benchmark of the GLM against mainstream regularised linear models and tree-based ensemble models under two popular distribution modelling strategies (Poisson-gamma and Tweedie), with respect to key criteria including estimation bias, deviance, risk differentiation, competitiveness, loss ratios, discrimination and fairness. Pricing performance and fairness were assessed simultaneously on the same samples of premium estimates for GLM and ML models. The models were compared on two open-access motor insurance datasets, each with a different type of cover (fully comprehensive and third-party liability). While no single ML model outperformed across both pricing and discrimination metrics, the GLM significantly underperformed for most. The results indicate that ML may be considered a realistic and reasonable alternative to current practices. We advocate that benchmarking exercises for risk prediction models should be carried out to assess both pricing accuracy and fairness for any given portfolio.

Information

Type
Contributed Paper
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
© The Authors, 2026. Published by Cambridge University Press on behalf of The Institute and Faculty of Actuaries
Figure 0

Figure 1. Distributions of claims amounts (top) and number of claims (bottom) for Dataset 1 (left) and Dataset 2 (right).

Figure 1

Table 1. Variable importance metrics for frequency (left), severity (centre) and Tweedie (right) models trained on Dataset 1 (top) and Dataset 2 (bottom), as defined in Section 2.5.6. The greyed-out cells indicate variables that were removed during the stepwise selection in the GLM model building process. A value of 0% in the table corresponds to an importance below 0.5%

Figure 2

Table 2. SHAP value metrics for Product (left) and Tweedie (right) models trained on Dataset 1 (top) and Dataset 2 (bottom), as defined in Section 2.5.6. The greyed-out cells indicate variables that were removed during the stepwise selection in the GLM model-building process. A value of 0% in the table corresponds to an SHAP value below 0.5%

Figure 3

Figure 2. Partial dependence plot by feature level for categorical and discretised (by deciles) numerical variables for models trained on Dataset 1. Each subplot shows the average predicted premium across levels of a single feature, under the product (solid line) and Tweedie (dashed line) modelling strategies.

Figure 4

Figure 3. Partial dependence plot by feature level for categorical and discretised (by deciles) numerical variables for models trained on Dataset 2. Each subplot shows the average predicted premium across levels of a single feature, under the product (solid line) and Tweedie (dashed line) modelling strategies.

Figure 5

Figure 4. Bootstrap distributions of test Poisson (top), gamma (middle) and Tweedie (bottom) deviances, for Datasets 1 (left) and 2 (right). The solid blue line indicates the deviance measure from a null model (average value observed from training data).

Figure 6

Table 3. Bootstrapped confidence intervals for model bias on the predicted premium values

Figure 7

Table 4. Bootstrapped Gini indices (and corresponding standard errors) for comparison of risk stratification from the ML pipelines against that of the reference GLM

Figure 8

Figure 5. Bootstrap distributions of total book premiums (top), total claims (${2^{nd}}$ row) and loss ratios (${3^{rd}}$ row) comparison between two models, using, respectively, a GLM and an XGB methodology for premium pricing, on Dataset 1 (left) and Dataset 2 (right). The dashed line indicates a loss ratio of 1. The same analysis was carried out to compare RF with GLM, with loss ratio results (bottom) consistent with the comparison between XGB and GLM.

Figure 9

Figure 6. Radar charts comparing the reference GLM with LASSO and the tree-based models on all key performance indicators.

Figure 10

Figure 7. Distributions of the GLM and XGB model actual premium, by gender, under the Product (left) and Tweedie (right) model strategies.

Figure 11

Table 5. Measure of the distance (${d_{gender}}$), difference in distribution medians and Wilcoxon test p-values of the actual premium models under the product model (left) and Tweedie model (right) strategies

Figure 12

Figure 8. Distributions of actual premium across technical premium quintile bands for the XGB model (left) and the GLM model (right) under the product (top) and Tweedie (bottom) modelling strategies.

Figure 13

Table 6. For the Product model, Wasserstein distances, median difference and Wilcoxon test p-values comparing actual premium distributions by technical premium band for actuarial group fairness (left) and calibration (right), as well as mean APTP ratios and associated standard deviation by actual premium band (centre). Significant p-values are italicised

Figure 14

Table 7. For the Tweedie model, Wasserstein distances, median difference and Wilcoxon test p-values comparing actual premium distributions by technical premium band for actuarial group fairness (left) and calibration (right), as well as mean APTP ratios and corresponding standard deviation by actual premium band (centre). Significant p-values are italicised

Figure 15

Figure 9. Distributions of actual claims by actual premium quintile band for the XGB and GLM models under the Product and Tweedie modelling strategies.

Figure 16

Figure A1. Distributions of recategorised variables. Based on the average observed claims, variables were recategorised into broader, well-defined groups to select key differentiating features and use interpretable modelling approaches to enhance the significance and understanding of predictions for each group.