Hostname: page-component-77f85d65b8-v2srd Total loading time: 0 Render date: 2026-03-29T16:03:03.767Z Has data issue: false hasContentIssue false

Statistical models for improved insurance risk assessment using telematics

Published online by Cambridge University Press:  26 May 2025

James Hannon*
Affiliation:
Centre for Research Training in Foundations of Data Science, University College Dublin, Dublin, Ireland School of Mathematics and Statistics, University College Dublin, Dublin, Ireland
Adrian O’Hagan
Affiliation:
School of Mathematics and Statistics, University College Dublin, Dublin, Ireland Insight Centre for Data Analytics, University College Dublin, Dublin, Ireland
*
Corresponding author: James Hannon; Email: james.hannon1@ucdconnect.ie
Rights & Permissions [Opens in a new window]

Abstract

This paper uses a two-step approach to modelling the probability of a policyholder making an auto insurance claim. We perform clustering via Gaussian mixture models and cluster-specific binary regression models. We use telematics information along with traditional auto insurance information and find that the best model incorporates telematics, without the need for dimension reduction via principal components. We also utilise the probabilistic estimates from the mixture model to account for the uncertainty in the cluster assignments. The clustering process allows for the creation of driving profiles and offers a fairer method for policyholder segmentation than when clustering is not used. By fitting separate regression models to the observations from the respective clusters, we are able to offer differential pricing, which recognises that policyholders have different exposures to risk despite having similar covariate information, such as total miles driven. The approach outlined in this paper offers an explainable and interpretable model that can compete with black box models. Our comparisons are based on a synthesised telematics data set that was emulated from a real insurance data set.

Information

Type
Contributed Paper
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
© The Author(s), 2025. Published by Cambridge University Press on behalf of The Institute and Faculty of Actuaries
Figure 0

Table 1. Variables used with their meaning and type (either a response variable, a telematics variable, or a traditional variable used in auto insurance)

Figure 1

Figure 1. Histogram of the ages of the insured on each policy.

Figure 2

Figure 2. Bar plot of the credit scores of the insured on each policy.

Figure 3

Figure 3. PCA explained variance ratio and cumulative PCA explained variance ratio versus number of principal components. The first three principal components were ultimately chosen to be used for modelling.

Figure 4

Table 2. Loading matrix for the first three principal components. pca was performed on the telematics variables, which are listed in the variable column

Figure 5

Figure 4. Plot of logit, probit, log-log, complementary log-log, and Cauchy link functions.

Figure 6

Table 3. Link functions and their inverses that are available in the statsmodels package

Figure 7

Figure 5. BIC scores for Gaussian mixture models with number of components ranging from 1 to 12. Data used in the left subplot include continuous and discrete variables, while the right subplot includes continuous, discrete, and principal components.

Figure 8

Figure 6. Average silhouette score for Gaussian mixture models with number of components ranging from 1 to 12. Data used in the left subplot include continuous and discrete variables, while the right subplot includes continuous, discrete, and principal components.

Figure 9

Table 4. Confusion matrix comparing the clustering results for the model incorporating the principal components and the model without the principal components on the validation data set. Adjusted Rand Index = 0.0777

Figure 10

Figure 7. Bar chart showing percentage of claims made by cluster (left) and pie chart showing percentage of training set that belongs to each cluster (right). Clusterings performed using the data set without PCA. Cluster 0 had a claim percentage of 5.28%, Cluster 1 had a claim percentage of 6.20%, and Cluster 2 had a claim percentage of 3.76%.

Figure 11

Table 5. Cluster means for Gaussian mixture model fitted on the data set without PCA

Figure 12

Figure 8. Bar chart showing percentage of claims made by cluster (left) and pie chart showing percentage of training set that belongs to each cluster (right). Clusterings performed using the data set with PCA. Cluster 0 had a claim percentage of 5.05%, Cluster 1 had a claim percentage of 2.43%, and Cluster 2 had a claim percentage of 5.86%.

Figure 13

Table 6. Cluster means for Gaussian mixture model fitted on the data set with PCA

Figure 14

Figure 9. The average MAvG of models based on the test data set with or without PCA (top left), for different covariance structures (top right), for different number of components (bottom left), and for different link functions (bottom right). Error bars represent the minimum and maximum values.

Figure 15

Table 7. MAvG of the test set for the top 5 models

Figure 16

Table 8. Mean, standard error, and 95% confidence intervals for variables in the optimal regression models. All variables are statistically significant at the 5% level

Figure 17

Table 9. Cut-off points for the optimal regression model. Optimal cut-off points based on the ROC curve so that it maximises the difference between the recall and the false positive rate

Figure 18

Figure 10. Partial dependence plots for Cluster 0’s regression model.

Figure 19

Figure 11. Partial dependence plots for Cluster 1’s regression model.

Figure 20

Figure 12. Partial dependence plots for Cluster 2’s regression model.

Figure 21

Figure 13. Calibration plot for regression probabilities in the test set for the optimal model.

Figure 22

Figure 14. Histogram with kernel density estimation for claims (1) and no claims (0) in the test set based on predictions from the optimal model.

Figure 23

Table 10. Hosmer–Lemeshow statistics, p-values, and degrees of freedom for the optimal regression model on the test set

Figure 24

Table A1. Descriptive statistics for telematics variables used in this paper

Figure 25

Table A2. Descriptive statistics for traditional numerical auto insurance variables used in this paper

Figure 26

Table A3. Breakdown of traditional categorical auto insurance variables used in this paper

Figure 27

Table A4. Descriptive statistics for principal components used in this paper

Figure 28

Table A5. Univariate tests of normality for clustering variables in the training data set

Figure 29

Table A6. Univariate tests of normality for clustering variables in the training set, using only observations assigned to cluster 0

Figure 30

Table A7. Univariate tests of normality for clustering variables in the training set, using only observations assigned to cluster 1

Figure 31

Table A8. Univariate tests of normality for clustering variables in the training set, using only observations assigned to cluster 2

Figure 32

Table A9. Multivariate tests of normality for clustering variables in the training set. note that $n = 60,000$ and $d = 34$

Figure 33

Table A10. Multivariate tests of normality for clustering variables in the training set, using only observations assigned to cluster 0. Note that $n = 15,990$ and $d = 34$

Figure 34

Table A11. Multivariate tests of normality for clustering variables in the training set, using only observations assigned to cluster 1. Note that $n = 807$ and $d = 34$

Figure 35

Table A12. Multivariate tests of normality for clustering variables in the training set, using only observations assigned to cluster 2. Note that $n = 43,203$ and $d = 34$

Figure 36

Table A13. Table of the average silhouette scores for the clustering solutions

Figure 37

Figure A1. The silhouette plot for the three clusters using the full covariance structure on the data set without PCA. The red dashed line represents the average silhouette score across all observations.

Figure 38

Figure A2. The silhouette plot for the three clusters using the full covariance structure on the data set with PCA. The red dashed line represents the average silhouette score across all observations.

Figure 39

Figure A3. The adjusted Rand index for the three clusters using the full covariance structure on the data set without PCA. Ten initialisations were used to assess the stability of the clustering solution. The line plot represents the average, while the bars represent the standard deviation.

Figure 40

Figure A4. The adjusted Rand index for the three clusters using the full covariance structure on the data set with PCA. Ten initialisations were used to assess the stability of the clustering solution. The line plot represents the average, while the bars represent the standard deviation.