Hostname: page-component-89b8bd64d-x2lbr Total loading time: 0 Render date: 2026-05-09T13:30:08.112Z Has data issue: false hasContentIssue false

On hybrid tree-based methods for short-term insurance claims

Published online by Cambridge University Press:  08 March 2023

Zhiyu Quan*
Affiliation:
Department of Mathematics, University of Illinois at Urbana-Champaign, 1409 W. Green Street (MC-382), Urbana, IL 61801, USA.
Zhiguo Wang
Affiliation:
Department of Mathematics, University of Connecticut, 341 Mansfield Road, Storrs, CT 06269-1009, USA.
Guojun Gan
Affiliation:
Department of Mathematics, University of Connecticut, 341 Mansfield Road, Storrs, CT 06269-1009, USA.
Emiliano A. Valdez
Affiliation:
Department of Mathematics, University of Connecticut, 341 Mansfield Road, Storrs, CT 06269-1009, USA.
*
*Corresponding author. E-mail: zquan@illinois.edu
Rights & Permissions [Opens in a new window]

Abstract

Two-part framework and the Tweedie generalized linear model (GLM) have traditionally been used to model loss costs for short-term insurance contracts. For most portfolios of insurance claims, there is typically a large proportion of zero claims that leads to imbalances, resulting in lower prediction accuracy of these traditional approaches. In this article, we propose the use of tree-based methods with a hybrid structure that involves a two-step algorithm as an alternative approach. For example, the first step is the construction of a classification tree to build the probability model for claim frequency. The second step is the application of elastic net regression models at each terminal node from the classification tree to build the distribution models for claim severity. This hybrid structure captures the benefits of tuning hyperparameters at each step of the algorithm; this allows for improved prediction accuracy, and tuning can be performed to meet specific business objectives. An obvious major advantage of this hybrid structure is improved model interpretability. We examine and compare the predictive performance of this hybrid structure relative to the traditional Tweedie GLM using both simulated and real datasets. Our empirical results show that these hybrid tree-based methods produce more accurate and informative predictions.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2023. Published by Cambridge University Press.
Figure 0

Figure 1. A simplified illustration of hybrid tree-based methods.

Figure 1

Table 1. Model performance on the simulation dataset.

Figure 2

Table 2. Model performance on the simulation dataset without nonrelevant explanatory variables.

Figure 3

Table 3. Summary statistics of the training dataset, 2006–2010.

Figure 4

Table 4. Summary statistics of the test dataset, 2011.

Figure 5

Figure 2. Heat maps of model comparison based on various validation measures. (a) Model performance based on training dataset. (b) Model performance based on test dataset.

Figure 6

Figure 3. Tree paths with highlighted nodes.

Figure 7

Figure 4. Classification tree for the frequency.

Figure 8

Figure 5. Variable importance for the claim frequency.

Figure 9

Table 5. Regression coefficients at the terminal nodes.

Figure 10

Table A1. Performance validation measures.

Figure 11

Figure A1. Classification tree for the claim frequency.