Hostname: page-component-89b8bd64d-x2lbr Total loading time: 0 Render date: 2026-05-06T18:04:42.206Z Has data issue: false hasContentIssue false

Extracting information from textual descriptions for actuarial applications

Published online by Cambridge University Press:  02 March 2021

Scott Manski
Affiliation:
Michigan State University, East Lansing, USA
Kaixu Yang
Affiliation:
Michigan State University, East Lansing, USA
Gee Y. Lee*
Affiliation:
Michigan State University, East Lansing, USA
Tapabrata Maiti
Affiliation:
Michigan State University, East Lansing, USA
*
*Corresponding author. E-mail: leegee@msu.edu
Rights & Permissions [Opens in a new window]

Abstract

Initial insurance losses are often reported with a textual description of the claim. The claims manager must determine the adequate case reserve for each known claim. In this paper, we present a framework for predicting the amount of loss given a textual description of the claim using a large number of words found in the descriptions. Prior work has focused on classifying insurance claims based on keywords selected by a human expert, whereas in this paper the focus is on loss amount prediction with automatic word selection. In order to transform words into numeric vectors, we use word cosine similarities and word embedding matrices. When we consider all unique words found in the training dataset and impose a generalised additive model to the resulting explanatory variables, the resulting design matrix is high dimensional. For this reason, we use a group lasso penalty to reduce the number of coefficients in the model. The scalable, analytical framework proposed provides for a parsimonious and interpretable model. Finally, we discuss the implications of the analysis, including how the framework may be used by an insurance company and how the interpretation of the covariates can lead to significant policy change. The code can be found in the TAGAM R package (github.com/scottmanski/TAGAM).

Information

Type
Original Research Paper
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
© The Author(s), 2021. Published by Cambridge University Press on behalf of Institute and Faculty of Actuaries
Figure 0

Table 1 Summary statistics for the log(loss) for the training and validation datasets

Figure 1

Figure 1. Frequency for the most common words.

Figure 2

Figure 2. Cosine similarity against property loss for house and thunderstorm.

Figure 3

Table 2 Summary statistics for the final model. The residual degree of freedom (DF) comes from the estimated degrees of freedom from the GAM, and the mean squared prediction error (MSPE) is the out-of-sample mean squared prediction error.

Figure 4

Figure 3. Function estimates for several covariates.

Figure 5

Figure 4. Words with the highest cosine similarity with house.

Figure 6

Table 3. Comparison of models.

Figure 7

Figure 5. Predicted property loss amounts against the true property loss amounts for the training and validation samples. The Spearman correlations are $80.61\%$ and $76.06\%$, respectively.