Hostname: page-component-77f85d65b8-45ctf Total loading time: 0 Render date: 2026-03-28T12:56:42.984Z Has data issue: false hasContentIssue false

Actuarial applications of natural language processing using transformers: Case studies for using text features in an actuarial context

Published online by Cambridge University Press:  07 March 2024

Andreas Troxler*
Affiliation:
AT Analytics Ltd, Forchwaldstrasse 57, 6318 Walchwil, Switzerland
Jürg Schelldorfer
Affiliation:
Swiss Re Management Ltd, Mythenquai 50/60, 8022 Zurich, Switzerland
*
Corresponding author: Andreas Troxler; Email: andreas.troxler@atanalytics.ch
Rights & Permissions [Opens in a new window]

Abstract

This paper demonstrates workflows to incorporate text data into actuarial classification and regression tasks. The main focus is on methods employing transformer-based models. A dataset of car accident descriptions with an average length of 400 words, available in English and German, and a dataset with short property insurance claims descriptions, are used to demonstrate these techniques. The case studies tackle challenges related to a multilingual setting and long input sequences. They also show ways to interpret model output and to assess and improve model performance, by fine-tuning the models to the domain of application or to a specific prediction task. Finally, the paper provides practical approaches to handle classification tasks in situations with no or only few labelled data. The results achieved by using the language-understanding skills of off-the-shelf natural language processing (NLP) models with only minimal pre-processing and fine-tuning clearly demonstrate the power of transfer learning for practical applications.

Information

Type
Contributed Paper
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
© The Author(s), 2024. Published by Cambridge University Press on behalf of Institute and Faculty of Actuaries
Figure 0

Figure 1. Fictional example of text pre-processing. The text is converted to lowercase, stemmed (“occurred” becomes “occur”; “intersection” becomes “intersect”) and split into tokens. Stopwords such as “the,” “at” and “an” are removed. The misspelled word “crach” is not in the dictionary and is represented by a special token. In this example, punctuation is suppressed. The emphasis on the word “intersection” is lost, because formatting is suppressed, and all words are converted to lowercase.

Figure 1

Figure 2. Basic architecture of a recurrent neural network.

Figure 2

Figure 3. Basic architecture of the transformer encoder. The architecture of a transformer encoder layer is shown in the right part of the graph.

Figure 3

Figure 4. NLP techniques are used to extract additional features, which are used to augment the available tabular features.

Figure 4

Figure 5. An NLP encoder used to encode the text data into additional features, which are used to augment the available tabular features.

Figure 5

Figure 6. The transformer encoder and the neural network to process the tabular data are connected and trained together. Parts of the transformer encoder could be frozen during the training process.

Figure 6

Table 1. Scores of the Different Approaches, Evaluated on the Test Set

Figure 7

Figure 7. Confusion matrices, evaluated on the test set. The dummy classifier always predicts the most frequent class.

Figure 8

Table 2. Scores in a Multilingual Setting, Using Distilbert-Base-Multilingual-Cased, Evaluated on the Test Set

Figure 9

Figure 8. Confusion matrices in a multilingual setting, using distilbert-base-multilingual-cased, evaluated on the test set.

Figure 10

Figure 9. Domain-specific pre-training by masked language modelling.

Figure 11

Table 3. Scores, Evaluated on the Test Set, in a Multilingual Setting, Using Distilbert-Base-Multilingual-Cased with 2 Epochs of Masked Language Modelling on the Mixed-Language Training Set

Figure 12

Figure 10. Confusion matrices, evaluated on the test set, in a multilingual setting with 2 epochs of masked language modelling on the mixed-language training set.

Figure 13

Table 4. Scores, Evaluated on the Test Set, in a Multilingual Setting, using Distilbert-Base-Multilingual-Cased Without Fine-Tuning, with 2 Epochs of Task-Specific Pre-Training

Figure 14

Table 5. Scores of the Different Approaches, Evaluated on the Test Set

Figure 15

Figure 11. Confusion matrices evaluated on the test set. The result from task-specific fine-tuning is shown in Figure 12.

Figure 16

Table 6. Scores of the Different Approaches, Evaluated on the Test Set. Since We Have not Implemented a Logic to Combine the Predicted Probabilities of the Different Chunks, the Log Loss and Brier Loss Cannot be Evaluated in This Case

Figure 17

Figure 12. Confusion matrices, evaluated on the test set.

Figure 18

Figure 13. Using extractive question answering to extract shorter text sequences from the raw text.

Figure 19

Figure 14. Extracted answers to the questions “Was someone injured?” and “Was someone transported?” from the example shown in Figure A5 (SCASEID=2007043731967). The first four sentences are the candidate answers to the first question. The last four sentences are the candidate answers to the second question. The person who was not injured is the driver of V1; the person transported to hospital is the driver of V2.

Figure 20

Table 7. Scores Obtained of the Different Approaches, Evaluated on the Test Set

Figure 21

Figure 15. Confusion matrices corresponding to the second last items of Table 7.

Figure 22

Figure 16. Using zero-shot classification to assign probabilities to pre-defined expressions.

Figure 23

Table 8. Expressions and Mapping to Hazard Types. Note that we Have Mapped Two Different Expressions for the Hazard Type “Vandalism”

Figure 24

Figure 17. Confusion matrices, evaluated on the test set.

Figure 25

Table 9. Scores of the Different Approaches, Evaluated on the Test Set. For the Zero-Shot Model with Adjusted Threshold, Predicted Probabilities are Not Available; Therefore, the Log Loss and Brier Loss are Not Shown. The Sentence Similarity Approach will be Discussed in the Next Section

Figure 26

Figure 18. Confusion matrices, evaluated on the test set.

Figure 27

Figure 19. Using sentence similarity to perform unsupervised classification.

Figure 28

Table 10. Expressions and Mapping to Hazard Types

Figure 29

Figure 20. Confusion matrices, evaluated on the test set.

Figure 30

Figure 21. Using topic clustering to generate document labels in an unsupervised setting.

Figure 31

Figure 22. Topic word scores for the first four identified topics. The topics appear in descending order of number of allocated samples. The first and third topic are mapped to “Vandalism,” Topic 1 is mapped to “Fire” and Topic 3 to “Wind”.

Figure 32

Figure 23. The matrix shows one column per topic. The shading indicates the frequency distribution of labels within a given topic. The presence of a single dark patch indicates that almost all of the samples of the topic are associated with a single label.

Figure 33

Figure 24. Confusion matrices, comparing labels obtained from topic clustering with the true labels, evaluated on the test set.

Figure 34

Figure A1. Distribution of NUMTOTV, the number of vehicles involved.

Figure 35

Table A1. Summary of Dataset Columns

Figure 36

Figure A2. Box plot of the length of the English case descriptions (counting words split by empty space) by number of vehicles involved. The minimum/average/maximum length is 60/419/1248 words, respectively.

Figure 37

Figure A3. Sample of SUMMARY_EN and SUMMARY_DE (SCASEID= 200501269400).

Figure 38

Figure A4. Word importance visualisation for a true positive example (SCASEID=2006008500862). Apparently, the most important word is “injuries.” The original text is padded to a length of 512 tokens (not all shown in the exhibit).

Figure 39

Figure A5. Word importance visualisation for a true positive example (SCASEID = 2007043731967). The original text is truncated to a length of 512 tokens. Apparently, the most important words are “transported” and “hospital.”

Figure 40

Figure A6. Word importance visualisation for a false positive example (SCASEID = 2007002229650). The original text is truncated to a length of 512 tokens. Apparently, the words “was taken” and “hospital” are most important, as in Figure A5. However, in this case it was only for a check out, and as such there is no strong evidence of a bodily injury in this case description.

Figure 41

Table A2. Distribution of INJSEVA and INJSEVB

Figure 42

Table A3. Sample of the LPGIF Data Set