Hostname: page-component-6766d58669-7fx5l Total loading time: 0 Render date: 2026-05-20T09:02:23.935Z Has data issue: false hasContentIssue false

A framework for applying data science techniques to health and care actuarial projects

Published online by Cambridge University Press:  14 May 2026

Johannes Michiel Luteijn*
Affiliation:
Hannover Re UK Life Branch, Hannover Re, London, UK
Jacky Tam
Affiliation:
Verisk, London, UK
Rebecca Denis
Affiliation:
Independent Scholar
Fiona Fan
Affiliation:
General Reinsurance Corporation, London, UK
Jaskaran Minhas
Affiliation:
Reinsurance Group of America, London, UK
Dhanesh Kadan Puthanveedu
Affiliation:
TAL Australia, Sydney, Australia
Abigail Takyi
Affiliation:
Independent Scholar
*
Corresponding author: Johannes Michiel Luteijn; Email: michiel.luteijn@hannover-re.com
Rights & Permissions [Opens in a new window]

Abstract

The fast-moving field of data science is increasingly permeating into the health and care actuarial sciences. Given this context, the Institute and Faculty of Actuaries set out to form a “techniques in data science in health and care” working party. This working party was tasked with creating a framework for those actuaries working within the health and care domain that would assist them in determining which techniques are appropriate for a project. The framework presented here was developed through a combination of literature review and synthesis of expert opinion from experienced practitioners from diverse backgrounds. The framework offers a structured, itemised approach, serving as a checklist to ensure that all relevant analytics and decisions are considered and documented. Each itemised topic is covered by a summary providing guidance and relevant references for further reading. The checklist follows the natural workflow of a data analytics project, guiding users through each step to prevent omissions and maintain rigour in both analysis, reporting and peer-review. The framework blends relevant analytics elements from actuarial science, data science and epidemiology. We hope the framework will enhance transparency, reproducibility, and comprehensiveness in the reporting and peer-review of health and care data analytics projects.

Information

Type
Sessional Paper
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
© The Institute and Faculty of Actuaries, 2026. Published by Cambridge University Press on behalf of The Institute and Faculty of Actuaries
Figure 0

Table 1. Framework checklist

Figure 1

Table 2. An adapted version of the FINER criteria for a good research question and project plan (Hulley et al., 2007)

Figure 2

Table 3. Consistency checks classes

Figure 3

Table 4. Overview of feature engineering methods for numeric features

Figure 4

Figure 1. How to incorporate cross-validation into target encoding to prevent data leakage in training data with 3 random folds.

Figure 5

Figure 2. CMI expected term assurance mortality rates by age: smokers (orange) versus non-smokers (blue) (all genders, all durations; authors’ analysis).

Figure 6

Table 5. Main ensemble learning methods

Figure 7

Figure 3. XGBoost tree structure – before and after pruning, with min split loss$$\;\left( \gamma \right) = 0.05$$.

Figure 8

Figure 4. Light GBM grows trees leaf-wise (asymmetric), while CatBoost grows trees level-wise with symmetric structure.

Figure 9

Table 6. Comparison of functionalities between XGBoost, LightGBM and CatBoost

Figure 10

Figure 5. A comparison of a fully connected, feedforward NN (left) and a GLM-like architecture with sparser connectivity (right), both using two hidden layers.

Figure 11

Figure 6. Lift curve comparison between GAM predictions (blue) and CMI mortality table (orange).

Figure 12

Figure 7. Double lift plot between actual observed (blue solid line), GAM predictions (green dashed line) and CMI mortality table (orange dotted line). The bars represent life years exposure (%).

Figure 13

Figure 8. Poisson deviance loss by number of trees during XGBoost model training for training data (solid blue line) and validation data (dashed orange line).

Figure 14

Figure 9. Actual versus expected claim rates by policyholder age, comparing GAM (solid green line) and CMI predictions (solid orange line) against observed outcomes (the dashed grey line). The bars represent life years exposure (%).

Figure 15

Table 7. Bradford-Hill criteria