Modelling socio-economic mortality at neighbourhood level

Abstract In this study, we quantify the relationship between socio-economic status and life expectancy and identify combinations of socio-economic variables that are particularly useful for explaining mortality differences between neighbourhoods in England. We achieve this by examining socio-economic variation in mortality experiences across small areas in England known as lower layer super output areas (LSOAs). We then consider 12 socio-economic variables that are known to have a strong association with mortality. We estimate the relationship between those variables and mortality rates using a random forest algorithm. Based on the resulting estimate, we then create a new socio-economic mortality index – the Longevity Index for England (LIFE). The index is constructed in a way that eliminates the impact of care homes that might artificially increase mortality rates in LSOAs with care homes compared to LSOAs that do not contain a care home. Using mortality data for different age groups, we make the index age-dependent and investigate the impact of specific socio-economic characteristics on the age-specific mortality risk. We compare the explanatory power of the LIFE index to the English Index of Multiple Deprivation (IMD) as predictors of mortality. While we find that the IMD can explain regional mortality differences to some extent, the LIFE index has significantly greater explanatory power for mortality differences between regions. Our empirical results also indicate that income deprivation amongst the elderly and employment deprivation are the most significant socio-economic factors for explaining mortality variation across LSOAs in England.


Introduction
It is well documented that there is a strong association between mortality and socio-economic status. While this relationship has been known for many years, the availability of more granular data allows us to look more closely at the impact of socio-economic characteristics on mortality. In recent years, there have been numerous studies on this relationship: Bennett et al. (2015) discuss modelling life expectancies in different areas of England and Wales via a Bayesian model with spatial effects; Raleigh and Kiri (1997) study trends in life expectancy in relation to deprivation; Woods et al. (2005) describe mortality in England and Wales by deprivation and in each government office region during 1998; Cairns et al. (2019) identify different socio-economic groups in Denmark and model their mortality rates using an affluence index and Wen et al. (2021) explore mortality rates in populations identified by deciles of the English Index of Multiple Deprivation (IMD). Deciles of the IMD index are also used by Lyu et al. (2022) to identify three socio-economic groups that are then used to study the effectiveness of an indexbased longevity hedge for which the three groups are modelled with a generalised three-way Li-Lee model. Mayhew et al. (2020) also use the IMD to measure deprivation and study its impact on demographic differences measured by life expectancy, lifespan variation and mortality. They also consider C The Author(s), 2023. Published by Cambridge University Press on behalf of The International Actuarial Association. This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited. differences in those quantities between different geographic areas. Finally, the IMD was also used by Villegas and Haberman (2014) who proposed a new approach for the joint modelling of the mortality in subpopulations of a larger national population. All of those studies documented significant differences in mortality levels and mortality improvement rates between socio-economic classes.
In this paper, we develop a socio-economic mortality index that we call the Longevity Index for England -LIFE. The LIFE index is designed to enable us to predict mortality risk in small neighbourhoods across England relative to national mortality based on specific socio-economic data for such neighbourhoods. In other words, the LIFE index models the relative risk of dying within a small population in England as a function of certain socio-economic variables. The index is created based on a regression analysis using a random forest algorithm to estimate a non-parametric regression function.
We consider a total of 12 socio-economic factors that are known to be linked to mortality. Many are domains or subdomains of the English indices of deprivation: in particular, the IMD -English Index of Multiple Deprivation. Those indices are constructed and published by the Ministry of Housing, Communities & Local Government in the United Kingdom. Others are derived from the 2011 national census. We are using the indices of deprivation published in 2015, see Smith et al. (2015). The indices of deprivation provide a score for small neighbourhoods, called lower layer super output area (LSOAs), in different domains of deprivation. An LSOA is a lower layer super output area -a small area in England with a population of about 1500 people. For the 2015 indices of deprivation, there were 32,844 LSOAs in England.
The main difference to the IMD-based research mentioned above is that we consider the impact of individual variables on mortality in very small neighbourhoods rather than grouping the data by one measure (IMD deciles) into rather large groups that are then considered to be homogenous.
Two of the 12 variables used to explain mortality in LSOAs are measuring the proportion of an LSOA's population that live in care homes. Since the existence of care homes has the potential to artificially increase the mortality in an LSOA, we adjust those variables for the construction of the LIFE index.
In the empirical part of this paper, we investigate the explanatory power of the LIFE index in comparison to the English Index of Multiple Deprivation. We find that the LIFE index is better able to explain regional differences in mortality than the IMD. This comparison is based on the analysis of age and deprivation standardised mortality rates (ADSMRs). We explain the construction of ADSMRs and show that large mortality differences between regions remain unexplained when ADSMRs are based on the IMD. Applying LIFE-based ADSMRs reduces those regional differences significantly indicating that the LIFE index is better able to capture mortality related factors. We also study the impact of specific predictors on LIFE scores. It turns out that old-age income deprivation is the most powerful predictor of mortality amongst those considered.
As mentioned above the index is based on an application of the random forest algorithm to estimate the non-parametric regression function linking mortality to the 12 predictors, see Breiman (2001) and James et al. (2013). This approach offers a high degree of flexibility, does not require assumptions about the underlying relationship between predictive variables, and we show that it is an effective tool for analysing large and complex mortality datasets. The random forest estimator is a well-understood and widely used method from machine learning, and it has been applied in mortality modelling as an alternative to parametric models, see for example, Bjerre (2022), Hong et al. (2021), Levantesi and Nigri (2020) and Levantesi and Pizzorusso (2019). While those authors apply the random forest method to improve the goodness-of-fit and the predictive power of mortality models, we use the method to analyse the impact of socio-economic characteristics on higher or lower mortality in any given neighbourhood compared to the national mortality levels for a population with a similar age structure.
Estimating the effect of socio-economic factors on the deviation of LSOA-specific mortality from national mortality is a regression problem. As for many other regression problems, there are several estimation methods to choose from. We have chosen the random forest method as it offers a high degree of flexibility, but in contrast to other non-parametric methods it requires a relatively small number of hyper-parameters to be chosen. In addition, computations are rather fast. Wen (2022) applies other estimators to explain the impact of socio-economic factors on mortality. He finds that non-parametric methods (random forest and local linear regression) outperform generalised linear models in terms of an out-of-sample mean squared error. It seems that the structure of the generalised linear models considered by Wen (2022) is not sufficiently flexible to capture non-linear effects of certain factors or the join effects that some factors might have on mortality. It might be possible to extend the analysis by Wen (2022) by including GLMs with more factors or interaction terms between certain factors, but this is not in the scope of this study.
The remainder of this paper is organised as follows. In Section 2, we describe the mortality data and socio-economic data used in our study, and in Section 3, we introduce the LIFE index. Section 4 provides an overview of the random forest method and shows how it is applied in the context of this paper. In this section, we also study the performance of this method for the data in our study. We then investigate the impact of different age groups on the LIFE index ranks of LSOAs in Section 5 and compare the LIFE index to the IMD in Section 6. We then apply the life index to study the distribution of low and high mortality groups across urban and rural LSOAs in Section 7 and analyse the impact of individual variables in Section 8. We return to a comparison of the LIFE index with the IMD in Section 9 where we consider mortality rates in LIFE deciles and IMD deciles and compare ADSMRs based on the two indices. Our final conclusions are presented in Section 10.

Data
The data used in this paper are for England and have been sourced from the UK's Office for National Statistics (ONS). Further details can be found in the supplementary material published online and in Wen (2022).
Socio-economic data and mortality data are available at a neighbourhood level called LSOA. An LSOA is a geographical unit that describes a small neighbourhood with a population size of around 1600, and generally with a high degree of socio-economic homogeneity within each LSOA. The number of LSOAs and their boundaries varies from time to time as populations change. This paper uses the revisions based on the 2011 Census and there are N = 32, 844 LSOAs. The data described below are available for each of the N LSOAs.
The specific data considered in our study are the following • mid-year population estimates (exposure size) E ita by single LSOA i = 1, . . . , N, year t and age a; • death counts D ita by single LSOA i = 1, . . . , N, year t and age a; • a vector of K predictive variables X i = (X i,1 , . . . , X i,K ) for each LSOA i = 1, . . . , N. These data are not year or age-specific but describe socio-economic characteristics of the entire population of an LSOA measured at a specific point in time. Details about the predictive variables used in this study are provided in Section 2.2.
The mortality data, E ita and D ita , are available for calendar years 2001-2018 by single year of age. As the total exposure in any individual LSOA is very small, we will group ages for the construction of the mortality index. In this study, we will focus on three age groups: 60-69, 70-79 and 80-89. Note that the boundaries of some LSOAs have changed during our observation period 2001-2018. All data used in this study are based on LSOA boundaries used in the 2011 census.

Mortality data and relative risk
In this study, we model the relative mortality risk in an individual LSOA i ∈ {1, . . . , N} compared to the average mortality in England. To define our measure of relative risk, we first define a baseline death rate m b ta for year t and age a for the whole of England in the usual way: (2.1) The model will be fitted using data from years T and age range A. Without any additional information, the expected 1 total number of deathsD 0 i across all ages a ∈ A and years t ∈ T in LSOA i is given bŷ and we define the observed relative risk of death R 0 i for an individual living in LSOA i as the ratio of the actual number of deaths to the expected number of deaths in that LSOA, that is, With our definition, the realised relative risk R 0 in any neighbourhood is a random variable since the realised number of deaths is random. In the following, we are interested in modelling the conditional expectation of R 0 given a vector of socio-economic characteristics. Note that the relative risk R 0 i is not age and year specific. However, as mentioned above, we will calculate and model the relative risk using mortality data for different age ranges A, see Section 5 for details.

Socio-economic characteristics
In Section 3, we will construct an index that explains differences in the mortality rates in different LSOAs based on differences in their socio-economic characteristics. In Wen (2022), a large universe of predictive variables for LSOA-specific mortality rates were considered. Based on findings there, we restrict our attention in this paper to 12 variables. They are listed in Table 1. Further details about those 12 variables, including data sources, can be found in the supplementary material published online.
The possible values of the first nine numerical variables x 1 , . . . , x 9 in Table 1 are on very different scales. For the purpose of visualisation, we standardise them to have mean zero and variance one. Details of the standardisation procedure can be found in the supplementary material.
Variable x 10 is a categorical variable representing the urban-rural class of an LSOA and taking one of five values listed in Table 2.
In summary, the socio-economic characteristics of any neighbourhood are given as a vector taking values in the K = 12 dimensional space (2.3) Note, that our urban-rural class indicator x 10 distinguishes between urban conurbation in London and outside London. We have introduced that distinction as we found in previous research that mortality rates in London are rather different from mortality rates in other parts of England, see Cairns et al. (2021) and Wen (2022).
It is to be expected that the covariates in Table 1 are correlated. We report the empirical correlations in Table 3. We observe in Table 3 that there are some strong correlations, but we argue that none of the observed correlations is so strong that a variable should be removed. The few strong correlations we observe might be seen as problematic for parametric models as the parameters for individual covariates might not be identifiable when one variable can act as a proxy for another variable. However, since we have 32,844 observations, the inclusion of highly correlated variables is still meaningful and Old-age income deprivation x 2 Employment deprivation (i.e. unemployment) x 3 Proportion of the age-65+ population with no qualifications x 4 Crime rate x 5 Average number of bedrooms x 6 Proportion of the population born in the UK x 7 Wider barriers to housing (affordability, homelessness) x 8 Employment/occupation: proportion in a management position x 9 Proportion working more than 49 h per week (ages 16-74) x 10 Urban-rural classification x 11 Proportion of population aged 60+ in a care home with nursing care x 12 Proportion of population aged 60+ in a care home without nursing care Urban conurbation (except London) 2 Urban city and town 3 Rural town and village 4 Rural hamlet and isolated dwellings 5 Urban conurbation (in London) Table 3. Correlations between the covariates, see Table 1 for details about the covariates. Empirical correlations have been calculated using all LSOAs in England and Wales regardless of their urban-rural classification. Note that x 10 (urban-rural classification) is not included in the table.
x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 11 We would also argue that there is no surprise in the correlation table: for example x 1 is positively correlated with x 2 , but negatively correlated with x 5 and x 8 meaning that the higher the level of income deprivation, the higher the level of employment deprivation and the smaller are the houses, and the fewer are working in management positions. For completeness, we also report correlation tables for individual urban-rural classes in the supplementary material published online. The general conclusions from those tables are similar to those obtained from correlations across all LSOAs. However, correlations with the more-minor variables x 3 to x 9 do vary more between different urban-rural classes.

The Longevity Index for England (LIFE)
As mentioned above, our aim in this paper is to construct an index that will explain the expected relative mortality risk in any neighbourhood based on the socio-economic characteristics of that neighbourhood. We call this index the Longevity Index for England, hereafter, the "LIFE index".

Modelling relative mortality risk
As a starting point, we model the conditional expectation of the relative mortality risk R 0 given characteristics x: is the vector of predictive socio-economic variables taking values in a K-dimensional space L 0 of possible realisations of X i . For the 12 variables in our empirical study, (see Table 1) L 0 is the 12-dimensional space defined in (2.3).
To estimate the regression function f , we will use a supervised machine learning algorithm called a RF, and we will denote this estimator of f byf RF . Details of the estimation procedure are given in Section 4. Let us mention that other estimators for f could be used. For example, Wen (2022) compares the RF estimator with a local linear regression estimator.

Care homes
When we construct the LIFE index as an estimator for the relative risk in any LSOA, we need to take into account that in our sample of LSOAs there are some with care homes and some without care homes. Clearly, if a significant proportion of individuals in any LSOA are living in a care home, then this will increase the mortality rate in that LSOA and, therefore, increase the relative risk. However, this does not then properly reflect the main socio-economic characteristics of the LSOA (x 1 , . . . , x 10 ).
To offset this effect when constructing the LIFE index in the next section, we will make assumptions about the proportion of people living in a care home for any LSOA rather than using the actual proportion of people living in care homes in that LSOA. In other words, we are trying to answer the question: What would be the relative risk of dying in LSOA i if we kept all socio-economic variables to the values observed in that LSOA, but changed the proportion of people living in care homes to the average for the whole of England, or to some other chosen value.

The LIFE index
Based on the discussion so far, we now define our Longevity Index for England as the value of f for specific neighbourhoods using the socio-economic characteristics of this neighbourhood but replacing the proportion of people living in care homes with the average for the whole of England. More precisely, we define the LIFE index for LSOA i as X i,9 , X i,10 ,X 11 ,X 12 (3.2) whereX 11 andX 12 denote the average values of the proportion of an LSOA's population living in care homes with nursing and care homes without nursing, respectively. In replacing the true with the mean values ofX 11 andX 12 , it is helpful to note from Table 3 that x 11 and x 12 have a very low correlation with other socio-economic variables. First, this implies that care homes are not concentrated in neighbourhoods with particular socio-economic characteristics. Second, the lack of correlation means that when we replace X i,11 and X i,12 with their mean values, we do not need to alter the values of other predictive variables to compensate: that is, the presence of a care home does not artificially inflate or deflate an LSOA's other socio-economic, predictive variables. Note that the index could be constructed with other adjustments to the care home variables. For example, we could choose to calculate the index based on setting X i,11 = X i,12 = 0. That would also be a good choice to model the relative mortality risk of the popoulation not living in care homes. However, we prefer the setting in (3.2) as we will use the index to calculate life expectancies as a function of socioeconomic characteristics. Setting X i,11 = X i,12 = 0 would implicitly assume that no individuals will ever be in a care home, which is, of course, not reasonable. Additionally, base mortality m b ta incorporates excess care home deaths, and so we prefer to reflect this also in the index values.
Let us mention that our index is based on the conditional expectation in (3.1) and therefore, we can calculate the relative risk of dying for any values of the socio-economic variables, they do not need to be those observed in a specific LSOA. In other words, we can calculate the LIFE score for fictional neighbourhoods with specified socio-economic characteristics. This allows us to investigate how sensitive the relative mortality risk is with respect to changes in certain socio-economic variables. We will return to that point in Section 8.
From the construction of the LIFE index in (3.2), it is clear that the index is not explicitly age specific. Instead, it is an index summarising the socio-economic characteristics of all members of a small community regardless of their age. However, the LIFE index relies on an estimate of the relative risk function f in (3.2) that links socio-economic characteristics to death rates. The age range for death counts and exposures used to estimate the function f will of course have an impact on the obtained LIFE index values. We will investigate this further in Section 5.

Estimating the relative risk using the RF algorithm
As mentioned in Section 3, we can use a wide variety of nonparametric estimators for the regression function f in (3.1). In this study, we will use the RF algorithm. As this estimation step is at the heart of our index construction, we will explain our approach in detail.

Overview
Our RF algorithm consists of three stages. For each stage, we use R (R Core Team, 2021) and the R package randomForest (Liaw and Wiener, 2002).

Stage 1
The purpose of the first stage is exclusively to choose certain hyperparameters -we provide more details about the hyperparameters below. For this stage, we split (randomly) our data set, S = {1, . . . , N}, into two disjoint subsets: • the training set, S train ⊂ S contains LSOAs used to "train" our model, that is, to choose optimal parameters determiningf RF , see below for details; and • the validation set, S val ⊂ S is the set of LSOAs used for model validation. In the first stage, data in the validation set are used for selecting hyperparameters; again, we explain details below.
Note that S train ∩ S val = ∅ and S train ∪ S val = S. The parameter optimisation for the observations in the training set is repeated for each possible choice of hyperparameters. We then select hyperparameters for whichf RF produces the best fit for the observations in the validation set, and those values are then fixed for the hyperparameters in the second stage.

Stage 2
In the second stage, we split our complete data set again into two subsets allocating LSOAs randomly (and independent of the allocation in the first stage) to either: • the training set, S train ⊂ S containing LSOAs used to choose optimal parameters determininĝ f RF using the optimal hyperparameters determined in stage one; or • the test set, S test ⊂ S which is a subset of LSOAs that are only used for evaluating how good our estimated functionf RF (fitted to data in S train ) can predict the relative risk in out of sample LSOAs.
Using the values for the hyperparameters obtained in the first stage, we fit our estimatorf RF to the observations in our new training set and evaluate the goodness-of-fit using the test set.
So, both stages follow the same idea, but in the first we refitf RF to the stage-one training set many times to choose optimal hyperparamters, while in stage twof RF is fitted to the stage-two training set only once to assess the out-of-sample goodness-of-fit.
In this paper, we split the set of all N = 32, 844 LSOAs into two equally sized disjoint sets to obtain a training set and a validation or test set in the two stages.

Stage 3
In the final stage of the estimation procedure, we run the RF algorithm with all chosen hyperparameters for the full set of N = 32, 844 LSOAs to produce the final estimate of the regression function f in (3.1) and obtain the LIFE index values from (3.2).

Fitting a single tree
A RF consists of B > 1 regression trees also known as decision trees. We will here briefly discuss how each tree is constructed as a crude estimator of the regression function f in (3.1). In the next section, we will then turn to combining many trees into a RF.
For fitting an individual tree with index b ∈ {1, . . . , B}, we only use a subset S b of the LSOAs in the training set S train in both stages. The procedure for growing a tree is the same for stages one and two. The choice of S b is explained in Section 4.3.
Constructing an individual tree is an iterative procedure. We start with defining our initial estimator f (b) 0 as the average of all observed values of the relative risk R 0 of the LSOAs in the set S b , that is, In the next step, we choose one explanatory variable, say x k * , and a level l * , and split the initial node L 0 into the two disjoint subsets: This procedure is now repeated but, in addition to choosing an explanatory variable x k * and a threshold level l * , we also choose one of the sets (nodes) L b 1,1 and L b 1,2 which we then split in the next step. Starting with s = 1 and the two nodes defined in (4.1) and (4.2), we now apply the following iterative procedure: • Choose one subset L b s,j * (j * ∈ 1, . . . , s + 1) out of the s + 1 subsets formed by the first s splits. Also, choose an explanatory variable x k * and a threshold l * • Split L b s,j * into two subsets and leave all other subsets unchanged. With this procedure, we have the following nodes available after s + 1 splits: Equation (4.3) states that split s + 1 does not affect any nodes other than L b s,j * . Equations (4.4) and (4.5) mean that all LSOAs with characteristics x in node L b s,j * for which x k * ≥ l * are put into a new node L b s+1,s+2 so that only those LSOAs with characteristics in L b s,j * and x k * < l * remain in that node and are then contained in L b s+1,j * after s + 1 splits. We now define the estimatorf (b) s (x) of the regression function f in (3.1) obtained from one tree b after splitting L 0 into s + 1 nodes asf where r j is the average observed relative risk R 0 for LSOAs X i in node j, that is, As explained above, for each new split, we need to choose an existing node L b s,j * , an explanatory variable x k * and a threshold l * . Those are chosen such that the fit of the new estimatorf to the observed values of the relative risk, R 0 i for i ∈ S b , is optimised. More specifically, our choice minimises the residual sum of square Within the RF algorithm, for each split, rather than optimise over all K = 12 of the predictive variables, we optimise over a subset of m variables. This subset is chosen randomly for each new split (see, for example, James et al., 2013). The purpose of this is to increase the variability and reduce the correlation between individual trees. Without restricting the set of variables we find that trees are very similar. The parameter m is a hyperparameter, and we explain its choice in Section 4.4. Finally, we stop splitting nodes further as soon as any obtained node contains less than M observations where M is another hyperparameter. In our empirical study, we choose M = 200 as that choice achieves a good balance between goodness-of-fit and overfitting. Díaz-Uriarte and Alvarez de Andrés (2006) provide a general discussion of the minimum node size M, principles of choosing it, and examples on its potential impact on the model performance and computation time. Wen (2022) has a more detailed discussion around M in the specific context of modelling relative mortality risk at neighbourhood level using the RF algorithm, including its impact on the complexity of underlying trees and the standard deviation of the outcomes produced by individual trees. Although the model's out of sample performance does not appear to be very sensitive to the choice of M, choosing M to be 200 rather than, say, 5 or 50 significantly saves computation time without sacrificing the predictive power of the RF estimator in our application.
In Figure 1, we illustrate the construction of one tree using our data set of LSOAs. In this example, only two variables are considered for potentially splitting nodes: old-age income deprivation, x 1 and employment deprivation, x 2 . In total, s = 5 splits have been performed leaving us with s + 1 = 6 nodes, and our estimator for f is given bŷ In Figure 2, we show the order of the performed splits: the first three splits are all based on x 1 (oldage income deprivation). Only after three splits using x 1 , two of the obtained nodes are split using x 2 (employment deprivation). This clearly shows, that for this example x 1 has a higher explanatory power than x 2 since splitting the early nodes in our tree according to the old-age income deprivation score reduces the residual sum of squares RSS more than early splits with respect to employment deprivation would achieve. To illustrate this specific example further, we also report the residual sum of squares, RSS b s , in Figure 2. As expected, we find that the early splits result in the greatest reduction of RSS b s .

Many trees form a RF
Having seen how an individual regression tree is fitted to observations in a set S b , we now turn to describing how we choose the sets S b and how we combine many trees to obtain our final estimatorf RF for the regression function f in (3.1). For each tree b ∈ {1, . . . , B}, the set S b is obtained by (see, for example, James et al., 2013) 1. sampling randomly with replacement from the training data set S train to obtain a sample of the same size as S train , and then 2. removing all duplicates from that sample. 2 Repeating this procedure B times, we obtain B subsets S b ⊆ S train of the training data set.
To introduce more randomness in the construction of the RF estimatorf RF we also, as remarked before, restrict the predictive variables considered at each split in any individual tree. Rather than choosing a predictive variable x k out of all p variables when minimising the residual sum of squares RSS b s in (4.6), we follow James et al. (2013) and choose x k from a subset of m predictive variables. As mentioned in Section 4.2, this subset of predictive covariates is randomly chosen for each split within each tree.
So, each tree b = 1, . . . B is fitted to a randomly chosen subset S b of observations from the training data set, and RSS b s is optimised with respect to m randomly chosen predictive variables. In this way, we obtain a total of B regression functionsf (b) . The number m of predictive variables considered for each split of nodes is a hyperparameter and will be chosen in stage one using cross-validation. As mentioned earlier, m is then fixed in stage two.
Our final RF estimatorf RF for the regression function f in (3.1) is obtained by taking the average over all individual regression treesf (b) , that is, for any x ∈ L 0 (4.9)  Table 1 are used.
Note thatf RF is piecewise constant over the full range of values of x ∈ L 0 as it is an average over a finite number of piecewise constant regression tree functionsf (b) . However,f RF can take many more values compared to any individual treef (b) .

Hyperparameter selection (stage 1)
With the minimum node size, M = 200, fixed, the regression functionf RF in (4.9) will depend on two further hyperparameters: the number m of predictive variables considered for each split, and the number of trees, B. Both of those parameters need to be chosen. One could argue that we can also choose the size N train of the training set S train , but, for simplicity, we choose that set to include half of the available observations with the other half being included in the validation set S val in stage one. For the N = 32, 844 LSOAs in our empirical study, we clearly have that both sets include N train = N val = 16, 422 LSOAs.
The hyperparameters B and m are chosen in stage one in the following way: we fitf RF =f RF B,m to the data in the training set using different values for B and m. For each combination (B,m) considered, we then evaluate the fit of the obtained estimatef RF B,m to the data in the validation set using the mean squared error as criterion.

a D ita is the observed total number of deaths across all ages a and years t in LSOA
i is the expected total number of deaths adjusted with the fitted relative riskf RF B,m (X i ) for LSOA i.
In Figure 3, we plot MSE(B, m) in (4.10) for different values of B (with m = 4) and different values of m (with B = 2500). The figure shows that the out-of-sample performance of the RF is not worsening as more trees are grown, and we choose B = 2500 as we think this will be a good compromise between computational effort and goodness-of-fit. However, we find that considering m = 4 predicative variables in (4.6) leads to the smallest mean squared error. Table 4 summarises our choice of hyper-parameters.

Goodness-of-fit (stage 2)
In order to assess the out-of-sample performance of the proposed RF estimator applied to the mortality data for the 32,844 LSOAs, we move on to stage two as mentioned in Section 4.1. To this end, we randomly split the set of all 32,844 LSOAs into two equally sized subsets: the training set S train (different from before) and the test set S test . The training set S train chosen at this stage is a random sample of S and independent of the training set chosen in stage 1. The sets S train and S test are disjoint and S train ∪ S test = S. We then construct an estimatorf RF using the data in S train and the hyper-parameters in Table 4. To quantify the goodness-of-fit of the obtained estimatorf RF , we evaluate its out-of-sample fit to data in the test set by calculating the mean squared error as in (4.10) but now considering LSOAs in the test set rather than the validation set, that is, Clearly, the realised values of MSE test will depend on the randomly chosen LSOAs in S train and S test . To get an idea of how sensitive the results are to the randomised choice of S train and S test , we calculate MSE test for three different splits (rounds) of our data into training and test sets. The results are provided in Table 5 for data based on different age ranges. We find that there is not much variation in the values of MSE test between rounds. We also see that the goodness-of-fit off RF is much better when it is estimated from mortality data at younger ages.

Robustness
The rather small variation of the test set MSEs over different randomly chosen training sets in Table 5 is an indication that the fitted relative risk, and therefore the LIFE index, is a robust estimator of the true underlying relative mortality risk. To investigate robustness further we now split the annual data for the observation period 2001-2018 into two subsets: data for even years 2002, 2004, . . . and data for odd years 2001, 2003, . . . This split leaves us with two subsets each consisting of nine years of observations. We chose to split the observation period in this way to avoid any impact of potential trends in the relative mortality risk over time.
We now apply the above methods to obtain estimatesf RF (x) with data from only one of the two observation subsets and then compare the results.
We present scatter plots of the estimated values off RF (x) based on odd years (horizontal axis) and even years (vertical axis) using mortality data for different age ranges in Figure 4. The plots clearly  32, show that the estimated values of the relative risk for individual LSOAs are very similar when mortality data from different years are used, in particular, there seems to be no systematic differences -this is further evidence that the results of our RF estimator are robust. Any variation we see is most likely due to sampling variation in the deaths counts rather than systematic differences.

Final index values (stage 3)
The hyperparameters have been chosen in stage 1 and goodness-of-fit and robustness assessed in stage 2, and it has been concluded that the RF algorithm has produced a good estimate of f with the chosen parameters. In the final stage, we simply rerun the RF algorithm but, instead of using only half of the data, we use the full set of 32,844 LSOAs.

Fitting the LIFE index to different age groups
As mentioned in Section 3.3, the LIFE index is not directly age specific. However, its estimated values depend upon the specified age range, A, and the index values obtained can be assumed to apply to either the whole of that age range or to the midpoint of that range.
To investigate the effect of different age groups on the estimated LIFE index value, we compare index values obtained from fittingf RF (x) to mortality data for three age groups: 60-69, 70-79 and 80-89. We report Q-Q-plots of the obtained index values R i (Equation (3.2)) for all 32,844 LSOAs in Figure 5. Figure 5 shows a very strong dependency between the LIFE index for age groups 60-69 and 70-79 (left-hand plot). This dependency is slightly weaker when we compare the age group 80-89 with the younger ages (a greater spread of points in the middle and right hand plots) but the dependency is still strong. Wen (2022) found similar results for an index constructed using local linear regression. Empirical distributions of the R i are plotted in Figure 6 for different age groups. We can see that variation in relative mortality risk is greater for younger ages than older ages, an observation that is consistent with previous research on socio-economic variation in mortality in various populations (see, for example, Mackenbach et al., 2003;Mackenbach et al., 2015;Chetty et al., 2016;Wen et al., 2020 andWen et al., 2021 and references therein).

The LIFE index versus the IMD
The IMD is published by the Department for Communities and Local Government in the UK (Smith et al., 2015). The IMD is designed as a general measure for deprivation. The LIFE index on the other hand has been produced specifically as a measure of mortality deprivation with the aim to predict mortality differences between LSOAs. We would expect the two indices to have a highrank correlation. To check this hypothesis, we plot the LIFE index values fitted to mortality data for ages 40-49, 60-69 and 80-89 versus the IMD scores in Figure 7. We also report Spearman's rank correlations between the LIFE index fitted to different age ranges and the IMD scores in Table 6.
We find that the rank correlation is indeed high, in particular, when the LIFE index is fitted to mortality data at younger ages, see Table 6. We also observe in Figure 7 that there is a strong dependency between the scores of the two indices, but that there are some outlier LSOAs with a rather low IMD score (little deprivation) and a rather high relative mortality risk.
We did consider the outlier LSOAs, for example, for ages 60-69 in some detail. Individual predictive variables for these LSOAs tended to be towards the tails of the data but not too extreme, but, in higher dimensions, the vectors X i for these LSOAs were clearly positioned around the fringes of the cluster of observations for the 32,844 LSOAs. As with most regression methods, estimates at the edges of a dataset do carry higher levels of uncertainty than estimates in the middle of the dataset.

Figure 7. Scatterplot of LIFE index versus IMD. The LIFE index is based on an estimated relative mortality risk fitted to mortality data for ages 40-49 (top left), 60-69 (top right) and 80-89 (bottom). Colour indicate the urban-rural class of an LSOA: conurbations (black), cities/towns (red), villages (green), rural areas (dark blue) and London (light blue).
The colouring of the dots in Figure 7 reveals how significantly the inclusion of urban-rural class has impacted on estimates of the relative risk compared to the IMD, particularly ages 40-49 and 60-69. In the upper plots, the dark blue dots representing very rural areas are mostly shifted to the left of the main diagonal. This indicates that the inclusion of urban-rural class in the RF model estimates significantly lower mortality in these rural areas than would be suggested by the IMD, which takes no explicit account of urban-rural class. Indeed the IMD includes a subdomain called geographical barriers which counts greater distance to services as meaning an area is more deprived. But in mortality terms (at least at the macro scale) the opposite is true: larger, more-rural or otherwise less-dense LSOAs have lower mortality even though one has to travel further for essential services.
The RF model also predicts lower mortality for London (light blue dots) than the IMD predicts. In this case, the reason is less clear but needs further investigation: what is missing in the IMD that is to the advantage of London and to the disadvantage of other areas, particularly other conurbations and cities.

Distribution of low-and high-risk groups across urban-rural classes
In this section, we investigate if there are differences between urban-rural classes in the number of lowand high-risk mortality populations. More specifically, we denote by q R α the empirical α-quantile of the estimated relative risk R 0 in all LSOAs, and we then define four groups of LSOAs: the lower 5% and 50% quantile groups, and the upper 5% and 50% quantile groups, G l α := k : R 0 k < q R α and G u α := k : R 0 k > q R 1−α for α = 0.05, 0.5 (7.1) where q R α is the empirical α quantile of the fitted values R 0 k . Table 7 shows how urban-rural classes are distributed in each of those groups. An interesting result in Table 7 is that out of the 2542 LSOAs that are classified as rural hamlets and isolated dwellings (urban-rural class 4) none can be found in the high-risk group G u 0.05 when the relative risk is fitted to mortality data for ages 60-69. Similarly, only eight of the 3056 LSOAs in urban-rural class 3 are found in the high-risk group. On the other hand, 11.9% of large conurbations outside London made it into the top 5% risk group while only 1.7% of LSOAs in London are in that group. The data in Table 7 clearly show the strong impact that the urban-rural class has on the estimated relative mortality risk of an LSOA with the general conclusion that LSOAs in large cities tend to have a higher mortality risk than LSOAs in rural areas. London is an exception for the very high-risk group but we also find that more than half (61%) of LSOAs in London have a mortality risk greater than the median for England.
The picture changes slightly when we consider the oldest age group, 80-89, in our data set. However, the general conclusion seems to be unchanged: large cities have higher mortality than rural areas.

Impact of specific variables
Since the LIFE index is based on a non-parameteric estimator of the regression function f in (3.1), it is not straightforward to assess the impact of specific variables on the index value from the sign or magnitude of specific parameters as is often possible for parametric regression models. Instead, we study the values off RF (x) for a certain range of covariate values where we vary only some variables while leaving others constant.
In Wen (2022), it was found that employment and old-age income deprivation are two of the most significant predictors of mortality rates. We therefore focus on those two variables first. We also include here x 6 -the proportion of an LSOA's population born in the UK -as an example of a less important variable. Figure 8 shows the fitted relative mortality risk as a function of those three covariates. In each of the plots, all variables except the variable on the horizontal axis have been set to their median calculated across all LSOAs. More specifically, for the first row in Figure 8 (old-age income deprivation) the value shown for LSOA i is calculated as f X i,1 , X 50 2 , . . . , X 50 11 , X 50 i,12 where x 50 denotes the empirical 50% quantile of covariate x. We use a similar approach for X 2 and X 6 . This allows us to zoom in on the specific effect of one variable. We can clearly see that old-age income deprivation and employment deprivation have a similar effect on the risk of dying with high levels of deprivation associated with high levels of mortality. However, comparing the range of risk values (y-axis), we find that income deprivation is a much better variable than employment deprivation to distinguish between low-and high-risk LSOAs. This is particularly true when data for ages 60-69 are used to fit the relative risk function. Not surprisingly, old-age income deprivation is still a good variable to explain differences at the older ages 80-89, but employment in an LSOA has little explanatory power for mortality differences in that age group.
Turning to x 6 -the proportion of the population born in the UK -this variable has very limited explanatory power (a narrow range of relative risk values) when all other variables are set to the median. Nevertheless, Wen (2022) has found that it is a variable that helps the RF algorithm to better predict observed mortality risk. While we find that higher numbers of UK-born residents seems to slightly increase mortality in an LSOA, this effect is relatively small. We also observe in Figure 8 that the relative risk is below one for all LSOAs. This is clearly a consequence of setting all other covariates to the median. The relative risk values and our conclusions about x 6 might change when other covariates are set to different values rather than the median. Our proposed non-parametric estimation of the relative risk would allow for such more detailed empirical studies but that is beyond the scope of this paper. Also of note is the fact that the steep portion of both plots for x 6 is well to the left of the median (x 6 = 0). A potential reason for this is how x 6 interacts with other predictive variables. It is only for the 20% lowest where there is a significant dependency between x 6 and other predictive variables. For the upper 80% of the distribution of x 6 , there is very little dependency with other variables.
Finally, we conclude from Figure 8 that the impact of the three considered variables is comparable in all five urban-rural classes.
The LIFE index has been constructed by keeping all variables at their observed levels except the proportion of residents living in care homes, x 11 and x 12 , see Section 3.3 for details. To investigate further the impact of the three variables, we now plot the values of the LIFE index for all LSOAs as a function of one covariate in Figure 9.
We find in Figure 9 that, as expected, there is a lot more fluctuation when the variability from other covariates is not removed as in Figure 8. We can see that old-age income deprivation has the strongest correlation with estimated relative mortality risk followed by employment deprivation and, finally, there is no clear pattern linking the population born in the UK to the relative risk. In each plot, the colour of each point shows the urban-rural class. We can see that a London-effect is clearly visible: London has lower mortality as seen from the plot on the top left, and London has also a relatively large population of people born abroad which is clearly visible in the lower plot.
Finally, we consider the joint impact of two variables on the LIFE index. In Figure 10, we show heat plots of LIFE index values for ages 60-69 as functions of old-age income deprivation, x 1 , and employment deprivation, x 2 . All variables x 3 , . . . , x 9 , x 11 and x 12 are set to their median and the three panels show urban-rural classes 1, 4 and 5 from left to right. It is notable that, in all three cases, for less-deprived LSOAs, the divisions between bands of colour are nearly vertical indicating that old-age income deprivation is the main driver of the LIFE index out of the two variables. However, as we move  up, the divisions between colour bands gradually tilts indicating that there is more of a balance between the two measures of deprivation in terms of their impact on the LIFE index. This then gives a good indication of how the non-linear RF algorithm is easily able to pick up changes in the impact of different variables as we move across the dataset.

Relative risk in different deciles
With the proposed LIFE index, we can also zoom in on specific populations identified by their mortality risk. For example, Figure 11 shows the LIFE index values where we only consider the 10% of LSOAs with the lowest relative risk (left plot) and the 10% of LSOAs with the highest relative risk (right plot).
More specifically, we introduce a decile function g which identifies the decile k for any LSOA i with g(i) = k for k ∈ {1, . . . , 10}. The 10% of LSOAs with the highest mortality risk are LSOAs i with g(i) = 1 and the LSOAs with the lowest relative risk have g(i) = 10.
The plots in Figure 11 clearly show that in both subpopulations the most important variable to explain mortality differences is old-age income deprivation while employment deprivation has less impact.

Explanatory power of the LIFE index
The LIFE index has been constructed with the aim to explain the mortality risk in an LSOA based on socio-economic variables. Therefore, the question arises whether the LIFE index can indeed differentiate between high and low mortality areas. Or, in other words, are the socio-economic variables and the constructed index able to predict the relative mortality risk in an LSOA. To investigate this, we calculate age standardised mortality rates (ASMR) and ADSMR.
The ASMR for any population in year t is calculated as follows: where X refers to the age range used and m gta = D gta /E gta is the crude death rate for the underlying population g in year t for age a. The standard population E s a used in our study is the European Standard Population 3 (ESP) in 2013.
To investigate how well the LIFE index can distinguish between different levels of mortality, we split the set of all LSOAs into 10 deciles as described above and plot the ASMRs for each decile over time. We then compare our results to ASMRs calculated for deciles obtained from the Index of Multiple Deprivation (IMD), see Wen et al. (2021). The results are shown in Figure 12.
We can see in Figure 12 that the 10 deciles obtained from either index, LIFE or IMD, produce very different mortality rates. The figure also shows that the LIFE index leads to a wider spread of mortality rates meaning that it is better than the IMD in identifying low and high mortality on the basis of socioeconomic variables. However, we must keep in mind that the IMD was not designed to predict mortality while the LIFE index was chosen to do so. We also find that both indices show a widening of the mortality gap, the difference between rates for the most deprived as compared to the least deprived. For the IMD, we discussed this issue in detail in Wen et al. (2021).

Explaining regional differences
While the LIFE index only uses information about socio-economic covariates, including an urban-rural class, to predict the relative mortality risk in an LSOA, it might be that LSOAs in different regions in England have different mortality rates although they have the same socio-economic characteristics.
To investigate this question further, we group all LSOAs into nine geographical regions. Table 8 lists the regions and shows the percentage of a region's LSOAs that belong to the risk groups defined in Section 7 for the LIFE index fitted to ages 60-69. The table shows that 12.4% of the LSOAs in the North West of England belong to the group of LSOAs with the 5% highest mortality risk, closely followed by the North East. The lowest LIFE scores are observed in the South East with 11.4% of those LSOAs belonging to the 5% English LSOAs with the lowest mortality risk. The results for other age groups are very similar, and, therefore, not reported in this paper.
To obtain a more detailed picture of the regional differences, we consider age standardised mortality rates. The ASMRs by region for age 60-69 and 80-89 can be seen in the Figure 13. We find that there are apparently substantial inequalities between regions, but how much of that can be explained by differences in the socio-economic mix of the nine regions?
To address this question, we propose using what we call the Age and Deprivation Standardised Mortality Rate, ADSMR. In the same way that the basic ASMR removes differences between populations that have different age profiles, the ADSMR is designed to remove the impact of different deprivation profiles. Thus, the ADSMR in region r in year t is defined as ADSMR rt = 1 10 10 k=1 ASMR rkt (9.2) where ASMR rkt is the ASMR in year t of all LSOAs in region r and index decile k. Deciles are formed on the basis of the LIFE index values as described above, and, for comparison, on an IMD basis. Assuming that all mortality differences are explained by the LIFE (or IMD) index based on socioeconomic variables, the ADSMRs should be the same for different regions (since ASMR rkt would be independent of r). In Figure 14, we plot ADSMR rt for the nine regions where deciles have been obtained from using the IMD (left plots) and the LIFE index (right plots).
We clearly see that the purpose built LIFE index is better suited to explain mortality differences between regions, resulting in much smaller differences in the ADSMR between the regions than the IMD-based ADSMR. While this is true for both age ranges (and other ranges as shown in Figures published online in the supplementary material), it is for the oldest age group 80-89 that the LIFE index can explain most of the differences between regions -the ADSMRs in the bottom right plot are much more similar than in the bottom left plot. This can also be observed for age group 70-79, see the online supplementary material for the relevant graphs. The reason for the much better ability of the LIFE index to explain differences between regions for higher age groups as compared to the IMD might be the inclusion of old-age income deprivation in the LIFE index rather than general income deprivation (which is one of the domains of deprivation used for the IMD).
Another interesting feature we observe in Figure 14 and similar figures in the online supplementary material are the mortality improvement rates in London which seem to be much greater than in other regions. Considering the oldest age group 80-89 and measuring deprivation using the IMD suggests that mortality in London improved significantly more than in other areas even after accounting for deprivation and that this improvement continued after 2011 when other regions have experienced no or little improvements. However, measuring deprivation using the LIFE index changes this conclusion.
The ADSMRs for the nine regions are very close together, and there is no London-effect visible for this age range.

Summary and conclusion
We have introduced a new mortality index, the LIFE that uses socio-economic characteristics to explain mortality rates in individual LSOAs. The LIFE index is constructed by first modelling the relationship between mortality and explanatory variables as a non-parametric function and estimating that function using the RF method. In a second step, the resulting regression function is adjusted to account for inflated death counts in LSOAs with care homes so that the LIFE index is a good representation of the general population.
Using the RF estimator of the relationship between socio-economic variables and relative mortality risk, we are able to study the impact of specific variables in isolation. While we have only reported results for three covariates in this paper, the proposed method allows for a much more detailed analysis. However, our empirical results indicate that of all the covariates considered it is old-age income deprivation which has the highest explanatory power for mortality differences between LSOAs -the higher old-age income deprivation the higher are the mortality rates in an LSOA.
While studying the impact of individual variables helps to understand to some extent how the LIFE index allocates scores to LSOAs with certain characteristics, the impact of individual factors is still not as clear as it is in parametric models. A careful analysis of the sensitivity of the LIFE index scores with respect to changes in certain factors could serve as a first step in the development of a parametric (or semiparametric) regression model in which estimated parameters have a clear interpretation.
Comparing the LIFE index with the widely used English Index of Multiple Deprivation shows that the LIFE index is better able to explain regional variations in mortality with deprivation measures than the IMD. However, keeping in mind that the IMD has not been constructed with reference to relative mortality risk, we find that it actually is a good predictor for mortality differences. Nevertheless, for any application with the goal of explaining mortality differences between different geographic locations in England the LIFE index is more suitable than the IMD.
The proposed LIFE index could be further improved in different directions. Other covariates could, of course, be considered. One LSOA-specific covariate that might be important is the average age in an LSOA within a given age band (e.g. 60-69). The construction of the relative risk takes the age structure of LSOAs into account (both expected and observed deaths are age dependent). However, it might be that certain covariates have different effects in LSOAs with higher or lower average age. This is different from considering different age ranges for fitting the relative risk to observed covariates.
Another extension of our index would be to make it time dependent. However, that would require the measurement of the LSOA-specific relative risk for individual calendar years, which would increase the variance of the observed relative risk substantially.