Integration of traditional and telematics data for efficient insurance claims prediction

Abstract While driver telematics has gained attention for risk classification in auto insurance, scarcity of observations with telematics features has been problematic, which could be owing to either privacy concerns or favorable selection compared to the data points with traditional features. To handle this issue, we apply a data integration technique based on calibration weights for usage-based insurance with multiple sources of data. It is shown that the proposed framework can efficiently integrate traditional data and telematics data and can also deal with possible favorable selection issues related to telematics data availability. Our findings are supported by a simulation study and empirical analysis in a synthetic telematics dataset.


Introduction
Telematics generates data related to many variables characterized for each driver, including total miles driven, the number of sudden brakes or accelerations, and at what time they are driving.With technological advancements in the automobile industry with driver telematics, the insurance industry can add new features to the databases along with the traditional features that will be used in claim predictions and risk classifications in a unified frame.In this regard, it is required to consider a framework to deal with multiple data sources that contain traditional and/or telematics features for insurance ratemaking, which is one of the main contributions of this paper.
The usage-based insurance (UBI) is an innovative product in the insurance industry based on technological advances to assess the risk profile of a driver.Past studies elaborate on the additional value of telematics-derived information to provide improved claims predictions, risk classification, and premium assessments.Ayuso et al. (2014) compare driving behaviors of novice and experienced young drivers with pay-as-you-drive policies using few telematics variables as well as traditional variables.Furthermore, Ayuso et al. (2016) examine gender discrimination in the risk of accidents using the same dataset.Baecke and Bocca (2017) illustrate the use of telematics variables to decide the risk premium and state that at least three months of data are enough to obtain efficient risk estimates.Verbelen et al. (2018) depict the importance of telematics variables, which are based on driving habits, in predicting the frequency of claims.Gao et al. (2019) show the predictive power of telematics covariates extracted from speed-acceleration heat maps in the modeling of claim frequency and support the use of telematics features for insurance pricing.
Moreover, insurance companies have access to large datasets related to policyholders that contain traditional characteristics, as driver demographics and vehicle characteristics.However, a telematics dataset can have fewer data points than a traditional dataset, as the number of telematics related policyholders is low.Guillen et al. (2021) use a modeling approach for insurance ratemaking using traditional and telematics data but is limited to a small number of features, as available data are limited.Ma et al. (2018) mention that the lack of availability of telematics data is a challenge in identifying the factors of policyholder behavior in ratemaking.While providing a compact description of the insurability of risk using telematics data, Eling and Kraft (2020) highlight some actions that can increase the number of telematics-based policyholders.Hence, there is a scarcity of telematics data when compared to traditional data.
In this regard, it is natural to expect that insurers need to deal with two types of datasets: traditional datasets with fewer features and more observations from non-UBI insureds and telematics datasets with more features and fewer observations from UBI insureds.One could argue that insurers could potentially treat UBI and non-UBI insureds as separate groups and it suffices to analyze two types of datasets separately, as more and more people with low risk would move to UBI over time and form a natural market segmentation due to asymmetric information (Rothschild and Stiglitz, 1978).According to Holzapfel et al., 2023), however, the market share of UBI contracts remains relatively low and stands around at about 5%, whereas UBI contracts have been accessible to policyholders for over twenty years (NAIC, 2015;MarketsandMarkets, 2021).At the very least, the situation where there are far fewer UBI subscribers than non-subscribers can last longer than expected, and therefore the data integration framework that we propose could be valid for a considerable period of time in the future.Further, it is natural to expect that a policyholder may move back and forth between a UBI and a non-UBI contract upon renewal (Śliwiński and Kuryłowicz, 2021).Therefore, it is worthwhile to investigate the available datasets jointly to better understand the characteristics of the population, compared to a separate analysis of the traditional and telematics datasets that implicitly assume time-invariant business mix between UBI and non-UBI contracts of an auto insurance company.
Data integration techniques enable combining information from a few data sources into one.According to Yang and Kim (2020), it leads to the incorporation of information from different samples to achieve efficiency in estimations under finite population inference while handling potential selection biases.And Husnjak et al. (2015) recognize that the integration of telematics data with traditional data can help to realize the full potential of telematics data.Thus, Ayuso et al. (2019) and Gao et al. (2022) propose two-step approaches that use telematic characteristics to improve a regression model that only incorporates traditional ratemaking factors.
Although these approaches are straightforward and readily available, they might be problematic when the availability of telematics features depends on the riskiness of the policyholders due to possible favorable selection.For example, Denuit et al. (2019) state that low-risk drivers would favor telematics insurance products.And Duval et al. (2023) mention that the attraction of safer drivers is beneficial for the insurer as it could lower the claim cost.However, this situation may result in missing some insights about more risky drivers in terms of an analytical point of view.According to Cohen and Siegelman (2010), one can expect that the information asymmetry between insurers and policyholders may lead to favorable selection in the sampling mechanism of observations with telematics features as less risky drivers are more likely to provide telematics data for possible premium discounts.
Indeed, consideration and collection of telematics data are relatively recent, and there are still ongoing concerns about privacy issues, which make many policyholders reluctant to agree on the provision of their telematics data to insurers.In this regard, Dewri et al. (2013) state privacy concerns that can arise when using telematics data for driving habits.Also Duri et al. (2002) mention that there is a tendency to observe a decrease in the amount of telematics data due to privacy concerns, which is a similar trend among web users with privacy concerns.In a similar way, Milanović et al. (2020) imply that policyholders who are willing to provide telematics data tend to have less concern about privacy issues.Thus, we can observe a selection bias in the telematics dataset due to privacy issues as well as the favorable selection.In summary, the following objectives of the proposed research are recognized.First, we propose a framework based on the estimation of the propensity score to combine information from multiple datasets in insurance ratemaking considering the scarcity of telematics data and possible favorable selection regarding the availability of telematics data.Then we introduce an algorithm to integrate a traditional insurance claim dataset and a telematics dataset based on a calibration equation approach in detail.Finally, we test the validity and applicability of the proposed framework through a simulation study and empirical analysis of a synthetic telematics dataset.Consequently, we hope that the proposed method can help insurance companies effectively use multiple sources of data for better risk classification.
The rest of this article is organized as follows.Section 2 provides a detailed description of the problem and the corresponding data structure with the missing mechanism.In Section 3, the proposed framework for data integration is developed based on a calibration equation approach with information projection to model the claim count data.Section 4 provides a simulation study to assess the effects of the proposed approach compared to four preexisting approaches.Section 5 conducts an empirical analysis with a synthetic telematics data portfolio that is emulated from real data, to assess the applicability of the proposed approach in practice.Section 6 concludes the paper with some constructive remarks.

Data structure and problem description
This study focuses on two data sources as discussed in Section 1. S 0 , a small dataset with M 0 observations that contains both telematics and traditional features.And S 1 , a large dataset with M 1 number of observations that contains only traditional features.We also assume that the finite population S consists of both S 0 and S 1 and that the total number of observations in S is We denote traditional features of a policyholder i as x iτ , (available both in S 0 and S 1 ) and telematics features of a policyholder i as x iT , which is only available in S 0 .Using these features, all the corresponding features of the study can be denoted as a vector, x i = (x iτ , x iT ).A summary of the description of the data is given in Figure 1.
Note that the observability of x iT could depend on the risk profile of a policyholder i, which could make the sampling mechanism of S 0 from the population subject to selection biases.As mentioned in the previous section, there have been possible concerns about providing telematics records, such as privacy and security issues; hence, it is natural to expect that a policyholder might not be willing to provide their telematics records to the insurer unless the expected benefits from the provision outweigh the possible concerns.Therefore, we can think of the following conjectures: • Those who are younger tend to agree to provide their telematics records more, as they could be less reluctant to technology or the compensation for disclosing privacy to get a UBI policy is lower according to Derikx et al. (2016).It implies that the probability of observing a data point in S 0 could be negatively correlated with the driver's age.
• Those who are less risky1 tend to agree to provide their telematics records or the UBI policyholders tend to be less risky drivers according to Reimers and Shiller (2019) and Cather (2020), so that the accessibility of x T is prone to favorable selection.It implies that the probability of observing a data point in S 0 could be negatively correlated with the number of claims (n i ).• Those who drive less frequently tend to agree to provide their telematics records more since the premium is low in UBI products as in Boucher et al. (2013).It implies that the probability of observing a data point in S 0 could be negatively correlated with the self-perceived mileage.
While our main task is neither to detect possible selection biases in the availability of telematics features nor prove such conjectures, we consider the situations where such conjectures could hold and discuss the benefits of the proposed framework compared to preexisting benchmarks in various situations.

Methodology
The general framework that we follow to estimate the model parameters using the proposed method is briefly described in this section.We are interested in estimating , where m(•) is a known function and β is an unknown parameter while N i is the observed number of claims for a policyholder i with i = 1, . . ., M. We assume that N i are independently distributed with a Poisson distribution with mean μ i .2Using the canonical link function as it is given in Agresti (2003), we can express m(•) = exp(•).Let t i be an exposure variable associated with i th claim count like the duration of a policy.Then, η i is the average number of claims per the ith duration.Now we can redefine the regression model in terms of η i as where x ij is the jth feature of the policyholder i and x i0 = 1.Thus, we can reform this model using the definition μ i = t i η i3 as Now, using model (3.1), the census estimating equation for β can be written as where U(β; x, n) = {n − t exp(xβ)}x is the estimating function for β with a Poisson distribution.However, as mentioned in Section 2, x iT (which corresponds to the telematics features of a policyholder i) is subject to missingness and only observable in S 0 .In this regard, one can consider the following equation to estimate β 1 and β 2 simultaneously: where ω i is a propensity weight to handle possible selection biases.
To incorporate the partial information in S 1 where we only observe x iτ and n i , we wish to construct the propensity weight where The propensity score (PS) is defined as The property of the propensity score estimating equation in (3.4) is called self-efficiency, as it leads to an efficient estimation of β as long as the conditional expectation in E{U(β; Here, we assume that the sampling mechanism for S 0 is missing at random (MAR) in the sense of Rubin (1976).That is, we assume To find ω i satisfying (3.4), we first find the basis functions satisfying where the span implies that the conditional expectation is represented by a combination of basis functions, b l , that are formed only using the traditional features and observed number of claims where l = 1, . . ., L.4 Under (3.5), estimating the conditional expectation E{U(β; To avoid this difficulty, we consider an alternative method using (3.4) without estimating the conditional expectation.
To achieve this goal, using the basis functions in (3.5), we impose the following system of equations (3.6) as a constraint for finding the propensity weights ω i in (3.4), where b li = b l (x iτ , n i ) is a vector of integrable functions of traditional features and inspired by the form of Poisson score function, which implies L = 2v + 1 where v is the number of features in x iτ .Constraint (3.6) is often called the covariate-balancing property (Imai and Ratkovic, 2014) in the context of causal inference, which enables an efficient estimation of the propensity score by assuring that the distributions of available covariates in propensity weighted sample and the population are similar.The following proposition shows that the covariate balancing is a sufficient condition for self-efficiency in (3.4).
Proof.See Appendix A. Now, to uniquely determine ω i , we can use the information projection of Wang and Kim (2021) under the constraint (3.6) to get where is an unknown parameter.The parameters are estimated by solving the calibration equation in (3.6).
Once φ 0 , • • • , φ L are estimated by (3.6) and (3.7), we can use as the final propensity weights for estimating β using (3.8): Because the propensity weights satisfy the calibration equation in (3.6), it satisfies the self-efficiency without estimating the regression coefficients α5 in the working model But, the vector space spanned in (3.5) implicitly assumes a regression model that is where e ij ∼ (0, V j ).
Proof.See Appendix B. Now, to improve this proposed method, we may use the information of model variance.Suppose that V(e i ) = v i is available, then we can use as the final propensity weights for estimating θ .It can still achieve (3.4) where αl 5 α is the estimated regression coefficients of the (linear) working model explained in Equation (3.5).
We can simply use the class in (3.10) as a class of calibration weights and choose v i = f (b i ) such that (3.6) holds and reduces the variance (by downweighting the large weights).One way is to use v i from the conditional variance of U i given the covariates.Now, the estimation scheme for the study is listed in order according to the requirements of the estimation process at each step.
is the estimating function for β. 2. Find v i using a suitable method.3. Obtain φ by solving 4. Obtain β by solving The estimation of the standard error of β is presented in Appendix C.

Simulation study
In this section, we use a hypothetical and less complex finite population to test the validity and applicability of the proposed method.More specifically, it allows us to quantify the estimation performance of regression coefficients with the proposed model (compared to the benchmarks) using finite samples from a predetermined distribution.In this regard, we assume three hypothetical scenarios in which traditional features are fully available while telematics features are partially available, depending on the sampling mechanism of observations with telematics information.We generate a finite population of size 100,000 with the following specification: where P, U , Ber, N , and G refer to Poisson, uniform, Bernoulli, normal, and gamma6 distributions, respectively.Here, x Ai refers to a traditional continuous variable with quadratic effect (e.g., driver's age), x Gi refers to a traditional binary variable (e.g., gender), x Mi refers to a traditional variable like selfperceived mileage, and x Ti refers to a telematics variable that has significant impacts on the risk profile.Let S * be the finite population generated according to the notation used in Section 2. Once a finite population is generated, the following scheme is applied to split the data.
1. First, 10% of the data points are set aside where {N i , x iτ , x iT } are all available, which is equivalent to S 0 in Section 2. Depending on the assumption of availability of telematics information, we apply the following four sampling schemes of S 0 : • Random selection: Data points assigned to S 0 are chosen at random, • Age selection: Each data point assigned to S 0 is chosen with a sampling probability proportional to 1/{1 + exp(3x Ai )}, which means that younger ones are more likely to provide telematics information due to their lower resistance to new technologies.In this case, δ ⊥ N|x τ .• Favorable selection: Each data point assigned to S 0 is chosen with the sampling probability proportional to 1/{1 + exp(2N i )}, which means that those with less risky behaviors are more likely to provide the telematics information.In this case, δ ⊥ N|x τ .• Mileage selection: Each data point assigned to S 0 is chosen with the sampling probability proportional to 1/{1 + exp(x Mi )}, which means that those with lower mileage are more likely to provide the telematics information.In this case, δ ⊥ N|x τ .
2. After that, 80% of data points are used as a large dataset, but only with traditional features and the response variable {N i , x iτ }, which is equivalent to S 1 in Section 2.
For comparison, we consider the following models to estimate β 1 and β 2 : • Naive model: Fit a usual Poisson GLM using the data points in S 0 , which is equivalent to solving (3.3) for β assuming that ω i = 1 for all i. • Traditional model: Fit a usual Poisson GLM using only traditional features and the response variable {N i , x iτ } in S 0 ∪ S 1 , which is equivalent to solving (3.2) for β 1 assuming that β 2 = 0.
As such, this model does not allow the use of telematics information in the risk classification.• Full model: It uses all data points in the training set to estimate the regression coefficients, which is equivalent to solve (3.2) for β.Therefore, it is expected to provide the best estimation performance, but may not be available in practice.• Boosting model: It uses the same estimates of β 1 from the traditional model and computes ηi = exp(x iτ β1 ) for each observation i in S 0 .After that, another Poisson GLM is fitted with S 0 where the telematics information, x iT , is the only regressor and log ηi is used as an offset to further estimate β2 as mentioned in Ayuso et al. (2019).It is equivalent to solving (3.3) for β 2 assuming that ω i = 1 for all i while β 1 is replaced with its estimate from the traditional model.• Proposed model: It follows the estimation procedures described in Section 3, which is equivalent to solve (3.3) for β where ω i is replaced by ωi ( φ) for all i.In this study, we use 1/v i = Deviance(Traditional) i − Deviance(Naive) i , where the Deviance i is the deviance contribution of ith individual in S 0 .
3. Lastly, to incorporate the possibility that a policyholder may choose to opt for a telematics policy or not over time, 10% of data points are randomly set aside as T for out-of-sample validation (equivalently, the test set T is a representative sample of the population), where {N i , x iτ , x iT } are all available.
After fitting all models, the regression estimates of these models were used to find the predicted values Ni = exp(x iτ β1 + x iT β2 ) for i in the out-of-sample validation set T .Note that generation of each of the finite population, data split, regression coefficients estimation, and the out-of-sample validation are repeated R = 1000 times with different random seeds.
Table 1 shows the estimation results of the regression coefficients under different model specifications and sampling schemes.Here, bias, root mean square error (RMSE) and 90% confidence interval coverage (CI) of β j are defined as follows: where β(r) j is the estimate of β j at r th simulation, and SE( β) (r) j is the estimated standard error of β(r) j .From Table 1, it is clearly observed that if the sampling mechanism of S 0 is purely random, then the use of the naive model is less problematic in terms of estimation performance.Although the full model shows the best performance in the estimation performance followed by the proposed model, the boosting model (and correspondingly the traditional model) suffers from the biases in β0 and βT .One can also observe that although the naive model is unbiased in the case of random selection, it is less efficient in the parameter estimation compared to both the full and proposed models as shown in the higher values of RMSE.When the sampling mechanism is age selection, it is shown that the naive model has larger biases for βA1 and βA2 compared to the full and proposed models, as these coefficients correspond to the age covariate that comes with selection biases in this scenario.On the other hand, if the sampling mechanism of S 0 is prone to favorable selection, then the differences in estimation performance are more dramatic.Unlike the random sampling case, the naive model severely suffers from lack of fit and biases in the estimates while only the full and proposed models provide acceptable ranges of estimates as S 0 is no longer a representative sample of the finite population.Lastly, in the case of mileage selection, there is no significant improvement in estimation performance of the proposed model compared to the naive model, but the insight from results is similar with the age selection.Note that the values of Bias, RMSE, and CI of the traditional and full models across all four sampling methods are identical, which is natural as both models do not depend on the sample split between S 0 and S 1 for estimation of the regression coefficients.In the case of boosting model, it also shows identical values of Bias, RMSE, and CI for all the traditional covariates as it shares the estimated coefficients with the traditional model by definition.
Note that mileage can appear in both traditional and telematics datasets as self-perceived mileage and actual mileage, respectively.If the actual mileage is used for the selection, the sampling scheme with mileage selection becomes non-ignorable.While it could be meaningful to consider the non-ignorable missing mechanism in the UBI context (choosing a UBI policy based on telematics variables) as mentioned in Boucher et al. (2013), handling a non-ignorable missing pattern requires to jointly model δ and x T (Heckman, 1976;Glynn et al., 2013) that comes with much more distributional assumptions and restrictions.In this regard, we delegate this issue as a future research topic and refrain from further discussing this issue in the current paper.
Note that the efficiency gain in the estimation of β 2 = β T using the proposed model is no better than the naive model, unlike in the cases of It is reasonable since there is no information to borrow from S 1 to better estimate β 2 in the proposed model.
After assessing the estimation performance of each model, we use the out-of-sample validation set T r for each r = 1, . . ., R to compare their predictive performance.In the out-of-sample validation, we use prediction RMSE (pRMSE) and the Poisson deviance statistic (DEV) defined as follows:  2 presents the out-of-sample validation performance of the aforementioned models.Again, the values of Avg_pRMSE and Avg_DEV of the traditional and full models across all four sampling methods are identical as the estimated regression coefficients, which are used for the prediction, are identical across all sampling methods.As in Table 1, the use of naive and boosting models is more vulnerable when the availability of telematics information is prone to favorable selection.It is also shown that the predictive performance of the traditional model is generally inferior to the other models, since it completely ignores the impacts of the available telematics information.Lastly, it is shown that the proposed model shows satisfactory prediction performance comparable to that of the full model (ideal yet not available in practice) in all scenarios for the missing mechanism.

Data description
To assess the validity and applicability of the proposed method under a more realistic environment than the simulation study with possible sampling biases, we use a synthetic dataset from the study of So et al. (2021) that includes traditional characteristics, telematics characteristics, and the response variable.As mentioned in Section 1, it has been difficult for researchers to access a dataset on insurance claims with telematics features due to privacy concerns and proprietary issues of insurers.In this regard, So et al. (2021) effectively emulated a synthetic dataset that shares remarkably similar statistics with the original dataset yet still preserves the privacy of the observations from the original source.Due to scarcity of a realized data split for S 0 , S 1 , and T that are simultaneously obtained from an actual insurance poftfolio, here we assumed that the synthetic dataset of So et al. (2021) is the finite population including both the traditional and telematics features while the data splits followed the sampling schemes of Section 4. Note that our purpose is not to detect selection biases from an actual insurance portfolio, but to quantify impacts of the proposed method under potential selection biases.Although the available features in the dataset are already summarized in tabular format compared to the raw data directly obtained from the telematics device, they are still high dimensional.For example, one of the "traditional" features is Region, which is a categorical variable with 55 categories.
However, the proposed data integration approach is based on estimating equations and GLMs so that it lacks the ability to handle high dimensionality on its own, unlike neural network models or tree-based In this regard, some of the available features were preprocessed.Due to the high dimension of the dataset and the complexity of defining some of its features, the territorial embedding and principal component analysis (PCA) were utilized to clean up the dataset.After data preprocessing, we retained the following variables for our analysis as described in Table 3.For details of data preprocessing, see Jeong (2022).

Estimation and prediction results
Unlike the simulation study, it is hardly possible to believe that the actual observations in the synthetic dataset follow the specified Poisson GLM.In this regard, here we replicate the empirical distribution of the preprocessed dataset (which is our finite population here) by generating bootstrap samples to ensure each observation has the same empirical distribution as the finite population.More specifically, we take bootstrap samples S 0 and S 1 of sizes 100,000 and 800,000, respectively, in each of the sampling schemes listed in Section 4. Subsequently, a bootstrap sample T of size 100,000 is taken at random for out-of-sample validation.After that, we repeat the process of fitting and testing these five models as in Section 4 for R = 500 times to compare the estimation and predictive performance under each sampling scheme.
To assess the in-sample estimation performance, we compare the estimated regression coefficients of each method and sampling scheme with the estimated regression coefficients obtained from the finite population.More specifically, bias, root mean squared error (RMSE), and 90% confidence interval coverage (CI) of the regression coefficients are defined as follows: , where βj and β(r) j are estimates of β j using the finite population and with r th bootstrap sample, respectively.SE( β) (r)  j is the estimated standard error of β(r) j .Note that in our case, we prefer a method with biases closer to 0, smaller RMSEs, and/or CIs closer to the theoretical benchmark, 90%.
Table S1 shows the estimation results of the regression coefficients under different model specifications and sampling schemes of the bootstrap samples from the prerocessed synthetic data.Note that the estimated coefficients from the traditional model were omitted as they are only available for the traditional features and identical to those from the boosting model.Implications of the estimation results with the actual data are as follows.
• In the case of random selection, only the boosting model suffers from the biases of the regression coefficients, and there are no big differences in the estimation performance between the naive and proposed models.It implies that as long as the sampling mechanism of S 0 (a small dataset with both traditional and telematics features) from the finite population is purely random, it is okay to ignore S 1 (a large dataset only with traditional features) and analyze S 0 for ratemaking purposes.• In the case of age selection and mileage selection, the naive model is more biased in the estimation of the traditional covariates (especially the intercept term) compared to the proposed model.It implies that if the observability of the telematics features depends on the traditional features, then the proposed approach might be helpful in better understanding the underlying impacts of the covariates on the claim experience.• Lastly, in the case of favorable selection, the proposed model is no more unbiased, but the naive model is still more biased in the estimation of the regression coefficients.Therefore, if accessibility to telematics features is affected by favorable selection, it is recommended to integrate two data sources to handle the missingness and selection biases of the telematics features.
Such differences are also visualized in Figures S1, S2, S3, and S4 where a model with biases closer to 0, smaller RMSEs, and/or CIs closer to 90% receives the higher rank for each covariate.It is consistently observed that, in the case of either age or favorable selection, the proposed model is the second best, following the full model that is unattainable in practice.
In addition to the estimation performance, the out-of-sample validation performance is assessed using r;k) , I pRMSE (r;k) > pRMSE (r;proposed) , DEV (r;k) I DEV (r;k) > DEV (r;proposed) , demonstrate that this framework can effectively integrate traditional and telematics data, while also managing potential favorable selection problems.This conclusion is supported by a simulation study and empirical analysis using a synthetic telematics dataset as it turns out that the proposed approach could achieve satisfactory performance both in the in-sample estimation and in the out-of-sample prediction, compared to the existing benchmarks for automobile insurance ratemaking practices.Thus, the proposed approach has a potential to improve risk classification in auto insurance and assist insurers in making informed decisions.
The possible extension of this article is twofold.First, the proposed data integration approach relies on the assumption in (3.5) so it might not work well if the basis function of E{U(β; x i , n i ) | x iτ , n i } is not correctly specified.To address such a problem, one can implement a doubly robust calibration approach that only requires either the basis function of the outcome variable or the propensity score to be correctly specified.Second, the proposed approach can be extended to data integration for mixed-effects models where a policyholder is observed over a period of time, so that the proposed framework can also consider random effects for experience ratemaking, as well as the fixed effects.

Table 1 .
Estimation performance with the simulated data (Here N, T, B, F, and P refer toNaive, Traditional, Boosting, Full, and Proposed models,  respectively). j,

Table 2 .
Out-of-sample validation performance with the simulated data.| is the number of observations in T r and the predicted value N(r;k) i is generated in model k with r th simulation sample.Table

Table 3 .
Variable names and descriptions of the preprocessed dataset.