Hostname: page-component-77f85d65b8-2tv5m Total loading time: 0 Render date: 2026-04-21T10:57:59.146Z Has data issue: false hasContentIssue false

Clustered flexible calibration plots for binary outcomes using random effects modeling

Published online by Cambridge University Press:  29 December 2025

Lasai Barreñada
Affiliation:
Department of Development and regeneration, KU Leuven, Belgium Leuven Unit for Health Technology Assessment Research (LUHTAR), KU Leuven, Belgium
Bavo De Cock Campo
Affiliation:
Department of Metabolism, Digestion and Reproduction, Imperial College, United Kingdom
Laure Wynants
Affiliation:
Department of Development and regeneration, KU Leuven, Belgium Leuven Unit for Health Technology Assessment Research (LUHTAR), KU Leuven, Belgium Department of Epidemiology, CAPHRI Care and Public Health Research Institute, Maastricht University, Maastricht, Netherlands
Ben Van Calster*
Affiliation:
Department of Development and regeneration, KU Leuven, Belgium Leuven Unit for Health Technology Assessment Research (LUHTAR), KU Leuven, Belgium Department of Biomedical Data Sciences, Leiden University Medical Center, Leiden, the Netherlands Julius Center, Department of Data Science & Biostatistics, University Medical Center (UMC) Utrecht, Utrecht, the Netherlands
*
Corresponding author: Ben Van Calster; Email: ben.vancalster@kuleuven.be
Rights & Permissions [Opens in a new window]

Abstract

Evaluation of clinical prediction models across multiple clusters, whether centers or datasets, is becoming increasingly common. A comprehensive evaluation includes an assessment of the agreement between the estimated risks and the observed outcomes, also known as calibration. Calibration is of utmost importance for clinical decision making with prediction models, and it often varies between clusters. We present three approaches to take clustering into account when evaluating calibration: (1) clustered group calibration (CG-C), (2) two-stage meta-analysis calibration (2MA-C), and (3) mixed model calibration (MIX-C), which can obtain flexible calibration plots with random effects modeling and provide confidence interval (CI) and prediction interval (PI). As a case example, we externally validate a model to estimate the risk that an ovarian tumor is malignant in multiple centers (N = 2489). We also conduct a simulation study and a synthetic data study generated from a true clustered dataset to evaluate the methods. In the simulation study, MIX-C and 2MA-C (splines) gave estimated curves closest to the true overall curve. In the synthetic data study, MIX-C produced cluster-specific curves closest to the truth. Coverage of the PI across the plot was best for 2MA-C with splines. We recommend using 2MA-C with splines to estimate the overall curve and 95% PI and MIX-C for cluster-specific curves, especially when the sample size per cluster is limited. We provide ready-to-use code to construct summary flexible calibration curves, with CI and PI to assess heterogeneity in calibration across datasets or centers.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2025. Published by Cambridge University Press on behalf of The Society for Research Synthesis Methodology
Figure 0

Figure 1 Prevalence and mean predicted ADNEX risk by center across the 14 centers in the dataset. The dashed diagonal line indicates perfect calibration.

Figure 1

Figure 2 Traditional flexible calibration curves for the ADNEX model in the motivating example. Observed proportion is estimated with a logistic model with restricted cubic splines to model non-linear effects, and estimated risks are grouped in 10 groups. Confidence intervals are shown for 1000 bootstraps with a shaded area for splines and a + for grouped calibration. The dashed diagonal line indicates perfect calibration.

Figure 2

Table 1 Overview of introduced methodologies for creating flexible calibration curves accounting for clustering

Figure 3

Figure 3 Comparison of the standard logistic regression with splines and the three introduced methodologies with CI (bright shaded) and PI (light shaded). Number of quantiles for CG-C were 10, 2MA-C fitted center-specific curves with splines or LOESS, and MIX-C used random intercept and slopes with restricted cubic splines and three knots. The dashed diagonal line indicates perfect calibration.

Figure 4

Figure 4 Boxplots of mean squared calibration error (log) for fixed prediction models varying validation events per center (EPC) and the number of centers in the validation.

Figure 5

Table 2 Median (IQR) squared difference between true average probabilities and estimated observed proportion (MSCE) with logistic calibration, CG-C (10 groups), 2MA-C, and MIX-C methods for a logistic model varying validation events per center and number of centers. Bold numbers indicate best performance across approaches

Figure 6

Table 3 Median (IQR) squared difference between true average probabilities and estimated observed proportion (MSCE) with logistic calibration, CG-C (10 groups), 2MA-C, and MIX-C methods for a logistic model varying training sample events per center and number of centers. Bold numbers indicate best performance across approaches

Figure 7

Figure 5 Pointwise prediction interval coverage varying the validation sample size. The model validated is the same in each superpopulation, and it was trained from a center with average event rate and with adequate sample size. The black dotted line indicates nominal coverage (95%).

Figure 8

Figure 6 Center-specific (grey) and average true calibration plots for the synthetic data with 1000000 observations per center.

Figure 9

Table 4 Median MSCE for the synthetic data study for center-specific results. MSCE is presented multiplied by 100. Bold numbers indicate best performance across approaches

Figure 10

Figure 7 Center-specific results of log(MSCE) by the number of events per center for the logistic truth (a) and the random forest (b) truth.

Supplementary material: File

Barreñada et al. supplementary material

Barreñada et al. supplementary material
Download Barreñada et al. supplementary material(File)
File 2.1 MB