Hostname: page-component-89b8bd64d-n8gtw Total loading time: 0 Render date: 2026-05-07T21:20:25.142Z Has data issue: false hasContentIssue false

Variable selection methods for identifying predictor interactions in data with repeatedly measured binary outcomes

Published online by Cambridge University Press:  16 November 2020

Bethany J. Wolf*
Affiliation:
Department of Public Health Sciences, Medical University of South Carolina, Charleston, SC, USA
Yunyun Jiang
Affiliation:
Milken Institute School of Public Health, Biostatistics Center, George Washington University, Rockville, MD, USA
Sylvia H. Wilson
Affiliation:
Department of Anesthesia and Perioperative Medicine, Medical University of South Carolina, Charleston, SC, USA
Jim C. Oates
Affiliation:
Department of Medicine, Division of Rheumatology and Immunology, Medical University of South Carolina, Charleston, SC, USA
*
Address for correspondence: B. J. Wolf, PhD, Department of Public Health Sciences, Medical University of South Carolina, 135 Cannon Street, Charleston, SC, USA. Email: wolfb@musc.edu
Rights & Permissions [Opens in a new window]

Abstract

Introduction:

Identifying predictors of patient outcomes evaluated over time may require modeling interactions among variables while addressing within-subject correlation. Generalized linear mixed models (GLMMs) and generalized estimating equations (GEEs) address within-subject correlation, but identifying interactions can be difficult if not hypothesized a priori. We evaluate the performance of several variable selection approaches for clustered binary outcomes to provide guidance for choosing between the methods.

Methods:

We conducted simulations comparing stepwise selection, penalized GLMM, boosted GLMM, and boosted GEE for variable selection considering main effects and two-way interactions in data with repeatedly measured binary outcomes and evaluate a two-stage approach to reduce bias and error in parameter estimates. We compared these approaches in real data applications: hypothermia during surgery and treatment response in lupus nephritis.

Results:

Penalized and boosted approaches recovered correct predictors and interactions more frequently than stepwise selection. Penalized GLMM recovered correct predictors more often than boosting, but included many spurious predictors. Boosted GLMM yielded parsimonious models and identified correct predictors well at large sample and effect sizes, but required excessive computation time. Boosted GEE was computationally efficient and selected relatively parsimonious models, offering a compromise between computation and parsimony. The two-stage approach reduced the bias and error in regression parameters in all approaches.

Conclusion:

Penalized and boosted approaches are effective for variable selection in data with clustered binary outcomes. The two-stage approach reduces bias and error and should be applied regardless of method. We provide guidance for choosing the most appropriate method in real applications.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
© The Association for Clinical and Translational Science 2020
Figure 0

Table 1. Simulation study design

Figure 1

Fig. 1. Proportion of times the true predictors, X5, X1X2, and X2X5 and the average false-discovery rate for null predictors (XNull) are selected by the four variable selection methods by effect size and sample size for data with two observations per subject.

Figure 2

Fig. 2. Boxplots of bias for the stage 1 models across all simulation runs for stepwise (Step), glmmLasso (Lasso), GMMBoost (GMMB), and GEEBoost (GEEB) models in data with two repeated measures per subject. Boxes represent the 25th, 50th, and 75th percentiles, whiskers extend 1.5 × inner quartile range (IQR) from the 25th and 75th percentiles and points are values outside 1.5 × IQR. The gray dashed line indicates bias = 0.

Figure 3

Fig. 3. Boxplots of change in absolute bias from Stage 1 to Stage 2 bias in regression estimates for the true predictors X5, X1X2, and X2X5 and average bias for null predictors (XNull) for glmmLasso (Lasso), GMMBoost (GMMB), and GEEBoost (GEEB) for data with two measures per subject across effect and sample sizes. Boxes represent the 25th, 50th, and 75th percentiles, whiskers extend 1.5 × inner quartile range (IQR) from the 25th and 75th percentiles, and points are values outside 1.5 × IQR. The gray dashed line indicates no difference in bias between stages.

Figure 4

Table 2. Models of treatment response over time in patients with lupus nephritis selected by each method. Values presented for each predictor are the regression parameter estimates (standard error). Missing values indicate that the predictor was not selected in that model. Parameter estimates for the glmmLasso, GMMBoost, and GEEBoost models are from the two-stage modeling approach

Figure 5

Table 3. Models of incidence of hypothermia over time in patients undergoing total joint arthroplasty selected by each method. Values presented for each predictor are the regression parameter estimates (standard error). Missing values indicate that the predictor was not selected in that model. Parameter estimates for the glmmLasso, GMMBoost, and GEEBoost models are from the two-stage modeling approach

Figure 6

Table 4. Guidance for selecting the optimal variable selection method in data with a repeated binary outcome

Supplementary material: PDF

Wolf et al. supplementary material

Wolf et al. supplementary material

Download Wolf et al. supplementary material(PDF)
PDF 485.8 KB