Hostname: page-component-6766d58669-kl59c Total loading time: 0 Render date: 2026-05-18T22:30:58.978Z Has data issue: false hasContentIssue false

Variable Selection via Knockoffs in Missing Data Settings with Categorical Predictors

Published online by Cambridge University Press:  12 May 2026

Silvia Bacci*
Affiliation:
Department of Statistics, Computer Science, Applications, Università degli Studi di Firenze, Italy
Emanuela Dreassi
Affiliation:
Department of Statistics, Computer Science, Applications, Università degli Studi di Firenze, Italy
Leonardo Grilli
Affiliation:
Department of Statistics, Computer Science, Applications, Università degli Studi di Firenze, Italy
Carla Rampichini
Affiliation:
Department of Statistics, Computer Science, Applications, Università degli Studi di Firenze, Italy
*
Corresponding author: Silvia Bacci; Email: silvia.bacci@unifi.it
Rights & Permissions [Opens in a new window]

Abstract

Large-scale assessment data typically include numerous variables, often affected by missing values. Motivated by the challenges arising in this framework, we extend the knockoffs method for selecting predictors to settings with missing values. Our proposal relies on a preliminary phase of multiple imputation (MI) of missing values. Each imputed dataset is then processed using a suitable knockoff filter. We evaluate the performance of the proposed method through simulation studies, showing satisfactory results consistent with a recently advocated cutting-edge method. We apply the method to large-scale assessment data collected by INVALSI on test scores of Italian students in grade 5, including many background variables. This case study is challenging, as most predictors have unordered categories, a setting not considered by traditional knockoff methods. In addition, some of the key predictors are affected by missing values. Our proposal to implement the knockoffs method within an MI framework is feasible, flexible, and effective.

Information

Type
Application and Case Studies - Original
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2026. Published by Cambridge University Press on behalf of Psychometric Society
Figure 0

Table 1 Overview of the simulation study

Figure 1

Table 2 Simulation study: variable selection using XCDW, MI-lasso, MI-RWC, and MI-seq, for increasing rates of missing values (about 10%, 25%, 32%, and 45%), by selection proportion (MAR assumption; nominal PFER=2)

Figure 2

Table 3 Simulation study with 32% missing values: variable selection using XCDW, MI-lasso, MI-RWC, and MI-seq, for increasing selection proportion (nominal PFER = 2)

Figure 3

Figure 1 Distribution of empirical FDR (left) and empirical TPR (right) under XCDW, MI-lasso, MI-RWC, and MI-seq, for an increasing selection proportion (up to bottom panels: $\geq 0.7$, $\geq 0.8$, $\geq 0.9$, and $1$): box plots based on Monte Carlo values for minimum, 1st quartile, median, 3rd quartile, and maximum.

Figure 4

Figure 2 Variable selection using XCDW, MI-lasso, MI-RWC, and MI-seq, for an increasing selection proportion for ultimately selecting the variables over the imputed datasets: Monte Carlo means of FDR (left) and TPR (right). Dashed lines for MAR and solid lines for SMAR: $\square $ XCDW, $\times $ MI-lasso, $ \circ $ MI-RWC, and $\triangle $ MI-seq.

Figure 5

Table 4 Variable selection via MI-seq: proportion selected over 10 imputed datasets and retained status ($\checkmark $ if proportion $\ge 0.8$)

Figure 6

Table 5 Random intercept null model and full model with selected predictors (Table 4): combined estimates of the residual variances based on 10 imputed datasets

Figure 7

Table 6 Random intercept model with selected predictors: combined estimates of the regression coefficients and standard errors based on 10 imputed datasets

Figure 8

Table A1 Data-driven simulation study—Variable selection with MI-seq, for increasing selection proportion (nominal PFER $=2$): Monte Carlo means and standard deviations of empirical PFER, FDR, and TPR

Supplementary material: File

Bacci et al. supplementary material

Bacci et al. supplementary material
Download Bacci et al. supplementary material(File)
File 179.7 KB