Hostname: page-component-89b8bd64d-4ws75 Total loading time: 0 Render date: 2026-05-06T11:23:33.001Z Has data issue: false hasContentIssue false

PKLM: A Flexible MCAR Test Using Classification

Published online by Cambridge University Press:  03 January 2025

Meta-Lina Spohn
Affiliation:
ETH Zürich, Seminar for Statistics, Zürich, Switzerland
Jeffrey Näf*
Affiliation:
Inria PreMeDICaL Team, Montpellier, France
Loris Michel
Affiliation:
QuantCo, Zürich, Switzerland
Nicolai Meinshausen
Affiliation:
ETH Zürich, Seminar for Statistics, Zürich, Switzerland
*
Corresponding author: Jeffrey Näf; Email: jeffrey.naf@inria.fr
Rights & Permissions [Opens in a new window]

Abstract

We develop a fully nonparametric, easy-to-use, and powerful test for the missing completely at random (MCAR) assumption on the missingness mechanism of a dataset. The test compares distributions of different missing patterns on random projections in the variable space of the data. The distributional differences are measured with the Kullback-Leibler Divergence, using probability Random Forests (Malley et al., 2011). We thus refer to it as “Projected Kullback–Leibler MCAR” (PKLM) test. The use of random projections makes it applicable even if very few or no fully observed observations are available or if the number of dimensions is large. An efficient permutation approach guarantees the level for any finite sample size, resolving a major shortcoming of most other available tests. Moreover, the test can be used on both discrete and continuous data. We show empirically on a range of simulated data distributions and real datasets that our test has consistently high power and is able to avoid inflated type-I errors. Finally, we provide an R-package PKLMtest with an implementation of our test.

Information

Type
Theory and Methods
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2025. Published by Cambridge University Press on behalf of Psychometric Society
Figure 0

Table 1 Illustration of some of the properties of various tests

Figure 1

Table 2 Notation: Summary of the notation used throughout the paper, with (“partial”) and without (“full”) considering the missing values.

Figure 2

Figure 1 Illustration of the projections A and B in an example with $n=4$ and $p=5$. In a first step, a projection $A = \{3,4,5\} \subset \{1, \ldots , 5 \}$ is drawn. The fully observed points on A form $\mathbf {X}_{\mathcal {N}_A,A}$, as indicated in green. In a second step, a projection $B= \{2\} \subset \{1, \ldots , 5 \} \backslash A$ is drawn, as indicated in blue. The patterns in projection B then determine the labels assigned to the observations in $\mathbf {X}_{\mathcal {N}_A,A}$. In this case, we obtain two different class labels: the first observation has one label, and the second and third observations share another common label.

Figure 3

Table 3 Simulated power and type-I error of PKLM, Q, Little and JJ for $n=200$, $p=4$ and $n=500$, $p=10$

Figure 4

Table 4 Simulated power and type-I error of PKLM, Q, Little, and JJ for $n=500$, $p=20$, $n=1000$, and $p=40$

Figure 5

Table 5 Simulated power and level of PKLM, Q, Little and JJ for $13$ real datasets

Figure 6

Figure 2 Example plot of cumulative distribution function values of the p-values under the null (MCAR) of the four different tests. The simulation set up is $n=500$, $p=10$, $r=0.65$ in case $5$, with $500$ repetitions. The red line is the $x=y$ line, while the blue lines show $100$ ecdfs of $500$ simulated uniform random variables.

Figure 7

Figure 3 $X_1$ and $X_2$ of the fully observed data in the simulated example of Section 6. In red: Points with missing values in $X_1$, in blue: points with missing values in $X_2$. The blue points are randomly scattered, independently of the value of $X_1$, while in the red points, there is a visible trend toward having more missing values in $X_1$ for higher values of variable $X_2$.

Figure 8

Figure C1 Histogram with relative frequencies of $X_1$ if the corresponding $X_2$ is $\texttt {NA}$.

Supplementary material: File

Spohn et al. supplementary material

Spohn et al. supplementary material
Download Spohn et al. supplementary material(File)
File 409.7 KB