When Can Multiple Imputation Improve Regression Estimates?

Vincent Arel-Bundock; Krzysztof J. Pelc

doi:10.1017/pan.2017.43

When Can Multiple Imputation Improve Regression Estimates?

Published online by Cambridge University Press: 06 March 2018

Vincent Arel-Bundock and

Krzysztof J. Pelc

Show author details

Vincent Arel-Bundock: Affiliation:
Department of Political Science, Université de Montréal, Canada. Email: vincent.arel-bundock@umontreal.ca
Krzysztof J. Pelc*: Affiliation:
Department of Political Science, McGill University, Canada. Email: kj.pelc@mcgill.ca
*: *Email: kj.pelc@mcgill.ca

Article contents

Abstract
When Does Listwise Deletion Introduce Bias in Regression Estimates?
When Can Multiple Imputation Improve Regression Estimates?
Best Practices
Conclusion
Supplementary material
Footnotes
References

Rights & Permissions

Abstract

Multiple imputation (MI) is often presented as an improvement over listwise deletion (LWD) for regression estimation in the presence of missing data. Against a common view, we demonstrate anew that the complete case estimator can be unbiased, even if data are not missing completely at random. As long as the analyst can control for the determinants of missingness, MI offers no benefit over LWD for bias reduction in regression analysis. We highlight the conditions under which MI is most likely to improve the accuracy and precision of regression results, and develop concrete guidelines that researchers can adopt to increase transparency and promote confidence in their results. While MI remains a useful approach in certain contexts, it is no panacea, and access to imputation software does not absolve researchers of their responsibility to know the data.

Keywords

multiple imputation missing data Monte Carlo simulation

Information

Type: Letter
Information: Political Analysis , Volume 26 , Issue 2 , April 2018 , pp. 240 - 245

DOI: https://doi.org/10.1017/pan.2017.43 [Opens in a new window]
Copyright: Copyright © The Author(s) 2018. Published by Cambridge University Press on behalf of the Society for Political Methodology.

Observational data in the social sciences are often incomplete. The most common approach for dealing with missing data is complete case analysis (or listwise deletion), but this strategy has important shortcomings: it ignores the valuable information carried by partially observed units, and it can introduce bias in regression coefficient estimates.

In a recent Political Analysis article, Lall (Reference Lall2016) adds to a body of work making a powerful case for an alternative: multiple imputation (MI). The author argues that listwise deletion (LWD) often introduces severe bias in regression estimates, and he applies a popular imputation routine (Honaker, King, and Blackwell Reference Honaker, King and Blackwell2011) to show that several published results are affected by the way analysts handle missing data.

Here, we clear up a common misunderstanding about LWD: this approach does not introduce bias in regression estimates, as long as the dependent variable is conditionally independent of the missingness mechanism, or when the analyst can control for the determinants of missingness.

We highlight the conditions under which MI is most likely to improve the accuracy and precision of regression results, and propose a set of best practices for empiricists dealing with missing data. The premise underlying these best practices is that while complete case analysis can be problematic, MI is no panacea: the range of circumstances under which this approach guarantees bias reduction relative to LWD is limited, and results may be sensitive to violations of the imputation model’s assumptions. When results under MI and LWD diverge, analysts can make no a priori claim that one set of results is more credible than the other, and access to imputation software does not absolve researchers of their responsibility to know the data.Footnote ¹

1 When Does Listwise Deletion Introduce Bias in Regression Estimates?

After Rubin (Reference Rubin1976), it has become standard practice to distinguish between three missingness generation mechanisms.Footnote ² Data are said to be missing completely at random (MCAR) if the pattern of missingness is independent of both the observed and unobserved data. Data are called missing at random (MAR) if missingness depends only on observables. Data are not missing at random (NMAR) when missingness depends on unobservables.

Based on this typology, Lall (Reference Lall2016, 416) writes:

“Listwise deletion is unbiased only when the restrictive MCAR assumption holds—that is, when omitting incomplete observations leaves a random sample of the data. Under MAR or [NMAR], deleting such observations produces samples that are skewed away from units with characteristics that increase their probability of having incomplete data.”

This echoes King et al. (Reference King, Honaker, Joseph and Scheve2001, 51), who argue that

“inferences from analyses using listwise deletion are relatively inefficient, no matter which assumption characterizes the missingness, and they are also biased unless MCAR holds.”

It is true that MI allows us to leverage more information than LWD, and that it could thus improve the efficiency of our analyses. However, the claim that LWD always introduces bias unless data are MCAR is erroneous. To demonstrate,Footnote ³ let $Q_{i}$ equal 1 if the $i$ th observation is fully observed, and 0 otherwise. A simple complete case model can be represented as:

$$\begin{eqnarray}\mathbf{QY}=\mathbf{QX}\boldsymbol{\unicode[STIX]{x1D6FD}}_{c}+\mathbf{Q}\boldsymbol{\unicode[STIX]{x1D700}},\quad \text{with }\mathbf{Q}=\text{diag}(Q_{1},\ldots ,Q_{n}).\end{eqnarray}$$

Defining $\mathbf{X}_{c}=\mathbf{QX}$ and $\mathbf{Y}_{c}=\mathbf{QY}$ , the least squares complete case estimator becomes:

(1)

$$\begin{eqnarray}\displaystyle \hat{\boldsymbol{\unicode[STIX]{x1D6FD}}}_{c} & = & \displaystyle (\mathbf{X}_{c}^{\prime }\mathbf{X}_{c})^{-1}\mathbf{X}_{c}^{\prime }\mathbf{Y}_{c}\nonumber\\ \displaystyle & = & \displaystyle (\mathbf{X}^{\prime }\mathbf{QX})^{-1}\mathbf{X}^{\prime }\mathbf{QY}\nonumber\\ \displaystyle & = & \displaystyle (\mathbf{X}^{\prime }\mathbf{QX})^{-1}\mathbf{X}^{\prime }\mathbf{Q}(\mathbf{X}\boldsymbol{\unicode[STIX]{x1D6FD}}+\boldsymbol{\unicode[STIX]{x1D700}})\nonumber\\ \displaystyle & = & \displaystyle \boldsymbol{\unicode[STIX]{x1D6FD}}+(\mathbf{X}^{\prime }\mathbf{QX})^{-1}\mathbf{X}^{\prime }\mathbf{Q}\boldsymbol{\unicode[STIX]{x1D700}}.\end{eqnarray}$$

Clearly, if $\mathbf{Q}$ is independent of $\boldsymbol{\unicode[STIX]{x1D700}}$ , and if the usual assumptions of the classical linear model hold, the complete case estimator is unbiased.Footnote ⁴ More loosely, Equation (1) shows that the OLS estimator with LWD is unbiased in the MAR cases where the pattern of missingness is unrelated to values of the dependent variable, or where we can control for the determinants of missingness.

Equation (1) also implies that complete case coefficient estimates are unbiased in the NMAR case “where the probability that a covariate is missing depends on the value of that covariate”, as long as “the probability of being a complete case depends on $X_{1};\ldots ;X_{p}$ but not on $Y$ ” (Little and Rubin Reference Little and Rubin2002, 43).

To be clear, the above conclusions do not depend on which variables are partially observed, but rather on the association between the values of those variables and the pattern of missingness. The outcome $Y$ may well be unobservable for the $i$ th individual, but as long as the reason why data are missing for that individual relates to the value of $X_{i}$ and not $Y_{i}$ (net of $X_{i}$ ), then LWD does not introduce bias in regression estimates.

These results should not be surprising to political scientists, who have long been aware of the pitfalls of “selecting cases for study on the dependent variable” (Geddes Reference Geddes1990). To illustrate, Figure 1 shows two simulated samples where all observed units (black) fall below an arbitrary threshold, and all unobserved units (gray) fall above that threshold.Footnote ⁵ The gray lines show the result of a bivariate regression model using the full data, while the black lines show analogous results based on the observed data only. In the left panel of Figure 1, sample selection is based on the values of the independent variable, and the gray and black lines overlap (no bias). In the right panel of Figure 1, sample selection is based on the values of the dependent variables, and the two linear models diverge (bias).

The practical implications are considerable. In cross-national comparisons, for instance, more complete cases are typically available for advanced democracies than for developing countries. This has led analysts to worry that their estimates may suffer from an “advanced economies” or a “pro-democracy” bias (e.g., Lall Reference Lall2017, 1292).

We can distinguish between two interpretations of this problem. First, one could argue that the estimated slopes should be different in democratic and authoritarian countries, and that a full data estimate of the (“averaged”) marginal effect will be sensitive to sample composition. In that case, our recommendation is that researchers model heterogeneity explicitly (Brambor, Clark, and Golder Reference Brambor, Clark and Golder2006; Franzese and Kam Reference Franzese and Kam2009), or risk misspecification bias (but not necessarily selection bias).

Second, one could think about the issue not in terms of heterogeneous marginal effects, but directly in terms of a selection problem. In that case, analysts should reflect on the nature of the association between missingness and their dependent variable. If, as in the resource curse literature, the outcome of interest is “regime type”, and we suspect that this dependent variable directly affects transparency and observability (Hollyer, Rosendorff, and Vreeland Reference Hollyer, Rosendorff and Vreeland2011), then there are good reasons to worry. In contrast, when analysts can put the drivers of missingness on the right-hand side of their regression equations, LWD need not spoil the results.

Figure 1.

Linear regression under two selection mechanisms.

2 When Can Multiple Imputation Improve Regression Estimates?

MI seems more likely to be beneficial in some contexts. First, as suggested by Equation (1), the use of LWD is largely unproblematic when data are MCAR, when missingness is solely a function of the regressors, or when control variables can purge the dependent variable of its association with the missingness generation mechanism. In those cases, MI does not reduce bias, but it could still improve efficiency.

Second, there are good reasons to expect that MI will be most effective where missingness affects auxiliary (or control) variables, rather than the main independent or dependent variablesof interest.Footnote ⁶ As Little (Reference Little1992, 1227) points out, if “the X’s are complete and the missing values of $Y$ are missing at random, then the incomplete cases contribute no information to the regression of $Y$ on $X_{1},\ldots ,X_{p}$ .” Relatedly, White and Carlin (Reference White and Carlin2010, 2928) note that “MI is likely to be beneficial for the coefficient of a relatively complete covariate when other covariates are incomplete.”

Third, MI may produce better results when analysts can build an imputation model that accurately predicts the values of missing data points. When missing values are difficult to predict, or when analysts cannot leverage relevant auxiliary variables to build their prediction model, we are more likely to see large differences in coefficient estimates across imputed datasets, which would reduce the precision of the combined estimates obtained by Rubin’s rules.

Fourth, an imputation routine is obviously more likely to be useful when its underlying statistical assumptions are satisfied. In particular, it is important to note that MI offers no guarantee of bias reduction unless data are MAR.Footnote ⁷ While we still lack systematic assessments, simple simulations have shown that LWD estimates can sometimes be less biased than MI estimates under NMAR (White and Carlin Reference White and Carlin2010; Pepinsky Reference Pepinsky2017).Footnote ⁸ MI performance can also be degraded when imputation routines make implausible distributional assumptions (e.g., multivariate normality) and data are not well-behaved.Footnote ⁹

Finally, it seems reasonable to expect that MI will bring about larger improvements to precision where the proportion of fully observed units is small (White and Carlin Reference White and Carlin2010).

In sum, MI can often improve regression estimates, but this is not always the case. Because some of the assumptions that underpin LWD and MI are untestable, analysts will typically be unable to make an a priori claim that either set of estimates is more credible than the other. When results under LWD and MI diverge, researchers will have to exercise case-specific judgement.

3 Best Practices

To exercise this kind of case-specific judgement, researchers should take to heart the repeated admonitions of MI advocates, by developing a deep knowledge of their datasets (King et al. Reference King, Honaker, Joseph and Scheve2001; van Buuren Reference van Buuren2012). They could also improve the credibility of their empirical work by following a set of simple best practices:

(1) Define the population of interest.
(2) Report the share of missing values for each variable and descriptive statistics for both complete and incomplete cases. Do fully observed units differ systematically from partially observed ones?
(3) Theorize the missingness mechanism. Is the pattern of missingness driven by (a) pure chance, (b) factors unrelated to the variables of interest, (c) values of the independent variables, (d) values of the dependent variable, or (e) unobservable factors? Under (a), (b), and (c), LWD can be used without fear that it will introduce bias in regression estimates. Under (d), MI can sometimes reduce bias, but it only offers guarantees if data are MAR and the imputation model’s assumptions are satisfied. Under (e) data are NMAR and neither LWD nor MI promise unbiased estimates.
(4) Check for divergence between LWD and MI results. If estimates do diverge, which “new” observations have a strong influence on the results? Are these observations theoretically distinct?
(5) Robustness checks. Do alternative imputation procedures or tuning parameters produce different results? Does the imputation model have good predictive power? Does it fill in reasonable values for missing observations?Footnote ¹⁰

In supplementary materials, we illustrate how these guidelines can improve statistical practice by revisiting one of the political-economy studies criticized in Lall (Reference Lall2016). The study we replicate meets some of the conditions listed above, and thus appears as a good prima facie candidate for MI. This replication exercise highlights some of the practical pitfalls of MI, and illustrates why researchers need to familiarize themselves with the data before deploying Amelia and concluding that MI results are more credible than LWD results.Footnote ¹¹

4 Conclusion

Missing data are an inevitable problem in social science. The main shortcoming of the common way of dealing with these, through LWD, is that it is done in an unthinking manner. This is where the benefit of Lall’s article, and the literature to which it contributes, truly lies. We, as analysts, must show greater awareness of, and transparency about, the implications of missing data.

Unfortunately, MI is no panacea. In this note, we suggest that the range of circumstances under which this approach guarantees improvement relative to LWD is more narrow than is generally acknowledged by proponents of MI.

Taking the problem of missing data seriously means asking the type of questions raised above. Does the pattern of missingness suggest that LWD is biased, and that MI will be beneficial? What variables are truly unobserved, rather than nonexistent? Can we build an accurate prediction model to fill in missing values? And how does the expansion of the sample relate to the theory being tested? Multiple imputation requires a number of choices on the analyst’s part; these must be informed by knowledge of the data and of the theory being tested.

Supplementary material

For supplementary material accompanying this paper, please visit https://doi.org/10.1017/pan.2017.43.

Footnotes

Authors’ note: We thank Neal Beck, Timm Betz, Christina Davis, Tom Pepinsky, Amy Pond, and Erik Voeten for valuable comments. Replication files and supplementary materials are hosted on the Harvard Dataverse and the authors’ websites. https://dataverse.harvard.edu/dataverse/pan, doi:10.7910/DVN/S9G9XS. http://arelbundock.com, https://sites.google.com/site/krzysztofpelc/.

Contributing Editor: R. Michael Alvarez

1 In supplementary materials, we revisit one of the empirical studies replicated in Lall (Reference Lall2016) to illustrate the importance of the best practices we propose (Arel-Bundock and Pelc Reference Arel-Bundock and Pelc2017). We also present results from Monte Carlo experiments designed to probe the performance of Amelia under different conditions.

2 Formal definitions can be found in many texts, including Little and Rubin (Reference Little and Rubin2002, 11–13).

3 We follow Jones (Reference Jones1996).

4 Allison (Reference Allison2001, fn.1) offers a more general proof: “We want to estimate $f(Y|X)$ , the conditional distribution of $Y$ given $X$ , a vector of predictor variables. Let $A=1$ if all variables are observed; otherwise, $A=0$ . Listwise deletion is equivalent to estimating $f(Y|X,A=1)$ . The aim is to show that this function is the same as $f(Y|X).$ From the definition of conditional probability, we have $f(Y|X,A=1)=\frac{f(Y,X,A=1)}{f(X,A=1)}=\frac{Pr(A=1|Y,X)f(Y|X)f(X)}{Pr(A=1|X)f(X)}$ . Assume that $Pr(A=1|Y,X)=Pr(A=1|X)$ , that is, that the probability of data present on all variables does not depend on $Y$ , but may depend on any variables in $X$ . It immediately follows that $f(Y|X,A=1)=f(Y|X)$ . Note that this result applies to any regression procedure, not just linear regression.”

5 $X$ and $Y$ are drawn from a multivariate normal with mean 0, variance 1, and covariance 0.5. The selection threshold is 0.

6 In supplementary materials, we use simulations to illustrate this point.

7 Lall (Reference Lall2016) points out that the MAR assumption is untestable (footnote 7) and that NMAR data are ubiquitous (footnote 15).

8 Lall (Reference Lall2016, 5) argues that “multiple imputation is not seriously biased under [NMAR] if missingness is strongly related to observed data and thus approximates MAR (Graham, Hofer, and MacKinnon Reference Graham, Hofer and MacKinnon1996; Schafer Reference Schafer1997; Collins, Schafer, and Kam Reference Collins, Schafer and Kam2001).” However, Graham, Hofer, and MacKinnon (Reference Graham, Hofer and MacKinnon1996) is barely germane; the simulation study in Schafer (Reference Schafer1997, 2.5.2) is useful but perfunctory; and the main focus of Collins, Schafer, and Kam (Reference Collins, Schafer and Kam2001) is on “[f]our conditions with different varieties of MAR missing data mechanisms.” Our view is that broad pronouncements about the performance of MI under NMAR are premature, and that practitioners still lack clear guidelines to determine if their (observed) auxiliary data are rich enough for MI routines to work adequately.

9 In supplementary materials, we use simulations to illustrate how departures from multivariate normality can hinder the performance of Amelia, even in settings where all marginal distributions are normal. Note that other imputation procedures may relax the multivariate normality assumption, but that they typically open several other “researcher degrees of freedom.” For example, the reference manual for the mice routine (van Buuren and Groothuis-Oudshoorn Reference van Buuren and Groothuis-Oudshoorn2011) points out that the analyst needs to make seven main choices in the specification of the imputation model.

10 We concur with Graham, Hofer, and MacKinnon (Reference Graham, Hofer and MacKinnon1996) who write that “[b]ecause the various [imputation] procedures may be differentially sensitive to abnormalities in the data (e.g., irregularities in the minimization function, solutions near the boundary), it is always a good strategy to approach the missing data problem from different directions.”

11 We show that Lall’s different results are largely driven by (a) the introduction of nearly 90,000 theoretically irrelevant observations, and (b) the influence of five island nations with a combined population of about 430,000.

References

Allison, Paul D. 2001. Missing data , vol. 136. Thousand Oaks, CA: Sage Publications.Google Scholar

Arel-Bundock, Vincent, and Pelc, Krzysztof. 2017. When can multiple imputation improve regression estimates? doi:10.7910/DVN/S9G9XS, Harvard Dataverse, V1.Google Scholar

Brambor, Thomas, Clark, William Roberts, and Golder, Matt. 2006. Understanding interaction models: Improving empirical analyses. Political Analysis 14(1):63–82.Google Scholar

Collins, Linda M., Schafer, Joseph L., and Kam, Chi-Ming. 2001. A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychological Methods 6(4):330.Google Scholar

Franzese, Robert, and Kam, Cindy. 2009. Modeling and interpreting interactive hypotheses in regression analysis . Ann Arbor, MI: University of Michigan Press.Google Scholar

Geddes, Barbara. 1990. How the cases you choose affect the answers you get: Selection bias in comparative politics. Political Analysis 2(1):131–150.Google Scholar

Graham, John W., Hofer, Scott M., and MacKinnon, David P.. 1996. Maximizing the usefulness of data obtained with planned missing value patterns: An application of maximum likelihood procedures. Multivariate Behavioral Research 31(2):197–218.Google Scholar

Hollyer, James R., Rosendorff, B. Peter, and Vreeland, James Raymond. 2011. Democracy and transparency. Journal of Politics 73(4):1191–1205.Google Scholar

Honaker, James, King, Gary, and Blackwell, Matthew. 2011. Amelia II: A program for missing data. Journal of Statistical Software 45(7):1–47, http://www.jstatsoft.org/v45/i07/.Google Scholar

Jones, Michael P. 1996. Indicator and stratification methods for missing explanatory variables in multiple linear regression. Journal of the American Statistical Association 91(433):222–230, http://www.jstor.org/stable/2291399.Google Scholar

King, Gary, Honaker, James, Joseph, Anne, and Scheve, Kenneth. 2001. Analyzing incomplete political science data: An alternative algorithm for multiple imputation. American Political Science Review 95(1):49–69.Google Scholar

Lall, Ranjit. 2016. How multiple imputation makes a difference. Political Analysis 24(4):414–433.Google Scholar

Lall, Ranjit. 2017. The missing dimension of the political resource curse debate. Comparative Political Studies 50(10):1291–1324, http://cps.sagepub.com/content/early/2016/09/06/0010414016666861.Google Scholar

Little, Roderick J. A. 1992. Regression with missing X’s: A review. Journal of the American Statistical Association 87(420):1227–1237, http://www.jstor.org/stable/2290664.Google Scholar

Little, Roderick J. A., and Rubin, Donald B.. 2002. Statistical analysis with missing data . Hoboken, NJ: John Wiley & Sons.Google Scholar

Pepinsky, Thomas. 2017. A note on listwise deletion versus multiple imputation. Working paper, Cornell University.Google Scholar

Rubin, Donald B. 1976. Inference and missing data. Biometrika 63(3):581–592, http://biomet.oxfordjournals.org/content/63/3/581.Google Scholar

Schafer, Joseph L. 1997. Analysis of incomplete multivariate data . London: Chapman & Hall.Google Scholar

van Buuren, Stef. 2012. Flexible imputation of missing data . Boca Raton, FL: CRC Press.Google Scholar

van Buuren, Stef, and Groothuis-Oudshoorn, Karin. 2011. Mice: Multivariate imputation by chained equations in R. Journal of Statistical Software 45(3):1–67.Google Scholar

White, Ian R., and Carlin, John B.. 2010. Bias and efficiency of multiple imputation compared with complete-case analysis for missing covariate values. Statistics in Medicine 29(28):2920–2931, http://onlinelibrary.wiley.com/doi/10.1002/sim.3944/abstract.Google Scholar

Figure 1. Linear regression under two selection mechanisms.

Arel-Bundock and Pelc supplementary material 1

Arel-Bundock and Pelc supplementary material

File 214.3 KB

Arel-Bundock and Pelc Dataset

Dataset

https://doi.org/10.7910/DVN/S9G9XS

Link

Article contents

When Can Multiple Imputation Improve Regression Estimates?

Abstract

Keywords

Information

1 When Does Listwise Deletion Introduce Bias in Regression Estimates?

2 When Can Multiple Imputation Improve Regression Estimates?

3 Best Practices

4 Conclusion

Supplementary material

Footnotes

References

Arel-Bundock and Pelc supplementary material 1

Arel-Bundock and Pelc Dataset

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests