Skip to main content Accessibility help
×
Home

Multiple Imputation for Continuous and Categorical Data: Comparing Joint Multivariate Normal and Conditional Approaches

  • Jonathan Kropko (a1), Ben Goodrich (a2), Andrew Gelman (a3) and Jennifer Hill (a4)

Abstract

We consider the relative performance of two common approaches to multiple imputation (MI): joint multivariate normal (MVN) MI, in which the data are modeled as a sample from a joint MVN distribution; and conditional MI, in which each variable is modeled conditionally on all the others. In order to use the multivariate normal distribution, implementations of joint MVN MI typically assume that categories of discrete variables are probabilistically constructed from continuous values. We use simulations to examine the implications of these assumptions. For each approach, we assess (1) the accuracy of the imputed values; and (2) the accuracy of coefficients and fitted values from a model fit to completed data sets. These simulations consider continuous, binary, ordinal, and unordered-categorical variables. One set of simulations uses multivariate normal data, and one set uses data from the 2008 American National Election Studies. We implement a less restrictive approach than is typical when evaluating methods using simulations in the missing data literature: in each case, missing values are generated by carefully following the conditions necessary for missingness to be “missing at random” (MAR). We find that in these situations conditional MI is more accurate than joint MVN MI whenever the data include categorical variables.

Copyright

Corresponding author

e-mail: jkropko@virginia.edu (corresponding author)

Footnotes

Hide All

Authors' note: An earlier version of this study was presented at the Annual Meeting of the Society for Political Methodology, Chapel Hill, NC, July 20, 2012. Replication code and data are available on the Political Analysis Dataverse, and the full citation to the replication material is included in the references. We thank Yu-sung Su, Yajuan Si, Sonia Torodova, Jingchen Liu, Michael Malecki, and two anonymous reviewers for their comments.

Footnotes

References

Hide All
American National Election Studies (ANES; www.electionstudies.org). The ANES 2008 Time Series Study [data set]. Stanford University and the University of Michigan [producers].
Bernaards, Coen A., Belin, Thomas R., and Schafer, Joseph L. 2007. Robustness of a multivariate normal approximation for imputation of incomplete binary data. Statistics in Medicine 26(6): 1368–82.
Cranmer, Skyler J., and Gill, Jeff. 2013. We have to be discrete about this: A non-parametric imputation technique for missing categorical data. British Journal of Political Science 43(2): 425–49.
Cribari-Neto, Francisco, and Zeileis, Achim. 2010. Beta regression in R. Journal of Statistical Software 34(2): 124.
Demirtas, Hakan. 2010. A distance-based rounding strategy for post-imputation ordinal data. Journal of Applied Statistics 37(3): 489500.
Dempster, Arthur P., Laird, Nan, and Rubin, Donald B. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B (Methodological) 39(1): 138.
Gelman, Andrew, Jakulin, Aleks, Pittau, Maria Grazia, and Su, Yu-Sung. 2008. A weakly informative default prior distribution for logistic and other regression models. Annals of Applied Statistics 2(4): 1360–83.
Gelman, Andrew, Su, Yu-Sung, Yajima, Masanao, Hill, Jennifer, Grazia Pittau, Maria, Kerman, Jouni, and Zheng, Tian. 2012. arm: Data analysis using regression and multilevel/hierarchical models. R package version 1.5–05. http://CRAN.R-project.org/package=arm.
Goodrich, Ben, Kropko, Jonathan, Gelman, Andrew, and Hill, Jennifer. 2012. mi: Iterative multiple imputation from conditional distributions. R package version 2.15.1.
Greenland, Sander, and Finkle, William D. 1995. A critical look at methods for handling missing covariates in epidemiologic regression analyses. American Journal of Epidemiology 142(12): 1255–64.
Honaker, James, and King, Gary. 2010. What to do about missing values in time-series cross-section data. American Journal of Political Science 54(2): 561–81.
Honaker, James, King, Gary, and Blackwell, Matthew. 2011. Amelia II: A program for missing data. Journal of Statistical Software 45(7): 147.
Honaker, James, King, Gary, and Blackwell, Matthew. 2012. Amelia II: A program for missing data. Software documentation, version 1.6.2. http://r.iq.harvard.edu/docs/amelia/amelia.pdf.
Horton, Nicholas J., Lipsitz, Stuart R., and Parzen, Michael. 2003. A potential for bias when rounding in multiple imputation. American Statistician 57(4): 229–32.
Kropko, Jonathan, Goodrich, Ben, Gelman, Andrew, and Hill, Jennifer. 2014. Replication data for: Multiple imputation for continuous and categorical data: Comparing joint multivariate normal and conditional approaches, http://dx.doi.org/10.7910/DVN/24672UNF:5:QuxE8nFhbW2JZT+OW9WzWw==IQSS Dataverse Network [Distributor] V1 [Version].
Lee, Katherine J., and Carlin, John B. 2010. Multiple imputation for missing data: Fully conditional specification versus multivariate normal imputation. American Journal of Epidemiology 171(5): 624–32.
Lewandowski, Daniel, Kurowicka, Dorota, and Joe, Harry. 2010. Generating random correlation matrices based on vines and extended onion method. Journal of Multivariate Analysis 100(9): 19892001.
Li, Fan, Yu, Yaming, and Rubin, Donald B. 2012. Imputing missing data by fully conditional models: Some cautionary examples and guidelines. Working paper. ftp.stat.duke.edu/WorkingPapers/11-24.pdf. Accessed 7 December 2012.
Royston, Patrick. 2005. Multiple imputation of missing values: Update. Stata Journal 5(2): 188201.
Royston, Patrick. 2007. Multiple imputation of missing values: Further update of ice, with an emphasis on interval censoring. Stata Journal 7(4): 445–74.
Royston, Patrick. 2009. Multiple imputation of missing values: Further update of ice, with an emphasis on categorical variables. Stata Journal 9(3): 466–77.
Rubin, Donald B. 1978. Multiple imputations in sample surveys. Proceedings of the Survey Research Methods Section of the American Statistical Association.
Rubin, Donald B. Statistical matching using file concatenation with adjusted weights and multiple imputations. Journal of Business and Economic Statistics 4(1): 8794.
Rubin, Donald B. 1987. Multiple imputation for nonresponse in surveys. New York: John Wiley and Sons.
Rubin, Donald B., and Little, Roderick J. A. 2002. Statistical analysis with missing data. 2nd ed. New York: John Wiley and Sons.
Schafer, Joseph L. 1997. Analysis of incomplete multivariate data. London: Chapman & Hall.
Schafer, Joseph L., and Olsen, Maren K. 1998. Multiple imputation for multivariate missing-data problems: A data analyst's perspective. Multivariate Behavioral Research 33(4): 545–71.
StataCorp. 2013. Stata 13 base reference manual. College Station, TX: Stata Press.
Su, Yu-Sung, Gelman, Andrew, Hill, Jennifer, and Yajima, Masanao. 2011. Multiple imputation with diagnostics (mi) in R: Opening windows into the black box. Journal of Statistical Software 45(2): 131.
Therneau, Terry. 2012. survival: A package for survival analysis in S. R package version 2.36–14.
van Buuren, Stef. 2007. Multiple imputation of discrete and continuous data by fully conditional specification. Statistical Methods in Medical Research 16(3): 219–42.
van Buuren, Stef. 2012. Flexible imputation of missing data. Boca Raton, FL: Chapman & Hall/CRC.
van Buuren, Stef, Boshuizen, Hendriek C., and Knook, D. L. 1999. Multiple imputation of missing blood pressure covariates in survival analysis. Statistics in Medicine 18(6): 681–94.
van Buuren, Stef, and Groothuis-Oudshoorn, Karin. 2011. mice: Multivariate imputation by chained equations in R. Journal of Statistical Software 45(3): 167.
Venables, William N., and Ripley, Brian D. 2002. Modern applied statistics with S. 4th ed. New York: Springer.
Yu, L-M, Burton, Andrea, and Rivero-Arias, Oliver. 2007. Evaluation of software for multiple imputation of semi-continuous data. Statistical Methods in Medical Research 16(3): 243–58.
Yuan, Yang C. 2013. Multiple imputation for missing data: Concepts and new development (Version 9.0). SAS Software Technical Papers.
MathJax
MathJax is a JavaScript display engine for mathematics. For more information see http://www.mathjax.org.

Metrics

Altmetric attention score

Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed