Skip to main content
×
×
Home

We Have to Be Discrete About This: A Non-Parametric Imputation Technique for Missing Categorical Data

  • Skyler J. Cranmer and Jeff Gill
Abstract

Missing values are a frequent problem in empirical political science research. Surprisingly, the match between the measurement of the missing values and the correcting algorithms applied is seldom studied. While multiple imputation is a vast improvement over the deletion of cases with missing values, it is often unsuitable for imputing highly non-granular discrete data. We develop a simple technique for imputing missing values in such situations, which is a variant of hot deck imputation, drawing from the conditional distribution of the variable with missing values to preserve the discrete measure of the variable. This method is tested against existing techniques using Monte Carlo analysis and then applied to real data on democratization and modernization theory. Software for our imputation technique is provided in a free, easy-to-use package for the R statistical environment.

Copyright
Footnotes
Hide All
*

Department of Political Science, University of North Carolina; and Department of Political Science, Washington University (email: jgill@wustl.edu), respectively. The authors wish to thank Micah Altman, James Fowler, Katie Gan, Adam Glynn, Justin Grimmer, Dominik Hangartner, Michael Kellerman, Gary King, Ryan Moore and Randolph Siverson for valuable comments. Replication data is available at http://www.unc.edu/~skylerc/.

Footnotes
References
Hide All

1 The term ‘missing data’ can mean either missing values (e.g. item non-response in a survey) or missing observations such as refusal to take an entire survey. Throughout this work, we use the term exclusively to mean the first case.

2 Taagepera, Rein and Shugart, Matthew Soberg, Seats and Votes: The Effects and Determinants of Electoral Systems (New Haven, Conn.: Yale University Press, 1989)

3 Peter Mair and Ingrid van Biezen, ‘Party Membership in Twenty European Democracies, 1980–2000’, Party Politics, 7 (2001), 521

4 Palmer, Harvey D. and Whitten, Guy D., ‘The Electoral Impact of Unexpected Inflation and Economic Growth’, British Journal of Political Science, 29 (1999), 623639

5 Reiter, Dan, ‘Does Peace Nurture Democracy?’ Journal of Politics, 63 (2001), 935948

6 Tsiatis, Anastasios A., Semiparametric Theory and Missing Data (New York: Springer, 2010)

Enders, Craig K., Applied Missing Data Analysis (New York: The Guilford Press, 2010)

Tan, Ming T.Tian, Guo-Liang and Ng, Kai Wang, Bayesian Missing Data Problems: EM, Data Augmentation and Noniterative Computation (New York: Chapman & Hall/CRC, 2009)

Molenberghs, Geert and Kenward, Michael G., Missing Data in Clinical Studies (New York: Wiley, 2007)

McKnight, Patrick E., McKnight, Katherine M.Sidani, Souraya and Figueredo, Aurelio Jose, Missing Data: A Gentle Approach (New York: The Guilford Press, 2007)

7 Rees, Phil H. and Duke-Williams, Oliver, ‘Methods for Estimating Missing Data on Migrants in the 1991 British Census’, International Journal of Population Geography, 3 (1997), 323368

8 Rees and Duke-Williams, ‘Methods for Estimating Missing Data on Migrants in the 1991 British Census’.

9 Roderick J. A. Little and Donald B. Rubin, Statistical Analysis with Missing Data, 2nd edn (New York: Wiley, 2002), p. 42

10 Allison, Paul D., Missing Data (Thousand Oaks, Calif.: Sage, 2001)

Little, Roderick J. A., ‘Regression with Missing X's: A Review’, Journal of the American Statistical Association, 87 (1992), 12271237

Little, Roderick J. A., ‘Approximately Calibrated Small Sample Inference about Means from Bivariate Normal Data with Missing Values’, Computational Statistics & Data Analysis, 7 (1988), 161178

Rubin, Donald B., ‘Inference and Missing Data (with Discussion)’, Biometrika, 63 (1976), 581592

King, Gary, Honaker, JamesJoseph, Anne and Scheve, Kenneth, ‘Analyzing Incomplete Political Science Data: An Alternative Algorithm for Multiple Imputation’, American Political Science Review, 95 (2001), 4969

11 Honaker, James and King, Gary, ‘What to Do about Missing Values in Time-Series Cross-Section Data’, American Journal of Political Science, 54 (2010), 561581

12 Rubin, ‘Inference and Missing Data’; King, Honaker, Joseph and Scheve, ‘Analyzing Incomplete Political Science Data’; Little and Rubin, Statistical Analysis with Missing Data.

13 Little and Rubin, Statistical Analysis with Missing Data, p. 12.

14 King, Honaker, Joseph and Scheve, ‘Analyzing Incomplete Political Science Data’.

15 Gelman, Andrew and Hill, Jennifer, Data Analysis Using Regression and Multilevel/Hierarchical Models (New York: Cambridge University Press, 2007)

16 Bailar, John C. III and Bailar, Barbara A., ‘Comparison of the Biases of the “Hot Deck” Imputation Procedure with an “Equal Weights” Imputation Procedure’, Symposium on Incomplete Data: Panel on Incomplete Data of the Committee on National Statistics, National Research Council, 1997), 422–47

Cox, Brenda. G., ‘The Weighted Sequential Hot Deck Imputation Procedure’, Proceedings of the Section on Survey Research Methods, American Statistical Association (1980), 721–6

Rockwell, Richard C., ‘An Investigation of Imputation and Differential Quality of Data in the 1970 Census’, Journal of the American Statistical Association, 70 (1975), 3942

17 Rubin, Donald B., Multiple Imputation for Nonresponse in Surveys (New York: Wiley, 2004)

18 Rubin, ‘Inference and Missing Data’.

19 Rubin, Donald B., ‘Formalizing Subjective Notions about the Effect of Nonrespondents in Sample Surveys’, Journal of the American Statistical Association, 72 (1977), 538543

Rubin, Donald B., ‘Multiple Imputations in Sample Surveys: A Phenomenological Bayesian Approach to Nonresponse’, Proceedings of the Survey Research Methods Section of the American Statistical Association (1978), 20–34

Rubin, Donald B. and Schenker, Nathaniel, ‘Multiple Imputation for Interval Estimation from Simple Random Samples with Ignorable Nonresponse’, Journal of the American Statistical Association, 81 (1986), 366374

Rubin, Donald B., ‘Statistical Matching Using File Concatenation with Adjusted Weights and Multiple Imputations’, Journal of Business and Economic Statistics, 4 (1986), 8794

Rubin, Donald B.Schafer, J. L. and Schenker, Nathaniel, ‘Imputation Strategies for Missing Values in Post-Enumeration Surveys’, Survey Methodology, 14 (1988), 209221

Rubin, Donald B., ‘Multiple Imputation after 18+ Years’, Journal of the American Statistical Association, 91 (1996), 473489

20 The combined $$\[-->$<>{{\bar{\theta }}_{{\bi M}}} <$> <!--\]$$ is in fact an average, but the treatment of the variability of this estimate is slightly more complicated than an average since it needs to account for within imputation variation and between imputation variation. The subject of multiple estimate combination will be discussed in some detail below. See Little and Rubin, Statistical Analysis with Missing Data, for a more detailed treatment.

21 Kim, Jae Kwang, ‘Finite Sample Properties of Multiple Imputation Estimators’, Annals of Statistics, 32 (2004), 766783

Kim, Jae Kwang and Fuller, Wayne, ‘Fractional Hot Deck Imputation’, Biometrika, 91 (2004), 559578

Fuller, Wayne and Kim, Jae Kwang, ‘Hot Deck Imputation for the Response Model’, Statistics Canada, 31 (2005), 139149

22 Schafer, Joseph L., Analysis of Incomplete Multivariate Data (New York: Chapman & Hall/CRC, 1997)

23 King, Honaker, Joseph and Scheve, ‘Analyzing Incomplete Political Science Data’; Honaker and King, ‘What to Do about Missing Values in Time-Series Cross-Section Data’.

24 The articles describing the Amelia procedure have received over 330 ISI citations as of this writing.

25 Reilly, Marie, ‘Data Analysis Using Hot Deck Multiple Imputation’, The Statistician, 42 (1993), 307313

26 Kalton, Graham and Kish, Leslie, ‘Some Efficient Random Imputation Methods’, Communications in Statistics – Theory and Methods, 13 (1984), 19191939

Fay, Robert E., ‘Alternative Paradigms for the Analysis of Imputed Survey Data’, Journal of the American Statistical Association, 91 (1996), 490498

27 Reilly, ‘Data Analysis Using Hot Deck Multiple Imputation’.

28 Reilly, ‘Data Analysis Using Hot Deck Multiple Imputation’.

29 For linguistic parsimony, we generally use the term ‘respondent’ below, but these methods are immediately applicable to datasets where the rows reflect any other type of observation.

30 Gower, J. C., ‘A General Coefficient of Similarity and Some of its Properties’, Biometrics, 27 (1971), 857871

31 Rosenbaum, Paul R. and Rubin, Donald B., ‘The Central Role of the Propensity Score in Observational Studies for Causal Effects’, Biometrika, 70 (1983), 4155

32 Kim, ‘Finite Sample Properties of Multiple Imputation Estimators’; Kim and Fuller, ‘Fractional Hot Deck Imputation’; Fuller and Kim, ‘Hot Deck Imputation for the Response Model’.

33 Kim, ‘Finite Sample Properties of Multiple Imputation Estimators’.

34 Little and Rubin, Statistical Analysis with Missing Data; Rubin, ‘Multiple Imputations in Sample Surveys’; Rubin, Multiple Imputation for Nonresponse in Surveys; Rubin, ‘Multiple Imputation after 18+ Years’.

35 Little and Rubin, Statistical Analysis with Missing Data.

36 Our software formats its output so that the output can be used seamlessly with the R package Zelig; Koske Imai, Gary King and Olivia Lau, ‘Zelig: Everyone's Statistical Software’, Comprehensive R Archive Network (2006). This has the advantage of allowing the user to run, in a single line of code, a great variety of models on the multiple imputed datasets and have the combination handled automatically.

37 King, Honaker, Joseph and Scheve, ‘Analyzing Incomplete Political Science Data’; Honaker and King, ‘What to Do about Missing Values in Time-Series Cross-Section Data’.

38 Stef van Buuren, Jaap P. L. Brand, C. G. M. Groothuis-Oudshoorn and Donald B. Rubin, ‘Fully Conditional Specification in Multivariate Imputation’, Journal of Statistical Computation and Simulation, 76 (2006), 10491064

Stef van Buuren, ‘Multiple Imputation of Discrete and Continuous Data by Fully Conditional Specification’, Statistical Methods in Medical Research, 16 (2007), 219242

39 Dempster, A. P.Laird, N. M. and Rubin, D. B., ‘Maximum Likelihood from Incomplete Data via the EM Algorithm’, Journal of the Royal Statistical Society, Series B, 39 (1977), 493510

40 We also ran experiments where the missing values were MCAR, but, as we would expect theoretically, no method was biased under those conditions.

41 Lipset, Seymour M., ‘Some Social Requisites of Democracy: Economic Development and Political Legitimacy’, American Political Science Review, 53 (1959), 69105

42 Cutright, Phillips, ‘National Political Development: Its Measurement and Social Correlates’, in Nelson W. Polsby, Robert A. Dentler and Paul A. Smith, eds, Politics and Social Life: An Introduction to Political Behavior (Boston, Mass.: Houghton Mifflin, 1963), 569581

Deutsch, Karl W., ‘Social Mobilization and Political Development’, American Political Science Review, 55 (1961), 493510

Dahl, Robert A., Polyarchy: Participation and Opposition (New Haven, Conn.: Yale University Press, 1971)

Burkhart, Ross E. and Lewis-Beck, Michael S., ‘The Economic Development Thesis’, American Political Science Review, 88 (1994), 903910

Londregan, John B. and Poole, Keith T., ‘Does High Income Promote Democracy?’ World Politics, 49 (1996) 1–30

43 Przeworski, Adam, Democracy and the Market: Political and Economic Reforms in Eastern Europe (New York: Cambridge University Press, 1991)

Przeworski, Adam, Democracy and the Market: Political and Economic Reforms in Eastern Europe (New York: Cambridge University Press, 1991)

Przeworski, Adam and Limongi, Fernando, ‘Political Regimes and Economic Growth’, Journal of Economic Perspectives, 7 (1993), 5169

Przeworski, Adam, Alvarez, Michael E.Cheibub, Jose A. and Limongi, Fernando, ‘What Makes Democracies Endure?’ Journal of Democracy, 7 (1996), 3955

Przeworski, Adam and Limongi, Fernando, ‘Modernization: Theories and Facts’, World Politics, 49 (1997), 155183

Przeworski, Adam, Alvarez, Michael E.Cheibub, Jose A. and Limongi, Fernando, Democracy and Development: Political Institutions and Well-Being in the World, 1950–1990 (New York: Cambridge University Press, 2000)

44 Przeworski, Alvarez, Cheibub and Limongi, Democracy and Development.

45 Boix, Carles, Democracy and Redistribution (New York: Cambridge University Press, 2002)

Boix, Carles and Stokes, Susan, ‘Endogenous Democratization’, World Politics, 55 (2003), 517549

Epstein, David L., Bates, Robert, Goldstone, JackKristensen, Ida and O'Halloran, Sharyn, ‘Democratic Transitions’, American Journal of Political Science, 50 (2006), 551569

46 Przeworski, Alvarez, Cheibub and Limongi, Democracy and Development.

47 Przeworski, Alvarez, Cheibub and Limongi, Democracy and Development.

48 Przeworski, Alvarez, Cheibub and Limongi, Democracy and Development.

49 The true results are true to the extent that they are the results actually obtained by analysing the complete data. They are not true in the more traditional sense of being the true population parameters an empirical analysis attempts to estimate.

50 Przeworski, Alvarez, Cheibub and Limongi, Democracy and Development.

51 Przeworski, Alvarez, Cheibub and Limongi, Democracy and Development.

52 Imai, King and Lau, ‘Zelig’.

* Department of Political Science, University of North Carolina; and Department of Political Science, Washington University (email: ), respectively. The authors wish to thank Micah Altman, James Fowler, Katie Gan, Adam Glynn, Justin Grimmer, Dominik Hangartner, Michael Kellerman, Gary King, Ryan Moore and Randolph Siverson for valuable comments. Replication data is available at http://www.unc.edu/~skylerc/.

Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

British Journal of Political Science
  • ISSN: 0007-1234
  • EISSN: 1469-2112
  • URL: /core/journals/british-journal-of-political-science
Please enter your name
Please enter a valid email address
Who would you like to send this to? *
×

Metrics

Altmetric attention score

Full text views

Total number of HTML views: 8
Total number of PDF views: 144 *
Loading metrics...

Abstract views

Total abstract views: 697 *
Loading metrics...

* Views captured on Cambridge Core between September 2016 - 23rd June 2018. This data will be updated every 24 hours.