Hostname: page-component-7d684dbfc8-jcwnr Total loading time: 0 Render date: 2023-09-29T19:24:33.769Z Has data issue: false Feature Flags: { "corePageComponentGetUserInfoFromSharedSession": true, "coreDisableEcommerce": false, "coreDisableSocialShare": false, "coreDisableEcommerceForArticlePurchase": false, "coreDisableEcommerceForBookPurchase": false, "coreDisableEcommerceForElementPurchase": false, "coreUseNewShare": true, "useRatesEcommerce": true } hasContentIssue false

Machine Learning Predictions as Regression Covariates

Published online by Cambridge University Press:  11 November 2020

Christian Fong
Assistant Professor, Department of Political Science, University of Michigan, Ann Arbor, MI, USA. Email:
Matthew Tyler*
Ph.D. Candidate, Department of Political Science, Stanford University, Stanford, CA, USA. Email:
Corresponding author Matthew Tyler


In text, images, merged surveys, voter files, and elsewhere, data sets are often missing important covariates, either because they are latent features of observations (such as sentiment in text) or because they are not collected (such as race in voter files). One promising approach for coping with this missing data is to find the true values of the missing covariates for a subset of the observations and then train a machine learning algorithm to predict the values of those covariates for the rest. However, plugging in these predictions without regard for prediction error renders regression analyses biased, inconsistent, and overconfident. We characterize the severity of the problem posed by prediction error, describe a procedure to avoid these inconsistencies under comparatively general assumptions, and demonstrate the performance of our estimators through simulations and a study of hostile political dialogue on the Internet. We provide software implementing our approach.

© The Author(s) 2020. Published by Cambridge University Press on behalf of the Society for Political Methodology

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)


Edited by Jeff Gill


Aigner, D. J. (1973). “Regression with a Binary Independent Variable Subject to Errors of Observation.” Journal of Econometrics 1(1):4959.CrossRefGoogle Scholar
Anastasopoulos, J., Badani, D., Lee, C., Ginosar, S., and Ryland Williams, J. (2016). “Photographic Home Styles in Congress: A Computer Vision Approach.” Scholar
Cameron, A. C. and Trivedi, P. K. (2005). Microeconometrics: Methods and Applications. Cambridge: Cambridge University Press.CrossRefGoogle Scholar
Dimitriadou, E., Hornik, K., Leisch, F., Meyer, D., Weingessel, A., and Leisch, M. F. (2009). “Package ‘e1071’.” R Software package, Scholar
Fong, C., Malhotra, N., and Margalit, Y. (2019). “Political legacies: Understanding their singificance to contemporary political debates.” PS: Political Science & Politics 52(3):451456.Google Scholar
Fong, C. and Tyler, M. (2020a). “Replication Data for: Machine Learning Predictions as Regression Covariates.” Code Ocean, V1. Scholar
Fong, C. and Tyler, M. (2020b). “Replication Data for: Machine Learning Predictions as Regression Covariates.”, Harvard Dataverse, V1,UNF:6:vgF7Ffh39tB+eQJxHpax7A== [fileUNF].Google Scholar
Grimmer, J., Messing, S., and Westwood, S. J. (2012). “How Words and Money Cultivate a Personal Vote: The Effect of Legislator Credit Claiming on Constituent Credit Allocation.” American Political Science Review 106(4):117.CrossRefGoogle Scholar
Grumbach, J. M. and Sahn, A. (2020). “Race and representation in campaign finance.” American Political Science Review 114(1):206221.CrossRefGoogle Scholar
Hopkins, D. J. and King, G. (2010). “A method of automated nonparametric content analysis for social science.” American Journal of Political Science 54(1):229247.CrossRefGoogle Scholar
Ibrahim, J. G., Chen, M.-H., Lipsitz, S. R., and Herring, A. H. (2005). “Missing-data methods for generalized linear models: A comparative review.” Journal of the American Statistical Association 100(469):332346.CrossRefGoogle Scholar
Imai, K. and Khanna, K. (2016). “Improving Ecological Inference by Predicting Individual Ethnicity from Voter Registration Records.” Political Analysis 24:263272.CrossRefGoogle Scholar
Iyyer, M., Enns, P., Boyd-Graber, J., and Resnik, P. (2014). “Political ideology detection using recursive neural networks.” In Proceedings of the Association for Computational Linguistics, pp. 111.Google Scholar
Jamal, A. A., Keohane, R. O., Romney, D., and Tingley, D. (2015). “Anti-Americanism and Anti-Interventionism in Arabic Twitter Discourses.” Perspectives on Politics 13(1):5573.CrossRefGoogle Scholar
Jerzak, C. T., King, G., and Strezhnev, A. (2018). “An Improved Method of Automated Nonparametric Content Analysis for Social Science.” Scholar
Kane, T. J., Rouse, C. E., and Staiger, D. (1999). Estimating Returns to Schooling When Schooling Is Misreported. National Bureau of Economic Research.CrossRefGoogle Scholar
King, G., Pan, J., and Roberts, M. E. (2013). “How Censorship in China Allows Government Criticism but Silences Collective Expression.” American Political Science Review 107(2):326343.CrossRefGoogle Scholar
Munger, K. (2017). “Experimentally Reducing Partisan Incivility on Twitter.” Scholar
Mutz, D. C. and Reeves, B. (2005). “The new videomalaise: Effects of televised incivility on political trust.” American Political Science Review 99(1):115.CrossRefGoogle Scholar
Rubin, D. B. (2004). Multiple Imputation for Nonresponse in Surveys. Hoboken, NJ:John Wiley & Sons.Google Scholar
Socher, R., Perelygin, A., and Wu, J. (2013). “Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank.” In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Seattle, Washington, pp. 16311642. Seattle, Washington.Google Scholar
Stewart, B. M. and Zhukov, Y. M. (2009). “Use of Force and Civil–Military Relations in Russia: An Automated Content Analysis.” Small Wars & Insurgencies 20(2):319343.CrossRefGoogle Scholar
Theocharis, Y., Barberá, P., Fazekas, Z., Popa, S. A., and Parnet, O. (2016). “A Bad Workman Blames His Tweets: The Consequences of Citizens’ Uncivil Twitter Use when Interacting with Party Candidates.” Journal of communication 66(6):10071031.CrossRefGoogle Scholar
Supplementary material: Link

Fong and Tyler Dataset

Supplementary material: PDF

Fong and Tyler supplementary material

Online Appendix

Download Fong and Tyler supplementary material(PDF)
PDF 617 KB