Hostname: page-component-797576ffbb-pxgks Total loading time: 0 Render date: 2023-12-03T21:15:17.465Z Has data issue: false Feature Flags: { "corePageComponentGetUserInfoFromSharedSession": true, "coreDisableEcommerce": false, "useRatesEcommerce": true } hasContentIssue false

Why Propensity Scores Should Not Be Used for Matching

Published online by Cambridge University Press:  07 May 2019

Gary King*
Institute for Quantitative Social Science, Harvard University, 1737 Cambridge Street, Cambridge, MA 02138, USA. Email:, URL:
Richard Nielsen
Department of Political Science, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, MA 02139, USA. Email:, URL:∼rnielsen


We show that propensity score matching (PSM), an enormously popular method of preprocessing data for causal inference, often accomplishes the opposite of its intended goal—thus increasing imbalance, inefficiency, model dependence, and bias. The weakness of PSM comes from its attempts to approximate a completely randomized experiment, rather than, as with other matching methods, a more efficient fully blocked randomized experiment. PSM is thus uniquely blind to the often large portion of imbalance that can be eliminated by approximating full blocking with other matching methods. Moreover, in data balanced enough to approximate complete randomization, either to begin with or after pruning some observations, PSM approximates random matching which, we show, increases imbalance even relative to the original data. Although these results suggest researchers replace PSM with one of the other available matching methods, propensity scores have other productive uses.

Copyright © The Author(s) 2019. Published by Cambridge University Press on behalf of the Society for Political Methodology. 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)


Authors’ note: The current version of this paper, along with a Supplementary Appendix, can be found at We thank Alberto Abadie, Alan Dafoe, Justin Grimmer, Jens Hainmueller, Chad Hazlett, Seth Hill, Stefano Iacus, Kosuke Imai, Simon Jackman, John Londregan, Adam Meirowitz, Giuseppe Porro, Molly Roberts, Jamie Robins, Bradley Spahn, Brandon Stewart, Liz Stuart, Chris Winship, and Yiqing Xu for helpful suggestions, and Connor Jerzak, Chris Lucas, Jason Sclar for superb research assistance. We also appreciate the insights from our collaborators on a previous related project, Carter Coberley, James E. Pope, and Aaron Wells. All data necessary to replicate the results in this article are available at Nielsen and King (2019).

Contributing Editor: Jeff Gill


Abadie, A., and Imbens, G. W.. 2006. “Large Sample Properties of Matching Estimators for Average Treatment Effects.” Econometrica 74(1):235267.Google Scholar
Athey, S., and Imbens, G. W.. 2015. “A Measure of Robustness to Misspecification.” American Economic Review Papers and Proceedings 105(5):476480.Google Scholar
Austin, P. C. 2008. “A Critical Appraisal of Propensity-Score Matching in the Medical Literature Between 1996 and 2003.” Journal of the American Statistical Association 72:20372049.Google Scholar
Austin, P. C. 2009. “Some Methods of Propensity-Score Matching had Superior Performance to Others: Results of an Empirical Investigation and Monte Carlo Simulations.” Biometrical Journal 51(1):171184.Google Scholar
Banaji, M. R., and Greenwald, A. G.. 2016. Blindspot: Hidden Biases of Good People . New York: Bantam.Google Scholar
Bansal, P. P., and Ardell, A. J.. 1972. “Average Nearest-Neighbor Distances Between Uniformly Distributed Finite Particles.” Metallography 5(2):97111.Google Scholar
Barnow, B. S., Cain, G. G., and Goldberger, A. S.. 1980. “Issues in the Analysis of Selectivity Bias.” In Evaluation Studies, vol. 5 , edited by Stromsdorfer, E. and Farkas, G.. San Francisco: Sage.Google Scholar
Box, G. E. P., Hunter, W. G., and Hunter, J. S.. 1978. Statistics for Experimenters . New York: Wiley-Interscience.Google Scholar
Brookhart, M. A., Schneeweiss, S., Rothman, K. J., Glynn, R. J., Avorn, J., and Sturmer, T.. 2006. “Variable Selection for Propensity Score Models.” American Journal of Epidemiology 163:11491156.Google Scholar
Caliendo, M., and Kopeinig, S.. 2008. “Some Practical Guidance for the Implementation of Propensity Score Matching.” Journal of Economic Surveys 22(1):3172.Google Scholar
Crump, R. K., Hotz, V. J., Imbens, G. W., and Mitnik, O.. 2009. “Dealing with Limited Overlap in Estimation of Average Treatment Effects.” Biometrika 96(1):187.Google Scholar
D’Augustino, R. B. 1998. “Propensity Score Methods for Bias Reduction in the Comparison of a Treatment to a Non-Randomized Control Group.” Statistics in Medicine 17:22652281.Google Scholar
Dehejia, R. 2004. “Estimating Causal Effects in Nonexpermental Studies.” In Applied Bayesian Modeling and Causal Inference from Incomplete-Data Perspectives , edited by Gelman, A. and Meng, X.-L.. New York: Wiley.Google Scholar
Diamond, A., and Sekhon, J. S.. 2012. “Genetic Matching for Estimating Causal Effects: A General Multivariate Matching Method for Achieving Balance in Observational Studies.” Review of Economics and Statistics 95(3):932945.Google Scholar
Drake, C. 1993. “Effects of Misspecification of the Propensity Score on Estimators of Treatment Effects.” Biometrics 49:12311236.Google Scholar
Efron, B. 2014. “Estimation and Accuracy After Model Selection.” Journal of the American Statistical Association 109(507):9911007.Google Scholar
Finkel, S. E., Horowitz, J., and Rojo-Mendoza, R. T.. 2012. “Civic Education and Democratic Backsliding in the Wake of Kenya’s Post-2007 Election Violence.” Journal of Politics 74(01):5265.Google Scholar
Glazerman, S., Levy, D. M., and Myers, D.. 2003. “Nonexperimental Versus Experimental Estimates of Earnings Impacts.” The Annals of the American Academy of Political and Social Science 589:6393.Google Scholar
Greevy, R., Lu, B., Silver, J. H., and Rosenbaum, P. R.. 2004. “Optimal Multivariate Matching Before Randomization.” Biostatistics 5(2):263275.Google Scholar
Gu, X. S., and Rosenbaum, P. R.. 1993. “Comparison of Multivariate Matching Methods: Structures, Distances, and Algorithms.” Journal of Computational and Graphical Statistics 2:405420.Google Scholar
Heckman, J., Ichimura, H., and Todd, P.. 1998. “Matching as an Econometric Evaluation Estimator: Evidence from Evaluating a Job Training Program.” Review of Economic Studies 65:261294.Google Scholar
Hill, J. 2008. “Discussion of Research Using Propensity-Score Matching: Comments on “A Critical Appraisal of Propensity-Score Matching in the Medical Literature Between 1996 and 2003” by Peter Austin, Statistics in Medicine.” Statistics in Medicine 27(12):20552061.Google Scholar
Ho, D. E., Imai, K., King, G., and Stuart, E. A.. 2007. “Matching as Nonparametric Preprocessing for Reducing Model Dependence in Parametric Causal Inference.” Political Analysis 15:199236. URL: Scholar
Holland, P. W. 1986. “Statistics and Causal Inference.” Journal of the American Statistical Association 81:945960.Google Scholar
Iacus, S. M., King, G., and Porro, G.. 2011. “Multivariate Matching Methods that are Monotonic Imbalance Bounding.” Journal of the American Statistical Association 106:345361. URL: Scholar
Imai, K., King, G., and Nall, C.. 2009. “The Essential Role of Pair Matching in Cluster-Randomized Experiments, with Application to the Mexican Universal Health Insurance Evaluation.” Statistical Science 24(1):2953. URL: Scholar
Imai, K., King, G., and Stuart, E. A.. 2008. “Misunderstandings Among Experimentalists and Observationalists about Causal Inference.” Journal of the Royal Statistical Society, Series A 171(2):481502. URL: Scholar
Imai, K., and Ratkovic, M.. 2014. “Covariate Balancing Propensity Score.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 76(1):243263.Google Scholar
Imbens, G. W. 2004. “Nonparametric Estimation of Average Treatment Effects Under Exogeneity: A Review.” Review of Economics and Statistics 86(1):429.Google Scholar
Imbens, G. W., and Rubin, D. B.. 2015. Causal Inference for Statistics, Social, and Biomedical Sciences An Introduction . New York: Cambridge University Press.Google Scholar
Ioannidis, J. P. A. 2005. “Why Most Published Research Findings are False.” PLoS Medicine 2(8):e124.Google Scholar
Kahneman, D. 2011. Thinking, Fast and Slow . London: Macmillan.Google Scholar
Kallus, N. 2018. “Optimal A Priori Balance in The Design of Controlled Experiments.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 80(1):85112.Google Scholar
Kang, J. D. Y., and Schafer, J. L.. 2007. “Demystifying Double Robustness: A Comparison of Alternative Strategies for Estimating a Population Mean from Incomplete Data.” Statistical Science 22(4):523539.Google Scholar
King, G., and Zeng, L.. 2006. “The Dangers of Extreme Counterfactuals.” Political Analysis 14(2):131159. URL: Scholar
King, G., and Zeng, L.. 2007. “When Can History Be Our Guide? The Pitfalls of Counterfactual Inference.” International Studies Quarterly , 183210. URL: Scholar
Lechner, M. 2001. “Identification and Estimation of Causal Effects of Multiple Treatments under the Conditional Independence Assumption.” In Econometric Evaluation of Labour Market Policies , edited by Lechner, M. and Pfeiffer, F., 4358. Heidelberg: Physica.Google Scholar
Lunceford, J. K., and Davidian, M.. 2004. “Stratification and Weighting via the Propensity Score in Estimation of Causal Treatment Effects: A Comparative Study.” Statistics in Medicine 23(19):29372960.Google Scholar
Mahoney, M. J. 1977. “Publication Prejudices: An Experimental Study of Confirmatory Bias in the Peer Review System.” Cognitive Therapy and Research 1(2):161175.Google Scholar
Mielke, P., and Berry, K.. 2007. Permutation Methods: A Distance Function Approach . New York: Springer.Google Scholar
Morgan, S. L., and Winship, C.. 2014. Counterfactuals and Causal Inference: Methods and Principles for Social Research , 2nd edn. Cambridge: Cambridge University Press.Google Scholar
Nielsen, R., Findley, M., Davis, Z., Candland, T., and Nielson, D.. 2011. “Foreign Aid Shocks as a Cause of Violent Armed Conflict.” American Journal of Political Science 55(2):219232.Google Scholar
Nielsen, R., and King, G.. 2019. “Replication Data for: Why Propensity Scores Should Not Be Used for Matching.”, Harvard Dataverse, V1.Google Scholar
Pearl, J.2009. “Myth, Confusion, and Science in Causal Analysis.” Unpublished paper,∼kaoru/r348.pdf.Google Scholar
Pearl, J. 2009. “The Foundations of Causal Inference.” Sociological Methodology 40(1):75149.Google Scholar
Peikes, D. N., Moreno, L., and Orzol, S. M.. 2008. “Propensity Score Matching.” The American Statistician 62(3):222231.Google Scholar
Pimentel, S. D., Page, L. C., Lenard, M., and Keele, L.. 2018. “Optimal Multilevel Matching Using Network Flows: An Application to a Summer Reading Intervention.” The Annals of Applied Statistics 12(3):14791505.Google Scholar
Robins, J. M., Hernan, M. A., and Brumback, B.. 2000. “Marginal Structural Models and Causal Inference in Epidemiology.” Epidemiology 11(5):550560.Google Scholar
Robins, J. M., and Morgenstern, H.. 1987. “The Foundations of Confounding in Epidemiology.” Computers & Mathematics with Applications 14(9):869916.Google Scholar
Rosenbaum, P. R., Ross, R., and Silber, J.. 2007. “Minimum Distance Matched Sampling With Fine Balance in an Observational Study of Treatment for Ovarian Cancer.” Journal of the American Statistical Association 102(477):7583.Google Scholar
Rosenbaum, P. R., and Rubin, D. B.. 1983. “The Central Role of the Propensity Score in Observational Studies for Causal Effects.” Biometrika 70:4155.Google Scholar
Rosenbaum, P. R., and Rubin, D. B.. 1984. “Reducing Bias in Observational Studies Using Subclassification on the Propensity Score.” Journal of the American Statistical Association 79:515524.Google Scholar
Rosenbaum, P. R., and Rubin, D. B.. 1985a. “Constructing a Control Group Using Multivariate Matched Sampling Methods That Incorporate the Propensity Score.” The American Statistician 39:3338.Google Scholar
Rosenbaum, P. R., and Rubin, D. B.. 1985b. “The Bias Due to Incomplete Matching.” Biometrics 41(1):103116.Google Scholar
Rubin, D. B. 1974. “Estimating Causal Effects of Treatments in Randomized and Nonrandomized Studies.” Journal of Educational Psychology 6:688701.Google Scholar
Rubin, D. B. 1976. “Inference and Missing Data.” Biometrika 63:581592.Google Scholar
Rubin, D. B. 1980. “Comments on “Randomization Analysis of Experimental Data: The Fisher Randomization Test”, by D. Basu.” Journal of the American Statistical Association 75:591593.Google Scholar
Rubin, D. B. 2008a. “Comment: The Design and Analysis of Gold Standard Randomized Experiments.” Journal of the American Statistical Association 103(484):13501353.Google Scholar
Rubin, D. B. 2008b. “For Objective Causal Inference, Design Trumps Analysis.” Annals of Applied Statistics 2(3):808840.Google Scholar
Rubin, D. B. 2009. “Should Observational Studies be Designed to Allow Lack of Balance in Covariate Distributions Across Treatment Groups? Statistics in Medicine 28:14151424.Google Scholar
Rubin, D. B. 2010. “On the Limitations of Comparative Effectiveness Research.” Statistics in Medicine 29(19):19911995.Google Scholar
Rubin, D. B., and Stuart, E. A.. 2006. “Affinely Invariant Matching Methods with Discriminant Mixtures of Proportional Ellipsoidally Symmetric Distributions.” Annals of Statistics 34(4):18141826.Google Scholar
Rubin, D. B., and Thomas, N.. 2000. “Combining Propensity Score Matching with Additional Adjustments for Prognostic Covariates.” Journal of the American Statistical Association 95:573585.Google Scholar
Simmons, J. P., Nelson, L. D., and Simonsohn, U.. 2011. “False-Positive Psychology Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant.” Psychological Science 22(11):13591366.Google Scholar
Smith, J. A., and Todd, P. E.. 2005a. “Does Matching Overcome LaLonde’s Critique of Nonexperimental Estimators? Journal of Econometrics 125(1–2):305353.Google Scholar
Smith, J., and Todd, P.. 2005b. “Rejoinder.” Journal of Econometrics 125:365375.Google Scholar
Stuart, E. A. 2010. “Matching Methods for Causal Inference: A Review and a Look Forward.” Statistical Science 25(1):121.Google Scholar
Stuart, E. A., and Rubin, D. B.. 2007. “Best Practices in Quasi-Experimental Designs: Matching Methods for Causal Inference.” In Best Practices in Quantitative Methods , edited by Osborne, J., 155176. New York: Sage.Google Scholar
Stuart, E. A., and Rubin, D. B.. 2008. “Matching with Multiple Control Groups with Adjustment for Group Differences.” Journal of Educational and Behavioral Statistics 33(3):279306.Google Scholar
Tetlock, P. E. 2005. Expert Political Judgment: How Good Is It? How Can We Know? Princeton: Princeton University Press.Google Scholar
VanderWeele, T. J., and Hernan, M. A.. 2012. “Causal Inference Under Multiple Versions of Treatment.” Journal of Causal Inference 1:120.Google Scholar
VanderWeele, T. J., and Shpitser, I.. 2011. “A New Criterion for Confounder Selection.” Biometrics 67(4):14061413.Google Scholar
Vansteelandt, S., and Daniel, R.. 2014. “On Regression Adjustment for the Propensity Score.” Statistics in Medicine 33(23):40534072.Google Scholar
Wilson, T. D., and Brekke, N.. 1994. “Mental Contamination and Mental Correction: Unwanted Influences on Judgments and Evaluations.” Psychological Bulletin 116(1):117.Google Scholar
Zhao, Z. 2008. “Sensitivity of Propensity Score Methods to the Specifications.” Economic Letters 98(3):309319.Google Scholar
Zubizarreta, J. R., Paredes, R. D., and Rosenbaum, P. R. et al. . 2014. “Matching for Balance, Pairing for Heterogeneity in an Observational Study of the Effectiveness of For-Profit and Not-For-Profit High Schools in Chile.” The Annals of Applied Statistics 8(1):204231.Google Scholar
Supplementary material: File

King and Nielsen supplementary material

King and Nielsen supplementary material

Download King and Nielsen supplementary material(File)
File 477 KB