Skip to main content
    • Aa
    • Aa

Logistic Regression in Rare Events Data

  • Gary King (a1) and Langche Zeng (a2)

We study rare events data, binary dependent variables with dozens to thousands of times fewer ones (events, such as wars, vetoes, cases of political activism, or epidemiological infections) than zeros (“nonevents”). In many literatures, these variables have proven difficult to explain and predict, a problem that seems to have at least two sources. First, popular statistical procedures, such as logistic regression, can sharply underestimate the probability of rare events. We recommend corrections that outperform existing methods and change the estimates of absolute and relative risks by as much as some estimated effects reported in the literature. Second, commonly used data collection strategies are grossly inefficient for rare events data. The fear of collecting data with too few events has led to data collections with huge numbers of observations but relatively few, and poorly measured, explanatory variables, such as in international conflict data with more than a quarter-million dyads, only a few of which are at war. As it turns out, more efficient sampling designs exist for making valid inferences, such as sampling all available events (e.g., wars) and a tiny fraction of nonevents (peace). This enables scholars to save as much as 99% of their (nonfixed) data collection costs or to collect much more meaningful explanatory variables. We provide methods that link these two results, enabling both types of corrections to work simultaneously, and software that implements the methods developed.

Linked references
Hide All

This list contains references from the content that can be linked to their source. For a full set of references and notes please see the PDF or HTML where available.

A. Agresti 1992. “A Survey of Exact Inference for Contingency Tables (with discussion).” Statistical Science 7(1): 131177.

Takeshi Amemiya , and Quang H. Vuong 1987. “A Comparison of Two Consistent Estimators in the Choice-Based Sampling Qualitative Response Model.” Econometrica 55(3): 699702.

Nathaniel Beck , Gary King , and Langche Zeng . 2000. “Improving Quantitative Studies of International Conflict: A Conjecture.” American Political Science Review 94(1): 115. (Preprint at http://GKing.Harvard.Edu.)

Norman E. Breslow 1996. “Statistics in Epidemiology: The Case-Control Study.” Journal of the American Statistical Association 91: 1428.

Shelley B. Bull , Celia M. T. Greenwood , and Walter W. Hauck 1997. “Jackknife Bias Reduction for Polychotomous Logistic Regression.” Statistics in Medicine 16: 545560.

Stephen R. Cosslett 1981a. “Maximum Likelihood Estimator for Choice-Based Samples.” Econometrica 49(5): 12891316.

David Firth . 1993. “Bias Reduction of Maximum Likelihood Estimates.” Biometrika 80(1): 2738.

Seymour Geisser . 1993. Predictive Inference: An Introduction. New York: Chapman and Hall.

Daniel S. Geller , and J. David Singer 1998. Nations at War: A Scientific Study of International Conflict. New York: Cambridge University Press.

Paul W. Holland , and Donald B. Rubin 1988. “Causal Inference in Retrospective Studies,” Evaluation Review 12(3): 203231.

David A. Hsieh , Charles F. Manski , and Daniel McFadden . 1985. “Estimation of Response Probabilities from Augmented Retrospective Observations.” Journal of the American Statistical Association 80(391): 651662.

Paul K. Huth 1988. “Extended Deterrence and the Outbreak of War.” American Political Science Review 82(2): 423443.

Guido Imbens . 1992. “An Efficient Method of Moments Estimator for Discrete Choice Models with Choice-Based Sampling.” Econometrica 60(5): 11871214.

Gary King , Michael Tomz , and Jason Wittenberg . 2000. “Making the Most of Statistical Analyses: Improving Interpretation and Presentation.” American Journal of Political Science 44(2): 341355. (Preprint at

Tony Lancaster , and Guido Imbens . 1996a. “Case-Control with Contaminated Controls.” Journal of Econometrics 71: 145160.

Charles F. Manski , and Steven R. Lerman 1977. “The Estimation of Choice Probabilities from Choice Based Samples.” Econometrica 45(8): 19771988.

Zeev Maoz , and Bruce Russett . 1993. “Normative and Structural Causes of Democratic Peace, 1946–86.” American Political Science Review 87(3): 624638.

P. McCullagh , and J. A. Nelder , 1989. Generalized Linear Models, 2nd ed. New York: Chapman and Hall.

Nico J.D. Nagelkerke , Stephen Moses , Francis A. Plummer , Robert C. Brunham , and David Fish . 1995. “Logistic Regression in Case-Control Studies: The Effect of Using Independent as Dependent Variables.” Statistics in Medicine 14: 769775.

R. L. Prentice , and R. Pyke 1979. “Logistic Disease Incidence Models and Case-Control Studies.” Biometrika 66: 403411.

Brian D. Ripley 1996. Pattern Recognition and Neural Networks. New York: Cambridge University Press.

Kenneth J. Rothman , and Sander Greenland . 1998. Modern Epidemiology, 2nd ed. Philadelphia: Lippincott-Raven.

Robert L. Schaefer 1983. “Bias Correction in Maximum Likelihood Logistic Regression.” Statistics in Medicine 2: 7178.

Curtis S. Signorino 1999. “Strategic Interaction and the Statistical Analysis of International Conflict.” American Political Science Review 93(2): 279298.

M. A. Tanner 1996. Tools for Statistical Inference: Methods for the Exploration of Posterior Distributions and Likelihood Functions, 3rd ed. New York: Springer-Verlag.

John A. Vasquez 1993. The War Puzzle. Cambridge, New York: Cambridge University Press.

C. Y. Wang , and R. J. Caroll 1995. “On Robust Logistic Case-Control Studies with Response-Dependent Weights.” Journal of Statistical Planning and Inference 43: 331340.

Yu Xie , and Charles F. Manski 1989. “The Logit Model and Response-Based Samples.” Sociological Methods and Research 17(3): 283302.

Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

Political Analysis
  • ISSN: 1047-1987
  • EISSN: 1476-4989
  • URL: /core/journals/political-analysis
Please enter your name
Please enter a valid email address
Who would you like to send this to? *


Altmetric attention score

Full text views

Total number of HTML views: 0
Total number of PDF views: 14 *
Loading metrics...

Abstract views

Total abstract views: 629 *
Loading metrics...

* Views captured on Cambridge Core between 4th January 2017 - 24th July 2017. This data will be updated every 24 hours.