Skip to main content
×
×
Home

Comparing Random Forest with Logistic Regression for Predicting Class-Imbalanced Civil War Onset Data

  • David Muchlinski (a1), David Siroky (a2), Jingrui He (a3) and Matthew Kocher (a4)
Abstract

The most commonly used statistical models of civil war onset fail to correctly predict most occurrences of this rare event in out-of-sample data. Statistical methods for the analysis of binary data, such as logistic regression, even in their rare event and regularized forms, perform poorly at prediction. We compare the performance of Random Forests with three versions of logistic regression (classic logistic regression, Firth rare events logistic regression, and L 1-regularized logistic regression), and find that the algorithmic approach provides significantly more accurate predictions of civil war onset in out-of-sample data than any of the logistic regression models. The article discusses these results and the ways in which algorithmic statistical methods like Random Forests can be useful to more accurately predict rare events in conflict data.

Copyright
Corresponding author
e-mail: david.muchlinski@glasgow.ac.uk (corresponding author)
Footnotes
Hide All

Author's note: Replication data are available on the Political Analysis Dataverse at http://dx.doi.org/10.7910/DVN/KRKWK8.

Footnotes
References
Hide All
Beck, N., King, G., and Zeng, L. 2000. Improving quantitative studies of international conflict: A conjecture. American Political Science Review 94(1): 2135.
Blair, R., Blattman, C., and Hartman, A. 2015. Predicting local violence. Social Science Research Network. revised url http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2497153 (accessed October 10, 2015).
Brandt, P., Freeman, J. R., and Schrodt, P. 2014. Evaluating forecasts of political conflict dynamics. International Journal of Forecasting 30:944–62.
Breiman, L. 1996. Out-of-bag estimation. Technical report, Citeseer.
Breiman, L. 2001a. Random forests. Machine Learning 45(1): 532.
Breiman, L. 2001b. Statistical modeling: The two cultures (with comments and a rejoinder by the author). Statistical Science 16(3): 199231.
Buuren, S., and Groothuis-Oudshoorn, K. 2011. MICE: Multivariate imputation by chained equations in R. Journal of Statistical Software 45(3): 167.
Cederman, L.-E., Gleditsch, K. S., and Buhaug, H. 2013. Inequality, grievances, and civil war. Cambridge University Press.
Chawla, N. V. 2005. Data mining for imbalanced datasets: An overview, 875–86. Springer.
Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. 2002. SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research (JAIR) 16:321–57.
Chawla, N. V., Lazarevic, A., Hall, L. O., and Bowyer, K. W. 2003. Smoteboost: Improving prediction of the minority class in boosting. In Knowledge discovery in databases: PKDD 2003, 7th European conference on principles and practice of knowledge discovery in databases, Cavtat-Dubrovnik, Croatia, September 22–26, 2003, Proceedings, volume 2838 of lecture notes in computer science, eds. Lavrac, N., Gamberger, D., Blockeel, H., and Todorovski, L., 107–19. Springer.
Chen, C., Liaw, A., and Breiman, L. 2004. Using random forest to learn imbalanced data. Berkeley: University of California.
Cieslak, D. A., and Chawla, N. V. 2008. Start globally, optimize locally, predict globally: Improving performance on imbalanced data. In Proceedings of the 8th IEEE international conference on data mining (ICDM 2008), December 15–19, 2008, Pisa, Italy, 143–52.
Clayton, G., and Gleditsch, K. S. 2014. Will we see helping hands? Predicting civil war mediation and likely success. Conflict Management and Peace Science 31:265–84.
Collier, P., and Hoeffler, A. 2004. Greed and grievance in civil war. Oxford Economic Papers 56(4): 563–95.
Duncan, G. M. 2014. Causal random forests. http://econ.washington.edu/sites/econ/files/old-site-uploads/2014/08/Causal-Random-Forests_Duncan.pdf (accessed October 10, 2015).
Efron, B. 1983. Estimating the error rate of a prediction rule: Improvement on cross-validation. Journal of the American Statistical Association 78(382): 316–31.
Fawcett, T. 2006. An introduction to ROC analysis. Pattern Recognition Letters 27(8): 861–74.
Fearon, J. D., and Laitin, D. D. 2003. Ethnicity, insurgency, and civil war. American Political Science Review 97(01): 7590.
Firth, D. 1993. Bias reduction of maximum likelihood estimates. Biometrika 80(1): 2738.
Freiman, M. H. 2010. Using random forests and simulated annealing to predict probabilities of election to the Baseball Hall of Fame. Journal of Quantitative Analysis in Sports 6(2): 135.
Geisser, S. 1975. The predictive sample reuse method with applications. Journal of the American Statistical Association 70(350): 320–8.
Gelman, A., and Imbens, G. 2013. Why ask why? Forward causal inference and reverse causal questions. NBER working paper number 19614.
Gleditsch, K. S., and Ward, M. 2012. Forecasting is difficult, especially about the future: Using contentious issues to forecast interstate disputes. Journal of Peace Research 50(1): 1731.
Goldstone, J. A., Bates, R. H., Epstein, D. L., Gurr, T. R., Lustik, M. B., Marshall, M. G., Ulfelder, J., and Woodward, M. 2010. A global model for forecasting political instability. American Journal of Political Science 54(1): 190208.
Greenhill, B., Ward, M. D., and Sacks, A. 2011. The separation plot: A new visual method for evaluating the fit of binary models. American Journal of Political Science 55(4): 9911002.
Hajjem, A., Bellavance, F., and Larocque, D. 2014. Mixed-effects random forest for clustered data. Journal of Statistical Computation and Simulation 84(6): 1313–28.
Hastie, T., Tibshirani, R., Friedman, J., Hastie, T., Friedman, J., and Tibshirani, R. 2009. The elements of statistical learning. Springer.
Hegre, H., Karlsen, J., NygWård, H. M., Strand, H., and Urdal, H. 2013. Predicting armed conflict, 2010–2050. International Studies Quarterly 57(2): 250–70.
Hegre, H., and Sambanis, N. 2006. Sensitivity analysis of empirical results on civil war onset. Journal of Conflict Resolution 50(4): 508–35.
Hill, D. W., and Jones, Z. M. 2014. An empirical evaluation of explanations for state repression. American Political Science Review 108:661–87.
Hoff, P. D., and Ward, M. D., 2004. Modeling dependencies in international relations networks. Political Analysis 12(2): 160–75.
Holland, P. W. 1986. Statistical and causal inference. Journal of the American Statistical Association 81(396): 945–60.
Honaker, J., King, G., and Blackwell, M. 2011. Amelia ii: A program for missing data. Journal of Statistical Software 45(7): 147.
Jones, Z., and Linder, F. 2015. Exploratory data analysis using random forests. Prepared for the 73rd annual MPSA conference, April 16–19, 2015. http://zmjones.com/static/papers/rfss_manuscript.pdf (accessed October 10, 2015).
Kalyvas, S. N. 2007. Civil wars In The Oxford handbook of comparative politics, eds. Boix, C. and Stokes, S., 416–34. Oxford University Press.
King, G., Keohane, R. O., and Verba, S. 1994. Designing social inquiry: Scientific inference in qualitative research. Princeton University Press.
King, G., and Zeng, L. 2001. Logistic regression in rare events data. Political Analysis 9(2): 137–63.
Köknar-Tezel, S., and Latecki, L. J. 2011. Improving SVM classification on imbalanced time series data sets with ghost points. Knowledge and Information System 28(1): 123.
Lee, S., Lee, H., Abbeel, P., and Ng, A. Y. 2006. Efficient L1 regularized logistic regression. In Proceedings, The Twenty-First National Conference on Artificial Intelligence and the Eighteenth Innovative Applications of Artificial Intelligence Conference, July 16–20, 2006, Boston, Massachusetts, USA, 401–8.
Liaw, A. 2015. Package “randomforest”. https://cran.r-project.org/web/packages/randomForest/randomForest.pdf (accessed October 10, 2015).
Ling, C. X., and Li, C. 1998. Data mining for direct marketing: Problems and solutions. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD-98), New York City, New York, USA, August 27–31, 1998, 73–9.
Montgomery, J. M., Hollenbach, F. M., and Ward, M. D. 2012. Improving predictions using ensemble Bayesian model averaging. Political Analysis 20(3): 271–91.
Muchlinski, D. 2015. Replication Data for: Comparing Random Forests with Logistic Regression for Predicting Class-Imbalanced Civil War Onset Data. http://dx.doi.org/10.7910/DVN/KRKWK8,HarvardDataverse,V1[UNF:6:pwv9cSHI53tZqXlrJ9EDaw== (accessed October 10, 2015).
Park, M. Y. and Hastie, T. 2007. L1-regularization path algorithm for generalized linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 69:659–77.
Ravikumar, P., Wainwright, M. J., and Lafferty, J. D. 2010. High-dimensional Ising model selection using 11 regularized logistic regression. Annals of Statistics 38:1287–319.
Schrodt, P., Yonamine, J., and Bagozzi, B. E. 2013. Data-based computational approaches to forecasting political violence. In Handbook of computational approaches to counterterrorism, ed. Subrahmanian, V., 129–62.
Sela, R. J., and Simonoff, J. S. 2012. Re-em trees: A data mining approach for longitudinal and clustered data. Machine Learning 86:169207.
Shellman, S. M., Levy, B. P., and Young, J. K. 2013. Shifting sands: Explaining and predicting phase shifts by dissident organizations. Journal of Peace Research 50:319–36.
Shmueli, G. 2010. To explain or predict? Statistical Science 25(3): 289310.
Siroky, D. 2009. Navigating random forests and related advanced in algorithmic modeling. Statistics Surveys 3:147–63.
Spirling, A. 2008. Rebels with a cause? Legislative activity and the personal vote in Britain. Working Paper.
Strobl, C., Boulesteix, A.-L., Kneib, T., Augustin, T., and Zeileis, A. 2008. Conditional variable importance for random forests. BMC Bioinformatics 9(1): 307.
Sun, Y., Kamel, M. S., and Wang, Y. 2006. Boosting for learning multiple classes with imbalanced class distribution. In Proceedings of the 6th IEEE international conference on data mining (ICDM 2006), 18–22 December 2006, Hong Kong, China, 592–602. IEEE Computer Society.
Ward, M., Siverson, R., and Cao, X. 2007. Disputes, democracies, and dependencies: A reexamination of the Kantian peace. American Journal of Political Science 51(3): 583601.
Ward, M. D., Greenhill, B. D., and Bakke, K. M. 2010. The perils of policy by p-value: Predicting civil conflicts. Journal of Peace Research 47(4): 363–75.
Ward, M. D., and Hoff, P. D. 2007. Persistent patterns of international commerce. Journal of Peace Research 44(2): 157–75.
Ward, M. D., Metternich, N. W., Dorff, C., Gallop, M., Hollenbach, F. M., Schultz, A., and Weschle, S. 2012. Learning from the past and stepping into the future: The next generation of crisis predition. International Studies Review 15(4): 473–90.
Weidmann, N. B. 2008. Conflict prediction via machine learning: Addressing the rare events problem with bagging. Poster presented at the 25th annual summer conference of the society for political methodology.
Zorn, C. 2005. A solution to separation in binary response models. Political Analysis 13(2): 157–70.
Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

Political Analysis
  • ISSN: 1047-1987
  • EISSN: 1476-4989
  • URL: /core/journals/political-analysis
Please enter your name
Please enter a valid email address
Who would you like to send this to? *
×
MathJax

Metrics

Altmetric attention score

Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed