Comparing Random Forest with Logistic Regression for Predicting Class-Imbalanced Civil War Onset Data

David Muchlinski; David Siroky; Jingrui He; Matthew Kocher

doi:10.1093/pan/mpv024

Comparing Random Forest with Logistic Regression for Predicting Class-Imbalanced Civil War Onset Data

Published online by Cambridge University Press: 04 January 2017

Jingrui He and

David Muchlinski*: Affiliation:
School of Social and Political Science, University of Glasgow, Glasgow, UK
David Siroky: Affiliation:
Department of Political Science, Arizona State University, Tempe, AZ, e-mail: david.siroky@asu.edu
Jingrui He: Affiliation:
Department of Computer Science and Engineering, Arizona State University, Tempe, AZ, e-mail: jingrui.he@asu.edu
Matthew Kocher: Affiliation:
Department of Political Science, Yale University, New Haven, CT, e-mail: mathew.kocher@yale.edu
*: e-mail: david.muchlinski@glasgow.ac.uk (corresponding author)

Article contents

Abstract
Footnotes
References

Get access

Rights & Permissions

Abstract

The most commonly used statistical models of civil war onset fail to correctly predict most occurrences of this rare event in out-of-sample data. Statistical methods for the analysis of binary data, such as logistic regression, even in their rare event and regularized forms, perform poorly at prediction. We compare the performance of Random Forests with three versions of logistic regression (classic logistic regression, Firth rare events logistic regression, and L 1-regularized logistic regression), and find that the algorithmic approach provides significantly more accurate predictions of civil war onset in out-of-sample data than any of the logistic regression models. The article discusses these results and the ways in which algorithmic statistical methods like Random Forests can be useful to more accurately predict rare events in conflict data.

Information

Type: Articles
Information: Political Analysis , Volume 24 , Issue 1 , Winter 2016 , pp. 87 - 103

DOI: https://doi.org/10.1093/pan/mpv024 [Opens in a new window]
Copyright: Copyright © The Author 2015. Published by Oxford University Press on behalf of the Society for Political Methodology

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

Footnotes

Author's note: Replication data are available on the Political Analysis Dataverse at http://dx.doi.org/10.7910/DVN/KRKWK8.

References

Beck, N., King, G., and Zeng, L. 2000. Improving quantitative studies of international conflict: A conjecture. American Political Science Review 94(1): 21–35.Google Scholar

Blair, R., Blattman, C., and Hartman, A. 2015. Predicting local violence. Social Science Research Network. revised url http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2497153 (accessed October 10, 2015).Google Scholar

Brandt, P., Freeman, J. R., and Schrodt, P. 2014. Evaluating forecasts of political conflict dynamics. International Journal of Forecasting 30:944–62.CrossRef Google Scholar

Breiman, L. 1996. Out-of-bag estimation. Technical report, Citeseer.Google Scholar

Breiman, L. 2001a. Random forests. Machine Learning 45(1): 5–32.CrossRef Google Scholar

Breiman, L. 2001b. Statistical modeling: The two cultures (with comments and a rejoinder by the author). Statistical Science 16(3): 199–231.Google Scholar

Buuren, S., and Groothuis-Oudshoorn, K. 2011. MICE: Multivariate imputation by chained equations in R. Journal of Statistical Software 45(3): 1–67.Google Scholar

Cederman, L.-E., Gleditsch, K. S., and Buhaug, H. 2013. Inequality, grievances, and civil war. Cambridge University Press.CrossRef Google Scholar

Chawla, N. V. 2005. Data mining for imbalanced datasets: An overview, 875–86. Springer.Google Scholar

Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. 2002. SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research (JAIR) 16:321–57.Google Scholar

Chawla, N. V., Lazarevic, A., Hall, L. O., and Bowyer, K. W. 2003. Smoteboost: Improving prediction of the minority class in boosting. In Knowledge discovery in databases: PKDD 2003, 7th European conference on principles and practice of knowledge discovery in databases, Cavtat-Dubrovnik, Croatia, September 22–26, 2003, Proceedings, volume 2838 of lecture notes in computer science, eds. Lavrac, N., Gamberger, D., Blockeel, H., and Todorovski, L., 107–19. Springer.Google Scholar

Chen, C., Liaw, A., and Breiman, L. 2004. Using random forest to learn imbalanced data. Berkeley: University of California.Google Scholar

Cieslak, D. A., and Chawla, N. V. 2008. Start globally, optimize locally, predict globally: Improving performance on imbalanced data. In Proceedings of the 8th IEEE international conference on data mining (ICDM 2008), December 15–19, 2008, Pisa, Italy, 143–52. Google Scholar

Clayton, G., and Gleditsch, K. S. 2014. Will we see helping hands? Predicting civil war mediation and likely success. Conflict Management and Peace Science 31:265–84.Google Scholar

Collier, P., and Hoeffler, A. 2004. Greed and grievance in civil war. Oxford Economic Papers 56(4): 563–95.CrossRef Google Scholar

Duncan, G. M. 2014. Causal random forests. http://econ.washington.edu/sites/econ/files/old-site-uploads/2014/08/Causal-Random-Forests_Duncan.pdf (accessed October 10, 2015).Google Scholar

Efron, B. 1983. Estimating the error rate of a prediction rule: Improvement on cross-validation. Journal of the American Statistical Association 78(382): 316–31.CrossRef Google Scholar

Fawcett, T. 2006. An introduction to ROC analysis. Pattern Recognition Letters 27(8): 861–74.Google Scholar

Fearon, J. D., and Laitin, D. D. 2003. Ethnicity, insurgency, and civil war. American Political Science Review 97(01): 75–90.CrossRef Google Scholar

Firth, D. 1993. Bias reduction of maximum likelihood estimates. Biometrika 80(1): 27–38.Google Scholar

Freiman, M. H. 2010. Using random forests and simulated annealing to predict probabilities of election to the Baseball Hall of Fame. Journal of Quantitative Analysis in Sports 6(2): 1–35.Google Scholar

Geisser, S. 1975. The predictive sample reuse method with applications. Journal of the American Statistical Association 70(350): 320–8.CrossRef Google Scholar

Gelman, A., and Imbens, G. 2013. Why ask why? Forward causal inference and reverse causal questions. NBER working paper number 19614.Google Scholar

Gleditsch, K. S., and Ward, M. 2012. Forecasting is difficult, especially about the future: Using contentious issues to forecast interstate disputes. Journal of Peace Research 50(1): 17–31.Google Scholar

Goldstone, J. A., Bates, R. H., Epstein, D. L., Gurr, T. R., Lustik, M. B., Marshall, M. G., Ulfelder, J., and Woodward, M. 2010. A global model for forecasting political instability. American Journal of Political Science 54(1): 190–208.Google Scholar

Greenhill, B., Ward, M. D., and Sacks, A. 2011. The separation plot: A new visual method for evaluating the fit of binary models. American Journal of Political Science 55(4): 991–1002.CrossRef Google Scholar

Hajjem, A., Bellavance, F., and Larocque, D. 2014. Mixed-effects random forest for clustered data. Journal of Statistical Computation and Simulation 84(6): 1313–28.Google Scholar

Hastie, T., Tibshirani, R., Friedman, J., Hastie, T., Friedman, J., and Tibshirani, R. 2009. The elements of statistical learning. Springer.Google Scholar

Hegre, H., Karlsen, J., NygWård, H. M., Strand, H., and Urdal, H. 2013. Predicting armed conflict, 2010–2050. International Studies Quarterly 57(2): 250–70.Google Scholar

Hegre, H., and Sambanis, N. 2006. Sensitivity analysis of empirical results on civil war onset. Journal of Conflict Resolution 50(4): 508–35.CrossRef Google Scholar

Hill, D. W., and Jones, Z. M. 2014. An empirical evaluation of explanations for state repression. American Political Science Review 108:661–87.Google Scholar

Hoff, P. D., and Ward, M. D., 2004. Modeling dependencies in international relations networks. Political Analysis 12(2): 160–75.Google Scholar

Holland, P. W. 1986. Statistical and causal inference. Journal of the American Statistical Association 81(396): 945–60.Google Scholar

Honaker, J., King, G., and Blackwell, M. 2011. Amelia ii: A program for missing data. Journal of Statistical Software 45(7): 1–47.CrossRef Google Scholar

Jones, Z., and Linder, F. 2015. Exploratory data analysis using random forests. Prepared for the 73rd annual MPSA conference, April 16–19, 2015. http://zmjones.com/static/papers/rfss_manuscript.pdf (accessed October 10, 2015).Google Scholar

Kalyvas, S. N. 2007. Civil wars In The Oxford handbook of comparative politics, eds. Boix, C. and Stokes, S., 416–34. Oxford University Press.Google Scholar

King, G., Keohane, R. O., and Verba, S. 1994. Designing social inquiry: Scientific inference in qualitative research. Princeton University Press.CrossRef Google Scholar

King, G., and Zeng, L. 2001. Logistic regression in rare events data. Political Analysis 9(2): 137–63.CrossRef Google Scholar

Köknar-Tezel, S., and Latecki, L. J. 2011. Improving SVM classification on imbalanced time series data sets with ghost points. Knowledge and Information System 28(1): 1–23.CrossRef Google Scholar

Lee, S., Lee, H., Abbeel, P., and Ng, A. Y. 2006. Efficient L1 regularized logistic regression. In Proceedings, The Twenty-First National Conference on Artificial Intelligence and the Eighteenth Innovative Applications of Artificial Intelligence Conference, July 16–20, 2006, Boston, Massachusetts, USA, 401–8. Google Scholar

Liaw, A. 2015. Package “randomforest”. https://cran.r-project.org/web/packages/randomForest/randomForest.pdf (accessed October 10, 2015).Google Scholar

Ling, C. X., and Li, C. 1998. Data mining for direct marketing: Problems and solutions. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD-98), New York City, New York, USA, August 27–31, 1998, 73–9. Google Scholar

Montgomery, J. M., Hollenbach, F. M., and Ward, M. D. 2012. Improving predictions using ensemble Bayesian model averaging. Political Analysis 20(3): 271–91.Google Scholar

Muchlinski, D. 2015. Replication Data for: Comparing Random Forests with Logistic Regression for Predicting Class-Imbalanced Civil War Onset Data. http://dx.doi.org/10.7910/DVN/KRKWK8,HarvardDataverse,V1[UNF:6:pwv9cSHI53tZqXlrJ9EDaw== (accessed October 10, 2015).Google Scholar

Park, M. Y. and Hastie, T. 2007. L1-regularization path algorithm for generalized linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 69:659–77.Google Scholar

Ravikumar, P., Wainwright, M. J., and Lafferty, J. D. 2010. High-dimensional Ising model selection using 11 regularized logistic regression. Annals of Statistics 38:1287–319.Google Scholar

Schrodt, P., Yonamine, J., and Bagozzi, B. E. 2013. Data-based computational approaches to forecasting political violence. In Handbook of computational approaches to counterterrorism, ed. Subrahmanian, V., 129–62.Google Scholar

Sela, R. J., and Simonoff, J. S. 2012. Re-em trees: A data mining approach for longitudinal and clustered data. Machine Learning 86:169–207.CrossRef Google Scholar

Shellman, S. M., Levy, B. P., and Young, J. K. 2013. Shifting sands: Explaining and predicting phase shifts by dissident organizations. Journal of Peace Research 50:319–36.Google Scholar

Shmueli, G. 2010. To explain or predict? Statistical Science 25(3): 289–310.Google Scholar

Siroky, D. 2009. Navigating random forests and related advanced in algorithmic modeling. Statistics Surveys 3:147–63.CrossRef Google Scholar

Spirling, A. 2008. Rebels with a cause? Legislative activity and the personal vote in Britain. Working Paper.Google Scholar

Strobl, C., Boulesteix, A.-L., Kneib, T., Augustin, T., and Zeileis, A. 2008. Conditional variable importance for random forests. BMC Bioinformatics 9(1): 307.CrossRef Google Scholar PubMed

Sun, Y., Kamel, M. S., and Wang, Y. 2006. Boosting for learning multiple classes with imbalanced class distribution. In Proceedings of the 6th IEEE international conference on data mining (ICDM 2006), 18–22 December 2006, Hong Kong, China, 592–602. IEEE Computer Society. Google Scholar

Ward, M., Siverson, R., and Cao, X. 2007. Disputes, democracies, and dependencies: A reexamination of the Kantian peace. American Journal of Political Science 51(3): 583–601.Google Scholar

Ward, M. D., Greenhill, B. D., and Bakke, K. M. 2010. The perils of policy by p-value: Predicting civil conflicts. Journal of Peace Research 47(4): 363–75.Google Scholar

Ward, M. D., and Hoff, P. D. 2007. Persistent patterns of international commerce. Journal of Peace Research 44(2): 157–75.CrossRef Google Scholar

Ward, M. D., Metternich, N. W., Dorff, C., Gallop, M., Hollenbach, F. M., Schultz, A., and Weschle, S. 2012. Learning from the past and stepping into the future: The next generation of crisis predition. International Studies Review 15(4): 473–90.Google Scholar

Weidmann, N. B. 2008. Conflict prediction via machine learning: Addressing the rare events problem with bagging. Poster presented at the 25th annual summer conference of the society for political methodology.Google Scholar

Zorn, C. 2005. A solution to separation in binary response models. Political Analysis 13(2): 157–70.Google Scholar

Article contents

Comparing Random Forest with Logistic Regression for Predicting Class-Imbalanced Civil War Onset Data

Abstract

Information

Access options

Article purchase

Temporarily unavailable

Footnotes

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests