This is the 2018 Political Analysis in-house replication of Muchlinski, Siroky, He, and Kocher (2016), henceforth MSHK. This work, “Comparing Random Forest with Logistic Regression for Predicting Class-Imbalanced Civil War Onset Data,” was published in Political Analysis in Volume 24, Issue 1 in 2016.Footnote 1 It was accompanied by Dataverse replication material as required by the journal.Footnote 2 While this material was checked upon submission in 2015, recent replication efforts show that it does not support the claims made by MSHK.
Shown here specifically is that MSHK conducted in-sample predictions instead of out-sample predictions in their use of RandomForest as stated in the paper. RandomForest is a machine learning algorithm that constructs multiple decision trees to obtain more accurate predictions. The higher the number of trees in the forest, the higher the prediction accuracy. A RandomForest model needs to be fitted, or trained, on a data sample. This model can then be used to forecast, or predict, observations. If this prediction is made for an observation that is part of the training sample, it is an in-sample prediction. If this prediction is made for an observation that is external to the training sample, it is an out-sample prediction. By definition, predicting observations from the fitting sample based on a model derived from the same sample, i.e., in-sample predictions, will provide highly accurate results: We are predicting within the same sample that we trained on. To assess the viability of a RandomForest model, it is necessary to predict observations that were not used for the model fitting, i.e., to conduct out-sample predictions.
I am the current replicator for Political Analysis. I have been in this position since August 2017. I was not involved in the original assessment of MSHK’s submitted replication material. I walk through MSHK’s 2016 R code step by step. I start with the loaded source files, move on to model building, and finally address the out-sample analysis and insufficient output for Table 1. All R code, including comments and typos, is copied verbatim from material provided by MSHK. Some code lines in this replication analysis have been omitted for space reasons while others have been rearranged to fit page margins. These alignments do not affect the substantive content of the analysis.
2 Loaded Files
MSHK load three imputed .csv files: SambnisImp.csv, Amelia.Imp3.csv, and AfricaImp.csv. The first two are loaded as pre-imputed source files. The latter is imputed by MSHK in a separate R script. SambnisImp.csv is loaded into the R object data, which is further subset into data.full. Amelia.Imp3.csv is loaded into data2, myvars, and newdata. AfricaImp.csv is loaded into data3. These R objects are confusingly named, which makes the replication of the material more complex than it needs to be. While renaming them to separate them more clearly from each other would solve this, I have retained MSHK’s original names here in the interest of transparency. The R code to load the .csv files into the respective R objects is shown below.
3 Model Building
MSHK build four models. Three of these stem from previous studies: Fearon and Laitin (2003), Collier and Hoeffler (2004), and Hegre and Sambanis (2006). For each of these three studies’ models, they implement uncorrected and penalized logistic regression models. The fourth model is MSHK’s implementation of RandomForest. MSHK use these three external studies to showcase the superiority of RandomForest in predicting class-imbalanced civil war onset data. All models are trained on the R object data.full, which is a subset of SambnisImp.csv.
4 Out-Sample Analysis
After training the models, MSHK create three logit models for the external studies by Fearon and Laitin (2003), Collier and Hoeffler (2004), and Hegre and Sambanis (2006) as well as one model with RandomForest. All models load the R object data.full and are thus based on the imputed source file SambnisImp.csv.
Based on these models, MSHK make one prediction per model, turn the predictions into data frames, and subsequently set the seed to draw 737 random units from each predicted data frame. Each separate set of randomly drawn units is saved as a predictor object: predictors.rf for RandomForest, predictors.fl for Fearon and Laitin, predictors.ch for Collier and Hoeffler, and predictors.hs for Hegre and Sambanis.
MSHK then create confusion matrices with the predictor objects (based on the imputed source file SambnisImp.csv) and the variable warstds, which is a column from the data set data3, which in turn is based on the imputed source file AfricaImp.csv. Subsequently, MSHK load the R package ROCR. As per the ROCR package documentation, the function prediction() transforms the input data into a standardized format, while the function performance() calculates the area under the ROC curve if set to the parameter "auc", as MSHK do in the code shown below.
To sum up: For their out-sample analysis, MSHK create models based on SambnisImp.csv, make predictions based on SambnisImp.csv, draw random samples based on SambnisImp.csv, and calculate AUC scores based on SambnisImp.csv and one external variable based on AfricaImp.csv. In other words: MSHK conduct in-sample predictions, take random samples of these in-sample predicted probabilities, and compare those probabilities with true values from out-sample data. MSHK thus use the same sample to fit the model and conduct the predictions. This is not an out-sample prediction.
5 Output for Main Evidence
In their paper, MSHK provide Table 1 as the main evidence for their claim of the superiority of RandomForest. This table lists the predicted probabilities for civil war onset for 19 African countries and showcases the superiority of RandomForest over logit models in terms of prediction accuracy. MSHK provide CompareCW_dat.csv and identify it as the output that forms Table 1.
As the R code shows, CompareCW_dat.csv consists of the random predictor objects (predictors.rf, predictors.fl, predictors.ch, and predictors.hs) and the variable warstds. The predictor objects are based on the imputed source file SambnisImp.csv, while warstds is based on the imputed source file AfricaImp.csv. If we now juxtapose CompareCW_dat.csv and Table 1, we can see that it is not possible to match the information provided in the output .csv with the information listed in Table 1, as Figure 1 shows.
CompareCW_dat.csv and Table 1 should show identical content. This is not the case. Table 1 consists of 19 rows, while CompareCW_dat.csv has 737. CompareCW_dat.csv does not have any identifiers that make the transition from this source file to the eventual Table 1 apparent and transparent. We do not know which predictor numbers correspond to which countries, as there is no information about the countries in the .csv file. Even if we assume that all instances where warstds == 1 sum up to the number of countries shown in Table 1, the numbers do not add up: There are 21 such instances in the .csv, but 19 in Table 1.
MSHK create several models (logit, RandomForest), make predictions based on these models, and draw random samples from these predictions. The data used for all of this comes from SambnisImp.csv. MSHK then create confusion matrices and calculate AUC scores based on data from SambnisImp.csv and one external variable from AfricaImp.csv. Rephrased in more generic terms, MSHK conduct in-sample predictions and take an in-sample sample, and then compare this in-sample sample with true values from out-sample data. Out-sample data only enters the equation in the final comparison, after the predictions have already been made with in-sample data.
In addition, the provided CompareCW_dat.csv cannot be compared to Table 1 because of its lack of identifiers. It is not possible to examine and verify the origin of the numbers in Table 1, which functions as the main piece of evidence in the paper.