Hostname: page-component-848d4c4894-wg55d Total loading time: 0 Render date: 2024-05-20T13:09:55.281Z Has data issue: false hasContentIssue false

Improving Supreme Court Forecasting Using Boosted Decision Trees

Published online by Cambridge University Press:  19 February 2019

Aaron Russell Kaufman*
PhD Candidate, Department of Government, Harvard University, 1737 Cambridge Street, Cambridge, MA 02138, USA. Email:
Peter Kraft
PhD Candidate, Department of Computer Science, Stanford University, 353 Serra Mall, Stanford, CA 94305, USA. Email:
Maya Sen
Associate Professor, John F. Kennedy School of Government, Harvard University, 79 John F. Kennedy Street, Cambridge, MA 02138, USA. Email:
Rights & Permissions [Opens in a new window]


Though used frequently in machine learning, boosted decision trees are largely unused in political science, despite many useful properties. We explain how to use one variant of boosted decision trees, AdaBoosted decision trees (ADTs), for social science predictions. We illustrate their use by examining a well-known political prediction problem, predicting U.S. Supreme Court rulings. We find that our ADT approach outperforms existing predictive models. We also provide two additional examples of the approach, one predicting the onset of civil wars and the other predicting county-level vote shares in U.S. presidential elections.

Copyright © The Author(s) 2019. Published by Cambridge University Press on behalf of the Society for Political Methodology. 

1 Introduction

What predicts U.S. Supreme Court rulings? What predicts whether a country will suffer a civil war? How might we forecast U.S. presidential election outcomes at the local level? These are important questions. For example, dozens of papers and hundreds of journalists have sought to predict Supreme Court rulings (e.g., Ruger et al. Reference Ruger, Kim, Martin and Quinn2004; Epstein, Landes and Posner Reference Epstein, Landes and Posner2010; Black et al. Reference Black, Treul, Johnson and Goldman2011), which are delivered only after months of closed-door deliberation but nonetheless involve key issues in American politics—including civil rights, voting rights, presidential powers, and national security. In the ten months that the Supreme Court was privately deliberating a prominent same-sex marriage case, for example, thousands of couples married without assurances that the federal government would recognize their marriages.Footnote 1

In this paper, we introduce one tool that, though underused in political science, offers attractive properties for social science prediction problems: AdaBoosted decision trees (ADTs). ADTs capture gains in prediction when there are many variables, most of which add only limited predictive value. We illustrate their utility by predicting Supreme Court rulings using a novel dataset that includes case-level information alongside textual data from oral arguments. Using this approach, we predict more than 75% of all case outcomes accurately, with even higher accuracy among politically important cases. Substantively, we are able to accurately predict approximately seven more cases per year (out of around 80) compared to the baseline of predicting that the petitioner will always win, which yields 68% accuracy. To illustrate the broad applicability of ADTs, we provide two additional examples: (1) predicting whether civil war occurs in a country in a given year (which we predict with 99.0% accuracy) and (2) predicting county-level U.S. presidential election outcomes (which we predict with 96.7% accuracy, using the 2016 election as our example).

2 AdaBoosted Decision Trees and their Applicability to Social Science Questions

With exceptions (e.g., Green and Kern Reference Green and Kern2012; Montgomery and Olivella Reference Montgomery and Olivella2016; Muchlinski et al. Reference Muchlinski, Siroky, He and Kocher2016; Bansak et al. Reference Bansak, Ferwerda, Hainmueller, Dillon, Hangartner, Lawrence and Weinstein2018), tree-based models are rarely used in political science, which tends to focus on substantive and/or causal interpretation of covariates.Footnote 2 Tree-based models—which are designed to incorporate flexible functional forms, avoid parametric assumptions, perform vigorous variable selection, and prevent overfitting—are common, however, in machine learning. These approaches are well suited for identifying variables important for forecasting, which could include variables that are not causal in nature per se but that are nonetheless predictive and for analyses involving large numbers of variables of potential (but uncertain) substantive importance.

The simplest tree-based models partition a dataset into “leaves” according to covariates and predict the value of each leaf. For example, a decision tree predicting Supreme Court rulings might start by splitting cases by whether the government is the respondent. If so, the algorithm may predict that the government wins. If not, the algorithm may examine the provenance of the case and, if there is a circuit split, predict that the petitioner wins. If it is not a circuit split, then it may examine whether Anthony Kennedy spoke frequently at oral arguments. If he did, the algorithm may predict that the respondent wins.

Our analysis relies on boosted decision trees, discussed in Montgomery and Olivella (Reference Montgomery and Olivella2016) and which are newer to political science. (For an application of boosted regression trees to refugee allocation, see Bansak et al. Reference Bansak, Ferwerda, Hainmueller, Dillon, Hangartner, Lawrence and Weinstein2018.) Boosting creates trees sequentially, and as Montgomery and Olivella (Reference Montgomery and Olivella2016, p. 11) explain, each new tree then “improves upon the predictive power of the existing ensemble.” The base classifier relies on “weak learners,” decision rubrics that perform only slightly better than chance.

We use one of the most widely used boosting algorithms, AdaBoost. (See Appendix G, Section 8.4 for a discussion of other boosting approaches and why we chose AdaBoost.) AdaBoosting initializes by giving each observation equal weight. In the second iteration, AdaBoost will assign more weight to those units that were incorrectly classified in the first iteration. Focusing on those units that are hard to classify makes this approach well suited to social science problems, many of which involve heterogeneity and outliers.Footnote 3

2.1 Pros and Cons of ADTs

ADTs’ properties make it attractive for social science research. First, it has desirable asymptotics in improving predictive accuracy, especially when there are many features that each only contribute a small predictive gain. In predicting Court outcomes, although baseline accuracy is high, the predictive capacity of any one variable is small, leaving little room for improvement. This is common in the social sciences. Predicting the advent of civil wars has high baseline accuracy since there are few wars, but each additional predictor adds little information (Ward, Greenhill and Bakke Reference Ward, Greenhill and Bakke2010). Changes in which party controls the U.S. Presidency are often summarized by the “bread and peace” model: the incumbent party wins when the economy is growing, except during unpopular wars (Hibbs Jr Reference Hibbs2000). This produces high baseline accuracy, with other variables adding little (Gelman and King Reference Gelman and King1993). Second, AdaBoost provides a useful theoretical guarantee: for any given iteration, as long as that model’s predictions are consistently better than random chance, the overall model’s training error is guaranteed to decrease (Mukherjee, Rudin and Schapire Reference Mukherjee, Rudin and Schapire2011).Footnote 4 Lastly, AdaBoost is agnostic to predictor or outcome data types, be they binary, continuous, or categorical (Elith, Leathwick and Hastie Reference Elith, Leathwick and Hastie2008), simplifying its implementation in dealing with mixed datasets of many predictors.

We also note drawbacks. First, ADTs sacrifice some interpretability of estimates for flexibility of functional form. By avoiding assumptions about the relationship between Court rulings and covariates, for example, ADTs provide more robust predictive capacity. However, they preclude discussions of statistical significance or effect sizes; rather than interpreting coefficients on covariates, ADTs rely on “feature importance.” (Appendix C discusses how feature importance could nonetheless provide substantively important information that models like OLS miss.) Second, ADTs are computationally expensive without being parallelizable. Third, ADTs have many tuning parameters inherited from decision trees, and a few added from AdaBoost. Fourth, ADTs tend to overfit easily, especially compared to random forests (Elith et al. Reference Elith, Leathwick and Hastie2008). This can be controlled by limiting the learning rate (see Appendix G) at the cost of computation time. Lastly, there exist important problems for which AdaBoost fails. With insufficient sample sizes, primarily unpredictive covariates, or unsuitable base models, AdaBoost will show no improvement over more naive methods. Despite this, AdaBoost has been shown to work well in a wide variety of experimental settings among benchmark problems in computer science (Freund and Schapire Reference Freund and Schapire1996).

3 Application of AdaBoosting to the Supreme Court

We illustrate ADTs by predicting rulings by the U.S. Supreme Court. Because the Court decides cases of magnitude—including cases on presidential power, states’ rights, and national security—even small predictive gains translate into significant policy importance. The simplest predictive algorithm for Court rulings is that the petitioner (party appealing the case) wins roughly two thirds of the time (Epstein et al. Reference Epstein, Landes and Posner2010).Footnote 5 In practice, guessing that the petitioner wins every time predicts 67.98% of cases since 2000 accurately (Appendix A1), though several studies have surpassed this baseline (Martin et al. Reference Martin, Quinn, Ruger and Kim2004; Katz, Bommarito and Blackman Reference Katz, Bommarito and Blackman2014; Katz, Bommarito II and Blackman Reference Katz, Bommarito and Blackman2017). In this paper, we compare our approach to two prominent Court forecasting models, $\{\text{Marshall}\}+$ and CourtCast.Footnote 6

We implement ADTs using the scikit-learn Python library.Footnote 7 We train our model (and comparison models) using two data sources from 2005 to 2015. First, we use case-level covariates from the Supreme Court Database (Spaeth et al. Reference Spaeth, Epstein, Martin, Segal, Ruger and Benesh2015). These include the procedural posture of the case, the issues involved, the parties’ identities, and other case-level factors, detailed in Appendix C.Footnote 8 Second, we incorporate statements made by the Justices during oral arguments. Scholarship suggests that Justices use oral arguments to gather information and stake out positions (Johnson, Wahlbeck and Spriggs Reference Johnson, Wahlbeck and Spriggs2006). We draw on textual data from the Court’s oral argument transcripts provided by the Oyez Project (Goldman Reference Goldman2002), which we operationalize into 55 variables, detailed in Appendix C. Finally, we optimize our model’s tuning parameters using grid search (see Appendix G).

4 Results and Comparisons to Other Approaches

In Figure 1 below, we compare predictions based on (1) our model (referred to as “KKS”) to (2) the “petitioner always wins” baseline rule, (3) CourtCast, (4) $\{\text{Marshall}\}+$ , and (5) a generic random forest distinct from Katz et al. (Reference Katz, Bommarito and Blackman2017). We evaluate all models using tenfold cross-validation (Efron and Tibshirani Reference Efron and Tibshirani1997), which captures a model’s ability to predict withheld samples of the observed data (see Appendix D).

Figure 1. Cross-Validation Accuracy for KKS compared to the “petitioner always wins” baseline, CourtCast, $\{\text{Marshall}\}+$ , and a generic random forest. We compare these across three datasets: Supreme Court Database (“SCDB”), oral argument data, and both datasets jointly. For $\{\text{Marshall}\}+$ and CourtCast, black dots indicate the original dataset on which those models were trained.

In Table 1, we present each model’s accuracy as reported by their authors in the original papers. For $\{\text{Marshall}\}+$ , the original self-reported accuracy is much higher than we achieve (Figure 1), since it includes covariates we purposely excluded.Footnote 9 For CourtCast, self-reported accuracy is lower than we achieve: the original CourtCast model uses fewer training years and less accurate data than in our replications and measures accuracy using a single train-test split rather than 10-fold cross-validation.

Table 1. Accuracy (self-reported) for (1) the “petitioner always wins” baseline, (2) Katz et al. (Reference Katz, Bommarito and Blackman2017)’s Random Forest, (3) $\{\text{Marshall}\}+$ , (4) CourtCast, and (5) KKS. “Data” indicates the training dataset: case-level covariates from the Supreme Court Database (“SCDB”), transcript data from the oral arguments (“oral argument”), or both. The KKS model using all covariates almost triples the added accuracy of the next best model.

Figure 1 indicates for each model the dataset (Supreme Court Database, oral arguments data, or both), cross-validation accuracy, and comparison to baseline accuracy. We generate these accuracy statistics by training the respective models on data from 2005 to 2015. We find that all models perform best using the joint dataset; all perform second best with the oral argument dataset. The KKS model using only case covariates performs less well, achieving an accuracy of more than 7 points below baseline. Using oral argument data, however, it exceeds baseline by more than 5 points with an accuracy of 72.5%. With joint data, it achieves an accuracy of 74.04%. Its added accuracy of 6.06 points over baseline is almost triple the added accuracy originally reported by Katz et al. Reference Katz, Bommarito and Blackman2017. Substantively, this means our model correctly predicts about seven more cases (out of 80) per term than baseline—a meaningful improvement.

Interestingly, no model using only case covariates surpasses baseline accuracy; it is unsurprising that oral argument data, collected much closer to the decision, are more predictive than case covariates determined years prior to a ruling. We also note that by introducing the joint dataset to $\{\text{Marshall}\}+$ and CourtCast, both outperform their originally reported results, though neither perform as well as KKS on either the oral argument or the joint datasets.

5 Predictive Accuracy Conditional on Covariates

Our model enjoys an overall gain of approximately six percentage points over baseline, but this often increases when we examine subsets of cases. Close 5–4 decisions go to the petitioner 61% of the time on average, and our accuracy for 5–4 cases is 66%, five points above that baseline. We correctly predict 73% of 6–3 cases, 76% of 7–2 cases, 82% of 8–1 cases, and 77% of 9–0 cases; our model provides the biggest accuracy boost, 13 points, for 6–3 decisions.

Our model also outperforms the baseline in cases related to judicial power (nine points) and federalism (16 points) and where a state or federal government is a party (nine points). We see weaker gains in criminal procedure, civil rights, and First Amendment cases (Table 2). Our model outperforms $\{\text{Marshall}\}+$ and CourtCast in all subgroups except two: CourtCast performs one point better in unanimous cases and two points better in economic activity cases. However, both previous models often fail to exceed the baseline: $\{\text{Marshall}\}+$ in eight subgroups and CourtCast in two.

Table 2. KKS model accuracy by decision margin.

5.1 Additional applications: county-level U.S. presidential vote share & civil wars

ADTs are promising for other political science applications and may outperform even other tree-based methods. To demonstrate, we examine two applications. First, we look at U.S. presidential elections. For this, we analyze data from the 2010 U.S. Census that includes county-level age, income, education, and gender. The outcome variable is whether the Democratic Party’s two-party county-level vote share in the 2016 presidential election is greater than 50%. The baseline is calculated by predicting that the Republican Party’s two-party county-level vote share is greater than 50%. To assess accuracy, we use 10-fold cross-validation for the proportion of counties correctly predicted.

Second, we look at civil war incidence, examining a dataset indicating which country-years were engaged in civil wars, alongside country-level covariates derived from Collier and Hoeffler (Reference Collier and Hoeffler2002) and Fearon and Laitin (Reference Fearon and Laitin2003), including population, GDP, Polity score, ethnolinguistic fractionalization, and oil reserves. The baseline accuracy is 86.1%, achieved by predicting “no civil war” in all cases. To assess accuracy, we use 10-fold cross-validation on the proportion of country-years correctly predicted as having a civil war or not.

Table 3. ADTs outperform other methods in predicting both county-level vote share in the 2016 U.S. Presidential Election and civil war incidence.

Table 3 presents these results. ADTs outperform competing linear, nonlinear, and tree-based methods. These improvements, even when small, are substantively meaningful. As the example of 2016 shows, presidential elections are consequential and hard to predict. In our dataset of 3,082 counties, being able to predict the likely vote of 308 more counties than baseline (and 31 more counties than the next best model), may impact how campaigns distribute resources. Predicting civil wars is likewise hugely important; accurately forecasting them holds great promise for allocating scarce peacekeeping resources. Across 6,610 country-years since 1945, our model correctly predicts 853 more cases than baseline (and seven more cases than the next best model), corresponding to 11.8 additional countries each year; it also predicts around 20 more cases than a logistic regression (0.36 more per year). Both are substantively meaningful differences that would be useful for policy experts and analysts.

6 Discussion and Conclusion

Our contributions are twofold. First, we provide an overview of ADTs, a technique frequently used in machine learning but one more novel within the social sciences. The approach is promising for many social science questions owing to its robustness to small sample sizes and its treatment of weakly predictive (though not unpredictive) covariates. As our examples show, this approach performs favorably compared to other commonly used methods across several applications. We include technical overviews and best practices guides in the Appendix.

Second, we contribute to a growing literature on Supreme Court prediction. The Court is the most reclusive branch of the U.S. government, yet it rules on some of the most important and contentious policy issues of the day. Increasing the predictive accuracy of forecasting models not only improves our understanding of how this important branch of government operates, but also, we believe, allows researchers to more credibly assess which way these influential rulings may go.


Many thanks to Matthew Blackwell, Peter Dilworth, Finale Doshi-Velez, Phillipa Gill, Gary King, Brian Libgober, Chris Lucas, Luke Miratrix, Kevin Quinn, Jeff Segal, Robert Ward, and participants at the Computational Social Science Institute Seminar at University of Massachusetts, Amherst for helpful conversations and valuable feedback. We also thank Josh Blackman, Michael Bommarito, and Dan Katz for comments during early stages of this project.

Supplementary material

For supplementary material accompanying this paper, please visit


Authors’ note: Replication materials available at the Political Analysis Dataverse: (Kaufman, Kraft and Sen 2018)

Contributing Editor: Jeff Gill

1 See Appendix A0 for discussion of the substantive importance of Supreme Court prediction.

2 See Appendix E for discussion of why machine learning may be underused in political science.

3 For a more technical walk-through of the AdaBoosting algorithm, see Appendix G.

4 Train error refers to in-sample model fit, while test error refers to out-of-sample predictive accuracy. Here we measure predictive accuracy using exponential loss. This property of AdaBoost ensures that there are no local optima and no way to overfit.

5 A favorable ruling is at least a 5–4 majority. Note that we examine Court outcomes as opposed to the votes of individual Justices, in line with most papers in the literature.

6 Source code for CourtCast is at See Appendix H.

7 Complete replication materials are available on the Political Analysis Dataverse (Kaufman et al. Reference Kaufman, Kraft and Sen2018).

8 Some of these variables are subjectively coded after the ruling is issued (for example, issue area). We see no way in which the coding would change pre- and postdecision. Appendix C provides further detail.

9 Specifically, the original $\{\text{Marshall}\}+$ analysis includes covariates gathered after oral argument, such as the month of the ruling. When we include all original $\{\text{Marshall}\}+$ covariates, we achieve a replicated accuracy that is comparable to their original results.

10 We draw these results from the paper, as we do not have access to replication materials for this model.


Bansak, K., Ferwerda, J., Hainmueller, J., Dillon, A., Hangartner, D., Lawrence, D., and Weinstein, J.. 2018. “Improving Refugee Integration through Data-Driven Algorithmic Assignment.” Science 359(6373):325329.Google Scholar
Black, R. C., Treul, S. A., Johnson, T. R., and Goldman, J.. 2011. “Emotions, Oral Arguments, and Supreme Court Decision Making.” The Journal of Politics 73(2):572581.Google Scholar
Collier, P., and Hoeffler, A.. 2002. “On the Incidence of Civil War in Africa.” Journal of Conflict Resolution 46(1):1328.Google Scholar
Efron, B., and Tibshirani, R.. 1997. “Improvements on Cross-Validation: the $632+$ Bootstrap Method.” Journal of the American Statistical Association 92(438):548560.Google Scholar
Elith, J., Leathwick, J. R., and Hastie, T.. 2008. “A Working Guide to Boosted Regression Trees.” Journal of Animal Ecology 77(4):802813.Google Scholar
Epstein, L., Landes, W. M., and Posner, R. A.. 2010. “Inferring the Winning Party in the Supreme Court from the Pattern of Questioning at Oral Argument.” The Journal of Legal Studies 39(2):433467.Google Scholar
Fearon, J. D., and Laitin, D. D.. 2003. “Ethnicity, Insurgency, and Civil War.” American Political Science Review 97(1):7590.Google Scholar
Freund, Y., and Schapire, R. E.. 1996. “Experiments with a New Boosting Algorithm.” In Proceedings of the Thirteenth International Conference on Machine Learning, vol. 96 , 148156. San Francisco, CA: Morgan Kaufmann Publishers.Google Scholar
Gelman, A., and King, G.. 1993. “Why are American Presidential Election Campaign Polls so Variable when Votes are so Predictable? British Journal of Political Science 23(4):409451.Google Scholar
Goldman, J.2002. The OYEZ Project [On-line].Google Scholar
Green, D. P., and Kern, H. L.. 2012. “Modeling Heterogeneous Treatment Effects in Survey Experiments with Bayesian Additive Regression Trees.” Public Opinion Quarterly 76(3):491511.Google Scholar
Hibbs, D. A. Jr. 2000. “Bread and Peace Voting in US Presidential Elections.” Public Choice 104(1–2):149180.Google Scholar
Johnson, T. R., Wahlbeck, P. J., and Spriggs, J. F.. 2006. “The Influence of Oral Arguments on the US Supreme Court.” American Political Science Review 100(1):99113.Google Scholar
Katz, D. M., Bommarito, M. J., and Blackman, J.. 2014. “Predicting the Behavior of the Supreme Court of the United States: A General Approach.” Available at SSRN: Scholar
Katz, D. M., Bommarito, M. J. II, and Blackman, J.. 2017. “A General Approach for Predicting the Behavior of the Supreme Court of the United States.” PloS one 12(4): e0174698.Google Scholar
Kaufman, A., Kraft, P., and Sen, M.. 2018. “Replication Data for: Improving Supreme Court Forecasting Using Boosted Decision Trees.”, Harvard Dataverse, V1.Google Scholar
Martin, A. D., Quinn, K. M., Ruger, T. W., and Kim, P. T.. 2004. “Competing Approaches to Predicting Supreme Court Decision Making.” Perspectives on Politics 2(4):761767.Google Scholar
Montgomery, J. M., and Olivella, S.. 2016. “Tree-Based Models for Political Science Data.” American Journal of Political Science 62(3):729744.Google Scholar
Muchlinski, D., Siroky, D., He, J., and Kocher, M.. 2016. “Comparing Random Forest with Logistic Regression for Predicting Class-Imbalanced Civil War Onset Data.” Political Analysis 24(1):87103.Google Scholar
Mukherjee, I., Rudin, C., and Schapire, R. E.. 2011. “The Rate of Convergence of Adaboost.” In Proceedings of the 24th Annual Conference on Learning Theory , 537558. Association for Computational Learning.Google Scholar
Ruger, T. W., Kim, P. T., Martin, A. D., and Quinn, K. M.. 2004. “The Supreme Court Forecasting Project: Legal and Political Science Approaches to Predicting Supreme Court Decisionmaking.” Columbia Law Review 104:11501210.Google Scholar
Spaeth, H. J., Epstein, L., Martin, A. D., Segal, J. A., Ruger, T. J., and Benesh, S. C.. 2015. The Supreme Court database . Center for Empirical Research in the Law at Washington University.Google Scholar
Ward, M. D., Greenhill, B. D., and Bakke, K. M.. 2010. “The Perils of Policy by p-value: Predicting Civil Conflicts.” Journal of Peace Research 47(4):363375.Google Scholar
Figure 0

Figure 1. Cross-Validation Accuracy for KKS compared to the “petitioner always wins” baseline, CourtCast, $\{\text{Marshall}\}+$, and a generic random forest. We compare these across three datasets: Supreme Court Database (“SCDB”), oral argument data, and both datasets jointly. For $\{\text{Marshall}\}+$ and CourtCast, black dots indicate the original dataset on which those models were trained.

Figure 1

Table 1. Accuracy (self-reported) for (1) the “petitioner always wins” baseline, (2) Katz et al. (2017)’s Random Forest, (3) $\{\text{Marshall}\}+$, (4) CourtCast, and (5) KKS. “Data” indicates the training dataset: case-level covariates from the Supreme Court Database (“SCDB”), transcript data from the oral arguments (“oral argument”), or both. The KKS model using all covariates almost triples the added accuracy of the next best model.

Figure 2

Table 2. KKS model accuracy by decision margin.

Figure 3

Table 3. ADTs outperform other methods in predicting both county-level vote share in the 2016 U.S. Presidential Election and civil war incidence.

Supplementary material: Link

Kaufman et al. dataset

Supplementary material: File

Kaufman et al. supplementary material

Kaufman et al. supplementary material 1

Download Kaufman et al. supplementary material(File)
File 431.7 KB