Re-Evaluating Machine Learning for MRP Given the Comparable Performance of (Deep) Hierarchical Models

Multilevel regression and post-stratification (MRP) is a popular use of hierarchical models in political science. Multiple papers have suggested that relying on machine learning methods can provide substantially better performance than traditional approaches that use hierarchical models. However, these comparisons are often unfair to traditional techniques as they omit possibly important interactions or nonlinear effects. I show that complex (“deep”) hierarchical models that include interactions can nearly match or outperform state-of-the-art machine learning methods. Combining multiple models into an ensemble can improve performance, although deep hierarchical models are themselves given considerable weight in these ensembles. The main limitation to using deep hierarchical models is speed. This paper derives new techniques to further accelerate estimation using variational approximations. I provide software that uses weakly informative priors and can estimate nonlinear effects using splines. This allows flexible and complex hierarchical models to be fit as quickly as many comparable machine learning techniques.

T he growing popularity of machine learning continues to revolutionize parts of political science by allowing easy estimation of flexible and powerful models.One increasingly popular application is using machine learning when performing "multilevel regression and post-stratification" (MRP) to extrapolate nationally representative surveys to smaller geographic units such as states (Lax and Phillips 2009;Park, Gelman, and Bafumi 2004).MRP is a two-step process that begins by fitting a predictive model to the survey using demographic and state-level information.Next, opinion estimates for the states are obtained by a weighted average of the predicted values for various demographic groups inside of that state using their known distribution.While the performance of MRP depends on both steps, multiple papers have found that using machine learning for the predictive model outperforms traditional methods ("multilevel regression") by considerable margins (e.g., Bisbee 2019; Broniecki, Leemann, and Wüest 2022;Goplerud et al. 2018;Ornstein 2020).A plausible justification is that the linear, additive, nature of traditional models is insufficient to capture the complex relationship between the covariates and the outcome.
The reliance on simple hierarchical models, however, unnecessarily limits their usefulness.Unlike some machine learning methods that can automatically estimate interactions (or nonlinear effects of continuous predictors), hierarchical models can only estimate interactions that the researcher has explicitly included.This is both a strength and a weakness.While hierarchical models are highly modular and allow the researcher to explicitly incorporate domain-specific knowledge as to important predictors or interactions, there is a risk of mis-specification-and thus worse performance-if important interactions are omitted.Thus, a "fair" test of MRP's performance must examine a model that explicitly includes many possibly relevant interactions.Ghitza and Gelman (2013) illustrate this by adding a broad set of interactions and uncover considerably more subtle results than traditional methods could identify.Following their usage, I refer to complex hierarchical models that explicitly include interactions or nonlinear effects as "deep MRP."1Thus, despite the understandable enthusiasm for applying machine learning to MRP, it is simply unknown in a systematic way whether machine learning outperforms deep MRP.The main reason for this gap in the literature is a practical one.Existing uses of deep MRP sometimes include nearly 20 random effects to capture the underlying heterogeneity and thus are usually very slow to estimate.Given that one might wish to fit these models repeatedly (e.g., comparing different specifications), this has quite reasonably caused researchers to "rule out" deep MRP.
Fortunately, recent work has shown that deep MRP can be estimated very quickly using variational inference while producing very similar point estimates to traditional methods (Goplerud 2022).However, that paper only tested those algorithms on the single dataset from Ghitza and Gelman (2013) and did not compare them against machine learning techniques.My initial systematic tests found that those algorithms performed unfavorably against machine learning.This paper provides two improvements to existing variational methods that result in competitive performance: First, Goplerud (2022) relied on an improperly calibrated prior that often resulted in too little regularization.Second, those algorithms cannot capture nonlinear effects of continuous covariates (e.g., presidential vote share).
Those concerns are addressed by, first, extending the variational algorithms to include a weakly informative prior (Huang and Wand 2013) that can more appropriately regularize random effects and, second, allowing the use of penalized splines for continuous predictors.After implementing a number of novel computational techniques to accelerate estimation, the accompanying open-source software can fit highly flexible deep MRP in minutes-rather than the hours possibly needed for traditional approaches.
This paper illustrates the importance of deep MRP by reanalyzing two papers that suggest that machine learning methods clearly outperform MRP (Bisbee 2019;Ornstein 2020).It demonstrates two important stylized facts: deep multilevel models (i) are given considerable weight in an ensemble of machine learning methods and (ii) are competitive with Bayesian additive regression trees (BART) in terms of performance.While recent work reports that BART performs noticeably better than traditional MRP (Bisbee 2019), I demonstrate that this is not the case.I show that, especially at moderate sample sizes, BART usually only slightly outperforms even traditional MRP.Thus, while machine learning methods that combine many methods together in an ensemble can improve performance, (deep) MRP should continue to be used as a highly competitive single method or in any ensemble approach.

FITTING DEEP MRP FAST
The key limitation in fitting MRP with interactions is the speed of estimation.Earlier research has shown that fitting a single deep MRP model can take multiple hours (e.g., Goplerud 2022).This is because of the presence of high-dimensional integrals that traditional methods either numerically approximate or address using Bayesian methods.
Variational inference provides a different approach for fast estimation; the goal is to find the best approximating distribution to the posterior given some simplifying assumption-usually that blocks of parameters are independent (Grimmer 2011).However, the accuracy of this approximation can depend heavily on the specific problem, and thus needs extensive testing to ensure its reliability.Goplerud (2022) derived a new general algorithm for binomial hierarchical models and conducted extensive explorations of its performance on the single dataset considered in Ghitza and Gelman (2013).Those algorithms fit an extremely complex hierarchical model in around 1 minute-versus hour(s) for existing approaches.It demonstrated excellent performance by recovering posterior means on coefficients and predictions that closely aligned with the gold standard approach of Bayesian estimation.2Appendix A of the Supplementary Material provides a full exposition of the variational algorithm.
To illustrate the extensions in this paper, I focus on a simplified MRP model: Equation 1 shows a hierarchical model without fixed effects and with a random intercept for state and a random intercept for race, where y i is the (binary) response for observation i.The notation follows Gelman and Hill (2006) where α state g½i selects the random effect for the state g of which observation i is a member.Appendix A of the Supplementary Material shows the generalization to random slopes, arbitrary numbers of random effects, and fixed effects.
The choice of prior on the variance of the random effect p 0 ðσ 2 j Þ is a difficult task.Some inferential techniques assume a flat prior.A risk of this strategy is that point estimates of σ 2 j could be degenerate and equal zero; this sets all random effects estimates equal to zero.This problem is rather common for the Laplace approximation in glmer (Chung et al. 2015).A proper prior prevents this problem and thus is preferable.An Inverse-Gamma prior is a popular choice, but it is difficult to calibrate the strength correctly (Gelman 2006).Figure 1 illustrates this by showing the prior used in Goplerud (2022) ("Inverse-Gamma") against the Huang and Wand (2013) prior employed in this paper.The latter prior implies the popular half-t prior on the standard deviation σ j (Gelman 2006), and it can be generalized to multidimensional random effects.In that case, it imposes half-t priors on each marginal standard deviation while maintaining (if desired) a uniform prior on the correlations.Appendix A.1 of the Supplementary Material provides more information on this prior such as its density.
Figure 1 shows that Goplerud's (2022) prior puts effectively no mass on small values of σ j (e.g., Pðσ j ≤ 0:25Þ ≈ 0:0003 ).Thus, in the event where the true value is small (i.e., the random effect is mostly irrelevant), the prior results in too large estimates of σ j , thereby underregularizing the coefficients, which likely results in poorer performance.By contrast, the Huang-Wand prior puts nontrivial weight on very small σ j and thus allows for strong regularization when appropriate.Appendix A.1 of the Supplementary Material provides a stylized example of this phenomenon.
Unfortunately, Appendix A.5 of the Supplementary Material illustrates that naively incorporating the Huang-Wand prior dramatically increases estimation time.While it does increase the time per iteration, the major problem is that estimation requires 5-10 times more iterations to converge.Thus, a key contribution of this paper is to accelerate variational algorithms when this more appropriate prior is employed.Appendices A.2 and A.3 of the Supplementary Material provide, respectively, full explanations of the techniques employed: (i) a squared iterative method and (ii) a novel application of parameter expansion.
The model in Equation 1 can be extended by adding many interactions between geographic and demographic factors ("deep MRP"; Ghitza and Gelman 2013).However, Broniecki, Leemann, and Wüest (2022) note that additional state-level predictors (e.g., unemployment rate) may also provide considerable benefits.Unlike hierarchical models, many machine learning methods can automatically estimate nonlinear effects or interactions between these continuous predictors, whereas they must be specified explicitly for MRP.
I address that scenario by allowing estimation of nonlinear effects using splines as in a generalized additive model.Appendix A.4 of the Supplementary Material demonstrates how splines can be represented as additional hierarchical terms and thus estimated using the same variational algorithms.
Appendix B of the Supplementary Material provides simulations to illustrate the importance of using hierarchical models that include interactions or nonlinear effects.It shows that ignoring important interactions or nonlinearities hurts the performance of hierarchical models vis-à-vis alternative models, especially as the sample size increases.However, after those terms are included, hierarchical models perform well even against machine learning methods.

COMPARING METHODS FOR FITTING MRP
To compare deep hierarchical models against machine learning systematically, I use Buttice and Highton's (2013) popular dataset for validating new methods for MRP (e.g., Bisbee 2019; Broniecki, Leemann, and Wüest 2022; Ornstein 2020).It consists of 89 policy questions that are collected from multiple years of the National Annenberg Election Studies (2000, 2004, and 2008) and the Cooperative Congressional Election Studies (2006 and2008).The benefit of these large samples is that it is possible to use the entire dataset to get a "ground truth" by taking the observed average in each state while drawing a smaller subsample (e.g., 1,500 respondents) to mimic the conditions under which a researcher would need to apply MRP to obtain reliable state-level estimates.
Existing comparisons, however, only rely on a simple hierarchical model outlined below (Equation 2), following the original specification in Buttice and Highton (2013).The model includes random effects for age, education (educ), gender-race combination (gXr), state, and region.The state-level continuous predictors pvote (state-level Republican presidential two-party vote share) and relig (share of population identifying as Evangelical Protestant or Mormon) are indexed with g½i as they are constant within a state.
for all j and g: (2) This model includes no interactions between variables or nonlinear effects on continuous predictors, V j

Density
Note: The dashed line shows the prior density on σ j given an Inverse-Gamma prior on σ 2 j with α 0 ¼ 1 and β 0 ¼ 1=2.The solid line shows a Huang-Wand prior on σ 2 j with ν ¼ 2 and A ¼ 5 (i.e., half-t on σ j ).
and thus is likely insufficiently rich to capture the true underlying relationship.It is reasonable to suspect that a "properly specified" MRP model should include at least some interactions to be competitive with methods that can automatically learn interactions or nonlinearities.I consider three expansions of this model's hierarchical component.First, I consider a deep MRP where all two-way interactions between demographics and geography are included (e.g., age-education and age-state), as well as a triple interaction between the three demographic variables.Second, I add splines to capture possible nonlinear effects in the state-level continuous variables.A third model includes both extensions.Table 1 summarizes the specifications; Appendix F of the Supplementary Material provides a demonstration of how to fit these models in the accompanying software (Goplerud 2023). 3t is important to stress that this paper tracks the existing analyses comparing machine learning and MRP as closely as possible.There are thus other specifications that likely improve upon Table 1, although I show that adding this set of interactions enables MRP to perform competitively against state-of-the-art machine learning techniques.

DEEP MRP IN AN ENSEMBLE
The first comparison explores whether deep MRP adds much benefit when used alongside a suite of machine learning methods.I begin by using a technique known as "stacking" that takes the predictions of many different methods and combines them into a single prediction known as an ensemble.Ornstein (2020) applied this method to MRP and found considerable gains over traditional methods.The method performs K-fold cross-validation to get an out-of-sample prediction for each observation in the survey using each constituent model.The out-of-sample predictions are combined to see which weighted average (convex combination) best predicts the outcome, and then these weights are used to combine predictions before post-stratification.It is often the case that the ensemble outperforms any single method (Broniecki, Leemann, and Wüest 2022;Ornstein 2020), although this is not guaranteed and can be empirically assessed by, for example, using a held-out dataset.
A useful property of ensembles is the ability to compare the weights given to the constituent models.The weights reflect both the performance of the method and its "distinctiveness" from the other methods in the ensemble.Using each survey in Buttice and Highton (2013), I drew 10 different samples of varying sizes and estimated an ensemble using fivefold cross-validation with the models in Ornstein (2020) where I swapped the traditional ("Simple") MRP model with the deep MRP model from Table 1. Figure 2 summarizes the weights given to each model, averaging across the surveys and simulations.
The results provide clear support for the importance of deep MRP in an ensemble: it is the highest weighted method when the sample size is over 1,500 and is given over 40% of the total weight when the sample size is 5,000 or higher.The performance of deep MRP is corroborated by the fact that, of the methods in the ensemble, it has the lowest cross-validated error on the survey data when N > 1, 500.In terms of computational time, fitting this deep model on the full survey takes around 30 seconds for the largest sample size of 10,000 observations.Thus, deep MRP can be added to an ensemble with limited cost.
Appendix D of the Supplementary Material explores the trade-off between model complexity and sample size.It examines a larger ensemble that includes all four models from Table 1. 4 While corroborating Figure 2-MRP models collectively receive around 40%-50% of the weight-it shows an expected trade-off between traditional and deep MRP where traditional (noninteractive) methods are given decreasing weight as the sample size increases.This suggests that the ensemble upweights more complex methods as the amount of Note: The second column indicates how these two variables are included.It is either "Linear" (Equation 2) or "Splines" where a spline is used to allow for nonlinear effects for each variable.The third column indicates how the random effects on age, education, gender Â race, state, and region are included."Additive" refers to five random effects added together (Equation 2)."Interacted" refers to including the interactions noted in the main text alongside the additive terms.
data increases.The spline-based methods receive relatively low weight, but this may be due to the limited variation in the continuous variables that are measured at the state level.5 Appendix D of the Supplementary Material also shows that, in terms of raw performance, a well-designed ensemble usually beats any single constituent method.The five-model ensemble beats all of its constituent methods by more than 5% on at least one sample size considered.

MRP AND BART
One limitation of ensembles is appropriately quantifying uncertainty of the post-stratified estimates.This is challenging because it may be difficult to quantify the uncertainty of the estimates from the individual machine learning methods used in the ensemble. 6It also requires careful work to interpret the effects of the included variables.Thus, researchers often seek to rely on a single model that can incorporate uncertainty and remains highly flexible.To that end, BART (Chipman, George, and McCulloch 2010) is an attractive choice.
The method is related to popular tree-based methods such as "random forests," but it is implemented in a Bayesian framework that allows for quantification of uncertainty.Bisbee (2019) applies BART to MRP and reports that it substantially outperforms (traditional) MRP.The magnitude of the improvement is large (e.g., around 20%-30% decrease in mean absolute error [MAE]).This motivates an initial question: does BART improve upon deep MRP?After some preliminary exploration, I discovered an error in Bisbee's (2019) replication archive.Appendix E of the Supplementary Material describes it in detail; see also a corrigendum to Bisbee (2019) (Goplerud and Bisbee 2022).In brief, the error arbitrarily injected random noise into the MRP estimates at the prediction stage.When this is corrected, traditional MRP's performance increases markedly and is only slightly beaten by BART.
Following the main analysis in Bisbee (2019), Figure 3 shows the predictive accuracy on the surveys in Buttice and Highton (2013) for a sample with 1,500 observations.7For simplicity, I show only three methods: the traditional ("Simple") MRP following Bisbee's (2019) provided code, a corrected traditional MRP, and deep MRP estimated using variational inference (see Table 1).
Fixing the error shows a noticeably different story; rather than being clearly beaten by BART, traditional MRP looks visually similar to BART in terms of its error across surveys.2020): LASSO, k-Nearest Neighbors (KNN), Random Forest (Forest), Gradient Boosting Machine (GBM).The final method is "Deep" MRP from Table 1.hundred simulations.A positive number indicates that BART outperforms the other method.This measure is relative as BART and MRP both decrease the observed MAE as sample size increases.
Table 2 shows that BART outperforms traditional MRP, but it does so by quite small margins (1%-4%) and its relative advantage declines as sample size increases.Deep MRP performs slightly better versus BART; for modest sample sizes (3,000-6,000), it actually slightly outperforms BART, although it does slightly worse at small and very large sample sizes.In terms of performance, across all sample sizes, deep MRP outperforms BART between 45% and 55% of the time and, thus, they can be considered to reach an effective "draw" in terms of performance.The table also suggests a smallbut-systematic improvement of deep MRP over traditional MRP.Comparing the traditional ("Simple") MRP against deep MRP shows that, except for the largest sample sizes, traditional MRP performs around 1%-3% worse than deep MRP in terms of MAE and is beaten around 60%-70% of the time.

CONCLUSION
This paper has shown that with recent advances in variational inference and novel technical extensions, it is possible to rapidly estimate deep MRP; Appendix C of the Supplementary Material shows that deep MRP can be often estimated much more quickly than machine learning methods that require tuning of external hyperparameters and as quickly as BART without tuning.It found that deep MRP is highly competitive in performance-effectively tying the state-of-the-art BART method.Compared with traditional MRP, adding  interactions provided a small, but systematic and nontrivial, gain in performance at most observed sample sizes.
The key implications of this paper for applied MRP research are twofold.First, the fast variational methods developed in this paper allow researchers to easily perform the well-established process of model comparison (e.g., by using cross-validation) when deciding which (complex) hierarchical model to use.Rather than selecting a single (traditional) specification for MRP, the results in this paper suggest that considering and comparing a variety of possible models-traditional MRP, deep MRP, or perhaps machine learning-can result in better performance for the post-stratified estimates.It is not possible to know a priori whether deep or traditional hierarchical models will perform better on a specific survey, but the methods in this paper allow for this to be easily tested rather than assumed.If quantification of uncertainty in the estimates is desired, one might employ a hybrid strategy where the variational methods are used for initial model comparison to winnow down the possible models before using a fully Bayesian approach for the final estimates.
Second, in terms of re-evaluating the role of machine learning for MRP, this paper suggests that the major benefit of machine learning comes from combining models in an ensemble.However, it also shows a clear important role for (deep) hierarchical models in those ensembles; these deep hierarchical models often have the strongest out-of-sample predictive accuracy on the survey itself and thus are given high weight in an ensemble.There is usually little downside to including additional methods in an ensemble-especially when the cost of estimation is rather low-and thus a wellspecified ensemble should probably include multiple versions of MRP (traditional and deep).As in the case of model comparison, this allows for the data to provide guidance in terms of which type of hierarchical model is most appropriate.

FIGURE 1 .
FIGURE 1. Comparing Prior Density for Random Effect Standard Deviation σ j

FIGURE 2 .
FIGURE 2. Weights Given to Models in Ensemble

TABLE 1 .
Deep MRP Specifications Table 2 provides a more concise quantitative summary.It shows the percentage gap in MAE versus BART averaged across the 89 surveys: MAE k −MAE BART ð Þ =MAE BART Â 100, where MAE k indicates the MAE of the model k averaged across two-

TABLE 2 .
Relative Mean Absolute Error versus BART