Understanding, Choosing, and Unifying Multilevel and Fixed Effect Approaches

Abstract When working with grouped data, investigators may choose between “fixed effects” models (FE) with specialized (e.g., cluster-robust) standard errors, or “multilevel models” (MLMs) employing “random effects.” We review the claims given in published works regarding this choice, then clarify how these approaches work and compare by showing that: (i) random effects employed in MLMs are simply “regularized” fixed effects; (ii) unmodified MLMs are consequently susceptible to bias—but there is a longstanding remedy; and (iii) the “default” MLM standard errors rely on narrow assumptions that can lead to undercoverage in many settings. Our review of over 100 papers using MLM in political science, education, and sociology show that these “known” concerns have been widely ignored in practice. We describe how to debias MLM’s coefficient estimates, and provide an option to more flexibly estimate their standard errors. Most illuminating, once MLMs are adjusted in these two ways the point estimate and standard error for the target coefficient are exactly equal to those of the analogous FE model with cluster-robust standard errors. For investigators working with observational data and who are interested only in inference on the target coefficient, either approach is equally appropriate and preferable to uncorrected MLM.


Introduction
Researchers in many applied fields encounter data structures with observations that are grouped or clustered in one or more ways, for example students nested in classrooms and/or schools, and perhaps measured repeatedly over time. Such observations are referred to variably as grouped, clustered, hierarchical, multilevel, or repeated measures, and include panel, longitudinal, or timeseries cross-sectional data. In one very common context-and the primary example considered here-investigators hope to estimate the effect of some "treatment," or target covariate, that varies at a lower level while accounting for confounding that is hoped to be fixed at the higher level. For example, in country-year data, some countries may adopt a treatment in some years, and we seek to account at least for country-(group-) level confounders, in the hopes that remaining, timevarying confounding is absent or less problematic.
Multilevel data structures, however, pose complications for estimation and inference by violating the independence of observations assumed under classical inference. Although a longstanding and ubiquitous issue, methodological practices for dealing with this non-independence have not converged across disciplinary traditions. One tradition, often referred to as the "fixed effects" approach (FE), advises investigators to account for group-level confounders by introducing group-level, freely varying, intercepts to their models. The nonindependence of observations then complicates only variance estimation, so investigators are instructed to choose a variance estimator that accommodates the forms of intragroup dependency assumed to exist. One popular choice, closely examined here, is the "cluster-robust standard error" (White 1984).
A different tradition holds that multilevel data must be analyzed with "multilevel models" (MLMs). These models look similar to FE models, but include terms that are estimated as "random effects," meaning that the coefficients are assumed to have a distribution rather than to be fixed in truth. In a review of 10 textbooks and 4 well-cited pedagogical articles 1 as well as 109 empirical articles employing MLMs in top education, political science, and sociology journals, 2 we find four common reasons given to employ MLM rather than FE: (1) MLMs correctly estimate standard errors in grouped data; (2) MLMs estimate coefficients more efficiently and produce more accurate predictions than do the analogous FE models; (3) MLMs allow intercepts and slopes to vary by group, like analogous FE models; but (4) unlike those FE models, MLMs can estimate coefficients for group-level variables (and their interactions with lower-level variables) while allowing varying intercepts (and slopes). By contrast, particularly for fields closely aligned with econometrics, both long-standing (e.g., Hausman 1978) and more recent work (e.g., Clark and Linzer 2015) emphasize bias concerns with MLMs. In the common setting where one particular variable is the treatment, MLM estimates can have lower variance than do FE estimates for the coefficient of interest, but are biased when group-level confounding is present, compelling users to employ FE. 3 The choice between these approaches remains a matter of debate and disciplinary tradition, and is sometimes justified based on erroneous claims as to these models' properties. In this paper, we seek to demystify their differences, the problems associated with them, and ultimately important equivalences between them. First, to illuminate a key difference between the approaches, we show that random effect estimates in MLMs are precisely equivalent to FE estimates that have been shrunken through a regularization process, that is, by penalizing larger coefficients. This explains the principle concern with random effects: bias that emerges because these variables are not "allowed" to adjust for confounding as intended. Because regularization reduces over-fitting, this also demystifies why MLM's out-of-sample outcome predictions are typically more accurate. Second, this bias can be eliminated in many cases by a long-standing adjustment from Mundlak 1978. For models with group-level intercepts added as random effects ("random intercept" models), which we focus on here, adding the group-level averages of all included variables as regressors (or a variety of equivalent procedures such as group-wise centering) relieves the bias otherwise induced by this regularization. We also present a more general solution for more complex models. Finally, we consider variance estimation. Contrary to claims we document, MLMs do not automatically ensure appropriate standard error estimates for grouped data. Rather, the standard MLM approach relies on strict assumptions and can have poor coverage in even simple circumstances.
We are not the first to draw many of these conclusions. However, our review of empirical papers employing MLMs shows widespread failure to either appreciate the concerns they raise or employ suggested solutions. Bringing these concerns together, we describe a "bias-corrected MLM" that employs cluster-robust standard errors, which make no assumption about within-group dependency and assume no between-group dependence. Remarkably, once these adjustments are 1 The textbooks are: Faraway (2016), Finch, Bolin, and Kelley (2016), Fitzmaurice, Laird, and Ware (2004), Greene (2003), Heck, Thomas, and Tabata (2013), Luke (2004), Snijders and Bosker (2011), Hox and Roberts (2011), Gelman and Hill (2006), Hoffman (2015. The articles are: Snijders and Berkhof (2008), Raudenbush (2009), Gelman (2006, Steenbergen and Jones (2002). 2 American Journal of Political Science (17 articles), the American Political Science Review (13), the Journal of Politics (20), the American Education Research Journal (28), Educational Evaluation and Policy Analysis (8), the American Journal of Sociology (13), and the American Sociological Review (10). We decided on this selection of journals after asking specialists in each field about the top journals that often publish papers that employ MLM. To find the articles, we searched on "multilevel," "multi-level," "hierarchical," "random effect," "random effects," "random-effect," and "random-effects." Our political science and sociology reviews currently cover all articles dated January 2017 through December 2018, and our education review currently covers all articles dated January 2017 through April 2019. 3 When this bias is present, the root mean square error of MLM may be higher than that of FE even if MLM has lower variance, as we demonstrate. It is also possible to construct cases wherein a MLM and FE model have equal variance. One example is when treatment assignment is "perfectly blocked" within group, so that the estimated covariance of the treatment and block indicators is numerically zero in every sample. In this case, FE, MLM with random intercepts, and even OLS produce identical estimates and thus have identical efficiency. We thank Ian Lundberg for pointing this out. made, the resulting MLM produces coefficient and standard error estimates identical to those of the analogous FE model with cluster-robust standard errors. We regard this as the most important conclusion, since for investigators interested only in the target coefficient and its standard error, it resolves any debate as to which approach is more appropriate. In most cases, we thus recommend either of these unbiased approaches over uncorrected MLM, due to the dangers of heightened bias and root mean square error when MLM's strict assumptions are violated. The remaining differences between these approaches are that MLM has better (out-of-sample) accuracy for the outcome predictions, and allows the user to estimate coefficients that would otherwise be dropped by FE (e.g., group-level covariates), although interpreting these coefficients can be problematic.

Notation
We first set notation. To aid the reader, Appendix A.1 describe (i) symbols used in our notation (Table 1) and (ii) abbreviations we use relating to models (Table 2).
Let g = 1, . . .,G index the group. We index vectors belonging to group g with the subscript g, for example, the outcome for all units in group g is given by the vectorY g . The ith unit in group g is then indexed by g [i ], for example, unit i in group g has outcome Y g [i ] . This notation reminds the reader that unit i is contained in group g. Each group g has size n g with N = G g =1 n g . Let X g [i ] be the p-dimensional vector of covariates, including an intercept term, with an associated coefficient vector β . One element of X g [i ] in particular will be regarded as a treatment in settings described here. X g is the matrix of X g [i ] for group g, and X is the matrix of X g [i ] for the entire sample.
is a d-dimensional vector of covariates, often containing a subset of the covariates in X g [i ] , and possibly an intercept which will later function as an indicator of membership to group g. For each group g, the Z g [i ] have an associated coefficient vector γ g . Z g is the matrix of Z g [i ] for group g, Z is a block diagonal matrix of the Z g , and γ stacks the set of γ g into a matrix.
is the outcome of interest, and g [i ] is its associated residual/error term. Y and are N × 1 vectors containing Y g [i ] and g [i ] for the entire sample.

Fixed effect and multilevel models
Varying intercepts: Group fixed effects and random intercept models. We focus mainly on uses of FE and MLM that allow each group in the data a different intercept but no other group-varying coefficients. 4 Suppose data are generated according to the simple model, written in matrix form, where Z contains only indicators for membership to each group g, that is, each Z g is a vector of ones, meaning that only intercepts may differ between groups. While useful to write in this form because it shows the role of Z, the model is often expressed as where γ g are group-specific deviations from the overall intercept in β . Although a model can be fitted regardless of the correlation between the residuals and X g [i ] , investigators are usually interested in understanding the effect of a treatment variable (in X g [i ] ) onY g [i ] , which requires the "conditional independence assumption," E( g [i ] | X , Z ) = 0. This is an identification assumption relating to the form of confounding, although it is also subject to model specification. We describe the types of permitted confounding that are used to justify this assumption in Section 2.3. The distinction between FE and MLM for the varying intercepts model in Equation (2) relates to the assumptions they employ during estimation. Both FE and MLM regard β as "fixed," that is, nonrandom with no distributional assumption. However, FE and MLM diverge in their handling of γ g . In FE, the γ g are regarded as fixed, like β . FE thus estimates β and γ by including indicator variables for group membership (in Z) as additional regressors in an OLS of Y on X, or by equivalent demeaning/partialing-out procedures. We refer to this as "group fixed effects" (Group-FE) here. In practice, one group indicator and its corresponding γ g must be dropped, or the intercept term must be dropped from X g [i ] .
MLM, by contrast, treats the γ g as "random effects," meaning that each γ g is estimated under the assumption that it may have a distribution, that is has nonzero variance rather than being a fixed quantity. This has implications for estimating both γandβ , as well as for constructing variance estimates, detailed in Section 2.4. 5 Specifically, we define the "random intercept" (RI) model, where we also assume that g [i ] |= γ g | X , Z for all g , g , and i. Additionally, the conditional independence assumption is elaborated to require a multivariate-normal distribution for the residual vectors in each group, g | X , Z i nd ∼ N(0, Σ g ) where Σ g ∈ R n g ×n g is group g's error covariance matrix. All γ g can be kept and estimated by this approach, together with an intercept in β . It is also commonly assumed that the g [i ] are spherical (i.e., Σ g = σ 2 I n g ), although we avoid that restriction unless otherwise noted.
General fixed effect and multilevel models. Although we focus mainly on Group-FE and RI models here, the claims made below pertain to FE and MLM in their more general form, where the γ g represent group-specific coefficients for the variables in Z g , which may now include variables other than group indicators as above. As before, FE estimates both β and γ by OLS regression of Y on X and Z. For identifiability, the covariates that X g [i ] and Z g [i ] share (or that result in perfect colinearity) are either dropped from Z g [i ] for one g, or dropped from X g [i ] . When instead fit with random effects in MLM, the coefficients γ g are estimated with additional distributional assumptions, specifically where Ω ∈ R d ×d and we assume g [i ] |= γ g | X , Z for all g , g , and i. 6 In addition, we define var( | X , Z ) = Σ to be the ordered block diagonal matrix of Σ g , and Ω block = var(γ | X , Z ) to be the block diagonal matrix of Ω,

Identification: From nonparametric conditions to specification requirements
Let us assume there is no within-group confounding. In longitudinal settings, this is to say there are no unobserved time-varying confounders. Such an assumption ensures non-parametric identifiability, that is, the causal effect of the treatment can be identified conditionally on group and then averaged together as desired across groups. 7 For such nonparametric identification to be sufficient for unbiased estimation, however, would require the ability to condition on group non-parametrically (i.e., estimate the relationship between treatment and outcome within each group separately). In most settings the investigator is unable to do this, and so turns to modeling assumptions that instead "account for" group in a specific model. Because of this model dependence, identification of treatment effects then requires additional assumptions related to the specification of those models.
The traditional motive for Group-FE is an assumption that the only source of confounding is group-level confounding that takes the linear form of Equation (2): a constant, additive shift by group (i.e., the γ g ). The conditional independence assumption needed is E( g [i ] | X , Z ) = 0 (e.g., Greene 2003). The Group-FE approach is powerful precisely because including Z fully purges the group-level intercepts, γ g , from this residual, thus removing confounding. So long as other (withingroup) forms of confounding do not exist, this unbiasedly estimates β .
The central concern with RI is that even under the conditional independence assumption, random effect estimation fails to remove this confounding. As we explain in Section 3.1, this is because RI does not fully account for the γ g , biasing the estimate of β as well. Avoiding this bias requires additionally that the "uncorrelated random effects" assumption be true: that the γ g are uncorrelated with X g [i ] , so that failure to account for γ g does not bias estimates of β . 8,9 This is problematic, since γ g is most important to account for when it is correlated with the treatment variable of interest in X g [i ] and thus acts as a confounder. While this assumption, also referred to simply as the "random effects assumption" (Bell and Jones 2015;Kim and Steiner 2019), is well-known in principle, it remains widely neglected in practice (see Section 3.1) despite attractive solutions (see Section 3.2).

Parameter estimation in MLM
We begin by briefly reviewing how MLM parameters are estimated. 10 Estimation proceeds in three steps: (1) estimate Ω and Σ, (2) estimate β using the estimate of (Ω, Σ), and (3) estimate γ using the estimate of (β , Ω, Σ). The first two steps require the distribution of Y given X and Z. Because, in MLM, the Z g [i ] γ g are mean-zero given X and Z, the (Z g [i ] γ g + g [i ] ) can be treated as combined mean-zero error terms. This allows one to formulate MLM on the sample-level as Y = X β + * where * = Z γ + . Then, because γ and are both normally distributed given X and Z, we have The likelihood is then Given a choice of (Ω, Σ), which determines V, maximizing the likelihood for β would yield However, estimates of Ω and Σ do not enjoy similarly simple closed solutions. Instead, they are typically found iteratively through either unrestricted maximum likelihood estimation or restricted maximum likelihood estimation, both using Equation (5). β can then be estimated by plugging estimates of Σ and Ω into Equation (6). Finally, γ is estimated by maximizing its posterior probability given Y, X, Z, and the estimated (β , Ω, Σ), that is, In this article, we focus solely on estimation and inference for β and γ. Because MLM's estimates of these parameters are functions of Ω and Σ, our results hold regardless of the choice between unrestricted or restricted maximum likelihood in estimating Ω and Σ. For this reason, we refer to arbitrary MLM estimates of the parameters asΩ MLM ,Σ MLM ,β MLM , andγ MLM .
Finally, treatingV MLM as fixed, the conditional variance ofβ MLM is simply 8 The presumption that γ g and X g [i ] are uncorrelated is sometimes described less as an "assumption" and instead as a feature of a "working model" or a prior belief, that is, a convenience that we employ but do not necessarily expect true, or that becomes irrelevant as the sample size grows. However, as shown here, it is a consequential modeling decision and, in finite samples, demonstrably leads to undesirable estimates. 9 In the more general setting where the entire "random effect contribution" is captured by Z g [i ] γ g (i.e., Equation 3), this implies that Z g [i ] γ g is uncorrelated with X g [i ] . 10 Both frequentist (maximum likelihoood) and Bayesian estimation approaches are available. The former is the most commonly taught and employed (e.g., lme4 in R), at least for the simpler models we consider here. An excellent review of Bayesian MLM estimation can be found in Gelman and Hill (2006). which is commonly simplified to (X V −1 MLM X ) −1 as the standard model-based MLM variance estimator, by assuming thatV MLM is a consistent estimator of var(Y | X , Z ).

Analytical insights
We now analyze three features of MLM that aid in understanding how these methods compare.
3.1 Random effects as regularization and the "incomplete conditioning" problem The first piece shows an equivalence between the random effects employed in MLM and regularization, that is, a penalized fitting approach. We begin by introducing the "regularized fixed effects" (regFE) class of models. Suppose we are interested in optimal out-of-sample generalization. Then, instead of estimating Equation (2) by minimizing (in-sample) squared error, we minimize the squared error plus a penalty proportional to the sum of squared γ g values, known as L 2 or Tikhonov regularization: where λ > 0 determines the extent of regularization on γ g -larger values shrink each γ g toward to 0, while smaller values yield estimated intercepts closer to Group-FE's estimates-and may be chosen by various means such as some form of cross-validation. 11 The key result is that for a particular choice of λ, such a regularized model is equivalent to the RI model, Theorem 3.1 (Equivalence of RI and regFE) Let (ω 2 RI ,Σ RI ,β RI ,γ RI ) be estimates from the RI model. (9), then (β regFE ,γ regFE ) = (β RI ,γ RI ).
We omit the proof here, as Theorem 3.2 below offers a generalization. That MLM provides "shrinkage" or "partial-pooling" estimates is very well-known (e.g., Steenbergen and Jones 2002;Fitzmaurice, Laird, and Ware 2004;Luke 2004;Gelman 2006), and in recent texts MLM has even been referred to as employing a "regularizing prior" (e.g., McElreath 2018). The equivalence between regularization and holding a prior on the regularized coefficients is also well known. 12 That said, because we did not find any explicit formalization of the equivalence of estimation in MLM to regularization with a specific choice of λ in any textbooks or articles we reviewed, we fill this gap.
We demonstrate with a simple simulation example, drawing 1,000 datasets from the following data generating process (DGP) with G = 25 and n g = 15 each time: where λ for regFE is chosen by cross-validation and (σ 2 RI ,ω 2 RI ) for RI by restricted maximum likelihood. 13 We see in Figure 1 that the estimates of β and γ are nearly identical. They would be 11 Cross-validation partitions the data into "folds," using all of the folds except for one to estimate a model, and the held-out fold to make predictions with said model. This process is repeated, holding out each fold in turn, ultimately producing an estimate for every observation from a model that was not trained on that observation. These predictions can be used to approximate out-of-sample error. Here we use ten-fold cross-validation to choose the λ that minimizes this out-of-sample error estimate. 12 Specifically, regularization with the L 2 penalty as used here is equivalent to finding the maximum a posteriori (MAP) estimates under a prior that γ g | X , Z i i d ∼ N (0, σ 2 /λ). We use this connection in the proof of the equivalence between MLM and regFE. Note that other regularizing norms could be used. For example, regularization with an L 1 norm, g |γ g |, produces the MAP estimator when we hold a Laplacian prior on γ g , and would have the effect of inducing sparsity, possibly shrinking some γ g to exactly zero.
13 Here, and elsewhere where it is clear from context, we refer to the variable of interest, X (1) g [i ] , as simply X g [i ] to reduce notation, neglecting that X g [i ] may also include an intercept or other terms. numerically equal if, instead of setting λ in regFE by cross-validation, we chose λ =σ 2 RI /ω 2 RI as per Theorem 3.1. However, despite the different selection procedures, we also see from Figure  1 that the λ from RI and regFE are quite similar. This is because while λ =σ 2 RI /ω 2 RI in RI is not made specifically with predictive accuracy in mind, the choice is a sensible one from a prediction point of view. To see this, note first that the portion of Y g [i ] that is not explained by X g [i ] is the sum of γ g and g [i ] , which have variances of ω 2 and σ 2 , respectively. When, for example, the γ g have high variance in relation to that of g [i ] , λ = σ 2 /ω 2 will be small, which is appropriate since γ g will be helpful in predicting Y g [i ] . The converse is true when the γ g have low variance relative to g [i ] , leading to a higher effective λ and desirable shrinkage on γ g to avoid over-fitting.
More generally, consider estimating Equation (3) with the regularized regression: where Λ ∈ R d ×d is symetric and positive semi-definite.
Again, Λ may be obtained by various means, such as cross-validation. 14 The equivalence with MLM is then given in Theorem 3.2. . This equivalence is useful in comprehending several of MLM's most central advantages, and limitations. First, the equivalence to regularization immediately explains why 14 Note that the procedure in Equation (11) penalizes the magnitude of γ g , as γ g Λγ g ≥ 0 because Λ is positive semi-definite.
MLM can produce more accurate (out-of-sample) outcome predictions than does FE: regularization prevents the over-fitting that FE may allow, particularly in the case of small groups. 15 Further, MLM is sometimes understood to "wisely" adapt the level of shrinkage based on group size, but the comparison to regularization shows that such adaptation is not as sophisticated as it may appear. When MLM or regFE encounters a group that appears to have a very high or low mean relative to others, choosing an extreme γ g would decrease the sum of squared errors for that group, For a relatively small group, the savings in squared loss would be smaller relative to the additional "cost" paid through the regularization penalty, γ g Λγ g , and so γ g will be left relatively close to zero. By comparison, when MLM or regFE encounters a larger group, the savings in terms of squared loss would justify the added regularization penalty, so the choice of γ g minimizing the penalized sum of squared errors will be a more extreme one. Although this behavior appears to intelligently balance prior knowledge with allowing the data to speak, it is reproduced simply through regularization.
Another feature of MLM that can be understood through the regularization view is its ability to estimate coefficients for group-level variables even while including group-specific intercepts and slopes that would have prevented model identification under OLS. This occurs for the same reason that one can include more coefficients than observations in a "ridge regression." Consider attempting to find (β ,γ) purely by OLS, This has no unique solution if X g [i ] contains a group-level variable or its interaction with a variable in Z g [i ] , as [X Z ] [X Z ] is then singular. However, by introducing the regularization penalty into the minimization problem, regFE essentially adds to [X Z ] [X Z ] a positive semi-definite matrix that allows the sum of the two matrices to be invertible. Finally, this equivalence illuminates the main concern with MLM: bias when the random effects are correlated with X g [i ] . Because the group-specific intercepts in a RI model are regularized, they do not achieve the values that would "fully absorb" group-specific confounding, leaving components unexplained that can instead be captured by biasing the coefficients on X g [i ] . In Equation (2), where we hope to condition on group to absorb group-level confounders, random effects thus offer only "incomplete conditioning," not fully accounting for the unobserved grouplevel variables' influence on the outcome. To illustrate, consider the following DGP: where X g [i ] ∈ R is an observed observation-level variable and the W ( ) g are unobserved grouplevel covariates, with W (1) g being a confounder. While Group-FE will unbiasedly estimate β 1 , a simple OLS regression of Y on X would produce a biased estimate of β 1 , having failed to account for W (1) g . RI may seem more appealing than OLS because investigators might hope that the γ g will capture the group-level confounding. However, the shrinkage of γ g due to treating them as random effects yields estimated intercepts closer to zero than those obtained by Group-FE and required to 15 One well-known use of MLM, especially in political science, is "multilevel regression and post-stratification" (MRP; Park, Gelman, and Bafumi 2004). This is an approach to small-area estimation, in which estimates for the conditional means of small groups are attempted despite having very little data by group. Its strength in this task is derived from the ability to partially pool information across units, that is, shrinkage of the random effect estimates. Other approaches that explicitly engage regularization may also thus be effective in this task.  Figure 2. Comparison of estimates of β 1 from OLS, Group-FE, and RI in DGP1. Note: Results across 1000 iterations, each drawn from DGP 1 with β 0 = β 1 = 1. The dashed-line represents the true β . Due to correlated random effects, RI estimates are almost as biased as OLS estimates when group size is small (5). The bias is less severe but still appreciable at a group size of 50, and RMSE remains twice that of Group-FE. account for W (1) g . This leaves some of W (1) g unabsorbed, allowing it to continue biasingβ 1 as in the OLS model. We call this "incomplete conditioning" because the intended analytical strategy required to estimate β unbiasedly would condition on group, but the use of RI fails to achieve this. Figure 2 illustrates this. Define bias and root mean square error (RMSE) as, where m indexes the number of iterations from 1 to M andβ (m) is the estimate from the mth iteration. Bias is large, as expected, for OLS as it fails to account for group-level confounding at all. RI makes almost no improvement when groups are small (n g = 5), and only a partial improvement when the groups are quite large (n g = 50). Note that unbiasedness of RI would require the absence of correlated random effects, that is no correlation between X g [i ] and W (1) g , which is to say an absence of group-level confounding. Had this been the case, OLS would also suffice, and would differ from RI only in its efficiency and how variance is estimated. By contrast, in the presence of such correlation, Group-FE effectively eliminates confounding bias at both group sizes. In terms of efficiency, while RI has the expected slight decrease in variance, its RMSE remains about twice that of Group-FE at either group size due to these biases. That RI's average bias falls as n g rises may seem to be a cause for hope when one has a large enough dataset, and Lockwood, McCaffrey, et al. (2007) describes the conditions under which this type of bias tends to 0 as n g → ∞ generally. However, in practice, there is no knowing if n g is large enough to ensure negligible bias in a given case, as this depends on the correlation between the covariates and the random effects. We also note that with group-level covariates, increasing n g may not alleviate bias (see Appendix A.4).
Comparison to practice. For each of the three analyses in Sections 3.1, 3.2 and 3.3, we briefly contrast what is already known of these claims to what we find in practice. In this case, both textbooks and pedagogical articles remark heavily on the correlated random effects assumption, albeit not usually in terms of regularization or incomplete conditioning. Yet, this most central of concerns regarding MLM is demonstrably neglected in empirical practice. Among the MLM-based studies we reviewed where such bias would be at issue, only one in 24 education articles, 13 of 39 political science articles, and 10 of 19 sociology articles addressed the issue.
) 2 ). The RI model with Σ = σ 2 I N has been debiased by includingX g as a covariate, and shows lower testing error, especially when groups are smaller. Results are averaged across 1000 iterations, each drawn from DGP 1. Testing data are of the same size as the training data.

Bias-corrected MLM
The second analytical piece is that while MLM's bias problem with correlated random effects appears to be dire, there is a simple solution dating back to Mundlak (1978). With this fix, MLM unbiasedly estimates coefficients for variables below the level of grouping or clustering, like FE does, while retaining the ability to include group-level variables in the model and the superior predictive accuracy that arises from regularization/random effects. For RI, the fix requires adding to the model the group-level means of X g [i ] (including any interactions or other nonlinear terms).
As a stepping stone, consider a similar approach that could be applied to OLS as a substitute for FE. We consider again DGP 1 from Section 3.1, but suppose we estimate the following model by OLS, whereX g = 1 n g n g i =1 X g [i ] . Informally, addingX g "soaks up" any contribution that W (1) g could have made to Y g [i ] that could be correlated with X g [i ] , protecting the estimate of β 1 in the same way that FE would, and leaving an unbiased estimate of β 1 under the conditional independence assumption. Proof is given in Appendix A.5. This bias-curing effect of includingX g in OLS can similarly be applied to the RI model, and alleviates RI's bias problem, with Σ = σ 2 I N , in estimating β 1 . We omit the proof of this result, as we prove a more general claim below. In fact, an OLS model includingX g , the Group-FE model, and the RI model includingX g all produce exactly the same point estimates. Despite this equivalence, the RI model retains MLM's greater (out-of-sample) predictive accuracy for the outcome compared to Group-FE-a benefit of the regularization imposed on γ g , as illustrated in Figure 3.
One can easily generalize this alteration to other models with varying intercepts but with multiple covariates or treatments: simply add to the RI model the group-level means of all included variables,X g = 1 n g n g i =1 X g [i ] , including any interactions or nonlinear transformations. If spherical observation-level errors are assumed, the coefficient estimates from this model for lower-level variables are exactly equal to those obtained by Group-FE. Historically, the inclusion ofX g was proposed by Mundlak (1978) and extended by Chamberlain (1979Chamberlain ( , 1982, and is known in some econometrics-informed traditions as the "correlated random effects" (CRE) approach (Wooldridge 2010;Schunck 2013). Although originally motivated by imposing the assumption E(γ g | X g , Z g ) = X g α, we show unbiasedness without this assumption by showing its equivalence to "partialing out" of the group-level effects, which is in turn equivalent to Group-FE. A closely related approach is the hybrid model of Allison (2009), which group-demeans X g [i ] in addition to includingX g . This model is an isomorphic variation on models that only includeX g without demeaning, producing the same coefficient of interest on X but altering the way in which coefficients are combined to interpret between-group differences (see Schunck 2013).
We now turn to the more general debiasing approach for cases allowing multiple random coefficients and not just group intercepts. We refer to this as "bias-corrected MLM" (bcMLM), and use this label throughout the remainder of the paper to cover the special case of bias-corrected RI as well. 16 We first take the projections of the "fixed effect variables" (X g ), excluding the intercept, onto the random effect variables (Z g ) within each group,X g [i ] = Z g [i ] (Z g Z g ) −1 Z g X g . 17 These projections are then added to the regression as "fixed effect variables," where β and α are assumed fixed, and we continue to assume that g | X , Z i nd ∼ N(0, Σ g ) and |= γ g | X , Z for all g , g , and i. When Σ = σ 2 I N , and provided it is not overidentified (described below), bcMLM produces estimates for β that are unbiased under the conditional independence assumption (Appendix A.8) and identical to FE estimates (Appendix A.9).
We make two remarks on this result. First, it provides unbiasedness only when spherical observation-level errors are assumed. 18 While the choice of variance-covariance matrix for the errors has no impact on point estimates under FE, the point estimates from MLM are sensitive to this choice. Second, the RI model includingX g noted above is a special case of bcMLM: Limitations and the per-cluster regression. One limitation of bcMLM is that it can fail to eliminate potential biases for coefficients of certain variables because includingX g [i ] results in an overidentified model. A simple example would be a RI model that includes a group-level covariate, U g , in X g [i ] . The proposed alteration suggests that one includesŪ g = 1 n g n g i =1 U g , but this is simply the same U g already included (see Appendix A.6 for an example). Therefore, unless U g is independent of the random effects and any included lower-level variables (e.g., if U g were randomly assigned as a group-level treatment), the estimated coefficients for U g may be biased. 19 Thankfully, as long as U g is uncorrelated with the random intercepts, its coefficient can be unbiasedly estimated if desired by adding a "per-cluster regression" step as proposed by Bates et al. (2014). For example, suppose we have one observation-level variable X g [i ] and one grouplevel variable U g , with coefficients β 1 and β 2 , respectively. First, using Group-FE or bcMLM (here, an RI model includingX g ), unbiasedly estimateβ 1 . Then remove the estimated marginal effect . The per-cluster step is to then regress the G grouplevel means ofY ⊥ g [i ] on U g and an intercept term by OLS. We provide an example of this process in Appendix A.12. 16 Similar suggestions have been made by Wooldridge (2005), Snijders and Berkhof (2008), and Raudenbush (2009). Another approach (Snijders and Bosker 2011;Wooldridge 2013), is to include in an MLM the interaction of (X g −X ) with all variables in Z g [i ] , whereX = i ,j X g [i ] is the grand-mean of X g [i ] . This amounts to including (X g −X ) ⊗ Z g [i ] among the X g [i ] , where ⊗ is the Kronecker product. However, this is not guaranteed to debias estimates of β (see Appendix A.10 and A.11). 17 Note that this is a slight abuse of notation, as X g contains a column for the intercept term. In practice, this should be removed from X g when forming these projections. We appear to keep it here in the interest of avoiding extra notation. 18 We are not aware of the general unbiasedness or consistency of bcMLM when Σ σ 2 I N and is possibly misspecified.
19 Furthermore, if cor(X ( ) g [i ] ,U g ) 0 for some lower-level variable X ( ) g [i ] , the inclusion ofX ( ) g may induce bias in the coefficient for U g that would otherwise not be there (demonstrated in Appendix A.7).
Another important example of an over-identification limitation to bcMLM occurs when the slope for X ( ) g [i ] is allowed to vary, that is, X ( ) g [i ] is included in Z g [i ] . Because bcMLM includes as extra covariates the predictions of X ( ) g [i ] using Z g [i ] , and Z g [i ] predicts X ( ) g [i ] perfectly, including this "prediction" simply includes X ( ) g [i ] in the model twice. One of the X ( ) g [i ] would be dropped out of the model, and the "prediction" of X ( ) g [i ] cannot soak up the bias from any potentially correlated random effects when estimating β . This is also true of coefficients for any cross-level interactions, Here again the per-cluster regression provides an option for users who are interested in those coefficients, as demonstrated in Appendix A.13. 20 Comparison to practice. The main takeaway from this analysis is that bcMLM removes the bias of MLM due to correlation of random effects with the treatment, and in so doing, produces coefficient estimates identical to FE (when Σ = σ 2 I N is assumed), while retaining MLM's superior predictive accuracy for the outcome and the ability to model group-level variables, which may or may not be of interest to investigators depending on their task. Yet, among articles we reviewed where bias due to correlated random effects would be at issue, none of 24 education articles, one in 39 political science articles, and 2 of 19 sociology articles employed bcMLM or any other suitable debiasing approach for MLM.

Variance estimation
The third and final analytical piece is that the "default" MLM standard errors are based on narrow assumptions that are often inappropriate, but this too can be remedied. A commonly cited motive for using MLM is the claim that it ensures appropriate standard errors for multilevel data (e.g., . This is only true when the sole source of heteroskedasticity or dependency between residuals arises due to the random effects included by Z g [i ] and their contributions to Y g [i ] . That a default standard error exists for MLM, and is the only choice in some software, does not imply it is always an appropriate choice.
Recall from Section 2.4 that the random effects contribution, Z g [i ] γ g , and the idiosyncratic error, g [i ] , can be thought of as a single mean-zero error term * . The dependency between two observations' outcomes within the same group, conditional on X and Z, is then MLM can be understood as a framework for structuring this covariance, by specifying which variables enter the models as random effects (i.e., Z g [i ] ) and parameterizing Ω and Σ. In the RI model, . Under the conditional independence assumption and spherical errors, In other words, Y g [i ] is modeled as linear in X g [i ] with error, but instead of independent observations, there is constant covariance between observations in the same group, and this covariance does not differ by group. This yields a compound symmetric covariance matrix for each group's 20 The per-cluster regression does, however, require that all n g > d , and is unstable when any non-intercept elements of Z g [i ] have little variation within a group. Graham and Powell (2012), extending a closely related estimator from Chamberlain (1992), had previously investigated the conditions under which β is identifiable in these cases despite correlated random effect contributions, and proposes an estimator that is consistent when they hold.
and that (conditionally on X and Z) γ 0g and γ 1g are drawn from N (0, ω 2 0 ) and N (0, ω 2 1 ) with covariance ω 01 yields intragroup covariances of With the addition of more random effect variables, the variance structure becomes more complex. This complexity should not be equated with generality-the variance is still assumed to be a highly prescribed function of the data. To illustrate the potential for variance estimates with poor coverage, consider the following longitudinal DGP, where g indexes the person and t = 1, . . .,T indexes the time-point of the observation: where W g Here, there is an observation-level variable X g [t ] that is auto-correlated; a group-level variable U g ; and a random interceptW g . The observation-level error terms are auto-correlated as well with (heteroskedastic) variances that depend on U g . The correct dependence structure would be Using the "default" variance with RI, assuming Σ = σ 2 I N , Figure 4 shows coverage rates for nominal 95% confidence intervals of β 1 and β 2 across draws from DGP 2. Typically we would focus interest on β 1 , motivated by an assumption of no within-group confounding. However, we also show results for β 2 because group-level variables were often of interest in the empirical works we reviewed. The RI standard errors are consistently too small for both β 1 and β 2 across all sample sizes. The undercoverage for β 1 worsens as the total number of time-periods (T) increases. Coverage improves for β 2 as T increases, but remains unsatisfactory even at T = 50. Similar undercoverage of the "default" MLM standard errors for RI has been noted by Bell, Fairbrother, and Jones (2019), Heisig, Schaeffer, andGiesecke (2017), andJacqmin-Gadda et al. (2007).
Relaxing assumptions for MLM variance estimation. If the user has strong reason to believe errors (conditionally on the random effect contributions) are spherical, then the default MLM standard errors just described would be appropriate. However, such justifications are rarely offered. Fortunately, more flexible approaches can be employed in the MLM framework, relaxing this assumption. For example, with longitudinal data, it is possible to assume an AR(1) error structure for the g [t ] , in which cor( g [t ] , g [t +k ] ) = ρ k for ρ ∈ (−1, 1). Or, one may allow var( g [i ] | X , Z ) = σ 2 g , where σ 2 g can differ by group, to accommodate heteroskedasticity by group. One may also allow a common unstructured Σ g across g, which makes no assumptions on the intragroup covariance and assigns a separate parameter to each cov( g [i ] , g [i ] ). 21 21 Another avenue toward achieving flexibility is to allow the random effects to be heteroskedastic, that is, relaxing the assumption that the γ g have common variance, Ω. Hoffman (2015), for example, proposes directly modeling the variance of the random effects as a function of the covariates, such as var(γ g 0 | X , Z ) = exp(ν 0 + ν 1 U g ) where U g is a group-level covariate and (ν 0 , ν 1 ) are parameters to be estimated.
Using such specialized error structures when estimating standard errors may be advisable when researchers can defend the corresponding assumptions. On the other hand, users often cannot claim to know the correct variance structure based on theory alone. Fortunately, clusterrobust standard errors (CRSE), popular in conjunction with FE, provide a useful low-assumption alternative by asking the user only to assume independence across groups while allowing within group covariance to be fully estimated, albeit at the cost of requiring more data. 22 Note that both MLM and CRSEs operate as though we assume zero covariance of the residuals from units in different groups, sharing the assumption What differs is only how E( * In MLM, this covariance is parametrized as described above (e.g., in a RI model), and estimated by plugging in the parameter estimates. By contrast CRSEs are remarkable for the lack of structure they impose on these within-group covariances. For example, after estimatingβ with an OLS of Y on X, CRSEs would simply construct empirical covariance estimateŝ whereê β and c is a scalar finite sample correction. In other words, while MLMs make a strict assumption on the within-group covariances, CRSEs impose among the weakest possible assumptions by simply employing an empirical estimate based on fitted residuals. Appendix A.14 provides a more detailed discussion of the CRSE structure. Fortunately, nothing prevents MLM users from employing CRSE assumptions during variance estimation (see also Cameron and Miller 2015), estimating variance according to whereê g = Y g − X gβMLM and V is defined in Equation (4). We discuss the choice of c in Appendix A.15. This leads to a suprising and useful equivalance. Mirroring the equivalence between estimates of β from bcMLM with Σ = σ 2 I N and FE, the CRSEs forβ from both models are also equal if both use the same c (as we recommend in Appendix A.15). This fact, proven in Appendix A.16, may be surprising since the point estimates ofŶ g [i ] differ between models, and thus the overall error variance differs. This equivalence avoids debate over which method is appropriate to estimate the coefficient and standard error on the covariate of interest, since the answers will be the same. Figure 5 illustrates the performance of CRSE with MLM in DGP 2, in which RI with Σ = σ 2 I N misspecifies the dependence structure. Confidence intervals for β 1 using CRSEs fixes the undercoverage seen above in Figure 4 using the conventional standard errors. Coverage remains poor for β 2 with G = 15, with some undercoverage remaining at G = 50. 22 We refer readers to Cameron and Miller (2015) for a detailed review of CRSEs. We take a model-based perspective here and show the dependence structure implied by different variance estimators, including CRSEs. For a design-based approach to considering when clustering may be required and concerns regarding the conservative affects of clustering at too high a level, see Abadie et al. (2017). Guidance on CRSEs. We offer several notes regarding the appropriate use of CRSEs. First, CRSEs assume that observations in different groups have zero covariance in their residuals. Investigators must keep this in mind when choosing the level at which clustering is performed. 23 Clustering units that actually belong to different groups as if they are in the same group reduces the number of clusters but does not violate the CRSE assumption. By contrast, if units that actually have dependent residuals are labeled as if they belong to separate groups, the CRSE assumption of no between-group dependence will be violated and the results will be unreliable. In some cases, there may not be any choice of grouping that makes this assumption defensible, in which case CRSEs would not be defensible either. Second, as in any modeling problem, there is a tradeoff between the ability to relax assumptions and the requirement for more data. Thus while CRSEs can be a substantial improvement over default RI model-based standard errors, they do so at the cost of demanding more data. The convergence of the CRSEs depends on the number of groups. Cameron and Miller (2015) suggests that 20 to 50 clusters may be needed to ensure stable estimates. Naturally, this number depends upon many features of the data and no guidelines can be expected to be universally sufficient. Alternatively, when investigators can arguably defend the stricter assumptions of any less flexible covariance structure, such as an AR(1) or a simpler heteroskedastic model for example, then doing so may pay off. To this end, when the number of groups is smaller, the model-based MLM standard errors may be preferable if one can justify that the assumed dependence structure is plausible.
Comparison to practice. Most textbooks we reviewed describe alternative estimators for the variance of MLM such as auto-regressive models. 24 Empirical works using MLM we reviewed showed little attention to this issue: among articles that employed RI models to estimate a coefficient of interest, all 24 articles in education, 29 of 39 articles in political science articles, and 14 of 21 articles in sociology used the default MLM standard errors (Σ = σ 2 I N ) without discussion or justification. We surmise there are several reasons for this. The first is the misconception, documented above, that MLM automatically produces correct standard errors for any multilevel data structure without further consideration. Second and compounding the first, software widely used for MLM estimation does not always allow alternative variance structures besides that with Σ = σ 2 I or nonconstant Ω. 25 Finally, investigators may reasonably worry that correctly specifying covariance structures using theory or prior knowledge is not feasible, and decide to instead accept default choices. This makes the CRSE approach a particularly attractive option, at least when groups can be defined such that between-group residual dependence is arguably ruled out.

Conclusions
Different methodological traditions have responded to the challenges posed by grouped data with divergent solutions: FE with modified standard errors, or MLMs with random effects. To demystify their properties, we show that (i) random effects invoked in MLMs can be understood as regularized FE, explaining MLM's improved predictive power, ability to include group-level variables, and bias problem; (ii) this bias can be addressed; and (iii) the "default" standard errors under MLM do not necessarily address all concerns with intragroup dependency in multilevel data.
We thus recommend estimating coefficients with either FE or bcMLM, with the assumption of spherical errors. In both cases, CRSEs offer a flexible approach to variance estimation, particularly if an argument can be made for independence across clusters. Fortunately, these two approaches produce identical point estimates and standard errors for the coefficients they share. Hence, for those willing to make these adjustments and focused on inference regarding a treatment coefficient, the question of whether FE or MLM is "more appropriate" is irrelevant. Both approaches sacrifice the potential efficiency gain that an uncorrected MLM would offer had its strict assumptions been true. 26 We consider this a small and acceptable price to pay to avoid the risk of bias and higher RMSE that occurs under uncorrected MLM in the presence of correlated random effects. Accordingly, we suggest these unbiased approaches rather than any approach that seeks to mix estimators (e.g., Cheng, Liao, and Shi 2019) or to choose between FE (or bcMLM) and uncorrected MLM based on some criterion. For example, we do not advocate for choosing uncorrected MLM when the number of observations per group is above some threshold: one cannot know how many observations will be enough for the bias (and RMSE) to become acceptably small, and any potential efficiency or accuracy gain of MLM relative to FE is diminishing in group size anyway. We similarly do not advocate for a statistical testing approach such as Hausman (1978): if one is concerned that 24 Only Snijders and Bosker (2011) discussed CRSEs in depth. Among pedagogical articles, Cameron and Miller (2015) clearly describe the connection between MLM standard errors and CRSE, while Heisig, Schaeffer, and Giesecke (2017) compare coverage rates of model-based MLM standard errors and CRSEs in simulations. 25 At the time of writing, lme4 in R and VARCOMP in SPSS do not allow nonspherical observation-level errors, while nlme in R, SAS MIXED, and MIXED in SPSS do. SAS NLMIXED also allows random effects to be heteroskedastic. Alternatively, more general Bayesian modeling and sampling software such as WinBUGS and STAN allow very flexible models. 26 bcMLM may have efficiency gains over FE, however, if one specifies a different model for Σ that nearly enough approximates the correct structure. See Appendix A.2. However, we are not aware of a proof of the general unbiasedness or consistency of bcMLM when nonspherical errors are assumed but Σ is possibly misspecified. one of those estimates (from uncorrected random effects) is incorrect, then knowing whether the observed difference in the estimates is statistically significant or not is of little relevance. Although our advice regarding CRSEs is less common, our debiasing recommendations echo Raudenbush (2009), Bell and Jones (2015), and Bell, Fairbrother, and Jones (2019). It is also consistent with the "correlated random effects" approaches in econometrics, which employ the Mundlak (1978) solution, noted in texts including Wooldridge (2010) and Greene (2012). Nevertheless, such advice have gone largely unheeded in political science and education, and to some degree in sociology. 27 Finally, while FE and bcMLM produce identical results for the coefficients they share, the approaches differ in that (i) bcMLM has improved predictive (out-of-sample) accuracy for the outcome, and (ii) bcMLM retains the ability to include group-level covariates and cross-level interactions in the model. Whether users are interested in predictive accuracy from the same model in which they are interested in estimating an unbiased effect of a key covariate is a question of research goals, not addressed here. We also emphasize that the estimated coefficients for group-level covariates or cross-level interactions in bcMLM may be difficult to interpret. In addition to the usual identification concerns, the bias-correction step in bcMLM applies only to coefficients it shares with FE, and may even induce bias in those that are absent from FE. For users interested in these coefficients, bcMLM or FE regression can be followed by the appropriate per-cluster regression step.