A careful consideration of CLARIFY: simulation-induced bias in point estimates of quantities of interest

Abstract Some work in political methodology recommends that applied researchers obtain point estimates of quantities of interest by simulating model coefficients, transforming these simulated coefficients into simulated quantities of interest, and then averaging the simulated quantities of interest (e.g., CLARIFY). But other work advises applied researchers to directly transform coefficient estimates to estimate quantities of interest. I point out that these two approaches are not interchangeable and examine their properties. I show that the simulation approach compounds the transformation-induced bias identified by Rainey (2017), adding bias with direction and magnitude similar to the transformation-induced bias. I refer to this easily avoided additional bias as “simulation-induced bias.” Even if researchers use simulation to estimate standard errors, they should directly transform maximum likelihood estimates of coefficient estimates to obtain point estimates of quantities of interest.


Introduction
Political scientists employ maximum likelihood (ML) to estimate a variety of statistical models. ML estimators have desirable and widely understood properties. But for many research questions, the model coefficient estimates do not directly interest the researcher. Following King et al. (2000), researchers often use the coefficient estimates to compute substantively meaningful quantities of interest, such as predicted probabilities, expected counts, marginal effects, and first differences. The literature offers two methods to compute point estimates for these quantities of interest.
Researchers estimate quantities of interest either by simulating quantities of interest and then averaging (e.g., King et al., 2000) or by directly transforming coefficients into quantities of interest (e.g., Herron, 1999). 1 In practice, researchers' choice between these two approaches seems idiosyncratic rather than principled, depending on their preferred software package rather than any statistical criteria. Further, the methodological literature has not distinguished or compared the two approaches to estimating quantities of interest.
How does the simulation approach compare to directly transforming coefficients? Which should researchers prefer? Or are the two approaches interchangeable, as the literature seems to imply? Rainey (2017) shows that directly transforming coefficients into quantities of interest creates "transformation-induced" bias. I show that when researchers use the average of simulations to estimate quantities of interest, they replicate the logic of transformation-induced bias and add an additional, unnecessary bias to their estimates. I refer to this additional bias as "simulation-induced" bias. I show that simulation-induced bias occurs in addition to transformation-induced bias and is approximately the same size and direction. While this bias is usually small relative to the standard error, methodologists should not recommend methods that add unnecessary bias to point estimates. Instead, we should recommend methods that better adhere to the usual justifications.

The current practice
2.1. The plug-in estimator First, researchers can directly transform the coefficient estimates into quantities of interest. The invariance property of ML estimators allows a researcher to find the ML estimate of a function of a parameter (i.e., a quantity of interest) by first using ML to estimate the model parameter and then applying the function to that estimate (or "plugging in the estimate") (King, 1998, 75-76;Casella and Berger, 2002, 320-321). I refer to this approach as the "plug-in" estimator. Importantly, the plug-in estimator remains an ML estimator. Thus, it retains all of the (desirable) properties of ML estimators (e.g., consistency, asymptotic efficiency).
As a concrete example, suppose a researcher uses ML to estimate a statistical model in which , N} and f represents a probability distribution. The parameter θ i connects to a design matrix X of k explanatory variables and a column of ones by a link function g( ⋅ ), so that g(θ i ) = X i β, where b [ R k+1 represents a vector of parameters with length k + 1. The researcher uses ML to compute estimatesb mle for the parameter vector β. I denote the function that transforms model coefficients into quantities of interest as τ( ⋅ ). For example, if the researcher uses a logit model and focuses on the predicted probability for a specific observation X c , then t(b) = logit −1 (X c b) = 1/(1 + e −X c b ). The researcher can use the invariance property to quickly obtain a ML estimate of the predicted probability: t mle = t(b mle ) = logit −1 (X cb mle ) = 1/(1 + e −X cb mle ).

The average-of-simulations estimator
Second, researchers can use the average of simulated quantities of interest as the point estimator. King et al. (2000) suggest the following approach: Most quantities of interest depend on the values of the explanatory variables. In this situation, researchers either focus on a specific observation (typically some kind of "average case") or average across all sample observations (Hanmer and Kalkan, 2013). In any case, the transformation τ( ⋅ ) includes this choice. 2 2 As King et al. (2000) note, this step might require additional simulation, to first introduce and then average over fundamental uncertainty. I ignore this additional step since (1) it is usually not necessary and (2) including it does not affect my argument.

4.
Average the simulations of the quantity of interest. Estimate the quantity of interest using the average of the simulations of the quantity of interest, so thatt avg = 1/M M i=1t (i) . 3 I refer to this as the "average-of-simulations" estimator. But what are the properties of this estimator? While the estimates it provides are sometimes similar to well-behaved plug-in estimates, King et al. (2000) develop their method informally. Much of their theoretical argument happens quickly when they write "we draw many plausible sets of parameters from their posterior or sampling distribution" (349). 4 One might justify their method from a frequentist perspective by first thinking of their method as "informal" Bayesian posterior simulation (Gelman and Hill, 2007). This is helpful because the theory and practice of simulating from a posterior distribution are well-developed and widely understood. The Berstein-von Mises theorem (Van der Vaart, 2000, 140-146) guarantees, under relatively weak assumptions, that posterior simulations are asymptotically equivalent to the simulation procedure suggested by King, Tomz, and Wittenberg. And because the point estimatorb mle (and functions ofb mle ) are consistent, then the mean of the simulations (and the mean of functions of the simulations) are consistent as well. Therefore, one can defendt avg on the grounds that it is a consistent estimator of τ.
However, the small sample properties of the average-of-simulations remain poorly understood. Methodologists and applied researchers seem to assume thatt avg is interchangeable witht mle . But below, I show that the average-of-simulations algorithm compounds the transformation-induced bias described by Rainey (2017) and adds a similar bias to the estimate that I refer to as "simulation-induced bias."

A theory of simulation-induced bias
Before developing the theory of simulation-induced bias, I review the concept of transformationinduced bias from Rainey (2017). As Rainey (2017) shows, transforming unbiased model coefficient estimates can introduce bias into estimates of quantities of interest. Rainey (2017, 404) decomposes the bias in the estimate of the quantity of interest, which he refers to as total τ-bias, into two components: transformation-induced τ-bias and coefficient-induced τ-bias. Rainey (2017) defines these as (1) Transformation-induced τ-bias behaves systematically. The shape of the transformation τ( ⋅ ) determines the direction of the bias. In general, any strictly convex (concave) τ( ⋅ ) creates upward (downward) transformation-induced τ-bias. The direction and magnitude of the coefficientinduced τ-bias depend on the choice of τ( ⋅ ) and the bias in the coefficient estimates, but an unbiased estimatorb mle implies the absence of coefficient-induced τ-bias. But Rainey (2017) does not consider the average-of-simulations estimator. This raises the question: Does the average-of-simulations estimatort avg suffer the same transformation-induced bias as the plug-in estimatort mle ? I now turn to the average-of-simulations estimator and develop the idea of "simulation-induced bias." 3 In the discussion that follows, I assume no Monte Carlo error exists int avg . In other words, I assume that M is sufficiently If the transformation of coefficient estimates into an estimate of the quantity of interest is always convex (or always concave), then Jensen's inequality allows the simple statement relatinĝ t avg andt mle given in Lemma 1.
This result is intuitive. Since I simulate using a multivariate normal distribution,b has a symmetric distribution. But the distribution of t(b) is not symmetric. Ifb happens to fall below the modeb mle , then τ( ⋅ ) pulls t(b) in towardt mle . Ifb happens to fall above the modê b mle , then τ( ⋅ ) pushes t(b) away fromt mle . This creates a right-skewed distribution for t(b), which pushes the averaget avg abovet mle . See Appendix A for the proof.
For a convex transformation, Lemma 1 shows thatt avg is always larger thant mle . I refer to the expectation of this difference betweent avg andt mle as "simulation-induced bias," so that Theorem 1: compares the sum of simulation-and transformation-induced τ-bias int avg to transformation-induced τ-bias int avg .
Theorem 1 Suppose a nondegenerate ML estimatorb mle . Then for any strictly convex or concave τ( ⋅ ), the sum of the simulation-induced and transformation-induced τ-bias int avg is strictly greater in magnitude than the transformation-induced τ-bias int mle .
Regardless of the direction of simulation-induced and transformation-induced τ-bias, Theorem 1 shows that the magnitude of the combination int avg is always larger than the transformation-induced bias alone int mle for strictly convex or concave τ( ⋅ ). The proof follows directly from Jensen's inequality, but see Appendix A for the details.
Theorem 1 shows thatt avg compounds transformation-induced τ-bias with simulationinduced τ-bias. But is this bias substantively important? An analytical approximation provides a helpful guideline.
I approximate the simulation-induced τ-bias int avg as where the remaining expectation occurs with respect tob mle , H(b mle ) represents the Hessian matrix of second derivatives of τ( ⋅ ) at the pointb mle and, conveniently,V(b mle ) represents the estimated covariance matrix forb mle . This approximation appears similar to the approximation for the transformation-induced τ-bias, which (adjusting notation slightly) Rainey (2017, 405, Equation (1)) computes as where H[E(b mle )] represents the Hessian matrix of second derivatives of τ( ⋅ ) at the point E(b mle ) and V(b mle ) represents the covariance matrix of the sampling distribution ofb mle . When one compares Equations (2) and (3), they yet again compare the expectation of a function to the function of the expectation. Therefore, Equations (2) and (3) are not exactly equal. But, as a rough guideline, one should expect them to be similar. And to the extent that the two are similar, the additional simulation-induced τ-bias int avg is about the same as the transformation-induced τ-bias int mle .
Because of the similarity between Equations (2) and (3), the simulation-induced τ-bias becomes large under the conditions identified by Rainey (2017) as leading to large transformation-induced τ-bias: when the non-linearity in the transformation τ( ⋅ ) is severe and when the standard errors ofb mle are large. While the transformation-induced τ-bias vanishes as the number of observations grows large, it can be substantively meaningful for the sample sizes commonly encountered in social science research. In standard modeling situations, Rainey (2017) demonstrates that transformation-induced bias can (1) be larger than the bias in the estimates of the model coefficients and (2) shrink to zero more slowly as the sample size increases. By extension, the same claims hold for simulation-induced bias (which, again, appears in addition to transformation-induced bias).

The intuition of simulation-induced bias
To develop the intuition for the theoretical results above, I examine a stylized example with simulations, an alternative analytical approach, and an empirical example. 4.1. Using a drastic, convex transformation: τ(μ) = μ 2 To develop an intuition for the simulation-induced τ-bias int avg , consider the simple (unrealistic, but heuristically useful) scenario in which y i ∼ N(0, 1), for i ∈ {1, 2, …, n = 100}, and the researcher wishes to estimate μ 2 . Suppose that the researcher knows that the variance equals one but does not know that the mean μ equals zero. The researcher uses the unbiased ML estimatorm mle = n i=1 y i /n of μ, but ultimately cares about the quantity of interest τ(μ) = μ 2 . The researcher can use the plug-in estimatort mle = (m mle ) 2 of τ(μ). Alternatively, the researcher can use the average-of-simulations estimator, estimating τ(μ) ast avg = 1/M M i=1 t(m (i) ), wherẽ m (i) N(m mle , 1/ n √ ) for i ∈ {1, 2, …, M}. The true value of the quantity of interest is τ(0) = 0 2 = 0. However, the ML estimator t mle = (m mle ) 2 equals zero if and only ifm mle = 0. Otherwise,t mle . 0. Sincem mle almost surely differs from zero,t mle is biased upward.
I illustrate this fact clearly by repeatedly simulating y and computingt mle andt avg . Figure 1 shows the first four of 10,000 total simulations. The figure shows how the unbiased estimatê m mle is translated intot mle andt avg .
First, to findt avg , I complete three steps: (1) simulatem (i) N(m mle , 1/10) for i ∈ {1, 2, …, M = 1,000}, (2) calculatet (i) = t(m (i) ), and (3) calculatet avg = 1/M M i=1t (i) . The rug plot along the horizontal axis shows the distribution ofm. The hollow points in Figure 1 shows the transformation of each pointm (i) intot (i) . The rug plot along the vertical axis shows the distribution oft. Focus on the top-left panel of Figure 1. Notice thatm mle estimates the true value μ = 0 quite well. However, after simulatingm and transformingm intot, thets fall far from the true value τ(0) = 0. The dashed orange line shows the average oft. Notice that althoughm mle is unusually close to the truth μ = 0 in this sample,t avg is much too large.
Second, to findt mle , I transformm mle directly usingt mle = (m mle ) 2 . The solid green lines show this transformation. The convex transformation τ( ⋅ ) has the effect of lengthening the right tail of the distribution oft, pulling the average well above the mode. This provides the basic intuition for Lemma 1.
The remaining panels of Figure 1 repeat this process with three more random samples. Each sample presents a similar story-the convex transformation stretches the distribution oft to the right, which pullst avg abovet mle .
I repeat this process 10,000 total times to produce 10,000 estimates ofm mle ,t mle , andt avg . Figure 2 shows the density plots for the 10,000 estimates (i.e., the sampling distributions of m mle ,t mle , andt avg ). As I show analytically,m mle is unbiased with a standard error of s/ n √ = 1/ 100 √ = 1/10. Botht mle andt avg are biased upward, but, as Theorem 1 suggests, t avg has more bias thant mle . And as the approximation suggestions,t avg has about twice the bias oft mle . Indeed, the exact bias of each estimator is easy to compute for this simple example. Appendix B shows that the biases are 1/n = 1/100 and 2/n = 2/100 in this example.

Using the law of iterated expectations
One can also develop the argument analytically via the law of iterated expectations. It helps to alter the notation slightly, making two implicit dependencies explicit. I explain each change below and use the alternate, more expansive notation only in this section.
The law of iterated expectations states that E Y (E X|Y (X | Y)) = E X (X), where X and Y represent random variables. The three expectations occur with respect to three different distributions: E Y denotes the expectation with respect to the marginal distribution of Y, E X|Y denotes the expectation with respect to the conditional distribution of X| Y, and E X denotes the expectation with respect to the marginal distribution of X.
Outside of this section, the reader should understand that the distribution ofb depends on b mle and could be written asb |b mle . To remain consistent with previous work, especially Herron (1999) and King et al. (2000), I useb to representb |b mle . The definition ofb makes this usage clear. In this section only, I useb |b mle to represent the conditional distribution of b and useb to represent the unconditional distribution ofb. Intuitively, one might imagine (1) generating a data set y, (2) estimatingb mle , and (3) simulatingb |b mle . If I perform steps (1) and (2) just once, but step (3) repeatedly, I generate a sample from the conditional distributionb |b mle . If I perform steps (1), (2), and (3) repeatedly, then I generate a sample from the unconditional distributionb. The unconditional distribution helps us understand the nature of the simulation-induced τ-bias.
Applying the law of iterated expectations, I obtain Eb(b) = Ebmle(Eb |b mle (b |b mle )). The three identities below connect the three key quantities from Theorem 1 to three versions of Ebmle(Eb |b mle (b |b mle )), with the transformation τ( ⋅ ) applied at different points.
To obtain the τ-bias int avg I must subtract Equation (6) from (4). But to move from Equation (4) to (6) I must swap τ( ⋅ ) with an expectation twice. Again, if τ( ⋅ ) is convex, then Equation (6) must be greater than Equation (4). However, because one should expectb mle andb |b mle to have similar distributions, one should expect the additional swap to roughly double the bias int avg compared tot mle . This additional swap creates the additional, simulation-induced τ-bias. 4.3. Using a re-analysis of Holland (2015) Holland (2015) presents a nuanced theory that describes the conditions under which politicians choose to enforce laws and supports the theoretical argument with a rich variety of evidence. In particular, she elaborates on the electoral incentives of politicians to enforce laws. I borrow three Poisson regressions and hypotheses about a single explanatory variable to illustrate how the plug-in estimates can differ from the average-of-simulations estimate.
Holland writes: My first hypothesis is that enforcement operations drop off with the fraction of poor residents in an electoral district. So district poverty should be a negative and significant predictor of enforcement, but only in politically decentralized cities [Lima and Santiago]. Poverty should have no relationship with enforcement in politically centralized cities [Bogota] once one controls for the number of vendors.
I use Holland's hypothesis and data to illustrate the behavior of the average-of-simulations and plug-in estimators. I refit Model 1 from Table 2  . The plug-in estimate, then, is about 7 percent smaller than the average-of-simulations estimate-a small, but noticeable shrinkage. Figure 3 shows how the estimates change (usually shrink) for all districts when I switch from the average-of-simulations estimates to plug-in estimates. Table 1 presents the details for the Figure 3. This figure compares the average-of-simulations estimates with the plug-in estimates using three Poisson regression models from Holland (2015). The quantity of interest is the percent increase in the enforcement operations when the percent of a district in the lower class drops by half. The arrows show how the estimates change when I switch from the average-of-simulations to the plug-in estimate. labeled cases in Figure 3. In Bogota, the estimate shrinks by 11 percent in Sante Fe and 16 percent in Usme. In Lima, the estimate shrinks by 5 percent in Chacalacayo and 7 percent Villa Maria El Triunfo. The shrinkage is much larger in Santiago, where the standard errors for the coefficient estimates are much larger. The estimate shrinks by about 47 percent in San Ramon and 53 percent in La Pintana. The median shrinkage is 6 percent in Bogota, 2 percent in Lima, and 37 percent in Santiago. For many districts in Santiago, the average-of-simulations estimate is about twice the plugin estimate. These estimates clearly show that the average of simulations and the ML estimate can differ meaningfully in actual analyses.

Conclusion
Many social scientists turn to King et al. (2000) for advice on interpreting, summarizing, and presenting empirical results. The authors improved empirical research by highlighting the importance of substantively meaningful quantities of interest. I agree with King and Zeng's (2006) summary of the literature: "[w]hether such effects are calculated via analytical derivation or what is now the more common approach of statistical simulation, political scientists have made much progress in learning how to make sophisticated methods speak directly to their substantive research questions" (132).
Researchers estimate quantities of interest either by averaging simulated quantities of interest (e.g., CLARIFY in Stata, Zelig in R, or clarify in R) or using the invariance property of ML estimators (e.g., margins in Stata and R). In practice, researchers' choice between these two estimators seems idiosyncratic rather than principled, depending on their preferred software package rather than any statistical criteria. The methodological literature recommends both, but has not distinguished or compared the two approaches to estimating quantities of interest.
When researchers use the average of simulations (King et al., 2000) to estimate quantities of interest, they replicate the logic of transformation-induced bias (Rainey, 2017) and add simulation-induced bias to the estimates. This additional bias is roughly the same magnitude and direction as transformation-induced bias and occurs in addition to transformation-induced bias. While the additional bias is usually small relative to the standard error, methodologists should not recommend methods that add unnecessary bias to point estimates. Instead, we should recommend methods that better adhere to the usual evaluative standards. Even if we recommend using simulation to estimate standard errors, we should not recommend averaging simulations to obtain the point estimate. Instead, researchers should directly transform ML estimates of coefficients to obtain ML estimates of the quantities of interest. The resulting point estimates inherit the desirable properties of ML estimators and avoid unnecessary simulation-induced bias.
Supplementary material. The supplementary material for this article can be found at https://doi.org/10.1017/psrm.2023.8. All data and computer code necessary to reproduce the results are available on Dataverse at https://doi.org/10.7910/DVN/ 5YKD4S.