## References

Adams, James, Clark, Michael, Ezrow, Lawrence, and Glasgow, Garrett. 2006. Are niche parties fundamentally different from mainstream parties? The causes and the electoral consequences of western European parties’ policy shifts, 1976–1998.
*American Journal of Political Science*
50(3):513–529.

Aiken, Leona S., West, Stephen G., and Reno, Raymond R.. 1991.
*Multiple regression: Testing and interpreting interactions*
. London: Sage Publications.

Aklin, Michaël, and Urpelainen, Johannes. 2013. Political competition, path dependence, and the strategy of sustainable energy transitions.
*American Journal of Political Science*
57(3):643–658.

Anderson, James H.2013 Sunshine works–comment on ‘the adverse effects of sunshine: A field experiment on legislative transparency in an authoritarian assembly’. World Bank Policy Research Working Paper (6602).

Athey, Susan, and Wager, Stefan. 2018. Estimation and inference of heterogeneous treatment effects using random forests.
*Journal of the American Statistical Association*
, doi:10.1080/01621459.2017.1319839.
Banks, Antoine J., and Valentino, Nicholas A.. 2012. Emotional substrates of white racial attitudes.
*American Journal of Political Science*
56(2):286–297.

Beck, Nathaniel, King, Gary, and Zeng, Langche. 2000. Improving quantitative studies of international conflict: A conjecture.
*American Political Science Review*
94(1):21–35.

Berry, William D., Golder, Matt, and Milton, Daniel. 2012. Improving tests of theories positing interaction.
*Journal of Politics*
74(3):653–671.

Bodea, Cristina, and Hicks, Raymond. 2015a. International finance and central bank independence: Institutional diffusion and the flow and cost of capital.
*The Journal of Politics*
77(1):268–284.

Bodea, Cristina, and Hicks, Raymond. 2015b. Price stability and central bank independence: Discipline, credibility, and democratic institutions.
*International Organization*
69(1):35–61.

Brambor, Thomas, Clark, William Roberts, and Golder, Matt. 2006. Understanding interaction models: Improving empirical analyses.
*Political Analysis*
14:63–82.

Braumoeller, Bear F.
2004. Hypothesis testing and multiplicative interaction terms.
*International Organization*
58(04):807–820.

Cameron, A. Colin, and Miller, Douglass L.. 2015. A Practitioner’s guide to cluster-robust inference.
*The Journal of Human Resources*
50(2):317–372.

Carpenter, Daniel, and Moore, Colin D.. 2014. When canvassers became activists: Antislavery petitioning and the political mobilization of American women.
*American Political Science Review*
108(3):479–498.

Chapman, Terrence L.
2009. Audience beliefs and international organization legitimacy.
*International Organization*
63(04):733–764.

Clark, Michael, and Leiter, Debra. 2014. Does the ideological dispersion of parties mediate the electoral impact of valence? A cross-national study of party support in nine western European democracies.
*Comparative Political Studies*
47(2):171–202.

Clark, William Roberts, and Golder, Matt. 2006. Rehabilitating Duverger’s theory testing the mechanical and strategic modifying effects of electoral laws.
*Comparative Political Studies*
39(6):679–708.

Cleveland, William S., and Devlin, Susan J.. 1988. Locally weighted regression: An approach to regression analysis by local fitting.
*Journal of the American Statistical Association*
83(403):596–610.

Fan, Jianqing, Heckman, Nancy E., and Wand, Matt P.. 1995. Local polynomial kernel regression for generalized linear models and quasi-likelihood functions.
*Journal of the American Statistical Association*
90(429):141.

Friedrich, Robert J.
1982. In defense of multiplicative terms in multiple regression equations.
*American Journal of Political Science*
26(4):797–833.

Grimmer, Justin, Messing, Solomon, and Westwood, Sean J.. 2014 Estimating heterogeneous treatment effects and the effects of heterogeneous treatments with ensemble methods.

Hainmueller, Jens, and Hazlett, Chad. 2014. Kernel regularized least squares: Reducing misspecification bias with a flexible and interpretable machine learning approach.
*Political Analysis*
22(2):143–168.

Hainmueller, Jens, Mummolo, Jonathan, and Xu, Yiqing. 2018. Replication data for: How much should we trust estimates from multiplicative interaction models? Simple tools to improve empirical practice. https://doi.org/10.7910/DVN/Q1V0OG, Harvard Dataverse, V1.
Hastie, Trevor, and Tibshirani, Robert. 1986. Generalized additive models.
*Statistical Science*
1(3):297–318.

Hellwig, Timothy, and Samuels, David. 2007. Voting in open economies the electoral consequences of globalization.
*Comparative Political Studies*
40(3):283–306.

Hicken, Allen, and Simmons, Joel W.. 2008. The personal vote and the efficacy of education spending.
*American Journal of Political Science*
52(1):109–124.

Hosking, J. R. M.
1990. L-moments: Analysis and estimation of distributions using linear combinations of order statistics.
*Journal of the Royal Statistical Society. Series B (Methodological)*
52(1):105–124.

Huddy, Leonie, Mason, Lilliana, and Aarøe, Lene. 2015. Expressive partisanship: Campaign involvement, political emotion, and partisan identity.
*American Political Science Review*
109(1):1–17.

Imai, Kosuke, Keele, Luke, and Yamamoto, Teppei. 2010. Identification, inference and sensitivity analysis for causal mediation effects.
*Statistical Science*
25(1):51–71.

Imai, Kosuke, and Ratkovic, Marc. 2013. Estimating treatment effect heterogeneity in randomized program evaluation.
*Annals of Applied Statistics*
7(1):443–470.

Jaccard, James, and Turrisi, Robert. 2003.
*Interaction effects in multiple regression*
, vol. 72. London: Sage Publications.

Kam, Cindy D., and Franzese, Robert J. Jr. 2007.
*Modeling and interpreting interactive hypotheses in regression analysis*
. Ann Arbor: The University of Michigan Press.

Kim, Henry A., and LeVeck, Brad L.. 2013. Money, reputation, and incumbency in US house elections, or why marginals have become more expensive.
*American Political Science Review*
107(3):492–504.

King, Gary, and Zeng, Langche. 2006. The dangers of extreme counterfactuals.
*Political Analysis*
14(2):131–159.

Li, Qi, and Racine, Jeffrey S.. 2010. Smooth varying-coefficient estimation and inference for qualitative and quantitative data.
*Econometric Theory*
26:1–31.

Lin, Winston. 2013. Agnostic notes on regression adjustments to experimental data: Reexamining Freedman?s critique.
*The Annals of Applied Statistics*
7(1):295–318.

Malesky, Edmund, Schuler, Paul, and Tran, Anh. 2012. The adverse effects of sunshine: A field experiment on legislative transparency in an authoritarian assembly.
*American Political Science Review*
106(04):762–786.

Miratrix, Luke W., Sekhon, Jasjeet S., and Yu, Bin. 2013. Adjusting treatment effect estimates by post-stratification in randomized experiments.
*Journal of the Royal Statistical Society: Series B (Statistical Methodology)*
75(2):369–396.

Mosteller, Frederick, and Tukey, John Wilder. 1977.
*Data analysis and regression: a second course in statistics*
. Reading, MA: Addison-Wesley Publishing Co.

Neblo, Michael A., Esterling, Kevin M., Kennedy, Ryan P., Lazer, David M.J., and Sokhey, Anand E.. 2010. Who wants to deliberate—and why?
*American Political Science Review*
104(3):566–583.

Open Science Collaboration *et al.*2015. Estimating the reproducibility of psychological science. *Science* 349(6251): aac4716.

Pelc, Krzysztof J.
2011. Why do some countries get better WTO accession terms than others?
*International Organization*
65(4):639–672.

Petersen, Michael Bang, and Aarøe, Lene. 2013. Politics in the mind’s eye: Imagination as a link between social and political cognition.
*American Political Science Review*
107(2):275–293.

Somer-Topcu, Zeynep. 2009. Timely decisions: The effects of past national elections on party policy change.
*The Journal of Politics*
71(1):238–248.

Tavits, Margit. 2008. Policy positions, issue importance, and party competition in new democracies.
*Comparative Political Studies*
41(1):48–72.

Truex, Rory. 2014. The returns to office in a ‘rubber stamp’ parliament.
*American Political Science Review*
108(2):235–251.

Vernby, Kåre. 2013. Inclusion and public policy: Evidence from Sweden’s introduction of noncitizen suffrage.
*American Journal of Political Science*
57(1):15–29.

Williams, Laron K.
2011. Unsuccessful success? Failed no-confidence motions, competence signals, and electoral support.
*Comparative Political Studies*
44(11):1474–1499.

Wood, Simon N.
2003. Thin plate regression splines.
*Journal of the Royal Statistical Society: Series B (Statistical Methodology)*
65(1):95–114.

1 There obviously exist many sophisticated estimation approaches that are more flexible such as Generalized Additive Models (Hastie and Tibshirani 1986; Wood 2003), Neural Networks (Beck, King, and Zeng 2000), or Kernel Regularized Least Squares (Hainmueller and Hazlett 2014). We do not intend to critique these approaches. Our perspective for this study is that many applied scholars prefer to remain in their familiar regression framework to test conditional hypotheses and our proposals are geared toward this audience. Also, since our replications are based on articles published in the top political science journals, our conclusions about the state of empirical practice apply to political science, although similar problems might be present in other disciplines.

2 Other advice includes Friedrich (1982), Aiken, West, and Reno (1991), Jaccard and Turrisi (2003), Braumoeller (2004), Kam and Franzese Jr. (2007), Berry, Golder, and Milton (2012).

3 As of February 2018, Brambor, Clark, and Golder (2006) has been cited over 4,200 times according to Google Scholar, which makes it one of the most cited political science articles in recent decades.

4 For example, in Bodea and Hicks (2015b), the authors wrote: “At low levels of POLITY, the marginal effect of CBI is positive but statistically insignificant. Similarly, the marginal effect of CBI is negative and significant only when the FREEDOM HOUSE score is greater than about 5” (p. 49). Similarly, in Clark and Leiter (2014), the authors write that when the moderator, “party dispersion,” is set to one standard deviation above its mean, “a change from the minimum value of valence to the maximum value…” corresponds to “a 10-point increase in predicted vote share—more than double that of predicted change in vote share for the mean value of party dispersion, and a sufficient change in vote share to move a party from government to opposition.” (p. 186). We thank the Editor and anonymous reviewers for highlighting this point.

6 A moderator is different from a mediator, which is a variable that accounts for at least part of the treatment effect (see, for example, Imai, Keele, and Yamamoto (2010)).

7 Note that the designation of one of the independent variables as the treatment and the other as the moderator is done without loss of generality. The typical approach in most empirical studies in our replication sample is to designate one variable as the treatment of interest and another variable as the moderator, for example, in randomized experiments or observational studies where one variable is (quasi) randomly assigned and a pretreatment covariate moderates the treatment effect on the outcome. In other designs, such as multifactorial experiments, there might be two variables that can be viewed as treatments and the same diagnostics and estimation strategies that we propose here can be applied to estimate how the effect of one treatment is moderated by the other and vice versa.

8 The LIE assumption would also hold in the special case where both functions are nonlinear but the difference between both of these functions is a linear function. This is unlikely in empirical settings and never occurs in any of our replications.

10 Although the linear regression framework is flexible enough to incorporate higher order terms of
$X$
and their interaction with
$D$
(see Kam and Franzese Jr. 2007; Berry, Golder, and Milton 2012), this is rarely done in practice. In fact, not a single study incorporated higher order terms in our replication sample.

11 Of course, if the model happens to be correct, estimated conditional effects will still be consistent and unbiased despite the common support issue. We thank an anonymous reviewer for highlighting this point.

12 The same plots can be constructed after residualizing with respect to the covariates
$Z$
.

13 In addition, researchers can plot the estimated density of
$X$
in both groups in a single plot to further judge the range of common support.

14 See Appendix for more information on this strategy.

15 This idea is analogous to breaking a continuous variable, such as age, into several bins in a linear regression model. We thank the Editor for highlighting this point.

17 Note that in the context of a randomized experiment, a regression of the outcome on the treatment, the demeaned covariates, and the interaction between the treatment and the demeaned covariates provides a semiparametric and asymptotically efficient estimator of the average treatment effect under the Neyman model for randomization inference (Lin 2013; Miratrix, Sekhon, and Yu 2013). In this context, our binning estimator is similar except that it applies to subgroups of the sample defined by the bins of the moderator.

18 Clearly, one could construct cases where the distribution of
$X$
within a bin is highly bimodal and therefore the bin median might involve interpolation, but this is not very common in typical political science studies. In fact, in our nearly 50 replications of recently published interaction effects we found not a single case where this potential problem occurs (see below).

19 Berry, Golder, and Milton (2012) also recommend adding a frequency distribution of the moderator to the marginal-effect plots. We argue that in the case of a binary treatment it is advantageous to distinguish in the histogram between the two groups to get a better sense of the overlap across groups.

20 For theoretical properties of the kernel smoothing estimator, see Li and Racine (2010).

21 Compared with estimates from the conventional linear interaction model shown in Figure 2(a), we see that the kernel estimator does not result in a large increase in the uncertainty of the estimates when the linear interaction model is correct. This is mainly because when the LIE assumption is correct, the cross-validation scheme is likely to choose a large bandwidth.

22 We cap the number of replicated interactions at four per study. In the rare cases with more than four interaction plots we chose the four most important ones based on our reading of the article.

23 According to Google scholar our replicated studies have been cited nearly 1,900 times as of December 2016. The mean number of citations per article was roughly 86.

25 The medians of the three terciles, 0.35, 0.53, 1.77, refer to the 17th, 50th, and 83rd percentile of the moderator, respectively.

26 There are four observations at an Internet penetration of 4.07, 22 at 6.47, and 20 at 8.63. Together these make up less than 10% of all observations.

27 Our replications below show that the same problem applies to all the other three outcomes used by Malesky, Schuler, and Tran (2012) in their Figure 1.

28 Note that the unit of analysis in the original study is the delegate, who is exposed or not exposed to the intervention, while the moderator Internet penetration is measured at the level of the province. In our robustness check we drop only four delegates, but keep all provinces in the data including all the metropolitan areas with extreme values of Internet penetration. Also note that we have no theoretical rationale to drop these data points. This is merely a robustness check to demonstrate the fragility of estimates from a linear interaction model that relies on interpolation.

29 Note that this marginal-effect plot also appears in Brambor, Clark, and Golder (2006).

31 Because of the extreme skew in the distribution of the moderator which only sparsely overlaps with the treatment, we manually chose a bandwidth of 1 when employing the kernel estimator in this example.

32 The L-Kurtosis is based on linear combination of order statistics and is therefore less sensitive to outliers and has better asymptotic properties than the classical kurtosis (Hosking 1990).

33 For example, in the case of Huddy, Mason, and Aarøe (2015) the moderator has an L-Kurtosis of 0.065 which is halfway between a normal distribution (L-Kurtosis
$=$
0.12) and a uniform distribution (L-Kurtosis
$=$
0) and therefore indicates good support across the range of the moderator. 80% of the density is concentrated in about 53% of the interval reported in the marginal-effect plot. In contrast, in the case of Malesky, Schuler, and Tran (2012) the moderator has an L-Kurtosis of 0.43 which indicates severe extrapolation. In fact, about 80% of the density of the moderator is concentrated in a narrow interval that only makes up 11% of the range of the moderator over which the marginal effects are plotted in the study. We code studies where the L-Kurtosis exceeds 0.16 as exhibiting severe extrapolation. This cut point roughly corresponds to the L-Kurtosis of an exponential or logistic distribution.

34 In two cases we could not test for equality in the marginal effects at low and high levels of the moderator due to a lack of common support. In other cases a singular variance–covariance matrix precluded Wald tests for linearity. Dropping these cases rather than scoring them as failing the test likely improved these aggregate scores.

35 For details, see Table A1 in the Appendix.

36 In all, only 30% of the interaction effects we examine allow us to reject the null of identical marginal effects in the first and third terciles of the moderator (i.e., the low vs. the high bins) at the 5% significance level. Lowering the significance threshold to the 10% and 25% levels leads us to reject the null in 34% and 55% of cases, respectively. Note that two cases where a lack of data prevented us from conducting this
$t$
test were dropped and are not included in these calculations. See Online Appendix for a full list of
$p$
values from these tests.

37 Consistent with this pattern, Open Science Collaboration *et al.* (2015) which replicated 100 studies published in three psychology journals found that the replication success rate for significant effects was much lower for studies that tested interaction effects (22% replicated) compared to studies that tested main or simple effects (47% replicated).

38 See, for examples, Imai and Ratkovic (2013), Grimmer, Messing, and Westwood (2014), Athey and Wager (2018).

40 These include tests for linearity and the existence of outliers.