HOW MUCH SHOULD WE TRUST INSTRUMENTAL VARIABLE ESTIMATES IN POLITICAL SCIENCE? PRACTICAL ADVICE BASED ON 67 REPLICATED STUDIES

. Instrumental variable (IV) strategies are widely used in political science to establish causal relationships, but the identifying assumptions required by an IV design are demanding, and assessing their validity remains challenging. In this paper, we replicate 67 articles published in three top political science journals from 2010-2022 and identify several concerning patterns. First, researchers often overestimate the strength of their instruments due to non-i.i.d. error structures such as clustering. Second, IV estimates are often highly uncertain, and the commonly used t -test for two-stage-least-squares (2SLS) estimates frequently underestimates the uncertainties. Third, in most replicated studies, 2SLS estimates are significantly larger than ordinary-least-squares estimates, and their ratio is inversely related to the strength of the instrument in observational studies—a pattern not observed in experimental ones—suggesting potential violations of unconfoundedness or the exclusion restriction in the former. We provide a checklist and software to help researchers avoid these pitfalls and improve their practice.


Introduction
The instrumental variable (IV) approach is a commonly used empirical method in the social sciences, including political science, for establishing causal relationships.It is often used when selection on observables is implausible, experimentation is infeasible or unethical, and rule-based assignments that allow for sharp regression-discontinuity (RD) designs are not available.In recent years, there has been a growing number of papers published in top political science journals, such as the American Political Science Review (APSR), American Journal of Political Science (AJPS), and Journal of Politics (JOP), that use IV as a primary causal identification strategy.This trend can be traced back to the publication of the textbook Mostly Harmless Econometrics (Angrist and Pischke, 2008), which popularized the modern interpretation of IV designs, and Sovey and Green (2011), which clarifies the assumptions required by an IV approach and provides a useful checklist for political scientists.Note: Our criteria rule out IV models appearing in the online appendix only, in dynamic panel settings, with multiple endogenous variables, and with nonlinear link functions.Non-replicability is primarily due to a lack of data and/or coding errors.
Despite its popularity, some researchers have questioned the validity of the IV approach, noting that two-stage least-squares (2SLS) estimates are often much larger in magnitude than "naïve" ordinary-least-squares (OLS) estimates, even when the main concern with the latter is upward omitted-variables bias.1 Others have raised concerns about the validity of inferential methods used for 2SLS estimation (e.g. Lee et al., 2022;Young, 2022).
These observations motivate our systematic examination of the use of IVs in the empirical political science literature.We set out to replicate all papers published in the APSR, AJPS, and JOP during the past thirteen years (2010)(2011)(2012)(2013)(2014)(2015)(2016)(2017)(2018)(2019)(2020)(2021)(2022) that use an IV design with a single endogenous variable as one of the main identification strategies. 2 Out of 114 papers meeting this criterion, 71 have complete replication materials online, which is itself a troubling pattern.We successfully replicate at least one of the main IV results for 67 of the 71 remaining papers, with three papers having two separate IV designs producing separate 2SLS results.
Using data from these 70 IV designs, we conduct a programmatic replication exercise and find three troubling patterns.First, a significant number of IV designs in political science either do not report the first-stage partial F statistic or overestimate it by failing to adjust standard errors (SEs) for factors such as heteroskedasticity, serial correlation, or clustering structure.Using the effective F -statistic (Olea and Pflueger, 2013), we find that at least 11% of the published IV studies rely on what econometricians call "weak instruments," the consequences of which have been well-documented in the literature (see Andrews, Stock and Sun (2019) for a comprehensive review).
A second related issue is statistical inference.We find that a considerable number of IV designs are underpowered, with almost all studies relying on t-tests based on analytic standard errors and traditional critical values (such as 1.96 for statistical significance at the 5% level) to make inferences about the 2SLS coefficients.When we use bootstrapping procedures, the AR test, or the tF procedure, an F -statistic-dependent t-test (Lee et al., 2022), to perform hypothesis testing, we find that, depending on the method employed, 19-30% of designs cannot reject the null hypothesis of no effect at the 5% level, whereas the number based on the SEs or p-values reported in the original papers is only 10%.This suggests that inferences based on traditional t-tests may not accurately reflect the uncertainties in 2SLS estimates in a significant portion of cases.
Finally, our replications corroborate evidence from economics and finance that the 2SLS estimates are often much bigger in magnitude than the OLS estimates obtained from regressing the outcome on the potentially endogenous treatment variables and covariates (Jiang, 2017).In 68 out of the 70 designs (97%), the 2SLS estimates are bigger than the OLS estimates in magnitude; among them, 24 (34%) are at least five times bigger.This is alarming because, in an IV design with observational data, researchers often say that they are most concerned about the upward bias of the treatment effect estimates produced by naïve OLS.
Even after we exclude 15 papers that explicitly claim to expect downward biases in OLS estimates, the percentages remain high (96% and 35%, respectively).
The first two patterns may be due to researchers' unfamiliarity with recent development in the IV literature, such as the effective F statistic and the tF test, or under-utilization of inferential procedures robust to weak instruments, such as the AR test.Therefore, researchers can avoid these problems by adopting better practices.The third finding, however, is the most concerning.We cannot explain it with weak instruments alone because at least in the case of i.i.d.errors, when instruments are exogenous, weak instruments bias 2SLS estimates toward OLS estimates in finite samples (Bound, Jaeger and Baker, 1995).But what we observe is the opposite: The ratio between the magnitudes of the 2SLS and OLS estimates is strongly negatively correlated with the strength of the first stage among studies that use non-experimental instruments, and the relationship is almost nonexistent among studies with experimental instruments.We suspect that this is primarily driven by a combination of weak instruments and failure of exogeneity, although other mechanisms such as publication bias, heterogeneous treatment effects (HTE), and measurement error may also contribute.
What do these findings mean for empirical IV studies in political science?First, traditional t tests for the 2SLS estimates (especially those based on classic analytic SEs) mask the fact that most IV results are highly uncertain, which likely leads to selective reporting and publication bias.Second, and more importantly, many of the 2SLS estimates likely suffer from large biases due to failures of unconfoundedness or exclusion restriction and hence are not credible.Although we cannot definitively say which estimates are problematic, the underlying issue seems to prevail in the IV literature.However, the goal of this paper is not to discredit existing IV studies or dissuade researchers from ever using the IV method.On the contrary, we want to caution researchers against ad-hoc justifications for IVs in observational studies and provide practical advice to improve future practices.This includes accurately quantifying instrument strength and 2SLS estimate uncertainties, as well as conducting additional analysis, such as placebo tests, to corroborate the identifying assumptions.
Our work builds on a growing literature evaluating IV strategies in social sciences and offering methods to improve empirical practice.Notable studies include Young (2022), which finds IV estimates to be more sensitive to outliers and conventional t-tests to understate uncertainties; Jiang (2017), which observes larger IV estimates in finance journals and attributed this to exclusion restriction violations and weak instruments; Mellon (2020), which emphasizes the vulnerability of weather instruments; Dieterle and Snell (2016), which develops a quadratic over-identification test and discovered significant non-linearities in the first stage regression; Felton and Stewart (2022), which finds unstated assumptions and a lack of weak-instrument robust tests in top sociology journals; and Cinelli and Hazlett (2022), which proposes a sensitivity analysis for IV designs in an omitted variable bias framework.
Our study is the first large-scale replication effort focusing on IV designs and the first to link the discrepancy between 2SLS and OLS estimates to weak instruments using extensive replication data across social sciences.

Theoretical Refresher
In this section, we offer a brief overview of the IV approach, including the setup, the identifying assumptions, as well as the 2SLS estimators.We then discuss potential pitfalls and survey several inferential methods.To cover the vast majority of IV studies in political science, we adopt a traditional constant treatment effect approach to IV designs, which imposes a set of parametric assumptions.For example, 51 (73%) designs in our replicated sample employ continuous treatment variables and make no reference to treatment effect heterogeneity, hence, they are ill-suited for the local average treatment effect (LATE) approach (Angrist, Imbens and Rubin, 1996).
For simplicity, we do not include additional exogenous controls in the discussion without loss of generality.This is because, by the Frisch-Waugh-Lovell theorem, we can remove them by regressing the outcome, treatment, and IVs on the controls and using the residuals for all subsequent analyses.Apart from the "canonical" use of IVs in addressing non-compliance in experimental encouragement designs, we observe that in the majority of the papers we review, researchers use IVs in observational settings to establish causality between a single treatment variable d and an outcome variable y.The basic idea of this approach is to use an instrument z to isolate "exogenous" variation in d (i.e., the variation in d that is not related to potential confounders) and estimate its causal effect on y. Figure 1 illustrates the directed acyclic graph for an IV design, where ε denotes the error term that captures all unexplained variations in y.The figure depicts that because d and ε are correlated, an observed correlation between d and y does not identify the causal effect of d on y.It also shows that an IV approach relies on three crucial assumptions: (1) the relevance of the instrument, which is directly testable, meaning that z is correlated with d; (2) the unconfoundedness assumption, which states that z is quasi-randomly assigned, and (3) the exclusion restriction, which posits that z does not have a direct effect on y beyond the channel through d.

Estimation Strategies
Imposing a set of parametric assumptions, we define a system of simultaneous equations: First-stage equation: in which y is the outcome variable, d is a scalar treatment variable; z is a vector of instruments for d; τ captures the (constant) treatment effect and is the key quantity of interest.
Equations (2.1) and (2.2) are referred to as the structural equation and the first-stage equation, respectively.The error terms ε and ν in the two equations may be correlated.
The IV estimate, therefore, is the ratio of the reduced-form and first-stage coefficients.To identify τ , we make the following assumptions (Greene, 2003, Chapter 12).
π 1 = 0.This assumption requires that the IVs can predict the treatment variable, and is therefore equivalently stated as d ⊥ ⊥ z.
Cov(z, ε) = 0 and E[ε] = 0. Assumption 2 is satisfied when unconfoundedness and the exclusion restriction are satisfied.However, without additional structural assumptions.failures of unconfoundedness and the exclusion restriction are observationally equivalent, therefore, we do not distinguish them in the analysis and diagnostics.
Under Assumptions 1 and 2, the 2SLS estimator is shown to be consistent for the structural parameter τ .Consider a sample of N observations.We can write d and y = (y 1 , y 2 , • • • , y N ) as (N × 1) vectors of the treatment and outcome data, and of instruments in which p z is the number of instruments.To simplify mathematics, we residualize original d, y, and each column of z against the exogenous covariates, obtaining y, d, and z, respectively.The 2SLS estimator is written as follows: in which P z = z (z z) −1 z is the hat-maker matrix from the first stage which projects the endogenous treatment variable d into the column space of z, thereby preserving only the exogenous variation in d that is uncorrelated with ε.This formula permits the use of more than one instrument, in which case the model is said to be "overidentified."The 2SLS estimator belongs to a class of generalized method of moments (GMM) estimators taking advantage of the moment condition E[zε] = 0, including the two-step GMM (Hansen, 1982) and limited information maximum likelihood (LIML) estimators (Anderson, Kunitomo and Sawa, 1982).We use the 2SLS estimator throughout the replication exercise because of its simplicity and because every single paper in our replication sample uses it in at least one specification.
When the model is exactly identified, i.e., the number of treatment variables equals the number of instruments, the 2SLS estimator can be simplified as the IV estimator: τ2SLS = τIV = (z d) −1 z y.In the case of one instrument and one treatment, the 2SLS estimator can also be written as a ratio of two sample covariances: τ2SLS = τIV = γ1 π1 = Cov(y,z) Cov(d,z) , which illustrates that the 2SLS estimator is a ratio between reduced-form and first-stage coefficients in this special case.This further simplifies to a ratio of the difference in means when z is binary, which is called a Wald estimator.

Potential Pitfalls in Implementing an IV Strategy
The challenges with 2SLS estimation and inference are often due to the violation of the two identifying assumptions.These difficulties can result in (1) significant uncertainty around 2SLS estimates and size distortion for t tests due to weak instruments even when Assumption 2 is valid; and (2) potentially larger biases in 2SLS estimates compared to OLS estimates when both assumptions are violated.
Inferential problem due to weak instruments.Since the IV coefficient is a ratio, the weak instrument problem is a "divide-by-zero" problem, which arises when Cov(z, x) ≈ 0 (i.e., when Assumption 1 is violated).The instability of ratio estimators like τ 2SLS when the denominator is approximately zero has been extensively studied going back to Fieller (1954).The conventional wisdom in the past two decades has been that the first-stage partial F statistic needs to be bigger than 10, and it should be clearly reported (Staiger and Stock, 1997).As a rule of thumb, the original cutoff is chosen based on simulation results to meet two criteria under i.i.d.errors: (1) in the worst case, the bias of the 2SLS estimator does not exceed 10% of the bias of the OLS estimator, and (2) a t-test based on the 2SLS estimator with a size of 5% does not lead to size over 15%.
The literature has discussed at least three issues caused by weak instruments when Assumption 2 is valid.First, under i.i.d.errors, a weak first stage exacerbates the finite-sample bias of the 2SLS estimator toward the inconsistent OLS estimator, thereby reproducing the endogeneity problem that an IV design was meant to solve (Staiger and Stock, 1997) 3 .Second, the 2SLS estimates become very imprecise. 4A third and related issue is that the tests are of the wrong size and the t-statistics don't follow a t-distribution (Nelson and Starz, 1990).Issues relating to imprecision and test-statistic size arise from the fact that the distribution of τ is derived from its linear approximation of τ in (γ, π), wherein normality of the two OLS coefficients implies the normality of their ratio.However, this normal approximation breaks down when π ≈ 0.Moreover, this approximation failure cannot generally be rectified by bootstrapping (Andrews and Guggenberger, 2009), although Young (2022) argues that it nevertheless allows for improved inference when outliers are present.Overall, 3 The 2SLS estimator may not have a mean when the first stage is weak, its median is centered around the OLS coefficient (Hirano and Porter, 2015) 4 To illustrate, a commonly used variance estimator for τIV can be written as: in which σ2 is a variance estimator for the error term and R 2 xz is the R-squared from the first stage.The estimated variance is mechanically larger than the estimated variance of the OLS estimator as long as R 2 xz < 1.It is decreasing in R 2 xz , i.e. stronger instruments produce more precise IV estimates.
valid IV inference relies crucially on correctly identifying strong IVs.
In general, there are two approaches to conducting inference in an IV design: pretesting and direct testing.The pretesting approach involves using an F statistic to test the first stage strength, and if it exceeds a certain threshold (e.g., F > 10), proceeding to test the null hypothesis about the treatment effect (e.g., τ = 0).In contrast, the direct testing approach does not rely on passing a pretest.Nearly all reviewed studies employ the pretesting approach.We examine four methods for statistical inference in IV designs, with the first three related to pretesting and the last one being a direct test.
First, Olea and Pflueger (2013) propose the effective F statistic for both just-identified and over-identified settings and accommodates robust or cluster-robust SEs.The effective F is a scaled version of the first-stage F statistic and is computed as where Σππ is the variance-covariance matrix of the first stage regression, and QZZ = 1 N N i=1 z i z i .In just-identified cases, F Eff is the same as F statistics based on robust or cluster-robust SEs.
The authors derive the critical values for F Eff and note that the statistic and corresponding critical values are identical to the better-known robust F statistic π Σ−1 ππ π and corresponding Stock and Yogo (2005)  ).Hall and Horowitz (1996) show that bootstrap-t achieves an asymptotic refinement over bootstrap-c.5 Third, in just-identified single treatment settings, Lee et al. (2022) propose the tF procedure that smoothly adjusts the t−ratio inference based on the first-stage F statistic, which improves upon the ad-hoc screening rule of F > 10.The adjustment factor applied to 2SLS SEs is based on the first stage t−ratio f := π/ V(π), with the first stage F = f 2 , and relies on the fact that the distortion from employing the standard 2SLS t-ratio t := τ / V(τ ) can be quantified in terms of F and AR−statistic, which gives rise to a set of critical values for a given pair of t and F .The authors also show that, if no adjustment is made to the t-test's critical value (e.g., using 1.96 as the threshold for 5% statistical significance), a first stage F of 104.7 is required to guarantee a correct size of 5% for a two-sided t-test for the 2SLS coefficient.
Finally, where there is one endogenous treatment variable, the AR procedure, which is essentially an F test on the reduced form, is a direct inferential method robust to weak instruments (Anderson and Rubin, 1949;Chernozhukov and Hansen, 2008).Without loss of generality, assume that we are interested in testing the null hypothesis that τ = 0, which then implies that the reduced form coefficient from regressing y on z, γ 1 = 0.This motivates the following procedure: given a set T of potential values for τ , for each value τ , construct y = y − d τ , and regress y on z to obtain a point estimate γ and (robust, or cluster robust) covariance matrix V( γ), and construct a Wald statistic W s ( γ) The AR test not only requires no pretesting but is also shown to be the uniformly most powerful unbiased test in the just-identified case (Moreira, 2009).However, it is not as commonly used as procedures that involve pretesting, possibly because researchers are more accustomed to using t-tests than F /Wald tests and reporting SEs rather than CIs.
Bias amplification and the failure of Assumption 2. When the number of instruments is bigger than the number of endogenous treatments, researchers can use an overidentification test to gauge the plausibility of Assumption 2 (Arellano, 2002).However, such a test is often underpowered and has bad finite sample properties (Davidson and MacKinnon, 2015).In just-identified cases, Assumption 2 is not directly testable.When combined with weak instruments, even small violations of Assumption 2 can produce inconsistency.This is because plim τIV = τ + Cov(z,ε) Cov(z,d) .When Cov(z, d) ≈ 0, even small violations of exogeneity, i.e., Cov(z, ε) = 0, will enlarge the second term, resulting in large biases.Thus, the two identifying assumption failures exacerbate each other: having weak instruments compounds problems from confounding or exclusion restriction violations, and vice versa.With invalid instruments, it is possible that the asymptotic bias of the 2SLS estimator is greater than that of the OLS estimator, i.e., Cov(z,ε)   Cov(z,d) in the single instrument case.
While the inference problem can be alleviated by employing alternative inferential methods as described above, addressing the failure of Assumption 2 is more challenging since it is fundamentally a research design issue that should be tackled at the design stage.Researchers often devote significant effort to arguing for unconfoundedness and exclusion restrictions in their settings.In Section A3 of the SM, we provide an exposition of the "zero-first-stage" (ZFS) test (Bound and Jaeger, 2000), which is essentially a placebo test on a subsample where the instrument is expected to be uncorrelated with the treatment, to help researchers gauge the validity of their instruments.These estimates can then be used to debias the 2SLS estimate using the methods proposed in Conley, Hansen and Rossi (2012).

Data and Types of Instruments
In this section, we first discuss our case selection criteria and the replication sample, which is the focus of our subsequent analysis.We then describe the types of instruments in the replicable studies.
Data.We examine all empirical papers published in the APSR, AJPS, and JOP from 2010 to 2022 and identify studies that use an IV strategy as one of the main identification strategies, including papers that use binary or continuous treatments and that use a single or multiple instruments.We use the following criteria: (1) the discussion of the IV result needs to appear in the main text and support a main argument in the paper; (2) we consider linear models only; in other words, papers that use discrete outcome models are excluded from our sample; (3) we exclude papers that include multiple endogenous variables in a single specification (multiple endogenous variables in separate specifications are included); (4) we exclude papers that use IV or GMM estimators in a dynamic panel setting because they are subject to a separate set of empirical issues and their poor performance has been thoroughly discussed in the literature (Bun and Windmeijer, 2010).These criteria result in 30 papers in the APSR, 33 papers in the AJPS, and 51 papers in the JOP.We then strive to find replication materials for these papers from public data-sharing platforms, such as the Harvard Dataverse, and the authors' websites.We are able to locate complete replication materials for 76 (62%) papers.However, code completeness and quality of documentation vary a great deal.Data availability has significantly improved since 2016-2017 following new editorial policies requiring authors to make replication materials publicly available, though none of the journals requires full replicability administrated by a third party as a condition for publication (Key, 2016), which would constitute a major improvement in our view.
Using data and code from the replication materials, we set out to replicate the main IV results in the 67 papers.Our replicability criterion is simple: As long as we can exactly replicate one 2SLS point estimate that appears in the paper, we deem the paper replicable.
We do not aim at exactly replicating SEs, z-scores, or level of statistical significance for the 2SLS estimates because they involve the choice of the inferential method, which we will discuss in the next section.
After Historical instruments are based on historical differences between units that cannot be explained by current levels of the treatment.For example, Vernby (2013) uses historical immigration levels as an instrument for the current number of non-citizen residents.Similarly, Spenkuch and Tillmann (2018) use historical decisions by rulers in Europe over the religion of their region to instrument for the current religion of survey respondents.These studies use historical variation as instruments for current or modern variables.
Several studies base their choices on regional diffusion of treatment.For example, Dube and Naidu (2015) use US military aid to countries outside Latin America as an instrument for US military aid to Colombia.Grossman, Pierskalla and Boswell Dean (2017b) use over-time variation in the number of regional governments to instrument government fragmentation in sub-Saharan Africa.Dorsch and Maarek (2019) use the regional share of democracies as an instrument for democratization in a country-year panel.
Finally, several papers rely on a unique instrument based on theories that we could not place in a category.For example, Carnegie and Marinov (2017) use the rotating presidency of the Council of the European Union as an instrument for official development aid.They argue that countries that were colonized by the country that holds the presidency receive exogenously more aid than other countries.Dower et al. (2018) use religious polarization as an instrument for the frequency of unrest and argue that religious polarization could only impact collective action through its impact on representation in local institutions.
The second-biggest category is randomized experiments.Articles in this category employ randomization, designed and conducted by researchers or a third party, to make causal inference and use 2SLS estimation to address non-compliance issues in an encouragement design-the IV normally is being encouraged to take the treatment.With random assignment, we have more confidence in Assumption 2 because z ⊥ ⊥ v by design, and the direct effect of encouragement on the outcome is easier to rule out than without random assignment.
Another category of instruments are based on explicit rules, which generate quasi-random variation in the treatment.Sovey and Green (2011) refer to this category as "Natural Experiment."We avoid this terminology because it is widely misused.We limit this category to two circumstances: fuzzy regression discontinuity (RD) designs and variation in exposure to policies due to time of birth or eligibility. 7For example, Kim (2019) leverages a reform in 7 The difference between the two is subtle: For the latter, the gap in the forcing variable, such as birth Sweden that requires municipalities above a population threshold to adopt direct democratic institutions.Dinas (2014) uses eligibility to vote based on age at the time of an election as an instrument for whether respondents did vote.
The last category of instruments are based on econometric assumptions.This category includes what Sovey and Green (2011) call "Lags."These are econometric transformations of variables argued to constitute instruments.For example, Lorentzen, Landry and Yasuda (2014) use a measure of the independent variable from 8 years earlier to mitigate endogeneity concerns.Another example is instruments relying on variable transformations to satisfy assumptions, such as Shift-share "Bartik" instruments based on interactions between multiple variables.For example, Baccini and Weymouth (2021) use the share of jobs in a specific industry within a county, interacted with national-level changes in employment in that industry, to study the effect of manufacturing layoffs on voting.
Compared to IV papers published before 2010, there is a significant increase in the proportion of papers using experiment-generated IVs (from 2.9% to 17.1%) due to the growing popularity of survey and field experiments.In contrast, the number of papers relying on econometric techniques or flawed empirical tests (such as regressing y on d and z and checking if the coefficient of z is significant) has decreased, thanks to improving empirical practices in the discipline.The percentage of papers using theory-justified instruments remains nearly the same at around 60%.

Replication Procedure and Results
In this section, we describe our replication procedure and report the main findings.
Procedure.For each paper, we select the main IV specification that plays a central role in supporting a main claim in the paper; it is either referred to as the baseline specification cohort, is fixed and cannot be arbitrarily small.
or appears in one of the main tables or figures.Focusing on this specification, our replication procedure involves the following steps.First, we compute the first-stage partial F statistics based on (1) classic analytic SEs, (2) Huber White heteroskedastic-robust SEs, (3) clusterrobust SEs (if applicable and based on the original specifications), and (4) bootstrapped SEs. 8 We also calculate F Eff .
We then replicate the original IV result using the 2SLS estimator and apply four different inferential procedures.First, we make inferences based on analytic SEs, including robust SEs or cluster-robust SEs (if applicable).Additionally, we use two nonparametric bootstrap procedures, as described in Section 2, bootstrap-c and bootstrap-t.For specifications with only a single instrument, we also employ the tF procedure proposed by Lee et al. (2022), using 2SLS t-statistics and first-stage F -statistics based on analytic SEs accounting for the originally specified clustering structure.Finally, we conduct an AR procedure and record the p-values and CIs.
We record the point estimates, SEs (if applicable), 95% CIs, and p-values for each procedure (the point estimates fully replicate the reported estimates in the original papers and are the same across all procedures).In addition, we estimate a naïve OLS model by regressing the outcome variable on the treatment and control variables, leaving out the instrument.We calculate the ratio between the magnitudes of the 2SLS and OLS estimates.We also record other useful information, such as the number of observations, the number of clusters, the types of instruments, the methods used to calculate SEs or CIs, and the rationale for each paper's IV strategy.Our replication yields the following three main findings.
8 They are calculated by τ2SLS /p z , where p z is the number of IVs and Var boot (τ 2SLS ) is the estimated variance-covariance matrix based on a nonparametric bootstrap procedure, in which we repeatedly sample the rows of the data matrix with replacement.If the data have a clustered structure, we use cluster-bootstrapping instead by sampling with replacement each cluster of data (Colin Cameron and Miller, 2015;Esarey and Menger, 2019).We include F boot as a reference to the classic F and effective F .In Section A.2 of the SM, we compare the five types of F statistics and show that the effective F and F based on bootstrapping are usually more conservative than other F statistics.
Finding 1. First-stage partial F statistics.Our first finding regards the strengths of the instruments.To our surprise, among the 70 IV designs, 12 (17%) do not report this crucial statistic despite its key role in justifying the validity of an IV design.Among the remaining 58 studies that report F statistics, 9 (16%) use classic analytic SEs, thus not adjusting for potential heteroskedasticity or clustering structure.In Figure 3, we plot the replicated first-stage partial F statistics based on the authors' original model specifications and choices of variance estimators on the x-axis against effective F statistics (a) or bootstrapped F statistics (b) on the y-axis.Both axes are on a logarithmic scale.9In the original studies, the authors used various SE estimators, such as classic SEs, robust SEs, or cluster-robust SEs.As a result, the effective F may be larger or smaller than the original ones.However, a notable feature of Figure 3 is that when a clustering structure exists, the original F statistics tend to be larger than the effective F or bootstrapped F .
When using the effective F as the benchmark, 8 studies (11%) have F Eff < 10.This number increases to 12 (17%) when the bootstrapped F statistics are used.The median first-stage F Eff statistic is higher in experimental studies compared to non-experimental ones (67.7 versus 53.5).It is well known that failing to cluster the SEs at appropriate levels or using the analytic cluster-robust SE with too few clusters can lead to a severe overstatement of statistical significance (Cameron, Gelbach and Miller, 2008).However, this problem has received less attention when evaluating IV strength using F statistics.
Finding 2. Inference.Next, we compare the reported and replicated p-values for the null hypothesis of no effect.For studies that do not report a p-value, we calculate it based on a standard normal distribution using the reported point estimates and SEs.The replicated p-values are based on (1) bootstrap-c, (2) bootstrap-t, and (3) the AR procedure.Since we can exactly replicate the point estimates for the papers in the replication sample, the differences in p-values are the result of the inferential methods used.Figure 4(a)-(c) plot reported and replicated p-values, from which we observed two patterns.First, most of the reported pvalues are smaller than 0.05 or 0.10, the conventional thresholds for statistical significance.
Second, consistent with Young (2022)'s finding, our replicated p-values based on the AR procedure or bootstrap methods are usually bigger than the reported p-value (exceptions are mostly caused by rounding errors), which are primarily based on t statistics calculated using analytic SEs.Using the AR test, we cannot reject the null hypothesis of no effect at the 5% level in 13 studies (19%), compared with 7 (10%) in the original studies.The number increases to 15 (21%) and 20 (29%) when we use p-values from the bootstrap-t and -c methods.Note that very few papers we review utilize inferential procedures specifically designed for weak instruments, such as the AR test (2 papers), the conditional likelihoodratio test (Moreira, 2003) (1 paper), and confident sets (Mikusheva and Poi, 2006) (none).
We also apply the tF procedure to 54 studies that use single IVs using F Eff statistics and t statistics based on robust or cluster-robust SEs.Figures 4(d) shows that 16 studies (30%) are not statistically significant at the 5% level, and 5 studies deemed statistically significant when using the conventional fixed critical values for the t-test become statistically insignificant using the tF procedure, indicating that overly optimistic critical values due to weak instruments also contribute to overestimation of statistical power, but not as the primary factor.These results suggest that both weak instruments and non-i.i.d.errors have contributed to severe overstatements of power in IV studies in political science.
Finding 3. 2SLS-OLS discrepancy.Finally, we investigate the relationship between the 2SLS estimates and naïve OLS estimates.In Figure 5(a), we plot the 2SLS coefficients against the OLS coefficients, both normalized using reported OLS SEs.The shaded area indicates the range beyond which the OLS estimates are statistically significant at the 5% level.It shows that for most studies in our sample, the 2SLS estimates and OLS estimates share the same direction and that the magnitudes of the former are often much larger than those of the latter.Figure 5(b) plots the distribution of the ratio between the 2SLS and OLS estimates (in absolute terms).The mean and median of the absolute ratios are 12.4 and 3.4, respectively.In fact, in all but two designs (97%), the 2SLS estimates are bigger than the OLS estimates, consistent with Jiang (2017)'s finding based on finance research.
While it is theoretically possible for most OLS estimates in our sample to be biased towards zero, only 21% of the studies have researchers expressing their belief in downward biases of the OLS estimates.Meanwhile, 40% of the studies consider the OLS results to be their main findings.The fact that researchers use IV designs as robustness checks for OLS estimates due to concerns of upward biases is apparently at odds with the significantly larger magnitudes of the 2SLS estimates.
In Figure 5(c), we further explore whether the 2SLS-OLS discrepancy is related to IV strength, measured by ρ(d, d), the estimated correlation coefficient between the treatment  is 0.268, with p = 0.000.However, the relationship is much weaker among studies using experiment-generated instruments (red dots).The adjusted R 2 is −0.014 with p = 0.378.In Figure 5(d), we limit our focus to the subsample in which the OLS estimates are statistically significant at the 5% level and researchers accept them as (part of) the main findings, and the strong negative correlation remains.At first glance, this result may seem mechanical: as the correlation between d and d increases, the 2SLS estimates naturally converge to the OLS estimates.However, the properties of the 2SLS estimator under the identifying assumptions do not predict the negative relationship (we confirm it in simulations in the SM), and such a relationship is not found in experimental studies.
We believe that several factors contribute to this pattern, including (1) the failure of Assumption 2, (2) publication bias, (3) HTE, and (4) measurement error in d.We suspect the first two factors are the main driving forces.As previously mentioned, when Assumption 2 is violated, weak instruments amplify the biases from endogenous IVs or exclusion restriction failures, i.e., 1. Publication bias may also play a role.When the first stage is weak, IV estimates have a larger variance and can be very large or very small in magnitude compared to OLS estimates.If researchers selectively report statistically significant results or journals tend to publish papers with statistically significant findings, we may observe a negative relationship as in Figure 5.This phenomenon is also referred to as Type-M bias in the psychology and sociology literature (Gelman and Carlin, 2014;Felton and Stewart, 2022).
Moreover, 30% of the replicated studies in our sample mention HTE as a possible explanation for this discrepancy.OLS and 2SLS place different weights on covariate strata in the sample, and therefore if compliers, those whose treatment status is affected by the instrument, are more responsive to the treatment than the rest of the units in the sample, we might see diverging OLS and 2SLS estimates.Under the assumption that the exclusion restriction holds, this gap can be decomposed into covariate weight difference, treatment-level weight difference, and endogeneity bias components using the procedure developed in (Ishimaru, 2021).In the SM, we investigate this possibility and find that it is highly unlikely that HTE alone can explain the difference in magnitudes between 2SLS and OLS estimates we observe in the replication data, i.e., the variance in treatment effects needed for this gap is implausibly large.
Finally, an IV design can correct for the downward bias of the measurement error in d, resulting in |τ 2SLS /τ OLS | > 1.If the measurement error is large, this can weaken the relationship between d and d, producing a negative correlation.However, it is worth noting that only 4 papers in our sample (6%) attribute the IV strategy to measurement error; the negative correlation remains even when the OLS estimates are the main findings (indicating measurement error may not be as concerning for researchers).
We summarize the main findings from our replication exercise in Table 3.The three issues we have identified are observed in all three journals included in the study.Based on these results, we believe that a significant portion of the IV results either lack credibility or do not provide new information beyond what is already provided by OLS regressions.

Recommendations
IV designs in experimental and observational studies differ fundamentally.In randomized experiments, the instruments' unconfoundedness is ensured by design, and researchers can address potential exclusion restriction failures at the design stage, e.g., by testing potential design effects through randomization (Gerber and Green, 2012, pp. 140-141).Practices like power analysis, placebo tests, and preregistration in experimental studies also help reduce improper use of IV designs.In contrast, analyzing observational IV design based on "natural experiments" requires detailed knowledge of the assignment mechanism, making them more complex and prone to potential issues (Sekhon and Titiunik, 2012).
Our findings suggest that using an IV strategy in an observational setting is much more challenging.Since unconfoundedness is not guaranteed by design, researchers have a greater burden of proof for the validity of IVs.On the one hand, truly random (and strong) instruments are rare; on the other hand, it is difficult to conduct placebo tests, such as the ZFS test, for the exclusion restriction after data collection.Additionally, researchers often cannot easily increase the sample size to obtain sufficient statistical power.To prevent misusing IVs in observational studies, we provide a checklist for researchers to consider when applying or considering applying an IV strategy with observational data (in the case of one endogenous treatment variable):

Design
• Prior to using an IV strategy, consider how selection bias may be affecting treatment effect estimates obtained through OLS.If the main concern is underestimating an already statistically significant treatment effect, an IV strategy may not be necessary.
• During the research design phase, consider whether the chosen instrument can realistically create random or quasi-random variations in treatment assignment while remaining excluded from the outcome equation.
Characterizing the first-stage • Calculate and report F Eff for the first stage, taking into account heteroscedasticity and clustering structure as needed.However, do not discard a design simply because F Eff < 0.
• If d and z are continuous, plot d against its predicted values d (with covariates and fixed effects already partialled out from both) and visually verify whether their relationship aligns with theoretical expectations.
Hypothesis testing and inference • Option 3. Direct testing.Apply weak-instrument-robust procedures, such as the AR test.

Communicating your findings
• Present OLS and IV estimates alongside CIs from various inferential methods in a graphical format, like in Figure 6.These CIs may not concur on statistical significance, but they collectively convey the findings' robustness to different inferential approaches.
• Remember to report first-stage and reduced-form estimation results, including 95% CIs for coefficients, as they offer insight into both instrument strength and statistical power.

Additional diagnostics
• If you expect the OLS results to be upward biased, be concerned if the 2SLS estimator yields much larger estimates.
• If there is good reason to believe that treatment effects on compliers are significantly larger in magnitude than those on non-compliers, explain this through profiling of these principal strata (Abadie, 2003;Marbach and Hangartner, 2020).
• If it is possible to identify an observational analogue of "never takers" or a subset of them, conduct a placebo test by estimating the effect of the instrument on the outcome of interest in this ZFS sample.Using results from the ZFS test, obtain local-to-zero IV estimates and CIs and compare them to the original estimates and CIs.See the SM for a detailed example.We provide an accompanying R package, ivDiag, to implement our recommended procedures.Our aim is to address concerns regarding IVs in social science research and improve the quality of estimation and inference, especially for non-experimental IV designs.

Figure 1 .
Figure 1.IV Papers Published in the APSR, AJPS, and JOP

Figure 2 .
Figure 2. A Directed Acyclic Graph of an IV Design

Figure 3 .
Figure 3. Original vs. Effective and Bootstrapped F

Figure
Figure 5. Relationship between OLS and 2SLS Estimates

Table 1 .
Data availability and replicability of IV papers.
much effort and hundreds of hours of work, we are able to replicate the main results of 61 papers.
Sovey and Green (2011)nd Xu (2019)stent with what is reported inHainmueller, Mummolo and Xu (2019). Treasons for failures of replication are incomplete data (38 papers), incomplete code or poor documentation (4 papers), and replication errors (5 papers).Table1presents summary statistics on data availability and replicability of IV papers for each of the three journals.The rest of this paper focuses on results based on these 67 replicable papers (and 70 IV designs).Types of instruments.Inspired bySovey and Green (2011), in Table2, we summarize the types of IVs in the replicable designs, although our categories differ from theirs to reflect changes in the types of instruments used in the discipline.As in Sovey and Green

Table 3 .
Summary of Replication Results

•
Option 1. t-test with F Eff pretesting.If F Eff < 10, choose Options 2 or 3. Utilize conservative methods like bootstrap-t and bootstrap-c if outliers or group structures are present.
• Option 2. tF procedure.For single treatment and instrument cases, adjust t-test critical values based on F Eff .