ROBUST INFERENCE IN STRUCTURAL VECTOR AUTOREGRESSIONS WITH LONG-RUN RESTRICTIONS

Long-run restrictions are a very popular method for identifying structural vector autoregressions, but they suffer from weak identification when the data is very persistent, i.e., when the highest autoregressive roots are near unity. Near unit roots introduce additional nuisance parameters and make standard weak-instrument-robust methods of inference inapplicable. We develop a method of inference that is robust to both weak identification and strong persistence. The method is based on a combination of the Anderson-Rubin test with instruments derived by filtering potentially nonstationary variables to make them near stationary using the IVX instrumentation method of Magdalinos and Phillips (2009). We apply our method to obtain robust confidence bands on impulse responses in two leading applications in the literature.


INTRODUCTION
Since the seminal paper of Sims (1980), structural vector autoregressions (SVARs) have become a very popular method for analyzing dynamic causal effects in macroeconomics. SVARs can be used to decompose economic fluctuations into interpretable shocks, such as technology, demand, policy shocks, and trace the dynamic response of macroeconomic variables to such shocks, known as impulse response functions (IRFs). The success of the SVARs relies on (i) their ability to recover the true underlying structural shocks (invertibility); (ii) the validity of the identification scheme; and (iii) the informativeness of the identifying restrictions. Because an SVAR is a system of linear simultaneous equations, the third condition can be expressed as the availability of informative instruments.
In the words of Christiano, Eichenbaum, and Vigfusson (2007), "to be useful in practice, VAR-based procedures should accurately characterize [and] uncover the information in the data about the effects of a shock to the economy". In other words, confidence intervals on the model's parameters, e.g., the IRFs to an identified shock, need to have the property that they are (i) as small as possible when instruments are strong (efficiency); and (ii) large when instruments are weak/irrelevant (robustness), see Dufour (1997). Conventional methods based on standard strong-instrument and stationarity assumptions achieve the first objective but fail the second and therefore lead to unreliable inference.
This paper focuses on the identification scheme known as long-run restrictions, proposed by Blanchard and Quah (1989). This assumes that certain shocks (e.g., demand shocks) have no permanent effect on certain economic variables (e.g., output). Long-run restrictions are a popular identification scheme for SVARs, because they seem to be less contentious than short-run identifying restrictions, see e.g., Christiano et al. (2007) and the associated comments and discussion. However, it is well-known that long-run restrictions can lead to weak identification, see e.g., Pagan and Robertson (1998), and there is presently no method of inference that is fully robust to this problem. The main difficulty is that in this context weak identification arises when instruments are highly persistent, or nearly nonstationary. Therefore, all the available weak identification robust methods of inference, such as the Anderson and Rubin (1949) test, see Staiger and Stock (1997), are inapplicable because they rely on stationary asymptotics. This also applies to common pretests of weak identification, see Mark Watson's comment on Christiano et al. (2007), as well as to bootstrap methods that are not robust to weak instruments and near unit roots.
In this paper, we develop a method of inference that is robust to weak instruments as well as near nonstationarity. The method is based on combining recent advances in econometrics on inference with highly persistent data by Magdalinos and Phillips (2009) and Kostakis, Magdalinos, and Stamatogiannis (2015), see also Phillips (2014), with well-established methods of inference that are robust to weak instruments. The former methods have been developed for predictive regressions or cointegration, and their use in the context of structural inference in simultaneous equations models is new. Our new method of inference controls asymptotic size under a wide range of data generating processes, including standard local-to-unity asymptotics; it has good size in finite samples; it is asymptotically efficient under strong identification and has good power under weak identification; and it is very simple to implement and quick to compute. 1 For illustration, we revisit the empirical evidence in two classic applications of SVARs with long-run restrictions: the original application in Blanchard and Quah (1989) and the hours debate of Galí (1999) and Christiano, Eichenbaum, and Vigfusson (2003). In the case of Blanchard and Quah (1989), we find that long-run restrictions yield very weak identification. On the hours debate, we find that the difference specification of Galí (1999) is very well identified, while the level specification of Christiano et al. (2003) is weakly identified. Long-run restrictions are one of the most well-known approaches to the identification of SVARs, and have been extensively used in the literature since the seminal contribution of Blanchard and Quah (1989). 2 Therefore, the scope of the present paper extends well beyond the two applications that we discuss here.
In this paper, we focus exclusively on frequentist inference in SVARs identified using long-run restrictions. A popular alternative to frequentist methods is Bayesian estimation of SVARs. It is well known that weak identification can also be problematic for Bayesian inference, see Kleibergen and Zivot (2003). How to address these issues in the context of SVARs identified using long-run restrictions remains an open question which could be addressed using, e.g., the approach of Kleibergen and Mavroeidis (2014).
The paper is structured as follows. Section 2 introduces the model and the long-run identification scheme. Section 3 discusses existing methods of inference, highlights the problem and presents our proposed solution. Section 4 gives simulations on the finite-sample size and power of our new method. Section 5 presents the two empirical applications and finally, Section 6 concludes. Proofs are given in the Appendix at the end, as well as in a Supplementary Appendix available at Cambridge Journals Online (journals.cambridge.org/ect), which also contains additional numerical and empirical results.

MODEL
A general SVAR with m lags can be written as where L is the lag operator, Y t is a n × 1 vector of endogenous random variables, B j are n × n nonstochastic matrices of parameters, is a matrix of coefficients on deterministic terms D t , and E (ε t |Y t −1 , Y t −2 ,...) = 0. The diagonal elements of B 0 are normalized to 1, and var (ε t ) is a diagonal matrix.
Partition the vector of structural shocks as ε t = ε 1t ε 2t , where ε 1t is scalar and ε 2t is (n − 1) × 1. We are interested in identifying ε 1t , and the IRF The long-run identifying restriction is that ε 2t has no long-run effect on Y 1t . In the literature this is expressed as a zero restriction on elements of the spectral density matrix of Y t at frequency zero-via a Choleski factorization of the longrun variance of Y t . We work with the (equivalent) instrumental variables (IV) representation of the long-run restrictions in Pagan and Robertson (1998), see also Appendix 6. According to this representation, under the assumption that ε 1t has a permanent effect on Y 1t , and the long-run restriction that ε 2t has no permanent effect on Y 1t , the system (1) can be written as: 3 where X 1t , X 2t denote vectors containing lags of Y t and deterministic terms D t , δ 1 ,δ 2 denote the coefficients on those exogenous and predetermined variables, and u 2t is the reduced-form error in Y 2t . 4 It is evident that the variables Y 2,t −1 are excluded from (2), and hence they can be used as instruments for the endogenous regressors Y 2t . This suffices to identify ε 1t and hence trace out the entire IRF with respect to ε 1t . Note that v 2t is the residual of the projection of the reducedform error u 2t on ε 1t . Moreover, when the data is in logs the coefficient b 12 in (2) has a direct economic interpretation as a short-run elasticity.
In the rest of the paper, we will focus our attention on the special case n = 2, because it suffices to expose the main methodological innovation of the paper and covers the two leading applications of long-run restrictions in the literature. We will also comment on how the results can be generalized to allow for n > 2.
Note that the representation (2)-(3) with α 2 < 0 assumes that no shock has a permanent effect on Y 2t , meaning that Y 2t is stationary. In the literature on hours (Galí, 1999;Christiano et al., 2003) this is referred to as the levels specification, which is contrasted with the differences specification that assumes Y 2t to be nonstationary. The differences specification can be written exactly in the form (2)-(3) if we replace Y 2t by Y 2t , see Appendix 6 for details. Since the representation (2)-(3) can accommodate both specifications, we do not need to analyze them separately in the methodological part of the paper-we study their empirical implications in Section 5.
The objective of this paper is to develop tests of general hypotheses on the identified structural parameters θ where r : → q , q ≤ dim θ . This includes e.g., the IRF and forecast error variance decomposition.
Example (Bivariate SVAR(1)) A bivariate SVAR(1) without deterministic terms is given by The structural parameters θ = b 12 ,σ ε 1 ,α 2 , d 21 , where σ ε 1 is the standard deviation of ε 1t . This is the simplest possible model that suffices to characterize the inference problem and describe our methodology, so we will use this as a running example throughout the paper. The parameter α 2 plays a crucial role both for the persistence of the data and the identification of the structural parameters. Specifically, when α 2 is close to zero, Y 2,t −1 has a near unit root and becomes a weak instrument for Y 2t , see Pagan and Robertson (1998) and Gospodinov (2010). An example of a simple hypothesis of interest is r (θ ) = d 21 − d 0 21 in (4). Inverting an η-level test of this hypothesis produces a (1 − η)-level confidence band for d 21 , which is the impact response of Y 2t to a unit impulse on ε 1t .

ECONOMETRIC METHODS
The conventional approach is to use Gaussian maximum likelihood (ML) estimation with conditional homoskedasticity. The ML estimator is trivial to obtain in this case. It can be computed in two steps as follows: (i) estimate equation (2) by IV (2SLS) with instrument Y 2,t −1 for Y 2t , and save the residual ε 1t = Y 1t −b 12 Y 2t −δ 1 X 1t ; (ii) substituteε 1t for ε 1t in the remaining equations (3) and estimate them by OLS.
Under strong-instrument stationary asymptotics, i.e., α 2 < 0 and fixed, the asymptotic distribution of Wald statistics for testing general hypotheses (4) is χ 2 and error bands for any smooth function of the parameters can be derived using the delta method, e.g., Mittnik and Zadrozny (1993), or by bootstrapping, e.g., Kilian (1998). When α 2 is small, asymptotic distributions of Wald tests may become nonstandard and depend on a nuisance parameter that measures the proximity of α 2 to zero, see, e.g., Gospodinov (2010).
Thus, conventional confidence bands on SVAR coefficients and IRFs do not have correct asymptotic coverage. This includes conventional bootstrap methods, since the conditions for the validity of the bootstrap, cf. Horowitz (2001), are not satisfied here. In particular, the structural parameters are nonsmooth functions of the reduced-form parameters because a discontinuity occurs at the point of nonidentification α 2 = 0. 5 In this section, which contains the main contribution of the paper, we introduce a method that has correct asymptotic coverage.

Anderson-Rubin Test with Filtered Instruments
Our approach to solving the problems of weak identification and near nonstationarity consists of two components: (i) a weak-identification robust method; the Anderson and Rubin (1949) (henceforth AR) test, since the model is typically just-identified, and (ii) filtered instruments; the so-called IVX approach of Magdalinos and Phillips (2009), to deal with near unit roots.
We start by looking at the special case of testing the hypothesis This hypothesis is special because it turns out that there exists a test that is both robust to weak identification/near unit roots and asymptotically efficient under strong identification. Note also that H 0 (which may be interpreted as a hypothesis about a short-run elasticity) is of frequent economic interest.
Because of the structure of the problem, the hypothesis (7) can be tested using just the first equation of the model (2). Given some instruments Z 1t = X 1t , z t , the AR statistic, A R b 0 12 , is the Wald statistic for testing H * 0 : δ z = 0 in the auxiliary regression: When n = 2, i.e., when b 12 is a scalar, this AR statistic can be written analytically as where P · denotes the projection matrix, M · = I − P · Z 1 = (X 1 , z) , and we follow standard notation that for any column vector X t , X denotes the matrix of T stacked rows X t , t = 1,..., T . If we set z t = Y 2,t −1 , the AR statistic corresponds to the likelihood ratio test for (7). Under stationarity/strong identification (α 2 < 0 and fixed), A R (b 12 ) is asymptotically distributed as χ 2 under H 0 . Moreover, the likelihood ratio test is asymptotically efficient under stationarity/strong identification. However, when α 2 is local to zero, the χ 2 asymptotic approximation breaks down, and the asymptotic distribution, if it exists, depends on the proximity of T α 2 to zero. So, A R (b 12 ) is not asymptotically pivotal, and tests based on χ 2 critical values will not control asymptotic size. This is straightforward to see using local-to-unity asymptotics as in Gospodinov (2010).
5 Nonconventional bootstrap methods, such as the grid bootstrap, see Hansen (1999) and Mikusheva (2012), or subsampling, see Andrews and Guggenberger (2010), could provide valid asymptotic coverage. A disadvantage of those methods is that they are much more computationally demanding than the method we propose here.
Our solution to the above problem is to use an instrument that relates to Y 2,t −1 but is constructed in such a way that it is less persistent than Y 2,t −1 whenever the latter has a near unit root. This is an application of the IVX method of Magdalinos and Phillips (2009) to this problem. Magdalinos and Phillips (2009) obtained nuisance-parameter-free asymptotic distribution theory for Wald tests in situations where the order of integration of the regressors is unknown, such as predictive regressions or cointegrating regressions when the right hand side variables are nearly integrated. They did so by introducing an instrument which is filtered from the original data in such a way that it is at most moderately integrated, and correlates sufficiently with the variable it is instrumenting.
In the SVAR model, the filtered instrument of Magdalinos and Phillips (2009) is given by The parameter ρ T z must be close to unity for efficiency, and outside an O (1/T ) neighborhood of unity for asymptotic size control, as we show later. Extensive simulations reported in Kostakis et al. (2015) show that setting c z = −1 and b = 0.95 achieves a good balance between size and power in finite samples in the predictive regression model. We find that these values also work well in the context of this paper (see the Online Supplementary Appendix), and we therefore use them in our empirical implementation.
To obtain asymptotic results, we make the following assumption on ε t , where · denotes the spectral norm.
This assumption is similar to the one used in Magdalinos and Phillips (2009), except for the addition of conditional homoskedasticity, which is typically used in the literature (e.g., the results in Galí (1999) assume conditional homoskedasticity). Heteroskedasticity robust versions of the proposed tests can be obtained using GMM, see the Appendix.
Our proposed AR test is based on the following result. (9) with instrument z t defined by (10).

Then under H
The asymptotic size of the η-level AR test that rejects H 0 when A R b 0 12 exceeds the 1 − η quantile of χ 2 1 is equal to η. This can be shown using arguments analogous to those used in the proof of Andrews, Cheng, and Guggenberger (2011, Cor. 2.1 and Lemma 4.1), see the Online Supplementary Appendix for further details. 2. The case T α 2 → −∞ corresponds to (near) stationarity and strong identification. In this case, the statistic A R in (9) is asymptotically equivalent to the AR statistic A R * that is obtained by replacing the filtered instrument z t with Y 2,t −1 . Because the model is just-identified, A R * is the standard LR statistic which is asymptotically efficient under stationary and strong-instrument asymptotics. It is also asymptotically equivalent to the standard Wald test of H 0 . Thus, the use of the filtered instrument entails no loss of power in the case of strong identification, and so the A R test with filtered instruments weakly dominates the Wald and standard LR tests.
3. The results of the theorem, as well as the above two remarks, also apply in a model with more endogenous variables, Y 3t , that are subject to long-run restrictions, under the assumption that their coefficients, b 13 , can be estimated consistently using Y 3,t −1 as instruments, and the resulting estimatorb 13 is asymptotically Gaussian. A sufficient condition for this is that Y 3t is stationary.
4. Theorem 1 can be extended to cover the case when Y 2t is a vector along the lines of Magdalinos and Phillips (2009, Thm. 3.8), or Kostakis et al. (2015, Thm. 1), under the assumption that C is a diagonal matrix. In that case,

Tests of General Hypotheses
Testing general hypotheses such as (4) is complicated by the fact that r (θ ) contains the potentially weakly identified parameter b 12 . Let ψ denote the rest of the unknown parameters in θ other than b 12 . Note that when b 12 is known, the parameters ψ are identified as regression coefficients and variances. So, inference on smooth functions of ψ, given b 12 , would be straightforward, except for the complication that arises when there is a near unit root in Y 2t . We address this issue using IVX in equation (3) with instrument z t given by (10) for Y 2,t −1 . General hypotheses (4) can be tested using Bonferroni or projection methods for valid inference. The Bonferroni method is as follows: By the Bonferroni inequality, this test has level at most η 1 + η 2 . In fact, because it turns out that the second step Wald test is asymptotically independent of the first-step AR test, the Bonferroni bound can be tightened somewhat by choosing a larger η 2 , see Remark 4 below Theorem 2. In theory, this can be refined even further along the lines of McCloskey (2012), but this may be computationally impractical in realistic settings, due to the large number of parameters.
The projection method is as follows: perform a test of the joint null hypothesis H * 0 : r (θ ) = 0, b 12 = b 0 12 , and project out b 12 , i.e., reject H 0 : r (θ ) = 0 if there is no value of b 0 12 for which H * 0 is accepted. This approach requires a test of the joint hypothesis H * 0 . Our proposed test for H * 0 is based on a novel idea that combines the A R (b 12 ) statistic developed above with the Wald statistic for testing the restrictions on the remaining parameters ψ (this idea applies more generally, see Section C.2 and Theorem C.1 in the Appendix). We call the resulting test ARW, and derive its asymptotic properties under the null in Theorem 2 below.
We now turn to the derivation of the ARW test. Letψ (b 12 ) be the restricted GMM estimator of ψ given b 12 given in equation (C.4) in the Appendix, and let Vψ (b 12 ) denote the estimator of the asymptotic variance matrix ofψ (b 12 ) given in equation (C.5) in the Appendix. Provided R (θ ) = ∂r (θ ) /∂ψ exists and is of full rank q, define and consider the combined statistic The asymptotic distribution of A RW b 0 12 under the null H * 0 is given by the following result.
THEOREM 2. Under the conditions of Theorem 1, if the null hypothesis H * 0 : r (θ ) = 0, b 12 = b 0 12 holds, then: The asymptotic size of a (1 − η)-level confidence set obtained by inverting an η-level ARW test, defined as the minimum coverage probability of the confidence set, is equal to 1 − η uniformly in α 2 . This result is analogous to Remark 1 to Theorem 1, see the Online Supplementary Appendix for details.
3. Remarks 3 and 4 to Theorem 1 also apply to Theorem 2.
exceeds the 1 − η 2 quantile of χ 2 q for all b 0 12 in the first-step confidence set. Because A R and W are asymptotically independent under H 0 , a η-level Bonferroni test can be obtained by setting η 2 = η−η 1 1−η 1 thus avoiding the more conservative Bonferroni bound given by η 2 = η − η 1 . 6 5. Confidence intervals for any scalar function of the parameters g (b 12 ,ψ) that is smooth in ψ, such as an impulse response, can be obtained easily and quickly by numerical optimization methods. An algorithm for this is given in the Online Supplementary Appendix.
6. The ARW test is a Wald test of the joint hypothesis H * * 0 : r ψ,b 0 12 = 0 and δ z = 0 in the auxiliary regression (8), where δ z and r ψ,b 0 12 are the means of two asymptotically jointly Normal random vectors with an asymptotic variance matrix that is block diagonal under the null, because E ε 1t ε 2t = 0. Hence, by the usual invariance argument of Wald (1943), the joint Wald statistic for testing H * * 0 is equal to the sum of the Wald statistic A R b 0 12 for testing δ z = 0, and the Wald statistic W b 0 12 for testing r ψ,b 0 12 = 0. Alternative combinations of the two statistics that place different weights on each of the two components, e.g., w A R b 0 12 + W b 0 12 , w > 0 can be considered in order to direct power to specific alternatives. We explore this in Section 4.2.3, and find that there is no w = 1 that uniformly dominates the (equally weighted) ARW test. Moreover, the optimal choice of w depends on the nuisance parameter c = T α 2 that is not consistently estimable under near-unit-root asymptotics. These limitations, combined with the added complication that tests based on nonequally weighted combinations of A R b 0 12 and W b 0 12 will require nonstandard critical values, thus further limiting the appeal of the procedure for practitioners, lead us to propose the equally weighted ARW test.
. This can be expressed as the linear restriction r denote the restricted point estimate of d 21 and its standard error, respectively. An ARW projection A Bonferroni confidence interval based on an η 1 -level AR test with critical value c 1 and η 2 -level W test with critical value c 2 , where c i is the 1 − η i quantile of χ 2 1 and η 2 = η−η 1 1−η 1 , is given by

Deterministic Terms
Theorems 1 and 2 apply when model (2)- (3) does not include any deterministic terms in X 1t and X 2t , but it can be shown using the same arguments as in Kostakis et al. (2015) Theorem A that they continue to hold if an intercept is included in X 1t , X 2t . However, in that case the asymptotic approximations may deteriorate in finite samples, as was found by Kostakis et al. (2015) for predictive regression. To address this possibility, we derive a finite sample correction proposed by Kostakis et al. (2015), adapting it to the ARW statistic as follows. The finite sample correction in Kostakis et al. (2015), applied to the AR in (9) consists in modifying P M X 1 z in the numerator. When the model contains an intercept, the finite sample correction involves replacing the term where X 1 denotes the elements in X 1 excluding the intercept, ρ ε 1, u 2 is the estimated long-run correlation between ε 1t and u 2t in equations (2)-(3). The correction of the Wald statistic W (b 12 ) is analogous. It depends on the specific form of H * 0 but only affects the variance related to the estimator of α 2 inVψ (b 12 ) . We provide an expression for it in the Online Supplementary Appendix. In the empirical applications, we consider in this paper, ρ ε 1, u 2 is low enough so the finite sample correction does not make material difference to the results.
In some applications, Y 2t denotes the deviation of some observed variable (e.g., log hours, or log real GDP) from a linear deterministic trend where the observed data Y obs 2t is given by Y obs 2t = Y 2t + τ x + γ x t. We then replace Y 2t withŶ 2t = Y obs 2t −τ x −γ x t in the computation of the IVX instrument z t . Whether or not Y 2t is stationary affects inference on γ x . Ifγ x is computed using the full sample, then Y 2t is a function of future values and this may affect the validity of the exclusion restrictions used in the estimation.
To avoid this issue, we follow Phillips, Park, and Chang (2004) and use a recursive detrending formula to ensure thatŶ 2t is not computed using future values: This formula preserves the martingale difference sequences which are needed in the asymptotic theory, so moment conditions hold under H 0 . Hence, the asymptotic results presented above continue to hold.

NUMERICAL RESULTS
In this section, we investigate the finite-sample properties of our proposed test and compare them with the existing nonrobust alternative. The data generating process is the bivariate SVAR(1) example introduced earlier, with α 2 = cT −1 . In reduced form, the model is: and Y 10 = Y 20 = 0. We normalize ω 2 = 1 because the statistics are invariant to scaling of the variance matrix. The AR statistic is also invariant to ω 1 , so in simulations involving only A R (b 12 ) , we will also normalize ω 1 = 1. The estimated model is SVAR(1), with and without deterministic terms. 7

Size
We conduct two sets of simulation experiments to obtain the rejection frequency of tests of the following two null hypotheses: (i) H 0 : b 12 = 0 against H 1 : b 12 = 0, using the AR test with filtered instruments, and (ii) H 0 : d 21 = d 0 21 against H 1 : d 21 = d 0 21 using the ARW test, for d 21 ∈ [−1, 1] . 8 In case (i), we report rejection frequencies over a few different parameterizations. We consider the parameter sets ρ ∈ {0.20, 0.95} and c ∈ {0, −1, −10, −30, −100} and the sample size is set to T = 200. We compute the null rejection frequencies of our AR test with the filtered instrument z t in (9) and the conventional t test with instrument Y 2,t −1 at the 5% and 10% levels of significance. The estimated model is SVAR(1) with an intercept, and the computation of the AR statistic uses the finite sample correction introduced in Section 3.3. The number of Monte Carlo replications is 20,000.
The rejection frequencies are reported in Table 1. We notice that the rejection frequency of the t test can deviate sharply from its asymptotic level, with considerable overrejection in the cases ρ = 0.95 and c close to zero. In contrast, the rejection frequency of our proposed AR test is close to its asymptotic level in all cases. Similar results obtain for SVAR models with more lags as well as for models with deterministic terms (further results can be found in the Online Supplementary Appendix). In case (ii), we conduct experiments for a very large number of parameter combinations over a 4-dimensional grid in d 21 ,ρ,ω 1 and c, where we exploit an invariance property of the ARW statistic that enables us to normalize ω 2 = 1 and fix b 12 as a function of the other parameters, see the Online Supplementary Appendix for details. Figure 1 reports the maximal rejection frequency of the test at three different levels of significance (10%, 5% and 1%) over ρ,ω 1 and c for each value of d 21 under the null, denotedd 21 in the figure. The estimated model coincides with the data generating process (DGP), i.e., an SVAR(1) without deterministics, and the number of Monte Carlo replications is 20,000.
We notice that the size of the projection ARW test is well below the nominal level across all values of d 21 . In the Online Supplementary Appendix, we verify that the same result holds also in a large sample with T = 2,000. This indicates that there is some projection bias that could in principle be reduced by using lower critical values. However, it is not possible to reduce the critical value all the way to χ 2 1 , as would be warranted under strong identification, because the resulting test would be oversized (see the results in the Online Supplementary Appendix). An ARW test with χ 2 1 critical values will only yield correct asymptotic size when α 2 < κ for some fixed κ < 0. This is because in that case, a test that rejects when min b 12 A RW (b 12 ) is greater than the 1 − η quantile of the χ 2 1 distribution is asymptotically equivalent to a standard Wald test of the restriction on the parameter d 21 . However, it does not seem possible to use the lower critical values under weak identification, so the use of the projection critical values based on χ 2 2 entail some loss of power for robustness in the case of strong identification.
It is in principle possible to reduce the projection bias, e.g., by designing a databased identification category selection rule along the lines of Andrews and Cheng (2012), comparing the proximity ofα 2 to some cutoff that diverges with T. This improvement will come at the cost of introducing additional tuning parameters, and so may be unappealing in applied work.

Power
We compute the power of tests of AR, t, projection ARW and Bonferroni tests in the working SVAR(1) example. We set T = 200 and use 10,000 Monte Carlo replications. In the Online Supplementary Appendix, we report large-sample power curves, obtained with T = 2,000, and note that they are very similar to the ones reported here.

Power of AR Test.
We compare the power of AR and t tests of H 0 : b 12 = 0 against H 1 : b 12 = 0 at the 10% level of significance. The remaining parameters are ρ ∈ {0.2, 0.95}, ω 1 = 1, and c ∈ {−10, −70, −200}. In this model, the strength of identification is driven by c. To relate the results to well-known cases of weak, moderate and strong identification in linear IV, we compute an approximate measure of the strength of instruments known as the concentration parameter (denoted λ) in linear IV. 9 The chosen values of c correspond to approx-9 In linear IV with fixed instruments, the concentration parameter is equal to k [E (F) − 1] , where F is the infeasible version of the first-stage F statistic for excluding the instrument, computed when the variance of the reduced form error variance is known, see Stock, Wright, and Yogo (2002). The present context does not fit into that canonical IV framework, so we use a large sample approximation of λ. imate values of λ of 1.36 (weak), 10.6 (medium) and 49.5 (strong), respectively. The range of b 12 under H 1 is λ −1/2 (−3 : 3). Figure 2 reports the resulting power curves. The figure shows that the AR test has good finite-sample power even for c close to zero. This is not the case for the t test, which is both size distorted and even biased in some cases. Moreover, when identification is strong (c = −200), the power of the AR test is very similar to that of the t test, which is asymptotically efficient in this case (the power curves are even closer for T = 2,000). Since the DGP in this case is approximately stationary, this is a consequence of the fact that the AR and t tests are asymptotically equivalent in the case of stationarity, see Remark 2 to Theorem 1.

Power of Projection ARW Test.
We compare the power of the projection ARW test of H 0 : d 21 = 0 against H 1 : d 21 = 0, as defined in Remark 1 to Theorem 2, with the corresponding t test at significance level 10%. We set b 12 = 0, ω 1 = 1 and note that with these parameter values ρ = d 21 , so the range of d 21 is bounded between −1 and 1. Unlike the previous subsection, which dealt with inference on b 12 (the coefficient on the endogenous regressor in a linear IV regression), there is no direct analogy to the concentration parameter as a measure of the strength of identification. The results are reported in Figure 3   to weak and strong identification, respectively, while −50 and −100 correspond to intermediate cases. The projection ARW test is conservative, as expected, and less powerful than the nonrobust t test. So, there is a clear trade-off here between power and robustness to weak identification, unlike the AR test of hypotheses on b 12 , reported in Figure 2.
A comparison of the projection ARW test with the Bonferroni method discussed in Remark 4 of Theorem 2 is reported in the Online Supplementary Appendix.

Power of Weighted ARW Test.
We consider alternatives to the ARW test based on a weighted average of the A R (b 12 ) and W (b 12 ) statistics For each w, the critical value of an η-level A RW w test is computed by simulating its asymptotic distribution obtained from Theorem 2.
We use the same DGPs as in the previous subsections with T = 200 and set η = 10% as before. We consider both tests of the joint null H 0 : b 12 = d 21 = 0, and projection tests of H 0 : d 21 = 0. We compute the power of A RW w tests as a function of w across the different DGPs. ) such that the power of the A RW w test is 50% when w = 1.
The left panel of Figure 4 reports the power of the A RW w statistic as a function of w for different DGPs indexed by c. The figure shows that the optimal weight w clearly depends on c. Specifically, the optimal weight is close to 1 when c = −200 and it decreases with c.
Next, we explore whether the optimal weight depends on the alternative for b 12 . Specifically, we let b 12 ∈ {−1, −0.5, 0.5, 1}, while holding c fixed at c = −10. Again, for each pair (c, b 12 ), we pick ρ to ensure that power is 50% when w = 1. The right panel of Figure 4 reports the power as a function of w.
All in all, the results show that the optimal weight w depends on the DGP and there is no value of w that yields uniform power improvement over the equallyweighted ARW test.
Projection Test of H 0 : d 21 = 0. We set b 12 = 0 and consider different values for c ∈ {−200, −100, −50, −10}. The value of ρ is chosen as before to ensure 50% power for the ARW test with w = 1. Figure 5 reports the power of the projection A RW w test as a function of w. Again, we see that the optimal weight depends on c. Similarly to the test of the joint hypothesis, the optimal weight is close to 1 for c = −200, but unlike the joint test, the optimal weight is increasing rather than decreasing in c. Unreported results show a similar pattern for other values of b 12 . All in all, there is no uniformly optimal value of w for the projection A RW w test.

Comparison with Gospodinov (2010)
In Appendix B, we provide a comparison with Gospodinov (2010) who also considers inference in the bivariate model (2)-(3). He proposes a method of inference that relies on an additional overidentifying assumption that the modeler pos- sesses knowledge of one parameter of the system. This assumption ensures that b 12 is identified and can be estimated from a function of the coefficients in the VECM representation of the model. For example, in the SVAR(1) model, his additional restriction reduces to the assumption that b 12 is known and is equal to zero (Gospodinov, 2010, p. 4). We therefore report simulations when the estimated model is SVAR(2), so that Gospodinov's estimator is nontrivial. We compare the power of our AR test H 0 : b 12 = 0 against H 1 : b 12 = 0, to the t test based on Gospodinov's method and find that when Gospodinov's extra assumption holds both under the null and under the alternative, and when the DGP is very persistent, his t test is correctly sized and is more powerful than the AR test. However, when the highest root is far from unity or when Gospodinov's restriction is violated, his t test becomes size distorted and biased.

Blanchard and Quah (1989)
We first revisit the application of Blanchard and Quah (1989) (BQ), where Y 1t is log real GNP, and Y 2t is the unemployment rate in deviation from a linear trend. We use the original BQ dataset, which is quarterly and covers the period 1948q1 to 1987q4. More details about the data and transformations are given in the Online Supplementary Appendix. FIGURE 6. IRFs to supply shock from a bivariate SVAR in real output growth and the unemployment rate by Blanchard and Quah (1989). The solid line is the ML estimator. The dotted lines are 90% Wald confidence intervals, and the dashed lines are the 90% projection ARW confidence intervals. The data is from Blanchard and Quah (1989) over the period 1948q1 to 1987q4.
The specification in BQ is an SVAR(9). Figure 6 reports the estimated IRFs together with robust 90% confidence bands based on our proposed ARW method and the corresponding nonrobust Wald confidence bands. We see that the robust confidence bands are so large that the original conclusion of BQ is not borne out. In other words, long-run restrictions produce very weak identification in this application using the original data. This corroborates the criticism of Pagan and Robertson (1998).
The results in Figure 6 used full-sample detrending, which is problematic when the data is persistent, as we saw in our numerical analysis in the previous section. This can be addressed using recursive detrending. Results based on recursive detrending of the unemployment rate are given in Figure 7. We see that the results are very sensitive to the detrending method. With recursive detrending, which is more reliable than full-sample detrending, the effect of the supply shock on output becomes clearly positive but the effect on unemployment remains ambivalent.
We should emphasize that weak identification is an empirical matter, so identification of the model may become stronger over a different sample. Figure 8 reports estimates of the IRFs based on the same specification as in Figure 7, but estimated over an extended sample that runs up to 2014q4. We notice that the FIGURE 7. Estimates and confidence bands of the IRFs in Blanchard and Quah (1989) with recursive detrending, using their original data. The solid line is the ML estimator. The dotted lines are 90% Wald confidence intervals, and the dashed lines are the 90% projection ARW confidence intervals. point estimates are very similar, but error bands become significantly tighter, and identification appears to be strong.

The Hours Debate
Next, we turn to the debate on the short-run effect of a technology shock on hours initiated by the seminal papers of Galí (1999) and Christiano et al. (2003) (CEV). The analysis in those papers is based on an SVAR where Y 1t denotes log productivity and Y 2t denotes log hours.
The original paper by Galí (1999) estimated a negative short-run effect of a technology shock on hours, where Y 2t was the growth rate in hours, i.e., total log hours in first difference. Galí (1999) argued that this finding was inconsistent with real business cycle theory, but could be explained by sticky-price models. CEV criticized Galí's data and specification. Specifically, they argued for using log hours per capita as opposed to total hours and that Y 2t should be hours in levels as opposed to growth rates because the level specification encompasses the difference one. Reestimating using per capita hours in levels, they found a positive short-run effect of technology shock on hours, contradicting Galí's conclusions.
There has been a large subsequent literature attempting to explain the above conflicting findings, see, for example, Chaudourne, Fève, and Guay (2014), Du-FIGURE 8. Estimates and confidence bands of the IRFs with extended Blanchard and Quah (1989) data and recursive detrending. The solid line is the ML estimator. The dotted lines are 90% Wald confidence intervals, and the dashed lines are the 90% projection ARW confidence intervals. paigne, Fève, and Matheron (2007), Guay (2009, 2010), Ramey (2005, 2009), Gospodinov, Maynard, andPesavento (2011), Pesavento andRossi (2005), and Ramey (2016) (Section 5) for a recent review. Many of those papers emphasized possible misspecification due to omission of relevant variables and shocks from the SVAR, which could be addressed by adding more variables to the SVAR. Others emphasized the sensitivity of the estimates to assumptions about the number of permanent shocks and the effect of near unit roots. Our analysis below complements the literature by providing confidence bands on the impulse responses in question that are fully robust to weak identification. We focus our empirical analysis only on the baseline specifications in the two seminal papers in the literature, Galí (1999) and CEV, but we note that our methods are applicable to the more general SVAR specifications used in the literature.
We use the same data as Galí and CEV, 10 so the point estimates and conventional confidence bands reported below are the same as in those papers. Galí uses total hours linearly detrended over the sample 1948q1 to 1994q4. CEV use per capita hours and their sample is 1948q1 to 2001q4. The number of lags in the VAR is 5. FIGURE 9. IRFs to technology shock for the difference specification of Galí (1999). The model is a bivariate SVAR in the first differences of log productivity and log total hours. The solid line is the ML estimator. The dotted lines are 90% Wald confidence intervals, and the dashed lines are the 90% projection ARW confidence intervals. The data is from Galí (1999) over the period 1948q1 to 1994q4. Figure 9 presents point estimates and 90% confidence bands from the difference specification in Galí (1999) with total hours. We see that the projection ARW confidence bands are not much wider than the nonrobust ones reported by Galí (1999), indicating that this specification does not suffer from weak identification. This conclusion is robust to using the growth in per capita hours instead of total hours. 11 However, the results on the difference specification are subject to the valid critique by CEV regarding possible misspecification if hours do not have a unit root. 12 Figure 10 presents the CEV estimates and confidence intervals based on the levels specification, together with the robust projection ARW confidence bands. Unlike the Wald bands, the robust confidence bands are so wide that the response of hours to a technology shock is no longer significant. The information in the long-run restriction is so small that the data is consistent both with a positive as well as a negative response of hours to a technology shock. Therefore, the original conclusions of CEV are not robust to weak identification. FIGURE 10. IRFs to technology shock for the level specification of Christiano et al. (2003). The model is a bivariate SVAR in the growth of productivity and the level of log per capita hours. The solid line is the ML estimator. The dotted lines are 90% Wald confidence intervals, and the dashed lines are the 90% projection ARW confidence intervals. The data is from Christiano et al. (2003) over the period 1948q1 to 2001q4.
In the Online Supplementary Appendix, we report further results that indicate that the above conclusion on the weak identification of the level specification is robust to detrending of hours and to extensions of the estimation sample. All in all, we see that long-run restrictions are not very informative in this application, unless one is willing to impose the arguably strong assumption that hours have a unit root.

CONCLUSIONS
We proposed a method of inference on the parameters of SVARs identified using long-run restrictions that is robust to both weak instruments and near unit roots in the data. The method uses IVX-type instruments obtained by filtering the potentially nonstationary variables to make them near stationary. We propose to test hypotheses on the parameters that are potentially weakly identified using the Anderson-Rubin test with filtered instruments. Tests of general parametric restrictions, and confidence intervals for differentiable functions of the parameters, such as IRFs or forecast error variance decompositions, are obtained using a combined AR and Wald test. The robust test and associated confidence bands are easy to compute, and offer informative and reliable inference in two high-profile applications.

Case of Two Permanent Shocks.
In this case there is no cointegration, so the model is a VAR in first differences: The long-run restriction that permanent shocks toỸ 2t have no impact onỸ 1t is (Note that in this case C 22 does not need to be 0.) The long-run restriction then implies: As before, this can also be expressed as an exclusion restriction. First, from the Beveridge and Nelson (1981) (henceforth BN) decomposition we have b 12 +γ 12 (L) = b 12 +γ 12 (1) +γ * 12 (L) .

A.2. Misspecification of Difference Specification
Using (A.3) to substitute forγ 12 (L) in (A.2) yields Similarly, the reduced form equation for the level specification imposes no extra restriction, and usesỸ 2,t −1 as an instrument in (A.5). The difference specification imposes b 12 +γ 12 (1) = α 2 = 0, which enables us to use Y 2,t −1 as an instrument in (A.5). The difference specification will be misspecified if b 12 +γ 12 (1) = 0. In principle, this misspecification is detectable by a suitable diagnostic test. However, the power of such a test depends on the value of α 2 = 0. Only when α 2 is far from zero can we reject α 2 = 0 with high probability. Otherwise, if we do not reject α 2 = 0 and impose it incorrectly, the bias that will result depends on the true value of b 12 +γ 12 (1) and can be arbitrarily large. This discussion corroborates formally CEV's critique.
We study the power of Gospodinov's t test obtained under knowledge of π = 0 and localto-unity asymptotics, and compare it to our proposed AR test that is robust to violations of those assumptions. We first consider the SVAR(1) DGP used in our previous simulation study reported in the main text, with c = −10 and ρ = 0.5 (but note that the estimated model is SVAR (2), otherwise Gospodinov's estimator is trivial). The results are reported in Figure 11. We find that Gospodinov's t test has correct size but lower power than the AR test, despite the fact that it uses an additional assumption. This is because Gospodinov's overidentifying assumption π = 0 is violated under the alternative and this works to reduce the power of the test on b 12 in this case.
Next, we consider a DGP in which Gospodinov's extra restriction holds both under the null and under the alternative. For this purpose, we use the SVAR(2) DGP used in Gospodi-    Figure 12 reports power curves when the DGP has π = 0, so Gospodinov's assumption holds and b 0 12 = −0.16. When c is small, Gospodinov's test is more powerful yet slightly oversized. When c is largely negative, Gospodinov's method becomes oversized and biased (i.e., rejects less under some alternatives than under the null). Figure 13 reports power curves when π = −1/2, so Gospodinov's method is misspecified. In this case, the t test is invalid: it is oversized and biased. In contrast, the AR test remains valid across all DGPs used in Figures 11 to 13. moments, where f t (θ) is a k-dimensional vector-valued function of data and parameters with k ≥ p and E ( f t (θ)) = 0 at the true value of θ. Let r (θ) be a known function of the parameters, r : → q , q ≤ p ψ . Suppose f t (ϑ, ·) and r (ϑ, ·) are continuously differentiable with respect to ψ, and let J T (θ) = ∂ F T (θ) /∂ψ and R (θ) = ∂r (θ) /∂ψ . LetV f (θ) denote a k × k matrix that is positive definite almost surely, and define the GMM objective function whereψ could be equal to some one-step GMM estimator (for 2-step GMM) or to ψ (for continuously updated GMM). Suppose the constrained GMM estimator of ψ given ϑ exists: To simplify notation, letψ ≡ψ (ϑ),r (ϑ) = r ϑ,ψ ,R (ϑ) = R ϑ,ψ ,Ṽ f (ϑ) = V f ϑ,ψ ,F T (ϑ) = F T ϑ,ψ andĴ T (ϑ) = J T ϑ,ψ . Also, letĈ (ϑ) be an almost surely full-rank k × k − p ψ matrix that spans the null-space ofṼ f (ϑ) −1/2Ĵ T (ϑ) , i.e., LetĈψ be a square matrix such thatĈψĈ ψ =Vψ (ϑ) −1 . The following result gives highlevel conditions under which the asymptotic distribution of ARW (ϑ) is χ 2 p ϑ +q when ϑ is the true value of that parameter and r (θ) = 0. It can then be used to form a test of H * 0 : ϑ = ϑ 0 ,r (θ) = 0 against H * 1 : ϑ = ϑ 0 and/or r (θ) = 0.
THEOREM C.1. Suppose that at the true value of the parameters θ = ϑ ψ , (iv) there exist a nonstochastic p ψ × p ψ symmetric matrix B T → 0 such that B TĈψ =⇒ full-rank a.s., and (v) any stochastic elements in are independent of ξ = ξ 1 ,ξ 2 .

Proof.
By assumption (ii) and Slutsky's theorem, we haveR (ϑ) = R (θ) + o p (1) . By the singular value decomposition, R (θ) B T = Q T T U T , where Q T is an orthonormal q × q matrix, T → 0 is a diagonal matrix holding the singular values of R (θ) B T , and U T is a p ψ × q matrix such that U T U T = I q . So,