FINITE-SAMPLE SIZE CONTROL OF IVX-BASED TESTS IN PREDICTIVE REGRESSIONS

In predictive regressions with variables of unknown persistence, the use of extended IV (IVX) instruments leads to asymptotically valid inference. Under highly persistent regressors, the standard normal or chi-squared limiting distributions for the usual t and Wald statistics may, however, differ markedly from the actual finite-sample distributions which exhibit in particular noncentrality. Convergence to the limiting distributions is shown to occur at a rate depending on the choice of the IVX tuning parameters and can be very slow in practice. A characterization of the leading higher-order terms of the t statistic is provided for the simple regression case, which motivates finite-sample corrections. Monte Carlo simulations confirm the usefulness of the proposed methods.


INTRODUCTION
A common inferential task of practical relevance is to decide whether a potential predictor variable does indeed forecast another variable of interest. In the simplest setup, practitioners thus test the null hypothesis of no predictability in the model y t = µ + βx t−1 + u t , t = 2, . . . ,T , where the regressor is usually assumed to have an autoregressive structure, with initial condition bounded in probability, x 1 = O p (1). With financial data, predictors such as dividend yields or earnings-price ratios are often quite persistent, even if still mean-reverting (typically captured by a value of ρ close to unity), and its shocks are contemporaneously correlated with the variable to be predicted (see Phillips, 2015, for a recent review). This biases the OLS estimator of the slope parameter and induces heavy non-normality of t statistics (Elliott and Stock, 1994;Stambaugh, 1999) such that tests for predictability are size-distorted. Near to unity asymptotics, obtained by letting ρ = 1 − c/T, offer a better approximation of the actual distribution of the OLS t statistic than the standard normal in this situation; cf. Elliott and Stock (1994). The limiting distribution of the OLS estimator and test is explicitly non-normal under near integration and depends on the mean-reversion parameter c and the correlation between u t and v t . Since consistent estimation of c is not possible in such highly persistent cases (Phillips, 1987), the literature suggested several different ways of circumventing the lack of knowledge on ρ. See, among others, Campbell and Yogo (2006), Jansson and Moreira (2006), Maynard and Shimotsu (2009), Camponovo (2015), Phillips (2015), and Breitung and Demetrescu (2015).
Building on the work of Phillips and Magdalinos (2007) and Magdalinos and Phillips (2009), the extended IV (IVX) estimation and testing approach introduced by Phillips and Magdalinos (2009) is gaining momentum for predictive regressions; see, for example, Gonzalo and Pitarakis (2012), Phillips and Lee (2013), Kostakis, Magdalinos, and Stamatogiannis (2015), Demetrescu and Rodrigues (2016), Demetrescu et al. (2020), or Yang et al. (2020). In the IVX framework, x t−1 is instrumented by the specifically constructed instrumental variable z t−1 = (1 − ̺L) −1 + x t−1 = t−2 j=0 ̺ j x t−1−j with initial condition z 1 = 0 and ̺ = 1−a/T η , where a > 0 and η ∈ (0,1). This "endogenous instrumentation" method has convenient properties: the persistence of z t is under control, and is below that of the near-integrated x t−1 . Regularity conditions assumed, the resulting IV estimator follows a mixed Gaussian distribution in the limit, and the limiting null distribution of the corresponding t ratio is standard normal.
Should the regressor x t be highly persistent with localization parameter c close to zero, the IVX-based test of no predictability may still be seriously distorted in finite samples, even if less so than the OLS-based test. This is clearly the case when choosing η too close to unity or a too close to zero, and the difference in terms of persistence between the instrument z t and the regressor x t becomes small: for example, the rule of thumb proposed by Kostakis et al. (2015), which sets ̺ = 1 − 1/T 0.95 , is actually equivalent to a near unit root with localizing coefficientc between 1 and 2 for sample sizes between T = 100 and T = 10,000. Kostakis et al. (2015), therefore, recommend the use of a finite-sample correction leading to reliable size control for two-sided tests. 1 We find, however, that the correction is not equally effective for tests against one-sided alternatives. This is relevant in practice, as economic theory often predicts a certain sign of the slope coefficient β.
We, therefore, examine in Section 2 the behavior of components of the t statistic that vanish in the limit, but still have an effect in finite samples. We do so in a setup allowing for deterministically varying variances and correlations of the errors u t and v t . Since the main source of distortions in finite samples appears to be the fact that the finite-sample distribution is not centered at zero (see also Stambaugh, 1999), we focus on correcting for the noncentrality of the t ratio. One way of doing so is to resort to backward and forward demeaning of the involved variables. In time series analysis, backward (or recursive, or adaptive) demeaning can be traced back to at least the work of So and Shin (1999) where recursive demeaning is shown to reduce bias in estimators of large autoregressive roots. Specifically for (panel) predictive regressions, Westerlund, Karabiyik, and Narayan (2017) resorts to forward and backward demeaning to reduce endogeneity bias. While this is shown to stabilize size, we also find that it has the side effect of reducing power in a nontrivial manner. This is a specific effect of forward demeaning in the context of persistent predictors, and not of IVX. Therefore, we discuss the use of direct approximations of the higher-order terms affecting the finite-sample behavior of the t statistic. Some depend on the localizing coefficient c, which cannot be consistently estimated, so we provide a method of side-stepping this issue. In extensive Monte Carlo experiments (see Section 3), we find it to work reasonably well under various patterns of changing error variances.
The technical details of the proofs can be found in the Appendix and in an Online Supplement, which also contains additional simulation results pertaining to conditional heteroskedasticity.

Preliminaries
Let us first specify the details of the predictive regression model we work with. Assumption 1. The data {y t , x t }, t = 2, . . . ,T, are generated from (1) and (2) To keep a realistic setup, we allow for error heterogeneity in the form of timevarying variances and correlations, as well as short-run dynamics. Specifically, we work under the following assumptions.
This would be a typical structure in predictive regressions for stock returns, where the disturbance u t is not predictable using the past of v t . We do not assume a particular distribution for the errors but only require finite fourth-order moments. Although daily returns may exhibit fat tails, standard predictive regression models are used in conjunction with monthly, quarterly, or even annual data, where infinite kurtosis is not an issue. For the same reason, the serial independence assumption we make on the innovations is justifiable. The 1-summability condition placed on the coefficients of the filter is standard in the literature involving integrated and near-integrated variables. Let ν (t/T) and notice that we have time varying variances, covariances, and correlations of the errors, as Cov((u t ,ν t ) ′ ) = t is not restricted beyond piecewise smoothness. The off-diagonal elements of t are not required to be zero, thereby allowing for predictive regression endogeneity. The assumption on H(s) allows for a wide range of covariance matrices of the innovations, including, for example, single or multiple (co-) variance shifts, smooth transition (co-) variance shifts, or even trending variances.
With W a vector of two independent standard Wiener processes and "⇒" denoting weak convergence of probability measures on the space of càdlàg real functions on [0,1] equipped with the Skorokhod topology, we have (see Cavaliere, Rahbek, and Taylor, 2010) the normalized levels of x t converge weakly to a heteroskedastic Ornstein-Uhlenbeck type process, IVX estimation relies on using the instrument with Eicker-White standard errors to account for the heteroskedasticity. The residualsû t are computed using the OLS estimator of β, as is common in the predictive IVX regression literature. 2 (1) with (2), ρ = 1, (u t ,v t ) ∼ iidN (0, ((1,δ);(δ,1))), 25,000 replications, different correlations δ, and sample sizes.
What makes the IVX approach interesting for practitioners is that the terms involving c vanish as T → ∞ and pivotal inference on β can be obtained asymptotically. See Kostakis et al. (2015) for details on IVX-based predictive regression under strict stationarity of errors, and Demetrescu and Rodrigues (2016) for a case with time-varying variances with some (nontrivial) restrictions on the correlations. In finite samples, however, the actual distribution is not centered at zero because numerator and denominator correlate, and has a variance somewhat smaller than 1, as can be seen in Figure 1. Notice also the slow convergence to the standard normal.

Higher-Order Terms
We, therefore, study corrections that make IVX-based inference in predictive regressions even more reliable. To this end, we first characterize the leading terms of the IVX t statistic.
Proposition 1. Under Assumptions 1-3 and any η ∈ ( 1 /2,1), it holds as T → ∞ that Proposition 1 provides an explanation for the finite sample behavior of t vx as observed in Figure 1 for c = 0. For instance, the direction and magnitude of the noncentrality depend on the average sign and magnitude of the correlation between the errors u t and ν t via the two terms B T and C T . It can be seen from the discussion below that, under constant correlation δ, the magnitude of the noncentrality is in fact proportional to δ. Moreover, the slow convergence of t vx to the standard normal seen in Figure 1 can also be explained by the behavior of the two terms B T and C T : although they do vanish, they do so at rate T η/2−1/2 , which is low whenever η close to unity.
The noncentrality is mainly driven by the two terms B T and C T . The first depends on the user-chosen parameters a and η (with the additional restriction η > 1 /2 required for the calculation of E(B T )), as well as on a particular form of average correlation. In fact, under homoskedasticity (H = const.), the expectation of T 1/2−η/2 B T is asymptotically equivalent to −δ/ √ 2a with δ the constant correlation of u t and ν t . If σ uν (s) = 0∀s ∈ [0,1], B T does not affect the centering of t vx .
The same holds for the second component, C T : if σ uν (s) = 0∀s ∈ [0,1], then U H and B c,H are independent; therefore, the expectation of the limit of the normalized C T is zero as well. Should there, however, be contemporaneous correlation, the behavior of C T -in particular its expectation-does depend on c. Moreover, the dependence is nonlinear, since it is easily shown that As expected, the expectation decreases in magnitude as c increases. This expression simplifies too under homoskedasticity, where T 1/2−η/2 C T has an asymptotic expectation depending on the (constant) correlation δ, namely −δ 2 a 1−e −c c . For c = 0, the case with the largest distortions, we see this expectation to be twice as large as that of the normalized B T , with the relative importance of C T diminishing as c increases. This component depends, however, on the localizing coefficient c which cannot be consistently estimated, unlike the expectation of B T . Figure 2 plots the contribution of both B T and C T to the noncentrality of the t statistic t vx . We note that heteroskedasticity only has a secondary influence 95 (see the Section 3 for more details); right panel: t = ((1,δ(t/T));(δ(t/T),1)), δ(·) switching between −0.5 and −0.95.
compared to the localizing coefficient c, and that most (but not all) of the finitesample noncentrality seen in Figure 1 for c = 0 is accounted for by the two terms.

Corrections
The term C T from Proposition 1 appears because of the full-sample demeaning of the dependent variable (see the proof of Proposition 1 for details). To deal with this, we first discuss recursive demeaning as a possible correction. In particular, we use backward recursive demeaning for the regressor and forward demeaning for the dependent variable, in that we write The motivation for such demeaning schemes is that the recursively demeaned regressor and the forward demeaned disturbance are now orthogonal irrespective of the correlation between u t and ν t , which is not the case with usual demeaning. Such orthogonal schemes of mean adjustment have been used before in predictive regressions: for example, Westerlund et al. (2017) uses such a scheme to develop a predictability test in panel predictive regression. 3 In fact, in the panel literature, forward and backward demeaning have a much longer history in dealing with the Nickell bias (Nickell, 1981); see Everaert (2013) for a recent contribution.
This effect of recursive adjustment is very much in the spirit of the proposal of Kostakis et al. (2015), who point out that not demeaning the instrument z t−1 (while still demeaning the dependent variable and the predictor itself to account for a nonzero intercept in the predictive regression) reduces the finite-sample correlation between the numerator and the denominator of the IVX t statistic. We shall closer examine the corrections of Kostakis et al. (2015) after analyzing the effect of the orthogonal mean adjustment scheme in: Proposition 2. Under the assumptions of Proposition 1, it holds as T → ∞ that Proposition 2 shows that dependence on c of the leading higher-order terms may in fact be eliminated. Our Monte Carlo study (see Section 3) shows that t rec vx performs quite well in terms of size in spite of the remaining term B T , so the first correction we suggest is orthogonal mean adjustment.
The Monte Carlo study also shows that the local power of t rec vx is low. To see why, examine with· t denoting forward demeaned quantities. Under the alternative β = 0, power is driven by the cross-product (z t−1 −z t−1 ) (x t−1 −ẍ t−1 ). But the forward demeaned x t−1 may be rewritten as The effect on t rec vx under the alternative is similar to that of a weak instrument. This phenomenon is caused by the correction itself and not by the IVX instrumentation.
Turning our attention to the corrections proposed by Kostakis et al. (2015), they result in v withω 2 u andω 2 v estimators of the long-run variances of u t and v t , andλ uv an estimator of the long-run covariance of u t and v t . The behavior of t W vx is discussed in Proposition 3. Under the assumptions of Proposition 1, it holds as T → ∞ that t W vx = Z T + B T + C T + o p T η/2−1/2 with Z T ,B T , and C T from Proposition 1.

Proof: See Appendix B.
Although t W vx has the same leading terms as t vx , we give in Section II of the Online Supplement some of the higher-order terms of t vx which are of order O p (T η−1 ) and missing from t W vx . Since they contribute to the noncentrality (cf. Kostakis et al., 2015Kostakis et al., , p. 1516 and also the differences seen by comparing Figures 1 and 2), their absence likely improves the finite-sample behavior of t W vx and thus explains our findings in the Monte Carlo section that the two-sided t W vx statistic performs remarkably well. The impact of the terms B T and in particular C T on the one-sided versions of t W vx is, however, not negligible and we, therefore, move on to propose explicit corrections for B T and C T .
The quantities involved in the expectation of B T may for instance be estimated using smoothed residuals, delivering an estimate of In dealing with C T , it may be tempting to proceed analogously. Yet, with c unknown and no consistent estimator available, this approach seems of limited applicability in general. Alternatively, one may try to match the functional 1 0 e −c(1−s) σ uν (s)ds using the expectation of another functional depending on c. We illustrate this idea for the case of homoskedasticity, where we have under homoskedasticity. Therefore, 2 Var J c,H ( 1 /2) = 1−e −c c , which suggests employing a quantity with this expectation to accommodate the noncentrality induced by C T .
While it is likely possible to modify this approach in certain particular cases (say for breaks in variances and covariances at suitable times), the general case seems out of reach. We, therefore, propose the use of the correction for the homoskedastic case (and also point to Figure 2 as additional motivation for this proposal). In fact, our simulations in Section 3 and in Section III of the Online Supplement show this to work reasonably well under heteroskedasticity too.
The resulting correction term for the expectation of B T is then where δ may be estimated as the correlation of ν t and u t based onû t andν t from an AR(p) approximation of x t with p selected via an information criterion (we resort to the Akaike IC).
For C T , the natural choice following from the property of the Ornstein-Uhlenbeck process discussed above is then 1/(ψ 2 σ 2 ν )T −1 x 2 [T/2] , leading to since ω 2 = ψ 2 σ 2 ν is simply the (stationary) long-run variance of v t (which may be estimated either based on x t , or-as we proceed in our simulations-on the residuals of a first-order autoregression of x t ).
It should be noted, however, that this delivers a noisy proxy for the mean of C T : while it will remove the noncentrality due to C T (at least under homoskedasticity), it will at the same time marginally inflate the variance of the corrected t statistic. The presence of the estimatorω 2 in the denominator further inflates the variance: since we employ a nonparametric estimator, its variability in finite samples is large enough to affect the positive effect of the correction. Concretely, it induces outliers in the distribution of the correction and inflates the variance of the corrected statistic. To deal with these issues, we add finite-sample modifications which do not affect the asymptotics.
Finally, should x t be stationary instead of near-integrated, this bias correction may overcorrect, since, for ρ away from unity, the standard normal asymptotics do relatively well even when ̺ is close to unity; see Kostakis et al. (2015). A practical adjustment of the correction is to restrict ̺ in t * vx to be smaller than an estimate of ρ. In particular, we suggest ̺ = min{̺,ρ}, whereρ is the OLS estimator in a firstorder autoregression of x t . Asymptotically, this restriction makes no difference under near-integration, but prevents the bias correction to "overshoot." To sum up, we suggest to use, withb T = −δ/ 2T 1 − min{̺,ρ} , the statistic whereδ/3 is intended to capture finite-sample correlation ofĉ T and t vx , and is tuned to homoskedasticity. Our Monte Carlo study in Section 3 and in Section III of the Online Supplement shows that t * vx works well under heteroskedasticity too.

FINITE SAMPLE EVIDENCE
In this section, we provide finite sample evidence on the merits of the remedies proposed in this paper. We use a data generating process (DGP) as outlined under equations (1) and (2) with independent innovation process governed by a bi-variate normal distribution with a correlation coefficient of δ = −0.95 (which is typical for predictive regressions with stock returns; see, e.g., Phillips, 2015), as well as time-varying volatility. Concretely, The size study results are generated using 10,000 replications and considering c ∈ {0,1,5,10,30,50} together with β = 0 for T = 250 and 500. To analyze the behavior of the corrected tests under the alternative, we consider a sequence of local alternatives characterized by β = b T √ 1 − δ 2 , for b ∈ {−26, − 24, . . . ,−2,0,2, . . . ,26}. Note under b = 0, the size properties of the test will be recovered. Since the sign of β might be known in practice (as is often the case when the choice of the predictor is motivated by economic theory 4 ), we consider local alternatives covering both situations, β < 0 and β > 0, alongside with cases where two-sided testing is of interest. All through this section, we fix a = 1 and η = 0.95 following the recommendation of Kostakis et al. (2015).
We compare four versions of the IVX statistic testing the null β = 0: the original IVX t statistic (t vx ), the finite-sample adjusted version of Kostakis et al. (2015) (t W vx ), 5 the IVX t statistic computed with orthogonal mean adjustment (t rec vx ), as well as our bias-corrected proposal (t * vx ). Table 1 shows the finite-sample rejection frequencies at the 5% nominal level for strong negative contemporaneous correlation δ = −0.95. 6 The finite-sample noncentrality of the standard IVX t statistic, t vx , leads as expected to huge size distortions that only drop to reasonable levels for c = 10 if not c = 30. The time variation of the variance influences these distortions, but not by much. Also, they do not drop with increasing T, as predicted by the small rates in Proposition 1. The statistic t W vx on the other hand shows that the finite-sample corrections introduced in Kostakis et al. (2015) work excellently in the two-sided case. Only for c = 50, can one observe a very slight tendency to overreject (with rejection frequencies closer to 6% than to 5% for T = 250). However, the t W vx statistic does not behave too well in each tail taken alone, as it tends to overreject to the right (one sees large rejection frequencies for small c, and even for c = 50 we note rejection frequencies above 8%) and to underreject to the left (this is most visible for small c, where the rejection frequencies are below 1%). This also does not significantly improve for larger T = 500, and exhibits little variation across the different variance patterns. The statistic with backward and forward recursive demeaning, t rec vx , has very good size control (with some exceptions for c = 0, where rejection frequencies of 7% may be observed for the test against right-sided alternatives, and some cases of under-rejections: for left-sided testing under downward breaks and c = 0,1 we observe rejection frequencies of 2 or 3%). Finally, the t * vx statistic has the best size control of all four tests: while it sometimes underrejects for left-sided testing (in the same situations where the t rec vx statistic was undersized), most rejection frequencies lie between 4% and 6%, with only a handful of cases where the 6% threshold is exceeded, and no rejection frequency above 7%.
Summing up, all three modified statistics may be used in a two-sided testing situation in what concerns size control. For one-sided testing situations, the use of t W vx is not recommended as it overrejects to the right and severely underrejects to the left, which has a dampening effect on rejection frequencies under the alternative; see below. The simulations will also confirm the power-reducing effect of the orthogonal mean adjustment scheme mentioned after Proposition 2.
We present in Figures 3-5 plots of rejection frequencies of the four statistics compared for c = 0,10,30 and all variance patterns and test variants (left-, right-, and two-sided).
For left-sided testing, it is t * vx that has best rejection rates in all cases. Compared to t vx and t W vx , this is because t * vx is centered correctly and therefore not undersized. Here, t vx seems to perform a bit better than t W vx . The test based on backward and forward adjustment has poor rejection properties under the alternative; the gap 5 Kostakis et al. (2015) consider a Wald statistic W for which t W vx 2 = W. 6 The findings are symmetric in the sign of δ; moreover, size behavior improves uniformly for decreasing magnitude of δ so we do not include the exact figures to save space.  (1) and (2) with  (1) and (2) with v t = φv t + ν t for φ = 0.5, where (u t ,ν t ) ∼ iiN(0, t ) and t exhibits constant correlation δ = −0.95 and time-varying variances. We set ρ = 1 − c /T for various c and ̺ = 1 − 1 /T 0.95 and use standard normal critical values. See the text for details.  (1) and (2) with v t = φv t + ν t for φ = 0.5, where (u t ,ν t ) ∼ iiN(0, t ) and t exhibits constant correlation δ = −0.95 and time-varying variances. We set ρ = 1 − c /T for various c and ̺ = 1 − 1 /T 0.95 and use standard normal critical values. See the text for details.  (1) and (2) with v t = φv t + ν t for φ = 0.5, where (u t ,ν t ) ∼ iiN(0, t ) and t exhibits constant correlation δ = −0.95 and time-varying variances. We set ρ = 1− c /T for various c and ̺ = 1− 1 /T 0.95 and use standard normal critical values. See the text for details.
to the other tests decreases as c increases, but rejection rates drop anyway with increasing c.
For right-sided testing, t vx rejects very often, but this is of course due to the extreme liberality compared to the other tests. The test based on t rec vx performs, like before, worst (again, with differences decreasing as c increases). To the right, t W vx typically rejects more often than t * vx , but keep in mind that it is also quite oversized, even if not as oversized as the uncorrected t vx .
Finally, examining the two-sided tests, we observe as expected a combination of the findings for the left-and right-sided tests, with the difference that the t W vx test is now correctly sized and the corresponding test decisions are now reliable. The test based on t * vx is also correctly sized, and the power ranking of the two depends on the sign of β under the alternative. While t W vx is more powerful against rightsided alternatives, but less powerful against left-sided ones, t * vx exhibits a more balanced behavior. Again, the larger c, the closer the rejection frequencies of the three corrected tests.
Summing up, when the theory provides clear justifications about using a onesided test, we could safely recommend the use of t * vx . For two-sided testing, one has the choice between t W vx and t * vx , with the symmetry of the rejection frequencies under the alternative being an argument in favor of t * vx , and the higher power against right-sided alternatives (or left-sided, should the correlation δ be positive) being an argument in favor of t W vx . Altogether, as our Monte Carlo results show, we would like to stress that t W vx has a very good size control for two-sided testing. Allowing for conditional heteroskedasticity does not alter this general recommendation (see Section III of the Online Supplement for further Monte Carlo simulation results supporting this claim).

CONCLUDING REMARKS
A convenient approach in the context of predictive regressions where the persistence of the endogenous forecasting variable is unknown, is to turn to IV regressions where a so-called extended instrumental variable with a controlled level of persistence is constructed. The resulting IVX estimator is asymptotically mixed Gaussian and makes for standard asymptotic inference. Finite-sample deviations from the asymptotic limit can, however, be quite serious. Typically manifested in the form of noncentrality, they depend heavily on how the IV estimator is constructed.
In this paper, we provide a structured approach to control the small sample noncentrality of the IVX t statistic for a given instrumental variable, and as a result control the size distortions. First, we develop a higher-order expansion of the corresponding IVX t statistic and as such provide a theoretical understanding of the small sample deviations of the t statistic from its limit. This in turn suggests ways to center the t statistic at the origin under the null. Combining forward and recursive demeaning does account for most leading terms of the bias at the cost of some loss of power. An explicit correction for the noncentrality achieves similar size control but without the power reduction. These proposals do not assume any parametric restriction on the persistence of the extended instrumental variable, and rather provide, for any given parameterization thereof, a corresponding way of reducing noncentrality.
Our recommendations do not concern Wald tests of the null of no predictability in multiple regressions, the main reason being that the corrections already proposed by Kostakis et al. (2015) are quite effective for the Wald statistic. We leave a full analysis of higher-order terms of the involved quadratic forms to future work.
Our Monte Carlo study shows that all of these proposals provide substantial remedies to small sample size distortions to the IVX t statistic while maintaining relatively good properties under the alternative. Further, when the effect of a forecasting variable is negative, we suggest using a left-sided t statistic with one of the corrections we provided in this paper, since our Monte Carlo study provides evidence that such a strategy is associated with a better statistical power compared to using a two-sided test. For two-sided alternatives, the Wald test of Kostakis et al. (2015) offers the better balance between size and power.

SUPPLEMENTARY MATERIAL
Then,

Proof of Proposition 1
Begin by applying Lemma A.2; since T − η /2 = o(T η /2− 1 /2 ), we may focus on 1 For the second term, we have (using the arguments in the proof of Lemma A.1 of Demetrescu et al., 2020)