IDENTIFICATION ROBUST INFERENCE FOR MOMENTS-BASED ANALYSIS OF LINEAR DYNAMIC PANEL DATA MODELS

We use identification robust tests to show that difference (Dif), level (Lev), and nonlinear (NL) moment conditions, as proposed by Arellano and Bond (1991, Review of Economic Studies 58, 277–297), Ahn and Schmidt (1995, Journal of Econometrics 68, 5–27), Arellano and Bover (1995, Journal of Econometrics 68, 29–51), and Blundell and Bond (1998, Journal of Econometrics 87, 115–143) for the linear dynamic panel data model, do not separately identify the autoregressive parameter when its true value is close to one and the variance of the initial observations is large. We prove that combinations of these moment conditions, however, do so when there are more than three time series observations. This identification then solely results from a set of, so-called, robust moment conditions. These robust moments are spanned by the combined Dif, Lev, and NL moment conditions and only depend on differenced data. We show that, when only the robust moments contain identifying information on the autoregressive parameter, the discriminatory power of the Kleibergen (2005, Econometrica 73, 1103–1124) Lagrange multiplier (KLM) test using the combined moments is identical to the largest rejection frequencies that can be obtained from solely using the robust moments. This shows that the KLM test implicitly uses the robust moments when only they contain information on the autoregressive parameter.


INTRODUCTION
It is common to estimate the parameters of linear dynamic panel data models using the generalized method of moments (GMM; Hansen, 1982). The moment conditions for the linear dynamic panel data model either analyze it in first differences using lagged levels of the series as instruments, in levels using lagged first differences as instruments or using a product of levels and first differences. We refer to the first set of moment conditions as Dif(ference) moment conditions (see Arellano and Bond, 1991), the second set as Lev(el) moment conditions (see Arellano and Bover, 1995;Blundell and Bond, 1998), and the third set as N(on-)L(inear) moment conditions (see Ahn and Schmidt, 1995).
The Dif, Lev, and NL moment conditions can be used separately to identify the parameters of dynamic panel data models. To exhaust all information, however, two particular combinations of Dif, Lev, and NL moment conditions have been proposed. We refer to the combined Dif and Lev moment conditions as the Sys(tem) moment conditions and the combination of the Dif and NL moment conditions as the A(hn-)S(chmidt) moment conditions. 1 The Sys moment conditions exhaust all information on the autoregressive parameter that is present under mean stationarity (see Arellano and Bover, 1995;Blundell and Bond, 1998). The AS moment conditions exhaust all information while not assuming mean stationarity (see Ahn and Schmidt, 1995).
We analyze the identification of the autoregressive parameter by the various sets of moment conditions for a range of true values including the case of highly persistent panel data. All moment conditions involve first differences of the series to remove individual specific effects. The first difference operator removes information in the time series at the unit root value of the autoregressive parameter. It is well known that the Dif moment conditions, therefore, do not identify the autoregressive parameter when its true value is (close to) one, since lagged levels are then weak predictors of first differences. This has led to the development of the NL and Lev, and hence AS and Sys, moment conditions which were originally considered to identify the autoregressive parameter when the panel data are highly persistent.
To show the identification issues at specific values of the autoregressive parameter, we use identification robust tests, i.e., the GMM-A(nderson-)R(ubin) statistic of Anderson and Rubin (1949) and Stock and Wright (2000), and the K(leibergen) L(agrange) M(ultiplier) statistic of Kleibergen (2005). At values of the parameters where identification issues occur, the rejection frequency of these tests provenly coincides with the significance level, so the identification issues are relatively easy to detect by inspecting the power curves. Using power curves of the KLM test, we show that Dif, Lev, and NL moment conditions separately do not identify the autoregressive parameter for persistent values of it when paired with a large variance of the initial observations. The same holds for the Sys moment conditions with three times series observations. The power curves further show that Sys and AS moment conditions generally identify the autoregressive parameter when the number of time series observations exceeds 3.
We formally prove these identification results using an asymptotic sampling scheme in which we jointly let the variance of the initial observations and the number of cross section observations go to infinity. For a range of relative convergence rates of the variance of the initial observations compared to the cross section sample size, the Dif, Lev, and NL sample moments and their derivatives diverge. Both the population moment and the Jacobian identification condition are then ill defined, which implies that the autoregressive parameter is not separately identified by the Dif, NL, or Lev moment conditions. These results confirm and extend earlier findings in Madsen (2003), Bond, Nauges, and Windmeijer (2005), Hahn, Hausman, and Kuersteiner (2007), Kruiniger (2009), andPhillips (2018).
Using our asymptotic sampling scheme, we also prove that AS and Sys moment conditions identify the autoregressive parameter irrespective of the variance of the initial observation when the number of time series observations exceeds 3. When the variance of the initial observations is large, the identification results from a set of, so-called, robust sample moments that are a combination of the Dif, Lev, and NL sample moments (other than AS and Sys) and only depend on differenced data. These robust sample moments are spanned by the Sys sample moments and also by the AS sample moments. They identify the autoregressive parameter irrespective of the variance of the initial observation and including the case of highly persistent data. They are a subset of the moment conditions in Kruiniger (2002), which are derived under the additional assumption of time series homoskedasticity.
Despite these positive identification results for the Sys and AS moments, the large sample distributions of corresponding one-step and two-step GMM estimators are known to be nonstandard when the variance of the initial observation is large and the autoregressive parameter is close to one. This makes it hard to infer if and how standard GMM inference using the original AS or Sys sample moments exploits the information contained in the robust sample moments that they encompass. The nonstandard limiting behavior results, since the identification of the autoregressive parameter is then of, so-called, second order, since the Jacobian of the robust sample moments is rank deficient, but the Hessian is not (see, e.g., Dovonon and Renault, 2013;Dovonon and Hall, 2018;Dovonon, Hall, and Kleibergen, 2020). It explains the large biases of the one-step and two-step GMM estimators and the size distortions of their corresponding t-statistics when the series are persistent (see, e.g., Madsen, 2003;Hahn et al., 2007;Kruiniger, 2009;Bun and Windmeijer, 2010;Dhaene and Jochmans, 2016). Because of the second-order identification, GMM estimators based on the robust sample moments also have nonstandard asymptotic distributions when the data are persistent (see Dovonon et al., 2020).
We therefore analyze how identification robust test statistics exploit the identifying information in the robust sample moments. We prove that the identification robust KLM test procedure based on either AS or Sys sample moments exploits all the identifying information contained in the robust sample moments. We do so by first determining the (infeasible) optimal weighted average of the robust sample moments that maximizes the discriminatory power of a GMM-AR test of the autoregressive parameter in settings where only the robust sample moments contain identifying information. Next, we determine the discriminatory power of KLM tests, based on AS or Sys moment conditions, under such settings and prove that it equals that of the GMM-AR test using the optimal weighted average of the robust sample moments. KLM tests using AS or Sys moment conditions thus resort to just using the robust sample moments when only the latter contain information on the autoregressive parameter. It is therefore not necessary to explicitly use the robust sample moments, which provide identification under mild conditions, since they are implicitly used in the KLM test based on either AS or Sys sample moments.
The paper is organized as follows. Section 2 introduces the linear dynamic panel data model and the different moment conditions we use to identify its parameters. It also discusses identification robust statistics, specifically the KLM test, that we use to illustrate the identification issues that occur at persistent values of the autoregressive parameter. In Section 3, we use a representation theorem, akin to the cointegration representation theorem (see Engle and Granger, 1987;Johansen, 1991) to pin down the identification properties of the different moment conditions. This theorem also allows us to obtain the robust sample moments. In Section 4, we define the GMM-AR test that uses the (infeasible) optimal weighted average of the robust sample moments and derive the large sample distribution of the KLM test using AS or Sys moment conditions under settings where only the robust sample moments contain information on the autoregressive parameter. The fifth (final) section concludes. Proofs of theorems and definitions of sample moments are provided in the Appendix. We use the following notation throughout the paper: vec(A) stands for the (column) vectorization of the k × n matrix A, vec(A) = (a 1 ... a n ) , for A = (a 1 ... a n ),P A = A(A A) −1 A is a projection on the columns of the full-rank matrix A, and M A = I N − P A is a projection on the space orthogonal to A. Convergence in probability is denoted by "→ p ", convergence in distribution by "→ d ", and "= a " means asymptotically equivalent.

IDENTIFICATION ROBUST GMM INFERENCE FOR DYNAMIC PANEL DATA MODELS
In this section, we briefly describe the dynamic panel data model and the different sets of moment conditions. Thereafter, we discuss identification robust GMM inference including the construction of confidence intervals. Finally, we illustrate the identification issues that occur when using the different moment conditions for dynamic panel data models, by computing power curves based on the identification robust KLM statistic.

Model and Moment Conditions
We analyze the first-order autoregressive linear dynamic panel data model y it = c i + θ y it−1 + u it , i = 1, . . . ,N,t = 2, . . . ,T, with T the number of time periods and N the number of cross section observations. We assume that the initial observation y i1 is observed and that the vector of observations (y i1 ,...,y iT ) for individual i is independently distributed across the N individuals. We will later on make further assumptions on the initial observations to properly define the process in (1). For expository purposes, we analyze the simple dynamic panel data model in (1), which can be extended with additional lags of y it and explanatory variables. 2 Estimation of the parameter θ by means of least squares leads to an inconsistent estimator in samples with a finite value of T and large N (see, e.g., Nickell, 1981). We therefore estimate it using GMM. We obtain the GMM moment conditions from the unconditional moment assumptions: Under these assumptions, the moments of the T(T − 1) interactions of y it and y it : can be used to construct functions which identify the parameter of interest θ . We do not use products of y it to identify θ , since we would need further assumptions, i.e., homoskedasticity or initial condition assumptions (see, e.g., Han and Phillips, 2010). Two different sets of moment conditions, which are functions of the moments in (3), are commonly used to identify θ : 1. Dif moment conditions: as proposed by, e.g., Anderson and Hsiao (1981) and Arellano and Bond (1991). The Dif moment conditions solely result from the conditions in (2). 2. Lev moment conditions: as proposed by Arellano and Bover (1995; see also Blundell and Bond, 1998).
In addition to the conditions in (2), the Lev moment conditions use which implies that the original data in levels have constant correlation over time with the individual-specific effects. The Lev moment conditions (5) hold under the following conditions regarding the initial observations y i1 (i = 1, . . . ,N) : The specification of the initial observations in (7)-(9) is often referred to as mean stationarity. In our analysis, we maintain the assumption of mean stationarity.
The Dif and Lev moments can be used separately or jointly to identify θ . When we use the moment conditions in (4) and (5) jointly, we refer to them as Sys moment conditions 3 (see Arellano and Bover, 1995;Blundell and Bond, 1998). Another set of NL moment conditions, which just like the Dif moments only use the conditions in (2), results from Ahn and Schmidt (1995): The NL moments can be used separately or jointly with the Dif moments to identify θ . When we use the moment conditions in (4) and (10) jointly, we refer to them as AS moment conditions. Ahn and Schmidt (1995) show that their AS moment conditions exhaust the information on θ in the moment conditions (2) and are therefore complete. Mean stationarity adds one moment condition (6) to the moment conditions in (2). Hence, the complete set of moment conditions under (2) and (6) equals the AS moment conditions and (6). Upon rewriting, we can show that these combined moment conditions are identical to the Sys moment conditions, so they are complete under (2) and (6).

Identification Robust GMM Tests
In GMM, we consider a k-dimensional vector of moment conditions (see Hansen, 1982): where f i (θ ) is a k-dimensional (continuous and continuously differentiable) function of the observed data for individual i and the unknown parameter vector θ whose functional expression is identical for all individuals. There is a unique true value of the p-dimensional vector θ where the moment conditions are satisfied, which we denote by θ 0 , and k is at least as large as p. We only analyze the firstorder autoregressive panel data model, so p = 1 for our setting. The population moments in (11) are estimated using the sample moments, The k × p dimensional matrix q N (θ ) contains the derivative of f N (θ ) with respect to θ : . Specifications of the sample moment functions f N (θ ) and q N (θ ) for the Dif, Lev, Sys, NL, and AS moment conditions are provided in the Appendix.
Statistical inference based on the two-step GMM estimator is known to be of poor quality in the case of weak identification, which leads to an inconsistent estimator with nonstandard behavior of its corresponding t-statistic (see, e.g., Phillips, 1989;Staiger and Stock, 1997;Stock and Wright, 2000). The nonstandard limiting behavior of one-step and two-step GMM estimators for dynamic panel data models in the case of weak identification has been documented in, e.g., Madsen (2003), Kruiniger (2009), andPhillips (2018).
In this study, we therefore use identification robust GMM statistics to overcome the aforementioned problems. The main advantage of identification robust statistics is that, unlike conventional two-step GMM statistics, their limiting distributions are unaffected by the identification strength. Define θ * as the hypothesized value under the null hypothesis. A particularly simple to compute identification robust GMM statistic to test H 0 : θ = θ * is the GMM extension of the AR statistic (see Anderson and Rubin, 1949;Stock and Wright, 2000): withV ff (θ ) the Eicker-White covariance matrix estimator: The GMM-AR statistic equals the continuous updating objective function (Hansen, Heaton, and Yaron, 1996) evaluated in θ * . A possible drawback of the GMM-AR statistic is its lower power in the case of overidentified models. The KLM statistic of Kleibergen (2005) partly overcomes this. The KLM statistic is a quadratic form of the score of the GMM-AR statistic with respect to θ : and The limiting distributions of the identification robust GMM-AR and KLM statistics apply under less restrictive assumptions than those of the traditional test statistics based on two-step GMM. The GMM-KLM and GMM-AR statistics converge under H 0 to χ 2 (p) and χ 2 (k) distributed random variables even when the Jacobian, J(θ 0 ) = E(q i (θ 0 )), does not have a full-rank value (see Stock and Wright, 2000;Kleibergen, 2005;Newey and Windmeijer, 2009). Other identification robust statistics for GMM are proposed in Kleibergen (2005), Andrews (2016), and Andrews and Mikusheva (2016), which all provide extensions of the conditional likelihood ratio statistic of Moreira (2003) to GMM. The conditional likelihood ratio statistic is optimal for the homoskedastic linear instrumental variables regression model with one included endogenous variable (see Andrews, Moreira, and Stock, 2006). None of its extensions to GMM has, however, shown to be optimal for our setting of the dynamic linear panel autoregression, so we just use the easier to implement GMM-AR and KLM statistics. 4 The identification robust tests can be inverted to obtain corresponding identification robust confidence sets. The 100×(1−α)% confidence set for θ (denoted by CS θ (α) below) consists of all values of θ * for which the respective identification robust test does not reject using its 100 × α% asymptotic critical value: with IRT(θ * ) the identification robust statistic evaluated at θ * and CDF IRT (α) the (1 − α) × 100th percentile of the limiting distribution of IRT(θ 0 ). The identification robust tests are not quadratic functions of θ * , so they cannot directly be inverted to obtain the confidence set. 5 The confidence sets resulting from them do, therefore, not have the usual expression of an estimator plus or minus a multiple of the standard error. Instead, we have to specify a p-dimensional grid of values of θ * and compute the identification robust statistic for every value of θ * on the grid to determine if it is less than the appropriate critical value, so θ * is part of the confidence set.
Specifically, the confidence set in ( Bounded and convex confidence sets occur when the parameters of interest are well identified. Unbounded confidence sets are indicative of weak identification, so if we then test H 0 : θ = θ * at a very large, possibly infinite, value of θ * using an identification robust test at, say, the 5% significance level, it does not necessarily reject. For such instances, we thus often do not reject the hypothesis of an infinite value of θ , so we obtain an unbounded 95% confidence set. In Dufour (1997, Thms. 3.3 and 3.6), it is shown that any size correct procedure used to test parameters which can be nonidentified must have a positive probability of producing an unbounded 95% confidence set. Conversely, also any test procedure, like the Wald t test, which cannot generate an unbounded 95% confidence set, cannot be a size correct test procedure when the tested parameter can be nonidentified. Empty confidence sets occur when the model is misspecified, so there is no value of θ for which the moment condition holds. Since the GMM-AR statistic tests whether all moment conditions hold, it also tests misspecification. It can therefore result in empty confidence sets, but the KLM test cannot, since it is equal to zero at the continuous updating estimator of Hansen et al. (1996), which is the minimizer of the GMM-AR statistic.
The identification robust statistics conduct tests on the full parameter vector θ . Valid (1 − α) × 100% confidence sets for the individual elements of θ then result by projecting the joint p-dimensional (1 − α) × 100% confidence set for θ on the p different axes. These projection-based confidence sets are size correct, so they contain the true value of θ with a probability which is at least (1 − α) × 100% irrespective of the strength of identification. Projection-based confidence sets can face computational issues when p is rather large given the large number of points on the p-dimensional grid for which the statistic then has to be computed.
Confidence sets for the individual elements of θ can also be obtained by plugging in an estimator for the remaining elements of θ after which the (conditional) limiting distribution can be sharpened using the usual degrees of freedom correction of the χ 2 limiting distributions. The resulting confidence sets only have correct coverage when these remaining parameters are well identified (see Kleibergen, 2005). Just in some isolated cases, for example, when using the GMM-AR statistic in the homoskedastic linear instrumental variables regression model or in the linear factor model for determining risk premia in finance, can we prove that these confidence sets are valid without requiring the partialled out parameters to be well identified (see Guggenberger et al., 2012;Guggenberger, Kleibergen, and Mavroeidis, 2019;Kleibergen (2021); Kleibergen, Kong, and Zhan (2020); ).

Using Identification Robust Tests to Highlight Identification Issues
Identification robust GMM tests are size correct irrespective of the identification strength. Therefore, their rejection frequencies can be used in a straightforward manner to illustrate the identification issues at particular values of the autoregressive parameter in the dynamic panel data model. The conventional t test based on the two-step GMM estimator is not suitable for this purpose, as it is size distorted in the case of weak identification and, hence, rejection frequencies would not equal the significance level.
To illustrate the identification issues for the different moment conditions, we compute the rejection frequencies of 5% significance KLM tests of H 0 : θ = 0.5 for a range of (true data generating) values θ 0 . We do so by simulating data from the panel autoregressive model in (1) with three or four time series observations, so T = 3 or 4, and 250 individuals, so N = 250. The individual specific effects c i and idiosyncratic errors u it are independently generated from N(0,σ 2 c ) and N(0,1) distributions, respectively. We vary the value of σ 2 c to show the sensitivity of the identification of θ using the panel moment conditions to the variance of the initial observations. We assume mean stationarity, so (7)-(9) hold.
We consider four KLM tests based on Dif, Lev, Sys, and AS moment conditions, which have been calculated according to equation (16) using θ * = 0.5. Figures 1  and 2 show the rejection frequencies of KLM tests of H 0 : θ = 0.5 with 5% significance for four values of σ 2 c and a range of true values θ 0 . Figure 1 does so for three times series observations, while Figure 2 covers four time series observations. The simulation experiment is designed such that the variance of the initial observations becomes very large when θ 0 gets close to one and σ 2 c exceeds zero. Figures 1a and 2a show that the rejection frequencies of the KLM test with Dif moment conditions for θ 0 close to one converges to the significance level of 5%. It is well known that the Jacobian of the Dif moment conditions is zero when θ 0 equals one, so they then do not identify θ . The KLM test is identification robust, which explains why the rejection frequency equals the significance level both at the hypothesized value of θ * = 0.5 and when θ 0 is close to 1 for all values of σ 2 c . The latter results, since the Dif moment conditions do then not identify θ ; hence, the KLM test has no discriminating power, so the power of the KLM test equals the significance level. Figures 1b and 2b show the rejection frequencies of 5% significance tests of H 0 : θ = 0.5 using the KLM test with Lev moment conditions. Interestingly, these figures show that the Lev moment conditions only identify θ when the true value θ 0 is close to one when σ 2 c = 0. Nonzero values of σ 2 c correspond with a large variance of the initial observations when θ 0 is close to one and 2b show that the Lev moment conditions do not identify θ in this case. This contradicts the common perception that the Lev moment conditions generally identify θ irrespective of the setting of nuisance parameters, like, the variance of the initial observations. Figures 1c and 2c show the rejection frequencies of 5% significance tests of H 0 : θ = 0.5 using the KLM test with Sys moment conditions. Surprisingly, these figures show that the Sys moment conditions do not identify θ when θ 0 is close to one and σ 2 c > 0 when T = 3, but do so when T = 4. Figure 2d shows the rejection frequencies of 5% significance tests of H 0 : θ = 0.5 using the KLM test with AS moment conditions. These rejection frequencies show that the AS moment conditions, which are not defined for T = 3, identify θ when its true value is close to one and the variance of the initial observations is very large. Interestingly, the rejection frequencies of KLM tests of H 0 using the Sys and AS moment conditions are very close when θ 0 is near one when paired with large variances of the initial observations. Summarizing, Figures 1 and 2 illustrate a few stylized facts that concern the identification of θ for the data generating process (DGP) used in the simulation experiment: 1. Dif moment conditions do not identify θ when θ 0 is close to one for general T. 2. Lev moment conditions do not identify θ when θ 0 is close to one for large variances of the initial observations for general T. 3. Sys moment conditions do not identify θ when θ 0 is close to one for large variances of the initial observations when T = 3. 4. Sys and AS moment conditions identify θ when θ 0 is close to one for large variances of the initial observations when T exceeds 3. 5. The rejection frequencies of KLM tests of H 0 using AS and Sys moment conditions when θ 0 is close to one and the variance of the initial observations is large are almost identical.
Except for the first stylized fact, a theory backing them up is lacking, so we aim to provide one in the sections ahead. In doing so, we show that all information regarding θ, when its true value is close to one and the variance of the initial observations is large, is contained in a set of, so-called, robust moment conditions which are a combination of either the AS or Sys moment conditions. We furthermore show that the KLM test based on the original AS or Sys moment conditions, as reported in Figures 1 and 2, makes optimal use of these robust sample moments when only they contain information on θ .
Alongside the identification issues we can infer from the rejection frequencies in Figures 1 and 2, they are also indicative of the different kind of confidence sets that can result from the identification robust tests as discussed previously. For example, the low rejection frequencies occurring for θ 0 around one, that result from the identification issues, show that the 95% confidence sets for θ are then typically very wide, possibly unbounded, when θ 0 has such a value paired with a large variance of the initial observations. To visualize this further, Figure 3 contains the (one minus the) p-value plots of KLM tests using AS, Dif, Lev, and Sys moment  conditions for four datasets using the same DGPs as in Figures 1 and 2 with N = 250 and θ 0 = 0.95. 6 The DGPs used for the four figures differ over the values of T and σ 2 c . The intersections of the depicted p-value plots with the line at 0.95 indicate the 95% confidence sets of KLM tests with the respective moment condition.
In Figure 3a,c, σ 2 c = 0, so identification issues only occur at θ 0 close to one when using the Dif moment conditions. Since θ 0 is 0.95, this explains why the p-value plots of the KLM test with the Dif moments conditions do not cross the line at 0.95 in Figure 3a,c, so the resulting 95% confidence sets are very wide. The p-value plots in Figure 3a,c of KLM tests with Sys and Lev moment conditions show that they lead to bounded 95% confidence sets, since these moment conditions have no identification issues when T = 3 and σ 2 c = 0. In Figure 3b, where T = 3 and σ 2 c = 0.5, none of the p-value plots crosses the line at 0.95, so 95% confidence sets that result from KLM tests with Dif, Lev, and Sys moment conditions are all very wide and possibly unbounded. This is indicative of the identification issues when T = 3 and σ 2 c = 0.5 for true values of θ close to one. In Figure 3d, where T = 4 and σ 2 c = 0.5, KLM tests with Sys and AS moment conditions both result in finite 95% confidence sets, while the KLM test with Dif and Lev moment conditions leads to very wide possibly unbounded confidence sets. Hence, Sys and AS moment conditions have no identification issues, while Dif and Lev moment conditions do. The AS moment conditions are quadratic functions of θ , which explains the somewhat unusual shape of their p-value plots in Figure 3c,d.

IDENTIFICATION FROM DIFFERENT MOMENT CONDITIONS
Stylized Facts 1-4 illustrated by Figures 1-3 show the identification issues that occur for the autoregressive parameter θ when the variance of the initial observations is large and θ 0 , i.e., the true value in the DGP, is close to one. To pin these identification issues down precisely, we use an asymptotic sampling scheme which consists of joint drifting sequences for the autoregressive parameter and the variance of the initial observation. We indicate this dependence on the sample size N by θ 0,N and , respectively. The true value of θ, previously denoted by θ 0 , is from now on, therefore, denoted by θ 0,N . Assumptions 1 and 2 group the different requirements needed to obtain our results.
Assumption 1. a. The drifting sequences of the autoregressive parameter and variance of the initial observations are such that: 6 We note that the Figure 3a-d shows (one minus) the p-value for one realized dataset and does not show the simulated empirical distribution function of the test under the null hypothesis, which is sometimes also referred to as a p-value plot (see Davidson and MacKinnon, 2002).
with d 1 a finite, possibly zero constant. b. The initial observations satisfy the mean stationarity conditions in (7)-(9). c. The joint limit behavior of the variance of u i1 and (1 − θ 0,N ) is such that with σ 2 1,N = var(u i1 ), d 2 a finite, possibly zero constant, and (1 − θ 0,N ) 1 2 u i1 is a random variable with finite fourth-order moments.
d. The variance of the product of the initial observation y i1 and the disturbances u it is such that with σ 2 t = var(u it ), t = 2, . . . ,T. e. The errors u i1 /σ 1,N , u i2 ,...,u iT and c i , i = 1, . . . ,N, are independently distributed within individuals and over the different individuals and have meanzero, finite variance, and finite fourth-order moments and satisfy the conditions in (2).
Assumption 1(a) concerns the joint limit behavior of the variance of the initial observations and θ 0,N . By the definition of μ i in (8) and Assumption 1(a), μ i is also drifting with the sample size, since it is a function of θ 0,N , and so are y i1 and σ 2 1,N . Assumption 1(b) specifies that the initial observations follow the mean stationarity assumption, which is necessary for the Lev and Sys moment conditions to hold. Assumption 1(c)-(e) is mainly technical assumptions, which is needed to obtain our theoretical results. Assumption 1(c) sets an upper bound on the rate at which the variance of u i1 can diverge. It implies that the variance of u i1 is at most proportional to (1−θ 0,N ) −1 (so covariance stationarity is allowed for). Assumption 1(d) holds under independence of u it and y i1 , but it can also hold under less stringent conditions. In the sequel, we analyze the identification of θ when the variance of the initial observations gets large compared to that of the subsequent disturbances. Assumption 1(d) enables such settings. Assumption 1(e) is a technical assumption, which is needed to use a central limit theorem.
Assumption 1(a) allows the variance of the initial observations to be large jointly with a large value for the autoregressive parameter. When d 1 in (20) equals zero, the rate at which h N (θ 0,N ) goes to zero, or the variance of the initial observation goes to infinity, is key to the identification of θ from the sample moment conditions. We therefore put down two alternative assumptions regarding the joint convergence of the sample size and the variance of the initial observations under which there is identification or identification is problematic for specific moment conditions. Assumption 2. a. d 1 = 0 and the drifting sequence of the variance of the initial observation is such that: b. d 1 = 0 or the drifting sequence of the variance of the initial observation is such that: Identification generically holds under Assumption 2(b) but can become problematic under Assumption 2(a) and then depends on the particular moment condition and number of time series observations as we show later on. In the intermediate case where h N (θ 0,N ) √ N converges to a finite, but nonzero constant, we are in a case similar to that discussed in the weak instrument literature where the sample Jacobian converges to a random variable which leads to inconsistent estimators with nonstandard behavior of their corresponding t-statistics. Because of the practical similarities with Assumption 2(a), however, we do not separately discuss it.
Since any assumption about the convergence rates of the sample size and the variance of the initial observations is to a large extent arbitrary, also the identification of θ by these conditions is arbitrary for DGPs for which the true value of θ is close to one and the variance of the initial observations is infinite when the true value of θ equals one. Some plausible DGPs, all of which accord with mean stationarity (7)-(9), for the initial observations belong to this category: DGPs 4 and 5 characterize an autoregressive process of order one that has started g periods in the past, while the initial observations that result from DGP 2 and 3 result from an autoregressive process that has started an infinite number of periods in the past. DGPs 2 and 3 are also used by Blundell and Bond (1998), and Arellano and Bover (1995) use DGP 2, but these studies keep the variance of the initial observations fixed.
For DGPs 1-5 to imply Assumption 2(a), the limiting sequence θ 0,N has to be such that: with e a constant and some real number larger than zero. In the case of DGP 4, (25) implies that the process has been running longer than the sample size N. Kruiniger (2009) uses the above specification of DGP 3 with = 0 and DGP 4 with N/g converging to a constant to construct local to unity asymptotic approximations of the distributions of two-step GMM estimators that use the Dif, Lev, or Sys moment conditions.
We do not confine ourselves to a specific DGP for the initial observations, so we obtain results that apply more generally. While the (non) identification conditions for identifying θ that result from the above DGPs might be (in)plausible, it is the arbitrariness of them which is problematic. In addition, the identification condition might hold, but it can still lead to large size distortions of Wald test statistics, like, the t test.
To analyze the identification of θ by the different moment conditions for a general number of time periods T, we start out with a representation theorem. For the different moment conditions, it states the behavior of the sample moments and their derivatives under Assumptions 1 and 2(a).

THEOREM 1 (Representation theorem). Under Assumptions 1 and 2(a), we can characterize the large sample behavior of the Dif, Lev, NL, AS, and Sys sample moments for T time series observations and their derivatives by
Proof. See the Appendix.
The representation theorem in Theorem 1 is reminiscent of the cointegration representation theorem (see, e.g., Engle andGranger, 1987 andJohansen, 1991). Identical to that representation theorem, Theorem 1 shows that the behavior of the moment series changes over different directions.
Theorem 1 implies that the sample moment and its derivative diverge in the direction of , since the latter components get multiplied by 1 h(θ 0,N ) √ N , which under Assumption 2(a) goes off to infinity when the sample size increases. The only identifying information for θ then results from that part of the sample moment which does not depend on ψ. Since ψ only affects the part of the sample moments spanned by A j f (θ ), the sample moments are independent of ψ in the direction of the maximal nondegenerate space spanned by vectors orthogonal to A j f (θ ) to which we refer as the orthogonal complement of A j f (θ ). We construct the orthogonal complement, which we denote by A j f (θ ) ⊥ , as the full-rank matrix projecting on the orthogonal complement of the range space of A j f (θ ). It consists of the minimal set of vectors spanning the null space of the columns of A j f (θ ). In the case the null space has dimension zero, a full-rank specification of A j f (θ ) ⊥ cannot be constructed.
When we premultiply the sample moments by the orthogonal complement of Compared with expression (26) in Theorem 1, the elements multiplied by The right-hand side of (28) now contains all the remaining identifying elements of the original moment conditions. From expression (28), it is seen that identification results only when (1)  For an illustrative example of Theorem 1, consider the large sample behavior, for T = 3 of the Lev sample moment, 1 N N i=1 y i2 (y i3 − θ y i2 ), and its derivative, − 1 N N i=1 y i2 y i2 , when θ 0,N converges to one according to (20) and mean stationarity (8)-(9) applies. The Lev moment condition has been proposed by Arellano and Bover (1995) and Blundell and Bond (1998) to overcome the identification problems of the Dif moment condition near the unit root. Under Assumption 1, the relevant elements for the large sample behavior are: (see the proof of Theorem 1 in the Appendix for a derivation). The o p (1) remainder terms contain all elements in (29) that cannot dominate the large sample behavior when θ 0,N goes to one according to the drifting parameter sequences defined in Assumption 1. The components explicitly specified in (29) either have a nonzero mean or depend on the initial observations y i1 . Under Assumption 1, we have that which is proved in Lemma 1 in the Appendix and where ψ 2 and ψ c are independent normal random variables with mean zero and variance σ 2 2 and σ 2 c , σ 2 c = var(c i ). It explains why 1 (29). When d 1 in (20) equals zero, the rate at which h N (θ 0,N ) goes to zero, or the variance of the initial observation goes to infinity, determines the behavior of the sample moments in (29). For example, when d 1 = 0 and these sequences are as in Assumption 2(b), it holds that Although Assumption 1 does not fully pin down d 2 , which value depends on the particular DGP for the initial observations, it is clear that the probability limit of the sample Jacobian typically differs from zero. Hence, the Lev moment condition seems to identify θ irrespective of its true value (see Arellano and Bover, 1995;Blundell and Bond, 1998). There is a caveat though, since, under Assumption 2(a), Theorem 1 shows that: which implies that the sample moments of the Lev population moment and Jacobian diverge when the sample size increases. The Lev sample moment then no longer identifies θ , since the components that would identify θ in the Jacobian identification condition, i.e., 1 We next discuss what Theorem 1 implies for the different sets of moment conditions discussed previously and their respective orthogonal complements of A f (θ ).

Dif and Lev conditions.
, and A j f (θ ) ⊥ for the Dif and Lev moment conditions, which are stated in the proof of Theorem 1 in the Appendix, are: The expressions of A Lev f (θ ) are all such that we cannot specify a nonzero matrix This remains so when T exceeds 4 (see the Appendix). Hence, A Lev f (θ ) ⊥ does not exist (as a nonzero matrix). Regarding the Dif moments, when T > 3, the rank of the orthogonal complement of

NL condition.
The NL moment condition is not defined for T = 3. When T = 4, the expressions of μ Since the orthogonal complement does not exist, the NL moment condition does not identify θ . The expression of A NL f (θ ) for a larger number of time series observations (see the Appendix) is also such that the orthogonal complement A NL f (θ ) ⊥ also does not exist. Hence, for larger values of T, the NL moment conditions also do not identify θ .

AS and Sys conditions.
The expressions of μ j f (θ,σ 2 ), A j f (θ ), and A j f (θ ) ⊥ when T = 3 and 4 for the AS and Sys moment conditions result from stacking those of the Dif and NL and Dif and Lev moment conditions, respectively: When T = 3, A Sys f (θ ) is a full-rank square matrix, so its orthogonal complement does not exist. It implies that the Sys moment conditions do not identify θ when T = 3. When T = 4, the orthogonal complement of A j f (θ ), A j f (θ ) ⊥ , has rank larger than zero for both AS and Sys moments. Furthermore, the specification of μ j f (θ,σ 2 ) for the AS and Sys moment conditions in (36) = 0 which just reflects that the moment conditions hold at the true value. This implies that although the AS and Sys sample moments diverge in the direction of A j f (θ ), so that part cannot be used to identify θ, the AS and Sys sample moments identify θ by their part which is spanned by the orthogonal complement of A j f (θ ). The expressions of μ j f (θ,σ 2 ) and A j f (θ ) in the proof of Theorem 1 in the Appendix show that this argument extends to all values of T larger than 3.
Our preceding analysis is summarized by Corollary 1.

COROLLARY 1 (Identification of θ ). Under Assumptions 1 and 2(a), θ is identified by the AS and Sys moment conditions when T exceeds 3. Furthermore, θ is not identified by the Dif, Lev, and NL moment conditions separately for any value of T and the Sys moment conditions when T equals 3.
Corollary 1 proves Stylized Facts 1-4 from Section 3, which are illustrated by Figures 1 and 2. It also shows that the identification from the Lev moment condition remains problematic for larger values of T, but the Sys and AS moment conditions generally identify θ for values of T larger than 3.
Regarding the NL moments, we find that they are not robust to all settings of nuisance parameters like the variance of the initial observations. Alvarez and Arellano (2004) and Kruiniger (2013) have shown that, when the data, including the initial observation, have finite second moments and the autoregressive parameter equals one, θ is identified by the NL and, hence, the AS moment conditions if and only if T ≥ 4. Furthermore, if T ≥ 4, θ is only locally identified when the unconditional variances of the errors change at a constant rate of growth between t = 2 and t = T − 1 and only second-order but globally identified when the unconditional variances between t = 2 and t = T − 1 are equal. Unlike Alvarez and Arellano (2004) and Kruiniger (2013), our limiting sequence for the variance of the initial observations allows for unbounded values. Theorem 1 then shows that identification by the NL moment conditions is lost when its convergence rate accords with (23). The intuition is that the NL moment conditions are a product of levels and first differences, so they are unlikely to identify the parameters in limit sequences where the variance of the initial observations increases faster than the sample size.
Theorem 1 can be used to construct the nonstandard limiting behavior of one-step and two-step GMM estimators that result from the different moment conditions. These are similar to the nonstandard results in, e.g., Madsen (2003) and Kruiniger (2009), so we, for reasons of brevity, refrain from stating them.

Robust sample moments
Theorem 1 shows that the identification of θ when the variance of the initial observations is large results from the part of the (AS or Sys) moment conditions that lies in the direction of A j f (θ ) ⊥ . Expressions of the orthogonal complements of A j f (θ ) for T = 4 and 5 for the AS and Sys moment conditions are stated in (36). They can be specified (see the Appendix) as where T indicates the number of time periods and G j 2,T is such that G j 2,T μ j f (θ,σ 2 ) = 0, for all θ . Furthermore, G j f,T (θ ) is the only part of A j f (θ ) ⊥ that depends on θ . The orthogonal complements are then such that the resulting, what we refer to as, robust moment conditions are quadratic in θ : where the expressions for a, b, and d are constructed in the Appendix:

T=5:
and similar specifications of a, b, and d result for larger values of T.
It is interesting to see that these robust moments only depend on differences of the data, so the initial observations get differenced out. This explains why these moments are robust to the variance of the initial observations. When the autoregressive parameter equals one and in the case of i.i.d. normal errors and time series homoskedasticity, Ahn and Thomas (2006) and Kruiniger (2013) show that the maximum likelihood estimator of Hsiao, Pesaran, and Tahmiscioglu (2002) and the random effects estimator of Anderson and Hsiao (1982) have the same limiting distributions. These results show that, similar to our findings, moment conditions involving levels of the data are redundant in this setting, and only moment conditions using differences of the data, like our robust moment conditions, are informative.

Large individual effect variance.
So far, we have focused on highly persistent panel data resulting from a large autoregressive parameter. However, the representation theorem for the moment conditions and their derivatives in Theorem 1 applies to any setting where the variance of the initial observations gets large. The expression of the initial observation in (7) shows that its variance becomes large when either the variance of the initial disturbance term, u i1 , or the individual specific effect, μ i , becomes large. Theorem 1 focuses on a large variance that results from the autoregressive parameter converging to one. Theorem 1 does, however, extend to the case where jointly with the sample size, the individual specific effect variance becomes large in such a manner that Assumption 2(a) holds. This drifting sequence applies to any value of the autoregressive parameter, so the resulting identification issues are then no longer confined to the unit root value. Hence, they also apply to the cases with only moderate autoregressive dynamics, but a large variance of the unobserved heterogeneity. The robust moments in (38) also apply to this case. Kruiniger (2002) extensively analyzes the setting of a large variance of the individual specific effects. He shows that only moment conditions based on differences of the data yield a consistent estimator, so moment conditions involving levels are redundant. He also constructs the set of optimal moment conditions assuming time series homoskedasticity. Our robust moments (38) extend his set of optimal moment conditions, since they remain valid under a large variance of the individual specific effect and also allow for time series heteroskedasticity.

KLM TEST AND ROBUST SAMPLE MOMENTS
Theorem 1 establishes identification results for the AS and Sys moment conditions, which are based on the robust sample moments. It is not clear, however, how an identification robust test procedure makes use of it. In this section, we show that the KLM test based on the original AS or Sys moment conditions just uses the robust sample moments when only the latter contain identifying information on the autoregressive parameter. We show that, under large variances of the initial observation and when the true value of θ is close to one, the KLM test based on either the AS or Sys moment conditions exploits the identifying information from the robust moment conditions in an optimal manner. For practical purposes, this implies that we do not have to explicitly use the robust sample moments, since they are implicitly used when conducting a KLM test using AS or Sys moment conditions.
We obtain the above result in four steps. First, we characterize the limit behavior of the robust sample moments. Second, we use it to determine asymptotic sequences for the true and hypothesized values, so the power properties of the corresponding identification robust test statistics when using the robust moments are not trivial and stay informative. Third, we construct the largest (infeasible) discriminatory power that can be obtained from combining the robust moments. Finally, we show that it coincides with the rejection frequency of KLM tests using either AS or Sys moment conditions. Summarizing, the KLM test based on original AS or Sys moment conditions implicitly resorts to using the robust sample moments in an optimal manner when only these contain information on θ .

Large Sample Behavior of Robust Sample Moments
To construct the limiting behavior of the robust sample moments for settings where only they contain information on θ, we first state the probability limits of the quantities a, b, and d in (38) under Assumption 1. The components that comprise the robust sample moments do not depend on the variance of the initial observations, so they are not affected by Assumption 2. Since we analyze the behavior when the true value θ 0,N is converging to one, we specify this convergence behavior of θ 0,N , so it is dominated by the random components present in the limit behavior of a, b, and d which are of order O p (N − 1 2 ). This then implies that θ 0,N converges rather rapidly to one with a convergence rate that is faster than N − 1 2 . Hence, θ 0,N is considered to be in the close neighborhood of one. THEOREM 2. Under Assumption 1, the limit behavior of the different components of g j f,T (θ ), j = AS, Sys, for θ 0,N = 1 + l N τ with l a fixed constant, l < 0, and τ > 1 2 , is characterized by: Proof. See the Appendix.
Although AS and Sys robust moments are different, Theorem 2 implies that under Assumption 1, the probability limits of a, b, and d are identical. Furthermore, Theorem 2 implies that the Jacobian of the robust moment equation (38) is of full column rank when σ 2 t = σ 2 for at least one value of t = 2, . . . ,T. This fulfills one of the sufficient conditions for standard asymptotic theory for GMM inference based on the robust sample moments, which, since the other sufficient conditions can be shown to hold as well, applies for these settings.

Asymptotic Sequence for the Hypothesized Value
We want to compare tests of H 0 : θ = θ * using the robust sample moments to KLM tests of H 0 using the original AS and Sys moments for settings where the identification can be problematic, which occurred for true values of θ close to one and large variances of the initial observations. Because we want to analyze local asymptotic power while the true value θ 0,N is converging to one according to θ 0,N = 1 + l N τ , we also consider a local to unity drifting sequence for the hypothesized value θ * , which we denote by θ(e) with e < 0 the localizing parameter. Although less common in asymptotic power analysis, the advantage of a drifting hypothesized value is that our results hold for a range of hypothesized values.
The asymptotic sequence θ(e) is such that the behavior of the identification robust tests is not diverging and informative about θ , when the true value θ 0,N is converging to one. Theorem 3 establishes the particular rate at which θ(e) converges to one which makes these conditions hold. Note that there is a slight abuse of notation, as, from now on, we suppress the superscript j in g j f,T (θ (e)), j = AS, Sys, which is inconsequential for the results to follow. The quartic root convergence rate in Theorem 3.1 results, since the Jacobian of the robust moment equation (38) is then equal to zero, but the Hessian is not. It is thus a setting of so-called second-order identification with first-order underidentification. Estimators then generally have quartic root convergence rates (see, e.g., Dovonon and Renault, 2013;Dovonon and Hall, 2018;Dovonon et al., 2020). A quartic root convergence rate for estimators in dynamic panel data models is also found by Ahn and Thomas (2006) and Kruiniger (2013).
The quartic root convergence rate for the robust sample moments results from specifying θ(e) = 1 + e N 1/4 and σ 2 t = σ 2 , t = 2,... T. All elements of the robust sample moments which are linear in e then cancel out in the limit. We are then left with a quadratic term in e and components that converge at the rate 1 √ N . A quartic root convergence rate makes all these components of the same order of magnitude. Theorem 3 shows that error variances which are constant over time, σ 2 t = σ 2 , t = 2,... T, lead to this slow convergence rate.

Largest Rejection Frequencies of Robust Sample Moments
To show that the KLM test of H 0 using AS and Sys moment conditions just uses the robust sample moments when only these contain information on θ , we use the largest rejection frequencies that result in such instances from the robust sample moments. To obtain these largest rejection frequencies, we first consider the GMM-AR test of H p : θ(e) = 1 + e 4 √ N using the robust sample moments, which is specified as: with g f,T (θ (e)) the moments in (38) with δ(N) = (eσ ) 4 ι p 0 (B(N) V abd B(N)) −1 ι p 0 , p the number of columns G f,T (θ ), so when T = 4, p = 1, and when T = 5, p = 3, and p max the number of elements of g f,T (θ (e)), so, when T = 4, p max = 2, while p max = 5, for T = 5, V abd the covariance matrix of a, b, and d, I p max the p max × p max dimensional identity matrix, e 1,3 and e 2,3 the first and second 3 × 1 dimensional unity vectors, and χ 2 (δ,p max ) a noncentral χ 2 distribution with noncentrality parameter δ and p max degrees of freedom.
Proof. See the Appendix.
The expression of the large sample distribution in Theorem 4 depends on the sample size. Given the quartic root convergence rate, convergence to the limiting distribution is very slow, so it is important for the accuracy of the approximation of the finite sample distribution to incorporate higher-order components. The proof of Theorem 4 in the Appendix, therefore, from the outset considers all higher-order components of g f,T (θ (e)) in order to construct a large sample approximation of the distribution of GMM-AR(θ(e)).
To obtain the maximal rejection frequencies using the robust sample moments, we use a (infeasible) weighted average of the moment equations in g f,T (θ (e)) where the weights are chosen such that the noncentrality parameter equals the one of the noncentral χ 2 limiting distribution of the GMM-AR statistic while the degrees of freedom is equal to one (i.e., the number of elements of θ ). This value of the noncentrality parameter is also the maximal one that can be obtained using a weighted average of the robust sample moments.

that uses a weighted average of the robust sample moments can be constructed that has approximately a
distribution in large samples of size N.
Proof. See the Appendix.
The GMM-AR statistics in Theorems 4 and 5 both have noncentral χ 2 distributions with the same noncentrality parameter, so the one with the smallest number of degrees of freedom, i.e., the statistic in Theorem 5, has the largest power.  Figure 4 illustrates Theorem 5 and shows the maximal rejection frequencies based on combining the robust sample moments based on either AS or Sys moment condition in a GMM-AR test 8 for T = 4 and 5. It uses DGP 1 from Section 3 with a true value of θ which is very close to one (0.99) and a large value of σ 2 c (10) compared to σ 2 (one), which amplifies the variance of the initial conditions. The DGP thus satisfies mean stationarity (7)-(9) and also time series homoskedasticity, i.e., σ 2 t = σ 2 , for t = 2,...,T. We use N = 2,000, a relatively large value and test for a wide range of values for θ , which together with N provides a mapping to the constant e (= 4 √ N(θ − 1)) in Figure 4 (horizontal axis). The usual power curve, as shown earlier in Figures 1 and 2, reports the rejection frequencies of tests of the hypothesized parameter value as a function of the parameter value used in the DGP where the data are simulated from. Figure 4, however, reports for a fixed parameter value equal to one in the DGP used to simulate the data, the rejection frequencies as a function of a varying localizing parameter e, and, hence, autoregressive parameter θ(e), under the tested null hypothesis. The rejection frequencies in Figure 4, thus, report those observed at one for a range of the usual power curves where the tested parameter values correspond with those on the horizontal axes in Figure 4.
Because of the equivalence of the GMM-AR test for the AS and Sys robust moments, the rejection frequencies are identical for the AS-and Sys-based robust sample moments and only differ over T. Any remaining differences in Figure 4 are due to sampling noise.

Large Sample Behavior of the KLM Test
Finally, we construct the large sample distribution of KLM tests of H p : θ(e) = 1 + e 4 √ N using AS and Sys moment conditions when θ 0,N accords with the drifting sequences in Assumptions 1 and 2(a), so only the robust sample moments contain information on θ . THEOREM 6. Under Assumptions 1 and 2(a), θ 0,N = 1 + l N τ with l a fixed constant, l < 0, and τ > 1 2 , σ 2 t = σ 2 , t = 2,... T, the large sample distribution of the KLM statistic using the AS or Sys moments for testing the hypothesis with δ(N) defined in Theorem 4.

Proof. See the Appendix.
Under Assumptions 1 and 2(a), Theorem 1 implies that the GMM sample moments diverge in one direction and converge in another one. Identical to tests for cointegration, Theorem 6 shows that the diverging parts of the GMM sample moments cancel out in the large sample distribution of the KLM test, so it only contains elements from the converging part of the GMM sample moments. The proof of the large sample distribution of the KLM test is, therefore, rather elaborate, since this has to be shown for each of the different components of the KLM test.
Theorem 6 shows that the large sample distribution of the KLM test using AS or Sys moment conditions when only the robust sample moments contain information on θ is identical to the limiting distribution of the GMM-AR test that optimally combines the robust sample moments for these settings. It proves that KLM tests using the AS and Sys moment conditions then only use the robust sample moments. It is similar to what happens in cointegration where, since the cointegrating vector and stochastic trends operate orthogonally, a likelihood ratio test on the cointegration vector also does not depend on the stochastic trends (see, e.g., Johansen, 1991).
Theorem 6 is illustrated by Figure 5a,b, which shows the rejection frequencies of 5% significance tests using a KLM test of H p : θ(e) = 1 + e 4 √ N with AS and Sys moment conditions when T equals 4 (Figure 5a), and 5 (Figure 5b), respectively. It uses the same DGP as for Figure 4. In addition, identical to Figure 4, the rejection frequencies in Figure 5 report the rejection frequencies when using a fixed parameter value in the DGP where we simulate the data from, as a function of a varying parameter value under the tested hypothesis. using AS (dashed) and Sys (dash-dotted) and GMM-AR tests using (infeasible) optimal weighted average of robust sample moments (solid line). Notes: 5% significance level, true value of θ is 0.99, and N = 2,000. Figure 5 shows, for both T = 4 and T = 5, that the rejection frequencies that result from using the KLM test with either AS or Sys moment conditions are equal to the largest rejection frequencies, that can be obtained with the robust moments when only they contain information on θ . It illustrates that the robust sample moments are (implicitly) used when you conduct KLM tests with AS or Sys moment conditions. Hence, in practice, one can just use AS or Sys moment conditions in the construction of the KLM test, i.e., there is no need to switch to the robust sample moments. Figure 5 also provides a visual proof of Stylized Fact 5 from Section 3, i.e., rejection frequencies for the KLM test using AS or Sys moment conditions are almost identical when the true value of θ is close to one and for large variances of the initial observations, and that it is not specific for the tested values used there but holds generally for different tested values of θ .

CONCLUSIONS
We have analyzed GMM inference for dynamic panel data models involving highly persistent panel data. We show that the Dif, Lev, and NL moment conditions separately do not identify the parameters in dynamic panel data models for a general number of time periods. This results from the divergence of the initial observations for some plausible DGP involving highly persistent panel data. When there are more than three time periods, the AS and Sys moment conditions, however, do lead to identification. The identification based on the AS and Sys moment conditions for the problematic cases of divergent initial observations results from so-called robust sample moments. They are combinations of either the AS or Sys sample moments and do not depend on the initial observations. Despite the positive identification results for AS and Sys moment conditions, conventional inference based on two-step GMM estimators is not valid, since these estimators have nonstandard limiting distributions near the unit root. Similar results hold for two-step GMM estimators based on our robust sample moments. We have, therefore, analyzed the large sample properties of identification robust GMM test procedures. These test statistics are size correct, easy to implement, and have been used in a variety of models analyzed using GMM. We show that the identification robust KLM statistic based on the AS and Sys sample moments implicitly resorts to using the robust sample moments when only the latter contain identifying information.
Based on the theoretical analysis and numerical results, a number of remarks can be made regarding the implementation of GMM inference for applied linear dynamic panel data analysis. First, statistical inference, i.e., hypothesis testing and confidence intervals, should be based on identification robust tests, like the KLM or GMM-AR test. The nonstandard limiting behavior of the two-step GMM coefficient estimator makes the use of conventional GMM inference hazardous in applied research when there are identification issues. Second, one should always use either AS or Sys moment conditions, since these deliver identification under more general conditions when T > 3. An advantage of the AS moments is that they are valid under less restrictive assumptions than the Sys moments. Third, when mean stationarity applies, the Sys moments are preferred. Although AS and Sys moments contain the same amount of identifying information when θ is close to one and the variance of the initial observations is large, in practice, the opposite may well be the case if one is not close to the unit root (or if time series heteroskedasticity is present). This is shown, for example, by our simulated KLM power curves in Section 2. Fourth, the original AS or Sys moments should be used in an identification robust GMM test statistic and not the implied robust sample moments. Although only the latter preserve identification when the variance of the initial observations is large, we have shown that the identification robust KLM test based on the AS or Sys moments implicitly uses the robust sample moments.
Finally, for expository purposes, we have only analyzed the first-order autoregressive panel data model. The extension to panel data models with multiple endogenous regressors, e.g., dynamic models with additional endogenous regressors, is an important area for future research.

A. Specification of GMM Sample Moments and Proofs
Specification of sample moment functions. For the Dif moment conditions in (4), For the Lev moment conditions in (5), k Lev equals T − 2 while the sample moment functions are For the NL moment conditions in (10), k NL equals T −3 while the sample moment functions can be specified as The sample moments for the AS moment conditions result by just stacking the appropriate sample moments stated above, so k AS equals 1 2 (T − 1)(T − 2) + T − 3. In a similar manner, the Sys sample moments result, so k Sys equals 1 2 (T + 1)(T − 2).

LEMMA 1. We state some intermediate results, which involve the different terms in the sample moments and their derivatives. Assumption 1 implies the following:
Proof of Lemma 1. i. Under mean stationarity, we have: Assumption 1(c) implies that (1 − θ 0,N ) 1 2 u i1 is a random variable with finite fourth moments, so a law of large numbers applies: Since c i = (1 − θ 0,N )μ i , we can specify: with ψ c independent of ψ j , j = 2,...,T, as c i is independent from u ij , j = 2,...,T. Upon combining, we obtain: ii. Since u it are independently distributed, t = 1,...,T, and (1 − θ 0,N ) 1 2 u i1 is a random variable with finite fourth moments, a law of large numbers applies: iii. Finite fourth moments of u it imply that a law of large numbers applies: . Furthermore, since both θ 0,N − 1 1 2 u i1 and u i2 have finite fourth moments and are independent, 2 Finally, we have E u 2 i2 = σ 2 2 and finite fourth moments; hence, Along the same lines as the above, this can be shown to hold for other values of t as well. v. Similar to the above, when substituting for y i2 and y i3 , we have Similar derivations as before show that 1 , so all these terms are o p (1) and have probability limit 0, implying that Along similar lines, this can be proved to extend to the first differences at other time periods. vi. Since h N (θ 0,N ) −2 = var(y i1 ), the random variable h N (θ 0,N )y i1 has variance equal to one. Since y i1 and u it , t > 1, are independent, because of Assumption 1(e), E(h N (θ 0,N )y i1 u it ) = 0. Furthermore, Assumption 1(d) implies that Var h N (θ 0,N )y i1 u it = σ 2 t , which is finite. A central limit theorem therefore applies: with ψ = (ψ y 1i u i2 ... ψ y 1i u iT ) a (T − 1)-dimensional, mean-zero normal random vector. Assumption 1(e) states that u i1 /σ 1,N , u i2 ...,u iT , and c i are independently distributed within individuals and over the different individuals. It implies that u i1 c i and y i1 u it are uncorrelated. Since ψ and ψ c are the limits of the scaled sums of y i1 u it and u i1 c i , they are uncorrelated normal random variables and therefore independent. As a result of this, the T × T covariance matrix of ψ and ψ c is diagonal: Proof of Theorem 1. T = 3. Under mean stationarity, we have Substituting these expressions, we can specify the Dif sample moment and its derivative as Combining convergence results stated in Lemma 1, the large sample behavior of the Dif sample moment and derivative can thus be characterized by where we note that h N (θ 0,N )σ 1,N ≤ 1, since var(y i1 ) ≥ var(u i1 ), from which it is readily seen that Regarding the Lev moment, using y i2 = y i2 + y i1 , Exploiting mean stationarity and substituting for y i2 and y i3 , we write Regarding the Lev derivative, we have Therefore, we can write the Lev moment condition and derivative as Combining this and other convergence results from Lemma 1, the large sample behavior of the Lev sample moment and derivative can thus be characterized by From this last result, it is not difficult to see that, under Assumption 2(b), we have The reason for this is that Assumption 2(b) amounts to h N (θ 0,N ) → N→∞ ∞, and, since var(y i1 ) ≥var(u i1 ), it implies that σ 2 1,N /N → N→∞ 0. Finally, the Sys sample moment and derivative simply result from stacking the Dif and Lev sample moments and derivatives: Combining earlier convergence results, the large sample behavior of the Sys sample moment and derivative can thus be characterized by from which it is readily seen that T = 4. Under mean stationarity, we have Substituting these expressions and y i2 = y i2 +y i1 , we can specify the Dif sample moments and their derivatives as The limit behavior of the first two terms in each expression has been established before. Furthermore, Lemma 1 shows that the last term in each expression is o p (1). Therefore, the large Dif sample moment and derivative can be expressed as: from which it is readily seen that After some algebra, we can specify the Lev sample moments and their derivatives as Using Lemma 1, the large sample behavior of these expressions is equal to: We can specify the NL sample moment and its derivative as Using Lemma 1, the large sample behavior of these expressions is equal to: so this implies that: Finally, regarding AS and Sys moment conditions, we simply have Using similar calculations, we obtain: General T. Along the lines of the above derivations, it is also possible to construct the expressions of A j f (θ),A j q (θ), μ j f (θ,σ 2 ), and μ j q (θ,σ 2 ) for larger values of T which we, for reasons of brevity, refrain from.

Orthogonal complements of A AS f (θ ) and A
Sys f (θ ) for T = 4 and 5 and the specification of the robust sample moments. We specify the orthogonal complements as in (37), which we repeat here for convenience: where T indicates the number of time periods and G j 2,T is such that G j 2,T μ j f (θ,σ 2 ) = 0. This notation is used in the proofs of subsequent theorems. (36), G j f,T=4 (θ) and G j 2,T=4 , for j = AS, Sys, result as:

T = 4. From the expressions of
From these expressions and (36), it is easily seen that The robust sample moments are defined as . . .
The robust sample moments consist of products of the above expressions. To obtain the probability limits in Theorem 2 of the elements comprising the robust sample moments, we use that for all s and t, t > 1, t = s, which is implied by Assumption 1. Therefore, the a, b, and d components of the robust sample moments simplify to: T = 4, Sys:.
where the O p (N −1/2 ) remainder terms result from the interaction terms between the different errors, like 1 N N i=1 u i2 u i3 , which converge at rate N − 1 2 , since their correlation equals zero.
Using next that, because of Assumption 1(c), 1 We use similar calculations for T = 5 to obtain that: T = 5, Sys:.
Proof of Theorem 3. The proof of Theorem 3 establishes the probability limits of a, b, and d, for θ 0,N = 1 + l N τ , l < 0, and τ > 1 2 . Denoting these probability limits by, a p , b p , and d p , the large sample behavior of a, b, and d is characterized by, for θ 0,N = 1 + l N τ with τ > 1 2 : with (ε a , ε b , ε d ) jointly normal, mean-zero random variables, which follows straightforwardly from an appropriate central limit theorem applied to the highest-order remainder terms in the proof of Theorem 2, which are all sample averages over i.i.d. mean-zero random variables. We want to determine the appropriate rate for ξ in g f,T (θ(e)), so we can analyze its behavior in a neighborhood of the true value θ 0,N = 1 + l N τ , l < 0, with τ > 1 2 while N goes to infinity, with θ(e) = 1 + e N ξ . Substituting θ(e) and the above large sample characterizations of a, b, and d in (38), we can write: To determine ξ , we impose two conditions: (1) √ Ng f,T (θ(e)) converges to a nondegenerate bounded random variable of order O p (1); and (2) g f,T (θ(e)) is informative about the value of e when N gets large. We discriminate between two different cases for σ 2 t : 1. For σ 2 t = σ 2 , t = 2,...,T : since a p + b p + d p = 0 and b p + 2a p = 0, we distinguish three settings: ξ > 1/4 : This shows that the appropriate rate corresponds with ξ = 1/4. For a smaller value of ξ , √ Ng f,T (θ (e)) diverges. For a larger value, √ Ng f,T (θ(e)) converges to a meanzero normal random variable unaffected by the choice of e. Although, in this case, √ Ng f,T (θ (e)) is not informative about e, we do not need to worry about e, because standard asymptotics apply. 2. When σ 2 t = σ 2 s , for at least one t = s, a p + b p + d p = 0 but b p + 2a p = 0, we can establish along the lines of the above that the appropriate rate corresponds with ξ = 1/2 : Proof of Theorem 4. Denote with g f,T (θ(e)) the moments in (38) evaluated at θ(e) = 1 + e 4 √ N . When σ 2 t = σ 2 and substituting the large sample characterization of a, b, and d, √ Ng f,T (θ(e)) can be expressed as: √ Ng f,T (θ(e)) = e 2 a p + ε a (1 + 2e Define φ(N) = e 2 a p + ε a 1 + 2e )(e 1,3 ⊗ I p max ) + (e 2,3 ⊗ I p max ) , and V abd the covariance matrix of (ε a . . . ε b . . . ε d ) , ι 3 a 3× 1 dimensional vector of ones, I p max the p max × p max dimensional identity matrix, p max equals the number of elements of a, and e 1,3 and e 2,3 the first and second 3 × 1 dimensional unity vectors.
Hence, √ Ng f,T (θ(e)) = φ(N) + o p (1), so in a sample of size N, √ Ng f,T (θ(e)) is normally distributed up to an o p (1) term. While some of the components in φ(N) are essentially also o p (1), it is important to incorporate them for an accurate approximation of the distribution of √ Ng f,T (θ(e)) for a given sample of size N, since the low-order components, of order N −1/4 , converge very slowly to zero.
The individual moments g f,n (θ(e)) in the sample average g f,T (θ(e)) = 1 N N n=1 g f,n (θ(e)) can be specified as: N N n=1 d n , ε a n = a n − a p , ε b n = b n − b p , and ε d n = d n − d p , so taking g f,n (θ(e)) in deviation from its sample average g f,T (θ(e)) results in g f,n (θ(e)) − g f,T (θ(e)) = ε a n − ε a + ε From the above, it then straightforwardly follows that so the distribution of the GMM-AR statistic testing H p for a sample of size N is characterized by Proof of Theorem 5. When we instead of the full vector g f,T (θ(e)) use a linear combination of it, say w g f,T (θ(e)) with w an orthonormal p max × 1 vector, the approximating distribution of the GMM-AR statistic for testing H p : θ(e) = 1 + e 4 √ N that uses w g f,T (θ(e)) as the moment vector reads χ 2 (e 4 (w a p ) w B(N) V abd B(N)w −1 (w a p ),1).
The optimal combination w is the one that leads to the largest value of the noncentrality parameter. The noncentrality parameter can be specified as The maximal value of (w a p ) 2 w B(N) V abd B(N)w results from the largest root of the generalized eigenvalue problem λB(N) V abd B(N) − a p a p = 0, and the optimal value of w equals the eigenvector associated with the largest root. Since a p is only a vector, just one root of the generalized eigenvalue problem is nonzero, so it is also the largest one. This root results from using w = (B(N) V abd B(N)) −1 a p , and the largest root then equals so the maximal value of the noncentrality parameter is since a p = σ 2 ι p 0 with ι p a p × 1 dimensional vector of ones and p the number of columns of G f,T (θ).
Proof of Theorem 6. Before we start out to prove Theorem 6, we first state an addendum to Theorem 1, which incorporates some higher-order components of order O p (N −1/2 ) that are needed for some of the intermediate results.
Addendum to Theorem 1: Theorem 1* (Representation theorem). Under Assumptions 1 and 2(a), we can characterize the large sample behavior of the Dif, Lev, NL, AS, and Sys sample moments and their derivatives by: Lev, NL, AS, Sys, and B j f (θ), B j q (θ) : k j × m j and k j × m j , k j × 1 dimensional matrices, and ψ uu is a mean-zero, finite variance, normal random vector that is possibly dependent on ψ.
Proof of large sample distribution KLM statistic. For the construction of the large sample distribution of the KLM statistic under Assumptions 1 and 2(a), we use that the part of the sample moments spanned by A j f (θ(e)) and the part spanned by A j f (θ(e)) ⊥ converge at different rates. We use the normalized large sample behavior of each of these parts to construct it. This amounts to premultiplying the sample moments in the expression of the while the true value of θ is drifting to one in line with Assumption 2(a). We separately construct the behavior of the following four components: 1. √ NV ff (θ (e)) −1 f N (θ (e)), 2. q N (θ (e)), 3.V θf (θ (e)), 4.D N (θ (e)), which provide the building blocks for the large sample distribution of the KLM statistic. For each of these components, we determine their limit behavior when multiplied by for the last three components and its inverse for the first one. Taken all together, this implies that (h N (θ 0,N )A f (θ(e)) 1 . . . A f (e) ⊥ ) cancels out of the overall expression of the KLM statistic. 1. To determine the limit behavior of √ NV ff (θ(e)) −1 f N (θ(e)), we disentangle the components with different convergence rates which we do by premultiplying it by We next determine the large sample behavior of the different components under Assumptions 1 and 2(a). Our specification of A f (θ(e)) ⊥ is such that: √ NA f (θ(e)) ⊥ f N (θ(e)) = √ Ng f,T (θ(e)), so using the large sample behavior of √ Ng f,T (θ(e)) stated in the proof of Theorem 4, we have that the large sample behavior of √ NA f (θ(e)) ⊥ f N (θ(e)) for a (large) sample of size N results as: The large sample behavior of √ Nh N (θ 0,N )A f (θ(e)) 1 f N (θ(e)) results from Theorem 1 (the representation theorem) and accords with, since by Assumption 2(a) √ Nh N (θ 0,N ) → 0, √ Nh N (θ 0,N )A f (θ(e)) 1 f N (θ(e)) = A f (θ(e)) 1 A f (θ(e))ψ + o p (1), whereψ = ψ − h N (θ 0,N )σ 1,n ι T−1 ψ c , so upon combining: We next focus on the components of [(h N (θ . Since g f,T (θ(e)) does not depend on the initial observations y i1 , the (normalized) covariance of A f (θ(e)) 1 f N (θ(e)) and A f (θ(e)) ⊥ f N (θ(e)) equals zero: h N (θ 0,N )A f (θ(e)) 1V ff (θ(e))A f (θ(e)) ⊥ = o p (1).
Overall, the large sample behavior of A f (θ(e)) ⊥ q N (θ(e)) for both the AS and Sys moment conditions reads: where it is again important to incorporate the higher-order components. We can also specify the above convergence as: 3. We next determine the behavior ofV θf (θ(e)) : where we have rescaled, since all the higher-order terms have dropped out, and which shows that the additional components in Theorem 1 * compared to Theorem 1 do not affect the limit behavior ofD N (θ(e)) up to order N −1/4 . The specification of ν is: which is independent of the limit behavior of √ Ng f,T (θ(e)). We obtain the limit behavior of √ ND N (θ(e)) V ff (θ(e)) −1 D N (θ(e)) from: √ ND N (θ(e)) V ff (θ(e)) −1D N (θ(e))