RELEVANT MOMENT SELECTION UNDER MIXED IDENTIFICATION STRENGTH

This paper proposes a robust moment selection method aiming to pick the best model even if this is a moment condition model with mixed identification strength, that is, moment conditions including moment functions that are local to zero uniformly over the parameter set. We show that the relevant moment selection procedure of Hall et al. (2007, Journal of Econometrics 138, 488–512) is inconsistent in this setting as it does not explicitly account for the rate of convergence of parameter estimation of the candidate models which may vary. We introduce a new moment selection procedure based on a criterion that automatically accounts for both the convergence rate of the candidate model’s parameter estimate and the entropy of the estimator’s asymptotic distribution. The benchmark estimator that we consider is the two-step efficient generalized method of moments estimator, which is known to be efficient in this framework as well. A family of penalization functions is introduced that guarantees the consistency of the selection procedure. The finite-sample performance of the proposed method is assessed through Monte Carlo simulations.


INTRODUCTION
The validity of the standard moment condition-based inference hinges on the strong/point identification property. Strongly identified models are those solved by a unique parameter value. Many estimators have been proposed including the generalized method of moments (GMM) and the generalized empirical likelihood estimators that are all consistent and asymptotically normal under further regularity conditions. Moment selection methods have also been developed under standard identification settings.
The literature on moment selection presents two main approaches. One is based on Lasso-type penalized estimation procedures in which both the parameter of interest and the best subset of moment restrictions are jointly estimated. This strand of literature includes Belloni et al. (2012), Caner and Fan (2015), Cheng and Liao (2015), and Windmeijer et al. (2019).
The second strand of literature on moment selection adopts a more classical methodology for model selection by relying on information criteria. This approach includes Andrews (1999), Andrews and Lu (2001), Donald and Newey (2001; henceforth DN), Hall and Peixe (2003), and Hall et al. (2007). The selection problem in these papers consists in selecting the best subset of moment restrictions among those useful to estimate a given parameter as the one minimizing an information criterion. In this framework, all the candidate models are expressed in terms of that same parameter of interest and the selection methods proposed in these papers differ by their choice of information measure. Andrews (1999) and Andrews and Lu (2001) rely on the GMM overidentification test statistic with the aim to select correct moment restrictions. DN rely on the mean squared error (MSE) of some estimators including the two-stage least-squares estimator, its bias corrected version, and the limited information maximum likelihood estimator, whereas Hall et al. (2007) consider an entropy-based moment selection criterion with the focus on selecting from a set of correct moment restrictions, the relevant ones. This is a set of moment restrictions that does not contain a subset of restrictions with equal amount of information about the model parameter nor is included in a set of moment restrictions that carry more information about the parameter. In some sense, the relevant moment selection criterion (RMSC) of Hall et al. (2007) and the J-statistic selection criterion of Andrews (1999) are complementary. The best model in terms of RMSC is the smallest model (in number of moment restrictions) among those that are correct and yielding the maximum information about the parameter of interest.
Common to all the papers cited above is the requirement of strong identification for consistency of the selection procedure and to ensure valid inference using the selected model. Nevertheless, strong identification is not always guaranteed for moment condition models and a still growing literature is devoted to inference in models failing this property. Identification properties are outlined by considering on the one hand strong identification and on the other hand the extreme lack of identification pattern where the model is uninformative about the parameter of interest. In the latter, consistent estimation is not possible and identification is deemed weak. Between weak and strong identification lies a wide range of identification patterns. Since the seminal work of Staiger and Stock (1997), the strength of a moment restriction has been captured by the possibility that the moment function vanishes uniformly over the whole parameter space as the sample size grows. The faster the moment function of the restriction vanishes, the weaker is the restriction.
Weak moment restrictions are those vanishing at least at the rate T − 1 2 ; strong ones are those vanishing only at the true value and do not drift to 0 at any other value, whereas those vanishing over the parameter set at rate T −α , α ∈ (0, 1 2 ), are considered semi-weak (or semi-strong). More importantly, the moment restrictions defining a moment condition model can have various strengths, leading the model to have mixed identification strength. Examples of such models can be found in Section 2. They include the classical linear instrumental variable (IV) model with nearly weak instruments and GMM inference on conditional moment restrictions with finite-support conditioning variables. We refer to Caner (2009), Andrews and Cheng (2012), Antoine and Renault (2012), and Han and McCloskey (2019) for further account of such models.
Even though point identification fails in the limit, consistent estimation is possible due to the fact that (by the central limit theorem) these models gather information about the parameter of interest at a faster rate than they lose their potential for identification. This feature has first been pointed out by Hahn and Kuersteiner (2002) and subsequently by Antoine and Renault (2009), who also show-in this setting-that consistent estimators may converge at faster rates in some directions of the parameters space. Interestingly, in this context, standard optimal GMM inference is valid without a specific characterization of the directions of faster convergence (see Renault, 2009, 2012). More recently, Antoine and Renault (2020) have proposed a test to investigate whether a moment condition model is strong enough to warrant the validity of the standard inference.
This paper proposes a robust moment selection method for moment condition with mixed identification strength. We build on the work of Hall et al. (2007) and propose a relevant moment selection procedure that consistently selects the best model even if this model is of mixed strength. We argue that, in the configuration of heterogeneity of restrictions' strength, candidate models must be valued by the rate of convergence of the estimator that they deliver and two models with the same rate of estimation should be differentiated by the amount of information they convey about the model parameter which is encapsulated in the entropy of the asymptotic distribution of the parameter estimate. The estimator that we use as benchmark is the two-step GMM estimator, which has linear reparameterizations shown to be asymptotically efficient in this framework by Dovonon, Atchadé, and Doko Tchatoka (2022). We propose a feasible selection criterion that has these properties. This criterion turns out to be a modified version of RMSC that we label mRMSC.
mRMSC conveniently scales the information part of RMSC to provide a sequential estimation of rate of convergence and entropy. More precisely, mRMSC first rewards the rate of estimation and then, for models with the same rate, it rewards (negative) entropy. In addition, new penalty terms are introduced that guarantee the consistency of the selection procedure. Conditions under which mRMSC lead to consistent selection are outlined, and we show that the new selection procedure is robust to the presence of uninformative and weak models. In comparison to the RMSC and accounting for the scaling factor, mRMSC penalizes more strongly larger models. Indeed, the penalty term of mRMSC is proportional to (1/ ln T) α , α > 0, whereas that of the Bayesian information criterion (BIC)-RMSC-identified as the best-performing version of RMSC-is ln √ T/ √ T. The choice of penalty for mRMSC is guided both by robustness to unknown model identification strength and selection consistency. In this case, stronger penalization seems to be required to dissociate possibly weak signals from noise.
Simulations are performed to evaluate the finite-sample properties of the proposed method. In support of our theory, the simulations reveal that, irrespective of the Monte Carlo design considered and except for the cases where all candidate models are weak, mRMSC selects the best model or set of instruments with probability (hit rate) growing to 1 as the sample size increases. This exercise also highlights the limits of RMSC in settings of identification with mixed strength. Specifically, as the identification weakens, there are many instances where its hit rate decreases to 0 with the sample size or plateaus way below 1, showing evidence of its inconsistency. This issue with RMSC is exacerbated when the number of parameters increases. Nevertheless, in standard identification settings, RMSC seems to have a slight advantage over mRMSC as it converges a bit faster. This seems to be the price for the robustness of mRMSC. We also consider the MSE-based criterion of DN and found it bested by the two entropy-based criteria in terms of hit rate.
The post-selection performance of these methods has been analyzed. For this purpose, we also consider the moment condition model including all the available instruments. According to Chao and Swanson (2005), this is a recommended practice in settings where many weak instruments are available. We find that models selected by mRMSC dominate the other models in terms of coverage probability of confidence intervals except for the configuration where all relevant instruments are weak. In these cases, all the models in competition perform very poorly although the model with all instruments seems marginally better in relative terms. We also consider the bias and MSE of estimation. For both measures, mRMSC outperforms the other criteria except for the cases of weak identification where, again, all of them perform poorly with the model with all instruments having a slight edge followed by models selected by the Donald and Newey's criterion.
For further relation to the literature, it is worth mentioning the quasi-Bayesian model selection method recently proposed by Inoue and Shintani (2018). This method aims to select the most parsimonious model among those with the largest quasi-likelihood. Even though their approach can be adapted to moment selection, our goal differs from theirs as our quest is to find, among the models with maximum information about a parameter of interest, the one with the smallest number of moment restrictions.
The rest of the article is organized as follows: Section 2 introduces the setup and existing asymptotic results on inference on moment condition models with mixed strength. Section 3 analyzes the performance of RMSC in this setting and reports simulation results exposing some evidence of inconsistency of this method. mRMSC is introduced in Section 4 along with its consistency and post-selection properties. Relevant choices of penalty functions are also discussed. Simulation results are reported in Section 5, whereas Section 6 concludes. Lengthy proofs are relegated to Appendix B. The Supplementary Material of this article provides further simulation results.
Throughout the article, |a| denotes the number of nonzero entries or the determinant of a if a is a vector or a square matrix; a denotes the Frobenius norm of the matrix a, i.e., a = √ trace(aa ); a ∨ b denotes max(a,b) and a ∧ b denotes min(a,b).

SETUP, EXAMPLES, AND EXISTING RESULTS
Let us consider the sample {Y tT : t = 1, . . . ,T}, (T ≥ 1), a triangular array with common distribution P T , described by the population moment condition where φ(·,·) is a known R k -valued function, θ 0 is the parameter value of interest, which is unknown but lies in a subset of R p , and E(·) denotes expectation taken under P T -we do not explicitly mention its dependence on T for simplicity. The moment condition model (1) is said to globally identify θ 0 if consistent estimation is possible in nearly weak models and not in weak models. Of particular interest are configurations where the moment restrictions carry different levels of information about the parameters of interest. Along this line, we consider the estimating function φ(·) to be partitioned into subvectors with various strengths of identification. Specifically, we assume that (3) In this representation, φ 1 has the potential to more strongly identify θ 0 -or some of its components-than φ 2 . Although this moment condition model is not informative about θ 0 in the limit if 0 < δ 1 , Renault (2009, 2012) show that consistent estimation is possible under mild conditions. Standard identification features of moment condition models pertain to the case δ 1 = δ 2 = 0.
For simplicity of exposition, we maintain throughout that the moment functions exhibit either one of two identification strengths. Nevertheless, our results extend to more general settings as addressed by Appendix C (see also Remark 3).
Two examples of moment condition models where various levels of identification strengths may be expected from the components of the estimating function are presented below. Further examples of such models are detailed in Antoine and Renault (2012) and Han and McCloskey (2019).
Example 1. (Linear IV model with nearly weak instruments). Consider the classical linear IV model: for some 0 ≤ δ 1 ≤ δ 2 < 1 2 , and C i , (k i ,p)-matrix for i = 1,2; and k 1 + k 2 = k. Partition Z = [Z 1 . . . Z 2 ] according to the partition of T , i.e., Z i , (T,k i )-matrix for i = 1,2. Thus, we can write the system (4) and (5) as https://doi.org/10.1017/S0266466622000640 Published online by Cambridge University Press When δ 1 = δ 2 , the instruments in Z 1 and Z 2 have equal strength, whereas those in Z 1 are stronger than those in Z 2 if δ 1 < δ 2 . Suppose that {w t ≡ (y t ,x t ,z t ) ∈ R×R p ×R k : t = 1, . . . ,T} is a sample of independent and identically distributed random vectors with finite second moments, and E(z t u t ) = 0, E(z t v t ) = 0 for all t. Then, the true parameter θ 0 solves the moment condition Specifically, letting ≡ 11 12 As indicated in Antoine and Renault (2009), we see that if the instruments z 1t and z 2t are orthogonal, i.e., if 12 = 21 = 0, (9) becomes which has the form in (3) with φ i (w t ,θ) = z it (y t − x t θ) and ρ i (θ ) given in (9), i = 1,2.
Example 2. (Kernel Smoothing;Antoine and Renault, 2012). Let (X t ,Y t ), t = 1, . . . ,T, be the observed sample on a stationary process with stationary distribution (X,Y) . Let m T (θ ) be a Nadaraya-Watson estimator of the conditional expectation E [g(Y,θ)|X = x], where g is a known function of an unknown parameter θ of interest. Letting h T = T −2δ 1 for a suitably chosen 0 < δ 1 < 1 2 , denote the bandwidth sequence, } is pointwise asymptotically Gaussian with zero mean and under further mild conditions may, as a function of θ , converge to a Gaussian process. Suppose now that for inference about the true value θ 0 of θ , the estimating equation E [g(Y,θ 0 )|X = x] = 0 is valid for a given value x but may not be uniformly valid over all the support of X. 1 Then, . This property ofφ T (θ ) gives rise to the moment condition: whereφ T (θ ) is a sample mean of a double array given bȳ and K is a kernel function. Euler optimality conditions are fulfilled for the true unknown value θ 0 of θ ensuring that ρ(θ 0 ) = 0. Suppose now that another conditional expectation is informative about θ . The same reasoning as above leads to additional moment restrictions similar to (11) with a possibly different degree of smoothness δ 2 . We then end up with vectorial functions φ T (θ ) and ρ(θ) such that which has the form in (3).
Returning to the general framework, the following assumption is made to obtain consistent estimators for θ 0 from model (1) under the mixed identification framework in (3). We letφ Assumption 1(i) imposes global identification of θ 0 by the suitably inflated estimating moment function, whereas part (ii) of the assumption requires that the sample mean of the estimating function accumulates information about its population mean at a fast rate √ T. Note that this is the standard rate of convergence of sample mean guaranteed by the functional central limit theorem for triangular arrays. See, e.g., Ziegler (1997). Under Assumption 1, consistent estimation is possible so long as the rate of accumulation of information outweighs the rate of dilution of information.
Let the GMM estimatorθ T be defined bŷ where W T is a sequence of almost surely symmetric positive definite matrices converging in probability to W, a symmetric positive definite matrix. Under Assumption 1, Renault (2009, 2012) show that Hence, δ 2 < 1 2 is sufficient condition to ensure thatθ T converges in probability to θ 0 , especially if we maintain that the parameter set is compact. Note, however, that δ 2 < 1 2 is not a necessary condition for consistency in the sense that a subset of the estimating vector can even be identically 0 and consistent estimation would still be possible. Although, for this, it is important that δ 1 < 1 2 and ρ 1 (θ ) = 0 is uniquely solved by θ = θ 0 .
Under further regularity conditions, the GMM estimator is asymptotically normally distributed. To introduce these conditions and the main result due to Antoine and Renault (2012), we introduce some notation. Let s 1 = Rank ∂ρ 1 ∂θ (θ 0 ) that we assume strictly smaller than p, and let R = (R 1 . . . R 2 ) be a (p,p)-nonsingular matrix such that R 1 is (p,s 1 )-full-column-rank matrix and the s 2 = p − s 1 columns of R 2 span the null space of ∂ρ 1 ∂θ (θ 0 ). Define The following assumptions are made.
neighborhood N θ 0 of θ 0 and, with i = 1,2, we have Assumptions 2 and 3 are standard and impose asymptotic normality for the sample meanφ T (θ ) at θ = θ 0 as well as regularity conditions on its first-and second-order derivatives that are useful for its Taylor series expansions. Although immaterial when φ 1 is linear in the parameter, the condition δ 2 < 1 4 + δ 1 2 in Assumption 3(i) implies that the Jacobian of the moment function is big enough to ensure that the first-order terms in the expansion ofφ T (θ T ) around θ 0 dominate the higher-order terms. Note also that under some dominance conditions, the matrix C in Assumption 2 is equal to ∂ρ(θ 0 )/∂θ . We have the following result.
Theorem 2.1 effectively provides the asymptotic distribution ofη T = R −1θ T , a linear function ofθ T with components converging with a specific rate of convergence. In particular, the first s 1 components ofη T converge at T 1 2 −δ 1 , and hence are faster than the remaining s 2 = p − s 1 components which converge at rate T 1 2 −δ 2 . In general, sinceθ T is typically a linear function of all components ofη T , we expect that the slower rate of convergence would prevail for each component ofθ T . More specifically, (ii) implies thatθ T − θ 0 = O P (T δ 2 − 1 2 ).
Remark 1. Note that this result holds in the extreme cases where s 1 = 0 and s 1 = p. In these cases, R = I p and for s 1 = 0, J = 0 ∂ρ 2 ∂θ (θ 0 ) and T = T 1 2 −δ 2 I p and for s 1 = p, In the case s 1 = p, first-order local identification is ensured by the moment restrictions determined by φ 1 , which also determine the asymptotic distribution of the GMM estimator. Let W 11 be the limit weighting matrix for estimation based only on φ 1 . If W 11 matches the upper-left (k 1 ,k 1 )-submatrix of W, φ 2 appears redundant in the sense that, given φ 1 , the inclusion of the weaker moment conditions in φ 2 does not improve inference about θ 0 . In the case where s 1 = 0, it is φ 2 that ensures local identification and φ 1 may turn out to be the irrelevant set of moment restrictions.
It is not hard to see that the asymptotic variance in (16) is smallest for the choice of W = −1 where it is equal to V * = J −1 J −1 . Dovonon et al. (2022) actually show that V * stands as the semiparametric efficiency bound for the estimation of η 0 = R −1 θ 0 . The properly scaled two-step efficient GMM estimator using a sequence of weighting matrices W T (converging in probability to −1 ) has V * as asymptotic variance, and they further show that this estimator is asymptotically minimax optimal with respect to a large class of loss functions. We next revisit our comments in Remark 1 in the setting of efficient GMM estimation.
Remark 2. In Remark 1, consider again the case s 1 = p. We observe that if estimation involves optimal weighting matrices, W 11 does not always match the upper-left block of W = −1 and efficiency gain becomes possible when the weaker estimating function φ 2 is added. To see this, consider the necessary and sufficient condition for efficiency gain derived by Breusch et al. (1999, Thm. 1) in a general GMM inference setting: In particular, when k 1 = p ( just-identification by φ 1 ) and 2 = 0, equation (17) is equivalent to 21 = 0, which is the condition derived by Antoine and Renault (2017). In general, if φ 1 is overidentifying, this condition is necessary but not sufficient for efficiency gain.
In the context of linear IV model (Example 1), given the moment restriction E(z 1t u t ) = 0, the necessary and sufficient condition that E(z 2t u t ) = 0 induces efficiency gain as per (17) can be written: We can show-see also Hall, Inoue, and Shin, 2008, p. 499-that this condition is equivalent to 2T = 0 and this, regardless of the canonical correlations between z 1t and z 2t . In particular, if 2T vanishes faster than 1T and z 1t identifies θ 0 , then including z 2t would not improve efficiency. This claim is confirmed by Proposition 3.1(ii) in Section 3, which shows that the asymptotic distribution of the efficient GMM estimator is unchanged whether or not the second (weaker) set of instruments is included for inference.
Regarding inference about θ 0 within the GMM framework, one may expect, in the light of Theorem 2.1, that knowing s 1 , δ i 's, R, and the moment function's partition in (3) is essential. Interestingly, however, Renault (2009, 2012) have shown that such knowledge is not required. In particular, inference about θ 0 2 In our framework of mixed identification strength, since 2 is of smaller magnitude than 1 , condition (17) amounts to 21 −1 11 ∂ρ1(θ0) ∂θ using the two-step efficient GMM estimator can validly be carried out using the standard formula. Specifically, the standard GMM inference is robust to the sorts of deviations encapsulated in the conditions of Theorem 2.1. (See Antoine and Renault (2009, p. S151).) This makes relevant the question of moment selection in the context of nearly weak moment restrictions, which is the focus of this paper. Below, we first consider the relevant moment selection methodology introduced by Hall et al. (2007) and investigate its performance in the presence of nearly weak moment equalities. We then propose an mRMSC that robustly selects the best model even when this model does not enjoy a strong identification property.

PERFORMANCE OF THE STANDARD RELEVANT MOMENT SELECTION PROCEDURE
This section investigates the performance of RMSC model selection procedure when the best model might not be strongly identifying. This is done through Monte Carlo simulations of IV models, and we provide some intuition about potential shortcomings that paves the way for an mRMSC that we introduce in the next section. Before introducing the simulation setup, let us first introduce the RMSC. RMSC is a penalized entropy measure that is minimized over candidate models to obtain the most relevant one. Let φ denote the estimating function of the moment condition model in (1) which is supposed to have standard identification properties. RMSC uses the entropy of the asymptotic distribution of the efficient estimator θ T (φ) of θ 0 in (1) which, up to a constant, is lim T→∞ Var( √ Tφ T (θ 0 )), with variance under P T . The sample estimate of ent θ (φ) yields RMSC: whereĜ T (φ),ˆ (φ) andV T (φ) are consistent estimators of G(φ), (φ) and V(φ), respectively and κ the penalty function. Throughout this section, we will consider the BIC-type penalty function: which has been identified by Hall et al. (2007) as the best performing one compared to other alternatives including the Hannan-Quinn penalty. In (19), τ T represents the rate of convergence of the estimatorV T (φ). In particular, Under some regularity conditions, if the process {φ(Y tT ,θ 0 ) : t = 1, . . . ,T} is at most finite lag-dependent, τ T = √ T but if the estimatorV T (φ) involves a kernel estimation of the long run variance, then τ T = √ T/ T where T is the kernel bandwidth. See Andrews (1991).
Next, we shed some light on the performance of the RMSC procedure in the presence of moment restrictions with mixed strength. We achieve this in the context of the classical linear IV model of Example 1. For a larger perspective, it is worth deriving the asymptotic properties of the two-stage least-squares estimator. We maintain the following assumption where σ 2 u = E(u 2 t ). Assumption 4(i) restricts the sample to be independent and identically distributed. While this assumption may look restrictive it is only made for simplification purposes. The main points in this section continue to hold for stationary and ergodic time-dependent data. Assumption 4(ii) is standard. Nonsingularity of imposes no linear duplication of instruments while the rank condition on C amounts to the standard rank condition on E(z t x t ). Assumption 4(iii) requires homoskedasticity for u t and exogeneity for z t as well as some limit properties useful to derive the asymptotic distribution of the estimators that we will consider. We do not restrict the correlation between u t and v t which is typically different from 0 in the presence of endogenous regressors.
The efficient GMM estimator of θ 0 from the moment condition (8) is the twostage least-squares estimator: where P Z = Z(Z Z) −1 Z . Its asymptotic distribution can be obtained readily from Theorem 2.1 if the instruments are orthogonal. The following proposition gives this distribution without such a restriction.
Proposition 3.1. Under Assumption 4, the following statements hold.
In cases (i) and (ii), the asymptotic variance is consistently estimated bỹ This proposition highlights the expected mixture of rate of convergence of the GMM estimator when instruments have mixed strength. It also shows that if the stronger instruments locally identify the parameter of interest, consistency is achieved at a faster rate and the weaker IVs become irrelevant as they do not affect the asymptotic variance. However, if the stronger set does not identify the true parameter in all directions (this is the case for instance if we have two endogenous variables and only one stronger IV), the weaker set of IVs appears relevant to estimate the remaining directions, albeit at a slower rate of convergence. The linear IV model offers a suitable framework to investigate the performance of the RMSC procedure in the presence of moment restrictions with nonstandard or mixed strength. We consider the following data generating process (DGP).
The instruments z 1 ,z 2 ∈ R T are independent with common distribution N(0,I T ) and are independent of U and V which lie in R T with common distribution N(0,I T ) and Cov(u t ,v t ) = ρ for all t = 1, . . . ,T. We consider cases of equal strength for the instruments with δ 1 = δ 2 = 0,0.2,0.3,0.4 and cases of mixed strength with We then consider the case of one endogenous variable and set θ 0 = 0.1 and c 1 = c 2 = 1.48 and the case of two endogenous variables with θ 0 = (0.1,0.1) , c 1 = (1.48,0) and c 2 = (0,1.48).
We include four extra instruments, z 3 ,z 4 ,z 5 ,z 6 , independent of each other and of z 1 ,z 2 , U and V with common distribution N(0,I T ) and proceed to select the best set of instruments using RMSC. The RMSC of each of the 63 (57) combinations of IV has been assessed in the case of the models with 1 (2) endogenous variable(s) Hit rate 0 0.5 1 1 endog. var., d 1 =d 2 2 endog. var., d 1 =d 2 1 endog. var., d 1 <d 2 =0.4 2 endog. var., d 1 <d 2 =0.4 Sample size /100,000  Figure 1. Proportion of best model selection (Hit rate) by RMSC for models with one and two endogenous variables. Sample size T = 100, 200, 500, 1,000, 5,000, 10,000, 20,000, 50,000, 100,000; number of replications: 10,000. and the best model is the one with the lowest RMSC. For a given candidate set of k instruments Z, the RMSC is: In the case of one endogenous variable, if δ 1 < δ 2 only z 1 is relevant while all the other IV are redundant and if δ 1 = δ 2 both z 1 and z 2 determine the best set of IV while all the others are redundant. In the case of two endogenous variables, z 1 and z 2 constitute the best set of IV regardless of the values of δ 1 and δ 2 . We consider sample sizes T = 100, 200, 500, 1,000, 5,000, 10,000, 20,000, 50,000, 100,000. We include such large sample sizes because of possibilities of slow rate of convergence. Figure 1 plots the proportion of correct model selection (hit rate) by sample size. The number of Monte Carlo replications is 10,000 throughout.
The results suggest that RMSC consistently selects the best model as the sample size increases in cases where the instruments are relatively strong (low δ i ). However, the failure of RMSC is striking in models with moderately large to large values of δ i . The probability of selecting the best model does not seem to converge to 1 as the sample size grows. Specifically, for cases of δ 1 = δ 2 = 0.3, the best model is selected about 50% of the time for sample sizes as large as 50,000 or above. The selection procedure also fails to converge for (δ 1 ,δ 2 ) = (0.2,0.4) in models with one endogenous variable even though the sole relevant instrument in this configuration seems relatively strong. Also striking is the fact that the hit rate seems to decrease with sample size in many instances of nearly weak instruments. This is the case when δ 1 ,δ 2 ≥ 0.3. Finally, the case of two endogenous variables and δ 1 < δ 2 appears to be the most difficult for RMSC to handle since the hit rate drops with the sample size for all combinations of instruments' strength including when a strong IV (δ 1 = 0) is present.
The failure of RMSC can be related to the fact that the information part of the criterion diverges to infinity under nearly weak identification as we can see from Proposition 3.1(iii). This makes the penalty term inappropriate to balance out effectively the noise associated with the selection procedure. Also of importance is the fact that the entropy or the asymptotic variance has to be estimated at a rate at least as fast as √ T for consistency to be guaranteed. (See Assumption 4 of Hall et al. (2007).) This is not guaranteed at all in this simulation exercise. We are rather certain that the entropy cannot be estimated at such a fast rate and can even have a different rate of convergence depending on the set of instruments being assessed.
Accounting for these shortcomings of RMSC, we further analyze its properties in moment condition models with mixed identification strength. We then propose a modified version of this criterion which robustly and consistently selects the best model regardless of the identification strength.

A ROBUST RELEVANT MOMENT SELECTION PROCEDURE
In this section, we propose a moment selection method to consistently select the smallest (in terms of number of moment restrictions) most relevant model while accounting for the possibility of mixed identification strength of the moment restrictions. We first motivate and introduce a new criterion which is a modified version of RMSC with some robustness properties. We then outline the conditions under which this criterion delivers consistent selection of the best model. The section ends with a discussion on the robustness of the mRMSC.

The Selection Criterion
The problem that we address is one where we have a finite but possibly large number of moment candidate restrictions available to carry out inference about a p-vector parameter θ 0 . These restrictions possibly do not have the same identification strength and our goal is to propose a criterion useful to select the best and most relevant moment condition model. As in Hall et al. (2007), we define this model as one from which it is impossible to improve the inference about θ 0 by adding other moment restrictions. Adding to the difficulty of the problem, we do not know what are the strengths of the moment restrictions a priori and could not even provide a systematic ranking of them.
To simplify, we assume that the available moment restrictions fit into two categories of strengths and that all the candidate models can be expressed as (3) with 0 ≤ δ 1 ≤ δ 2 < 1 2 . As in the previous section, we refer to a generic candidate model by φ, the vector of the estimating functions that it contains. We shall focus on candidate models φ with partition (φ 1 ,φ 2 ) satisfying the conditions of Theorem 2.1. Note that φ 2 may be empty if all the components of φ have the same strength. The most restrictive of these assumptions may be Assumption 1(i). However, we will show that the candidate models for which this condition fails are ruled out by the proposed selection procedure and as a result it makes sense to consider that this condition holds without loss of generality.
As established by Dovonon et al. (2022), the efficiency bound on the estimation This bound happens to be the asymptotic variance of the efficient GMM estimator Recall also from Theorem 2.1 that different candidate models may lead to different rates of convergence of the GMM estimator or equivalently to different rates of accumulation of information. In that respect, letting φ ( j) ( j = 1,2) be two candidate models,θ T (φ (1) ) may converge faster thanθ T (φ (2) ) but with a larger information bound. In such a case, it is natural to prefer φ (1) over φ (2) .
Hence, as a matter of fact, any relevant criterion in the current framework shall account for (i) the amount of information and (ii) the speed of information gathering which should be of first-order importance.
To account for the efficiency bound, we will follow Hall et al. (2007), who consider the entropy of the asymptotic distribution of the efficient GMM estimator. This distribution being Gaussian, the entropy is given by However, the dependence of J(φ) on the choice of parameter rotation matrix R(φ) raises the question of invariance of the entropy. The following proposition shows that regardless of the rotation matrix chosen, ent θ (φ) is unchanged. Hence, even though the asymptotic variance may depend on the choice of rotation, the entropy is rotation-invariant.
Proposition 4.1. Let D = D 1 D 2 be a (k,p)-matrix of rank p, and let s 1 denote the rank of D 1 . Assume that 0 < s 1 < p, and let Then, for any R,S ∈ R and any arbitrary (k,k)-matrix V, we have It is not hard to see that the sequence D T R T → J(R) as T → ∞. Hence, by continuity of the determinant function of a matrix, and therefore the sequence of determinants does not depend on R ∈ R. As a consequence, the limit |J(R) VJ(R)| is also unrelated to R ∈ R and this concludes the proof.
Thanks to these properties, the quest for the optimal model is consistent with the minimization of entropy as one should expect. However, if the limit amount of information about the true parameter value θ 0 plays an important role in the determination of the optimal model, this information is as mentioned only of second-order importance to the rate at which this information is gathered.
The setting of Hall et al. (2007) accounts only for cases where that rate is not heterogeneous for the best model in the sense that all directions of the parameter space are estimated at the same standard rate √ T. In this case, the effect of the rate can be ignored in the selection process and, as they point out in their Corollary 1(iii), any model yielding estimators that converge more slowly than the standard rate would have entropy equal to infinity and therefore would not be selected. Our framework departs from theirs by the fact that the best model may actually not only yield estimators converging at a slower rate than standard, but there are also possibilities of having estimators converging at different rates in various directions. For our purpose, the rate of convergence needs to be accounted for in the definition of a meaningful selection criterion. A natural summary indicator for the rates of convergence from φ is the weighted average of those rates of convergence with weights given by the number of directions in the parameter space that they characterize. That is, (The rates of convergence are given by the scaling matrix T (φ) defined above.) In the context of only two possible rates of convergence-say δ i (φ) = δ i (i = 1,2) for all φ-two models φ (1) and φ (2) can be compared along the number of fast converging directions that they estimate and the best model would be the one with the largest s 1 . Since in this case s 2 (φ) = p − s 1 (φ), it is not hard to see that This further validates the choice of a(φ) as the summary measure of the rates.
Remark 3. In the occurrence of mixed rate estimation involving more than two directions (see Theorem 2.1), direct comparison of two models using the analog of a(·) may look problematic as this function no longer provides a natural ordering of the models. Nonetheless, this analog a(φ) is maximized at φ = φ max , the largest model available, which also yields the best estimation rates. Hence, so long as a(φ) is the dominant term of the selection criterion, the best model selected shall be one that matches a(φ max ). Lemma C.1 in Appendix C establishes that a(φ) cannot be maximum without yielding the best estimation rates as well. The intuition is that estimation rates from φ max are determined by its strongest elements. As a result, a(φ) cannot have maximum value if, for instance, the number of fastest estimation directions by φ does not match that of φ max . One can proceed iteratively to claim that the map of rates for the estimator from φ max is the same as that of any φ such that a(φ) = a(φ max ). This general case is formally studied in Appendix C.
These points make a(φ) a compelling summary of rates of convergence as far as model selection is concerned. As a result, the information-related part of the selection criterion that we shall consider is The sequence ν T depends on the sample size T and shall converge to 0 as T grows to infinity so that the rate component dominates the entropy component as one should expect. Nevertheless, ν T shall not converge too fast as this would destroy the valuable information encapsulated in the entropy function. In fact, ent θ (φ) is the component that ranks candidate models with the same rate component a(φ).
For example, recall that candidates φ that estimate the whole parameter vector θ 0 ∈ R p at rate √ T are those with s 1 (φ) = p and δ 1 (φ) = 0. For them, s 2 (φ) = 0 and the leading term reaches its minimum value possible. The comparison of such candidate models is solely based on their entropies.
The natural question now is about the sample evaluation of ι θ (φ). This question is of particular importance since, for a given model φ, s i (φ) and δ i (φ) (i = 1,2) are unknown. Interestingly, ι θ (φ) can be mimicked by starting off with a naive estimator of the asymptotic variance V θ (φ). Recall that, as claimed by (15), under some regularity conditions, Then, taking the determinant ofV θ (φ), ent θ (φ) can be estimated by The choice of ν T = 1/(p ln T) arises naturally for the definition of ι θ (φ), which then can be estimated byι θ (φ) given bŷ The information-related part of the selection criterion can therefore effectively be considered as The resulting family of information criterion for model selection that we label mRMSC takes the form: where κ T is the usual penalty term aiming to filter out noise without impacting consistent selection of the correct model. The choice of κ T will be discussed in the next section. Despite the similarities, there are some key differences between mRMSC and RMSC. (a) The term appearing in the logarithm is not an estimator of the asymptotic variance of the efficient GMM estimator in general. This is the case only when estimation is done at the standard rate √ T. (b) The information-related part is scaled down by the inverse of ln T. This makes the rate component useful for moment selection in situations of interest where convergence is slower. Without scaling, this information-related term in mRMSC would explode and standard penalization components would not be as effective at excluding redundant moment restrictions as illustrated in Section 3.

Consistency
We now show that the proposed criterion leads to consistent selection of the relevant model. We follow Andrews (1999) and Hall et al. (2007) by relying on the following notation. Let φ max (·) ∈ R k max be the vector of all available candidate moment restrictions. Let the selection vector c ∈ R k max with entries 0's and 1's denote the components of φ max (·) included in a particular moment condition model. Any subvector φ(·) of the set of candidates φ max (·) is identified by a unique selection vector c with c j = 1 if and only if φ(·) contains the jth element of φ max (·). |c| = c c represents the number of moment restrictions in φ(·) and write φ(·) = φ max (·,c). The set of all possible selection vectors is denoted C and defined as C = c = (c 1 , . . . ,c k max ) ∈ R k max : c j = 0,1 for j = 1, . . . ,k max and |c| ≥ p .
For notational simplicity, the statistics of interest are now indexed by c and soθ T (c) denotes the GMM estimator based on φ ≡ φ max (·,c), V θ (c) its asymptotic variance and R(c) the rotation matrix in which it is expressed, andÎ θ,T (c) the estimated information matrix (see (24)).
We maintain the following assumption on φ max .
is continuous on and satisfies over : (iv) θ 0 belongs to the interior of , and θ → φ max (Y,θ) is twice continuously differentiable almost everywhere in a neighborhood N θ 0 of θ 0 .
(v) ∂ρ max ∂θ (θ 0 ) is full column rank, and, for i = 1,2, E Assumption 5 is a partial collection of Assumptions 1-3 omitting Assumption 2(ii). Note that this latter is useful to establish asymptotic normality of the GMM estimator but not crucial to obtain consistent selection of moments. The parts of Assumptions 1-3 highlighted by Assumption 5 are those useful to establish the consistency of the GMM estimator and the Jacobian matrix of the sample mean of the estimating function.
Since all the components of φ max (·) are valid estimating functions, inference based on the whole vector φ max (·) would lead to asymptotic efficiency. However, a plurality of moment restrictions has an adverse consequence of damaging finitesample properties of GMM inference. Simulation cases have been reported by Hall and Peixe (2003), showing the negative effect of redundant moment restrictions on inference. Formal analysis have also been carried out by Newey and Smith (2004), showing that larger moment condition models inflate finite-sample bias. In this regard, researchers are motivated to select from φ max (·), the minimal set of relevant moments that achieves the same asymptotic efficiency as φ max . We next introduce a formal definition of relevance that accounts for the possibility of mixed rate of convergence.
Letting c be a selection vector, we write c = (c 1 ,c 2 ) ∈ R k 1 × R k 2 and let s 1 (c) be the rank of the Jacobian matrix of ρ max,1 (c 1 ) at θ 0 and s 2 (c) = p − s 1 (c).
Definition 1. A subset of moment restriction characterized by c r ∈ C is said to be relevant if the following two properties hold: (ii) For any decomposition c r = c r,1 + c r,2 of c r with c r,1 ,c r,2 ∈ C , either one of the following holds: This definition is of the same flavor as Definition 2 of Hall et al. (2007) while accounting explicitly for the rate of convergence. In particular, asymptotic variances can be compared only when rates of convergence are of the same magnitude. Consistent with our presentation so far, the definition implicitly assumes that the moment function E(φ max (Y tT ,θ)) partitions at most into two components with specific rate of convergence to 0 that are T −δ 1 and T −δ 2 , respectively. The general case is studied in Appendix C.
Nevertheless, because of the dependence of V θ (c) on the choice of rotation matrix R(c), the statement V θ (ι max ) = V θ (c r ) requires some clarification. We the columns of R 2 (ι max ) spanning the null space of ∂ρ max,1 (θ 0 )/∂θ . Under the condition s 1 (c r )δ 1 + s 2 (c r )δ 2 = s 1 (ι max )δ 1 + s 2 (ι max )δ 2 , which is actually equivalent to s 1 (c r ) = s 1 (ι max ), Lemma B.1 in Appendix B claims that R 2 (ι max ) also span the null space of ∂ρ max,1 (θ 0 ,c)/∂θ . Hence, the asymptotic distributions ofθ T (c r ) andθ T (ι max ) can be explored in terms of the same rotation and their asymptotic variances shall be compared under this rotation. V θ (ι max ) and V θ (c r ) in Definition 1(i) are expressed in terms of that common rotation. Similar arguments can be made about the variance comparison in Definition 1(ii.b) as well.
We base the determination of c r , the selection vector corresponding to the relevant set of moment conditions on the mRMSC introduced by (24) with a penalization term κ T , a function of sample size, and the size of the estimating function. Note that parsimony is sought relative to the number of moment restrictions and not the number of parameter estimates, which is always p. Specifically, we write whereÎ θ,T (c) is given by (24) with φ(·) = φ max (·,c). To estimate c r , consider the valueĉ T of c minimizing mRMSC(c) over C : Our next assumption pertains to the set of selection vectors. Let and C min = {c ∈ C eff : |c| ≤ |c| for allc ∈ C eff } .
This assumption is similar to Assumption 4 of Hall et al. (2007) that we adapt to our configuration. Part (i) is an identification condition for c r allowing for its consistent estimation. Part (ii) may look restrictive by imposing that all candidate models extracted from φ max must globally identify θ 0 and must also identify θ 0 locally at first order. This, indeed, need not be the case. However, it turns out that consistency of model selection within that category of models is the most relevant. As we show in Theorem 4.3, potential nonidentifying and/or rank deficient candidate models are strongly outscored by c r in terms of minimum mRMSC, at the limit.
Parts (iv) and (v) relate the rate of accumulation of information about θ 0 to the penalty term. These conditions allow the selection mechanism to favor, with large probability as the sample size grows, the less sophisticated model of two with comparable levels of information about θ 0 . The convergence rate τ T,c is tagged to the model choice c to stress the dependence of rate of estimation on the model under consideration.
In standard problems, the asymptotic variance is estimated at the rate τ T = √ T in the presence of cross-sectional data, whereas for weakly dependent data, this rate is slower (τ T = √ T/ T , where T is a bandwidth parameter; see Andrews, 1991). These rates arise when the parameter itself is estimated at the rate √ T, which is not the case in our setting. Proposition 4.5 derives the order of magnitude ofV θ (c) − V θ (c) when the parameter is nearly weakly identified. Typically, τ T = o( √ T) with cross-sectional data and τ T = o( √ T/ T ) for weakly dependent data. The choice of penalty terms will be discussed after the following consistency result.
Theorem 4.2. If Assumptions 5 and 6 hold, thenĉ T converges in probability to c r as T → ∞.
An extension of this result to the case of more than two identification strengths is given by Theorem C.1 in Appendix C. For completeness, we now analyze mRMSC when C contains candidate models that violate Assumption 6(ii). This is the case when point identification fails or when the Jacobian matrix of the moment function is rank-deficient. For a candidate model c, failure of point identification implies thatθ T (c) is not consistent. If ρ max (θ,c) = 0 is solved by a continuum of values around θ 0 , then the Jacobian matrix of the moment function is necessarily rank-deficient at θ 0 .
In addition, point identification may hold while the Jacobian matrix is rankdeficient at θ 0 . In this case,θ T (c) is consistent, but the first-order local approximation of the moment function fails to identify θ 0 . Dovonon andRenault (2013, 2020), Dovonon and Hall (2018), Lee and Liao (2018), Han and McCloskey (2019), and Dovonon and Atchadé (2020), among others, have studied the behavior of the GMM estimator in this condition. The expected outcome in this setting is that, overall,θ T (c) converges at a slower rate than T 1 2 −δ 2 . We shall examine rank deficiency in these two scenarios. Common to both is that s i (c) directions of the parameter are estimated at the rate T 1 2 −δ i (i = 1,2) with s 1 (c) = Rank ∂ρ max,1 (θ 0 ,c 1 ) ∂θ and s 1 (c) + s 2 (c) = Rank ∂ρ max (θ 0 ,c) ∂θ < p. The remaining directions are estimated at a slower rate in the latter scenario, whereas inconsistent in the former.
Another possibility is that the moment function is solved at isolated points including θ 0 . In this case, we can claim that there is point identification relative to a smaller parameter set around θ 0 . The full-rank Jacobian matrix of the moment function at θ 0 then fits into Theorem 4.2, whereas the rank-deficient Jacobian matrix at θ 0 fits into the second scenario discussed above. The following result extends Theorem 4.2 and shows thatĉ is consistent for c r even if C includes candidate models with identification issues.
(ii) For any vector r in the null space of ∂ρ max,1 (θ 0 ,c 1 )/∂θ (in the setting of (i.a)) or the null space of M (in the setting of (i.b)), [∂ρ max,1 Under Assumption 7(i.a), θ 0 is consistently estimated byθ T (c) and ∂ρ max (θ T (c),c)/∂θ converges in probability to ∂ρ max (θ 0 ,c)/∂θ . The rank deficiency of the latter implies that of the former in the limit. The second part of Assumption 7(i.b) is not particularly restrictive, even though under its first part, θ 0 is not consistently estimable. Indeed, thanks to Lemma A.4 of Antoine and Renault (2009), ρ max (θ T (c),c) converges to 0 in probability so thatθ T (c) solves ρ max (θ,c) = 0 in the limit. Under a mere differentiability assumption, the Jacobian matrix of ρ max (θ,c) at any accumulation point θ * ∈ N, the set of solutions of this equation, is rank deficient. Under the first part of Assumption 7(i.b), N is a continuum set and the fact thatθ T (c) lies on the closure of N in the limit implies that the Jacobian matrix atθ T (c) is rank-deficient in the limit. This provides a motivation to the second part of the assumption. Of course, if ρ max (θ,c) is linear in θ , the first and second parts of Assumption 7(i.b) are trivially redundant. Assumption 7(ii) is useful to control the remainder of the expansion of the estimated Jacobian matrix. Note that if ρ max (θ,c) is linear in θ , [∂ρ max,1 (θ T (c),c 1 )/∂θ ]r = O P (T − 1 2 ) in both (i.a) and (i.b). Assumption 7(iii) is standard, whereas Assumption 7(iv) is guaranteed by the functional central limit theorem.
Remark 4. This result shows that candidate models that fail identification as indicated by Assumption 7 are dominated by the relevant model c r in terms of mRMSC and cannot be picked. Meanwhile, Theorem 4.2 shows that any candidate model that satisfies identification properties as in Assumption 6(ii) is also dominated by c r . These two results show that mRMSC is consistent in a wide range of candidate model configurations. It is not hard to see that Theorem 4.3 covers models that are completely uninformative about the true parameter value θ 0 . This is the case if ρ max (θ,c) = 0 for all θ in the parameter set.
Remark 5. The consistency of mRMSC is established under the condition that the GMM estimator from the estimating function φ max is consistent and asymptotically normal. If this condition fails-which is the case if φ max is uninformative about θ 0 or δ 2 ≥ δ 1 ≥ 1 2 -mRMSC is not guaranteed to behave well. Actually, simulation results in Section 5 show that it behaves very poorly as does the RMSC and the MSE criterion of DN. Hall et al. (2008) have studied the behavior of RMSC in the condition where all candidate models are weak. They advocate a two-step procedure in which standard identification of φ max is first tested following, e.g., the approach of Stock and Yogo (2005) and, only if identification is maintained, can the researchers proceed with RMSC for relevant model selection. Following them, we shall advocate a two-step procedure as well. Antoine and Renault (2020) have recently proposed a test to investigate whether a moment restriction is strong enough to warrant consistent and asymptotically normal GMM estimation. We recommend to first apply this test to φ max and only when there is indication of consistency and asymptotic normality that one can apply mRMSC for model selection.
The next result addresses the efficiency of inference post-selection. For this, let us consider the partition C = C 0 C 1 , where C 0 and C 1 are the subsets of elements c ∈ C such that Rank(∂ρ max (θ 0 ,c)/∂θ ) = p and Rank(∂ρ max (θ 0 ,c)/∂θ ) < p, respectively. We have the following result.

Choice of Penalty Function and Robustness
The conditions in Assumption 6(iii) and (iv) are particularly crucial for the consistency of the model selection procedure and provide some guidelines for the choice of penalty function. It appears important to know the rate of convergence of the estimator of asymptotic variance used and then select the penalty function κ(·,T) in such a way that Assumption 6(iv) holds. The following proposition gives the rate of convergence of the asymptotic variance estimatorV θ (φ) given by (23) for a model candidate φ. We consider the case where cross-sectional independent and identically distributed data are involved and the case of weakly dependent time series data.
In the case of cross-sectional data, the estimator of the long-run variance is the sample variance given bŷ whereas in the case of time series data, one shall rely onˆ hac (φ), any heteroskedasticity and autocorrelation consistent estimator of the long-run variance. See, e.g., Andrews (1991). We let T denote the kernel bandwidth of this estimator, where vec(·) is the standard matrix vectorization operator. We have the following result.
(ii) If {Y tT : t = 1, . . . ,T} is a weakly dependent time series process, δ 2 < 1 6 , T ∼ T a , with a ∈ (2δ 2 , 1 2 − δ 2 ) such that the condition (ii) of Proposition A.1 in Appendix A is satisfied, and, in addition, all the conditions of that proposition hold with δ = δ 2 , then This proposition shows that the rate of convergence ofV θ (φ) depends on the identification strength of the model under consideration. In the case of crosssectional data, the requirement in Assumption 6(v) translates into as T → ∞ andc,c ∈ C such that |c| > |c|. Since δ 1 ,δ 2 can take any arbitrary value in [0, 1 2 ), the commonly used penalty functions, such as the BIC-type information criterion (κ(|c|,T) = (|c| − p) ln √ T/ √ T) and the Hannan-Quinn type of criterion (κ(|c|,T) = (|c| − p)b ln(ln √ T)/ √ T, b > 2), would not fulfill this requirement since we can always find some values of δ 1 and δ 2 in [0, 1 2 ) that make these criteria violate the condition.
A natural choice of penalty function to consider is and h(|c|,p) a nonnegative and strictly increasing function of |c| for all values of p.
Thanks to (24), the mRMSC is given by Obviously, since T is the same across the models under assessment in the selection procedure, we can simply write for some α > 0 and h(|c|,p) is as introduced above. It is not hard to see that such a penalty function satisfies the requirements in Assumption 6(iv) regardless of the values of δ 1 and δ 2 and therefore leads to consistent selection of the best model. This penalty function also works when the data are time series as this can be seen from the order of magnitude derived in Proposition 4.5(ii) for the asymptotic variance estimator.
While the best choice of α in (25) may be of independent interest that we shall pursue in future work, it is of interest to mention that α > 0 is important as a condition to ensure that the penalty function is smaller than the information component. Moreover, note that the higher α, the less "bad" models are penalized. Since the mixed identification framework is one where signals are by definition weak, it is even more important to exercise higher penalty on "bad" models to obtain consistent selection. In the simulation results reported in the next section, we have set α = 0.1 and use h(|c|,p) = 1 − p |c| . 3

SIMULATION RESULTS
In this section, we study the finite-sample performance of the proposed selection criterion (mRMSC) and the post-selection properties of the GMM estimator (bias, MSE, and coverage rate of confidence sets) through a Monte Carlo experiment.
For this purpose, we use the same simulation setup of Section 3 but increase the set of candidate instruments to 12 (i.e., z 1 ,z 2 , . . . ,z 12 ). For clarity, the analysis on the performance of the selection criterion (mRMSC) is separated from that on the post-selection properties of the GMM estimator.

Performance of the mRMSC
In this section, we compare the finite-sample performance of the proposed mRMSC with existing methods in the literature, namely, the RMSC of Hall et al. (2007) and the MSE-based criterion of DN. Since entropy-based selection criteria (mRMSC and RMSC) are conceptually different from the MSE type-selection criteria, the inclusion of the DN criterion is useful to determine which types of criteria perform the best, at least from the finite-sample perspective. Figure 2 contains the results for both the model with one endogenous regressor ( p = 1; Figure 2a) and the model with two endogenous regressors ( p = 2; Figure 2b). Each subfigure shows, for a combination of identification strengths (i.e., the values of δ i ,i = 1,2), the plots of the proportion of best model selection (hit rate) by sample size. Specifically, the first two rows in Figure 2a,b report the results where the two instruments z 1 and z 2 have different identification strength (δ 1 < δ 2 ), whereas the last row contains the plots of the hit rates where both instruments z 1 and z 2 have equal identification strength (i.e., δ 1 = δ 2 ). As part of this, we include the case δ 1 = δ 2 = 0.5 to assess selection performance when the sample moment of the estimation function does not accumulate sufficient information to allow for consistent point estimation as the sample size grows (see, e.g., Staiger and Stock, 1997). Three main results stand out from this exercise.  First, with the exception of the Staiger and Stock (1997) weak identification setup (δ 1 = δ 2 = 0.5), the hit rate of mRMSC increases to 1 in all cases as the sample size grows except for very few cases where its convergence is expectedly very slow. This confirms the consistency result for mRMSC established by Theorems 4.2 and 4.3. While mRMSC displays evidence of consistency throughout, there are many instances where the hit rate of RMSC and DN drops to 0 or plateaus way below 1.0, highlighting the limitation of these criteria at consistently selecting the correct model when operating on models with poor identification strength. Looking specifically at the DN criterion, the hit rate is almost flat at 0 in all cases considered, including when standard strong identification holds (δ 1 = δ 2 = 0). This is because DN is not based on maximizing the entropy of GMM estimator asymptotic distribution and does not penalize larger models either. As a result, the DN criterion always tends to select models that include irrelevant instruments (see Tables 1 and 2). Regarding RMSC, the cases where the hit rate drops to 0 or plateaus way below 1.0 are seen clearly in the subfigures "δ 1 = 0.3, δ 2 = 0.4" and "δ 1 = δ 2 = 0.3" of Figure 2a. The lackluster performance of RMSC and DN is more pronounced in models with two endogenous regressors ( Figure 2b). In this case, RMSC seems to be consistent only when the model is strongly identified (i.e., "δ 1 = δ 2 = 0") or close to being so (e.g., "δ 1 = 0,δ 2 = 0.2" or "δ 1 = 0.1,δ 2 = 0.2"), whereas DN almost never selects the correct model even when identification is strong. Clearly, DN seems to be the least-performing criterion in selecting the correct model. In both Figure 2a (model with one endogenous regressor) and Figure 2b (model with two endogenous regressors), the hit rate of all selection criteria is almost flat at 0 when δ 1 = δ 2 = 0.5, the weak identification framework.
nous regressors ( p = 2) are presented in Table 2. More specifically, each table indicates, for each sample size, the empirical selection rates of all possible models for a given criterion (RMSC, DN, or mRMSC) and given values of δ i (i = 1,2). Considering first the case with one endogenous regressor (Table 1), we see that when δ 1 < δ 2 (first part of the table for each sample size), mRMSC outperforms RMSC even for relatively small sample sizes, and the latter dominates DN, which almost never selects the correct model. For example, when T = 100 and "δ 1 = 0.1 < δ 2 = 0.3," RMSC only selects the relevant model (i.e., columns "z1" in Table 1 for T = 100) 68% of the time, whereas mRMSC selects this model 91% of the time. As the sample size increases to T = 50,000, these empirical selection probabilities bounce to 100% for both RMSC and mRMSC. The empirical selection probabilities of DN remain flat at 0 along T.
Furthermore, looking at columns "z1" in Table 1 for δ 1 < δ 2 (first part of the table), it is obvious that the dominance of mRMSC is even more pronounced when "δ 1 = 0.3 < δ 2 = 0.4" (i.e., when identification strength deteriorates) regardless of the sample size, with this dominance becoming even more visible as the sample size increases. For example, when "δ 1 = 0.3 < δ 2 = 0.4," RMSC only selects the relevant model 9% of the time when T = 100, whereas mRMSC selects this model 17% of the time. As the sample size increases to T = 50,000, the empirical selection probability for RMSC decreases drastically to 2%, whereas that of mRMSC bounces to 99%. Clearly, we see that as identification weakens, RMSC has a tendency to often select less relevant and less sparse models for small samples (see the selection probabilities in columns "All I," "z1+z2," "z1+I," "z2+I," and "zj+more" in Table 1 for T = 100) or less sparse models with at least one of both instruments z 1 and z 2 (see, e.g., the selection probabilities in columns "z1+z2" in Table 1 for T = 1,000, 5,000, 50,000). Meanwhile, mRMSC still has an overall good performance in selecting the more relevant model. Now, when δ 1 = δ 2 (second part of Table 1 for each sample size), both RMSC and mRMSC perform relatively well in selecting the correct model (i.e., columns "z1+z2" of the tables) even with moderate identification strength ( δ 1 = δ 2 ≤ 0.3), but RMSC performs slightly better for sample sizes T = 100, 1,000, whereas this dominance is reversed for larger sample sizes (T = 5,000, 50,000). As identification deteriorates (see the selection probabilities in columns "δ 1 = δ 2 = 0.3" in Table 1 for T = 5,000, 50,000), mRMSC improves substantially over RMSC when T = 5,000, 50,000. However, all criteria become weaker in selecting the correct model when identification is weak (see columns "δ 1 = δ 2 = 0.5" in Table 1).
We now consider the case with two endogenous regressors (Table 2) where the most relevant model is column "z1+z2." We see that for both "δ 1 < δ 2 " and "δ 1 = δ 2 ," mRMSC outperforms RMSC in most combinations of identification strength δ i (i = 1,2), especially when the sample size increases (T = 1,000, 5,000, 50,000). Again, RMSC shows a tendency to often select less relevant models when identification deteriorates (i.e., high values of δ i ,i = 1,2). In addition, the empirical selection probabilities of the relevant model increase with the sample size for mRMSC for all combinations of identification strength used, whereas those of RMSC often decrease as the sample size increases for high values of δ i (i = 1,2). This illustrates why the aggregate hit rate of RMSC decreases as the sample size increases for high values of δ i , as shown in Figure 2. Remarkably, DN never selects the correct model even when identification is strong (δ 1 = δ 2 = 0).

Performance of Ppost-Selection Inference
We now investigate the bias, MSE, and coverage rate of confidence sets of the GMM estimator post selection. In addition to the criteria under consideration, namely, mRMSC, RMSC, and DN, we also analyze the performance of the (naive) GMM estimator that uses all available instruments. This naive GMM model corresponds to the estimating function φ max in Section 4.2. Its inclusion allows us to illustrate the importance of moment selection in GMM models with relatively poor identification strength, even when the set of moment conditions available is not large.
For clarity, let us focus first on the bias and MSE of the post-selection GMM estimatorθ . We consider both models with one endogenous regressor (p = 1) and two endogenous regressors (p = 2). In the latter case,θ = (θ 1 ,θ 2 ) has two components, so we shall show the bias and MSE results of both components. Figures 3 and 4 report the results where the bias and MSE of the post-selection estimatorsθ (for p = 1) andθ 1 ,θ 2 (for p = 2) are plotted against the sample size/100,000 with the various selection criteria for sample sizes T = 100, 200, 500, 1,000, 5,000, 10,000, 20,000, 50,000, 100,000. Looking first at the plots of the bias in Figure 3, we see similar patterns in the behavior of the GMM estimator for both p = 1 (Figure 3a) and p = 2 (Figure 3b,c). Considering the case where p = 1 (Figure 3a), with the exception of the Staiger and Stock (1997) weak identification setup ("δ 1 = δ 2 = 0.5"), the post-selection GMM estimator with DN has the highest bias for most combinations of the identification strength δ i (i = 1,2). Both mRMSC and RMSC outperform the naive GMM estimator except when identification is weak ("δ 1 = δ 2 = 0.5"). The dominance of the naive estimator under weak identification is in line with the widely documented weak IV literature (see, e.g., Chao and Swanson, 2005;Andrews and Stock, 2007). For all the combinations of the identification strength δ 1 = δ 2 = 0.3 and δ 1 < δ 2 ≤ 0.4 shown in Figure 3a, mRMSC dominates or performs as well as RMSC. The dominance of the mRMSC is especially pronounced when δ 1 = 0.3,δ 2 = 0.4, where the bias of the post-selection GMM estimatorθ with mRMSC vanishes as the sample size increases, whereas the bias of that of the RMSC plateaus far away from zero. Second, for the models with p = 2 (Figure 3b,c), the bias of the GMM estimatorθ 1 (the strongest identified component of θ ) across various selection criteria is quite similar to the results with p = 1 depicted in Figure 3a, with the exception of subfigure "δ 1 = δ 2 = 0.3" where mRMSC's dominance is even clearer. Looking at the bias ofθ 2 -the least identified component of θ (Figure 3c)we observe that in most cases the bias of the post-selection GMM estimatorθ 2 with mRMSC is smaller than the ones resulting from both RMSC and DN. While   = 100, 200, 500, 1,000, 5,000, 10,000, 20,000, 50,000, 100,000; number of replications: 5,000. the naive GMM estimation appears competitive at mitigating the bias ofθ 2 when δ 1 = δ 2 = 0.5, it does not perform as well as mRMSC when 0.1 < δ 1 < δ 2 ≤ 0.4. Interestingly, the post-selection GMM estimator with mRMSC performs as well as the naive GMM estimator as the sample size increases in the rare cases where the latter is competitive.
Let us now focus on the MSE results in Figure 4. Considering the model with p = 1 (Figure 4a), we see that all selection criteria (mRMSC, RMSC, and DN) perform quite similarly. In particular, when δ 1 < δ 2 ≤ 0.4 or δ 1 = δ 2 < 0.5, mRMSC and RMSC have a slight edge on DN when the sample size is small. However, this advantage of mRMSC and RMSC disappears as the sample size increases for a similar level of performance. When δ 1 = δ 2 = 0.5 (i.e., under weak identification), all the models perform poorly with the naive GMM displaying the smallest MSE, followed by the post-selection GMM estimators with DN, mRMSC, and RMSC, respectively. Considering now the model with p = 2 (Figure 4b,c), we observe again that the MSE results ofθ 1 are quite similar to those in Figure 4a for p = 1. However, the MSE of the estimatorθ 2 (the estimator of the weakest identified component) depicts a different picture. Indeed, there are many instances in Figure 4c where the MSEs of the post-selection GMM estimators with mRMSC and RMSC are smaller than that with DN, especially for small samples. As identification deteriorates (see subfigures "δ 1 = 0,δ 2 = 0.4" and "δ 1 = 0.3, δ 2 = 0.4"), post-selection GMM-RMSC is dominated by GMM-mRMSC, which also matches both the naive GMM and GMM-DN as the sample size increases.
Aside the bias and MSE, one of the important properties of post-selection inference is whether a given selection criterion leads to confidence sets with correct coverage post selection. It is well known that standard selection methods based on information criteria, such as Akaike information criterion (AIC) and BIC, do not enjoy this property (see, e.g., Kabaila and Leeb, 2006). To investigate this further, we consider both the models with p = 1 and p = 2 and explore the coverage rate of Wald-type confidence intervals based on the post-selection or naive GMM estimatorsθ. As θ is a scalar when p = 1, we consider t-type confidence intervals based on the post-selection or naive GMM estimator in that case. However, for p = 2,θ = (θ 1 ,θ 2 ) ∈ R 2 , so we shall consider Wald-type joint confidence sets based on the post-selection or naive GMM estimatorθ . As in the previous sections, we report the results across various selection criteria (mRMSC, RMSC, and DN), along with the naive GMM estimator that utilizes all 12 available instruments. Figure 5 shows the results, where the coverage rates are plotted against the sample size. The nominal confidence level is set to 95%, but the results are not sensitive to alternative choices of this significance level.
Considering first the model with p = 1 (Figure 5a), two main observations stand out. First, when δ 1 < δ 2 , the post-selection GMM-mRMSC outperforms or performs as well as both the naive GMM and the GMM-RMSC and GMM-DN. As identification deteriorates, the dominance of post-selection GMM-mRMSC is visible (see, e.g., subfigures "δ 1 = 0.3,δ 2 = 0.4" in Figure 5a). In the latter case, the coverage rate of confidence intervals from post-selection GMM-mRMSC  approaches 95% as the sample size increases, whereas naive GMM, GMM-RMSC, and GMM-DN have coverage rates that plateau slightly above 85% even for a sample size as large as T = 100,000. Second, when δ 1 = δ 2 , with the exception of the weak identification case (δ 1 = δ 2 = 0.5), GMM-mRMSC performs as well as GMM-RMSC and both dominate GMM-DN and naive GMM, with greater magnitude for small sample sizes (see, e.g., subfigure "δ 1 = δ 2 = 0.3" in Figure 5a). Moving to models with two endogenous variables, p = 2 (Figure 5b), the dominance of post-selection GMM-mRMSC is even more noticeable. With the exception of the weak identification case (δ 1 = δ 2 = 0.5), the edge of GMM-mRMSC on GMM-RMSC, GMM-DN, and the naive GMM is remarkable. For example, when δ 1 = 0.3,δ 2 = 0.4, the coverage rate of the joint confidence sets with GMM-RMSC, GMM-DN, and the naive GMM plateaus below 80% as the sample size increases. Meanwhile, the coverage rate of joint confidence sets with GMM-mRMSC continue to increase with the sample size. At many instances in Figure 5b, GMM-RMSC outperforms both naive GMM and GMM-DN, although it is not as competitive as GMM-mRMSC. Such results are shown in all subfigures with δ 1 = 0.3. Clearly, this simulation exercise illustrates the overall good performance of our post-selection GMM-mRMSC compared with post-selection GMM-RMSC and DN, as well as the naive GMM.
Note that while the DGP in Sections 5.1 and 5.2 imposes z 1 to be uncorrelated z 2 , we report in the Supplementary Material experiment results for correlated relevant instruments. These results are qualitatively the same as those reported in  Table S1.2 for correlated instruments and Table S1.3 for uncorrelated instruments in the Supplementary Material.)

CONCLUSION
In this paper, we study model selection in moment condition models with mixed identification strength that allow for consistent and asymptotically normal parameter estimation. Our investigation reveals that standard model selection procedures, such as the relevant model selection criterion of Hall et al. (2007), are inconsistent in this setting as they do not explicitly account for the rate of convergence of parameter estimation of candidate models which may vary. We introduce new entropy-based relevant moment selection criteria, the mRMSC. Similar to RMSC, mRMSC are evaluated using the two-step GMM estimator, which has linear reparameterizations known to be efficient in this framework as well (see Dovonon et al., 2022). In the case of the multivariate parameter, the asymptotic distribution of this estimator is, in general, characterized by directions of fast convergence and directions of slow convergence rate. The best or relevant model is the smallest model (in terms of number of moment restrictions) that delivers the same rate of convergence and the same asymptotic variance as those obtained when all the moment restrictions are used.
By construction, mRMSC first rewards the rate of estimation and then, for models with the same rate, it rewards (negative) entropy. In addition, suitable penalty terms are introduced that guarantee the consistency of the selection procedure. Conditions under which mRMSC lead to consistent selection of the best model are outlined, and we show that this new selection procedure is robust to the presence of uninformative and weak models.
We illustrate the finite-sample performance of the proposed method through Monte Carlo simulations. In addition to mRMSC and RMSC, we also consider the MSE criterion of DN along with the moment condition model including all the available instruments serving as a benchmark. In almost all the considered Monte Carlo designs, mRMSC dominates the other selection criteria in terms of hit rate. The post-selection performance is also investigated, revealing that mRMSCselected models produce confidence intervals with the best coverage probability in most of the Monte Carlo designs. These models are also among those with the smallest bias and MSE. Nevertheless, when all the available moment restrictions are weak, mRMSC performs quite poorly as the other selection methods. In this case, the model with all instruments performs marginally better than all the selected ones although unreliably.

APPENDIX A. Convergence Rate of HAC Using Slow Estimators
As we have seen, under mixed strength identifying moment restrictions, the resulting parameter estimator has a slow rate of convergence-O P (T 1 2 −δ ). Standard theories for HAC estimators of the long-run variance apply to √ T-consistent parameter estimators. The next proposition gives the rate of convergence of HAC estimators of the long-run variance, , of φ(Y tT ,θ 0 ): = lim T→∞ Var 1 √ T T t=1 φ(Y tT ,θ 0 ) , with variance taken under P T , when estimators of θ 0 based on the moment condition E(φ(Y tT ,θ 0 )) = 0 are available and the components of φ have mixed identification strength for θ 0 . We know in this case that standard estimatorsθ T are such that T Letˆ hac be the HAC estimator of using the kernel function k(x) and bandwidth parameter T . (See Andrews (1991) for more explicit definitions.) We shall assume that k(·) belong to the class K 1 , i.e., k(·) is symmetric, continuous at 0 and at all but a finite number of other points, square integrable, and takes values in [−1,1], with k(0) = 1.
(ii) In the case s 1 = p, we use the fact that T δ 1 −1 X Z = C 1 0 + o P (1). Therefore, Moreover, As a result, and (ii) follows from Assumption 4(iii).
(iii)σ 2 u converges in probability to σ 2 u by the law of large numbers. For the case 0 < s 1 < p, we have −1 , and using (B.2), we have the expected result. For the case s 1 = p, the expected result follows from (B.4).
Proof of Theorem 4.2. Analogous to previous notation, let where φ in this definition includes only the components of φ max selected by c. Under Assumptions 5 and 6(ii), θ T (c) − θ 0 = O P (T − 1 2 +δ 2 ) under P T . Thanks to Lemma A.5 of Antoine and Renault (2009), we can claim that √ T ∂φ T ∂θ (θ T (c))R(c) T (c) −1 converges in probability to J(c), and as a result,V θ (c) converges in probability to (J(c) (c) −1 J(c)) −1 .
Note that
Case (2): s 1 (c) = s 1 (c r ). Lemma B.1 ensures that V θ (c), V θ (c r ), and V θ (ι max ) can be expressed in terms of the same rotation matrix R(ι max ). By definition, V θ (c r ) = V θ (ι max ) and, considering V θ (c) as expressed in terms of R(ι max ) as well, standard results of GMM theory ensure that we either have V θ (c) = V θ (c r ) or V θ (c)−V θ (c r ) is positive semi-definite. We further consider these two cases.
Thanks to Assumption 6(iv), this quantity tends to +∞ with probability 1 as T grows and we can deduce that T (c,c r ) is positive with probability 1 as T grows. This means that c r is eventually selected over c.
Proof of Proposition 4.4. The first conclusion is a mere consequence of Theorems 4.2 and 4.3. Since P T (ĉ T = c r ) → 1 as T → ∞, the second part follows directly from Lemma 1 of Pötscher (1991).

APPENDIX C. Mixed Identification Strength of Arbitrary Number of Levels
This section establishes the consistency of mRMSC in a more general framework of moment condition models with mixed identification strength of arbitrary number of levels. We claim that a moment condition model represented by the R k -valued estimating function φ(·) has mixed identification strength if, for some l ∈ N: φ ≡ (φ 1 ,...,φ l ) ∈ R k 1 × ··· × R k l : ..,l, and 0 ≤ δ 1 ≤ ··· ≤ δ l < 1 2 . (C.1) The case studied in the main body of the paper corresponds to l = 2. We assume that the set of available moment restrictions to be selected from are collected in the R k max -valued estimating function φ max satisfying the following assumption.
We base the estimation of c r , the selection vector corresponding to the relevant set of moment conditions, on the mRMSC introduced by (24) with a penalization term κ T = κ(|c|,T): whereÎ θ,T (c) is given by (24) with φ(·) = φ max (·,c). As in Section 4.2, the relevant model c r is estimated byĉ T defined bŷ c T = arg min c∈C mRMSC(c).
Note that Assumption C.2 is the same as Assumption 6. We only replace in the latter the definition of the relevant model by the more general concept above, and φ max and ρ max are, respectively, replaced by their more general version in Assumption C.1. Similar to Theorem 4.2, we can now claim the following result.
Theorem C.1. If Assumptions C.1 and C.2 hold, thenĉ T converges in probability to c r as T → ∞.
This result shows that the optimal model with respect to mRMSC converges to the relevant model c r as the sample size grows when selection is made among candidate models satisfying point identification and first-order local identification properties. This is an extension of Theorem 4.2 to the case where l ≥ 2 and, in a similar way, is at the core of the consistency of mRMSC over the whole set of candidate models. Indeed, as shown by Theorem 4.3, we can also show in this context that, with probability approaching 1, c r outperforms candidate models that fail point identification or first-order local identification properties. We can also claim that the selection procedure yields a model that is efficient since we can establish an analog of Proposition 4.4 in the current configuration. We do not propose a formal exposition of the analogs of Theorem 4.3 and Proposition 4.4 to save space.
Therefore, T (c,c r ) is positive with probability 1 as T grows.