UNIFORM-IN-SUBMODEL BOUNDS FOR LINEAR REGRESSION IN A MODEL-FREE FRAMEWORK

For the last two decades, high-dimensional data and methods have proliferated throughout the literature. Yet, the classical technique of linear regression has not lost its usefulness in applications. In fact, many high-dimensional estimation techniques can be seen as variable selection that leads to a smaller set of variables (a “submodel”) where classical linear regression applies. We analyze linear regression estimators resulting from model selection by proving estimation error and linear representation bounds uniformly over sets of submodels. Based on deterministic inequalities, our results provide “good” rates when applied to both independent and dependent data. These results are useful in meaningfully interpreting the linear regression estimator obtained after exploring and reducing the variables and also in justifying post-model-selection inference. All results are derived under no model assumptions and are nonasymptotic in nature.


Introduction and Motivation
Least squares linear regression is one of the most widely used prediction tools in practical data analysis.With its simple form, linear regression leads to interpretable results and in many cases has predictive performance on par with sophisticated/complex models.It is, however, an open secret that in most cases the set of covariates used in the final linear regression model is rarely the same as the set of covariates initially considered by the data analyst.This is typically a consequence of the selection of a good predictive submodel based on an estimate of the out-of-sample prediction risk.We use "submodel" here to denote a subset of the full set of covariates.
Traditional analysis of the least squares linear regression estimator restricts attention to a single set of covariates to prove consistency as well as asymptotic normality; see White (1980a,b) and Buja et al. (2019).In this case, it was proved that the least squares estimator is weakly and strongly consistent to the population least squares functional; see (10) below.Also, a properly normalized estimator has an asymptotic normal distribution.However, the theoretical understanding and practical usefulness of submodel least squares estimators resulting from a covariate selection procedure requires simultaneous consistency and (asymptotic) normality of all the estimators under consideration.Such simultaneous consistency and normality properties are the major focus of the current article.These are what we call uniform-in-submodel results.To be more concrete, suppose M " tM 1 , M 2 , . . ., M L u denotes a collection of submodels, where M j represents a subset of covariates for 1 ď j ď L. Also, let βMj represent the least squares estimator for the linear regression of the response on the covariates in M j .By simultaneous consistency, we mean the existence of target vectors tβ Mj : 1 ď j ď Lu such that sup MPM } βM ´βM } " o p p1q, as n Ñ 8, for some norm ¨ .To claim simultaneous asymptotic normality, we prove the existence of functions tψ Mj p¨q : 1 ď j ď Lu such that sup ψ M pZ i q " o p p1q, as n Ñ 8. (2) Here n represents the sample size and Z i " pX i , Y i q, 1 ď i ď n represent the regression data, with detailed notation given in Section 2. Equation ( 2) provides the well-known "asymptotic uniform linear representation" in the special case of the least squares linear regression estimator.This uniform linear representation is very crucial in providing inference after variable selection via simultaneous inference (Bachoc et al., 2020).If x M is a selected model, then one can perform inference on β x M by estimating the distribution of p β x M .This can be a tricky problem to deal with as shown in the works of Leeb and Pötscher (2005, 2006a,b, 2008).
Although various model selection criteria like C p , AIC, BIC, and lasso have been recommended for covariate selection in linear regression, results of the type (1) and ( 2) have not been established in the literature (at least not in the full generality considered here).Our method of attack is quite non-standard.Instead of assuming that the observations are independent and identically distributed, we prove a purely deterministic inequality to bound the left hand sides of (1) and (2) using maxima of several averages.We then control these averages under both independence and functional dependence to obtain explicit rates of convergence; cf.White (2001) where a detailed classical analysis of the least squares regression estimator is provided.The functional dependence structure of data, introduced in Wu (2005), is based on the idea of coupling and covers the setting of many linear and non-linear time series.This dependence concept is very closely related to the L p -approximability concept introduced in Pötscher and Prucha (1997).
Some noteworthy aspects of our results are as follows.
1. We provide a purely deterministic inequality for the least squares linear regression estimator which does not require any stochasticity of the regression data and holds for any sample size n.These deterministic results are sharp and by nature more widely applicable than any asymptotic results.Some deterministic inequalities for linear regression appeared in Kuchibhotla et al. (2020).Although these inequalities led to suboptimal rates, the structure of those deterministic inequalities were useful for the context in that paper.
2. All our results allow misspecification of the linear model.This means that the classical Gauss-Markov linear model need not hold true for any of the submodels under consideration; see Chapter 4 of Monahan (2008).Two important objections (for us) to the classical model are the impositions of fixed design and linearity structure on the data generating distribution.Since our setting allows for misspecification, we call our framework "model-free."We note here that our results do apply to the setting of fixed covariates.
3. When studied assuming a suitable randomness structure (such as independence or functional dependence), our results are precise concentration inequalities applicable in finite samples and apply to high-dimensional observations.Another interesting facet of our results is that we do not assume the observations are identically distributed.This is an important generalization needed to include the case of fixed covariates.
4. For concreteness, we take the set of submodels M to be the set of all submodels of size bounded by k (for some 1 ď k ď p).Here p represents the total number of available covariates.Under certain regularity conditions, the rates of convergence we obtain in this case for simultaneous consistency (1) and normality (2) with Euclidean norm are a k logpep{kq{n and k logpep{kq{ ?n, respectively (up to a lower order additive term).Interestingly, the simultaneous consistency rate matches the minimax optimal rate of a well-specified high-dimensional sparse linear regression; see Raskutti et al. (2011).It should be noted that even though the rates match with the setting of well-specified high-dimensional linear regression, we do NOT require a well-specified model in this article.
5. In the process of applying our results to functionally dependent observations, we prove a tail bound for zero mean dependent sums, thereby extending the results of Wu and Wu (2016).For independent observations, we use the precise concentration inequality results of Kuchibhotla and Chakrabortty (2020).
In addition to the important general model selection problem above where the results of the type (1) and ( 2) are required, our simultanety results can be seen to provide essential inferential validity guarantees for the following setting of growing importance.In the vast literature on high-dimensional linear regression, it has become customary to assume an underlying linear model along with a sparsity constraint on the true regression parameter.But suppose statisticians are not willing to assume sparsity of the parameter, and neither are they willing to assume a linear model.Such unwillingness is not unreasonable in light of the fact that any model is just an approximation, and sparsity is just an assumption of convenience.Now consider the following stylized description of approaches to high-dimensional data as widely practiced in applied statistics and data science: High dimensional data is first explored either in a formal algorithmic way (e.g., using lasso or best subset selection) and/or in an informal exploratory way (e.g., using residual and leverage plots) to select a manageable small set of variables.Subsequently, the reduced data is subjected to linear regression.The combination of variable selection and linear regression is thought of as one procedure, a "high-dimensional linear regression".Even though the procedure uses only a reduced set of variables in the final regression, it uses all the variables in the preceding selection phase.Suppose M P M is the final selected submodel (from some collection of models M) and β M is the least squares linear regression estimator thus obtained.The estimator β M is known as the post-regularization estimator in the high-dimensional statistics literature if M is obtained from some regularized least squares procedure.An important question now is "what does β M estimate (consistently)?".A simultaneous result answers this question through the trivial bound Therefore, β M is estimating the quantity β M which is random through M .If the model selection procedure is such that M does not stabilize as n Ñ 8, then β M is only consistent for the random quantity β M and may not be consistent for any non-random quantity.By comparison if Pp M " M 0 q Ñ 1 as n Ñ 8 for some submodel M 0 , then with probability converging to one β M " β M0 and hence β M is consistent for the non-random quantity β M0 .

Literature Review
Results of the simultaneous type described in (1) and ( 2) are not readily available in the literature.Some works that are closely related to ours are Belloni and Chernozhukov (2013), Bachoc et al. (2018) and Chakrabortty et al. (2021).Although some of these works consider a simultaneous problem, their results are only restricted to certain special cases (e.g., independent observations and/or fixed design) of our framework.Belloni and Chernozhukov (2013) prove the rate of convergence of the least squares linear regression estimator obtained after covariate selection using lasso.Bachoc et al. (2018) prove the rate of convergence of sup MPM βM ´βM 8 under the restricted isometry property (RIP).(Here v 8 for a vector v denotes the maximum absolute entry in the vector.)Also, they only consider fixed covariates.We do not assume RIP because it is not a practical assumption, and also we prove the simultaneous convergence guarantee with the Euclidean norm rather than ¨ 8 .It should also be mentioned that Bachoc et al. (2018) appeared after the initial version of the current work Kuchibhotla et al. (2018).Chakrabortty et al. (2021) independently prove results very similar to ours in the case of independent observations with sub-Gaussian tails.They consider a more general collection of submodels M than the set of k-sparse submodels; see section 5 of Chakrabortty et al. (2021) for more details.
Because our results are deterministic in nature, they do apply for a general collection of submodels, but for concreteness we fix the choice of the collection.Under the assumptions of Chakrabortty et al. (2021, Section 5), their results match ours exactly.We note, however, that their results are only proved for independent and identically distributed observations, which is why they do not apply to the case of fixed covariates.Further, our results, including the case of independent observations, are proved for a large class of tail assumptions that subsume their assumptions.Finally, we mention two recent works that discuss uniform-in-submodel type results.Rinaldo et al. (2019) in their Theorem 1, as well as Remark 4 that follows, discuss uniform-in-submodel consistency for independent and identically distributed observations that are bounded.Their rates, however, are suboptimal; for instance, their Theorem 1 only proves a rate k a logpkq{n while our results imply the optimal rate of a k{n.Giessing (2018, Chapter 2), following the initial version of our work, proves uniform-in-submodel consistency as well as linear representation results for quantile regression when the observations are independent.The tail assumptions on the observations there are weaker than ours but this is expected, at least for the response, because the loss is Lipschitz in the response.

Organization
The remainder of our paper is organized as follows.In Section 2, we introduce our notation and general framework.In Section 3, we derive various deterministic inequalities for linear regression that form the core of the paper.The application of these results to the case of independent observations is considered in Section 4. The application of the deterministic inequalities to the case of (functionally) dependent observations is considered in Section 5. A discussion of our results along with their implications for post-selection inference is given in Section 6.Some auxiliary probability results for sums of independent and functionally dependent random variables are given in Appendix A and Appendix B, respectively.

Notation
Suppose pX 1 , Y 1 q, . . ., pX n , Y n q are n random vectors in R p ˆR.Throughout the paper, we implicitly think of p as a function of n and so the sequence of random vectors should be thought of as a triangular array.The term "submodel" is used to specify the subset of covariates used in the regression and does not refer to any probability model.We do not assume a linear model (in any sense) to be true anywhere for any choice of covariates in any section of the paper.In this sense all our results are applicable in the case of misspecified linear regression models.
For any vector v P R q for q ě 1 and 1 ď j ď q, let vpjq denote the j-th coordinate of v.For any non-empty submodel M given by a subset of t1, 2, . . ., qu, let vpM q denote a sub-vector of v with indices in M .For instance, if M " t2, 4u and q ě 4, then vpM q " pvp2q, vp4qq.The notation |M | is used to denote the cardinality of M .For any non-empty submodel M Ď t1, 2, . . ., qu and any symmetric matrix A P R qˆq , let ApM q denote the sub-matrix of A with indices in M ˆM .For 1 ď j, k ď q, let Apj, kq denote the value at the j-th row and the k-th column of A. Define the r-norm of a vector v P R q for 1 ď r ď 8 as v r r :" q ÿ j"1 |vpjq| r , for 1 ď r ă 8, and v 8 :" max 1ďjďq |vpjq|.
Let v 0 denote the number of non-zero entries in v (note this is not a norm).For any square matrix A, let λ min pAq denote the minimum eigenvalue of A. Also, let the elementwise maximum and the operator norm be defined, respectively, as |Apj, kq|, and A op :" sup Aδ 2 .
The following simple inequalities are useful.For any matrix A P R qˆq and v P R q , For any 1 ď k ď p, define the set of k-sparse submodels Mpkq :" tM : M Ď t1, 2, . . ., pu, 1 ď |M | ď ku, so that Mppq is the power set of t1, 2, . . ., pu with the deletion of the empty set.Thus the set Mpkq denotes the set of all non-empty submodels of size bounded by k.The most important aspect of our results is the "uniform-in-submodel" feature.These results are proved uniform over M P Mpkq for some k that is allowed to diverge with n.
When fitting a linear regression, it is common to include an intercept term.To avoid extra notation, we assume that all covariates under consideration are included in the vectors X i .So, take the first coordinate of all X i 's to be 1, that is, X i p1q " 1 for all 1 ď i ď n, if an intercept is required.For any M Ď t1, 2, . . ., pu, define the ordinary least squares empirical risk (or objective) function as Rn pθ; M q :" 1 n Expanding the square function it is clear that Rn pθ; M q " 1 n Only the second and the third term depend on θ.Because the quantities in these terms play a significant role in our analysis, define X i X J i P R pˆp , and Γn :" The least squares linear regression estimator βn,M is defined as βn,M :" arg min Rn pθ; M q " arg min The notation arg min θ f pθq denotes the minimizer of f pθq.Based on the quadratic expansion (4) of the empirical objective Rn pθ; M q, the estimator βn,M is given by the closed form expression βn,M " r Σn pM qs ´1 Γn pM q, (7) assuming non-singularity of Σn pM q.Note that r Σn pM qs ´1 is not equal to Σ´1 n pM q.The matrix Σn pM q being the average of n rank one matrices in R |M|ˆ|M| , its rank is at most mint|M |, nu.This implies that the least squares estimator βn,M is not uniquely defined unless |M | ď n.
It is clear from (7) that βn,M is a smooth (non-linear) function of two averages Σn pM q and Γn pM q.Assuming for a moment that the random vectors pX i , Y i q are independent and identically distributed (iid) with finite fourth moments, it follows that Σn pM q and Γn pM q converge in probability to their expectations.The iid assumption here can be relaxed to weak dependence and non-identically distributed random vectors; see White (2001) for more details.
Getting back to the general context, define the "expected" matrix and vector as Note that we write Σ n or Γ n (indexing by the sample size n) for two reasons.Firstly, we do not assume the random vectors are identically distributed, and hence the expected matrix changes with n even if the dimension is fixed.Secondly, the dimension in our setting is allowed to change with n, and hence, even if the observations are identically distributed, the expectation matrix changes with the sample size.
To define a target vector that is being consistently estimated by βn,M , consider the following simple calculation in a simpler setting where |M | does not change with n.As noted above βn,M " r Σn pM qs ´1 Γn pM q, and if p Σn ´Σn , Γn ´Γn q P Ñ 0 as n Ñ 8, then by a Slutsky type argument, it follows that βn,M ´βn,M where β n,M :" rΣ n pM qs ´1Γ n pM q " arg min The convergence statement (9) only concerns a single submodel M and is not uniform over M .By uniformin-submodel ¨ 2 -norm consistency of βn,M to β n,M for M P Mpkq, we mean that sup MPMpkq βn,M ´βn,M 2 " o p p1q as n Ñ 8.
As shown above, convergence of βn,M to β n,M only requires convergence of Σn pM q to Σ n pM q and Γn pM q to Γ n pM q.It is not required that these matrices and vectors are averages of random matrices and random vectors.
In the following section, in proving deterministic inequalities, we generalize the linear regression estimator by the function β M : R pˆp ˆRp Ñ R |M| as β M pΣ, Γq " rΣpM qs ´1ΓpM q, (11) assuming the existence of the inverse of ΣpM q.We call this β M p¨, ¨q the linear regression map.It is evident that βn,M " β M p Σn , Γn q and β n,M " β M pΣ n , Γ n q.
There are many potential applications that require replacing the sample average matrices in the linear regression estimator by a suitable non-average version, e.g., shrinkage or robust estimators.Three of these applications are listed in Section 3.3.To distinguish the estimator βn,M with sample averages from the linear regression map, we call βn,M as the OLS estimator.
In the next section, we shall prove a bound of the type and for some function F M p¨, ¨q.Taking pΣ 1 , Γ 1 q " p Σn , Γn q and pΣ 2 , Γ 2 q " pΣ n , Γ n q, inequality (12) is useful for the purpose of proving (1).In regard to (12), thinking of β M as a function of pΣ, Γq, our results are essentially about studying Lipschitz continuity properties and understanding what kind of norms are best suited for this purpose.Using the smoothness of the linear regression map, we also obtain a bound on where ∇β M p¨, ¨q represents the gradient of the linear regression map.The following error norms will be very useful for these results: RIPpk, Σ 1 ´Σ2 q :" sup The quantity RIP is a norm for any k ě 2 and is not a norm for k " 1.This error norm is very closely related to the restricted isometry property used in the compressed sensing and high-dimensional linear regression literature where Σ 2 is the identity matrix.Also, define the k-sparse minimum singular value of a matrix A P R pˆp as Λpk; Aq " inf Even though all the results in the next section are written in terms of the linear regression map (11), our main focus will still be the matrices and vectors defined in ( 5) and ( 8).
3 Deterministic Results for Linear Regression

Can we expect deterministic inequalities?
Classical asymptotic theory for linear regression or for that matter any estimation problem usually starts with an assumption that the observations are independent or otherwise follow a specific stochastic dependence.What we are aiming for is a purely deterministic inequality that does not even assume randomness of the observations.To see whether we can at all expect a deterministic inequality, let us consider a simple example with only one submodel M " t1u, that is, a simple regression through the origin based on one regressor.For this case, let us write σ2 n :" Σn pM q, γn :" Γn pM q, σ 2 n :" Σ n pM q, and γ n :" Γ n pM q.
Note that these are all scalar quantities.Now the regression estimator and targets become βn,M " γn σ2 Solving this inequality for | βn,M ´βn,M |, we get . This is a deterministic inequality that does not require any probabilistic structure on the data, and more importantly, the right hand side tends to zero if σ2 n ´σ2 n " opσ 2 n q and γn ´γn " opσ 2 n q.Because this bound is a deterministic inequality, taking a supremum over a collection of submodels does not invalidate the inequality.This is not the case if we only have an asymptotic result.All our deterministic inequalities to be stated/proved in the forthcoming sections are variations of the calculation above.One might suspect that the closed form expression of the linear regression map made a deterministic inequality possible, but as shown in Kuchibhotla (2018) most "smooth" M -estimators satisfy this type of result.

Main Results
All our results in this section depend on the error norms RIPpk, Σ 1 ´Σ2 q and Dpk, Γ 1 ´Γ2 q in (13).These are, respectively, the maximal k-sparse eigenvalue of Σ 1 ´Σ2 and the maximal k-sparse Euclidean norm of Γ 1 ´Γ2 .At first glance, it may not be clear how these quantities behave.We first present a simple inequality for RIP and D in terms of |||¨||| 8 and ¨ 8 .
Proof.See Appendix C for a proof.
In many cases, it is much easier to control the maximum elementwise norm rather than the RIP error norm.However, the factor k on the right hand side often leads to sub-optimal dependence in the dimension.For the special cases of independent and dependent random vectors (to be discussed in Sections 4 and 5), we directly control RIP and D.
The sequence of results to follow are related to uniform consistency in ¨ 2 -and ¨ 1 -norms.To state these results, we require the following quantities representing the strength of regression (or linear association).For r, k ě 1 S r,k pΣ, Γq :" sup For the following theorem, recall the k-sparse minimum singular value Λp¨; ¨q defined in ( 14) and the error metrics defined in (13).
Fix M P Mpkq.Then By definition of the operator norm, where I M represents the identity matrix of dimension |M | ˆ|M |.Now combining bounds on ∆ 1 , ∆ 2 , we get Subtracting and adding β M pΣ 2 , Γ 2 q from β M pΣ 1 , Γ 1 q, we get Solving this inequality under assumption ( 16), it follows for all M P Mpkq that This proves the result.
As will be seen in the application of Theorem 1, the complicated looking bound provided above gives the "optimal" bound.Combining Proposition 1 and Theorem 1, we get the following simple corollary that gives sub-optimal rates.
Corollary 1.Let k ě 1 be any integer such that Remark 3.1 (Bounding S 2,k in ( 15)) The bound for uniform L 2 -consistency requires a bound on β M pΣ 2 , Γ 2 q 2 in addition to bounds on the error norms related to Σ-matrices and Γ-vectors.It is a priori not clear how this quantity might vary as the dimension of the submodel M changes.In the classical analysis of linear regression where a true linear model is assumed, the true parameter vector β is seen as something chosen by nature and hence its norm is not under control of the statistician.Hence, in the classical analysis, a growth rate on β 2 is imposed as an assumption.
From the viewpoint taken in this paper, under misspecification nature picks the whole distribution sequence of random vectors and hence the quantity β M p¨, ¨q that came up in the analysis.In the full generality of linear regression maps considered here, we do not know of any techniques to bound the norm of this vector.It is, however, possible to bound it if β M p¨, ¨q is defined by a least squares linear regression problem.Recall the definition of Σ n , Γ n from (8) and β n,M from (10).Observe that by definition of This holds because Hence for every M P Mppq, Therefore, using the definitions of Λpk; Σ n q and S r,k in ( 14) and ( 15), .
It is immediate from these results that if the second moment of the response is uniformly bounded, then S 2,k behaves like a constant when Σ n is well-conditioned.See Foygel and Srebro (2011) for a similar calculation.˛Based on the uniform-in-submodel ¨ 2 -bound, the following result is trivially proved.
Then simultaneously for all M P Mpkq, Proof.The proof follows by using the first inequality in (3).
The results above only prove a rate of convergence that gives uniform consistency.They are therefore not readily applicable for (asymptotic) inference.For inference about a parameter, an asymptotic distribution result is required, usually asymptotic normality, which is typically proved by way of an asymptotic linear representation.In what follows we derive a uniform-in-submodel linear representation for the linear regression map.The result in terms of the regression map itself is somewhat abstract, hence it might be helpful to revisit the usual estimators βn,M and β n,M from ( 6) and ( 10) to understand what kind of representation is possible.From the definition of βn,M , we have Σn pM q βn,M " Γn pM q ñ Σn pM q ´β n,M ´βn,M ¯" Γn pM q ´Σ n pM qβ n,M .
Assuming Σn pM q and Σ n pM q are close, one would expect βn,M ´βn,M ´rΣ n pM qs Note, by substituting all the definitions, that This being an average (a linear functional), the left hand side quantity in ( 17) is called the linear representation error.Now, using the same argument and substituting Σ 1 and Σ 2 for Σn and Σ n , respectively we get the following result.Recall the notations S 2,k p¨, ¨q and Λp¨, ¨q from Equations ( 15) and ( 14).
Adding and subtracting Now adding and subtracting Σ 2 pM q from Σ 1 pM q in this equation, we get The right hand side is almost the quantity we need to bound to establish the result.Multiplying both sides of the equation by rΣ 2 pM qs ´1 and then applying the Euclidean norm implies that for M P Mpkq, This proves the first part of the result.The second part of the result follows by the application of Theorem 1.
If the minimal and maximal k-sparse singular values of Σ 1 ´Σ2 are of the same order, then the upper and lower bounds for the linear representation error match up to the order under the additional assumption that the minimal and maximal sparse eigenvalues of Σ 2 are of the same order.Remark 3.3 (Improved ¨ 2 -Error Bounds) Uniform linear representation error bounds ( 18) and ( 19) prove more than just a linear representation.These bounds allow us to improve the bounds provided for uniform L 2 -consistency.Bound ( 18) is of the form Therefore, assuming RIPpk, Σ 1 ´Σ2 q ď Λpk; Σ 2 q{2, it follows that for all M P Mpkq, This is a more precise result than informed by Theorem 1 because here we characterize the estimation error exactly up to a factor of 2. Also, note that in case of the least squares estimator and target, βn,M and β n,M , the upper and lower bounds here are Euclidean norms of averages of random vectors.Dealing with linear functionals like averages is much simpler than dealing with non-linear functionals such as βn,M .If RIPpk, Σ 1 ´Σ2 q converges to zero, then the right hand side of bound ( 18) is of smaller order than both the terms appearing on the left hand side (which are the same as those appearing in ( 22)).This means that the linear representation error is of strictly smaller order than the estimator error simultaneously over all M P Mpkq.Remark 3.4 (Alternative to RIP) A careful inspection of the proof of Theorem 3 and Theorem 1 reveals that the bounds can be written in terms of sup MPMpkq rΣ 2 pM qs ´1{2 Σ 1 pM q rΣ 2 pM qs ´1{2 ´I|M| op , instead of RIPpk, Σ 1 ´Σ2 q.Here I |M| is the identity matrix in R |M|ˆ|M| .Bounding this quantity might not require a bounded condition number of Σ 2 ; however, we will only deal with RIPpk, Σ 1 ´Σ2 q in the following sections for convenience.Summarizing all the results in this section it is sufficient to control RIPpk, Σ 1 ´Σ2 q and D pk, Γ 1 ´Γ2 q to derive uniform-in-submodel results in any linear regression type problem.In this respect, these are the norms in which one should measure the accuracy of the Gram matrix and the inner product of covariates and response.Hence if one wishes to use shrinkage estimators, for example, because Σ and Γ are highdimensional "objects", then the estimation accuracy should be measured with respect to RIP and D for uniform-in-submodel type results.

Applications of the Linear Regression Map
Before proceeding to the rates of convergence of these error norms for independent and dependent data, we describe the importance of defining the linear regression map with general matrices instead of just Gram matrices.The generality achieved so far would be worthless if no interesting applications existed.The goal now is to provide a few such interesting examples.
1. Heavy-Tailed Observations: The RIPp¨, ¨q-norm is a supremum over all submodels of size k or less, hence the supremum is over number of submodels.This bound is polynomial in the total number of covariates but is exponential in the size of the largest submodel under consideration.Therefore, if the total number of covariates p is allowed to diverge, then the question we are interested in is inherently high-dimensional.If the usual Gram matrices are used then RIPpk, Σn ´Σn q " sup |M|ďk Σn pM q ´Σn pM q op , hence, RIP in this case is the supremum in the order of pep{kq k many averages.As is well-understood from the literature on concentration of measure or even the union bound, one would require exponential tails on the initial random vectors to allow a good control on RIPp¨, ¨q if the usual Gram matrix is used.Does this mean that the situation is hopeless if the initial random vectors do not have exponential tails?The short answer is "not necessarily."Viewing the matrix Σ n (the "population" Gram matrix) as a target, there have been many variations of sample mean Gram matrix estimators that are shown to provide exponential tails even though the initial observations are heavy tailed.See, for example, Catoni (2012), Wei and Minsker (2017) and Catoni and Giulini (2017), along with the references therein, for more details on a specific estimator and its properties.It should be noted that these authors do not study the estimator accuracy with respect to the RIP-norm.
2. Outlier Contamination: Real data, more often than not, is contaminated with outliers, and it is a difficult problem to remove or downweight observations when contamination is present.Robust statistics provide estimators that can ignore or downweight the observations suspected to be outliers and yet perform comparably when there is no contamination present in the data.Some simple examples include entry-wise medians or trimmed means.See Minsker (2015) and references therein for some more examples.Almost none of these estimators are simple averages but behave regularly in the sense that they can be expressed as averages up to a negligible asymptotic remainder term.Chen et al.
(2013) provide a simple estimator of the Gram matrix under adversarial corruption and case-wise contamination.
3. Indirect Observations: This example is taken from Loh and Wainwright (2012).The setting is as follows.Instead of observing the real random vectors pX 1 , Y 1 q, . .., pX n , Y n q, we observe a sequence pZ 1 , Y 1 q, . . ., pZ n , Y n q with Z i linked with X i via some conditional distribution that is for 1 ď i ď n, Z i " Qp¨|X i q.
As discussed on page 4 of Loh and Wainwright (2012), this setting includes some interesting cases such as missing data and noisy covariates.A brief hint of the settings is given below: -If Z i " X i `Wi where W i is independent of X i and has mean zero with a known covariance matrix.
-For some fraction ρ P r0, 1q, we observe a random vector Z i P R p such that for each component j, we independently observe Z i pjq " X i pjq with probability 1 ´ρ and Z i pjq " ˚with probability ρ.
(Here ˚means a missing value.) -If Z i " X i d u i , where u i P R p is again a random vector independent of X i and d is the Hadamard (coordinate-wise) product.The problem of missing data is a special case.
On page 6, Loh and Wainwright (2012) provide various estimators in place of Σn in (5).The assumption in Lemma 12 of Loh and Wainwright ( 2012) is essentially a bound on the RIP-norm in our notation, and they verify this assumption in all the examples above.Hence all our results in this section apply to these settings.
Theorem 4. Let k ě 1 be any integer such that RIP n pkq ď Λ n pkq.Then for all submodels M P Mpkq, Recall here that Γ n and Σ n are non-random vectors/matrices given in (8).So Theorem 4 (which is still a deterministic inequality) can be used to prove an asymptotic uniform linear representation.
Remark 3.5 (Non-uniform Bounds) The bound above applies for any k satisfying the assumption RIP n pkq ď Λ n pkq.Noting that for M P Mpkq, RIP n p|M |q ď RIP n pkq as well as Λ n p|M |q ě Λ n pkq, Theorem 4 implies that βn,M ´βn,M ´1 n The point made here is that even though the bound in Theorem 4 only uses the maximal submodel size, it can recover submodel size dependent bounds because the result is proved for every k.Remark A similar bound also holds for the linear representation error.Therefore, the uniform-in-submodel results above allow us to prove consistency and asymptotic normality of the least squares linear regression estimator after data exploration.See Belloni and Chernozhukov (2013) for related applications and methods of choosing the random submodel M .Remark 3.7 (Bounding S 2,k ) As shown in Remark 3.1, for the setting of averages The quantity on the right hand side of ( 24) is of the order Λ ´1{2 n pkq under the assumption of bounded second moments of the Y i 's.Therefore, we will not further write S 2,k explicitly and just use Λ ´1{2 n pkq instead.4

Rates for Independent Observations
In this section, we derive bounds for RIP n pkq and D n pkq defined in (23) under the assumption of independence and weak exponential tails.The setting is as follows.Suppose pX 1 , Y 1 q, . . ., pX n , Y n q are a sequence of independent random vectors in R p ˆR.Consider the following assumptions: (MExp) Assume that there exist positive numbers α ą 0, and K n,p ą 0 such that max 1ďjďp max !X i pjq ψα , Y i ψα ) ď K n,p for all 1 ď i ď n.
(JExp) Assume that there exist positive numbers α ą 0, and K n,p ą 0 such that max Recall that X i pjq means the j-th coordinate of X i .The notation ¨ ψα refers to a quasi-norm defined by for any random variable W . Random variables W satisfying W ψα ă 8 are referred to as sub-Weibull of order α, because W ψα ă 8 implies that for all t ě 0, where the right hand side resembles the survival function of a Weibull random variable of order α ą 0 (see Kuchibhotla and Chakrabortty (2020) for more details).The special cases α " 1, 2 are very much used in the high-dimensional literature as assumed tail behaviors.A random variable W satisfying W ψα ă 8 with α " 2 is called sub-Gaussian, and with α " 1 it is called sub-exponential (see van der Vaart and Wellner (1996) for more details).
It is easy to see that Assumption (JExp) implies Assumption (MExp).We refer to Assumption (MExp) as a marginal assumption and Assumption (JExp) as a joint assumption.It should be noted that Assumption (JExp) is much stronger than (MExp) because (JExp) implies that the coordinates of X i should be "almost" independent (see Chapter 3 of Vershynin (2018) and Kuchibhotla and Chakrabortty (2020) for further discussion).
The following results bound D n pkq and RIP n pkq based on Theorem 10 in Appendix A. Because RIP n pkq involves operator norms over k-sparse unit balls, we will bound it using ε-nets for the union of these unit balls.This will also be useful for bounding D n pkq.Before stating the results, we need the following preliminary calculations and notations.For any set K with metric dp¨, ¨q, a set N is called a γ-net of K with respect to d if N Ă K and for any z P K there exists an x P N such that dpx, zq ď γ.Let ¨ 2 denote the Euclidean norm and define the d-dimensional unit ball by Let N d pεq represent an ε-net of B 2,d with respect to the Euclidean norm.Define the k-sparse subset of the unit ball in R p as Θ k :" tθ P R p : With some abuse of notation, a disjoint decomposition of Θ k can be written as The last union includes repetition of B 2,s as subsets of R p with unequal supports.Using this decomposition, it follows that a 1{4-net N pε, Θ k q of Θ k with respect to the Euclidean norm on R p can be chosen to satisfy and hence, can be bounded in cardinality by Applying Lemma 4.1 of Pollard (1990) it follows that (Lemma 4.1 of Pollard (1990) provides the bound on the covering number to be p3{εq d but it can be improved from the proof to p1 `1{εq d .)Here one can choose the elements of the covering set N s pεq to be s-sparse in R p .See Lemma 3.3 of Plan and Vershynin (2013) for a similar result.Based on these calculations and the covering set N pε, Θ k q, we bound D n pkq and RIP n pkq by a finite maximum of mean zero averages.
Observe that Therefore, It is clear that the bound is sharp up to a constant factor.By a similar calculation, it can be shown that See Lemma 2.2 of Vershynin ( 2012) for a derivation.Importantly independence of the random vectors is not used in any of these calculations.Replacing the continuous supremum by a finite maximum works irrespective of how the random vectors are distributed.
As an immediate corollary, we get the following rate of convergence results.
Then the following rates of convergence hold if K n,p " Op1q.
(b) Under Assumption (JExp), For simplicity, we provide here only rates of convergence.A more precise tail bound is given in Theorem 11 of Appendix A.
Remark 4.1 (Simplified Rates of Convergence) In most cases the second term in the rate of convergence is of lower order than the first term.Hence, under both the assumptions (MExp) and (JExp), we get We believe these to be optimal because if X and Y are independent and jointly Gaussian, then the rates would be a k logpep{kq{n; see Theorem 3.3 of Cai and Yuan (2012) and Lemma 15 of Loh and Wainwright (2012) for related results.
Ą direct application of Theorem 5 to Theorem 4 implies the following uniform linear representation result for linear regression under independence.Recall the notation Λ n pkq from ( 23) and also βn,M , β n,M from ( 7), (10).Theorem 6.If pΛ n pkqq ´1 " Op1q as n, p Ñ 8, then the following rates of convergence hold as n Ñ 8.
Remark 4.2 (Simplified Rates of Convergence) The result can be made much more precise by giving the exact tail bound for all the quantities using the exact result of Theorem 11.We leave the details to the reader.From Theorem 6, it is clear that if k logpep{kq 2{T1pαq " opnq, then the least squares linear regression estimator is uniformly consistent at the rate of a k logpep{kq{n, which is well-known to be the minimax optimal rate of convergence for high-dimensional linear regression estimators under a true linear model with a sparse parameter vector.We conjecture these rates to be optimal.However, we have not derived minimax rates for this problem.Also, our results are uniform over all probability distributions of the random vectors pX i , Y i q satisfying either of the Assumptions (MExp) or (JExp) with K n,p ď K for some fixed constant K ă 8. Remark 4.3 (Fixed Covariates) The results in this section do not require any special properties of the data generating distribution such as linearity and Gaussianity.The results only require independence of random vectors with weak exponential tails, but it is not assumed that pX i , Y i q have identical distributions for 1 ď i ď n.
It is worth mentioning a special case of our setting that is popular in the classical as well as modern linear regression literature: the setting of fixed covariates.As explained in Buja et al. (2019), this assumption has its roots in the ancillarity theory assuming the truth of a linear model.If the covariates are non-stochastic, then Σn " so that RIP n pkq " 0 for all n and k.Therefore, the bounds in Theorem 4 become trivial in the sense that the uniform linear representation error becomes zero.The result applies because assumption (MExp) holds with K n,p " maxt max which again leads to the same rate of convergence a k logpep{kq{n.An interesting observation here is that there is no dependence on the strength of linear association S 2,k pΣ n , Γ n q defined in Equation ( 15) in the case of fixed covariates.Remark 4.4 (Are the rates optimal?)We believe the rates for the uniform linear representation error to be optimal; cf.Theorem 5.1 of Javanmard and Montanari (2018).An intuitive reason is as follows.Any symmetric function of independent random variables can be expanded as a sum of degenerate U -statistics of increasing order according to the Hoeffding decomposition; see van Zwet (1984).That is, f pW 1 , . . ., W n q " U 1n `U2n `. . .`Unn , for any symmetric function f of independent random variables W 1 , . . ., W n .Here U in represents an i-th order degenerate U -statistics.
For the statistic βn,M ´βn,M , the first order term U 1n in the decomposition is given by Hence the difference βn,M ´βn,M ´UpMq 1n is of the same order as the second order U -statistics U pMq 2n next in the decomposition.It is well-known that under mild conditions, a second order degenerate U -statistics is of order 1{n; see Serfling (1980, Chapter 5) for precise results.Therefore, bounding the supremum of the ¨ 2 -norm in the uniform linear representation by we see that this is a maximum of at most p5ep{kq k many degenerate U -statistics of order 2, which is expected to be of order plogp5ep{kq k q{n " pk logp5ep{kqq{n.See de la Peña and Giné (1999) for results about suprema of degenerate U -statistics.Remark 4.5 (Using covariance matrices instead of Gram matrices) The quantities Υ Γ n,k and Υ Σ n,k play an important role in determining the exact rates of convergence in Theorem 6.Under Assumption (JExp), it can be easily shown that these quantities are of the same order as K n,p .In cases where the dimension grows, Assumption (JExp) cannot be justified with non-zero mean of X i 's unless ErX i s 2 " Op1q.Under Assumption (MExp), Υ Γ n,k and Υ Σ n,k can grow with k, and it is hard to pinpoint their growth rate.In many cases, it is reasonable to assume a bounded operator norm of the covariance matrix instead of the second moment (or Gram) matrix.For this reason, it is of interest to analyze the least squares estimators with centered random vectors.In this case Σn and Γn should be replaced by Σn :" 1 n Here X and Ȳ represent the sample means of the covariates and the response, respectively.Without the assumption of equality of ErX i s for 1 ď i ď n, Σn is not consistent for the covariance matrix of X. Define

It is easy to prove that
Σn "

Similarly, we get
Γn " Γn ´`X ´μ X n ˘`Ȳ ´μ Y n ˘, where Γn :" Note that Γn and Σn are averages of independent random vectors and random matrices and so the theory before applies with the target vector and matrix given by It is important to recognize that Theorem 4 is not directly applicable since the forms of Σn and Γn do not match the structure required.One has to apply Theorem 3 to obtain sup where βM :" β M p Σn , Γn q, β M :" β M pΣ n, Γ nq, S 2,k :" S 2,k pΣ n, Γ nq.
The right hand side terms above can be controlled using Theorem 10.Thus, the linear representation changes when using the sample covariance matrix.See Section 4.1.1 of Kuchibhotla and Chakrabortty (2020) for more details.5

Rates for Functionally Dependent Observations
In this section, we extend all the results presented in the previous section to dependent data.The dependence structure on the observations we use is based on a notion developed by Wu (2005).It is possible to derive these results also under the classical dependence notions like α-,β-,ρ-mixing, however, verifying the mixing assumptions can often be hard and many well-known processes do not satisfy them.See Wu (2005) for more details.It has also been shown that many econometric time series can be studied under the notion of functional dependence; see Wu and Mielniczuk (2010), Liu et al. (2013) and Wu and Wu (2016).For a study of dependent processes under a similar framework called L p -approximability, see Pötscher and Prucha (1997).
The dependence notion of Wu (2005) is written in terms of an input-output process that is easy to analyze in many settings.The process is defined as follows.Let tε i , ε 1 i : i P Zu denote a sequence of independent and identically distributed random variables on some measurable space pE, Bq.Define the q-dimensional process W i with causal representation as for some vector-valued function G i p¨q " pg i1 p¨q, . . ., g iq p¨qq.By Wold representation theorem for stationary processes, this causal representation holds in many cases.Define the non-decreasing filtration Using this filtration, we also use the notation W i " G i pF i q.To measure the strength of dependence, define for r ě 1 and 1 ď j ď q, the functional dependence measure δ s,r,j :" max 1ďiďn W i pjq ´Wi,s pjq r , and ∆ m,r,j :" where W i,s pjq :" g ij pF i,i´s q with F i,i´s :" σ `. . ., ε i´s´1 , ε 1 i´s , ε i´s`1 , . . ., ε i´1 , ε i ˘.
The σ-field F i,i´s represents a coupled version of F i .The quantity δ s,r,j measures the dependence using the distance in terms of ¨ r -norm between g ij pF i q and g ij pF i,i´s q.In other words, it is quantifying the impact of changing ε i´s on g ij pF i q; see Definition 1 of Wu (2005).The dependence adjusted norm for the j-th coordinate is given by tW pjqu r,ν :" sup mě0 pm `1q ν ∆ m,r,j , ν ě 0.
Remark 5.1 (Independent Sequences) Any notion of dependence should at least include independent random variables.It might be helpful to understand how independent random variables fits into this framework of dependence.For independent random vectors W i , the causal representation reduces to It is not a function of any of the previous ε j , j ă i.This implies by the definition (30) that Here W 1 i represents an independent and identically distributed copy of W i .Hence, It is now clear that for any ν ą 0, Hence, if the independent sequence W i satisfies assumption (MExp), then tW u ψα,ν ă 8 for all ν ą 0, in particular for ν " 8. Therefore, independence corresponds to ν " 8.As ν decreases to zero, the random vectors become more and more dependent.Ąll our results in this section are based on the following tail bound for the maximum of averages of functionally dependent variables which is an extension of Theorem 2 of Wu and Wu (2016).This result is similar to Theorem 10.For this result, define spλq :" p1{2 `1{λq ´1, and T 1 pλq :" mintλ, 1u for all λ ą 0. (31) Theorem 7. Suppose Z 1 , . . ., Z n are random vectors in R q with a causal representation such as (28) with mean zero.Assume that for some α ą 0, and ν ą 0, Then for all t ě 0, with probability at least 1 ´8e ´t, max 1ďjďq ˇˇˇˇn ÿ i"1 Here B ν and C α are constants depending only on ν and α, respectively.
Proof.The proof follows from Theorem 12 proved in Appendix B and a union bound.
Getting back to the application of uniform-in-submodel results for linear regression, we assume that the random vectors are elements of a causal process with exponential tails.Formally, suppose pX 1 , Y 1 q, . . ., pX n , Y n q are random vectors in R p ˆR satisfying the following assumption: (DEP) Assume that there exist n vector-valued functions G i and an iid sequence tε i : i P Zu such that Also, for some ν, α ą 0, tW u ψα,ν ď K n,p and max Based on Remark 5.1, Assumption (DEP) is equivalent to Assumption (MExp) for independent data.For independent random variables, the second part of Assumption (DEP) about the expectations follows from the ψ α -bound assumption.The reason for this expectation bound in the assumption here is that the functional dependence measure δ s,r does not have any information about the expectation since W i pjq ´Wi,s pjq r " pW i pjq ´E rW i pjqsq ´pW i,s pjq ´E rW i,s pjqsq r .
The coupled random variable W i,s has the same expectation as W i .Since the quantities we need to bound involve products of random variables, such a bound on the expectations is needed for our analysis.
We are now ready to state the final results of this section.Only results similar to Theorems 11 and 6 are stated.Also, we only state the results under marginal moment assumption and the version with joint moment assumption can easily be derived based on the proof.These results are based on Theorem 7. Recall from inequalities ( 26) and ( 27) that Note that these quantities involve linear combinations pθ J X i q and products pθ J X i Y i q of functionally dependent random variables.It is clear that all linear combinations and products of functionally dependent random variables have a causal representation since if W p1q i :" h p1q i pF i q and W p2q i :" Thus, they can be studied under the same framework of dependence.In Lemma 4, we bound the functional dependence measure of such linear combination and product processes.
For the main results of this section, define for θ P Θ k (see ( 25)) Then under Assumption (DEP), with probability at least 1 ´16e ´t, the following inequalities hold simultaneously, and Here T 1 pαq and spαq are functions given in (31) and B ν , C α are constants depending only on ν and α, respectively.
Also, using Lemma 3 and Lemma 4, it follows that Hence applying Theorem 7, the result is proved.
Theorem 8 along with Theorem 4 implies the following uniform linear representation result for linear regression under functional dependence.Recall the notation Λ n pkq from Equation ( 23) and also βn,M , β n,M from Equations ( 7), (10).

˙.
In comparison to Theorem 6, the rates attained here are very similar except for two changes: 1.The exponent terms α{2, and T 1 pα{2qq are replaced by spα{2q, and T 1 pspα{2qq, respectively.This is because of the use of a version of Burkholder's inequality from Rio (2009) in the proof of Theorem 12.
2. The factor Ω n pνq in the second order terms above.This factor is due to the dependence of the process.If ν ą 1{2 (which corresponds to "weak" dependence), then Ω n pνq is of order 1 and for the boundary case ν " 1{2, Ω n pνq is of order plog nq 5{2 .In both these cases, the rates obtained for functionally dependent ψ α -random vectors match very closely the rates obtained for independent ψ spαq -random vectors.
Remark 5.2 (Some Comments on Assumption (DEP)) Assumption (DEP) is similar to the one used in Theorem 3.3 of Zhang and Wu (2017) for derivation of a high-dimensional central limit theorem with logarithmic dependence on the dimensional p.It is worth mentioning that in their notation α corresponds to the functional dependence and ν corresponds to the moment assumption.Also their assumption is written as sup rě2 tZu r,ν r α ă 8, (after swapping the dependence and moment parameters).
Hence, our parameters pα, νq correspond to their parameters p1{ν, αq.Our assumptions are weaker than those used by Zhang and Cheng (2014).From the discussion surrounding Equation (28) there, they require geometric decay of ∆ m,r,j while we only require polynomial decay.Zhang and Wu (2017) only deal with stationary sequences and Zhang and Cheng (2014) allows non-stationarity.Some useful examples verifying the bounds on the functional dependence measure are also provided in Zhang and Cheng (2014).

Discussion and Conclusions
In this paper, we have proved uniform-in-submodel results for the least squares linear regression estimator under a model-free framework allowing for the total number of covariates to diverge "almost exponentially" in n.Our results are based on deterministic inequalities.The exact rate bounds are provided when the random vectors are independent and functionally dependent.In both cases, the random variables are assumed to have weak exponential tails to provide logarithmic dependence on the dimension p.
In this paper, we have primarily focused on ordinary least squares linear regression.The main results, uniform-in-submodels consistency and linear representation, continue to hold for a large class of M -estimators defined by twice differentiable loss function as shown in Kuchibhotla (2018).The implications of these results are that one can use all the information from all the observations to build a submodel (subset of covariates) and apply a general M -estimation technique on the final model selected.These results can be extended to non-differentiable loss functions using techniques from empirical process theory, in particular, the stochastic uniform equicontinuity assumption.See, for example, Giessing (2018, Chapter 2) for results under independence.
All of our results are free of the assumption of correctly specified models.Therefore, our results provide a "target" β n,M for the estimator βn,M irrespective of whether M is fixed or random as long as |M | ď k.This implication follows from the uniform-in-submodel feature of the results.The conclusion here is that if the statistician has a target in mind, then all they need to check is if β n,M is close to the target they are thinking of.
As mentioned in the beginning of the article one can rethink high-dimensional linear regression as using high-dimensional data for exploration to find a "significant" set of variables and then applying the "lowdimensional" linear regression technique.If the exploration is not restricted to a very principled method, then inference can be very difficult.This problem is exactly equivalent to the problem of valid post-selection inference.Post-selection inference has a rich history in both statistics and econometrics.Leeb and Pötscher (2005, 2006a,b, 2008) have provided several impossibility results regarding the estimation of the distribution of p β x M , when x M represents the data-dependent selected model.One way to avoid this difficulty is by performing inference for all models simultaneously.The results in this paper allow for the construction of a simultaneous inference procedure using a high-dimensional central limit theorem and multiplier bootstrap; see Bachoc et al. (2019Bachoc et al. ( , 2020)); Kuchibhotla et al. (2021) and Belloni et al. (2018, Section 2) for more details.A related exploration will be provided in a future manuscript.

A Auxiliary Results for Independent Random Vectors
The following result proves a tail bound for a maximum of the average of mean zero random variables and follows from Theorem 4 of Adamczak (2008).The result there is only stated for α P p0, 1s, however, the proof can be extended to the case α ą 1. See the forthcoming paper Kuchibhotla and Chakrabortty (2020) for a clear exposition.
Theorem 10.Suppose W 1 , . . ., W n are mean zero independent random vectors in R q , q ě 1 such that for some α ą 0 and K n,q ą 0, max Then for any t ě 0, with probability at least 1 ´3e ´t, where T 1 pαq " mintα, 1u and C α is a constant depending only on α.
Proof.Fix 1 ď j ď q and apply Theorem 4 of Adamczak (2008) with F " tf u where f pW i q " W i pjq for 1 ď i ď n.Then applying the union bound the result follows.To extend the result to the case α ą 1, use Theorem 5 of Adamczak (2008) with α " 1 to bound the second part of inequality (8) there.
Using Theorem 10, we get the following results for RIP and D under independence.
Theorem 11.Fix n, k ě 1 and let t ě 0 be any real number.Then the following probability statements hold true: (a) Under Assumption (MExp), with probability at least 1 ´6e ´t, the following two inequalities hold simultaneously, (b) Under Assumption (JExp), with probability at least 1 ´6e ´t, the following two inequalities hold simultaneously, Here T 1 pαq " mintα, 1u and C α is a constant depending only on α.

B Auxiliary Results for Dependent Random Vectors
In this section, we present a moment bound for sum of functionally dependent mean zero real-valued random variables.The moment bound here is an extension of Theorem 2 of Wu and Wu (2016) to random variables with exponential tails.The main distinction is that our moment bound exhibits a part Gaussian behavior.
For proving these moment bounds, we need a few preliminary results and notation.Suppose Z 1 . . ., Z n are mean zero real valued random variables with a causal representation for some real valued function g i .We write δ k,r " Z i ´Zi,k r .The following proposition bounds the r-th moment of Z i in terms of tZu r,ν .This is based on the calculation shown after Equation (2.8) in Wu and Wu (2016).
Proposition 2. Consider the setting above.If E rZ i s " 0 for 1 ď i ď n, then Z i r ď tZu r,0 ď tZu r,ν , for any r ě 1 and ν ą 0.
Proof.Assuming E rZ i s " 0 for 1 ď i ď n, it follows that and so, The last inequality follows from Jensen's inequality and noting that the last bound equals ∆ 0,r , it follows that Z i r ď ∆ 0,r " tZu r,0 .
The following lemma provides a bound on the moments of a martingale in terms of the moments of the martingale difference sequence.This result is an improvement over the classical Burkholder's inequality.
Lemma 1 (Theorem 2.1 of Rio (2009)).Let tS n : n ě 0u be a martingale sequence with S 0 " 0 adapted with respect to some non-decreasing filtration F n , n ě 0. Let X k " S k ´Sk´1 denote the corresponding martingale difference sequence.Then for any p ě 2, The following simple calculation is also used in Theorem 12. Define Lemma 2. The following inequalities hold true: (a) For any β ě 0 and p ě 2,  ı .
Here the second inequality follows from (43).Similarly a representation for B n,ℓ exists with independent summands satisfying the assumption of Theorem 10 with β " spαq and so, To complete bounding III, we need to bound the moments of the sum of M n,ℓ over 1 ď ℓ ď L which are all dependent.For this, define the sequence This positive sequence satisfies ř L ℓ"1 λ ℓ ă 1.It is easy to derive from H 'older's inequality that ˇˇˇˇL ÿ
Here we have used inequalities (3).A similar proof implies the second result.

3. 6
(Post-selection Consistency) One of the main aspects of our results is in proving consistency of the least squares linear regression estimator after data exploration.Suppose a random submodel M chosen based on data satisfies | M | ď k with probability converging to one, that is, Pp M P Mpkqq Ñ 1.Then, with probability converging to one, βn, M ´βn,