UNIFORM ASYMPTOTICS AND CONFIDENCE REGIONS BASED ON THE ADAPTIVE LASSO WITH PARTIALLY CONSISTENT TUNING

We consider the adaptive Lasso estimator with componentwise tuning in the framework of a low-dimensional linear regression model. In our setting, at least one of the components is penalized at the rate of consistent model selection and certain components may not be penalized at all. We perform a detailed study of the consistency properties and the asymptotic distribution which includes the effects of componentwise tuning within a so-called moving-parameter framework. These results enable us to explicitly provide a set 
$\mathcal {M}$
 such that every open superset acts as a confidence set with uniform asymptotic coverage equal to 1, whereas removing an arbitrarily small open set along the boundary yields a confidence set with uniform asymptotic coverage equal to 0. The shape of the set 
$\mathcal {M}$
 depends on the regressor matrix as well as the deviations within the componentwise tuning parameters. Our findings can be viewed as a broad generalization of Pötscher and Schneider (2009, Journal of Statistical Planning and Inference 139, 2775–2790; 2010, Electronic Journal of Statistics 4, 334–360), who considered distributional properties and confidence intervals based on components of the adaptive Lasso estimator for the case of orthogonal regressors.


Introduction
The least absolute shrinkage and selection operator or Lasso by Tibshirani (1996) has received tremendous attention in the statistics literature in the past two decades.The main attraction of this method lies in its ability to perform model selection and parameter estimation at very low computational cost, and the fact that the estimator can be used in high-dimensional settings where the number of variables p exceeds the number of observations n ("p n").For these reasons, the Lasso has also turned into a very popular and powerful tool in econometrics, and similar things can be said about the estimator's many variants, among them the adaptive Lasso estimator of Zou (2006), where the l 1 -penalty term is randomly weighted according to some preliminary estimator.This particular method has been used in econometrics in the context of diffusion processes (DeGregorio & Iacus, 2012), for instrumental variables (Caner & Fan, 2015), in the framework of stationary and non-stationary autoregressions (Kock & Callot, 2015, Kock, 2016) and for autoregressive distributed lag (ARDL) models (Medeiros & Mendes, 2017), to name just a few.
Despite the popularity of this method, there are still many open questions on how to construct valid confidence regions in connection with the adaptive Lasso estimator.Pötscher & Schneider (2010) demonstrate that the oracle property from Zou (2006) and Huang et al. (2008) cannot be used to conduct valid inference and that resampling techniques also fail.They give confidence intervals with exact coverage in finite samples as well as an extensive asymptotic study in the framework of orthogonal regressors.However, settings more general than the orthogonal case have not been considered yet.
In this paper, we consider an arbitrary low-dimensional linear regression model ("p ≤ n") where the regressor matrix exhibits full column rank.We allow for the adaptive Lasso estimator to be tuned componentwise with some tuning parameters possibly being equal to zero, so that not all coordinates have to be penalized.Due to this componentwise structure, three possible asymptotic regimes arise: the one where each zero component is identified as such with asymptotic probability less than one, usually termed conservative model selection, the one where each zero component is revealed as zero with asymptotic probability equal to one, usually referred to as consistent model selection, as well as the mixed case where some components are tuned conservatively and some are tuned consistently.The framework we consider encompasses the latter two regimes.
The main challenge for inference in connection with the adaptive Lasso and related estimators lies in the fact that the finite-sample distribution depends on the unknown parameter in a complicated manner, and that this dependence persists in large samples.Consequently, the coverage probability of a confidence region varies over the parameter space, and in order to conduct valid inference, one needs to guard against the lowest possible coverage and consider the minimal one.This is done so in the present paper.
Since explicit expressions for the finite-sample distribution and therefore also the coverage probabilities of confidence regions are unknown when the regressors are not orthogonal, our study is set in an asymptotic framework.We determine the appropriate uniform rate of convergence and derive the asymptotic distribution of an appropriately scaled estimator that has been centered at the true parameter.While the limit distribution is still only implicitly defined through a minimization problem, the key observation and finding is that one may explicitly characterize the set of minimizers once the union over all true parameters is taken.This is done by heavily exploiting the structure of the corresponding optimization problem and leads to a compact set M that is determined by the asymptotic Gram matrix as well as the asymptotic deviations between the componentwise tuning parameters and the maximal one.Subsequently, this result can be used to show how the set M acts as a benchmark for confidence regions since, very loosely put, any larger set will necessarily have asymptotic coverage equal to one, and any smaller set will exhibit zero uniform coverage in the limit.We will formalize this statement, sharpen it for a wide class of tuning regimes and demonstrate the intrinsic problem that arises for inference in connection with this estimation method.
In this article, we show that the one-dimensional case from Pötscher & Schneider (2010) can indeed be generalized to arbitrary low-dimensional models.Our investigations reveal the geometry of confidence regions based on the adaptive Lasso which cannot be seen in the intervals of the one-dimensional setting.Our study also encompasses the effects of varying tuning schemes over different components of the parameter which may result in a loss of dimension in the confidence set.
The paper is organized as follows.We introduce the model and the assumptions as well as the estimator in Section 2. In Section 3, we study the relationship of the adaptive Lasso to the least-squares estimator.The consistency properties with respect to parameter estimation, rates of convergence, and model selection are derived in Section 4. Section 5 looks at the asymptotic distribution of the estimator and deduces that it is always contained in a compact set, independently of the unknown parameter.These results are used to construct and discuss the confidence regions in Section 6, where their shape is also illustrated.We summarize in Section 7 and relegate all proofs to Appendix A for readability.

Setting and Notation
We consider the linear regression model where y ∈ R n is the response vector, X ∈ R n×p the non-stochastic regressor matrix assumed to have full column rank, β ∈ R p the unknown parameter vector and ε ∈ R n the unobserved stochastic error term consisting of independent and identically distributed components with mean zero and finite second moments, defined on some probability space (Ω, F, P).To define the adaptive Lasso estimator, first introduced by Zou (2006), let where . is the Euclidean norm, λ j are non-negative tuning parameters, and βLS = (X X) −1 X y is the ordinary least-squares (LS) estimator.We assume the event { βLS,j = 0} to have zero probability for all j = 1, . . ., p and do not consider this event occurring in the subsequent analysis.
The adaptive Lasso estimator we employ is given by βAL = arg min which always exists and is uniquely defined in our setting.Note that, in contrast to Zou (2006), we allow for componentwise partial tuning where the tuning parameter may vary over coordinates and may be equal to zero, so that not all components need to be penalized.This is unlike the typical case of uniform tuning with a single positive tuning parameter.We also look at the leading case of ω j = 1/| βLS,j | γ with γ = 1, in the notation of Zou (2006).For all asymptotic considerations, we will assume that X X/n converges to a positive definite matrix C ∈ R p×p as n → ∞.
We define the true active set A to be A = {j : β j = 0}.The quantity λ * is given by the largest tuning parameter, λ * = max 1≤j≤p λ j .We use R for the extended real line.Finally, the symbol d −→ stands for convergence in distribution.For the sake of readability, we suppress the dependence of the following quantities on n in the notation: y, X, ε, βAL , βLS , λ j and λ * .

Relationship to LS estimator
The following finite-sample relationship between the adaptive Lasso and the LS estimator is essential for proving the results in the subsequent section and will also give some insights for understanding the idea behind the results on the shape of the confidence regions in Sections 5 and 6.The lemma shows that the difference between the adaptive Lasso and the LS estimator is always contained in a bounded and closed set that depends on the regressor matrix as well as on the tuning parameters.Note that the statements in Lemma 1 and Corollary 2 hold for all ω ∈ Ω, i.e., "surely".
Lemma 1 can be used to determine under which tuning regime the adaptive Lasso is asymptotically behaving the same as the LS estimator, as is stated in the following corollary.
Corollary 2 shows that in case λ * → 0, the adaptive Lasso estimator is asymptotically equivalent to the LS estimator, so that this case becomes a trivial one.How the estimator behaves in terms of parameter estimation and model selection for different asymptotic tuning regimes is treated in the next section.

Consistency in parameter estimation and model selection
We start our investigation by deriving the pointwise convergence rate of the estimator.
Proposition 3 (Pointwise convergence rate).Let a n = min( √ n, n/λ * ).Then the adaptive Lasso estimator is pointwise a n -consistent for β in the sense that for every δ > 0, there exists a real number M β,δ such that The fact that the pointwise convergence rate is given by n1/2 only if λ * /n 1/2 does not diverge has implicitly been noted in Zou (2006)'s oracle property in Theorem 2 in that reference, reflected in the assumption of λ * /n 1/2 → 0 1 .In the one-dimensional case, it can be learned from Theorem 5 Part 2 in Pötscher & Schneider (2009) that the sequence n 1/2 ( βAL − β) is not stochastically bounded if λ * /n 1/2 diverges 2 .However, neither of these references determine the slower rate of n/λ * explicitly when it applies.
The uniform convergence rate is presented in the next proposition.
Proposition 4 (Uniform convergence rate).Let b n = min( √ n, n/λ * ).Then the adaptive Lasso estimator is uniform b n -consistent for β in the sense that for every δ > 0, there exists a real number M δ such that sup n∈N sup Proposition 4 shows that the uniform convergence rate is slower than n 1/2 if λ * → ∞.The fact that the uniform rate may differ from the pointwise one has been noted in Pötscher & Schneider (2009).Unless the estimator is inconsistent in parameter estimation, the uniform convergence from Proposition 4 is slower than the pointwise one and can, indeed, not be improved upon.The latter statement is substantiated by Theorem 7 in Section 5, which shows that the limit of b n ( βAL − β n ) is non-zero for certain sequences β n .
Theorem 5 (Consistency in parameter estimation).The following statements are equivalent.Condition (d) in Theorem 5 states that the adaptive Lasso never chooses underparametrized models with asymptotic probability equal to 1.It underlines the fact that λ * /n → 0 is a basic condition that we will assume in all subsequent statements.
Theorem 6 (Consistency in model selection).Suppose that λ * /n → 0 as n → ∞.If λ j → ∞ as well as √ nλ j /λ * → ∞ as n → ∞ for all j = 1, . . ., p, then the adaptive Lasso estimator performs consistent model selection in the sense that Remark.Inspecting the proof of Theorem 6 shows that in fact a more refined statement than Theorem 6 holds.Assume that λ * /n → 0. We then have that P β ( βAL,j = 0) → 0 whenever j ∈ A and This statement is in particular interesting for the case of partial tuning where some λ j are set to zero and the corresponding components are not penalized, revealing that the other components can still be tuned consistently in this case.

Asymptotic distribution
In this section we investigate the asymptotic distribution.We perform our analysis for the case when λ * → ∞ which, by Theorem 6, encompasses the tuning regime of consistent model selection and often is the regime of choice in applications.If the estimator is tuned uniformly over all components, the condition λ * → ∞ is in fact equivalent to consistent tuning, given the basic condition of λ * /n → 0.
The requirement λ * → ∞ also corresponds to the case where the convergence rate of the adaptive Lasso estimator is given by (n/λ * ) 1/2 rather than n 1/2 , as can be seen from Proposition 4. Pötscher & Schneider (2009, 2010) demonstrate that in order to get a representative and full picture of the behavior of the estimator from asymptotic considerations, one needs to consider a movingparameter framework where the unknown parameter β = β n is allowed to depend on sample size.For these reasons, we study the asymptotic distribution of (n/λ * ) 1/2 ( βAL − β n ), which is done in the following.
Throughout Section 5 and Section 6, let λ 0 ∈ [0, 1] p and ψ ∈ [0, ∞] p be defined by measuring the two different deviations between each tuning parameter to the maximal one.Note that we have λ 0 = (1, . . ., 1) and ψ = 0 for uniform tuning, and that not penalizing the j-th parameter leads to ψ j = ∞ and λ 0 j = 0. Note that assuming the existence of these limits does not pose a restriction, as we could always perform our analyses on convergent subsequences and characterize the limiting behavior for all accumulation points. where There are a few things worth mentioning about Theorem 7. First of all, in contrast to the one-dimensional case, the asymptotic limit of the appropriately scaled and centered estimator may still be random.However, this can only occur if ψ j is non-zero and finite for some component j, meaning that the maximal tuning parameter diverges faster (in some sense) than the tuning parameter for the j-th component, but not too much faster.When no randomness occurs in the limit, the rate of the stochastic component of the estimator is obviously smaller by an order of magnitude compared to the bias component.In particular, this will always be the case for uniform tuning when ψ = 0.
As is expected, the proof of Theorem 7 will be carried out by looking at the corresponding asymptotic minimization problem of the quantity of interest, which can shown to be the minimization of V φ .However, since this limiting function is not finite on an open subset of R p , the reasoning of why the appropriate minimizers converge in distribution to the minimizer of V φ is not as straightforward as might be anticipated.
The assumption of n 1/2 β n λ * 1/2 /λ j converging in R p in the above theorem is not restrictive in the sense that otherwise, we simply revert to converging subsequences and characterize the limiting behavior for all accumulation points, which will prove to be all we need for Proposition 8 and the confidence regions in Section 6.While we cannot explicitly minimize V φ for a fixed φ ∈ R p other than in trivial cases, surprisingly, we can still explicitly deduce the set of all minimizers of V φ over all φ ∈ R p , which yields the same set regardless of the realization of Z in V φ .This is done in the following proposition.
Proposition 8 (Set of minimizers).Define Then for any ω ∈ Ω we have So, while the limit of (n/λ * ) 1/2 ( βAL − β n ) will, in general, be random, the set M is not.In fact, Proposition 8 shows that for any ω, the union of limits over all possible sequences of unknown parameters is always given by the same compact set M. This observation is central for the construction of confidence regions in the following section.It also shows that while in general, a stochastic component will survive in the limit, it is always restricted to have bounded support that depends on the regressor matrix and the tuning parameter through the matrix C and the quantities ψ and λ 0 .Interestingly, M only depends on ψ for the components where ψ j = ∞, in which case the set M loses a dimension.This can be seen as a result of the j-th component being penalized much less than the maximal one so that the scaling factor used in Theorem 7 is not large enough for this component to survive in the limit.Note that in case of uniform tuning where ψ = 0 and λ 0 = (1, . . ., 1) , M does not depend on the sequence of tuning parameters at all.Also, we have M = [−1, 1] for p = 1 and C = 1, a fact that has been shown in Pötscher & Schneider (2009) and used in Pötscher & Schneider (2010).
A simple "quick-and-dirty" way to motivate the result in Proposition 8 is to rewrite and observe that the second term on the right-hand side is o p (1) whereas the first term is always contained in the set by Lemma 1, which contains the set M in the limit.Theorem 7 and Proposition 8 can therefore be viewed as the theory that makes this observation precise by sharpening the set and showing that it only contains the limits.This can then be used for constructing confidence regions, which is done in the following section.

Confidence regions -coverage and shape
The insights from Theorem 7 and Proposition 8 can now be used for deriving the following theorem on confidence regions.
(a) If λ 0 ∈ (0, 1] p , then for any d > 1 we have (b) If ψ ∈ {0, ∞} p , then any closed and proper subset C of M fulfills Note that for uniform tuning, both refinements hold since ψ = 0 and λ 0 = (1, . . ., 1) .Part (a) holds since under the given assumptions, M d has non-empty interior and therefore contains an open superset of M. Part (b) hinges on the fact that the limits in Theorem 7 are always non-random under the given assumptions.
Casually put, Theorem 9 and the subsequent remark show the following.The set M = M 1 acts as a benchmark for confidence sets in the sense that if we take a "slightly larger" set, multiplied with the appropriate factor and centered at the adaptive Lasso estimator, we get a confidence region with minimal asymptotic coverage probability equal to 1. If, however, we base the region on a "slightly smaller" set than M, we end up with a confidence set of asymptotic minimal coverage 0. Nothing can be revealed from the above when using M itself.We get into a deeper discussion in the following.
We focus on the case where λ 0 ∈ (0, 1] p , i.e., the case where all components of λ 0 are nonzero (implying ψ = 0).This means that all components are penalized at the same rate, which is obviously fulfilled for uniform tuning.In this case, the asymptotic distribution is mere point-mass with no stochastic part surviving in the limit, as can be seen from Theorem 7. The reason for this is the fact that when controlling for the bias of the estimator (by scaling with the reciprocal of the uniform convergence rate), the stochastic part vanishes asymptotically.In other words, the appropriate scaling factor is simply not large enough to keep the random component alive in the limit, illustrating that the bias is of larger order than the stochastic component when viewed under a uniform lens3 -a fact that is generally inherent to penalized estimators.
Given the above considerations, one might ask what happens when the confidence region is based on M dn where d n may vary?The following theorem addresses this question by giving upper and lower bounds for the corresponding coverage probabilities.To state the theorem, we define the finite sample version of M d as which differs from M d only in that C and λ 0 are replaced by their finite-sample equivalents X X/n and λ/λ * , respectively, so that M d converges to M d (in the Hausdorff metric).We now provide lower and upper bounds depending on if and how d n converges to 1 in relation to λ * .
Theorem 10.Assume that λ 0 ∈ (0, 1] p and let ν = lim n→∞ √ λ * (d n − 1) ∈ R. We then have lim sup and for ν > 0 lim inf where l 0 = p j=1 λ 0 j , and Φ and F χ 2 p denote the cdf of a standard normal and a chi-squared distribution with p degrees of freedom, respectively.The symbol κ C stands for the condition number of C with respect to the spectral norm, i.e., the ratio of the largest and the smallest eigenvalue.
Remark.(a) Theorem 10 can be shown to still hold true when M dn is replaced by its counterpart M dn , with a slight adaptation of the constant ν involving the convergence rate of X X/n to C and λ/λ * to λ 0 .
(b) If d n = 1 for all n, implying that the confidence region is based on M 1 , the above theorem provides 0 as lower and 1/2 as upper bound.The lower bound can, in fact, be shown to be strict, implying that using M 1 will always yield a positive asymptotic coverage (bounded by 1/2) when all components of λ 0 are non-zero.
(c) Pötscher & Schneider (2010) prove that in the one-dimensional and Gaussian case, the upper bound of Theorem 10 is sharp: The interval possesses asymptotic infimal coverage probability of Φ(ν/(2σ)), which is precisely the upper bound in the above theorem.
(d) Lemma 13, on which the proof of the second statement in the above theorem is based, reveals that for any d > 1, the convergence rate of the coverage probability of M d (converging to 1) is at least 1/λ * .
Theorem 10 furthermore allows to illustrate the following.Assume that the confidence region M dn has asymptotic coverage strictly between 0 and 1 (implying that d n → 1).Then this region will asymptotically not differ in volume from sets that exhibit asymptotic coverage of probability 1.In fact, it can be shown that there exists a sequence dn such that M dn has asymptotic coverage 1, satisfying where µ p denotes p-dimensional Lebesgue measure.This states that the ratio of volumes will tend to 1, even faster than rate (λ * ) q for any q < 1/2.It demonstrates a peculiar nature inherent to the estimation method, differing strongly from the standard approach through the LS estimator.
One might wonder now how this type of confidence region does indeed compare to the confidence ellipse based on the LS estimator.Note that the regions will be multiplied by a different factor and centered at a different estimator.In general, the following observation can be made.For 0 < α < 1, let E α = {z ∈ R p : z Cz ≤ k α } with k α > 0 be such that βLS − n −1/2 E α is an asymptotic (1−α)-confidence region for β.If we contrast this with βAL − ( λ * n ) 1/2 M, we see that since both E α and M have positive, finite volume and since λ * → ∞, the regions based on the adaptive Lasso are always larger by an order of magnitude.This phenomenon is a special case of what has been found for any consistently tuned model selection estimator in Pötscher (2009).
Finally, we illustrate the shape of M. We start with p = 2 and the matrix We consider the case of uniform tuning, so that λ 0 = (1, 1) and ψ = (0, 0) and show the resulting set M in Figure 1.The color indicates the value of max j=1,2 m j (Cm) j at the specific point m inside the set.The higher the absolute value of the correlation of the covariates, the flatter and more stretched the confidence set becomes.As one may expect intuitively, in case of negative correlation, the confidence set covers more of the area where the signs of the covariates are equal, as can be seen in Figure 1.A positive correlation causes the opposite behavior.Note that the corners of the set M touch the boundary of the ellipse E α for a certain value of k α .For the case of p = 3, we again start with an example with uniform tuning so that λ 0 = (1, 1, 1) and ψ = (0, 0, 0) and consider the matrix The resulting set M is depicted in Figure 2. To give a better impression of the shape, the set is colored depending on the value of the third coordinate.Here, the high correlation between the first and third covariate stretches the set in the direction where the signs of the covariates differ.Figure 2(b) shows the projections of the three-dimensional set of Figure 2(a) onto three planes where one component is held fixed at a time.The projection onto the plane where the second component is held constant clearly shows the behavior explained above.On the other hand, the other two projections emphasize that for covariates with a lower correlation in absolute value, the confidence set is less distorted.Finally, Figure 3 illustrates the partially tuned case with the same matrix C. The first component is not penalized whereas the remaining ones are tuned uniformly.This implies that λ 0 = (0, 1, 1) and ψ = (∞, 0, 0) .Due to the condition (Cm) 1 = 0 for all m ∈ M, the resulting set is an intersection of a plane with the set in Figure 2(a).The fact that the confidence set is only two-dimensional might appear odd and is due to the fact that the unpenalized component exhibits a faster convergence rate so that the factor (λ * /n) 1/2 with which M is multiplied is not large enough for this component to survive in the limit.

Summary and conclusions
We give a detailed study of the asymptotic behavior of the adaptive Lasso estimator with partially consistent and partial tuning in a low-dimensional linear regression model in terms of consistency   and distributional properties.We do so within a framework that takes into account the non-uniform behavior of the estimator, non-trivially generalizing results from Pötscher & Schneider (2009) that were derived for the case of orthogonal regressors.We also demonstrate and formalize what these distributional results imply for valid confidence regions, namely that there exists a "benchmark" set M, such that open supersets have asymptotic coverage equal to 1, whereas "slightly smaller" sets exhibit 0 uniform coverage in the limit.The reason for this phenomenon lies in the different rates of the bias component and the stochastic component of the estimator.A similar effect has been observed before for the one-dimensional case in Pötscher & Schneider (2010).We illustrate the shape of M and demonstrate the effect of componentwise tuning at different rates, as well as the implications of partial tuning on the confidence set.

A Appendix -Proofs
We introduce the following additional notation for the proofs.The symbol e j denotes the j-th unit vector in R p and the sign function is given by sgn(x) = 1 {x>0} − 1 {x<0} for x ∈ R. For a function g : R p → R, the one-sided directional derivative of g at u in the direction of r ∈ R p is denoted by D r g(u), given by h .
For a vector u ∈ R p and an index set I ⊆ {1, . . ., p}, u I ∈ R |I| contains only the components of u corresponding to indices in I. Finally, p −→ denotes convergence in probability.

A.1 Proofs for Section 3
Proof of Lemma 1.Consider the function which can, using the normal equations of the LS estimator, be rewritten to Note that G n is minimized at βAL − βLS and that, since all directional derivatives have to be non-negative at the minimizer of a convex function, after some basic calculations we get for all j = 1, . . ., p.When βAL,j = 0, this implies that holds.When βAL,j = 0, the equations in (1) imply (X X( βAL − βLS )) j = −λ j sgn( βAL,j ) In any case, λ j = 0 implies (X X( βAL − βLS )) j = 0, which completes the proof.
Proof of Corollary 2. By Lemma 1, we have Since X X/n → C with C being positive definite, the claim follows.

A.2 Proofs for Section 4
Proof of Proposition 3. Consider the function H n,β : R p → R defined by H n,β (u) = a 2 n (L n (u/a n + β) − L n (β))/n which can be written as H n,β is minimized at a n ( βAL − β) and, since H n,β (0) = 0, we have H n,β (a n ( βAL − β)) ≤ 0, which implies that where in the latter sum we have dropped the non-positive terms for j / ∈ A and have used the fact that |β j | − |u j /a n + β j | ≤ |u j /a n | on the terms for j ∈ A. Now note that both a n / √ n and a n λ j /n are bounded by 1 and that the sequences X ε/ √ n and 1/ βLS,j for j ∈ A are tight, so that we can bound the right-hand side of the above inequality by a term that is stochastically bounded times a n ( βAL − β) .Moreover, since X X/n converges to C and all matrices are positive definite, we can bound the left-hand side of the above inequality from below by a positive constant times a n ( βAL − β) 2 , so that we can arrive at which proves the claim.
Proof of Proposition 4. Let L > 0 denote the infimum of all eigenvalues of X X/n and C taken over n and note that b 2 n λ * /n ≤ 1.By Lemma 1 we have For any M ≥ 2 p L we therefore have The claim now follows from the uniform √ n-consistency of the LS estimator.
Proof of Theorem 5. We have (c) =⇒ (b) by Proposition 4 and clearly, (b) =⇒ (a) holds.To show (a) =⇒ (c), assume that βAL is consistent for β and that λ j /n k → c ∈ (0, ∞] for some j along a subsequence n k .Let β j = 0. On the event βAL,j = 0, which by consistency has asymptotic probability equal to 1, we have by Equation (3).By consistency and the convergence of X X/n, the left-hand side converges to zero in probability, whereas the right-hand side converges to c/|β j | > 0 in probability along the subsequence n k , yielding a contradiction.This shows the equivalence of the first three statements.Moreover, (a) =⇒ (d) since for j ∈ A by consistency in parameter estimation.
The final implication we show is (d) =⇒ (c).For this, assume that λ * /n → 0 so that there exists a subsequence n k such that λ j /n k → c > 0 as n k → ∞ for some j.We first look at the case of c = ∞.Note that βAL is stochastically bounded, since As X X/n → C and X y/n → X Xβ, the quadratic term on the left-hand side dominates the linear term on the right-hand side which is only possible if βAL is O p (1).Now note that by Equation 3, βAL,j = 0 implies The fact that X X/n k → C and that βAL and βLS are stochastically bounded for fixed β shows that the left-hand side of the above display is bounded in probability also.The right-hand side, however, diverges to ∞ regardless of the value of β j .We therefore have P β ( βAL,j = 0) → 1 for all β j ∈ R, which is a contradiction to (d).If c < ∞, we first observe that X X/n( βAL − βLS ) is always contained in a compact set by Lemma 1 and the convergence of X X/n to C. This implies that X X/n( βAL − βLS ) ∞ ≤ L < ∞ for some L > 0 and for all β.Again, by Equation 3, whenever βAL,j = 0.The left-hand side is bounded by L whereas the right-hand side converges to c/|β j | in probability.We therefore get P β ( βAL,j = 0) → 1 for all β j ∈ R satisfying |β j | < c/L, also yielding a contraction to (d).
Proof of Theorem 6.Since the condition λ * /n → 0 guards against false negatives asymptotically by Theorem 5, we only need to show that the estimator detects all zero coefficients with asymptotic probability equal to one.Assume that β j = 0 and that βAL,j = 0.The partial derivative of L n with respect to b j = 0 is given by Since βAL is a n -consistent for β, X X/n converges, a n /n 1/2 ≤ 1 and X ε/ √ n is tight, the left-hand side of the above display is stochastically bounded.The behavior of the right-hand side is governed by λ j a n / √ n as √ n βLS,j is also stochastically bounded for β j = 0.If a n / √ n does not converge to zero, then the right-hand side diverges because λ j does.If a n / √ n → 0, we have a n = n/λ * eventually, so that λ j a n / √ n = √ nλ j /λ * which also diverges by assumption.

A.3 Proofs for Section 5
Lemma 11.Assume that λ * /n → 0 and λ * → ∞.Moreover, suppose that ψ n,j = √ λ * /λ j → ψ j ∈ [0, ∞] and φ n,j = √ nβ n,j √ λ * /λ j → φ j ∈ R. Then for any u j ∈ R, the term Proof of Lemma 11.Note that if u j = 0, the term A n,βn,j is clearly equal to 0, so that we assume u j = 0 in the following.Define ζ n,j = n/λ * β n,j → ζ j ∈ R and notice that |ζ j | ≤ |φ j |, as well as We now look at the case where |φ j | = ∞.The term |A n,βn,j (u j )| is bounded by also, the above expression tends to zero in probability.If 0 < |ζ j | < ∞, the same expression converges to λ 0 j |u j |/|ζ j | in probability.But in this case, we necessarily have λ 0 j = 0, so that the limit also equals zero.If ζ j = 0, rewrite the above bound to |u j | |ψ n,j Z n,j + φ n,j | which clearly converges to zero in probability when ψ j < ∞.If ψ j = ∞, note that the above display converges to zero in probability if and only if for any δ > 0, the expression converges to zero, which it does by Polya's Theorem.We next turn to the case where ψ j = ∞.If |φ j | = ∞ also, the limit equals zero by the above.If |φ j | < ∞, since |A n,βn,j (u j )| is bounded by it will converge to zero in probability.
Let us now consider the case where φ j = ψ j = 0. We write A n,βn,j (u j ) as which clearly diverges as u j = 0, |ζ n,j | ≤ |φ n,j | → 0 and the denominator tends to 0 in probability.
For the remaining cases where u j = 0, |φ j |, ψ j < ∞ and max(|φ j |, ψ j ) > 0 note that A n,βn,j (u j ) can also be written as and ζ n,j → ζ j = λ 0 j φ j .The joint distributional convergence of j A n,βn,j (u j ) to j A φ,j (u j ) follows trivially.
Proof of Theorem 7. Define V n,βn (u) = 1 λ * L n ( λ * /nu + β n ) − L n (β n ) and notice that V n,βn is minimized at n/λ * ( βAL − β n ).The function V n,βn can be shown to equal A n,βn,j (u j ), where A n,βn,j (u j ) is defined in Lemma 11.Since X X/n → C, X ε/ √ n is stochastically bounded and λ * → ∞, invoking Lemma 11 shows that V n,βn (u) converges in distribution to V φ (u).We now wish to deduce the same for the corresponding minimizers m n and m.As explained in Section 5, the limiting function V φ is not finite on an open subset of R p and we cannot invoke the usual theorems employed in such a context.Instead, we define a new sequence of functions whose minimizers behave similarly but whose limiting function remains finite.To this end, we let I = {j : max(|φ j |, ψ j ) > 0} and assume without loss of generality that I = {1, . . ., p} with p ≤ p to ease notation with indices.Now consider Vn,βn : R p → R defined by and let Ṽn,βn , Ṽφ : R p → R with Ṽn,βn (ũ) = Vn,βn The sequence m n is stochastically bounded by Proposition 4. But then so is the left-hand side of the above inequality by Lemma 11.The right-hand side, however, tends to −∞ whenever m n,I To piece together the missing parts for the minimizers m n and m of V n,βn (u) and V φ (u), respectively, we do the following.First note that m I c = 0 since otherwise V φ is infinite, so that we have To finish, observe that Proposition 12.The point m ∈ R p is a minimizer of V φ if and only if Proof of Proposition 12. Clearly, m j = 0 if φ j = ψ j = 0 as otherwise V φ is infinite.The other conditions immediately follow by noting that m is a minimizer of the convex function V φ if and only if 0 is a subgradient of V φ at m.
Proof of Proposition 8. "⊆": We first show that the union of minimizers is contained in the set M. For this, let m = arg min u V φ (u) for some φ ∈ R p .We distinguish three cases.
Firstly, if φ j = ψ j = 0, we have m j = 0 which immediately implies m j (Cm) j = 0 ≤ λ 0 j .If secondly |φ j | = ∞ or ψ j = ∞, Proposition 12 implies that (Cm) j = 0 which also yields m j (Cm) j = 0 ≤ λ 0 j .Thirdly, if 0 < max(|φ j |, ψ j ) < ∞, we consider two subcases.When ψ j > 0, λ 0 j = 0 necessarily holds.Here, if m j = 0, we immediately have m j (Cm) j = 0 = λ 0 j .Otherwise, m j = 0 implies by Proposition 12.The other subcase of ψ j = 0 can be treated as follows.If m j = −λ 0 j φ j , Proposition 12 yields and show that m is a minimizer of the resulting function V φ .First note that since m ∈ M, ψ j = ∞ immediately implies (Cm) j = 0, satisfying the second condition of Proposition 12.We therefore assume that ψ j < ∞ in the following and go through the three definitions in (4).If (Cm) j = 0 then the second condition in Proposition 12 is satisfied.When φ j = −m j /λ 0 j the condition λ 0 j > 0 implies that ψ j = 0.So when m j = 0, we are in the case where φ j = ψ j = 0 and the first condition in Proposition 12 is fulfilled.If m j = 0, we have According to (4) in the proof of Proposition 8, m then is the unique minimizer of the corresponding function V φ .This can be seen by noting that (Cm) j = 0 if and only if λ 0 j = 0, as well as ψ j > 0 implying that λ 0 j = 0.It is crucial to observe that the function V φ is non-random in this case and that M d is closed.Now take any sequence (β n ) n∈N ⊆ R p converging to φ and let The following lemma is the basis to prove Theorem 10.For a symmetric matrix A, we denote by κ A the condition number of A with respect to the spectral norm, i.e., the ratio of the largest by the smallest eigenvalue of A (in absolute value).
Proof.Let a = c n (1 − c n /(2 √ l n λ * )).The above statement is trivial when a < 0. Note that by Lemma 1, βAL − βLS is an element of λ * /n M 1 .If a = 0, the event on the right-hand side implies X ε = 0 and therefore βLS − β = (X X) −1 X ε = 0.But then we get βAL − βLS = βAL − β ∈ λ * /n M 1 , which implies the claim since M 1 ⊆ M dn .We now prove the statement for a > 0. If we can show that whenever z X Xz ≤ a 2 and m ∈ λ * /n M 1 , we get z + m ∈ λ * /n M dn , then the following holds P ε X(X X) −1 X ε ≤ a 2 = P β ( βLS − β) X X( βLS − β) ≤ a 2 , βAL − βLS ∈ λ * /n M 1 ≤ P β βLS − β + βAL − βLS ∈ λ * /n M dn = P β β ∈ βAL − λ * /n M dn for all β ∈ R p , which is what we have to prove.It only remains to show that z + m ∈ λ * /n M dn whenever z X Xz ≤ a 2 and m ∈ λ * /n M 1 .To do so, we show that (z+m) j (X X(z+m)) j ≤ λ j d n for all j.As a > 0 implies λ j > 0 for all j, this suffices to conclude z + m ∈ λ * /n M dn .Clearly, m ∈ λ * /n M 1 implies m X Xm ≤ p j=1 λ j .We also have , where L n and U n are the smallest and largest eigenvalue of X X, respectively.With the same argument, we get z 2 ∞ ≤ a 2 /L n and X Xz 2 ∞ ≤ a 2 U n .Equipped with these inequalities, we conclude for every j that (z + m) j (X X(z + m)) j = z j (X Xz) j + m j (X Xm) j + z j (X Xm) j + m j (X Xz) j which completes the proof.

Figure 1 :
Figure 1: An example for the set M with uniform tuning in p = 2 dimensions.

Figure 2 :
Figure 2: An example for the set M with uniform tuning and p = 3 dimensions.The threedimensional set is depicted in ((a)) whereas its two-dimensional projections are shown in ((b)).

Figure 3 :
Figure 3: An example of the set M with partial tuning and p = 3 dimensions.The first component is not penalized resulting in the set being part of a two-dimensional subspace.