Specification Testing in Nonparametric Instrumental Quantile Regression

There are many environments in econometrics which require nonseparable modeling of a structural disturbance. In a nonseparable model with endogenous regressors, key conditions are validity of instrumental variables and monotonicity of the model in a scalar unobservable variable. Under these conditions the nonseparable model is equivalent to an instrumental quantile regression model. A failure of the key conditions, however, makes instrumental quantile regression potentially inconsistent. This paper develops a methodology for testing the hypothesis whether the instrumental quantile regression model is correctly specified. Our test statistic is asymptotically normally distributed under correct specification and consistent against any alternative model. In addition, test statistics to justify the model simplification are established. Finite sample properties are examined in a Monte Carlo study and an empirical illustration is provided.


Introduction
Regression models that involve instrumental variables are widely used in economics to overcome endogeneity problems. In these models, assuming the structural disturbances to be additively separable implies that marginal effects do not depend on unobserved characteristics, which may be difficult to justify. This is why their nonseparable extension has received a lot of attention recently. Under certain key conditions, the nonseparable model is equivalent to an instrumental quantile regression model. These conditions are validity of instruments and monotonicity of the model in a scalar unobservable. If one of these conditions is violated, however, the quantile regression representation is misspecified. In this paper, we propose a specification test of the instrumental quantile regression model Y = ϕ(Z, q) + U (q) where P(U (q) 0|W ) = q (1.1) for each 0 < q < 1, where Y is a scalar dependent variable, Z a vector of potentially endogenous regressors, W a vector of instruments, and U (q) an unobservable disturbance. 1 This quantile regression model is equivalent to a nonseparable model (cf. Horowitz and Lee [2007]) given by where (a.1) the instrumental variable W is independent of V , (a.2) the function ϕ is strictly monotonic increasing in the scalar disturbance V , and (a.3) V ∼ U(0, 1).
Condition (a.3) can be assumed without loss of generality if V is continuously distributed with positive density on its support which we assume to hold throughout the paper. The quantile regression model (1.1) for all 0 < q < 1 is thus misspecified if in its nonseparable version (1.2) the instrument is not valid, that is, W is not independent of V , or the function ϕ is not monotonic in V . Specification testing in instrumental variable models is a subject of considerable literature. In the context of nonparametric instrumental mean regression Y = g(Z) + U with E[U |W ] = 0, tests for correct specification have been proposed by Gagliardini and Scaillet [2017], Horowitz [2012], and Breunig [2015]. These tests are, however, not robust against potential nonseparability of the structural disturbance. On the other hand, by considering the nonseparable model (1.2) with conditions (a.1)-(a.3) a failure of the exclusion restriction of the instruments might only be one source of misspecification. Indeed, as argued by Hoderlein and Mammen [2007], in certain applications, such as consumer demand, the monotonicity restriction (a.2) might be highly unrealistic. As such, providing a specification test of model (1.2) together with conditions (a.1)-(a.3) seems paramount but, as far as we know, has not yet been addressed in the literature.
Research on identification and estimation in nonparametric instrumental quantile regression has been active in the last decade. Chesher [2003] establishes nonparametric identification of derivatives of the unknown functions in a triangular array structure. Chernozhukov and Hansen [2005] and Chernozhukov et al. [2007] give identification conditions and develop a nonparametric minimum distance estimator. Sufficient conditions for local identification are given by Chen et al. [2014]. Horowitz and Lee [2007] propose an estimator based on Tikhonov regularization, Chen and Pouzo [2012] study penalized sieve minimum distance estimation, and Dunker et al. [2014] consider an iteratively regularized Gauß-Newton method. Further, Gagliardini and Scaillet [2012] obtain asymptotic distribution results of a Tikhonov regularized estimator. There is also a large literature on testing quantile regression models with exogenous covariates. In this context particularly relevant is quantile regression testing using an infinite number of quantiles for parametric functions, see Escanciano and Velasco [2010] and, in the nonparametric context, Escanciano and Goh [2014]. In instrumental quantile regression (1.1) for a fixed quantile 0 < q < 1, Horowitz and Lee [2009] established a test of parametric specification of ϕ. Chen and Pouzo [2015] consider functionals of semi/nonparametric conditional moment restrictions with possibly nonsmooth generalized residuals. A test of monotonicity in unobservables of ϕ has been proposed by Hoderlein et al. [2016] but requires conditional exogeneity of Z and hence, is not related to instrumental variables methodology. Recently and independently of this paper, Fève et al. [2018] developed a test of whether Z is independent of the nonseparable disturbance V in the model (1.2). Our test statistic is based on the L 2 -norm of the empirical conditional quantile restriction and involves sieve methodology. The sieve approach makes the statistic easy to implement and further, is convenient to impose additional constraints on the structural function ϕ.
As an example, we discuss a test of additivity of ϕ with respect to the vector of regressors Z. In addition, we establish a test statistic for testing exogeneity which is robust against nonseparability. More precisely, we establish a test of exogeneity of the regressors Z at some quantile 0 < q < 1, that is, whether P(Y ϕ(Z, q)|Z) = q. This extends the results on nonparametric tests of exogeneity in mean regression suggested by Blundell and Horowitz [2007] and Breunig [2015] to the quantile regression case. It should also be noted that the test proposed in this paper is a joint test of monotonicity and instrument validity. This is the nature of many nonparametric tests, see, for instance, Chiappori et al. [2015] or Lewbel et al. [2015]. On the other hand, we show in this paper how the sign of P(Y ϕ(Z, q)|W ) − q can be exploited to make inferences on the validity of the instrumental variables. As such, in many cases it is possible to detect the cause of a rejection of our test. We establish the asymptotic distribution of our test statistic under the null hypothesis and its consistency against fixed alternatives. We study the power of our test against a sequence of local alternatives. By Monte Carlo simulations we demonstrate the power properties of our test in finite samples. As an empirical illustration, we study a nonseparable model of the effects of class size on test scores of 4th grade students in Israel. We reject the hypothesis of exogeneity of class size but fail to reject the instrumental variable model. The remainder of this work is organized as follows. In Section 2, we propose a test statistic and obtain its asymptotic distribution. We further establish consistency of our test. The power of the test is judged by considering a sequence of local alternatives. Section 3 gives several extensions of the previous results. In Section 4 and 5 we study the finite sample properties of our test and give an empirical illustration. All proofs can be found in the appendix.

The test statistic and its asymptotic properties
This section begins with the definition of the test statistic and states assumptions required to obtain its asymptotic distribution under the null hypothesis. Moreover, we study power and consistency properties of our test.

Definition of the test statistic
The quantile regression model (1.1) leads to a nonlinear operator equation, as we see in the following. Let Φ be a Banach space endowed with the norm φ Z,p := (E |φ(Z)| p ) 1/p for some integer p > 0 and if p = ∞ then φ Z,∞ := sup z |φ(z)|. For simplicity let φ Z := φ Z,2 . Further, let us introduce the Hilbert space L 2 W := {ψ : ψ 2 W := E |ψ(W )| 2 < ∞}. We define a nonlinear operator T : for any φ ∈ Φ where 1 denotes the indicator function. Thereby, model (1.1) can be rewritten as the operator equation T ϕ q = q with ϕ q (·) := ϕ(·, q) for all 0 < q < 1. In many economic applications, for instance when estimating a demand function or Engel curves, the structural function of interest may be assumed to be smooth. This a priori knowledge is captured by a set B ⊂ Φ which we introduce below. The set B may also contain constraints on the function ϕ q such as monotonicity, concavity/convexity or additivity (see also Section 3.2) and can also ensure uniqueness of ϕ q (see Example 2.1 below). Let us introduce the set B (0,1) = {φ : φ(·, q) ∈ B for all q ∈ (0, 1)}. We consider the null hypothesis H 0 : there exists a function ϕ ∈ B (0,1) such that T ϕ q = q for all q ∈ (0, 1). (2. 2) The alternative is that there exists no function ϕ ∈ B (0,1) solving T ϕ q = q for all q ∈ (0, 1). We construct in the following a test statistic based on the L 2 -distance. Throughout the paper, we assume that an independent and identically distributed n-sample of (Y, Z, W ) is available. Let {f j } j 1 be a sequence of approximating functions in L 2 W . Then, for any where (·) − denotes a generalized inverse. Further, we define the sieve least square estimator of ϕ q by where B kn is a k n -dimensional sieve space that becomes dense in B as the sample size n tends to infinity. If B contains additional constraints then these are imposed in B kn on the finite dimensional functions. Here, k n and l n grow with sample size n. Clearly, k n l n for each n is required and in our simulations we choose l n = Ck n for some constant C > 1 (see also Chen and Christensen [2015] in the case of nonparametric instrumental mean regression). The estimator ϕ qn is a simplified version of the penalized sieve minimum distance estimator suggested by Chen and Pouzo [2012]. The test statistic is then given by where m n grows with sample size n. As the test is one sided, we reject the null hypothesis at level α when the standardized version of S n , namely 3 5/m n S n − m n /6 , is larger than the (1 − α)-quantile of N (0, 1). The asymptotic distribution of S n is derived below under mild restrictions on the dimension parameters k n , l n , and m n . We require that the number of unconditional moment restrictions determined by m n is asymptotically larger than the dimension of the sieve space B kn . This corresponds to the test of overidentifying restrictions in parametric models. In contrast to the parametric setting, however, also the number of unconditional moment restrictions used to construct the estimator (determined by l n ) must be asymptotically smaller than the number of moment restrictions used for the test statistic. This ensures that the estimation error in the test statistic becomes asymptotically negligible as we see below.
Our test statistic builds on the nonparametric specification test in instrumental mean regression suggested by Breunig [2015]. Testing in instrumental quantile regression, on the other hand, requires a different methodology. First, the test statistic is a discontinuous function of the unknown structural effect ϕ q . Second, instrumental quantile regression leads to a nonlinear inverse problem and hence deriving asymptotic results is more challenging. Third, to verify the conditional moment restrictions for all quantiles we need to integrate over them. In the appendix, we show that the mapping q → ϕ q is continuous under mild assumptions. This justifies the use of our L 2 -type rather than a sup norm statistic.

Assumptions and notation
In order to obtain our asymptotic result we state the following assumptions. Our first assumption gathers conditions which we require for the basis functions {f j } j 1 . In the following, the supports Z of Z and W of W are assumed to be bounded (see also Assumption 4). The probability density function (p.d.f.) of W , denoted by p W , is assumed to be uniformly bounded from above and away from zero. Assumption 1 (i) holds for sufficiently large C if {f j } j 1 are trigonometric basis functions, Bsplines, or wavelets. Assumption 1 (ii) is satisfied if the marginal density of W is uniformly bounded away from zero on W and f mn forms a vector of orthonormal basis functions. For any φ ∈ B (0,1) we write φ q (·) := φ(·, q) for all 0 < q < 1. We denote the Fréchet derivative of T at ϕ q by where p Y |Z,W denotes the density of Y conditional on (Z, W ). We introduce the notation Φ and ψ(·, q) ∈ L 2 W for all q ∈ (0, 1).
There exists some constant 0 < η < 1 such that for all 0 < q < 1 and all functions φ ∈ {φ ∈ B : φ − ϕ q Z,p ε} for some ε > 0 it holds (2.5) Assumption 2 (i) ensures identification of ϕ q for almost all 0 < q < 1 on the set B which we introduce below. Assumption 2 (ii) specifies an upper bound on the Taylor remainder of T in a small neighborhood around ϕ q . It is also known as the tangential cone condition and frequently used in the analysis of nonlinear operator equations (cf. Hanke et al. [1995] or Dunker et al. [2014] in case of instrumental variable estimation). We provide sufficient conditions for the tangential cone condition in Example 2.1 below and refer to Chen et al. [2014] for further discussions.
Assumption 3. There exists a sequence (r n ) n 1 with r n = o(1) such that for constants C > 0 and κ ∈ (0, 1] it holds max 1 j mn Assumption 3 states that the function ϕ q → (1{Y ϕ(Z, q)} − q)f j (W ), 1 j m n , is locally uniformly L 2 W continuous for almost all 0 < q < 1. This condition has also been exploited by Chen et al. [2003] (Theorem 3), Chen [2007] (Lemma 4.2 (i)) or Chen and Pouzo [2012] (Remark c.1). Example 2.2 below gives primitive conditions under which Assumption 3 holds true. Let Z ⊂ R dz and for any vector of nonnegative integers k = (k 1 , . . . , k dz ) define |k| = dz j=1 k j and D k = δ |k| /(δz k 1 1 . . . δz k dz dz ). For some integer p > 0 we define the norms where α and α 0 are positive integers. We denote the Sobolev spaces associated with the norm · α,p by W α,p := {φ : Z → R : φ α,p < ∞}. (2.7) For some constant ρ > 0, define B as the Sobolev ellipsoid of radius ρ given by On the other hand, our sieve space B kn used to approximate B is compact under · Z and thus, penalization is not necessary for consistent estimation (see also Chen and Pouzo [2012]). Also additional constraints such as monotonicity can be imposed by B = {φ ∈ W α,p : φ α,p ρ, inf z∈Z φ (z) > 0} for scalar z. Such a monotonicty constraint does not necessarily lead to faster rates of convergence, in contrast to an additivity restriction on ϕ q . Consequently, we do not treat shape restrictions like monotonicty explicitly but only discuss a test of additivity in Section 3.2. In this context, we also refer to Chetverikov and Wilhelm [2017] for using shape restriction for sieve estimation in instrumental mean regression. The following assumption gathers regularity conditions imposed on the structural functions ϕ and the supports Z and W.
Assumption 4. (i) Let α 0 > d z /p and α > d z /κ. (ii) Z is bounded, convex and satisfies a uniform cone property. (iii) W is bounded. (iv) The marginal density of W, denoted by p W , is bounded from above and uniformly bounded away from zero on W.
Assumption 4 (i) requires α to be large if (2.6) holds only for small κ > 0 or the dimension d z is large. Assumption 4 (ii) imposes a weak regularity condition on the shape of Z. For the uniform cone property see, for instance, Paragraph 4.4 in Adams and Fournier [2003]. This property was also used by Santos [2012]. Assumption 4 (v) ensures that T q φ W C φ Z for all φ ∈ L 2 Z and some constant C > 0.
Example 2.1 (Primitive Conditions for Assumption 2). Let Φ coincide with the Hilbert space L 2 Z := {φ : φ Z < ∞}. If for any 0 < q < 1 the operator T q is compact then there exists an orthonormal basis in L 2 Z denoted by for some constant c 0 > 0 then, under mild assumptions on the joint distribution of (Y, Z, W ), the function ϕ q is identified on B (cf. Theorem 6 of Chen et al. [2014]). A similar restriction was also imposed by Horowitz and Lee [2007]. If B ⊂ q∈(0,1) B source,q then Assumption 2 (i) holds true. Under further assumptions, imposing bounds on the generalized Fourier coefficients is equivalent to imposing smoothness restrictions. To illustrate this relation let Z be a scalar uniformly distributed random variable and assume s qj = j −ζ , j 1, for some constant ζ > 0. In this case, if {e j } j 1 are the usual trigonometric basis functions then B source,q coincides with the Sobolev space of ζ-times differentiable functions with periodic boundary conditions, while if s 2 qj = exp(−j 2ζ ), j 1 and ζ > 1, B source,q contains only analytic functions (see also Kress [1989]). In this sense, B source,q links the smoothness of φ − ϕ q to the degree of ill-posedness determined by the degree of decay of (s qj ) j 1 , which is also known as a so-called source condition (cf. Chen and Reiß [2011] or Dunker et al. [2014] for a further discussion). Under the singular value decomposition of T q it is also possible to provide primitive conditions for the tangential cone condition (2.5). Assume that the conditional p.d.f. of Y given (Z, W ), denoted by p Y |Z,W , is continuously differentiable with |∂p Y |Z,W (·, Z, W )/∂y| c 1 and the conditional p.d.f. of Z given W satisfies p Z|W (·, W ) c 2 p Z (·), for some constants c 1 , c 2 > 0. Then by Theorem 6 of Chen et al. [2014] it holds (2.9) We further obtain for all φ ∈ B source,q by making use of the Cauchy-Schwarz inequality Consequently, the tangential cone condition (2.5) is satisfied if we assume c 1/2 0 c 1 c 2 < 1. We also note that for our test of exogeneity in Section (3.1) only the weaker condition (2.9) is required.
Example 2.2 (Primitive Conditions for Assumption 3). Let F Y |ZW denote the cumulative distribution function of Y given (Z, W ) and assume that it is Lipschitz continuous with constant C L > 0, that is, |F Y |ZW (y) − F Y |ZW (y )| C L |y − y | for all (y, y ). Due to Assumption 4 the Sobolev space W α,p can be embedded in W α,∞ (cf. Theorem 6 of Adams and Fournier [2003]). In particular, the supremum norm is bounded on B and moreover, Assumption 3 holds true. Indeed, 1 0 φ q −ϕ q 2 ∞ dq r 2 n implies φ q −ϕ q ∞ c r n for almost all 0 < q < 1 and some constant c > 0. Hence, ϕ(Z, q) − c r n φ(Z, q) ϕ(Z, q) + c r n for almost all 0 < q < 1 and following Chen et al. [2003] (page 1599 -1600) we observe Remark 2.1 (Local Overidentification). In this remark, we discuss local overidentification restrictions in nonparametric instrumental quantile regression for some 0 < q < 1. As Chen and Santos [2018] point out in their Example 5.2, the range of the Fréchet derivative T q , is given by Local overidentification corresponds to the case where the closure of the range R q is a strict subset of L 2 W . In this paper, the class of structural functions ϕ is restricted to belong to an ellipsoid B and thus, we consider for each q: Mild restrictions on the ellipsoid B imply local overidentification and hence, the class of functions in the alternative model is not empty.
The next result formalizes the discussion of the previous remark and shows that the regularity conditions imposed on the function set B ensure overidentification.
Proposition 2.1. Let Φ coincide with the Hilbert space L 2 Z and let Assumption 4 (v) be satisfied. Then we have local identification, i.e., for any 0 < q < 1 the closure of R q (B) is a strict subset of L 2 W .
The proof of Proposition 2.1 relies on the fact that the functions in B are bounded by some constant ρ > 0 and, in particular, no smoothness restrictions are employed here to achieve overidentification. 2 It is also possible to achieve overidentification for classes containing unbounded functions, as long as they satisfy minimal smoothness conditions. The following result is due to Chen and Santos [2018, Lemma 4.1] and gives a condition for local overidentification without imposing a priori restrictions on the set of functions B.

Lemma 2.2 (Chen and Santos [2018]). The model is locally overidentified if and only if
Lemma 2.2 provides a necessary and sufficient condition for local overidentification without imposing regularity or other shape restrictions. This result involves the adjoint of the Fréchet derivative T q and can be characterized more explicitly in different cases. For instance, assume that the vector of instruments can be decomposed such that and hence, the model is locally overidentified when there exists a nontrivial function ψ such The rate ω n captures the variance and bias part for estimating T φ for a fixed function φ and also contains the bias for approximating the structural function ϕ in the weak norm induced by the Fréchet derivative of T . Following Chen and Pouzo [2012] we introduce the sieve measure of local ill-posedness by We write a n ∼ b n when there exist constants c, c > 0 such that cb n a n c b n for sufficiently large n.

Asymptotic distribution under the null hypothesis
The following theorem establishes asymptotic normality of the test statistic S n after standardization under the null hypothesis H 0 .
Theorem 2.3. Let Assumptions 1-4 be satisfied. Assume that and in addition for some > 0. Then we have under H 0 3 5/m n S n − m n /6 d → N (0, 1).
To motivate the constants in the sieve mean and variance, respectively, we observe see also the proof of Lemma A.3. The required rate imposed in (2.10) on m n is milder than the rate requirement m n = o(n 1/3 ) imposed by Breunig [2015] in case of nonparametric instrumental mean regression. This is due to the fact that in the latter case we do not have a lower bound for the sieve standard deviation in general, while in case of quantile regression the sieve standard deviation is √ m n within a positive constant. This can be exploited to weaken rate restrictions on m n . Further, note that restriction (2.11) implies k n = o( √ m n ) (by using that l n k n ). This requirement essentially determines the degree of overidentification required for inference. The rate restriction τ kn ω n = o m −(1+ )/κ n imposed in condition (2.11) implies that the dimension parameter m n dominates the effect of estimation of the structural function. Consequently, the asymptotic behavior of our test statistic is not affected by the estimation of ϕ, regardless of the underlying degree of ill-posedness. Note that this rate restriction can be ensured by choosing k n relative to decay of the sieve measure of local ill-posedness, which is described in more detail in Example 2.3 below. We illustrate below that condition (2.11) is satisfied under common smoothness restrictions on ϕ and mapping requirements of the Fréchet derivative T q .
Remark 2.2. Consider the Hilbert space case Φ = L 2 Z and let {e j } j 1 be an orthonormal basis in L 2 Z . In this case, Π kn φ = kn j=1 E[φ(Z)e j (Z)]e j . Let us assume the following two conditions.
If the p.d.f. p Z of Z ∈ [0, 1] dz is bounded then it is well known that the sieve approximation error condition holds for splines, wavelets, and Fourier series bases. Due to Assumption 4 (v) the link condition is always satisfied with υ j = 1 for all j 1. The link condition implies an upper bound for the sieve measure of ill-posedness; that is, τ kn Cυ kn for some constant C > 0 and all n 1 (cf. Lemma B.2 of Chen and Pouzo [2012]). Consequently, the first part of condition (2.11) simplifies to belongs to a Hölder space with Hölder parameter β. In addition, in the setting of Example 2.2, the second part of condition (2.11) simplifies to for some > 0.
In the next example, we illustrate different mapping properties of the operator T q which are usually studied in the literature.

Example 2.3. Consider the Hilbert space setting of Remark 2.2 with conditions (i) and (ii).
In addition assume that the reverse link condition ] 2 for φ ∈ B and some constant c > 0 is satisfied. In the setting of Example 2.1, we have 1 0 s 2 qj dq > υ j for all j 1 implying that T q is nonsingular for almost all 0 < q < 1 (since any countable union of null sets is null). For simplicity, let Z and W be scalars. Further, let max n −1 l n , l −2β n ∼ n −1 k n and k n ∼ n χ for some constant χ > 0 which is specified in the following two cases.
(i) Mildly ill-posed case: If υ kn ∼ k −2ζ n for some ζ 0 then in order for (2.11) to hold we require m n ∼ n ι with 0 < ι < 1/3 and In both situations we conclude that the dimension parameter m n is required to be larger than the dimension k n of the sieve space for n sufficiently large. Roughly speaking we require more moment restrictions implied by the instrument than the number of parameters we want to estimate. This corresponds to the test of overidentification in the parametric framework.
In contrast to a test integrated over all quantiles, one might be interested to check model (1.1) for one specific quantile. In this case, we consider the test statistic If S n (q) becomes too large then we reject the null hypothesis H 0 . The derivation of the asymptotic behavior of S n (q) is similar as in Theorem 2.3. Indeed, only the Lebesgue measure over (0, 1) has to be replaced by the Dirac measure which has its mass at the quantile of interest.
Corollary 2.4. Let Assumptions 1 and 4 be satisfied. For a fixed quantile q ∈ (0, 1), let Assumptions 2, 3, and conditions (2.10) and (2.11) hold. If there exists a function ϕ q ∈ B with T ϕ q = q then In addition, one might be interested in certain regions of quantile functions. Let µ denote any measure on (0, 1). Again, the next result is a direct implication of Theorem 2.3 and hence we omit its proof.
Corollary 2.5. Let Assumptions 1 and 4 be satisfied. For all q in the support of µ, let Assumptions 2, 3, and conditions (2.10) and (2.11) hold. If there exists a function ϕ ∈ B with |T ϕ q − q|dµ(q) = 0 then As mentioned in the introduction, our test is a joint test of instrument validity and monotonicity of ϕ in its second entry. The following remark illustrates how the test statistic S n (q) integrated over a subset of (0, 1) can be useful to detect which kind of deviation exists.
Remark 2.3 (Detecting the kind of deviation). Suppose that the structural function is strictly monotonically increasing in its second entry for values q ∈ (0, q ) given some q ∈ (0, 1) (can be checked using Corollary 2.5). Further, let q → ϕ(·, q) be either nonincreasing or decreasing on (q , q ). This can be assured by letting q close to q and assuming that ϕ does not oscillate for q q . If W is a valid instrument, employing model equation (1.2) and V ∼ U(0, 1) yields for all q q and q sufficiently close to q . The last inequality holds regardless whether the function q → ϕ(·, q) is strictly monotone or not. Consequently, if inf w∈W P(Y ϕ(Z, q)|W = w) > q for some q ∈ (q , q ) we may conclude that W is not a valid instrument. The analysis of a one sided test based on this inequality is beyond the scope of this paper. On the other hand, we can check the kind of deviation by using the estimator Further, confidence statements can be achieved by using resampling methods.
Remark 2.4 (Implementation of the test statistic). This remark provides some details on the implementation of our test. First, discretize the (0, 1)-integral by using the grid 1/N, 2/N, . . . , (N − 1)/N for some integer N . In different simulations, we found that a grid size of N = 20 was sufficiently large. Also note that by the choice of the grid we avoid evaluation at boundary points zero or one. Second, for any integer m n n 1/2 estimate the structural effect ϕ q given in (2.3) for each grid point q, each parameter k n with k 2 n m n and l n = 2k n . Third, compute the standardized test statistic S n such that it is maximized w.r.t. m n and minimized w.r.t. k n . That is, we choose k n to provide a good model fit and m n to increase the power of the test. The choice of the dimension parameters capture essential rate requirements imposed to achieve asymptotic normality and is also motivated by simulation results. This leads to a so-called minimum-maximum principle, see also Subsection 4.1 for more details.

Consistency against a fixed alternative
Let us first establish consistency when H 0 does not hold, that is, there exists no function ϕ belonging to B (0,1) which solves T ϕ q = q for all 0 < q < 1. The following proposition shows that our test has the ability to reject a false null hypothesis with probability 1 as the sample size grows to infinity. In the following analysis of the asymptotic power of our testing procedure we let Proposition 2.6. Assume that H 0 does not hold. Let Assumptions 1-4 be satisfied. Consider a sequence (γ n ) n 1 satisfying γ n = o(n/ √ m n ). If conditions (2.10) and (2.11) hold we have P 3 5/m n S n − m n /6 > γ n = 1 + o(1).

Limiting behavior under local alternatives
In the following, we study the power of the test, that is, the probability to reject a false hypothesis against a sequence of linear local alternatives that tends to zero as the sample size tends to infinity. We proceed similarly as Ait-Sahalia et al. [2001] (Section 3.3). More precisely, let (ϕ qn ) n 1 be a sequence of (nonstochastic) functions satisfying n The next result establishes asymptotic normality for the standardized test statistic S n .
Proposition 2.7. Let Assumptions 1-4 be satisfied. Assume that (ϕ qn ) n 1 satisfies (2.13) and n If conditions (2.10) and (2.11) hold we have From Proposition 2.7 we see that our test can detect local linear alternatives at the rate δ n . If {f j } j 1 forms an orthonormal basis in L 2 W then δ n coincides with m 1/4 n n −1/2 within a constant. Hence, our test has the same power against local linear alternatives as the test of Hong and White [1995] who consider parametric specification testing.

Inference based on bootstrap
Nonparametric tests that rely on the asymptotic normal approximation may perform poorly in finite samples. An alternative approach is to use bootstrap approximation. It is known that bootstrap based procedures could approximate finite sample distributions more accurately. In the following, we propose a bootstrap version of our test statistic S n . The bootstrap procedure is based on a sequence of independent and identically distributed random variables ε i , 1 i n, drawn independently of the original data (Y i , X i , W i ), 1 i n. Following Chen and Pouzo [2015] we then consider the bootstrap residual function Let ϕ * qn be the bootstrap version of the sieve least squares estimator (2.3), which is computed in the same way but where only The bootstrap version S * n of our test statistic S n given in (2.4) builds on ϕ * qn . More precisely, S * n is computed as the test statistic S n but where only Assumption 5. Let (ε i ) i 1 be an independent and identically distributed sequence of random variables drawn independently of (Y, Z, W ) such that E[ε] = 1, Var(ε) =: σ 2 ε ∈ (0, ∞) and E[|ε − 1| 4 ] < ∞ Assumption 5 corresponds to Assumption Boot.1 of Chen and Pouzo [2015]. We slightly strengthen their assumption by imposing a fourth moment restriction, which we require to derive asymptotic validity of the bootstrap procedure. Due to the bootstrap innovations ε i the constants in the sieve mean and sieve standard deviation change. For the bootstrap test S * n we obtain the sieve mean constant and the sieve standard deviation constant Chen and Pouzo [2015] show that the bootstrap version of the sieve estimator ϕ * qn converges at the same rate as ϕ qn . Thus, following line by line the proof of Theorem 2.3 and using the imposed restrictions on the weights ε i we obtain the following result. It should be emphasized the asymptotic validity of the bootstrap procedure is, in particular, due to the rate condition (2.11), which ensures that the asymptotic distribution of S * n is not affected by the estimation of the structural function. The next result establishes consistency of the bootstrap test against fixed alternatives.

Extensions
As we see in this section, our testing procedure can potentially be applied to a much wider range of situations. We now discuss corollaries that generalize the previous results in different ways. For the following analysis we focus on a fixed quantile q ∈ (0, 1).

Testing exogeneity
Falsely assuming exogeneity of the regressors leads to inconsistent estimators while on the other hand treating exogenous regressors as if they were endogenous can lower the rate of convergence dramatically. In this subsection, we develop a nonparametric test of exogeneity that is robust against possible nonseparability of unobservables. The test statistic is similar to the statistic S n (q) given in (2.12) but where ϕ qn is replaced by an estimator of the conditional quantile function.
In contrast to the previous section, we assume here that there exists a unique function ϕ q satisfying Y = ϕ q (Z) + U q with P(U q 0|W ) = q and for some q ∈ (0, 1). The relation between Z and W is thus restricted through this maintained hypothesis. Under the maintained hypothesis, we propose a test whether the vector of regressors Z is exogenous at a quantile q ∈ (0, 1), that is, In the following, we denote the conditional quantile function by ϕ e q which satisfies P(Y ϕ e q (Z)|Z) = q. The null hypothesis H e 0 is satisfied if and only if the structural function ϕ q coincides with the conditional quantile function ϕ e q . Further, under nonsingularity of the operator T , hypothesis H e 0 is equivalent to Our test of exogeneity, which we propose below, is based on this equation or equivalently on P(Y ϕ e q (Z)|W ) = q. More precisely, to test exogeneity we replace in the statistic S n (q) given in (2.12) the estimator of ϕ q by an estimator of ϕ e q . In the following, ϕ e qn denotes an estimator for the conditional quantile function ϕ e q . For instance, an estimator of ϕ e q is given by where q (u) = |u| − (2q − 1)u is the check function and here, B kn = φ ∈ B : φ(·) = kn j=1 β j e j (·) . For B-spline basis functions and an additional penalty this estimator was proposed by Koenker et al. [1994]. In the following, let p Z and p Z|W denote the marginal density of Z and the conditional density of Z given W , respectively.
Assumption 6 (i) formalizes the maintained hypothesis of a correctly specified nonparametric instrumental quantile moment equation. Section 2 provides a test for it. Due to Assumption 6 (ii) we do not require Assumption 2 (ii) but can rather rely on an upper bound of the Taylor reminder of T obtained by Chen et al. [2014]. In this sense, the test of exogeneity presented below requires weaker restrictions on the local curvature of T than in the case of specification testing. Assumption 6 specifies a rate requirement for the L 2 Z distance of the estimator ϕ e qn . For instance, under H e 0 , Assumption 6 (iii) is satisfied with R e n = k n /n + k −2r n when ϕ e qn is given by the estimator (3.2) with the B-splines basis functions {e j } j 1 and Z is scalar, see He and Shi [1994]. The same rate is obtained by Horowitz and Lee [2005] in the case of multivariate Z in an additive quantile regression model. For a test of the null hypothesis H e 0 we replace in the definition of S n (q) given in (2.12) the estimator ϕ qn by ϕ e qn . That is, We reject the hypothesis H e 0 if S e n (q) becomes too large. The next result establishes asymptotic normality of our test statistic S e n (q) under the null hypothesis. Example 3.1. Let us illustrate when condition (3.3) holds true. Let m n ∼ n ι with 0 < ι < 1/3. Then for (3.3) to hold let k n ∼ n χ where χ > 0 satisfies Hence, we require r > 2/κ which is a slightly stronger restriction than Assumption 4 (i).
In the following, we study the power of the test, that is, the probability to reject a false hypothesis against a sequence of linear local alternatives that tends to zero as the sample size tends to infinity. More precisely, let (ϕ e qn ) n 1 be a sequence of (nonstochastic) functions satisfying Here, ξ e q ∈ L 2 W is a function satisfying ξ e q 2 W > 0. The next result establishes asymptotic normality for the standardized test statistic S e n (q).

Testing additivity
The test statistic given in (2.4) is also convenient to check additional restrictions on the structural effect ϕ q for 0 < q < 1. These additional restrictions can be easily imposed by constraints on the functions of the sieve space B kn . For instance, one may impose an additive structure of the quantile structural effects.
By assuming an additive structure of ϕ q one might reduce the effect of dimensionality of the regressors on the convergence rate of an estimator (cf. Chen and Pouzo [2012] in case of instrumental quantile regression). Applying this structure leads, however, to inconsistent estimators in general if the function ϕ q does not obey an additive form. Our aim in the following is to test whether Similarly as above we obtain the test statistic Here the estimator ϕ add qn = ( ϕ 1 qn , ϕ 2 qn ) of ϕ q = (ϕ 1 q , ϕ 2 q ) is given by (2.3) where the sieve basis is a tensor product of basis functions that depend either on Z or Z . For a more detailed discussion we refer to Section 6 of Chen and Pouzo [2012]. The next asymptotic normality result is a direct consequence of Corollary 2.4 and hence its proof is omitted.

Monte Carlo simulation
In this section, we study the finite sample performance of our test by presenting the results of a Monte Carlo investigation. There are 1000 Monte Carlo replications in each experiment. Results are presented for the nominal levels 0.05. Let Φ denote the cumulative standard normal distribution function. Throughout this simulation study, realizations (Z, W ) were generated by Z = Φ ζω + 1 − ζ 2 ε and W = Φ(ω) where ω is independent of ε and ω, ε ∼ N (0, 1). Here, the constant ζ > 0 determines the degree of correlation between Z and W and is varied in the experiments.

Testing a Nonparametric Specification
We begin with the finite sample analysis of our test statistics in case of nonparametric specification testing. To analyze the finite sample power we distinguish in the following between a failure of the null hypothesis caused either by a lack of instrument validity or by non-monotonicity of the structural function in unobservables.
Failure of instrument validity. We first generate realizations of Y under the null hypothesis H 0 . Recall that under H 0 there exists a function ϕ ∈ B (0,1) such that P(Y ϕ(Z, q)|W ) = q for all q ∈ (0, 1). In the following finite sample analysis, we restrict B (0,1) to contain continuously differentiable functions only. Under H 0 we generate realizations of Y from the nonseparable model where V = ϑ ε + √ 1 − ϑ 2 with ∼ N (0, 1) independent of (ω, ε) and ϑ = 0.7. We consider the function φ(z) = ∞ j=1 j −4 cos(jπz). For computational reasons we truncate the infinite where ρ j (z) = 10 j (z 1{z 0.25} + (z − 1) 1{z > 0.25}) for j = 1, 2 and ρ j (z) = (z/2c j ) 1{0.5 − c j z < 0.5 + c j } for j = 3, 4, with c 3 = 0.1 and c 4 = 0.05. Here, the variable V is generated as in (4.1). Under (4.2), the structural function ϕ satisfying the quantile restriction P(Y ϕ(Z, q)|W ) = q is given by ϕ(z, q) = (φ(z) + ρ j (z))(1 + F −1 V (q)/6) + F −1 V (q)/2. So ϕ(·, q) is not continuously differentiable and thus, H 0 is false. Due to the ill-posed inverse problem estimation of ϕ(·, q) we cannot choose k n sufficiently large to capture such irregularities which implies finite sample power of our test against those alternatives. This corresponds to the analysis of Horowitz [2011] in the instrumental mean regression case. For each quantile 0 < q < 1, we estimate the structural function using the estimator ϕ qn given in (2.3) with B-splines as approximation basis functions. More precisely, for the sieve space B kn we use B-splines of order 2 with 1 knot or 2 knots (hence k n = 4 or k n = 5) and for the criterion function we use B-splines of order 2 with 5 knots or 7 knots (hence l n = 2k n ), respectively. We thus follow Chen and Christensen [2015] and choose l n to be a constant multiple of k n . Also for the vector of basis functions f mn , used to construct the test statistic, we use B-spline basis of order 2 with knots varying between 17, 22 or 27 (hence m n = 20, m n = 25 or m n = 30). The empirical rejection probabilities of our standardized test statistic 3 5/m n S n − m n /6 at nominal level 0.05 are shown in Table 1. We approximate the integral over the quantiles on (0, 1) by the mean of a random sample from the uniform (0, 1) distribution. As we see from Table 1, our test is less sensitive with respect to the choice of m n than to the choice of k n , which is not surprising and well known from nonparametric instrumental variable estimation problems, see also Chen and Pouzo [2015]. Table 1 shows the empirical rejection probabilities for the sample sizes 500 and 1000. We see that as the sample size increases the finite sample rejection probabilities become larger in the alternative models. For k n = 4 we see that the finite sample coverage improves slightly as the sample size increases. This Table 1: Empirical rejection probabilities for the standardized test statistic 3 5/m n S n − m n /6 and its bootstrap version 3 5/(m n (σ 2 ε + 1)) S * n −m n (σ 2 ε +1)/6 with varying dimension parameters k n and m n with l n = 2k n .

Sample Model
Emp. rejection prob. Emp. rejection prob. Size using S n using S * is not the case for k n = 5 which appears to be an inappropriate choice implying a large variance.
In Table 1 we also compare our testing procedure to a bootstrap version of it. We consider the generalized residual bootstrap as proposed in Subsection 2.6. We generate the bootstrap weights by ε ∼ N (1, σ 2 ε ), independently of (Y, X, W ), where σ ε = 0.5. We run 200 bootstrap evaluations per Monte Carlo replication. We see from Table 1 that the bootstrap leads to an improvement in the finite sample coverage in the true model. In this sense, the bootstrap test statistic is less sensitive to the choice of k n under the true model. Similar to Chen and Pouzo [2015] (see p. 1059), we see only a minor improvement of the bootstrap test in the alternative models but we expect that it improves further as the number of bootstrap runs is increased. As we fix the dimension parameter l n = 2k n , two dimension parameters remain to be chosen by the econometrician, namely, k n and m n . While proposing an adaptive testing procedure is beyond the scope of this paper, we want to provide an heuristic argument for the parameter choice. Intuitively, we want to choose k n such that we have a good model fit, i.e., a small value of the test statistic, and m n to have good power properties, i.e., a large value of the test statistics. Moreover, the choice should reflect the rate requirement from our theory, that is, k n l n = o(m 1/2 n ) and m n = o(n 1/2 ). We implement such a heuristic parameter choice criterion via the following minimum-maximum principle. That is, if {s(k n , m n )} denotes the standardized value of our test S n with dimension parameters k n and m n then we choose these parameters such that min kn<n 1/4 max k 2 n mn<n 1/2 {s(k n , m n )}.
The values of this minimum-maximum principle (over the range m n ∈ {20, 25, 30} and k n ∈ {4, 5}) are shown in bold in Table 1. Note that the requirement k n < n 1/4 implies k n 4 when n = 500 and k n 5 when n = 1000. Further, m n < n 1/2 implies m n 22 for n = 500 and m n 31 for n = 1000. We see that this criterion helps to avoid choosing the dimension parameter k n too large which would yield inaccurate coverage. Such a rule, however, does not account for ill-posedness of the estimation problem and hence, k n might still be chosen too large. We thus could calculate the sieve measure of ill-posedness by estimating the first k n minimal eigenvalues of T q (see also Chen and Pouzo [2015]).

Failure of monotonicity in unobservables.
We study the finite sample power of our test when ϕ is not strictly monotonic in the structural disturbance V . Realizations of Y were generated from for j = 1, 2. In the alternative models, the structural disturbance enters the model in a nonmonotonic way. We construct the statistic S n and its bootstrap counterpart S * n as described in the previous paragraph. Table 2 depicts the empirical rejection probabilities of our test against the alternative models (4.4) and (4.5). Again we observe that our test is not very sensitive to the choice of the dimension parameter m n . Our test becomes somewhat less powerful for large k n . But in contrast to the alternatives involving discontinuous functions in the previous paragraph, the choice of k n is not as sensitive. For each choice of parameter k n , our test becomes more powerful as the sample size increases from 500 to 1000. For n = 1000 we see that the parameter choice k n = 5 leads to a more accurate finite sample coverage. This is captured by the minimum-maximum principle as introduced above. Again, the resulting values of the test statistic using this criterion over the range m n ∈ {20, 25, 30} and k n ∈ {4, 5} are shown in bold. Again we observe that the boostrap version of the test statistic behaves similarly as the statistic S n . Table 2: Empirical rejection probabilities for the standardized test statistic 3 5/m n S n − m n /6 and 3 5/(m n (σ 2 ε + 1)) S * n − m n (σ 2 ε + 1)/6 using varying dimension parameters k n and m n with l n = 2k n .

Sample
Model Emp. rejection prob. Emp. rejection prob. Size using S n using S *

Testing exogeneity
Realizations Y were generated by where V is generated as described in model (4.1), that is, V = ϑ ε + √ 1 − ϑ 2 with ∼ N (0, 1) independent of (ω, ε). The function ϕ e is given by ϕ e (z) = ∞ j=1 (−1) j+1 j −2 sin(jπz). Again, for computational reasons we truncate the infinite sum at 100. The resulting function is displayed in Figure 1. Note that ϑ determines the degree of endogeneity of Z and is varied among the experiments. The null hypothesis H 0 : P(Y ϕ e (Z)|Z) = q holds true if ϑ = 0 and is false otherwise. In the following, we perform a test at the median q = 0.5. As our test relies on the equation P(Y ϕ e (Z)|W ) = q we expect our test to have more power as the correlation between W and Z increases. The test statistic is implemented as described in Section 4.2. To estimate the structural effect we make use of the estimator ϕ e qn of He and Shi [1994] given in (3.2). Here, we use B-splines of order 2 with 1 knot (hence k n = 4) or 2 knots (hence k n = 5). In contrast to the previous section, the choice of the dimension parameter k n is not affected by the ill-posedness of the underlying inverse problem. As above, the vector of basis functions f mn is also constructed with B-spline basis of order 2 with knots varying between 17, 22 or 27 (hence m n = 20, m n = 25 or m n = 30). Table 3 depicts the empirical rejection probabilities with varying number of basis functions. As we see from Table 3, our test becomes more powerful for larger ζ; that is, for instruments with a stronger correlation to the covariates Z. From Table 3 we see that the test of exogeneity becomes somewhat less powerful for larger values of m n . On the other hand, the test seems not to be too sensitive with respect to the choice of the dimension parameters k n and m n . We also see from Table 3 that the finite sample coverage and power properties of the test improve as the sample size increases from 500 to 1000. Similarly as above, a guideline for smoothing parameter choice in practice is given by the following minimum-maximum principle. That is, if s e q (k n , m n ) denotes the standardized value of our test S e n (q) with dimension parameters k n and m n then choose these parameters such that min kn<n 1/4 max k 2 n mn<n 1/2 s e q (k n , m n ) .
Again this criterion takes the rate condition for the asymptotic theory into account. In Table  3 the resulting values of the test statistic using this criterion over the range m n ∈ {20, 25, 30} and k n ∈ {4, 5} are shown in bold.

An empirical illustration
To illustrate our testing procedure, we present an empirical application concerning estimation of the effects of class size on students' performance on standardized tests. Angrist and Lavy [1999] studied the effects of class size on test scores of 4th and 5th grade students in Israel. In this empirical illustration, we focus on 4th grade reading comprehension a feature that was also considered by Horowitz [2011].
In this empirical example we study the model where Y sc is the average reading comprehension test score of 4th grade students in class c of school s, Z sc is the number of students in class c of school s, D sc is the fraction of disadvantaged students in class c of school s with unknown scalar function β, V sc = U s + ε sc where U s is an unobserved school-specific random effect, and ε sc is an unobserved, independently over classes and schools distributed random variable. The class size Z sc may be endogenous, for instance, due to the socioeconomic background of the students. To identify the causal effect of class size on scholar achievement Angrist and Lavy [1999] use Maimonides' rule as instruments. According to this administrative rule, maximum class size is given by 40 pupils and will be split if the number of enrolled students exceeds this number. More precisely, assuming that cohorts are divided into classes of equal size, Maimonides rule is described by Table 3: Empirical rejection probabilities for the standardized test statistic (2m n ) −1/2 4 S e n (0.5) − m n with varying dimension parameters k n and m n . ζ ϑ Emp. rejection prob. Emp. rejection prob. using S e n (0.5) with n = 500 using S e n (0.5) with n = 1000 where E s denotes enrollment in school s and x denotes the largest integer less or equal to x. Note that Horowitz [2011] could show that a linear relation between class size and scholar achievement as used by Angrist and Lavy [1999] is misspecified. To apply our tests, we consider a subsample where only one representative class per school is considered. By doing so, we avoid that rejection of a hypothesis may be caused by within class correlation. Moreover, only schools with at least two classes are considered which leads to a sample size of 707. In the following, we want to test nonparametrically whether class size is endogenous at the 0.5-quantile. The null hypothesis is that P(Y sc ϕ(Z sc , q) + D sc β(q)|Z sc ) = q where q = 0.5. The value of our test statistic S e n (0.5) = (2m n ) −1/2 4 S e n (0.5) − m n is given by 1.885. For the choice of smoothing parameters k n and m n we applied the minimummaximum principle as described in Section 4.2. The resulting dimension parameters are k n = 4 and m n = 23. 3 We thus reject the hypothesis of exogeneity at the 0.05 nominal level. In particular, in model (5.1) under conditions (a.1)-(a.3) we conclude that Z sc is not independent of V sc . We now test whether the model (5.1) with conditions (a.1)-(a.3) is correctly specified. We construct our test statistic using B-splines as described in Section 4.1. For the choice of smoothing parameters k n and m n we applied the minimum-maximum principle as described in Section 4.2. As in the Monte Carlo section we choose l n = 2k n . Our test statistic attains the value 1.4152 and thus fails to reject the nonseparable model (5.1) with conditions (a.1)-(a.3) at the 0.05 nominal level. This value of the test statistic is obtained when k n = 4 and m n = 26. For the fixed quantile q = 0.5, we also performed a test of P(Y sc ϕ(Z sc , q) + D sc β(q)|W sc ) = q. In this case, our test statistic attains the value 0.981 and again fails to reject the hypothesis. mn is maximized over the range k 2 n to 26 (being the largest integer smaller than √ 707). 4 This is not the case if kn is chosen too small or too large. For instance if kn = 4 or kn = 9, respectively, then the value of the test statistic is 2.064 or 3.420 (as above maximized of mn and ln = 2kn).
For the full sample, Figure 2 depicts estimators of the structural effect ϕ q for the quantiles q ∈ {0.75, 0.5, 0.25} where the number of disadvantaged students is restricted to be smaller than 15% (which implies n = 688). The solid lines are the estimators and the dashed lines are the 90% pointwise bootstrap confidence intervals using 1000 bootstrap iterations (we account for within school correlation by using schools as the bootstrap sampling units, see also Horowitz [2011]). We can see that the confidence intervals are tight enough to reject the hypothesis that the quantile structural effects are overall upward sloping. In particular, we see that the effect of class size variation on test scores is more severe for lower performing classes.

Conclusion
In this paper, we develop a nonparametric specification test for the quantile regression model (1.1). The power of the test derives either from violations of regularity conditions imposed on the structural function, such as bounds or smoothness requirements, or a failure of monotonicity in the nonseparable unobservable variable. The test statistic is easy to implement and a natural extension of specification testing in a parametric framework. As the test builds on the sieve methodology, it allows to incorporate restrictions under the null hypothesis directly on the sieve space. As examples of tests of constraint hypotheses we consider in detail a test of exogeneity and a test of additivity of the structural function. We establish the large sample behavior of our test statistics and show that our tests work well in finite sample experiments. We also obtain reasonable results in an empirical illustration concerning the analysis of class size on students' performance. While we provide some heuristic guideline how to choose the sieve dimension in finite samples, an interesting future research area remains to provide asymptotic justification for it via adaptive testing.
In the appendix, f mn denotes an m n dimensional vector with entries f j for 1 j m n . Moreover, · is the usual Euclidean norm. For ease of notation, let X i = (Y i , Z i , W i ) for 1 i n with realizations x = (y, z, w) ∈ Y × Z × W. Let H be a class of measurable functions with a measurable envelope function H. Then N (ε, H, L 2 X ) and N [ ] (ε, H, L 2 X ), respectively, denote the covering and bracketing numbers for the set H. In addition, let J [ ] (1, H, L 2 X ) denote a bracketing integral of H, that is, Throughout the proofs, we will use C > 0 to denote a generic finite constant that may be different in different uses. Further, for ease of notation we write for By Assumption 1, the eigenvalues of E[f mn (W )f mn (W ) t ] are bounded away from zero and hence, it may be assumed that E[f mn (W )f mn (W ) t ] = I mn where I mn denotes the m n dimensional identity matrix (cf. Newey [1997], p. 161). In the following result, we establish continuity of the mapping q → ϕ(·, q) under the tangential cone condition and a mild assumption on the sieve approximation error for ϕ q .
Lemma A.1. Let Assumption 2 be satisfied. Assume for almost all q ∈ (0, 1) there exists a function ϕ q with T ϕ q = q, let T q be compact, and ϕ q − Π k ϕ q Z = o(1) as k → ∞. Then the mapping q → ϕ(·, q) is continuous.
Proof. For some q ∈ (0, 1), since the linear operator T q is compact there exists singular value decomposition of it denoted by {s qj , e j , f j } j 1 . For any ε > 0 and k sufficiently large, let us define δ = (1 − η) ε s qk /3. We consider q ∈ (0, 1) such that |q − q | < δ. Since q, q satisfy the quantile restriction we have T ϕ q − T ϕ q W < δ. Let us further denote r k (q) = Π k ϕ q − ϕ q W . We have r k (q) ε/6 by assumption for all q. By Assumption 2 (ii) and the triangular inequality it holds using that (s qj ) j 1 is a nonincreasing sequence. This implies which proves the result.
Proof of Proposition 2.1. Let T q o,2 be the operator norm of the Fréchet derivative T q given by T q o,2 = sup {φ∈L 2 Z : φ Z 1} T q φ W . From Assumption 4 (v) we infer that the operator T q is bounded since Since · Z · α,p for any integer p > 0 (see e.g. Lemma A.2 of Santos [2012]) we have sup φ∈B φ Z ρ by the definition of B. Consequently, for any φ ∈ B we obtain We conclude that the range R q (B) is uniformly bounded by the constant ρ T q o,2 and hence, R q (B) is a strict subset of L 2 W , which completes the proof.
Proof of Theorem 2.3. Since we have Q n − I mn 2 = o p (m 2 n /n) it is sufficient to prove that 3 5/m n (A.1) Consider I n . We calculate further where the first summand tends in probability to zero as n → ∞. Indeed,we have for all j 1 and hence, by using sup w∈W f mn (w) 2 Cm n . Therefore, to establish 3 5/m n (nI n − m n /6)

This follows from Lemma
and the classes H qjn = {h qj (·, φ q ) : φ q ∈ B qn } and H qj = {h qj (·, φ q ) : φ q ∈ B}. We observe and hence, H qj is an envelope function of the class H qjn and due to Assumption 3 we have where the last inequality is due to Theorem 2.14.5 of van der Vaart and Wellner [2000]. We further conclude by applying the last display of Theorem 2.14.2 of van der Vaart and for all 0 < q < 1. Now since max 1 j mn E |H qj (X)| 2 dq Cm −(1+c) n for n sufficiently large it is sufficient to show that max 1 j mn J [ ] (1, H qjn , L 2 X ) < C for all 0 < q < 1. From Lemma 4.2 (i) of Chen [2007] we deduce Employing condition α 0 > d z /p and Theorem 6.2 Part II of Adams and Fournier [2003] yields that W α,p is compactly embedded in W α,∞ . Thereby, B ⊂ W α,p is totally bounded in W α,∞ which implies φ α,∞ C for all φ ∈ B. Let W α,∞ C := {W α,∞ : φ q α,∞ C}. Now Theorem 2.7.1 of van der Vaart and Wellner [2000] gives where C depends on the diameter of Z. Now due to Assumption 4 (i) it is straightforward to see that max 1 j mn J [ ] (1, H qjn , L 2 X ) < C and hence, nIII n = o p ( √ m n ).
Consider II n . We observe The Cauchy Schwarz inequality implies for all ε > 0 where the last equality follows similarly to the proof of nIII n = o p ( √ m n ). Consider C n2 .
Let us introduce the function for 1 j m n and 1 i n G q := {t qn : φ ∈ B}, and G qn := {t qn : φ ∈ D qn }. We calculate Since p W is uniformly bounded away from zero, n T φ − T ϕ 2 W √ m n , and F mn (T φ q − T ϕ q ) W C T φ q − T ϕ q W for all φ ∈ D n we have |F mn (T φ q − T ϕ q )(w)| C m 1/4 n n −1/2 for almost all 0 < q < 1 and p W -almost all w. Consequently, t qn (x, φ q ) C m 1/4 n n −1/2 p W -almost surely. We conclude by again applying the last display of Theorem 2.14.2 of van der Vaart and Wellner [2000] As above it can be seen that J [ ] (1, G qn , L 2 X ) < C for all 0 < q < 1. Indeed, from Assumption 2 (ii) we conclude T φ − T ϕ q W (1 + η) T q (φ − ϕ q ) W and further, Assumption 4 (v) Hence, the mapping φ → F mn T φ is Lipschitz continuous at ϕ q and we may apply Theorem 2.7.11 of van der Vaart and Wellner [2000] which yields Thereby, C n2 = o p ( √ m n ), which completes the proof.
In the following we make use of the notation g qj (X i , φ) := (1{Y i φ(Z i )} − q)f j (W i ), 1 j m n , 1 i n, for any φ ∈ B.
Proof of Proposition 2.6. For the proof it is sufficient to show n −1 S n T ϕ q − q 2 W dq/2+o p (1). Since which proves the result.
Proof of Proposition 2.7. Since ϕ q = argmin φ∈B T φ − q W we obtain as in the proof of Theorem 2.3 by employing the results of Lemma A.2 that Further, we calculate E g qj (X, ϕ q ) 2 dq = I n + 2II n + III n .
Proof of Corollary 2.9. For the proof it is sufficient to show n −1 S * n T ϕ q − q 2 W dq/2 + o p * (1) with probability approaching one. Chen and Pouzo [2015] show that the bootstrap version of the sieve estimator ϕ * qn converges at the same rate as ϕ qn . In light of the proof of Proposition 2.6, it is sufficient to show using that ε is independent of W and E[ε] = 1 as well as Var(ε) < ∞, which proves the result.
Proof of Corollary 3.1. In light of the proof of Theorem 2.3 it is sufficient to prove n T ϕ e qn − T ϕ q 2 W = o p ( √ m n ). Due to Assumption 6 (ii) we obtain as in the proof of Theorem 6 of Chen et al. [2014] that and consequently, Moreover, by applying sup y p Y |Z,W (y, Z, W ) C and Jensen's inequality we have by employing the rate conditions (3.3) and Assumption 6 (iii).
Thus, by following line by line the Proposition 2.7, we obtain the result.

Further, it holds
T Π kn ϕ q − q 2 W dq 2 T Π kn ϕ − T ϕ W +2 T ϕ q − q 2 W dq. We thus obtain For all φ ∈ D kn and 0 < q < 1 we have which goes to zero for all n 1 as ε → ∞. Proof of (A.5). Note that T q (φ − ϕ q ) W (1 − η) −1 T φ − T ϕ q W for all φ in a sufficiently small neighborhood around ϕ q . Thereby, due to (A.3) we obtain Hence, the result follows by applying (A.4).