NONPARAMETRIC WEIGHTED AVERAGE QUANTILE DERIVATIVE

The weighted average quantile derivative (AQD) is the expected value of the partial derivative of the conditional quantile function (CQF) weighted by a function of the covariates. We consider two weighting functions: a known function chosen by researchers and the density function of the covariates that is parallel to the average mean derivative in Powell, Stock, and Stoker (1989, Econometrica 57, 1403–1430). The AQD summarizes the marginal response of the covariates on the CQF and defines a nonparametric quantile regression coefficient. In semiparametric single-index and partially linear models, the AQD identifies the coefficients up to scale. In nonparametric nonseparable structural models, the AQD conveys an average structural effect under certain independence assumptions. Including a stochastic trimming function, the proposed two-step estimator is root-n-consistent for the AQD defined by the entire support of the covariates. To facilitate tractable asymptotic analysis, a key preliminary result is a new Bahadur-type linear representation of the generalized inverse kernel-based CQF estimator uniformly over the covariates in an expanding compact set and over the quantile levels. The weak convergence to Gaussian processes applies to the differentiable nonlinear functionals of the quantile processes.


INTRODUCTION
The weighted average quantile derivative (AQD) is the weighted expected value of the partial derivatives of the conditional quantile function (CQF), defined as Chernozhukov, Fernández-Val, and Melly, 2013). For W(X) = f (X) the density of the covariates, we estimate the data-driven density weight nonparametrically. We show that the proposed estimators of β W (τ ) are √ n-consistent and weakly converges to a Gaussian process indexed by τ . It follows that we can conduct joint inference on the Hadamard-differentiable functionals of the conditional or average quantile process, which can be nonlinear functionals. In an intermediate step, we provide a new Bahadur-type linear representation for the generalized inverse kernel-based CQF estimator uniformly over the covariates in an expanding compact set and over the quantile levels, which can be of independent interest.
The AQD is a nonparametric alternative to the linear QR by Koenker and Bassett (1978), just as the average mean derivative (AMD), E[∇E [Y|X]], is a nonparametric alternative to the ordinary least-squares (OLS) estimation. Powell, Stock, and Stoker (1989) introduce the density-weighted AMD E[∇E [Y|X]f (X)], which has received a lot of attention: Härdle and Stoker (1989), Powell and Stoker (1996), Nishiyama and Robinson (2000), Schafgans and Zinde-Walsh (2010), Cattaneo, Crump, and Jansson (2010, 2014a, 2014b, and Cattaneo and Jansson (2018), to mention just a few. When the economic theory implies some semiparametric single-index and partially linear models, the AQD identifies the coefficient up to scale, e.g., Chaudhuri, Doksum, and Samarov (1997), Lee (2003), and Hoderlein and Mammen (2009). The AQD gives a simple picture of the impacts of the covariates on the outcome distribution and is more robust against possible extreme values than the mean estimators. In nonparametric nonseparable structural models, the derivative of the CQF has causal interpretation of continuous quantile treatment effects, under certain conditional independence assumptions, e.g., Chesher (2003), Chernozhukov and Hansen (2005), Hoderlein and Mammen (2007), Matzkin (2007), and Sasaki (2015), among others. The AQD is a simple summary statistic for the quantile treatment effects by averaging over the covariates.
The paper is concerned with the estimation and inference of the weighted AQD in (1): where the equality follows by integration by parts and assuming the covariates have zero density on the boundary. Hence, the weighted AQD can be interpreted as a weighted average CQF. More generally, we focus on the weighted average quantile response (AQR) When the weight φ(X) = −∇W(X) − W(X)∇f (X) f (X), β φ (τ ) = β W (τ ) is the weighted AQD in (2). We propose a two-step estimator for β W (τ ): The first step is leave-one-out nonparametric kernel-based estimation of the unknown functions f (x), ∇f (x), and Q(τ |X). The CQF is estimated by a generalized inverse of the estimated conditional distribution function,Q(τ |X) ≡ inf{y :F Y (y|X) ≥ τ }, whereF Y (y|X) is a smoothed local constant regression estimator. In the second step, the expectation is replaced by its sample analog involving a stochastic trimming function to account for estimating the CQF near the boundary with small density. A stochastic trimming function selects a compact interior support of the covariates by removing the observation i iff XY (X i ,Q(τ |X i )) < δ, where the trimming parameter δ vanishes at an appropriate rate as the sample size grows. Hence, our estimator is consistent for β φ (τ ) defined by the entire support of X. We present three main results. First, to the best of our knowledge, this is the first paper to provide the limit theory for the weighted AQD defined by the whole support of X, which has zero density on the boundary and can be unbounded. Our limit theorems also cover the case when the estimand is defined by a compact interior support of X using a nonvanishing trimming parameter, commonly used in previous research discussed below. Our tractable approach may be applied to other multistep estimation problems based on a preliminary nonparametric kernelbased estimator, where the stochastic trimming is required but often avoided due to technical complication.
The second result is the density-weighted AQD by choosing W(X) = f (X). The density weight inherits the spirit of the AMD in Powell et al. (1989). The density-weighted AQD has a simplified expression in β φ with φ(X) = −2∇f (X) by eliminating the denominator in (2). Consequently, compared with the AQD with a known weight in (2), the density-weighted AQD estimator allows the trimming parameter to vanish at a faster rate, so the estimator trims away less observations in finite samples and also assumes weaker smoothness conditions on the distributions. We also provide an optimal bandwidth that minimizes the mean squared error (MSE), using the results of Powell and Stoker (1996).
The third result is a Bahadur-type linear representation of the CQF estimator that is uniform over values of the covariates in a sequence of expanding compact interior support and over quantile levels in a compact subset of (0,1). The new Bahadur-type representation allows us to use the stochastic trimming function to select the interior support of the covariates and the quantile. Hence, it is particularly useful when the CQF is involved in a multistep estimation procedure, for example, the first-price auction in Guerre, Perrigne, and Vuong (2000) and Marmer and Shneyerov (2012) and the quantile correlated random coefficients panel data model in Graham et al. (2018). We also derive the weak convergence of the conditional quantile process estimator. Now, we discuss our contributions of the above three results to the related literature. The weighted AQD in (1) and (2) are first estimated by Chaudhuri et al. (1997) using local polynomial estimators. Lee (2003) estimates (1) in a partially linear model. Recently, Belloni et al. (2019) develop a nonparametric series framework and perform inference on linear functionals of the CQF, including the AQD with a known weight. We add to the literature by (i) an explicit expression of the first-order bias that could be useful for robust inference in finite samples. (ii) Our asymptotic analysis accounts for the estimation error of the density weight. (iii) Using a stochastic trimming function, we are able to conduct inference on the average quantile process defined by the whole support of X. In contrast, the abovementioned papers all employ a fixed trimming function for technical simplification without losing their main focus. Consequently, their estimands are defined by the interior support of the regressors. For the AQD to serve as a nonparametric QR coefficient or in some economic applications, their estimands would be different objects from the AQD defined by the entire support. Our stochastic trimming approach has two advantages of consistency and efficiency over the fixed trimming approach. In a nonparametric model, our estimator reaches the semiparametric efficiency bound (Newey, 1990). In the semiparametric singleindex and partially linear models, a trimming function and the weighting function do not affect the consistency of the estimators. But when the optimal weight in terms of efficiency is concerned, a fixed trimming parameter results in efficiency loss, as noted in Lee (2003).
We contribute a new tractable approach to handle the stochastic trimming in asymptotic analysis. Toward this end, we apply the result of the nonparametric kernel-based estimators in Hansen (2008) and Cattaneo et al. (2013). In particular, to account for the entire support of the covariates, we face two challenges: The first issue is the "denominator problem" for estimating the CQF when the density is small near the boundary. Second, we derive the limit theory by plugging a Bahadurtype representation for the CQF estimator into the two-step AQD estimator, which becomes a U-statistic. The linear representation ofQ(τ |X) is uniform over X in a compact inner subset of the support, rather than the entire support. The key to overcome these two problems is to incorporate a Bahadur-type representation ofQ(τ |X) on an expanding compact interior support with a stochastic trimming function. The smoothed estimatorF Y (y|x) utilizes the uniform convergence results on expanding interior support in Hansen (2008) and Cattaneo et al. (2013). The trimming function selects the expanding compact interior support by controlling the lower bound of the joint density f XY (X,Q(τ |X)) converging to zero. Then, we derive a Bahadur representation for the generalized inverse estimator of the CQF uniformly over X in this trimmed interior support at an appropriate rate as the sample grows, depending on a tradeoff between the tail behavior of the distribution and the estimation error from the CQF. Therefore, the trimmed compact interior support, where we have the uniform linear representation forQ(τ |X), is expanding to the entire support as the sample grows.
Finally, our third result contributes a Bahadur representation of the generalized inverse smoothed estimator for the CQF. Bhattacharya and Gangopadhyay (1990) provide a Bahadur representation uniformly over the bandwidth, and Dabrowska (1992) derives the uniformity over the quantiles. For the local polynomial estimator of the CQF, Chaudhuri et al. (1997) and Kong, Linton, and Xia (2010) derive Bahadur representations for uniformity in the covariates X, Qu and Yoon's (2015) result is uniform over the quantile τ , and Guerre and Sabbah (2012) and Fan and Guerre (2016) provide the uniformity in X and τ . To extend their Bahadur representations to uniformity on expanding interior supports, the uniform convergence rate is penalized by the lower bound of the density at a slower rate, as noted in Hansen (2008). As a result, compared with our local constant estimator, using a local linear quantile estimator trims more observations in finite samples.
The paper proceeds as follows. In Section 2, we discuss applications of the AQD. Section 3 introduces the estimators. In Section 4, we first show a uniform linear representation and weak convergence for the nonparametric kernel-based CQF estimation. The AQD estimators are √ n-consistent and asymptotically normal. We suggest a consistent estimator for the asymptotic covariance matrix and an optimal bandwidth choice. In Section 5, by a simple simulation study, we compare the proposed AQD estimator with the AMD estimator in Powell et al. (1989), the linear QR estimator in Koenker and Bassett (1978), and the OLS for the semiparametric partially linear models. We also implement several bootstrap-based confidence intervals (CIs). All proofs are in the Appendix.

APPLICATIONS
We discuss applications of the AQD by starting with a general nonparametric nonseparable structural model. We demonstrate how the AQD captures informative causal features under certain conditional independence assumptions. By imposing further assumptions on the data generating processes (DGPs) and the structural equations, the AQD estimates the coefficients (up to scale) in semiparametric QR models. Another application relates to the counterfactual distribution or decomposition analysis literature. We discuss some earlier work that is most related to ours; more details are in their references therein.
Consider the general setting in Newey and Stoker (1993), Y = φ(X,e), where e captures the unobserved individual heterogeneity and could be multidimensional. Let X = (X 1 ,X 2 ) and X 1 be conditionally independent of e given X 2 . Hoderlein and Mammen (2007) and Sasaki (2015) investigate causal interpretation of the derivative of the CQF, which identifies a weighted average of heterogeneous structural partial effects among the subpopulation of individuals at the conditional quantile of interest.
A common identification strategy assumes that the structural function φ is strictly increasing in the scalar unobservable e. Then, the CQF of Y given X identifies the structural function φ up to a normalization on Q e (τ |X 2 ), Therefore, the partial derivative of the CQF with respect to X 1 , ∂Q(τ |X)/∂X 1 , identifies the structural derivative, ∂φ(X,Q e (τ |X 2 ))/∂X 1 , which is the causal effect of X 1 while leaving the value of the unobserved variable e unchanged at Q e (τ |X 2 ). Further assume quantile independence and normalization such that (4) yields φ(X,Q e (τ )) = φ(X,τ ), which is the τ th quantile treatment response defined in Chernozhukov and Hansen (2005). It follows that for a nonseparable single-index model Y = φ(X β 0 ,e), β W (τ ) identifies the index coefficient β 0 up to scale, e.g., Chaudhuri et al. (1997) and a rank estimator in Khan (2001). Then, the structural function φ can be further estimated by a nonparametric QR of Y on the estimated index X β W (τ ). 1 The partially linear and single-index models relax restrictive parametric assumptions and ease the curse of dimension in nonparametric estimation. For a singleindex QR model, Y = φ τ (X β τ ) + e τ and Q e τ (τ |X) = 0. The weighted AQD identifies the coefficient β τ up to scale, β W (τ ) = β τ E[φ τ (X β τ )W(X)]. For example, Wu, Yu, and Yu (2010) propose an iterative algorithm, and Kong and Xia (2012) propose an adaptive estimation procedure.
The weighted AQD relates to the counterfactual distribution literature by the choice of the weighting function W(X); see Chernozhukov et al. (2013) and the references therein. For example, the policy maker may change the covariate distribution exogenously to some probability density function f * (X), as in Rothe (2010), and consider a simple counterfactual, "what would the AQD have been if individuals' attributes had been distributed by f * (X)?" By choosing is the counterfactual AQD. Our estimator for β W is directly applicable when W(X) is known. The limit theory for an estimated W(X) can be modified using the general results in the Appendix.

ESTIMATION
The data consist of n observations (X i ,Y i ) ,i = 1,...,n, which is an independently and identically distributed random sample from a distribution F XY (X,Y). The τ th CQF of Y given X is Q(τ |X) ≡ inf{y : F Y (y|X) ≥ τ }, where F Y (y|X) is the conditional cumulative distribution function (CDF) of Y given X.
We propose two-step estimators for three estimands: the weighted AQR β φ (τ ) in (3), the weighted AQD β W (τ ) in (2), and the density-weighted AQD The first step is leave-one-out nonparametric estimation of the unknown functions. The second step is the sample analog involving a stochastic trimming function 1 X i ∈Ŝ : 1 This specification includes many models as special cases, for an example of a selection model where X 1 , X 2 , and Y = X 1 β 1 + e 1 are observed only if the unobserved Z * 2 = X 2 β 2 + e 2 > 0. Assuming (e 1 ,e 2 ) is independent of (X 1 ,X 2 ), Q Y (τ |X 1 ,X 2 ,Z * 2 > 0) = X 1 β 1 + Q e1 (τ |Z * 2 > 0). If X 2 has no variables in common with X 1 , then the AQD identifies the structural parameter β 1 and the selection parameter β 2 up to scale. If X 1 β 1 and X 2 β 2 are the same, then it is the truncated Tobit model, as discussed in Stoker (1986).
for a quantile level τ ∈ T = [ε,1 − ε] for ε ∈ (0,1/2). Following Powell et al. (1989), a more interpretable rescaled density-weighted AQD is defined as . Next, we describe each component in the estimators. The leave-one-out kernel estimator for the density function of X at is a ν 1 th-order multivariate product kernel and the bandwidth matrix H 1 is the d × d identity matrix multiplied by h 1 , a positive sequence of n. The covariates X can be normalized by the standard deviations, so that the bandwidths are equal to the same h 1 for all components of X for simplicity.
The CQF is estimated by inverting the estimated conditional CDF,Q(τ |X) ≡ inf{y : with a kernel of order ν and a bandwidth matrix H. The indicator function 1{Y j ≤ y} for the dependent variable is smoothed by a cumulative kernel G(z) = z g(t)dt with a second-order kernel g and bandwidth h 0 . The CQF estimator by the generalized inverse is monotone in τ by construction. However, the CDF estimator is not increasing in y when we use a bias-reducing or higher-order kernel K. Chernozhukov, Fernández-Val, and Galichon (2010) propose a generic rearrangement method to get a monotonized version of the estimateF Y (y|X i ), which preserves the same asymptotics asF Y (y|X i ). Then, the CQF can be estimated byQ The trimming function 1 X i ∈Ŝ is defined by a small enough positive trimming parameter δ, which can be a constant or a positive sequence converging to zero, and S ≡ {x : f XY (x,Q(τ |x)) ≥ δ}. To estimate the AQR and the AQD at a particular quantile level τ , letŜ ≡ x :f XY (x,Q(τ |x)) ≥ δ . The trimming parameter δ can be a constant that defines an interior support of X or a sequence δ = δ n converging to zero that defines a sequence of expanding sets converging to the 2 Note that the weighted AQD β W (τ ) is a nonparametric object of interest, which is a summary statistic of the marginal effect of X on the CQF. We do not assume that x β W (τ ) is the CQF and is monotone in τ for all x. However, quantile crossings occur in the single-index and partially linear models in Section 2, which is a fundamental problem for (semi)parametric QR models. In particular, Phillips (2015) shows that quantile crossings are inevitably present with positive probability in quantile predictive regressions. entire support X ⊆ R d . 3 The conditions on the tuning parameters h,h 0 ,h 1 ,ν,ν 1 ,δ are specified in Assumptions 3 and 4 in Section 4.
The derivative can only be calculated for the continuous covariates. When the covariates contain discrete components, the same estimation works for each point in a finite set of the realized values of the discrete components.

ASYMPTOTIC PROPERTIES
We provide a Bahadur-type representation for the CQF estimator uniformly in the quantiles and the covariates in Section 4.1. We also show the weak convergence of the conditional quantile process indexed by the quantile, for a given value of the regressors. Section 4.2 presents the limit theories for our estimators of the AQR β φ , the density-weighted AQD β f , and the weighted AQD β W . In Section 4.3, we provide a consistent estimator of the asymptotic covariance matrix. In Section 4.4, we provide an optimal bandwidth choice. Limits are taken as n → ∞ unless otherwise noted. We start with regularity assumptions. The joint density of (X ,Y) , denoted by f XY (X,Y), is with respect to the Lebesgue measure on X × Y ⊆ R d+1 . Denote the boundary of X by ∂X .
with respect to x of order p x are uniformly continuous in x and bounded; ∂ 3 F Y (y|x)/∂y 3 is uniformly continuous in y and bounded.
Assumption 2 (Kernel). (K) The kernel function k is Lipschitz continuous, bounded, symmetric, with convex bounded support, and of order ν, i.e., κ j ≡ x j k(x)dx = 0, for j < ν and κ ν ∈ (0,∞). The first derivative k (x) is bounded and integrable. (G) The second-order kernel g is bounded and symmetric. When g has an unbounded support and Y ⊂ R, there exist some positive constants C,L < ∞ and m > 4 such that |g(u)| ≤ C|u| −m , for |u| > L.
Assumption 2(K) imposes standard kernel conditions, e.g., Powell et al. (1989) and Hansen (2008). Assumption 2(G) is used to characterize the first-order bias from smoothing the indicator function with G (y − Y j )/h 0 . The commonly used Gaussian kernel satisfies Assumption 2(G) that restricts the tail behavior of the kernel function g with an unbounded support when Y does not have a full support on R.
3 For the denominator problem, Robinson (1988), Härdle and Stoker (1989), Lavergne and Vuong (1996), Ichimura and Todd (2007), and Escanciano, Jacho-Chávez, and Lewbel (2014), among others, use a similar stochastic trimming approach by bounding the density of X away from zero. Because we are dealing with an additional problem of estimating the CQF and its uniform linear representation, trimming on f (X) is not sufficient. Similarly, for the series estimator in Belloni et al. (2019), their AQD is defined by a fixed interior support where f XY (x,Q(τ |x)) is bounded away from zero.

Conditional Quantile Function
We give a condition on the trimming parameter δ in the trimming function 1{X i ∈ S} via an expanding set C n ≡ {x : x ≤ c n }, where a positive sequence c n → ∞ and x = |x x| 1/2 . Then, we build on and extend the uniform convergence results of kernel-based estimators on an expanding interior support in Hansen (2008) and Cattaneo et al. (2013) to specify the condition on c n . Specifically, for any positive sequence δ = δ n → 0, define S = S n ≡ {x : inf τ ∈T f XY (x,Q(τ |x)) ≥ δ} that approaches X as n → ∞. There exists such an expanding set C n equal to the convex hull of S. We can show that δ = inf τ ∈T f XY (x,Q(τ |x)) withx on the boundary of C n , i.e., x = c n . When X is convex, we can write δ = inf x∈C n ,τ ∈T f XY (x,Q(τ |x)). 4 We remark that the convex support condition in Assumption 1(i) is used to conveniently specify the condition on the trimming parameter δ and is for notational simplicity. Nonetheless, the convex support assumption is not uncommon in the literature; for example, it can be implied by other smoothness conditions on f (x), f Y (y|x), or Q(τ |x) as in Guerre and Sabbah (2012) and Qu and Yoon (2015), among others. We can allow a nonconvex support by letting Assumption 1 be local following Qu and Yoon (2015) in the sense that the restrictions are on neighborhoods surrounding X × Y rather than on the support of (X ,Y) by a slight abuse of notation. Then, our results can be applied to the case when the support is nonconvex and is a union of convex sets.

THEOREM 1 (Bahadur representation). Let Assumptions 1 and 2 hold
Furthermore, let δ −1 log n/(nh d h 0 ) → 0. Then, for any τ ∈ T and x ∈ S, 4 We can normalize X such that the support X is centered at zero without loss of generality. To see the relationship between δ and c n , let T = {τ } be a singleton for simplicity. Since S is compact,x on the boundary of the convex hull of S is also on the boundary of S, and hence, f XY (x,Q(τ |x)) ≥ δ. Suppose to the contrary that the inequality is strict, i.e., f XY (x,Q(τ |x)) = aδ > δ for a constant a > 1. By the assumption lim x→∂X f (x) = 0, lim x→∂X f XY (x,Q(τ |x)) = 0, and hence, S is a strict subset of X , for any δ > 0. So X ∩ S c is not empty. By continuity, there exists η > 0 such that When X is convex, S = C n for n large enough, and hence, δ = inf x∈Cn f XY (x,Q(τ |x)) by construction.
where the remainder term R n (τ,x) satisfies The linear representation is useful for analyzing large sample properties of the final plug-in estimators under different applications. The CQF estimator inherits the uniform convergence rate of the conditional CDF estimatorF Y (y|X). The Bahadur representation allows for the optimal rate for estimating the CQF. In the uniform convergence rate of the remainder term R n , the second part in (10) is of smaller order if the estimator is undersmoothed by assuming √ nh d (h 2 0 + h ν ) → 0. Theorem 2 below shows the weak convergence of the empirical conditional quantile process x →Q(τ |x) : τ ∈ T . When the interest is to conduct inference on the CQF, Theorem 2 enables the inference method in Fan and Liu (2016), who develop a new CI interval from any conditional quantile process estimator that weakly converges to a Gaussian process.

Weighted Average Quantile Response
We first establish asymptotic linearity of the estimator for the weighted AQR with a known weight φ(X), (6). Theorem 3 also provides the preliminary results to analyzeβ f andβ W , where the weight is estimated. Let ∂ l k g(x) denote the lth-order partial derivative of a generic function g(x) with respect to the kth component of x and THEOREM 3 (Weighted AQR). Let the conditions in Theorem 1 hold with p x ≥ ν + 1. For a measurable function φ : X → R q , assume the p x th-order derivative of φ(x) to be uniformly continuous and bounded. Let the following conditions hold. and The condition (i) in Theorem 3 is due to the U-process theory in Sherman (1994). The bandwidths are large enough to achieve asymptotic linearity. The bandwidths are small enough with possibly higher-order kernels to control the bias to vanish at a rate no slower than √ n. Consequently, the nonparametric estimations of the density and CQF are undersmoothed, which is conventional in semiparametric estimation.
Our asymptotic theorems apply to both cases using fixed trimming and vanishing trimming approaches. The asymptotic linear representation in Theorem 3 coincides with Chaudhuri et al. (1997), where the weight W serves as a trimming function to define a compact interior of the support of X. We further characterize the first-order bias from the CQF estimate, which vanishes at a faster rate under the condition √ n(h 2 0 + h ν ) → 0. The trimming parameter δ vanishes to zero at a rate specified by the conditions (ii) and (iii) in Theorem 3. The condition (ii) ensures δ to be large enough to control the sampling variation of estimating the CQF. The condition (iii) controls the trimming bias that depends on the tail behavior of the distribution f XY and ensures that δ is small enough for the estimator to approach the entire support. The trimming parameter δ trades off the estimation variance and the trimming bias, 5 while the bandwidths h and h 0 trade off the variance and bias of the preliminary estimators.
We use a standard approach to derive the limit theory by plugging a Bahadurtype representation for the CQF estimator into the two-step estimator, which becomes a U-statistic. The linear representation ofQ(τ |X) given in Theorem 1 has the joint density f XY (x,Q(τ |x)) in the denominator. Heuristically, we control the remainder terms in the linear representation of the final estimator to be of smaller order based on 1/f XY (x,Q(τ |x)) < 1/δ. In a semiparametric problem that involves the CQF, the conditional density f Y (Q(τ |x)|x) is commonly assumed to be bounded away from zero for identification or for technical simplification. Assumption 1 implies that there exist some constants δ and c such that {x : is equivalent to using the joint density f XY (x,Q(τ |x)) in our asymptotic analysis.
Exploiting the assumption f Y (Q(τ |x)|x) > c would not improve the large sample properties by our approach.
Next, we present the asymptotic properties for the density-weighted AQD with an estimated weight ∇f (X) in (8) . Assumption 3 gives specific conditions for the tuning parameters forβ f and are sufficient for the conditions in Theorem 3.
THEOREM 4 (Density-weighted AQD). Let the conditions in Theorem 1 hold with p x ≥ max{ν + 1,ν 1 + 2} and φ(X) = −2∇f (X). Let Assumption 3 hold. Define the influence function Then, the empirical process indexed by τ ∈ T converges weakly to a zero-mean Gaussian process for any τ 1 ≤ τ 2 ∈ T . For the scaled density-weighted AQD, Now, we consider the general weighted AQDβ W (τ ) in (7). Assumption 4 below gives specific conditions for the tuning parameters forβ W .
Our theorems show that the proposed estimators are asymptotically linear and weakly converge to Gaussian processes. It is worth noting the difference in estimating the density-weighted AQDβ f in Theorem 4 and the general weighted AQD β W in Theorem 5. First, estimating the density weight contributes an additional term f (X)∇Q(τ |X) − β f (τ ). Thus, the estimation error of the density function is not ignorable to perform correct inference. Second, because the density appears in the denominator in β W , the trimming bound δ vanishes at a slower rate. Thus,β W trims more observations in finite samples thanβ f does. Third, the smoothness conditions are weaker forβ f than forβ W .
Remark 1 (Efficiency). Since our estimands are explicit functions of the distribution, Newey (1990) implies that these nonparametric estimators reach the semiparametric efficiency bounds when the distribution is unrestricted. That is, the influence function of any asymptotically linear and regular estimator for our estimand is unique and hence efficient. It follows that our estimators reach the efficiency bounds of the weighted AQR and the (density-)weighted AQD, respectively. Other nonparametric estimations for the first-step unknown functions, such as series or local polynomial, will give the same asymptotic distribution. 6 Remark 2 (Choice of tuning parameters). Although the tail distribution conditions are not testable, we provide an example that satisfies the condition lim sup n→∞ log(c n )/ log n < ∞ in Theorem 1. When the tail of the joint distribution of (X,Y) decays at an exponential rate, we can choose c n ∝ (log n) q , for some q > 0, and δ = δ n ∝ n −b . 7 An alternative set of sufficient conditions for the nonparametric tuning parameters in Theorems 4 and 5 is to let the positive sequences vanish at a polynomial rate, h ∝ n −a , h 1 ∝ n −c , h 0 ∝ n −η , and δ ∝ n −b , for some positive constants, a,b,c,η: , η ∈ [ 1 4 ,1 − 2ad), and b < min 1 6 (1 − 2ad − η), 1 4 (1 − ad − c(d + 2)), 1 6 (1 − ad − cd), 1 8 (1 − 2ad) . These sufficient conditions suggest an upper bound of the convergence rate of the tuning parameter δ ∝ n −b , i.e., δ cannot be too small, so that we can control the first-step estimation error. On the other hand, the condition 6 Specifically, Newey and Stoker (1993) calculate the efficiency bounds for the weighted average derivative for general loss functions, including conditional mean and quantiles, where the weighting function is a known function. By proceeding as in the proof of Theorem 3.1 in Newey and Stoker (1993), we can calculate the efficiency bounds for the density-weighted average quantile/mean derivatives where the density weight is estimated. We can verify that the estimators proposed in this paper and in Powell et al. (1989) are semiparametrically efficient, as implied by the result in Newey (1990). Since the proof closely follows Theorem 3.1 in Newey and Stoker (1993), we do not repeat the details to save space. 7 Suppose the joint distribution of Z = (X ,Y) to be proportional to e − z p , for some p > 0. The bandwidth assumption requires the trimming parameter δ to be bounded above by inf x ≤cn f XY (x,Q(τ |x)) ∝ e −(log n) qp that is larger than n −b by letting qp < 1. A smaller p results in a larger q, meaning that we could use more observations when f XY has a fatter tail. B n Q(τ |X)∇f (X) f (X)dX = o(n −1/2 ) in Assumptions 3 and 4 suggests that δ cannot be too big, so that the trimming bias is of smaller order. To see these conditions are feasible in practice, we give a set of tuning parameters forβ f in our Monte Carlo simulations.
Remark 4 (Alternative small bandwidth asymptotics and bootstrap). There are recently developed resampling methods for two-step semiparametric estimators. Cattaneo and Jansson (2018) develop an alternative asymptotic theory to the conventional empirical process theory that relies on the usual stochastic equicontinuity condition and is used in this paper. They allow for low precision of the first-step kernel-based estimators due to a small bandwidth and account for the resulting undersmoothing bias. They show that some nonparametric bootstrap methods automatically correct for such bias. In our simulation study, we examine the robustness of the nonparametric bootstrap as well as our normal distributional approximation, with respect to bandwidth choice.

Asymptotic Covariance Matrix
An asymptotically pivotal test statistic, a CI, or the corresponding hypothesis test can be constructed by a studentized version of the estimator using Slusky's theorem with a consistent covariance matrix estimator. The covariance matrix could be consistently estimated as the sample variance of uniformly consistent estimators of the influence function. We provide a covariance matrix estimator that is composed of preliminary estimators already used in the primary AQD estimator. So we do not need additional estimation for the asymptotic covariance, such as estimating the derivative of the CQF ∇Q(τ |X i ).
We utilize the projection structure of the U-statistic, following Härdle and Stoker (1989) for the AMD. Definê whereφ(X i ),ψ(X i ;τ ), andγ (X i ;τ ) are defined specifically forβ f andβ W as follows.

Optimal Bandwidth Choice
We consider a linear combination of the density-weighted AQD a β f , where a ∈ R d and a a = 1. For example, when a = (1,0, . . . ,0) , a β f estimates the densityweighted AQD with respect to the first component of X i . We modify the estimator proposed by Powell and Stoker (1996) for the optimal bandwidth h in K that minimizes the leading terms of the asymptotic MSE of a β f . We can apply this approach forβ φ andβ W . We may also extend this approach to choose the trimming parameter δ and h 0 .
The optimal bandwidth of a β f is estimated byĥ opt = dV νB 2 1/(2ν+d) n −2/(2ν+d) , whereV estimates the leading variance associated with the bandwidth h andB estimates the leading bias. As the first-order variance ofβ f is O(n −1/2 ), by choosing this optimal bandwidth h opt , the bias is first-order asymptotically negligible. In addition, the first-step estimators are undersmoothed.
and the preliminary estimatorsQ(τ |X), ∇f (X), andf XY use a νth-order kernel with bandwidth h = h 1 = h v . We consider a fixed trimming function 1 i with a constant δ, to simplify the application of the results in Powell and Stoker (1996). Consider estimating the leading bias ofβ f . For a positive constant u = 1 and a preliminary bandwidth h = h 1 = h b , letβ f,h b andβ f,uh b be the estimators using the bandwidths h b and uh b , respectively. Let ν = ν 1 for simplicity COROLLARY 1. Let the conditions in Theorem 4 hold.

(i) Then, the bandwidth that minimizes the leading terms associated with h in the asymptotic MSE ofβ f is h opt
Then,V andB are consistent estimators of V and B, respectively, andĥ opt −h opt = o p (n −2/(2ν+d) ).
We may consider a further bias correction using the above simple bias estimator, i.e.,β f − h νB . Such robust bias-corrected inference may allow for a wider range of bandwidth choices h in practice, as discussed in Calonico, Cattaneo, and Farrell (2018), and is left for future research.

MONTE CARLO SIMULATIONS
We compare the finite-sample performance of our scaled density-weighted AQD estimatorβ s with the AMD in Powell et al. (1989), the conventional Koenker and Bassett (1978) linear QR (labeled by KB), and the OLS. We consider partially linear models with homogenous and heterogenous errors for the DGPs. Both AMD and AQD identify the coefficient of interest in the partially linear model. The linear OLS and KB estimators suffer from misspecification. We find that the scaled density-weighted AQD estimatorβ s outperforms the AMD when the outcome distribution has fat tails. The results are rather robust with a range of bandwidths. We consider four DGPs that are modified from the experiments in Lee (2003).
1. Partially linear model with homoscedastic error (PL-homo): Partially linear model with heteroskedastic error (PL-hetero): We consider two error distributions: e ∼ N (0,1) and e ∼ t(2) for a fat-tailed distribution. The regressors X 1 and X 2 are jointly normal with mean zero, variance one, and covariance 0.5. Thus, the regressors have unbounded support. The parameter of interest is the coefficient of X 2 , i.e., the true parameter is 1.  We use the fourth-order Epanechnikov kernel. Under Assumption 3, we choose the trimming bound δ ∝ n −0.02 and trim 5% of the sample at the tails. The bandwidths are h 1 = h = Cσ x n −0.15 and h 0 = Cσ y n −0.3 , where the powers satisfy Assumption 3, and σ x and σ y are the interquantile range of X and Y, respectively (Silverman, 1986). Figures 1 and 2 report the MSEs against a range of C. There are 1,000 replications in each experiment. We compute the optimal bandwidth h opt for (0,1)β f proposed in Section 4.4 as a reference bandwidth. For the DGP PL-hetero-N(0,1), the corresponding optimal constant C = 2.39 for n = 200 and C = 1.76 for n = 500. 9 The theoretical optimal bandwidth that minimizes the MSE appears to agree with the simulation results in the lower panel of Figure 1.
For the normal error in Figure 1, AMD and AQD outperform the linear estimators, OLS and KB. For the fat-tailed error in Figure 2, the QRs (AQD and KB) outperform the mean regressions (AMD and OLS). The optimal bandwidth that minimizes the MSE for the AQD is smaller than that of the AMD. This is because AQD involves additional nonparametric estimation of the CQF, and the nonparametric estimator is more undersmoothed. When the bandwidth is around 9 Specifically, we numerically computeV described in Section 4.4 with a preliminary bandwidth h v = 3.12σ x n −0.15 , where 3.12 is from the Silverman rule-of-thumb bandwidth and n = 5,000. ForB, we choose u = 0.5 and h b = 3.12σ x n −0.07 . The conditions in Corollary 1 hold. the MSE optimal bandwidth, the nonparametric estimators perform well in finite samples. Overall, AQD outperforms the linear KB. Table 1 reports the coverage rates of three bootstrap-based CIs, following the definitions in Cattaneo and Jansson (2018). In standard notation, the superscript * denotes the bootstrap analog computed under the bootstrap distribution conditional on the data. Let q * n,α = inf q ∈ R : Then, Efron CI = β s +q * n,α/2 ,β s +q * n,1−α/2 , Percentile CI = β s − q * n,1−α/2 ,β s + q * n,α/2 , and Symmetric CI = β s − Q * n,1−α ,β s + Q * n,1−α . We also consider Normal CI = β s −q n,1−α/2 ,β s +q n,α/2 , whereq n,α = −1 (α)se, is the standard normal CDF, and we use the bootstrap standard error of β s for se.
The simulation results are mostly in line with the theoretical findings in Cattaneo and Jansson (2018) and this paper. For most cases, Symmetric CI has the largest coverage rates, whereas the Efron CI has the smallest coverage rates, as predicted by Cattaneo and Jansson (2018). Normal CI is comparable with the bootstrapbased CIs, and all inference procedures perform reasonably. One possible explanation of this result is that the small-bandwidth bias studied in Cattaneo and Jansson (2018) is relatively small in this DGP.

CONCLUSION AND OUTLOOK
We estimate weighted AQDs via a weighted average CQF. We show that our estimators are asymptotic linear uniformly over the quantile index and converge weakly to Gaussian processes. We also characterize the leading bias. More generally, this paper is concerned with one of the semiparametric estimation problems based on a preliminary nonparametric estimator and involving a stochastic trimming function. We demonstrate a novel application of the uniform convergence results of nonparametric kernel-based estimators on expanding interior supports in Hansen (2008) and Cattaneo et al. (2013), so that our asymptotic analysis is tractable to account for the stochastic trimming problem.
There are several important directions for future research. The criteria of choosing the bandwidths and trimming parameter for finite samples are to be investigated. For the AMD, Cattaneo et al. (2010Cattaneo et al. ( , 2013Cattaneo et al. ( , 2014aCattaneo et al. ( , 2014b and Cattaneo and Jansson (2018) develop several methods for robust inference in terms of the bandwidth choice. 10 There is recent development on the trimming parameter for the inverse probability weighting estimator of treatment effects, e.g., Ma and Wang (2019) and Sasaki and Ura (2021) propose inference methods that account for the trimming bias. Since our estimands are more complex, estimating the trimming bias and a data-driven trimming parameter are out of the scope of this paper and left for future research. As with the rich literature of the AMD, more theoretical and empirical research on the AQD could be expected.

APPENDIX
The Appendix is organized as follows. We first state the notations and assumptions. Then, we present preliminary lemmas, whose proofs are in Appendix C. The asymptotic theorems for the weighted AQD estimatorβ are first derived for the infeasible estimator β ≡ −n −1 n i=1Q (τ |X i )φ(X i )1{X i ∈ S}, trimmed based on the true density. Then, Lemma 4 shows √ n(β −β) = o p (1) uniformly over τ . Appendix A.1 presents the proofs of Theorems 1 and 2. Appendix B presents the proofs of Theorems 3-6.
Notations. Let Z = (X ,Y) . f Y (y|X) denotes the derivative with respect to y that should not be confusing with the transpose of a matrix A . C denotes a generic constant. For an m × n matrix A, we use Frobenius norm: A = trace(A A) 1/2 . Let · ∞ be the sup-norm for a function, i.e., f ∞ = sup x∈X |f (x)|. s.o. denotes smaller-order terms. w.p.a.1 means with probability approaching one. Denote the product of two functions A(X) and B(X) by For some notations in the following proof, we omit τ and X for brevity without loss of clarity; for example, Q ≡ Q(τ |x) and Q ≡Q(τ |x).
LEMMA 1 (Uniform convergence rate). Let Assumptions 1 and 2 hold with p x ≥ ν. Define C n ≡ {x : x ≤ c n }, where c n satisfies lim sup n→∞ log(c n )/ log n < ∞.

LEMMA 4 (Trimming). Let all assumptions in Theorem 3 hold. Let sup
LEMMA 5. Let the conditions in Theorem 3 hold. Denote p (Z i

A. Proofs of Theorems 1 and 2 in Section 4.1
Proof of Theorem 1. The proofs use the following preliminary results (i) and (ii). Suppose that, for any x ∈ S, there exists a compact convex set Y To briefly discuss these results, Lemma 1 implies η When Y x degenerates to a point, i.e., Q(τ |x) = Q(ε|x) = Q(1 − ε|x), for τ ∈ T , we modify (i) . Then (ii) follows. Uniform convergence rate. For any x ∈ S and τ ∈ T , a Taylor series expansion yields by the following reasons: For the numerator,F Y (Q(τ |x)|x) = τ = F Y (Q(τ |x)|x) by construction. Then, use the result for = F Y (y|x)f (x) in Proof of Lemma 1(1). For the denominator, the above results (i) and (ii) imply sup x∈S,τ ∈T f XY (x,Q(τ |x)) −1 = O p (δ −1 ).
(i) The triangular array processes {f ni (ω,τ )} are manageable with respect to the envelopes : τ ∈ T } are euclidean and manageable by Lemma 2.13 and Example 2.10 in Pakes and Pollard (1989) and p. 221 in Kosorok (2008). In addition, F n (ω) ≡ (F n1 ,...,F nn ) is an R n -valued function on the underlying probability space. Then (i) is proved by applying Lemma E1 in Andrews and Shi (2013).
We first calculate the following results. As h 0 → 0, The second equality is a Taylor series expansion around Q i , and Q i is on the line segment between Q i and Q i − h 0 v. When g has a bounded support, G(z/h 0 ) is zero for a small enough h 0 and for any negative z. Thus, when g has a bounded support or when the support of Y is R, i.e., V = R, the term (17) is zero for a small enough h 0 . Whenȳ or y is bounded and g has an unbounded support, Assumption 2(G) implies that . Similarly, the second and third terms are o(h 3 0 ) by integration by parts. The last term of (17) is o(h 3 0 ) by the uniform continuity of f Y (y|X) in y and the dominated convergence theorem.
By a similar argument, for q 1 ≤ q 2 , . The covariance kernel of the limiting Gaussian process is lim n→∞ E Z n (τ 1 )Z n (τ 2 ) = lim n→∞ nE [f n (ω,τ This is because 1 F ni > ε = 0 for n large enough, by assuming K is bounded and

B. Proofs of Theorems 3-6 in Section 4.2
We use the U-process theorems in Sherman (1994) to prove Lemma 2, Lemma 3, and Theorem 3. Then, the proofs of Theorems 4 and 5 build on these results. We start with an overview of the proof of the U-process theorems.
: λ ∈ } be a class of measurable functions on Z 2 . The collection U n p : p ∈ F is a U-process of order 2 indexed by F where, for each p ∈ F , The proof involves the following steps.
Step 3 [Bias]. Calculate E[U n p] = P 2 p, uniformly in τ ∈ T . For the weak convergence result, let the asymptotic bias converge to zero at a rate faster than root-n.
Step 4 [Weak convergence]. By van der Vaart (2000), (i) monotonic and smooth function classes are Donsker and (ii) the Cartesian product of two Donsker classes of functions is also a Donsker class. By Donsker's theorem, we complete the proof.
Proof of Theorem 3. By Theorem 1, where the second-order U-statistic Z j ;τ,h,h 0 ,δ) and the second term in (20) is Step 1. We claim are euclidean by Lemma 2.13 and Example 2.10 in Pakes and Pollard (1989) and p. 221 in Kosorok (2008). Then, Lemma 2.14 in Pakes and Pollard (1989) implies F 1 is euclidean.
Thus, we can apply Corollary 4 in Sherman (1994). Lemma 6 in Sherman (1994) implies the class of P-degenerate functions of order 2 Step 2 [Projection]. We show that uniformly in τ ∈ T , We prove equalities (i), (ii), and (iii) in the above equation in the following.
By the argument for equality (i), it suffices that by (19) and Furthermore, using (18) The claim follows the argument for equality (i).
For the stochastic trimming function, by Proof of Lemma 4, Step 3 .
where the influence function for the first part is given in Theorem 3 with φ(X) = −2∇f (X). The influence function for (II) is given in Lemma 2 with ψ(X;τ ) = −2Q(τ |X). For the third term s.o., by Lemma 1, Theorem 6 in Hansen (2008), and Assumption 3.

As argued in
Step 4, we obtain the second part of Theorem 4.
Proof of Theorem 5.

C. Proof of Lemmas
Proof of Lemma 1. This proof modifies the proofs of Theorems 2 and 6 in Hansen (2008) and Lemma B(1) in Cattaneo et al. (2013) that amends a truncation argument in the proof of Theorem 2 in Hansen (2008).
, for x ∈ C n and y ∈ Y.Consider the uniformity over Y × C n . Construct a grid using regions of the form B j = {y : |y − y j | ≤ a n h} × {x : |x − x j | ≤ a n h}. Since we have a bounded dependent variable G y−Y i h 0 ∈ (0,1), the argument for uniform bound is the same, for example, in (A.8) of Hansen (2008) Therefore, Hansen's proof of Theorem 2 gives sup x∈C n ,y∈Y n |ˆ (y,x) − E[ˆ (y,x)]| = O p log n/ nh d 1/2 . By change of variables, the smoothness assumptions, and the dominated convergence theorem, for any y ∈ Y n and x ∈ C n , Therefore, uniformly in y ∈ Y and x ∈ C n ,F Y (y|x) =ˆ (y,x)/f (x) ). Theorem 6 in Hansen (2008) implies sup x∈C n ,y∈Y ˆ (y,x)− E ˆ (y,x) = O p log n nh d h 0 1/2 . A similar calculation yields Bias φ (ˆ (y,x)) = O(h ν + h 2 0 ).
The proof follows the procedure outlined at the beginning of Appendix B.
Hence, 1{X i ∈ S c ,X i / ∈Ŝ} = 0w.p.a.1, for all i. So we need to consider the second term only. Define I ni ≡ 1{X i / ∈ S c ,X i ∈Ŝ}.
where n −1/2 n i=1 φ i I ni = O p (1) by the central limiting theorem by E φ i 2 < ∞, and the uniform convergence ofQ i is implied by Theorem 1 and Assumption 3 or 4. By the similar arguments, the remaining terms of (25) vanish in probability.
whereX i is on the line segment between X i and X i + HV, by the dominated convergence theorem and the uniform continuity of ∂ k 1 ···∂ k ν f (X) and ∂ k 1 ···∂ k ν f Y (y|X) for k 1 ,...,k ν ∈ {1,...,d} in X.
Since 1{X i / ∈ S} = o(1) by δ → 0 and the moments exist, by the dominated convergence theorem,