Recursive Di¤erencing for Estimating Semiparametric Models

Controlling the bias is central to estimating semiparametric models. Many methods have been developed to control bias in estimating conditional expectations while maintaining a desirable variance order. However, these methods typically do not perform well at moderate sample sizes. Moreover, and perhaps related to their performance, non-optimal windows are selected with undersmoothing needed to ensure the appropriate bias order. In this paper, we propose a recursive di¤erencing estimator for conditional expectations. When this method is combined with a bias control targeting the derivative of the semiparametric expectation, we are able to obtain asymptotic normality under optimal windows. As suggested by the structure of the recursion, in a wide variety of triple index designs, the proposed bias control performs much better at moderate sample sizes than regular or higher order kernels and local polynomials.


Introduction
In this paper, our primary emphasis is on semiparametric index models, which perform well at moderate sample sizes.Often, such models require estimating an expectation conditioned on a vector of indices, where each index is a parametric function of observables and an unknown …nite dimensional parameter vector.We will term such an expectation as semiparametric due to the index structure of the conditioning variables.For models with an index structure, see for example, Robinson (1988), Powell et. al. (1989), Ichimura (1993), Ichimura and Lee (1991), Klein and Spady (1993), Horowitz (1996), Li et. al. (2014), and Klein, Shen and Vella (2015).The …rst objective of this paper is to develop a recursive estimator for a semiparametric expectation that can deliver a bias of any order while maintaining desirable variance properties and …nite sample performance.Second, we combine the properties of this recursive estimator with a residual property of semiparametric derivatives to obtain asymptotic normality under optimal windows.
To obtain asymptotic normality at a p N rate for a …nite dimensional parameter vector in a semiparametric model, the bias in the estimator must vanish faster than N 1=2 while the variance must converge to zero at a su¢ ciently fast rate.Methods have been developed in the literature to control for the bias while maintaining a desirable variance order.In some cases, an estimate of the bias in the parameter estimator can be removed from the estimator as in Honore and Powell (2005).In other cases, it is possible to employ di¤erent estimators for conditional expectations.Higher order kernels (e.g.Muller (1984)) achieve this bias order by increasing the degree of the kernel.An extension of Newey (2004) can similarly control the bias by increasing the convolution degree.For estimating semiparametric models, both approaches require suboptimal windows and hence leads to much larger variability.Local polynomials (e.g.Fan andGijbels (1995, 1996), Ruppert and Wand (1994), Lu (1994), Masry (1995) and Gu, Li, and Yang (2015)) obtain this bias order by increasing the degree of the local polynomial.While the performance of the local linear estimator is quite good, higher degrees require introducing more local parameters and hence also leads to higher variability.In this paper, we propose alternative approaches for bias reduction that enables us to obtain p N normality in semiparametric index models.In …nite samples, we …nd much lower variability for the recursive di¤erencing estimator in simulation studies.
The estimator proposed here has a recursive di¤erencing structure with a local linear estimator providing the basis for the …rst stage of the recursion.The bias in the …rst stage estimator depends on a localization error de…ned as the di¤erence between the expectation at a point of interest and a nearby point.Accordingly, in the second stage of the recursion, we remove an estimator of this localization error from the previous stage.Continuing in this manner, we show that the bias declines at each stage of the recursion, with the variance order being unchanged.
Taking advantage of this recursive mechanism and a residual property of semiparametric derivatives, we propose an approach to estimate semiparametric multiple index models of arbitrary dimension under optimal windows.In a Monte Carlo study, we considered four semiparametric triple index models.Employing a three stage recursion, which is appropriate in this context (Theorem 2), we found that the resulting estimator had very good …nite sample performance in terms of both bias and variance.In all of the cases the RMSE decreased, often substantially, with the stage of the recursion.The resulting estimator also performed much better than either a higher order kernel estimator or a local polynomial whose bias order is below N 1=2 : To develop the proposed parameter estimator for a semiparametric model and the underlying conditional expectation estimator, Section 2 provides the intuition for these two estimators and their theoretical properties.Section 3 formally de…nes the estimators and obtains their large sample properties.Section 4 provides Monte Carlo results that demonstrate very good …nite sample properties of the estimators in triple index models, exhibiting a substantial improvement over regular, higher order kernels, and local polynomial estimators.We note that the local linear estimator forms the basis for the …rst stage of the recursion and does signi…cantly improve its performance.Section 5 contains our conclusions.The appendix contains proofs of all theorems and supporting intermediate lemmas.

Estimating Expectations under Recursive Di¤erencing
The semiparametric model that we study assumes: where the vector fY i ; W i g is i.i.d.over i = 1; :::; N; and takes on values in X Y X W R 1+dw with d w the dimension of W i .Here, V i V (W i ; 0 ) is a vector of d < d w continuous indices that depend on a …nite dimensional parameter vector, 0 : For expositional purposes, in this section we take the parameter vector as known and discuss its estimation in the next section.
To motivate the form of the bias reduction, we defer discussions of trimming considerations to the next section.Denote M (V i ) E(Y i jW i ) with V i V (W i ; 0 ) and consider the model in localized form: where " i is an error satisfying E(" i jW i ) = 0: With K i (v) as a kernel weight that controls the localization error [M (V i ) M (v)] by downweighting observations V i not close to v, an often employed conditional expectation estimator is given as: Substituting the local model for Y i : As the localization error, M (V i ) M (v), is responsible for the bias, we propose a recursive mechanism that removes the estimated localization error.For s > 1, de…ne the stage s estimator as: where is the estimated localization error from the previous stage.It then follows that: Note that the di¤erence Ms (v) M (v) has been scaled by a density estimator, which is convenient for obtaining asymptotic results.With the kernel ensuring that V i is close to v; intuitively this di¤erencing structure reduces the estimation error because the error at V i will be close to that at v. In the appendix, we formally prove that there is a recursion uniformly within o p (N 1=2 ) of that above for which the bias order diminishes with each stage of the recursion.The variance order remains the same.To provide some intuition for this result, consider stage s = 2 and write: To deal with estimated density denominators, in the appendix we provide a recursion within o p (N 1=2 ) of the above and having the form: where (V i ) contains the higher order terms from an approximating expansion.It can be shown that for the leading term in this expansion: where the B-functions are uniformly bounded.In the case of local constant estimators, these expansions are well known in the literature, but as shown below also hold in other cases.
With the kernel ensuring that V i is close to v the above di¤erencing structure results in a canceling of bias terms resulting in the bias order decreasing from h2 to h 4 . 1 We show that higher order terms in (V i ) do not have a larger bias order.As shown in the appendix, higher stages have a di¤erencing in di¤erencing structure which further reduces the bias.
To start this recursion, the theory developed here will hold for any initial estimator that satis…es certain convergence properties and has an approximating bias expansion of the above form.These conditions are satis…ed by both the local constant estimator and the local linear estimator.Here, we employ a modi…ed local linear estimator since it performs better than the local constant estimator. 2 With V i as a row vector of the model's d continuous indices and v a conformable row vector of …xed values, let Z i (V i v)=h, where h is a kernel window parameter.The local linear estimator solves: 1 For a symmetric kernel: where Y (v) and Z (v) are kernel weighted averages: and M 0 L is a local linear estimator of the derivative of M (v).To simplify arguments, de…ne a modi…ed derivative estimator as where Ỹ (v) is a vector with each element equal to the local constant estimator, Y (v), and The modi…ed local linear estimator is then de…ned as: In analyzing this estimator, we employ a regular kernel function, with the bias declining at each stage of the recursion.If other bias reducing kernels are employed at each stage, then the bias will be even smaller at each stage and continue to decline over the stages.The proofs for these alternative kernels are identical to those employed here.Using a modi…ed local linear estimator as the start of the recursion, we have found that the recursion based on regular kernels performs very well.

Estimating Index Parameters in Semiparametric Models
Many semiparametric models depend on expectations conditioned on a vector of indices.Here, we employ recursive di¤erencing to estimate conditional expectations.In addition to the recursive di¤erencing structure, we also propose an extra mechanism that further reduces the bias in estimating index parameters in semiparametric models.Combining these mechanisms, we will be able to estimate a wide class of multiple index semiparametric models using optimal windows.We refer to this additional control as the residual property of semiparametric derivatives, which is given in the following proposition due to Whitney Newey. 3roposition 1 Assume E(Y jW ) = M [V ( 0 )] from the index assumption in (1) and let Exploiting this result as a bias control for single index models, Klein and Shen (2010) show that asymptotic normality can be obtained with regular kernels.This approach does not extend to higher dimensions under regular kernels, but does extend under recursive di¤erencing.Before providing the theorem that covers this extension, here we brie ‡y discuss the estimation strategy.To exploit Newey's residual result, it can be shown that trimming must be based on the index.Accordingly, we employ a two-step estimation strategy where the index parameters are estimated in the …rst step with trimming based on X, the continuous variables in W .In the second step, trimming is based on the estimated index from the …rst step.Throughout, we use the term step to distinguish the initial estimator based on Xtrimming and the estimator based on index trimming; while we use the term stage to refer to the recursive di¤erencing sequence.Under recursive di¤erencing, it is possible to show that the semiparametric derivative component of the estimation gradient can be taken as known for any index dimension.We are then able to establish asymptotic normality under optimal windows.It should be noted that this argument requires an adjustment strategy, apart from trimming, to control density denominators in the second step.We defer discussion of this issue to the next section.
Theorem 1 in the next section provides the properties of the expectation estimator.For a class of semiparametric models, Theorem 2 then combines recursive di¤erencing with a residual control introduced below so as to provide conditions on the number of stages and the kernel window to obtain p N asymptotic normality under optimal windows.

Large Sample Results
To establish large sample results, we require the following de…nitions and notations.

De…nitions and Notations
D1) Index Functions.Let W i be an i:i:d: vector of continuous and discrete variables, i = 1; :::; N .Let be a …nite dimensional parameter vector and V i = V (W i ; ) a vector of d continuous parametric index functions.
D3) Trimming.Denote 1 and 2 as lower and upper percentiles and let [ 1 ; 2 ] : De…ne q x ( ) [q x ( 1 ) ; q x ( 2 )] : d c 2 as the corresponding matrix of lower and upper population quantiles for X i ; where X i : d c 1 is the continuous subvector of W i : Similarly, de…ne q v ( ) [q v ( 1 ) ; q v ( 2 )] : d 2 as a matrix of population quantiles for V i : d 1.Then, de…ne exterior trimming based on X i or V i as: de…ne corresponding interior trimming functions as: Let ^ be the sample quantile version of .For index trimming, the estimated index is constrained to being between lower and upper sample quantiles for the estimated index vector.
D4) Kernel.Let v and V i be d-dimensional vectors with l th elements as v i (l) and V i (l) respectively and denote s l as the standard deviation of V i (l). 5 De…ne: where h = O(N r ), and (z (l)) is a density symmetric about 0 with …nite moments of all orders.The function ^ i is a trimming function that may be either ^ v (V i ) or ^ x (X i ).
D5) Kernel Averages.Referring to D4), de…ne: When the estimators are evaluated at a data point, the average is taken over the N 1 observations excluding that data point observation.
D6) Conditional Expectation Estimator.Let D and D be N N diagonal matrices with i th element K i (v; ^ ) and K i (v; ) respectively.Let Z be an N d matrix with i th row (V i v)=h: Referring to (5), with Y (v) and Z(v) depending on ĝ1 (v) ; for stage s = 1: For stage s > 1 : Ms (v) D7) Adjusted Densities.With ^ s as lower quantiles for the ĝs in D5), de…ne adjustment factors: where h is the window and 0 < a < 1: To smoothly restrict v to be between lower quantile q v ( 1 ) and upper quantile q v ( 2 ), de…ne Let ^ (v) be the sample quantile version of (v): Adjusted densities are then de…ned as: ĝsa (v) ĝs (v) + A s (v): D8) Adjusted Recursive Di¤erencing.Referring to D6) and D7), obtain the adjusted estimator Msa (v) from Ms (v) by replacing ĝs (v) with ĝsa (v).
While the purpose of most of these de…nitions is clear, further discussion of the trimming and adjustment strategies can be useful.The estimators above depend on di¤erent types of trimming.Because of the recursive structure of the estimator in D6), the estimator at stage s will depend on a vector of estimated conditional mean functions from the previous stage.Trimming (exterior) is required to control this vector.To avoid boundary bias, this exterior trimming set must strictly contain the set over which the M (v) functions are de…ned.We refer to the set of such v-values as an interior set.
The adjustment factor in D7) is employed to control density denominators in the second step for estimating a class of semiparametric models.Recall from the previous section that we employ a two-step estimator for index parameters in semiparametric models.In the …rst step, the estimated indices are recovered with trimming based on X.The estimated index from this step is then used for trimming in the second step.This procedure with expectations estimated under recursive di¤erencing ensures that the gradient to the estimation problem has asymptotic bias of order o(N 1=2 ) under optimal windows.However in uniform convergence arguments underlying the consistency argument, density estimators must be controlled for away from the truth 0 .The adjustment factor is employed for this purpose.The (v) function exponentially approaches 0 for v in the interior of its support.Near the support boundary where the density is allowed to approach 0 at an unrestricted rate, the adjustment factors vanish slowly at a rate of h a ; 0 < a < 1.In this manner, the rate at which the density denominator converges to zero is controlled.
In Theorem 1 below, we obtain convergence properties for the proposed expectation estimator.In Theorem 2 we prove asymptotic normality for parameter estimators in a class of semiparametric models under recursive di¤erencing and the residual control.To this end, we make the following assumptions.

Assumptions
A1) The vector fY i ; W i g is i:i:d: over i = 1; :::; N; and takes on values in X Y X W R 1+dw ; where d w is the dimension of W i .

A2)
The following index assumption holds: A3) Refer to D1) and assume that , a compact set.Let g (v (w; ) ; ) be the density for V (W i ; ) evaluated at v (w; ) and let f (xjV i = v) be the density for X conditioned on V i = v.Let = 0; 1; 2 and = 0; 1; :::; 2s + 1, where s is the stage.For v,w, and in compact sets and with c > 0: A5) For the window parameter: r < 1 2d : A6) The vector of continuous random variables, X i , is bounded.
The …rst three assumptions are standard for index models.Assumption A4) is useful for obtaining uniform convergence results for functions of unbounded random variables.Assumption A5) is needed to establish the bias properties of the proposed recursive di¤erencing estimator (Theorem 1).Condition A6) is not required, but greatly simpli…es the exposition.It should be noted that many and perhaps most models in applied work satisfy this condition.Moreover, when it does not hold, it is always possible to trim out a very small fraction of the observations to guarantee that A6) holds.
Convergence properties are important for obtaining p N -normality for a …nite dimensional parameter vector in semiparametric models.Theorem 1 provides these properties for the conditional expectation estimator.It holds in the semiparametric case when conditioning on a vector of indices and in the nonparametric case.
Theorem 1.The Recursive Di¤erencing Estimator.Assume A1)-A5), then for v and in compact sets, with s (v) ĝs (v) Ms (v) M (v) , there exists a s (v) with its form provided in Lemma 2, that satis…es the following: For estimating semiparametric models, it can be readily shown that it is the bias in the scaled estimator s (v) that is relevant.As this quantity is nonlinear, it is di¢ cult to study directly.Lemmas 1-2 provide an approximating sequence s (v) that from a) is uniformly within o p N 1=2 of s (v): The bias result is then given in b), followed by a variance result in c) and a convergence rate in d).
To obtain p N normality for a …nite dimensional parameter vector, we will require conditions on the stage s and the window parameter r.Theorem 2 provides these conditions for a class of multiple index models estimated by Semiparametric-Least-Squares (SLS) introduced by Ichimura (1992) and Ichimura and Lee (1991).The results of Theorem 2 readily extend to Quasi-Maximum-Likelihood estimators for semiparametric binary response and to ordered models.Though, such extensions require a more complicated trimming strategy than that employed here.
In the single index case, Klein and Shen (2010) exploit Newey's residual result to obtain p N normality.Theorem 2 obtains this result for the multiple index case by combining recursive di¤erencing and the residual bias controls.In so doing, it is possible to employ optimal windows.Theorem 2. Estimating Index Parameters with Recursive Di¤erencing and Residual Bias Controls.Set the stage s and window parameter r to satisfy: 6 With all trimming based on sample quantiles for X as described in D3), de…ne the …rst step estimator as: 7  ^ = arg min X For the second step estimator, set the stage s, and window parameter r, to satisfy: 8 where r is the optimal window.Referring to D3) for the de…nitions of index trimming functions, base all trimming functions on sample quantile regions for the estimated index from the …rst stage.Referring to D7) -D8) and letting Vi V i W i ; ^ de…ne the adjusted second stage estimator as: ^ a arg min Then, under A1)-A5): where 2 " is the variance of " i and G i ( 0 ) vI (V 0i )r [M (V (W i ; ) ; )] 0 : It should be noted that recursive di¤erencing makes it possible to take the estimated semiparametric derivative as known in the multiple index case.The gradient evaluated with a known semiparametric derivative and index trimming function has no bias from the residual property of the semiparametric derivative.
X 3 + X 4 where X 1 , X 2 , X 3 and X 4 follow standard normal distributions.We also generated an error term " that follows a standard normal distribution.In all designs, the outcome has the form: where T 2 was set respectively to be a quadratic, cubic, exponential, or sin function of V 2 in the four designs.The ; ; are standardizing constants selected so that in all designs, each of the three explanatory components has an approximate standard deviation (SD) of one.
As the focus of this paper is on the semiparametric case, we begin by reporting results for di¤erent estimators of the parameters in the four designs discussed above (Table 1).We will also compare results for di¤erent estimators of the conditional mean function (Table 2).
Five measures of estimator performance are provided in Table 1 including root mean squared error (RMSE), mean, standard deviation (SD), median, and median absolute deviation (MAD).The …rst three columns of results in Table 1 focus on the proposed recursive di¤erencing estimator with a …xed window.We set the window parameter at r = 1  15 throughout these three stages so as to facilitate the comparison between stages. 9The next three columns show the results for the proposed recursive di¤erencing estimator with optimal windows over the stages.Stage 1 had a window size r = 1 7 ; stage 2 had a window size r = 1 11 ; while stage 3 had a window size r = 1  15 .The third stage estimator also has the residual control to ensure p N normality.The last three columns of Table 1 include three estimators that have a p N -asymptotic normality property due to undersmoothing without the residual control.The …rst one of them (RDS) employs recursive di¤erencing as the sole bias control with undersmoothing window parameter r = 1  11:99 to ensure p N asymptotic normality.The next column provides results for a higher order kernel estimator (HK), which is an extension of the twicing kernel (Newey et. al, 2004).The last column is for a …fth degree local polynomial estimator (LP) with bias O(h 6 ) as in Ruppert and Wand (1994).For these last two estimators, we also set r = 1  11:99 to ensure that the bias is o(N 1=2 ).The …rst three columns of results show that with …xed window, the proposed recursive di¤erencing estimator has decreasing bias over the stages while the standard deviation remains stable.This …nding is consistent with the theory behind the recursive di¤erencing mechanism.In all four designs, there was a substantial decrease in RMSE over the stages.This decline is most pronounced for the sin design.For example, the RMSE of the …rst parameter estimate declines from .368 at the …rst stage to .219 at the second stage and …nally to .127 at the third stage.The next three columns show the results when we set the optimal window for each stage of the recursive di¤erencing estimator.As expected, we found that the RMSE monotonically declined over the stages in most cases; in all cases, the third stage achieved the smallest RMSE.The small third stage RMSE is due both to recursive di¤erencing and the residual control.The value added by using the residual control can be seen by comparing recursive di¤erencing estimators with optimal windows with (column6) and without (column3) the residual control.Across all four designs, RMSE decreased by 20%-35% when the additional residual bias control is employed.
The last three columns of Table 1 provide a comparison between three p N asymptotically normal estimators.The recursive di¤erencing estimator with undersmoothing (RDS) strongly dominates the other two estimators (HK and LP) with substantially smaller RMSE.Its advantage is most pronounced in terms of variation (i.e. standard deviation and MAD).Similarly, the proposed recursive di¤erencing estimator with residual control also strongly dominates the HK and LP estimators.The performance of the proposed recursive di¤erencing estimator with residual control (column6) is overall better than the undersmoothing version (RDS), with smaller RMSE in 8 out of 12 parameters from 4 designs, and the di¤erence in RMSE is relatively small in the 4 cases where it did not dominate the undersmoothing version.In summary, recursive di¤erencing estimators dominate the other estimators and there is value added to employing the residual bias control with recursive di¤erencing.
In addition to comparing parameter estimators, we also compare results for estimating the conditional mean functions.The estimation of these functions plays a fundamental role in estimating parameters.Further, it is essential in estimation of marginal e¤ects, which is an important object of interest in empirical studies.Therefore, we provide results on the estimation of conditional mean functions in Table 2.We investigated the performance of three conditional mean estimators underlying the parameter estimators we studied above: the recursive di¤erencing estimator, the higher order kernel estimator, and the local polynomial estimator.To avoid confounding performance of the parameter estimators with those for conditional mean functions, all estimators for the conditional mean functions are reported at the true parameter values.Since marginal e¤ects are often evaluated at di¤erent regions of the distribution, we calculated the conditional mean function estimators at every point in a trimmed set 10 and then averaged over decile intervals so as to obtain p N asymptotic normality.In so doing, a window size of r = 1 11:99 was set for all three estimators.In terms of RMSE, we found that the recursive di¤erencing estimator was superior to the higher order kernel and local polynomial estimators across designs and deciles.The advantage was especially pronounced at the higher and lower deciles; and was relatively moderate at the middle deciles.To further investigate the source of such advantages, we examined the bias and standard deviations of the estimators.In terms of bias, we found that both the recursive di¤erencing and the higher order kernel estimators perform better than the local polynomial estimator at the middle deciles.However, the recursive di¤erencing estimator dominates both the higher order kernel estimator and the local polynomial estimator considerably at the higher and lower deciles.
The standard deviation of the recursive di¤erencing estimator was smaller than that of the higher order kernel and local polynomial estimators across designs and deciles, with substantial advantage at the higher and lower deciles.Indeed, all three estimators had worse performance at extreme deciles.We remark that the performance of the local polynomial estimator near the boundary improves signi…cantly when the sample size increases.We experimented with increasing the sample size to 10,000 for the cubic design.In that case, the RMSE of the local polynomial estimator for the …rst decile reduced to 0.275; bias reduced to -0.222; and standard deviation reduced to 0.162.However, the recursive di¤erencing estimator continue to dominate it with …rst decile RMSE of 0.126, bias of 0.115 and standard deviation of 0.053.Results are similar for the 10th decile.There is a large and growing literature on boundary correction methods in kernel density estimation that could improve the performance of all three estimators at the boundaries.However, such a boundary correction is outside of the scope of this paper.
In summary, the Monte Carlo experiment showed that the proposed recursive di¤erencing estimator performs much better under a moderate sample size than the other methods that were considered.Further, the behavior of the recursive di¤erencing estimator is consistent with the theory underlying this estimator.

Conclusions
In this paper, we propose recursive di¤erencing estimators for estimating conditional expectations in semiparametric models with multiple indices.The order of the bias decreases with the stage of the recursion while the order of the variance remains the same.
While higher order kernels and local polynomials share the above properties, they di¤er from the proposed estimator in two important respects.First, the RMSE of the recursive di¤erencing estimator became smaller over the stages.In contrast, higher order kernel or local polynomial estimators would require higher order terms to achieve the same bias order and often lead to higher RMSE.Second, in estimating index models, we show that with recursive di¤erencing it is possible to exploit a residual property of semiparametric derivatives.In so doing, we obtain asymptotic normality without undersmoothing, regardless of the dimension of the index vector.This theoretical property contributes to the very good …nite sample performance of the proposed estimator.Proof of Theorem 1.The proof for part a) follows from Lemmas 1-2.Parts b-c) follow from Lemma 6.With ĝ (v; ) converging to g > 0, part d) follows from parts a-c).
Proof of Theorem 2. The proofs for a) and c) are very similar.To establish a), note that under X-trimming (see, e.g.Ichimura(1993)), it can readily be shown that ^ 0 = o p (1): From a standard Taylor series expansion: where Ĥ is the (estimated) Hessian matrix, Ĝ is the estimated gradient to the SLS objective function, and + is between ^ and 0 : Let H denote the Hessian matrix with all estimated functions replaced by the corresponding true ones.From Lemma 9, sup Ĥ ( ) H ( ) = o p (1): Therefore, with + p ! 0 ; Ĥ + p ! E [H ( 0 )] : Under an invertibility assumption, the convergence rate for ^ 0 is then determined by the convergence rate for Ĝ ( 0 ) to 0. With " i = Y i M i ; write the estimated gradient component as ĜA ĜB ; where: From a standard central limit theorem, ĜA From Lemma 10, we may take exterior trimming as known.From a mean-square convergence argument and Lemma 7, Ĝ A 3 = o p (N 1=2 ): An induction argument obtains this result for s > 1.
As above, for ĜA For ĜB1 ; from Cauchy-Schwarz, Lemmas 1,6, and C1), it can be shown that For s = 1, consider the …rst component of M1a (v); and ĝa(v; ) g(v; ) ĝa(v; ) : With the proofs for these terms being similar, consider the …rst term and write: Recall that in the adjustment factor de…ned in D7), (v) is a smoothed indicator approaching 1 on the set fv : 1 : ) uniformly converges to g(v; ) > 0; while on its complement, = O(h a ): Accordingly, it su¢ ces to show that A = o p (1): From Lemma 8, Under A5) with a < 1; A = o p (1): For the B-term, as in Klein and Shen (2010), let s be an indicator for fx : 1 : Under the same arguments as above, B will be uniformly o p (1) if B is uniformly o p (1): Write B as: The …rst term is uniformly o p (1) as the boundary bias vanishes when s = 1.When s = 0; it can be shown that B = O p (1): With the probability of s = 0 being O(N c ), 0 < a < c < 1; the second term uniformly converges in probability to zero.Similar arguments apply for s > 1.From the above uniform convergence, the second stage objective function uniformly converges to its expectation, which has a unique maximum at the true parameter vector under standard arguments.Consistency for the second stage estimator in (7) of Theorem 2 follows.
To establish c), note that it has the following linear form: Employing arguments very similar to those in Part a), it can be shown under C2) that The proof for part c) will then follow if From Lemmas 1-2: From Newey's residual result, E [B ] = 0: In the single index case, Klein and Shen (2010) show that B is a degenerate U-statistic under regular kernels.This argument had been extended to multiple indices under the recursive estimator, with the extension provided in Jiang (2019).
With Lemma 10 showing that exterior trimming can be taken as known, Lemmas 1-9 are provided for the known trimming case.For notational simplicity, let Lemmas 1-6 provide properties of the proposed expectations estimator while Lemmas 7-9 provide results for estimating semiparametric models.For ease of exposition, in Lemmas 1-6 we assume that the conditional expectation of the dependent variable, Y i , is a function of the d-dimensional vector V i .The vector V i may be interpreted as a vector of indices in the semiparametric case or as a vector of exogenous variables in the nonparametric case.
We will obtain results for an approximating recursive sequence that is uniformly within N 1=2 of the original sequence.In constructing this sequence, we require the result on trimmed kernels in Lemma 1. Lemma 2 then provides the main result on the approximation.

Intermediate Lemmas
Lemma 1. Density Estimators and Trimming.Recall from D3) that index-trimming and X-trimming functions have the form: Let C v and C x be sets of the form: Then: Proof.Here we provide the proof for a), with the proof for b) being similar.Referring to a), denote V i (m) as the m th element of V i ; m = 1; :::; d: Let q 1m be the m th element in the vector q v ( 1 ) and q 2m the m th element in the vector q v ( 2 ) : De…ne an indicator: ] can be written as the sum of a …nite number of terms, with a typical term given as: It then su¢ ces to examine: where K im (v) is the kernel component de…ned in D3).Under a), with q 1m and q 2m as the m th elements in q v ( 1I ) and q v ( 2I ) respectively: The result in (8) follows as the sup of each term in the sum is attained by selecting a value for v(m) that minimizes jv(m) V i (m)j : Hence, for each component in …rst product of terms, the sup is attained at either q 1m or q 2m : Under symmetry for the kernel, as the analysis for each of the two terms in ( 8) is essentially the same, here we consider the expectation and variance of the …rst term and show that it has order smaller than o p (N 1=2 h d 1 ).For T 1 ) and g the index density: Z ::: = sup v g(v) where in (9) the integration is over the support for V i .From the de…nition of the indicator, Making a change of variable with Z(m) q 2m V i (m) h , as the kernel has moments of all orders under D4), from Markov's inequality: With t selected to be su¢ ciently large, the result follows.For the variance, From the analysis of the expectation above, the second component vanishes arbitrarily fast.The Lemma follows.
Lemma 2. Recursion Approximation.Recall the de…nition of the initial estimator in D6) and kernel functions in D4-5).With V i V (W i ; 0 ) and Z i (v) [V i v] =h as the i th row of Z, de…ne: For L, P su¢ ciently large and …nite: Proof.For s = 1, note that: With a similar expansion holding for 1=ĝ(V i ); for L, P su¢ ciently large: The result now follows for s = 1.
Recall the de…nitions of A (v) and g 1 (V i ) in Lemma 2. With Z i = (v V i ; )=h, let: : Let U 1 and U be random variables that have zero expectation conditioned on X and de…ne: T L;i 1 ;:::;i L ; T L;i 1 ;::: Then, there exists integers C 1 ; ::; C k 2 such that a) : Proof.The proof for a) is immediate from Lemmas 1-2.For b), due to the form of s 1 (v) s 2 (v), the lemma follows because To study the expectations of the estimators as characterized in Lemma 3, Lemma 4 provides conditional independence results for studying the expectation of products of averages, which is the structure of the stage s estimator.Part a) of this lemma is stated in a form more general than is required so as to illustrate a method of proof that applies to all other parts.Lemma 4. Kernel Products.Let F (V i ) be a bounded function of V i and A(v) a matrix that depends on v: For p a positive integer, de…ne: Referring to the de…nitions in Lemmas 1-2, under A5): Proof.For a), write the expectation of a typical term in (V i ): ; : If all of the subscripts are distinct, the result is immediate from independence.Assume there are m + 1 > 2 identical subscripts and re-order terms so that these are at the end.The expectation is then: For the three components above, from A5), the …rst term is O . With this change of variable, and the integral being multi-dimensional, we can write the third component as: With k(z) p integrable under A5) for all …nite p, the above integral is O(1) from which the result follows.The proofs for b-c) are very similar to that for a) in that they depend on matching subscripts as in the above argument and require de…nitions d1)-d2).For d), from the stage characterization in Lemma 4, it su¢ ces to consider: Employing the same matching arguments as in a), it follows that: which completes the argument for d).
Lemma 5. Bias Expansions for Kernel Expectations.Let F (v) be a bounded function of v and g(v) the density for V i .Assume that F (v)g(v) has uniformly bounded derivatives to order 2m + 1.Then, with kernel functions de…ned in D4), and for v in a compact subset over which the trimming function (see D3) is de…ned: where B i is uniformly bounded in its argument.
Proof.Recalling D4): where C v (q v ) as de…ned in D3) is the set over which the trimming function is one.Let z be a vector with l th element v(l) v i (l) O(h) .Then, making this change of variable and restricting all components of v to a strict subset of C v (q v ) ; the expectation up to higher order terms is given as: R where k is the standardized kernel in D4).From a standard Taylor expansion in h about 0 and A3), the result follows.
Employing the above lemmas, Lemma 6 obtains uniform bias and variance rates for the proposed estimator.Lemma 6. Stage Bias and Variance.Assuming A5), with s (v) de…ned as in Lemma 2, and B s (v) a uniformly bounded function: Proof.For a), with KP 0 and KP 1 de…ned in Lemma 3: Continuing with an induction argument for a), with s > 1; assume that Under Lemma 4d), E [ s (v)] is given as: From Lemma 5 it now follows that The proof for b) at stage s = 1 is immediate.Assuming the result holds for stage s 1, we show it holds for stage s > 1, recall that rd) , l = 1; 2; 3; 4: The result is immediate for T 1 : For T 2 and any stage s: For T 3 and any stage s, since E fE from conditional independence, the expectation of cross products vanish, which implies that: For T 4 , write T 2 4 = ST + CP T; where For the squared terms (ST): For the cross-product terms (CPT), from Lemma 4, part e) : As the …rst component is 0; part b) of the lemma follows.
Lemma 7. Mean-Square Convergence for First Derivatives.For w in a compact subset of its support: Proof.Employing the same term decomposition as in Lemma 6, the proof for lemma 7 follows.
In providing results for a class of semiparametric models, we need uniform convergence results, where uniformity is taken with respect to index parameters.For this purpose, we now explicitly consider the dependence of index on the parameter vector .Lemma 8. Uniform Convergence Rates.Recall from D4) that K i (v) depends on via v ( ) V (w; ) and V i ( ) V (W i ; ): Under A4), with Z i [v V i ( )] =h; for w; restricted to compact sets and = 0; 1; 2 : Proof.As the arguments for b), c), and d) are similar to but less complicated than those for a), here we provide the proof for a).Let: b i = 1 : jY i j > N 1 m+2 0 : otherwise : Write the term in a) as T 1 b + T b , where: The result now follows. 11inally, for establishing asymptotic normality, we need a uniform convergence lemma for the estimated mean functions and their …rst two derivatives.Lemma 9 provides the required results.Proof.From Lemma 8), the lemma is immediate for s = 1.For s > 2, the result follows from an induction argument.
For purposes of obtaining properties of the recursive estimator for M , above we have taken exterior X-trimming as known.For estimating index parameters in a semiparametric model, as described in Theorem 2, in the …rst stage we employ X-trimming based on sample quantiles.In the second stage, we trim based on the estimated index.To cover these cases, let ^ be either a vector of sample quantiles or a vector containing sample quantiles and the vector of index parameter estimates, ^ .Lemma 10.Exterior Trimming.Set the stage s, and window parameter r, to satisfy either condition C1) or C2) in Theorem 2. With q as a vector of sample quantiles, denote s (v; q) as the corresponding scaled estimator under estimated X-trimming and s (v; q) as the corresponding estimator under known trimming.Then: r 2 f s (v; q) s (v; q)g = o p (1): Proof.We prove the lemma …rst in the case of X-trimming and then show that the arguments immediately extend to index-trimming.To prove a) for s = 1 and letting Â [Z 0 D (v; ^ )Z] =N; A [Z 0 D (v; )Z] =N; B Z 0 D (v; ^ )=N; and B Z 0 D (v; )=N : As all terms are similar, with the third converging to zero faster than the …rst two, here we provide the argument for the …rst term given as: It can be shown that there exists > 0 such that Z(v) is uniformly o p (N ): Further, it can be shown hat Â ), a Taylor series then gives the result, with the smooth approximation slowing the convergence rate to N (1=2 e) ; e > 0 and arbitrarily small.The result then follows for s = 1.An induction argument then completes the proof for part a).The proofs for parts b) and c) are similar.
For exterior index trimming, let ^ (q 1 ; q2 ; ^ ); where (q 1 ; q2 ) are lower and upper sample quantiles for the estimated index, X 1i + X 2i ^ .Write where V 0i X 1i + X 2i 0 .Recalling from A6) that X 2i is bounded and that ^ 0 = O p (N 1=2 ), the proofs for index trimming are very similar to those for X-trimming.

1 U
Employing the arguments similar to those inJiang (2019), it can then be shown thatN 1=2 [ Ĝ B N ] = o p (1) where U N is a centered U-statistic with N 1=2 U N asymptotically distributed as normal.It follows that Ĝ B 1 = O p N 1=2 : From Cauchy Schwarz, Lemmas 6-7, and C1), ĜB 2 = o p (N 1=2).The analysis for ĜB 3 and ĜB 4 is similar to that for Ĝ A 4 and ĜB 2 : To establish b), under index trimming at the second step, consistency follows from an extension of Lemma 5 inKlein and Shen (2010) to d multiple indices.To outline the argument, recall the de…nition of the adjusted estimator in D7) and D8):

1 N h d m ; m > 1 :
It can readily be shown the second component is O(1).For the …nal component, let z be a d 1 vector with l th component 4rs + O N (1 r(d+2)) : (m+2) r( +d))

For T 1 b
; from standard results in the literature:sup jT 1 b E (T 1 b )j = O p N ( 1 2 r( +d) 1 m+2 ) = O p N ( m 2(m+2) r( +d)) :For T b ; the function K i (V ( )) incorporates trimming that can be taken as known (Lemma 1) and that bounds X s : Then for x restricted to a compact set, it follows thatsup ;i jr [K i (v)]j = O 1 h +d :It then su¢ ces to study1 h +d E [b i jY i j] 1 h +d E [b i ] 1=2 E Y 2 i 1=2 ;which follows from Cauchy-Schwarz.Note that:E [b i ] Pr jY i j m > N m m+2 N ( m m+2 ) E [jY i j m ] = O N m m+2 : Therefore, T b = O N [ m2(m+2) r( +d)] :