NEW CONTROL FUNCTION APPROACHES IN THRESHOLD REGRESSION WITH ENDOGENEITY

This paper studies control function (CF) approaches in endogenous threshold regression where the threshold variable is allowed to be endogenous. We first use a simple example to show that the structural threshold regression (STR) estimator of the threshold point in Kourtellos, Stengos and Tan (2016, Econometric Theory 32, 827–860) is inconsistent unless the endogeneity level of the threshold variable is low compared to the threshold effect. We correct the CF in the STR estimator to generate our first CF estimator using a method that extends the two-stage least squares procedure in Caner and Hansen (2004, Econometric Theory 20, 813–843). We develop our second CF estimator which can be treated as an extension of the classical CF approach in endogenous linear regression. Both these approaches embody threshold effect information in the conditional variance beyond that in the conditional mean. Given the threshold point estimates, we propose new estimates for the slope parameters. The first is a by-product of the CF approach, and the second type employs generalized method of moment (GMM) procedures based on two new sets of moment conditions. Simulation studies, in conjunction with the limit theory, show that our second CF estimator and confidence interval for the threshold point together with the associated second GMM estimator and confidence interval for the slope parameter dominate the other methods. We further apply the new estimation methodology to an empirical application from international trade to illustrate its usefulness in practice.


INTRODUCTION
Endogenous threshold regression (ETR) has attracted much attention in recent econometric practice. The interest is explained partly by the recognition that economic relationships may shift over time, partly by the empirical relevance of thresholding in the design of government tax and welfare programs, and partly by the growing use of merit or need based awards in various social and educational programs. In all such cases, shifts in the associated regression equations may be driven by a companion variable which may itself be endogenous. For one example, in the empirical growth models used in Papageorgiou (2002) and Tan (2010), the corresponding threshold variables, the trade share to GDP in 1985 and the average expropriation risk from 1984 to 1997, are endogenous, as argued in Frankel and Romer (1999) and Acemoglu, Johnson and Robinson (2001), respectively. For another example, our empirical application later in the paper explores the effects of exports on real output growth, where the real exchange rate volatility, as the threshold variable, is endogenous as argued in Tenreyro (2007). For recent discussions of research on thresholding, including endogenously determined thresholds and overviews of various time series and cross-sectional applications (see Hansen, 2011;Yu and Phillips, 2018 [YP hereafter] and Yu, Liao and Phillips, 2018 [YLP hereafter]).
The usual modeling setup of the ETR has the following form: where x = x ,q ∈ R d , the parameter of interest is θ = γ ,β with β = β 1 ,β 2 or equivalently, θ = γ ,β 2 ,δ β with δ β = β 1 − β 2 being the threshold effect in conditional mean, and δ σ = σ 1 − σ 2 is the threshold effect in conditional variance. Quite often, the ETR literature assumes δ σ = 0 and absorbs σ 2 into u (see, e.g., Caner and Hansen, 2004 [CH hereafter]; Kourtellos, Stengos and Tan, 2016 [KST hereafter]), but allowing δ σ = 0 is often more realistic and plays an important role in this paper. The regressor x may contain some exogenous regressors such as an intercept, q may be a component of x (or some discussions in this paper can be simplified), and both x and q may be treated as endogenous. This setup is similar to endogenous linear regression (ELR) except that the regression coefficients and error variances depend on whether the threshold variable q crosses the threshold point γ . As shown in YP, identification of γ does not require any instruments. Nevertheless, all the estimation procedures discussed in this paper involve instruments. In this introductory discussion, the data are assumed to be sampled as i.i.d. but weak dependence is permitted in the formal discussion later. When q is exogenous, CH use a two-stage least squares (2SLS) method to estimate γ . Suppose the first-stage regression is where the instruments z contain both included (in x) exogenous regressors such as 1 and q, and excluded exogenous regressors, E [v|z] = 0 and E [u|z] = 0. Then by plugging (2) in (1) and taking the conditional expectation of y given z, we can show that E [y|z] = 0 z β 10 1 (q ≤ γ 0 ) + 0 z β 20 1 (q > γ 0 ) =: g CH (z;θ 0 ), where θ = θ , , and the subscript 0 is used to emphasize the true value of a parameter. The CH estimator of γ minimizes the sample analog of E (y − g CH (z;θ)) 2 with g CH (z;θ) defined in (3).
When q is endogenous, KST call (1) a structural threshold regression (STR) model and use a type of control function (CF) approach to estimate γ (see Navarro, 2008, for an introduction and Wooldridge, 2015, for a recent summary of CF approaches in regular models). Their CF approach can be treated as an extension of CH's 2SLS method. Specifically, they derive under some assumptions that E[y|z,q ≤ γ 0 ] = g 1 (z;γ 0 ,β 10 ,κ 0 ) and E[y|z,q > γ 0 ] = g 2 (z;γ 0 ,β 20 ,κ 0 ), where z contains the instruments which do not include q, κ 0 is a nuisance parameter indicating the endogeneity level of q, and g 1 and g 2 are some parametric functions of z. Then KST's STR estimator of γ , say γ , minimizes the sample analog of E (y − g 1 (z;γ ,β 1 ,κ)1 (q ≤ γ ) − g 2 (z;γ ,β 2 ,κ)1 (q > γ )) 2 , for which a detailed construction is given in Section 2.1. KST develop their asymptotic theory of γ in the small-threshold-effect framework as in CH, especially, in which both δ β and κ are O n −α with α ∈ (0,1/2). We demonstrate with a simple example in Section 2.2 that γ is not generically consistent unless κ is relatively small compared to δ β and that there does not exist a nonzero κ value such that γ is consistent for all possible δ β values. We further point out that an extension of KST using copula theory by Christopoulos, McAdam and Tzavalis (2021) (CMT hereafter) suffers a similar problem. Moreover, if γ is not consistent, the corresponding β estimator, β, is also inconsistent, thereby leading to failure in the whole estimation procedure. Section 2.3 shows that the key problem in the KST estimator is that when the objective function takes the form of a sum of squares, the conditioning set in (4) cannot depend on the unknown parameter γ 0 .
As an alternative methodology, Section 3 puts forward two CF approaches for γ , labeled CF-I and CF-II. These two CF approaches differ from KST and do not require the degree of endogeneity to shrink to zero asymptotically. Method CF-I corrects the CF in KST so that it can be treated as an extension of CH and thereby includes CH as a special case. CF-II extends the classical CF approach in ELR to the current context. The approach is new in the ETR literature and covers the standard CH model (i.e., with q exogenous) as a special case. The difference between these two approaches lies in using E [y|z,q] or E [y|z,x] to identify γ . Specifically, we first derive E [y|z,q] = g I (z,q;θ 0 ) in CF-I and E [y|z,x] = g II (z,x;θ 0 ) in CF-II under some auxiliary assumptions (which are standard in the usual CF approach), noting that CH also derive E [y|z,q] since q ∈ z, so CF-I is indeed an extension of CH. Then we estimate γ by minimizing in generic form the sample analogs of E (y − g I (z,q;θ)) 2 and E (y − g II (z,x;θ)) 2 , for "I" and "II," respectively, where we use θ to collect all unknown parameters which need not be the same in CF-I and CF-II. Note that the conditioning sets in E [y|z,q] and E [y|z,x] do not depend on any unknown parameter, which explains why the new estimators are consistent and the KST estimator is not.
Interestingly, although both CF-I and CF-II use the conditional mean of y to identify γ , these approaches also utilize some threshold effect information in the conditional variance (i.e., δ σ = 0). This methodology differs from the usual least squares (LS) estimator (without endogeneity) and CH's 2SLS estimator (with endogenous x but exogenous q), where only the threshold effect information in the conditional mean can be explored even if there is a threshold effect in conditional variance. We derive the asymptotic distribution of these two CF estimators of γ in the small-threshold-effect framework, and also suggest inverting the likelihoodratio (LR) statistics to construct confidence intervals (CIs) for γ . Unlike the Heckit model, y = y * , if z γ + v > 0 or d = 1, 0, if z γ + v ≤ 0 or d = 0, with y * = x β + u, (u,v) ∼ N (0,(σ 2 ,ρσ ;ρσ,1)) and ρ = 0, where a CF is added to the LS objective function based on the observed data, the extra randomness from the generated regressors in our objective functions does not affect the asymptotic distribution of the CF estimators of γ , say, γ .
Recently, Kourtellos, Stengos and Sun (2022) extended g I (z,q;θ) in CF-I to a semiparametric setup where the CFs take nonparametric forms, whereas the other components of g I (z,q;θ) retain parametric forms; their approach can also be applied to CF-II. Section 3.2 provides detailed comparisons between the parametric setup of this paper and their semiparametric setup. We emphasize here that CF-II is the preferable approach because it has a firmer theoretical foundation as discussed in Section 3 and is found to perform much better than CF-I in our simulations of Section 5. CF-I is included in our analysis only to show the correct objective function for the KST framework.
All the estimators of γ discussed above are M-estimators, and another type of estimator with instruments employed are generalized method of moment (GMM) estimators (or Z-estimators). YLP propose the 2SLS estimator in the crosssectional data case and Seo and Shin (2016) (SS hereafter) propose the FD-GMM in the dynamic panel data case (Appendix A of SS also extends CH's estimator above to the dynamic panel data case). More specifically, such GMM estimators use the moment conditions like to identify γ . Although GMM estimators are important and natural to handle endogeneity, compared with M-estimators, they suffer from at least three drawbacks as discussed in YLP. First, GMM changes the nature of γ from a threshold point (which is nonregular) to a quantile of q (which is regular), which implies the convergence rate of γ is n 1/2 , much slower than the convergence rate n of Mestimators. Second, γ is not always identifiable by GMM, for example, when q is independent of the rest of the system such as the time index in structural change models, γ cannot be identified by GMM. Third, GMM requires more instruments than our CF estimators for identification, which implies that GMM may have less applicability since good instruments are hard to find in practice. Our simulations show that the performance of GMM estimators is much worse than our CF-II estimator. Of course, GMM estimators require only unconditional moments to identify γ , whereas M-estimators require conditional moments; however, for identification in nonlinear models conditional moments are routinely assumed in the literature. Also, it should be emphasized that these drawbacks of GMM are not due to the use of unconditional moments; as argued in YLP, even if conditional moments (which are equivalent to uncountably many unconditional moments) are employed, the first two drawbacks of GMM do not disappear. In summary, GMM has several less desirable features for estimate of γ in ETR.
Given γ , we propose two estimation procedures for β in Section 4. The first type estimates β by β = β 1 ( γ ) , β 2 ( γ ) and is therefore a by-product of the CF approaches for γ , where β (γ ) = β 1 (γ ) , β 2 (γ ) is the concentrated version of the estimator of β given γ based on the objective functions employed in our CF approaches. This type of estimator of β is not studied in CH or KST. Recall that in the Heckit model the generated regressor affects the asymptotic variance of the estimates of regular parameters such as β. A similar phenomenon occurs here. Interestingly, in CF-I, the extra randomness in the generated regressor and the randomness present in the original error term are correlated, whereas in CF-II they are not. The second type of estimator relies on GMM estimation and is based on two new sets of moment conditions. The first GMM estimator requires more moments than the second and can be treated as an extension of CH's GMM estimator to the endogenous q case for which their GMM estimator is inconsistent. The second GMM estimator is an extension of the first type of estimator (i.e., the CF estimator of β). For all β estimators, the γ estimates do not affect their asymptotic variances even though γ may not be o p n −1/2 convergent.
Section 5 reports simulation comparisons of the finite sample performance of all existing estimators and CIs of θ when instruments are available. It turns out that the CF-II estimator and CI for γ and the associated second GMM estimator and CI for β dominate the other methods. Section 6 applies the new methodology of this paper to an empirical application from international trade to illustrate its usefulness in practice. Section 7 concludes. Proofs of the main results (theorems, propositions, and lemmas) are given in Appendixes A-C. Additional assumptions, lemmas, and theorems as well as further details concerning the various procedures in this paper and the properties of existing procedures in the literature are given in the Supplementary Material which contains five sections labeled SD.1-SD.5.
A word on notation: Throughout this paper, we use the subscript 0 to designate the true value of a parameter when it is not obvious and a hat to indicate its estimate. The subscripts "≤ γ " and "> γ " signify use of the indicator functions 1(q i ≤ γ ) and 1(q i > γ ), so that for instance x i,≤γ = x i 1(q i ≤ γ ). For three random vectors x,y,z, x ⊥ y means x is independent of y, and x ⊥ y|z means x is independent of y conditional on z. For a matrix A, A > 0 means A is positive definite. I m is the m × m identity matrix. λ(·) = φ(·)/ (·) is the inverse Mills ratio, and φ(·) and (·) are the standard normal pdf and cdf. Throughout the paper, the symbol is used to indicate the presence of two regimes in (1) and is not written out explicitly as " = 1,2." The slope parameter is sometimes partitioned according to x and q, as well as regime, giving the component representation β = β x ,β q .
A word on assumptions: Because CF-II is our preferable method, we list only the assumptions for consistency and asymptotic distributions of CF-II in the main paper, and relegate the counterparts for CF-I to Section SD.2 of the Supplementary Material.

INCONSISTENCY OF KST'S STR ESTIMATOR
This section begins with a review of the STR estimator in KST, followed by a simple example to show its inconsistency with an accompanying analysis and discussion. To simplify the development, we assume the data are randomly sampled in this section.

Review of KST's STR Estimator
KST assume in the first-stage regression, where E [v x |z] = 0, E v q |z = 0, and u,v q |z ∼ N (0,(σ 2 ,ρσ ;ρσ,1)) with ρ = 0 due to the endogeneity of q. 1 KST also impose the following key assumption to simplify their discussion.
Assumption K excludes q as a regressor; otherwise, the analysis is more complicated as detailed below. Under Assumption K, Note here that the conditioning events are q ≤ γ 0 and q > γ 0 rather than q ≤ γ and q > γ for an arbitrary γ . As shown at the end of Section SD.1 of the Supplementary Material, the formulas above cannot be extended to other γ values. When x = x ,q , where g q = π 0 z, and κ 0 = ρ 0 σ 0 + β q0 . To be consistent with KST, we maintain Assumption K below. Now, where we follow the KST notation and define λ 1 (·) = −λ (·) and λ 2 (·) = λ (−·), and In other words, the model becomes a nonlinear threshold regression, and γ can be estimated by minimizing where g xi = x z i and g qi = π z i with x and π obtained from a first-stage regression. As usual, we can concentrate on parameter γ , and regress y i on g xi 1(q i ≤ γ ), g xi 1(q i > γ ) and i (γ ) := λ 1 γ − g qi 1(q i ≤ γ )+ λ 2 γ − g qi 1(q i > γ ) to obtain β 1 (γ ), β 2 (γ ) and κ (γ ). Then γ can be estimated by the extremum problem is a proper subset of the support of q, and S n (γ ) = S n γ , β 1 (γ ), β 2 (γ ), κ (γ ) .
Given γ , the parameter β can be estimated by 2SLS/GMMs as in CH. In the smallthreshold-effect framework of Hansen (2000), KST show that γ is n 1−2α -consistent and its asymptotic distribution is based on a functional of two-sided Brownian motion, under the assumption that both δ β and κ are O n −α with α ∈ (0,1/2). The terms κ ·λ 1 γ − g qi and κ ·λ 2 γ − g qi in the criterion (10) are CF, as in the Heckit model. Unfortunately in the present case, they do not take the correct form as is illustrated in the simple example in the next subsection. When q is exogenous or κ 0 = 0, KST's STR estimator reduces to the 2SLS estimator of CH which proves that γ is consistent.
The arguments above provide a possible explanation for why the simulations in KST work well. The setup above is a special case of their data generating processes (DGPs). Specifically, in our setup, their x 2i = 1, x 1i is missing and q i = −z i + v qi (rather than 2 + z i + v qi ). In our notation, they set δ 0 = 1,2,3 and κ 0 = 0.053,0.316,0.634. Obviously, their κ 0 's are all relatively small compared to δ 0 ; this parameter setting is the reason why their simulation based estimates of γ appear unbiased.
In Section SD.1 of the Supplementary Material, we point out the key problems in KST's consistency proof using the framework of our simple example. Basically, they miss two terms in their calculation of the probability limit of the objective function. When κ is small, these two terms are dominated, so the minimizer of the limit objective function is still γ 0 . But when κ is large, these two terms are not neglectable, so the minimizer is different from γ 0 . We also show that we cannot modify KST's objective function (pursuing their approach) to obtain a consistent estimator of γ .
CMT suffer a similar problem as KST, where they replace λ γ i by some functions derived from copula theory. The problem here is similar to that of KST, that is, they calculate E[y|z,q ≤ γ ] and E[y|z,q > γ ] rather than E[y|z,q]. The good performances of CMT's estimators in their simulations (even in the misspecified scenarios) are also attributed to the large threshold effects compared to the endogeneity levels in their DGPs.

Discussion
As emphasized in Section 3.3 of Yu (2015), conditional moment restrictions rather than orthogonality conditions must be used to consistently estimate γ . Given that KST's estimation is indeed based on the conditional mean E[y|z,q ≤ γ 0 ] and E[y|z,q > γ 0 ], a natural question is why KST's estimator is inconsistent. The key point here is that when the objective function takes the form of a sum of squares (like in KST), the conditioning set in the conditional mean on which the objective function is based cannot depend on the unknown parameter to achieve a consistent estimator. This phenomenon emerges regardless of whether the parameter is nonregular (like γ ) or not (like a mean). We here use a simple example to illustrate this point.
Suppose y follows the exponential distribution, and we want to estimate its mean θ 0 = 1. It is easy to see that for any x ∈ (0,∞), which certainly hold when x = θ 0 , so it is natural to estimate θ by The probability limit of S 1n (θ ) is with y following the standard exponential distribution. As shown in Figure 2, S 1 (θ ) is minimized at around 1.24, different from θ 0 . 3 On the other hand, if the conditioning sets do not depend on θ , say, {y ≤ 1} and {y > 1}, then and we can set the objective function as 3 Of course, the above estimation approach need not always generate an inconsistent estimator. For example, if we replace the standard exponential distribution by the standard uniform distribution and use E [y|y ≤ θ] = θ 2 and E [y|y > θ] = 1+θ 2 , θ ∈ (0,1), to identify θ 0 = 1 2 , it indeed works. Actually, as shown in Section 2.2, when the endogeneity level is relatively low compared to the threshold effect, KST's estimator also converges to the true γ 0 . However, because the true DGP is unknown, we cannot use these estimators without discretion or additional information.
KST try to calculate E[y|z,q ≤ γ 0 ] and E[y|z,q > γ 0 ] whose conditioning sets depend on γ 0 ; this is why their estimator of γ may not be consistent. If we make the conditioning set independent of γ 0 , for example, replacing q ≤ γ 0 by q ∈ or q ∈ R, then we can indeed get a consistent estimator. In KST's setup, suppose x ∈ z and only q is endogenous; then where the last equality is from q = π z + v q and v q ∼ N (0,1). The resulting least square estimator of γ is √ n consistent (rather than n consistent) because the objective function is smooth in γ . Letting γ = −∞ and γ = ∞, we get and a √ n-consistent estimator can also result from this conditional mean. When x is also endogenous, the analysis is much messier even if we assume v where the terms involving v x will not disappear. Note that KST's estimation is inspired by Heckman (1979). In the Heckit model (5), we note that d = 1(z γ + v > 0) is observable, whereas the conditioning events 1 (q ≤ γ 0 ) and 1 (q > γ 0 ) in KST's estimation are unobservable. Estimation is based on E [y|d = 1,x,z], that is, the conditioning set does not depend on any unknown parameter, which results in consistent estimation of β. Similarly, our CF-I is based on E[y|z,q] and CF-II is based on E[y|z,x], that is, neither conditioning set depends on γ 0 ; this is why our two CF estimators are consistent. Compared with CF-I, KST integrate out q on a range depending on γ 0 in E [y|z,q] (or project the q dimension onto the two indicator functions 1 (q ≤ γ 0 ) and 1 (q > γ 0 )).
In Section SD.1 of the Supplementary Material, we provide more discussion on why we cannot include unknown parameters in the conditioning set of M-estimation to generate consistent estimators by comparing M estimators and Z estimators in the general sense.

TWO CONTROL FUNCTION APPROACHES FOR γ
In this section, we propose two CF approaches to estimate γ . In both approaches, we allow x and q to be endogenous, treating the exogenous q specification as a special case. First, we introduce some notation for future use. Define F i−1 as the sigma field generated by When the sample is random, we can just replace F i−1 by z i . Suppose the first-stage regression is written in stacked form as with components x i and q i taking the form (6) so that Denote the estimates of and v i as = x , π and v i = v xi , v qi , and that of g i = z i by g i = z i . For exogenous covariates in x i such as 1, the corresponding components in are known and the corresponding components of v xi are zero since such covariates are included in z i . In the two approaches, we use only the first-stage error terms and residuals for endogenous regressors; to avoid introducing further notation, we still use v i and v i to denote such error terms and residuals. To unify the notations in the two approaches, define θ = θ ,κ , where θ = γ ,β is the parameter of interest, and κ collects all other nuisance parameters which may be different in the two CF approaches. We will also use the same notationsx i (and x i ) for the regressors and β for the slope parameters in the two approaches although they may be different. As mentioned in the Section 1, CF-II is our favorite approach, so CF-I is introduced here only for completeness and all its properties such as consistency, asymptotic distribution and LR inference are provided in Section SD.2 of the Supplementary Material. and the error term where E e 0 i |F i−1 ,x i = 0. Then γ can be estimated by minimizing S n (θ ) where κ = ψ ,σ 1 ,σ 2 =: ψ ,σ . CH and KST use the idea of CF-I, while the classical CF approach in ELR uses the idea of CF-II, that is, conditioning on all endogenous regressors and exogenous instruments. In other words, CF-II is the genuine CF approach in the literature where the endogeneity is controlled by E [u i |v i ] as the regressor x i includes two components, z i and v i , and only v i is correlated with u i . We state CF-I here to show the correct form of CFs in the KST approach. In ELR, using CFs and using moment conditions are equivalent. However, in ETR, they are different (see the Sections 1 and 2.

for the differences between GMM estimators and M-estimators like CF-I and CF-II). CF-I imposes assumptions on both
Of course, rigorously speaking, these two groups of assumptions do not nest each other. Because , CF-II controls for more endogeneity effects than CF-I (when x i = x i and only q i is endogenous, they are the same). So CF-II is indeed more appealing from a theory perspective. Actually, CF-I projects all endogeneity on the v qi dimension, that is, these two approaches are connected. Specifically, where we need strengthen We now show the consistency of the CF-II estimator of γ based on (18) to formalize our identification scheme. First, the following assumptions are imposed. Note here that we label our assumptions as "C-II" since the label "C-I" has been used for CF-I in Section SD.2 of the Supplementary Material. Assumption C-II: are strictly stationary and ergodic; θ ∈ with being compact; β 10 ,σ 10 = β 20 ,σ 20 .
Condition C-II.1 covers weakly dependent data beyond random samples but excludes time trends and integrated processes; it also assumes the existence of threshold effects in the conditional mean and/or conditional variance of y. C-II.2 collects the nonlinear endogeneity assumptions which imply the correct specification of the conditional mean in the structural equation and reduced form. C-II.3 implies the consistency of the first-stage estimator . C-II.4 is implied by E u 2 i < ∞ but we express it in this way for convenience. C-II.5 includes some regularity conditions on f (·); especially, it requires the threshold variable to have a continuous distribution and excludes the possibility that γ 0 falls on the boundary of q's support. C-II.6 guarantees that replacing v i by v i will not affect the consistency of γ . C-II.7 is the key assumption for identification of γ when v i is replaced by v i , for example, it excludes the case where g 3 (v;ψ) takes a linear form in ψ and implicitly assumes δ β E x i x i |q δ β > 0. It guarantees the uniqueness of arg min θ plim(S n (θ )) at θ 0 ; it is actually the form of the identification assumptions in nonlinear least squares (see, e.g., Section 2.2.2 of Newey and McFadden, 1994) required in nonlinear TR models, where linear and nonlinear TR models refer to the models where the conditional mean of y i in each regime is linear and nonlinear in parameters, respectively.

Identification of γ and Linear Endogeneity Forms
From the analysis in Section 3.1, we can see three drawbacks of the nonlinear specification of endogeneity forms. First, different from linear TR models, it is hard to find general primitive conditions for consistency of γ , that is, we need check C-II.7 case by case. Second, there is typically no guideline for the nonlinear specification and the functional form of g 3 (·;·) is often arbitrary. On the other hand, linear TR models introduce nonlinearity into the system only through the threshold effect, thereby providing a class of parsimonious models; and, as shown in Petruccelli (1992), linear threshold autoregressive models can approximate a general class of time series processes (e.g., exponential autoregressive and invertible bilinear processes) almost surely. In other words, use of the linear TR model reduces the need to introduce nonlinearity in each regime. Third, and most importantly, optimization when the endogeneity takes nonlinear forms can be troublesome in practice. In particular, we need first concentrate on ψ ,γ and then on γ to find γ ; and when the dimension of ψ is large, numerical minimization is burdensome. These three drawbacks go some way to explain why the literature seldom considers nonlinear TR models.
To ease discussion and provide the most practical estimation procedures, we assume that endogeneity takes linear forms in both approaches. In CF-I, we further assume that E v i |v qi = ϕv qi and E u i |v qi = κv qi . Then where we note that when q ∈ x and is endogenous, the last element of ϕ is 1, and κ = β ϕ + σ κ. Different from the nonlinear endogeneity case, ϕ and κ cannot be identified separately. In fact, we must exclude the cases where g 1 (·;·) and g 2 (·;·) take linear forms in (16) to identify ϕ and κ. Since our main interest lies in θ = γ ,β , absorbing ϕ and κ in the nuisance parameter κ = (κ 1 ,κ 2 ) is not critical. Now, γ is estimated by minimizing where θ = θ ,κ =: (γ ,θ ) , θ = β ,κ =: β 1 ,β 2 with β = β ,κ , and Although the objective function (21) is based on the conditional mean E [y i |F i−1 ,q i ], (21) carries some threshold effect information in the conditional variance because δ κ := κ 1 − κ 2 = δ β ϕ + δ σ κ involves δ σ . In contrast, the usual LS estimator (without endogeneity) and CH's 2SLS estimator (with endogenous x i but exogenous q i ) explore only the threshold effect information in the conditional mean even though a threshold effect in the conditional variance may be present (i.e., δ σ = 0). Note that, were δ σ = 0, we would still have κ 1 = κ 2 unless ϕ = 0. But when q i ∈ z i , ϕ = 0 and κ = 0, so we return to the framework of CH. Also, different from KST, κ is assumed to be fixed rather than to shrink to zero (i.e., we allow for strong endogeneity) and v qi need not follow N (0,1).
In CF-II, we assume that So γ can be estimated by minimizing where the nuisance parameter κ = κ 1 ,κ 2 with κ = σ ψ, β = β ,κ , and Note here that we express the objective function of CF-II in the same format as in CF-I by redefining x i and β . Like (19), we can discuss the relationship between the parameters in CF-I and CF-II. Now, E u i |v qi = E E [u i |v i ] |v qi = ψ ϕv qi , that is, ψ ϕ plays the role of κ in CF-I. So the coefficients of v qi in the conditional mean of u i in these two approaches (κ in CF-I and ψ q in CF-II) are not equal unless either ψ x or ϕ x is zero, that is, either x i is exogenous or x i ⊥ q i |F i−1 . As in CF-I, ψ cannot be identified separately, and the identification of ψ excludes the linear specification of g 3 (·;·) in (18); the threshold effect information in the conditional variance is also explored through δ κ := κ 1 −κ 2 = δ σ ψ; ψ is fixed, that is, CF-II also allows for strong endogeneity. Different from CF-I, even if q is exogenous (i.e., in CH's setup), the threshold effect information in the conditional variance is still explored. When q is exogenous, ψ v i = ψ x v xi , and we need only control for the endogeneity of x. If δ σ = 0, then κ 1 = κ 2 and the model is simpler. As shown in Perron and Qu (2006), the asymptotic distribution of the estimator of γ based on (23) is not affected when some constraints on the slope parameters in the two regimes are imposed as long as the true model satisfies these constraints. The intuition is simple: the constraints can affect only the efficiency (i.e., the asymptotic variance) of the slope estimator, while as long as the slope estimator is √ n-consistent, the asymptotic distribution of the γ estimator is the same, which is in turn due to the asymptotic independence between the slope estimator and the γ estimator. In Section SD.4 of the Supplementary Material, we develop the asymptotic theory for γ and β estimators when the model satisfies some constraints such as part of β 1 and β 2 are the same since such constraints are imposed in our simulations and empirical application. The CF-II approach is not considered in the ETR literature.
As in CMT, we can extend v qi in (20) and v i in (22) to some functions of them and meanwhile maintain the linear TR forms. To save space, we focus on CF-II here. For one example, if u i ∼ N (0,1), and u i ,v i has a Gaussian copula, then by Proposition 2 of CMT, and ψ * takes the same form as above. 5 In other words, for some nonlinear function g 3 implied from copula theory. Now, the analysis in this paper goes through, just replacing everywhere. In short, the linear endogeneity forms are not as restrictive as they might seem to be.
The above extensions require that all components of v are continuous. When x i contains some discrete regressors, it seems to us that the linear endogeneity assumption in CF-II is more robust than in CF-I. In such a case, assuming E v i |v qi to be linear in v qi in CF-I seems questionable. For such discrete regressors, assuming E [x i |F i−1 ] = z i also seems questionable. This is why CH assumes E [x i |F i−1 ] = g (z i ; ) for some possibly nonlinear (e.g., logit or probit) function of z i . In other words, in such scenarios, we must use an objective function like (16). On the other hand, assuming E [u i |v i ] = ψ x v xi + ψ q v qi seems reasonable even if some components of x i are discrete. When x i is discrete, it is better to replace v xi by the generalized error of Gouriéroux, Monfort, Renault and Trognon (1987). For example, if x i ∈ R satisfies the reduced form equation However, we cannot replace v xi by r xi in CF-I to calculate E v i |v qi . As in CH, we can also assume the first-stage reduced form is unstable, , so when F xj = or T νu , x * ji = x ji , and when F q = or T νu , q * i = q i , we have the multicollinearity problem given that x ji or q i itself is a regressor. Although F xj and F q are unknown so that x * i need be estimated, that is, x * i and x i are not perfectly multicollinear in finite samples, they are indeed so asymptotically. As a result, the asymptotic distributions of γ and β, which are not stated in CMT, cannot be well developed. On the other hand, an advantage of CMT's specification is that no instruments are required. and the asymptotic distribution of γ takes the same form with the meaning of notations appropriately adjusted. 6 Because CF-II is our favored approach and the linear endogeneity assumption is reasonable in CF-II, we will maintain such an assumption in CF-II in the rest of the paper. As to CF-I, we also impose the linear endogeneity assumption to avoid technical difficulties, that is, it is better to assume the endogenous regressors in x i to be continuous if practitioners wish to use CF-I. Finally, note that the existing literature such as CH, KST, and CMT does not explicitly consider discrete x i scenarios.
In the semiparametric framework, we need to replace κ 1 v qi and κ 2 v qi by h 1 v qi and h 2 v qi in CF-I and replace κ 1 v i and where h 1 and h 2 can be any (smooth) functions. It seems that the first approach does not suffer from the curse of dimensionality compared with the second approach. Following this paper, Kourtellos et al. (2022) do the semiparametric extension of CF-I, and their method may be applied to the extension of CF-II. We will not discuss semiparametric approaches in this paper because the key advantage of the CF approaches above is their parametric setup. If the setup does contain nonparametric components (e.g., , nonparametric estimators such as the IDKE in YP and YLP seem preferable because they do not require any instruments. 7 In summary, the parametric setup in this paper is not only preferable practically (not requiring series expansion on h 1 v qi and h 2 v qi as in Kourtellos et al. (2022)), but also preferable theoretically when instruments are employed (given that nonparametric methods do not require instruments).
Because we are interested only in γ in this section, we can concentrate out β and κ in both approaches to get a concentrated objective function S n (γ ). Specifically, for any γ , let Y, X γ , and X ⊥ denote the matrices of stacked vectors y i , x i 1(q i ≤ γ ), and x i 1(q i > γ ) in both approaches. Then S n (γ ) is the LS residual sum of squared errors from a regression of Y on X γ and X ⊥ . Our two CF estimators for γ are the minimizers of the sum of squared errors: There is an interval of γ , [ γ − , γ + ), minimizing S n (γ ). Following Yu (2012;, we therefore take the mid-point of the interval as our estimator of γ because the mid-point γ − + γ + 2 is more efficient than the left-endpoint γ − in most cases. 6 Boldea, Hall and Han (2012) and Hall, Han and Boldea (2012) also study the unstable reduced form in the context of structural change models and allow both fixed and shrinking (at potentially different rates from n −α ) 1 − 2 , but note that in structural change models, the threshold variable is the time index and so it cannot be endogenous (see also Perron and Yamamoto, 2014, for related discussions). Note also that although the form of γ 's asymptotic distribution will not change, the form of β's asymptotic variance matrix critically depends on the formation of the first-stage regression. In the following two subsections, we will discuss the asymptotic properties of the CF-II estimator of γ above. Compared with usual threshold regression estimation such as that of Chan (1993) and Hansen (2000), the only new component in the objective function (23) is the presence of the generated regressors from the first step.
We impose the following assumptions which somewhat strengthen Assumption C-II so notations such as w i and f (·) used there apply here. Like the label "C-II," we use the label "II" for our assumptions here since the label "I" has been used for CF-I in Section SD.2 of the Supplementary Material.
Assumption II: are strictly stationary and ergodic with ρ mixing coefficients that Conditions II.1,8 strengthen C-II.1, II.2-4 strengthen C-II.2-4 when x i includes 1, II.6,7,9,10 imply C-II.5 (especially, II.10 implies that cannot include the boundary region of q's support), II.2,3 imply C-II.6 when is compact, and II.9,10 play the role of (but are not limited to) C-II.7 in the linear endogeneity scenario. This is understandable because Assumption II is used to derive the asymptotic distribution of γ while Assumption C-II is only used to prove its consistency (although in the nonlinear endogeneity scenario). We next provide more explanation on the conditions in Assumption II beyond those in Assumption C-II. The because the latter is a function of the former. II.3 implies is O p n −1/2 . II.4-10 parallel Assumption 1.4-1.10 of CH, and we indicate the differences below. Note that II.3 implies E x i 4 < ∞ in II.4, but to make our assumptions comparable with CH, we still state this explicitly in II.4. Note also that II.4,5 imply E z i e 0 i 4 < ∞, and sup γ ∈ E z i e 0 i 4 |q i = γ ≤ C, which are used in our proof. Different from CH, II.7 allows regime-dependent heteroskedasticity. II.8 is the small threshold effect assumption in the original TR model (1), where the quantity c represents the threshold effect and among which c β is the quantity required in CH; it implies small threshold effect assumptions in the augmented TR model defined by (22), where c in II.9 represents the threshold effect and is the quantity required in this paper. II.7,9 implicitly assume that D γ and V ± γ are well defined in a neighborhood of γ 0 . The assumption c D 0 c > 0 in II.9 excludes the continuous threshold regression (CTR) of Chan and Tsay (1998) (see also Hansen, 2017) in the augmented regression. When c σ = 0, the CTR in the original regression implies the CTR in the augmented regression asymptotically.
CTR implies the augmented CTR. When c σ = 0, however, even if the original regression is a CTR, the augmented regression need not be, given that c D 0 c = c 2 σ E ψ v 2 |q = γ 0 is positive in general. Since the discontinuous threshold regression has more identification information than the CTR, the threshold effects in conditional variance can actually improve identification power. Also, D 0 > 0 implies that dim (z i ) ≥ dim (x i ) = d. This is because the rank of is the same as the rank of with φ = c V + 0 c/c V − 0 c and W (r), = 1,2, being two independent standard Wiener processes on [0,∞). Bai (1997). If δ σ = 0, then c = I d 0 c β is simplified; we can replace thex i in both D 0 and V ± 0 by x i and c by c β , so the endogeneity level ψ does not affect the asymptotic distribution of γ . From Section SD.2 of the Supplementary Material, the asymptotic distribution of the CF-I γ takes the same form with ω and φ reinterpreted.

The density of ζ(φ) is reported in Appendix B of
It is helpful to compare the efficiency of the two CF estimators with the LS estimator without endogeneity. There is no clear result unless some simplifications are assumed. is the same as that of LS, and its value of ω = should be larger than c β E x i x i |q i = γ 0 c β given that the threshold information on the conditional variance is also explored (when c σ = 0, they are the same). It is interesting here to observe that endogeneity actually provides useful information in CF-II (in the sense that when there is no endogeneity, that is, ψ = 0, both φ and ω of CF-II are the same as those of LS even without the simplifications at the beginning of this paragraph); the extra efficiency beyond LS originates from the extra variables z i . Interestingly, extra randomness in the generated regressors would not affect the asymptotic distribution of γ . This is different from the results for usual regularly estimable parameters (see Theorem 2 or recall the Heckit model). From Yu (2012;, this outcome arises because the extra randomness affects γ only locally so the effect does not accumulate sufficiently to influence estimation of γ . From the proof of Theorem 1, when we replace the true regressorsx i by x i , an extra term e i = r i κ 2 is added to the original error term e 0 i , and this extra term is O p (n −1/2 ), For regular parameters, a positive proportion of e i would contribute; since n i=1 e i ≈ O p (1), the effect of e i would not disappear. For γ , however, only an order O (1/a n ) portion of e i would contribute; then, since n/a n i=1 e i ≈ O p ( √ n/a n · 1/n) = O p ( √ 1/a n ) = o p (1), the effect of e i disappears asymptotically.

Likelihood Ratio Tests
As a by-product of estimation, we obtain natural test statistics for hypotheses concerning γ such as H 0 : γ = γ 0 . As emphasized in Hansen (2000), the following LR-like statistic has better finite-sample performance than a typical t-like statistic when the threshold effect is small: where η 2 is a consistent estimator of η 2 = c V − 0 c/c D 0 c. Also, LR n (γ ) avoids having to estimate f in the asymptotic distribution of γ , as is apparent in the following result (see also Theorem 2 in KST, noting the correction given in the proof of the corollary in Appendix A).

Corollary 1. Under the assumptions in Theorem 1,
From Section SD.2 of the Supplementary Material, the CF-I LR statistic takes the same form and has the same asymptotic distribution with φ reinterpreted. If the model is homoskedastic, then φ = 1 and η 2 in LR n (γ ) can be replaced by an estimate of E e 02 i , such as S n ( γ ) /n as in CH. However, the model is generally heteroskedastic. For example, In fact, even in CH's setup, V − 0 is likely to be unequal to V + 0 , as we now explain. In their setup, where v does not include v q because their x is just our x and their q is exogenous so is included in z. Obviously, V − 0 is unlikely to equal V + 0 . For inference based on LR n (γ ), we need to estimate η 2 and φ. Following Hansen (2000), we can use the following procedure. Let r 1i = (δ nx i ) 2 e 2 1i , r 2i = (δ nx i ) 2 e 2 and r 3i = (δ nx i ) 2 . Then are ratios of two conditional expectations. Since r 1i ,r 2i and r 3i are unobserved, let r 1i = ( δ n x i ) 2 e 2 1i , r 2i = ( δ n x i ) 2 e 2 2i and r 3i = ( δ n x i ) 2 denote their sample counterparts, where e 1i = y i − x i β 1 is defined only for i ∈ I 1 := {i|q i ≤ γ }, and e 2i = y i − x i β 2 is defined only for i ∈ I 2 := {i|q i > γ } with some consistent estimators of β in Section 4. We can then estimate η 2 and φ by kernel regression, for example, for some bandwidth h and (boundary) kernel functions K ± (·) and K (·). Alternatively, we can use a polynomial regression, such as a quadratic by fitting OLS regressions of the form and setting η 2 = μ 10 + μ 11 γ + μ 12 γ 2 μ 30 + μ 31 γ + μ 32 γ 2 and φ = μ 20 + μ 21 γ + μ 22 γ 2 μ 10 + μ 11 γ + μ 12 γ 2 .
If E x ix i u i − ψ v i 2 |q i = γ is continuous at γ 0 , then φ = σ 2 2 /σ 2 1 can be estimated in a simpler way. For example, if the homoskedasticity assumption can be estimated by σ 2 1 under the above homoskedasticity assumption.

ESTIMATION OF SLOPE PARAMETERS
Having obtained γ , we propose two approaches for estimating β. The first is to estimate β as a by-product of the two CF approaches above, that is, we estimate β by β = β 1 ( γ ) , β 2 ( γ ) , where β (γ ) = β 1 (γ ) , β 2 (γ ) is the concentrated estimator of β given γ . We label these two estimation methods of β (with either the CF-I estimator of γ or the CF-II estimator of γ as γ ) as CF-I and CF-II, and jointly as CF. The second approach is to estimate β by setting up moment conditions as in CH and KST. However, we show in Section SD.3 of the Supplementary Material that the estimates based on the moment conditions in CH and KST are in general inconsistent when q is endogenous. We therefore provide two new sets of moment conditions that generate consistent estimates of β. The resulting estimation methods of β are labeled GMM-1 and GMM-2, respectively; when the CF-I estimator of γ is used as γ , the corresponding estimation methods of β using these two sets of moment conditions are labeled as GMM-I1 and GMM-I2, and when the CF-II estimator of γ is used as γ , labeled as GMM-II1 and GMM-II2, respectively. Since GMM-II2 is the most efficient among all methods in our simulation studies, we will discuss CF/GMM-1 and GMM-II1 only briefly in the main text and leave the detailed discussions concerning asymptotic properties to Sections SD.3 and SD.2 of the Supplementary Material, respectively.

CF and GMM-1 Estimators
In the CF estimators, we estimate β by and estimate δ n by δ n = β 1 − β 2 , where X 1 = X γ , and X 2 is X ⊥ with γ replaced by γ . Given β , we can estimate β by extracting the corresponding components of β , say, β , and estimate δ β by β 1 − β 2 . This type of estimate essentially employs the moment conditions E x i e 1i 1(q i ≤ γ 0 ) = 0 and E x i e 2i 1(q i > γ 0 ) = 0 (25) to estimate β 1 and β 2 separately. Before introducing the GMM-1 estimators, we first show in Section SD.3 of the Supplementary Material that the moment conditions in CH (the resulting GMM is labeled as CH's GMM) and KST will not generate consistent estimates of β in general when q is endogenous. Although the moment conditions used in CH, may not hold, E [z i u i ] = 0 seems reasonable, where we assume δ σ = 0 and absorb σ 2 in u as mentioned in the Section 1. As a result, we can estimate β by the following minimization where W n p −→ W > 0, and in which γ is the CF-I/CF-II estimator of γ in GMM-I1/GMM-II1. In other words, the GMM-1 estimator of β is Compared with the CF estimators of β, the GMM-1 estimates of β suffer two drawbacks. First, the CF estimates require only d instruments, while the GMM-1 estimates require 2d instruments; 2d instruments are required because we estimate β 1 and β 2 jointly. When q is exogenous, we can estimate β 1 and β 2 separately as in CH's GMM, in which case only d instruments are required. It is well known that good instruments are often hard to find and justify in practical work, making this a relevant consideration. Second, the GMM-1 estimates are hard to extend to the δ σ = 0 case although not impossible (see Section SD.3 of the Supplementary Material for detailed analysis).

GMM-2 Estimators
The two drawbacks of GMM-1 estimators of β above originate from the fact that they estimate β 1 and β 2 jointly. To avoid these drawbacks, we can estimate β 1 and β 2 separately as in CH's GMM. Specifically, we use the following moment conditions to estimate β 1 and β 2 : wherȇ When some elements of x i are exogenous, then they can be included in z i and x i is correspondingly shortened. The moment conditions (27) hold because of the analysis in Section 3, especially (20) and (22). Although we can use any function ofz i as instruments, we do not pursue this particular moment selection problem in the current paper. Interestingly, although q i and x i may be endogenous, they are used as instruments because the error terms here are e i rather than u i . Actually, the CF estimators of β are special cases of this kind of GMM estimator as is evident by noting that We list the CF estimators of β separately because they are natural byproducts of CF estimation of γ . Solving the minimization problem min β n · m n β W n m n β , (24), Z 1 and Z 2 are the matrices obtained by stackingz i 1(q i ≤ γ ) andz i 1(q i > γ ), and It might seem that GMM-I2 is an extension of CH's GMM method. However, even if q is exogenous, they are different because e i rather than u i are used as the error terms in (27). The limit theory of GMM-II2 is given in the following result.
Theorem 2. Under Assumption II, When W = −1 , reduces to (G −1 G ) −1 . The asymptotic covariance matrix between n 1/2 ( β 1 − β 1 ) and n 1/2 ( β 2 − β 2 ) is Note that the components 2 stem from contributions by the generated regressors. If the first-stage regression is homoskedastic, that is, Var (v i |z i ) = Q, then the 2 simplify as follows: and, similarly, the limit covariance matrix structure is Note that the extra randomness from the generated regressors is not correlated with the original error term. Take β 1 as an example: the random component of the influence function from the original error term isz i,≤γ 0 e 1i and that from the generated regressors is z i v i κ 1 . Obviously, In Section SD.2 of the Supplementary Material, we show that the asymptotic variance matrix of GMM-I2 takes a similar form as in GMM-II2 but there are two extra terms in beyond 1 and 2 ; this is because the two random components in the influence functions are correlated now. Although 1 > 0 is assumed, 2 is not positive definite because dim (z i ) < dim z i . Of course, = 1 + 2 > 0 and so > 0; this is because G W G > 0 and W > 0 imply G is full column rank (necessarily, dim z i ≥ dim x i or dim (z i ) ≥ d). Unlike LS estimation, β 1 and β 2 are not asymptotically independent because the generated regressors v i involve all data points; this is why C GMM = 0. As a result, the asymptotic variance matrix of δ n = β 1 − β 2 is 1 + 2 − C GMM − C GMM rather than 1 + 2 as in LS estimation.
As for estimation of , we just replace the population mean by the sample mean, population random vectors by their sample analogs (i.e.,x i by x i , v i by v i , and e i by e i ), and the population parameters by their estimates (i.e., θ 0 by θ ). In practice, we can set W n = Z Z to get an initial estimator of β , and then set W n as a consistent estimator of −1 to obtain the optimal estimator. Given , we can construct CIs for each component of θ by inverting the t-statistic; we can also test some interesting hypotheses, for example, by testing whether κ 1 = κ 2 = 0 in CF-II to check whether there is endogeneity (such a test cannot be applied to CF-I because whether or not κ = β ϕ + σ κ = 0 does not indicate the existence of endogeneity when z i does not contain x i in the first-stage equation). Also, the usual nonparametric bootstrap should be valid for inference on θ although it is questionable for γ (see Yu, 2014, for further discussion on this point).
As distinct from GMM-1, only d instruments (both included and excluded) are needed in GMM-2 estimation irrespective of whether q is endogenous. On the other hand, the asymptotic distributions of GMM-2 rely on first-stage specification such as stable or unstable reduced forms, which is similar to CF but different from GMM-1. To be specific, both CF and GMM-2 require v i which is generated from the first-stage estimation, while GMM-1 relies on moment conditions E [z i u i ] = 0 that do not involve v i ; this is an advantage of GMM-1. Of course, we can use the nonparametric bootstrap to conduct inferences on θ to avoid explicit formulas of the asymptotic variances of θ .
Because GMM-2 employs more moment conditions than CF, the resulting estimators are expected to be more efficient than the CF estimators. 8 However, it is hard to compare the efficiency of CF and GMM-2 with GMM-1. This is because, as distinct from GMM-1, CF and GMM-2 take advantage of the triangular structure of the simultaneous equations and also restrictions such as E v|v q = ϕv q and E u|v q = κv q in CF-I and GMM-I2 and E [u|x,z,q] = ψ v in CF-II and GMM-II2. Roughly speaking, GMM-1 explores a different part of model information from that of CF and GMM-2. In summary, it is fair to claim that GMM-2 is the most preferable method to practitioners because GMM-1 suffers from the two drawbacks discussed above and CF is less efficient. Between GMM-I2 and GMM-II2, GMM-II2 seems more attractive because it usually uses a more efficient γ (i.e., CF-II) and more instruments than GMM-I2 (i.e., dim (x i ) ≥ dim(q i )). Of course, GMM-II2 usually estimates more parameters.
Finally, the asymptotic independence between n 1/2 θ − θ and a n ( γ − γ 0 ) occurs because the global information explored in the former and the local information explored in the latter are independent (see Yu (2012; for more discussion on this point). This information independence also implies that γ will not affect the asymptotic variance of θ (as verified in the theorem) although γ need not be o p n −1/2 .

SIMULATIONS
This section reports the results of three simulation experiments conducted to assess the performance of the various estimates of γ and β and their corresponding CIs when q is endogenous and instruments are employed. Simulation 1 seeks to reveal the bias of the STR estimator of KST in finite samples when endogeneity is severe, and to compare the 2SLS estimator of YLP and the two CF estimators in this paper for γ . Under the DGP of this simulation, the two CF estimators are the same; also, the efficient GMM estimator of γ in YLP is the 2SLS estimator. Simulation 2 compares the two CF estimators of γ under a DGP for which they differ. Simulation 3 compares the performances of seven estimators of θ , viz., the 2SLS estimator of YLP, the two CF estimators in Section 4.1, and the four GMM estimators in Sections 4.1 and 4.2. Similar to Simulation 1, the efficient GMM estimator of θ in YLP is the 2SLS estimator. In Section SD.5 of the Supplementary Material, we also report the risk of the integrated difference kernel estimator (IDKE) of γ in YP where no instruments are employed and compare the performance of the two CF approaches when q is exogenous.

Comparison of Three Estimators of γ
For the 2SLS estimator of YLP, we use the moment conditions E [z i (y i − δ1 (q i ≤ γ 0 ))] = 0 to estimate δ and γ 0 jointly. From YLP, where E u 2 = κ 2 + 1, In KST, , and in our CF approaches, From Section 2, the limit STR objective function is symmetric and its minimizer is not unique, so we use the mean absolute deviation (MAD) as a measure of risk. In our two CF estimators, φ and η 2 can be simplified as follows: φ = 1 and η 2 = E e 2 ui = 1.
In the STR estimator of KST, we set φ = 1 and η 2 = E e * 2 for comparison. Table 1 summarizes the MAD outcomes for the three estimates of γ . Some conclusions can be drawn from the results. First, when endogeneity is weak, that is, κ = 0.2δ, the MAD of the STR estimator lies between those of our CF estimators and the 2SLS estimator; when endogeneity is strong, that is, κ = δ, the MAD of the STR estimator is the largest. This is because, as is apparent from Figure 1, the STR estimator is consistent in the former case but inconsistent in the latter case. Actually, in the latter case, when δ = 0.5, 1, and 2, the absolute asymptotic biases are 0.797, 0.775, and 0.778, respectively. Correspondingly, the MAD increases according to this scale. Second, the 2SLS estimator is the most inefficient among these three estimators (for the STR estimator, consider only the κ = 0.2δ case). This is because its convergence rate is n 1/2−α from YLP, which is much slower than the convergence rate n 1−2α of the other two estimators. Third, our CF estimators are the most efficient among these three estimators in all cases. Fourth, for all estimators, endogeneity is evidently harmful, that is, the κ = δ case has a larger MAD than the κ = 0.2δ case. Fifth, comparing the cases where n = 200 and n = 800, we see that the convergence rate for the 2SLS estimator is roughly √ n,    whereas for our CF estimators, it is roughly n. 9 Sixth, when the STR estimator is consistent, it may be exactly the same as our CF estimators. For example, when κ = 0.2δ, n = 200, δ = 2 or κ = 0.2δ, n = 800, δ = 1 and 2, their MADs are exactly the same; this is because the estimators in almost all simulations are the same. In Table 12 in Section SD.5 of the Supplementary Material, we also report the MAD of the IDKE in YP (which reduces to the DKE since q is the only covariate). From Table 12 in Section SD.5 of the Supplementary Material, we can see that the risk of the IDKE is between that of the CF estimators and the 2SLS estimator in all cases. This result is expected because the CF estimators have the same convergence rate as the IDKE but use more data (i.e., z i 's) while the 2SLS estimator has a slower convergence rate than the IDKE.
Tables 2 and 3 summarize performance of the CIs. Table 2 reports coverage and  Table 3 reports average length. From Table 2, we draw the following conclusions. First, when the STR estimator is consistent, the corresponding CI has good coverage (mostly over-coverage); but when it is inconsistent, the CI undercovers and the coverage can be lower than 10%. Second, the 2SLS-CI coverage is good when nδ is large; when nδ is small (especially when endogeneity is strong), the CI suffers from under-coverage. Third, our CF-CI coverage is excellent in all cases and is unaffected by the severity of endogeneity.
From Table 3, we draw these conclusions. First, our CF-CI has the shortest length in all cases, followed by the STR-CI, and the 2SLS-CI which is longest. This  Table 2, the findings show that CF-CI is shortest and has the best coverage. Second, the STR-CI is much longer than the CF-CI when endogeneity is strong but, even so, the coverage is very low. This outcome is of course due to the inconsistency of the STR estimator. Third, comparing the cases where n = 200 and n = 800, it is apparent that the length of the 2SLS-CI when n = 800 is roughly 1/2 of that when n = 200, whereas for the CF-CI it is roughly 1/4. These findings match the convergence rate findings in Table 1. Finally, from Tables 1 and 3, it is clear that larger δ tends to be more beneficial, that is, larger δ implies lower risk and shorter CIs.
Tables 4-6 summarize performance of the two CF estimators and corresponding CIs. For completeness, we also report performance of the 2SLS estimator based on the moment conditions E [z i (y i − x i δ1 (q i ≤ γ 0 ))] = 0. From YLP, the asymptotic distribution of this 2SLS estimator takes the same form as in (28) except that G is redefined as G = E z i x i,≤γ 0 ,E [z i x i |q i = γ 0 ] c β f . From these three tables, the following conclusions are drawn. First, CF-II dominates the other two methods with lowest risk, highest coverage and shortest length. Second, and different from Simulation 1, the 2SLS estimator can be more efficient than the CF-I estimator,     especially when nδ is small; also, the respective CIs always undercover. Third, for all methods, endogeneity is harmful and larger δ is beneficial (stronger endogeneity and smaller δ implies higher risk and longer CIs). Fourth, comparing the risks and CI lengths when n = 200 and n = 800, we see that the convergence rates of the 2SLS and CF-I estimators are slower than √ n, while the convergence rate of the CF-II estimator is roughly of order n. In summary, the performance of CF-II is extremely good and the other two methods are not recommended. In Table 13 in Section SD.5 of the Supplementary Material, we also report the MAD of the IDKE in YP. From Table 13 in Section SD.5 of the Supplementary Material, we can see that the risk of the IDKE is between that of the CF-II estimator and the CF-I estimator (also the 2SLS estimator) in all cases. Although it is expected that the IDKE is less efficient than the CF-II estimator since the latter uses more data (i.e., z i 's), it is surprising to see that the CF-I estimator is worse than the IDKE given that it also uses the same set of additional data (z i 's). This result corroborates the discussion in Section 3 where we argue that CF-II is the preferable CF approach compared with CF-I.

Comparison among Seven Estimators of δ
For the estimation of δ, we can use the moment conditions E [z i (y i − x i δ1 (q i ≤ γ 0 ))] = 0 to estimate δ and γ 0 jointly as in the 2SLS procedure of YLP. Note that under our DGP cannot be used to estimate δ in our case. Instead, we will use the six estimators in Section 4, that is, CF-I, GMM-I1, GMM-II1, CF-II, GMM-I2, and GMM-II2, to estimate δ. Since δ is a regular parameter, we use the usual root-mean-square error (RMSE) criterion to evaluate its risk. Except for YLP's 2SLS, where the nonparametric bootstrap is used to construct CIs, we invert the t-statistics using the asymptotic critical values to construct CIs for the other six estimators.
In CF-II and GMM-II2, we can estimate δ and ψ jointly to improve the efficiency of δ estimation. The details of the asymptotic variances can be found at the end of Section SD.4 of the Supplementary Material. Due to the special setup of our DGP2, the asymptotic variances of the CF-I, GMM-1, and GMM-I2 estimators can also be simplified; we detail these simplifications in Section SD.3 of the Supplementary Material after the discussion on the general results.
The performance of these seven estimators of δ are summarized in Tables 7-9. From these tables, the following conclusions are drawn. First, the rank of the risk from lowest to highest among these seven estimators in all cases is GMM-II2, CF-II, GMM-II1, GMM-I1, CF-I, GMM-I2, and 2SLS. So the methods based on CF-II γ are generally better than those based on CF-I γ . The superiority of GMM-II1 compared with GMM-I1 can be explained by the more precise estimation of γ in CF-II. But the superiority of CF-II and GMM-II2 compared with CF-I and GMM-I2 cannot be explained solely by this factor. It seems that the former two methods themselves have some intrinsic advantages given that their risks can be less than one half of the latter two methods. GMM-I2 is a little worse than CF-I, while GMM-II2 is better than CF-II especially when endogeneity is severe. Second, although the 2SLS estimator is the most inefficient, the corresponding CIs have the best coverage uniformly for all cases. The superiority of the 2SLS coverage may be explained by the fact that δ is estimated jointly with γ so, different from the other methods, the uncertainty of γ is incorporated. On the other hand, the 0 0 2SLS CI is much longer than the other methods. Third, CF-I and GMM-I2 suffer from some under-coverage especially when nδ is small, 10 whereas the coverage of GMM-1, CF-II and GMM-II2 is excellent or near perfect in all cases. Fourth, as expected, the rank of the CI length matches the rank of risk. Fifth, similar to γ estimation, endogeneity is harmful to all methods; again stronger endogeneity implies more risk and longer CIs (but in larger magnitudes than for γ estimation). Sixth, comparing the risks and CI lengths when n = 200 and n = 800, we see that the convergence rates of all estimators are roughly √ n. In summary, the GMM-II2 estimator performs best, and after balancing coverage and length, GMM-II2 delivers the best CI. 10 As in CH, we could use Bonferroni-type methods to improve coverage for these CIs. Because there is as yet no rigorous procedure to select the coverage level of γ in this Bonferroni-type CI construction, we will not investigate it in this paper.  In Section SD.5 of the Supplementary Material, we also compare the performance of the two CF approaches when q is exogenous. All the results in Simulations 2 and 3 follow as above (especially the CF-II (GMM-II2) estimators and associated CIs of γ (δ) are found to perform best). When q is exogenous, we can also estimate δ based on the moment conditions E [z i (y i − x i δ) 1 (q i ≤ γ 0 )] = 0, as in CH. It turns out that the corresponding estimator is more efficient than that based on the moment conditions E [z i (y i − x i δ1 (q i ≤ γ 0 ))] = 0 both asymptotically and in finite samples. This is because δ is only related to the data points with q i ≤ γ 0 , whereas the moment conditions E [z i (y i − x i δ1 (q i ≤ γ 0 ))] = 0 also use data points with q i > γ 0 , which introduces some redundancy in the estimation of δ and thereby blunts the sharper information in the former moment conditions. Note further that CH's GMM has the lowest risk among the four estimators of δ that are associated with CF-I γ , that is, CF-I, CH's GMM, GMM-I1, and GMM-I2. In other words, if we use CH's γ , their GMM estimator of δ is indeed the best.

EMPIRICAL APPLICATION
As in CMT, we estimate a threshold model of the foreign-trade multiplier relationship, quantifying the effects of exports on real output growth, where the level of the real exchange rate volatility is used as the threshold variable. Specifically, we employ the following TR model: where y i is the real GDP growth rate, x i = (1,x i ) with x i being the export growth rate of goods and services, and q i is a measure of the real effective exchange rage volatility (REER). Adding y i−1 into the system is to control for the effects of persistent shocks in y i , but we assume such effects do not exhibit any threshold effect. As noted in Tenreyro (2007), the two regimes defined by REER may reflect differences between flexible and stable (or pegging) exchange rate regimes, and the threshold variable REER q i is correlated with both x i and u i . Following CMT, we do not consider the endogeneity of x i because their test cannot reject that x i is exogenous. In this case, our CF-I and CF-II estimators of γ are the same. The method of CMT does not require any instruments. For our methods, we use z i = (1,x i ,y i−1 ,q i−1 ) as instruments, that is, we use q i−1 as the only excluded instrument, and specify the first-stage equation as and assume E u i |v qi = ψv qi . The linear form of E u i |v qi can be justified as follows. The determination of GDP growth y i and exchange rate volatility q i involves many factors beyond x i and z i , and all these factors combine to influence u i and v qi . By a commonly used central limit argument the resulting variables u i and v qi in aggregating these influences would approximately follow normal distributions, which in turn implies a linear form of E u i |v qi . The primary purpose of this empirical application is to show the difference between KST's STR estimator and our two CF estimators in this setting.
The data of CMT contain seven OECD countries: Canada (CA), France (FR), Italy (IT) Japan (JP), United Kingdom (UK), United States (US), and Netherlands (NL), from 1966 to 2014. 11 Since the tests in CMT cannot reject that CA and FR do not exhibit any threshold effects, we consider the remaining five countries. Table 10 reports the estimates of model parameters γ and θ based on CF-II and GMM-II2 (which are the same as CF-I and GMM-I2 in this application) since they are the most efficient in our simulations. From Table 10, we can draw the following conclusions. First, the percentages of observations in the high volatility regime "2" are bounded away from 0 and 1, which indicates that there are sufficient observations in each regime for our estimation. Second, ρ is bounded away from 1, that is, there is no unit root problem, which indicates that our assumption of stationarity is reasonable in this application. Third, for IT, JP, and US, we can reject δ κ = 0 (i.e., κ 1 = σ 1 ψ = σ 2 ψ = κ 2 ), which implies σ 1 = σ 2 when ψ = 0 (this is the case at least for IT and JP from the CIs for κ ). In other words, our specification of a threshold effect in the variance of error term is necessary in this application. Fourth, all CIs (whether for γ or β) are quite wide, which is due to the small sample size n = 49. As a result, the CIs of γ in our CF approach cover the KST and CMT estimators. Fifth, our γ 's for IT and JP are very different from the KST and CMT estimators, while for UK, US, and NL, the three estimators are similar or even the same. This can be explained from the testing results for κ 1 = κ 2 = 0. For IT and JP, this endogeneity test rejects the null at the 5% level; in other words, the endogeneity level is comparable to the jump size. In this case, the KST and CMT estimators are inconsistent, as explained in Section 2. On the other hand, for the remaining three countries, the endogeneity test cannot reject or only marginally rejects the null at the 5% level, that is, the endogeneity level is negligible compared to the jump size. In this case, all three estimators are consistent. Note that our endogeneity test results differ from those in CMT where the null is rejected for all five countries. Sixth, we check the effects of x i on y i via the coefficients β 12 and β 22 . It turns out that our results are qualitatively similar to CMT. On the one hand, β 12 is significantly positive for IT, US, and NL, which supports the view that low real exchange rate volatility favors exports and economic growth (e.g., Grier and Smallwood, 2013). On the other hand, β 22 is significant and greater than 1 for JP and the UK, that is, high real exchange rate volatility favors exports and growth (e.g., Egert and Morales-Zumaquero, 2008); see CMT for more references on this phenomenon.

CONCLUSION
We conclude this paper by providing a summary of the strengths and weaknesses of the available estimators of γ and β in ETR. Table 11 lists the number of instruments required by each estimator. YLP's 2SLS estimator requires the most instruments in estimating both γ and β and has a slower convergence rate in estimating γ . Although we can use transformations of z as instruments when more than dim (z) instruments are required, such instruments provide information only through nonlinearity and seem unattractive for practical use. The CI for β (but not γ ) has good coverage (but not length) because this method estimates γ and β jointly so it takes into account the randomness in estimating γ . The joint property of this estimation method differs from all the other methods. CH's 2SLS cannot be used to estimate γ and CH's GMM cannot be used to estimate β when q is endogenous. In such cases, we can use CF-I or GMM-I1 to substitute for CH's 2SLS or CH's GMM. In this sense, our CF-I estimate of γ combined with the GMM-I1 estimate of β may be viewed as an extension of CH's 2SLS combined with CH's GMM. However, GMM-I1 requires more instruments than CH's GMM and cannot handle the δ σ = 0 case easily. GMM-I2 is an extension of CF-I; both differ from GMM-I1 and do not suffer the problems of GMM-I1. The CF-II estimator of γ combined with the corresponding estimators of β are parallel developments of the CF-I estimators.
Compared with CF-I, CF-II estimation of γ and β has advantages in both its theory foundation and its finite sample performance. Among all estimators and CIs for β, the GMM-II2 estimator and CI have the best finite sample performance. After balancing all dimensions of each method, we suggest the use of CF-II to estimate and construct CIs for γ and use of GMM-II2 for estimation of β when instruments are available. When instruments are absent, we may use YP's IDKE method to estimate γ . Of course, we can use YP's IDKE even when instruments are available. However, the method is nonparametric so may suffer from the curse of dimensionality and requires bandwidth selection in practice which may be troublesome when dim (x) is large. For β, YP's IDKE can estimate δ β when δ σ = 0 and E [u|x] is smooth, but in general instruments are required to estimate β. The latter is not unexpected in view of endogeneity in the regression.  Note: The first three rows report different γ estimators, β 2 is the coefficient of x i in each regime, 95% CIs are reported in the parentheses, δ κ = 0 and κ 1 = κ 2 = 0 report the p-values for the two tests, and %obs"2" reports the percentages of observations in regime "2." Table 11. Number of instruments required in estimation of γ and β by each method where g 3,i,ψ = g 3 (v i ;ψ), and with the first inequality from C-II.6, the second inequality from v i − v i = − z i , and the last equality from C-II.2-3, we need only show that n −1 n i=1 y i − x i β 1 + σ 1 g 3,i,ψ 1 (q i ≤ γ ) − x i β 2 + σ 2 g 3,i,ψ 1 (q i > γ ) z i = O p (1) Newey and McFadden (1994). Although Lemma 2.4 of Newey and McFadden (1994) is stated for i.i.d. observations, the result requires only application of a weak law of large numbers, which holds under C-II.1 by the ergodic theorem. The almost sure continuity condition in Lemma 2.4 of Newey and McFadden (1994) is implied by the continuity of g 3,i,ψ in ψ (C-II.6) and the distribution of q (C-II.5). To check the summand is dominated by a function of w i with finite first moment, we need only show E y 2 i < ∞, E x i 2 < ∞ and E g 2 3,i,ψ 0 < ∞ by the Cauchy-Schwarz inequality and E g 2 3,i,ψ ≤ C E g 2 3,i,ψ 0 + ψ − ψ 0 2 for any ψ (which is implied by C-II.6). E x i 2 < ∞ is implied by C-II.3, E g 2 3,i,ψ 0 < ∞ is assumed in C-II.4, and E y 2 i < ∞ is implied by these two results and E e 0 i 2 < ∞ (C-II.4). Second, we prove the consistency of θ by applying Theorem 2.1 of Newey and McFadden (1994). For this purpose, we need only show that S n (θ ) converges uniformly in probability to S (θ ) which is continuous and minimized uniquely at θ 0 . By Lemma 2.4 of Newey and McFadden (1994) and the analysis above, S n (θ ) converges uniformly in probability to which is continuous in θ. From Section 2.2.2 of Newey and McFadden (1994), S (θ) is minimized uniquely at θ 0 if g (w;θ ) = g (w;θ 0 ) implies θ = θ 0 , where g (w;θ) := x β 1 + σ 1 g 3 (v;ψ) 1 (q ≤ γ ) − x β 2 + σ 2 g 3 (v;ψ) 1 (q > γ ). If not, then there are two cases: γ = γ 0 and γ = γ 0 . For γ = γ 0 , there are three subcases: Case (i): γ = γ 0 , θ = θ 0 , where θ := γ ,θ . By C-II.5, P (γ < q ≤ γ 0 ) > 0, where we assume without loss of generality that γ < γ 0 . So C-II.1 and C-II.7 imply P (g (w;θ) = g (w;θ 0 )) > 0, a contradiction, and thus γ = γ 0 .