Identification of Regression Models with a Misclassified and Endogenous Binary Regressor

We study identification in nonparametric regression models with a misclassified and endogenous binary regressor when an instrument is correlated with misclassification error. We show that the regression function is nonparametrically identified if one binary instrument variable and one binary covariate satisfy the following conditions. The instrumental variable corrects endogeneity; the instrumental variable must be correlated with the unobserved true underlying binary variable, must be uncorrelated with the error term in the outcome equation, but is allowed to be correlated with the misclassification error. The covariate corrects misclassification; this variable can be one of the regressors in the outcome equation, must be correlated with the unobserved true underlying binary variable, and must be uncorrelated with the misclassification error. We also propose a mixture-based framework for modeling unobserved heterogeneous treatment effects with a misclassified and endogenous binary regressor and show that treatment effects can be identified if the true treatment effect is related to an observed regressor and another observable variable.


Introduction
Misclassified endogenous binary regressors are prevalent in applications. Examples include selfreported educational attainment (Black et al., 2003), self-reported participation in job training (Krueger and Rouse, 1998), health insurance coverage reported by worker (Black et al., 2000) and participation to the Supplemental Nutrition Assistance Program (SNAP) formerly known as the Food Stamp Program (Kreider et al., 2012). For example, Black et al. (2003) find that only 66.4% of those reporting a professional degree in the 1990 Decennial Census have a professional degree, and Meyer et al. (2020) find that 49.0% of true food stamp recipient households do not report receipt in the Current Population Survey from 2002 to 2005.
We study identification in nonparametric regression models in the presence of a misclassified and endogenous binary regressor when a binary instrument controlling endogeneity is correlated with misclassification error. We consider the following model with a misclassified and endogenous binary regressor and the instrument variable (instrument) Z: where Y is the outcome variable (for example, wage), X is exogenous controls, and ε is an unobservable disturbance. T * is an unobservable binary regressor (for example, true educational qualification) which may be endogenous in the sense correlated with ε. T is an observable misclassified measurement of T * (for example, self-reported schooling). Here, because the regressor T * is binary, its measurement error is necessarily nonclassical, i.e., T * − T is correlated with T * .
This makes identification difficult.
A number of papers have studied regression models with an exogenous misclassified binary regressor. Aigner (1973) characterizes the OLS asymptotic bias for such a model and develops a procedure to consistently estimate the coefficient of the misclassified binary regressor when the outside information on misclassification probabilities is available. More recently, Lewbel (2007) shows that the difference E[Y |X, T * = 1]−E[Y |X, T * = 0] can be identified using an instrument that is mean independent of the change in outcome variable associated with the change in T * when the instrument takes at least three values. Mahajan (2006) shows that the conditional mean of outcome variable Y given T * is identified while Hu (2008) provides related identification results when the discrete regressor takes more than two values. Battistin et al. (2014) examine the identification of the returns to educational qualifications when repeated misclassified measurements are available. Black et al. (2000) and Kane et al. (1999) show identification when repeated misclassified measurements of a binary regressor are available.
Only a few papers analyze identification of regression models when a misclassified binary regressor is endogenous. In particular, Mahajan (2006) shows that α(X) and β(X) are identified when there exists a binary instrument variable Z that satisfies the conditional independence from T given by T ⊥ ⊥ Z conditionally on (T * , X), in addition to the standard relevance condition and exclusion restriction as well as some other assumptions. However, DiTraglia and García-Jimeno (2019) show that the assumptions in Mahajan (2006) imply that E[ε|X, T * ] = 0, namely, T * is exogenous. As a result, identification of the model (1) under endogenous T * has remained an open question.
As pointed out by DiTraglia and García-Jimeno (2019), the reason why Mahajan (2006) cannot identify α(X) and β(X) under endogenous T * is that Mahajan (2006) uses only one binary instrument Z to control two sources of endogeneity, i.e., misclassification in T and endogeneity in T * .
Some recent studies provide related identification conditions for models with an endogenous misclassified regressor T * while maintaining the assumption (2) that the instrument Z is not only independent of ε but also independent of T conditional on T * . DiTraglia and García-Jimeno include Hu et al. (2015Hu et al. ( , 2016 who address the identification of nonseparable models with mismeasured endogenous regressor but their Assumption 2.1 also assumes that the instrument Z is independent of T conditional on (T * , X). In these studies, the instrument Z has to satisfy two different exclusion restrictions: one from the outcome equation and the other from the misclassification probability.
In empirical applications, a researcher chooses the instrument Z such that Z is relevant for T * and is excluded from the outcome equation; whether Z is excluded from the misclassification probability or not is often a secondary concern given the difficulty of finding a valid instrument that satisfies both the relevance condition and the exclusion restriction from the outcome equation. When an endogenous binary regressor is a self-reported variable, however, the instrument Z may be correlated with the misclassification error as the following examples illustrate.
Example 1 (Supplemental Nutrition Assistance Program (SNAP)). Many empirical papers analyze the effect of participation in SNAP on health outcomes or food insecurity, where Y is a health outcome or food insecurity variable and T is self-reported participation in SNAP. As the instrument Z for controlling endogeneity in T * , some studies use variables that affect the cost and benefit of participating in SNAP such as whether the state uses biometric identification technology (i.e., fingerprint scanning) and the percentage of SNAP benefits issued by direct mail rather than electronic benefit transfer (EBT) (e.g., Meyerhoefer and Pylypchuk, 2008;Yen et al., 2008;Almada et al., 2016). For example, the use of EBT may help encourage SNAP participation by mitigating the stigma associated with SNAP participation (Yen et al., 2008). Such state policies may be correlated with the misreporting error for the following two reasons.
First, as Bound et al. (2001, Section 5. Example 2 (Returns to Education). Consider analyzing returns to education on wages, where outcome variable Y is logarithm of wage, T * is true educational attainment, and T is selfreported educational attainment. Self-reported educational attainment may be misclassified when a respondent makes a false statement (Kane et al., 1999;Battistin et al., 2014). Card (1993) proposes using college proximity as an instrument Z for controlling endogeneity in T * given that college proximity affects the cost of attending colleges. College proximity could be correlated with the misreporting error of educational attainment for the following reasons. as discussed in footnote 1, the misclassification error could be correlated with college proximity through unobserved stigma.
To the best of our knowledge, none of the existing papers establishes identification of models with a misclassified endogenous binary regressor when an instrument is correlated with misclassification errors. This paper fills this gap. Specifically, we relax the assumption (2) and show identification when one of the covariates in the outcome equation, denoted by V , satisfies an exclusion restriction from the misclassification probability, i.e., where the model (1) is now written as Because E[ε|X, Z, V ] = 0, V can be one of the covariates in the outcome equation. As in the existing literature, V also needs to be relevant for T * in that V changes the distribution of T * . Unlike the existing literature, however, we allow Z to affect the misclassification probability.
Choosing the variable V in empirical applications may not be easy. One possibility is to refer to the existing studies that examine the determinant of misreporting errors, which may be survey-specific. negative nor false positive is statistically associated with a gender dummy. Given that gender is likely to be one of the key determinants of health outcome as well as SNAP participation, these results suggest that we may use a gender dummy for the variable V when we use ACS, CPS, or FoodAPS to study the effect of SNAP participation on health outcome.
Example 2 (Returns to Education, continued). Bingley and Martinello (2014) link the Survey of Health, Ageing and Retirement in Europe (SHARE) to Danish Administrative Registers and examine the determinants of misclassification of self-reported schooling. They find that, conditional on true educational qualification and its interaction with income levels, misclassification error is correlated with neither gender nor age. This result suggests that gender and age are possible candidates for the variable V . Figure 1 compares the relationship among Y , T * , T , Z, and V in this paper with those in some recent studies. Our Proposition 1 in Figure 1(a) does not assume that Z is independent of T conditional on T * while the existing studies such as DiTraglia and García-Jimeno (2019) and Nguimkeu et al. (2019) assume that Z is excluded from the misclassification probability in Figure 1(b)(c). 2 Figure 1(d) illustrates the approach of Black et al. (2000), Kane et al. (1999), and Battistin et al. (2014), who use two conditionally independent measurements of T * . In our setup, Z and V can be correlated to each other conditional on T * so that our identification argument is different from theirs.  Figure 1: Comparison of the relationships between the outcome Y , true unobserved regressor T * , misclassified regressor T * , instrument Z, and other covariate/instrument/measurement V in some published papers. Each arrow represents the dependence while a lack of arrow represents the conditional independence.
Our identification result is useful for empirical applications. To apply our identification result, the researcher needs to find one of the covariates that is correlated with endogenous regressor T * but does not affect misclassification. As in the examples above with SNAP and education qualification, the existing studies that link survey data to administrative data provide some guidance for choosing the covariate V . With such a covariate, we may weaken the requirement for the instrument Z by allowing Z to be correlated with the misclassification error.
The identification of the local average treatment effect (LATE) under mismeasured treatment was studied by Yanagi (2019) Botosaru and Gutierrez (2018) show that the average treatment effect on the treated is identifiable from repeated cross-section data when the treatment status is observed only either before or after the implementation of a program if there is a proxy variable for the latent treatment. Tommasi and Zhang (2020) study identification and inference for the bounds of the weighted average of local average treatment effects.
The model (4) assumes that the individual treatment effect does not depend on unobservables. To examine heterogeneous treatment effects, we extend the model (4) by allowing α(·) and β(·) to depend on an unobserved random variable U * that has a finite support. We generalize our identification result to this model with heterogeneous treatment effects when a mismeasured observable measure (proxy) for U * is available and show that the average treatment effect, the average treatment effect on the treated, the average treatment effect on the untreated, and the LATE are identified.
The remainder of this paper is organized as follows. Section 2 introduces the model and assumptions and derives identification results. Section 3 briefly discusses estimation and inference. Section 4 shows identification of a heterogeneous treatment effect model. Section 5 concludes.
Proofs are collected in Section 6. All limits below are taken as n → ∞. Let := denote "equals by definition." For a k × 1 vector a and a function f (a), let ∇ a f (a) denote the k × 1 vector of the derivative (∂/∂a)f (a).

Binary Regressor
Throughout the paper, we assume that both Z and V are binary random variables with their support given by {0, 1}. In this section, we suppress the exogenous regressor X for brevity. The whole material remains valid conditional on X if all the assumptions are imposed conditional on X. We establish the identification of the model (4) under the following assumptions.
Assumption 1. The following holds for any Assumption 2.
Example 2 (Returns to Education, continued). In the return to education example, gender and age are possible candidates for the variable V . Gender and age are likely to satisfy Assumption Assumption 1 is a straightforward generalization of the assumptions in the current literature. Assumption 1(a) assumes that the self-reported treatment status T does not provide any additional information on the mean of ε, and hence Y , given the knowledge of T * , instrument Z, and covariate V (and the exogenous regressor X). In particular, the error term ε is conditionally mean independent of the misclassification error conditional on (T * , Z, V ). This is often referred to as "non-differential measurement error." While this assumption is standard in the misclassification literature (e.g., equation (1) of Mahajan (2006) and Assumption 2.2.(iii) of DiTraglia and García-Jimeno (2019)), it is potentially restrictive. In the context of the SNAP example, this assumption may be violated if misreporting of SNAP participation status due to unobserved stigma is correlated with unobserved factors that affect health outcome or food insecurity even after controlling for observed covariates and true participation status. Similarly, in the returns to education example, this assumption may be violated if lying about college completion leads to higher earnings. In both cases, conditioning on a rich set of observed characteristics may mitigate the concern for the violation of this assumption.
Assumption 1(b) and (c) are the standard instrumental variable assumptions. Assumption 1(b) states that the instrument Z has to be excluded from the outcome equation while Assumption 1(c) requires that Z must be relevant for the true regressor T * . Combining Assumptions . Instrument validity, i.e., the validity of Assumption 1(b)(c), is an important issue in empirical applications.
In Assumption 1(d), we relax one requirement on the instrumental variable Z in the existing misclassification literature by allowing Z to be correlated with the misclassification probability, thus relaxing Assumption 3 in Mahajan (2006) and Assumption 2.2(i) in DiTraglia and García-Jimeno (2019). At the same time, Assumptions 1(d) and 1(e) require the existence of a covariate V that affects the true regressor T * but does not affect misclassification error given Z and other covariates. As discussed in the above examples, existing studies linking administrative and survey data analyze how the misclassification error is associated with observed covariates, providing some guidance on how to choose V from a set of observed covariates. For instance, in the case of SNAP participation, a gender dummy can be used for V because it may not be correlated with misclassification error in SNAP participation once other covariates are conditioned on but is likely to be correlated with true SNAP participation status T * .
Assumption 1(f) requires that T * changes the mean of T and corresponds to Assumption 2 in Mahajan (2006) In the SNAP example, this assumption holds if the conditional mean of health outcome given covariates differs between the SNAP recipients and non-recipients. Assumption 1(h) holds if model (4)   and Assumption 2 enable us to solve these equations for a unique solution. When Z is uncorrelated with T * , Assumption 2 does not hold and the system of equations (6) fails to have a unique solution. α(V ) and β(V ) are identified from the relation and Assumption 1(c).
The following proposition provides the main identification result of this paper. The key assumption in Proposition 1 that is different from those in the existing papers is that we allow Z to affect the misclassification probability. Mahajan (2006) and DiTraglia and García-Jimeno For clarification, we make the following two remarks. In this section, we briefly discuss estimation and inference of the model parameter. Our estimation strategy follows directly from the identification result in Section 2. Suppose we have iid ob- {0, 1} 2 } denote the vector of 12 unknowns of the system (6). Note that θ x is obtained by replac- (1, 0), (0, 1), (1, 1)}} denote the (12 × 1)-vector of population conditional moments of Y and T conditional on (X, Z, V ) evaluated at X = x and the support of (Z, V ).
Write the system (6) of 12 equations as m x = f (φ x ). Because φ x is uniquely identified from population moments, φ x is the unique solution that satisfies m x = f (φ x ). Further, equation (7) gives α(x, v) and β(x, v) as a function of φ x and m x ; consequently, we may write θ x as θ x = g(φ x , m x ) for a smooth function g. Let m x be an estimator of m x .
We provide the details of the construction of m x later. We estimate φ x and θ x by φ x := arg min φ m x − f (φ) 2 and θ x := g( φ x , m x ). A straightforward application of minimum distance estimation (Newey and McFadden, 1994, Theorem 3.2) and the delta-method gives the following proposition: Proposition 3. Suppose m x satisfies a n ( m x − m x ) → d N (0, Ω) for a nonstochastic sequence a n → ∞. Then, we have We When X has a finite support, we estimate m x by sample moments. For j = 1, . . . , 4, define and define m xf := ( m ⊤ xf 1 , m ⊤ xf 2 , m ⊤ xf 3 , m ⊤ xf 4 ) ⊤ . From Theorems 3.3.1 and 3.3.2 of Bierens (1987), where Ω f j := Var[R|x, w j ]. Applying this result to Proposition 3 with a n = √ n and Ω = Ω f gives the asymptotic distribution of θ x .
When X is continuously distributed, we estimate m x by a kernel estimator of E[R|x, w] following Mahajan (2006), provided that E[R|x, w] is continuous in x. For j = 1, . . . , 4, define where K(·) is a kernel function, and h is the bandwidth satisfying h + 1/(nh dim(X) ) → 0. Define m xc similarly to m xf . Let f (x|w) denote the density of X conditional on W = w. Suppose that , and h satisfy Assumptions 10-14 of Mahajan (2006).
Then, it follows from Lemma 2 of Mahajan (2006) (see also Theorem 3.2.1 of Bierens (1987)) that where Ω cj := Var[R|x, w j ] K(s) 2 ds/f (x|w j ) Pr(W = w j ). Therefore, applying this result to Proposition 3 with a n = √ nh and Ω = Ω c gives the asymptotic distribution of θ x .

Heterogeneous Treatment Effect
In Section 2, we assume that the effect of T * on Y does not depend on unobservables. In this section, we extend the model (4) to allow the parameter α(·) and β(·) to depend on an unobserved random variable U * . This gives a random coefficient model similar to the model in Heckman et al. (2006): where U * is assumed to be exogenous but T * may be correlated with ε. We allow U * and T * to be correlated. Hence, α(U * , X, V ) and β(U * , X, V ) may be correlated with T * conditional on (X, V ). When α(U * , X, V ) and T * are correlated, we have "sorting on the level," which is a common form of selection bias. When β(U * , X, V ) and T * are correlated, we have "sorting on the gain," which is called essential heterogeneity by Heckman et al. (2006). (2018) examine how SNAP participation affects food insecurity while allowing for heterogeneous treatment effects using finite mixture models with two latent classes; low and high food security latent classes. In this case, Y is an ordinal measure of food security taking an integer value between 0 and 10. U * represents the latent variable and takes the value 0 for a low food security class and 1 for a high food security class such that α(0, X, V ) < α(1, X, V ) for any value of V . Using the CPS and assuming no measurement error in self-reported SNAP participation, they find that the effect of SNAP participation is higher for a low food security class than for a high food security class. This suggests a possibility that β(0, X, V ) > β(1, X, V ).

Example 3 (Heterogeneous Effect of SNAP). Deb and Gregory
Example 4 (Heterogeneous Returns to Education). As in Carneiro et al. (2011), consider a model of heterogeneous returns to education, where Y is logarithm of wage, U * may be interpreted as unobservable ability, T * is true educational attainment (college degree), V may be gender, and Z is proximity to colleges. In wage regression (8), unobserved ability U * may affect not only the intercept, α(U * , X, V ), but also the returns to schooling, β(U * , X, V ). In such a case, α(U * , X, V ) and β(U * , X, V ) may be correlated with T * conditional on (X, V ) because a person with higher unobserved ability may have a higher chance of obtaining a college degree and because a forward-looking school decision depends on the returns to schooling.
Both Deb and Gregory (2018) and Carneiro et al. (2011) in the above examples assume no measurement error in treatment variables. 3 We relax this assumption by assuming that we have an observable binary measurement T for an unobserved binary treatment variable T * . To identify the joint distribution of T * and U * from the data, we augment the model with an observable measurement U of U * . The role of U to U * is similar to that of T to T * in that U provides information on U * .
Example 3 (Heterogeneous Effect of SNAP, continued). Deb and Gregory (2018) also analyze the determinants of latent food security class membership. They find that household size and a subjective variable indicating whether the individual met food needs or not are significant determinants of class membership. This suggests that these variables may be used for U .
Example 4 (Heterogeneous Returns to Education, continued). To study the returns to education in the United States using the National Longitudinal Survey of Youth (NLSY), the existing studies use the Armed Services Vocational Aptitude Battery (ASVAB) test scores to measure the cognitive skills (e.g., Castex and Dechter, 2014;Heckman et al., 2018). The measures such as ASVAB test scores can be used for U as a proxy for unobserved ability U * .
Henceforth, we suppress the exogenous regressor X for brevity. The whole material remains valid conditional on X if all the assumptions are imposed conditional on X. Define S := (U, T ) and S * := (U * , T * ). We assume that S is conditionally independent of V given (S * , Z), where V is a binary observable variable. As in Section 2, we may choose V based on the existing studies that use administrative data while we may choose the use of fingerprint technology or EBT as an instrument Z for SNAP participation. As shown in Proposition 4 below, with additional regularity conditions (rank conditions and distinct eigenvalues), we may identify α(U * , V ), β(U * , V ), Pr(S * |Z, V ), and Pr(S|S * , Z) for all (S, S * , Z, V ). Furthermore, the conditional distribution of Y conditional on (S * , V ) is identified.
We assume that both U * and U take K u discrete values with the support U := {u 1 , . . . , u Ku }.
When U is a continuous variable as in the case of ASVAB, we may define K u distinct sets by partitioning the support of U . Denote the support of S * and S by S := {s 1 , . . . , s K } with K := 2K u . We also assume that Y can take at least K different values. Assumption 6. There exists a partition of the support of the distribution of Y , {∆ j } K j=1 , such that the matrix Assumption 4(a)(b) corresponds to Assumption 1(a)(b), representing a non-differential measurement error assumption and an exclusion restriction on Z from the outcome equation. Assumption 4(c) corresponds to Assumption 1(c) and requires that Z must be relevant for the true regressor T * at any value of U * . Assumption 4(d) requires that V is excluded from the measurement equation for S conditional on (S * , Z), generalizing Assumption 1(d). As discussed through examples, the result of the existing studies that examine the determinant of misclassification errors may provide guidance on the choice of V .
Assumption 4(e) corresponds to Assumption 1(f) and assumes that S is sufficiently informative to identify the unobserved value of S * such that the probability of S = s given S * = s is higher than that of S = s ′ for any s ′ = s. In Example 3, the variable S may consist of a selfreported SNAP participation T and a binary subjective variable for meeting the food needs U .
Then, Assumption 4(d) requires that the probability of truthfully reporting SNAP participation and food needs is larger than the probability of falsely self-reporting any combination of SNAP participation and food needs. Assumption 4(f) corresponds to Assumption 1(h).
Assumption 5 is similar to Assumption 2 and requires that Z and V are relevant for determin- We consider identification of treatment effects from model (8). The local average treatment effect is the average of the treatment effect on Y over the subpopulation (the compliers) whose treatment status is strictly affected by the instrument. If their Conditions 1 and 2 hold conditional on V , Imbens and Angrist (1994, Theorem 1) show that the local average treatment For identifying other treatment effects, let Y 1 denote the potential outcome if the subject were to receive treatment and let Y 0 denote the potential outcome if the subject were not to receive treatment. Decompose Y j into its conditional mean given V , µ j (V ), and its deviation from the mean, η j , as We consider the following assumption to identify treatment effects.
Assumption 7 is similar to Assumption 4(b) and imposes an exclusion restriction on the instrument Z from the outcome equation given the unobserved heterogeneity U * . Assumption 7 corresponds to Assumption A-1 of Heckman et al. (2006), which assumes (η 0 , η 1 ) is independent of Z conditional on V .
From Proposition 4 and Assumption 7, we can identify the average treatment effect (ATE), the average treatment effect on the treated (TT), and the average treatment effect on the untreated (TUT) conditional on V by taking the average of β(U * , V ) over U * using appropriate weights as This paper gives new identification results for cross-sectional regression models when a binary regressor is misclassified and endogenous. Existing studies assume that the instrument used in estimation satisfies not only the standard exclusion restriction and relevance condition but also an additional condition that it is uncorrelated with misclassification errors. Some instruments in empirical applications, however, may be correlated with misclassification errors and, thus, relaxing this additional requirement is important for applications. We show that the constant and slope parameters are identified even if a binary instrumental variable is correlated with misclassification errors when there exists a regressor that is excluded from the outcome equation but is relevant for the true unobserved regressor.
We can identify E[Y |U * , V, Z] because we can derive Pr(Y ∈ ∆ j , S * = s|Z, V ) from Pr(Y ∈ ∆ j |S * , V ) and Pr(S * |Z, V ). We can derive Pr(T * = 1|U * , Z, V ) from Pr(S * |Z, V ). Therefore, the left hand side of (19) and the matrix on the right hand side of (19) are identified, and this matrix is invertible from Assumption 4(c). Consequently, α(U * , V ) and β(U * , V ) are identified.
It remains to show the identification of Pr(Y ∈ ∆|S * , V ) for any ∆. For a partition ∆ := {∆ j } K j=1 that does not satisfy Assumption 6, define