## 1 Introduction

Difference-in-differences is a popular evaluation strategy in the social sciences; it makes causal comparisons from observational panel data by exploiting variation across time (Ashenfelter Reference Ashenfelter1978; Bertrand, Duflo, and Mullainathan Reference Bertrand, Duflo and Mullainathan2004; Angrist and Pischke Reference Angrist and Pischke2009; Bechtel and Hainmueller Reference Bechtel and Hainmueller2011; Keele and Minozzi Reference Keele and Minozzi2013; Malesky, Nguyen, and Tran Reference Malesky, Nguyen and Tran2014; Keele *et al.*
Reference Keele, Small, Hsu and Fogarty2019; Callaway and Sant’Anna Reference Callaway and Sant’Anna2019). The key assumption underlying difference-in-differences is *parallel trends*, that is, the counterfactual trend behavior of treatment and control groups, in the absence of treatment, is the same, possibly conditioning on some observed covariates (Heckman, Ichimura, and Todd Reference Heckman, Ichimura and Todd1997; Abadie Reference Abadie2005). In practice, the parallel trends assumption can be questionable because unobserved confounders may have time-varying effects on the outcomes. A common alternative method is a regression model that adjusts for the lagged dependent variables (Ashenfelter Reference Ashenfelter1978), which assumes *ignorability* conditional on past outcomes and observed covariates.

Difference-in-differences and lagged-dependent-variable adjustment—also known respectively as the gain score estimator and the analysis of covariance estimator in sociology and psychology—are two different methods relying on different identification assumptions. Extensive conceptual, empirical and numerical comparisons between the two methods have been made in the literature (e.g., Allison Reference Allison1990; Maris Reference Maris1998; van Breukelen Reference van Breukelen2013; Ryan, Burgess, and Dimick Reference Ryan, Burgess and Dimick2015; O’Neill *et al.*
Reference O’Neill, Kreif, Grieve, Sutton and Sekhon2016). In particular, in the context of linear models, Angrist and Pischke (Reference Angrist and Pischke2009) show that difference-in-differences and lagged-dependent-variable regression estimators have a *bracketing* relationship. Namely, for a true positive effect, if the ignorability assumption is correct, then mistakenly assuming parallel trends will overestimate the effect; in contrast, if the parallel trends assumption is correct, then mistakenly assuming ignorability will underestimate the effect. The opposite holds for a true negative effect.

The bracketing relationship is important in practice. Though we usually do not know which one of the two assumptions is true in real applications, we can analyze the data under each assumption and treat the estimates as the upper and lower bounds of the true effect. However, the linear setting in Angrist and Pischke (Reference Angrist and Pischke2009) is restrictive, particularly for applications with noncontinuous outcomes. For example, binary outcomes are common in political science (e.g., Keele and Minozzi Reference Keele and Minozzi2013; Malesky, Nguyen, and Tran Reference Malesky, Nguyen and Tran2014) and health studies (e.g., Stuart *et al.*
Reference Stuart, Huskamp, Duckworth, Simmons, Song, Chernew and Barry2014); count outcome are common in transportation safety studies where the before–after design is popular (e.g., Hauer Reference Hauer1997). Moreover, the parallel trends assumption is scale-dependent (Athey and Imbens Reference Athey and Imbens2006). Therefore, an extension to nonlinear settings is relevant for both theory and practice. In this paper, we prove that, within the canonical two-period two-group setting, the same bracketing relationship holds in general nonparametric and semiparametric settings. We give three examples to illustrate the theoretical results.

## 2 Setup

### 2.1 Difference-in-Differences

We proceed under the potential outcomes framework (Neyman Reference Neyman1923; Rubin Reference Rubin1974). We consider the basic two-period two-group panel design, where a sample of units, indexed by $i\in \{1,\ldots ,n\}$ , are drawn from a target population of two groups, labeled by $G_{i}=0$ or $1$ . Each unit can potentially be assigned to a treatment $d$ , with $d=1$ for the active treatment and $d=0$ for the control. Units in both groups are followed in two periods of time $T$ , with $T=t$ and $T=t+1$ denoting the before and after period, respectively. The treatment is only administered to the group with $G_{i}=1$ in the after period. For each unit $i$ , let $D_{iT}$ be the observed treatment status at time $T$ . The above design implies $D_{it}=0$ for all units and $D_{i,t+1}=1$ for the units in group $G_{i}=1$ ; thus $G_{i}=D_{i,t+1}$ . Assume that each unit has two potential outcomes in each period, $\{Y_{iT}(1),Y_{iT}(0)\}$ for $T=t$ and $t+1$ , and only the one corresponding to the observed treatment status, $Y_{iT}=Y_{iT}(D_{iT})$ , is observed. Therefore, $Y_{it}=Y_{it}(0)$ and $Y_{i,t+1}=(1-G_{i})Y_{i,t+1}(0)+G_{i}Y_{i,t+1}(1)$ . For each unit, a vector of pretreatment covariates $X_{i}$ are also observed in the before period.

In the two-period two-group panel design, the target estimand is usually the average treatment effect for the treated (ATT) (Abadie Reference Abadie2005; Angrist and Pischke Reference Angrist and Pischke2009; Lechner Reference Lechner2011):

where $\unicode[STIX]{x1D707}_{1}=E\{Y_{i,t+1}(1)\mid G_{i}=1\}$ and $\unicode[STIX]{x1D707}_{0}=E\{Y_{i,t+1}(0)\mid G_{i}=1\}$ . When the outcome is discrete, ratio versions of $\unicode[STIX]{x1D70F}_{\text{ATT}}$ are often of interest, such as

which is the causal risk ratio for binary outcomes and the causal rate ratio for count outcomes.

The quantity $\unicode[STIX]{x1D707}_{1}$ equals $E(Y_{i,t+1}\mid G_{i}=1)$ , and thus is directly estimable from the observed data, e.g., by the moment estimator $\bar{Y}_{1,t+1}=\sum _{i=1}^{n}G_{i}Y_{i,t+1}/\sum _{i=1}^{n}G_{i}$ . In contrast, the quantity $\unicode[STIX]{x1D707}_{0}$ , the counterfactual outcome for the treatment group in the after period in the absence of treatment, is not observable and must rely on additional assumptions to identify. The central task in this design is to use the observed data to estimate the counterfactual $\unicode[STIX]{x1D707}_{0}$ . Any consistent estimator of $\unicode[STIX]{x1D707}_{0}$ leads to consistent estimators of $\unicode[STIX]{x1D70F}_{\text{ATT}}$ and $\unicode[STIX]{x1D6FE}_{\text{ATT}}$ .

With difference-in-differences, the key for identifying
$\unicode[STIX]{x1D707}_{0}$
is the *parallel trends* assumption.

#### Assumption 1 (Parallel trends).

$E\{Y_{i,t+1}(0)-Y_{it}(0)\mid X_{i},G_{i}=1\}=E\{Y_{i,t+1}(0)-Y_{it}(0)\mid X_{i},G_{i}=0\}.$

The parallel trends assumption requires that, conditional on covariates $X_{i}$ , the average outcomes in the treated and control groups in the absence of treatment would have followed parallel paths over time. Under Assumption 1, we have the nonparametric identification formula for $\unicode[STIX]{x1D707}_{0}$ :

where $F_{X|G=1}(x)=\text{pr}(X\leqslant x\mid G=1)$ is the distribution of $X$ in the treatment group. All terms of the right hand side of (3) are identifiable from the observed data. A stronger version of Assumption 1 imposes parallel trends without conditioning on covariates, under which we can write

Based on the identification formula (4), a moment estimator of $\unicode[STIX]{x1D70F}_{\text{ATT}}$ is

where $\bar{Y}_{g,T}$ is the mean observed outcome for group $g$ at time $T$ $(g=0,1;T=t,t+1)$ . The form of this estimator underlies the name “difference-in-differences”.

A well-known limitation of the difference-in-differences approach is that the parallel trends assumption depends on the scale of the outcome (Athey and Imbens Reference Athey and Imbens2006; Lechner Reference Lechner2011). Specifically, the parallel trends assumption may hold for the original $Y$ but not for a nonlinear monotone transformation of $Y$ , for example, $\log Y$ . This scale dependence restricts the use of difference-in-differences in settings with non-Gaussian and discrete outcomes.

### 2.2 Lagged-Dependent-Variable Adjustment

In the treatment–control panel design, a class of alternative methods relies on the assumption of ignorability conditional on the lagged dependent variable, that is, in the absence of treatment, the outcomes for the treated and control groups would have the same distributions, conditional on their lagged outcome and covariates.

#### Assumption 2 (Ignorability).

$Y_{i,t+1}(0)\bot \!\!\!\bot G_{i}\mid (Y_{it},X_{i})$ .

Under ignorability, we have the following nonparametric identification formula of $\unicode[STIX]{x1D707}_{0}$ :

where $F_{Y_{t},X|G=1}(y,x)$ is the joint distribution of $(Y_{t},X)$ in the treatment group. The form of $\tilde{\unicode[STIX]{x1D707}}_{0,\text{LDV}}$ is identical to the traditional identification formula for the average treatment effect for the treated in observational cross-sectional studies. We can specify a model for $E\{Y_{i,t+1}(0)\mid Y_{it},G_{i},X_{i}\}$ , based on which we impute the counterfactual mean $\unicode[STIX]{x1D707}_{0}=E\{Y_{i,t+1}(0)\mid G_{i}=1\}$ by averaging over $Y_{t}$ and $X$ and thus obtain a consistent estimator for $\unicode[STIX]{x1D70F}_{\text{ATT}}$ .

In contrast to the parallel trends assumption, the ignorability assumption is scale-free. Three popular methods under the ignorability assumption are the synthetic control method (Abadie and Gardeazabal Reference Abadie and Gardeazabal2003; Abadie, Diamond, and Hainmueller Reference Abadie, Diamond and Hainmueller2015), matching (Heckman, Ichimura, and Todd Reference Heckman, Ichimura and Todd1997) or regression adjustment (Ashenfelter Reference Ashenfelter1978) of the lagged dependent variable. Among these, the lagged-dependent-variable adjustment approach is the easiest to implement. Through extensive simulations, O’Neill *et al.* (Reference O’Neill, Kreif, Grieve, Sutton and Sekhon2016) have found that, when the parallel trends assumption does not hold, the lagged-dependent-variable regression adjustment approach produces the most efficient and least biased estimates among these three methods.

## 3 Theory

Our goal in this section is to establish the analytical relationship between the difference-in-differences and lagged-dependent-variable adjustment estimators under general settings. For notational simplicity, we condition on the covariates $X$ and thus ignore them in the discussion.

### 3.1 Bracketing Relationship in Linear Models

We start with the simple case of linear regressions. Specifically, the difference-in-differences approach is usually implemented via a linear fixed-effects model:

where $\unicode[STIX]{x1D6FC}_{i}$ is the individual fixed effect and $\unicode[STIX]{x1D706}_{T}$ is the time-specific fixed effect. When model (7) is correct, the coefficient $\unicode[STIX]{x1D70F}$ equals the estimand $\unicode[STIX]{x1D70F}_{\text{ATT}}$ ; any consistent estimator of $\unicode[STIX]{x1D70F}$ in (7) is also consistent for $\unicode[STIX]{x1D70F}_{\text{ATT}}$ . By taking the difference between outcomes at time points $t$ and $t+1$ in (7), we can eliminate the individual fixed-effects $\unicode[STIX]{x1D6FC}_{i}$ . Because $G_{i}=D_{i,t+1}-D_{it}$ , we have $E(Y_{i,t+1}-Y_{it}\mid G_{i})=(\unicode[STIX]{x1D706}_{t+1}-\unicode[STIX]{x1D706}_{t})+\unicode[STIX]{x1D70F}G_{i}.$ Therefore, we can fit a linear regression of the difference $Y_{i,t+1}-Y_{it}$ on the group indicator $G_{i}$ to estimate $\unicode[STIX]{x1D70F}$ . The resulting ordinary least squares estimator is the difference between the sample means of $Y_{i,t+1}-Y_{it}$ in the treated and control groups, and thus it equals $\hat{\unicode[STIX]{x1D70F}}_{\text{DID}}$ defined in (5).

The lagged-dependent-variable adjustment method can be implemented via linear models in two ways. In the first approach, motivated by (6), we can fit an ordinary least squares line $\hat{E}(Y_{t+1}\mid G=0,Y_{t}=y)=\hat{\unicode[STIX]{x1D6FC}}+\hat{\unicode[STIX]{x1D6FD}}Y_{t}$ using only the control units; then we obtain $\hat{\unicode[STIX]{x1D707}}_{0,\text{LDV}}=\hat{\unicode[STIX]{x1D6FC}}+\hat{\unicode[STIX]{x1D6FD}}\bar{Y}_{1,t}$ as the sample analog of $\tilde{\unicode[STIX]{x1D707}}_{0,\text{LDV}}$ and $\hat{\unicode[STIX]{x1D70F}}_{\text{LDV}}=\bar{Y}_{1,t+1}-\hat{\unicode[STIX]{x1D707}}_{0,\text{LDV}}$ as the estimate of $\unicode[STIX]{x1D70F}_{\text{ATT}}$ . In the second approach, as in Angrist and Pischke (Reference Angrist and Pischke2009, ch. 5.4), we can use the following linear model:

When model (8) is correct, the coefficient $\unicode[STIX]{x1D70F}$ equals the causal estimand $\unicode[STIX]{x1D70F}_{\text{ATT}}$ , and any consistent estimator of $\unicode[STIX]{x1D70F}$ is consistent for $\unicode[STIX]{x1D70F}_{\text{ATT}}$ . We can fit the ordinary least squares line $\hat{E}(Y_{t+1}\mid G,Y_{t})=\hat{\unicode[STIX]{x1D6FC}}+\hat{\unicode[STIX]{x1D70F}}_{\text{LDV}}^{\prime }G+\hat{\unicode[STIX]{x1D6FD}}^{\prime }Y_{t}$ using all units and take the coefficient $\hat{\unicode[STIX]{x1D70F}}_{\text{LDV}}^{\prime }$ as an estimate of $\unicode[STIX]{x1D70F}_{\text{ATT}}$ . We have the following expressions for the two estimators $\hat{\unicode[STIX]{x1D70F}}_{\text{LDV}}$ and $\hat{\unicode[STIX]{x1D70F}}_{\text{LDV}}^{\prime }$ (the proof is given in the Appendix).

Proposition 1. Without covariates, the two lagged-dependent-variable adjustment estimates are

These two estimates in (9) differ from the moment difference-in-differences estimate
$\hat{\unicode[STIX]{x1D70F}}_{\text{DID}}=(\bar{Y}_{1,t+1}-\bar{Y}_{0,t+1})-(\bar{Y}_{1,t}-\bar{Y}_{0,t})$
only in the coefficients
$\hat{\unicode[STIX]{x1D6FD}}$
and
$\hat{\unicode[STIX]{x1D6FD}}^{\prime }.$
Consider the case with
$\hat{\unicode[STIX]{x1D6FD}}$
or
$\hat{\unicode[STIX]{x1D6FD}}^{\prime }$
larger than
$0$
but smaller than
$1$
. The sign of
$\hat{\unicode[STIX]{x1D70F}}_{\text{DID}}-\hat{\unicode[STIX]{x1D70F}}_{\text{LDV}}$
or
$\hat{\unicode[STIX]{x1D70F}}_{\text{DID}}-\hat{\unicode[STIX]{x1D70F}}_{\text{LDV}}^{\prime }$
depends on the sign of
$\bar{Y}_{1,t}-\bar{Y}_{0,t}$
. If the treatment group has larger lagged outcome
$Y_{t}$
on average, then
$\hat{\unicode[STIX]{x1D70F}}_{\text{DID}}<\hat{\unicode[STIX]{x1D70F}}_{\text{LDV}}$
; if the treatment group has smaller
$Y_{t}$
on average, then
$\hat{\unicode[STIX]{x1D70F}}_{\text{DID}}>\hat{\unicode[STIX]{x1D70F}}_{\text{LDV}}$
. In the special case with
$\hat{\unicode[STIX]{x1D6FD}}=1$
or
$\hat{\unicode[STIX]{x1D6FD}}^{\prime }=1$
, they are *identical*:
$\hat{\unicode[STIX]{x1D70F}}_{\text{DID}}=\hat{\unicode[STIX]{x1D70F}}_{\text{LDV}}$
or
$\hat{\unicode[STIX]{x1D70F}}_{\text{DID}}^{\prime }=\hat{\unicode[STIX]{x1D70F}}_{\text{LDV}}$
. How much
$\hat{\unicode[STIX]{x1D6FD}}$
or
$\hat{\unicode[STIX]{x1D6FD}}^{\prime }$
deviates from 1 indicates how different the two estimates are. We see this phenomenon in the examples in Section 4. Importantly, the discussion in this subsection holds without imposing any stochastic assumptions. That is, Proposition 1 is a purely numerical result. In contrast, the bracketing relationship in Angrist and Pischke (Reference Angrist and Pischke2009, ch. 5.4) is proven under the linear model assumptions.

Gelman (Reference Gelman2007) pointed out that restricting $\unicode[STIX]{x1D6FD}=1$ in (8) gives identical least squares estimators for $\unicode[STIX]{x1D70F}$ from models (7) and (8), which is also evident from Proposition 1. However, the nonparametric identification Assumptions 1 and 2 are not nested, and the difference-in-differences estimator is not a special case of the lagged-dependent-variable adjustment estimator in general. Therefore, it is natural to investigate whether Angrist and Pischke (Reference Angrist and Pischke2009)’s result is unique to the linear models (7) and (8). In the next subsection, we generalize the bracketing relationship to model-free settings.

### 3.2 Nonparametric Bracketing Relationship

For notational simplicity, below we also drop the subscript $i$ . Under ignorability, the nonparametric identification formula (6) of $\unicode[STIX]{x1D707}_{0}$ simplifies to

where $F_{Y_{t}}(y\mid G=g)=\text{pr}(Y_{t}\leqslant y\mid G=g)$ is the cumulative distribution function of $Y_{t}$ for units in group $g$ $(g=0,1)$ . The form of $\tilde{\unicode[STIX]{x1D707}}_{0,\text{LDV}}$ is identical to the identification formula for the ATT estimand in cross-sectional studies.

To compare $\tilde{\unicode[STIX]{x1D70F}}_{\text{DID}}$ and $\tilde{\unicode[STIX]{x1D70F}}_{\text{LDV}}$ without imposing any functional form of the outcome model, we first obtain the following analytical difference between $\tilde{\unicode[STIX]{x1D707}}_{0,\text{DID}}$ and $\tilde{\unicode[STIX]{x1D707}}_{0,\text{LDV}}$ (the proof is given in the Appendix).

Lemma 1. The difference between $\tilde{\unicode[STIX]{x1D707}}_{0,\text{DID}}$ and $\tilde{\unicode[STIX]{x1D707}}_{0,\text{LDV}}$ is

where $\unicode[STIX]{x1D6E5}(y)=E(Y_{t+1}\mid G=0,Y_{t}=y)-y.$

The quantity $\unicode[STIX]{x1D6E5}(y)=E(Y_{t+1}\mid G=0,Y_{t}=y)-y=E(Y_{t+1}-Y_{t}\mid G=0,Y_{t}=y)$ equals the expectation of the change in the outcome conditioning on the lagged outcome in the control group. Lemma 1 suggests that the relative magnitude of $\tilde{\unicode[STIX]{x1D707}}_{0,\text{DID}}$ and $\tilde{\unicode[STIX]{x1D707}}_{0,\text{LDV}}$ depends on (a) the expectation of the before–after difference $Y_{t+1}-Y_{t}$ conditional on $Y_{t}$ in the control group, and (b) the difference between the distribution of the before outcome $Y_{t}$ in the treated and control groups. Both are important characteristics of the underlying data generating process, which measures (a) the dependence of the outcome on the lagged outcome and (b) the dependence of the treatment assignment on the lagged outcome, respectively. In particular, if $Y_{t}\bot \!\!\!\bot G$ or equivalently $F_{Y_{t}}(y\mid G=1)=F_{Y_{t}}(y\mid G=0)$ , then $\tilde{\unicode[STIX]{x1D707}}_{0,\text{LDV}}=\tilde{\unicode[STIX]{x1D707}}_{0,\text{DID}}.$

To reach the main conclusion, we introduce two additional conditions regarding the quantities in Lemma 1. The first is a stationarity condition on the outcome.

#### Condition 1 (Stationarity).

$\unicode[STIX]{x2202}E(Y_{t+1}\mid G=0,Y_{t}=y)/\unicode[STIX]{x2202}y<1$ for all $y.$

In a linear model for $E(Y_{t+1}\mid G=0,Y_{t}=y)$ , Condition 1 requires that, in the control group, the regression coefficient of the outcome $Y_{t+1}$ on the lagged outcome $Y_{t}$ is smaller than $1$ ; this is also invoked by Angrist and Pischke (Reference Angrist and Pischke2009). Its sample version is $\hat{\unicode[STIX]{x1D6FD}}<1$ or $\hat{\unicode[STIX]{x1D6FD}}^{\prime }<1$ as in Section 3.1. In general, Condition 1 ensures that the time series of the outcomes would not grow infinitely as time, which is reasonable in most applications.

The second condition describes the treatment assignment mechanism with respect to the lagged outcome, with two opposite versions.

#### Condition 2 (Stochastic Monotonicity).

(a) $F_{Y_{t}}(y\mid G=1)\geqslant F_{Y_{t}}(y\mid G=0)$ for all $y$ ; (b) $F_{Y_{t}}(y\mid G=1)\leqslant F_{Y_{t}}(y\mid G=0)$ for all $y$ .

Condition 2(a) implies that the treated group has smaller lagged outcome compared to the control group, and Condition 2(b) implies the opposite relationship. In the case of linear models, Condition 2(a) or (b) reduces to the *average* lagged outcome in the treated group is smaller or larger than that in the control group, respectively.

Because they only involve observed variables, Conditions 1 and 2 are testable empirically. Specifically, to check Condition 1, we can estimate the derivative of the conditional mean function $E(Y_{t+1}\mid G=0,Y_{t}=y)$ ; to check Condition 2, we can visually compare the empirical cumulative probability distributions of the outcomes in the treatment and control groups. These conditions hold in many applications, e.g., in the examples in Section 4. In contrast, Assumptions 1 and 2 are in general untestable.

Under Conditions 1 and 2, we have the following results on the bracketing relationship on $\tilde{\unicode[STIX]{x1D70F}}_{\text{DID}}$ and $\tilde{\unicode[STIX]{x1D70F}}_{\text{LDV}}$ in a nonparametric setting; see the Appendix for the proof.

Theorem 1. If Conditions 1 and 2(a) hold, then $\tilde{\unicode[STIX]{x1D707}}_{0,\text{DID}}\leqslant \tilde{\unicode[STIX]{x1D707}}_{0,\text{LDV}}$ , and thus $\tilde{\unicode[STIX]{x1D70F}}_{\text{DID}}\geqslant \tilde{\unicode[STIX]{x1D70F}}_{\text{LDV}}$ and $\tilde{\unicode[STIX]{x1D6FE}}_{\text{DID}}\geqslant \tilde{\unicode[STIX]{x1D6FE}}_{\text{LDV}}$ ; if Conditions 1 and 2(b) hold, then $\tilde{\unicode[STIX]{x1D707}}_{0,\text{DID}}\geqslant \tilde{\unicode[STIX]{x1D707}}_{0,\text{LDV}}$ , and thus $\tilde{\unicode[STIX]{x1D70F}}_{\text{DID}}\leqslant \tilde{\unicode[STIX]{x1D70F}}_{\text{LDV}}$ and $\tilde{\unicode[STIX]{x1D6FE}}_{\text{DID}}\leqslant \tilde{\unicode[STIX]{x1D6FE}}_{\text{LDV}}$ .

Theorem 1 is a result on the relative magnitude between the two quantities $\tilde{\unicode[STIX]{x1D70F}}_{\text{DID}}$ and $\tilde{\unicode[STIX]{x1D70F}}_{\text{LDV}}$ (and between $\tilde{\unicode[STIX]{x1D6FE}}_{\text{DID}}$ and $\tilde{\unicode[STIX]{x1D6FE}}_{\text{LDV}}$ ). On the one hand, Theorem 1 holds without requiring either Assumption 1 or 2. Specifically, under Stationarity and Stochastic Monotonicity (a), $\tilde{\unicode[STIX]{x1D70F}}_{\text{DID}}$ is larger than or equal to $\tilde{\unicode[STIX]{x1D70F}}_{\text{LDV}}$ . Both of them can be biased for the true causal effect $\unicode[STIX]{x1D70F}_{\text{ATT}}$ : if $\tilde{\unicode[STIX]{x1D70F}}_{\text{DID}}\geqslant \tilde{\unicode[STIX]{x1D70F}}_{\text{LDV}}\geqslant \unicode[STIX]{x1D70F}_{\text{ATT}}$ , then $\tilde{\unicode[STIX]{x1D70F}}_{\text{DID}}$ overestimates $\unicode[STIX]{x1D70F}_{\text{ATT}}$ more than $\tilde{\unicode[STIX]{x1D70F}}_{\text{LDV}}$ ; if $\unicode[STIX]{x1D70F}_{\text{ATT}}\geqslant \tilde{\unicode[STIX]{x1D70F}}_{\text{DID}}\geqslant \tilde{\unicode[STIX]{x1D70F}}_{\text{LDV}}$ , then $\tilde{\unicode[STIX]{x1D70F}}_{\text{LDV}}$ underestimate $\unicode[STIX]{x1D70F}_{\text{ATT}}$ more than $\tilde{\unicode[STIX]{x1D70F}}_{\text{DID}}$ ; if $\tilde{\unicode[STIX]{x1D70F}}_{\text{DID}}\geqslant \unicode[STIX]{x1D70F}_{\text{ATT}}\geqslant \tilde{\unicode[STIX]{x1D70F}}_{\text{LDV}}$ , then $\tilde{\unicode[STIX]{x1D70F}}_{\text{DID}}$ and $\tilde{\unicode[STIX]{x1D70F}}_{\text{LDV}}$ are the upper and lower bounds on $\unicode[STIX]{x1D70F}_{\text{ATT}}$ . Analogous arguments apply under Stationarity and Stochastic Monotonicity (b). On the other hand, only under Assumption 1 or 2, the quantities $\tilde{\unicode[STIX]{x1D70F}}_{\text{DID}}$ and $\tilde{\unicode[STIX]{x1D70F}}_{\text{LDV}}$ have the interpretation as the nonparametric identification formulas of the causal estimand $\unicode[STIX]{x1D70F}_{\text{ATT}}$ . We stress that the bracket $(\tilde{\unicode[STIX]{x1D70F}}_{\text{DID}},\tilde{\unicode[STIX]{x1D70F}}_{\text{LDV}})$ provides bounds for the true effect $\unicode[STIX]{x1D70F}_{\text{ATT}}$ if either Assumption 1 or 2 holds; however, it does not answer the question about whether the true effect falls inside or, if outside, which side of the bracket when neither Assumption 1 nor 2 holds. The relationship under such a scenario is dependent on the specific true data generating model.

For discrete outcomes, Equation (10) reduces to $\tilde{\unicode[STIX]{x1D707}}_{0,\text{LDV}}=\sum _{y}E(Y_{t+1}\mid G=0,Y_{t}=y)\text{pr}(Y_{t}=y\mid G=1)$ , and the stationary condition becomes $E(Y_{t+1}\mid G=0,Y_{t}=y+1)-E(Y_{t+1}\mid G=0,Y_{t}=y)<1$ for all values of $y$ . For the case of binary outcome, the stationary condition always holds because $0\leqslant E(Y_{t+1}\mid G=0,Y_{t}=y)\leqslant 1$ for $y=0,1$ . Therefore, we only need to check the sign of the empirical counterpart of $\text{pr}(Y_{t}=0\mid G=1)-\text{pr}(Y_{t}=0\mid G=0)$ . Specifically, if $\text{pr}(Y_{t}=0\mid G=1)\geqslant \text{pr}(Y_{t}=0\mid G=0)$ , then $\tilde{\unicode[STIX]{x1D70F}}_{\text{DID}}\geqslant \tilde{\unicode[STIX]{x1D70F}}_{\text{LDV}}$ and $\tilde{\unicode[STIX]{x1D6FE}}_{\text{DID}}\geqslant \tilde{\unicode[STIX]{x1D6FE}}_{\text{LDV}}$ ; if $\text{pr}(Y_{t}=0\mid G=1)\leqslant \text{pr}(Y_{t}=0\mid G=0)$ , then $\tilde{\unicode[STIX]{x1D70F}}_{\text{DID}}\leqslant \tilde{\unicode[STIX]{x1D70F}}_{\text{LDV}}$ and $\tilde{\unicode[STIX]{x1D6FE}}_{\text{DID}}\leqslant \tilde{\unicode[STIX]{x1D6FE}}_{\text{LDV}}$ .

### 3.3 Semiparametric Bracketing Relationship

Under the parallel trends Assumption 1, Abadie (Reference Abadie2005) proposed a semiparametric inverse probability weighting estimator for $\unicode[STIX]{x1D70F}_{\text{ATT}}$ based on the following identification formula of $\unicode[STIX]{x1D707}_{0}$ :

where the propensity score is defined as $e=\text{pr}(G=1)$ . Abadie (Reference Abadie2005)’s estimator based on $\tilde{\unicode[STIX]{x1D707}}_{0,\text{DID}}$ shares the same form as the inverse probability weighting estimator for the ATT in the cross-sectional setting, but replaces the outcome in the treatment group by the before–after difference $Y_{t+1}-Y_{t}$ . Similarly, under Assumption 2, we can construct a semiparametric estimator based on

where the propensity score is defined as $e(Y_{t})=\text{pr}(G=1\mid Y_{t})$ .

Because (11) and (12) are alternative identification formulas for $\unicode[STIX]{x1D707}_{0}$ , we can show that $(\tilde{\unicode[STIX]{x1D707}}_{0,\text{DID}},\tilde{\unicode[STIX]{x1D707}}_{0,\text{LDV}})=(\tilde{\unicode[STIX]{x1D707}}_{0,\text{DID}}^{\prime },\tilde{\unicode[STIX]{x1D707}}_{0,\text{LDV}}^{\prime })$ and thus have the following corollary of Theorem 1.

Corollary 1. Theorem 1 holds if $(\tilde{\unicode[STIX]{x1D707}}_{0,\text{DID}},\tilde{\unicode[STIX]{x1D707}}_{0,\text{LDV}})$ are replaced by $(\tilde{\unicode[STIX]{x1D707}}_{0,\text{DID}}^{\prime },\tilde{\unicode[STIX]{x1D707}}_{0,\text{LDV}}^{\prime })$ .

Corollary 1 shows that the bracketing relationship between $\tilde{\unicode[STIX]{x1D70F}}_{\text{DID}}$ and $\tilde{\unicode[STIX]{x1D70F}}_{\text{LDV}}$ does not depend on the forms of identification formulas and estimators.

## 4 Examples

### 4.1 Minimum Wages and Employment

We reanalyze part of the classic Card and Krueger (Reference Card and Krueger1994) study on the effect of a minimum wage increase on employment. Data were collected on the employment information at fast food restaurants in New Jersey and Pennsylvania before and after a minimum wage increase in New Jersey in 1992. The outcome is the number of full-time-equivalent employees at each restaurant.

The difference-in-differences estimate is $\hat{\unicode[STIX]{x1D70F}}_{\text{DID}}=2.446$ , and the lagged-dependent-variable adjustment estimates are $\hat{\unicode[STIX]{x1D70F}}_{\text{LDV}}=0.302$ and $\hat{\unicode[STIX]{x1D70F}}_{\text{LDV}}^{\prime }=0.865$ with coefficients of the lag outcome $\hat{\unicode[STIX]{x1D6FD}}=0.288<1$ and $\hat{\unicode[STIX]{x1D6FD}}^{\prime }=0.475<1$ . Meanwhile, because the sample means satisfy $\bar{Y}_{1,t}-\bar{Y}_{0,t}=17.289-20.299<0$ , our theoretical result predicts that $\hat{\unicode[STIX]{x1D70F}}_{\text{DID}}>\hat{\unicode[STIX]{x1D70F}}_{\text{LDV}}(\text{or }\hat{\unicode[STIX]{x1D70F}}_{\text{LDV}}^{\prime })$ , which exactly matches the relative magnitude of the empirical estimates above. In addition, if we adopt a quadratic specification of $E(Y_{t+1}\mid G=0,Y_{t})$ , the lagged-dependent-variable regression estimate becomes $\hat{\unicode[STIX]{x1D70F}}_{\text{LDV}}=0.275$ , which is also smaller than $\hat{\unicode[STIX]{x1D70F}}_{\text{DID}}$ . This is again coherent with our theory because Stationarity and Stochastic Monotonicity hold, depicted in Figure 1. In this example, the differences between $\hat{\unicode[STIX]{x1D70F}}_{\text{DID}}$ and $\hat{\unicode[STIX]{x1D70F}}_{\text{LDV}}(\text{or }\hat{\unicode[STIX]{x1D70F}}_{\text{LDV}}^{\prime })$ are significant at level $0.05.$

### 4.2 Electoral Returns to Beneficial Policy

We reanalyze the Bechtel and Hainmueller (Reference Bechtel and Hainmueller2011) study on electoral returns to beneficial policy. We focus on the short-term electoral returns by analyzing the causal effect of disaster relief aid due to the 2002 Elbe flooding in Germany. The before period is 1998 and the after period is 2002. The units of analysis are electoral districts, the treatment is the indicator whether a district is affected by the flood, and the outcome is the vote share that the Social Democratic Party attains in that district.

The difference-in-differences estimate is $\hat{\unicode[STIX]{x1D70F}}_{\text{DID}}=7.144$ , and the lagged-outcome adjustment estimates are $\hat{\unicode[STIX]{x1D70F}}_{\text{LDV}}=7.160$ and $\hat{\unicode[STIX]{x1D70F}}_{\text{LDV}}^{\prime }=7.121$ with coefficients of the lag outcome $\hat{\unicode[STIX]{x1D6FD}}=1.002>1$ and $\hat{\unicode[STIX]{x1D6FD}}^{\prime }=0.997<1$ . The relative magnitudes match our theory in Section 3.1. However, these estimates are almost identical because the coefficients of $Y_{t}$ are extremely close to $1$ . In this example, even though the empirical distributions of $F_{Y_{t}}(y\mid G=1)$ and $F_{Y_{t}}(y\mid G=0)$ differ significantly as Figure 2 shows, the analysis is not sensitive to the choice between the difference-in-differences and lagged-dependent-variable adjustment estimates.

### 4.3 A Traffic Safety Intervention on Crashes

Outside the political science literature, the before–after treatment–control design is the state-of-the-art method in traffic safety evaluations (Hauer Reference Hauer1997), where count outcomes are common. Here we provide an example of evaluating the effects of rumble strips on vehicle crashes. Crash counts were collected on $n=1986$ road segments in Pennsylvania before (year 2008) and after (year 2012) the rumble strips were installed in 331 segments between year 2008 to 2012. The control group consists of 1655 sites matched to the treated sites on covariates including past accident counts, road characteristics, traffic volume. Table 1 presents the crash counts classified by $Y_{t}$ and $Y_{t+1}$ for control and treatment groups, respectively.

We first examine the dichotomized outcome of whether there has been at least one crash in that site. As noted after Theorem 1, Condition 1 automatically holds for a binary outcome. We can verify that Condition 2(a) holds because the empirical means suggest $\widehat{\text{pr}}(Y_{t}=0\mid G=1)-\widehat{\text{pr}}(Y_{t}=0\mid G=0)=232/331-1102/1655=0.701-0.666>0$ . Therefore, applying Theorem 1, we predict that $\tilde{\unicode[STIX]{x1D70F}}_{\text{DID}}>\tilde{\unicode[STIX]{x1D70F}}_{\text{LDV}}$ and $\tilde{\unicode[STIX]{x1D6FE}}_{\text{DID}}>\tilde{\unicode[STIX]{x1D6FE}}_{\text{LDV}}$ . Now we calculate the nonparametric estimate of $\unicode[STIX]{x1D707}_{0}$ under ignorability to be $\hat{\unicode[STIX]{x1D707}}_{0,\text{LDV}}=\sum _{y=0,1}\hat{E}(Y_{t+1}\mid G=0,Y_{t}=y)\widehat{\text{pr}}(Y_{t}=y\mid G=1)=0.324$ , and under parallel trends to be $\hat{\unicode[STIX]{x1D707}}_{0,\text{DID}}=0.294$ . Therefore, the empirical estimates suggest $\hat{\unicode[STIX]{x1D70F}}_{\text{DID}}>\hat{\unicode[STIX]{x1D70F}}_{\text{LDV}}$ and $\hat{\unicode[STIX]{x1D6FE}}_{\text{DID}}>\hat{\unicode[STIX]{x1D6FE}}_{\text{LDV}}$ , which matches the theoretical prediction.

We then examine the original count outcome in Table 1. The sample means $\hat{E}(Y_{t+1}\mid G=0,Y_{t}=y)$ are $0.374,0.572,0.670,0.660$ for $y=0,1,2,3+$ , respectively. Therefore, Condition 1 holds for all $y$ . We can also verify that Condition 2(a) holds because the sample probabilities are $\widehat{\text{pr}}(Y_{t}\leqslant y\mid G=1)=0.700,0.909,0.973$ and $\widehat{\text{pr}}(Y_{t}\leqslant y\mid G=0)=0.666,0.898,0.968$ for $y=0,1,2$ , respectively. Therefore, applying Theorem 1, we predict that $\tilde{\unicode[STIX]{x1D70F}}_{\text{DID}}>\tilde{\unicode[STIX]{x1D70F}}_{\text{LDV}}$ and $\tilde{\unicode[STIX]{x1D6FE}}_{\text{DID}}>\tilde{\unicode[STIX]{x1D6FE}}_{\text{LDV}}$ . Now we calculate the nonparametric estimate of $\unicode[STIX]{x1D707}_{0}$ under ignorability to be $\hat{\unicode[STIX]{x1D707}}_{0,\text{LDV}}=0.438$ , and under parallel trends to be $\hat{\unicode[STIX]{x1D707}}_{0,\text{DID}}=0.395$ . Therefore, the empirical estimates suggest $\hat{\unicode[STIX]{x1D70F}}_{\text{DID}}>\hat{\unicode[STIX]{x1D70F}}_{\text{LDV}}$ and $\hat{\unicode[STIX]{x1D6FE}}_{\text{DID}}>\hat{\unicode[STIX]{x1D6FE}}_{\text{LDV}}$ , which matches the theoretical prediction.

In this example, the differences between the $\hat{\unicode[STIX]{x1D6FE}}_{\text{DID}}$ ’s and the $\hat{\unicode[STIX]{x1D6FE}}_{\text{LDV}}$ ’s are not significant at level $0.05.$

## 5 Discussion

We established a model-free bracketing relationship between the difference-in-differences and lagged-dependent-variable adjustment estimators in the canonical two-period two-group setting. In practice, we cannot use the data to validate the assumptions that justify these approaches. Therefore, a practical suggestion is to report results from both approaches and ideally to conduct sensitivity analyses allowing for violations of these assumptions.

Several directions are worth investigating. First, in the setting with $K+1$ time periods, we may consider a model that incorporates both Model (7) and (8): $E(Y_{i,T}\mid X_{i},Y_{i,T-1},G_{i})=\unicode[STIX]{x1D6FC}_{i}+\unicode[STIX]{x1D706}_{T}+\unicode[STIX]{x1D6FD}Y_{i,T-1}+\unicode[STIX]{x1D70F}G_{i}+\unicode[STIX]{x1D703}^{\mathsf{T}}X_{i}$ for $T=t+1,\ldots ,t+K$ . However, Nickell (Reference Nickell1981) and Angrist and Pischke (Reference Angrist and Pischke2009, Section 5.3) pointed out that identification and estimation under this model require much stronger assumptions. It is of interest to extend the bracketing relationship to this setting. Second, we focused on the average treatment effect on the treated; we can extend the result to other types (e.g. categorical and ordinal) of outcomes for which the averages are less interpretable (Lu, Ding, and Dasgupta Reference Lu, Ding and Dasgupta2018).

## Appendix Proofs

## Proof of Proposition 1.

First, the ordinary least squares fit $\hat{E}(Y_{t+1}\mid G=0,Y_{t}=y)=\hat{\unicode[STIX]{x1D6FC}}+\hat{\unicode[STIX]{x1D6FD}}Y_{t}$ using the control units must satisfy $\hat{\unicode[STIX]{x1D6FC}}=\bar{Y}_{0,t+1}-\hat{\unicode[STIX]{x1D6FD}}\bar{Y}_{0,t}$ . Therefore,

Second, the coefficient $\hat{\unicode[STIX]{x1D70F}}_{\text{LDV}}^{\prime }$ in the ordinary least squares fit $\hat{E}(Y_{t+1}\mid G,Y_{t})=\hat{\unicode[STIX]{x1D6FC}}+\hat{\unicode[STIX]{x1D70F}}_{\text{LDV}}^{\prime }G+\hat{\unicode[STIX]{x1D6FD}}^{\prime }Y_{t}$ using all units equals the difference-in-means of $Y_{i,t+1}-\hat{\unicode[STIX]{x1D6FC}}-\hat{\unicode[STIX]{x1D6FD}}^{\prime }Y_{it}$ in the treated and control groups. Therefore,

## Proof of Lemma 1.

The conclusion follows from the law of total probability. We can write $\tilde{\unicode[STIX]{x1D707}}_{0,\text{LDV}}-\tilde{\unicode[STIX]{x1D707}}_{0,\text{DID}}$ as

## Proof of Theorem 1.

The proof relies on a lemma on stochastic ordering in Shaked and Shanthikumar (Reference Shaked and Shanthikumar2007). Specifically, for two random variables $A$ and $B$ , $\text{pr}(A\leqslant x)\geqslant \text{pr}(B\leqslant x)$ for all $x$ if and only if $E\{u(A)\}\geqslant E\{u(B)\}$ for all nonincreasing functions $u(\cdot )$ .

Under Condition 1, we have $\unicode[STIX]{x2202}\unicode[STIX]{x1D6E5}(y)/\unicode[STIX]{x2202}y=\unicode[STIX]{x2202}E(Y_{t+1}\mid G=0,Y_{t}=y)/\unicode[STIX]{x2202}y-1<0,$ i.e., $\unicode[STIX]{x1D6E5}(y)$ is a nonincreasing function of $y$ . Therefore, combining Lemma 1, Condition 2(1) implies $\tilde{\unicode[STIX]{x1D707}}_{0,\text{DID}}\leqslant \tilde{\unicode[STIX]{x1D707}}_{0,\text{LDV}}$ , and Condition 2(2) implies $\tilde{\unicode[STIX]{x1D707}}_{0,\text{DID}}\geqslant \tilde{\unicode[STIX]{x1D707}}_{0,\text{LDV}}$ .◻