How should we estimate inverse probability weights with possibly misspecified propensity score models?

Inverse probability weighting is a common remedy for missing data issues, notably in causal inference. Despite its prevalence, practical application is prone to bias from propensity score model misspecification. Recently proposed methods try to rectify this by balancing some moments of covariates between the target and weighted groups. Yet, bias persists without knowledge of the true outcome model. Drawing inspiration from the quasi maximum likelihood estimation with misspecified statistical models, I propose an estimation method minimizing a distance between true and estimated weights with possibly misspecified models. This novel approach mitigates bias and controls mean squared error by minimizing their upper bounds. As an empirical application, it gives new insights into the study of foreign occupation and insur-gency in France.


Introduction
Causal inference in social sciences stands out as an increasingly attractive field.Currently, it is acknowledged as a missing data problem, where one of the potential outcomes is missing for each unit (Little and Rubin, 2019).This problem is widely addressed by inverse probability weighting (IPW) based on the propensity score estimation under the conditional ignorability assumption.In a review of papers in the American Journal of Political Science, the American Political Science Review, and the Journal of Politics from 2000 to 2022, 1 I found that 35 articles employ the IPW.Notably, 30 of them appear in the past five years, making it the most widely used method among the popular weighting or matching methods, including the entropy balancing (Hainmueller, 2012), propensity score matching, and genetic matching (Diamond and Sekhon, 2013), in each of the five years.Despite its increasing popularity, however, it is vulnerable to propensity score model misspecification (Kang and Schafer, 2007;Imai and Ratkovic, 2014), which is rather common in practice.
In this paper, I propose an estimation method for the IPW with a misspecified propensity score model that mitigates bias and controls the mean squared error (MSE) by minimizing the imbalance in the multivariate covariate distribution rather than only specific moments.The key idea is to use the estimating equations optimized for estimating the (inverse probability) weights rather than the treatment assignment or propensity scores.Specifically, it estimates propensity scores by minimizing the Kullback-Leibler (KL) divergence between the true and estimated weights, which directly quantifies the quality of the IPW estimation.This idea builds upon the quasi maximum likelihood estimation (MLE) with misspecified models, which minimizes the KL divergence between the true and estimated distribution (Akaike, 1973;White, 1982).I demonstrate that the KL divergence between the weights can be calculated up to a constant without requiring knowledge of the true weights or propensity scores and that the proposed method mitigates bias and controls the MSE of parameters of interest by minimizing their upper bounds (Section 3.3 and 3.5).
Figure 1 provides an overview of the proposed method, summarizing its two main goals of the proposed method and two associated intermediate goals for each along with the propositions for the theoretical results.The first main goal is (a) mitigating bias of the parameter of interest.The associated intermediate goals are (a-1) minimizing imbalance in multivariate covariate distribution between the target group and the weighted group and (a-2) minimizing an upper bound of bias.The second main goal is (b) controlling the MSE of the parameter of interest.The associated intermediate goals are (b-1) minimizing an upper bound of the relative error of (inverse probability) weights, and (b-2) minimizing an upper bound of the MSE.Section 2 presents a numerical example to illustrate that the proposed method achieves these goals better than the MLE.
The proposed method has several additional attractive characteristics.First, the proposed method can incorporate the outcome regression model to substitute for the mean balancing condition (Section 4.3).This allows it to remove bias under the same conditions as the recently proposed covariate balancing methods such as the covariate balancing propensity scores (CBPS) (Imai and Ratkovic, 2014), bias-reduced doubly robust estimator (Vermeulen and Vansteelandt, 2015), calibrated weighting (Tan, 2020), and entropy balancing (Hainmueller, 2012).Second, like the CBPS, it models propensity scores explicitly and estimates them consistently when the model is correctly specified.While weight calibration methods such as the entropy balancing have an implicit propensity score model (Hainmueller, 2012;Zhao and Percival, 2017;Wang and Zubizarreta, 2020), it is easier for empirical researchers to make a propensity score model by considering the treatment assignment mechanism with their domain knowledge than to incorporate it implicitly through moment conditions.Third, as a direct consequence of these two characteristics, it can become a doubly robust (DR) estimator (Section 4.3) (Seaman and Vansteelandt, 2018).It is consistent either if the propensity score model or the outcome model is correct, and it attains the semiparametric efficiency bound if both the models are correct (Robins et al., 1994).
Despite the theoretical attractiveness, the proposed method has some difficulties in implementation.First, it is not sample-bounded because the estimated weights may not sum to one; second, it is likely to overfit extreme data points; and third, it necessitates the non-convex optimization.I address the first problem by imposing the normalization constraint and deriving the unconstrained optimization problem by the Lagrangian multiplier method (Section 4.1).I address the second problem by utilizing the L2 regularization (Section 4.1).To address the third problem, I develop a majorization-minimization algorithm that iteratively optimizes two convex decompositions of the original function in Section 4.2 (Wu and Lange, 2010).
The proposed method is motivated by an observational study of political devolution and resistance to foreign rule in France during World War II (Ferwerda and Miller, 2014).In Section 6, I apply the proposed method to this study and discover the treatment effect heterogeneity, suggesting that spill-over effects contaminate the treatment effect.I also apply the proposed method to another empirical study in Supplementary Materials H. Section 5 demonstrates the finite-sample performance of the proposed method through simulation studies.
The proposed methodology is implemented via an open-source R package dbw, which is available at https://github.com/hirotokatsumata/dbwand will be soon at CRAN.

A numerical example
This section presents a numerical example for illustrating that the proposed method achieves goals (a) and (b), through goals (a-1), and (b-1) better than the MLE, where goals (a-2) and (b-2) are direct consequences of (a-1) and (b-1) as later shown in propositions 2 and 4. It also provides intuition on how the proposed method achieves these goals by showing that it is optimized for estimating inverse probability weights rather than propensity scores.For simplicity, this example examines a situation where there is only one continuous covariate X i following the standard normal distribution as well as a binary response variable R i , indicating that the outcome Y i is observed when R i = 1 but not observed when R i = 0, for observation i = 1, 2, …, 1000.The propensity score model is misspecified in the sense that it only includes a linear term for X i but the true model also includes its squared term X 2 i .Specifically, the misspecified model is Pr The inverse probability weights are estimated as ŵi = 1/ Pr(R i = 1).To consider the possible worstcase scenario for each estimator, I do not specify the data-generating process for the outcome Y i .
a-1: Distributional imbalance in the covariate.First, I compare the performance of the proposed method with the MLE in terms of distributional imbalance in the covariate X.For this purpose, a commonly used statistic is the Kolmogorov-Smirnov (KS) statistic.The KS statistics are 0.112 for the proposed method and 0.193 for the MLE, indicating that the proposed method has a smaller covariate distributional imbalance than the MLE.To compare the distribution visually, I present a figure showing the cumulative distribution functions of X in Supplementary Material B.
b-1: Relative error of the weights.Next, I compare the proposed method with the MLE in terms of the relative error of the weights.As detailed in Section 3.5, this quantity is defined as E[( ŵi /w i − 1) 2 ], where w i and ŵi denote true and estimated weights.The relative error of the weights is 0.38 for the proposed method and 4.14 for the MLE, indicating that the relative error of the proposed method is smaller than one-tenth of the MLE.

Intuition: targeting weights rather than propensity scores
To gain intuition, the left panel of Figure 2 presents the true (the x-axis) and estimated weights (the y-axis) for each observation.The proposed method (triangles) estimates inverse probability weights better than the MLE (circles).Most of the triangles are within or near the shaded narrow corridor around the diagonal line, which indicates that the difference between the true and estimated weights is less than two, whereas many circles lie far from the narrow corridor.
The right panel shows the true (the x-axis) and estimated propensity scores (the y-axis) for each observation.Overall, the MLE (circles) estimates propensity scores better than the proposed method (triangles) but the proposed method restricts the difference more effectively than the MLE when the difference in the resulting inverse probability weights is large shown as the shaded area.This explains why the proposed method approximates the weights better than the MLE.
These results provide intuition when the difference between the two estimators is large: It is when some units have large estimated weights, which is likely to happen when there are unmodeled nonlinear effects of covariates on treatment assignment, there are unmodeled interaction effects of covariates on treatment assignment, or the numbers of the treated and controlled units are unbalanced.In addition, the proposed estimator is especially better than the MLE when the outcomes of these units are extreme (relative to the true average outcome), but not so much when the opposite is true.The difference in the estimated weights between the two estimators is small when no units have large estimated weights.
The root mean squared errors (RMSEs) of the estimated weights are 1.18 for the proposed method and 3.10 for the MLE, and RMSEs of the estimated propensity scores are 0.198 for the proposed method and 0.131 for the MLE, indicating that the proposed method estimates the weights better than the MLE whereas the MLE estimates the propensity scores better than the proposed method.To estimate the quantity of interest through estimating inverse probability weights, the proposed method is better suited than the MLE.

Setup
Suppose that there is a random sample of n units (i = 1, 2, …, n) from a population.Each unit consists of a triplet (Y i , R i , X i ), where Y i is an outcome variable, R i is a binary outcome response indicator variable that takes one when Y i is observed and zero otherwise, and X i is a vector of observed covariates whose support is denoted by X.The propensity score is defined as the conditional response probability given observed covariates and denoted as Pr We impose the overlap assumption that the propensity score is bounded away from 0 and 1: 0 < π (x) < 1 for any x [ X.We also assume that the outcome is missing at random, i,e., the outcome variable does not account for the response conditional on the observed covariates: (Little and Rubin, 2019).These assumptions correspond to the conditional ignorability assumption in causal inference.
For simplicity, the parameter of interest in this study is the average outcome m = E[Y i ] for the target group of R i = 1 and R i = 0, and the proposed method is easily extended to other estimands such as the average treatment effect in causal inference (Kang and Schafer, 2007;Vermeulen and Vansteelandt, 2015;Zubizarreta, 2015;Wong and Chan, 2017;Seaman and Vansteelandt, 2018;Zhao, 2019;Tan, 2020;Wang and Zubizarreta, 2020).

Problems in the inverse probability weighting with the maximum likelihood estimation
The average outcome can be estimated unbiasedly and consistently by the IPW with true propensity scores: , where w i = 1/π(X i ) is the true inverse probability weight for unit i.In observational studies, however, neither the true propensity scores nor the true inverse probability weights are known.In practice, researchers use a parametric propensity score model π(X i , β), where the most common choice is the logistic regression model p(X i , b) = exp (X T i b)/(1 + exp (X T i b)).Typically, the nuisance parameters β are estimated using the MLE to obtain the estimated propensity scores for each unit pMLE (X i ) = p(X i , bMLE ).The resulting IPW estimator with the estimated propensity scores is consistent if the propensity score model is correct.
However, if the propensity score model is misspecified, the MLE estimator for β and the resulting IPW estimator is not consistent.Generally, the limiting values of bMLE are the solution to the following problem (Akaike, 1973;Tan, 2020;White, 1982): This implies that the MLE estimates bMLE to minimizes the (generalized) KL divergence between the true responses (R and 1 − R) and their estimates (p(X, bMLE ) and 1 − p(X, bMLE )) for the target group of R i = 1 and R i = 0, where the (generalized) KL divergence is the most common measure of the difference between two points P and Q, which is defined as The problem is that this is not the same as the KL divergence between the true weights and estimated inverse probability weights although the latter determines the performance of the IPW estimator.

Distribution balancing weighting
This study proposes the distribution balancing weighting (DBW), which estimates such propensity scores p(X i , bDBW ) that minimize the KL divergence between the true and the estimated inverse probability weights.The idea of minimizing the KL divergence with possibly misspecified models originates from the quasi MLE, which minimizes the KL divergence between the true and estimated distribution when the true distribution is unknown (Akaike, 1973;White, 1982).Since the propensity scores and their inverse probability weights have a non-linear relationship, minimizing the KL divergence between the true and estimated response assignment like the MLE does not minimize that for the inverse probability weights in general.The DBW applies the idea to the KL divergence between the true and estimated weights and obtains the following KL divergence: (3) The DBW estimates coefficients as the minimizers of the following loss function, which is the sample version of the KL divergence above up to a constant: As stated formally in proposition 5, the DBW estimates propensity scores and resulting inverse probability weights consistently, like the MLE, when the propensity score model is correctly specified.

Minimizing an upper bound of bias
The main advantage of the DBW is that it also minimizes the imbalance in the multivariate covariate distribution in the target and weighted groups as stated in proposition 1 below, whose importance has been acknowledged in the existing studies (Hainmueller, 2012;Zubizarreta, 2015;Li et al., 2018;Zhao, 2019).While the propensity scores estimated by the MLE minimize the distributional imbalance in the limit when the propensity score model is correctly specified, it does not when the model is misspecified.The DBW achieves this even when the model is misspecified.The following proposition summarizes this result.
Proposition 1 (Multivariate distribution balancing property): When the propensity scores and resulting inverse probability weights are estimated using (4), the weights minimize the following KL divergence for the multivariate covariate distribution between target and weighted groups in the limit even when the propensity score model is misspecified.
where b * DBW is the probability limit of bDBW and f (x) and f 1 (x) is the multivariate covariate distribution for the target group (S = {i | R i [ {0, 1}}) and the response group (S 1 = {i | R i = 1}), respectively.
Proof.For any β, the following equations hold: By applying the continuous mapping theorem, one can obtain

□
This property is advantageous because it implies that bias is bounded above as stated in the following proposition.
Proposition 2 (Minimizing a sharp upper bound of bias): Let the absolute value of bias be denoted as B and A function of imbalance in the multivariate covariate distribution quantified by the KL divergence sharply bounds bias from above as follows: where the last inequality follows from Pinsker's inequality, and the equalities hold when the propensity score model is correct.□ This section demonstrates that the proposed method sharply bounds an upper bound of the MSE by showing that the proposed method minimizes a sharp upper bound of the relative error of weights (proposition 3) and the relative error of weights bounds the MSE of the IPW estimator (proposition 4).First, the KL divergence between the weights, which the proposed method minimizes as a loss function, bounds the relative error of the weights j( b) = E[(p(X i )/p(X i , b) − 1) 2 ] from above as the following proposition states: Proposition 3 (Minimizing a sharp upper bound of the relative error of weights): When the propensity scores and resulting inverse probability weights are estimated using (4), a sharp upper bound of the relative error of weights is minimized in the limit even when the propensity score model is misspecified.If p(X i , b) ≥ ap(X i ) almost surely for some constant a ∈ (0, 1], the bound is as follows: where the equalities hold when p(X i , b) = ap(X i ), for all i.Proof is available in Supplementary Material C.1.Note that the MLE does not minimize this upper bound as its coefficients converge to different limits than the DBW unless the propensity score model is correctly specified.Note also that this bound is tighter than Tan (2020)'s bound (Supplementary Material C.1), where the leading term is 5/(3a) for a case where p(X i , b) ≥ ap(X i ) almost surely for some constant a ∈ (0, 1/2] instead of a ∈ (0, 1] in proposition 3. Tan (2020) neither provide an intuitive interpretation of this relative error nor investigate the estimator minimizing it as the loss function.
Second, the MSE of the IPW estimator is bounded above by the relative error of weights as the following proposition states, which is also shown in Proposition 2 of Tan (2020) (a typo in the original proposition corrected): Proposition 4 (Minimizing a sharp upper bound of the mean squared error): The MSE of the IPW estimator is sharply bounded above by the relative error of weights as: where E[Y 2 | X] ≤ c and π(X i ) ≥ ρ almost surely for some constants c > 0 and ρ ∈ (0, 1).The equality holds when Y i = μ and p(X i ) = p(X i , b) = 0.5 for all i.Proof is available in Supplementary Material C.2.
Propositions 3 and 4 together imply that the proposed method minimizes the upper bound of the MSE even when the propensity score model is misspecified.

Comparison with the existing methodologies
This section briefly discusses related literature and compares the proposed methods with existing ones.A more detailed discussion including a comparison with the non-parametric methods is provided in Supplementary Material F.
Table 1 summarizes the loss function and three important properties of proposed and three existing methods, the standard IPW, calibrated weighting (Tan, 2020), and entropy balancing (Hainmueller, 2012), for the average outcome estimation, where each of the DBW, calibrated weighting, and entropy balancing has two of those three properties.First, the distribution balancing property minimizes the imbalance in the multivariate covariate distribution between the target and weighted groups (Section 3.3) and an upper bound of the mean squared error (MSE) of the parameter of interest (Section 3.5).The DBW and entropy balancing have this property because they use the same loss function (4) with different link functions, where the DBW uses the logistic function and the entropy balancing uses the exponential function (Wang and Zubizarreta, 2020, See also, Supplementary Material D and F).Second, the calibrated weighting and entropy balancing have the mean balancing property, which is increasingly employed in recently proposed methods (Imai and Ratkovic, 2014;Vermeulen and Vansteelandt, 2015;Zubizarreta, 2015;Zhao, 2019;Wang and Zubizarreta, 2020).This property implies that the estimators are sample-bounded (Robins et al., 2007;Wang and Zubizarreta, 2020).Third, the estimated propensity score is bounded between 0 and 1 except for the entropy balancing, whose implicit propensity scores can take larger values than one (See, Wang and Zubizarreta, 2020).This unboundedness implies that the implicit propensity score model in the entropy balancing is always misspecified, and it cannot be a doubly robust estimator for the average outcome estimation (Zhao, 2019), which contrasts with the double robustness property for the estimation of the average treatment effect for the treated (Zhao and Percival, 2017).
The comparison shows that the drawback of the DBW is the lack of mean balancing and sample-boundedness properties.However, the DBW can incorporate or substitute for these properties.First, it can substitute the mean balancing with outcome models as explained in Section 4.3, which removes bias under the same assumptions as the mean balancing methods such as the calibrated weighting and entropy balancing.I also demonstrate that incorporating the linear outcome model is equivalent to modify the DBW weights to have the mean balancing property.Second, it can incorporate the normalization constraint to gain the sample-boundedness as explained in Section 4.1.

Normalization, regularization, and estimation
Like the IPW with the MLE, a drawback of the DBW is that estimated weights may not sum to one.To become normalized and thus sample-bounded, the normalized DBW (nDBW) estimator This nDBW estimator minimizes the distribution imbalance and an upper bound of the MSE under the normalization constraint.When the propensity score model is correctly specified, the limiting values of the DBW and nDBW estimators are the same because the (unnormalized) DBW estimator asymptotically satisfies the normalization constraint.
Using the Lagrangian function, the solution to this constrained optimization problem is obtained as the solution to the following unconstrained optimization problem: To estimate coefficients for the nDBW in ( 19), we can use the M-estimation under the regularity conditions that the solution to ( 19) is unique and an element of the interior of a compact set.The second condition is satisfied when the objective function in ( 19) is bounded below, which is similar to the nonseparation condition for the MLE (Vermeulen and Vansteelandt, 2015;Tan, 2020).To satisfy this condition, we can utilize the L2 regularization: where λ is a hyper-parameter controlling for the regularization, β −1 is β without the intercept, and || ⋅ || denotes the Euclidean norm.
As the properties of the M-estimator, the following results regarding the consistency and asymptotic normality of the nDBW estimator are obtained (Stefanski and Boos, 2002).
Proposition 5 (Large sample properties): Under the conditions that the solution to ( 21) is unique and that ( 21) is bounded below, bnDBW and resulting inverse probability weights 1/p(X i , bnDBW ) converge in probability to their limiting values b * nDBW and 1/p(X i , b * nDBW ), and since ( 21) is a smooth function, bnDBW is asymptotically normally distributed with a certain variance Σ as follows: where β* is the true values of the coefficients, which is the solution for (3), g ′ When the propensity score model is correctly specified, bnDBW and 1/p(X i , bnDBW ) are consistently estimated.
To demonstrate the benefits of normalization and regularization, I conduct a simulation in Section 2 for the non-normalized DBW and the nDBW with regularization.First, the RMSEs of the estimated weights are 1.23 for the non-normalized DBW against 1.18 for the nDBW, indicating that the normalization improves the weight estimation.
Second, Figure 3 presents the performance of the proposed estimators with different levels of regularization.The optimal value of the regularization hyper-parameter is 0.22 as depicted by the vertical dotted line.Beyond this value, the RMSE of the weights increases as the regularization becomes stronger, approaching the RMSE of the uniform weights shown in the horizontal dotted line.Although tuning hyper-parameters is still a difficult and unsolved open problem (Zubizarreta, 2015;Wong and Chan, 2017;Zhao, 2019), I recommend using regularization mainly to avoid the separation problem by choosing the smallest one that addresses the separation problem as I do in the simulation studies in Section 5. Another way is choosing the one that minimizes the upper bound of bias estimated by the kernel balance method in Hazlett (2020), which I adopt in the empirical application in Section 6.

Algorithm
In this section, I explain the algorithm for estimating the DBW without the normalization and regularization for simplicity, and the algorithm for the nDBW with regularization is explained in Supplementary Material E. Since the loss function in ( 4) is non-convex with respect to β, the convex optimization algorithm such as the Newton's method is not applicable.For this nonconvex optimization, I develop a majorization-minimization algorithm that iteratively optimizes two convex decompositions of the original function (Wu and Lange, 2010).It exploits the characteristics that the minimization problem of the non-convex loss function of ( 4) is equivalent to the minimization problem of the difference of the two convex functions as follows: This equivalence implies that the loss function for the DBW ( 25) is equivalent to the difference of the convex loss functions for the calibrated weighting (26) and the MLE (27).Note that the DBW does not minimize the absolute difference of the two functions but it minimizes the sum of the loss functions of the calibrated weighting and the negative of the loss function of the MLE.Thus, we should not interpret the DBW as the combination of the calibrated weighting and the MLE but rather the calibrated weighting should be regarded as the combination of the DBW and the MLE (Tan, 2020).Since the DBW is better at controlling the mean squared error than the MLE as shown in Section 3.5, the DBW is also better than the calibrated weighting in this regard.
The difference of the convex functions algorithm for the DBW repeats the following two steps until convergence.First, as the majorization step, it constructs a surrogate function u(β, β t ) for iteration t by the Taylor expansion of the second component of ( 25) as follows: where β t is the initial values or values obtained in the previous iteration and g ′ 2 (b) = ∂g 2 (b) ∂b .Next, as the minimization step, it estimates β t+1 such that minimizes (28) with respect to β while keeping β t fixed.This minimization is easily conducted, e.g., via the Newton's method, because the objective surrogate function is convex.This algorithm is proved to decrease the loss monotonically in every steps and converge to a stationary point (Wu and Lange, 2010).

A substitute for the mean balancing condition: doubly robust distribution balancing weighting estimator
Another drawback of the DBW is that it lacks the mean balancing property, which is increasingly employed by recently proposed methods as a safeguard against model misspecification (Hainmueller, 2012;Imai and Ratkovic, 2014;Vermeulen and Vansteelandt, 2015;Zubizarreta, 2015;Zhao, 2019;Tan, 2020;Wang and Zubizarreta, 2020).This removes bias when the true outcome model is a linear regression model with the balanced covariates (Zubizarreta, 2015;Zhao and Percival, 2017;Zhao, 2019;Fan et al., 2021).
To substitute the mean balancing property, the (n)DBW can incorporate the outcome regression model.Specifically, this study considers the augmented IPW where the weights are estimated by the (n)DBW instead of the MLE.The augmented IPW combines the outcome model and propensity score model in the following manner (Robins et al., 1994;Seaman and Vansteelandt, 2018): where p(X i , b) is the estimated propensity scores and m(X i ) is the estimated conditional expectation function m(X The (n)DBW DR estimator uses the WLS to estimate m(X i ) because it has better performance than the OLS (Kang and Schafer, 2007;Imai and Ratkovic, 2014).Under the same condition for the mean balancing methods, the (n)DBW estimator eliminates bias by using a linear regression outcome model.It can also utilize more flexible methods such as machine learning techniques for estimating m(X i ).This estimator is doubly robust, i.e., consistent when either the propensity score or the outcome model is correctly specified (Vermeulen and Vansteelandt, 2015;Seaman and Vansteelandt, 2018).Moreover, the (n)DBW DR estimator achieves semiparametric efficiency bound if both of the models are correctly specified (Vermeulen and Vansteelandt, 2015;Seaman and Vansteelandt, 2018).
Recently, Chattopadhyay and Zubizarreta (2023) demonstrates that the augmented IPW is expressed as the (non-augmented) IPW with modified weights with the covariate balancing property.Chattopadhyay and Zubizarreta (2023) also proves that the modified weights minimize the Chi-square-type distance from the (inverse probability) weights used in the augmented IPW with the WLS.This implies that the (n)DBW DR estimator can be expressed as the (non-augmented) IPW estimator with weights ŵnDBWDR satisfying the following conditions: which demonstrates that the (n)DBW DR estimator equips the covariate balancing property with the approximate multivariate covariate distribution balancing.
Since the covariate balancing methods satisfy 1 )) X i = 0 by construction, their DR estimators with the linear outcome model of m(X i ) cannot improve their non-DR estimators because the augmented term, the second term in (29), is always 0. I will examine the finite-sample performance of the nDBW DR estimators through simulation studies in Section 5.

Simulation studies
This section examines the finite-sample performance of the proposed methods through simulation studies.The standard simulation setting is proposed by Kang and Schafer (2007), which showed that even the DR estimators may suffer from large bias and variance when both the outcome and propensity score models are misspecified.To examine the performance of newly proposed methods, many studies have utilized the same data-generating process (Robins et al., 2007;Imai and Ratkovic, 2014;Vermeulen and Vansteelandt, 2015;Zubizarreta, 2015;Zhao and Percival, 2017;Tan, 2020;Wang and Zubizarreta, 2020).
Following these studies, I use the following data-generating process.There are n units and each unit i has four covariates X i = (X i1 , X i2 , X i3 , X i4 ), each of which is independently and identically distributed according to the standard normal distribution.Each unit also has an outcome Y i and a response indicator variable R i ∈ {0, 1}, where outcome is observed only for the response group R i = 1.The response indicator variable is assigned according to the Bernoulli distribution R i = Bernoulli (π(X i )), where π(X i ) is a true propensity score.Following Robins et al. (2007); Vermeulen and Vansteelandt (2015), this study considers two types of the true propensity score model, where the signs of the coefficients are reversed each other: the type A model π where the outcome variable Y i is independently and identically distributed according to the standard normal distribution Y i N (m(X i ), 1).Following the extension of the standard simulation study by Tan (2020), this study considers six versions of the true outcome model m(X i ): (Linear 1) m(X i ) = 210 + 27.4 X i1 + 13.7 X i2 + 13.7 X i3 + 13.7 X i4 and μ = 210; (Linear 2) m(X i ) = 210 + 13.7 X i1 + 27.4 X i2 + 27.4 X i3 + 27.4 X i4 and μ = 210; (Quadratic 1) m(X i ) = 210 + 27.Finally, the full results are shown in Table I.1-I.6 in Supplementary Materials I, which enable to compare the IPW estimators and their DR variants.In most of the scenarios, the nDBW estimator is improved by incorporating outcome models.In contrast, as explained in Section 4.3, the calibrated weighting and entropy balancing estimators have the same biases and RMSEs as their DR variants in the misspecified propensity score model scenarios, where the linear outcome model uses the same covariates as the propensity score model.

Empirical analysis
Can foreign occupiers reduce resistance by devolving political authority to native elites?This question is central in the literature on foreign occupation but also challenging because randomized experiments are almost impossible, and strategic determination of political devolution causes an endogeneity problem in observational studies.
To address this issue, Ferwerda and Miller (2014) utilizes a natural experimental setting in France during World War II, where municipalities were plausibly viewed as randomly assigned into German or Vichy-governed zones in the neighborhood of the demarcation line.After Notes: This simulation compares the performance of various methods for estimating propensity scores and (inverse probability) weights by investigating combinations of two versions of the true outcome model (Linear 1 and 2) and two versions of coefficients for the propensity score model (type A and B) with the two different numbers of observations (n = 200 and n = 1000).For each estimation method, I use two propensity score model specifications (correct and misspecified) and report the bias and RMSE for each in the table.
Germany defeated France in June 1940, an armistice provided for the division of France into an occupied zone in northern France and the western coast ruled by Germany directly and an unoccupied zone in the south ruled by the Vichy government.In November 1942, in response to the landing of the Allies in North Africa, German forces invaded and occupied the unoccupied southern zone.However, the formal administrative division remained, which kept the extent of political devolution between the two zones different.Ferwerda and Miller (2014) analyzes this period (November 1942 to September 1944), when German military forces were present in both the zones and most of the resistance events occurred (mainly after 1943).
Using the regression discontinuity design, Ferwerda and Miller (2014) finds that political devolution decreases sabotage events in the close neighborhood of the demarcation line, which was drawn arbitrarily at the local level, cutting across the preexisting administrative borders and geographic features such as mountain ranges and rivers.There is, however, a concern about the spill-over effects: Resistance fighters operating from the Vichy zone, not residents in the German-governed zone, may increase sabotage events in the German zone at the demarcation line.To address this concern, it is necessary to compare the municipalities at some distance from the line like the robustness checks in their article.A problem that arises here is that the demarcation line is drawn at random only locally, which leaves some imbalance in the confounders between municipalities with high (the Vichy zone) and low (the German zone) levels of political devolution distant from the demarcation line.
To address this issue, I employ the nDBW DR estimator, and compare the results with the MLE DR, calibrated weighting DR, CBPS DR, entropy balancing DR, and the unweighted difference-in-means estimators.Following Ferwerda and Miller (2014), I construct a propensity score model with the following variables: the distance from the demarcation line, local state capacity (Population from the 1936 census, the distance to the nearest train station, an index variable indicating whether a municipality possessed a telephone bureau, a telegraph station, or a post office in 1939), forest cover and urbanization (the number of actively farmed hectares), the ruggedness of terrain (the mean and standard deviation of the elevation), and the political ideology (leftist and rightist vote shares in the 1936 election).
With these methods, I compare the occurrence of sabotage events, which counts all attacks against infrastructure (largely railroad and communications), between the treatment (German-zone) and control (Vichy-zone) municipalities within some pre-specified bandwidths of the distance from the demarcation line (Ferwerda and Miller, 2014).To examine whether the effects are observed only locally near the demarcation line or not, I use rolling windows of 15 km, where municipalities within each of 0-15, 2.5-17.5, …,20-35 km from the demarcation line are compared.I denote them by the upper limit: e.g., the results for the municipalities that are 20 km distant from the demarcation line indicate results for those within 5-20 km distant from the line.

Bias due to imbalance in multivariate covariate distribution
While bias induced by multivariate distributional imbalance is difficult to evaluate, an upper bound of bias can be estimated by the kernel technique (Hazlett, 2020).I use this upper bound to compare the performance of various methods.Figure 5 presents the upper bounds of bias due to multivariate distributional imbalance for the proposed (shown in red) and other estimators.Those for the calibrated weighting for the distance larger than 30 km are missing because the weights cannot be estimated due to the convergence issue.As the distance from the demarcation line increases, bias due to multivariate distributional imbalance becomes severe, but the proposed estimator mitigates the bias more effectively than the others.

Results
The results are presented in Figure 6, and the details are shown in Supplementary Material G.The left panel presents point estimates of the ATE for the proposed (shown in red)   Why do these estimators provide such different estimates?The key is multivariate distributional imbalance.To examine the distribution balance obtained by these estimators, I conduct the following analysis.First, to represent a possibly complex function of covariates, I prepared the constitutive terms, their squared terms, and all the first-order interactions.Then, to select relevant terms influencing potential outcomes, I estimated outcome models using LASSO and selected terms with non-zero coefficients.This procedure selected sixteen terms out of 54 terms as relevant, of which four were squared terms and nine were interactions, implying that the outcome model consists of a complex function of covariates, highlighting the importance of the distribution balance.Finally, I calculated the absolute mean difference for these selected terms.
Figure 7 presents the results for the treated municipalities with a 20-35 km distance from the demarcation line as an illustration.The proposed estimator (circles) attains better balance than the CBPS estimator (triangles) in all but two terms while avoiding large imbalances unlike the entropy balancing estimator (squares).These results confirm the superiority of the proposed method in the distribution balance.The estimates of the calibrated weighting for the distance larger than 30 km are not shown because the weights cannot be estimated due to the convergence issue.
The right panel of Figure 6 presents the ATE estimates estimated by the nDBW DR estimator and their jack-knife standard errors, which demonstrate that the treatment effects decrease steeply toward zero as the distance from the demarcation line increases.The decreasing effect indicates that political devolution reduces resistance activities only near the demarcation line, which raises concern about the spill-over effects that the resistance operating from the Vichy zone increases sabotage events in the German zone.This result suggests the possibility of the contamination of the treatment effects by spill-over effects.

Conclusions
The IPW estimators are widely utilized to address missing data problems including causal inference.However, their practical application is susceptible to bias due to propensity score model misspecification.In response, existing studies proposed various methods balancing some moments (or kernels) of observed covariates (Hainmueller, 2012;Imai and Ratkovic, 2014;Vermeulen and Vansteelandt, 2015;Zubizarreta, 2015;Chan et al., 2016;Wong and Chan, 2017;Zhao, 2019;Tan, 2020;Wang and Zubizarreta, 2020), but specifying the moment conditions remains a formidable task because it requires knowledge of the true outcome model (Zubizarreta, 2015;Zhao and Percival, 2017;Zhao, 2019;Fan et al., 2021).
This study proposes the distribution balancing weighting, which mitigates bias and controls the MSE by minimizing their upper bounds.These goals are achieved by balancing covariate distribution through minimizing the KL divergence between the true and estimated weights, inspired by the quasi MLE with misspecified models (Akaike, 1973;White, 1982).The proposed method has several attractive properties as demonstrated in this study.
Finally, I show some future directions for the improvement and the application of the proposed method.First, the key idea of the distribution balancing weighting is to minimize the statistics measuring the discrepancy between the true and estimated weights.This idea may also apply to recently proposed machine learning techniques for propensity score estimation, such as the decision tree, random forest, and generalized random forest.Second, the proposed method may be useful in other missing data problems such as mediation analysis, causal inference with time-series cross-section data, and continuous treatment cases.I am currently exploring these potential directions.

Figure 1 .
Figure 1.Goals of the proposed method.

Figure 2 .
Figure 2. A numerical example: The true and estimated weights and propensity scores with a misspecified propensity score model.Notes: This figure compares the performance of the proposed estimator (triangles) and MLE estimator (circles) with a misspecified propensity score model in terms of inverse probability weights (left panel) and propensity scores (right panel).In the left (right) panel, the x-axis represents the true weights (propensity scores), and the y-axis represents the estimated ones.The shaded area in the left panel indicates that the difference between the true and estimated weights is small (less than two) whereas the shaded area in the right panel indicates the corresponding area where the difference in weights is small (not the difference in propensity scores). https://doi.org/10.1017/psrm.2024.

Figure 3 .
Figure 3.The performance of the proposed estimators with different levels of regularization Notes: The horizontal dotted line indicates the RMSE of the uniform weights.

Figure 4 .
Figure 4. Simulation results: Bias and RMSE with misspecified propensity score models Notes: This figure compares the performance of the DBW DR estimator (x-axis) with the calibrated weighting DR (circles), CBPS DR (triangles), and entropy balancing DR (squares) estimators (y-axis) in terms of the absolute bias (left panel) and RMSE (right panel).Those above the diagonal line indicate that the DBW DR works better than each of these estimators and those below indicate the opposite.

Figure 5 .
Figure 5. Upper bounds of bias due to distributional imbalance for various methods, Notes: As the distance from the demarcation line increases, bias due to multivariate distributional imbalance becomes severe, but the proposed estimator mitigates it more effectively than the others.

Figure 6 .
Figure 6.Political devolution decreases resistance activities only near the demarcation line Notes: The left panel presents the ATE estimates by various estimators, diverging as the distance from the demarcation line increases.This divergence is consistent with the estimated upper bounds of bias shown in Figure 5.The right panel presents the ATE estimates and their standard errors estimated by the nDBW DR estimator, which demonstrates that the treatment effects decrease steeply toward zero as the distance from the demarcation line increases.

Table 1 .
Comparison of the proposed and existing methods for the average outcome estimationNotes: The standard IPW is the IPW with the MLE. https://doi.org/10.1017/psrm.2024.23

Table 2 .
Simulation results: linear outcome models

Table 3 .
Simulation results: quadratic outcome models This simulation compares the performance of various methods for estimating propensity scores and (inverse probability) weights by investigating combinations of two versions of the true outcome model (Quadratic 1 and 2) and two versions of coefficients for the propensity score model (type A and B) with the two different numbers of observations (n = 200 and n = 1000).For each estimation method, I use two propensity score model specifications (correct and misspecified) and report the bias and RMSE for each in the table. Notes:

Table 4 .
Simulation results: exponential outcome models This simulation compares the performance of various methods for estimating propensity scores and (inverse probability) weights by investigating combinations of two versions of the true outcome model (Exponential 1 and 2) and two versions of coefficients for the propensity score model (type A and B) with the two different numbers of observations (n = 200 and n = 1000).For each estimation method, I use two propensity score model specifications (correct and misspecified) and report the bias and RMSE for each in the table. Notes: