## 1. Introduction

The existence of separation in binary choice models, where an independent variable perfectly predicts a binary dependent variable, is a problem within political science. The default response to this problem suggested by Zorn (Reference Zorn2005) and further evaluated by Rainey (Reference Rainey2016), is the use of penalized maximum likelihood (PMLE), equivalent to the use of a Jeffreys prior (Jeffreys, Reference Jeffreys1946), developed by Firth (Reference Firth1993). Such an approach has also been used to provide finite estimates of fixed effects for units that never experience the outcome event (Cook *et al.*, Reference Cook, Hays and Franzese2020).

In this paper, I demonstrate that PMLE can lead to statistically significant point estimates in the *opposite* direction to that of the separation, when the number of observations are sufficiently large and the binary dependent and independent variables responsible for separation are rare events. Dyadic data, with tens to hundreds of thousands of observations, on countries over time is a common example of such data. In these cases, researchers often focus on rare events both as dependent and independent variables, such as interstate war, the onset of sanctions, possession of nuclear weapons, and the signing of preferential trade agreements.

This reversal in sign is a result of the Jeffreys prior resulting in non-independent prior densities for parameters, which can lead to high prior density for parameters opposite the direction of separation when including a binary rare event independent variable. To demonstrate how this occurs and its effect on inferences, I use simulated data and an empirical illustration. As an alternative to Jeffreys prior/Firth's PMLE, independent “weakly-informative” priors such as the Cauchy prior suggested by Gelman *et al.* (Reference Gelman, Jakulin, Pittau and Su2008) ensure that the point estimate remains in the direction of separation for this specification.

Researchers should be mindful that Firth's method can lead to statistically significant point estimates in the opposite direction of separation. While there may be occasions researchers believe this is an appropriate result, it should be made clear that this is due to the specific choice of the prior. Researchers who wish to ensure point estimates are always in the same direction of the separation can instead use the Cauchy prior approach of Gelman *et al.* (Reference Gelman, Jakulin, Pittau and Su2008). Even so, and echoing Rainey (Reference Rainey2016), researchers should be aware that any default solution to separation necessarily has inbuilt assumptions about researcher's prior information that may not be universally applicable.

## 2. How penalized maximum likelihood and Jeffreys prior lead to opposite point estimates

Suppose our estimating equation is *y* = α + β*x* + ε and we are faced with negative quasi-complete separation. That is, there exist no observations in the 2 × 2 table, displayed in Table 1a such that $x = 1 \wedge y = 1$.

In this single covariate case, Firth's Jeffreys prior approach is equivalent to adding 0.5 to each cell in Table 1a (Zorn, Reference Zorn2005), resulting in Table 1b.

This solution becomes problematic when it does not maintain the feature that there are relatively more observations where *y* = 1 when *x* = 0 than when *x* = 1. For inferences about the effect of *x* on *y* to have the same sign as the separation, it must hold that:

Rare events in large datasets can violate this inequality. When *y* is a rare event, *n* _{3} is small. If *x* is also a rare event, *n* _{1} will be very large while *n* _{2} remains small. This is important as if *n* _{2} is small, then the addition of 0.5 to all cells will lead the relative frequency of *y* = 1 when *x* = 1 to be larger than when *x* = 0, in spite of the data suggesting the opposite. This is due to this relevant proportion being strongly affected due to *n* _{2} being small, while the proportion for when *x* = 0 is mostly unaffected due to *n* _{1} being so large.

As an illustration, consider the case of war between two nuclear powers. Table 2 displays the appropriate frequencies using data from Rauchhaus (Reference Rauchhaus2009). The relationship between these two variables shows quasi-complete separation, with no observations where both countries had nuclear weapons and went to war. The relevant relative frequencies in this case are ${\rm p}_{10} = {62\over 62 + 454,\; 752} \approx 0.0001$ and ${\rm p}_{11} = {0\over 805} = 0$.

Suppose we were to use the Firth's PMLE estimator to examine this bivariate relationship. As noted previously, this is akin to adding 0.5 to each cell. This makes the new relative frequencies: ${\rm p}_{10}^{'} = {62.5\over 62 + 454 {,\; }752 + 1} \approx 0.0001$ and ${\rm p}_{11}^{'} = {0.5\over 805 + 1} \approx 0.0006$. Thus this approach to dealing with separation results in the relative frequency of war in nuclear dyads being six times *larger* than in dyads where both states do not simultaneously have nuclear weapons. Jeffreys prior leads to inferences about the relationship between nuclear weapons and interstate war opposite to the direction seen in the data.

More concretely, while Jeffreys prior is non-informative with regard to the baseline probabilities, it is ultimately informative about the parameters. Jeffreys prior results in prior distributions for the constant term and parameter for *x* that are not independent. Jeffreys prior gives more (less) prior mass for large positive coefficients for *x* than equally large negative coefficients for *x*, when evaluated at a large negative (positive) value for the constant term, ensuring that the joint prior distribution remains uninformative with regard to the baseline probability.

This is problematic when *y* is a rare event. The likelihood is high at large negative values for the constant, the point where Jeffreys prior assigns more mass to large positive values for the coefficient of the variable that leads to separation. Therefore, the prior leads to the estimate for the coefficient on *x* to be positive, even though this is in the direction opposite to that of the separation.

Panels (a) and (b) in Figure 1 illustrate this for the case where *n* _{1} = 50, 000, *n* _{2} = 100, *n* _{3} = 100. Panel (a) shows that Jeffreys prior gives more mass to positive values for the coefficient on *x* relative to equally negative values when the constant is negative. As the likelihood has high density when the constant has a large negative value, the Jeffreys prior pulls the posterior toward being positive in spite of negative separation.

Under the same specification, estimates in the direction of the prior can be retained if researchers instead use independent “weakly-informative” prior distributions centered at zero.Footnote ^{1} This retains the property that the posterior point estimate of the coefficient for the variable that leads to separation is never in a direction opposite to the separation.Footnote ^{2}

The Cauchy prior suggested by Gelman *et al.* (Reference Gelman, Jakulin, Pittau and Su2008) is one such prior.Footnote ^{3} They advocate using independent Cauchy distributions as priors for parameters, with a location of zero and scale of 2.5 for independent variables and 10 for the constant. These default priors are weakly informative, based upon plausible baseline probabilities and effect sizes for covariates that are rescaled to have mean zero and standard deviation 0.5 (if continuous or a symmetric binary variable). For the independent variables, the prior corresponds to the idea that absolute changes of less than 5 in logit probability (e.g. moving from 0.5 to 0.99) are plausible when increasing a variable from one standard deviation below to above its mean. The scale is widened for the prior for the constant, which due to rescaling reflects the expected success probability when all variables are at their means, to correspond to the range of plausible success probabilities to be from 10^{−9} to 1 − 10^{−9}.

Panels (c) and (d) in Figure 1 illustrate the use of independent Cauchy distributions. As can be seen from the prior, equally large positive and negative values for the coefficient on *x* have equal prior density independent of the value of the constant. As a result at the point where the likelihood has high density for the constant, the posterior remains in the direction of the separation as movements toward positive values for the coefficient on *x* lead to lower prior density. Thus the use of an independent prior density centered at zero ensures that inferences in terms of point estimates remain in the direction of the separation, unlike those of the Jeffreys prior/PMLE.

To further illustrate this result, I compare the use of Jeffreys prior/Firth's PMLE estimator to the use of independent Cauchy prior distributions on hypothetical data.Footnote ^{4} ^{,} Footnote ^{5} I examine how the performance of these estimators changes with different values of *n* _{1}, *n* _{2},, and *n* _{3} while keeping *n* _{4} = 0 to maintain negative quasi-complete separation.Footnote ^{6}

Figure 2 displays the results of estimating these models on the simulated data. We can see that when $y = 1 \wedge x = 0$ and $y = 0 \wedge x = 1$ are rare events relative to $y = 0 \wedge x = 0$, the PMLE estimator can lead to coefficients in the opposite (positive) direction to that of the separation. In contrast, using the Cauchy prior ensures that the coefficient on *x* remains in the same direction of the separation although this parameter asymptotically approaches zero.

Figure 3 displays how the features of the data and choice of prior impact uncertainty around the parameter estimate. I focus on two cases, where the number of observations with $x = 0 \wedge y = 0$ (n1) changes from 1000 to 100,000, while the number of observations with $x = 1 \wedge y = 0$ (n2) and $x = 0 \wedge y = 1$ (n3) remaining fixed at 50.

For the first case, while both *y* and *x* are still rare-events, estimates of the effect of *x* remain in the same direction as the separation due to the relatively low overall number of observations. The magnitude of the point estimates are similar across models. Examining the distribution of the estimates, we can see that both have a similar amount of mass in the direction of the separation.

Things change considerably in the second case. The point estimate obtained using Firth's PMLE/Jeffreys prior is now in the opposite direction of the separation. Furthermore, the majority of the estimate's distribution is also in the opposite direction of the separation, with approximately 92 percent of the mass of the profile likelihood being positive. Estimates obtained using the Cauchy prior do not exhibit this behavior. However, there are differences in the distribution of the posterior dependent upon the choice of estimation algorithm.Footnote ^{7}

In summary, a seemingly innocuous choice of estimator to deal with separation has an important consequence. Namely, coefficient estimates in the opposite direction of the separation can be obtained. While researchers prior beliefs may suggest this is an appropriate inference to make, for example that given enough time it is likely we would observe a case in the empty cell thus leading to this relationship, reading of the prior literature suggests this is not the case.

## 3. Empirical illustration: sanctions and interstate war

In this section, I illustrate the occurrence of this phenomenon using data on economic sanctions and interstate war. Economic sanctions are a tool used by states to facilitate and improve outcomes in bargaining situations (Morgan and Schwebach, Reference Morgan and Schwebach1997). Yet historical cases suggest that sanctions may lead to war. For example, the imposition of extensive sanctions by the United States upon Japan in 1941 is a common explanation for Japan initiated war with the United States in that same year.Footnote ^{8}

Understanding such interstate relations, through the use of dyadic data, often confronts rare events for both the dependent and independent variables. As displayed in Table 3, both the onset of sanctions and war between states are rare events. Furthermore, there is quasi-complete negative separation. There are no observations where a sanction onset and war occur.

To illustrate how the choice of prior would affect inferences about the effect of sanctions, I combined dyadic data on economic sanctions collected by Hufbauer *et al.* (Reference Hufbauer, Schott and Elliott1990); Elliott *et al.* (Reference Elliott, Schott, Hufbauer and Oegg2007) and collated by Hafner-Burton and Montgomery (Reference Hafner-Burton and Montgomery2008) with dyadic data on interstate war collected by Bell and Miller (Reference Bell and Miller2015). After removing missing values, this results in a dataset with approximately 370,000 observations for 12,268 dyads from 1951 to 1999.

The results displayed in Table 4 show that using Firth's PMLE/Jeffreys prior leads to the coefficient for the onset of sanctions to be opposite to the direction of the separation. Furthermore, this effect is classified as statistically significant at conventional levels, for models both with and without controls (models 4 and 1, respectively). Figure 4 displays the appropriate measures of uncertainty for estimates of the effect of sanction onsets upon war. We can see that the majority of the profile likelihood's mass is in the opposite direction of the separation when using Firth's PMLE. Interpretation of the effect of sanction onsets, estimated using Firth's PMLE, would be that sanctions *increase* the probability of conflict, even though no such cases occur in the data. The use of the Cauchy prior avoids this problem and the posterior distribution resembles the Cauchy prior, reflecting that the data has little information to inform us about the effect of sanctions upon war.

*** p < 0.005, ** p < 0.05, * p < 0.1

Based upon p-value, or whether zero falls into the associated credible interval. For models using Cauchy priors, all variables are centered. These are further divided by two times their standard deviation if continuous.

Therefore, researchers should be wary of using Firth's PMLE as a default solution to separation in binary choice models. In cases where researchers are dealing with rare events and large amounts of data, such an approach can result in misleading inferences. Rather, researchers should be mindful of the (implicit) choice of prior when they are attempting to deal with separation and if there has to be a default then they should use independent “weakly-informative” prior distributions centered at zero, such as the Cauchy prior suggested by Gelman *et al.* (Reference Gelman, Jakulin, Pittau and Su2008).

## 4. Conclusion

In this paper, I have shown that the commonly suggested PMLE approach to separation (Zorn, Reference Zorn2005) can lead to statistically significant point estimates opposite to the direction of separation. This occurs when confronted with separation in datasets with a large number of observations where the dependent and independent variables of interest are rare events.

Therefore, Firth's PMLE/Jeffreys prior is not necessarily a suitable default choice for dealing with separation when dealing with rare events. Independent “weakly-informative” prior distributions centered at zero, such as the Cauchy prior suggested by Gelman *et al.* (Reference Gelman, Jakulin, Pittau and Su2008), are one such means to ensure that point estimates remain in the same direction of separation. Even so, and echoing Rainey (Reference Rainey2016), researchers should be mindful of how prior choices shape uncertainty estimates in such cases.

More generally, researchers should be mindful of how the prior used to deal with separation contains specific information that has consequences for the possible direction of parameter estimates, that needs to be justified *a priori*. In particular, the use of data-dependent priors, such as Jeffreys prior, require researchers to be mindful about the particular features of their data and how this translates into the prior and thus influences the parameters of interest. In many circumstances, researchers would benefit from instead formulating priors based upon theory and previous evidence. For example, researchers may face cases where separation occurs due to it being impossible for the effect of a variable to be a certain sign. In such cases, researchers may be comfortable assigning zero prior, and thus posterior, mass for a given effect size direction or to functions of the posterior distribution.Footnote ^{9} At a minimum, however, researchers should be cognizant of the implications about prior information that default solutions to separation make, how these influence their estimates, and whether it is appropriate to the problem at hand.

## Supplementary material

The supplementary material for this article can be found at https://doi.org/10.1017/psrm.2020.46.