Estimating logit models with small samples

Abstract In small samples, maximum likelihood (ML) estimates of logit model coefficients have substantial bias away from zero. As a solution, we remind political scientists of Firth's (1993, Biometrika, 80, 27–38) penalized maximum likelihood (PML) estimator. Prior research has described and used PML, especially in the context of separation, but its small sample properties remain under-appreciated. The PML estimator eliminates most of the bias and, perhaps more importantly, greatly reduces the variance of the usual ML estimator. Thus, researchers do not face a bias-variance tradeoff when choosing between the ML and PML estimators—the PML estimator has a smaller bias and a smaller variance. We use Monte Carlo simulations and a re-analysis of George and Epstein (1992, American Political Science Review, 86, 323–337) to show that the PML estimator offers a substantial improvement in small samples (e.g., 50 observations) and noticeable improvement even in larger samples (e.g., 1000 observations).

where y represents a vector of binary outcomes, X represents a matrix of explanatory variables and a constant, b represents a vector of model coefficients, and g −1 represents some inverse-link function that maps R into [0, 1]. When g −1 represents the inverse-logit function logit −1 (a) = 1/(1 + e −a ) or the cumulative normal distribution function F(a) = a −1 (1/ 2p √ )e −(x 2 /2) dx, then we refer to Equation 1 as a logit or probit model, respectively. To simplify the exposition, we focus on logit models because the canonical logit link function induces nicer theoretical properties (McCullagh and Nelder, 1989: 31-32). In practice, though, Kosmidis and Firth (2009) show that the ideas we discuss apply equally well to probit models.
To The researcher can obtain the ML estimateb mle by finding the vector b that maximizes log L given y and X (King, 1998). 1 The ML estimator has excellent properties in large samples. It is asymptotically unbiased, so that E(b mle ) ≈ b true when the sample is large (Casella and Berger, 2002: 470;Wooldridge, 2002: 391-395). It is also asymptotically efficient, so that the asymptotic variance of the ML estimate obtains the Cramer-Rao lower bound (Casella and Berger, 2002: 472, 516;Greene, 2012: 513-523). For small samples, though, the ML estimator of logit model coefficients does not work well-the ML estimates have substantial bias away from zero (Long, 1997: 53-54). Long (1997: 54) offers a rough heuristic about appropriate sample sizes: "It is risky to use ML with samples smaller than 100, while samples larger than 500 seem adequate." 2 This presents the researcher with a problem: When dealing with small samples, how can she obtain reasonable estimates of logit model coefficients?
2. An easy solution for the big problem The statistics literature offers a simple solution to the problem of bias. Firth (1993) suggests penalizing the usual likelihood function L(b|y) by a factor equal to the square root of the determinant of the information matrix |I(b)| 1/2 , which produces a "penalized" likelihood function L * (b|y) = L(b|y)|I(b)| 1/2 (see also Kosmidis and Firth 2009;Kosmidis 2014). 3 It turns out that this penalty is equivalent to Jeffreys' (1946) prior for the logit model (Firth 1993;Poirier 1994). We take the natural logarithm of both sides to obtain the penalized log-likelihood function: log L * (b|y) = In practice, we use iteratively reweighted least squares to fit logit models in this paper, the default behavior of glm() in R. 2 Making the problem worse, King and Zeng (2001) point out that ML estimates have substantial bias for much larger sample sizes if the event of interest occurs only rarely. 3 The statistics literature offers other approaches to bias reduction and correction as well. See Kosmidis (2014) for a useful overview.
Then the researcher can find the PML estimateb pmle by finding the vector b that maximizes log L * . 4,5 Zorn (2005) suggests PML for solving the problem of separation, but the broader and more important application to small sample problems seems to remain under-appreciated in political science. 6 Most importantly, though, we show that the small-sample bias should be of secondary concern to the small-sample variance of ML. Fortunately, PML reduces both the bias and the variance of the ML estimates of logit model coefficients.
A researcher can implement PML as easily as ML, but PML estimates of logit model coefficients have a smaller bias (Firth, 1993) and a smaller variance (Copas, 1988;Kosmidis 2007: 49). 7 This is important. When choosing among estimators, researchers often face a tradeoff between bias and variance (Hastie et al., 2013: 37-38), but there is no bias-variance tradeoff between ML and PML estimators. The PML estimator exhibits both lower bias and lower variance.
Two concepts from statistical theory illuminate the improvement offered by the PML estimator over the ML estimator. Suppose two estimators A and B, with a quadratic loss function so that the risk functions R A and R B (i.e., the expected loss) correspond to the MSE. If R A ≤ R B for all parameter values and the inequality holds strictly for at least some parameter values, then we can refer to estimator B as inadmissible and say that estimator A dominates estimator B (Leonard and Hsu, 1999: 143-146;DeGroot and Schervish, 2012: 458). Now suppose a quadratic loss function for the logit model coefficients, such that In this case, the inequality holds strictly for all b true so that R pmle , R mle . For logit models, then, we can describe the ML estimator as inadmissible and say that the PML estimator dominates the ML estimator.
The intuition of the bias reduction is subtle. First, consider the source of the bias, illustrated in Figure 1. Calculate the score function s as the gradient (or first-derivative) of the log-likelihood with respect to b so that s(y, b) = ∇ log L(b|y). Note that solving s(y,b mle ) = 0 is equivalent to findingb mle that maximizes log L(b|y). Now recall that at the true parameter vector b true , the expected value of the score function is zero so that E[s(y, b true )] = 0 (Greene, 2012: 517). By the law of total probability, this implies that which is highlighted by (i) in Figure 1. 4 For details on numerical optimization of the log-likelihood and penalized log-likelihood functions, we refer readers to an especially clear discussion in Gill (2000, Ch. 5) and the references therein and Kosmidis and Firth (2009: 799). 5 One might also consider least squares (LS) in the context of small samples, even with a binary outcome. Indeed, in testing the null hypotheses that an explanatory variable has no effect, LS and PML (with profile likelihood tests) perform nearly identically. However, for a variety of reasons, researchers might like to obtain approximately unbiased estimates of logit coefficients. We take this assumption as a starting point of our paper. Depending on the data, the correct model, and quantity of interest, LS might offer a reasonable or terrible alternative to a logit model estimated with ML. For example, LS does a reasonable job of estimating the average marginal effect (across all observations), but a horrible job of estimating risk ratios. Given that substantive political scientists are increasingly critical of analyses that focus exclusively on p-values (King et al., 2000;Rainey, 2014;Gross, 2015;McCaskey and Rainey, 2015), we view LS only as a non-default choice that only works well in special cases.
However, if the score function s is decreasing and curved in the area around b true j so that s ′′ j = ∂ 2 s(y, b)/∂ 2 b j . 0 (see Figure 1), then a high miss s(y, b true ) . 0 (top dashed line in Figure 1) implies an estimate well above the true value, so thatb mle ≫ b true , and a low miss (bottom dashed line in Figure 1) s(y, b true ) , 0 implies an estimate only slightly below the true value, so thatb mle , b true . In effect, the curved (convex) score function pulls low misses closer the true value and pushes high misses even further from the true value. This dynamic, which is highlighted by (ii) in Figure 1, implies that low misses and high misses do not cancel and that the ML estimate is too large on average. A similar logic applies for s ′′ j , 0. Therefore, due to the curvature in the score function s, the high and low misses ofb mle do not cancel out, so that E(b mle j ) . b true j when s ′′ j . 0 and E(b mle ) , b true j when s ′′ j , 0. Cox and Snell (1968: 251-252) derive a formal statement of this bias of order n −1 , which we denote as bias n −1 (b true ). Now consider the bias reduction strategy. At first glance, one may simply decide to subtract bias n −1 (b true ) from the estimateb mle . However, note that the bias depends on the true parameter. Because researchers do not know the true parameter, this is not the most effective strategy. 8 However, Firth (1993) suggests modifying the score function, so that s * (y, b) = s(y, b) − g(b), where g shifts the score function upward or downward. Firth (1993) shows that one good choice of g takes g j = (1/2)trace[I −1 (∂I/∂b j )] = (∂/∂b j )( log |I(b)|). Integrating, we can see that solving s * (y,b pmle ) = 0 is equivalent to findingb pmle that maximizes log L * (b|y) with respect to b.
The intuition of the variance reduction is straightforward. Because PML shrinks the ML estimates toward zero, the PML estimates must have a smaller variance than the ML estimates. If we imagine the PML estimates as trapped between zero and the ML estimates, then the PML estimates must be less variable. What can we say about the relative performance of the ML and PML estimators? Theoretically, we can say that the PML estimator dominates the ML estimator because the PML estimator has lower bias and variance regardless of the sample size. That is, the PML estimator always outperforms the ML estimator, and least in terms of the bias, variance, and MSE. However, both estimators are asymptotically unbiased and efficient, so the difference between the two estimators becomes negligible as the sample size grows large. In small samples, though, Monte Carlo simulations show substantial improvements that should appeal to substantive researchers.

The big improvements from an easy solution
To show that the size of reductions in bias, variance, and MSE should draw the attention of substantive researchers, we conduct a Monte Carlo simulation comparing the sampling distributions of the ML and PML estimates. These simulations demonstrate three features of the ML and PML estimators: (1) In small samples, the ML estimator exhibits a large bias. The PML estimator is nearly unbiased, regardless of sample size.
(2) In small samples, the variance of the ML estimator is much larger than the variance of the PML estimator. (3) The increased bias and variance of the ML estimator implies that the PML estimator also has a smaller MSE. Importantly, though, the variance makes a much greater contribution to the MSE than the bias.
In our simulation, the true data generating process corresponds to 2x j , and we focus on the coefficient for x 1 as the coefficient of interest. We draw each fixed x j independently from a normal distribution with mean of zero and standard deviation of one and vary the sample size N from 30 to 210, the number of explanatory variables k from 3 to 6 to 9, and the intercept b cons from −1 to −0.5 to 0 (which, in turn, varies the proportion of events P cons from about 0.28 to 0.38 to 0.50). 9 Each parameter in our simulation varies the amount of information in the data set. The biostatistics literature uses the number of events per explanatory variable (1/k) y i as a measure of the information in the data set (e.g., Peduzzi et al., 1996;Vittinghoff and McCulloch, 2007), and each parameter of our simulation varies this quantity, where (N × P cons )/k ≈ (1/k) y i . For each combination of the simulation parameters, we draw 50,000 data sets and use each data set to estimate the logit model coefficients using ML and PML. To avoid an unfair comparison, we exclude the ML estimates where separation occurs (Zorn, 2005). We keep all the PML estimates. Replacing the ML estimates with the PML estimates when separation occurs dampens the difference between the estimators and keeping all the ML estimates exaggerates the differences. From these estimates, we compute the percent bias and variance of the ML and PML estimators, as well as the MSE inflation of the ML estimator compared to the PML estimator.

Bias
We calculate the percent bias = 100% × (E(b)/b true − 1) as the intercept b cons , the number of explanatory variables k, and the sample size N vary. Figure 2 shows the results. The sample size varies across the horizontal-axis of each plot and each panel shows a distinct combination of intercept and number of variables in the model. Across the range of the parameters of our sample, the bias of the ML estimate varies from about 122 percent (b cons = −0.5, k = 9, and N = 30) to around 3 percent (b cons = −1, k = 3, and N = 210). The bias in the PML estimate, on the other hand, is much smaller. For the worst-case scenario (b cons = −0.5, k = 9, and N = 30), the ML estimate has an upward bias of about 122 percent, while the PML estimate has an upward bias of only about 12 percent. 10

Variance
In many cases, estimators tradeoff bias and variance, but the PML estimator reduces both. In addition to nearly eliminating the bias, Figure 3 shows that the PML estimator also substantially reduces the variance, especially for the smaller sample sizes. For b cons = −1 and N = 30, the variance of the ML estimator is about 99, 178, and 436 percent larger than the PML estimator for three, six, and nine variables, respectively. Doubling the sample size to N = 60, the variance remains about 25, 58, and 117 percent larger, respectively. Even for a larger sample of N = 210, the variance of the ML estimator is about 6, 10, and 14 percent larger than the PML estimator. 11

Mean-squared error
However, neither the bias nor the variance serves as a complete summary of the performance of an estimator. The MSE, though, combines the bias and variance into an overall measure of the accuracy, where (3) Since the bias and the variance of the ML estimator exceeds the bias and variance the PML estimator, the ML estimator must have a larger MSE, so that MSE(b mle ) − MSE(b pmle ) . 0. The online appendix shows the expected value and (absolute) bias of these estimates. 11 The online appendix shows the variance inflation = 100% × (Var(b mle )/Var(b pmle ) − 1).

554
C. Rainey and K. McCaskey We care about the magnitude of this difference, though, not the sign. To summarize the magnitude, we compute the percent increase in the MSE of the ML estimator compared to the PML estimator. We refer to this quantity as the "MSE inflation," where An MSE inflation of zero indicates that the ML and PML estimators perform equally well, but because the PML estimator dominates the ML estimator, the MSE inflation is strictly greater than zero. Figure 4 shows the MSE inflation for each combination of the parameter simulations on the log 10 scale. Notice that for the worst-case scenario (b cons = −1, k = 9, and N = 30), the MSE of the ML estimates is about 721 percent larger than the MSE of the PML estimates. The MSE inflation only barely drops below 10 percent for the most information-rich parameter combinations (e.g.,  estimator to obtain the contribution of each. But notice that we can easily compare the relative contributions of the bias and variance using the ratio relative contribution of variance = contribution of variance contribution of bias . (5) Figure 5 shows the relative contribution of the variance. Values less than 1 indicate that the bias makes a greater contribution and values greater than 1 indicate that the variance makes a greater contribution. In each case, the relative contribution of the variance is much larger than 1. For N = 30, the contribution of the variance is between 4 and 17 times larger than the contribution of the bias. For N = 210, the contribution of the variance is between 27 and 166 times larger than the contribution of the bias. In spite of the attention paid to the small sample bias in ML estimates of logit model coefficients, the small sample variance is a more important problem to address, at least in terms of the accuracy of the estimator. Fortunately, the PML estimator greatly reduces the bias and variance, resulting in a much smaller MSE, especially for small samples. These simulation results show that the bias, variance, and MSE of the ML estimates of logit model coefficients are not trivial in small samples. Researchers cannot safely ignore these problems. Fortunately, researchers can implement the PML estimator with little to no added effort and obtain substantial improvements over the usual ML estimator. And these improvements are not limited to Monte Carlo studies. In the example application that follows, we show that the PML estimator leads to substantial reductions in the magnitude of the coefficient estimates and in the width of the confidence intervals.

The substantive importance of the big improvements
To illustrate the substantive importance of using the PML estimator, we reanalyze a portion of the statistical analysis in George and Epstein (1992). 12 We re-estimate the integrated model of US  (1992) and find substantial differences in the ML and PML coefficient estimates and the confidence intervals.

Supreme Court decisions developed by George and Epstein
George and Epstein (1992) combine the legal and extralegal models of Court decision-making in order to overcome the complementary idiosyncratic shortcomings of each. The legal model posits stare decisis, or the rule of law, as the key determinant of future decisions, while the extralegal model takes a behavioralist approach containing an array of sociological, psychological, and political factors.
The authors model the probability of a conservative decision in favor of the death penalty as a function of a variety of legal and extralegal factors. George and Epstein use a small sample of 64 Court decisions involving capital punishment from 1971 to 1988. The data set has only 29 events (i.e., conservative decisions). They originally use the ML estimator, and we reproduce their estimates exactly. For comparison, we also estimate the model with the PML estimator that we recommend. Figure 6 shows the coefficient estimates for each method. In all cases, the PML estimate is smaller than the ML estimate. Each coefficient decreases by at least 25 percent with three decreasing by more than 40 percent. Additionally, the PML estimator substantially reduces the width of all the confidence intervals. Three of the 11 coefficients lose statistical significance.
Because we do not know the true model, we cannot know which of these sets of coefficients is better. However, we can use out-of-sample prediction to help adjudicate between these two methods. We use leave-one-out cross-validation and summarize the prediction errors using Brier and log scores, for which smaller values indicate better predictive ability. 13 The ML estimates produce a Brier score of 0.17, and the PML estimates lower the Brier score by 7 percent to 0.16. Similarly, the ML estimates produce a log score of 0.89, while the PML estimates lower the log score by 41 percent to 0.53. The PML estimates outperform the ML estimates for both approaches to scoring, and this provides good evidence that the PML estimates better capture the data generating process.
Because we estimate a logit model, we are likely more interested in the functions of the coefficients rather than the coefficients themselves (King et al., 2000). For an example, we take George

13
The Brier score is calculated as n i=1 (y i − p i ) 2 , where i indexes the observations, y i [ {0, 1} represents the actual outcome, and p i [ (0, 1) represents the estimated probability that y i = 1. The log score as − n i=1 log (r i ), where r i = y i p i + (1 − y i )(1 − p i ). Notice that because we are logging r i [ [0, 1], n i=1 log (r i ) is always negative and smaller (i.e., more negative) values indicate worse fit. Notice that we use the negative of n i=1 log (r i ), so that, like the Brier score, larger values indicate a worse fit. and Epstein's integrated model of Court decisions and calculate a first difference and risk ratio as the repeat-player status of the state varies, setting all other explanatory variables at their sample medians. George and Epstein hypothesize that repeat players have greater expertise and are more likely to win the case. Figure 7 shows the estimates of the quantities of interest.
The PML estimator pools the estimated probabilities toward one-half. When the state is not a repeat player, the PML estimates suggest a 17 percent chance of a conservative decision while ML estimates suggest only a 6 percent chance. However, when the state is a repeat player, the PML estimates suggest that the Court has a 69 percent chance of a conservative decision compared to the 68 percent chance suggested by ML. Thus, PML also provides smaller effect sizes for both the first difference and the risk ratio. PML decreases the estimated first difference by 17 percent from 0.63 to 0.52 and the risk ratio by 67 percent from 12.3 to 4.0.
This example application clearly highlights the differences between the ML and PML estimators. The PML estimator shrinks the coefficient estimates and confidence intervals substantially. Theoretically, we know that these estimates have a smaller bias, variance, and MSE. Practically, though, this shrinkage manifests in smaller coefficient estimates, smaller confidence intervals, and better out-of-sample predictions. And these improvements come at almost no cost to researchers. The PML estimator is nearly trivial to implement but dominates the ML estimator-the PML estimator always has lower bias, lower variance, and lower MSE.

Recommendations to substantive researchers
Throughout this paper, we emphasize one key point-when using small samples to estimate logit and probit models, the PML estimator offers a substantial improvement over the usual ML estimator. But what actions should substantive researchers take in response to our methodological point? In particular, at what sample sizes should researchers consider switching from the ML estimator to the PML estimator?

Concrete advice about sample sizes
Prior research suggests two rules of thumb about sample sizes. First, Peduzzi et al. (1996) recommend about ten events per explanatory variable, although Vittinghoff and McCulloch (2007) suggest relaxing this rule. Second, Long (1997: 54) suggests that "it is risky to use ML with samples smaller than 100, while samples larger than 500 seem adequate." In both of these cases, the alternative to a logit or probit model seems to be no regression at all. Here, though, we present the PML estimator as an alternative, so we have room to make more conservative recommendations.
On the grounds that the PML estimator dominates the ML estimator, we might recommend that researchers always use the PML estimator. But we do not want or expect researchers to switch from the common and well-understood ML estimator to the PML estimator without a clear, meaningful improvement in the estimates. Although the PML estimator is theoretically superior, it includes practical costs. Indeed, even though the R packages brglm (Kosmidis, 2017a), brglm2 (Kosmidis, 2017b), and logistf (Heinze and Ploner, 2016) and the Stata module FIRTHLOGIT (Coveney, 2015) make fitting these models quick and easy, this approach might require researchers to write custom software to compute quantities of interest. Furthermore, the theory behind the approach is less familiar to most researchers and their readers. Finally, these models are much more computationally demanding. Although the computational demands of a single PML fit are trivial, the costs might become prohibitive for procedures requiring many fits, such as bootstrap or Monte Carlo simulations. With these practical costs in mind, we use a Monte Carlo simulation to develop rules of thumb that link the amount of information in the data set to the cost of using ML rather than PML.
We measure the cost of using ML rather than PML as the MSE inflation defined in Equation 4: the percent increase in the MSE when using ML rather than PML. The MSE inflation summarizes the relative inaccuracy of the ML estimator compared to the PML estimator.
To measure the information in a data set, the biostatistics literature suggests using the number of events per explanatory variable (1/k) y i (e.g., Peduzzi et al. 1996;Vittinghoff and McCulloch 2007). However, we modify this metric slightly and consider the minimum of the number of events and the number of non-events. This modified measure, which we denote as j, has the attractive property of being invariant to flipping the coding of events and non-events. Indeed, one could not magically increase the information in a conflict data set by coding peace-years as ones and conflict-years as zeros. With this in mind, we use a measure of information j that takes the minimum of the events and non-events per explanatory variable, so that (1 − y i ) .
The cost of using ML rather than PML decreases continuously with the amount of information in the data set, but to make concrete suggestions, we break the costs into three categories: substantial, noticeable, and negligible. We use the following cutoffs: (1) Negligible: If the MSE inflation probably falls below 3 percent, then we refer to the cost as negligible.
(2) Noticeable: If the MSE inflation of ML probably falls below 10 percent, but not probably below 3 percent, then we refer to the cost as noticeable.
(3) Substantial: If the MSE inflation of ML might rise above 10 percent, then we refer then the cost as substantial.
To develop our recommendations, we estimate the MSE inflation for a wide range of hypothetical analyses across which the true coefficients, the number of explanatory variables, and the sample size varies.
To create each hypothetical analysis, we do the following: (1) Choose the number of covariates k randomly from a uniform distribution from 3 to 12.
(2) Choose the sample size n randomly from a uniform distribution from 200 to 3000.
(3) Choose the intercept b cons randomly from a uniform distribution from −4 to 4.
(4) Choose the slope coefficients b 1 , . . . , b k randomly from a normal distribution with mean 0 and standard deviation 0.5. (5) Choose a covariance matrix S for the explanatory variables randomly using the method developed by Joe (2006) such that the variances along the diagonal range from 0.25 to 2. (6) Choose the explanatory variables x 1 , x 2 , . . . , x k randomly from a multivariate normal distribution with mean 0 and covariance matrix S. (7) If these choices produce a data set with (a) less than 20 percent events or non-events or (b) separation (Zorn 2005), then we discard this analysis.
Note that researchers should not apply our rules of thumb to rare events data (King and Zeng, 2001), because the recommendations are overly conservative in that scenario. As the sample size increases, the MSE inflation drops, even if j is held constant. By the design of our simulation study, our guidelines apply to events more common than 20 percent and less common than 80 percent. However, researchers using rare events data should view our recommendations as conservative; as the sample size increases, the MSE inflation tends to shrink relative to j. 14 For each hypothetical analysis, we simulate 2000 data sets and compute the MSE inflation of the ML estimator relative to the PML estimator using Equation 3. We then use quantile regression to model the 90th percentile as a function of the information j in the data set. This quantile regression allows us to estimate the amount of information that researchers need before the MSE inflation "probably" (i.e., about a 90 percent chance) falls below some threshold. We then calculate the thresholds at which the MSE inflation probably falls below 10 and 3 percent. Table 1 shows the thresholds and Figure 8 shows the MSE for each hypothetical analysis and the quantile regression fits.
Interestingly, ML requires more information to estimate the intercept b cons accurately relative to PML than the slope coefficients b 1 , . . . , b k (see King and Zeng, 2001). Because of this, we calculate the cutoffs separately for the intercept and slope coefficients.
If the researcher simply wants accurate estimates of the slope coefficients, then she risks substantial costs when using ML with j ≤ 16 and noticeable costs when using ML with j ≤ 57. If the researcher also wants an accurate estimate of the intercept, then she risks substantial costs when using ML with j ≤ 35 and noticeable costs when using ML when j ≤ 92. Researchers should treat these thresholds as rough rules of thumb-not strict guidelines. Indeed, the MSE inflation depends on a complex interaction of many features of the analysis, including the number of covariates, their distribution, the magnitude of their effects, the correlation among them, and the sample size. However, these (rough) rules accomplish two goals. First, they provide researchers with a rough idea of when the choice of estimator might matter. Second, they highlight that analysis with samples typically considered "large enough" for ML might benefit from using PML instead.
Importantly, the cost of ML only becomes negligible for all model coefficients when j . 92this threshold diverges quite a bit from the prior rules of thumb. For simplicity, assume the researcher wants to include eight explanatory variables in her model. In the best case scenario of 50 percent events, she should definitely use the PML estimator with fewer than 8×16 0.5 = 256 observations and ideally use the PML estimator with fewer than 8×57 0.5 = 912 observations. But if she would also like accurate estimates of the intercept, then these thresholds increase to 8×35 0.5 = 560 and 8×92 0.5 = 1, 472 observations. Many logit and probit models estimated using survey data have fewer than 1500 observations and these studies risk a noticeable cost by using the ML estimator rather than the PML estimator. Furthermore, these estimates assume 50 percent events. As the number of events drifts toward 0 or 100 percent or the number of variables increases, then the researcher needs even more observations.

Concrete advice about estimators
When estimating a model of a binary outcome with a small sample, a researcher faces several options. First, she might avoid analyzing the data altogether because she realizes that ML estimates of logit model coefficients have significant bias. We see this as the least attractive option. Even small data sets contain information and avoiding these data sets leads to a lost opportunity.
Second, the researcher might proceed with the biased and inaccurate estimation using ML. We also see this option as unattractive, because simple improvements can dramatically shrink the bias and variance of the estimates.
Third, the researcher might use least squares to estimate a linear probability model (LPM). If the probability of an event is a linear function of the explanatory variables, then this approach is reasonable, as long as the researcher takes steps to correct the standard errors. However, in most cases, using an "S"-shaped inverse-link function (i.e., logit or probit) makes the most theoretical sense, so that marginal effects shrink toward zero as the probability of an event approaches zero or one (e.g., Long, 1997: 34-47;Berry et al., 2010). Long (1997: 40) writes: "In my opinion, the most serious problem with the LPM is its functional form." Additionally, the LPM sometimes produces nonsense probabilities that fall outside the [0, 1] interval and nonsense risk ratios that fall below zero. If the researcher is willing to accept these nonsense quantities and assume that the functional form is linear, then the LPM offers a reasonable choice. However, we agree with Long (1997) that without evidence to the contrary, the logit or probit model offers a more plausible functional form.
Fourth, the researcher might use a bootstrap procedure (Efron, 1979) to correct the bias of the ML estimates. Although in general the bootstrap presents a risk of inflating the variance when correcting the bias (Efron and Tibshirani, 1993: esp. 138-139), simulations suggest that the procedure works comparably to PML in some cases for estimating logit model coefficients. However, the bias-corrected bootstrap has a major disadvantage. When a subset of the bootstrapped data sets have separation (Zorn, 2005), which is highly likely with small data sets, then the bootstrap procedure produces unreliable estimates. In this scenario, the bias and variance can be much larger than even the ML estimates and sometimes wildly incorrect. Given the extra complexity of the bootstrap procedure and the risk of unreliable estimates, the bias-corrected bootstrap is not particularly attractive.
Finally, the researcher might simply use PML, which allows the theoretically-appealing "S"-shaped functional form and fast estimation while greatly reducing the bias and variance. Indeed, the PML estimates always have a smaller bias and variance than the ML estimates. These substantial improvements come at almost no cost to the researcher in learning new concepts or software beyond ML and simple commands in R and/or Stata. 15 We see this as the most attractive option. Whenever researchers have concerns about bias and variance due to a small sample, a simple switch to a PML estimator can quickly ameliorate any concerns with little to no added difficulty for researchers or their readers.