SECOND-ORDER BIAS REDUCTION FOR NONLINEAR PANEL DATA MODELS WITH FIXED EFFECTS BASED ON EXPECTED QUANTITIES

In many nonlinear panel data models with fixed effects maximum likelihood estimators suffer from the incidental parameters problem, which often entails that point estimates are markedly biased. While the recent literature has mostly generated methods that yield a first-order bias reduction relative to maximum likelihood, we derive a first- and second-order bias correction of the profile likelihood based on “expected quantities” which differs from the corresponding correction based on “sample averages” derived in Dhaene and Sun (2021, Journal of Econometrics 220, 227–252). While consistency and asymptotic normality of our estimator are derived in a setting where both the number of individuals and the number of time periods grow to infinity, we illustrate in a simulation study that our second-order bias reduction indeed yields an estimator with substantially improved small sample properties relative to its first-order unbiased counterpart, especially when less than 10 time periods are available.


INTRODUCTION
Panel data models are popular among economists and other social scientists, since besides observed characteristics, they allow researchers to account for unobserved individual-specific attributes that do not vary over time. If the unobserved timeinvariant individual heterogeneity is allowed to be arbitrarily correlated with observed explanatory variables, it is often referred to as "fixed effects." Failure to account for the presence of fixed effects may therefore lead to biased estimates and incorrect inference in econometric applications.
Many panel datasets for microeconomic analysis are "short," i.e., the number of individuals (n) is much larger than the number of time periods (T). Therefore, the asymptotic theory has often been derived under the assumption that n → ∞ while T is kept fixed. In this asymptotic framework, it is well known how to estimate panel data models in which the unobserved heterogeneity enters linearly and additively, without the need of making distributional assumptions (cf., e.g., Chamberlain, 1982Chamberlain, , 1984Arellano, 2003;Wooldridge, 2010). However, stronger parametric distributional assumptions are typically needed whenever the fixed effects enter the model nonlinearly. Yet, even then only few nonlinear models can be satisfactorily handled when the number of time periods is fixed, since the incidental parameters problem (IPP), first noted by Neyman and Scott (1948), often renders the maximum likelihood estimator (MLE) inconsistent.
Alternatively, one may focus on estimation approaches in which both n and T are allowed to grow to infinity. Yet, even in this asymptotic setting, the presence of fixed effects leads to an incorrectly centered asymptotic distribution of the MLE if n/T → 0. Since Hahn and Kuersteiner (2002), who proposed a bias correction for linear dynamic models, numerous corrections of the first-order bias (FOB), i.e., the leading bias term of order O(T −1 ), have been proposed to mitigate the IPP (for reviews of the literature, see for instance, Arellano andHahn, 2007 or Fernández-Val andWeidner, 2018). Simulation studies suggest that for moderately large T, FOB corrected estimators perform better in small samples than uncorrected estimators. Moreover, FOB corrected estimators can sometimes be shown to be asymptotically correctly centered if n/T 3 → 0 (e.g., , so that n can be thought of as being "large" relative to T. Yet, when T is small (i.e., T < 10), even first-order corrected estimators exhibit some non-negligible bias (see, for instance,  or Section 5). It is therefore desirable to find methods that besides eliminating the leading FOB term also correct bias terms of higher order to allow for faster growth rates of n relative to T and to further improve the small-sample performance of the resulting estimators.
In this paper, we follow Arellano and Hahn (2016), henceforth AH, and focus on a bias correction of the "profile" or "concentrated" likelihood. Besides the FOB, our correction also removes the second-order bias (SOB) and thus leads to an objective function that is "second-order unbiased," i.e., its expectation coincides with the "target" likelihood up to an error of order O(T −3 ). The maximizer of this objective function is shown to be consistent and asymptotically following a correctly centered normal distribution if n,T → ∞ such that n/T 5 → 0. A simulation study illustrates that our second-order corrected estimator substantially improves upon its first-order counterparts in terms of small sample performance and inference.
We further compare our approach to a closely related profile likelihood correction outlined in Dhaene and Sun (2021), henceforth DS, that also yields a secondorder unbiased objective function. However, the approach in DS does not make full use of the known likelihood, since expectations are estimated based on "sample averages" (SA) rather than by "expected quantities" (EQ). We further illustrate that each approach relies on mutually exclusive properties: while DS use "scorefactors," which is not possible with our approach based on EQ, the "centering" based approach presented here is not applicable for second-order bias corrections based on SA. Finally, we also provide some evidence that estimators which make full use of the likelihood being known may be preferable in small samples as compared to estimators based on sample averages, albeit at the expense of higher computational complexity.
The rest of the paper is organized as follows: In Section 2, we introduce some notation and discuss the IPP before introducing our estimator together with an illustrative example. Section 3 discusses differences between corrections based on EQ and SA and highlights the main distinctions between our approach and DS. Assumptions and theory are provided in Section 4. Simulation results on the small sample performance of our estimator are presented in Section 5. Finally, Section 6 concludes. Additional tables as well as all proofs are collected in the Appendixes and the Online Supplementary Material for this paper.

SECOND-ORDER BIAS CORRECTION BASED ON EQ
For n,T ≥ 2, let Y it denote an outcome variable and X it a (column) vector of explanatory variables for i = 1, . . . ,n and t = 1, . . . ,T. Here, i refers to an individual whereas t indexes time. While both Y it and X it are assumed to be observed, we assume that the fixed effect α i0 is an unobserved random variable with unknown distribution. Moreover, for individual i, let Y i := (Y i1 , . . . ,Y iT ) denote the time-series of outcomes, and X i := (X i1 , . . . ,X iT ) the time-series of explanatory variables. The time-series of outcomes are drawn from the conditional distribution f Y i |X i ,θ 0 ,α i0 , which is known up to the p-dimensional parameter of interest θ 0 ∈ . Note that this does not impose any restriction on the distribution of (X i ,α i0 ), hence we allow for arbitrary dependence between the observed explanatory variables and the unobserved fixed effect. The average (or scaled) loglikelihood for individual i depends on (θ,α i ) ∈ × J and is defined as i (θ,α i ) := T −1 log f Y i |X i ,θ,α i (Y i ). By "time-independence" specified in Assumption 4.1, we can further write i (θ,α is the likelihood contribution of individual i in time period t. For expansions of the loglikelihood, we indicate the order of partial derivatives with respect to the vector θ by the first number in the index of the loglikelihood whereas the order of partial derivatives with respect to the scalar α i will be registered by the second number in the index. More precisely, for m ∈ {0,1, . . . , 9}, i0m (θ,α is a column vector and i2m (θ,α i ) = ∂ θθ • ∂ m α i i (θ,α i ) is a p × p-dimensional matrix. Following usual convention, ∂ 0 is interpreted as the identity operator. Since we derive a correction of the profile likelihood bias conditional on covariates that is valid globally, i.e., for any θ ∈ , we further introduce notation for taking expectations conditional on the covariates. 1 For the SOB correction of the profile likelihood of individual i, let τ i := (γ ,φ i ) ∈ × J . We then let the expectation operator E τ i denote integration with respect to f Y i |X i ,τ i , which denotes the conditional density evaluated at τ i for individual i. For Section 4, where we derive the conditional bias of our objective function (which depends on preliminary estimators and thus on the data of the whole sample), let τ 0 := (θ 0 ,α 10 ,...,α n0 ) ∈ × J × · · · × J and let E τ 0 denote integration with respect to f Y 1 ,...,Y n |X 1 ,...,X n ,τ 0 , which denotes the joint density of outcomes, conditional on all explanatory variables, for the entire sample of individuals. Notice that if the integrand only depends on outcome data of individual i, E τ 0 reduces to an expectation with respect to the individual density f Y i |X i ,θ 0 ,α i0 so that in this case, E τ 0 coincides with E τ i with τ i = (θ 0 ,α i0 ) . 2 The symbol E without subscript is reserved for unconditional expectations evaluated at the true parameters. Finally, we introduce a separate notation for centered and scaled likelihood terms, i.e., we write 3 In the following, we omit the argument τ i when τ i = (θ 0 ,α i0 ) .

Maximum Likelihood Estimation in the Presence of Fixed Effects
Since we assume that the distribution of Y i |X i ,α i is known up to θ 0 , we can compute the MLE of θ 0 while treating α i as a "nuisance parameter" to be estimated. Writingα i (θ ) := argmax α i ∈J i (θ,α i ), and obtaining i (θ,α i (θ )) known as the "profile likelihood" for individual i, we can compute the MLE of θ 0 in the presence of α 1 , . . . ,α n asθ := argmax θ∈ n −1 n i=1 i (θ,α i (θ )). Unfortunately, the IPP often renders the MLE inconsistent if T is fixed while n → ∞. As shown in , the MLE suffers from FOB, which leads to an incorrectly centered asymptotic distribution if n,T → ∞ such that n/T → c, where c is some positive constant. In order to remedy this problem, one may attempt to correct the profile likelihood in order to approximate the infeasible "target likelihood" i (θ,α * i (θ )) (which is score-unbiased) based on the "target value" or "population level MLE" As discussed in AH, the profile likelihood correction can be based on EQ (see AH,p. 259), where the FOB is calculated explicitly before replacing the unknown true parameters with plug-in-estimates, 2 Since E τi and E τ0 denote integration with respect to a conditional density which thus results in a random variable, the same O p (·)-notation as for the likelihood is used, even though the respective probability measures may differ. If expectations are taken unconditionally, we use standard O(·)-notation. 3 When expectations and variances are defined conditional on explanatory variables and individual fixed effects, equalities and inequalities involving them hold w.p.1. In order to avoid the proliferation of "w.p.1" qualifiers, we do not state them explicitly for the remainder of the paper. and based on SA, where the FOB is estimated by replacing population means with sample means. 4 We now introduce our second-order corrected profile likelihood based on EQ before discussing the distinction between this correction and the one of DS, who provide a second-order correction of the profile likelihood based on SA. Notice that we omit the argument for α i whenever the likelihood is evaluated at α * i (θ ) to shorten notation. For example, ikm (θ ) := ikm (θ,α * i (θ )). Moreover, we frequently omit both arguments when the likelihood is evaluated at (θ 0 ,α i0 ). For instance, ikm := ikm (θ 0 ,α i0 ).

The Estimator
. By expanding the profile likelihood, we find the first-and second-order profile likelihood bias terms .

(2.2)
A naive bias correction might then involve However, this would induce a second-order bias due to the slow convergence ofβ i . Thus, we use the corrected version In this definition, partial derivatives of B (1) i (θ,β i ) with respect to the components of β i are denoted by the corresponding parameters in the index. In the following, the β i -argument is further omitted when evaluating at β i0 . For instance, . The explicit expressions for all derivatives appearing in (2.3) are collected in Appendix A. Moreover, for the right-hand side of (2.3) we require Finally, The estimatorθ derived here is now obtained as the maximizer of a second-order bias corrected profile likelihood, i.e., second-order unbiased approximations of the first-and second-order profile likelihood bias terms. The intuition behind the form of the SOB corrected profile likelihood based on EQ is provided in Section 4.2.

Example
We now show how the second-order bias correction based on EQ can be applied in the heterogeneous means model of Neyman and Scott (1948). In this simple setup, an explicit form of the bias and its correction can be derived. 5 Let Y it = α i0 + U it and assume U i1 , . . . ,U iT α i0 d = NIID(0,σ 2 0 ), where θ := σ 2 ∈ (0,∞). Ignoring constants, the loglikelihood can then be written as coincides with the true value α i0 for any θ . Thus, the expression for the FOB of the profile likelihood of individual i in the Neyman-Scott model is The first-order corrected estimator based on expected quantitieš θ exp can now be computed asθ showing that the bias of the MLE is exactly of order O(T −1 ). In comparison, the bias of the estimator derived from the first-order corrected profile likelihood based on expected quantities is of In order to derive the second-order corrected profile likelihood based on expected quantities, we first notice that l i0m (θ ) = 0 for every m > 1 and λ i0q = 0 for every q > 2, which by (2.2) implies that the SOB of the profile likelihood is zero. Letβ i := (θ ,α i (θ),α i (θ )) , whereθ is a consistent preliminary estimator of θ 0 . Expanding the approximation of the FOB and taking expectations yields where we have used that derivatives of B (1) i (θ,β i ) with respect to φ i and α i as well as derivatives with respect toθ of at least second-order are zero. If the preliminary estimator is the MLE, i.e.,θ =θ , then the expectation of the MLE suggests estimating Therefore, the bias of the maximizer of the second-order corrected profile likelihood based on expected quantities is indeed of order O(T −3 ). Notice that the estimator usingθ as the preliminary estimator coincides with the SOB corrected estimator based on SA derived in Example 2 of DS, denoted asθ (2) . However, if the preliminary estimator of θ 0 is chosen to be the first-order corrected estimator based on expected quantities, i.e.,θ =θ exp , the bias of the corrected profile likelihood can be further reduced. Since , the bias ofθ exp can be estimated by −θ exp /T 2 , so that Notice that this iteration step is not possible with the SA-based approach of DS. Further iterating this procedure infinitely many times finally yieldsθ (∞) , the bias has now been removed completely. 6

COMPARING BIAS CORRECTIONS BASED ON EQ AND SA
We now discuss differences between profile likelihood bias corrections based on EQ and SA with particular focus on the implications for differences between our approach and DS. As can be seen by comparing b 1 and b 2 in Proposition given in Section 2.2, the bias correction based on SA of DS is in general different from the bias correction based on EQ as derived here. This can be attributed to the differences in the definition of the bias and the sources of randomness, as discussed in Section 3.1. Potential consequences in terms for the small sample performance are discussed in Section 3.2.

Conditional Profile Likelihood Bias
A first crucial distinction between our approach and DS is the benchmark likelihood relative to which the profile likelihood bias is defined. Recall that we are defining the target value as α , which is used, for instance, in Pace and Salvan (2006), Severini (2000, Chap. 4.6), Tripathi (2021a), or Schumann, Severini, andTripathi (2021b). Since the likelihood in many models of interest is obtained by making distributional assumptions on the model error conditional on observed covariates and the unobserved fixed effect, the bias of the profile likelihood is defined as Clearly, cancellation of the first-and secondorder conditional profile likelihood bias terms implies that also the corresponding unconditional bias terms are canceled. This definition of the bias has important consequences for the bias correction. For instance, it is not possible to use "score-factors," which are fundamental to the bias correction in DS, as does in general not equal zero unless for instance θ = θ 0 or if there is no time variation in the observed covariates. 7 Moreover, the "centering" approach used here by which the likelihood terms are centered around their expectations conditional on covariates is not possible with SA, as the centered terms depend on the unknown true parameters. It is thus not surprising that the bias terms in (2.1) and (2.2) differ from those presented in Lemma 4 of DS. The choice between EQ and SA also has consequences for the derivation of a feasible approximation of the FOB that is unbiased up to an error of order O p (T −3 ). Initial approximations of the FOB based on EQ and SA are , respectively. Conditional on covariates, randomness inB (1) i (θ ) stems from the outcome data in the likelihood derivatives andα i (θ ). In B (1) i (θ,β i0 ), all outcome data have been integrated out, so that randomness stems from replacing β i0 with its plug-in estimatorβ i . Not surprisingly, our approximation of the FOB differs from the one of DS. Moreover, as the use of score-factors is not permitted when the profile likelihood bias is considered conditional on covariates, the derivations of DS cannot be used for a correction based on EQ. For instance, the terms A 2 and B 1 in Lemma 4 of DS have "score-factor" and thus, according to DS, have "a zero-mean dominant stochastic term, which can be ignored." However, as E τ 0 [ it01 (θ )] = 0 in general, the definition of the target value and simple algebra show that for θ = θ 0 , ] 2 < 0 so that the statement in DS does not hold when expectations are derived conditionally on covariates. Similarly, if θ = θ 0 , E τ 0 [Z] = 0, where "[Z]" refers to the term of the same notation that appears for instance in the proof of Lemma 5 in DS.

Small Sample Performance
While both the estimator of DS and the estimator derived here remove the bias of the profile likelihood up to an O(T −3 ) term, the small sample performance may differ due to the theoretical differences between the two approaches. Hahn, Kuersteiner, and Newey (2004) use higher-order expansions of the MLE in a cross-sectional model to show that using sample averages instead of integrals in first-order bias corrections does not affect the higher-order variance. However, while their result suggests that also in panel data the limiting distribution (as n,T → ∞) of FOB and SOB corrections may be equivalent, a theoretical analysis of potential advantages of SA and EQ in small samples is very challenging. For instance, comparisons in specific models are either noninformative (e.g., in the Neyman-Scott model in Section 2.3, the SOB corrections based on SA and EQ coincide), or very difficult, as closed forms for parameter estimates such asθ are not available. Nevertheless, we can gain some intuition on possible advantages of each approach. First of all, as can be seen by comparing , the estimated first-order profile likelihood bias for individual i depends on T observations of individual i only for SA and on data of the whole sample (throughβ i ) for EQ, which may explain why the bias correction based on EQ can perform better in small samples, particularly when T is small. Moreover, unlike the estimator based on SA, our estimator can in principle be iterated, which can be used to remove dependence on initial estimators and starting values for numerical optimization. 8 Further notice that ifβ i = β i0 , the EQ FOB approximation coincides exactly with the FOB, whereas the SA approximationB (1) i (θ ) does so only up to an error of order O p (T −3/2 ) and with a conditional bias of order O p (T −2 ). As the same arguments holds for the SOB corrections based on EQ and SA, one may expect the EQ based approximations to be closer to the actual bias terms if the parameter estimates are sufficiently close to the true values. Finally, the fact that our bias correction corrects the conditional profile likelihood bias for every θ ∈ whereas DS in general reduce the conditional bias for θ = θ 0 (scorefactors hold at the true parameters only) might also help explaining why the SOB corrections based on EQ and SA perform similarly well in static logit while the EQ correction appears to perform better in static probit. As shown in detail in Appendix L.1, various simplifications arise in the static logit. For instance, i02 (θ ) coincides with λ i02 (θ ), so that "[Z]" in the proof of Lemma 5 in DS, which stems from expanding i02 (θ ) in the denominator around λ i02 (θ ), vanishes. In contrast, no such simplifications arise in the static probit model, so that our approach corrects the conditional bias due to "[Z]" for any θ ∈ whereas the approach presented in DS does so only for θ = θ 0 . Unfortunately, the potential improvement in terms of small sample performance comes at the cost of an increase in algebraic and computational complexity. A further potential disadvantage of EQ bias corrections may be that it relies on the correct specification of the likelihood not only through the likelihood derivatives needed for the bias correction (which is the case also for SA) but also through the integration step when computing expected quantities. While we do not find evidence for larger misspecification bias of EQ estimators relative to SA estimators in a small simulation study (see the discussion on Table B.3 in Section 5), misspecification bias remains a legitimate concern.

ASSUMPTIONS AND THEORY
We now present the assumptions before discussing the underlying intuition for the second-order correction presented here. Finally, we provide some asymptotic results forθ .

Assumptions
We begin the discussion with the fundamental conditions on our model which we will maintain in the rest of the paper.
Assumption 4.1. (i) The true parameter θ 0 lies in the interior of , which is a compact and convex subset of R p . (ii) For every i, α i0 is a random variable that, with probability one, lies in the interior of J , which is a finite and closed Assumptions (i) and (ii) are standard assumptions in maximum likelihood theory which ensure that the MLEs for θ 0 and α i0 are bounded in probability. Moreover, (iii) is standard in microeconometric applications and is, for instance, used in Chamberlain (2010), Alvarez and Arellano (2003) or, more recently, Schumann et al. (2021a) and Schumann et al. (2021b). Notice that it treats the fixed effect as a random variable that is unobserved (which distinguishes it from the observed covariates) and that we impose no restrictions on the correlation between X i and α i0 . While the conditional independence assumption in (iv), subsequently referred to as "time-independence" or "independence over time," can in principle be relaxed, it is used here to make the proofs which require higher-order expansions more tractable. Finally, (v) rules out time effects and time trends. 9 The next assumption is used to ensure uniqueness of the MLE and the population level MLE for α i0 . ,u)] in u; and (iii)α i (θ ) and α * i (θ ) lie in the interior of J with probability one.

Assumption 4.2. For every
In the derivation of our estimator, we will frequently use the following assumption: Similar assumptions are used in the majority of papers on nonlinear panels with fixed effects. A noteworthy exception is Fernández-Val and Weidner (2016).
Assumption 4.3(i) is used to derive stochastic orders of likelihood terms and remainder terms in Taylor expansions when computing the profile likelihood bias of a given individual. Similar assumptions are frequently used in the literature. 10 Notice that here E on the outside denotes the unconditional expectation. By the Markov inequality, (i) also implies that conditional expectations of likelihood derivatives are bounded in probability, i.e., is used to control the behavior of individual i's profile likelihood bias expression and its feasible approximation by bounding the denominator away from zero for T large enough. Part (iii) states that the observed information is uniformly bounded away from zero across the parameters. This holds in all our examples given the compactness of and J , which is guaranteed by Assumption 4.1. Assumption (iv) is an identification condition similar to Condition 1 of Hahn and Kuersteiner (2011). Finally, (v) is required to show asymptotic normality of our estimator.
Since our feasible approximation of the bias terms requires a preliminary estimator of θ 0 which will typically be the MLE, we make the following assumption.

Assumption 4.4. As n,T → ∞:
Unlike the previous conditions, this assumption is maintained mostly for convenience, as (i) and (ii) can be shown under Assumptions 4.1-4.3. Since proofs are available for instance in  they are omitted here. Part (iii) restricts our analysis to a setup in which the asymptotic distribution of the MLE is not correctly centered (which makes bias corrections necessary). It is used to simplify the stochastic order of remainder terms. 11

Derivation of the SOB Correction
We now provide some intuition behind the estimator presented in Section 2.2. First, we use Assumption 4.3 and follow the approach in Severini (2000, Sect. 5.2) to derive the bias of the profile likelihood for each θ ∈ . As shown in Appendix C.1, the bias of the profile likelihood can be expressed as as an approximation of the FOB will induce a bias of second order. To refine this approximation, we use a Taylor expansion (adapting the idea outlined in Ferrari et al., 1996) and take expectations (conditional on covariates), which yields where Rem (1) (θ ) is specified in (C.10). 12 Notice that we have used that derivatives of B (1) i (θ,β i ) do not depend on outcome data if none of the arguments do. As a consequence, they act as constants with respect to the conditional expectation operator E τ 0 . 13 As shown in Appendix I, so that the remainder can be ignored. For a feasible approximation, we require approximations for the terms on the right-hand side of (4.3). Derivatives of the FOB can easily be approximated up to an error of sufficiently low order by replacing 12 Notice that the expectation computes B (1) ..,Yn |X1,...,Xn,τ0 , where the randomness with respect to the conditional density is coming fromβ i only. 13 If the FOB is approximated based on SA, derivatives of the FOB approximation depend on outcome data even when evaluated at the true parameters. Thus, they are not constant with respect to E τ0 . Consequently, this step highlights one of the main conceptual differences between bias corrections based on EQ and SA. Furthermore, Consequently, as is shown in Appendix C.2, ) that is unbiased up to third order. Finally, combining (4.1), (4.2), and (4.6), so that the difference between the second-order corrected profile likelihood and the target likelihood exhibits a bias of order O p (T −3 ). 14 Notice that the remainder term R i (θ ) is a combination of the remainder terms in (4.1), (4.2), and (4.6). Next, we discuss the asymptotic properties ofθ defined in (2.4).

Asymptotic Theory
We now show asymptotic normality ofθ. While we provide some intuition on the main steps, we defer the detailed proofs to Appendix D. If not stated otherwise, all limits are taken as n,T → ∞. Consistency forθ can be shown in a similar way as the consistency of the MLEθ , since the correction term to the profile likelihood in our bias corrected objective function becomes negligible in the limit as n,T → ∞. 15 While showing consistency does not require us to impose a rate at which T grows relative to n, we will need rate conditions to show asymptotic normality ofθ. In the first step, we expand the first-order condition whereθ lies betweenθ and θ 0 . Since the target likelihood is a "genuine" likelihood that satisfies the Bartlett identities and the information equality in particular, the first term on the right hand side of (4.8) can be shown to converge to F as defined in Assumption 4.3(v), i.e., Next, we consider the second term on the right-hand side of (4.8). Recall that by (4.7), . In order to guarantee that the score with respect to θ of * i (θ,β i )| β i =β i is unbiased up to an error of order O(T −3 ), we impose the next "high-level" assumption.
Assumptions of this kind are frequently used in the literature (e.g., in , and implicitly in Arellano and Bonhomme, 2009). Here, it is used to avoid the explicit derivation of the derivative of R i (θ ), which is algebraically tedious without offering further insight. However, careful inspections of the proofs of Section 4.2 reveals that R i (θ ) is of the form T −3 r i (θ ), where r i (θ ) consist of a sum of fractions. The numerators of these fractions consist of products of likelihood derivatives which have bounded unconditional expectation by Lemma E.1, while the denominators are bounded away from zero by Assumption 4.3. The same pattern arises when taking the derivative with respect to θ , albeit with more complicated algebra. Further notice that the argument in r i (θ ) is fixed and thus does not drift with n,T. Further justification for Assumption 4.5 is given in Appendix D.
Since differentiation and integration is interchangeable by Assumption 4.3(i), using an expansion of the profile likelihood together with Assumption 4.5 then implies It is then shown in Appendix D that

10)
which illustrates that the rate n/T 5 → 0 needs to be imposed in order to ensure that the asymptotic distribution is centered around the true parameter θ 0 . We further show in Appendix D that Combining these results then leads to the following theorem.

SIMULATIONS
In order to assess the potential gains of our SOB correction while allowing for a comparison with the simulation results of DS, we conduct a simulation study using Design 3 of DS. 16 Here In a first set of simulations, we choose n = 100 and consider T ∈ {3,4,5,6,10}, since we are particularly interested in the bias reducing properties when T is small relative to n. We further report numbers for n = 1,000 in static logit and static probit with T ∈ {5,10,20}. The code for the experiments is written in R. The results in the first set of simulations are based on M = 2,000 Monte Carlo iterations whereas, due to an increase in computational burden, the results in the second set of simulations are based on 250 iterations. 17 The estimators considered here include the MLE (θ), the first-order corrected estimators based on sample averages (θ avg ) and based on expected quantities (θ exp ) where, as discussed in Section 3.1, the FOB is estimated using , and the second-order corrected estimator based on expected quantities (θ ). Moreover, we report results for both the firstand second-order split-panel jackknife bias corrected estimators of Dhaene and Jochmans (2015) denoted asθ (FO) SPJ andθ (SO) SPJ . For static logit, we also report the conditional maximum likelihood estimator (θ CMLE ) as a benchmark. 16 In Appendix F, we further provide simulation results based on a data generating process introduced in Arellano and Bonhomme (2009) and subsequently used in Schumann et al. (2021a). The results are qualitatively similar (see Table 4). 17 The actual number of Monte Carlo iterations may be lower, as iterations where at least one of the estimators cannot be computed due to the failure of the standard maximization routine optim() are discarded. However, these cases are very rare when T ≥ 4.

Results
The results in both models shown in Tables B.1 and B.2 are qualitatively similar. First, they confirm the well-known fact that the MLE exhibits a large bias, especially when T is small. In comparison, the FOB corrected estimators perform much better, in particular when T is moderately large (i.e., T ≥ 5), both in terms of bias and variance. 18 Moreover, our results suggest that the EQ FOB correction of the profile likelihood works better in small samples than its SA counterpart, sinceθ exp greatly outperformsθ avg in terms of bias, standard deviation and MSE for any value of T. The difference in performance is particularly striking in static probit, where the MSE ofθ avg is more than three times larger than the MSE ofθ exp for T ≤ 5. The SOB corrected estimatorθ works well, also when measured by the performance of its first-order counterpartθ exp . In both models, the bias, standard deviation and MSE ofθ are substantially lower than the respective numbers foř θ exp . The first-and second-order corrected estimators based on EQ also appear to outperform their split-panel jackknife counterparts introduced in Dhaene and Jochmans (2015). While particularlyθ SO SPJ does well in reducing the bias when n = 1,000 and T ≥ 10, it does not exist in short panels (T ≤ 5) and generally suffers from a comparatively high variance potentially due to the panel splitting. 19 Specific to static logit,θ is only beaten (narrowly for T ≥ 5) by the fixed-T consistent benchmarkθ CMLE . Further comparing our logit results with the numbers reported for Design 3 in DS shows that second-order corrections based on EQ and SA perform similarly well for T ∈ {5,10,20}. This may be explained by the fact that in the static logit example simplifications arise that reduce differences between both bias correction methods (see Section 3.2). In contrast,θ greatly outperforms all competing estimators for any value of T in the static probit design. Moreover, comparing our results in Table B.2 with Design 3 in Table 2 of DS suggests that the EQ SOB correction performs considerably better than the SA second-order correction. Whileθ has an absolute bias of (less than) 1% for T ∈ {5,10,20}, the estimator of DS still exhibits a substantially larger bias of 14.2% when T = 5. In their Table 4, DS further report biases of 32%, 14.4%, and 7.8% for T ∈ {3,4,5} 18 The phenomenon that in small samples bias reduction can lead to a reduction in variance is well-documented in the literature (e.g., Newey, 2004 or Schumann et al., 2021b). 19 Nevertheless, it may still be a computationally attractive alternative when the panel is sufficiently long.
for their second-order corrected estimator, which is again considerably larger than the biases reported in Table B.1 for the second-order corrected estimator based on EQ.
A similar ranking also arises for the empirical coverage rates based on the Wald statistic: while the MLE is unreliable even for large values of T, the coverage rates of FOB corrected estimators are much closer to the nominal coverage level, in particular when the bias correction is based on EQ. The Wald confidence intervals based on the SOB corrected estimatorθ appears to be much more reliable as compared to the first-order corrected estimators. Especially in static probit where the CMLE does not exist, the Wald confidence interval based on the SOB corrected estimatorθ clearly outperforms all other estimators considered here. Notice that for n = 100 and T = 5, the empirical coverage rate of the Wald statistic based on θ practically coincides with the nominal coverage level in both logit and probit. In comparison, DS still report an over-rejection of 13% of the LR test based their SA second-order corrected likelihood in the static probit model. As in DS, we observe a worsening of the empirical coverage rates when n gets larger. 20 However, while their rejection rate increases rather steeply to 45% for T = 4, the empirical coverage rate of the Wald test based onθ remains close to 95%.
Finally, Table B.3, which shows the simulated means of the estimators considered here under misspecification (i.e., the data are generated following a probit model and estimated based on the logit likelihood), illustrates that EQ estimators do not necessarily suffer from a larger misspecification bias than SA estimators. Both the SA and EQ estimates appear to be upward biased but approach the CMLE as T increases, which itself is close to the usual "rule of thumb" value of 1.6 (see Wooldridge, 2015, Chap. 17).

CONCLUSION
We have constructed a second-order bias corrected objective function for estimation and inference in nonlinear panel data models with fixed effects based on expected quantities. To do so, we have derived an explicit form of the second-order profile likelihood bias and demonstrated how to construct feasible approximations that make full use of the known likelihood. The maximizer of our adjusted profile likelihood is shown to be asymptotically normal and correctly centered if n,T → ∞ such that n/T 5 → 0. A simulation study suggests excellent small sample properties, in particular in very short panels in which first-order corrected estimators still exhibit a substantial bias. We have further pointed out some fundamental differences between bias corrections based on EQ and SA and provided some intuition that helps explaining why EQ estimators appear to outperform SA estimators.

Appendix A. Derivatives of the First-Order Bias Used to Construct (2.3)
In this section, we display the explicit derivatives of the FOB used in the expansion leading to the refined approximation of the FOBB . The detailed derivations are provided in Appendix C.2 of the Online Supplementary Material.     Notes:All estimators are based on the static logit model while the data are generated using the static probit model.

(C.4)
Using this expression, we can then obtain .

(C.5)
Similarly, Explicit expressions for a −3/2 ,...,a −5/2 are provided in Appendix H. Note that a −2 and a −5/2 are not provided in Rilstone et al. (1996), so that we have extended their result for the maximum likelihood case. Moreover, these terms differ from those in Ferrari et al. (1996) or Shenton and Bowman (1977) because we cannot use Bartlett-identities (in α i ) to simplify moments of l i01 (θ). For example, in general E τ 0 [l 2 i01 (θ)] = −λ i02 (θ). Next, expanding the profile likelihood,

(C.6)
Since now explicit expressions for (powers of) δ i (θ) are available, we get an expansion of the profile likelihood around the target likelihood up to the desired order as where the explicit terms for B (1) i (θ) can be derived as follows: (1) and Thus, by taking expectations conditional on covariates E τ 0 , we arrive at (4.1). Finally, by Lemma E.1, products of likelihood derivatives with a −1/2 ,...,a −5/2 and which is the remainder in (C.6). Thus, sup θ∈ Rem (0) (θ) = O p (T −3 ) follows from Assumption 4.3 and Lemma E.1.
Remark C.1. Taking the first and second derivatives with respect to θ of the remainder term Rem (0) (θ) does not affect its stochastic order. Given the form of the remainder term specified in the proof of (4.1), the first and second derivatives of i (θ)] with respect to θ consist of products of 8 and 9 likelihood derivatives with bounded expectation by Lemma E.1. Moreover, the first and second derivatives with respect to θ of δ i (θ) can be shown to be of order O p (T −1/2 ) by following the argument leading to (C.1), as all derivatives of centered likelihood derivatives covered by Assumption 4.3 are of order O p (T −1/2 ).

Remark C.2.
While the stochastic order of δ i (θ) is given by (C.1), expectations of odd powers of δ i (θ) converge at a faster rate. To see this, we again use (θ,ᾱ(θ))), where the denominator is bounded away from zero by Assumption 4.3(iii). Thus, Here, c denotes a constant and the last equation follows from Lemma E.1.

C.2. Feasible Bias Estimation
In an initial step, we show that where Rem (1) (θ) is the remainder term in (4.3). For the proof, we introduce some notation for partial derivatives. For k,k 1 ,...,k 4 ∈ {1,...,p + 2}, let derivatives of B (1) i (θ,β i ) with respect to components of β i be denoted by corresponding indices, i.e., B Proof of (C.9). We can, ignoring constants, write the remainder term as andβ i lies betweenβ i and β i0 . By (C.1), (I.4), and Assumption 4.4(ii) and (iii), Moreover, asβ i p → β i0 as T → ∞, the second-and fourth-order derivatives of B (1) i (θ,β i ) are of order O p (1) when evaluated at β i =β i by Lemma E.2. Thus, Rem (1,b) . Further notice that the same argument also holds for derivatives of Rem (1,b) (θ) with respect to θ, so that the stochastic order is not affected by taking derivatives. Next, we notice that the derivatives of B (1) i (θ,β i0 ) do not depend on outcome data, so that Moreover, (C.11) shows that ( or lower if at least one index in {k 1 ,k 2 ,k 3 } is an element of {1,...,p}. In this case, Lemma E.1 implies that the expectation of (β ) k 3 is of sufficiently low order. We thus need to consider where we have again used (C.8). Using similar arguments, up to a bias of sufficiently low order (see the proof of (4.4)). In total, Remark C.3. In the same way, one can show that derivatives of B (1) i (θ,β i ) with respect to β i can be approximated by replacing β i0 with β i0 , i.e., Moreover, also the proof of (4.2) can be carried out with analogous arguments. The full proofs are given in Appendix I.
In the proofs of (4.4) and (4.5), we again denote partial derivatives by parameters in the index and use a plug-in notation so that for instance Proof of (4.4). In an initial step, the full proof of (4.4) requires us to show the first equation in which is shown in Appendix I. Note first that by (C.5) and Lemma E.1, Tλ 2 i02 (θ) (C.14) . We first use an expansion (ignoring constants) to see

(C.15)
whereβ i again lies betweenβ i and β i0 . In order to show (4.4), we need to demonstrate that derivatives of A(θ,β i ) are of order O p (T −1 ). To do so, one can follow the steps in the proof of Lemma E.2. First, notice that for each of the derivatives the denominator consists of (powers of) , which can be shown to be bounded under Assumption 4.3(i) using the Cauchy-Schwarz inequality and the Jensen inequality, as in the proof of Lemma E.2. Due to the prefactor T −1 , the derivatives of A(θ,β i ) when evaluates at β i = β i0 or β i =β i are of order O p (T −1 ). Combining this with (C.1), (I.4), and Assumption 4.4, we see that the right hand side of (C.15) is of order O p (T −3/2 ). Moreover, the derivatives of A(θ,β i ) do not depend on outcome data when evaluated at β i0 . This implies for example that which is of order O p (T −2 ) by (C.14). Similarly, An explicit expression for the FOB ofθ . Notice that θ and γ are the same for each individual i in the definition ofF i (θ,β i ) and ∂ θ B (1) i (θ,β i ). Furthermore, there is not need for a notation that differentiates between α i and φ i here. Thus, let ζ := (θ ,α 1 ,...,α n ) ∈ × J ×···×J , and further write ζ 0 := (θ 0 ,α i0 ,...,α n0 ) andζ := (θ ,α 1 (θ),...,α n (θ)) . Finally, define ThenT = T (ζ ) and the FOB ofθ can be expressed as T (ζ )| ζ =ζ 0 . The latter can be seen by first considering equation (F.1) together with the remarks in the proof of (6.7) in Schumann et al. (2021b) and taking conditional expectations E τ 0 on both sides (notice that the conditional expectation operator E τ 0 corresponds to their E 0 ). Since, in their notation, . Since the profile likelihood is "first-order Hessian biased," Using an expansion and the fact that ∂ θ B (1) In the next step, we show that our estimator for E τ 0 [θ ] − θ 0 is first-order unbiased, i.e., Proof of (C.17). Recall ζ = (θ ,α 1 ,...,α n ) ∈ × J × ··· × J so that dim(ζ ) = p + n. Let k 1 ,k 2 ∈ {1,...,p + n}. As in the proof of (C.9) we denote partial derivatives with respect to components of ζ in the index, i.e., Further recallζ = (θ ,α 1 (θ),...,α n (θ)) and ζ 0 = (θ 0 ,α i0 ,...,α n0 ) , and let (ζ − ζ 0 ) k denote the kth component of (ζ − ζ 0 ). Using an expansion (ignoring constants), whereζ lies betweenζ and ζ 0 . Taking expectations yields where we have used that T k (ζ )| ζ =ζ 0 does not depend on outcome data. Clearly, Proof of (4.6). Recall that by (2.3), Using (C.13), we therefore need to show We first analyze (1) using g(θ, whereβ i1 lies betweenβ i and β i0 . Next, we notice that the first and second derivatives of g(θ,β i ) consist of first and second derivatives of B (1) iα i (θ,β i ) and A(θ,β i ). Sinceβ i lies in any neighborhood of β i0 with probability approaching one as T → ∞, Lemma E.2 implies that derivatives of B (1) iα i (θ,β i ) are O p (1) uniformly across β i in the neighborhood specified in Lemma E.2. Further, as shown in the proof of (4.4), the first and second derivative of A(θ,β i ) are of order O p (T −1 ), again sinceβ i p → β i0 . Thus, the first and second derivative of g(θ,β i ) are of order O p (T −1 ). Moreover, ∂ β i g(θ,β i )| β i =β i0 does not depend on outcome data, so that where we have used (C.14), (C.8), and (C.16). Since In total, we have therefore shown that Similarly, again using Lemma E.2 and the arguments outlined in the proof of (4.5) showing that the first and second derivatives of V(θ,β i ) with respect to the components of β i are of order O p (T −1 ), we obtain (2). To show (3) we assume that dim(θ) = 1 in order to avoid tensor notation. Since we need to show (3) for each component, the same argument can be carried out component-wise when dim(θ) > 1. Recall thatT = T (ζ )| ζ =ζ . We then note that whereβ i2 lies again betweenβ i and β i0 and Rem = O p (T −3/2 ) and by the proof of (C.17). Now, ∂ β i B (1) iγ (θ,β i )| β i =β i0 and T (ζ 0 ) do not depend on outcome data. Therefore,  To show (4), we again use an expansion (ignoring constants) to obtain where againβ i3 is some value betweenβ i and β i0 , and whereθ lies betweenθ and θ 0 andβ i4 lies betweenβ i and β i0 . Recall that A(θ 0 ,β i0 ) = E τ 0 [δ i (θ)] by (4.4). As argued in the proof of (4.4), the first and second derivatives of A(θ,β i ) are of order O p (T −1 ) uniformly over θ and over β i in a neighborhood of β i0 . Hence, by Assumption 4.4(ii) and (iii), (C.14), and (C.8), and Next, we notice that the first derivatives in (C.18) and (C.19) do not depend on outcome data. multiplying (C.18) and (C.19) yields (4). Parts (5) and (6) can be shown using analogous arguments.

Appendix D. Proofs for Section 4.3
This section contains the proof for asymptotic normality ofθ.
Proof of (4.9). Let B denote an open ball centered at θ 0 and let B denote its closure. Now, the first and second derivative of B (1) in probability uniformly in θ on B by Lemma E.1 in conjunction with Assumption 4.3(ii). Moreover, by (C.7) we can expand the profile likelihood around the target likelihood up to an error term that is o p (1) uniformly on . Since the same arguments hold for the first and second derivatives with respect to θ , we can write with sup θ∈B |s R * i (θ)s| = o p (1). Thus, n −1 n i=1 sup θ∈B |s R * i (θ)s| = o p (1) by Assumption 4.1(iii). It is further shown subsequently that the target likelihood satisfies the second Bartlett identity, i.e., Thus, by (D.1), (D.2), and the definition of Notice that for s ∈ S := {s ∈ R p : s = 1}, we have Next, from (D.10), (I3), and Assumption 4.3(i) and (ii), the Cauchy-Schwarz inequality and Consequently, sinceθ ∈ B w.p.a.1,
Proof of (D.2). First, notice that Therefore, using the Bartlett identities Next, differentiating (D.8) with respect to θ , Hence, where we have used that by definition of the target value λ i01 (θ) = 0. Again using (D.5) now yields Comparing (D.9) and (D.11), the desired result follows.
Justification for Assumption 4.5. We need to justify that each component of ∂ θ R i (θ) is of order O p (T −3 ) when evaluated at θ = θ 0 . In order to keep the notation simple, we assume here that dim(θ) = 1 and write r i (θ) := T 3 R i (θ) and r i (θ 0 ) := ∂ θ r i (θ)| θ=θ 0 . By (4.7), r i (θ) = O p (1). Hence, since the derivative of r i (θ) exists at θ = θ 0 by Assumption 4.3(i), we need to show that for each δ > 0 there exists some constant M < ∞ and T 0 ∈ N such that T ≥ T 0 implies Notice that the argument in r i and r i does not drift with n or T (i.e., θ is not allowed to involve a multiplicative factor of n or T) as this violates the compactness assumption in Assumption 4.1. Since the derivative exists and θ 0 lies in the interior of , there is some τ > 0 such that 0 < h < τ implies θ 0 + h ∈ and Fix h with 0 < h < τ. Since for each θ ∈ we have shown that r i (θ) = O p (1), there exists some constants M 1 such that Therefore, Proof of (4.10). Recall that by (2.5) and (C.7) * . As in the proof of (4.2) and (4.6), we can use a second-order expansion around β i0 to show that B = 0 for every θ ∈ and where R * i (θ) consists of the remainder terms in (4.2) and (4.6) and thus is the remainder term in (4.7) and Assumption 4.5. Since differentiation and integration is interchangeable, this in particular implies that

(D.13)
First, we consider the stochastic properties of the last term on the right-hand side of (D.13). By Assumption 4.5, we immediately have Since (D.12) together with iterated expectations implies E[∂ θ B * i (θ,β i )]| θ=θ 0 ,β i =β i = 0 for every i, we only need to consider the variance of the second term on the righthand side of (D.13) in order to determine its stochastic properties. Notice that the leading term in and hence, In total, which in combination with (D.13) shows the desired result.

Appendix E. Auxiliary Results
In this section, we derive the stochastic orders of products of likelihood derivatives. Moreover, we show that derivatives of B Proof of Lemma E.1. First notice that by compactness of the parameter spaces and the continuity of E τ i [D ν+μ l i (θ,α i )] in τ i , θ and α i , for each panel length T there exist somẽ τ i ∈ × J ,θ ∈ andα i ∈ J such that Thus, the mean-zero property of centralized likelihood derivatives is not affected by taking the supremum in each parameter. Next,

(E.1)
Now, to bound the summands, Repeated use of the Cauchy-Schwarz inequality and the Jensen inequality together with Assumption 4.3(i) now shows that the right-hand side of (E.2), and thus every summand on the right hand side of (E.1), is bounded. For instance, for h = 2, applying the Cauchy-Schwarz inequality on the right-hand side of (E.2) yields the upper bound whereẼ denotes integration with respect to the unconditional density evaluated atτ i corresponding to E[sup τ i ∈ ×J E τ i [·]]. By the Jensen inequality and Assumption 4.3(i), where c denotes a constant. Next, we note that by time-independence and because E τ i [l it (θ,α i ;τ i )] = 0 by definition, Eτ i [ h j=1 D ν j +μ j l ik j (θ,α i ;τ i )] = 0 if there exists some j * ∈ {1,...,h} such that time period k j * differs from k j for all j = j * , since this implies Consequently, any summation over an index that is not paired with others drops out. Therefore, if h is even, there are at most h/2 pairs of indices and thus at most h/2 sums over absolutely bounded summands scaled by a factor T −h/2 on the right-hand side of (C.8). The total expression is thus of order O(1). If h is odd, the maximum number of summations is reached with (h − 3)/2 pairs of indices and a single triplet of indices. Therefore, we have (h−3)/2+1 = (h−1)/2 summations over bounded summands, scaled by the factor T −h/2 . Consequently, the total expression is of order O(T −1/2 ).
As the variance is bounded by the second moment, Lemma E.1 in particular implies for h = 1,...,9. The next Lemma shows that sufficient derivatives of B

Then, there exists an open ball B centered at
for all ν with |ν| ≤ 4.
Proof of Lemma E.2. Recall from (2.1) where we used time-independence and E τ i [l it01 (θ,α i ;τ i )] = 0 in the second equation. Thus, derivatives of B i (θ,β i ) consist of terms with derivatives of T −1 T t=1 E τ i [l 2 it01 (θ,α i ;τ i )] and T −1 T t=1 E τ i [ it02 (θ,α i )] in the numerator and E τ i [ i02 (θ,α i )] in the denominator. First, we will consider the latter term in the denominator. By Let from now on B be an open ball centered at β i0 such that B ⊂ M. The denominator is then uniformly bounded away from zero over and B. Next, we consider the terms in the numerator. Here, it follows from Assumption 4.3(i) that the derivatives with respect to α are bounded. For example, and so that each of the terms in (1)-(4) consists of the expectation of a product of two centered likelihood terms, where the expectation is taken with respect to the parameters γ and φ i . As in the proof of (E.3), we can use the Cauchy-Schwarz inequality and the Jensen inequality to show that all terms on the right hand side are bounded uniformly over the parameter space and across T. To illustrate this, notice that since E τ i [l it0m (θ,α i )] = 0 by definition, independence over time implies so that, by Assumption 4.3 and the Jensen inequality, Hence, The remaining derivatives with respect to α i can be handled in a similar way. Next, we consider derivatives with respect to φ i . Using that (1) and Again, using the Cauchy-Schwarz inequality repeatedly, we can bound the supremum of each summand. For example, But, By Assumption 4.3(i) we therefore have Therefore, in total A similar argument can be used to show that all fourth-order derivatives of E τ i [ i02 (θ,α i )] are bounded. As an example for derivatives with respect to components of γ , let ψ 1 ,...,ψ 4 ∈ {1,...,p} so that γ ψ denotes the ψth component of γ . Further, let ψ 1 it1m (θ,α i ) denote the ψth component of the vector it1m (θ,α i ), ψ 1 ψ 2 it2m (θ,α i ) denote the (ψ 1 ,ψ 2 )th element of the matrix it2m (θ,α i ), ψ 1 ψ 2 ψ 3 it2m (θ,α i ) denote the (ψ 1 ,ψ 2 ,ψ 3 )th element of the three-dimensional array it3m (θ,α i ) and so on, and consider (1) and ∂ 4 γ ψ 1 γ ψ 2 γ ψ 3 γ ψ 4 Since by Assumption 4.3 each individual likelihood contribution in time period t is uniformly bounded across the parameters (θ,α i ) ∈ × J and since the expectations are uniformly bounded across (γ ,φ i ) ∈ × J and across time, the supremum over θ ∈ and β i ∈ B of each of the terms in (1)-(4) can again be shown to be bounded by repeated use of the Cauchy-Schwarz inequality. Analogous arguments can be used for mixed derivatives.