To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
God made the integers, all the rest is the work of man.
Kronecker
This book is concerned with models of event counts. An event count refers to the number of times an event occurs, for example the number of airline accidents or earthquakes. An event count is the realization of a nonnegative integer-valued random variable. A univariate statistical model of event counts usually specifies a probability distribution of the number of occurrences of the event known up to some parameters. Estimation and inference in such models are concerned with the unknown parameters, given the probability distribution and the count data. Such a specification involves no other variables and the number of events is assumed to be independently identically distributed (iid). Much early theoretical and applied work on event counts was carried out in the univariate framework. The main focus of this book, however, is regression analysis of event counts.
The statistical analysis of counts within the framework of discrete parametric distributions for univariate iid random variables has a long and rich history (Johnson, Kotz, and Kemp, 1992). The Poisson distribution was derived as a limiting case of the binomial by Poisson (1837). Early applications include the classic study of Bortkiewicz (1898) of the annual number of deaths from being kicked by mules in the Prussian army. A standard generalization of the Poisson is the negative binomial distribution. It was derived by Greenwood and Yule (1920), as a consequence of apparent contagion due to unobserved heterogeneity, and by Eggenberger and Polya (1923) as a result of true contagion.
The general modeling approaches most often used in count data analysis – likelihood-based, generalized linear models, and moment-based – are presented in this chapter. Statistical inference for these nonlinear regression models is based on asymptotic theory, which is also summarized.
The models and results vary according to the strength of the distributional assumptions made. Likelihood-based models and the associated maximum likelihood estimator require complete specification of the distribution. Statistical inference is usually performed under the assumption that the distribution is correctly specified.
A less parametric analysis assumes that some aspects of the distribution of the dependent variable are correctly specified while others are not specified, or if specified are potentially misspecified. For count data models considerable emphasis has been placed on analysis based on the assumption of correct specification of the conditional mean, or on the assumption of correct specification of both the conditional mean and the conditional variance. This is a nonlinear generalization of the linear regression model, where consistency requires correct specification of the mean and efficient estimation requires correct specification of the mean and variance. It is a special case of the class of generalized linear models, widely used in the statistics literature. Estimators for generalized linear models coincide with maximum likelihood estimators if the specified density is in the linear exponential family. But even then the analytical distribution of the same estimator can differ across the two approaches if different second moment assumptions are made.
This chapter deals with the topic of valid inference about the population given samples that are not simple random samples. There are several well-known ways in which departures from simple random sampling occur. They include choice-based sampling and endogenous stratified sampling, endogenous regressors, and sample selection.
The departure from simple random sampling may cause the sample probability of observations to differ from the corresponding population probabilities. In general such a divergence leads to models in which simple conditioning on exogenous variables does not lead to consistent estimates of the population parameters. These topics have been studied in depth in the discrete choice literature (Manski and McFadden, 1981). The analysis of count data in the presence of such complications is relatively underexplored.
A second topic considered in this chapter is endogenous regressors. Ignoring the feedback from the response variable to the endogenous regressor leads in general to invalid inferences. The estimation procedure should allow for stochastic dependence between the response variable and endogenous regressors. In considering this issue the existing literature on simultaneous equation estimation in nonlinear models is of direct relevance (Amemiya, 1985). This material is a continuation of section 8.2.
The third topic considered is sample selection in count regression, which also is closely related to issues of simultaneity and nonrandom sampling.
It is desirable to analyze count data using a cycle of model specification, estimation, testing, and evaluation. This cycle can go from specific to general models – for example, it can begin with Poisson and then test for negative binomial – or one can use a general to specific approach – for example, begin with negative binomial and then test the restrictions imposed by Poisson. For inclusion of regressors in a given count model either approach might be taken; for choice of the count data model itself other than simple choices such as Poisson or negative binomial the former approach is most often useful. For example, if the negative binomial model is inadequate, there is a very wide range of models that might be considered, rendering a general-to-specific approach difficult to implement.
The preceding two chapters have presented the specification and estimation components of this cycle for cross-section count data. In this chapter we focus on the testing and evaluation aspects of this cycle. This includes residual analysis, goodness-of-fit measures, and moment-based specification tests, in addition to classical statistical inference.
Residual analysis, based on a range of definitions of the residual for heteroskedastic data such as counts, is presented in section 5.2. A range of measures of goodness of fit, including pseudo R-squareds and a chi-square goodness-of-fit statistic, are presented in section 5.3. Likelihood-based hypothesis tests for overdispersion, introduced in section 3.4, are discussed more extensively in section 5.4. Small-sample corrections, including the bootstrap pairs procedure for quite general cross-section data models, are presented in section 5.5.
In this chapter we provide a detailed discussion of empirical models based on two cross-sectional data sets. The first of these analyzes the demand for medical care by the elderly in the United States. This data set shares many features of health utilization studies based on cross-section data. The second is an analysis of recreational trips.
Section 6.2 extends the introduction by surveying two general modeling issues. The first is the decision to model only the conditional mean versus the full distribution of counts. The second issue concerns behavioral interpretation of count models, an issue of importance to econometricians who emphasize the distinction between reduced form and structural models. Sections 6.3 and 6.4 deal in turn with each of the two empirical applications. Each has several subsections that deal with details. The health care example in section 6.3 is intended to illustrate in detail the methodology for fitting a finite mixture model. There are relatively few econometric examples that discuss at length the implementation of the finite mixture model and the interpretation of the results. The example is intended to fill this gap. Section 6.5 pursues a methodological question concerning the distribution of the LR test under nonstandard conditions, previously raised in section 4.8.5. The final two sections provide concluding remarks and bibliographic notes. The emphasis of this chapter is on practical aspects of modeling. Each application involves several competing models which are compared and evaluated using model diagnostics and goodness-of-fit measures.
Longitudinal data or panel data are observations on a cross-section of individual units such as persons, households, firms, and regions that are observed over several time periods. The data structure is similar to that of multivariate data considered in Chapter 8. Analysis is simpler than for multivariate data because for each individual unit the same outcome variable is observed, rather than several different outcome variables. Analysis is more complex because this same outcome variable is observed at different points in time, introducing time series data considerations presented in Chapter 7.
In this chapter we consider longitudinal data analysis if the dependent variable is a count variable. Remarkably, many count regression applications are to longitudinal data rather than simpler cross-section data. Econometrics examples include the number of patents awarded to each of many individual firms over several years, the number of accidents in each of several regions, and the number of days of absence for each of many persons over several years. A political science example is the number of protests in each of several different countries over many years. A biological and health science example is the number of occurrences of a specific health event, such as seizure, for each of many patients in each of several time periods.
A key advantage of longitudinal data over cross-section data is that they permit more general types of individual heterogeneity. Excellent motivation was provided by Neyman (1965), who pointed out that panel data enable one to control for heterogeneity and thereby distinguish between true and apparent contagion.
Many widely used regression packages, including LIMDEP, STATA, TSP, and GAUSS, support maximum likelihood estimation of standard Poisson and negative binomial regressions, the latter of these in a separate count module. LIMDEP also supports the QGPML versions of the standard models, maximum-likelihood estimation of truncated or censored Poisson, geometric and negative binomial models, and ZIP and sample selection models. STATA also supports the generalized negative binomial regression in which the overdispersion parameter is further parameterized as a function of additional covariates. In addition, any statistical package with a generalized linear models component will include maximum likelihood and QGPML estimation of the Poisson, although not necessarily negative binomial. Thus, regression packages cover the models in Chapter 3 and roughly half of those in Chapter 4. The packages vary somewhat in the provision of diagnostics such as overdispersion tests and goodness-of-fit measures.
At the time of writing (late 1997) there is virtually no specialized software for the models presented in Chapters 7 through 12. A notable exception is estimation of basic panel count data models, which is provided by both LIMDEP and TSP. For models for which off-the-shelf software is not available, one needs to provide at least the likelihood function, for maximum likelihood estimation, or the moment conditions and weighting matrix, for GMM estimation. In principle this can be done using many regression packages, or using matrix programming languages such as GAUSS, MATLAB, S-PLUS, or SAS/IML. In practice numerical problems can be encountered if models are quite nonlinear.