To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
A basic assumption in the construction of likelihood-based models is that constituent observations are independent. This is a reasonable assumption for perhaps the majority of studies. However, for longitudinal studies this assumption is not feasible, nor does it hold when data are clustered. For example, observations from a study on student drop-outs can be clustered by the type of schools sampled. If the study is related to intervention strategies, schools in affluent suburban, middle-class suburban, middle-class urban, and below-poverty-level schools have more highly correlated strategies within the school type than between types or groups. Likewise, if we have study data taken on a group of individual patients over time (e.g. treatment results obtained once per month for a year), the data related to individuals in the various time periods are likely to be more highly correlated than are treatment results between patients. Any time the data can be grouped into clusters, or panels, of correlated groups, we must adjust the likelihood-based model (based on independent observations) to account for the extra correlation.
We have previously employed robust variance estimators and bootstrapped standard errors when faced with overdispersed count data. Overdispersed Poisson models were replaced by negative binomial models, by adjusting the variance function of the basic Poisson model, or by designing a new log-likelihood function to account for the specific source of the overdispersion.
In this chapter we describe how count response data can be modeled using the NB2 negative binomial regression. NB2 is the traditional parameterization of the negative binomial model, and is the one with which most statisticians are familiar. For this chapter, then, any reference to negative binomial regression will be to the NB2 model unless otherwise indicated.
Poisson versus negative binomial
We have earlier stated that, given the direct relationship in the negative binomial variance between α and the fitted value, μ, the model becomes Poisson as the value of α approaches zero. A negative binomial with α = 0 will not converge because of division by zero, but values close to zero allow convergence. Where α is close to zero, the model statistics displayed in Poisson output are nearly the same as those of a negative binomial.
The relationship can be observed using simulated data. The code below constructs a 50,000 observation synthetic Poisson model with an intercept value of 2 and parameter values of x1 = 0.75 and x2 = 1.25. Each predictor is generated as a synthetic random normal variate. The Poisson data are then modeled using a negative binomial where the value of α is estimated prior to the calculation of parameter estimates and associated statistics.
In this and subsequent chapters we shall discuss the nature and utility of some 25 varieties of negative binomial regression that are useful for modeling count response data. In addition, we examine certain models that are related to the negative binomial family of models. This chapter will primarily be devoted to an examination of the derivation of the negative binomial model and to the two foremost methods of its estimation. We also consider how the probabilities generated from a negative binomial model differ from the Poisson, as well as how they differ among various negative binomial models based on both mean and ancillary parameters.
Varieties of negative binomial
I mentioned that the basic negative binomial model can be enhanced to allow for the modeling of a wide range of count response situations. The Poisson can likewise be enhanced to adjust for data that violate its essential distributional assumptions. In fact, many of the same distributional problems face both Poisson and negative binomial models. We therefore find similar approaches to the handling of such data for both the Poisson and negative binomial. These include models such as zero-inflated Poisson (ZIP), which is directly related to the zero-inflated negative binomial (ZINB). All of the models allow for an estimated variance that exceeds the mean – the principal assumption of the Poisson regression model being that of equidispersion.
This chapter can be considered as a continuation of the previous one. Few real-life Poisson data sets are truly equidispersed. Overdispersion to some degree is inherent to the vast majority of Poisson data. Thus, the real question deals with the amount of overdispersion in a particular model – is it statistically sufficient to require a model other than Poisson? This is a foremost question we address in this chapter, together with how we differentiate between real and apparent overdispersion.
What is overdispersion?
Not all overdispersion is real; apparent overdispersion may sometimes be identified and the model amended to eliminate it. We first address the difference between real and apparent overdispersion, and what can be done about the latter.
1 What is overdispersion?
Overdispersion in Poisson models occurs when the response variance is greater than the mean.
2 What causes overdispersion?
Overdispersion is caused by positive correlation between responses or by an excess variation between response probabilities or counts. Overdispersion also arises when there are violations in the distributional assumptions of the data, such as when the data are clustered and thereby violate the likelihood independence of observations assumption.
3 Why is overdispersion a problem?
Overdispersion may cause standard errors of the estimates to be deflated or underestimated, i.e. a variable may appear to be a significant predictor when it is in fact not significant.
Interaction terms play an important role in statistical modeling. However, it is a topic that relates to regression models in general and not specifically to count models. The subject was, therefore, not addressed in the text. On the other hand, there are few sources that provide specifics on how interactions are to be constructed and interpreted for non-linear models, and in particular for count models. This appendix provides a brief overview of interactions and their construction.
Recall that there are a variety of types of predictors. The three most common are binary (1/0), categorical (1, 2,…, k), and continuous. Continuous predictors are often constrained to have positive real numbers as allowable values. In addition, continuous predictors that do not begin at zero or 1 are usually centered, as discussed in Section 6.3.1 the text.
Many researchers believe that the coefficients of interaction terms are interpreted in the same manner as other predictors in a model. However, this is not the case, and in particular it is not the case for interactions in discrete response regression models.
I shall provide a brief overview of how to construct and interpret Binary × Binary, Binary × Continuous, Categorical × Continuous, and Continuous × Continuous interactions. These appear to be the most commonly employed interactions. Understanding these allows the construction and interpretation of any of the other options.
Two general methods are used to estimate count response models: (1) an iteratively re-weighted least squares (IRLS) algorithm based on the method of Fisher scoring, and (2) a full maximum likelihood Newton–Raphson type algorithm. Although the maximum likelihood approach was first used with both the Poisson and negative binomial, we shall discuss it following our examination of IRLS. We do this for strictly pedagogical purposes, which will become evident as we progress.
It should be noted at the outset that IRLS is a type or subset of maximum likelihood which can be used for estimation of generalized linear models (GLM). Maximum likelihood methods in general estimate model parameters by solving the derivative of the model log-likelihood function, termed the gradient, when set to zero. The derivative of the gradient with respect to the parameters is called the Hessian matrix, upon which model standard errors are based. Owing to the unique distributional structure inherent to members of GLM, estimation of model parameters and standard errors can be achieved using IRLS, which in general is a computationally simplier method of maximum likelihood estimation. Both methods are derived, described, and related in this chapter.
Derivation of the IRLS algorithm
The traditional generalized linear models (GLM) algorithm, from the time it was implemented in GLIM (generalized linear interactive modeling) through its current implementations in Stata, R, SAS, SPSS, GenStat, and other statistical software, uses some version of an IRLS estimating algorithm.
Poisson regression is the standard or base count response regression model. We have seen in previous discussions that other count models deal with data that violate the assumptions carried by the Poisson model. Since the model does play such a pivotal role in count response modeling, we begin with an examination of its derivation and structure, as well as a discussion of how it can be parameterized to model counts per unit time or area. Also discussed are the different interpretations of Poisson coefficients, marginal effects and discrete change, and graphics such as effect plots. We also show how to construct synthetic Poisson models in both Stata and R, but which can be developed in other languages using the same logic as described here.
Derivation of the Poisson model
Poisson regression, based on the Poisson probability distribution, is the fundamental method used for modeling count response data. As previously mentioned, however, the Poisson model can also be construed as a subset of the NB2 model. Poisson is an NB2 model with a heterogeneity parameter value of zero. This relationship is discussed in detail in Chapter 8.
In this section we discuss the derivation of the Poisson regression model. First, in Section 6.1.1 we demonstrate how the Poisson probability distribution can be derived from the binomial distribution. In Section 6.1.2, given the Poisson distribution first displayed in equation 3.1, we derive the Poisson regression model.
Sure to be influential, this book lays the foundations for the use of algebraic geometry in statistical learning theory. Many widely used statistical models and learning machines applied to information science have a parameter space that is singular: mixture models, neural networks, HMMs, Bayesian networks, and stochastic context-free grammars are major examples. Algebraic geometry and singularity theory provide the necessary tools for studying such non-smooth models. Four main formulas are established: 1. the log likelihood function can be given a common standard form using resolution of singularities, even applied to more complex models; 2. the asymptotic behaviour of the marginal likelihood or 'the evidence' is derived based on zeta function theory; 3. new methods are derived to estimate the generalization errors in Bayes and Gibbs estimations from training errors; 4. the generalization errors of maximum likelihood and a posteriori methods are clarified by empirical process theory on algebraic varieties.
Bayesian nonparametrics works - theoretically, computationally. The theory provides highly flexible models whose complexity grows appropriately with the amount of data. Computational issues, though challenging, are no longer intractable. All that is needed is an entry point: this intelligent book is the perfect guide to what can seem a forbidding landscape. Tutorial chapters by Ghosal, Lijoi and Prünster, Teh and Jordan, and Dunson advance from theory, to basic models and hierarchical modeling, to applications and implementation, particularly in computer science and biostatistics. These are complemented by companion chapters by the editors and Griffin and Quintana, providing additional models, examining computational issues, identifying future growth areas, and giving links to related topics. This coherent text gives ready access both to underlying principles and to state-of-the-art practice. Specific examples are drawn from information retrieval, NLP, machine vision, computational biology, biostatistics, and bioinformatics.
Probability and Statistics are as much about intuition and problem solving as they are about theorem proving. Because of this, students can find it very difficult to make a successful transition from lectures to examinations to practice, since the problems involved can vary so much in nature. Since the subject is critical in many modern applications such as mathematical finance, quantitative management, telecommunications, signal processing, bioinformatics, as well as traditional ones such as insurance, social science andengineering, the authors have rectified deficiencies in traditional lecture-based methods by collecting together a wealth of exercises with complete solutions, adapted to needs and skills of students. Following on from the success of Probability and Statistics by Example: Basic Probability and Statistics, the authors here concentrate on random processes, particularly Markov processes, emphasising modelsrather than general constructions. Basic mathematical facts are supplied as and when they are needed andhistorical information is sprinkled throughout.
Probability and Statistics are as much about intuition and problem solving, as they are about theorem proving. Because of this, students can find it very difficult to make a successful transition from lectures to examinations to practice, since the problems involved can vary so much in nature. Since the subject is critical in many modern applications such as mathematical finance, quantitative management, telecommunications, signal processing, bioinformatics, as well as traditional ones such as insurance, social science and engineering, the authors have rectified deficiencies in traditional lecture-based methods by collecting together a wealth of exercises for which they have supplied complete solutions. These solutions are adapted to needs and skills of students. To make it of broad value, the authors supply basic mathematical facts as and when they are needed, and have sprinkled some historical information throughout the text.
Applied statistics is an inherently conservative enterprise, and appropriately so since the scientific world depends heavily on the consistent evaluation of evidence. Conservative consistency is raised to its highest level in classical significance testing, where the control of Type I error is enforced with an almost religious intensity. A p-value of 0.06 rather than 0.04 has decided the fate of entire pharmaceutical companies. Fisher's scale of evidence, Table 3.1, particularly the α = 0.05 threshold, has been used in literally millions of serious scientific studies, and stakes a good claim to being the 20th century's most influential piece of applied mathematics.
All of this makes it more than a little surprising that a powerful rival to Type I error control has emerged in the large-scale testing literature. Since its debut in Benjamini and Hochberg's seminal 1995 paper, false discovery rate control has claimed an increasing portion of statistical research, both applied and theoretical, and seems to have achieved “accepted methodology” status in scientific subject-matter journals.
False discovery rate control moves us away from the significance-testing algorithms of Chapter 3, back toward the empirical Bayes context of Chapter 2. The language of classical testing is often used to describe FDR methods (perhaps in this way assisting their stealthy infiltration of multiple testing practice), but, as the discussion here is intended to show, both their rationale and results are quite different.
Simultaneous hypothesis testing was a lively topic in the early 1960s, my graduate student years, and had been so since the end of World War II. Rupert Miller's book Simultaneous Statistical Inference appeared in 1966, providing a beautifully lucid summary of the contemporary methodology. A second edition in 1981 recorded only modest gains during the intervening years. This was a respite, not an end: a new burst of innovation in the late 1980s generated important techniques that we will be revisiting in this chapter.
Miller's book, which gives a balanced picture of the theory of that time, has three notable features:
It is overwhelmingly frequentist.
It is focused on control of α, the overall Type I error rate of a procedure.
It is aimed at multiple testing situations with individual cases N between 2 and, say, 10.
We have now entered a scientific age in which N = 10 000 is no cause for raised eyebrows. It is impressive (or worrisome) that the theory of the 1980s continues to play a central role in microarray-era statistical inference. Features 1 and 2 are still the norm in much of the multiple testing literature, despite the obsolescence of Feature 3. This chapter reviews part of that theory, particularly the ingenious algorithms that have been devised to control the overall Type I error rate (also known as FWER, the family-wise error rate).