To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
More formal methods of statistical analysis are based on a probability model for the data. This represents in idealized form the main features of the variability encountered and possibly also summarizes the datagenerating process. Such models contain parameters some of which encapsulate the research questions of concern. The main aspects of probability models are reviewed and simple examples are given.
Preliminaries
Simple methods of graphical and tabular analysis are of great value. They are essential in the preliminary checking of data quality and in some cases may lead to clear and convincing explanations. They play a role too in presenting the conclusions even of quite complex analyses. In many contexts it is desirable that the conclusions of an analysis can be regarded, in part at least, as summary descriptions of the data as well as interpretable in terms of a probability model.
Nevertheless careful analysis often hinges on the use of an explicit probability model for the data. Such models have a number of aspects:
they may encapsulate research questions and hypotheses in compact and clear form via parameters of interest, or they may specify a simple structure, deviations from which can be isolated and studied in detail;
they provide a way of specifying the uncertainty in conclusions;
they formalize the discounting of features that are in a sense accidents of the specific dataset under analysis;
Assessment of the uncertainty implicit in conclusions is often an important feature of statistical analyses. Although the primary emphasis in this book is not on the specialized concepts and methods involved in such assessments, the present chapter reviews the main ideas involved under the following headings: confidence limits, posterior distributions and significance tests.
Preliminaries
The details of specific statistical methods and the associated mathematical theory will not be discussed here. We shall, however, outline the main forms in which statistical conclusions are presented, because understanding of the strengths and limitations of these forms is essential if misunderstanding is to be avoided.
We discuss first analyses in which interpretation centres on individual parameters of interest; that is, in general we investigate component by component. We denote a single such parameter by ψ. This can represent some property of interest, such as the number of individual animals of a specific wildlife species in a particular area, or it can represent contrasts between groups of individuals in an outcome of interest or, in a linear regression application, ψ can be the slope of the relationship.
It is desirable that ψ is considered in ‘sensible’ units, chosen to give answers within or not too far outside the range from 0.1 to 10.For example, a slope of a relationship of length against time, that is a speed, might be in mm per hour or km per day, etc., as appropriate and an incidence rate might be in the number of cases per 100 000 person-years or in cases per 100 person-days.
The success of an investigation depends crucially on the quality and relevance of the data analysed. Criteria for good measurement procedures are outlined and the quantities measured are classified in various ways, in particular by their role in the study in question.
Criteria for measurements
A crucial aspect of any analysis is the nature and quality of the data that are involved, which we call measurements, using that term in a very general sense. Criteria for satisfactory measurements typically include:
relevance;
adequate precision;
economy; and
absence of distortion of the features studied.
In fields with a solid history of investigation, the measurement techniques necessary to capture the relevant features may well be firmly established. In other contexts, defining a measurement procedure that will yield insightful information may be crucial to successful work. Usually we require of a procedure some mixture of face validity, that is, apparent relevance, and construct validity. The latter implies that a procedure has proven success in establishing meaningful and stable relationships with other variables. These further variables should include those for which there is strong prior reason to expect a connection.
The costs, however measured, of obtaining the data should be commensurate with the objectives. In particular, while individual data quality is of great importance, the collection of unnecessarily large amounts of data is to be avoided; in any case the intrinsic quality of data, for example the response rates of surveys, may be degraded if too much is collected.
Statistical considerations arise in virtually all areas of science and technology and, beyond these, in issues of public and private policy and in everyday life. While the detailed methods used vary greatly in the level of elaboration involved and often in the way they are described, there is a unity of ideas which gives statistics as a subject both its intellectual challenge and its importance.
In this book we have aimed to discuss the ideas involved in applying statistical methods to advance knowledge and understanding. It is a book not on statistical methods as such but, rather, on how these methods are to be deployed. Nor is it a book on the mathematical theory of the methods or on the particular issue of how uncertainty is to be assessed, even though a special feature of many statistical analyses is that they are intended to address the uncertainties involved in drawing conclusions from often highly variable data.
We are writing partly for those working as applied statisticians, partly for subject-matter specialists using statistical ideas extensively in their work and partly for masters and doctoral students of statistics concerned with the relationship between the detailed methods and theory they are studying and the effective application of these ideas. Our aim is to emphasize how statistical ideas may be deployed fruitfully rather than to describe the details of statistical techniques.
An outline is given of some of the steps needed to ensure that the data finally analysed are of appropriate quality. These include data auditing and data screening and the use of simple graphical and tabular preliminary analyses. No rigid boundary should be drawn between such largely informal procedures and the more formal model-based analyses that are the primary focus of statistical discussion.
Introduction
While it is always preferable to start with a thoughtful and systematic exploration of any new set of data, pressure of time may tempt those analysing such data to launch into the ‘interesting’ aspects straight away. With complicated data, or even just complicated data collection processes, this usually represents a false time economy as complications then come to light only at a late stage. As a result, analyses have to be rerun and results adjusted.
In this chapter we consider aspects of data auditing, data screening, data cleaning and preliminary analysis. Much of this work can be described as forms of data exploration, and as such can be regarded as belonging to a continuum that includes, at the other extreme, complex statistical analysis and modelling. Owing to the fundamental importance of data screening and cleaning, guidance on ethical statistical practice, aimed perhaps particularly at official statisticians, has included the recommendation that the data cleaning and screening procedures used should be reported in publications and testimony (American Statistical Association Committee on Professional Ethics, 1999).
Models and likelihood are the backbone of modern statistics. This 2003 book gives an integrated development of these topics that blends theory and practice, intended for advanced undergraduate and graduate students, researchers and practitioners. Its breadth is unrivaled, with sections on survival analysis, missing data, Markov chains, Markov random fields, point processes, graphical models, simulation and Markov chain Monte Carlo, estimating functions, asymptotic approximations, local likelihood and spline regressions as well as on more standard topics such as likelihood and linear and generalized linear models. Each chapter contains a wide range of problems and exercises. Practicals in the S language designed to build computing and data analysis skills, and a library of data sets to accompany the book, are available over the Web.
In this definitive book, D. R. Cox gives a comprehensive and balanced appraisal of statistical inference. He develops the key concepts, describing and comparing the main ideas and controversies over foundational issues that have been keenly argued for more than two-hundred years. Continuing a sixty-year career of major contributions to statistical thought, no one is better placed to give this much-needed account of the field. An appendix gives a more personal assessment of the merits of different ideas. The content ranges from the traditional to the contemporary. While specific applications are not treated, the book is strongly motivated by applications across the sciences and associated technologies. The mathematics is kept as elementary as feasible, though previous knowledge of statistics is assumed. The book will be valued by every user or student of statistics who is serious about understanding the uncertainty inherent in conclusions from statistical analyses.
The aim of this book is to present a detailed, but thoroughly clear and understandable, analysis of the nature and scope of the varieties of negative binomial model that are currently available for use in research. Modeling count data using the standard negative binomial model, termed NB2, has recently become a foremost method of analyzing count response models, yet relatively few researchers or applied statisticians are familiar with the varieties of available negative binomial models, or how best to incorporate them into a research plan.
Note that the Poisson regression model, traditionally considered as the basic count model, is in fact an instance of NB2 – it is an NB2 with a heterogeneity parameter of value 0. We shall discuss the implications of this in the book, as well as other negative binomial models that differ from the NB2. Since Poisson is a variety of the NB2 negative binomial, we may regard the latter as more general and perhaps as even more representative of the majority of count models used in everyday research.
I began writing this second edition of the text in mid-2009, some two years after the first edition of the text was published. Most of the first edition was authored in 2006. In just this short time – from 2006 to 2009/2010 – a number of advancements have been made to the modeling of count data.
I previously indicated that extended Poisson and negative binomial models have generally been developed to solve either a distributional or variance problem arising in the base Poisson and NB2 models. We shall discover later that some extended negative binomial models rely on the NB1 parameterization rather than on the NB2, e.g. fixed-effects negative binomial. But, except for a few models, most negative binomial models used in actual research are based on the NB2 model.
Changes to the negative binomial variance function were considered in the last chapter. In this chapter, we address the difficulties arising when the zero counts in Poisson and NB2 models differ considerably from model distributional assumptions. In the first section we address models that cannot have zero counts.
Zero-truncated count models
Many times we are asked to model count data that structurally exclude zero counts. Hospital length-of-stay data are an example of this type of data. When a patient first enters the hospital, the count begins upon registration, with the length of stay given as 1. There can be no 0 days – unless we are describing patients who do not enter the hospital. This latter situation describes a different type of model where there may be two generating processes – one, for example, for patients who may or may not enter the hospital, and another for patients who are admitted. This type of model will be discussed later.
The notion of risk lays at the foundation of the modeling of counts. In this chapter we discuss the technical meaning of risk and risk ratio, and how to interpret the estimated incidence rate ratios that are displayed in the model output of Poisson and negative binomial regression. In the process, we also discuss the associated relationship of risk difference as well as odds and odds ratios, which are generally understood with respect to logistic regression models.
Risk is an exposure to the chance or probability of some outcome, typically thought of as a loss or injury. In epidemiology, risk refers to the probability of a person or group becoming diseased given some set of attributes or characteristics. In more general terms, the risk that an individual with a specified condition will experience a given outcome is the probability that the individual actually experiences the outcome. It is the proportion of individuals with the risk factor who experience the outcome. In epidemiological terms, risk is therefore a measure of the probability of the incidence of disease. The attribute or condition upon which the risk is measured is termed a risk factor, or exposure.
There are many times when certain data elements are lost, discarded, ignored, or are otherwise excluded from analysis. Truncated and censored models have been developed to deal with these types of data. Both models take two forms: truncation or censoring from below, and truncation or censoring from above. Count model forms take their basic logic from truncated and censored continuous response data, in particular, from Tobit (Amemiya, 1984) and censored normal regression (Goldberger, 1983) respectively.
The traditional parameterization used for truncated and censored count data can be called the econometric or cut-point parameterization. This is the form of model discussed in standard econometric texts and is the form found in current econometric software implementations. I distinguish this from what I term a survival parameterization, the form of which is derived from standard survival models. This parameterization only relates to censored Poisson and censored negative binomial models. I shall first address the more traditional econometric parameterization.
Censored and truncated models – econometric parameterization
Censored and truncated count models are related, with only a relatively minor algorithmic difference between the two. The essential difference relates to how response values beyond a user-defined cut-point are handled. Truncated models eliminate the values altogether; censored models revalue them to the value of the cut-point. In both cases the probability function and log-likelihood functions must be adjusted to account for the change in distribution of the response. We begin by considering truncation.
This chapter takes us into areas that have been primarily addressed in the econometric literature. However, the methods discussed are just as appropriate for research in biostatistics and medical health, and in social and political analysis. Recall that hurdle, zero-inflated, and selection models all partition the response distribution into two components. Zero-inflated models explain excessive zero counts using a binary generating process, which is mixed with a Poisson generating process, including zero. Hurdle models, on the other hand, are also partitioned into binary and count components, but the counts are generated by two separate processes with no overlapping mixture. Zero versus positive counts are formatted into a binary model and a zero-truncated count model generating the positive counts. We found that zero-inflated models employ traditional binary models such as logit and probit regression for understanding the zero counts, whereas the binary component of the hurdle model may be a right censored at 1 count model or a traditional binary model. The finite mixture models discussed in Section 11.1 extend hurdle models in such a manner that processes generating disparate types of counts may be identified.
We follow an examination of finite mixture models with an overview of the methods used to deal with the problem of endogenous predictors. Actually, endogeneity touches on finite mixture models, as well as on count models with unexplained heterogeneity in general, e.g. generalized Waring regression.
The negative binomial regression model is a truly unusual statistical model. Typically, those in the statistical community refer to the negative binomial as a single model, as we would in referring to Poisson regression, logistic regression, or probit regression. However, there are in fact several distinct negative binomial models, each of which are referred to as being a negative binomial model. Boswell and Patil (1970) identified 13 separate types of derivations for the negative binomial distribution. Other statisticians have argued that there are even more derivations. Generally, those who are using the distribution as the basis for a statistical model of count data have no idea that the parameterization of the negative binomial they are employing may differ from the parameterization being used by another. Most of the time it makes little difference how the distribution is derived, but, as we shall discover, there are times when it does. Perhaps no other model has such a varied pedigree.
I will provide an outline here of the intertwining nature of the negative binomial. Unless you previously have a solid background in this area of statistics, my overview is not likely to be completely clear. But, as we progress through the book, its logic will become evident.
Bayesian statistics has continually been growing in popularity since the 1980s. Currently there are societies devoted to Bayesian methodology, and the major statistical associations typically have committees devoted to the promotion of Bayesian techniques. In the discipline of astrostatistics, for example, the majority of articles related to the statistical analysis of astronomical data employ some type of Bayesian approach. Not all disciplines are so committed to Bayesian methodology of course, but it is the case that there are a growing number of researchers who are turning to Bayesian modeling. In this chapter we shall present a brief overview of this approach, and give an example of a Bayesian negative binomial model.
Bayesian statistics is named after Thomas Bayes (1702–1761), a British Presbyterian minister and amateur mathematician who was interested in the notion of inverse probability, now referred to as posterior probability. The notion of posterior probability can perhaps best be understood with a simple example.
Suppose that 90% of the students in a statistics course pass. Students who passed are given the symbol P; students who failed are given an F. Therefore, 10% of the students failed the course. Everyone who passed purchased the course text, while only half of the students who failed did so. Let's suppose we observe a student with the course text (assume they purchased it). What is the probability that this student failed? This is an example of posterior probability.
Negative binomial regression has traditionally been used to model otherwise overdispersed count or Poisson data. It is now considered to be the general catchall method used when Poisson data are found to be overdispersed, particularly when the source of overdispersion has not been identified. When we can identify that which gives rise to extra correlation, and hence overdispersion, the basic Poisson and negative binomial algorithms may themselves be further adjusted or enhanced to directly address the identified source of extra correlation. For example, when overdispersion results from an excess of zero counts in the response, an appropriate strategy is to model the data using either a zero-inflated Poisson (ZIP) or zero-inflated negative binomial (ZINB). Employing a hurdle model may also result in a better fit. On the other hand, if the response is structured such that zero counts are not possible, as in hospital length-of-stay data, a zero-truncated Poisson (ZTP) or zero-truncated negative binomial (ZTNB) model may be appropriate.
A variety of alternative models have been developed to address specific facts in the data that give rise to overdispersion. Models dealing with an excess or absence of zeros typically define a mixture that alters the distributional assumptions of the Poisson distribution. Other models are constructed to alter not the probability and log-likelihood distributions, but rather the Poisson and negative binomial variance functions. We discuss these types of models in this chapter.