To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Most of the models considered in the previous chapters are members of the generalized linear models family and have the form g(μ) = xT β, with link function g. The models are nonlinear because of the link function, but nonetheless they are parametric, because the effect of covariates is based on the linear predictor η= xT β. In many applications, parametric models are too restrictive. For example, in a linear logit model with a unidimensional predictor it is assumed that the response probability is either strictly increasing or decreasing over the whole range of the predictor given that the covariate has an effect at all.
Example 10.1: Duration of Unemployment
When duration of unemployment is measured by two categories, short-term unemployment (1: below 6 months) and long-term employment (0: above 6 months), an interesting covariate is age of the unemployed person. Figure 10.1 shows the fits of a linear logistic model, a model with additional quadratic terms, and a model with cubic terms. The most restrictive model is the linear logistic model, which implies strict monotonicity of the probability depending on age. It is seen that the fit is rather crude and unable to fit the observations at the boundary. The quadratic and the cubic logistic models show better fit to the data but still lack flexibility. Non-parametric fits, which will be considered in this chapter, are also given in Figure 10.1. They show that the probability of short-term unemployment seems to be rather constant up to about 45 years of age but then strongly decreases. The methods behind these fitted curves will be considered in Section 10.1.3.
In many regression problems the response is restricted to a fixed set of possible values, the so-called response categories. Response variables of this type are called polytomous or multicategory responses. In economical applications, the response categories may refer to the choice of different brands or to the choice of the transport mode (Example 1.3). In medical applications, the response categories may represent different side effects of medical treatment or several types of infection that may follow an operation. Most rating scales have fixed response categories that measure, for example, the medical condition after some treatment in categories like good, fair, and poor or the severeness of symptoms in categories like none, mild, moderate, marked. These examples show that there are at least two cases to be distinguished, namely, the case where response categories are mere labels that have no inherent ordering and the case where categories are ordered. In the first case, the response Y is measured on a nominal scale. Instead of using the numbers 1, …, k for the response categories, any set of k numbers would do. In the latter case, the response is measured on an ordinal scale, where the ordering of the categories and the corresponding numbers may be interpreted but not the distance or spacing between categories. Figures 8.1 and 8.2 illustrate different scalings of response categories. In the nominal case the response categories are given in an unsystematic way, while in the ordinal case the response categories are given on a straight line, thus illustrating the ordering of the categories.
Contingency tables, or cross-classified data, come in various forms, differing in dimensions, distributional assumptions, and margins. In general, they may be seen as a structured way of representing count data. They were already used to represent data in binary and multinomial regression problems when explanatory variables were categorical (Chapters 2 and 8). Also, count data with categorical explanatory variables (Chapter 7) may be given in the form of contingency tables.
In this chapter log-linear models are presented that may be seen as regression models or association models, depending on the underlying distribution. Three types of distributions are considered: the Poisson distribution, the multinomial, and the product-multinomial distribution. When the underlying distribution is a Poisson distribution, one considers regression problems as in Chapter 7. When the underlying distribution is multinomial, or product-multinomial, one has more structure in the multinomial response than in the regression problems considered in Chapters 2 and 8. In those chapters the response is assumed to be multinomial without further structuring, whereas in the present chapter the multinomial response arises from the consideration of several response variables that together form a contingency table. Then one wants to analyze the association between these variables. Log-linear models provide a common tool to investigate the association structure in terms of independence or conditional independence between variables. Several examples of contingency tables have already been given in previous chapters. Two more examples are the following.
The focus of this book is on applied structured regression modeling for categorical data. Therefore, it is concerned with the traditional problems of regression analysis: finding a parsimonious but adequate model for the relationship between response and explanatory variables, quantifying the relationship, selecting the influential variables, and predicting the response given explanatory variables.
The objective of the book is to introduce basic and advanced concepts of categorical regressions with the focus on the structuring constituents of regressions. The term “categorical” is understood in a wider sense, including also count data. Unlike other texts on categorical data analysis, a classical analysis of contingency tables in terms of association analysis is considered only briefly. For most contingency tables that will be considered as examples, one or more of the involved variables will be treated as the response. With the focus on regression modeling, the generalized linear model is used as a unifying framework whenever possible. In particular, parametric models are treated within this framework.
In addition to standard methods like the logit and probit models and their extensions to multivariate settings, more recent developments in flexible and high-dimensional regressions are included. Flexible or non-parametric regressions allow the weakening of the assumptions on the structuring of the predictor and yield fits that are closer to the data. High-dimensional regression has been driven by the advance of quantitative genetics with its thousands of measurements. The challenge, for example in gene expression data, is in the dimensions of the datasets. The data to be analyzed have the unusual feature that the number of variables is much higher than the number of cases.
Multivariate methods are employed widely in the analysis of experimental data but are poorly understood by those users who are not statisticians. This is because of the wide divergence between the theory and practice of multivariate methods. This book provides concise yet thorough surveys of developments in multivariate statistical analysis and gives statistically sound coverage of the subject. The contributors are all experienced in the theory and practice of multivariate methods and their aim has been to emphasize the major features from the point of view of applicability and to indicate the limitations and conditions of the techniques. Professional statisticians wanting to improve their background in applicable methods, users of high-level statistical methods wanting to improve their background in fundamentals, and graduate students of statistics will all find this volume of value and use.
Applied statistics is more than data analysis, but it is easy to lose sight of the big picture. David Cox and Christl Donnelly distil decades of scientific experience into usable principles for the successful application of statistics, showing how good statistical strategy shapes every stage of an investigation. As you advance from research or policy question, to study design, through modelling and interpretation, and finally to meaningful conclusions, this book will be a valuable guide. Over a hundred illustrations from a wide variety of real applications make the conceptual points concrete, illuminating your path and deepening your understanding. This book is essential reading for anyone who makes extensive use of statistical methods in their work.
In this chapter we deal with various issues which evaded discussion in earlier chapters and conclude with a broad summary of the strategical and tactical aspects of statistical analysis.
Historical development
The development of new statistical methods stems directly or indirectly from specific applications or groups of related applications. It is part of the role of statistical theory in the broad sense not only to develop new concepts and procedures but also to consolidate new developments, often developed in an ad hoc way, into a coherent framework. While many fields of study have been involved, at certain times specific subjects have been particularly influential. Thus in the period 1920–1939 there was a strong emphasis on agricultural research; from 1945 problems from the process industries drove many statistical developments; from 1970 onwards much statistical motivation has arisen out of medical statistics and epidemiology. Recently genetics, particle physics and astrophysics have raised major issues. Throughout this time social statistics, which gave the subject its name, have remained a source of challenge.
Clearly a major force in current developments is the spectacular growth in computing power in analysis, in data storage and capture and in highly sophisticated measurement procedures. The availability of data too extensive to analyse is, however, by no means a new phenomenon. Fifty or more years ago the extensive data recorded on paper tapes could scarcely be analysed numerically, although the now-defunct skill of analogue computation could occasionally be deployed.
This, the first of two chapters on design issues, describes the common features of, and distinctions between, observational and experimental investigations. The main types of observational study, cross-sectional, prospective and retrospective, are presented and simple features of experimental design outlined.
Introduction
In principle an investigation begins with the formulation of a research question or questions, or sometimes more specifically a research hypothesis. In practice, clarification of the issues to be addressed is likely to evolve during the design phase, especially when rather new or complex ideas are involved. Research questions may arise from a need to clarify and extend previous work in a field or to test theoretical predictions, or they may stem from a matter of public policy or other decision-making concern. In the latter type of application the primary feature tends to be to establish directly relevant conclusions, in as objective a way as possible. Does culling wildlife reduce disease incidence in farm animals? Does a particular medical procedure decrease the chance of heart disease? These are examples of precisely posed questions. In other contexts the objective may be primarily to gain understanding of the underlying processes. While the specific objectives of each individual study always need careful consideration, we aim to present ideas in as generally an applicable form as possible.
This second chapter on design issues describes the main types of study in more detail. For sampling an explicit population of individuals the importance of a sampling frame is emphasized. Key principles of experimental design are discussed, including the factorial concept. Finally, various types of comparative observational investigation are outlined.
Preliminaries
We now discuss in a little more detail the main types of study listed in Section 2.3. The distinctions between them are important, notably the contrast between observational and experimental investigations. Nevertheless the broad objectives set out in the previous chapter are largely common to all types of study.
The simplest investigations involve the sampling of explicit populations, and we discuss these first. Such methods are widely used by government agencies to estimate population characteristics but the ideas apply much more generally. Thus, sampling techniques are often used within other types of work. For example the quality, rather than quantity, of crops in an agricultural field trial might be assessed partly by chemical analysis of small samples of material taken from each plot or even from a sub-set of plots.
By contrast, the techniques of experimental design are concentrated on achieving secure conclusions, sometimes in relatively complicated situations, but in contexts where the investigator has control over the main features of the system under study. We discuss these as our second theme in this chapter, partly because they provide a basis that should be emulated in observational studies.
Interpretation is here concerned with relating as deeply as is feasible the conclusions of a statistical analysis to the underlying subjectmatter. Often this concerns attempts to establish causality, discussion of which is a main focus of the chapter. A more specialized aspect involves the role of statistical interaction in this setting.
Introduction
We may draw a broad distinction between two different roles of scientific investigation. One is to describe an aspect of the physical, biological, social or other world as accurately as possible within some given frame of reference. The other is to understand phenomena, typically by relating conclusions at one level of detail to processes at some deeper level.
In line with that distinction, we have made an important, if rather vague, distinction in earlier chapters between analysis and interpretation. In the latter, the subject-matter meaning and consequences of the data are emphasized, and it is obvious that specific subject-matter considerations must figure strongly and that in some contexts the process at work is intrinsically more speculative. Here we discuss some general issues.
Specific topics involve the following interrelated points.
To what extent can we understand why the data are as they are rather than just describe patterns of variability?
How generally applicable are such conclusions from a study?
Given that statistical conclusions are intrinsically about aggregates, to what extent are the conclusions applicable in specific instances?
An ideal sequence is defined specifying the progression of an investigation from the conception of one or more research questions to the drawing of conclusions. The role of statistical analysis is outlined for design, measurement, analysis and interpretation.
Preliminaries
This short chapter gives a general account of the issues to be discussed in the book, namely those connected with situations in which appreciable unexplained and haphazard variation is present. We outline in idealized form the main phases of this kind of scientific investigation and the stages of statistical analysis likely to be needed.
It would be arid to attempt a precise definition of statistical analysis as contrasted with other forms of analysis. The need for statistical analysis typically arises from the presence of unexplained and haphazard variation. Such variability may be some combination of natural variability and measurement or other error. The former is potentially of intrinsic interest whereas the latter is in principle just a nuisance, although it may need careful consideration owing to its potential effect on the interpretation of results.
Illustration: Variability and error The fact that features of biological organisms vary between nominally similar individuals may, as in studies of inheritance, be a crucial part of the phenomenon being studied. That, say, repeated measurements of the height of the same individual very erratically is not of intrinsic interest although it may under some circumstances need consideration. […]
Detailed consideration is required to ensure that the most appropriate parameters of interest are chosen for a particular research question. It is also important to ensure the appropriate treatment of nonspecific effects, which correspond to systematic differences that are not of direct concern. Thus in the present chapter we discuss aspects relating to the choice of models for a particular application, first the choice between distinct model families and then the choice of a specific model within the selected family.
Criteria for parameters
Preliminaries
In some applications analysis and interpretation may be based on nonparametric formulations, for example the use of smooth curves or surfaces summarizing complex dependencies not easily captured in a simple formula. The reporting of estimated spectral densities of time series or line spectra representing complex mixtures of molecules are examples. Mostly, however, we aim to summarize the aspects of interest by parameters, preferably small in number and formally defined as properties of the probability model. In the cases on which we concentrate, the distribution specified by the model is determined by a finite number of such unknown parameters.
For a specific research question, parameters may be classified as parameters of interest, that is, directly addressing the questions of concern, or as nuisance parameters necessary to complete the statistical specification. Often the variation studied is a mixture of systematic and haphazard components, with attention focused on the former.