To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
With the exception of Section 4.2, and the first half of Section 4.3, the material of this chapter isn't much used elsewhere and can be skipped at first reading. Yet it would be regrettable in a book of this kind to stay solely with the very simplest of Monte Carlo. Specialist books in finance are Jäckel (2002), Glasserman (2004) and Dagpunar (2007). Most of their illustrations are from short-term finance and, it seems useful with a text more adapted to the needs of actuaries, with examples from general insurance too.
We shall start with table look-up methods in the next section. These are not always mentioned, yet they possess very useful qualities. They are general (can virtually always be used) and are ultra-fast once they are in place. With Poisson or multinomial sampling these are the methods of choice when there are very many repetitions. At the expense of some discretization error, continuous distributions can be handled too. The restriction of table methods is one of dimensionality. We must be able to break simulation down on independent variables or, at most, a handful of dependent ones. When that is impossible, Markov chain Monte Carlo is another general approach; see Section 4.8.
The other methods (Sections 4.3–4.6) are specifically geared towards speed enhancement, sometimes achieving drastic improvement.
The case-control approach is a powerful method for investigating factors that may explain a particular event. It is extensively used in epidemiology to study disease incidence, one of the best-known examples being Bradford Hill and Doll's investigation of the possible connection between cigarette smoking and lung cancer. More recently, case-control studies have been increasingly used in other fields, including sociology and econometrics. With a particular focus on statistical analysis, this book is ideal for applied and theoretical statisticians wanting an up-to-date introduction to the field. It covers the fundamentals of case-control study design and analysis as well as more recent developments, including two-stage studies, case-only studies and methods for case-control sampling in time. The latter have important applications in large prospective cohorts which require case-control sampling designs to make efficient use of resources. More theoretical background is provided in an appendix for those new to the field.
This book is about the planning and analysis of a special kind of investigation: a case-control study. We use this term to cover a number of different designs. In the simplest form individuals with an outcome of interest, possibly rare, are observed and information about past experience is obtained. In addition corresponding data are obtained on suitable controls in the hope of explaining what influences the outcome. In this book we are largely concerned with binary outcomes, for example indicating disease diagnosis or death. Such studies are reasonably called retrospective as contrasted with prospective studies, in which one records explanatory features and then waits to see what outcome arises. In retrospective studies we are studying the causes of effects and in prospective studies we are studying the effects of causes. We also discuss some extensions of case-control studies to incorporate temporality, which may be more appropriately viewed as a form of prospective study. The key aspect of all these designs is that they involve a sample of the underlying population that motivates the study, in which individuals with certain outcomes are strongly over-represented.
While we shall concentrate on the many special issues raised by such studies, we begin with a brief survey of the general themes of statistical design and analysis. We use a terminology deriving in part from epidemiological applications although the ideas are of much broader relevance.
We start the general discussion by considering a population of study individuals, patients, say, assumed to be statistically independent.
• A case-control study is a retrospective observational study and is an alternative to a prospective observational study. Cases are identified in an underlying population and a comparable control group is sampled.
• In the standard design exposure information is obtained retrospectively, though this is not necessarily the case if the case-control sample is nested within a prospective cohort.
• Prospective studies are not cost effective for rare outcomes. By contrast, in a case-control study the ratio of cases and controls is higher than in the underlying population in order to make more efficient use of resources.
• There are two main types of case-control design; matched and unmatched.
• The odds ratio is the most commonly used measure of association between exposure and outcome in a case-control study.
• Important extensions to the standard case-control design include the explicit incorporation of time into the choice of controls and into the analysis.
Defining a case-control study
Consider a population of interest, for example the general population of the UK, perhaps restricted by gender or age group. We may call a representation of the process by which exposures X and outcomes Y occur in the presence of intrinsic features W the population model. As noted in the Preamble, such a system may be investigated prospectively or retrospectively; see Figure 1.1. In a prospective or cohort study a suitable sample of individuals is chosen to represent the population of interest, values of (W, X) are determined and the individuals are followed through time until the outcome Y can be observed.
The retrospective case-control approach provides a powerful method for studying rare events and their dependence on explanatory features. The method is extensively used in epidemiology to study disease incidence, one of the best known and early examples being the investigation by Bradford Hill and Doll of the possible impact of smoking and pollution on lung cancer. More recently the approach has been ever more widely used, by no means only in an epidemiological setting. There have also been various extensions of the method.
A definitive account in an epidemiological context was given by Breslow and Day in 1980 and their book remains a key source with many important insights. Our book is addressed to a somewhat more statistical readership and aims to cover recent developments. There is an emphasis on the analysis of data arising in case-control studies, but we also focus in a number of places on design issues. We have tried to make the book reasonably selfcontained; some familiarity with simple statistical methods and theory is assumed, however. Many methods described in the book rely on the use of maximum likelihood estimation, and the extension of this to pseudolikelihoods is required in the later chapters. We have therefore included an appendix outlining some theoretical details.
There is an enormous statistical literature on case-control studies. Some of the most important fundamental work appeared in the late 1970s, while the later 1980s and the 1990s saw the establishment of methods for case-control sampling in time.
The misclassification of exposures and outcomes and errors in continuous exposures result in biased estimates of associations between exposure and outcome. A particular consideration that arises in case-control studies is differential error or misclassification that depends on the outcome.
Relatively simple methods can be used to correct for misclassification in binary exposures, provided that there is information available on the sensitivity and specificity of the measured exposure, for example from a validation study. These methods extend to allow differential misclassification and additionally to allow for misclassification in binary outcomes.
Error in continuous exposures arises in many areas of application and can take different forms. The form of the error influences its effect on the estimated association between exposure and outcome.
A commonly used method for correcting error in continuous exposures is regression calibration, which relies on an assumption of non-differential error. Correction methods that allow differential error include multiple imputation and moment reconstruction.
Preliminaries
In this chapter we discuss the effects of misclassification and measurement error and methods for making corrections for these effects. The focus is naturally on case-control studies, but much of the discussion and methods apply more generally. After some preliminary remarks, the chapter is divided broadly into three sections:
• misclassification of a binary or categorical exposure;
• misclassification of case-control status;
• error in the measurement of a continuous exposure.
Logistic regression can be used to estimate odds ratios using data from a case-control sample as though the data had arisen prospectively. This allows regression adjustment for background and confounding variables and makes possible the estimation of odds ratios for continuous exposures using case-control data.
The logistic regression of case-control data gives the correct estimates of log odds ratios, and their standard errors are as given by the inverse of the information matrix.
The logistic regression model is in a special class of regression models for estimating exposure-outcome associations that may be used to analyse case-control study data as though they had arisen prospectively. Another regression model of this type is the proportional odds model. For other models, including the additive risk model, case-control data alone cannot provide estimates of the appropriate parameters.
Absolute risks cannot be estimated from case-control data without additional information on the proportions of cases and controls in the underlying population.
Preliminaries
The previous chapters have introduced the key features of case-control studies but their content has been restricted largely to the study of single binary exposure variables. We now give a more general development. The broad features used for interpretation are as before:
• a study population of interest, from which the case-control sample is taken;
• a sampling model constituting the model under which the case-control data arise and which includes a representation of the data collection process;
• an inverse model representing the population dependence of the response on the explanatory variables; this model is the target for interpretation.
The case-subcohort design, often called simply the case-cohort design, is an alternative to the nested case-control design for case-control sampling within a cohort.
The primary feature of a case-subcohort study is the ‘subcohort’, which is a random sample from the cohort and which serves as the set of potential controls for all cases. The study comprises the subcohort plus all additional cases, that is, those not in the subcohort.
In an analysis using event times the cases are compared with members of the subcohort who are at risk at their event time, using a pseudo-partial likelihood. This results in estimates of hazard ratios.
An advantage of this design is that the same subcohort can be used to study cases of different types.
A simpler form of case-subcohort study disregards event times and is sometimes referred to as a case-base study or hybrid epidemiologic design. In this the subcohort enables estimation of risk ratios and odds ratios.
Preliminaries
In this chapter we continue the discussion of studies described broadly as involving case-control sampling within a cohort. In the nested case-control design, discussed in Chapter 7, cases are compared with controls sampled from the risk set at each event time. A feature of the nested case-control design is that the sampled controls are specific to a chosen outcome and therefore cannot easily be re-used in studies of other outcomes of interest if these occur at different time points; in principle, at least, a new set of controls must be sampled for each outcome studied though some methods have been developed that do enable the re-use of controls.
The nested case-control design accommodates case event times into the sampling of controls.
In this design one or more controls is or are selected for each case from the risk set at the time at which the case event occurs. Controls may also be matched to cases on selected variables.
Nested case-control studies are particularly suited for use within large prospective cohorts, when it is desirable to process exposure information only for cases and a subset of non-cases.
The analysis of nested case-control studies uses a proportional hazards model and a modification to the partial likelihood used in full-cohort studies, giving estimates of hazard ratios. Extensions to other survival models are possible.
In the standard design, controls are selected randomly from the risk set for each case; however, more elaborate sampling procedures for controls, such as counter-matching, may gain efficiency. A weighted partial-likelihood analysis is needed to accommodate non-random sampling.
Preliminaries
We have focused primarily so far on case-control studies in which the cases and controls are sampled from groups of individuals who respectively do and do not have the outcome of interest occurring within a relevant time window. This time window is typically relatively short. If cases are to be defined as those experiencing an event or outcome of interest occurring over a longer time period, or if the rate of occurrence of the event is high, then the choice of a suitable control group requires special care.
In two-stage case-control designs, limited information is obtained on individuals in a first-stage sample and used in the sampling of individuals at the second stage, where full information on exposures and other variables is obtained. The first stage may be a random sample or a case-control sample; the second stage is a case-control sample, possibly within strata. The major aim of these designs is to gain efficiency.
Two-stage studies can be analysed using likelihood-based arguments that extend the general formulation based on logistic regression.
Special sampling designs for matched case-control studies include countermatching, which uses some information on individuals in the potential pool of controls to select controls in such a way as to maximize the informativeness of the case-control sets.
Family groupings can be used in case-control-type studies, and there is a growing literature in the epidemiological, statistical and genetics fields. In one approach, cases are matched to a sibling or other relative.
Preliminaries
So far we have discussed case-control studies in which cases and controls are sampled, in principle at random, from the underlying population on the basis of their outcome status. We have also considered extensions, including matched studies and stratified sampling, in both of which it is assumed that some features of individuals in the underlying population are easily ascertained. Sometimes it is useful to consider alternative ways of sampling in a case-control study. In this chapter we discuss some special case-control sampling designs.
The cornerstone of the analysis of case-control studies is that the ratio of the odds of a binary outcome Y given exposure X = 1 to that given X = 0 is the same as the ratio of the odds where the roles of Y and X are reversed. This result means that prospective odds ratios can be estimated from retrospective case-control data.
For binary exposure X and outcome Y there are both exact and large-sample methods for estimating odds ratios from case-control studies.
Methods for the estimation of odds ratios for binary exposures extend to categorical exposures and allow the combination of estimates across strata. The latter enables control for confounding and background variables.
For binary exposure X and outcome Y, the probabilities of X given Y and of Y given X can be formulated using two different logistic regression models. However, the two models give rise to the same estimates of odds ratios under maximum likelihood estimation.
Rate ratios can be estimated from a case-control study if ‘time’ is incorporated correctly into the sampling of individuals; a simple possibility is to perform case-control sampling within short time bands and then to combine the results.
Preliminaries
Many central issues involved in the analysis of case-control data are illustrated by the simplest special case, namely that of a binary explanatory variable or risk factor and a binary outcome or response.
Case-control studies can involve more than two outcome groups, enabling us to estimate and compare exposure-outcome associations across groups.
Studies may involve multiple case subtypes and a single control group, or one case group and two or more control groups, for example.
Case-control studies with more than two outcome groups can be analysed using pairwise comparisons or special polychotomous analyses. The general formulation based on logistic regression extends to this situation, meaning that the data from such studies can be analysed as though arising from a prospective sample.
By contrast, in some situations case-only studies are appropriate; in these no controls are required. In one such situation the nature of the exposure contrasts studied may make the absence of controls reasonable. In another, each individual is in a sense his or her own control.
Preliminaries
In most of this book we are supposing that there are just two possible outcomes for each individual defining them as either cases or as controls. However, there are two contrasting situations where other than two outcomes are involved. First, it may be of interest to estimate and compare risk factors for three or more outcomes; an extended case-control design can be used to make comparisons between more than two outcome groups. The other, very contrasting, situation occurs when controls may be dispensed with, the case-only design. In the first part of this chapter we consider the former situation, and in the second part the latter.
• The individual matching of controls to cases in a case-control study may be used to control for confounding or background variables at the design stage of the study.
• Matching in a case-control study has some parallels with pair matching in experimental designs. It uses one of the most basic methods of error control, comparing like with like.
• The simplest type of matched case-control study takes a matched-pair form, in which each matched set comprises one case and one control.
• Matched case-control studies require a special form of analysis. The most common approach is to allow arbitrary variations between matched sets and to employ a conditional logistic regression analysis.
• An alternative analysis suitable in some situations uses a regression formulation based on the matching variables.
Preliminaries
An important and quite often fruitful principle in investigating the design and analysis of an observational study is to consider what would be appropriate for a comparable randomized experiment. What steps would be taken in such an experiment to achieve secure and precise conclusions? To what extent can these steps be followed in the observational context and what can be done to limit the loss of security of interpretation inherent in most observational situations?
In Chapter 2 we studied the dependence of a binary outcome, Y, on a single binary explanatory variable, the exposure, X.