To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
In several chapters we discussed parametric regression modeling for a moderate number of explanatory variables based on maximum likelihood methods. In some areas of application, however, the number of explanatory variables may be very high. For example, in genetics, where binary regression is a frequently used tool, the number of predictors may be even larger than the number of predictors. In this “p > n problem” maximum likelihood and similar estimators are bound to fail. Typical data of this type are microarray data, where the expressions of thousands of predictors (genes) are observed and only some hundred samples are available. For example, the dataset considered by Golub et al. (1999a), which constitutes a milestone in the classification of cancer, consists of gene expression intensities for 7129 genes of 38 leukemia patients, from which 27 were diagnosed with acute lymphoblastic leukemia and the remaining patients acute myeloid leukemia.
In high-dimensional problems the reduction of the predictor space is the most important issue. A reduction technique with a long history is stepwise variable selection. However, stepwise variable selection as a discrete process is extremely variable. The results of a variable selection procedure may be determined by small changes in the data. The effect is often poor performance (see, e.g., Frank and Friedman, 1993). Moreover, it is challenging to investigate the sampling properties of stepwise variable selection procedures.
An alternative to stepwise subset selection is regularization methods. Ridge regression is a familiar regularization method that adds a simple penalty term to the log-likelihood and thereby shrinks estimates toward zero.
Categorical regression has the same objectives as metric regression. It aims at an economic representation of the link between covariables considered as the independent variables and the response as the dependent variable. Moreover, one wants to evaluate the influence of the independent variables regarding their strength and the way they exert their influence. Predicting new observations can be based on adequate modeling of the response pattern.
Categorical regression modeling differs from classical normal regression in several ways. The most crucial difference is that the dependent variable y follows a quite different distribution. A categorical response variable can take only a limited number of values, in contrast to normally distributed variables, in which any value might be observed. In the simplest case of binary regression the response takes only two values, usually coded as y = 0 and y = 1. One consequence is that the scatterplots look different. Figure 2.1 shows data from the household panel described in Example 1.2. The outcomes “car in household” (y = 1) and “no car in household” (y = 0) are plotted against net income (in Euros). It is seen that for low income the responses y = 0 occur more often, whereas for higher income y = 1 is observed more often. However, the structural connection between the response and the covariate is hardly seen from this representation. Therefore, in Figure 2.1 the relative frequencies for owning a car are shown for households within intervals of length 50. The picture shows that a linear connection is certainly not the best choice.
In this chapter we embed the logistic regression model as well as the classical regression model into the framework of generalized linear models. Generalized linear models (GLMs), which have been proposed by Nelder and Wedderburn (1972), may be seen as a framework for handling several response distributions, some categorical and some continuous, in a unified way. Many of the binary response models considered in later chapters can be seen as generalized linear models, and the same holds for part of the count data models in Chapter 7.
The chapter may be read as a general introduction to generalized linear models; continuous response models are treated as well as categorical response models. Therefore, parts of the chapter can be skipped if the reader is interested in categorical data only. Basic concepts like the deviance are introduced in a general form, but specific forms that are needed in categorical data analysis will also be given in the chapters where the models are considered. Nevertheless, the GLM is useful as a background model for categorical data modeling, and since McCullagh and Nelder's (1983) book everybody working with regression models should be familiar with the basic concept.
Basic Structure
A generalized linear model is composed from several components. The random component specifies the distribution of the conditional response yi given xi, whereas the systematic component specifies the link between the expected response and the covariates.
Tree-based models provide an alternative to additive and smooth models for regression problems. The method has its roots in automatic interaction detection (AID), proposed by Morgan and Sonquist (1963). The most popular modern version is due to Breiman et al. (1984) and is known by the name classification and regression trees, often abbreviated as CART, which is also the name of a program package. The method is conceptually very simple. By binary recursive partitioning the feature space is partitioned into a set of rectangles, and on each rectangle a simple model (for example, a constant) is fitted.
The approach is different from that given by fitting parametric models like the logit model, where linear combinations of predictors are at the core of the method. Rather than getting parameters, one obtains a binary tree that visualizes the partitioning of the feature space. If only one predictor is used, the one-dimensional feature space is partitioned, and the resulting estimate is a step function that may be considered a non-parametric (but rough) estimate of the regression function.
Regression and Classification Trees
In the following the basic concepts of classification and regression trees (CARTs) are given. The term “regression” in CARTs refers to metrically scaled outcomes, while “classification” refers to the prediction of underlying classes. By considering the underlying classes as the outcomes of a categorical response variable, classification can be treated within the general framework of regression.
In many applications the response variable is given in the form of event counts, where an event count refers to the number of times an event occurs. Simple examples are
number of insolvent firms within a fixed time internal,
number of insurance claims within a given period of time,
number of epileptic seizures per day,
number of cases with a specific disease in epidemiology.
In all of theses examples the response y may be viewed as a non-negative integer-valued random variable with y ∈ {0, 1, 2, …}. Although in many applications there is an upper bound for the response, because the number of firms or potential insurance claims is finite, the upper bound is often very large and considered as irrelevant in modeling. In other cases, for example, for the number of epileptic seizures, no upper bound is given. In the following some further examples are given.
Example 7.1: Number of Children
The German General Social Survey Allbus provides micro data, which allow one to model the dependence of the number of children on explanatory variables. We will consider women only and the predictors age in years (age), duration of school education (dur), nationality (nation, 0: German, 1: otherwise), religion (answer categories to “God is the most important in man”, 1: strongly agree, …, 5: strongly disagree, 6: never thought about it), university degree (univ, 0: no, 1: yes).
In Chapter 13 the marginal modeling approach has been used to model observations that occur in clusters. An alternative approach to dealing with repeated measurements is by modeling explicitly the heterogeneity of the clustered responses. By postulating the existence of unobserved latent variables, the so-called random effects, which are shared by the measurement within a cluster, one introduces correlation between the measurements within clusters.
The introduction of cluster-specific parameters has consequences on the interpretation of parameters. Responses are modeled given covariates and cluster-specific terms. Therefore, interpretation is subject-specific, in contrast to marginal models, which have population-averaged interpretations. For illustration let us consider Example 13.4, where a binary response indicating pain depending on treatment and other covariates is measured repeatedly. When each individual has its own parameter, which represents the individual's sensitivity to pain, modeling of the response given the covariates and the individual level means that effects are measured on the individual level. For non-linear models, which are the standard in categorical regressions, the effect strength will differ from the effect strength found in marginal modeling without subject-specific parameters. The difference will be discussed in more detail in Section 14.2.1 for the simple case of binary response models with random intercepts.
Explicit modeling of heterogeneity by random effects is typically found in repeated measurements, as in the pain study. In the following we give two more examples.
Example 14.1: AIDS Study
The data were collected within the Multicenter AIDS Cohort Study (MACS), which has followed nearly 5000 gay or bisexual men from Baltimore, Pittsburgh, Chicago, and Los Angeles since 1984 (see Kaslow et al., 1987; Zeger and Diggle, 1994).
In prediction problems one considers a new observation (y, x). While the predictor value x is observed, y is unknown and is to be predicted. In general, the unknown y may be from any distribution, continuous or discrete, depending on the prediction problem. When the unknown value is categorical we will often denote it by Y, with Y taking values from {1, …, k}. Then prediction means to find the true underlying value from the set {1, …, k}. The problem is strongly related to the common classification problem where one wants to find the true class from which the observation stems. When the numbers 1, …, k denote the underlying classes, the classification problem has the same structure as the prediction problem. Classification problems are basically diagnostic problems. In medical applications one wants to identify the type of disease, in pattern recognition one might aim at recognizing handwritten characters, and in credit scoring (Example 1.7) one wants to identify risk clients. Sometimes the distinction between prediction and classification is philosophical. In credit scoring, where one wants to find out if a client is a risk client, one might argue that it is a prediction problem since the classification lies in the future. Nevertheless, it is mostly seen as a classification problem, implying that the client is already a risk client or not.
Categorical data play an important role in many statistical analyses. They appear whenever the outcomes of one or more categorical variables are observed. A categorical variable can be seen as a variable for which the possible values form a set of categories, which can be finite or, in the case of count data, infinite. These categories can be records of answers (yes/no) in a questionnaire, diagnoses like normal/abnormal resulting from a medical examination, or choices of brands in consumer behavior. Data of this type are common in all sciences that use quantitative research tools, for example, social sciences, economics, biology, genetics, and medicine, but also engineering and agriculture.
In some applications all of the observed variables are categorical and the resulting data can be summarized in contingency tables that contain the counts for combinations of possible outcomes. In other applications categorical data are collected together with continuous variables and one may want to investigate the dependence of one or more categorical variables on continuous and/or categorical variables.
The focus of this book is on regression modeling for categorical data. This distinguishes between explanatory variables or predictors and dependent variables. The main objectives are to find a parsimonious model for the dependence, quantify the effects, and potentially predict the outcome when explanatory variables are given. Therefore, the basic problems are the same as for normally distributed response variables. However, due to the nature of categorical data, the solutions differ. For example, it is highly advisable to use a transformation function to link the linear or non-linear predictor to the mean response, to ensure that the mean is from an admissible range.
In many studies the objective is to model more than one response variable. For each unit in the sample a vector of correlated response variables, together with explanatory variables, is observed. Two cases are most important:
repeated measurements, when the same variable is measured repeatedly at different times or/and under different conditions;
different response variables, observed on one subject or unit in the sample.
Repeated measurements occur in most longitudinal studies. For example, in a longitudinal study measurements on an individual may be observed at several times under possibly varying conditions. In Example 1.4 (Chapter 1) an active ingredient is compared to a placebo by observing the healing after 3, 7, and 10 days of treatment. In Example 13.1, the number of epileptic seizures is considered at each of four two-week periods. Although they often do, repeated responses need not refer to different times. Response variables may also refer to different questions in an interview or to the presence of different commodities in a household. In Example 13.2 the two, possibly correlated responses are the type of birth (Ceaserian or not) and the stay of the child in intensive care (yes or no). Responses may also refer to a cluster of subjects; for example, when the health status of the members of a family is investigated, the observed responses form a cluster linked to one family. In Example 10.8, where the health status of trees is investigated, clusters are formed by the trees measured at the same spot.
When the response categories in a regression problem are ordered one can find simpler models than the multinomial logit model. The multinomial logit wastes information because the ordering of categories is not explicitly used. Therefore, often more parameters than are really needed are in the model. In particular with categorical data, parsimonious models are to be preferred because the information content in the response is always low.
Data analysts who are not familiar with ordinal models usually seek solutions by inadequate modeling. If the number of response categories is high, for example in rating scales, they ignore that the response is ordinal and use classical regression models that assume that the response is at least on an interval scale. Thereby they also ignore that the response is categorical. The result is often spurious effects. Analysts who are aware of the ordinal scale but are not familiar with ordinal models frequently use binary regression models by collapsing outcomes into two groups of response categories. The effect is a loss of information. Armstrong and Sloan (1989) demonstrated that the binary model may attain only between 50 and 75% efficiency relative to an ordinal model for a five-level ordered response; see also Steadman and Weissfeld (1998), who in addition consider polytomous models as alternatives.
One may distinguish between two types of ordinal categorical variables, grouped continuous variables and assessed ordinal categorical variables (Anderson, 1984). The first type is a mere categorized version of a continuous variable, which in principle can be observed itself.