To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Beran's (2003) provocative definition of statistics as “the study of algorithms for data analysis” elevates computational considerations to the forefront of the field. It is apparent that the evolutionary success of statistical methods is to a significant degree determined by considerations of computational convenience. As a result, design and dissemination of statistical software has become an integral part of statistical research. Algorithms are no longer the exclusive purview of the numerical analyst or the proto-industrial software firm; they are an essential part of the artisanal research process. Fortunately, modern computing has also transformed the software development process and greatly facilitated collaborative research; the massive collective international effort represented by the R project exceeds the most idealistic Marxist imagination.
Algorithms have been a crucial part of the research challenge of quantile regression methods since their inception in the 18th century. Stigler (1984) describes an amusing episode in 1760 in which the itinerant Croatian Jesuit Rudjer Boscovich sought computational advice in London regarding his nascent method for median regression. Ironically, a fully satisfactory answer to Boscovich's questions only emerged with the dawn of modern computing. The discovery of the simplex method and subsequent developments in linear programming have made quantile regression methods competitive with traditional least-squares methods in terms of their computational effort. These computational developments have also played a critical role in encouraging a deeper appreciation of the statistical advantages of these methods.
In this chapter, we seek to provide a basic conceptual guide to quantile regression, illustrating the ideas with a number of examples and stressing various aspects of the interpretation of quantile regression. We begin with a discussion of quantile treatment effects in the two-sample treatment-control model. In this context, the difference between the empirical quantile functions of the treatment and control observations provides a natural generalization of conventional mean measures of the treatment effect. This “quantile treatment effect” is precisely what is delivered by the the quantile regression estimator of a model with a single binary indicator variable. We describe some basic characteristics of the quantile regression estimator, its equivariance properties, and robustness. The interpretation of estimated quantile regression parameters is described in the context of several applications. Some issues of misspecification are raised, and the chapter concludes with an interpretation of the quantile regression model as a random coefficient model.
QUANTILE TREATMENT EFFECTS
The simplest formulation of regression is the classical two-sample treatment-control model. We begin by reconsidering a general model of two-sample treatment response introduced by Lehmann and Doksum in the 1970s. This model provides a natural introduction to the interpretation of quantile regression models in more general settings.
Lehmann (1974) proposed the following model of treatment response:
Suppose the treatment adds the amount Δ(x) when the response of the untreated subject would be x. Then the distribution G of the treatment responses is that of the random variable X + Δ(X) where X is distributed according to F.
Although the finite-sample distribution theory of regression quantiles can be represented explicitly as has been illustrated in Chapter 3, the practical application of this theory would entail a host of hazardous assumptions and an exhausting computational effort. It is generally conceded throughout statistics that approximation methods involving local linearization and the central limit theorem play an indispensable role in the analysis of the performance of statistical procedures and in rendering such procedures practical tools of statistical inference. The zealous pursuit of these objectives is inevitably met with accusations that we live in a cloud-cuckoo land of “asymptopia,” but life is full of necessary compromises and approximations. And it is fair to say that those who try to live in the world of “exact results” in finite-sample statistical distribution theory are exiled to an even more remote and exotic territory.
Fortunately, there are many tools available to help us evaluate the adequacy of our asymptotic approximations. Higher order expansions, although particularly challenging in the present context, may offer useful assessments of the accuracy of simpler approximations and possible refinement strategies. Monte Carlo simulation can be an extremely valuable tool, and the rapid development of resampling methods for statistical inference offers many new options for inference.
The fundamental task of asymptotic theory is to impose some discipline and rigor on the process of developing statistical procedures. The natural enthusiasm that arises from the first few “successful” applications of a new technique can be effectively tempered by some precisely cast questions of the following form: suppose data arose according to the conditions A, does the procedure produce a result that converges in some appropriate sense to object B?
In many disciplines, longitudinal and panel data sets can be readily constructed using a traditional pre- and postevent study; that is, subjects are observed prior to some condition of interest as well as after this condition or event. It is also common to use longitudinal and panel data methods to examine data sets that follow subjects observed at an aggregate level, such as a government entity (state, province, or nation, for example) or firm.
Longitudinal and panel data sets are also available that follow individuals over time. However, these data sets are generally expensive to construct and are conducted through the sponsorship of a government agency. Thus, although the data are available, data providers are generally bound by national laws requiring some form of user agreement to protect confidential information regarding the subjects. Because of the wide interest in these data sets, most data providers make information about the data sets available on the Internet.
Despite the expense and confidentiality requirements, many countries have conducted, or are in the process of conducting, household panel studies. Socio-demographic and economic information is collected about a household as well as individuals within the household. Information may relate to income, wealth, education, health, geographic mobility, taxes, and so forth. To illustrate, one of the oldest ongoing national panels, the U.S. Panel Study of Income Dynamics (PSID), collects 5,000 variables. Table F.1 cites some major international household panel data sets.
Abstract. This chapter extends the linear model introduced in Chapters 1–8 and the binary dependent-variable model in Chapter 9 to the generalized linear model formulation. Generalized linear models (GLMs) represent an important class of nonlinear regression models that have found extensive use in practice. In addition to the normal and Bernoulli distributions, these models include the binomial, Poisson, and Gamma families as distributions for dependent variables.
Section 10.1 begins this chapter with a review of homogeneous GLMs, models that do not incorporate heterogeneity. The Section 10.2 example reinforces this review. Section 10.3 then describes marginal models and generalized estimating equations, a widely applied framework for incorporating heterogeneity. Then, Sections 10.4 and 10.5 allow for heterogeneity by modeling subject-specific quantities as random and fixed effects, respectively. Section 10.6 ties together fixed and random effects under the umbrella of Bayesian inference.
Homogeneous Models
This section introduces the generalized linear model (GLM) due to Nelder and Wedderburn 1972G); a more extensive treatment may be found in the classic work by McCullagh and Nelder (1989G). The GLM framework generalizes linear models in the following sense. Linear model theory provides a platform for choosing appropriate linear combinations of explanatory variables to predict a response. In Chapter 9, we saw how to use nonlinear functions of these linear combinations to provide better predictors, at least for responses with Bernoulli (binary) outcomes. With GLMs, we widen the class of distributions to allow us to handle other types of nonnormal outcomes.
Abstract. This chapter describes a conditional modeling framework that takes into account hierarchical and clustered data structures. The data and models, known as multilevel, are used extensively in educational science and related disciplines in the social and behavioral sciences. We show that a multilevel model can be viewed as a linear mixed-effects model, and hence the statistical inference techniques introduced in Chapter 3 are readily applicable. By considering multilevel data and models as a separate unit, we expand the breadth of applications that linear mixed-effects models enjoy.
Cross-Sectional Multilevel Models
Educational systems are often described by structures in which the units of observation at one level are grouped within units at a higher level of structure. To illustrate, suppose that we are interested in assessing student performance based on an achievement test. Students are grouped into classes, classes are grouped into schools, and schools are grouped into districts. At each level, there are variables that may affect responses from a student. For example, at the class level, education of the teacher may be important, at the school level, the school size may be important, and at the district level, funding may be important. Further, each level of grouping may be of scientific interest. Finally, there may be not only relationships among variables within each group but also across groups that should be considered.
Abstract. As introduced in Chapter 1, longitudinal and panel data are often heterogeneous and may suffer from problems of attrition. This chapter describes models for handling these tendencies, as well as models designed to handle omitted-variable bias.
Heterogeneity may be induced by (1) fixed effects, (2) random effects, or (3) within-subject covariances. In practice, distinguishing among these mechanisms can be difficult, although, as the chapter points out, it is not always necessary. The chapter also describes the well-known Hausman test for distinguishing between estimators based on fixed versus random effects. As pointed out by Mundlak (1978aE), the Hausman test provides a test of the significance of time-constant omitted variables, certain types of which are handled well by longitudinal and panel data.
This ability to deal with omitted variables is one of the important benefits of using longitudinal and panel data; in contrast, attrition is one of the main drawbacks. The chapter reviews methods for detecting biases arising from attrition and introduces models that provide corrections for attrition difficulties.
Heterogeneity
Heterogeneity is a common feature of many longitudinal and panel data sets. When we think of longitudinal data, we think of repeated measurements on subjects. This text emphasizes repeated observations over time, although other types of clustering are of interest. For example, one could model the family unit as a “subject” and have individual measurements of family members as the repeated observations.
Abstract. This chapter introduces the analysis of longitudinal and panel data using the general linear model framework. Here, longitudinal data modeling is cast as a regression problem by using fixed parameters to represent the heterogeneity; nonrandom quantities that account for the heterogeneity are known as fixed effects. In this way, ideas of model representation and data exploration are introduced using regression analysis, a toolkit that is widely known. Analysis of covariance, from the general linear model, easily handles the many parameters needed to represent the heterogeneity.
Although longitudinal and panel data can be analyzed using regression techniques, it is also important to emphasize the special features of these data. Specifically, the chapter emphasizes the wide cross section and the short time series of many longitudinal and panel data sets, as well as the special model specification and diagnostic tools needed to handle these features.
Basic Fixed-Effects Model
Data
Suppose that we are interested in explaining hospital costs for each state in terms of measures of utilization, such as the number of discharged patients and the average hospital stay per discharge. Here, we consider the state to be the unit of observation, or subject. We differentiate among states with the index i, where i may range from 1 to n, and n is the number of subjects. Each state is observed Ti times and we use the index t to differentiate the observation times. With these indices, let yit denote the response of the ith subject at the tth time point.
Abstract. This chapter considers models of longitudinal data sets with longer time dimensions than were considered in earlier chapters. With many observations per subject, analysts have several options for introducing more complex dynamic model features that address questions of interest or that represent important tendencies of the data (or both). One option is based on the serial correlation structure; this chapter extends the basic structures that were introduced in Chapter 2. Another dynamic option is to allow parameters to vary over time. Moreover, for a data set with a long time dimension relative to the number of subjects, we have an opportunity to model the cross-sectional correlation, an important issue in many studies. The chapter also considers the Kalman filter approach, which allows the analyst to incorporate many of these features simultaneously. Throughout, the assumption of exogeneity of the explanatory variables is maintained. Chapter 6 considered lagged dependent variables as explanatory variables, another way of introducing dynamic features into the model.
Introduction
Because longitudinal data vary over time as well as in the cross section, we have opportunities to model the dynamic, or temporal, patterns in the data. For the data analyst, when is it important to consider dynamic aspects of a problem?
Part of the answer to this question rests on the purpose of the analysis. If the main inferential task is forecasting of future observations as introduced in Chapter 4, then the dynamic aspect is critical. In this instance, every opportunity for understanding dynamic aspects should be explored.
Abstract. This chapter considers the Chapter 2 data structure but here the heterogeneity is modeled using random quantities in lieu of fixed parameters; these random quantities are known as random effects. By introducing random quantities, the analysis of longitudinal and panel data can now be cast in the mixed linear model framework.
Although mixed linear models are an established part of statistical methodology, their use is not as widespread as regression. Thus, the chapter introduces this modeling framework, beginning with the special case of a single random intercept known as the error-components model and then focusing on the linear mixed-effects model, which is particularly important for longitudinal data. After introducing the models, this chapter describes estimation of regression coefficients and variance components, as well as hypothesis testing for regression coefficients.
Error-Components/Random-Intercepts Model
Sampling and Inference
Suppose that you are interested in studying the behavior of individuals who are randomly selected from a population. For example, in Section 3.2 we will study the effects that an individual's economic and demographic characteristics have on the amount of income tax paid. Here, the set of subjects that we will study is randomly selected from a larger database, which is itself a random sample of the U.S. taxpayers. In contrast, the Chapter 2 Medicare example dealt with a fixed set of subjects. That is, it is difficult to think of the 54 states as a subset from some “superpopulation” of states.