To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
In Chapter 1we introduced several substantive problems that mainly involved scientific studies. In this chapter, we return to these problems. The goal here is not simply to illustrate semiparametric modeling techniques but to show how these techniques can be integrated into scientific studies. Analyses for about half of the studies have recently been published and so, in order to save space, we will simply refer the interested reader to the relevant journal article.
Cancer Rates on Cape Cod
An analysis of the Cape Cod cancer data is given in French and Wand (2003). In their presentation, a logistic geoadditive model (Section13.6) leads to maps showing regions of elevated relative cancer risk after accounting for age and smoking status. The model developed there also accounts for missingness (missing values) in the smoking variable.
Assessing the Carcinogenicity of Phenolphthalein
Parise and colleagues (2001) used semiparametric logistic mixed models to assess the carciogenicity of phenolphthalein. After adjusting for rodent weight, they were not able to find a significant dose effect for phenolphthalein.
Salinity and Fishing in North Carolina
Real data sets often illustrate several different statistical principles. The salinity data set is not simply an example of semiparametric modeling; it also shows the differing effects of outliers on parametric and nonparametric modeling.
The salinity data are introduced in Section1.2. Recall the definitions of the variables: salinity is the measured value of salinity in Pamlico Sound, lagged.sal is salinity two weeks earlier, and discharge is the amount of fresh water flowing into the sound from rivers. In this example there are two unusual values of discharge, and the question naturally arises of whether these data points should be included.
The penalty parameter λ of a penalized spline controls the trade-off between bias and variance. In this chapter we introduce a method of fitting penalized splines wherein λ varies spatially in order to accommodate possible spatial nonhomogeneity of the regression function. In other words, λ is allowed to be a function of the independent variable x. Allowing λ to be a function of spatial location can improve mean squared error (MSE; see Section 3.11) and the accuracy of inference.
Suppose we are using a quadratic spline that has constant curvature between knots. If the regression function has rapid changes in curvature, then a small value of λ is needed so that the second derivative of the fitted spline can take jumps large enough to accommodate these changes in curvature. Conversely, if the curvature changes slowly, then a large value of λ will not cause large bias and will reduce dffit and the variance of the fitted values. The problem is that a regression function may be spatially nonhomogeneous with some regions of rapidly changing curvature and other regions of little change in curvature. A single value of λ is not suitable for such functions. The inferiority – in terms of MSE – of splines having a single smoothing parameter is shown in a simulation study byWand (2000). In that study, for regression functions with significant spatial inhomogeneity, penalized splines with a single smoothing parameter were not competitive with knot selection methods. In an empirical study by Ruppert (2002), the spatially adaptive penalized splines described in this chapter were found to be at least as efficient as knot selection methods.
Each of the problems described in the previous chapter can benefit from regression analysis. In this book we focus on the combination of classical parametric regression techniques and modern nonparametric regression techniques to develop useful models for such analyses. Therefore, it is essential to have a good grounding in the principles of parametric regression before proceeding to the more complicated semiparametric regression chapters. In particular, some of the theoretical aspects of regression should be well understood since these are important in extensions to semiparametric regression. The present chapter can serve as either a brief introduction to parametric regression for readers without a background in that field or as a refresher for those with a working knowledge of parametric regression but who could benefit from a review. If you are very familiar with parametric regression methodology and theory, then this chapter could be skimmed. Of course, this brief introduction can only cover the main concepts and a few special models. Many widely used parametric models are not discussed. This chapter provides sufficient background in parametric regression for the chapters to follow. However, readers wishing to apply parametric regression models may consult a textbook on parametric regression such as Weisberg (1985), Neter et al. (1996), or Draper and Smith (1998).
Note, moreover, that Section 2.5 contains some new perspectives on parametric regression that are relevant to later chapters on semiparametric models, so this is worth covering regardless of experience.
Toward the end of the chapter we describe some limitations of parametric regression. Most of the remainder of the book is concerned with extensions of parametric regression that have much more flexibility.
Classical statistics treats parameters as fixed unknown quantities. Bayesian statistics is based on a different philosophy; parameters are treated as random variables. The probability distribution of a parameter characterizes knowledge about the parameter's value, and this distribution changes as new data are acquired. The mixed models of classical statistics have a Bayesian flavor because some parameters are treated as random. However, in a mixed model both the fixed effects and the variance components are treated as nonrandom unknowns. Bayesians go one step beyond mixed models in that they treat all parameters as random. In this chapter we take the mixed model formulation of Section 4.9 and extend it to a fully Bayesian model.
Bayesian statistics differs from classical statistics in two important respects:
(1) the use of the prior distribution to characterize knowledge of the parameter values prior to data collection; and
(2) the use of the posterior distribution – that is, the conditional distribution of the parameters given the data – as the basis of inference.
Some statisticians are uneasy about the use of priors, but when done with care, the use of priors is quite sensible. In some situations, we might have strong prior beliefs that will influence our analysis. For example, suppose we needed to estimate the probability that a toss of a coin comes up heads.
The primary aim of this book is to guide researchers needing to flexibly incorporate nonlinear relationships into their regression analyses. Flexible nonlinear regression is traditionally known as nonparametric regression; it differs from parametric regression in that the shape of the functional relationships are not predetermined but can adjust to capture unusual or unexpected features of the data.
Almost all existing regression texts treat either parametric or nonparametric regression exclusively. The level of exposition between books of either type differs quite alarmingly. In this book we argue that nonparametric regression can be viewed as a relatively simple extension of parametric regression and treat the two together. We refer to this combination as semiparametric regression. Our approach to semiparametric regression is based on penalized regression splines and mixed models. Indeed, every model in this book is a special case of the linear mixed model or its generalized counterpart. This makes the methodology modular and is in keeping with our general philosophy of minimalist statistics (see Section 19.2), where the amount of methodology, terminology, and so on is kept to a minimum. This is the first smoothing book that makes use of the mixed model representation of smoothers.
Unlike many other texts on nonparametric regression, this book is very much problem-driven. Examples from our collaborative research (and elsewhere) have driven the selection of material and emphases and are used throughout the book.
The book is suitable for several audiences. One audience consists of students or working scientists with only a moderate background in regression, though familiarity with matrix and linear algebra is assumed. Marginal notes and the appendices are intended for beginners, especially those from interface disciplines.