We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
We begin by illustrating the interplay between questions of scientific interest and the use of data in seeking answers. Graphs provide a window through which meaning can often be extracted from data. Numeric summary statistics and probability distributions provide a form of quantitative scaffolding for models of random as well as nonrandom variation. Simple regression models foreshadow the issues that arise in the more complex models considered later in the book. Frequentist and Bayesian approaches to statistical inference are contrasted, the latter primarily using the Bayes Factor to complement the limited perspective that p-values offer. Akaike Information Criterion (AIC) and related "information" statistics provide a further perspective. Resampling methods, where the one available dataset is used to provide an empirical substitute for a theoretical distribution, are introduced. Remaining topics are of a more general nature. RStudio is one of several tools that can help in organizing and managing work. The checks provided by independent replication at another time and place are an indispensable complement to statistical analysis. Questions of data quality, of relevance to the questions asked, of the processes that generated the data, and of generalization, remain just as important for machine learning and other new analysis approaches as for more classical methods.
Multiple linear regression generalizes straight line regression to allow multiple explanatory (or predictor) variables, in this chapter under the normal errors assumption. The focus may be on accurate prediction. Or it may, alternatively or additionally, be on the regression coefficients themselves. Simplistic interpretations of coefficients can be grossly misleading. Later chapters elaborate on the ideas and methods developed in this chapter, applying them in new contexts. The attaching of causal interpretations to model coefficients must be justified both by reference to subject area knowledge and by careful checks to ensure that they are not artefacts of the correlation structure. There is attention to regression diagnostics, to assessment, and comparison of models. Variable selection strategies can readily over-fit. Hence the importance of training/test approaches and cross-validation. The potential is demonstrated for errors in x to seriously bias regression coefficients. Strong multicollinearity leads to large variance inflation factors.
This chapter moves from regression to methods that focus on the pattern presented by multiple variables, albeit with applications in regression analysis. A strong focus is to find patterns that beg further investigation, and/or replace many variables by a much smaller number that capture important structure in the data. Methodologies discussed include principal components analysis and multidimensional scaling more generally, cluster analysis (the exploratory process that groups “alike” observations) and dendogram construction, and discriminant analysis. Two sections discuss issues for the analysis of data, such as from high throughput genomics, where the aim is to determine, from perhaps thousands or tens of thousands of variables, which are shifted in value between groups in the data. A treatment of the role of balance and matching in making inferences from observational data then follows. The chapter ends with a brief introduction to methods for multiple imputation, which aims to use multivariate relationships to fill in missing values in observations that are incomplete, allowing them to have at least some role in a regression or other further analysis.
In the models discussed here, there is a hierarchy of variation that corresponds to groupings within the data. For example, students may be sampled from different classes, that in turn are sampled from different schools. Or, rather than being nested, groups may be crossed. Important notions are those of fixed and random effects, and variance components. Analysis of data from designs that have the balance needed to allow an analysis of variance breakdown are a special case. Further types of mixed models are generalized linear mixed models and repeated measures models. Repeated measures models are multilevel models where measurements consist of multiple profiles in time or space, resulting in time or spatial dependence. Relative to the length of time series that is required for a realistic analysis, each individual repeated measures profile can and often will have values for a few time points only.
This chapter explores ways to set up a model matrix so that linear combinations of the columns can fit curves and multidimensional surfaces. These extend to methods, within a generalized additive model framework, that use a penalization approach to constrain over-fitting. A further extension is to fitting quantiles of the data. The methodologies are important both for direct use for modeling data, and for checking for pattern in residuals from models that are in a more classical parametric style. The methodology is extended, in later chapters, to include smoothing terms in generalized linear models and models that allow for time series errors.
The notes in this appendix provide a brief and limited overview of R syntax, semantics, and the R package system, as background for working with the R code included in the text. It is intended for use alongside R help pages and the wealth of tutorial material that is available online.
Common time series models allow for a correlation between observations that is likely to be largest for points that are close together in time. Adjustments can be made, also, for seasonal effects. Variation in a single spatial dimension may have characteristics akin to those of time series, and comparable models find application there. Autoregressive models, which make good intuitive sense and are simple to describe, are the starting point for discussion; then moving on to autoregressive moving average with possible differencing. The "forecast" package for R has mechanisms that allow automatic selection of model parameters. Exponential smoothing state space (exponential time series or ETS) models are an important alternative that have often proved effective in forecasting applications. ARCH and GARCH heteroskedasticity models are further classes that have been developed to handle the special characteristics of financial time series.
Inferences are never assumption free. Data summaries that do not account for all relevant effects readily mislead. Distributions for the Pearson correlation and for counts, and extensions accounting for handling extra-binomial and extra-Poisson variation are noted. Notions of statistical power are introduced. Resampling methods, the bootstrap, and permutation tests, extend available inferential approaches. Regression with a single explanatory variable is used as a context in which to introduce residual plots, outliers, influence, robust regression, and standard errors of predicted values. There are two regression lines – that of y on x and that of x on y. Power transformations, with the logarithmic transformation as a special case, are often effective in giving a linear relationship. The training/test approach, and the closely allied of cross-validation approach, can be important for avoiding over-fitting. Other topics include one- and two-way comparisons, adjustments when there are multiple comparisons, and the estimation of false discovery rates when there is severe multiplicity. Discussions of theories of inference, including likelihood, and Bayes Factor and other Bayesian perspectives, ends the chapter.
The strengths of this book include the directness of its encounter with research data, its advice on practical data analysis issues, careful critiques of analysis results, its use of modern data analysis tools and approaches, its use of simulation and other computer-intensive methods where these provide insight or give results that are not otherwise available, its attention to graphical and other presentation issues, its use of examples drawn from across the range of statistical applications, the links that it makes into the debate over reproducibility in science, and the inclusion of code that reproduces analyses. The methods that we cover have wide application. The datasets, many of which have featured in published papers, are drawn from many different fields. They reflect a journey in learning and understanding, alike for the authors and for those with whom they have worked, that has ranged widely over many different research areas. The R system has brought into a common framework a huge range of abilities for data analysis, data manipulation and graphics. Our text has as its aim helping its readers to take full advantage of those abilities.
Generalized linear models extend classical linear models in two ways. They allow the fitting of a linear model to a dependent variable whose expected values have been transformed using a "link" function. They allow for a range of error families other than the normal. They are widely used to fit models to count data and to binomial-type data, including models with errors that may exhibit extra-binomial or extra-Poisson variation. The discussion extends to models in the generalized additive model framework, and to ordinal regression models. Survival analysis, also referred to as time-to-event analysis, is principally concerned with the time duration of a given condition, often but not necessarily sickness or death. In nonmedical contexts, it may be referred to as failure time or reliability analysis. Applications include the failure times of industrial machine components, electronic equipment, kitchen toasters, light bulbs, businesses, loan defaults, and more. There is an elegant methodology for dealing with "censoring" – where all that can be said is that the event of interest occured before or after a certain time, or in a specified interval.
Tree-based methods use methodologies that are radically different from those discussed in previous chapters. They are relatively easy to use and can be applied to a wide class of problems. As with many of the new machine learning methods, construction of a tree, or (in the random forest approach, trees) follows an algorithmic process. Single-tree methods occupy the first part this chapter. An important aspect of the methodology is the determining of error estimates. By building a large number of trees and using a voting process to make predictions, the random forests methodology that occupies the latter part of this chapter can often greatly improve on what can be achieved with a single tree. The methodology operates more as a black box, but with implementation details that are simpler to describe than for single- tree methods. In large sample classification problems, the methodology has often proved superior to other contenders.