We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
In the models discussed here, there is a hierarchy of variation that corresponds to groupings within the data. For example, students may be sampled from different classes, that in turn are sampled from different schools. Or, rather than being nested, groups may be crossed. Important notions are those of fixed and random effects, and variance components. Analysis of data from designs that have the balance needed to allow an analysis of variance breakdown are a special case. Further types of mixed models are generalized linear mixed models and repeated measures models. Repeated measures models are multilevel models where measurements consist of multiple profiles in time or space, resulting in time or spatial dependence. Relative to the length of time series that is required for a realistic analysis, each individual repeated measures profile can and often will have values for a few time points only.
This chapter explores ways to set up a model matrix so that linear combinations of the columns can fit curves and multidimensional surfaces. These extend to methods, within a generalized additive model framework, that use a penalization approach to constrain over-fitting. A further extension is to fitting quantiles of the data. The methodologies are important both for direct use for modeling data, and for checking for pattern in residuals from models that are in a more classical parametric style. The methodology is extended, in later chapters, to include smoothing terms in generalized linear models and models that allow for time series errors.
The notes in this appendix provide a brief and limited overview of R syntax, semantics, and the R package system, as background for working with the R code included in the text. It is intended for use alongside R help pages and the wealth of tutorial material that is available online.
Common time series models allow for a correlation between observations that is likely to be largest for points that are close together in time. Adjustments can be made, also, for seasonal effects. Variation in a single spatial dimension may have characteristics akin to those of time series, and comparable models find application there. Autoregressive models, which make good intuitive sense and are simple to describe, are the starting point for discussion; then moving on to autoregressive moving average with possible differencing. The "forecast" package for R has mechanisms that allow automatic selection of model parameters. Exponential smoothing state space (exponential time series or ETS) models are an important alternative that have often proved effective in forecasting applications. ARCH and GARCH heteroskedasticity models are further classes that have been developed to handle the special characteristics of financial time series.
Inferences are never assumption free. Data summaries that do not account for all relevant effects readily mislead. Distributions for the Pearson correlation and for counts, and extensions accounting for handling extra-binomial and extra-Poisson variation are noted. Notions of statistical power are introduced. Resampling methods, the bootstrap, and permutation tests, extend available inferential approaches. Regression with a single explanatory variable is used as a context in which to introduce residual plots, outliers, influence, robust regression, and standard errors of predicted values. There are two regression lines – that of y on x and that of x on y. Power transformations, with the logarithmic transformation as a special case, are often effective in giving a linear relationship. The training/test approach, and the closely allied of cross-validation approach, can be important for avoiding over-fitting. Other topics include one- and two-way comparisons, adjustments when there are multiple comparisons, and the estimation of false discovery rates when there is severe multiplicity. Discussions of theories of inference, including likelihood, and Bayes Factor and other Bayesian perspectives, ends the chapter.
The strengths of this book include the directness of its encounter with research data, its advice on practical data analysis issues, careful critiques of analysis results, its use of modern data analysis tools and approaches, its use of simulation and other computer-intensive methods where these provide insight or give results that are not otherwise available, its attention to graphical and other presentation issues, its use of examples drawn from across the range of statistical applications, the links that it makes into the debate over reproducibility in science, and the inclusion of code that reproduces analyses. The methods that we cover have wide application. The datasets, many of which have featured in published papers, are drawn from many different fields. They reflect a journey in learning and understanding, alike for the authors and for those with whom they have worked, that has ranged widely over many different research areas. The R system has brought into a common framework a huge range of abilities for data analysis, data manipulation and graphics. Our text has as its aim helping its readers to take full advantage of those abilities.
Generalized linear models extend classical linear models in two ways. They allow the fitting of a linear model to a dependent variable whose expected values have been transformed using a "link" function. They allow for a range of error families other than the normal. They are widely used to fit models to count data and to binomial-type data, including models with errors that may exhibit extra-binomial or extra-Poisson variation. The discussion extends to models in the generalized additive model framework, and to ordinal regression models. Survival analysis, also referred to as time-to-event analysis, is principally concerned with the time duration of a given condition, often but not necessarily sickness or death. In nonmedical contexts, it may be referred to as failure time or reliability analysis. Applications include the failure times of industrial machine components, electronic equipment, kitchen toasters, light bulbs, businesses, loan defaults, and more. There is an elegant methodology for dealing with "censoring" – where all that can be said is that the event of interest occured before or after a certain time, or in a specified interval.
Tree-based methods use methodologies that are radically different from those discussed in previous chapters. They are relatively easy to use and can be applied to a wide class of problems. As with many of the new machine learning methods, construction of a tree, or (in the random forest approach, trees) follows an algorithmic process. Single-tree methods occupy the first part this chapter. An important aspect of the methodology is the determining of error estimates. By building a large number of trees and using a voting process to make predictions, the random forests methodology that occupies the latter part of this chapter can often greatly improve on what can be achieved with a single tree. The methodology operates more as a black box, but with implementation details that are simpler to describe than for single- tree methods. In large sample classification problems, the methodology has often proved superior to other contenders.
Using diverse real-world examples, this text examines what models used for data analysis mean in a specific research context. What assumptions underlie analyses, and how can you check them? Building on the successful 'Data Analysis and Graphics Using R,' 3rd edition (Cambridge, 2010), it expands upon topics including cluster analysis, exponential time series, matching, seasonality, and resampling approaches. An extended look at p-values leads to an exploration of replicability issues and of contexts where numerous p-values exist, including gene expression.Developing practical intuition, this book assists scientists in the analysis of their own data, and familiarizes students in statistical theory with practical data analysis. The worked examples and accompanying commentary teach readers to recognize when a method works and, more importantly, when it doesn't. Each chapter contains copious exercises. Selected solutions, notes, slides, and R code are available online, with extensive references pointing to detailed guides to R.