We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Sculpins (coastrange and slimy) and sticklebacks (ninespine and threespine) are widely distributed fishes cohabiting 2 south-central Alaskan lakes (Aleknagik and Iliamna), and all these species are parasitized by cryptic diphyllobothriidean cestodes in the genus Schistocephalus. The goal of this investigation was to test for host-specific parasitic relationships between sculpins and sticklebacks based upon morphological traits (segment counts) and sequence variation across the NADH1 gene. A total of 446 plerocercoids was examined. Large, significant differences in mean segment counts were found between cestodes in sculpin (mean = 112; standard deviation [s.d.] = 15) and stickleback (mean = 86; s.d. = 9) hosts within and between lakes. Nucleotide sequence divergence between parasites from sculpin and stickleback hosts was 20.5%, and Bayesian phylogenetic analysis recovered 2 well-supported clades of cestodes reflecting intermediate host family (i.e. sculpin, Cottidae vs stickleback, Gasterosteidae). Our findings point to the presence of a distinct lineage of cryptic Schistocephalus in sculpins from Aleknagik and Iliamna lakes that warrants further investigation to determine appropriate evolutionary and taxonomic recognition.
Another way of examining the patterns among objects based on multiple variables is to plot the objects in multidimensional space based on their pairwise dissimilarities. We first describe multidimensional scaling as a very flexible ordination method that can be based on a wide range of dissimilarities. We also introduce cluster analysis based on dissimilarities, where the pattern among objects is represented in a tree-like plot called a dendrogram. We show how to correlate dissimilarities to other continuous and/or grouping variables and fit linear models that treat the dissimilarities as responses modeled against continuous or categorical predictors.
We can easily find ourselves with lots of predictors. This situation has been common in ecology and environmental science but has spread to other biological disciplines as genomics, proteomics, metabolomics, etc., become widespread. Models can become very complex, and with many predictors, collinearity is more likely. Fitting the models is tricky, particularly if we’re looking for the “best” model, and the way we approach the task depends on how we’ll use the model results. This chapter describes different model selection approaches for multiple regression models and discusses ways of measuring the importance of specific predictors. It covers stepwise procedures, all subsets, information criteria, model averaging and validation, and introduces regression trees, including boosted trees.
Earlier chapters introduced modeling approaches for a continuous, normally distributed response. Biological data are often not so neat, and the common practice was to transform continuous response variables until the assumption of normality was met. Other kinds of data, particularly presence–absence and behavioral responses and counts, are discrete rather than continuous and require a different approach. In this chapter, we introduce generalized linear models and their extension to generalized linear mixed models to analyze these response variables. We show how common techniques such as contingency tables, loglinear models, and logistic and Poisson regression can be viewed as generalized linear models, using link functions to create the appropriate relationship between response and predictors. The models described in earlier chapters can be reinterpreted as a version of generalized linear models with the identity link function. We finish by introducing generalized additive models for where a linear model may be unsuitable.
Confronting models with data is only effective when the statistical model matches the biological one and the structure of your data collection is right for the statistical model. We outline some basic principles of sampling, emphasizing the importance of randomization. Randomization is also essential to experimental design, but so are controls, replication of experimental units, and independence of experimental units. This chapter emphasizes the distinction between sampling or experimental units representing independent instances and observational units representing things we measure or count from those units. Observational units may be subsamples of experimental units, but shouldn’t be confused with them. In this chapter, we also introduce methods for deciding how much data you need.
All statistical models have assumptions, and violation of these assumptions can affect the reliability of any conclusions we draw. Before we fit any statistical model, we need to explore the data to be sure we fit a valid model. Are relationships assumed to be a straight line really linear? Does the response variable follow the assumed distribution? Are variances consistent? We outline several graphical techniques for exploring data and introduce the analysis of model residuals as a powerful tool. If assumptions are violated, we consider two solutions, transforming variables to satisfy assumptions and using models that assume different distributions more consistent with the raw data and residuals. The exploratory stage can be extensive, but it is essential. At this pre-analysis stage, we also consider what to do about missing observations.
We don’t always have a single response variable, and disciplines like community ecology or the new “omics” bring rich datasets. Chapters 14–16 introduce the treatment of these multivariate data, with multiple variables recorded for each unit or “object.” We start with how we measure association between variables and use eigenanalysis to reduce the original variables to a smaller number of summary components or functions while retaining most of the variation. Then we look at the broad range of measures of dissimilarity or distance between objects based on the variables. Both approaches allow examination of relationships among objects and can be used in linear modeling when response and predictor variables are identified. We also highlight the important role of transformations and standardizations when interpreting multivariate analyses.
Biological data commonly involve multiple predictors. This chapter starts expanding our models to include multiple categorical predictors (factors) when they are in factorial designs. These designs allow us to introduce synergistic effects – interactions. Two- and three-factor designs are used to illustrate the estimation and interpretation of interactions. Our approach is first to consider the most complex interactions and use them to decide whether it is helpful to continue examining simple interactions. Main effects – single predictors acting independently of each other – are the last to be considered. We also deal with problems caused by missing observations (unbalanced designs) and missing cells (fractional and incomplete factorials) and discuss how to estimate and interpret them.
It’s surprisingly common for biologists to combine crossed and nested factors. These designs are partly nested or split-plot designs. They are nearly always mixed models, usually a random nested effect and at least two fixed effects. We describe the analysis of these designs, starting with a simple three-factor design with a single between-plot and a single within-plot effect, extending this analysis to include multiple effects, including interactions at this level, and adding continuous predictors (covariates).
Most biological ideas can be viewed as models of nature we create to explain phenomena and predict outcomes in new situations. We use data to determine these models’ credibility. We translate our biological models into statistical ones, then confront those models with data. A mismatch suggests the biological model needs refinement. A biological idea can also be considered a signal that appears in the data among the background noise. Fitting the model to the data lets us see if such a signal exists and, importantly, measure its strength. This approach only works well if our biological hypotheses are clear, the statistical models match the biology, and we collect the data appropriately. This clarity is the starting point for any biological research program.
For this book, we assume you’ve had an introductory statistics or experimental design class already! This chapter is a mini refresher of some critical concepts we’ll be using and lets you check you understand them correctly. The topics include understanding predictor and response variables, the common probability distributions that biologists encounter in their data, the common techniques, particularly ordinary least squares (OLS) and maximum likelihood (ML), for fitting models to data and estimating effects, including their uncertainty. You should be familiar with confidence intervals and understand what hypothesis tests and P-values do and don’t mean. You should recognize that we use data to decide, but these decisions can be wrong, so you need to understand the risk of missing important effects and the risk of falsely claiming an effect. Decisions about what constitutes an “important” effect are central.
Multiple predictors can all be continuous, or they can be mixtures of continuous and categorical. A common biological situation is a substantial number of continuous predictors, and fitting these models is commonly labeled multiple regression. We might also mix continuous and categorical predictors, and these have been called analyses of covariance. We show how these two analyses are closely related and how to fit and interpret these models. This chapter introduces the complication of correlated predictors (collinearity) and describes ways of detecting and dealing with the problem. This chapter also introduces measures of influence and leverage as part of checking assumptions.
Making repeated observations through time adds complications, but it’s a common way to deal with limited research resources and reduce the use of experimental animals. A consequence of this design is that observations fall into clusters, often corresponding to individual organisms or “subjects.” We need to incorporate these relationships into statistical models and consider the additional complication where observations closer together in time may be more similar than those further apart. These designs were traditionally analyzed with repeated measures ANOVA, fitted by OLS. We illustrate this traditional approach but recommend the alternative linear mixed models approach. Mixed models offer better ways to deal with correlations within the data by specifying the clusters as random effects and modeling the correlations explicitly. When the repeated measures form a sequence (e.g. time), mixed models also offer a way to deal with occasional missing observations without omitting the whole subject from the model.
The components or functions derived from an eigenanalysis are linear combinations of the original variables. Principal components analysis (PCA) is a very common method that uses these components to examine patterns among the objects, often in a plot termed an ordination, and identify which variables are driving those patterns. Correspondence analysis (CA) is a related method used when the variables represent counts or abundances. Redundancy analysis and canonical CA are constrained versions of PCA and CA, respectively, where the components are derived after taking into account the relationships with additional explanatory variables. Finally, we introduce linear discriminant function analysis as a way of identifying and predicting membership of objects to predefined groups.