We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Biostatistics with R provides a straightforward introduction on how to analyse data from the wide field of biological research, including nature protection and global change monitoring. The book is centred around traditional statistical approaches, focusing on those prevailing in research publications. The authors cover t-tests, ANOVA and regression models, but also the advanced methods of generalised linear models and classification and regression trees. Chapters usually start with several useful case examples, describing the structure of typical datasets and proposing research-related questions. All chapters are supplemented by example datasets, step-by-step R code demonstrating analytical procedures and interpretation of results. The authors also provide examples of how to appropriately describe statistical procedures and results of analyses in research papers. This accessible textbook will serve a broad audience, from students, researchers or professionals looking to improve their everyday statistical practice, to lecturers of introductory undergraduate courses. Additional resources are provided on www.cambridge.org/biostatistics.
This chapter handles more advanced types of ANOVA models, those that contain multiple explanatory variables (factors). We start with the hierarchical ANOVA, illustrated by two example studies, and we describe how the variation of the response variable is decomposed, introducing the concept of variance components. We set apart and discuss the properties of the split-plot ANOVA model and we illustrate its use by evaluating the results of a field experiment. Finally, we discuss the repeated measurements ANOVA, which is a very important model for analysing both monitoring data and data from manipulative experiments. Although it is typically analysed using a type of a split-plot ANOVA, the repeated measurements ANOVA model has further assumptions that are discussed in the text. The methods described in this chapter are accompanied by a carefully-explained guide to the R code needed for their use, including the nlme, lme4, effects, and car packages.
We start by comparing the use of correlation (namely the Pearson linear correlation coefficient), for describing the relationship of two variables, with the approach based on a simple linear regression. Then we describe how to test the hypothesis that there is no correlation between the two variables within the sampled population. We conclude this topic by discussing the power of this test. We then move to the nonparametric correlation coefficients suitable for measuring the strength of a monotonic relationship between two variables. An additional section focuses on how to appropriately interpret correlation strength and significance, factoring in the specific questions being asked. Finally, we discuss the differences between correlation-based and causal relationships. The methods described in this chapter are accompanied by a carefully-explained guide to the R code needed for their use, including the pwr package.
We start by outlining how the generalised linear models (GLM) extend the classical linear models, namely by the use of the link function, transforming the values of the response variable predicted by the model. We also present the types of statistical distribution we can choose for the unexplained (residual) variation and relate them to the most commonly encountered forms of biological data. The decomposition of the variation in the response variable, using the analysis of deviance, is described together with the concepts of maximum likelihood and of the null model. We also explain how to handle overdispersion, which is the larger-than expected residual variation in GLMs with an assumed Poisson or binomial distribution. We show the ways we can select predictors for inclusion in our model, focusing on the idea of model parsimony, measured by AIC criterion. The methods described in this chapter are accompanied by a carefully-explained guide to the R code needed for their use.
The t distribution plays an important role in many statistical tests and in the estimation of parametric confidence intervals. We introduce properties of the t distribution and its relationship to the normal distribution. The single sample t test is described, as well as the related paired t test. The concept of a one-sided test is then introduced and compared with the two-sided test. We explain the meaning of confidence intervals and show their calculation. We discuss the assumptions of the t tests introduced in this chapter. A separate section is devoted to a detailed treatment of how to present the variability in our data, and the precision of mean value estimation, both numerically and visually. The reporting of standard deviations, standard errors, and confidence intervals is compared and discussed. We round off this chapter by outlining how to calculate the sample size required to attain a specified precision for the mean estimate. The methods described in this chapter are accompanied by a carefully-explained guide to the R code needed for their use.
Although there are many ways of describing nonlinear relationships among variables, this chapter focuses primarily on the polynomial regression, which is related to the multiple linear regression model. We pay particular attention to models using the second-order polynomial. These models are often employed in the field of community ecology to describe unimodal changes of species abundances along environmental gradients. The downsides of using polynomial regression are also addressed. We bring this chapter to a close by touching on the non-linear least-squares regression models and the appropriate context in which they should be applied. The methods described in this chapter are accompanied by a carefully-explained guide to the R code needed for their use, including the nlme package.
This chapter compares two major families of ordination methods, the unconstrained and constrained ordination. We start by describing the tasks achieved with the help of unconstrained ordination and illustrate how to interpret the resulting ordination diagrams. The methods of constrained ordination allow us to build and test statistical models describing the effects of predictors (such as environmental descriptors) on multivariate response data (such as the composition of biotic communities). We discuss linear discriminant analysis separately, which aims to use a set of numerical variables to predict the membership of observations in a priori defined classes. The methods described in this chapter are accompanied by a carefully-explained guide to the R code needed for their use, in this case employing the vegan package.
The analysis of variance is introduced as a method of testing differences among means of more than two groups of observations. We outline the basic assumptions of ANOVA models, focusing on the expected homogeneity of variances across the compared groups, which is assessed by the Bartlett test. The decomposition of variability in the response variable (its total sum of squares) into among-group and within-group (residual) variation leads to the definition of the F-ratio, which is the central test statistic in ANOVA models. We also introduce the distinction between fixed and random effects and discuss the F test power as well as the robustness of the test to violations of ANOVA model assumptions. The first part of the chapter, dealing with one-way ANOVA, concludes with a description of the multiple comparisons procedure. We focus on two types - Tukey's test and Dunnett's test. This chapter concludes by presenting a nonparametric counterpart of one-way ANOVA, the Kruskal-Wallis test. The methods described in this chapter are accompanied by a carefully-explained guide to the R code needed for their use, including the multcomp package.
After a general introduction to multivariate statistical analyses, we focus on describing the task of multivariate classification, distinguishing its non-hierarchical and hierarchical forms. Focusing on hierarchical agglomerative classification methods (cluster analysis), we highlight the important decisions that must be made regarding the measurement of dissimilarity (distance) among objects. Following this, we explain the construction of dendrograms representing this hierarchical classification. We also briefly mention divisive classification methods, focusing on the TWINSPAN method. The methods described in this chapter are accompanied by a carefully-explained guide to the R code needed for their use, in this case employing the cluster package.
The two-way ANOVA , is applied to data with a factorial arrangement (and its extensions to more factors), and is an important tool for analysing data from experimental studies. We start by characterising the properties of a factorial design and compare it with a hierarchical design. We introduce two important experimental concepts here - the ideas of a balanced design and of a proportional design. We then describe the two-way ANOVA model, including an explanation of the interaction term and its use in ANOVA models. We outline some basic types of correct experimental designs, including complete randomised blocks, and contrast them with incorrect designs resulting in pseudo-replicated observations. Separate sections deal with ANOVA model specification for randomised blocks and Latin square designs, and with the specific issues of the multiple comparisons procedure in ANOVA models with multiple factors. A nonparametric counterpart of the randomised complete block ANOVA - the Friedman test - is also introduced. The methods described in this chapter are accompanied by a carefully-explained guide to the R code needed for their use, including the multcomp package.
The simple linear regression is one of the most frequently employed statistical model. Linear regression is used to describe the relationship between two numerical variables, but it also serves as a building block for more complex statistical methods, such as the multivariate ordination. We start by comparing the concepts of regression and correlation, before introducing the equation of the simple linear regression. We also explain the decomposition of the observed values of the response variable into fitted values and regression residuals. Following this is a discussion regarding the hypotheses that can be tested for a regression model, distinguishing the F-ratio based test from the t tests of individual regression coefficients. The calculation of confidence and prediction intervals allow us to enhance diagrams displaying the fitted model. A separate section is devoted to the graphs of regression diagnostics and their interpretation, as well as to the effects of log-transforming the variables to linearise their relationship. Additional specialised sections deal with regression through the origin and its possible dangers, regression using a predictor with random variation, and with linear calibration. The methods described in this chapter are accompanied by a carefully-explained guide to the R code needed for their use, including the effects and lmodel2 packages.
The contingency tables are used to quantify and test relationships between two or more categorical (qualitative) variables. Taking a simple example of a two-way contingency table (relating two categorical variables), we illustrate the process of calculating the frequencies of category combinations expected under the assumption of variable independence, and show how observed and expected frequencies are compared within the chi-square test statistic. We also briefly describe the task of measuring the strength of association between two categorical variables, which is important for evaluating the co-occurrence of biological taxa. We illustrate the differences between statistical and causal relationships between variables, highlighting the essential role of manipulative experiments for revealing causality. Finally, we demonstrate the possible ways of visualising contingency tables and their test results. The methods described in this chapter are accompanied by a carefully-explained guide to the R code needed for their use, including the vcd package.
We explain linear regression models with multiple predictors, including an overview of partial regression coefficients. The related concept of partial correlation is discussed in a separate section. We also contrast the overall model test using the F-ratio statistic and the t tests of partial effects of individual predictors. The adjusted coefficient of determination is presented as a more accurate way of conveying the explanatory power of a regression model. Finally, we characterise the family of general linear models, focusing specifically on analysis of covariance (ANCOVA). We provide examples of ANCOVA models and demonstrate their usefulness when applied to the analysis of biological experiments. The methods described in this chapter are accompanied by a carefully-explained guide to the R code needed for their use, including the effects and ppcor packages.
The core of this chapter is the two-sample t test which compares means among two groups of observations, but we start with comparing the variances of two groups using the F test. We discuss the assumptions of the two-sample t test and we also present the approximate Welch test, used when the assumption of variance homogeneity is violated. The methods described in this chapter are accompanied by a carefully-explained guide to the R code needed for their use.
After a general introduction to nonparametric tests, we review two useful nonparametric tests for comparing two samples. The Mann-Whitney test is a counterpart of the two-sample t test, but it uses the ranks of recorded values instead. Despite this test often being described as a test of the differences between mean values, this only applies when both samples come from distributions where both distribution curves are of the same shape. The Wilcoxon test for paired observations corresponds to the parametric paired t test. We also introduce permutation tests, which represent another group of non-parametric methods for hypothesis testing. The methods described in this chapter are accompanied by a carefully-explained guide to the R code needed for their use.