To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
In this chapter, we describe how to jointly model continuous quantities, by representing them as multiple continuous random variables within the same probability space. We define the joint cumulative distribution function and the joint probability density function and explain how to estimate the latter from data using a multivariate generalization of kernel density estimation. Next, we introduce marginal and conditional distributions of continuous variables and also discuss independence and conditional independence. Throughout, we model real-world temperature data as a running example. Then, we explain how to jointly simulate multiple random variables, in order to correctly account for the dependence between them. Finally, we define Gaussian random vectors which are the most popular multidimensional parametric model for continuous data, and apply them to model anthropometric data.
Some of the key messages of this book are reviewed here in the format of ’reminders’ to clarify the concerns of past misunderstandings and to emphasize solutions to perceived challenges. The importance of basic fundamentals, such as visual assessment, awareness of assumptions and potential numerical solutions is described and then the complementarity of the many statistics and their bases is reviewed. The exciting potential of ongoing developments is summarized, featuring hierarchical Bayesian analysis, spatial causal inference, applications of artificial intelligence (AI), knowledge graphs (KG), literature-based discovery (LBD) and geometric algebra. A quick review of future directions concludes this chapter and the book.
This chapter focuses on correlation, a key metric in data science that quantifies to what extent two quantities are linearly related. We begin by defining correlation between normalized and centered random variables. Then, we generalize the definition to all random variables and introduce the concept of covariance, which measures the average joint variation of two random variables. Next, we explain how to estimate correlation from data and analyze the correlation between the height of NBA players and different basketball stats.In addition, we study the connection between correlation and simple linear regression. We then discuss the differences between uncorrelation and independence. In order to gain better intuition about the properties of correlation, we provide a geometric interpretation of correlation, where the covariance is an inner product between random variables. Finally, we show that correlation does not imply causation, as illustrated by the spurious correlation between temperature and unemployment in Spain.
Sets of points can be analysed from their positions in space and line segments can be studied separately for their own spatial arrangements and relationships. Combining points and lines as the nodes and edges of a spatial graph provides a flexible and powerful approach to spatial analysis. Such graphs and their network versions are studied by Graph Theory, a branch of mathematics that quantifies their properties, with or without additional features such as labels, weights and functions associated with the nodes and edges. Some relevant graph theory terms are introduced, including connectivity, connectedness, modularity and centrality. Networks are graphs with additional features, usually representing an observed system of interest, whether aspatial like a food web or spatial like a metacommunity. Key concepts for the latter example are connectivity, migration and network flow.
The spatial patterns of point events in the plane can exist at several different scales in a single data set. The assessment of point patterns can be based on the distances between neighbour events, on the counts of events in quadrats or on counts of events in point-centred circles of changing size. Ripley’s K function evaluates simple point patterns and can be modified for different spatial dimensions, for bi- and multi-variate variables and for non-homogeneous data. Quadrat-based quantitative data are usually analysed by one of many related ’quadrat variance’ methods that assess variance or covariance as a function of spatial scale and which can also be modified for different conditions, such as bi- or multi-variate data. There are related methods from other traditions to be considered, including spectral analysis and wavelets. These approaches share a conceptual basis of comparing the data with spatial templates and we provide a summary of their relationships and differences.
Spatial structure is key to understanding diversity in ecological systems, being affected by both location and scale. The effects of scale are often dealt with as the hierarchy of alpha (local area), beta (between areas) and gamma (largest areas) diversity. All have spatial aspects, but beta diversity may be most interesting for spatial analysis because it involves complex responses such as intermediate-scale nestedness and species turnover with or without environmental gradients. In addition to species diversity within communities, the diversity of species composition or combinations as a function of location is an important characteristic of ecological assemblages. Many aspects of spatial diversity are best understood by spatial graphs, with sites as nodes and edges quantifying inter-site relationships. Temporal information, when available, can provide crucial insights about spatial diversity through understanding the dynamics of the system.
Spatial analysis originated in a broad range of disciplines, producing a diverse set of concepts and terminologies. Ecological processes take place in space and time, and the spatio-temporal structure that results takes different forms that produce spatial dependence at all scales. That dependence has major effects, even when ecological data are abstracted from the spatial context. Not all dependence exhibits a smooth decay with increasing separation, but it can vary with scale, stationarity or its absence and direction (anisotropy versus isotropy). A key factor in spatial analysis is the ability to determine neighbour events for points or patches and we present various algorithms to create networks of neighbours. We discuss a range of spatial statistics and related randomization tests, including a ’Markov and Monte Carlo’ approach. The chapter provides a detailed conceptual background for the technical aspects presented in subsequent chapters.
This chapter presents hypothesis testing which is used to evaluate whether the available data provide sufficient evidence to support a certain hypothesis. The main idea is to play devil's advocate and assume a null hypothesis, which contradicts our hypothesis of interest. We explain how to use parametric modeling to implement this idea, and define the p-value. We prove that thresholding the p-value controls the probability of false positives. In addition, we define the power of a test, which quantifies the test's ability to identify positive findings. Next, we show how to perform hypothesis testing without a parametric model, focusing on the permutation test. Then, we discuss multiple testing, a setting where many tests are performed simultaneously. Finally, we provide three reasons why hypothesis testing should not be used as the only stamp of approval for scientific discoveries. First, hypothesis testing does not necessarily identify causal effects; it is complementary to causal inference. Second, small p-values do not imply practical significance. Third, relying on p-values to validate findings produces a strong incentive to cherry-pick results.
This chapter explains how to estimate population parameters from data. We introduce random sampling, an approach that yields accurate estimates from limited data. We then define the bias and the standard error, which quantify the average error of an estimator and how much it varies, respectively. In addition, we derive deviation bounds and use them to prove the law of large numbers, which states that averaging many independent samples from a distribution yields an accurate estimate of its mean. An important consequence is that random sampling provides a precise estimate of means and proportions. However, we caution that this is not necessarily the case, if the data contain extreme values. Next, we discuss the central limit theorem (CLT), according to which averages of independent quantities tend to be Gaussian. We again provide a cautionary tale, warning that this does not hold in the absence of independence. Then, we explain how to use the CLT to build confidence intervals which quantify the uncertainty of estimates obtained from finite data. Finally, we introduce the bootstrap, a popular computational technique to estimate standard errors and build confidence intervals.
This chapter covers regression and classification, where the goal is to estimate a quantity of interest (the response) from observed features. In regression, the response is a numerical variable. In classification, it belongs to a finite set of predetermined classes. We begin with a comprehensive description of linear regression and discuss how to leverage it to perform causal inference. Then, we explain under what conditions linear models tend to overfit or to generalize robustly to held-out data. Motivated by the threat of overfitting, we introduce regularization and ridge regression, and discuss sparse regression, where the goal is to fit a linear model that only depends on a small subset of the available features. Then, we introduce two popular linear models for binary and multiclass classification: Logistic and softmax regression. At this point, we turn our attention to nonlinear models. First, we present regression and classification trees and explain how to combine them via bagging, random forests, and boosting. Second, we explain how to train neural networks to perform regression and classification. Finally, we discuss how to evaluate classification models.
We start the explanation of analyzing spatial sample data with join-count statistics for regular (lattice) and irregular (spatial network) samples, leading to methods for spatial autocorrelation and variography or geostatistics. The latter provides spatial interpolation methods that estimate variables at unsampled locations, based on the values at measured samples. There are a range of such methods based on different assumptions and the types of data analysed. For quantitative data, Kriging estimates interpolated values at unsampled locations and their associated errors. In these applications, as elsewhere, there is an important distinction between global and local statistics and their estimates.
The analysis of spatio-temporal data is critical for understanding change in ecological systems. Spatio-temporal methods are the natural extensions of spatial statistics incorporating change over time. This chapter covers spatio-temporal approaches such as join counts, scan statistics, cluster and polygon change and the analysis of movement, cyclic phenomena and synchrony. In all these applications, we must consider and account for multi-dimensional autocorrelation in the data.
This chapter describes how to model multiple discrete quantities as discrete random variables within the same probability space and manipulate them using their joint pmf. We explain how to estimate the joint pmf from data, and use it to model precipitation in Oregon. Then, we introduce marginal distributions, which describe the individual behavior of each variable in a model, and conditional distributions, which describe the behavior of a variable when other variables are fixed. Next, we generalize the concepts of independence and conditional independence to random variables. In addition, we discuss the problem of causal inference, which seeks to identify causal relationships between variables. We then turn our attention to a fundamental challenge: It is impossible to completely characterize the dependence between all variables in a model, unless they are very few. This phenomenon, known as the curse of dimensionality, is the reason why independence assumptions are needed to make probabilistic models tractable. We conclude the chapter by describing two popular models based on such assumptions: Naive Bayes and Markov chains.
This chapter discusses how to build probabilistic models that include both discrete and continuous variables. Mathematically, this is achieved by defining them as random variables within the same probability space. In practice, the variables are manipulated using their marginal and conditional distributions. We define the conditional pmf of a discrete random variable given a continuous variable, and the conditional probability density of a continuous random variable given a discrete variable. We use these objects to build mixture models and apply them to model height in a population. Next, we describe Gaussian discriminant analysis, a classification method based on mixture models with Gaussian conditional distributions, and apply it to diagnose Alzheimer's disease. Then, we explain how to perform clustering using Gaussian mixture models and leverage the approach to cluster NBA players. Finally, we introduce the framework of Bayesian statistics which enables us to explicitly encode our uncertainty about model parameters, and use it to analyze poll data from the 2020 United States presidential election.