To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
DISTRIBUTIONS ARE GENERALIZATIONS of mathematical functions from a purely technical standpoint. But perhaps it is most pertinent to begin by asking a more utilitarian question. Why should we study distributions? Specifically, why should we study probability distributions? One of the motivations stems from a practical limitation of experimental measurements that is underlined by the uncertainty principle postulated by Werner Heisenberg (see Figure 2.1). The very fabric of reality and the structure of scientific laws that govern our ability to understand physical phenomena demand a probabilistic (statistical) approach. Our inability to make infinite-precision measurements of data necessitates the consideration of averages over many measurements, and under similar conditions, as a more reliable strategy to affix experimental values to unknowns with reasonable accuracy.
In this chapter, we introduce probabilistic models of the mechanisms that generate data. Probabilistic models let us express scientific hypotheses with clear truth conditions, even when the mechanisms are inherently stochastic. A conditional probabilistic model describes how the conditional probability density of a response variable given a predictor depends on the predictor’s value. This dependence is controlled by a parameter vector, whose possible values form the model’s hypothesis space. Fitting the model means choosing a specific parameter vector based on data. One common approach is maximum likelihood, which selects parameters that make the observed data most probable. For many conditional models, maximising likelihood is equivalent to minimising the residual sum of squares.
This final chapter sketches ways to expand the inference toolkit introduced in the book. We explore more flexible prediction functions, including splines, generalised additive models and local regression. These methods improve expressivity while controlling complexity, helping avoid overfitting. We also show how differential equations – commonly used in scientific modelling – fit naturally into the probabilistic framework by defining parameterised function families. Inherently, stochastic systems can be modelled using Markov processes, allowing inference via familiar likelihood-based methods. Finally, we discuss generative language models, focusing on the GPT architecture. GPT models define probability distributions over token sequences using autoregressive neural networks trained via cross-entropy loss. Though the underlying architecture is complex, the core modelling idea – predicting the next word given prior context – builds directly on probabilistic and machine learning principles developed throughout the book.
The advent of the internet and sensor technology has enabled humankind to collect, store, and share data in bulk. In turn, access to a variety of data has amplified a different kind of problem, which is to devise an appropriate strategy to derive meaning from data. Indeed, extracting information from data has acquired the highest priority among tasks performed by engineers and scientists alike. State-ofthe-art machine learning algorithms are used to process and analyze data in order to leverage maximum gains in developing new technology and creating a new body of knowledge.
Further, the data-rich tech-universe has inherent complexity in addition to the vastness in terms of numbers. This complexity arises from the fact that often this data is embedded in a higher-dimensional space. For example, the data acquired by a camera hosted on a robot is in the form of multiple grayscale images (frames); each data-frame is constituted of a sequence of numbers that represents the intensity of grayness of each pixel. If each image has a resolution 100 × 100 (pixel count), then this image data is embedded in a 10000 dimensional space. Additionally, if the camera records 100 frames per second for one minute, then we have 6000 data points in a 10000 dimensional space. This is just an illustrative example of how a high-dimensional large data set may be generated. Quite evidently, not all the 10000 dimensions host most of the information. One of the most important techniques that we will learn in this chapter will allow us to extract a lower dimensional representation of the data set that will retain sufficient information for the robot to navigate and perform its tasks.
This chapter provides an overview of the types of inference problems we address and the different approaches to solving them. We focus on risky inference: drawing conclusions, learning and making predictions in situations where certainty is impossible. Predicting a response from one or more predictors using past data is called supervised learning. When the response is continuous, the task is regression; when it is categorical, the task is classification. In unsupervised learning, there is no response variable. Instead, the goal is to find patterns or structure in data, as in density estimation, clustering and dimensionality reduction. In both supervised and unsupervised contexts, overfitting occurs when we model data in excessive detail and fail to distinguish systematic patterns from noise; underfitting occurs when our models are too simple to capture systematic patterns. Probability is a key tool for tackling risky inference, with frequentist and Bayesian interpretations motivating distinct approaches. Finally, large neural networks have proven remarkably effective in both supervised and unsupervised tasks, often avoiding overfitting despite containing billions of parameters.
This chapter introduces key ideas about probability, likelihood, and Bayesian inference. The likelihood of a hypothesis is the conditional probability of the data given the hypothesis. One way of using data to choose a hypothesis from a hypothesis space is to pick the hypothesis with the greatest likelihood; this is known as maximum likelihood inference. When used to choose between hypotheses that differ greatly in intrinsic plausibility, maximum likelihood inference is unreliable. Bayesian inference takes likelihoods into account but is also sensitive to the intrinsic plausibility of hypotheses.
MARKOV CHAINS WERE first formulated as a stochastic model1 by Russian mathematician Andrei Andreevich Markov. Markov spent most of his professional career at St. Petersburg University and the Imperial Academy of Science. During this time, he specialized in the theory of numbers, mathematical analysis, and probability theory. His work on Markov chains utilized finite square matrices (stochastic matrices) to show that the two classical results of probability theory, namely, the weak law of large numbers and the central limit theorem, can be extended to the case of sums of dependent random variables. Markov chains have wide scientific and engineering applications in statistical mechanics, financial engineering, weather modeling, artificial intelligence, and so on. In this chapter, we will look at a few applications as we build the concepts of Markov chains. Additionally, we will also implement a technique (using Markov chains) to solve a simple and practical engineering problem related to aircraft control and automation.
3.1 Chapter objectives
The chapter objectives are listed as follows.
1. Students will learn the definition and applications of Markov processes.
2. Students will learn the definition of the stochastic matrix (also known as the probability transition matrix) and perform simple matrix calculations to compute conditional probabilities.
3. Students will learn to solve engineering and scientific problems based on discrete time Markov chains (DTMCs) using multi-step transition probabilities.
4. Students will learn to compute return times and hitting times to Markov states.
5. Students will learn to classify different Markov states.
6. Students will learn to use the techniques of DTMCs introduced in this chapter to solve a complex engineering problem related to flight control operations.
Aimed at practising biologists, especially graduate students and researchers in ecology, this revised and expanded 3rd edition continues to explore cause-effect relationships through a series of robust statistical methods. Every chapter has been updated, and two brand-new chapters cover statistical power, Akaike information criterion statistics and equivalent models, and piecewise structural equation modelling with implicit latent variables. A new R package (pwSEM) is included to assist with the latter. The book offers advanced coverage of essential topics, including d-separation tests and path analysis, and equips biologists with the tools needed to carry out analyses in the open-source R statistical environment. Writing in a conversational style that minimises technical jargon, Shipley offers an accessible text that assumes only a very basic knowledge of introductory statistics, incorporating real-world examples that allow readers to make connections between biological phenomena and the underlying statistical concepts.
Deductive languages afford many advantages in theory development. They ensure that different people with different biases can understand the logic; they ensure that the logic can be repeated, and they ensure that we can reason from empirical tests to the support or nonsupport for the theory. However, the deductive form also requires that the concepts used are precisely specified. A defining characteristic of such deductive arguments is that the premises enable us to reason to a conclusion that does not add any information beyond the premises. This can be compared to inductive arguments in which the conclusion amplifies or adds information to the premises and because of this does not provide the advantages of deduction.
Empirical tests are necessary for the advancement of theories. Clear theoretical definitions enable researchers to find or create instances of the abstract concepts. Empirical tests can be done using a variety of methodologies. Some empirical tests supply more information than others based on how many alternatives interpretations for the results can be ruled out. Stronger tests are those that offer more precise predictions. Replications of tests are important for the advancement of theory. We differentiate between empirical replications, which use the exact same measurements to test a theory and theoretical replications, which use differing operationalizations to test a theory. Both types of replication are important and rely on researchers making their reasoning and test materials publicly available.
Critical for any explanation or theory are well-formulated concepts with clear and unambiguous meaning. Definitions can ensure unambiguous meaning, and the kind of definition most useful for theories are nominal definitions. Nominal definitions have two parts: a definiendum, the term being defined, and definiens, other words that tell what it means. To be useful in explanations a concept should be abstract, clearly defined, and embedded in principles that describe its behavior. Such concepts are the result of thought rather than simply of observations. Their meanings are not tied to any particular time or place. Their definitions include all and only the important elements of whatever phenomena they refer to.
While ownership of private goods enables an actor to exclude others’ ownership, public goods are available to others. Because of this, there is a social dilemma associated with public goods: why would a person contribute to a public good if they can use it even if they don’t contribute? The traditional response is that some cost must be added to not paying so that the public good such as community parks could be created and maintained. This response was a clear outcome of early economic theories. However, empirical anomalies emerged that did not support these earlier theories. Cooperation among theorists enabled the development of new theories of how social characteristics of group members could intervene and solve some social dilemmas.