To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
In this chapter, we examine our first supervised learning problem, focusing on how to construct prediction functions and assess their performance. Given data consisting of predictor–response pairs, we can learn the parameters of a prediction function by minimising a loss, such as the residual sum of squares, which measures the discrepancy between actual and predicted responses. Using more flexible families of prediction functions typically reduces loss on the training data, but excessive flexibility can lead to overfitting: fitting to noise rather than the systematic component of the relationship. Overfitting results in poor prediction performance on new, unseen data. To estimate how a prediction method will perform on unseen data, we use cross-validation. However, when we compare many prediction methods using cross-validation, the best-performing method often appears better than it truly is; its apparent performance is an unreliable guide to its future accuracy. Prior knowledge is crucial for selecting plausible prediction methods to compare. Finally, we can use bootstrapping to quantify uncertainty in prediction functions and their predictions.
In this chapter, we examine how to quantify uncertainty about model parameters, highlighting two main approaches: frequentist and Bayesian. We start by modelling a data-generating mechanism with a parametric family, where different parameter values correspond to different models. Assuming our model family can describe the mechanism, we use data to infer plausible parameters and quantify uncertainty. In frequentist inference, we build parameter estimators and study their sampling distributions across repeated data collection. Here, parameters are fixed unknown constants, and only estimators are treated probabilistically. In Bayesian inference, parameters are latent random variables. We express uncertainty through probability, combining prior beliefs about parameter values with observed data using Bayes’ rule to obtain a posterior distribution. The posterior and the frequentist sampling distribution often play similar roles and can resemble each other in practice. Computational tools like bootstrapping and Markov chain Monte Carlo help estimate sampling and posterior distributions, respectively.
This chapter introduces simple and multiple Bayesian linear regression models, in which parameters are treated as latent random variables. Thanks to their simplicity, these models yield closed-form posteriors. With flat priors, the posterior closely resembles the frequentist sampling distribution. We also explore the use of shrinkage priors to penalise model complexity and reduce overfitting. A Gaussian prior on the coefficients leads to ridge regression, where the MAP estimate corresponds to L2-regularised least squares. A Laplace prior yields lasso regression, based on L1 regularisation. Both are examples of regularisation techniques, but they behave differently: ridge regression shrinks all coefficients toward zero, while lasso tends to set some exactly to zero, producing a sparse model.
This chapter introduces probabilistic models for supervised learning tasks where the prediction target is categorical. In binary classification, the target takes two values; models output the conditional probability of one of these, given the predictors. Logistic regression expresses the log odds as a linear function of the predictors and is fitted by minimising (regularised) cross-entropy loss. Minimising unregularised cross-entropy is equivalent to maximising likelihood, but in linearly separable cases, a maximum likelihood solution may not exist. Regularisation ensures the problem is well posed and helps control overfitting. In multiclass classification, the target can take K > 2 values, and models output a K-dimensional probability vector. Multinomial logistic regression expresses a K-dimensional score vector as a linear function of the predictors and applies the softmax function to convert scores into probabilities. k-nearest neighbours (k-NN) is a non-parametric method that estimates class probabilities from nearby training points. In high-dimensional predictor spaces, parametric models like logistic regression often outperform non-parametric ones like k-NN.
In this chapter, we introduce probabilistic models of the mechanisms that generate data. Probabilistic models let us express scientific hypotheses with clear truth conditions, even when the mechanisms are inherently stochastic. A conditional probabilistic model describes how the conditional probability density of a response variable given a predictor depends on the predictor’s value. This dependence is controlled by a parameter vector, whose possible values form the model’s hypothesis space. Fitting the model means choosing a specific parameter vector based on data. One common approach is maximum likelihood, which selects parameters that make the observed data most probable. For many conditional models, maximising likelihood is equivalent to minimising the residual sum of squares.
This final chapter sketches ways to expand the inference toolkit introduced in the book. We explore more flexible prediction functions, including splines, generalised additive models and local regression. These methods improve expressivity while controlling complexity, helping avoid overfitting. We also show how differential equations – commonly used in scientific modelling – fit naturally into the probabilistic framework by defining parameterised function families. Inherently, stochastic systems can be modelled using Markov processes, allowing inference via familiar likelihood-based methods. Finally, we discuss generative language models, focusing on the GPT architecture. GPT models define probability distributions over token sequences using autoregressive neural networks trained via cross-entropy loss. Though the underlying architecture is complex, the core modelling idea – predicting the next word given prior context – builds directly on probabilistic and machine learning principles developed throughout the book.
This chapter provides an overview of the types of inference problems we address and the different approaches to solving them. We focus on risky inference: drawing conclusions, learning and making predictions in situations where certainty is impossible. Predicting a response from one or more predictors using past data is called supervised learning. When the response is continuous, the task is regression; when it is categorical, the task is classification. In unsupervised learning, there is no response variable. Instead, the goal is to find patterns or structure in data, as in density estimation, clustering and dimensionality reduction. In both supervised and unsupervised contexts, overfitting occurs when we model data in excessive detail and fail to distinguish systematic patterns from noise; underfitting occurs when our models are too simple to capture systematic patterns. Probability is a key tool for tackling risky inference, with frequentist and Bayesian interpretations motivating distinct approaches. Finally, large neural networks have proven remarkably effective in both supervised and unsupervised tasks, often avoiding overfitting despite containing billions of parameters.
This chapter introduces key ideas about probability, likelihood, and Bayesian inference. The likelihood of a hypothesis is the conditional probability of the data given the hypothesis. One way of using data to choose a hypothesis from a hypothesis space is to pick the hypothesis with the greatest likelihood; this is known as maximum likelihood inference. When used to choose between hypotheses that differ greatly in intrinsic plausibility, maximum likelihood inference is unreliable. Bayesian inference takes likelihoods into account but is also sensitive to the intrinsic plausibility of hypotheses.
This book presents a modern introduction to the field of algorithmic game theory. It places a heavy emphasis on optimization and online learning (a subdiscipline of machine learning), which are tools that increasingly play a central role in both the theory and practice of applying game-theoretic ideas. The book covers the core techniques used in several majorly successful applications, including techniques used for creating superhuman poker AIs, the theory behind the 'pacing' methodology that has become standard in the internet advertising industry, and the application of competitive equilibrium from equal incomes for fair course seat allocation in many business schools. With its focus on online learning tools, this book is an ideal companion to classic texts on algorithmic game theory for graduate students and researchers.
Most regression methods estimate the mean of Y given X. But it can also be useful to estimate the quantiles of Y given X. This provides more information about the relationship between X and Y.
In Chapter 3, we achieved speedup by reducing the precision used throughout the model using low-precision techniques. In this chapter, we introduce pruning, a method that transforms the model at a finer granularity to accelerate computation.
In this book, various techniques for accelerating and reducing the cost of deep neural networks have been introduced. The fundamental strategies can be categorized into two approaches: compressing the redundancy that develops after training, and employing architectures that inherently suppress redundancy from the beginning.