To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Aimed at undergraduate students, this text guides readers through the methods and principles of machine learning in an approachable manner without sacrificing mathematical precision or notation. Step-by-step explanations allow students to grasp complicated mathematical calculations and translate the theory and mathematics into programming and applications. The text presents machine learning concepts visually, and uses example datasets from fictional hippopotamuses and illustrations to explain the material in a unique, but easily understood and engaging way. Worked examples connect the mathematics and algorithms to real-world applications and enable students to utilize this technology in new and ever-changing circumstances. Topics covered include Bayesian reasoning, linear regression and classification, margin-based classification, cross-validation, neural networks, decision trees, clustering and dimensionality reduction. End-of-chapter mathematical exercises and additional coding projects reinforce application and decision-making skills.
This gentle introduction to the most important techniques in natural language processing uses a unified mathematical and algorithmic framework and gradually increases in complexity. Topics covered range from n-gram language models to large language models (LLMs), from perceptron to deep learning, from text classification to structured prediction (e.g., sequence labelling, segmentation, and parsing) and generation, and from discrete representation to neural representation of linguistics structures. This book provides a comprehensive overview of NLP, making it ideal for upper undergraduate and graduate students in computer science and a valuable reference for researchers and engineers. Exercises of varying difficulty are provided as well as teaching slides and tutorial videos. The new edition features three new chapters on pre-trained language models and large language models as well as a new preliminary chapter overviewing data and model as a framework for NLP methods.
Kernel methods, with origins in the pioneering work of Mercer (1909), Bochner (1933), and Aronszajn (1950), have become central tools in modern mathematics and machine learning. This book explores their deep connections with approximation theory, highlighting both classical results and cutting-edge developments. Through clear explanations and illustrative examples, it guides readers from foundational concepts to contemporary applications, including computational methods and real-world problem solving. By bridging theory and practice, the text not only provides a rigorous understanding of kernels but also inspires further exploration and research. Suitable for students, researchers, and practitioners, it invites readers to engage with ongoing advances in this dynamic field and to contribute to its future growth.
This book offers a comprehensive introduction to Markov decision process and reinforcement learning fundamentals using common mathematical notation and language. Its goal is to provide a solid foundation that enables readers to engage meaningfully with these rapidly evolving fields. Topics covered include finite and infinite horizon models, partially observable models, value function approximation, simulation-based methods, Monte Carlo methods, and Q-learning. Rigorous mathematical concepts and algorithmic developments are supported by numerous worked examples. As an up-to-date successor to Martin L. Puterman's influential 1994 textbook, this volume assumes familiarity with probability, mathematical notation, and proof techniques. It is ideally suited for students, researchers, and professionals in operations research, computer science, engineering, and economics.
Bridging the gap between introductory texts and the specialized research literature, this is one of the first truly rigorous yet accessible treatments of modern reinforcement learning. Written by three leading researchers with over a decade of teaching experience, the book uniquely combines mathematical precision with practical insights. It progresses naturally from planning (dynamic programming, MDPs, value and policy iteration) to learning (model-based and model-free algorithms, function approximation, policy gradients, and regret minimization). Each concept is developed from first principles with complete proofs, making the material self-contained. The modular chapter organization enables flexible course design. The book's website offers battle-tested exercises refined through years of classroom use. Combining mathematical rigor with practical applications, this definitive text is ideal for advanced undergraduate and graduate students as well as practitioners seeking a deep understanding of sequential decision-making and intelligent agent design.
Originating from lectures delivered at the African Institute of Mathematical Sciences, this book presents a unifying perspective on traditional and modern methods in generative AI and stochastic thermodynamics. By relating the core topics in machine learning to the notion of (variational) free-energy, a bridge is built between methods such as latent variable models, variational auto-encoders, optimal control, optimal transport, normalizing flows and diffusion models and concepts such as entropy production and fluctuation theorems in stochastic thermodynamics. Structured into three main parts, the book commences by setting up the required mathematical and statistical physics preliminaries needed to make it broadly accessible. The largest part of the book then focuses on building intuition of major advances in generative AI by considering discrete time processes and their relationship to topics in stochastic thermodynamics. Finally, the authors take a short excursion to the continuous time domain for the more advanced learner.
Based on courses taught at the University of Cambridge, this text presents core contemporary statistical methods and theory in an accessible, self-contained and rigorous fashion, with a focus on finite-sample guarantees as opposed to asymptotic arguments. Many of the topics and results have not appeared in book form previously, and some constitute new research. The prerequisites are relatively light (primarily a good grasp of linear algebra and real analysis) and complete solutions to all 250+ exercises are available online. It is the perfect entry point to the subject for master's and graduate-level students in statistics, data science and machine learning, as well as related disciplines such as artificial intelligence, signal processing, information theory, electrical engineering and econometrics. Researchers in these fields will also find it an invaluable resource. This title is also available as Open Access on Cambridge Core.
This collection of articles and interviews surveys human-centered approaches to machine learning that can make AI more human-friendly, usable, and ethical. It provides a handbook for students, researchers, and practitioners who want new ways of approaching AI that place humanity at their center. It shows how to apply methods from human-computer interaction to the new technologies of AI and ML with a view to enabling computing technology to become user-friendly and human-centric. The book has 13 articles and 9 interviews from a range of different perspectives, helping readers understand existing machine learning systems and their impacts on people and society. It is an ideal introduction both for human-computer interaction practitioners who are interested in working with ML and for ML experts interested in making their practice more human-centered. The book offers a critical lens on existing machine learning alongside an optimistic vision of AI in the service of humanity.
This chapter explores three kinds of unsupervised task: clustering, density estimation and dimensionality reduction. Cluster analysis aims to group similar observations together. The K-means algorithm does this by repeatedly reassigning each point to the nearest cluster centre, reducing or maintaining the clustering inertia at each step. Density estimation involves learning a probabilistic model of a data-generating process. Gaussian mixture models represent the distribution as a weighted sum of multivariate normal components. The EM algorithm fits these models by alternating between assigning each component a responsibility for each point and updating component locations using responsibility-weighted averages. Cross-entropy measures how well an estimated density approximates the true one and is minimised when the two match. Dimensionality reduction compresses data into a lower-dimensional latent space via an encoder, with a decoder reconstructing the original data. Principal component analysis uses linear encoder–decoder pairs to minimise reconstruction error, offering a simple yet powerful form of dimensionality reduction.
This chapter introduces neural networks as flexible function approximators built by composing layers of simple processing units. A network with no hidden layers performs linear regression if its output layer is linear and logistic regression if its output layer uses softmax. Hidden layers increase expressivity: a network with one hidden layer and ReLU activations can approximate any continuous function on a closed and bounded input domain, though complex functions may require many units. Deep networks, with multiple hidden layers, are more efficient and scalable than shallow ones, especially for learning hierarchical structure. Neural networks are trained using gradient-based optimisation, with gradients computed via backpropagation. Training adjusts weights to minimise a loss function, using small batches of data. Techniques like early stopping and small batches act as implicit regularisers, while weight decay provides explicit regularisation. Convolutional neural networks use convolution and pooling layers to exploit spatial structure in image data. More broadly, architectural choices often reflect domain-specific assumptions.
This chapter introduces directed acyclic graphs (DAGs) as a way to represent multivariate probability distributions. DAGs help clarify the structure of probabilistic models and the dependencies among their variables and serve as a central tool in later chapters. Every DAG corresponds to a specific factorisation of a joint mass or density function into a product of conditional distributions. While a DAG encodes how the distribution breaks down into conditionals, it does not fully determine the distribution itself. Instead, it implies certain dependency constraints among variables. These constraints can be examined using the concept of d-separation, which allows us to infer conditional independence relationships directly from the graph.
This chapter introduces key concepts and methods in Bayesian statistical modelling. The posterior predictive distribution captures both epistemic uncertainty in model parameters and aleatory uncertainty in future outcomes. A Bayesian p-value gives the probability that a statistic computed from data output by a given model will be more extreme than the value of the same statistic computed from observed data. Bayesian p-values close to 0 or 1 suggest the model may be inadequate. Markov chain Monte Carlo is a general-purpose tool for sampling from complex, unnormalised distributions. It produces dependent samples, so the effective sample size is usually smaller than the number of iterations. Informative priors are useful when data leave large uncertainties in parameter values. Empirical Bayes combines information across related datasets by estimating a distribution over parameters using frequentist methods. Hierarchical modelling provides a unified Bayesian framework for handling multiple related datasets, capturing group structure via a hierarchical graph.
In this chapter, we explore an unsupervised learning problem: estimating a distribution function from two-dimensional data. Although there is no response variable, the workflow mirrors that of supervised learning. We select the best-fitting function within a family by maximising the sum of the log of the distribution's values at the observed data points. As in supervised learning, excessive flexibility leads to overfitting, while insufficient flexibility leads to underfitting. We use cross-validation to identify a function family that achieves a happy medium.
This chapter introduces simple and multiple linear regression models – core tools in predictive modelling due to their simplicity and interpretability. These models assume the response variable is a linear function of the predictor(s), plus a noise term. The regression function gives the expected response given the predictors. The coefficient of determination, R2, measures how much of the variance in the response is explained by the model. In simple linear regression, R2 equals the square of the Pearson correlation between response and predictor; in multiple regression, it equals the square of the correlation between response and predicted values. Each coefficient in multiple regression reflects the expected change in the response for a one-unit increase in that predictor, holding others fixed. Standardising predictors lets us compare coefficient sizes. Strong collinearity between predictors increases uncertainty in the fitted coefficients. Models using only a subset of predictors may generalise better than those using all and overfitting. The squared error risk of a modelling procedure – its expected test error – can be broken down into bias, variance and irreducible noise.