To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
In the second part of this book (chapters 6–9), we treat vision as a machine learning problem and disregard everything we know about the creation of the image. For example, we will not exploit our understanding of perspective projection or light transport. Instead, we treat vision as pattern recognition; we interpret new image data based on prior experience of images in which the contents were known. We divide this process into two parts: in learning we model the relationship between the image data and the scene content. In inference, we exploit this relationship to predict the contents of new images.
To abandon useful knowledge about image creation may seem odd, but the logic is twofold. First, these same learning and inference techniques will also underpin our algorithms when image formation is taken into account. Second, it is possible to achieve a great deal with a pure learning approach to vision. For many tasks, knowledge of the image formation process is genuinely unnecessary.
The structure of Part II is as follows. In Chapter 6 we present a taxonomy of models that relate the measured image data and the actual scene content. In particular, we distinguish between generative models and discriminative models. For generative models, we build a probability model of the data and parameterize it by the scene content. For discriminative models, we build a probability model of the scene content and parameterize it by the data.
This chapter introduces the pinhole or projective camera. This is a purely geometric model that describes the process whereby points in the world are projected into the image. Clearly, the position in the image depends on the position in the world, and the pinhole camera model captures this relationship.
To motivate this model, we will consider the problem of sparse stereo reconstruction (Figure 14.1). We are given two images of a rigid object taken from different positions. Let us assume that we can identify corresponding 2D features between the two images – points that are projected versions of the same position in the 3D world. The goal now is to establish this 3D position using the observed 2D feature points. The resulting 3D information could be used by a robot to help it navigate through the scene, or to facilitate object recognition.
The pinhole camera
In real life, a pinhole camera consists of a chamber with a small hole (the pinhole) in the front (Figure 14.2). Rays from an object in the world pass through this hole to form an inverted image on the back face of the box, or image plane. Our goal is to build a mathematical model of this process.
It is slightly inconvenient that the image from the pinhole camera is upside-down. Hence, we instead consider the virtual image that would result from placing the image plane in front of the pinhole.
This is a brief guide to the notational conventions used in this text.
Scalars, vectors, and matrices
We denote scalars by either small or capital letters α, A α. We denote column vectors by bold small letters a, ϕ. When we need a row vector we usually present this as the transpose of a column vector aT, ϕT.
We represent matrices by bold capital letters B, Φ. The ith row and jth column of matrix A is written as aij. The jth column of matrix A is written as aj. When we need to refer to the ith row of a matrix, we write this as ai • where the bullet • indicates that we are considering all possible values of the column index.
We concatenate two D × 1 column vectors horizontally as a = [b, c] to form the D × 2 matrix A. We concatenate two D × 1 column vectors vertically as a = [bT,cT]T to form the 2D × 1 vector a. Although this notation is cumbersome, it allows us to represent vertical concatenations within a single line of text.
Variables and parameters
We denote variables with Roman letters a, b. The most common examples are the observed data which is always denoted by x and the state of the world which is always denoted by w. However, other hidden or latent variables are also represented by Roman letters. We denote parameters of the model by Greek letters μ, Φ, σ2. These are distinguished from variables in that there is usually a single set of parameters that explains the relation between many sets of variables.
This chapter concerns discriminative models for classification. The goal is to directly model the posterior probability distribution Pr(ω|x) over a discrete world state ω ϵ {1,…K} given the continuous observed data vector x. Models for classification are very closely related to those for regression and the reader should be familiar with the contents of Chapter 8 before proceeding.
To motivate the models in this chapter, we will consider gender classification: here we observe a 60 × 60 RGB image containing a face (Figure 9.1) and concatenate the RGB values to form the 10800×1 vector x. Our goal is to take the vector x and return the probability distribution Pr(ω|x) over a label ω ϵ {0,1} indicating whether the face is male (ω = 0) or female (ω = 1).
Gender classification is a binary classification task as there are only two possible values of the world state. Throughout most of this chapter, we will restrict our discussion to binary classification. We discuss how to extend these models to cope with an arbitrary number of classes in Section 9.9.
Logistic regression
We will start by considering logistic regression, which despite its name is a model that can be applied to classification. Logistic regression (Figure 9.2) is a discriminative model; we select a probability distribution over the world state ω ϵ {0,1} and make its parameters contingent on the observed data x.
At an abstract level, the goal of computer vision problems is to use the observed image data to infer something about the world. For example, we might observe adjacent frames of a video sequence and infer the camera motion, or we might observe a facial image and infer the identity.
The aim of this chapter is to describe a mathematical framework for solving this type of problem and to organize the resulting models into useful subgroups, which will be explored in subsequent chapters.
Computer vision problems
In vision problems, we take visual data x and use them to infer the state of the world w. The world state w may be continuous (the 3D pose of a body model) or discrete (the presence or absence of a particular object). When the state is continuous, we call this inference process regression. When the state is discrete, we call it classification.
Unfortunately, the measurements x may be compatible with more than one world state w. The measurement process is noisy, and there is inherent ambiguity in visual data: a lump of coal viewed under bright light may produce the same luminance measurements as white paper in dim light. Similarly, a small object seen close-up may produce the same image as a larger object that is further away.
In the face of such ambiguity, the best that we can do is to return the posterior probability distribution Pr(w|x) over possible states w.
The first part of this book (Chapters 2–5) is devoted to a brief review of probability and probability distributions. Almost all models for computer vision can be interpreted in a probabilistic context, and in this book we will present all the material in this light. The probabilistic interpretation may initially seem confusing, but it has a great advantage: it provides a common notation that will be used throughout the book and will elucidate relationships between different models that would otherwise remain opaque.
So why is probability a suitable language to describe computer vision problems? In a camera, the three-dimensional world is projected onto the optical surface to form the image: a two-dimensional set of measurements. Our goal is to take these measurements and use them to establish the properties of the world that created them. However, there are two problems. First, the measurement process is noisy; what we observe is not the amount of light that fell on the sensor, but a noisy estimate of this quantity. We must describe the noise in these data, and for this we use probability. Second, the relationship between world and measurements is generally many to one: there may be many real-world configurations that are compatible with the same measurements. The chance that each of these possible worlds is present can also be described using probability.
The structure of Part I is as follows: in Chapter 2, we introduce the basic rules for manipulating probability distributions including the ideas of conditional and marginal probability and Bayes' rule.
The previous chapters discussed models that relate the observed measurements to some aspect of the world that we wish to estimate. In each case, this relationship depended on a set of parameters and for each model we presented a learning algorithm that estimated these parameters.
Unfortunately, the utility of these models is limited because every element of the model depends on every other. For example, in generative models we model the joint probability of the observations and the world state. In many problems both of these quantities may be high-dimensional. Consequently, the number of parameters required to characterize their joint density accurately is very large. Discriminative models suffer from the same pathology: if every element of the world state depends on every element of the data, a large number of parameters will be required to characterize this relationship. In practice, the required amount of training data and the computational burden of learning and inference reach impractical levels.
The solution to this problem is to reduce the dependencies between variables in the model by identifying (or asserting) some degree of redundancy. To this end, we introduce the idea of conditional independence, which is a way of characterizing these redundancies. We then introduce graphical models which are graph-based representations of the conditional independence relations. We discuss two different types of graphical models – directed and undirected – and we consider the implications for learning, inference, and drawing samples.
The models in chapters 6–9 describe the relationship between a set of measurements and the world state. They work well when the measurements and the world state are both low dimensional. However, there are many situations where this is not the case, and these models are unsuitable.
For example, consider the semantic image labeling problem in which we wish to assign a label that denotes the object class to each pixel in the image. For example, in a road scene we might wish to label pixels as ‘road’, ‘sky’, ‘car’, ‘tree’, ‘building’ or ‘other’. For an image with N = 10000 pixels, this means we need to build a model relating the 10000 measured RGB triples to 610000 possible world states. None of the models discussed so far can cope with this challenge: the number of parameters involved (and hence the amount of training data and the computational requirements of the learning and inference algorithms) is far beyond what current machines can handle.
One possible solution to this problem would be to build a set of independent local models: for example, we could build models that relate each pixel label separately to the nearby RGB data. However, this is not ideal as the image may be locally ambiguous. For example, a small blue image patch might result from a variety of semantically different classes: sky, water, a car door or a person's clothing.
The main focus of this book is on statistical models for computer vision; the previous chapters concern models that relate visual measurements x to the world w. However, there has been little discussion of how the measurement vector x was created, and it has often been implied that it contains concatenated RGB pixel values. In state-of-the-art vision systems, the image pixel data are almost always preprocessed to form the measurement vector.
We define preprocessing to be any transformation of the pixel data prior to building the model that relates the data to the world. Such transformations are often ad hoc heuristics: their parameters are not learned from training data, but they are chosen based on experience of what works well. The philosophy behind image preprocessing is easy to understand; the image data may be contingent on many aspects of the real world that do not pertain to the task at hand. For example, in an object detection task, the RGB values will change depending on the camera gain, illumination, object pose and particular instance of the object. The goal of image preprocessing is to remove as much of this unwanted variation as possible while retaining the aspects of the image that are critical to the final decision.
In a sense, the need for preprocessing represents a failure; we are admitting that we cannot directly model the relationship between the RGB values and the world state. Inevitably, we must pay a price for this.
This chapter concerns regression problems: the goal is to estimate a univariate world state ω based on observed measurements x. The discussion is limited to discriminative methods in which the distribution Pr(ω|x) of the world state is directly modeled. This contrasts with Chapter 7 where the focus was on generative models in which the likelihood Pr(x|ω) of the observations is modeled.
To motivate the regression problem, consider body pose estimation: here the goal is to estimate the joint angles of a human body, based on an observed image of the person in an unknown pose (Figure 8.1). Such an analysis could form the first step toward activity recognition.
We assume that the image has already been preprocessed and a low-dimensional vector x that represents the shape of the contour has been extracted. Our goal is to use this data vector to predict a second vector containing the joint angles for each of the major body joints. In practice, we will estimate each joint angle separately; we can hence concentrate our discussion on how to estimate a univariate quantity ω from continuous observed data x. We begin by assuming that the relation between the world and the data is linear and that the uncertainty around this prediction is normally distributed with constant variance. This is the linear regression model.
Linear regression
The goal of linear regression is to predict the posterior distribution Pr(ω|x) over the world state ω based on observed data x.
There are already many computer vision textbooks, and it is reasonable to question the need for another. Let me explain why I chose to write this volume.
Computer vision is an engineering discipline; we are primarily motivated by the real-world concern of building machines that see. Consequently, we tend to categorize our knowledge by the real-world problem that it addresses. For example, most existing vision textbooks contain chapters on object recognition and stereo vision. The sessions at our research conferences are organized in the same way. The role of this book is to question this orthodoxy: Is this really the way that we should organize our knowledge?
Consider the topic of object recognition. A wide variety of methods have been applied to this problem (e.g., subspace models, boosting methods, bag of words models, and constellation models). However, these approaches have little in common. Any attempt to describe the grand sweep of our knowledge devolves into an unstructured list of techniques. How can we make sense of it all for a new student? I will argue for a different way to organize our knowledge, but first let me tell you how I see computer vision problems.
We observe an image and from this we extract measurements. For example, we might use the RGB values directly or we might filter the image or perform some more sophisticated preprocessing. The vision problem or goal is to use the measurements to infer the world state.
In Part V, we finally acknowledge the process by which real-world images are formed. Light is emitted from one or more sources and travels through the scene, interacting with the materials via physical processes such as reflection, refraction, and scattering. Some of this light enters the camera and is measured. We have a very good understanding of this forward model. Given known geometry, light sources, and material properties, computer graphics techniques can simulate what will be seen by the camera very accurately.
The ultimate goal for a vision algorithm would be a complete reconstruction, in which we aim to invert this forward model and estimate the light sources, materials, and geometry from the image. Here, we aim to capture a structural description of the world: we seek an understanding of where things are and to measure their optical properties, rather than a semantic understanding. Such a structural description can be exploited to navigate around the environment or build 3D models for computer graphics.
Unfortunately, full visual reconstruction is very challenging. For one thing, the solution is nonunique. For example, if the light source intensity increases, but the object reflectance decreases commensurately, the image will remain unchanged. Of course, we could make the problem unique by imposing prior knowledge, but even then reconstruction remains difficult; it is hard to effectively parameterize the scene, and the problem is highly non-convex.
In this part of the book, we consider a family of models that approximate both the 3D scene and the observed image with sparse sets of visual primitives (points).