To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
The first part of this book (Chapters 2–5) is devoted to a brief review of probability and probability distributions. Almost all models for computer vision can be interpreted in a probabilistic context, and in this book we will present all the material in this light. The probabilistic interpretation may initially seem confusing, but it has a great advantage: it provides a common notation that will be used throughout the book and will elucidate relationships between different models that would otherwise remain opaque.
So why is probability a suitable language to describe computer vision problems? In a camera, the three-dimensional world is projected onto the optical surface to form the image: a two-dimensional set of measurements. Our goal is to take these measurements and use them to establish the properties of the world that created them. However, there are two problems. First, the measurement process is noisy; what we observe is not the amount of light that fell on the sensor, but a noisy estimate of this quantity. We must describe the noise in these data, and for this we use probability. Second, the relationship between world and measurements is generally many to one: there may be many real-world configurations that are compatible with the same measurements. The chance that each of these possible worlds is present can also be described using probability.
The structure of Part I is as follows: in Chapter 2, we introduce the basic rules for manipulating probability distributions including the ideas of conditional and marginal probability and Bayes' rule.
The previous chapters discussed models that relate the observed measurements to some aspect of the world that we wish to estimate. In each case, this relationship depended on a set of parameters and for each model we presented a learning algorithm that estimated these parameters.
Unfortunately, the utility of these models is limited because every element of the model depends on every other. For example, in generative models we model the joint probability of the observations and the world state. In many problems both of these quantities may be high-dimensional. Consequently, the number of parameters required to characterize their joint density accurately is very large. Discriminative models suffer from the same pathology: if every element of the world state depends on every element of the data, a large number of parameters will be required to characterize this relationship. In practice, the required amount of training data and the computational burden of learning and inference reach impractical levels.
The solution to this problem is to reduce the dependencies between variables in the model by identifying (or asserting) some degree of redundancy. To this end, we introduce the idea of conditional independence, which is a way of characterizing these redundancies. We then introduce graphical models which are graph-based representations of the conditional independence relations. We discuss two different types of graphical models – directed and undirected – and we consider the implications for learning, inference, and drawing samples.
The models in chapters 6–9 describe the relationship between a set of measurements and the world state. They work well when the measurements and the world state are both low dimensional. However, there are many situations where this is not the case, and these models are unsuitable.
For example, consider the semantic image labeling problem in which we wish to assign a label that denotes the object class to each pixel in the image. For example, in a road scene we might wish to label pixels as ‘road’, ‘sky’, ‘car’, ‘tree’, ‘building’ or ‘other’. For an image with N = 10000 pixels, this means we need to build a model relating the 10000 measured RGB triples to 610000 possible world states. None of the models discussed so far can cope with this challenge: the number of parameters involved (and hence the amount of training data and the computational requirements of the learning and inference algorithms) is far beyond what current machines can handle.
One possible solution to this problem would be to build a set of independent local models: for example, we could build models that relate each pixel label separately to the nearby RGB data. However, this is not ideal as the image may be locally ambiguous. For example, a small blue image patch might result from a variety of semantically different classes: sky, water, a car door or a person's clothing.
The main focus of this book is on statistical models for computer vision; the previous chapters concern models that relate visual measurements x to the world w. However, there has been little discussion of how the measurement vector x was created, and it has often been implied that it contains concatenated RGB pixel values. In state-of-the-art vision systems, the image pixel data are almost always preprocessed to form the measurement vector.
We define preprocessing to be any transformation of the pixel data prior to building the model that relates the data to the world. Such transformations are often ad hoc heuristics: their parameters are not learned from training data, but they are chosen based on experience of what works well. The philosophy behind image preprocessing is easy to understand; the image data may be contingent on many aspects of the real world that do not pertain to the task at hand. For example, in an object detection task, the RGB values will change depending on the camera gain, illumination, object pose and particular instance of the object. The goal of image preprocessing is to remove as much of this unwanted variation as possible while retaining the aspects of the image that are critical to the final decision.
In a sense, the need for preprocessing represents a failure; we are admitting that we cannot directly model the relationship between the RGB values and the world state. Inevitably, we must pay a price for this.
This chapter concerns regression problems: the goal is to estimate a univariate world state ω based on observed measurements x. The discussion is limited to discriminative methods in which the distribution Pr(ω|x) of the world state is directly modeled. This contrasts with Chapter 7 where the focus was on generative models in which the likelihood Pr(x|ω) of the observations is modeled.
To motivate the regression problem, consider body pose estimation: here the goal is to estimate the joint angles of a human body, based on an observed image of the person in an unknown pose (Figure 8.1). Such an analysis could form the first step toward activity recognition.
We assume that the image has already been preprocessed and a low-dimensional vector x that represents the shape of the contour has been extracted. Our goal is to use this data vector to predict a second vector containing the joint angles for each of the major body joints. In practice, we will estimate each joint angle separately; we can hence concentrate our discussion on how to estimate a univariate quantity ω from continuous observed data x. We begin by assuming that the relation between the world and the data is linear and that the uncertainty around this prediction is normally distributed with constant variance. This is the linear regression model.
Linear regression
The goal of linear regression is to predict the posterior distribution Pr(ω|x) over the world state ω based on observed data x.
There are already many computer vision textbooks, and it is reasonable to question the need for another. Let me explain why I chose to write this volume.
Computer vision is an engineering discipline; we are primarily motivated by the real-world concern of building machines that see. Consequently, we tend to categorize our knowledge by the real-world problem that it addresses. For example, most existing vision textbooks contain chapters on object recognition and stereo vision. The sessions at our research conferences are organized in the same way. The role of this book is to question this orthodoxy: Is this really the way that we should organize our knowledge?
Consider the topic of object recognition. A wide variety of methods have been applied to this problem (e.g., subspace models, boosting methods, bag of words models, and constellation models). However, these approaches have little in common. Any attempt to describe the grand sweep of our knowledge devolves into an unstructured list of techniques. How can we make sense of it all for a new student? I will argue for a different way to organize our knowledge, but first let me tell you how I see computer vision problems.
We observe an image and from this we extract measurements. For example, we might use the RGB values directly or we might filter the image or perform some more sophisticated preprocessing. The vision problem or goal is to use the measurements to infer the world state.
In Part V, we finally acknowledge the process by which real-world images are formed. Light is emitted from one or more sources and travels through the scene, interacting with the materials via physical processes such as reflection, refraction, and scattering. Some of this light enters the camera and is measured. We have a very good understanding of this forward model. Given known geometry, light sources, and material properties, computer graphics techniques can simulate what will be seen by the camera very accurately.
The ultimate goal for a vision algorithm would be a complete reconstruction, in which we aim to invert this forward model and estimate the light sources, materials, and geometry from the image. Here, we aim to capture a structural description of the world: we seek an understanding of where things are and to measure their optical properties, rather than a semantic understanding. Such a structural description can be exploited to navigate around the environment or build 3D models for computer graphics.
Unfortunately, full visual reconstruction is very challenging. For one thing, the solution is nonunique. For example, if the light source intensity increases, but the object reflectance decreases commensurately, the image will remain unchanged. Of course, we could make the problem unique by imposing prior knowledge, but even then reconstruction remains difficult; it is hard to effectively parameterize the scene, and the problem is highly non-convex.
In this part of the book, we consider a family of models that approximate both the 3D scene and the observed image with sparse sets of visual primitives (points).
In most of the models in this book, the observed data are treated as continuous. Hence, for generative models the data likelihood is usually based on the normal distribution. In this chapter, we explore generative models that treat the observed data as discrete. The data likelihoods are now based on the categorical distribution; they describe the probability of observing the different possible values of the discrete variable.
As a motivating example for the models in this chapter, consider the problem of scene classification (Figure 20.1). We are given example training images of different scene categories (e.g., office, coastline, forest, mountain) and we are asked to learn a model that can classify new examples. Studying the scenes in Figure 20.1 demonstrates how challenging a problem this is. Different images of the same scene may have very little in common with one another, yet we must somehow learn to identify them as the same. In this chapter, we will also discuss object recognition, which has many of the same characteristics; the appearance of an object such as a tree, bicycle, or chair can vary dramatically from one image to another, and we must somehow capture this variation.
The key to modeling these complex scenes is to encode the image as a collection of visual words, and use the frequencies with which these words occur as the substrate for further calculations. We start this chapter by describing this transformation.
Nowadays there are myriad tools to select from if you develop and maintain software. Much is open source, but there are also many commercial tools available. The development and maintenance tools include:
■ Compilers, linkers, and interactive debuggers
■ Build tools to automatically compile the source code in the right order
■ Integrated development environments
■ Tools for static and dynamic analysis
■ Version control systems
■ Tools for source code documentation
This appendix describes some uses of these tools, but it does not provide a complete overview.
The Compiler
Since the Fortran 90 standard, the language has gained features that make it easier for the compiler to perform static and dynamic analysis:
■ The implicit none statement (or equivalent compiler options) force the compiler to check that every variable is explicitly declared. This reduces the chance that typos inadvertently introduce bugs in the form of stray variables names into the program.
■ By putting all routines in modules, you make sure the compiler checks the number and types of the arguments for subroutine and function calls. It also reduces name clashes when linking a large program, which uses many libraries.
■ By using assumed-shape arrays instead of assumed-size, the compiler can insert runtime checks for array bounds. Moreover, you do not need to add separate arguments for the size of arrays anymore. This makes the call simpler and less error-prone.
Therefore, the compiler has become a much more powerful tool with respect to static and dynamic analysis.
In this chapter we discuss a family of models that explain observed data in terms of several underlying causes. These causes can be divided into three types: the identity of the object, the style in which it is observed, and the remaining variation.
To motivate these models, consider face recognition. For a facial image, the identity of the face (i.e., whose face it is) obviously influences the observed data. However, the style in which the face is viewed is also important. The pose, expression, and illumination are all style elements that might be modeled. Unfortunately, many other things also contribute to the final observed data: the person may have applied cosmetics, put on glasses, grown a beard, or dyed his or her hair. These myriad contributory elements are usually too difficult to model and are hence explained with a generic noise term.
In face recognition tasks, our goal is to infer whether the identities of face images are the same or different. For example, in face verification, we aim to infer a binary variable ω ϵ {0;1}, where ω=0 indicates that the identities differ and ω=1 indicates that they are the same. This task is extremely challenging when there are large changes in pose, illumination, or expression; the change in the image due to style may dwarf the change due to identity (Figure 18.1).
The models in this chapter are generative, so the focus is on building separate density models over the observed image data cases where the faces do and don't have the same identity.
The applications of Fortran span, very nearly, the whole period during which computers have been in general-purpose use. This is quite remarkable and, given the demise so many other high-level languages, it is quite difficult to know why. Possibly the original design concepts of John Backus – ease of use and efficiency of execution – have been two major factors. Another might be the devotion of Fortran's user community, who labor to keep it abreast of developments in programming techniques and to adapt it to ever-more demanding requirements.
Despite all the predictions, over several decades, that Fortran is about to become extinct, the language has shown itself to be remarkably resilient. Furthermore, over the last few years, it has been subject to new rounds of standardization, and the latest standard, Fortran 2008, should again extend the language's life. Against this background, it is very regrettable that old versions of Fortran live on, both in the form of antiquated courses given by incorrigible teachers and also as an outmoded concept in the minds of its detractors. After all, modern Fortran is a procedural, imperative, and compiled language with a syntax well suited to a direct representation of mathematical formulae. Its individual procedures may be compiled separately or grouped into modules, either way allowing the convenient construction of very large programs and procedure libraries. The language contains features for array processing, abstract data types, dynamic data structures, object-oriented programming, and parallel processing.
I was very pleased to be asked to write this foreword, having seen snapshots of the development of this book since its inception. I write this having just returned from BMVC 2011, where I found that others had seen draft copies, and where I heard comments like “What amazing figures!”, “It's so comprehensive!”, and “He's so Bayesian!”.
But I don't want you to read this book just because it has amazing figures and provides new insights into vision algorithms of every kind, or even because it's “Bayesian” (although more on that later). I want you to read it because it makes clear the most important distinction in computer vision research: the difference between “model” and “algorithm.” This is akin to the distinction that Marr made with his three-level computational theory, but Prince's two-level distinction is made beautifully clear by his use of the language of probability.
Why is this distinction so important? Well, let us look at one of the oldest and apparently easiest problems in vision: separating an image into “figure” and “ground.” It is still common to hear students new to vision address this problem just as the early vision researchers did, by reciting an algorithm: first I'll use PCA to find the dominant color axis, then I'll generate a grayscale image, then I'll threshold that at some value, then I'll clean up the holes using morphological operators.