To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
A task of visual perception is to find the scene which best explains visual observations. Figure 9.1 can be used to illustrate the problem of perception. The visual data is the image held by two cherubs at the right. Scattered in the middle are various geometrical objects – “scene interpretations” – which may account for the observed data. How does one choose between the competing interpretations for the image data?
One approach is to find the probability that each interpretation could have created the observed data. Bayesian statistics are a powerful tool for this, e.g. Geman & Geman (1984), Jepson & Richards (1992), Kersten (1991), Szeliski (1989). One expresses prior assumptions as probabilities and calculates for each interpretation a posterior probability, conditioned on the visual data. The best interpretation may be that with the highest probability density, or a more sophisticated criterion may be used. Other computational techniques, such as regularization (Poggio et al., 1985; Tikhonov & Arsenin, 1977), can be posed in a Bayesian framework (Szeliski, 1989). In this chapter, we will apply the powerful assumption of “generic view” in a Bayesian framework. This will lead us to an additional term from Bayesian theory involving the Fisher information matrix. (See also chapter 7 by Blake et al..) This will modify the posterior probabilities to give additional information about the scene.
Texture cues in the image plane are a potentially rich source of surface information available to the human observer. A photograph of a cornfield, for example, can give a compelling impression of the orientation of the ground plane relative to the observer. Gibson (1950) designed the first experiments to test the ability of humans to use this texture information in their estimation of surface orientation. Since that time, various authors have proposed and tested hypotheses concerning the relative importance of different visual cues in human judgements of shape from texture (Cutting & Millard, 1984; Todd & Akerstrom, 1987). This work has generally relied on a cue conflict paradigm in which one cue is varied while the other is held constant. This is potentially problematic, since surfaces with conflicting texture cues do not occur in nature. It is possible that in a psychophysical experiment our visual system might employ a different mechanism to resolve the cue conflict condition. We show in this paper that the strength of individual texture cues can be measured and compared with an ideal observer model without resorting to a cue conflict paradigm.
Ideal observer models for estimation of shape from texture have been described by Witkin (1981), Kanatani & Chou (1989), Davis et al. (1983), Blake & Marinos (1990), Marinos & Blake (1990). Given basic assumptions about the distribution and orientation of texture elements, an estimate of surface orientation can be obtained, together with crucial information about reliability of the estimate.
By the late eighties, the computational approach to perception advocated by Marr (1982) was well established. In vision, most properties of the 2 ½ D sketch such as surface orientation and 3D shape admitted solutions, especially for machine vision systems operating in constrained environments. Similarly, tactile and force sensing was rapidly becoming a practicality for robotics and prostheses. Yet in spite of this progress, it was increasingly apparent that machine perceptual systems were still enormously impoverished versions of their biological counterparts. Machine systems simply lacked the inductive intelligence and knowledge that allowed biological systems to operate successfully over a variety of unspecified contexts and environments. The role of “top-down” knowledge was clearly underestimated and was much more important than precise edge, region, “textural”, or shape information. It was also becoming obvious that even when adequate “bottom-up” information was available, we did not understand how this information should be combined from the different perceptual modules, each operating under their often quite different and competing constraints (Jain, 1989). Furthermore, what principles justified the choice of these “constraints” in the first place? Problems such as these all seemed to be subsumed under a lack of understanding of how prior knowledge should be brought to bear upon the interpretation of sensory data. Of course, this conclusion came as no surprise to many cognitive and experimental psychologists (e.g. Gregory, 1980; Hochberg, 1988; Rock, 1983), or to neurophysiologists who were exploring the role of massive reciprocal descending pathways (Maunsell & Newsome, 1987; Van Essen et al., 1992).
The previous chapters have demonstrated the many ways one can use a Bayesian formulation for computationally modeling perceptual problems. In this chapter, we look at the implications of a Bayesian view of visual information processing for investigating human visual perception. We will attempt to outline the elements of a general program of empirical research which results from taking the Bayesian formulation seriously as a framework for characterizing human perceptual inference. A major advantage of following such a program is that it supports a strong integration of psychophysics and computational theory, since its structure is the same as that of the Bayesian framework for computational modeling. In particular, it provides the foundation for a psychophysics of constraints, used to test hypotheses about the quantitative and qualitative constraints used in human perceptual inferences. The Bayesian approach also suggests new ways to conceptualize the general problem of perception and to decompose it into isolatable parts for psychophysical investigation. Thus, it not only provides a framework for modeling solutions to specific perceptual problems; it also guides the definition of the problems.
The chapter is organized into four major sections. In the next section, we develop a framework for characterizing human perception in Bayesian terms and analyze its implications for studying human perceptual performance. The third and fourth sections of the chapter apply the framework to two specific problems: the perception of 3-D shape from surface contours and the perception of 3-D object motion from cast shadow motion.
By
B.M. Bennett, University of California at Irvine,
D.D. Hoffman, University of California at Irvine,
C. Prakash, California State University,
S.N. Richman, University of California at Irvine
The search is on for a general theory of perception. As the papers in this volume indicate, many perceptual researchers now seek a conceptual framework and a general formalism to help them solve specific problems.
One candidate framework is “observer theory” (Bennett, Hoffman, & Prakash, 1989a). This paper discusses observer theory, gives a sympathetic analysis of its candidacy, describes its relationship to standard Bayesian analysis, and uses it to develop a new account of the relationship between computational theories and psychophysical data. Observer theory provides powerful tools for the perceptual theorist, psychophysicist, and philosopher. For the theorist it provides (1) a clean distinction between competence and performance, (2) clear goals and techniques for solving specific problems, and (3) a canonical format for presenting and analyzing proposed solutions. For the psychophysicist it provides techniques for assessing the psychological plausibility of theoretical solutions in the light of psychophysical data. And for the philosopher it provides conceptual tools for investigating the relationship of sensory experience to the material world.
Observer theory relates to Bayesian approaches as follows. In Bayesian approaches to vision one is given an image (or small collection of images), and a central goal is to compute the probability of various scene interpretations for that image (or small collection of images). That is, a central goal is to compute a conditional probability measure, called the “posterior distribution”, which can be written p(Scene | Image) or, more briefly, p(S | I).
The term “Pattern Theory” was introduced by Ulf Grenander in the 70's as a name for a field of applied mathematics which gave a theoretical setting for a large number of related ideas, techniques and results from fields such as computer vision, speech recognition, image and acoustic signal processing, pattern recognition and its statistical side, neural nets and parts of artificial intelligence (see Grenander, 1976-81). When I first began to study computer vision about ten years ago, I read parts of this book but did not really understand his insight. However, as I worked in the field, every time I felt I saw what was going on in a broader perspective or saw some theme which seemed to pull together the field as a whole, it turned out that this theme was part of what Grenander called pattern theory. It seems to me now that this is the right framework for these areas, and, as these fields have been growing explosively, the time is ripe for making an attempt to reexamine recent progress and try to make the ideas behind this unification better known. This article presents pattern theory from my point of view, which may be somewhat narrower than Grenander's, updated with recent examples involving interesting new mathematics.
When we see objects in the world, what we actually “see” is much more than the retinal image. Our perception is three-dimensional. Moreover, it reflects constant properties of the objects and the environment, regardless of changes in the retinal image with varying viewing condition. How does the visual system make this possible?
Two different approaches have been evident in the study of visual perception. One approach, most successful in recent times, is based on the idea that perception emerges automatically by some combination of neuronal receptive fields. In the study of depth perception, this general line of thinking has been supported by psychophysical and physiological evidence. The “purely cyclopean” perception in the Julesz' random dot stereogram (Julesz, 1960) shows that depth can emerge without the mediation of any higher order form recognition. This suggested that relatively local disparity-specific processes could account for the perception of a floating figure in an otherwise camouflaged display. Corresponding electrophysiological experiments using single cell recordings demonstrated that the depth of such stimuli could be coded by neurons in the visual cortex, receiving input from the two eyes (Barlow et al., 1967; Poggio & Fischer, 1977). In contrast to this more modern approach, there exists an older tradition which asserts that perception is inferential, that it can cleverly determine the nature of the world with limited image data. Starting with Helmholtz's unconscious inference (Helmholtz, 1910) and with more recent formulations such as Gregory's “perceptual hypotheses”, this approach stresses the importance of problem solving in the process of seeing (Hochberg, 1981; Gregory, 1970; Rock, 1983).
The world we live in is a very structured place. Matter does not flit about in space and time in a completely unorganized fashion, but rather is organized by the physical forces, biological processes, social interactions, and so on which exist in our world (McMahon, 1975; Thompson, 1952). It is this structure, or regularity, which makes it possible for us to make reliable inferences about our surroundings from the signals taken in from various senses (Marr, 1982; Witkin and Tenenbaum, 1983). In other words, regularities in the world make sense data reliably informative about the world we move around in. But what is the nature of these regularities, and how can they be used for the purposes of perception?
In this chapter, we consider one class of environmental regularities which arise from what we call the modal structure of the world and which has the effect of making sensory information for certain types of perceptual judgements highly reliable (Bobick and Richards, 1986). Our definition of modal regularities is motivated by careful analyses of some simple examples of reliable perceptual inferences. Given the resulting definition, we then briefly discuss some of the implications for the knowledge required of a perceiver in order for it to make reliable inferences in the presence of such modal structure.
Modal structure: An example
When can we infer that an object is stationary?
A common perceptual inference is that of whether an object is moving or at rest. How can we make this inference given only the two-dimensional projection of a three-dimensional object?
Ideas about perception have changed a great deal over the past half century. Before the discovery of maps of sensory surfaces in the cerebral cortex the consensus view was that perception involved “mind-stuff”, and because mind-stuff was not matter the greatest care was required in using crude, essentially materialistic, scientific methods and concepts to investigate and explain phenomena in which it had a hand. One approach was to reduce the role of this mind-stuff to a minimum by designing experiments so that the mind was used simply as a null-detector, analogous to a sensitive galvanometer in a Wheatstone bridge, which had to do no more than detect the identity or non-identity of two percepts. Those who followed this approach might be termed the hard-psychophysics school – exemplified by people such as Helmholtz, Stiles, Hecht and Rushton – and it had some brilliant successes in explaining the properties of sensation: for instance the trichromatic theory; the relation between the quality of the retinal image and visual acuity; and the relation between sensitivity and the absorption of quanta in photo-sensitive pigments. But it left the mystery of the mind-stuff untouched.
Others attempted to discover the properties of the mind-stuff by defining the physical stimuli required to elicit its verbally recognizable states – a soft-psychophysics approach that could be said to characterize the work of the Gestalt school and, for example, Hering, Gibson, and Hurvich and Jameson.
The luminance of a surface results from the combined effect of its reflectance (albedo) and its conditions of illumination. Luminance can be directly observed, but reflectance and illumination can only be derived by perceptual processes. Human observers are good at judging an object's reflectance in spite of large changes in illumination; this skill is known as “lightness constancy”.
Most research on lightness constancy has used stimuli consisting of grey patches on a single flat plane. The models are typically based on the assumption that slow variations in luminance are due to illumination gradients, while sharp changes in luminance are due to reflectance edges. The retinex models for use with “Mondrian” stimuli are good examples (Horn, 1974; Land & McCann, 1971). But in three dimensional scenes, sharp luminance changes can arise from either reflectance or from illumination, as illustrated in Figure 11.1. The edge marked (1) is due to a reflectance change, such as might result from a different shade of paint. The edge marked (2) results from a change in surface normal which leads to a change in the angle of incidence of the light – an effect that we may simply refer to as “shading.” As Gilchrist and his colleagues have emphasized (Gilchrist et al., 1983), three-dimensional scenes introduce large and important effects that are completely missed in the traditional approach to lightness perception.
Intrinsic image analysis
Using the terminology of Barrow & Tenenbaum (1978) we may cast the perceptual task as a problem of computing intrinsic images – images that represent the underlying physical properties of a scene.
Visual perception is the process of inferring world structure from image structure. If the world structure we recover from our images “makes sense” as a plausible world event, then we have a “percept” and can often offer a concise linguistic description of what we see. For example, in the upper panel of Figure 3.1, if asked, “What do you see?”, a typical response might be a pillbox with a handle either erect (left) or flat (right). This concise and confident response suggests that we have identified a model type that fits the image observation with no residual ambiguities at the level of the description. In contrast, when asked to describe the two lower drawings in Figure 3.1, there is some hesitancy and uncertainty. Is the handle erect or not? Does it have a skewed or rectangular shape? The depiction leaves us somewhat uncertain, as if several options were possible, but none where all aspects of the interpretation collectively support each other. What then, leads us to the certainty in the upper set and to the ambiguity in the lower pair?
To be a bit more precise about our goal, let us assume that some Waltz-like algorithm has already identified the base of the pillbox and the wire-frame handle as separate 3D parts. Even with this decomposition, there remains an infinity of possible interpretations for any of these drawings.