To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
This chapter describes the work on human-computer interaction being carried out in our laboratory at the University of Osaka. Recognition of human expressions is necessary for human-computer interactive applications. A vision system is suitable for recognition of human expression since this involves passive sensing and the human gestures of hand, body and face that can be recognized without any discomfort for the user. The computer should not restrict the movements of the human to the front of the computer. Therefore, we study methods of looking at people using a network of active cameras.
Introduction
Sensing of human expressions is very important for human-computer interactive applications such as virtual reality, gesture recognition, and communication. A vision system is suitable for human-computer interaction since this involves passive sensing and the human gestures of the hand, body, and face that can be recognized without any discomfort for the user. We therefore use cameras for the sensors in our research to estimate human motion and gestures.
Facial expression is a natural human expression and is necessary to communicate such emotions as happiness, surprise, and sadness to others. A large number of studies have been made on machine recognition of human facial expression. Many of them are based on multi-resolution monochrome images and template pattern matching techniques [62, 293, 324]. This kind of approach needs some average operation on the face model or blurring of the input image to cope with the different appearance of faces in the images.
Face it. Butlers cannot be blind. Secretaries cannot be deaf. But somehow we take it for granted that computers can be both.
Human-computer interface dogma was first dominated by direct manipulation and then delegation. The tacit assumption of both styles of interaction has been that the human will be explicit, unambiguous and fully attentive. Equivocation, contradiction and preoccupation are unthinkable even though they are very human behaviors. Not allowed. We are expected to be disciplined, fully focused, single minded and ‘there’ with every attending muscle in our body. Worse, we accept it.
Times will change. Cipolla, Pentland et al, fly in the face (pun intended) of traditional human-computer interface research. The questions they pose and answers they provide have the common thread of concurrency. Namely, by combining modes of communication, the resulting richness of expression is not only far greater than the sum of the parts, but allows for one channel to disambiguate the other. Look. There's an example right there. Where? Well, you can't see it, because you cannot see me, where I am looking, what's around me. So the example is left to your imagination.
That's fine in literature and for well codified tasks. Works for making plane reservations, buying and selling stocks and, think of it, almost everything we do with computers today. But this kind of categorical computing is crummy for design, debate and deliberation. It is really useless when the purpose of communication is to collect our own thoughts.
A key issue in advanced interface design is the development of friendly tools for natural interaction between user and machine. In this chapter, we propose an approach to non-intrusive human-computer interfacing in which the user's head and pupils are monitored by computer vision for interaction control within on-screen environments. Two different visual pointers are defined, allowing simultaneous and decoupled navigation and selection in 3D and 2D graphic scenarios. The pointers intercept user actions, whose geometry is then remapped onto the environment by a drag and click metaphor providing dialogue with a natural semantics.
Introduction
In the last few years, a huge effort has been made towards building advanced environments for human-machine interaction and human-human communication mediated by computers. Such environments can improve both the activity and satisfaction of individual users and computer supported cooperative work. Apart from some obvious implementation and design differences, virtual reality [255], augmented reality [309] and smart room [235] environments share the very same principle of providing users with a more natural dialogue with (and through) the computer with respect to the past. This is obtained through a careful interface design involving interface languages mimicking everyday experience and advanced interaction techniques.
Recently, the simultaneous growth of computing power and decrease of hardware costs, together with the development of specific algorithms and techniques, has encouraged the use of computer vision as a non intrusive technology for advanced human-machine interaction.
In this chapter, we present our approach to recognizing hand signs. It addresses three key aspects of the hand sign interpretation, the hand location, the hand shape, and the hand movement. The approach has two major components: (a) a prediction-and-verification segmentation scheme to segment the moving hand from its background; (b) a recognizer that recognizes the hand sign from the temporal sequence of segmented hand together with its global motion information. The segmentation scheme can deal with a large number of different hand shapes against complex backgrounds. In the recognition part, we use multiclass, multi-dimensional discriminant analysis in every internal node of a recursive partition tree to automatically select the most discriminating linear features for gesture classification. The method has been tested to recognize 28 classes of hand signs. The experimental results show that the system can achieve a 93% recognition rate for test sequences that have not been used in the training phase.
Introduction
The ability to interpret the hand gestures is essential if computer systems can interact with human users in a natural way. In this chapter, we present a new vision-based framework which allows the computer to interact with users through hand signs.
Since its first known dictionary was printed in 1856 [61], American Sign Language (ASL) is widely used in the deaf community as well as by the handicapped people who are not deaf [49].
Face and hand gestures are an important means of communication between humans. Similarly, automatic face and gesture recognition systems could be used for contact-less human-machine interaction. Developing such systems is difficult, however, because faces and hands are both complex and highly variable structures. We describe how flexible models can be used to represent the varying appearance of faces and hands and how these models can be used for tracking and interpretation. Experimental results are presented for face pose recovery, face identification, expression recognition, gender recognition and gesture interpretation.
Introduction
This chapter addresses the problem of locating and interpreting faces and hand gestures in images. By interpreting face images we mean recovering the 3D pose, identifying the individual and recognizing the expression and gender; for the hand images we mean recognizing the configuration of the fingers. In both cases different instances of the same class are not identical; for example, face images belonging to the same individual will vary because of changes in expression, lighting conditions, 3D pose and so on. Similarly hand images displaying the same gesture will vary in form.
We have approached these problems by modeling the ways in which the appearance of faces and hands can vary, using parametrised deformable models which take into account all the main sources of variability. A robust image search method [90, 89] is used to fit the models to new face/hand images recovering compact parametric descriptions.
In this chapter I describe ongoing research that seeks to provide a common framework for the generation and interpretation of spontaneous gesture in the context of speech. I present a testbed for this framework in the form of a program that generates speech, gesture, and facial expression from underlying rules specifying (a) what speech and gesture are generated on the basis of a given communicative intent, (b) how communicative intent is distributed across communicative modalities, and (c) where one can expect to find gestures with respect to the other communicative acts. Finally, I describe a system that has the capacity to interpret communicative facial, gestural, intonational, and verbal behaviors.
Introduction
I am addressing in this chapter one very particular use of the term “gesture” – that is, hand gestures that co-occur with spoken language. Why such a narrow focus, given that so much of the work on gesture in the human-computer interface community has focused on gestures as their own language – gestures that might replace the keyboard or mouse or speech as a direct command language? Because I don't believe that everyday human users have any more experience with, or natural affinity for, a “gestural language” than they have with DOS commands. We have plenty of experience with actions and the manipulation of objects. But the type of gestures defined as (Väänänen & Böhm, 1993) “body movements which are used to convey some information from one person to another” are in fact primarily found in association with spoken language (90% of gestures are found in the context of speech, according to McNeill, 1992).
This chapter describes several probabilistic techniques for representing, recognizing, and generating spatiotemporal configuration sequences. We first describe how such techniques can be used to visually track and recognize lip movements in order to augment a speech recognition system. We then demonstrate additional techniques that can be used to animate video footage of talking faces and synchronize it to different sentences of an audio track. Finally we outline alternative low-level representations that are needed to apply these techniques to articulated body gestures.
Introduction
Gestures can be described as characteristic configurations over time. While uttering a sentence, we express very fine grained verbal gestures as complex lip configurations over time, and while performing bodily actions, we generate articulated configuration sequences of jointed arm and leg segments. Such configurations lie in constrained subspaces and different gestures are embodied as different characteristic trajectories in these constrained subspaces.
We present a general technique called Manifold Learning, that is able to estimate such constrained subspaces from example data. This technique is applied to the domain of tracking, recognition, and interpolation. Characteristic trajectories through such spaces are estimated using Hidden Markov Models. We show the utility of these techniques on the domain of visual acoustic recognition of continuous spelled letters.
We also show how visual acoustic lip and facial feature models can be used for the inverse task: facial animation. For this domain we developed a modified tracking technique and a different lip interpolation technique, as well as a more general decomposition of visual speech units based on Visemes.
Using computers to watch human activity has proven to be a research area having not only a large number of potentially important applications (in surveillance, communications, health, etc.) but also one the had led to a variety of new, fundamental problems in image processing and computer vision. In this chapter we review research that has been conducted at the University of Maryland during the past five years on various topics involving analysis of human activity.
Introduction
Our interest in this general area started with consideration of the problem of how a computer might recognize a facial expression from the changing appearance of the face displaying the expression. Technically, this led us to address the problem of how the non-rigid deformations of facial features (eyes, mouth) could be accurately measured even while the face was moving rigidly.
In section 10.2 we discuss our solution to this problem. Our approach to this problem, in which the rigid head motion is estimated and used to stabilize the face so that the non-rigid feature motions could be recovered, naturally led us to consider the problem of head gesture recognition. Section 10.3 discusses two approaches to recognition of head gestures, both of which employ the rigid head motion descriptions estimated in the course of recognizing expressions.
The ability to track a face in a video stream opens up new possibilities for human computer interaction. Applications range from head gesture-based interfaces for physically handicapped people, to image-driven animated models for low bandwidth video conferencing. Here we present a novel face tracking algorithm which is robust to partial occlusion of the face. Since the tracker is tolerant of noisy, computationally cheap feature detectors, frame-rate operation is comfortably achieved on standard hardware.
Introduction
The ability to detect and track a person's face is potentially very powerful for human-computer interaction. For example, a person's gaze can be used to indicate something, in much the same manner as pointing. One can envisage a window manager which automatically shuffles to the foreground whichever window the user is looking at [153, 152]. Gaze aside, the head position and orientation can be used for virtual holography [14]: as the viewer moves around the screen, the computer displays a different projection of a scene, giving the illusion of holography. Another application lies in low-bandwidth video conferencing: live images of participant's face can be used to guide a remote, synthesised “clone” face which is viewed by other participants [180, 197]. A head tracker could also provide a very useful computer interface for physically handicapped people, some of whom can only communicate using head gestures. With an increasing number of desktop computers being supplied with video cameras and framegrabbers as standard (ostensibly for video mail applications), it is becoming both useful and feasible to track the computer user's face.
Present day Human–Computer Interaction (HCI) revolves around typing at a keyboard, moving and pointing with a mouse, selecting from menus and searching through manuals. These interfaces are far from ideal, especially when dealing with graphical input and trying to visualise and manipulate three-dimensional data and structures. For many, interacting with a computer is a cumbersome and frustrating experience. Most people would prefer more natural ways of dealing with computers, closer to face-to-face, human-to-human conversation, where talking, gesturing, hearing and seeing are part of the interaction. They would prefer computer systems that understand their verbal and non-verbal languages.
Since its inception in the 1960s, the emphasis of HCI design has been on improving the “look and feel” of a computer by incremental changes, to produce keyboards that are easier to use, and graphical input devices with bigger screens and better sounds. The advent of low-cost computing and memory, and computers equipped with tele-conferencing hardware (including cameras mounted above the display) means that video input and audio input is available with little additional cost. It is now possible to conceive of more radical changes in the way we interact with machines: of computers that are listening to and looking at their users.
Progress has already been achieved in computer speech synthesis and recognition [326]. Promising commercial products already exist that allow natural speech to be digitised and processed by computer (32 kilobytes per second) for use in dictation systems.
This chapter introduces the Human Reader project and some research results of human-machine interfaces based on image sequence analysis. Real-time responsive and multimodal gesture interaction, which is not an easily achieved capability, is investigated. Primary emphasis is placed on real-time responsive capability for head and hand gestural interaction as applied to the project's Headreader and Handreader. Their performances are demonstrated in experimental interactive applications, the CG Secretary Agent and the FingerPointer. Next, we focus on facial expression as a rich source of nonverbal message media. A preliminary experiment in facial expression research using an optical-flow algorithm is introduced to show what kind of information can be extracted from facial gestures. Real-time responsiveness is left to subsequent research work, some of which is introduced in other chapters of this book. Lastly, new directions in vision-based interface research are briefly addressed based on these experiences.
Introduction
Human body movement plays a very important role in our daily communications. Such communications do not merely include human-to-human interactions but also human-to-computer (and other inanimate objects) interactions. We can easily infer people's intentions from their gestures. I believe that a computer possessing “eyes” in addition to a mouse and keyboard, would be able to interact with humans in a smooth, enhanced and well-organized way by using visual input information. If a machine can sense and identify an approaching user, for example, it can load his/her personal profile and prepare the necessary configuration before he/she starts to use it.
Bayesian approaches have enjoyed a great deal of recent success in their application to problems in computer vision (Grenander, 1976-1981; Bolle & Cooper, 1984; Geman & Geman, 1984; Marroquin et al., 1985; Szeliski, 1989; Clark & Yuille, 1990; Yuille & Clark, 1993; Madarasmi et al., 1993). This success has led to an emerging interest in applying Bayesian methods to modeling human visual perception (Bennett et al., 1989; Kersten, 1990; Knill & Kersten, 1991; Richards et al., 1993). The chapters in this book represent to a large extent the fruits of this interest: a number of new theoretical frameworks for studying perception and some interesting new models of specific perceptual phenomena, all founded, to varying degrees, on Bayesian ideas. As an introduction to the book, we present an overview of the philosophy and fundamental concepts which form the foundation of Bayesian theory as it applies to human visual perception. The goal of the chapter is two-fold: first, it serves as a tutorial to the basics of the Bayesian approach to readers who are unfamiliar with it, and second, to characterize the type of theory of perception the approach is meant to provide. The latter topic, by its meta-theoretic nature, is necessarily subjective. This introduction represents the views of the authors in this regard, not necessarily those held by other contributors to the book.
First, we introduce the Bayesian framework as a general formalism for specifying the information in images which allows an observer to perceive the world.