Introduction
The books Hearing By Eye (Dodd and Campbell Reference Dodd and Campbell1987) and Speech Perception by Ear and Eye (Massaro Reference Massaro1987) were the first volumes that considered speechreading as a psychological process of interest beyond its direct applications in hearing loss and deafness (see for example, Jeffers and Barley Reference Jeffers and Barley1971). Eight years later, David G. Stork, Christian Benoît, and N. Michael Brooke organized the landmark NATO workshop “Speechreading by Man and Machine: Models, Systems and Applications.” The workshop was “the first forum on the interdisciplinary study of speechreading (lipreading) – production, perception and learning by both humans and machines.” This workshop was followed by several volumes (Stork and Hennecke Reference Stork and Hennecke1996; Campbell et al.Reference Campbell, Dodd and Burnham1998; Massaro Reference Massaro1998b) and was undoubtedly a major step towards the design of an audiovisual (AV) speech processing community: you will find in this volume numerous references to the series of subsequent AVSP workshops (Rhodes 1997; Terrigal 1998; Santa Cruz 1999; Scheelsminde 2001; St-Jorioz 2003; Vancouver 2005; Hilvarenbeek 2007; Tangalooma 2008; Norwich 2009; Hakone 2010) sponsored first by the European Speech Communication Association then by the AVSP Special Interest Group of the International Speech Communication Association, both bodies in which Christian Benoît was constantly promoting AV speech processing. These workshops together with dedicated workshops (see for example the AV speech recognition workshop organized in 2000 by Chalapathy et al.) and special sessions in international conferences have fostered the development of innumerable lines of AV speech research.
The book is divided into four main parts although most chapters address most of the questions.
The first part of the book is largely devoted to AV speech perception and to two main questions concerning human AV performance: how and where (in the brain) auditory (A) and visual (V) signals combine to access the mental lexicon. Although speech can be perceived by vision alone (i.e., via lipreading/speechreading) and visual speech perception (Bernstein) can provide sufficient phonetic information to access the mental lexicon, talking faces constitute a major part of an infant’s perceptual experience: through the process of watching and listening while people talk to them and point out objects of the world, infants have the opportunity to attribute semantics to the sounds they hear. Developmental studies (Burnham and Sekiyama) can thus contribute to explaining how auditory and visual information combine. Idiosyncrasies of human brain circuitry also hold clues to the evolution and development of human language, and its accessibility by eye and ear (Campbell and MacSweeney). One commonly accepted term of this intersensory integration is that the two signals carry both complementary and redundant information on the phonetic properties of the original message. The striking observation is however that the integration is something more than taking the best of both worlds and that AV perception is able to perceive properties that are carried by neither modality alone (Remez). Some answers to this puzzle could be found in a more intimate intersensory integration at the signal level, notably that which comprises the dynamic aspects of both signals (Lander and Bruce) which are in fact audible and visible traces of the same articulatory gestures.
The second part of the book is dedicated to the production and perception of visible speech i.e. speech movements. We have access to dynamic AV information (Lander and Bruce) as consequences of the acoustic and aerodynamic consequences of the motion of speech articulators. A production-aware “grounded” perception can benefit from the availability of sensorimotor maps (which may or may not include dynamic representations) whose existence has been proved to be useful for the control of most biological movements. This intersensory integration is not only necessary for perception (and therefore comprehension) but also for movement learning and control (Cathiard et al.) Accurate descriptions and models of coordinate structures linking activations of the different speech segments are thereby necessary: for instance Beautemps, Cathiard et al. describe coordination between hand and vocal tract motions in manual cued speech.
The third part of the book presents some of the latest developments in AV speech processing by machines, particularly in AV speech recognition and synthesis (Brooke and Scott). In parallel with the development of AV research, computer-generated facial animation (Parke and Waters Reference Parke and Waters1996) has attracted considerable attention and progress. Areas of application have set aside the traditional field of the animation and games industry to address more challenging applications where the metaphor of face-to-face conversation is applied to human–computer interfaces (Cassell et al.Reference Cassell, Sullivan, Prevost and Churchill2000). Of course, we are still a long way from building a computer which can carry on a face-to-face conversation with a human and which can pass a face-to-face Turing test, i.e. one whose computed behavior cannot be detected from natural human interaction. However, noticeable progress has been made in giving computers the “gift” of AV speech (Brooke and Scott). AV speech recognition outperforms acoustic-only speech recognition especially for degraded speech (Potamianos, Chapalatti et al.) while the realism of facial animation has been drastically improved by image-based speech synthesis (Ezzat et al; Slaney and Bregler). The chapter by Massaro et al. on the Baldi talking head and its further developments for virtual speech tutoring concludes this part by reporting experimental work on augmented communication.
The fourth part focuses on the nature of the information related to oro-facial gestures (head, vocal tract, and face movements), which is necessary to enable an efficient contribution of the visual component in the audiovisual processing of speech. Bateson and Munhall’s approach is based on experimental studies of the perception of multimodal natural and synthetic stimuli in which various characteristics are either degraded or carefully preserved. Bailly, Badin et al. use a modeling approach based on a careful analysis of real speakers’ data to study the main degrees of freedom of the speech production system and their impact on the audiovisual perception of speech.
This work has been accomplished because a body of researchers is now working on the various aspects of audiovisual speech processing. Most of this synergy is due to the research field itself where the majority of the paradigms of unimodal speech research have been renewed and questioned. Part of this synergy is also due to the communicative enthusiasm of researchers such as Christian Benoît.
Scientific outcomes of multimodal speech communication studies are numerous and they cover a broad scope. We acknowledge that little is said in this book about them. Indeed, it was our decision to focus mainly on basics. However, we would like to mention one of the most exciting current outcomes: Face-to-Face speech communication. Interaction loops between production and perception of speech and gestures are at the core of this aspect of human communication, transmitting via multimodal signals parallel information about what the interlocutors say, think about what they say and how they feel when they say it. Convergence or imitation phenomena, which are at the core of L1 and L2 learning process in babies and adults, result from this interaction. Face-to-face communication studies require the integration of all the mechanisms of embodied speech production and audiovisual speech communication, and combining them with social and physical interactions between humans and between humans and their environment (see notably extended papers of presentations dicussed at two workshops organized in Grenoble: Abry et al.Reference Abry, Vilain and Schwartz2009; Dohen et al.Reference Dohen, Schwartz and Bailly2010). The quest for neurophysiological and behavioral correlates of these sensorimotor loops constitute an exciting research program that would certainly have attracted Christian Benoît’s attention and titillated his insatiable curiosity.