8 Visual and audiovisual synthesis and recognition of speech by computers
8.1 Overview
It is now a little while since the authors last worked in this area. One (NMB), has now retired, having entered the field when only a handful of researchers around the world were working in visual speech; the other (SDS), who has now moved into new areas of work, became involved in the field when statistical methods were becoming increasingly powerful and useful. During the period of their research activity, the power, speed, and capabilities of computer systems and computer graphics were increasing very rapidly and their cost was simultaneously falling. There was consequently a shift away from the identification of facial features whose movements could be used to represent and model the visual cues to speech and towards the processing and use of facial images themselves. The increased complexity and volume of data that had to be handled was offset by using statistically based methods to identify and represent those characteristics of the images that might be applied to synthesize and recognize visual speech events. Some of these studies also suggested that a relatively small set of parameters might characterize the dimensionality of the space that separated specific speech events, though their physical and anatomical nature generally remained somewhat obscure. During the same period, the growth of the Internet and of local networks has generated new applications for audiovisual speech synthesis and recognition whilst at the same time eliminating others.
There are many new techniques for exploring human speech production and for developing new approaches to audiovisual processing, including, for example, fMRI. On the other hand, there are other areas that remain relatively unchanged or intractable. Thus, for example, there is still no comprehensive catalogue of facial features or points whose behaviour can fully define, or, alternatively, distinguish the set of visual speech events. Speakers differ greatly in their visible articulatory gestures and this is one of the main obstacles to progress. However, some speakers’ visible gestures are known to be easier to understand than others and this may offer a way forward. The paralinguistic gestures, involving the face, head, and body are still incompletely understood, despite the advances that have been made in this area. For example, as this chapter indicates, very simple modelling appears to be very successful in synthesizing the visual effects of speech embodying major head movements. Like the lip gestures themselves, however, many paralinguistic gestures may be small, short-lived, and subtle. The problem in the visual domain is compounded by the partial or intermittent invisibility of some of the visual features, which only complex, intrusive, and expensive methods, like micro-beam X-radiography can reveal fully. Additionally, there may be significant difficulties in relating surface activity on the face to the underlying musculature and its changes. Measurements of the visible movements of particular, identifiable points at the skin surface, for example, around the lips themselves, are unlikely to bear a simple relationship to the underlying muscular changes.
Visual coarticulation effects are known to exist in speech production and the integration of the audio and visual signals, especially in recognition, also remains an interesting area because it is clear that there can be modal asynchronies in the audio and the visual signals which may be complex and temporally extended, certainly beyond phone boundaries and possibly further and both phenomena are likely to be a continuing focus for attention. One of the most successful early models for recognition that takes account of audiovisual asynchrony is described in this chapter. Conversely, the synthesis of facial gestures using an acoustic speech signal as the driver also presents interesting challenges, since it can lead to one-to-many mappings. That is, a single acoustic output may arise from several possible articulatory gestures. In spite of this, it is possible to associate at least some acoustic speech signals with facial measurements. At the same time, the existence of the McGurk effect reinforces the importance of ‘getting it right’, since mismatched audio and video cues can convey a third, quite different, audiovisual percept.
One major advance for research into visual and audiovisual speech synthesis and recognition is the greatly increased availability of well-defined and agreed corpora of speech data that can be used to compare and assess objectively the performance of synthesizers and recognizers. This marks a very great advance on the situation that pertained when the very early TULIPS corpus appeared. There are, however, still major issues concerning the detailed specification of corpora for particular purposes. For example, the effects of different lighting conditions and angles of view, as well as the controlled construction of multi-speaker corpora, in particular, remain to be fully evolved. Performance measures for synthesis and recognition are also still developing; is it, for example, still appropriate to choose human performance as a measurement baseline, as suggested in this chapter?
The current chapter is concerned to a large extent with the work from the early stages of the field to the point in the new millennium at which significant advances were being made using image data and statistical techniques. It is primarily intended to indicate, therefore, how we got to be where we are now. Some of the issues and problems have now been resolved and others have not. This chapter is therefore intended to complement the other contributions to this volume; it is for the reader to determine which issues remain to be dealt with and which have now been either partially or, indeed, fully resolved. There is still much that needs to be done, including, importantly, continuing analytical experiments in audiovisual speech perception that will support the development of synthesizers and recognizers. It is offered with the hope and wish that our successors will be successful in their endeavours.
8.2 The historical perspective
8.2.1 Visual speech processing and early approaches
One of the prime motivations for the processing of visual speech signals arose from the need to investigate and understand better the nature of speechreading, so that the rehabilitation of the hearing-impaired could be improved. The face is the visible, external termination of the human speech production system, whose articulatory gestures can convey useful cues to production events and, in particular, to the place of articulation of speech sounds. Although the visible gestures by no means uniquely identify individual speech events (Summerfield Reference Summerfield, Dodd and Campbell1987), the benefits of seeing the face of a speaker, especially where there is noise or hearing-impairment (Sumby and Pollack Reference Sumby and Pollack1954; Erber Reference Erber1975), can nonetheless be worthwhile, as the hearing-impaired have of course known for many years. The benefit has been estimated to be equivalent to an increase in the signal-to-noise ratio of about 10–12 dB when identifying words in a sentence uttered against a noise background (MacLeod and Summerfield Reference MacLeod and Summerfield1987).
Studying speechreading was severely constrained by the inability to carry out analytical investigations using controlled continua of natural stimuli. Speakers are not able to vary the articulatory gestures of speech in a controllable way and some experiments may even require the presentation of stimuli that cannot be naturally produced by a human speaker. An example of the latter included an experiment to investigate the role of the teeth in the visual identification of the vowel in a range of /bVb/ utterances, using stimuli that were visually identical apart from the presence or absence of the teeth (McGrath et al. Reference McGrath, Summerfield and Brooke1984; Summerfield et al. Reference Summerfield, MacLeod, McGrath, Brooke, Young and Ellis1989). Animated computer graphics displays of speech movements of the face are in principle capable of overcoming all of these difficulties. One very early attempt at visual speech synthesis used Lissajou’s figures displayed on an oscilloscope to simulate lip movements and was probably also the first to be driven by a speech signal (Boston Reference Boston1973). Controllable syntheses could, however, be achieved more flexibly and easily by using computer graphics simulations of talking faces. The ready availability of relatively cheap, general purpose mini-computers that were powerful enough to create such computer graphics prompted the development of some of the earliest visual speech synthesizers (e.g. Brooke Reference Brooke1982; Montgomery and Hoo Reference Montgomery and Hoo1982; Brooke Reference Brooke, Taylor, Néel and Bouwhuis1989).
Most of the early visual speech synthesizers were simple vector graphics animated displays of outline diagrams of the main facial articulators, namely, the lips, teeth, and jaw. Very high performance computers were prohibitively expensive for all but the most specialized applications and most general purpose computers were capable only of approximately one million operations per second. The generation of up to ten thousand separate vectors per second, which was the rate needed to create the early visual speech synthesizers, therefore represented a fairly severe requirement. Although it was well known at the time that the teeth and tongue could convey important visual cues, for example, to the identity of vowels, these articulators were difficult to simulate with vector graphics. The teeth posed a problem because they were only intermittently or partially visible and required an effective hidden-line removal algorithm in order to be displayed accurately. The tongue was almost impossible to display because it usually appeared as an indistinct area, rather than as an outline shape. Raster graphics systems were able to generate much more realistic, rendered displays of the human head, but were very expensive (typically ten times the cost of a vector graphics system). In addition, fully rendered animated displays required significantly longer and were significantly more complex to generate than vector graphics displays, which could themselves take over 20 times realtime for the creation of short ‘copy’ syntheses (syntheses in which the gestures were driven directly from measurements of a human utterance). Nonetheless, a forerunner of the modern visual speech synthesizers had already been developed by the mid 1970s (Parke Reference Parke1975) and the approach was quickly developed to display facial expressions (e.g. Platt and Badler Reference Platt and Badler1981). The state of visual speech synthesis in the early 1990s has been more fully reviewed elsewhere (Brooke Reference Brooke and Ainsworth1992a; Brooke Reference Brooke, Bailly and Benoît1992b).
The apparently modest gain that results from seeing a speaker’s facial gestures when speech is uttered in a noisy background acquires greater significance when it is realized that changes in the signal-to-noise ratio of a speech signal from −6 dB to +6 dB represent an improvement in word intelligibility from about 20 percent to 80 percent (e.g. MacLeod and Summerfield Reference MacLeod and Summerfield1987). Acoustic cues to the place of articulation tend to be easily destroyed by acoustic noise because they are dependent upon phonetic context and upon low intensity, short duration signals usually associated with fine spectral details at the higher frequencies. Acoustic cues to the manner of articulation, however, tend to be associated with relatively slowly changing, spectrally strong features of the acoustic signal at the lower frequencies that are resistant to corruption by noise (Summerfield Reference Summerfield, Dodd and Campbell1987). The acoustic and visual signals of speech therefore tend to complement each other so that, for example, if both are used together in speech recognition, speech intelligibility in noise should be enhanced. Indeed, human audiovisual speech recognition performance is better than either audio or visual performance in acoustically noisy conditions (Adjoudani and Benoît Reference Adjoudani, Benoît, Stork and Hennecke1996). This immediately suggests an important potential application for the enhancement of conventional, automatic speech recognition in noisy environments. Automatic visual speech recognition began to develop in the 1980s, when image capture hardware and software capable of capturing the dynamic facial gestures from successive frames of film or video recordings became available. One of the earliest devices was Petajan’s visual recognition system, which used special-purpose hardware to capture binary black and white images of the oral region in realtime (Petajan Reference Petajan1984). This device typified many later recognition systems in using image data to extract a relatively small number of time-varying facial features such as the width and height of the lip aperture and the area of the oral cavity. In the 1980s these were used as templates for matching the characteristics of test utterances with libraries of reference templates from the utterances forming the vocabulary of the recognizer (Petajan et al. Reference Petajan, Bischoff, Bodoff and Brooke1988a; Petajan et al. Reference Petajan, Brooke, Bischoff and Bodoff1988b). The main objective of the early prototypical recognition systems was to establish the benefits that were available from the use of visual speech signals. As Section 8.4.4 shows, this issue remains incompletely resolved.
8.2.2 Digital image processing and the data-driven approach
One of the main challenges facing automatic visual speech processing has been to develop the ability to process the quantity of information represented by continuous sequences of moving images within a useable time scale. The synthesis of TV-quality, full-screen colour images, for example, requires the generation of millions of bits of information per second. The data processing rate needed for image capture and analysis was also a problem in early recognition systems and was usually dealt with by extracting feature information from oral images, even when special-purpose hardware began to be available for image capture in realtime (e.g. Petajan Reference Petajan1984). However, it is not always possible to describe all of the important visible information in terms of simple parameters and, despite some early studies (Finn Reference Finn1986), there is still no comprehensive catalogue describing the facial features that are relevant to speechreading.
By the second half of the 1980s digital image processing equipment was becoming more widely accessible and processor speeds were rapidly increasing through the megahertz range. Fully rendered facial images displayed on raster graphics devices had supplanted the earlier vector graphics and several software packages had been implemented (e.g. Yau and Duffy Reference Yau and Duffy1988; Saintourens et al. Reference Saintourens, Tramus, Huitric and Nahas1990; Terzopoulos and Waters Reference Terzopoulos and Waters1990). It also became possible to process images directly. For the first time, data-driven synthesis and, more especially, recognition that did not involve pre-selection of relevant features became a real option. For example, visual speech recognizers were reported that used automatic image processing to follow movements of the lower facial region by detecting the differences between successive images, or by computing optical flows (e.g. Nishida Reference Nishida1986; Pentland and Mase Reference Pentland and Mase1989).
One early application of image processing was the use of chroma-key methods to extract lip parameters from video recordings of the face of a speaker, as described in Section 8.3.1 below. Another study from the same period involved the processing of video recordings of the oral region of a speaker enunciating five long vowels in a /hVd/ context. The image sequences were processed to create moving images of the speaker at different spatial and grey-level resolutions (Brooke and Templeton Reference Brooke and Templeton1990). These were used as visual stimuli in a vowel identification experiment. For this very simple vowel set, the results suggested that images of low resolution (about 16 × 16 pixels and eight grey levels) adequately captured the essential visual cues to vowel identity. The experiment was helpful in setting an approximate lower bound on the amount of data that images had to embody if speechreading was to be possible. To capture even these minimal, low-resolution monochrome video images at the standard 25 frames per second implied the storage of about 20 000 bits per second. For both synthesis and recognition, there is also a lower limit to the frame rate beneath which significant visual information is lost (Pearson and Robinson Reference Pearson and Robinson1985; Frowein et al. Reference Frowein, Smoorenburg, Pyters and Schinkel1991). Normal video recording rates (25 frames, or 50 fields per second) are above this lower limit.
The greater computing power of the 1980s also marked the emergence of new, statistically oriented data-processing methods that were proving extremely successful in pattern-recognition tasks, including conventional acoustic speech recognition. The techniques included the so-called Artificial Neural Networks (ANNs) and the class of finite state machines known as hidden Markov models. These have been described elsewhere (e.g. Rabiner Reference Rabiner1989b; Beale and Jackson Reference Beale and Jackson1990). Both are well suited to data-driven processing as they can build an internal description of images of a speaker’s visible gestures without needing a detailed description or understanding either of the nature of the gestures, or how they arise. Images can be treated at the lowest level purely as arrays of pixels.
One form of ANN, the Multi-Layer Perceptron (or MLP), was swiftly applied to visual speech processing. MLPs consist of layers of rather simple processing units. Every unit at each layer generates inputs to all the units of the layer above and the connections can be weighted. These weightings, together with the bias parameters that typically define the properties of each of the processing units, can be adjusted. Given a starting set of random values, the parameters can be successively refined by a training process. In a typical training process, many examples of ‘labelled’ patterns are presented as input to the first layer of units and the MLP parameters are reiteratively adjusted, using a standard algorithm, until the output units generate the correct labels for each input pattern. The output units usually form an encoder in which, conceptually, one unit produces an output for each specific label and all other units give no output. MLPs can thus ‘learn’ to generate internal mappings between sets of inputs and outputs during the training process (e.g. Elman and Zipser Reference Elman and Zipser1986). One very early use of a three-layer MLP in visual speech processing was to find a mapping between single, 16 × 16 pixel, monochrome images of the oral region of a speaker captured at the nuclei of the three vowels that lie at the corners of the vowel triangle. Once trained, the MLP was used to determine the vowel class labels for previously unseen test images (Peeling et al. Reference Peeling, Moore and Tomlinson1986). Only two processing units were needed in the intermediate layer of the trained MLP. Since this layer is a gateway through which input data passes to the output units, it effectively holds an encoded internal representation of the vowel images and this result suggested that the essential visual cues to vowel identity could be captured by a very small number of parameters. A second MLP-based experiment shortly afterwards (Brooke and Templeton Reference Brooke and Templeton1990) showed that a machine with just six processing units in the intermediate layer could be trained to identify vowels using 16 × 12 pixel monochrome images of a speaker captured at the nuclei of the eleven non-diphthongal British English vowels in a /bVb/ context. Overall performance was 91 percent correct vowel identification and, for the worst case individual vowel class, 84 percent correct identification. Once again, the real importance of these results was the indication that visual cues to vowel identity could be retained in images of relatively low resolution and, furthermore, that additional significant gains in image compression could be gained by noting that the essential cues could be internally encoded by the MLP with a very small number of parameters.
Whilst MLPs suggested the possibility of further image compression, they were not particularly stable coders; small changes in images could produce large changes in the MLPs’ internal representations and vice versa. A well-established statistical technique, Principal Components Analysis, or PCA (Flurry Reference Flurry1988), proved more stable and was potentially at least as efficient in compressing image data as MLPs (Anthony et al. Reference Anthony, Hines, Barham and Taylor1990). PCA transforms a pattern space into a new space in which as much of the variance of the original data as possible is accounted for by as small a number of axes, or principal components, as possible. Images of the articulatory movements of the face, which is a highly constrained anatomical structure, should show a high degree of structuring. One of the earliest examples of this approach to facial image encoding was reported in 1991 (Turk and Pentland Reference Turk and Pentland1991). An independent preliminary experiment also originally reported in 1991 (Brooke and Tomlinson Reference Brooke, Tomlinson, Taylor, Néel and Bouwhuis2000) established the validity of using PCA to compress and encode monochrome oral images of a speaker’s gestures. When PCA was performed on approximately 15 000 monochrome oral images of a speaker uttering digit triples, captured at a resolution of 32 × 24 pixels, it was found that about 80 percent of the variance was captured by just 15 principal components (the uncompressed 32 × 24 pixel monochrome images correspond to points in a 768-dimensional space). Given a representative set of training images, it is possible to use the computed principal components to compress similar test images. Results from perceptual tests on the speech readability of monochrome oral images of spoken digit triples reconstructed from a PCA-encoded format showed that the use of more than 15 components did not significantly improve the visual intelligibility of the images (Brooke and Scott Reference Brooke and Scott1994b). In one unpublished project (Brooke, Fiske, and Scott), PCA performed on two-dimensional images of the outer lip margins during unrestricted speech production also suggested that two components sufficed to describe their varying contours, a finding consistent with one animated lip model, which uses three parameters to model full three-dimensional movements (Guiard-Marigny et al. Reference Guiard-Marigny, Adjoudani, Benoît, van Santen, Sproat, Olive and Hirschberg1996). A later, multistage variant of PCA was also successfully developed (Brooke and Scott Reference Brooke and Scott1998a). It involved dividing images into sub-blocks, PCA encoding each of the sub-blocks and then applying PCA a second time to the coded versions of all of the sub-blocks. This method was used to compress the data from a corpus of sentence utterances video recorded in colour (see Section 8.3.1 below). PCA has been widely used to reduce the dimensionality of data that arises in visual speech processing (e.g. Welsh and Shah Reference Welsh and Shah1992; Goldschen et al. Reference Goldschen, Garcia, Petajan, Stork and Hennecke1996; Vatikiotis-Bateson et al. Reference Vatikiotis-Bateson, Kuratate, Kamachi and Yehia1999).
Another technique used to reduce the dimensionality of video image data for audiovisual speech recognition is discrete cosine transforms. In the work carried out by Potamianos et al. (Potamianos et al. Reference Potamianos, Neti, Iyengar and Helmuth2001a; Gravier et al. Reference Gravier, Potamianos and Neti2002b), the size and position of the speaker’s mouth in video fields, captured at a rate of 60 Hz, was estimated using face-tracking algorithms. The portions of the video fields containing the mouth were then extracted and size-normalized to 64 × 64 pixels. A two-dimensional, separable discrete cosine transform (DCT) was applied to these normalized mouth images and the twenty-four highest-energy DCT coefficients were stored for each image. Linear interpolation was then applied to the stored features to derive sets of visual coefficients that corresponded to the audio coefficients (100 Hz). Finally, changes in lighting conditions were dealt with by applying feature mean normalization to the sets of linear interpolated coefficients. The reader can find in Potamianos et al. (this volume) recognition experiments where this DCT technique is compared to PCA.
Although ANNs can be used to handle time-varying speech patterns (Stork et al. Reference Stork, Wolff and Levine1992; Bregler et al. Reference Bregler, Hild, Manke and Waibel1993; Lavagetto and Lavagetto Reference Lavagetto, Lavagetto, Stork and Hennecke1996; Krone et al. Reference Krone, Talle, Wichert and Palm1997), a second type of model, though a very poor representation of the speech production process, has proved to be very successful in speech recognition tasks. It comprises the finite state machines known as hidden Markov models (HMMs). HMMs can embody the time-varying properties of real speech signals and are also able to capture their inherent variability (Rabiner Reference Rabiner1989a). HMMs describe each speech event (whether a word, phone or sub-phone element) as a synchronous finite state machine that begins in a starting state then changes state and generates an output pattern for each tick of a clock until it reaches an end state. The properties of the machine are determined by its parameters, which are the set of transition probabilities that govern the likelihood of a particular state succeeding any other, plus a set of state-dependent probability density functions that determine the probability of a particular state generating any one from the set of all possible output patterns. In conventional acoustic speech recognition, each output pattern is a vector that describes the short-term characteristics of the speech signal, for example, as a set of short-term cepstral coefficients. The clock period of HMMs is conventionally equivalent to the frame rate of the short-term observations. The HMM parameters can be computed by presenting the model for each unit in the recognizer’s vocabulary with corresponding examples of real speech signals, in the form of sets of pattern vectors. The model parameters are then reiteratively adjusted, using standard algorithms, starting from a set of initial parameter estimates. The recognition process consists in identifying the trained HMM most likely to have generated an unknown test signal. It is possible to build HMMs that represent each phone in the context of its preceding and succeeding phone; these so-called ‘triphone’ models can help to account for coarticulatory effects. Additionally, since the pattern vectors that HMMs use can be modified very easily, HMMs are well suited to visual speech recognition and to audiovisual recognition simply by replacing or augmenting the conventional, acoustic pattern vectors with a suitable vector to describe the visual features of the speech event. Some pattern vectors use extracted feature values (e.g. Adjoudani and Benoît Reference Adjoudani, Benoît, Stork and Hennecke1996; Goldschen et al. Reference Goldschen, Garcia, Petajan, Stork and Hennecke1996; Jourlin et al. Reference Jourlin, Luettin, Genoud and Wassner1997; Tamura et al. Reference Tamura, Masuko, Kobayashi and Tokuda1998), rather than the image data itself, in a suitably compressed and encoded form, as described earlier in this section and in Section 8.4.4 below.
Both ANNs and HMMs require considerable amounts of training data and a great deal of computational power, especially in the training phase. The resources to deal with these requirements became steadily more abundant during the latter part of the 1980s, which therefore marked a period during which there was an upsurge in visual speech processing. The recognition phase when using trained ANNs and HMMs is relatively light on computing resources. In MLPs, for example, the content and structure of the images presented to a trained machine make no difference to the processing time. Consequently, it became practicable to undertake more demanding applications in audiovisual speech synthesis and recognition as explained in Section 8.4.
8.2.3 Redefining the goals of visual speech processing
By the early 1990s, the benefits of visual speech processing were generally recognized and visual speech processing was moving from pilot studies of techniques towards practicable and challenging applications. In speech recognition, work has been focused on capturing, encoding, and integrating visual speech with its audio component to permit robust recognition in noisy environments where, for example, hands-free control of devices is required. In speech synthesis, work has focused on building interactive applications. One such application is the teaching of speechreading by allowing an instructor to generate training material for private use, with the system adapting the training material presented to a learner in response to the learner’s progress (Cole et al. Reference Cole, Massaro, de Villiers, Rundle, Shobaki, Wouters, Cohen, Beskow, Stone, Connors, Tarachow and Solcher1999). Another application is a computer information system controlled by interaction with a complete, computer-generated, virtual humanoid capable of responding to input by the user, via speech and vision (Cassell et al. Reference Cassell, Sullivan, Prevost and Churchill2000). Given the importance of the head and face to such applications, they have become the prime foci for recent research.
8.3 Heads, faces, and visible speech signals
The head has a complex three-dimensional structure capable of quasi-rotational global movements in three dimensions relative to the rest of the body. These can be informally characterized as head shaking, head nodding, and head tilting, though there is not a simple centre of rotation. In addition, the head is carried on the body, whose movements are therefore superimposed on those of the head itself. The face is also a complex anatomical structure whose tissues possess great mobility and elasticity. In the simplest terms, the surface layers are underlaid by a complex musculature through which they are attached to the rigid bone of the skull. At least thirteen separate groups of muscles are involved in movements of the lips alone (Hardcastle Reference Hardcastle1976), but many more muscles are involved in the generation of the full range of facial expressions. The latter have been classified and characterized in terms of the actions of the muscle groups in the Facial Action Coding System (Ekman and Friesen Reference Ekman and Friesen1978).
Not only are the visible gestures of the primary speech articulators therefore complex and subtle, they are normally accompanied by many secondary gestures of the face and indeed of the body, including changes in facial expression and body posture or movements. These are discussed in Section 8.3.2, below. They can all convey important speech cues; often they have less to do with the phonetic content than with the speaker’s meaning and intention, that is, they are related more to the understanding of speech than its recognition (see for example, Dittmann Reference Dittmann1972; Ekman and Friesen Reference Ekman and Friesen1978; Ekman Reference Ekman, von Cranach, Foppa, Lepenies and Ploog1979; Ekman and Oster Reference Ekman and Oster1979). They also play a part in dialogue. Visible, non-verbal cues may even be made in response to a speaker, in order to direct the discourse. An example of this is the quizzical expression that a listener may use to prompt a speaker into giving further information.
However, knowledge about the nature and significance, even of head and body movements, let alone facial expression, during speech production is still far from complete. It is, however, at least conceivable that inappropriate global movements and facial expressions could change the meaning or intention of an utterance in a way similar to that in which mismatched visual and acoustic cues to phonetic content can induce changed percepts (McGurk and MacDonald Reference McGurk and MacDonald1976). This has important implications for automatic visual speech synthesis, as discussed in Sections 8.4.2 and 8.4.4 below.
8.3.1 Recording and measuring visible speech gestures
Most of the earliest recordings of speakers’ faces were captured on film (e.g. Fujimura Reference Fujimura1961). In the 1970s, video recordings were becoming a cheaper alternative, despite having a lower spatial resolution and generally rather long frame exposure times that smeared rapid movements of the oral region (e.g. Brooke and Summerfield Reference Brooke and Summerfield1983). Video recordings were also difficult to handle because of the high cost and limited availability of the equipment that was needed to retrieve the individual fields or frames in sequence so that movements could be tracked. Short time intervals were necessary; 20 to 40 ms intervals were considered only just adequate for capturing the rapid consonantal articulations. Furthermore, the measurement and analysis of speech movements is difficult because the articulatory gestures (a) involve only small movements (the largest excursions rarely if ever exceed 25 mm and are often much smaller); and (b) are superimposed upon the global head and body movements which accompany natural speech. In the experiments conducted to date there has been a trade-off between the accuracy of the data gathered and the range of head and body movements or free facial expressions permitted to the speaker. Recordings have tended to concentrate on the primary visible articulations around the lower face that are most closely related to the phonetic content of an utterance. In some of the very early recordings, the speaker’s head was clamped so that any visible movements could be ascribed to articulatory movements of the mouth region alone. This was hardly a way to capture ‘natural’ speech movements and later experiments allowed the speaker greater freedom of head movement. Where global head movement was permitted, a common strategy has been to track selected points on the head that are not involved in articulation and use their position to compensate for the global movements of the head (Brooke and Summerfield Reference Brooke and Summerfield1983). The tracking of these fixed points has the advantage that it permits the head position and orientation to be quantified so that the global movements and their relation to speech production could in principle be explored. Fixed points of this kind can, however, vary considerably from one individual to another and are not always marked by easily identified anatomical features. It can thus be difficult to establish an accurate reference frame for measurements when speakers are allowed to move freely. Another approach has been to create a 3D model of the speaker’s head and then deform the model, through a set of parameters, until it matches the current image of the speaker (Eisert and Girod Reference Eisert and Girod1998). A third approach to the elimination of effects due to global head movements is to fix the camera relative to the head, for example, by using a head-mounted boom. Even then, however, there are still small but significant residual head movements relative to the camera. Petajan’s ‘nostril tracker’ is an early example of tracking used in this situation (Petajan Reference Petajan1984). Contemporary recordings of speakers still tend to employ a front facial image and compensate for any global movements by assuming that they are small x and y translations that can be compensated by tracking the position of a few identifiable facial feature points (e.g. Brooke and Scott Reference Brooke and Scott1998b).
While many important articulatory gestures can be seen in frontal face images, others, such as lip protrusion, require observation of movements in all three dimensions. Simultaneous recording of movements in all three dimensions is not easy to achieve, especially if the subject is allowed complete freedom of movement. One early method (Brooke and Summerfield Reference Brooke and Summerfield1983) used a mirror angled at 45 degrees to capture simultaneously the front and side views of a speaker in a single recorded image. Usually the recordings involve the marking of points on the face, especially around the lips, lower face, and jaw. These techniques may be acceptable in analytical studies of speech production, but are unlikely ever to be appropriate in a practicable visual speech recognition system, which will probably use a single camera to observe the unmarked faces of speakers with at least a reasonable freedom of movement. Articulators that are only partially or intermittently visible, such as the teeth and the tongue, are known to convey important visual cues also. Their movements can be continuously recorded, but only by the use of very complex, expensive and specialized techniques such as X-ray microbeams, X-ray cineradiography, dynamic MRI, or ultrasound measurements (Perkell Reference Perkell1969; Fujimura Reference Fujimura1982; Keller and Ostry Reference Keller and Ostry1983; Perkell and Nelson Reference Perkell and Nelson1985; Echternach et al. Reference Echternach, Sundberg, Arndt, Breyer, Markl, Schumacher and Richter2008). These are frequently invasive, and involve the placement of targets at points in the internal vocal tract.
In the 1980s, video recording systems had improved and digital image processing techniques were more widely available. It became possible to collect and analyse facial speech movements using automatic techniques. These included the capture of binary oral images in realtime (Petajan Reference Petajan1984) from which lip contours could be derived, or the extraction of lip parameters from lips that had been painted a cyan colour so that they could be separated from skin tones by chroma-keying (Benoît et al. Reference Benoît, Lallouache, Mohamadi, Abry, Bailly and Benoît1992). Large digital frame stores began to make possible the processing and storage of sequences of video-recorded images that could then be retrieved and replayed at normal video frame rates as described in Section 8.2.2 above. Given the increasing capacity of disc storage systems, the greater speed of modern processors, the widespread availability of relatively cheap and efficient digital colour video cameras, and access to very efficient video data compression tools such as MPEG coding, the capture of high-quality moving images of a speaker’s face in timescales close to, or even at, realtime is now relatively straightforward. The high capacity of very cheap storage media such as CD-ROM allows large corpora of speech material embodying both video and audio recordings to be created and distributed. For example, 90 minutes of speech material consisting of a single speaker uttering 132 sentences from the SCRIBE corpus has been video recorded in colour at 50 fields (25 frames) per second, complete with audio sampled at 16 KHz. The digitized, 64 × 48-pixel colour images of the oral region were stored and the audio was encoded as the first 24 LPC cepstral coefficients, sampled at 20 ms intervals. All the audiovisual data could be held on just three CD-ROMs (Brooke and Scott Reference Brooke and Scott1998b).
Despite the remarkable advances in the technology for data capture, there are still no readily available, general-purpose tools for tracking either the global movements of the head or the topographical features. Automatic feature extraction remains non-trivial. For example, while the most sophisticated real-time tracking systems (e.g. Blake et al. Reference Blake, Curwen and Zisserman1993; Blake and Isard Reference Blake and Isard1994; Dalton et al. Reference Dalton, Kaucic, Blake, Stork and Hennecke1996) can accurately find and follow changes in shape of the outer lip margins, the inner lip margins, which are known to convey important visual cues to speech (Plant Reference Plant1980; Montgomery and Jackson Reference Montgomery and Jackson1983), are not so well defined and remain difficult to locate reliably. Also, tracking the variations in lip contours is not necessarily equivalent to tracking the movements of marked points on the face that represent specific locations on the skin surface. Facial surface features are not defined on a contour, except at a few clearly identifiable positions such as the lip corner. Thus, while contours may be an efficient descriptor for lip shapes, they may not be well suited for examining the effects of actions by the facial musculature on changes in configuration of the surface topography.
8.3.2 Creating visual and audiovisual speech databases
The main problem in recording speech data for use in audiovisual speech processing applications is agreeing on what kind of material should be captured and under what conditions. Without necessarily attempting to constrain the specific material, a generally-agreed-upon framework is important in order to establish at least the structure and size of a common database that can be used to compare objectively the results obtained, for example, from automatic audiovisual or visual speech recognition using different systems. This has been attempted and to a large extent achieved for conventional audio speech processing. For example, much confusion can be created by simple inconsistencies such as attempting to compare the results from isolated word and continuous audiovisual speech recognition systems (e.g. Brooke et al. Reference Brooke, Tomlinson and Moore1994; Adjoudani and Benoît Reference Adjoudani, Benoît, Stork and Hennecke1996; Tomlinson et al. Reference Tomlinson, Russell and Brooke1996). Also, many prototypical audiovisual speech recognition systems used corpora of training and test data, such as the TULIP database and its derivatives, that were really too small to be completely reliable for the assessment of recognition performance (e.g. Movellan and Chadderdon Reference Movellan, Chadderdon, Stork and Hennecke1996; Matthews et al. Reference Matthews, Cootes, Cox, Harvey and Bangham1998).
Even now, there is relatively little systematic data available to describe and classify the visible speech articulations, though some work has been done to identify the visemes (e.g. Benoît et al. Reference Benoît, Lallouache, Mohamadi, Abry, Bailly and Benoît1992), which are the closest visual equivalent of the abstract sound classes known as phonemes. Just as phoneme sets are language-specific, it seems likely that viseme sets may be language-specific as well; but thus far, there are too few systematic studies available to draw any clear conclusions. It is even possible that the facial gestures for speech events whose production methods are similar may differ among native speakers of different languages. Furthermore, since speech production involves time-varying movements of many articulators that may possess different mechanical properties, like stiffness and inertia, there is not a single, fixed articulatory configuration associated with a particular speech sound. Rather, the vocal tract configurations for a particular speech event are affected by the sounds that precede and follow it. This effect is known as coarticulation and, not surprisingly, manifests itself in the visible, facial gestures as well as in the acoustic outputs of the vocal tract (Benguerel and Pichora-Fuller Reference Benguerel and Pichora-Fuller1982; Bothe et al. Reference Bothe, Rieger and Tackmann1993). Consequently, a database of speech material must include many samples of each phoneme in different phonetic contexts. An early study (Brooke and Summerfield Reference Brooke and Summerfield1983) employed VCV syllables in the context of the three vowels that lie at the corners of the vowel triangle (see Ladefoged Reference Ladefoged1975) and a series of /bVb/ and /hVd/ utterances. However, a comprehensive library with multiple samples of all phonemes in the context even of the two adjacent phonemes (in other words, the complete set of triphones) involves a very large database of samples and the coarticulatory effects can in fact extend across a much wider spread of neighbouring phonemes (Ladefoged Reference Ladefoged1975). Even 90 minutes of recordings of SCRIBE sentences (see Section 8.3.1) can cover only a small proportion of all the possible triphones. Multiple tokens are needed to take account of the natural variations in the productions of even a single speaker. However, recordings of multiple speakers also need to be made to study the inter-speaker variations in articulatory strategies and gestures that are known to exist (Montgomery and Jackson Reference Montgomery and Jackson1983). Until recently, these variations have seriously restricted the application of articulatory synthesis to the creation of acoustic speech signals, despite the potential attractiveness of early accurate and detailed articulatory models (e.g. Mermelstein Reference Mermelstein1973). Lately, however, there has been a revival in this area with work on fricative constants (Mawass et al.Reference Mawass, Badin and Bailly2000; Beautemps et al. Reference Beautemps, Badin and Bailly2001), and biomechanical modelling of velar stops (Perrier et al. Reference Perrier, Payan, Perkell, Zandipour, Pelorson, Coisy and Matthies2000).
Recordings of speakers’ faces may be treated at one extreme as images whose coded representations define the specific gestures and configurations of the visible articulators. At the other extreme, they can be analysed to find the spatial positions of particular topographical features, such as the corners of the lips, for example. The simplest approach is to make front-facial video recordings under a fixed level of lighting from two sources close to either side of the camera and in the same horizontal plane as the nose, so as to minimize unwanted shadowing and asymmetry. In reality, however, speakers’ heads may be presented under a range of lighting conditions and lighting conditions may vary even during short utterances. Similarly, whilst it is possible to make simple x- and y-corrections to measurements taken from different frames to compensate for the small head and body movements made by a speaker in controlled conditions, these corrections do not adequately compensate for the larger variations due to the changes in a speaker’s head orientation and position with respect to the observer that can occur in natural speech. While lighting conditions may or may not interfere with techniques that search for and track facial points or features, they can have a large effect on the appearance of the facial image. Consequently, were PCA, for example, to be used to encode facial images (e.g. Brooke and Scott Reference Brooke and Scott1998b; Brooke and Tomlinson Reference Brooke, Tomlinson, Taylor, Néel and Bouwhuis2000), it might be expected that codes representing images of identical facial presentations in differing lighting conditions, would vary widely. However, the broadly fixed patterns of articulatory movement for specific utterances might manifest themselves at a deeper level through particular kinds or rates of change in the PCA coefficients, for example. Changes in facial orientation and position would generate additional time-varying changes in the PCA coefficients that would be confounded with the changes due to the visible articulatory gestures.
Recent work has led to the development of an automatic system for facial recognition (Moghaddam et al. Reference Moghaddam, Wahid and Pentland1998); it modelled the two mutually exclusive classes of variation in facial images. The first was the class of intra-personal variations due to differences in appearance of the same individual resulting from changes in lighting conditions, facial expression, facial orientation, and position with respect to the camera. The second class, of extra-personal variations, are the differences in facial appearance presented by different individuals. A Bayesian classifier was used to determine whether a pair of images represents the same individual or two different individuals. The technique used an embedded algorithm to isolate, scale, and align the faces from images and was able to deal with changes in head orientation, though the images tested were all essentially front-face presentations and the orientation variations were small. The images also included some lighting changes and variations in facial expression that the recognition system was able to deal with. However, the major objective of the face-recognition task seeks to discount precisely the differences in facial expressions and gestures that are relevant to visual speech processing and, furthermore, seeks to find the extra-personal differences that it would be highly desirable to discount, for example, in visual speech recognition. There currently appear to be no studies that have recorded speakers under controlled ranges of lighting conditions and facial orientations to explore the effects of either on (a) image encoding, in the case of image-based systems; or (b) the effectiveness of tracking algorithms, in the case of feature-oriented systems. Since separate recordings would be needed for each condition, the analysis of the results would need to take account also of the natural variations in articulatory gestures noted earlier. The outcome of the investigations would be directly relevant, for example, to recognition systems, like hidden Markov models, that essentially attempt to account for the variance of the input data.
In addition to the recording of basic visual material as outlined above, account needs to be taken of the differences between gestures that may occur in different kinds of speech activity. For example, spontaneous speech and conversation may not employ the same vocabulary of gestures as reading out loud, or using the telephone. In the very long term, material that reveals the interaction between speakers in a dialogue may be important, but it may perhaps be premature to attempt the specification of corpus material to investigate these aspects of visual speech signals when there is still a dearth of data for exploring more basic questions like the ones outlined above.
One further aspect of visible speech signals is becoming very important, for example, if a synthetic computer agent is to be realized that can simulate realistically a human speaker. It concerns the addition of appropriate facial expressions to the visible articulatory gestures and the interaction between the two. There is currently no analytical information available and no corpus of recorded data, for the following reasons. The simplest view of facial expressions is that the relevant muscular activities concerned with their production, which are well defined (Ekman and Friesen Reference Ekman and Friesen1978), can be superimposed upon the muscular activities associated with the articulatory gestures. However, it seems more likely that there is a complex coupling between the two. For example, anyone who has attempted to utter speech whilst simultaneously producing a fixed smile or a permanent scowl will be aware that this is unnatural and difficult to do. Not only are the articulatory gestures of speech time varying, the facial expressions themselves are essentially dynamic gestures that vary during speech utterances. A very simple example (Ekman Reference Ekman, von Cranach, Foppa, Lepenies and Ploog1979) is provided by the well-known baton gestures of the eyebrows. Furthermore, making a specific facial expression can actually modify the speech production process. Scowling, for example, produces a tightening of the lower facial muscles and a drawing down of the lip corners that dramatically changes the normal articulations and also affects the voice quality of the acoustic output. The empirical investigation of the interaction between facial expressions and speech movements would most naturally start from studies of the articulatory movements in a ‘neutral’ face, which may be considered the normative baseline. Audiovisual recorded material from natural speech that embodies a range of natural expressions is also required and the time-varying expressions would need to be labelled by hand, much as phonetic transcriptions are created by manual labelling of conventional spoken corpora. It would then be possible to attempt to measure the differences between the expressive and the ‘neutral’ faces for each class of labelled expressions, in a range of phonetic contexts. This field has yet to be explored in any real depth and the development of an appropriate methodology remains a major challenge. Existing work is centred on the perception of audiovisual speech in the presence of emotions such as amusement, which are easy to activate spontaneously in constrained speech (Aubergé and Lemaître Reference Aubergé and Lemaître2000).
A proper understanding of the interaction between the expressive and the articulatory processes demands a far more sophisticated and extensive model of language than we yet possess. Attempts to engineer expressive agents without greater knowledge of this kind may be not only misplaced, but also positively damaging, for reasons that are discussed in Section 8.4.3 below, in relation to purely articulatory gestures.
8.4 Automatic audiovisual speech processing
Additional sources of relevant knowledge should enhance the performance of conventional automatic speech recognition (ASR) systems. This is one of the main reasons for current interest in the visible aspects of speech production. As Section 8.2.1 above argued, visual speech signals tend to complement and therefore augment the acoustic signals. Indeed, the potential relevance and usefulness of visual signals is confirmed by the employment of speechreading, not only by those suffering from hearing-impairment, for whom it is an essential part of successfully managing everyday communication, but also for most normal hearing people, especially when there is a noisy environment. For the reasons given earlier, one of the principal application areas for automatic audiovisual speech recognition is the robust recognition of speech in the presence of background noise in locations such as aircraft cockpits, where hands-free voice control may be required.
The primary requirement in automatic audiovisual speech recognition is therefore to identify the cues from each of the two modalities that are important to the accurate phonetic identification of speech events and to combine those information sources so as to make the best use of both together. This is in fact the second major issue in visual speech processing, along with the management of the volume of information that visual data presents, as noted in Section 8.2.2 above. There is as yet an incomplete understanding of the nature of the visual cues to speech events. Features such as lip separation and lip width are known to be important cues (e.g. McGrath et al. Reference McGrath, Summerfield and Brooke1984), and others have been suggested (Finn Reference Finn1986). However, there is still no reliable literature to indicate anything approaching a complete list of visual cues and no established methodology for identifying them. In addition, some of the cues may be very subtle. For example, when some of the identifiable visible features of speech signals were encoded using PCA, some of the higher-order coefficients that contributed very little to the data variance were more important in contributing to successful recognition than lower-order components that accounted for a much greater proportion of the data variance (Goldschen et al. Reference Goldschen, Garcia, Petajan, Stork and Hennecke1996). Conventional (acoustic) automatic speech recognition systems can now handle speaker-independent continuous speech with sizeable vocabularies of the order of thousands of words. Variations due, for example, to movements of the speaker relative to the microphone can be fairly readily eliminated (see, for example, Holmes Reference Holmes1988). As the discussion of visual corpora above indicated, there is as yet no adequate body of systematic data to support the construction of speaker-independent audiovisual recognizers and it is much more difficult to compensate for visual variations such as movements of the speaker. Any practicable recognition system will ultimately have to allow the speaker reasonably free movement, at least within a fixed field of view.
Visual speech synthesis, on the contrary, has a wide range of potential applications and the requirements may vary. One possibility is the construction and presentation of material for helping in the teaching, for example, of speechreading skills. A second is the construction of lifelike computer agents, and a third is the cheap and rapid construction of animated cartoon films. Other, longer-term applications could include automatic film dubbing into a variety of languages by substituting generated facial syntheses for the original actor’s face.
In an agent application, high-resolution colour images are required that model very accurately both speech gestures and facial expressions in essentially complete and photographic detail; realism is the primary motivation. For training purposes, the primary requirement for visual speech syntheses is more likely to be that they are speech-readable and embody the essential visual cues to speech events. It may or may not be essential to model the whole head or face. In film cartoon applications, it may be more important to generate animated displays that exaggerate speech and expressive gestures, possibly, in the extreme limit, in a highly formalized or anatomically inappropriate way. This is typical of the animator’s art and may rely upon generally accepted conventions that are divorced from reality to achieve a kind of plausibility. A limited ability to exaggerate articulatory movements may also be attractive in training applications when specific gestures may need to be emphasized. One other potentially important application area is the analytical investigation of visual and audiovisual speech perception itself. This could also require the generation of artificial visual stimuli that defy normal articulatory realization, for example, because articulators must be coupled or decoupled in an unnatural way. To illustrate this, consider a hypothetical experiment to determine whether it is the jaw opening or the lip opening that is more important in the perception of the British-English long vowel /a/. One approach might subject observers to a continuum of visual stimuli in which, at one extreme, the mouth opens while the jaw remains stationary and, at the other, the jaw drops while the lips remain closed. In this example, the normally coupled movements of these two articulators must be decoupled.
8.4.1 Head model architectures for visual speech synthesis
The development of visual synthesizers basically sprang from early approaches that involved a model of the head, or face, whose conformation could be adjusted to generate a sequence of frames. These could then be displayed in succession sufficiently rapidly to simulate movement. Two-dimensional facial topographies that could be realized through vector graphics were the cheapest and simplest to implement (e.g. Brooke Reference Brooke1982; Montgomery and Hoo Reference Montgomery and Hoo1982; Brooke Reference Brooke, Taylor, Néel and Bouwhuis1989). One of the earliest raster graphics displays that could be fully rendered to simulate texture and shading was the three-dimensional ‘wire-frame’ model of Parke (Reference Parke1975) in which the polygonal surfaces making up the head could be modified. Indeed, this model was the archetype for its contemporary counterparts, which can also include details of internal features like the teeth and tongue (e.g. Cohen et al. Reference Cohen, Beskow and Massaro1998). The main challenge underlying the animation of computer graphics displays of this kind remains the derivation and application of control parameters for driving accurately the time-varying movements of the model. They must be simple enough to adjust the time-varying conformation of the wire-frame economically, yet powerful enough to permit a full range of movements and gestures. One way to achieve this is to configure the model for a set of idealized target gestures, for example, a set of phones, then to generate intermediate image frames by interpolating between the target images (Lewis and Parke Reference Lewis and Parke1987). There are now techniques for modelling coarticulatory effects, of which the application of a gestural theory of speech production was one important step (Cohen and Massaro Reference Cohen, Massaro, Thalmann and Magnenat-Thalmann1993). This technique can, however, involve a very great deal of manual tuning to optimize performance. An alternative approach is to attempt to model the head anatomically, including descriptions of the skin, muscle, and bone structures (Platt and Badler Reference Platt and Badler1981; Terzopoulos and Waters Reference Terzopoulos and Waters1990; Waters and Terzopoulos Reference Waters and Terzopoulos1992). Time-varying muscle-based parameters can then be used to change the shape of the head model. While this is attractive in principle, mechanisms for deriving the muscle parameters are not straightforward. Additionally, the derivation of the control parameters for features that are only partially or intermittently visible may require sophisticated and invasive measurement techniques. Given the greater power of modern processors, it is now also possible to use facial images themselves as the head models for a series of phonetic targets and to ‘morph’ between the target images to simulate movements (e.g. Bregler et al. Reference Bregler, Covell and Slaney1997a; Ezzat and Poggio Reference Ezzat and Poggio1997). Visible coarticulation effects have to be carefully handled in this type of synthesis.
The major advantage of methods based on head models is that they can in principle include facial expressions very easily if suitable modifications can be made to the control parameters to augment the purely articulatory gestures. Attempts have been made to achieve this using a simple catalogue of basic emotions (e.g. Lundeberg and Beskow Reference Lundeberg and Beskow1999), but the difficulties of a general solution to the problem have been described earlier. A further advantage of using head models for visual synthesis is that the images can be rendered to any resolution so that it is possible to generate photographically realistic images at only marginally greater computational expense. Modelling of exaggerated or anatomically implausible gestures is potentially straightforward and could in principle be achieved by applying appropriate control parameters.
8.4.2 Data-driven methods for visual speech synthesis
By using HMMs for capturing and describing the statistical properties of image sequences, it is possible to develop visual recognizers, as discussed in Section 8.2.2 above. However, a more unconventional application of trained HMMs (that is, HMMs whose parameters have been established) is to use them to generate outputs. In this way HMMs can become image synthesizers. One of the earliest applications of HMMs for synthesizing oral images (Simons and Cox Reference Simons and Cox1990) used only lip widths and separations as parameters. Parametric representations of a wider range of facial features and their time-varying changes can be used in this way to generate more sophisticated syntheses (e.g. Tamura et al. Reference Tamura, Masuko, Kobayashi and Tokuda1998; Okadome et al. Reference Okadome, Kaburagi and Honda1999). It is also possible to create syntheses from HMMs trained on images of a speaker’s face without any knowledge of the underlying structure of the images. In order to do this, HMMs are combined with a second statistically based technique, namely, PCA, which can efficiently compress and encode image data as discussed in Section 8.2.2 above. This is the basis of an entirely data-driven approach to visual speech synthesis (Brooke and Scott Reference Brooke and Scott1994a; Brooke and Scott Reference Brooke and Scott1998a). No modelling of the anatomy or structure of the head and face is required. In effect, the computer can be presented with many sequences of images of a speaker’s face, in a PCA-encoded format, from which it can ‘learn’ how the speaker’s facial gestures vary when uttering specific speech sounds. That is, it can use time-varying encoded versions of images to train HMMs to represent each of a set of phones. To synthesize an utterance, the HMMs for the appropriate sequence of sounds are invoked in order and reconstruct what they have ‘learned’. In principle, they generate as their outputs a sequence of images, also in a PCA-encoded format. However, since the outputs of HMMs are generated independently, they cannot be used directly to simulate a human speaker’s smooth, physiologically constrained outputs. Instead, the HMMs are used to generate PCA-encoded outputs at key points within speech events that are then used to compute a smoothly time-varying sequence of outputs (Brooke and Scott Reference Brooke and Scott1998a). Since each HMM is probabilistically driven, the image sequence that any particular HMM generates will vary from one invocation to another. Overall, however, the statistical properties of a large number of invocations will essentially match the variations in production that the HMM encountered during the learning phase. The data-driven methods can therefore model the variations of real speakers’ productions. In order to model at least some of the coarticulatory effects, triphone models are employed; the disadvantage of triphone modelling is that a much greater amount of training data is required.
One of the greatest virtues of data-driven modelling is that, because the HMMs are trained on images of real speakers, the images quite naturally embody facial features like the teeth and tongue that are only partially and intermittently visible. There is thus no requirement to estimate and track the position of those features that are most difficult to measure. For the same reason, natural asymmetries of the facial movements and the imperfect skin textures of real faces also arise naturally from the training process. It is thought that some of the shadowing and texturing of the skin surface may provide cues to particular speech events (e.g. Fujimura Reference Fujimura1961). In many models, bilateral symmetry of the face is assumed to minimize the volume of control data that is needed to drive the syntheses. In reality, faces are rarely, if ever, entirely symmetrical and one of the advantages of the data-driven approach is that the natural asymmetries are built in. The training phase, although it employs well-defined, standard algorithms (Rabiner Reference Rabiner1989b), is computationally expensive. However, the generation of the colour syntheses is rather rapid. Using the prototype synthesizer running on a modest 166 MHz PC, complete sentences could be synthesized at 50 frames per second from a phonetically transcribed input in approximately 6–8 times realtime. At these timescales, it is possible to envisage the interactive construction of visual speech stimuli that can be presented in response to a user. This capability would be important in training applications.
Much of the computational load in synthesis is due to reconstruction of the images from their PCA-encoded format. To reduce this load, codebooks of images together with their PCA codes can be constructed. Image sequences can then be generated entirely by a form of vector quantization that selects the codebook images with PCA codes closest to those computed by the HMM-based synthesizer. These syntheses approach realtime but are less smoothly varying (Brooke and Scott Reference Brooke and Scott1998b). In principle it is possible to set a tolerance such that codebook images are selected if their PCA codes lie within the tolerance, but images are reconstructed from the PCA codes otherwise. By adjusting the tolerance, it should be possible to trade off better synthesis timescales against a reduced image quality. The speech-readability of the prototype, non-codebook version of the data-driven synthesizer was estimated in visual perception experiments. These showed that single digit recognition rates in monochrome syntheses of spoken digit triples at a spatial resolution of only 32 × 24 pixels, could reach 63 percent, compared with a rate of 68 percent for monochrome visual displays of real digit triple utterances video-recorded at the same spatial resolution (Scott Reference Scott1996).
The data-driven approach implies, importantly, that unlike most model-based methods, very little manual tuning or adjustment is needed and a synthesizer can be trained on any speaker if sufficient annotated training data is available. The prototype data-driven visual speech synthesizer described above was constructed to be speech-readable. As a result, it has relatively low spatial resolution. Use of a PCA-based coding scheme with greater image resolution would carry additional computational costs, especially if the whole face were to be generated. It is possible to adjust the transition probabilities of the HMMs to modify the rate of speech production in the syntheses, though this capability is restricted. Similarly, it is not easy to realize significant changes in facial expressions by manipulating the model parameters in the way the head models can be manipulated, though some small adjustments are possible, as described in Section 8.5 below. Data-driven models are thus less convenient if detailed control of the movements of specific facial features is required, for example, to exaggerate particular gestures in a speechreading training application. On the other hand, the increasing power and size of computer systems is such that it may be possible within the fairly near future to consider the creation of sets of trained HMMs that include the variations due to facial expressions. That is, it may be possible to use captured data from real speakers to create HMMs not simply for each triphone, but for each triphone in a number of different facial expressions (see Section 8.3.2 above). At present, however, limited adaptability is one of the major drawbacks of the data-driven approach to synthesis.
8.4.3 Data-driven audiovisual synthesis and synchronization
Visual speech synthesizers, like conventional acoustic speech synthesizers, can be, and frequently were, driven by providing an essentially phonetic description of the utterances to be created, possibly with additional markings to specify their durations. However, for many applications, including the generation of cartoons, or for the film-dubbing application outlined earlier, lip synchronization is vital in order to be able to generate the visual speech syntheses so that they are matched to a pre-existing sound track. One solution might be to generate complete audiovisual syntheses. It is possible to generate an audio speech output using the HMM-based method (from initial proposals by Falaschi et al. Reference Falaschi, Giusiniani and Verola1989; Brooke and Scott Reference Brooke and Scott1998a; to recent work by Zen et al. Reference Zen, Tokuda and Black2009) in much the same way that the visual output is generated. Feasibility experiments have been carried out using short-term spectral descriptions of acoustic speech, which, although of rather poor quality, were adequate to demonstrate the principles of audiovisual synthesis. Audiovisual syntheses can be created by invoking both the audio and the visual synthesizers and running them independently, but in parallel (Tamura et al. Reference Tamura, Kondo, Masuko and Kobayashi1999; Bailly Reference Bailly2002). Given the natural variations inherent to human articulatory movements, it might not be surprising to encounter complex (and variable) phasing relations between the salient events characterizing the acoustic and the visual consequences of the same articulatory movement. This may have important implications for audiovisual speech recognition, as described in Section 8.4.4 below.
Ideally, however, it would be desirable to drive visual speech synthesizers using acoustic speech signals themselves. This was indeed the way in which one of the earliest data-driven synthesizers was used (Simons and Cox Reference Simons and Cox1990). The same method was developed (Brooke and Scott unpublished) in a pilot experiment that used ergodic (that is, fully connected) HMMs with approximately sixteen states. They were trained on audio and visual data to generate both PCA-encoded image data and cartoon-like syntheses from an acoustic input. Other methods for speech-driven synthesis have also been reported (e.g. Morishima Reference Morishima1998; Agelfors et al. Reference Agelfors, Beskow, Granström, Lundeberg, Salvi, Spens and Öhman1999). The ability of ANNs to find complex mappings has made them popular tools for both synthesis and recognition. Time-delay neural networks have been used in synthesis for attempting, for example, to map from acoustic speech signals to articulatory parameters (Lavagetto and Lavagetto Reference Lavagetto, Lavagetto, Stork and Hennecke1996) and from phones to facial images selected from a pre-determined set (Bothe Reference Bothe, Stork and Hennecke1996). A specific acoustic signal may be generated from a number of different articulatory configurations. A rather extreme illustrative example is the production of the /i/ sound. This is usually articulated as a high front vowel with lip spreading, but can also be created with rounded lips if the tongue is moved abnormally far forward and upward (Stevens and House Reference Stevens and House1955). The inverse problem, namely, the computation of an articulatory configuration given an acoustical or phonetic description of the speech event, is an intractable one-to-many mapping problem (Atal et al. Reference Atal, Chang, Mathews and Tukey1978). Artificial Neural Networks can present problems if they are used where one-to-many mappings may be encountered. Even if methods exist in which multiple mappings from specific sounds to articulatory gestures can be reliably and successfully found (e.g. Toda et al. Reference Toda, Black and Tokuda2004; Toda et al. Reference Toda, Black and Tokuda2008), the utterance context is likely to be critical in selecting a plausible articulation within a sequence of articulations. One interesting study (Kuratate et al. Reference Kuratate, Yehia and Vatikiotis-Bateson1998) shows that acoustic speech syntheses, as well as visual syntheses can be achieved using parameters derived from measurements of the facial movements, which indicates a link between the two modalities that might ultimately offer a new route to intrinsic lip synchronization.
It is very important to combine acoustic and visual speech signals accurately. De-synchronization of the acoustic and visual speech signals beyond a critical time window can cause severe perceptual difficulties (see McGrath Reference McGrath1985). However, the naturally occurring time offset between visual and auditory events can provide cues to particular speech events. For example, changes in the categorical perception of sounds in the /ma/, /ba/, and /pa/ continuum can be induced by changing the voice onset timings with respect to visible lip opening (Erber and de Filippo Reference Erber and de Filippo1978). The McGurk effect (McGurk and MacDonald Reference McGurk and MacDonald1976) is a compelling demonstration of the results of failing to match acoustic and visual data appropriately. For example, a visual /ga/ combined with an acoustic /ba/ produces the percept of a spoken /da/. In other words, not only may a mismatch cause the observer to fail to perceive what was uttered, it may result in an entirely different percept that was neither seen nor heard.
8.4.4 Models for visual and audiovisual automatic speech recognition
One of the earliest approaches to automatic speech recognition used template-matching methods to match test and reference utterances encoded as descriptions of time-varying visible features such as lip width and separation, or oral cavity area and perimeter. Even purely visual recognition was shown to be useful for vocabularies of isolated words such as spoken digits and letters (Petajan Reference Petajan1984; Petajan et al. Reference Petajan, Brooke, Bischoff and Bodoff1988b). Audiovisual recognition at that stage was managed by a heuristic that combined separate audio and visual recognition processes and it was not easy to estimate quantitatively the benefits of using the visual component. Speech intelligibility by humans in all but the very best and quietest of acoustic conditions is always improved if a speaker’s face is visible. This is the basis of all attempts to employ visible speech data as an additional source of information, especially in acoustically noisy conditions that severely degrade the performance of most conventional speech recognition systems, even at relatively low levels.
By the mid 1990s, for reasons noted in Section 8.2.2 above, the range of experimental visual and audiovisual recognition systems was already very wide, extending from purely image-based systems at one extreme, to model-based systems at the other. These were reviewed in a contemporary article (Stork et al. Reference Stork, Hennecke, Prasad, Stork and Hennecke1996); it revealed a fundamental division between the feature-based and the image-based views of visual data that persists. What emerged then as a major issue, prompted by investigations of audiovisual speech perception, was the combination of visual and auditory information so as to obtain the maximum benefit from the two together, the importance of which has already been noted. Four models were proposed (Robert-Ribes et al. Reference Robert-Ribes, Lallouache, Escudier and Schwartz1993; Robert-Ribes et al. Reference Robert-Ribes, Piquemal, Schwartz, Escudier, Stork and Hennecke1996). The first is the Direct Identification, or DI, model, in which acoustic and visual data are combined and transmitted directly to a single, bimodal classifier. The second is the Separate Identification, or SI, model that carries out two unimodal classifications. The results from these are then sent forward for fusion and decision-making. The third is the Dominant Recoding, or DR, model. Here, auditory processing is assumed to be dominant and visual data is recoded into the dominant modality. For example, both modalities might be recoded into a tract transfer function. The two estimates are then fed forward for final classification. The fourth model is the Motor-space Recoding, or MR, model. Inputs from both modalities are recoded into an amodal common space, such as the space of articulatory configurations. The two representations are then passed to the classifier.
The DR model was implemented in one early prototype recognition system that attempted to code the visual data into a vocal tract transfer function (Sejnowski et al. Reference Sejnowski, Yuhas, Goldstein, Jenkins and Touretzky1990). However, most of the recognizers to date lie in the continuum between the SI model (e.g. Petajan Reference Petajan1984; Petajan et al. Reference Petajan, Bischoff, Bodoff and Brooke1988a) and the DI model (Brooke et al. Reference Brooke, Tomlinson and Moore1994). Perceptual evidence seems least strong for the SI model and the most general view is that fusion takes place at a level higher than the peripheral system, but prior to categorization (Summerfield Reference Summerfield, Dodd and Campbell1987; Massaro Reference Massaro, Stork and Hennecke1996). The DI model, using a composite audiovisual feature vector, is a particularly straightforward architecture to implement with HMM-based recognition systems. It does not rely on early feature extraction and defers the decision-making process so that as much of the available data as possible can be retained for use. One speaker-dependent system of this kind used continuously spoken digit triple utterances as the vocabulary (Brooke et al. Reference Brooke, Tomlinson and Moore1994; Brooke Reference Brooke, Stork and Hennecke1996). The acoustic data, stored at 100 frames per second, consisted of the outputs of a 26-channel filter bank covering the frequency range between 60 Hz and 10 kHz. The visual data was a 10-component PCA-encoded version of 10 × 6 pixel monochrome images of a speaker’s oral region. Visual data was captured at 25 frames per second and four replications of each visual frame were used to match the acoustic data rate. The recognizer used the 36-element composite vectors in 3-state triphone HMMs, each state of which was associated with a single multivariate continuous Gaussian distribution and a diagonal covariance matrix. The HMMs were trained on 200 digit triples and tested on 100 digit triples. The visual signals were clean, but spectrally flat noise at various levels was added to the acoustic signal to investigate the performance of the recognition system on speech in noise. The results showed that, although the simple addition of visual data could improve a recognizer’s performance, the gain was small, essentially because the contribution from the visual channels was swamped by the errors induced by the noise in the acoustic channels. The gains became more significant when compared to the best available acoustic recognizers that used techniques such as silence tracking and noise masking (Klatt Reference Klatt1976) to counter the effects of acoustic noise. In contrast, other, contemporary approaches to recognition using HMMs tended to compute explicit weighting factors to bias the system in favour of the visual signal when the acoustic noise levels became high (e.g. Adjoudani and Benoît Reference Adjoudani, Benoît, Stork and Hennecke1996). None of these recognition systems was able to demonstrate a bimodal recognition performance that was consistently better than the unimodal performance in either of the two domains across a wide range of acoustic signal-to-noise ratios. Many of the early systems did not investigate speech in high levels of noise (such as signal-to-noise ratios much below 0 dB).
Conventional HMM-based audiovisual speech recognizers that conform to the DI model, like that described in the previous paragraph, implicitly assume synchrony between the acoustic and the visual data. The HMMs typically consist of three state, left-to-right models without state skipping. It is, however, possible to build HMM-based recognizers that permit a degree of audio and visual asynchrony, at least within phones, though synchrony is re-asserted at the phone boundaries. This approach was tested in a prototypical audiovisual speech recognizer using a technique that was based on single signal decomposition (e.g. Varga and Moore Reference Varga and Moore1990). The operation of this recognizer was similar to that described in the paragraph above, except that the HMMs in this type of architecture were 9-state triphone models arranged as a two-dimensional array of 3 × 3 states (Tomlinson et al. Reference Tomlinson, Russell and Brooke1996). The rows corresponded to the states of an audio model and the columns to the states of a visual model. The models were entered at the top left-hand state and exited at the bottom right-hand state. Transition between states from top to bottom and left to right, including diagonal transitions, were permitted. Thus, for example, a diagonal path through the HMM would correspond to complete synchrony between the audio and visual signals. The HMMs were trained using separate audio and visual models. This was one of the first experiments on continuous speech recognition to demonstrate a performance that was better in the audiovisual domain than in either of the audio or visual domains separately, throughout a full range of signal-to-noise ratios between +23 dB and –22 dB. In addition, it achieved this performance in a continuous speech recognition task. Despite the novel architecture, this recognizer remains a form of DI model. Work on similar HMM architectures, such as coupled HMMs, is currently being undertaken (Nefian et al. Reference Nefian, Liang, Pi, Liu and Murphy2002). The capability to extend the asynchrony across phone boundaries has not yet been investigated and presents some challenges (see proposals by Saenko and Livescu Reference Saenko and Livescu2006; Lee and Ebrahimi Reference Lee and Ebrahimi2009).
8.5 Assessing and perceiving audiovisual speech
Assessment of the performance of both recognition and synthesis systems is complex, partly because, as Section 8.4 described, there are many possible application areas for these technologies that can have different requirements. In addition, it may be appropriate to distinguish between performance measures based on purely visual speech processing and those based on audiovisual speech processing. For both synthesis and recognition, speech-readability is a central issue, whether it is speech-readability of human signals by machines, as it is in recognition, or speech-readability of machine outputs by humans, as it is in synthesis. More precisely (Kuratate et al. Reference Kuratate, Yehia and Vatikiotis-Bateson1998), the activity of the vocal tract that generates the acoustic speech signal has time-varying visible correlates that are conveyed by many parts of speakers’ faces. The identification of the visual correlates makes them available for automatic speech recognition, whether or not humans actually use them. Conversely, synthesizers that attain a level of ‘communicative realism’ would minimally embody the audible-visual correlates observed in human oro-facial movement. That is, in a minimal synthesizer, attention should be focused on the visible-acoustic or phonetic aspects of facial movements. Whilst this is a sensible starting point and would be satisfactory for some applications, in others the minimal synthesis would not be adequate. Facial expressions and paralinguistic gestures would also need to be included, as described in Section 8.4 above. In the case of synthesis, most of the guidance in the development of systems is likely to be provided by studies of speech production and speech perception. Conversely, although there is no inherent reason why machines should be bounded by human constraints on operation or performance, it may, however, still be beneficial to understand how (successful) human systems work in order to exploit their known strengths and capabilities when developing artificial processing methods. For both automatic speech synthesis and recognition, human performance is the obvious baseline with which to compare machine performance (e.g. Brooke and Scott Reference Brooke and Scott1994a; Brooke et al. Reference Brooke, Tomlinson and Moore1994).
Like the construction of corpora of speech material described in Section 8.3, above, the design of standardized performance assessment material and the precise specification of test conditions is central to the progressive refinement of all automatic systems. The design and application of test material is complex because some speech events are inherently easier to identify in the acoustic domain than in the visual domain and vice versa. In audiovisual testing, it is also necessary to separate the contributions from each of these modalities. No generally agreed-upon methodology appears yet to have emerged.
8.5.1 Automatic audiovisual speech recognition
Some sources of difficulty in comparing results from recognition experiments were indicated in Section 8.3.2 above. A starting point for the comparison of results from different systems would have to include at least specifications to define, for example: whether the recognizer was speaker-dependent, multiple speaker, or speaker-independent; whether the speech was uttered as isolated words, as connected (that is, carefully pronounced) words, or as continuous speech; the acoustic and visual conditions under which the test and training data were acquired; and the nature of the test vocabulary. There is a separate issue concerning the speaker that has to be taken into account. Speakers do not enunciate equally clearly and some speakers produce significantly better results with ASR systems than others. Clarity in the auditory domain has its counterpart in the visual domain also and some speakers are much easier to speechread than others. Without clear specifications of recognition conditions, it is impossible to make a valid comparison of the merits of the various recognizers themselves, let alone of the alternative models for audiovisual speech perception as applied to recognition systems (see Section 8.4.4 above).
It is possible to quote the benefits of using a visual component in recognition systems in a number of different ways. Thus, for example, in early experiments to investigate the benefits of using a visual component when attempting automatic recognition of acoustically noisy speech signals (Brooke et al. Reference Brooke, Tomlinson and Moore1994), one way to express the gain was to indicate, for a fixed level of word accuracy, the change in signal-to-noise ratio of an acoustic speech signal that was equivalent to adding a visual component to the recognizer’s input. Alternatively, it was possible to quote the difference in word accuracy for acoustic-only versus acoustic plus visual signal inputs at stated acoustic signal-to-noise ratios. Significant differences in interpretation could result from this choice of presentation. In addition, the conservative use of percentage word accuracy accounted for word insertion errors as well as word deletion and word substitution errors. The commonly used percentage word correct recognition rate accounts only for word substitution and word deletion errors and generally produces an apparently more favourable recognition performance (Tomlinson Reference Tomlinson1996).
The use of PCA-encoded oral image data to represent the visible aspects of the speech signal showed that it is not the low-order components that always contribute the biggest improvements to audiovisual speech recognition (e.g. Brooke et al. Reference Brooke, Tomlinson and Moore1994; Goldschen et al. Reference Goldschen, Garcia, Petajan, Stork and Hennecke1996). This is not necessarily surprising. It may simply imply that, though a particular component contributes a significant variance to the image data, it contributes little to discrimination between different speech events. PCA has nothing to reveal about the way that the phonetic classes are distributed along a particular dimension and linear discriminant analysis (see, e.g. Chatfield and Collins Reference Chatfield and Collins1991) may offer more useful insights (Brooke et al. Reference Brooke, Tomlinson and Moore1994; Tomlinson et al. Reference Tomlinson, Russell and Brooke1996). Attempting to identify and retain the components that contribute significantly to the speech recognition process, while eliminating those that do not, should improve recognition efficiency. Scatter plots of PCA component values for pairs of phonetic events should reveal which components contribute most to the recognition process for discrimination between particular phonetic pairs. However, no comprehensive study of this kind seems to have been reported yet. A similar approach might also yield information about the usefulness of feature parameters, like lip separation and lip spreading, for the discrimination between different pairs of phonetic events. One objective of such studies might be to compare experimental estimates of the perceptual distances between different speech events with observable visible feature parameters and to seek correlations. Additionally, in order to investigate the value of feature parameters in contrast to pure image data, it might also be possible to perform PCA upon a composite vector of both feature parameters and image pixel intensity data and seek the most significant components for speech recognition.
8.5.2 Automatic visual and audiovisual speech synthesis
The production of animated, computer-generated audiovisual speech displays poses a number of perceptual questions. One is whether or not observers treat the syntheses in the same way as normal, human speech productions. In an early vowel identification experiment with a vector graphics synthesizer that displayed only an outline diagram of the facial topography (McGrath et al. Reference McGrath, Summerfield and Brooke1984), errors in vowel identification were similar to those obtained from natural speech recordings. This suggested that the synthetic stimuli were being perceived in a manner analogous to that which is applied to real speech. The ability of the synthesizer to induce the McGurk effect was further confirmation that the syntheses could be perceived as speech-like (see Brooke Reference Brooke and Ainsworth1992a). Though it is likely that the more sophisticated syntheses now available are also treated as speech-like, a successful outcome to this test should not necessarily be taken for granted. The test may define one criterion in determining the validity of facial speech syntheses. The literature of human–computer interaction offers guidance on subjective measures for determining the naturalness and acceptability of visual speech syntheses, but more objective tests are difficult to devise. A very simple test for the naturalness of visual speech synthesis, not dissimilar from the well-known Turing test for artificial intelligence, might be to determine whether viewers can discriminate between real and synthesized images of talking faces.
Even now, a very large number of fundamental issues concerning visual and audiovisual speech perception remain to be completely investigated. Some illustrative examples are set out below. Section 8.4 noted the capacity of the data-driven visual speech synthesizers to model bilateral facial asymmetries and real skin textures that could be a possible cue to some speech events. It is not known how important the bilateral asymmetries are, but they are certainly very common in real speakers and their absence might render facial images less natural looking. There is also some anecdotal evidence that fixed frontal views of the oral region of speakers can induce a sense of unease, especially when the viewers are children. The natural global movements of the head, despite not playing a significant part in conveying cues to the phonetic content of speech, may nonetheless also contribute to the acceptability of artificial computer-generated displays.
Some perceptual issues are directly relevant to the synthesis of visual speech gestures. For example, unreported experiments with the data-driven visual speech synthesizer described in Section 8.4.2 above (Brooke and Scott Reference Brooke and Scott1998b) indicated that there was no difference in digit recognition rates when otherwise identical syntheses of digit triples were presented in monochrome and in colour. It is, however, not known how colour and monochrome presentations would be likely to affect the results for different vocabularies or the perceived acceptability and naturalness of syntheses. A separate pilot study explored the feasibility of using a simple algorithm to generate facial displays in which the visible articulatory gestures could be accompanied by time-varying global movements of the head. The two-dimensional reconstructed oral images of the visible speech movements were electronically pasted onto a three-dimensional wire-frame model of the lower face in which the oral area was a plausible three-dimensional shape, but did not include the mouth as a feature (Brooke and Scott Reference Brooke and Scott1998b). It acted as a shaped screen that could carry the animated display of the visible speech gestures while the wire-frame head model was rotated about the x, y, and z axes. Though no analytical experiments were carried out, the informal response of a number of observers suggested that the syntheses were interpreted as plausible speech-readable gestures even when the head was shown in views that were far from direct frontal presentations. This was unexpected, especially as the three-dimensional information about features like the teeth and tongue that lay within the oral cavity could not be accurately represented in the two-dimensional image projections at large angles of head rotation. However, useful information about visual speech intelligibility might be gained from a programme of analytical experiments. If the results supported the informal observation, it would suggest that there might be a range of applications for a type of talking computer agent (see Section 8.4 above) that could be implemented without the need for detailed three-dimensional head models.
One unresolved issue in visual speech synthesis and its assessment is the need to find objective methods for measuring the differences between the images of a synthesized talking face and the images of a real speaker’s facial gestures. One application would be to determine, for example, how accurately a synthesizer is able to simulate specific coarticulatory gestures that may be critical to speech-readability. A simple approach to this problem is possible when using the data-driven method of visual speech synthesis in which the output images are reconstructed from sets of PCA code values generated by HMMs trained on the PCA-encoded images of a real speaker (Scott Reference Scott1996). Here, it is possible to measure the differences between either the PCA code values or the image pixel values obtained from (a) the synthesized and (b) the recorded images for particular utterances. Optimal time alignment of the real and synthesized utterances can be computed by dynamic time warping. Pilot experiments were carried out with monochrome images of synthesized and real digit triples utterances. A significant correlation was found between the fraction of the total variance accounted for by the PCA coding scheme and both (a) the visual digit recognition rate of synthesized oral images and (b) the root mean squared intensity difference computed from the corresponding sub-blocks of pixels in the reconstructed and raw images. The correlations thus linked subjective and objective performance measures and suggested that the general quality of syntheses could be predicted without necessarily performing complex and expensive perceptual experiments or detailed image analyses.
However, it would be highly desirable to find a generalized method for matching and comparing synthesized images and images of real speakers. This would, minimally, involve scaling and overlaying the image pairs accurately and defining a suitable metric to describe the differences between the image pairs in a way that focused on the visible articulatory gestures. Potentially relevant methods for facial image alignment and comparison have been successfully developed for systems designed to perform facial recognition (Moghaddam et al. Reference Moghaddam, Wahid and Pentland1998). As noted in Section 8.3.2, however, facial recognition methods are not designed to focus on the effects of those facial gestures and expressions that are the ones relevant to assessments of the performance of visual speech synthesizers. In the context of visual speech processing, therefore, image comparison appears to remain a largely unresolved problem.
With the development of other facial image alignment and comparison techniques (Moghaddam et al. Reference Moghaddam, Wahid and Pentland1998; Hall et al. Reference Hall, Crowley and Colin de Verdière2000), there has been a move towards generalized methods for matching and comparing synthesized images and images of real speakers (Elisei et al. Reference Elisei, Bailly, Odisio and Badin2001). These methods involve scaling and overlaying image pairs accurately and defining suitable metrics to describe the differences between image pairs in a way that focuses on the visible articulatory gestures.
8.6 Current prospects
The development of powerful, statistically oriented techniques for capturing, analysing, and describing data, together with the availability of large, fast computer systems able to support them, has played a major role in the development of automatic visual and audiovisual speech processing. However, while statistical methods can, for example, find efficient ways to reduce and manage large volumes of data, they are not usually informative about the significance or interpretation of the transformed representations. Spaces like those generated by PCA directly from images as described in Section 8.2.2 illustrate this; the components are unlabelled and even when the effects of variations in individual components are plotted, they rarely suggest simple, equivalent, feature-based facial parameters (e.g. Turk and Pentland Reference Turk and Pentland1991); to obtain interpretable movement parameters, it is necessary at least to resort to multistage image analysis (as in Ezzat et al. Reference Ezzat, Geiger and Poggio2002a). The same applies to HMMs, in which the states of the finite state machines relate only weakly, if at all, to the articulatory stages in the production of the speech events that the HMMs model. The properties and behaviour of ANNs like MLPs tend to be expressed by parameters that are distributed throughout the networks. This is one of the characteristics of ANNs that makes them powerful, because they can, for example, continue to function (though not as well) even if part of the network is damaged or destroyed. However, the distribution of the parameters generally makes the behaviour of ANNs difficult to observe and understand. Statistical analyses, including PCA, though powerful and useful techniques, may not be sufficient on their own to reveal the full pattern and structure of the data they themselves generate. This may only emerge through the simultaneous application of sophisticated visualization techniques.
In contrast to the statistically oriented models, speech production processes that involve the muscular control of the configurations of articulatory organs and their time-varying changes are the physical basis of the visible changes that manifest themselves in the gestures and movements of facial features. For this reason, the identification and tracking of the time-varying movements of facial features remains attractive and appropriate. An objective posed by one of the authors (NMB) in 1982 is still relevant, namely, to find the minimal set of facial points whose movements can generate complete and accurate descriptions of a facial speech synthesis that is ‘communicatively realistic’. The techniques for achieving this goal are, however, now emerging (Kuratate et al. Reference Kuratate, Yehia and Vatikiotis-Bateson1998). Notwithstanding this progress, much concerning the speech production process is still not well understood. In particular, there is no convincing account for the inter-speaker differences in the visible articulatory gestures that are known to exist. There are no well-established techniques for seeking the ‘deep structures’ of speech and there is thus no clear answer, in either the acoustic or visual domains, to questions about what exactly it is that characterizes specific speech events. Similarly, a comprehensive catalogue showing the ‘perceptual distance’ that separates phones would set a baseline for human recognition performance that could be compared to machine recognition performance. Some studies of this kind have been undertaken (e.g. Summerfield Reference Summerfield, Dodd and Campbell1987), but the relationship between perceptual distances and the conformation of facial features remains incomplete, as noted in Sections 8.2.2 and 8.5.2 above.
The identity and characteristic properties of phones are among the factors that make the problem of automatic speech recognition difficult and interesting. Perceptual studies suggest numerous mechanisms for visual and audiovisual speech recognition in terms of the categorization of phones. They support the conventional view that the human system operates in a highly parallel way. It seems likely therefore that parallelism could play a major role in automatic visual and audiovisual speech processing systems. The development of highly parallel computing systems is, however, still at a comparatively rudimentary stage. In particular, it is difficult to find useful metalanguages to describe the nature of parallelism and how it can be applied to computational processes. Specifically, appropriate mechanisms are needed for the fusion of information computed by different processes prior to decision making (see also Section 8.4.4).
The success of the modern computational methods has been remarkable. The techniques that have been developed, such as multidimensional morphable models (Ezzat et al. Reference Ezzat, Geiger and Poggio2002a), have made it possible to achieve plausibly realistic visual speech synthesis and to carry out audiovisual speech recognition sufficiently accurately to make some useful applications possible (Brooke et al. Reference Brooke, Tomlinson and Moore1994). It is therefore tempting to concentrate on improving the performance of those techniques and to seek to meet the many demands of the modern world for short-term solutions to specific problems. However, there are risks in doing this without a proper understanding of the nature of visual and audiovisual speech processing. This can be gained only by patient and careful analytical studies, using well-designed and systematic bodies of test data, as this chapter has indicated. For example, even the best of today’s visual speech synthesizers would probably be unlikely to pass the ‘Turing test’ for communicative realism, proposed in Section 8.5.2, let alone a test that included the facial expressions that a computer agent would have to simulate. It is important to understand why they would not.
As Section 8.1 indicated, the early studies of visual speech were generally prompted by interest in the analytical investigation of visual speech processing and the desire to improve the rehabilitation of the hearing-impaired through the better-informed and improved teaching and training of speechreading skills. Speech production studies and perceptual studies of speechreading therefore prompted much of the initial work in the field and, as Section 8.5 suggests, will continue to offer an important contribution to the development of automatic visual and audiovisual speech processing systems. This is likely to be a synergistic collaborative process with the technologists and computer scientists who will in their turn develop the systems that are needed to assist in carrying out those studies.
On a final, more personal note, perhaps the greatest single impetus to the contemporary development of the audiovisual processing of speech was the NATO Advanced Study Institute at Bonas in 1995. For the first time, the whole community of audiovisual speech scientists was brought together, including mathematicians, statisticians, psychologists, linguists, engineers, and computer scientists. This was important because audiovisual speech processing encompasses an unusually wide spectrum of specialist studies and demands a cross-disciplinary approach. The NATO meeting was the catalyst that led to a rapid expansion of the field and also inspired the regular succession of Audiovisual Speech Processing (AVSP) meetings that has followed, largely due to the energy, enthusiasm, and unique leadership of Christian Benoît. The outcomes have now established the field securely and ensured that it will move forward constructively on a suitably broad knowledge base.