9 Audiovisual automatic speech recognition
9.1 Introduction
We have made significant progress in automatic speech recognition (ASR) for well-defined applications like dictation and medium vocabulary transaction processing tasks in relatively controlled environments (O’Shaughnessy Reference O’Shaughnessy2003). However, ASR performance has yet to reach the level required for speech to become a truly pervasive user interface. Indeed, even in “clean” acoustic environments, and for a variety of tasks, state-of-the-art ASR system performance lags human speech perception by up to an order of magnitude (Lippmann Reference Lippmann1997). In addition, current systems are quite sensitive to channel, environment, and style of speech variation. A number of techniques for improving ASR robustness have met with limited success in severely degraded environments (Ghitza Reference Ghitza1986; Nadas et al.Reference Nadas, Nahamoo and Picheny1989; Juang Reference Juang1991; Liu et al.Reference Liu, Stern, Huang and Acero1993; Hermansky and Morgan Reference Hermansky and Morgan1994; Neti Reference Neti1994; Gales Reference Gales1997; Jiang et al.Reference Jiang, Soong and Lee2001; De la Torre et al.Reference De la Torre, Peinado, Segura, Perez-Cordoba, Benítez and Rubio2005; Droppo and Acero Reference Droppo, Acero, Banesty, Sondhi and Huang2008; Benesty et al. 2009). Clearly, novel, non-traditional approaches that use sources of information orthogonal to the acoustic input are needed to achieve ASR performance closer to the level of human speech perception, and robust enough to be deployable in field applications. Visual speech is the most promising source of additional speech information, and it is obviously not affected by the acoustic environment and noise.
Human speech perception is bimodal in nature: Humans combine audio and visual information in deciding what has been spoken, especially in noisy environments. The benefit of the visual modality to speech intelligibility in noise has been quantified as far back as in Sumby and Pollack (Reference Sumby and Pollack1954). Furthermore, bimodal fusion of audio and visual stimuli in perceiving speech has been demonstrated in the McGurk effect (McGurk and MacDonald Reference McGurk and MacDonald1976). For example, when the spoken sound /ba/ is superimposed on the video of a person uttering /ga/, most people perceive the speaker as uttering the sound /da/. In addition, visual speech is of particular importance to the hearing-impaired: Mouth movement is known to play an important role in both sign language and simultaneous communication among the deaf (Marschark et al.Reference Marschark, LePoutre, Bement, Campbell, Dodd and Burnham1998). The hearing-impaired speechread well, possibly better than the general population (Bernstein et al.Reference Bernstein, Demorest, Tucker, Campbell, Dodd and Burnham1998).
There are three key reasons why the availability of visual speech benefits human speech perception (Summerfield Reference Summerfield, Dodd and Campbell1987): It helps speaker (audio source) localization; it contains speech segmental information that supplements the audio; and it provides complementary information about the place of articulation. The latter is due to the partial or full visibility of articulators such as the tongue, teeth, and lips. Place of articulation information can help disambiguate, for example, the unvoiced consonants /p/ (a bilabial) and /k/ (a velar), the voiced consonant pair /b/ and /d/ (a bilabial and alveolar, respectively), and the nasal /m/ (a bilabial) from the nasal alveolar /n/ (Massaro and Stork Reference Massaro and Stork1998). All three pairs are highly confusable on the basis of acoustics alone. In addition, jaw and lower face muscle movement is correlated to the produced acoustics (Yehia et al.Reference Yehia, Rubin and Vatikiotis-Bateson1998; Barker and Berthommier Reference Barker and Berthommier1999), and when this movement is visible, human speech perception has been shown to be enhanced (Summerfield et al.Reference Summerfield, MacLeod, McGrath, Brooke, Young and Ellis1989; Smeele Reference Smeele, Stork and Hennecke1996).
The benefits of visual speech information for speech perception have motivated significant interest in automatic recognition of visual speech, formally known as automatic lipreading, or speechreading (Stork and Hennecke Reference Stork and Hennecke1996). Work in this field aims at improving ASR by exploiting the visual information from the speaker’s mouth region in addition to the traditional audio information, leading to audiovisual automatic speech recognition systems. Not surprisingly, systems that include the visual modality have been shown to outperform audio-only ASR over a wide range of conditions. Such performance gains are particularly impressive in noisy environments, where traditional acoustic-only ASR performs poorly. Improvements have also been demonstrated when speech is degraded due to speech impairment (Potamianos and Neti Reference Potamianos and Neti2001a) and Lombard effects (Huang and Chen Reference Huang and Chen2001). Coupled with the diminishing cost of quality video capturing systems, these facts make automatic speechreading tractable for achieving robust ASR in certain scenarios (Hennecke et al.Reference Hennecke, Stork, Prasad, Stork and Hennecke1996; Connell et al.Reference Connell, Haas, Marcheret, Neti, Potamianos and Velipasalar2003).
Automatic recognition of audiovisual speech introduces new and challenging tasks when compared to traditional, audio-only ASR. The block diagram of Figure 9.1 highlights these: In addition to the usual audio front end (feature extraction stage), visual features that are informative about speech must be extracted from video of the speaker’s face. This requires robust face detection, as well as location estimation and tracking of the speaker’s mouth or lips, followed by extraction of suitable visual features. In contrast to audio-only recognizers, there are now two streams of features available for recognition, one for each modality. The combination of the audio and visual streams should ensure that the resulting system performance exceeds the better of the two single-modality recognizers. Both issues, namely the visual front end design and audiovisual fusion, constitute difficult problems, and they have generated significant research work by the scientific community.
Figure 9.1 The main processing blocks of an audiovisual automatic speech recognizer. The visual front end design and the audiovisual fusion modules introduce additional challenging tasks to automatic recognition of speech, as compared to traditional, audio-only ASR. They are discussed in detail in this chapter.
Indeed, since the mid eighties, numerous articles have concentrated on audiovisual ASR, with the vast majority appearing during the last fifteen years. The first automatic speechreading system was reported by Petajan (Reference Petajan1984). Given the video of a speaker’s face, and using simple image thresholding, he was able to extract binary (black and white) mouth images, and subsequently, mouth height, width, perimeter, and area, as visual speech features. He then developed a visual-only recognizer based on dynamic time warping (Rabiner and Juang Reference Rabiner and Juang1993) to rescore the best two choices of the output of the baseline audio-only system. His method improved ASR for a single-speaker, isolated word recognition task on a 100-word vocabulary that included digits and letters. Petajan’s work generated significant excitement, and soon various sites established research in audiovisual ASR.
Among the pioneer sites was the group headed by Christian Benoît at the Institute de la Communication Parlée (ICP), in Grenoble, France. For example, Adjoudani and Benoît (Reference Adjoudani, Benoît, Stork and Hennecke1996) investigated the problem of audiovisual fusion for ASR, and compared early and late integration strategies. In the latter area, they considered modality reliability estimation based on the dispersion of the likelihood of the top four recognized words using the audio-only and visual-only inputs. They reported significant ASR gains on a single-speaker corpus of fifty-four French nonsense words. Later, they developed a multimedia platform for audiovisual speech processing, containing a head-mounted camera to robustly capture the speaker’s mouth region (Adjoudani et al.Reference Adjoudani, Guiard-Marigny, Le Goff, Revéret and Benoît1997). Recently, work at ICP has continued in this area, with additional audiovisual corpora collected (French connected letters and English connected digits) and a new audiovisual ASR system reported by Heckmann et al. (Reference Heckmann, Berthommier and Kroschel2001). In addition, the group has been working in related areas, including audiovisual speech enhancement (Girin et al.Reference Girin, Schwartz and Feng2001b), speech separation (Girin et al.Reference Girin, Allard and Schwartz2001a; Sodoyer et al.Reference Sodoyer, Girin, Jutten and Schwartz2004), coding (Girin Reference Girin2004), synthesis (Bailly et al.Reference Bailly, Bérar, Elisei and Odisio2003), and other aspects of audiovisual speech and face-to-face communication (Dohen et al.Reference Dohen, Schwartz and Bailly2010).
As shown in Figure 9.1, audiovisual ASR systems differ in three main aspects (Hennecke et al.Reference Hennecke, Stork, Prasad, Stork and Hennecke1996; Potamianos and Neti Reference Potamianos and Neti2003): The visual front end design, the audiovisual integration strategy, and the speech recognition method used. Unfortunately, the diverse algorithms suggested in the literature for automatic speechreading are very difficult to compare, as they are rarely tested on a common audiovisual database. In addition, until the beginnings of this decade (Neti et al.Reference Neti, Potamianos, Luettin, Matthews, Glotin, Vergyri, Sison, Mashari and Zhou2000; Hazen et al.Reference Hazen, Saenko, La and Glass2004), early audiovisual ASR studies have been conducted on databases of small duration, and, in most cases, limited to a very small number of speakers (mostly less than ten, and often single-subject) and to small vocabulary tasks (Chibelushi et al.Reference Chibelushi, Deravi and Mason1996; Hennecke et al.Reference Hennecke, Stork, Prasad, Stork and Hennecke1996; Chibelushi et al.Reference Chibelushi, Deravi and Mason2002). Such tasks are typically nonsense words (Adjoudani and Benoît Reference Adjoudani, Benoît, Stork and Hennecke1996; Su and Silsbee Reference Su and Silsbee1996), isolated words (Petajan Reference Petajan1984; Matthews et al.Reference Matthews, Bangham and Cox1996; Movellan and Chadderdon Reference Movellan, Chadderdon, Stork and Hennecke1996; Chan et al.Reference Chan, Zhang and Huang1998; Dupont and Luettin Reference Dupont and Luettin2000; Gurbuz et al.Reference Gurbuz, Tufekci, Patterson and Gowdy2001; Huang and Chen Reference Huang and Chen2001; Nefian et al.Reference Nefian, Liang, Pi, Liu and Murphy2002), connected letters (Potamianos et al.Reference Potamianos, Graf and Cosatto1998), connected digits (Potamianos et al.Reference Potamianos, Graf and Cosatto1998; Zhang et al.Reference Zhang, Levinson and Huang2000; Patterson et al.Reference Patterson, Gurbuz, Tufekci and Gowdy2002), closed-set sentences (Goldschen et al.Reference Goldschen, Garcia, Petajan, Stork and Hennecke1996), or small-vocabulary continuous speech (Chu and Huang Reference Chu and Huang2000). Databases are commonly recorded in English, but other examples are French (Adjoudani and Benoît Reference Adjoudani, Benoît, Stork and Hennecke1996; Alissali et al.Reference Alissali, Deléglise and Rogozan1996; André-Obrecht et al.Reference André-Obrecht, Jacob and Parlangeau1997; Teissier et al.Reference Teissier, Robert-Ribes and Schwartz1999; Dupont and Luettin Reference Dupont and Luettin2000), German (Bregler et al.Reference Bregler, Hild, Manke and Waibel1993; Krone et al.Reference Krone, Talle, Wichert and Palm1997), Japanese (Nakamura et al.Reference Nakamura, Ito and Shikano2000; Fujimura et al.Reference Fujimura, Miyajima, Itou, Takeda and Itakura2005), Hungarian (Czap Reference Czap2000), Spanish (Ortega et al.Reference Ortega, Sukno, Lleida, Frangi, Miguel, Buera and Zacur2004), Czech (Železný and Cisař Reference Železný and Cisař2003), and Dutch (Wojdel et al. 2002), among others. However, if the visual modality is to become a viable component in real-word ASR systems, research work is required on larger vocabulary tasks, developing speechreading systems on data of sizable duration and of large subject populations. A first attempt towards this goal was the authors’ work during the summer 2000 workshop at the Center for Language and Speech Processing at the Johns Hopkins University, in Baltimore, Maryland (Neti et al.Reference Neti, Potamianos, Luettin, Matthews, Glotin, Vergyri, Sison, Mashari and Zhou2000), where a speaker-independent audiovisual ASR system for Large Vocabulary Continuous Speech Recognition (LVCSR) was developed for the first time. Significant performance gains in both clean and noisy audio conditions were reported.
In this chapter, we present the main techniques for audiovisual speech recognition that have been developed since the mid eighties. We first discuss the visual feature extraction problem, followed by a discussion of audiovisual fusion. In both cases, we provide details of some of the techniques employed during the Johns Hopkins summer 2000 workshop (Neti et al.Reference Neti, Potamianos, Luettin, Matthews, Glotin, Vergyri, Sison, Mashari and Zhou2000). We also consider the problem of audiovisual speaker adaptation, an issue of significant importance when building speaker-specific models or developing systems across databases. We then discuss the main audiovisual corpora used in the literature for ASR experiments, including the IBM audiovisual LVCSR database. Subsequently, we present experimental results on automatic speechreading and audiovisual ASR. As an application of speaker adaptation, we consider the problem of automatic recognition of impaired speech. Finally, we conclude the chapter with a discussion on the current state of audiovisual ASR, and on what we view as open problems in this area.
9.2 Visual front ends
As was briefly mentioned in the Introduction (see also Figure 9.1), the first main difficulty in the area of audiovisual ASR is the visual front end design. The problem is two-fold: Face, lips, or mouth tracking is first required, followed by visual speech representation in terms of a small number of informative features. Clearly, the two issues are closely related: Employing a lip-tracking algorithm allows one to use visual features such as mouth height or width (Adjoudani and Benoît Reference Adjoudani, Benoît, Stork and Hennecke1996; Chan et al.Reference Chan, Zhang and Huang1998; Potamianos et al.Reference Potamianos, Graf and Cosatto1998), or parameters of a suitable lip model (Chandramohan and Silsbee Reference Chandramohan and Silsbee1996; Dalton et al.Reference Dalton, Kaucic, Blake, Stork and Hennecke1996; Luettin et al.Reference Luettin, Thacker and Beet1996). On the other hand, a crude detection of the mouth region is sufficient to obtain visual features, using transformations of this region’s pixel values that achieve sufficient dimensionality reduction (Bregler et al.Reference Bregler, Hild, Manke and Waibel1993; Duchnowski et al.Reference Duchnowski, Meier and Waibel1994; Matthews et al.Reference Matthews, Bangham and Cox1996; Potamianos et al.Reference Potamianos and Neti2001b). Needless to say, robust tracking of the lips or mouth region is of paramount importance for good performance of automatic speechreading systems (Iyengar et al.Reference Iyengar, Potamianos, Neti, Faruquie and Verma2001).
9.2.1 Face detection, mouth, and lip tracking
The problem of face and facial-part detection has attracted significant interest in the literature (Graf et al.Reference Graf, Cosatto and Potamianos1997; Rowley et al.Reference Rowley, Baluja and Kanade1998; Sung and Poggio Reference Sung and Poggio1998; Senior Reference Senior1999; Li and Zhang Reference Li and Zhang2004; Garcia et al.Reference Garcia, Ostermann and Cootes2007). In addition to automatic speechreading, it has applications to other areas related to audiovisual speech, for example visual text-to-speech (Cohen and Massaro Reference Cohen and Massaro1994b; Chen et al.Reference Chen, Graf and Wang1995; Cosatto et al.Reference Cosatto, Potamianos and Graf2000; Bailly et al.Reference Bailly, Bérar, Elisei and Odisio2003; Aleksic and Katsaggelos Reference Aleksic and Katsaggelos2004b; Melenchón et al.Reference Melenchón, Martínez, De La Torre and Montero2009), person identification and verification (Jourlin et al.Reference Jourlin, Luettin, Genoud and Wassner1997; Wark and Sridharan Reference Wark and Sridharan1998; Fröba et al.Reference Fröba, Küblbeck, Rothe and Plankensteiner1999; Jain et al.Reference Jain, Bolle, Pankanti, Jain, Bolle and Pankanti1999; Maison et al.Reference Maison, Neti and Senior1999; Chibelushi et al.Reference Chibelushi, Deravi and Mason2002; Zhang et al.Reference Zhang, Broun, Mersereau and Clements2002; Sanderson and Paliwal Reference Sanderson and Paliwal2004; Aleksic and Katsaggelos Reference Aleksic and Katsaggelos2006), speaker localization (Bub et al.Reference Bub, Hunke and Waibel1995; Wang and Brandstein Reference Wang and Brandstein1999; Zotkin et al.Reference Zotkin, Duraiswami and Davis2002), detection of intent to speak (De Cuetos et al.Reference De Cuetos, Neti and Senior2000) and of speech activity (Libal et al.Reference Libal, Connell, Potamianos and Marcheret2007; Rivet et al.Reference Rivet, Girin and Jutten2007), face image retrieval (Swets and Weng Reference Swets and Weng1996), and others. In general, robust face and mouth detection is quite difficult, especially in cases where the background, face pose, and lighting are variable (Iyengar and Neti Reference Iyengar and Neti2001; Jiang et al.Reference Jiang, Potamianos and Iyengar2005).
In the audiovisual ASR literature, where issues such as visual feature design, or audiovisual fusion algorithms are typically of more interest, face and mouth detection are often ignored, or at least, the problem is simplified: In some databases for example, the speaker’s lips are suitably colored, so that their automatic extraction becomes trivial by chroma-key methods (Adjoudani and Benoît Reference Adjoudani, Benoît, Stork and Hennecke1996; Heckmann et al.Reference Heckmann, Berthommier and Kroschel2001). In other works, where audiovisual corpora are shared (for example, the Tulips1, (X)M2VTS, and AMP/CMU databases, discussed later), the mouth regions are extracted once and re-used in subsequent work by other researchers, or sites. Further, there have been efforts in the literature to design wearable audiovisual headsets that, when properly worn, reliably capture the speaker mouth region alone (Huang et al.Reference Huang, Potamianos, Connell and Neti2004). It should also be noted that in the vast majority of audiovisual databases the faces are frontal with minor face pose and lighting variation. Therefore, in this chapter we focus on frontal pose visual feature extraction. Nevertheless, the approaches and algorithms discussed here carry on to non-frontal head poses as well to a large extent, as demonstrated by the work of Iwano et al. (Reference Iwano, Yoshinaga, Tamura and Furui2007), Kumar et al. (Reference Kumar, Chen and Stern2007), Kumatani and Stiefelhagen (Reference Kumatani and Stiefelhagen2007), and Lucey et al. (Reference Lucey, Potamianos, Sridharan, Liew and Wang2009).
In general, all audiovisual ASR systems require determining a region-of-interest (ROI) for the visual feature extraction algorithm to proceed. For example, a ROI can be the entire face, in which case a subsequent active appearance model can be used to match to the exact face location (Cootes et al.Reference Cootes, Edwards and Taylor1998). Alternatively, a ROI can be the mouth-only region, in which case an active shape model of the lips can be used to fit a lip contour (Luettin et al.Reference Luettin, Thacker and Beet1996). If appearance-based visual features are to be extracted (see below) the latter is all that is required. Many techniques of varying complexity can be used to locate ROIs. Some use traditional image processing techniques, such as color segmentation, edge detection, image thresholding, template matching, or motion information (Graf et al.Reference Graf, Cosatto and Potamianos1997), whereas other methods use statistical modeling techniques, employing “strong” classifiers like neural networks for example (Rowley et al.Reference Rowley, Baluja and Kanade1998), or cascades of “weak” classifiers (Viola and Jones Reference Viola and Jones2001). In the following, we describe one such statistical modeling based approach.
9.2.1.1 Face detection and mouth region-of-interest extraction
A typical algorithm for face detection and facial feature localization is described in Senior (Reference Senior1999). This technique is used in the visual front end design of Neti et al. (Reference Neti, Potamianos, Luettin, Matthews, Glotin, Vergyri, Sison, Mashari and Zhou2000) and Potamianos et al. (Reference Potamianos and Neti2001b), when processing the video of the IBM ViaVoiceTM audiovisual database, described later. Given a video frame, face detection is first performed by employing a combination of methods, some of which are also used for subsequent face feature finding. A face template size is first chosen (an 11 × 11-pixel square, here), and an image pyramid over all permissible face locations and scales (given the video frame and face template sizes) is used to search for possible face candidates. This search is constrained by the minimum and maximum allowed face candidate size with respect to the frame size, the face size increment from one pyramid level to the next, the spatial shift in searching for faces within each pyramid level, and the fact that no candidate face can be of smaller size than the face template. In Potamianos et al. (Reference Potamianos and Neti2001b), the face width is restricted to lie within 10% and 75% of the frame width, with a face size increase of 15% across consecutive pyramid levels. Within each pyramid level, a local horizontal and vertical shift of one pixel is used to search for candidate faces.
If the video signal is in color, skin-tone segmentation can be used to quickly narrow the search to face candidates that contain a relatively high proportion of skin-tone pixels. The normalized (red, green, blue) values of each frame pixel are first transformed to the (hue, saturation) color space, where skin tone is known to occupy a range of values largely invariant to most humans and lighting conditions (Graf et al.Reference Graf, Cosatto and Potamianos1997; Senior Reference Senior1999). In this particular implementation, all face candidates that contain less than 25% of pixels with hue and saturation values that fall within the skin-tone range, are eliminated. This substantially reduces the number of face candidates (depending on the frame background), speeding up computation and reducing spurious face detections. Every remaining face candidate is subsequently size-normalized to the 11 × 11 face template size, and its grayscale pixel values are placed into a 121-dimensional face candidate vector. Each vector is given a score based on both a two-class (face versus non-face) Fisher linear discriminant and the candidate’s distance from face space (DFFS), i.e., the face vector projection error onto a lower, 40-dimensional space, obtained by means of principal components analysis (PCA – see below). All candidate regions exceeding a threshold score are considered as faces. Among such faces at neighboring scales and locations, the one achieving the maximum score is returned by the algorithm as a detected face (Senior Reference Senior1999). An improved version of this algorithm appears in Jiang et al. (Reference Jiang, Potamianos and Iyengar2005).
Once a face has been detected, an ensemble of facial feature detectors is used to estimate the locations of twenty-six facial features, including the lip corners and centers (twelve such facial features are marked on the frames of Figure 9.2). Each feature location is determined by using a score combination of prior feature location statistics, linear discriminant, and distance from feature space (similar to the DFFS discussed above), based on the chosen feature template size (such as 11 × 11 pixels).
Before incorporating the described algorithm into our speechreading system, a training step is required to estimate the Fisher discriminant and eigenvectors (PCA) for face detection and facial feature estimation, as well as the facial feature location statistics. Such training requires a number of frames manually annotated with the faces and their visible features. When training the Fisher discriminant, both face and non-face (or facial feature and non-feature) vectors are used, whereas in the case of PCA, face and facial-feature-only vectors are considered (Senior Reference Senior1999).
Given the output of the face detection and facial feature finding algorithm described above, five located lip contour points are used to estimate the mouth center and its size at every video frame (four such points are marked on the frames of Figure 9.2). To improve ROI extraction robustness to face and mouth detection errors, the mouth center estimates are smoothed over twenty neighboring frames using median filtering to obtain the ROI center, whereas the mouth size estimates are averaged over each utterance. A size-normalized square ROI is then extracted (see Eq. (9.1), below), with sides M = N = 64 (see also Figure 9.2). This can contain just the mouth region, or also parts of the lower face (Potamianos and Neti Reference Potamianos and Neti2001b).
Figure 9.2 Region-of-interest extraction examples. Upper rows: Example video frames of eight subjects from the IBM ViaVoiceTM audiovisual database (described below), with superimposed facial features, detected by the algorithm of Senior (Reference Senior1999). Lower row: Corresponding mouth regions-of-interest, extracted as in Potamianos et al.(Reference Potamianos and Neti2001b). © 1999 and 2001 IEEE.
9.2.1.2 Lip contour tracking
Once the mouth region is located, a number of algorithms can be used to obtain lip contour estimates. Some popular methods are snakes (Kass et al.Reference Kass, Witkin and Terzopoulos1988), templates (Yuille et al.Reference Yuille, Hallinan and Cohen1992; Silsbee Reference Silsbee1994), and active shape and appearance models (Cootes et al.Reference Cootes, Taylor, Cooper and Graham1995; Cootes et al.Reference Cootes, Edwards and Taylor1998).
A snake is an elastic curve represented by a set of control points. The control point coordinates are iteratively updated, by converging towards the local minimum of an energy function, defined on the basis of curve smoothness constraints and a matching criterion with respect to desired features of the image (Kass et al.Reference Kass, Witkin and Terzopoulos1988). Such an algorithm is used for lip contour estimation in the speechreading system of Chiou and Hwang (Reference Chiou and Hwang1997). Another widely used technique for lip tracking is by means of lip templates, employed in the system of Chandramohan and Silsbee (Reference Chandramohan and Silsbee1996) for example. Templates constitute parameterized curves that are fitted to the desired shape by minimizing an energy function, defined similarly to snakes. B-splines, used by Dalton et al. (Reference Dalton, Kaucic, Blake, Stork and Hennecke1996), work similarly to the above techniques as well. Combinations of the above have also been used in the literature, as for example by Aleksic et al. (Reference Aleksic, Williams, Wu and Katsaggelos2002), where both snakes and templates are employed.
Active shape and appearance models construct a lip shape or ROI appearance statistical model, as discussed in following subsections. These models can be used for tracking lips by means of the algorithm proposed by Cootes et al. (Reference Matthews, Cootes, Cox, Harvey and Bangham1998). This assumes that, given small perturbations from the actual fit of the model to a target image, a linear relationship exists between the difference in the model projection and image, and the required updates to the model parameters. An iterative algorithm is used to fit the model to the image data (Matthews et al.Reference Matthews, Cootes, Cox, Harvey and Bangham1998). Alternatively, the fitting can be performed by the downhill simplex method (Nelder and Mead Reference Nelder and Mead1965), as in Luettin et al. (Reference Luettin, Thacker and Beet1996). Examples of lip contour estimation by means of active shape models using the latter fitting technique are depicted in Figure 9.3.
Figure 9.3 Examples of lip contour estimation by means of active shape models (Luettin et al.Reference Luettin, Thacker and Beet1996). Depicted mouth regions are from the Tulips1 audiovisual database (Movellan and Chadderdon Reference Movellan, Chadderdon, Stork and Hennecke1996), and they were extracted preceding lip contour estimation. Reprinted from Computer Vision and Image Understanding, 65:2, Luettin and Thacker, Speechreading using probabilistic models, 163–178, © Reference Luettin and Thacker1997, with permission from Elsevier.
9.2.2 Visual features
Various sets of visual features for automatic speechreading have been proposed in the literature over the last twenty years. In general, they can be grouped into three categories: (a) Video pixel- (or, appearance) based ones; (b) Lip contour- (or, shape) based features; and (c) Features that are a combination of both appearance and shape (Hennecke et al.Reference Hennecke, Stork, Prasad, Stork and Hennecke1996; Aleksic et al.Reference Aleksic, Potamianos and Katsaggelos2005). In the following, we present each category in more detail. Possible post-feature extraction processing is discussed at the end of this section.
9.2.2.1 Appearance-based features
In this approach to visual feature extraction, the image part typically containing the speaker’s mouth region is considered as informative for lipreading, i.e., the region-of-interest (ROI). This region can be a rectangle containing the mouth, and possibly include larger parts of the lower face, such as the jaw and cheeks (Potamianos and Neti Reference Potamianos and Neti2001b), or the entire face (Matthews et al.Reference Matthews, Potamianos, Neti and Luettin2001). Often, it can be a three-dimensional rectangle, containing adjacent frame rectangular ROIs, in an effort to capture dynamic speech information at this early stage of processing (Li et al.Reference Li, Dettmer and Shah1995; Potamianos et al.Reference Potamianos, Graf and Cosatto1998). Alternatively, the ROI can correspond to a number of image profiles vertical to the lip contour (Dupont and Luettin Reference Dupont and Luettin2000), or just be a disc around the mouth center (Duchnowski et al.Reference Duchnowski, Meier and Waibel1994). By concatenating the ROI pixel grayscale (Bregler et al.Reference Bregler, Hild, Manke and Waibel1993; Duchnowski et al.Reference Duchnowski, Meier and Waibel1994; Potamianos et al.Reference Potamianos, Graf and Cosatto1998; Dupont and Luettin Reference Dupont and Luettin2000), or color values (Chiou and Hwang Reference Chiou and Hwang1997), a feature vector is obtained. For example, in the case of an M × N-pixel rectangular ROI, which is centered at location (mt,nt) of video frame Vt(m,n) at time t, the resulting feature vector of length d = M.N will be (after a lexicographic ordering)1

This vector is expected to contain most visual speech information. Notice that approaches that use optical flow as visual features (Mase and Pentland Reference Mase and Pentland1991; Gray et al.Reference Gray, Movellan, Sejnowski, Mozer, Jordan and Petsche1997) can fit within this framework by replacing in Eq. (9.1) the video frame ROI pixels with optical flow estimates.
Typically, the dimensionality d of vector xt in Eq. (9.1) is too large to allow successful statistical modeling (Chatfield and Collins Reference Chatfield and Collins1991) of speech classes, by means of a hidden Markov model (HMM), for example (Rabiner and Juang Reference Rabiner and Juang1993). Therefore, appropriate transformations of the ROI pixel values are used as visual features. Movellan and Chadderdon (Reference Movellan, Chadderdon, Stork and Hennecke1996) for example, use low-pass filtering followed by image subsampling and video frame ROI differencing, whereas Matthews et al. (Reference Matthews, Bangham and Cox1996) propose a non-linear image decomposition using “image sieves” for dimensionality reduction and feature extraction. By far however, the most popular appearance feature representations achieve such reduction by using traditional image transforms (Gonzalez and Wintz Reference Gonzalez and Wintz1977). These transforms are typically borrowed from the image compression literature, and the hope is that they will preserve most information relevant to speechreading. In general, a D × d-dimensional linear transform matrix P is sought, such that the transformed data vector yt =Pxt contains most speechreading information in its D « d elements. To obtain matrix P, L training examples are given, denoted by xl, l =1, . . . , L. A number of possible such matrices are described in the following.
Principal components analysis (PCA)
This constitutes the most popular pixel-based feature representation for automatic speechreading (Bregler et al.Reference Bregler, Hild, Manke and Waibel1993; Bregler and Konig Reference Bregler and Konig1994; Duchnowski et al.Reference Duchnowski, Meier and Waibel1994; Li et al.Reference Li, Dettmer and Shah1995; Brooke Reference Brooke, Stork and Hennecke1996; Tomlinson et al.Reference Tomlinson, Russell and Brooke1996; Chiou and Hwang Reference Chiou and Hwang1997; Gray et al.Reference Gray, Movellan, Sejnowski, Mozer, Jordan and Petsche1997; Luettin and Thacker Reference Luettin and Thacker1997; Potamianos et al.Reference Potamianos, Graf and Cosatto1998; Dupont and Luettin Reference Dupont and Luettin2000; Hazen et al.Reference Hazen, Saenko, La and Glass2004). The PCA data projection achieves optimal information compression, in the sense of minimum square error between the original vector xt and its reconstruction based on its projection yt; however, appropriate data scaling constitutes a problem in the classification of the resulting vectors (Chatfield and Collins Reference Chatfield and Collins1991). In the PCA implementation of Potamianos et al. (Reference Potamianos, Graf and Cosatto1998), the data are scaled according to their inverse variance, and their correlation matrix R is computed. Subsequently, R is diagonalized as R = AΛAT (Chatfield and Collins Reference Chatfield and Collins1991; Press et al.Reference Press, Flannery, Teukolsky and Vetterling1995), where A = [a1,..,ad] has as columns the eigenvectors of R, and Λ is a diagonal matrix containing the eigenvalues of R. Assuming that the D largest such eigenvalues are located at the j1, . . .,jD diagonal positions, the data projection matrix is PPCA= [aj1, . . .,ajD]T. Given a data vector xt, this is first element-wise mean and variance normalized, and subsequently, its feature vector is extracted as yt =PPCAxt.
Discrete cosine, wavelet, and other image transforms
As an alternative to PCA, a number of popular linear image transforms (Gonzalez and Wintz Reference Gonzalez and Wintz1977) have been used in place of P for obtaining speechreading features. For example, the discrete cosine transform (DCT) has been adopted in several systems (Duchnowski et al.Reference Duchnowski, Meier and Waibel1994; Potamianos et al.Reference Potamianos, Graf and Cosatto1998; Nakamura et al.Reference Nakamura, Ito and Shikano2000; Neti et al.Reference Neti, Potamianos, Luettin, Matthews, Glotin, Vergyri, Sison, Mashari and Zhou2000; Scanlon and Reilly Reference Scanlon and Reilly2001; Nefian et al.Reference Nefian, Liang, Pi, Liu and Murphy2002; Barker and Shao Reference Barker and Shao2009); the discrete wavelet transform (DWT – Daubechies Reference Daubechies1992) in others (Potamianos et al.Reference Potamianos, Graf and Cosatto1998), and the Hadamard and Haar transforms by Scanlon and Reilly (Reference Scanlon and Reilly2001). Most researchers use separable transforms (Gonzalez and Wintz Reference Gonzalez and Wintz1977), which allow fast implementations (Press et al.Reference Press, Flannery, Teukolsky and Vetterling1995) when M and N are powers of 2 (typically, values M, N = 16, 32, or 64 are considered). Notice that, in each case, matrix P can have as rows the image transform matrix rows that maximize the transformed data energy over the training set (Potamianos et al.Reference Potamianos, Graf and Cosatto1998), or alternatively, that correspond to a priori chosen locations (Nefian et al.Reference Nefian, Liang, Pi, Liu and Murphy2002).
Linear discriminant analysis (LDA)
The data vector transforms presented above are more suitable for ROI compression than for ROI classification into the set of speech classes of interest. For the latter task, LDA (Rao Reference Rao1965) is more appropriate, as it maps features to a new space for improved classification. LDA was first proposed for automatic speechreading by Duchnowski et al. (Reference Duchnowski, Meier and Waibel1994). There, it was applied directly to the ROI vector. LDA has also been considered in a cascade, following the PCA projection of a single frame ROI vector, or on the concatenation of a number of adjacent PCA projected vectors (Matthews et al.Reference Matthews, Potamianos, Neti and Luettin2001).
LDA assumes that a set of classes, C (such as HMM states), is a priori chosen, and, in addition, that the training set data vectors xl , l =1,..,L are labeled as c(l) ∈ C. Then, it seeks matrix PLDA, such that the projected training sample [PLDAxl , l = 1, . . ., L] is “well separated” into the set of classes C, according to a function of the training sample within-class scatter matrix SW and its between-class scatter matrix SB (Rao Reference Rao1965). These matrices are given by

respectively. In Eq. (9.2), Pr(c) = Lc/L, c ∈ C, is the class empirical probability mass function, where Lc =
and δi,j = 1, if i = j; 0, otherwise; in addition, m(c) and ∑(c) denote the class sample mean and covariance, respectively; and finally, m = ∑c∈C Pr(c) m(c) is the total sample mean. To estimate PLDA, the generalized eigenvalues and right eigenvectors of the matrix pair (SB, SW), that satisfy SBF = SWFΛ, are first computed (Rao 1965; Golub and van Loan Reference Golub and van Loan1983). Matrix F = [f1, . . .,fd] has as columns the generalized eigenvectors. Assuming that the D largest eigenvalues are located at the j1,..,jD diagonal positions of Λ, then, PLDA = [fj1, . . .,fjD]T. It should be noted that, due to Eq. (9.2), the rank of SB is at most |C| −1, where |C| denotes the number of classes (the cardinality of set C); hence D ≤ |C| −1 should hold. In addition, the rank of the d × d-dimensional matrix SW cannot exceed L − |C|; therefore, having insufficient training data with respect to the input feature vector dimension d is a potential problem.
Maximum likelihood data rotation (MLLT)
In our speechreading system (Potamianos et al.Reference Potamianos and Neti2001b), LDA is followed by the application of a data maximum likelihood linear transform (MLLT). This transform seeks a square, non-singular, data rotation matrix PMLLT that maximizes the observation data likelihood in the original feature space, under the assumption of diagonal data covariance in the transformed space (Gopinath Reference Gopinath1998). Such a rotation is beneficial, since in most ASR systems diagonal covariances are typically assumed when modeling the observation class conditional probability distribution with Gaussian mixture models. The desired rotation matrix is obtained as

(Gopinath Reference Gopinath1998). This can be solved numerically (Press et al.Reference Press, Flannery, Teukolsky and Vetterling1995).
Notice that LDA and MLLT are data transforms aiming at improved classification performance and maximum likelihood data modeling. Therefore, their application can be viewed as a feature post-processing stage, and clearly, should not be limited to appearance-only visual data.
9.2.2.2 Shape-based features
In contrast to appearance-based features, shape-based feature extraction assumes that most speechreading information is contained in the shape (contours) of the speaker’s lips, or more generally (Matthews et al.Reference Matthews, Potamianos, Neti and Luettin2001), in the face contours (such as jaw and cheek shape, in addition to the lips). Two types of features fall within this category, geometric features and shape model-based features. In both cases, an algorithm that extracts the inner and/or outer lip contours, or in general, the face shape, is required. A variety of such algorithms were discussed above.
Lip geometric features
Given the lip contour, a number of high-level features meaningful to humans can be readily extracted, such as the contour height, width, and perimeter, as well as the area contained within the contour. As demonstrated in Figure 9.4, such features do contain significant speech information. Not surprisingly, a large number of speechreading systems make use of all or a subset of them (Petajan Reference Petajan1984; Adjoudani and Benoît Reference Adjoudani, Benoît, Stork and Hennecke1996; Alissali et al.Reference Alissali, Deléglise and Rogozan1996; Goldschen et al.Reference Goldschen, Garcia, Petajan, Stork and Hennecke1996; André-Obrecht et al.Reference André-Obrecht, Jacob and Parlangeau1997; Jourlin Reference Jourlin1997; Chan et al.Reference Chan, Zhang and Huang1998; Rogozan and Deléglise Reference Rogozan and Deléglise1998; Teissier et al.Reference Teissier, Robert-Ribes and Schwartz1999; Zhang et al.Reference Zhang, Levinson and Huang2000; Gurbuz et al.Reference Gurbuz, Tufekci, Patterson and Gowdy2001; Heckmann et al.Reference Heckmann, Berthommier and Kroschel2001; Huang and Chen Reference Huang and Chen2001).
Figure 9.4 Geometric feature approach. Top: Reconstruction of an estimated outer lip contour from 1, 2, 3, and 20 sets of its Fourier coefficients. Bottom: Three geometric visual features, displayed on a normalized scale, tracked over the spoken utterance “81926” of the connected digits database of Potamianos et al. (Reference Potamianos, Graf and Cosatto1998). Lip contours are estimated as in Graf et al. (Reference Graf, Cosatto and Potamianos1997). © Reference Graf, Cosatto and Potamianos1997 and 1998 IEEE
Additional visual features can be derived from the lip contours, such as lip image moments and lip contour Fourier descriptors (see Figure 9.4), that are invariant to affine image transformations. Indeed, a number of central moments of the contour interior binary image, or its normalized moments, as defined in Dougherty and Giardina (Reference Dougherty and Giardina1987), have been considered as visual features (Czap Reference Czap2000). Normalized Fourier series coefficients of a contour parameterization (Dougherty and Giardina Reference Dougherty and Giardina1987) have also been used to augment previously discussed geometric features in some speechreading systems, resulting in improved automatic speechreading (Potamianos et al.Reference Potamianos, Graf and Cosatto1998; Gurbuz et al.Reference Gurbuz, Tufekci, Patterson and Gowdy2001).
Lip model features
A number of parametric models (Basu et al.Reference Basu, Oliver and Pentland1998) have been used for lip- or face-shape tracking in the literature, and briefly reviewed in a previous subsection. The parameters of these models can be readily used as visual features. For example, Chiou and Hwang (Reference Chiou and Hwang1997) employ a snake-based algorithm to estimate lip contour, and subsequently use a number of snake radial vectors as visual features. Su and Silsbee (Reference Su and Silsbee1996), as well as Chandramohan and Silsbee (Reference Chandramohan and Silsbee1996), use lip template parameters instead.
Another popular lip model is the active shape model (ASM). These are flexible statistical models that represent an object with a set of labeled points (Cootes et al.Reference Cootes, Taylor, Cooper and Graham1995; Luettin et al.Reference Luettin, Thacker and Beet1996). The object can be the inner and/or outer lip contour (Luettin and Thacker Reference Luettin and Thacker1997), or the union of various face shape contours as in Matthews et al. (Reference Matthews, Potamianos, Neti and Luettin2001). To derive an ASM, a number of K contour points are first labeled on available training set images, and their coordinates are placed on the 2K-dimensional shape vectors

Given a set of vectors in Eq. (9.4), PCA can be used to identify the optimal orthogonal linear transform
in terms of the variance described along each dimension, resulting in a statistical model of the lip or facial shape (see Figure 9.5). To identify axes of genuine shape variation, each shape in the training set must be aligned. This is achieved using a similarity transform (translation, rotation, and scaling), by means of an iterative Procrustes analysis (Cootes et al.Reference Cootes, Taylor, Cooper and Graham1995; Dryden and Mardia Reference Dryden and Mardia1998). Given a tracked lip contour, the extracted visual features will be
. Note that vectors in Eq. (9.4) can be the output of a tracking algorithm based on B-splines for example (Dalton et al.Reference Dalton, Kaucic, Blake, Stork and Hennecke1996), or specific “meaningful” points of the lips, appropriately tracked, as the facial animation parameters (Pandzic and Forchheimer Reference Pandzic and Forchheimer2002; Ekman and Friesen Reference Ekman and Friesen2003) in Aleksic et al. (Reference Aleksic, Williams, Wu and Katsaggelos2002).
Figure 9.5 Statistical shape model. The top four modes are plotted (left-to-right) at ±3 standard deviations around the mean. These four modes describe 65% of the variance of the training set, which consists of 4072 labeled images from the IBM ViaVoiceTM audiovisual database (Neti et al.Reference Neti, Potamianos, Luettin, Matthews, Glotin, Vergyri, Sison, Mashari and Zhou2000; Matthews et al.Reference Matthews, Potamianos, Neti and Luettin2001). © 2000 and 2001 IEEE.
9.2.2.3 Joint appearance and shape features
Appearance- and shape-based visual features are quite different in nature. In a sense they code low- and high-level information about the speaker’s face and lip movements. Not surprisingly, combinations of features from both categories have been employed in a number of automatic speechreading systems.
In most cases, features from each category are just concatenated. For example, Chan (Reference Chan2001) combines geometric lip features with the PCA projection of a subset of pixels contained within the mouth. Luettin et al. (Reference Luettin, Thacker and Beet1996), as well as Dupont and Luettin (Reference Dupont and Luettin2000), combine ASM features with PCA-based ones, extracted from a ROI that consists of short image profiles around the lip contour. Chiou and Hwang (Reference Chiou and Hwang1997), on the other hand, combine a number of snake lip contour radial vectors with PCA features of the color pixel values of a rectangle mouth ROI.
A different approach to combining the two classes of features is to create a single model of face shape and appearance. An active appearance model (AAM – Cootes et al.Reference Cootes, Edwards and Taylor1998) provides a framework to statistically combine them. Building an AAM requires three applications of PCA:
A shape eigenspace calculation that models shape deformations, resulting in PCA matrix
, computed as above (see Eq. (9.4)).An appearance eigenspace calculation to model appearance changes, resulting in a PCA matrix
, of the ROI appearance vectors. If the color values of the M × N-pixel ROI are considered, such vectors are(9.5)
Using these, calculation of a combined shape and appearance eigenspace. The latter is a PCA matrix
on training vectors (9.6)where W is a suitable diagonal scaling matrix (Matthews et al.Reference Matthews, Potamianos, Neti and Luettin2001). The aim of this final PCA is to remove the redundancy due to the shape and appearance correlation and to create a single model that compactly describes shape and the corresponding appearance deformation.
Such a model has been used for speechreading in Neti et al. (Reference Neti, Potamianos, Luettin, Matthews, Glotin, Vergyri, Sison, Mashari and Zhou2000), Matthews et al. (Reference Matthews, Potamianos, Neti and Luettin2001), and Papandreou et al. (Reference Papandreou, Katsamanis, Pitsikalis and Maragos2009). An example of the resulting learned joined model is depicted in Figure 9.6. A block diagram of the method, including the dimensionalities of the input shape and appearance vectors (Eq. (9.4) and Eq. (9.5), respectively), their PCA projections y(S), y(A), and the final feature vector y(A,S)
x(A,S) is depicted in Figure 9.7.
Figure 9.6 Combined shape and appearance statistical model. Center row: Mean shape and appearance. Top row: Mean shape and appearance +3 standard deviations. Bottom row: Mean shape and appearance −3 standard deviations. The top four modes, depicted left-to-right, describe 46% of the combined shape and appearance variance of 4072 labeled images from the IBM ViaVoiceTM audiovisual database (Neti et al.Reference Neti, Potamianos, Luettin, Matthews, Glotin, Vergyri, Sison, Mashari and Zhou2000; Matthews et al.Reference Matthews, Potamianos, Neti and Luettin2001).© 2000 and 2001 IEEE.
Figure 9.7 DCT- versus AAM-based visual feature extraction for automatic speechreading, followed by visual feature post-extraction processing using linear interpolation, feature mean normalization, adjacent frame feature concatenation, and the application of LDA and MLLT. Vector dimensions as implemented in the system of Neti et al. (Reference Neti, Potamianos, Luettin, Matthews, Glotin, Vergyri, Sison, Mashari and Zhou2000) are depicted.
9.2.2.4 Visual feature post-extraction processing
In an audiovisual speech recognition system, in addition to the visual features, audio features are also extracted from the acoustic waveform. For example, such features could be mel-frequency cepstral coefficients (MFCCs), or linear prediction coefficients (LPCs), typically extracted at a 100 Hz rate (Deller et al.Reference Deller, Proakis and Hansen1993; Rabiner and Juarg Reference Rabiner and Juang1993; Young et al.Reference Young, Kershaw, Odell, Ollason, Valtchev and Woodland1999). In contrast, visual features are generated at the video frame rate, commonly 25 or 30 Hz, or twice that, in the case of interlaced video. Since feature stream synchrony is required in a number of algorithms for audiovisual fusion, as discussed in the next section, the two feature streams must achieve the same rate.
Typically, this is accomplished (whenever required), either after feature extraction, by simple element-wise linear interpolation of the visual features to the audio frame rate (as in Figure 9.7), or before feature extraction, by frame duplication, to achieve a 100 Hz video input rate to the visual front end. Occasionally, the audio front end processing is performed at a lower video rate.
Another interesting issue in visual feature extraction has to do with feature normalization. In a traditional audio front end, cepstral mean subtraction is often employed to enhance robustness to speaker and environment variations (Liu et al.Reference Liu, Stern, Huang and Acero1993; Young et al.Reference Young, Kershaw, Odell, Ollason, Valtchev and Woodland1999). A simple visual feature mean normalization (FMN) by element-wise subtraction of the vector mean over each sentence has been demonstrated to improve appearance feature-based visual-only recognition (Potamianos et al.Reference Potamianos and Graf1998; Potamianos et al.Reference Potamianos, Neti, Iyengar, Senior and Verma2001b). Alternatively, linear intensity compensation preceding the appearance feature extraction has been investigated by Vanegas et al. (Reference Vanegas, Tanaka, Tokuda and Kitamura1998).
A very important issue in the visual feature design is capturing the dynamics of visual speech. Temporal information, often spanning multiple phone segments, is known to help human perception of visual speech (Rosenblum and Saldaña Reference Rosenblum, Saldaña, Campbell, Dodd and Burnham1998). Borrowing again from the ASR literature, dynamic speech information can be captured by augmenting the visual feature vector with its first- and second-order temporal derivatives (Rabiner and Juang Reference Rabiner and Juang1993; Young et al.Reference Young, Kershaw, Odell, Ollason, Valtchev and Woodland1999). Alternatively, LDA can be used, as a means of “learning” a transform that optimally captures the speech dynamics. Such a transform is applied on the concatenation of consecutive feature vectors adjacent to and including the current frame (see also Figure 9.7), i.e., on

with J = 15 for example, as in Neti et al. (Reference Neti, Potamianos, Luettin, Matthews, Glotin, Vergyri, Sison, Mashari and Zhou2000) and Potamianos et al. (Reference Potamianos, Neti, Iyengar, Senior and Verma2001b).
Clearly, and as we already mentioned, LDA could be applied to any category of features discussed. The same holds for MLLT, a method that aims to improve maximum likelihood data modeling and, in practice, ASR performance. For example, a number of feature post-processing steps discussed above, including LDA and MLLT, were interchangeably applied to DCT appearance features, as well as to AAM ones, in our visual front end experiments during the Johns Hopkins workshop, as depicted in Figure 9.7 (Neti et al.Reference Neti, Potamianos, Luettin, Matthews, Glotin, Vergyri, Sison, Mashari and Zhou2000; Matthews et al.Reference Matthews, Potamianos, Neti and Luettin2001). Alternate ways of combining feature post-extraction processing steps can easily be envisioned. For example, LDA and MLLT can be applied to obtain within-frame discriminant features (Potamianos and Neti Reference Potamianos and Neti2001b), which can then be augmented by their first- and second-order derivatives, or followed by LDA and MLLT across frames (see also Figure 9.11). Additional feature transformations can also hold benefit to the system, for example a Gaussianization step, as reported by Huang and Visweswariah (Reference Huang, Marcheret and Visweswariah2005).
Finally, an important problem in data classification is the issue of feature selection within a larger pool of candidate features (Jain et al.Reference Jain, Duin and Mao2000). In the context of speechreading, this matter has been directly addressed in the selection of geometric, lip contour-based features by Goldschen et al. (Reference Goldschen, Garcia, Petajan, Stork and Hennecke1996) and in the selection of appearance, DCT-based features by Scanlon et al. (Reference Scanlon, Potamianos, Libal and Chu2004) and Potamianos and Scanlon (Reference Potamianos and Scanlon2005).
9.2.3 Summary of visual front end algorithms
We have presented a summary of the most common visual feature extraction algorithms proposed in the literature for automatic speechreading. Such techniques differ both in their assumptions about where the speechreading information lies, as well as in the requirements that they place on face detection, facial part localization, and tracking. On the one extreme, appearance-based visual features consider a broadly defined ROI and then rely on traditional pattern recognition and image compression techniques to extract relevant speechreading information. On the other end, shape-based visual features require adequate lip or facial shape tracking and assume that the visual speech information is captured by the shape’s form and movement alone. Bridging the two extremes, various combinations of the two types of features have also been used, ranging from simple concatenation to joint modeling.
Comparisons between features within the same class are often reported in the literature (Duchnowski et al.Reference Duchnowski, Meier and Waibel1994; Goldschen et al.Reference Goldschen, Garcia, Petajan, Stork and Hennecke1996; Gray et al.Reference Gray, Movellan, Sejnowski, Mozer, Jordan and Petsche1997; Potamianos et al.Reference Potamianos and Graf1998; Matthews et al.Reference Matthews, Potamianos, Neti and Luettin2001; Scanlon and Reilly Reference Scanlon and Reilly2001; Seymour et al.Reference Seymour, Stewart and Ming2008). Comparisons however across the various types of features are rather limited, as the feature types require quite different sets of algorithms for their implementation. Nevertheless, Matthews et al. (Reference Matthews, Cootes, Cox, Harvey and Bangham1998) demonstrate AAMs to outperform ASMs, and to result in similar visual-only recognition to alternative appearance-based features. Chiou and Hwang (Reference Chiou and Hwang1997) report that their joint features outperform their shape and appearance feature components, whereas Potamianos et al. (Reference Potamianos and Graf1998), as well as Scanlon and Reilly (Reference Scanlon and Reilly2001), report that DCT-based visual features are superior to a set of lip contour geometric features. Also, Aleksic and Katsaggelos (Reference Aleksic and Katsaggelos2004a) compare PCA appearance-based and shape-based features, but with inconclusive results. However, the above are all reported on single-subject data and/or small vocabulary tasks. In a larger experiment, Matthews et al. (Reference Matthews, Potamianos, Neti and Luettin2001) compare a number of appearance-based features with AAMs on a speaker-independent LVCSR task. All appearance features considered outperformed AAMs. However, it is suspected that the AAM used there was not sufficiently trained.
Although much progress has been made in visual feature extraction, it seems that the identification of the best visual features for automatic speechreading, features that are robust in a variety of visual environments, remains to a large extent unresolved. Of particular importance is that such features should exhibit sufficient speaker, pose, camera, and environment independence. However, it is worth mentioning two arguments in favor of appearance-based features. First, their use is well motivated by human perception studies of visual speech. Indeed, significant information about the place of articulation, such as tongue and teeth visibility, cannot be captured by the lip contours alone. Human speech perception based on the mouth region is superior to perception on the basis of the lips alone and it further improves when the entire lower face is visible (Summerfield et al.Reference Summerfield, MacLeod, McGrath, Brooke, Young and Ellis1989). Second, the extraction of certain well-performing, appearance-based features such as the DCT is computationally efficient. Indeed, it requires a crude mouth region detection algorithm, which can be applied at a low frame rate, whereas the subsequent pixel vector transform is amenable to fast implementation for suitable ROI sizes (Press et al.Reference Press, Flannery, Teukolsky and Vetterling1995). These observations are encouraging with regard to work towards the ultimate goal of implementing realtime automatic speechreading systems (Connell et al.Reference Connell, Haas, Marcheret, Neti, Potamianos and Velipasalar2003), operating robustly in realistic visual environments (Potamianos and Neti Reference Potamianos and Neti2003).
9.3 Audiovisual integration
Audiovisual fusion is an instance of the general classifier combination problem (Jain et al.Reference Jain, Duin and Mao2000; Sannen et al.Reference Sannen, Lughofer and Van Brussel2010). In our case, two observation streams are available (audio and visual modalities) and provide information about speech classes, such as context-dependent sub-phonetic units, or at a higher level, word sequences. Each observation stream can be used alone to train single-modality statistical classifiers to recognize such classes. However, one hopes that combining the two streams will give rise to a bimodal classifier with superior performance to both single-modality ones.
Various information fusion algorithms have been considered in the literature for audiovisual ASR (for example, Bregler et al.Reference Bregler, Hild, Manke and Waibel1993; Adjoudani and Benoît Reference Adjoudani, Benoît, Stork and Hennecke1996; Hennecke et al.Reference Hennecke, Stork, Prasad, Stork and Hennecke1996; Potamianos and Graf Reference Potamianos and Graf1998; Rogozan Reference Rogozan1999; Teissier et al.Reference Teissier, Robert-Ribes and Schwartz1999; Dupont and Luettin Reference Dupont and Luettin2000; Neti et al.Reference Neti, Potamianos, Luettin, Matthews, Glotin, Vergyri, Sison, Mashari and Zhou2000; Chen Reference Chen2001; Chu and Huang Reference Chu and Huang2002; Garg et al.Reference Garg, Potamianos, Neti and Huang2003; Lewis and Powers Reference Lewis and Powers2005; Saenko and Livescu Reference Saenko and Livescu2006; Marcheret et al.Reference Marcheret, Libal and Potamianos2007; Shao and Barker Reference Shao and Barker2008; Papandreou et al.Reference Papandreou, Katsamanis, Pitsikalis and Maragos2009). The proposed techniques differ both in their basic design, and in the adopted terminology. The architecture of some of these methods (Robert-Ribes et al.Reference Robert-Ribes, Piquemal, Schwartz, Escudier, Stork and Hennecke1996; Teissier et al.Reference Teissier, Robert-Ribes and Schwartz1999; Lewis and Powers Reference Lewis and Powers2005) is motivated by models of human speech perception (Massaro Reference Massaro, Stork and Hennecke1996; Massaro and Stork Reference Massaro and Stork1998; Berthommier Reference Berthommier2001). In most cases however, research in audiovisual ASR has followed a separate track from work on modeling the human perception of audiovisual speech.
Audiovisual integration techniques can be broadly grouped into feature fusion and decision fusion methods. The first ones are based on training a single classifier (i.e., of the same form as the audio- and visual-only classifiers) on the concatenated vector of audio and visual features, or on any appropriate transformation of it (Adjoudani and Benoît Reference Adjoudani, Benoît, Stork and Hennecke1996; Teissier et al.Reference Teissier, Robert-Ribes and Schwartz1999; Potamianos et al.Reference Potamianos, Luettin and Neti2001c; Aleksic et al.Reference Aleksic, Potamianos and Katsaggelos2005). In contrast, decision fusion algorithms utilize the two single-modality (audio- and visual-only) classifier outputs to recognize audiovisual speech. Typically, this is achieved by linearly combining the class-conditional observation log-likelihoods of the two classifiers into a joint audiovisual classification score, using appropriate weights that capture the reliability of each single-modality classifier, or data stream (Hennecke et al.Reference Hennecke, Stork, Prasad, Stork and Hennecke1996; Rogozan et al.Reference Rogozan, Deléglise and Alissali1997; Potamianos and Graf Reference Potamianos and Graf1998; Dupont and Luettin Reference Dupont and Luettin2000; Neti et al.Reference Neti, Potamianos, Luettin, Matthews, Glotin, Vergyri, Sison, Mashari and Zhou2000; Nefian et al.Reference Nefian, Liang, Pi, Liu and Murphy2002; Tamura et al.Reference Tamura, Iwano and Furui2005; Marcheret et al.Reference Marcheret, Libal and Potamianos2007; Shao and Barker Reference Shao and Barker2008).
In this section, we provide a detailed description of some popular fusion techniques from each category (see also Table 9.1). In addition, we briefly address two issues relevant to automatic recognition of audiovisual speech. One is the problem of speech modeling for ASR, which poses particular interest in automatic speechreading, and helps establish some background and notation for the remainder of the section. We also consider the subject of speaker adaptation, an important element in practical ASR systems.
Table 9.1 Taxonomy of the audiovisual integration methods considered in this section. Three feature-fusion techniques that differ in the features used for recognition and three decision-fusion methods that differ in the combination stage of the audio and visual classifiers are described in more detail in this chapter.
9.3.1 Audiovisual speech modeling for ASR
Two central aspects in the design of ASR systems are the choice of speech classes that are assumed to generate the observed features, and the statistical modeling of this generation process. In the following, we briefly discuss both issues, since they are often embedded into the design of audiovisual fusion algorithms.
9.3.1.1 Speech classes for audiovisual ASR
The basic unit that describes how speech conveys linguistic information is the phoneme. For American English, there exist approximately forty-two such units (Deller et al.Reference Deller, Proakis and Hansen1993), generated by specific positions or movements of the vocal tract articulators. Only some of the articulators are visible, however; therefore among these phonemes, the number of visually distinguishable units is much smaller. Such units are called visemes in the audiovisual ASR and human perception literatures (Stork and Hennecke Reference Stork and Hennecke1996; Campbell et al.Reference Campbell, Dodd and Burnham1998; Massaro and Stork Reference Massaro and Stork1998). In general, phoneme to viseme mappings are derived from human speechreading studies. Alternatively, such mappings can be generated using statistical clustering techniques, as proposed by Goldschen et al. (Reference Goldschen, Garcia, Petajan, Stork and Hennecke1996) and Rogozan (Reference Rogozan1999). There is no universal agreement about the exact partitioning of phonemes into visemes, but some visemes are well-defined, such as the bilabial viseme consisting of the phoneme set [/p/, /b/, /m/]. A typical clustering into thirteen visemes is used by Neti et al. (Reference Neti, Potamianos, Luettin, Matthews, Glotin, Vergyri, Sison, Mashari and Zhou2000) to conduct visual speech modeling experiments, and is depicted in Table 9.2.
Table 9.2 The forty-four phonemes to thirteen visemes mapping considered by Neti et al. (Reference Neti, Potamianos, Luettin, Matthews, Glotin, Vergyri, Sison, Mashari and Zhou2000), using the HTK phone set (Young et al.Reference Young, Kershaw, Odell, Ollason, Valtchev and Woodland1999).
In traditional audio-only ASR, the set of classes c ∈C that needs to be estimated on the basis of the observed feature sequence most often consists of sub-phonetic units, and occasionally of sub-word units in small vocabulary recognition tasks. For LVCSR, a large number of context-dependent sub-phonetic units are used, obtained by clustering the possible phonetic contexts (tri-phone ones, for example) using a decision tree. In this chapter, such units are exclusively used, defined over tri- or eleven-phone contexts, as described in the Experiments section (Section 9.5).
For automatic speechreading, it seems appropriate, from the human visual speech perception point of view, to use visemic sub-phonetic classes, and their decision tree clustering based on visemic context. Such clustering experiments are reported by Neti et al. (Reference Neti, Potamianos, Luettin, Matthews, Glotin, Vergyri, Sison, Mashari and Zhou2000). In addition, visual-only recognition of visemes is occasionally considered in the literature (Potamianos et al.Reference Potamianos, Neti, Iyengar, Senior and Verma2001b; Gordan et al.Reference Gordan, Kotropoulos and Pitas2002). Visemic speech classes are also used for audiovisual ASR at the second stage of a cascade decision fusion architecture proposed by Rogozan (Reference Rogozan1999), as well as in the dynamic Bayesian network proposed for audiovisual fusion by Terry and Katsaggelos (Reference Terry and Katsaggelos2008) and a number of experiments reported by Hazen (Reference Hazen2006). In general, however, the vast majority of works in the literature employ identical classes and decision trees for both modalities.
9.3.1.2 HMM-based speech recognition
The most widely used classifier for audiovisual ASR is the hidden Markov model (HMM), a very popular method for traditional audio-only speech recognition. Additional methods also exist for automatic recognition of speech, and have been employed in audiovisual ASR systems, such as dynamic time warping (DTW), used for example by Petajan (Reference Petajan1984), artificial neural networks (ANN), as in Krone et al. (Reference Krone, Talle, Wichert and Palm1997), hybrid ANN-DTW systems (Bregler et al.Reference Bregler, Hild, Manke and Waibel1993; Duchnowski et al.Reference Duchnowski, Meier and Waibel1994), hybrid ANN-HMM ones (Heckmann et al.Reference Heckmann, Berthommier and Kroschel2001), and support vector machines (SVM, see Gordan et al.Reference Gordan, Kotropoulos and Pitas2002) – in the latter case for visual-only ASR. Various types of HMMs have also been used for audiovisual ASR, such as HMMs with discrete observations after vector quantization of the feature space (Silsbee and Bovik Reference Silsbee and Bovik1996), or HMMs with non-Gaussian continuous observation probabilities (Su and Silsbee Reference Su and Silsbee1996). However, the vast majority of audiovisual ASR systems, to which we restrict our presentation in this chapter, employ HMMs with a continuous observation probability density, modeled as a mixture of Gaussian densities.
Typically in the literature, single-stream HMMs are used to model the generation of a sequence of audio-only or visual-only speech informative features,
, of dimensionality Ds, where s = A,V denotes the audio or visual modality (stream). The HMM emission (class conditional observation) probabilities are modeled by Gaussian mixture densities, given by

for all classes c ∈C, whereas the HMM transition probabilities between classes are given by
. The HMM parameter vector is therefore

In Eq. (9.7) and Eq. (9.8), c∈C denote the HMM context-dependent states, whereas mixture weights wsck are positive adding to one; Ksc denotes the number of mixtures; and ND(o;m,s) is the D-variate normal distribution with mean m and a diagonal covariance matrix, its diagonal being denoted by s.
The expectation-maximization (EM) algorithm (Dempster et al.Reference Dempster, Laird and Rubin1977) is typically used to obtain maximum likelihood estimates of Eq. (9.9). Given a current HMM parameter vector at EM algorithm iteration j,
, a re-estimated parameter vector is obtained as

In Eq. (9.10), O(s) denotes training data observations from L utterances
, l = 1, . . .,L, and Q(•,•|•) represents the EM algorithm auxiliary function, defined (Rabiner and Juang Reference Rabiner and Juang1993) as

In Eq. (9.11), c(l) denotes any HMM state sequence for utterance l. Replacing it with the best HMM path reduces EM to Viterbi training. As an alternative to maximum likelihood, discriminative training methods can instead be used for HMM parameter estimation (Bahl et al.Reference Bahl, Brown, DeSouza and Mercer1986; Chou et al.Reference Chou, Juang, Lee and Soong1994; Woodland and Povey Reference Woodland and Povey2002; Huang and Povey Reference Huang and Povey2005).
9.3.2 Feature fusion techniques for audiovisual ASR
As already mentioned, feature fusion uses a single classifier to model the concatenated vector of time-synchronous audio and visual features, or appropriate transformations of it. Such methods include plain feature concatenation (Adjoudani and Benoît Reference Adjoudani, Benoît, Stork and Hennecke1996), feature weighting (Teissier et al.Reference Teissier, Robert-Ribes and Schwartz1999; Chen Reference Chen2001), both also known as direct identification fusion (Teissier et al.Reference Teissier, Robert-Ribes and Schwartz1999), and hierarchical linear discriminant feature extraction (Potamianos et al. Reference Potamianos, Luettin and Neti2001c). The dominant and motor recording fusion models discussed by Teissier et al. (Reference Teissier, Robert-Ribes and Schwartz1999) also belong to this category, as they seek a data-to-data mapping of either the visual features into the audio space, or of both modality features to a new common space, followed by linear combination of the resulting features. Audio feature enhancement on the basis of either visual input (Girin et al.Reference Girin, Feng and Schwartz1995; Barker and Berthommier Reference Barker and Berthommier1999), or concatenated audiovisual features (Girin et al.Reference Girin, Schwartz and Feng2001b; Goecke et al.Reference Goecke, Potamianos and Neti2002) also falls within this category of fusion, under the general definition adopted above. In this section, we expand on three feature fusion techniques, schematically depicted in Figure 9.8.
Figure 9.8 Three types of feature fusion considered in this section: Plain audiovisual feature concatenation (AV-Concat), hierarchical discriminant feature extraction (AV-HiLDA), and audiovisual speech enhancement (AV-Enh).
9.3.2.1 Concatenative feature fusion
Given time-synchronous audio and visual feature vectors
and
, with dimensionalities DA and DV, respectively, the joint, concatenated audiovisual feature vector at time t becomes

where D = DA + DV. As with all feature fusion methods (i.e., also for vectors in Eq. (9.13) and Eq. (9.14), below), the generation process for a sequence of features in Eq. (9.12) is modeled by a single-stream HMM, with emission probabilities (see also Eq. (9.7))

for all classes c ∈C (Adjoudani and Benoît Reference Adjoudani, Benoît, Stork and Hennecke1996). Concatenative feature fusion constitutes a simple approach for audiovisual ASR, implementable in most existing ASR systems with minor changes. However, the vector dimensionality in Eq. (9.12) can be rather high, with the consequent risk of inadequate modeling in Eq. (9.8) due to the curse of dimensionality (Chatfield and Collins Reference Chatfield and Collins1991). The following fusion technique aims to avoid this, by seeking lower-dimensional representations of Eq. (9.12).
9.3.2.2 Hierarchical discriminant feature fusion
Visual features contain less speech classification power than audio features, even in the case of extreme noise in the audio channel (see Table 9.3 in the Experiments section). One would therefore expect that an appropriate lower-dimensional representation of Eq. (9.12) could lead to equal and possibly better HMM performance, given the problem of accurate probabilistic modeling in high-dimensional spaces. Potamianos et al. (Reference Potamianos, Luettin and Neti2001c) have considered LDA as a means of obtaining such a dimensionality reduction. The goal is in fact to obtain the best discrimination among the classes of interest and LDA achieves this on the basis of the data (and their labels) alone, without an a priori bias in favor of either of the two feature streams. LDA is followed by an MLLT-based data rotation (see also Figure 9.8), in order to improve maximum-likelihood data modeling using Eq. (9.8). In the audiovisual ASR system of Potamianos et al. (Reference Potamianos, Luettin and Neti2001c), the proposed method amounts to a two-stage application of LDA and MLLT, first intra-modal on the original audio MFCC and visual DCT features, and then inter-modal on Eq. (9.12), as also depicted in Figure 9.11. It is therefore referred to as HiLDA (hierarchical LDA). The final audiovisual feature vector is (see also Eq. (9.12))

One can set the dimensionality of Eq. (9.13) to be equal to the audio feature vector size, as implemented by Neti et al. (Reference Neti, Potamianos, Luettin, Matthews, Glotin, Vergyri, Sison, Mashari and Zhou2000).
9.3.2.3 Audio feature enhancement
Audio and visible speech are correlated since they are constrained to use a common orofacial anatomy. Not surprisingly, a number of techniques have been proposed to obtain estimates of audio features utilizing the visual-only modality (Girin et al.Reference Girin, Feng and Schwartz1995; Yehia et al.Reference Yehia, Rubin and Vatikiotis-Bateson1998; Barker and Berthommier Reference Barker and Berthommier1999), or joint audiovisual speech data, in the case where the audio signal is degraded (Girin et al.Reference Girin, Schwartz and Feng2001b; Goecke et al.Reference Goecke, Potamianos and Neti2002). The latter scenario corresponds to the speech enhancement paradigm. Under this approach, the enhanced audio feature vector
can be simply obtained as a linear transformation of the concatenated audiovisual feature vector of Eq. (9.11), namely as

where
consists of D-dimensional row vectors
for i = 1, . . .,DA, and has dimension DA × D (see also Figure 9.8).
A simple way to estimate matrix
is by considering the approximation
in the Euclidean distance sense, where vector
denotes clean audio features available in addition to visual and noisy audio vectors, for a number of time instants t in a training set, T. By Eq. (9.14), this becomes equivalent to solving DAmean square error (MSE) estimations

for i = 1, . . .,DA, i.e., one per row of the matrix
. Equation (9.15) results in DA systems of Yule–Walker equations that can be easily solved using Gauss–Jordan elimination (Press et al.Reference Press, Flannery, Teukolsky and Vetterling1995). A more sophisticated way of estimating
by using a Mahalanobis type distance instead of Eq. (9.15) is considered by Goecke et al. (Reference Goecke, Potamianos and Neti2002). Non-linear estimation schemes are proposed by Girin et al. (Reference Girin, Allard and Schwartz2001a) and Deligne et al. (Reference Deligne, Potamianos and Neti2002).
9.3.3 Decision fusion techniques for audiovisual ASR
Although feature fusion techniques (for example, HiLDA) that result in improved ASR over audio-only performance have been documented (Neti et al.Reference Neti, Potamianos, Luettin, Matthews, Glotin, Vergyri, Sison, Mashari and Zhou2000), they cannot explicitly model the reliability of each modality. Such modeling is extremely important, as speech information content and the discrimination power of the audio and visual streams can vary widely, depending on the spoken utterance, acoustic noise in the environment, visual channel degradation, face tracker inaccuracies, and speaker characteristics. In contrast to feature fusion methods, the decision fusion framework provides a mechanism for capturing the reliability of each modality, by borrowing from classifier combination literature.
Classifier combination based on individual decisions about the classes of interest is an active area of research with many applications (Xu et al.Reference Xu, Krzyzak and Suen1992; Kittler et al.Reference Kittler, Hatef, Duin and Matas1998; Jain et al.Reference Jain, Duin and Mao2000; Sannen et al.Reference Sannen, Lughofer and Van Brussel2010). Combination strategies differ in various aspects, such as the architecture used (parallel, cascade, or hierarchical combination), possible trainability (static or adaptive), and information level considered at integration: abstract, rank-order, or measurement level, i.e., whether information is available about the best class only, the top n classes (or the ranking of all possible classes), or the likelihood scores. In the audiovisual ASR literature, examples of most of these categories can be found. For example, Petajan (Reference Petajan1984) rescores the two best outputs of the audio-only classifier by means of the visual-only classifier, a case of cascade, static, rank-order level decision fusion. Combinations of more than one category, as well as cases where one of the two classifiers of interest corresponds to a feature fusion technique are also possible. For example, Rogozan and Deléglise (Reference Rogozan and Deléglise1998) use a parallel, adaptive, measurement-level combination of an audiovisual classifier trained on concatenated features (Eq. (9.12)) with a visual-only classifier, whereas Rogozan (Reference Rogozan1999) considers a cascade, adaptive, rank-order level integration of the two. The lattice rescoring framework used during the Johns Hopkins University workshop (as described in the Experiments section that follows) is an example of a hybrid cascade/parallel fusion architecture (Neti et al.Reference Neti, Potamianos, Luettin, Matthews, Glotin, Vergyri, Sison, Mashari and Zhou2000; Glotin et al.Reference Glotin, Vergyri, Neti, Potamianos and Luettin2001; Luettin et al.Reference Luettin, Potamianos and Neti2001).
However, by far the most commonly used decision fusion techniques for audiovisual ASR belong to the paradigm of audio- and visual-only classifier integration using a parallel architecture, adaptive combination weights, and class measurement level information. These methods derive the most likely speech class or word sequence by linearly combining the log-likelihoods of the two single-modality HMM classifier decisions, using appropriate weights (Adjoudani and Benoît Reference Adjoudani, Benoît, Stork and Hennecke1996; Jourlin Reference Jourlin1997; Potamianos and Graf Reference Potamianos and Graf1998; Teissier et al.Reference Teissier, Robert-Ribes and Schwartz1999; Dupont and Luettin Reference Dupont and Luettin2000; Neti et al.Reference Neti, Potamianos, Luettin, Matthews, Glotin, Vergyri, Sison, Mashari and Zhou2000; Gurbuz et al.Reference Gurbuz, Tufekci, Patterson and Gowdy2001; Heckmann et al.Reference Heckmann, Berthommier and Kroschel2001; Nefian et al.Reference Nefian, Liang, Pi, Liu and Murphy2002; Tamura et al.Reference Tamura, Iwano and Furui2005; Marcheret et al.Reference Marcheret, Libal and Potamianos2007; Shao and Barker Reference Shao and Barker2008). This corresponds to the adaptive product rule in the likelihood domain (Jain et al.Reference Jain, Duin and Mao2000), and it is also known as the separate identification model of audiovisual fusion (Rogozan Reference Rogozan1999; Teissier et al.Reference Teissier, Robert-Ribes and Schwartz1999).
Continuous speech recognition introduces an additional twist to the classifier fusion problem, due to the fact that sequences of classes (HMM states or words) need to be estimated. One can consider three possible temporal levels for combining stream (modality) likelihoods, as depicted in Table 9.1: (a) “Early” integration, i.e., likelihood combination at the HMM state level, which gives rise to the multi-stream HMM classifier (Bourlard and Dupont Reference Bourlard and Dupont1996; Young et al.Reference Young, Kershaw, Odell, Ollason, Valtchev and Woodland1999), and forces synchrony between its two single-modality components; (b) “Late” integration, where typically a number of n-best audio and possibly visual-only recognizer hypotheses are rescored by the log-likelihood combination of the two streams, which allows complete asynchrony between the two HMMs; and (c) “Intermediate” integration, typically implemented by means of the product HMM (Varga and Moore Reference Varga and Moore1990), or the coupled HMM (Brand et al.Reference Brand, Oliver and Pentland1997), which forces HMM synchrony at the phone, or word, boundaries. Notice that such terminology is not universally agreed upon, and our reference to early or late integration at the temporal level should not be confused with the feature versus decision fusion meaning of these terms in other work (Adjoudani and Benoît Reference Adjoudani, Benoît, Stork and Hennecke1996).
9.3.3.1 Early integration: state-synchronous multi-stream HMM
In its general form, the class conditional observation likelihood of the multi-stream HMM is the product of the observation likelihoods of the HMM single-stream components, raised to appropriate stream exponents that capture the reliability of each modality, or equivalently, the confidence of each single-stream classifier. Such a model has been considered in audio-only ASR where, for example, separate streams are used for the energy audio features and MFCC static features, as well as their first and possibly second-order derivatives, as in Hernando et al. (Reference Hernando, Ayarte and Monte1995) and Young et al. (Reference Young, Kershaw, Odell, Ollason, Valtchev and Woodland1999), or for band-limited audio features in the multi-band ASR paradigm (Hermansky et al.Reference Hermansky, Tibrewala and Pavel1996), as in Bourlard and Dupont (Reference Bourlard and Dupont1996), Okawa et al. (Reference Okawa, Nakajima and Shirai1999), Glotin and Berthommier (Reference Glotin and Berthommier2000), among others. In the audiovisual domain, the model becomes a two-stream HMM, with one stream devoted to the audio, and another to the visual modality. As such, it has been extensively used in audiovisual ASR (Jourlin Reference Jourlin1997; Potamianos and Graf Reference Potamianos and Graf1998; Dupont and Luettin Reference Dupont and Luettin2000; Miyajima et al.Reference Miyajima, Tokuda and Kitamura2000; Nakamura et al.Reference Nakamura, Ito and Shikano2000; Neti et al.Reference Neti, Potamianos, Luettin, Matthews, Glotin, Vergyri, Sison, Mashari and Zhou2000; Tamura et al.Reference Tamura, Iwano and Furui2005; Marcheret et al.Reference Marcheret, Libal and Potamianos2007; Shao and Barker Reference Shao and Barker2008; Terry et al.Reference Terry, Shiell and Katsaggelos2008). In the system reported by Neti et al. (Reference Neti, Potamianos, Luettin, Matthews, Glotin, Vergyri, Sison, Mashari and Zhou2000) and Luettin et al. (Reference Luettin, Potamianos and Neti2001), the method was applied for the first time to the LVCSR domain.
Given the bimodal (audiovisual) observation vector
, the state emission “score” (it no longer represents a probability distribution) of the multi-stream HMM is (see also Eq. (9.8) and Eq. (9.12))

Notice that Eq. (9.16) corresponds to a linear combination in the log-likelihood domain. In Eq. (9.16), λsct denote the stream exponents (weights), that are non-negative, and in general, are a function of the modality s, the HMM state c ∈C, and locally, the utterance frame (time) t. Such state- and time-dependence can be used to model the speech class and “local” environment-based reliability of each stream. The exponents are often constrained to λAct + λVct = 1, or 2. In most systems, they are set to global, modality-only dependent values, i.e., λs ←λsct, for all classes c ∈C and time instants t, with the class dependence occasionally being preserved, i.e., λsc ←λsct, for all t. In the latter case, the parameters of the multi-stream HMM (see also Eq. (9.8), Eq. (9.9), and Eq. (9.16))

consist of the HMM transition probabilities r and the emission probability parameters bA and bV of its single-stream components.
The parameters of aAV can be estimated separately for each stream component using the EM algorithm, namely, Eq. (9.10) for s ∈[A,V] and subsequently, by setting the joint HMM transition probability vector equal to the audio one, i.e., r=rA, or alternatively, to the product of the transition probabilities of the two HMMs, i.e., r= diag(rArVT) (see also Eq. (9.9)). The latter scheme is referred to in the Experiments section as AV-MS-Sep. An obvious drawback of this approach is that the two single-modality HMMs are trained asynchronously (i.e., using different forced alignments), whereas Eq. (9.16) assumes that the HMM stream components are state synchronous. The alternative is to jointly estimate parameters aAV, in order to enforce state synchrony. Due to the linear combination of stream log-likelihoods in Eq. (9.16), the EM algorithm carries on in the multi-stream HMM case with minor changes (Rabiner and Juang Reference Rabiner and Juang1993; Young et al.Reference Young, Kershaw, Odell, Ollason, Valtchev and Woodland1999). As a result,

can be used, a scheme referred to as AV-MS-Joint. Notice that the two approaches basically differ in the E-step of the EM algorithm.
In both separate and joint HMM training, the remainder of parameter vector
, consisting of the stream exponents, needs to be obtained. Maximum likelihood estimation cannot be used for such parameters, and discriminative training techniques have to be employed instead (Jourlin Reference Jourlin1997; Potamianos and Graf Reference Potamianos and Graf1998; Nakamura Reference Nakamura, Bigun and Smeraldi2001; Gravier et al.Reference Gravier, Axelrod, Potamianos and Neti2002a). This issue is discussed later. Notice that HMM stream parameter and stream exponent training iterations can be alternated in Eq. (9.18).
9.3.3.2 Intermediate integration: product HMM
It is well known that visual speech activity can precede the audio signal by as much as 120 ms (Bregler and Konig Reference Bregler and Konig1994; Grant and Greenberg Reference Grant and Greenberg2001), which is close to the average duration of a phoneme. A generalization of the state-synchronous multi-stream HMM can be used to model such audio and visual stream asynchrony to some extent, by allowing the single modality HMMs to be in asynchrony within a model, but forcing their synchrony at model boundaries instead. Single-stream log-likelihoods are linearly combined at such boundaries using weights, similarly to Eq. (9.16). For LVCSR, a reasonable choice is to force synchrony at the phone boundaries. The resulting phone-synchronous audiovisual HMM is depicted in Figure 9.9, for the typical case of three states used per phone in each modality.
Figure 9.9 Left: Phone-synchronous (state-asynchronous) multi-stream HMM with three states per phone in each modality. Right: Its equivalent product (composite) HMM; black circles denote states that are removed when limiting the degree of within-phone allowed asynchrony to one state. The single-stream emission probabilities are tied for states along the same row (column) to the corresponding audio (visual) state probabilities.
Recognition based on this intermediate integration method requires the computation of the best state sequences for both audio and visual streams. To simplify decoding, the model can be formulated as a product HMM (Varga and Moore Reference Varga and Moore1990). Such a model consists of composite states c ∈C × C, that have audiovisual emission probabilities of a form similar to Eq. (9.16), namely

where c = [cA, cV]T. Notice that in Eq. (9.18), the audio and visual stream components correspond to the emission probabilities of certain audio- and visual-only HMM states, as depicted in Figure 9.9. These single-stream emission probabilities are tied for states along the same row, or column (depending on the modality); therefore the original number of mixture weight, mean, and variance parameters is kept in the new model. However, this is usually not the case with the number of transition probability parameters
, as additional transitions between the composite states need to be modeled. Such probabilities are often factored as
, in which case the resulting product HMM is typically referred to in the literature as the coupled HMM (Brand et al.Reference Brand, Oliver and Pentland1997; Chu and Huang Reference Chu and Huang2000; Chu and Huang Reference Chu and Huang2002; Nefian et al.Reference Nefian, Liang, Pi, Liu and Murphy2002). A further simplification of this factorization can be employed,
, as in Gravier et al. (Reference Gravier, Axelrod, Potamianos and Neti2002a) for example, which results in a product HMM with the same number of parameters as the state synchronous multi-stream HMM.
Given audiovisual training data, product HMM training can be performed similarly to separate, or joint, multi-stream HMM parameter estimation, discussed in the previous subsection. In the first case, the composite model is constructed based on individual single-modality HMMs estimated by Eq. (9.10), and on transition probabilities equal to the product of the audio- and visual-only ones. In the second case, referred to as AV-MS-PROD in the experiments reported later, all transition probabilities and HMM stream component parameters are estimated at a single stage using Eq. (9.18) with appropriate parameter tying. In both schemes, stream exponents need to be estimated separately. In the audiovisual ASR literature, product (or, coupled) HMMs have been considered in some small-vocabulary recognition tasks (Tomlinson et al.Reference Tomlinson, Russell and Brooke1996; Dupont and Luettin Reference Dupont and Luettin2000; Huang and Chen Reference Huang and Chen2001; Nakamura Reference Nakamura, Bigun and Smeraldi2001; Chu and Huang Reference Chu and Huang2002; Nefian et al.Reference Nefian, Liang, Pi, Liu and Murphy2002), where synchronization is sometimes enforced at the word level, and recently for LVCSR (Neti et al.Reference Neti, Potamianos, Luettin, Matthews, Glotin, Vergyri, Sison, Mashari and Zhou2000; Luettin et al.Reference Luettin, Potamianos and Neti2001; Gravier et al.Reference Gravier, Potamianos and Neti2002b).
It is worth mentioning that the product HMM allows the restriction of the degree of asynchrony between the two streams by excluding certain composite states in the model topology. In the extreme case, when only the states that lie in its “diagonal” are kept, the model becomes equivalent to the state-synchronous multi-stream HMM (see also Figure 9.9).
9.3.3.3 Late integration: discriminative model combination
A popular stage of combining audio- and visual-only recognition log-likelihoods is at the utterance end, giving rise to late integration. In small-vocabulary, isolated word speech recognition, this can be easily implemented by calculating the combined likelihood for each word model in the vocabulary, given the acoustic and visual observations (Adjoudani and Benoît Reference Adjoudani, Benoît, Stork and Hennecke1996; Su and Silsbee Reference Su and Silsbee1996; Cox et al.Reference Cox, Matthews and Bangham1997; Gurbuz et al.Reference Gurbuz, Tufekci, Patterson and Gowdy2001). However, for connected word recognition, and even more so for LVCSR, the number of possible hypotheses of word sequences becomes prohibitively large. Instead, one has to limit the log-likelihood combination to the top n-best only hypotheses. Such hypotheses can be generated by the audio-only HMM, an alternative audiovisual fusion technique, or can be the union of audio-only and visual-only n-best lists. In this approach, n-best hypotheses for a particular utterance, [h1,h2, . . ., hn], are first forced-aligned to their corresponding phone sequences hi = [ci,1 , ci,2, . . ., ci,Ni] by means of both audio- and visual-only HMMs. Let the resulting phone ci,j boundaries be denoted by
, for s ∈[A,V], j = 1, . . ., Ni and i = 1, . . ., n. Then, the audiovisual likelihoods of the n-best hypotheses are computed as

where PrLM(hi) denotes the language model (LM) probability of hypothesis hi. The exponents in Eq. (9.20) can be estimated using discriminative training criteria, as in the discriminative model combination method of Beyerlein (Reference Beyerlein1998) and Vergyri (Reference Vergyri2000). The method is proposed for audiovisual LVCSR in Neti et al. (Reference Neti, Potamianos, Luettin, Matthews, Glotin, Vergyri, Sison, Mashari and Zhou2000), and it is referred to as AV-DMC in the Experiments section.
9.3.3.4 Stream exponent estimation and reliability modeling
We now address the issue of estimating stream exponents (weights), when combining likelihoods in the audiovisual decision fusion techniques presented above (see Eq. (9.16), Eq. (9.19), and Eq. (9.20)). As already discussed, such exponents can be set to constant values, computed for a particular audiovisual environment and database. In this case, the audiovisual weights depend on the modality and possibly on the speech class, capturing the confidence of the individual classifiers for the particular database conditions, and are estimated by seeking optimal system performance on matched data. However, in a practical audiovisual ASR system, the quality of captured audio and visual data, and thus of the speech information present in them, can change dramatically over time. To model this variability, utterance-level or even frame-level dependence of the stream exponents is required. This can be achieved by first obtaining an estimate of the local environment conditions and then using pre-computed exponents for this condition, or alternatively, by seeking a direct functional mapping between “environment” estimates and stream exponents. In the following, we expand on these methodologies.
In the first approach, constant exponents are estimated, based on training data, or more often, on so-called held-out data. Such stream exponents cannot be obtained by maximum likelihood estimation (Potamianos and Graf Reference Potamianos and Graf1998; Nakamura Reference Nakamura, Bigun and Smeraldi2001), although approaches based on likelihood normalization have appeared in the literature with moderate success (Hernando Reference Hernando1997; Tamura et al.Reference Tamura, Iwano and Furui2005. Typically though discriminative training techniques are used. Some of these methods seek to minimize a smooth function of the minimum classification error (MCE) resulting from the application of the audiovisual model on the data, and employ the generalized probabilistic descent (GPD) algorithm (Chou et al.Reference Chou, Juang, Lee and Soong1994) for stream exponent estimation (Potamianos and Graf Reference Potamianos and Graf1998; Miyajima et al.Reference Miyajima, Tokuda and Kitamura2000; Nakamura et al.Reference Nakamura, Ito and Shikano2000; Gravier et al.Reference Gravier, Axelrod, Potamianos and Neti2002a). Other techniques use maximum mutual information (MMI) training (Bahl et al.Reference Bahl, Brown, DeSouza and Mercer1986), such as the system reported by Jourlin (Reference Jourlin1997), or the maximum entropy criterion (Gravier et al.Reference Gravier, Axelrod, Potamianos and Neti2002a). The latter is reported to be faster than MCE-GPD and performs better than it in the case of class-independent exponents. Alternatively, one can seek to directly minimize the word error rate of the resulting audiovisual ASR system on a held-out dataset. In the case of global exponents across all speech classes, constrained to add to a constant, the problem reduces to one-dimensional optimization of a non-smooth function, and can be solved using simple grid search (Miyajima et al.Reference Miyajima, Tokuda and Kitamura2000; Luettin et al.Reference Luettin, Potamianos and Neti2001; Gravier et al.Reference Gravier, Axelrod, Potamianos and Neti2002a). For class-dependent weights, the problem becomes of higher dimension, and the downhill simplex method (Nelder and Mead Reference Nelder and Mead1965) can be employed. This technique is used by Neti et al. (Reference Neti, Potamianos, Luettin, Matthews, Glotin, Vergyri, Sison, Mashari and Zhou2000) to estimate exponents for late decision fusion.
In order to capture the effects of varying audio and visual environment conditions on the reliability of each stream, utterance-level, and occasionally frame-level, dependence of the stream weights needs to be considered. In most cases in the literature, exponents are considered as a function of the audio channel signal-to-noise ratio (SNR), and each utterance is decoded based on the fusion model parameters at its SNR (Adjoudani and Benoît Reference Adjoudani, Benoît, Stork and Hennecke1996; Meier et al.Reference Meier, Hürst and Duchnowski1996; Cox et al.Reference Cox, Matthews and Bangham1997; Teissier et al.Reference Teissier, Robert-Ribes and Schwartz1999; Gurbuz et al.Reference Gurbuz, Tufekci, Patterson and Gowdy2001). This SNR value is either assumed known, or estimated from the audio channel (Cox et al.Reference Cox, Matthews and Bangham1997). A linear dependence between SNR and audio stream weight has been demonstrated by Meier et al. (Reference Meier, Hürst and Duchnowski1996). An alternative technique sets the stream exponents to a linear function of the average conditional entropy of the recognizer output, computed using the confusion matrix at a particular SNR for a small-vocabulary isolated word ASR task (Cox et al.Reference Cox, Matthews and Bangham1997). A different approach considers the audio stream exponent as a function of the degree of voicing present in the audio channel, estimated as in Berthommier and Glotin (Reference Berthommier and Glotin1999). This method was used at the Johns Hopkins summer 2000 workshop (Neti et al.Reference Neti, Potamianos, Luettin, Matthews, Glotin, Vergyri, Sison, Mashari and Zhou2000; Glotin et al.Reference Glotin, Vergyri, Neti, Potamianos and Luettin2001), and is referred to in the Experiments section as AV-MS-UTTER. Finally, Heckmann et al. (Reference Heckmann, Berthommier and Kroschel2002) use a combination of the above-mentioned audio stream indicators to estimate the audio stream exponent.
The above techniques do not allow modeling of possible variations in the visual stream reliability, since they concentrate on the audio stream alone. Modeling such variability in the visual signal domain is challenging, although one could for example consider face detection confidence as one such measure (Connell et al.Reference Connell, Haas, Marcheret, Neti, Potamianos and Velipasalar2003). Typically however this is achieved using confidence measures of the visual-only classifier applied on the extracted visual feature sequence. For example, Adjoudani and Benoît (Reference Adjoudani, Benoît, Stork and Hennecke1996) and Rogozan et al. (Reference Rogozan, Deléglise and Alissali1997) use the dispersion of both audio-only and visual-only class posterior log-likelihoods to model the single-stream classifier confidences, and then compute the utterance-dependent stream exponents as a closed form function of these dispersions in an unsupervised fashion. Similarly, Potamianos and Neti (Reference Potamianos and Neti2000) consider various confidence measures, such as entropy and dispersion, to capture the reliability of audio- and visual-only classification at the frame level, and they obtain stream exponents using a look-up table over confidence value intervals, estimated on the basis of held-out data. Extensions of this work appear in Garg et al. (Reference Garg, Potamianos, Neti and Huang2003), where a sigmoid function is used instead of the look-up table. This sigmoid is discriminatively trained to map the vector of audiovisual reliability measures to frame-dependent audiovisual exponents. Further, Marcheret et al. (Reference Marcheret, Libal and Potamianos2007) employ Gaussian mixture models for this purpose using the above reliability measures, whereas Shao and Barker (Reference Shao and Barker2008) use an ANN in a two-step approach to estimating the desired exponents. As input to the ANN, they utilize the entire vector of audio and visual likelihoods of all classes (or an appropriate clustering of them). An alternative approach is followed by Terry et al. (Reference Terry, Shiell and Katsaggelos2008), where the audio and visual observations are first vector quantized, allowing estimation of the conditional probability of the visual observation centroids given the audio ones. This is subsequently employed (as a measure of audio visual consistency) together with audio-only SNR in stream exponent estimation via an introduced sigmoid function. Finally, Sanchez-Soto et al. (2009) propose an entirely unsupervised approach, where stream exponents are estimated as functionals of the inter- and intra-class distances that are computed for each stream observation sequence over the given test utterance.
9.3.4 Audiovisual speaker adaptation
Speaker adaptation is traditionally used in practical audio-only ASR systems to improve speaker-independent system performance, when little data from a speaker of interest is available (Gauvain and Lee Reference Gauvain and Lee1994; Leggetter and Woodland Reference Leggetter and Woodland1995; Neumeyer et al.Reference Neumeyer, Sankar and Digalakis1995; Anastasakos et al.Reference Anastasakos, McDonough and Makhoul1997; Gales Reference Gales1999; Goronzy Reference Goronzy2002; Young Reference Young, Banesty, Sondhi and Huang2008). Adaptation is also of interest across tasks or environments. In the audiovisual ASR domain, this is of particular importance, since audiovisual corpora are scarce and their collection expensive; therefore adaptation across datasets (in addition to speakers) is also of interest. In general, adaptation can be performed in a supervised or unsupervised fashion, as well as in a batch or incremental mode, depending on the type and availability of the adaptation data (Young Reference Young, Banesty, Sondhi and Huang2008).
Given little bimodal adaptation data from a particular speaker, and a baseline speaker-independent HMM, one may wish to estimate adapted HMM parameters that better model the audiovisual observations of the particular speaker. Two popular algorithms for speaker adaptation are maximum likelihood linear regression (MLLR) (Leggetter and Woodland Reference Leggetter and Woodland1995) and maximum a posteriori (MAP) adaptation (Gauvain and Lee Reference Gauvain and Lee1994). MLLR obtains a maximum likelihood estimate of a linear transformation of the HMM means, while leaving covariance matrices, mixture weights, and transition probabilities unchanged, and it provides successful adaptation with a small amount of adaptation data (rapid adaptation). On the other hand, MAP follows the Bayesian paradigm for estimating the HMM parameters. MAP estimates of HMM parameters slowly converge to their EM-obtained estimates as the amount of adaptation data becomes large; however such a convergence is slow, and therefore, MAP is not suitable for rapid adaptation. In practice, MAP is often used in conjunction with MLLR (Neumeyer et al.Reference Neumeyer, Sankar and Digalakis1995). Both techniques can be used in the feature fusion (Potamianos and Neti Reference Potamianos and Neti2001a) and decision fusion models discussed above (Potamianos and Potamianos Reference Potamianos and Potamianos1999), in a straightforward manner. One can also consider feature-level (front end) adaptation by adapting, for example, the audio-only and visual-only LDA and MLLT matrices and, if HiLDA fusion is used, the joint audiovisual LDA and MLLT matrices (Potamianos and Neti Reference Potamianos and Neti2001a). Experiments using these techniques are reported in a later section, all performed in a supervised, batch fashion. Alternative adaptation algorithms also exist, such as speaker adaptive training (Anastasakos et al.Reference Anastasakos, McDonough and Makhoul1997) and front end MLLR (Gales Reference Gales1999) that can be used in audiovisual ASR (Vanegas et al.Reference Vanegas, Tanaka, Tokuda and Kitamura1998; Huang et al.Reference Huang, Marcheret and Visweswariah2005; Huang and Visweswariah Reference Huang and Visweswariah2005).
9.3.5 Summary of audiovisual integration
We have presented a summary of the most common fusion techniques for audiovisual ASR. We first discussed the choice of speech classes and statistical ASR models that influences the design of some fusion algorithms. Subsequently, we described a number of feature and decision integration techniques suitable for bimodal LVCSR, and finally, briefly touched upon the issue of audiovisual speaker adaptation.
Among the fusion algorithms discussed, decision fusion techniques explicitly model the reliability of each source of speech information, by using stream weights to linearly combine audio- and visual-only classifier log-likelihoods. When properly estimated, the use of weights results in improved ASR over feature fusion techniques, as reported in the literature and demonstrated in the Experiments section. In most systems reported, such weights are set to a constant value over each modality, possibly dependent on the audio-only channel quality (SNR). However, robust estimation of the weights at a finer level (utterance or frame level), based on both audio and visual channel characteristics remains a challenge. Furthermore, the issue of whether speech class dependence of stream weights is desirable has also not been fully investigated. Although such dependence seems to help in late integration schemes (Neti et al.Reference Neti, Potamianos, Luettin, Matthews, Glotin, Vergyri, Sison, Mashari and Zhou2000), or small-vocabulary tasks (Jourlin Reference Jourlin1997; Miyajima et al.Reference Miyajima, Tokuda and Kitamura2000), the problem remains unresolved for early integration in LVCSR (Gravier et al.Reference Gravier, Axelrod, Potamianos and Neti2002a).
There are additional open questions relevant to decision fusion. The first concerns the stage of measurement level information integration, i.e., the degree of allowed asynchrony between the audio and visual streams. The second has to do with the functional form of stream log-likelihood combination, as integration by means of Eq. (9.16) is not necessarily optimal and it fails to yield an emission probability distribution. Finally, it is worth mentioning a theoretical shortcoming of the log-likelihood linear combination model used in the decision fusion algorithms considered. In contrast to feature fusion, such combination assumes class conditional independence of the audio and visual stream observations. This appears to be a non-realistic assumption (Yehia et al.Reference Yehia, Rubin and Vatikiotis-Bateson1998; Jiang et al.Reference Jiang, Alwan, Keating, Auer and Bernstein2002).
A number of models are being investigated to overcome some of the above issues (Pan et al.Reference Pan, Liang, Anastasio and Huang1998; Pavlovic Reference Pavlovic1998). Most importantly, recent years have seen increasing interest in the use of dynamic Bayesian networks for audio- visual fusion, as a generalization to HMMs (Zweig Reference Zweig1998; Murphy Reference Murphy2002; Bilmes and Bartels Reference Bilmes and Bartels2005). Such examples are the work of Saenko and Livescu (Reference Saenko and Livescu2006), Livescu et al. (Reference Livescu, Çetin, Hasegawa-Johnson, King, Bartels, Borges, Kantor, Lal, Yung, Bezman, Dawson-Haggerty, Woods, Frankel, Magimai-Doss and Saenko2007), Lv et al. (Reference Lv, Jiang, Zhao and Hou2007), Terry and Katsaggelos (Reference Terry and Katsaggelos2008), and Saenko et al. (Reference Saenko, Livescu, Glass and Darrell2009).
9.4 Audiovisual databases
A major contributor to the progress achieved in traditional, audio-only ASR has been the availability of a wide variety of large, multi-subject databases on a number of well-defined recognition tasks of different complexities. These corpora have often been collected using funding from US government agencies (for example, the Defense Advanced Research Projects Agency and the National Science Foundation), or through well-organized European activities, such as the Information Society Technologies program funded by the European Commission, or the European Language Resources Association. The resulting databases are made available to interested research groups by the Linguistic Data Consortium (LDC) and the European Language Resources Distribution Agency (ELDA), for example. Benchmarking research progress in audio-only ASR has been possible on such common databases.
In contrast to the abundance of audio-only corpora, there exist only few databases suitable for audiovisual ASR research. This is not only because the field is relatively young, but also due to the fact that audiovisual databases pose additional challenges concerning collection, storage, and distribution not found in the audio-only domain. Most early databases being the result of efforts by a few university groups or individual researchers with limited resources suffer from one or more of the following shortcomings (Chibelushi et al.Reference Chibelushi, Deravi and Mason1996; Hennecke et al.Reference Hennecke, Stork, Prasad, Stork and Hennecke1996; Chibelushi et al.Reference Chibelushi, Deravi and Mason2002; Potamianos et al.Reference Potamianos, Neti, Gravier, Garg and Senior2003): They contain a single or small number of subjects, affecting the generalization of developed methods to the wider population; they typically have small duration, often resulting in undertrained statistical models, or non-significant performance differences between various proposed algorithms; they mostly address simple recognition tasks, such as small-vocabulary ASR of isolated or connected words; and finally they mostly consider ideal visual environments with limited variation in head pose (mostly frontal), lighting, and background that do not reflect realistic human–computer interaction scenarios. These limitations have caused a gap in the state of the art between audio-only and audiovisual ASR in terms of recognition task complexity and have hindered practical system deployment. Nevertheless, the past few years have witnessed an effort to address some of these shortcomings; for example, IBM Research has collected a large corpus suitable for speaker-independent audiovisual LVCSR, employed in the experiments during the Johns Hopkins summer 2000 workshop (Neti et al.Reference Neti, Potamianos, Luettin, Matthews, Glotin, Vergyri, Sison, Mashari and Zhou2000) and discussed further in this section, whereas a number of groups have collected corpora in realistic, visually challenging environments such as automobiles, or data where the recorded subjects appear at a non-frontal head-pose. Details are provided in the next subsections.
9.4.1 Early audiovisual corpora
The first database used for automatic recognition of audiovisual speech was collected by Petajan (Reference Petajan1984). Data of a single subject uttering from 2 to 10 repetitions of 100 isolated English words, including letters and digits, were collected under controlled lighting conditions. Since then, several research sites have pursued audiovisual data collection. Some of these early resulting corpora are discussed in this subsection. They cover a number of small-vocabulary recognition tasks and are mostly recorded under ideal visual environment conditions.
Some of these early databases are designed to study audiovisual recognition of consonants (C), vowels (V), or transitions between them. For example, Adjoudani and Benoît (Reference Adjoudani, Benoît, Stork and Hennecke1996) report a single-speaker corpus of 54 /V1CV2CV1/ nonsense words (3 French vowels and 6 consonants are considered). Su and Silsbee (Reference Su and Silsbee1996) recorded a single-speaker corpus of /aCa/ nonsense words for recognition of 22 English consonants. Robert-Ribes et al. (Reference Robert-Ribes, Schwartz, Lallouache and Escudier1998), as well as Teissier et al. (Reference Teissier, Robert-Ribes and Schwartz1999) report recognition of 10 French oral vowels uttered by a single subject. Czap (Reference Czap2000) considers a single-subject corpus of /V1CV1/ and /C1VC1/ nonsense words for recognition of Hungarian vowels and consonants.
The most popular task for audiovisual ASR is isolated or connected digit recognition. Various corpora allow digit recognition experiments. For example, the Tulips1 database (Movellan and Chadderdon Reference Movellan, Chadderdon, Stork and Hennecke1996) contains recordings of 12 subjects uttering digits “one” to “four,” and has been used for isolated recognition of these four digits in a number of papers (Luettin et al.Reference Luettin, Thacker and Beet1996; Movellan and Chadderdon Reference Movellan, Chadderdon, Stork and Hennecke1996; Gray et al.Reference Gray, Movellan, Sejnowski, Mozer, Jordan and Petsche1997; Vanegas et al.Reference Vanegas, Tanaka, Tokuda and Kitamura1998; Scanlon and Reilly Reference Scanlon and Reilly2001). The M2VTS database, although tailored to speaker verification applications, also contains digit (“0” to “9”) recordings of 37 subjects, mostly in French (Pigeon and Vandendorpe Reference Pigeon, Vandendorpe, Bigün, Chollet and Borgefors1997), and it has been used for isolated digit recognition experiments (Dupont and Luettin Reference Dupont and Luettin2000; Miyajima et al.Reference Miyajima, Tokuda and Kitamura2000). XM2VTS is an extended version of this database containing 295 subjects in the English language (Messer et al.Reference Messer, Matas, Kittler, Luettin and Maitre1999). Additional single-subject digit databases include the NATO RSG10 digit-triples set, used by Tomlinson et al. (Reference Tomlinson, Russell and Brooke1996) for isolated digit recognition, and two connected-digits databases reported by Potamianos et al. (Reference Potamianos, Graf and Cosatto1998) and Heckmann et al. (Reference Heckmann, Berthommier and Kroschel2001). Finally, three recent databases suitable for multi-subject connected digit recognition are the 36-subject CUAVE dataset, as discussed in Patterson et al. (Reference Patterson, Gurbuz, Tufekci and Gowdy2002), a 100subject set collected at the University of Illinois at Urbana-Champaign with results reported in Chu and Huang (Reference Chu and Huang2000) and Zhang et al. (Reference Zhang, Levinson and Huang2000), the 97-subject Japanese dataset Aurora-2J-AV (Fujimura et al.Reference Fujimura, Miyajima, Itou, Takeda and Itakura2005), and an 11-subject Japanese set reported by Tamura et al. (Reference Tamura, Iwano and Furui2005). Among these, CUAVE remains the most popular, having been employed by many researchers in their experiments.
Isolated or connected letter recognition constitutes another popular audiovisual ASR task. German connected letter recognition of data of up to six subjects has been reported by Bregler et al. (Reference Bregler, Hild, Manke and Waibel1993), Bregler and Konig (Reference Bregler and Konig1994), Duchnowski et al. (Reference Duchnowski, Meier and Waibel1994), and Meier et al. (Reference Meier, Hürst and Duchnowski1996). Krone et al. (Reference Krone, Talle, Wichert and Palm1997) worked on single-speaker isolated German letter recognition. Single-, or two-subject, connected French letter recognition is considered in Alissali et al. (Reference Alissali, Deléglise and Rogozan1996), André-Obrecht et al. (Reference André-Obrecht, Jacob and Parlangeau1997), Jourlin (Reference Jourlin1997), Rogozan et al. (Reference Rogozan, Deléglise and Alissali1997), and Rogozan (Reference Rogozan1999). Finally, for English, a 10-subject isolated letter dataset is used by Matthews et al. (Reference Matthews, Bangham and Cox1996) and Cox et al. (Reference Cox, Matthews and Bangham1997), as well as a 49-subject connected letter database by Potamianos et al. (Reference Potamianos, Graf and Cosatto1998).
In addition to letter or digit recognition, a number of audiovisual databases have been collected that are suitable for recognition of isolated words. For example, Silsbee and Bovik (Reference Silsbee and Bovik1996) have collected a single-subject, isolated word corpus with a vocabulary of 500 words. Recognition of single-subject command words for radio/tape control has been examined by Chiou and Hwang (Reference Chiou and Hwang1997), as well as by Gurbuz et al. (Reference Gurbuz, Tufekci, Patterson and Gowdy2001), and Patterson et al. (Reference Patterson, Gurbuz, Tufekci and Gowdy2001). A 10-subject isolated word database with a vocabulary size of 78 words is considered by Chen (Reference Chen2001) and Huang and Chen (Reference Huang and Chen2001). This corpus was collected at Carnegie Mellon University (AMP/CMU database), and has also been used by Chu and Huang (Reference Chu and Huang2002), Nefian et al. (Reference Nefian, Liang, Pi, Liu and Murphy2002), and Zhang et al. (Reference Zhang, Broun, Mersereau and Clements2002), among others. Single-subject, isolated word recognition in Japanese is reported in Nakamura et al. (Reference Nakamura, Ito and Shikano2000) and Nakamura (Reference Nakamura, Bigun and Smeraldi2001). A single-subject German command word recognition is considered by Kober et al. (Reference Kober, Harz and Schiffers1997).
Further, a few audiovisual databases are suitable for continuous speech recognition in limited, small-vocabulary domains. Bernstein and Eberhardt (Reference Bernstein and Eberhardt1986a) and Goldschen et al. (Reference Goldschen, Garcia, Petajan, Stork and Hennecke1996) report small corpora containing TIMIT sentences uttered by up to two subjects. Chan et al. (Reference Chan, Zhang and Huang1998) present a dataset of 400 single-subject military command and control utterances. An extended multi-subject version of this database (still with a limited vocabulary of 101 words) is reported in Chu and Huang (Reference Chu and Huang2000).
Finally, a number of databases offer a combination of some or all of the above tasks. One example is the AVOZES corpus (Goecke and Millar Reference Goecke and Millar2004) that contains 20 subjects uttering connected digits, /VCV/ and /CVC/ words, as well as a small number of sentences, all in Australian English. Another is reported in Wodjel et al. (2002) and contains data in Dutch.
9.4.2 Large-vocabulary databases and the IBM ViaVoiceTM corpus
Following both technology progress that simplified recording and storage of larger audiovisual sets and research progress in the development of audiovisual speech recognition algorithms, the community started becoming interested in more challenging recognition tasks. In support of such quest, some sites have proceeded with the collection of larger corpora to allow the development of speaker-independent audiovisual LVCSR systems. Two such datasets are the AVTIMIT corpus that contains 4 hrs of phonetically balanced continuous speech utterances by 223 subjects in English (Hazen et al.Reference Hazen, Saenko, La and Glass2004) and the UWB-04-HSCAVC set that contains 40 hrs of data from 100 subjects in Czech (Cisař et al.Reference Cisař, Železný, Krňoul, Kanis, Zelinka and Müller2005).
A third database suitable for speaker-independent LVCSR is the IBM ViaVoiceTM audiovisual database, which remains the largest such corpus to date. It consists of full face frontal video and audio of 290 subjects (see also Figure 9.10), uttering ViaVoiceTM training scripts in dictation style, i.e., continuous read speech with mostly verbalized punctuation. The database video is 704 × 480 pixels, interlaced, and captured in color at a rate of 30 Hz (i.e., 60 fields per second are available at a resolution of 240 lines). It is MPEG2 encoded at the relatively high compression ratio of about 50:1. High quality wideband audio is collected synchronously with the video at a rate of 16 kHz and in a relatively clean audio environment (quiet office, with some background computer noise), resulting in a 19.5 dB SNR. The duration of the entire database is approximately 50 hours, and it contains 24 325 transcribed utterances with a 10 403-word vocabulary, from which 21 281 utterances are used in the experiments reported in the next section. In addition to LVCSR, a 50-subject connected digit database has been collected at IBM Research, in order to study the benefit of the visual modality on a popular small-vocabulary ASR task. This DIGITS corpus contains 6689 utterances of 7- and 10-digit strings (both “zero” and “oh” are used) with a total duration of approximately 10 hrs. Furthermore, to allow investigation of automatic speechreading performance for impaired speech (Potamianos and Neti Reference Potamianos and Neti2001a), both LVCSR and DIGITS audiovisual speech data of a single speech-impaired male subject with profound hearing loss have been collected. In Table 9.3, a summary of the above corpora is given, together with their partitioning as used in the experiments reported in the following section.
Figure 9.10 Example video frames of 10 subjects from the IBM ViaVoiceTM audiovisual database. The database contains approximately 50 hrs of continuous, dictation-style audiovisual speech by 290 subjects, collected with minor variations in face pose, lighting, and background (Neti et al.Reference Neti, Potamianos, Luettin, Matthews, Glotin, Vergyri, Sison, Mashari and Zhou2000).
9.4.3 Other recent audiovisual databases
Another recent direction that has attracted the interest of the research community has been the application of audiovisual ASR in realistic human–computer interaction environments. One such environment is the automobile cabin, where the driver needs to interact with the vehicle voice interface, for example for navigation or other command and control tasks. Not surprisingly, a number of research sites have collected in-vehicle audiovisual databases in support of this scenario. Resulting corpora of such efforts in English are the AVICAR dataset that contains 100 subjects uttering digit strings and TIMIT sentences, recorded using 4 cameras and a microphone array (Lee et al.Reference Lee, Hasegawa-Johnson, Goudeseune, Kamdar, Borys, Liu and Huang2004), and a corpus collected by IBM Research containing 87 subjects uttering digit strings, word spellings, and continuous speech geared towards various navigation and other in-vehicle command and control tasks (Potamianos and Neti Reference Potamianos and Neti2003). Similar, but smaller corpora have also been collected in other languages, for example the Japanese Aurora 3J-AV dataset containing 58 subjects recorded with 2 cameras and multiple microphones (Fujimura et al.Reference Fujimura, Miyajima, Itou, Takeda and Itakura2005), the Czech UWB-03-CIVAVC corpus containing 12 subjects (Železný and Cisař Reference Železný and Cisař2003), and the Spanish AV@CAR set of 20 subjects (Ortega et al.Reference Ortega, Sukno, Lleida, Frangi, Miguel, Buera and Zacur2004). Finally, a larger and more recent effort is that of the UTDrive project (Angkititrakul et al.Reference Angkititrakul, Hansen, Choi, Creek, Hayes, Kim, Kwak, Noecker, Phan, Takeda, Hansen, Erdoggan and Abut2009).
Yet another interesting aspect of visual and audiovisual speech recognition concerns the head pose of the recorded subject with respect to the camera. In the vast majority of the above mentioned datasets, this is assumed to be mostly frontal. A few researchers though have ventured to investigate how head pose affects speechreading performance. For this purpose, they have collected datasets of mostly small-vocabulary tasks and few subjects, as for example in Iwano et al. (Reference Iwano, Yoshinaga, Tamura and Furui2007), Kumar et al. (Reference Kumar, Chen and Stern2007), Kumatani and Stiefelhagen (Reference Kumatani and Stiefelhagen2007), and Lucey et al. (Reference Lucey, Potamianos, Sridharan, Liew and Wang2009). Most of these are multi-view databases, i.e. more than one camera (typically, two or three) captures the speaker’s face at relatively fixed head-pose view angles.
Finally, a few additional databases exist that can be used for audiovisual ASR, but are most appropriate for other audiovisual speech processing tasks, closely related to ASR. Two such are for example the XM2VTS corpus, already mentioned above (Messer et al.Reference Messer, Matas, Kittler, Luettin and Maitre1999), and the VidTIMIT database (Sanderson Reference Sanderson2003) that contains data from 43 subjects uttering short sentences. Both are most suitable for speaker recognition experiments. Other interesting datasets are the AVGrid corpus (Cooke et al.Reference Cooke, Barker, Cunningham and Shao2006) that is best suited for audiovisual speaker separation and the MOCHA (Wrench and Hardcastle Reference Wrench and Hardcastle2000) and QSMT (Engwall and Beskow Reference Engwall and Beskow2003) databases that have been used for speech inversion.
9.5 Audiovisual ASR experiments
In this section, we present experimental results on visual-only and audiovisual ASR using mainly the IBM ViaVoiceTM database discussed above. Some of these results were obtained during the Johns Hopkins summer 2000 workshop (Neti et al.Reference Neti, Potamianos, Luettin, Matthews, Glotin, Vergyri, Sison, Mashari and Zhou2000). Experiments conducted later on both these data and the IBM connected digits task (DIGITS) are also reported (Potamianos et al.Reference Potamianos, Luettin and Neti2001c; Goecke et al.Reference Goecke, Potamianos and Neti2002; Gravier et al.Reference Gravier, Axelrod, Potamianos and Neti2002a). In addition, the application of audiovisual speaker adaptation methods to the hearing impaired dataset is also discussed (Potamianos and Neti Reference Potamianos and Neti2001a). First, however, we briefly describe the basic audiovisual ASR system, as well as the experimental framework used.
9.5.1 The audiovisual ASR system
Our basic audiovisual ASR system utilizes appearance-based visual features that use a discrete cosine transform (DCT) of the mouth region-of-interest (ROI), as described in Potamianos et al. (Reference Potamianos, Neti, Iyengar, Senior and Verma2001b). Given the video of the speaker’s face, available at 60 Hz, it first performs face detection and mouth center and size estimation employing the algorithm of Senior (Reference Senior1999). On the basis of these, it extracts a size-normalized, 64 × 64 greyscale pixel mouth ROI, as discussed in Section 9.2.1 (see also Figure 9.2). Subsequently, a two-dimensional, separable, fast DCT is applied on the ROI, and its 24 highest energy coefficients (over the training data) are retained. A number of post-processing steps are applied on the resulting “static” feature vector, namely: linear interpolation on the audio feature rate (from 60 to 100 Hz); feature mean normalization (FMN) for improved robustness to lighting and other variations; concatenation of 15 adjacent features to capture dynamic speech information (see also Eq. (9.7)); and linear discriminant analysis (LDA) for optimal dimensionality reduction, followed by a maximum likelihood data rotation (MLLT) for improved statistical data modeling. The resulting feature vector
has dimension 41. These steps are described in more detail in the visual front end section of this chapter (see also Figure 9.7). Improvements to this DCT-based visual front end have been proposed in Potamianos and Neti (Reference Potamianos and Neti2001b), including the use of a larger ROI, a within-frame discriminant DCT feature selection, and a longer temporal window (see Figure 9.11). During the Johns Hopkins summer workshop, and in addition to the DCT-based features, joint appearance and shape features from active appearance models (AAMs) have also been employed. In particular, 6000-dimensional appearance vectors containing the normalized face color pixel values and 134-dimensional shape vectors of the face shape coordinates are extracted at 30 Hz and are passed through two stages of principal components analysis (PCA). The resulting “static” AAM feature vector is 86-dimensional, and it is post-processed similarly to the DCT feature vector (see Figure 9.7) resulting in 41-dimensional “dynamic” features.
In parallel to the visual front end, traditional audio features that consist of mel frequency cepstral coefficients (MFCCs) are extracted at a 100 Hz rate. The resulting “static” feature vector is 24-dimensional, and following FMN, LDA on 9 adjacent frames and MLLT, it gives rise to a 60-dimensional dynamic speech vector,
, as depicted in Figure 9.11. The audio and visual front ends provide time-synchronous audio and visual feature vectors that can be used in a number of fusion techniques discussed previously in Section 9.3. The derived concatenated audiovisual vector
has dimension 101, whereas in the HiLDA feature fusion implementation, the bimodal LDA generates features
with reduced dimensionality 60 (see also Figure 9.11).
Figure 9.11 The audiovisual ASR system employed in some of the experiments reported in this chapter. In addition to the baseline system used during the Johns Hopkins summer 2000 workshop, a larger mouth ROI is extracted, within-frame discriminant features are used, and a longer temporal window is considered in the visual front end (compare to Figure 9.7). HiLDA feature fusion is employed.
In all cases where LDA and MLLT matrices are employed (audio-only, visual-only, and audiovisual feature extraction by means of HiLDA fusion), we consider |C| = 3367 context-dependent sub-phonetic classes that coincide with the context-dependent states of an existing audio-only HMM that was previously developed at IBM for LVCSR and trained on a number of audio corpora (Polymenakos et al.Reference Polymenakos, Olsen, Kanevsky, Gopinath, Gopalakrishnan and Chen1998). The forced alignment (Rabiner and Juang Reference Rabiner and Juang1993) of the training set audio, based on this HMM and the data transcriptions, produces labels c(l) ∈C for the training set audio-only, visual-only, and audiovisual data vectors xl, l = 1, . . .,L. Such labeled vectors can then be used to estimate the required matrices PLDA, PMLLT, as described in the visual front end section of this chapter.
9.5.2 The experimental framework
The audiovisual databases discussed above were partitioned into a number of sets in order to train and evaluate models for audiovisual ASR, as detailed in Table 9.3. For both LVCSR and DIGITS speech tasks in the normal speech condition, the corresponding training sets are used to obtain all LDA and MLLT matrices required and the phonetic decision trees that cluster HMM states on the basis of phonetic context, as well as to train all the HMMs reported. The held-out sets are used to tune parameters relevant to audiovisual decision fusion and decoding (such as the multi-stream HMM and language model weights, for example), whereas the test sets are used for evaluating the performance of the trained HMMs. Optionally, the adaptation sets can be employed for tuning the front ends and/or HMMs to the characteristics of the test set subjects. In the LVCSR case, the subject populations of the training, held-out, and test sets are disjoint, thus allowing for speaker-independent recognition, whereas in the DIGITS data partitioning, all sets have data from the same 50 subjects, thus allowing multi-speaker experiments. Consequently, the adaptation and held-out sets for DIGITS are identical. For the impaired speech data, the duration of the collected data is too short to allow HMM training. Therefore, LVCSR HMMs trained on the IBM ViaVoiceTM dataset are adapted on the impaired LVCSR and DIGITS adaptation sets (see Table 9.3).
Table 9.3 The IBM audiovisual databases discussed and used in the experiments reported in this chapter. Their partitioning into training, held-out, adaptation, and test sets is depicted (duration in hours and number of subjects are shown for each set). Both large-vocabulary continuous speech (LVCSR) and connected digit (DIGITS) recognition are considered for normal, as well as impaired speech. The IBM ViaVoiceTM database corresponds to the LVCSR task in the normal speech condition. For the normal speech DIGITS task, the held-out and adaptation sets are identical. For impaired speech, due to the lack of sufficient training data, adaptation of HMMs trained in the normal speech condition is considered.
To assess the benefit of the visual modality to ASR in noisy conditions (as well as to the relatively clean audio condition of the database recordings), we artificially corrupt the audio data with additive, non-stationary, speech babble noise at various SNRs. ASR results are then reported at a number of SNRs, within [−1.5,19.5] dB for LVCSR and [−3.5,19.5] dB for DIGITS, with all corresponding front end matrices and HMMs trained in the matched condition. In particular, during the Johns Hopkins summer 2000 workshop, only two audio conditions were considered for LVCSR: the original 19.5 dB SNR audio and a degraded one at 8.5 dB SNR. Notice that, in contrast to the audio, no noise is added to the video channel or features. Many cases of “visual noise” could have been considered, such as additive noise on video frames, blurring, frame rate decimation, and extremely high compression factors, among others. Some studies on the effects of video degradations to visual recognition can be found in the literature (Davoine et al.Reference Davoine, Li, Forchheimer, Bigün, Chollet and Borgefors1997; Williams et al.Reference Williams, Rutledge, Garstecki and Katsaggelos1997; Potamianos et al.Reference Potamianos, Graf and Cosatto1998; Seymour et al.Reference Seymour, Stewart and Ming2008). These studies find automatic speechreading performance to be rather robust to video compression for example, but to degrade rapidly for frame rates below 15 Hz.
The ASR experiments reported next follow two distinct paradigms. The results on the IBM ViaVoiceTM data obtained during the Johns Hopkins summer 2000 workshop employ a lattice rescoring paradigm, due to the limitations in large-vocabulary decoding of the early HTK software used there (Young et al.Reference Young, Kershaw, Odell, Ollason, Valtchev and Woodland1999). Lattices were first generated prior to the workshop using the IBM Research stack decoder (Hark) with HMMs trained at IBM Research, and subsequently rescored during the workshop, by trained triphone context-dependent HMMs on various feature sets or fusion techniques using HTK. Three sets of lattices were generated for these experiments and were based on clean audio-only (19.5 dB), noisy audio-only, and noisy audiovisual (at the 8.5 dB SNR condition) HiLDA features. In the second experimental paradigm, full decoding results obtained by directly using the IBM Research recognizer are reported. For the LVCSR experiments, 11-phone context-dependent HMMs with 2808 context-dependent states and 47 k Gaussian mixtures are used, whereas for DIGITS recognition in normal speech the corresponding numbers are 159 and 3.2 k (for single-stream models). Decoding using the closed set vocabulary (10 403 words) and a trigram language model is employed for LVCSR (this is the case also for the workshop results), whereas the 11-digit (“zero” to “nine,” including “oh”) word vocabulary is used for DIGITS (with unknown digit string length).
9.5.3 Visual-only recognition
The suitability for LVCSR of a number of appearance-based visual features and AAMs was studied during and after the Johns Hopkins summer workshop (Neti et al.Reference Neti, Potamianos, Luettin, Matthews, Glotin, Vergyri, Sison, Mashari and Zhou2000; Matthews et al.Reference Matthews, Potamianos, Neti and Luettin2001). For this purpose, noisy audio-only lattices were rescored by HMMs trained on the various visual features considered, namely 86-dimensional AAM features, as well as 24-dimensional DCT, PCA (on 32 × 32 pixel mouth ROIs), and DWT-based features. All features were post-processed as previously discussed to yield 41-dimensional feature vectors (see Figure 9.7). For the DWT features, the Daubechies class wavelet filter of approximating order 3 is used (Daubechies Reference Daubechies1992; Press et al.Reference Press, Flannery, Teukolsky and Vetterling1995). LVCSR recognition results are reported in Table 9.4, depicted in word error rate (WER), %. The DCT outperformed all other features considered. Notice however that these results cannot be interpreted as visual-only recognition, since they correspond to cascade audiovisual fusion of audio-only ASR, followed by visual-only rescoring of a network of recognized hypotheses. For reference, a number of characteristic lattice WERs are also depicted in Table 9.4, including the audio-only result (at 8.5 dB). All feature performances are bounded by the lattice oracle and anti-oracle WERs. It is interesting to note that all appearance-based features considered attain lower WERs (e.g., 58.1% for DCT features) than the WER of the best path through the lattice based on the language model alone (62.0%). Therefore, such visual features do convey significant speech information. AAMs on the other hand did not perform well, possibly due to severe undertraining of the models, resulting in poor fitting to unseen facial data.
Table 9.4 Comparisons of recognition performance based on various visual features (three appearance-based features, and one joint shape and appearance feature representation) for speaker-independent LVCSR (Neti et al.Reference Neti, Potamianos, Luettin, Matthews, Glotin, Vergyri, Sison, Mashari and Zhou2000; Matthews et al.Reference Matthews, Potamianos, Neti and Luettin2001). Word error rate (WER), %, is depicted on a subset of the IBM ViaVoiceTM database test set of Table 9.3. Visual performance is obtained after the rescoring of lattices that had been previously generated based on noisy (8.5 dB SNR) audio-only MFCC features. For comparison, characteristic lattice WERs are also depicted (oracle, anti-oracle, and best path based on language model scores alone). Among the visual speech representations considered, the DCT-based features are superior and contain significant speech information.

As expected, visual-only recognition based on full decoding (instead of lattice rescoring) is rather poor. The LVCSR WER on the speaker-independent test set of Table 9.3, based on per-speaker MLLR adaptation, is reported at 89.2% in Potamianos and Neti (Reference Potamianos and Neti2001b), using the DCT features of the workshop. Extraction of larger ROIs and the use of within-frame DCT discriminant features and longer temporal windows (as depicted in Figure 9.11) result in the improved WER of 82.3%. In contrast to LVCSR, DIGITS visual-only recognition constitutes a much easier task. Indeed, on the multi-speaker test set of Table 9.3, a 16.8% WER is achieved after per-speaker MLLR adaptation.
9.5.4 Audiovisual ASR
A number of audiovisual integration algorithms presented in the fusion section of this chapter were compared during the Johns Hopkins summer 2000 workshop. As already mentioned, two audio conditions were considered: the original clean database audio (19.5 dB SNR) and a noisy one at 8.5 dB SNR. In the first case, fusion algorithm results were obtained by rescoring pre-generated clean audio-only lattices; in the second condition, HiLDA noisy audiovisual lattices were rescored. The results of these experiments are summarized in Table 9.5. Notice that every fusion method considered outperformed audio-only ASR in the noisy case, reaching up to a 27% relative reduction in WER (from 48.10% noisy audio-only to 35.21% audiovisual). In the clean audio condition, among the two feature fusion techniques considered, HiLDA fusion (Potamianos et al.Reference Potamianos, Luettin and Neti2001c) improved ASR from a 14.44% audio-only to a 13.84% audiovisual WER. However, concatenative fusion degraded performance to 16.0%. Among the decision fusion algorithms used, the product HMM (AV-MS-PROD) with jointly trained audiovisual components (Luettin et al.Reference Luettin, Potamianos and Neti2001) improved performance to a 14.19% WER. In addition, utterance-based stream exponents for a jointly trained multi-stream HMM (AV-MS-UTTER), estimated using an average of the voicing present at each utterance, further reduced WER to 13.47% (Glotin et al.Reference Glotin, Vergyri, Neti, Potamianos and Luettin2001), achieving a 7% relative WER reduction over audio-only performance. Finally, a late integration technique based on discriminative model combination (AV-DMC) of audio and visual HMMs (Beyerlein Reference Beyerlein1998; Vergyri Reference Vergyri2000; Glotin et al.Reference Glotin, Vergyri, Neti, Potamianos and Luettin2001) produced a WER of 12.95%, amounting to a 5% reduction from the clean audio-only baseline of 13.65% (this differs from the 14.44% audio-only result due to the rescoring of n-best lists instead of lattices). For both clean and noisy audio conditions, the best decision fusion method outperformed the best feature fusion technique considered. In addition, for both conditions, joint multi-stream HMM training outperformed separate training of the HMM stream components, something not surprising, since joint training forces state synchrony between the audio and visual streams.
Table 9.5 Test set speaker-independent LVCSR audio-only and audiovisual WER (%), for the clean (19.5 dB SNR) and a noisy audio (8.5 dB) condition. Two feature fusion- and five decision fusion-based audiovisual systems are evaluated using the lattice rescoring paradigm (Neti et al.Reference Neti, Potamianos, Luettin, Matthews, Glotin, Vergyri, Sison, Mashari and Zhou2000; Glotin et al.Reference Glotin, Vergyri, Neti, Potamianos and Luettin2001; Luettin et al.Reference Luettin, Potamianos and Neti2001).
To further demonstrate the differences between the various fusion algorithms and to quantify the visual modality benefit to ASR, we review a number of full decoding experiments recently conducted for both the LVCSR and DIGITS tasks, at a large number of SNR conditions (Potamianos et al.Reference Potamianos, Luettin and Neti2001c; Goecke et al.Reference Goecke, Potamianos and Neti2002; Gravier et al.Reference Gravier, Axelrod, Potamianos and Neti2002a). All three feature fusion techniques discussed in Section 9.3 are compared to decision fusion by means of a jointly trained multi-stream HMM. The results are depicted in Figure 9.12. Among the feature fusion methods considered, HiLDA feature fusion is superior to both concatenative fusion and the enhancement approach. In the clean audio case for example, HiLDA fusion reduces the audio-only LVCSR WER of 12.37% to 11.56% audiovisual, whereas feature concatenation degrades performance to 12.72% (the enhancement method obviously provides the original audio-only performance in this case). Notice that these results are somewhat different from the ones reported in Table 9.5, due to the different experimental paradigm considered. In the most extreme noisy case considered for LVCSR (−1.5 dB SNR), the audio-only WER of 92.16% is reduced to 48.63% using HiLDA, compared to 50.76% when feature concatenation is employed, and to 63.45% when audio feature enhancement is used. Similar results hold for DIGITS recognition, although the difference between HiLDA and concatenative feature fusion ASR is small, possibly due to the fact that HMMs with significantly fewer Gaussian mixtures are used, and to the availability of sufficient data to train on high-dimensional concatenated audiovisual vectors. The comparison between multi-stream decision fusion and HiLDA fusion reveals that the jointly trained multi-stream HMM performs significantly better. For example, at −1.5 dB SNR, LVCSR WER is reduced to 46.28% (compared to 48.63% for HiLDA). Similarly, for DIGITS recognition at −3.5 dB, the HiLDA WER is 7.51%, whereas the multi-stream HMM WER is significantly lower, namely 6.64%. This is less than one third of the audio-only WER of 23.97%.
Figure 9.12 Comparison of audio-only and audiovisual ASR by means of three feature fusion (AV-Concat, AV-HiLDA, and AV-Enhanced) algorithms and one decision fusion (AV-MS-Joint) technique, using the full decoding experimental paradigm. WERs vs. audio channel SNR are reported on both the IBM ViaVoiceTM test set (speaker-independent LVCSR – top), and on the multi-speaker DIGITS test set (bottom) of Table 9.3. HiLDA feature fusion outperforms alternative feature fusion methods, whereas decision fusion outperforms all three feature fusion approaches, resulting in an effective SNR gain of 7 dB for LVCSR and 7.5 dB for DIGITS, at 10 dB SNR (Potamianos et al.Reference Potamianos, Luettin and Neti2001c; Goecke et al.Reference Goecke, Potamianos and Neti2002; Gravier et al.Reference Gravier, Axelrod, Potamianos and Neti2002a). Notice that the WER ranges in the two graphs differ.
A useful indicator when comparing fusion techniques and establishing the visual modality benefit to ASR is the effective SNR gain, measured here with reference to the audio-only WER at 10 dB. To compute this gain, we need to consider the SNR value where the audiovisual WER equals the reference audio-only WER (see Figure 9.12). For HiLDA fusion, this gain equals approximately 6 dB for both LVCSR and DIGITS tasks. Jointly trained multi-stream HMMs improve these gains to 7 dB for LVCSR and 7.5 dB for DIGITS, at 10 dB SNR. Full decoding experiments employing additional decision fusion techniques are currently in progress. In particular, intermediate fusion results by means of the product HMM are reported in Gravier et al. (Reference Gravier, Axelrod, Potamianos and Neti2002a).
9.5.5 Audiovisual adaptation
We now describe recent experiments on audiovisual adaptation in a case study of single-subject audiovisual ASR of impaired speech (Potamianos and Neti Reference Potamianos and Neti2001a). As already indicated, the small amount of speech-impaired data collected (see Table 9.3) is not sufficient for HMM training, and call for speaker adaptation techniques instead. A number of such methods, described in a previous section, are used for adapting audio-only, visual-only, and audiovisual HMMs to LVCSR. The results on both speech-impaired LVCSR and DIGITS tasks are depicted in Table 9.6. Due to poor accuracy on impaired speech, decoding on the LVCSR task is performed using the 537-word test set vocabulary of the dataset. Clearly, the mismatch between the normal and impaired-speech data is dramatic, as the “Unadapted” table entries demonstrate. Indeed, the audiovisual WER in the LVCSR task reaches 106.0% (such large numbers occur due to word insertions), whereas the audiovisual WER in the DIGITS task is 24.8% (in comparison, the normal speech, per subject, adapted audiovisual LVCSR WER is 10.2%, and the audiovisual DIGITS WER is only 0.55%, computed on the test sets of Table 9.3).
Table 9.6 Adaptation results on the speech impaired data. WER, %, of the audio-only (AU), visual-only (VI), and audiovisual (AV) modalities, using HiLDA feature fusion, are reported on both the LVCSR (left table part) and DIGITS test sets (right table) of the speech-impaired data using unadapted HMMs (trained in normal speech) as well as a number of HMM adaptation methods. All HMMs are adapted on the joint speech-impaired LVCSR and DIGITS adaptation sets of Table 9.3. For the continuous speech results, decoding using the test set vocabulary of 537 words is reported. MAP followed by MLLR adaptation, and possibly preceded by front end matrix adaptation (Mat), achieves the best results for all modalities and for both tasks considered (Potamianos and Neti Reference Potamianos and Neti2001a).
We first consider MLLR and MAP HMM adaptation using the joint speech-impaired LVCSR and DIGITS adaptation tests. Audio-only, visual-only, and audiovisual performances improve dramatically, as demonstrated in Table 9.6. Due to the rather large adaptation set, MAP performs similarly well to MLLR. Applying MLLR after MAP improves results, and it reduces the audiovisual WER to 41.2% and 0.99% for the LVCSR and DIGITS tasks, respectively, amounting to a 61% and 96% relative WER reduction over the audiovisual unadapted results, and to a 13% and 58% relative WER reduction over the audio-only MAP+MLLR adapted results. Clearly, therefore, the use of the visual modality confers dramatic benefits on the automatic recognition of impaired speech. We also apply front end adaptation, possibly followed by MLLR adaptation, with the results depicted in the Mat+MAP(+MLLR) entries of Table 9.6. Although visual-only recognition improves, the audio-only recognition results fail to do so. As a consequence, audiovisual ASR degrades, possibly due to the fact that, in this experiment, audiovisual matrix adaptation is only applied to the second stage of LDA/MLLT.
9.6 Summary and discussion
In this chapter we provided an overview of the basic techniques for automatic recognition of audiovisual speech proposed in the literature over the past twenty years. The two main issues relevant to the design of audiovisual ASR systems are, first, the visual front end that captures visual speech information, and second, the integration (fusion) of audio and visual features into the automatic speech recognizer used. Both are challenging problems, and significant research effort has been directed towards finding appropriate solutions.
We first discussed extracting visual features from the video of the speaker’s face. This process requires first the detection and then tracking of the face, mouth region, and possibly the speaker’s lip contours. A number of mostly statistical techniques suitable for the task were reviewed. Various visual features proposed in the literature were then presented. Some are based on the mouth region appearance and employ image transforms or other dimensionality-reduction techniques borrowed from the pattern-recognition literature, in order to extract relevant speech information. Others capture the lip contour and possibly face shape characteristics by means of statistical or geometric models. Combinations of features from these two categories are also possible.
Subsequently, we concentrated on the problem of audiovisual integration. Possible solutions to this problem differ in various aspects, including the classifier and classes used for automatic speech recognition, the combination of single-modality features versus single-modality classification decisions, and in the latter case, the information level provided by each classifier, the temporal level of the integration, and the sequence of the decision combination. We concentrated on HMM-based recognition, based on sub-phonetic classes and assuming time-synchronous audio and visual feature generation. We reviewed a number of feature and decision fusion techniques. Within the first category, we discussed simple feature concatenation, discriminant feature fusion, and a linear audio feature enhancement approach. For decision-based integration, we concentrated on linear log-likelihood combination of parallel, single-modality classifiers at various levels of integration, considering the state-synchronous multi-stream HMM for “early” fusion, the product HMM for “intermediate” fusion, and discriminative model combination for “late” integration. We discussed the training of the resulting models.
Developing and benchmarking feature extraction and fusion algorithms requires available audiovisual data. A limited number of corpora suitable for research in audiovisual ASR have been collected and used in the literature. A brief overview of them was also provided, including a description of the IBM ViaVoiceTM database, suitable for speaker-independent audiovisual ASR in the large-vocabulary, continuous speech domain. Subsequently, experimental results were reported using this database, as well as research on additional corpora collected at IBM Research. Some of these experiments were conducted during the summer 2000 workshop at the Johns Hopkins University and compared both visual feature extraction and audiovisual fusion methods for LVCSR. More recent experiments, as well as a case study of speaker adaptation techniques for audiovisual recognition of impaired speech were also presented. These experiments showed that a visual front end can be designed that successfully captures speaker-independent, large-vocabulary continuous speech information. Such a visual front end uses discrete cosine transform coefficients of the detected mouth region of interest, suitably post-processed. Combining the resulting visual features with traditional acoustic ones results in significant improvements over audio-only recognition in both clean and of course degraded acoustic conditions, across small and large vocabulary tasks, as well as for both normal and impaired speech. A successful combination technique is the multi-stream HMM-based decision fusion approach, or the simpler, but inferior, discriminant feature fusion (HiLDA) method.
This chapter clearly demonstrates that, over the past twenty-five years, much progress has been made in capturing and integrating visual speech information into automatic speech recognition. However, the visual modality has yet to become utilized in mainstream ASR systems. This is due to the fact that both practical and research issues remain challenging. On the practical side, the need for high-quality captured visual data, necessary for extracting visual speech information capable of enhancing ASR performance, introduces increased cost, storage, and computer processing requirements. In addition, the lack of common, large audiovisual corpora that address a wide variety of ASR tasks, conditions, and environments, hinders development of audiovisual systems suitable for use in particular applications.
On the research side, the key issues in the design of audiovisual ASR systems remain open and subject to more investigation. In the visual front end design, for example, face detection, facial feature localization, and face shape tracking, robust to speaker, pose, lighting, and environment variation constitute challenging problems. A comprehensive comparison between face appearance- and shape-based features for speaker-dependent versus speaker-independent automatic speechreading is also unavailable. Joint shape and appearance three-dimensional face modeling, used for both tracking and visual feature extraction, has not been considered in the literature, although such an approach could possibly lead to the desired robustness and generality of the visual front end, successfully addressing challenging visual conditions in realistic environments, such as the automobile cabin. In addition, when combining audio and visual information, a number of issues relevant to decision fusion require further study. These include the optimal level of integrating the audio and visual log-likelihoods, the optimal function for this integration, and the inclusion of suitable, local estimates of the reliability of each modality into this function.
Further investigation of these issues is clearly warranted, and it is expected to lead to improved robustness and performance of audiovisual ASR. Progress in addressing some or all of these questions can also benefit other areas where joint audio and visual speech processing is suitable (Chen and Rao Reference Chen and Rao1998; Aleksic et al.Reference Aleksic, Potamianos and Katsaggelos2005). Such are for example: speaker identification and verification (Jourlin et al.Reference Jourlin, Luettin, Genoud and Wassner1997; Wark and Sridharan Reference Wark and Sridharan1998; Fröba et al.Reference Fröba, Küblbeck, Rothe and Plankensteiner1999; Jain et al.Reference Jain, Bolle, Pankanti, Jain, Bolle and Pankanti1999; Maison et al.Reference Maison, Neti and Senior1999; Chibelushi et al.Reference Chibelushi, Deravi and Mason2002; Zhang et al.Reference Zhang, Broun, Mersereau and Clements2002; Aleksic and Katsaggelos Reference Aleksic and Katsaggelos2003; Chaudhari et al.Reference Chaudhari, Ramaswamy, Potamianos and Neti2003; Sanderson and Paliwal Reference Sanderson and Paliwal2004; Aleksic and Katsaggelos Reference Aleksic and Katsaggelos2006); visual speech synthesis (Cohen and Massaro Reference Cohen and Massaro1994b; Chen et al.Reference Chen, Graf and Wang1995; Yamamoto et al.Reference Yamamoto, Nakamura and Shikano1998; Cosatto et al.Reference Cosatto, Potamianos and Graf2000; Choi et al.Reference Choi, Luo and Hwang2001; Bailly et al.Reference Bailly, Bérar, Elisei and Odisio2003; Aleksic and Katsaggelos Reference Aleksic and Katsaggelos2004b; Fu et al.Reference Fu, Gutierrez-Osuna, Esposito, Kakumanu and Garcia2005; Melenchón et al.Reference Melenchón, Martínez, De La Torre and Montero2009; Tao et al.Reference Tao, Xin and Yin2009); speech intent detection (De Cuetos et al.Reference De Cuetos, Neti and Senior2000); speech activity detection (Libal et al.Reference Libal, Connell, Potamianos and Marcheret2007; Rivet et al.Reference Rivet, Girin and Jutten2007); speech synchrony detection (Iyengar et al.Reference Iyengar, Nock and Neti2003; Bredin and Chollet Reference Bredin and Chollet2007; Sargin et al.Reference Sargin, Yemez, Erzin and Tekalp2007; Kumar et al.Reference Kumar, Potamianos, Navratil, Marcheret, Libal, Bhanu and Govindaraju2010); speech enhancement (Girin et al.Reference Girin, Schwartz and Feng2001b; Deligne et al.Reference Deligne, Potamianos and Neti2002; Goecke et al.Reference Goecke, Potamianos and Neti2002); speech coding (Foucher et al.Reference Foucher, Girin and Feng1998; Girin Reference Girin2004); speech inversion (Yehia et al.Reference Yehia, Rubin and Vatikiotis-Bateson1998; Jiang et al.Reference Jiang, Alwan, Keating, Auer and Bernstein2002; Kjellström et al.Reference Kjellström, Engwall and Bälter2006; Katsamanis et al.Reference Katsamanis, Papandreou and Maragos2009); speech separation (Girin et al.Reference Girin, Allard and Schwartz2001a; Sodoyer et al.Reference Sodoyer, Girin, Jutten and Schwartz2004); speaker localization (Bub et al.Reference Bub, Hunke and Waibel1995; Wang and Brandstein Reference Wang and Brandstein1999; Zotkin et al.Reference Zotkin, Duraiswami and Davis2002); emotion recognition (Cohen et al.Reference Cohen, Sebe, Garg, Chen and Huang2003); and video indexing and retrieval (Huang et al.Reference Huang, Liu, Wang, Chen and Wong1999). Improvements in these technologies are expected to result in more robust and natural human–computer interaction.
9.7 Acknowledgments
The authors would like to state that they are solely responsible for the content of this chapter. The views and opinions of the authors expressed herein do not necessarily reflect those of their affiliations.
The authors would like to acknowledge a number of people for particular contributions to this work: Giridharan Iyengar and Andrew Senior (IBM) for their help with face and mouth region detection on the IBM ViaVoiceTM and other audiovisual data discussed in this chapter; Rich Wilkins and Eric Helmuth (formerly with IBM) for their efforts in data collection; Guillaume Gravier (currently at IRISA/INRIA Rennes) for the joint multi-stream HMM training and full decoding on the connected digits and LVCSR tasks; Roland Goecke (currently at the Australian National University) for experiments on audiovisual-based enhancement of audio features during a summer internship at IBM; Hervé Glotin (ERSS-CNRS) and Dimitra Vergyri (SRI) for their work during the summer 2000 Johns Hopkins workshop on utterance-dependent multi-stream exponent estimation based on speech voicing, and late audiovisual fusion within the discriminative model combination framework, respectively; and the remaining summer workshop student team members for invaluable help.
Gerasimos Potamianos would also like to acknowledge partial support of the European Commission through FP7-PEOPLE-2009-RG-247948 grant AVISPIRE during the revision of this chapter.
, computed as above (see
, of the ROI appearance vectors. If the color values of the M × N-pixel ROI are considered, such vectors are
on training vectors 
