Skip to main content Accessibility help
×
Hostname: page-component-77f85d65b8-g98kq Total loading time: 0 Render date: 2026-04-17T17:44:00.199Z Has data issue: false hasContentIssue false

9 - Audiovisual automatic speech recognition

Published online by Cambridge University Press:  05 May 2012

Gérard Bailly
Affiliation:
Université de Grenoble
Pascal Perrier
Affiliation:
Université de Grenoble
Eric Vatikiotis-Bateson
Affiliation:
University of British Columbia, Vancouver

Information

Figure 0

Figure 9.1 The main processing blocks of an audiovisual automatic speech recognizer. The visual front end design and the audiovisual fusion modules introduce additional challenging tasks to automatic recognition of speech, as compared to traditional, audio-only ASR. They are discussed in detail in this chapter.

Figure 1

Figure 9.2 Region-of-interest extraction examples. Upper rows: Example video frames of eight subjects from the IBM ViaVoiceTM audiovisual database (described below), with superimposed facial features, detected by the algorithm of Senior (1999). Lower row: Corresponding mouth regions-of-interest, extracted as in Potamianos et al.(2001b). © 1999 and 2001 IEEE.

Figure 2

Figure 9.3 Examples of lip contour estimation by means of active shape models (Luettin et al.1996). Depicted mouth regions are from the Tulips1 audiovisual database (Movellan and Chadderdon 1996), and they were extracted preceding lip contour estimation. Reprinted from Computer Vision and Image Understanding, 65:2, Luettin and Thacker, Speechreading using probabilistic models, 163–178, © 1997, with permission from Elsevier.

Figure 3

Figure 9.4 Geometric feature approach. Top: Reconstruction of an estimated outer lip contour from 1, 2, 3, and 20 sets of its Fourier coefficients. Bottom: Three geometric visual features, displayed on a normalized scale, tracked over the spoken utterance “81926” of the connected digits database of Potamianos et al. (1998). Lip contours are estimated as in Graf et al. (1997). © 1997 and 1998 IEEE

Figure 4

Figure 9.5 Statistical shape model. The top four modes are plotted (left-to-right) at ±3 standard deviations around the mean. These four modes describe 65% of the variance of the training set, which consists of 4072 labeled images from the IBM ViaVoiceTM audiovisual database (Neti et al.2000; Matthews et al.2001). © 2000 and 2001 IEEE.

Figure 5

Figure 9.6 Combined shape and appearance statistical model. Center row: Mean shape and appearance. Top row: Mean shape and appearance +3 standard deviations. Bottom row: Mean shape and appearance −3 standard deviations. The top four modes, depicted left-to-right, describe 46% of the combined shape and appearance variance of 4072 labeled images from the IBM ViaVoiceTM audiovisual database (Neti et al.2000; Matthews et al.2001).© 2000 and 2001 IEEE.

Figure 6

Figure 9.7 DCT- versus AAM-based visual feature extraction for automatic speechreading, followed by visual feature post-extraction processing using linear interpolation, feature mean normalization, adjacent frame feature concatenation, and the application of LDA and MLLT. Vector dimensions as implemented in the system of Neti et al. (2000) are depicted.

Figure 7

Table 9.1 Taxonomy of the audiovisual integration methods considered in this section. Three feature-fusion techniques that differ in the features used for recognition and three decision-fusion methods that differ in the combination stage of the audio and visual classifiers are described in more detail in this chapter.

Figure 8

Table 9.2 The forty-four phonemes to thirteen visemes mapping considered by Neti et al. (2000), using the HTK phone set (Young et al.1999).

Figure 9

Figure 9.8 Three types of feature fusion considered in this section: Plain audiovisual feature concatenation (AV-Concat), hierarchical discriminant feature extraction (AV-HiLDA), and audiovisual speech enhancement (AV-Enh).

Figure 10

Figure 9.9 Left: Phone-synchronous (state-asynchronous) multi-stream HMM with three states per phone in each modality. Right: Its equivalent product (composite) HMM; black circles denote states that are removed when limiting the degree of within-phone allowed asynchrony to one state. The single-stream emission probabilities are tied for states along the same row (column) to the corresponding audio (visual) state probabilities.

Figure 11

Figure 9.10 Example video frames of 10 subjects from the IBM ViaVoiceTM audiovisual database. The database contains approximately 50 hrs of continuous, dictation-style audiovisual speech by 290 subjects, collected with minor variations in face pose, lighting, and background (Neti et al.2000).

Figure 12

Figure 9.11 The audiovisual ASR system employed in some of the experiments reported in this chapter. In addition to the baseline system used during the Johns Hopkins summer 2000 workshop, a larger mouth ROI is extracted, within-frame discriminant features are used, and a longer temporal window is considered in the visual front end (compare to Figure 9.7). HiLDA feature fusion is employed.

Figure 13

Table 9.3 The IBM audiovisual databases discussed and used in the experiments reported in this chapter. Their partitioning into training, held-out, adaptation, and test sets is depicted (duration in hours and number of subjects are shown for each set). Both large-vocabulary continuous speech (LVCSR) and connected digit (DIGITS) recognition are considered for normal, as well as impaired speech. The IBM ViaVoiceTM database corresponds to the LVCSR task in the normal speech condition. For the normal speech DIGITS task, the held-out and adaptation sets are identical. For impaired speech, due to the lack of sufficient training data, adaptation of HMMs trained in the normal speech condition is considered.

Figure 14

Table 9.4 Comparisons of recognition performance based on various visual features (three appearance-based features, and one joint shape and appearance feature representation) for speaker-independent LVCSR (Neti et al.2000; Matthews et al.2001). Word error rate (WER), %, is depicted on a subset of the IBM ViaVoiceTM database test set of Table 9.3. Visual performance is obtained after the rescoring of lattices that had been previously generated based on noisy (8.5 dB SNR) audio-only MFCC features. For comparison, characteristic lattice WERs are also depicted (oracle, anti-oracle, and best path based on language model scores alone). Among the visual speech representations considered, the DCT-based features are superior and contain significant speech information.

Figure 15

Table 9.5 Test set speaker-independent LVCSR audio-only and audiovisual WER (%), for the clean (19.5 dB SNR) and a noisy audio (8.5 dB) condition. Two feature fusion- and five decision fusion-based audiovisual systems are evaluated using the lattice rescoring paradigm (Neti et al.2000; Glotin et al.2001; Luettin et al.2001).

Figure 16

Figure 9.12 Comparison of audio-only and audiovisual ASR by means of three feature fusion (AV-Concat, AV-HiLDA, and AV-Enhanced) algorithms and one decision fusion (AV-MS-Joint) technique, using the full decoding experimental paradigm. WERs vs. audio channel SNR are reported on both the IBM ViaVoiceTM test set (speaker-independent LVCSR – top), and on the multi-speaker DIGITS test set (bottom) of Table 9.3. HiLDA feature fusion outperforms alternative feature fusion methods, whereas decision fusion outperforms all three feature fusion approaches, resulting in an effective SNR gain of 7 dB for LVCSR and 7.5 dB for DIGITS, at 10 dB SNR (Potamianos et al.2001c; Goecke et al.2002; Gravier et al.2002a). Notice that the WER ranges in the two graphs differ.

Figure 17

Table 9.6 Adaptation results on the speech impaired data. WER, %, of the audio-only (AU), visual-only (VI), and audiovisual (AV) modalities, using HiLDA feature fusion, are reported on both the LVCSR (left table part) and DIGITS test sets (right table) of the speech-impaired data using unadapted HMMs (trained in normal speech) as well as a number of HMM adaptation methods. All HMMs are adapted on the joint speech-impaired LVCSR and DIGITS adaptation sets of Table 9.3. For the continuous speech results, decoding using the test set vocabulary of 537 words is reported. MAP followed by MLLR adaptation, and possibly preceded by front end matrix adaptation (Mat), achieves the best results for all modalities and for both tasks considered (Potamianos and Neti 2001a).

Save book to Kindle

To save this book to your Kindle, first ensure no-reply@cambridge.org is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about saving to your Kindle.

Note you can select to save to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

Find out more about the Kindle Personal Document Service.

Available formats
×

Save book to Dropbox

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Dropbox.

Available formats
×

Save book to Google Drive

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Google Drive.

Available formats
×