10 Image-based facial synthesis
10.1 Facial synthesis approaches
There are many ways to organize a discussion of facial synthesis. Some people highlight quality, or computational efficiency, or particular geometrical representations. In this work we describe a trade-off between smart algorithms and lots of data.
A conventional approach to synthesizing a face is to model it as a three-dimensional computer graphics object. This approach has been used for nearly thirty years and researchers now understand many aspects of the human face. Given an audio signal we know the desired shape of the mouth, how muscles move, how skin stretches, and how light reflects off the skin (Terzopoulos and Waters Reference Terzopoulos and Waters1993). Yet with all this knowledge the results using computer graphics approaches are not 100 percent realistic. This observation is not meant as a criticism of previous work, but instead should be considered an indication of the full richness and subtlety of human behavior and perception.
Recently many researchers have advocated an approach based on simple algorithms, but lots of data. This new approach gets its realism from a large collection of image data, reorganizing the images to synthesize new audiovisual utterances. Oversimplifying, an image-based approach knows nothing about how muscles move, or any other properties of a face. Instead the system learns that when the natural face says “pa,” the video pixel one inch below the nose changes from pink (lip-colored) to white (teeth). In a sense, we reduce the face synthesis problem to a simple database problem.
In practice, image-based work is not so extreme. A little bit of knowledge about faces goes a long way. We can use this knowledge to reduce the size of the dataset we need to collect and, more importantly, to help synthesize utterances or head poses we have not seen before.

Figure 10.1 The range of options on the knowledge- to data-based axis of facial synthesis methods.
The continuum between knowledge-based and image-based synthesis is shown in Figure 10.1. On the left, the smart algorithms encode knowledge about how faces move and reflect light in computer algorithms. Their advantage is that they can easily generate images of new people, new lighting conditions, or even different animals. On the right, the image-based approaches have the potential to create super-realistic images. There are uses for both extremes, but practical systems will probably fall somewhere in the middle (Cohen and Massaro Reference Cohen and Massaro1990).
It is important to note that this dichotomy between knowledge- and data-based algorithms is common in Computer Science. People doing speech recognition used to build a hierarchy of specialized recognizers (Cole et al.Reference Cole, Stern, Phillips, Brill and Specker1983) but this approach was quickly surpassed by hidden Markov models (HMMs) that learn the probabilities of speech from large collections of speech data (Jelinek Reference Jelinek1998). Music synthesis in the 1970s was done using frequency modulation (FM) techniques that synthesize many interesting sounds when given the right parameters. Now musicians use wave tables to synthesize musical sounds – if you want a piano note then record it once and play it back every time you need that note. The polygons of the computer graphics world are often replaced by image-based rendering techniques such as lightfields (Levoy and Hanrahan Reference Levoy and Hanrahan1996). Finally the original work on text-to-speech (TTS) calculated the formant frequencies for each phone and synthesized speech by rule (Carlson and Granström Reference Carlson and Granström1976; Klatt Reference Klatt1979). The very highest quality results are now generated by collecting hours of speech data, chopping the waveform into collections of phonemes, and rearranging and concatenating them to produce the final results. This approach, known as concatenative synthesis (Hunt and Black Reference Hunt and Black1996), is closest in spirit to our image-based techniques.
While concatenative synthesis systems for speech and music domains were pioneered in the early 1990s (Moulines et al.Reference Moulines, Emerard, Larreur, Le Saint Milon, Le Faucheur, Marty, Charpentier and Sorin1990), such example-based techniques were first introduced to animation and video synthesis in 1997 (Bregler et al.Reference Bregler, Covell and Slaney1997b) and later refined by Ezzat et al. (Reference Ezzat, Geiger and Poggio2002b), and extended to motion capture animation (Arikan and Forsyth Reference Arikan and Forsyth2002; Kovar et al.Reference Kovar, Gleicher and Pighin2002; Pullen and Bregler Reference Pullen and Bregler2002; Reitsma and Pollard Reference Reitsma and Pollard2007). Another line of research close in spirit to this philosophy has been introduced by Efros and Leung (Reference Efros and Leung1999) to image texture synthesis, and by Schoedl et al. (Reference Schoedl, Szeliski, Salesin and Essa2000) to video texture synthesis. More recent research combines acoustic and facial speech with other motion modalities, like facial expression, head motions, and body gestures (Stone et al.Reference Stone, DeCarlo, Oh, Rodriguez, Stere, Lees and Bregler2004; Chuang and Bregler Reference Chuang and Bregler2005; Bregler et al.Reference Bregler, Williams, Rosenthal and McDowall2009; Williams et al.Reference Williams, Taylor, Smolskiy and Bregler2010).
Specific to human motion and facial speech synthesis, all of these data-based approaches use a learning/training normalization step to create a database. Most systems do the bulk of the analysis work when building the database so that the synthesis step is as easy as possible. Given a particular task, the synthesis stage finds the appropriate data in the database, warps it to fit the desired scene, and outputs the final pixels.
We will use the Video Rewrite system (Bregler et al.Reference Bregler, Covell and Slaney1997b) to demonstrate the basic ideas of image-based facial synthesis. Section 10.2 gives an overview of image-based facial synthesis and Video Rewrite in particular. Section 10.3 describes the analysis stage of Video Rewrite and Section 10.4 describes the synthesis stage of Video Rewrite. Section 10.5 compares and contrasts two other approaches – based on static images and Markov models – for facial synthesis.
10.2 Image-based facial synthesis
Video Rewrite uses a training database of video of a person speaking naturally. To synthesize talking faces it rearranges the contents of the video database to fit the new words. To start, the analysis stage Video Rewrite uses a conventional speech recognition system to segment the speech, and computer vision techniques to find the face and the exact location of the mouth and jaw line. In the synthesis stage, the same speech recognition technology segments the new utterances and then an image-morphing step warps the database images to fit the new words. A separate piece of video, called the background video or background sequence, provides the rest of the face and the overall head movements. The lip and jaw sequences from the database are inserted into the background sequence. This is shown in Figure 10.2.
Figure 10.2 The Video Rewrite synthesis system. Speech is recognized and triphone visemes from a database are found. A separate background video is used to provide the head movements and the rest of the face. The database images are transformed and inserted into the background video to form the final video.
In a strict image-based approach we can synthesize only the conditions we have already seen in the training data. Under controlled lighting conditions, we can capture the reflectivity of a simple object from every possible angle (Levoy and Hanrahan Reference Levoy and Hanrahan1996) but this is not possible with a talking face. Instead, we look for ways to simplify the problem.
The most important factor when synthesizing talking faces is coarticulation. The shape of the mouth depends primarily on the acoustic phone before and after the current sound. As we are saying one sound our lips are still moving from the shape of the previous sound and starting to move towards the shape of the next sound. It is easier to store a sequence of prototypical phoneme images, but working with triphones is not difficult. In 8 minutes of training video, we found 1700 different triphones, of more than the 19 000 naturally occurring triphone sequences (if we don’t have exactly the right triphone, we use the triphone that is closest visually). We found it was important to store the prototypes of many different triphones, and choose the triphone that most closely matched our synthesis needs. Three examples of triphones from a Video Rewrite database are shown in Figure 10.3.
Figure 10.3 The effects of coarticulation. Frames from our training data showing three different triphones show the wide variation in mouth positions, even for the same phones.
Other factors we need to consider are listed below.
Head pose: We want our subjects to speak and move naturally. But an image of the mouth changes as the head rotates and tilts. Video Rewrite adjusts for small changes in head pose with a planar model. Video Rewrite adjusts for larger changes by modeling the face as an ellipsoid. But even with a sophisticated ellipsoidal model, an image-based approach will never be able to synthesize the profile of a head if all our data is from frontal images. The planar and elliptical model approaches for compensating for head pose are described in Section 10.3.1.
Lighting: We assume without too much error that the reflectivity of skin does not change over the range of allowable head poses. But lighting changes are another matter. Some lighting changes are corrected by fitting a simple illumination model to the data, but if one wants to synthesize a face at a beach and in a nightclub then the training dataset will need images in daylight and in a dark room with the appropriate lights. A lighting-correction technique is described in Section 10.4.3.
Emotional content: There is data indicating that we can hear a smile (Ohala Reference Ohala, Hinton, Nichols and Ohala1994). We can certainly identify a smiling speaker, but we don’t have a good model of how the emotional state, as conveyed by an auditory signal, maps into facial positions. It is important that the two signals are consistent since a viewer can identify how much smile is present in both the audio and the visual signals.
Utterance length: The way that we speak changes as we change our rate of speech. Most speakers slur their words as they speak faster. Thus the word “pat” will look different when spoken slowly and carefully, compared to when it is spoken rapidly. We can adjust for this effect by using phone sequences from our database that are close in length to the new audio. Otherwise, we can change the video playback speed to fit the new audio.
Eyebrows: Video Rewrite uses a background face to provide most of the facial image. This includes the eyes and the eyebrows. There is evidence that a speaker’s eyebrows are used in concert with their voice to convey a message, but there is no straightforward mapping between acoustics and eyebrow locations. One possible correlation is with the speaker’s pitch (Ohala Reference Ohala, Hinton, Nichols and Ohala1994).
In all cases, an image-based solution combines multiple views of the speaker with algorithms that can modify the video to fit the need. Any one image in the training data comes with a coarticulation context, a head pose, a lighting model, and an emotional state. The art in image-based facial synthesis is trading off the different features to choose the best input sequence to modify and concatenate. Fortunately, collecting a large database minimizes all these problems.
The importance of this trade-off was made clear to us when we started working with John F. Kennedy footage from the Cuban missile crisis. This footage was shot before the advent of teleprompters: Kennedy spent half the time looking directly into the camera, and half the time looking down at his notes. Moving from an affine model of the face to an elliptical model helped improve the realism of the Video Rewrite results. But in the end we avoided warping a downward-looking image to a full frontal view unless we really had nothing close to the right viseme sequence in the desired pose. In effect, for any one desired pose we could use only half our training data.
We describe image-based facial synthesis in terms of a video database where images corresponding to small chunks of speech are first normalized so they appear in a standardized form and are stored for easy retrieval. But this is not how we implemented Video Rewrite. Instead, Video Rewrite labels the original video so that it can easily find the appropriate sections of video. There might be ten instances of the word “Cuba” but each will have a slightly different pose, length, and context. We improve the synthesis results by choosing a training example from the original video that has the closest context.
There are two reasons we get better results by leaving the video in its original form. First we improve the image quality by storing just the parameters that normalize each frame. Later when we need to denormalize by a different transform (to insert the mouth pixels into a new background face) we just multiply the transforms. Thus instead of doing two image transforms, each with its own image interpolation stage, we can combine the transforms mathematically and just do one interpolation step. Secondly, by leaving the video in its original form we can more easily choose long sequences of phonemes if they happen to occur in the training set in the right order. The minimum sequence of phones used by Video Rewrite is three phones, or a triphone, but we improve our synthesis quality by using the longest sequence we can find in our database.
10.3 Analyses and normalization
Video Rewrite analyzed the training data to find the location of each segment of speech, and the location and pose of each facial feature. This section describes the analysis Video Rewrite performed on the audio and video data.
Video Rewrite used an HMM speech recognition system (Jelinek Reference Jelinek1998) to segment the audio signal. Video Rewrite trained two gender-dependent recognition models using the TIMIT database. If we know the word sequence, segmentation into phonemes is easy. Given the word sequence the recognizer can look up the words in the dictionary and find the expanded phoneme sequence.1 The recognizer then fits the known phone sequence to the audio data and returns the segment boundaries. This is a comparatively easy task for a speech recognition system.
Video Rewrite knew the approximate location of the head in each video sequence but if this information is not available, a face-finding algorithm (Rowley et al.Reference Rowley, Baluja and Kanade1998) could be used to get the approximate location. Depending on the amount of out-of-plane motion, Video Rewrite used either an affine transform or an elliptical model to transform each facial image into the canonical pose. Any image can be chosen as the canonical pose, but it is best to use a median pose so that the average pose correction is small.
10.3.1 Pose estimation
For relatively static facial sequences, Video Rewrite used an affine warp to transform each face image in the training set into the canonical form. This transform not only aligns the size, position, and rotation of each facial image, but can also correct some small out-of-plane tilts of the head.
This step is critically important to Video Rewrite’s success. Early in the development process the isolated mouth sequences looked realistic, but they looked horrible when inserted into a face. At that point we were inserting the new mouth image into the background at a location determined by the pose estimate, but rounded to the nearest integer pixel location. Our results improved dramatically when we interpolated the mouth images to place the mouth images exactly. Evidently, viewers are so sensitive to the position of the mouth on the face that one pixel jitter was enough to destroy the illusion of a realistic talking face.
Accurately locating the new mouth images on a face is difficult because the true measure of success is how well the teeth are fixed on the skull. We occasionally see the teeth in the training video, but the skull is harder to see. Instead, Video Rewrite looks at portions of the face that are relatively stable and can provide a good estimate of the underlying skull location. These portions of the face are indicated with a mask, which multiplies the facial image. This mask avoids several sections of the face that are unreliable estimates of facial position, including the mouth (it is always moving), the nose (there is specular reflection from the tip, and the nose represents the biggest discrepancy from planar or elliptical models of the face), and eyes (movement). The mask and typical images are shown in Figure 10.4.
Figure 10.4 The masked portion of the face shown at the top is a reference image used to find the head pose. A white rectangle is superimposed on three facial images to show the estimated affine warp that best matches the reference image.
Video Rewrite used an affine tracker (Bergen et al.Reference Bergen, Anandan, Hanna and Hingorani1992) or an elliptical model of the face (Basu et al.Reference Basu, Essa and Pentland1996) to normalize each image in the training data and the background video. The affine warp can exactly compensate for translations and in-plane rotations. The compensation for out-of-plane rotations, such as tilting the head forward, is approximate – compressing the y-axis approximates small amounts of tilt. The affine tracker is less computationally expensive, but does not work as well with out-of-plane rotations of the head. Both algorithms find a transform that warps each image so that it closely matches a reference image.
10.3.1.1 Affine model
An affine tracker (Bergen et al.Reference Bergen, Anandan, Hanna and Hingorani1992) adjusts the parameters of these equations

so that a warped image, It(u(x,y),v(x,y)), at time t matches as closely as possible the reference image I0(x,y) with the affine warp defined by the a parameters. Using vector notation, this is written

where U = [x y]T, a = [a1, a2, a3, a4, a5, a6]T and

We can use Gauss–Newton to iterate and find a solution for a in terms of small steps δa. The change in a, at each iteration step is given by

Where ∇I is a 2 × 1 derivative of a pixel location versus x and y, and ΔI is a scalar and describes the difference between the first image and the warped second image using the current estimate of a.
For both computational efficiency and to avoid local minima in the optimization, this optimization step is often combined with hierarchical processing. At the start it is not necessary to work with images at the highest resolution. Instead, the images can be low-pass filtered and subsampled before the iteration described above is performed. Once we converge to a good answer at low resolution we can use this as a starting point on a slightly higher resolution image. An example of the affine model is shown in Figure 10.4.
10.3.1.2 Elliptical model
An affine image transformation model is a good fit for planar objects that are rotating by small amounts in space, but a human face is not planar. With large rotations, the 3D structure of the face is apparent and a richer model is needed to capture, and normalize, the motion.
Basu and his colleagues proposed an elliptical model of the face (Basu et al.Reference Basu, Essa and Pentland1996). We would like to find parameters of an elliptical model of the face, [α,β,γ,tx,ty,tz], where α, β, and γ describe the ellipse’s rotations, and tx, ty, and tz describe its position. We would like to recover the best parameters that explain the 2D image data we are considering.
We start with a reference image, usually showing a frontal view of the face, and then hand-fit a 3D ellipse to the image data. (Basu suggests using face-finding software to locate the face in the image and then starting with the closest example from a pre-adjusted set of ellipses.) The optimization step is straightforward. We want to find choices for the six ellipsoidal parameters that best represent the new image in terms of the reference. A set of parameters describes a 3D ellipse, which has a particular projection onto the image plane. We can then take the image points and map them back through the projection transformation to find the brightness of each point on the ellipse. In effect, we color the proposed ellipse with the pixels from the image. When we have the correct transformation parameters, the data for the new image, as mapped onto its ellipse, will agree with the original image’s transformation. The multidimensional parameters of this new ellipse can be easily optimized using, for example, the simplex method (Press et al.Reference Press, Flannery, Teukolsky and Vetterling1995).
10.3.2 Feature tracking
In an image-based rendering system, database entries are chosen and then must be blended. Speech synthesis systems often simply concatenate the database entries to form a new speech signal; listeners are not particularly sensitive to spectral discontinuities at unit boundaries. The same is not true for visual motion. Our eyes are excellent feature trackers and discontinuities in position or velocity are quite evident.
Thus an important step of image-based video synthesis is the ability to track features and blend their positions, velocities, and appearance across synthesis units. This is especially important for Video Rewrite since Video Rewrite combines overlapping triphones using image morphing; the location of the features controls the morphing algorithm.
The tracking problem is simplified once we have used the pose-estimation algorithm described in Section 10.3.1 to normalize the size and the position of the face. There are many ways to track features. Video Rewrite used a technique known as EigenPoints (Covell and Bregler Reference Covell and Bregler1996) because it is computationally efficient.
EigenPoints models the connection between an image, I, and the x–y coordinates of the feature we are tracking as a linear relationship. In mathematical terms

where M is the matrix that connects the image data, I and its mean,
, with the location of the features
, with its mean,
, subtracted.
EigenPoints finds a linear coupling by using a large training set of images and labeled x–y coordinates of the features. The data from one image and its corresponding coordinates are concatenated and form one row of a new matrix. If the image and coordinate data are scaled so they have similar variance, then a singular-value decomposition (SVD) is used to find the vector direction that best characterizes the combined spread of data. Taking into account the effects of noise, the linear mapping, M, is calculated using equation 9 of Covell and Bregler’s paper (Reference Covell and Bregler1996). The result of this processing is shown in Figure 10.5.
Figure 10.5 EigenPoints is a linear transform that maps image brightness into control point locations. The three images at the bottom show the fiduciary points for the facial images on top.
The EigenPoints connection is only valid over a limited range of movement. Underlying the EigenPoint theory is the assumption that the image and the coordinate data are linear functions of a single underlying driving process. Thus given an image patch from the face, with the mean removed, we can find the deviation of the control points by a simple matrix multiplication.
But, it is unlikely, for example, that the same process drives the upper lip and the jaw line. In fact they can move relatively independently of each other. Thus Video Rewrite uses two different EigenPoints models to capture all the dynamics of the mouth and jaw line.2
In Video Rewrite there were two separate EigenPoints analyses. The two EigenPoint models label each image in the training video using a total of 54 EigenPoints: 34 on the mouth (20 on the outer boundary, 12 on the inner boundary, one at the bottom of the upper teeth, and one at the top of the lower teeth) and 20 on the chin and jaw line. The first eigenspace controlled the placement of the 34 fiduciary points on the mouth, using 50 × 40 pixels around the nominal mouth location, a region that covers the mouth completely. The second eigenspace controlled the placement of the 20 fiduciary points on the chin and jaw line, using 100 × 75 pixels around the nominal chin location, a region that covers the upper neck and the lower part of the face.
We created the two EigenPoints models for locating the fiduciary points from a small number of images. We hand-annotated 26 images (of 14 218 images total; about 0.2%). We extended the hand-annotated dataset by morphing pairs of annotated images to form intermediate images, expanding the original 26 to 351 annotated images without any additional manual work. We then derived EigenPoints models using this extended dataset.
Video Rewrite used EigenPoints to find the mouth and jaw, and to label their contours. The derived EigenPoints models located the facial features using 6 basis vectors for the mouth and 6 different vectors for the jaw. EigenPoints then placed the fiduciary points around the feature locations: 32 basis vectors place points around the lips and 64 basis vectors place points around the jaw.
10.4 Synthesis
Synthesis in an image-based facial animation system is straightforward. Given a signal, we need to identify the speech sounds, find the best image sequences in our database, and stitch the results together. The synthesis procedure is diagrammed in Figure 10.6.
Video Rewrite uses a speech recognition system to recognize the new speech and translate the audio into a sequence of phonemes (with their durations). In the sections that follow, we will explain how Video Rewrite uses the background image (Section 10.4.1), how the database visemes are selected (Section 10.4.2), and finally how the visemes are morphed and stitched together (Section 10.4.3).
10.4.1 The background video
Video Rewrite uses a background video to set the stage for the new synthesized mouth. The background video provides images of most of the face – especially the eyes – and natural head movements. In the Video Rewrite examples, the background video came from the training video, so the normalization parameters have already been computed. After we selected the new mouth images, the normalizing parameters for each database image are multiplied by the inverse of the background image transform to find the transformation that maps each database image into the background frame.
We chose a background sequence from the training set where the speaker spoke a sentence with roughly the same length as the new utterance. There are undoubtedly important speech cues in the way that we move our head, eyes, and eyebrows as we speak. We do not know what all these cues are. By starting with real facial images and facial movements, we can show some of these cues, without understanding them.
10.4.2 Selecting visemes from the database
The most difficult part of an image-based synthesis method is selecting the right units to combine. There is always a limited database, and many different constraints to satisfy. Video Rewrite chooses the longest possible utterance from the database that has the right visemes, phoneme lengths, and head pose. The design of this trade-off is complicated by the fact that we have information about how some of the factors affect performance (see for example the viseme confusability data in Owens and Blazek Reference Owens and Blazek1985), but we have no information about how these factors combine. The problem is more acute because we can partially correct many factors – such as viseme length, head pose, and lighting – but the errors that remain are hard to quantify.
The new speech utterance, as understood by the automatic speech recognition system, determines the target sequence visemes. We would like to find a sequence of triphone videos from our database that matches this new speech utterance. For each triphone in the new utterance, our goal is to find a video example with exactly the transition we need, and with lip shapes that are compatible with the lip shapes in neighboring triphone videos. Since this goal often is not reachable, we compromise by choosing a sequence of clips that approximates the desired transitions and shape continuity. This process is quantified as follows.
Given a triphone in the new speech utterance, Video Rewrite computes a matching distance to each triphone in the video database. The matching metric has two terms: the phoneme-context distance, Dp, and the distance between lip shapes in overlapping visual triphones, Ds. The total error is

where the weight, α, is a constant that trades off the two factors.
The phoneme-context distance, Dp, is based on categorical distances between phoneme categories and between viseme classes. Since Video Rewrite does not need to create a new soundtrack (it needs only a new video track), we can cluster phonemes into viseme classes, based on their visual appearance.
We use twenty-six viseme classes. Ten are consonant classes: (1) /CH/, /JH/, /SH/, /ZH/; (2) /K/, /G/, /N/, /L/; (3) /T/, /D/, /S/, /Z/; (4) /P/, /B/, /M/; (5) /F/, /V/; (6) /TH/, /DH/; (7) /W/, /R/; (8) /HH/; (9) /Y/; and (10) /NG/. Fifteen are vowel classes: one each for /EH/, /EY/, /ER/, /UH/, /AA/, /AO/, /AW/, /AY/, /UW/, /OW/, /OY/, /IY/, /IH, /AE/, /AH/. One class is for silence, /SIL/.
The phoneme-context distance, Dp, is the weighted sum of phoneme distances between the target phonemes and the video-model phonemes within the context of the triphone. This distance is 0 if the phonemic categories are the same (for example, /P/ and /P/). The distance is 1 if they are in different viseme classes (/P/ and /IY/). If they are in different phoneme categories but are in the same viseme class (/P/ and /B/), then the distance is a value between 0 and 1. The intraclass distances are derived from published confusion matrices (Owens and Blazek Reference Owens and Blazek1985).
When computing Dp, the center phoneme of the triphone has the largest weight, and the weights drop smoothly from there. Although the video model stores only triphone images, we consider the triphone’s original context when picking the best-fitting sequence. In current animations, this context covers the triphone itself, plus one phoneme on either side.
The second error term, Ds, measures how closely the mouth contours match in overlapping segments of adjacent triphone videos. In synthesizing the mouth shapes for “teapot” we want the contours for the /IY/ and /P/ in the lip sequence used for /T-IY-P/ to match the contours for the /IY/ and /P/ in the sequence used for /IY-PP-AA/. Video Rewrite measures this similarity by computing the Euclidean distance, frame by frame, between four-element feature vectors containing the overall lip width, overall lip height, inner lip height, and height of visible teeth.
The lip-shape distance (Ds) between two triphone videos is minimized with the correct time alignment. For example, consider the overlapping contours for the /P/ in /T-IY-P/ and /IY-P-AA/. The /P/ phoneme includes both a silence, when the lips are pressed together, and an audible release, when the lips move rapidly apart. The durations of the initial silence within the /P/ phoneme may be different. The phoneme labels do not provide us with this level of detailed timing. Yet, if the silence durations are different, the lip-shape distance for two otherwise well-matched videos will be large. This problem is exacerbated by imprecision in the HMM phonemic labels.
We want to find the temporal overlap between neighboring triphones that maximizes the similarity between the two lip shapes. Video Rewrite shifts the two triphones relative to each other to find the best temporal offset and duration. Video Rewrite then uses this optimal overlap both in computing the lip-shape distance, Ds, and in cross-fading the triphone videos during the stitching step. The optimal overlap is the one that minimizes Ds while still maintaining a minimum-allowed overlap. Since the fitness measure for each triphone segment depends on that segment’s neighbors in both directions, Video Rewrite selects the sequence of triphone segments using dynamic programming over the entire utterance. This procedure ensures the selection of the best segments from the data available.
The experiments performed with Video Rewrite to date have used relatively small databases. We had 8 min of “Ellen” footage, most of which we could easily repurpose. We had 2 min of the “JFK” footage, of which half was useable at any one time due to the extreme poses. Concatenative text-to-speech systems use tens of hours of speech data for their task. The big advantage of data-based approaches for synthesis is that the quality gets better as the database gets larger. Given larger databases, one is more likely to find the exact triphone in the right context, or, in the worst case, to find a segment that needs less modification for the final synthesis.
10.4.3 Morphing and stitching
Video Rewrite produces the final video by stitching together the appropriate entries from the video database. At this point, Video Rewrite has already selected a sequence of triphone videos that most closely matches the target audio. We need to align the overlapping lip images temporally. This internally time-aligned sequence of videos is then time-aligned to the new speech utterance. Finally, the resulting sequences of lip images are spatially aligned and are stitched into the background face. We will describe how Video Rewrite performs each step in turn.
10.4.3.1 Time alignment of triphone videos
We combine a sequence of triphone videos to form a new mouth movie. In combining the videos, we want to maintain the dynamics of the phonemes and their transitions. We need to time-align the triphone videos carefully before blending them. If we are not careful in this step, the mouth will appear to flutter open and closed inappropriately.
Video Rewrite aligns the triphone videos by choosing a portion of the overlapping triphones where the two lip shapes are as similar as possible. Video Rewrite makes this choice when we evaluate Ds to choose the sequence of triphone videos (Section 10.4.2). We use the overlap duration and shift that provide the minimum value of Ds for the given videos.
10.4.3.2 Time alignment of the lips to the utterance
We now have a self-consistent temporal alignment for the triphone videos. We have the correct articulatory motions, in the correct order to match the target utterance, but these articulations are not yet time-aligned with the target utterance.
Video Rewrite aligns the lip motions with the target utterance by comparing the corresponding phoneme transcripts. The starting time of the center phone in the triphone sequence is aligned with the corresponding label in the target transcript. The triphone videos are then stretched or compressed so that they fit the time needed between the phoneme boundaries in the target utterance.
10.4.3.3 Illumination matching
Video Rewrite inserts a series of foreground images into the background video to synthesize new words. Under the best of conditions, the lighting on the face will be consistent and there will not be a noticeable difference at the boundary between the two sets of data. Unfortunately, with the need to collect as large a database as possible, there will often be lighting differences between different portions of the database. This is also likely to happen if we are trying to synthesize audiovisual speech in a large number of lighting conditions.
Video Rewrite uses a planar illumination model to adjust the lighting in the background and foreground images before stitching them together. The edge of the mask image, shown in Figure 10.4, defines a region where we want to be careful to match the lighting conditions. Video Rewrite models the average brightness of the pixels at the edge of the mask with a plane. The foreground image is adjusted by linearly scaling its brightness so that it matches the planar model of the background image. This measurement, in both cases, is only done using the pixels near the edge of the mask so that portions of the face that are moving, such as the mouth, are not included in the matching calculation. A more sophisticated approach to lighting control and image splining was proposed by Burt and Adelson (Reference Burt and Adelson1983a).
10.4.3.4 Combining the lips and the background
The remaining task is to stitch the triphone videos into the background sequence. The correctness of the facial alignment is critical to the success of the synthesis. The lips and head are constantly moving in the triphone and background footage. Yet, we need to align them so that the new mouth is firmly planted on the face. Any error in spatial alignment causes the mouth to jitter relative to the face – an extremely disturbing effect.
Video Rewrite again uses the mask from Figure 10.4 to find the optimal global transform to register the faces from the triphone videos with the background face. The combined transforms from the mouth and background images to the template face (Section 10.4.2) give a starting estimate in this search. Reestimating the global transform by directly matching the triphone images to the background improves the accuracy of the mapping.
Video Rewrite uses a replacement mask to specify which portions of the final video come from the triphone images and which come from the background video. This replacement mask warps to fit the new mouth shape in the triphone image and to fit the jaw shape in the background image. Figure 10.6 shows an example replacement mask, applied to triphone and background images.
Figure 10.6 The Video Rewrite synthesis process. Two images from the database, with their control points, are combined using morphing. Then this image is transformed to fit into the background image and inserted into the background face.
Local deformations are required to stitch the shape of the mouth and jaw line correctly. Video Rewrite handles these two shapes differently. The mouth’s shape is completely determined by the triphone images. The only changes made to the mouth shape are imposed to align the mouths within the overlapping triphone images: The lip shapes are linearly cross-faded between the shapes in the overlapping segments of the triphone videos.
The jaw’s shape, on the other hand, is a combination of the background jaw line and the two triphone jaw lines. Near the ears, we want to preserve the background video’s jaw line. At the center of the jaw line (the chin), the shape and position are determined completely by what the mouth is doing. The final image of the jaw must join smoothly together the motion of the chin with the motion near the ears. To do this, Video Rewrite smoothly varies the weighting of the background and triphone shapes as we move along the jaw line from the chin towards the ears.
The final stitching process is a three-way trade-off in shape and texture among the fade-out lip image, the fade-in lip image, and the background image. As we move from phoneme to phoneme, the relative weights of the mouth shapes associated with the overlapping triphone-video images are changed. Within each frame, the relative weighting of the jaw shapes contributed by the background image and by the triphone-video images is varied spatially.
The derived fiduciary positions are used as control points in morphing. All morphs are done with the Beier–Neely algorithm (Beier and Neely Reference Beier and Neely1992a). For each frame of the output image we need to warp four images: the two triphones, the replacement mask, and the background face. The warping is straightforward since Video Rewrite automatically generates high quality control points using the EigenPoints algorithm.
10.4.3.5 Results
The facial animation results for Video Rewrite are documented elsewhere (Bregler et al.Reference Bregler, Covell and Slaney1997b). Typical results using John F. Kennedy as a subject are shown in Figure 10.7.
Figure 10.7 Images synthesized by Video Rewrite showing John F. Kennedy speaking (from Bregler et al.Reference Bregler, Covell and Slaney1997b).
The results are difficult to quantify. Our ultimate goal is lifelike video that is indistinguishable from a real person. We are not at that stage yet, and are often faced with difficult trade-offs, especially when choosing among many different choices in our database.
There are many factors that lead to the overall perception of quality. Most importantly, are the lips and the audio synchronized? This is easiest to judge on plosive sounds and also the hardest to get right since the closure produces little sound.
Are the lip motions smooth? In a sense, the Video Rewrite database is directly capturing motion data. We need to blend the database triphones and preserve the motion information.
Is the border visible between the background and foreground portions of the face? The lighting control described above helps with much of this problem. Yet, with the high-resolution images on a computer screen we could see a slight reduction in resolution in the lip region. We think that our feature tracking was accurate, but not accurate enough to align the individual pores on the face. When neighboring triphones are overlapped, the pores are often averaged out.
Is the jaw line smooth? Does it move in a natural fashion? Does the neck stay fixed? There is a lot happening at the jaw line and its overlap with the neck. It is important that this region of the face look realistic.
10.5 Alternative approaches
There are many ways to use video from real speakers to learn the mapping between audio and facial images – Video Rewrite is just one example. Two other approaches we would like to discuss represent simpler and more complex models of facial animation. In Section 10.5.1 we will describe two systems that represent the face by static images that capture the position of the face at its extreme pose when speaking a viseme. In Section 10.5.2 we describe a system that trains a hidden Markov model to capture the facial motions.
10.5.1 Synthesis with static visemes
Actors (Scott et al.Reference Scott, Kagels, Watson, Rom, Wright, Lee and Hussey1994) and MikeTalk (Ezzat and Poggio Reference Ezzat and Poggio1998) are systems that synthesize a talking face by morphing between single static exemplar images. For each phoneme, a single prototypical image is captured and represents the target location for the face when saying that particular sound. Thus the /o/ sound is characterized by a single facial image with rounded lips. Image morphing techniques are used to synthesize the intermediate images. MikeTalk uses sixteen static images to synthesize speech – some of these visemes are shown in Figure 10.8.
Figure 10.8 Ten of the sixteen static visemes used by the MikeTalk system to synthesize speech (from Ezzat and Poggio Reference Ezzat and Poggio1998).
Systems based on static viseme examples have the same alignment and morphing problems addressed by Video Rewrite. In the Actor system, several dozen fiduciary points on the head and shoulders are manually identified. These provide the alignment information and the control points needed for morphing. In MikeTalk the optic flow procedure described below is used to track facial features, and linear regression on the fixed portions of the face are used to find the global alignment. Optic flow computes the motion between two images by finding a two-dimensional vector field [dx dy]T that shows how each pixel in an image moves into the new image. There is little motion between images that are close in time; this makes the optic flow calculation easier. By looking at how each pixel moves every 33 ms, MikeTalk can establish the correspondence between prototype viseme images and morph between visemes.
Synthesizing facial images from prototypical (static) viseme images is straightforward. Using either a TTS system to generate phonemes, as done by MikeTalk, or recognizing the phonemes with ASR, as done by Actor, drives the synthesis process. The correspondence between viseme pixels has already been computed, so synthesis is just a matter of morphing one viseme into the next.
There are two disadvantages of a static viseme system. Most importantly, the detailed motion information that was calculated in the analysis stage is not used. The lips do not move smoothly from one position to the next. Secondly, this type of synthesis does not take into account the effects of coarticulation. When we speak naturally the shape of our mouth and the sounds that we make are dramatically affected by the phonemic context.
We might not, for example, round our lips quite as much when we say the word “boot” at a slower rate as when we say it at a faster rate. Coarticulation has been modeled by filtering the control points, but this approach has not been applied to image-based rendering. Doing so would result in a system that is nearly identical to a system described by Dom Massaro and Mike Cohen based on polygonal models of the face and using texture mapping to provide realistic skin and details.
In contrast, Video Rewrite uses a large database of phonemes in context to capture specific motions and coarticulation. Given a large enough database of examples, this is a simple solution. But it is, perhaps, not as elegant as building a statistical model of visual speech.
10.5.2 Voice puppetry
Matthew Brand (Reference Brand1999) proposed building an HMM to map between an audio signal and the appropriate facial shapes. His system is called Voice Puppetry and it learns a highly constrained model from the data and uses it to drive a conventional polygonal face model or cartoon drawing.
HMMs are a common technique for recognizing speech signals. The HMM model recognizes hidden states – the phoneme sequence the speaker is trying to communicate – based on the observed acoustic signals. The model assumes that the probability of entering any new state depends on a small number of previous states and the observations depend only on the current state.
Voice Puppetry builds a model in four stages. First, the system analyzes the video and builds a spring-constrained model of facial feature positions. These facial features are used to build an entropic Markov model that represents the video signal. Second, the audio corresponding to states determined by the video analysis is noted. Now each stage in the Markov model has both a visual and an auditory observation vector. Third, given new audio, the speech can be decoded and the most likely set of states is determined. Fourth, and finally, the system can traverse the discovered state sequence and generate the most likely set of facial trajectories. These trajectories are used to drive the cartoon character.
An entropic HMM is the key to this method. In a conventional speech recognition system, there are many paths through the model to account for all possible ways of saying any given word. This is unnecessarily complicated for synthesis since we want to produce only the best facial feature locations given any audio signal. Voice Puppetry learns the mapping between audio and video, without converting the speech to a phoneme sequence (and thus avoiding the errors this produces.)
An entropic HMM maximizes the likelihood of a model by adjusting the model parameters. In a Bayesian framework we say

where the model parameters are described by θ, and the data is represented by X. The final term in this equation says that we want to maximize the product of (1) the probability of the model as a function of X and a given parameter set, and (2) the probability of seeing that parameter set.
This last term, an entropic prior in Voice Puppetry, says that models that are ambiguous and have less structure are not likely. The entropy of a discrete system is defined as

and this function is minimized when most of the probabilities are zero. The entropy is turned back into something that looks like a likelihood by exponentiation

The model is not very specific if any state can follow any other state. Instead we want to drive most of the parameter values to zero so that redundant links in the model are removed. Voice Puppetry starts with a fully connected set of twenty-six states, and prunes this model so that only one of twenty-two states have more than one alternative. This dramatic reduction in model complexity makes it possible for Voice Puppetry to synthesize new facial movements from audio.
Voice Puppetry uses the entropic HMM to drive a conventional 3D polygonal model of a face. The biggest advantage of such an approach is that the model can be built from data collected from one speaker, and applied to any other face – cartoon or otherwise. The face model captures many of the movements, primarily by stretching and shrinking the skin, but does not account for changes in appearance due to folding, or even teeth appearing and disappearing (see Figure 10.9).
Figure 10.9 Output from the Voice Puppetry system showing how an inanimate object can be made to talk using entropic HMM facial synthesis (from Brand Reference Brand1999).
Some forms of coarticulation are handled well by the entropic HMM that Voice Puppetry has learned. But the very nature of the structure reduction that Voice Puppetry enforces means that there is only one way to say each word. Current work, however, on speech recognition suggests that a different HMM is needed to capture the coarticulation for different speaking rates (Siegler Reference Siegler1995).
Table 10.1 Comparing animation systems.
10.5.3 Summary of approaches
Table 10.1 summarizes the approaches of each of the three major image-based synthesis systems. The need to handle many of these factors is described in Section 10.2. It should be noted that the extra complexity in Video Rewrite – the need for more robust head-pose and lighting models – is due to the fact that large amounts of data are collected and repurposed to fit new synthesis needs. Since both MikeTalk and Video Puppetry build simple models of how each piece of sound is said, they do not need large databases. This rich database is both a source of complexity and potentially a source of additional randomness that could make the synthesis more lifelike.
10.6 Conclusions
Image-based algorithms are a powerful way to create realistic synthetic images. By starting with real images and rearranging them, we have the potential to create the highest-quality animations, at minimal computational cost and human effort. This chapter has described a wide range of synthesis options: from using static images, to rearranging dynamic segments, to full mathematical models of facial movement. The results so far are not perfect, but they have the potential to do much better, by just adding more data to their databases.
More work is needed in a number of areas. Most importantly, we would like to learn how to capture many more features of the human face, while keeping the database collection effort to a reasonable size.
10.7 Acknowledgments
We appreciate the help from Michele Covell in designing and building Video Rewrite. Matt Brand and Tony Ezzat took time to explain their systems to us. Conversations with Paul Debevec helped us to understand all the ways that data-intensive algorithms have been successful. Finally, we appreciate the efforts of Christian Benoît to help organize and give life to the field of audio–video speech perception.
