14.1 Early Recordings of African Americans
African American English (AAE) is the most heavily studied group of dialects in North America. Its origins are hotly disputed, though, and one of the frustrating reasons is that there are relatively few early audio recordings of African Americans. Racist ideologies in the early twentieth century led to an assumption that African American life was less interesting to radio listeners and less important for scholarship. Of particular interest are recordings of African Americans who had been born as slaves. The small number of known recordings of former slaves has been collected by the Library of Congress and is now publicly available at <http://memory.loc.gov/ammem/collections/voices/title.html>. Often called the “ex-slave recordings” (ESR), they include interviews from a variety of sources, including recordings by professional folklorists such as John and Alan Lomax and John Henry Faulk, recordings made by linguists Lorenzo Dow Turner, Archibald A. Hill, and Guy S. Lowman, Jr., and a few recordings made by private researchers. The recordings vary greatly in their recording media and their sound quality and thus in their usefulness for linguistic analysis. Some of the folklore recordings are of music, which is less linguistically useable than ordinary speech. However, the collection has proved important in linguistic research on the history of AAE. The most notable exposition of them is Bailey, Maynor, and Cukor-Avila (Reference Bailey, Maynor and Cukor-Avila1991), which provides an overview, transcripts of many of the recordings, and linguistic analyses by ten scholars. Other analyses of the ESR include Myhill (Reference Myhill1995), Poplack and Tagliamonte (Reference Poplack and Tagliamonte2001), and Sutcliffe (Reference Sutcliffe and Lanehart2001, Reference Sutcliffe and Plag2003).
There are a number of corpora collected later that also have recordings of elderly African Americans, some born as early as the 1880s. The Linguistic Atlas of the Middle and South Atlantic States (LAMSAS) recorded all interviews conducted after 1950, though not all of the recordings still survive. All interviews conducted for the Linguistic Atlas of the Gulf States (LAGS), made 1968–1983, were audio recorded, and a small sampling of the recordings, including those of sixteen African Americans, are publicly available from the Digital Archive of Southern Speech (<http://wakespace.lib.wfu.edu/handle/10339/37613>) in mp3 format. The Dictionary of American Regional English (DARE) conducted a survey of the entire United States from 1965 to 1970 and audio recorded about half of their subjects, including a significant number of African Americans from the South and non-Southern cities. Information about access may be found at <http://dare.wisc.edu/>. In addition, a variety of smaller-scale surveys have included audio recordings of elderly African Americans (e.g., Pederson Reference Pederson1965; Wolfram Reference Wolfram1969; Butters and Nix Reference Butters, Nix, Montgomery and Bailey1986; Nichols Reference Nichols, Montgomery and Bailey1986; Cukor-Avila Reference Cukor-Avila and Lanehart2001; Wolfram and Thomas Reference Wolfram and Thomas2002, among others). Most recordings from those small-scale projects are not publicly accessible because of human subject protections.
The discussion here, however, will focus on the ex-slave recordings, which predate all the other known collections of recordings of African Americans. The value of these recordings for morphosyntactic research has been demonstrated in Bailey et al. (Reference Bailey, Maynor and Cukor-Avila1991), though the limitations of what they can tell us about earlier AAE has as well. The main limitation is that it is impossible to state how well the individuals whose voices were preserved represent AAE of the generation born in the mid-nineteenth century. One might expect another limitation to be that they are unsuitable for acoustic phonetic analysis because of the primitive recording media and degradation of the recordings over the years. In fact, however, they can be used for various acoustic analyses, and have been by Thomas and Bailey (Reference Thomas and Bailey1998), Thomas (Reference Thomas2001), and Thomas and Carter (Reference Thomas and Carter2006). Discussion of what kinds of acoustic analyses are possible and the problems that the recordings present will ensue.
14.2 Problematic Issues
Uncertainties about the ESR revolve around a number of issues. The most basic issue is that of how well the people in the recordings represent African Americans of their vintage. Rickford (Reference Rickford1991: 194) raises this problem in asking, “[D]o they represent, albeit non-statistically, the range of social types and experiences in the ex-slave population?” The ESR now include a few natives of Virginia, several from Texas, a few from Alabama, and one each from Mississippi and Louisiana, one (Charlie Smith) who had lived in multiple states, and several Gullah speakers from Georgia and South Carolina. There is also a North Carolina native, but her recording is one of the most incomprehensible ones. While this distribution includes the cotton, tobacco, and rice-producing cultures, it still leaves some gaps. For example, there are no recordings from the French part of Louisiana, and no speakers came from areas with relatively sparse concentrations of slaves. There are both male and female subjects. There are also both field hands and house servants, though it is unclear how much of a difference that distinction made to subjects’ speech. Another uncertainty is in the level of rapport the interviewers established with the subjects. As one of the interviewers, John Henry Faulk, discussed in an interview (Brewer Reference Brewer1991), it was difficult for white interviewers to shed their own patronizing attitudes. The interviewers also frequently had their attention diverted by the bulky recording equipment (see also Poplack and Tagliamonte Reference Poplack and Tagliamonte2001: 69–77). This limitation, one hopes, affects linguistic factors less than it affects the content of the interviews.
The ESR were made with several kinds of recording media. This fact should not be surprising, in that the earliest ones were made in 1932 (or possibly 1931) and the last in 1974. During the 1930s, the recordings were made with discs of various types. Those made during the 1940s mostly employed reel-to-reel magnetic tape, and the two recorded during the 1970s used cassette tapes (<http://memory.loc.gov/ammem/collections/voices/title.html>). Of these recordings, the best sound quality is perhaps to be found on some made with reel-to-reel devices. The early disc recordings tend to have poor retention of higher frequencies, especially above 2 kHz. Whether the problems with higher frequencies were due to weaknesses of the original recording equipment or degradation over the years is unclear. However, the Library of Congress has been able to amplify the higher frequencies to enhance the recordings so that they sound more understandable. The disc recordings also exhibit occasional crackling noises. The recordings made on cassette tapes have some detectable hiss, a problem endemic to magnetic recordings of any kind but more prominent on narrow cassette tape than on wider reel-to-reel tape. Even the reel-to-reel recordings show some loss of signal components at high frequencies, though.
In spite of the problems with sound quality, the ESR are largely understandable. The only ones that are mostly incomprehensible are the three made by Roscoe Lewis in Virginia, which had very poor reception of sound above 1 kHz, though even in them some of the words can be deciphered. The recording of George Johnson (from Mississippi) is also difficult to understand, but more because of Johnson's rapid rate of speech than because of recording problems. In the other recordings, there are indistinct words here and there, variously due to sound-quality problems, mumbling by the interviewee, or unusual idioms. Nevertheless, it has been possible to produce transcripts of all but a few of the recordings. Transcripts of eleven interviews are included in Bailey et al. (Reference Bailey, Maynor and Cukor-Avila1991). More recently, the Library of Congress has produced its own transcripts, including those for some recordings unknown to Bailey et al., which are available at the website given above. Neither set of transcripts is perfect. For example, in the Laura Smalley transcript in Bailey et al. (Reference Bailey, Maynor and Cukor-Avila1991), one passage is rendered as “… out there, an’ brought it to the kitchen. When I was a chil’.” A clause is omitted, and what Smalley actually said was, “… out there, an’ brought it to the house, an’ ?then brought it to the kitchen. When I was a chil’.” Similarly, the Library of Congress transcript for Aunt Phoebe Boyd has Guy Lowman, one of the interviewers, saying “We're sadly together now,” which makes little sense, in answer to a question from Boyd about Lowman's association with the other interviewer, Archibald Hill. When one listens to the recording, it is quite clear that Lowman in fact said, “We're traveling together now.” Rickford (Reference Rickford1991) discusses the uncertainties he encountered in transcribing Wallace Quarterman's Gullah. However, the presence of errors should not be taken to mean that the transcribers were sloppy. Transcription is time-consuming and painstaking, and some of the ex-slaves are exceptionally difficult to understand, so the transcribers are to be commended that their work is as good as it is. The main problem with the transcripts is that, as Wald (Reference Wald1995) asserts, transcriptions of ambiguous items in the recordings may be affected by transcribers’ expectations, which in turn can skew linguistic analyses, particularly of morphosyntactic variants such as presence or absence of –s inflections.
14.3 Past Linguistic Analyses of the ESR
Transcripts of the ESR have been essential for the linguistic analyses of the interviews, especially for morphosyntactic analyses. Indeed, much of the previous work on the ex-slave recordings has focused on morphosyntactic variables. Three of these studies, Singler (Reference Singler1991) and Poplack and Tagliamonte (Reference Poplack and Tagliamonte1991, Reference Poplack and Tagliamonte2001), compare morphosyntactic data from the ex-slave recordings with data from other corpora. Singler examines verbal aspect and plural marking of nouns in comparison with usage in Liberian Settler English, whose speakers are descended from freed American slaves. Poplack and Tagliamonte (Reference Poplack and Tagliamonte1991) compare verbal –s marking by the ex-slaves with that by another expatriate population descended from U.S. slaves, that of Samaná in the Dominican Republic. Poplack and Tagliamonte (Reference Poplack and Tagliamonte2001) expand the analysis of verbal -s and add a detailed analysis of past tense marking. All three studies find a great deal of similarity in grammatical and phonological conditioning between the ex-slaves and the expatriates. The broad similarity should be expected, considering that the expatriate groups left the United States, and thus contact with AAE and other forms of American English, at roughly the same period that the ex-slaves were growing up and establishing their speaking norms. Singler, though, notes two puzzling differences: the ex-slaves, at least part of the time, use contracted forms of will, would, and have/had/has, as well as the possessive –'s, while the Liberian settlers’ descendants lack any marking of those forms.
Other articles – Mufwene (Reference Mufwene1991), Holm (Reference Holm1991), and Sutcliffe (Reference Sutcliffe and Lanehart2001) – discuss the ex-slave recordings in terms of creole features. Mufwene focuses on Wallace Quarterman, the only Gullah interview available to Bailey et al. (though the Library of Congress collection contains several others that have surfaced since 1991). He notes that Quarterman uses a mixture of creole forms such as remote time been and non-creole forms such as gerunds, and contends that the Gullah of the 1980s is not a decreolized form of the Gullah of Quarterman's generation. Holm discusses a longer list of constructions, such as completive done, copula absence or realization, pronominal forms, and subject/verb order, and whether they occur in the ex-slave interviews to argue that there was never a widespread creole in the United States. Instead, he states (p. 246) that “the most likely scenario is that blacks born in most parts of the American South spoke a semi-creole from the beginning.” Sutcliffe concentrates on trying to find creole markers that other researchers overlooked.
14.4 Analyses that Cannot be Performed on the ESR
Morphosyntactic studies require only that the identity of words be discerned correctly. This prerequisite is most difficult to fulfill for certain one-phone morphemes, such as final –s or –ed. The reason is that consonants ordinarily have lower amplitude than vowels and thus are more easily obscured by noise or poor recording equipment performance. Moreover, sibilants add an additional problem because their frication noise occurs at high frequencies that early recording equipment picked up poorly. In comparison to morphosyntactic analysis, however, acoustic analysis has many more points at which debilitating problems could occur. How usable are the ESR for acoustic studies? Here, I will illustrate various kinds of acoustic techniques and how they play out with one of the recordings. The recording to be featured here is that of Phoebe Boyd, who hailed from Dunnsville, in the Tidewater section of Virginia. This recording is one of those whose existence was made known after the publication of Bailey et al. (Reference Bailey, Maynor and Cukor-Avila1991), and thus it is not included in that book. Born about 1848, Boyd was interviewed and recorded by Guy S. Lowman, Jr., and Archibald A. Hill in 1935. In fact, she may have also served as a subject for the Linguistic Atlas of the Middle and South Atlantic States (LAMSAS), as informant VA 15N, though the destruction of the personal information on LAMSAS subjects makes this possibility impossible to verify. Boyd worked as a domestic slave, but after freedom she was involved in tobacco and cotton farming. Her interview typifies the disc recordings within the ex-slave corpus in terms of its sound quality. It is divided into eight sections, each about five minutes long, that represent both sides of four discs.
There are certain kinds of analyses for which the ESR are unsuited. Most voice quality analyses, or at least comparison of voice quality between different interviews, fall into this category. It should be borne in mind that comparisons within a recording are less problematic than comparisons between different interviews. Within a recording, the equipment is constant, and recording conditions are relatively constant, so even if some frequencies were captured more poorly than others, it is still possible to examine, for example, whether the speaker is creakier at some points than at others. Comparisons between recordings are where the major problems occur. Because the ESR were made with diverse recording media, their fidelity varies considerably. The equipment was quite inferior to today's equipment as well. Most of them were recorded in subjects’ homes, which introduces the factors of background noise and possible echo, though such noise is not apparent most of the time in the ESR. Nonetheless, voice quality is among the most sensitive aspects of speech to sound fidelity. As a result, comparisons of voice quality by any particular ex-slave with that of other ex-slaves or with more recent interviews should not be attempted.
Figure 14.1 shows a wideband spectrogram of the phrase drive on from the Boyd interview. Two recording problems are immediately apparent. One is that there is a considerable amount of noise, visible as the static in the background behind the darker parts of the spectrogram that represent elements of Boyd's voice. The other obvious problem is the transience, or crackling, on the recording, manifested as vertical dark marks. The noise is relatively constant and would not prevent comparisons from one part of the interview to another. As for the crackling, most of it falls at higher frequencies that are of minimal importance for most aspects of speech, and the few crackles that extend to lower frequencies could be avoided by not taking readings where they occur. However, there are other, subtler problems. An important one for 1930s-era recordings is that the equipment picked up low-frequency sound well but high-frequency sound poorly. The Library of Congress attempted to counteract this problem by enhancing the high-frequency parts of the signal and, apparently, damping the lower-frequency parts. This procedure introduced its own distortion, however, as can be seen in the narrowband power spectrum in Figure 14.2, taken from the first syllable of chicken as found in the Boyd interview. Ordinarily, either the fundamental frequency (F0) or other harmonics lying in the vicinity of the first formant (F1) have the greatest amplitude in the entire spectrum. However, here, the highest amplitudes occur in the vicinity of the second and third formants (F2 and F3) because of the enhancement.

Figure 14.1 Wideband spectrogram of the phrase drive on from the Boyd interview, illustrating the noise and transience (crackling) found in the recording.

Figure 14.2 Narrowband power spectrum of the vowel in the first syllable of chicken from the Boyd interview. Because of enhancement, harmonics near F2 and F3 have greater amplitudes than the lowest harmonic (F0).
One set of commonly measured voice quality features are the harmonics-to-noise ratio (HNR), shimmer, and jitter. The problems with recording quality make the HNR virtually of no value. Likewise, shimmer, which represents the amount of local variation in amplitude, is rendered useless by the background noise. Only jitter, which measures the amount of local variation in F0, can be reliably gauged because F0 is preserved well in the ESR. Another commonly utilized method focuses on phonation, such as breathy and creaky voicing. This method involves comparisons of the amplitudes of the lowest harmonics, which can be examined in narrowband power spectra. In breathy voicing, the lowest harmonic (i.e., F0) has a much greater amplitude than other harmonics, while for creaky voicing, the second or third harmonic will show a greater amplitude than the first, and modal voicing is intermediate. Figure 14.3 compares a power spectrum from the Boyd interview with one from a recording of a female voice made in a soundproof booth with modern equipment. Both are from the word got uttered in spontaneous speech. Peaks from the lowest nine harmonics can be located easily in the Boyd spectrum. However, there are two problems. First, the greater amount of noise in the Boyd recording, some of it visible as jagged spikes between harmonics, adds amplitude to the harmonics, and not necessarily at equal amounts across different frequencies. Hence, readings of their relative amplitudes are inaccurate. Second, and more damaging, the enhancement applied by the Library of Congress to the recording has the effect of reducing the amplitude of the first harmonic relative to that of higher harmonics, which distorts any assessments of phonation.

Figure 14.3 Comparison of narrowband power spectra, both from the vowel in utterances of the word got, from Boyd's speech (left) and that of a woman recorded with modern equipment (right).
As with voice quality analyses, there are some kinds of consonantal analyses for which the ESR are poorly suited. Analyses of frication noise fall into this category. These kinds of analyses rely on spectra of the frication, most often assessing the frequency of the spectral peak and/or the “spectral moments.” The peak is the frequency at which the spectrum reaches its greatest amplitude. The spectral moments include measures of the center of gravity of the energy in the spectrum, how evenly the energy is distributed across frequencies, the way the energy is skewed across frequencies, and the degree of peakedness of the energy. Frication tends to have lower amplitude than the sound produced by vocal pulsing, making it easier for extraneous noise to drown out the frication. Frication also does not show the regular patterning of sound produced by vocal pulsing, making it difficult to distinguish frication noise from extraneous noise. Furthermore, the ESR did not capture the higher frequencies well, but these frequencies are important for analysis of frication, particularly for sibilants. The enhancement applied to the recordings did not eliminate this problem and to some extent may have exacerbated it. Figure 14.4 compares smoothed average power spectra of two [s] utterances, each covering about 77 ms, one by Boyd and the other by the same female speaker recorded in a soundproof booth as in Figure 14.3. The spectrum from the modern recording shows much better-defined peaks and valleys than the spectrum of Boyd. The modern recording also shows a decided tilt in favor of higher frequencies, with the highest amplitudes above 7,000 Hz, whereas the Boyd recording shows its highest amplitudes below 4,000 Hz. A large portion of the energy that the Boyd spectrum contains is due to adventitious noise in the recording introduced by the recording equipment or by deterioration of the discs. The enhancement process undoubtedly introduced some distortion as well.

Figure 14.4 Comparison of smoothed average power spectra of utterances of [s] from Boyd's speech (left) and that of a woman recorded with modern equipment (right).
14.5 Analyses Involving Formant Measurements
Nevertheless, some kinds of consonantal analyses can be performed successfully on the ESR. The consonantal analyses that can be performed most readily are those that involve formant measurements. Formants, while usually associated with vowels, play a part in nearly all consonants as well. This attribute is most easily seen for approximants such as [l], which behave like vowels in that they exhibit formant structure throughout their course. Figure 14.5 shows a spectrogram of the phrase the Lord will bless uttered by Boyd with an LPC formant track superimposed on it. The three [l] tokens all show wide spacing between F1 and F2, with F2 falling around 1,700 Hz. This broad gap indicates that the [l]s are “clear,” that is, not velarized. In contrast, most varieties of North American English today show a great deal of velarization, even in syllable onsets. In a velarized [ɫ], F2 is quite low, usually with little or no visible gap between it and F1 in a wideband spectrogram.

Figure 14.5 Spectrogram of the phrase the Lord will bless with superimposed formant tracks. The large difference between F1 and F2 frequencies for the three [l] sounds indicates that the [l]s are not velarized.
Another example of a consonantal analysis that can be performed with the ESR is assessment of rhoticity, or r-fulness. In most varieties of English, /r/ is characterized by lowering of F3. As is well known, however, the /r/ articulation can be lost, in which case a vowel sound such as schwa remains. Non-rhoticity usually occurs in pre-pausal and pre-consonantal contexts, though in AAE and old-fashioned white speech of the U.S. South, the process can be extended to positions before a vowel, such as in for a and carry. F3 tends to stay within a narrow range for most vowels (with a modest increase for high front vowels), so the kind of drop in F3 that typifies /r/ constriction is usually salient in spectrograms. As a result, F3 values that fall within the range of vowels indicate non-rhoticity, while F3 values lower than that of any vowel not adjacent to an /r/ indicate rhoticity. Figure 14.6 shows an example of the word mother from Boyd's interview. It can be seen that F3 has about the same frequency in the second syllable as in the first, and impressionistically the token sounds non-rhotic. There certainly is no resonance corresponding to F3 that is close to the F2 resonance. For a rhotic pronunciation, F2 and F3 lie quite close to each other, often resembling a single, wide formant.
Figure 14.6 Spectrogram of the word mother with superimposed formant tracks. F3 values for the second syllable are no lower than those for the vowel in the first syllable, indicating a non-rhotic pronunciation.
Measurements of /l/ and /r/ are not the only kinds of formant analyses that can be performed with consonants. Measurements of formants at the transitions between consonants and vowels are useful for examining place of articulation, and the ESR are suitable for such analyses, even though they are more difficult than with cleaner recordings. However, the most commonly conducted formant analyses are those that are used to determine vowel quality. Vowel formant analyses certainly can be performed on the ESR. There are a few complications that arise, but there are strategies to deal with them.
The main problem for formant analysis of the ESR is the fact that higher frequencies were captured less well than lower frequencies. As noted already, the Library of Congress attempted to counteract this problem by enhancing the higher frequencies. Even before the enhancement, however, it was possible to extract formant readings, at least for F1 and F2. Figure 14.7 shows a sample spectrogram from one of the ESRs – in this case, from the Laura Smalley recording – before the enhancement. The difference in amplitude between lower and higher frequencies is obvious. With recordings of this sort, three strategies can be effective. One strategy is to dispense with linear predictive coding (LPC) and estimate the formant values based on which harmonics have the highest frequencies, as viewed in a narrowband power spectrum. LPC is the usual method for taking formant frequency readings today, and it is included in most spectrographic analysis packages – in fact, many students today do not know of any other method – but it is not the only method. Estimation from harmonic values can work well; see Thomas (Reference Thomas2011: 46) for specific details. If one chooses to use LPC, there are two other strategies that can be utilized. One, effective if the drop in amplitude occurs in the range of 2–4.5 kHz, is to lower the analysis range for LPC. Thus, if one sets the upper limit of the formant readings to 4 kHz, none of the measured formant values will exceed 4 kHz. The default upper limit is usually 5 or 5.5 kHz. Of course, lowering the analysis range ordinarily requires lowering the number of LPC coefficients as well so that fewer formant readings will be taken. After all, there are fewer formants in the reduced range. The other LPC-specific method is to use different numbers of LPC coefficients for different formants. This practice is frowned upon for cleaner recordings because, in an LPC analysis, the different formant readings affect each other. Sometimes, however, especially with imperfect recordings, there is no other way to procure readings of all the desired formants. In the recording in Figure 14.7, for example, there is no single number of LPC coefficients that will yield good readings for all three of the lowest formants. A setting that gives a good reading for F1 will not pick up F3 or, for front vowels, F2. Conversely, in order to capture F2 and F3, the LPC coefficients have to be set so high that F1 appears to be split into two formants: i.e., two separate formant values will appear for what is in reality just one formant, F1.

Figure 14.7 Wideband spectrogram of a section of the Laura Smalley recording without enhancement. The amplitude drops off noticeably above 1,000 Hz.
Vowel formant measurements can be used for a number of purposes. A common kind of display is an F1/F2 plot of a speaker's entire vowel system, showing either the individual tokens or, as in Figure 14.10 for Boyd, the mean values of each vowel class. Plots of entire vowel systems are most useful for looking for shifts of particular vowels because the relative position of the vowels of interest can be compared with the rest of the system. For example, it can be seen in Figure 14.8 that Boyd shows no evidence of Southern Shift developments. The Southern Shift (e.g., Labov Reference Labov and Eckert1991, Reference Labov1994) involves interchange of the positions of the fleece and kit nuclei and of the face and dress nuclei, as well as fronting of the goose/toot and goat nuclei.Footnote 1 None of these mutations, with the possible exception of toot fronting, appears in Boyd's speech. However, she does show the old-fashioned Virginia allophony of price and prize, in which the nucleus is higher before voiceless obstruents than before voiced obstruents (Kurath and McDavid Reference Kurath and McDavid1961). It might be noted that she does not show much glide weakening of prize, which characterizes the speech of younger generations of Southerners. Even though her prize glide does not reach the level of her dress vowel, it is still strong enough to sound impressionistically like a full glide, at least to this author's ears. Southerners who sound as if they have glide weakening or outright monophthongization of prize show considerably less formant movement than Boyd does (see Thomas Reference Thomas2001 for examples). Certain other old features are apparent in her speech as well. She maintains distinctions between the bin and ben vowels and between the north and force classes. Many Southerners of younger generations merge each of those pairs.

Figure 14.8 Formant plot showing the mean values of the vowels of ex-slave Phoebe Boyd. Arrows indicate the gliding of diphthongs. Squares signify measurements 35 ms after the onset of the vowel, circles indicate measurements at the midpoint, and triangles represent measurements 35 ms before the offset.
Boyd does not show the fronting of the mouth/proud/how/down complex that typifies Southern White speech. She shows what seems to be an archaic distribution of the diphthongs in this complex, however. The classic Virginia distribution, described, e.g., in Kurath and McDavid (Reference Kurath and McDavid1961), was a system in which the nuclei are higher before voiceless obstruents and lower before voiced consonants and word-finally. However, Lowman (Reference Lowman, Jones and Fry1936) mentioned what was apparently an older distribution in which all tokens showed higher nuclei except those before or after /n/, as in down and now. The lowering would seem to have been related to the effects of nasality on F1 values, as discussed in Thomas (Reference Thomas2001: 52–53). Lowman's actual transcriptions often show higher glides associated with higher nuclei, i.e., [əu] vs. [æʊ]. Boyd's system does not match the one that Lowman described, but it may have some properties in common. Before voiced obstruents, shown in Figure 14.8 as proud, she shows low nuclei but notably high glides. Before voiceless obstruents, designated as mouth, her nuclei are higher than for proud but her glides are not quite as high. These differences may have to do with the short durations that occur before voiceless consonants, and short durations have the effect of truncating diphthongs at their onset, offset, or both. In word-final positions, designated as how (based on tokens of the words how and now, the only such words in the recording), Boyd shows quite low nuclei and the lowest glides of any in the complex. Her tokens before nasals, designated as down, show apparent raising of the nucleus and mild lowering of the glide. In nasal contexts, the oral F1 tends to become replaced by two nasal formants, and the auditory impression often does not seem to match what spectrograms show, creating a muddled situation. Could Boyd's system have been the one that Lowman (Reference Lowman, Jones and Fry1936) described? Lowman, of course, did not have access to modern acoustic techniques that could have provided greater precision for the phones he attempted to specify by ear.
Another kind of vowel formant analysis involves making a series of measurements through the course of each vowel token. This approach is used to examine the trajectory of a vowel in order to determine how diphthongal it is and, if it is a diphthong, in which direction it glides. Figure 14.9 shows trajectories of twenty tokens of Boyd's face vowel. For each trajectory, 1 indicates a point 1/10 of the duration from onset to offset, 2 a point 2/10 of the duration, and so forth. The actual onset (the 0/10 point) and offset (10/10) are not shown. Formant movement is evident for these tokens, but the dynamics are not the kind typical of diphthongs. Some of Boyd's face tokens begin in the interior of the vowel envelope, move toward the perimeter, and then move back to the interior. This kind of formant movement is characteristic of consonantal transitions, with a single vocalic target approached most closely near the center of the vowel. These examples of Boyd's face vowels come closest to their target when they are closest to the edge of the vowel envelope, though other vowels might have their targets in more interior positions. A true diphthong would show onset and offset values in different places, with a significant portion of the trajectory taken up by a decided rising, falling, backing, or fronting movement. A few tokens among Boyd's face vowels show a steadily outward, and often upward, moving trajectory. Most of these tokens fall before a /k/, however (e.g. take, bake), and dorsal consonants such as /k/ are characterized by F2 transitions that make F2 higher at the offset than it is within the vowel. Thus, these tokens do not actually reflect a diphthongal trajectory. They illustrate the importance of taking the consonantal context into account before deciding whether a vowel can be fairly labeled as a diphthong. The acoustic evidence, then, corroborates the findings of Dorrill (Reference Dorrill1986), who found, based on the auditory transcriptions of the Linguistic Atlas of the Middle and South Atlantic States, that several vowels, including face, were more likely to be monophthongal for African Americans than for European Americans. See Thomas and Bailey (Reference Thomas and Bailey1998) for acoustic analyses showing that other ex-slaves had monophthongal or nearly monophthongal face and goat vowels.

Figure 14.9 Trajectories of twenty tokens of face vowel as produced by Boyd.

Figure 14.10 An utterance from the Boyd interview with ToBI annotations. A pitch track with a scale from 80 to 270 Hz (black line) is superimposed on a narrowband spectrogram with a scale from 0 to 750 Hz. The series of H* annotations represent the pitch accents, while H- and L-H% are phrasal edge tones. Note the F0 troughs between the pitch accents.
One caveat that should be acknowledged about Boyd's formant readings is that certain measurements may reflect some distortion. The main problem is in the F1 values for her high vowels, which appear a little greater in Figure 14.8 than they should be. For all but a few women, the high vowels usually exhibit F1 values lower than 500 Hz. The distortion seems to stem from the enhancement procedures employed by the Library of Congress, which appear to have damped the lower frequencies, thereby lowering F1 amplitudes and shifting the center frequencies of F1 upward. Fortunately, the positions of the vowels relative to each other are not affected, as all the high vowels and, to a lesser extent, the mid vowels are influenced in the same way. Although F3 and F4 are not shown in Figure 14.8, there was greater uncertainty about their measurements, especially those of F4, than about those for F1 and F2.
14.6 Analyses Involving the Fundamental Frequency
The fundamental frequency, or F0, plays a minor role in certain consonantal and vocalic contrasts. In many languages, particularly in Africa and eastern Asia, it is used for lexical specification. However, its main use in Western languages is for intonation. Fortunately, F0 is well preserved in early recordings such as the ESR. In fact, it is one of the most robust phonetic properties of speech. Even when recordings fail to capture frequencies above 1 kHz or somewhat lower, vocal pulses and the lowest few harmonics are still evident and allow measurement of F0. As a result, the F0 contours that make up intonation are easily discernible. Only when the phonation becomes especially breathy or creaky does F0 become hard to measure.
Analysis of intonation requires some sort of transcription before any further procedures can be performed. At present, the standard method for transcribing intonation is the Tone and Break Index, or ToBI, system (Beckman and Hirschberg Reference Beckman and Hirschberg1994). ToBI requires construction of a textgrid with at least four tiers, including one for orthographic transcription of the speech sample and one for transcription of the various kinds of tones, including edge tones and pitch accents, that ToBI recognizes. Edge tones indicate the end of a prosodic phrase, of which there may be more than one kind (as in English), while pitch accents occur on some syllables to mark them as more prominent than syllables without a pitch accent. A sample utterance from the Boyd interview, with ToBI annotation, is shown in Figure 14.10. It exhibits a pattern that typifies many varieties of AAE to this day, one of high tones separated by clearly defined troughs as opposed to more gradual falls in F0. Every stressed syllable except those in the words I and out contains a pitch accent. This high density of pitch accents seems to be more common in present-day AAE than in present-day European American varieties, and its presence in the ESR suggests that it has existed in AAE for a long time.
Other kinds of analyses can be conducted once the ToBI transcription is complete. Some analyses simply involve computations of the frequency of different ToBI tonal designations. However, there are phonetic analyses that require numerical measurements. One example is the measurement of peak delay, which has to do with the position of the point of highest F0 of a tonal contour relative to the onset and offset of its host syllable. In English, stressed syllables serve as hosts, but some contours involve gradual rises that begin on the host syllable and reach their peak on a following unstressed syllable. Figure 14.11 shows a L+*H contour, the commonest rising contour in English, in Boyd's speech. The host syllable is hold, but as can be seen, the peak falls on the next word, the unstressed us. The peak delay is calculated as the proportion of the duration of the host syllable at which the peak is found, which in this case, since the peak is past the end of the host syllable, is 1.09. Thus far, little work has been conducted comparing dialects of English for peak delay, but the ESR could provide invaluable evidence on early AAE if such research were to be undertaken. See Thomas (Reference Thomas2011) for examples of other intonational metrics that can also be computed.

Figure 14.11 A pitch accent with a rising contour from the Boyd interview, showing the relevant features for computing the peak delay. A pitch track with a scale from 75 to 475 Hz (white line) is superimposed on a wideband spectrogram with a scale from 0 to 4,000 Hz.
14.7 Analyses of Timing
Timing does not appear to be affected noticeably in the ESR, including the Boyd interview. Although it is impossible to know exactly how faithfully the recordings reflect the actual timing of Boyd's utterances, and overall speeding or slowing are certainly possible, any such deviations are not evident when one listens to the recordings. It is also possible that the speed of the recordings increased or decreased between the beginning and end of each side of a disc, but quantitative measures would be required to detect such differences and they are not evident by ear, either. As a result, reasonably reliable analyses of rhythm and duration can be performed on the Boyd recording and other ESR.
Timing differences range from segmental-level durations to measures of a speaker's overall rate of speech. On a segmental level, measurements can involve the intrinsic length of particular phonemes (e.g., Peterson and Lehiste Reference Peterson and Lehiste1960) and durational differences due to phonetic context, such as the well-known tendency for vowels to have shorter durations before voiceless obstruents than before voiced obstruents (e.g., House and Fairbanks Reference House and Fairbanks1953). For example, the tokens of the fleece vowel measured for Figure 14.10 have a mean duration of 86 ms, while those of the kit vowel that were measured averaged only 46 ms. When only tokens before voiceless obstruents are included, fleece still shows longer durations than kit, 57 ms to 40 ms.
At the opposite end of the timing spectrum, the overall rate of speech describes how quickly an individual speaks in general. As Kendall (Reference Kendall2013) has shown, speech rate differs from individual to individual and dialect to dialect and thus can encode sociolinguistic meaning. One of the best measures of rate of speech is the articulation rate, which counts the number of syllables per second, excluding silent periods between utterances (Robb, Maclagan, and Chen Reference Robb, Maclagan and Chen2004). One section of the Boyd interview has been analyzed in this way, and Boyd's mean articulation rate for the 65 utterances with no ambiguous words is 5.1 syllables/second. This rate is about average for English, but faster rates are typically found in Romance languages (Arvaniti and Rodriquez Reference Arvaniti and Rodriquez2013). Comparisons of the ESR in different parts of the interview could also be conducted.
A more specialized way of examining speech rate is analysis of prosodic rhythm. Prosodic rhythm has to do with variations in the relative durations of segments. It is intended to describe how syllable-timed or stress-timed a language, dialect, or speaker is. Various methods compare durations of vocalic intervals, consonantal intervals, or both (see Thomas Reference Thomas2011). One commonly used method, nPVI, focuses on vocalic intervals. For each pair of adjacent vowels in an utterance, the absolute value of the difference in their durations is divided by the average of the two durations (Low, Grabe, and Nolan Reference Low, Grabe and Nolan2000). A sample utterance from the Boyd interview – the first part of the utterance from Figure 14.10 – is shown in Figure 14.12. In this excerpt, the nPVI calculations work as follows. The vowel in But is 89 ms long, while that in I has a duration of 70 ms. The difference of their durations, 11 ms, divided by their mean value, 75.5 ms, yields an nPVI score of 0.24. For the comparison between the vowels in I and was, nPVI = 0.80; between the vowels in was and hired, nPVI = 1.45; and between those in hired and out, nPVI = 0.44. Because out falls before a phrase boundary and is subject to phrase-final lengthening, it could be excluded from the calculations. nPVI scores vary greatly from one pair of words to another, so a large number of them have to be taken for each speaker, after which a mean or median value can be obtained. Smaller scores indicate more syllable-timing and larger scores more stress-timing. For the section of Boyd's interview analyzed for speech rate, there were 367 nPVI comparisons, with a mean value of 0.595 and a median value of 0.544. These figures represent a fairly stress-timed rhythm. Thomas and Carter (Reference Thomas and Carter2006) used nPVI analyses of the ESR to argue that AAE has become more stress-timed than it once was.

Figure 14.12 An utterance from the Boyd interview with consonantal (C) and vocalic (V) intervals delineated for analysis of prosodic rhythm.
14.8 Prospects
As can be seen, a great deal of acoustic analysis can be performed on the ex-slave recordings. Other researchers demonstrated over twenty years ago how important these recordings are for morphosyntactic variants. The ESR are clearly valuable for exploring some kinds of phonetic variables as well. Even though certain kinds of analyses are impossible to conduct on them, other kinds of analyses certainly can be performed. Analyses of vowel quality and prosody are quite doable.
The limitations of the ESR, in particular the impossibility of knowing how well the speakers in them represent African Americans born in the mid-nineteenth century, have been pointed out before. These problems, however, should not dissuade researchers from utilizing the ESR. The recordings certainly cannot be used to show that a particular feature was absent in nineteenth-century AAE, and they are not easily used to demonstrate that any feature was predominant. However, their chief value is for proving that certain features occurred in AAE at that time. They can show that some features that are present today are old and that features no longer extant occurred at one time. For vowel quality, they can be used to corroborate linguistic atlas records, which include some African American subjects from the same generation. The presence of monophthongal face and goat vowels in both the linguistic atlas records and the ESR testifies to this value. For prosody, there is no other record at all of African American features of that time. The ESR are the only window available on whether prosodic patterns found in AAE today existed in mid-nineteenth-century AAE. Evidence of this sort can shed new light on how AAE might have originated by shifting attention from the already heavily studied morphosyntactic variables to lesser known phonetic variables.
The methods illustrated here for analyzing these variables can be applied to other early recordings of other dialects and languages as well. It is my hope that other researchers will see that early recordings are indeed suitable for these techniques. It would be a shame to waste the insights that early recordings can provide just because someone considers their sound quality too poor for any modern acoustic methods.










