Vowel and consonant sounds are the heart of traditional pronunciation teaching, and they remain unavoidably important in an intelligibility-based approach to L2 pronunciation (Derwing & Munro, Reference Derwing and Munro2015; Derwing, Munro, & Wiebe, Reference Derwing, Munro and Wiebe1998). One cannot speak clearly without sufficient accuracy in pronouncing the sounds of a language. To be understood, the sounds a speaker produces must match what listeners expect to hear. This is why teaching materials typically have provided a great majority of practice exercises using segmental contrasts, usually through the use of minimal pairs (e.g., Dauer, Reference Dauer1993; Orion, Reference Orion2002), and why the pronunciation of even challenging segmentals has been shown to improve through perceiving differences in L2 segmental contrasts (Thomson, Reference Thomson2011, Reference Thomson2012; Wang & Munro, Reference Wang and Munro2004). In other words, the perception and production of vowels and consonants has to be sufficiently accurate to activate the speaker’s intended vocabulary in the mind of the listener. Sufficiently accurate does not mean a perfect match to an ideal. This is unlikely for any speaker, since the realization of phonemes in any language varies according to linguistic environment, idiosyncrasies of individual speakers, and speech register, among other things.
Segmentals refer to the vowel and consonant sounds of a language. Segmentals are distinct from, yet overlap to some extent with, the suprasegmentals of the language. For example, [ə] in English is clearly a vowel, indeed the most common vowel in L1 English varieties. Yet, it is dependent on a suprasegmental, the non-realization of stress, for its pronunciation. Nonetheless, we typically treat schwa and other segmentals as different from stress, rhythm, and intonation. In some languages there may be a relatively equal distribution of vowel and consonant sounds in speech because the languages have a strong tendency toward CV (consonant–vowel) syllable structure, while other languages may have an unequal frequency of consonants and vowels, with more consonants than vowels. Some languages allow multiple consonants to cluster together, while consonant clusters are strongly restricted in other languages. Finally, all the phonemes of a language are unavoidable in speaking, and the extent to which these sounds are pronounced accurately is one factor that affects how well listeners understand speakers. However, it is also clear that different types of mispronunciations may be more or less serious in their impact.
This chapter looks at segmentals from the perspective of L2 intelligibility, primarily from the perspective of L2 English. The first part of the chapter looks at how sounds can be classified and what this means for L2 pronunciation. The second section classifies common L2 pronunciation errors. Following this, the chapter briefly recounts explanations for why errors occur in L2 speech, and what kind of improvement can be expected in the absence of instruction, and with instruction. The next section examines research on intelligibility as it relates to segmentals.
In doing this, the chapter will argue the following about segmentals in L2 pronunciation:
1. Segmentals are critical in judgments of intelligibility and comprehensibility, if for no other reason than they are unavoidable and determine whether we hear one word rather than another.
2. Although an error in any segmental can cause misunderstanding, some segmental errors are more likely to cause misunderstandings than others.
3. Segmentals are subject to different types of mispronunciations, and these different types of deviations (substitutions, deletions, distortions, insertions) are likely to have varied effects on intelligibility.
4. Individual mispronunciations may lead to loss of understanding, but words/phrases/sentences may also have a complex combination of deviations. In combination, these may cause difficulties in understanding that are hard to compute.
5. Naturalistic development of segmentals is likely to be fastest within the first year in the L2 environment, but this development will not apply equally to all segments.
6. Instruction on pronunciation results in improved segmental accuracy, but the short-term effect of improved segmentals on spontaneous speech intelligibility is less clear.
Segmentals are primarily a word-based feature when it comes to understanding. In other words, mispronunciations can lead listeners to hear different words entirely, and the overall effect of many small deviations can make it more challenging for listeners to understand the speaker’s message. But human listeners, especially native speakers (NSs) of a language, can often interpret sounds in contexts that are not particularly ideal exemplars of the phonemes they represent. They may also be able to adjust to clearly wrong pronunciations because of the discourse context (e.g., Thank you pronounced as Sank you is unlikely to be misunderstood), perhaps because sound contrasts are relatively rare in the language (e.g., stood pronounced as stewed), or because a potential minimal pair does not exist (e.g., value pronounced as *balue). Segmentals are also an important marker of accented speech, both for differences in native varieties (e.g., the New Zealand pronunciation of left sounding like lift to North American English speakers) and for nonnative speech.
Ways to Describe Segmentals
Segmentals can be described in various ways, and in pronunciation teaching these different classifications are important for different purposes. For example, segmentals can be classified quite broadly in their relationship to the way they are spelled, or they may be described according to the way they sound. Even though sound–spelling correspondences are important in the teaching of pronunciation, their connection to intelligibility is indirect at best, and this chapter will not address it, focusing instead on sound categories. In regard to sounds, segmentals can first be described phonologically. This means that all languages have sets of vowel and consonant sounds that distinguish meaning between words (that is, they have different sets of phonemes, signaled by the use of /forward slashes/). For example, English has three voiceless stops that distinguish meaning, /p/, /t/, and /k/ (pick, tick, kick). Languages such as German, Spanish, and French have the same three phonemes, whereas Arabic lacks /p/ but has /t/ and /k/, and Korean has three voiceless stop phonemes (/p/, /ph/, and /p̤p/) for the single category found for each voiceless stop in English. These phoneme categories are language-specific, and they affect how we perceive and produce spoken language.
We can also describe segmentals phonetically – that is, by the specific sounds that are pronounced for particular phonemes. Even though two languages may have some of the same phonemes, such as Spanish and English, this may not help language learners to pronounce the L2 accurately because of the language-specific variants (called allophones) that depend on the linguistic environment in which the vowel or consonant is pronounced. Thus, the /k/ allophones in English kick are different, with the first being spoken with aspiration, or [kh], and the second (typically) without aspiration, or [k]. (Note that when we are talking about actual sounds rather than phonemes, we use [square brackets].) Thus, kick can be narrowly transcribed in English as [kʰɪk]. The two /k/ sounds in kick include differences beyond aspiration, with the first /k/ being pronounced further forward in the mouth, toward the palatal region, and the second /k/ perhaps being pronounced with a closed mouth, that is, it is unreleased. For most pronunciation teachers, however, such details of articulation may seem irrelevant, and both allophones will be considered as examples of the phoneme /k/ even though they are quite different sounds. Indeed, most English speakers may not even notice these sound differences. A sampling of common allophones of English sounds are listed in Table 3.1. However, it must be emphasized that allophonic variation is far more common than any chart can show, as sounds vary in systematic ways depending on the sounds they are next to and according to the dialect of the speaker. In an intelligibility-based approach, it is important to recognize that an L2 learner’s difficulties may stem from the production of allophones rather than phonemes.
Table 3.1 Some English phonemes and common allophones
Why is it important to distinguish between phonemes and allophones in teaching pronunciation? In L2 pronunciation learning, phonemes and allophones rarely match perfectly between the learner’s native language and the second language. Sometimes two languages will have the same [sounds] but not associate them with the same /phonemes/. For example, speakers of North American English think [ɾ] (the alveolar flap) represents the medial /t/ in words like city and beauty, but the same sound represents consonants spelled with <r> in Spanish, as in pero. Another reason to distinguish between phonemes and allophones is that we all pronounce the phonemes of our native language in ways that are dependent on where they occur in a word or sentence, and sometimes in ways that are dependent on our dialects. The American English /t/ is pronounced in four different ways in the words take, stake, city, and button (see Table 3.1), but the same words in some varieties of British English will have different allophones. (For example, the flap allophone in city in American English will be pronounced as [t] in British English.) These differences not only affect intelligibility, especially for listening comprehension, but may also be important in social or regional identification. It is also important to determine whether an L2 learner is learning a sound that they already know how to pronounce, but in a different environment, or whether the sound is completely new. For example, American English speakers learning languages like Vietnamese need to learn how to pronounce a familiar sound [ŋ] (e.g., singer, long) at the beginning of words (e.g., Nguyen, a common Vietnamese name), an environment that is illegal in English. They also need to pronounce completely new sounds such as /ɨ/, which may involve a different level of challenge.
Segmentals in a language can also be described according to constraints on the linguistic environments in which they occur. The /ŋ/ phoneme of English, for example, distinguishes meaning in word-medial and word-final positions (sin/sing, sinning/singing), but not word initially (net but not *nget). In another example, the [kh] allophone is pronounced at the beginning of stressed syllables in English (kit), but not after [s] (skit). These constraints are language-specific and may make pronunciation of even familiar sounds in a new language more difficult.
Finally, segmentals can be described according to how they fit into syllable structure, which is also a language-specific constraint. For example, Japanese maximally allows CV syllables (a shorthand description describing a syllable made of a consonant followed by a vowel), while English allows syllables that are far more complex. The word strengths, for example, can be spoken with three consonants before the vowel and four after, or CCCVCCCC [stɹɛŋkθs] (Cardoso, Reference Cardoso, Kang, Thomson and Murphy2017). Japanese learners of English must learn to pronounce syllable structures that do not fit the syllabic structure of their native language. Japanese learners of English have a strong tendency toward creating “legal” syllables through vowel epenthesis so that borrowed English words in Japanese are modified to have a Japanese syllable structure (e.g., strike pronounced [su.tu.ɾaɪ.ku] rather than [stɹaɪk]). In contrast, Vietnamese only allows certain single-consonant sounds at the ends of syllables, while English allows many single and multiple consonants in syllable codas. For example, lengths has at least three consonant sounds at the end, [lɛŋθs] or [lɛŋks] in my speech, and [lɛŋkθs] in that of some speakers. As a result, Vietnamese learners of English must learn how to pronounce consonant clusters that are phonotactically restricted (i.e., illegal) in Vietnamese. Such a task is not impossible once a learner is aware of it, but pronouncing phonotactically restricted sounds in new environments can be challenging. Without instruction, improvement may be quite limited, as suggested by Osburne’s (Reference 287Osburne1996) case study of a Vietnamese speaker.
These four areas of classifying segmentals, according to the phonemes they represent, their allophones, phonotactic constraints, and syllable structure, influence the ways in which language learners pronounce their new language, but not always in obvious ways. Some of the errors of L2 learners can be predicted by contrasting the sounds of the L1 and the L2 (i.e., a contrastive analysis). But L2 learners also make errors that are not predicted by a contrastive analysis of the native and target language (Munro, Reference Munro2018), or learners may be able to pronounce a sound well in one context but not in another. Sounds that are very similar yet different in the L1 and L2 are particularly difficult to hear and pronounce, while L2 sounds that are very different from anything in the L1 are often much easier to perceive and pronounce (Flege, Reference Flege and Strange1995). A French friend talked about the difficulty of saying Sushi shop in English, even though French has both /s/ and /ʃ/. This suggests that what is difficult for a particular speaker is not always easy to predict.
Vowels and Consonants
Segmentals include, as mentioned, both vowel and consonant sounds, two categories that seem distinct yet often overlap. Glides (also called semi-vowels or semi-consonants) and English /ɹ/ have much in common with vowels in articulation, given that all of these sounds are made with an unobstructed vocal tract and full voicing. In addition, some sonorant consonants may carry the syllable nucleus when they are syllabic (e.g., the final consonant in button, bottom, little), making them behave like vowels.
The most intensively studied consonant contrast it the /l/–/ɹ/ contrast in English and (primarily) how Japanese L1 learners negotiate it. The short answer is, quite poorly. Japanese learners of English have tremendous difficulty hearing the differences between these two sounds, producing the differences between the two sounds, and experience does not always help. Japanese has one phonemic category (either the alveolar flap /ɾ/ or the retroflex flap /ɽ/, depending on the description), which Japanese learners map unsuccessfully to the two English phonemes. The important finding of research on this difficult contrast is that significant improvement is possible given experience and effective training. Perceptual and production changes are possible even for learners who start learning as adults. In what follows, I will review only a few of the myriad studies on the /l/–/ ɹ/ contrast in order to show that improvement in pronunciation is possible for any contrasts, and especially for those that have a high functional load.
Cues present in the speech signal are sometimes ignored when they are unnecessary in the learner’s L1. For non-experts in my undergraduate classes, roughly speaking, I explain that all speech sounds comprise a wide range of noise across a broad range of frequencies, and that what constitutes any given sound is how particular frequencies are amplified as a consequence of changes to the shape of the vocal tract. For efficiency, humans are very good at ignoring all frequencies that are irrelevant for discriminating between sounds. Unfortunately, what is efficiently ignored in an L1 may be important for the discrimination of sounds in an L2.
The /l/–/ɹ/ contrast in English is very common in all linguistic environments, and the contrast thus has the highest functional load possible, ten on the ten-point scale (Brown, Reference Brown1988). Both sounds are difficult for Japanese learners to produce, and both are difficult to perceive (Goto, Reference Goto1971). It appears that /l/ is perceived as more similar to the Japanese flap than /ɹ/, but /ɹ/ appears to be easier to learn, in line with predictions from models of speech learning (Bradlow, Reference Bradlow, Hansen-Edwards and Zampini2008; Riney, Takada, & Ota, Reference Riney, Takada and Ota2000). The acoustic cues that distinguish /l/ from /ɹ/ are not important in Japanese, and Japanese learners of English find it difficult to attend to features in the speech signal that are habitually ignored in Japanese. Like any kind of speech learning, what is irrelevant in the L1 may be critical in the L2. All speakers are particularly good at attending to features that are important in the L1, but may also be particularly good at not hearing features that are critical for making distinctions in the L2. This process begins early in life, with infants demonstrating increasingly specific abilities in perception depending on the languages in their environment (Burns, Yoshida, Hill, & Werker, Reference Burns, Yoshida, Hill and Werker2007; Vihman, Reference Vihman, Reed and Levis2015; Werker & Tees, Reference Werker and Tees1984).
Experience with the L2 seems to help with improvement, at least to a point. Flege, Takagi, and Mann (Reference Flege, Takagi and Mann1995) studied adult Japanese learners who differed in experience in the United States. The low-experience group had been in the United States an average of 1.6 years and the high-experience group an average of 20.8 years. The high-experience group produced both English sounds far more accurately (though not like NSs).
Research on the /l/–/ɹ/ contrast also demonstrates that perceptual training can not only improve the ability to hear difficult contrasts, but can also result in improved pronunciation of the sounds. Bradlow, Pisoni, Akahane-Yamada, and Tokhura (Reference Bradlow, Pisoni, Akahane-Yamada and Tohkura1997) trained eleven Japanese English learners on sixty-eight minimal pairs in which the /l/–/ɹ/ contrast appeared in varied linguistic environments. Production recordings before and after training showed a trend toward production improvement. McClelland, Fiez, and McCandliss (Reference 284McClelland, Fiez and McCandliss2002) found that feedback was especially important in improving perception, and Saito and Lyster (Reference Saito and Lyster2012) found that feedback was also important in improving production of the sounds.
Although I am treating both vowels and consonants as having similar patterns as to their potential effects upon intelligibility, it is likely that they are also different in many ways. For example, Bent, Bradlow, and Smith (Reference Bent, Bradlow and Smith2007) examined the intelligibility of sentences produced by Mandarin speakers of English. They found that there was an overall correlation of segmental accuracy and intelligibility, but that overall accuracy of vowel pronunciation correlated with intelligibility judgments, while overall accuracy of consonant errors did not (although initial consonant production was correlated with intelligibility). This somewhat surprising finding suggests that vowels and consonants are not equal in their effects on intelligibility.
Similarly, Fogerty and Kewley-Port (Reference Fogerty and Kewley-Port2009) systematically examined the contributions of consonant and vowel units to intelligibility. Sentences had either consonant sections or vowel sections replaced by word-shaped noise. The results showed that intelligibility of the sentences was higher when the consonants were masked than when the vowels were. They explain these results by saying that “vowels as traditionally defined carry important perceptual cues that are not found in consonants. Furthermore, even truncated portions of vowels contribute strongly to sentence intelligibility despite providing much less of the overall sentence duration than consonants” (Fogerty & Kewley-Port, Reference Fogerty and Kewley-Port2009, p. 855). This greater effect on intelligibility comes because vowels carry not only their own information but also the co-articulatory information about consonantal transitions into and out of the vowel. In other words, hearing the vowels more clearly in difficult listening situations (i.e., in noise) is more likely to lead to greater intelligibility than hearing consonants clearly. Vowels are essential to intelligibility. Consonants are less consistently so. This pattern holds only for sentences, however, not for words in isolation (Fogerty & Humes, Reference Fogerty and Humes2012), and applies when younger normal-hearing listeners as well as older hearing-impaired listeners are the ones judging intelligibility (Miller et al., Reference Miller, Watson, Kewley-Port, Sillings, Mills and Burleson2007).
Acquisition of Vowels
L2 learners show certain patterns when learning L2 vowels. Munro and Derwing (Reference Munro and Derwing2008) examined the vowel production of recent immigrants to Canada from Chinese- and Slavic-language backgrounds. They found that learning was initially rapid, but that after the first six months in Canada, naturalistic learning of the vowel contrasts plateaued and showed inconsistent changes for the next six months. Both L1 groups showed similar trajectories in general, although not all vowels between groups were the same, indicating that there was also an effect of the L1. In a later study of vowel improvement after year one, Munro, Derwing, and Saito (Reference Munro, Derwing, Saito, Levis and LeVelle2013) found that listeners heard improvements in Slavic pronunciation of /ɪ/, especially in bVC (bit) rather than pVC (pit) contexts. Mandarin speakers did not show the same improvements in this vowel, although both groups showed continued improvement on /u/ and /ʊ/.
Munro, Flege, and MacKay (Reference Munro, Flege and MacKay1996) examined the perception and production of English vowels by Italian learners of English who had lived in Canada for decades. They found that accentedness of vowels could be predicted as a function of age of arrival (AOA), and that even long and extensive exposure to the L2 in the environment did not change accentedness. While some of these vowels were less accented than others, all were accented, but even the most accented vowels were highly intelligible. These results indicate that native-like production of L2 vowels is less likely with later AOA. The authors also suggest that the influence of the L1 on L2 vowels may be rooted in perception.
Relation of Perception and Production
The ability to perceive L2 vowel contrasts appears to increase with increasing L2 experience. Flege, Bohn, and Jang (Reference Flege, Bohn and Jang1997) asked Korean and German L2 speakers of English to produce beat–bit and bat–bet vowels, as well as to perceive the vowels in synthetic tokens. Production of vowels was more accurate with more experienced speakers, and some vowels were more difficult than others (especially the bat vowel). More experienced speakers were also more likely to pay attention to vowel quality distinctions in perception, and less attention to durational differences.
In another study of vowel perception and production, Flege, MacKay, and Meador (Reference Flege, MacKay and Meador1999) examined Italian L1 speakers of English. They found that the accuracy with which the Italian L1 speakers produced English vowels was connected to accuracy in perception, even when L2 experience was held consistent. In other words, “the accuracy with which L2 vowels are produced is limited by how accurately they are perceived” (Flege, MacKay, & Meador, Reference Flege, MacKay and Meador1999, p. 2973). Production and perception for early bilinguals (those who began learning English at an earlier age) was often indistinguishable from native English speakers, indicating that those who begin learning the L2 earlier are more likely to establish new perceptual categories that may be indistinguishable from those of NSs.
Hillenbrand and Clark (Reference Hillenbrand, Clark and Houde2000) examined how intelligible vowels were when their duration was modified in four different ways. In general, increased duration added very little to the intelligibility of vowels since they were all identified very successfully even at the original duration. However, duration had some effect in that vowels with relatively long durations tended to shift toward adjacent vowels (e.g., beat judgments shifted to bit when duration was shortened) and vowels with relatively short durations were judged as adjacent vowels when durations were lengthened (e.g., bit judgments shifted to beat when duration was lengthened). This suggests that duration is a cue to vowel intelligibility, albeit a minor one (Iverson & Evans, Reference Iverson and Evans2007).
So can vowel intelligibility be improved by manipulating duration or other features? Ferguson and Kewley-Port (Reference Ferguson and Kewley-Port2007) examined the clear and conversational speech of two sets of speakers that had been used in previous studies. Six of these had previously shown that they were more intelligible in noise, while six others were not. Acoustic analyses of the speakers found that those whose intelligibility was greater (the “Big-Benefit” speakers) were associated with an increased vowel space and greater duration, more than for the “No-Benefit” speakers. These patterns did not hold for all speakers, however, suggesting that clear speech benefits may include issues beyond vowels and duration.
Effect of Training on Vowel Perception
Cenoz and Lecumberri (Reference Cenoz and Lecumberri1999) examined the effects of training on vowel perception. L1 Spanish and Basque learners of English received training on the discrimination of English monophthongs and diphthongs; they found that these learners, especially those whose perception was poor, learned to more successfully perceive even difficult vowel contrasts. Better improvement was also connected to learner desire to have a more native-like accent.
Computer-based perceptual training seems particularly effective in helping L2 learners to create new categories for difficult vowel contrasts (Thomson, Reference Thomson2011, Reference Thomson2012; Wang & Munro, Reference Wang and Munro2004), which are more challenging to describe because of their less precise placement in the mouth. Thomson’s studies employed high variability phonetic training (i.e., multiple voices) producing vowels in open syllables, even if those syllables do not represent actual words. Wang and Munro trained the beat–bit, bat–bet, and Luke–look contrasts, allowing L2 learners to control the type of practice they used from the computer-based program. This study also suggests that allowing learners to determine the kind of practice they have can be successful.
The kind of L1 vowel inventory that L2 learners have also seems to affect how well they can benefit from L2 vowel training. Iverson and Evans (Reference Iverson and Evans2009) trained German and Spanish L2 learners on English vowels. German has a vowel inventory that is similar to English in size, while Spanish has a smaller five-vowel system. Given the same amount of training, German speakers improved more than Spanish speakers, suggesting that a larger vowel system may facilitate learning a larger L2 system. Spanish speakers also improved, but it took them three times longer to do so. Both groups retained their improvements over time. Despite suggestive evidence that vowels and consonants are different in regard to intelligibility, this chapter will largely collapse them because of the many places in which they seem to affect intelligibility similarly.
L2 Segmental Pronunciation Errors
One of the most noticeable things about pronunciation is that segmental errors are very common in L2 speech (Im & Levis, Reference Alameen, Levis, Reed and Levis2015) and are an important element in judgments of foreign accent, even if the errors made are hard to precisely identify. In some cases, errors may lead to loss of intelligibility in that the words are not understood at all (watched pronounced as wat) or are heard as a different word (as in hearing the word watched as washed). That most segmental errors do not cause loss of intelligibility is a testimony to the flexibility of human listeners, who hear effectively when words come at them in the stream of speech, even in speech that is degraded by noise and overlap (Scharenborg, Reference Scharenborg2007).
Some of the most common errors in L2 speech include substitutions of one phoneme for another, distorted pronunciations that do not match expected sounds, additions of extra sounds, and deletions of expected sounds. These four types of errors are called substitution, distortion, insertion or epenthesis, and deletion (Derwing & Munro, Reference Derwing and Munro2015).
Substitutions occur when the intended sound in the new language does not exist in the native language (such as English speakers pronouncing the high front rounded vowel [y] found in French or German) or when the allophonic patterns are different in the two languages. For example, Spanish speakers often substitute [d] for [ð] in English (then sounds like den), even though Spanish speakers use [ð] as an allophone of /d/ in Spanish, as in nada pronounced [nɑðɑ]. Berlitz, a provider of language instruction, played on the potential of substitutions for unintelligibility with a distress call relayed from a sinking ship to a German Coast Guard operator: “We’re sinking, we’re sinking,” says the crew member. “What are you sinking about?” responds the German L2 English speaker (www.youtube.com/watch?v=z78V_oo21Kc).
Across varieties of English, even categories of sounds may be different. Many American English speakers and Canadian English speakers have a merger of the low back vowels /ɑ/ and /ɒ/, leading to a different phonemic inventory from speakers of other varieties of North American English. Although the rounded vowel is typically referred to as open-o, or [ɔ], I will reserve that symbol for the more strongly rounded back vowel in the British English pronunciation of words like caught, fall, taught. Some varieties of American and British English also systematically realize the phonemes /θ/ and /ð/ as [f] and [v]. In other cases, speakers may use allophonic realizations in unexpected places. I have known several people who use [ɫ] as the only allophone of /l/, while most speakers use it only syllable-finally. Other speakers vocalize the [ɫ] so that it loses its /l/ quality altogether while retaining the velar co-articulation such that it sounds more like [w], e.g., fill as [fɪw]. These kinds of allophonic variations are often well known and stereotyped (such as the glottal stop for /t/ in some varieties of British English, even though [ʔ] is unremarkable as an allophone for /t/ in words like button and cotton).
Distortions occur when the pronunciation of a target sound cannot be classified as another L1 or L2 sound, although this is a relatively understudied type of error (Bent et al., Reference Bent, Bradlow and Smith2007). The lack of classification may have as much to do with the listener as it does with the speaker. Language learners may substitute an L1 sound for an L2 target, but a listener from the L2 may not be able to classify the L1 sound. This is true for both untrained and phonetically trained listeners. To overcome this problem, it is possible to use trained listeners who share the native language of L2 learners. For example, Hong, Kim, and Chung (Reference Hong, Kim and Chung2014) used Korean L1 phonetically trained listeners to code Korean-influenced English and used a number of Korean sounds in their transcription that may not have been obvious even to phonetically trained native English listeners. Distortions are common in casual L1 speech (Cauldwell, Reference Cauldwell2013; Shockey, Reference Shockey2003), and in my experience L2 speech is full of these types of deviations. L2 learners approximate target sounds as closely as possible within their L1 perceptual system, but their pronunciations may not have the acoustic signature of a single sound category that listeners can classify as a native phoneme. Kaneko et al. (Reference Kaneko, Heo, Iverson and Wilson2015) examined the interdental productions of Japanese speakers and found that some started as [s] and ended as [θ], but that the productions as a whole were not easily identifiable as either sound.
Deletions or insertions, on the other hand, are likely to result from L1–L2 conflicts in syllable structure or from phonotactic constraints. Japanese speakers learning English, as noted above, may insert vowels between consonant clusters to make them more pronounceable, while Vietnamese speakers of English may delete consonants from consonant clusters. In San Francisco, under the Golden Gate Bridge, there is a gift shop for tourists. The historically large number of Japanese tourists in this area led to a sign being posted that adjusted the English words “Gift Shop” (CVCC CVC) to reflect Japanese syllable structure, or “Gifuto Shoppu” (CVCVCV CVCV). This same process is evident when Japanese borrows English words, so that the baseball terms such as baseball (CVCCVC) are pronounced in a Japanese way, besuboru (CVCVCVCV). Vietnamese speakers, on the other hand, whose native language also has strong restrictions against final consonant clusters, employ another strategy to pronounce English words. Rather than inserting vowels, they often delete consonant sounds in final clusters. This may come not only from the phonotactic constraints but also from Vietnamese being an isolating language with a largely monosyllabic vocabulary. Although varied learners of English speakers employ deletion and insertion, so do English L1 speakers. A study of American English and Korean speakers (Hong et al., Reference Hong, Kim and Chung2014) showed that American English speakers were far more likely to employ deletion as a phonological strategy than insertion, although insertion is also evident for some speakers in words like ath(e)lete and real(a)tor. Advanced Korean speakers of English, in contrast, used insertion far more than deletion.
English speakers also demonstrate some of these processes in casual speech (Cauldwell, Reference Cauldwell2013; Shockey, Reference Shockey2003). Such distortions do not typically lead to loss of intelligibility during the course of conversational interaction even though they seem to distort sounds beyond recognition when looked at in phonetic detail or lead to elisions – that is, the loss of expected sounds in (casual) speech. For example, the word temperature on weather programs may not be pronounced in citation form with four syllables – that is, [tɛmpɚətʃɚ] – but rather as [tɛmpətʃɚ], or even with two syllables – [tɛmtʃɚ]. Johnson talks about this as “massive reduction,” that is, “the phonetic realization of a word involves a large deviation from the citation form such that whole syllables are lost and/or a large proportion of the phones in the form are changed” (Reference Johnson2004, p. 29). From the point of view of how learners produce the L2, these kinds of massive reductions are relatively unimportant. L2 learners do not need to try to imitate this kind of casual conversational speech. However, from the viewpoint of listening comprehension, learners must be able to negotiate massive reductions in the speech of NSs if they are to find normal conversational speech intelligible.
In terms of intelligibility and comprehensibility, these four types of deviations may have different effects. The effects of identifiable phonemic substitutions, for example, may be explained by the concept of functional load (FL) (Brown, Reference Brown1988; Catford, Reference Catford and Morley1987). Although FL includes more than minimal pairs (Sewell, Reference Sewell2017), substitutions that reflect greater numbers of minimal pairs (e.g., /p–b/, /p–f/) are likely to cause more serious intelligibility difficulties than those that have fewer minimal pairs (e.g., /ð–z/). The extent to which FL predicts intelligibility difficulties is unclear, given that few studies have examined this issue experimentally. Munro and Derwing (Reference Munro and Derwing2006) examined the ways that high and low FL errors affected English-speaking listeners. They found that high FL errors had a greater impact on comprehensibility than low FL errors, and that multiple errors impacted judgments differently for the two areas. Multiple high FL errors led to significantly different judgments of accentedness, while multiple low FL errors did not. (Ratings for comprehensibility were not significantly affected for multiple high or low FL errors.) This suggests that substitutions are particularly sensitive to the potential for confusion, and that judgments of accent may be greater for increasing numbers of high FL errors. In a different approach, Kang and Moran (Reference Kang and Moran2014) classified the types of phonological errors in Cambridge ESOL (English for speakers of other languages) tests. They found that lower scores correlated with higher frequencies of high FL errors, and that higher scores had a much lower frequency of such vowel and consonant errors.
Distortions are, by definition, sounds that are hard to classify by a listener. As a result, they do not seem to be quantifiable in the same way that substitutions are. Unfamiliar distortions may not only affect intelligibility but may play a greater role in loss of comprehensibility because listeners need to work harder to get information from the speech stream, and they may face greater challenges in classifying the sounds that they hear as representative of particular phonemes. In other words, rather than sending listeners down the well-defined garden path of minimal pairs, unexpected distortions may defy straightforward interpretation. The effect of distortions is thus unclear on listener understanding.
Regarding deletions and insertions, both of these types of deviations can change the syllable structure of words and have the potential to affect intelligibility. Jenkins (Reference Jenkins2000) argues that insertions are less harmful to intelligibility precisely because the expected segments are present, though they may be masked by the presence of an extra vowel. Deletions, especially in initial consonant clusters, may affect intelligibility because the expected segments are not present. However, there is little empirical research examining this assertion.
This discussion about the effects of various types of pronunciation differences is overly simplistic, of course. L2 learners often do not make only one easily identified error of pronunciation, and computing the effects of multiple errors is especially difficult. For example, in some data I have examined from oral language proficiency tests, a student was teaching about satellites and how they are controlled. About ninety seconds into the talk, he began talking about “[zurɪst] power,” a term he repeated multiple times. The raters (and the transcribers) found him unintelligible and the raters even asked him “What kind of power?” He replied, with great confidence, “[zurɪst].” One rater thought she heard “frost,” to which he replied, “Yes, [zurɪst].” As I worked with another research assistant, we finally figured out that he was saying “thrust.” The transcript shows that “[zurɪst]” was unintelligible more than ten times and that even when one of the raters tried the word “thrust” (the intended word) the confusion remained, both for the raters and later for the transcriber. Why was this word so difficult to understand? Minimally, the word had several phonetic differences from what was expected. First, [θ] was heard as [z], a difference of place and voicing. Second, this difference was at the beginning of the word, leading to immediate misclassification. Third, the word was spoken with two syllables rather than one. The cluster [θɹ] was broken up with the metathesized vowel [u], perhaps a spelling-influence pronunciation of the intended vowel [ʌ]. Finally, a second vowel was inserted into the word, leading to an ending pronunciation of [ɪst]. So there appear to have been two substitutions, metathesis, and an insertion, all relatively unexpected mispronunciations in a single word.
Naturalistic Development of Segmentals
The development of L2 segmentals has at least two aspects: how features develop normally in the learning of the new language, and the development that results from instruction. The first type of development can be referred to as naturalistic development, and the second as instructed development. Naturalistic development, or what other writers have called acquisition (e.g., Krashen, Reference Krashen1981), has been the focus of enough research to understand how L2 adult pronunciation develops in the absence of instruction. Questions about naturalistic development are always asked with a view to the observation that adults typically have noticeable foreign accents. This was originally seen as a reflection of having learned the L2 and its phonology outside the limits of a critical period (Scovel, Reference Scovel1969), but L2 learners continue to have the ability to acquire L2 sound systems throughout their lives (Piske, MacKay, & Flege, Reference Piske, MacKay and Flege2001). Some L2 learners are far more successful at this than others (Moyer, Reference Moyer2014), but there seems to be no obvious restriction related to age on acquiring a new sound system.
Studies on naturalistic development are closely related to questions about whether learners are able to associate L2 sound contrasts with separate phonemic categories, and whether they are able to produce such contrasts, even if the L1 does not have such categories. Baker, Trofimovich, Flege, Mack, and Halter (Reference Baker, Trofimovich, Flege, Mack and Halter2008) looked at whether age was important in the ability to distinguish (both in hearing and production) unfamiliar sound contrasts in an L2. Using children and adults, the study included two experiments, the first testing whether children or adults were more likely to associate contrasting sounds with a single L1 sound category, and the second testing whether children and adults differed in producing sound differences. The sound contrasts were two pairs of vowels (beat/bit, who’d/hood). Both contrasts are difficult for L2 English learners from a wide variety of languages, with the first being relatively common in English words, and the second relatively uncommon. The results showed that younger L2 learners were better able to hear L2 contrasts that were not part of their L1 categories. They were also better able to produce the sound contrasts. This indicates an effect of age, which is influenced by the interaction of the L1 and L2 sound systems. The extent to which the two systems interact seems weaker for younger learners, and stronger for older learners, indicating a greater flexibility for younger learners in learning L2 pronunciation. The authors conclude:
Both skills—being able to perceptually disassociate L2 sounds from L1 sound categories and being able to produce L2 sounds—appear to be associated with the age at which the children and the adults were first exposed to the L2, indicating that age may mediate the relationship between these two skills. Taken together, these findings suggest that children, perhaps due to the developmental state of their L1 at the time of their exposure to the L2, are able to perceptually distinguish L2 sounds from similar L1 sounds and to produce and (perhaps) perceive such L2 sounds more accurately than adults.
AOA is often used as a proxy for the effect of a critical period in pronunciation development. But it may also be the case that greater length of residence (LOR) influences phonological development. Baker and Trofimovich (Reference Baker and Trofimovich2006) looked at whether AOA or LOR was more important in the relationship between perception and production by matching Korean learners of English in AOA and LOR. They found that perception and production depended on learners’ AOA, not their LOR. They also found that the Korean speakers who showed the greatest self-awareness of their own production were also the most accurate producers of L2 sounds.
In a frequently cited study, Flege, Munro, and MacKay (Reference Flege, Munro and MacKay1995) examined the accuracy of consonant production by Italian learners of English who had emigrated to Canada at different ages. All participants spoke English more than Italian, and all had lived in Canada for more than thirty years on average. The study found that there was a systematic effect of age on the accuracy of consonant production, with those who started earlier being more accurate. Accuracy was evaluated by native English listeners making forced choice judgments of how accurate the consonants were. In general, the subjects who learned English after five years of age became progressively less accurate in their pronunciation, but the decrease of accuracy was different for different sounds and for the linguistic environments in which the sounds occurred. For example, accuracy was greater for final stop consonants than for initial stops.
In a follow-up study (Flege, Frieda, & Nozawa, Reference Flege, Frieda and Nozawa1997), the effect of age on the accuracy of L2 pronunciation was influenced by the relative amount of use of the L2 and the L1. The L2 English accents of two groups of Italian English bilinguals who emigrated to Canada between the ages of five and six, and had spoken English for over thirty years on average, were differentiated by the amount that they used Italian. One group used Italian very little, while the other used it about one-third of the time. Both groups had detectable foreign accents, despite having begun learning English well within the expected critical period. The group that reported speaking Italian more, however, had significantly stronger foreign accents. The study suggests that the extent to which the L1 remains active influences the accuracy of L2 pronunciation in naturalistic development.
Particular periods of time seem to be especially influential for phonological development, what Derwing and Munro (Reference Derwing and Munro2015) call the “Window of Maximal Opportunity” for changes in pronunciation. In one study of these changes, Munro and Derwing (Reference Munro and Derwing2008) examined the acquisition of vowels within the first year in an English-speaking environment for L1 speakers of Mandarin (n = 20) and Slavic (n = 24, mostly Russians, but with Ukranians and a Croatian as well) languages. Both Mandarin and Slavic languages have smaller vowel inventories than English. Recordings for the ten vowel targets (in b_t and p_t frames in a delayed repetition task) were made at two-month intervals over the course of the first year in Canada. Tokens were evaluated by phonetically trained listeners. Production accuracy for the ten vowels for all learners increased from 64 percent to 74 percent. However, a large amount of variation from one speaker to another was evident in the pronunciation of different learners, with some learners showing changes in vowel production primarily in the categories that were familiar from their L1 (Thomson, Reference Thomson2011). For those vowels that learners improved on, improvement leveled off by the end of the study period, suggesting that naturalistic improvement reaches a plateau, and that continued exposure loses its effect even when the learners remain surrounded by the L2. Some patterns of improvement seemed also to depend on the L1 of the subjects, with Slavic-language participants showing greater improvement than Mandarin. Another experiment using phonetically untrained listeners found similar results, with greater improvement within the first six months of exposure, but learning for some vowels continued throughout the full year of data collection. In summary, naturalistic pronunciation development is strongest when learners are initially in the new environment, but such development does not occur for all sounds, and the sounds that show changes may also be dependent upon the L1 of the learner. The results led the authors to say that “phonetic learning remains possible in adults, even when they do not receive focused instruction” (Munro & Derwing, Reference Munro and Derwing1999, pp. 498–9).
Are certain sounds inherently easier to produce accurately? In a test of the Markedness Differential Hypothesis (a theory that says that certain sounds are inherently more difficult because they are more unusual across the world’s languages – see Eckman, Reference Eckman1977), Chan (Reference Chan2007) looked at the accuracy with which Cantonese learners of English produced word-final consonants. She found that voiced obstruents and /l/ were the most difficult final consonants for the learners, but that voiceless obstruents and other sonorant consonants were generally produced accurately. The results suggested that markedness alone, while helpful in determining difficulty, was not an accurate predictor of difficulty, and that difficulty of sounds is also affected by the learner’s L1 background.
Although there are many more studies looking at naturalistic development of L2 pronunciation, this small collection of studies suggests certain tentative conclusions about the development of pronunciation abilities for L2 learners.
1. The age at which L2 pronunciation learning starts is important but does not determine pronunciation accuracy.
2. The ability to develop new L2 sound categories continues throughout life, but this process may be more challenging the older a learner is.
3. The first 6–12 months living in the L2 environment is connected to the greatest amount of naturalistic pronunciation development, but then naturalistic development slows down greatly.
4. Pronunciation development is connected to the amount and quality of experience with the new language.
5. Sounds that are relatively difficult because of their rarity or markedness in the world’s language may not be equally difficult for all learners. (Rarity refers to how frequently a sound occurs across languages.)
6. All L2 learners are not the same. Some will improve more than others even given similar demographic characteristics. Variation in development between learners is normal.
Instructed Development of Segmentals
Pronunciation is unusual in studies of second language speech because of its indirect connection to other measures of language proficiency, in that “a beginner can have excellent production and an individual with a superb grasp of L2 syntax and vocabulary can be difficult to understand” (Thomson & Derwing, Reference Thomson and Derwing2015, p. 339). But learners can improve. The unambiguous findings of studies that have examined the effects of pronunciation instruction is that instruction is successful. Because most studies on instruction have focused on segmentals rather than suprasegmentals, the findings of improvement from one study are supported by findings from many others. Three recent summaries of previous research have looked at this issue: Saito (Reference Saito2012); Lee, Jang, and Plonsky (Reference Lee, Jang and Plonsky2015); and Thomson and Derwing (Reference Thomson and Derwing2015). Saito synthesized the results of fifteen quasi-experimental studies (five on segmentals alone); Lee et al. conducted a meta-analysis of eighty-six instructional studies (the majority teaching segmentals); and Thomson and Derwing conducted a narrative analysis of seventy-five studies from Lee et al. (again, with the majority having studied the teaching of segmentals).
Saito (Reference Saito2012) synthesized the results of fifteen pre- and post-test design quasi-experimental studies on the success of pronunciation instruction. He found that instruction that called explicit attention to pronunciation was typically successful in controlled contexts and sometimes in extemporaneous speech. Some of the fifteen studies employed control groups while others did not. Saito also examined results from the type of instruction (Focus on Form [FonF] versus Focus on FormS [FonFS] versus Focus on Meaning [FonM], i.e., form + meaning versus form alone versus meaning alone) and whether improvement showed up in relatively controlled or free speaking tasks.
Consistently, FonF instruction was most likely to lead to improvement, and FonM instruction (typically, the treatment for the control groups) did not lead to pronunciation improvement. FonFS led to improvement in controlled tasks but not extemporaneous speaking. Nine of the studies looked only at controlled speaking tasks, and most found pronunciation improvement. Only two studies showed improvement in spontaneous speech, which suggests that improvement in accentedness or comprehensibility that listeners can hear is a higher standard than simple improvement. However, the two studies also suggest that well-designed instruction over a sufficient length of time can lead to improvements in pronunciation that are audible in spontaneous speech.
In a meta-analysis of eighty-six studies, Lee et al. (Reference Lee, Jang and Plonsky2015) examined the success of pronunciation instruction. Their analysis showed an overall large effect of pronunciation instruction, meaning that instruction typically resulted in a relatively large degree of improvement, especially when done over a longer period of time, when learners received feedback on their pronunciation, and when improvement was measured using more controlled tasks (such as reading aloud or imitation). Such findings are encouraging on the one hand, in that pronunciation instruction seems to work, yet concerning on the other, in that the improvement largely shows up in more controlled types of tasks that allow learners to consciously attend to pronunciation form. The meta-analysis also shows that most studies looked at university students, which is not surprising but limits the generalizability of the findings.
Thomson and Derwing (Reference Thomson and Derwing2015), in a narrative analysis, analyzed most of the studies in the work of Lee et al. (Reference Lee, Jang and Plonsky2015), with attention to what makes an excellent pronunciation training study. In particular, they looked at the studies with respect to whether they were in line with the Nativeness Principle (that is, whether the goal was to improve features that contribute primarily to perceptions of accentedness) or the Intelligibility Principle (Levis, Reference Levis2005a), meaning that the instruction helps learners become more understandable in less-controlled speaking. Studies of segmental improvement dominated the review, from as few as one segment to as many as fourteen. The nature of instruction was typically undefined, suggesting that improvement in controlled contexts resulted from all kinds of instructional interventions. A concern raised was that very few studies (9 percent) met what the authors call “the gold standard of enhanced comprehensibility and intelligibility” (Thomson & Derwing, Reference Thomson and Derwing2015, pp. 332–3) – that is, global judgments of improvement that do not depend on phonetic analysis of changes. In other words, most studies looked at analyses of individual pronunciation features (including features with a low FL) without asking whether they lead to improvements in comprehensibility. This is a particularly problematic issue for the pronunciation of segmentals. The pronunciation of one segment in context may be noticed by a listener but is unlikely to affect intelligibility, and even instruction on a set of segments that are particularly difficult for L2 learners may not result in listeners hearing L2 learners as more comprehensible (Neri, Cucchiarini, & Strik, Reference Neri, Cucchiarini and Strik2008).
Finally, few of the studies looked at the long-term effect of instruction, as few of them employed delayed post-tests. Such tests, when they occurred, showed that short-term improvement is subject to significant backsliding in the level of improvement (Hahn, Reference Hahn2002). To sum up, these analyses of previous studies suggest the following regarding improvement of segmentals based on instruction.
1. Instruction on pronunciation is typically successful, but certain types of instruction may lead to more robust improvement.
2. Improvement is more likely to show up in controlled tasks, but the ultimate goal of instruction is improvement that is evident in spontaneous speech.
3. It typically takes time for segmental improvement in controlled tasks to become evident in spontaneous speech.
4. Segments with a high FL are likely to impact comprehensibility more than those with low FL, but because comprehensibility is typically affected by multiple features, improvement for individual segments alone is unlikely to result in greater comprehensibility.
5. Longer treatments are more likely to lead to noticeable pronunciation improvement.
Segmental Errors and Intelligibility
Although suprasegmentals are often considered to have a greater impact on comprehensibility (Gordon & Darcy, Reference Gordon and Darcy2016; Hahn, Reference Hahn2004; McNerney & Mendelsohn, Reference McNerney, Mendelsohn, Avery and Ehrlich1992), segmentals are also critically important in how understandable speech is, and should not be ignored in instructional decisions (Derwing et al., Reference Derwing, Munro and Wiebe1998). Jenkins (Reference Jenkins2000) argued that consonants are essential for mutual intelligibility in English as lingua franca (ELF) communication, with certain exceptions (e.g., the interdental fricatives and the allophonic dark /l/), as are differences between lengthened and shortened vowels, so long as the vowel quality is within the range of a generally understood regional variety. Bent et al. (Reference Bent, Bradlow and Smith2007) found that vowel accuracy was the most important feature for intelligibility and that accuracy in initial consonants also corresponded with intelligibility in read sentences. Initial consonants show up regularly as important for intelligibility, perhaps because of their importance in leading listeners to expect one cohort rather than another.
While any error may cause some loss of intelligibility, in general, more frequent errors are more likely to affect intelligibility than less frequent ones (Prator & Robinett, Reference Prator and Robinett1985). But frequency of the errors alone is insufficient to measure the potential for loss of intelligibility. It also appears that the frequency of the words that sounds occur in correlate with how intelligible a particular error is (Thomson & Isaacs, Reference Thomson and Isaacs2009; also see Cole et al., Reference Cole, Mo and Hasegawa-Johnson2010). The types of deviations also must be taken into account. It is well documented, for instance, that NS productions of segmentals may vary tremendously in casual or connected speech (Johnson, Reference Johnson2004). But if these deviations are expected by listeners, that is, if they reflect shared expectations of that speech register, the speech will be more intelligible.
Frequency may also be a factor in intelligibility, in that more frequent phonemes are likely to play a larger role in intelligibility than less frequent ones. Consonant phonemes, for example, are more frequent than vowels in English, accounting for nearly 60 percent of sounds (Hayden, Reference 278Hayden1950; Mines, Hanson, & Shoup, Reference Mines, Hanson and Shoup1978), while vowel sounds are about 40 percent of phonemes, reflecting the syllable structure of English and the numbers of phonemes for each segmental category. Some segments are more or less frequent than others as well.
Frequency itself is affected by two moderating factors: the number of contrasts a phoneme has with other phonemes (its FL); and the degree to which a particular contrast is likely to be confused by L2 learners. In Table 3.2 (from Qian, Chukharev-Hudilainen, & Levis, Reference Qian, Chukharev-Hudilainen and Levis2018, based on Brown, Reference Brown1988), the FLs of British English contrasts are shown, from ten (highest) to one (lowest). There is little disagreement that contrasts at the top end of the scale are more likely to affect intelligibility even if their raw frequencies are in the middle of phoneme frequency (e.g., /b/–/p/ are both slightly less than 2 percent in raw frequency but have many minimal pairs). Likewise, contrasts at the low end of the FL ordering are less likely to impact intelligibility because they are unlikely to be confused with other words when they are mispronounced. It is not clear, however, where the best division is between high and low FL.
Table 3.2 Rank ordering of phoneme pairs commonly conflated by learners
| Vowels | Examples | Consonants | Examples | ||
|---|---|---|---|---|---|
| 10 | /ɛ–æ/ | bet–bat | 10 | /p–b/ | pat–bat |
| /æ–ʌ/ | bat–but | /p–f/ | pat–fat | ||
| /æ–ɒ/ | cat–cot | /m–n/ | home–hone | ||
| /ʌ–ɒ/ | cut–cot | /n–l/ | night–light | ||
| /ɔ–əu/ | ought–oat | /l–r/ | fall–fore | ||
| 9 | /ɛ–ɪ/ | bet–bit | 9 | /f–h/ | fat–hat |
| /ɛ–eɪ/ | bet–bait | /t–d/ | tip–dip | ||
| /ɑ:–aɪ/ | cart–kite | /k–g/ | call–gall | ||
| /ɚ–əʊ/ | immersion–emotion | ||||
| 8 | /w–v/ | wet–vet | |||
| 8 | /i–ɪ/ | beat–bit | /s–z/ | singer–zinger | |
| 7 | – | 7 | /b–v/ | rebel–revel | |
| /f–v/ | file–vial | ||||
| 6 | /ɔ:–ɚ/ | form–firm | /ð–z/ | clothing–closing | |
| /ɒ–əʊ/ | cot–coat | /s–ʃ/ | sea–she | ||
| 5 | /ɑ:–ʌ/ | bart–but | 6 | /v–ð/ | van–than |
| /ɔ–ɒ/ | caught–cot | /s–ʒ/ | person–Persian | ||
| /ɚ–ʌ/ | bird–bud | ||||
| 5 | /θ–ð/ | thigh–thy | |||
| 4 | /ɛ–eə/ | shed–shared | /θ–s/ | think–sink | |
| /æ–ɑ:/ | at–art | /ð–d/ | though–dough | ||
| /ɑ:–ɒ/ | cart–cot | /z–dʒ/ | zoo–Jew | ||
| /ɔ–ʊ/ | bought–boot | /n–ŋ/ | sin–sing | ||
| /ɚ–ɛ/ | further–feather | ||||
| 4 | /θ–t/ | thank–tank | |||
| 3 | /i–ɪə/ | tea–tear | |||
| /ɑ:–aʊ/ | vase–vows | 3 | /tʃ–dʒ/ | choke–joke | |
| /u–ʊ/ | fool–full | ||||
| 2 | /tʃ–ʃ/ | chair–share | |||
| 2 | /ɪə–ɛə/ | beer–bare | /ʃ–ʒ/ | Confucian–confusion | |
| 1 | /ɔ–ɔɪ/ | saw–soy | 1 | /f–θ/ | deaf–death |
| /u–ʊə/ | two–tour | /dʒ–j/ | juice–use |
The location of a sound in a word also has an important effect on the intelligibility of sounds. Zielinski (Reference Zielinski2008) explored sites of reduced intelligibility from the extended speech of three L2 speakers of English: Korean, Chinese, and Vietnamese. She found that non-standard syllable stress patterns and mispronounced segments both affected listeners’ ability to successfully transcribe words. It was especially the case that initial consonants and vowels in stressed syllables were associated with incorrect transcriptions. Syllable-final consonants and unstressed vowels were much less frequently associated with loss of intelligibility. The one exception to this pattern was for the Vietnamese speaker, for whom mispronounced final consonants in strong syllables were more strongly associated with loss of intelligibility than was the case for the Korean and Chinese speakers. The Vietnamese speaker was more likely to delete final consonants than speakers of the other language backgrounds, suggesting that deletions of consonants are likely to impact intelligibility.
In a partial replication of Zielinski (Reference Zielinski2008), Im and Levis (Reference Alameen, Levis, Reed and Levis2015) looked at the sites of reduced intelligibility for three Korean international teaching assistants (ITAs) in the United States. Each had been rated by NS raters at different levels (high, intermediate, and low) on a field-specific five-minute extemporaneous test of speaking. Five additional NS listeners with ESL (English as a second language) training listened to each of the speakers and stopped whenever they noticed something that either challenged them or where they noticed something unusual in the speaker’s use of English. Using a think-aloud protocol, they described to one of the researchers why they stopped. The reasons for stopping were then coded according to phonological and non-phonological issues in the speech. In agreement with Zielinski (Reference Zielinski2008), errors in stressed (strong) syllables had a greater impact on the listeners, but unlike Zielinski’s findings, syllable-final consonants were more influential than syllable-initial, while stressed vowels also were regularly noticed. The highest rated speaker also had far fewer segmentals commented on than the middle and low speakers, and, paradoxically, had more comments on non-phonological features, a peculiar finding given that the lower rated speakers had far greater grammatical difficulties, but it appeared that the greater numbers of segmental errors for the intermediate and low speakers meant that the listeners were less able to comment on their grammar or other non-phonological features. The pronunciation dominated their attention.
Conclusions
Segmentals in any context are subject to misunderstanding if they are mispronounced. These mispronunciations may be heard as other phonemes by listeners, or they may not be identifiable as a target language sound. The gravity of segmental errors varies tremendously, but it is clear that the major factor in error gravity is the way sounds are presented in the speech signal rather than those that are dependent on the listener (Derwing & Munro, Reference Derwing and Munro2015). Measures of FL are one of the more promising predictors or relative gravity, but it is not clear whether substitution, deletions, additions, or distortions of segmentals are more serious. There is also little information on the cumulative effect of segmental errors, whether in individual words or across a sentence. Munro and Derwing (Reference Munro and Derwing2006) found that low FL errors had no effect on frequency while high FL errors did. What might happen with a combination of high and low FL errors? We do not know.
Languages differ in the degree to which they permit multiple consonants to occur as the onset or coda in a syllable. In general, consonant clusters, like individual consonants, need to be produced and perceived accurately, or intelligibility may be reduced (Cardoso, Reference Cardoso, Kang, Thomson and Murphy2017; Zielinski, Reference Zielinski2006). L2 learners from languages that have less complex consonant onsets and codas may need to learn to produce and perceive more complex syllables when learning an L2 that has more complex syllable structures. The process of doing so may end in substitutions of sounds within the clusters (e.g., play pronounced as pray), deletions of sounds (e.g., lapsed pronounced as lap or laps), or insertions of vowels to make foreign words more pronounceable (e.g., splash pronounced as esplash by Spanish speakers, or as siblash by Arabic speakers [Al-Saidat, Reference Al-Saidat2010, p. 129]).
Indeed, the issues surrounding the pronunciation of consonant clusters are primarily issues related to syllable structure. Syllable structure differences between the L1 and L2 may mean that L2 learners have to delete final sounds (such as the final consonant in log) or add a vowel (such as age being pronounced [eɪdʒi]) if the L1 does not sanction that sound in that position. When the syllable onset or coda becomes more complex, allowing multiple consonants or consonant clusters that are not found in the L1, the potential for perception and production difficulties multiplies. The nature of syllable structure is a matter of language-specific phonotactics (e.g., the English <pl> in plume is a cluster, but the <ps> in psyche is not, despite both being pronounceable onset clusters in French). Some otherwise illegal clusters may show up in fast speech, such as Come here! being pronounced as [kmiɹ]. In general, initial clusters are more restricted in English than are final clusters. For example, all CCC clusters start with [s], followed by a stop (p, t, k) and a liquid or glide ([l, w, j, ɹ]), as in street, squelch, spew, splint. It is also clear that the pronunciation of clusters, and therefore syllable structure, is far less commonly treated in pronunciation teaching than is the pronunciation of segmentals and other suprasegmentals (Cardoso, Reference Cardoso, Kang, Thomson and Murphy2017), and when syllable structure is addressed, it is typically addressed in relation to the segmentals included in the clusters. It is likely that it should be addressed in its own right.
Two primary arguments will be made in this chapter. Both are important for onset (initial) and coda (final) clusters, and both are partially related to the information in Chapter 3.
1. Consonant clusters are important in word identification, and thus for intelligibility. But their effect on intelligibility is subject not only to substitutions of individual segments but to cluster simplifications, either through deletion or epenthesis. Some simplifications are unlikely to cause difficulties because native speakers (NSs) also simplify in certain linguistic environments, but many L2 simplifications will be unexpected for listeners and thus cause difficulties.
2. Final clusters also interact with morphophonological variants, in particular, inflectional endings (especially –ed, –s, –’s). Deletions of inflectional endings in clusters are unlikely for NSs, but are more likely to occur with nonnative speakers (NNSs) who may simplify clusters by deleting the last element of a cluster rather than the middle element. This may cause problems with intelligibility or comprehensibility because of the loss of grammatical information.
Consonant Clusters across Languages
Because languages differ in the degree to which they permit multiple consonants to occur as the onset or coda in a syllable, the pronunciation of syllables is a crucial aspect for intelligibility. In general, consonant clusters, like individual consonants, need to be produced and perceived accurately or intelligibility may be reduced (Zielinski, Reference Zielinski2006). L2 learners from L1s that have less complex consonant onsets and codas may need to learn to produce and perceive more complex syllables when learning an L2 that has more complex syllable structures. L2 learners from L1s that have similarly complex syllable structures will find differences in the types of clusters that are used in the L2; L2 learners from L1s with complex syllable types are also likely to find less complex syllable types challenging, especially if the L2 is syllable-based in its pronunciation (Cutler, Reference Cutler2012). Pronouncing new syllable types may result in substitutions of sounds within the clusters (e.g., play pronounced as pray), deletions of sounds (e.g., lapsed pronounced as lap or laps), or insertions of vowels to make foreign words more pronounceable (e.g., the borrowed word strike pronounced as suturaiku in Japanese).
Consonant clusters are especially important in two environments in English: the beginning of a word (the syllable onset) and the end of a word (the syllable coda). In both cases, clusters may involve up to three consonant sounds (e.g., strengths). Consonant sequences in the middle of a word are also important, but they are also likely to be pronounced as parts of two syllables (Pierrehumbert, Reference Pierrehumbert and Keating2006). The number of possible clusters in onsets and in codas has been a subject of some debate (Algeo, Reference Algeo1978), with numbers subject to definitions of what is a consonant (e.g., whether boards ends in two or three consonant sounds) and a number of other factors. The acquisition of consonant clusters is an important developmental issue for child L1 learners. Kirk (Reference Kirk2008) studied young children between the ages of seventeen months and thirty-one months. She found that the kinds of substitutions children made in clusters were not the same as substitutions for singleton consonants. She also found that the most common errors were cluster reduction or substitutions of one or more elements of the cluster (with the number of consonants in the cluster remaining the same).
For L2 learners, pronouncing consonant clusters is subject not only to the types of errors that are possible for singleton consonants (substitution or distortion), but also to simplification from deletions of part of the cluster, or epenthesis of a vowel. Both of these types of errors can change the syllable structure of the words. Accurate pronunciation of clusters is also subject to transfer from the L1 as well as errors that cannot be ascribed to L1 transfer. In addition to production of clusters, there is also evidence that L2 learners do not always perceive clusters in the way that L1 speakers do, and that they are subject to perceptual illusions (Davidson & Shaw, Reference Davidson and Shaw2012) that may be a result of onset complexity or relative sonority of the cluster consonants.
Studies on L2 Consonant Clusters
The pronunciation of sC clusters has been especially studied. For example, a well-known difficulty with sC clusters in English is found with Spanish L1 speakers, who epenthesize the cluster (e.g., eschool, esmoke, estop). Many sC clusters (e.g., speak, stay, skate) are particularly problematic because they involve decreasing sonority in the cluster before reaching the peak sonority of the vowel ([s] is more sonorous than the stop following it). Other sC clusters (e.g., snow, slow, smoke) do not violate the sonority sequencing principle. However, the sonority sequencing principle may not explain all L2-related problems. Cardoso, John, and French (Reference Cardoso, John, French, Watkins, Rauber and Baptista2009) found that sC clusters seem to be acquired better if they are more frequent in the spoken input learners are exposed to.
L2 studies of consonant clusters have looked mostly at how errors can be explained by transfer or by development. Anderson-Hsieh, Riney, and Koehler (Reference Anderson-Hsieh, Riney and Koehler1994) examined the production of initial and final clusters for intermediate proficiency (IP) and high proficiency (HP) Japanese EFL (English as a foreign language) learners, and compared both groups to NSs. Japanese has a consonant–vowel (CV) syllable structure, so clusters are likely to be a challenge for learning English. The study found that IP learners had significantly more problems with English clusters than the other two groups. The IP groups also used epenthesis more than the HP learners, though both groups used deletion more than epenthesis in producing clusters. HP learners had more cluster errors than NSs, though not significantly more.
Hansen (Reference Hansen2001) studied the acquisition of English coda singletons and clusters by three Mandarin speakers over six months. She found that the accuracy rate of coda production was relatively stable over the six months, but that the reasons for errors in codas with one, two, and three consonants were different. In single-consonant codas, errors involved substitutions of place or manner; in CC codas, most errors inserted a vowel before a pause; and in CCC codas, deletion of one or more segments was the most common mispronunciation.
Broselow and Finer (Reference Broselow and Finer1991) tested the effect of the sonority hierarchy using mastery of onset clusters by Japanese, Korean, and Hindi learners of English. They especially looked at the accuracy with which six onset clusters, [pr, br, fr, pj, bj, fj], were produced in imitation and delayed production tasks. They found that pC clusters were produced with the greatest success, that [br] had a much higher error rate than [bj], and that both [fr] and [fj] were difficult. About two-thirds of the errors involved substitutions of another manner of articulation, while one-third involved deletion of one consonant in the cluster. The study, however, involved uneven numbers of subjects from each L1 (twenty-four Korean, eight Japanese, and eleven Hindi), which may have affected the accuracy rates reported.
Major (Reference Major1994) studied the initial and final CC production in a four-week study with three data-collection points. Four beginning-level Brazilian learners of English produced the same words in word lists and in text. Major was looking for how transfer errors (including epenthesis and substitutions of Portuguese sounds) and developmental errors (deletions of sounds) were distributed over time and according to task. He found that transfer errors decreased over time but that there was no difference between the two tasks. Pedagogically, the decrease in transfer errors was unlikely to make a difference in how their speech would be heard, especially since developmental errors did not decrease over time.
Osburne (Reference 287Osburne1996) used a case study of a single Vietnamese learner of English to look at final cluster reduction over time. Using recordings of two informal work meetings recorded six years apart, she found that the speaker’s production of clusters improved over time but that errors remained in approximately 80 percent of final clusters. She claimed that cluster reduction could be largely predicted based on a variety of factors, including L1 influence and universal constraints.
In a study of initial cluster production that looked at the effects of style, gender, proficiency, and interlocutor type, Lin (Reference Lin2003) found that the relative amounts of deletion and epenthesis by Chinese EFL learners were affected by all four variables. Epenthesis was favored over deletion in more formal contexts, by HP learners, by female learners, and when speaking to NS teachers (for female learners only). When the task was informal, when learners were of a lower proficiency, were male, or when females spoke to classmates, deletion was favored as a strategy. This suggests that epenthesis is a strategy that involves more careful attention to speech.
Other studies have looked at the kinds of errors made, but have not examined how the errors were related to transfer or development. Anderson (Reference Anderson1983), using a contrastive analysis framework, examined the types of cluster errors made by Arabic and Chinese English learners. Many of the errors were predicted because of the learners’ L1s, but certain difficulties were shared between the two groups. Not all the clusters exhibited the same errors. Certain clusters seemed to encourage epenthesis, while others encouraged deletion. Overall, deletion was the more common strategy. In another study of Arabic, Ioup (Reference 279Ioup1984) found that different dialects can use different patterns of epenthesis (vowels preceding and vowels in the middle) to simplify clusters. Even within English, different varieties may have different restrictions on acceptable clusters. Setter (Reference Setter2008) compared cluster production in an inner-circle and an outer-circle variety of English – British English and Hong Kong English. She found that Hong Kong English had a more limited set of syllable types than British English, both for onset and for coda clusters.
Consonant Clusters and Intelligibility
Few studies have explicitly examined the effect of cluster errors on the intelligibility and comprehensibility of L2 speech. Jenkins (Reference Jenkins2000) argues that speakers who are particularly concerned for intelligibility in their speech are more likely to use epenthesis rather than deletion, since epenthesis provides greater recoverability and is likely to be associated with higher proficiency learners (as in Lin, Reference Lin2003). For this reason, Jenkins argues that “consonant deletion is more of a threat to intelligibility” (Reference Jenkins2000, p. 142), especially in onset clusters. She says that “addition is preferable to deletion; that sounds in initial clusters should never be deleted; that where elision occurs in a final cluster, it is preferable to opt for a /t/ or /d/ where this is possible” (Reference Jenkins2000, p. 143).
Zielinski (Reference Zielinski2006), in a study of reduced intelligibility in the English speech of a Korean, a Vietnamese, and a Chinese speaker, established that consonant errors in stressed syllables were more damaging to intelligibility than those in unstressed syllables, and that initial-consonant errors were worse than final-consonant errors (except for the Vietnamese speaker). Initial-consonant errors in stressed syllables were likely so damaging because they are likely to lead listeners to hear phantom vocabulary cohorts (Broersma & Cutler, Reference Broersma and Cutler2008). But a more careful examination of Zielinski (Reference Zielinski2006) provides hints about consonant cluster errors and intelligibility. In general, Zielinski’s examples indicate that errors in consonant clusters could also result in reduced intelligibility, and all types of errors (substitution, deletions, and epenthesis) were implicated in loss of intelligibility for clusters in all locations. For example, the words in (4.1)–(4.3) were identified as difficult to understand (Zielinski, Reference Zielinski2006, p. 90). Note: The phonetic symbols underneath the word reflect how the word was pronounced, and the “s” and “w” mean strong (stressed) and weak (unstressed) syllables. The first example involves vowel epenthesis, the second deletion, and the third insertion of another consonant, leading to the interpretation of a cluster where none existed for the word.
Onset deletion error
“proverb” heard as “problem”
[prɒ][bɒv]
s s
Consonant epenthesis error
“sitting for that kind of exam” heard as “assisting for that kind of exam”
[sɪs][tɪŋ]
s w
In another suggestive finding from Zielinski (Reference Zielinski2006), reduced intelligibility for cluster errors occurred less frequently than it did for single-consonant errors, perhaps leading to the conclusion that the multiple consonants in a cluster may provide a level of redundancy that single consonants do not. For the Vietnamese speaker, for example, for whom clusters were particularly a source of reduced intelligibility, there were twenty-six sites of reduced intelligibility for single consonants and sixteen for clusters.
Pedagogically, Zielinski recommends that
any non-standard consonant or consonant cluster in the initial position of a strong syllable has the potential to mislead NSE listeners, as does any non-standard vowel in a strong syllable. In order to ensure that segments in strong syllables are a trustworthy source of information for NSE listeners, therefore, the standard production of any consonant or consonant cluster and any vowel that can occur in a strong syllable should be a priority.
In other words, clusters should be treated like single consonants. Any deviation, especially in initial clusters involving high functional load sounds, is likely to be a threat to intelligibility. In addition, epenthesis, deletion, and substitution are all candidates for loss of intelligibility. Although initial and final clusters have different issues, final clusters cannot be ignored as unimportant, as research on Vietnamese learners of English shows. The relative importance of different types of cluster errors, especially epenthesis and deletion, is particularly ripe for research attention. We know that many L2 learners struggle with syllable structure in their new languages, but we do not know enough about which types of syllable structure errors matter and which do not.
Consonant Cluster and Morphophonological Variants
English has a wide variety of morphophonological variants. Morphophonology is the study of word forms (morphemes) that are pronounced differently within certain phonological environments. For example, im/il/ir are variants of the <in-> prefix, meaning “not.” Thus we have words like improbable, illegal, and irrelevant, in which the pronounced (and spelled) forms anticipate the pronunciation of the following consonant sound. Many morphophonological forms are not noticed by listeners and are simply shortcuts in pronunciation to assimilate other sounds in their manner of articulation (e.g., emphasis → [ɛɱfəsɪs]), voicing (e.g., have to → hafta), or place of articulation (e.g., input → imput). There are three suffixes with variant pronunciations that are potentially relevant to pronunciation teaching: –ed, –s/’s, and –ing.
In most varieties of English, –ing has two variants, –ing [ɪŋ] and –in’ [ɪn]. The second variant is sometimes referred to as “dropping your ‘g’s,” although there is no [g] in most varieties of English (this characterization is likely a consequence of English orthography). Instead the variants are two realizations of the final nasal, [ŋ] and [n]. These variant pronunciations are sociolinguistically marked and reflect choices made by speakers. All NSs typically use both variants, but [ɪn] is more commonly used in informal registers (e.g., Labov, Reference Labov and Lanehart2001), by boys more than girls, and normal boys more than “model” boys (Fischer, Reference Fischer1958), by less educated speakers (Zahn & Hopper, Reference Zahn and Hopper1985), and in lower socioeconomic classes (Shuy, Wolfram, & Riley, Reference Shuy, Wolfram and Riley1968), findings that have been replicated throughout the English-speaking world. It is also perceived to be more representative of straight than gay, and it intensifies the identification of southern US speech (Campbell-Kibler, Reference Campbell-Kibler2007). It is also argued to be common in African-American speech (Wolfram, Reference Wolfram1974). Differences in the proportional use of [ɪŋ] and [ɪn] marks differences in socioeconomic classes, with speakers from higher socioeconomic classes typically using more [ɪŋ] than lower socioeconomic groups in the same task. Speakers of each group will also use more [ɪŋ] in formal (monitored) than in informal (unmonitored) speech. From an L2 pronunciation perspective, learners should be made aware of sociolinguistic issues, but they almost invariably pronounce the –ing, a difference from both –ed and –s/’s endings, which are commonly subject to consonant deletion.
The other endings, the –ed and –s/’s suffixes, are commonly found in published pronunciation materials. Both sets of endings also contribute to final-consonant clusters (e.g., slapped, slaps), and may be subject to deletion and epenthesis. Further, I am treating variant forms similarly because their pronunciation regularities are such that they are consistent; however, the spelled forms are not completely regular (e.g., jump → jumped; judge → judged) and the pronunciation of each spelled form applies to multiple morphemes that function differently in English.
These morphemes and their phonological variants are also fundamentally different from phonemes because of their relationship to the morphological system of English. Each of these suffixes is connected to the English inflectional system, but they are not only inflections in modern English. For example, –ing endings are originally part of the verbal inflectional system, marking continuous states (e.g., He is smoking). In addition, –ing is used for gerunds (e.g., Smoking is bad for health) and adjectival forms (e.g., a smoking gun). The other inflections in English, the adjectival endings –er and –est (e.g., funnier than I am, the funniest person), are not typically issues for pronunciation.
Past Tense and Past Participle: The –ed Inflections
–ed is used for two verbal inflectional endings in English: regular past verbs (e.g., loved, laughed, wanted) and past participle forms (e.g., has loved, have laughed, was wanted). Like –ing endings, –ed is also used for adjectival forms (e.g., a loved one, a laughed-at joke, a wanted child). The regular past-tense ending is the dominant way to mark past tense in regard to the number of verbs it applies to, but it is not the only way to mark past tense in English. Suppletive forms (e.g., go/went), ablaut forms (e.g., come/came), and other past forms that do not neatly fit the regular pattern (e.g., bring/brought, can/could) are used successfully by L2 learners at different rates (Bayley, Reference Bayley, Tarone, Gass and Cohen1994) and differently from the regular form. Indeed, Bayley found that saliency (defined as the difference in the number of sounds between present and past forms) predicted the number of correct forms. Regular forms of both kinds were low on his saliency hierarchy, and were less successfully produced than most other forms. The syllabic form [ɪd] was less accurately produced than the nonsyllabic forms ([t], [d]), but none were above 40 percent correct.
The rules for the production of –ed endings reflect an almost invariant set of sound/spelling connections, suggesting that L2 learners should be able to understand and apply the rules to unfamiliar forms. In contrast, irregular past and participle forms must be learned as independent lexical items because they do not follow a general rule. However, irregular past endings are more likely to be produced successfully under performance pressures (Bayley, Reference Bayley, Tarone, Gass and Cohen1994) precisely because they are stored as lexical items while regular past-tense forms have to be created by rule (Pinker & Ullman, Reference Pinker and Ullman2002). Computing the rule creates an extra step that can lead to errors in phonological form while communicating or under other cognitive pressures. McDonald and Roussel (Reference McDonald and Roussel2010) found that artificially creating pressure through listening in noise made native English speakers significantly less accurate in judging whether regular past forms were actually used. There is also some disagreement about whether the long ending (with the extra syllable) or the short endings ([d/t]) are more salient. Bayley (Reference Bayley, Tarone, Gass and Cohen1994) found that the long ending was less successfully produced by both HP and LP Chinese learners, although HP speakers used both forms more frequently (but still below half of the time) than LP speakers. Both long and short regular forms were used less frequently than irregular forms. In contrast, Bell et al. (Reference Bell, Trofimovich and Collins2015) found that the long ending was heard more successfully than the short endings. The two studies used different samples, with different age ranges and first languages, but the lack of agreement on such a basic issue means we have a lot to learn about the pronunciation of <ed> endings and their effects on intelligibility.
Verbal, Plural and Possessive Inflections: The –s/’s Endings
-s/’s refers to multiple spoken forms in English: regular plurals (e.g., dogs, cats, matches), possessives (e.g., Jack’s car), third-person singular present verbs (e.g., He walks versus I/We/You/They walk), and the contracted auxiliaries “is” and “has” (e.g., He’s crazy, She’s been sleeping). The first three –s/’s endings are all inflectional suffixes, while the last is a spoken variant that has a grammatically equivalent full form (e.g., She is crazy, She has been sleeping). From a pronunciation point of view, all the forms are identical. From an acquisitional point of view, the forms are quite different in how early they are acquired (Brown, Reference Brown1973; Krashen et al., Reference Krashen, Houck, Giunchi, Bode, Birnbaum and Strei1977), a process that may be sensitive to L1 backgrounds. Luk and Shirai (Reference Luk and Shirai2009) address acquisition orders of two of the three –s endings (possessives and plurals) and show that the orders are not universal but rather affected deeply by the L1 of the learner. In addition to looking at Spanish-speaker research, they compare them to similar research studies on Japanese, Chinese, and Korean learners. The article does not look at pronunciation issues, though primarily it refers to studies of spoken rather than written language. If the structure is subject to positive transfer, it tends to be acquired earlier; if not, it is acquired later. Thus, L2 learners of English may pronounce some –s endings well while pronouncing others poorly. For the noninflected form, learners often pronounce full forms rather than contractions. Such a pronunciation may be pragmatically quite different in effect from not pronouncing the auxiliary in any form.
The –ed and –s/’s morphemes are parallel in how their morphophonological variants are patterned. Each has three phonologically conditioned variants (Figure 4.1). One variant is pronounced with an extra syllable (which for convenience I will call the long ending), and the other two are pronounced with voiced and voiceless sounds but without an extra syllable (the short endings). Each variant is environmentally conditioned by the sound that precedes it (i.e., progressive assimilation). The rules that describe which variant will occur are relatively straightforward. After accounting for the environment that creates the extra syllable (see Figure 4.1), the short endings assimilate in voicing to the preceding environments ([d/t] and [z/s]). These sound pairs are phonologically distinct in English (e.g., den/ten, led/let; zip/sip, lose/loose). Morphologically, however, the voicing distinction is less important and the two variants function more like allophones than phonemes. For example, final voiced variants may devoice in different environments. This can be important in how the morphemes should be treated in a pronunciation lesson (Dickerson, Reference Dickerson1990).
These variant forms may often be a challenge for learners from different L1s, both in pronunciation and in acquisition. For pronunciation, learners may pronounce spelled forms invariantly (e.g., the –ed should always include a vowel because it is spelled that way). Alternatively, learners may delete sounds when the morphemes are part of final-consonant clusters. Deletion, especially of the [t/d] or [s/z], the carrier of the grammatical information, is likely to negatively affect intelligibility.
Regarding acquisition, pronunciation is not the only difficulty for L2 learners, nor may it even be the correct problem to address when learners mispronounce the endings. Morphologically diverse forms are not all acquired at the same rate (even for L1 English-speaking children), and indeed, some forms may be particularly resistant to being acquired by L2 learners. This is especially true of the verbal –s inflection (e.g., He laughs a lot), which often shows errors even in very advanced learners of English. Alternatively, mispronunciations of endings may continue to occur when advanced L2 users know the rules for the form and can produce the form almost perfectly in a written test (Hawkins & Liszka, Reference Hawkins, Liszka, van Hout, Hulk, Kuiken and Towell2003). Of course, pronunciation may also sometimes be the central issue. For example, some Spanish speakers, whose L1 shows a close correlation between pronunciation and the way sounds are spelled, may overgeneralize and pronounce most –ed endings in L2 English with an extra syllable. Their success in producing past forms shows that such learners have acquired the English past tense but are using the wrong phonological forms and would benefit from pronunciation practice.
To sum up, L2 learners’ pronunciation of morphophonological variants in English may be affected by many causes. They may mispronounce because they:
1. have not yet fully acquired the form;
2. make a wrong guess about how the pronunciation system corresponds to the orthographic representation;
3. struggle to realize the form in a particular environment (e.g., as part of a consonant cluster);
4. are paying attention to other parts of their spoken language and do not prioritize this feature of English in their communication at that time.
Because morpheme mispronunciations are complicated in their causes, it is not enough to understand what learners do in pronouncing these forms. Instead, it is essential to understand why they are not pronouncing the forms correctly. It is not easy to tell the why and the what apart. We see only the surface performance, which may not give a clear picture of cause. Learners may seem to pronounce the forms invariantly, pronounce some forms more correctly while others are rarely correct, delete them in different environments from L1 users, delete –ed endings after voiced obstruents (begged) while pronouncing –ed endings with an extra syllable after voiceless obstruents (Koffi, Reference Koffi, Levis and LeVelle2010), and hear them at little above chance guessing in the absence of adverbial cues (Bell, Trofimovich, & Collins, Reference Bell, Trofimovich and Collins2015).
The results of the research on grammatical morphemes indicate that errors in how they are realized in speech are very common. Bayley’s (Reference Bayley, Tarone, Gass and Cohen1994) study of Chinese speakers showed that even the HP speakers (TOEFL [test of English as a foreign language] above 550) produced only 40 percent of nonsyllabic regular endings and 26 percent of syllabic endings correctly. Hawkins and Liszka (Reference Hawkins, Liszka, van Hout, Hulk, Kuiken and Towell2003) found similar weakness in the pronunciation of regular endings, even though the subjects in their study demonstrated native-like knowledge of the rules in a written test of their grammatical knowledge. Such findings suggest that the way that –ed and –s/’s endings are taught in pronunciation materials must take into account acquisition issues, perception, and production. Being able to demonstrate knowledge of the morphophonological rules is not sufficient for spoken performance.
Perhaps more importantly, we do not know the extent to which errors in the spoken form of the endings affect intelligibility or comprehensibility. Past-tense forms in speech may be redundant because of the presence of time adverbs, and markers of possession, plurality, or present tense in verbs may also be redundant in speech, or listeners may be able to understand the meaning without clearly pronounced morphemes. The goal in this book is to address pronunciation as it affects intelligibility and comprehensibility, but we do not have any research that directly addresses the pronunciation of morphophonological variants. We can make some guesses by analogy to studies of final consonants. Deletions of word-final consonants (Im & Levis, Reference Alameen, Levis, Reed and Levis2015) or sounds at the end of consonant clusters seem to affect understanding (Zielinski, Reference Zielinski2008), but we do not know if the same types of errors that are strictly phonological (pronouncing cold as coal) have the same effect when the errors affect morphosyntax (pronouncing called as call). One expects that morphological errors will be at least as serious as phonological, but the additional layer of meaning potential may have a different effect on understanding.
Jenkins (Reference Jenkins2002) also suggests that word-final insertion errors are less likely to impact intelligibility than deletion errors, but her proposal has not been systematically tested. It is alternatively possible that the addition of a syllable where it is unexpected could lead to loss of comprehensibility (e.g., walked pronounced as [wɔkəd] could be heard as walk it) when the surrounding linguistic context is not sufficiently redundant (e.g., with mismatching adverbial forms – see Bell et al., Reference Bell, Trofimovich and Collins2015). Even if such errors do not seriously impact intelligibility (that is, they do not cause complete loss of understanding), such errors in the use of grammatical morphemes may affect comprehensibility. Grammatical forms such as those expressing subject–verb agreement are among the most easily noticeable (Hopp, Reference Hopp2010), and inconsistent use of these inflections may slow language processing.
Researching the Pronunciation of Clusters and Grammatical Morphemes
The ways in which we teach the pronunciation of grammatical morphemes is badly in need of rethinking, and to do so we need targeted research. The amount of research done on the –ed endings (and other past-tense forms) far outstrips that done on the –s/’s endings, but very little of this has looked at how best to increase perception and production skills in L2 learners. We know that both production and perception skills are often impaired in L2 learners of English. We just don’t seem to know what to do about it beyond giving them rules that look like they come from introductory linguistics textbooks. Some pedagogical research questions include:
1. What are the errors that L2 learners demonstrate for clusters and for different grammatical morphemes? How do they vary by proficiency level? By L1?
2. To what extent do various errors impact comprehensibility? Are errors in plurals, for example, equal to those of third-person singular verb endings?
3. To what extent do NNSs drop final morphemic endings in spontaneous speaking and in writing? What is the effect of planning time on production?
4. Do different –s/’s ending and –ed ending errors affect intelligibility and/or comprehensibility similarly? Do errors in clusters that do not have grammatical morphemes affect intelligibility differently?
5. What kinds of rule-based approaches are more effective in teaching the pronunciation of grammatical morphemes (as in Brutten, Mouw, & Perkins, Reference Brutten, Mouw and Perkins1986)? Do rule-based approaches work more effectively than approaches based on perception/production practice?
Conclusion
Consonant clusters share in some of the generalizations that are true of individual segmentals, but it is also clear that they may affect intelligibility in different ways because of their relationship to syllable structure, a vastly underexplored area in L2 pronunciation teaching and learning (Cardoso, Reference Cardoso, Kang, Thomson and Murphy2017). It is not yet clear whether deletions or epenthesis in consonant clusters affect intelligibility more, or whether both are equally serious types of errors. One process removes sounds that may assist listeners in identifying the intended word (deletion), while the other most often preserves the segmentals in a word but by including an extra syllable or more than one. Until we know more, it would be a safe assumption that any change to the syllable structure is likely to impact intelligibility.
Misplaced word stress in English can stop communication completely. When a word, especially a word central to the understanding of the message, cannot be recognized, listeners may stop all other processing to decode the word that was not understood. For instance, in a study I ran on the ways that teaching assistants (TAs) turned a written text into spoken language, TAs were given a paragraph from a basic physics text on the types of energy and how they were related. They were given a short time to prepare their spoken presentation, which was video-recorded. A research assistant (RA) transcribed each presentation. In one case the RA could not identify a word in the first sentence of a presentation from a speaker from India (“Well, we have a very good ____________ today”). We knew the topic, and because the sentence was otherwise grammatical, we knew what the category of the word should be, but three of us could not decode the word over several days of listening. The three-syllable word was stressed on the middle syllable, which sounded like most or must. Finally, one of the researchers began to break down the segmentals, coming up with things like at most here, finally leading to the word atmosphere, a word identified by Bansal (Reference Bansal1969, p. 227) as likely to be stressed on the middle syllable in Indian English.
There were multiple reasons for our difficulty in understanding, including the word choice itself. Instead of the more expected word weather, the speaker used a climate-related word that did not fit the expected collocation related to talking about the weather in an informal way. Also, because the first sentence was out of the blue, and we only knew the general topic of the presentation – energy – we had insufficient context to interpret the unexpected word. But our failure to recognize the word was primarily from its unexpected word stress. As listeners, we tried to decode it based on the stressed syllable, which led us down the wrong path. Indeed, there is a large amount of evidence that listeners use pronunciation characteristics of spoken words (the segmentals and the prosody) to identify the group of possible words that are likely before a word is fully articulated. The group of possible words changes as listeners are able to access more information about the utterance, including the information provided by the stressed syllables (e.g., van Donselaar, Köster, & Cutler, Reference van Donselaar, Köster and Cutler2005). But if the wrong cohort is activated, intelligibility or comprehensibility will be compromised.
This chapter will build a case for why word stress is essential in an intelligibility-based approach to teaching pronunciation. It does not provide detailed recommendations for how to teach word stress, where to start, what to ignore and what to emphasize, or the types of activities that are most likely to be successful. Those decisions, while obviously important, will be addressed, at least to some extent, in later chapters and will depend on the L1 of the learners, their proficiency levels, the context in which they are learning, their age, and other factors (e.g., Murphy, Reference Murphy2004). The chapter will also argue that word stress is important for intelligibility in English as a second or foreign language (ESL or EFL) and ELF (English as a lingua franca) contexts because of the impossibility of separating stress and segmentals in regard to intelligibility (Zielinski, Reference Zielinski2008), evidence that stress affects both native and nonnative listeners (e.g., Field, Reference Field2005; Richards, Reference Richards2016), and findings that show how misplaced word stress can severely impact intelligibility for learners from particular L1s (e.g., Isaacs & Trofimovich, Reference Isaacs and Trofimovich2012).
The Role of Word Stress in Spoken Word Recognition
One of the strongest arguments for the importance of word stress in L2 pronunciation teaching is the evidence that stress patterns are critical for word recognition (Cutler, Dahan, & van Donselaar, Reference Cutler, Dahan and van Donselaar1997). That is, words spoken with expected stress patterns are recognized more quickly and accurately than those with unexpected stress patterns. In contrast, words spoken with incorrect stress patterns will not be recognized as quickly (that is, they will be less intelligible or comprehensible), will be heard as different real words (perhaps leading to loss of intelligibility or comprehensibility), or they will not be understood at all (a loss of intelligibility). Thus, mis-stressed words can lead to loss of understanding in multiple ways. For some words, however, it is possible that mis-stressing may not affect intelligibility or comprehensibility at all, such as with noun/verb stress pairs, e.g., PERmit/perMIT (Cutler, Reference Cutler1986).
Spoken language has no specified spaces between words, and from this continuous acoustic information listeners decode speech into its component meaningful units. This requires activation of the representation of intended words and the inhibition of words that were not spoken. In other words, possible words compete for recognition based on both segmental and suprasegmental information made available in real time. To take a simple example of a word spoken in isolation, conjure, the [k] will immediately bring to mind words starting with that sound, which will then be trimmed by the vowel [ɑ] to include only words starting with those two sounds. The [n] further narrows the possible words activated to contact, constants, conjugate, converse, etc., and the next sound, /dʒ/, will further limit the possible words to a set including conjure, conjugal, conjugate. The next sound [ɚ] will (in American English, at least), lead to recognition of the word conjure or one of its affixed forms (conjured, conjures, conjuring). This kind of high-speed decision-making in word recognition has been so convincingly demonstrated by a large number of research studies that “competition is an established component of human spoken-word recognition” (Cutler, Reference Cutler2012, p. 120). L1 listeners are not only superb at making these kinds of decisions for words in isolation, but also for words embedded in continuous speech, using not only phonemic and suprasegmental information but also phonetic co-articulatory information (for example, [k] will already sound different before [ɑ] than before [i] before the vowel is ever begun). One way of talking about what happens in recognition is that acoustic information activates a cohort (Marslen-Wilson & Welsh, Reference Marslen-Wilson and Welsh1978) of possible words, and that more information simultaneously facilitates certain words and inhibits others (e.g., a particular vowel immediately inhibits words with all other vowels that follow the onset). In word recognition, listeners use phonological information before they make use of semantic information.
The Problem(s) of Word Stress
Poldauf (Reference Poldauf1984) talks about the “problem of word stress” in English. By this he means that there are no simple rules or regularities to account for which syllable will receive word stress in a given word. Accounting for English word stress requires a hodgepodge of different patterns related to the length of the word, its lexical category, etymological origin, types of affixation, when the word became part of English, and which variety is being described (e.g., Berg, Reference Berg1999).
English is a free-stress or variable-stress language. This means that, in principle, words of more than one syllable may be stressed on any syllable. Thus, four-syllable words may have their main stress on the first, second, third, or fourth syllables (e.g., MARginally, moNOtonous, indePENdent, reappointEE), although some patterns are more or less likely. Free-stress does not mean that the same word can be stressed on any syllable freely. Free-stress languages are typically contrasted with fixed-stress languages in which the same syllable is always stressed in every word, such as Finnish (the first syllable) and Polish (the next-to-last, or penultimate, syllable).
English has minimal pairs based on stress, such as INsight/inCITE or REcord/reCORD. In INsight/inCITE, the segmentals are identical while the stress is not. In REcord/reCORD, the difference in stress also leads to changes in how the vowels are pronounced (i.e., ˈɹɛkɚd /ɹəˈkɔɹd). Such stress minimal pairs are actually relatively uncommon, especially when the segmentals are identical (Cutler, Reference Cutler1986). There are perhaps as many as 100 in English, most of which involve two-syllable words.
Stress also operates in English as a reliable cue to word identification in spoken discourse. About 90 percent of stressed syllables in discourse are onsets to words (Cutler & Carter, Reference Cutler and Carter1987), and stress is important for identifying both single- and multi-syllabic words in the stream of speech. This is true for single-syllable words because content words in English are typically stressed, whereas single-syllable function words are unstressed and are marked by the presence of reduced vowels in their normal spoken forms (e.g., go for sounds like gopher). Multi-syllable words of all sorts, in contrast, always have at least one stressed syllable with a full vowel. When stress is not word initial, the placement of the stressed syllable is often related to affixation; etymologically related words may be distinguished by the placement of the main stress, as in (5.1). Likewise, unrelated words may share similar stressed syllables and segmentals, as in (5.2).
PHOto, PHOtograph, phoTOgraphy, photoGRAphic
DIStant, DIStaff, DIStal
Lexical stress is thus often important for word identification, but not always so, and it can be manipulated by speakers in relation to other levels of prosody (such as in phrases like KNOWN and UNknown in which the normal stress of the second word is changed to show a discourse contrast). Stress information is almost always preserved in slips of the tongue (Fromkin, Reference Fromkin1971) and slips of the ear such as mondegreens (Content, Dumay, & Frauenfelder, Reference Content, Dumay and Frauenfelder2000).
Acoustic Cues to Word Stress
Acoustic information is central to word recognition. In free-stress languages, stress may be marked by duration, pitch, intensity, vowel reduction, or a combination of these features. Spanish, for example, does not make use of vowel reduction (Flege & Bohn, Reference Flege and Bohn1989), whereas Dutch and English do. English, however, has far more pervasive use of vowel reduction than Dutch. For English, then, there are four features that are associated with stress. Which features do English speakers pay the greatest attention to?
In terms of importance, it appears that English speakers first listen for segmental cues to stress, followed by duration, pitch, and intensity. In other words, listeners privilege segmental cues because they are so reliable in English; prosodic cues are not as consistently consulted because of the reliability of segmental cues. Segmental cues are immediately available and can thus be used by listeners to begin identifying words before prosodic cues become apparent. Van Donselaar et al. conclude that “in word recognition … listeners are unlikely to rely on any dimension of variation that yields little in the way of crucially distinctive information regarding word identity” (Reference van Donselaar, Köster and Cutler2005, p. 269). Because English segmentals yield so much information about stress patterns, English-speaking listeners rely on them more than on prosodic information.
Prosodic cues can be consulted, of course, and they can help listeners with word recognition. Durational cues to stress are noticed by listeners and used as a guide to stress, but though they are often redundant with vowel quality in English, they are less redundant in other stress languages. In fact, speakers of Dutch have been shown to be better at identifying stress in English words based on duration than native English speakers (Cooper, Cutler, & Wales, Reference Cooper, Cutler and Wales2002; Cutler, Reference Cutler2009).
Pitch is often cited as a marker of lexical stress as well, but it is much less reliably used except when a stressed syllable is also accented (Cutler, Reference Cutler, Gybbon and Richter1984), that is, when it receives a pitch accent because of its placement in a phrase, as in (5.3). The word conversation in the first line has primary lexical stress on the SA syllable, but in the sentence it would also be marked with pitch because it is the nucleus of the sentence. In the second sentence, the same syllable is much less likely to be marked by pitch, even though it is still stressed – that is, its first and third syllables have longer durations and full vowel quality, while the second and fourth syllables are reduced in vowel quality, with a shorter duration.
It was a boring converSAtion.
But conversations with her are ALways boring.
Intensity appears to be the weakest cue to stress in English. Zhang and Francis (Reference 294Zhang and Francis2010) showed that English listeners are able to identify stress from intensity differences, but that intensity is much less useful than vowel quality and the other prosodic cues. They also demonstrated that listeners could make use of all four cues when they were isolated.
Native Speaker Knowledge of Word Stress
There are two primary roles for stress in word recognition. The first involves the way that listeners use stress to identify individual words that differ prosodically. The second is the way that stress allows listeners to segment continuous speech. In English, the second use is critical for lexical access, while the first use of stress is important in some cases but not others.
In the second use of stress, English listeners rely on stress information to identify words in the stream of speech. Cutler and Norris (Reference Cutler and Norris1988) examined the effect of strong syllables (those with full vowels) on the identification of embedded words in nonsense syllables (e.g., recognizing the word mint in mintayf, with the second syllable being strong, in contrast to mintef, where the second syllable had a reduced vowel). They found that listeners were significantly less efficient at recognizing mint when the second syllable was strong. They argued that this required segmentation across boundaries (min-tayf → mint + ayf), whereas the initial strong syllable in mintef required no such re-segmentation. This led them to propose that English listeners treated strong syllables as likely onsets to words in speech.
Fear, Cutler, and Butterfield (Reference Fear, Cutler and Butterfield1995) further examined how listeners interpreted the distinction between strong and weak syllables, especially how listeners classify unstressed, unreduced vowels (e.g., the first vowel in audition). They used five sets of four words that differed in stress and in vowel quality and examined the acoustic characteristics of the vowels and how listeners judged their acceptability when the vowels in each word were extracted and spliced onto another word in the set. For example, the set of words ˈaudiences, audiˈtoria, auˈdition, adˈdition includes initial vowels with primary and secondary stress as well as unstressed vowels with full and reduced vowels. The third category, unstressed unreduced vowels, formed an acoustically distinct category in its duration, F0, intensity, and spectral characteristics. However, in listener perception, unstressed unreduced vowels were grouped more consistently with strong (full) vowels than with weak (reduced) vowels. The authors suggested that the results indicated that listeners make decisions in a binary manner between strong and weak syllables, and that unstressed unreduced vowels are more often classified as strong, presumably leading to more efficient processing of continuous speech. In their words:
Why should listeners make use of a binary distinction between strong and weak syllables? We suggest that listeners will in general prefer to make discriminations which are absolute rather than relational in nature. Absolute judgments can be made immediately; relational judgments require comparison between at least two instances (in this case, two syllables), and hence may involve a delay in making the decision. Studies of spoken word recognition suggest above all that recognition is fast and efficient; recognition decisions are not delayed. If this is indeed the case, then spectral characteristics offer the best basis for an absolute discrimination, on the grounds that category judgments about vowel identity draw upon spectral information.
The pervasiveness of vowel reduction in English means that a large number of spoken words (the great majority of the 100 most frequent words) are function words with reduced vowels (e.g., a, an, in, on, the, can, etc.). Non-initial stressed words are much less frequently used, with the effect that stress is an extraordinarily reliable cue to the beginning of content words in continuous speech.
The less consistent use of stress is that which allows listeners to use prosody to identify one word in contrast to another. For example, the three words octopus, October, and occur all begin with the same spelling, oc-. The first word has the primary stress on oc-, the second has secondary stress on oc-, while the third oc- is unstressed. The first two vowels are pronounced similarly while the third has a reduced vowel [ə]. Using stress to identify these words means that listeners need to distinguish degrees of stress reliably in the kinds of words that are in an activated cohort of possible words beginning with oc-. In English, there is little evidence that this kind of activation happens as a result of prosodic information.
The relative importance of acoustic cues in word recognition means that English speakers will use segmentals to make decisions about word identification before using prosodic information, but that both may be useful. As Cutler says, “activation will be affected by whatever aspects of the incoming signal serve usefully to identify the spoken word form” (Reference Cutler2012, p. 97). In a language like Spanish that does not mark stress differences segmentally, listeners rely on prosodic features to distinguish stress. In Dutch, which relies less on vowel quality changes than English, prosodic information is also important in word recognition. Research has demonstrated that English speakers can use a variety of prosodic features in word recognition (Zhang & Francis, Reference 294Zhang and Francis2010), but in general, English listeners do not have to use prosody because vowel quality is a highly reliable marker of stress differences. Cutler puts it this way: “English listeners do not seem to bother much with suprasegmental information in lexical activation and selection. For them, segmental differences … are far more important than suprasegmental differences” (Reference Cutler2012, p. 120).
Features that Influence Native Judgments of Word Stress
What kinds of word-stress features help listeners make decisions about the correct identification of a word? Research has examined this question for both L1 and L2 users of English. Three characteristics have been particularly evident: syllabic structure, word class, and phonological similarity.
Syllabic structure, the first parameter, signifying the stress-attracting properties of different syllable types, especially the propensity of long vowels to attract stress more than short vowels and coda consonant clusters to attract stress more than coda consonant singletons (e.g., append versus open). The second parameter, word class, is seen in the strong tendency of two-syllable nouns to have initial stress in English (e.g., cable) and the lesser tendency of two-syllable verbs to have final stress (e.g., reveal). A variety of studies (e.g., Davis & Kelly, Reference Davis and Kelly1997) have shown that English speakers are sensitive to this statistical regularity and that whether the word is used as a noun or verb influences their choice of stress pattern (when faced with nonsense words). The final factor, phonological similarity, appeals to analogy as a factor in stressing unknown words. For example, if someone knows the word a carton and is faced with the unknown word a varton, then similarities in the phonological (and orthographic) structure of the words may influence the choice of initial stress on the new word.
Guion, Clark, Harada, and Wayland (Reference Guion, Clark, Harada and Wayland2003) examined the relationship of these three features on stress placement decisions made by speakers of English. They found that all three parameters helped to explain how English speakers made decisions about how to stress two-syllable nonsense words. (Nonsense words are made-up words that speakers cannot, by definition, be familiar with. They are used to make sure that subjects cannot be making decisions based on prior knowledge of the word.)
To determine the influence of these three possible parameters for word stress in English, the researchers created forty two-syllable nonsense words in four combinations based on syllables that differed in structure (i.e., CV–CVCC, CVV–CVCC, CV–CVC, CV–CVVC) to test the effect of syllabic structure. Each syllabic structure combination was represented by ten nonsense words. Both perception and production were tested. In the production task, subjects were presented with the syllables spoken in isolation and asked to repeat them together (e.g., they heard “dɛ” then “kɪps” and were asked to repeat the two syllables in the same order as one word). For syllabic structure, they said the nonsense words in a neutral frame (“Now I say _______”).
To test for the effect of word class, the study used a perception task. Subjects were presented with the nonsense words with different stress patterns (read by a trained phonetician) in both noun and verb frames (“I’d like a _____” and “I’d like to ______”). They were asked to listen to two readings of a single frame (e.g., I’d like a ˈdɛkɪps versus I’d like a dɛˈkɪps; I’d like to ˈdɛkɪps versus I’d like to dɛˈkɪps) and decide which of the two readings sounded more like a real English sentence. Each stress pattern was presented in both noun and verb frames.
Phonological similarity was determined by asking subjects if the nonsense word reminded them of any real words. This open-ended task asked subjects to write down real words they thought of when they heard the syllable pairs from the syllabic structure task.
Guion et al. (Reference Guion, Clark, Harada and Wayland2003) found that all three of these factors were active in the decisions made by native English speakers in stressing nonsense words, with phonological similarity being the strongest influence. This confirmed previous research, and they reported that “analogical predictions won over rule-based predictions for cases in which analogical stress predicted first syllable stress on nouns or second syllable stress on verbs. While analogical effects were seen for all word lengths, longer words were more likely to receive the stress of phonologically similar words” (Reference Guion, Clark, Harada and Wayland2003, p. 407).
Features that Influence Nonnative Judgments of Word Stress
Does L1 background influence the way listeners perceive stress in the L2? This question has a complicated answer, given the wide variety of L1 sound systems and the way they do or do not use stress or its acoustic correlates. Speakers of languages like Dutch, in which stress is similar to English, seem able to transfer their ability with stress to their use of English. They even seem better than English speakers in attending to prosodic features in judgments of stress placement, while also showing evidence of being able to attend to segmental correlates of stress such as vowel quality (Cooper, Cutler, & Wales, Reference Cooper, Cutler and Wales2002). Speakers of Spanish, another free-stress language, make effective use of prosodic features for word stress in English but not segmental ones since Spanish does not use vowel reduction (Flege & Bohn, Reference Flege and Bohn1989). French speakers, in contrast, are said to be stress-deaf in learning Spanish (e.g., Dupoux, Sebastián-Gallés, Navarrete, & Peperkamp, Reference Dupoux, Sebastián-Gallés, Navarrete and Peperkamp2008), leading the researchers to suggest that it is extremely difficult for French speakers to encode free-stress patterns into their representation of L2 vocabulary. Another study by Peperkamp and Dupoux (Reference Peperkamp and Dupoux2002) indicates that not all fixed-stress languages behave in the same way. While they found evidence that Hungarian and Finnish speakers also exhibit a degree of stress-deafness in perception, speakers of Polish do not. Archibald (Reference Archibald1998) argues that interlanguage differences result from differences in the ways that languages use different prosodic parameters.
Speakers of non-stress languages like Japanese and Chinese present a different case. Stress is an unfamiliar feature in their L1 representations, but some of the same acoustic correlates used for stress in English are also present in the L1 (such as pitch in Chinese and Japanese and duration in Japanese). Archibald (Reference Archibald1997a) found that one Japanese and three Chinese subjects seemed to be successful with stress in English, but the study was limited by the small number of subjects. Archibald’s other studies (e.g., Reference Archibald1992, Reference Archibald, Rutherford, Clahsen and Eckman1993a, Reference 266Archibald1993b, Reference Archibald, Young-Scholten and Hannah1997b) hypothesized that L2 learners transferred the metrical parameters of their L1 to the production and perception of English word stress. Specific metrical parameters included syllable weight (heavy syllables such as CVV and CVC[C] are more likely to attract stress than CV, which are called light syllables), lexical class (e.g., nouns and verbs are treated differently in regard to stress placement), and potential affixation (e.g., deciding that a spelling in the L2 should be treated as a look-alike suffix in the L1).
Zhang and Francis (Reference 294Zhang and Francis2010) examined whether Mandarin-speaking subjects made use of vowel reduction, duration, pitch, and intensity as clues to stress perception in English. Using synthesized tokens of the word pair DEsert/deSERT, the study systematically varied vowel quality, duration, intensity, and pitch to isolate the contribution of each feature to stress recognition. Because Mandarin uses pitch for lexical tones but does not use vowel reduction, their hypothesis was that Mandarin L1 listeners should find it difficult to use vowel quality as a cue to stress placement. This hypothesis was not upheld. Mandarin listeners and English listeners both showed strong sensitivity to vowel quality as a marker of stress. In each experiment, vowel quality was at least equally important in perception of stress for both Mandarin and English-speaking listeners. The study argued that even though the Mandarin speakers were not able to consistently produce reduced vowels, they were able to perceive and use them in identifying stressed syllables. The results must be qualified somewhat in that the task itself used only one word pair in perception, and listening was done in the most advantageous conditions possible, but the evidence still suggests that lack of experience with prosodic correlates of stress does not mean they cannot be used by L2 learners.
Tremblay and Owens (Reference Tremblay and Owens2010) examined the production of lexical stress in two- and three-syllable words by L2 French learners of English and English speakers. The L2 learners were divided by English proficiency level (intermediate, low-advanced, and high-advanced). The study in particular examined how well the French L2 learners attended to heavy syllables in trochaic (non-initial) stress patterns. All data were collected in nonsense word productions in which two or three syllables had to be spoken together in a noun frame.
In two-syllable words, all groups preferred initial stress, with the intermediate group preferring it significantly less frequently than the other three groups. The three-syllable words, however, showed differences across all groups. Native English speakers almost always stressed the middle (heavy) syllable and avoided the final syllable. The French L2 learners of every proficiency, in contrast, produced initial stress around half the time. The use of middle stress increased as proficiency increased and only the highest-level L2 learners were similar to the English speakers in avoiding final stress. The L2 learners who were more successful in producing English stress patterns used duration as their primary cue, whereas those who were less successful seemed to attend to pitch as a primary cue. However, none were fully successful, given the dominance of initial stress even on three-syllable nonsense words. English speakers, in comparison, stressed the penultimate heavy syllable. These learners also seemed insensitive to syllable weight as a cue to stress placement.
A series of production studies replicating the work of Guion et al. (Reference Guion, Clark, Harada and Wayland2003) looked at whether L2 learners attended to the same features in choosing which syllable to stress. Guion, Harada, and Clark (Reference Guion, Harada and Clark2004) examined how two groups of Spanish speakers made stress placements. Participants included early Spanish English bilinguals (those who learned English before age six) and late bilinguals (those who learned in their teens or later). Spanish, like English, is a free-stress language, but it does not have the same connections of stress to vowel reduction since Spanish is a five-vowel language without vowel reduction. Early bilinguals behaved like NSs of English in almost every way. They showed the same sensitivity to lexical class and syllabic structure (except that they showed no preference for long vowels in final syllables). Late bilinguals, on the other hand, behaved quite differently. The noun–verb effect was restricted to one of the four syllabic structures (CVCVC nonsense words, e.g., demip). In addition, the late bilinguals used a higher level of initial stresses for nouns than either of the other two groups. This suggests a hypersensitivity to noun initial stress. This also shows that age of acquisition (AOA) may be important in the ultimate success of bilinguals being able to use English stress for unknown words.
Guion (Reference Guion2005) extended this work to Korean English bilinguals. Korean, unlike Spanish, does not use word stress as a phonological parameter (using phrasal accents instead that are not tied to lexical items), and thus the question is whether Korean English bilinguals make use of any of the same cues that NSs of English are sensitive to. Early Korean bilinguals (EKBs) and late Korean English bilinguals (LKBs) were examined and compared to NSs of English. Results showed that syllabic structure was a significant predictor for NSs, EKBs, and LKBs, but with reduced effects in that order. In other words, NSs showed the strongest sensitivity to syllabic structure, followed by somewhat reduced effects for EKBs and greatly reduced (but still present) effects for LKBs. In syllabic structure, long vowels were the strongest predictor of stress placement for all groups. The LKB group did not connect to any other clue, while the EKBs also showed connections to coda clusters as a clue to stress placement. All three groups also showed a relationship between phonological similarity and pronunciation of nonsense words. In regard to lexical class, LKBs were not sensitive to the tendency of nouns to have initial stress and verbs final. This is a striking difference to the late Spanish English bilinguals in Guion et al. (Reference Guion, Harada and Clark2004), who showed a hypersensitivity to initial stress for nouns. This is evidence that L1 background may have an effect on what learners notice about the English stress system.
The methodology pioneered by Guion et al. (Reference Guion, Clark, Harada and Wayland2003) was also used by Wayland, Landfair, Li, and Guion (Reference Wayland, Landfair, Li and Guion2006), who looked at intermediate Thai learners’ acquisition of English word stress following the same study procedures as earlier studies. Thai, unlike Spanish and Korean, is a tone language. Like Korean, word stress is not phonological in Thai. This study did not look at the effect of AOA on acquisition, instead looking at a single group of learners who were all at the same general level of proficiency with an average stay in the United States of 1.4 years. In regard to the effect of long vowels (e.g., heavy syllables) to attract stress, Thai learners’ syllable structure awareness in production of English non-words was native-like (p. 298), a finding interpreted as a transfer from Thai since long vowels also have associations with tones in Thai. However, for perception in general, there was no connection between lexical class, syllabic structure, and stress placement. Only the effect of phonologically similar real words had a significant connection to how Thai learners stressed non-words, leading the researchers to argue that “native Thai speakers relied heavily on the stress patterns of already known words” (p. 298).
Chen (Reference Chen2013) adapted Guion et al.’s (Reference Guion, Clark, Harada and Wayland2003) methodology for Chinese learners from Hong Kong. These learners come from a tonal (non-stress) language but learn English in a social context in which English has a long history. Thus, they provide a different view of the influence of these three factors. In addition to the use of the same syllables/nonsense words used in the earlier study, Chen also included perception and production of real English words, eighteen that followed regular stress patterns and eighteen that did not. Subjects performed at around 90 percent accuracy on the real words, suggesting that they had internalized the stress patterns of words that they already knew. For the nonsense words, however, there was evidence only for the effect of lexical class. Syllabic structure and phonologically similar words did not seem to influence Chinese L2 subjects’ stress decisions. Chen argued that this may have been due to Chinese learners storing stress on a word-by-word basis rather than according to rule.
These findings can be summarized in the words of Wayland et al. (Reference Wayland, Landfair, Li and Guion2006):
results … strongly suggest that aspects of knowledge influencing the assignment of stress patterns in English can be acquired by adult L2 learners to a varying degree of success depending on such factors as differences between English and the learners’ L1 prosodic system, age of acquisition, and perhaps English proficiency.
It is important to point out that these studies are talking about acquisition. In none of these studies is there evidence that any of these learners were taught about these regularities in English word stress. Although it may be the case that teaching these patterns would result in improved perception and production, there is currently no evidence for how learners would improve as a result of instruction.
The acquisition of stress may also be affected by higher-level prosodic constraints. Visceglia, Tseng, Su, and Huang (Reference Visceglia, Tseng, Su and Huang2010) looked at the production by Taiwanese L2 learners of English word stress in two-, three-, and four-syllable words under two conditions: when word stress alone was targeted (in carrier sentences) and when word stress was embedded within higher-level prosody demands (at phrase boundaries and in contexts requiring narrow focus – e.g., It’s in JANuary that I need it). They measured F0, duration, and intensity by NSs of American English and Taiwanese learners in both conditions. While the L2 learners approximated native controls in carrier sentences, their production diverged when higher-level prosody was required. Whereas native speakers continued to make word-stress differentiations in higher-level prosodic contexts, lexical stress markers were no longer evident for L2 learners when they had to pay attention to phrase boundaries or narrow focus.
Another study by Tseng and Su (Reference Tseng and Su2014) examined how Taiwanese learners used F0 and duration as cues to lexical stress production in English, again compared to NS controls. There was little difference between the two groups in how they used duration, but Taiwanese learners did not use F0 to distinguish primary, secondary, and tertiary stresses. A third study by Tseng, Su, and Visceglia (Reference Tseng, Su and Visceglia2013) employed Pairwise Variability Index measures to examine word-stress parameters (F0, duration, and intensity) by Taiwanese and English speakers. The Taiwanese L2 learners were most different from the English speakers in F0 contrasts, and the contrasts between stressed and unstressed vowels for all parameters were greater for English speakers than for the L2 learners.
Clearly, there is much that remains to be done in understanding how different groups of L2 learners perceive stress, but Cutler suggests possible challenges with such research (Reference Cutler, Reed and Levis2015, pp. 120–1):
In perception, non-native listeners … may or may not match to the listening strategies encouraged by the probabilities of English; where they do not match, they will generate speech perception difficulty unless listeners can succeed in inhibiting their use. At the word recognition level, such perceptual problems fall into three principal groups: pseudo-homophony, spurious word activation, and temporary ambiguity … Pseudo-homophones are not a serious problem for the non-native listener (or indeed for native listeners processing non-native pronunciation), simply because … every language contains many homophones and all listeners have to be able to understand them by choosing the interpretation appropriate to the context.
Spurious lexical activation and prolonged ambiguity are more serious problems. The first occurs when embedded “phantom words” are activated for the non-native listener and produce competition that native listeners are not troubled by. Remaining with the /r/–/l/ phonemic contrast, an example is competition from leg in regular. Such extra activation and competition has been abundantly demonstrated in non-native listening (Broersma, 2012; Broersma & Cutler, Reference Broersma and Cutler2008, Reference Broersma and Cutler2011). The second occurs when competition is resolved later for the non-native than for the native listener (e.g., register is distinguished from legislate only on the sixth phoneme, rather than on the first). This phenomenon too has been extensively documented
Addressing these challenges will make it more possible to connect findings from native listening studies (e.g., Cutler, Reference Cutler2012), to L2 problems with lexical access, to studies of intelligibility, especially in regard to how L2 listeners process speech from L1 and other L2 speakers.
Word Stress, Intelligibility, and Comprehensibility
As argued at the beginning of this chapter, stressing an unexpected syllable can impair intelligibility and comprehensibility. But misunderstanding is not inevitable. Unexpected stress can also signal regional or social accent (such as saying INsurance rather than inSURance, or FRUStrate instead of frusTRATE). However, it is clear that misplaced word stress can lead to loss of intelligibility – that is, the word will not be recognized at all, as in the atMOSphere example at the beginning of this chapter. In other cases, misplaced word stress may lead to greater difficulty in processing speech (that is, loss of comprehensibility). It appears that there is also a middle ground between intelligibility and comprehensibility (perhaps what Zielinski [Reference Zielinski2008] calls reduced intelligibility) in which words are initially not recognized but then are decoded within a short time. It is also possible that some misplaced word stress will not cause serious processing difficulties, such as some noun–verb minimal pairs like INsult/inSULT (Cutler, Reference Cutler1986) or words like ˈEStiˌmate pronounced by learners as ˌestiˈMATE, in which the major and secondary stress switch locations but both retain their full vowel quality (Slowiaczek, Reference 290Slowiaczek1990).
In general, loss of intelligibility also leads to loss of comprehensibility, but loss of comprehensibility does not mean that speech will be unintelligible. Indeed, this situation often occurs when we listen to things we do not have the background knowledge to understand (for me, a lecture on physics would qualify). I may be able to understand the words being spoken and even write them down, but not be able to process the message easily. Listening to the same lecture in German, however, will for me result in loss of intelligibility at both the lexical and message level.
Mis-stressed Words and Loss of Intelligibility or Comprehensibility
Misplaced stress in English words can include movement of primary stress to a less stressed syllable or a change of stress from an unstressed to a stressed syllable (involving a change in vowel quality as well). These variations affect understanding in different ways. When stress changes from one full vowel to another, it seems at most to slow down processing.
In one study involving primary and secondary stressed vowels (that is, with no reduced vowels), Slowiaczek (Reference 290Slowiaczek1990) conducted a series of experiments to examine the effect of correct and incorrect stress placement on spoken word recognition for two-, three-, four-, and five-syllable words. Some of the experiments included noise as a variable. All experiments examined words that had two stressed syllables (one primary and one secondary). Incorrectly stressed words involved putting primary stress on the secondary stressed syllable (e.g., REScue → resCUE). Subjects listened to the words and typed them out.
Results showed that noise affected the accuracy of word recognition, with greater noise leading to less accurate recognition, but there was no effect of noise on the level of recognition for correctly and incorrectly stressed words. That is, for the same level of noise, subjects typed out the intended words at the same rate regardless of stress placement. Further experiments employed shadowing tasks in which subjects repeated words as quickly as possible when they were presented over loudspeakers. In three experiments, correctly stressed words were repeated more quickly than incorrectly stressed words, which participants responded to more quickly than nonsense words. A final response-time experiment asked subjects to classify auditory stimuli as words or nonsense words. The study employed eighty-eight words (half correctly and half incorrectly stressed) to create eighty-eight equivalent nonsense words by changing one phoneme but keeping the same stress patterns (e.g., ANGuish/anGUISH/ANGlish/angLISH). Subjects were instructed to classify each stimulus as a word or nonsense word. Results showed that real words were identified more quickly than nonsense words and that correctly stressed words were identified more quickly than incorrectly stressed ones.
In agreement with such findings, Cutler (Reference Cutler, Reed and Levis2015, pp. 118–19) states that changes in stress involving only full vowels are not likely to lead to loss of intelligibility:
Mis-stressing can cause similar difficulty for the listener whenever it affects the segments that make up the word – that is, whenever a vowel is changed. Mis-stressing will NOT cause difficulty if it involves suprasegmentals only, e.g., when secondary and primary stress are interchanged; as the early research already mentioned has shown, mis-stressed words where vowels are unchanged (e.g., stampede pronounced as STAMpede) are recognized easily.
Another type of mis-stressing seems not to affect intelligibility or comprehensibility. This is evident in one of the most commonly taught areas of word stress, noun–verb pairs such as CONtract/conTRACT, REcord/reCORD. Word pairs like this are often used to demonstrate to learners that stress makes a difference in English word categories (it changes a word from a noun to a verb or vice versa) or meaning (proDUCE is a verb that means to make while PROduce refers to fruits or vegetables). These word pairs may have very similar vowel quality in both syllables (PERmit/perMIT, INsult/inSULT) or the change in stress may lead to a change in vowel quality (REcord/reCORD). Two studies indicate that these stress differences are unimportant for native listeners.
Cutler (Reference Cutler1986) examined pairs of stress homophones in English (e.g., FOREbear/forBEAR; TRUSTy/trusTEE) to determine whether English listeners used stress differences alone to access these words. Results showed that there was no significant difference in the time it took to access the competing word in a cross-modal priming task: “lexical prosody, i.e., the fact that FORbear is stressed on the first syllable and forBEAR on the second, would appear, from the results of the present experiments, to be irrelevant in the lexical access process” (Cutler, Reference Cutler1986, p. 217). This may not be a great problem given that these true stress homophones are rare in English and Dutch, with fewer than twenty such pairs in each language (van Donselaar et al., Reference van Donselaar, Köster and Cutler2005).
Small, Simon, and Goldberg (Reference Small, Simon and Goldberg1988) also examined the effect of homographs on listening. Their words included those that differed both in stress and vowel quality (e.g., CONvert/conVERT) in contrast to those that differed in stress alone. They found that “changes in vowel quality, when stress was shifted, appeared to have no effect on lexical access of the homographs” (Reference Small, Simon and Goldberg1988, p. 277). This result was different from the lexical access for non-homographs, in which stress changes affected lexical access. In summary, there is evidence that word pairs like this are processed by speakers of English as homophones, whether they have the same segmentals (Cutler, Reference Cutler1986) or whether their segmentals vary with stress changes (Small et al., Reference Small, Simon and Goldberg1988). While interesting from a linguistic point of view in demonstrating how stress can distinguish one word from another, these stress pairs are less useful as teaching tools since they do not impact intelligibility.
When Misplaced Word Stress Affects Intelligibility
In successful communication, speech has words that are identifiable (intelligible) and easily processed in the stream of speech (comprehensible). Word stress is thus one factor influencing the intelligibility and comprehensibility of speech. When misplaced stress results in segmental deviations or when misplaced stress is to the right rather than the left (Field, Reference Field2005; Richards, Reference Richards2016), there is a greater likelihood of unintelligible or incomprehensible speech. According to Cutler (Reference Cutler1986), full versus reduced vowels may be the most important factor involved in the intelligibility of multi-syllabic words in English.
Benrabah (Reference Benrabah1997) argues that word stress is highly likely to be a source of unintelligibility in English. Benrabah presented stress errors within sentences from three speakers (from Algeria, Nigeria, and India) for British English listeners who transcribed what they heard (a typical intelligibility task). The transcriptions of the words showed how mis-stressings misled the listeners (e.g., suiTAble heard as the level, interVAL as only trouble, cheMIStry as community) into trying to make sense of what they heard by writing down real words or phrases that matched the stress patterns they heard. Benrabah argued that “natives as listeners tend to impose their interpretation based on their ‘aural expectancies,’ relying heavily on the stress pattern produced and totally disregarding segmental information” (Reference Benrabah1997, p. 161). The words that were written down mostly matched the stress patterns the listeners heard, and listeners tried to create sense out of the speech signal, heavily relying on the stress pattern to do so (despite the transcriptions sometimes not making sense).
Benrabah’s study also hints at the importance of the listener in interpreting word-stress deviations. English is spoken in a wide variety of contexts throughout the world, involving not only NSs (inner-circle countries) and nonnative speakers (NNSs) (expanding-circle countries), but also speakers in countries where English has an official status and is almost universally a language of wider communication (outer-circle countries). The speakers in the study were from the outer circle (India, Nigeria) and expanding circle (Algeria), but the listeners were from the inner circle (Kachru, Reference Kachru1992). We do not know what types of transcriptions would result if they were heard by listeners from other L1 backgrounds.
Stress and Sites of Reduced Intelligibility
Stress and segmental information often work together in impacting intelligibility. Zielinski (Reference Zielinski2008) sought to make the effect of misplaced stress on unintelligibility explicit. She examined the intelligibility of utterances spoken by three NNSs of English, one each from Korea, Vietnam, and China. Each NNS was interviewed for between ninety minutes and two hours, then their interviews were transcribed and individual utterances were extracted from the overall interview and saved in separate sound files. The individual utterances were then played for five NS listeners who rated them for how difficult they were to understand. The most difficult to understand utterances numbered fifty-one for the Korean speaker, fifty-eight for the Chinese speaker, and sixty-eight for the Vietnamese speaker.
These utterances were the input for the next part of her study, in which three female NSs of Australian English listened to the target utterances (usually twice), and transcribed them orthographically. They then were asked to comment on the problems they had in identifying the words. Each NS listener individually listened to one speaker at a time, with at least two weeks separating listening sessions (a total of nine sessions in all). These “sites of reduced intelligibility” (Zielinski, Reference Zielinski2008, p. 73) were analyzed for syllable strength by the researcher and phonetically transcribed by another phonetician. The phonetic analysis was then compared to the orthographic transcriptions (enhanced by the listeners’ comments about what they heard), allowing the researcher to identify whether difficulties were caused by suprasegmental or segmental features.
The results showed that “the listeners always relied to some extent on the speakers’ syllable stress pattern and the segments in the speech signal to identify the intended words at sites of reduced intelligibility” but that their “reliance on segments … was not as consistent as their reliance on the speakers’ syllable stress pattern” (Zielinski, Reference Zielinski2008, p. 76). When they did rely on segments, they were segments in strong (i.e., stressed) syllables, especially initial consonants and vowels. This general pattern, however, did not hold for the Vietnamese speaker, for whom non-standard syllable-final consonants were a source of difficulty for the listeners. This is likely due to the tendency of Vietnamese speakers to simplify or delete syllable-final consonants and consonant clusters in English.
Although the ways that segmentals and suprasegmentals interact may be different for other languages, the two are frequently connected in stress languages. In one examination of Chinese and French learners of Dutch as a second language (DSL), there was no difference in loss of intelligibility based on stress versus segmental errors, but the combination of stress and segmental errors created a much larger loss than either error alone (Caspers & Horłoza, Reference Caspers and Horłoza2012), especially for the Chinese DSL speakers.
The L1 background of raters and the context of instruction may also affect the ways that the relative contributions of stress and segmental factors are evaluated. Kashiwagi and Snyder (Reference Kashiwagi and Snyder2010) found that NS and NNS teachers in Japan paid attention to different features of speech in evaluations of intelligibility and accentedness. Whereas NS (i.e., American) teachers were more attentive to vowel and consonant errors, Japanese L1 teachers paid more attention to consonant and stress errors. This finding is similar to Riney, Takagi, and Inutsuka’s results that Japanese and American non-teachers also paid attention to different features of speech in judging accent, in which “American listeners relied more on segmentals (especially /ɹ/ and /1/), [but] NNS Japanese listeners relied more on intonation, fluency, and rate of speech” (Reference Riney, Takagi and Inutsuka2005, p. 460).
Word-Stress Errors Are Not All Created Equal
Discussions of word stress as a source of unintelligibility sometimes make an assumption that is not warranted – namely, that any word-stress error is equal to any other stress error in its effect. We’ve already seen some evidence that this assumption is wrong. As Poldauf (Reference Poldauf1984) points out, word stress is not a unitary concept in English. But what is the evidence that word-stress errors differ in their impact?
Field (Reference Field2005) showed that not all word-stress errors have similar effects on NS and NNS listeners. He examined the intelligibility of word stress with two-syllable words in which mis-stressings were shifted leftward (e.g., laGOON → LAgoon) or rightward (e.g., WAllet → waLLET). Field used twenty-four relatively frequent (based on a corpus) two-syllable words (twelve with first-syllable stress and twelve with second-syllable stress) as the listening input in three conditions: standard stress and pronunciation (all twenty-four words), stress shifted (all twenty-four words), and stress shifted with vowel quality changes (eleven words, e.g., conTAIN → CONtain). The last group was smaller since it only used stress shifts likely to create a change in vowel quality. An additional distractor word was included to give sixty items, which were divided into three groups of twenty for the listening task. Each set of twenty included eight words with standard stress, eight with stress shifting, and three or four with stress shifting plus vowel quality changes. No word was repeated within each set.
The words were played for both NS and NNS listeners in intact classes. Each group of listeners heard only one set of words. For the NS listeners, transcription accuracy was significantly affected for both stress shifting and for stress shifting with vowel quality changes. Stress shifting plus vowel quality changes did not affect intelligibility as much as stress shifting alone. This result is paradoxical since the presence of two errors made the word more intelligible. Further analysis showed that rightward shifts impaired identification more than leftward shifts, a result mirrored in the stress-shifting condition without a change in vowel quality. NNS listeners showed the same patterns that NSs did despite starting “from a lower baseline in that their recognition of items in the standard form was less accurate” (Field, Reference Field2005, p. 413) in comparison with the NS listeners.
Field also connected intelligibility (his goal) to comprehensibility, saying that the loss of intelligibility in connected speech may also lead to greater processing demands that lead the listener to try to access other words that fit the phonetic form they have heard. As Field (Reference Field2005) says,
stressing the second syllable in foLLOW will lead the listener toward a cohort … that includes low, local, possibly below, and away from the target word. The consequent increase in processing demands [i.e., affecting comprehensibility] might well limit the listener’s ability to perform under the pressures of a conversational context.
Richards (Reference Richards2016) examined some of the same issues as Field, but for words of two, three, and four or more syllables. Richards tested an English Word Stress Error Gravity Hierarchy based on vowel quality changes and direction of stress shift as predictors of the effect of word-stress errors on comprehensibility and intelligibility. Results indicated that both features predicted the impact of English word-stress errors, with rightward mis-stressing being more serious than leftward, stress errors that changed vowel quality being more serious than stress shifts that do not, and multiple vowel quality errors being more serious than one error. Finally, like Field (Reference Field2005), Richards found that these errors in stress placement and vowel quality affected both native and nonnative listeners similarly, albeit from different baseline performances for words with no errors.
Field’s and Richards’ results show that some errors in word stress are more serious than others, that errors may affect NS and NNS listeners similarly, and that greater access to context may not help listeners. Second, their results suggest that NS and NNS listeners may not be very different in how they respond to mis-stressed words. Both groups showed similar effects in transcribing, even though both groups started from a different baseline. Third, Field (Reference Field2005) may have something to say about the effects of context on intelligibility. The listening task used by Field was not complicated in that it involved frequent words in citation form, thus lowering the processing load in comparison to what would be faced in normal conversational speech. While conversational speech may also provide greater access to contextual clues, Field also argued that the belief that “listeners can compensate for [stress errors] by drawing on information provided by context” is flawed because it is based on “how much of the previous input the listener has been able to decode accurately” (Reference Field2005, p. 418). As a result, the results may underestimate the contribution of lexical stress to overall intelligibility. Listening to NNS speech in an authentic communicative context, whether by NSs or NNSs, will place much greater demands on processing of speech. In addition, NNS listeners listening to NS speakers in normal communication (Lecumberri, Cooke, & Cutler, Reference Lecumberri, Cooke and Cutler2010) will also face much greater demands in identifying where words begin and end and in identifying phantom cohorts of potential words that are activated not only by stress perception but also by inability to perceive phonemic contrasts in the L2 (Broersma & Cutler, Reference Broersma and Cutler2008).
Word Stress in Relation to Other Phonological Features
It is impossible to separate word stress from segmentals as a source of unintelligibility or loss of comprehensibility, but it is also impossible to separate word stress from other elements of discourse regarding its effect upon understanding. Spoken language does not work this way. Kang and Moran (Reference Kang and Moran2014) examined the effect of functional load in the ratings of Cambridge English Proficiency testing for four different levels corresponding to four Common European Framework of References (CEFR) levels. They found that high functional load (FL) errors for consonants and vowels were significantly related to differences in ratings across the levels (in general, the higher the level of proficiency, the fewer high FL errors were made, a pattern that did not hold for low FL errors). In addition, an equally significant feature was incorrect stress patterns, but the authors did not discuss this finding in accounting for the variance in ratings across the CEFR levels.
In a more explicit discussion of the importance of word stress in relation to other factors, Isaacs and Trofimovich (Reference Isaacs and Trofimovich2012) examined the contribution of phonological and non-phonological factors to comprehensibility ratings. In research studies, comprehensibility is sometimes connected only to phonological factors, but this study expanded the scope of factors considered to create rating scales that could be used by teachers in evaluating the spoken production of learners. The spontaneous spoken production of forty French learners of English of varying spoken proficiency levels was analyzed for nineteen quantitative speech measures divided into four categories: phonology (e.g., word-stress error ratio), fluency (e.g., mean length of run), linguistic resources (e.g., grammatical accuracy), and discourse factors (e.g., story breadth). Next, sixty raters rated the French learners’ production for comprehensibility on a nine-point scale. Correlations between the comprehensibility ratings and the nineteen quantitative measures were calculated to determine which objective measures were most strongly associated with intuitive comprehensibility judgments.
To develop a usable rating scale, the researchers asked three experienced ESL teachers to listen to the spoken presentations and rate them for comprehensibility and give reasons for their ratings in written teacher reports. The final rating guidelines included those features that showed the highest correlations and also overlapped with teacher comments. The final rating categories were type frequency (a measure of vocabulary richness), mean length of run, story breadth, grammatical accuracy, and the only phonology-related measure, word-stress error ratio.
Word stress was the most sensitive measure of the five included in the final rating scale, and was the only factor to reliably distinguish between the three proficiency levels rated by the scale. In relation to comprehensibility judgments, the results suggest that word stress plays a central role in how teachers rated oral proficiency. It may be that this is a particularly important issue for Francophone learners of English, given that word stress is not distinctive in French, but the authors argue that because of “the sheer number of learners from other L1 backgrounds for whom English word stress (and rhythm) generally pose a problem (e.g., Spanish, Polish), English stress patterns could be a much more global feature in distinguishing between different L2 comprehensibility levels” (Isaacs & Trofimovich, Reference Isaacs and Trofimovich2012, pp. 497–8).
A summary of what we know about stress shifts is presented in Table 5.1. Some cells are not completed because it is impossible to have a word in which the primary stress is replaced by a primary stress (there is only one primary stress in each word). Also, there is no research suggesting what would happen if a secondary stressed vowel were replaced with another secondary stressed vowel (in effect, this is a segmental not a stress error). However, there is a suggestion about what could happen when an unstressed vowel is replaced with another unstressed vowel (e.g., delight pronounced as [diˈlaɪt] rather than [dəˈlaɪt]). In addition, there is the possibility that an unstressed vowel is given full quality even though the stressed vowel remains as it is. Cutler says:
if the word target is uttered with correctly placed stress on the initial syllable, but with the second syllable unreduced ‒ so that it sounds like get ‒ it is liable to be perceived as two words rather than one; the same will happen if in correctly stressed utterance either its second or third syllable is not reduced.
Table 5.1 Likelihood of unintelligibility when stress patterns are changed
| Realized stress | ||||
|---|---|---|---|---|
| Primary | Secondary | Unstressed | ||
| Expected stress | Primary | ***** | NS recognition unaffected with some slowdown in processing (Slowiaczek, Reference 290Slowiaczek1990) | Disruption of recognition (Richards, Reference Richards2016) especially when stress is shifted rightward (Field, Reference Field2005). True for NS and NNS listeners. |
| Secondary | Recognition unaffected, with some slowdown in processing (Slowiaczek, Reference 290Slowiaczek1990) | ***** | Disruption of recognition (Richards, Reference Richards2016) or not (Field, Reference Field2005) | |
| Unstressed | Disruption of recognition for NS and NNS listeners (Richards, Reference Richards2016) | Disruption of recognition (Richards, Reference Richards2016) | Full vowels from unstressed syllables can sound acceptable to NSs when spliced into words where the same vowel is primary or secondary stressed (Fear et al., Reference Fear, Cutler and Butterfield1995) | |
Conclusions
The research on the contribution of word stress to intelligibility is compelling: misplaced word stress can cause loss of understanding. Caspers and Horłoza (Reference Caspers and Horłoza2012) found that for Dutch (a closely related language to English), word-stress errors impaired intelligibility (even for words spoken by native Dutch speakers), though no more than segmental errors. They also showed that pronunciation of words with both stress and segment errors was associated with a significantly greater loss of intelligibility than either type of error alone. Zielinski (Reference Zielinski2008) and Benrabah (Reference Benrabah1997) provide evidence that NS listeners pay attention to stress patterns in trying to interpret accented speech, and that stress may be a greater source of reliability for word recognition than segmental information. Nor can the interaction of segmental and prosodic features be ignored.
What does this mean for teaching? Clearly, in an ESL context, word stress should be a high-priority feature because of its salience and the possibility that it is important for NS listeners in trying to access the words spoken by L2 learners. Even where intelligibility is not completely lost, mis-stressings often impair comprehensibility (cf. Cutler & Clifton, Reference Cutler, Clifton, Bouma and Bouwhuis1984). What about EIL/ELF contexts? Jenkins (Reference Jenkins2000, Reference Jenkins2002) places word stress outside the Lingua Franca Core (LFC), even though Dauer (Reference Dauer2005) and McCrocklin (Reference McCrocklin, Levis and LeVelle2012) criticize this decision. Walker (Reference Walker2010) is also less certain about the place of word stress in his pedagogical application of the LFC. However, Lewis and Deterding (Reference Lewis and Deterding2018) found that word stress did cause loss of intelligibility in South Asian ELF. Thus, it seems that Jenkins’ decision is unwarranted on the basis of the available evidence. We do not know how nonnative listeners react to errors in word stress made by interlocutors from a range of L1s, whether those speakers come from outer or expanding circles. In addition, the near-impossibility of dividing segmental and prosodic features involved with English word stress means that it is wise to advocate a role for word stress in ELF communication. It may be that in ELF communication, listeners pay greater attention to segmentals than stress, but it may also be that stress information remains essential to lexical access, as suggested by Field (Reference Field2005) and Richards (Reference Richards2016). It is also likely that the first languages of the interlocutors will provide different results on the relative contribution of stress and segmentals to intelligibility. But until we know more, the compelling evidence for the salience of word stress indicates that it should be considered a priority for pronunciation teaching in all L2 contexts.

