Currently there is very little debate about whether intelligibility is an appropriate goal for teaching spoken language. At least since the late 1980s, a majority of knowledgeable teachers and researchers have advocated some version of the Intelligibility Principle for teaching pronunciation. This does not always affect how pronunciation is actually taught, or how published teaching materials are constructed, but the advocacy has influenced both implicit and explicit discussions of priorities.
The opposite of the Intelligibility Principle, the Nativeness Principle (Levis, Reference Levis2005a), is now rarely put forth by teachers or researchers as a worthwhile goal, even though it remains vibrant, living on in popular beliefs about language learning, in spy novels (where effective spies always seem to become native-like in an unusually short time), and in accent-reduction advertising. The Nativeness Principle, by definition, says that learners have to completely match a native speaker’s production of all pronunciation targets, segmental and suprasegmental, in order to have achieved the target pronunciation in a foreign language. A number of pronunciation class texts appear to follow the Nativeness Principle. Such books include a nearly exhaustive collection of exercises for teaching segmentals and segmental contrasts (e.g., Orion Reference Orion2002), including sound contrasts that are rarely applicable, although often with a less than complete treatment of suprasegmentals. The advantage of such an approach to pronunciation teaching is that it defines what must be mastered and articulates a final objective – native-like speech. There is no obvious need to prioritize because the entire L2 sound system must be learned to mastery.
The power of the Nativeness Principle can be seen in the fact that it is not unusual to hear language learners themselves say they want to sound like a native speaker (NS). Every time I teach pronunciation, one or two students tell me unbidden that this is what they want. This is not surprising, as it shows that learners believe nativeness to be a desirable goal, and it shows that they understand that any deviation from that norm can mark them as being different or as being hard to understand. Their desire assumes a dichotomous state of affairs: native (and therefore unaccented and easy to understand) or nonnative (and therefore accented and harder to understand). Even though it is unlikely that they will ever reach their goal, their desire comes from a noble motive in that they want their spoken language to facilitate their communication. Not being language teachers, they do not realize that reaching a good-enough pronunciation is neither a sign of weakness nor of low standards, but is instead all that is really necessary (or all that is really possible, in most cases).
Unfortunately, few teachers or students have the time, aptitude, or the age necessary to achieve the kind of mastery needed for native-like pronunciation. An intelligibility-based approach, in contrast, requires prioritizing what is taught. Teaching to achieve intelligibility is challenging precisely because of this prioritized approach to errors. Such an approach is based on the assumption that some errors have a relatively large effect on understanding and others a relatively small effect. Judy Gilbert (personal communication) talks about priorities in terms of a battlefield medical image – triage. When faced with many people who are wounded, medics must prioritize treatment. Applied to pronunciation, the image suggests that certain errors (injuries to communication, to follow the metaphor) should be treated first because they are more likely to harm communication than are others. Other injuries to communication are far less problematic, and neither listeners nor speakers will be harmed by lack of accuracy in such cases.
What such an approach should look like in the classroom is not clear, however, partly because proposals based on the Intelligibility Principle conflict, and partly because the Nativeness Principle continues to strongly influence classroom practice and teachers’ attitudes. Also, the Intelligibility Principle must be context-sensitive and connected to both speaking and listening – speakers need to be intelligible to listeners, and listeners need to be able to understand speakers. So decisions about priorities not only involve helping learners produce speech in an accessible way for listeners, but also involve teaching those same learners to understand the speech they hear.
What Is Involved in Pronunciation?
An intelligibility-based approach to teaching pronunciation requires a clear description of the key elements of pronunciation and how they relate to one another. This description is the goal of the following section.
When we use the term pronunciation, we are talking about an interrelated system of sounds and prosody that communicates meaning through categorical contrasts (e.g., phonemes), systematic variations (e.g., allophones), and individual variations that may mark gradient differences such as gender, age, origin, etc. As such, the system of pronunciation can be divided in a variety of ways, but the divisions are merely a way to understand pieces of the system, even though all parts of the system interact with each other in ways that often make it impossible to separate their effects upon understanding.
The classification of the features that I use here is somewhat unusual (Figure 2.1). Typically, pronunciation is divided into segmentals and suprasegmentals, but from the viewpoint of spoken language understanding, word-level features are those most likely to impact intelligibility at the lexical level, and discourse-level features are likely to affect intelligibility at the semantic and pragmatic levels, and they are also more likely to impact comprehensibility. There are also spoken language elements related to pronunciation that are nonetheless not included in my categories for pronunciation. Any of the categories (word-level, discourse-level, and related areas) may impact perceptions of speech. In addition, although Figure 2.1 divides features into separate categories, this is a failure of the visual in representing the ways in which the categories interact. For example, word stress and rhythm both affect the pronunciation of vowels, with unstressed syllables strongly leaning toward schwa. Schwa is clearly a frequent segmental, but it is not a phoneme of English, but rather a variant of other vowels when they occur in unstressed syllables (Ladefoged, Reference Ladefoged1980). Stressed syllables in any part of a word are also the location for aspirated voiceless stop consonants, as in pill, repeat, till, return, kill, and acute, creating a dependency of some consonant allophones and stress. The rhythm of the phrase level and the stress patterns of words affect the ways sounds change in connected speech (e.g., the phoneme /t/ as realized in city, nature, button, and “can’t you’” are not the phonetic sound [t] in North American English). Pronunciation of the individual segments, in other words, is dependent upon where they occur in words, which are in turn affected by where they occur within a phrase. Prominence occurs on syllables that are emphasized through a combination of pitch, syllable duration (a rhythmic feature), and loudness. Prominence typically (but not always) occurs on the stressed syllable of the prominent word, connecting prominence to word stress. Prominence results in segments that are pronounced with particular clarity and precision. The prominent syllable in a phrase is often marked by pitch, and is the beginning of the final pitch contour in the phrase – that is, its intonation.

Figure 2.1 Pronunciation features related to intelligibility
Word-Level Features
Word-level features include segmentals (vowels and consonants) and word stress (Figure 2.2). They also include consonant clusters, a type of segmental which may, when mispronounced, have an effect on syllable structure. This introduction is provided to define how segmentals and word stress are related to intelligibility and comprehensibility.
Segmentals include approximately forty phonemes for most varieties of English (around twenty-four consonant phonemes and fourteen or more vowel phonemes). The phonemes mask the number of sounds that English uses, since phonemes often have multiple well-known allophones that are important for pronunciation teaching. For example, English regularly employs a glottal stop ([ʔ]) before vowel initial words (e.g., I → [ʔaɪ]) or as an allophone for /t/ before a final nasal (e.g., button –[bʌʔn̩]). Other well-known allophones include aspirated voiceless stops and affricates in pill/till/kill/chill, the dark (velarized) /l/ in all (as opposed to the light /l/ in word initial position, e.g., lap), the flapped /t/ in city, the now increasingly rare voiceless labial-velar fricative [ʍ] in which (sounds like [hw]), and many others. For vowels, allophones are almost too numerous to count, as vowel quality often shifts noticeably in the presence of nasals, before /ɹ/ or dark /l/, before /g/, and in unstressed syllables.
Regarding word stress, English is a free-stress language in that stress is not fixed to a particular syllable. Stress can occur on first, second, or third syllables, but the placement is fixed for individual words. For example, the main stress for a word may fall on the first syllable (COMfort, BEAUtiful), the second (caNOE, rePULsive), the third (referENdum, questionnAIRE), etc. Stressed syllables typically have greater segmental clarity, greater syllable duration, and greater intensity than unstressed syllables. When they are in particular discourse contexts, they may also be marked with pitch movement (Ladd & Cutler, Reference Ladd, Cutler, Cutler and Ladd1983)
Both segmentals and word stress are likely to impact intelligibility at the lexical level in that mispronunciations may lead listeners to fail to decode the intended words. This failure may come from identifying other possible words (as happens with minimal pairs) or failing to identify any word that matches the speech signal (as with segments that are distorted). Both consonants and vowels are affected by the stress patterns of words. Mis-stressed words may especially affect the ways that vowels are pronounced in English because of the ubiquity of the unstressed vowel schwa (33 percent of all vowels according to Woods, Reference Woods2005). Schwa is a key perceptual clue to lack of stress in English, and listeners tend to classify full vowels, even in unstressed syllables, as stressed (Fear, Cutler, & Butterfield, Reference Fear, Cutler and Butterfield1995). Because of these interactions, word stress and segmentals are inseparable in their impact on intelligibility (Zielinski, Reference Zielinski2008). An unexpected stress pattern on a word (e.g., FORtune → forTUNE) also may affect the ways that vowels and consonants are produced (e.g., [ˈfɔɹtʃən] versus [fɚˈtʰun]).
Discourse-Level Features
Discourse-level features (Figure 2.3) include suprasegmental features that carry categorical (i.e., phonological) meaning differences. In relation to how listeners may (mis)understand speakers, these suprasegmentals are not likely to cause listeners to misunderstand individual words (making words unintelligible) but are likely to cause listeners to process meaning with greater difficulty (making speech more effortful to understand). Another suprasegmental, word stress, as discussed above, is included as a word-level feature because it is more likely to impact intelligibility (though it may also impact comprehensibility, or ability to process speech, even without a change in vowel quality, as in Slowiaczek [Reference 290Slowiaczek1990]).
Rhythm, the first discourse-level feature, involves at the very least the relative durations of syllables and the timing of syllabic beats. The constructed sentence in (2.1), made up of all single-syllable words, varies between longer, stressed syllables (in CAPS) and shorter, unstressed syllables (in lower case). The stressed words have a pronunciation that will be closer to the citation form, while the unstressed words are prone to simplification (e.g., has is likely to have the [h] deleted or even to be contracted with John).
JOHN has CLIMBED the TREE to GET the CAT that’s been STUCK for a TIME.
In pronunciation teaching, rhythm has often been described in terms of stress timing and syllable timing (e.g., Pike, Reference Pike1945). Stress-timed languages (like English) are asserted to have large durational differences between stressed and unstressed syllables, with relatively equal timing between stresses. Syllable-timed languages are considered to have quite similar durations between syllables and thus timing that is at the level of the syllable rather than at the level of stressed syllables. This well-known formulation is overly simplistic, however, and stress timing and syllable timing are tendencies rather than absolutes (Dauer, Reference Dauer1983). The rhythmic characteristics of many languages remain of interest to researchers, and a wide variety of rhythm metrics have been tested for different L1 speakers (e.g., Low, Grabe, & Nolan, Reference Low, Grabe and Nolan2000), L2 speakers (Yang & Chu, Reference Yang and Chu2016), and L1–L2 comparisons (White & Mattys, Reference White and Mattys2007). However, there is great uncertainty about how well various rhythm metrics actually capture perceived rhythmic differences between languages (Arvaniti, Reference Arvaniti2012).
In English, rhythm and word stress are similar. The discourse level for rhythm in many ways mirrors the word-level rhythm of lexical stress. A major difference is that word stress typically is limited to multi-syllabic words, whereas rhythm includes stress for single-syllable words. In English, for example, content words (e.g., nouns, verbs, adjectives, adverbs, negatives), including those of one syllable, are normally stressed in discourse. Single-syllable function words (e.g., prepositions, auxiliary verbs, pronouns, determiners) are typically unstressed in discourse. Many single-syllable function words are also among the most frequent words in English, helping to contribute to the perception of stress timing.
Intonation, the second suprasegmental, includes at least three distinct ways in which meaning is communicated: prominence, tune, and range. The example in (2.2) illustrates these three. For context, imagine that the sentence is spoken in the middle of a lecture.

The initial extra-high pitch range in the example is meant to signal a topic shift, or what has been called a paratone (paragraph tone, see Wichmann, Reference Wichmann2000). Pitch range may also signal gradient meanings related to emotional engagement. NEXT is a prominent syllable with a jump up in pitch to call attention to the importance of the information (in this case, NEXT is likely related to PREVIOUS, the topic(s) that came before.) The last use of pitch is the drop in pitch from NEXT to the end of the utterance. This is the tune. Each of these uses is part of how intonation works in English.
Intonation is the system that uses voice pitch changes to communicate meaning. However, intonation may include more than voice pitch. Prominent syllables are not only higher or lower in pitch than the syllables that precede them, they also have greater duration (a rhythmic feature) and more clearly enunciated segmentals. In addition, the varied intonational categories are closely related. The final prominent syllable in a phrase (the nucleus) is also the beginning of the tune, and both may be pronounced with greater or lesser pitch range.
In regard to comprehensibility, the specific contribution of these suprasegmental features is understudied and needs greater attention. Isaacs and Trofimovich (Reference Isaacs and Trofimovich2012) found that more native-like vowel reduction and pitch contours correlated with better comprehensibility ratings, while pitch range was not significantly related to comprehensibility. Kang (Reference Kang2010) found that pitch range was instead associated with accentedness, but not comprehensibility. Tune, on the other hand, has also been suggested to have an effect on comprehensibility. Pickering (Reference Pickering2001), for example, found that Koreans teaching in English used a greater number of falling tunes than would be expected, and that the relative numbers of rising and falling tunes made their speech more challenging for listeners. (This is an interpretation of Pickering, given that she worked within a model that considers tunes to communicate differences in information structure.)
Related Areas
Comprehensibility and intelligibility are not only associated with pronunciation, but also with other characteristics of spoken language (Figure 2.4) that have an indirect connection to pronunciation. These areas include fluency (Derwing, Munro, & Thomson, Reference Derwing, Munro and Thomson2007), speech rate (Kang, Reference Kang2010), loudness (not typically addressed for L2 pronunciation research), and voice quality (Esling & Wong, Reference Esling and Wong1983; Ladd, Silverman, Tolkmitt, Bergmann, & Scherer, Reference Ladd, Silverman, Tolkmitt, Bergmann and Scherer1985; Munro, Derwing, & Burgess, Reference Munro, Derwing and Burgess2010). Generally, this book will not address these characteristics in detail, because other than fluency and speech rate (which is a component of fluency), these things are more idiosyncratic than the other features.

Figure 2.4 Spoken language features sometimes associated with pronunciation
In particular, research on fluency and speech rate seems to have significant effects on judgments of comprehensibility and accentedness. Pronunciation research on voice quality and its inclusion in teaching materials has never been common, despite its seeming promise in pedagogy (Jones & Evans, Reference Jones and Evans1995; Pennington, Reference Pennington1989). Loudness is an especially important issue in regard to hearing loss, hearing in noise, and the intelligibility of speech for those with cochlear implants. Anecdotally, some L2 learners can become more intelligible simply by speaking at a volume more appropriate to the context (e.g., in a large classroom), but this has not been typically considered important for an L2 pronunciation syllabus.
Fluency is sometimes associated with general proficiency (Fillmore, Reference Fillmore, Fillmore, Kempler and Wang1979) or smoothness of speech (Lennon, Reference Lennon1990). I use fluency to refer to smoothness of speech. However, judgments of fluency are also closely tied to many discrete elements of speech, including the numbers of silent pauses and filled pauses, whether junctures in spoken phrases are grammatically logical, mean length of run, repetitions in speech, and more (Lennon, Reference Lennon1990), as well as the level of automaticity in speaking (Segalowitz, Reference Segalowitz and Riggenbach2000), phonological memory (O’Brien, Segalowitz, Freed, & Collentine, Reference O’Brien, Segalowitz, Freed and Collentine2007), and attention control (Segalowitz, Reference Segalowitz2007). Remedial attention to pronunciation is more likely to be successful when learners are relatively comfortable speaking and listening in the L2 – that is, when they are sufficiently fluent. The development of fluency is clearly an important part of the big picture of L2 speaking development and instruction (Firth, Reference Firth, Avery and Ehrlich1992), and comfortable fluency can help give a global structure to other elements of spoken language, but fluency is not, by itself, part of L2 pronunciation. The fact that it impacts comprehensibility and is critical in communicative language teaching (Rossiter, Derwing, Manimtim, & Thomson, Reference Rossiter, Derwing, Manimtim and Thomson2010) means that, like pronunciation, it should be prioritized in teaching speaking and listening.
Speech rate is predictive of fluency judgments (Cucchiarini, Strik, & Boves, Reference Cucchiarini, Strik and Boves2000; Kormos & Dénes, Reference Kormos and Dénes2004). Fluent speakers can speak at different rates, so a fluent speaker could speak at a slower rate than a speaker who is judged less fluent. Speech rate is typically measured in syllables per second or words per minute. This rate may include all silences and filled pauses, or they may be removed, providing a measure of articulation rate. L2 speakers tend to speak more slowly than L1 natives, and their comprehensibility may be helped by faster speech. However, excessively fast or slow speech is more likely to be rated as less comprehensible and more accented (Derwing & Munro, Reference 273Derwing and Munro2001; Munro & Derwing, Reference Munro and Derwing1998).
Prioritizing: A Summary and Critique of Recommendations
Various writers have made recommendations about priorities for pronunciation teaching. These recommendations go from very general (learners should try for “listener-friendly pronunciation,” in the words of Olle Kjellin) to more detailed descriptions of what might be included in instruction. Kenworthy’s (Reference Kenworthy1987) early approach to intelligibility was based on describing learner pronunciation issues that may result in unintelligibility. She lists both segmental and suprasegmental issues, including substituting one sound for another, deleting or adding sounds, connecting one word to another, mis-stressing words, not using stress-based rhythm, and the misuse of intonation or use of unfamiliar intonation. Kenworthy does not further prioritize these potential sources of unintelligibility.
In another early attempt to prioritize, Jenner described a “common core for pronunciation” that would “guarantee intelligibility and acceptability anywhere in the world” by specifying “what all native speakers of all native varieties have in common which enables them to communicate effectively with native speakers of varieties other than their own” (Reference Jenner1989, p. 2). Rather than focusing on the differences between varieties, or indeed ruling out certain native varieties a priori while elevating others as models, Jenner thought it essential to look at commonalities, recognizing that speakers of native varieties have a better chance of being intelligible to each other because of what they share. Jenner suggested that the commonalities included vowel quantity (phonetic length differences), most consonants, syllable structure, stress-based rhythm, and varied commonalities of intonation, including tonic syllables and final movements of pitch. Jenner’s recommendations refer to segmental and suprasegmental features, and they often distinguish between a category (e.g., consonants) and secondary features that are not essential in the category (e.g., distinctions between [l] and [ɫ]).
In another analysis focused on segmental pronunciation, Brown (Reference Brown1988) proposed functional load as a way to determine pronunciation priorities. Functional load, a topic that had been put forth long before Brown’s application to pronunciation teaching, measures the “frequency with which two phonemes contrast in all environments” (Brown, Reference Brown1988, p. 591). Functional load is therefore inherently contrastive and the use of minimal pairs gives a way to quantify priorities. Phoneme contrasts that have a higher functional load are more likely to cause confusion if mispronounced, and they should be given priority over those with lower functional loads (assuming, of course, that students have difficulty with the contrast). Brown’s proposal for measuring functional load takes into account not only the number of minimal pairs for a contrast, but other issues such as the number of minimal pairs for the same part of speech, the extent to which a mispronunciation is stigmatized in native accents, and the acoustic similarity of the sounds involved in the contrast (not all minimal pairs are likely to be confused). For example, Brown’s analysis lists contrasts such as /p, b/, /p, f/, /l, ɹ/, and /l, n/ as of the highest importance, and contrasts such as /f, θ/, /ð, d/, and /ʤ, j/ as very low-priority contrasts. The proposal quantifies importance, but Brown recognizes that quantification alone cannot completely determine priorities. In addition, functional load is a measure of segmental importance only and cannot be applied to suprasegmental features, even to relatively uncomplicated ones like word stress.
Firth prioritizes differently by proposing a “zoom principle” for teaching in which “a pronunciation syllabus should begin with the widest possible focus and move gradually in on specific problems” (Reference Firth, Avery and Ehrlich1992, p. 173). By this, she means that pronunciation instruction should start with general speaking abilities and ability to communicate before moving on to phonetic details. This is put forth as the most likely way to promote comprehensible speech. Priorities beyond general speaking ability (including volume and clarity) include intonation and stress/rhythm (features that are more related to general speaking ability) before considering consonants and vowels. Interestingly, Firth’s recommendations led to some paradoxical suggestions that seem to have little to do with communicative importance. Despite its not being seen as critically important for understanding, /θ/ is given high priority within the consonantal system because students perceive it to be important and because it is relatively easy to teach effectively, giving students confidence to try for more difficult sounds. Practically speaking, this may also be a feature to spend modest time on if it leads to greater commitment to features more likely to improve intelligibility (Derwing & Munro, Reference Derwing and Munro2015).
Evidence for Firth’s recommendations comes from a recent study by Isaacs and Trofimovich (Reference Isaacs and Trofimovich2012). With a goal of making explicit the issues involved in comprehensibility ratings used in a variety of spoken assessment tools, the researchers examined which factors were salient in ratings of comprehensibility. Using criteria collected from research studies, the researchers identified nineteen quantitatively scored speech measures, including segmentals, suprasegmentals, fluency, vocabulary, grammar, and discourse. Speech samples of forty French learners of English were analyzed, and the scores derived from the analysis were then correlated with naive NS raters’ comprehensibility ratings (based on Munro & Derwing, Reference Munro and Derwing1995). Following this, three experienced ESL (English as a second language) teachers with minimal training in pronunciation listened to the speech samples and identified the factors they noticed when evaluating the speech of the French-speaking learners. Five factors were identified as important: fluency, breadth of vocabulary, grammatical control, construction of the discourse, and word-stress accuracy. Of these five measures, only one involves pronunciation in an explicit way, while the others are more global measures. The study concludes that expert raters do not focus overtly on pronunciation when evaluating speech comprehensibility, but that instead they take account of features that are not often included in pronunciation-oriented instruction. Other evidence for Firth’s contention can be seen from the work of Derwing, Munro, and Wiebe (Reference Derwing, Munro and Wiebe1998), who found that listeners rated L2 learners’ comprehensibility higher after a pronunciation course focusing on global skills and suprasegmentals than after a course focusing on segmental improvement. Similar results for other groups of learners can be found in the work of Gordon and Darcy (Reference Gordon and Darcy2016).
McNerney and Mendelsohn (Reference McNerney, Mendelsohn, Avery and Ehrlich1992) argue for suprasegmentals to be given top priority in pronunciation instruction. Suprasegmentals control how information is related, and they also are said to have a special role in conveying attitudinal meaning (cf. Pike, Reference Pike1945). McNerney and Mendelsohn say that “a short-term pronunciation course should focus first and foremost on suprasegmentals as they have the greatest impact on the comprehensibility of learners’ English” and because through suprasegmentals “greater change can be effected in a short time” (Reference McNerney, Mendelsohn, Avery and Ehrlich1992, p. 186), although they provide no empirical evidence for their confident assertion. Suprasegmental features include:
word stress and rhythm;
intonation;
palatalization in rhythmically related contexts such as “can’t you” and “did you.”
In another attempt to specify priorities for a short pronunciation course, Henderson (Reference Henderson2008) identified pacing (stressed words per minute), speech rate (syllables per second), and word stress as important in promoting more comprehensible speech in spontaneous and prepared speaking tasks. The author argued that these three areas were most amenable to changes in the short term, and that learners of English in this university-level planned speaking course were most likely to be successful by learning to vary their pacing and speech rate. Word stress, which was asserted to be important in promoting understanding, was also presented as a feature that may not be easy to change in the short run. It may be significant that both Isaacs and Trofimovich (Reference Isaacs and Trofimovich2012) and Henderson (Reference Henderson2008) studied French learners of English, a group for whom word stress may be particularly important.
Gilbert (Reference Gilbert2001) suggests priorities for beginning learners based on her experience and repeated attempts to distill that experience into pronunciation features that are likely to be learnable and to make a difference in comprehensibility. She lists the following as essential: sound–symbol correspondences for the spelling of key vowel sounds, consonant sounds (mostly final) that serve as signals of grammatical meaning, linking between words, epenthesis and deletion of syllables, word stress, differences between weak and strong syllables (stress timing), and emphasis (prominence or nuclear stress). This set of priorities emphasizes suprasegmentals, but also says that certain segmental targets are both important enough to pay attention to and likely to be learnable even at the beginning levels of proficiency.
Not all writers who agree that a greater emphasis on suprasegmentals is important for pronunciation instruction prioritize what should be taught. Morley (Reference Morley1991), in her historical review of the evolution of pronunciation in TESOL (teaching English to speakers of other languages), described many of the changes from the traditional, segmental-based approach of the 1950s and 1960s to the more communicatively oriented approach becoming evident in the 1980s and 1990s. However, Morley’s recommendations called for “an expanded concept of what constitutes the domain of pronunciation” (Reference Morley1991, p. 493) rather than setting clear priorities. This expanded domain saw pronunciation’s proper sphere of influence as encompassing communication skills, suprasegmentals, segmentals, voice quality, body language, greater learner and teacher involvement in developing self-monitoring skills, contextualization, linking listening to speaking, greater attention to sound–spelling relationships, and attention to individual differences among ESL learners. These recommendations cannot be faulted in and of themselves, but taken together their approach to teaching pronunciation suggests that previous lack of success came not because of misplaced priorities or goals, but because the program of instruction was too limited to work. While Gilbert’s triage metaphor suggests that certain pronunciation needs are medically critical, identifiable, and should be dealt with immediately, Morley’s recommendations sound like an extended stay at a luxury pronunciation spa with personal accent trainers.
Prioritizing for Nonnative Speaker–Nonnative Speaker Communication
All these approaches assume that native listeners are the appropriate audience for determining what is intelligible and what is not. This assumption is called into question by Jenkins (Reference Jenkins2000) in her proposal for a prioritized set of features for pronunciation teaching, the Lingua Franca Core (LFC). Jenkins recognizes that most nonnative speakers (NNS) of English around the world interact in English not with NSs but with other NNSs. This difference in audience prompted Jenkins to consider changes in how priorities are determined, with mutual intelligibility being the standard by which pronunciation features are to be judged for importance. Jenkins studied the interactive interlanguage talk of NNS dyads doing communicative tasks. She analyzed where communication failed in these communicative tasks and determined the causes of the forty instances in which communication broke down. Of these, twenty-seven were related to pronunciation deviations. The deviations that caused loss of intelligibility were candidates for the LFC. Those features that did not cause a loss of intelligibility were typically excluded from the core. The core included most consonant sounds, some consonant cluster simplifications, vowel length, and nuclear stress (i.e., prominence). Consonants that were excluded were the interdental fricatives and [ɫ], the velarized, or dark, allophone of /l/. Consonant cluster deletions were included in the LFC because loss of sounds was argued to be more likely to impact intelligibility than was epenthesis. Perhaps the place in which the LFC departs from the other recommendations most radically is in its treatment of suprasegmentals. Only one, nuclear stress, is included in the core. Others, such as stress-based rhythm, intonation, and word stress are all excluded based on Jenkins’ data, her appeal to teachability/learnability, and the impact of universals. Some of these decisions have been criticized, especially regarding word stress (Dauer, Reference Dauer2005; McCrocklin, Reference McCrocklin, Levis and LeVelle2012). In one description of pronunciation teaching in China that used the LFC as a rubric, Deterding (Reference 274Deterding2010) showed what most experienced teachers know very well: Pronunciation difficulties are varied, and they include errors that reflect both core and non-core features.
In a replication of Jenkins’ (Reference Jenkins2000) work, Kennedy (Reference Kennedy2012) found that the most common sources of unintelligibility were vowel and consonant segments, either individually or in combination. The only suprasegmental feature implicated in unintelligibility was word stress. Nuclear stress (prominence) was not a source of unintelligibility. Kennedy also suggested that learners may not always indicate that they do not understand a speaker and that researchers and teachers may not realize that pronunciation is a factor. This problem may be connected to the types of interactive tasks that are used to collect data, such that both listeners and speakers have to demonstrate understanding.
Walker (Reference Walker2010) applies the LFC extensively to pronunciation teaching, giving the original recommendations a classroom teacher’s perspective. His defense of ELF (English as a lingua franca) priorities describes why he thinks the LFC is appropriate, including issues related to bottom-up rather than top-down processing, mutual intelligibility, speaker identity, and teachability. One of the benefits of the LFC, Walker argues, is that it recognizes that ELF speakers make greater use of bottom-up processing in their interactions. This means that they are much more reliant on the details of the acoustic signal in interactions, and thus more likely to be affected by unexpected pronunciations of segmentals. NSs, in contrast, make greater use of top-down processing, in that they can more easily guess at the content of a spoken message even when the segmentals deviate from what is expected. The LFC, however, remains controversial in many of its recommendations. It lacks robust empirical support, assumes that all NNS contexts are similar, and does not take into account the importance of stigma associated with otherwise intelligible pronunciations (LeVelle & Levis, Reference LeVelle, Levis, Levis and Moyer2014). The actual details of what should and should not be included thus have found uncertain acceptance despite its appeal, and influential researchers have criticized its consistency, calling for empirical verification of its recommendations (e.g., Szpyra-Kozłowska, Reference Szpyra-Kozłowska2015, pp. 77–84).
The second area discussed is mutual intelligibility. Intelligibility is a context-sensitive feature of spoken discourse, a characterization that will find little disagreement from almost anyone interested in teaching pronunciation. Walker is more nuanced than Jenkins, recognizing that the LFC may need to be adjusted in some ways because interlocutors may also include NSs. For example, the teaching of weak forms and vowel reduction may be strictly non-core as far as production, but mutual intelligibility suggests that it is a core feature for perception. NS interlocutors will reduce vowels, and it is important for ELF listeners to be able to understand such speech. For communication to happen, both speakers and listeners must be intelligible to each other, and learning materials typically make use of native speech.
The third positive aspect of an LFC approach is that it recognizes the importance of speaker identity. The LFC recognizes that achieving a native accent is not necessary, and that the influence of the speaker’s L1 should be accepted, as long as intelligibility is not compromised. This is clearly not just part of NNS–NNS communication. Most people in inner-circle communities, especially in larger cities with significant immigrant communities, also think that an NS accent is not needed.
Finally, Walker discusses Jenkins’ concept of teachability. Walker says that “many features that are essential in a traditional EFL syllabus are largely unteachable. This was the case with tone and stress-timing, and with the use of weak forms and certain connected speech changes. In contrast, most of the items in the LFC are teachable, with classroom teaching leading to learning” (Reference Walker2010, p. 63). While it makes little sense to teach things that our learners cannot learn, little evidence is provided regarding teachability. In fact, many of the features that are said to be unteachable are ones that Judy Gilbert (Reference Gilbert2001), who is a notorious stickler for teaching only those things that can be learned, provides as priorities for beginning learners: linking, word stress, and distinguishing strong and weak syllables.
What does it mean for a feature to be teachable? Walker (Reference Walker2010) says this about nuclear stress placement:
[It] is teachable in the sense that the rules are simple enough for learners to master in the classroom, although for some learners there may be a noticeable gap between receptive and productive competence. As a result, our primary aim in the classroom will be to make learners aware of the existence and importance of nuclear stress. This should make them more sensitive to its use by other speakers, and consequently more likely to acquire competence in its use.
Teachability thus seems to mean a topic whose rules can be learned and applied by learners, leading to acquisition. It does not mean teachable. Any topic can be taught. What matters is the extent to which the teaching, the input, becomes learning, or intake. The principle of teachability/learnability will be discussed in detail in Principle 6 in Chapter 8.
Walker suggests ways the LFC might apply to speakers of different languages. Following Jenkins, he includes in the core rhotic /ɹ/ in all positions (characteristic of most North American English speakers), the non-flapped /t/ characteristic of British English, and word stress (an admitted gray area in Jenkins, Reference Jenkins2000) because of its impact on the core feature of nuclear stress, as well as vowel reduction and weak forms for receptive competence. In addition, certain errors (such as final glottal stops) that are common among certain users of English should be addressed because they may cause loss of intelligibility by masking the character of final stop consonants (see Walker, Reference Walker2010, p. 44; cf. Gilbert, Reference Gilbert2001).
The LFC’s recommendations have been used to examine features that promote the mutual intelligibility of emerging South-East Asian Englishes and the international intelligibility of Hong Kong English. Deterding and Kirkpatrick (Reference Deterding and Kirkpatrick2006) examined conversational interactions in English among speakers from ten South East Asian countries and identified features of speech that seem to form the basis of a developing regional variety. Features of this variety shared by speakers from at least four countries were the use of a stop for the voiceless dental fricative (dis for this), reduced initial aspiration of voiceless stops (pill sounds like bill), monophthongal mid-front and back vowels (take and goat do not have the extra glide typical of inner-circle varieties, so that they may sound like tech and gut), a lack of reduced vowels, stressed pronouns, and phrase-final discourse stress (e.g., Give it to HIM). Of these, several are features that Jenkins suggests should be treated as part of the LFC core (aspiration and nuclear stress) and others are part of her non-core features (dental fricatives and reduced vowels). Others are less obvious and may take finer analysis to determine whether they should be seen as core or non-core. Monophthongal vowels, however, may violate the quantity criterion for vowels while keeping the quality intact.
In the work of Kirkpatrick, Deterding, and Wong (Reference Kirkpatrick, Deterding and Wong2008), the intelligibility of the Hong Kong English of highly educated students was rated by university students in Singapore (a transitioning outer-circle country where English has an official role but is not the native language of all) and Australia (an inner-circle country in which English is the native language of most people). The students in Australia were both NSs and NNSs of English. Recordings of their speech were played for the subjects, who did a listening comprehension task about the content of the speech. In addition to this measure of intelligibility, raters considered the speakers in terms of intelligence and likeability, two concepts well-attested in other studies to be associated with speech, and in our terms, with the potential for irritation. In an interesting finding, the speakers who were the most intelligible were also seen as less intelligent and less likeable, often based on things they said, but sometimes on the basis of the speech being too good, suggesting that the raters thought the speaker was showy and proud. Clearly, intelligibility, a good thing in itself, may sometimes be judged negatively in some contexts based on unforeseen social values. Overall, Hong Kong English was widely intelligible in this area of the world, where it is likely to be a familiar variety of English. However, not all speakers were equally intelligible.
These different attempts to specify priorities for intelligibility-based instruction are interesting both in what they agree on and also in what they do not agree on. The variety found in the recommendations comes primarily from a heavy reliance on reasoning and a paucity of empirical evidence. Table 2.1 provides a summary of the recommendations.
Table 2.1 Pronunciation priority recommendations from various authors
| Study | Recommended targets | Recommended for exclusion | Source of evidence |
|---|---|---|---|
| Studies related to ESL/EFL contexts | |||
| Kenworthy (Reference Kenworthy1987) | Sound substitutions, deletions, and additions; linking; word stress; rhythm; intonation | Reasoning based on experience | |
| Jenner (Reference Jenner1989) | Vowel length; most consonants; syllable structure; stress-based rhythm; prominence; movements of pitch | Vowel quality; [ɫ] | Reasoning based on features shared by most NS varieties |
| Brown (Reference Brown1991) | High functional load contrasts, e.g., /p, b/, /p, f/, /l, r/, /l, n/, /æ, ɛ/ | Low functional load contrasts, e.g., /f, θ/, /ð, d/, /ʤ, j/, /u, ʊ/ | Functional load calculations based on minimal pair frequency modified by other criteria |
| Firth (Reference Firth, Avery and Ehrlich1992) | In descending order: general speaking abilities; intonation; stress/rhythm; consonants and vowels | None specified | Based on a “Zoom Principle,” a pedagogical approach that prioritizes general speaking habits over phonetic details |
| Isaacs and Trofimovich (Reference Isaacs and Trofimovich2012) | Word stress; lexical richness; grammatical control; use of discourse features; fluency | Pitch range | Based on correlations between scalar comprehensibility ratings and careful quantitative analysis, informed by teacher’s verbal protocols (one L1 only) |
| McNerney and Mendelsohn (Reference McNerney, Mendelsohn, Avery and Ehrlich1992) | Word-level stress/unstress; sentence-level stress/unstress; major sentence stress (or focus); intonation; linking and pausing; palatalization in rhythmically related contexts | None specified | Reasoning based on the belief that more change can be achieved by focusing on suprasegmentals in a short-term course |
| Gilbert (Reference Gilbert2001) | Key vowel sound/spelling correspondences; final consonants signaling grammatical meaning; linking; word stress; strong and weak syllables; emphasis (prominence) | Priorities for beginning learners based on experience as a teacher and textbook writer | |
| Morley (Reference Morley1991) | An expanded domain for pronunciation, including (in no particular order): communication skills; suprasegmental; segmentals; voice quality; body language; greater learner and teacher involvement in developing self-monitoring skills; contextualization; linking; greater attention to sound–spelling relationships; attention to individual differences among learners | None specified | Reasoning based upon the asserted need for pronunciation to take on expanded roles in the language classroom |
| Henderson (Reference Henderson2008) | Pacing of speech; rate of speech; word stress | A review of principles put forth by other writers. The choice of features for the short course are not clearly justified | |
| Studies related to English as an international language (EIL)/ELF contexts | |||
| Jenkins (Reference Jenkins2000) | Most consonant sounds; some consonant cluster simplifications involving deletions; vowel length,; nuclear stress (i.e., prominence) | Interdental fricatives; [ɫ]; consonant cluster epenthesis; stress-based rhythm; weak forms; intonation; lexical stress | Forty errors in NNS–NNS interaction, twenty-seven of which were directly related to pronunciation. Additional criteria of teachability and learnability |
| Walker (Reference Walker2010) | Same as Jenkins (Reference Jenkins2000), including rhotic [ɹ] in all positions; intervocalic [t] rather than flap in city, beauty; word stress as a basis for nuclear stress; weak forms and vowel reduction for receptive competence | Same as Jenkins (Reference Jenkins2000) with some modifications for specific language groups | Jenkins’ (Reference Jenkins2000) findings modified by trying to implement the LFC. Other research findings also consulted |
| Deterding and Kirkpatrick (Reference Deterding and Kirkpatrick2006) | No priorities given | A descriptive study of the features that may be part of an emerging South-East Asian variety of English | |
| Kirkpatrick et al. (Reference Kirkpatrick, Deterding and Wong2008) | No priorities given | A study of the intelligibility of Hong Kong English to listeners in Singapore and Australia. Hong Kong English speakers were generally highly intelligible, but high intelligibility did not guarantee perceptions of likeability or intelligence | |
Critiquing the Recommendations
The different attempts to specify priorities are a mishmash of incomplete and contradictory recommendations. Some of the studies offer recommendations based on experience (Firth, Reference Firth, Avery and Ehrlich1992; Gilbert, Reference Gilbert2001; Kenworthy, Reference Kenworthy1987; McNerney & Mendelsohn, Reference McNerney, Mendelsohn, Avery and Ehrlich1992), others provide priorities based on analysis of similarities and differences between English varieties (Jenner, Reference Jenner1989) or careful experimental evidence (Isaacs & Trofimovich, Reference Isaacs and Trofimovich2012) or other objective analyses based on models of intelligibility (Brown, Reference Brown1991; Jenkins, Reference Jenkins2000); some study intelligibility without any intention of recommending pronunciation priorities (Deterding & Kirkpatrick, Reference Deterding and Kirkpatrick2006; Kirkpatrick et al., Reference Kirkpatrick, Deterding and Wong2008), and others describe not only what should be included but also what should be excluded (Brown, Reference Brown1991; Jenkins, Reference Jenkins2000; Jenner, Reference Jenner1989). Features such as word stress are seen to be essential in some research (Isaacs & Trofimovich, Reference Isaacs and Trofimovich2012), while they are seen as relatively unimportant in other recommendations (Jenkins, Reference Jenkins2000) or potentially important in relation to other features (Walker, Reference Walker2010). A focus on suprasegmentals is encouraged by some authors (McNerney & Mendelsohn, Reference McNerney, Mendelsohn, Avery and Ehrlich1992), while it is largely bypassed in favor of segmentals in other accounts (Jenkins, Reference Jenkins2000). Some writers seek to achieve quicker rates of improvement in intelligibility by focusing first and foremost on features that are not usually part of pronunciation instruction (Firth, Reference Firth, Avery and Ehrlich1992), while other recommendations read like a pronunciation wish-list with no attempts to prioritize (Morley, Reference Morley1991). Sounds such as /θ/ are left off many lists, including Jenkins’ influential LFC, but other writers long to keep /θ/ because its supposed teachability may make it easier for learners to feel success and thus try harder sounds (Firth, Reference Firth, Avery and Ehrlich1992) or because there are situations in which /θ/ can affect intelligibility (Deterding, Reference Deterding2005; Henderson, Reference Henderson2008). Some of these seemingly contradictory recommendations are likely due to different L1 learner groups, limited numbers in each study, and the context in which the study took place.
It is clear that there is a further need to examine principles that help teachers decide on priorities based on context, allowing a finer-grained analysis than any single study can provide. All teachers have to prioritize, and it is best to have explicit, research-based support for setting priorities (Derwing & Munro, Reference Derwing and Munro2005). Much that has been written about priorities cannot be called research-based, and the articles that are based on empirical data should therefore have greater weight. Jenkins (Reference Jenkins2000), for example, has been much discussed, much praised, and much derided, but her recommendations are valuable because they are based upon evidence. However, the amount of evidence is small and some of the recommendations have been called into question (Dauer, Reference Dauer2005; McCrocklin, Reference McCrocklin, Levis and LeVelle2012). Her twenty-seven pronunciation errors (of forty total errors impacting intelligibility) allow us to make suggestive recommendations about what should and should not be included in the core features. /θ/, for example, did not lead to loss of intelligibility in her data. Clearly, there must have been many other phonemes that likewise did not lead to loss of intelligibility, but Jenkins made recommendations against /θ/ as a core item, and for other sounds not only based on her evidence but also on her view of English’s role in the world. This indicates that decisions about priorities must be made not only on explicit evidence, but on how implicit evidence is interpreted regarding pronunciation’s role in communicative success or failure in particular communicative contexts.
Context and Intelligibility
Finally, intelligibility is sensitive to the context in which communication takes place. What this means is that the degree of accuracy that determines intelligibility changes from one situation to the next, depending on the type of language use required. Intelligibility can be seen as the lowest possible standard that a speaker has to meet in order to get by. Hinofotis and Bailey (Reference Hinofotis, Bailey, Cameron Fisher, Clarke and Schachter1981), in a now famous phrase, talk about “an intelligibility threshold” that speakers must meet in order to communicate effectively. The intelligibility threshold, rather than being an objective criterion, is actually a moving target that includes much more than pronunciation (Tyler, Reference Tyler1992). Much of what causes the target to move is the context in which speech takes place. The influence of context is understudied, but it is likely to be an important determinant of how intelligible a particular speech sample is.
It should be obvious that certain contexts of use have higher stakes for both the speaker and the listener. For example, if your job involves staffing a cash register in an area in which ethnic shops are the norm (e.g., Chinatown), your needs for understandable pronunciation in English may be relatively low. Most of the customers are likely to either be from the same speech community or outsiders who have decided to take the extra step to shop there rather than somewhere else. The people the clerk interacts with are either sympathetic or are unlikely to come back regularly (tourists). Contrast this context with an instructor in a university class. Interactions are regular and required (students have to come to the lab or breakout session), high-stakes (performance in the class is highly dependent on being able to understand material through the mediation of the instructor), and subject to significant cross-cultural conflict (misunderstanding may be likely to be seen as caused by the inability of the other to play the expected role; cf. Rubin, Reference Rubin1992). High-stakes contexts include education, health, and translation, all areas in which speakers and listeners have to negotiate language and culture barriers and where the cost for failure is very real.
In North American higher education, many basic classes in the natural sciences, mathematics, and engineering are taught by NNSs of English. While some of these teachers are regular faculty, many are graduate teaching assistants who help fund their education through teaching. The classes they teach include not only majors in the field, but also students from other fields who are required to take courses that they may not feel comfortable with or enjoy. Their overall proficiency in English is very high. These NNSs have, after all, been admitted to demanding graduate degree programs in a foreign-language setting, and their needs for language support are targeted to specific areas such as speaking or writing.
Having these students teach sets up a natural context in which undergraduates’ stress from learning that demands new content (e.g., organic chemistry) can interact with stress from the way the content is presented (which may be due to the graduate students being inexperienced teachers or due to cultural views of appropriate teacher and student behavior), mixed in with unfamiliar or hard-to-understand accents. Pedagogical effectiveness and unfamiliar or inadequate pronunciation or language skills may both be implicated in lack of achievement, but inadequate language skills are most likely to be blamed for ineffective teaching. Many international teaching assistants (ITAs) and faculty development programs recognize that working on teaching and presentation skills can lead to greater success (and presumably, a better perception of comprehensibility) even without extensive work on language skills.
Other studies also make clear that comprehensibility is not based only on pronunciation. Tyler (Reference Tyler1992) asked raters to listen to two presentations, one given by an ITA and one by an NTA (native teaching assistant). Both presentations were then transcribed and read aloud by an NS of English so that pronunciation would not be a factor in how the presentations were rated. Raters evaluated the ITA as being less effective and less easy to follow than the NTA. The researcher argued that the ITA’s use of unexpected, nonparallel discourse markers (e.g., “the first one” followed by “and then” and “after that”), not establishing clear synonyms or clearly linking pronominal forms to the original noun phrases, and overuse of coordination and underuse of subordination, caused a loss of understanding.
The use of discourse markers may also be involved in how easy it is to understand speech. Williams (Reference Williams1992) found that when discourse moves were explicitly marked, ITA presentations were rated as being more comprehensible. Tyler and Bro (Reference 292Tyler and Bro1992), however, found that ITAs overused simple additive connectors that were ambiguous in the connections between ideas in the discourse. Liao (Reference 283Liao2009) examined Chinese ITAs’ use of common English spoken discourse markers in interactions with an interviewer and found that the ITAs overused some (especially yeah) and underused others (e.g., well, I mean). Overall, their markers were more restricted in range than for NTAs and included innovations that were not likely to be understood easily by NS interlocutors. At the level of grammatical competence, lexico-grammatical features may hamper ITAs’ ability to communicate information clearly. Tyler, Jefferies, and Davies (Reference Tyler, Jefferies and Davies1988) found that NTAs often used strategies to focus listener attention on information to be foregrounded and backgrounded, but ITAs did not.
In addition, even speech that is completely intelligible may be heard as heavily accented or take more effort to understand (Munro & Derwing, Reference Munro and Derwing1995). Expectations and implicit stereotypes may affect how well listeners understand a speaker, despite the speaker being intelligible. Rubin (Reference Rubin1992) played a short lecture given by a female speaker of General American English to undergraduate students under two guises: a Caucasian guise and an Asian guise. In the study, some listeners heard the lecture while looking at a picture of a blond Caucasian woman, while other listeners heard the same lecture (spoken by the same voice) while looking at a picture of an equally attractive Asian woman. When asked to demonstrate their understanding of the lecture, the listeners who heard the lecture in the Asian guise understood significantly less well than those who heard the lecture in the Caucasian guise. Comprehension was measured via a cloze of the passage with every seventh word deleted. In addition, listeners completed a semantic differential instrument with scales measuring their attitudes, issues related to background, values, and appearance, as well as items related to accent, ethnicity, and teaching qualifications. While the Munro and Derwing (Reference Munro and Derwing1995) study demonstrated that listeners can decode speech with 100 percent intelligibility yet find it heavily accented, this study suggests that lack of understanding can be affected by seemingly unrelated nonlanguage factors, in this case the unconscious biases that listeners bring with them to interactions. More recent research suggests that this bias may also be connected to congruence between the visual and the aural. McGowan (Reference McGowan2015) used transcription accuracy in noise to examine whether listeners would be more accurate in transcribing Chinese-accented speech when presented with a Chinese face, a Caucasian face, or an unspecified silhouette. Listeners transcribed more successfully when presented with a congruent face (a Chinese face), a finding that was consistent despite differences in experience with listening to Chinese-accented English.
Conclusion
Many recommendations about priorities for pronunciation teaching are painted in broad brushstrokes that probably mask distinctions between important and unimportant features within the same category. For example, Isaacs and Trofimovich (Reference Isaacs and Trofimovich2012) found that pitch movement correlated with comprehensibility ratings at a relatively high rate, whereas pitch range showed no correlation. Yet both are considered to be part of intonation. Levis (Reference Levis1999a) suggests that final intonation may be important for certain grammatical structures (e.g., declaratives), while the same pitch movements are relatively unimportant for others (e.g., yes–no questions). Syllable structure modifications are part of Jenner’s (Reference Jenner1989) core and Gilbert’s (Reference Gilbert2001) recommendations, while Jenkins (Reference Jenkins2000) distinguishes between initial deletions (core) and some medial and final deletions and epenthesis (non-core).
It seems clear from the often conflicting, and sometimes contradictory, recommendations that the criteria we use to set priorities are often themselves unclear, based on (un)informed intuition, unsupported by research findings, perhaps because research findings themselves have many gaps. We are still trying to understand the picture in which too many elements are missing, like trying to understand the picture on a jigsaw puzzle with only half the pieces available. A first step toward understanding the bigger picture is to try to specify and justify guidelines that may help us describe what an intelligibility-based approach might look like in the classroom (as in Chapters 8 and 9). It may also be that we need a better picture of how pronunciation not only affects understanding in a vacuum (i.e., via the speech signal alone), but also how it affects understanding in social and communicative contexts.





