17.1 Introduction
In recent decades, our knowledge of how bilingual listeners perceive speech in their first and second languages (L1 and L2, respectively) has advanced considerably; however, there is much that remains unknown. There are fundamental gaps that we have yet to fill about cross-language speech perception in bilinguals. Historically, this has perhaps been driven by linguistic theories that have been based on monolingual assumptions or baselines. This trend continues to this day; most current theories of speech perception make monolingual assumptions (although there are a few exceptions). In the early twentieth century, linguistic researchers conceived of bilinguals as problematic or ignored them entirely. Such stances have been criticized because it is estimated that more than half of the world’s population is bilingual (Reference GrosjeanGrosjean, 1982). Therefore, if we seek to develop a deep understanding of how human listeners perceive speech, it is essential to study those listeners who are the most representative. Occasionally, bilinguals are treated as a population that is worth studying, but as a special case. This logic is flawed. Bilinguals must be studied in order to answer fundamental questions concerning how language experience shapes the perception of speech.
In this chapter, I outline the challenge of speech perception that adult bilingual listeners face, and how the L1 and the L2 phonological systems interfere with one another during the act of speech perception. I provide an overview of studies that have examined bilinguals’ perception of speech segments, be it in the L1, the L2, or in an unfamiliar, non-native language. I also explore each of the factors that are known to affect bilinguals’ perception of speech, which include age of L2 acquisition, effects of L1:L2 usage and how they relate to language dominance and proficiency, and short-term contextual effects on speech perception. I also summarize the small, but growing, literature on bilinguals’ perception of suprasegmentals. I end the chapter by exploring some of the crucial questions facing the field of bilingual speech perception and what possible research lines could address these challenges.
17.2 Perception of Speech Segments by Bilinguals
Speech is the result of a series of intricate, coordinated gestures involving passive and active articulators that generate complex acoustic signals. As is the case for other objects and events in the environment, human listeners perceive speech segments (i.e., consonants and vowels) as belonging to psychologically meaningful categories.
Research examining the speech perception abilities of adult bilingual listeners makes use of sensitive perceptual tasks in order to discover: (1) whether a bilingual’s two languages interfere with one another during the act of perception; and (2) if so, in which direction; and (3) to what degree. The vast majority of work has explored the speech perception abilities of bilinguals by comparing their performance to monolingual baselines. There are two possible outcomes for such research designs: (1) Bilinguals complete some perception task and their performance will be equivalent to that of monolinguals; or (2) Bilinguals’ performance will deviate from that of monolingual baselines. The former eventuality is typically taken as evidence that the bilingual listeners are able to successfully keep the L1 and the L2 separate and free of interference. The details of the interpretations sometimes differ. For example, some may claim that bilinguals maintain separate L1 and L2 phonological systems, whereas others will explain this pattern of results as suggesting that the L1 and the L2 categories coexist in a common space but are free of L1–L2 phonetic interaction. The alternative eventuality is that bilinguals deviate from the performance of monolinguals. There are a few ways for this to occur. Most commonly, bilinguals will perform at a compromise point between monolinguals of the L1 and the L2, which is typically interpreted as evidence that the L1 and the L2 have influenced one another, or possibly merged, and come to resemble each other in perception. Another possibility is that bilinguals will exceed the performance of monolinguals by scoring more extreme values on the task being used. Such findings are sometimes referred to as a “bilingual advantage” on a given task. It is important to note that comparisons between bilinguals and monolingual baselines are based on assumptions, and these assumptions have received criticism (see Reference GrosjeanGrosjean, 1998). Nevertheless, many speech perception studies, including those summarized in this chapter, have compared bilinguals with monolinguals, and such comparisons are likely to remain a common feature of research into bilingual speech perception.
17.2.1 Bilinguals Are Flexible Perceivers
As they go about their lives, bilinguals will often use and switch between their languages. To the outside observer, this flexibility appears simultaneously remarkable and effortless, occurring without hesitation, and facilitating communication with others. What makes bilingual speech perception abilities even more impressive is that bilinguals must overcome greater challenges during the course of speech perception than their monolingual counterparts. For instance, consider that a child being raised in a bilingual environment will, by definition, receive less input in each language than a child acquiring a single language. Despite this discrepancy in input, bilingual children show similar developmental trajectories, meeting milestones within similar time frames and at a comparable rate to their monolingual peers. Bilingualism is proof of the human capacity to learn and use multiple languages and is characterized by its resilience, robustness, and flexibility.
Laboratory studies examining the effects of bilingual language experience and use employ very sensitive experimental tasks to understand the consequences of bilingualism for the psycholinguistic organization of the L1 and the L2. As we review the studies that have been conducted within the field throughout the remainder of this chapter, it may seem that their focus is on the constraints or limits of bilingualism. However, this should not be interpreted as suggesting that the bilingual perceiver is less effective than a monolingual one. Please keep in the forefront of your mind that bilinguals navigate the world without difficulty and communicate effectively in their daily lives, almost without exception.
17.2.2 The L1 Exerts Persistent Effects on Perception of L2 Contrasts
A large literature has developed over the past forty years which demonstrates that the language learning that occurs during the first years of life has a privileged status. Even after many years, traces of the early acquired L1 are observable in the performance of bilingual listeners when completing speech perception tasks. This is a robust finding that has been replicated dozens (if not hundreds) of times and reported by many different research groups across the world.
The earliest research on cross-language speech perception was published in the early twentieth century, and Reference BloomfieldBloomfield (1933, p. 79) argued that language-specific phonetic details were critical, and that listeners “ignore the rest of the gross acoustic mass that reaches [their monolingual] ears.” Reference TrubetzkoyTrubetzkoy ([1939] 1969) made the important contribution of positing that speech perception is shaped by prior language experience, likening the L1 phonological system to a “sieve” through which L2 speech must pass. This was an important idea because it meant that when two listeners were presented with the same acoustic signal, they would differ in how successfully they would perceive non-native speech as a function of their L1. Reference HockettHockett (1958) emphasized the importance of considering the relationship between sounds in a “network,” and Reference BrièreBrière (1966) added that perception of L2 speech required “exhaustive” analyses of phonetic details, both of which informed our understanding of L1 influence on phonetic contrasts. It was Reference WeinreichWeinreich (1953) who described the importance of factors that contribute to variation among bilinguals such as demographic characteristics, years of L2 use, language dominance, and the social settings within which the L1 and the L2 are used; these factors also contribute to the degree of L1 interference in the L2. Within a relatively short period, scientific thinking concerning speech perception had advanced considerably, and researchers sought an explanation for why the L1 exerts a persistent influence on perception of the L2. Inspired by work on animal imprinting in early life, Reference LennebergLenneberg (1967) proposed a critical period for L2 speech learning that, due to neurological maturation, restricts L2 learning from early adolescence. (Though compatible with the observation that the L1 exerts a persistent influence on the L2, the critical period has been criticized as lacking nuance in recent years; see Reference Mayberry and KluenderMayberry and Kluender, 2018; Reference Singleton and LeśniewskaSingleton and Leśniewska, 2021.)
Much of the early work (in the 1970s–1980s) exploring the persistent influence of the L1 involved the perception of L2 stop-voicing contrasts (e.g., Reference Elman, Diehl and BuchwaldElman, Diehl, & Buchwald, 1977; Reference Flege and EeftingFlege & Eefting, 1987; Reference WilliamsWilliams, 1977). Compatible results emerged soon after, concerning how the L1 also restricted the perception of L2 vowels (e.g., Reference Bohn and FlegeBohn & Flege, 1990; Reference MackMack, 1989). A consistent observation during this period was that the L1 influences both the identification and the discrimination of L2 segments. It was during this period that we saw the emergence of the influential theoretical models that dominated the field for the coming decades (Reference Best and StrangeBest, 1995; Reference Flege and StrangeFlege, 1995), and their revised versions are still relevant to this day (Reference Best, Tyler, Bohn and MunroBest & Tyler, 2007; Reference Flege, Bohn and WaylandFlege & Bohn, 2021).
Seminal contributions to this literature were made by researchers from the bilingual city of Barcelona. Across many studies, this research group investigated the perceptual abilities of bilingual listeners of Spanish and Catalan, two Romance languages spoken throughout the city. Bilinguals born in Barcelona are typically exposed to their L2 very early, and due to the bilingual nature of the city have the opportunity to use their two languages in daily life. Although many different experimental tasks have been used, a consistent observation across studies published out of Barcelona has been that those bilinguals born within predominantly Spanish-speaking families (i.e., bilinguals who are Spanish-dominant) encounter greater difficulties in perceiving Catalan-specific contrasts (Reference Bosch, Costa and Sebastián-GallésBosch, Costa, & Sebastián-Gallés, 2000; Reference Pallier, Bosch and Sebastián-GallésPallier, Bosch, & Sebastián-Gallés, 1997). This effect is well demonstrated in the work of Reference Pallier, Colomé and Sebastián-GallésPallier, Colomé and Sebastián-Gallés (2001) who asked participants to complete a lexical decision task that included phonemes that are contrastive in Catalan but not Spanish (e.g., /son/ versus /sɔn/). Catalan-dominant bilinguals successfully judged the Catalan minimal pairs to be distinct words, whereas the Spanish-dominant bilinguals showed repetition priming, that is, they perceived the Catalan-specific minimal pairs to be repetitions of the same word, leading to faster responses. These results demonstrate the strong and persistent effects of the early acquired L1 (in this case, Spanish) on perception of L2 contrasts even in bilingual listeners who are very proficient in both languages and use them on a daily basis. Other studies published by this group explored differences in the early emergence of language dominance in simultaneous bilinguals (Reference Ramon-Casas, Swingley, Sebastián-Gallés and BoschRamon-Casas et al., 2009), and the long-lasting effects of parental input during the first year of life (Reference Sebastián-Gallés, Echeverría and BoschSebastián-Gallés, Echeverría, & Bosch, 2005).
To this day, new studies continue to be published that advance our knowledge of how the L1 exerts a persistent influence over perception of the L2. These studies are adding an unprecedented level of nuance concerning how the L1 interacts with individual differences (Reference Chang and KwonChang & Kwon, 2020) and task demands (Reference Freeman, Blumenfeld, Carlson and MarianFreeman et al., 2022), and shapes perception of segments that are difficult to categorize (Reference Faris, Best and TylerFaris, Best, & Tyler, 2018), as well as lexical tones (Reference Chen, Best and AntoniouChen, Best, & Antoniou, 2020).
17.2.3 Bidirectional Effects: Learning an L2 Alters Perception of the L1
The studies covered in this chapter thus far have established that the L1 exerts a persistent influence on the L2. However, a considerable number of studies have also shown that L2 learning alters perception of the L1. These effects are perhaps best exemplified in the seminal work of Flege and colleagues who have repeatedly reported such bidirectional L1–L2 effects in studies that have examined speech production (e.g., Reference FlegeFlege, 1987; Reference Flege and EeftingFlege & Eefting, 1987; Reference MacKay, Flege, Piske and SchirruMacKay et al., 2001), as well as speech perception.
In my own work, I have observed clear L2 interference on the perception of L1 segments. Reference Antoniou, Tyler and BestAntoniou, Tyler and Best (2012) examined the L1 and L2 perceptual abilities of Greek-English early bilinguals who had been exposed to Greek since birth but who had become dominant in their L2, English, at a relatively young age. The bilinguals were compared to Greek and English monolinguals in their ability to perceive Greek and English stop-voicing contrasts. The bilinguals accurately identified both L1 and L2 stop consonants in a variety of phonetic environments (syllable-initial, intervocalic, and medial post-nasal). The bilinguals were sensitive to the differences between the phonetically similar short-lag stops of Greek (voiceless) and English (voiced initial stops). When asked to categorize the stops, the bilinguals appeared to be able to attend to high-level, abstract, phonological information, and identified the phonetically distinct English and Greek stops as phonological cognates (e.g., English [ph] and Greek [p] were both labeled as belonging to the phonological category /p/). The bilinguals also showed sensitivity to the differing L1 and L2 phonetic realizations of the same phoneme (e.g., English versus Greek realizations of /p/), as indicated by differences in their category-goodness ratings across languages. However, although the bilinguals showed clear differences when categorizing and assigning goodness-of-fit ratings to L1 and L2 stops, when asked to discriminate the stops, the bilinguals exhibited a pattern that mirrored the English monolingual listeners almost perfectly, and, remarkably, this pattern held across almost all positional contexts. The strikingly similar discrimination patterns observed between the Greek-English bilinguals and the Australian English monolinguals seems to suggest that the bilinguals have come to resemble monolinguals of the L2 even though they had continued to use their L1 for many years.
17.2.4 Bilinguals’ Perception of Non-native Speech Segments
The vast majority of studies that have examined the speech perception abilities of bilinguals have examined the patterns that bilinguals show when perceiving the speech of their L1 and their L2. A relatively small number of studies have investigated the effects of bilingualism on the perception of unfamiliar, non-native speech contrasts. Much has been learned from presenting monolingual listeners with non-native contrasts; however, our understanding of how bilingualism and L1:L2 interaction affect perception of non-native contrasts is limited at present. Studies that present bilinguals with non-native speech segmental contrasts are typically concerned with determining how and to what degree the L1 and the L2 phonological systems shape the perception of non-native contrasts (Reference Antoniou, Liang, Ettlinger and WongAntoniou, Liang, et al., 2015). This is best illustrated using an example. Consider a non-native contrast that falls across the L1 and the L2 phonological systems of a bilingual (i.e., one segment most closely corresponds to an L1 category, whereas the other most closely resembles an L2 category). In such a situation, would a bilingual hold an advantage over a monolingual listener when attempting to discriminate the given contrast? The evidence in the literature is not entirely clear.
Reference MelguyMelguy (2017) asked eight fluent Spanish-English early bilinguals and ten American English monolinguals to discriminate a number of unfamiliar Nepali stops [t̪ʰa, tʰa, t̪a, d̪a, da, kʰa, ga]. The participants did not possess any experience with Nepali, and the bilinguals had been raised in the US, had acquired English by the age of seven, and were L2 dominant. The bilinguals and the monolinguals did not differ in terms of how successfully they were able to discriminate the Nepali contrasts. This finding was interpreted as suggesting that, due to L2 learning, the bilinguals had undergone perceptual reorganization and now resembled monolingual listeners of their L2, English, when attempting to discriminate the non-native Nepali contrasts. Like in Reference Antoniou, Tyler and BestAntoniou et al. (2012), the bilinguals relied on the phonetic categories of their dominant language (in both cases the L2) during discrimination.
Reference Antoniou, Best and TylerAntoniou, Best, and Tyler (2013) presented non-native Ma’di stops /ba, pa, mba, ɓa, da, ta, nda, ɗa/ to Greek-English bilinguals as well as Greek and English monolinguals. Half of the bilinguals completed the experimental tasks in Greek, while the other half completed them in English, and this had a clear effect on categorization (see Section 17.3.4 on language mode). To illustrate, the Greek monolinguals and the bilinguals who completed the task in Greek both assimilated Ma’di /ba/–/pa/ and /da/–/ta/ to two native categories, whereas the English monolinguals and the bilinguals completing the task in English did not. In discrimination, the Greek monolinguals discriminated Ma’di /ba/–/pa/ and /da/–/ta/ excellently, whereas the English monolinguals had difficulty discerning the prevoiced versus unaspirated contrast. All groups had difficulty discriminating the contrasts involving prenasalized stops and voiced implosives /ba/–/mba/, /da/–/nda/, /ba/–/ɓa/, and /da/–/ɗa/. Most importantly, the two bilingual groups exhibited almost identical discrimination patterns, regardless of which language they completed the experiment in. The bilinguals’ discrimination of the non-native Ma’di contrasts was intermediate to that of the two monolingual groups, suggesting that the bilinguals were uniquely configured as a result of their cumulative L1 and L2 experience, and were unlike either monolingual group.
Reference McKelvie-Sebileau and DavisMcKelvie-Sebileau and Davis (2014) asked French-English and English-French bilinguals and English monolinguals to discriminate the Thai contrasts /ba/–/pa/ (which occurs in French), /pa/–/pha/ (which occurs in English), and /ba/–/pha/ (which occurs in both languages). The results showed that the French-English bilinguals, who had been born in France, successfully discriminated the native as well as the non-native contrasts. The English-French bilinguals, who had been born in Australia or New Zealand, discriminated the native contrast better than the non-native contrasts, but overall performed similarly to the English monolinguals. It is possible that the French-English bilinguals benefited from the native English input that they had encountered during the period of time that they had been living in Australia along with the use of their L2, which was almost certainly greater than the English-French bilinguals’ L2 use. This finding suggests that bilingual speech perception can be shaped by a number of factors relating to patterns of language acquisition as well as language use.
These studies demonstrate that unfamiliar, non-native languages are useful tools for probing the limits of bilinguals’ speech perception abilities. Just as presenting non-native contrasts to monolingual listeners has deepened our understanding of the speech perception abilities of single-language users, there exist a great number of untapped possibilities for exploring the speech perception abilities of bilingual listeners using unfamiliar languages.
17.3 Factors Known to Influence Speech Perception in Adult Bilingual Listeners
For much of the twentieth century, bilingual speech perception was characterized as “incomplete” or “imperfect.” A common but incorrect assumption is that bilinguals must possess an equal command of both languages in order to be considered “truly bilingual.” According to this problematic view, a bilingual must compartmentalize their two languages, keeping them free of interference, and function in a manner akin to two monolinguals within the same person. However, there now exists very convincing evidence that even the most fluent bilinguals will not function exactly like monolinguals when sensitive experimental tasks are used because the two languages are constantly interacting and interfering with one another. The degree to which one language influences the other will depend on numerous dimensions that relate to a bilingual’s patterns of language acquisition and use. The factors can vary across both long and short time frames. In recent decades, these factors have been studied extensively, particularly in the seminal work of Flege and colleagues. Next, I describe the concepts of age of acquisition, L1:L2 usage ratio, language dominance, and language mode, and explore how they affect speech perception in adult bilinguals.
17.3.1 Age of Acquisition of the L2
Age of acquisition refers to how old an individual is when they are first exposed to the language that will become their L2. The age at which this initial exposure occurs is usually taken as the commencement of L2 learning, although for obvious reasons it can be an imperfect measure (e.g., if there is a pause in language exposure due to immigration). Age of acquisition can vary greatly across individuals, and there is a spectrum of possible exposure times across the life span. Some will be exposed to the L2 very early in infancy, whereas others will acquire an L2 in later life. The earliest possible exposure occurs in simultaneous bilinguals who are exposed to both languages essentially from birth, during the first year of life. Heritage speakers are exposed to their heritage language from birth within the home and may first be exposed to the dominant language spoken within the country where they reside in early childhood, typically when they first attend preschool or school. For immigrants and asylum seekers, age of acquisition is often equated with the age at which they have arrived in their host country. This can, of course, vary widely and range from early childhood to late adulthood, as determined by each individual’s circumstances. Age of acquisition has a profound effect on the formation of phonetic categories, and the resulting L1–L2 interaction has a lasting effect on speech perception. Our understanding of these factors and how they contribute to speech perception has been informed by many studies, in particular those conducted by Flege and colleagues, whose work has allowed us to develop a detailed picture of how age of acquisition affects the perception of L2 speech.
Reference FlegeFlege (1993) presented Chinese participants with the English stop consonants /t/ and /d/ in word-final position. This contrast is difficult for Chinese listeners to perceive correctly because Chinese does not have word-final stop-voicing distinctions. In English, native speakers will lengthen the vowels preceding /d/ but not /t/ in words such as bad versus bat, which has the perceptual consequence of biasing native English listeners to perceive ambiguous word-final stops as voiced (i.e., /d/) when preceded by longer vowels. Reference FlegeFlege (1993) compared nine Chinese-English early bilinguals who had arrived in the United States by the age of ten, to thirty Chinese-English late bilinguals whose regular exposure to English occurred only after they had arrived in the United States. Participants were presented with seventeen-step continua in which the vowel durations had been manipulated in order to bias listeners to perceive the word-final consonant as /d/ or /t/. Both Chinese-English bilingual groups chose continuum steps with longer vowels, indicating that they were sensitive to the role that vowel length plays in determining whether the final stop is voiced. However, this effect was attenuated in the late bilinguals. They showed a smaller effect of vowel length when compared to the early bilinguals or the native English listeners, confirming that their ability to perceive the relationship between vowel length and word-final stop voicing in English was constrained by their later age of acquisition.
Reference Flege, MacKay and MeadorFlege, MacKay and Meador (1999) asked Italian-English bilinguals to perceive a series of English vowel contrasts. The bilinguals varied considerably in terms of the age at which they had arrived in Canada (taken as a proxy for their age of acquisition). The seventy-two Italian-English bilinguals were divided into groups on the basis of their age of acquisition. Those bilinguals who migrated to Canada at approximately seven years of age were put into the “early” group; those who arrived at fourteen years of age were put into the “mid” group; and those who arrived at nineteen years of age were placed into the “late” group. The bilinguals were presented with several vowel contrasts from their L1 Italian (/u/–/o/, /e/–/a/, /u/–/i/), with several from their L2 English (/i/–/ɪ/, /u/–/ᴜ/, /ʌ/–/æ/, /ɒ/–/ʌ/), and with a selection of L1–L2 cross-language contrasts (/æ/–/a/, /ʌ/–/a/, /ᴜ/–/o/, /eɪ/–/e/). For the English–English and English–Italian contrasts, the English monolinguals showed better discrimination than the bilingual groups. For the Italian–Italian contrasts, the English monolinguals, early bilinguals, and mid bilinguals showed better discrimination than the late bilinguals. Importantly, as age of acquisition increased, discrimination accuracy was reduced. These results indicate that increasing age of acquisition reduces the accuracy with which L2 vowels will be perceived.
Subsequent studies have confirmed that age of acquisition has far-reaching effects on speech perception (for a review, see Reference BirdsongBirdsong, 2018). It has even been proposed that age of acquisition is the primary determinant of less than native-like L2 attainment (Reference Bylund, Hyltenstam and AbrahamssonBylund, Hyltenstam, & Abrahamsson, 2021). Its influences have been observed across all aspects of speech perception, extending beyond segments to the behavioral and neural categorization of lexical tones (Reference Zhao, Jiao, Wang and WeiZhao et al., 2023).
17.3.2 L1:L2 Usage Ratio
Bilinguals use their languages across diverse social contexts, with different interlocutors, and for a variety of communicative purposes. Different social situations will require different languages to differing degrees. For instance, a monolingual work setting will necessarily differ from a language brokering situation, which will also differ from code-switching with friends. The L1:L2 usage ratio offers a global estimate of the amount of time an individual spends communicating in the L1 and the L2. It is an imperfect measure because quantifying the time spent communicating in the L1 and the L2 is very difficult, particularly if the bilingual in question switches languages frequently. The vast majority of studies that examine L1:L2 usage ratio rely on self-report. However, self-report measures are prone to error and as such should be used only as an approximation of language use, serving as a general guide rather than a definitive measure (Reference GrosjeanGrosjean, 1998). Nevertheless, L1:L2 usage is a useful concept because it captures the relative amount of time an individual spends using each language. In broad terms, the amount of time spent using the L1 is inversely related to the amount of time spent using the L2. For example, bilinguals who report high levels of continued L1 usage may have only limited exposure to their L2. The opposite can also be true. Bilinguals who report using their L1 in limited settings may use their L2 more widely, which over time can have implications for language dominance. In general, a higher amount of language usage is positively correlated with increased proficiency in that language, and this in turn has consequences for speech perception. In the medium to long term, changes in language use patterns are known to influence speech perception in bilingual listeners. To examine this systematically, studies control for age of acquisition by comparing bilinguals who acquired their languages at comparable ages but differ in their L1:L2 usage ratios.
Reference Flege and MacKayFlege and MacKay (2004) presented English vowel contrasts to Italian-English bilinguals across four experiments. In Experiments 1 and 2, Italian native listeners who had recently arrived in Canada (< 3 months) had difficulty distinguishing the English vowel contrasts /ɒ/–/ʌ/, /ɛ/–/æ/, and /i/–/ɪ/, often labeling both English vowels as members of the same Italian vowel category. Reference Flege and MacKayFlege and MacKay (2004) went on to examine the contributions of age of acquisition as well as the amount of L1 Italian use to discrimination of English vowels in two follow-up experiments. Experiments 3 and 4 compared Italian-English bilinguals who had spent years living in Canada to Canadian English monolinguals. The bilinguals were subdivided depending on the age at which they arrived in Canada and how much they used their L1 Italian on a daily basis, ranging from low (i.e., < 15 percent of the time) to high use of their L1 (i.e., > 25 percent of the time). This created four groups of bilinguals, referred to as early low, early high, late low, and late high. The listener groups were asked to discriminate the English vowel contrasts /ɛ/–/æ/, /ɐ/–/ʌ/, and /i/–/ɪ/.
The combined findings from Experiments 3 and 4 converged to confirm that the early bilinguals were better able to discriminate English vowel contrasts than were the late bilinguals. Additionally, bilinguals who used their L1 Italian less frequently better discriminated the English vowel contrasts than bilinguals who used their L1 more frequently. Further, the factors of age of acquisition and frequency of L1 use were found to interact. The early low bilingual group (i.e., those who seldom used Italian) did not differ from native speakers of English in perceiving English vowels, whereas the early high group deviated from the English monolinguals. These findings demonstrate that while speech perception is affected by age of acquisition, its effects are modulated by the L1:L2 usage ratio. Early bilinguals who frequently use the L1 are more likely to show L1 interference when perceiving speech in the L2.
17.3.3 Language Dominance
As bilinguals use their languages over the course of years, the L1:L2 usage ratio will impact the relative strength of the L1 and the L2. It was once thought that bilinguals need to master and maintain mastery of both languages in order to be considered “truly” bilingual. However, it has been claimed that very few, if any, bilinguals possess an equal command of both languages (Reference Grosjean, Silva-Corvalán and Treffers-DallerGrosjean, 2016). All bilinguals have a dominant language, and this is likely to be the language that they use the majority of the time, and which they are most comfortable using. However, there is no consensus regarding how language dominance should be measured, and, consequently, it can be very difficult to determine bilingual dominance (Reference Flege, MacKay and PiskeFlege, MacKay, & Piske, 2002; Reference GrosjeanGrosjean, 1998). This is perhaps best illustrated with an example. Consider a bilingual who is more comfortable using their nondominant language at their place of employment, but who is less comfortable using that nondominant language in other social domains. Further complicating matters is the dynamic nature of language dominance. That is, language dominance is not fixed, but can switch over time. If a bilingual’s language environment changes, this will necessarily place different demands on their need to use particular language skills. We do not fully understand how language dominance affects speech perception. Most speech perception studies involving bilingual listeners have reported a processing advantage for the L1. This pattern holds even for fluent bilingual listeners who acquired the L2 in early life, and continue to use both languages on a daily basis. However, most studies have (knowingly or perhaps unknowingly) recruited L1-dominant bilinguals, which presents an experimental confound. Some researchers have suggested that bilinguals who have used their L2 to such an extent that they have become L2-dominant might be the most likely to suppress L1 interference on the L2 (Reference Flege, MacKay and PiskeFlege et al., 2002). This hypothesis has been tested in a relatively small number of studies, but the evidence is mixed and more work is needed.
Reference Amengual and ChamorroAmengual and Chamorro (2015) assessed the effect of language dominance on speech perception by presenting Galician mid vowel contrasts /e/–/ɛ/ and /o/–/ɔ/ to fifty-four highly proficient Spanish-Galician early bilinguals. Dominance was quantified using the Bilingual Language Profile (Reference Birdsong, Gertken and AmengualBirdsong, Gertken, & Amengual, 2012), a self-report instrument that outputs a score based on age of acquisition, L1:L2 usage ratio, competence, and language attitudes. Those bilinguals deemed to be Galician-dominant (n = 25) were exposed to Galician earlier than Spanish, used Galician more than Spanish, and were more native-like Galician speakers than those who were Spanish-dominant. The Galician-dominant bilinguals exhibited robust categorical perception of the Galician mid-vowel contrasts, whereas the Spanish-dominant bilinguals encountered difficulty when attempting to differentiate them. These results suggest that, as a result of their differing language dominance, the two groups of bilinguals differed in their phonological organization. It would seem that the Galician-dominant bilinguals established separate vowel categories for each of the front and back mid vowels, whereas the Spanish-dominant bilinguals attempted to discern these vowel contrasts using combined or merged categories. Although this pattern of findings seems to suggest that language dominance may be a determining factor that shapes a bilingual’s perception of speech segments, there exist contradictory findings that limit the generalizability of such a conclusion.
Reference CasillasCasillas (2015) asked ten English monolinguals, ten Spanish-English late bilinguals, and ten Spanish-English early bilinguals to categorize steps of a vowel continuum of the Southwestern American English tense-lax high front vowel distinction /i/–/ɪ/. The late bilinguals had acquired their L2 English in adulthood and were Spanish-dominant, whereas the early bilinguals were initially exposed to their L1 Spanish in early childhood, but no longer spoke it as adults and were now dominant in their L2 English. The English monolinguals had the sharpest category boundary, followed by the English-dominant bilinguals who had a clearer category boundary than the Spanish-dominant bilinguals. Differences were also observed in terms of which acoustic information the groups relied on to make their categorical judgments. The English monolinguals and the English-dominant bilinguals relied more on vowel spectrum properties, whereas the Spanish-dominant bilinguals relied more on vowel duration. However, subtle differences were also observed between the English-dominant bilinguals and the English monolinguals. The English-dominant bilinguals relied on vowel duration to a greater extent than the English monolinguals. Therefore, it appears that although the English-dominant bilinguals acquired their L2 early in life and no longer used their L1, there was evidence of persistent sensitivity to vowel duration, which is characteristic of native Spanish listeners. It would seem that, at least in some populations, the influence of the L1 may shape the perception of L2 speech segments, even after many years of exclusive L2 use.
17.3.4 Language Mode
Thus far, we have explored the contributions of factors such as age of acquisition, L1:L2 usage ratio, and language dominance, which affect the speech perception abilities of bilinguals over timescales that are measured in years and perhaps even decades. However, bilinguals use their languages dynamically, constantly adapting to the communicative needs across a multitude of settings. It is not unusual for a bilingual to use their L1 within their home, their L2 at their place of employment, and to switch between languages, including mixing languages within the same utterance when talking with bilingual speakers of the same languages. Speech perception theories are not yet able to account for this dynamic use of languages or how rapid switches between languages affect the perception of speech. An influential model in this space, although not specifically intended to generate testable predictions about speech perception, is the Language Mode framework (Reference GrosjeanGrosjean, 1982, Reference Grosjean2008) according to which activation of a bilingual’s languages constantly varies as the bilingual moves along points on a monolingual–bilingual continuum. These points correspond to different “language modes,” which it is argued influence all levels of language processing. When interacting with a monolingual speaker, one language is in use and the other language is said to be deactivated, although never completely. This is referred to as a monolingual mode. When interacting with a bilingual speaker of the same two languages, both languages are activated and code-switching may occur. This is referred to as a bilingual mode. However, even in a bilingual mode, one of the languages is used for processing and serves as the main language for communication and is, therefore, deemed to be more active than the other “guest language” (Reference GrosjeanGrosjean, 1988). Indeed, studies examining code-switching have documented small but reliable temporal delays in the processing of code-switched sentences (Reference Altarriba, Kroll, Sholl and RaynerAltarriba et al., 1996), comprehension in the guest language (Reference LiLi, 1996), and the reading of passages containing code-switches (Reference KolersKolers, 1966; Reference Macnamara and KushnirMacnamara & Kushnir, 1971). A more recent, but related, conceptualization of how bilingual language use varies by context is the Adaptive Control Hypothesis (Reference Green and AbutalebiGreen & Abutalebi, 2013), according to which use of a single language, dual languages, and dense code-switching will necessarily place different cognitive demands on the individual and have behavioral and neural consequences.
There are numerous factors that may cause a bilingual to move along the language mode continuum. These may include their interlocutor, the setting, the purpose of the communication, the content of the message, the topic of discussion, and factors specific to the communicative situation (e.g., whether a participant in an experiment knows that the study concerns their bilingualism). It has been claimed that even a subtle reference to a participant’s bilingualism may be enough to increase activation of the guest language and place them in a bilingual mode (Reference GrosjeanGrosjean, 1998). Movement along the monolingual–bilingual continuum occurs constantly, and is thought to be unconscious, smooth, and effortless. A monolingual analog may be the seemingly automatic change of speech register that occurs in response to changes in conversational partners (Reference Giles, Coupland, Coupland, Giles, Coupland and CouplandGiles, Coupland, & Coupland, 1991).
Language mode has been shown to affect the speech perception abilities of bilingual adults. Although we do not yet have a detailed theoretical model that accounts for these effects, it seems plausible that, when in different language modes, bilinguals may attend to language-specific information present in the speech signal, and this phonetic selectivity may vary depending on the linguistic context. Several studies (including some that we have already reviewed, such as Reference Amengual and ChamorroAmengual & Chamorro, 2015; Reference Antoniou, Tyler and BestAntoniou et al., 2012, Reference Antoniou, Best and Tyler2013; Reference Elman, Diehl and BuchwaldElman et al., 1977; Reference WilliamsWilliams, 1977) have systematically manipulated experimental contexts in order to examine how language mode affects perception of both native and non-native speech contrasts.
In an electrophysiological study, Reference Garcia-Sierra, Ramírez-Esparza, Silva-Pereyra, Siard and ChamplinGarcia-Sierra et al. (2012) manipulated language mode in order to observe its effects on the mismatch negativity (MMN), a pre-attentive response to an odd stimulus in a sequence of stimuli. In an MMN study, listeners are presented with many repetitions of a standard auditory stimulus and on some trials a deviant stimulus is presented. If the listener is sensitive to the difference between the standard and the deviant stimuli, an MMN response will be observed. In the Reference Garcia-Sierra, Ramírez-Esparza, Silva-Pereyra, Siard and ChamplinGarcia-Sierra et al. (2012) study, ten Spanish-English bilinguals were presented with a selection of steps from a /ga/–/ka/ stop-voicing continuum. Continuum steps were presented that would be consistently categorized as /g/ or /k/ by Spanish and English listeners, and an ambiguous continuum step was also chosen, that is, one that Spanish and English listeners would categorize differently (specifically, Spanish listeners would label it as /k/, whereas English listeners would label it as /g/). The experimenters also manipulated the language mode by making the bilinguals read silently in either Spanish or English during the experiment as the speech stimuli were being presented, and this determined whether or not an MMN response was elicited. Bilinguals who read a Spanish magazine showed the expected MMN response. Strikingly, bilinguals who were asked to read in English did not show an MMN response, indicating that they perceived the ambiguous sound to be an exemplar of the same category as the standard stimulus. The findings illustrate that manipulating language mode may cause the same bilingual listeners to perceive the same physical stimulus token as being an exemplar of different native phonetic categories.
17.4 Perception of Suprasegmentals
The term suprasegmentals refers to the phonetic variations that occur above the level of phonemic segments, that is, consonants and vowels. Suprasegmentals include such language elements as stress, intonation, duration, syllabification, and tone. Each of these can play an important role in one language but not in another, and, consequently, suprasegmentals show language-specific effects on speech perception. This is perhaps best illustrated using an example. In Dutch, suprasegmental cues to stress can alter the activation of lexical candidates which will affect word recognition, whereas in Japanese similar effects can arise from pitch accent cues, but neither of these cues to stress will have comparable effects in English (Reference CutlerCutler, 2012). The majority of cross-language studies on suprasegmentals have examined stress (e.g., Reference Dupoux, Sebastián-Gallés, Navarrete and PeperkampDupoux et al., 2008; Reference Dupoux, Peperkamp and Sebastián-GallésDupoux, Peperkamp, & Sebastián-Gallés, 2001; Reference Lin, Wang, Idsardi and XuLin et al., 2014; Reference PeperkampPeperkamp, 2004; Reference Peperkamp, Dupoux, Gussenhoven and WarnerPeperkamp & Dupoux, 2008; Reference Peperkamp, Vendelin and DupouxPeperkamp, Vendelin, & Dupoux, 2010; Reference TremblayTremblay, 2008, Reference Tremblay2009). There remains much that we do not know about suprasegmental processing in the L2, and to date, cross-language perception of suprasegmentals has been examined in far fewer studies than those concerned with perception of segments. Consequently, suprasegmentals have not been well integrated into theoretical models of speech perception, and in the case of theories of L2 speech processing and bilingualism, suprasegmentals have been ignored almost entirely.
In a classic study, Reference Cutler, Mehler, Norris and SeguiCutler et al. (1992) compared word segmentation in French-English simultaneous bilinguals and monolingual listeners of French and English. French and English differ in that the former is syllable-timed and the latter is stress-timed, and this difference has important implications for word segmentation. The experimenters recruited bilinguals who were fluent and daily speakers of both their L1 and their L2. The experiment involved a phoneme monitoring task, which required listeners to respond when a particular phoneme was heard. Although the bilinguals were recruited because they were fluent in both languages, they were divided into two groups based on language dominance. Interestingly, this was determined by asking each participant: “Suppose you developed a serious disease, and your life could only be saved by a brain operation which would unfortunately have the side effect of removing one of your languages. Which language would you choose to keep?” The results revealed that those bilinguals who elected to keep French (i.e., French-dominant) used syllabic segmentation when presented with French words, whereas those who elected to keep English (i.e., English-dominant) did not show this pattern. The inverse was also true. When presented with the English words, the English-dominant bilinguals used stress-based segmentation, but this time the French-dominant bilinguals did not show the native pattern. The results suggest that even fluent and seemingly balanced bilinguals only make use of a single rhythmically based segmentation procedure, and this is applied to the processing of spoken words in both the L1 and the L2.
Reference BroersmaBroersma (2005, Reference Broersma2008, Reference Broersma2010) conducted a series of studies in which she examined how duration influences Dutch-English bilinguals’ perception of word-final obstruents. Native English listeners typically use the duration of the preceding vowel as the primary cue to syllable-final voicing. Note that while Dutch also contains voicing obstruent contrasts such as /f/–/v/ and /s/–/z/, it does not distinguish between them in word-final position because obstruents in final position are always voiceless. Reference BroersmaBroersma (2005) observed that Dutch-English bilinguals were able to categorize the English final voicing contrasts in a manner that matched the English monolinguals. In a follow-up experiment, continua were created for each of the English voiced–voiceless obstruent contrasts and vowel duration was fixed, meaning that it could not serve as an informative cue. English monolinguals continued to try to rely on vowel duration in order to categorize the voiced–voiceless obstruent contrasts, but because the vowel duration was fixed, the accuracy of their categorizations suffered, resulting in less steep categorization curves. The Dutch-English bilinguals, on the other hand, did not rely on vowel duration. The bilinguals were sensitive to the residual nondurational cues and used those to maintain steep categorization functions. This flexibility was observed across Broersma’s series of experiments: the English monolinguals continued to rely on vowel duration to a greater extent than the Dutch-English bilinguals, and persisted with it even when the duration cue was manipulated to be uninformative. The combined results of these studies indicate that the bilinguals were more flexible listeners because they made use of alternative cues present in the speech stream and they used this information to compensate when faced with less than ideal listening conditions.
Reference CutlerCutler (2009) examined sensitivity to L2 suprasegmental cues to stress in twenty-four Dutch-English bilinguals and English monolinguals. The experimenters replaced the vowels within words using a cross-splicing procedure. For example, unstressed vowels were removed from the recordings and replaced by stressed vowels. The bilinguals were sensitive to suprasegmental information and rated these modified words as unacceptable, whereas the monolinguals did not. The findings are compatible with those of Reference BroersmaBroersma (2005, Reference Broersma2008, Reference Broersma2010). Bilinguals appear to be more sensitive to suprasegmental cues to stress than English monolinguals. One plausible explanation could be that unstressed vowels are more common in Dutch than in English. Perhaps suprasegmental properties of the L1 improve L2 listening in instances when perceptual flexibility is required and bilinguals are able to take advantage of this.
This brief overview of studies examining perception of suprasegmentals demonstrates that bilinguals are flexible listeners that are able to attend to speech information that may be ignored by monolingual listeners. Although these findings are extremely interesting, there is much that we do not know about cross-language perception of suprasegmentals. In the future, additional research is needed in order for us to develop a deeper understanding of how bilinguals process suprasegmentals and how this affects speech perception.
17.5 The Need for Bilingual Theories of Cross-Language Speech Perception
The leading models of cross-language speech perception predict whether listeners will be successful in discerning non-native contrasts (Reference Best and StrangeBest, 1995), and whether those individuals who acquire an L2 will accurately perceive L1 and L2 segments and go on to establish new L2 categories (Reference Best, Tyler, Bohn and MunroBest & Tyler, 2007; Reference Flege and StrangeFlege, 1995). However, none of the speech perception models within our field were designed to account for the speech perception abilities of fluent bilingual listeners. There is no cross-language model that provides a theoretical explanation as to why bilinguals may show differences between speech perception and production, as has often been observed in studies involving bilinguals, including my own. At best, some models may assume that categories are used for both speech perception and production, but no mechanism or explanation is provided to account for discrepancies that may be observed due to the dynamic nature of bilingual language use, such as experimental manipulations of the communicative task. Even models that can successfully predict how well L2 learners will distinguish between L2 contrasts (Reference Best, Tyler, Bohn and MunroBest & Tyler, 2007; Reference Flege and StrangeFlege, 1995) do not address the performance of fluent bilinguals. Fluent bilinguals are no longer actively acquiring their L1 or their L2. They have been using both languages for years and have reached a level of language stability. These listeners use both languages in their daily lives, and will often do so by combining languages within the same utterance. Of course, this does not mean that they ever stop learning or that their L1 and their L2 are perfectly balanced. As we have shown, even the most fluent bilinguals have a dominant language. The factors discussed within this chapter, such as L1:L2 usage ratio, language dominance, and language mode, arise from differences in language use patterns. These changes provide evidence that bilinguals are dynamic language users, and are constantly changing and adapting to meet the needs of diverse communicative situations. However, this lifelong adaptability qualitatively differs from the process of L2 acquisition, which has been the focus of many theoretical models. If we are to capture the complexity and the dynamic nature of how bilinguals use their languages, new models of bilingual speech perception will be needed. Such a dynamic, adaptable model would need to be able to account for shifts according to context at long and short timescales (e.g., across the life span versus from one moment to the next).
17.6 Future Challenges and Directions
The research detailed throughout this chapter provides useful insights into how bilingualism shapes the perception of speech. Although knowledge has advanced greatly in recent decades, answers to fundamental questions remain elusive, and await to be addressed by future work. One major challenge will be to account for the variability that is characteristic of bilingual listeners when perceiving speech. Perhaps it should not surprise us that, given the majority of the world’s population is bilingual, researchers tend to observe variability across bilinguals, be it across studies or even within the same study. As I have shown, we have developed a clear understanding of several factors that are known to affect speech perception. These factors include bilingual patterns of language acquisition, as well as patterns of language use over both the long and the short term. In past studies, it was relatively common to see different types of bilinguals conflated into a single group of participants, and in some instances this has led to results being misinterpreted. The field has already begun to move away from the notion that bilinguals are part of a homogeneous group. Instead, bilingualism is now being viewed as a continuous variable and our understanding of how bilinguals vary on multiple dimensions is developing (Reference Luk and BialystokLuk & Bialystok, 2013), although there is room for further improvement. Through this lens, factors such as order of acquisition, language use, proficiency, and dominance provide useful information for differentiating between bilingual individuals rather than for defining whether an individual is bilingual or not. This shift in thinking coincides with the development of standardized surveys that quantify bilingual variables, such as the Bilingual Language Profile (Reference Birdsong, Gertken and AmengualBirdsong et al., 2012) and the Language Experience and Proficiency Questionnaire (Reference Marian, Blumenfeld and KaushanskayaMarian, Blumenfeld, & Kaushanskaya, 2007) which permit across-study comparisons that will benefit the field and lead to a deeper understanding of the factors affecting the speech perception of bilingual adults. We have also witnessed the development of stimulus sets specifically designed for use with bilingual populations, such as the Non-Native Speech Recognition sentences (Reference Stringer and IversonStringer & Iverson, 2020), and the Multilingual Picture Database (Reference Duñabeitia, Baciero and AntoniouDuñabeitia et al., 2022). These types of resources will allow for rigorous and systematic examination of issues concerning bilingual speech perception.
There are many exciting future directions for our field to pursue. We have learned an enormous amount by studying citation speech recorded in acoustically attenuated sound booths. However, there is a need for ecologically valid stimuli to test generalizability across listening conditions. Speech perception “in the wild” is messy, often occurring in the presence of competing noises (Reference Blasingame and BradlowBlasingame & Bradlow, 2020) or the speech of other talkers (Reference Antoniou, Wong and WangAntoniou, Wong, & Wang, 2015). There is evidence that bilinguals attend to audiovisual information differently than do monolinguals (Reference Pons, Bosch and LewkowiczPons, Bosch, & Lewkowicz, 2015), and this raises interesting questions about the differential effects between L1 and L2 when audiovisual information is occluded, as occurred, for instance, with the donning of face masks during the COVID-19 pandemic (Reference Chauvin and PhillipsChauvin & Phillips, 2022). There is much that we are yet to learn concerning how speech perception is affected by language switching (Reference Pérez and DuñabeitiaPérez & Duñabeitia, 2019), differing exposure to dialects or writing systems (Reference Hayes-Harb and ChengHayes-Harb & Cheng, 2016), as well as differences in input quality (Reference Kalashnikova and CarreirasKalashnikova & Carreiras, 2022) and quantity (Reference Garcia-Sierra, Ramírez-Esparza and KuhlGarcia-Sierra, Ramírez-Esparza, & Kuhl, 2016). The evidence gathered by addressing these emerging research areas will lead to a more nuanced understanding of the bilingual perceptual system and the brain.
18.1 Introduction
Individuals that learn an additional language often find that it is not a trivial task. Second language acquisition (SLA) research typically refers to the added language as a second language (L2) or a third language (L3), and so forth.Footnote 1 An L2 can be acquired in many ways. For instance, one can study a language formally (i.e., in an L2 classroom), informally (i.e., in a naturalistic context), or both formally and informally in an immersion-type context (Reference Saville-TroikeSaville-Troike, 2005). Among the many difficulties the L2 learner faces, production of the target language (TL) is one of, if not the most difficult, particularly when it comes to the pronunciation of TL sounds (Reference Flege, Meyer and SchillerFlege, 2003; Reference Flege and HillenbrandFlege & Hillenbrand, 1984; Reference Flege and WaylandFlege & Wayland, 2019; Reference OyamaOyama, 1976; among many others). This seems to be especially true if the individual in question begins the endeavor of learning the L2 as an adult. In the present chapter, I focus precisely on these individuals, that is, language learners that have become, or are becoming, bilingual as adults. Concretely, this chapter examines the range of factors known to impact speech production in adult learners, such as age of learning (AOL; Reference Flege, Bohn and JangFlege, Bohn, & Jang, 1997; Reference Flege and EeftingFlege & Eefting, 1987; Reference Flege, Munro and MacKayFlege Munro, & MacKay, 1995), length of residency (LOR; Reference FlegeFlege, 2012; Reference Flege, Bohn and JangFlege, Bohn, & Jang, 1997; Reference Flege and LiuFlege & Liu, 2001), age of acquisition (AOA; Reference Flege and EeftingFlege & Eefting, 1987), L1/L2 use (Reference Flege, MacKay and MeadorFlege, MacKay, & Meador, 1999; Reference Piske, Flege, MacKay and MeadorPiske et al., 2002), and TL input (Reference Bosch and Ramon-CasasBosch & Ramon-Casas, 2011; Reference FlegeFlege, 2012).
More than a half a century of research on language learning has taught us a simple fact: earlier is better. Adults, which I arbitrarily operationalize in this chapter as individuals aged eighteen years and older, have a more difficult time learning an additional language when compared with children and adolescents. A central theme in this chapter is that these so-called age effects permeate all areas of language learning, but may be more noticeable in the production of the TL. In most cases, native speakers can instantly identify a non-native accent. Adult learners often find that native-like TL production is extremely difficult to obtain – regardless of whether or not it is a specific goal of the learner – even after years of extensive study and use of the language. Currently, there are many theoretical models used to posit hypotheses that can account for the types of difficulties encountered by language learners, such as the Perceptual Assimilation Model-L2 (PAM-L2; Reference Best, Tyler, Bohn and MunroBest & Tyler, 2007) or the Second Language Linguistic Perception Model (L2LP; Reference Van Leussen and EscuderoVan Leussen & Escudero, 2015), to name just two. In the present work, I will not describe in detail these models, but the interested reader can refer to the relevant chapters of this volume (Chapters 7–11), as well as Reference Flege, Bohn and WaylandFlege and Bohn (2021) for the Speech Learning Model-revised (SLM-r).
The goal of this chapter is to provide an overview of the speech production of bilingual adults. It considers what we have learned from the past half-century-plus of empirical work and, importantly, where research on bilingual speech production in adults is heading in the coming years. While this overview cannot be exhaustive, it will cover many of the principal variables researchers have used to contribute to our understanding of adult bilingual speech production. The remainder of the chapter is organized as follows. First, I provide a brief historical overview of the motivations behind current speech production research. Second, I focus on age effects and ultimate attainment in adults. Next, I discuss seminal research looking at the role of TL input and use in production outcomes. Afterwards, I highlight exciting avenues of research, such as phonetic drift and semantic processing, that are likely to shape future studies. Finally, I conclude by describing current methodological paradigm shifts that I believe will guide future speech production researchers.
18.2 Even Cavemen Had Accents
Accents are not new. For as long as we have been speaking, we have been doing so through the filter of our native phonology. When we attempt to speak another language, our native phonological structure, both segmental and suprasegmental, can predict TL production difficulties (Reference Flege and StrangeFlege, 1995). In his book Languages in Contact: Findings and Problems, Reference WeinreichWeinreich (1953) considered L1 Russian L2 learners of English and their production of English /t/. In Russian, /t/ is produced with a short interval of time between it and the following vowel, like [ta]. In English, this interval is longer and accompanied by aspiration ([tha]). Reference WeinreichWeinreich (1953) was interested in whether these learners would acquire a new category for the English segment, that is, whether or not they would produce /t/ with aspiration like English speakers do. He proposed two possibilities: either they would develop a “coexistant” phonetic system, whereby they would learn to produce both [t] and [th] as realizations of a single /t/ category, or they would develop a new category altogether such that [t] was available in Russian and [th] in English. Reference WeinreichWeinreich (1953) considered the possibility that the adult learners would develop a new category for English to be more difficult and less likely.
Throughout this chapter I refer to the term phonetic category, as I did in the previous paragraph, thus, before continuing, it is worth a moment of our time to consider what this entails. In the speech literature, researchers use this term to refer to a vowel- or consonant-like unit (e.g., Reference Flege and EeftingFlege & Eefting, 1988), though the specifics can vary (see Reference TrubetzkoyTrubetzkoy, 1969). Importantly, this concept differs from the notion of a phoneme, a mental representation that indicates the minimal unit that can distinguish words, in at least the following ways. First, a phonetic category is language-specific. For instance, two languages can have the phoneme /t/, but have different phonetic categories for the corresponding acoustic realization, for example [th] or [t]. Next, as illustrated in the previous sentence, a phonetic category operates at the level of the segment. That is, like a phoneme, it represents a single unit that is smaller than a syllable. Lastly, a phonetic category is ascribed a certain level of abstractness in that it does not refer to one specific acoustic realization. If you were to say the word “taco” in English ten times, the realization of the initial [th] would be slightly different in each iteration, but still pertain to the same phonetic category.
Weinreich’s reflections about /t/ exemplify the way in which much of the research on adult speech production has been conducted. That is to say, researchers have been interested in understanding and predicting how TL difficulties can be explained by the organization of one’s native phonology. The parameter Weinreich was considering, known as voice onset time (VOT; Reference Lisker and AbramsonLisker & Abramson, 1964), refers to the duration of the time interval between the release of the stop burst and the onset of voicing of the following segment. A large amount of the extant research on adult speech production includes VOT. This is so because cross-linguistically stop contrasts are produced in varying ways based on VOT. In other words, the same phonological contrast in two distinct languages can be realized in acoustically different ways with regard to VOT, like how /t/ is produced as [t] in Russian and [th] in English.Footnote 2
In a similar vein to Reference WeinreichWeinreich (1953), Reference Flege and HillenbrandFlege and Hillenbrand (1984) proposed that L2 learners struggle with acoustically distinct L2 segments that are related to L1 categories (i.e., L1 [t] and TL [th] as pertaining to an L1 /t/) because of how acoustically similar they are. Reference FlegeFlege (1987) set out to test this hypothesis by analyzing the production of /t/ in native French and native English speakers who were learning English and French, respectively, as an L2. The authors assumed that speakers from both groups would hear the TL segment as realizations of their L1 category, that is, French speakers would hear English [th] as an exemplar of French /t/ and vice-versa. Their analysis of VOT in word-initial stops found that the learners produced L1-like VOT values in the TL; however, the productions of both learner groups differed from those of monolinguals in both the learners’ L1 and the TL. This led Reference FlegeFlege (1987) to conclude that the learners had noticed some of the acoustic differences between the TL segment and their L1 categories. In a related example, Reference Flege and EeftingFlege and Eefting (1987) examined Spanish speakers’ production of word initial stops in English. The study found that the Spanish speakers produced English stops with aspiration, but the acoustic analyses showed that VOT values were longer than in Spanish words, but not as long as VOT values produced by monolingual English speakers. Reference Flege and EeftingFlege and Eefting (1987) concluded that the learners had developed novel L2 phonetic categories for English, though they might have differed from those of the monolingual controls because of exposure to Spanish-accented English.
At this juncture you may be thinking that differences such as these seem trivial, and that may be true for some learners; nonetheless, they have important implications with regard to both speech production and perception (see Chapter 17, this volume), and they provide insight as to how speech research is often conducted with bilinguals. The fact that languages have varying acoustic patterns for similar sound categories – not just with stops but also with other vowels and consonants – provides virtually infinite possibilities for examining bilingual speech. As we will see, L2 learners are likely to produce (and perceive) TL sounds using the distributions of the acoustic parameters of their L1, rather than those of the TL. For instance, as in the example highlighted just now, it has been shown repeatedly that beginning L2 learners will produce TL /ptk/ with VOT based on the values for these segments in their L1 (Reference Caramazza, Yeni-Komshian, Zurif and CarboneCaramazza et al., 1973; Reference Flege and EeftingFlege & Eefting, 1987; Reference Flege and PortFlege & Port, 1981; Reference Port and MitlebPort & Mitleb, 1983). More generally, as TL productions deviate from native values, they are more likely to be characterized as “foreign sounding” (Reference González-BuenoGonzález-Bueno, 1997; Reference Sundara, Polka and BaumSundara, Polka, & Baum, 2006). The good news is that experience with the TL has proven to be correlated with increased accuracy in production (Reference Aoyama, Flege, Guion, Akahane-Yamada and YamadaAoyama et al., 2004; Reference Flege, Meyer and SchillerFlege, 2003; among many others), and accurately producing something seemingly minor, like the stop contrasts of the TL, can significantly improve foreign accent ratings (Reference Sundara, Polka and BaumSundara et al., 2006).
These early studies fostered some of the important questions that shape bilingual speech production research on adults to this day. For example, how and why does age affect speech production outcomes and is it possible to become native-like? What is the role of TL input and use in acquiring the sounds of another language? How do TL sound categories develop over time? How is phonological learning modulated by the context in which one learns? Or how does speech production development relate to that of speech perception? Researchers have attempted to respond to questions such as these for the past half-century. In Section 18.3 we will consider age effects and ultimate attainment.
18.3 Is Youth Wasted on the Young?
Many of us have heard, or can attest firsthand, to the fact that children seem to learn additional languages better than adults. We often refer to them as “sponges,” as they seem to effortlessly soak up an L2. Adults, on the other hand, are more likely to display difficulties with seemingly “easy” aspects of a TL. When it comes to learning the phonology of an L2, there is overwhelming evidence suggesting that it is better to start at an earlier age if the goal is to obtain more native-like production (see Reference Bongaerts, van Summeren, Planken and SchilsBongaerts et al., 1997; Reference Flege, Meyer and SchillerFlege, 2003; Reference Flege, MacKay and MeadorFlege et al., 1999; Reference OyamaOyama, 1976; among many others). This notion is hardly controversial, as countless studies show that this task seems to become more difficult after a certain age.
Reference LennebergLenneberg (1967a, Reference Lenneberg1967b) proposed an innate language faculty as the reason children are able to “grow” language. As his work gained support, the notion of the “critical period” became popular in explaining the “age effects” found in many cognitive abilities, language being one of them. Some researchers still believe that after a certain age – a critical period – it is no longer possible to acquire certain abilities due to constraints associated with neuronal maturation. Nonetheless, with regard to language, many learners manage to defy the odds. In other words, in spite of learning their L2 later in life, many adults manage to acquire TL sound categories. We have all heard hard-to-believe anecdotes about the “friend of a friend” who speaks perfect Italian, indistinguishable from a native speaker, after using Rosetta Stone for a year. But is there any truth to this “anecdata”? Work by Flege and colleagues examining an Italian community living in Ottawa for an average of twenty years shows that some adult learners can achieve native-like pronunciation despite their late exposure to English (see Reference Flege, MacKay and MeadorFlege et al., 1999; Reference Flege, Frieda and NozawaFlege, Frieda, & Nozawa, 1997; Reference Piske, Flege, MacKay and MeadorPiske et al., 2002). Conversely, there are also cases of early learners with nontarget-like speech production. Rather than abandoning the ideas of Lenneberg, findings such as these have led many researchers to develop variations of the original hypothesis (e.g., Reference ScovelScovel, 1969, Reference Scovel1988; Reference Seliger and RitchieSeliger, 1978; Reference Walsh, Diller and DillerWalsh & Diller, 1981). One possibility, for example, is that age effects are related to a sensitive period that varies from person to person much more than initially postulated.
Others ascribe adult language difficulties to the fact that they already begin the acquisition process with a developed phonological system in place. A natural implication of this fact is that the already established system is the source of cross-linguistic influence. For instance, a seminal investigation by Reference Flege, Munro and MacKayFlege et al. (1995) examined the L2 English of Italian immigrants living in Canada. These participants are of particular interest due to the fact that most of them had resided in Canada for several decades (mean = 23 years). Reference Flege, Munro and MacKayFlege et al. (1995) found that AOL was a key factor in explaining their L2 outcomes. Specifically, the earlier the AOL, the more native-like the participants performed regarding production and perception of English. Furthermore, AOL was positively correlated with foreign accent ratings from native English speakers. The results show that most of the adult learners did not have the same success as the early learners, possibly because their native-language categories were already firmly established in long-term memory.
Similarly, Reference Flege, Frieda, Walley and RandazzaFlege et al. (1998) examined the production of /t/ in learners of American English. This study included Spanish speakers who were partitioned into two groups based on their age when they began learning English (under or over twenty-one years old), and compared them to a group of monolingual English speakers. Reference Flege, Frieda, Walley and RandazzaFlege et al. (1998) utilized multiple regression to account for variation in VOT for /t/ in utterance-initial position, where we would expect [th], in sixty words.Footnote 3 The model included factors that were lexical or phonetic in nature, along with predictors related to linguistic experience. The lexical factors were based on the participants’ ratings of the experimental items. Specifically, each participant rated the sixty items for their familiarity, imageability, relatedness to Spanish words, and the age at which they were acquired. The predictors related to linguistic experience were age of exposure to English, age at which the participants began to feel comfortable speaking English, LOR in the United States, use of Spanish, and age. The phonetic predictors included syllabic structure of the word (monosyllabic, disyllabic), and the height of the following vowel (low, mid, high). The participants’ productions ranged from target-like (i.e., [th]) to nontarget-like (i.e., [t]) regarding VOT values. The analyses showed that VOT was more target-like in monosyllabic words and before high vowels, but the lexical factors did not account for variation in VOT values. The amount of English input, assessed through age of exposure and LOR, was a better predictor of target-like productions.
We will return to the role of input shortly, but first we will consider the final state or end point of language learning in relation to the learner’s ability to attain native-like proficiency in an L2. A large portion of the L2 research on adult phonological acquisition has focused on this notion, which is referred to as ultimate attainment (see Reference Mitchell, Myles and MarsdenMitchell, Myles, & Marsden, 2019). The main question posed in this line of research deals with how close a non-native speaker can come to acquiring the phonology of the TL. That is to say, can an adult learner become indistinguishable from a native speaker? In order to answer this question, this line of research has necessarily searched for adult bilinguals that have a large amount of experience and have reached high levels of proficiency in their L2 (e.g., Reference Bongaerts, van Summeren, Planken and SchilsBongaerts et al., 1997; Reference FaceFace, 2018a, Reference Face2018b, Reference Face2021; Reference Face and MenkeFace & Menke, 2020; Reference Stölten, Abrahamsson and HyltenstamStölten, Abrahamsson, & Hyltenstam, 2015). For example, Reference Stölten, Abrahamsson and HyltenstamStölten et al. (2015) examined ultimate attainment by recruiting native Spanish speakers that were highly proficient in Swedish. The researchers partitioned the participants into two groups, early and late learners, and analyzed their production of Swedish /ptk/. Their analysis found that the learners were indeed indistinguishable from native speakers when using a raw VOT measurement, that is, milliseconds; however, when taking speech rate into account by calculating VOT as a percentage of word duration, they found that few of the late learners produced /ptk/ in a native-like manner. This study suggests that adult learners can come quite close to native-like production, but when using stringent analytic techniques, production differences emerge, at least with regard to voice-timing.
A series of studies by Face and colleagues explored ultimate attainment of Spanish in American English speakers who had been residing in Spain for up to sixty years (mean = 36). These studies compared the adult learners’ production of voiced stops (Reference FaceFace, 2018a; Reference Face and MenkeFace & Menke, 2020), rhotics (Reference FaceFace, 2018b), and laterals (Reference FaceFace, 2021) to those of native Spanish speakers. Each of these cases presents difficulties for native English speakers due to cross-linguistic articulatory differences. In all four studies, the researchers found that the adult learners’ production differed from that of native Spanish speakers, though there were considerable individual differences. For each of the segments being analyzed, there were participants with production values within native ranges, with the exception of VOT in Reference Face and MenkeFace and Menke (2020). This begs the question as to what might be different among phonetic variables, such that some appear to be harder to attain for learners than others. The answer to this question is likely complex with multiple factors simultaneously at play, such as, for instance, the status of the segments in question in the learner’s native language.
Taken together, these investigations on ultimate attainment show that adult learners generally differ from native speakers, even when they are highly proficient in the TL and have decades of experience. This is particularly true when conducting group comparisons; however, the majority of the aforementioned studies show that some individuals do become native-like with regard to the parameters measured. The results support the hypothesis that speech learning can indeed continue throughout the life span, though it is not entirely clear to what this learning is attributed. This research suggests that other factors, aside from AOL, are likely at play.
18.4 The Roles of Input and Use
In much of the research that has been conducted on adult learners, input has not been considered the most important factor. Perhaps difficulties producing a TL accurately can be attributed in part to the quality and/or quantity of the input that learners receive. Learners who acquire their L2 in a naturalistic setting, for instance, are likely exposed to input that differs with regard to individuals who learn in a formal classroom setting. As we have seen thus far, the focus in SLA research has been on cross-linguistic interactions between sound systems, individual differences, maturational states, as well as AOA and LOR. This fact seems to be at odds with the relative importance that input receives in L1 acquisition theory and research. At this time, the role that input plays in the acquisition of L2 phonetic categories in adult learners remains unclear. Another understudied factor is the amount of L1 and TL use. It seems reasonable to assume that using the TL is a critical element for progressing in all areas of language learning, but how much and how often does the TL need to be used for optimal outcomes, and to what extent does L1 use play a role? Could it be a detriment to L2 progress? Perhaps L1/L2 use accounts for the improvement seen in the aforementioned studies by Flege and colleagues examining Italian immigrants in Canada?
In a follow-up to the Reference Flege, Munro and MacKayFlege et al. (1995) foreign accent study, Flege and MacKay revisited the adult learners from the original investigation (as cited in Reference FlegeFlege, 2012). Of the original 240 participants, 160 were rerecorded to see if these highly proficient adult learners had improved in their English production over the ten years. The average LOR at the second time of testing was forty-three years. Three subgroups were established based on changes in self-reported use of English at the two times of testing (in 1992 and in 2003). The first group reported using English 28 percent less at the second time of testing, the second group reported no change in English use, and the third group reported using English 24 percent more at the second time of testing. The researchers found that increased use of English was associated with more native-like VOT (determined by an increase from 1992 values) for voiceless stops. A decrease in English usage resulted in no change in VOT production.
In an unpublished longitudinal study on fifteen native Spanish late learners who immigrated to the United States, Flege found no significant improvement in the foreign accent ratings of their English pronunciation after a period of five years in Alabama (as cited in Reference FlegeFlege, 2012). However, a post hoc analysis of the data provided several insights with regard to the role of input. Flege compared the three “worst” (more foreign accented) learners with the three “best” (least foreign accented) learners and found that the large disparity between groups could be explained by the fact that the “best” learners were those who reported using English more, specifically in contexts where it could be considered optional (i.e., when having conversations with their friends). Presumably these learners received more input in these situations, though Flege only provided impressionistic inferences. Studies controlling for TL input and use can shed light on the relative importance of these factors as they pertain to speech production in adult learners.
One way in which researchers have attempted to disentangle the effects of input and use is by analyzing the context in which language acquisition occurs. Despite the relatively long history of study abroad (SA), only in the past twenty-five years have researchers begun to seriously consider the benefits that immersion learning has for L2 outcomes (Reference FreedFreed 1995), and how these benefits directly compare to those offered in the traditional, at-home L2 classroom context (AH).Footnote 4 There is evidence that L2 production gains are positively correlated with length of stay (Reference StevensStevens, 2001) and occur at a faster rate than in the AH context for conversational speech (see Reference Dı́az-CamposDı́az-Campos, 2004, Reference Dı́az-Campos, Klee and Face2006). These findings suggest that TL input may well be the key to native-like production; however, it is difficult to know what type of input the participants in the SA context receive and how it compares to the input available to AH learners. For example, it is possible for SA learners to mainly interact with students from their home country, or to use the TL minimally (e.g., Reference Gorba and CebrianGorba & Cebrian, 2021). Alternatively, improvements seen in the SA research could be due to increased usage of the TL, as most learners in the AH context use the TL for only a few hours each week. Domestic immersion programs represent an understudied context in which L1/L2 use and input can be more easily accounted for (see Reference CasillasCasillas, 2020a, Reference Casillas2020b).
18.5 What About Those Suprasegmentals?
The studies outlined thus far point to the importance of several key experiential variables – L1 phonology, AOL, TL input, and amount of L1/L2 use – and their effects on the production of L2 segments.Footnote 5 Our discussion has not gone beyond the segment. That is to say, we have not yet considered L2 production of suprasegmental aspects of speech, such as syllables, stress, length, tone, or intonation. This is indicative of the fact that the primary focus of research on bilingual speech production in adults over the past half-century has been at the segmental level. As evidence of this, consider the fact that many of the major theoretical models describing L2 phonological acquisition do not have specific hypotheses regarding how suprasegmentals are acquired. That being said, some researchers have used the current models and extended them to make predictions about the acquisition of suprasegmental aspects of speech. For example, Reference Trofimovich and BakerTrofimovich and Baker (2006, Reference Trofimovich and Baker2007) analyzed how L2 language experience modulated the production of prosody and fluency in adult L1 Korean learners of English. In line with the general findings we have seen at the segmental level, learners with more L2 experience produced suprasegmental aspects of English in a more native-like manner when compared with learners with less L2 experience. Furthermore, the authors affirmed that suprasegmentals can contribute to foreign accent in the same manner as segmentals. They conclude that models of L2 speech learning (e.g., Chapters 7–11 of this volume) can be extended to account for L2 suprasegmental production (see Reference Mennen, Delais-Roussarie, Herment and AvanziMennen [2015], for example). Our discussion of suprasegmental aspects of adult speech production is limited in scope. For a complete overview, see Chapter 21 of this volume.
18.6 Into the Unknown
Now I will briefly highlight some of the more recent lines of research relating to speech production in adults that are likely to inspire future studies. While the majority of the investigations we have considered have focused on the influence one’s native language has on the production of an L2, some studies have examined how acquiring an L2 can affect the production of one’s native language. This bidirectional influence bilingualism has on the L1 is referred to as phonetic drift and has been documented in individuals with ample L2 experience (see Reference FlegeFlege, 1987, Reference Flege and Strange1995, Reference Flege, Burmeister, Piske and Rohde2002; among others) as well as in early bilinguals (e.g., Reference Mora and NadeuMora & Nadeu, 2012) and switch dominance early learners (e.g., Reference CasillasCasillas, 2015; Reference Casillas and SimonetCasillas & Simonet, 2016). What is relatively novel, however, is the finding that phonetic drift can also occur during the beginning stages of adult sequential language learning. Reference ChangChang (2012, Reference Chang2013) found that native English speakers learning Korean as an L2 assimilated phonetic properties of Korean into their native English after initial exposure in an immersion program. The early phonetic drift effect has been replicated with some degree of success. For example, native English-speaking learners of Spanish have displayed phonetic drift after six weeks of exposure in the AH context (Reference Herd, Walden, Knight and AlexanderHerd et al., 2015; Reference Huffman and SchuhmannHuffman & Schuhmann, 2015a, Reference Huffman and Schuhmann2015b). Reference Lang and DavidsonLang and Davidson (2019) did not find the effect in English-speaking learners of French after the same period of time in an SA program; however, longer-term learners from the same study did show indications of L2 influence on their English vowel space. Future research in this area with novel language pairs can shed light on how TL input and use can modulate phonetic drift in adult speech production. For a complete overview of the topic, the interested reader is referred to Chapter 32 of this volume.
Other lines of research on L2 production have considered the nature of the representations acquired by adult learners. One question posed by this line of research revolves around whether or not the distinct types of cognitive processes utilized during speech production can have an influence on L2 phonetic processing. Phonetic processing is operationalized as the “processes involved in the planning, programming, and execution of articulation” (Reference Gustafson, Engstler and GoldrickGustafson, Engstler, & Goldrick, 2013, p. 506). One of the cognitive processes that has been used to assess phonetic processing is semantic processing (Reference Gustafson, Engstler and GoldrickGustafson et al., 2013; Reference Nozari, Kittredge, Dell and SchwartzNozari et al., 2010), which requires that one access the meaning of an object, concept, and so on. Reference Van Lancker Sidtis, Cameron and SidtisVan Lancker Sidtis, Cameron, and Sidtis (2012) found that patients suffering from Parkinson’s disease had difficulties producing speech when participating in tasks that required semantic processing. In this investigation, Reference Van Lancker Sidtis, Cameron and SidtisVan Lancker Sidtis et al. (2012) utilized picture naming to incite semantic processing, and found that the articulatory impairments of the participants were compounded when compared to tasks that did not require semantic processing (i.e., reading and delayed repetition). Recent studies have shown that semantic processing also affects stop production in adult language learners (Reference Casillas, Calhoun, Escudero, Tabain and WarrenCasillas, 2019; Reference Gustafson, Engstler and GoldrickGustafson et al., 2013), though the effect appears to diminish as TL proficiency increases. These findings have methodological implications for speech production research, as the nature and the amount of cross-language interactions in adult bilingual speech may be underestimated due to reliance on tasks that do not require semantic processing. Future research should explore semantic processing effects on other speech segments.
A final avenue for future consideration is, in fact, quite an old one: the production-perception link in L2 speech. Investigators have spent decades discussing the nature of the relationship between the two modalities. Importantly, if such a relationship exists, current models posit that one ought to observe a clear correlation between the production of categories and the perceptual judgments of these categories within listeners (Reference Diehl, Lotto and HoltDiehl, Lotto, & Holt, 2004). This does indeed seem to be corroborated in some research on production and perception in monolinguals and bilinguals (Reference Elvin, Williams and EscuderoElvin, Williams, & Escudero, 2016; Reference Evans and AlshangitiEvans & Alshangiti, 2018; Reference Flege, Meyer and SchillerFlege, 2003; Reference Flege, Frieda and NozawaFlege, Bohn, et al., 1997, Reference Flege, MacKay and MeadorFlege et al., 1999), though other studies find weak or no evidence for such a correlation (Reference De Leeuw, Stockall, Lazaridou-Chatzigoga and Gorba Masipde Leeuw et al., 2021; Reference Gorba and CebrianGorba & Cebrian, 2021; Reference Kartushina and FrauenfelderKartushina & Frauenfelder, 2014; Reference Peperkamp, Bouchon, Cosi, De Mori, Di Fabbrizio and PieracciniPeperkamp & Bouchon, 2011; Reference Rallo Fabra and RomeroRallo Fabra & Romero, 2012). As pointed out by Reference Nagle and Baese-BerkNagle and Baese-Berk (2021), there are methodological concerns that may explain discrepancies in the extant literature. For instance, the current models positing production–perception links have been essential in establishing a framework describing the nature of the relationship between modalities, but fall short of offering a roadmap for how the links can be investigated. As a consequence, researchers have employed a variety of experimental paradigms and analytic strategies that may not be conducive to explaining how production–perception links develop and evolve during L2 learning. Reference Nagle and Baese-BerkNagle and Baese-Berk (2021) suggest that future research in this area would benefit from: (1) prioritizing longitudinal designs, (2) accounting for different learning scenarios, and (3) including individual differences. Moving forward, our knowledge of bilingual speech production will only benefit from furthering our understanding of the production–perception link.
18.7 The Methods They Are A-Changin’
In this penultimate section, I will discuss what I believe are some essential methodological changes that will shape how we conduct speech production research moving forward. First, I will discuss the importance of longitudinal designs. Second, I will consider a recent push for bilingual baselines in L2 research and how it goes hand-in-hand with instructed second language acquisition (ISLA) in shaping how we design our studies. After, I will focus on methodological shortcomings related to power and sample size, before turning my attention to open science practices and reproducible research.
18.7.1 Longitudinal Research Design
In light of the fact that L1 acquisition is a process that develops over the course of years during the early stages of child development, a logical assumption is that L2 acquisition in adults is also a process that requires a substantial amount of time. Unfortunately, longitudinal research is costly, requires careful planning, and involves complex analytic strategies (Reference NagleNagle, 2021). In spite of these difficulties, longitudinal research is invaluable in providing crucial insight into developmental processes. In speech production research, there is a dearth of longer-term, multiwave studies dealing with pronunciation development (Reference NagleNagle, 2021), though some recent work has explored how L2 phonetic categories develop over the short term (Reference CasillasCasillas, 2020a, Reference Casillas2020b; Reference HollidayHolliday, 2015; Reference NagleNagle, 2017a, Reference Nagle2017b; among others). For example, Reference NagleNagle (2017a) examined bilabial stop production in twenty-six native English speaking L2 learners of Spanish studying in an AH, traditional context. He analyzed their stop production over the course of an academic year and found that the learners reduced VOT over time, which suggests that they may have developed language-specific phonetic categories for [b] and [p]. The analyses also revealed that learning generally followed a quadratic trajectory, implying that phonetic development occurred toward the beginning phases of testing and then slowed down over time. Longitudinal studies, such as Reference NagleNagle (2017a), show that the beginning stages of phonetic category formation can take place in a classroom learning context over the course of just one academic year. As longitudinal designs become more prevalent in L2 speech production research, we can expect to learn more about the relative difficulty of learning specific segments, contrasts, and the suprasegmental aspects of L2 phonology.
18.7.2 Bilingual Baselines
Traditionally L2 acquisition research has used monolingual speakers as comparison groups for L2 learners in speech production (and perception). A recent trend in L2 phonetic research has pushed for the use of a bilingual baseline as the starting point for measuring adult learner outcomes (e.g., Reference CasillasCasillas, 2020b; Reference SakaiSakai, 2018), though there has been a push from the field of SLA to move away from idealizing native speakers for nearly forty years (see Reference Bley-VromanBley-Vroman, 1983; Reference CookCook, 1999; Reference Ortega and MayOrtega, 2013). According to Reference LevisLevis (2005, p. 370), a predominant ideology in pronunciation research is that “it is both possible and desirable to achieve nativelike pronunciation in a foreign language,” which he refers to as the nativeness principle. The idea motivating the methodological shift to bilingual comparison groups is based on the notion that comparing an adult L2 learner with monolingual native speakers sets unobtainable expectations for the learner because it is not possible for an adult to become a native speaker of another language. Perhaps a more convincing argument can be made by restating this idea in a way that highlights the obvious fact that an adult L2 learner cannot become a monolingual in another language, and likely has no desire to somehow lose their L1. For this reason alone, we can begin to see why it makes little sense to compare learners with a monolingual population that is fundamentally different from the population that represents their goal. In other words, the use of bilingual comparison groups can provide a fairer and more useful assessment of the learner’s progress because, in essence, the learner’s abilities are compared to those of a target population (bilinguals) that represents their learning goal (bilingualism).
This paradigm shift also changes the narrative around the idealization of monolingual, native speakers. In its place, a bilingual baseline approach gives more agency to bilingual populations and helps to normalize all of the characteristics and the inherent value of bilingual speech. As noted by Reference SakaiSakai (2018), a consequence of this methodological shift in speech production research is that investigators and language practitioners reevaluate their goals regarding phonetic learning and move toward scholarship focusing on intelligibility rather than nativeness. For a more in-depth overview, the interested reader is referred to Chapters 33 and 35 of this volume.
18.7.3 Stepping Up Our Game
Further methodological concerns include statistical power and sample size. Most studies in the social sciences test for small effects (Reference EllisEllis, 2010; compare with Reference Plonsky and OswaldPlonsky & Oswald, 2014), include small sample sizes, and, therefore, are underpowered. Reference BrysbaertBrysbaert (2020) highlighted how these issues are rampant in bilingualism research, likening it to having blurred vision and, in a recent call to arms, affirming that it is now time for us to step up our game. The severity of having a field plagued with low power should not be overlooked, as an underpowered study is more likely to commit a type II error (false negative) and contribute to a literature with lower positive predictive value. Consequently, this implies that the prevalence of significant findings in our field may be indicative of publication bias. Future empirical studies on bilingual speech production will require careful planning and sample size justification if we hope to see clearly.
18.7.4 The Way Forward
In many fields, analytic flexibility in data analysis – commonly referred to as researcher degrees of freedom (Reference Gelman and LokenGelman & Loken, 2014; Reference Simmons, Nelson and SimonsohnSimmons, Nelson, & Simonsohn, 2011) – can lead to substantially different conclusions based on the same data set, or what is now being referred to as the inference crisis (see Reference Rotello, Heit and DubéRotello, Heit, & Dubé, 2015; Reference Starns, Cataldo and RotelloStarns et al., 2019). The “garden of forking paths” associated with researcher degrees of freedom directly affects our ability to replicate findings and thus accumulate knowledge. The early 2010s saw the reproducibility crisis take hold of the psychological sciences. A push for open science, reproducible methodology, and increased value of replication studies followed as a consequence. This methodological framework and associated techniques have reshaped research methods in psychology and have slowly but surely made their way into adjacent fields. This is certainly the case for applied linguistics, SLA, and phonetics and phonology. More and more, journals in these areas are requiring open science practices, such as sharing of code and data, for novel research even to be considered for publication. These journals are creating incentives for researchers to partake in open science practices by encouraging replication studies for publication. As a result, moving forward we can expect to see stronger research related to bilingual speech production, replications – both successful and failed – of our seminal findings, and larger, more robust studies with increased sample sizes.
18.8 Conclusion
In the present chapter I have provided an overview of the principal aims of speech production research involving adult language learners. The chapter covered in detail important areas of inquiry, such as the nature of age effects on L2 production outcomes, ultimate attainment, and the role of target language input and L1/L2 use in adult bilingual development. I also reviewed some of the fundamental questions that have been asked and are being asked currently, such as how L2 phonetic categories develop over time or how production–perception links develop in bilingualism. Importantly, I also highlighted what I believe to be the principal methodological issues for the field and how we might resolve them moving forward.
19.1 Introduction
Listening to speech in your native language is easy. Recognizing the words spoken in conversation is generally an automatic and smooth everyday process in the first language (L1). Even in noisy or otherwise less than ideal conditions, performance is surprisingly robust. But anyone who has attempted to follow a conversation in a second language (L2) knows how demanding this can be, even when you know all the words. The flexibility and efficiency which listeners usually display in their L1 – immediately understanding sentences they have never heard before; adjusting quickly to new speakers; understanding speech in noisy or degraded environments – diminishes when they listen to an L2 (Reference Takata and NábělekTakata & Nábělek, 1990). Even for reasonably clear speech, identifying individual words out of the speech stream is difficult. For example, Reference Charles, Trenkic, Gambier, Caimi and MariottiCharles and Trenkic (2015) report that a group of international students at a UK university were unable to recognize about 30 percent of the words they heard during lectures. In this chapter, we review how the perceptual processing of L2 speech sounds and their mapping to stored representations are influenced by the L1 phonology, making listening more complex in the L2 than in the L1.
19.2 An Information Processing Model of Spoken Word Recognition
One critical step when trying to comprehend a spoken message is to identify the words that the speaker intended. Recognizing spoken words involves perceiving speech, and it also requires the existence of a lexical representation for each word stored in long-term memory. The perceived speech must be compared to stored lexical representations, during a probabilistic process that involves competition among multiple candidates. A word is recognized when a listener establishes a form match between what is perceived and an available lexical representation. The phonological form – the one perceived as well as the one stored – is thus central to recognizing spoken words.
As will become clear, this mapping process is complex. Its elucidation depends partly on the assumed architecture of the word recognition system as well as on the assumed phonological content of lexical representations. In the case of L1 listeners, one of the most vexing explanatory challenges is that the form of speech is highly variable: the form of each word or sound is modulated by its surrounding phonetic context and surrounding words, for example through reduction, coarticulation, or assimilation processes. In an L2, an additional issue is that perception of speech is influenced by the L1 of the listener (we address L2 processing in Section 19.3).
Let us first consider the makeup of the speech perception system and the different levels of processing and representation that have been proposed (e.g. Reference McQueen, Cutler and NorrisMcQueen, Cutler, & Norris, 2006, p. 1114; Reference NorrisNorris, 1994; Reference Ramus, Peperkamp, Christophe, Fougeron, Kühnert, d’Imperio and ValléeRamus et al., 2010), before evoking in more detail how spoken words are recognized. Figure 19.1 provides an overview of the different levels of phonological processing involved in recognizing spoken words. This architecture is applicable to both L1 and L2 speech processing, although it does not make any claims as to their interaction.

Figure 19.1 A simplified view of phonological and lexical processing during recognition. Boxes illustrate different levels of representation, while arrows indicate processes intervening between representations.
The process of speech perception first begins when speech, in the form of sound vibrations, enters the ear and is transformed into nerve impulses by the cochlea. Sound at this initial stage is encoded as an acoustic representation, which is not specific to speech but instead is how all sounds are preprocessed. If this acoustic representation is perceived as speech, the speech-specific process of phonetic decoding (also referred to as perceptual processing or simply perception) categorizes the sounds into discrete units of representation. The perceived sounds (the percept) are encoded into a prelexical representation (Reference McQueen, Cutler and NorrisMcQueen et al., 2006), also called a sublexical phonological representation (Reference Ramus, Peperkamp, Christophe, Fougeron, Kühnert, d’Imperio and ValléeRamus et al., 2010). These phonological units consist of segmental and suprasegmental elements, such as phoneme categories, phonotactic structure, and other elements such as tone and stress, depending on the language, and they may be grouped as feet, syllables, morae, or other such temporal orderings. What is important to note here is that these units, and the representations that contain them, are language-specific, that is, they make reference to a particular language (usually the L1 of the listener).
Sublexical representations are then compared to phonological representations in the mental lexicon, also called phonolexical representations, which contain (at least) the canonical forms of the words stored in long-term memory. How exactly this comparison – also called lexical access– takes place in real time has been the focus of intense scrutiny over the past forty years (e.g. Reference Marslen-Wilson and WelshMarslen-Wilson & Welsh, 1978; Reference McMurray, Apfelbaum and TomblinMcMurray, Apfelbaum, & Tomblin, 2022; Reference Strauß, Wu, McQueen, Scharenborg and HintzStrauß et al., 2022). A likely scenario is that it is a probabilistic phenomenon in which listeners continuously attempt to map the unfolding speech signal onto lexical representations (Reference McClelland and ElmanMcClelland & Elman, 1986; Reference NorrisNorris, 1994). Several possible words that partially overlap with the prelexical representation are activated and compete for recognition in parallel, until the lexical candidate that best matches the input wins over its competitors, a process known as lexical competition. The array of potential candidates is updated as more signal is processed, and once a mismatch is detected, a candidate’s likelihood decreases, until the most likely candidate(s) remain(s) (e.g. Reference Norris and McQueenNorris & McQueen, 2008).
For the most likely candidates, other linked representations such as semantic and orthographic representations are coactivated. In the L1, lexical representations for well-known words contain semantic, syntactic, orthographic, and phonological features, among others (they are elaborate; Reference Hulstijn and RobinsonHulstijn, 2001, p. 259). They are specific and stable, and deeply interconnected into larger networks. The word recognition process is rather intolerant of any segmental or suprasegmental mismatch. Mismatching information can in some cases block recognition (for instance, in the case of an initial consonantal mismatch, as in pracelet). Selecting the most likely lexical representation is primarily bottom-up, and depends on which lexical entry best matches the information present in the speech stream, but it can also be influenced by other factors such as contextual information (Reference McQueen and GaskellMcQueen, 2007; see also Chapter 32, this volume).
Some aspects of this process, such as the need for a separate sublexical representation and the nature of phonolexical representations, are still a matter of debate. Researchers in favor of sublexical phonological representations argue that these may still differ from the forms that are encoded in the lexicon. Such intermediate representations are thought to be useful in explaining the ease and precision with which listeners recognize variable word forms in their L1 and the fact that recognizing these variable forms in L2 not only is much more difficult but also relies on L1 phonological knowledge (e.g., Reference Darcy, Peperkamp, Dupoux, Cole and HualdeDarcy, Peperkamp, & Dupoux, 2007). Variable forms can occur, for example, when final consonants undergo assimilation (e.g. hot bread /hɑt bɹɛd/ can be produced as [hɑp̚bɹɛd], and hot glue /hɑt ɡluː/ as [hɑk̚ɡluː]), which is a language-specific process.
Views also diverge regarding whether the lexicon is composed of abstract representations of speech (the canonical form of words is stored) or of acoustic traces of speech episodes (highly detailed variants of a word are stored) or both (Reference GoldingerGoldinger, 1998; Reference McQueen, Cutler and NorrisMcQueen et al., 2006; Reference Pallier, Colomé and Sebastian-GallésPallier, Colomé, & Sebastian-Gallés, 2001). There is evidence from a number of studies that at least some degree of abstraction is necessary to account for observed patterns (see Reference McQueen, Cutler and NorrisMcQueen et al. [2006] for a review). However, as mentioned by Reference McQueen, Cutler and NorrisMcQueen et al. (2006), it is likely that episodic memory is available to some degree in speech perception, and, currently, most researchers assume that listeners use both abstract and detailed representations, depending on the situation (e.g., Reference CutlerCutler, 2012; Reference GoldingerGoldinger, 2007; Reference McQueen and GaskellMcQueen, 2007). This has been reflected in so-called hybrid models of speech comprehension (e.g., Reference GoldingerGoldinger, 2007; Reference HawkinsHawkins, 2003). Even though the lexicon is based on abstract representations, listeners may be able to draw upon acoustic traces that are stored in episodic memory under certain conditions (Reference Ramus, Peperkamp, Christophe, Fougeron, Kühnert, d’Imperio and ValléeRamus et al., 2010).
These two issues – the nature of lexical representations and the intermediate steps in lexical access – are interconnected. They matter for our understanding of the L2 lexicon because of the assumptions regarding the level of detail in phonolexical representations and the structure of the lexical access process.
This section outlined the steps involved in accessing the L1 lexicon. In models of native spoken-word recognition, under normal circumstances, listeners’ perception of the L1 input is optimal, and their phonolexical representations are precise, or fine-grained, allowing for efficient activation and recognition. The percept we establish from what we hear is well aligned with our representations, and an optimal set of candidates is activated with strong knowledge of word probabilities for quick lexical selection. Yet, as Ann Cutler so concisely put it in her 2012 monograph, “nonnative listening is hard because native listening is easy” (Reference CutlerCutler, 2012, p. 335).
When applied to non-native or L2 listening, these two central premises are compromised: (1) listeners’ perception of the incoming L2 speech signal is not necessarily optimal; and (2) listeners’ L2 lexical representations are not necessarily accurate or precise. Both complicate the recognition of spoken L2 words. The next two sections outline how extensive experience with L1 shapes L2 phonological processing (Section 19.3) and the phonological coding of L2 words in the mental lexicon (Section 19.4).
19.3 L2 Phonological Processing Is Shaped by Experience with L1
Speech perception and word recognition in L2 are based on the same principles as in L1, likely using the same cognitive architecture: The acoustic input is preprocessed and sublexical representations are generated, before being probabilistically mapped to stored representations for L2 words (Reference Boersma, Paul and SilkeBoersma, 2009). However, L2 listeners may experience interference from L1 phonological processing routines at every turn when decoding the speech stream.
The details of this process have received intense attention in the field of L2 phonology, so we will not reiterate here information presented elsewhere (see Chapter 17, this volume, as well as Reference CutlerCutler [2012] and Reference Sebastian-Gallés, Pisoni and RemezSebastian-Gallés [2005] for a review). We present here only a brief overview of essential findings.
The perceptual system is attuned to processing native speech very efficiently (Reference PolivanovPolivanov, 1931; Reference Sebastian-Gallés, Pisoni and RemezSebastian-Gallés, 2005), relying on the language-specific phonological units used in L1, which were acquired for that particular L1 early in life (Reference Werker and TeesWerker & Tees, 1999). As a result, the available set of phonological units is not the same in all languages. Phonological units that are optimal for the L1 are not necessarily useful when listening to L2 speech, resulting in various misperception phenomena: listeners perceive sounds as different from what they are, or they perceive sounds that are absent, or they ignore dimensions that are relevant. Mostly, this happens because the L2 would require other representations to be processed optimally. For example, to optimally process Swedish, one would need at least seventeen vowel categories. Upon hearing Swedish, a speaker of Spanish, who has acquired five vowel categories for their L1, will initially attempt to “fit” all Swedish vowels in terms of their five categories, obscuring a number of phonemic differences between vowel pairs because they are assimilated into the same category. Similarly, a French speaker for whom words are not differentiated on the basis of stress, unlike in Spanish (e.g., bébe “he drinks” versus bebé “baby”), will initially ignore stress information when processing these Spanish words, obscuring the contrast between them. This particular dimension is not part of the set their processing system can rely upon. We refer to this phenomenon as L1-based processing.
One of the more studied examples of contrasting phonemes that do not exist in the L1 is the English /ɹ/–/l/ contrast for L1 Japanese speakers. Japanese does not have a phoneme that acoustically matches /ɹ/ or /l/. The only available liquid consonant is the tap /ɾ/, which is acoustically more similar to /l/ than /ɹ/ (Reference Cutler, Weber and OtakeCutler, Weber, & Otake, 2006), although noticeably different from both. Japanese speakers learning English are well-known to confuse the two consonants in production, and evidence from numerous studies has suggested that this difficulty might ultimately arise from an inability to discriminate between the two in perception (Reference GotoGoto, 1971; Reference MochizukiMochizuki, 1981; Reference Sheldon and StrangeSheldon & Strange, 1982). In addition to phoneme contrasts, studies have documented perception difficulties related to suprasegmentals (e.g., Reference Dupoux, Sebastián-Gallés, Navarrete and PeperkampDupoux et al., 2008), phonotactics (Reference Dupoux, Kakehi, Hirose, Pallier and MehlerDupoux et al., 1999), and between-word phonological processes (Reference Darcy, Peperkamp, Dupoux, Cole and HualdeDarcy et al., 2007). All these dimensions being language-specific, the findings converge to suggest that L2 learners process language input using the phonological units of the L1, at least during initial stages of acquisition.
What, then, are the consequences of L1-based phonological processing for spoken word recognition in L2? The main issue is that the presence of confusable phonemes increases lexical competition. By virtue of learners’ knowledge of multiple languages, words from all lexicons can be activated by the input and compete during recognition (Reference Dijkstra and GaskellDijkstra, 2007; Reference Marian and SpiveyMarian & Spivey, 2003; see Reference Shook and MarianShook & Marian [2013] for a model of bilingual language comprehension). But besides this added competition, L2 listeners experience more ambiguity when mapping the input to lexical representations because they tend to perceive speech through their L1 phonetic categories and phonological representations. When learners hear a word (“fry”) containing a phoneme that is part of a perceptually difficult L2 contrast (e.g. /ɹ/–/l/ for Japanese learners of English), words containing the other sound (“fly”) can be activated because the percept is ambiguous or inaccurate. Reference CutlerCutler (2005) investigated whether the inability to distinguish sounds does in fact create problems such as the activation of (spuriously) homophonous words. She examined the vowel contrast /æ/–/ɛ/ (difficult for Dutch learners of English) and the consonant pair /ɹ/–/l/ (difficult for Japanese learners). She used CELEX, an English lexicon with 70,000 words, to count the number of potential confusions between existing words. In the case of /æ/ and /ɛ/, the confusion would add just under 150 homophones, like cattle and kettle. In the case of /l/ and /ɹ/, this could cause about 300 spurious homophones, like rake and lake. These numbers are not high compared to the size of an average lexicon.
However, if one takes into account phantom-word activation, a phenomenon in which listeners recognize words that are not there, the size of the problem increases. This causes not only the activation of words that differ from the intended word in that segment but also the activation of words that only partially overlap, since any overlap adds to lexical competition. For example, the word phantom may activate fan for any listeners given the segmental overlap in the first syllable. For Japanese listeners, the word regular may also activate leg, and for Dutch listeners, daffodil temporarily activates deaf (Reference Broersma and CutlerBroersma & Cutler, 2008). Reference Cutler, Weber and OtakeCutler et al. (2006) showed that L1 Japanese listeners temporarily activated locker when hearing rocket in an eye-tracking study. Reference Broersma and CutlerBroersma and Cutler (2011) reported that Dutch listeners activated a word containing /æ/ if they heard a sequence containing /ɛ/, such that lamp was activated by the sequence [lɛmp] cut from the English phrase eviL EMPire. Similarly, L1 phonotactic knowledge can impact lexical activation. In Reference Freeman, Blumenfeld and MarianFreeman, Blumenfeld, and Marian (2016), L1 Spanish-L2 English bilinguals accessed their L1 phonotactic constraints during English comprehension, increasing lexical competition by activating both lexicons. Therefore, L2 listeners activate more and/or different lexical candidates than would L1 listeners, even across word boundaries.
A related complication of applying L1-based phonological knowledge during L2 processing is that L2 listeners may apply L1-specific segmentation strategies, which may not be optimal. Phonotactics and rhythmic properties constrain segmentation of the input, which consists of finding the location of word boundaries in running speech. For example, Reference Tremblay, Broersma, Coughlin and ChoiTremblay et al. (2016) reported that Korean L2 learners of French tended to use an L1-specific prosodic cue to locate word boundaries, which created problems in L2 speech segmentation. Specifically, both Korean and French use a rise in fundamental frequency (F0) to indicate word-final boundaries, but the exact location of the F0 rise is subtly different. Korean learners of French struggled to adapt this strategy to the prosodic structure of their L2, possibly due to the similarity of the L1 and the L2 cues.
Taken together, these findings show that L1-based processing makes lexical access more challenging, even if listeners know the words and have an accurate phonolexical representation for them. But, as it turns out, L1-based processing also has consequences in the lexical representations themselves.
19.4 L2 Lexical Storage Is Influenced by L1
The other main source of difficulty in recognizing spoken words for L2 listeners is that their phonolexical representations may be inaccurate or less precise in L2 than in L1. The following section provides an overview of the various ways in which phonolexical representations for L2 words have been shown to differ from L1 words.
19.4.1 Representations Code Only Dominant Categories
From the perspective of learning new words in L2, it has generally been assumed that a contrast must be perceived in order to be phonologically coded (i.e., represented) in a lexical representation. If two sounds are assimilated to the same category and the contrast between them is not perceived robustly, then at the moment of learning a minimal pair (a pair of words differing in a single phonological element, e.g. lock vs. rock), the two words may be lexically represented with identical sounds. Reference Pallier, Colomé and Sebastian-GallésPallier et al. (2001, p. 445) state that “it is reasonable to expect that a minimal pair of L2 words differing only in one phoneme contrast (i.e. ‘rock’–‘lock’) will be represented in the same way by L2 learners who do not possess the /ɹ/–/l/ contrast. Indeed, if both L2 phonemes are assimilated to the same L1 phoneme, then the two words become homophones.”
Therefore, lexical representations can be thought to be constrained by perception at the time of learning. A phonolexical representation may thus be derived only from the categories or phonological units that exist in the sublexical phonological representation, which are initially strongly influenced by L1.
Several studies have provided evidence of such spurious homophony between L2 words, whereby a minimal pair of L2 words differing only in a phoneme contrast that is difficult to perceive can be encoded in the mental lexicon as homophones. The first study to experimentally show this (Reference Pallier, Colomé and Sebastian-GallésPallier et al., 2001) used an auditory lexical decision task with repetition priming, looking at Catalan vowel contrasts absent in Spanish. In this task, participants had to decide whether each stimulus in a list was a word or a nonword. When a word was repeated further down the list, decision times were shorter because the item had been recently processed (a phenomenon known as repetition priming). Reference Pallier, Colomé and Sebastian-GallésPallier et al. (2001) presented minimal pair stimuli, such as /netə/–/nɛtə/, in sequence to determine if processing the first caused faster recognition of the second. While Catalan-dominant bilinguals displayed no repetition priming for minimal pairs (only for actual repetitions), Spanish-dominant bilinguals did. They responded faster to néta /netə/ “granddaughter” following itself, /netə/, but also following neta /nɛtə/ “clean.” This suggests that minimal pairs can be lexically coded as homophones for these learners. However, the use of auditory stimuli is a limitation because it cannot rule out that participants are simply perceiving the contrasting sounds in the task as the same. That is, the underlying forms of the lexical representations can be investigated only if we know that participants are perceiving the stimuli as different. And, as we saw in Section 19.3, this is often not the case.
To bypass this complicating factor of perception, Reference Ota, Hartsuiker and HaywoodOta, Hartsuiker, and Haywood (2009) used a semantic relatedness task in their investigation of the lexical representation of the English /ɹ/–/l/ contrast for L1 Japanese learners and the English /p/–/b/ contrast for L1 Arabic learners (Arabic does not have a /p/–/b/ contrast). This task visually presents a pair of words to a participant, and they must determine if the words are semantically related or not. For example, if they see LOCK–KEY, they should respond that yes, the words are semantically related, but if they see ROCK–KEY, the expected answer is no. Reference Ota, Hartsuiker and HaywoodOta et al. (2009) argue that readers automatically access the phonological form of a word when they encounter its orthographic form, so such a task provides insight into lexical phonological representations without interference from perception. Stimuli consisted of minimal pairs that varied on either the /ɹ/–/l/ contrast or the /p/–/b/ contrast. Each member of the minimal pair was paired with a word either to which it was semantically related (LOCK–KEY) or to which it was semantically unrelated (LOCK–HARD). In the case of the latter, the other half of the minimal pair (ROCK) was semantically related to the paired word. Within the experiment, participants performed semantic relatedness judgments on, for example, the following pairings: LOCK–KEY, LOCK–HARD, ROCK–KEY, and ROCK–HARD. Results showed that the L1 Japanese learners had a strong tendency to respond that unrelated words such as ROCK–KEY were semantically related for the /ɹ/–/l/ condition only, while L1 Arabic learners did the same for the /p/–/b/ condition only. When an L1 Japanese participant saw LOCK on the screen, they activated lexical representations associated with both the concept of lock and that of rock. This was taken as evidence that such minimal pairs are encoded as homophones in the mental lexicon.
This scenario implies that representations are precise but only contain L1-dominant categories. As a result, certain phonemic contrasts are neutralized at the lexical level (e.g., both lock and rock are stored as |lɑk|, and connect to different concepts). When perceiving [lɑk] (likely resulting from either rock or lock), both representations match the percept, which symmetrically activates them. In the hypothetical scenario that [ɹɑk] and [lɑk] are perceived differently, [ɹɑk] mismatches the representation.
The evidence presented so far suggests that an L2 phoneme contrast that is perceived in a nontarget-like way (for instance as a single category) by the learner could be represented in the mental lexicon in the same way, such that minimal pairs of that contrast become homophones. This is an expected finding, since it is not immediately clear how a contrast could be differentiated in the mental lexicon if it cannot be perceived. Thus, in some cases L2 perceptual difficulties appear to lead to (pseudo)homophony, where only the dominant category is lexically coded.
However, two other situations observed in research findings must be considered. First, somewhat unexpectedly, there are cases in which nontarget-like perception does not seem to prevent learners from encoding a lexical contrast, even if that contrast is also probably non-native-like (Section 19.4.2). Second, there are cases in which a phoneme contrast can be discriminated in perception but does not appear to be encoded appropriately (yet?) in the mental lexicon. Together, they suggest that even though “good enough” perceptual discriminability is needed, perfectly accurate perception is not necessary – nor is it a guarantee of accurate phonolexical encoding. Each will now be examined in detail, together with their consequences for the recognition of spoken L2 words.
19.4.2 Phonolexical Representations Are Differentiated Despite Perceptual Difficulties
Reference Weber and CutlerWeber and Cutler (2004) conducted a series of experiments investigating the lexical representation of the English /ɛ/–/æ/ contrast for L1 Dutch learners. This contrast is absent in Dutch and poses particular difficulty for learners in perception. Reference Weber and CutlerWeber and Cutler (2004) used an eye-tracking task in which participants were instructed to look at a particular picture when they heard the name of the object shown in the picture. For example, when they heard the word panda, they were told to immediately look at the picture of the panda and not at one of the other pictures displayed in front of them. In other words, upon perceiving the target input, they had to match it to a lexical representation in order to select the correct picture. However, there was always a competing picture that matched the first part of the target word except being switched for [ɛ]–[æ], such as panda [pændə] and pencil [pɛnsəl]. The results showed an interesting pattern of asymmetry. When hearing the first part of a word with [ɛ] such as pencil, [pɛn], participants tended to quickly look at the picture of the pencil without looking at the competitor. However, when hearing the first part of a word with [æ] such as panda, [pæn], participants were slower and tended to look at both the pictures of the panda and of the pencil. A follow-up study using the same methodology with the English /ɹ/–/l/ contrast for L1 Japanese learners found the same pattern: words that began with [l] were quickly identified and received no competition from words with [ɹ], but words that began with [ɹ] were slower to be identified and received competition from words with [l] (Reference Cutler, Weber and OtakeCutler et al., 2006).
Participants’ perception of these contrasts was not directly measured in either study, but previous research had shown that the contrasting phonemes are not discriminated in a native-like manner (Reference Bradlow, Akahane-Yamada, Pisoni and TohkuraBradlow et al., 1999; Reference Cutler, Weber, Smits and CooperCutler et al., 2004; Reference Escudero, Simon and MittererEscudero, Simon, & Mitterer, 2012).Footnote 1 Instead of perceiving them as a single ambiguous /ɛ/–/æ/ category for the L1 Dutch learners or /ɹ/–/l/ category for the L1 Japanese learners, they tended to perceive them as the dominant category /ɛ/ and /l/, respectively. These categories were dominant because they most closely reflected the relevant L1 categories in terms of acoustic properties.
The asymmetries between dominant and nondominant L2 categories observed by Reference Weber and CutlerWeber and Cutler (2004) and Reference Cutler, Weber and OtakeCutler et al. (2006) point to the possibility that, at least partially, participants were distinguishing the target phonemes in their lexical representations. The L2 segments were not necessarily stored as the same sound because otherwise no asymmetries in recognition should be observed: upon hearing items with /l/, if all representations for target and competitors contained /l/ as well, they would have been symmetrically activated. Importantly, the researchers claimed that the participants did distinguish between the contrasting phonemes at the level of lexical representation. The dominant category (/ɛ/ and /l/) was encoded accurately, but the nondominant category (/æ/ and /ɹ/) was encoded as something different (perhaps a poor version of the dominant category or an indistinct representation). Perceiving the input as the dominant category would therefore strongly activate the lexical representation that matched the dominant category (pencil |pɛnsəl|). Because the lexical representation that contained the nondominant category (panda, |p?ndə|) was not immediately activated by the percept alone, it implies that it must have be different in some way from the dominant one, and thus required more context to be activated.
The question remains, however, of how learners could encode a contrast in the mental lexicon (even though it might not be target-like) that, at the same time, remains perceptually difficult. One possibility mentioned by Reference Cutler, Weber and OtakeCutler et al. (2006, p. 280) puts forth that “L2 listeners have been taught that write and light (or cattle and kettle) are supposed to sound different, and that they have accordingly incorporated this distinction in some form in the phonological representations which they maintain of words containing the sounds in question in the lexicon.” Another proposal by Reference Darcy, Dekydtspotter and SprouseDarcy et al. (2012) states that learning at the lexical level might exploit information about combinations of acoustic features that are repeatedly perceived in the input and stored in phonolexical representations, even if such combinations are not part of the set of L1 phonological units used during phonological categorization, and are not (yet) sufficient or salient enough to trigger separate categorization responses. Accordingly, listeners may perceive the contrast somewhat covertly, that is, “well enough” to make lexical distinctions, but not in a target-like manner when an overt response (such as in categorization or identification tasks) is needed. We return to the potential role of explicit instruction and orthography in Section 19.5.
19.4.3 Phonolexical Representations Are Imprecise Despite Reliable Perception
Many of the difficulties presented earlier can be traced back to perceptual problems with confusable L2 phonological units, which either interfere during lexical access or lead to encoding words with the wrong units. Are those difficulties resolved once perception improves?
Sadly, no. Even learners with reliable perception experience difficulties with L2 lexical encoding (Reference AmengualAmengual, 2016; Reference Darcy, Daidone and KojimaDarcy, Daidone, & Kojima, 2013; Reference Díaz, Mitterer, Broersma and Sebastián-GallésDíaz et al., 2012; Reference Sebastián-Gallés, Baus and CutlerSebastián-Gallés & Baus, 2005). This implies that accurate perception is not a guarantee of accurate lexical encoding, suggesting that the phonolexical forms of L2 words may generally be less precise, or “fuzzy,” even in the absence of perceptual difficulty (Reference Cook and GorCook & Gor, 2015; Reference Cook, Pandža, Lancaster and GorCook et al., 2016; see Reference Gor and CookGor and Cook [2020] specifically for low frequency or less familiar words).
Several studies that have investigated both L2 perception and lexical representation have reported that learners were able to perceive phoneme contrasts (as measured by identification and discrimination tasks) yet did not seem to encode such contrasts in the mental lexicon (as measured by auditory priming and lexical decision or matching tasks). Such studies have looked at, for example, the Catalan /e/–/ɛ/ and /o/–/ɔ/ contrasts for Spanish-dominant early bilinguals (Reference AmengualAmengual, 2016), the Japanese singleton-geminate consonant contrasts for L1 English learners (Reference Darcy, Daidone and KojimaDarcy et al., 2013; Reference Hayes-Harb and MasudaHayes-Harb & Masuda, 2008), the Spanish /ɾ/–/r/ contrast for L1 English learners (Reference Daidone and DarcyDaidone & Darcy, 2021), and the Russian palatalized and nonpalatalized consonant contrasts for L1 English learners (Reference Simonchyk, Darcy, O’Brien and LevisSimonchyk & Darcy, 2017).
For example, in Reference Simonchyk, Darcy, O’Brien and LevisSimonchyk and Darcy (2017), a group of intermediate and advanced learners of Russian took part in a perception task in which they had to distinguish palatalized from nonpalatalized consonants, and in a lexical task with words and nonwords involving the palatalization contrast. Knowledge of all words in the lexical task was verified with a word familiarity questionnaire. Despite high familiarity with all words, error rates on the test condition of the lexical task were high – the minimum was 30 percent error, and some participants made 80–90 percent errors. When comparing error rates on the perception task to those in the lexical task, no clear link emerged: one subgroup of participants with a homogeneous error rate of around 15 percent in the perception task displayed wide-ranging error scores on the lexical task (30–90 percent across both groups). This means that listeners who were quite accurate at distinguishing the palatal and nonpalatal sounds in Russian did not always know which of the two should be stored in the lexical representation of words they were very familiar with. This kind of data brings home the point that the mental lexicon in learners can be quite abstract or separate from phonetic categorization performance. Even at high levels of proficiency, learners who have mastered perceptual distinctions have not necessarily modified or updated their phonolexical representations. We can therefore conclude that lexical updates are not always automatic. We return to this point in Section 19.5.
These findings raise interesting questions about the exact relationship between perception and lexical representation, and suggest that it may not be a clear causal relationship, as first assumed. The mismatch may be due to the time required to update lexical representations once a perceptual category is acquired. If a learner has encoded two contrasting phonemes as identical in the lexicon and then later acquires the contrasting phoneme categories, the lexical representations presumably must be updated. Since lexical representations are stored in long-term memory, updates may not be immediate. It is not clear how this update process might proceed, and it is possible that, even after the correct representation has been encoded, the previous, inaccurate representation may persist.
It is also possible that the observed disconnect between perception and lexical representation may instead be an effect of the tasks used. Reference Dupoux, Peperkamp and Sebastián-GallésDupoux, Peperkamp, and Sebastián-Gallés (2001) and Reference Højen and FlegeHøjen and Flege (2006) cited a number of early L2 perception studies that failed to find the expected pattern of category assimilation. They stated that the commonly used AXB task “has often yielded near-ceiling discrimination scores (90%–100% correct) for pairs of foreign vowels likely to have been perceived as instances of a single L1 category” (Reference Højen and FlegeHøjen and Flege, 2006, p. 3074). The concern is that if the demands of the task are too simple (such as sounds presented in isolation by a single speaker), participants may be able to discriminate at the level of the acoustic representation without relying on any speech-specific processing. To avoid this, discrimination tasks can be made more complex to push participants to discriminate at the phonological level. In addition, measuring monolingual naïve listeners along with L2 learners can help to elucidate results; if accurate discrimination requires the acquisition of a new perceptual category, then naïve listeners should perform poorly. In addition, perception tasks are often prelexically oriented, whereas lexical tasks by definition are not. As a result, lexical task performance is likely subject to various influences (e.g. word knowledge, speed of lexical retrieval) that may lead to more individually variable performance compared to a prelexical perception task.
As was noted in the discussion of L2 speech perception (see Section 19.3), much of the research on L2 lexical representations focused on phoneme contrasts at first. However, there is growing evidence that L1 suprasegmental dimensions (Reference Dupoux, Sebastián-Gallés, Navarrete and PeperkampDupoux et al., 2008: Spanish word stress for L1 French learners; Reference Pelzl, Lau, Guo and DeKeyserPelzl et al., 2019, Reference Pelzl, Lau, Guo and DeKeyser2021: Mandarin tone for L1 English learners) and phonotactic structure (Reference Darcy and ThomasDarcy & Thomas, 2019) are encoded in learners’ lexical representation as well.
These findings all suggest that learners’ phonolexical representations are not target-like. In some cases, representations are precise but encode L1 categories, which may result in homophones. In others, representations are imprecise (fuzzy) but may preserve some form of contrast, avoiding homophones. Asymmetries found in lexical tasks are crucial to further characterizing the nature of phonolexical fuzziness, as well as the patterns of lexical activation in each case (Reference Barrios and Hayes-HarbBarrios & Hayes-Harb, 2021).
In this section and the previous one (Section 19.4.2), we saw that learners’ issues stem either from unreliable input perception or from nontarget-like lexical representations, or both. In addition, learners may use segmentation strategies that are optimized for their L1 rather than their L2, resulting in misparsing of the speech stream. Because of their reduced exposure to the language, they likely also have less access to information about the word and phoneme probabilities of their L2 that native listeners use during recognition (Reference Norris and McQueenNorris & McQueen, 2008). Listeners activate unnecessary or more competitors, thus further increasing lexical competition (above and beyond the increased competition inherent in having more than one lexicon; see Reference Marian and SpiveyMarian & Spivey, 2003). Confusable phonemes can also make competitors difficult to inhibit, since L2 learners can have trouble detecting a mismatch between the incoming signal and the words stored in memory, which slows down recognition. These reasons make recognizing words in L2 a challenging undertaking generally speaking, especially at earlier proficiency levels. In Section 19.5, we highlight the fact that perception and phonolexical representations are not static, and, over time, L2 listeners’ recognition of spoken words improves.
19.5 Development of Phonological and Lexical Processing in L2
Lexical encoding of difficult L2 phonological dimensions has been shown to be challenging, but little is known about how learners update initially inaccurate lexical representations. Over time, sustained input helps the L2 phonological system develop, and processing of phonological dimensions becomes more accurate across a number of phonological dimensions (e.g., Reference Levy and StrangeLevy and Strange [2008] for vowels; Reference Wayland and GuionWayland and Guion [2003] for tones). Similarly, learners’ lexical representations become more accurate over time. For example, more advanced learners outperform intermediate learners in lexical decision tasks where they have to tell words from nonwords (Reference Darcy, Daidone and KojimaDarcy et al., 2013). Higher proficiency also helps them detect mispronunciations in words (Reference Simonchyk and DarcySimonchyk & Darcy, 2018), and helps them experience less competition from similar words (Reference Cook, Pandža, Lancaster and GorCook et al., 2016; Reference Veivo and JärvikiviVeivo & Järvikivi, 2013). These findings suggest that learners’ lexical representations become more precise over time, and possibly become less influenced by the L1 phonology. Given the assumed link between perceptual ability and word form learning (Reference Pallier, Bosch and Sebastian-GallésPallier, Bosch, & Sebastian-Gallés, 1997; Reference Pallier, Colomé and Sebastian-GallésPallier et al., 2001), it is possible that improvements in lexical encoding depend on improvements in perceptual accuracy – yet this is still an open question, complicated by the observed dissociations between perceptual and lexical processing. Indeed, as mentioned earlier, improvements in perception or production do not seem to parallel those in lexical tasks, and having acquired a contrast in perception does not guarantee accurate lexical representations (Reference Sebastián-Gallés, Baus and CutlerSebastián-Gallés & Baus, 2005). This begs the question of how L2 learners establish accurate or less fuzzy phonolexical representations, update initially inaccurate ones, and how these updates spread throughout the phonolexical representations that need them (and only these).
19.5.1 Learning and Updating Lexical Representations in Adults
Studies that examine how adults learn new words – in L1 or L2 – find that lexical representations for novel words (e.g. “cathedruke”) are swiftly created in terms of their phonological information (Reference Gaskell and DumayGaskell & Dumay, 2003), even when they contain non-native contrasts that are hard to perceive (Reference Dufour, Nguyen and FrauenfelderDufour, Nguyen, & Frauenfelder, 2010). However, these representations need at least an overnight consolidation to fully integrate with existing items, enter lexical competition, and “act” as words (Reference Gaskell and DumayGaskell & Dumay, 2003). Fewer studies have examined the specific mechanisms by which existing phonolexical representations can be updated over time. The first study to show that existing L1-based representations can be updated for a second dialect after training is Reference Dufour, Nguyen, Pattamadilok and FrauenfelderDufour et al. (2016), who showed that training a difficult contrast with orthography can temporarily modify lexical representations. However, participants returned to their pretraining patterns on the day following the training – which suggests that, while new lexical representations can be established relatively quickly, modifying existing ones for the long term may be a much longer process.
A study by Reference Darcy and ThomasDarcy and Thomas (2019) found that about 25 percent of the L2 English (L1 Korean) learners they tested were able to identify pseudowords that resembled highly familiar real words ([bʊlu:], based on [blu:] “blue”) as accurately as native speakers, suggesting that updating long-established words is possible. The behavioral task (lexical decision task) they used, however, cannot elucidate the mechanism of updates behind a learner’s accurate performance: It is possible that successful participants have either (1) successfully removed their older L1-based representation, in which case the pseudoword stimulus mismatches with the lexical representation they have, or (2) kept the L1-based representation, alongside a more recent, L2-based representation; upon hearing a pseudoword that matches the L1-based representation (e.g. [bʊlu:]), they are able to quickly suppress its activation to reject the pseudoword. This is a possible mechanism in L1: Reference Bürki and GaskellBürki and Gaskell (2012) demonstrated that people can have two coexisting representations for variable words (e.g. “mackerel” pronounced as [makɹɛl] or [makəɹɛl]) on a long-term basis.
Two recent studies (Reference Llompart and ReinischLlompart & Reinisch, 2021; Reference Melnik and PeperkampMelnik & Peperkamp, 2021) examined whether L2 lexical processing can be improved and lexical representations modified through training. Both studies used perceptual training based on the reasoning that accurate perceptual processing is a prerequisite for (even if not a guarantee of) the establishment of accurate lexical representations.
Reference Melnik and PeperkampMelnik and Peperkamp (2021) found that participants who underwent a phonetic training regimen significantly improved their perception of a difficult contrast. They also improved their recognition of real words (versus pseudowords) containing that contrast. Improvement was maintained at the delayed post-test (four months after immediate post-test), which suggests that the representation that learners had for these words had been somehow updated. Reference Llompart and ReinischLlompart and Reinisch (2021) evaluated more directly whether initial representations can be modified by learners. They used a phonological training paradigm with a novel word-learning task. The fact that they used novel words is important because it increases the likelihood that the participants have only one lexical representation for a given novel object, which they create during the learning task. By comparing the effect of minimal-pair training, provided either during the initial learning session or only later, the day after learning, Reference Llompart and ReinischLlompart and Reinisch (2021) examined whether participants were able to update newly acquired representations. Both groups updated their new lexical representations after training. The authors concluded that updating lexical representations for newly learned words is not too costly, and that it can happen very rapidly. This ties in with the earlier observation that “young” representations may still be malleable, but longer-established ones may take some time to modify (Reference Dufour, Nguyen and FrauenfelderDufour et al., 2010).
19.5.2 Nonperceptual Factors That Impact the Precision of Lexical Representations
Above and beyond the interference from the L1 phonological system during perception, many factors are likely involved in determining the precision with which L2 words are lexically encoded and the ease with which these representations can be adjusted over time. Research has only recently begun to investigate this process in L2 learners and to model these changes in terms of the whole mental lexicon (Reference Bordag, Gor and OpitzBordag, Gor, & Opitz, 2021). In this section, we briefly review two factors that potentially matter in the updating of lexical representations: the presence of orthographic input and the size of a learner’s vocabulary.
Orthographic representations influence spoken word processing in both L1 and L2. In native listeners, Reference Ranbom and ConnineRanbom and Connine (2011) observed identity priming of both spoken forms and orthographically supported mispronunciations (both [kæsl] and [kæstl] prime the target CASTLE, but there was no priming from [hæstl] for HASSLE, only [hæsl]). This suggests the coexistence of a spoken representation and an orthographically derived phonological one that are connected and can both modulate lexical competition during spoken word recognition.
It follows that, to learners, knowing that two similar-sounding words are spelled differently may signal a contrast that they cannot perceive. Reference Escudero, Hayes-Harb and MittererEscudero, Hayes-Harb, and Mitterer (2008) provided support for the role of orthography in a study that involved novel word learning. One group of participants learned the words through auditory delivery alone, while another group also saw the orthographic forms of the words. Using the same eye-tracking paradigm as Reference Weber and CutlerWeber and Cutler (2004), Reference Escudero, Hayes-Harb and MittererEscudero et al. (2008) found that the group who received auditory delivery alone confused /ɛ/–/æ/ words in a symmetric manner. However, the group who also saw the orthographic forms confused the words in an asymmetric manner that matched the results of Reference Weber and CutlerWeber and Cutler (2004). This suggests that providing the orthographic forms of the words during learning can help learners separate lexical representations, even when the contrast is not perceived (see Chapter 31, this volume). However, this evidence of a benefit of spelling was obtained with a very specific case: when both L1 and L2 phoneme-grapheme mappings refer to distinct categories. In many other cases, exposure to orthographic forms where L1/L2 phoneme-grapheme correspondences mismatch can lead to the emergence of orthographically licensed (mis)pronunciations, and interfere with the memorization of word forms (Reference Hayes-Harb, Nicol and BarkerHayes-Harb, Nicol, & Barker, 2010). Many questions remain as to whether and how spelling information impacts phonolexical encoding (see Reference Hayes-Harb and BarriosHayes-Harb & Barrios [2021] for a review).
Other factors possibly connected to orthographic input, such as explicit instruction, receiving visual information about articulation, using gestures, as well as exposure to minimal pairs during learning, can cue learners to a contrast that they perceptually neutralize, helping them establish distinct lexical representations (Reference Llompart and ReinischLlompart & Reinisch, 2017, Reference Llompart and Reinisch2020). However, these effects are not stable across studies and need to be further elucidated to fully understand their underlying mechanisms. For example, examining native English listeners’ perception of the Arabic /k/–/q/ contrast, Reference Showalter and Hayes-HarbShowalter and Hayes-Harb (2015) found that presenting text in Arabic script did not aid listeners in learning novel words. This result was true even when the subjects received explicit instruction on aspects of the script that would be informative for identifying the novel contrast.Footnote 2
Another factor that could guide learners to more fine-grained lexical representations is the size of the vocabulary, including lexical characteristics such as frequency and neighborhood density. A larger L2 lexicon may highlight the importance of L2 contrasts through the noticing of continual mismatches with phonological neighbors, leading to the refinement of existing phonolexical representations. Two recent studies (Reference Daidone and DarcyDaidone & Darcy, 2021; Reference LlompartLlompart, 2021) found that a larger L2 vocabulary was predictive of more accurate phonolexical representations. More studies are needed, but it can be tentatively concluded that acquiring a large vocabulary in the L2 could be useful in mediating phonologically imprecise lexical representations.
19.6 Conclusions
We have reviewed a number of findings that converge to show that L2 listening is no small feat: the profound effect of extensive L1 experience on both the phonological processing and the lexical encoding of spoken L2 words is difficult to overcome. Fine-tuning processing and updating representations is, in any case, a lengthy undertaking that necessitates large amounts of L2 input. Despite so many recent advances in our understanding, the connections between perception, lexical representations, orthography, and production are still unclear in the case of L2 listening. Important questions remain about the specificity and the interdependence of bilinguals’ lexical representations and the factors influencing cross-language word activation. Two recent special issues showcase developments in this area (Reference Darcy, Tremblay and SimonetDarcy, Tremblay, & Simonet, 2017; Reference Gor, Cook, Bordag, Chrabaszcz and OpitzGor et al., 2021). To conclude, we highlight five major questions that emerge for future research: (1) How fuzzy, precisely, are L2 phonolexical representations? (2) What are the conditions under which L2 learners can acquire accurate representations? (3) What are the mechanisms by which learners update the representations that need it? (4) Which intrinsic and extrinsic factors facilitate or hinder this process? (5) What is the dynamic relationship between phonological processing and phonolexical encoding, over time, within any given learner? Ultimately, understanding what is the source of imprecise phonolexical representations and which factors facilitate the development of finer-grained ones will inform instruction – for instance, by outlining when to provide explicit pronunciation support, or which types of words to learn together or separately in order to effectively refine these words’ representations.
20.1 Introduction
This chapter examines vowel and consonant perception and production during typical second language (L2) acquisition by sequential, dominant (i.e., nonbalanced) bilinguals including classroom learners. To contextualize research on L2 segmental speech learning within the fields of bilingualism and L2 learning more generally, in Section 20.2 we begin with a brief overview of four regularly investigated themes that serve to structure our review, namely, cross-linguistic influence, development, inter- and intralearner variability, and training effects.Footnote 1 Each theme is accompanied by one or two general questions that have guided previous research. Once this foundation is laid, we then review empirical research exploring these phenomena in the L2 perception and production of vowels (Section 20.3) and consonants (Section 20.4). We conclude briefly in Section 20.5.
20.2 Recurring Research Themes in the L2 Acquisition of Segments
20.2.1 Cross-Linguistic Influence
How does a sequential, dominant bilingual’s first language shape L2 segmental perception and production?Footnote 2
Cross-linguistic influence (CLI) from a learners’ first language (L1), also known as transfer or, when it results in nontarget-like learning or use, interference, shapes L2 speech learning at all levels of target language (TL) proficiency.Footnote 3 Arguably, CLI is the most well-researched theme in L2 speech learning, with researchers seeking to understand its effects on perception and production, including how CLI changes over time with increased experience and proficiency (see Chapter 10, this volume, for discussion). As concerns perception, theories propose (see Chapters 7–9, this volume) and empirical studies demonstrate that learners’ L1 segmental inventories as well as speech processing routines shape TL segment categorization, including by making learners more sensitive to phonetic cues (e.g., differences in voice onset time, laryngeal voicing, and duration that signal voiceless–voiced stop contrasts) used to realize L1 phonemic contrasts when perceiving TL sounds.Footnote 4 In production, learners may use not only their particular L1 phonetic realizations of vowels and consonants but also the general articulatory and voice settings (i.e., the overall configurations of major articulators including the lips, tongue, and jaws as well as the vocal cords and larynx) of their mother tongue (see Reference Colantoni, Kochetov and SteeleColantoni, Kochetov, and Steele [2021] for discussion).
20.2.2 Development
Are there universal patterns to L2 segmental speech development?
Are certain vocalic and consonantal phenomena acquired more easily?
Research on development seeks to discern the commonalities in L2 speech learning, including developmental universals and sequences as well as factors that facilitate the acquisition of certain vowels or consonants. A potential universal in both perception and production is L2 learners’ greater ease with duration than the spectral properties of TL segments (e.g., Reference BakerBaker, 2010; Reference CebrianCebrian, 2006; Reference Choi, Kim and ChoChoi, Kim, & Cho, 2016; Reference Smith, Hayes-Harb, Bruss and HarkerSmith et al., 2009). Developmental sequences consist of sequential, overlapping learning stages observed for a given vowel or consonant for learners of a particular TL. For example, when acquiring the French rhotic /ʁ/, English-speaking learners first master the manner then the voicing of the fricative (Reference Colantoni and SteeleColantoni & Steele, 2007, Reference Colantoni and Steele2008). Perceptual saliency and articulatory difficulty (e.g., Reference BakerBaker, 2010; Reference PatiencePatience, 2022; Reference Yang and ChenYang & Chen, 2019) along with segmental and lexical frequency (e.g., Reference InceogluInceoglu, 2021; Reference Munro and DerwingMunro & Derwing, 2008; Reference Thomson and IsaacsThomson & Isaacs, 2009) may also correlate with the degree of relative difficulty. When such variables shape acquisition, features/sounds that are more salient, less articulatorily difficult, and/or more frequent are often learned more readily.
20.2.3 Inter- and Intralearner Variability
What are the sources of differences in perception and/or production between learners and for the same learner over time and in different communicative contexts?
One of the hallmarks of L2 speech learning as well as of L2 acquisition in general is the considerable variability observed both between learners as well as with the same learner at different points in time or in different language use situations (e.g., Reference Colantoni, Kochetov and SteeleColantoni et al., 2021; Reference Smith and Hayes-HarbSmith & Hayes-Harb, 2012). Such differences have been explained with reference to four factors:
1. differences in input quantity and quality: the input available to L2 learners may vary greatly – for example, whereas classroom learners may have relatively little exposure to the TL, often produced by other non-native speakers including potentially their instructors, naturalistic learners, including immigrants and study abroad programme participants, may benefit from a richer input in contexts where the majority language may be used in personal, educational, and/or professional spheres of their daily life;
2. language experience: this variable typically encompasses various factors, the principal ones being age (at onset) of acquisition (AOA), number of years of learning, and the variety of contexts in which the target language must be understood and produced. Note that there may be overlap between language experience and input quantity and quality, as the former shapes the latter;
3. individual differences (IDs): these cognitive and affective learner characteristics shape language acquisition and use. Cognitive IDs include phonological short-term memory, analytical reasoning, and inductive learning abilities. Affective IDs include motivation and L2 anxiety (see Reference Dörnyei and RyanDörnyei and Ryan [2015] for general discussion; see Reference MoyerMoyer [2013] and Chapter 3, this volume, for IDs in L2 speech learning);
4. communicative context: as with native speakers, non-native speakers’ language use is shaped by the listening and speaking requirements of the communicative setting, including task complexity (e.g., Reference Saito and MunroSaito & Munro, 2014; Reference Solon, Long and Gurzynski-WeissSolon, Long, & Gurzynski-Weiss, 2017) and interlocutor characteristics (Reference GnevshevaGnevsheva, 2015).
20.2.4 Effects of Training
What are the effects of training on learning?Footnote 5
How do variables such as task and stimuli type condition effectiveness?
Given the perceptual and production difficulties that L2 learners typically encounter, researchers have investigated whether acquisition can be assisted, particularly for vowels or consonants that prove difficult for most learners. Using a variety of methods to be reviewed later in this chapter, learners are provided input rich in quantity and quality as well as, in some cases, metalinguistic information with training directing learners’ attention implicitly and/or explicitly to difficult TL contrasts/features. Overwhelmingly, research has revealed the positive effects of training on both segmental perception and production (see, e.g., Reference Sakai and MoormanSakai and Moorman [2018] for a meta-analysis of the effects of perception training on production).
Having briefly sketched these main research themes, we now turn to a review of empirical studies that have investigated each of them with respect to the L2 perception and production of vowels and consonants. With the goal of providing a state-of-the-art, comprehensive overview of L2 segmental speech learning, we focus on research from the past ten to fifteen years (see the book-length treatment in Reference EscuderoColantoni, Steele, & Escudero [2015] for a review of earlier research).
20.3 L2 Acquisition of Vowels
20.3.1 Cross-Linguistic Influence on L2 Vowels
First and foremost, CLI shapes L2 vowel perception in terms of categorization. Particularly at earlier stages of acquisition, L2 learners’ perception is influenced by the degree of L1–TL similarity, measured using phonological contrasts/features (e.g., Reference Martinez, Goad and DowMartinez, Goad, & Dow, 2023), articulatory gestures (e.g., Reference Faris, Best and TylerFaris, Best, & Tyler, 2016, Reference Faris, Best and Tyler2018), or spectral and temporal parameters (e.g., Reference Alispahic, Mulak and EscuderoAlispahic, Mulak, & Escudero, 2017; Reference Hu, Tao, Li and LiuHu et al., 2016). For example, with research conducted within the Speech Learning Model (SLM; Reference Flege and StrangeFlege [1995] inter alia), vowel similarity is assessed acoustically via L1–L2 formant comparison. Furthermore, L2 perception is predicted to be more accurate when the L1–L2 acoustics are either highly similar (“same” sounds), so that the use of an existing L1 category leads to accurate perception, or sufficiently distinct (“new” sounds), leading to learners noticing differences and, with sufficient experience, establishing new target-like L2 categories. The most difficult cases involve highly similar L1 and TL sounds; here, following equivalence classification, learners may continue to use L1-based categories resulting in nontarget-like perception (and, consequently, production). A growing body of work – particularly research conducted within Escudero’s Second Language Linguistic Perception (L2LP) model (Chapter 8, this volume) – argues that similarity is best assessed at the level of a learner’s particular dialect/idiolect and that both L1 and L2 dialectal differences have consequences for L2 speech perception (e.g., Reference Escudero and WilliamsEscudero & Williams, 2012; Reference Simon, Debaene and Van HerrewegheSimon, Debaene, & Van Herreweghe, 2015).
Turning to production, CLI shapes learners’ vowels through mergers where multiple TL categories are realized via a single L2 vowel (e.g., Reference Đào and NguyễnĐào & Nguyễn, 2018; Reference Georgiou and ThemistocleousGeorgiou & Themistocleous, 2021). Here, too, L1 dialect or even idiolect has been shown to shape production (e.g., Reference Grantham O’Brien and SmithGrantham O’Brien & Smith, 2010; Reference MarinescuMarinescu, 2013). For example, Reference Kartushina and FrauenfelderKartushina and Frauenfelder (2014), in a study of L1 Spanish–L2 French vowels, found that individual differences in the compactness (i.e., degree of variability) of particular L1 vowels as well as of the entire L1 vowel space correlated with learners’ production accuracy of similar and new TL vowels. Other variables that may condition CLI include AOA and TL experience. Reference Baker and TrofimovichBaker and Trofimovich’s (2005) study of English /i u æ ɛ ʊ ɪ/ as produced by early and late L1 Korean-L2 English learners (mean AOA of approximately nine versus twenty-two to twenty-five years, respectively) revealed that the early learners produced greater L1–L2 distinctions with the exception of highly similar vowels, and that the amount of TL experience correlated with accuracy; an effect of experience was not found for the late learners.
Another effect of CLI is its shaping of L2 cue weighting: at least at earlier stages, learners are more attentive to cues weighted more heavily in their L1 (e.g., Reference Escudero, Benders and LipskiEscudero, Benders, & Lipski, 2009; Reference Yazawa, Whang, Kondo and EscuderoYazawa et al., 2020). In parallel to native speakers, within-group variation in cue weighting occurs. For example, Reference Alispahic, Mulak and EscuderoKim, Clayards, and Goad (2017, Reference Kim, Clayards and Goad2018) found sub-groups of Korean-speaking learners who relied relatively more on either duration or spectral properties in the perception of the English /ɪ–i/ and /ɛ–æ/ contrasts.
Finally, CLI shapes not only phonological categories but also allophones. For example, studies on L2 English vowel reduction demonstrate the challenges learners face when acquiring the full–reduced vowel alternation. English unstressed vowels are realized as shorter and more centralized than their stressed counterparts. Due to CLI, L2 learners of English whose L1 lacks allophonic vowel reduction often struggle to reduce (sufficiently) vowels (e.g., Reference Byers and YavasByers & Yavas, 2017; Reference Rallo FabraRallo Fabra, 2015), although the alternation can be acquired with sufficient experience (Reference Han, Hwang and ChoiHan, Hwang, & Choi, 2011). English-speaking learners face a parallel problem of suppressing nontarget-like vowel reduction when acquiring languages lacking this allophonic property (e.g., Reference Colantoni, Mascaro, Steele, Sunara and WillisColantoni, Mascaro, et al., 2015; Reference Menke and FaceMenke & Face, 2010).
20.3.2 Developmental Phenomena in the L2 Acquisition of Vowels
Developmental patterns have been observed for L2 vowel perception and production. For example, Reference Stewart MorrisonStewart Morrison (2008), following Reference EscuderoEscudero (2000), proposes four main stages in Spanish-speaking learners’ perceptual acquisition of the English /i–ɪ/ contrast: an initial inability to distinguish the two vowels (Stage 0) followed by the use of duration (Stage 1) then a combined use of duration and spectral properties (Stage 2) and, finally, English-like primary use of spectral properties (Stage 3). In production, Reference Munro and DerwingMunro and Derwing (2008) observed developmental patterns for beginner Mandarin- and Slavic-speaking learners’ realization of the English vowels /i ɪ e ɛ æ u ʊ o ɑ ʌ/. Based on the overall intelligibility of these learners’ productions at two-month intervals over a one-year period,Footnote 6 these researchers observed initial rapid acquisition during the first six months followed by a learning plateau. This general pattern did not hold for all vowels: /i/ and /o/, having L1 phonemic equivalents, were highly intelligible from learning onset, whereas with /ɪ/ and /ʊ/, vowels lacking an L1 equivalent, increased intelligibility occurred relatively later or not at all.
Several factors have been proposed to condition the relative difficulty of L2 vowel acquisition. In perception, having a relatively smaller L1 inventory than the TL may lead to greater L2 perceptual difficulty, as a given TL vowel may be categorized as multiple L1 vowels (e.g., Reference Bundgaard-Nielsen, Best and TylerBundgaard-Nielsen, Best, & Tyler, 2011; Reference Kivistö-de Souza, Carlet, Julkowska and RatoKivistö-de Souza et al., 2020). Reference Alispahic, Mulak and EscuderoAlisphaic et al. (2017) nuance this claim, suggesting that, while relative L1–TL inventory size is indeed relevant, acoustic similarity plays a greater role. The presence of the L2 contrast or contrastive feature in the L1 has also been argued to facilitate acquisition (Reference Martinez, Goad and DowMartinez et al., 2023). Finally, as mentioned earlier, researchers have proposed that learners may more easily make use of duration than spectral properties when acquiring new contrasts (e.g., Reference CebrianCebrian, 2006).
20.3.3 Inter- and Intralearner Variability in the L2 Perception and Production of Vowels
Variability in L2 vowel learning is conditioned by a range of variables. Here, as elsewhere, some caution is necessary when making strong statements about the effects of particular variables, which may be mitigated by various factors; indeed, the interaction of multiple variables often best explains inter- and intralearner variability.
Among regularly cited linguistic variables conditioning L2 vowel perception and production, researchers have noted the effects of neighboring consonants (e.g., Reference Baker and TrofimovichBaker & Trofimovich, 2008; Reference Levy and Law IILevy & Law, 2010; Reference Levy and StrangeLevy & Strange, 2008; Reference TsukadaTsukada, 2009) and stress (e.g., Reference Cobb and SimonetCobb & Simonet, 2015; Reference Romanelli, Menegotto and SmythRomanelli, Menegotto, & Smyth, 2018). Orthography may also shape L2 vowel learning: orthographic input may trigger greater L1-based CLI (e.g., Reference SheaShea, 2021); learners may be relatively more/less accurate with vowel contrasts when such contrasts are represented explicitly in the orthography (e.g., Reference Escudero and WanrooijEscudero & Wanrooij, 2010; Reference Nimz and KhattabNimz & Khattab, 2020), although this effect may be mitigated for contrasts that are very difficult to discriminate (Reference EscuderoEscudero, 2015); and spelling-based nontarget-like pronunciations occur (e.g., Reference Sokolović-Perović, Bassetti and DillonSokolović-Perović, Bassetti, & Dillon, 2019). Lexical effects are also observed during L2 vowel learning, including greater perceptual accuracy and intelligibility with more frequent/familiar words (e.g., Reference InceogluInceoglu, 2021; Reference Munro and DerwingMunro & Derwing, 2008; Reference Thomson and IsaacsThomson & Isaacs, 2009). Moreover, research within the vocabulary-tuning model (e.g., Reference Bundgaard-Nielsen, Best and TylerBundgaard-Nielsen et al., 2011, Reference Bundgaard-Nielsen, Best, Kroos and Tyler2012; Reference Georgiou, Perfilieva and TeniziGeorgiou, Perfilieva, & Tenizi, 2020) has revealed a positive correlation between L2 vocabulary size and consistency of vowel categorization/perceptual accuracy with certain contrasts.
Task characteristics including topic and interlocutor may also condition variability. For example, Reference GnevshevaGnevsheva (2015) found that her L1 German- and Korean-speaking learners’ English vowels were more native-like in an authentic service encounter with native speakers than when discussing their studies or family with a near-native speaker. Reference Solon, Long and Gurzynski-WeissSolon et al. (2017) found their L1 English-L2 Spanish learners to be more target-like in the fronting of /e/ with greater task complexity.
Two of the most frequently studied learner variables are AOA and TL experience. All other things equal, earlier learners often have an advantage in both perception and production (e.g., Reference Byers and YavasByers & Yavas, 2017; Reference Flege, Shirru and MackayFlege, Shirru, & Mackay, 2003; but see, e.g., Reference Munro and DerwingMunro and Derwing [2008] for no effect). Similarly, greater TL experience, sometimes measured by length of residence (LOR; i.e., the amount of time spent in a majority TL-speaking environment), has been found to have positive effects (e.g., Reference Levy and Law IILevy & Law, 2010; Reference Kartushina and MartinKartushina & Martin, 2019; but see, e.g., Reference CebrianCebrian [2006] and Reference Simon and D’HulsterSimon and D’Hulster [2012] for no effects). The inconsistent effects of AOA and TL experience/LOR may be due to these factors serving as a proxy for or being in an indirect relationship with the input available to learners; if so, the contribution of differences in input quality and quantity to variability may be underestimated (Reference Flege, Piske and Young-ScholtenFlege, 2009). A more consistent finding is the positive correlation between accuracy and TL proficiency (e.g., Reference Cobb and SimonetCobb & Simonet, 2015; Reference Evans and AlshangitiEvans & Alshangiti, 2018; Reference Menke and FaceMenke & Face, 2010; Reference MoormanMoorman, 2017). Alongside AOA and TL experience and proficiency, research has demonstrated a positive role for cognitive IDs including phonological short-term memory (Reference Aliaga-García, Mora and Cerviño-PovedanoAliaga-García, Mora, & Cerviño-Povedano, 2011; Reference InceogluInceoglu, 2016), discrimination sensitivity (Reference Díaz, Baus, Escera, Costa and Sebastián-GallésDíaz et al., 2008; Reference Díaz, Mitterer, Broersma, Escera and Sebastián-GallésDíaz et al., 2016), domain-general auditory processing (Reference Kachlickaa, Saito and TierneyKachlickaa, Saito, & Tierney, 2019), preattentive speech perception abilities (Reference Jakoby, Goldstein and FaustJakoby, Goldstein, & Faust, 2011), and musical hearing (Reference Jekiel and MalarskiJekiel & Malarski, 2021). Reference MoormanMoorman (2017) highlights the potential interaction between such IDs and TL proficiency.
20.3.4 Effects of Training on the L2 Perception and Production of Vowels
A variety of perceptual (e.g., distributional training: Reference Wanrooij, Escudero and RaijmakersWanrooij, Escudero, & Raijmakers, 2013) and production methods (e.g., articulatory training; Reference Suemitsu, Dang, Ito and TiedeSuemitsu et al., 2015) are used, the most common for L2 vowels arguably being high variability phonetic training (HVPT),Footnote 7 which typically exposes learners to stimuli produced by multiple talkers (e.g., Reference GeorgiouGeorgiou, 2021; Reference Iverson, Pinet and EvansIverson, Pinet, & Evans, 2011; Reference RatoRato, 2014) with stimuli that may involve acoustic exaggeration of target features/contrasts (e.g., Reference Cheng, Zhang, Fan and ZhangCheng et al., 2019). Training may shape both learners’ categorization (e.g., Reference ThomsonThomson, 2012) and their use of acoustic cues (e.g., Reference Hu, Tao, Li and LiuHu et al., 2016; Reference Ylinen, Uther and LatvalaYlinen et al., 2010). Perceptual training may also benefit production (e.g., Reference LengerisLengeris, 2018; Reference Okuno and HardisonOkuno & Hardison, 2016) and production training may likewise improve perception (e.g., Reference Kartushina, Hervais-Adelman, Frauenfelder and GolestaniKartushina et al., 2015; Reference Linebaugh and RocheLinebaugh & Roche, 2015); there is, however, no necessary one-to-one relationship between perception and production in training (e.g., Reference Alshangiti, Evans, Fuchs, Grice, Hermes, Lancia and MückeAlshangiti & Evans, 2014; Reference Ghaffarvand Mokari and WernerGhaffarvand Mokari & Werner, 2018).
Training outcomes are conditioned by multiple factors including learner proficiency (Reference Alshangiti, Evans, Fuchs, Grice, Hermes, Lancia and MückeAlshangiti & Evans, 2014), degree of TL immersion at the time of training (Reference Alshangiti, Evans, Wolters, Livingstone, Beattie, Smith, MacMahon, Stuart-Smith and ScobbieAlshangiti & Evans, 2015), qualities of the stimuli including a focus on a subset versus the complete vocalic inventory (Reference LerdpaisalwongLerdpaisalwong, 2015), presence of orthographic and/or visual input (e.g., Reference InceogluInceoglu, 2019; Reference Okuno and HardisonOkuno & Hardison, 2016; Reference Solier, Perret, Baqué and Soum-FavaroSolier et al., 2019), availability and type of corrective feedback (Reference Lee and LysterLee & Lyster, 2016, Reference Lee and Lyster2017), and even time of day of training (Reference Sayako Earle and MeyersSayako Earle & Meyers, 2015). Individual cognitive differences that have been found to be in positive correlation with gains include learners’ phonological short-term and tonal memory (Reference Aliaga-García, Mora and Cerviño-PovedanoAliaga-García et al., 2011; Reference Ghaffarvand Mokari and WernerGhaffarvand Mokari & Werner, 2018, respectively), inhibitory control (Reference Ghaffarvand Mokari and WernerGhaffarvand Mokari & Werner, 2019), auditory frequency discrimination acuity (Reference Lengeris and HazanLengeris & Hazan, 2010), and phonological awareness (J. J. Reference Li, Best, Cao, Zhang, Calhoun, Escudero, Tabain and WarrenLi et al., 2019).
20.4 L2 Acquisition of Consonants
20.4.1 Cross-Linguistic Influence on L2 Consonants
As with vowels, L2 learners’ perception is shaped by their L1 phonemic (e.g., Reference Iverson, Ekanayake, Hamann, Sennema and EvansIverson et al., 2008; Reference Park and de JongPark & de Jong, 2008) and allophonic categories (e.g., Reference Eckman and IversonEckman & Iverson, 2013). Similarity between the L1 and the TL plays a central role in categorization, with greater difficulty occurring with highly similar consonants (e.g., Reference Yang, Chen and XiaoYang, Chen, & Xiao, 2022; Reference Yang and LuYang & Lu, 2019) as well as with consonants involving a new (secondary) feature or place of articulation (e.g., palatalization in Russian: Reference Hacking, Smith, Nissen and AllenHacking et al., 2016; Arabic emphatics and gutturals: Reference Eads, Khater, Mielke and AlhawaryEads, Khater, & Mielke, 2018; but see Reference Wagner and Baker-SmemoeWagner and Baker-Smemoe [2013] for a counterexample involving English-speaking learners’ production of Q’eqchi’ Mayan ejectives). In production, at least at lower proficiency levels, consonants are often realized using L1-based phonetic categories (e.g., Reference BarlowBarlow, 2014; Reference Iverson, Ekanayake, Hamann, Sennema and EvansIverson et al., 2008) and allophonic rules. As concerns the latter, for example, learners with L1 syllable-final devoicing may devoice L2 coda obstruents even when they realize TL voicing contrasts in other positions in the syllable/word (e.g., Ghosh et al., 2016; Reference Smith, Hayes-Harb, Bruss and HarkerSmith et al., 2009). Finally, L1-based influence on production may exist both at the level of individual consonants and more systematically due to L1-based articulatory settings (Reference Święciński, Waniek-Klimczak and ShockeyŚwięciński, 2013; Reference Wilson and KanadaWilson & Kanada, 2014). Reference Święciński, Waniek-Klimczak and ShockeyŚwięciński’s (2013) electropalatography study of high intermediate and advanced Polish-speaking learners found that learners with more native-like L2 pronunciation had adjusted their degree of overall tongue fronting and raising away from their L1 Polish toward that of TL English.
As concerns phonetics cues, CLI results in learners being more sensitive to and able to realize more accurately cues involved in L1 consonantal contrasts (e.g., Reference Brandmeyera, Desain and McQueenBrandmeyera, Desain, & McQueen, 2012; Reference ChangChang, 2018; Reference Choi, Kim and ChoChoi et al., 2016; Reference Saito and MunroSaito & Munro, 2014), although there may be variability in cue use among learners of the same L1 (Reference HollidayHolliday, 2015a; Reference Schertz, Cho, Lotto and WarnerSchertz et al., 2015; Reference Smith, Hayes-Harb, Bruss and HarkerSmith et al., 2009). While target-like cue-weighting may be acquired (Reference DmitrievaDmitrieva, 2019), learners may struggle to effectively redeploy an L1 cue for a new TL contrast, for example using duration as a cue to L1 coda obstruent voicing redeployed as an L2 cue for geminate consonants (Reference BroersmaBroersma, 2010; Reference HollidayHolliday, 2015b; Reference Kkese and PetinouKkese & Petinou, 2017).
Finally, CLI results in position-sensitive asymmetries in L2 segmental speech learning. In many languages, a larger set of phonemic contrasts is permitted in syllable onsets than codas – the coda devoicing just discussed is one such example. Furthermore, L1-based onset-coda or other asymmetries may shape both L2 speech perception (e.g., Reference Park and de JongPark & de Jong, 2017; but see Reference De Jong and HaoDe Jong and Hao [2018] for a counterexample) and production (see, e.g., Reference BroersmaBroersma, 2010; Reference HollidayHolliday, 2015b; Reference Kkese and PetinouKkese & Petinou, 2017). For example, Reference Cheng and ZhangCheng and Zhang’s (2015) Mandarin-speaking learners, whose L1 has a limited number of coda contrasts, were overall more accurate in the perception and production of English onset than coda consonants.
20.4.2 Developmental Phenomena in the L2 Acquisition of Consonants
Various general developmental phenomena occurring during L2 consonant acquisition have been identified. Reference De Jong, Silbert and ParkDe Jong, Silbert, and Park (2009) and Reference De Jong and HaoDe Jong and Hao (2018) argue that learners’ discrimination abilities develop systematically: segments sharing a feature are acquired in parallel across prosodic positions, with such generalization being stronger for manner than voicing, and shaped by L1 allophonic properties. Developmental sequences exist for production (e.g., Reference Eads, Khater, Mielke and AlhawaryEads et al., 2018; Reference Shea and CurtinShea & Curtin, 2011; Reference SolonSolon, 2017). For example, in a study of the “unlearning” of the English onset /l/–coda /ɫ/ alternation in L2 Spanish, Reference SolonSolon (2017) found learners to first lighten their laterals across the board then suppress coda velarization.
Turning to relative difficulty, certain features as well as phonetic cues that realize such features may be perceived and/or produced with greater ease. Various researchers have found that L2 learners, when acquiring coda obstruent voicing, realize vowel duration more readily than other cues, including preceding vowel formant properties or the duration or laryngeal voicing of the obstruent itself (e.g., Reference BakerBaker, 2010; Reference Choi, Kim and ChoChoi et al., 2016; Reference Patience and SteelePatience & Steele, 2023; Reference Smith, Hayes-Harb, Bruss and HarkerSmith et al., 2009). Perceptual saliency has also been argued to condition relative difficulty with more salient features being acquired earlier/with greater accuracy (e.g., Reference BakerBaker, 2010; Reference Colantoni and SteeleColantoni & Steele, 2007; Reference Yang and ChenYang & Chen, 2019). Articulatory difficulty is another factor in negative correlation with accuracy (Reference Colantoni and SteeleColantoni & Steele, 2007; Reference PatiencePatience, 2022; Reference Patience and SteelePatience & Steele, 2023). Reference PatiencePatience (2022), based on L1 English-L2 Spanish learners’ production of [m t β̞ ɣ̞ ɲ χ r], concludes that consonants involving posterior articulations or precise aerodynamic conditions are more difficult to acquire. Finally, L1 or L2 allophonic properties may increase learning difficulty: learners encounter particular challenges in mastering contextually appropriate TL variants (e.g., Reference Face, Menke, Collentine, García, Lafford and Marcos MarínFace & Menke, 2009; Reference Han and KimHan & Kim, 2017; Reference NagleNagle, 2017; Reference SolonSolon, 2017) and supressing L1 allophonic rules absent in the TL (e.g., Reference Colantoni, Kochetov and SteeleColantoni, Kochetov, & Steele, 2023; Reference Eckman and IversonEckman & Iverson, 2013).
20.4.3 Inter- and Intralearner Variability in the L2 Perception and Production of Consonants
Consonant development may be highly variable during L2 acquisition. For example, Reference Colantoni, Kochetov and SteeleColantoni et al. (2021) tested the articulatory settings-based hypothesis of a common, systematic development of consonants sharing a major articulatory gesture. This hypothesis received little support: their French-speaking learners’ production of the English coronal series /t d s z n l/ involved many between-consonant differences, particularly for degree of contact.
Sources of inter- and intralearner variability involve both linguistic features and learner characteristics. In terms of linguistic variables, the most commonly noted is linguistic context, both the position of the consonant in the syllable or word (e.g., lesser accuracy in codas; Reference Cheng and ZhangCheng & Zhang, 2015; Reference LaiLai, 2012; Reference Park and de JongPark & de Jong, 2017) and the neighboring sounds (e.g., Reference Hardison and Motohashi SaigoHardison & Motohashi Saigo, 2010; Reference Rau, Chang and TaroneRau, Chang, & Tarone, 2009). Lexical stress also conditions variability, having potentially both facilitative and hindering influences on accuracy (e.g., Reference Face, Menke, Collentine, García, Lafford and Marcos MarínFace & Menke, 2009; Reference Grover, Shafer, Campanelli, Whalen and LevyGrover et al., 2021; Reference NagleNagle, 2017; Reference Park and de JongPark & de Jong, 2017).
Characteristics of the input are relevant, including the availability of visual cues (Reference Hazan, Sennema and FaulkerHazan et al., 2006) as well as of orthography, including its contribution to the acquisition of allophonic variation (Reference Barrios and Hayes-HarbBarrios & Hayes-Harb, 2020; Reference Han and KimHan & Kim, 2017); spelling-based pronunciations (e.g., Reference BassettiBassetti, 2017; Reference Sokolović-Perović, Bassetti and DillonSokolović-Perović et al., 2019); and the potential hindering influence of an unfamiliar script (Reference MathieuMathieu, 2016). Finally, learners’ consonantal perception and production may differ across tasks (e.g., Reference Grover, Shafer, Campanelli, Whalen and LevyGrover et al., 2021; Reference Saito and MunroSaito & Munro, 2014; Reference Shea and RannaudShea & Rannaud, 2013; Reference Yang and LuYang & Lu, 2019) and speech style (Reference Rau, Chang and TaroneRau et al., 2009). Typically, tasks and speech styles involving greater attention, lower complexity, and/or more controlled speech result in more target-like performance.
Turning to learner variables, the most widely studied are LOR, AOA, and TL experience and proficiency. As with vowels, the effects of LOR are mixed. While some authors have found a positive correlation of varying strengths with consonant perception or production (Reference ItoIto, 2012; Reference Saito and MunroSaito & Munro, 2014; Reference Smith, Hayes-Harb, Bruss and HarkerSmith et al., 2009), others have found none (Reference Al-Kendi and KhattabAl-Kendi & Khattab, 2021; Reference Grover, Shafer, Campanelli, Whalen and LevyGrover et al., 2021). Reference Al-Kendi and KhattabAl-Kendi and Khattab’s (2021) study of foreign domestic workers’ acquisition of Omani Arabic consonants and consonant clusters underlines potential limits on the predictive power of LOR. These authors argue that the lack of correlation between LOR (mean 6.2 years, range 0.7–21 years) and perception and production accuracy among their participants of various L1s resulted from learners’ lack of input and output opportunities due to their restricted integration into the host country and their limited conversational opportunities in terms of topics and interlocutors.
The effects of AOA and TL experience are also mixed, with numerous studies finding a positive correlation with perceptual and production accuracy (e.g., Reference BakerBaker, 2010; Reference HollidayHolliday, 2015a; Reference ItoIto, 2012; Reference LababidiLababidi, 2016; Reference Stölten, Abrahamsson and HyltenstamStölten, Abrahamsson, & Hyltenstam, 2014, Reference Stölten, Abrahamsson and Hyltenstam2015), yet others finding little or no effect (e.g., Reference Grover, Shafer, Campanelli, Whalen and LevyGrover et al., 2021; Reference Hacking, Smith, Nissen and AllenHacking et al., 2016; Reference Han and KimHan & Kim, 2017; Reference Iverson, Ekanayake, Hamann, Sennema and EvansIverson et al., 2008) or even negative effects (Reference HollidayHolliday, 2015b). This last study compared the perception of the Korean sibilant fricatives /s/ and /s*/ by naïve (i.e., with no learning experience) and beginner Mandarin-speaking learners. Reference HollidayHolliday (2015b) hypothesizes that the latter group’s lesser perceptual accuracy in the /_a/ context was related to their perception being shaped by more than phonetic characteristics of the stimuli, including their growing knowledge of the phonological and orthographic properties of the Korean fricatives, information unavailable to naïve learners. This finding raises the question of the potential existence of proficiency thresholds that must be reached before experience can positively affect L2 speech learning. Returning to the general effects of experience, in keeping with Reference Munro and DerwingMunro and Derwing’s (2008) finding of rapid initial development followed by learning plateaus, experience later on during learning may not be (as) useful (Reference Grover, Shafer, Campanelli, Whalen and LevyGrover et al., 2021). Finally, when effects of TL proficiency exist, learners who are overall more proficient typically demonstrate greater phonetic/phonological competence (e.g., Reference Evans and AlshangitiEvans & Alshangiti, 2018; Reference Face, Menke, Collentine, García, Lafford and Marcos MarínFace & Menke, 2009; Reference Hardison and Motohashi SaigoHardison & Motohashi Saigo, 2010; Reference SolonSolon, 2017).
Among less-cited learner variables, one notes degree of literacy (Reference Al-Kendi and KhattabAl-Kendi & Khattab, 2021) and ethnic group affiliation (Reference Gatbonton, Trofimovich and SegalowitzGatbonton, Trofimovich, & Segalowitz, 2011). As concerns the latter, Reference Gatbonton, Trofimovich and SegalowitzGatbonton et al. (2011) found a negative correlation between the strength of their L1 Quebec French speakers’ linguistic and political ethnolinguistic identity and English /ð/ production accuracy, a correlation mediated by the amount of TL use. Compared to vowels, the effects of relatively few cognitive abilities have been studied for consonants (mimicry abilities: Reference Delvaux, Huet, Piccaluga and HarmegniesDelvaux et al., 2014; inhibitory control: Reference Darcy, Mora and DaidoneDarcy, Mora, & Daidone, 2016).
20.4.4 Effects of Training on the L2 Perception and Production of Consonants
Consonant training may target perception and/or production using a variety of techniques. For perception, these include high variability perceptual training (e.g., Reference Kar-man Shum, Au, Romo and JunKar-man Shum et al., 2021; Y. Reference Li, Best, Cao, Zhang, Calhoun, Escudero, Tabain and WarrenLi et al. 2019; Reference ShehataShehata, 2015; Reference Shinohara and IversonShinohara & Iverson, 2018, Reference Shinohara and Iverson2021), audiovisual training to enhance visual cue salience/sensitivity (Reference Hazan, Sennema, Iba and FaulknerHazan et al., 2005), and perceptual fading that diminishes the acoustic difference between categories over the course of training (Reference CibelliCibelli, 2018). For production, various feedback paradigms exist, both visual (use of waveforms and spectrograms: Reference OlsonOlson, 2019) and articulatory, using ultrasound (Reference Lin, Cychosz, Shen., Cibelli, Calhoun, Escudero, Tabain and WarrenLin et al., 2019; Reference Roon, Kang and WhalenRoon, Kang, & Whalen, 2020; Reference Tateishi, Winters and LuoTateishi & Winters, 2013), electropalatography (Reference Hacking, Smith and JohnsonHacking, Smith, & Johnson, 2017; Reference SchmidtSchmidt, 2012), or an intraoral tactile feedback device (Reference Bu, Nagano, Harel and McAllisterBu et al., 2021). Training may also include metalinguistic instruction (Reference CibelliCibelli, 2018; Reference OlsonOlson, 2019).
Positive gains following training are regularly observed, including with novice learners (Reference CibelliCibelli, 2018; Reference Roon, Kang and WhalenRoon et al., 2020), although gains may not be (fully) retained beyond the short term (Reference Bu, Nagano, Harel and McAllisterBu et al., 2021; Reference CibelliCibelli, 2018; but see the former authors’ explanations for lack of retention). Learning may generalize to novel consonants, words, and/or talkers as well as to more spontaneous speech (e.g., Reference Cebrian and CarletCebrian & Carlet, 2014; Reference Hacking, Smith and JohnsonHacking et al., 2017; Reference Huensch and TremblayHuensch & Tremblay, 2015; Y. Reference Li, Best, Cao, Zhang, Calhoun, Escudero, Tabain and WarrenLi et al., 2019; Reference OlsonOlson, 2019). Numerous studies have demonstrated the contributions of variability during training, in terms of both multiple talkers (Reference Kar-man Shum, Au, Romo and JunKar-man Shum et al., 2021; Reference ShehataShehata, 2015; Reference Shinohara and IversonShinohara & Iverson, 2018, Reference Shinohara and Iverson2021) and the stimuli set (via a synthesized continuum along which the main phonetic cue varies: Y. Reference Li, Best, Cao, Zhang, Calhoun, Escudero, Tabain and WarrenLi et al., 2019; or high-variability motor practice – practicing the target segment in single versus multiple syllable or lexical contexts: Reference Bu, Nagano, Harel and McAllisterBu et al., 2021). Beneficial effects of perception training on production and vice versa may occur (Reference Huensch and TremblayHuensch & Tremblay, 2015; Reference Lopez-Soto and Kewley-PortLopez-Soto & Kewley-Port, 2009; Reference Roon, Kang and WhalenRoon et al., 2020; Reference SchmidtSchmidt, 2012; Reference Shinohara and IversonShinohara & Iverson, 2018, Reference Shinohara and Iverson2021).
Finally, the effects of training may be conditioned by variables including position of the consonant in the word (onset-coda asymmetries: Reference Kar-man Shum, Au, Romo and JunKar-man Shum et al., 2021; Reference LerdpaisalwongLerdpaisalwong, 2015; Reference Roon, Kang and WhalenRoon et al., 2020), whether the focus is on a subset versus the complete TL consonantal inventory (Reference LerdpaisalwongLerdpaisalwong, 2015; Reference Roon, Kang and WhalenRoon et al., 2020), stimuli lexical frequency (Reference Cebrian and CarletCebrian & Carlet, 2014), task (Reference HaslamHaslam, 2011; Reference LerdpaisalwongLerdpaisalwong, 2015; but no effect in Reference Shinohara and IversonShinohara and Iverson, 2018), the contrasts’ visual distinctiveness (Reference Hazan, Sennema, Iba and FaulknerHazan et al., 2005), AOA (Reference Shinohara and IversonShinohara & Iverson, 2021), and individual differences (attentional switching: Reference Kim, Hazan, Dziubalska-Kołaczyk, Wrembel and KulKim & Hazan, 2010).
20.5 Conclusion
This chapter has presented a review of more recent work investigating the nature of CLI, development, inter- and intralearner variability, and training effects on typical L2 segmental speech learning. Many parallels between the L2 perception and production of vowels and consonants exist, including the ongoing influence of L1 categories and parsing routines, phenomena being more/less readily acquired as shaped by variables including L1–TL similarity, the effects of lexical (e.g., context, stress, frequency) and learner variables (AOA, LOR, TL proficiency, memory), and the overwhelmingly positive effects of training.
In terms of future directions, while outstanding questions exist for all four research themes highlighted here, arguably, our understanding of development is the least advanced. For example, while we possess a rather sound understanding of L1-based CLI’s effects on both speech perception and production, including the mitigating effects of AOA, LOR, and TL experience, less well understood are changes in L2 learners’ perception and production over time. This is undoubtedly related to the practical challenges of conducting large-sized, cross-sectional studies with learners of a wide range of proficiencies or, even more so, longitudinal studies. However, given the growing availability of methods for efficient, rapid data processing as well as of online data collection and L2 speech corpora, these challenges are quickly becoming less onerous. Finally, expanding the variety of L1–TL pairings investigated is of interest, especially in terms of increasing the number of studies on TLs other than English in order to have the richest data set from which to further our understanding of L2 segmental speech learning among this particular group of bilinguals.
21.1 Introduction
This chapter reviews the last twenty-five years of L2 (second language) prosody research in three sections – word stress (Section 21.2), sentence intonation (Section 21.3), and rhythm (Section 21.4) – and presents findings in relation to two underlying themes: form-meaning mapping and bilingual contexts. While my selection of section topics is somewhat arbitrary and nonexhaustive by necessity, given the large scope of L2 prosody, I have chosen those two underlying themes because they illustrate how L2 prosody research has addressed issues of current concern. Second language research is conditioned by the fact that English has become the lingua franca par excellence, a status that represents something of a two-edged sword (Reference Soler and Morales-GalvezSoler & Morales-Galvez, 2022). While on the one hand it facilitates transnational communication, on the other it is linked to language subordination ideologies (Reference Lippi-GreenLippi-Green, 2012). Consequently, a researcher’s approach to L2 prosody will inevitably favor a particular ideology. For example, focusing L2 English research only on educated US or British varieties ignores the fact that English is native to millions of speakers in South Asia and thus indirectly reinforces a colonial frame of mind (Reference KachruKachru, 1992; Reference Soler and Morales-GalvezSoler & Morales-Galvez, 2022). As a result, L2 researchers (e.g., Reference AnyaAnya, 2021; Reference LevisLevis, 2020; Reference Rosa and FloresRosa & Flores, 2017) are increasingly adopting approaches that incorporate social justice and equity into their work, including research in L2 prosody (Reference Armstrong, Breen, Gooden, Levon and YuArmstrong et al., 2022).
As we will see, social factors interact with linguistic and cognitive factors in the assigning of meaning to sentence intonation patterns. A brief example illustrates this point. Sociolinguist John J. Reference Gumperz, Coupland and JaworskiGumperz (1997) noted that newly hired South Asian women working at a cafeteria in a major British airport were perceived as rude by British customers because when they offered “Gravy?” they used a falling intonation instead of the rising one British service staff would have used. This cross-cultural misinterpretation of sentence intonation that is perfectly polite in Indian English emerged in a specific language contact context with clear power differentials between interlocutors in terms of gender, profession, and national origin. If the context were instead a university cafeteria in India and a group of local students invited a British exchange student to take a seat at their table using precisely the same falling intonation, the young visitor would be unlikely to take offense. Although mapping meaning to an intonation form is unarguably linked to linguistic and cognitive factors, it cannot be analyzed separately from the context of language contact in which it takes place as well as the analyst’s own sensitivity to social patterns and individual attitudes.
I therefore endeavor in this chapter to apply each topic covered to contexts of both additive bilingualism, where the L2 is added to the L1 and thus the learner’s total linguistic capital, and subtractive bilingualism, where L2 acquisition takes place at the expense of the L1 and thus makes no net contribution to the learner’s total linguistic capital. In additive bilingualism, the L2 is acquired in an educational setting and within a society that acknowledges and values the use of the learner’s L1 and regards acquisition of that particular L2 as enhancing the learner’s status. In other words, this bilingualism occurs in a context of privilege. This sort of bilingualism is exemplified by the acquisition of English by the educated middle class in much of Latin America, where children learn their first language, Spanish, at home and become literate in that language at school before starting to learn their L2, English. Subtractive bilingualism, on the other hand, occurs when the learner acquires an L2, which is attributed higher prestige by the broader community, at the expense of their L1. Native speakers of the indigenous languages of Latin America, for example, know that mastery of the registers of the Spanish they learn in school is more likely to confer social advantage than an enriched command of their L1, regardless of the prestige ascribed to the L1 within their home community (Reference Flores, García, Nasir, Lee, Pea and de RoystonFlores & García, 2020).
Our discussion of L2 prosody will also require the use of specialized terminology. First of all, an utterance can be broken down into a hierarchy of prosodic units, the above-word level, the word level, and the below-word level, as illustrated in Figure 21.1 for the Catalan proverb Com més serem, més ruirem (“The more [people] we are, the more we will laugh” [i.e., the more, the merrier]). At the word level, languages give perceptual prominence to a syllable or mora by enlarging its duration and intensity, like stress in English and Spanish, or by varying its pitch (i.e., frequency), like a pitch accent in Tokyo Japanese (Figure 21.2). Stress, pitch accents, and tone allow languages to distinguish between two otherwise identical forms, known as word-minimal pairs. For example, stress on the first syllable distinguishes the English noun “object” from the second-syllable-stressed verb “object.” Similarly, Japanese pitch accented hashi (“chopsticks”) contrasts with unaccented hashi (“edge”), while in Chinese ma uttered with tone 1 means “mother” but with tone 2 it means “numb.” Languages with fixed stress, however, lose this contrastive function: the fixed word-initial stress of Hungarian, for example, demarcates word beginnings but cannot generate word-minimal pairs.

Figure 21.1 The prosodic hierarchy. Prosodic domains are hierarchically organized. As a stress-accent language, the melodic tier in Catalan attaches to stressed vowels and the right edge of intonational phrases.

(a) and Japanese pitch accent

(b) in the cognate “America.” Pitch and duration cue stress in English, while only pitch – the ascending F0 on “e” – cues the Japanese pitch accent.
Figure 21.2 Acoustic cues to English stress
Domains at the above-word level relate to sentence intonation. The right and/or left edges of these domains are marked with boundary tones, such as the H% and L% in the “melodic tier” in Figure 21.1. They also contain pitch movements related to lexical events such as stress, pitch accent, and tone. Languages with above-word domains containing both boundary tones and pitch movements related to lexical prominence are called head-languages. By contrast, in what are known as edge-languages these domains have only boundary tones. Within head-languages, the pitch contour of above-word domains may be either fixed, as in Tokyo Japanese, where a hat shape repeats on each accentual phrase, or variable, as in stress-accent languages like English, where the stressed syllable’s pitch shape is contingent on the particular meaning that sentence intonation is intended to convey.
Below-word domains are related to rhythm. Rhythm is the sensation of isochrony (the division of time into equal portions) conveyed by the repetition of speech units like syllables, morae, or lexical stress. This perceptual isochrony has been thoroughly studied in relation to duration correlates, but only recently have intensity and pitch been investigated and related to word and larger domains.
The remainder of the chapter is organized as follows. Section 21.2 on L2 stress presents pioneering research (Section 21.2.1) which was later extended to other forms of word prominence such as tone and pitch accents, and to other edge-languages besides French mainly in additive bilingualism contexts (Section 21.2.2). Relevant findings in contexts of subtractive bilingualism are summarized in Section 21.2.3. Section 21.3 reviews L2 sentence intonation. It introduces the tools most widely used to analyze sentence intonation and the many sources of variation they capture (Section 21.3.1). Two sources of this variation are discussed, namely typological diversity in expressing focus meaning (Section 21.3.2) and accommodation patterns in situations of subtractive bilingualism (Section 21.3.3). Section 21.4 reviews rhythm, first defining the concept (Section 21.4.1) and then summarizing the methods by which it can be measured (Section 21.4.2). Research findings from contexts of additive (Section 21.4.3) and subtractive language contact (Section 21.4.4) are then presented. Conclusions and directions for future research (Section 21.5) end the chapter.
21.2 L2 Stress
21.2.1 Early Research
Early research into L2 stress was guided by two unexpected findings. First, it was found that native English speakers serving as controls were outperformed by Dutch speakers of L2 English in a stress-matching task (Reference Cooper, Cutler and WalesCooper, Cutler, & Wales, 2002). After listening to the first syllable of English words such as “music” and “museum” – where lexical stress is unusually expressed by only suprasegmental cues (i.e., duration, pitch, and intensity) and not by vowel reduction – participants were asked to match stressed [mju] to [ʹmjusik] and unstressed [mju] to [mjuʹseum]. The Dutch-speaking participants were faster and more accurate at doing this than the native speakers. This unexpected result was framed as a case of L1-to-L2 transfer of suprasegmental cues (Reference Cutler, Pisoni and RemezCutler, 2005, Reference Cutler2012; Reference Cutler, Jesse, Pardo, Nygaard, Remez and PisoniCutler and Jesse, 2021 and references therein). In general, vowel reduction is a more reliable segmental cue to lexical stress in English than it is in Dutch. Consequently, English speakers rely mostly on vowel reduction to perceive stress in English whereas Dutch speakers rely on both vowel reduction and suprasegmental cues to process Dutch stress. When listening to English stress, the Dutch speakers transferred their L1 reliance on suprasegmental cues, resulting in a superior performance in that specific context. More recently, Bruggeman and colleagues have explored whether the amount of L2 input modulated this initial L1-to-L2 cue transfer (Reference BruggemanBruggeman, 2016; Reference Bruggeman, Yu and CutlerBruggeman, Yu, & Cutler, 2022). Using the same experimental design and materials, they tested L1 Dutch speakers who had been living in Anglophone Canada for at least thirty years. For these speakers, vowel reduction cues outweighed suprasegmental cues, matching native English speakers’ performance. Thus, while L1 cue transfer into L2 explained L2 stress at the initial stages of L2 acquisition, at later stages and given enough L2 input, this initial transfer was readjusted to optimize efficiency in L2 processing.
In the second study yielding unexpected results, advanced-level French speakers of Spanish – despite their high proficiency – experienced persistent “stress deafness” in tasks requiring abstraction such as sequence recall and lexical decision tasks. However, they did perceive stress in tasks involving phonetic perception with minimal abstraction, such as AX (same/different) discrimination tasks (Reference Dupoux, Pallier, Sebastian and MehlerDupoux et al., 1997; Reference Dupoux, Peperkamp and Sebastián-GallésDupoux, Peperkamp, & Sebastián-Gallés 2001, Reference Dupoux, Peperkamp and Sebastián-Gallés2010; but see Reference Llisterri, Schwab, Gibson and GilLlisterri and Schwab [2019] and references therein for differing interpretations). Similar results were obtained from speakers of languages with fixed stress such as Polish and Hungarian (Reference Peperkamp, Vendelin and DupouxPeperkamp, Vendelin, & Dupoux, 2010). These speakers’ stress deafness was interpreted as a processing deficit linked to their inability to represent contrastive stress in their L2 Spanish lexicon. Because these speakers matched form – that is, suprasegmental cues to a demarcative meaning such as boundary tones in accentual phrases in French and fixed stress in Hungarian – learning to perceive L2 Spanish stress implied remapping suprasegmental cues from their L1 demarcative meaning to the contrastive meanings conveyed by stress in Spanish.
Summarizing, the persistent stress deafness experienced by French speakers of Spanish was related to the lack of stress representation in their L2 lexicon, while Dutch speakers’ outstanding perception of L2 stress was framed as an initial L1 cue transfer with later cue reweighting to match the target language. These different approaches came from the comparison on the one hand of two prosodically similar languages, Dutch and English – both head-languages with contrastive stress – and on the other of two prosodically different languages, French, which has no word prominence, and Spanish, a head-language with contrastive stress. Recent research continues to disentangle the effect of L1–L2 prosodic distance outcomes.
21.2.2 Recent Research
21.2.2.1 Adding Tone and Pitch Accents to L2 Stress
Recent research has contributed to the discussion in Section 21.2.1 by augmenting the number of L1–L2 language pairs and methodologies examined and has revolved around two questions. The first concerns the degree to which speakers of a head-language learning a second head-language with a different type of word prominence – be it contrastive stress, tone, or pitch accent – transfer the prominence type of their L1 to their L2. For example, Reference Wiener, Goss, Klessa, Bachan, Wagner, Karpiński and ŚledzińskiWiener and Goss (2018, Reference Wiener and Goss2019) examined pitch accent perception in L2 Tokyo Japanese by native English (stress) and Mandarin (tone and stress) speakers. Mandarin speakers with no prior knowledge of Japanese outperformed English speakers with an intermediate level of Japanese, and English speakers who had previously studied Mandarin outperformed English speakers who had not, showing L1-to-L2 transfer between L1 stress and tone to L2 pitch accents. Moreover, prior experience with more than one word-prominence type (e.g., stress and tone) facilitated the perception of L2 pitch accents to a larger extent than prior experience with a single prominence type (e.g., stress only).
In relation to acoustic cues to word prominence, Reference Qin, Chien and TremblayQin, Chien, and Tremblay (2017) showed that only cues that were already used in the learners’ L1 were transferred to the L2. They compared the perception of English stress by Standard Mandarin (tone and stress) with Taiwan Mandarin speakers (tone and no stress) in natural and synthesized speech. Their results showed that when they listened to natural speech the fact that tone was used in their L1s helped speakers from both groups to perceive stress in English. However, when exposed to the synthesized stimuli, Standard speakers used both pitch and durational cues while Taiwan speakers relied only on pitch, indicating that speakers’ perception of English stress was based on the cues that signaled word prominence in their own L1s. A more recent study by Reference ChoiChoi (2022) showed that functional load also modulated L1-to-L2 cue transfer. Thus, while Cantonese learners of English attended in general to pitch cues to English stress, confirming L1 transfer, they switched to vowel-quality cues in contexts where vowel quality was more informative than pitch.
The second question of interest in research on the acquisition of L2 stress concerns whether a native representation of word prominence is required to learn word prominence in an L2. This question has been explored by comparing the performance of native speakers of head- and edge-languages as they learned an L2 that was a head-language. Most studies have examined the perception of L2 English stress by speakers of different L1s, yielding consistent results (e.g., Reference Choi, Tong and SamuelChoi, Tong, & Samuel, 2019; Reference Kim and TremblayKim & Tremblay, 2021; Reference Lin, Wang, Idsardi and XuLin et al., 2014). Native speakers of head-languages outperform those of edge-languages regardless of the type of word prominence characterizing their L1. For example, native Mandarin speakers (a tone language with stress) outperformed native Seoul Korean speakers (an edge-language) in perceiving English stress in sequence recall and lexical decision tasks (Reference Lin, Wang, Idsardi and XuLin et al., 2014). Similarly, speakers of Gyeongsang Korean (a pitch accent language) outperformed Seoul Korean speakers (Reference Kim and TremblayKim & Tremblay, 2021).
21.2.2.2 Adding Pitch Shape and Post-Lexical Processing to L2 Stress
As noted in Section 21.2.1, early research related the persistent stress deafness experienced by French speakers of Spanish to the lack of stress representation in their L1 lexicon. By contrast, native speakers of head-languages with unpredictable stress, such as English or Spanish, which represent stress in their L1 lexicons, were assumed to perceive L2 stress accurately across the board. Recent research has qualified this assumption by showing gradient differences in stress perception depending on the L1 of the listener. Thus, Reference Ortín and SimonetOrtín and Simonet (2022, Reference Ortín and Simonet2023) showed that, at the pre-lexical level, Spanish speakers were more accurate than English speakers at perceiving stress in nonwords (e.g., “númi” versus “numí”) in sequence recall tasks. By the same token, at the post-lexical level, Reference Ortega-Llebaria, Gu and FanOrtega-Llebaria, Gu, and Fan (2013) found that Spanish speakers more accurately perceived stress in tasks involving Spanish target words with controlled cues embedded in sentences with different intonations from English speakers. The authors concluded that the English speakers’ reliance on pitch cues explained their perception of paroxytone words as oxytones in declarative sentences (e.g., [ʹmama] perceived as [maʹma]), while they failed to perceive stress altogether in flat-F0 parenthetic clauses. Taken as a whole, these results show that, within head-languages, L2 stress perception can be gradient at different levels (i.e., pre-lexical versus post-lexical) of processing.
Many studies have examined stress perception in words with an F0-peak on the stressed syllable (i.e., H*). Nevertheless, the pitch shape of stress is variable and contingent on sentence intonation requirements, so that, for example, the stressed syllable “ma” bears an F0-peak in “Mary!” but an F0-valley in “Mary?,” while the pitch shapes of tones and pitch accents are fixed in the lexicon, as in Mandarin ma1 “mother” versus ma2 “numb.” Consequently, native speakers of tonal languages tend to overlook variation in the pitch shape of stress in an L2. Recent research has revealed such effects in English word recognition by bilingual L1 Mandarin-L2 English speakers (Reference Ortega-Llebaria, Nemogá and PressonOrtega-Llebaria, Nemogá, & Presson, 2017; Reference Shook and MarianShook & Marian, 2016; Reference Wang, Wang and MalinsWang, Wang, & Malins, 2017). While both monolingual English controls and Mandarin-English bilinguals processed pitch shape at the pre-lexical level to normalize, for example, for gender differences, only Mandarin-English bilinguals continued processing pitch shape at the lexical level. Crucially, these speakers organized both the Mandarin and the English words in their bilingual lexicon with regard to pitch shape as if English was a one-tone language (Reference Ortega-Llebaria and WuOrtega-Llebaria & Wu, 2021; Reference WangWang, 2021). Nonetheless, future L2 research needs to further compare the effects of pitch shape on the perception of L2 word prominence in speakers of stress languages and speakers of tone languages.
21.2.3 L2 Stress in Subtractive Bilingualism: The Case of Heritage Spanish Speakers
Since the patterns described so far have been mostly documented in contexts of additive bilingualism, there is always the risk that any conclusions to be drawn do not reflect what occurs in the context of subtractive bilingualism. The body of research on the perception of stress by heritage speakers of Spanish sheds considerable light on this issue. The well-documented difficulties that heritage speakers encounter with Spanish stress are not orthographic but rather related to locating stress in the correct syllable of a word (e.g., Reference Parera and LynchParera & Lynch, 2021). Importantly, Reference KimKim (2020) showed an incongruent pattern between the production and the perception of Spanish stress by heritage speakers that was not present in either native Spanish speakers or English speakers of L2 Spanish. That is, while native Spanish speakers displayed ceiling accuracy in both perception and production tasks, English speakers of Spanish were equally inaccurate in both skill areas. By contrast, heritage speakers displayed high accuracy in perception but low accuracy in production. This incongruence was related to the context in which subtractive bilingualism occurs: heritage speakers inhabit an environment that favors the development of a perceptual representation of stress (i.e., continual receptive exposure to Spanish at home) but insufficient contexts of use to develop stable production patterns (i.e., at school English is the vehicular language). Moreover, Reference OrtínOrtín (2022) showed that heritage speakers’ perception of Spanish stress was heavily modulated by language dominance and their knowledge of Spanish. The more heritage speakers used Spanish and the more proficient they became, the closer their performance was to that of native Spanish speakers. These results provided supporting evidence for earlier research (Reference Rao, Colina and Martínez-GilRao, 2019; Reference SaalfeldSaalfeld, 2012) arguing that the knowledge of Spanish acquired by heritage speakers through an early exposure to Spanish at home can be lost if this use is interrupted early in development.
21.2.4 Summary
Difficulties of L2 stress perception were initially framed either as an L1-to-L2 cue transfer problem or as a processing deficit linked to learners’ inability to represent contrastive stress in their lexicons. Recent research has established the extent and limits of those initial frames by examining a larger number of L1–L2 language pairs. Native speakers of head-languages tend to experience fewer difficulties than native speakers of edge-languages in perceiving L2 stress, pitch accents, and tone, suggesting that L1 representations of word prominence help them to learn any word-prominence type in an L2. However, the gradient perception displayed by native speakers of different head-languages suggests that an L1 word-prominence representation is a necessary – albeit insufficient – condition for the accurate perception of word prominence in an L2. In head-languages, only the cues to word prominence used in the L1 are transferred to an L2, thus modulating L2 perception. While these patterns are reported to occur in contexts of additive bilingualism, in subtractive bilingualism contexts an asymmetry has been found between the perception and the production of stress that is not present in additive contexts. This serves as a reminder that a comprehensive unbiased understanding of the acquisition of L2 prosody cannot be achieved unless it incorporates findings based on subtractive bilingualism.
21.3 L2 Sentence Intonation
21.3.1 Sources of Variation
Sentence intonation has become a fertile ground for systematic study thanks to theories like autosegmental phonology (e.g., Reference LaddLadd, 2008) and transcription systems like Tones and Break Indices (ToBI; Reference Beckman, Hirschberg, Shattuck-Hufnagel and JunBeckman, Hirschberg, & Shattuck-Hufnagel, 2006 and references therein). Following the line taken by autosegmental phonology (Reference GoldsmithGoldsmith, 1976), segments and tones are defined as autosegments generated in different tiers, with phones in the segmental tier and tones in the melodic tier, which align with each other in accordance with general principles. While the complete set of possible tones and phonemes is universal, each language uses a particular subset of them. These principles were applied to study sentence intonation first in English and Japanese (Reference Beckman, Hirschberg, Shattuck-Hufnagel and JunBeckman et al., 2006) but soon across many other languages, generating language-specific prosodic grammars (e.g., Spanish ToBI: Reference Beckman, Díaz-Campos, McGory and MorganBeckman et al., 2002; Mandarin ToBI: Reference Peng, Chan, Tseng and JunPeng et al., 2005), which were further adapted to express dialectal variation in both L1 (Reference Clopper and SmiljanicClopper & Smiljanic, 2011; Reference CruzCruz, 2013, Reference Gili Fivela, Avesani, Barone, Frota and PrietoGili Fivela et al., 2015; Reference Prieto and RoseanoPrieto & Roseano, 2010) and L2 varieties (see, e.g., Reference Mennen, Delais-Roussaire, Avanzi and HermentMennen’s [2015] L2 Intonation Learning theory or LILt). At the same time, these variety-specific prosodic grammars have advanced our understanding of prosodic typology by showing how varieties can be grouped according to their intonational structure (Reference Jun and JunJun, 2014). Recently, ToBI has also been used to analyze pathological speech (e.g., Reference Frota, Cruz and CardosoFrota et al., 2021). Simultaneously, the growing field of sociophonetics has begun to incorporate prosody into research on social meaning (Reference Holliday and VillarrealHolliday & Villarreal, 2020). In short, over the past few decades intonational theories and transcription systems have proven to be of great utility in exploring the different sources of variation that affect sentence intonation. In this section, two such sources of variation will be reviewed, L1–L2 typological distance with regard to the expression of narrow focus (Section 21.3.2), and the social aspects of intonation (Section 21.3.3).
21.3.2 Prosody-Based Generalizations About the Expression of Focus
Typologically distant languages tend to use different linguistic strategies to express equivalent meanings, as is the case, for example, with English and Spanish in their expression of narrow focus (e.g., Reference ZubizarretaZubizarreta, 1998). This is illustrated in short dialogues (21.1)–(21.3), where “Mary/María,” “bone/hueso,” and “renting/alquilando” constitute the information requested in each question, respectively. The narrow focus applied to these words is expressed prosodically by marking the word as the most prominent in the sentence, that is, it bears the nuclear stress (NS) of the utterance (indicated with capitalized letters in the examples). In English, NS is assigned by a nuclear stress rule (NSR) which mandates not only the assignment of prominence to the word in focus (e.g., MARY) but also the de-accenting of post-focal words (e.g., “came” in [21.1]). Phonetically, prominence on “Mary,” “bone,” and “renting” is signaled in the speech signal by enlarging the pitch range, duration, and intensity of the stressed syllable. Post-focal clause de-accenting is expressed by compressing pitch range, duration, and intensity (Reference Cole, Hualde and SmithCole et al., 2019; Reference LaddLadd, 2008). By contrast, according to the NSR applied to Spanish, the word with prominence must be positioned at the end of the intonation phrase (e.g., la está ALQUILANDO in [21.3]). Consequently, the phonetic strategy of enlarging the suprasegmental cues of the in-focus word is relevant to the English NSR only because prominence in the Spanish NSR strategy is expressed exclusively by sentence position.
(21.1) English: A: Who came yesterday? B: MARY did.
Spanish. A: ¿Quién vino ayer? B: Vino MARÍA.
(21.2) English: A: What’s the dog chewing? B: It’s chewing a BONE.
Spanish. A: ¿Qué está mordiendo el perro? B: Está mordiendo un HUESO.
(21.3) English: A: Did John buy that green house? B: No, John is RENTING it.
Spanish. A: ¿Se compró Juan la casa verde? B: No, la está ALQUILANDO.
Research on the acquisition of focus in L2 English has described, on the one hand, the difficulty consistently experienced by Spanish speakers as they unlearn the Spanish NSR to acquire the English NSR (Reference Ortega-Llebaria and ColantoniOrtega-Llebaria & Colantoni, 2014; Reference Zubizarreta and NavaZubizarreta & Nava, 2011). This difficulty manifests itself in the following patterns. First, Spanish speakers start out by applying phonetic enhancement at the end of L2 English narrow focus statements, but then, as their proficiency increases, they begin to move it to other sentence positions, suggesting an awareness that they must apply a different NSR. However, after acquiring the English NSR to express focus, they consistently fail to generalize this strategy to thetic and categorical meanings. Finally, they have great a deal of difficulty learning to de-accent the post-focal clause, so, instead of de-accenting, they use repair strategies such as the insertion of a pause after the subject focus (e.g., MARY || came).
On the other hand, studies of L2 English learners have shown that speakers of Beijing Mandarin – in whose L1 the same NSR applies as in English – consistently obtain more native-like scores than Spanish speakers in all perception and production tasks (Reference Ortega-Llebaria and ColantoniOrtega-Llebaria & Colantoni, 2014; Reference Xu, Chen and WangXu, Chen, & Wang, 2012). That said, it should be noted that although all Mandarin Chinese dialects use the same NSR to express narrow focus, these dialects differ in their use of post-focal compression, such that the Mandarin spoken in Beijing has post-focal compression while the Mandarin spoken in Taiwan does not (Reference Xu, Chen and WangXu et al., 2012). As a result, only Beijing speakers use post-focal compression as a cue to the perception of narrow focus in their own dialect. Interestingly, this asymmetry in the perception of post-focal compression has been seen to transfer to the perception of L2 English narrow focus, resulting in L1 Beijing Mandarin speakers obtaining higher scores than their Taiwanese counterparts (Reference Adams and HellmuthAdams & Hellmuth, 2022).
In summary, L1–L2 typological distance constitutes a key tool for interpreting L2 focus acquisition by speakers of different L1s. Differences between the NSRs applying in an L1 and an L2 are likely to impede the acquisition of prosody, as data from Spanish speakers of English illustrate. Moreover, if their own L1 does not use post-focal compression, learners experience difficulties producing and perceiving post-focal compression in an L2, as illustrated by the L2 English of native speakers of the Beijing and Taiwan Mandarin Chinese varieties.
21.3.3 Social Aspects of Intonation: Accommodation in Subtractive Contexts
Accommodation refers to the changes in language style that a speaker effectuates to gain social acceptability from an in-group and/or out-group, leading to convergence between language styles, divergence, or maintenance. In principle, accommodation differs from transfer in that the intention underlying the former is to gain social acceptability. In other words, in accommodation the use of a language structure is linked to a social meaning, whereas in transfer it is simply the result of applying L1 parameters to an L2. In practice, this fine line between transfer and accommodation often becomes blurred, so that, for example, although a foreign-accent trait in an immigrant’s speech is initially the result of L1-to-L2 transfer, it can be assigned a social meaning by the host L2-speaking community, which later the speaker may use intentionally to signal convergence with their own immigrant in-group.
Accommodation in sentence intonation has been observed in speakers’ choice of pitch accents and in their use of pitch range. The dynamics of this kind of accommodation can be seen in Spain, for example, in the pitch accents used in absolute interrogatives by bilingual speakers from the island of Majorca. In Standard Peninsular Spanish, absolute interrogatives are produced with an ascending F0 intonation transcribed as L*H%. However, in his analysis of the speech of Catalan-Spanish balanced bilinguals growing up in Majorca, Reference SimonetSimonet (2008) found a general tendency to use the Majorcan Catalan HL* L% descending contour in absolute interrogatives when they spoke Spanish, indicating a convergence toward Catalan which was modulated by the age and language dominance of the bilingual speakers. The Majorcan Catalan descending contour occurred most among the younger Catalan-dominant speakers, Catalan-dominant bilinguals, then the younger Spanish-dominant bilinguals. By contrast, the Spanish ascending L*H% predominated among the older Spanish-dominant bilinguals. These patterns of use clearly illustrated the bidirectionality of the convergence effects, that is, from L1-to-L2 in Catalan-dominant speakers and from L2-to-L1 in Spanish-dominant speakers. However, it was not clear for the Spanish-dominant group whether the use of the Catalan HL* L% in their L1 Spanish came from their L2 Catalan – which would indicate direct accommodation – or from the Spanish variety spoken in Majorca – which would indicate indirect accommodation. More recently, Reference Romera and ElordietaRomera and Elordieta (2013) provided evidence from the same context of indirect accommodation in the speech of monolingual Spanish-speaking immigrants in Majorca by showing that they produced the descending HL*L% in their L1 Spanish interrogatives after living in Majorca for less than four years, despite not having learned any Catalan during that time. Interestingly, that accommodation was modulated by affective variables such as the degree of interest on the part of the speaker in becoming integrated in the host community.
Further evidence – also from Spain – of the role of affective factors in accommodation comes from the patterns detected in the speech of bilingual speakers of Lekitio Basque and Spanish by Elordieta and Romera (Reference Elordieta and RomeraElordieta & Romera 2021; Reference Romera and ElordietaRomera & Elordieta 2020). Like in Majorcan Catalan, absolute interrogatives exhibited descending intonation, which could also appear in the variety of Spanish spoken in the Basque Country when the speakers were accommodating. In these cases, affective variables could even override social factors such as age or gender and linguistic factors such as language dominance. Studies of this sort make it clear that any analysis of prosodic accommodation must include not just social and linguistic but also affective variables.
Pitch range has also been shown to be susceptible to accommodation. Human societies have historically projected the obvious link between vocal pitch and gender onto issues of power, such that higher pitch is related to femininity and submissive attitudes, and lower pitch to masculinity and dominance (Reference GussenhovenGussenhoven, 2004; Reference OhalaOhala, 1984; Reference Winter, Oh and HübscherWinter et al., 2021). Not surprisingly, the few studies that have looked at how bilingual speakers accommodate to the pitch range of one or the other language have shown that gender plays an important role. Ordin and Reference Mennen, Delais-Roussaire, Avanzi and HermentMennen (2015, Reference Ordin and Mennen2017), for example, examined the speech of Welsh-English bilinguals. In general, Welsh has a larger pitch range than English, which is achieved by means of the higher F0 maxima of pitch accents. The authors found that, interestingly, most male participants in the study did not change pitch ranges when switching between languages, failing to accommodate presumably because the higher pitch range sounded “feminine,” while females consistently did so, showing a distinct pitch range for each language. These differences were attributed by the authors to the sociocultural factors linking pitch and gender on the one hand and power on the other.
In a similar study, Reference De Leeuw, Brehmer and Treffers-Dallerde Leeuw (2020) examined the pitch level and pitch range of Germans living in Anglophone Canada. In general, when they were speaking German, the Germans’ natural pitch range was intermediate between that of Canadian English-speaking females and that of Canadian English-speaking males. As expected, the female Germans converged with female Canadian speech by increasing and widening their F0 range in both English and German. Unexpectedly, however, the German males increased their pitch range, more closely approximating the wider pitch range of Canadian females than the narrower range of Canadian males. As a result, German males’ English diverged from the Canadian males’ English. The authors postulated that this divergence reflected an attempt by the male Germans to exploit the aforementioned associations between pitch and power to distance themselves from the aggressive stereotypes of Germans that they knew were held in Canada.
Recent research (Reference Mennen, Reubold, Endes and MayrMennen et al., 2022) has applied the LILt model (Reference Mennen, Delais-Roussaire, Avanzi and HermentMennen, 2015) to investigate the L1 prosody of long-term English-speaking expatriates in Austria. The results revealed that L2-to-L1 prosodic accommodation affected the four dimensions outlined in the model, to wit, the phonological dimension or pitch accent inventory, the phonetic dimension or pitch accent implementation, the semantic dimension or form-meaning mapping, and pitch accent frequency of use, illustrating the great permeability of L1 prosody. Because some L2-to-L1 patterns could not be accounted for by purely linguistic concepts such as markedness, these results provided further proof of the need to add social and attitudinal factors to linguistic variables when analyzing the acquisition of prosody by bilinguals.
21.3.4 Summary
Intonation theories and transcription systems have generated a wealth of language-specific prosodic grammars, in the process uncovering multiple factors that affect variation in prosody and triggering the use of prosodic typologies. Both issues have had a profound effect on the study of L2 sentence intonation, giving rise, for example, to the LILt model of L2 prosody (Reference Mennen, Delais-Roussaire, Avanzi and HermentMennen, 2015). As for prosodic typologies, research has shown that typological distance between an L1 and an L2 has an impact on learner abilities to approximate the expression of focus by monolingual native speakers of the L2, and just as it does for L2 stress and rhythm (Section 21.4). Social meaning is also an important source of variation in sentence intonation. Accommodation literature in particular has uncovered evidence of the effect of affective variables on L2 sentence intonation, which can at times be stronger than the effect of more obvious linguistic and social factors, especially in multilingual and multicultural communities.
21.4 L2 Rhythm
21.4.1 Definition and Contextualization
In general, rhythm refers to the sensation of isochrony conveyed by a repetition pattern in auditory stimuli, most obviously exemplified by the 1–2–3 1–2–3 beat of a waltz. In speech, the perceived isochrony shown by primary-stresses, syllables, and morae has been used to classify languages into stress-timed, syllable-timed, and morae-timed rhythmic classes. Although production studies have failed to find acoustic correlates to these rhythmic classes (see Reference ArvanitiArvaniti [2012] for a historical overview), supporting evidence has come from adult and infant perceptual research (e.g., Reference Murty, Otake and CutlerMurty, Otake, & Cutler, 2007; Reference Nazzi, Bertoncini and MehlerNazzi, Bertoncini, & Mehler, 1998). In one study, for example, Ramus and colleagues measured the vowel and consonant interval durations of the sentence stimuli employed in previous infant discrimination experiments and found that the percentage of vocalic intervals in those sentences (%V) together with variations in the duration of vowel and consonant intervals (ΔV and ΔC) captured rhythm classes (Reference Ramus, Nespor and MehlerRamus, Nespor, & Mehler, 1999; see Reference Nazzi and RamusNazzi & Ramus [2003] for a later review). Ramus’ duration-based rhythm measures were refined in subsequent research. On the other hand, strong criticism questioning the suitability of using duration-based measures to distinguish rhythmic classes (Reference ArvanitiArvaniti, 2012) led to research on new measures based on acoustic dimensions other than duration. Advances in L2 rhythm research have proven to be contingent on the fine-tuning of these new measures.
21.4.2 Measures
To refine Ramus’ duration-based measures, researchers proposed the following modifications. Reference Frota and VigárioFrota and Vigário’s (2001) Δ%C and Δ%V measured the variability of interval duration relative to sentence duration. Reference Grabe and LowGrabe and Low’s (2002) Pairwise Variability Index (PVI) measured duration variation between adjacent intervals in a sentence, capturing the fact that the perceived duration of an interval is relative to the duration of the surrounding intervals. Dellwo’s (e.g., Reference Dellwo2010) Varco normalized interval durations for speech rate. Applying PVI and Varco for the first time to L2, Reference White and MattysWhite and Mattys (2007) showed that these measures were sensitive to capturing differences between L1 and L2 pairs, such as L1 Spanish compared with the L2 Spanish of L1 English speakers. In the following decade, the original PVI and Varco measures were subjected to further refinements in the light of their sensitivity to various factors such as L2 proficiency levels (e.g., Reference Kawase, Kim and DavisKawase, Kim, & Davis, 2016; Reference Ozaki, Yazawa and KondoOzaki, Yazawa, & Kondo, 2017), the differences detected between large speech corpora of the same language being spoken as an L1 and an L2 (e.g., Reference Lai, Evanini, Zechner, Badin, Hueber, Bailly, Demolin and RabyLai, Evanini & Zechner, 2013), human judgments of rhythmicity in L2 data (Reference Kinoshita and SheppardKinoshita & Sheppard, 2011), and dialectal variation (e.g., Reference CruzCruz, 2013).
Despite these improvements, the notion that the various rhythmic classes could be accurately defined using duration-based measures came under sharp attack (Reference ArvanitiArvaniti, 2009, Reference Arvaniti2012; Reference Barry, Andreeva and KoremanBarry, Andreeva, & Koreman, 2009), representing a turning point in rhythm research. While Reference ArvanitiArvaniti (2012) questioned the validity of duration-based rhythmic classes, Reference Gasparini, Langus, Tsuji and Boll-AvetisyanGasparini et al. (2021) proposed a shift in focus from rhythmic classes to metrics proper based on their meta-analysis of forty-two infant studies. They showed that rhythmic classes failed to account for infants’ language discrimination abilities better than successive consonant-duration interval variability. At about the same time, neuroscience research revealed the importance of synchronizing time windows of different sizes when describing isochrony (e.g., Reference Poeppel and AssaneoPoeppel & Assaneo, 2020), showing the need to add larger timescales such as syllables and words to the vowel and consonant intervals of duration metrics and thus underscoring the need to include pitch-based and intensity-based measures in any analysis of prosody.
With regard to pitch-based measures, Reference Polyanskaya, Busà and OrdinPolyanskaya, Busà, and Ordin (2020) were the first to explore the F0 contribution to the sensation of isochrony cross-linguistically. Based on Reference Jun and JunJun’s (2014) concept of macro-rhythm, these authors applied to Italian and English pitch-based measures such as the distance between F0 peaks and F0 valleys, the magnitude of pitch excursions, and the number of tonal targets per intonational unit. The tonal rhythm of Italian proved to be more regular than that of English, lending support to Jun’s predictions. These results have opened a door to the investigation of F0-based rhythm measures in L2s and their relationship with duration-based measures (e.g., Reference Nagao and Ortega-LlebariaNagao & Ortega-Llebaria, 2021).
With regard to intensity-based measures, many have been proposed. Reference HeHe (2012) applied to intensity the PVI and Varco measures previously used on duration. These intensity measures accounted for the perceived rhythmic differences between L1 English and Mandarin-accented English that duration-based measures failed to capture. Reference Quené, Orr, Campbell, Gibbon and HirstQuené et al. (2014) operationalized rhythm as peak frequency in the intensity envelope spectrum, normalized to syllables per second. This measure captured rhythmic convergence between L2 English and that of the monolingual community. Of particular interest are the recent amplitude envelope-based measures of rhythm because they reflect overlapping timescales of different sizes, thus overcoming the limitation of a single timescale displayed by previous duration-based measures. Reference Pellegrino, Schwab and DellwoPellegrino, Schwab, and Dellwo (2021) directly compared the effects of amplitude envelope and segment duration on rhythm while controlling for the speech rate of native Italian and L2 Italian as spoken by L1 German speakers. The effect of amplitude envelope in the perception of rhythmic differences between L1 and L2 Italian remained once the effect of segment duration was controlled for. Additional evidence from Reference Frota, Vigário, Cruz, Hohl and BraunFrota et al. (2022) showed that wideband amplitude envelopes of different bands captured rhythmic differences between Brazilian Portuguese, European Portuguese, and German.
Besides these acoustic correlates, gestures have also been investigated because of their proven synchronization with prosody. For example, Reference Rathcke, Lin, Falk and BellaRathcke et al. (2021) asked native English speakers to reproduce the beat of a looped sentence they had heard by tapping their fingers and found that the participants reproduced the rhythmic structure of sentences with a consistent anchoring of the taps on vowel onsets. Tying this to L2 acquisition research, Prieto and colleagues showed that the addition of beat-gestures (up-down hand movements) to L2 rhythm perception training significantly improved learners’ pronunciation and comprehension of the language (e.g., Reference BaillsBaills, 2022; Reference Gluhareva and PrietoGluhareva & Prieto, 2017).
21.4.3 L2 Rhythm in Additive Bilingual Contexts
Earlier research using duration-based measures not only yielded inconsistent results with regard to L2 proficiency effects (e.g., Reference GuilbaultGuilbault, 2002) and L2 length of exposure (e.g., Reference GutGut, 2009) but also exposed the difficulty of interpreting these measures. For instance, although L1 Japanese speakers increased duration variability in their L2 English, and these measures progressively approached English monolingual norms, they did not reflect the typical duration patterns of English stress (Reference Grenon, White, Chan, Jacob and KapiaGrenon & White, 2008). Crucially, Reference ArvanitiArvaniti (2012) identified speaker variation, elicitation method, and syllable complexity as variables that all had larger effects than rhythmic class. Accordingly, various authors designed experiments in which these effects were controlled for (Reference ArvanitiArvaniti, 2009, Reference Arvaniti2012; Reference GutGut, 2012; Reference Prieto, del Mar Vanrell, Astruc, Payne and PostPrieto et al., 2012), resulting in more findings that were consistent.
These new results indicate that L2 duration-based isochrony is learnable and that L2 proficiency modulates this learning (e.g., Reference Da Silva Junior and Barbosada Silva & Barbosa, 2019; Reference GutGut, 2012). For example, Reference Ordin and PolyanskayaOrdin and Polyanskaya (2015) and Reference Li and PostLi and Post (2014) showed that L1 German, L1 French, and L1 Mandarin learners of L2 English progressively improved their L2 duration-based patterns along with their English proficiency. However, in both studies German speakers showed more overall improvement than French or Mandarin speakers, illustrating the effect of L1–L2 typological distance. Directionality was also a modulating factor. Reference Van Maastricht, Krahmer, Swerts and PrietoVan Maastricht et al. (2019) showed that while both Dutch learners of Spanish and Spanish learners of Dutch improved L2 rhythm, Dutch learners of Spanish achieved higher scores. Because L1–L2 typological distance was the same, the authors attributed this learning difference to the respective phonological grammars of the two languages. Two other factors that seem to modulate L2 learning are musical abilities (Reference Cason, Marmursztejn, D’Imperio and SchönCason et al., 2020) and gestures (e.g., Reference Gluhareva and PrietoGluhareva & Prieto, 2017).
In addition to these factors, a universal learning pattern has been identified whereby, when learning a stress-timed L2, learners will initially employ syllable-timed patterns, regardless of the rhythm class of their L1. For example, exactly like French or Mandarin speakers, German speakers start out speaking L2 English using a more syllable-timed pattern, despite the fact that German, unlike French or Mandarin, is not a syllable-timed language (Reference Ordin and PolyanskayaOrdin & Polyanskaya, 2015; Reference Li and PostLi & Post, 2014).
21.4.4 L2 Rhythm in Subtractive Bilingual Contexts
Contexts of nonadditive bilingualism provide specific learning situations that modulate the factors affecting L2 acquisition in particular ways. It is in these contexts that bidirectional transfer between L1 and L2 has been observed most frequently. On the one hand, there is abundant evidence of L1-to-L2 transfer even after the L1 is no longer spoken in the community (Reference Benet, Gabriel, Kireva, Pešková, Ma, Ding and HirstBenet et al., 2012; Reference Gabriel and KirevaGabriel & Kireva, 2014). For example, Porteño Spanish, the Spanish spoken nowadays in Buenos Aires, shows traits of Italian prosody due to the prolonged period of Italian-Spanish language contact resulting from the mass migration of Italians to the city between 1860 and the early twentieth century (Reference Colantoni and GurlekianColantoni & Gurlekian, 2004). This was demonstrated by Benet, Gabriel, and colleagues (Reference Benet, Gabriel, Kireva, Pešková, Ma, Ding and HirstBenet et al., 2012; Reference Gabriel and KirevaGabriel & Kireva, 2014), who compared the rhythmic patterns of Porteño Spanish with those of Castilian Spanish, Italian, and L2 Spanish spoken by Italian speakers and found that Porteño Spanish displayed values that were closer to L1 Italian and L2 Spanish spoken by Italian speakers than to Castilian Spanish.
On the other hand, L2-to-L1 transfer has been frequently reported in bilingual communities where the L2 becomes the dominant language but does not displace the L1. Reference Coetzee, García-Amaya, Henriksen, Wissing, Wolters, Livingstone, Beattie, Smith, MacMahon, Stuart-Smith and ScobbieCoetzee et al. (2015) studied bilingual speakers of Afrikaans and Spanish living in Argentinian Patagonia. After decades of speaking only Afrikaans (a stress-timed language), in the second half of the twentieth century Spanish (a syllable-timed language) gradually became the dominant language of the community. At the time of the study, the rhythmic patterns of the bilingual speakers’ Spanish were identical to those of monolingual controls. However, the rhythmic patterns of the bilinguals’ Afrikaans showed L2 transfer from Spanish. Similarly, children in Botswana are raised bilingual in Setswana (a syllable-timed language) and English (a stress-timed language) within a diglossic context where the colonial language, English, is the prestige variety used in public settings (Reference BagwasiBagwasi, 2021). Schools can use either English or Setswana as their vehicular language. In schools where English is the vehicular language, the use of Setswana is relegated to family settings and English eventually becomes the pupils’ dominant language, with the result that the children transfer the rhythmic patterns of their L2, English, to their L1, Setswana (Reference Sebina, Setter and DallerSebina, Setter, & Daller, 2021).
21.4.5 Summary
After controlling for elicitation methods and the syllable-structure of test stimuli (e.g., Reference ArvanitiArvaniti, 2009, Reference Arvaniti2012), duration-based Varco and PVI measures of L2 output yield consistent results which reveal that the factors of L2 proficiency, L1–L2 typological distance, directionality, musical abilities, and gestures modulate the learning of L2 duration-based isochrony. In addition, a universal L2 learning pattern has been identified such that learners of stress-timed L2s start out by producing syllable-timed patterns independently of their L1. Many intensity-based measures of isochrony have been shown to be effective for capturing L2 learning patterns. Further, F0-based measures of isochrony are just starting to be used in L2 research. And L2 learning in subtractive contexts where the L2 becomes the dominant language in a community (e.g., bilingual Afrikaans–Spanish-speaking communities in Argentina, bilingual Setswana-English-speaking communities in Botswana) has shown both L1-to-L2 and L2-to-L1 transfer (Section 21.3.3). Future research needs to disentangle the role of social, linguistic, and cognitive factors in these contexts.
21.5 Conclusions and Directions for Future Research
Research from the last twenty-five years demonstrates that adults are capable of learning L2 stress, L2 duration-based rhythm, and L2 sentence intonation. Two findings are of particular relevance. First, various factors affecting L2 prosody have emerged, mostly in contexts of subtractive bilingualism. Whereas L1-to-L2 transfer is ubiquitous, L2-to-L1 transfer can take place in situations of language contact where the L2 is or becomes the dominant language, as in the case of L1 Afrikaans-L2 Spanish speakers in Patagonia (Reference Coetzee, García-Amaya, Henriksen, Wissing, Wolters, Livingstone, Beattie, Smith, MacMahon, Stuart-Smith and ScobbieCoetzee et al., 2015) or L1 Dutch speakers who are long-term residents of Anglophone Canada (Reference BruggemanBruggeman, 2016). This L2-to-L1 transfer has been observed in L2 stress, sentence intonation, and rhythm, as well as in the four dimensions outlined in LILt (Reference Mennen, Delais-Roussaire, Avanzi and HermentMennen, 2015). While in contexts of additive bilingualism, speakers’ perception and production of a prosodic feature are equally high or equally low, an inconsistent high perception but low production of Spanish stress has been observed in heritage Spanish speakers residing in the USA (Reference KimKim, 2020). This pattern has been linked to the situation of subtractive bilingualism: the knowledge of Spanish acquired by heritage speakers through early exposure to Spanish at home can be lost if this use is interrupted early in development (e.g., Reference Rao, Colina and Martínez-GilRao, 2019). Prosodic accommodation in multilingual communities conveys social meaning and is mainly modulated by affective and social factors rather than by factors traditionally emphasized in second language acquisition like L2 proficiency and language dominance (Reference Elordieta and RomeraElordieta & Romera, 2021). Altogether, this cumulative evidence underlines the fact that to date most research into the acquisition of L2 prosody has been carried out in contexts of additive bilingualism, and considerable work from contexts of subtractive bilingualism must be done to correct this imbalance.
The second main research finding relates to the central role played by form-meaning mapping in explaining speakers’ ability to approximate L2 prosodic targets. With regard to the acquisition of L2 stress, sentence intonation, or rhythm, speakers approximate an L2 to a lesser degree when learning the L2 either requires remapping L1 suprasegmental cues to new L2 meanings, such as when French speakers map the demarcative function of suprasegmental cues in their L1 onto the culminative function of stress in Spanish, or requires speakers to express the same meaning with prosodic features in one language and nonprosodic features in the other, as can be seen when English speakers learn Spanish syntactic-based focus. In both cases, L2 learning implies remapping prosodic forms to new meanings. Conversely, speakers approximate L2 targets more closely when no remapping is involved, that is, when learning an L2 prosodic target requires only transferring L1 suprasegmental cues to the L2 to express similar meanings.
These two main findings suggest the following routes for future research. First, to overcome bias favoring additive bilingualism, L2 prosody research needs to include more situations of subtractive bilingualism such as heritage speakers, long-term migrants, and pidgins and creoles, to name but a few. Reducing this bias together with taking an interdisciplinary approach that combines linguistic, social, and affective variables will help to unravel the reasons why specific factors modulating L2 prosody emerge only in certain situations of language contact. In short, it will help us to understand why reactions to “Gravy?” spoken with Indian English intonation can differ so sharply in the contexts of a British airport cafeteria or an Indian university cafeteria, as we saw in Section 21.1.
Second, the central role of form-meaning mappings in L2 prosody highlights the need to further our understanding of the meanings that prosody conveys. However, L2 prosody research has emphasized forms over meanings, and consequently only meanings relevant to the particular prosodic forms under study have been investigated, such as the culminative versus demarcative functions of L2 stress or the social meanings related to accommodation patterns. With the exception of a few recent papers (Reference Barbosa, Niebuhr, Elmentaler and NiebuhrBarbosa & Niebuhr, 2020; Reference Orrico and D’ImperioOrrico & D’Imperio, 2020; Reference Rao, Ye and ButeraRao, Ye, & Butera, 2022; Reference Taguchi, Hirsch and KangTaguchi, Hirsch, & Kang, 2022), the prosody of pragmatic meanings such as evidentiality and of emotional meanings such as those characterizing charismatic speech is mostly being studied in connection with L1s (e.g., Reference Escandell-Vidal, Prieto, Koike and Félix-BrasdeferEscandell & Prieto, 2020; Reference Niebuhr, Thumm, Michalsky, Klessa, Bachan, Wagner, Karpiński and ŚledzińskiNiebuhr, Thumm, & Michalsky, 2018), with research linked to L2s being still relatively meager (for a review see Reference Kang, Kermad and TaguchiKang & Kermad, 2019).
Finally, L2 prosody research has thus far mainly explored domains up to the sentence, bypassing discourse. To fill this gap, it is crucial to incorporate work that focuses on discourse pragmatics and conversation models. Moreover, studying discourse requires big data and efficient (semi-)automated L2 prosody annotation methodologies (Reference Kang, Johnson and KermadKang, Johnson & Kermad, 2021), which in turn involves interdisciplinary research among phoneticians, semanticists, pragmatists, sociolinguists, and computational linguists.



