Variables and Outcomes of Bilingual Speech

Part VI Variables and Outcomes of Bilingual Speech

29 Language Dominance Effects in the Phonetics and Phonology of Bilinguals

29.1 Introduction

In the context of bilingualism, language dominance “refers to observed asymmetries of skill in, or use of, one language over the other” (Reference BirdsongBirdsong, 2014, p. 374). Among other factors, assessments of language dominance may take into account relative proficiency, fluency, ease of processing, frequency of use, situational choice of use, and cultural identification in each language. Assessed dominance has been shown to be an important predictor of certain characteristics of speech production and perception among bilingual children (Reference Law and SoLaw & So, 2006; Reference Ramon-Casas, Swingley, Sebastián-Gallés and BoschRamon-Casas et al., 2009; Reference Ramon-Casas, Cortés, Benet, Lleó and Bosch2023; Reference Zembrzuski, Marecka and WodnieckaZembrzuski et al., 2020) and among adults (Reference AmengualAmengual, 2018; Reference Amengual and ChamorroAmengual & Chamorro, 2015; Reference Bosch, Costa and Sebastián-GallésBosch, Costa, & Sebastián-Gallés, 2000; Reference Bullock, Toribio, González, Dalola, O’Brien, Shea and ArchibaldBullock et al., 2006; Reference CasillasCasillas, 2015; Reference Casillas and SimonetCasillas & Simonet, 2016; Reference Mayr, López-Bueno, Fernández and Tomé LouridoMayr et al., 2019; Reference Ramírez and SimonetRamírez & Simonet, 2018; Reference Sebastián-Gallés and Soto-FaracoSebastián-Gallés & Soto-Faraco, 1999; Reference SimonetSimonet, 2011; Reference Tomé Lourido and EvansTomé Lourido & Evans, 2019).

This chapter reports the effects of language dominance on the phonetic and phonological characteristics of adult bilingual speech by selectively examining methodologically and theoretically diverse studies on speech production and perception. Further, we review language dominance effects across a range of multilingual settings, including language contact scenarios of typologically similar (e.g., Spanish-Catalan, Spanish-Galician) and dissimilar (e.g., Cantonese-English, Spanish-K’ichee’, Afrikaans-Spanish) languages in minority language contexts and societal bilingual settings, including simultaneous bilinguals and early and late sequential bilinguals and trilinguals.

We begin the chapter by conceptualizing and operationalizing the construct of dominance and how dominance can be measured. We go on to expose the dynamic nature of language dominance, looking in particular at dominance switches and dominance shifts, along with the interplay of language dominance with other relevant variables in bilingual speech research. We conclude with a brief look at dominance effects in third language (L3) acquisition and with suggestions for future research.

29.2 The Construct of Dominance, Its Significance, and Its Measurement

29.2.1 The Nature of Dominance

As stated in Section 29.1, dominance in bilingualism is fundamentally conceptualized in terms of asymmetry. At the level of the individual bilingual, this asymmetry may play out, for example, in speaking or reading aloud more words per minute in one language than the other, in more frequent everyday use of one language over the other, or in greater ease of processing natural speech in one language vis à vis the other. These examples illustrate the relativistic nature of dominance within the individual bilingual. Relativity also applies across individuals; for example, one Spanish-English bilingual may be more dominant in Spanish than another Spanish-English bilingual. As these examples further suggest, dominance is properly understood as a gradient, continuous construct. A bilingual is not simply Language X dominant; they are Language X dominant to a certain measurable degree. Assessments of dominance, discussed in Section 29.2.4, assign numerical values to the degree of dominance at the level of the individual.Footnote ¹

29.2.2 The Relevance of Dominance to the Study of Bilingual Phonetics and Phonology

Since its introduction as a measurable trait in bilingualism by Reference LambertLambert (1955), and its subsequent rise in visibility (e.g., Reference Bahrick, Hall, Googin, Bharick and BergerBahrick et al., 1994; Reference GrosjeanGrosjean, 2010; Reference Hamman, Rinke and Genevska-HankeHamman, Rinke, & Genevska-Hanke, 2019; Reference Silva-Corvalán and Treffers-DallerSilva-Corvalán & Treffers-Daller, 2016), dominance has emerged as a notable predictive factor in bilingual phonetics and phonology. A compelling demonstration of the predictive potential of dominance is Reference Baird, Willis, Martín Butragueño and Herrera ZendejasBaird’s (2015) study of F0 production on Spanish tonic syllables among Spanish-K’ichee’ bilinguals in Guatemala. Ten residents from each of two Guatemalan communities, Cantel and Nahualá, were assessed for their relative Spanish-K’ichee’ dominance with the Bilingual Language Profile (BLP; Reference Birdsong, Gertken and AmengualBirdsong, Gertken, and Amengual [2012], discussed in Section 29.2.4). The residents of Cantel were assessed as being Spanish dominant to varying degrees, while participants residing in Nahualá displayed levels of K’ichee’ dominance. On a task involving reading aloud a set of Spanish declarative phrases, the bilinguals’ speech samples, along with those of ten Spanish monolinguals from the nearby city of Quetzaltenango, were subjected to acoustic analyses of peak F0 placement on paroxytone target words with CV tonic syllables. Spanish F0 typically occurs after the tonic syllable, whereas its placement is variable and often pretonic in contact and bilingualism contexts. Like all of the Spanish monolinguals, all the Cantel bilinguals produced late (post-stress) F0 peaks, as did seven of the ten bilinguals from Nahualá. This result, a first-pass association of the bilinguals’ place of residence (a categorical predictive factor) with pre- versus post-stress F0 (a categorical dependent variable), underspecifies the actual relationship of peak F0 to the tonic syllable. As shown in Figure 29.1, Reference Baird, Willis, Martín Butragueño and Herrera ZendejasBaird (2015) found that the continuous factor of dominance was predictive of both the direction and the actual distance of F0 peaks with respect to the tonic syllable nucleus.

Figure 29.1 Correlation of BLP scores with relative peak F0 alignment scores for Spanish-K’ichee’ bilinguals from Cantel (n = 10) and Nahualá (n = 10). Zero-upward BLP scores show increasing Spanish dominance; zero-downward BLP scores represent increasing K’ichee’ dominance. Relative peak alignment scores = duration from syllable onset to peak F0 divided by total duration of the syllable; values are individual participant averages. The broken vertical line indicates the end of the tonic syllable.

Source: Baird (2015, p. 170).

Reference Baird, Willis, Martín Butragueño and Herrera ZendejasBaird (2015) represents a coherent object lesson for the study of speech production in bilingualism. By properly operationalizing his speakers’ relative Spanish-K’ichee’ dominance as a continuous predictor variable and by representing F0 placement as a continuous dependent variable, Baird captures critical facts of inter-individual variation among bilinguals. As noted above, the direction and the degree of dominance at the level of the individual bilingual participant predict both the observed direction and the distance of peak F0 placement relative to the Spanish tonic syllable. The precise nature of F0 peaks among speakers of these contact varieties of Spanish would not have been revealed under an analysis by categorical factors of place of residence, Spanish versus K’ichee’ dominance, and pre- versus post-stress F0 peak alignment.

29.2.3 A Closer Look at Dominance

Beyond emphasizing the relativity and continuousness of dominance, it is important to distinguish language dominance at the level of the bilingual speech community from individual-level language dominance. At the community level, the dominant language is typically associated with greater numbers of speakers, with the language of instruction in schools, or with official communications. The dominant community language is sometimes the lingua franca or the language of prestige. In contrast, at the level of the individual bilingual, the dominant language is associated with greater skill or more frequent use. Although conceptually distinct, the prevalence of the community-level dominant language in daily life may affect the degree to which that language dominates at the level of the individual. For discussion, see Reference De Bot, Köpke, Schmid, Keijzer and DostertDe Bot (2007) and Reference Lazarev, Pravikova, Cohen, McAlister, Rolstad and MacSwanLazarev and Pravikova (2005).

Next, we call attention to the distinction between dimensions and domains of dominance. Dimensions of dominance are thought of in terms of abilities or skills. For example, a bilingual who speaks faster and with greater lexical diversity in one language is dominant in that language in the dimension of oral fluency. Other examples of dominance in language-relevant skill dimensions include detecting words in sentences in each language uttered under white noise, picture-naming speed in each language, and Stroop performance in each language. In contrast, domains are the contexts, situations, and purposes where languages are used. The language that is used more frequently in the workplace is dominant in that domain. Similarly, the language that is used more frequently in counting and calculating, in schooling, and in child rearing is dominant in those domains.

We note here that domain-based dominance has been conceptualized and empirically verified in terms of Grosjean’s Complementarity Principle. To illustrate this principle, Reference Grosjean, Silva-Corvalán and Treffers-DallerGrosjean (2016) cites Reference GasserGasser’s (2000) case studies of individual English-German bilinguals. Bilingual 14 in the Gasser survey spoke German colloquially and in clubs 80 percent of the time, while speaking English 20 percent of the time – the complement of 80 percent – in these domains. The same bilingual used English 95 percent of the time for counting and calculating, and German 5 percent of the time for these purposes.

In conceptualizing and assessing language dominance, it is important to understand that domains and dimensions are, inherently, independent considerations. For example, a bilingual who uses one language more often than the other when talking with their children (domain-based dominance) may not necessarily be more fluent in that language (dimension-based dominance). At the same time, it is likely that increases or diminishment of use across and within domains (e.g., public speaking, professional writing, casual interactions with peers) can result in fluctuations in the skills that correspond to these domains.

Finally, dominance is not exclusively concerned with, or defined by, language proficiency. Proficiency, understood in basic terms as knowledge of language structure and lexis and facility in language use, is one of several measurable dimensions; for defining and operationalizing dominance, there are multiple dimensions (as well as multiple domains) that are taken into account. We note as well that a bilingual who is dominant in one language in the dimension of proficiency is not necessarily highly proficient in that language, only less proficient in the other language. In a similar vein, a bilingual whose proficiency is equal in both languages (a nondominant, or balanced, bilingual with respect to proficiency) is not necessarily highly proficient in both languages; rather, this person simply demonstrates comparable levels of proficiency – high, low, or anywhere in between – in both languages;Footnote ² see Reference Butler, Hakuta, Bhatia and RitchieButler and Hakuta (2004).

29.2.4 Assessing Dominance

The relativistic and gradient characteristics of bilingual dominance are captured in comparisons of measurements of dimensions and domains in the two languages. As a simple example, the speed of naming pictures (e.g., Boston Naming Test [BNT; Reference Kaplan, Goodglass and WeintraubKaplan, Goodglass, & Weintraub, 1983]; A Quick Test of Cognitive Speed [AQT; Reference Wiig, Nielsen, Minthon and WarkentinWiig et al., 2002]) in one language can be compared to picture-naming speed in the other language. In the literature, such comparisons have typically been carried out by two basic arithmetic methods: subtraction and ratio. Continuing with naming speed as an example, under subtraction, the language in which naming is faster is dominant by the difference (in seconds) between the naming times in the two languages. By the ratio method, an index of dominance is derived by dividing the naming time in one language by the naming time in the other. A hybrid method that combines subtraction and ratio in bilingual naming speed times is employed by Reference Langdon, Wiig and NielsenLangdon, Wiig, and Nielsen (2005). For an evaluative comparison of different methods of deriving dominance indices, see Reference Birdsong, Silva-Corvalán and Treffers-DallerBirdsong (2016).

A number of instruments for specifically assessing language dominance are in wide use. These include the BLP (Reference Birdsong, Gertken and AmengualBirdsong et al., 2012) mentioned earlier, the Bilingual Dominance Scale (BDS; Reference Dunn and Fox TreeDunn & Fox Tree, 2009), and the Self-Report Classification Tool (SCRT; Reference Lim, Rickard Liow, Lincoln, Chan and OnslowLim et al., 2008). Other questionnaires and tests that address bilingualism more generally may contain items about dominance; these include the Language Experience and Proficiency Questionnaire (LEAP-Q; Reference Marian, Blumenfeld and KaushanskayaMarian, Blumenfeld, and Kaushanskaya, 2007), the Language and Social Background Questionnaire (LSBQ; Reference Luk and BialystokLuk & Bialystok, 2013), the Bilingual Input-Output Survey (BIOS) subset of Bilingual English-Spanish Assessment™ (BESA; Reference Peña, Gutiérrez-Clellen, Iglesias, Goldstein and BedorePeña et al., 2014), and the L2 Language History Questionnaire (Reference Li, Sepanski and ZhaoLi, Sepanski, & Zhao, 2006). For additional information on dominance instruments and comparisons of their content and output, see Reference BirdsongGertken, Amengual, and Birdsong (2014) and Reference Peña, Bedore, Torres and FrancisPeña, Bedore, and Torres (2021).Footnote ³

Three important features of these dominance assessments must be emphasized. First, since some of these instruments include self-reports of relevant experiences (e.g., situational use over the short and the long term) and skills (e.g., grammatical and lexical proficiency), the responses are necessarily approximations. Note, however, that the proficiency portions of some instruments have been validated by comparisons with objective tests of proficiency; see Reference Gertken, Amengual, Birdsong, Leclercq, Edmonds and HiltonGertken et al. (2014) and also Reference Sheng, Lu and GollanSheng, Lu, and Gollan (2014) for validation of self-reported dominance with naming task performance.

Second, on certain instruments the questionnaire items probe respondents’ attitudes, not just their linguistic practices or skills. It is to be expected that attitudes may affect choices in use, thereby indirectly informing domain-based dominance and dimension-based dominance. For example, answers to items on a questionnaire probing bilinguals’ motivation to sound like native speakers may connect causally to their answers on items relating to choosing to use the L2 in contexts (domains) where either the L1 or the L2 could be used; in turn, the effect of such practice could improve performance in the dimensions of speech accent and speech rate.

Third, the scores on these instruments are composite measures, the sum of responses to multiple items that relate to diverse elements of the makeup of bilinguals (see Reference Luk and BialystokLuk & Bialystok, 2013). As with the output of intelligence quotient (IQ) tests, what these instruments provide are indices of dominance in a global and undifferentiated sense. As a consequence, two identical scores on a dominance questionnaire are unlikely to reflect identical answers to all items. It follows that two bilinguals with the same dominance scores should not be assumed to be equivalent with respect to underlying components of dominance. In cases where granularity is desired, researchers may consider using as predictor variables relevant scores on modules, subsections, or individual items.

29.2.5 The Dynamic Character of Dominance: Dominance Switches and Dominance Shifts

At the level of the individual bilingual, the dominance relationship of the two languages is not static over time but may evolve as relative usage patterns of each language change. As a result of immigration, or of changes in workplace, family situation, educational demands, and so on, the “wax and wane of languages” (Reference GrosjeanGrosjean, 2010) may range from instances where the dominant language becomes less or more dominant, to more extreme cases where the dominance relationship of the two languages is reversed or switched.

Typically, the likelihood of the second language replacing the first as the dominant language correlates negatively with the age at which learning the second language is begun. In particular, entrenchment of the L1 over continued use makes it increasingly resistant to diminishment in skill dimensions involving processing.

Increasing age is not an absolute barrier to dominance switching, however. For example, Reference GrosjeanGrosjean (2010) famously reports that his second language (English) displaced his first language (French) in terms of relative use and fluency in the years following immersion in English at age eight. This dominance switch was followed by a change back to French dominance between the late twenties and late thirties, subsequent to his return to immersion in French. By the time he reached his sixties, Grosjean’s use and fluency of English and French were approximately equal.

In the area of phonetics, a recent study by Reference Kornder and MennenKornder and Mennen (2021) examined the case of Arnold Schwarzenegger, whose pronunciations in both his native Austrian-German and his L2 American English have evolved over the course of living in the US for fifty-two years since immigrating at age twenty-one. Schwarzenegger’s bilingual profile reveals a striking switch from L1 German dominance to L2 English dominance:

English has become his dominant means of communication which he makes use of even in L1 settings. He prefers, for example, speaking in English when being interviewed by German or Austrian broadcast stations and newspapers … In interviews, he repeatedly stated that it is much easier for him to speak English than German and that the use of his L1 is restricted predominantly to private contexts … .

(Reference Kornder and MennenKornder & Mennen, 2021, p. 7)

A corpus of interviews with Schwarzenegger begun in 1979 (ten years after migration to the US) and continuing until 2018 was analyzed for voice onset time (VOT) and for F1 and F2 in German and English. Productions at earlier stages of US residence were compared with those at later stages. In brief, in later productions, Schwarzenegger’s VOT durations for German plosives mostly lengthened toward English long-lag values, while VOTs in English plosives were highly variable. For vowels, German /i/ and /ɔ/ shifted toward English values for F1 and F2, while F1 and F2 for other German vowels moved farther away from corresponding English values. For both consonants and vowels, Schwarzenegger’s switch to English dominance did not result in a generalized pattern of L2 (dominant-language)-induced changes on the L1, nor did partial maintenance of the nondominant L1 German result in systematic departures from native-like English values. Rather, for Schwarzenegger we see local assimilatory and dissimilatory drifts in both languages. Importantly, within the dynamic context of dominance switches, the phonologies of both the L1 and the L2 show evidence of permeability, a reflection of bidirectional cross-linguistic interaction of speech systems that is a characteristic feature of bilingualism, which in the case of Arnold Schwarzenegger is manifest across five decades. See also Reference De Leeuwde Leeuw (2019), for a study over four decades of the L1 (German) and L2 (English) speech of professional tennis player Steffi Graf.

With respect to dominance shifts – changes in the relative degree to which the dominant language is dominant, as opposed to full-on dominance switches or reversals – the effects on the pronunciation of the L1 are noted early on in L2 learning. For example, among adult Anglophone beginning students of L2 Korean, Chang (Reference Chang2012) observed a drift in the F1 values of their English vowels toward Korean values at even the earliest stages of Korean study abroad. In other studies of limited immersion, the drift is toward dissimilation, that is, an enhancement of L1–L2 differences, particularly VOT; see Reference Flege and EeftingFlege and Eefting (1987a, Reference Flege and Eefting1987b) and the discussion in Reference Kartushina, Frauenfelder and GolestaniKartushina, Frauenfelder, and Golestani (2016). Extended immersion, to the degree that it is characterized by continued frequent (cumulatively greater) use of the L2, increases the magnitude of dominance shifts, which may in turn reinforce the phonetic drift of VOT. However, upon return to the L1 environment and increased use of the L1, the VOT values tend to drift back toward those of the L1.

To fully characterize the relationship of dominance shifts to phonetic drift, one must take into account the interplay of multiple factors. These include degree of dominance, frequency of code-switching, L2 pronunciation skill, the phonetic similarity of the L2 and the L1 at local levels of analysis, the particular phonetic features in question, and mono-, bi-, and multilingual modes at testing and over time. See Reference Kartushina, Frauenfelder and GolestaniKartushina et al. (2016) for a review of studies where these factors are considered. See Reference Tobin, Nam and FowlerTobin, Nam, and Fowler (2017) for a meticulous examination of factors in phonetic drift among Spanish-English bilinguals. Reference Tobin, Nam and FowlerTobin et al. (2017) go on to incorporate these factors into a dynamical systems model of drift that they test against the findings and predictions of several relevant studies, including Reference Antoniou, Best, Tyler and KroosAntoniou et al. (2011), Reference Bullock, Toribio, Isurin, Winford and de BotBullock and Toribio (2009), Chang (Reference Chang2012, Reference Chang2013), and Reference Sancier and FowlerSancier and Fowler (1997). See Section 29.5 for discussion of dominance shifts and phonetic drifts in the learning of a third language (L3).

29.3 Language Dominance and Bilingual Speech Production

Most of the work to date on language dominance and bilingual speech has focused on analyzing speech production as opposed to speech perception (Reference AmengualAmengual, 2016a, Reference Amengual2016b, Reference Amengual2018; Reference Amengual and ChamorroAmengual & Chamorro, 2015; Reference Antoniou, Best, Tyler and KroosAntoniou et al., 2010, Reference Antoniou, Best, Tyler and Kroos2011; Reference Baird, Willis, Martín Butragueño and Herrera ZendejasBaird, 2015, Reference Baird2021; Reference Bullock, Toribio, González, Dalola, O’Brien, Shea and ArchibaldBullock et al., 2006; Reference De la Fuente Iglesias and Pérez Castillejode la Fuente Iglesias & Pérez Castillejo, 2020; Reference Kim and Repiso PuigdelliuraKim & Repiso Puigdelliura, 2020; Reference SimonetSimonet, 2011, Reference Simonet2014). These studies have enriched our understanding of the effects of language dominance on the acoustic realization of bilinguals and its interaction with other relevant variables, such as language activation or quantity and quality of input during childhood, among other factors. Language dominance has been shown to be an important predictor of the interaction patterns between the sound systems of both languages of the bilingual individual, and this variable has provided a nuanced approach to explain cross-linguistic influence in bilingual speech production.

To provide a few examples, Reference SimonetSimonet (2011) examined the production of the Catalan and Spanish mid-back vowel contrast /o/–/ɔ/ by two groups of highly proficient, early-onset Catalan-Spanish bilinguals, divided as a function of their language dominance. The results of his study revealed that language dominance impacted the acoustic realization of their Catalan mid vowels. On the one hand, Spanish-dominant bilinguals did not produce the Catalan-specific vowel contrast as they produced a single, merged phonetic category instead of two distinct phonemic categories. On the other hand, Catalan-dominant bilinguals were found to have a system of two mid-back vowel categories, one for Catalan /ɔ/ and another one for Catalan /o/ and Spanish /o/. These results support the claim that early and extensive exposure to a language does not guarantee that learners will be able to acquire the speech patterns of their L2 in a native-like manner (Reference Sebastián-Gallés and Soto-FaracoSebastián-Gallés & Soto-Faraco, 1999). In a follow-up study that examined the perception and production of the Catalan mid vowel contrasts /e/–/ɛ/ and /o/–/ɔ/ in the same bilingual community, Reference AmengualAmengual (2016b) showed that even though these bilinguals maintained robust mid vowel contrasts in their productions, the degree of language dominance had an impact on the acoustic distance maintained between these mid vowel targets.

In addition to being a relevant variable in the study of the phonetic production of early bilinguals, with studies showing that highly proficient bilinguals tend to transfer the phonetic features of the sound categories of their dominant language to their nondominant language even with early and extensive exposure to native input (Reference AmengualAmengual, 2016b, Reference Amengual2016c; Reference SimonetSimonet, 2011), language dominance has also been seen to account for observed patterns in the acoustic cues employed by early and late bilinguals. In a study that investigated the acoustic correlates of the Spanish tap–trill phonological contrast in the production of Spanish heritage speakers and L2 Spanish learners, Reference AmengualAmengual (2016a) showed that most of the L2 learners and English-dominant heritage speakers produced noncanonical phonemic trills with one or zero occlusions and maintained the Spanish tap–trill contrast largely by means of segmental duration. In contrast, Spanish-dominant heritage speakers produced the majority of their trills with two or three occlusions.

For groupwise comparisons, bilinguals are sometimes classified as dominant in one or the other language. However, language dominance is continuous, not dichotomous (e.g., Reference Grosjean and NicolGrosjean [2001] and, in this chapter, Section 29.2), and discrete classifications of language dominance can obscure distinct patterns of between-speaker variation. In order to investigate individual variation in the production patterns of bilinguals, several studies have incorporated gradience into the variable of language dominance (Reference AmengualAmengual, 2016b; Reference Amengual and ChamorroAmengual & Chamorro, 2015; Reference Baird, Willis, Martín Butragueño and Herrera ZendejasBaird, 2015; Reference De la Fuente Iglesias and Pérez Castillejode la Fuente Iglesias & Pérez Castillejo, 2020). For instance, Pillai scores or Euclidean distances have been used to investigate the acoustic distance between two vowel targets and have correlated individual data points with a continuous measure of language dominance for each bilingual speaker. An example of this is a recent study on the perception and production of the Galician mid vowel contrasts (Reference Amengual and ChamorroAmengual & Chamorro, 2015). The Pillai score representing the vowel cluster difference between Galician /e/–/ɛ/ and /o/–/ɔ/ was calculated for each individual speaker by which the higher the Pillai score, the lower the degree of overlap, and the larger the distinction, between the two vowel clusters. The results indicated that the Pillai score was smaller for Spanish-dominant bilinguals than for Galician-dominant bilinguals, and correlations between the individual language dominance scores (based on the BLP scores) and the Pillai scores yielded a significant positive correlation for the Galician-dominant group. In other words, the Galician-dominant bilinguals who were more extremely dominant in Galician were found to maintain a lower degree of overlap, and thus a larger distinction between these Galician-specific mid vowel contrasts, as shown in Figure 29.2.

Figure 29.2 Individual Pillai scores as a measure of /o/–/ɔ/ vowel merger plotted as a function of individual speakers’ BLP score. Fitted lines show Spanish-dominant and Galician-dominant bilinguals.

Source: Adapted from Amengual and Chamorro (2015, p. 223).

These gradient effects of language dominance complement those reported in Section 29.2.2 (Reference Baird, Willis, Martín Butragueño and Herrera ZendejasBaird, 2015, Reference Baird2021) on the prosody of Spanish-K’ichee’ bilinguals in Guatemala. Baird advises that “categorical declarations of the role of language dominance in linguistic outcomes fall shorter of the true nature of bilingualism than cases in which language dominance is viewed as continuous” (Reference BairdBaird, 2021, p. 512). Apart from variation in prior language experience, factors contributing to language dominance effects in bilingual speech production are also likely to be correlated and interact with other relevant variables that have been included in bilingual speech research. These include variables such as language mode, code-switching, or the pronunciation of cognates (see Chapter 30, this volume). For instance, Reference Henriksen, Coetzee, García-Amaya and FischerHenriksen et al. (2021) investigated the production of intervocalic voiced stops in Afrikaans and Spanish by Afrikaans-Spanish bilinguals in Patagonia, Argentina. The target voiced stops produced in unilingual contexts (Spanish or Afrikaans) were compared to those productions in code-switched speaking contexts. The results of their production experiment showed that these bilinguals produced distinct categories in each language and that code-switching affects the production of the target sounds asymmetrically, such that L1 Afrikaans influenced the production of L2 sounds, but not vice versa. More specifically, code-switched Spanish targets were found to display less lenition of intervocalic /b,d,g/ (i.e., more influence from Afrikaans), whereas code-switched Afrikaans targets resisted influence from Spanish.

This asymmetry as a function of language dominance seems to be persistent, as shown in other studies that investigate language mode and code-switching effects in bilingual speech (Reference AmengualAmengual, 2018, Reference Amengual2021; Reference OlsonOlson, 2017, Reference Olson2019; Reference SimonetSimonet, 2014; Reference Tsui, Tong and ChanTsui, Tong, & Chan, 2019). For example, Reference AmengualAmengual (2018) examined the effects of language mode on the acoustic realization of Spanish and English laterals by four groups of Spanish-English bilinguals that differed in their degree of language dominance in three different sessions (monolingual English, monolingual Spanish, and bilingual English/Spanish). The results indicated that the lateral production of these bilinguals differed as a function of the experimental session or the speech production setting in which the words were uttered: when in bilingual mode, all four groups of bilinguals produced a less target-like lateral in their nondominant language in comparison to their productions in the unilingual session, whereas the lateral categories in their dominant language were not affected by the experimental session. These asymmetries as a result of language activation (language mode, code-switching, cognates) support the notion that a bilingual’s situational language context warrants consideration within models of bilingual speech production.

To close this section, it should be mentioned that, although there is robust evidence of the effects of language dominance on bilingual speech production and its power to explain the degree of cross-linguistic influence in the speech of bilingual individuals, in certain cases language dominance may not be able to detect group differences or capture certain phonological features. Reference Amengual and SimonetAmengual and Simonet (2020) compared the effects of language dominance on the acoustic realization of the Catalan [a]–[ə] alternation (a phonological process induced by lexical stress) and the Catalan mid vowel contrasts /e/–/ɛ/ and /o/–/ɔ/ (two phonemic contrasts). The results of two production experiments indicated that even though there were significant differences between Spanish-dominant and Catalan-dominant bilinguals in their production of the Catalan mid vowels, these same language dominance effects were not found with respect to vowel reduction.

Complementing this study, Reference Simonet and AmengualSimonet and Amengual (2020) analyzed the acoustic realization of the Spanish and Catalan unstressed /a/, which surfaces as [a] in Spanish but is reduced to schwa, [ə], in Catalan. The results indicated that the Catalan unstressed /a/, which was similarly reduced to schwa in the speech of all participants, became slightly more similar to the Spanish unstressed /a/ (i.e., it had a higher F1) when produced alongside Spanish words (bilingual setting) than when produced in a Catalan unilingual setting. There were no consistent effects of language dominance, and the effects of language mode did not interact with the linguistic experience of the bilingual participants. In a similar experimental paradigm, Reference SimonetSimonet (2014) also did not find effects of language dominance. This study examined the production and perception of the Catalan mid-back vowel contrast /o/–/ɔ/ by a group of highly proficient, early-onset Catalan-Spanish bilinguals. These bilinguals were divided into three groups, as a function of their language dominance score: a first group with ten Catalan-dominant bilinguals, a second group with ten Spanish-dominant bilinguals that were considered to be more “balanced,” and ten Spanish-dominant bilinguals who were more strongly dominant in Spanish than the second group. This study manipulated the level of activation of each language as the Catalan-Spanish bilinguals produced words in Catalan and Spanish in two different experimental sessions: a unilingual session and a bilingual one. The results from the production task showed that both Catalan vowels /o/ and /ɔ/ were affected by the presence of Spanish words in the bilingual environment in that they had lower first-formant (F1) frequencies in the bilingual session in comparison to the unilingual session. The results also indicated that all three participant groups were similarly affected by the activation of both languages in that the Catalan mid-back vowels shifted toward Spanish /o/ when both languages were activated, independently of the patterns of language dominance of the speakers. Considering these combined results, we should be careful when making overly general claims about the effects of language dominance on bilingual speech production, as language dominance may impact the patterns of cross-linguistic influence in the sound systems of bilinguals to a different degree, depending on the phonological variables under investigation (e.g., language-specific phonological contrasts, allophonic distributions, primary or secondary acoustic cues, etc.). For details, see Reference Kartushina, Frauenfelder and GolestaniKartushina et al., 2016; Reference Tobin, Nam and FowlerTobin et al., 2017; the discussion of Reference Kornder and MennenKornder and Mennen, 2021 in Section 29.2.5; and the discussion of Reference Kartushina and MartinKartushina and Martin, 2019 in Section 29.5.

29.4 Language Dominance and Bilingual Speech Perception and Processing

It is important to note that in addition to speaking with an “accent,” that is, displaying traces of cross-linguistic influence in their L1 and L2 speech, bilingual individuals also “hear with an accent” (Reference Jenkins, Strange, Polka, Lubinski and DawisJenkins, Strange, & Polka, 1995). An increasing number of studies have incorporated language dominance as a variable to explain cross-linguistic influence in bilingual speech perception and processing, demonstrating that this perceptual accent is modulated by language dominance (Reference Amengual and ChamorroAmengual & Chamorro, 2015; Reference Antoniou, Tyler and BestAntoniou, Tyler, & Best, 2012; Reference Black, Joanisse and RafatBlack, Joanisse, & Rafat, 2020; Reference Carrasco-Ortíz, Amengual and GriesCarrasco-Ortíz, Amengual, & Gries, 2021; Reference Chan, Johnson and BabelChan, Johnson, & Babel, 2020; Reference Hazan and BoulakiaHazan & Boulakia, 1993; Reference OsborneOsborne, 2021; Reference Ramírez and SimonetRamírez & Simonet, 2018).

In a perception study designed to test the effects of language dominance and language mode, Reference Antoniou, Tyler and BestAntoniou et al. (2012) investigated the perception of word-initial and word-final English and Greek voiced and voiceless stops by L2-dominant Greek–English bilinguals, and compared their perceptual abilities to those of English monolinguals and Greek monolinguals. The results of a categorization task with goodness ratings and a categorial AXB discrimination task showed that Greek-English L2-dominant early bilinguals are very accurate, relative to monolinguals in either language, in identifying short-lag and long-lag stops in Greek and English. In the perception of language-specific phonetic VOT ranges of Greek and English, the findings indicated that language dominance influenced bilinguals’ discrimination performance, even if tested in the L2 and if the listener was currently operating in L1 language mode. The authors suggest that when faced with a taxing task, the dominant language of a bilingual individual is the default setting in speech perception.

In a comparable perception study, Reference Ramírez and SimonetRamírez and Simonet (2018) examined the perceptual discrimination of the /ʎ/–/ʒ/ contrast in Majorcan Catalan by two groups of Catalan-Spanish bilinguals classified as a function of their language dominance. This consonantal contrast in Majorcan Catalan does not have a direct equivalent in Spanish. The results from an odd-item-out AXB perceptual task indicate that Spanish-dominant bilinguals were less accurate than Catalan dominants in their discrimination of the /ʎ/–/ʒ/ contrast. While Catalan-dominant participants discriminated any of the pairs that involved [ʒ] very accurately, this was not the case for the Spanish-dominant bilinguals who were less sensitive to this contrast. These findings demonstrate that language dominance modulates the perceptual discrimination of the sounds involved in the Majorcan Catalan /ʎ/–/ʒ/ contrast.

Returning to the Catalan mid vowel contrasts, two studies in the same Majorcan Catalan-Spanish bilingual community examined the perception and processing of the Catalan mid vowels /e/–/ɛ/ and /o/–/ɔ/. In Reference AmengualAmengual (2016b), sixty early Spanish-Catalan bilinguals in Majorca completed a categorical AXB discrimination task and picture-naming task to examine the perception and production of the Catalan mid-front and mid-back vowel contrasts. The results showed that the Catalan-specific mid vowels were more susceptible to discrimination difficulties than other vowel contrasts in the language regardless of their language dominance. Reference AmengualAmengual (2016d) further explored the perception and processing of these mid vowels by the same bilingual participants. Results from binary forced-choice identification, AX discrimination, and lexical decision tasks indicated that even though these bilinguals demonstrated high accuracy in perceptual identification and discrimination tasks, they had difficulties distinguishing between words and nonwords in a lexical decision task, with Spanish dominants exhibiting higher error rates than Catalan dominants. In a follow-up study, Amengual investigates the phonetic production and processing of the Catalan mid-back vowel contrast by a group of Catalan-dominant and Spanish-dominant highly proficient early Spanish-Catalan bilinguals (Reference AmengualAmengual, 2016c). Analyses of individual data showed that in addition to being a strong predictor of the acoustic distance and overlap maintained between both phonemes, the degree of language dominance had an effect on how these mid vowels were processed in the lexical decision task, as those bilinguals who were more Spanish dominant were the ones that displayed a higher error rate and a slower reaction time when responding to words and nonwords.

Studies on the perception of the mid vowel contrasts in other Romance languages, such as Galician and Portuguese (Reference Amengual and ChamorroAmengual & Chamorro, 2015; Reference OsborneOsborne, 2021) also provide evidence of language dominance effects on bilingual speech perception. Results from a binary forced-choice identification and AX discrimination on the Galician mid vowel contrasts by fifty-four early Spanish-Galician bilinguals showed that Spanish-dominant bilinguals had great difficulties in discriminating between the Galician-specific mid vowel contrasts, but Galician-dominant bilinguals demonstrated a robust categorical identification of the two mid vowel phonemes (Reference Amengual and ChamorroAmengual & Chamorro, 2015). Figure 29.3 shows the proportion of /e/ and /o/ responses in the binary forced-choice identification task as a function of language dominance, revealing that Galician-dominant bilinguals exhibited clear categorization of two distinct mid vowel phonemes, whereas the trajectory of the Spanish-dominant group indicates that these bilinguals did not accurately identify the stimuli along the /e/–/ɛ/ and /o/–/ɔ/ continuum.

Figure 29.3 Identification of stimuli along the /e/–/ɛ/ and /o/–/ɔ/ continua as a function of language dominance (Spanish-dominant and Galician-dominant).

Source: Adapted from Amengual and Chamorro (2015, p. 216).

A similar perceptual pattern was found in a study that examined the sensitivity of heritage speakers of Brazilian Portuguese toward the Portuguese mid vowel contrasts /e/–/ɛ/ and /o/–/ɔ/ (Reference OsborneOsborne, 2021). These bilinguals, who were born and/or grew up in the United States, participated in an AXB discrimination task and a two alternative forced-choice (2AFC) identification task to test their sensitivity to the Portuguese mid vowel contrasts. Based on the language dominance distribution of the participant sample, participants were divided into two groups: a group with less dominance in English (lower English dominant, LED) and another group that was more dominant in English (higher English dominant, HED). The results of this study showed that the HED speakers had more difficulties than the LED speakers in categorizing the Portuguese /o/–/ɔ/ contrast, but these same effects were not found in relation to the /e/–/ɛ/ vowel pair. These results confirm that language dominance can predict certain, but not all difficulties in the perception of language-specific vowel contrasts.

In a study examining the acquisition of allophonic alternations, rather than phonemic contrasts, Reference Black, Joanisse and RafatBlack et al. (2020) examined perceptual accuracy for Spanish approximants versus voiced stops [β]/[b], [δ]/[d], [ɣ]/[g] in an AXB discrimination task. Participants were twenty-three English-speaking learners of L2 Spanish and ten Spanish-speaking learners of L2 English. The researchers found a significant positive correlation between language dominance and perceptual accuracy, with accuracy scores improving as BLP scores moved along the continuum from strong English dominance toward strong Spanish dominance.

Recalling our note of caution on making overly general claims about the effects of language dominance on bilingual speech production (see Section 29.3), a similar caveat applies to bilingual speech perception. In several studies, language dominance has not been able to predict bilinguals’ accuracy in identifying ambiguously pronounced words in a lexical decision task (Reference Chan, Johnson and BabelChan et al., 2020) or to modulate the effects of orthographic and phonological similarity in bilingual word recognition (Reference Carrasco-Ortíz, Amengual and GriesCarrasco-Ortíz et al., 2021). Still more research is needed to support or reject the argument that, at least by certain metrics, language dominance has little impact on perceptual flexibility (Bruggeman & Cutler, 2020; Chan et al., 2020).

29.5 L3 and Dominance

When bilinguals learn a third language (L3), what are the effects on their other languages? To answer this question at a relevant level of granularity, one must consider both short-term effects and long-term effects, while keeping in mind the dynamics of relative use of the other two languages, that is, potential shifts for these languages in terms of domain-based dominance.

As an example of work along these lines, Reference Kartushina and MartinKartushina and Martin (2019) designed a study of L3 vowel production among bilinguals in study-abroad contexts. Participants were ten Spanish-Basque from-birth or early sequential bilinguals from San Sebastián, Spain. Half were slightly Spanish dominant and half were roughly balanced; all were between sixteen and seventeen years of age. All were enrolled in a two-week Erasmus Study Abroad program in the Netherlands, which featured six to eight hours per day of debates, discussions, and presentations in English, along with use of English with Dutch host families. Tests of vowel production in Spanish and Basque conducted one day after the end of the program revealed a drift of F1 in both languages, toward higher (English-like) F1; no such drift was observed for F2 values. Interestingly, different levels of drift were associated with the dominance relationship of their early-learned languages. The balanced bilinguals showed drift toward English norms in both Spanish and Basque, whereas the Spanish dominants, who on average used Spanish 30 percent more often than Basque, exhibited drift only in their Basque production. Overall, participants who exhibited the greatest progress in their L3 English pronunciation also exhibited less assimilatory drift (toward English values) in their native languages.

Four months after this first round of testing, with no ongoing exposure to or use of English, the participants’ F1 values were examined again. As expected, a “return drift” to norms of native production was observed. However, this phenomenon was modulated by dominance. Balanced bilinguals showed a return to native F1 values in both their Spanish and their Basque, while the F1 values of those who were Spanish dominant did not return fully to Basque-like norms. Thus, four months after cessation of intensive experience in a third language, only the nondominant of the first two languages remained impacted by the third.

Recalling our earlier discussion of multiple factors in phonetic drift (Section 29.2.5), Reference Kartushina and MartinKartushina and Martin (2019) explored the relationship of language-switching frequency to the degree of assimilation to English F1 values and to the subsequent degree of return drift to native values. Frequency of language switching was found to affect some phonetic features of the L1 differently for balanced Spanish-Basque bilinguals versus Spanish dominants. For other features, more switching between Spanish and Basque among the balanced bilinguals was associated with smaller magnitudes of return drift to native norms in the two languages. Among the Spanish-dominant bilinguals, more switching was associated with a smaller return drift in the direction of Basque, possibly because in language switching contexts Basque was used less often than Spanish, thus reducing the “‘opportunity’ to recover” from the influence of L3 English (Reference Kartushina and MartinKartushina & Martin, 2019, p. 16).

29.6 Future Directions

Looking forward to further studies of dominance effects in the speech production, perception, and processing of bilinguals, we suggest four areas where research can be fruitfully directed. First, additional longitudinal studies hold the promise of advancing the field. Consistent with and building upon works mentioned in this chapter, the aim of longitudinal research programs would be to take full measure of the dynamic character of dominance by looking for effects of dominance shifts in the short term (weeks and months) and at effects of dominance shifts and switches in the long term (years and decades). In so doing, the effects of dominance can be properly situated within the big picture of bilingualism as a life-span phenomenon punctuated by situational, transitory language-relevant events (e.g., Reference FrancisFrancis, 2021; Reference Nicoladis and MontanariNicoladis & Montanari, 2016; Reference Røyneland and BlackwoodRøyneland & Blackwood, 2021).

Second, recalling the findings of Reference Baird, Willis, Martín Butragueño and Herrera ZendejasBaird (2015, Reference Baird2021), Reference Kartushina and MartinKartushina and Martin (2019), and Reference Kornder and MennenKornder and Mennen (2021), the extent of dominance effects in the phonetics and phonology of bilinguals warrants attention. Specifically, researchers should investigate dominance effects not only at the segmental level but also at the suprasegmental level, and should seek to characterize such effects in each of the languages of a bilingual, not just one.

Third, it is important to keep in mind that different assessments of dominance, be they conducted by use of established instruments or by home-grown probes of skills and experiences (see Reference Flege, MacKay and PiskeFlege, MacKay, and Piske [2002] for a notable example), do not all focus on the same domains and dimensions of dominance. Thus, the information available across assessments is qualitatively incommensurable. It also follows that, for a given feature of bilingual speech, for example VOT, some indices of dominance will be better predictors than others, or may not be predictive at all. By the same token, within composite instruments that assess global dominance, some items or modules are intuitively more relevant than others to specifically targeted linguistic features among bilinguals. For example, relative levels of grammatical knowledge in the two languages should not necessarily be hypothesized to correlate with fine-grained acoustic features in the two languages.

Fourth, in studies that involve comparisons of a group of Language X-dominant bilinguals with a group of Language Y dominants, researchers should not limit themselves to examining between-group differences and similarities. For both production and perception of speech, we encourage looking at within-group variable performance as a function of gradations of dominance within the dominant and nondominant languages. The revelatory potential of this approach derives from exploiting the continuous nature of dominance. Importantly, looking at within-group variability as it relates to dominance meaningfully addresses the essential diversity of experience and ability among bilinguals (e.g., Reference Beatty-Martinez and TitoneBeatty-Martinez & Titone, 2021; Reference Luk and BialystokLuk & Bialystok, 2013), which may manifest in individual differences in speech learning and bilingual phonetic and phonological behaviors.

30 Code-Switching and Language Mode Effects in the Phonetics and Phonology of Bilinguals

30.1 Introduction

While a large body of research has demonstrated that bilinguals may maintain two distinct phonetic systems in their two languages, it is clear that these two systems are not fully independent. Interaction between these two systems, in which one system influences the other (i.e., unidirectional) or both systems mutually influence one another (i.e., bidirectional), has been found across a range of bilingual populations and contexts. Yet a crucial distinction should be made between long-term and short-term sources of cross-linguistic interaction. Long-term sources of interaction result in interactions between long-term phonetic memory representations, often referred to as transfer (Reference AmengualAmengual, 2021). This type of long-term interaction is found in cases of sustained influences over longer periods of time, such as language acquisition and/or immersion, and results in changes in the relatively stable mental representations of a bilingual’s phonetic categories. As noted by Reference SimonetSimonet (2014), empirical studies that examine such long-term interactions often compare participant groups (e.g., novice versus advanced learners) or use longitudinal data (e.g., tracking learners’ phonetic changes across longer periods of time). In contrast, short-term interference refers to the short-term simultaneous or near-simultaneous activation of the competing stable phonetic categories in both languages. Short-term interaction is found in settings or communicative contexts in which momentary (i.e., transient) activation of both languages occurs. Empirical studies of short-term interference often take a within-subjects design, comparing a single participant’s performance across different settings or stimuli types (for additional discussion of transfer and interference, see Reference GrosjeanGrosjean, 2012; Reference SimonetSimonet, 2014). Relative to the extensive body of research on long-term sources of cross-linguistic transfer, short-term interference in bilingual phonetics is less represented in the prior literature.Footnote ¹

Within research on sources and outcomes of short-term phonetic interference, two areas have been the most productive to date: code-switching and bilingual language mode. Broadly, code-switching is defined as the alternation between two languages in a single interaction (e.g., Reference Myers-ScottonMyers-Scotton, 1993). This switch between languages, which may occur in a predominantly monolingual discourse or a predominantly bilingual discourse, represents a near-instantaneous activation of the two languages. Bilingual language mode is defined as the operation of a bilingual in a context that requires the activation of both languages (e.g., Reference Grosjean and NicolGrosjean, 2001) and represents a sustained coactivation of a bilingual’s two languages in a given context or interaction. These two phenomena, code-switching and bilingual language mode, are linked by the fact that both require activation of a bilingual’s two languages in a short time frame. This simultaneous activation of the phonetic representations of both languages presents the opportunity for cross-linguistic interference. The remainder of this chapter will detail the theoretical frameworks, recent research, key concepts, and future directions in these two sources of short-term phonetic interference.

30.2 Code-Switching and Phonetics

Code-switching can be defined as the alternation between two or more languages, or language varieties, in a single interaction (Reference Myers-ScottonMyers-Scotton, 1993). While the perception of the general public is often that code-switching is random (for discussion, see Reference Cheng and ButlerCheng and Butler, 1989) or representative of a deficiency in a particular language, a significant body of research has shown that this is not the case. Code-switching is systematic and rule-governed at the syntactic level (e.g., Reference MuyskenMuysken, 2000) and used for a variety of social and pragmatic functions (e.g., Reference ZentellaZentella, 1997). These sophisticated grammatical, social, and pragmatic regularities suggest that, rather than being deficiency-oriented, code-switching serves as a sensitive marker of bilingual competence (Reference Yow, Tan and FlynnYow, Tan, & Flynn, 2018). Moreover, while code-switching is possible for bilingual speakers, not all bilinguals engage in code-switching similarly. Reference OlsonOlson (2024) details a novel tool for measuring code-switching engagement – the Bilingual Code-Switching Profile – and analysis demonstrates that even within the same community, people’s use of code-switching is driven by a variety of contextual and attitudinal factors.

Several distinctions have been used to describe different types of code-switches. At the most basic level, intersentential code-switching can be distinguished from intrasentential, representing switches between different utterances and within a single utterance, respectively. Intrasentential code-switching can be further divided into insertional, alternational, and congruent lexicalization (Reference MuyskenMuysken, 2000). Insertional code-switching consists of embedding a single word or phrase from Language B within an utterance of Language A (30.1a). Alternational code-switching includes a single switch within an utterance from Language A to Language B (30.1b). Congruent lexicalization, possible when two languages share a large part of their grammatical structure, occurs when elements from either language may be inserted into the shared grammatical frame (30.1c) (see Reference MuyskenMuysken, 2000).

1. a. Spanish-English
  Fue como un challenge para mi.
  ‘It was like a challenge for me.’
  Source: Reference Pérez CasasPérez Casas (2008, p. 121)
2. b. Spanish-English
  Es como la lista de things to do before you ‘kick the bucket’.
  ‘It’s like the list of things to do before you ‘kick the bucket’.’
  Source: Reference Pérez CasasPérez Casas (2008, p. 193)
3. c. Dutch–Sranan
  wan heri gedeelte de ondro beheer fu gewapende machten.
  ‘One whole part is under control of the armed forces.’
  Source: Bolle (1994, cited in Reference MuyskenMuysken, 2000, p. 139)

Particularly relevant for an examination of bilingual phonetics and phonology is that a number of authors have distinguished between code-switches, nonce borrowings, and loanwords on the basis of the morphosyntactic and phonological integration into the host language (for discussion, see Reference Bullock, Toribio, Bullock and ToribioBullock and Toribio, 2009b). Code-switched tokens are not integrated into the host language, and thus require a switch between phonetic systems. In contrast, loanwords are those that are integrated into the host language, and may appear in monolingual speech. Reference PoplackPoplack (2012, p. 644) suggests a more flexible interpretation of this distinction, noting that while phonological integration correlates with lexical diffusion, “phonological integration is often gradual and highly variable.” In contrast, morphological integration is largely categorical and, as such, represents a better diagnostic (Reference Poplack, Sankoff and MillerPoplack, Sankoff, & Miller, 1988). Others suggest that borrowing and code-switching may be best represented on a continuum, rather than a categorical distinction (e.g., Reference Bullock, Bullock and ToribioBullock, 2009).

While code-switching has been a fertile ground for linguistic research, the primary focus has been on the lexical, syntactic, and sociopragmatic features of code-switching (e.g., Reference Bullock, Bullock and ToribioBullock, 2009b), with phonetics and phonology being less studied. However, the particular features of different linguistic components will undoubtedly play a role in the potential outcomes of code-switching. For example, lexical items come from either Language A or Language B, excepting cases of cognates and borrowings. In contrast, many phonetic features, such as aspiration duration (e.g., voice onset time [VOT]) or the shape of a pitch contour, are gradient in their interpretation. As such, while code-switching at the lexical or the morphosyntactic levels may be examined as categorical (either Language A or Language B), the impact of code-switching on phonetic cues may be examined along a continuum (from Language A to Language B). Thus, an examination of the phonetic consequences of code-switching may provide a sensitive measure with which to examine the underlying mechanisms involved in language switching that are effectively obscured when examining other linguistic components.

30.2.1 Recent Research on Code-Switching and Phonetics

Research on the phonetics of code-switching generally examines productions at (or near) the point of switch and seeks to compare the phonetic outcomes of switched and nonswitched productions. When considering the largely gradient measures of different acoustic and corresponding articulatory cues, Reference Antoniou, Best, Tyler and KroosAntoniou et al. (2011, pp. 558–559) detail several possible outcomes that may result from code-switching, which serve as a starting point for this discussion. Unidirectional interference occurs when Language A impacts Language B, but the reverse is not found. Bidirectional interference occurs when Language A impacts Language B and Language B impacts Language A. Finally, Reference Antoniou, Best, Tyler and KroosAntoniou et al. (2011) note that there may be cases in which there is no interaction between the two languages as a result of code-switching. Complementing these possible outcomes, Reference Bullock, Toribio, Isurin, Winford and de BotBullock and Toribio (2009a) note that in addition to convergence (i.e., unidirectional or bidirectional interference), it is possible that speakers may seek to “enhance” contrasts between their two languages, termed divergence (Reference Bullock, Toribio, Isurin, Winford and de BotBullock & Toribio, 2009a, p. 202). Finally, and drawing on notions of the (un)predictability of code-switches relative to nonswitched language, Reference OlsonOlson (2012) suggests that hyperarticulation, which serves to enhance specific acoustic cues (Reference Lindblom, Hardcastle and MarchalLindblom, 1990), may also result from code-switching.

30.2.2 Segmental Phonetics

30.2.2.1 Voice Onset Time

In the study of the effect of code-switching on segmental production, examination of the feature of VOT has predominated in previous research. Defined as the temporal difference between the release of a stop consonant and the onset of voicing of the following segment (e.g., Reference Lisker and AbramsonLisker & Abramson, 1964), VOT has been a useful feature for this line of research as it is well-documented, it differs cross-linguistically, and it provides a gradient measure. For example, English and Spanish both employ a bipartite distinction between voiceless and voiced stop consonants, but the acoustic and articulatory characteristics of the two languages differ. English voiceless stops are considered long-lag (VOT: 30–120 ms), while Spanish voiceless stops are short-lag (VOT: 0–30 ms) (Reference Lisker and AbramsonLisker & Abramson, 1964). This cross-linguistic difference has been exploited in a number of studies on code-switching and phonetics (e.g., Reference Balukas and KoopsBalukas & Koops, 2015; Reference Bullock, Toribio, González, Dalola, O’Brien, Shea and ArchibaldBullock et al., 2006; Reference Bullock, Toribio, Isurin, Winford and de BotBullock & Toribio, 2009a; Reference González LópezGonzález López, 2012; Reference Grosjean and MillerGrosjean & Miller, 1994; Reference Khattab, Bullock and ToribioKhattab, 2009; Reference OlsonOlson, 2016a; Reference Piccinini and ArvanitiPiccinini & Arvaniti, 2015; Reference Schwartz, Balas and RojczykSchwartz, Balas, & Rojczyk, 2015; Reference Šimáčková, Podlipský, Wolters, Livingstone, Beattie, Smith, MacMahon, Stuart-Smith and ScobbieŠimáčková & Podlipský, 2015). Outcomes across this set of studies have varied widely, and cases of unidirectional interference (e.g., Reference Antoniou, Best, Tyler and KroosAntoniou et al., 2011), bidirectional interference (e.g., Reference Bullock, Toribio, Isurin, Winford and de BotBullock & Toribio, 2009a), no interference (e.g., Reference Grosjean and MillerGrosjean & Miller, 1994), and divergence (e.g., Reference Piccinini and ArvanitiPiccinini & Arvaniti, 2015) have all been reported.

Yet, in the face of such diversity in outcomes, a few unifying patterns have been found. First, unidirectional interference has been the most common finding in the literature. When unidirectional interference is reported, the language with long-lag VOTs shifts in the direction of the short-lag language (Reference Antoniou, Best, Tyler and KroosAntoniou et al., 2011; Reference Balukas and KoopsBalukas & Koops, 2015; Reference Bullock, Toribio, González, Dalola, O’Brien, Shea and ArchibaldBullock et al., 2006; Reference Bullock, Toribio, Isurin, Winford and de BotBullock & Toribio, 2009a; Reference OlsonOlson, 2016a). This unidirectional interference, impacting the long-lag language, has been found for speakers dominant in both the long-lag and the short-lag languages (e.g., Reference Bullock, Toribio, Isurin, Winford and de BotBullock & Toribio, 2009a), regardless of the original order of acquisition (e.g., Reference Antoniou, Best, Tyler and KroosAntoniou et al., 2011). Moreover, this finding has emerged in read speech (e.g., Reference Antoniou, Best, Tyler and KroosAntoniou et al., 2011) and spontaneous speech (e.g., Reference Balukas and KoopsBalukas & Koops, 2015), as well as in different types of code-switching (e.g., insertional: Reference OlsonOlson, 2016a; alternational: Reference Bullock, Toribio, Isurin, Winford and de BotBullock & Toribio, 2009a). Second, when bidirectional interference has been found, the magnitude of the switch differs between the two languages. The degree of shift is generally larger for the long-lag language than the short-lag language (Reference Bullock, Toribio, Isurin, Winford and de BotBullock & Toribio, 2009a; Reference González LópezGonzález López, 2012; Reference OlsonOlson, 2016a). Reference OlsonOlson (2016a), for example, found a mean shift of 0–5 ms in the short-lag language (Spanish) and a shift of 5–19 ms in the long-lag language (English). Moreover, bidirectional interference, when present, is often shown for only a subset of participants – namely, those participants who produced the shortest VOTs for nonswitched tokens in the short-lag language (e.g., Reference OlsonOlson, 2016a). To date, few studies have found a lack of interference at (or near) the point of switch. For example, Reference Grosjean and MillerGrosjean and Miller (1994) failed to find any evidence of interference at the point of switch, although Reference Bullock, Toribio, Isurin, Winford and de BotBullock and Toribio (2009a) note that the stimuli used (cross-linguistic homophones) may have encouraged careful speech (see also Reference Šimáčková and PodlipskýŠimáčková & Podlipský, 2018). Similarly, while Reference Piccinini and ArvanitiPiccinini and Arvaniti (2015) found a degree of divergence for short-lag code-switched tokens, whereby code-switched tokens were produced with shorter VOTs than nonswitched tokens, they note that the effect was small and the nonswitched tokens were produced with relatively long VOTs.

30.2.2.2 Vowel Quality

Beyond the effects of code-switching on VOT, a limited number of studies have begun to examine vocalic production in code-switched tokens, with differing results. Indicative of cross-linguistic interference, Reference Elias, McKinnon and Milla-MuñozElias, McKinnon, and Milla-Muñoz (2017) examined Spanish-English bilinguals and found that unstressed vowels in Spanish code-switched tokens were produced with greater centralization (i.e., English-like) than unstressed vowels in nonswitched speech. In contrast, Reference Muldner, Hoiting, Sanger, Blumenfeld and ToivonenMuldner et al. (2017) failed to find evidence for convergence in the production of vowel quality during code-switching. Finally, most representative of divergence, Reference Ojeda, de Prada Pérez and WaylandOjeda, de Prada Pérez, and Wayland (2018, p. 9) actually found some evidence for phonetic-level shifts in which both Spanish /u/ and English /u/ became more like the “standard values for each respective language” during code-switching. Although these studies provide an important first step in examining the effects of code-switching on vowel production, methodological considerations may provide additional context for these results. For example, Reference Muldner, Hoiting, Sanger, Blumenfeld and ToivonenMuldner et al. (2017) used a carrier phrase approach, which could impact phonetic production (for discussion, see Reference Bullock, Toribio, Isurin, Winford and de BotBullock & Toribio [2009a] and Reference OlsonOlson [2019]). These mixed findings highlight the need for further research in this area and call for careful examination and contextualization of study characteristics and results.

30.2.2.3 Phonological Processes

The body of research detailed so far on segmental phonetics and code-switching has relied on word-internal, and thus language-internal, phonetic production. Yet, as many phonological processes occur across word boundaries, code-switching presents the opportunity to examine such processes across language boundaries as well. To date, this area has remained understudied relative to the other segmental phenomena. Reference OlsonOlson (2019) examined the potential for phonological processes (i.e., assimilation and spirantization) that normally occur across word boundaries in a single language (i.e., Spanish) to also occur across language boundaries in cases of code-switching. Results again showed an asymmetrical pattern of code-switching effects. Specifically, for the phonological processes examined – spirantization of intervocalic voiced stops and regressive voicing assimilation, both of which are typical in Spanish – results showed an impact of code-switching on the production of Spanish tokens, but not English tokens. Spanish tokens became more “English-like” in code-switched productions relative to monolingual productions. Similar results have been found for lenition in Afrikaans-Spanish bilinguals (Reference Henriksen, Coetzee, García-Amaya and FischerHenriksen et al., 2021).

Taken as a whole, this body of research suggests that code-switching, which requires the activation of both languages nearly simultaneously, serves as a source of cross-linguistic interference. Moreover, while bidirectional interference is possible, the outcomes or evidence for such interference may be impacted by the characteristics of the phonetic feature and a given speaker’s language dominance (see Section 30.6 for discussion).

30.2.3 Code-Switching and Suprasegmental Phonetics

Much like segmental features, suprasegmental phonetic features, such as intonation and duration, are language-specific in nature. For example, different languages may employ different intonational contours for similar types of utterances. Yet, suprasegmental features are also modulated by a number of communicative and cognitive factors in similar ways across many languages. For example, greater pitch modulation has been found in infant directed speech (e.g., Reference Fernald and SimonFernald & Simon, 1984), clear speech (e.g., Reference Han, Munson and SchlauchHan, Munson, & Schlauch, 2021), and for tokens with decreased predictability (e.g., Reference TurkTurk, 2010). As such, research on the potential suprasegmental effects of code-switching has examined both language-specific features (e.g., contour shape) and more universal prosodic patterns (e.g., pitch range).

Considering the language-specific phonological patterns, several studies have examined whether code-switched items, usually single word insertional code-switches, are produced with the intonational contour of the code-switched token (e.g., Language B) or the surrounding utterances (Language A). While sample sizes from early studies were small, results were mixed. Suggesting that the language of the code-switched token may determine the intonational contour, Reference QueenQueen (2001), in her examination of German–Turkish code-switches, found that intonation contours for code-switched tokens were produced according to the language of the lexical item. German lexical items inserted into Turkish utterances were produced with German intonation contours. In contrast, Reference Birkner, Giles and PetersBirkner (2004), examining Portuguese–German code-switching, found the opposite pattern, with the code-switched token adopting the intonation contour of the surrounding language (see also Reference AlyAly, 2017).

Taking into account more universal prosodic patterns, Reference OlsonOlson (2012) examined both pitch range and duration in Spanish-English code-switching and found that code-switched tokens were produced with a significantly higher pitch range and longer duration than comparable nonswitched tokens. Expanded pitch range and increased duration are found in both Spanish and English, as dictated by communicative factors (e.g., clear speech). Building on this work, Reference OlsonOlson (2016b) broadly replicated the finding of increased pitch range and duration for code-switched tokens but found that such effects were modulated by language mode (for a discussion of language mode, see Section 30.3). Reports of expanded pitch range and duration have been found in several language pairings (English–Hindi: Reference Rao, Pandya, Sabu, Kumar and BondaleRao et al., 2018; Basque–Spanish: [partial support] Reference AlyAly, 2017; Spanish-English: Reference OlsonOlson, 2012, Reference Olson2016b), although such results have not been entirely consistent (see Reference Piccinini, Garellek, Campbell, Gibbon and HirstPiccinini & Garellek, 2014). Addressing these inconsistencies, Reference AlyAly (2017) suggests that increases in pitch and duration may be modulated by other factors such as language dominance (see also Reference OlsonOlson, 2016b).

Seeking to provide a theoretical explanation for findings of expanded pitch range and duration, Reference OlsonOlson (2012, Reference Olson2016b) proposes that code-switches may be hyperarticulated. Within the framework of the Hyper- and Hypoarticulation Theory (H&H Theory; Reference Lindblom, Hardcastle and MarchalLindblom, 1990), speakers may modulate articulatory effort, including expanding pitch range and slowing speech rate (i.e., increasing duration), to compensate for both contextual and cognitive factors. For example, less predictable lexical items in a discourse may be hyperarticulated relative to the same lexical items in a more predictable context (Reference Aylett and TurkAylett & Turk, 2004; Reference Bell, Brenier, Gregory, Girand and JurafskyBell et al., 2002). Reference GrosjeanGrosjean (1988) demonstrated that listeners were most likely to expect nonswitched tokens, and, as such, code-switches may be naturally less predictable (and potentially incur greater cognitive effort) than nonswitched tokens. This framework further explains the differences found in hyperarticulation across differing language modes, as switches in a more monolingual mode (i.e., less predictable) were found to be produced with a greater degree of hyperarticulation than in bilingual mode (i.e., more predictable switches).

30.3 Language Mode

As bilinguals move through different communicative contexts and engage with different interlocutors, their language use shifts to reflect the needs or expectations of these different situations. In some contexts, consciously or unconsciously, a bilingual may employ only one of their languages for the duration of the interaction. For example, a bilingual in a formal context with monolingual speakers may be unlikely to produce bilingual speech. In contrast, when interacting with other bilinguals in informal settings, a bilingual may employ both of their languages, engaging in switching, borrowing, and other bilingual behaviors. These differing degrees of bilingual operation are termed language modes.

For bilinguals, language mode may be best conceptualized as a continuum from monolingual behavior (e.g., production, perception) in Language A to bilingual behavior (Languages A and B) to monolingual behavior in Language B (e.g., Reference Grosjean and NicolGrosjean, 2001). Interaction in one of their two languages may be referred to as monolingual mode and interaction simultaneously in both languages is referred to as bilingual mode. The theoretical framework of language mode has been developed and refined by Grosjean and colleagues (Reference GrosjeanGrosjean, 1988, Reference Grosjean, De Groot and Kroll1997, Reference Grosjean1998, Reference Grosjean and Nicol2001, Reference Grosjean2008; Reference Soares and GrosjeanSoares & Grosjean, 1984). While bilingual production may be the most “observable” correlate of language mode, this framework includes both bilingual production and perception (Reference GrosjeanGrosjean, 1998), as well as spoken and written language (Reference Grosjean, De Groot and KrollGrosjean, 1997). Reference Grosjean and NicolGrosjean (2001) notes that, within a given interaction, language mode may be impacted by the form and content of the message, the situation, and the characteristics of the interlocutors. The form and content of the message refers to which or how much of each language has been used, the topic, and the lexical items. The situation of the interaction includes the physical location, the degree of formality, and the presence of monolingual or bilingual overhearers. Interlocutor characteristics include factors such as the language proficiency of the other interlocutors in an interaction, their language attitudes, the usual mode of interaction, and their relationship to the bilingual.

Important for the current discussion is that language mode has been discussed in terms of the relative levels of activation or inhibition of the two languages. In monolingual mode, while one language is maximally activated, the competing language is relatively deactivated or inhibited. In bilingual mode, both languages are activated to similar degrees. Reference Grosjean and NicolGrosjean (2001) conceptualizes activation in terms of a base and a guest language, in which the base language is always fully or maximally activated and the activation of the guest language is variable (for discussion of the Inhibitory Control Model [Reference GreenGreen, 1998], see Reference Grosjean and NicolGrosjean [2001]). Reference OlsonOlson (2016a) conceptualizes language mode in terms of the relative levels of availability of the two languages, resulting from a modulation of the activation and/or inhibition of each of the bilingual’s two languages.

Applying the concept of language mode to experimental research in phonetics and psycholinguistics, a few key considerations should be noted. First, while language mode may be theoretically conceptualized as ranging from monolingual to bilingual modes, operation in a truly monolingual mode, in which one language is fully deactivated, is unlikely (e.g., Reference Spivey and MarianSpivey & Marian, 1999; for asymmetrical cross-linguistic activation, see Reference Blumenfeld and MarianBlumenfeld and Marian, 2007). For example, previous research has shown that, even when an entire interaction or experimental paradigm is conducted in Language A, activation of the nontarget language (Language B) remains significant enough to impact bilingual behavior (e.g., Reference Spivey and MarianSpivey & Marian, 1999). Second, while language mode is responsive to social, interactional, and linguistic factors, experimental paradigms are unlikely to faithfully reproduce all of the relevant social and communicative variables. As such, many of the studies that investigate or manipulate language mode do so by isolating one of the multiple factors hypothesized to control language mode, such as the quantity of each language present. Footnote ²

30.4 Recent Research on Language Mode and Phonetics

30.4.1 Linguistic Forms

First, a series of studies have begun to specifically examine language mode via the lens of the linguistic forms in the paradigm. Within this framework, two main sources have been examined: dual language paradigms and cognates. Dual language paradigms are those interactions or experimental paradigms in which both languages are used, but only nonswitched tokens are examined. Dual language paradigms require the activation of both languages in the same interaction, leading to a bilingual mode and the potential for cross-linguistic interference. Nonswitched productions in a dual language paradigm, used to avoid the potential confound of interference due to code-switching, are often compared with single language paradigms (i.e., unilingual sessions) to provide evidence of interference driven by dual language activation. Cognates, specifically noted by Reference Grosjean and NicolGrosjean (2001) as linguistic forms that may impact language mode, are defined as words with “considerable” phonological, semantic, and (often) orthographic similarities between a bilingual’s two languages (Reference AmengualAmengual, 2012, p. 518). The overlapping representations for cognates in both languages have been shown to lead to cross-linguistic activation (e.g., Reference Spivey and MarianSpivey & Marian, 1999) and the potential for cross-linguistic phonetic interference.

A growing body of research, framed specifically in terms of language mode, has shown evidence of a degree of cross-linguistic interference in bilingual mode. These studies have largely operationalized language mode by conducting separate experimental sessions which involved either a single language (i.e., monolingual mode) or both languages (i.e., bilingual mode) and compared bilingual productions between the two sessions. For example, Reference SimonetSimonet (2014) investigated the potential for phonetic cross-linguistic interference in mid-back vowels in Catalan-Spanish bilinguals during bilingual mode. While Catalan employs two mid-back vowels [o] and [ɔ], Spanish contains only one [o]. In this study, monolingual mode was induced by having participants produce target materials only in Catalan. Bilingual mode was induced by having participants produce target materials in both Catalan and Spanish. Results demonstrated that productions of the mid-back Catalan vowels differed between the two sessions, with Catalan vowels becoming more Spanish-like during the dual language sessions (i.e., bilingual mode). Extending this line of work, Reference Simonet and AmengualSimonet and Amengual (2020) examined cross-linguistic interference in Catalan-Spanish bilinguals during bilingual mode, specifically with respect to language-specific phonological processes (i.e., vowel reduction). Again, the authors found evidence of cross-linguistic interference in bilingual mode relative to monolingual mode. Notably, these results were found regardless of language experience (see also Reference AmengualAmengual, 2018). These findings have been further extended to VOT in trilingual speakers (English-Spanish-Japanese: Reference Amengual, Meredith, Panelli, Calhoun, Escudero, Tabain and WarrenAmengual, Meredith, & Panelli, 2019). Similar results have been found in the case of sequential interpreting, a unique type of dual language context, with evidence of a degree of cross-linguistic interference found in the context of interpretation relative to monolingual production (Reference Šimáčková, Podlipský, Wolters, Livingstone, Beattie, Smith, MacMahon, Stuart-Smith and ScobbieŠimáčková & Podlipský, 2015; although see Reference Šimáčková and PodlipskýŠimáčková & Podlipský, 2018).

Considering the effect of cognates, a number of studies have shown that cognate status impacts phonetic production, with cognate tokens being produced with some fine-grained features of the opposite language. For example, Reference AmengualAmengual (2012) examined the effect of cognate status on phonetic production in Spanish-English bilinguals from across a wide dominance spectrum. Results demonstrated that Spanish cognate tokens were produced with significantly more English-like (longer) VOTs than noncognate tokens. While significant for all of the different groups of bilinguals, from English-dominant to Spanish-dominant, this effect was overall small in magnitude (3–4 ms) and not found for all participants (for VOT, see also Reference Flege and MunroFlege & Munro, 1994; compare with Reference Flege, Frieda, Walley and RandazzaFlege et al., 1998). Similar results for the effect of cognates on phonetic production have been found for vowel quality (Reference AmengualAmengual, 2016) and several phonological processes (Reference Brown and AmengualBrown & Amengual, 2015; Reference Brown and HarperBrown & Harper, 2009), and have been reported in controlled laboratory speech (e.g., Reference AmengualAmengual, 2012) and naturalistic data (Reference Brown and AmengualBrown & Amengual, 2015). Moreover, recent research has suggested that the degree of overlap between cognates is likely to condition the outcomes of cross-linguistic phonetic interference (e.g., Reference Gradoville, Waltermire and LongGradoville, Waltermire, & Long, 2021). In sum, bilingual language mode resulting from the linguistic forms in a given paradigm, including both dual language contexts and cognates, serves as a potential source for cross-linguistic phonetic interference.

30.4.2 Situation

In their seminal work, Reference Sancier and FowlerSancier and Fowler (1997) tracked the phonetic production (VOT) of a single Portuguese-English bilingual as she moved between linguistic environments. She was recorded in an English-dominant environment (USA), after two and a half months in a Portuguese-dominant environment (Brazil), and again after four months in an English-dominant environment (USA). They found that phonetic productions, for both English and Portuguese tokens, were more English-like after a stay in the English-dominant environment and more Portuguese-like after a stay in the Portuguese-dominant environment. While these results suggest that moving between different interactional contexts, or recent divergent language experiences, may impact the degree of cross-linguistic interference evidenced, the authors acknowledge that it was unclear as to whether these differences might be attributed to short-term (i.e., interference – any recent experience in an English or Portuguese-speaking context, regardless of duration) or long-term factors (i.e., transfer – longer-term immersion of multiple weeks or months). Subsequent research, testing bilingual participants (English-Spanish) immediately before and after (i.e., within less than seventy-two hours) a change in physical location (i.e., involving countries with different dominant languages) failed to find any significant impact of physical situation on phonetic production (Reference OlsonOlson, 2020; for discussion of physical location, see Reference Grosjean and NicolGrosjean, 2001). Reference OlsonOlson (2020) suggested that such results may represent a division between the local context of an interaction (i.e., the immediate surroundings) and the broader societal context (i.e., the language norms of the broader community). Additional evidence from perceptual studies may corroborate this suggestion. Reference Hay and DragerHay and Drager (2010), for example, found that the presence of region-specific symbols in the experimental environment (e.g., a stuffed kangaroo [Australia] versus a stuffed kiwi [New Zealand]) is sufficient to impact vowel perception in bidialectal speakers. As such, the physical location of an interaction, and most notably the immediate local context, may impact a bilingual’s position along the language mode continuum and shape the degree of cross-linguistic interference in phonetic production and perception.

30.4.3 Interlocutor Characteristics

Finally, while subject to less research than linguistic forms (e.g., dual language contexts and cognates), there is some evidence that interlocutor characteristics may also impact language mode and result in cross-linguistic phonetic interference. In her research on bilingual English-Arabic-speaking children, Reference Khattab and NielsonKhattab (2002a) found that children were more likely to use Arabic with Arabic-dominant interlocutors and English with English-dominant interlocutors. Moreover, the results showed that English tokens produced during the Arabic-dominant sessions evidenced more Arabic phonological features, such as the tapped realization of the rhotic, than in the English-dominant sessions. These observations extended to vowels, consonants, and stress patterns (Reference KhattabKhattab [2002b] as cited in Reference Khattab and NielsonKhattab [2002a]). The author suggests that these realizations are driven by interlocutor characteristics, and that children are more likely to produce Arabic-accented English when interacting with Arabic-dominant speakers (Reference Khattab, Bullock and ToribioKhattab, 2009). Contextualized within language mode, the presence of Arabic-dominant speakers, a key interlocutor characteristic, may result in greater activation of the Arabic linguistic system and greater opportunity for cross-linguistic interference. Further evidence for the role of interlocutor characteristics comes from monolingual perceptual paradigms, particularly when monolingual listeners are familiar with multiple regional or social varieties. For example, listeners’ categorization of stimuli along a PEN–PIN continuum, an age-graded merger in some varieties of Southern American English, depended on the perceived age of the speaker (Reference Koops, Gentry and PantosKoops, Gentry, & Pantos, 2008). Specifically, listeners were more likely to assume merged categories for older speakers and unmerged categories for younger speakers (for perceived geographic origin, see Reference NiedzielskiNiedzielski, 1999). In this case, the perceived characteristics of the speaker may have shifted participants across the language (or dialect) mode continuum.

To date, a fairly limited number of studies specifically address bilingual phonetics within the framework of language mode. It should be noted that, while theoretical conceptualizations of language mode (e.g., Reference Grosjean and NicolGrosjean, 2001) suggest that the linguistic forms, situation, and interlocutor characteristics in an interaction may impact language mode, much of the research on these topics is not specifically framed within notions of language mode. That is, researchers may examine the role of cognate status or interlocutor characteristics, factors that may impact language mode, without reference to language mode itself.

30.5 Interactions between Code-Switching and Language Mode

Given the inherent links between code-switching and bilingual language mode, whereby code-switching may be more prevalent in bilingual mode than in near-monolingual mode, it is relevant to consider the interaction between these two phenomena. Specifically, having demonstrated that insertional code-switching in an otherwise mostly monolingual mode (e.g., Reference OlsonOlson, 2016a) and nonswitched productions in a bilingual mode (e.g., Reference SimonetSimonet, 2014) both provide opportunities for increased cross-linguistic interference, this line of research has considered the potential for additive effects. Namely, if the effects of code-switching and bilingual mode are additive, one might expect that code-switching in bilingual mode may evidence greater degrees of phonetic interference than code-switching in a near-monolingual mode. From a language activation perspective, this additive effect may be conceptualized as combining the near-instantaneous dual activation from code-switching with the more sustained dual activation required by a more bilingual language mode.

Reference OlsonOlson (2016a) explored this particular question, examining VOT in Spanish-English bilinguals, and found that the effects were not additive. Specifically, while effects of code-switching on phonetic production were found in both near-monolingual and bilingual mode, there was no difference between code-switched tokens in the two differing modes. In contrast, at the suprasegmental level (Reference OlsonOlson, 2016b), an analysis of pitch range and vowel duration largely found significant differences between code-switches in near-monolingual and bilingual modes. While code-switches were produced with greater pitch range and vowel durations, this effect was less pronounced in the bilingual mode than in the near-monolingual mode. Returning to H&H Theory (Reference Lindblom, Hardcastle and MarchalLindblom, 1990) and the Smooth Signal Redundancy Hypothesis (Reference Aylett and TurkAylett & Turk, 2004), Reference OlsonOlson (2016b) suggests that code-switching in bilingual mode may be somewhat more expected and thus less marked prosodically. The divergence between findings at the segmental and the suprasegmental levels again highlights the need to consider the characteristics of the specific feature under examination.

30.6 Key Emergent Variables

Within this growing line of work, two specific notions are emerging as key to determining the presence or extent of cross-linguistic interaction: the characteristics of the phonetic category under examination and the role of language dominance. These two variables represent both language-internal (i.e., inherent characteristics of a given phonological system) and language-external (i.e., speaker characteristics) factors (see Reference Bullock, Toribio, Isurin, Winford and de BotBullock and Toribio [2009a] for discussion).

30.6.1 Phonetic Characteristics

Taken together, the body of research focused on segmental phonetics shows that code-switching and bilingual mode may significantly impact phonetic production, but the degree to which cross-linguistic phonetic interference is found appears to be language- and phoneme-specific. Returning to the findings for VOT, the most common finding appears to be unidirectional interference, whereby code-switching (or bilingual language mode) impacts only the language with long-lag voiceless stops, but not the short-lag language (e.g., Reference Antoniou, Best, Tyler and KroosAntoniou et al., 2011; Reference Balukas and KoopsBalukas & Koops, 2015; Reference Bullock, Toribio, González, Dalola, O’Brien, Shea and ArchibaldBullock et al., 2006; Reference Bullock, Toribio, Isurin, Winford and de BotBullock & Toribio, 2009a; Reference OlsonOlson, 2016a). While the short-lag language is not immune to interference in cases of code-switching, when present, the effects appear to be consistently smaller than for the long-lag language (e.g., Reference Bullock, Toribio, Isurin, Winford and de BotBullock & Toribio, 2009a; Reference González LópezGonzález López, 2012; Reference OlsonOlson, 2016a). Moreover, participants who are most likely to show effects in the short-lag language are those that produced the shortest VOTs in the nonswitched contexts (e.g., Reference OlsonOlson, 2016a; for similar effect in long-lag language, see Reference Šimáčková and PodlipskýŠimáčková and Podlipský, 2018). Similar asymmetrical cross-linguistic interference for VOT has also been shown in studies on dual language paradigms and cognate effects. For example, Spanish-English-Japanese trilinguals were shown to experience phonetic interference in bilingual mode only for English and Japanese tokens, but not Spanish (Reference Amengual, Meredith, Panelli, Calhoun, Escudero, Tabain and WarrenAmengual et al., 2019). The authors highlight the fact that “this is a remarkable feat if we consider that more precision may be required to maintain short lag Spanish stops in comparison to a wider span of VOT values for English and Japanese voiceless stops in bilingual and multilingual speech” (Reference Amengual, Meredith, Panelli, Calhoun, Escudero, Tabain and WarrenAmengual et al., 2019, p. 967).

Discussing such results, Reference Bullock, Toribio, Isurin, Winford and de BotBullock and Toribio (2009a) suggest that this asymmetrical interference relates to the range of acceptable productions within a given phonetic category (e.g., phonetic latitude). Notably, short-lag voiceless stops have a much smaller range of acceptable VOTs (approximately 30 ms) than long-lag voiceless stops (up to 100 ms). Bilinguals may “permit” a degree of interference, so long as the production stays within the expected category variability. Long-lag voiceless stops, with greater acceptable ranges, are most likely to show evidence of cross-linguistic interference. In contrast, short-lag voiceless stops, with a relatively reduced range, are less likely to evidence cross-linguistic interference and do so only when monolingual productions leave “room” for (i.e., are sufficiently short) convergence. In other words, research at the segmental level has tended to show interference at the phonetic level, rather than the phonological level. Participants produced code-switched items or items in bilingual mode that differed from their own monolingual productions (e.g., Language A), but these productions were within the (Language A) category-specific range and never crossed into the opposite language (Language B) phonological category (for discussion, see Reference OlsonOlson, 2019). Further support for this approach comes from the lack of evidence for additive effects of code-switching and bilingual language mode. For example, code-switched tokens may already be produced at one end of the range of natural variability, and as such are unlikely to show any additional effects of language mode. That is, while bidirectional interference is possible, the outcomes of such interference are constrained by language-specific and phoneme-specific variability.

Moving beyond VOT, parallels can be found in the asymmetrical impact of code-switching on phonological processes, whereby cross-linguistic interference has been found only when a given language permits variable application of the phonological rule (Reference Henriksen, Coetzee, García-Amaya and FischerHenriksen et al., 2021; Reference OlsonOlson, 2019). In his study on Spanish-English bilinguals (spirantization and voicing assimilation in code-switching), Reference OlsonOlson (2019) found evidence of significant cross-linguistic interference only for Spanish tokens, but not English tokens, regardless of the language dominance of the bilinguals. Similar findings were reported for spirantization in Afrikaans-Spanish bilinguals, who were mostly L2-dominant (Reference Henriksen, Coetzee, García-Amaya and FischerHenriksen et al., 2021). Both intervocalic spirantization and voicing assimilation are variable phenomena in monolingual Spanish speech (for spirantization, see Reference Hualde, Simonet and NadeuHualde, Simonet, & Nadeu, 2011; for assimilation, see Reference Campos-Astorkiza, Klassen, Liceras and ValenzuelaCampos-Astorikiza, 2015), but invariable in English and Afrikaans. Again, Reference OlsonOlson (2019) suggests that interference effects for phonological processes are constrained by language-specific and phoneme-specific variability in nonswitched productions and appear only when a given language permits a degree of variability.

Finally, again highlighting the differential impacts of cross-linguistic phonetic interference on different phonetic features, results for suprasegmental features appear to contrast with those at the segmental level. A number of studies have demonstrated a degree of hyperarticulation at the prosodic level (Reference AlyAly, 2017; Reference OlsonOlson, 2012, Reference Olson2016b; Reference Rao, Pandya, Sabu, Kumar and BondaleRao et al., 2018), potentially responding to the cognitive difficulties in both production and perception (Reference OlsonOlson, 2016b). As such, bilinguals may mark code-switches prosodically at the suprasegmental level, diverging from nonswitched productions in both languages, while evidencing interference (i.e., convergence) at the segmental level. Moreover, several studies have demonstrated that, in cases of code-switching, the intonation contour of the inserted lexical item (i.e., the guest language) may be produced with the expected contour of the surrounding language (i.e., the base language) (Reference AlyAly, 2017; Reference Birkner, Giles and PetersBirkner, 2004). The application of opposite language contours to code-switched tokens at the suprasegmental level stands in contrasts to outcomes at the segmental level, where lexical items are not produced with the phonological categories of the opposite language.

30.6.2 Language Dominance

Considering the role of language dominance in cross-linguistic interaction, it is clear that speakers from across the language dominance continuum may show evidence of bidirectional interference (e.g., Reference AmengualAmengual, 2012). Yet, evidence of unidirectional interference has also been well documented in the literature (Reference Antoniou, Best, Tyler and KroosAntoniou et al., 2011; Reference Balukas and KoopsBalukas & Koops, 2015; Reference Bullock, Toribio, González, Dalola, O’Brien, Shea and ArchibaldBullock et al., 2006; Reference Bullock, Toribio, Isurin, Winford and de BotBullock & Toribio, 2009a; Reference OlsonOlson, 2016a). In many cases, an explanation for unidirectional, rather than bidirectional, interference can be found at the intersection of the characteristics of the phonetic feature and a speaker’s language dominance. Considering VOT, while most studies have shown an impact of short-term interference on the long-lag language, regardless of language dominance (e.g., Reference OlsonOlson, 2016a; Reference Šimáčková, Podlipský, Wolters, Livingstone, Beattie, Smith, MacMahon, Stuart-Smith and ScobbieŠimáčková & Podlipský, 2015), results for the short-lag language seem to be conditioned by language dominance. Notably, speakers who are dominant in the short-lag language are more likely to evidence an impact of short-term interference in the short-lag language than speakers who are dominant in the long-lag language (e.g., Reference AmengualAmengual, 2012; Reference OlsonOlson, 2016a). Short-lag-dominant speakers are likely to produce the shortest VOTs, effectively allowing “phonetic latitude” to evidence interference effects. For example, in his examination of Spanish-English cognates, Reference AmengualAmengual (2012) reported the patterns of significance by individual (Reference AmengualAmengual, 2012, table 3). A brief reanalysis of this data shows a significant difference in noncognate VOTs between participants who showed significant interference in cognate production and those who did not. Specifically, those bilinguals that produced significant shifts in VOT for cognates evidenced shorter (short-lag) VOTs for noncognate words than those bilinguals who did not show significant cognate effects (p = 0.02). Moreover, even when all participants evidence cross-linguistic interference, the magnitude of the effect appears related to language dominance (e.g., Reference OlsonOlson, 2019).

Related to language dominance, Reference Bullock, Toribio, Isurin, Winford and de BotBullock and Toribio (2009a) note that their finding of phonetic divergence resulting from code-switching may be explained by proficiency and linguistic awareness. Highly proficient participants with a high degree of linguistic awareness (e.g., language instructors) may modify their phonetic productions to enhance distinctions. Taken as a whole, the results suggest that language dominance may play a role in the phonetic outcomes of short-term cross-linguistic interference.

30.7 Future Directions

While the impact of long-term sources of phonetic transfer have been the subject of a robust body of research, short-term sources of phonetic interference, such as code-switching and bilingual language mode, represent an emergent area in the field. As such, several areas, both methodological and theoretical, remain for future research.

30.7.1 Methodological Considerations

With respect to methodology, it should first be recognized that this body of work has focused on a small number of language pairings and phonetic features. Future research may seek to expand these findings to a variety of different language pairings, notably from typologically distinct languages or language pairings that have substantially different social or communicative roles (e.g., cases of diglossia versus intense language mixing). Given the role that the phonetic feature has played in both code-switching and language mode research and the broad overrepresentation of VOT in this line of work, future research may examine a variety of different segmental and suprasegmental features. Second, the bulk of the research in this area has come from carefully controlled lab speech, and naturalistic paradigms remain underrepresented (although see Reference AlyAly, 2017; Reference Balukas and KoopsBalukas & Koops, 2015; Reference Piccinini and ArvanitiPiccinini & Arvaniti, 2015). Future research should seek to incorporate data from across the methodological spectrum. Finally, extant research has largely focused on insertional code-switching, with other types of switching less studied (e.g., alternational code-switching) or absent (e.g., congruent lexicalization). In addition, although a number of authors explicitly mention the role of phonology in differentiating borrowing, nonce borrowing, and code-switching, careful acoustic studies are noticeably lacking. Future research should specifically seek to incorporate a variety of types of bilingual behavior, including different types of switching, borrowing, translanguaging, and interpretation.

30.7.2 Theoretical Considerations

First, as noted already, current research in bilingual phonetics specifically framed within the notions of language mode has tended to operationalize language mode as the quantity of each language present in a current paradigm (e.g., Reference SimonetSimonet, 2014). Yet, in theoretical proposals (e.g., Reference Grosjean and NicolGrosjean, 2001), factors that may impact language mode are much more wide-ranging, encompassing the form and content of the message, the situation, and the interlocutors. To date, while there is existing research on variables within each of these language mode components, the findings are often not contextualized within the framework of language mode. As such, researchers should work both to incorporate novel variables that may impact language mode and to leverage prior studies that may not necessarily be framed within language mode theory. While the current chapter gives a number of examples of this type of reframing of prior research (e.g., research on cognates or interlocutors), it should by no means be considered exhaustive. Future work should explicitly seek to reframe prior studies on variables under the language mode “umbrella” to inform our understanding of the role language mode plays in bilingual phonetic production and perception. Moreover, the cross-linguistic influence driven by language mode should be more explicitly integrated into models of speech processing.

Second, the body of work discussed in this chapter largely suggests that bidirectional cross-linguistic interference is possible in contexts where both of a bilingual’s languages are activated. But, this line of research focuses on the outcomes of such interference as a means to understand the underlying cognitive processes. It is important to distinguish among the sources of cross-linguistic interference, the underlying cognitive process of cross-linguistic interference, and the outcomes or evidence of cross-linguistic interference. That is to say, a failure to find evidence of interference in bilingual productions does not necessarily imply that there is no underlying interference. It may be the case that speakers experience interference but that the outcomes (i.e., the direct evidence) of such interference in production are constrained by a variety of factors, such as the characteristics of the phoneme in question. Research in perception, showing a degree of cross-linguistic activation for cognates even in a strongly monolingual mode (e.g., Reference Spivey and MarianSpivey & Marian, 1999), seems to support this possibility. In short, future research may begin to better disentangle the production-oriented evidence of cross-linguistic interference from the underlying cross-linguistic activation of the two languages.

Finally, it is worth distinguishing between studies that specifically examine the phonetic outcomes in bilingual speech and those that leverage bilingual phonetics to comment on broader theoretical concepts. Research on code-switching provides a unique lens through which to analyze linguistic features that is largely unavailable in monolingual populations. As an example, Reference OlsonOlson (2019) exploits a code-switching paradigm to comment on the underlying nature of phonological processes. Building on prior work on monolingual speech, Reference OlsonOlson (2019) proposes that phonological processes may be anchored to a specific lexical item (and language), but that the environment that licenses the phonological change may be language-nonspecific. This type of analysis, contributing to phonological theory, would be largely unavailable in monolingual speech. Future research should both seek to understand bilingual behaviors and utilize those behaviors as a tool for examining major theoretical concepts in the field.

31 Orthographic Effects in the Phonetics and Phonology of Second Language Learners and Users

31.1 Introduction

This chapter reviews evidence that the orthographic forms (spellings) of second language (L2) sounds and words affect L2 phonological representation and processing. It argues the following points. Orthographic effects (OEs) are found in five realms: speech perception, speech production, phonological awareness, phonological learning, and lexical learning. Orthographic forms facilitate L2 users – for instance in perception and lexical learning – by representing sounds and contrasts they cannot perceive. However, orthographic forms also have negative effects on L2 phonology, resulting in sound additions, deletions, and substitutions. Such OEs are due to differences between the L2 user’s orthographic knowledge and the way the L2 writing system works. Further, OEs are caused by orthographic input, but may be reinforced by nonorthographic factors, and modulated by individual-level and sound/word-level variables. They are established after minimal exposure to orthographic input and are persistent. Future research could aim at producing a coherent framework, addressing remaining gaps, and further developing research methods, in order to make a stronger contribution to theory and practice.

This section presents key terms and concepts related to writing systems and to OEs.Footnote ¹

31.1.1 Introduction to Writing Systems

The smallest unit of a writing system is the grapheme. In an alphabetic writing system, a grapheme is a letter or a letter cluster that represents a phoneme; for example, in English the grapheme <th> represents the phoneme /θ/ (among others). A grapheme–phoneme correspondence (GPC) determines the pronunciation of a grapheme.

Each written language has its own writing system, with its graphemes and GPCs, for instance the English writing system, the German writing system, and so on. Alphabetical writing systems differ in:

sets of graphemes – the grapheme <th> is not attested in Italian (excluding loanwords);
grapheme–phoneme correspondences and phoneme–grapheme correspondences – <ch> corresponds to /k/ in Italian and mostly to /k, ʃ, tʃ/ in English (chaos, machine, child); /k/ corresponds to <c, ch> in Italian and mostly to <c, ck, k> in English (cat, lick, break);
degree of phonological transparency (regularity of orthography–phonology correspondences, also called orthographic transparency and orthographic depth) – in the more phonologically transparent Italian writing system, one grapheme regularly represents the same phoneme, and vice versa, whereas one-to-one correspondences are rare in the more phonologically opaque English writing system.

31.1.2 Introduction to L2 Orthographic Effects

Generally, L2 learners (those learning a non-native language after early childhood) are exposed to both spoken and written L2 input.Footnote ² Instructed learners often receive copious orthographic input from the early stages of learning, when their L2 phonological system is not established, and often learn L2 words from written input as well as – or even without – spoken input. Naturalistic learners encounter written input in the environment. Hence, the minds of L2 learners and L2 users (former L2 learners who have now acquired and are using an L2, also called late bilinguals) contain the phonological system and the writing system of two or more languages: the L1(s) plus any L2(s). This can result in OEs on L2 phonology.Footnote ³

Up to the late 2000s, L2 phonology researchers generally believed that OEs on L2 phonological development were negligible, and did not research the topic (reasons for this neglect are discussed in Reference EscuderoBassetti, Hayes-Harb, and Escudero, 2015). Indeed, the first overview (Reference Bassetti, Piske and Young-ScholtenBassetti, 2008) could include only about half a dozen peer-reviewed studies, mostly by Young-Scholten. Then the zeitgeist changed and research flourished. Some key events are:

1995–2007: Researchers occasionally mention orthography as causing pronunciation errors (Reference Piske, Flege, MacKay and MeadorPiske et al., 2002; Reference ZampiniZampini, 1994). Reference Young-Scholten, Eubank, Selinker and Sharwood SmithYoung-Scholten (1995, Reference Young-Scholten, Allan and Leather1998, Reference Young-Scholten, Burmeister, Piske and Rohde2002) investigates OEs on L2 pronunciation.
2008–2010: Reference Bassetti, Piske and Young-ScholtenBassetti (2008), at the invitation of Young-Scholten, publishes the first overview. Bassetti, Escudero, and Hayes-Harb each give a talk about OEs on L2 perception and production at the Workshop on the Relation between Orthography and Phonological Acquisition (Ghent, Belgium, 2008). Bassetti presents her 2007 publication (Reference Bassetti, Guder, Jiang and WanBassetti, 2007). The other two talks are published in a 2010 special issue of Language and Speech on OEs on (mostly L1) phonology (Reference Escudero and WanrooijEscudero & Wanrooij, 2010; Reference Hayes-Harb, Nicol and BarkerHayes-Harb, Nicol, & Barker, 2010).
2013–2015: Bassetti, Escudero, and Hayes-Harb organize the first colloquium on L2 OEs at the European Second Language Association conference (2013) and then edit the first journal special issue on L2 OEs (Reference EscuderoHayes-Harb, Bassetti, & Escudero, 2015) and propose a research agenda (Reference Bassetti, Hayes-Harb and EscuderoBassetti et al., 2015).
2021: Reference Hayes-Harb and BarriosHayes-Harb and Barrios (2021) publish a comprehensive overview in Language Teaching.
2023: Bassetti publishes the first monograph (Reference BassettiBassetti, 2023).

There is currently much evidence of OEs on L2 phonology. Section 31.2 will review this.

31.2 Types of OEs on L2 Phonology

This section reviews current knowledge about OEs on L2 phonology – perception, production, phonological awareness, phonological learning, and lexical learning. In a nutshell,Footnote ⁴ the orthographic forms of sounds and words – their spellings – can facilitate learning and processing sounds and words, but can also lead to sound additions, omissions, and substitutions in L2 learners, L2 users, and L0 learners (those exposed to a language they have never encountered before within an experiment, usually called naïve learners).Footnote ⁵

31.2.1 Facilitative Effects of Orthographic Input

There is by now robust evidence that exposure to orthographic forms in addition to phonological forms facilitates the acquisition of L2 phonological forms. This bimodal exposure can speed up spoken word learning and increase the accuracy of sounds and words.

Word learning. It is generally agreed that with orthographic input L2 English learners learn more words, or learn words faster. This is found at all levels, whether they are primary (Reference HuHu, 2008; Reference Krepel, de Bree and de JongKrepel, de Bree, & de Jong, 2021), high-school (Reference Cerni, Bassetti and MastersonCerni, Bassetti, & Masterson, 2019), or university students (Reference Bürki, Welby, Clément and SpinelliBürki et al., 2019; Reference Uchihara, Webb, Saito and TrofimovichUchihara et al., 2022).

Accuracy. Additional exposure to orthographic forms can also help L2 learners learn more accurate sounds or word forms, compared to only exposure to acoustic input. This facilitative effect was first established experimentally with L0 learners by Reference Erdener and BurnhamErdener and Burnham (2005), who found that orthographic input facilitated perception in listeners upon first exposure to a spoken language. Research then found that orthographic forms can help not only L0 listeners who are exposed to difficult sounds in a language they have never encountered before (such as L0 Spanish rhotics: Reference RafatRafat, 2015; or consonant clusters: Reference DavidsonDavidson, 2010) but also learners at all levels of L2 development, including advanced level (Reference Escudero and WanrooijEscudero & Wanrooij, 2010; Reference Pattamadilok, Welby and TylerPattamadilok, Welby, & Tyler, 2022).

Orthographic forms facilitate L2 perception because they allow L2 listeners to distinguish sounds that they cannot perceive, for instance L2 vowel contrasts (Reference Escudero, Hayes-Harb and MittererEscudero, Hayes-Harb, & Mitterer, 2008; Reference Escudero, Simon and MulakEscudero, Simon, & Mulak, 2014), L2 German vowel lengths (Reference Nimz and KhattabNimz & Khattab, 2020), and L2 Chinese lexical tones (Reference Mok, Lee, Li and XuMok et al., 2018; tone symbols even facilitate L0 Chinese learners, as their shape represents pitch variations [Reference Showalter and Hayes-HarbShowalter & Hayes-Harb, 2013]). Indeed, training with orthographic input was found to improve accuracy in sound production more than phonetic training (Reference Solier, Perret, Baqué and Soum-FavaroSolier et al., 2019) and to improve perception more than exposure to articulatory gestures (Reference Pattamadilok, Welby and TylerPattamadilok et al., 2022). Nevertheless, improvement may be limited to the training of specific sounds or contrasts, as lengthy exposure to captioned videos by itself did not improve perception of either a difficult vowel contrast in particular or a foreign accent in general (Reference Wisniewska and MoraWisniewska & Mora, 2020). Orthographic forms also facilitate production, as knowing the spelling of L2 words allows L2 speakers to produce contrasts they cannot perceive, for example producing lip or rip without being able to perceive the [l, r] contrast (Reference EckmanEckman, 2004).

31.2.2 Negative Effects of Orthographic Input: Sound Additions, Deletions, and Substitutions

Although orthographic input has robust facilitative effects on L2 phonological and lexical acquisition, most OE research reports negative effects, resulting in sound additions, deletions, and substitutions. Effects are found in perception, production, and metalinguistic awareness, as well as phonological and lexical learning.

31.2.2.1 Introduction to Negative Effects of Orthographic Input

In some cases, the same researchers who reported facilitative OEs also found negative effects. Faster word learning was accompanied by lower accuracy in the production of the same words, in both L0 (Reference Bürki, Welby, Clément and SpinelliBürki et al., 2019) and L2 speakers (Reference Cerni, Bassetti and MastersonCerni et al., 2019; Reference Uchihara, Webb, Saito and TrofimovichUchihara et al., 2022). Increased accuracy in the production of some sounds was accompanied by lower accuracy in other sounds (Reference Escudero, Simon and MulakEscudero et al., 2014; Reference Escudero and WanrooijEscudero & Wanrooij, 2010). There were also different effects on different characteristics of the same sound, such as higher accuracy in vowel duration but lower accuracy in vowel quality (Reference Nimz and KhattabNimz & Khattab, 2020). Sections 31.2.2.2–31.2.2.4 present three types of negative OE.

31.2.2.2 Additions

In orthography-induced sound addition, or epenthesis, L2 learners/users add a sound that is not present in the phonological form of an L2 word because of orthography. Additions often happen because L2 users realize a letter that belongs to a larger orthographic unit, such as a multi-letter grapheme or a morphemic spelling. Such letters are traditionally called “silent letters.” Examples (31.1) to (31.3) show examples of multi-letter graphemes, while (31.4) shows a morphemic spelling:

(31.1) <g> in <ng> (=/ŋ/) as in king;

(31.2) <l> in <al> (=/ɔː/) as in walk;

(31.3) <e> in <a_e> as in made (word-final <V_e> represents a long V);

(31.4) <e> in the past-tense marker <ed> in contexts where <ed>=/t, d/.

L2 users then add:

(31.5) a [g] in king (Reference Cabanero and AlvesCabanero & Alves, 2008);

(31.6) a [l] in walk (Reference Bassetti and AtkinsonBassetti & Atkinson, 2015);

(31.7) a word-final vowel in made (Reference SilveiraSilveira, 2007);

(31.8) a vowel before the final consonant in talked (Reference Bassetti and AtkinsonBassetti & Atkinson, 2015; for orthography-induced epenthesis in L0 speakers, see Reference Hayes-Harb, Nicol and BarkerHayes-Harb et al. [2010] and Reference Rafat and StevensonRafat and Stevenson [2019]).

Orthography-induced epenthesis is not always due to realizing an extra letter. For instance, L0 learners who hear L0 words containing complex consonant clusters tend to delete consonants, but if they are exposed to such a word’s spelling, they tend to produce all of the consonants and add epenthetic vowels to facilitate production (Reference Detey and NespoulousDetey & Nespoulous, 2008; Reference Young-Scholten, Burmeister, Piske and RohdeYoung-Scholten, 2002).

31.2.2.3 Deletions

In orthography-induced sound deletion, L2 learners/users omit a sound that is present in the phonological form of an L2 word but not represented orthographically. As an example, if the romanized spelling of an L2 Chinese triphthong contains two letters, learners both count two vowels (Reference BassettiBassetti, 2006; Reference PytlykPytlyk, 2017; see Reference PytlykPytlyk [2011] for L0 Chinese learners) and produce a diphthong (Reference Bassetti, Guder, Jiang and WanBassetti, 2007), but they count three vowels and produce a triphthong if the spelling contains three letters. For instance, they omit [o] from [jow] if spelled <iu> (after a consonant) but not if spelled <you> (in onsetless syllables).

31.2.2.4 Substitutions

Orthography-induced substitutions – the most commonly reported negative OEs – occur when a sound is replaced with another sound or allophone because of its spelling. The substituted sound can even be a sound that is unattested in the target language.

Looking at production, an early study found that L1 English-L2 German beginner learners produce voiced obstruents instead of voiceless obstruents in word-final position because German word-final voiceless obstruents are spelled <b, d, g> (Reference Young-Scholten, Burmeister, Piske and RohdeYoung-Scholten, 2002, Reference Young-Scholten2004; see Reference Barrios and Hayes-HarbBarrios and Hayes-Harb [2020] for a German-like L0). Also, L1 English-L2 German beginner learners substitute [s] to [z] word-initially because word-initial /z/ is spelled <s> in German (Reference Young-Scholten and LangerYoung-Scholten & Langer, 2015). As another instance, Italian natives overextend a “single vowel letter = short vowel” rule to L2 English words where it does not apply, hence replacing a long with a short vowel (Italians generally use length to distinguish English tense and lax vowels). For example, they produce and perceive [iː] as a long vowel when spelled with a digraph (seen, jeans, bee) and as a short vowel when spelled otherwise (scene, genes, B; Reference Bassetti and AtkinsonBassetti & Atkinson, 2015; Reference Bassetti, Sokolović-Perović, Mairano and CerniBassetti et al., 2018, Reference Bassetti, Mairano, Masterson and Cerni2020). In the previous instances, the outcome is a sound other than the target sound, while in other instances the outcome is an allophone of the target. For instance, L1 Spanish-L2 American English speakers produce [t] instead of a flap because of the spelling <t> or <tt> (city, bottom; Reference VokicVokic, 2011), and L1 Chinese-L2 Korean learners produce different allophones of /h/ depending on its spelling (Reference Han and KimHan & Kim, 2017). Substitutions have also been repeatedly found in L0s by Hayes-Harb and colleagues (for instance, Reference Hayes-Harb, Brown and SmithHayes-Harb, Brown, & Smith, 2018; Reference Hayes-Harb, Nicol and BarkerHayes-Harb et al., 2010; Reference ShowalterShowalter, 2020).

Looking at perception, for instance, L2 vowel perception was less accurate in L1 Spanish-L2 Dutch learners who were exposed to L2 vowel spellings that represented different vowels in the L1 and the L2 (Reference Escudero, Simon and MulakEscudero et al., 2014), and L1 English-L2 Spanish listeners perceived Spanish intervocalic approximants as stops due to their spelling <b, d, g> (Reference SheaShea, 2017). Finally, looking at awareness, researchers found orthography-induced substitutions of L2 English vowels with L1 vowels in tasks such as creating L2 pseudowords (Reference Piske, Flege, MacKay and MeadorPiske et al., 2002) and adapting L2 pseudowords as L1 loanwords (Reference Vendelin and PeperkampVendelin & Peperkamp, 2006).

Orthography-induced substitutions can even result in unattested sounds or contrasts, existing in the L2 phonological systems of L2 learners and users but not in the L1 phonological systems of native speakers of the target language. An early descriptive study (Reference ZampiniZampini, 1994) observed that beginner L1 English learners realize <v> as [v] instead of [b] in L2 Spanish, although in Spanish /v/ is unattested (for L0 Spanish learners, see Reference RafatRafat, 2016). A series of experimental studies by Bassetti and colleagues revealed that Italian natives systematically substitute singleton consonants with geminates in L2 English words that are spelled with double consonant letters (doublets), such as a geminate [tː] in kitty but not in city (Reference BassettiBassetti, 2017). This is because L1 Italian phonology contrasts singleton and geminate consonants, and L1 Italian orthography regularly represents singletons with a letter and geminates with doublets. Italian natives therefore establish a singleton-geminate contrast in their L2 English phonological systems, and recode L2 English consonant doublets as geminates. The contrast is lexically encoded as it depends on the spelling of individual words. Italian natives then perceive and produce geminate or singleton consonants in pairs such as Finnish-finish, and consider rhymes such as beginner-Berliner to be nonrhymes (Reference Bassetti, Sokolović-Perović, Mairano and CerniBassetti et al., 2018, Reference Bassetti, Mairano, Masterson and Cerni2020, Reference Bassetti, Masterson, Cerni and Mairano2021). According to Bassetti and colleagues, this happens because of top-down effects from orthography-influenced L2 word representations that contain a geminate. Orthography-induced gemination is so widespread that it may be considered a feature of the Italian variety of English (Reference BassettiBassetti, 2023). Later, L2 English lexical gemination was also confirmed in Japanese (Reference Sokolović-Perović, Bassetti and DillonSokolović-Perović, Bassetti, & Dillon, 2020) and Korean natives (Reference Rafat, Whitford and JoanisseRafat et al., 2021).

31.3 Causes of OEs

This section aims at explaining OEs on the basis of current knowledge. It argues that OEs cannot be explained other than as consequences of orthographic forms. It also proposes that OEs are caused by the difference between the L2 user’s orthographic knowledge and the GPCs and graphotactics (contextual rules that constrain graphemes and GPCs) of the L2. Although OEs are caused by orthography, they can be reinforced by other factors, including acoustic input. All of these factors may contribute to explaining why OEs arise and persist in spite of the absence of supporting phonological evidence, or even with contrasting phonological evidence.

31.3.1 Orthographic Effects Cannot Be Explained in Terms of Phonology Alone

For an effect to be considered orthographic in nature, it should have an orthographic explanation and should not be explainable in terms of phonology alone. Here are some characteristics of OEs (for an earlier formulation, see Reference Bassetti, Piske and Young-ScholtenBassetti, 2008). These are illustrated with reference to Italian natives’ L2 English contrastive gemination:

1. OEs must be attributable to orthography. The effect can be explained by one or more GPCs or graphotactic rules (in the L1 or the L2, or in the L2 user’s orthographic repertoire). For instance, L2 English lexical gemination is attributable to L1 Italian doublet-geminate correspondence.
2. OEs cannot be entirely explained by L1 phonology (L1 phonological categories and contrasts, phonotactic rules) or the L1 phonological lexicon (the phonological forms of cognates). For instance, Italian natives’ L2 English gemination cannot be explained by Italian phonology, as follows. First, Italian natives can produce either a geminate or its corresponding singleton in the same phonological context (for example, both [n] and [nː] in [fɪ_ɪʃ]). Second, they learn an L2 English spoken word as containing a geminate only if exposed to orthographic input, not if exposed exclusively to spoken input (Reference Cerni, Bassetti and MastersonCerni et al., 2019). Third, they geminate L2 English words that have no L1 Italian cognate or whose cognate contains a singleton.
3. OEs cannot be entirely explained by the phonology of the L2 or by universals of phonological development. A sound is not attributable to OEs if it (a) is attested in the repertoire of monolingual adult speakers of the language; (b) emerges during phonological development in native children; and (c) can be explained by universals of phonological acquisition (although orthography can reinforce such effects). For instance, (a) English natives do not produce or perceive geminates; (b) gemination is not attested in L1 English infants; and (c) gemination is marked (uncommon across languages).

31.3.2 Orthographic Sources of OEs

This section argues that OEs occur because L2 users phonologically recode L2 orthographic forms according to their L2 orthographic knowledge, that is to say what they know about the written representation of the L2, both at the graphemic level (GPCs) and above (graphotactics). Effects of orthography can also be reinforced by factors other than orthographic input, including orthography-influenced auditory input (discussed in Section 31.3.3).

31.3.2.1 Grapheme–Phoneme Correspondences

The L2 learner/user’s orthographic knowledge may reflect gaps in their knowledge of L2 GPCs. It may also include unattested GPCs, that is to say GPCs that exist in the orthographic system of the L2 user but are unattested in native users. These may be caused by L1 GPCs or may be novel creations.

Unfamiliar GPCs. In such cases, L2 learners/users do not know an L2 GPC. For example, ignoring that <al> represents /ɔː/ in walk results in adding [l], and ignoring that <V__e#> represents /Vː/ leads to recoding V as a short vowel (Reference Bassetti and AtkinsonBassetti & Atkinson, 2015; Reference Bassetti, Sokolović-Perović, Mairano and CerniBassetti et al., 2018, Reference Bassetti, Mairano, Masterson and Cerni2020, Reference Bassetti, Masterson, Cerni and Mairano2021).

L1-influenced unattested GPCs. In this case, L2 learners/users assimilate an L2 sound with an L1 sound when the two languages map the same grapheme onto different sounds. An L1–L2 incongruence may affect recoding of a single grapheme (for example, <w>=[v] in Italian natives’ L2 English words such as water, a GPC used in L1 Italian loanwords [Reference Bassetti and AtkinsonBassetti & Atkinson, 2015]), or a set of graphemes (for example, <CC>=/Cː/ in Italian natives’ L2 English). Various substitutions reported in Section 31.2.2.4 can be explained in this way (for instance, Reference Escudero, Simon and MulakEscudero et al., 2014; Reference SheaShea, 2017).

Novel unattested GPCs. Occasionally, L2 learners/users may establish GPCs that cannot be explained in terms of L1 or L2, for example realizing <ght> as [tʃ] in caught (Reference Bassetti and AtkinsonBassetti & Atkinson, 2015). Such novel GPCs may have various causes, including difficulties perceiving the target sound, incomplete knowledge of target words, and influences of the GPCs of other languages in the L2 user’s repertoire.

31.3.2.2 Graphotactic Rules

Another aspect that influences L2 users’ L2 perception and production is their graphotactic knowledge. For instance, L2 English learners add [b] in words such as comb, ignoring that morpheme-final <mb#> represents /m/ (Reference Bassetti and AtkinsonBassetti & Atkinson, 2015); L2 Chinese learners produce Chinese [jow] as a diphthong, ignoring that this triphthong is spelled <iu> after a consonant (Reference Bassetti, Guder, Jiang and WanBassetti, 2007); and L2 German speakers substitute [s] to [z] word-initially, ignoring that word-initial /z/ is spelled <s>, and substitute [d] to [t] word-finally, ignoring that word-final <d> represents /t/ (Reference Young-Scholten, Burmeister, Piske and RohdeYoung-Scholten, 2002, Reference Young-Scholten2004; Reference Young-Scholten and LangerYoung-Scholten & Langer, 2015).

31.3.2.3 Lexical Knowledge

Some OEs may be due to gaps in the L2 lexicon. For instance, additions can derive from not knowing that Wednesday represents [wenzdeɪ] (Reference Bassetti and AtkinsonBassetti & Atkinson, 2015).

31.3.2.4 Multiple Causes

Attempting to isolate a single cause is sometimes disingenuous, as various factors may be at play. For instance, given that <d>=/d/ in both English and German, L1 English-L2 German speakers’ voicing of word-final obstruents may be due to an L1-influenced GPC <d>=/d/ or to overgeneralizing the L2 GPC <d>=/d/ by ignoring the German graphotactic rule <__d#>=/t/. Given that <t>=/t/ in both American English and Spanish, L1 Spanish-L2 American English speakers’ substitution of an intervocalic flap with [t] may be due to an L1-influenced GPC <t>=/t/ or to overgeneralizing the L2 GPC <t>=/t/, ignoring that in intervocalic position <t> represents a flap.

31.3.3 Additional Sources of OEs: Beyond Orthographic Input

Although OEs are caused by orthographic forms, and are reinforced by further encounters with such forms, they can also be reinforced by other factors, including spoken input and output. This section proposes a number of additional possible sources of OEs. For clarity, this section and the next (Section 31.3.4) are illustrated with examples from Italian natives’ L2 English gemination (see Section 31.2.2.4 for references).

1. Orthography-influenced spoken input. This can be produced by fellow L2 speakers. For instance, L2 English geminates produced by native speakers of Italian and other languages with gemination.
2. Orthography-influenced intake. Since orthographic forms affect speech perception, L2 listeners may perceive sounds other than those in the input. For instance, Italian natives perceive geminates in spoken words produced by English natives.
3. The L2 user’s own orthography-influenced overt and covert output. Orthography affects the L2 user’s own speech production and covert language production, that is to say inner speech and silent reading. These all constitute part of the learners’ input. For instance, Italian natives produce geminates in L2 English speech (and presumably in inner speech and silent reading).
4. Strategically ignored target-like spoken input. The L2 user may perceive the L2 sounds correctly but strategically disregard their percept in favor of orthographic forms. Relying on orthography and ignoring one’s own percept is an overall successful strategy. For instance, vowel spellings can help L2 listeners discriminate two vowels they perceive as one category, but they must disregard their own perception because the vowels sound the same to them. Italian natives may then choose to disregard the consonant duration they perceive and rely instead on consonant spelling.
5. Indirect corroborating evidence. There may be other sources of evidence. For instance, contrastive L2 English gemination may be reinforced by noncontrastive gemination in L1 English speech (occurring when a consonant is repeated at morpheme boundaries as in top pick versus topic).

In conclusion, it is argued that orthographic forms affect the learning of L2 sounds and words, but these effects may be reinforced by nonorthographic factors. Once established, orthography-influenced phonological representations then influence L2 processing, via mechanisms discussed next.

31.3.4 How Orthography Affects Online L2 Processing

This section argues that OEs on L2 speech perception and production probably occur because L2 users have mental representations that are influenced by orthographic forms. Yet, the effects may also be reinforced by the simultaneous activation of an orthographic representation.

Orthography-influenced phonological representations. The orthographic forms of the L2, phonologically recoded according to the L2 learner/user’s orthographic knowledge, affect L2 learners/users’ mental representations of L2 phonological words. These orthography-influenced phonological representations in the lexicon (mental representations of phonological forms of words that are stored in long-term memory) in turn affect speech processing (as proposed for the L1 by Reference TaftTaft [2006], among others). For example, L1 Italian-L2 English users have a mental representation of Finnish as /fɪnːɪʃ/, which leads them to perceive and produce [fɪnːɪʃ].

Coactivation of orthographic representations. During speech processing, orthographic representations – mental representations of orthographic forms stored in long-term memory – are coactivated with phonological representations (as proposed for the L1 by Reference Ziegler and FerrandZiegler and Ferrand [1998], among others). For example, when [fɪnɪʃ] (meaning “Finnish”) is activated, L1 Italian-L2 English learners and users also coactivate <Finnish>, which they recode as [fɪnːɪʃ].

The locus of orthographic effects in L2 speech processing is as yet not established, and coactivation may play a role. However, the hypothesis that L2 OEs are due to orthography-influenced phonological representations is consistent with much of the evidence reported in this chapter.

31.4 Characteristics of OEs

This section shows that OEs are established after minimal exposure to an orthographic form, and once established are resistant to lengthy naturalistic exposure, and possibly to teaching. They can affect all aspects of L2 speech processing, as the same OE can be found in perception, production, awareness, and phonological and lexical learning, and reflected in spelling.

31.4.1 Orthographic Effects Are Established After Brief Exposure to Orthographic Forms

Even minimal exposure to L2 orthographic forms can result in OEs. Experienced L2 learners/users who are only briefly exposed to L2 orthographic forms acquire orthography-influenced L2 words. For instance, Italian natives produced gemination in L2 English pseudowords after just eight exposures (Reference Cerni, Bassetti and MastersonCerni et al., 2019). Research on L0 phonological and lexical learning also shows that OEs appear after minimal exposure.

31.4.2 Orthographic Effects Reflect Interactions between Orthography and Phonology

The nature of the interaction between phonology and orthography is complex and not understood yet. However, OE research can contribute to this debate.

When orthography and phonology conflict, orthography can override phonology. First, at a more general level, orthography can override universals of phonological development. For instance, it can cause the transfer of marked L1 features such as gemination and word-final voicing, as discussed earlier, and it can allow production to precede perception, as argued for instance by Reference Sheldon and StrangeSheldon and Strange (1982) regarding L1 Japanese learners’ L2 English /l-r/ contrast. Second, orthography can override acoustic input, leading L2 learners/users to perceive and produce sounds other than those in the input. For instance, L1 English learners of L2 German can perceive the voiced-voiceless contrast and yet produce word-final voiced consonants as voiced (Reference Young-Scholten, Burmeister, Piske and RohdeYoung-Scholten, 2002), and L1 Polish speakers perceive L2 German [oː] as /u/, but produce it as [ɔ] if exposed to its spelling <o> because in Polish <o>=/ɔ/ (Reference Nimz and KhattabNimz & Khattab, 2020).

A contrast between orthography and phonology can also result in a combined outcome. For example, L1 Polish-L2 German speakers produce [eː] as [eɪ] because they perceive [eː] as [i] or [ɨ] (an L1 phonology effect) and recode its spelling <e> as /ɛ/ (an OE [Reference Nimz and KhattabNimz & Khattab, 2020]; see Reference Rafat and StevensonRafat and Stevenson [2019] for examples in an L0). Similarly, the orthographic contrast between consonant letters and doublets in written English interacts with the absence of durational differences in spoken English consonants, resulting in an L2 phonological contrast between geminate and singleton that is realized with half of the L1 durational difference (Reference Bassetti, Sokolović-Perović, Mairano and CerniBassetti et al., 2018).

31.4.3 Orthographic Effects Are Persistent

Research so far has found that OEs may be more resistant to change than phonological effects, persisting in spite of naturalistic exposure and teaching. Longitudinal research found that naturalistic learners improve L2 pronunciation over time, but not with sounds represented by an L1-L2 incongruent grapheme (Reference Young-Scholten and LangerYoung-Scholten & Langer, 2015). Indeed, OEs are found in former instructed learners after years of naturalistic exposure (Reference Bassetti, Mairano, Masterson and CerniBassetti et al., 2020), even after a reduction in the effects of L1 phonology that are not reinforced by orthography (specifically, on voice onset time [VOT], see Reference Mairano, Bassetti, Sokolović-Perović and CerniMairano et al., 2018).

While training often reduces the effects of L1 phonology on L2 phonology, the only published study of an explicit GPC teaching intervention in L2 learners found no reduction of OEs (Reference Bassetti, Cerni and MastersonBassetti, Cerni, & Masterson, 2022). The same was found with L0s (Reference Hayes-Harb, Brown and SmithHayes-Harb et al., 2018; Reference ShowalterShowalter, 2020; Reference Showalter and Hayes-HarbShowalter & Hayes-Harb, 2015).

31.4.4 Orthographic Effects Can Affect All Aspects of L2 Phonological Processing

Most studies report OEs on either perception or production, but there is some evidence that the same OE can be found across L2 phonological processes. The two-letter spelling of Chinese triphthongs affects both metalinguistic awareness (phoneme counting [Reference BassettiBassetti, 2006; Reference PytlykPytlik, 2017]) and production (Reference Bassetti, Guder, Jiang and WanBassetti, 2007) in L2 Chinese learners. In a series of studies where the same L1 Italian-L2 English speakers performed various tasks, effects of the “doublet=geminate” correspondence were found across all aspects of L2 phonology – perception, production, awareness, and lexical and phonological learning – as Italian natives perceived, produced, analyzed, and learned English consonants as either singleton or geminate depending on their spelling (Reference Bassetti, Mairano, Masterson and CerniBassetti et al., 2020, Reference Bassetti, Masterson, Cerni and Mairano2021, Reference Bassetti, Cerni and Masterson2022). The same studies also found similar effects of vowel spelling across all aspects of L2 English phonology.

31.4.5 Orthographic Effects Are Impacted by Various Factors

Although it is established that orthography affects L2 phonology, some studies found no evidence of predicted OEs, and research shows high variability. This points to the existence of variables that modulate or even prevent OEs. This section presents some variables – at the level of individual, sound, and task – that have been shown to impact L2 OEs.Footnote ⁶

31.4.5.1 Individual-Level Predictors

There are a number of individual-level predictors. For example:

Type of L1 writing system. Native users of nonalphabetic writing systems show OEs (for example, L1 Chinese-L2 Korean learners [Reference Han and KimHan & Kim, 2017]). However, their OEs are possibly weaker than those of alphabet native users (for example, there is less L2 English consonant gemination among Japanese than Italian natives [Reference Sokolović-Perović, Bassetti and DillonSokolović-Perović et al., 2020]).

L1 writing system transparency. While native users of a transparent L1 writing system can be strongly affected by orthography, native users of opaque writing systems also show OEs (Reference Hayes-Harb, Brown and SmithHayes-Harb et al., 2018), albeit not always (for instance, not in L1 Farsi-L2 English learners [Reference Rafat, Whitford and JoanisseRafat et al., 2021]). With L0s, an opaque L1 writing system results in weaker OEs (positive and negative) than a transparent one (Reference Erdener and BurnhamErdener & Burnham, 2005; Reference Ijalba and OblerIjalba & Obler, 2015).

L2 orthographic knowledge. While better orthographic knowledge may reduce OEs, a modicum level of L2 orthographic knowledge is required for orthography to influence L2 perception. This has been shown by studies where OEs on perception were stronger in proficient than in beginner L2 listeners with limited GPC knowledge (Reference Simonchyk and DarcySimonchyk & Darcy, 2018; Reference Veivo, Järvikivi, Porretta and HyönäVeivo et al., 2016).

Other individual-level variables. Lengthy naturalistic exposure has no effect (Reference Bassetti, Sokolović-Perović, Mairano and CerniBassetti et al., 2018, Reference Bassetti, Mairano, Masterson and Cerni2020, Reference Bassetti, Masterson, Cerni and Mairano2021; Reference Mairano, Bassetti, Sokolović-Perović and CerniMairano et al., 2018). Higher proficiency may reduce OEs on L2 production in learners (Reference Bassetti, Mairano, Masterson and CerniBassetti et al., 2020).

31.4.5.2 Sound-Level Predictors

The overall level of phonological transparency of the L2 writing system is likely to affect L2 learners’ reliance on L2 orthographic form, as well as characteristics of the sound. For example:

L2 writing system transparency. While OEs are found with transparent L2 writing systems (including scripts other than the Roman alphabet, such as L2 Korean; Reference Han and KimHan & Kim, 2017), they also show with opaque ones (such as L2 French; Reference Solier, Perret, Baqué and Soum-FavaroSolier et al., 2019). The overall unreliability of phonological information of the Chinese writing system is the likely reason why phonologically transparent Chinese characters do not facilitate lexical acquisition in L2 Chinese learners (as reported by Reference Zhang, Li and LiuZhang, Li, & Liu, 2021). With L0s, OEs are stronger with regular GPCs (Reference Erdener and BurnhamErdener & Burnham, 2005; Reference Ijalba and OblerIjalba & Obler, 2015), but also found when transparency level is unclear (Reference Hayes-Harb, Brown and SmithHayes-Harb et al., 2018).

L1–L2 GPC congruence. If a grapheme represents similar sounds in the L1 and the L2/L0, there is a facilitative effect. On the other hand, if it represents different sounds in the L1 and the L2/L0, this results in sound substitutions, as discussed in Sections 31.2 and 31.3.

Perceptual difficulty of the L2 sound/contrast. The relationship between perceptual difficulty and OEs appears complex. If the listener can perceive the L2 sound, negative OEs are weaker. For instance, Japanese natives, whose L1 has contrastive vowel length, are less affected by L2 English vowel spelling than Italian natives (Reference Sokolović-Perović, Bassetti and DillonSokolović-Perović et al., 2020). If perception is too difficult, facilitative OEs may be weaker. For instance, orthographic cues to L2 Russian consonant palatization facilitate perception word-medially, but not word-finally where palatization is less salient (Reference Simonchyk and DarcySimonchyk & Darcy, 2018). With L0s, perceptual difficulty may explain various null results (Reference EscuderoEscudero, 2015; Reference MathieuMathieu, 2016; Reference Showalter and Hayes-HarbShowalter & Hayes-Harb, 2015).

Other variables. The existence of an L1 cognate containing a sound, for instance a geminate consonant, may reinforce OEs, but the evidence is inconsistent (Reference BassettiBassetti, 2017; Reference Bassetti, Mairano, Masterson and CerniBassetti et al., 2020).

31.4.5.3 Task-Level Predictors

Purely oral tasks regularly reveal OEs (see, among many, Reference PytlykPytlyk, 2017). Some suggested that OEs may be stronger in tasks that provide orthographic input, such as oral reading, but evidence is inconsistent (for instance, a direct comparison of oral reading and spoken word repetition found no difference; Reference BassettiBassetti, 2017).

With some metalinguistic tasks, OEs may be explained away as a consequence of orthographic strategies. For instance, L1 Italian-L2 English speakers produce L2 English [ɪ] as L1 Italian [i] in a complex pseudoword creation task, but never when producing real English words (Reference Piske, Flege, MacKay and MeadorPiske et al., 2002).

31.5 Future Directions in OEs Research

While early research had to test whether orthography affects L2 phonology, this is by now established. Research now needs to combine evidence into a coherent framework, to address the remaining gaps and to introduce theoretical and methodological developments, with a view to contributing to wider debates in linguistics and psychology.

31.5.1 Producing a Coherent Framework

A coherent framework is now needed to clarify which aspects of orthography affect which aspects of L2 phonology learning and processing in which language combinations, how and why, and to explain inconsistencies in findings (including explaining why some predicted OEs do not occur). To achieve this, there is a need for larger-scale studies than have hitherto been the case. Current empirical evidence mostly consists of a variety of small-scale studies that investigated a single phenomenon in a single language pair. Such small-scale studies result in the rapid proliferation of evidence and allow emerging researchers to contribute to the field, but should be confirmed by larger, systematic investigations of the effects of the same orthographic phenomenon across aspects of L2 phonology – perception, production, phonological awareness, phonological and lexical learning, and their interactions, and across learners, languages, and contexts (for an example, see Bassetti and colleagues’ work on L2 English consonant length, reported in Reference BassettiBassetti, 2023). For instance, if an L2 speaker adds a sound in production, do they also perceive this extra sound and count or segment it in phonological awareness tasks? Does this sound addition occur across tasks, materials, and types of learners (for instance with different language combinations or proficiency levels)? This systematic approach would demonstrate the robustness of results and warrant contributing to wider debates (for instance, about the relationship between perception and production).

31.5.2 Addressing Gaps in Current Knowledge

For some factors there is much evidence; for instance, we know that incongruence between L1 and L2 GPCs has negative effects. For other factors, evidence is limited but convincingly consistent; for instance, we can probably agree that a phonologically transparent L1 writing system results in stronger L2 OEs than an opaque one. Finally, for many factors the evidence is still insufficient, inconsistent, or entirely absent, such as the effects of teaching interventions, individual differences, and languages other than English.

Teaching interventions. Research needs to do two things. It should (1) investigate whether methods that reduce the effects of L1 phonology can also reduce OEs and (2) devise ways to address directly the orthography–phonology connection (for example, training in L2 GPCs, or contrasting L1 and L2 GPCs; Reference Bassetti, Cerni and MastersonBassetti et al., 2022).

Individual differences. Individual differences that affect L2 phonology and L2 literacy should all be investigated as potential predictors of Oes. (For a list, see Reference Bassetti, Mairano, Masterson and CerniBassetti et al. [2020].)

Languages other than English. Researching L2 English is important (as it is the most widely studied language, hence findings have wide-ranging impacts) and has practical advantages (participants are easily found, researchers feel confident researching a familiar language, and funding may be more easily available as English teaching and testing is a big business). However, researching other languages can widen the research agenda, introducing different sounds and contrasts, writing systems, and language teaching traditions. It can also confirm findings from English, so that a theory of OEs in L2 English phonology can become a true theory of OEs in L2 phonology. Indeed, the field is moving in that direction with a recent increase in studies of less widely studied language combinations, such as L1 Cantonese-L2 PutonghuaFootnote ⁷ (Reference Mok, Lee, Li and XuMok et al., 2018) and L1 Tibetan-L2 Putonghua (Reference Qu and DamianQu & Damian, 2019).

31.5.3 Better Integration of Research Developments

Research into OEs could further integrate recent developments in L2 phonology research in particular and in L2 research in general. For instance, in OE research the dependent variable is usually the nativelikeness of L2 sound production, whereas current L2 phonological research often uses measures of intelligibility and comprehensibility. As another example, there is much evidence of factors that impact L2 phonology, and these could be tested as potential predictors of OEs, as mentioned earlier. Also, in light of developments in research on cross-linguistic influence, it would be interesting to investigate the effects of writing systems other than the native one, for instance looking at effects of L2 GPCs on L3 phonology. Recent research on illiterate L2 learners is also very relevant, and research on OEs should investigate preliterate children, illiterate adults, and others who have no or limited access to orthographic representations.

31.5.4 Further Developing Research Methodology

There is a dearth of mixed-methods, qualitative, longitudinal, and intervention studies. Mixed-methods and qualitative studies could confirm, supplement, and explain the extant, mostly experimental evidence, and may generate new research questions beyond those generated by researchers’ intuitions and observations and by contrastive analyses of languages and orthographies. Longitudinal studies could shed light on the development of OEs in L2 learners over time, and individual variation in such development. There is also a need for teaching intervention studies, perhaps starting with beginner L2 learners in order to maximize the probability of success.

Over time, OE research methods have become increasingly sophisticated (for instance, while some early studies reported only descriptive statistics, more recent ones often display complex statistical analyses). Yet, some methodological limitations could be addressed. Design could be more theory-driven, with a declared theoretical background as the basis for both generating hypotheses and discussing results. Looking at participants, the overreliance on university students could be corrected by investigating other populations. While sample sizes have increased, sample size calculations are rarely reported, and various studies appear underpowered. Small sample sizes generate spurious effects and null effects, and limit the field’s abilities to detect smaller effects and to test more complex designs and larger numbers of variables, such as individual differences. In terms of materials, there is a need for more objective and transparent measures. The limitations of self-reported variables (questionnaires, self-ratings) should be discussed, and objective measures used wherever possible (for instance, for proficiency). More transparent measures would facilitate cross-study comparisons, including meta-analyses (for instance, reporting length of study in months instead of “semesters” or similar measures that have different meanings across educational systems).

The field has already contributed to methodological innovation by creating new tasks and materials, but these could be validated by using them in more than one study. Replication is key to confirming significant effects found in one-off publications, particularly ones with small sample sizes, and to validating tasks and materials. To improve replicability, researchers could more often share materials in online repositories. Finally, the research community could benefit from submitting their findings to discussion among language teachers, including by producing user-friendly reports such as the Oasis reports (https://oasis-database.org).

31.5.5 Contributing to Wider Debates in L2 Phonology and Beyond

Research into L2 OEs should aim to contribute further to wider debates, as its findings have important implications for research on L2 phonology in particular, and more generally for applied linguistics and cognitive psychology. First, the findings of L2 OE research can contradict predictions based on the findings of research on phonological effects; for instance, orthography can result in the transfer of marked features. Second, such findings can lead to questioning established research priorities. For instance, while research tends to focus on the typical 1:2 scenario (where the L1 has one phonological category corresponding to two L2 categories), research on orthography-induced L2 English gemination has revealed the difficulty of the opposite 2:1 scenario, where the L1 has two categories (for instance, singleton/geminate) that align with one L2 category. Finally, OE research can contribute to various debates in L2 phonological research. For instance, findings that the same OE affects both L2 speech production and perception can contribute to debates about the link between perception and production. Also, looking at debates about the effects of age of onset of acquisition, it has now been twenty years since scholars first proposed a causal link between onset of literacy and decreased ability to acquire L2 phonology (Reference BurnhamBurnham, 2003; Reference Young-Scholten, Burmeister, Piske and RohdeYoung-Scholten, 2002).

It will fall to model developers to decide how to integrate L2 OE findings into L2 phonological development models, but OE researchers can make suggestions. For instance, the revised Speech Learning Model (SLM-r; Reference Flege, Bohn and WaylandFlege & Bohn, 2021) – a model of naturalistic L2 speech learning – could take into account that naturalistic exposure includes exposure to environmental orthographic input (Reference Young-Scholten and LangerYoung-Scholten & Langer, 2015) and to orthography-influenced spoken input, and that L2 listeners’ perception of environmental speech may also be biased by their orthographic knowledge (Reference Bassetti, Masterson, Cerni and MairanoBassetti et al., 2021; Reference SheaShea, 2017). Also, the Perceptual Assimilation Model of L2 speech learning (PAM-L2; Reference Best, Tyler, Munro and BohnBest & Tyler, 2007) – which explains the acquisition of L2 contrasts in terms of patterns of assimilation to L1 categories – could extend its list of patterns of perception assimilation to include 2:1 scenarios, and discuss how L1 and L2 orthographic contrasts impact phonological category assimilation (beyond simply acknowledging that if the same grapheme represents an L1 and an L2 sound, this contributes to the assimilation of the two sounds).

As for L2 pedagogy, much more evidence is needed before recommendations can be made. First, we need evidence of the impact of the various solutions that have been proposed, from delaying orthographic input to supplementing it with phonetic transcriptions (for proposals, see Reference Rafat, Perry and RaoRafat & Perry, 2019). Second, we need evidence of the impact of L2 OEs on comprehensibility and intelligibility (and accentedness where this is an issue). This evidence will enable teachers to assess the effectiveness of interventions vis-à-vis the effort required, and the negative impacts of orthographic input on pronunciation vis-à-vis its advantages for vocabulary development.

31.6 Conclusions

This chapter demonstrated that orthography has strong effects on second language phonology that are widespread and persistent. Although more research is needed, both to fill gaps and to systematize knowledge, the evidence is strong enough to warrant serious consideration of this issue in both theory and practice.

32 Phonetic and Phonological L1 Attrition and Drift in Bilingual Speech

32.1 Introduction

32.1.1 Why Is Change in the First Language (L1) Important?

Even the most seemingly monolingual places are shaped in space and time by different languages meeting. Most often, when languages meet, they are brought together by humans who speak with one another. This section of the handbook is devoted to the variables and outcomes of bilingual speech, and within this chapter the authors target how an individual’s native language (used synonymously here with “first language,” L1) is impacted when a new language (used synonymously here with “second language,” L2) is acquired post-pubescently.

Most of our knowledge about the speech of bilinguals relates to the potential for humans to acquire a new language. It is indeed a fascinating endeavor to understand the extent to which an L2 can be acquired once an L1 is already – even if only partly – in place. In comparison, much less bilingualism research has examined whether and how the L1 changes upon L2 acquisition and immersion. It is these changes in the L1 which are discussed here as outcomes of bilingual speech. We consider these outcomes (i.e., attrition and drift) to be inevitable outcomes of bilingual speech, although individual bilinguals are impacted differently, to greater or lesser degrees, and not necessarily across all aspects of their L1.

Research into L1 attrition and L1 drift aims to investigate how the native language changes when a new language is acquired, and – as discussed in more detail shortly – the two terms are closely related. They are, however, different in terms of the type of L1 change they describe. What underlies both terms is the understanding that a person’s L1 is malleable and can – and does – change in adulthood once fully acquired. In this chapter, the focus will be on L1 attrition and L1 drift (henceforth, simply “attrition” and “drift” although we only refer to changes in the L1) in bilingual speech.

As bilingualism impacts all of us, either directly through our own personal experiences or indirectly through those around us, attrition and drift have an extensive scope. Indeed, individuals from diverse backgrounds may be impacted by attrition and drift. For example, in the case of the first author, she grew up in Northern British Columbia, so her L1 is Canadian English, but, as she also spent decades in Germany and the UK (as well as time in the Netherlands and Switzerland), her English has undergone changes in comparison to the language she spoke when she moved to Europe as a teenager. As her father spoke Dutch as an L1, she was also exposed to Dutch as a child, but, after living in the Netherlands, she discovered that her father’s Dutch was quite different from the language spoken in the Netherlands (for one, he did not use English words in his Dutch, as many Dutch speakers do today). These personal experiences influenced her understanding of language change early on before she set about researching attrition and drift.

In the case of the second author, who grew up in New York, his L1 is US English, although he heard Korean often from family and community members throughout childhood. He started to perceive changes in his L1 during his first extended stay in Korea in his early twenties, when he was studying and speaking Korean daily. These changes were unexpected, as he was, at the same time, working as a teacher of his L1 and speaking English daily. Upon returning to the US, he became increasingly interested in the phenomenon of L1 change in adulthood, and noticed how his parents, L1 speakers of Korean who immigrated to the US in their thirties, would speak Korean differently from homeland speakers in Korea – using native Korean words where homeland speakers would use English borrowings, and vice versa – although his parents’ pronunciation of Korean was not noticeably different from homeland speakers’.

Although there are parallels between the experiences of both authors – both are second-generation immigrants to North America, who returned to their family’s country of origin as adults, and became interested in research into attrition and drift in part due to their own backgrounds – their experiences are also unique. For example, their regional origins are different: the first author grew up in a sparsely populated environment, the second in an urban environment. As such, there was no extensive Dutch community where the first author grew up, whereas there was a large Korean community where the second author grew up. This difference in environment resulted in different bilingual experiences for both authors as the quality and quantity of their Dutch and Korean input varied. Such unique, yet similar, backgrounds exemplify the diversity of bilingual experiences.

Despite the difference in the authors’ regional origins, there is a commonality here as well: both authors grew up on land that was colonized by settlers. In Northwestern British Columbia, the first author saw and learned as a young girl about Canadian residential schools, where, among other atrocities, First Nation children were sent after being taken from their families and forbidden to speak their languages (Reference Greenwood and de LeeuwGreenwood & de Leeuw, 2012). Today, many indigenous languages, which have been profoundly impacted by the effects of colonialism, are undergoing revitalization initiatives (Reference Sterud Miller and DenhamSterud Miller, 2019). Here, too, we see the impact of bilingualism and the potential impact of attrition and drift. It is not only through personal choice and planned immigration, as in the cases of the authors of this chapter, that an individual acquires an L2; nor is it necessarily personal choice if this L2 changes the speech patterns of an individual’s L1. Colonialism has given rise to large portions of the world’s population that have been forced to acquire the language of colonizers, and, in doing so, colonialism may have changed not only the amount of speakers of colonial and indigenous languages but also the colonized and colonizing languages themselves.

These examples illuminate how different – and yet how similar – personal bilingual experiences can be. Many people come from immigrant backgrounds or have been personally impacted by colonialism, and are thus in a position to experience attrition and drift, directly or indirectly. This is why investigating change in the L1 is important from a personal perspective, although, as will be discussed in the rest of the chapter, investigating change in the L1 is also important from a theoretical perspective.

32.1.2 Terminology

As already exemplified, both attrition and drift refer to changes in the L1 of an individual, rather than the societal loss of a language (Reference De Bot and Clynede Bot & Clyne, 1994; Reference SchmidSchmid, 2011). In this chapter, we discuss attrition and drift as they occur in bilingual speech (i.e., at a phonetic and phonological level), in contrast to how, for example, the L1 syntax or lexical semantics might change due to L2 acquisition.

32.1.2.1 What Is Bilingual Speech?

We use the term bilingual speech to refer broadly to speech which is produced by an individual who speaks more than one language variety. This could be someone who speaks two or more dialects (bidialectal) or two or more languages (bilingual).

The inclusion of dialects into a definition of attrition and drift is relevant when one considers that the terms “language” and “dialect” represent a continuum and that their “edges are extremely ragged and uncertain” (Reference HaugenHaugen, 1966, p. 922). There are many examples of dialect continua which show that the distinction between languages and dialects is fuzzy (Reference Chambers and TrudgillChambers & Trudgill, 1998). Therefore, we believe that when investigating potential attrition and drift, both bilingualism and bidialectalism should be considered. Multiple dialects in a bilingual’s repertoire can be referred to as D1, D2, etc., along the same lines as multiple languages (L1, L2, etc.).

Bilingual speech can be speech in L1/D1 or L2/D2, or speech in an additional variety such as a third language or dialect (L3/D3), and so on. It is simply the speech produced by an individual who speaks two or more languages or dialects. In the study of attrition and drift, bilingual speech is the object of investigation; most commonly, the focus is on L1 speech specifically, but, as an additional language is required per definition, L2 speech, L3 speech, and so on may also enter the investigation. Additional languages may be investigated to see how the bilingual’s different languages interact phonetically and/or phonologically, if at all, throughout the bilingual/bidialectal experience.

32.1.2.2 What Is L1 Attrition and What Is L1 Drift?

We consider the terms attrition and drift to characterize outcomes of bilingual speech on a continuum (see Figure 32.1). At the extreme ends of attrition and drift, the outcomes might look very different, but toward the middle of the continuum, the outcomes might overlap greatly. In the following sections, after considering the similarities between attrition and drift, we discuss differences between the two terms.

Figure 32.1 Theoretical similarities and differences between drift and attrition.

32.1.2.3 Similarities between Attrition and Drift

32.1.2.3.1 L1 Change in Speech

Although attrition and drift might also occur in other linguistic domains, our definition of these terms focuses on L1 changes in speech. These changes can occur within the segmental and prosodic levels of speech in the phonetic domain. That is, both attrition and drift involve phonetic changes in the L1, which could include changes in L1 perception and/or L1 production. However, phonetic change in the L1 does not necessarily imply phonological change (see Section 32.1.2.4).

32.1.2.3.2 Context of Late Bilingualism

Research into both attrition and drift aims to describe changes in pronunciation or listening that occur within an individual’s L1 when an L2 is acquired post-pubescently, as well as to explain why such changes occur. Crucially, both processes show that L1 change is possible after childhood (see also Reference Flege and StrangeFlege, 1995; Reference Flege, Bohn and WaylandFlege & Bohn, 2021). In contrast, psycholinguistic models of language acquisition (e.g., the critical period hypothesis; Reference LennebergLenneberg, 1969) have advanced the idea that the L1 stabilizes in adolescence (i.e., after the closing of a proposed critical period). This idea, however, is challenged by research showing attrition (in the long term) and drift (in the short term) in late sequential bilinguals (see Chang, 2019b and de Leeuw & Celata, 2019 for a review). Such studies augment research on effects of new dialect exposure in monolinguals, which likewise reveal that, post-pubescently, the L1/D1 changes over time when in contact with new varieties of the same language (Reference Evans and IversonEvans & Iverson, 2007; Reference HarringtonHarrington, 2006; Reference Munro, Derwing and FlegeMunro, Derwing, & Flege, 1999; Reference Sankoff and FoughtSankoff, 2004; Reference ShockeyShockey, 1984). In short, both attrition and drift refer to processes occurring in the context of late sequential bilingualism, broadly construed to include the very beginnings of L2 acquisition.

32.1.2.3.3 Lack of Permanency

We propose that the term “attrition,” like the term “drift,” characterizes changes in the L1 that are not necessarily permanent. In this respect, we conceive of attrition as similar to drift in being, in principle, reversible. In fact, we consider the term “permanent” to be a misnomer, and a straw man, in this context: since studies of L1 change glimpse only a part of an individual bilingual’s life, how can one ever know whether any observed change is truly permanent? Instead, we regard attrition as the summation of relatively durable changes in the L1 of bilingual speech which have occurred over time, without excluding the possibility that such changes could be reversed at a later point in life. By comparison, we use the term “drift” to refer to L1 changes that are intrinsically less durable, due to their source in recent L2 experience specifically (see Section 32.1.2.4).

Related to lack of permanency, we do not use the terms “attrition” and “drift” to imply language “loss,” which in our view carries the implication of permanency and, in certain contexts, clinical implications and associated value judgments. On the contrary, we consider both attrition and drift to be normal, inevitable, and reversible processes in late sequential bilingualism. Furthermore, because neither attrition nor drift involves an irrecoverable “loss” of any aspect of the L1, we would not expect the process of reversing attrition/drift (i.e., reacquiring or retuning the L1) to look the same as acquiring the L1 from scratch.

32.1.2.4 Differences between Attrition and Drift

32.1.2.4.1 Duration of L1 Change

The terms “attrition” and “phonetic drift” (or “drift” for short) refer to L1 changes that differ in durability. We describe long-term, more enduring effects of the L2 on the L1 with the term “attrition” (Reference Bergmann, Nota, Sprenger and SchmidBergmann et al., 2016; Reference De Leeuwde Leeuw, 2019b; Reference De Leeuw, Mennen and Scobbiede Leeuw, Mennen, & Scobbie, 2012; Reference De Leeuw, Opitz and Lubinskade Leeuw, Opitz, & Lubinska, 2013; Reference De Leeuw, Schmid and Mennende Leeuw, Schmid, & Mennen, 2010; Reference De Leeuw, Tusha and Schmidde Leeuw, Tusha, & Schmid, 2018; Reference Hopp and SchmidHopp & Schmid, 2013; Reference MajorMajor, 1992; Reference Mayr, Price and MennenMayr, Price, & Mennen, 2012; Reference MennenMennen, 2004; Reference Ulbrich and OrdinUlbrich & Ordin, 2014), but shorter-term, potentially more superficial effects of the L2 on the L1 with the term “drift” (Reference ChangChang, 2012, Reference Chang2013; Reference Sancier and FowlerSancier & Fowler, 1997; Reference Tobin, Nam and FowlerTobin, Nam, & Fowler, 2017). As in Section 32.1.2.3.3, however, both types of L1 change are in principle reversible.

32.1.2.4.2 L2 Input

Attrition and drift differ in terms of their presumed source in L2 input and experience. The term “drift” refers to phonetic (i.e., subphonemic) changes in the L1 that can be attributed to the influence of recent L2 experience (Reference Chang, Schmid and KöpkeChang, 2019b). Given this presumed source, research on drift as distinct from attrition often focuses on the earliest stages of L2 immersion, such as the first weeks of a study-abroad experience (Reference ChangChang, 2012, Reference Chang2013, Reference Chang2019a; Reference Lang and DavidsonLang & Davidson, 2019), or on laboratory-based L2 exposure, which can be controlled and directly observed (Reference Kartushina, Hervais-Adelman, Frauenfelder and GolestaniKartushina et al., 2016); however, the study of drift extends to L2 learning in L1-dominant environments (Reference Herd, Walden, Knight and AlexanderHerd et al., 2015) as well. In the case of drift, we are generally concerned with the effects of comparatively little and/or introductory L2 experience, not advanced L2 proficiency or long-term immigration to an L2-dominant environment. On the other hand, attrition is most often investigated within the scope of long-term immigration (Reference Köpke, Schmid, Schmid, Köpke, Keijzer and WeilemarKöpke & Schmid, 2004). Therefore, we conceive of attrition as being due to the cumulative totality of L2 experiences (e.g., Reference De Leeuw, Schmid and Mennende Leeuw et al., 2010), which may or may not be recent or ongoing, whereas drift is due precisely to recent or ongoing L2 experience (e.g., Reference ChangChang, 2012; Reference Dmitrieva, Jongman and SerenoDmitrieva, Jongman, & Sereno, 2020).

32.1.2.4.3 Consciousness of L1 Change

The difference between drift and attrition in the extent and timescale of L2 experience is expected to affect how conscious or aware bilinguals are of the fact that their L1 has changed. In short, bilinguals appear more likely to be conscious of attrition than of drift. This stands to reason as attrition is a long(er)-term process, so the bilingual experiencing attrition has more opportunity to notice changes in their L1 (which may be pointed out to them by others) than the bilingual experiencing drift. To our knowledge, there is no research that has systematically examined consciousness of change in L1 speech at different stages of late sequential bilingualism. However, anecdotally, many bilingual study participants in attrition contexts have revealed to the first author during experimental debriefing sessions that they “know” they have a foreign accent in their L1 and/or that other people have commented on the “way they say things,” whereas this level of consciousness of L1 change is not very common in the second author’s experience with study participants showing drift.

32.1.2.4.4 Agency in L1 Change

Related to the difference in consciousness of change, we expect attrition and drift to differ in terms of potential effects of the bilingual’s agency as a language user. Although all bilinguals have some degree of agency in their speech development and use (e.g., through choosing environments in which their L1 is practiced and/or their L2 is promoted), the bilingual undergoing attrition is more likely to be conscious of changes in their L1 speech than the bilingual undergoing drift, and is therefore more likely to be able to express their agency about these L1 changes. Agency could be expressed in a variety of ways, such as monitoring and inhibition of noncanonical L1 production patterns, choosing different language environments, talking with different interlocutors, or selecting languages differently by interlocutor and/or topic.

32.1.2.4.5 Scope of L1 Change

Both attrition and drift involve phonetic changes in L1 speech – that is, changes that do not alter abstract phonological structure (e.g., phonemic contrast between two sounds) – but only attrition may additionally involve changes at a phonological level. Because drift follows from relatively little, albeit recent, L2 experience, the L1 changes in drift are expected to be phonetic, not phonological. Any L1 changes that are phonological (e.g., merging phonemes) are more likely to be found in contexts of attrition – and can be described precisely as “phonological attrition” (Celata, 2019; de Leeuw et al., 2018) – because such changes ostensibly result from the more extensive L2 experience characteristic of attrition contexts. Whether “phonological drift” is possible is an empirical question, but to our knowledge such a phenomenon has not been observed and we do not predict this to be possible.

An additional note about attrition and drift related to scope concerns the language versus dialect continuum. Although the term “attrition” has been used to refer to L1/D1 change in the context of both bilingualism and bidialectalism, the term “drift” has, up to now, mostly been applied to L1 change in the context of bilingualism per se (i.e., not bidialectalism). This is because the changes in L1 speech observed to follow from acquisition of a new dialect (see Reference NyczNycz, 2015) have generally only been studied long after migration to a new dialect region (meaning that it is not possible to tease apart effects of the most recent exposure to the new dialect from those of extensive cumulative exposure), while changes in L1 speech found after recent exposure to a same-language talker, who may or may not speak a different dialect, have been described with a different term (namely, “accommodation”; e.g., Reference CouplandCoupland, 1984; Reference Giles, Coupland, Coupland, Giles, Coupland and CouplandGiles, Coupland, & Coupland, 1991; Reference Hay, Jannedy, Mendoza-Denton, Ohala, Hasegawa, Ohala, Granville and BaileyHay, Jannedy, & Mendoza-Denton, 1999; Reference KimKim, 2009). In our view, if one were to investigate effects of recent D2 exposure specifically, and in particular outside of interactional contexts, it would be appropriate to describe these effects as “drift” of the D1.

32.1.3 Core Findings of Attrition and Drift

Research on attrition and drift in bilingual speech is relatively young, but there are several recurring themes that have emerged from the literature. In this section, we briefly discuss each of six themes. More in-depth reviews of the literature in this area can be found in Reference ChangChang (2019a, Reference Chang, Schmid and Köpke2019b) for (phonetic) drift, Reference De Leeuw, Schmid and Köpkede Leeuw (2019a) for phonetic attrition, and Reference Celata, Schmid and KöpkeCelata (2019) for phonological attrition.

First, virtually all aspects of L1 speech are subject to attrition and drift. Attrition has been documented in both segmental (Reference De Leeuw, Tusha and Schmidde Leeuw et al., 2018) and suprasegmental features (Reference De Leeuw, Mennen and Scobbiede Leeuw et al., 2012; Reference MennenMennen, 2004) and in both perception (Reference Ahn, Chang, DeKeyser and Lee-EllisAhn et al., 2017; Reference De Leeuw, Kapia and Lewisde Leeuw, Kapia, & Lewis, 2023) and production (Reference De Leeuwde Leeuw, 2019b; Reference De Leeuw, Schmid and Mennende Leeuw et al., 2010). Similarly, drift has been found in both segmental and suprasegmental features (Reference ChangChang, 2012, Reference Chang2013) and in both perception (Reference Kellogg and ChangKellogg & Chang, 2023; Reference Tice and WoodleyTice & Woodley, 2012) and production (Reference ChangChang, 2019a; Reference Sancier and FowlerSancier & Fowler, 1997). In short, there do not appear to be any aspects of L1 speech that are, by nature, impervious to change in late sequential bilinguals.

Second, there is variation in how susceptible different aspects of the L1 are to attrition and drift, and at least some of this variation may be principled. For example, in contexts of drift, the voice onset time (VOT) of L1 stops has been found to change faster than spectral qualities of L1 vowels such as formants (Reference ChangChang, 2012; see also Reference Lang and DavidsonLang & Davidson, 2019). This disparity has been attributed to the different nature of these features: VOT is timing-based, but vowel formants are posture-based (and, therefore, potentially more anchored in proprioceptive feedback).

Third, attrition and drift can generally be linked to a developing or well-established L2. In all the examples cited earlier, the L1 changes were observed in bilinguals in the broadest sense, including both novice L2 learners and proficient L2 users. Thus, despite our acknowledgment that attrition has been theorized to be possible without L2 acquisition (see, e.g., Reference KöpkeKöpke, 2004), in practice, both attrition and drift are typically attributable to the influence of an L2, influence that may be specific (e.g., due to particular L2 sounds) or more general (e.g., due to the presence of an additional language in the bilingual’s repertoire that “competes” with the L1). Furthermore, specific L2 influence may not necessarily result in assimilation toward L2 norms, as dissimilatory attrition/drift is also attested (e.g., Reference Chang, Schmid and KöpkeChang, 2019b; Reference De Leeuw, Mennen and Scobbiede Leeuw et al., 2012; Reference Flege and EeftingFlege & Eefting, 1987).

Fourth, both attrition and drift are influenced by cross-linguistic similarities between the L1 and the L2 at multiple levels (e.g., between individual phones, between natural classes, etc.), but the effects of L1–L2 similarity vary according to the aspect of the L1 under consideration. In the case of contrasts, similarity to L2 contrasts appears to counteract attrition by reinforcing the contrastiveness of the L1 contrast; thus, L1 contrasts paralleled by similar contrasts in the L2 tend to resist attrition (Reference Ahn, Chang, DeKeyser and Lee-EllisAhn et al., 2017), while contrasts unique to the L1 are more vulnerable (Reference De Leeuw, Tusha and Schmidde Leeuw et al., 2018). In the case of sounds, however, there seems to be a “sweet spot” of cross-linguistic acoustic similarity that results in L1 change: enough similarity to an L2 sound is needed to cause the L1 sound to be cognitively linked to, and thus able to be influenced by, the L2 sound, but too much similarity (such that any L1–L2 acoustic disparity is so small as to be unnoticeable) may remove any trigger for L1 change (Reference ChangChang, 2012), just as a lack of phonological similarity appears to do (see, e.g., /y/ in Reference Chang, Yao, Haynes and RhodesChang et al., 2011; Reference FlegeFlege, 1987). Recent work on drift has further argued that drift is not only triggered by L1–L2 similarities but also shaped directly by L2 development, such that the L1 and the L2 change together (“tandem drift”; see Reference TurnerTurner, 2023).

Fifth, apart from L2 influence, attrition and drift are influenced by individual differences in L1-specific as well as nonlinguistic factors. Factors that are specific to L1 include quality of L1 use, including amount of code mixing and language switching (Reference De Leeuw, Schmid and Mennende Leeuw et al., 2010; Reference Kartushina and MartinKartushina & Martin, 2019); amount of L1 education (Reference Ahn, Chang, DeKeyser and Lee-EllisAhn et al., 2017); amount of L1 contacts (Reference HulsenHulsen, 2000); and sociolinguistic variables (Reference Nodari, Celata and NagyNodari, Celata, & Nagy, 2019). Nonlinguistic factors include inhibitory control (Reference Lev-Ari and PeperkampLev-Ari & Peperkamp, 2013). Although many L1-specific factors could be construed as indirectly related to the L2 for bilinguals, cognitive constructs such as inhibitory control are generally understood to be domain-general (i.e., not specific to language). Crucially, this means that an individual bilingual’s manifestation of attrition and/or drift cannot be reduced completely to L2 influence.

Finally, attrition and drift can be reversed, although their reversal may differ depending on the feature under consideration. For example, drift in both VOT and vowel formants is reversed substantially after L1 reimmersion (Reference Kartushina and MartinKartushina & Martin, 2019; Reference Sancier and FowlerSancier & Fowler, 1997). Within an L2 environment as well, drift in VOT is reversed significantly in the absence of frequent active L2 use, whereas drift in fundamental frequency (f₀) is more resistant to reversal even without frequent active L2 use (Reference ChangChang, 2019a). In addition, there is evidence that traces of drift as well as attrition may remain in the L1 as long as ten months after L1 reimmersion (Reference TurnerTurner, 2023), suggesting that full reversal of these processes may be difficult, at least as long as the L2 continues to be used within the L1 environment.

In summary, the current body of work on attrition and drift in bilingual speech does not point to any single condition as sufficient for triggering attrition or drift, but rather indicates that these are complex, multifaceted processes influenced by myriad factors, both linguistic and nonlinguistic, and about which much remains unknown. In the next sections, we contextualize a new theory of bilingual speech, one that devotes special attention to attrition and drift, within the landscape of existing theories of L2 speech, and then present the principles of this new theory as a testable framework for future work in this area.

32.2 Links to L2 Speech Theories

Current theories of bilingual speech focus on speech production and perception of individual sounds and sound contrasts in the L2. Three prominent theories – the Perceptual Assimilation Model (PAM; Reference Best and StrangeBest, 1995) and its extension to L2 learning (PAM-L2; Reference Best, Tyler, Bohn and MunroBest & Tyler, 2007), the Speech Learning Model (SLM; Reference Flege and StrangeFlege, 1995) and its revised version (SLM-r; Reference Flege, Bohn and WaylandFlege & Bohn, 2021), and the Second Language Linguistic Perception model (L2LP; Reference EscuderoEscudero, 2005, Reference Escudero, Boersma and Hamann2009) – explain how L1 experience influences L2 speech learning in different learning scenarios (Reference Tyler, Best, Faber and LevittTyler et al., 2014). Special attention is given to L2 prosody in other theories, such as PAM for Suprasegmentals (PAM-S; Reference So and BestSo & Best, 2010) and the L2 Intonation Learning theory (LILt; Reference Mennen, Delais-Roussarie, Avanzi and HermentMennen, 2015). In this section, we discuss points of convergence and divergence between our proposed theory – Attrition & Drift in Access, Production, and Perception Theory (ADAPPT) – and the most widely tested bilingual speech theories, namely, the SLM-r, the PAM-L2, and the L2LP.

A major similarity among the SLM-r, the PAM-L2, and the L2LP is that none attempts to provide a detailed account of the effects that L2 acquisition has on L1 speech. Although the SLM-r admits the possibility of L2 effects on the L1 by positing that cross-language interactions are bidirectional, its focus remains on accounting for the acquisition of L2 speech, similar to the PAM-L2. The L2LP goes one step further in explicitly not allowing for perceptual attrition: “The L2LP model predicts that advanced L2 learners in an L2 monolingual mode will exhibit an L2 perception similar to that of monolingual native listeners. This means that no fossilization in L2 sound perception and no attrition in L1 sound perception will be attested” (Reference EscuderoEscudero, 2005, p. 121). We consider the general lack of attention to L2 effects on the L1 to be a significant shortcoming of these models as theories of bilingual speech, given the mounting evidence that the L1 is indeed impacted upon L2 acquisition (see references in Sections 32.1.2.3, 32.1.2.4, and 32.1.3). Our model, ADAPPT, attempts to address this shortcoming by devoting special attention to L1 change.

A second similarity among the PAM-L2, the L2LP, and the SLM (although not the SLM-r) is their stipulation that accurate perception needs to be in place before accurate production is possible, and that difficulties producing L2 sounds have a perceptual basis. The SLM-r, by contrast, adopts the view that “L2 segmental perception and production coevolve without precedence” (Reference Flege, Bohn and WaylandFlege & Bohn, 2021, pp. 28–29). In ADAPPT, we take a similar view as the SLM-r as to the timing of changes in perception versus production (i.e., they are not coordinated in any particular way), which leads us to predict disparities in attrition/drift between perception and production (see Section 32.4).

Note that ADAPPT’s view on the lack of coordination between L1 changes in perception versus production reflects a core hypothesis of separation between these modalities, which are presumed to rely on divergent, albeit connected, representations. This hypothesis is supported by a growing body of research showing dissociations between perception and production of L2 sounds (Reference Baker and TrofimovichBaker & Trofimovich, 2005; Reference BarrientosBarrientos, 2023; Reference Beach, Burnham and KitamuraBeach, Burnham, & Kitamura, 2001; Reference De Leeuw, Stockall, Lazaridou-Chatzigoga and Gorba Masipde Leeuw et al., 2021; Reference Gorba Masip and CebrianGorba Masip & Cebrian, 2021; Reference Kartushina and FrauenfelderKartushina & Frauenfelder, 2014; Reference Sheldon and StrangeSheldon & Strange, 1982; Reference ZampiniZampini, 1998). Indeed, of the experimental studies explicitly targeting the relationship between L2 perception and production, most have not found that accurate perception clearly precedes production. For example, results from Reference Sheldon and StrangeSheldon and Strange (1982) revealed that L1 Japanese speakers who produced L2 English /l/ and /r/ in a target-like fashion still made errors in perceiving the contrast, which led them to conclude that “perceptual mastery of a foreign contrast does not necessarily precede adult learners’ ability to produce acceptable tokens of the contrasting phonemes” (p. 254). Recent studies that have carefully controlled for methodological factors to measure L2 perception and production using more comparable metrics have also not found especially consistent and/or strong perception–production correlations (e.g., Reference Baese-BerkBaese-Berk, 2019; Reference Melnik-Leroy, Turnbull and PeperkampMelnik-Leroy, Turnbull, & Peperkamp, 2021; Reference TurnerTurner, 2022; see also Reference Cheng, Niziolek, Buchwald and McAllisterCheng et al. [2021] on the L1 perception–production link). Such empirical evidence directly challenges the view that L2 perception and L2 production are tightly linked; it also opens up the possibility that change in L1 perceptual capacities may occur separately from change in L1 production capacities.

A third similarity among the SLM-r, the PAM-L2, and the L2LP is their incorporation of a central role for the construct of “similarity.” The SLM-r views cross-linguistic phonetic similarity between nonidentical L1 and L2 sounds as ultimately challenging for developing target-like L2 categories, whereas the L2LP views similarity as facilitative for developing target-like L2 perception. Focusing on L2 contrasts, the PAM-L2 proposes a typology of similarity relationships between L2 contrasts and L1 sounds, with the least challenging case being an L2 contrast resembling an L1 contrast (i.e., two different L1 sounds) and the most challenging case being an L2 contrast resembling one and the same L1 sound. Regarding attrition and drift of L1 speech, ADAPPT views cross-linguistic similarity as playing a significant, but variable, role; in particular, similarity can either prevent L1 change (compare with the SLM-r on L2 development) or facilitate L1 change (compare with the L2LP on L2 development), depending on the aspect of L1 speech at issue (see Section 32.1.3).

Finally, a further similarity among the SLM-r, the PAM-L2, and the L2LP is their focus on segments, such as phonemes and allophones (compare with the PAM-S and LILt). However, languages differ not only in their phoneme inventories but also in their realization of prosody and phonotactic constraints. Moreover, speech itself is holistic: prosodic dimensions are produced in tandem with segments, and segments are constrained by phonotactic rules. In fact, speech is not easily separable into individual sounds at all. This is to say that a theory explaining speech primarily at the segmental level is inherently limited because it does not account for many other important contributors to the speech signal. Thus, in ADAPPT, we propose a theory of bilingual speech which encapsulates segmental and suprasegmental changes in both the L1 and the L2 as the next step forward for research into bilingual speech.

32.3 Attrition and Drift in Access, Production, and Perception Theory (ADAPPT)

Very generally, ADAPPT stipulates that the speech systems of different languages which a bilingual perceives and produces are inherently connected to one another because they are produced by the same person in the same vocal tract with the same ears. As these languages are intrinsically connected, changes which occur in one language (e.g., in an L2 as it is newly acquired and develops) will have effects on the other language (i.e., the L1). There are ten main principles of ADAPPT which differentiate it from other theories of bilingual speech. These are outlined below.

1. L1 and L2 speech are intrinsically linked. As mentioned already, ADAPPT predicts that when a person newly acquires or further develops an L2, this will have an impact on the L1. In its most minimal form, the impact on the L1 could be reduced accessibility, causing, for example, slower reaction times in the L1 (e.g., Reference De Leeuw, Kapia and Lewisde Leeuw et al., 2023; Reference Linck, Kroll and SundermanLinck, Kroll, & Sunderman, 2009), but in some cases (e.g., extended L2 immersion) extensive structural changes in the L1 could occur (e.g., Reference De Leeuw, Tusha and Schmidde Leeuw et al., 2018; Reference Dmitrieva, Jongman and SerenoDmitrieva, Jongman, & Sereno, 2010; Reference Yao and ChangYao & Chang, 2016). Future research could examine how these different forms of mutual interaction proceed over a bilingual’s life span. Crucially, in contrast to the aforementioned theories, ADAPPT stipulates that the L1 will always be impacted in some way by L2 exposure and acquisition and that this is a normal outcome of bilingual speech. This does not mean that the L1 will be impacted in the same way across all bilinguals (on the contrary, there is great variation in the extent to which bilinguals display attrition and drift; Reference ChangChang, 2012; Reference De Leeuwde Leeuw, 2019b); rather, it means that, among the various dimensions of L1 speech, L2 development will necessarily impact an individual’s L1 in one or more of these dimensions.
2. All dimensions of L1 and L2 speech can be affected in a bilingual’s language repertoire. According to ADAPPT, any process related to speech, including access (i.e., retrieval of information in long-term memory representations), production, and perception, and any linguistic representation related to speech, such as of individual sounds, prosodic patterns, and phonotactic constraints, is a candidate for change in attrition and drift. We see these various dimensions of speech as interconnected and do not believe that it is generally possible for L1 and L2 sounds, for example, to influence one another without this affecting other dimensions because attrition and drift are at least in part systemic (Reference ChangChang, 2012, Reference Chang2013; Reference Mayr, Price and MennenMayr et al., 2012; Reference TurnerTurner, 2023). In the future, it could be interesting to jointly investigate numerous dimensions of bilingual speech within individuals to see the extent to which there is systematicity in how these dimensions influence one another throughout attrition and drift.
3. Changes in bilingual speech tend to occur hierarchically. Closely related to principle 2, ADAPPT stipulates that what happens initially at the “lower” levels of speech (e.g., acoustic dimensions) impacts “higher” levels (e.g., phonemes, syllables, tones, etc.). Thus, we would predict, for example, that phonological attrition (Reference De Leeuw, Tusha and Schmidde Leeuw et al., 2018) would not occur without phonetic attrition (Reference De Leeuw, Mennen and Scobbiede Leeuw, Mennen, & Scobbie, 2013). That is, lower-level phonetic change is necessary, but not sufficient, for higher-level changes in the L1. Future research might test this prediction by examining numerous levels of bilingual speech (e.g., vowel formants and vowel contrasts) within the same bilingual participants.
4. There is no straightforward link between speech perception and production. While acknowledging that there may be mechanisms of mapping between speech perception and production, ADAPPT does not predict a straightforward link between perception and production in either the L1 or the L2 (i.e., accurate speech perception is neither sufficient nor necessary for accurate speech production, and vice versa). Instead, ADAPPT suggests that speech perception and production recruit different executive control mechanisms (see Reference Rubinstein, Meyer and EvansRubinstein, Meyer, & Evans, 2001 for examples of different possible mechanisms), are based on qualitatively different representations (see, e.g., Reference Schuerman, Meyer and McQueenSchuerman, Meyer, & McQueen, 2015; compare with Reference Best and StrangeBest, 1995), and are acquired at least partly independently in both L1 and L2 acquisition (Reference De Leeuw, Stockall, Lazaridou-Chatzigoga and Gorba Masipde Leeuw et al., 2021; Reference Wode, Ohala, Hasegawa, Ohala, Granville and BaileyWode, 1999). By the same token, perception and production are influenced at least partly independently in attrition and drift (for related findings from international adoptees, see Reference Oh, Au and JunOh, Au, & Jun, 2010). To test this fourth principle, the design of future experiments needs to incorporate both perception and production tasks testing and measuring the same variables.
5. Short- and long-term changes are possible in the L1. As discussed previously, ADAPPT differentiates between short- and long-term changes in the L1 (respectively, drift and attrition). By hypothesis, these two levels of durability of L1 change differ in their source: in the case of drift, it is exclusively recent/ongoing L2 experience, whereas in the case of attrition, it is not, as we consider attrition to be due to the totality of L2 experiences, which are not necessarily recent or ongoing. Although recent L2 experience is often confounded with extensive L2 experience in bilingualism research, in those instances where they are not, the evidence suggests that there are indeed L1 changes that are due to the totality of L2 experience rather than recent L2 experience specifically (e.g., Reference TurnerTurner, 2023). Future work could focus on recency effects in more detail to determine the extent to which drift and attrition interact to determine changes in the L1 and the L2 of bilinguals over their life span.
6. The L1 and the L2 can change and adapt at any time in a person’s life. According to ADAPPT, the speech of bilinguals is dynamic (Reference De Bot, Lowie and Verspoorde Bot, Lowie, & Verspoor, 2007; Reference De Leeuw, Opitz and Lubinskade Leeuw, Opitz, et al., 2013) and the languages of a bilingual are constantly interacting throughout life (Reference ChangChang, 2019a; Reference Sancier and FowlerSancier & Fowler, 1997). Therefore, there is no language fossilization (see also Reference Larsen-Freeman, Han and OdlinLarsen-Freeman, 2005), and attrition can never be permanent because the languages of a bilingual continue to change as they interact. Here, future research could conduct short- and long-term longitudinal studies with numerous testing points to examine how linguistic variables change over time in different situations (see further discussion in Reference ChangChang, 2012, Reference Chang2019a; Reference De Leeuwde Leeuw, 2019b).
7. Different variables in both the L1 and the L2 can be affected differently due to sociolinguistic factors. Recent studies of bilingualism highlight the relevance of sociolinguistic factors for understanding cross-language interactions and the lack thereof (e.g., Reference Passoni, Mehrabi, Levon, de Leeuw, Klessa, Bachan, Wagner, Karpiński and ŚledzińskiPassoni et al., 2018). For example, research suggests that some L1 variables are more prone to being influenced by a new language or dialect due to their role in socio-indexical marking (Reference Bergmann, Nota, Sprenger and SchmidBergmann et al., 2016; Reference De Leeuwde Leeuw, 2019b; Reference Sankoff and FoughtSankoff, 2004) or, alternatively, are resistant to such influence due to their socio-indexical value (Reference Nodari, Celata and NagyNodari et al., 2019). This is not to rule out the possibility of across-the-board, systematic change; rather, given that different linguistic variables have different social meaning (e.g., stereotypes, markers, and indicators; Reference Labov and FishmanLabov, 1971), different L1 variables may be affected differently by L2 counterparts dependent on their social meaning. According to ADAPPT, when we see differences between variables with respect to attrition and drift, we should look to sociolinguistic explanations.
8. Different variables in both the L1 and the L2 can be affected differently due to language-internal and cross-linguistic factors. The language-internal functional role of sounds and other aspects of the L1 influences the likelihood of their attrition/drift (see, e.g., Reference Alkhudidi, Stevenson and RafatAlkhudidi, Stevenson, & Rafat, 2020; Reference De Leeuw, Tusha and Schmidde Leeuw et al., 2018, Reference De Leeuw, Kapia and Lewis2023). In particular, a high functional load or otherwise important functional role in the L1 may inhibit L1 change. Various types of similarity between aspects of the L1 and the L2 also affect the likelihood of L1 change. For example, where an L1 sound is similar to, but still noticeably different from, a nearby sound in the L2, this proximity could lead to competition between the two sounds or to a perceptual conflation of the two sounds that allows the L1 sound to be influenced by the properties of the L2 sound. Alternatively, where an L1 sound is unique to the L1 (i.e., has no L2 counterpart), there will be little to no competition with L2 sounds, so an L2 speaker could spend years in an L2 immersion setting without hearing the L1 sound and show no attrition of that L1 sound when finally tested. On the other hand, L1 contrasts may benefit from being similar to parallel L2 contrasts, by way of having their contrastiveness reinforced (Reference Ahn, Chang, DeKeyser and Lee-EllisAhn et al., 2017; see also Reference Chang and AhnChang & Ahn, 2023). In other words, cross-linguistic similarities that effectively weaken aspects of the L1 promote the likelihood of drift/attrition, whereas those that strengthen them reduce the likelihood of drift/attrition.
9. It is possible for bilinguals to be conscious of changes in their speech. Bilinguals may be consciously aware of changes in their L1, and we hypothesize that as time proceeds in the L2 environment and/or in their use of the L2, bilinguals’ consciousness of these changes becomes more likely. As such, bilinguals are more likely to be conscious of attrition than of drift. It would be interesting in future studies to examine consciousness systematically and measure how strongly it correlates with different stages of attrition/drift.
10. Bilinguals have agency in their speech development and use. This agency can be expressed in a number of ways, such as through choosing environments in which their L1 is practiced and/or their L2 is promoted – choices that crucially affect L1 input (compare with the SLM-r, which emphasizes the role of quantity and quality of L2 input for L2 development; Reference Flege, Bohn and WaylandFlege & Bohn, 2021). For example, some bilingual parents choose to speak their L1 with their children, while others choose to speak the new language of their environment; regardless, when the children become adults, they may then choose to learn the parents’ L1 and even live in the respective language environments. Such choices, when available, play a significant role in the course of potential drift and attrition. As the late bilingual who is undergoing attrition (as opposed to drift) is more likely to be conscious of this change in L1 speech, so is the late bilingual undergoing attrition more likely to be able to express their agency about this change (e.g., if a foreign accent in their L1 is pointed out to them, they might choose to talk more with old friends and acquaintances in their L1). Future studies may wish to investigate the extent to which particular choices made through a bilingual’s agency affect attrition and drift.

By incorporating the findings of research on attrition and drift – which we consider central facets of bilingualism – we submit that ADAPPT offers a more holistic theoretical account of bilingual speech than theories which focus primarily on L2 speech learning. As outlined here, the ten principles of ADAPPT are testable and therefore falsifiable. As such, we hope that this initial version of ADAPPT will help spur future research that will shed further light on the dynamics of bilingualism.

32.4 Implications for Major Branches of Linguistics

In the final section of this chapter, we briefly discuss the significance of findings revealing attrition and drift in relation to major theoretical approaches and branches of linguistics. In doing so, we aim to highlight the breadth of implications of findings on attrition and drift, which have often not been adequately acknowledged in these other parts of the field. We begin with sociolinguistics, continue with usage-based theory, and close with generative linguistics.

32.4.1 Implications for Sociolinguistics

There are three main points we would like to make about sociolinguistics. Firstly, despite a compelling body of studies which has revealed that an L1 can change in adulthood and undergo drift in the short term and attrition in the long term (see Sections 32.1.2.3, 32.1.2.4, and 32.1.3), as well as corpus-based longitudinal research showing systematic L1 changes in young and middle-age adulthood independent of language contact (Reference Gahl and BaayenGahl & Baayen, 2019), the assumption that the L1 acquired during childhood and adolescence is stable, or potentially even impermeable to change, is often (we believe inappropriately) applied in sociolinguistic research (see, e.g., Reference MeyerhoffMeyerhoff, 2011, pp. 134–136, who describes this issue in further detail).

More specifically, sociolinguistic “apparent time” studies interpret the speech of individuals in adulthood based on this questionable assumption. Language change at the community level is taken to be measurable through the comparison of younger and older individuals (i.e., different generations of speakers): if older individuals speak differently than younger individuals, this difference is interpreted as evidence that the (community-level) language has changed since the time the older individuals were the age of the younger individuals (again, because older individuals’ speech is assumed to have remained the same since the close of a proposed critical period). However, if it is accepted, in light of findings of attrition and drift, that the L1/D1 can and does change in adulthood, then we propose that it is not actually possible to deduce the former speech patterns of any language user from their currently observed adult speech patterns, and the older versus younger comparison in “apparent time” studies breaks down. Therefore, where possible, “apparent time” studies would be wisely augmented by longitudinal studies to draw conclusions about language change over time.

Secondly, particularly in highly diverse multilingual communities, which are increasingly the object of sociolinguistic investigations, it would be interesting to investigate more specifically longitudinal effects of attrition and drift to see how these processes potentially interact with and influence language change and variation (see, e.g., Reference CheshireCheshire, 2020; Reference Cheshire, Kerswill, Fox and TorgersenCheshire et al., 2011; Reference FoxFox, 2015; Reference Fox, Torgersen, Braber and JansenFox & Torgersen, 2018; Reference MarzoMarzo, 2016; Reference Oxbury and de LeeuwOxbury & de Leeuw, 2020). For example, in countries impacted by colonization, the L1 of the colonized speakers could undergo changes due to mass (forced) acquisition of an L2. Similarly, in countries less directly impacted by colonization, the hegemony of certain “world languages” such as English may have effects on the L1 of speakers who are not otherwise immersed in a multicultural environment (see, e.g., Reference ChangChang, 2012, p. 266). Accordingly, it is relevant for sociohistorical studies investigating language variation and change to consider the potential effects of attrition and drift on the level of community language change.

Finally, we believe that studies of attrition and drift should, in turn, bear sociolinguistic factors in mind (see principle 7 of ADAPPT). This has not been a focal point of research into attrition and drift, but we believe that studies of L1 change – in particular, in the phonetic and phonological domains – could fruitfully use social and indexical information about bilinguals’ L1 and L2 to help explain their findings (see, e.g., Reference Nodari, Celata and NagyNodari et al., 2019; Reference Passoni, Mehrabi, Levon, de Leeuw, Klessa, Bachan, Wagner, Karpiński and ŚledzińskiPassoni et al., 2018).

32.4.2 Implications for Usage-Based Theory

There is one main point about usage-based theory – specifically, exemplar theory – that we would like to make. By way of background, usage-based studies of attrition and drift use the construct of memory to explain their findings. For example, the Activation Threshold Hypothesis stipulates that more frequently used items have a lower threshold for activation and are more accessible than less frequently used items (Reference ParadisParadis, 2004). In this vein, a language which is rarely used due to immigration to another country would become less accessible and potentially undergo drift in the short term and attrition in the long term (Reference Paradis, Köpke, Schmid, Keijzer and DostertParadis, 2007).

Exemplar theory is a usage-based theory which stipulates that language sound systems are represented in a set of detailed phonetic exemplars that the individual perceives and that feed the individual’s production; phonological rules – if they exist at all – emerge from these exemplars. As a result of the perception and use of these exemplars, which are stored in memory with indexical (e.g., talker) characteristics, indexical meaning emerges and sound inventories of languages change over time. For example, it has been claimed, alternately, that high-frequency words lead sound changes (Reference BybeeBybee, 2002) or that low-frequency words lead sound changes (Reference Hay, Pierrehumbert, Walker and LaShellHay et al., 2015). However, in these investigations, bilingualism was not incorporated into the analysis, despite the fact that the languages of bilinguals influence each other and are not produced in isolation, and more people in the world speak two or more languages than only one language (e.g., Reference GrosjeanGrosjean, 1998).

Research on speech production and/or perception from an exemplar theory perspective should account for bilingualism explicitly, because cross-linguistic influence is particularly apparent in the speech of bilinguals. Research into language acquisition and attrition has shown that the languages of both simultaneous bilinguals (Reference Sundara, Polka and BaumSundara, Polka, & Baum, 2006) as well as early bilinguals (Reference Caramazza, Yeni‐Komshian, Zurif and CarboneCaramazza et al., 1973) influence one another; in late sequential bilinguals, too, there is mutual cross-linguistic influence, including drift in the short term and attrition in the long term. We are unaware of studies coming from an exemplar theory perspective which have focused on bilinguals, but we believe that, to be generalizable to all language users, usage-based theories such as exemplar theory need to account for bilingualism. After all, the frequency and the manner of bilinguals’ using their languages seem to play a central role in the way they perceive and produce them, including the L1, and their production of these languages is also perceived (and remembered) by monolinguals.

32.4.3 Implications for Generative Linguistics

There are two points which we believe are important for generative linguistics. Our first point is a practical one: given the reality of L1 change in adulthood, researchers working in the “monolingual model” encouraged by generative linguistics (Reference ChangChang, 2019a) should consider the consequences of attrition and drift for their target population(s). For example, if the research questions concern a “homeland” language variety, it may not make sense to target diaspora users of the language, who may evince (a type of) bilingualism that is not characteristic of homeland users and who may therefore diverge from them linguistically in unintended or unexpected ways. Crucially, such divergence may occur in such unambiguously structural aspects of the L1 as inflectional morphology (Reference Jarvis and CookJarvis, 2003) and phonemic contrast (Reference De Leeuw, Tusha and Schmidde Leeuw et al., 2018), which means that no L1 feature should be considered exempt from change in the context of bilingualism (see principle 2 of ADAPPT). This is especially true for speech, which has been our focus in this chapter and which is known to show remarkable plasticity in adulthood.

The second point is in reference to Reference ChomskyChomsky’s (1965) differentiation between linguistic “competence” (i.e., internal knowledge) and “performance” (i.e., the externalization of internal knowledge). In the past, demonstrating attrition at any linguistic level was almost exclusively accomplished by comparing the performance of a group of immigrant bilinguals in their L1 with that of monolinguals from the immigrants’ country of origin. In such cross-sectional studies, if the performance of the bilinguals was different from that of the monolinguals, the conclusion was that attrition (of competence) had occurred. However, it has been argued (as in ADAPPT) that any potential attrition, as reflected in performance differences, might not be permanent. Going one step further, it has been argued that, by looking at performance differences that could change over time, competence per se has not been assessed; therefore, since no change in competence has been shown, actually no attrition has occurred.

We argue that this view of the relationship between performance and competence, and of the role of permanence in interpreting observed changes in L1 performance, is unnecessarily limiting. It is possible to assume a competence versus performance distinction and still allow that competence may change over time, even in adulthood. To put it another way, one can say that some performance changes do not reflect changes in underlying competence without saying that no performance changes (because of their potentially temporary nature) can reflect changes in underlying competence. Otherwise, we are left in the paradoxical position of drawing conclusions about competence from performance without being able to consider performance. In our view, this is not the way forward; instead, researchers should have the ability to adjudicate when some deviation in L1 performance from a reference group (which may not necessarily comprise monolingual users of the L1) is systematic – and should therefore count as drift or attrition – versus not (e.g., isolated memory lapse).

32.5 Concluding Remarks

We close with an acknowledgment of Vivian Cook’s notion of “multicompetence” (Reference CookCook, 1992, Reference Cook and Cook2003), which established, decades ago, a conceptual framework for understanding how the L1 adapts as the result of acquiring an L2. In this chapter, we have outlined a theory of attrition and drift that we call ADAPPT, for a specific reason: like Cook, we regard the L1 as dynamic, capable of change over the life span, and we believe that the way in which the L1 changes is, in a sense, logical. When language users add a new variety to their language repertoire, the rest of the language repertoire must adapt to the presence of this new variety to facilitate its integration into the repertoire. Thus, just as Cook argued, L1 change in the context of L2 acquisition should not be surprising; on the contrary, it should be expected. With ADAPPT, our goal is to begin to formulate a testable theory of how bilingual speech develops, giving due consideration to the L1 changes that have been mostly left out of other theories in this area. Our hope is that future work on L2 speech learning, attrition, and drift will further develop and refine this theory, leading to new questions about bilingual speech and multiple facets of language change over the life span.

33 Bilingual Speech Intelligibility

33.1 Introduction

Speech intelligibility – the accuracy with which a given speech signal is understood in a particular communication setting – is notoriously difficult to predict (for recent discussion, see Reference Mamun, Zilany, Hansen and Davies-VennMamun et al., 2021; for second-language speech intelligibility prediction in particular, see Reference Van Wijngaarden, Bronkhorst, Houtgast and Steenekenvan Wijngaarden et al., 2004). Nevertheless, all else being equal, variation in intelligibility as assessed by word recognition accuracy is a common distinguishing feature of first-language (L1) versus second-language (L2) speech production and perception. Notwithstanding substantial variation across talkers with varying language learning experiences, a general finding across many comparisons of L1 and L2 speech intelligibility with controlled test materials, listening conditions, and listeners (typically, L1 adults with normal hearing and cognitive function) is that recognition errors are more likely for speech by L2 talkers than for speech by L1 talkers, particularly with increasing levels of background noise (e.g., among many others, Reference Bradlow and BentBradlow & Bent, 2008; Reference McLaughlin, Baese-Berk, Bent, Borrie and Van EngenMcLaughlin et al., 2018; Reference MunroMunro, 1998; Reference Munro and DerwingMunro & Derwing, 1995; Reference Rogers, Dalby and NishiRogers, Dalby, & Nishi, 2004). For example, in a seminal study demonstrating partial independence of objective intelligibility, subjective comprehensibility, and accentedness of the speech of L2 English learners, Reference Munro and DerwingMunro and Derwing (1995) found that, even for a group of L2 English talkers with relatively high proficiency, the mean number of recognition errors by L1 English listeners per L2 talker was almost three times the mean number of errors per L1 talker. This well-documented difference in word recognition accuracy for L2 versus L1 speech is mirrored by equally well-documented challenges associated with L2 versus L1 listening, particularly under noisy listening conditions (e.g., among many others, Reference Bradlow and AlexanderBradlow & Alexander, 2007; Reference Bradlow and PisoniBradlow & Pisoni, 1999; Reference García Lecumberri, Cooke and CutlerGarcía Lecumberri, Cook, & Cutler, 2010; Reference Mayo, Florentine and BuusMayo, Florentine, & Buus, 1997; Reference Tamati and PisoniTamati & Pisoni, 2014). Based on a comprehensive review of studies of L1 versus L2 listening in background noise, Reference Scharenborg and van OsScharenborg and van Os (2019) estimated that a signal-to-noise ratio increase of approximately 4–8 dB SPL (sound pressure level, measured in decibels) is required for L2 listeners to match the level of word recognition accuracy of L1 listeners. Considering that many (perhaps most) conversations in the world today involve L2 participants, any account of speech intelligibility in real-world speech communication must address these salient differences between L1 and L2 speaking and listening. Accordingly, the goal of this chapter is to provide a conceptual framework that links various strands of research on speech intelligibility in communicative situations with talkers and listeners from different language backgrounds. The proposed Talker–Listener–Language, or TL2, framework is presented as both a coherent perspective on prior research and a map for future research.

Four streams of research define the broader context for the TL2 framework. First, as a framework for understanding speech intelligibility in bilingual communicative situations, the TL2 framework links directly to prior work that has distinguished speech intelligibility from the related dimensions of comprehensibility, accentedness, and listening effort. Seminal work by Reference Munro and DerwingMunro and Derwing (1995) established that intelligibility (proportion of words correctly recognized), comprehensibility (a rating from “extremely easy to understand” to “impossible to understand”), and degree of foreign accent (a rating from “no foreign accent” to “very strong foreign accent”) are related but partially independent dimensions of L2 speech processing. Specifically, L2 speech samples that are equally intelligible may not be judged as similar on the other two dimensions (for recent discussion, see Reference LevisLevis, 2005, Reference Levis2020; Reference Munro and DerwingMunro & Derwing, 2020). Similarly, research with various talker and listener populations has demonstrated divergence of intelligibility (in terms of word recognition accuracy) from listening effort as assessed from either physiologic (e.g., pupillometric) or subjective measures (e.g., Reference Borghini and HazanBorghini & Hazan, 2018, Reference Borghini and Hazan2020; Reference Francis and LoveFrancis & Love, 2020; Reference McGarrigle, Rakusen and MattysMcGarrigle, Rakusen & Mattys, 2020; Reference Van Engen and PeelleVan Engen & Peelle, 2014; Reference Winn and TeeceWinn & Teece, 2021). While all of these dimensions are relevant for speech communication in bilingual communicative situations, the TL2 framework emphasizes intelligibility in terms of word recognition accuracy since its measurement and interpretation are currently the most straightforward of the various dimensions that constitute effective and efficient speech communication.

The TL2 framework is also developed within the context of research on speech recognition in “adverse” conditions, which Reference Mattys, Davis, Bradlow and ScottMattys et al. (2012) defined as involving the introduction of any factor that may lower intelligibility relative to the optimal condition of “healthy native listeners hearing carefully recorded speech in a quiet environment and under focused attention.” In their review of research on speech recognition in adverse conditions, Reference Mattys, Davis, Bradlow and ScottMattys et al. (2012) classified ACs (adverse conditions) with respect to two orthogonal dimensions: the origin of the AC (noncanonical signal, transmission interference, or receiver limitations) and the locus of the effect of the AC (perceptual processing, mental representation, attention, or memory). This classification scheme aimed to expand the domain of speech recognition research to include the complex listening environments and individual differences that characterize real-world speech communication. Like this AC framework, the TL2 framework aims to avoid undue emphasis on conditions of “artificial normality” (Reference Mattys and LissMattys & Liss, 2008) and to develop the conceptual foundations for detailed descriptions and modeling of natural variation in speech intelligibility, particularly as it relates to conversations involving bilingual individuals.

The influence of language factors has been widely acknowledged throughout the long history of audiometric testing and computational modeling of speech intelligibility (e.g., among many others, Reference French and SteinbergFrench & Steinberg, 1947; Reference Müsch and BuusMüsch & Buus, 2001; Reference Steeneken and HoutgastSteeneken & Houtgast, 1982; Reference Van Wijngaardenvan Wijngaarden, 2001; Reference Van Wijngaarden, Bronkhorst, Houtgast and Steenekenvan Wijnaarden et al., 2004; Reference Van Wijngaarden, Steeneken and Houtgastvan Wijngaarden, Steeneken, & Houtgast, 2002). However, quantitative models of speech intelligibility have generally emphasized accurate and reliable metrics under various signal distorting conditions, such as background noise, reverberation, and other nonlinear distortions that often occur in real-world communication systems. Language factors such as the language proficiency of the talker and/or the listener, the number of response alternatives, and linguistic entropy in the materials to be recognized have received some attention in these models, but quantitative modeling of the full extent of variation in speech intelligibility as a function of the interplay between the talker’s and the listener’s language backgrounds is a remaining challenge. Ultimately, TL2 aims to offer a coherent conceptual framework for developing a quantitative interface between language-based factors and signal-based computational models of speech intelligibility.

Finally, the TL2 conceptual orientation is aligned with a recent proposal for a “tripartite conceptualization of intelligibility” in which overall speech recognition accuracy involves contributions from three components, the talker, the listener, and their joint contribution within dynamic interaction (Reference Olmstead, Lee and ViswanathanOlmstead, Lee, & Viswanathan, 2020). Motivated by studies demonstrating co-constrained adjustments of speakers with dysarthria and their listeners (Reference Borrie, McAuliffe and LissBorrie, McAuliffe, & Liss, 2012), the tripartite view claims that the conversational dynamics that emerge during a specific interaction exceed those which can be accounted for by the individual speech production and perception abilities of the interlocutors. Under this view, the key to successful speech communication is not unilateral adjustment of either the talker to the listener (through, for example, clear speech production) or the listener to the talker (through, for example, perceptual learning); instead, mutual, iterative adaptation is required to arrive at a jointly created conversation-specific phonetic system. The TL2 framework takes a broadly similar view of speech intelligibility with the key difference that it specifies the three critical components in terms of language variables. Nevertheless, the tripartite view of intelligibility (Reference Olmstead, Lee and ViswanathanOlmstead et al., 2020) and the TL2 framework are developing along parallel paths toward the goal of broadly applicable research and modeling of speech intelligibility with a strong emphasis on talker–listener interactions.

33.2 The Talker–Listener–Language, or TL2, Framework

Any speech interaction, or turn in a conversation, can be characterized by three fundamental relations: (a) the relation between the talker and the language being spoken (Talker–LBS), (b) the relation between the listener and the language being spoken (Listener–LBS), and (c) the relation between the language repertoires of the interlocutors (Talker–Listener). These three relations are the basis of the TL2 framework. Figure 33.1 provides a schematic of the three foundational relations of the TL2 framework with each relation represented by a slider that can move between two settings. For Talker–LBS and Listener–LBS, the sliders move between low proficiency L2, for early second-language learners of the language being spoken, and L1, for first-language speakers/listeners of the language being spoken. The settings for Talker–LBS and Listener–LBS are specific to the language being spoken in a given conversational interaction; therefore, the settings for a bilingual individual will typically differ across conversations depending on their proficiency in the language of the current conversation. For bilinguals with similar proficiency in more than one language, the Talker–LBS and Listener–LBS settings would be similar for conversational interactions in both/all of their languages.

Figure 33.1 Three sliders representing the three foundational relations of the Talker–Listener–Language (TL2) framework. Talker–LBS = the relation of the talker to the language being spoken; Listener–LBS = the relation of the listener to the language being spoken; Talker–Listener = the relation between the talker’s and the listener’s language repertoire. While Talker–LBS and Listener–LBS can vary independently, Talker–Listener depends on the settings of the other two. The settings shown in this figure represent the conversational condition of a relatively low proficiency L2 talker and an L1 listener. The Talker–Listener relation is (necessarily) Mismatch.

See text and Table 33.1 for additional information.

For many bilingual individuals, the language that they learned first is not the language that ultimately becomes the dominant language in their adult language repertoire. For these individuals, exposure to the L1 precedes exposure to the L2 (by definition), yet the extent of exposure to the L1 is more constrained than that of exposure to the L2. This pattern of language learning is common across the globe, particularly in bilingual households where the language of instruction in schools (typically, the dominant language in the society) does not match the home language. The consequences for speech production and perception of this “switched” language dominance have direct bearing on speech intelligibility. Specifically, for switched dominance bilinguals (or heritage speakers), dominance predicts word recognition accuracy more closely than order of acquisition for both speech production (Reference Blasingame and BradlowBlasingame & Bradlow, 2021) and speech recognition (Reference Blasingame and BradlowBlasingame & Bradlow, 2020). In the TL2 framework, the Talker–LBS and Listener–LBS relations encode variation in spoken language proficiency for the language of the current conversation without distinguishing between order of acquisition, language dominance, and language proficiency. Thus, for a switched dominance bilingual, the Talker–LBS and Listener–LBS settings are L1 for a conversation in their dominant language to represent their L1 level of spoken language proficiency in the language (rather than representing order of acquisition).

For the Talker–Listener relation, the slider setting varies from Match to Mismatch to indicate varying degrees of shared language experience between the talker and the listener. The setting may vary along the slider continuum depending on the extent of the interlocutors’ shared knowledge and experience. Importantly, the setting for Talker–Listener is partially dependent on Talker–LBS and Listener–LBS, since if the language of the conversation is the L1 for only one of the interlocutors then the Talker–Listener relation is necessarily tilted toward the Mismatch end of the slider (one’s L1 is the other’s L2). However, if Talker–LBS and Listener–LBS are both L1 or both L2, then Talker–Listener may shift toward either Match or Mismatch. If both Talker–LBS and Listener–LBS are L2, then the setting for Talker–Listener varies depending on whether the interlocutors come from the same (Match) or different (Mismatch) L1 backgrounds. If both Talker–LBS and Listener–LBS are L1 then typically the Talker–Listener setting will tilt toward Match. However, a setting may shift toward Mismatch if the interlocutors are from different L2 language backgrounds (e.g., one knows Spanish while the other knows Mandarin) or are from communities where different heritage languages are widely spoken to the extent that even monolinguals in the community speak a language variety that reflects features of the heritage language that circulates in the community. While this within-L1 Talker–Listener variation goes beyond the scope of the empirical support for the TL2 framework presented here, it reflects the fact that the impact of language contact can extend beyond bilingual individuals and into the community at large.

Various possible combinations of settings for each of the three language relations provide a scaffold for classifying conversational interactions in terms of their language-related properties. Table 33.1 presents all possible conversational interaction types that result from the extreme settings for each of the three sliders shown in Figure 33.1: L1 versus (low-proficiency) L2 for Talker–LBS and Listener–LBS, and Match versus Mismatch for Talker–Listener. Two of the eight logical combinations are excluded from the list of conversational interaction types because (as discussed just now) the setting for the Talker–Listener relation is necessarily Mismatch when the language being spoken is the L1 of only one of the interlocutors. Thus, Table 33.1 describes the limits of variation along the three language relations that define the TL2 framework for speech communication between two individuals. Communicative situations with multiple talkers and listeners would introduce substantial complexity into the framework, as would continuous variation along each of the dimensions, but the broadly defined conceptual framework remains the same.

Table 33.1 The typology of conversational interactions that emerges from the foundational language relations shown in Figure 33.1.

The typology of conversational interactions listed in Table 33.1 introduces “language profile” as a level of description that is subordinate to the language of the conversation (i.e., the language being spoken) but superordinate to the acoustic-phonetic features of the speech signal. Figure 33.2 illustrates these three levels of description for any conversational interaction. At the broadest level is selection of the language of the conversational interaction (Language Selection); at an intermediate level is description of the conversational interaction type (Language Profile); and at the narrowest level is specification of the vocal gestures and acoustic phonetic parameters of the speech signal (Speech Implementation). The example shown in Figure 33.2 is an interaction between an L1 and an L2 speaker of English. In the first conversational turn, Talker–LBS is L2, Listener–LBS is L1, and Talker–Listener is (necessarily) Mismatch. In the second turn, the interlocutors switch roles so that now Talker–LBS is L1, Listener–LBS is L2, and Talker–Listener remains Mismatch. Intelligibility assessment and/or prediction for the conversation as a whole would be aggregated over each of the conversational interactions.

Figure 33.2 Three levels of description for any conversational interaction.

Note: See text for details.

33.3 Empirical Foundations of the TL2 Framework

For each of the three language relations that describe the language profile of a given conversational interaction, one of the two possible end-point settings is typically associated with higher intelligibility than the other. For Talker–LBS and Listener–LBS, L1 is typically associated with higher intelligibility than L2. For Talker–Listener, Match is presumably associated with higher intelligibility than Mismatch since a greater extent of shared language knowledge and experience is generally advantageous for overall speech intelligibility. While these separate influences on speech intelligibility are important contributors to the overall intelligibility of a given conversational interaction, a key point of the TL2 framework is its emphasis on these language relations as interacting components of the speech communication system. That is, each of these three language relations influences speech intelligibility both independently and through modulation of the others. As discussed in Sections 33.3.1–33.3.4, several lines of experimental phonetic and psycholinguistic research provide the empirical foundation for this perspective.

33.3.1 L2 Speech Intelligibility as a Function of the Talker’s and the Listener’s Language Backgrounds

For well over half a century, researchers have noted the lowering of the language barrier when both interlocutors in a conversation are L2 users of the language being spoken (e.g., Reference NashNash, 1969; Reference WeinreichWeinreich, 1953). Over the past two decades there has been a steady accumulation of speech recognition data in support of this observation (Reference Bent and BradlowBent & Bradlow, 2003; Reference Imai, Walley and FlegeImai, Walley, & Flege, 2005; Reference Pinet, Iverson and HuckvalePinet, Iverson, & Huckvale, 2011; Reference Shu, Wilson, Perkins., Carignan and TylerShu, Wilson, & Perkins, 2016; Reference Stringer and IversonStringer & Iverson, 2019; Reference Van Wijngaardenvan Wijngaarden, 2001; Reference Van Wijngaarden, Steeneken and Houtgastvan Wijngaarden et al., 2002; Reference Weber, Broersma and AoyagiWeber, Broersma, & Aoyagi, 2011; Reference Xie and FowlerXie & Fowler, 2013). For example, in a factorial experimental design, Reference Bent and BradlowBent and Bradlow (2003) crossed variation in Talker–LBS with variation in Listener–LBS to yield speech recognition test conditions that covered both Match and Mismatch for the Talker–Listener relation. Specifically, this study tested recognition of English sentences produced by talkers from three L1 backgrounds (American English, Chinese, or Korean) with listeners from four different L1 backgrounds (American English, Chinese, Korean, or a fourth group of various other L1 backgrounds). The major finding of this study was significant attenuation (and, in some cases, reversal) of the L2 talker intelligibility disadvantage for L2 listeners as compared to L1 listeners.

The TL2 framework expresses this effect through the interplay between the settings for each of the three foundational relations. Under conditions where the language of the stimuli, English, was the L2 for both the talker and the listeners (i.e., both Talker–LBS and Listener–LBS were in the L2 setting), word recognition accuracy for L2 speech was as high as for L1 speech. Moreover, this modulation of the influence of Talker–LBS by Listener–LBS – the “interlanguage speech intelligibility benefit” – extended to both settings for the Talker–Listener relation. That is, the L1 talker intelligibility advantage was attenuated for L2 listeners from the same L1 background as the talker (the “matched interlanguage speech intelligibility benefit”) as well as for L2 listeners from other L1 backgrounds (the “mismatched interlanguage speech intelligibility benefit”). In the conversational interaction typology of the TL2 framework as laid out in Table 33.1, the conditions where we observe the matched and mismatched interlanguage speech intelligibility benefit are conditions E (L2-L2-Match) and F (L2-L2-Mismatch), respectively.

This basic pattern of modulation of the effect on intelligibility when Talker–LBS is L2 through variation in the settings for Listener–LBS and Talker–Listener has been quite widely attested with some important nuances. In particular, this interlanguage speech intelligibility benefit (ISIB) is sensitive to talker and listener L2 proficiency such that it is substantially weaker for low versus high proficiency L2 talkers (Reference Bent and BradlowBent & Bradlow, 2003; Reference Stibbard and LeeStibbard & Lee, 2006), whereas it is stronger for lower versus higher proficiency listeners (Reference Xie and FowlerXie & Fowler, 2013; see also Reference Wang and van HeuvenWang & van Heuven, 2015). This asymmetry in the modulation of Talker–LBS by Listener–LBS and vice versa is further demonstrated by the distinction between the ISIB-Talker and the ISIB-Listener (Reference Hayes-Harb, Smith, Bent and BradlowHayes-Harb et al., 2008; Reference Xie and FowlerXie & Fowler, 2013). The ISIB-Talker involves attenuation of the intelligibility advantage for the L1 over the L2 setting for Talker–LBS (i.e., L1 versus L2 speech) when Listener–LBS is also L2 (Reference Bent and BradlowBent & Bradlow, 2003; Reference Hayes-Harb, Smith, Bent and BradlowHayes-Harb et al., 2008; Reference Stibbard and LeeStibbard & Lee, 2006; Reference Xie and FowlerXie & Fowler, 2013), while the ISIB-Listener involves reduction of the intelligibility disadvantage for the L2 over the L1 setting for Listener–LBS (i.e., L2 versus L1 listening) when Talker–LBS is also L2. In the conversational interaction typology of the TL2 framework (Table 33.1), ISIB-Talker is represented by comparison across conditions D (L1-L2-Mismatch) and E (L2-L2-Match), while ISIB-Listener is represented by comparison across conditions C (L2-L1-Mismatch) and E (L2-L2-Match). The available data suggests that the ISIB-Listener (L2 talker with L1 vs. L2 listeners) may be stronger than the ISIB-Talker (L2 listener with L1 vs. L2 talkers). The intelligibility differential between L2 and L1 settings for Listener–LBS is sensitive to modulation by variation in Talker–LBS and Talker–Listener, while the intelligibility differential between L2 and L1 settings for Talker–LBS is more difficult to neutralize by variation in the other two relations (Reference Hayes-Harb, Smith, Bent and BradlowHayes-Harb et al., 2008; Reference Xie and FowlerXie & Fowler, 2013). This distinction addresses the question of whether L2 speaking is more, less, or as difficult as L2 listening, a distinction that is also represented by comparison across conditions C (L2-L1-Mismatch) and D (L1-L2-Mismatch).

The impact of variation in the language profile of a conversational interaction beyond word recognition accuracy was explored by Reference Van Engen, Baese-Berk and BakerVan Engen et al. (2010) with an experimental paradigm, Diapix, that emphasizes communicative intent during interaction. The Diapix paradigm involves a spontaneous speech elicitation task in which two participants are each presented with one version of a picture. The participants must work cooperatively to identify differences between the two pictures as quickly as possible without seeing each other’s version of the picture. Communicative efficiency is then measured by the time taken to complete the task (i.e., to identify all the differences). The Diapix paradigm allows investigators to document phonetic variation and adaptation within conversational interactions and, critically, to relate these aspects of conversational dynamics to communicative success through the Diapix measure of task completion time. Reference Van Engen, Baese-Berk and BakerVan Engen et al. (2010) compared Diapix task completion times across various participant pair types: (a) two L1 talkers (n = 8 pairs), (b) an L1 talker with an L2 talker (n = 11 pairs), (c) two L2 talkers from the same L1 background (n = 11 pairs), and (d) two L2 talkers from different L1 backgrounds (n = 8 pairs). These participant pair types map directly onto the conversational interaction types of the TL2 framework as described in Table 33.1; these are conditions A (L1-L1-Match), C/D (L2-L1-Mismatch and L1-L2-Mismatch), E (L2-L2-Match), and F (L2-L2-Mismatch). The main finding of interest of this study was that task completion time was shortest for the L1-L1-Match and longest for the L2-L2-Mismatch conditions. Interestingly, participant pairs that included one L1 speaker (L2-L1-Mismatch and L1-L2-Mismatch) took as long to find all the differences between the two pictures as did participant pairs that included two L2 speakers from the same L1 background (L2-L2-Match condition). This pattern suggests that the foundational language relations and the conversational interaction typology of the TL2 framework have the potential to provide a coherent account of variation in real-world speech communication. New web-based data collection and analysis techniques open up the possibility of larger-scale replications and expansions of this relatively small-scale study with many more participant pairs and with conversations in various languages.

Overall, investigations of speech intelligibility that include recognition of both L1 and L2 speech by both L1 and L2 listeners have established significant contributions of variation across these interacting dimensions, albeit with some nuances related to language proficiency and other experimental factors including the nature of the stimuli (e.g., isolated words, words in sentences), the task (e.g., open-set transcription, word classification, reaction time), the presentation conditions (e.g., the absence or presence of background noise), and the location of the listening environment (e.g., in a country where the language being spoken has dominant or nondominant language status). The relations that define the TL2 framework – Talker–LBS, Listener–LBS, and Talker–Listener – provide a coherent structure for representing this variation in a way that clarifies the possibility of asymmetrical effects on speech intelligibility of L2 speaking versus L2 listening in isolation from, as well as in the context of, shared versus different language backgrounds. In particular, the TL2 framework provides a conceptual structure with which to support and interpret direct comparison across the various conversational conditions that involve L2 speaking (conditions C, E, and F in Table 33.1) versus those that involve L2 listening (conditions D, E, and F). The extent and the nature of this asymmetry have significant implications for our understanding of the relationship between speech production and perception in bilingual individuals. Moreover, talker–listener asymmetries could have practical implications for speech intelligibility optimization in real-world situations that involve, for example, an L1 speaker addressing a large crowd of primarily L2 listeners or vice versa. Systematic comparison across the various conversational interaction types laid out in Table 33.1 would be a productive direction for future work to follow, preferably in a wider range of languages and with a wider variety of speech materials than has been tested to date (e.g., see Reference Strori, Bradlow and SouzaStrori, Bradlow, and Souza [2020] for the interplay of language profile and sentence complexity for L2 word recognition accuracy).

33.3.2 Listener-Oriented Talker Adaptation

Clear speech is a mode of speech production that talkers adopt when they are aware of a communication barrier due to, for example, the presence of background noise, impaired hearing by the listener, or the listener’s L2 listening experience in the language of the conversation (for a comprehensive and up-to-date review of clear speech research, see Reference Smiljanic, Pardo, Nygaard, Remez and PisoniSmiljanic, 2021; see also Reference Smiljanic and BradlowSmiljanic & Bradlow, 2009; Reference Uchanski, Pisoni and RemezUchanski, 2005). Clear speech production involves language-general modifications that serve to enhance the overall acoustic salience of the signal (i.e., to make it easier to detect and follow in a complex auditory environment) and language-specific modifications that serve to expand the acoustic distinctions that convey phonological contrasts (i.e., to provide more transparent access to the phonological structure of the language being spoken). Overall, the talker aims to provide a clearer auditory pathway for the listener toward accurate reception of the talker’s message. When successful (i.e., when clear speech production actually enhances speech recognition accuracy), clear speech brings the listener’s signal decoding routines into closer alignment with the talker’s message encoding routines. In the TL2 framework, clear speech is represented as a talker-initiated, listener-oriented response to a Talker–Listener setting of Mismatch.

Available evidence has established that both L1 and L2 talkers are capable of producing effective clear speech, and both L1 and L2 listeners can derive substantial benefit from clear speech production. However, the magnitude of the clear speech benefit for L1 talkers and listeners is often greater than for L2 talkers and listeners. For example, Reference Bradlow and BentBradlow and Bent (2002) found a significant but smaller clear speech benefit for the L1-L2-Mismatch condition than for the L1-L1-Match condition. That is, L2 English listeners received a smaller clear speech benefit than L1 English listeners for clear speech produced by L1 speakers of English. Presumably, this variation is related to the language-specific aspect of clear speech production. While the language-independent features of clear speech should enhance speech signal salience for all listeners, the benefits of the language-specific, phonemic enhancements depend on knowledge of the phonological structure and patterns of phonetic implementation that characterize the sound structure of the language being spoken. Data from subsequent studies (Reference Granlund, Hazan and BakerGranlund, Hazan, & Baker, 2012; Reference Keerstock and SmiljanicKeerstock & Smiljanic, 2018, Reference Keerstock and Smiljanic2019; Reference Smiljanic and BradlowSmiljanic & Bradlow, 2011) further refined this pattern by investigating L2 speakers with varying levels of proficiency in the language being spoken and conversational conditions with both L2 talkers and L2 listeners. Specifically, these studies cover conditions C (L2-L1-Mismatch) and E (L2-L2-Match) in addition to A (L1-L1-Match) and D (L1-L2-Match). These patterns can be expressed in the TL2 framework as modulation of the effect of clear speech by Listener–LBS.

It is important to note that clear speech is a unilateral response to a setting of Mismatch for the Talker–Listener relation since it involves accommodation of one interlocutor (the talker) to the other (the listener) without necessarily involving any reciprocal response (i.e., listener-to-talker adaptation). In fact, it is possible that specific clear speech adjustments may result in divergence between the interlocutors’ speech patterns. For example, in a conversation with an L2 speaker of English, an L1 English speaker may expand their acoustic vowel space as they become aware of their L2 interlocutor’s need for clear speech, but the L2 conversation partner may still produce English vowels with strong influence from their L1. In this case, we may observe phonetic divergence between the interlocutors in vowel space acoustics along with overall intelligibility enhancement in the conversation (assuming that the vowel space expansion produced by the one interlocutor is indeed beneficial for the other’s ability to distinguish phonologically contrastive vowel categories). In the TL2 perspective, this type of unilateral phonetic adjustment in response to Talker–Listener mismatch functions as a mechanism for enhancing intelligibility through alignment at the level of signal-message mapping rather than at the phonetic level.

33.3.3 Perceptual Adaptation to L2 Speech

Just as clear speech can be viewed as a talker-initiated, unilateral response to Talker–Listener Mismatch, perceptual adaptation of listeners to talkers can be viewed as a parallel response from the listener’s side of the conversational interaction. In the case of speech interactions that involve L1 and L2 interlocutors, speech recognition accuracy can be significantly enhanced through exposure-induced perceptual adaptation by L1 listeners to speech produced by L2 talkers (for a recent comprehensive review of perceptual learning of L2 speech, see Reference Bent, Baese-Berk, Pardo, Nygaard, Remez and PisoniBent & Baese-Berk, 2021). A major focus of research on perceptual adaptation to L2 speech has been on the exposure conditions that lead to generalized adaptation, that is to improved recognition of L2 speech beyond the particular L2 speech samples encountered during initial exposure or training. A general finding from this line of work is that L1 listeners exhibit substantial flexibility in their recognition of L2 speech following short-term exposure to L2 speech, and this flexibility can extend to novel talkers provided that there is sufficient and appropriate overlap between the exposure and the novel conditions (Reference Xie, Liu and JaegerXie, Liu, & Jaeger [2021] and references therein). Other research in this area has focused on mechanisms of perceptual adaptation including category boundary recalibration, category remapping, general category relaxation, phonetic cue reweighting, and attention to novel cues (Reference Bent, Baese-Berk, Pardo, Nygaard, Remez and PisoniBent & Baese-Berk [2021] and references therein). Within the TL2 framework, the representation of perceptual adaptation to L2 speech is closely analogous to the representation of clear speech (as discussed in Section 33.3.2). That is, just as clear speech production by the talker for the sake of the listener can bring the talker’s speech encoding routines into closer alignment with the listener’s speech decoding routines, perceptual adaptation by the listener to the talker’s speech patterns also involves closer alignment of the talker’s encoding and the listener’s decoding routines. While the perceptual processes that underlie adaptation of L1 listeners to L2 speech can promote Talker–Listener Match at the level of signal-message mapping, they may remain unilateral (i.e., the changes in the listener’s perceptual routines may not be reflected in any speech production changes by the talker). Nevertheless, the overall intelligibility of the conversational interaction will presumably improve.

An important element of the TL2 framework is the modulation of the influence of each relation by the others. With respect to listener-initiated perceptual adaptation, there is strong evidence that relatively high proficiency L2 listeners can retune their phonetic categories to a similar extent to L1 listeners with some variation depending on the position of the critical categories in the sound structure of the listeners’ L1 as well as in the sound structure of the language being spoken (e.g., Reference Bruggeman and CutlerBruggeman & Cutler, 2020; Reference Drozdova, van Hout and ScharenborgDrozdova, van Hout, & Scharenborg, 2016; Reference Reinisch, Weber and MittererReinisch, Weber, & Mitterer, 2013; Reference Schertz, Cho, Lotto and WarnerSchertz et al., 2016). This suggests that both L1 and L2 listening are subject to the same adaptative mechanisms with some experience-dependent variation related to interactions between the language being spoken and the other language(s) in the listeners’ repertoires. Moreover, variation in the extent and time-course of perceptual adaptation to L2 speech has shown some dependency on the talker’s L2 proficiency (i.e., variation in the setting of Talker–LBS on the scale from L1 to low proficiency L2) with faster and greater gains in speech recognition accuracy for higher than lower proficiency L2 talkers (e.g., Reference Bradlow and BentBradlow & Bent, 2008). In the TL2 framework, these studies demonstrate variation in listener-initiated adaptation to Talker–Listener Mismatch as the Talker–LBS and Listener–LBS settings vary from L1 to low proficiency L2.

33.3.4 Phonetic Variation and Adaptation in Conversational Interactions

Separate investigations of talker-to-listener adaptation (e.g., clear speech) and listener-to-talker adaptation (e.g., perceptual adaptation to L2 speech) have established a strong empirical foundation for our understanding of the dynamics of speech production and perception, and of the mechanisms that underlie these adaptive processes in conversational interactions where the Talker–Listener relation is Mismatch rather than Match. As discussed earlier, these unilateral adaptations may result in closer Talker–Listener alignment at the level of sound-meaning mapping and, at the same time, divergence at the phonetic level. That is, these unilateral adaptations may result in Talker–Listener alignment without phonetic entrainment (for discussion of this important distinction, see Reference Costa, Pickering and SoraceCosta, Pickering, & Sorace, 2008). However, unlike the noninteractive testing procedures typical of research on clear speech and perceptual adaptation, real-world speech communication affords iterative (i.e., repeated over multiple conversation turns) and bidirectional (talker-to-listener and vice versa) adaptation guided by the overarching goal of successful message transmission (i.e., communicative intent). These properties of conversational interactions – iterativity, bidirectionality, and communicative intent – suggest that the full impact of talker-initiated and/or listener-initiated adaptations can only be observed with test conditions that involve multiple turns for each interlocutor. Multiple rounds of speech production and speech recognition (i.e., with switching roles as talker and listener) over the course of a conversation provide multiple opportunities for adjustment of the production–perception loop during speech communication. Notwithstanding the intrinsic challenges of studying speech and language processes in dialogues, a substantial body of research on speech in interaction has revealed general patterns that are particularly relevant for the TL2 framework’s goal of providing a coherent analytical construct for representing and analyzing conversational interactions between individuals from various language backgrounds.

The empirical details and underlying mechanisms of Talker–Listener alignment at all levels of linguistic structure, from semantic to syntactic to phonetic, and across various types of interlocutor pairs, have been widely investigated (for reviews of this research, see Reference Pardo, Urmanche, Wilman and WienerPardo et al., 2017; Reference Pardo, Pellegrino, Dellwo and MöbiusPardo et al., 2022). As laid out in detail by Reference Costa, Pickering and SoraceCosta, Pickering, and Sorace (2008), fundamental differences in motivations and abilities will likely cause L1 and L2 talkers to adopt different adaptive strategies in conversational interactions, and indeed studies of vocal accommodation with interlocutors from different language backgrounds have shown various patterns of phonetic convergence and divergence depending on the language profile of the conversational interaction (e.g., Reference Hwang, Brennan and HuffmanHwang, Brennan, & Huffman, 2015; Reference Kim, Horton and BradlowKim, Horton, & Bradlow, 2011; Reference Olmstead, Viswanathan, Aivar and ManuelOlmstead et al., 2013; Reference Olmstead, Viswanathan, Cowan and YangOlmstead et al., 2021). In general, greater convergence is observed when the Talker–Listener setting is Match rather than Mismatch (see also Reference BabelBabel, 2009; Reference Lewandowski, Dogil, De Cillia, Gruber, Krzyzanowski and MenzLewandowski & Dogil, 2010; Reference NielsenNielsen, 2011; Reference PardoPardo, 2006; Reference Pardo, Jay and KraussPardo, Jay, & Krauss, 2010).

Many studies of vocal accommodation have avoided test paradigms with dyadic setups (Reference Pardo, Urmanche, Wilman and WienerPardo et al., 2017) and instead have attempted to isolate processes of phonetic convergence from message transmission (i.e., intelligibility or communicative success). This strategy is motivated by theoretically driven inquiry into the mechanisms that underlie vocal accommodation (automatic priming vs. mediated by social or cognitive factors). Consequently, the relation between vocal alignment and speech intelligibility (or communicative success beyond word recognition accuracy) is relatively understudied. Using a highly flexible implementation of the Diapix paradigm (see Section 33.3.1 for a description of this dialogue-based picture matching test setup), Hazan and colleagues have successfully investigated interactive speech across the life span, across participants with various hearing and language abilities, and in the presence of various types of transmission channel distortions (Reference Hazan and BakerHazan & Baker, 2011; Reference Hazan, Tuomainen and TuHazan et al., 2018; Reference Hazan, Tuomainen and PettinatoHazan, Tuomainen, & Pettinato, 2016; Reference Hazan, Tuomainen, Taschenberger, Kubin and KačičHazan, Tuomainen, & Taschenberger, 2019; Reference Tuomainen, Taschenberger, Rosen and HazanTuomainen et al., 2021). An important finding from this series of studies is that talker adaptations to communicative barriers in conversational settings are tailored to the needs of the listeners. Specifically, talker adaptations to an L2 listener differ in their acoustic details from talker adaptations to transmission channel distortions (Reference Hazan, Grynpas and BakerHazan, Grynpas, & Baker, 2012; Reference Hazan, Uther, Grunland, Wolters, Livingstone, Beattie, Smith, MacMahon, Stuart-Smith and ScobbieHazan, Uther, & Grunland, 2015), just as infant-, pet-, and foreigner-directed speech registers have specific characteristics adapted to the interlocutor’s needs (Reference Burnham, Kitamura and Vollmer-ConnaBurnham, Kitamura, & Vollmer-Conna, 2002; Reference Uther, Knoll and BurnhamUther, Knoll, & Burnham, 2007). This specificity of talker adaptations is represented in the TL2 framework by the three-dimensional language profile of conversational interactions (i.e., the combination of Talker–LBS, Listener–LBS, and Talker–Listener). Specifically, the noninteractional studies of vocal accommodation and the dialogue-based studies of speech communication across various barriers show that talker-based variation and adaptation can be fully understood only in relation to listener-based variation and adaptation, and vice versa.

33.4 Concluding Remarks

Situated at the nexus of research on second-language speech production and perception, computational modeling of speech intelligibility, and analyses of conversational dynamics, the TL2 framework is presented in this chapter as both a structured perspective on prior research and a map for future research. The present description of the TL2 framework for understanding bilingual speech intelligibility is a starting point with ample room for elaboration. One important avenue for elaboration relates to the role of background speech in determining overall speech intelligibility. Speech recognition accuracy for a conversation with a given language profile varies substantially depending on the relationship (Match or Mismatch) of the language of any background speech that may be present in the environment to the language being spoken and to the language repertoires of the talker and the listener. A large body of research has focused on this dimension of the language profile of a conversational interaction (e.g., among many others, Reference Brouwer, Van Engen, Calandruccio and BradlowBrouwer et al., 2012; Reference Calandruccio, Dhar and BradlowCalandruccio, Dhar & Bradlow, 2010; Reference Cooke, García Lecumberri and BarkerCooke, García Lecumberri, & Barker, 2008; Reference Van Engen and BradlowVan Engen & Bradlow, 2007) and has established a complex interplay between all of these language factors in determining overall speech intelligibility. An elaborated version of the TL2 might include additional parameters to reflect this important dimension of variation in bilingual speech intelligibility.

Another aspect of the TL2 framework that needs elaboration is extension beyond dialogue situations to representation of communicative settings with more than two participants. This is particularly important for speech intelligibility modeling and prediction in group settings where the environmental conditions may be suboptimal, such as classrooms, auditoria, and for speech communication over the Internet or other electronic transmission systems. In these settings, L2 speech intelligibility is often significantly compromised and stands to benefit most dramatically from speech enhancement devices.

Finally, it is important to note that the Talker–LBS and Listener–LBS settings represent only the position of the language being spoken in the individual’s language repertoire, and therefore provide no information about the other language(s) in the repertoire or their interaction. For example, a setting of L2 for Talker–LBS does not specify the particular L2 accent as, for example, Chinese-accented English and Persian-accented French. Similarly, the Talker–Listener parameter represents only the degree of overlap between the language repertoires of the interlocutors without regard for the nonoverlapping language(s). This feature of the TL2 framework reflects its emphasis on overall intelligibility in terms of word recognition accuracy (which is a universally applicable, language-independent measure) rather than on interactions between the sound structures of an individual’s L1 and L2, which are necessarily language-specific and highly dependent on the individual’s entire language repertoire. Consequently, in the TL2 framework, no distinction is made between, for example, a talker of Chinese-accented English and a talker of Spanish-accented French (assuming both are at about the same level of proficiency in their respective L2). Accordingly, the central claim encoded in the set of three language-based relations of the TL2 framework – Talker–LBS, Listener–LBS, and Talker–Listener – is that the combined influence of these relations on overall speech intelligibility follows language-independent patterns. In the TL2 perspective, while the range of proficiency levels may vary across particular L1–L2 pairings (e.g., a higher ultimate level of L2 French proficiency may be attained by an L1 Spanish speaker than by an L1 Thai speaker), different L2 learning trajectories are irrelevant for speech intelligibility modeling in any given conversational interaction. Instead, the overall language profile of a conversation, which is expressed in language-independent terms, exerts an overarching influence over L1 and L2 speech production and perception, which ultimately determines communicative success in terms of word recognition accuracy (or speech intelligibility). An important goal for future research is to verify this central claim of the TL2 framework with extensive speech intelligibility data from a wide range of languages (i.e., going beyond the bias toward speech research on a small number of languages, most notably English) and with test conditions that cover the full range of language profiles specified by the TL2 framework.

Acknowledgments

I gratefully acknowledge scientific contributions from present and past members of the Speech Communication Research Group in the Linguistics Department at Northwestern University. I also thank Chun Liang Chan for technical support, and Matt Goldrick for many useful discussions. This work is supported by Grant R01-DC005794 from NIH-NIDCD.

34 Using a Characteristic Speech Production (CSP) Procedure to Elicit Monolingual and Bilingual Speech

34.1 Introduction

Speech production research has examined diverse groups of talkers to define the phonetic properties of vowels and consonants. Research can examine monolingual speakers of a single language or compare monolingual speakers of different languages. Work with bilinguals has tended to focus on either the first language (L1) or the second language (L2), but increasingly more work has aimed to better understand the mutual influence of the L1 and L2 phonetic subsystems on one another (e.g., Reference Tobin, Nam and FowlerTobin, Nam, & Fowler, 2017; Reference Yeni-Komshian, Flege and LiuYeni-Komshian, Flege, & Liu, 2000). Production research of any kind requires the elicitation of speech samples for analysis. Although the choice of elicitation procedure can impact findings and so potentially the replicability of a study, there has been surprisingly little focus on the methods used to elicit speech.

After reviewing discussions of what has been designated a “replication crisis,” subsequent sections of this chapter examine replicability in the context of speech production research. These discussions lead to the conclusion that both monolingual and bilingual speech production research must be improved to augment the replicability of research findings. To aid in this effort, a new elicitation method is proposed that aims to reduce uncontrolled variability in speech samples that are to be subjected to acoustic analyses and/or listener evaluation. The aim of the new method is to provide more stable speech samples, and so more readily reproducible speech production findings.

34.2 Replication Crisis

34.2.1 Is Our Research Good Enough?

Research that can be replicated is considered scientifically sound and rigorous, but replication rates vary across disciplines, sometimes falling to levels that fail to inspire confidence. One systematic attempt to replicate previous research in leading psychology journals (Open Science Collaboration, 2015) found that more than 60 percent of studies failed to replicate. This suggested a period of “methodological unrest” in speech and language research to Reference Sönning and WernerSönning and Werner (2021) but a generalized “crisis of faith” in science to Reference Smaldino and McElreathSmaldino and McElreath (2016).

Psychologists’ self-criticism for low replicability rates (e.g., Reference LawsLaws, 2016) is nothing new. Reference Makel, Plucker and HegartyMakel, Plucker, and Hegarty (2012) evaluated replication rates in research published from 1950 to 1999 and from 2000 to 2012. A slightly higher rate of successful direct replications was reported for the latter period (82.7 percent versus 74.0 percent). Importantly, rates were higher in both time periods when one or more authors of the original study participated in the replication study. This might be because the authors were aware of methodological details of the original study that were not published, perhaps due to page limits (Reference Gawne, Styles, Berez-Kroeker, McDonnell, Koller and CollisterGawne & Styles, 2022, p. 15).

One barrier to obtaining high replication rates is the difficulty of reproducing the statistical power of the original study (Reference Anderson and MaxwellAnderson & Maxwell, 2017). Nonstatistical barriers also exist. Reference BakBak (2016) argued that the details of an experimental protocol can affect the outcome of the study. Studies of bilingualism, which may vary greatly in terms of experimental protocols, may produce conflicting evidence. However, a failure to replicate is unlikely to lead to the rejection of a hypothesis supported by previous successful replications. Conversely, successful replications may augment confidence in a field of research, but they tend to demonstrate the reliability of an effect, not its validity.

Reference FeestFeest (2019) suggested that no two experiments can ever really be the same in all respects. Different scientists may make different assumptions as to which details of a protocol matter; and no one scientist is likely to identify all details of an experimental design that are potentially relevant when replicating previously published research. One such “detail” is the assignment of participants to groups in bilingual research (Reference Luk and BialystokLuk & Bialystok, 2013). Even if the same selection criteria have been applied, the members of groups compared across studies are unlikely to be matched along all relevant parameters, including the ages at which the second language was acquired and under what circumstances, the relative proficiency in the two languages, the contexts in which they are used and how often, and how frequently the members of the two groups code-switch.

Diverse explanations have been offered for failures to replicate. For some, the primary cause is how statistical analyses have been performed in the original study (Reference FeestFeest, 2019). Reference Sönning and WernerSönning and Werner (2021, p. 1182) pointed to other potentially important causes, including the selective reporting of results, a lack of transparency in describing procedures and data analysis, poor research design, small sample sizes, experimenters’ cognitive biases, and theories that are not sufficiently well defined “to inform, guide, and constrain” data-based research.

34.2.2 Improving Research

Suggestions abound as to how replicability rates can be augmented (see Reference Coretta, Casillas and RoessigCoretta et al. [2023] for discussion). Reference Roettger and Baer-HennyRoettger and Baer-Henny (2019) suggested that creating a culture in which replication is common might help. Investigators who know that their research might be subjected to additional scrutiny following publication may become more careful and more attentive to all aspects of their research protocols. An obvious approach would be to increase statistical power (Reference MaxwellMaxwell, 2004). Despite repeated calls for doing so, however, there has been no detectable increase in statistical power over the past fifty years (Reference Smaldino and McElreathSmaldino & McElreath, 2016). No panacea for the replication crisis is likely to exist, but any measure that serves to increase replication rates is to be welcomed.

One way to augment replication rates might be to analyze only appropriate speech samples, that is, samples that are characteristic of how participants actually speak (Reference Flege and WaylandFlege, 2021). This proposal starts with the simple observation that no one always speaks the same way. When individual differences in speech production by monolinguals are reported, one wonders if these differences derived from how individual participants reacted to being seated in a sound booth in front of a microphone and/or how they went about reading, repeating, or naming a series of words, phrases, or sentences. For example, Reference ClayardsClayards (2018) observed significant individual differences in voice onset time (VOT) values in stops found in the initial position of English words that participants read in random order. The author attributed the differences between talkers to differences in “speaking style or degree of hyper-articulation” (p. 19), but did not explain why the protocol might have induced individuals to differ in these ways.

An underlying assumption common to speech production research is that a single sample will suffice to identify individual differences in speech production. This assumption is unwarranted given that multiple speech samples are rarely examined. Moreover, differences between talkers may emerge during a single recording session. Reference Mielke and NielsenMielke and Nielson (2018) measured VOT in 120 English words presented visually for production at 2 second intervals. Values tended to drift slightly downward during the recording session, perhaps because the participants grew “tired and/or bored” (Reference Mielke and NielsenMielke and Nielson, 2018, p. 2175). Changes on a shorter timescale were evident when VOT values in the three words preceding a target word were longer or shorter than was typical for an individual. This suggested that some participants reset production parameters during the elicitation task, perhaps to make their VOT output correspond better to the specification of VOT in long-term memory representations.

34.3 Replication Attempts

Discussions of the replication crisis have focused on many different kinds of research. However, little of this discussion has focused on speech production and none, as far as I know, has considered individual studies in detail. The studies reviewed here are not necessarily representative of speech production research but nonetheless provide an opportunity for readers to consider how aspects of an experimental design might impact research outcomes.

34.3.1 VOT in Early versus Late Learners

Reference FlegeFlege (1991) was the first to directly compare L2 segmental production by participants differing in age of first exposure to the L2. The study measured VOT in Spanish and English words produced by forty speakers, balanced across language profile. Spanish and English monolinguals produced materials in one language. Bilinguals differing in age of arrival in the USA (Early: M = 2 years, Late: M = 20 years) produced Spanish and English materials in counterbalanced order. All participants were tested in a quiet room at the University of Texas, Austin by undergraduates who were themselves Early Spanish-English bilinguals.

In Experiment 1, participants read fourteen English phrases of the form Take a X (X = 8 words such as t-bone and teapot) and/or fourteen Spanish phrases of the form Tengo un X (X = 8 words such as tigre and tipo) “as if talking to friends.” Figure 34.1 shows the distributions of 140 English VOT values for productions of /t/ by members of 3 groups in (a) the phrase-initial word take and in (b) 8 variable words produced in phrase-final position.

Figure 34.1 Distributions of VOT values produced in phrase-initial and phrase-final words by native English (NE) speakers and native speakers of Spanish who learned English as children (Early) or as adults (Late).

For all three groups, VOT values were about 10 ms longer for phrase-final than phrase-initial words. The Early and native English (NE) groups’ VOT values were normally distributed in both phrasal positions (W > 0.10), but the values obtained for Late learners were not normally distributed in either position (W < 0.01). This was because the Late learners produced values ranging from short-lag VOT, much like the values obtained for Spanish monolinguals (not shown here), to values obtained for the NE monolinguals. Members of the NE and Early learner groups, who did not differ significantly from one another, produced significantly longer VOT values than the Late learners in both phrasal positions.

Given the novelty of this finding, Reference FlegeFlege (1991) tested for between-group differences in a second experiment. Early and Late learners returned a week later for an experiment in which English and Spanish materials were produced in alternation in three successive blocks. The aim was to determine if the Early versus Late difference would also be evident under time pressure. In Block 1, the disyllabic test words were read at the end of alternating sentences (Take another word such as __; Tengo palabras como __). In Block 2, the words were read at the end of alternating phrases (Take a __; Tengo un __), and in Block 3 the English and Spanish test words were isolated and read in alternation. This manipulation systematically reduced the intervals between successive productions of Spanish and English /t/. This time, VOT values were found to be significantly longer in Block 1 (sentences) than in Blocks 2 and 3 (phrases, isolated words). The Early learners produced significantly longer VOT values than the Late learners in all three blocks, replicating the finding from Experiment 1.

34.3.2 Foreign Accent

Reference Flege, Munro and MacKayFlege, Munro, and MacKay (1995a) assessed the effect of age of first exposure to an L2 on the overall degree of perceived foreign accent (FA) in the L2 by assessing English sentences spoken by 240 Italian immigrants in Canada. The native Italian (NI) participants were equally distributed across ten groups based on age of arrival (AOA), with group mean AOAs ranging from two to twenty-three years. Participants repeated sentences after a filled delay, and these sentences were later rated by NE-speaking listeners. The study yielded three findings that were inconsistent with a critical period account of FA. The AOA–FA relation was linear in the AOA range bracketing the end of a supposed critical period. Some NI participants who arrived in Canada in early childhood were found to speak English with a detectable FA (see also Reference Flege, Nyvad, Hejná, Højen, Jespersen and SørensenFlege, 2019, pp. 508–510; Reference Han, Kim and TsukadaHan et al., 2023). Finally, FAs grew increasingly stronger as AOAs increased beyond the supposed close of a critical period.

Reference Flege, Yeni-Komshian and LiuFlege, Yeni-Komshian, and Liu (1999) carried out a conceptual replication of Reference Flege, Munro and MacKayFlege et al. (1995a). The 240 native Korean (NK) participants recruited for the replication study were equally distributed across ten groups based on AOA, but the two studies differed in important ways. The Koreans had lived for fewer years in the USA than the Italians had lived in Canada and they had more years of education in English-medium schools than the Italians. Further, the English sentences examined in the Korean study were longer. The Korean study nevertheless yielded the same three findings as the Italian study related to a critical period account of FA.

The results of the Korean and Italian studies were further compared in Reference Flege, Piske and Young-ScholtenFlege (2009). Language background questionnaire items common to the two studies were submitted to Principal Component Analyses and the derived factors were regressed onto the two sets of FA ratings. The Italian and the Korean analyses accounted for 66.8 percent and 71.7 percent of the variance in FA, respectively. The first factor identified in both analyses captured most of the variance. This factor had high loadings on AOA, L1 proficiency, and frequency of English use (see also Reference Luk and BialystokLuk & Bialystok, 2013). The second factor in both analyses had high loadings on chronological age and length of residence (LOR; in years) in Canada or the USA. A third “Motivation” factor was only observed for the Koreans, but it accounted for only 1.3 percent of additional variance.

34.3.3 LOR Effects on English Liquid Production

Many studies have examined the production of English liquids by native Japanese (NJ) adults because English has two liquids where Japanese has just one (symbolized here as /R/). The Japanese /R/ differs from both English liquids. In early stages of exposure to English, some NJ adults seem to substitute Japanese /R/ for both English liquids, creating the impression for NE-speaking listeners of quasi-random substitution of /r/-for-/l/ and /l/-for-/r/ (Reference Flege, Aoyama, Bohn and WaylandFlege, Aoyama, & Bohn, 2021).

Long-standing interest in NJ participants’ production of English liquids arose from the desire to understand if the capacity to learn new speech sounds is lost or reduced following the close of a critical period. Two such studies were those of Reference Flege, Takagi and MannFlege, Takagi, and Mann (1995) and Reference Larson-HallLarson-Hall (2006). The participants in both studies had comparable experience with English at school in Japan before arriving in the USA as adults and both studies focused on liquids in minimally paired English words such as right and light. Studies prior to 1995 consistently showed production and perception differences between English monolinguals and NJ adults who had lived in the USA for less than six years (see Reference Flege, Takagi and MannFlege, Takagi, & Mann, 1996). Such participants might have received less English input than some monolingual English children who still have difficulty producing English /r/ like their parents and older siblings. Reference Flege, Takagi and MannFlege et al. (1995) hypothesized that the NJ adults examined in earlier research may have erred in producing English liquids because of inadequate input, not a diminished capacity to learn speech.

Reference Flege, Takagi and MannFlege et al. (1995) tested the input hypothesis by recruiting three groups of twelve adults each in Irvine, CA: NE monolinguals, NJ speakers who had lived in the USA for less than three years (IJ; M = 1.6), and “experienced” NJ adults who had lived there for more than twelve years (EJ; M = 20.8). The thirty-six participants’ productions were elicited in three different ways. In the “Definition” task, the experimenter, an NJ graduate student, said the number of each trial in English and displayed a short written English definition of each successive test word (e.g., “What we get from the sun” to elicit light). After he said the Japanese translation equivalent of a test word (e.g., hikari to elicit English light), the participants said the English test word aloud. The experimenter nodded if the intended word was said, indicating to participants that they were to say the word again at the end of a carrier phrase (The next word is__). In the few cases in which the intended test word was not said, the experimenter offered other hints in Japanese until the intended word was said aloud, first in isolation (followed by a nod) and then at the end of the carrier phrase. In the “Reading” task that followed, the participants simply read the English words from a written list, in isolation. Finally, in the unscripted “Spontaneous” task, participants produced the test words in an English phrase or sentence of their own choosing.

The NE-speaking listeners evaluated the production of liquids in four experiments. In one experiment, listeners rated consonant-vowel (CV) syllables derived from four consonant-vowel-consonant (CVC) words with /r/ (right, rock, read, rate) and four with /l/ (light, lock, lead, late) in eight counterbalanced blocks, one for each Liquid x Target word combination. The intended identity of the initial consonant in each block was always known beforehand. The productions of each target word by the thirty-six participants were randomly presented four times each to all twelve listeners. Responses to the first random presentation of the CV stimuli were counted as practice and so not analyzed. Six mean ratings of liquid production accuracy were computed for each participant by averaging over the ratings obtained from the listeners for each Liquid x Elicitation Condition combination.

The mean ratings were examined in a (3) Group x (2) Liquid x (3) Elicitation Condition ANOVA (analysis of variance), which yielded a significant Group x Elicitation interaction and a significant main effect of Group. Liquids produced by the NE speakers in two of the three elicitation conditions (Reading and Definition, but not Spontaneous) received significantly higher ratings than liquids spoken by the EJ speakers. Productions by members of the NE and EJ groups received significantly higher ratings than productions by members of the IJ group for liquids elicited in all three elicitation conditions.

To reach a more general conclusion, Reference Flege, Takagi and MannFlege et al. (1995) calculated a global rating for each of the thirty-six participants, averaging over the elicitation conditions and liquids. The global means obtained for ten of the twelve EJ participants fell within the range of mean ratings obtained for the NE speakers, and the means for the remaining two EJ participants fell just outside the NE range. This, together with the earlier analysis, led Reference Flege, Takagi and MannFlege et al. (1995) to conclude that most NJ speakers who arrive in the USA as adults and who have sufficient experience hearing and speaking English will eventually learn to produce English liquids accurately.

A different conclusion regarding adults’ capacity to learn L2 speech following the end of a supposed critical period was drawn from an attempted replication of Reference Flege, Takagi and MannFlege et al. (1995). Reference Larson-HallLarson-Hall (2006) recruited fifteen NE and thirty NJ participants in Pittsburgh, PA. The NJ participants in the relatively inexperienced group (IJ) had lived in the USA for 0.3 to 3 years (IJ; M = 1.1), those in the more experienced group (EJ) for 12 to 42 years (EJ; M = 23.2).

The original and replication studies were similar in three respects: a similar difference in mean LORs between relatively experienced and inexperienced NJ participants, use of the same seven-point rating scale to evaluate liquid production accuracy, and elimination of word-final consonants via editing to avoid lexical bias effects in listener evaluations (see Reference Flege, Takagi and MannFlege et al., 1996). The two studies differed in other important respects, however. Reference Larson-HallLarson-Hall (2006, experiment 2) examined a different set of English minimal pairs than Reference Flege, Takagi and MannFlege et al. (1995) did. The twelve NE-speaking listeners who rated CV stimuli in Pittsburgh were assigned to rate one-third of the stimuli rather than all stimuli. The listeners in Pittsburgh responded to just two (rather than four) random presentations of the CV stimuli. Perhaps most importantly, the CVs derived from /r/- and /l/-initial words were randomly presented to listeners in Pittsburgh together in single blocks rather than in separate counterbalanced blocks. This necessitated showing in writing on a computer screen the intended identity of the consonant to be rated on each trial (/r/ or /l/) as it was presented auditorily.

A (3) Group x (2) Liquid consonant ANOVA examining the ratings yielded a significant main effect of Group and a significant Group x Liquid interaction, but a nonsignificant main effect of Liquid. Post hoc tests indicated that significantly higher ratings were obtained for both /r/ and /l/ productions by members of the NE group than by members of both the IJ and the EJ groups, who did not differ significantly from one another.

The results obtained by Reference Larson-HallLarson-Hall (2006) failed to replicate the better performance by EJ than IJ participants in Reference Flege, Takagi and MannFlege et al. (1995). In fact, evidence of a difference in the opposite direction was reported. The correlation between LOR and the ratings of English liquid production accuracy was r(28) = −0.35 (p = 0.06) for the thirty NJ participants. However, when the two NJ groups of fifteen each were considered separately, a significant positive correlation was obtained for the relatively inexperienced participants in IJ (r(13) = 0.51), whereas a significant negative correlation was obtained for the more experienced participants in EJ (r(13) = −0.81). In other words, production accuracy increased for the IJ group as years of residence in the USA increased from 0.3 to 3.0 years while production accuracy decreased for the EJ group as years of residence increased from 12 to 42 years.

The source(s) of the divergence of findings obtained in these two studies is impossible to determine because of methodological differences between them. I will therefore simply consider in turn five possible causes of the divergence in findings. First, different experimenters elicited the speech samples in different locations. Reference DochertyDocherty (2022) noted that researchers who elicit speech in a laboratory setting often overlook factors that sociolinguists recognize as being sources of structured heterogeneity in speech production. All thirty-six participants in Irvine (Reference Flege, Takagi and MannFlege et al., 1995) were tested by the male NJ graduate student who had earlier invited the NJ participants to take part in a “word knowledge” study on the UC–Irvine campus. He never corrected the NJ participants’ pronunciation of English. The NJ participants in Pittsburgh (Reference Larson-HallLarson-Hall, 2006), presumably tested by a female NE graduate student, were asked to participate in a study that examined pronunciation and were sometimes corrected (Reference Larson-HallLarson-Hall, 2006, p. 528). These differences may have led to greater anxiety for members of the EJ group in Pittsburgh than in Irvine. If so, this may have contributed to the relatively poor performance by members of the EJ group in Pittsburgh.

Speaking rates may have differed across the two studies. This is important because NE-speaking listeners use the duration of first-formant (F1) transitions (along with second-formant and third-formant onset frequencies) when identifying liquids as /r/ or /l/ (e.g., Reference Gordon, Keyes and YungGordon, Keyes, & Yung, 2001). The NJ participants in Pittsburgh were asked to produce successive words at a self-selected “normal pace” (Reference Larson-HallLarson-Hall, 2006, p. 527), but in Irvine the pace was controlled by the experimenter, who initiated successive trials by saying aloud in English the number of the next trial.

Different NE-speaking listeners evaluated liquid production accuracy in the two studies: graduate students at the University of Alabama at Birmingham versus undergraduate students at the University of Pittsburgh. For the research in Irvine, mean ratings of production accuracy were based on more judgments for each participant’s production of the two liquids (12 listeners x 5 words x 3 repetitions x 3 elicitation conditions = 540 for both /r/ and /l/) than in the Pittsburgh study (4 listeners x 5 words x 2 repetitions = 40 per liquid).

The NJ groups in the two studies were matched for LOR but not for the quantity and quality of the English input they had received. The NJ participants in Irvine were recruited as they left a store that sold Japanese food and cosmetics. In Pittsburgh, the IJ participants were recruited via ads and flyers placed on campus and tested there, whereas the EJ participants were recruited via a Japanese organization and were tested at their home or office rather than on campus. Use of LOR may provide an appropriate index of quantity of English input only for individuals who have the need and opportunity to communicate in English on a regular basis (Reference Flege and LiuFlege & Liu, 2001). Neither of the two studies assessed NJ participants’ need/opportunity to use English. This means that equating EJ groups for LOR did not guarantee that the EJ groups in the two studies were equally experienced in English.

Finally, raw data were treated differently in the two studies. All ratings obtained in Irvine were examined, whereas in Pittsburgh “20% trimmed means and percentile bootstrapping” were used prior to the statistical analyses (Reference Larson-HallLarson-Hall, 2006, p. 530). Eliminating data in L2 production research on purely statistical grounds might be questioned in research designed to assess non-native performance via comparisons to a group of native speakers. Such an approach assumes, perhaps inappropriately, that the native speakers in the comparison group adequately represent the kind of input to which the non-native speakers have been exposed.

Figure 34.2 shows weighted intelligibility scores obtained by Reference Flege, Takagi and MannFlege et al. (1995) for each of nineteen English words beginning with /r/ (left panel) and each of nineteen words beginning with /l/ (right panel).Footnote ¹ The /r/ scores obtained for a single NE speaker, participant #6, have been arranged in increasing order of intelligibility in the left panel; the average scores obtained for the remaining eleven NE speakers have been arranged in the same order. Eight /r/-initial words produced by participant #6 had a score of less than 5, indicating that these tokens were heard predominantly as /l/ rather than /r/. The scores for the other eleven NE speakers indicate near perfect identification in all nineteen words with /r/. The anomaly for participant #6 was corroborated by acoustic analyses (Reference Flege, Aoyama, Bohn and WaylandFlege, Aoyama, & Bohn, 2021, p. 109). Inspection of the results obtained for /l/, on the other hand, indicate that participant #6 produced all nineteen words with /l/ like the other eleven NE speakers. A consideration of these data suggests that automatically eliminating data on a purely statistical basis may eliminate some of the natural variation that occurs in the input that learners receive while learning an L2, and so may undermine the use of native-speaker comparison groups.

Figure 34.2 Averaged weighted intelligibility scores obtained for nineteen /r/-initial and nineteen /l/-initial words produced by eleven native English (NE) speakers and by a single NE participant (#6) considered separately.

Taken together with the results just presented, the observation by Reference FeestFeest (2019) that no two experiments can ever be completely the same raises the issue of whether perfect replications are possible. The results of one recent study suggest that near-perfect replication is indeed possible, at least if the same participants are tested and all aspects of the protocol are the same.

Reference Flege, Bohn and WaylandFlege and Bohn (2021, pp. 26–28) reported a longitudinal study in which English words were produced by 20 NE and 149 NI speakers in 1992 using the procedures described by Reference Flege, Munro and MacKayFlege, Munro and MacKay (1995b) and again in 2003. The words were recorded in the same place using identical procedures and equipment at Time 1 (late 1992) and Time 2 (the summer of 2003). The VOT values in word-initial /p t k/ tokens produced at Time 1 and Time 2 correlated strongly, r = 0.82. Of the 149 NI participants tested, 20 reported a higher use of English in 2003 than 1992 and 20 reported using English less. When these forty NI participants were excluded, the Time 1–Time 2 correlation increased to r = 0.95. The authors concluded that while the language use patterns of immigrants usually do stabilize at some point, individuals may experience important life changes (e.g., remarriage, a job change, relocation to a new neighborhood) that will alter the input distributions defining their phonetic categories, resulting in measurable phonetic-level changes.

34.4 Experimental Details Influence Replicability

Findings reviewed earlier indicated that research outcomes may be replicated almost perfectly but that an attempted replication may sometimes yield a finding that diverges substantially from prior research. This section considers studies which, when taken together, suggest that even seemingly small details of an experimental protocol may influence whether a research finding is or is not replicated.

34.4.1 Speaking Rate

Reference Flege and WaylandFlege (2021) reexamined the results of an earlier study which addressed the question of whether NE speakers who began learning French after the age of eighteen years could produce French /p t k/ with appropriate short-lag VOT values rather than with English-like long-lag VOT values. Reference BirdsongBirdsong (2003) recorded twenty-one NE speakers who had lived in France for six to twenty-two years. Nine participants (43 percent) produced average VOT values in French that differed by less than 1 standard deviation from the overall mean value obtained for a native French (NF) comparison group. This finding suggested that an authentic production of French /p t k/ is not impossible for some Late learners. Birdsong attributed the substantial intersubject variability to differences in motivation and/or phonetic training.

Reference BirdsongBirdsong (2003) may have underestimated the performance of the Late learners he recorded in Paris. Speaking rate was not controlled and so, not surprisingly, the Americans’ vowels were 48 percent longer on average than those produced by the NF speakers in the same words. Reference Flege and WaylandFlege (2021) adjusted the NE speakers’ French VOT values using the VOT–vowel duration relationship reported for English monolinguals by Reference Theodore, Miller and DeStenoTheodore, Miller, and DeSteno (2009). Doing so shortened the NE speakers’ French VOT values by 8.7 ms, on average, leaving an overall NE versus NF difference which averaged just 2.1 ms, slightly less than the expected measurement error.

34.4.2 Clear Speech

Speaking clearly augments intelligibility when it has been induced via instructions in a laboratory setting. The extent to which speech has been rendered “clear” is usually indexed by determining the percentage of spoken words that have been identified as intended (e.g., Reference Hazan, Tuomainen, Kim, Davis and BrungartHazan et al., 2018; Reference Smiljanić and BradlowSmiljanić & Bradlow, 2009). Acoustic measures have generally shown that compared to “plain” speech, clear speech is produced at a slower speaking rate, vowels are more widely separated, and pitch ranges increase (e.g., Reference Lam, Tjaden and WildingLam, Tjaden, & Wilding, 2012; Reference Smiljanić and BradlowSmiljanić & Bradlow, 2005).

The extent to which intelligibility increases in clear speech has varied widely over published studies (e.g., Reference Uchanski, Pisoni and RemezUchanski, 2005). Importantly, some individuals have been found to be more intelligible than others even without having been instructed to speak clearly (Reference Bond and MooreBond & Moore, 1994). Such “inadvertently” clear speakers might be those who spontaneously switch to clear speech simply because they are asked to speak into a microphone while seated in a sound booth.

34.4.3 L2 Segmental Production Accuracy

Many previous studies have attempted to determine if “post-Critical Period” (i.e., Late) learners retain the capacity to establish new phonetic categories for certain sounds found in the L2 but not the L1. As already mentioned, Reference FlegeFlege (1991) found that native Spanish Late learners differed substantially in producing VOT in English words. He suggested (see also Flege [1995], but compare with Reference Flege, Bohn and WaylandFlege & Bohn, 2021) that some but not all Late learners may maintain the capacity for phonetic category formation.

Late learners’ capacity to establish new phonetic categories for long-lag English stops was evaluated for NI immigrants in Canada by Reference Flege, Munro and MacKayFlege, Munro, and MacKay (1995b). As in the companion FA study, 240 NI participants were assigned to 10 subgroups based on their AOA in Canada. The NI participants and twenty-four NE speakers heard CVC words at the end of a carrier phrase (The next word is__) and then, following a delay, repeated the words at the end of another carrier phrase (Now I say__). The VOT was measured in word-initial productions of /p t k/. The members of some AOA-defined Late learner groups produced significantly shorter VOT values in /p/ and /t/ than the members of the NE comparison group did. This supported the hypothesis that some Late learners may lose the capacity to establish new phonetic categories.

This conclusion may have been premature. Reference Flege, Munro and MacKayFlege et al. (1995b) observed that some of the eighty NI participants who arrived in Canada between the ages of sixteen and twenty-three but none of the eighty Early learners who arrived between the ages of two and nine produced English /p/ and /t/ with Italian-like short-lag VOT values. One possibility is that some Late learners code-switched (e.g., Reference OlsonOlson, 2016; Reference Piccini and ArvanitiPiccini & Arvaniti, 2015), even though the protocol was administered in English by an experimenter who could not speak Italian. More specifically, the Late learners who produced English /p/ and /t/ with Italian short-lag stops may have been producing English words using phonetic categories from their Italian phonetic subsystem. This possibility demonstrates the need to control factors that might influence code-switching, especially language mode (Reference Grosjean and NicolGrosjean, 2001), and underscores the importance of the elicitation method.

34.5 No Single Style Talkers

Talkers alter their speech to “maximize communicative efficacy” (Reference Antoniou, Best, Tyler and KroosAntoniou et al., 2010, p. 641). They may do so to make their speech easier to understand, to make it more (or less) like that of interlocutor(s), or to make it more appropriate for a particular social context, speaking style, or topic. The acoustic-phonetic changes that occur when speaking styles change may be global in nature (e.g., changing speaking rate), localized at a segmental level, or both (Reference Bradlow, Blasingame and LeeBradlow, Blasingame, & Lee, 2018).

Talkers’ propensity to alter their speech poses a challenge for replicability in speech production research and raises the issue of how best to elicit speech. Most speech elicited for production research is obtained in a laboratory setting (Reference XuXu, 2010). One wonders to what extent laboratory speech reflects how people usually speak and to what extent it reflects spontaneous adaptations to a particular speaking situation.

Reference LabovLabov (1972, Reference Labov, Baugh and Sherzer1984) suggested that changes in speaking style may accompany a change in attention to speech, that is, the extent to which talkers focus on the form rather than the content of what they are saying. Reference Bell, Eckert and RickfordBell (2001, p. 145) proposed that style shifting represents a speaker’s response to a particular paralinguistic context, including the interlocutor(s). Reference Giles, Coupland, Coupland, Giles, Coupland and CouplandGiles, Coupland, and Coupland (1991) went further, arguing that stylistic variation may represent an active attunement to the speech of interlocutors.

Sociolinguists assess production differences across speaking styles to identify structured variation that conveys social information. They often record speech in the community rather than in a laboratory using interactional techniques intended to put their participants at ease. One speaking style of special interest is vernacular speech, a form of speech that adults are most likely to use when speaking to close friends and family members in a relaxed setting. Vernacular speech is acquired early in life because children learning their native language attend to patterns of pronunciation in the ambient language (Reference LabovLabov, 1972, Reference Labov, Celata and Calamai2014, p. 17).

Vernacular speech is thought to be the least self-conscious form of speech in a talker’s repertoire, and therefore the most regular. It is defined empirically in sociolinguistic research by the frequency of specific phonetic variants that are articulated in different ways and transcribed differently. Variation along a single acoustic-phonetic dimension, such as VOT in segments that are not transcribed as distinct sounds, is also examined (Reference DochertyDocherty, 2022). Reference TaroneTarone (1979, p. 189) encouraged L2 researchers to focus on learners’ production of vernacular speech rather than on more formal variants of the L2 that might be learned at school. In agreement with Reference LabovLabov (1972), she speculated that vernacular speech represents the “most systematic” form of speech to which L2 learners are exposed, even though it might not represent the most prestigious form of the L2 being learned.

The results obtained in a study of NI immigrants in New York City bear on this hypothesis. Reference Oyama, Krashen, Scarcella and LongOyama (1982) assessed the degree of FA in English materials spoken by sixty NI males whose ages of arrival in the USA ranged from six to twenty years and whose LOR ranged from five to eighteen years. The participants read an English paragraph and then recounted a moment in the past in which they had feared dying. The unscripted “danger of death” samples were rated as less foreign-accented than the paragraphs that were read. Reference Oyama, Krashen, Scarcella and LongOyama (1982, pp. 26–27) speculated that the reading task may have been stressful to some participants because they were “painfully aware” of their Italian accents and knew that their pronunciation of English was under scrutiny.

Paralinguistic context is defined by several factors, one being the interlocutor (Reference Giles, Coupland, Coupland, Giles, Coupland and CouplandGiles et al., 1991). Reference Hay, Drager and WarrenHay, Drager, and Warren (2009) observed a difference in vowel production by New Zealanders who interacted with an experimenter from the USA and another from New Zealand. The production differences were attributed to accommodation to the dialect of the experimenters. Given the ubiquity of speech accommodation (Reference Tobin, Nam and FowlerTobin et al., 2017), one wonders if differences in segmental production might arise when speech is elicited by multiple experimenters having the same dialect. This possibility has received surprisingly little attention and, in fact, the person who elicits speech is often not explicitly identified (e.g., Reference BirdsongBirdsong, 2003; Reference Flege, Takagi and MannFlege et al, 1995; Reference HerdHerd, 2020; Reference Larson-HallLarson-Hall, 2006).

As mentioned earlier, talkers differ substantially in how much they alter their speech when asked to speak “clearly” and some speak clearly even when not instructed to do so. Participants in speech production research are sometimes asked to speak “normally.” However, there is no guarantee that they will all understand such instructions in the same way or be equally willing and/or able to follow them.

Reference Flege and WaylandFlege (2021) proposed that speech production research should focus on how participants speak “characteristically,” that is, when they are not attempting to accommodate to the speech of others and when not attempting to use a particular speaking style that they deem appropriate to a specific speaking context. Characteristic speech is not necessarily the form of speech that individuals use most frequently or on a daily basis. It is probably impossible to know if a speech sample is characteristic of an individual if observation is confined to a single sample. This is because individual differences that are noted in a single session may reflect how individuals adapt to the paralinguistic context of an experimental protocol.

For a sample to be considered characteristic of an individual’s speech, individuals must demonstrate the same speech behavior in multiple sessions. Multiple test sessions are now beginning to appear in perception research (e.g., Reference Idemaru, Holt and SeltmanIdemaru, Holt, & Seltman, 2012; Reference Schertz, Cho, Lotto and WarnerSchertz et al., 2015) but are still quite rare in speech production research (Reference Wade, Lai and TammingaWade, Lai, & Tamminga, 2021). Individual differences in production that persist over time, if observed, are likely to derive from previous input, especially input heard in childhood (Reference Flege, Nyvad, Hejná, Højen, Jespersen and SørensenFlege, 2019; Reference Flege, Bohn and WaylandFlege & Bohn, 2021; Reference Schertz, Carbonell and LottoSchertz, Carbonell, & Lotto, 2020).

Testing in a sound booth may cause some individuals to speak differently than they did before and after entering the sound booth. Reference LabovLabov observed (1972, p. 209) that the aim of sociolinguistic research is to find out how people talk “when they are not being systematically observed” but that the only way to do so is through systematic observation. This is the observer’s paradox. One aim of the new Characteristic Speech Production (CSP) elicitation procedure presented in the following sections is to resolve this long-standing quandary in all kinds of speech production research.

34.6 Speech Elicitation

34.6.1 Existing Procedures

Many procedures have been used to elicit speech for production research. Participants have been asked to read lists of words, phrases, or sentences; to imitate, mimic, or shadow auditorily presented materials; to retell a story just heard or depicted in a series of pictures or line drawings; to speak extemporaneously on a self-selected or indicated topic; or to describe a configuration of objects or events to someone who cannot see what is being described.

The choice of elicitation procedure is rarely explained or justified. Relatively few studies have attempted to recreate the fundamentally communicative aspect of speech, one exception being the Diapix task (e.g., Reference Baker and HazanBaker & Hazan, 2011). Picture naming tasks are favored by those who want to avoid the use of written materials, but pictures sometimes evoke different responses (names) and may introduce hesitancy (see Reference Duñabeitia, Baciero and AntoniouDuñabeitia et al. [2022] for recent improvements).

None of the elicitation procedures now in use represents a gold standard. This raises the issue of how to interpret divergent findings that might arise when speech is elicited using different procedures. That being the case, there is a need for a new elicitation procedure that will augment the reproducibility of individual talkers’ production patterns.

34.6.2 A New Procedure

A new method for eliciting speech is presented here. It is a Characteristics Speech Production (CSP) procedure in which participants give meaningful answers to meaningful questions, knowing that they are being recorded but unaware that the research focuses on their pronunciation of vowels and consonants in the answers they give.

Many variants of the CSP procedure exist (Reference Flege and WaylandFlege, 2021). In the variant outlined in this section the participants will see pictures of a series of consumer items and make vocal judgments regarding the accuracy of the price shown above each picture. In other variants, participants might be asked to study three pictures per trial and then judge each in turn. For example, NJ speakers might be asked to participate in a “Fashion and Accessories” activity. They would be asked to indicate in English how likely they would be to wear a particular garment or hat or sunglasses. The three vocal responses to be given (No, it’s really ugly; It’s okay but not great; Yes, I like it a lot) would yield /r/ and /l/ tokens. In a “Category Goodness” variant, participants would see three examples of a natural category, for example three different horses. They would rate the representativeness of each instance (picture) as an example of the category. The responses (Poor example; Better example; Best example) would provide numerous /b/ and /p/ tokens.

The aim of a CSP procedure is to provide multiple productions of a target sound in the same phonetic context by each participant. Just as importantly, it can be used to elicit speech in multiple sessions simply by changing the picture stimuli. By hypothesis, multiple productions of each target segment will be more stable than productions elicited using other procedures because participants do not need to accommodate to the speech of others or try to adopt a particular speaking style. Given that the CSP procedure is new, its efficacy must be evaluated formally; a method for doing so is presented in Section 34.6.3.

34.6.3 Test Design

The hypothetical evaluation procedure outlined here calls for the recruitment of monolingual native speakers of two different dialects, for example the Southern and North Central dialects of American English. Speakers of both dialects will be randomly assigned to produce the same English sentences using a CSP procedure or a traditional list reading (LR) procedure. By hypothesis, the CSP procedure will yield production samples that are more characteristic of how the participants speak, leading to the prediction that dialect-specific production patterns will be more evident in samples elicited using the CSP than the LR procedure.

All participants will be students at a four-year institution located in a small town where the two dialects predominate (e.g., Delta State University in Cleveland, MS and Saint Cloud State University in St. Cloud, MN). Selection requirements include predominant use of the local dialect, an age of eighteen to twenty-six years (balanced for gender), agreement to participate in two 15-minute sessions separated by a week, and consent for recordings to be used “for research purposes.”

Participants will be tested in a small, nonreverberant room on campus that is not associated with an academic department. The local experimenters (henceforth “E”) who administer the protocol at the two test locations will be of the same age and dialect as the participants, who will provide background information when they are recruited via social media.

To begin, E will initiate a practice session designed to show participants how to respond. Each of the six practice items will include a picture of an item for sale on Amazon.com and a short, written description of it. Recordings of the written descriptions that have been produced by representative speakers of the local dialect will be output aurally as each picture is shown.

Participants will see a price above each picture that is written but not vocalized. The LR participants will always see the actual, advertised price. The CSP participants will see the advertised price on one-third of the test trials (and for two of the six practice items), prices that are 30 percent too low on one-third of the trials, and prices that are 30 percent too high for the remaining one-third of the trials. For example, participants might see a picture of a “Coleman ultra compact aluminum camping table” with a written price, $29.04, that is lower than the actual advertised price ($41.49), a “Mexerris 360 rotating microfiber dustmop” shown with the correct price ($24.99), or “Braun Series 7390 hair clippers for men” with a price of $77.92 that is higher than the actual advertised price ($59.94).

The vocal responses that participants can give will be shown in writing above each picture. The response alternatives offered to CSP participants will be (a) The price is too high, (b) The price is just right, or (c) The price is too low. They will need to select and then say aloud one of these response alternatives based on their judgment regarding the accuracy of the price that is shown. The LR participants will see each of the three response alternatives in one-third of the test trials. They will simply read the single response alternative (sentence) they see above each picture as it appears.

After successfully completing their respective practice sessions, sixty test trials will be presented in the same order to members of the CSP and LR groups. Participants will be told that all sixty items they will see are currently for sale on Amazon.com with prices ranging from $10 to $100.Footnote ²

Feedback will be provided to members of the CSP but not the LR group. As mentioned, LR participants will always see the correct, advertised price for each item, whereas CSP participants will see the correct price in only one-third of the test trials. Once participants in both groups have given an appropriate vocal response to each test trial,Footnote ³ a slightly modified version of the screen that was just used to prompt a vocal response will appear. For CSP participants, the actual advertised price of an item will replace the written price that was shown on the preceding screen. This will let CSP participants know if the response alternative they just gave was factually correct or not. For LR participants, the correct price shown on the screens used to elicit vocal responses will be replaced by the message Thank you.

34.6.4 Analyses

Acoustic analyses will focus on four phonetic segments found in participants’ vocal responses: the /u/ in too, the /a^ɪ/ in price and high, the /o/ in low, and the /t/ in too. Formant measurements will indicate the location of vowels in a normalized front-back space for multiple productions of /u/ and /o/ and the amount of movement in an F1-F2 space for /a^ɪ/. These frequency measurements will be taken at the middle (50 percent) points of /u/ and /o/ productions, and at the 20 percent and 80 percent points for /a^ɪ/. Compared to speakers of the North Central dialect, Southern dialect participants are expected to front /u/ and /o/ productions to a greater extent. They are also expected to produce /a^ɪ/ with less formant movement than the North Central dialect participants (e.g., Reference Clopper, Pisoni and de JongClopper, Pisoni, & de Jong, 2005; Reference HerdHerd, 2020; Reference Jacewicz and FoxJacewicz & Fox, 2020; Reference HerdShport & Herd, 2020; Reference WalkerWalker, 2020).

There is no reason to expect a cross-dialect difference in the VOT values in /t/. However, previous research leads to the expectation of significant differences between individual speakers of both dialects (e.g., Reference Flege, Frieda, Walley and RandazzaFlege et al., 1998; Reference Theodore, Miller and DeStenoTheodore et al., 2009). The question of interest regarding /t/ is whether individual differences in VOT (assuming they are observed) will be maintained within and across recording sessions to a greater extent in speech samples that have been elicited using the CSP rather than the LR procedure.

Acoustic-phonetic measures of the four target sounds will be examined separately using generalized linear mixed-effects models that aim to evaluate differences in productions elicited using the CSP and LR procedures. Given evidence of drift during a recording session (Reference Mielke and NielsenMielke & Nielsen, 2018), the first and the last seven productions of each target sound will be considered separately. The models will be fitted to acoustic-phonetic measures by adding Elicitation Condition, Dialect, and Position-in-List (the first versus the last seven tokens)Footnote ⁴ and their interactions as fixed-effects predictors. For analysis of /a^ɪ/, frequency measures taken at the 20 percent and 80 percent points will also be entered as fixed-effect predictors. For participant-level effects, models will be fitted with random intercepts and random slopes, assessing the addition of by-participant slopes on the final fit of each model using the lme4 package in R (Reference Bates, Mächler, Bolker and WalkerBates et al., 2015).

Use of the CSP rather than the LR procedure in future research would be supported by evidence that cross-dialect differences are significantly greater in CSP- than LR-elicited speech. Additional support would be provided by evidence of significantly smaller changes within and across sessions, and less token-to-token variability in individual participants’ multiple productions of target sounds, in CSP- than LR-elicited speech samples.

Another method that might be used to evaluate the efficacy of the two elicitation procedures would be to obtain listener judgments, as was done by Reference Munro, Derwing and FlegeMunro, Derwing, and Flege (1999) in an investigation of cross-dialect acquisition. This would involve randomly presenting target sentences to native speakers and current active users of the Southern and North Central dialects. The listeners would need to rate sentences on a scale ranging from 1 (Very Southern) to 9 (Very Northern). By hypothesis, the North versus South differences will be significantly greater for sentences elicited using the CSP than the LR procedure.

34.7 Future Research

Researchers face many important choices whenever they test a hypothesis. Reference Coretta, Casillas and RoessigCoretta et al. (2023) found that teams of researchers who measured and statistically analyzed the same speech samples sometimes made different decisions which influenced whether statistically significant effects were found, and if so, the size of the effects. Despite progress in the quest for replicability, earlier phases of research have been virtually ignored in the replication crisis literature, resulting in a notable gap. The important decisions to be reached in pre-data analysis phases of research include the establishment of criteria for selecting participants, establishing separate blocks and conditions to differentiate behavior, evaluating what kind and how much data is to be obtained, and – most important in the present context – determining where, how, and by whom the data to be analyzed will be elicited.

The evidence reviewed earlier in this chapter suggested the need for a better way to elicit speech. Comparisons of previously published studies have been impeded by the use of different speech elicitation procedures. Also, different experimenters, often unnamed, interact with participants in ways that are seldom described.

The CSP procedure proposed here is simple conceptually and fairly easy to operationalize. It differs from existing procedures by providing many productions of a limited number of target sounds in a way that approximates how speech is used in everyday life, that is, to communicate information and beliefs. If implemented, the protocol presented here can be used to determine if the CSP procedure is indeed superior to a traditional LR procedure. Hopefully, a reader of this chapter will do just that. If the results obtained were to verify the superiority of the CSP procedure, a change in research practice would be warranted.

Although illustrated here for research with speakers of a single language, the CSP procedure can also be used with bilinguals. For example, a CSP procedure could be used to elicit productions of English and Italian /b/ and /p/ in separate, counterbalanced blocks each preceded by a short, language-specific practice. On each trial, participants would see three numbered pictures representing a semantic category (e.g., three different chairs, trees, or cats). The semantic category of interest would appear in writing above the three pictures presented on each trial. Participants would be asked to judge the representativeness of each picture as an instance of the category, saying in turn, for example: 2–Poorest example (or Pessimo esempio in the Italian block); 3–Better example (Bell’ esempio); 1–Best example (Miglior esempio).

A CSP procedure is used to elicit multiple productions of specific target sounds when participants are focused on a communicative act rather than on their articulation of speech. When bilinguals are recruited for “categorization research” they will be told that the research examines the representativeness of various pictures as instances of well-known semantic categories. Examples will be provided in both languages. The same set of picture stimuli will be used to elicit vocal responses in both languages; only the response alternatives and the written names of the semantic categories (“gatto” versus “cat”) will change. The motive provided to participants for testing in two languages is the desire to learn “if the responses you give depend on the language you are using.” Crucial information regarding language history and use will be collected only after elicitation of vocal responses in both languages has been completed.

The CSP procedure illustrated in this chapter provides an alternative to having participants read, name, or repeat meaningless lists of linguistic materials without providing a specific motivation for doing so. There will always be a need, of course, to acquire data in a laboratory setting, especially when more than just a microphone and a recording device are needed. However, the use of traditional elicitation procedures to acquire speech production data should not be an unreflective choice based on mere convenience. Whenever possible, the speech samples examined in production research should be characteristic of how individuals speak.

35 Bridging the Gap between Bilingual Phonetic Research and Pronunciation Teaching

35.1 Introduction

Researchers working in the fields of bilingual phonetics and pronunciation teaching have approached L2 learning with distinct goals and methodologies. Bilingual phonetics has predominantly been concerned with the nature and extent of cross-linguistic interactions in individuals who grew up speaking two or more languages. Accordingly, work in this area has examined how factors like age of acquisition and language dominance regulate the degree of cross-linguistic influence in language-switching tasks (Reference Goldrick, Runnqvist and CostaGoldrick, Runnqvist, & Costa, 2014) and in the perception and production of cognates (Reference AmengualAmengual, 2016). Due to the need to control the phonetic, phonological, and lexical properties of the target words, studies are typically conducted in a controlled laboratory setting using highly structured tasks such as word and sentence reading.

Pronunciation teaching research also recognizes the inevitability of cross-linguistic interaction, insofar as L1 transfer influences L2 perception and production throughout the process of L2 learning (Reference Trofimovich, Kennedy, Foote, Reed and LevisTrofimovich, Kennedy, & Foote, 2015). However, pronunciation teaching research is less concerned with the notion of ultimate attainment that is central to documenting the subtle forms of cross-linguistic influence that are evident in highly proficient bilinguals. Instead, L2 pronunciation teaching focuses on pedagogical interventions that can help learners improve their perception and production. For perception, this means that learners need to develop perceptual categories and processing routines that are accurate enough to allow for efficient word recognition, sentence processing, and recognition of speech acts. Likewise, production needs to be accurate enough for speakers to be intelligible and comprehensible (Reference LevisLevis, 2005). These two constructs, while related, are not identical. Even if the listener is able to understand the speaker (in that all words are identifiable), they may have to invest considerable effort to do so (Reference Huensch and NagleHuensch & Nagle, 2021; Reference Munro and DerwingMunro & Derwing, 1995; Reference Nagle and HuenschNagle & Huensch, 2020). Thus, intelligible speech may show varying levels of comprehensibility. Finally, pronunciation teaching is fundamentally concerned with facilitating accurate perception and production in communicative and interactive contexts, which entails developing and sequencing tasks that scaffold learning from controlled to spontaneous contexts of L2 use (Reference Saito and PlonskySaito & Plonsky, 2019). In summary, then, bilingual phonetic research often implicitly aims to enhance theories of the bilingual mind by formulating and revising frameworks of cross-linguistic interaction, such as the Speech Learning Model (SLM; Reference Flege and StrangeFlege, 1995; Reference Flege, Bohn and WaylandFlege & Bohn, 2021). While such frameworks are certainly relevant to pronunciation teaching, the end goal of pronunciation teaching is to help learners communicate more effectively in the L2. These two goals are not mutually exclusive but rather reflect different priorities that demand different research epistemologies and methodologies.

Despite these differences, the two fields converge on several points because, as Reference Kroll and BialystokKroll and Bialystok (2013) argued, bilingualism is a multidimensional phenomenon, not a categorical variable. Thus, we believe that L2 learners, even in the early stages of L2 learning, are emergent bilinguals. In that sense, the findings of bilingual phonetics research are immediately applicable to L2 pronunciation teaching insofar as they help establish expectations about typical bilingual behavior. These expectations can in turn be used to empower learners and debunk commonly held beliefs, such as the view that monolingual-like attainment should be the target for L2 pronunciation learning. Bilingual phonetics research also implies that pronunciation teaching must help learners develop more accurate and more robust lexical representations. Training on individual sounds and contrasts may not be sufficient to help learners develop accurate phonolexical representations or update inaccurate ones. In Section 35.2, we provide a brief overview of the trajectory of pronunciation teaching before discussing principles for pronunciation teaching.

35.2 Historical Perspectives

Traditional views of teaching L2 pronunciation, dating back to the audiolingualism of the 1940s and earlier, assumed that the language learner is a monolingual who is learning a new language as an adult, and that the dominant influence on their pronunciation will be that of the L1 system. Thus, we often speak of a Spanish or Japanese accent in L2 English, or an American or French accent in L2 Spanish, or any of a hundred other combinations. The Contrastive Analysis Hypothesis (CAH) put forth by Reference LadoLado (1957) and others made this influence of the L1 on the L2 explicit, not only for pronunciation but for grammar and other aspects of language form. In the CAH, a comparison of differences in the L1 and L2 systems was hypothesized to predict areas of difficulty in L2 learning. For example, Japanese learners of English may be predicted to have trouble with the distinction between /l/ and /r/ in English because Japanese has neither of these sounds in its phonological inventory. Additionally, Japanese does not distinguish /b/ from /v/, which from the CAH perspective would be another area of potential difficulty.

Yet, CAH predictions are not always borne out in the data. The CAH predicts errors that do not happen while failing to predict some errors that do occur. This fact ultimately led to the development of strong and weak versions of the CAH (Reference WardhaughWardhaugh, 1970). In the strong version, differences in the L1 and the L2 systems determined the types of errors that would occur, while in the weak version the L1 system was expected to influence the types of errors that would occur in the learning of an L2. Furthermore, the CAH says relatively little about learnability, that is, about whether certain challenging contrasts are likely to improve over time as the L2 learner becomes a more experienced L2 user. For Japanese learners of English, /b/ – /v/ tends to show such improvement while /l/–/r/ remains an intractable error that shows very little improvement with increased proficiency (Reference Brown and ArchibaldBrown, 2000). Ultimately, despite its intuitive attractiveness and long history of applications to L2 pedagogy, the CAH never predicted L2 errors very well (Reference MunroMunro, 2018).

Other models of L2 phonetic learning have similarly sought to determine what types of features will pose the most difficulty for L2 speakers. The SLM (Reference Flege and StrangeFlege, 1995; Reference Flege, Bohn and WaylandFlege & Bohn, 2021) predicts that cross-linguistically similar sounds – sounds that are similar but not phonetically identical in the L1 and the L2 – will be the most difficult for L2 learners to perceive and produce accurately, and the L2 Perceptual Assimilation Model (PAM; Chapter 7, this volume) sorts L2 sounds into assimilation categories based on their relationship to the L1. According to this model, single category assimilations, where both members of an L2 contrast are assimilated to a single L1 category and judged to be equal in terms of their goodness of fit, are predicted to be especially difficult to perceive. While these models of speech learning are useful for directing researchers to potential problem areas that are of theoretical interest, they do not capture the fact that not all pronunciation issues affect L2 communication in the same way. Thus, the perception and production difficulties that models predict may not lead to any real communication difficulties for L2 speakers. Furthermore, even if the models do succeed in predicting problems based on cross-linguistic L1–L2 comparisons, individual variation in perception and production may render group-level trends inapplicable because even speakers who share the same L1 background often have individual issues that are not shared by others with the same background (Reference MunroMunro, 2021).

Taking the perspective of bilingual learning of the pronunciation of an additional language means that we must take into account research on third language (L3) sound acquisition and the ways that the pronunciation systems of all previous languages interact in the learning of the new system. In fact, in many contexts, instructed language learners are effectively emergent multilinguals. In many parts of the world English instruction is mandatory, which means that individuals learn English and additional languages simultaneously. Despite this multilingual landscape, it is not yet clear how multiple phonetic and phonological systems interact in the multilingual brain. While Reference Hammarberg, Cenoz, Hufeisen and JessnerHammarberg (2001, p. 23) argued that “earlier rather than later secondary languages are more likely to be activated in L3,” other evidence indicates that later secondary languages (i.e., the L3) can influence the pronunciation of the earlier secondary languages (i.e., the L2). Reference Cabrelli AmaroCabrelli Amaro (2017, p. 699) suggested that a more “native-like L2 is more vulnerable to L3 influence than an L1.” The typological distance of the different languages and the proficiency of the varied secondary languages may also influence which language has the most notable influence (Reference Llama, Cardoso and CollinsLlama, Cardoso, & Collins, 2010). However, Reference Wrembel, Waniek-Klimczak and ShockeyWrembel (2013) found that in some cases the L1 provides the dominant influence on the L2 and L3. In her study, Polish L1 speakers of L2 French and L3 English showed Polish influence on L3 accentedness. These findings all suggest that the influence of different phonological systems on each other is complex for bi/multilingual learners, but that all languages have the potential to influence the others depending on proficiency, language dominance, and typological relationships.

35.3 Goals for Pronunciation Teaching and Learning

Reference LevisLevis (2005) described two distinct principles that are relevant for pronunciation teaching and bilingual pronunciation learning: the Nativeness Principle and the Intelligibility Principle. The Nativeness Principle says that the goal of L2 pronunciation learning is to sound like a native speaker of the new language. The Intelligibility Principle, on the other hand, says that the goal of pronunciation learning is to communicate successfully and that pronunciation teaching is important insofar as it serves this purpose. It contradicts the basic characteristic of the Nativeness Principle by saying that the goal of achieving native pronunciation is unimportant. There is evidence to suggest that teachers are increasingly aware of an intelligibility-based approach (Reference HuenschHuensch, 2019; Reference Nagle, Sachs and Zárate-SándezNagle, Sachs, & Zárate-Sández, 2018), but in many contexts nativeness continues to be a powerful and attractive view of L2 pronunciation attainment among teachers and learners.

Why is native pronunciation unimportant? There are at least three reasons. First, it is unlikely to happen given that even learners who have learned an additional language at a very young age may show influence of the L1 (Reference Flege, Yeni-Komshian and LiuFlege, Yeni-Komshian, & Liu, 1999). We know that age is an important factor in learning to pronounce a new language and that monolinguals who are learning the new language will almost always have a foreign accent (Reference Flege, Munro and MacKayFlege, Munro, & MacKay, 1995). Even adult bilinguals who can pass as native speakers are likely to do so only in limited contexts such as service encounters (Reference PillerPiller, 2002). We also know that the phonetic system of early bilinguals is not identical in all respects to that of monolinguals (Reference MackMack, 1989); nor should we expect it to be. As a result, it seems particularly unwise to use monolingual natives as the baseline for the pronunciation attainment of bilingual and multilingual individuals.

Second, achieving a native accent is unnecessary for communicative success. Bilinguals around the world successfully communicate in their additional languages despite the often-obvious presence of accented speech (at least as judged by a monolingual baseline). This happens because accentedness is only weakly related to intelligibility. There is compelling evidence that L2 speakers can be heard as both highly accented and completely intelligible (e.g., Reference Huensch and NagleHuensch & Nagle, 2021; Reference Munro and DerwingMunro & Derwing, 1995). As a result, because native-like pronunciation is both unlikely and unnecessary, the traditional view that pronunciation teaching should target a native accent is especially problematic for bilinguals. Certainly, L2 speakers must achieve a certain level of phonetic accuracy in order to produce intelligible and comprehensible speech, but that threshold is not native-like accuracy.

Finally, a goal of achieving native pronunciation is likely to lead teachers to waste instructional time on pronunciation features that are salient to native identity but have only minor effects on intelligibility and communicative success. For example, the pronunciation of interdental fricatives in English is rare among the world’s languages and is an especially salient marker of foreign accent for L2 learners of English (Reference Huang and EvaniniHuang & Evanini, 2016). In our experience, L2 learners want to learn these sounds, are very aware that their mispronunciation of these sounds outs them as non-native English speakers, and often ask to learn to pronounce these sounds above other less salient features. However, there is little evidence that pronouncing these sounds natively affects the intelligibility of L2 speech (Reference Derwing and MunroDerwing & Munro, 2005; Reference JenkinsJenkins, 2002), suggesting that giving excessive attention to such targets may indeed be a waste of limited instructional time for pronunciation.

In summary, traditional approaches to pronunciation teaching are based on several assumptions that are not relevant to bilingual learners. First, traditional approaches assume that learners are monolingual and that the L1 will be the primary influence on L2 production and perception. Although we know that any L1 influence is at best indirect (Reference MunroMunro, 2018), we still talk about French-accented English or Vietnamese-accented German to reflect general identifiable influences. When learners are bilingual or multilingual, however, the influence of one language on another becomes far more complex, and it is not unusual to notice that the L1 influences both the L2 and the L3, the L2 influences the L3, or even the L2 or the L3 influences the L1 (Reference Cabrelli AmaroCabrelli Amaro, 2017). Second, despite research calling for intelligibility-based pronunciation teaching (Reference LevisLevis, 2020; Reference Munro and DerwingMunro & Derwing, 2020), traditional approaches emphasize nativeness as a goal. Learners of multiple languages have multiple phonological systems, none of which are necessarily a match for a monolingual. Not only is it extremely unlikely, but achieving a native-like accent is not even closely related to being intelligible, the most important goal for pronunciation teaching and learning.

35.4 Current Perspectives

In this section, we outline ten goals for teaching pronunciation. We also show how they can be informed by bilingual phonetic research to develop an effective and empirically grounded pedagogical approach that is not based on monolingual assumptions.

35.4.1 Choose Targets That Are Likely to Impact Intelligibility and Comprehensibility

Research on teaching pronunciation has demonstrated that not all errors are created equal (Reference LevisLevis, 2018). Some learner errors are noticeable markers of foreign accent but are unlikely to affect whether listeners will understand a speaker. For example, the velarized allophone of /l/ in English, the so-called dark [ɫ] (e.g., full), rarely affects intelligibility (Reference JenkinsJenkins, 2000) and is highly variable in its pronunciation, commonly being vocalized to [w] or deleted altogether (e.g., foo or [fuw]). In contrast, not distinguishing words with /i/ and /ɪ/ (beat versus bit) or /ɛ/ and /æ/ vowels (said versus sad) has a much higher likelihood of misunderstanding because there are many more word pairs in English that are distinguished by these phonological differences. The differential importance of sound contrasts is explained by their functional load in a language. Contrasts with more minimal pairs will have a higher functional load than contrasts with few minimal pairs (Reference BrownBrown, 1988; Reference Munro, Derwing and ThomsonMunro, Derwing, & Thomson, 2015; Reference SewellSewell, 2021), and errors in higher functional load contrasts make understanding more difficult (Reference Munro and DerwingMunro & Derwing, 2006). This is especially true when words share the same part of speech and are therefore more easily confusable (Reference Levis, Cortes, Chappelle, Chung and XuLevis & Cortes, 2008).

A second example of targeting comprehensibility is that the suprasegmental features of stress, rhythm, and intonation are likely to make a larger difference in the production of spontaneous speech than are segmentals (Reference Derwing, Munro and WiebeDerwing, Munro, & Wiebe, 1998; Reference Zhang and YuanZhang & Yuan, 2020). Although there have long been proposals that suprasegmentals are likely to be more valuable for improvement in the short term (Reference Gordon and DarcyGordon & Darcy, 2016; Reference LevisLevis & Muller Levis, 2018; Reference McNerney, Mendelsohn, Avery and EhrlichMcNerney & Mendelsohn, 1992), suprasegmentals are often neglected in pronunciation teaching because they are less easy to describe and because teachers are often less comfortable teaching them (Reference CouperCouper, 2016; Reference Foote, Holtby and DerwingFoote, Holtby, & Derwing, 2011). But there is compelling evidence of their importance. In Reference Derwing, Munro and WiebeDerwing et al. (1998), for example, groups of English learners in Canada were taught using either a segmental focus or a suprasegmental focus. While both groups improved their comprehensibility in read speech, only the suprasegmental group improved their comprehensibility in spontaneous speech. These results have been replicated repeatedly (e.g., Reference Gordon and DarcyGordon & Darcy, 2016, Reference Gordon, Darcy, Levis, Nagle and Todey2019; Reference Zhang and YuanZhang & Yuan, 2020), indicating that learners benefit from attention to suprasegmentals, though it bears mentioning that this work has focused nearly exclusively on L2 English and should be expanded to other L2s.

35.4.2 Provide Accurate and Accessible Descriptions

Pronunciation is often seen as a relatively abstract part of language learning because learners may not be accustomed to thinking about the structure of the sounds, rhythm, and melody of the L2. Thus, helping learners understand what they are trying to learn is an essential part of pronunciation teaching (Reference Celce-Murcia, Brinton, Goodwin and GrinerCelce-Murcia et al., 2010), which for many teachers means providing learners with a description of L1–L2 articulatory differences for their different languages. While some bilingual learners may already have ways to understand the pronunciation of a new language, it is likely that most will benefit from accessible descriptions approached in varied ways. For example, teachers may wish to employ phonetic symbols to teach pronunciation, and there is evidence that this approach is helpful for learners (Reference Fouz-González and MompeanFouz-González & Mompean, 2021). Other successful pronunciation teaching approaches have used orthography rather than phonetic symbols to connect written and spoken forms (Reference Dickerson, Reed and LevisDickerson, 2015). However, there are other options besides written symbols in helping phonetically untrained learners understand pronunciation features. For suprasegmentals, pedagogical approaches to English pronunciation have advocated the use of gadgets such as rubber bands or kazoos (Reference Gilbert and GrantGilbert, 2014, Reference Gilbert2019) to physically mimic differences in length or remove segmental content from sentences and make pitch movement more noticeable. There are also movement-oriented approaches to teach rhythm (Reference BurriBurri, 2023; Reference ChanChan, 2018) or connections of pronunciation to body language to make pronunciation visible as well as auditory (Reference SmotrovaSmotrova, 2017).

The emerging consensus from this line of work is that providing metalinguistic information is beneficial when integrated with other instructional strategies, such as form-focused instruction (Reference SaitoSaito, 2013). Likewise, corrective feedback, which draws attention to pronunciation differences, thereby helping learners notice the gap between their pronunciation and the pronunciation target, has been shown to be beneficial for pronunciation learning (Reference Martin and SippelMartin & Sippel, 2021; Reference Saito and LysterSaito & Lyster, 2012a, Reference Saito and Lyster2012b).

35.4.3 Find Out the Social Reasons for Learning the New Language

Bilinguals are already experienced L2 learners, and learning an additional language often has social motivations. Why is someone learning a language, and who do they want to communicate with or in what contexts do they hope to use the language? Do learners have instrumental or integrative reasons for learning the language? We know that answers to these questions are powerful predictors of pronunciation success, but they are underexplored in pronunciation teaching (Reference LevisLevis, 2015; Reference MoyerLevis & Moyer, 2014; Reference McCrocklin and LinkMcCrocklin & Link, 2016). In Reference LybeckLybeck (2002), American women who married Norwegian spouses and who lived in Norway were found to have better pronunciation of the Norwegian /r/ (a salient marker of sounding Norwegian) if they were part of close-knit social networks, but their pronunciation was less good if their networks were not close. In one particularly interesting example, a woman’s /r/ pronunciation, which was initially excellent, got worse over time as her social connections to the culture frayed and her marriage became less secure.

Learners may only be vaguely aware of social reasons for learning a particular language, or they may not have connected their desires to pronounce the language well with outside motivations for group membership. Raising social issues with learners can help learners understand ways to make better connections between their pronunciation learning and their social goals by helping them imagine their membership in possible communities (Reference LeVelle, Levis, Levis and MoyerLeVelle & Levis, 2014). Providing communicative practice by connecting to learners in other countries, either through videochat or in immersion experiences, can also help to make social motivations for pronunciation learning more salient (Reference Saito, Suzuki, Oyama and AkiyamaSaito et al., 2021).

Once learners reach a certain proficiency, in terms of both their knowledge about pronunciation and their ability to control aspects of their pronunciation, they can make choices about how they want to sound in the L2. Some learners may choose to retain an L1 accent in the L2, whereas others may use regional pronunciation features to mask their L1 to the extent possible, thereby asserting their identity as legitimate L2 users (Reference MüllerMüller, 2013). In one particularly salient case study, Reference MarxMarx (2002) described several stages of accent and identity formation in L2 German. For instance, upon arriving in Germany, she quickly adopted a French accent to avoid being perceived as American (Marx was Canadian). Over time, she began using regional pronunciation features to pass as a native speaker, and when she returned to Canada, she had a German accent in English, which eventually resolved itself as the German aspect of her identity faded in an L1 context. As Reference MoyerMoyer (2017) has argued, pronunciation learning involves limits both on what most L2 speakers can attain and on the choices they can make. Pronunciation teaching must be sensitive to this interplay.

35.4.4 Build on Learners’ Previous Learning Experiences

Bilingual language learners, especially those who have consciously learned an L2 or L3, will come to the learning of an additional language with pronunciation skills already in place. They are likely to be aware of differences in the segmental and suprasegmental features between the languages they know and the new language, to categorize new sounds differently from monolingual learners, and to classify new sounds as being more or less like their other languages. They may also be more successful in the pronunciation of the new language (Reference KieseierKieseier, 2021). The pronunciation of their existing language systems will interact with the system of the new language such that all of their phonological systems will be in flux and subject to differing amounts of relative language use (Reference Kartushina and MartinKartushina & Martin, 2019). It only makes sense, therefore, to build on their previous strategies toward pronunciation learning by asking them how they learn pronunciation best, why they think their approaches have been successful, and what types of pronunciation features they have focused on and why. The answers to these questions can help build a bridge between previous learning and the new language by recognizing the learners as multicompetent language users (Reference CookCook, 1999).

Possible differences in pronunciation learning strategies for monolinguals and bilinguals can be indirectly seen in other findings about pronunciation learning. One such hint is found in Reference MillerMiller’s (2003) description of a Bosnian high-school student who had moved to Sweden as a refugee before her family moved to Australia. She transferred her spoken language strategies gained in learning Swedish, including those centered on suprasegmentals, to her learning of English. This resulted in her becoming quickly audible. (Audibility, in Miller’s use of the term, is the ability to be heard by others and to use that audibility to develop an L2 identity and develop language skills more fully. Audibility, like intelligibility, is based partly on speaking the L2 in a way that others find [unconsciously] acceptable.) Another hint of bilingual strategies in pronunciation learning is found in Reference PillerPiller’s (2002) research into the “passing” strategies of bilingual couples in Germany. Often, both halves of the couple enjoyed trying to pass as native speakers of their nondominant language, often by combining pronunciation knowledge with sociolinguistic discourse markers associated with a dialect rather than the standard version of the language. A final hint of the extra flexibility with which bilinguals learn pronunciation is illustrated in Reference KalmarKalmar ([2000] 2015), which describes the functional English pronunciation learning strategies of farm workers in a small town in the Midwestern United States. The workers were L2 speakers of Spanish and L1 speakers of Tarascan, an indigenous language. In interactions with Anglo residents of the town, the farm workers learned to pronounce English by using written Spanish to represent what they heard being said in English. For example, the phrase “Where do you live?” was written down as JUELLULIB, a representation that served its purpose in getting a young woman to respond to the question. In other words, they created their own phonetic alphabet so they could learn to pronounce English. In pronunciation teaching terms, they wanted to be understood and came up with a way to pronounce that used their L2 writing system to represent a language that was unfamiliar, helping them become both more intelligible and more comprehensible. All of these strategies show that bilingual learners bring their own knowledge and practices to pronunciation learning in a new language, and it only makes sense to draw out these practices in teaching pronunciation.

35.4.5 Provide Practice in Learning Strategies to Promote Learner Autonomy

Learners who improve their pronunciation, no matter how many other languages they know, are likely to show greater improvement if they develop a metacognitive approach to learning (Reference MoyerMoyer, 2014). Such metacognitive approaches to learning include the deliberate use of learning strategies. Reference SardegnaSardegna (2022) argues, based on a series of studies she carried out on the effects of teaching pronunciation learning strategies, that improvement in controlled pronunciation is most evident for learners who explicitly set goals, engage in focused practice, reflect on their recorded performance, receive feedback, and update their goals over time. In her examination of improvement in three suprasegmental features (linking, word stress, and nuclear stress or prominence), she found that participants who learned strategies to practice their pronunciation showed dramatic improvement in controlled speech for all features, while those in a control group did not show any change. In delayed assessments, the experimental group participants showed minor backsliding but still significantly improved from their pretest scores. Overall, the studies reported by Reference SardegnaSardegna (2022) indicate a strong effect of strategy use that focuses on out-of-class practice through a particular approach, such as Covert Rehearsal (an approach to practice focused on private practice of pronunciation forms; see Reference Dickerson and MorleyDickerson, 1994). Out-of-class practice, especially when it is carefully constructed to build on curricular goals, is attractive to teachers because it offers the opportunity to extend learning beyond the restricted opportunities in the classroom. Reference MartinMartin (2020), in a study of novice learners of German, demonstrated that homework-based pronunciation practice can even work as effectively as in-class practice in improving learner comprehensibility and perception of L2 sound categories.

Some bilinguals who have consciously learned an L2 before may already have their own sets of strategies (i.e., they may be so-called good language learners), but it is likely that most bilinguals can benefit from training in strategy use. Reference Baker Smemoe and HaslamBaker-Smemoe and Haslam (2013) found that learners in foreign language and second language contexts did not differ in their use of strategies, so it seems that training in learning strategies is important if learners are to take charge of their own learning. Although Reference SardegnaSardegna (2022) did not clearly delineate the effects of pronunciation teaching with and without strategy use or compare her approach to strategies with other approaches, there are likely multiple ways to approach pronunciation learning strategies. Also, we do not know whether controlled practice with strategies results in improvement in spontaneous speech, although research by others (Reference Derwing, Munro and WiebeDerwing et al., 1998; Reference Gordon and DarcyGordon & Darcy, 2016; Reference Zhang and YuanZhang & Yuan, 2020) suggests that interventions similar to Sardegna’s focus on suprasegmental features may also lead to noticeable improvement in spontaneous speech and may promote learner autonomy (Reference McCrocklinMcCrocklin, 2019).

35.4.6 Use Technology to Facilitate Individualized Production Practice

Most bilingual learners will have individual pronunciation problems that need to be addressed; some will experience difficulty with a given sound, while others will not, and even when many learners experience difficulty with the same sound, some will be better than others, resulting in distinct individual starting points and rates of improvement. In that case, a one-size-fits-all approach will not be effective, but addressing individual needs in a classroom context is challenging (Reference MunroMunro, 2018). Computer-assisted pronunciation training can provide a solution. Research suggests that the use of visualization software can be a powerful means of drawing learners’ attention to pronunciation differences and delivering instant, readily interpretable feedback that helps the user understand and begin to bridge the gap between their output and the target pronunciation (Reference Ding, Liberatore and SonsaatDing et al., 2019). Because each learner is working with their own input, visual feedback paradigms can be highly individualized, making them effective for a wide range of learners.

Reference OlsonOlson (2014a, Reference Olson2014b) and Reference Offerman and OlsonOfferman and Olson (2016) demonstrated that Praat acoustic analysis software can be used to help learners improve their production of challenging L2 sounds. In their studies, they used a three-step Praat activity to train L2 Spanish speakers’ production of voiced and voiceless stop consonants in L2 Spanish. In a follow-up analysis comparing the different approaches – short and long simultaneous (training all places of articulation at once) and long sequential (training each place of articulation separately) paradigms – Reference Olson and OffermanOlson and Offerman (2020) found that the longer interventions consisting of three sessions were more beneficial than the short, one-shot intervention and, additionally, that the long sequential intervention seemed to slightly outperform the long simultaneous one. While it can be used for pronunciation training, Praat was designed for research purposes, which means that the interface and its functionality were not developed with language students in mind. Other visualization programs, however, have been created specifically for the purpose of pronunciation training.

Reference RehmanRehman (2021) examined the potential of Vowel Viewer, a program that plots vowel targets in real time in a two-dimensional space representing tongue height and backness. Model productions are also plotted so that the user can compare their production to the target and make adjustments accordingly. Participants who used the software produced vowels that were significantly more accurate at post-test, in terms of their alignment with the values produced by a native speaker group, but parallel gains were not observed in vowel intelligibility, as assessed by four expert judges. Nevertheless, these findings show that programs like Vowel Viewer hold significant potential, especially because learners can use them independently, outside of class, and without special training.

35.4.7 Provide Regular and Targeted Perception Practice

Accurate perception is a key ingredient of accurate production. Models of L2 pronunciation learning such as the SLM (Reference Flege and StrangeFlege, 1995; Reference Flege, Bohn and WaylandFlege & Bohn, 2021) argue that accurate perception should lay the groundwork for accurate production. In their meta-analysis of perception training studies, Reference Sakai and MoormanSakai and Moorman (2018) found that perception training leads to small but reliable gains in production, which means that pronunciation training will be even more successful if it includes perception practice.

One of the most important instructional techniques for helping learners improve their perception of challenging (i.e., imperceptible or barely perceptible L2 contrasts) is high variability phonetic training (HVPT). In this technique, the listener is exposed to stimuli spoken by many different talkers, which forces them to attend not to the specific characteristics of any talker but rather to the general cues along which the target sounds differ (Reference ThomsonThomson, 2018). Because HVPT is computer-mediated, it offers highly individualized and adaptive practice. Reference ThomsonThomson (2012) used HVPT to help L1 Mandarin speakers improve their vowel perception in English. Participants completed eight training sessions over three weeks. In each session, they heard vowel stimuli spoken by twenty native English talkers. Their task was to map the stimulus to one of ten nautical flags, which Thomson selected to avoid orthographic options that may have interfered with participants’ learning. Results showed that learners’ L2 vowel perception improved significantly over time for nearly all vowels, gains transferred to at least one novel (i.e., untrained) phonetic context, and performance remained high at delayed post-test (four weeks after training), which suggests that HVPT promotes perception learning that is both durable and generalizable. (For more recent uses of Thomson’s technology, see www.englishaccentcoach.com.) In a more recent study, Reference Fouz-González and MompeánFouz-González and Mompeán (2020) demonstrated that both phonetic symbols and keywords can be successfully used with an HVPT paradigm.

As Reference ThomsonThomson (2018) pointed out in his overview of HVPT, there is no doubt that this technique is effective, but much remains unknown about how HVPT can be optimized even further. How might altering the number and duration of sessions, or the number of talkers, affect learning? How might learner characteristics interact with the training? When Reference Perrachione, Lee, Ha and WongPerrachione et al. (2011) compared various high and low variability conditions for training high and low aptitude learners to perceive three pitch contours (i.e., tonal contrasts), they found that the high variability blocked condition, which presented stimuli for each of the four speakers in separate blocks, was most beneficial for all learners, irrespective of aptitude. In other words, maintaining high overall variability while reducing trial-by-trial variability seemed to be the best option. More work is still needed on the training and learner characteristics (and their interactions) that result in optimal learning, retention, and generalization.

35.4.8 Provide Production Practice in Discourse and Communicative Exercises

Segmentals are important in learning pronunciation, but they all too often are taught at the sound or word level, leading to improvement that does not transfer to larger stretches of speech or to spontaneous speech. Even moving beyond the word level to practice with sentences appears to lead to greater generalization of improvement. Reference HirataHirata (2004) found that learners of Japanese transferred improvement to the word and sentence level when taught pitch accents and phonemic lengthening in sentences. If they were taught only at the word level, however, they showed no improvement at the sentence level. Providing practice and instruction at the sentence or discourse level also leads to spillover improvement of nontargeted features. Reference HardisonHardison (2004) taught intonation to L1 English learners of French using directed listening, humming, production, and computer visualization of varied French intonation contours. Learners internalized the contours so well that they could repeat sentences even when hearing only the hummed sentence. They also began to notice nontargeted features of French pronunciation such as liaison.

Suprasegmental features of language are more likely to lead to improvement that will be noticeable, and even a short-term focus on suprasegmentals may result in enhanced communicative abilities (Reference McNerney, Mendelsohn, Avery and EhrlichMcNerney & Mendelsohn, 1992). In one study looking at the teaching of contrastive focus in English (e.g., “the RED houses has a LARGE door, but the GREEN one has a SMALL one”), Reference LevisLevis and Muller Levis (2018) found that intermediate learners improved their comprehensibility after six 25-minute sessions over three weeks of instruction. The instruction focused on using contrastive focus in spontaneous production using picture comparisons rather than reading aloud. In another approach to teaching pronunciation at the discourse level, Reference Goodwin, Frodesen and HoltenGoodwin (2005) proposed using short clips from sitcoms or other dramatic interactions that included two speakers. Learners used the script (perhaps even transcribing it themselves in normal spelling) to mark features, such as phrase breaks, word stress, syllable lengthening, connected speech phenomena such as linking and assimilation, pitch movement on emphasized words or at phrase boundaries, and nonverbals, of how speakers interact, practicing and shadowing each feature as they noticed what was happening. Reference Foote and McDonoughFoote and McDonough (2017) tested whether such shadowing practice led to improvement in directed learning outside the classroom. They provided learners with iPods over eight weeks, with a different sitcom and transcript for each week and instructions for how to practice. Not only did the learners enjoy the practice but they also increased their comprehensibility and fluency in spontaneous speech.

35.4.9 Integrate Pronunciation with Other Aspects of Language

Pronunciation can be taught alone, but that probably is not the best way to teach it. Pronunciation is a servant. Teaching it without reference to other aspects of instruction means finding dedicated space in already full classes and asking learners to make connections to other skills when the teacher is better able to make such connections. Integrating pronunciation into the teaching of other skills, on the other hand, makes obvious connections to vocabulary learning, listening comprehension, and speaking and conversations skills (Reference JonesJones, 2016; Reference Levis and GrantLevis & Grant, 2003). Integration will look different in different classes, but it is always possible (Reference Levis, Echelberger, Levis, Derwing and Sonsaat-HegelheimerLevis & Echelberger, 2022). Finally, integrating pronunciation into the teaching of other skills means that pronunciation is much more likely to be taught and not simply reacted to in an ad hoc manner. Reference Darcy, Rocca and HancockDarcy, Rocca, and Hancock (2021) examined a corpus of 110 hours of instruction in which teachers took 3 different approaches to integration: ad hoc, included but not integrated, and communicatively integrated. The ad hoc approach addressed pronunciation in a majority of classes but spent only a small amount of time on it, and the included group spent a relatively large amount of time on pronunciation but not as much as the communicative group. Planning pronunciation work into the class meant that it was actually taught. Intending to treat pronunciation when it came up meant that it was rarely addressed.

A significant body of work has explored how pronunciation can be taught within a communicative framework. Early work looked at how to make highly controlled pronunciation exercises more communicative (Reference Celce-MurciaCelce-Murcia, 1983) by connecting pronunciation to other aspects of language (Reference MorleyMorley, 1991). More recent work has related pronunciation teaching to second language acquisition models. Reference Saito and LysterSaito and Lyster (2012a) found that teaching using a focus on form approach (in which pronunciation practice includes attention to both form and meaning) was more effective than focusing on form alone, the traditional way to approach pronunciation teaching. In other recent work, task-based pronunciation teaching has been shown to be promising (Reference Gurzynski-Weiss, Long and SolonGurzynski-Weiss, Long, & Solon, 2017; Reference Mora and LevkinaMora & Levkina, 2017; Reference Mora-Plaza, Mora, Gilabert and LevisMora-Plaza, Mora, & Gilabert, 2018). Ultimately, integrating pronunciation into the teaching of other language skills (or integrating the teaching of other language skills into pronunciation) is more likely to promote transfer of pronunciation improvement into unplanned communication, resulting in the automatization of intelligible and comprehensible L2 pronunciation patterns.

35.4.10 Engage in Meaningful Pronunciation Assessment

Teachers need to provide students with structured but accessible feedback on their pronunciation so that they are aware of their pronunciation issues and can work to address them. Structured assessment of learners is also beneficial for teachers because it provides insight into how learners are progressing and whether the instructional approach is working as intended. Entire volumes have been dedicated to this topic (e.g., Reference Isaacs and TrofimovichIsaacs & Trofimovich, 2017), so here we touch upon only a few basic points.

Adequate pronunciation assessment rests on an understanding of the factors that are likely to influence intelligibility and comprehensibility in the L2. For example, word stress seems to be a particularly good predictor of comprehensibility in English (Reference Isaacs and TrofimovichIsaacs & Trofimovich, 2012; Reference Trofimovich and IsaacsTrofimovich & Isaacs, 2012) and, at least compared to other languages, the functional load of English sounds and its relationship to intelligibility and comprehensibility is relatively well understood (Reference Munro and DerwingMunro & Derwing, 2006). Teachers therefore have a strong empirical basis for assessing high priority pronunciation features for L2 English. Empirical evidence for other languages is beginning to emerge (Reference IsbellIsbell, 2021), but this work is still in its infancy.

Once appropriate targets are identified, rubrics must be developed to convey that information to learners. Reference Isaacs, Trofimovich and FooteIsaacs, Trofimovich, and Foote (2017) designed and validated a five-point user-friendly comprehensibility scale that takes into account many of the pronunciation features known to affect comprehensibility in L2 English. Once work accumulates on pronunciation, intelligibility, and comprehensibility in other languages, similar scales can be developed and deployed. In their chapter on classroom-based pronunciation assessment, Reference Isbell, Sakai, Levis, Derwing and Sonsaat-HegelheimerIsbell and Sakai (2022) summarized various assessment types, ranging from diagnostic assessments, often carried out at the beginning of a course to offer learners individualized feedback and provide a blueprint for moving forward, to performance and self-assessment, situating each type relative to several continua (e.g., summative-formative, holistic-analytic). They also provided sample activities and a suite of assessment resources that teachers can adapt to fit to their own needs. Ultimately, as they argued, classroom-based pronunciation assessment “entails distinct principles and priorities allowing for a much more diverse range of assessment practices” (Reference Isbell, Sakai, Levis, Derwing and Sonsaat-HegelheimerIsbell and Sakai, 2022, p. 4).

35.5 Future Directions

It is our view that all research dealing with the pronunciation of a second or additional language is relevant to pronunciation teaching, although connections between pedagogy and theoretically oriented work are not always immediately apparent. One of the strongest contributions of bilingual research to pronunciation teaching rests in its capacity to shed light on the types of cross-linguistic interactions that are likely to take place in bilingual speakers, which can help both researchers and teachers alike understand and develop realistic expectations in their respective contexts. One obvious implication of bilingual research is the need to reject the view that bilinguals should behave like monolingual speakers. We agree with Reference SakaiSakai (2018), who argued for a bilingual baseline in L2 phonetic research, and further assert that such a baseline is necessary for pronunciation teaching. We would also take the argument one step further. Many individuals are themselves bilingual or multilingual users who are learning an additional language, and many traditional language students are now opting to learn an L3 and beyond. Reenvisioning pronunciation teaching in a multilingual world means recognizing the knowledge and skills that learners bring to the table and resetting expectations about target varieties and norms. In a multilingual context, appropriate pronunciation benchmarks can emerge only after careful consideration of who the learners are and with whom they plan to interact. Throughout this chapter, we have made the case for principles that should guide pronunciation teaching to bilingual learners. These principles are certainly broadly applicable, but if they are to be optimized, they must be adapted to local teaching contexts. Overall, then, we argue for a consistent, empirically based approach that is flexible enough to account for the learner and contextual differences that teachers often encounter in the classroom.

In terms of phonetic research, we would like to draw attention to two areas that could be particularly fruitful for pronunciation teaching. First, the phonetic basis of intelligibility and comprehensibility is not yet well understood. How accurate do learners need to be on a given target structure in order to be intelligible and comprehensible? Is phonetic accuracy more important for some structures than others? The answers to these questions can help bridge phonetically oriented approaches and intelligibility-based frameworks in the classroom. Second, bilingual and multilingual individuals do not perceive and produce contrasts in the same way that monolingual speakers do. Bilinguals may perceive and produce intelligible contrasts using the same phonetic cues as monolinguals, albeit in a slightly different way, or they may rely on different phonetic cues altogether. For instance, Japanese speakers seem to be able to produce an intelligible English /r/ using a combination of duration and spectral cues even though native English speakers are more sensitive to using information from the third formant or F3, which correlates with lip-rounding (Reference Saito and van PoeterenSaito & van Poeteren, 2017). Thus, an important obstacle that future research and teaching will need to surmount is how to move beyond monolingual standards, addressing the complex ways in which bilingual and multilingual individuals perceive and produce the additional languages they speak.

Book contents

Part VI - Variables and Outcomes of Bilingual Speech

Information