7.1 Introduction
Phonetic and phonological characteristics vary greatly across the languages of the world. Infants do not come pre-adapted to the specific language environment they are born into, so they must tune in to the characteristics of native speech during language acquisition. Adults’ speech perception reflects their lifetime of using their specific native language(s), but it is particularly strongly grounded in language experience during infancy and early childhood. While this native-language (L1) attunement is beneficial for recognizing words in the L1, it often results in difficulty hearing the difference between words in a foreign language that are distinguished solely by a consonant or a vowel distinction that does not occur in their own language(s) (i.e., a non-native contrast). For instance, German lacks the initial consonant /θ/ in the English word “think” as well as the vowel /æ/ in “bat”; thus, for native German listeners the English minimal-word pairs “sink”–“think” and “bat”–“bet” would both involve a non-native speech contrast.
The Perceptual Assimilation Model (PAM; e.g., Reference Best and StrangeBest, 1995) was devised to account for how adults’ speech perception has been shaped by their language experience from birth. In its original form, the model focused on monolingual adults and on infants acquiring a native language in a monolingual setting. This was not because monolinguals are more interesting or important to study than bilinguals (they actually might be less interesting than bilinguals in certain respects), but because the task of accounting for how language experience shapes speech perception is already complex when an adult has learned or an infant is learning only one language, let alone when two or more languages are involved.
In this chapter, we will sketch out how speech perception and phonological category acquisition are shaped by early acquisition of more than one language, that is, from infancy or in early childhood, and how they might be accounted for with a PAM framework. We refer to such adults as “early bilinguals.” Our discussion will focus mainly on PAM, but we will touch on the Perceptual Assimilation Model of L2 Speech Learning (PAM-L2; Reference Best, Tyler, Munro and BohnBest & Tyler, 2007) at the end of the chapter. PAM-L2 extends PAM principles to account for L1 perceptual effects on late second language (L2) acquisition by adults, but aspects of the model are also relevant to early bilinguals. For late L2 acquisition, where learning of a second language begins in adolescence or adulthood, we refer readers to the PAM-L2 model presented in Reference Best, Tyler, Munro and BohnBest and Tyler (2007) and updated by Reference Tyler, Nyvad, Hejná, Højen, Jespersen and SørensenTyler (2019) and Reference Tyler, Ball and BestTyler, Ball, and Best (2024).
Before outlining how PAM might account for phonological attunement in early bilinguals, we first need to consider key findings and concepts. In Section 7.2, we will provide an overview of PAM, as it is currently described for monolinguals. An understanding of how PAM might account for language-attuned speech perception in adult early bilinguals also requires an appreciation of the experiential foundation of that attunement: development of speech perception in infants being raised monolingually versus bilingually. That issue will be covered in Section 7.3, then we will review the empirical findings in early bilingual speech perception that need to be accounted for by any model in Section 7.4. In Section 7.5, we will turn to how PAM might account for speech perception in early bilingual adults (henceforth, early bilinguals) before making some suggestions for future research directions in Section 7.6.
7.2 PAM and Monolinguals
To address bilingual speech perception from a PAM perspective, we need to first understand how PAM accounts for language attunement in monolinguals. The clearest evidence for the influence of L1 attunement on speech perception comes from studies in which perception of phonemes from an unfamiliar non-native language is compared across groups of monolinguals with different L1 backgrounds. This type of research is referred to as cross-language speech perception (Reference Bohn, Fernández and CairnsBohn, 2017). For example, monolingual English and Greek participants showed different patterns of discrimination accuracy for initial stop voicing contrasts from Ma’di, a language spoken in Uganda and South Sudan (Reference Antoniou, Best and TylerAntoniou, Best, & Tyler, 2013). Both groups had difficulty telling apart the Ma’di contrast between voiced plosive versus implosive coronal (tongue tip) stops /da/–/ɗa/. The Ma’di plosive has a stable larynx position and egressive airflow whereas the implosive has a larynx lowering gesture that causes ingressive airflow, a distinction that neither group had experienced in their L1. The English monolinguals also performed poorly on the Ma’di voiced-voiceless plosive /da/–/ta/ contrast, but the Greek monolinguals discriminated it accurately. Ma’di, English, and Greek all have phonologically equivalent voiced and voiceless stops (in this study, /d/ versus /t/, as in English <toner>–<donor> and Greek <τολμάς> [tolmas] “you dare” – <ντολμάς> [dolmas] “stuffed cabbage roll”),Footnote 1 but their phonetic realizations are different. The most important phonetic difference is voice onset time (VOT), which represents the time difference between the release of the tongue tip constriction and the onset of vocal fold vibration in stop consonant phonological contrasts such as English /d/–/t/, /b/–/p/, and /ɡ/–/k/. With respect to the Reference Antoniou, Best and TylerAntoniou et al. (2013) study, both Ma’di and Greek /d/ is phonetically prevoiced [d] (i.e., with a negative VOT that indicates the focal folds begin vibrating well before the constriction release), while /t/ is phonetically voiceless unaspirated [t] (i.e., with a slight positive VOT that indicates voicing begins slightly after the release). Thus, the Ma’di contrast is native-like for Greek listeners. In contrast, when English /d/ and /t/ occur in stressed syllable onset position, as the stimuli did in Reference Antoniou, Best and TylerAntoniou et al. (2013), English /d/ has a slight positive VOT (unaspirated [t]) that is phonetically quite similar to Ma’di and Greek /t/, whereas English /t/ is phonetically unlike Greek and Ma’di /t/, with notably longer positive VOT that results in noisy aspiration (i.e., long-lag aspirated [tʰ]). Thus, both Ma’di /d/ and /t/ are phonetically similar to English /d/ but neither is phonetically similar to English /t/. Based on those observations, discrimination of the non-native Ma’di consonant contrasts was consistent with each monolingual group’s prior experience with the phonetic VOT realizations of their L1 /d/ and/or /t/. It is findings such as these that PAM was devised to explain.
PAM is grounded on the premise that the objects of speech perception are temporally and spatially coordinated patterns of constrictions along the vocal tract, produced by its articulatory organs (lips, tongue tip, tongue body, tongue root, velum, and larynx), which dynamically shape the vocal tract over time during speech in distinct and perceptually discernible ways (Reference Best and StrangeBest, 1995; Reference FowlerFowler, 1986; Reference Studdert-Kennedy, Goldstein, Christiansen and KirbyStuddert-Kennedy & Goldstein, 2003). The multimodal output (acoustic signal, facial movements, airflow) of these temporally coordinated vocal tract constriction events provide evidence of the originating articulatory patterns that perceivers detect in speech. For example, English /d/ is typically produced by making a complete closure between the tongue tip and the alveolar ridge, with vocal fold vibration starting around the same time as, or just after, closure release when it occurs in word- or syllable-initial position (i.e., phonetically, it is an unaspirated short-lag VOT [t]). However, the precise articulatory-phonetic details of these articulatory gestures differ by phonetic context, as well as by talker and accent. The resulting variability in constriction location and degree, and in the relative timing of constrictions, is systematic. For example, when English /d/ occurs word-internally between vowels, as in “widen,” it is more likely to be prevoiced (phonetically, [d]), and if the following consonant involves a constriction at the teeth for the fricative [θ], as in the word “width,” its constriction location is more likely to be at the upper front teeth, that is, dental (phonetically, [d̪]).
The systematic relationships among variants of a given consonant or vowel, that is, a phonological segment, are delineated by the articulatory properties that are crucial to that segment’s phonological identity, which we refer to as its phonological category. The core articulatory properties that are shared by the many variants of a phonological category are its higher-order invariants, which are abstracted across the many disparate variants, and which set that phoneme apart from all other phonemes in a given language. To give an English lexical example of the difference between higher- and lower-order invariants, the word “baby” is pronounced differently by speakers of different regional English accents (e.g., as something akin to BAY-bee by Americans, BYE-bee by Australians, or BEE(-a)-bee by Jamaicans). Despite those accent differences, which provide listeners with useful lower-order phonetic information about where the speaker grew up, the three accented forms nonetheless share the higher-order lexical invariant of being the same word, “baby,” which is distinct from all other similar-sounding words in English, for example, from “bobby,” “birdy,” and “buggy.”
Each phonological category contrasts with all other phonological categories in that language. This contrastiveness can signal meaningful differences between words in the language. Contrastive differences among phonological segments are what we refer to as phonological distinctiveness (Reference Best, Tyler, Gooding, Orlando and QuannBest et al., 2009). For example, in English the voiceless alveolar fricative /s/ in “sing” differs from the voiceless dental fricative /θ/ in “thing” by a single gestural difference. The critical gestural difference is the location of the tongue tip constriction for producing fricatives at the alveolar ridge (/s/) versus the upper front teeth (/θ/). The invariants must also be of sufficiently high order, that is, abstract enough, for word recognition to remain robust across context, talker, and accent variations. That is, they must display phonological constancy (Reference Best, Tyler, Gooding, Orlando and QuannBest et al., 2009).
Tuning in to the higher-order invariants of native phonological categories supports rapid and efficient recognition of spoken words. This, in turn, equips native listeners with the flexibility to adapt to phonetic variability across talkers and accents, and to any long-term shifts in their own regional accent. This natively tuned perception leads listeners to automatically perceive speech from any language using the perceptual efficiencies that they developed to support L1 speech perception. In other words, they perceptually assimilate speech to their L1 phonological system.
The effect of L1 attunement on perception can be inferred by observing how listeners categorize and discriminate unfamiliar, minimally contrasting phonemes from a never-before-heard non-native language (i.e., cross-language speech perception, which is distinct from second-language speech perception – perception of a language that the listener is actively learning). If a listener detects higher-order invariant information that is consistent with a single L1 phonological category, then the non-native phoneme is perceptually assimilated as categorized, although its goodness of fit can range from ideal to notably deviant. For example, the monolingual Greek listeners in Reference Antoniou, Best and TylerAntoniou et al. (2013) perceived Ma’di alveolar /d/ as an acceptable exemplar of Greek /d/. They also perceived Ma’di implosive /ɗ/ as Greek /d/, but with a poorer goodness of fit. A non-native phoneme is uncategorized if it is perceived as speech, but not as consistent with any single L1 phonological category, either because it is not primarily similar to any particular L1 category or because the listener perceives (weak) similarity to one or more than one L1 category (for more on uncategorized assimilations, see Reference Faris, Best and TylerFaris, Best, & Tyler, 2016, Reference Faris, Best and Tyler2018; Reference Tyler and WaylandTyler, 2021b). Finally, some non-native phonemes may be so unlike any L1 phonological category that they are not perceived as speech, and thus are not assimilated as possible L1 phonological elements. This latter possibility arose from the observation that English monolingual adults often reported hearing nonspeech events (e.g., clap, finger snap, twig cracking) when asked to categorize Zulu click consonants (Reference Best, McRoberts and SitholeBest, McRoberts, & Sithole, 1988). Importantly, native listeners of Zulu and Sesothu (both click languages) predominantly perceived non-native !Xóõ click consonants as speech, in contrast to native English listeners who perceived them mostly as nonspeech or as containing a nonspeech element (Reference Best, Traill, Carter, Harrison, Faber, Solé, Recasens and RomeroBest et al., 2003).
So far, we have outlined the various ways in which a single unfamiliar phoneme from a never-experienced language might be perceived after perception has become natively tuned. PAM also provides general predictions for discrimination accuracy on non-native speech contrasts based on how the contrasting pairs of non-native phonemes are categorized and rated (see Reference Tyler and WaylandTyler [2021b] for a discussion of the different types of information available for discriminating non-native segments). Let us consider, first, cases where both non-native phonemes are categorized. Either the two contrasting non-native phonemes will be perceived as instances of different L1 phonological categories, or both will be perceived as instances of the same L1 phonological category. When they are perceived as different L1 categories, it is a two-category assimilation. Discrimination of two-category assimilations should be excellent because the listener has detected higher-order invariants in each of the non-native phonemes that would support efficient perception of an L1 contrast (i.e., due to language-specific attunement rather than the universal discriminability of the contrast). When both non-native phonemes are perceived as instances of the same L1 phonological category, this means that the listener has detected the same higher-order invariants in both non-native phonemes.Footnote 2 The extent to which the listener can perceive a difference between the non-native phonemes depends on the perceived goodness of fit of each one to the L1 category. If the phonetic differences between the non-native phonemes fall within a range of variability that would be expected for the single L1 category (i.e., they display phonological constancy in the L1), then both non-native phonemes will be perceived as having a good fit to the same L1 phoneme. For this single-category type of assimilation, the perception of higher-order invariants that facilitate L1 perception has a detrimental effect on discrimination, to the extent that non-native listeners may not detect any difference between the contrasting non-native phonemes. For example, in Reference Antoniou, Best and TylerAntoniou et al. (2013), English listeners categorized both Ma’di plosive /d/ and implosive /ɗ/ as English /d/. There was no significant difference in their goodness of fit, so /d/–/ɗ/ was a single-category contrast for these listeners. Their discrimination accuracy was 58 percent (chance was 50 percent, so not much above chance). A category goodness assimilation occurs when both non-native phonemes are assimilated to the same L1 phonological category but there is a perceived difference in goodness of fit. Discrimination accuracy is predicted to fall between that of a two-category and a single-category assimilation. For example, English listeners assimilated both the Zulu voiceless aspirated velar stop /kʰ/ and the velar ejective stop /k’/ to English /k/ (phonetically [kʰ]), but they perceived /k’/ as a notably poorer exemplar than /kʰ/ (Reference Best, McRoberts and GoodellBest, McRoberts, & Goodell, 2001). They discriminated the contrast well (~80 percent), though well below ceiling level. Thus, the clear prediction for discrimination accuracy, for these three assimilation types, is two-category > category-goodness > single-category.
When at least one non-native phoneme in a contrast is perceived as a phoneme but not as a clear exemplar of any specific L1 phonological category (i.e., it is uncategorized), discrimination accuracy depends on the extent to which the listener perceives overlapping L1 phonological information between the contrasting non-native phonemes. For example, Australian English listeners categorized the Danish vowel [œ] as their L1 [ɜː] (the Australian vowel in “bird”), but Danish [o] was uncategorized, with listeners perceiving phonological characteristics that were weakly consistent with L1 Australian English [ʊ] (“foot”), [ɔ] (“lot”), [oː] (“north”), and [əʉ] (“goat”) (Reference Faris, Best and TylerFaris et al., 2018; see Reference Cox and PalethorpeCox and Palethorpe [2007] for phonetic descriptions of Australian vowels). Discrimination accuracy for this uncategorized-categorized [o]–[œ] contrast was high (85 percent) because there was no phonological overlap between the sets of L1 vowel categories the English listeners perceived in each of the two non-native Danish vowels. In contrast, Danish [ø] was perceived as weakly consistent with L1 [ʊ], [ɜː], and [ʉː] as in “goose.” The uncategorized-categorized [ø]–[œ] contrast was partially overlapping because the L1 [ɜː] vowel (to which [œ] was categorized) was perceived as being among the set of English vowels that was consistent with Danish [ø]. Discrimination accuracy was poor for [ø]–[œ] (57 percent). The general prediction for discrimination accuracy on contrasts involving at least one uncategorized non-native phoneme is nonoverlapping > partially overlapping > completely overlapping (Reference Bohn, Best, Avesani, Vayra, Lee and ZeeBohn et al., 2011; Reference Chen, Antoniou and BestChen, Antoniou, & Best, 2023; Reference Faris, Best and TylerFaris et al., 2018). Finally, if both non-native phonemes in a contrast are nonassimilable (e.g., English listeners’ perception of Zulu clicks, Reference Best, McRoberts and SitholeBest et al., 1988) then discrimination should be good to excellent depending on how they are each perceived as nonspeech sounds (Reference Best, Traill, Carter, Harrison, Faber, Solé, Recasens and RomeroBest et al., 2003; for further discussion, see Reference Tyler and WaylandTyler, 2021b).
In summary, as a result of perceptual attunement to their native language, adult monolinguals have rapid and efficient perception of phonological contrasts in the L1. However, natively tuned perception is automatic, which may lead to inaccurate perception of unfamiliar phonological contrasts in a non-native language. To consider how PAM might account for early bilinguals’ speech perception, however, we first need to go back to the beginnings of language-specific perceptual attunement in infancy.
7.3 Back to the Future: First Steps in Infancy
7.3.1 Development of Speech Perception in Monolingual Infants
Most early speech perception studies have examined monolingual infants, that is, with little or no exposure to other languages. This research has overwhelmingly found effects of L1 experience on vowel and consonant perception in the second half-year of life (for reviews, see, e.g., Reference Best, Fernández and CairnsBest, 2017; Reference Best, Goldstein, Nam and TylerBest et al., 2016; Reference Werker, Yeung and YoshidaWerker, Yeung, & Yoshida, 2012). Before then, infants discriminate nearly all segmental phonetic contrasts of the world’s languages, including those lacking in their L1 environment. Thus, monolingual infants under six months do not yet appear to be perceptually attuned to native phonetic contrasts. But discrimination has declined by six months for most non-native vowel contrasts, and by nine to ten months for most non-native consonant contrasts examined. Most native contrasts continue to be discriminated well throughout that period. This general pattern of declining discrimination for most non-native contrasts but maintenance for most native contrasts is widely accepted and interpreted as evidence for perceptual attunement to native speech.
However, different developmental trajectories have been reported in monolingual infants for some consonant contrasts (see Reference Best, Fernández and CairnsBest, 2017; Reference Best, Goldstein, Nam and TylerBest et al., 2016; Reference Liu, Peter and TylerLiu, Peter, & Tyler, 2023; Reference Tyler, Best, Goldstein and AntoniouTyler et al., 2014). Discrimination of some native contrasts is only fair rather than good initially, improving during the second half-year (Reference Kuhl, Stevens and HayashiKuhl et al., 2006), while for others it is initially poor and shows even more delayed improvement with L1 experience, not evident until four years (Reference Polka, Colantonio and SundaraPolka, Colantonio, & Sundara, 2001). Conversely, initially good discrimination of some non-native contrasts is maintained across infancy and through to adulthood, rather than showing a decline (e.g., Reference Best, McRoberts and SitholeBest et al., 1988; Reference Best and McRobertsBest & McRoberts, 2003). The disparate developmental patterns for these contrasts appear largely compatible with PAM principles. For example, the Tigrinya ejectives /p’/–/t’/ that English-speaking adults assimilate to native stops /p/–/t/ ([pʰ]–[tʰ], a two-category assimilation) and the Zulu click contrasts that they perceive as nonspeech sounds (nonassimilable) are both discriminated well by English-learning infants from early on and show no decline (Reference Best and MolfeseBest, 1988; Reference Best, McRoberts, LaFleur and Silver-IsenstadtBest et al., 1995; see Reference Best, Goldstein, Nam and TylerBest et al., 2016). Meanwhile, the Zulu stop-ejective contrast /k/–/k’/ that English monolingual adults assimilated as good versus deviant exemplars of English /k/ ([kʰ]), and discriminated well (category-goodness assimilation), is discriminated at six to eight months but not at ten to twelve months (Reference Best and McRobertsBest & McRoberts, 2003).
By tuning in to the higher-order invariant properties of phonological categories, infant speech perception begins to specialize for efficient recognition and learning of meaningful words in the L1. The emergence of phonological constancy and phonological distinctiveness in the second year coincides with the vocabulary spurt, a period of rapid word learning that occurs once an infant’s expressive vocabulary (i.e., words they can produce) reaches around fifty words (Reference Nazzi and BertonciniNazzi & Bertoncini, 2003). For monolingual infants, this occurs at around seventeen to eighteen months of age. We suggest that it is the discovery of phonological constancy and distinctiveness, across phonetically variable productions of words (e.g., varying talkers, emotions, accents), that supports this rapid vocabulary expansion (Reference Best, Romero and RieraBest, 2015; Reference Best, Tyler, Gooding, Orlando and QuannBest et al., 2009).
7.3.2 Development of Speech Perception in Bilingual Infants
But how does an infant being raised bilingually optimally attune to the phonologies of two languages? It is tempting to ask simply whether bilingual infants show the same trajectories for their multiple languages as monolinguals show for their single L1, the focus of such research to date. But that question implicitly assumes a clear categorical distinction between monolingual versus bilingual infants, which is overly simplistic when given the multifaceted differences in language learning contexts between monolingual and bilingual children as well as among individuals on each side of what is actually a rather fuzzy dichotomy (e.g., Reference JohnsonJohnson, 2018). Even the speech heard on one or the other side of this blurry divide shows individual variation in quantity and quality, in the number of people speaking the language(s), their accents (native regional variants and/or L2), and sociocultural and family differences in spoken communication (Reference GathercoleGathercole, 2014; Reference JohnsonJohnson, 2018).
We also must consider whether bilingual infants acquire their languages simultaneously or sequentially, a fundamental distinction in adult bilingual research and theory. Simultaneous bilingual adults have acquired both languages “from infancy” (two L1s), whereas sequential bilinguals learned one language from birth (their L1) and the other later (their L2), either during the preschool years (early-sequential bilinguals) or later in childhood, or in adolescence/adulthood (late L2 learners). However, these definitions fail to capture the richly varied reality of infant bilingual contexts. While some do learn their languages simultaneously “from birth,” even these children’s circumstances can differ. For example, each parent may speak a different language to them, or both parents may speak both languages to them, or the parents may speak one language and an extended family member who regularly cares for the child from birth may speak another. But many, if not most, bilingual infants encounter their two languages sequentially. For example, they hear a single home language from birth, then begin to experience another one months later when they start spending regular time with people other than the parents, such as a nanny, other family caregivers, or daycare teachers. In such sequential cases the home language is often the parental heritage language, and the later-encountered L2 is used by the broader community/country, but the reverse situation can also occur (e.g., when a nanny or daycare teachers speak in languages other than the home/majority language; see Reference De HouwerDe Houwer, 2021; Reference GathercoleGathercole, 2014). Relatedly, one language may dominate, and dominance may shift over time. For example, the home language may be the “dominant” language when an infant is very young, but dominance shifts to the community L2 when the child enters daycare/preschool or interacts with L2-native peers in weekly play groups.
Bilingual infant speech perception studies have generally assumed a quasi-binary distinction that fails to consider either simultaneity-sequentiality or the other contextual differences among bilingual infants. The language background of participants has usually been grouped using arbitrary language exposure thresholds (e.g., those with > 85 percent exposure to a single L1 defined as monolingual, and those with < 65 percent exposure to one language and > 35 percent exposure to another as bilingual; Reference Bosch and Sebastián-GallésBosch & Sebastián-Gallés, 2001), but many infants fall between those two extremes. As a result, we still know little about the effects of the factors discussed here on bilingual infants’ perceptual attunement, or the vestiges of that attunement on adult bilingual speech perception.
7.3.3 Speech Perception in Bilingual Infants
Research on speech perception in bilingual infants is newer and less extensive than that on monolinguals, but has been increasing, along with reviews and theoretical analyses (e.g., Reference Byers-Heinlein, Grosjean and Byers-HeinleinByers-Heinlein, 2018; Reference Curtin, Byers-Heinlein and WerkerCurtin, Byers-Heinlein, & Werker, 2011; Reference Höhle, Bijeljac-Babic and NazziHöhle, Bijeljac-Babic, & Nazzi, 2020). The first such study examined discrimination of the Catalan-only vowel contrast between mid-high /e/ and mid /ɛ/ by monolingual and bilingual infants learning Spanish and/or Catalan (Reference Bosch and Sebastián-GallésBosch & Sebastián-Gallés, 2003). All groups discriminated Catalan /e/–/ɛ/ at four months, but by twelve months only Catalan monolinguals and bilinguals did so, not Spanish monolinguals, indicating perceptual attunement in all groups. At eight months, however, only the Catalan monolinguals discriminated it, indicating that the two monolingual groups were already perceptually attuned to their L1s. Despite their Catalan experience, the bilinguals showed a dip in discrimination at eight months, which they recovered by twelve months. In a follow-up study, an eight-month dip and twelve-month recovery was also found in bilinguals’ discrimination of a vowel contrast shared by Catalan and Spanish, high versus mid-high back rounded /u/–/o/, which both monolingual groups discriminated at all ages (Reference Sebastián-Gallés and BoschSebastián-Gallés & Bosch, 2009). The authors speculated that this indicates a temporary delay in bilinguals’ development of perceptual attunement. However, using a more sensitive anticipatory eye movement task at eight months they found that bilinguals, like Catalan monolinguals, can discriminate Catalan /e/–/ɛ/ whereas Spanish monolinguals failed (Reference Albareda-Castellot, Pons and Sebastián-GallésAlbareda-Castellot, Pons, & Sebastián-Gallés, 2011). Thus, the bilinguals’ earlier difficulty at this age appears to reflect greater sensitivity to task demands, rather than an actual loss of perceptual ability.
Analogously, English monolingual and Spanish-English bilingual infants discriminated the Midwestern American English /e/–/ɛ/ contrast at both four and eight months (Reference Sundara and ScutellaroSundara & Scutellaro, 2011) in a habituation task that was also more sensitive than Reference Bosch and Sebastián-GallésBosch and Sebastián-Gallés (2003). In another study, monolingual Dutch infants and infants being raised bilingually in Dutch and one of a range of languages (e.g., English, German, Spanish, or Turkish) completed the same sensitive habituation task with the Dutch /ɪ/–/i/ contrast, a purely spectral high versus mid-high front distinction lacking from the bilinguals’ other languages (Reference Liu and KagerLiu & Kager, 2016). Neither group discriminated it at five to six months, only the bilinguals discriminated it at eight to nine months, and then both groups succeeded at eleven to twelve months. The authors interpreted this pattern as a bilingual lead in perceptual attunement to this initially difficult native vowel contrast. That inference is consistent with monolingual English and bilingual Spanish-English infants’ brain responses when discriminating another difficult purely spectral English contrast, mid-high versus mid front lax /ɪ/–/ɛ/, indicating that bilingual infants may attend more to the distinction than monolinguals (Reference Shafer, Yu and DattaShafer, Yu, & Datta, 2011; Reference Shafer, Yu and Garrido-NagShafer, Yu, & Garrido-Nag, 2012).
Considering the task and vowel contrast differences, we draw the following inferences about bilingual infants’ perceptual attunement to vowel contrasts: they are neither delayed nor advanced relative to monolinguals in perceptual attunement to vowel contrasts, but instead differ in their attention to them, which is influenced by task demands. This attentional difference may explain the seemingly counterintuitive findings of a bilingual dip/delay at eight months for initially easy (discriminated by all groups at four to six months) native vowel contrasts (Catalan/Spanish /u/–/o/; Catalan /e/–/ɛ/; English /e/–/ɛ/), but a bilingual lead at eight to nine months for initially difficult (not discriminated by any groups at four to six months) native contrasts (Dutch /ɪ/–/i/; English /ɪ/–/ɛ/).
As for consonant contrasts, Canadian English monolingual versus French-English bilingual infants were examined on discrimination of stop consonant voicing distinctions that differ between languages in VOT (Reference Burns, Yoshida, Hill and WerkerBurns et al., 2007). As noted earlier, English uses short-lag versus long-lag aspirated VOTs to distinguish voiced versus voiceless stops, whereas, like Greek and Spanish, French uses near-zero-lag (Canadian) or voicing-lead (European) (Reference Caramazza and Yeni-KomshianCaramazza & Yeni-Komshian, 1974) versus short-lag VOT. The infants were habituated to a language-ambivalent short-lag [p] heard as /p/ by French-speaking but /b/ by English-speaking adults, then heard two test trials that assessed their discrimination of near-zero-lag [b] (Canadian French /b/) and long-lag [pʰ] (English /p/) from the habituation stimulus. At six to eight months both monolingual English and bilingual French-English infants discriminated the English [p]–[pʰ] but not the Canadian French [b]–[p] contrast, which was difficult for both groups. Bilinguals discriminated the French contrast from ten months onward, whereas English monolinguals again failed. This finding provides evidence of perceptual attunement for this consonant contrast by ten months for both groups, consistent with prior research showing perceptual attunement for consonant contrasts in monolinguals (see Reference Liu and KagerLiu and Kager [2015] for similar results with monolingual Dutch and bilingual Dutch-{French, Spanish, English, German, or Chinese} infants discriminating Dutch and English VOT contrasts). Analogous results were also found for a place-of-articulation difference between the allophones of /d/ in Canadian French (dental [d̪]) and Canadian English (alveolar [d]) (Reference Sundara, Polka and MolnarSundara, Polka, & Molnar, 2008). The [d]–[d̪] contrast was successfully discriminated by all three groups at six to eight months: bilingual French-English, monolingual French, and monolingual English. However, only the bilingual and the monolingual English infants discriminated the contrast at ten to twelve months, which is consistent with discrimination of the same contrast by monolingual English and bilingual adults (Reference Sundara and PolkaSundara & Polka, 2008). Neuroscientific studies on VOT comparisons with English monolingual and English-Spanish bilingual infants largely concur (Reference Garcia-Sierra, Rivera-Gaxiola and PercaccioGarcia-Sierra et al., 2011; Reference Ferjan Ramírez, Ramírez, Clarke, Taulu and KuhlFerjan Ramírez et al., 2017), and further suggest that bilinguals’ brain responses may show enhanced executive functioning and a longer period of “openness” to language experience relative to monolinguals (see also Reference Petitto, Berens and KovelmanPetitto et al., 2012; Reference Singh, Loh and XiaoSingh, Loh, & Xiao, 2017).
Our inference is that discrimination ability and perceptual attunement per se do not differ between bilinguals and monolinguals early in life. What differs instead is their experience-driven attentional biases to the phonetic details of native and non-native contrasts, which are affected by task demands as well as the other factors we discussed earlier (see also Reference StrangeStrange, 2011; Chapter 9, this volume). Thus, the simple “same or different trajectory” question is off target for understanding perceptual attunement in bilingual infants. The insights and inferences we have drawn here from research on the beginnings of language-specific perceptual attunement in monolingual versus bilingual infants provide developmental grounding for returning to consider our core question of how PAM principles might account for cross-language speech perception in adults who had acquired their two languages in the first two years of life, that is, mature early bilinguals.
7.4 Speech Perception in Early Bilingual Adults
While cross-language speech perception studies with bilinguals are rare, we can glean important insights from the limited studies available (see Reference Antoniou, Liang, Ettlinger and WongAntoniou et al. [2015] for a thorough review). Reference Antoniou, Best and TylerAntoniou et al. (2013) tested cross-language perception of Ma’di consonants by early Greek-English bilinguals, in addition to the Greek and English monolinguals we discussed earlier in Section 7.2. Participant selection was carefully controlled for the bilinguals, with additional requirements beyond those for the monolingual Greek and English participants. All bilingual participants were born in Sydney, Australia, and acquired Greek in the home environment before learning English as a second language (L2) before the age of five years, that is, they were early sequential bilinguals. A particular focus of the study was on whether their perception shifted according to the language mode (Reference Grosjean and NicolGrosjean, 2001; see also Reference Antoniou, Tyler and BestAntoniou, Tyler, & Best, 2012). Half of the bilinguals completed the experiment in a unilingual Greek mode, interacting with the early sequential bilingual Greek-English experimenter in Greek throughout recruitment and participation in the study. They received written and verbal instructions in Greek and were engaged in conversation about their activities in the Australian Greek community. The other half completed the experiment in a unilingual English mode. They were only ever spoken to in English, and the experimenter carefully avoided topics that might relate to their Greek heritage. When asked to categorize the Ma’di consonants using letters in Greek (e.g., <τ> versus <ντ>) or English orthography (e.g., <t> versus <d>), the bilinguals in the Greek mode condition categorized the Ma’di consonants similarly to Greek monolinguals, and bilinguals in the English mode condition responded similarly to English monolinguals. This suggests that bilinguals can change their listening mode depending on the language context of the experiment, at least for categorization tasks. In the discrimination tests of this study, however, accuracy did not differ according to language mode. For Ma’di contrasts where discrimination performance differed between monolingual Greek and English participants, the bilinguals’ accuracy was intermediate between the two monolingual groups. Thus, it appears that when the bilinguals performed a task requiring a judgment about category membership, they were able to respond in a way that matched the language mode they were in (language context of the experiment). In contrast, language mode did not affect their discrimination. The language mode effect on categorization but not discrimination is in line with prior findings on Greek-English bilinguals’ categorization versus discrimination of Greek and English coronal stop voicing stimuli (Reference Antoniou, Tyler and BestAntoniou et al., 2012). These findings suggest that listeners make language-specific phonological judgments when categorizing the stimuli but rely on phonetic information in a way that reflects their language-specific listening history (i.e., amount of exposure to one or the other or both languages) when discriminating between stimuli.
The only other recent cross-language perception study with early bilinguals, that we are aware of, is a small-scale study by Reference MelguyMelguy (2018), who compared categorization and discrimination of Nepali coronal stop place contrasts by early Spanish-English bilinguals (a mixture of sequential and simultaneous) and English monolinguals living in the USA. There was no language mode manipulation, but the bilinguals were given the option to select their consonant category choice from a set of English or Spanish initial-consonant keywords (e.g., English “pole,” “toll,” “coal” and Spanish “pon,” “ton,” “con”). Nepali has phonological contrasts between dental versus retroflex places of articulation for its voiced and voiceless aspirated coronal (tongue tip) stops, /d̪/–/ɖ/ and /t̪ʰ/–/ʈʰ/, respectively (Reference KhatiwadaKhatiwada, 2009), in which the retroflex stops are often realized phonetically as alveolars. The stimuli were produced by two Nepali-English bilinguals living in the USA, as dental versus alveolar phonetic distinctions with prevoiced versus long-lag aspirated VOTs, that is, [d̪]–[d] and [t̪ʰ]–[tʰ], respectively. In Spanish, /d/ and /t/ are generally produced with a dental place of articulation as prevoiced versus voiceless unaspirated [d̪]–[t̪], whereas in English they are generally alveolar voiceless unaspirated versus aspirated [t]–[tʰ]. The main aim of the study was to test whether Spanish-English bilinguals’ categorization and discrimination of the Nepali dental-alveolar phonetic distinctions was more accurate, given their experience with both places of articulation across their two languages (dental in Spanish, alveolar in English), than that of monolingual English speakers. In the categorization task, the bilinguals tended to choose a Spanish more often than an English consonant label for the Nepali dental consonants, and an English more often than a Spanish consonant label for the phonetically alveolar Nepali consonants (the phonologically retroflex stops). This suggests that they were sensitive to the differences between the Nepali dental (Spanish place of articulation) and alveolar (English place of articulation) phonetic realizations. Conversely, their discrimination of the two Nepali place contrasts did not reflect that sensitivity to the language-specific phonetic difference, as their accuracy was close to chance (and did not differ significantly from the English monolinguals). As there was no monolingual Spanish group, it is not possible to infer any effects of language dominance on discrimination accuracy.
While there are few studies focusing on cross-language speech perception in early bilinguals, there is a large literature on how they perceive consonants and vowels (and lexical tones) in their own two languages (for reviews, see Reference Gonzales, Byers-Heinlein and LottoGonzales, Byers-Heinlein, & Lotto, 2019; Reference SimonetSimonet, 2016). Reference Antoniou, Tyler and BestAntoniou et al. (2012) found that early sequential Greek-English bilinguals’ categorization of Greek and English bilabial and coronal stops differed according to language mode. The potential perceptual conflict for them is that Greek phonologically voiceless stops /t, p/ and English phonologically voiced stops /d, b/ have very similar phonetically short positive VOTs (i.e., unaspirated voicing lag). Only the bilinguals in Greek mode categorized the Greek stops similarly to Greek monolinguals, and only the bilinguals in English mode categorized the English stops like English monolinguals. Thus, despite the phonetic/phonological mismatches between their two languages, the bilinguals apparently optimized their category judgments to match the language context. Other studies have also shown that bilinguals shift their category judgments according to a language-specific phonetic context (Reference Casillas and SimonetCasillas & Simonet, 2018; Reference Gonzales and LottoGonzales & Lotto, 2013) and their expectations of which language they will hear (Reference Gonzales, Byers-Heinlein and LottoGonzales et al., 2019). As in Reference Antoniou, Best and TylerAntoniou et al. (2013), Reference Antoniou, Tyler and BestAntoniou et al. (2012) failed to find a language mode effect in discrimination. However, the bilinguals’ discrimination of both Greek and English voicing contrasts aligned with that of monolinguals from their dominant language, English, rather than being intermediate between the two monolingual groups’ discrimination levels as they were for the unfamiliar non-native Ma’di stimuli (Reference Antoniou, Best and TylerAntoniou et al., 2013).
Reference Sundara and PolkaSundara and Polka (2008) tested discrimination of Canadian French (dental [d̪]) and Canadian English (alveolar [d]) allophones of /d/ by simultaneous Canadian French-English bilinguals. Their results were compared to early bilinguals who acquired English at home and attended French immersion school from age five to six, monolingual French and English speakers, and native Hindi speakers. The last group were included because Hindi has a dental-retroflex stop contrast /d̪/–/ɖ/ and the authors speculated that English alveolar [d] was likely to assimilate to their L1 retroflex /ɖ/ while French dental [d̪] would assimilate to their /d̪/. The results suggested that, unlike the Spanish-English bilinguals in Reference MelguyMelguy (2018), the Canadian French-English bilinguals were sensitive to the phonetic difference between French [d̪] and English [d], as their discrimination was as accurate as the Hindi native speakers’, and generally higher than the French and English monolingual groups’. However, the participants did not complete a categorization task, so it is unclear whether the simultaneous bilinguals were using language-specific phonetic sensitivity to discriminate the contrast.
Another important set of findings is from perception of Catalan-only vowel contrasts by early sequential bilinguals with Catalan versus Spanish as their L1. These studies have often used categorical perception tasks, where participants are presented with synthesized speech tokens that are equally spaced along an acoustic continuum between two contrasting consonants or vowels. Perceivers have to categorize each continuum step as one category or the other and discriminate equally spaced pairs of steps along the continuum (e.g., step 1 versus step 4, step 2 versus step 5, and so on). Perception is considered categorical when the continuum steps are mostly perceived as one category or the other (i.e., there is a sharp boundary between categories), and discrimination is much better across the category boundary than within categories. In Reference Pallier, Bosch and Sebastián-GallésPallier, Bosch, and Sebastián-Gallés (1997), early sequential L1-Catalan and L1-Spanish bilinguals from Barcelona categorized and discriminated tokens from a vowel continuum ranging between the Catalan-only /e/–/ɛ/ contrast. Spanish has only /e/, which is acoustically intermediate between these Catalan vowels. The L1-Catalan bilinguals showed a clear categorical boundary for the synthetic /e/–/ɛ/ continuum, but the L1-Spanish bilinguals did not. The authors suggest that the L1-Spanish bilinguals’ early exposure to the Spanish /e/ prevented them from acquiring the Catalan /e/–/ɛ/ contrast because both Catalan vowels overlap acoustically with Spanish /e/.
Contrasting findings were reported by Reference AmengualAmengual (2016b), who divided Spanish-Catalan bilinguals in terms of their language dominance rather than order of acquisition. Catalan- versus Spanish-dominant bilinguals in Majorca categorized and discriminated synthetic continua for two Catalan-only vowel contrasts, /e/–/ɛ/ and mid-high versus mid /o/–/ɔ/. They discriminated both vowel contrasts quite categorically, without group differences. They also identified both vowel contrasts categorically, with no group difference on /e/–/ɛ/ but less robust categorization of /ɔ/ (one endpoint of the /o/–/ɔ/ continuum) by the Spanish-dominant than by the Catalan-dominant group. The different outcomes in Reference AmengualAmengual (2016b) versus Reference Pallier, Bosch and Sebastián-GallésPallier et al. (1997) might be due to phonetic differences between the Catalan spoken in Majorca versus that in Barcelona, but Amengual pointed out that language input differences may also be a factor. Spanish-Catalan bilinguals raised in Barcelona are more likely to be exposed to Spanish-accented Catalan, in which /e/ and /ɛ/ are merged, than are those raised in Majorca. In another study, the same Majorcan bilinguals completed discrimination tasks using naturally produced tokens of Catalan vowels (Reference AmengualAmengual, 2016a). While discrimination accuracy was high for both groups (> 85 percent correct), the Catalan-dominant bilinguals had a small but significantly higher accuracy than the Spanish-dominant bilinguals on three of the contrasts: /e/–/ɛ/, /o/–/ɔ/, and /ɛ/–/a/.
Our brief review of perceptual studies shows that early bilinguals can adjust their phonetic sensitivity according to their language context when making category judgments, which could be interpreted as evidence for separate phonological systems. However, the findings with Spanish-Catalan bilinguals seem to be consistent with the idea that bilinguals have a single phonological system for both languages, with attunement to Catalan /e/–/ɛ/ dependent on the phonetic characteristics of the regional accents and the context of language acquisition and use. The reason for the varying findings across these studies is difficult to pinpoint given the multiple differences among them. It could result from one or more of the following factors: different engagement of phonological and phonetic processes in categorization versus discrimination tasks (see Reference Antoniou, Tyler and BestAntoniou et al., 2012), stimulus differences (natural tokens versus synthetic continua; consonants versus vowels), language differences (Greek/English versus Catalan/Spanish), and/or different types of bilingual comparisons (language mode, language dominance, or order of acquisition). To make sense of bilingual speech perception, prior research needs to be interpreted within a theory of how phonological categories are acquired and perceived, and future studies should be designed to test theoretical predictions.
7.5 Back to Adult Bilinguals: Implications of, and for, PAM
In order to fully address the question of how PAM might account for speech perception in early bilinguals, we must briefly review some relevant aspects of PAM’s extension to second language learning, PAM-L2 (Reference Best, Tyler, Munro and BohnBest & Tyler, 2007). PAM-L2 was developed to account for the acquisition of phonological categories in late L2 learners (for further detail on PAM-L2, see Reference Best, Tyler, Munro and BohnBest & Tyler, 2007; Reference Tyler, Nyvad, Hejná, Højen, Jespersen and SørensenTyler, 2019; Reference Tyler, Ball and BestTyler et al., 2024). As late L2 learners are already well-practiced in using the natively tuned speech perception they developed early in life when they start learning the L2, they initially perceive L2 segments in terms of their L1 phonological categories (i.e., in the same way as naïve monolinguals in cross-language speech perception studies). PAM-L2 is based on the assumption that late L2 learners incorporate L2 phonological contrasts into a single interlanguage system, and it outlines the ways that L2 categories might develop.
The applicability of PAM-L2 to early bilinguals depends on whether they have a single interlanguage phonological system, as we have proposed for late L2 learners, or separate phonological systems for each language (for discussions on this topic for bilingual language use more generally, see Reference GeneseeGenesee, 1989; Reference MacSwanMacSwan, 2017; Reference Otheguy, García and ReidOtheguy, García, & Reid, 2015). Our review of the infant literature suggests that bilingual infants follow the same developmental sequence as monolinguals, but prior research on speech perception in early bilingual infants and adults does not provide any clear answers. We do not rule out the possibility that some bilinguals might develop separate phonological systems, but we think that in order for this to occur it would probably require a restricted set of circumstances, for example nonoverlapping contexts of language use and/or languages with markedly different phonological properties, such as bilinguals who speak a tonal and a nontonal language (e.g., Reference Choi, Tong and SamuelChoi, Tong, & Samuel, 2019) or who use a signed and a spoken language (e.g., Reference Blanco-Elorrieta, Emmorey and PylkkänenBlanco-Elorrieta, Emmorey, & Pylkkänen, 2018). We suggest that in most early bilingual language environments, both sequential and simultaneous bilinguals are most likely to develop a single interlanguage phonological system, just as we have proposed in PAM-L2 for adult L2 learners.
According to PAM-L2, when an L2 phonological segment is initially perceived as an instance of an L1 category, that category will become a shared L1-L2 phonological category. For L2 segments that are not acquired as shared L1-L2 categories, the learner may acquire a new L2-only phonological category (compare with the Speech Learning Model [SLM]; Reference Flege and StrangeFlege, 1995), for example when an L2 segment is initially assimilated as uncategorized or as a nonassimilable nonspeech sound. It is important to note that if an L2 contrast is a single-category assimilation for an L2 learner, then both L2 segments might be incorporated into the same shared L1-L2 phonological category, and the learner would continue to find it difficult to discriminate the L2 contrast (for a more detailed discussion, see Reference Tyler, Ball and BestTyler et al., 2024). Although this outcome may seem consistent with the failure of early sequential Spanish-Catalan bilinguals in Barcelona to perceive certain Catalan-only vowel contrasts categorically, we consider this late-L2 account unlikely to apply frequently to early bilinguals. This is because, unlike adult L2 learners, their nascent phonological categories would not be established well-enough in one language to prevent them from tuning in to a contrast in their other early-acquired language.
When the phonetic properties differ for the L1 and the L2 versions of shared L1-L2 phonological categories, as is the case for stop voicing in Greek and Spanish or place of articulation for /d/ in French and English, then late L2 learners may tune in to those L2-specific phonetic properties. Reference Best, Tyler, Munro and BohnBest and Tyler (2007) referred to this as a language-specific phonetic category within a shared phonological category. As early bilinguals’ phonological categories develop before any one language is well established, we predict that they are more likely than late L2 learners to acquire sensitivity to language-specific phonetic properties within phonological categories that are shared between their two languages.Footnote 3 Predictions for contrast discrimination accuracy are the same for PAM-L2 as they are for PAM, except that the segments comprising two-category, category-goodness, or single-category contrasts may be categorized as L1-only phonological categories, L2-only (new) phonological categories, or shared L1-L2 phonological categories.
One clear finding from the adult bilingual speech perception literature is that early bilinguals can make language-appropriate category judgments, even when there is a phonetic-phonological mismatch across languages for a shared phonological category (Reference Antoniou, Tyler and BestAntoniou et al., 2012; Reference Casillas and SimonetCasillas & Simonet, 2018; Reference Gonzales, Byers-Heinlein and LottoGonzales et al., 2019). Bilinguals may achieve this by shifting their attention to lower-order invariant phonetic properties that are unique to one language or the other, that is, to their language-specific phonetic specifications within that shared phonological category. Note that shifting attention to lower-order phonetic properties is not unique to bilingual speech perception. Monolinguals also need to shift their attention to lower-order phonetic properties in certain listening situations, such as when rating the goodness of fit of non-native phones to native phonemes. As monolinguals have only a single, unilingual language mode (Reference Grosjean and NicolGrosjean, 2001), they are able to shift attention between higher-order phonological and lower-order phonetic information according to task demands. We propose that bilinguals necessarily shift into a unilingual language mode in one or the other of their languages when they need to focus on lower-order phonetic details. While they may be able to shift between unilingual modes, we reason that they should not be able to attend to the lower-order phonetic details of both of their languages at the same time. For example, the Spanish-English bilinguals in Reference MelguyMelguy (2018) showed some differential sensitivity to the dental (Spanish-like) versus the alveolar (English-like) Nepali consonants in categorization, but their discrimination of the contrasts was poor. They may have been able to switch between unilingual Spanish and English modes as they judged the individual Nepali consonant stimuli because they were presented one by one, but we argue that in the discrimination task they would need to remain in a unilingual mode to detect the lower-order phonetic differences between the three non-native Nepali consonants within a given trial.Footnote 4
We posit that bilinguals living in an environment where one language dominates are likely to be more adept at shifting their attention from higher-order phonological to lower-order phonetic properties in the dominant language than in the nondominant language. When perceiving non-native speech, they should be more likely to focus automatically on the lower-order phonetic properties of their dominant language, unless the task requires an explicit focus on the nondominant language (e.g., categorizing non-native segments using orthographic labels from the nondominant language, Reference Antoniou, Best and TylerAntoniou et al., 2013). We speculate, further, that it should require greater effort for them to shift their attention to the nondominant language than it would to the dominant language. That is, we predict a language-dominance-related asymmetry in directing attention to lower-order phonetic details when perceiving speech in their two languages.Footnote 5
In summary, the principles that guide a PAM account of speech perception in monolinguals and late L2 learners also apply to early bilinguals. We have suggested that, like late L2 learners (Reference Best, Tyler, Munro and BohnBest & Tyler, 2007; Reference Tyler, Nyvad, Hejná, Højen, Jespersen and SørensenTyler, 2019; Reference Tyler, Ball and BestTyler et al., 2024), early bilinguals are likely to have developed a single interlanguage phonological system, composed of a combination of language-specific and shared phonological categories, the latter of which may also have shared or language-specific phonetic categories. We predict that early bilinguals will perceptually assimilate non-native phonological segments to their interlanguage phonological categories and that they will be sensitive to the higher-order phonological information that sets those categories apart from one another (i.e., two-category assimilations), regardless of language mode. If a non-native contrast cannot be discriminated using high-order phonological information (e.g., single-category and category-goodness assimilations), the bilingual must shift to a unilingual mode in order to focus on lower-order phonetic details. In this case, if the context does not favor one language over the other, then bilinguals are likely to shift to their dominant language.
7.6 Future Directions and Methodological Considerations
Studies of bilingual speech perception have only begun to scratch the surface of how early bilingual language experience shapes perception. Many studies have compared bilinguals’ perception of segments in their two languages, while very few have tested bilinguals’ cross-language perception using non-native segments from a language that they have never encountered before. For a richer insight into phonological attunement in bilinguals, more studies are needed in which bilinguals’ perception of minimally contrasting non-native segments is compared to perception by monolinguals of each of their two languages. We encourage researchers with access to early bilingual populations to investigate cross-language speech perception, but there are prerequisites that need to be met and methodological considerations, including some issues that are yet to be resolved, as we discuss in the following paragraphs.
To understand how bilingual language experience shapes perception, it is necessary to compare bilinguals and monolinguals. Ideally, there would be two monolingual groups, one for each of the bilingual’s languages (see, e.g., Reference Antoniou, Tyler and BestAntoniou et al., 2012, Reference Antoniou, Best and Tyler2013; Reference Sundara and PolkaSundara & Polka, 2008). PAM makes clear predictions about adult monolinguals’ discrimination accuracy for non-native contrasts because they are raised in linguistically similar language environments. However, perceptual attunement may differ among bilinguals who are exposed to the same two languages because the shape of a bilingual’s interlanguage phonological system depends on the nature of their own individual language experience (compare with the role of input in the revised Speech Learning Model (SLM-r), Reference Flege, Bohn and WaylandFlege & Bohn, 2021; and see Reference Otheguy, García and ReidOtheguy et al. [2015] for the notion of an idiolect). To avoid confounds, the characteristics of a bilingual sample need to be well controlled on the factors that are known to influence bilingual speech perception (e.g., age of acquisition of each language, dominance, and contexts of language use; see Reference Antoniou, Grosjean and Byers-HeinleinAntoniou, 2018). Monolingual groups must also be well controlled, with researchers ensuring that they have not been regularly exposed to other languages in early childhood (i.e., for more than four hours per week), have not lived for an extended period in a second-language environment, and are not advanced learners of an L2. Finding monolingual groups without early exposure to another language may be difficult nowadays, however, as children often also learn English (e.g., in Europe) or Standard Mandarin (in China or some other Asian countries) as an L2 from a young age. In fact, bilinguals from the same language community may acquire those languages as an L3. This means that the listeners could have acquired L2(/L3) phonological categories that support discrimination of non-native contrasts (see Reference Tyler, Nyvad, Hejná, Højen, Jespersen and SørensenTyler [2019] for a discussion of category acquisition in a foreign-language classroom setting). At the very least, researchers should ensure that monolinguals and bilinguals have the same history of early L2(/L3) classroom learning, but it would be preferable, where possible, to test bilinguals with language combinations for which both languages are also still being learned monolingually.
Our previous discussion shows that language mode (Reference Antoniou, Tyler and BestAntoniou et al., 2012; Reference Grosjean and NicolGrosjean, 2001) is an important factor to take into account, so investigations of bilingual speech perception should ideally include a language-mode manipulation. If that is not feasible, then it is important to control for language mode by ensuring that bilinguals are restricted to and maintained in one language mode or the other when completing the experiment. However, researchers should bear in mind that, even with the most carefully controlled language mode conditions, language mode might shift when the phonetic properties of the non-native stimuli are much more consistent with the bilingual’s other language (Reference Gonzales and LottoGonzales & Lotto, 2013).
In order to test PAM predictions, participants must complete both categorization and discrimination tasks. Categorization tasks usually have a closed set of response labels that participants can select from (e.g., Reference Antoniou, Best and TylerAntoniou et al., 2013; Reference MelguyMelguy, 2018), but it is possible to use an open set, where participants simply write down what they hear (e.g., Reference Best, McRoberts and SitholeBest et al., 1988; Reference Bohn, Best, Avesani, Vayra, Lee and ZeeBohn et al., 2011). An open-set categorization task has the advantages that participants are not restricted by a closed response set or influenced by the experimenter’s choice of categories. However, responses can be difficult to interpret if participants use novel letter combinations to indicate peculiar phonetic characteristics (e.g., Reference Best, McRoberts and SitholeBest et al., 1988). If using an open-set approach, researchers should interview the participants after the experiment to clarify what their labels refer to. A closed-set approach has the advantage of focusing participants’ attention on phonological categories and it avoids the need to interpret any idiosyncratic labels that participants might insert. However, if using a closed-set task, we recommend a whole-system approach (Reference Bundgaard-Nielsen, Best and TylerBundgaard-Nielsen, Best, & Tyler, 2011), where participants are provided with labels for all possible consonant or vowel categories in a given language, rather than assuming which phonological categories are relevant.
An unresolved methodological issue that needs to be considered for categorization tasks with bilinguals is whether to give them the opportunity to select categories from both of their languages. Spanish-English bilinguals in Reference MelguyMelguy (2018) were provided with a closed set of both English and Spanish response categories, whereas the Greek-English bilinguals in Reference Antoniou, Tyler and BestAntoniou et al. (2012, Reference Antoniou, Best and Tyler2013) selected labels in either Greek or English orthography, according to the language mode condition. Neither approach is ideal. Reference MelguyMelguy (2018) found that bilinguals’ responses for the same Nepali consonant were shared between the Spanish and the English labels. He needed to infer their language-specific phonetic sensitivity by analyzing the relative proportions of Spanish versus English labels, which is a nonideal methodological approach. Furthermore, using a whole-system approach for more than one language would result in an overwhelming number of response categories. The alternative approach of providing response options from only one language would also be problematic if the stimulus were perceived as a clear instance of a category in the bilingual’s other language. This is unlikely to have occurred in Reference Antoniou, Tyler and BestAntoniou et al. (2012, Reference Antoniou, Best and Tyler2013) because the stimuli were perceived as consonants that are shared between the bilinguals’ two languages. Had they been asked to categorize an Arabic voiceless uvular fricative /χ/, however, which also occurs in Greek, then bilinguals in English mode would not have been able to indicate that the Arabic consonant was a clear instance of their Greek /χ/ category. A possible alternative task is forced category-goodness rating (Reference TylerTyler, 2021a). For each stimulus item, the participant is provided with a single phonological category label and asked to rate the goodness of fit to the category provided. This task has the advantage that participants provide ratings against phonological categories that they may not have selected in a forced-choice categorization task, but it is very labor-intensive because each stimulus needs to be rated against multiple phonological categories. One compromise that may provide a solution is to use a categorization task with elements of both open- and closed-set approaches (Reference Rallo Fabra, Achichaou, Tyler, Blecua, Cicres, Espejel and MachucaRallo Fabra, Achichaou, & Tyler, 2022; Reference Rallo Fabra and TylerRallo Fabra & Tyler, 2023). Participants are provided with a closed set of labels from one language, with an additional button to use when they wish to use a different category. The participant provides a new label, which is then added to the categorization grid for subsequent trials. A survey at the end of the task asks for a description of each label, including the language that the category is from.
PAM was developed to account for how native-language experience shapes perception. In this chapter, we outlined how the principles of PAM and PAM-L2 (late bilinguals) also apply to early bilinguals’ speech perception. We have suggested that bilinguals’ speech perception is shaped by their experience with their two languages, and that they have a single shared phonological system for both of their languages. Phonological categories may be unique to one language or shared, and bilinguals may shift their attention to language-specific phonetic categories within a shared phonological category according to language mode. PAM was developed using experimental evidence from cross-language speech perception studies, in which listeners categorize and discriminate speech segments from a language that they have never encountered before. We suggest that carefully controlled cross-language speech perception studies comparing bilingual and monolingual infants and adults will provide important insights into how language experience shapes perception in early bilinguals.
8.1 Introduction
Since its original proposal (Reference EscuderoEscudero, 2005) and following a revision (Reference Escuderovan Leussen & Escudero, 2015), the Second Language Linguistic Perception (L2LP) model has received increasing attention as a comprehensive and quantitative model of second language (L2) speech perception. It grew out of and coevolved with the Bidirectional Phonology and Phonetics (BiPhon) framework (Reference BoersmaBoersma, 1998, Reference Boersma, Benz and Mattausch2011), which itself is an extension of Optimality Theory (OT; Reference Prince and SmolenskyPrince & Smolensky, [1993] 2002).Footnote 1 Numerous studies have been conducted within the model’s framework over the last two decades, accumulating evidence for its adequacy in describing, explaining, and predicting L2 learners’ perceptual patterns. Recent works have also extended the model to a wider range of bilingual populations (e.g., simultaneous bilinguals as in Reference Escudero, Mulak, Fu and SinghEscudero et al., 2016a), to other domains of language acquisition (e.g., word learning [as in Reference Escudero, Mulak and VlachEscudero, Mulak, & Vlach, 2016b; Reference Escudero, Smit and MulakEscudero, Smit, & Mulak, 2022], orthography [as in Reference EscuderoEscudero, 2015; Reference Escudero, Simon and MulakEscudero, Simon, & Mulak, 2014a; Reference Escudero, Smit and AngwinEscudero, Smit, & Angwin, 2023], and speech production [as in Reference Liu and EscuderoLiu & Escudero, 2023; Reference Yazawa, Konishi, Whang, Escudero and KondoYazawa et al., 2023a), and to other academic disciplines (e.g., language training and curriculum design [as in Reference Colantoni, Escudero, Marrero-Aguiar and SteeleColantoni et al., 2021; Reference Elvin, Escudero, Gutierrez-Mangado, Martínez-Adrián and Gallardo-del-PuertoElvin and Escudero, 2019]). This chapter aims to illustrate how L2LP can address a breadth of issues in bilingual phonetics and phonology by reviewing pivotal research conducted with the model. The focus here is on L2LP, but thorough comparisons with other models of L2 and bilingual phonetics and phonology can be found in Reference EscuderoColantoni, Steele, and Escudero (2015), Reference EscuderoEscudero (2005), and Reference YazawaYazawa (2020).
Before we move on, it is important to note that most studies that have previously been conducted within L2LP or other models of non-native speech perception have tended to feature “naïve” listeners and “L2 learners” with different proficiency levels. Given that within this volume the term used to define users of two or more languages is “bilingual,” it seems appropriate to first provide the definitions of a variety of participant groups that have been included in previous and recent L2LP studies.
Most studies within the L2LP framework have used a control group commonly termed “monolingual” listeners of the target language. However, even a term that seems simple and easy to determine has complexities. To clarify the term, Reference Escudero, Sisinni and GrimaldiEscudero, Sisinni, and Grimaldi (2014b, p. 1578) defined monolingual listeners or functional monolinguals as those who use only their first language (L1) in their everyday life, have not resided in a country or region where another language is spoken for longer than a month, and have received basic classroom L2 instruction (if at all) by L1-accented teachers focusing on reading and grammar. Such monolinguals can be regarded as being in their initial state for learning any subsequent language, that is, at the onset of L2 learning.
In Reference Escudero, Smit and MulakEscudero et al. (2022), an important difference is made between those who use two languages, commonly referred to as bilinguals, based on their age of acquisition for each language. Specifically, the authors distinguish between “simultaneous” and “sequential” bilinguals, with the former being exposed to their languages from birth and the latter acquiring an L2 after their L1. Sequential bilinguals are commonly called L2 learners, with the onset of L2 acquisition occurring during adolescence or adulthood. Although L2 learners can reach an end state that resembles native-like performance, this may not be the case for all components of language proficiency, resulting in different levels of L2 proficiency for sequential bilinguals. In contrast, simultaneous bilinguals commonly acquire full proficiency comparable to monolinguals of the two languages, especially in the domain of phonetics and phonology (Reference Antoniou, Best, Tyler and KroosAntoniou et al., 2011; Reference Elvin, Tuninetti and EscuderoElvin, Tuninetti, & Escudero, 2018a). Later we will see that this distinction between different types of bilinguals yields differential performance, which will be explained using L2LP’s developmental proposal.
The remainder of this chapter is organized as follows. First, we present an overview of L2LP to help familiarize readers with the model’s key constructs (Section 8.2). This section also discusses how computational and statistical methods are utilized to provide greater explanatory adequacy and more specific and testable predictions, since quantification is a crucial property of the model. We then report on a series of new studies to illustrate L2LP’s recent approach to lexical development (Section 8.3). These studies shed light on previously understudied aspects of bilingual phonetics and phonology, including how bilinguals’ linguistic background influences their prelexical perception and lexical development. Finally, we address some remaining questions concerning how the model handles important issues such as the role of orthography, speech production, and applications to language training and curriculum design, including future directions (Section 8.4). The chapter ends with a summary and conclusion (Section 8.5).
8.2 Model Overview
Given that L2LP’s theoretical framework is based on “Linguistic Perception” (LP), we start by outlining the principles of LP in Section 8.2.1, followed by their extension to L2LP in Section 8.2.2. Section 8.2.3 addresses how the model’s theoretical components can be computationally implemented for explanatory adequacy as well as to formulate specific and testable predictions.
8.2.1 Linguistic Perception (LP)
The term “Linguistic Perception” reflects the notion that human speech perception is a language-specific rather than a general auditory process. Reference EscuderoEscudero (2005, p. 7) defines speech perception as “the act by which listeners map continuous and variable speech onto linguistic targets.” Given that the very purpose of speech communication is to understand and to be understood, the listener’s task is to map the incoming variable acoustic cues (e.g., first formant or F1, second formant or F2, fundamental frequency, and duration) onto discrete and abstract linguistic representations (e.g., distinctive features, segmental categories, and suprasegmental structures) to ultimately extract the meaning intended by the speaker. The mapping patterns are language-specific in nature, since the number of linguistic representations and the use of acoustic cues vary substantially not only across languages but also across varieties or dialects of the same language.
Consider, for example, how the acoustic cues of F1 and F2 can map onto vowel categories. These cues, though physically continuous, should perceptually map to a different number of discrete categories depending on the language. Native English listeners need to make a fine-grained mapping of the two cues onto a dozen vowel categories so that they can identify and distinguish minimal pairs such as “heed,” “hid,” “hayed,” “head,” “had,” “hud,” “hod,” “hawed,” “hoed,” “who’d,” “hood,” and “heard,” although “a dozen” is a very rough approximation because the exact number of categories varies across different dialects of English. The mapping is much less dense for Arabic, which has only three qualitative contrasts (/i/, /a/, and /u/), though again dialectal variations exist. Languages also exhibit divergent mapping patterns even when they have the same number of categories. For example, native listeners of Greek, Hebrew, Czech, Spanish, and Japanese, all of which have a five-vowel system in their standard varieties, show distinct mapping patterns of the F1 and F2 cues per language (Reference Boersma, Chládková, Lee and ZeeBoersma & Chládková, 2011; Reference Escudero, Sisinni and GrimaldiEscudero et al., 2014b).
The language-specific nature of speech perception is formulated in the LP model as the optimal perception hypothesis,Footnote 2 which posits that listeners learn the optimal mapping of acoustic cues onto appropriate sound representations that leads to maximum likelihood behavior (Reference BoersmaBoersma, 1998, p. 337). This means that the probability of correctly perceiving the intended linguistic representation based on the acoustic cues is maximized or, to put it another way, the probability of misperception is minimized. Native listeners’ perception of a language is optimal in that it tries to extract as many linguistic representations as required in the language (e.g., a dozen vowel categories in English, three in Arabic, or five in Greek, Hebrew, Czech, Spanish, and Japanese, with nonnegligible dialectal differences). It is also optimal in that the mapping patterns mirror the acoustic cues in the language (e.g., Japanese /u/ is generally more fronted than Spanish /u/, and so the perceptual usage of the F2 cue differs between the two languages).
How do native listeners acquire such language-specific, optimal perception? The LP model assumes a general learning device that is responsible for creating representations and adjusting cue usage, which is computationally implemented (see Section 8.2.3) by the Gradual Learning Algorithm (GLA; Reference Boersma and HayesBoersma & Hayes, 2001). An important attribute of the learning device is that it is distribution- and meaning-driven. It is distribution-driven in that it collects the statistical information concerning the acoustic cues in the ambient language and gradually adjusts the mapping patterns based on this information (Reference Boersma, Escudero, Hayes, Solé, Recasens and RomeroBoersma, Escudero, & Hayes, 2003), whereby the resulting perception exhibits what is known as the perceptual magnet effect (Reference KuhlKuhl, 2004). The device is meaning-driven in that it evaluates how the mappings signal lexical contrasts to determine the number of representations required for optimal perception in the language. The meaning-driven nature of the device implies that LP goes beyond simple acoustic-to-category mapping, since sound categories alone are meaningless unless they are associated with higher-level lexical representations. These learning mechanisms work alongside a complex structure involving multiple levels of representation and connections between them, as shown in the current LP model illustration in Figure 8.1 (Reference Escuderovan Leussen & Escudero, 2015).

Figure 8.1 Current full architecture of LP.
In Figure 8.1, the bottom-level representation, the [auditory] form, refers to the incoming acoustic signals as they arrive in the peripheral auditory system. The variable [auditory] form is then mapped to the following /surface/ form, which encodes the listener’s language-specific and invariant representations of speech sounds, including context-specific allophonic details. The /surface/ form is further abstracted into the third, |underlying| form, which encodes the canonical phonemic contrasts that may change the meaning of a word. Finally, the |underlying| form is connected to the <lexical> form, namely, to words and morphemes stored in the mind or brain. These representations, together with the connections between them, are acquired via distributional and meaning-driven learning. The [auditory]-to-/surface/ mappings (cue constraintsFootnote 3) are learned based on the distributions of acoustic values, while the connections between /surface/ and |underlying| forms (phonological constraints) are learned in relation to which <lexical> forms exist or not (lexical constraints).
Importantly, notice in Figure 8.1 that LP distinguishes prelexical perception and lexical recognition. Most psycholinguistic models of speech perception agree that lexical recognition guides perceptual learning, but it remains controversial whether the two processes are sequential (i.e., bottom-up) or interactive (i.e., bottom-up and top-down). The original LP model (Reference EscuderoEscudero, 2005, Reference Escudero, Boersma and Hamann2009) held a sequential view where perception precedes recognition, that is, the outcome of perception is faithfully passed on to recognition. According to this view, the lexical influences on perception are explained by offline (i.e., post hoc) learning from the lexicon (see the Merge model; Reference Norris, McQueen and CutlerNorris, McQueen, & Cutler, 2000). In contrast, the revised LP model (Reference Escuderovan Leussen & Escudero, 2015) allows for an interactive view as well, in which the lexicon can influence lower-level representations during the online (i.e., ad hoc) processing of speech (see the TRACE model; Reference McClelland and ElmanMcClelland & Elman, 1986). While the pursuit of this matter is beyond the scope of this chapter, the distinction and the connection between prelexical perception and lexical recognition will be discussed in Section 8.3.
8.2.2 Second Language Linguistic Perception (L2LP)
The L2LP model is a conceptual extension of the LP framework for L2 learners. The model consists of five theoretical ingredients, as shown in Figure 8.2, where straight arrows represent the ingredients’ sequential nature and curved arrows represent relationships between ingredients.

Figure 8.2 The five theoretical ingredients of L2LP.
As shown in Figure 8.2, the first ingredient is optimal perception in the listener’s L1 and the target L2. As mentioned, LP is language-specific, with the number of linguistic representations and the mapping of acoustic cues being unique to each language. This means that optimal perception for one language is not necessarily optimal for another and vice versa (see Footnote 2 for L2LP’s definition of “optimal”). The L2LP model proposes that a thorough analysis of optimal perception in each language, and specifically in each variety or dialect of a language,Footnote 4 is a crucial first step toward an adequate account of L2 perception, as L1 and L2 optimal perception define the initial state (Ingredient 2) and the end state or ultimate attainment (Ingredient 5) of L2 learning, respectively. To adequately and correctly predict and explain L2 development, the focus should be on the acoustic distributions of the target sounds and their closest L1 counterparts (closest in terms of acoustic-auditory proximity), together with their phonemic and allophonic status in the two languages (whether the sounds are lexically contrastive or not), but other factors such as the quantity and the quality of input and the learners’ cognitive capacity and skills are also relevant, as we shall see later.
The second ingredient of L2LP is the L2 initial state. The model’s Full Copying hypothesis, which derives from the Full Transfer hypothesis (Reference Schwartz and SprouseSchwartz & Sprouse, 1996), states that listeners start with a copy or duplicate of their L1 optimal perception at the onset of L2 learning. This results in the listener having a separate system or grammar for each of their L1 and L2, through which the sounds in the L1 and the L2 are perceived, respectively. Listeners at this stage are called “naïve” because no L2 learning has taken place yet, and their perception of target language sounds is commonly called “cross-linguistic” because L2 sounds are filtered by the L1. Note that both L1 linguistic representations and perceptual mappings are copied, which relates to the learning tasks in Ingredient 3.
Since the initial L2 grammar is seldom optimal for perceiving L2 sounds because of mismatches between optimal L1 and L2 perception, learners often struggle with misperception and miscommunication in the target language. The learners’ goal, then, is to modify the L2 grammar to solve the mismatch. Two kinds of learning tasks (Ingredient 3) are specified for this goal: a representational task to modify the number of categories (by forming new ones or disposing of existing ones), and a perceptual task to adjust the acoustic cue usage (by changing the weighting of Familiar cues and/or creating new mappings of Unfamiliar cues).Footnote 5 The L2LP model proposes that three types of learning scenarios emerge depending on the task(s): Similar, New, and Subset. These are illustrated with examples in Figure 8.3 and explained in detail in the paragraphs that follow.

Figure 8.3 Three types of learning scenarios in L2LP.
The Similar scenario occurs when the same number of representations are involved across the two languages. For example, L1 Canadian English listeners’ perception of the L2 Canadian French /æ/–/ɛ/ contrast falls into this scenario (Reference Escudero, Boersma and HamannEscudero, 2009).Footnote 6 While Canadian English also has /æ/ and /ɛ/ that differ in both F1 and duration (where /ɛ/ is generally shorter than /æ/), Canadian French /æ/ and /ɛ/ differ primarily in F1 with little durational differences. The weighting of the F1 and the duration cues is thus different between the two languages. Consequently, Canadian English learners of Canadian French tend to misperceive durationally short tokens of L2 /æ/ as /ɛ/, relying on their higher use of duration cues in the L1. The learners therefore have the perceptual task of adjusting the nonoptimal cue weighting to minimize the likelihood of L2 misperception. They do not have a representational task in this scenario because no addition or removal of categories is needed.
The New scenario occurs when L2 representations outnumber L1 representations. Unlike the Similar scenario, this scenario poses a representational task because a new sound category needs to be formed for L2 optimal perception. There are two subscenarios of New that differ in the perceptual task: one that involves an Unfamiliar acoustic dimension and the other that involves only Familiar acoustic dimensions. An example of the Unfamiliar New scenario comes from L1 Iberian Spanish listeners’ perception of the L2 Southern British English /iː/–/ɪ/ contrast (Reference Escudero and BoersmaEscudero & Boersma, 2004). This corresponds to a New scenario because the target L2 vowels, which contrast in both F1 and duration, map to the same L1 vowel /i/. The duration cue is Unfamiliar because Spanish does not employ duration for segmental contrasts.Footnote 7 The learners’ perceptual task is to create completely new mappings (e.g., long versus short) on this “blank slate” or “uncategorized” acoustic dimension. The mappings are then integrated into an existing category to create new ones (e.g., long /i/ versus short /i/) to accomplish the representational task. On the other hand, the Familiar New scenario occurs when new perceptual mappings are created along acoustic dimensions already utilized in the L1. An example of this scenario is L1 Tokyo Japanese listeners’ perception of the L2 American English /ɛ/–/æ/–/ʌ/ contrast (Reference Yazawa, Whang, Kondo and EscuderoYazawa et al., 2023b). This is also New because the three L2 vowels map to two L1 vowels /e/ or /a/. A notable difference from the case of Reference Escudero and BoersmaEscudero and Boersma (2004) is that the learners’ L1, Japanese, has phonemic vowel length, unlike Spanish. Given that all relevant acoustic cues for vowel identity (F1, F2, and duration) are Familiar in the L1, the perceptual task is to alter the existing mapping patterns along the known acoustic dimensions. This would result in the splitting of an existing category (e.g., /a/) to yield a new one (e.g., /æ/), as part of the representational task.
Finally, the Subset scenario occurs when L1 representations outnumber L2 representations, whereby an L2 sound perceptually maps to more than one L1 representation. The L2LP model is currently the only model that addresses this mapping pattern, which Reference Escudero, Boersma, Skarabela, Fish and DoEscudero and Boersma (2002) termed Multiple Category Assimilation (MCA). Examples of this scenario are L1 North Holland Dutch listeners’ perception of the L2 Iberian Spanish /i/ and /e/ (Reference Boersma, Escudero, Avery, Dresher and RiceBoersma & Escudero, 2008; Reference Escuderovan Leussen & Escudero, 2015) and L1 Australian English listeners’ perception of the L2 Iberian Spanish vowels (Reference Elvin, Escudero, Gutierrez-Mangado, Martínez-Adrián and Gallardo-del-PuertoElvin & Escudero, 2019). For Dutch listeners, the Spanish vowels /i/ and /e/ perceptually map to /i/, /ɛ/, or /ɪ/ in the L1, thus resulting in MCA. Here, the listeners can have a representational problem where three categories are perceived instead of two, which could lead to spurious lexical contrasts (i.e., /i/–/ɪ/ or /ɪ/–/ɛ/) in the L2. Even when they “know” from textbooks that there are only two such vowels in Spanish, they have a perceptual problem where their L2 initial grammar cannot help automatically mapping relevant acoustic cues to three categories. Thus, learners have a representational task to unlearn unnecessary categories and a perceptual task to alter the existing mapping so as not to perceive them.Footnote 8
The fourth ingredient of L2LP is L2 development, for which the proposal states that L2 learners have Full Access (Reference Schwartz and SprouseSchwartz & Sprouse, 1996) to the general learning device of LP (which is computationally implemented by the GLA; see Section 8.2.3) throughout their lifetime. Thus, L2 perceptual learning is assumed to be fundamentally similar to L1 learning, which is distribution- and meaning-driven. Studies have shown that distributional learning has immediate and long-lasting effects on adult L2 learners (Reference Escudero, Benders and WanrooijEscudero, Benders, & Wanrooij, 2011; Reference Escudero and WilliamsEscudero & Williams, 2014). The meaning-driven nature of L2 perceptual learning becomes evident when its relationship with lexical development is considered (Section 8.3). The hypothesized full access to an L1-like learning device does not guarantee that L2 learning occurs as quickly and as effortlessly as L1 learning, however. In fact, researchers have long noted that adults progress more slowly than children in L2 perceptual learning. The L2LP model attributes age effects to cognitive plasticity, which peaks in youth and then gradually decreases as one gets older. Crucially, Reference EscuderoEscudero (2005) also argues that the role of input outweighs that of plasticity, which explains why learners of the same age and linguistic background may not follow an identical developmental path, since the quality and the quantity of input are modulated by various factors including motivation. These factors have significant implications for predicting the end state of L2 learning (Ingredient 5).
The model’s final ingredient proposes that all L2 learners can ultimately acquire L2 optimal perception regardless of their age, provided that sufficient and appropriate linguistic input is continually provided to the learner. This holds true for all three learning scenarios, though different scenarios can pose different levels of difficulty depending on the number of learning tasks. Specifically, it is proposed that the New scenario is the most difficult, followed by Subset and then by Similar, as forming new categories is considered more difficult than deleting or reusing existing ones (Reference EscuderoEscudero, 2005, p. 125). Note that the L1 grammar remains intact because L2 development occurs in a separate copy of the grammar (see Ingredient 1), unless input is diminished or stops when moving to an L2-speaking environment, as implied by the higher weight of input for L2 development and L1 maintenance in Ingredient 4. Thus, L2LP predicts that learners can attain two separate optimal grammars for the two languages. This hypothesis may raise questions because bilinguals can show bidirectional interactions when they code-mix their two languages (Reference Antoniou, Best, Tyler and KroosAntoniou et al., 2011). The L2LP model explains such phenomena with the assumption of gradient and parallel activation of the two grammars, which derives from Reference Grosjean and NicolGrosjean’s (2001) language mode hypothesis. A recent L2LP study has confirmed perception modes in L1 Japanese learners of L2 American English, who adapt their cue weighting (duration versus F2/F1) for vowel perception depending on whether they listen to English or Japanese (Reference Yazawa, Whang, Kondo and EscuderoYazawa et al., 2020). Crucially, in this study, some learners showed L1–L2 intermediate cue weighting, implying that both grammars were activated to different degrees, which was also shown previously in Reference Boersma, Escudero, Avery, Dresher and RiceBoersma and Escudero (2008) and Reference Escudero, Boersma and HamannEscudero (2009). Within Ingredient 5, it is also proposed that for ultimate attainment and successful performance, bilinguals need to master language control (Reference GreenGreen, 1998) or selective inhibitory control (Reference Friesen, Luo, Luk and BialystokFriessen et al., 2015). This proposal can explain individual or group differences in performance and will be relevant for comparing results of different types of bilinguals in Section 8.3.2.
In sum, the five ingredients of L2LP constitute a comprehensive model of how L2 perception is acquired, starting from the initial state (Full Copying of L1 optimal perception), through learning tasks (Similar, Unfamiliar/Familiar New, and Subset) and development (Full Access to an L1-like learning device mediated by input and plasticity), to the end state (L1 and L2 optimal perception activated in different degrees). While these theoretical components alone can explain and predict the outcome of various L1–L2 learning scenarios, the model’s predictive power is further reinforced by employing computational and statistical methods, as described next.
8.2.3 Computational Implementation of L2LP
A unique strength of L2LP is that its theoretical component can be computationally implemented to simulate L2 speech perception. While sometimes conflated, simulation should be distinguished from modeling (Reference Maria, Andradóttir, Healy, Withers and NelsonMaria, 1997). A model is a representation of the construction and working of a system of interest, and modeling is the process of building a model. For example, Reference EscuderoEscudero’s (2005) work concerned the modeling of L2 speech perception, which resulted in the L2LP model. A model should be a close approximation to the real system it represents, incorporating its salient attributes, but it should not be too complex to understand. A good model, therefore, is a trade-off between realism and simplicity.Footnote 9
A simulation, on the other hand, involves the operation or implementation of a model through configuring it to virtually experiment with it. Simulations can serve at least two purposes. First, one can validate a model by implementing it computationally under known conditions and comparing the output with the real system output. For example, L2LP’s Ingredient 2 (the initial state) can be tested by simulating a virtual listener who learns to perceive Spanish as their L1. This virtual performance is then compared to that of real learners’ perception gathered experimentally. Second, simulations can predict the performance of a system under different configurations and over long periods of time, which would be too expensive or impractical to conduct in the real world. For example, the outcome of specific L2 learning environments can be predicted by reconfiguring the types of input and the learning period, such as one, three, six, and eighteen years of L2 Spanish input fed to an L1 Dutch grammar (Reference Boersma, Escudero, Avery, Dresher and RiceBoersma & Escudero, 2008) or a few months versus a few years of L2 English input to an L1 Japanese grammar (Reference Yazawa, Whang, Kondo and EscuderoYazawa et al., 2020). The main incentive for computational modeling in L2LP is thus to provide a direct test for a hypothesis before conducting an empirical study, resulting in the formulation of more accurate predictions.
Although L2LP’s theoretical components can be implemented with various computational methods, the model has often utilized Stochastic OT (Reference BoersmaBoersma, 1998) and the GLA (Reference Boersma and HayesBoersma & Hayes, 2001). More recently, neural networks have been used to extend these frameworks (Reference Escuderovan Leussen & Escudero, 2015). Stochastic OT is a probabilistic extension of OT that is used to represent the learners’ language-specific grammar. The GLA is an error-driven algorithm for learning optimal constraint rankings in Stochastic OT that represents the learning device, which has been shown to outperform other machine learning algorithms (Reference Escudero, Kastelein, Weiand and van SonEscudero et al., 2007). While we do not intend to provide detailed explanations of how Stochastic OT and the GLA work here, interested readers can find step-by-step instructions for implementing L2LP with these computational methods in the following studies: Reference Yazawa, Whang, Kondo and EscuderoYazawa et al. (2020) for a Similar scenario; Reference Escudero and BoersmaEscudero and Boersma (2004) for an Unfamiliar New scenario; Reference Yazawa, Whang, Kondo and EscuderoYazawa et al. (2023b) for a Familiar New scenario; and Reference Boersma, Escudero, Avery, Dresher and RiceBoersma and Escudero (2008) for a Subset scenario. These studies focus mainly on the acquisition of the cue constraints (see Figure 8.1), hence representing classic L2 perception research (i.e., cue-based segmental category identification and discrimination), but Reference Boersma, Benz and MattauschBoersma (2011) discusses how the phonological and lexical constraints can also be implemented. See also Reference Escuderovan Leussen and Escudero (2015) for how the constraints at different levels may interact and for an implementation using an approach more compatible with neural networks.
Other studies have utilized statistical methods to make L2LP’s predictions more specific (Reference Escudero, Boersma and HamannCurtin, Fennell, & Escudero, 2009; Reference Elvin, Vasiliev, Escudero, Gibson and GilElvin, Vasiliev, & Escudero, 2018b; Reference Elvin, Williams and EscuderoElvin, Williams, & Escudero, 2016). For example, Reference Elvin, Williams and EscuderoElvin et al. (2016) applied discriminant analysis to Australian English vowel production data to predict which acoustic cues (duration, formant means, and formant changes) would contribute to the identity of /iː/–/ɪ/–/ɪə/ and to what extent. The analysis, which has been used for assessing cross-linguistic phoneme categorization as well (Reference Escudero and VasilievEscudero & Vasiliev, 2011), found that /ɪ/ can be durationally distinguished from /iː/ or /ɪə/ while formant changes were essential for distinguishing /iː/ and /ɪə/. The statistical model results accurately predicted real Australian English listeners’ perception (Reference Williams, Escudero and GafosWilliams, Escudero, & Gafos, 2018), which resembled simulation results using Stochastic OT and the GLA (Reference YazawaYazawa, 2020). The key point here is that quantification of theoretical predictions is a crucial component in the explanatory adequacy of L2LP, whether the adopted method is computational, statistical, or both.
8.3 Explaining Lexical Development within L2LP
We now present a series of new studies to highlight recent advancements within the L2LP framework. While earlier research tended to focus on prelexical perception by adult L2 learners, the new studies expand the scope of inquiry to include lexical development in monolinguals and a wider range of bilingual populations. This is a crucial step forward, given the importance of lexical recognition for speech communication (see Figure 8.1) and the diversity of bilinguals worldwide. Although previous studies had shown that L2LP can explain the interrelation between prelexical perception and lexical development (Reference EscuderoEscudero, 2005; Reference Escudero, Broersma and SimonEscudero, Broersma, & Simon, 2013; Reference Escuderovan Leussen & Escudero, 2015), it was assumed that lexical learning took place via one-to-one mappings between words and their referents. In Sections 8.3.1 and 8.3.2, we introduce a novel word learning paradigm that more closely resembles the real world where word-referent mappings are ambiguous, and test the L2LP proposal for explaining lexical encoding of minimal pairs in different types of bilinguals.
8.3.1 Learning to Distinguish L2 Sounds in Context: The Case of Minimal Pairs
Previous L2LP studies on lexical development of phonological contrasts employed a word learning paradigm where each novel word is explicitly and unambiguously paired with its corresponding referent. The method involves a learning phase where participants are presented with a picture of a novel object in tandem with the object’s auditory form, followed by a testing phase where they hear one of the learned words and select the corresponding visual object. Many studies using this paradigm have shown that adults and children can learn minimal pairs, that is, words that are distinguished by a phonological contrast, in their L1 or in a subsequent language (Reference EscuderoEscudero, 2015; Reference Escudero, Broersma and SimonEscudero et al., 2013, Reference Escudero, Simon and Mulak2014a; Reference Escudero, Hayes-Harb and MittererEscudero, Hayes-Harb, & Mitterer, 2008; Reference Escudero and KalashnikovaEscudero & Kalashnikova, 2020; Reference Giezen, Escudero and BakerGiezen, Escudero, & Baker, 2016). Results also show that word recognition accuracy is linked to how well the phonological distinction is perceived, confirming the L2LP proposal of a tight relationship between prelexical perception and lexical development. The mechanism underlying this rapid word learning ability is commonly called fast mapping (Reference Escudero, Smit and AngwinEscudero et al., 2023).
However, this type of explicit and intentional learning does not entirely reflect how the learning of new words proceeds in more naturalistic and immersive environments. Specifically, everyday situations typically pose high levels of ambiguity because a novel word may appear alongside many potential referents (Reference Mulak, Vlach and EscuderoMulak, Vlach, & Escudero, 2019; Reference Yu and SmithYu & Smith, 2007). Real-world ambiguity can be resolved by drawing conclusions from statistical regularities across instances or situations where the same word is presented, a mechanism known as cross-situational word learning (CSWL; Reference Escudero, Mulak, Fu and SinghEscudero et al., 2016a, Reference Escudero, Mulak and Vlach2016b, Reference Escudero, Smit and Angwin2023; Reference Yu and SmithYu & Smith, 2007). Studies have shown that adults (Reference Angwin, Armstrong, Fisher and EscuderoAngwin et al., 2022; Reference Escudero, Mulak, Fu and SinghEscudero et al., 2016a, Reference Escudero, Mulak and Vlach2016b; Reference Mulak, Vlach and EscuderoMulak et al., 2019) and children (Reference Escudero, Mulak and VlachEscudero, Mulak, & Vlach, 2016c; Reference Pino Escobar, Tuninetti, Antoniou and EscuderoPino Escobar et al., 2023; Reference Smith and YuSmith & Yu, 2008) use CSWL to learn words in their L1 and in subsequent languages (Reference Escudero, Mulak and VlachEscudero et al., 2016b, Reference Escudero, Smit and Mulak2022; Reference Junttila and YlinenJuntilla & Ylinen, 2020; Reference Tuninetti, Mulak and EscuderoTuninetti, Mulak, & Escudero, 2020). Importantly, CSWL differs from incidental word learning paradigms used in previous L2 vocabulary learning studies in that CSWL is not only unintentional but also ambiguous (see Reference Escudero, Smit and AngwinEscudero et al., 2023).
Reference Escudero, Mulak and VlachEscudero et al. (2016b) were the first to apply the CSWL paradigm to the learning of minimal pairs, demonstrating that adult monolingual Australian English listeners could track word-referent co-occurrences while at the same time attending to phonetic/phonological distinctions in spoken words. Participants were shown the eight words in Figure 8.4 (left) in pairs that formed nonminimal pairs (e.g., /bɔn/–/dit/), vowel minimal pairs (e.g., /dit/–/dɪt/), or consonant minimal pairs (e.g., /bɔn/–/tɔn/) in English. The experiment consisted of learning and testing phases (Figure 8.4, center and right). During learning, participants were presented with a series of trials with two auditory words and two visual objects, without any instruction about the nature of the task or the correspondence between the words and the objects. Each trial was ambiguous because the order of presentation of the auditory words was not synced with that of the visual objects. Participants were then asked to identify word-object mappings in the test phase. Performance at test was above chance for all pairs, but vowel minimal pairs were less accurate than consonant and nonminimal pairs, with no difference between the last two. These findings suggest that phonological-lexical encoding may be weaker for vowels than for consonants, at least in Australian English monolinguals, indicating that this unintentional and ambiguous paradigm can be used to explore the link between prelexical perception and lexical development proposed in L2LP in naturalistic environments (see Reference Escudero, Smit and MulakEscudero et al., 2022; Reference Escudero and Hayes-HarbEscudero & Hayes-Harb, 2022). The obvious next step was to examine how bilinguals fare at encoding phonological detail in ambiguous, everyday situations.

Figure 8.4 Illustration of the CSWL paradigm.
8.3.2 Different Bilinguals Learning Minimal Pairs: Advantages and Disadvantages
Most L2LP studies and most studies previously conducted within the field of L2 phonetics and phonology show that monolinguals outperform sequential bilinguals in their target language, which leads researchers and observers to conclude that the “optimal” end state in L2 acquisition is very hard to achieve. Additionally, the idea that monolinguals outperform bilinguals has been confirmed in most studies on lexical processing (see Reference Gollan, Kroll and RappGollan & Kroll [2001] for a review). Contrary to this common belief, Reference Escudero, Mulak and VlachEscudero et al. (2016b) found that sequential bilinguals had comparable word learning performance to monolinguals when tested in the same CSWL task described in Section 8.3.1. One reason for this discrepancy may be that the CSWL task allows sequential learners to perform well regardless of their linguistic background, yielding results that are different from those gathered with more conventional tasks and with those predicted by the L2LP model. However, another CSWL study (Reference Tuninetti, Mulak and EscuderoTuninetti et al., 2020) has shown that the relationship between L1 and L2 phonemes predicts the difficulty with which Australian English listeners learn Dutch and Portuguese words, indicating that CSWL results are in line with the L2LP proposal that perceptual difficulty is correlated with word learning and recognition difficulty depending on the listeners’ linguistic background.
Alternatively, the composition of the bilingual group may have influenced the results. That is, the sequential bilinguals in Reference Escudero, Mulak and VlachEscudero et al. (2016b) came from diverse linguistic backgrounds and had diverse onsets of acquisition for their L2 English,Footnote 10 and including this type of “heterogeneous” bilinguals may have obscured differences when compared to monolinguals. To test the L2LP developmental proposal that both perceptual difficulties related to linguistic background and acquired proficiency play a role in CSWL of minimal pairs, two studies were conducted using the same method reported in Reference Escudero, Mulak and VlachEscudero et al. (2016b), summarized at the end of Section 8.3.1. The first study tested simultaneous Mandarin-English bilinguals in Singapore (Reference Escudero, Mulak, Fu and SinghEscudero et al., 2016a), while the second tested a group of homogenous sequential bilinguals with L1 Mandarin who started learning L2 English at school and resided in Shanghai or Sydney at the time of testing (Reference Escudero, Smit and MulakEscudero et al., 2022). As expected, opposite group results were found, where simultaneous bilinguals performed better overall than monolinguals, while sequential bilinguals showed overall lower performance than monolinguals.
Both results are explained by the bilinguals’ linguistic background and their acquired proficiency. The L2LP model’s explanation for better performance in simultaneous bilinguals relates to the inclusion of language control and selective inhibition as part of ultimate attainment (see Ingredient 5), which states that with high proficiency and continuous input from both languages, a bilingual can perform equally to a monolingual of either language. It seems that the heterogeneous group of sequential bilinguals in Reference Escudero, Mulak and VlachEscudero et al. (2016b) had high enough L2 proficiency and did not activate L1 features that could have negatively affected their performance. In the case of the simultaneous bilinguals in Reference Escudero, Mulak, Fu and SinghEscudero et al. (2016a), their “advantage” for overall word learning is explained by their heightened ability to selectively inhibit or suppress the irrelevant language, which may enable higher performance when coping with the ambiguity of a CSWL task. This L2LP proposal is in line with studies showing a bilingual advantage for simultaneous bilinguals depending on their levels of inhibitory control, an ability connected to the general-domain executive functioning (Reference Friesen, Luo, Luk and BialystokFriessen et al., 2015; Reference Pino Escobar, Kalashnikova and EscuderoPino Escobar, Kalashnikova, & Escudero, 2018). Thus, the L2LP explanation extends the bilingual advantage to the domain of statistical learning of minimal pairs, which involves the encoding of phonological distinctions.
In contrast, the overall “disadvantage” for the homogenous sequential bilinguals is explained by the activation of an L1 Mandarin linguistic feature, namely, contrastive lexical tones, due to the pitch variations in the stimuli presented, since no negative evidence against the use of tonal contrasts was provided in the CSWL task (Reference Escudero, Smit and MulakEscudero et al., 2022). This possibility may have been enhanced by the words in the study being produced in infant-directed speech (IDS) because its properties can facilitate the learning of phonetic contrasts in adults and children (Reference Escudero and WilliamsEscudero & Williams, 2014; Reference Golinkoff and AliotoGolinkoff & Alioto, 1995; Reference Graf Estes and HurleyGraf Estes & Hurley, 2013).Footnote 11 However, since IDS has more variable pitch than adult-directed speech across languages (Reference Igarashi, Nishikawa, Tanaka and MazukaIgarashi et al., 2013), the English words produced in IDS likely sounded as though they had different lexical tones to L1 Mandarin ears, challenging their word-referent mappings. A similar effect has been found by Reference Smit, Milne and EscuderoSmit, Milne, and Escudero (2022), in which participants’ music perception abilities negatively influenced their learning of English vowel minimal pairs via CSWL, presumably because of their enhanced sensitivity to pitch variations in vowels. Reference Escudero, Smit and MulakEscudero et al. (2022) explained that hearing tones in the English words could have resulted in these sequential bilinguals’ MCA of the vowels in the words, leading to a Subset scenario with spurious lexical contrasts and poorer performance. These findings suggest that not only segmental but also suprasegmental details should be considered in predicting and explaining vowel perception and word learning (Reference Escudero, Mulak, Elvin and TraynorEscudero et al., 2018; Reference Escudero and KalashnikovaEscudero & Kalashnikova, 2020).
8.4 Remaining Issues and Future Directions
The recent studies reviewed so far demonstrate that L2LP offers adequate explanations regarding the relation between speech perception and lexical development in diverse bilingual populations. Here we review other important issues that the model can explain, such as the role of orthography and speech production (Sections 8.4.1 and 8.4.2), as well as applications to language training and curriculum design for ultimate attainment (Section 8.4.3).
8.4.1 The Role of Orthography
Many studies have shown that orthography influences speech processing in bilinguals (see Reference Bassetti, Escudero and Hayes-HarbBassetti, Escudero, & Hayes-Harb, 2015). Studies within L2LP have shown that the availability of orthographic forms as input to bilinguals can have various influences on speech perception and word learning, both facilitative and impeding (Reference EscuderoEscudero, 2015; Reference Escudero, Hayes-Harb and MittererEscudero et al., 2008, Reference Escudero, Simon and Mulak2014a; Reference Escudero and WanrooijEscudero & Wanrooij, 2010). For instance, Reference Escudero, Simon and MulakEscudero et al. (2014a) demonstrate that congruence between L2 learners’ orthographic systems influences performance in word learning, as the orthography of the learners’ dominant language is activated when reading L2 words (Reference EscuderoEscudero, 2015; Reference Escudero, Hayes-Harb and MittererEscudero et al., 2008). For both prelexical perception and lexical recognition, it has been shown that when the learners’ two orthographic systems match, learning is facilitated, but when they do not, learning is more challenging. Also, CSWL is more accurate for both monolinguals and bilinguals when words are presented orthographically rather than auditorily, suggesting that visual information facilitates unintentional and ambiguous word learning (Reference Escudero, Smit and AngwinEscudero et al., 2023). Thus, the role of orthography is clearly prominent in bilingual phonetics and phonology.
The L2LP model assumes that bilinguals’ mental lexicon contains phonological and orthographic representations of speech based on much previous research attesting the role of orthography in bilingual speech processing (Reference EscuderoEscudero, 2015; Reference Escudero, Smit and AngwinEscudero et al., 2023; Reference Escudero and WanrooijEscudero & Wanrooij, 2010). However, how exactly orthography fits within L2LP’s architecture (Figure 8.1) is yet to be formally modeled and computationally simulated. Ongoing research aims at bridging this gap.
8.4.2 Speech Production
There have also been attempts to extend L2LP to speech production (Reference Elvin, Vasiliev, Escudero, Gibson and GilElvin et al., 2018b; Reference Elvin, Williams, Escudero, Molsing, Becker Lopes Perna and Tramunt IbañosElvin, Williams, & Escudero, 2020), as the model claims that perception precedes production and is a prerequisite for the development of production skills (Reference Escudero and PenningtonEscudero, 2007, p. 110). Unlike other models of L2 speech acquisition that predict no “mastery” of L2 production, L2LP’s Ingredient 5 predicts that L2 learners can ultimately attain optimal perception (and, by extension, production). To test this hypothesis for production, Reference Yazawa, Konishi, Whang, Escudero and KondoYazawa et al. (2023a) examined 102 adult L1 Japanese speakers’ productions of L2 American English monophthongs using an L2 English speech corpus called J-AESOP (Reference Kondo, Tsubaki and SagisakaKondo, Tsubaki, & Sagisaka, 2015). All learners were late sequential bilinguals who had been learning English since the age of thirteen in Japanese schools and had never lived outside Japan. Despite the uniform linguistic background, the learners exhibited diverse levels, with some (if not most) showing near-native-like productions across all vowel categories, regardless of the perceptual similarity between particular L1 and L2 sounds. The result is consistent with L2LP’s prediction and provides a promising extension of the model to speech production.
Reference Liu and EscuderoLiu and Escudero (2023) applied L2LP’s predictions to the influence of dialectal variation in production, finding that those who speak two dialects of the same L1 had overall better performance in L2 production tasks than those who speak only one L1 dialect, despite their similar L2 learning backgrounds. This implies that the divergent performance of different types of bilinguals (Section 8.3.2) extends to “bidialectal” populations, confirming L2LP’s proposal to focus on each specific variety of a language and suggesting that the proposed inhibitory control advantages may also apply to the control of two dialects (Section 8.2.2). Further research can help to better understand how bilingualism and bidialectalism compare.
8.4.3 Applying L2LP to Language Training and Curriculum Design
Finally, L2LP’s theoretical proposals have significant implications for language learning and training, as detailed by Reference Elvin, Escudero, Gutierrez-Mangado, Martínez-Adrián and Gallardo-del-PuertoElvin and Escudero (2019). Specifically, its ingredients can be used to identify the specific difficulties learners may have because of cross-linguistic/dialectal differences and to predict their further development. The following are just a few examples of how L2LP can be applied to language training and curriculum design.
Many studies within L2LP have capitalized on the distributional nature of perceptual learning to demonstrate that difficult phonetic contrasts can be accurately perceived through very short exposure to the most frequent sound exemplars of a phonetic continuum (Reference Escudero, Benders and WanrooijEscudero et al., 2011). It has been shown that distributional training can enhance the perception of difficult vowel and tone contrasts (Reference EscuderoOng, Burnham, & Escudero, 2015), that individual differences prior to training modulate success (Reference Wanrooij, Escudero and RaijmakersWanrooij, Escudero, & Raijmakers, 2013), and that Similar contrasts are easier to train than New contrasts (Reference Chládková, Boersma and EscuderoChládková, Boersma, & Escudero, 2022, also for a review of neurphysiological studies testing L2LP). Reference Escudero and WilliamsEscudero and Williams (2014) also demonstrated that the effects of distributional learning in adult L2 learners can last a long time, as its effects remained for more than a year after training.
The CSWL task can be used to teach real words to L2 learners at different developmental stages. Reference Tuninetti, Mulak and EscuderoTuninetti et al. (2020) show that learners can easily learn twelve to eighteen words within a learning session, suggesting that this paradigm could be quite successful for classroom learning or self-paced learning at home, as suggested in Reference Escudero, Smit and AngwinEscudero et al. (2023).
Regarding L2 production training, Reference Colantoni, Escudero, Marrero-Aguiar and SteeleColantoni et al. (2021) devised innovative perception and production exercises for beginner university-level L2 learners of Spanish. The proposed teaching materials are based on key principles such as a focus on features with high functional load and shared by most varieties of the target language, which are direct applications of L2LP.
8.5 Conclusion
The L2LP model is a comprehensive theoretical framework of how people learn to perceive, recognize, and produce the phonetics and phonology of multiple languages and/or dialects simultaneously or sequentially. The model has unique strengths, such as powerful computational and statistical mechanisms for precisely predicting learning outcomes, as well as the ability to explain previously understudied issues including the bilingual/bidialectal (dis)advantage and the interrelation between prelexical perception and lexical development, with new studies extending the model to orthographic influences, speech production, and language training and curriculum design. We hope that readers find L2LP useful in deepening their understanding of bilingual phonetics and phonology and in promoting explanatory adequacy for modeling language acquisition in this specific area and beyond.
9.1 Overview
Research over the past forty years has revealed that listeners perceive non-native speech sounds through the lens of their native (first) language (L1). Second-language (L2) learners after years of experience may continue to show speech perception patterns influenced by the L1. The Automatic Selective Perception (ASP) model, described in this chapter, was developed to explain the conditions influencing L2 speech perception (Reference StrangeStrange, 2011; Reference Strange, Shafer, Hansen Edwards and ZampiniStrange & Shafer, 2008). In the chapter, Sections 9.2 and 9.3 present basic terminology and key observations that are needed to understand speech perception and the motivations for proposing the ASP model. Section 9.4 presents the ASP model and identifies limitations. Section 9.5 presents pertinent developmental and neurophysiological studies that support modification of the ASP. Section 9.6 introduces a revised ASP model, with Sections 9.7 and 9.8 addressing future directions and take-home points.
9.2 Key Constructs
Considerable research has led to important advances in our understanding of speech perception (Reference Diehl, Lotto and HoltDiehl, Lotto, & Holt, 2004). Even so, our knowledge of the processes supporting speech perception remains incomplete. Winifred Strange proposed the ASP model to explain gaps in our knowledge, including why L2 learners rarely show L2 speech perception performance that matches that of L1 listeners. Reference Strange, Shafer, Hansen Edwards and ZampiniStrange & Shafer (2008) made the following observations:
1. Speech perception involves selection and integration of multiple acoustic cues, which supports categorization (discrimination and identification) of phonetic information into phonological categories and access to lexical and syntactic categories.
The aim of this research over the past seventy years has been to identify the representations and processes that allow fast and efficient speech perception (Reference Diehl, Lotto and HoltDiehl et al., 2004; Reference Liberman and MattinglyLiberman & Mattingly, 1985). The culmination of this research indicates that no simple, invariant set of acoustic cues supports this process and led to the suggestion that gestural rather than acoustic targets are the “invariant cue” (Reference Browman and GoldsteinBrowman & Goldstein, 1992). These early studies, however, focused primarily on identifying the cues (i.e., the nature of the representations), with less research addressing the question of the processes underlying speech perception.
2. Speech perception abilities change from birth to adulthood via experience.
Infants are equipped with basic perceptual sensitivities which allow them to tune perception of speech to the complex acoustic-phonetic patterns of the native-language input. At birth, they have basic abilities to discriminate a range of speech patterns (Reference Aslin, Pisoni, Hennessy and PereyAslin et al., 1981; Reference Kuhl, Williams, Lacerda, Stevens and LindblomKuhl et al., 1992; Reference Werker and TeesWerker & Tees, 1983). These patterns are then modified via experience with the ambient language over the course of the first few years of life (Reference Best, Goldstein, Nam and TylerBest et al., 2016).
Speech perception, however, is not fully mature until later in childhood. During the preschool years, children’s phonological representations become more detailed (Reference Werker, Fennell, Corcoran and StagerWerker, et al., 2002), and they continue to mature into late childhood (e.g., Reference Nittrouer and MillerNittrouer & Miller, 1997; Reference Sundara, Polka and GeneseeSundara, Polka, & Genesee, 2006).
3. For adults, speech perception of a native language is robust and automatic.
Adult listeners are highly adept at perceiving native-language speech patterns, even under increasingly difficult task and stimulus conditions, such as divided attention tasks and conditions of noise (e.g., background babble) (Reference StrangeStrange, 2011). The ease of native-language speech perception highlights the challenges of non-native speech perception (e.g., Reference MacKain, Best and StrangeMacKain, Best, & Strange, 1981). For example, native listeners of English find it challenging to categorize the non-native Hindi dental [d̪] versus retroflex [ɖ] consonants but easy to categorize English /v/ and /w/, whereas native Hindi listeners show the opposite pattern (Reference Grover, Shafer, Campanelli, Whalen and LevyGrover et al., 2021; Reference Werker and LalondeWerker & Lalonde, 1988).
We add the following observation that is key for explaining speech perception and understanding the ASP model:
4. Speech perception is supported by attentional and memory mechanisms.
Attention is a highly complex construct that includes concepts such as sustained attention, orienting, and focused attention. Focused attention is needed to integrate (bind) information that is not explicitly represented and to resolve competitions among multiple targets (Reference KochKoch, 2004, p. 166). Sustained attention (arousal) is necessary in difficult tasks to maintain focused attention. Explicitly represented information is salient and has a “pop-out” effect (e.g., a dog’s bark among a background of birds chirping); it is available for cognitive operations (manipulable) without the need for focused attention to bind the information into an “object” for perception. Some features or objects categories are predetermined by genetics to be explicit. More subtle patterns of information can become explicit through overlearning. Overlearning requires multiple presentations of the stimulus or action.
Memory is also a complex construct which can be broken down into several types, including iconic memory, working/short-term memory, and long-term memory (Reference CowanCowan, 2016). Speech sounds are initially encoded in an iconic form of memory that decays rapidly in a matter of seconds. Speech information can be maintained in working memory (phonological memory) via rehearsal processes. Long-term memory holds phonological representations that have been learned via experience.
9.3 Factors Underlying L2 Speech Perception
Speech perception in an L2 is highly variable compared to L1 speech perception. Research has identified a number of factors modulating speech perception, including L1–L2 phonological differences, contrast salience, age of acquisition (AoA), and amount of experience in the L2. In addition, task and stimulus complexity contribute to variability in performance. These are described in Section 9.5.
The Perceptual Assimilation Model (PAM and PAM-L2) and the Speech Learning Model (SLM) reveal that a major portion of variance in non-native/L2 perception (and production) can be explained in terms of differences between the phonological systems of the two languages (contrastive analysis) (Reference Best, Tyler, Bohn and MunroBest & Tyler, 2007; Reference MacKay, Flege, Piske and SchirruMacKay et al., 2001; Chapter 7, this volume). Phonological differences between the L1 and the L2, however, do not fully predict speech perception performance.
Salience of speech contrasts also appears to affect speech perception performance. Non-native contrasts range from relatively easy to very difficult for naïve listeners to discriminate (Reference StrangeStrange, 2011). Burnham described phonetic contrasts in terms of “robustness” and “fragility” (Reference BurnhamBurnham, 1986); those that are more fragile (less salient) will be most likely to challenge non-native and L2 listeners. Saliency can be difficult to define, but linguists have generally considered contrasts that are rare in the world’s languages and later-learned during L1 acquisition to be more fragile (often called “marked”) (Reference StrangeStrange, 2011). Contrast saliency can also be characterized in psychoacoustic terms. For example, /æ/ and /ɑ/ differ spectrally to a greater extent than /ʌ/ versus /ɑ/ (Reference Shafer, Kresh and ItoShafer et al., 2021). The point vowels /i/, /a/, and /u/ are maximally distinct in acoustic and articulatory properties (the first and second formants differ greatly, driven by differences in gestures of the tongue, lips, and jaw) and, thus, these contrasts should be highly salient and easy to discriminate for both native and non-native listeners. It is likely that the commonality of speech sounds across the world’s languages and the difficulty of acquisition are grounded in physical factors (both articulatory and acoustic).
Researchers define AoA as the age when a learner first begins to learn the L2. Language learners who begin learning the L2 late in life often find L2 categories, particularly those that are “fragile,” to continue to be difficult to perceive, even after years of experience (e.g., Reference Levy and StrangeLevy & Strange, 2008). Studies have generally observed a strong relationship between AoA of an L2 and accuracy in production and perception of “fragile” L2 speech contrasts that do not match with L1 categories (Reference Flege, Frieda and NozawaFlege, Frieda, & Nozawa, 1997). Specifically, poorer L2 performance is typically observed with later onset of L2 acquisition.
Amount of L2 input has shown a smaller effect on L2 speech perception than AoA. However, Reference Flege and LiuFlege & Liu (2001) argue that amount of input may have a greater effect than previously claimed by studies that crudely measured input in terms of number of years in the L2 country. Specifically, they suggested that better perception of English contrasts by immigrants from China who were students in a university setting compared to those in other occupations was due to higher levels of native input in this setting (Reference Flege and LiuFlege & Liu, 2001).
The crucial insight underlying the development of the ASP model came from the observation that task and stimulus factors modulate performance of individual L2 learners, adding variability that cannot be accounted for by the factors described already. Task and stimulus factors strongly influence non-native performance, whereas they have a much weaker effect on native-language speech perception. Non-native and L2 listeners often perform relatively well in the simplest tasks with the simplest materials. For example, stimulus uncertainty is minimized in a discrimination task where the target (X) is always compared to a base stimulus “A” that occurs in the same relationship (AX or AAAX). The behavioral discrimination task in Reference Hisagi, Garrido-Nag, Datta and ShaferHisagi, Garrido-Nag, et al. (2015) illustrates how task difficulty modulates performance; they found that for Spanish late learners of English, discrimination of the English vowel contrast /ɪ/ versus /ɛ/ in an AX task was quite good (mean 89 percent), but identification (label the stimulus A or B) was at chance. Native listeners, however, were less affected by task.
9.4 ASP Model
The ASP model is built on the fundamental claim that native-language speech perception is accomplished via selective perception routines (SPRs). Crucially, native-language SPRs are overlearned and do not require focused attention; these SPRs allow for extraction of the most reliable information from the speech signal in a fast and efficient manner even under suboptimal conditions (Reference StrangeStrange, 2011, p. 460). These native-language SPRs are sensitive to contextual information (that is, coarticulatory patterns in a language). For example, American English (AE) speakers often front /u/ in the context of an alveolar consonant (e.g., “too”); AE listeners’ SPRs will automatically place this fronted vowel [y] (called an allophone) into the /u/ category (Reference Levy and StrangeLevy & Strange, 2008).
A second, important assumption of the ASP model is that focused attention is necessary for listeners to perceive fine-grained, phonetic details of speech that are irrelevant for recovering lexical and grammatical meaning.Footnote 1 Evidence for access to this fine-grained information is seen in listeners’ ability to detect accentedness. Certain laboratory tasks induce listeners to enter this mode, such as “goodness” rating tasks and “accentedness-judgment” tasks. For example, in perceptual assimilation tasks, listeners are asked to judge which native-language category is closest to a target non-native speech sound and then asked to rate the goodness of the exemplar as a member of the chosen category. Differences in ratings confirm that listeners can detect fine-grained phonetic details.
Native-language SPRs can be described as engaging a “phonological mode” of speech perception. In contrast, access to fine-grained phonetic details is described as “phonetic mode.” These different modes of processing are encouraged under different task conditions. Fine-grained phonetic details decay in iconic memory, and are not available at long interstimulus intervals (ISIs). In contrast, phonological information is stored in long-term memory and, thus, is available at long ISIs. Reference Werker and LoganWerker and Logan (1985) provided evidence for phonetic versus phonological modes of processing; they demonstrated that judgments of within-category phonetic differences of a dental versus a retroflex place of articulation can be discerned at ISIs of 500 ms, but not 1,500 ms. At the longer ISI, native English listeners categorize both variations as falling within the category /ta/ (see Reference Yu, Shafer and SussmanYu, Shafer, & Sussman, 2017).
A third assumption of the ASP model is that at the beginning stages of L2 learning, listeners must engage phonetic-mode processing to identify how the L2 differs from L1 patterns to allow L2 learning. Therefore, focused attention is necessary for the novice stage of L2 speech perception. Research indicates that L2 learners can detect whether L1 and L2 phones are “identical” versus “similar” versus “novel” (Reference Flege and StrangeFlege, 1995).
The ASP model allows for the possibility that L2 listeners can learn to perform well in perceiving L2 speech; that is, L2 learners can establish novel SPRs to support category decisions. Some speech sound contrasts, however, remain difficult even for highly proficient L2 learners, particularly under difficult task or stimulus conditions. For example, advanced AE learners of Parisian French continue to have problems categorizing [y]/[u] in both bilabial and alveolar phrase-level contexts (Reference FlegeFlege, 1987; Reference LevyLevy, 2009a, Reference Levy2009b; Reference Levy and StrangeLevy & Strange, 2008).
The original ASP model did not address how SPRs develop from infancy to adulthood. The developmental literature reveals that L1 SPRs are not adult-like even by seven years of age (Reference Shafer, Yu and DattaShafer, Yu, & Datta, 2010). Development includes changes in social and cognitive systems (e.g., Reference Morgan, Curtin and BottingMorgan, Curtin, & Botting, 2021), in addition to language-specific experience. Thus, changes in performance on speech perception tasks with age could be related to developmental changes of multiple systems. The ASP model is also limited to a description of behavioral perception, with only a few predictions focusing on the neural level (Reference StrangeStrange, 2011). Expanding the ASP model to address developmental and neurobiological factors will lead to novel hypotheses.
9.5 Speech Perception Development
Task complexity and contrast saliency are shown to modulate whether infants show behavioral discrimination of native-language speech contrasts. This finding suggests that infant speech perception requires attentional resources. For example, fourteen-month-old infants failed to show perception of a /bɪ/ versus a /dɪ/ “word” contrast in a “switch” task where they had to learn to associate each syllable with a novel image (Reference Pater, Stager and WerkerPater, Stager, & Werker, 2004); this age group, however, can discriminate /b/ versus /d/ in tasks with lower cognitive load. In the switch task, infants learn to associate each syllable with an image of an object on repeated trials. After habituation (a drop in looking time below 65 percent), infants are presented with a change and a no-change (control) trial. Perception of the difference between the two target sounds is inferred from an increase in looking time to the change trial compared to the no-change trial. For the fine-grained place difference between /b/ and /d/, infants do not show success in the switch task until seventeen months of age (Reference Werker, Fennell, Corcoran and StagerWerker et al., 2002).
Stimulus salience, however, also modulates task performance in infants. Using the contrast /l/ versus /r/ in “leet” versus “reet,” fourteen-month-olds were able to detect a mismatch using the switch task, but only when the contrast was sufficiently long (Reference Archer and CurtinArcher & Curtin, 2018). Reference Archer and CurtinArcher and Curtin (2018) argue that the /r/ versus /l/ difference was more salient than the /b/ versus /d/ difference.
A prediction derived from the ASP model is that improved performance with increasing age on speech perception tasks is due to increasingly detailed phonetic representations via overlearning. By eighteen months of age, the phonetic detail is sufficient to support discriminating the difference between /b/ and /d/ without focused attention. That is, speech perception performance is less affected by task at older ages because the fine-grained information has become explicit. This explanation does not preclude maturation of attentional skills also influencing task performance. However, attentional skills are expected to have a much weaker effect on native-language speech perception with increasing age because these representations are sufficiently detailed to support automatic recovery of phonological identity.
Under this view, perception of fine-grained phonetic contrasts will fail without the aid of focused attention during the first year of life. With increasing experience, these representations become sufficiently detailed to allow automatic perception of the relevant information. Large acoustic differences allow infants to succeed at a younger age in the more complex switch task because focused attention is not needed.Footnote 2 Neural measures can help pinpoint the timeline for development of explicit, detailed phonetic representations as will be illustrated next.
9.5.1 Neurobiology
An ideal method for studying the brain mechanisms underlying speech perception is via the electroencephalogram (EEG). The EEG reflects summation of excitatory and inhibitory postsynaptic potentials generated when neurons fire, and can be measured at the scalp. The timing and the topography (location on the scalp) of EEG modulations to a stimulus allow inferences about underlying brain mechanisms. Averaging portions of the EEG that are time-locked to a stimulus of interest, called event-related potentials (ERPs), is a common method used to isolate neural processes associated with the stimulus (Reference LuckLuck, 2014). Non-time-locked modulations can be examined using approaches such as time-frequency analysis (Reference Wagner, Ortiz-Mantilla and RusiniakWagner et al., 2022).
Studies using electrocorticography (ECoG) (undertaken in patients with intractable epilepsy) suggest that regions in the superior temporal gyrus (STG) are likely to be the locus of language-specific representations that form the basis of SPRs (Reference Yi, Leonard and ChangYi, Leonard, & Chang, 2019). These studies have identified STG regions that correspond to feature categories such as obstruent/sonorant, manner of articulation, place of articulation, and voicing (Reference Moerel, De Martino and FormisanoMoerel, De Martino, & Formisano, 2014; Reference Steinschneider, Nourski and FishmanSteinschneider, Nourski, & Fishman, 2013; Reference Yi, Leonard and ChangYi et al., 2019).
The ERPs used to study speech perception include auditory evoked potentials (AEPs), which are obligatory responses to acoustic input that occur in the first 200 ms from onset and have sources that are likely to correspond to those identified with ECoG methods. The AEPs named P1-N1-P2 are recorded at frontal-central sites (with a source in the superior plane of STG) and T-complex AEPs are recorded over lateral sites (presumed to have sources in the lateral plane of STG). Some studies suggest no effect of language-specific experience on AEPs (Reference Hoonhorst, Serniclaes and ColletHoonhorst et al., 2009; Reference Sharma and DormanSharma & Dorman, 2000); however, other studies do show modulation related to language experience (Reference Rinker, Shafer, Kiefer, Vidal and YuRinker et al., 2017; Reference Wagner, Shafer, Martin and SteinschneiderWagner et al., 2013).Footnote 3 Language-specific modulation of AEPs might be due indirectly to differences in how monolinguals and bilinguals allocate attention. For example, bilinguals compared to monolinguals showed greater attention to nontarget vowels (seen as a negative shift at the N1 and P2 latencies, called the Nd) in a task where the targets were tones or consonants (Reference Datta, Hestvik and VidalDatta et al., 2020) (see Figure 9.1). To date, too few cross-linguistic studies of AEPs to speech have carefully matched groups for bilingual experience; matching of experience would allow determining whether these effects are due to differences in how bilinguals and monolinguals deploy attention or to different tuning of phonological representations.

Figure 9.1 Auditory evoked potentials to a 250-ms vowel /ɛ/ at the onset of a sequence of ten stimuli (intersequence interval 1,250 ms) for monolingual AE adults (N = 14) and five to seven-year-old children (N = 14) and bilingual Spanish-English adults (N = 7) and five to seven-year-old children (N = 13) (both languages learned before age five years). The upper-right graph shows the adult P1-N1-P2 sequence at Fz and the upper-left graph exhibits the child P100-N2 sequence. The increased negativity (Nd shift) in the ERP of bilingual adults indicates allocation of attentional resources to processing the stimuli, even when instructed to ignore the auditory stimuli (see Reference Datta, Hestvik and VidalDatta et al. [2020]). The bottom-right graph shows the T-complex measures for left and right electrode sites (1 cm above the ears). The increased positivity of the Ta for bilingual adults may be the polarity reversal of the frontal Nd effect. The bottom-left graph shows the maturing T-complex, with the positive-going Ta peak beginning to emerge (see Reference Shafer, Yu and WagnerShafer, Yu, & Wagner [2015]). The bilingual children show more negative T-complex responses (see Reference Rinker, Shafer, Kiefer, Vidal and YuRinker et al. [2017]).
Neural mechanisms of speech perception have also been studied using attention-dependent measures, including N2b and N400, and positivities (P300/P3b). These measures reflect engagement of goal-directed, higher-level processes that are dependent on conscious perception, and that are expected to show a stronger correlation with behavioral perception than attention-independent measures (Reference Dehaene-Lambertz, Dupoux and GoutDehaene-Lambertz, Dupoux, & Gout, 2000; Reference Wagner, Shafer, Martin and SteinschneiderWagner et al., 2012). Task-related potentials can be used to track the underlying processes leading up to the behavioral response. For example, a common paradigm used to focus attention on the speech signal is an oddball task in which a frequent category is repeated and listeners are asked to respond to a rare (oddball) category (Reference Buchwald, Guthrie, Schwafel, Erwin and Van LanckerBuchwald et al., 1994; Reference García and FroudGarcía & Froud, 2018; Reference Xi, Xu and ZhuXi et al., 2021). Behavioral detection of the oddball elicits an increased negativity around 200 ms called the N2b followed by a large positivity (P300 or P3b) peaking between 300 and 600 ms. Easier discriminations allow for larger and earlier peak responses. In the absence of attention to the stimuli, a different measure called mismatch negativity (MMN) is elicited, as explained next.
The MMN ERP measure is ideal for testing the ASP model’s claim that native-language SPRs are automatic because: (1) it is sensitive to language experience; and (2) it can be examined when attention is directed away from the auditory stimulus. In the oddball paradigm, MMN is elicited to a rare event (deviant), and results in increased negativity at fronto-central sites (and positivity at mastoid sites), if the stimulus contrast is sufficiently salient (Reference Näätänen, Kujala and WinklerNäätänen, Kujala, & Winkler, 2011). Further, MMN is larger in amplitude and earlier in latency with increasing contrast salience. The sources of MMN are localized to the STG, but additional sources are sometimes found in the frontal cortex and hypothesized to reflect activation of the attentional network (e.g., orienting to a salient sound change). It is possible for MMN to be partially or wholly masked by other components, such as the N2b in tasks where attention is focused on detecting the stimulus change. Stable patterns can also be consolidated as an auditory representation, and deviation from the pattern will lead to MMN (Reference Sussman, Ritter and VaughanSussman, Ritter, & Vaughan, 1998). Figure 9.2 (right column) displays the MMN component in adults.

Figure 9.2 Mismatch responses (deviant minus standard) to the 250 ms vowel /ɛ/ (ISI 400 ms) from the same participants as in Figure 9.1. The MMN peak is 18 ms later for bilingual than for monolingual adults. Monolingual children show a pMMR followed by an MMN that is about 110 ms later than seen for adults. The bilingual children show a pMMR followed by a small MMN deflection.
More than a quarter-century of research has revealed that MMN is modulated by native-language experience (Reference Mah, Goad and SteinhauerMah, Goad, & Steinhauer, 2016; Reference Näätänen, Lehtokoski and LennesNäätänen et al., 1997). Reviewing this literature is beyond the scope of this chapter, but the following studies illustrate the utility of MMN in testing the ASP model.
The first studies using MMN to test the ASP model examined neural discrimination of Japanese vowel and consonant duration contrasts under different conditions of attention in native AE and native Japanese listeners (Reference Hisagi, Shafer, Strange and SussmanHisagi et al., 2010; Reference Shafer, Yu and WagnerHisagi, Shafer, et al., 2015). We predicted that AE listeners would show a smaller MMN when attention was focused away from the speech compared to attention focused on the speech. The AE listeners showed a smaller MMN to the vowel duration contrast when attention was focused on a visual shape task, but a comparable MMN to Japanese listeners when attention was directed to the auditory modality (by asking them to count vowel deviants), confirming the prediction.Footnote 4 For both groups, we also observed a large P3b response to the duration change when counting deviants, indicating behavioral detection of the oddball (Reference Hisagi, Shafer, Strange and SussmanHisagi et al., 2010). The findings for the consonant contrast, however, suggested that salience was important for native listeners, as well as non-native listeners (Reference Shafer, Yu and WagnerHisagi, Shafer, et al., 2015).
In another study, we examined L2 learners of English from Japanese, Spanish, or Russian backgrounds to further test whether listeners rely on L1 SPRs in the absence of focused attention (Reference Shafer, Kresh and ItoShafer et al., 2021). Neural discrimination of AE low vowels /ʌ/, /ɑ/, and /æ/ was tested in a paradigm using three natural tokens of each vowel (e.g., ɑpə1, ɑpə2, ɑpə3, ʌpə2, ɑpə1, ɑpə2, ɑpə3, æpə3, etc.; deviants are in bold). These three vowels are challenging to L2 learners whose language includes only one low vowel (represented by /a/), because their spectral F1 and F2 patterns are acoustically similar (compared to, e.g., /a/ versus /e/). The AE /ʌ/, however, is systematically shorter than /ɑ/ and /æ/, which provides an additional cue for listeners. Japanese listeners were expected to access this temporal cue at the automatic level because it serves as a primary cue in Japanese (Reference Strange, Akahane-Yamada and KuboStrange et al., 1998). The Spanish listeners were expected to show poor discrimination of all low-vowel pairings because Spanish does not distinguish vowel duration. The expectation was less clear for the Russian group because duration is used for stress. In addition, Russian listeners sometimes behaviorally judge /æ/ as more similar to Russian /e/ than /a/, which suggests that they are selecting different cues than Spanish listeners, who judge /æ/ to be most similar to Spanish /a/ (Reference StrangeStrange, 2011).
Spanish listeners did not reveal an MMN to the /ʌ/ versus /ɑ/ contrast when /ʌ/ was the standard; both the Japanese and the Russian listeners showed an MMN to this contrast, although later in latency than that of the English listeners. This result supported the prediction that Spanish listeners’ SPRs for vowels lacked duration information, and are insufficiently detailed in spectral information to support distinguishing /ɑ/ and /ʌ/. In contrast, Japanese listeners’ native-language SPRs included fine-grained temporal information. The finding of an MMN for Russian listeners suggests that SPRs for their native-language stress patterns facilitated discrimination. The later latency of the MMN for the Japanese and the Russian listeners than for the English listeners is consistent with reliance on the duration cue (which requires waiting for the end of the shorter stimulus), whereas the English speakers were able to discriminate the vowels on the basis of spectral differences from stimulus onset.
An asymmetry in discrimination provided deeper insight into the nature of SPRs. When /ɑ/ served as the standard and /ʌ/ as the deviant stimulus, all groups showed an MMN, suggesting that neural discrimination was easier for this presentation order. One explanation for this finding is that the repeated stimulus triggers the native-language SPR in long-term memory (without focused attention); in contrast, the deviant is registered as an iconic memory, which decays rapidly. Under this view, Spanish listeners’ representation for the repeated stimulus (standard) was Spanish /a/, whether the phonetic identity was AE /ʌ/ or /ɑ/; thus, their perception was /a-a-a/-[ɑ] for the phonetic sequence [ʌ-ʌ-ʌ-ɑ], but /a-a-a/-[ʌ] for the phonetic sequence [ɑ-ɑ-ɑ-ʌ]. In this latter case, the iconic memory trace of /ʌ/ as the deviant is sufficiently different (i.e., it is a poor exemplar of Spanish /a/) to generate the MMN. Increasing the ISI to induce greater decay of the deviant would serve to test this claim. For example, in a different study of Mandarin lexical tone contrasts, naïve AE listeners showed an MMN when the ISI was relatively short (600 ms), but not at a longer ISI (2,700 ms), whereas native Mandarin listeners showed MMN at both ISIs (Reference Yu, Shafer and SussmanYu et al., 2017).
These findings support the claim that listeners fall back on native-language SPRs in the absence of focused attention; however, several studies indicate that attentional resources may be needed to discriminate some native-language contrasts that are phonetically similar. Contrasts that are considered “marked” because they are phonetically similar are the most likely to need attentional resources. The Hindi dental-retroflex contrast and the Japanese short versus long consonant contrast appear to fall in this category (Reference Shafer, Yu and WagnerHisagi, Shafer, et al., 2015; Reference Shafer, Schwartz and KurtzbergShafer, Schwartz, & Kurtzberg, 2004). For the Japanese consonant duration difference, MMN was larger for both naïve English and native Japanese listeners in an auditory attend compared to a visual attend condition (although only Japanese listeners showed an MMN without attention to the auditory deviant). Another study by Reference Barrios, Namyst, Lau, Feldman and IdsardiBarrios et al. (2016) failed to find a larger MMN for Spanish listeners to a contrast that was phonemic in Spanish but allophonic in English compared to a contrast that was allophonic in Spanish but phonemic in English.
9.5.2 Neurodevelopmental Studies
The prior section demonstrated that MMN serves as an excellent measure for probing the nature of phonological representations stored in the brain. This knowledge allows us to now interpret the neural processes supporting speech perception development. Specifically, presence of MMN can be used to indicate whether and when in development phonological representations are sufficiently detailed to support discrimination. This section presents several key studies to illustrate what we have learned so far from neurodevelopmental measures.
First, it is crucial to recognize that brain regions mature at different rates in the developing child, and, thus, speech perception is likely to be accomplished differently across development (Reference Werker and CurtinWerker & Curtin, 2005). Studies of infant auditory development reveal relatively mature development of the peripheral and brainstem auditory systems; in contrast, neocortical regions are still immature at birth (Reference Moore and LinthicumMoore & Linthicum, 2007). Newborns can encode and discriminate speech, but the neural systems supporting auditory processing at the cortical level are immature in density of neurons, connectivity (through dendrites and synapses), and speed of processing (through myelination of axons, which connect neurons to each other). This immaturity is also apparent in neurophysiological measures used to test speech encoding and discrimination (e.g., Reference Shafer, Yu and WagnerShafer, Yu, & Wagner, 2015; Reference Yu, Tessel and HanYu et al., 2019). Cortical immaturity leads to the hypothesis that only very large acoustic-phonetic differences can drive attention-independent discrimination in newborns (Reference Moore and LinthicumMoore & Linthicum, 2007). In addition, infants have immature prefrontal lobes (supporting executive functions). In the first few months of life, infants are unable to voluntarily shift attention away from a highly salient or novel stimulus, such as a large sound change (Reference Hendry, Jones and CharmanHendry, Jones, & Charman, 2016). Thus, we predict that only large acoustic differences support an automatic discrimination process in the youngest infants, which will be indexed by MMN. This prediction is generally held up in infant studies examining neural measures of discrimination.
The first infant studies of neural speech discrimination observed a very large, sustained negativity to a Finnish /y/ versus /i/ contrast in newborn infants; the authors argued that this finding supported early maturation of the speech discrimination process reflected by MMN (Reference Cheour-Luhtanen, Alho and KujalaCheour-Luhtanen et al., 1995). Subsequent studies, however, suggest that MMN is observed in young infants only in relation to large acoustic differences. To smaller differences, infants showed an increased positivity to the deviant stimulus (called the positive mismatch response or pMMR), rather than increased negativity. This pMMR has been observed in relation to both tones and speech contrasts (Reference Dehaene-Lambertz and DehaeneDehaene-Lambertz & Dehaene, 1994; Reference Leppänen, Richardson and PihkoLeppänen et al., 2002; Reference Morr, Shafer, Kreuzer and KurtzbergMorr et al., 2002; Reference Trainor, Samuel, Desjardins and SonnadaraTrainor et al., 2001). Subsequent studies showed that MMNs in relation to subtle consonant and vowel contrasts are not consistently present in children until at least four years of age (Reference Lee, Yen and YehLee et al., 2012; Reference Shafer, Yu and DattaShafer et al., 2010). A negative response (that may be the MMN)Footnote 5 appears only to large acoustic differences, (e.g., 1,000 Hz vs. 2,000 Hz pure tones: Reference Morr, Shafer, Kreuzer and KurtzbergMorr et al., 2002; lexical tone: Reference Lee, Yen and YehLee et al., 2012; Reference Liu, Ong, Tuninetti and EscuderoLiu et al., 2018). In addition, the pMMR and the MMN can be elicited in the same children, with the pMMR preceding the MMN in time (Reference Shafer, Yu and DattaShafer et al., 2010). These studies also indicate slower neural discrimination by younger children. That is, MMN shifts earlier in time by about 10–12 ms per year from four to ten years of age and may not reach adult latencies until after puberty (Reference Datta, Hestvik and VidalDatta et al., 2020; Reference Shafer, Yu and DattaShafer et al., 2010). Figure 9.2 (left column) displays pMMR and MMN in children.
These developmental patterns suggest that pMMR indexes the magnitude of acoustic differences rather than phonological differences, and that it is generated by recovery from refractoriness of a neural population firing to the acoustic characteristics of the deviant (Reference May and TiitinenMay & Tiitinen, 2010).Footnote 6 Studies have shown that repetition of a stimulus leads to adaptation (suppression of neuron firing). Longer intervals between repetitions allow neurons to recover. The longer interval between deviants compared to standards allows for greater recovery, seen as greater positivity to the deviant stimulus for children. We hypothesize that the pMMR reflects an iconic memory mechanism.
Several studies with infants suggest that fine-grained phonetic differences initially need focused attention for discrimination. We observed that six-month-olds showed only a negative mismatch response (nMMR) to the fine-grained /ɛ/ versus /ɪ/ vowel distinction under conditions manipulating attention to the stimulus change (Reference Shafer, Yu and Garrido-NagShafer, Yu, & Garrido-Nag, 2012). For example, in Karen Garrido-Nag’s dissertation (Reference Garrido-NagGarrido-Nag, 2013), infant attention was manipulated by pairing a stimulus change with a reward (a smiling face). The MMRs to this “contingent” stimulus change were more negative than in another condition where the smiling face was not contingent on the change. Infants as young as three months of age appear to have sufficiently mature attentional networks (top-down feedback from frontal regions) to support the allocation of attention in speech perception (Reference Mersad, Kabdebon and Dehaene-LambertzMersad, Kabdebon, & Dehaene-Lambertz, 2021).Footnote 7
We argued that increasing negativity of MMRs to a vowel contrast /ɛ/ versus /ɪ/ with increasing age could be explained by infants’ representations becoming more detailed, and, thus, allowing discrimination whether attention was focused on or away from the speech stimuli (Reference Shafer, Yu and DattaShafer, Yu, & Datta, 2011). A study from Patricia Kuhl’s lab supports this explanation (Reference Garcia-Sierra, Ramírez-Esparza and KuhlGarcia-Sierra, Ramírez-Esparza, & Kuhl, 2016). They tested eleven to fourteen-month-old children from monolingual English and bilingual Spanish-English households and measured the amount of input in English and Spanish (by recording speech in the household and deriving the number of words in forty 30 s intervals). Monolingual children with high English input showed a significant negative MMR to an English contrast /pa/ versus /ta/, but a pMMR to a non-native Mandarin Chinese alveolo-palatal affricate (/tɕhi/) versus an alveolo-palatal fricative (/ɕi/). No MMR effects (positive or negative) were observed in relation to a Spanish prevoiced /da/ versus short-lag voiceless /ta/ contrast. Spanish-English bilingually exposed infants with high English or high Spanish input showed pMMR rather than negativity to their respective language voiced/voiceless contrast. This absence of negative MMR in bilingual infants may be explained by insufficient input in either language (the “high” input values for bilinguals matched low-input values for monolingual infants), or to bilingual experience leading to interference in setting up native-language representations (described as delayed neural commitment by Reference Garcia-Sierra, Ramírez-Esparza and KuhlGarcia-Sierra et al., 2016).
In summary, these neurodevelopmental studies demonstrate that MMRs provide a method for examining the nature of phonological representations in L2 learning. Section 9.5.3 reveals how research using the MMR method provides additional insight into the nature of SPRs for bilinguals who have learned two languages early in life.
9.5.3 Early Bilinguals
For bilingual listeners, conflicts will exist between the two language phonologies for some phonological representations. Bilinguals who have learned both languages before five years of age are particularly interesting because differences in speech perception compared to monolingual listeners cannot be attributed to closure of a critical/sensitive period for refining speech sound categories (Reference Reh, Arredondo and WerkerReh, Arredondo, & Werker, 2018).
Three different possibilities exist for bilingual speech perception (Reference WilliamsWilliams, 1977):
1. Listeners show a bias toward one language (Language A); in this case, the “explicit” phonological representations will exist for Language A and allow automaticity of SPRs, but SPRs for Language B will require some attentional resources.
2. Listeners compromise between the two languages; in this case, the phonological representations (SPRs) for both Language A and Language B differ from those used by monolingual listeners.
3. Listeners can shift between the two systems by means of some cue that allows selecting language-specific SPRs; however, some attentional resources may be necessary for selecting the language.
Neurophysiological evidence can be found to support these three scenarios. For example, Reference Peltola, Tamminen, Toivonen, Kujala and NäätänenPeltola et al. (2012) found that proficient Finnish-Swedish bilinguals who were dominant in Finnish showed a larger MMN to a vowel change in the environment where it is phonologically contrastive (that is, Finnish context), but an attenuated MMN in the context where the vowel pair falls within the same category (Swedish). In contrast, balanced bilinguals, who had learned both languages from birth, showed evidence of compromising, with a smaller and a later MMN observed for both the Finnish and the Swedish contexts. The finding of difference in attention directed to speech between monolingual and bilingual adults suggests that bilinguals may be monitoring the speech signal for cues to determine which language to select (e.g., Reference Datta, Hestvik and VidalDatta et al., 2020) (see Figure 9.1). An Nd-attention difference was not observed between monolingual and bilingual children in Reference Datta, Hestvik and VidalDatta et al. (2020), but it may be that children have not yet developed this strategy.
9.6 ASP Extended
The evidence reviewed in this chapter supports the following claims:
1. For monolingual listeners, native-language SPRs need few attentional resources to support perception because phonological representations in long-term memory have the necessary phonetic detail to allow automatic recovery of lexical identity. Native-language phonological representations are “explicit” and located in the auditory cortex. Their explicit nature (achieved by overlearning) boosts salience.
2. Phonetic contrast salience modulates whether focused attention is needed for perception. If the explicit, phonetic detail in phonological representations is not sufficiently distinct (e.g., dental versus retroflex stops), then focused attention will be needed to draw on additional phonetic cues to support perception.
3. Native-language SPRs develop via overlearning (intense experience) over the first four years of life and continue to be refined up to puberty. Phonological representations are initially sparse, and thus only large phonetic differences are detected without focused attention in the first year. With increasing exposure, representations become more detailed and more selective for the specific language; between three and four years of age, children begin to have sufficiently detailed phonological representations to support more fine-grained distinctions and automaticity of most native-language SPRs.
4. Novel phonological categories can be learned for an L2 via tasks allowing focused attention to phonetic detail. However, these L2 representations may never become explicit, particularly for those that conflict with L1 phonological representations. Thus, L2 SPRs may continue to require focused attention for successful perception, even for highly proficient L2 learners.
5. Second-language listeners will fall back on L1 SPRs in tasks with high cognitive load because focused attention is directed to other levels of processing (e.g., semantic). Learners of an L2 can perform well for L2 speech perception when the L1 SPR supports the distinction between two L2 speech sounds (as in the case of the vowel duration difference between English /a/ and /ʌ/ for Japanese listeners).
6. Learning an L2 early in life before L1 phonological representations are fully refined (i.e., before neural commitment) will result in differences from monolingual phonological representations, and/or in differences in how attention is engaged in speech perception.
9.7 Future Directions
Future studies are needed to examine the limits of L2 learning in relation to factors such as age of onset of learning, quality of input, and type of training (Reference Tamminen, Kujala, Näätänen and PeltolaTamminen et al., 2021). Future studies should also explore whether the ASP model can help understand speech perception deficits in bilingual children with developmental language disorder (DLD) (Reference Kujala and LeminenKujala & Leminen, 2017; Reference Rinker, Yu, Wagner and ShaferRinker et al., 2022). For example, many children with DLD do not show MMN to fine-grained phonetic differences, such as the spectral differences between /ɛ/ and /ɪ/ (Reference Shafer, Morr, Datta, Kurtzberg and SchwartzShafer et al., 2005). A third area of interest focuses on understanding the developmental link between speech perception and production. Neural measures will be crucial for providing insight into these questions.
9.8 Conclusion
This chapter presented an updated version of the ASP model and provided evidence that supports the basic claims of the ASP. Specifically, L1 speech perception is automatic and robust, whereas L2 speech perception will often require focused attention for accurate recovery of L2 phonological patterns. Evidence also revealed that some L1 phonological representations may have insufficient detail to support attention-independent discrimination. Studies using neural measures indicate that phonological representations become increasingly detailed during the first four years of life, but begin to have sufficient detail to allow attention-independent discrimination of phonetically similar contrasts only after about three years of age. Neurophysiological measures provide an excellent source of evidence and will continue to be important for further testing the ASP model.
10.1 Introduction: Interlanguage
Second language acquisition (SLA) research prior to 1970 claimed that all non-native substitutions were due to transfer. However, shortly after, researchers became increasingly aware that transfer could not account for all phenomena. In 1972, Selinker proposed the term interlanguage (hereafter IL), which included phenomena not occurring in the native first language (L1) or the second language (L2). Accordingly, IL can be considered to be composed of three fundamental elements: L1, L2, and a component that is neither L1 nor L2. Because this third element cannot be attributed to the L1 or the L2, it can be argued that it is a result of non-L1 and non-L2 specific universals. Some of these phenomena are identical or similar to those occurring in L1 acquisition (e.g., terminal coda obstruent devoicing), suggesting important similarities between L1 and L2 acquisition. This issue will be addressed in Sections 10.2.3 and 10.2.4. In addition to universals, early research demonstrated the importance of similarity and markedness (Reference EckmanEckman, 1977; Reference FlegeFlege, 1987; Reference MajorMajor, 1987a; Reference Oller and ZiahosseinyOller & Ziahosseiny, 1970).
Native L2 comprises one component of the IL; an example would be an English speaker who accurately produces Portuguese /õj̃/ (e.g., põe “puts”). Another element is L1 transfer, such as the same speaker substituting English /e/ for Portuguese /e/, /ej/, and /ɛj/ (e.g., [kade] “where is,” [dej] “I gave,” [idɛja] “idea”). Although the world-renowned Prague School phonologist Trubetzkoy is often not recognized for his contributions to SLA, he did claim that L2 perception is “filtered” through the “sieve” of the L1 (Reference TrubetzkoyTrubetzkoy, [1939] 1958).
The third IL component encompasses universals – the portion of IL that cannot be attributed to the L1 or the L2. As early as the 1970s, Reference NemserNemser (1971a) reported that Hungarians acquiring English /θ/ produced [sθ], which is non-occurrent in English or Hungarian. In another study, Reference JohanssonJohansson (1973) discovered that American English and German speakers acquiring Swedish produced sounds not occurring in any of these three languages (e.g., [ɯ] for /ʉː/). In a cross-sectional study of fifty-three Brazilian learners of English, Reference MajorMajor (1986b) found universals to be very prevalent: [ə] paragoge rather than [i] paragoge, which transfer would predict. Other studies reported final obstruent devoicing in L2 speakers whose L1s do not have final obstruents (Reference Altenberg and VagoAltenberg & Vago, 1983; Reference YavaşYavaş, 1997). Reference MusauMusau (1993) documented universals occurring in the L2 acquisition of Swahili by native speakers (NSs) of Bukusu, Kamba, Kikuyu, Massai, Nandi, Somali, and Luo; these substitutions occurred at a variety of levels, including segmentals, syllable structure, stress, and tone.
It is imperative to clarify the term universals (hereafter U). In the explication of the Ontogeny Phylogeny Model, U pertains to universals that are not already evident in the L1 and the L2. For example, when /s/ voicing occurs before nasals in Spanish speakers of English (e.g., Christmas, isthmus), it is subsumed under L1 because it is a native Spanish process (e.g., mismo → [mizmo] “same”). However, in Japanese learners of English it is classified as U because Japanese has no /sN/ sequences.
10.2 The Ontogeny Phylogeny Model
Although by the 1970s SLA researchers were aware that U comprised an important component of IL, the relative importance of L1 transfer versus U was a topic of debate. Moreover, there were no hypotheses regarding possible interrelationships between L1, L2, and U. Remarkably, more than fifteen years passed before any systematic interrelationships were proposed. The Ontogeny Model (OM; Reference Major, Ioup and WeinbergerMajor, 1987b) was the first to do so. Clearly, as acquisition improves, the total number of non-native substitutions necessarily decreases. However, the innovative claim of the OM was that a specific type of non-native substitution, namely U, actually increases at first before it decreases. Expanding on this model, the Ontogeny Phylogeny Model (OPM; Reference MajorMajor, [2001] 2014) further argues that similarity and markedness have predictable interrelationships with L1, L2, and U. In addition, the OPM claims that these same interrelationships obtain for L1 acquisition and languages in contact.
The OPM is divided into four corollaries, which address chronology, style, similarity, and markedness.Footnote 1 All four corollaries claim that L2 increases, L1 decreases, and U increases and then decreases. The patterns for chronology and style are identical (Figures 10.1 and 10.2). In contrast, the patterns for similarity and markedness differ from chronology and style in their relative proportions of L2, L1, and U (Figures 10.3 and 10.4).
10.2.1 Chronological Corollary of the OPM
The first corollary addresses overall chronological development. It claims that IL develops chronologically in the following manner: L2 increases, L1 decreases, and U increases and then decreases (Figure 10.1).

Figure 10.1 Chronological corollary of the OPM.
At the initial stage the learner speaks an L1, no L2, while U remains dormant; thus, L2 is essentially a tabula rasa. Therefore, IL = L1. One could claim that someone with the heaviest of foreign accents is not speaking the L2 at all, but merely speaking the L1 employing L2 loanwords. During acquisition, the L2 system gradually develops through various intermediate stages until it reaches the final idealized state when the L2 is completely mastered; that is, IL = L2. Very few, if any, learners reach this advanced stage (but see Reference MoyerMoyer’s [2021] “gifted language learner”). The chronological corollary pertains to normal phenomena that are neither similar nor marked. The stylized hypothetical in Figure 10.1 and those that follow characterize the idealized learner, but necessarily will vary with different learners and different phenomena. The relative frequencies of the L1, the L2, and U vary and the specific frequencies of these three factors are hypothetical.
The OPM claims that at the early stages, L1 influence is so strong that it prevents L2 mastery and U from surfacing. With continued exposure and improvement in perception,Footnote 2 the learner realizes (consciously or unconsciously) that L1 substitutions are not accurate productions of the L2. Subsequently, the learner’s continuing but failed attempts to accurately produce the L2 cause this latent U to become activated, resulting in phenomena that are part of neither the L1 nor the L2. Thus, at an early stage, L1 obliterates the effect of U; hence, nothing in U is evident that is not already part of the L1; that is, U remains dormant. However, at a later stage, this nascent U awakens and replaces some L1 components. Then, still at a later stage, U decreases as the L2 replaces it.
Logically, the OPM should be true in light of widely observed IL patterns. The learner begins acquisition with 100 percent L1 and 0 percent L2. At the final stage the idealized learner’s IL is 0 percent L1 and 100 percent L2. Chronologically, this means L2 must increase and L1 decrease. Because at the beginning stage L1 is 100 percent and at the final stage L2 is 100 percent, it follows that U must be zero at both the initial and the final stages. However, U is also a well-documented factor (Reference Altenberg and VagoAltenberg & Vago, 1983; Reference JohanssonJohansson, 1973; Reference MusauMusau, 1993; Reference NemserNemser, 1971a). Therefore, U must rise and fall at stages other than at the very beginning or the very end (when it is zero) because mathematically L1 + L2 + U = 100 percent.
Although L1 and L2 acquisition involve some principles that are distinct from other forms of learning, all types of learning do share one common characteristic: transfer. In the early stages, the underlying cause for transfer being much more important than U follows from long-known principles of numerous learning theories of psychologists (Reference AusubelAusubel, 1963; Reference Ausubel, Novak and HanesianAusubel, Novak, & Hanesian, 1978; Reference BugelskiBugelski, 1942; Reference GagnéGagné, 1977). Extensive research has demonstrated that one relies on previous cognitive structures when learning new structures; that is, transfer occurs. Reference Ausubel, Novak and HanesianAusubel et al. (1978, p. 65) further claimed that all learning involves transfer and that subsequent learning “results in new transfer by modifying cognitive structure.”Footnote 3 Incorporating this research into an SLA framework, their conclusions predict that L1 transfer will dominate the early stages because the learner has mastered very little L2. As acquisition proceeds, the existing cognitive structure (the IL) is modified by the L2 experience, creating new cognitive structures. However, because the structures are new, they could not be L1, but necessarily are either U or L2. These new structures in turn cause further IL modifications. Therefore, as acquisition progresses, this new IL will show an increasing U and L2, while the influence of pure L1 transfer continues to decrease. Subsequently, as learning continues further, U substitutions will be replaced by native L2 forms, resulting in a decrease in U. These patterned progressions are precisely the OPM’s claims: L1 will decrease over time, L2 will increase, whereas U will first increase and then later decrease.
Consider the acquisition of English terminal coda voiced obstruents in NSs of Japanese and Portuguese (non-occurrent in either language). If transfer operates first, learners will epenthesize a vowel. However, if and when epenthesis ceases, devoicing is able to surface. Thus, these hypothetical stages for the acquisition of dog would be: do[gu] → do[k] → do[g] (L1 Japanese) and do[gi] → do[k] → do[g] (L1 Portuguese).
Although some substitutions are unmistakably either L1 or U, others can be both. Coda obstruent devoicing in Japanese learners of English is clearly U because Japanese has no coda obstruents. In contrast, this process in German speakers of English could be attributed both to U and to L1 because it is obligatory in native German. A number of researchers have proposed that substitutions persist longer when both transfer and U produce the same result (Reference Andersen, Gass and SelinkerAndersen, 1983; Reference Hecht and MulfordHecht & Mulford, 1982).Footnote 4 This claim follows logically from the OPM: In German speakers of English, L1 first causes devoicing to occur; subsequently, when learners eliminate L1 transfer, U surfaces. However, because both L1 and U cause devoicing, the process persists. A possible test case would be to compare devoicing in Japanese and German learners of English. The prediction would be that devoicing would persist longer in German NSs because it occurs in native German but not in native Japanese.
An early longitudinal study that provides evidence for the OPM is Reference WodeWode’s (1981) research on the L2 acquisition of English by his four German-speaking children (over a period of approximately six months, beginning at ages 3;11 to 8;11). Wode proposes gradual approximations and discrete jumps. The vowels [a] and [ɛ] were first substituted for English /ʌ/ and /æ/, indicating transfer from German. Later the substitutions became closer and closer to native English via gradual approximations. These gradual approximations can be attributed to U because they are neither native English nor native German. In contrast, discrete jumps occurred in r in all four children: [R] > [w] > [ɹ] > [ṛ] (Wode’s [ɹ] = “central frictionless continuant” and [ṛ] = “target-like retroflex”). The stages for vowel and r substitutions follow from the OPM: early substitutions were due to transfer, later substitutions were due to U, and, finally, L2 was mastered.
Reference AbrahamssonAbrahamsson’s (2003) longitudinal study further supports the OPM, where he reported deletion and epenthesis increased and then decreased in Chinese learners of Swedish. Other longitudinal studies provide evidence for the OPM, including English consonant cluster acquisition by Brazilian Portuguese speakers (Reference MajorMajor, 1994, Reference Major, Bayley and Preston1996) and Spanish r acquisition by English speakers (Reference MajorMajor, 1986a). In all three studies, the longitudinal patterns show an increase in L2, a decrease in L1, and an increase and then a decrease in U. For example, in the Spanish study, some speakers exhibited the following stages for /r/ acquisition: [ɹ] > [ɾ] > [ʀ] > [ʐ]Footnote 5 > [r].
For the past three decades, Optimality Theory (OT; Reference Prince and SmolenskyPrince & Smolensky, 1997, Reference Prince and Smolensky2004) has been a prominent theory in mainstream phonology. In OT terminology, the OPM can be formulated as follows: L1 constraint rankings comprise the early stages; later they are reranked into non-native L1 and non-native L2 (i.e., U); and finally they are reranked into native L2.Footnote 6 Reference Hancin-Bhatt and BhattHancin-Bhatt and Bhatt (1997), employing OT, found some substitutions due to language-specific rankings (L1 transfer) and others due to language-independent rankings (U). Reference Hancin-Bhatt and BhattHancin-Bhatt and Bhatt (1997, p. 368) argue that
we can begin to give a linguistic-theoretic interpretation to Major’s (1986, 1987, 1994) ontogeny model … that L2 learners have mostly transfer-related errors in early stages of learning, but that, over time, developmental errors become more prominent, whereas both taper off in advanced L2 speakers. If we assume that the L2 learner’s initial state is transferred L1 constraint ranking, we can then begin to address why transfer effects are prominent in early stages of acquisition.
10.2.2 Stylistic Corollary of the OPM
The second corollary deals with style. It predicts that as style becomes more formal, L2 increases, L1 decreases, and U increases and then decreases (Figure 10.2).

Figure 10.2 Stylistic corollary of the OPM.
It should be noted that the shapes of the graphs for chronology and style are identical (Figures 10.1 and 10.2). The L1 pattern follows logically from widespread research demonstrating that transfer is most prominent in casual styles but least prominent in formal styles, such as citation (Reference Dickerson, Dickerson, Corder and RouletDickerson & Dickerson, 1977; Reference NemserNemser, 1971b; Reference SchmidtSchmidt, 1977; Reference TaroneTarone, 1988; Reference Wilson and MøllergardWilson & Møllergard, 1981). Probably nearly every language teacher would agree that L2 speakers’ pronunciation of isolated words is usually much better than in conversation, where L1 transfer is more noticeable. A hypothetical example illustrates this. By changing the tone in Mandarin, the sequence /ma/ has four meanings (“mother,” “horse,” “hemp,” “scold”). An NS of English who studies Mandarin for six months and practices daily mā 妈 “mother” may shout “Mother!” and successfully sound native for this one word while not mistakenly calling her a horse, mă 马; nevertheless, it is highly probable that the same person will have a prominent L1 English accent in conversation.
Although L1 transfer decreases and L2 increases as style becomes more formal, this relationship does not overtly predict the rise and fall of U. However, this is implicit, because any IL component that is neither L1 nor L2 is by definition U. In the idealized or extreme case, a very formal style is pure L2 and a very casual style is pure L1. Therefore, as style changes from formal to casual, the U component has to appear and then disappear; that is, it increases and then decreases.
Reference WodeWode’s (1981, p. 228) study of his daughter Birgit’s L1 German-L2 English acquisition of /ɹ/ supports the stylistic claims of the OPM (Wode’s /r/ = IPA /ɹ/): “Until she got out of school at the beginning of June, she much preferred [R] as a substitute for [ṛ]]/[ɹ]in her casual spontaneous speech. In the imitation-like checkups she would frequently produce or attempt [w] or something [w]-like to substitute for the L2 /r/.” These “checkups” where she produced U substitutions ([w] or something [w]-like)Footnote 7 can be considered more formal than “casual spontaneous speech,” when she used L1 substitutions ([R]). These data thus support the OPM: L1 was greater in casual speech, but U was greater in more formal speech.
The patterns in Figure 10.2 may not hold true due to a number of factors, such as nervousness, comfort level, and competence in a given style. A former student of Major’s reported that his Japanese girlfriend living in the United States sounded more native-like (though formal) when she was less guarded and more relaxed than when she consciously attempted to sound casual, a style that she had not learned in Japan. Furthermore, mastery of different styles implies competence in the appropriate situations (i.e., communicative competence). Reference MajorMajor (2004) investigated English casual speech processes in NSs of American English and NSs of Japanese and found that although some Japanese speakers had mastered casual speech processes, they often used them in a formal style, for example [gɑɑɑʧɑɑɑ] got you when slowly reading a list of two-word phrases.
Because different processes can occur in different styles, a thorough knowledge of stylistic variation in both the L1 and the L2 is critical for a comprehensive analysis. If productions are more native-like in a casual style, this first might appear to be counterevidence to the OPM. However, this is not necessarily true: Reference ZampiniZampini (1994) pointed out that positive transfer can occur in a casual style but may not in a formal style. For example, in Brazilian Portuguese, stressed and pretonic /il/ and /iu/ are normally pronounced [iw], but [ju] in very casual speech (e.g., f[iw]mamos → f[ju]mamos “we film,” Reference MajorMajor, 1981, Reference Major1985). Brazilians tend to pronounce English few as [fiw], but in very casual speech they accurately produce [fju]. Reference MajorMajor’s (1994, Reference Major, Bayley and Preston1996) studies of consonant clusters also showed that L1 casual processes correctly predicted L2 accuracy, yet L1 formal processes did not. Thus, L1 transfer is the governing factor in both these styles and has no relevance to the stylistic corollary.
When L2 learners are unaware that L2 sounds actually occur in their L1s, it can be to their disadvantage. Brazilian Portuguese has seven nasalized monophthongs and diphthongs, including /ʌ̃/, /ʌ̃j̃/, and /ʌ̃w̃/ (e.g., /χʌ̃/, /mʌ̃j̃/, and /mʌ̃w̃/ “frog,” “mother,” “hand”), which are usually very troublesome for English speakers. Nonetheless, unbeknownst to most learners, these three sounds in fact do occur in English. Huh is [hʌ̃] and in very rapid casual speech “what’s the matter, honey?” and “I don’t know” are often pronounced [smæɹ̩hʌ̃j̃] and [ʌ̃w̃]. Such an awareness could be quite useful in pedagogy.
Further examples demonstrate the relevance of stylistically conditioned processes in SLA analysis. In certain environments, Japanese and Portuguese /i/ and /u/ become voiceless and subsequently deleted. Thus, to English speakers, Japanese sukiyaki is perceived as [skijaki] and Portuguese lápis “pencil” is perceived as [laps]. However, curiously and fortuitously, for Japanese and Portuguese speakers of English, greater accuracy in English can occur in casual speech. Thus, in careful speech a Japanese speaker of English may say [sukaj] for sky but [skaij] in running speech. Likewise, a Brazilian carefully pronouncing English laps may say [lapis] but in fluent speech correctly produce the coda cluster [ps]. A further example from Japanese demonstrates how one process in the same style produces a native-like utterance in one word but not in another word. In a careful formal style, a Japanese speaker’s pronunciation of sky and city would be [sukay] and [siti] (assuming the transfer process /s/ → [∫]/ – [i] does not operate). However, in running speech the following utterance would be likely: I love a blue [skaj] in a big [sti].
10.2.3 Similarity Corollary of the OPM
The third corollary addresses similarity. The OPM claims that in similar phenomena, IL develops chronologically in the following manner: L2 increases slowly, L1 decreases slowly, and U increases slowly and then decreases slowly (Figure 10.3).

Figure 10.3 Similarity corollary of the OPM.
Slowly means more slowly than in normal phenomena. Compare Figure 10.1 to Figure 10.3. In similar phenomena, the role of L1 is much greater than U compared to normal phenomena. Although L1 predominates all acquisition in the early stages, in similar phenomena L1 persists longer than in normal phenomena, indicating that similar phenomena are more difficult to learn.
A caveat is the definition of similar and difficult (Reference ArchibaldArchibald, 2021; Reference Major and KimMajor & Kim, 1996; Reference Yang, Chen and XiaoYang, Chen, & Xiao, 2020).Footnote 8 Although there are problematic cases, there are clearly unambiguous cases: [k] is more similar to [q] than it is to [t] or [s]. However, perceptions of similarity are also governed by the native languages (NLs) of the learners. French speakers typically substitute /s/ and /z/ for English /θ/ and /ð/, while Portuguese speakers substitute /t̪/ and /d̪/, even though both French and Portuguese have /s/, /z/, /t/, and /d/; Hindi speakers of English substitute Hindi retroflex /ʈ/ and /ɖ/ for English alveolar /t/ and /d/, but English speakers perceive Hindi dental /t̪/ and /d̪/ being closer to native English; English speakers typically pronounce fondue as fond[u], while Brazilian Portuguese speakers pronounce it fond[i].
Psychologists have demonstrated that transfer operates only when there are relevant phenomena to transfer. Reference Ausubel, Novak and HanesianAusubel et al. (1978, p. 165) claimed that past experience has “impact on relevant properties of cognitive structure,” but if the properties are not “relevant properties,” transfer cannot occur (compare with Reference Andersen, Gass and SelinkerAndersen’s [1983] “transfer to somewhere”). Transferring a tennis swing to badminton seems more likely than transferring shot-put techniques to the sport of curling. Reference ZoblZobl (1980, p. 43) proposed conditions of transfer in SLA, including the “selectivity of L1 influence on L2 acquisition,” also supported by others (Reference James, James and KettemannJames, 1983; Reference Young-ScholtenYoung-Scholten, 1985). In similar phenomena, L1 transfer is more prevalent than U because of perceptual saliency. Minimal differences are less likely to be noticed, resulting in transfer, that is, nonlearning. Accordingly, transferring English /k/ to Arabic /q/ seems more likely than transferring any other English sound to Xhosa clicks, where learners are more likely to produce non-English sounds, indicating that U plays a greater role.
However, seemingly there is a paradox regarding similarity. When L1 and L2 phenomena are maximally similar, transfer should persist and consequently L2 should not be learned. Nevertheless, at some point, as L1 and L2 approach identity, negative transfer will become positive transfer because the two phenomena have become virtually indistinguishable. Thus, learning ostensibly has taken place, but in fact it is merely positive transfer. This “similarity paradox” was noted by the prominent psychologist Reference OsgoodOsgood (1949). However, the definition of “indistinguishable” can be a further issue because some phenomena can be instrumentally distinguishable but humanly perceptually indistinguishable: Reference Eckman, Iverson and SongEckman, Iverson, and Song (2015) found statistically significant voice onset time (VOT) differences between /p/ and /b/ in Arabic speakers of English, even though trained phoneticians perceived no differences.
The U pattern of the similarity corollary follows from mathematical logic: because it is well-known that similar phenomena are acquired slowly while L1 influence is strong and persists, U must then increase slowly and decrease slowly because L1 + L2 + U = 100 percent. Thus, U must be a lesser factor than L1.
There is a plethora of research that demonstrates that similar phenomena are difficult to learn and are acquired slowly (Reference Best and StrangeBest, 1995; Reference Bohn, Wrembel, Kiełkiewicz-Janowiak and GąsiorowskiBohn, 2020; Reference Major and ChapelleMajor, 2018; Reference Major and KimMajor & Kim, 1996; Reference Yang, Chen and XiaoYang et al., 2020).Footnote 9 The Speech Learning Model (SLM, Reference Flege and StrangeFlege, 1995; SLM-r, Reference Flege, Bohn and WaylandFlege & Bohn, 2021) has convincingly demonstrated that “similar” sounds are more difficult to learn than “new” sounds because speakers classify similar sounds as “equivalent.” A study designed to test the SLM and the OPM supported both models (Reference JevringJevring, 2015; also see Section 10.2.4). Reference JevringJevring (2015, p. 34) states that “these findings support the Similarity Corollary of the OPM as well, since L1 transfer is a larger affecting factor on similar phenomena than universals and the L2.”
Well-known facts about L1 acquisition, L2 acquisition, and dialect variation add further evidence for the OPM. In L2 acquisition, L1 phenomena can be termed similar or dissimilar to L2 phenomena, but in L1 acquisition there is no L2 as a basis for comparison. Nevertheless, L1 and L2 acquisition share important similarities. In L1 acquisition, mergers are common when a phenomenon being acquired is similar to something already acquired; for example, mergers of /s/ and /ʃ/ and /ɑ/ and /ɔ/ are legion. For the L1 learner, this means that the phenomenon in the adult L1 that has previously been acquired is being substituted for the phenomenon that has not yet been acquired (e.g., [s] for [ʃ] and [ɑ] for [ɔ]). Accordingly, because of similarity, this substitution persists and therefore U cannot surface (as in SLA). Mergers that occur in L1 acquisition are also reflected in dialects in contact (a form of L2 acquisition) and historical change. For example, /ɑ/ and /ɔ/ mergers prevalent in children (e.g., caught/cot) are standard in many American dialects, in addition to mergers between tense and lax vowels before nasals and liquids (e.g., Engler/angler, hairy/Harry, sale/sell). The patterns of L1 and L2 acquisition are therefore analogous.
10.2.4 Markedness Corollary of the OPM
The last corollary deals with markedness. The OPM claims that in marked phenomena, IL develops chronologically in the following manner: L2 increases slowly, L1 decreases rapidly and then decreases slowly, and U increases rapidly and then decreases slowly (Figure 10.4). Slowly and rapidly mean more slowly and more rapidly than in normal phenomena. Research is prevalent demonstrating that marked phenomena are acquired slowly (Reference Altenberg and VagoAltenberg & Vago, 1983; Reference Anderson, Ioup and WeinbergerAnderson, 1987; Reference ArchibaldArchibald, 2021; Reference Carlisle, Yavaş, Kehoe and CardosoCarlisle, 2017; Reference Carlisle, Espinosa and YavaşCarlisle & Espinosa, 2015; Reference EckmanEckman, 1977, Reference Eckman, Malovrh and Benati2018; Reference MajorMajor, 1994, Reference Major and Chapelle2018; Reference Major and FaudreeMajor & Faudree, 1996).

Figure 10.4 Markedness corollary of the OPM.
Compare Figure 10.1 to Figure 10.4. In marked phenomena, the role of U is much greater than L1 compared to normal phenomena. Further compare Figure 10.3 to Figure 10.4. Both graphs indicate a slower rate of L2 acquisition compared to normal phenomena (Figure 10.1). However, Figures 10.3 and 10.4 differ in their relative proportions of L1 and U. In similar phenomena, L1 persists and U remains minimal throughout; in marked phenomena, U has a relatively greater role than L1, compared to similar phenomena.
In order to test the similarity and the markedness corollaries, it is crucial to compare phenomena that are equal in all relevant criteria except the one under scrutiny. For example, the claims could not be tested by comparing a marked and similar phenomenon to a less marked and less similar phenomenon. This is because both similarity and markedness slow acquisition but have opposite effects on U and transfer (Figure 10.3 versus Figure 10.4). Although it is virtually impossible to control for everything, the OPM is formulated ceteris paribus.
The markedness corollary follows logically from general learning principles. At the early stages of acquisition, learners soon realize that for a specific phenomenon there is often nothing from the L1 to transfer to the L2 (e.g., Zulu clicks; compare with Section 10.3 on conditions for transfer). This prompts learners to attempt something other than the L1, namely U, which causes U to rise rapidly. Not making a focused effort is employing the L1; however, making a concerted effort but without success is employing U. Because marked phenomena are difficult, learners continue attempts but fail repeatedly, resulting in various U substitutions rather than L1 substitutions. For example, very few non-native Highland Chontal speakers can master the very marked /tɬʼ/, a laterally released voiceless affricate ejective. Predictably, English speakers will often produce non-English/non-Chontal sounds, namely U substitutions, and continue to do so. When L1 continues to decrease while L2 accuracy is still not achieved, this results in a rapidly rising U component. However, because L2 is mastered slowly, U must subsequently decrease slowly.
The L1 pattern follows from mathematical logic: Because L1 + L2 + U = 100 percent and because it has been argued that U increases rapidly and then decreases slowly, this means that L1’s chronology must be the reverse of U: a rapid decrease with a subsequent slow decrease.
The claim that the role of U is greater in more marked phenomena than in less marked phenomena is supported by several longitudinal studies. Two studies designed to test the OPM’s markedness corollary supported its claims: a study of onset clusters in Spanish learners of English (Reference Carlisle, Espinosa and YavaşCarlisle & Espinosa, 2015) and Reference JevringJevring’s (2015) study of English fricatives of Swedish learners of English (compare with Section 10.2.3 on similarity). Both studies found a greater occurrence of U in more marked phenomena, compared to less marked phenomena. Earlier studies also confirmed this. Reference MajorMajor (1986a) reported more U substitutions for /r/ than for /ɾ/, the less marked sound, in a longitudinal study of L2 Spanish acquisition by English speakers. In a longitudinal study of Brazilian English investigating initial and final consonants and consonant clusters in nine different environments, a variable rule analysis (VARBRUL) found that the more marked the environment, the greater the probability of U (Reference Major, Bayley and PrestonMajor, 1996). For example, in codas, the probability of U was 0.854 for double stops, 0.590 for fricative plus stop, but only 0.083 for a single stop.
Further, L1 acquisition of marked phenomena parallels L2 acquisition. At an early stage, the L1 learner most likely substitutes the nearest equivalent previously acquired (e.g., [w] for [ɹ]). However, the L1 learner soon realizes that this early-stage substitution is not adequate and so attempts other substitutions; nonetheless, these substitutions are not native-like either. This necessarily means that they are U because the learner has only the portion of the L1 previously acquired and U; no other systems are available (e.g., an L1 Spanish learner’s non-native substitutions for the Spanish trill /r/). This chronological development is analogous to the L2 learner: after realizing that L1 substitutions are not satisfactory, the L2 learner employs U. Both the L1 and the L2 learners’ continual but inaccurate trials result in a rapid rise of U and its subsequent persistence.
10.2.5 Comparison of Normal, Similar, and Marked Phenomena
The unifying pattern for similarity and markedness is that the rate of L2 mastery is slower than for normal phenomena. However, they differ in their relative importance of L1 and U. After the initial stages, when L1 is large and U small for all phenomena, their patterns diverge. In similar phenomena, L1 persists and the proportion of U to L1 is relatively small throughout the various stages. In contrast, in marked phenomena, the patterns are reversed: U quickly becomes large and persists, resulting in a relatively large proportion of U to L1 throughout acquisition. Thus, the later stages of acquisition of similar and marked phenomena for L1 and U are near mirror images of each other (Figure 10.3 versus Figure 10.4). Two examples mentioned previously (Sections 10.3 and 10.4) illustrate the difference: an English speaker persistently substitutes /k/ for /q/ but produces non-English substitutions for /tɬʼ/. Table 10.1 compares Figures 10.1, 10.3, and 10.4.
Table 10.1 Comparison of normal, similar, and marked phenomena
| L2 | LI | U | |
|---|---|---|---|
| Chronological and stylistic corollaries of the OPM | L2 increases | LI decreases | U increases and then decreases |
| Similarity corollary of the OPM | L2 increases slowly | LI decreases slowly | U increases slowly and then decreases slowly |
| Markedness corollary of the OPM | L2 increases slowly | LI decreases rapidly and then decreases slowly | U increases rapidly and then decreases slowly |
In order to further investigate the OPM’s claims for similarity and markedness, it would be informative to investigate their interaction with a phenomenon that is both marked and similar to a phenomenon in the L1. Because these factors should be mutually reinforcing, an obvious prediction would be that acquisition would be even slower than for purely marked or purely similar phenomena. It would also be predicted that L1 transfer would be more important than U because the effects of similarity and markedness on U would tend to neutralize each other (because in marked phenomena U increases rapidly but in similar phenomena U increases slowly). Furthermore, in similar phenomena L1 tends to persist. A test case could be L1 English [h] and L2 Arabic [ħ]. The two sounds are very similar, but [ħ] is very marked. This prediction was supported by a longitudinal study of Arabic. Reference AlshalawiAlshalawi (1998) reported that NSs of American English substituted [h] for [ħ] more frequently than U substitutions at all stages, meaning that L1 transfer was more important.
It would also be relevant to compare the patterns of marked L2 phenomena and less-marked phenomena to similar L2 phenomena and less-similar phenomena. One possible test case could be L1 Spanish /a/ and L2 English /ɔ/, /æ/, and /ʌ/ (Spanish does not have any of these vowels). English /æ/ and /ʌ/ are more marked than /ɔ/, but all three are similar to Spanish /a/. Another possible study could be L1 Spanish /t̪/ and L2 English /θ/ and /t/; English /θ/ is more marked than English /t/, but both English /θ/ and /t/ are similar to Spanish /t̪/.
10.2.6 Summary of the OPM
While L1 transfer, similarity, markedness, and U have long been investigated, remarkably, prior to the OM (Reference Major, Ioup and WeinbergerMajor, 1987b), research had not addressed the interrelationships of these four factors. Expanding on the OM, the OPM proposes even more explicit interrelationships. The model argues that, chronologically and as style becomes increasingly formal, L2 native-like processes increase, L1 transfer processes decrease, and universal processes increase and then decrease. It further claims that similarity and markedness govern these patterns in predictable ways. In similar phenomena L1 transfer persists, whereas in marked phenomena U persists. Furthermore, it has been argued that L1 and L2 acquisition share important similarities.
10.3 The OPM and Languages in Contact
Whereas SLA ontogeny deals with an individual over time (Section 10.2), phylogeny deals with populations over part of a generation or many generations, including long-term changes in whole languages and language families. This section deals with phylogeny. It argues that the OPM’s claims hold true for large populations, both synchronically and diachronically. It demonstrates how the L2 acquisition of an individual has striking similarities to languages in contact, including such outcomes as loanwords, isolation and assimilation, bilingualism and multilingualism, dialects in contact, and pidgins and creoles.
10.3.1 Loan Phonology
Loan phonology usually involves a very rudimentary form of L2 acquisition, where L1 transfer completely dominates. For example, Koran in English is pronounced with initial [kʰ], not [q]. However, sometimes a new sound is adopted. Maori of the Cook Islands has no /s/, yet Jesus is pronounced [iesu]. Phonological changes can be limited to particular words. Some speakers with a classical musical and literary background carefully pronounce Bach [bax] and genre [ʒɑ̃ɹə]. The [x] may only occur in this one word Bach. In contrast, Reich is pronounced with final [k] and the deletion of [n] before r in genre does not occur in other words. Although [bax] Bach and [ʒɑ̃ɹə] genre may sound cultured and educated, [tʰæxɑməɾɹ̩] tachometer and [hɛ̃ɹi] Henry sound odd or amusing. The special status of loanwords is also evident in dialects of the same language. In the United States, many young white speakers of standard American English pronounce cool [kʰuw], yet these same speakers pronounce pool [pʰuɫ] (the slang use of “cool” as reportedly originated from African American jazz musicians; in African American English, coda /l/ → [w]). Though sound change is usually regular (according to Neogrammarians, e.g., IE p > Germanic f: pater > father), these examples indicate that change can be limited to certain words (compare with British /ɑ/ vs. /æ/ in can’t, can, pant).
Loan phonology is relevant to the OPM because of the occurrence of U. Some English speakers pronounce Bach and Van Gogh with final [h], rather than [x] in native German and Dutch, respectively. In attempting final [x], speakers produce final [h] (non-occurrent in English), thus resulting in a change in English phonotactics (note: in native Dutch, Gogh is pronounced [xɔx], not *[gɔx], as some English speakers do). Onset cluster deletion is a further example of U (e.g., Pfizer, psoas, and tsunami). Although coda cluster deletion in English occurs synchronically (e.g., clothes → clo[z]), onset deletion does not and is therefore a U process limited to loanwords.
We find U also in the pronunciation of the names of two well-known former professional athletes, Brett Favre and Patrick Roy. In US and Canadian English, Favre is pronounced [fɑɹv] and Roy [wɑ]. Metathesis, which occurs in Favre (also in hors d’oeuvre), is a diachronic process in English but not normally an adult synchronic process, except in speech errors; thus, it should be considered U (although very common in L1 acquisition, e.g., a[nim]al animal). The deletion of r in Roy ([ʁwa] in native French) is likewise a U process because L1 transfer would predict epenthesis, not deletion (e.g., Rwanda [ɹəwɑdə], not *[wandə], but Roy [wɑ], not *[ɹəwa]).
The Bach example is also relevant to both the chronological and the stylistic corollaries of the OPM. Some English speakers substitute [k], others use [h], while still others achieve native [x], representing stages of the chronological corollary: L1 transfer → U → native L2. In a classical radio broadcast, an announcer may carefully pronounce Bach with final [x], though [h] when less careful, yet [k] outside of work, perhaps not wanting to sound pretentious. This corroborates the stylistic corollary.
The similarity and the markedness corollaries pertain to loan phonology as well. In similar phenomena L1 transfer is relatively more important than U and in marked phenomena the situation is the reverse. Thus, English speakers typically pronounce coup d’état with aspirated (long-lag VOTs) [kʰ] and [tʰ] instead of the similar unaspirated French (short-lag VOTs) [k] and [t̪], whereas the same speakers carefully trying to pronounce the word Xhosa likely produce non-English/non-Xhosa sounds, rather than a [kʰ] or [h].
10.3.2 Isolation, First Language Attrition, and Assimilation
First language attrition,Footnote 10 assimilation, bilingualism, and isolation can result when different languages come into contact (compare with Reference WeinreichWeinrich’s [1953] classic work). Complete isolation is rare because monolinguals often have contact with bilinguals; therefore, both groups are mutually influenced.
Another possible outcome is L1 attrition, the extreme form being a complete L1 loss with subsequent assimilation to the dominant language. Attrition can be considered acquisition in reverse. Thus, the OPM proposes that the beginning stages of L1 attrition are the later stages of L2 acquisition (Figures 10.1, 10.3, and 10.4). Consider children acquiring the NL of their immigrant parents who are undergoing attrition. Most L1 acquisition input is casual speech (e.g., conversation versus carefully reading a word list). Because research has shown that L1 attrition in adults first occurs in casual speech, this then means that the first generation of speakers will show significant attrition, perhaps more than their parents, but also due to the dominant language in that country.
Further, L1 attrition has striking similarities to some of Jakobson’s work from more than eighty years ago (Reference JakobsonJakobson, [1941] 1968).Footnote 11 He postulated interrelationships between child language, aphasia, and phonological universals, convincingly arguing that L1 attrition in aphasic adults mirrors L1 acquisition; specifically, phenomena that are acquired last are lost first. His claim – last acquired, first lost – parallels L1 attrition as well. Thus, L1 attrition is L2 acquisition in reverse. In accordance with the OPM, L2 will transfer to L1. Moreover, U is also a factor; for example, in L1 Spanish attrition in the US, speakers often merge English /ɛ/ and /æ/ (similar to many monolingual speakers in western USA where Ellen and Allan are homophones and penguin and sanguine rhyme). Since Spanish has neither vowel and these speakers do not substitute Spanish /e/ or /a/, this merger must be attributed to U. Thus, Figure 10.1 also depicts L1 attrition in reverse.
The vast majority of US monolingual NSs of English have immigrant ancestors who did not speak English natively. Consequently, American English is a product of L2 acquisition, leaving traces in speakers who have long since lost their ancestral languages. In Minnesota and North Dakota, there is speculation that the very back and monophthongal /o/ can be attributed to German and Scandinavian immigrants. A number of mergers in American English possibly could be traced to incomplete L2 acquisition, a form of attrition from the original British English being acquired. Among them are the merger of British /ɑ/, /ɒ/, and /ɔ/ to American /ɑ/ and /ɔ/ or to the single vowel /ɑ/ (especially in the West), and the loss in most American dialects of the distinction between /w/ and /ʍ/.
10.3.3 Bilingualism and Multilingualism
In bilingual acquisition in childhood, there are four components: U, language A, language B, and shared language A and B. For example, at some stage the child may produce only three vowels in both languages,Footnote 12 though A and B have five and eleven vowels (e.g., Spanish and English). At a later stage, the learner exhibits evidence of two vowel systems (though still not native adult), while still retaining a core shared system characterizing both languages. At a later stage the proportions of A and B increase, resulting in a proportional decrease of shared A and B. At the final stage the two systems become separate. Two completely separate systems characterize the idealized learner. It is unlikely that such a person exists because both languages continually influence each other. Moreover, bilingual speakers are never equally competent in both languages in all environments. In order for this to happen, absolutely simultaneous bilingual acquisition would have to occur, a physically impossible situation, except hypothetically in cloned individuals. The result is two new language varieties (Anew and Bnew) having components of the original A and B (i.e., mutual transfer), remnants of U, as well as a mutually shared core system. The same patterns also apply to multilingualism.Footnote 13 In the US Southwest, some Spanish-English bilinguals contrast /b/ and /v/ in Spanish, which are non-occurrent in standard Latin American Spanish.
Furthermore, these new varieties (Anew and Bnew) can affect monolingual speakers of A and B. In the US Southwest, where there is widespread Spanish-English bilingualism, monolingual English speakers seem to have more of a syllable-timed rhythm (typical of Spanish) than monolingual English speakers from other parts of the US, who have a stress-timed rhythm. Such mutual influence can also result in phenomena that are different from monolinguals of both languages. In the classic study of bilinguals, Reference Caramazza, Yeni-Komshian, Zurif and CarboneCaramazza et al. (1973) found VOTs of word-initial stops (based on perception and production) in the English and the French of French-English bilingual Canadians to be intermediate between monolingual speakers of both languages. These values may be attributed to U because they are different from monolinguals (but see Reference MoyerMoyer, 2021).
10.3.4 Dialects in Contact
The differences between language and dialect are not always based on linguistic criteria but often on political, ethnic, and social criteria. For these reasons, SLA necessarily includes dialects in contact because a new dialect is indeed an L2. Thus, the OPM principles operate: at the initial stage L1 transfer predominates (native dialect). As acquisition proceeds, the L2s (new dialects) increase and the L1s (native dialects) decrease, while U increases and then decreases. Finally, typically remnants of the L1s and U remain, similar to an individual acquiring an L2.
It is widely known that the more intense and varied the contact, the more frequent are mergers (Reference TrudgillTrudgill, 1986). Some distinctions in the Eastern US are absent in the Midwest and even more absent in the West, where frequent migration continues to occur. In Los Angeles, San Francisco, and Phoenix, the majority of the residents are not native-born. The distinction between /æ/ and /ɛ/ before /ɹ/ in the Northeast (e.g., Harry/hairy) is absent in the Midwest and the West. In many areas of the US, writer and rider are pronounced r[ʌj]der and r[ɑj]der, respectively, but they merge in the West, resulting in a diphthong that is intermediate between the two. The /ɑ/–/ɔ/ distinction has been lost in the entire West. Many western speakers merge certain vowels before /l/: wheel/will, sale/sell, Ellen/Allan, colt/cult, pole/pull. In standard British English, the following pairs contrast phonologically: where/wear, tune/toon, dew/do, as well as the vowels in balm, bomb, and bought (/ɑ/, /ɒ/, /ɔ/). These distinctions rarely occur in the USA.
The similarity and the markedness corollaries of the OPM are also relevant to dialects in contact. The OPM claims that in similar phenomena, L1 plays a larger role than U, but in marked phenomena the opposite is true. Accordingly, because of the similarity of the mother dialects, L1 plays a greater role than U (e.g., Midwestern versus Western US English compared to Mandarin versus German).
Reference LabovLabov (1994, p. 363) coined “the Bill Peters effect,” which is germane to the OPM’s stylistic corollary. Bill Peters, an eighty-year-old man, exhibited a distinction between /ɔ/ and /ɑ/ in spontaneous speech (his native dialect) but in minimal pairs he merged them, typical of younger speakers. The dialect with this merger can be considered Peters’ L2 because he learned it only in later life. Labov further documented a number of similar cases. Peters’ greater accuracy with the new dialect (the L2) in formal speech than in spontaneous speech is evidence supporting the OPM.
10.3.5 Pidgins and Creoles
When individuals sharing no common language come into contact, a pidgin can form; if it becomes nativized it is termed a creole. Reference SchumannSchumann (1978) likened L2 acquisition to pidginization, giving evidence from the oft-cited Alberto’s acquisition of English.
The OPM claims that the patterns of L1, L2, and U hinge on whether the phenomena are normal, similar, or marked. When the L1s share similar phenomena, transfer is a larger component than when they have little in common, where U plays a greater role. There are important differences between pidgins and creoles and dialects in contact. In dialects in contact, L1 transfer is generally more important than U because of the similarity of the mother languages (Section 10.3.4); however, in pidgins and creoles, U is more important because of the frequent dissimilarity of the mother languages. In US Western dialects, which are the results of dialects in contact, a number of marked phenomena are present: /θ/ and /ð/, coda consonant clusters, and coda voiced obstruents, which are also present in the vast majority of the mother dialects. In contrast, many English creoles do not have any of these marked phenomena; moreover, marked phenomena are rare in all pidgins and creoles (e.g., Tok Pisin has no interdental fricatives and Haitian Creole has no front rounded vowels; Reference ValdmanValdman, 2015; Reference VerhaarVerhaar, 1995).
One reason why U is a prominent factor in pidginization is because speakers have little or no shared knowledge of each other’s languages; therefore, they resort to universal principles in order to communicate. Reference BickertonBickerton’s (1981) well-known bioprogram hypothesis argues for the importance of universals in pidginization. Successful communication far outweighs any attention to form, which is a moot point because when a pidgin is created, no established standard exists. Thus, a speaker employing U is more probable than employing an L1 with someone who has no familiarity with that L1. As a pidgin forms, speakers with different L1s produce different varieties. Because communication is paramount, these different varieties become increasingly influenced by general acquisition principles (U) and less and less influenced by their L1s. If and when a pidgin becomes more standardized (and/or becomes a creole), the U components can then be considered to be part of the language and no longer U (e.g., Haitian Creole has affricates but French does not; terminal coda obstruent devoicing is standard German but also is a common U process in L1 and L2 acquisition; Section 10.2.1). When children learn a pidgin as their L1 (thus forming a creole), they will hear different varieties and necessarily create a compromise, an intermediate category. This results in mergers (as with dialects in contact), which are widespread in all pidgins and creoles (e.g., Tok Pisin has five vowels but English has nine to eleven, depending on the dialect).
The preponderance of U is evident in Haitian Creole. In an extensive work, Reference ValdmanValdman (2015) describes a host of phenomena in Haitian Creole that do not occur in French. Among them are /ɣ/, /ɣ/ → [w] (< French /ʁ/), /j̃/ (< French /ɲ/, zonyon [zõj̃õ] “onion”), /ʧ/ and /ʤ/,Footnote 14 intervocalic voicing and deletion of voiceless stops, and hypercorrection. These are natural phonological processes that occur in a number of languages, both diachronically and synchronically (e.g., English /ʧ/, /ʤ/ < /kʲ/, /gʲ/; Portuguese casual speech minha [mĩɲa] → [mĩj̃a] “my” [fem.]).
Usually one language is dominant in pidgins and creoles. If a pidgin survives and becomes a creole, it may undergo decreolization or hypercreolization. If the creole experiences continual contact with the dominant language, decreolization is favored (e.g., Jamaican English) and thus the dominant language becomes increasingly more important than U. However, if contact is cut off, then hypercreolization is favored (e.g., Tok Pisin and Haitian Creole). The result is that U becomes increasingly more important because the creole continues to develop independently and is therefore influenced more by general linguistic U than by the original dominant language. However, other factors can be more important, such as group identity and solidarity. For example, though most speakers of Hawaiian Creole English have continual contact with standard English, the creole does not seem to be undergoing decreolization; rather, it seems to be becoming even more distinct from standard English, very likely because of its importance to Hawaiian identity.
Decreolization is analogous to SLA when L2 learners have continual contact with NSs and learners become increasingly more native-like (e.g., immigrants). In contrast, hypercreolization is similar to learners being cut off from contact with NSs while continuing to use the L2 with other NNSs (non-native speakers) with different L1s (e.g., a group of professionals with different L1s working together in a country where English is not spoken natively). This L2 continues to develop independently (as in decreolization), with less and less influence from other speakers’ L1s and the original NS version of the L2. Thus, this process is similar to hypercreolization and is continually taking place in international Englishes (see Reference Trudgill and HannahTrudgill & Hannah, 1994).
10.3.6 Summary of the OPM and Languages in Contact
In sum, SLA encompasses both an individual and languages in contact. The OPM claims that its principles apply not only to the individual but also to languages in contact, including loanwords, isolation and assimilation, bilingualism and multilingualism, dialects in contact, and pidgins and creoles. For example, in similar phenomena and in dialects in contact, L1 transfer is more prominent, but in marked phenomena and in pidgins and creoles, universals are more prominent.
10.4 Conclusion
The OPM makes explicit SLA interrelationships between L1 transfer, L2 mastery, and universals (L1, L2, and U). It claims that chronologically, and as style increases in formality, L2 increases, L1 decreases, and U increases and then decreases. These patterns in turn are mediated by similarity and markedness, which govern the relative proportions of L1, L2, and U: L1 transfer persists in similar phenomena, whereas U persists in marked phenomena. The OPM also demonstrates how L1 and L2 acquisition are analogous, and further argues that its claims hold true for languages in contact. The OPM therefore deepens our understanding of language because not only does it make explicit the ontogenetic interrelationships between transfer and universals in SLA, L1, and bilingual acquisition but it also demonstrates how these same principles obtain phylogenetically for languages in contact.
11.1 Introduction
The “speech chain” invites the image of a talker on one end and a listener on the other. In producing speech, the talker, while limited anatomically and physiologically in what they can utter, has some amount of agency as they sample from their production repertoire (e.g., styling one’s speech appropriately for an infant or an adult). The selected linguistic message exits the vocal tract via a motor plan that orchestrates a number of articulatory actions, and results in an audiovisual speech signal that is complex and noisy. The listener engages in a multifaceted mapping process that requires them to identify, synthesize, and prioritize a multitude of acoustic and visual cues from the incoming signals in real time. The information received by the listener is rich and allows for the processing of the linguistic message (e.g., an offer of coffee), the talker’s identity (e.g., from your friend Sahil), in addition to larger pragmatic and contextual cues (e.g., in a noisy coffee shop where other voices, perhaps in other languages, compete for your auditory attention). This talker-and-listener dynamic takes place between individuals, but also within an individual in a posited production-perception link. A strong connection between an individual’s production repertoire and their perceptual representations is at the heart of exemplar-based theories of speech: simply, a listener’s auditory experiences feed into the available forms for production.
Before laying out the assumptions within any exemplar-based model, an initial foray into complicating the outlined phonetic space for bilinguals is necessary. We attempt to do so initially without introducing the theoretical assumptions and jargon of any particular framework. Starting with the more observable speech production, a bilingual’s two languages often, but not always, exhibit cross-language mutual influence (see Section 11.3.2). For example, the Cantonese /u/ from a Cantonese-English bilingual may be a little English-like and the English /u/ may be a little Cantonese-like. This observation suggests that somewhere within the production pipeline there is a link between Cantonese and English. In perception, there is also evidence of mutual influence (see Section 11.3.1), but the situation for a bilingual listener can be a bit more complicated. For example, a bilingual listener needs to be able to adjust their thresholds and categories for phonetic and phonological units based on the language. This adjustment requires that listeners use signal-based information (or other environmental cues) to identify the language and adapt accordingly. Evidence from early bilinguals and proficient second language learners shows that certain bilinguals can, indeed, adjust their categorization boundaries in this way (Reference Casillas and SimonetCasillas & Simonet, 2018).
11.2 The Basics of Exemplar-Based Theories
A shared proposition of any exemplar model is that phonetic detail beyond the abstract phonological category, often accompanied by social and speaker-specific information, is part of linguistic knowledge (Reference GoldingerGoldinger, 1996, Reference Goldinger1998; Reference Johnson, Johnson and MullennixJohnson, 1997; Reference Pierrehumbert, Gussenhoven and WarnerPierrehumbert, 2002, Reference Pierrehumbert2003; Reference Sumner, Kim, King and McGowanSumner et al., 2014). The evidence is unclear whether the social information is part and parcel of the linguistic representation or connected via an associative network. A note on terminology is crucial here, as what exactly is represented in an “episode” or any linguistic representation is the million-dollar question. Exemplar-based models are often caricatured with claims that speakers/listeners have spectrograms (or cochleagrams or some other auditorily informed transformation of the acoustic signal) dancing around in their heads. This caricature is often used to deride exemplar-based models, but rightfully points to the elephant in the room: the exact nature(s) of the mental representation of speech is unknown. While we do not yet know the exact nature of sound-based linguistic representations in the mind, behavioral work has shown us that users of language are keenly aware of phonetic information and its associated linguistic and social meaning on implicit and, occasionally, explicit levels. These behavioral results are supported by a growing body of work in neurolinguistics. The superior temporal gyrus decodes phonetic detail (Reference Mesgarani, Cheung, Johnson and ChangMesgarani et al., 2014) and the auditory cortex facilitates talker-specific cues in the context of vowel normalization (Reference Sjerps, Fox, Johnson and ChangSjerps et al., 2019), implicating the regions of the brain involved in the processing of phonetic detail. In short, while exemplar models often use variable terms such as sub-phonemic, episodic-acoustic, indexical, fine-grained representations, and so on, what these terms have in common is an acknowledgment that phonetic and phonological knowledge exists below the phonemic or allophonic levels. The goal of the current chapter is not to adjudicate between such labels. We globally refer to this body of literature as exemplar models to be contrasted with purely abstract models (see Section 11.4.1).
While early work paved the way for existing exemplar interpretations and models in the field of linguistics (e.g., Reference GoldingerGoldinger, 1996; Reference HintzmanHintzman, 1986; Reference Johnson, Johnson and MullennixJohnson, 1997; Reference NosofskyNosofsky, 1986, Reference Nosofsky1992; Reference Pierrehumbert, Gussenhoven and WarnerPierrehumbert, 2002, Reference Pierrehumbert2003), the core tenets of exemplar-based theories have a long history in cognitive psychology. Much of the seminal behavioral evidence that talker-specific phonetic detail facilitates the processing of spoken language came from word and talker recall tasks: listeners are more accurate at recalling words as previously heard when presented with a word in the same voice (e.g., Reference Palmeri, Goldinger and PisoniPalmeri, Goldinger, & Pisoni, 1993). Word identification and categorization paradigms have also been used to demonstrate listeners’ sophisticated knowledge of which social groups produce which sounds in which ways (e.g., Reference Hay, Warren and DragerHay, Warren, & Drager, 2006).
The assumption that listeners have detailed phonetic knowledge that can be yoked to social and contextual categories is relatively widespread across many areas of linguistics, beyond phonetics and phonology – for example historical linguistics and language change (Reference SkousenSkousen, 1989; Reference Todd, Pierrehumbert and HayTodd, Pierrehumbert, & Hay, 2019; Reference Wedel and FatkullinWedel & Fatkullin, 2017) and child phonological acquisition (Reference JusczykJusczyk, 1993). Crucially, however, any viable theory or model that includes exemplar-like representations must also include mechanisms for abstraction at various levels in the linguistic system. The need for abstraction to operate over experiences has been noted by prominent theorists (Reference PierrehumbertPierrehumbert, 2016).
We continue with two assumptions about exemplar-based models. The first assumption is that any flavor of an exemplar model provides users with sensitivity to phonetic detail. This sensitivity does not mean that all phonetic detail is encoded or weighted equally in a linguistic representation. What listeners attend to and encode in the phonetic detail is also not fixed, but can evolve over time. This is, arguably, particularly important in the bilingual context as individuals’ exposure to and experience with languages change over the life span. The second assumption is that phonetic knowledge – at a detailed or abstracted level – is associated with social and contextual information, connecting “core” linguistic knowledge with indexical and sociopragmatic cues and categories. This disperse network contextualizes speech and language in the socially rich world that individuals inhabit.
11.3 Shared Linguistic Structures
Before we examine how an exemplar model can accommodate bilingual information, we must consider the linguistic structures that have been posited to be connected across an individual’s language systems. This is not a chapter on the robust evidence for cross-linguistic connections, but we must visit this topic to make apparent the required cross-linguistic links within a bilingual exemplar model. In our too-brief overview here, we focus on the sound-based level, extending to lexical and social categories. The body of work on bilingual cross-linguistic influence is expansive, and growing at an impressive rate. Our discussion here touches on only a fraction of the important work that has been conducted in this area, and should not be considered an exhaustive summary by any means.
11.3.1 Perception
Cross-linguistic connections in perceptual representations have a long-studied history,Footnote 1 approached heavily from the perspective of second language acquisition. Two popular frameworks within the language learning space are the Perceptual Assimilation Model-L2 (PAM-L2; Reference Best, Tyler, Bohn and MunroBest & Tyler, 2007) and the Speech Learning Model (SLM; Reference Flege and StrangeFlege, 1995). These frameworks postulate connections between perceptual categories across languages such that category boundaries are nudged on account of cross-linguistic mutual influence. The revised SLM framework, the SLM-r (Reference Flege, Bohn and WaylandFlege & Bohn, 2021), is an updated and more dynamic model that allows for continual reorganization of cross-linguistic connections across the life span (Reference Flege, Bohn and WaylandFlege & Bohn, 2021). We recommend Reference Chang, Katz and AssmannChang (2019) as an overview on the phonetics of second language learning.
While early work indicated that bilinguals may exhibit perceptual behaviors that are between those of monolingual groups from the respective languages (Reference Caramazza, Yeni-Komshian, Zurif and CarboneCaramazza et al., 1973), subsequent work focusing on early bilinguals and highly proficient late bilinguals suggests that these groups perceptually shift their categorization boundaries based on the language they are perceiving. For instance, the threshold for stop voicing in different languages (e.g., /b/ vs. /p/) is shifted along a voice onset time (VOT) continuum for Spanish-English bilinguals, suggesting that listeners can rely on different, and thus separable, categories in perception and recognition.Footnote 2 This behavior is called a double phonemic boundary (Reference Casillas and SimonetCasillas & Simonet, 2018; Reference Garcia-Sierra, Diehl and ChamplinGarcia-Sierra, Diehl, & Champlin, 2009; Reference García-Sierra, Ramírez-Esparza, Silva-Pereyra, Siard and ChamplinGarcía-Sierra et al., 2012; Reference Gonzales and LottoGonzales & Lotto, 2013). Bilinguals may also switch in and out of language modes based on previously heard speech, social and pragmatic environments, or linguistic cues. Those linguistic cues may be phonotactic or subphonemic in nature. Not all perception tasks point to the same linguistic knowledge. For example, Reference Antoniou, Tyler and BestAntoniou, Tyler, and Best (2012) examined early bilinguals’ performance on categorization, goodness ratings, and discrimination tasks. Language-specific patterns were found for categorization and goodness rating responses, but not discrimination, which appeared to be dictated by the more dominant language of the listener. While more work is necessary to fully understand the effects of perceptual modes on listening for bilinguals, it is clear that bilinguals, particularly early bilinguals, exhibit sensitivity to phonetic details.
11.3.2 Production
Across a range of methods, there is clear and robust evidence that bilingual speech production can show cross-linguistic influence (e.g., Reference Baker and TrofimovichBaker & Trofimovich, 2005; Reference BarlowBarlow, 2014; Reference FlegeFlege, 1987; Reference Fricke, Kroll and DussiasFricke, Kroll, & Dussias, 2016; Reference MajorMajor, 1992; Reference OlsonOlson, 2016; Reference Sancier and FowlerSancier & Fowler, 1997; Reference SimonetSimonet, 2011). This cross-linguistic influence suggests the existence of linked categories. However, early bilinguals can maintain distinct phonetic categories across their two languages (e.g., Reference GuionGuion, 2003; Reference Sundara, Polka and BaumSundara, Polka, & Baum, 2006; Reference MacLeod, Stoel-Gammon and WassinkMacLeod, Stoel-Gammon, & Wassink, 2009) and the evidence that bilinguals habitually produce compromised categories – those that average across the categories of their multiple codes – is not particularly strong (Reference CasillasCasillas, 2021). Multilingual individuals are socially sophisticated, capable of performing different degrees of cross-linguistic influence in pronunciation variation based on social demands (Reference Bullock, Toribio, Isurin, Winford and deBotBullock & Toribio, 2009). As such, a model of bilingual speech production must allow for the interaction of phonological and phonetic categories without the requirement that one language’s categories subsume the other’s. The degree of cross-linguistic influence also needs to be flexible synchronically and across an individual’s life span (Reference Simonet and AmengualSimonet & Amengual, 2020). The dynamic nature of the strength of cross-linguistic influence in production is apparent in studies that manipulate language mode (Reference GrosjeanGrosjean, 1998, Reference Grosjean2008). In experimental contexts where researchers attempt to present a unilingual mode, bilinguals can show, for example, phonetic patterns that are equivalent to monolingual norms (e.g., Reference Antoniou, Best, Tyler and KroosAntoniou et al., 2010; Reference Henriksen, Coetzee, García-Amaya and FischerHenriksen et al., 2021), whereas conditions that prompt a multilingual mode through, for example, code-switching elicit cross-linguistic influence in phonetic patterns (e.g., Reference Henriksen, Coetzee, García-Amaya and FischerHenriksen et al., 2021). The empirical evidence is mixed for whether language mode effects vary as a function of language dominance (e.g., Reference AmengualAmengual, 2018; Reference SimonetSimonet, 2014).
11.3.3 Social Meaning
The subphonemic variation in speech that is associated with social identity is rich, and well-studied within monolingual populations. There appears, however, to be limited research in this domain in bilingual populations. One notable study was Reference Szakay, Babel and KingSzakay, Babel, and King (2016), who queried the social associations across languages and dialects in the New Zealand context with a sample of English-Māori bilinguals. New Zealand English has two main varieties: Māori English and Pakeha English. These Englishes are associated with ethnically Māori and ethnically (white) European individuals, respectively. Māori English is acquired as a native variety of English and is positively associated with Māori culture and identity in New Zealand (Reference HolmesHolmes, 2005; Reference Maclagan, King and GillonMaclagan, King, & Gillon, 2008). Reference Szakay, Babel and KingSzakay et al. (2016) assessed whether the Māori social association with Māori English led to stronger L2 (Māori)-L1 (English) priming when Māori English was the target, as opposed to Pakeha English. We walk through this hypothesized social connection with the lemma snow. (See figure 1 in Reference Szakay, Babel and KingSzakay et al. [2016] for a schematization to accompany our description.) This lemma concept of “white flakes of frozen water vapor” exists in both languages and has distinct lexemes in each language, which can be orthographically represented as huka and snow. Stronger connections at the lemma level exist in the mapping of the L1 lexeme to the L2 lexeme through the lemma, often seen as the stronger effect of L1–L2 priming (e.g., Reference Jiang and ForsterJiang & Forster, 2001) compared to the weaker links in the L2–L1 direction. Of note in our example with snow, the goose vowel in New Zealand English exhibits pronunciation variation indexed to Māori and Pakeha identities, with realizations of snow as [snɵʉ̘] and [snɐʉ], respectively, in Māori English and Pakeha English.Footnote 3 The social link between the Māori English pronunciation of snow [snɵʉ̘] and the Māori huka [hʉ̙kɐ̝] was hypothesized by Reference Szakay, Babel and KingSzakay et al. (2016) to facilitate L2–L1 priming for Māori English targets. Indeed, they found that Māori effectively primed Māori English, but not Pakeha English, suggesting that a social category connection facilitated L2–L1 processing. Listeners were sensitive to the subphonemic variation in the Māori English that indexed Māori identity, which is also intricately connected to the Māori language itself.
This particular example of cross-linguistic dialect-specific L2–L1 priming is crucial because the lack of shared phonemes in Māori huka and English snow strongly suggests that the cross-language activation must come from subphonemic cues. Indeed, Māori English and Māori are known to share suprasegmental similarities (Reference SzakaySzakay, 2012). Listeners must be sensitive to these subphonemic suprasegmental differences in order to utilize them in the creation, maintenance, and use of a social Māori category across codes. This is a classic example of the fine-grained phonetic detail in an exemplar network being used to facilitate linguistic processing.
11.4 Building Toward a Framework
The overview of shared linguistic structures in bilinguals identifies the desiderata for a bilingual exemplar model. In perception, listeners are able to operate in distinct language modes as a function of language experiences (e.g., age of acquisition, continued language use, etc.). In production, coactivation of an individual’s languages entails the possibility for cross-linguistic (mutual) influence. This influence may be observed on cross-linguistically linked phones or at the lemma level, which connects concepts and translation equivalents. Social information carried in the speech stream may also link distinct languages.
11.4.1 Phonetic Details and Abstract Categories
Exemplar models have historically been contrasted with abstract models, where mental representations of language are exclusively available in an abstracted form, leaving fine phonetic detail unrepresented (e.g., Cohort: Reference Marslen-Wilson, Bouma and BouwhuisMarslen-Wilson, 1984, Reference Marslen-Wilson1987; TRACE: Reference McClelland and ElmanMcClelland & Elman, 1986; Shortlist: Reference NorrisNorris, 1994). In these types of models, while token-specific information may be perceived in the speech signal, a normalization process results in a more abstract sublexical representation, with models differing in the nature of the abstract sublexical units (Reference Weber and ScharenborgWeber & Scharenborg, 2012). The dichotomy between abstract and exemplar interpretations was more explicit in early phonetic research. For instance, in a study by Reference Pallier, Colomé and Sebastián-GallésPallier, Colomé, and Sebastián-Gallés (2001), Catalan-Spanish bilinguals, varying in Spanish or Catalan dominance, were tested in a “medium-distance” primed auditory lexical decision task. Minimal pairs bearing Catalan-specific vowel contrasts were separated across eight to twenty trials (e.g., /netə/ “granddaughter” – /nɛtə/ “clean, fem.”) and matched with corresponding identity pairs (e.g., /netə/ – /netə/; the same word is repeated twice, but as acoustically unique instances). Reference Pallier, Colomé and Sebastián-GallésPallier et al. (2001) observed that Catalan-dominant speakers showed priming for identity pairs and not minimal pairs, while Spanish-dominant speakers showed priming for both identity pairs and minimal pairs. These findings were taken as evidence in support of abstract models and not exemplar models, since Spanish-dominant speakers could only have demonstrated minimal pair priming if the Catalan contrasts were being assimilated to native language categories, with minimal pairs in Catalan being perceived as identity pairs.
Recent work has provided growing empirical evidence in support of hybrid models that assume that linguistic knowledge is composed of both abstract and exemplar representations: listeners are sensitive to phonetic detail and make generalizations over and across both experienced items and abstract categories (Reference PierrehumbertPierrehumbert, 2016). Lexically guided perceptual learning paradigms offer a nice example of the need for both attention and retention of phonetic detail and generalization over categories. In these paradigms, listeners are exposed to novel pronunciations in the context of a lexical frame. This lexical frame provides the scaffolding necessary for a listener to connect the novel pronunciation to a particular phoneme or allophone. This exposure causes an adjustment of a category, such that listeners’ categories accommodate the new pronunciation variation. The empirical support for perceptual adaptation is robust (e.g., Reference Eisner and McQueenEisner & McQueen, 2006; Reference JesseJesse, 2021; Reference Kraljic and SamuelKraljic & Samuel, 2007; Reference McQueen, Cutler and NorrisMcQueen, Cutler, & Norris, 2006; Reference Norris, McQueen and CutlerNorris, McQueen, & Cutler, 2003; Reference SamuelSamuel, 2016), and the literature includes work on bilinguals (Reference Bruggeman and CutlerBruggeman & Cutler, 2020; Reference Chan, Johnson and BabelChan, Johnson, & Babel, 2020; Reference Cutler, Burchfield, Antoniou, Calhoun, Escudero, Tabain and WarrenCutler, Burchfield, & Antoniou, 2019; Reference Drozdova, Van Hout and ScharenborgDrozdova, Van Hout, & Scharenborg, 2016; Reference Reinisch, Weber and MittererReinisch, Weber, & Mitterer, 2013). Within bilingual populations, however, the results are somewhat mixed. While there is evidence that bilinguals show adaptation in their second language (Reference Drozdova, Van Hout and ScharenborgDrozdova et al., 2016; Reference Reinisch, Weber and MittererReinisch et al., 2013), some data suggest that a bilingual’s less-dominant language may not be so adaptable (Reference Bruggeman and CutlerBruggeman & Cutler, 2020; Reference Cutler, Burchfield, Antoniou, Calhoun, Escudero, Tabain and WarrenCutler et al., 2019; but see also Reference Chan, Johnson and BabelChan et al., 2020). What is important for the current purposes here is that a prerequisite for adaptation is sensitivity to phonetic details that connects with phonological categories. It may be that, in some bilingual scenarios, the appropriate connection strength may be lacking. This sets the stage for the crucial next step of this chapter: the introduction of a particular exemplar model.
11.4.2 Weaving Bilingualism into an Exemplar Model
It is apparent that a model of speech perception needs to provide the structure and the mechanisms to account for sensitivity to phonetic detail and the ability to abstract and generalize. Exemplar-based models typically abstract generalizations of experienced items. One must remember, however, that this does not mean that all experienced items are treated equivalently. Variation is socially weighted such that not all experienced items may contribute equivalently to the category (Reference Sumner, Kim, King and McGowanSumner et al., 2014). In many exemplar models, a production–perception feedback loop is central to the model’s architecture (e.g., Reference Ettlinger, Trouvain and BarryEttlinger, 2007; Reference Harrington, Kleber, Reubold, Schiel and StevensHarrington et al., 2018; Reference Pierrehumbert, Gussenhoven and WarnerPierrehumbert, 2002; Reference SóskuthySóskuthy, 2015; Reference Todd, Pierrehumbert and HayTodd et al., 2019; Reference TupperTupper, 2015; Reference Wedel and FatkullinWedel & Fatkullin, 2017). Note that none of these frameworks are designed to model bilingual knowledge, but at the heart of many is the tagging of phonetic variation with respect to its social and pragmatic contexts. While it is not a stretch to adapt such a concept to the tagging of language, what becomes complex and messy is the nature of the connections across languages within a bilingual.
Let us walk through the basic structure of Reference Todd, Pierrehumbert and HayTodd et al. (2019), a recent instantiation of an exemplar-based model, and note opportunities for bilingualism to factor into the model. We focus on the model by Reference Todd, Pierrehumbert and HayTodd et al. (2019) because it is computationally implemented (on a small scale) and thus well-specified, and it builds upon the seminal work by Reference Pierrehumbert, Bybee and HopperPierrehumbert (2001), which is, arguably, the most well-known exemplar model in linguistics. This multilayered framework builds on crucial insights from previous exemplar models (e.g., Reference Harrington, Kleber, Reubold, Schiel and StevensHarrington et al., 2018; Reference Johnson, Johnson and MullennixJohnson, 1997; Reference NosofskyNosofsky, 1986; Reference SóskuthySóskuthy, 2015; Reference WedelWedel, 2006, Reference Wedel2012; Reference Wedel and FatkullinWedel & Fatkullin, 2017), but is unique in its integration of these stages into a single architecture. Reference Todd, Pierrehumbert and HayTodd et al. (2019) advance an exemplar-based computational model where the role of the listener is emphasized in navigating and organizing phonetic variation. While this description sounds as though it affords the listener some amount of agency in the process, the point is that the processes of perceiving, categorizing, encoding, and recognizing are iterative, and transform the original signal into a mental representation that has been evaluated several times over. This provides multiple opportunities for bilinguals’ multifaceted linguistic knowledge to be integrated into an exemplar framework.
Let us begin with establishing that, in this model, categories and types are abstract generalizations of experienced phonemes and words, respectively. In terms of the architecture, we start on the listener side of the production–perception loop. In Reference Todd, Pierrehumbert and HayTodd et al. (2019), activation is the first stage in perception, where an item activates other similar exemplars within a distributional space. Note that the nature of the phonetic detail involved in this activation and similarity-matching process is unspecified.Footnote 4 The apprehension of an utterance results in activation, a process that excites other exemplars based on their distance in perceptual-acoustic space. What aspects of the signal are grounds for a similarity assessment in activation is an opportunity for bilingual interaction. A bilingual will have not only different but more categories and types available for activation than a monolingual. The similarity assessment can also change in the context of language learning. For example, as native speakers of Korean learn English, they adjust their spectral and duration cue weights for the /i/–/ɪ/ contrast in English (Reference Kim, Clayards and GoadKim, Clayards, & Goad, 2018). In learning to weigh the spectral cue more heavily, individuals may use the spectral information more prominently in the similarity assessment in the activation phase.
After activation comes identification, which is the process of recognizing the category, as informed by the phonological frame or word in which it is contextualized. The process of identification does not equal encoding, however, as encoding is achieved only if the item passes the discriminability and typicality evaluations. Discrimination is a comparison of the ratio of category activations to the discriminability threshold. The equation implemented by Reference Todd, Pierrehumbert and HayTodd et al. (2019) has an equivalent approach in Reference NosofskyNosofsky (1986). This part of the process is yet another opportunity where the bilingual experience is relevant. If a bilingual’s categorization threshold is lower, this will lead to an increase in discriminability, regardless of the activation threshold achieved. We assume that lexical biases are built upon robust abstractions at the phonological level. Second language learners exhibit less lexical bias (Reference Samuel and FrostSamuel & Frost, 2015; Reference Soo, Sidiqi, Shah and MonahanSoo et al., 2020), which may be a reflection of decreased thresholds for some types of bilinguals. Typicality is a comparison of the activation of the intended category to the typicality threshold. When activation is higher than the threshold, this increases the likelihood of that particular token passing the typicality evaluation. In Reference Todd, Pierrehumbert and HayTodd et al.’s (2019) model, both discriminability and typicality are probabilistic in nature. If a token is not well discriminated or is deemed atypical, it does not pass the thresholds in the discrimination and typicality evaluation stages. Correspondingly, it will not be stored and will, therefore, not be allowed to update the system. Here, an important question to consider in the context of bilingual listeners is what speech patterns are part of the typicality assessment. In many parts of the world, diaspora communities foster heterogeneous bilingual populations with exposure to a range of pronunciation variants. In Canada, for instance, the cultural mosaic means that Cantonese-English bilinguals are exposed to Cantonese, English, Cantonese-accented English, and English-accented Cantonese since birth. Which varieties benefit from their own category aggregates? Which compete or coalesce into a more or less cohesive system? Crucial questions like these need to be addressed in order to properly extend an exemplar model to bilingual listeners.
For bilinguals who speak related languages, these steps of identification, discrimination, and typicality evaluation are ideal opportunities for language interaction, as cognates can lead to cross-linguistic activation that blurs phonemic contrasts. Let us consider, for example, Reference AmengualAmengual (2016a), who examined Majorcan Catalan-Spanish bilinguals’ production and lexical processing of the /ɔ/–/o/ contrast in Catalan. Crucially, this contrast does not exist in Spanish. Participants – whether Catalan or Spanish dominant – produced less of a difference between Catalan /ɔ/ and /o/ in words that were cognates compared to noncognate items, suggesting phonetic interference in production. In terms of lexical processing, the results of a lexical decision task suggest coactivation of Spanish and Catalan lexicons. For congruent cognates, this had the effect of increasing listeners’ abilities to correctly identify words and nonwords, but the coactivation increased error rates for incongruent cognates.Footnote 5 Reference AmengualAmengual (2016a) identifies exemplar models as a framework that accounts for the necessary link between lexical and subphonemic structures within and across languages. To accommodate these findings, the similarity of the phonological frames in the cognate items must form part of the assessment of like sounds in the target window for typicality evaluation.Footnote 6 For cognate items, this coactivation would include both Catalan /ɔ/ and Spanish /o/, ultimately inviting the conflation at the lexical level, which manifests as increased acceptability of “incorrect” pronunciations in Catalan words. Ultimately, this coactivation would cause cross-linguistic influence on lexical items independent of the cognate status (Reference AmengualAmengual, 2016b).
A token is stored or encoded if it passes both the discriminability and the typicality evaluations. In Reference Todd, Pierrehumbert and HayTodd et al.’s (2019) model, the storage process overwrites a randomly selected exemplar within the same space. If bilinguals experience more variability in their environment, potentially by virtue of experiencing their languages spoken by individuals who vary in terms of their proficiency and accentedness, they could have consistently lower thresholds than monolingual individuals, leading to an exemplar space that has wider distributions.
We turn next to speech production, which first involves sampling from the pool of stored exemplars. The first pass is called type selection, which is the selection of the lexical item. target selection is the selection of an exemplar that provides an acoustic target. The sampling from the type and token space can pull a lexical item or pronunciation variant that has more or fewer sociocultural associations with one language or another, potentially affecting a bilingual’s target pronunciation. The language-specificity of the target selection may be influenced by language mode or the magnitude of language activation, where acoustic targets may be sampled from a single language distribution or a distribution composed of both languages or a single language distribution that is pulled by the competing language (Reference AmengualAmengual, 2018; Reference Brown and CoppleBrown & Copple, 2018; Reference Johnson and BabelJohnson & Babel, 2023; Reference Simonet and AmengualSimonet & Amengual, 2020).
The last two stages in production can nudge a pronunciation in different directions. First, a bias parameter conditions category behavior – in Reference Todd, Pierrehumbert and HayTodd et al.’s (2019) model this is a crucial aspect for modeling the dynamics of sound change, which we can broadly envision as a balance of stability (e.g., maintain categories) and change (e.g., a pull in a particular direction or dimension). In sound change, typically, socially structured variation within the community gets pushed or pulled in various ways, the direction of which may relate to the exact nature of the phonetic variation (Reference Harrington and SchielHarrington & Schiel, 2017; Reference Harrington, Kleber, Reubold, Schiel and StevensHarrington et al., 2018). Such a dynamic is not unlike the cross-linguistic influence in bilinguals (for a recent attempt to connect the sound change and cross-linguistic influence literatures, see Johnson and Babel, Reference Johnson and Babel2023). We see the bias stage as a prime opportunity for mutual influence in bilinguals. The final stage before an item is unleashed in the acoustic space is imprecision, which is random noise added to the target form. For bilinguals, we can imagine that an aspect of imprecision is an opportunity for the global activation of the nontarget language to bias and shift pronunciation. Language-specific articulatory settings and interspeech postures may manifest as cross-linguistic influence at either the bias or the imprecision stage (Reference Mennen, Scobbie, de Leeuw, Schaeffler and SchaefflerMennen et al., 2010; Reference Wilson and GickWilson & Gick, 2014;). This could provide an explanatory mechanism for phonetic drift in L1–L2 language dynamics (Reference ChangChang, 2012; Reference GuionGuion, 2003).
11.5 It’s Complicated
In this final section, we briefly discuss challenges in the study of bilingualism that are a bit thorny for exemplar-based models: cross-linguistic similarity and the heterogeneity within and across bilinguals. We close with a discussion and recommendations for ongoing efforts at quantifying bilingualism in the context of exemplar models.
11.5.1 Similarity
The degree of cross-linguistic similarity is an important factor in exemplar-based models, as the similarity of languages and linguistic structures influences the kinds of connections that can be made within a bilingual (e.g., Reference Yao and ChangYao & Chang, 2016). Similarity is relevant at multiple dimensions within an exemplar model: phonetic forms, phonemic forms, and lexical similarity. A level of linguistic analysis on the part of the speaker-listener is relevant, as sound-based similarity relationships may be better accounted for at a certain level of abstraction, as opposed to phonetic similarity (Reference Chang, Yao, Haynes and RhodesChang et al., 2011; Reference FlegeFlege, 1987); see Reference Chang, Katz and AssmannChang (2019) for an overview.
The similarity between languages is continuous and multidimensional. Importantly, similarity along one dimension (e.g., phonemic categories) does not entail similarity along another dimension (e.g., existence of cognates). The scaling up of exemplar-based theories to account for bilingualism should also offer accounts and make predictions about bidialectalism, where cross-code similarity is presumably greater than cross-linguistic similarity. The question of how processes related to bidialectalism compare to those involving bilingualism is poorly understood as comparison between bilingual and bidialectal population groups has historically been avoided (Reference Chevrot, Ghimenton, De Houwer and OrtegaChevrot & Ghimenton, 2018). We suggest, however, that the omission of bidialectalism from theoretical discussions about similarity, cross-language competition, and exemplar models is a missed opportunity to better understand the similarity quantification(s) that individuals engage in.
11.5.2 Heterogeneity within and across Bilinguals
The challenge in an exemplar-based model for bilingualism is not simply the increased complexity from the addition of another language. There is also the fact that bilinguals may be vastly different from one another.
A large body of work both within and across languages and dialects has made clear that bilinguals are not all the same. Nor should a bilingual be considered two monolinguals – complete or incomplete – in one (Reference GrosjeanGrosjean, 1998). Bilinguals’ phonetic behaviors are informed by a number of (interacting) language factors, such as age and order of acquisition (e.g., Reference Canseco-Gonzalez, Brehm and BrickCanseco-Gonzalez et al., 2010; Reference Montrul and FooteMontrul & Foote, 2014), age of arrival (e.g., Reference Flege, Munro and MacKayFlege, Munro, & MacKay, 1995; Reference Flege, Yeni-Komshian and LiuFlege, Yeni-Komshian, & Liu, 1999), length of residence (e.g., Reference Flege and LiuFlege & Liu, 2001), sociocultural identity (e.g., Reference Szakay, Babel and KingSzakay et al., 2016), language usage patterns (e.g., Reference Bruggeman and CutlerBruggeman & Cutler, 2020; Reference Chan, Johnson and BabelChan et al., 2020), and language dominance (e.g., Reference Casillas and SimonetCasillas & Simonet, 2016; Reference Soo and MonahanSoo & Monahan, 2023), to name but a few. Altogether, the myriad factors that contribute to bilingual behavior and linguistic organization suggest that bilingualism is best conceptualized as a continuum.
That bilingual individuals vary is, of course, no surprise to any bilingual researcher. Heterogeneity even within controlled populations is the norm. Yet, in the context of exemplar-based models, this variation is particularly important. The factors described just now set the stage for the organization of the bilingual’s linguistic system and its evolution over time. The similarity assessments within a bilingual exemplar model and the cross-linguistic activation that colors mutual influence will be conditioned by the nature of the bilingualism, which for any individual will likely shift over time.
11.6 Where Do We Go from Here?
Independent of the heterogeneous nature of bilingual individuals and speech communities is the challenge of how to quantify any sort of bilingualism in the first place. There has been no lack of trying. In her dissertation, Reference BruggemanBruggeman (2016) provides an amusing litany of definitions and means of quantifying language dominance that have been advanced (tables 2-1 and 2-2 in Reference BruggemanBruggeman, 2016). These measures have historically fallen short, as bilingualism is complex, multifaceted, and dynamic at both small scales (e.g., a heritage speaker’s familiarity with kinship and domestic terms in the heritage language versus their broader use of the second language in institutional spheres) and large scales (e.g., immigrating to a country with different national languages). More recent metrics like the Bilingual Language Profile (Reference Gertken, Amengual, Birdsong, Leclercq, Edmonds and HiltonGertken, Amengual, & Birdsong, 2014) and language entropy (Reference Gullifer and TitoneGullifer & Titone, 2020) consider the multidimensional nature of language dominance, but both still attempt a quantification of something that may ultimately be unquantifiable (Reference Marian and HayakawaMarian & Hayakawa, 2021). We are not saying that measures of bilingualism do not have some utility – they very much do – but they need to be used with recognition of their limits.
These concerns exist in any bilingual research, but in the nebulous associative networks that connect social context to phonetic detail to lexicons, the challenge is more apparent. Bilingual individuals are complex. Language use varies across cultures, communities, and individuals, setting up a seemingly infinite array of social connections. How can we wrangle such a complex and variable topic? We, of course, do not have the ultimate answer, but we underscore the need for within-subject analyses, open science approaches, and transparency in hypothesis testing versus exploratory analyses. Altogether, this will allow the field to incrementally move toward a more complete understanding of the connections that bilinguals make within and across their languages between abstract representations, social context, and phonetic detail. In the context of exemplar models where sociocultural connections and metrics of phonetic and phonological similarity link linguistic forms, the heterogeneity of bilinguals makes specific predictions about speech and language particularly tricky. However, it is by welcoming the heterogeneity that we may ultimately reach an improved understanding of the representation of phonetic detail in bilinguals.
Acknowledgments
We thank Fion Fung and Ellie Yoon for copy-editing assistance, and members of the Speech in Context Laboratory at the University of British Columbia for their comments on previous versions of this work. We acknowledge the support of the Natural Sciences and Engineering Research Council (NSERC) of Canada and the Social Sciences and Humanities Research Council (SSHRC) of Canada.









